nips nips2006 nips2006-147 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Andriy Myronenko, Xubo Song, Miguel Á. Carreira-Perpiñán
Abstract: We introduce Coherent Point Drift (CPD), a novel probabilistic method for nonrigid registration of point sets. The registration is treated as a Maximum Likelihood (ML) estimation problem with motion coherence constraint over the velocity field such that one point set moves coherently to align with the second set. We formulate the motion coherence constraint and derive a solution of regularized ML estimation through the variational approach, which leads to an elegant kernel form. We also derive the EM algorithm for the penalized ML optimization with deterministic annealing. The CPD method simultaneously finds both the non-rigid transformation and the correspondence between two point sets without making any prior assumption of the transformation model except that of motion coherence. This method can estimate complex non-linear non-rigid transformations, and is shown to be accurate on 2D and 3D examples and robust in the presence of outliers and missing points.
Reference: text
sentIndex sentText sentNum sentScore
1 Non-rigid point set registration: Coherent Point Drift ´ ˜a Andriy Myronenko Xubo Song Miguel A. [sent-1, score-0.093]
2 edu Abstract We introduce Coherent Point Drift (CPD), a novel probabilistic method for nonrigid registration of point sets. [sent-4, score-0.556]
3 The registration is treated as a Maximum Likelihood (ML) estimation problem with motion coherence constraint over the velocity field such that one point set moves coherently to align with the second set. [sent-5, score-1.094]
4 We formulate the motion coherence constraint and derive a solution of regularized ML estimation through the variational approach, which leads to an elegant kernel form. [sent-6, score-0.475]
5 We also derive the EM algorithm for the penalized ML optimization with deterministic annealing. [sent-7, score-0.132]
6 The CPD method simultaneously finds both the non-rigid transformation and the correspondence between two point sets without making any prior assumption of the transformation model except that of motion coherence. [sent-8, score-0.591]
7 This method can estimate complex non-linear non-rigid transformations, and is shown to be accurate on 2D and 3D examples and robust in the presence of outliers and missing points. [sent-9, score-0.092]
8 1 Introduction Registration of point sets is an important issue for many computer vision applications such as robot navigation, image guided surgery, motion tracking, and face recognition. [sent-10, score-0.346]
9 In fact, it is the key component in tasks such as object alignment, stereo matching, point set correspondence, image segmentation and shape/pattern matching. [sent-11, score-0.116]
10 The registration problem is to find meaningful correspondence between two point sets and to recover the underlying transformation that maps one point set to the second. [sent-12, score-0.838]
11 The “points” in the point set are features, most often the locations of interest points extracted from an image. [sent-13, score-0.124]
12 Any geometrical feature can be represented as a point set; in this sense, the point locations is the most general of all features. [sent-15, score-0.227]
13 Registration techniques can be rigid or non-rigid depending on the underlying transformation model. [sent-16, score-0.225]
14 The key characteristic of a rigid transformation is that all distances are preserved. [sent-17, score-0.197]
15 The simplest nonrigid transformation is affine, which also allows anisotropic scaling and skews. [sent-18, score-0.168]
16 However, the need for more general non-rigid registration occurs in many tasks, where complex non-linear transformation models are required. [sent-20, score-0.533]
17 Non-linear non-rigid registration remains a challenge in computer vision. [sent-21, score-0.414]
18 Another popular method for point sets registration is the Iterative Closest Point (ICP) algorithm [2], which iteratively assigns correspondence and finds the least squares transformation (usually rigid) relating these point sets. [sent-26, score-0.831]
19 The algorithm then redetermines the closest point set and continues until it reaches the local minimum. [sent-27, score-0.137]
20 Nonetheless ICP requires that the initial pose of the two point sets be adequately close, which is not always possible, especially when transformation is non-rigid [3]. [sent-29, score-0.27]
21 The Robust Point Matching (RPM) method [4] allows global to local search and soft assignment of correspondences between two point sets. [sent-31, score-0.143]
22 In [5] it is further shown that the RPM algorithm is similar to Expectation Maximization (EM) algorithms for the mixture models, where one point set represents data points and the other represents centroids of mixture models. [sent-32, score-0.534]
23 According to regularization theory, the TPS parametrization is a solution of the interpolation problem in 2D that penalizes the second order derivatives of the transformation. [sent-34, score-0.098]
24 In 3D the solution is not differentiable at point locations. [sent-35, score-0.114]
25 A correlation-based approach to point set registration is proposed in [8]. [sent-39, score-0.507]
26 The registration is considered as the alignment between the two distributions that minimizes a similarity function defined by L2 norm. [sent-41, score-0.414]
27 Once again thin-plate spline is used to parameterize the smooth non-linear underlying transformation. [sent-43, score-0.112]
28 In this paper we introduce a probabilistic method for point set registration that we call the Coherent Point Drift (CPD) method. [sent-44, score-0.507]
29 Similar to [5], given two point sets, we fit a GMM to the first point set, whose Gaussian centroids are initialized from the points in the second set. [sent-45, score-0.376]
30 However, unlike [4, 5, 9] which assumes a thin-plate spline transformation, we do not make any explicit assumption of the transformation model. [sent-46, score-0.178]
31 Instead, we consider the process of adapting the Gaussian centroids from their initial positions to their final positions as a temporal motion process, and impose a motion coherence constraint over the velocity field. [sent-47, score-1.057]
32 Velocity coherence is a particular way of imposing smoothness on the underlying transformation. [sent-48, score-0.252]
33 The concept of motion coherence was proposed in the Motion Coherence Theory [10]. [sent-49, score-0.332]
34 This motion coherence constraint penalizes derivatives of all orders of the underlying velocity field (thin-plate spline only penalizes the second order derivative). [sent-51, score-0.778]
35 Examples of velocity fields with different levels of motion coherence for different point correspondence are illustrated in Fig. [sent-52, score-0.715]
36 (a) (b) (c) (d) Figure 1: (a) Two given point sets. [sent-54, score-0.093]
37 (c, d) Velocity fields that are less coherent for the given correspondences. [sent-56, score-0.107]
38 We derive a solution for the velocity field through a variational approach by maximizing the likelihood of GMM penalized by motion coherence. [sent-57, score-0.509]
39 We show that the final transformation has an elegant kernel form. [sent-58, score-0.18]
40 We also derive an EM algorithm for the penalized ML optimization with deterministic annealing. [sent-59, score-0.132]
41 Once we have the final positions of the GMM centroids, the correspondence between the two point sets can be easily inferred through the posterior probability of the Gaussian mixture components given the first point set. [sent-60, score-0.428]
42 Our method is a true probabilistic approach and is shown to be accurate and robust in the presence of outliers and missing points, and is effective for estimation of complex non-linear non-rigid transformations. [sent-61, score-0.092]
43 2 Method Assume two point sets are given, where the template point set Y = (y 1 , . [sent-66, score-0.391]
44 , yM )T (expressed as a M × D matrix) should be aligned with the reference point set X = (x1 , . [sent-69, score-0.183]
45 We consider the points in Y as the centroids of a Gaussian Mixture Model, and fit it to the data points X by maximizing the likelihood function. [sent-73, score-0.221]
46 We denote Y0 as the initial centroid positions and define a continuous velocity function v for the template point set such that the current position of centroids is defined as Y = v(Y 0 ) + Y0 . [sent-74, score-0.762]
47 1 Consider a Gaussian-mixture density p(x) = m=1 M p(x|m) with x|m ∼ N (ym , σ 2 ID ), where Y represents D-dimensional centroids of equally-weighted Gaussians with equal isotropic covariance matrices, and X set represents data points. [sent-75, score-0.261]
48 In order to enforce a smooth motion constraint, we define the prior p(Y|λ) ∝ exp (− λ φ(Y)), where λ is a weighting constant and φ(Y) is a function 2 that regularizes the motion to be smooth. [sent-76, score-0.363]
49 Using Bayes theorem, we want to find the parameters Y by maximizing the posteriori probability, or equivalently by minimizing the following energy function: M N M E(Y) = − 1 e− 2 log n=1 xn −ym σ 2 m=1 + λ φ(Y) 2 (1) We make the i. [sent-77, score-0.103]
50 Equation 1 has a similar form to that of Generalized Elastic Net (GEN) [11], which has shown good performance in nonrigid image registration [12]; note that there we directly penalized Y, while here we penalize the transformation v. [sent-81, score-0.656]
51 Specifically, we want the velocity field v generated by template point set displacement to be smooth. [sent-83, score-0.53]
52 Here transform of the velocity and G ˜ G represents a symmetric low-pass filter, so that its Fourier transform G is real and symmetric. [sent-88, score-0.398]
53 2 has the form of the radial basis function: M wm G(z − y0m ) v(z) = (3) m=1 We choose a Gaussian kernel form for G (note it is not related to the Gaussian form of the distribution chosen for the mixture model). [sent-90, score-0.21]
54 The regularization term Rd |˜(s)|2 /G(s) ds, v with a Gaussian function for G, is equivalent to the sum of weighted squares of all order derivatives β 2m of the velocity field Rd ∞ m! [sent-95, score-0.293]
55 The equivalence of the regularization term with that of the Motion Coherence Theory implies that we are imposing motion coherence among the points and thus we call our method the Coherent Point Drift (CPD) method. [sent-97, score-0.419]
56 7 · Update Y = Y0 + GW • Anneal σ = ασ • Compute the velocity field: v(z) = G(z, ·)W Figure 2: Pseudo-code of CPD algorithm. [sent-102, score-0.231]
57 4 as (E-step): N M Q(W) = n=1 m=1 P old (m|xn ) xn − y0m − G(m, ·)W 2σ 2 2 + λ tr WT GW 2 (5) where P old denotes the posterior probabilities calculated using previous parameter values, and G(m, ·) denotes the mth row of G. [sent-110, score-0.176]
58 where P is a matrix of posterior probabilities with pmn = e The diag (·) notation indicates diagonal matrix and 1 is a column vector of all ones. [sent-115, score-0.151]
59 This results in a system of nonlinear equations that can be iteratively solved using fixed point update, which is exactly the EM algorithm shown above. [sent-123, score-0.114]
60 The use of a probabilistic assignment of correspondences between point sets is innately more robust than the binary assignment used in ICP. [sent-128, score-0.231]
61 However, the GMM requires that each data point be explained by the model. [sent-129, score-0.093]
62 This new component changes posterior probability matrix P ‚ ‚ ‚ ‚ ‚ old ‚2 1 y −x − 2 ‚ mσ n ‚ ‚ ‚ 2 /( (2πσ ) a D 2 ‚ old ‚2 1 y −x − 2 ‚ mσ n ‚ ‚ ‚ M ), m=1 e in Eq. [sent-131, score-0.109]
63 We use deterministic annealing for σ, starting with a large value and gradually reducing it according to σ = ασ, where α is annealing rate (normally between [0. [sent-141, score-0.162]
64 The first column shows template (◦) and reference (+) point sets. [sent-160, score-0.4]
65 The second column shows the registered position of the template set superimposed over the reference set. [sent-161, score-0.389]
66 The third column represents the recovered underlying deformation . [sent-162, score-0.167]
67 The last column shows the link between initial and final template point positions (only every second point’s displacement is shown). [sent-163, score-0.427]
68 All point sets are preprocessed to have zero mean and unit variance (which normalizes translation and scaling). [sent-165, score-0.153]
69 We compare our method on non-rigid point registration with RPM and ICP. [sent-166, score-0.507]
70 The RPM and ICP implementations and the 2D point sets used for comparison are taken from the TPS-RPM Matlab package [4]. [sent-167, score-0.125]
71 We show the velocity field through the deformation of a regular grid. [sent-172, score-0.275]
72 The deformation field for RPM corresponds to parameterized TPS transformation, while that for CPD represents a motion coherent non-linear deformation. [sent-173, score-0.371]
73 The fish head in the reference point set is removed, and random noise is added. [sent-176, score-0.183]
74 The CPD algorithm shows robustness even in the area of missing points and corrupted data. [sent-178, score-0.12]
75 We artificially deform the control point positions non-rigidly and use it as a template point set. [sent-185, score-0.442]
76 The original control point positions are used as a reference point set. [sent-186, score-0.359]
77 CPD is effective and accurate for this 3D non-rigid registration problem. [sent-187, score-0.414]
78 CPD algorithm RPM algorithm ICP algorithm Figure 4: The reference point set is corrupted to make the registration task more challenging. [sent-188, score-0.681]
79 Noise is added and the fish head is removed in the reference point set. [sent-189, score-0.211]
80 The tail is also removed in the template point set. [sent-190, score-0.317]
81 The first column shows template (◦) and reference (+) point sets. [sent-191, score-0.4]
82 The second column shows the registered position of the template set superimposed over the reference set. [sent-192, score-0.389]
83 The last column shows the link between the initial and final template point positions. [sent-194, score-0.336]
84 4 Discussion and Conclusion We intoduce Coherent Point Drift, a new probabilistic method for non-rigid registration of two point sets. [sent-195, score-0.507]
85 The registration is considered as a Maximum Likelihood estimation problem, where one point set represents centroids of a GMM and the other represents the data. [sent-196, score-0.768]
86 We regularize the velocity field over the points domain to enforce coherent motion and define the mathematical formulation of this constraint. [sent-197, score-0.538]
87 We derive the solution for the penalized ML estimation through the variational approach, and show that the final transformation has an elegant kernel form. [sent-198, score-0.289]
88 The estimated velocity field represents the underlying non-rigid transformation. [sent-200, score-0.31]
89 Once we have the final positions of the GMM centroids, the correspondence between the two point sets can be easily inferred through the posterior probability of (a) (b) (c) 4 3 3 2. [sent-201, score-0.271]
90 5 −2 Figure 5: The results of CPD non-rigid registration on 3D point sets. [sent-223, score-0.507]
91 (a, d) The reference face and its control point set. [sent-224, score-0.237]
92 (b, e) The template face and its control point set. [sent-225, score-0.32]
93 (c, f) Result obtained by registering the template point set onto the reference point set using CPD. [sent-226, score-0.449]
94 The computational complexity of CPD is O(M 3 ), where M is the number of points in template point set. [sent-228, score-0.297]
95 It is worth mentioning that the components in the point vector are not limited to spatial coordinates. [sent-229, score-0.093]
96 Typically such transformation can be first compensated by other well known global registration techniques before CPD algorithm is carried out. [sent-237, score-0.554]
97 Appendix A N M E=− 1 n=1 2 xn −ym σ e− 2 log + m=1 λ 2 Rd |˜(s)|2 v ds ˜ G(s) (8) Consider the function in Eq. [sent-239, score-0.157]
98 8, where ym = y0m + v(y0m ), and y0m is the initial position of ym point. [sent-240, score-0.332]
99 v is a continuous velocity function and v(y0m ) = Rd v (s)e2πi ds in terms of its ˜ Fourier transform v . [sent-241, score-0.367]
100 A robust algorithm for point set registration using mixture of gaussians. [sent-311, score-0.623]
wordName wordTfidf (topN-words)
[('cpd', 0.46), ('registration', 0.414), ('rpm', 0.29), ('icp', 0.268), ('velocity', 0.231), ('template', 0.173), ('motion', 0.169), ('coherence', 0.163), ('centroids', 0.159), ('ym', 0.142), ('transformation', 0.119), ('wm', 0.118), ('gmm', 0.113), ('coherent', 0.107), ('point', 0.093), ('reference', 0.09), ('ds', 0.09), ('em', 0.079), ('gw', 0.078), ('rigid', 0.078), ('drift', 0.074), ('rd', 0.074), ('amn', 0.067), ('miguel', 0.067), ('tps', 0.067), ('xn', 0.067), ('mixture', 0.064), ('fourier', 0.059), ('correspondence', 0.059), ('spline', 0.059), ('positions', 0.058), ('eld', 0.053), ('annealing', 0.052), ('penalized', 0.051), ('represents', 0.051), ('song', 0.049), ('elastic', 0.049), ('nonrigid', 0.049), ('ml', 0.049), ('transform', 0.046), ('awarded', 0.045), ('iis', 0.045), ('myronenko', 0.045), ('pmn', 0.045), ('yold', 0.045), ('deformation', 0.044), ('column', 0.044), ('geometrical', 0.041), ('old', 0.04), ('mct', 0.039), ('chui', 0.039), ('energy', 0.036), ('penalizes', 0.036), ('gaussian', 0.036), ('registered', 0.035), ('smoothness', 0.035), ('deterministic', 0.035), ('outliers', 0.035), ('diag', 0.033), ('displacement', 0.033), ('elegant', 0.033), ('variational', 0.033), ('sets', 0.032), ('derivative', 0.032), ('derivatives', 0.032), ('robust', 0.031), ('sh', 0.031), ('points', 0.031), ('regularization', 0.03), ('posterior', 0.029), ('nal', 0.029), ('face', 0.029), ('underlying', 0.028), ('associating', 0.028), ('kernel', 0.028), ('removed', 0.028), ('translation', 0.028), ('px', 0.026), ('imposing', 0.026), ('initial', 0.026), ('missing', 0.026), ('control', 0.025), ('superimposed', 0.025), ('correspondences', 0.025), ('assignment', 0.025), ('smooth', 0.025), ('derive', 0.025), ('symmetric', 0.024), ('constraint', 0.024), ('tail', 0.023), ('gradually', 0.023), ('reaches', 0.023), ('image', 0.023), ('lter', 0.022), ('position', 0.022), ('algorithm', 0.021), ('corrupted', 0.021), ('robustness', 0.021), ('appendix', 0.021), ('differentiable', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 147 nips-2006-Non-rigid point set registration: Coherent Point Drift
Author: Andriy Myronenko, Xubo Song, Miguel Á. Carreira-Perpiñán
Abstract: We introduce Coherent Point Drift (CPD), a novel probabilistic method for nonrigid registration of point sets. The registration is treated as a Maximum Likelihood (ML) estimation problem with motion coherence constraint over the velocity field such that one point set moves coherently to align with the second set. We formulate the motion coherence constraint and derive a solution of regularized ML estimation through the variational approach, which leads to an elegant kernel form. We also derive the EM algorithm for the penalized ML optimization with deterministic annealing. The CPD method simultaneously finds both the non-rigid transformation and the correspondence between two point sets without making any prior assumption of the transformation model except that of motion coherence. This method can estimate complex non-linear non-rigid transformations, and is shown to be accurate on 2D and 3D examples and robust in the presence of outliers and missing points.
2 0.29711699 42 nips-2006-Bayesian Image Super-resolution, Continued
Author: Lyndsey C. Pickup, David P. Capel, Stephen J. Roberts, Andrew Zisserman
Abstract: This paper develops a multi-frame image super-resolution approach from a Bayesian view-point by marginalizing over the unknown registration parameters relating the set of input low-resolution views. In Tipping and Bishop’s Bayesian image super-resolution approach [16], the marginalization was over the superresolution image, necessitating the use of an unfavorable image prior. By integrating over the registration parameters rather than the high-resolution image, our method allows for more realistic prior distributions, and also reduces the dimension of the integral considerably, removing the main computational bottleneck of the other algorithm. In addition to the motion model used by Tipping and Bishop, illumination components are introduced into the generative model, allowing us to handle changes in lighting as well as motion. We show results on real and synthetic datasets to illustrate the efficacy of this approach.
3 0.10781903 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification
Author: Ali Rahimi, Ben Recht
Abstract: We derive a cost functional for estimating the relationship between highdimensional observations and the low-dimensional process that generated them with no input-output examples. Limiting our search to invertible observation functions confers numerous benefits, including a compact representation and no suboptimal local minima. Our approximation algorithms for optimizing this cost functional are fast and give diagnostic bounds on the quality of their solution. Our method can be viewed as a manifold learning algorithm that utilizes a prior on the low-dimensional manifold coordinates. The benefits of taking advantage of such priors in manifold learning and searching for the inverse observation functions in system identification are demonstrated empirically by learning to track moving targets from raw measurements in a sensor network setting and in an RFID tracking experiment. 1
4 0.10096779 111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations
Author: Lorenzo Torresani, Peggy Hackney, Christoph Bregler
Abstract: This paper presents an algorithm for synthesis of human motion in specified styles. We use a theory of movement observation (Laban Movement Analysis) to describe movement styles as points in a multi-dimensional perceptual space. We cast the task of learning to synthesize desired movement styles as a regression problem: sequences generated via space-time interpolation of motion capture data are used to learn a nonlinear mapping between animation parameters and movement styles in perceptual space. We demonstrate that the learned model can apply a variety of motion styles to pre-recorded motion sequences and it can extrapolate styles not originally included in the training data. 1
5 0.097109228 110 nips-2006-Learning Dense 3D Correspondence
Author: Florian Steinke, Volker Blanz, Bernhard Schölkopf
Abstract: Establishing correspondence between distinct objects is an important and nontrivial task: correctness of the correspondence hinges on properties which are difficult to capture in an a priori criterion. While previous work has used a priori criteria which in some cases led to very good results, the present paper explores whether it is possible to learn a combination of features that, for a given training set of aligned human heads, characterizes the notion of correct correspondence. By optimizing this criterion, we are then able to compute correspondence and morphs for novel heads. 1
6 0.07991007 31 nips-2006-Analysis of Contour Motions
7 0.079534218 134 nips-2006-Modeling Human Motion Using Binary Latent Variables
8 0.071968608 80 nips-2006-Fundamental Limitations of Spectral Clustering
9 0.061624385 57 nips-2006-Conditional mean field
10 0.059042536 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
11 0.053651679 33 nips-2006-Analysis of Representations for Domain Adaptation
12 0.052875381 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
13 0.05266327 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation
14 0.049312048 153 nips-2006-Online Clustering of Moving Hyperplanes
15 0.049047273 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
16 0.048103515 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
17 0.047649544 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
18 0.04711578 120 nips-2006-Learning to Traverse Image Manifolds
19 0.046621725 128 nips-2006-Manifold Denoising
20 0.046476375 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation
topicId topicWeight
[(0, -0.173), (1, 0.034), (2, 0.084), (3, -0.016), (4, 0.032), (5, -0.077), (6, 0.03), (7, -0.085), (8, 0.003), (9, 0.147), (10, 0.054), (11, -0.065), (12, -0.042), (13, 0.075), (14, 0.069), (15, -0.002), (16, -0.145), (17, -0.228), (18, -0.096), (19, 0.145), (20, -0.09), (21, 0.087), (22, -0.013), (23, -0.023), (24, -0.193), (25, 0.142), (26, 0.026), (27, 0.217), (28, 0.122), (29, 0.07), (30, 0.014), (31, -0.104), (32, -0.109), (33, -0.046), (34, 0.03), (35, 0.021), (36, -0.115), (37, 0.156), (38, 0.04), (39, 0.191), (40, -0.143), (41, -0.058), (42, -0.113), (43, 0.173), (44, -0.004), (45, 0.051), (46, -0.112), (47, 0.026), (48, -0.1), (49, -0.089)]
simIndex simValue paperId paperTitle
same-paper 1 0.94782811 147 nips-2006-Non-rigid point set registration: Coherent Point Drift
Author: Andriy Myronenko, Xubo Song, Miguel Á. Carreira-Perpiñán
Abstract: We introduce Coherent Point Drift (CPD), a novel probabilistic method for nonrigid registration of point sets. The registration is treated as a Maximum Likelihood (ML) estimation problem with motion coherence constraint over the velocity field such that one point set moves coherently to align with the second set. We formulate the motion coherence constraint and derive a solution of regularized ML estimation through the variational approach, which leads to an elegant kernel form. We also derive the EM algorithm for the penalized ML optimization with deterministic annealing. The CPD method simultaneously finds both the non-rigid transformation and the correspondence between two point sets without making any prior assumption of the transformation model except that of motion coherence. This method can estimate complex non-linear non-rigid transformations, and is shown to be accurate on 2D and 3D examples and robust in the presence of outliers and missing points.
2 0.73752135 42 nips-2006-Bayesian Image Super-resolution, Continued
Author: Lyndsey C. Pickup, David P. Capel, Stephen J. Roberts, Andrew Zisserman
Abstract: This paper develops a multi-frame image super-resolution approach from a Bayesian view-point by marginalizing over the unknown registration parameters relating the set of input low-resolution views. In Tipping and Bishop’s Bayesian image super-resolution approach [16], the marginalization was over the superresolution image, necessitating the use of an unfavorable image prior. By integrating over the registration parameters rather than the high-resolution image, our method allows for more realistic prior distributions, and also reduces the dimension of the integral considerably, removing the main computational bottleneck of the other algorithm. In addition to the motion model used by Tipping and Bishop, illumination components are introduced into the generative model, allowing us to handle changes in lighting as well as motion. We show results on real and synthetic datasets to illustrate the efficacy of this approach.
3 0.39800751 200 nips-2006-Unsupervised Regression with Applications to Nonlinear System Identification
Author: Ali Rahimi, Ben Recht
Abstract: We derive a cost functional for estimating the relationship between highdimensional observations and the low-dimensional process that generated them with no input-output examples. Limiting our search to invertible observation functions confers numerous benefits, including a compact representation and no suboptimal local minima. Our approximation algorithms for optimizing this cost functional are fast and give diagnostic bounds on the quality of their solution. Our method can be viewed as a manifold learning algorithm that utilizes a prior on the low-dimensional manifold coordinates. The benefits of taking advantage of such priors in manifold learning and searching for the inverse observation functions in system identification are demonstrated empirically by learning to track moving targets from raw measurements in a sensor network setting and in an RFID tracking experiment. 1
4 0.33558181 45 nips-2006-Blind Motion Deblurring Using Image Statistics
Author: Anat Levin
Abstract: We address the problem of blind motion deblurring from a single image, caused by a few moving objects. In such situations only part of the image may be blurred, and the scene consists of layers blurred in different degrees. Most of of existing blind deconvolution research concentrates at recovering a single blurring kernel for the entire image. However, in the case of different motions, the blur cannot be modeled with a single kernel, and trying to deconvolve the entire image with the same kernel will cause serious artifacts. Thus, the task of deblurring needs to involve segmentation of the image into regions with different blurs. Our approach relies on the observation that the statistics of derivative filters in images are significantly changed by blur. Assuming the blur results from a constant velocity motion, we can limit the search to one dimensional box filter blurs. This enables us to model the expected derivatives distributions as a function of the width of the blur kernel. Those distributions are surprisingly powerful in discriminating regions with different blurs. The approach produces convincing deconvolution results on real world images with rich texture.
5 0.33083981 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
Author: Graham Mcneill, Sethu Vijayakumar
Abstract: Correspondence algorithms typically struggle with shapes that display part-based variation. We present a probabilistic approach that matches shapes using independent part transformations, where the parts themselves are learnt during matching. Ideas from semi-supervised learning are used to bias the algorithm towards finding ‘perceptually valid’ part structures. Shapes are represented by unlabeled point sets of arbitrary size and a background component is used to handle occlusion, local dissimilarity and clutter. Thus, unlike many shape matching techniques, our approach can be applied to shapes extracted from real images. Model parameters are estimated using an EM algorithm that alternates between finding a soft correspondence and computing the optimal part transformations using Procrustes analysis.
6 0.3243778 110 nips-2006-Learning Dense 3D Correspondence
7 0.32009351 111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations
8 0.29788652 121 nips-2006-Learning to be Bayesian without Supervision
9 0.29742166 31 nips-2006-Analysis of Contour Motions
10 0.28115886 6 nips-2006-A Kernel Subspace Method by Stochastic Realization for Learning Nonlinear Dynamical Systems
11 0.27412045 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
12 0.26581946 182 nips-2006-Statistical Modeling of Images with Fields of Gaussian Scale Mixtures
13 0.24925463 95 nips-2006-Implicit Surfaces with Globally Regularised and Compactly Supported Basis Functions
14 0.24608336 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
15 0.2368688 153 nips-2006-Online Clustering of Moving Hyperplanes
16 0.23636594 139 nips-2006-Multi-dynamic Bayesian Networks
17 0.23062988 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
18 0.22606485 52 nips-2006-Clustering appearance and shape by learning jigsaws
19 0.22445501 159 nips-2006-Parameter Expanded Variational Bayesian Methods
20 0.22395179 39 nips-2006-Balanced Graph Matching
topicId topicWeight
[(1, 0.076), (3, 0.034), (7, 0.076), (9, 0.031), (22, 0.038), (44, 0.055), (57, 0.045), (65, 0.031), (69, 0.501), (71, 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.92056024 147 nips-2006-Non-rigid point set registration: Coherent Point Drift
Author: Andriy Myronenko, Xubo Song, Miguel Á. Carreira-Perpiñán
Abstract: We introduce Coherent Point Drift (CPD), a novel probabilistic method for nonrigid registration of point sets. The registration is treated as a Maximum Likelihood (ML) estimation problem with motion coherence constraint over the velocity field such that one point set moves coherently to align with the second set. We formulate the motion coherence constraint and derive a solution of regularized ML estimation through the variational approach, which leads to an elegant kernel form. We also derive the EM algorithm for the penalized ML optimization with deterministic annealing. The CPD method simultaneously finds both the non-rigid transformation and the correspondence between two point sets without making any prior assumption of the transformation model except that of motion coherence. This method can estimate complex non-linear non-rigid transformations, and is shown to be accurate on 2D and 3D examples and robust in the presence of outliers and missing points.
2 0.89189541 88 nips-2006-Greedy Layer-Wise Training of Deep Networks
Author: Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle
Abstract: Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.
3 0.87817436 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics
Author: John R. Hershey, Trausti Kristjansson, Steven Rennie, Peder A. Olsen
Abstract: Human listeners have the extraordinary ability to hear and recognize speech even when more than one person is talking. Their machine counterparts have historically been unable to compete with this ability, until now. We present a modelbased system that performs on par with humans in the task of separating speech of two talkers from a single-channel recording. Remarkably, the system surpasses human recognition performance in many conditions. The models of speech use temporal dynamics to help infer the source speech signals, given mixed speech signals. The estimated source signals are then recognized using a conventional speech recognition system. We demonstrate that the system achieves its best performance when the model of temporal dynamics closely captures the grammatical constraints of the task. One of the hallmarks of human perception is our ability to solve the auditory cocktail party problem: we can direct our attention to a given speaker in the presence of interfering speech, and understand what was said remarkably well. Until now the same could not be said for automatic speech recognition systems. However, we have recently introduced a system which in many conditions performs this task better than humans [1][2]. The model addresses the Pascal Speech Separation Challenge task [3], and outperforms all other published results by more than 10% word error rate (WER). In this model, dynamics are modeled using a layered combination of one or two Markov chains: one for long-term dependencies and another for short-term dependencies. The combination of the two speakers was handled via an iterative Laplace approximation method known as Algonquin [4]. Here we describe experiments that show better performance on the same task with a simpler version of the model. The task we address is provided by the PASCAL Speech Separation Challenge [3], which provides standard training, development, and test data sets of single-channel speech mixtures following an arbitrary but simple grammar. In addition, the challenge organizers have conducted human-listening experiments to provide an interesting baseline for comparison of computational techniques. The overall system we developed is composed of the three components: a speaker identification and gain estimation component, a signal separation component, and a speech recognition system. In this paper we focus on the signal separation component, which is composed of the acoustic and grammatical models. The details of the other components are discussed in [2]. Single-channel speech separation has previously been attempted using Gaussian mixture models (GMMs) on individual frames of acoustic features. However such models tend to perform well only when speakers are of different gender or have rather different voices [4]. When speakers have similar voices, speaker-dependent mixture models cannot unambiguously identify the component speakers. In such cases it is helpful to model the temporal dynamics of the speech. Several models in the literature have attempted to do so either for recognition [5, 6] or enhancement [7, 8] of speech. Such models have typically been based on a discrete-state hidden Markov model (HMM) operating on a frame-based acoustic feature vector. Modeling the dynamics of the log spectrum of speech is challenging in that different speech components evolve at different time-scales. For example the excitation, which carries mainly pitch, versus the filter, which consists of the formant structure, are somewhat independent of each other. The formant structure closely follows the sequences of phonemes in each word, which are pronounced at a rate of several per second. In non-tonal languages such as English, the pitch fluctuates with prosody over the course of a sentence, and is not directly coupled with the words being spoken. Nevertheless, it seems to be important in separating speech, because the pitch harmonics carry predictable structure that stands out against the background. We address the various dynamic components of speech by testing different levels of dynamic constraints in our models. We explore four different levels of dynamics: no dynamics, low-level acoustic dynamics, high-level grammar dynamics, and a layered combination, dual dynamics, of the acoustic and grammar dynamics. The grammar dynamics and dual dynamics models perform the best in our experiments. The acoustic models are combined to model mixtures of speech using two methods: a nonlinear model known as Algonquin, which models the combination of log-spectrum models as a sum in the power spectrum, and a simpler max model that combines two log spectra using the max function. It turns out that whereas Algonquin works well, our formulation of the max model does better overall. With the combination of the max model and grammar-level dynamics, the model produces remarkable results: it is often able to extract two utterances from a mixture even when they are from the same speaker 1 . Overall results are given in Table 1, which shows that our closest competitors are human listeners. Table 1: Overall word error rates across all conditions on the challenge task. Human: average human error rate, IBM: our best result, Next Best: the best of the eight other published results on this task, and Chance: the theoretical error rate for random guessing. System: Word Error Rate: 1 Human 22.3% IBM 22.6% Next Best 34.2% Chance 93.0% Speech Models The model consists of an acoustic model and temporal dynamics model for each source, and a mixing model, which models how the source models are combined to describe the mixture. The acoustic features were short-time log spectrum frames computed every 15 ms. Each frame was of length 40 ms and a 640-point mixed-radix FFT was used. The DC component was discarded, producing a 319-dimensional log-power-spectrum feature vector yt . The acoustic model consists of a set of diagonal-covariance Gaussians in the features. For a given speaker, a, we model the conditional probability of the log-power spectrum of each source signal xa given a discrete acoustic state sa as Gaussian, p(xa |sa ) = N (xa ; µsa , Σsa ), with mean µsa , and covariance matrix Σsa . We used 256 Gaussians, one per acoustic state, to model the acoustic space of each speaker. For efficiency and tractability we restrict the covariance to be diagonal. A model with no dynamics can be formulated by producing state probabilities p(sa ), and is depicted in 1(a). Acoustic Dynamics: To capture the low-level dynamics of the acoustic signal, we modeled the acoustic dynamics of a given speaker, a, via state transitions p(sa |sa ) as shown in Figure 1(b). t t−1 There are 256 acoustic states, hence for each speaker a, we estimated a 256 × 256 element transition matrix Aa . Grammar Dynamics: The grammar dynamics are modeled by grammar state transitions, a a p(vt |vt−1 ), which consist of left-to-right phone models. The legal word sequences are given by the Speech Separation Challenge grammar [3] and are modeled using a set of pronunciations that 1 Demos and information can be found at: http : //www.research.ibm.com/speechseparation sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (a) No Dynamics (b) Acoustic Dynamics a vt−1 a vt a vt−1 a vt sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (c) Grammar Dynamics (d) Dual Dynamics Figure 1: Graph of models for a given source. In (a), there are no dynamics, so the model is a simple mixture model. In (b), only acoustic dynamics are modeled. In (c), grammar dynamics are modeled with a shared set of acoustic Gaussians, in (d) dual – grammar and acoustic – dynamics have been combined. Note that (a) (b) and (c) are special cases of (d), where different nodes are assumed independent. map from words to three-state context-dependent phone models. The state transition probabilities derived from these phone models are sparse in the sense that most transition probabilities are zero. We model speaker dependent distributions p(sa |v a ) that associate the grammar states, v a to the speaker-dependent acoustic states. These are learned from training data where the grammar state sequences and acoustic state sequences are known for each utterance. The grammar of our system has 506 states, so we estimate a 506 × 256 element conditional probability matrix B a for each speaker. Dual Dynamics: The dual-dynamics model combines the acoustic dynamics with the grammar dynamics. It is useful in this case to avoid modeling the full combination of s and v states in the joint transitions p(sa |sa , vt ). Instead we make a naive-Bayes assumption to approximate this as t t−1 1 p(sa |sa )α p(sa |vt )β , where α and β adjust the relative influence of the two probabilities, and z t t−1 t z is the normalizing constant. Here we simply use the probability matrices Aa and B a , defined above. 2 Mixed Speech Models The speech separation challenge involves recognizing speech in mixtures of signals from two speakers, a and b. We consider only mixing models that operate independently on each frequency for analytical and computational tractability. The short-time log spectrum of the mixture yt , in a given frequency band, is related to that of the two sources xa and xb via the mixing model given by the t t conditional probability distribution, p(y|xa , xb ). The joint distribution of the observation and source in one feature dimension, given the source states is thus: p(yt , xa , xb |sa , sb ) = p(yt |xa , xb )p(xa |sa )p(xb |sb ). t t t t t t t t t t (1) In general, to infer and reconstruct speech we need to compute the likelihood of the observed mixture p(yt |sa , sb ) = t t p(yt , xa , xb |sa , sb )dxa dxb , t t t t t t (2) and the posterior expected values of the sources given the states, E(xa |yt , sa , sb ) = t t t xa p(xa , xb |yt , sa , sb )dxa dxb , t t t t t t t (3) and similarly for xb . These quantities, combined with a prior model for the joint state set quences {sa , sb }, allow us to compute the minimum mean squared error (MMSE) estima1..T 1..T ˆ ˆ tors E(xa |y1..T ) or the maximum a posteriori (MAP) estimate E(xa |y1..T , sa 1..T , sb 1..T ), 1..T 1..T ˆ ˆ where sa 1..T , sb 1..T = arg maxsa ,sb p(sa , sb |y1..T ), where the subscript, 1..T , refers to 1..T 1..T 1..T 1..T all frames in the signal. The mixing model can be defined in a number of ways. We explore two popular candidates, for which the above integrals can be readily computed: Algonquin, and the max model. s a s xa b xb y (a) Mixing Model (v a v b )t−1 (v a v b )t (sa sb )t−1 (sa sb )t yt yt (b) Dual Dynamics Factorial Model Figure 2: Model combination for two talkers. In (a) all dependencies are shown. In (b) the full dual-dynamics model is graphed with the xa and xb integrated out, and corresponding states from each speaker combined into product states. The other models are special cases of this graph with different edges removed, as in Figure 1. Algonquin: The relationship between the sources and mixture in the log power spectral domain is approximated as p(yt |xa , xb ) = N (yt ; log(exp(xa ) + exp(xb )), Ψ) (4) t t t t where Ψ is introduced to model the error due to the omission of phase [4]. An iterative NewtonLaplace method accurately approximates the conditional posterior p(xa , xb |yt , sa , sb ) from (1) as t t t t Gaussian. This Gaussian allows us to analytically compute the observation likelihood p(yt |sa , sb ) t t and expected value E(xa |yt , sa , sb ), as in [4]. t t t Max model: The mixing model is simplified using the fact that log of a sum is approximately the log of the maximum: p(y|xa , xb ) = δ y − max(xa , xb ) (5) In this model the likelihood is p(yt |sa , sb ) = pxa (yt |sa )Φxb (yt |sb ) + pxb (yt |sb )Φxa (yt |sa ), (6) t t t t t t t t t y t where Φxa (yt |sa ) = −∞ N (xa ; µsa , Σsa )dxa is a Gaussian cumulative distribution function [5]. t t t t t t In [5], such a model was used to compute state likelihoods and find the optimal state sequence. In [8], a simplified model was used to infer binary masking values for refiltering. We take the max model a step further and derive source posteriors, so that we can compute the MMSE estimators for the log power spectrum. Note that the source posteriors in xa and xb are each t t a mixture of a delta function and a truncated Gaussian. Thus we analytically derive the necessary expected value: E(xa |yt , sa , sb ) t t t p(xa = yt |yt , sa , sb )yt + p(xa < yt |yt , sa , sb )E(xa |xa < yt , sa ) t t t t t t t t t pxa (yt |sa ) t a b , = πt yt + πt µsa − Σsa t t t Φxa (yt |sa ) t t = (7) (8) a b a with weights πt = p(xa=yt |yt , sa , sb ) = pxa (yt |sa )Φxb (yt |sb )/p(yt |sa , sb ), and πt = 1 − πt . For t t t t t t t t a ≫ µ b in a given frequency many pairs of states one model is significantly louder than another µs s band, relative to their variances. In such cases it is reasonable to approximate the likelihood as p(yt |sa , sb ) ≈ pxa (yt |sa ), and the posterior expected values according to E(xa |yt , sa , sb ) ≈ yt and t t t t t t t E(xb |yt , sa , sb ) ≈ min(yt , µsb ), and similarly for µsa ≪ µsb . t t t t 3 Likelihood Estimation Because of the large number of state combinations, the model would not be practical without techniques to reduce computation time. To speed up the evaluation of the joint state likelihood, we employed both band quantization of the acoustic Gaussians and joint-state pruning. Band Quantization: One source of computational savings stems from the fact that some of the Gaussians in our model may differ only in a few features. Band quantization addresses this by approximating each of the D Gaussians of each model with a shared set of d Gaussians, where d ≪ D, in each of the F frequency bands of the feature vector. A similar idea is described in [9]. It relies on the use of a diagonal covariance matrix, so that p(xa |sa ) = f N (xa ; µf,sa , Σf,sa ), where Σf,sa f are the diagonal elements of covariance matrix Σsa . The mapping Mf (si ) associates each of the D Gaussians with one of the d Gaussians in band f . Now p(xa |sa ) = f N (xa ; µf,Mf (sa ) , Σf,Mf (sa ) ) ˆ f is used as a surrogate for p(xa |sa ). Figure 3 illustrates the idea. Figure 3: In band quantization, many multi-dimensional Gaussians are mapped to a few unidimensional Gaussians. Under this model the d Gaussians are optimized by minimizing the KL-divergence D( sa p(sa )p(xa |sa )|| sa p(sa )ˆ(xa |sa )), and likewise for sb . Then in each frequency band, p only d×d, instead of D ×D combinations of Gaussians have to be evaluated to compute p(y|sa , sb ). Despite the relatively small number of components d in each band, taken across bands, band quantization is capable of expressing dF distinct patterns, in an F -dimensional feature space, although in practice only a subset of these will be used to approximate the Gaussians in a given model. We used d = 8 and D = 256, which reduced the likelihood computation time by three orders of magnitude. Joint State Pruning: Another source of computational savings comes from the sparseness of the model. Only a handful of sa , sb combinations have likelihoods that are significantly larger than the rest for a given observation. Only these states are required to adequately explain the observation. By pruning the total number of combinations down to a smaller number we can speed up the likelihood calculation, estimation of the components signals, as well as the temporal inference. However, we must estimate the likelihoods in order to determine which states to retain. We therefore used band-quantization to estimate likelihoods for all states, perform state pruning, and then the full model on the pruned states using the exact parameters. In the experiments reported here, we pruned down to 256 state combinations. The effect of these speedup methods on accuracy will be reported in a future publication. 4 Inference In our experiments we performed inference in four different conditions: no dynamics, with acoustic dynamics only, with grammar dynamics only, and with dual dynamics (acoustic and grammar). With no dynamics the source models reduce to GMMs and we infer MMSE estimates of the sources based on p(xa , xb |y) as computed from (1), using Algonquin or the max model. Once the log spectrum of each source is estimated, we estimate the corresponding time-domain signal as shown in [4]. In the acoustic dynamics condition the exact inference algorithm uses a 2-Dimensional Viterbi search, described below, with acoustic temporal constraints p(st |st−1 ) and likelihoods from Eqn. (1), to find the most likely joint state sequence s1..T . Similarly in the grammar dynamics condition, 2-D Viterbi search is used to infer the grammar state sequences, v1..T . Instead of single Gaussians as the likelihood models, however, we have mixture models in this case. So we can perform an MMSE estimate of the sources by averaging over the posterior probability of the mixture components given the grammar Viterbi sequence, and the observations. It is critical to use the 2-D Viterbi algorithm in both cases, rather than the forward-backward algorithm, because in the same-speaker condition at 0dB, the acoustic models and dynamics are symmetric. This symmetry means that the posterior is essentially bimodal and averaging over these modes would yield identical estimates for both speakers. By finding the best path through the joint state space, the 2-D Viterbi algorithm breaks this symmetry and allows the model to make different estimates for each speaker. In the dual-dynamics condition we use the model of section 2(b). With two speakers, exact inference is computationally complex because the full joint distribution of the grammar and acoustic states, (v a × sa ) × (v b × sb ) is required and is very large in number. Instead we perform approximate inference by alternating the 2-D Viterbi search between two factors: the Cartesian product sa × sb of the acoustic state sequences and the Cartesian product v a × v b of the grammar state sequences. When evaluating each state sequence we hold the other chain constant, which decouples its dynamics and allows for efficient inference. This is a useful factorization because the states sa and sb interact strongly with each other and similarly for v a and v b . Again, in the same-talker condition, the 2-D Viterbi search breaks the symmetry in each factor. 2-D Viterbi search: The Viterbi algorithm estimates the maximum-likelihood state sequence s1..T given the observations x1..T . The complexity of the Viterbi search is O(T D2 ) where D is the number of states and T is the number of frames. For producing MAP estimates of the 2 sources, we require a 2 dimensional Viterbi search which finds the most likely joint state sequences sa and 1..T sb given the mixed signal y1..T as was proposed in [5]. 1..T On the surface, the 2-D Viterbi search appears to be of complexity O(T D4 ). Surprisingly, it can be computed in O(T D3 ) operations. This stems from the fact that the dynamics for each chain are independent. The forward-backward algorithm for a factorial HMM with N state variables requires only O(T N DN +1 ) rather than the O(T D2N ) required for a naive implementation [10]. The same is true for the Viterbi algorithm. In the Viterbi algorithm, we wish to find the most probable paths leading to each state by finding the two arguments sa and sb of the following maximization: t−1 t−1 {ˆa , sb } = st−1 ˆt−1 = arg max p(sa |sa )p(sb |sb )p(sa , sb |y1..t−1 ) t t−1 t t−1 t−1 t−1 sa sb t−1 t−1 arg max p(sa |sa ) max p(sb |sb )p(sa , sb |y1..t−1 ). t t−1 t t−1 t−1 t−1 a st−1 sb t−1 (9) The two maximizations can be done in sequence, requiring O(D3 ) operations with O(D2 ) storage for each step. In general, as with the forward-backward algorithm, the N -dimensional Viterbi search requires O(T N DN +1 ) operations. We can also exploit the sparsity of the transition matrices and observation likelihoods, by pruning unlikely values. Using both of these methods our implementation of 2-D Viterbi search is faster than the acoustic likelihood computation that serves as its input, for the model sizes and grammars chosen in the speech separation task. Speaker and Gain Estimation: In the challenge task, the gains and identities of the two speakers were unknown at test time and were selected from a set of 34 speakers which were mixed at SNRs ranging from 6dB to -9dB. We used speaker-dependent acoustic models because of their advantages when separating different speakers. These models were trained on gain-normalized data, so the models are not well matched to the different gains of the signals at test time. This means that we have to estimate both the speaker identities and the gain in order to adapt our models to the source signals for each test utterance. The number of speakers and range of SNRs in the test set makes it too expensive to consider every possible combination of models and gains. Instead, we developed an efficient model-based method for identifying the speakers and gains, described in [2]. The algorithm is based upon a very simple idea: identify and utilize frames that are dominated by a single source – based on their likelihoods under each speaker-dependent acoustic model – to determine what sources are present in the mixture. Using this criteria we can eliminate most of the unlikely speakers, and explore all combinations of the remaining speakers. An approximate EM procedure is then used to select a single pair of speakers and estimate their gains. Recognition: Although inference in the system may involve recognition of the words– for models that contain a grammar –we still found that a separately trained recognizer performed better. After reconstruction, each of the two signals is therefore decoded with a speech recognition system that incorporates Speaker Dependent Labeling (SDL) [2]. This method uses speaker dependent models for each of the 34 speakers. Instead of using the speaker identities provided by the speaker ID and gain module, we followed the approach for gender dependent labeling (GDL) described in [11]. This technique provides better results than if the true speaker ID is specified. 5 Results The Speech Separation Challenge [3] involves separating the mixed speech of two speakers drawn from of a set of 34 speakers. An example utterance is place white by R 4 now. In each recording, one of the speakers says white while the other says blue, red or green. The task is to recognize the letter and the digit of the speaker that said white. Using the SDL recognizer, we decoded the two estimated signals under the assumption that one signal contains white and the other does not, and vice versa. We then used the association that yielded the highest combined likelihood. 80 WER (%) 60 40 20 0 Same Talker No Separation No dynamics Same Gender Acoustic Dyn. Different Gender Grammar Dyn All Dual Dyn Human Figure 4: Average word error rate (WER) as a function of model dynamics, in different talker conditions, compared to Human error rates, using Algonquin. Human listener performance [3] is compared in Figure 4 to results using the SDL recognizer without speech separation, and for each the proposed models. Performance is poor without separation in all conditions. With no dynamics the models do surprisingly well in the different talker conditions, but poorly when the signals come from the same talker. Acoustic dynamics gives some improvement, mainly in the same-talker condition. The grammar dynamics seems to give the most benefit, bringing the error rate in the same-gender condition below that of humans. The dual-dynamics model performed about the same as the grammar dynamics model, despite our intuitions. Replacing Algonquin with the max model reduced the error rate in the dual dynamics model (from 24.3% to 23.5%) and grammar dynamics model (from 24.6% to 22.6%), which brings the latter closer than any other model to the human recognition rate of 22.3%. Figure 5 shows the relative word error rate of the best system compared to human subjects. When both speakers are around the same loudness, the system exceeds human performance, and in the same-gender condition makes less than half the errors of the humans. Human listeners do better when the two signals are at different levels, even if the target is below the masker (i.e., in -9dB), suggesting that they are better able to make use of differences in amplitude as a cue for separation. Relative Word Error Rate (WER) 200 Same Talker Same Gender Different Gender Human 150 100 50 0 −50 −100 6 dB 3 dB 0 dB −3 dB Signal to Noise Ratio (SNR) −6 dB −9 dB Figure 5: Word error rate of best system relative to human performance. Shaded area is where the system outperforms human listeners. An interesting question is to what extent different grammar constraints affect the results. To test this, we limited the grammar to just the two test utterances, and the error rate on the estimated sources dropped to around 10%. This may be a useful paradigm for separating speech from background noise when the text is known, such as in closed-captioned recordings. At the other extreme, in realistic speech recognition scenarios, there is little knowledge of the background speaker’s grammar. In such cases the benefits of models of low-level acoustic continuity over purely grammar-based systems may be more apparent. It is our hope that further experiments with both human and machine listeners will provide us with a better understanding of the differences in their performance characteristics, and provide insights into how the human auditory system functions, as well as how automatic speech perception in general can be brought to human levels of performance. References [1] T. Kristjansson, J. R. Hershey, P. A. Olsen, S. Rennie, and R. Gopinath, “Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system,” in ICSLP, 2006. [2] Steven Rennie, Pedera A. Olsen, John R. Hershey, and Trausti Kristjansson, “Separating multiple speakers using temporal constraints,” in ISCA Workshop on Statistical And Perceptual Audition, 2006. [3] Martin Cooke and Tee-Won Lee, “Interspeech speech separation http : //www.dcs.shef.ac.uk/ ∼ martin/SpeechSeparationChallenge.htm, 2006. challenge,” [4] T. Kristjansson, J. Hershey, and H. Attias, “Single microphone source separation using high resolution signal reconstruction,” ICASSP, 2004. [5] P. Varga and R.K. Moore, “Hidden Markov model decomposition of speech and noise,” ICASSP, pp. 845–848, 1990. [6] M. Gales and S. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 352–359, September 1996. [7] Y. Ephraim, “A Bayesian estimation approach for speech enhancement using hidden Markov models.,” vol. 40, no. 4, pp. 725–735, 1992. [8] S. Roweis, “Factorial models and refiltering for speech separation and denoising,” Eurospeech, pp. 1009–1012, 2003. [9] E. Bocchieri, “Vector quantization for the efficient computation of continuous density likelihoods. proceedings of the international conference on acoustics,” in ICASSP, 1993, vol. II, pp. 692–695. [10] Zoubin Ghahramani and Michael I. Jordan, “Factorial hidden Markov models,” in Advances in Neural Information Processing Systems, vol. 8. [11] Peder Olsen and Satya Dharanipragada, “An efficient integrated gender detection scheme and time mediated averaging of gender dependent acoustic models,” in Eurospeech 2003, 2003, vol. 4, pp. 2509–2512.
4 0.86098099 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms
Author: Xinhua Zhang, Wee S. Lee
Abstract: Semi-supervised learning algorithms have been successfully applied in many applications with scarce labeled data, by utilizing the unlabeled data. One important category is graph based semi-supervised learning algorithms, for which the performance depends considerably on the quality of the graph, or its hyperparameters. In this paper, we deal with the less explored problem of learning the graphs. We propose a graph learning method for the harmonic energy minimization method; this is done by minimizing the leave-one-out prediction error on labeled data points. We use a gradient based method and designed an efficient algorithm which significantly accelerates the calculation of the gradient by applying the matrix inversion lemma and using careful pre-computation. Experimental results show that the graph learning method is effective in improving the performance of the classification algorithm. 1
5 0.80884123 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation
Author: Daniel Tarlow, Gal Elidan, Daphne Koller, John C. Duchi
Abstract: In general, the problem of computing a maximum a posteriori (MAP) assignment in a Markov random field (MRF) is computationally intractable. However, in certain subclasses of MRF, an optimal or close-to-optimal assignment can be found very efficiently using combinatorial optimization algorithms: certain MRFs with mutual exclusion constraints can be solved using bipartite matching, and MRFs with regular potentials can be solved using minimum cut methods. However, these solutions do not apply to the many MRFs that contain such tractable components as sub-networks, but also other non-complying potentials. In this paper, we present a new method, called C OMPOSE, for exploiting combinatorial optimization for sub-networks within the context of a max-product belief propagation algorithm. C OMPOSE uses combinatorial optimization for computing exact maxmarginals for an entire sub-network; these can then be used for inference in the context of the network as a whole. We describe highly efficient methods for computing max-marginals for subnetworks corresponding both to bipartite matchings and to regular networks. We present results on both synthetic and real networks encoding correspondence problems between images, which involve both matching constraints and pairwise geometric constraints. We compare to a range of current methods, showing that the ability of C OMPOSE to transmit information globally across the network leads to improved convergence, decreased running time, and higher-scoring assignments.
6 0.57377481 134 nips-2006-Modeling Human Motion Using Binary Latent Variables
7 0.52570105 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
8 0.50876659 167 nips-2006-Recursive ICA
9 0.4608497 158 nips-2006-PG-means: learning the number of clusters in data
10 0.45955971 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
11 0.45574266 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models
12 0.45422351 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements
13 0.45149878 31 nips-2006-Analysis of Contour Motions
14 0.45052293 34 nips-2006-Approximate Correspondences in High Dimensions
15 0.44664127 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
16 0.44091621 15 nips-2006-A Switched Gaussian Process for Estimating Disparity and Segmentation in Binocular Stereo
17 0.42583334 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
18 0.42566323 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
19 0.42538264 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
20 0.42447081 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation