nips nips2000 nips2000-59 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Cynthia Archer, Todd K. Leen
Abstract: We establish a principled framework for adaptive transform coding. Transform coders are often constructed by concatenating an ad hoc choice of transform with suboptimal bit allocation and quantizer design. Instead, we start from a probabilistic latent variable model in the form of a mixture of constrained Gaussian mixtures. From this model we derive a transform coding algorithm, which is a constrained version of the generalized Lloyd algorithm for vector quantizer design. A byproduct of our derivation is the introduction of a new transform basis, which unlike other transforms (PCA, DCT, etc.) is explicitly optimized for coding. Image compression experiments show adaptive transform coders designed with our algorithm improve compressed image signal-to-noise ratio up to 3 dB compared to global transform coding and 0.5 to 2 dB compared to other adaptive transform coders. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We establish a principled framework for adaptive transform coding. [sent-6, score-0.585]
2 Transform coders are often constructed by concatenating an ad hoc choice of transform with suboptimal bit allocation and quantizer design. [sent-7, score-1.06]
3 Instead, we start from a probabilistic latent variable model in the form of a mixture of constrained Gaussian mixtures. [sent-8, score-0.498]
4 From this model we derive a transform coding algorithm, which is a constrained version of the generalized Lloyd algorithm for vector quantizer design. [sent-9, score-0.991]
5 A byproduct of our derivation is the introduction of a new transform basis, which unlike other transforms (PCA, DCT, etc. [sent-10, score-0.543]
6 Image compression experiments show adaptive transform coders designed with our algorithm improve compressed image signal-to-noise ratio up to 3 dB compared to global transform coding and 0. [sent-12, score-1.779]
7 5 to 2 dB compared to other adaptive transform coders. [sent-13, score-0.585]
8 1 Introduction Compression algorithms for image and video signals often use transform coding as a low-complexity alternative to vector quantization (VQ). [sent-14, score-0.699]
9 Transform coders compress multi-dimensional data by transforming the signal vectors to new coordinates and coding the transform coefficients independently of one another with scalar quantizers. [sent-15, score-0.973]
10 The coordinate transform may be fixed a priori as in the discrete cosine transform (DCT). [sent-16, score-0.962]
11 It can also be adapted to the signal statistics using, for example, principal component analysis (PCA), where the goal is to concentrate signal energy in a few signal components. [sent-17, score-0.256]
12 Noting that signals such as images and speech are nonstationary, several researchers have developed non-linear [1, 2] and local linear or adaptive [3,4] PCA transforms for dimension reduction! [sent-18, score-0.343]
13 None of these transforms are designed to minimize compression distortion nor are they designed in concert with quantizer development. [sent-20, score-0.725]
14 lIn dimension reduction the original d-dimensional signal is projected onto a subspace or submanifold of lower dimension. [sent-21, score-0.113]
15 Several researchers have extended the idea of local linear transforms to transform coding [5, 6, 7]. [sent-23, score-0.756]
16 In these adaptive transform coders, the signal space is partitioned into disjoint regions and a transform and set of scalar quantizers are designed for each region. [sent-24, score-1.317]
17 In our own previous work [7], we use k-means partitioning to define the regions. [sent-25, score-0.136]
18 Dony and Haykin [5] partition the space to minimize dimension-reduction error. [sent-26, score-0.157]
19 Tipping and Bishop [6] use soft partitioning according to a probabilistic rule that reduces, in the appropriate limit, to partitioning by dimension-reduction error. [sent-27, score-0.305]
20 These systems neither design transforms nor partition the signal space with the goal of minimizing compression distortion. [sent-28, score-0.426]
21 This ad hoc construction contrasts sharply with the solid grounding of vector quantization. [sent-29, score-0.081]
22 Nowlan [8] develops a probabilistic framework for VQ by demonstrating the correspondence between a VQ and a mixture of spherically symmetric Gaussians. [sent-30, score-0.261]
23 In the limit that the mixture component variance goes to zero, the ExpectationMaximization (EM) procedure for fitting the mixture model to data becomes identical to the Linde-Buzo-Gray (LBG) algorithm [9] for vector quantizer design. [sent-31, score-0.824]
24 This paper develops a similar grounding for both global and adaptive (local) transform coding. [sent-32, score-0.692]
25 We define a constrained mixture of Gaussians model that provides a framework for transform coder design. [sent-33, score-1.094]
26 Our new design algorithm is simply a constrained version of the LBG algorithm. [sent-34, score-0.173]
27 It iteratively optimizes the signal space partition, the local transforms, the allocation of coding bits, and the scalar quantizer reproduction values until it reaches a local distortion minimum. [sent-35, score-0.939]
28 This approach leads to two new results, an orthogonal transform and a method of partitioning the signal space, both designed to minimize coding error. [sent-36, score-1.004]
29 2 Global Transform Coder Model In this section, we develop a constrained mixture of Gaussians model that provides a probabilistic framework for global transform coding. [sent-37, score-0.834]
30 1 Latent Variable Model A transform coder converts a signal to new coordinates and then codes the coordinate values independently of one another with scalar quantizers. [sent-39, score-1.029]
31 To replicate this structure, we envision the data as drawn from a d-dimensionallatent data space, S, in which the density p( 8) = p( 81,82, . [sent-40, score-0.156]
32 a I - - r2 · 12 s· ~ 11,12 r1· - 11I 1 S Figure 1: Structure of latent variable space, S, and mapping to observed space, X . [sent-47, score-0.189]
33 The latent data density consists of a mixture of spherical Gaussians with component means qa constrained to lie at the vertices of a rectangular grid. [sent-48, score-0.867]
34 The latent data is mapped to the observed space by an orthogonal transform, W. [sent-49, score-0.317]
35 We model the density in the latent space with a constrained mixture of Gaussian densities K p(s) = 7ra P(sla) (1) L a=l where 7ra are the mixing coefficients and p(sla) = N(qa, ~a) is Gaussian with mean qa and variance ~a. [sent-50, score-0.771]
36 The mixture component means, qa, lie at the vertices of a rectangular grid as illustrated in figure (1). [sent-51, score-0.538]
37 , rdiJ T , where r JiJ is the i~h grid mark on the SJ axis. [sent-54, score-0.229]
38 There are KJ grid mark values on the SJ axis, so the total number of grid vertices K = TIJ KJ. [sent-55, score-0.457]
39 We constrain the mixture components variances, ~a to be spherically symmetric with the same variance, (721, with 1 the identity matrix. [sent-56, score-0.228]
40 We do not fit (72 to the data, but treat it as a "knob", which we will turn to zero to reveal a transform coder. [sent-57, score-0.465]
41 These mean and variance constraints yield marginal densities PJ(sJliJ) = N(r JiJ> (72). [sent-58, score-0.084]
42 We write the density of S conditioned on a as d p(sla) =P(Sl, . [sent-59, score-0.109]
43 ,id) = TIJ PJir Incorporating these constraints into (1) and noting that the sum over the mixture components a is equivalent to sums over all grid mark values, the latent density becomes Kl p(s) = K2 d Kd L L . [sent-69, score-0.653]
44 The latent data is mapped to the observation space by an orthogonal transformation, W (figure 1). [sent-73, score-0.317]
45 Using p(xls) = c5(x - Ws - fJ,) and (1), the density on observed data x conditioned on component a is p(xla) = N(W qa + fJ" (721). [sent-74, score-0.292]
46 2 Model Fitting and Transform Coder design The model (4) can be fit to data using the EM algorithm. [sent-80, score-0.07]
47 In the limit that the variance of the mixture components goes to zero, the EM procedure for fitting the mixture model to data corresponds to a constrained LBG (CLBG) algorithm for optimal transform coder design. [sent-81, score-1.456]
48 In the limit (72 -+ 0 the entropy term, In 7r a , becomes insignificant and the component posteriors collapse to (6) Each data vector is assigned to the component whose mean has the smallest Euclidean distance to it. [sent-82, score-0.177]
49 In the limit that (72 -+ 0, maximizing the likelihood (5) is equivalent to minimizing compression distortion 1 D = 7ra N Ix - Wqa _1-£1 2 (7) L L a a xER", where Ra = {x Ip(alx) = I}, Na is the number of x ERa, and 7ra = Na/N. [sent-84, score-0.26]
50 To optimize the transform, we find the orientation of the current quantizer grid which minimizes (7). [sent-85, score-0.406]
51 The transform, W, is constrained to be orthogonal, that is WTW = I. [sent-86, score-0.103]
52 (8) a xER", Minimizing the distortion (7) with respect to some element of W and using Lagrange multipliers to enforce the orthogonality of W yields the condition QW = WTQT (9) or QW is symmetric. [sent-88, score-0.118]
53 This symmetry condition and the orthogonality condition, WTW = I, uniquely determine the coding optimal transform (COT) W. [sent-89, score-0.692]
54 The COT reduces to the PCA transform when the data is Gaussian. [sent-90, score-0.465]
55 For instance in global transform coding trials on a variety of grayscale images, the COT improves signal-to-noise ratio (SNR) relative to PCA by 0. [sent-92, score-0.689]
56 We next minimize (7) with respect to the grid mark values, r JiJ' for J = 1 . [sent-101, score-0.292]
57 K J and the number of grid values K J for each coordinate. [sent-107, score-0.147]
58 It is advantageous to rewrite compression distortion as the sum of distortions D = LJ DJ due to quantizing the transform coefficients SJ = WJ x, where WJ is the Jfh column vector of W. [sent-108, score-0.67]
59 The rJiJ grid mark values that minimize each DJ are the reproduction values of a scalar Lloyd quantizer [10] designed for the transform coefficients, SJ. [sent-109, score-1.204]
60 KJ is the number of reproduction values in the quantizer for transform coordinate J. [sent-110, score-0.827]
61 Allocating the log2 (K) coding bits among the transform coordinates so that we minimize distortion [11] determines the optimal KJ's. [sent-111, score-0.929]
62 3 Local Transform Coder Model In this section, we develop a mixture of constrained Gaussian mixtures model that provides a probabilistic framework for adaptive transform coding. [sent-112, score-0.954]
63 1 Latent Variable Model A local or adaptive transform coder identifies regions in data space that require different quantizer grids and orthogonal transforms. [sent-114, score-1.376]
64 A separate transform coder is designed for each of these regions. [sent-115, score-0.878]
65 To replicate this structure, we envision the observed data as drawn from one of M grids in the latent space. [sent-116, score-0.324]
66 The latent variables, s, are modeled with a mixture of Gaussian densities, where the mixture components are constrained to lie at the grid vertices. [sent-117, score-0.829]
67 Each grid has the same number of mixture components, K, however the number and spacing of grid marks on each axis can differ. [sent-118, score-0.467]
68 L(2) Figure 2: Nonstationary data model: Structure of latent variable space, S, and mapping (in the hard clustering limit) to observed space, X. [sent-122, score-0.222]
69 The density in the latent space consists of mixtures of spherically symmetric Gaussians. [sent-123, score-0.404]
70 The mixture component means, qim ), lie at the vertices the mth grid. [sent-124, score-0.359]
71 Latent data is mapped to the observation space by W(m). [sent-125, score-0.077]
72 The density on s conditioned on a single mixture component, 0: in grid m,2 is = N(q~m), 0- 2 1). [sent-126, score-0.429]
73 The latent density is a mixture of constrained Gaussian mixture densities, M K p(slo:, m) L'lrm LP(o:lm) p(slo:, m) m=l p(s) = 0=1 (10) The latent data is mapped to the observation space by orthonormal transforms w(m). [sent-127, score-1.044]
74 The density on x conditioned on 0: in grid m is p(xlo:, m) = N(w(m)q~m) + p,(m),0- 2 1). [sent-128, score-0.256]
75 2 K 0=1 = L'lrm (11) L p(o:lm) p(xlo:, m) Optimal Adaptive Transform Coder Design In the limit that 0- 2 -t 0, the EM procedure for fitting this model corresponds to a constrained LBG algorithm for adaptive transform coder design. [sent-130, score-1.169]
76 As before, a single mixture component becomes responsible for Xn 0: mx -t { 1 if Ix - w(m)q~m) p( , I) 0 otherwise - p,(m) 12 ::; Ix - w(m)q~m) - p,(m) 12 ' < 0. [sent-131, score-0.234]
77 Second, instead of using the coding optimal transform (9), we use the peA transform. [sent-133, score-0.66]
78 We report on compression experiments using two types of images, Magnetic Resonance Images (MRI) and gray-scale natural images of traffic moving through street intersections. [sent-139, score-0.235]
79 These MRI images were used by Dony and Haykin in [5] and we duplicate their image pre-processing. [sent-140, score-0.118]
80 One MRI image is decomposed into overlapping 8 x 8 blocks to form 15,625 training vectors; a second image is used for testing. [sent-141, score-0.14]
81 The traffic images are frames from two video sequences. [sent-142, score-0.147]
82 7 Compressed Bit-rate (bpp) (a) MRl test image SNR. [sent-149, score-0.07]
83 6 Compressed bit-rate (bpp) (b) Traffic test image SNR. [sent-158, score-0.07]
84 Figure 3: The x is our coding optimal partition, 0 local peA partition with dimension eight, e k-means clustering, and + is global peA. [sent-160, score-0.408]
85 Figure 3 shows compressed test image SNR for four compressed bit-rates and four compression methods. [sent-163, score-0.371]
86 The quoted bit-rates include the bits necessary to specify region assignments. [sent-164, score-0.063]
87 The x results are for our transform coder which uses coding optimal partitioning. [sent-165, score-1.013]
88 8 dB and local peA partitioning with target dimension eight (0) by 0. [sent-170, score-0.264]
89 0 dB higher that Dony and Haykin's local peA transform coder (dimension eight) [5]. [sent-175, score-0.867]
90 Their local peA coder does not use optimal bit allocation or Lloyd quantizers, which further reduces compressed image SNR. [sent-176, score-0.731]
91 5 Summary In this paper, we cast the design of both conventional and adaptive transform coders as a constrained optimization procedure. [sent-177, score-0.923]
92 We derive our algorithm from the EM procedure for fitting a mixture of mixtures model to data. [sent-178, score-0.306]
93 In contrast to standard transform coder design, all operations: partitioning the signal space (for the adaptive case), transform design, allocation of coding bits, and quantizer design, are coupled together to minimize compression distortion. [sent-179, score-2.348]
94 This approach leads to a new transform basis that is optimized for coding. [sent-180, score-0.505]
95 The coding optimal transform is in general different from PCA. [sent-181, score-0.66]
96 This approach also leads to a method of data space partitioning that is optimized for coding. [sent-182, score-0.214]
97 This method assigns each signal vector to the coder the compresses it with the least distortion. [sent-183, score-0.418]
98 Acknow ledgeIllents The authors wish to thank Robert Dony and Simon Haykin for the use of their MRI image data and the Institute fur Algorithmen und Kognitive Systems, Universitiit Karlsruhe for making their traffic images available. [sent-186, score-0.186]
99 Optimal dimension reduction and transform coding with mixture principal components. [sent-221, score-0.85]
100 Soft Competitive Adaptation: neural network learning algorithms based on fitting statistical mixtures. [sent-224, score-0.073]
wordName wordTfidf (topN-words)
[('transform', 0.465), ('coder', 0.353), ('quantizer', 0.259), ('latent', 0.189), ('mixture', 0.173), ('coders', 0.165), ('coding', 0.164), ('grid', 0.147), ('partitioning', 0.136), ('qa', 0.122), ('adaptive', 0.12), ('compression', 0.119), ('cot', 0.118), ('dony', 0.118), ('constrained', 0.103), ('allocation', 0.101), ('bpp', 0.094), ('lbg', 0.094), ('mri', 0.094), ('pea', 0.092), ('compressed', 0.091), ('snr', 0.091), ('db', 0.089), ('distortion', 0.086), ('mark', 0.082), ('haykin', 0.081), ('vertices', 0.081), ('transforms', 0.078), ('fitting', 0.073), ('archer', 0.071), ('lloyd', 0.071), ('reproduction', 0.071), ('sjlij', 0.071), ('sla', 0.071), ('image', 0.07), ('design', 0.07), ('traffic', 0.068), ('signal', 0.065), ('kj', 0.064), ('bits', 0.063), ('minimize', 0.063), ('density', 0.062), ('component', 0.061), ('vq', 0.061), ('pj', 0.06), ('global', 0.06), ('designed', 0.06), ('mixtures', 0.06), ('coordinates', 0.057), ('scalar', 0.057), ('partition', 0.056), ('limit', 0.055), ('spherically', 0.055), ('pca', 0.054), ('densities', 0.054), ('orthogonal', 0.051), ('local', 0.049), ('images', 0.048), ('jij', 0.048), ('dimension', 0.048), ('conditioned', 0.047), ('alx', 0.047), ('dct', 0.047), ('envision', 0.047), ('grounding', 0.047), ('quantizers', 0.047), ('replicate', 0.047), ('xer', 0.047), ('xla', 0.047), ('xlo', 0.047), ('lie', 0.044), ('grids', 0.041), ('slo', 0.041), ('todd', 0.041), ('em', 0.04), ('optimized', 0.04), ('mapped', 0.039), ('sj', 0.038), ('space', 0.038), ('wtw', 0.037), ('nonstationary', 0.037), ('qw', 0.037), ('tij', 0.037), ('bit', 0.036), ('tesauro', 0.034), ('hoc', 0.034), ('clustering', 0.033), ('probabilistic', 0.033), ('transactions', 0.033), ('coordinate', 0.032), ('cowan', 0.032), ('simon', 0.032), ('orthogonality', 0.032), ('dj', 0.032), ('tipping', 0.032), ('rectangular', 0.032), ('eight', 0.031), ('optimal', 0.031), ('frames', 0.031), ('variance', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding
Author: Cynthia Archer, Todd K. Leen
Abstract: We establish a principled framework for adaptive transform coding. Transform coders are often constructed by concatenating an ad hoc choice of transform with suboptimal bit allocation and quantizer design. Instead, we start from a probabilistic latent variable model in the form of a mixture of constrained Gaussian mixtures. From this model we derive a transform coding algorithm, which is a constrained version of the generalized Lloyd algorithm for vector quantizer design. A byproduct of our derivation is the introduction of a new transform basis, which unlike other transforms (PCA, DCT, etc.) is explicitly optimized for coding. Image compression experiments show adaptive transform coders designed with our algorithm improve compressed image signal-to-noise ratio up to 3 dB compared to global transform coding and 0.5 to 2 dB compared to other adaptive transform coders. 1
2 0.151943 51 nips-2000-Factored Semi-Tied Covariance Matrices
Author: Mark J. F. Gales
Abstract: A new form of covariance modelling for Gaussian mixture models and hidden Markov models is presented. This is an extension to an efficient form of covariance modelling used in speech recognition, semi-tied covariance matrices. In the standard form of semi-tied covariance matrices the covariance matrix is decomposed into a highly shared decorrelating transform and a component-specific diagonal covariance matrix. The use of a factored decorrelating transform is presented in this paper. This factoring effectively increases the number of possible transforms without increasing the number of free parameters. Maximum likelihood estimation schemes for all the model parameters are presented including the component/transform assignment, transform and component parameters. This new model form is evaluated on a large vocabulary speech recognition task. It is shown that using this factored form of covariance modelling reduces the word error rate.
3 0.11015267 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications
Author: Michael S. Gray, Terrence J. Sejnowski, Javier R. Movellan
Abstract: We examine eight different techniques for developing visual representations in machine vision tasks. In particular we compare different versions of principal component and independent component analysis in combination with stepwise regression methods for variable selection. We found that local methods, based on the statistics of image patches, consistently outperformed global methods based on the statistics of entire images. This result is consistent with previous work on emotion and facial expression recognition. In addition, the use of a stepwise regression technique for selecting variables and regions of interest substantially boosted performance. 1
4 0.10950267 60 nips-2000-Gaussianization
Author: Scott Saobing Chen, Ramesh A. Gopinath
Abstract: High dimensional data modeling is difficult mainly because the so-called
5 0.096145771 36 nips-2000-Constrained Independent Component Analysis
Author: Wei Lu, Jagath C. Rajapakse
Abstract: The paper presents a novel technique of constrained independent component analysis (CICA) to introduce constraints into the classical ICA and solve the constrained optimization problem by using Lagrange multiplier methods. This paper shows that CICA can be used to order the resulted independent components in a specific manner and normalize the demixing matrix in the signal separation procedure. It can systematically eliminate the ICA's indeterminacy on permutation and dilation. The experiments demonstrate the use of CICA in ordering of independent components while providing normalized demixing processes. Keywords: Independent component analysis, constrained independent component analysis, constrained optimization, Lagrange multiplier methods 1
6 0.080699645 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals
7 0.073863149 121 nips-2000-Sparse Kernel Principal Component Analysis
8 0.070425831 27 nips-2000-Automatic Choice of Dimensionality for PCA
9 0.069899179 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation
10 0.068076536 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes
11 0.065008931 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
12 0.057092819 94 nips-2000-On Reversing Jensen's Inequality
14 0.055028379 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images
15 0.050651506 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
16 0.048563216 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition
17 0.048049677 82 nips-2000-Learning and Tracking Cyclic Human Motion
18 0.046916861 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models
19 0.043780766 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation
20 0.042874098 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method
topicId topicWeight
[(0, 0.168), (1, -0.047), (2, 0.101), (3, 0.127), (4, -0.022), (5, 0.032), (6, -0.056), (7, -0.041), (8, -0.013), (9, -0.019), (10, -0.216), (11, -0.057), (12, 0.025), (13, 0.023), (14, -0.15), (15, 0.101), (16, -0.071), (17, -0.015), (18, -0.064), (19, 0.12), (20, 0.061), (21, 0.079), (22, -0.054), (23, 0.089), (24, -0.203), (25, -0.03), (26, 0.015), (27, 0.079), (28, 0.325), (29, -0.001), (30, -0.079), (31, -0.043), (32, 0.025), (33, 0.077), (34, -0.022), (35, 0.117), (36, -0.087), (37, -0.106), (38, 0.026), (39, -0.09), (40, 0.145), (41, -0.047), (42, 0.003), (43, -0.044), (44, -0.032), (45, -0.015), (46, 0.062), (47, 0.033), (48, 0.045), (49, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.97890151 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding
Author: Cynthia Archer, Todd K. Leen
Abstract: We establish a principled framework for adaptive transform coding. Transform coders are often constructed by concatenating an ad hoc choice of transform with suboptimal bit allocation and quantizer design. Instead, we start from a probabilistic latent variable model in the form of a mixture of constrained Gaussian mixtures. From this model we derive a transform coding algorithm, which is a constrained version of the generalized Lloyd algorithm for vector quantizer design. A byproduct of our derivation is the introduction of a new transform basis, which unlike other transforms (PCA, DCT, etc.) is explicitly optimized for coding. Image compression experiments show adaptive transform coders designed with our algorithm improve compressed image signal-to-noise ratio up to 3 dB compared to global transform coding and 0.5 to 2 dB compared to other adaptive transform coders. 1
2 0.74590236 60 nips-2000-Gaussianization
Author: Scott Saobing Chen, Ramesh A. Gopinath
Abstract: High dimensional data modeling is difficult mainly because the so-called
3 0.62972683 51 nips-2000-Factored Semi-Tied Covariance Matrices
Author: Mark J. F. Gales
Abstract: A new form of covariance modelling for Gaussian mixture models and hidden Markov models is presented. This is an extension to an efficient form of covariance modelling used in speech recognition, semi-tied covariance matrices. In the standard form of semi-tied covariance matrices the covariance matrix is decomposed into a highly shared decorrelating transform and a component-specific diagonal covariance matrix. The use of a factored decorrelating transform is presented in this paper. This factoring effectively increases the number of possible transforms without increasing the number of free parameters. Maximum likelihood estimation schemes for all the model parameters are presented including the component/transform assignment, transform and component parameters. This new model form is evaluated on a large vocabulary speech recognition task. It is shown that using this factored form of covariance modelling reduces the word error rate.
4 0.39938983 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals
Author: Lucas C. Parra, Clay Spence, Paul Sajda
Abstract: We present evidence that several higher-order statistical properties of natural images and signals can be explained by a stochastic model which simply varies scale of an otherwise stationary Gaussian process. We discuss two interesting consequences. The first is that a variety of natural signals can be related through a common model of spherically invariant random processes, which have the attractive property that the joint densities can be constructed from the one dimensional marginal. The second is that in some cases the non-stationarity assumption and only second order methods can be explicitly exploited to find a linear basis that is equivalent to independent components obtained with higher-order methods. This is demonstrated on spectro-temporal components of speech. 1
5 0.37759757 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications
Author: Michael S. Gray, Terrence J. Sejnowski, Javier R. Movellan
Abstract: We examine eight different techniques for developing visual representations in machine vision tasks. In particular we compare different versions of principal component and independent component analysis in combination with stepwise regression methods for variable selection. We found that local methods, based on the statistics of image patches, consistently outperformed global methods based on the statistics of entire images. This result is consistent with previous work on emotion and facial expression recognition. In addition, the use of a stepwise regression technique for selecting variables and regions of interest substantially boosted performance. 1
6 0.37259534 36 nips-2000-Constrained Independent Component Analysis
7 0.35487342 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns
9 0.32304123 27 nips-2000-Automatic Choice of Dimensionality for PCA
10 0.31617588 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes
11 0.31389821 85 nips-2000-Mixtures of Gaussian Processes
12 0.26360884 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images
13 0.25583795 121 nips-2000-Sparse Kernel Principal Component Analysis
14 0.245729 136 nips-2000-The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity
15 0.22638816 143 nips-2000-Using the Nyström Method to Speed Up Kernel Machines
16 0.21828474 135 nips-2000-The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference
17 0.21096249 48 nips-2000-Exact Solutions to Time-Dependent MDPs
18 0.20968787 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation
19 0.20033705 30 nips-2000-Bayesian Video Shot Segmentation
20 0.1937283 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets
topicId topicWeight
[(10, 0.036), (17, 0.136), (33, 0.059), (54, 0.017), (55, 0.024), (62, 0.043), (65, 0.017), (67, 0.036), (76, 0.041), (78, 0.375), (79, 0.014), (81, 0.017), (90, 0.031), (91, 0.023), (97, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.82703662 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding
Author: Cynthia Archer, Todd K. Leen
Abstract: We establish a principled framework for adaptive transform coding. Transform coders are often constructed by concatenating an ad hoc choice of transform with suboptimal bit allocation and quantizer design. Instead, we start from a probabilistic latent variable model in the form of a mixture of constrained Gaussian mixtures. From this model we derive a transform coding algorithm, which is a constrained version of the generalized Lloyd algorithm for vector quantizer design. A byproduct of our derivation is the introduction of a new transform basis, which unlike other transforms (PCA, DCT, etc.) is explicitly optimized for coding. Image compression experiments show adaptive transform coders designed with our algorithm improve compressed image signal-to-noise ratio up to 3 dB compared to global transform coding and 0.5 to 2 dB compared to other adaptive transform coders. 1
2 0.39707911 122 nips-2000-Sparse Representation for Gaussian Process Models
Author: Lehel Csatč´¸, Manfred Opper
Abstract: We develop an approach for a sparse representation for Gaussian Process (GP) models in order to overcome the limitations of GPs caused by large data sets. The method is based on a combination of a Bayesian online algorithm together with a sequential construction of a relevant subsample of the data which fully specifies the prediction of the model. Experimental results on toy examples and large real-world data sets indicate the efficiency of the approach.
3 0.39510694 74 nips-2000-Kernel Expansions with Unlabeled Examples
Author: Martin Szummer, Tommi Jaakkola
Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.
4 0.39482003 4 nips-2000-A Linear Programming Approach to Novelty Detection
Author: Colin Campbell, Kristin P. Bennett
Abstract: Novelty detection involves modeling the normal behaviour of a system hence enabling detection of any divergence from normality. It has potential applications in many areas such as detection of machine damage or highlighting abnormal features in medical data. One approach is to build a hypothesis estimating the support of the normal data i. e. constructing a function which is positive in the region where the data is located and negative elsewhere. Recently kernel methods have been proposed for estimating the support of a distribution and they have performed well in practice - training involves solution of a quadratic programming problem. In this paper we propose a simpler kernel method for estimating the support based on linear programming. The method is easy to implement and can learn large datasets rapidly. We demonstrate the method on medical and fault detection datasets. 1 Introduction. An important classification task is the ability to distinguish b etween new instances similar to m embers of the training set and all other instances that can occur. For example, we may want to learn the normal running behaviour of a machine and highlight any significant divergence from normality which may indicate onset of damage or faults. This issue is a generic problem in many fields. For example, an abnormal event or feature in medical diagnostic data typically leads to further investigation. Novel events can be highlighted by constructing a real-valued density estimation function. However, here we will consider the simpler task of modelling the support of a data distribution i.e. creating a binary-valued function which is positive in those regions of input space where the data predominantly lies and negative elsewhere. Recently kernel methods have been applied to this problem [4]. In this approach data is implicitly mapped to a high-dimensional space called feature space [13]. Suppose the data points in input space are X i (with i = 1, . . . , m) and the mapping is Xi --+ ¢;(Xi) then in the span of {¢;(Xi)}, we can expand a vector w = Lj cr.j¢;(Xj). Hence we can define separating hyperplanes in feature space by w . ¢;(x;) + b = O. We will refer to w . ¢;(Xi) + b as the margin which will be positive on one side of the separating hyperplane and negative on the other. Thus we can also define a decision function: (1) where z is a new data point. The data appears in the form of an inner product in feature space so we can implicitly define feature space by our choice of kernel function: (2) A number of choices for the kernel are possible, for example, RBF kernels: (3) With the given kernel the decision function is therefore given by: (4) One approach to novelty detection is to find a hypersphere in feature space with a minimal radius R and centre a which contains most of the data: novel test points lie outside the boundary of this hypersphere [3 , 12] . This approach to novelty detection was proposed by Tax and Duin [10] and successfully used on real life applications [11] . The effect of outliers is reduced by using slack variables to allow for datapoints outside the sphere and the task is to minimise the volume of the sphere and number of datapoints outside i.e. e i mIll s.t. [R2 + oX L i ei 1 (Xi - a) . (Xi - a) S R2 + e ei i, ~ a (5) Since the data appears in the form of inner products kernel substitution can be applied and the learning task can be reduced to a quadratic programming problem. An alternative approach has been developed by Scholkopf et al. [7]. Suppose we restricted our attention to RBF kernels (3) then the data lies on the surface of a hypersphere in feature space since ¢;(x) . ¢;(x) = K(x , x) = l. The objective is therefore to separate off the surface region constaining data from the region containing no data. This is achieved by constructing a hyperplane which is maximally distant from the origin with all datapoints lying on the opposite side from the origin and such that the margin is positive. The learning task in dual form involves minimisation of: mIll s.t. W(cr.) = t L7,'k=l cr.icr.jK(Xi, Xj) a S cr.i S C, L::1 cr.i = l. (6) However, the origin plays a special role in this model. As the authors point out [9] this is a disadvantage since the origin effectively acts as a prior for where the class of abnormal instances is assumed to lie. In this paper we avoid this problem: rather than repelling the hyperplane away from an arbitrary point outside the data distribution we instead try and attract the hyperplane towards the centre of the data distribution. In this paper we will outline a new algorithm for novelty detection which can be easily implemented using linear programming (LP) techniques. As we illustrate in section 3 it performs well in practice on datasets involving the detection of abnormalities in medical data and fault detection in condition monitoring. 2 The Algorithm For the hard margin case (see Figure 1) the objective is to find a surface in input space which wraps around the data clusters: anything outside this surface is viewed as abnormal. This surface is defined as the level set, J(z) = 0, of some nonlinear function. In feature space, J(z) = L; O'.;K(z, x;) + b, this corresponds to a hyperplane which is pulled onto the mapped datapoints with the restriction that the margin always remains positive or zero. We make the fit of this nonlinear function or hyperplane as tight as possible by minimizing the mean value of the output of the function, i.e., Li J(x;). This is achieved by minimising: (7) subject to: m LO'.jK(x;,Xj) + b 2:: 0 (8) j=l m L 0'.; = 1, 0'.; 2:: 0 (9) ;=1 The bias b is just treated as an additional parameter in the minimisation process though unrestricted in sign. The added constraints (9) on 0'. bound the class of models to be considered - we don't want to consider simple linear rescalings of the model. These constraints amount to a choice of scale for the weight vector normal to the hyperplane in feature space and hence do not impose a restriction on the model. Also, these constraints ensure that the problem is well-posed and that an optimal solution with 0'. i- 0 exists. Other constraints on the class of functions are possible, e.g. 110'.111 = 1 with no restriction on the sign of O'.i. Many real-life datasets contain noise and outliers. To handle these we can introduce a soft margin in analogy to the usual approach used with support vector machines. In this case we minimise: (10) subject to: m LO:jJ{(Xi , Xj)+b~-ei' ei~O (11) j=l and constraints (9). The parameter). controls the extent of margin errors (larger ). means fewer outliers are ignored: ). -+ 00 corresponds to the hard margin limit). The above problem can be easily solved for problems with thousands of points using standard simplex or interior point algorithms for linear programming. With the addition of column generation techniques, these same approaches can be adopted for very large problems in which the kernel matrix exceeds the capacity of main memory. Column generation algorithms incrementally add and drop columns each corresponding to a single kernel function until optimality is reached. Such approaches have been successfully applied to other support vector problems [6 , 2]. Basic simplex algorithms were sufficient for the problems considered in this paper, so we defer a listing of the code for column generation to a later paper together with experiments on large datasets [1]. 3 Experiments Artificial datasets. Before considering experiments on real-life data we will first illustrate the performance of the algorithm on some artificial datasets. In Figure 1 the algorithm places a boundary around two data clusters in input space: a hard margin was used with RBF kernels and (J
5 0.39250243 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications
Author: Michael S. Gray, Terrence J. Sejnowski, Javier R. Movellan
Abstract: We examine eight different techniques for developing visual representations in machine vision tasks. In particular we compare different versions of principal component and independent component analysis in combination with stepwise regression methods for variable selection. We found that local methods, based on the statistics of image patches, consistently outperformed global methods based on the statistics of entire images. This result is consistent with previous work on emotion and facial expression recognition. In addition, the use of a stepwise regression technique for selecting variables and regions of interest substantially boosted performance. 1
6 0.39246792 37 nips-2000-Convergence of Large Margin Separable Linear Classification
7 0.39124563 130 nips-2000-Text Classification using String Kernels
8 0.39073226 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
9 0.3906351 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm
10 0.38797247 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
11 0.38783735 133 nips-2000-The Kernel Gibbs Sampler
12 0.38666016 51 nips-2000-Factored Semi-Tied Covariance Matrices
13 0.38594458 94 nips-2000-On Reversing Jensen's Inequality
14 0.38573578 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work
15 0.38510862 60 nips-2000-Gaussianization
16 0.38486138 36 nips-2000-Constrained Independent Component Analysis
17 0.38374093 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling
18 0.3822566 111 nips-2000-Regularized Winnow Methods
19 0.38179606 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks
20 0.38000399 119 nips-2000-Some New Bounds on the Generalization Error of Combined Classifiers