nips nips2002 nips2002-67 knowledge-graph by maker-knowledge-mining

67 nips-2002-Discriminative Binaural Sound Localization


Source: pdf

Author: Ehud Ben-reuven, Yoram Singer

Abstract: Time difference of arrival (TDOA) is commonly used to estimate the azimuth of a source in a microphone array. The most common methods to estimate TDOA are based on finding extrema in generalized crosscorrelation waveforms. In this paper we apply microphone array techniques to a manikin head. By considering the entire cross-correlation waveform we achieve azimuth prediction accuracy that exceeds extrema locating methods. We do so by quantizing the azimuthal angle and treating the prediction problem as a multiclass categorization task. We demonstrate the merits of our approach by evaluating the various approaches on Sony’s AIBO robot.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract Time difference of arrival (TDOA) is commonly used to estimate the azimuth of a source in a microphone array. [sent-5, score-0.428]

2 The most common methods to estimate TDOA are based on finding extrema in generalized crosscorrelation waveforms. [sent-6, score-0.141]

3 In this paper we apply microphone array techniques to a manikin head. [sent-7, score-0.329]

4 By considering the entire cross-correlation waveform we achieve azimuth prediction accuracy that exceeds extrema locating methods. [sent-8, score-0.439]

5 We do so by quantizing the azimuthal angle and treating the prediction problem as a multiclass categorization task. [sent-9, score-0.385]

6 1 Introduction In this paper we describe and evaluate several algorithms to perform sound localization in a commercial entertainment robot. [sent-11, score-0.507]

7 The physical system being investigated is composed of a manikin head equipped with a two microphones and placed on a manikin body. [sent-12, score-0.606]

8 This type of systems is commonly used to model sound localization in biological systems and the algorithms used to analyze the signal are usually inspired from neurology. [sent-13, score-0.522]

9 In the case of an entertainment robot there is no need to be limited to a neurologically inspired model and we will use combination of techniques that are commonly used in microphone arrays and statistical learning. [sent-14, score-0.289]

10 The focus of the work is the task of localizing an unknown stationary source (compact in location and broad in spectrum). [sent-15, score-0.19]

11 The goal is to find the azimuth angle of the source relative to the head. [sent-16, score-0.405]

12 A common paradigm to approximately find the location of a sound source employs a microphone array and estimates time differences of arrival (TDOA) between microphones in the array (see for instance [1]). [sent-17, score-0.743]

13 In a dual-microphone array it is usually assumed that the difference in the two channels is limited to a small time delay (or linear phase in frequency domain) and therefore the cross-correlation is peaked at the the time corresponding to the delay. [sent-18, score-0.206]

14 Thus, methods that search for extrema in cross-correlation waveforms are commonly used [2]. [sent-19, score-0.231]

15 The time delay approach is based on the assumption that the sound waves propagate along a single path from the source to the microphone and that the microphone response of the two channels for the given source location is approximately the same. [sent-20, score-0.861]

16 The time delay assumption fails in the case of a manikin head: the microphone are antipodal and in addition the manikin head and body affect the response in a complex way. [sent-23, score-0.583]

17 First, we perform signal processing based on the generalized cross correlation transform called Phase Transform (PHAT) also called Cross Power Spectrum Phase (CPSP). [sent-26, score-0.272]

18 This signal processing removes to a large extent variations due the sound source. [sent-27, score-0.279]

19 Then, rather than proceeding with peak-finding we employ discriminative learning methods by casting the azimuth estimation as a multiclass prediction problem. [sent-28, score-0.461]

20 2 we describe how the signal received in the two microphones was processed to generate accurate features. [sent-32, score-0.249]

21 2 Signal Processing Throughout the paper we denote signals in the time domain by lower case letters and in the frequency domain by upper case letters. [sent-40, score-0.145]

22 We denote the convolution operator between two signals by and the correlation operator by . [sent-41, score-0.132]

23 The unknown source signal is denoted by and thus its spectrum is . [sent-42, score-0.328]

24 The source signal passes through different physical setup and is received at the right and left microphones. [sent-43, score-0.346]

25 We model the different physical media, the signal passes through, as two linear systems whose frequency response is denoted by and . [sent-45, score-0.255]

26 In addition the signals are contaminated with noise that may account for non-linear effects such as room reverberations (see for instance [3] for more detailed noise models). [sent-46, score-0.224]

27 Let be the number of segments a signal is divided into and the number of samples in a single segment. [sent-48, score-0.174]

28 Each is multiplied by a Hanning window and padded with zeros to smooth the end-of-segment effects and increase the resolution of the short-time Fourier transform (see for instance [8]). [sent-49, score-0.183]

29 Based on the properties of the Fourier transform, the local cross-correlation between the two signals can be computed efficiently by , of the product of the spectrum of and the the inverse Fourier transform, denoted complex conjugate of the spectrum of , ¤ # $¦ ¤ ¦  ! [sent-51, score-0.264]

30 However, since the source signal passes through different physical media the short-time cross-correlation does not necessarily obtain a large value at the time-lag index. [sent-53, score-0.326]

31 It is therefore common (see for instance [1]) to multiply the spectrum of the cross-correlation by a weighting function in order to compensate for the differences in the frequency responses obtained at the two microphones. [sent-54, score-0.222]

32 Denoting the spectral shaping function for the th segment by , the generalization cross-correlation from Eq. [sent-55, score-0.146]

33 The transform is obtained by setting, where is the average over all measurements and both channels of . [sent-59, score-0.196]

34 03 −5 −4 −3 −2 −1 Figure 1: Average waveform with standard deviation for after performing PHAT (top) and the equalized crosscorrelation (bottom). [sent-79, score-0.307]

35 Formally, the feature vector of the th segment is defined as, W # W ¦ ¨ XW ¤ ¨ W ¦ ¤ ¨ 6¨ Therefore, assuming the noise is zero, PHAT eliminates the contribution of the unknown source and the entire waveform of PHAT is only a function of the physical setup. [sent-82, score-0.515]

36 If all other physical parameters are constant, the PHAT waveform (as well as its peak location) is a function of the azimuth angle of the sound source relative to the manikin head. [sent-83, score-1.019]

37 This is of course an approximation and the presence of noise and changes in the environment result in a waveform that deviates from the closed-form given in Eq. [sent-84, score-0.15]

38 1 we show the empirical average of the waveform for PHAT and for the equalized cross-correlation, the vertical bars represent an error of . [sent-87, score-0.272]

39 In both cases, the location of the maximal correlation is clearly at as expected. [sent-88, score-0.128]

40 Nonetheless, the high variance, especially in the case of the equalized cross-correlation imply that classification of individual segments may often be rather difficult. [sent-89, score-0.217]

41    0 P    Summarising, the signal processing we perform is based on short time Fourier transform of the signals received at the two microphones. [sent-91, score-0.259]

42 From the two spectrums we then compute the generalized cross-correlation using one of the three weighting schemes described above samples of the resulting waveforms as the feature vectors. [sent-92, score-0.185]

43  3 Single Segment Classification Traditional approaches to sound localization search for the the position of the extreme value in the generalized cross-correlation waveform that were derived in Sec. [sent-95, score-0.592]

44 5 that the entire waveform of PHAT can be used as a feature vector to localise the source. [sent-99, score-0.186]

45  In a supervised learning setting, we have access to labelled examples and the goal is to find a mapping from the instance domain (the peak-location or waveforms in our setting) to a response variable (the azimuth angle). [sent-104, score-0.353]

46 Since the angle is a continuous variable the first approach that comes to mind is using a linear or non-linear regressors. [sent-105, score-0.141]

47 Instead of treating the learning problem as a regression problem, we quantized the angle and converted the sound localization problem into a multiclass decision problem. [sent-107, score-0.782]

48 We now can transform the real-valued angle of the th segment, , into a discrete variable where iff . [sent-109, score-0.296]

49 After this quantization, the training set is composed of instance-label pairs and the first task is to find a classification rule from the peak-location or waveforms space into . [sent-110, score-0.18]

50 We will first describe the method used for peak-location and then we will describe two discriminative methods to classify the waveform. [sent-111, score-0.137]

51 The first is based on a multiclass version of the Fisher linear discriminant [7] and is very simple to implement. [sent-112, score-0.296]

52 ¥   ©  ¥  1 B@ C'A2 ' '  ' &0 9   ( 87 ¥ 6' ¥ 5© ' ' 4 ) Peak location classification: Due to the relative low sampling frequency ( spline interpolation was used to improve the peak location. [sent-119, score-0.153]

53 In microphone arrays it is common to translate the peak-location to an estimate of the source azimuth using a geometric formula. [sent-120, score-0.419]

54 However, this was found to be inappropriate due to the internal reverberations generated by the manikin head. [sent-121, score-0.244]

55 I H P¨ 4 &I; F GE © The peak location data was modelled using a separate histogram for each direction . [sent-123, score-0.186]

56 For a given direction , all the training measurements for which are used to build a single histogram: where is if is true and otherwise, is the size of the bin in the histogram, , and is the number of bins. [sent-124, score-0.131]

57 An estimate of the probability density function was taken to be the normalized histogram step function: where is the number of training measurements for which . [sent-125, score-0.149]

58 To do so we divide the training set into subsets where the th subset corresponds to measurements from azimuth in . [sent-127, score-0.314]

59  © ' D I 7 BV U † ST' I © 7 % © © 9 ƒ 7 v 7 d v ‰' ˆ I  7 B V U ‡7 † ¦' S % © 7 d New test waveforms were then classified using the ML formula, Eq. [sent-132, score-0.124]

60 However, it degenerates if the training data is non-stationary, as often is the case in sound localization problems due to effects such as moving objects. [sent-135, score-0.438]

61 The Perceptron algorithm is a conservative online algorithm: it receives an instance, outputs a prediction for the instance, and only in case it made a prediction mistake the Perceptron update its classification rule which is a hyperplane. [sent-139, score-0.154]

62 Since our setting requires building a multiclass rule, we use the version described in [6] which generalises the Perceptron to multiclass settings. [sent-140, score-0.408]

63 We first describe the general form of the algorithm and then discuss the modifications we performed in order to adapt it to the sound localization problem. [sent-141, score-0.438]

64 To extend the Perceptron algorithm to multiclass problem we maintain hyperplanes (one per class) denoted . [sent-142, score-0.3]

65 On the th round, the algorithm gets a new instance and set the predicted class to be the index of the hyperplane attaining the largest inner-product with the input instance, If the algorithm made a prediction error, , it updates the set of hyperplanes. [sent-144, score-0.239]

66 The uniform update moves the hyperplane corresponding to the correct label in the direction of and all the hyperplanes whose inner-products were larger than away from . [sent-147, score-0.158]

67 This update of the hyperplanes is performed and if only on rounds on which there was a prediction error. [sent-149, score-0.158]

68 The multiclass Perceptron algorithm is guaranteed to converge to a perfect classification rule if the data can be classified perfectly by an unknown set of hyperplanes. [sent-151, score-0.204]

69 However, linear classifiers may not suffice to obtain in many applications, including the sound localization application. [sent-154, score-0.407]

70 Common kernels are RBF kernels and polynomial kernels which take the form . [sent-158, score-0.138]

71 In the case of the multiclass Perceptron we replace the update from Eq. [sent-160, score-0.236]

72 ¢  ¥  ¦ © ¦  © ¦  ¢  §¥£  © ¦ ¤ © ¦  ¢ ¥¨ §¥£  © ¦ ¤ Table 1: Summary of results of sound localization methods for a single segment. [sent-173, score-0.407]

73 However, if the source of sound does not move for a period of time, we can accumulate evidence from multiple segments in order to increase the accuracy. [sent-179, score-0.417]

74 Due to the lack of space we only outline the multi-segment classification procedure for the Fisher discriminant and compare it to smoothing and averaging techniques used in the signal processing community. [sent-180, score-0.171]

75 In multi-segment classification we are given waveforms for which we assume that , . [sent-181, score-0.124]

76 Each the source angle did not change in this period, i. [sent-182, score-0.263]

77 We then converted the waveform feature vector into a probability estimate for each discrete angle direction, using the Fisher discriminant. [sent-185, score-0.321]

78 The probability density function of the entire window is therefore and the ML estimation for is We compared the Maximum Likelihood decision under the independence assumption with the following commonly used signal processing technique. [sent-189, score-0.19]

79 The sampling frequency was fixed to and the robot’s uni-directional microphone without automatic level control was used. [sent-196, score-0.173]

80 The robot was laid on a concrete floor in a regular office room, the room reverberations was . [sent-197, score-0.175]

81 A PC connected through a wireless link to the robot directed its head relative to the speaker. [sent-199, score-0.135]

82 The location of the sound source was limited to be in front of the head ( ) at a fixed constant elevation and in jumps of . [sent-200, score-0.468]

83 £ © ¦ Table 2: Summary of results of sound localization methods for multiple segments. [sent-213, score-0.407]

84 ) For each head direction segments of data were collected. [sent-219, score-0.203]

85 Therefore, altogether there were segments for training and the same amount for evaluation. [sent-224, score-0.126]

86 From the transformed waveforms samples where taken ( locations in histograms were found using bins. [sent-228, score-0.124]

87 The first, denoted , is the empirical classification error that counts the number of times the predicted (discretized) angle was different than the true angle, that is, . [sent-233, score-0.177]

88 The second evaluation measure, denoted , is the average absolute difference between the predicted angle and the true angle, . [sent-234, score-0.177]

89 It is clear from the results that traditional methods which search for extrema in the waveforms are inferior to the discriminative methods. [sent-239, score-0.341]

90 As a by-product we confirmed that equalized crosscorrelations is inferior to PHAT modelling for high SNR with strong reverberations, similar results were reported in [11]. [sent-240, score-0.193]

91 Using the Perceptron algorithm with degree achieves the best results but the difference between the Perceptron and the multiclass Fisher discriminant is not statistically significant. [sent-243, score-0.296]

92 Their performance turns to be inferior to the discriminative multiclass approaches. [sent-245, score-0.35]

93 A possible explanation is that the multiclass methods employ multiple hyperplanes and project each class onto a different hyperplane while linear regression methods seek a single hyperplane onto which example are projected. [sent-246, score-0.336]

94 This suggests that online algorithms may be more suitable when the sound source is stationary only for short periods. [sent-252, score-0.364]

95 Multisegment classification was performed by taking consecutive measurements over a window of during which the source location remained fix. [sent-254, score-0.299]

96 sulting prediction accuracy of Fisher’s discriminant is good enough to make the solution practical so long as the sound source is fixed and the recording conditions do not change. [sent-260, score-0.454]

97 For classification using multiple segments classifying the entire PHAT waveform gave better results than various techniques that smooth the power spectrum over the segments. [sent-266, score-0.397]

98 Our current research is focused on efficient discriminative methods for sound localization in changing environments. [sent-267, score-0.482]

99 Acoustic event localization using a crosspowerspectrum phase based technique. [sent-278, score-0.237]

100 Benesty Adaptive eigenvalue decomposition algorithm for passive acoustic source localization J. [sent-292, score-0.354]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('phat', 0.489), ('hfd', 0.306), ('perceptron', 0.236), ('localization', 0.207), ('multiclass', 0.204), ('sound', 0.2), ('fisher', 0.173), ('manikin', 0.163), ('waveform', 0.15), ('azimuth', 0.142), ('angle', 0.141), ('microphone', 0.128), ('waveforms', 0.124), ('equalized', 0.122), ('source', 0.122), ('classi', 0.108), ('hxd', 0.102), ('segments', 0.095), ('discriminant', 0.092), ('spectrum', 0.091), ('microphones', 0.089), ('transform', 0.084), ('reverberations', 0.081), ('signal', 0.079), ('xd', 0.078), ('head', 0.078), ('segment', 0.075), ('discriminative', 0.075), ('th', 0.071), ('extrema', 0.071), ('inferior', 0.071), ('measurements', 0.07), ('location', 0.068), ('fourier', 0.065), ('aibo', 0.061), ('scot', 0.061), ('sony', 0.061), ('tdoa', 0.061), ('physical', 0.061), ('hyperplanes', 0.06), ('instance', 0.06), ('robot', 0.057), ('vu', 0.056), ('cation', 0.055), ('smoothed', 0.052), ('delay', 0.051), ('received', 0.05), ('histogram', 0.048), ('kernels', 0.046), ('signals', 0.046), ('frequency', 0.045), ('channels', 0.042), ('online', 0.042), ('binaural', 0.041), ('entertainment', 0.041), ('trademark', 0.041), ('xsq', 0.041), ('peak', 0.04), ('cross', 0.04), ('prediction', 0.04), ('window', 0.039), ('array', 0.038), ('room', 0.037), ('entire', 0.036), ('nding', 0.036), ('denoted', 0.036), ('hyperplane', 0.036), ('commonly', 0.036), ('crosscorrelation', 0.035), ('generalized', 0.035), ('correlation', 0.034), ('passes', 0.034), ('bv', 0.032), ('attaining', 0.032), ('err', 0.032), ('update', 0.032), ('describe', 0.031), ('training', 0.031), ('oor', 0.03), ('media', 0.03), ('converted', 0.03), ('phase', 0.03), ('direction', 0.03), ('ml', 0.029), ('uv', 0.028), ('commercial', 0.028), ('speech', 0.028), ('arrays', 0.027), ('placed', 0.027), ('domain', 0.027), ('maximal', 0.026), ('weighting', 0.026), ('rmed', 0.026), ('rounds', 0.026), ('operator', 0.026), ('composed', 0.025), ('power', 0.025), ('plain', 0.025), ('acoustic', 0.025), ('singer', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 67 nips-2002-Discriminative Binaural Sound Localization

Author: Ehud Ben-reuven, Yoram Singer

Abstract: Time difference of arrival (TDOA) is commonly used to estimate the azimuth of a source in a microphone array. The most common methods to estimate TDOA are based on finding extrema in generalized crosscorrelation waveforms. In this paper we apply microphone array techniques to a manikin head. By considering the entire cross-correlation waveform we achieve azimuth prediction accuracy that exceeds extrema locating methods. We do so by quantizing the azimuthal angle and treating the prediction problem as a multiclass categorization task. We demonstrate the merits of our approach by evaluating the various approaches on Sony’s AIBO robot.

2 0.20253606 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

Author: Sariel Har-Peled, Dan Roth, Dav Zimak

Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.

3 0.13873808 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

Author: Gil-jin Jang, Te-Won Lee

Abstract: We present a new technique for achieving source separation when given only a single channel recording. The main idea is based on exploiting the inherent time structure of sound sources by learning a priori sets of basis filters in time domain that encode the sources in a statistically efficient manner. We derive a learning algorithm using a maximum likelihood approach given the observed single channel data and sets of basis filters. For each time point we infer the source signals and their contribution factors. This inference is possible due to the prior knowledge of the basis filters and the associated coefficient densities. A flexible model for density estimation allows accurate modeling of the observation and our experimental results exhibit a high level of separation performance for mixtures of two music signals as well as the separation of two voice signals.

4 0.11035647 53 nips-2002-Clustering with the Fisher Score

Author: Koji Tsuda, Motoaki Kawanabe, Klaus-Robert Müller

Abstract: Recently the Fisher score (or the Fisher kernel) is increasingly used as a feature extractor for classification problems. The Fisher score is a vector of parameter derivatives of loglikelihood of a probabilistic model. This paper gives a theoretical analysis about how class information is preserved in the space of the Fisher score, which turns out that the Fisher score consists of a few important dimensions with class information and many nuisance dimensions. When we perform clustering with the Fisher score, K-Means type methods are obviously inappropriate because they make use of all dimensions. So we will develop a novel but simple clustering algorithm specialized for the Fisher score, which can exploit important dimensions. This algorithm is successfully tested in experiments with artificial data and real data (amino acid sequences).

5 0.10590336 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor

Abstract: In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. By adjusting the parametrisation we can also influence the weighting received by the features . In this way we are able to obtain a logarithmic weighting in a Fisher kernel. Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline. 1

6 0.10022896 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

7 0.097387925 120 nips-2002-Kernel Design Using Boosting

8 0.092404656 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

9 0.091604672 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

10 0.08725889 169 nips-2002-Real-Time Particle Filters

11 0.085873857 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

12 0.077277094 147 nips-2002-Monaural Speech Separation

13 0.076104388 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

14 0.073009603 45 nips-2002-Boosted Dyadic Kernel Discriminants

15 0.071621336 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering

16 0.070983268 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

17 0.070865184 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

18 0.070645757 136 nips-2002-Linear Combinations of Optic Flow Vectors for Estimating Self-Motion - a Real-World Test of a Neural Model

19 0.069012582 79 nips-2002-Evidence Optimization Techniques for Estimating Stimulus-Response Functions

20 0.067011714 103 nips-2002-How Linear are Auditory Cortical Responses?


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.209), (1, -0.037), (2, 0.062), (3, -0.019), (4, 0.084), (5, -0.039), (6, -0.063), (7, -0.024), (8, 0.145), (9, -0.016), (10, -0.017), (11, 0.138), (12, -0.122), (13, -0.02), (14, 0.089), (15, 0.023), (16, 0.011), (17, 0.033), (18, -0.068), (19, -0.065), (20, 0.034), (21, -0.064), (22, -0.136), (23, 0.171), (24, 0.125), (25, 0.009), (26, -0.121), (27, 0.064), (28, -0.025), (29, -0.11), (30, 0.03), (31, -0.065), (32, -0.083), (33, -0.15), (34, -0.049), (35, -0.001), (36, -0.036), (37, -0.063), (38, 0.052), (39, -0.053), (40, 0.01), (41, 0.107), (42, -0.044), (43, 0.002), (44, 0.046), (45, 0.027), (46, -0.081), (47, 0.055), (48, 0.114), (49, 0.078)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92966986 67 nips-2002-Discriminative Binaural Sound Localization

Author: Ehud Ben-reuven, Yoram Singer

Abstract: Time difference of arrival (TDOA) is commonly used to estimate the azimuth of a source in a microphone array. The most common methods to estimate TDOA are based on finding extrema in generalized crosscorrelation waveforms. In this paper we apply microphone array techniques to a manikin head. By considering the entire cross-correlation waveform we achieve azimuth prediction accuracy that exceeds extrema locating methods. We do so by quantizing the azimuthal angle and treating the prediction problem as a multiclass categorization task. We demonstrate the merits of our approach by evaluating the various approaches on Sony’s AIBO robot.

2 0.54088151 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering

Author: Hagai Attias

Abstract: Source separation is an important problem at the intersection of several fields, including machine learning, signal processing, and speech technology. Here we describe new separation algorithms which are based on probabilistic graphical models with latent variables. In contrast with existing methods, these algorithms exploit detailed models to describe source properties. They also use subband filtering ideas to model the reverberant environment, and employ an explicit model for background and sensor noise. We leverage variational techniques to keep the computational complexity per EM iteration linear in the number of frames. 1 The Source Separation Problem Fig. 1 illustrates the problem of source separation with a sensor array. In this problem, signals from K independent sources are received by each of L ≥ K sensors. The task is to extract the sources from the sensor signals. It is a difficult task, partly because the received signals are distorted versions of the originals. There are two types of distortions. The first type arises from propagation through a medium, and is approximately linear but also history dependent. This type is usually termed reverberations. The second type arises from background noise and sensor noise, which are assumed additive. Hence, the actual task is to obtain an optimal estimate of the sources from data. The task is difficult for another reason, which is lack of advance knowledge of the properties of the sources, the propagation medium, and the noises. This difficulty gave rise to adaptive source separation algorithms, where parameters that are related to those properties are adjusted to optimized a chosen cost function. Unfortunately, the intense activity this problem has attracted over the last several years [1–9] has not yet produced a satisfactory solution. In our opinion, the reason is that existing techniques fail to address three major factors. The first is noise robustness: algorithms typically ignore background and sensor noise, sometime assuming they may be treated as additional sources. It seems plausible that to produce a noise robust algorithm, noise signals and their properties must be modeled explicitly, and these models should be exploited to compute optimal source estimators. The second factor is mixing filters: algorithms typically seek, and directly optimize, a transformation that would unmix the sources. However, in many situations, the filters describing medium propagation are non-invertible, or have an unstable inverse, or have a stable inverse that is extremely long. It may hence be advantageous to Figure 1: The source separation problem. Signals from K = 2 speakers propagate toward L = 2 sensors. Each sensor receives a linear mixture of the speaker signals, distorted by multipath propagation, medium response, and background and sensor noise. The task is to infer the original signals from sensor data. estimate the mixing filters themselves, then use them to estimate the sources. The third factor is source properties: algorithms typically use a very simple source model (e.g., a one time point histogram). But in many cases one may easily obtain detailed models of the source signals. This is particularly true for speech sources, where large datasets exist and much modeling expertise has developed over decades of research. Separation of speakers is also one of the major potential commercial applications of source separation algorithms. It seems plausible that incorporating strong source models could improve performance. Such models may potentially have two more advantages: first, they could help limit the range of possible mixing filters by constraining the optimization problem. Second, they could help avoid whitening the extracted signals by effectively limiting their spectral range to the range characteristic of the source model. This paper makes several contributions to the problem of real world source separation. In the following, we present new separation algorithms that are the first to address all three factors. We work in the framework of probabilistic graphical models. This framework allows us to construct models for sources and for noise, combine them with the reverberant mixing transformation in a principled manner, and compute parameter and source estimates from data which are Bayes optimal. We identify three technical ideas that are key to our approach: (1) a strong speech model, (2) subband filtering, and (3) variational EM. 2 Frames, Subband Signals, and Subband Filtering We start with the concept of subband filtering. This is also a good point to define our notation. Let xm denote a time domain signal, e.g., the value of a sound pressure waveform at time point m = 0, 1, 2, .... Let Xn [k] denote the corresponding subband signal at time frame n and subband frequency k. The subband signals are obtained from the time domain signal by imposing an N -point window wm , m = 0 : N − 1 on that signal at equally spaced points nJ, n = 0, 1, 2, ..., and FFT-ing the windowed signal, N −1 e−iωk m wm xnJ+m , Xn [k] = (1) m=0 where ωk = 2πk/N and k = 0 : N − 1. The subband signals are also termed frames. Notice the difference in time scale between the time frame index n in Xn [k] and the time point index n in xn . The chosen value of the spacing J depends on the window length N . For J ≤ N the original signal xm can be synthesized exactly from the subband signals (synthesis formula omitted). An important consideration for selecting J, as well as the window shape, is behavior under filtering. Consider a filter hm applied to xm , and denote by ym the filtered signal. In the simple case hm = hδm,0 (no filtering), the subband signals keep the same dependence as the time domain ones, yn = hxn −→ Yn [k] = hXn [k] . For an arbitrary filter hm , we use the relation yn = hm xn−m −→ Yn [k] = Hm [k]Xn−m [k] , (2) m m with complex coefficients Hm [k] for each k. This relation between the subband signals is termed subband filtering, and the Hm [k] are termed subband filters. Unlike the simple case of non-filtering, the relation (2) holds approximately, but quite accurately using an appropriate choice of J and wm ; see [13] for details on accuracy. Throughout this paper, we will assume that an arbitrary filter hm can be modeled by the subband filters Hm [k] to a sufficient accuracy for our purposes. One advantage of subband filtering is that it replaces a long filter hm by a set of short independent filters Hm [k], one per frequency. This will turn out to decompose the source separation problem into a set of small (albeit coupled) problems, one per frequency. Another advantage is that this representation allows using a detailed speech model on the same footing with the filter model. This is because a speech model is defined on the time scale of a single frame, whereas the original filter hm , in contrast with Hm [k], is typically as long as 10 or more frames. As a final point on notation, we define a Gaussian distribution over a complex number Z ν by p(Z) = N (Z | µ, ν) = π exp(−ν | Z − µ |2 ) . Notice that this is a joint distribution over the real and imaginary parts of Z. The mean is µ = X and the precision (inverse variance) ν satisfies ν −1 = | X |2 − | µ |2 . 3 A Model for Speech Signals We assume independent sources, and model the distribution of source j by a mixture model over its subband signals Xjn , N/2−1 p(Xjn | Sjn = s) N (Xjn [k] | 0, Ajs [k]) = p(Sjn = s) = πjs k=1 p(X, S) p(Xjn | Sjn )p(Sjn ) , = (3) jn where the components are labeled by Sjn . Component s of source j is a zero mean Gaussian with precision Ajs . The mixing proportions of source j are πjs . The DAG representing this model is shown in Fig. 2. A similar model was used in [10] for one microphone speech enhancement for recognition (see also [11]). Here are several things to note about this model. (1) Each component has a characteristic spectrum, which may describe a particular part of a speech phoneme. This is because the precision corresponds to the inverse spectrum: the mean energy (w.r.t. the above distribution) of source j at frequency k, conditioned on label s, is | Xjn |2 = A−1 . (2) js A zero mean model is appropriate given the physics of the problem, since the mean of a sound pressure waveform is zero. (3) k runs from 1 to N/2 − 1, since for k > N/2, Xjn [k] = Xjn [N − k] ; the subbands k = 0, N/2 are real and are omitted from the model, a common practice in speech recognition engines. (4) Perhaps most importantly, for each source the subband signals are correlated via the component label s, as p(Xjn ) = s p(Xjn , Sjn = s) = k p(Xjn [k]) . Hence, when the source separation problem decomposes into one problem per frequency, these problems turn out to be coupled (see below), and independent frequency permutations are avoided. (5) To increase sn xn Figure 2: Graphical model describing speech signals in the subband domain. The model assumes i.i.d. frames; only the frame at time n is shown. The node Xn represents a complex N/2 − 1-dimensional vector Xn [k], k = 1 : N/2 − 1. model accuracy, a state transition matrix p(Sjn = s | Sj,n−1 = s ) may be added for each source. The resulting HMM models are straightforward to incorporate without increasing the algorithm complexity. There are several modes of using the speech model in the algorithms below. In one mode, the sources are trained online using the sensor data. In a second mode, source models are trained offline using available data on each source in the problem. A third mode correspond to separation of sources known to be speech but whose speakers are unknown. In this case, all sources have the same model, which is trained offline on a large dataset of speech signals, including 150 male and female speakers reading sentences from the Wall Street Journal (see [10] for details). This is the case presented in this paper. The training algorithm used was standard EM (omitted) using 256 clusters, initialized by vector quantization. 4 Separation of Non-Reverberant Mixtures We now present a source separation algorithm for the case of non-reverberant (or instantaneous) mixing. Whereas many algorithms exist for this case, our contribution here is an algorithm that is significantly more robust to noise. Its robustness results, as indicated in the introduction, from three factors: (1) explicitly modeling the noise in the problem, (2) using a strong source model, in particular modeling the temporal statistics (over N time points) of the sources, rather than one time point statistics, and (3) extracting each source signal from data by a Bayes optimal estimator obtained from p(X | Y ). A more minor point is handling the case of less sources than sensors in a principled way. The mixing situation is described by yin = j hij xjn + uin , where xjn is source signal j at time point n, yin is sensor signal i, hij is the instantaneous mixing matrix, and uin is the noise corrupting sensor i’s signal. The corresponding subband signals satisfy Yin [k] = j hij Xjn [k] + Uin [k] . To turn the last equation into a probabilistic graphical model, we assume that noise i has precision (inverse spectrum) Bi [k], and that noises at different sensors are independent (the latter assumption is often inaccurate but can be easily relaxed). This yields p(Yin | X) N (Yin [k] | = p(Y | X) p(Yin | X) , = hij Xjn [k], Bi [k]) j k (4) in which together with the speech model (3) forms a complete model p(Y, X, S) for this problem. The DAG representing this model for the case K = L = 2 is shown in Fig. 3. Notice that this model generalizes [4] to the subband domain. s1n−2 s1n−1 s1 n s2n−2 s2n−1 s2 n x1n−2 x1n−1 x1 n x2n−2 x2n−1 x2 n y1n−2 y1n−1 y1n y2n−2 y2n−1 y2 n Figure 3: Graphical model for noisy, non-reverberant 2 × 2 mixing, showing a 3 frame-long sequence. All nodes Yin and Xjn represent complex N/2 − 1-dimensional vectors (see Fig. 2). While Y1n and Y2n have the same parents, X1n and X2n , the arcs from the parents to Y2n are omitted for clarity. The model parameters θ = {hij , Bi [k], Ajs [k], πjs } are estimated from data by an EM algorithm. However, as the number of speech components M or the number of sources K increases, the E-step becomes computationally intractable, as it requires summing over all O(M K ) configurations of (S1n , ..., SKn ) at each frame. We approximate the E-step using a variational technique: focusing on the posterior distribution p(X, S | Y ), we compute an optimal tractable approximation q(X, S | Y ) ≈ p(X, S | Y ), which we use to compute the sufficient statistics (SS). We choose q(Xjn | Sjn , Y )q(Sjn | Y ) , q(X, S | Y ) = (5) jn where the hidden variables are factorized over the sources, and also over the frames (the latter factorization is exact in this model, but is an approximation for reverberant mixing). This posterior maintains the dependence of X on S, and thus the correlations between different subbands Xjn [k]. Notice also that this posterior implies a multimodal q(Xjn ) (i.e., a mixture distribution), which is more accurate than unimodal posteriors often employed in variational approximations (e.g., [12]), but is also harder to compute. A slightly more general form which allows inter-frame correlations by employing q(S | Y ) = jn q(Sjn | Sj,n−1 , Y ) may also be used, without increasing complexity. By optimizing in the usual way (see [12,13]) a lower bound on the likelihood w.r.t. q, we obtain q(Xjn [k] | Sjn = s, Y )q(Sjn = s | Y ) , q(Xjn , Sjn = s | Y ) = (6) k where q(Xjn [k] | Sjn = s, Y ) = N (Xjn [k] | ρjns [k], νjs [k]) and q(Sjn = s | Y ) = γjns . Both the factorization over k of q(Xjn | Sjn ) and its Gaussian functional form fall out from the optimization under the structural restriction (5) and need not be specified in advance. The variational parameters {ρjns [k], νjs [k], γjns }, which depend on the data Y , constitute the SS and are computed in the E-step. The DAG representing this posterior is shown in Fig. 4. s1n−2 s1n−1 s1 n s2n−2 s2n−1 s2 n x1n−2 x1n−1 x1 n x2n−2 x2n−1 x2 n {y im } Figure 4: Graphical model describing the variational posterior distribution applied to the model of Fig. 3. In the non-reverberant case, the components of this posterior at time frame n are conditioned only on the data Yin at that frame; in the reverberant case, the components at frame n are conditioned on the data Yim at all frames m. For clarity and space reasons, this distinction is not made in the figure. After learning, the sources are extracted from data by a variational approximation of the minimum mean squared error estimator, ˆ Xjn [k] = E(Xjn [k] | Y ) = dX q(X | Y )Xjn [k] , (7) i.e., the posterior mean, where q(X | Y ) = S q(X, S | Y ). The time domain waveform xjm is then obtained by appropriately patching together the subband signals. ˆ M-step. The update rule for the mixing matrix hij is obtained by solving the linear equation Bi [k]ηij,0 [k] = hij j k Bi [k]λj j,0 [k] . (8) k The update rule for the noise precisions Bi [k] is omitted. The quantities ηij,m [k] and λj j,m [k] are computed from the SS; see [13] for details. E-step. The posterior means of the sources (7) are obtained by solving   ˆ Xjn [k] = νjn [k]−1 ˆ i Bi [k]hij Yin [k] − j =j ˆ hij Xj n [k] (9) ˆ for Xjn [k], which is a K ×K linear system for each frequency k and frame n. The equations for the SS are given in [13], which also describes experimental results. 5 Separation of Reverberant Mixtures In this section we extend the algorithm to the case of reverberant mixing. In that case, due to signal propagation in the medium, each sensor signal at time frame n depends on the source signals not just at the same time but also at previous times. To describe this mathematically, the mixing matrix hij must become a matrix of filters hij,m , and yin = hij,m xj,n−m + uin . jm It may seem straightforward to extend the algorithm derived above to the present case. However, this appearance is misleading, because we have a time scale problem. Whereas are speech model p(X, S) is frame based, the filters hij,m are generally longer than the frame length N , typically 10 frames long and sometime longer. It is unclear how one can work with both Xjn and hij,m on the same footing (and, it is easy to see that straightforward windowed FFT cannot solve this problem). This is where the idea of subband filtering becomes very useful. Using (2) we have Yin [k] = Hij,m [k]Xj,n−m [k] + Uin [k], which yields the probabilistic model jm p(Yin | X) N (Yin [k] | = Hij,m [k]Xj,n−m [k], Bi [k]) . (10) jm k Hence, both X and Y are now frame based. Combining this equation with the speech model (3), we now have a complete model p(Y, X, S) for the reverberant mixing problem. The DAG describing this model is shown in Fig. 5. s1n−2 s1n−1 s1 n s2n−2 s2n−1 s2 n x1n−2 x1n−1 x1 n x2n−2 x2n−1 x2 n y1n−2 y1n−1 y1n y2n−2 y2n−1 y2 n Figure 5: Graphical model for noisy, reverberant 2 × 2 mixing, showing a 3 frame-long sequence. Here we assume 2 frame-long filters, i.e., m = 0, 1 in Eq. (10), where the solid arcs from X to Y correspond to m = 0 (as in Fig. 3) and the dashed arcs to m = 1. While Y1n and Y2n have the same parents, X1n and X2n , the arcs from the parents to Y2n are omitted for clarity. The model parameters θ = {Hij,m [k], Bi [k], Ajs [k], πjs } are estimated from data by a variational EM algorithm, whose derivation generally follows the one outlined in the previous section. Notice that the exact E-step here is even more intractable, due to the history dependence introduced by the filters. M-step. The update rule for Hij,m is obtained by solving the Toeplitz system Hij ,m [k]λj j,m−m [k] = ηij,m [k] (11) j m where the quantities λj j,m [k], ηij,m [k] are computed from the SS (see [12]). The update rule for the Bi [k] is omitted. E-step. The posterior means of the sources (7) are obtained by solving  ˆ Xjn [k] = νjn [k]−1 ˆ im Bi [k]Hij,m−n [k] Yim [k] − Hij j m =jm ,m−m ˆ [k]Xj m  [k] (12) ˆ for Xjn [k]. Assuming P frames long filters Hij,m , m = 0 : P − 1, this is a KP × KP linear system for each frequency k. The equations for the SS are given in [13], which also describes experimental results. 6 Extensions An alternative technique we have been pursuing for approximating EM in our models is Sequential Rao-Blackwellized Monte Carlo. There, we sample state sequences S from the posterior p(S | Y ) and, for a given sequence, perform exact inference on the source signals X conditioned on that sequence (observe that given S, the posterior p(X | S, Y ) is Gaussian and can be computed exactly). In addition, we are extending our speech model to include features such as pitch [7] in order to improve separation performance, especially in cases with less sensors than sources [7–9]. Yet another extension is applying model selection techniques to infer the number of sources from data in a dynamic manner. Acknowledgments I thank Te-Won Lee for extremely valuable discussions. References [1] A.J. Bell, T.J. Sejnowski (1995). An information maximisation approach to blind separation and blind deconvolution. Neural Computation 7, 1129-1159. [2] B.A. Pearlmutter, L.C. Parra (1997). Maximum likelihood blind source separation: A contextsensitive generalization of ICA. Proc. NIPS-96. [3] A. Cichocki, S.-I. Amari (2002). Adaptive Blind Signal and Image Processing. Wiley. [4] H. Attias (1999). Independent Factor Analysis. Neural Computation 11, 803-851. [5] T.-W. Lee et al. (2001) (Ed.). Proc. ICA 2001. [6] S. Griebel, M. Brandstein (2001). Microphone array speech dereverberation using coarse channel modeling. Proc. ICASSP 2001. [7] J. Hershey, M. Casey (2002). Audiovisual source separation via hidden Markov models. Proc. NIPS 2001. [8] S. Roweis (2001). One Microphone Source Separation. Proc. NIPS-00, 793-799. [9] G.-J. Jang, T.-W. Lee, Y.-H. Oh (2003). A probabilistic approach to single channel blind signal separation. Proc. NIPS 2002. [10] H. Attias, L. Deng, A. Acero, J.C. Platt (2001). A new method for speech denoising using probabilistic models for clean speech and for noise. Proc. Eurospeech 2001. [11] Ephraim, Y. (1992). Statistical model based speech enhancement systems. Proc. IEEE 80(10), 1526-1555. [12] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul (1999). An introduction to variational methods in graphical models. Machine Learning 37, 183-233. [13] H. Attias (2003). New EM algorithms for source separation and deconvolution with a microphone array. Proc. ICASSP 2003.

3 0.53493679 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking

Author: Sariel Har-Peled, Dan Roth, Dav Zimak

Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.

4 0.50900877 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

Author: Gil-jin Jang, Te-Won Lee

Abstract: We present a new technique for achieving source separation when given only a single channel recording. The main idea is based on exploiting the inherent time structure of sound sources by learning a priori sets of basis filters in time domain that encode the sources in a statistically efficient manner. We derive a learning algorithm using a maximum likelihood approach given the observed single channel data and sets of basis filters. For each time point we infer the source signals and their contribution factors. This inference is possible due to the prior knowledge of the basis filters and the associated coefficient densities. A flexible model for density estimation allows accurate modeling of the observation and our experimental results exhibit a high level of separation performance for mixtures of two music signals as well as the separation of two voice signals.

5 0.49876434 47 nips-2002-Branching Law for Axons

Author: Dmitri B. Chklovskii, Armen Stepanyants

Abstract: What determines the caliber of axonal branches? We pursue the hypothesis that the axonal caliber has evolved to minimize signal propagation delays, while keeping arbor volume to a minimum. We show that for a general cost function the optimal diameters of mother (do) and daughter (d], d 2 ) branches at a bifurcation obey v v 路 d

6 0.45994914 149 nips-2002-Multiclass Learning by Probabilistic Embeddings

7 0.45014095 53 nips-2002-Clustering with the Fisher Score

8 0.41155928 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

9 0.38956493 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

10 0.3873986 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

11 0.3824743 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

12 0.37827754 45 nips-2002-Boosted Dyadic Kernel Discriminants

13 0.37445262 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study

14 0.37190175 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging

15 0.36219758 58 nips-2002-Conditional Models on the Ranking Poset

16 0.36149058 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

17 0.34519675 55 nips-2002-Combining Features for BCI

18 0.3376694 92 nips-2002-FloatBoost Learning for Classification

19 0.33073992 167 nips-2002-Rational Kernels

20 0.32864559 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.013), (11, 0.016), (23, 0.017), (42, 0.046), (54, 0.56), (55, 0.03), (67, 0.013), (68, 0.019), (74, 0.057), (92, 0.021), (98, 0.121)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99666595 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

Author: Tong Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert function Spaces. We introduce a concept of scalesensitive effective data dimension, and show that it characterizes the convergence rate of the underlying learning problem. Using this concept, we can naturally extend results for parametric estimation problems in finite dimensional spaces to non-parametric kernel learning methods. We derive upper bounds on the generalization performance and show that the resulting convergent rates are optimal under various circumstances.

2 0.99608588 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

Author: Glenn M. Fung, Olvi L. Mangasarian, Jude W. Shavlik

Abstract: Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a linear support vector machine classifier. The resulting formulation leads to a linear program that can be solved efficiently. Real world examples, from DNA sequencing and breast cancer prognosis, demonstrate the effectiveness of the proposed method. Numerical results show improvement in test set accuracy after the incorporation of prior knowledge into ordinary, data-based linear support vector machine classifiers. One experiment also shows that a linear classifier, based solely on prior knowledge, far outperforms the direct application of prior knowledge rules to classify data. Keywords: use and refinement of prior knowledge, support vector machines, linear programming 1

3 0.99466276 36 nips-2002-Automatic Alignment of Local Representations

Author: Yee W. Teh, Sam T. Roweis

Abstract: We present an automatic alignment procedure which maps the disparate internal representations learned by several local dimensionality reduction experts into a single, coherent global coordinate system for the original data space. Our algorithm can be applied to any set of experts, each of which produces a low-dimensional local representation of a highdimensional input. Unlike recent efforts to coordinate such models by modifying their objective functions [1, 2], our algorithm is invoked after training and applies an efficient eigensolver to post-process the trained models. The post-processing has no local optima and the size of the system it must solve scales with the number of local models rather than the number of original data points, making it more efficient than model-free algorithms such as Isomap [3] or LLE [4]. 1 Introduction: Local vs. Global Dimensionality Reduction Beyond density modelling, an important goal of unsupervised learning is to discover compact, informative representations of high-dimensional data. If the data lie on a smooth low dimensional manifold, then an excellent encoding is the coordinates internal to that manifold. The process of determining such coordinates is dimensionality reduction. Linear dimensionality reduction methods such as principal component analysis and factor analysis are easy to train but cannot capture the structure of curved manifolds. Mixtures of these simple unsupervised models [5, 6, 7, 8] have been used to perform local dimensionality reduction, and can provide good density models for curved manifolds, but unfortunately such mixtures cannot do dimensionality reduction. They do not describe a single, coherent low-dimensional coordinate system for the data since there is no pressure for the local coordinates of each component to agree. Roweis et al [1] recently proposed a model which performs global coordination of local coordinate systems in a mixture of factor analyzers (MFA). Their model is trained by maximizing the likelihood of the data, with an additional variational penalty term to encourage the internal coordinates of the factor analyzers to agree. While their model can trade off modelling the data and having consistent local coordinate systems, it requires a user given trade-off parameter, training is quite inefficient (although [2] describes an improved training algorithm for a more constrained model), and it has quite serious local minima problems (methods like LLE [4] or Isomap [3] have to be used for initialization). In this paper we describe a novel, automatic way to align the hidden representations used by each component of a mixture of dimensionality reducers into a single global representation of the data throughout space. Given an already trained mixture, the alignment is achieved by applying an eigensolver to a matrix constructed from the internal representations of the mixture components. Our method is efficient, simple to implement, and has no local optima in its optimization nor any learning rates or annealing schedules. 2 The Locally Linear Coordination Algorithm H 9¥ EI¡ CD66B9 ©9B 766 % G F 5 #

4 0.9940753 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction

Author: Vin D. Silva, Joshua B. Tenenbaum

Abstract: Recently proposed algorithms for nonlinear dimensionality reduction fall broadly into two categories which have different advantages and disadvantages: global (Isomap [1]), and local (Locally Linear Embedding [2], Laplacian Eigenmaps [3]). We present two variants of Isomap which combine the advantages of the global approach with what have previously been exclusive advantages of local methods: computational sparsity and the ability to invert conformal maps.

same-paper 5 0.98792332 67 nips-2002-Discriminative Binaural Sound Localization

Author: Ehud Ben-reuven, Yoram Singer

Abstract: Time difference of arrival (TDOA) is commonly used to estimate the azimuth of a source in a microphone array. The most common methods to estimate TDOA are based on finding extrema in generalized crosscorrelation waveforms. In this paper we apply microphone array techniques to a manikin head. By considering the entire cross-correlation waveform we achieve azimuth prediction accuracy that exceeds extrema locating methods. We do so by quantizing the azimuthal angle and treating the prediction problem as a multiclass categorization task. We demonstrate the merits of our approach by evaluating the various approaches on Sony’s AIBO robot.

6 0.98186046 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus

7 0.94809139 156 nips-2002-On the Complexity of Learning the Kernel Matrix

8 0.91550112 113 nips-2002-Information Diffusion Kernels

9 0.9111203 119 nips-2002-Kernel Dependency Estimation

10 0.91080058 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

11 0.89719236 190 nips-2002-Stochastic Neighbor Embedding

12 0.89639115 106 nips-2002-Hyperkernels

13 0.89105225 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

14 0.88943744 120 nips-2002-Kernel Design Using Boosting

15 0.88626909 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

16 0.8838715 60 nips-2002-Convergence Properties of Some Spike-Triggered Analysis Techniques

17 0.87906373 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

18 0.8767243 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers

19 0.87078565 140 nips-2002-Margin Analysis of the LVQ Algorithm

20 0.86375546 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting