nips nips2005 nips2005-174 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yun-gang Zhang, Chang-shui Zhang
Abstract: Separation of music signals is an interesting but difficult problem. It is helpful for many other music researches such as audio content analysis. In this paper, a new music signal separation method is proposed, which is based on harmonic structure modeling. The main idea of harmonic structure modeling is that the harmonic structure of a music signal is stable, so a music signal can be represented by a harmonic structure model. Accordingly, a corresponding separation algorithm is proposed. The main idea is to learn a harmonic structure model for each music signal in the mixture, and then separate signals by using these models to distinguish harmonic structures of different signals. Experimental results show that the algorithm can separate signals and obtain not only a very high Signalto-Noise Ratio (SNR) but also a rather good subjective audio quality. 1
Reference: text
sentIndex sentText sentNum sentScore
1 cn Abstract Separation of music signals is an interesting but difficult problem. [sent-7, score-0.77]
2 It is helpful for many other music researches such as audio content analysis. [sent-8, score-0.816]
3 In this paper, a new music signal separation method is proposed, which is based on harmonic structure modeling. [sent-9, score-1.375]
4 The main idea of harmonic structure modeling is that the harmonic structure of a music signal is stable, so a music signal can be represented by a harmonic structure model. [sent-10, score-2.986]
5 The main idea is to learn a harmonic structure model for each music signal in the mixture, and then separate signals by using these models to distinguish harmonic structures of different signals. [sent-12, score-1.861]
6 Experimental results show that the algorithm can separate signals and obtain not only a very high Signalto-Noise Ratio (SNR) but also a rather good subjective audio quality. [sent-13, score-0.257]
7 1 Introduction Audio content analysis is an important area in music research. [sent-14, score-0.707]
8 There are many open problems in this area, such as content based music retrieval and classification, Computational Auditory Scene Analysis (CASA), Multi-pitch Estimation, Automatic Transcription, Query by Humming, etc. [sent-15, score-0.73]
9 In a song, the sounds of different instruments are mixed together, and it is difficult to parse the information of each instrument. [sent-18, score-0.18]
10 However, music signals are so different from general signals. [sent-20, score-0.77]
11 So, we try to find a way to separate music signals by utilizing the special character of music signals. [sent-21, score-1.479]
12 After source separation, many audio content analysis problems will become much easier. [sent-22, score-0.151]
13 In this paper, a music signal means a monophonic music signal performed by one instrument. [sent-23, score-1.636]
14 A song is a mixture of several music signals and one or more singing voice signals. [sent-24, score-1.185]
15 As we know, music signals are more “ordered” than voice. [sent-25, score-0.77]
16 The entropy of music is much more constant in time than that of speech [5]. [sent-26, score-0.785]
17 More essentially, we found that an important character of a music signal is that its harmonic structure is stable. [sent-27, score-1.284]
18 And the harmonic structures of music signals performed by different instruments are different. [sent-28, score-1.326]
19 So, a harmonic structure model is built to represent a music signal. [sent-29, score-1.127]
20 This model is the fundamental of the separation algorithm. [sent-30, score-0.172]
21 In the separation algorithm, an extended multi-pitch estimation al- gorithm is used to extract harmonic structures of all sources, and a clustering algorithm is used to calculate harmonic structure models. [sent-31, score-1.103]
22 Then, signals are separated by using these models to distinguish harmonic structures of different signals. [sent-32, score-0.697]
23 There are many other signal separation methods, such as ICA [6]. [sent-33, score-0.248]
24 General signal separation methods do not sufficiently utilize the special character of music signals. [sent-34, score-0.935]
25 Gil-Jin and TeWon proposed a probabilistic approach to single channel blind signal separation [7], which is based on exploiting the inherent time structure of sound sources by learning a priori sets of basis filters. [sent-35, score-0.388]
26 Vanroose used ICA to remove music background from speech by subtracting ICA components with the lowest entropy [9]. [sent-39, score-0.813]
27 Compared to these approaches, our method can separate each individual instrument sound, preserve the harmonic structure in the separated signals and obtain a good subjective audio quality. [sent-40, score-0.887]
28 We translate the amplitudes into a log scale, because the human ear has a roughly logarithmic sensitivity to signal intensity. [sent-61, score-0.158]
29 In this paper, these coefficients are used to represent the harmonic structure of a sound. [sent-67, score-0.462]
30 Average Harmonic Structure and Harmonic Structure Stability are defined as follows to model music signals and measure the stability of harmonic structures. [sent-68, score-1.194]
31 Since timbres of most instruments are stable, Bl varies little in different frames in a music signal and AHS is a good model to represent music signals. [sent-76, score-1.592]
32 On the contrary, Bl varies much in a voice signal and the corresponding HSS is much bigger than that of a music signal. [sent-77, score-1.165]
33 50 50 0 0 −50 0 50 100 150 −50 200 0 50 100 150 200 (a) Spectra in different frames of a voice signal. [sent-79, score-0.39]
34 The number of harmonics (significant peaks in the spectrum) and their amplitude ratios are totally different. [sent-80, score-0.152]
35 The number of harmonics (significant peaks in the spectrum) and their amplitude ratios are almost the same. [sent-82, score-0.152]
36 1 10 12 14 16 (e) The AHS and HSS of a male singing voice (f) The AHS and HSS of a female singing voice Figure 1: Spectra, AHSs and HSSs of voice and music signals. [sent-92, score-1.782]
37 In (c)-(f), x-axis is harmonic number, y-axis is the corresponding harmonic structure coefficient. [sent-93, score-0.867]
38 3 Separation algorithm based on harmonic structure modeling Without loss of generality, suppose we have a signal mixture consisting of one voice and several music signals. [sent-94, score-1.639]
39 The separation algorithm consists of four steps: preprocessing, extraction of harmonic structures, music AHSs analysis, separation of signals. [sent-95, score-1.314]
40 In the second step, the pitch estimation algorithm of Terhardt [11] is extended and used to extract harmonic structures. [sent-97, score-0.512]
41 For a fundamental frequency candidate f , count the number of fi which satisfies the following condition: f loor[(1 + d)fi /f ] ≥ (1 − d)fi /f (4) f loor(x) denotes the greatest integer less than or equal to x. [sent-104, score-0.155]
42 If the condition is fulfilled, fi is the frequency of th the ri harmonic component when fundamental frequency is f . [sent-106, score-0.608]
43 For each fundamental ˆ frequency candidate f , the coincidence number is calculated and f corresponding to the largest coincidence number is selected as the estimated fundamental frequency. [sent-107, score-0.272]
44 Secondly, not only the fundamental frequency but also all its harmonics are extracted, then B can be calculated. [sent-110, score-0.148]
45 This criterion is not stable when the signal is polyphonic, because harmonic components of different sources may influence each other. [sent-112, score-0.603]
46 A new optimality criterion is define as follows (n is the coincidence number): d= 1 n K i=1,fi coincident with f |ri − fi /f | ri (5) ˆ f corresponding to the smallest d is the estimated fundamental frequency. [sent-113, score-0.188]
47 For each fundamental frequency, harmonic components of the same source are more probably to have a high coincidence precision than those of a different source. [sent-115, score-0.512]
48 So, the new criterion is helpful for separation of harmonic structures of different sources. [sent-116, score-0.604]
49 After harmonic structure extraction, a data set of harmonic structures is obtained. [sent-122, score-0.934]
50 As the analysis in section two, in different frames, music harmonic structures of the same instrument are similar to each other and different from those of other instruments. [sent-123, score-1.185]
51 So, in the data set all music harmonic structures form several high density clusters. [sent-124, score-1.137]
52 Voice harmonic structures scatter around like background noise, because the harmonic structure of the voice signal is not stable. [sent-126, score-1.444]
53 In the third step, NK algorithm [12] is used to learn music AHSs. [sent-127, score-0.665]
54 Actually, the harmonic structure data set is such a data set. [sent-130, score-0.462]
55 Clusters of harmonic structures of different instruments have different densities. [sent-131, score-0.556]
56 Each data point, a harmonic structure, has a high dimensionality (20 in our experiments). [sent-133, score-0.405]
57 In the separation step, all harmonic structures of an instrument in all frames are extracted to reconstruct the corresponding music signals and then removed from the mixture. [sent-145, score-1.446]
58 After removing all music signals, the rest of the mixture is the separated voice signal. [sent-146, score-1.162]
59 The procedure of music harmonic structure detection is detailed as follows. [sent-147, score-1.127]
60 , BR ] and a fundamental frequency candidate f , a music harmonic structure ¯ ¯ is predicted. [sent-151, score-1.244]
61 , BR ] are its frequencies and harmonic structure coefficients. [sent-158, score-0.484]
62 The closest peak in the magnitude spectrum for each predicted harmonic component is detected. [sent-159, score-0.43]
63 , BR ] are the frequencies and harmonic structure coefficients of these peaks (measured peaks). [sent-166, score-0.566]
64 Formula 6 is defined to calculate the distance between the predicted harmonic structure and the measured peaks. [sent-167, score-0.482]
65 Note that, only harmonic components with none-zero harmonic structure coefficients are conˆ sidered. [sent-170, score-0.867]
66 Let f indicate the fundamental frequency candidate corresponding to the smallest ˆ distance between the predicted peaks and the actual spectral peaks. [sent-171, score-0.199]
67 If D(f ) is smaller than a threshold Td , a music harmonic structure is detected. [sent-172, score-1.127]
68 Otherwise there is no music harmonic structure in the frame. [sent-173, score-1.127]
69 If a music harmonic structure is detected, the corresponding measured peaks in the spectrum are extracted, and the music signal is reconstructed by IFFT. [sent-174, score-2.034]
70 4 Experimental results We have tested the performance of the proposed method on mixtures of different voice and music signals. [sent-176, score-1.012]
71 In experiments 1 and 2, the mixed signals consist of one voice signal and one music signal. [sent-181, score-1.321]
72 In experiment 3, the mixture consists of two music signals. [sent-182, score-0.714]
73 In experiment 4, the mixture consists of one voice and two music signals. [sent-183, score-1.061]
74 It can be seen that the mixtures are well separated into voice and music signals and very high SNRs are obtained in the separated signals. [sent-185, score-1.357]
75 Experimental results show that music AHS is a good model for music signal representation and separation. [sent-186, score-1.465]
76 In the separation procedure, music harmonic structures are detected by the music AHS model and separated from the mixture, and most of the time voice harmonic structures remain almost untouched. [sent-188, score-2.874]
77 This procedure makes separated signals with a rather good subjective audio quality due to the good harmonic structure in the separated signals. [sent-189, score-0.937]
78 Few existing methods can obtain such a good result because the harmonic structure is distorted in most of the existing methods. [sent-190, score-0.462]
79 However, we compared our method with a speech enhancement method, because separation 1 http://www. [sent-192, score-0.313]
80 htm Table 1: SNR results (DB): snrv , snrm1 and snrm2 are the SNRs of voice and music ′ ′ ′ signals in the mixed signal. [sent-197, score-1.234]
81 snrv , snrm1 ′ and snrm2 are the SNRs of the separated voice and music signals. [sent-199, score-1.18]
82 0 of voice and music can be regarded as a speech enhancement problem by regarding music as background noise. [sent-225, score-1.905]
83 Figure 2 (b), (d) give speech enhancement results obtained by a speech enhancement software which tries to estimate the spectrum of noise in the pause of speech and enhance the speech by spectral subtraction [14]. [sent-226, score-0.683]
84 Detecting pauses in speech with music background and enhancing speech with fast music noise are both very difficult problems, so traditional speech enhancement techniques can’t work here. [sent-227, score-1.816]
85 5 Conclusion and discussion In this paper, a harmonic structure model is proposed to represent music signals and used to separate music signals. [sent-228, score-1.919]
86 The proposed method has many applications, such as multi-pitch estimation, audio content analysis, audio edit, speech enhancement with music background, etc. [sent-230, score-1.125]
87 Multi-pitch estimation is an important problem in music research. [sent-231, score-0.701]
88 In our algorithm, not only harmonic structures but also corresponding fundamental frequencies are extracted. [sent-235, score-0.553]
89 It analyzes the primary multi-pitch estimation results and learns models to represent music signals and improve multi-pitch estimation results. [sent-237, score-0.842]
90 After separation, pitch estimation on the separated voice signal that contains melody becomes a monophonic pitch estimation problem, which can be done easily. [sent-244, score-0.9]
91 Then, many content base audio analysis tasks such as audio retrieval and classification become much easier and many midi based algorithms can be used on audio files. [sent-246, score-0.392]
92 In these cases, the music harmonic structures of this instrument will form several clusters, not one. [sent-254, score-1.185]
93 Then a GMM model instead of an average harmonic structure model (actually a point model) should be used to represent the music. [sent-255, score-0.462]
94 0066462 14 16 18 20 0 0 2 4 6 8 (i) Experiment4:The learned music AHSs Figure 2: Experimental results. [sent-259, score-0.665]
95 Andre-Obrecht, “Robust speech / music classification in audio documents,” in 7th International Conference On Spoken Language Processing (ICSLP), 2002, pp. [sent-283, score-0.894]
96 [7] Gil-Jin Jang and Te-Won Lee, “A probabilistic approach to single channel blind signal separation,” in Neural Information Processing Systems 15 (NIPS2002), 2003. [sent-290, score-0.157]
97 [8] Yazhong Feng, Yueting Zhuang, and Yunhe Pan, “Popular music retrieval by independent component analysis,” in ISMIR, 2002, pp. [sent-291, score-0.688]
98 [9] Peter Vanroose, “Blind source separation of speech and background music for improved speech recognition,” in The 24th Symposium on Information Theory, May 2003, pp. [sent-293, score-1.046]
99 Beauchamp, “Fundamental frequency estimation of musical signals using a two-way mismatch procedure,” Journal of the Acoustical Society of America, vol. [sent-313, score-0.218]
100 [17] Hirokazu Kameoka, Takuya Nishimoto, and Shigeki Sagayama, “Separation of harmonic structures based on tied gaussian mixture model and information criterion for concurrent sounds,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP04), 2004. [sent-326, score-0.521]
wordName wordTfidf (topN-words)
[('music', 0.665), ('harmonic', 0.405), ('voice', 0.347), ('hss', 0.168), ('signal', 0.135), ('ahs', 0.132), ('speech', 0.12), ('separated', 0.12), ('separation', 0.113), ('audio', 0.109), ('signals', 0.105), ('piccolo', 0.096), ('instruments', 0.084), ('br', 0.084), ('peaks', 0.082), ('enhancement', 0.08), ('pitch', 0.071), ('mixed', 0.069), ('structures', 0.067), ('organ', 0.067), ('bl', 0.064), ('fundamental', 0.059), ('structure', 0.057), ('ahss', 0.048), ('harmonics', 0.048), ('melody', 0.048), ('pitches', 0.048), ('snrv', 0.048), ('coincidence', 0.048), ('instrument', 0.048), ('frames', 0.043), ('content', 0.042), ('frequency', 0.041), ('fi', 0.038), ('singing', 0.038), ('fr', 0.038), ('estimation', 0.036), ('monophonic', 0.036), ('musical', 0.036), ('terhardt', 0.036), ('timbre', 0.036), ('sound', 0.033), ('snrs', 0.031), ('mixture', 0.03), ('ar', 0.029), ('transcription', 0.029), ('snr', 0.028), ('background', 0.028), ('sources', 0.028), ('sounds', 0.027), ('original', 0.026), ('rf', 0.025), ('coef', 0.025), ('spectrum', 0.025), ('al', 0.024), ('beauchamp', 0.024), ('exceeding', 0.024), ('loor', 0.024), ('maher', 0.024), ('polyphonic', 0.024), ('snre', 0.024), ('tsinghua', 0.024), ('vanroose', 0.024), ('ri', 0.024), ('automatic', 0.023), ('retrieval', 0.023), ('amplitudes', 0.023), ('clusters', 0.022), ('separate', 0.022), ('blind', 0.022), ('character', 0.022), ('amplitude', 0.022), ('frequencies', 0.022), ('subjective', 0.021), ('spectra', 0.021), ('rth', 0.021), ('beijing', 0.021), ('nk', 0.02), ('cult', 0.02), ('detected', 0.02), ('calculate', 0.02), ('ica', 0.02), ('criterion', 0.019), ('zhang', 0.019), ('mikhail', 0.019), ('experiment', 0.019), ('stability', 0.019), ('extraction', 0.018), ('noise', 0.018), ('feng', 0.018), ('bigger', 0.018), ('neighborhood', 0.017), ('eigenvalues', 0.017), ('candidate', 0.017), ('cluster', 0.016), ('stable', 0.016), ('acoustics', 0.016), ('china', 0.016), ('automation', 0.016), ('dif', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 174 nips-2005-Separation of Music Signals by Harmonic Structure Modeling
Author: Yun-gang Zhang, Chang-shui Zhang
Abstract: Separation of music signals is an interesting but difficult problem. It is helpful for many other music researches such as audio content analysis. In this paper, a new music signal separation method is proposed, which is based on harmonic structure modeling. The main idea of harmonic structure modeling is that the harmonic structure of a music signal is stable, so a music signal can be represented by a harmonic structure model. Accordingly, a corresponding separation algorithm is proposed. The main idea is to learn a harmonic structure model for each music signal in the mixture, and then separate signals by using these models to distinguish harmonic structures of different signals. Experimental results show that the algorithm can separate signals and obtain not only a very high Signalto-Noise Ratio (SNR) but also a rather good subjective audio quality. 1
2 0.080449432 88 nips-2005-Gradient Flow Independent Component Analysis in Micropower VLSI
Author: Abdullah Celik, Milutin Stanacevic, Gert Cauwenberghs
Abstract: We present micropower mixed-signal VLSI hardware for real-time blind separation and localization of acoustic sources. Gradient flow representation of the traveling wave signals acquired over a miniature (1cm diameter) array of four microphones yields linearly mixed instantaneous observations of the time-differentiated sources, separated and localized by independent component analysis (ICA). The gradient flow and ICA processors each measure 3mm × 3mm in 0.5 µm CMOS, and consume 54 µW and 180 µW power, respectively, from a 3 V supply at 16 ks/s sampling rate. Experiments demonstrate perceptually clear (12dB) separation and precise localization of two speech sources presented through speakers positioned at 1.5m from the array on a conference room table. Analysis of the multipath residuals shows that they are spectrally diffuse, and void of the direct path.
3 0.073351488 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard
Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1
4 0.058167126 163 nips-2005-Recovery of Jointly Sparse Signals from Few Random Projections
Author: Michael B. Wakin, Marco F. Duarte, Shriram Sarvotham, Dror Baron, Richard G. Baraniuk
Abstract: Compressed sensing is an emerging field based on the revelation that a small group of linear projections of a sparse signal contains enough information for reconstruction. In this paper we introduce a new theory for distributed compressed sensing (DCS) that enables new distributed coding algorithms for multi-signal ensembles that exploit both intra- and inter-signal correlation structures. The DCS theory rests on a new concept that we term the joint sparsity of a signal ensemble. We study three simple models for jointly sparse signals, propose algorithms for joint recovery of multiple signals from incoherent projections, and characterize theoretically and empirically the number of measurements per sensor required for accurate reconstruction. In some sense DCS is a framework for distributed compression of sources with memory, which has remained a challenging problem in information theory for some time. DCS is immediately applicable to a range of problems in sensor networks and arrays. 1
5 0.03708249 178 nips-2005-Soft Clustering on Graphs
Author: Kai Yu, Shipeng Yu, Volker Tresp
Abstract: We propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i.e., a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results. 1
6 0.036045715 136 nips-2005-Noise and the two-thirds power Law
7 0.03543381 29 nips-2005-Analyzing Coupled Brain Sources: Distinguishing True from Spurious Interaction
8 0.035090599 104 nips-2005-Laplacian Score for Feature Selection
9 0.034382697 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
10 0.033631593 183 nips-2005-Stimulus Evoked Independent Factor Analysis of MEG Data with Large Background Activity
11 0.032550924 20 nips-2005-Affine Structure From Sound
12 0.03195053 111 nips-2005-Learning Influence among Interacting Markov Chains
13 0.031199491 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators
14 0.03108965 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
15 0.030590331 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?
16 0.029555846 171 nips-2005-Searching for Character Models
17 0.029387871 150 nips-2005-Optimizing spatio-temporal filters for improving Brain-Computer Interfacing
18 0.0276381 159 nips-2005-Q-Clustering
19 0.026856495 141 nips-2005-Norepinephrine and Neural Interrupts
20 0.025209967 25 nips-2005-An aVLSI Cricket Ear Model
topicId topicWeight
[(0, 0.084), (1, -0.001), (2, -0.022), (3, 0.031), (4, -0.043), (5, -0.009), (6, -0.019), (7, -0.063), (8, 0.031), (9, 0.011), (10, -0.108), (11, -0.07), (12, 0.03), (13, 0.001), (14, -0.017), (15, -0.056), (16, -0.022), (17, 0.032), (18, 0.054), (19, -0.012), (20, -0.108), (21, -0.061), (22, 0.066), (23, 0.006), (24, 0.028), (25, -0.086), (26, -0.054), (27, -0.086), (28, -0.037), (29, -0.088), (30, 0.03), (31, 0.013), (32, 0.016), (33, 0.076), (34, 0.044), (35, 0.017), (36, -0.038), (37, -0.045), (38, 0.059), (39, -0.015), (40, 0.02), (41, 0.031), (42, -0.016), (43, 0.012), (44, -0.015), (45, -0.057), (46, 0.126), (47, 0.064), (48, 0.111), (49, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.95924193 174 nips-2005-Separation of Music Signals by Harmonic Structure Modeling
Author: Yun-gang Zhang, Chang-shui Zhang
Abstract: Separation of music signals is an interesting but difficult problem. It is helpful for many other music researches such as audio content analysis. In this paper, a new music signal separation method is proposed, which is based on harmonic structure modeling. The main idea of harmonic structure modeling is that the harmonic structure of a music signal is stable, so a music signal can be represented by a harmonic structure model. Accordingly, a corresponding separation algorithm is proposed. The main idea is to learn a harmonic structure model for each music signal in the mixture, and then separate signals by using these models to distinguish harmonic structures of different signals. Experimental results show that the algorithm can separate signals and obtain not only a very high Signalto-Noise Ratio (SNR) but also a rather good subjective audio quality. 1
2 0.58415192 88 nips-2005-Gradient Flow Independent Component Analysis in Micropower VLSI
Author: Abdullah Celik, Milutin Stanacevic, Gert Cauwenberghs
Abstract: We present micropower mixed-signal VLSI hardware for real-time blind separation and localization of acoustic sources. Gradient flow representation of the traveling wave signals acquired over a miniature (1cm diameter) array of four microphones yields linearly mixed instantaneous observations of the time-differentiated sources, separated and localized by independent component analysis (ICA). The gradient flow and ICA processors each measure 3mm × 3mm in 0.5 µm CMOS, and consume 54 µW and 180 µW power, respectively, from a 3 V supply at 16 ks/s sampling rate. Experiments demonstrate perceptually clear (12dB) separation and precise localization of two speech sources presented through speakers positioned at 1.5m from the array on a conference room table. Analysis of the multipath residuals shows that they are spectrally diffuse, and void of the direct path.
3 0.53413224 183 nips-2005-Stimulus Evoked Independent Factor Analysis of MEG Data with Large Background Activity
Author: Kenneth Hild, Kensuke Sekihara, Hagai T. Attias, Srikantan S. Nagarajan
Abstract: This paper presents a novel technique for analyzing electromagnetic imaging data obtained using the stimulus evoked experimental paradigm. The technique is based on a probabilistic graphical model, which describes the data in terms of underlying evoked and interference sources, and explicitly models the stimulus evoked paradigm. A variational Bayesian EM algorithm infers the model from data, suppresses interference sources, and reconstructs the activity of separated individual brain sources. The new algorithm outperforms existing techniques on two real datasets, as well as on simulated data. 1
4 0.45824081 163 nips-2005-Recovery of Jointly Sparse Signals from Few Random Projections
Author: Michael B. Wakin, Marco F. Duarte, Shriram Sarvotham, Dror Baron, Richard G. Baraniuk
Abstract: Compressed sensing is an emerging field based on the revelation that a small group of linear projections of a sparse signal contains enough information for reconstruction. In this paper we introduce a new theory for distributed compressed sensing (DCS) that enables new distributed coding algorithms for multi-signal ensembles that exploit both intra- and inter-signal correlation structures. The DCS theory rests on a new concept that we term the joint sparsity of a signal ensemble. We study three simple models for jointly sparse signals, propose algorithms for joint recovery of multiple signals from incoherent projections, and characterize theoretically and empirically the number of measurements per sensor required for accurate reconstruction. In some sense DCS is a framework for distributed compression of sources with memory, which has remained a challenging problem in information theory for some time. DCS is immediately applicable to a range of problems in sensor networks and arrays. 1
5 0.45564139 20 nips-2005-Affine Structure From Sound
Author: Sebastian Thrun
Abstract: We consider the problem of localizing a set of microphones together with a set of external acoustic events (e.g., hand claps), emitted at unknown times and unknown locations. We propose a solution that approximates this problem under a far field approximation defined in the calculus of affine geometry, and that relies on singular value decomposition (SVD) to recover the affine structure of the problem. We then define low-dimensional optimization techniques for embedding the solution into Euclidean geometry, and further techniques for recovering the locations and emission times of the acoustic events. The approach is useful for the calibration of ad-hoc microphone arrays and sensor networks. 1
6 0.43386143 29 nips-2005-Analyzing Coupled Brain Sources: Distinguishing True from Spurious Interaction
7 0.39507237 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
8 0.36504591 158 nips-2005-Products of ``Edge-perts
9 0.346876 104 nips-2005-Laplacian Score for Feature Selection
10 0.34019464 101 nips-2005-Is Early Vision Optimized for Extracting Higher-order Dependencies?
11 0.32597628 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
12 0.30660722 171 nips-2005-Searching for Character Models
13 0.304896 203 nips-2005-Visual Encoding with Jittering Eyes
14 0.29714966 33 nips-2005-Bayesian Sets
15 0.29321197 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care
16 0.28981265 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries
17 0.28902167 121 nips-2005-Location-based activity recognition
18 0.28779748 111 nips-2005-Learning Influence among Interacting Markov Chains
19 0.2805292 3 nips-2005-A Bayesian Framework for Tilt Perception and Confidence
20 0.27291849 73 nips-2005-Fast biped walking with a reflexive controller and real-time policy searching
topicId topicWeight
[(3, 0.038), (10, 0.026), (27, 0.027), (31, 0.042), (34, 0.054), (41, 0.012), (55, 0.014), (57, 0.011), (67, 0.43), (69, 0.056), (73, 0.051), (88, 0.089), (91, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.82774377 174 nips-2005-Separation of Music Signals by Harmonic Structure Modeling
Author: Yun-gang Zhang, Chang-shui Zhang
Abstract: Separation of music signals is an interesting but difficult problem. It is helpful for many other music researches such as audio content analysis. In this paper, a new music signal separation method is proposed, which is based on harmonic structure modeling. The main idea of harmonic structure modeling is that the harmonic structure of a music signal is stable, so a music signal can be represented by a harmonic structure model. Accordingly, a corresponding separation algorithm is proposed. The main idea is to learn a harmonic structure model for each music signal in the mixture, and then separate signals by using these models to distinguish harmonic structures of different signals. Experimental results show that the algorithm can separate signals and obtain not only a very high Signalto-Noise Ratio (SNR) but also a rather good subjective audio quality. 1
2 0.58926624 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories
Author: Nicolas Loeff, Himanshu Arora, Alexander Sorokin, David Forsyth
Abstract: We describe a novel method for learning templates for recognition and localization of objects drawn from categories. A generative model represents the configuration of multiple object parts with respect to an object coordinate system; these parts in turn generate image features. The complexity of the model in the number of features is low, meaning our model is much more efficient to train than comparative methods. Moreover, a variational approximation is introduced that allows learning to be orders of magnitude faster than previous approaches while incorporating many more features. This results in both accuracy and localization improvements. Our model has been carefully tested on standard datasets; we compare with a number of recent template models. In particular, we demonstrate state-of-the-art results for detection and localization. 1
3 0.50417417 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery
Author: Jeremy Kubica, Joseph Masiero, Robert Jedicke, Andrew Connolly, Andrew W. Moore
Abstract: In this paper we consider the problem of finding sets of points that conform to a given underlying model from within a dense, noisy set of observations. This problem is motivated by the task of efficiently linking faint asteroid detections, but is applicable to a range of spatial queries. We survey current tree-based approaches, showing a trade-off exists between single tree and multiple tree algorithms. To this end, we present a new type of multiple tree algorithm that uses a variable number of trees to exploit the advantages of both approaches. We empirically show that this algorithm performs well using both simulated and astronomical data.
4 0.31028712 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
Author: Amir Navot, Lavi Shpigelman, Naftali Tishby, Eilon Vaadia
Abstract: We present a non-linear, simple, yet effective, feature subset selection method for regression and use it in analyzing cortical neural activity. Our algorithm involves a feature-weighted version of the k-nearest-neighbor algorithm. It is able to capture complex dependency of the target function on its input and makes use of the leave-one-out error as a natural regularization. We explain the characteristics of our algorithm on synthetic problems and use it in the context of predicting hand velocity from spikes recorded in motor cortex of a behaving monkey. By applying feature selection we are able to improve prediction quality and suggest a novel way of exploring neural data.
5 0.30794075 45 nips-2005-Conditional Visual Tracking in Kernel Space
Author: Cristian Sminchisescu, Atul Kanujia, Zhiguo Li, Dimitris Metaxas
Abstract: We present a conditional temporal probabilistic framework for reconstructing 3D human motion in monocular video based on descriptors encoding image silhouette observations. For computational efÄ?Ĺš ciency we restrict visual inference to low-dimensional kernel induced non-linear state spaces. Our methodology (kBME) combines kernel PCA-based non-linear dimensionality reduction (kPCA) and Conditional Bayesian Mixture of Experts (BME) in order to learn complex multivalued predictors between observations and model hidden states. This is necessary for accurate, inverse, visual perception inferences, where several probable, distant 3D solutions exist due to noise or the uncertainty of monocular perspective projection. Low-dimensional models are appropriate because many visual processes exhibit strong non-linear correlations in both the image observations and the target, hidden state variables. The learned predictors are temporally combined within a conditional graphical model in order to allow a principled propagation of uncertainty. We study several predictors and empirically show that the proposed algorithm positively compares with techniques based on regression, Kernel Dependency Estimation (KDE) or PCA alone, and gives results competitive to those of high-dimensional mixture predictors at a fraction of their computational cost. We show that the method successfully reconstructs the complex 3D motion of humans in real monocular video sequences. 1 Introduction and Related Work We consider the problem of inferring 3D articulated human motion from monocular video. This research topic has applications for scene understanding including human-computer interfaces, markerless human motion capture, entertainment and surveillance. A monocular approach is relevant because in real-world settings the human body parts are rarely completely observed even when using multiple cameras. This is due to occlusions form other people or objects in the scene. A robust system has to necessarily deal with incomplete, ambiguous and uncertain measurements. Methods for 3D human motion reconstruction can be classiÄ?Ĺš ed as generative and discriminative. They both require a state representation, namely a 3D human model with kinematics (joint angles) or shape (surfaces or joint positions) and they both use a set of image features as observations for state inference. The computational goal in both cases is the conditional distribution for the model state given image observations. Generative model-based approaches [6, 16, 14, 13] have been demonstrated to Ä?Ĺš‚exibly reconstruct complex unknown human motions and to naturally handle problem constraints. However it is difÄ?Ĺš cult to construct reliable observation likelihoods due to the complexity of modeling human appearance. This varies widely due to different clothing and deformation, body proportions or lighting conditions. Besides being somewhat indirect, the generative approach further imposes strict conditional independence assumptions on the temporal observations given the states in order to ensure computational tractability. Due to these factors inference is expensive and produces highly multimodal state distributions [6, 16, 13]. Generative inference algorithms require complex annealing schedules [6, 13] or systematic non-linear search for local optima [16] in order to ensure continuing tracking. These difÄ?Ĺš culties motivate the advent of a complementary class of discriminative algorithms [10, 12, 18, 2], that approximate the state conditional directly, in order to simplify inference. However, inverse, observation-to-state multivalued mappings are difÄ?Ĺš cult to learn (see e.g. Ä?Ĺš g. 1a) and a probabilistic temporal setting is necessary. In an earlier paper [15] we introduced a probabilistic discriminative framework for human motion reconstruction. Because the method operates in the originally selected state and observation spaces that can be task generic, therefore redundant and often high-dimensional, inference is more expensive and can be less robust. To summarize, reconstructing 3D human motion in a Figure 1: (a, Left) Example of 180o ambiguity in predicting 3D human poses from silhouette image features (center). It is essential that multiple plausible solutions (e.g. F 1 and F2 ) are correctly represented and tracked over time. A single state predictor will either average the distant solutions or zig-zag between them, see also tables 1 and 2. (b, Right) A conditional chain model. The local distributions p(yt |yt−1 , zt ) or p(yt |zt ) are learned as in Ä?Ĺš g. 2. For inference, the predicted local state conditional is recursively combined with the Ä?Ĺš ltered prior c.f . (1). conditional temporal framework poses the following difÄ?Ĺš culties: (i) The mapping between temporal observations and states is multivalued (i.e. the local conditional distributions to be learned are multimodal), therefore it cannot be accurately represented using global function approximations. (ii) Human models have multivariate, high-dimensional continuous states of 50 or more human joint angles. The temporal state conditionals are multimodal which makes efÄ?Ĺš cient Kalman Ä?Ĺš ltering algorithms inapplicable. General inference methods (particle Ä?Ĺš lters, mixtures) have to be used instead, but these are expensive for high-dimensional models (e.g. when reconstructing the motion of several people that operate in a joint state space). (iii) The components of the human state and of the silhouette observation vector exhibit strong correlations, because many repetitive human activities like walking or running have low intrinsic dimensionality. It appears wasteful to work with high-dimensional states of 50+ joint angles. Even if the space were truly high-dimensional, predicting correlated state dimensions independently may still be suboptimal. In this paper we present a conditional temporal estimation algorithm that restricts visual inference to low-dimensional, kernel induced state spaces. To exploit correlations among observations and among state variables, we model the local, temporal conditional distributions using ideas from Kernel PCA [11, 19] and conditional mixture modeling [7, 5], here adapted to produce multiple probabilistic predictions. The corresponding predictor is referred to as a Conditional Bayesian Mixture of Low-dimensional Kernel-Induced Experts (kBME). By integrating it within a conditional graphical model framework (Ä?Ĺš g. 1b), we can exploit temporal constraints probabilistically. We demonstrate that this methodology is effective for reconstructing the 3D motion of multiple people in monocular video. Our contribution w.r.t. [15] is a probabilistic conditional inference framework that operates over a non-linear, kernel-induced low-dimensional state spaces, and a set of experiments (on both real and artiÄ?Ĺš cial image sequences) that show how the proposed framework positively compares with powerful predictors based on KDE, PCA, or with the high-dimensional models of [15] at a fraction of their cost. 2 Probabilistic Inference in a Kernel Induced State Space We work with conditional graphical models with a chain structure [9], as shown in Ä?Ĺš g. 1b, These have continuous temporal states yt , t = 1 . . . T , observations zt . For compactness, we denote joint states Yt = (y1 , y2 , . . . , yt ) or joint observations Zt = (z1 , . . . , zt ). Learning and inference are based on local conditionals: p(yt |zt ) and p(yt |yt−1 , zt ), with yt and zt being low-dimensional, kernel induced representations of some initial model having state xt and observation rt . We obtain zt , yt from rt , xt using kernel PCA [11, 19]. Inference is performed in a low-dimensional, non-linear, kernel induced latent state space (see Ä?Ĺš g. 1b and Ä?Ĺš g. 2 and (1)). For display or error reporting, we compute the original conditional p(x|r), or a temporally Ä?Ĺš ltered version p(xt |Rt ), Rt = (r1 , r2 , . . . , rt ), using a learned pre-image state map [3]. 2.1 Density Propagation for Continuous Conditional Chains For online Ä?Ĺš ltering, we compute the optimal distribution p(yt |Zt ) for the state yt , conditioned by observations Zt up to time t. The Ä?Ĺš ltered density can be recursively derived as: p(yt |Zt ) = p(yt |yt−1 , zt )p(yt−1 |Zt−1 ) (1) yt−1 We compute using a conditional mixture for p(yt |yt−1 , zt ) (a Bayesian mixture of experts c.f . §2.2) and the prior p(yt−1 |Zt−1 ), each having, say M components. We integrate M 2 pairwise products of Gaussians analytically. The means of the expanded posterior are clustered and the centers are used to initialize a reduced M -component Kullback-Leibler approximation that is reÄ?Ĺš ned using gradient descent [15]. The propagation rule (1) is similar to the one used for discrete state labels [9], but here we work with multivariate continuous state spaces and represent the local multimodal state conditionals using kBME (Ä?Ĺš g. 2), and not log-linear models [9] (these would require intractable normalization). This complex continuous model rules out inference based on Kalman Ä?Ĺš ltering or dynamic programming [9]. 2.2 Learning Bayesian Mixtures over Kernel Induced State Spaces (kBME) In order to model conditional mappings between low-dimensional non-linear spaces we rely on kernel dimensionality reduction and conditional mixture predictors. The authors of KDE [19] propose a powerful structured unimodal predictor. This works by decorrelating the output using kernel PCA and learning a ridge regressor between the input and each decorrelated output dimension. Our procedure is also based on kernel PCA but takes into account the structure of the studied visual problem where both inputs and outputs are likely to be low-dimensional and the mapping between them multivalued. The output variables xi are projected onto the column vectors of the principal space in order to obtain their principal coordinates y i . A z ∈ P(Fr ) O p(y|z) kP CA ĂŽĹšr (r) ⊂ Fr O / y ∈ P(Fx ) O QQQ QQQ QQQ kP CA QQQ Q( ĂŽĹšx (x) ⊂ Fx x ≈ PreImage(y) O ĂŽĹšr ĂŽĹšx r ∈ R ⊂ Rr x ∈ X ⊂ Rx p(x|r) ≈ p(x|y) Figure 2: The learned low-dimensional predictor, kBME, for computing p(x|r) â‰Ä„ p(xt |rt ), ∀t. (We similarly learn p(xt |xt−1 , rt ), with input (x, r) instead of r – here we illustrate only p(x|r) for clarity.) The input r and the output x are decorrelated using Kernel PCA to obtain z and y respectively. The kernels used for the input and output are ĂŽĹš r and ĂŽĹšx , with induced feature spaces Fr and Fx , respectively. Their principal subspaces obtained by kernel PCA are denoted by P(Fr ) and P(Fx ), respectively. A conditional Bayesian mixture of experts p(y|z) is learned using the low-dimensional representation (z, y). Using learned local conditionals of the form p(yt |zt ) or p(yt |yt−1 , zt ), temporal inference can be efÄ?Ĺš ciently performed in a low-dimensional kernel induced state space (see e.g. (1) and Ä?Ĺš g. 1b). For visualization and error measurement, the Ä?Ĺš ltered density, e.g. p(yt |Zt ), can be mapped back to p(xt |Rt ) using the pre-image c.f . (3). similar procedure is performed on the inputs ri to obtain zi . In order to relate the reduced feature spaces of z and y (P(Fr ) and P(Fx )), we estimate a probability distribution over mappings from training pairs (zi , yi ). We use a conditional Bayesian mixture of experts (BME) [7, 5] in order to account for ambiguity when mapping similar, possibly identical reduced feature inputs to very different feature outputs, as common in our problem (Ä?Ĺš g. 1a). This gives a model that is a conditional mixture of low-dimensional kernel-induced experts (kBME): M g(z|δ j )N (y|Wj z, ĂŽĹ j ) p(y|z) = (2) j=1 where g(z|δ j ) is a softmax function parameterized by δ j and (Wj , ĂŽĹ j ) are the parameters and the output covariance of expert j, here a linear regressor. As in many Bayesian settings [17, 5], the weights of the experts and of the gates, Wj and δ j , are controlled by hierarchical priors, typically Gaussians with 0 mean, and having inverse variance hyperparameters controlled by a second level of Gamma distributions. We learn this model using a double-loop EM and employ ML-II type approximations [8, 17] with greedy (weight) subset selection [17, 15]. Finally, the kBME algorithm requires the computation of pre-images in order to recover the state distribution x from it’s image y ∈ P(Fx ). This is a closed form computation for polynomial kernels of odd degree. For more general kernels optimization or learning (regression based) methods are necessary [3]. Following [3, 19], we use a sparse Bayesian kernel regressor to learn the pre-image. This is based on training data (xi , yi ): p(x|y) = N (x|AĂŽĹšy (y), â„Ĺš) (3) with parameters and covariances (A, â„Ĺš). Since temporal inference is performed in the low-dimensional kernel induced state space, the pre-image function needs to be calculated only for visualizing results or for the purpose of error reporting. Propagating the result from the reduced feature space P(Fx ) to the output space X pro- duces a Gaussian mixture with M elements, having coefÄ?Ĺš cients g(z|δ j ) and components N (x|AĂŽĹšy (Wj z), AJĂŽĹšy ĂŽĹ j JĂŽĹšy A + â„Ĺš), where JĂŽĹšy is the Jacobian of the mapping ĂŽĹšy . 3 Experiments We run experiments on both real image sequences (Ä?Ĺš g. 5 and Ä?Ĺš g. 6) and on sequences where silhouettes were artiÄ?Ĺš cially rendered. The prediction error is reported in degrees (for mixture of experts, this is w.r.t. the most probable one, but see also Ä?Ĺš g. 4a), and normalized per joint angle, per frame. The models are learned using standard cross-validation. Pre-images are learned using kernel regressors and have average error 1.7o . Training Set and Model State Representation: For training we gather pairs of 3D human poses together with their image projections, here silhouettes, using the graphics package Maya. We use realistically rendered computer graphics human surface models which we animate using human motion capture [1]. Our original human representation (x) is based on articulated skeletons with spherical joints and has 56 skeletal d.o.f. including global translation. The database consists of 8000 samples of human activities including walking, running, turns, jumps, gestures in conversations, quarreling and pantomime. Image Descriptors: We work with image silhouettes obtained using statistical background subtraction (with foreground and background models). Silhouettes are informative for pose estimation although prone to ambiguities (e.g. the left / right limb assignment in side views) or occasional lack of observability of some of the d.o.f. (e.g. 180o ambiguities in the global azimuthal orientation for frontal views, e.g. Ä?Ĺš g. 1a). These are multiplied by intrinsic forward / backward monocular ambiguities [16]. As observations r, we use shape contexts extracted on the silhouette [4] (5 radial, 12 angular bins, size range 1/8 to 3 on log scale). The features are computed at different scales and sizes for points sampled on the silhouette. To work in a common coordinate system, we cluster all features in the training set into K = 50 clusters. To compute the representation of a new shape feature (a point on the silhouette), we ‘project’ onto the common basis by (inverse distance) weighted voting into the cluster centers. To obtain the representation (r) for a new silhouette we regularly sample 200 points on it and add all their feature vectors into a feature histogram. Because the representation uses overlapping features of the observation the elements of the descriptor are not independent. However, a conditional temporal framework (Ä?Ĺš g. 1b) Ä?Ĺš‚exibly accommodates this. For experiments, we use Gaussian kernels for the joint angle feature space and dot product kernels for the observation feature space. We learn state conditionals for p(yt |zt ) and p(yt |yt−1 , zt ) using 6 dimensions for the joint angle kernel induced state space and 25 dimensions for the observation induced feature space, respectively. In Ä?Ĺš g. 3b) we show an evaluation of the efÄ?Ĺš cacy of our kBME predictor for different dimensions in the joint angle kernel induced state space (the observation feature space dimension is here 50). On the analyzed dancing sequence, that involves complex motions of the arms and the legs, the non-linear model signiÄ?Ĺš cantly outperforms alternative PCA methods and gives good predictions for compact, low-dimensional models.1 In tables 1 and 2, as well as Ä?Ĺš g. 4, we perform quantitative experiments on artiÄ?Ĺš cially rendered silhouettes. 3D ground truth joint angles are available and this allows a more 1 Running times: On a Pentium 4 PC (3 GHz, 2 GB RAM), a full dimensional BME model with 5 experts takes 802s to train p(xt |xt−1 , rt ), whereas a kBME (including the pre-image) takes 95s to train p(yt |yt−1 , zt ). The prediction time is 13.7s for BME and 8.7s (including the pre-image cost 1.04s) for kBME. The integration in (1) takes 2.67s for BME and 0.31s for kBME. The speed-up for kBME is signiÄ?Ĺš cant and likely to increase with original models having higher dimensionality. Prediction Error Number of Clusters 100 1000 100 10 1 1 2 3 4 5 6 7 8 Degree of Multimodality kBME KDE_RVM PCA_BME PCA_RVM 10 1 0 20 40 Number of Dimensions 60 Figure 3: (a, Left) Analysis of ‘multimodality’ for a training set. The input zt dimension is 25, the output yt dimension is 6, both reduced using kPCA. We cluster independently in (yt−1 , zt ) and yt using many clusters (2100) to simulate small input perturbations and we histogram the yt clusters falling within each cluster in (yt−1 , zt ). This gives intuition on the degree of ambiguity in modeling p(yt |yt−1 , zt ), for small perturbations in the input. (b, Right) Evaluation of dimensionality reduction methods for an artiÄ?Ĺš cial dancing sequence (models trained on 300 samples). The kBME is our model §2.2, whereas the KDE-RVM is a KDE model learned with a Relevance Vector Machine (RVM) [17] feature space map. PCA-BME and PCA-RVM are models where the mappings between feature spaces (obtained using PCA) is learned using a BME and a RVM. The non-linearity is signiÄ?Ĺš cant. Kernel-based methods outperform PCA and give low prediction error for 5-6d models. systematic evaluation. Notice that the kernelized low-dimensional models generally outperform the PCA ones. At the same time, they give results competitive to the ones of high-dimensional BME predictors, while being lower-dimensional and therefore signiÄ?Ĺš cantly less expensive for inference, e.g. the integral in (1). In Ä?Ĺš g. 5 and Ä?Ĺš g. 6 we show human motion reconstruction results for two real image sequences. Fig. 5 shows the good quality reconstruction of a person performing an agile jump. (Given the missing observations in a side view, 3D inference for the occluded body parts would not be possible without using prior knowledge!) For this sequence we do inference using conditionals having 5 modes and reduced 6d states. We initialize tracking using p(yt |zt ), whereas for inference we use p(yt |yt−1 , zt ) within (1). In the second sequence in Ä?Ĺš g. 6, we simultaneously reconstruct the motion of two people mimicking domestic activities, namely washing a window and picking an object. Here we do inference over a product, 12-dimensional state space consisting of the joint 6d state of each person. We obtain good 3D reconstruction results, using only 5 hypotheses. Notice however, that the results are not perfect, there are small errors in the elbow and the bending of the knee for the subject at the l.h.s., and in the different wrist orientations for the subject at the r.h.s. This reÄ?Ĺš‚ects the bias of our training set. Walk and turn Conversation Run and turn left KDE-RR 10.46 7.95 5.22 RVM 4.95 4.96 5.02 KDE-RVM 7.57 6.31 6.25 BME 4.27 4.15 5.01 kBME 4.69 4.79 4.92 Table 1: Comparison of average joint angle prediction error for different models. All kPCA-based models use 6 output dimensions. Testing is done on 100 video frames for each sequence, the inputs are artiÄ?Ĺš cially generated silhouettes, not in the training set. 3D joint angle ground truth is used for evaluation. KDE-RR is a KDE model with ridge regression (RR) for the feature space mapping, KDE-RVM uses an RVM. BME uses a Bayesian mixture of experts with no dimensionality reduction. kBME is our proposed model. kPCAbased methods use kernel regressors to compute pre-images. Expert Prediction Frequency − Closest to Ground truth Frequency − Close to ground truth 30 25 20 15 10 5 0 1 2 3 4 Expert Number 14 10 8 6 4 2 0 5 1st Probable Prev Output 2nd Probable Prev Output 3rd Probable Prev Output 4th Probable Prev Output 5th Probable Prev Output 12 1 2 3 4 Current Expert 5 Figure 4: (a, Left) Histogram showing the accuracy of various expert predictors: how many times the expert ranked as the k-th most probable by the model (horizontal axis) is closest to the ground truth. The model is consistent (the most probable expert indeed is the most accurate most frequently), but occasionally less probable experts are better. (b, Right) Histograms show the dynamics of p(yt |yt−1 , zt ), i.e. how the probability mass is redistributed among experts between two successive time steps, in a conversation sequence. Walk and turn back Run and turn KDE-RR 7.59 17.7 RVM 6.9 16.8 KDE-RVM 7.15 16.08 BME 3.6 8.2 kBME 3.72 8.01 Table 2: Joint angle prediction error computed for two complex sequences with walks, runs and turns, thus more ambiguity (100 frames). Models have 6 state dimensions. Unimodal predictors average competing solutions. kBME has signiÄ?Ĺš cantly lower error. Figure 5: Reconstruction of a jump (selected frames). Top: original image sequence. Middle: extracted silhouettes. Bottom: 3D reconstruction seen from a synthetic viewpoint. 4 Conclusion We have presented a probabilistic framework for conditional inference in latent kernelinduced low-dimensional state spaces. Our approach has the following properties: (a) Figure 6: Reconstructing the activities of 2 people operating in an 12-d state space (each person has its own 6d state). Top: original image sequence. Bottom: 3D reconstruction seen from a synthetic viewpoint. Accounts for non-linear correlations among input or output variables, by using kernel nonlinear dimensionality reduction (kPCA); (b) Learns probability distributions over mappings between low-dimensional state spaces using conditional Bayesian mixture of experts, as required for accurate prediction. In the resulting low-dimensional kBME predictor ambiguities and multiple solutions common in visual, inverse perception problems are accurately represented. (c) Works in a continuous, conditional temporal probabilistic setting and offers a formal management of uncertainty. We show comparisons that demonstrate how the proposed approach outperforms regression, PCA or KDE alone for reconstructing the 3D human motion in monocular video. Future work we will investigate scaling aspects for large training sets and alternative structured prediction methods. References [1] CMU Human Motion DataBase. Online at http://mocap.cs.cmu.edu/search.html, 2003. [2] A. Agarwal and B. Triggs. 3d human pose from silhouettes by Relevance Vector Regression. In CVPR, 2004. [3] G. Bakir, J. Weston, and B. Scholkopf. Learning to Ä?Ĺš nd pre-images. In NIPS, 2004. [4] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. PAMI, 24, 2002. [5] C. Bishop and M. Svensen. Bayesian mixtures of experts. In UAI, 2003. [6] J. Deutscher, A. Blake, and I. Reid. Articulated Body Motion Capture by Annealed Particle Filtering. In CVPR, 2000. [7] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, (6):181–214, 1994. [8] D. Mackay. Bayesian interpolation. Neural Computation, 4(5):720–736, 1992. [9] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In ICML, 2000. [10] R. Rosales and S. Sclaroff. Learning Body Pose Via Specialized Maps. In NIPS, 2002. [11] B. Sch¨ lkopf, A. Smola, and K. M¨ ller. Nonlinear component analysis as a kernel eigenvalue o u problem. Neural Computation, 10:1299–1319, 1998. [12] G. Shakhnarovich, P. Viola, and T. Darrell. Fast Pose Estimation with Parameter Sensitive Hashing. In ICCV, 2003. [13] L. Sigal, S. Bhatia, S. Roth, M. Black, and M. Isard. Tracking Loose-limbed People. In CVPR, 2004. [14] C. Sminchisescu and A. Jepson. Generative Modeling for Continuous Non-Linearly Embedded Visual Inference. In ICML, pages 759–766, Banff, 2004. [15] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative Density Propagation for 3D Human Motion Estimation. In CVPR, 2005. [16] C. Sminchisescu and B. Triggs. Kinematic Jump Processes for Monocular 3D Human Tracking. In CVPR, volume 1, pages 69–76, Madison, 2003. [17] M. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. JMLR, 2001. [18] C. Tomasi, S. Petrov, and A. Sastry. 3d tracking = classiÄ?Ĺš cation + interpolation. In ICCV, 2003. [19] J. Weston, O. Chapelle, A. Elisseeff, B. Scholkopf, and V. Vapnik. Kernel dependency estimation. In NIPS, 2002.
6 0.3073051 102 nips-2005-Kernelized Infomax Clustering
7 0.30172953 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts
8 0.30086306 48 nips-2005-Context as Filtering
9 0.30085358 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
10 0.30050865 144 nips-2005-Off-policy Learning with Options and Recognizers
11 0.30047446 179 nips-2005-Sparse Gaussian Processes using Pseudo-inputs
12 0.30023426 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails
13 0.29948834 30 nips-2005-Assessing Approximations for Gaussian Process Classification
14 0.29751787 74 nips-2005-Faster Rates in Regression via Active Learning
15 0.29634941 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
16 0.29531044 133 nips-2005-Nested sampling for Potts models
17 0.29508141 171 nips-2005-Searching for Character Models
18 0.29462975 23 nips-2005-An Application of Markov Random Fields to Range Sensing
19 0.29454079 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity
20 0.29347134 130 nips-2005-Modeling Neuronal Interactivity using Dynamic Bayesian Networks