nips nips2000 nips2000-78 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John W. Fisher III, Trevor Darrell, William T. Freeman, Paul A. Viola
Abstract: People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a lowlevel, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sampling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the complex signal relationships. We learn the joint distribution of the visual and auditory signals using a non-parametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation. We then model the complicated stochastic relationships between the signals using a nonparametric density estimator. These learned densities allow processing across signal modalities. We demonstrate, on synthetic and real signals, localization in video of the face that is speaking in audio, and, conversely, audio enhancement of a particular speaker selected from the video.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract People can understand complex auditory and visual information, often using one to disambiguate the other. [sent-10, score-0.106]
2 Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the complex signal relationships. [sent-12, score-0.303]
3 We learn the joint distribution of the visual and auditory signals using a non-parametric approach. [sent-13, score-0.244]
4 We then model the complicated stochastic relationships between the signals using a nonparametric density estimator. [sent-15, score-0.219]
5 We demonstrate, on synthetic and real signals, localization in video of the face that is speaking in audio, and, conversely, audio enhancement of a particular speaker selected from the video. [sent-17, score-1.371]
6 Humans face complex perception tasks in which ambiguous auditory and visual information must be combined in order to support accurate perception. [sent-19, score-0.17]
7 Simplifying assumptions about the joint measurement statistics are often made in order to yield tractable analytic forms. [sent-22, score-0.128]
8 For example Hershey and Movellan have shown that correlations between video data and audio can be used to highlight regions of the image which are the "cause" of the audio signal. [sent-23, score-1.58]
9 Furthermore, these assumptions may not be appropriate for fusing modalities such as video and audio. [sent-28, score-0.421]
10 The joint statistics for these and many other mixed modal signals are not well understood and are not well-modeled by simple densities such as multi-variate exponential distributions. [sent-29, score-0.278]
11 A critical question is whether, in the absence of an adequate parametric model for joint measurement statistics, can one integrate measurements in a principled way without discounting statistical uncertainty. [sent-31, score-0.206]
12 In the nonparametric statistical framework principles such as MAP and ML are equivalent to the information theoretic concepts of mutual information and entropy. [sent-33, score-0.205]
13 Consequently we suggest an approach for learning maximally informative joint subspaces for multi-media signal analysis. [sent-34, score-0.366]
14 The technique is a natural application of [8, 3, 5, 4] which formulates a learning approach by which the entropy, and by extension the mutual information, of a differentiable map may be optimized. [sent-35, score-0.113]
15 In the experiments we are able to show significant audio signal enhancement and video source localization. [sent-37, score-1.287]
16 Similarly, mutual information quantifies the information (uncertainty reduction) that two random variables convey about each other. [sent-40, score-0.136]
17 1 Maximally Informative Subspaces In order to make the problem tractable we project high dimensional audio and video measurements to low dimensional subspaces. [sent-43, score-1.069]
18 The parameters of the sub-space are not chosen in an ad hoc fashion, but are learned by maximizing the mutual information between the derived features. [sent-44, score-0.172]
19 Specifically, let V i '" V E ~Nv and ai '" A E ~Na be video and audio measurements, respectively, taken at time i . [sent-45, score-0.962]
20 During adaptation the parameters vectors et v and eta (the perceptron weights) are chosen such that (1) This process is ilustrated notionally in figure 1 in which video frames and sequences of periodogram coefficients are projected to scalar values. [sent-49, score-0.62]
21 A clear advantage of learning a projection is that rather than requiring pixels of the video frames or spectral coefficients to be inspected individually the projection summarizes the entire set efficiently into two scalar values (one for video and one for audio). [sent-50, score-1.189]
22 We have little reason to believe that joint audio/video measurements are accurately characterized by simple parametric models (e. [sent-51, score-0.171]
23 Moreover, low dimensional projections which do not preserve this complex structure will not capture the true form of the relationship (i. [sent-54, score-0.103]
24 The low dimensional projections which are learned by maximizing mutual information reduce the complexity of the joint distribution, but still preserve the important and potentially complex relationships between audio and visual signals. [sent-57, score-0.971]
25 This learned subspace video projection video sequence audio sequence Figure 1: Fusion Figure: Projection to Subspace possibility motivates the methodology of [3, 8] in which the density in the joint subspace is modeled nonparametrically. [sent-58, score-1.931]
26 There are a variety of ways the subspace and the associated joint density might be used to, for example, manipulate one of the disparate signals based on another. [sent-60, score-0.321]
27 In these experiments, two sub-space mappings are learned, one from video and another from audio. [sent-64, score-0.421]
28 In all cases, video data is sampled at 30 frames/second. [sent-65, score-0.421]
29 We use both pixel based representations (raw pixel data) and motion based representations (i. [sent-66, score-0.182]
30 Anandan's optical flow algorithm [1] is a coarse-to-fine method, implemented on a Laplacian pyramid, based on minimizing the sum of squared differences between frames. [sent-69, score-0.252]
31 When raw video is used as an input to the subspace mapper, the pixels are collected into a single vector. [sent-72, score-0.668]
32 The raw video images range in resolution from 240 by 180 (i. [sent-73, score-0.539]
33 When optical flow is used as an input to the sub-space mapper, vector valued flow for each pixel is collected into a single vector, yielding an input vector with twice as many dimensions as pixels. [sent-79, score-0.543]
34 4 ms duration sampled at 30 Hz (commensurate with the video rate). [sent-84, score-0.421]
35 At each point in time there are 513 periodogram coefficients input to the sub-space mapper. [sent-85, score-0.151]
36 Mouth parameters are functionally related to one audio signal. [sent-87, score-0.59]
37 The goal of the experiment is to use a video sequence to enhance an associated audio sequence. [sent-91, score-1.133]
38 Figure 2 shows examples from a synthetically generated image sequence of faces (and the associated optical flow field). [sent-92, score-0.479]
39 The parameters of the ellipse are functionally related to a recorded audio signal. [sent-94, score-0.695]
40 Specifically, the area of the ellipse is proportional to the average power of the audio signal (computed over the same periodogram window) while the eccentricity is controlled by the the entropy of the normalized periodogram. [sent-95, score-0.897]
41 Consequently, observed changes in the image sequence are functionally related to the recorded audio signal. [sent-96, score-0.789]
42 The associated audio signal is mixed with an interfering, or noise, signal. [sent-98, score-0.782]
43 If the power spectrum of the associated and interfering signals were known then the optimal filter for recovering the associated audio sequence is the Wiener filter. [sent-100, score-1.124]
44 It's spectrum is described by H(f) = Pa(f) Pa(f) + Pn(f) (2) where Pa(f) is the power spectrum of the desired signal and Pn (f) is the power spectrum of the interfering signal. [sent-101, score-0.5]
45 Furthermore, suppose y = Sa + n where Sa is the signal of interest and n is an independent interference signal. [sent-104, score-0.138]
46 ) - ~ n (3) where p is the correlation coefficient between Sa and the corrupted version y and (ri:) is the signal to noise power ratio (SNR). [sent-107, score-0.224]
47 Consequently given a reference signal and some signal plus interferer we can use the relationships above to gauge signal enhancement. [sent-108, score-0.528]
48 The question we address is that in the absence of knowing the separate power spectra, which are necessary to implement the Wiener filter, how do we compare using the associated video data. [sent-109, score-0.533]
49 It is not immediately obvious how one might achieve signal enhancement by learning a joint subspace in the manner described. [sent-110, score-0.43]
50 For this simple case it is only the associated audio signal which bears any relationship to the video sequence. [sent-112, score-1.157]
51 Furthermore, the coefficients of the audio projection, ex a correspond to spectral coefficients. [sent-113, score-0.687]
52 Our reasoning is that large magnitude coefficients correspond those spectral components which have more signal component than those with small magnitude. [sent-114, score-0.321]
53 Using this reasoning we can construct a filter whose coefficients are proportional to our projection ex a. [sent-115, score-0.298]
54 Specifically, we use the following to design our filter H (f) MI = {J ( lexa (f) I- min( lexa (f)l) ) max (lexa(f)I) - min(lexa(f) I) + 1- {J . [sent-116, score-0.3]
55 Solid line indicates the desired audio component while the dashed line indicates the interference. [sent-123, score-0.541]
56 where aa (f) are the audio projection coefficients associated with spectral coeffiecient, f. [sent-124, score-0.828]
57 While somewhat ad hoc the filter is consistent with our reasoning above and, as we shall see, yields good results. [sent-129, score-0.18]
58 Furthermore, because the signal and interferer are known (in our experimental set up) we can compare our results to the unachievable, yet optimal, Wiener filter for this case. [sent-130, score-0.294]
59 In this case the SNR was 0 dB , furthermore as the two signals have significant spectral overlap, signal recovery is challenging. [sent-131, score-0.299]
60 The optimal Wiener filter achieves a signal processing gain of 2. [sent-132, score-0.246]
61 6 dB while the filter constructed as described achieves 2. [sent-133, score-0.108]
62 2 Video Attribution of Single Audio Source The previous example demonstrated that the audio projection coefficients could be used to reduce an interfering signal. [sent-137, score-0.814]
63 Figure 4(a) shows a video frame from the sequence used in the next experiment. [sent-139, score-0.479]
64 There is a single audio signal source (of the speaker) but several interfering motion fields in the video sequence. [sent-141, score-1.256]
65 Figures 4(b) is the pixel-wise standard deviations of the video sequence while figure 4(c) shows the pixel-wise flow field energy. [sent-142, score-0.633]
66 Note that the most intense changes in the image sequence are associated with the monitor and not the speaker. [sent-144, score-0.229]
67 Our goal with this experiment is to show that via the method described we can properly attribute the region of the video image which is associated with the audio sequence. [sent-145, score-1.096]
68 We expect that large image projection coefficients, a v correspond to those pixels which are related to the audio signal. [sent-147, score-0.735]
69 Figure 4(d) shows the image a v when images are fed directly into the algorithm while figure 4(e) shows the same image when flow-fields are the input. [sent-148, score-0.191]
70 Clearly both cases have detected regions associated with the speaker with the substantive difference being that the use of flow fields resulted in a smoother attribution. [sent-149, score-0.399]
71 In this case there are two speakers recorded with a single microphone (the speakers were recorded with stereo microphones so as to obtain a reference, but the experiments used a single mixed audio source). [sent-153, score-1.043]
72 Figure Sea) shows an example frame from the video sequence. [sent-154, score-0.421]
73 We now demonstrate the ability to enhance the audio signal in a user-assisted fashion. [sent-155, score-0.735]
74 As the original data was collected with stereo microphones we can again compare our result to an approximation to the Wiener filter (neglecting cross channel leakage). [sent-157, score-0.227]
75 In this case, due to the fact that the speakers are male and female, the signals have better spectral separation. [sent-158, score-0.323]
76 Consequently the Wiener filter achieves a better signal processing gain. [sent-159, score-0.246]
77 For the male speaker the Wiener filter improves the SNR by 10. [sent-160, score-0.393]
78 43 dB, while for the female speaker the improvement is 10. [sent-161, score-0.276]
79 2 dB SNR gain (optic flow based) for the male speaker while for the female speaker we achieve 5. [sent-165, score-0.715]
80 It is not clear why performance is not as good for the female speaker, but figures 5(b) and (c) are provided by way of partial explanation. [sent-168, score-0.123]
81 Having recovered the audio in the user-assisted fashion described we used the recovered audio signal for video attribution (pixel-based) of the entire scene. [sent-169, score-1.803]
82 Figures 5(b) and (c) are the images of the resulting (tv when using the male (b) and female (c) recovered voice signals. [sent-170, score-0.311]
83 The attribution of the male speaker in (b) appears to be clearer than that of (c). [sent-171, score-0.381]
84 This may be an indication that the video cues were not as detectable for the female speaker as they were for the male in this experiment. [sent-172, score-0.794]
85 In any event these results are consistent with the enhancement results described above. [sent-173, score-0.112]
86 Recent commercial advances in speech recognition rely on careful placement of the microphone so that background sounds are minimized. [sent-176, score-0.148]
87 Results in more natural environments, where the microphone is some distance from the speaker and there is significant background noise, are disappointing. [sent-177, score-0.274]
88 Our approach may prove useful for teleconferencing, where audio and video of multiple speakers is recorded simultaneously. [sent-178, score-1.091]
89 Other applications include broadcast television in situations where careful microphone placement is not possible, or post-hoc processing to enhance the audio channel might prove valuable. [sent-179, score-0.745]
90 For example, if one speaker's microphone at a news conference malfunctions, the voice of that speaker might be enhanced with the aid of video information. [sent-180, score-0.817]
91 5 Conclusions One key contribution of this paper is to extend the notion of multi-media fusion to complex domains in which the statistical relationships between audio and video is complex and nongaussian. [sent-181, score-1.171]
92 This is claim is supported in part by the results of Slaney and Covell in which canonical correlations failed to detect audio/video synchrony when a spectral representation was used for the audio signal [7]. [sent-182, score-0.793]
93 Previous approaches have attempted to model these relationships using simple models such as measuring the short term correlation between pixel values and the sound signal [6]. [sent-183, score-0.295]
94 The power of the non-parametric mutual information approach allows our technique to handle complex non-linear relationships between audio and video signals. [sent-184, score-1.226]
95 Experiments were performed using raw pixel intensities as well as optical flows (which is a complex non-linear function of pixel values across time), yielding similar results. [sent-186, score-0.4]
96 Another key contribution is to establish an important application for this approach, video enhanced audio segmentation. [sent-187, score-0.997]
97 Initial experiments have shown that information from the video signal can be used to reduce the noise in a simultaneously recorded audio signal. [sent-188, score-1.266]
98 Noise is reduced without any a priori information about the form of the audio signal or noise. [sent-189, score-0.711]
99 Surprisingly, in our limited experiments, the noise reduction approaches what is possible using a priori knowledge of the audio signal (using Weiner filtering). [sent-190, score-0.71]
100 Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. [sent-247, score-0.999]
wordName wordTfidf (topN-words)
[('audio', 0.541), ('video', 0.421), ('speaker', 0.188), ('wiener', 0.165), ('flow', 0.154), ('signal', 0.138), ('interfering', 0.12), ('enhancement', 0.112), ('filter', 0.108), ('db', 0.102), ('optical', 0.098), ('male', 0.097), ('attribution', 0.096), ('lexa', 0.096), ('joint', 0.093), ('pixel', 0.091), ('female', 0.088), ('subspace', 0.087), ('microphone', 0.086), ('projection', 0.084), ('signals', 0.084), ('periodogram', 0.082), ('raw', 0.081), ('synthetic', 0.077), ('image', 0.077), ('snr', 0.077), ('spectral', 0.077), ('tv', 0.075), ('mutual', 0.072), ('fisher', 0.07), ('nonparametric', 0.069), ('coefficients', 0.069), ('relationships', 0.066), ('fusion', 0.065), ('speakers', 0.065), ('recorded', 0.064), ('informative', 0.061), ('sequence', 0.058), ('associated', 0.057), ('viola', 0.056), ('voice', 0.056), ('enhance', 0.056), ('densities', 0.055), ('power', 0.055), ('consequently', 0.053), ('functionally', 0.049), ('eta', 0.048), ('interferer', 0.048), ('mapper', 0.048), ('methodology', 0.048), ('mixed', 0.046), ('sa', 0.046), ('collected', 0.046), ('spectra', 0.046), ('measurements', 0.045), ('spectrum', 0.044), ('mller', 0.041), ('nv', 0.041), ('ellipse', 0.041), ('hershey', 0.041), ('trevor', 0.041), ('differentiable', 0.041), ('stereo', 0.041), ('entropy', 0.04), ('complex', 0.039), ('maximally', 0.039), ('experiments', 0.039), ('reasoning', 0.037), ('monitor', 0.037), ('slaney', 0.037), ('synchrony', 0.037), ('images', 0.037), ('auditory', 0.036), ('source', 0.036), ('figures', 0.035), ('measurement', 0.035), ('hoc', 0.035), ('enhanced', 0.035), ('faces', 0.035), ('subspaces', 0.035), ('pa', 0.034), ('learned', 0.033), ('pixels', 0.033), ('parametric', 0.033), ('recovered', 0.033), ('projections', 0.033), ('mouth', 0.032), ('automated', 0.032), ('microphones', 0.032), ('ma', 0.032), ('information', 0.032), ('face', 0.032), ('massachusetts', 0.032), ('noise', 0.031), ('conference', 0.031), ('dimensional', 0.031), ('visual', 0.031), ('person', 0.031), ('careful', 0.031), ('placement', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation
Author: John W. Fisher III, Trevor Darrell, William T. Freeman, Paul A. Viola
Abstract: People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a lowlevel, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sampling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the complex signal relationships. We learn the joint distribution of the visual and auditory signals using a non-parametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation. We then model the complicated stochastic relationships between the signals using a nonparametric density estimator. These learned densities allow processing across signal modalities. We demonstrate, on synthetic and real signals, localization in video of the face that is speaking in audio, and, conversely, audio enhancement of a particular speaker selected from the video.
2 0.49134186 50 nips-2000-FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks
Author: Malcolm Slaney, Michele Covell
Abstract: FaceSync is an optimal linear algorithm that finds the degree of synchronization between the audio and image recordings of a human speaker. Using canonical correlation, it finds the best direction to combine all the audio and image data, projecting them onto a single axis. FaceSync uses Pearson's correlation to measure the degree of synchronization between the audio and image data. We derive the optimal linear transform to combine the audio and visual information and describe an implementation that avoids the numerical problems caused by computing the correlation matrices. 1 Motivation In many applications, we want to know about the synchronization between an audio signal and the corresponding image data. In a teleconferencing system, we might want to know which of the several people imaged by a camera is heard by the microphones; then, we can direct the camera to the speaker. In post-production for a film, clean audio dialog is often dubbed over the video; we want to adjust the audio signal so that the lip-sync is perfect. When analyzing a film, we want to know when the person talking is in the shot, instead of off camera. When evaluating the quality of dubbed films, we can measure of how well the translated words and audio fit the actor's face. This paper describes an algorithm, FaceSync, that measures the degree of synchronization between the video image of a face and the associated audio signal. We can do this task by synthesizing the talking face, using techniques such as Video Rewrite [1], and then comparing the synthesized video with the test video. That process, however, is expensive. Our solution finds a linear operator that, when applied to the audio and video signals, generates an audio-video-synchronization-error signal. The linear operator gathers information from throughout the image and thus allows us to do the computation inexpensively. Hershey and Movellan [2] describe an approach based on measuring the mutual information between the audio signal and individual pixels in the video. The correlation between the audio signal, x, and one pixel in the image y, is given by Pearson's correlation, r. The mutual information between these two variables is given by f(x,y) = -1/2 log(l-?). They create movies that show the regions of the video that have high correlation with the audio; 1. Currently at IBM Almaden Research, 650 Harry Road, San Jose, CA 95120. 2. Currently at Yes Video. com, 2192 Fortune Drive, San Jose, CA 95131. Standard Deviation of Testing Data FaceSync
3 0.22266586 83 nips-2000-Machine Learning for Video-Based Rendering
Author: Arno Schödl, Irfan A. Essa
Abstract: We present techniques for rendering and animation of realistic scenes by analyzing and training on short video sequences. This work extends the new paradigm for computer animation, video textures, which uses recorded video to generate novel animations by replaying the video samples in a new order. Here we concentrate on video sprites, which are a special type of video texture. In video sprites, instead of storing whole images, the object of interest is separated from the background and the video samples are stored as a sequence of alpha-matted sprites with associated velocity information. They can be rendered anywhere on the screen to create a novel animation of the object. We present methods to create such animations by finding a sequence of sprite samples that is both visually smooth and follows a desired path. To estimate visual smoothness, we train a linear classifier to estimate visual similarity between video samples. If the motion path is known in advance, we use beam search to find a good sample sequence. We can specify the motion interactively by precomputing the sequence cost function using Q-Iearning.
4 0.17210715 96 nips-2000-One Microphone Source Separation
Author: Sam T. Roweis
Abstract: Source separation, or computational auditory scene analysis , attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as lCA and its extensions recover sources by reweighting multiple observation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting (
5 0.16586778 103 nips-2000-Probabilistic Semantic Video Indexing
Author: Milind R. Naphade, Igor Kozintsev, Thomas S. Huang
Abstract: We propose a novel probabilistic framework for semantic video indexing. We define probabilistic multimedia objects (multijects) to map low-level media features to high-level semantic labels. A graphical network of such multijects (multinet) captures scene context by discovering intra-frame as well as inter-frame dependency relations between the concepts. The main contribution is a novel application of a factor graph framework to model this network. We model relations between semantic concepts in terms of their co-occurrence as well as the temporal dependencies between these concepts within video shots. Using the sum-product algorithm [1] for approximate or exact inference in these factor graph multinets, we attempt to correct errors made during isolated concept detection by forcing high-level constraints. This results in a significant improvement in the overall detection performance. 1
6 0.13766128 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models
7 0.13344018 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images
8 0.12397232 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals
9 0.1102953 30 nips-2000-Bayesian Video Shot Segmentation
10 0.1096919 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition
11 0.099072911 89 nips-2000-Natural Sound Statistics and Divisive Normalization in the Auditory System
12 0.093233928 82 nips-2000-Learning and Tracking Cyclic Human Motion
13 0.090062395 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications
14 0.088558897 72 nips-2000-Keeping Flexible Active Contours on Track using Metropolis Updates
15 0.080876447 99 nips-2000-Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech
16 0.076961145 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
17 0.069899179 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding
18 0.067573145 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes
19 0.065811589 88 nips-2000-Multiple Timescales of Adaptation in a Neural Code
20 0.063711986 101 nips-2000-Place Cells and Spatial Navigation Based on 2D Visual Feature Extraction, Path Integration, and Reinforcement Learning
topicId topicWeight
[(0, 0.226), (1, -0.188), (2, 0.182), (3, 0.328), (4, -0.207), (5, -0.207), (6, 0.184), (7, 0.131), (8, -0.324), (9, -0.128), (10, -0.033), (11, -0.147), (12, -0.195), (13, -0.004), (14, 0.212), (15, 0.082), (16, 0.026), (17, -0.07), (18, 0.032), (19, 0.025), (20, -0.031), (21, 0.028), (22, 0.02), (23, 0.1), (24, 0.118), (25, -0.03), (26, 0.008), (27, 0.084), (28, -0.023), (29, 0.045), (30, -0.038), (31, 0.001), (32, -0.0), (33, 0.061), (34, 0.072), (35, -0.041), (36, -0.016), (37, -0.046), (38, 0.101), (39, -0.041), (40, 0.008), (41, 0.105), (42, 0.064), (43, -0.107), (44, 0.043), (45, 0.025), (46, -0.001), (47, 0.005), (48, 0.076), (49, -0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.98102695 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation
Author: John W. Fisher III, Trevor Darrell, William T. Freeman, Paul A. Viola
Abstract: People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a lowlevel, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sampling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the complex signal relationships. We learn the joint distribution of the visual and auditory signals using a non-parametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation. We then model the complicated stochastic relationships between the signals using a nonparametric density estimator. These learned densities allow processing across signal modalities. We demonstrate, on synthetic and real signals, localization in video of the face that is speaking in audio, and, conversely, audio enhancement of a particular speaker selected from the video.
2 0.96466696 50 nips-2000-FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks
Author: Malcolm Slaney, Michele Covell
Abstract: FaceSync is an optimal linear algorithm that finds the degree of synchronization between the audio and image recordings of a human speaker. Using canonical correlation, it finds the best direction to combine all the audio and image data, projecting them onto a single axis. FaceSync uses Pearson's correlation to measure the degree of synchronization between the audio and image data. We derive the optimal linear transform to combine the audio and visual information and describe an implementation that avoids the numerical problems caused by computing the correlation matrices. 1 Motivation In many applications, we want to know about the synchronization between an audio signal and the corresponding image data. In a teleconferencing system, we might want to know which of the several people imaged by a camera is heard by the microphones; then, we can direct the camera to the speaker. In post-production for a film, clean audio dialog is often dubbed over the video; we want to adjust the audio signal so that the lip-sync is perfect. When analyzing a film, we want to know when the person talking is in the shot, instead of off camera. When evaluating the quality of dubbed films, we can measure of how well the translated words and audio fit the actor's face. This paper describes an algorithm, FaceSync, that measures the degree of synchronization between the video image of a face and the associated audio signal. We can do this task by synthesizing the talking face, using techniques such as Video Rewrite [1], and then comparing the synthesized video with the test video. That process, however, is expensive. Our solution finds a linear operator that, when applied to the audio and video signals, generates an audio-video-synchronization-error signal. The linear operator gathers information from throughout the image and thus allows us to do the computation inexpensively. Hershey and Movellan [2] describe an approach based on measuring the mutual information between the audio signal and individual pixels in the video. The correlation between the audio signal, x, and one pixel in the image y, is given by Pearson's correlation, r. The mutual information between these two variables is given by f(x,y) = -1/2 log(l-?). They create movies that show the regions of the video that have high correlation with the audio; 1. Currently at IBM Almaden Research, 650 Harry Road, San Jose, CA 95120. 2. Currently at Yes Video. com, 2192 Fortune Drive, San Jose, CA 95131. Standard Deviation of Testing Data FaceSync
3 0.55461079 83 nips-2000-Machine Learning for Video-Based Rendering
Author: Arno Schödl, Irfan A. Essa
Abstract: We present techniques for rendering and animation of realistic scenes by analyzing and training on short video sequences. This work extends the new paradigm for computer animation, video textures, which uses recorded video to generate novel animations by replaying the video samples in a new order. Here we concentrate on video sprites, which are a special type of video texture. In video sprites, instead of storing whole images, the object of interest is separated from the background and the video samples are stored as a sequence of alpha-matted sprites with associated velocity information. They can be rendered anywhere on the screen to create a novel animation of the object. We present methods to create such animations by finding a sequence of sprite samples that is both visually smooth and follows a desired path. To estimate visual smoothness, we train a linear classifier to estimate visual similarity between video samples. If the motion path is known in advance, we use beam search to find a good sample sequence. We can specify the motion interactively by precomputing the sequence cost function using Q-Iearning.
4 0.40768707 103 nips-2000-Probabilistic Semantic Video Indexing
Author: Milind R. Naphade, Igor Kozintsev, Thomas S. Huang
Abstract: We propose a novel probabilistic framework for semantic video indexing. We define probabilistic multimedia objects (multijects) to map low-level media features to high-level semantic labels. A graphical network of such multijects (multinet) captures scene context by discovering intra-frame as well as inter-frame dependency relations between the concepts. The main contribution is a novel application of a factor graph framework to model this network. We model relations between semantic concepts in terms of their co-occurrence as well as the temporal dependencies between these concepts within video shots. Using the sum-product algorithm [1] for approximate or exact inference in these factor graph multinets, we attempt to correct errors made during isolated concept detection by forcing high-level constraints. This results in a significant improvement in the overall detection performance. 1
5 0.38594007 30 nips-2000-Bayesian Video Shot Segmentation
Author: Nuno Vasconcelos, Andrew Lippman
Abstract: Prior knowledge about video structure can be used both as a means to improve the peiformance of content analysis and to extract features that allow semantic classification. We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. The new formulations is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy.
6 0.37374619 96 nips-2000-One Microphone Source Separation
7 0.35856491 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals
8 0.33823526 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images
9 0.28739294 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models
10 0.26740715 99 nips-2000-Periodic Component Analysis: An Eigenvalue Method for Representing Periodic Structure in Speech
11 0.25128931 89 nips-2000-Natural Sound Statistics and Divisive Normalization in the Auditory System
12 0.23534961 91 nips-2000-Noise Suppression Based on Neurophysiologically-motivated SNR Estimation for Robust Speech Recognition
13 0.2158235 2 nips-2000-A Comparison of Image Processing Techniques for Visual Speech Recognition Applications
14 0.21000692 82 nips-2000-Learning and Tracking Cyclic Human Motion
15 0.20433825 59 nips-2000-From Mixtures of Mixtures to Adaptive Transform Coding
16 0.19294919 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition
17 0.18890263 72 nips-2000-Keeping Flexible Active Contours on Track using Metropolis Updates
18 0.18849604 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach
19 0.16664129 88 nips-2000-Multiple Timescales of Adaptation in a Neural Code
topicId topicWeight
[(2, 0.017), (10, 0.025), (17, 0.112), (32, 0.022), (33, 0.068), (55, 0.032), (62, 0.031), (65, 0.015), (67, 0.056), (69, 0.021), (75, 0.011), (76, 0.035), (79, 0.024), (81, 0.049), (90, 0.02), (91, 0.358), (97, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.89670694 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation
Author: John W. Fisher III, Trevor Darrell, William T. Freeman, Paul A. Viola
Abstract: People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a lowlevel, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sampling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the complex signal relationships. We learn the joint distribution of the visual and auditory signals using a non-parametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation. We then model the complicated stochastic relationships between the signals using a nonparametric density estimator. These learned densities allow processing across signal modalities. We demonstrate, on synthetic and real signals, localization in video of the face that is speaking in audio, and, conversely, audio enhancement of a particular speaker selected from the video.
2 0.88148934 114 nips-2000-Second Order Approximations for Probability Models
Author: Hilbert J. Kappen, Wim Wiegerinck
Abstract: In this paper, we derive a second order mean field theory for directed graphical probability models. By using an information theoretic argument it is shown how this can be done in the absense of a partition function. This method is a direct generalisation of the well-known TAP approximation for Boltzmann Machines. In a numerical example, it is shown that the method greatly improves the first order mean field approximation. For a restricted class of graphical models, so-called single overlap graphs, the second order method has comparable complexity to the first order method. For sigmoid belief networks, the method is shown to be particularly fast and effective.
3 0.80460209 85 nips-2000-Mixtures of Gaussian Processes
Author: Volker Tresp
Abstract: We introduce the mixture of Gaussian processes (MGP) model which is useful for applications in which the optimal bandwidth of a map is input dependent. The MGP is derived from the mixture of experts model and can also be used for modeling general conditional probability densities. We discuss how Gaussian processes -in particular in form of Gaussian process classification, the support vector machine and the MGP modelcan be used for quantifying the dependencies in graphical models.
4 0.58087802 13 nips-2000-A Tighter Bound for Graphical Models
Author: Martijn A. R. Leisink, Hilbert J. Kappen
Abstract: We present a method to bound the partition function of a Boltzmann machine neural network with any odd order polynomial. This is a direct extension of the mean field bound, which is first order. We show that the third order bound is strictly better than mean field. Additionally we show the rough outline how this bound is applicable to sigmoid belief networks. Numerical experiments indicate that an error reduction of a factor two is easily reached in the region where expansion based approximations are useful. 1
5 0.49911824 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks
Author: Chiranjib Bhattacharyya, S. Sathiya Keerthi
Abstract: A variational derivation of Plefka's mean-field theory is presented. This theory is then applied to sigmoidal belief networks with the aid of further approximations. Empirical evaluation on small scale networks show that the proposed approximations are quite competitive. 1
6 0.49570426 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning
7 0.46127757 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA
8 0.4500193 94 nips-2000-On Reversing Jensen's Inequality
9 0.43571955 96 nips-2000-One Microphone Source Separation
10 0.43427792 49 nips-2000-Explaining Away in Weight Space
11 0.43403882 122 nips-2000-Sparse Representation for Gaussian Process Models
12 0.43372896 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data
13 0.43029389 62 nips-2000-Generalized Belief Propagation
14 0.42690989 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models
15 0.42393175 89 nips-2000-Natural Sound Statistics and Divisive Normalization in the Auditory System
16 0.42374715 74 nips-2000-Kernel Expansions with Unlabeled Examples
17 0.42364553 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing
18 0.42364082 136 nips-2000-The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity
19 0.42182848 127 nips-2000-Structure Learning in Human Causal Induction
20 0.42068571 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition