nips nips2010 nips2010-28 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Javier R. Movellan, Paul L. Ruvolo
Abstract: Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction. One popular approach to this problem is audio-visual synchrony detection [10, 21, 12]. A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal. Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all. Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models. The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e.g. in the case of a mobile robot). Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection. The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction. [sent-8, score-1.186]
2 One popular approach to this problem is audio-visual synchrony detection [10, 21, 12]. [sent-9, score-0.372]
3 A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal. [sent-10, score-1.543]
4 Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all. [sent-11, score-1.209]
5 Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models. [sent-12, score-0.812]
6 The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e. [sent-13, score-0.548]
7 Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection. [sent-16, score-0.852]
8 The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database. [sent-17, score-0.358]
9 1 Introduction In recent years interest has been building [10, 21, 16, 8, 12] in the problem of detecting locations in the visual field that are responsible for auditory signals. [sent-18, score-0.48]
10 A specialization of this problem is determining whether a person in the visual field is currently taking. [sent-19, score-0.364]
11 Applications of this technology are wide ranging: from speech recognition in noisy environments, to speaker diarization, to expression recognition systems that may benefit from knowledge of whether or not the person is talking to interpret the observed expressions. [sent-20, score-1.143]
12 Past approaches to the problem of speaker detection have focused on exploiting audio-visual synchrony as a measure of how likely a person in the visual field is to have generated the current audio signal [10, 21, 16, 8, 12]. [sent-21, score-1.336]
13 , they are not limited to detecting human speech [12]. [sent-24, score-0.248]
14 Another benefit is that they require very little processing of the visual signal (some of them operating on raw pixel values [10]). [sent-25, score-0.246]
15 However, as we show in this document, when visual features tailored to the analysis of facial expressions are used it is possible to develop a very robust speech detector that is based only on the visual signal that far outperforms the past approaches. [sent-26, score-1.189]
16 Given the strong performance for the visual speech detector we incorporate auditory information using the paradigm of transductive learning. [sent-27, score-0.898]
17 Specifically we use the visual-only detector’s output as 1 an uncertain labeling of when a given person is speaking and then use this labeling along with a set of acoustic measurements to create a voice model of how that person sounds when he/she speaks. [sent-28, score-0.803]
18 We show that the error rate of the visual-only speech detector can be more than halved by combining it with the auditory voice models developed via transductive learning. [sent-29, score-0.973]
19 Another view of our proposed approach is that it is also based on synchrony detection, however, at a much higher level and much longer time scale than previous approaches. [sent-30, score-0.265]
20 More concretely our approach moves from the level of synchrony between pixel fluctuations and sound energy to the level of the visual markers of talking and auditory markers of a particular person’s voice. [sent-31, score-1.018]
21 As we will show later, a benefit of this approach is that the auditory model that is optimized to predict the talking/not-talking visual signal for a particular candidate speaker also works quite well without using any visual input. [sent-32, score-1.102]
22 The results presented here challenge the orthodoxy of the use of low-level synchrony related measures that dominates research in this area. [sent-36, score-0.265]
23 2 Methods In this section we review a popular approach to speech detection that uses Canonical Correlation Analysis (CCA). [sent-37, score-0.311]
24 Next we present our method for visual-only speaker detection using facial expression dynamics. [sent-38, score-0.707]
25 [10] pioneered the use of audio-visual synchrony for speech detection. [sent-43, score-0.469]
26 were chiefly interested in designing a system to automatically synchronize audio and video, however, their results inspired others to use similar approaches for detecting regions in the visual field responsible for auditory events [12]. [sent-49, score-0.739]
27 For example, if mouth pixels of a potential speaker are highly predictable based on sound energy then it is likely that there is a common cause underlying both sensory measurements (i. [sent-51, score-0.431]
28 , VN be sequences of audio and visual features respectively with each Ai ∈ Rv and Vi ∈ Ru . [sent-61, score-0.492]
29 We collectively refer to the audio and visual features with the variables A ∈ Rv×N and V ∈ Ru×N . [sent-62, score-0.492]
30 Our model of speaker detection based on CCA involves computing canonical vectors wA and wV that solve Equation 1 and then computing time-windowed estimates of the correlation of the auditory and visual features projected on these vectors at each point in time. [sent-66, score-1.023]
31 The final judgment as to whether or not a candidate face is speaking is determined by thresholding the windowed correlation value. [sent-67, score-0.297]
32 2 Visual Detector of Speech The Facial Action Coding System (FACS) is an anatomically inspired, comprehensive and versatile method to describe human facial expressions [7]. [sent-69, score-0.25]
33 2 Figure 1: The Computer Expression Recognition Toolbox was used to automatically extract 84 features describing the observed facial expressions. [sent-72, score-0.262]
34 These features were used for training a speech detector. [sent-73, score-0.274]
35 The output of the CERT system provides a versatile and effective set of features for vision-based automatic analysis of facial behavior. [sent-76, score-0.392]
36 In this paper we used 84 outputs of the CERT system ranging from the locations of key feature points on the face to movements of individual facial muscle groups (Action Units) to detectors that specify high-level emotional categories (such as distress). [sent-78, score-0.389]
37 Figure 2 shows an example of the dynamics of CERT outputs during periods of talking and non-talking. [sent-79, score-0.393]
38 There appears to be a periodicity to the modulations in the chin raise Action Unit (AU 17) during the speech period. [sent-80, score-0.338]
39 3 Voice Model The visual speech detector described above was then used to automatically label audio-visual speech signals. [sent-96, score-0.853]
40 These labels where then used to train person-specific voice models. [sent-97, score-0.293]
41 It is possible to cast the bootstrapping of the voice model very similarly to the more conventional Canonical Correlation method discussed in Section 2. [sent-99, score-0.324]
42 Although it is known [20] that non-linear models provide superior performance to linear models for auditory speaker identification, consider the case where we seek to learn a linear model over auditory features to determine a model of a particular speaker’s voice. [sent-101, score-0.822]
43 15 Figure 2: An example of the shift in action unit output when talking begins. [sent-104, score-0.423]
44 Qualitatively there is a periodicity in CERT’s Action Unit 17 (Chin Raise) output during the talking period. [sent-107, score-0.41]
45 While this view is useful for seeing the commonalities between our approach and the classical synchrony approaches it is important to note that our approach does not have the restriction of requiring the use of linear models of either the auditory or visual talking detectors. [sent-129, score-1.018]
46 In this section we show how we can fit a non-linear voice model that is very popular for the task of speaker detection using the visual detector output as a training signal. [sent-130, score-1.276]
47 1 Auditory Features We use the popular Mel-Frequency Cesptral Coefficients (MFCCs) [3] as the auditory descriptors to model the voice of a candidate speaker. [sent-133, score-0.589]
48 MFCCs have been applied to a wide range of audio category recognition problems such as genre identification and speaker identification [19], and can be seen as capturing the timbral information of sound. [sent-134, score-0.619]
49 2 Learning and Classification Given a temporal segmentation of when each of a set of candidate speakers is speaking we define the set of MFCC features generated by speaker i as FAi where each column of FAij denotes the MFCC features of speaker i at the j th time point that the speaker is talking. [sent-142, score-1.432]
50 In order to build an auditory model that can discriminate who is speaking we first model the density of input features pi for the ith speaker based on the training data FAi . [sent-143, score-0.711]
51 In order to determine the probability of a speaker generating new input audio features, TA , we apply Bayes’ rule p(Si = 1|TA ) ∝ p(TA |Si = 1)p(Si = 1). [sent-144, score-0.572]
52 Where Si indicates whether or not the ith speaker is currently speaking. [sent-145, score-0.342]
53 The probability distributions of the audio features given whether or not a given speaker is talking are modeled using 4-state hidden Markov models with each state having an independent 4 component Gaussian Mixture model. [sent-146, score-0.933]
54 The parameters of the voice model were learned using the Expectation Maximization Algorithm [6]. [sent-150, score-0.293]
55 3 Threshold Selection The outputs of the visual detector over time provide an estimate of whether or not a candidate speaker is talking. [sent-153, score-0.941]
56 In this work we convert these outputs into a binary temporal segmentation of when a candidate speaker was or was not talking. [sent-154, score-0.598]
57 Our threshold selection mechanism uses a training portion of audio-visual input as a method of tuning the threshold to each candidate speaker. [sent-156, score-0.284]
58 In order to select an appropriate threshold we trained a number of audio models each trained using a different threshold for the visual speech detector output. [sent-157, score-0.963]
59 Each of these thresholds induces a binary segmentation which in turn is fed to the voice model learning component described in Section 2. [sent-158, score-0.368]
60 Next, we evaluate each voice model on a set of testing samples (e. [sent-160, score-0.338]
61 The acoustic model that achieved the highest generalization performance (with respect to the thresholded visual detector’s output on the testing portion) was then selected for fusion with the visual-only model. [sent-163, score-0.481]
62 In the testing stage each of these discrete models is evaluated (in the figure there are only two but in practice we use more) to see how well it generalizes on the testing set (where ground truth is defined based on the visual detector’s thresholded output). [sent-166, score-0.347]
63 Finally, the detector that generalizes the best is fused with the visual detector to give the final output of our system. [sent-167, score-0.793]
64 All assessments of generalization performance are with respect to the outputs of the visual classifier and not the true speaking vs. [sent-171, score-0.375]
65 4 Fusion There are many approaches [15] to fusing the visual and auditory model outputs to estimate the likelihood that someone is or is not talking. [sent-174, score-0.609]
66 In order to compute the fused output we simply add the whitened outputs of the visual and auditory detectors’ outputs. [sent-176, score-0.633]
67 5 Related Work Most past approaches for detecting whether someone is talking have either been purely visual [18] (i. [sent-178, score-0.644]
68 using a classifier trained on visual features from a training database) or based on audio-visual synchrony [21, 8, 12]. [sent-180, score-0.553]
69 The first method used low-level audio-visual synchrony detection to estimate the probability of whether or not someone is speaking at each point in time (see Section 2. [sent-185, score-0.518]
70 The second approach is the approach proposed in this document: start with a visual-only speech detector, then incorporate acoustic information by training speaker-dependent voice models, and finally fuse the audio and visual models’ outputs. [sent-187, score-1.019]
71 The database contains a wide-variety of vocal and facial expression behavior as the responses of the interviewees are not scripted but rather spontaneous. [sent-190, score-0.347]
72 As a consequence this database provides a much more realistic testbed for speech detection algorithms then the highly scripted databases (e. [sent-191, score-0.4]
73 Since we cannot extract visual information of the person behind the camera we define the task of interest to be a binary classification of whether or not the person being interviewed is talking at each point in time. [sent-194, score-0.855]
74 It is reasonable to conclude that our performance would only be improved on the task of speaker detection in two speaker environments if we could see both speakers’ faces. [sent-195, score-0.791]
75 In order to test the effect of the voice model bootstrapping we use the first half of each interview as a training portion (that is the portion on which the voice model is learned) and the second half as the testing portion. [sent-198, score-0.935]
76 The specific choice of a 50/50 split between training and test is somewhat arbitrary, however, it is a reasonable compromise between spending too long learning the voice model and not having sufficient audio input to fit the voice model. [sent-199, score-0.842]
77 It is important to note that no ground truth was used from the first 50% of each interview as the labeling was the result of the person independent visual speech detector. [sent-200, score-0.623]
78 we have audio and video information and codes as to when the person in front of the camera is talking). [sent-203, score-0.462]
79 For both our method and the synchrony method the audio modality was summarized by the first 13 (0th through 12th) MFCCs. [sent-206, score-0.495]
80 First we apply CCA between MFCCs and CERT outputs (plus the temporal derivatives and absolute value of the temporal derivatives) over the database of six interviews. [sent-208, score-0.248]
81 Next we look for regions in the interview where the projection of the audio and video onto the vectors found by CCA yield high correlation. [sent-209, score-0.343]
82 This evaluation method is called “Windowed Correlation” in the results table for the synchrony detection (see Table 2). [sent-211, score-0.372]
83 Each row indicates the performance (as measured by area under the ROC) of the a particular detector on the second half of a video of a particular subject. [sent-242, score-0.285]
84 6937 Table 2: The performance of the synchrony detection model. [sent-250, score-0.372]
85 Each row indicates the performance of the a particular detector on the second half of a video of a particular subject. [sent-251, score-0.285]
86 Table 2 and Table 1 summarize the performance of the synchrony detection approach and our approach respectively. [sent-252, score-0.372]
87 9620), than the synchrony detection approach that has access to both audio and video. [sent-257, score-0.602]
88 This validates that our method is of use in situations where we cannot expect to always have visual input on each of the candidate speakers’ faces. [sent-263, score-0.296]
89 Our approach also benefitted from fusing the learned audio-based speaker models. [sent-264, score-0.374]
90 4 Discussion and Future Work We described a new method for multi-modal detection of when a candidate person is speaking. [sent-266, score-0.331]
91 Our approach used the output of a person independent-vision based speech detector to train a persondependent voice model. [sent-267, score-0.933]
92 To this end we described a novel approach for threshold selection for training the voice model based on the outputs of the visual detector. [sent-268, score-0.655]
93 We showed that our method greatly improved performance with respect to previous approaches to the speech detection problem. [sent-269, score-0.311]
94 We also briefly discussed how the work proposed here can be seen in a similar light as the more conventional synchrony detection methods of the past. [sent-270, score-0.372]
95 This view combined with the large gain in performance for the method presented here demonstrates that synchrony over long time scales and high-level features (e. [sent-271, score-0.309]
96 talking / not talking) works significantly better than over short time scales and low-level features (e. [sent-273, score-0.361]
97 Another challenge is incorporating confidences from the visual detector output in the learning of the voice model. [sent-277, score-0.801]
98 Robust sensor fusion: Analysis and application to audio visual speech recognition. [sent-378, score-0.652]
99 Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. [sent-413, score-0.506]
100 Measuring the difficulty of a lecture using automatic facial expression recognition. [sent-436, score-0.322]
wordName wordTfidf (topN-words)
[('speaker', 0.342), ('talking', 0.317), ('voice', 0.293), ('synchrony', 0.265), ('audio', 0.23), ('detector', 0.227), ('facial', 0.218), ('visual', 0.218), ('auditory', 0.218), ('speech', 0.204), ('cert', 0.166), ('person', 0.146), ('detection', 0.107), ('wv', 0.106), ('wa', 0.104), ('portion', 0.096), ('speaking', 0.081), ('candidate', 0.078), ('outputs', 0.076), ('mfcc', 0.074), ('facs', 0.074), ('littlewort', 0.074), ('noulas', 0.074), ('fusion', 0.068), ('cca', 0.066), ('interviews', 0.065), ('someone', 0.065), ('output', 0.063), ('temporal', 0.06), ('mfccs', 0.06), ('fused', 0.058), ('video', 0.058), ('correlation', 0.057), ('speakers', 0.057), ('chin', 0.055), ('interview', 0.055), ('krose', 0.055), ('slaney', 0.055), ('windowed', 0.055), ('database', 0.052), ('gabor', 0.051), ('movellan', 0.049), ('raise', 0.049), ('ta', 0.049), ('acoustic', 0.048), ('recognition', 0.047), ('bartlett', 0.047), ('multimodal', 0.047), ('testing', 0.045), ('roc', 0.044), ('detecting', 0.044), ('features', 0.044), ('action', 0.043), ('segmentation', 0.042), ('threshold', 0.042), ('multimedia', 0.042), ('ru', 0.042), ('imaginary', 0.04), ('muscle', 0.04), ('expression', 0.04), ('thresholded', 0.039), ('automatic', 0.038), ('rv', 0.038), ('canonical', 0.037), ('atkinson', 0.037), ('cuave', 0.037), ('diarization', 0.037), ('fai', 0.037), ('fasel', 0.037), ('gilman', 0.037), ('hershey', 0.037), ('lainscsek', 0.037), ('mail', 0.037), ('scripted', 0.037), ('tutoring', 0.037), ('thresholds', 0.033), ('expressions', 0.032), ('genuine', 0.032), ('aus', 0.032), ('fusing', 0.032), ('mouth', 0.032), ('pain', 0.032), ('whitehill', 0.032), ('bank', 0.032), ('sensory', 0.031), ('transductive', 0.031), ('bootstrapping', 0.031), ('logistic', 0.031), ('driver', 0.03), ('periodicity', 0.03), ('system', 0.029), ('camera', 0.028), ('signal', 0.028), ('measurements', 0.026), ('jolla', 0.026), ('bandwidths', 0.026), ('thorough', 0.026), ('training', 0.026), ('lecture', 0.026), ('face', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection
Author: Javier R. Movellan, Paul L. Ruvolo
Abstract: Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction. One popular approach to this problem is audio-visual synchrony detection [10, 21, 12]. A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal. Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all. Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models. The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e.g. in the case of a mobile robot). Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection. The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database. 1
2 0.1259269 125 nips-2010-Inference and communication in the game of Password
Author: Yang Xu, Charles Kemp
Abstract: Communication between a speaker and hearer will be most efficient when both parties make accurate inferences about the other. We study inference and communication in a television game called Password, where speakers must convey secret words to hearers by providing one-word clues. Our working hypothesis is that human communication is relatively efficient, and we use game show data to examine three predictions. First, we predict that speakers and hearers are both considerate, and that both take the other’s perspective into account. Second, we predict that speakers and hearers are calibrated, and that both make accurate assumptions about the strategy used by the other. Finally, we predict that speakers and hearers are collaborative, and that they tend to share the cognitive burden of communication equally. We find evidence in support of all three predictions, and demonstrate in addition that efficient communication tends to break down when speakers and hearers are placed under time pressure.
3 0.11944751 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities
Author: Tian Lan, Yang Wang, Weilong Yang, Greg Mori
Abstract: We propose a discriminative model for recognizing group activities. Our model jointly captures the group activity, the individual person actions, and the interactions among them. Two new types of contextual information, group-person interaction and person-person interaction, are explored in a latent variable framework. Different from most of the previous latent structured models which assume a predefined structure for the hidden layer, e.g. a tree structure, we treat the structure of the hidden layer as a latent variable and implicitly infer it during learning and inference. Our experimental results demonstrate that by inferring this contextual information together with adaptive structures, the proposed model can significantly improve activity recognition performance. 1
4 0.11350358 281 nips-2010-Using body-anchored priors for identifying actions in single images
Author: Leonid Karlinsky, Michael Dinerstein, Shimon Ullman
Abstract: This paper presents an approach to the visual recognition of human actions using only single images as input. The task is easy for humans but difficult for current approaches to object recognition, because instances of different actions may be similar in terms of body pose, and often require detailed examination of relations between participating objects and body parts in order to be recognized. The proposed approach applies a two-stage interpretation procedure to each training and test image. The first stage produces accurate detection of the relevant body parts of the actor, forming a prior for the local evidence needed to be considered for identifying the action. The second stage extracts features that are anchored to the detected body parts, and uses these features and their feature-to-part relations in order to recognize the action. The body anchored priors we propose apply to a large range of human actions. These priors allow focusing on the relevant regions and relations, thereby significantly simplifying the learning process and increasing recognition performance. 1
5 0.091470957 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton
Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1
6 0.087024346 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
7 0.084974349 157 nips-2010-Learning to localise sounds with spiking neural networks
8 0.078095235 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
9 0.072740167 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
10 0.061922811 149 nips-2010-Learning To Count Objects in Images
11 0.058499012 91 nips-2010-Fast detection of multiple change-points shared by many signals using group LARS
12 0.058362775 138 nips-2010-Large Margin Multi-Task Metric Learning
13 0.058314212 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
14 0.057718504 268 nips-2010-The Neural Costs of Optimal Control
15 0.05562282 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
16 0.054735351 61 nips-2010-Direct Loss Minimization for Structured Prediction
17 0.053440236 186 nips-2010-Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification
18 0.05186012 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
19 0.050154757 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters
20 0.048543397 17 nips-2010-A biologically plausible network for the computation of orientation dominance
topicId topicWeight
[(0, 0.138), (1, 0.045), (2, -0.134), (3, -0.046), (4, 0.018), (5, 0.007), (6, -0.064), (7, 0.024), (8, -0.049), (9, 0.022), (10, 0.02), (11, 0.025), (12, 0.009), (13, -0.035), (14, 0.006), (15, 0.016), (16, 0.026), (17, 0.043), (18, -0.066), (19, -0.027), (20, -0.049), (21, 0.048), (22, 0.087), (23, 0.064), (24, 0.153), (25, -0.037), (26, 0.04), (27, -0.034), (28, -0.007), (29, 0.036), (30, -0.01), (31, -0.022), (32, 0.095), (33, -0.027), (34, 0.072), (35, 0.04), (36, 0.029), (37, -0.099), (38, 0.02), (39, 0.037), (40, -0.016), (41, -0.12), (42, -0.024), (43, -0.111), (44, -0.05), (45, -0.024), (46, 0.048), (47, 0.025), (48, 0.048), (49, 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.93631893 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection
Author: Javier R. Movellan, Paul L. Ruvolo
Abstract: Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction. One popular approach to this problem is audio-visual synchrony detection [10, 21, 12]. A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal. Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all. Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models. The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e.g. in the case of a mobile robot). Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection. The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database. 1
2 0.69451773 281 nips-2010-Using body-anchored priors for identifying actions in single images
Author: Leonid Karlinsky, Michael Dinerstein, Shimon Ullman
Abstract: This paper presents an approach to the visual recognition of human actions using only single images as input. The task is easy for humans but difficult for current approaches to object recognition, because instances of different actions may be similar in terms of body pose, and often require detailed examination of relations between participating objects and body parts in order to be recognized. The proposed approach applies a two-stage interpretation procedure to each training and test image. The first stage produces accurate detection of the relevant body parts of the actor, forming a prior for the local evidence needed to be considered for identifying the action. The second stage extracts features that are anchored to the detected body parts, and uses these features and their feature-to-part relations in order to recognize the action. The body anchored priors we propose apply to a large range of human actions. These priors allow focusing on the relevant regions and relations, thereby significantly simplifying the learning process and increasing recognition performance. 1
3 0.67276049 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
Author: Fabian Triefenbach, Azarakhsh Jalalvand, Benjamin Schrauwen, Jean-pierre Martens
Abstract: Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant. 1
4 0.63051361 125 nips-2010-Inference and communication in the game of Password
Author: Yang Xu, Charles Kemp
Abstract: Communication between a speaker and hearer will be most efficient when both parties make accurate inferences about the other. We study inference and communication in a television game called Password, where speakers must convey secret words to hearers by providing one-word clues. Our working hypothesis is that human communication is relatively efficient, and we use game show data to examine three predictions. First, we predict that speakers and hearers are both considerate, and that both take the other’s perspective into account. Second, we predict that speakers and hearers are calibrated, and that both make accurate assumptions about the strategy used by the other. Finally, we predict that speakers and hearers are collaborative, and that they tend to share the cognitive burden of communication equally. We find evidence in support of all three predictions, and demonstrate in addition that efficient communication tends to break down when speakers and hearers are placed under time pressure.
5 0.59818637 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton
Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1
7 0.50805736 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities
8 0.49058223 111 nips-2010-Hallucinations in Charles Bonnet Syndrome Induced by Homeostasis: a Deep Boltzmann Machine Model
9 0.47450626 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression
10 0.47234926 157 nips-2010-Learning to localise sounds with spiking neural networks
11 0.45656663 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine
12 0.4488847 17 nips-2010-A biologically plausible network for the computation of orientation dominance
13 0.41622528 271 nips-2010-Tiled convolutional neural networks
14 0.40721673 149 nips-2010-Learning To Count Objects in Images
15 0.40658131 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
16 0.40426004 61 nips-2010-Direct Loss Minimization for Structured Prediction
17 0.39578143 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
18 0.38717687 241 nips-2010-Size Matters: Metric Visual Search Constraints from Monocular Metadata
19 0.38673469 39 nips-2010-Bayesian Action-Graph Games
20 0.38288933 171 nips-2010-Movement extraction by detecting dynamics switches and repetitions
topicId topicWeight
[(13, 0.032), (17, 0.03), (27, 0.088), (30, 0.07), (35, 0.03), (45, 0.169), (50, 0.059), (52, 0.024), (60, 0.015), (77, 0.043), (87, 0.326), (90, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.74668556 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection
Author: Javier R. Movellan, Paul L. Ruvolo
Abstract: Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction. One popular approach to this problem is audio-visual synchrony detection [10, 21, 12]. A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal. Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all. Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models. The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e.g. in the case of a mobile robot). Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection. The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database. 1
2 0.70614743 41 nips-2010-Block Variable Selection in Multivariate Regression and High-dimensional Causal Inference
Author: Vikas Sindhwani, Aurelie C. Lozano
Abstract: We consider multivariate regression problems involving high-dimensional predictor and response spaces. To efficiently address such problems, we propose a variable selection method, Multivariate Group Orthogonal Matching Pursuit, which extends the standard Orthogonal Matching Pursuit technique. This extension accounts for arbitrary sparsity patterns induced by domain-specific groupings over both input and output variables, while also taking advantage of the correlation that may exist between the multiple outputs. Within this framework, we then formulate the problem of inferring causal relationships over a collection of high-dimensional time series variables. When applied to time-evolving social media content, our models yield a new family of causality-based influence measures that may be seen as an alternative to the classic PageRank algorithm traditionally applied to hyperlink graphs. Theoretical guarantees, extensive simulations and empirical studies confirm the generality and value of our framework.
3 0.66915435 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
Author: Amin Sayedi, Morteza Zadimoghaddam, Avrim Blum
Abstract: We discuss an online learning framework in which the agent is allowed to say “I don’t know” as well as making incorrect predictions on given examples. We analyze the trade off between saying “I don’t know” and making mistakes. If the number of don’t-know predictions is required to be zero, the model reduces to the well-known mistake-bound model introduced by Littlestone [Lit88]. On the other hand, if no mistakes are allowed, the model reduces to KWIK framework introduced by Li et. al. [LLW08]. We propose a general, though inefficient, algorithm for general finite concept classes that minimizes the number of don’t-know predictions subject to a given bound on the number of allowed mistakes. We then present specific polynomial-time algorithms for the concept classes of monotone disjunctions and linear separators with a margin.
4 0.55064446 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
5 0.54874635 155 nips-2010-Learning the context of a category
Author: Dan Navarro
Abstract: This paper outlines a hierarchical Bayesian model for human category learning that learns both the organization of objects into categories, and the context in which this knowledge should be applied. The model is fit to multiple data sets, and provides a parsimonious method for describing how humans learn context specific conceptual representations.
6 0.5485279 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
7 0.54800338 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
8 0.54795051 268 nips-2010-The Neural Costs of Optimal Control
9 0.54607028 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models
10 0.54566616 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts
11 0.5425961 98 nips-2010-Functional form of motion priors in human motion perception
12 0.54184812 117 nips-2010-Identifying graph-structured activation patterns in networks
13 0.54153359 194 nips-2010-Online Learning for Latent Dirichlet Allocation
14 0.54123223 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior
15 0.54106498 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
16 0.54072219 161 nips-2010-Linear readout from a neural population with partial correlation data
17 0.54039431 17 nips-2010-A biologically plausible network for the computation of orientation dominance
18 0.53920066 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
19 0.53903836 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication
20 0.53847933 96 nips-2010-Fractionally Predictive Spiking Neurons