nips nips2012 nips2012-356 knowledge-graph by maker-knowledge-mining

356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio


Source: pdf

Author: Sourish Chaudhuri, Bhiksha Raj

Abstract: Approaches to audio classification and retrieval tasks largely rely on detectionbased discriminative models. We submit that such models make a simplistic assumption in mapping acoustics directly to semantics, whereas the actual process is likely more complex. We present a generative model that maps acoustics in a hierarchical manner to increasingly higher-level semantics. Our model has two layers with the first layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. We evaluate our model on a large-scale retrieval task from TRECVID 2011, and report significant improvements over standard baselines. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Approaches to audio classification and retrieval tasks largely rely on detectionbased discriminative models. [sent-5, score-0.491]

2 Our model has two layers with the first layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. [sent-8, score-0.643]

3 1 Introduction Automatic semantic analysis of multimedia content has been an active area of research due to potential implications for indexing and retrieval [1–7]. [sent-10, score-0.318]

4 In this paper, we limit ourselves to the analysis of the audio component of multimedia data only. [sent-11, score-0.546]

5 Early approaches for semantic indexing of audio relied on automatic speech recognition techniques to generate semantically relevant keywords [2]. [sent-12, score-0.678]

6 Subsequently, supervised approaches were developed for detecting specific (potentially semantically relevant) sounds in audio streams [6, 8–10], e. [sent-13, score-0.721]

7 , and using the detected sounds to characterize the audio files. [sent-16, score-0.639]

8 While this approach has been shown to be effective on certain datasets, it requires data for each of the various sounds expected in the dataset. [sent-17, score-0.174]

9 audio libraries are studio-quality, while user-generated Youtube-style content are noisy. [sent-20, score-0.434]

10 In order to avoid the issues that arise with using supervised, detection-based systems, unsupervised approaches were developed to learn sound dictionaries from the data [7, 11, 12],. [sent-21, score-0.186]

11 Typically, these methods use clustering techniques on fixed length audio segments to learn a dictionary, and then characterize new data using this dictionary. [sent-22, score-0.554]

12 However, characterizing audio data with elements from an audio dictionary (supervised or unsupervised) for semantic analysis involves an implicit assumption that the acoustics map directly to semantics. [sent-23, score-1.058]

13 In reality, we expect the mapping to be more complex, because acoustically similar sounds can be produced by very different sources. [sent-24, score-0.269]

14 Further, since most audio datasets do not contain detailed hierarchical labels that our framework would require, we present unsupervised formulations for two layers in this hierarchical framework, building on previous work for the first layer, and developing a model for the second. [sent-27, score-0.597]

15 Instead, we use features derived from this structure to characterize audio, and evaluate these characterizations in a large-scale audio retrieval task with semantic categories, where our model significantly improves over state-of-the-art baselines. [sent-29, score-0.671]

16 A further benefit of the induced structure is that the generated segments may be used for annotation by humans, thus removing the need for the annotator to scan the audio to identify and mark segment boundaries, making the annotation process much faster [13]. [sent-30, score-0.546]

17 In Section 2, we introduce a novel framework for mapping acoustics to semantics for deeper analysis of audio. [sent-31, score-0.189]

18 Section 3 describes the process of learning the lower-level acoustic units in the framework, while Section 4 describes a generative model that automatically identifies patterns over and segments these acoustic units. [sent-32, score-0.937]

19 Thus, changes in real-world scenes are sequential by nature and the human brain can perceive this sequentiality and use it to learn semantic relationships between the various events to analyze scenes; e. [sent-35, score-0.373]

20 In this section, we present a hierarchical model that maps observed scene characteristics to semantics in a hierarchical fashion. [sent-38, score-0.196]

21 Traditional detection-based approaches, that assign each frame or a sequence of frames of prespecified length to sound categories/clusters, are severely limited in their ability to account for context. [sent-42, score-0.186]

22 In addition to context, we need to consider the possibility of polysemy in sounds– semantically different sounds may be acoustically similar; e. [sent-43, score-0.308]

23 a dull metallic sound may be produced by a hammer striking an object, a baseball bat hitting a ball, or a car collision. [sent-45, score-0.222]

24 The sound alone doesn’t provide us with sufficient information to infer the semantic context. [sent-46, score-0.274]

25 However, if the sound is followed by applause, we guess the context to be baseball, screams or sirens suggest an accident, while monotonic repetitions of the metallic sound suggest someone using a hammer. [sent-47, score-0.285]

26 The grey circles closest to the observed audio represent short-duration lower-level acoustic units which produce sounds that human ears can perceive, such as the clink of glass, thump produced by footsteps, etc. [sent-50, score-1.036]

27 These units have acoustic characteristics, but no clear associated semantics since the semantics may be context dependent. [sent-51, score-0.616]

28 Sequences of these units, however, will have interpretable semantics– we refer to these as events marked by grey rectangles in Figure 1a. [sent-52, score-0.249]

29 Further, these events themselves likely influence future events, shown by the arrows, e. [sent-54, score-0.224]

30 the loud cheering in the audio clip is because a hitter hit a home run. [sent-56, score-0.539]

31 The event layer in Figure 1b has been further divided into 2, where the lower level (indexed by v) correspond to observable events (e. [sent-59, score-0.495]

32 hit-ball, cheering), whereas the higher level (e) corresponds to a semantic event (e. [sent-61, score-0.356]

33 battingin-run), and the root node represents the semantic category (baseball, in this case). [sent-63, score-0.149]

34 Typically, audio datasets contain only a category or genre label for each audio file. [sent-65, score-0.868]

35 This framework for semantic analysis of audio is the first effort to extract deeper semantic structure, to the best of our knowledge. [sent-67, score-0.786]

36 We build on previous work to automatically learn lower level units unsupervised from audio data [14]. [sent-69, score-0.615]

37 We then develop a generative model to learn event patterns over the lower-level units, which correspond to the second layer in Figure 1b. [sent-70, score-0.346]

38 We represent the audio as a sequence of 39-dimensional feature vectors, each comprising 13 Mel-Frequency Cepstral Coefficients and 13-dimensional∆ and ∆∆ features. [sent-71, score-0.468]

39 2 Figure 1: Conceptual representation of the proposed hierarchical framework (a) Left figure: Conceptualizing increasingly complex semantic analysis; (b) Right figure: An example semantic parse for baseball . [sent-72, score-0.411]

40 3 Unsupervised Learning of the Acoustic Unit Lexicon At the the lowest level of the hierarchical structure specified by the model of Figure 1a is a sequence of atomic acoustic units, as described earlier. [sent-73, score-0.425]

41 In reality, the number of such acoustic units is very large, possibly even infinite. [sent-74, score-0.428]

42 For the task of learning a lexicon of lower-level acoustic units, we leverage an unsupervised learning framework proposed in [14], which employs a the generative model shown in Figure 2 to describe audio recordings. [sent-76, score-0.959]

43 We define a finite set of audio symbols A, and corresponding to each symbol a ∈ A, we define an acoustic model λa , and we refer to the set of all acoustic models as Λ. [sent-77, score-1.16]

44 Thereafter, for each symbol at in T , a variable-length audio segment Dat is generated in accordance with λat . [sent-79, score-0.434]

45 The final audio D comprises the concatenation of the audio segments corresponding to all the symbols in T . [sent-80, score-0.976]

46 Similar to [14], we represent each acoustic unit as a 5-state HMM with gaussian mixture output densities. [sent-81, score-0.34]

47 The learnt parameters λa for each symbol a ∈ A allow us to decode any new audio file in terms of the set of symbols. [sent-83, score-0.507]

48 While these symbols are not guaranteed to have any semantic interpretations, we expect them to capture acoustically consistent phenomena, and we see later that they do so in Figure 7. [sent-84, score-0.29]

49 The symbols may hence be interpreted as representing generalized acoustic units (representing clusters of basic sound units). [sent-85, score-0.599]

50 H and Λ are the language model and acoustic model parameters, T is the latent transcript and D is the observed data. [sent-89, score-0.405]

51 1 A Generative Model for Inducing Patterns over AUDs As discussed in Section 2, we expect that audio data are composed of a sequence of semantically meaningful events which manifest themselves in various acoustic forms, depending on the context. [sent-92, score-1.121]

52 The acoustic unit (AUD) lexicon described in Section 3 automatically learns the various acoustic manifestations from a dataset but do not have interpretable semantic meaning. [sent-93, score-0.963]

53 In this section, we introduce a generative model for the second layer in Fig 1a where the semantically interpretable acoustic events generate lower level AUDs (and thus, the observed audio). [sent-95, score-0.764]

54 The distribution of AUDs for a specific event will be stochastic in nature (e. [sent-96, score-0.207]

55 segments for a cheering event may contain any or all of claps, shouts, speech, music), and the distribution of the events themselves are stochastic and category-dependent. [sent-98, score-0.563]

56 Again, while the number of such events can be expected to be very large, we assume that for a given dataset, a limited number of events can describe the event space fairly well. [sent-99, score-0.655]

57 Further, we expect that the distribution of naturally occurring events in audio will follow the power law properties typically found in natural distributions [15, 16]. [sent-100, score-0.683]

58 Events, drawn from this distribution, then generate lower level acoustic units (AUDs) corresponding to the sounds that are to be produced. [sent-102, score-0.602]

59 Because this process is stochastic, different occurrences of the same event may produce different sequences of AUDs, which are variants of a common underlying pattern. [sent-103, score-0.283]

60 We assume K audio events in the vocabulary, and M distinct AUD tokens, and we can generate a corpus of D documents as follows: for each document d, we first draw a unigram distribution U for the events based on a power-law prior µ. [sent-105, score-0.925]

61 We then draw Nd event tokens from the distribution for the events. [sent-106, score-0.25]

62 Each event token can generate a sequence of AUDs of variable length n, where n is drawn from an event specific distribution α. [sent-107, score-0.534]

63 n AUDs (cn ) 1 are now drawn from the multinomial AUD-emission distribution for the event Φevent . [sent-108, score-0.207]

64 Thus, in this model, each audio document is a bag of events and each occurrence of an event is a bag of AUDs; the events themselves are distributions over AUDs. [sent-109, score-1.171]

65 We can then use these parameters to estimate the latent events present in audio based on an observed AUD stream (the AUD stream is obtained by decoding audio as described in Section 3). [sent-113, score-1.194]

66 Unlike those, however, we model each event as a bag of AUDs as opposed to an AUD sequence for two reasons. [sent-115, score-0.269]

67 First, AUD sequences (and indeed, the observed audio) for different instances of the same event will have innate variations. [sent-116, score-0.242]

68 Second, in the case of the audio, presence of multiple sounds may result in noisy AUD streams so that text character streams which are usually clean are not directly analogous; instead, noisy, badly spelt text might be a better analogy. [sent-117, score-0.3]

69 1 Latent Variable Estimation in the Learning Framework Figure 4: An example automaton for a word of maximum length 3. [sent-129, score-0.191]

70 Figure 5: An automaton with the K word automatons in parallel for decoding a token stream We construct an automaton for each of the K events– an example is shown in Figure 4. [sent-131, score-0.395]

71 Based on these, states can skip to the final state, thus accounting for variable lengths of events in terms of number of AUDs. [sent-136, score-0.256]

72 We define S as the set of all start states for events, so that Si =start state of event i. [sent-137, score-0.207]

73 Since we model event occurrences as bags of AUDs, AUD emission probabilities are shared by all states for a given event. [sent-138, score-0.288]

74 The automatons for the events are now put together as shown in Figure 5– the black circle represents a dummy start state, and terminal states for each event can transition to this start state. [sent-139, score-0.53]

75 Pd (wi ) represents the probability of the event wi given the unigram distribution for the document d. [sent-140, score-0.301]

76 Thus, if for word i, we have a set of m occurrences in these paths of lengths n1 , n2 , . [sent-152, score-0.151]

77 Then, we present results using the 2-level hierarchical model on the event kit of the 2011 TRECVID Multimedia Event Detection (MED) task [22]. [sent-164, score-0.258]

78 Again, the learner automatically identified the most frequent word to be one which had highest emission probabilities for {‘. [sent-176, score-0.18]

79 ’, ‘c’, ‘o’, ‘m’} and the second most frequent word with {‘h’, ‘t’, ‘p’, ‘/’, ‘:’, ‘w’} characters having high probabilities. [sent-177, score-0.153]

80 board trick, feeding an animal– full list at [22]), and was to be used to build detectors for each semantic category, so that given a new file, it can predict whether it belongs to any of those categories or not. [sent-182, score-0.194]

81 All our reported results use 8-fold cross validation on the entire event kit. [sent-183, score-0.207]

82 While the AUDs are not required to have clear semantic interpretations, listening to the concatenated instances shows that the AUD on the left primarily spans music segments while the right consists primarily of speech– speech formant structures are visible in the image. [sent-188, score-0.355]

83 We then use the decoded AUD sequences as character streams to learn parameters for the second layer of observable acoustic events spanning local AUD sequences. [sent-189, score-0.77]

84 Since there are no annotations available, these events 7 Table 1: Performance summary across MED11 dataset (lower is better) System Average AUC Best Performance in #categories #SSI-over-VQ #SSI-over-FOLEY #SSI-over-AUD-FREQ #SSI-over-EVENT-FREQ VQ 0. [sent-191, score-0.264]

85 One such event consists of sequences of sounds that relate to crowds with loud cheering and a babble of voices in a party being subsumed within the same event. [sent-197, score-0.521]

86 We use the decoded AUDs and event sequences for each file to characterize the MED11 data, and evaluate the effect of using the AUDs layer and the event layer individually (AUD-FREQ and EVENT-FREQ, respectively) and together (COMB). [sent-198, score-0.638]

87 The first is a VQ baseline (VQ) where a set of audio words is learned unsupervised by applying K-Means on the data at the frame level. [sent-202, score-0.495]

88 The second uses an audio library to create a supervised sound library from the 480 sound types in the Foley Sound Library [23], and we characterize each file using occurrence information of these sounds in the file (FOLEY). [sent-203, score-0.915]

89 We used the best performing lexicon size for the various systems– 4096 clusters for the VQ, 480 Foley audio events, 1024 AUDs, and 128 acoustic events. [sent-206, score-0.851]

90 This work presents an initial approach to extracting such deeper semantic features from audio based on local patterns of low-level acoustic units. [sent-213, score-1.005]

91 Since the discovered latent events and acoustic units do not have true labels, we would like to explore ways to leverage tags, knowledge bases and human annotators to induce labels. [sent-215, score-0.652]

92 In such settings, we would like to explore non-parametric techniques that can grow the event set based on data. [sent-216, score-0.207]

93 Finally, we would like to use such event structure to study co-occurrences and dependencies of acoustic event types that might allow us to predict sounds in the future based on the context. [sent-218, score-0.928]

94 Content-based audio classication and retrieval by support vector machines. [sent-240, score-0.537]

95 Mixture of probability experts for audio retrieval and indexing. [sent-244, score-0.491]

96 Classication of sound clips by two schemes: using onomatopoeia and semantic labels. [sent-264, score-0.274]

97 Classification of tv programs based on audio information using hidden markov model. [sent-270, score-0.464]

98 Unsupervised learning of acoustic unit descriptors for audio content representation and classification. [sent-306, score-0.774]

99 Bayesian unsupervised word segmentation with nested pitmanyor language modeling. [sent-320, score-0.217]

100 Improving nonparametric bayesian inference: Experiments on unsupervised word segmentation with adaptor grammars. [sent-338, score-0.183]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('audio', 0.434), ('aud', 0.385), ('auds', 0.35), ('acoustic', 0.34), ('events', 0.224), ('event', 0.207), ('sounds', 0.174), ('semantic', 0.149), ('sound', 0.125), ('multimedia', 0.112), ('semantics', 0.094), ('units', 0.088), ('automaton', 0.086), ('word', 0.078), ('lexicon', 0.077), ('acoustically', 0.07), ('cheering', 0.07), ('foley', 0.07), ('layer', 0.064), ('semantically', 0.064), ('segments', 0.062), ('baseball', 0.062), ('trecvid', 0.062), ('unsupervised', 0.061), ('token', 0.059), ('retrieval', 0.057), ('deeper', 0.054), ('sundaram', 0.053), ('tir', 0.053), ('le', 0.052), ('hierarchical', 0.051), ('wi', 0.051), ('stream', 0.051), ('auc', 0.049), ('streams', 0.049), ('vq', 0.049), ('generative', 0.047), ('symbols', 0.046), ('classication', 0.046), ('characters', 0.045), ('categories', 0.045), ('segmentation', 0.044), ('unigram', 0.043), ('music', 0.043), ('tokens', 0.043), ('occurrences', 0.041), ('acoustics', 0.041), ('concatenated', 0.041), ('comb', 0.04), ('annotations', 0.04), ('emission', 0.04), ('learnt', 0.038), ('ni', 0.038), ('cn', 0.037), ('terminal', 0.037), ('sequences', 0.035), ('emitted', 0.035), ('decode', 0.035), ('phenomena', 0.035), ('automatons', 0.035), ('bhiksha', 0.035), ('efi', 0.035), ('expo', 0.035), ('georgiou', 0.035), ('loud', 0.035), ('metallic', 0.035), ('morphological', 0.035), ('sequence', 0.034), ('di', 0.034), ('language', 0.034), ('med', 0.033), ('lengths', 0.032), ('automatically', 0.032), ('speech', 0.031), ('characterize', 0.031), ('transcript', 0.031), ('decoded', 0.03), ('tv', 0.03), ('frequent', 0.03), ('technologies', 0.029), ('zipf', 0.029), ('listening', 0.029), ('eqn', 0.029), ('urls', 0.029), ('character', 0.028), ('bag', 0.028), ('patterns', 0.028), ('ij', 0.028), ('length', 0.027), ('binomial', 0.027), ('dummy', 0.027), ('linguistics', 0.026), ('occurrence', 0.026), ('backward', 0.026), ('count', 0.026), ('interpretable', 0.025), ('detection', 0.025), ('expect', 0.025), ('oracle', 0.025), ('annotation', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

Author: Sourish Chaudhuri, Bhiksha Raj

Abstract: Approaches to audio classification and retrieval tasks largely rely on detectionbased discriminative models. We submit that such models make a simplistic assumption in mapping acoustics directly to semantics, whereas the actual process is likely more complex. We present a generative model that maps acoustics in a hierarchical manner to increasingly higher-level semantics. Our model has two layers with the first layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. We evaluate our model on a large-scale retrieval task from TRECVID 2011, and report significant improvements over standard baselines. 1

2 0.22306231 150 nips-2012-Hierarchical spike coding of sound

Author: Yan Karklin, Chaitanya Ekanadham, Eero P. Simoncelli

Abstract: Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important first step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The first layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. Patterns in the positions of first layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while fine-scale structure is captured by recurrent interactions within the first layer. When fit to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a significant improvement over standard methods. 1

3 0.10756665 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

Author: Hyunsin Park, Sungrack Yun, Sanghyuk Park, Jongmin Kim, Chang D. Yoo

Abstract: For phoneme classification, this paper describes an acoustic model based on the variational Gaussian process dynamical system (VGPDS). The nonlinear and nonparametric acoustic model is adopted to overcome the limitations of classical hidden Markov models (HMMs) in modeling speech. The Gaussian process prior on the dynamics and emission functions respectively enable the complex dynamic structure and long-range dependency of speech to be better represented than that by an HMM. In addition, a variance constraint in the VGPDS is introduced to eliminate the sparse approximation error in the kernel matrix. The effectiveness of the proposed model is demonstrated with three experimental results, including parameter estimation and classification performance, on the synthetic and benchmark datasets. 1

4 0.099608473 276 nips-2012-Probabilistic Event Cascades for Alzheimer's disease

Author: Jonathan Huang, Daniel Alexander

Abstract: Accurate and detailed models of neurodegenerative disease progression are crucially important for reliable early diagnosis and the determination of effective treatments. We introduce the ALPACA (Alzheimer’s disease Probabilistic Cascades) model, a generative model linking latent Alzheimer’s progression dynamics to observable biomarker data. In contrast with previous works which model disease progression as a fixed event ordering, we explicitly model the variability over such orderings among patients which is more realistic, particularly for highly detailed progression models. We describe efficient learning algorithms for ALPACA and discuss promising experimental results on a real cohort of Alzheimer’s patients from the Alzheimer’s Disease Neuroimaging Initiative. 1

5 0.098553315 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex

Author: Hiroki Terashima, Masato Okada

Abstract: The computational modelling of the primary auditory cortex (A1) has been less fruitful than that of the primary visual cortex (V1) due to the less organized properties of A1. Greater disorder has recently been demonstrated for the tonotopy of A1 that has traditionally been considered to be as ordered as the retinotopy of V1. This disorder appears to be incongruous, given the uniformity of the neocortex; however, we hypothesized that both A1 and V1 would adopt an efficient coding strategy and that the disorder in A1 reflects natural sound statistics. To provide a computational model of the tonotopic disorder in A1, we used a model that was originally proposed for the smooth V1 map. In contrast to natural images, natural sounds exhibit distant correlations, which were learned and reflected in the disordered map. The auditory model predicted harmonic relationships among neighbouring A1 cells; furthermore, the same mechanism used to model V1 complex cells reproduced nonlinear responses similar to the pitch selectivity. These results contribute to the understanding of the sensory cortices of different modalities in a novel and integrated manner.

6 0.076605499 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes

7 0.076048881 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

8 0.073914349 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

9 0.072983406 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

10 0.06786783 155 nips-2012-Human memory search as a random walk in a semantic network

11 0.065313719 197 nips-2012-Learning with Recursive Perceptual Representations

12 0.065003231 238 nips-2012-Neurally Plausible Reinforcement Learning of Working Memory Tasks

13 0.064841412 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

14 0.064045981 72 nips-2012-Cocktail Party Processing via Structured Prediction

15 0.063944325 182 nips-2012-Learning Networks of Heterogeneous Influence

16 0.063907862 205 nips-2012-MCMC for continuous-time discrete-state systems

17 0.06267184 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video

18 0.057978459 14 nips-2012-A P300 BCI for the Masses: Prior Information Enables Instant Unsupervised Spelling

19 0.05478818 12 nips-2012-A Neural Autoregressive Topic Model

20 0.053199172 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.134), (1, 0.03), (2, -0.112), (3, 0.02), (4, -0.028), (5, -0.011), (6, -0.003), (7, 0.016), (8, -0.011), (9, 0.007), (10, 0.035), (11, 0.058), (12, 0.008), (13, 0.004), (14, -0.022), (15, -0.041), (16, -0.023), (17, 0.029), (18, 0.037), (19, -0.098), (20, 0.001), (21, 0.019), (22, -0.098), (23, -0.146), (24, 0.047), (25, -0.029), (26, 0.047), (27, -0.079), (28, 0.013), (29, 0.003), (30, 0.233), (31, 0.025), (32, 0.034), (33, -0.032), (34, -0.076), (35, -0.004), (36, 0.165), (37, -0.02), (38, 0.134), (39, 0.087), (40, 0.073), (41, 0.203), (42, -0.162), (43, -0.007), (44, -0.04), (45, 0.062), (46, -0.065), (47, 0.117), (48, -0.001), (49, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9476853 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

Author: Sourish Chaudhuri, Bhiksha Raj

Abstract: Approaches to audio classification and retrieval tasks largely rely on detectionbased discriminative models. We submit that such models make a simplistic assumption in mapping acoustics directly to semantics, whereas the actual process is likely more complex. We present a generative model that maps acoustics in a hierarchical manner to increasingly higher-level semantics. Our model has two layers with the first layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. We evaluate our model on a large-scale retrieval task from TRECVID 2011, and report significant improvements over standard baselines. 1

2 0.76737261 150 nips-2012-Hierarchical spike coding of sound

Author: Yan Karklin, Chaitanya Ekanadham, Eero P. Simoncelli

Abstract: Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important first step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The first layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. Patterns in the positions of first layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while fine-scale structure is captured by recurrent interactions within the first layer. When fit to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a significant improvement over standard methods. 1

3 0.56708598 72 nips-2012-Cocktail Party Processing via Structured Prediction

Author: Yuxuan Wang, Deliang Wang

Abstract: While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison. We show that the cocktail party problem, or the speech separation problem, can be effectively approached via structured prediction. To account for temporal dynamics in speech, we employ conditional random fields (CRFs) to classify speech dominance within each time-frequency unit for a sound mixture. To capture complex, nonlinear relationship between input and output, both state and transition feature functions in CRFs are learned by deep neural networks. The formulation of the problem as classification allows us to directly optimize a measure that is well correlated with human speech intelligibility. The proposed system substantially outperforms existing ones in a variety of noises.

4 0.56666154 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

Author: Hyunsin Park, Sungrack Yun, Sanghyuk Park, Jongmin Kim, Chang D. Yoo

Abstract: For phoneme classification, this paper describes an acoustic model based on the variational Gaussian process dynamical system (VGPDS). The nonlinear and nonparametric acoustic model is adopted to overcome the limitations of classical hidden Markov models (HMMs) in modeling speech. The Gaussian process prior on the dynamics and emission functions respectively enable the complex dynamic structure and long-range dependency of speech to be better represented than that by an HMM. In addition, a variance constraint in the VGPDS is introduced to eliminate the sparse approximation error in the kernel matrix. The effectiveness of the proposed model is demonstrated with three experimental results, including parameter estimation and classification performance, on the synthetic and benchmark datasets. 1

5 0.49008635 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes

Author: Charles Blundell, Jeff Beck, Katherine A. Heller

Abstract: We present a Bayesian nonparametric model that discovers implicit social structure from interaction time-series data. Social groups are often formed implicitly, through actions among members of groups. Yet many models of social networks use explicitly declared relationships to infer social structure. We consider a particular class of Hawkes processes, a doubly stochastic point process, that is able to model reciprocity between groups of individuals. We then extend the Infinite Relational Model by using these reciprocating Hawkes processes to parameterise its edges, making events associated with edges co-dependent through time. Our model outperforms general, unstructured Hawkes processes as well as structured Poisson process-based models at predicting verbal and email turn-taking, and military conflicts among nations. 1

6 0.47621486 155 nips-2012-Human memory search as a random walk in a semantic network

7 0.4720929 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

8 0.46929932 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex

9 0.44463098 276 nips-2012-Probabilistic Event Cascades for Alzheimer's disease

10 0.44114175 289 nips-2012-Recognizing Activities by Attribute Dynamics

11 0.39358613 136 nips-2012-Forward-Backward Activation Algorithm for Hierarchical Hidden Markov Models

12 0.35133037 306 nips-2012-Semantic Kernel Forests from Multiple Taxonomies

13 0.34215054 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks

14 0.3289237 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

15 0.32764462 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

16 0.32200974 205 nips-2012-MCMC for continuous-time discrete-state systems

17 0.31850591 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

18 0.31558648 266 nips-2012-Patient Risk Stratification for Hospital-Associated C. diff as a Time-Series Classification Task

19 0.31381142 182 nips-2012-Learning Networks of Heterogeneous Influence

20 0.31326529 2 nips-2012-3D Social Saliency from Head-mounted Cameras


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.076), (17, 0.018), (21, 0.034), (38, 0.079), (39, 0.013), (42, 0.026), (54, 0.017), (55, 0.052), (74, 0.04), (76, 0.157), (77, 0.018), (80, 0.073), (92, 0.053), (98, 0.254)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87035263 154 nips-2012-How They Vote: Issue-Adjusted Models of Legislative Behavior

Author: Sean Gerrish, David M. Blei

Abstract: We develop a probabilistic model of legislative data that uses the text of the bills to uncover lawmakers’ positions on specific political issues. Our model can be used to explore how a lawmaker’s voting patterns deviate from what is expected and how that deviation depends on what is being voted on. We derive approximate posterior inference algorithms based on variational methods. Across 12 years of legislative data, we demonstrate both improvement in heldout predictive performance and the model’s utility in interpreting an inherently multi-dimensional space. 1

same-paper 2 0.78894985 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

Author: Sourish Chaudhuri, Bhiksha Raj

Abstract: Approaches to audio classification and retrieval tasks largely rely on detectionbased discriminative models. We submit that such models make a simplistic assumption in mapping acoustics directly to semantics, whereas the actual process is likely more complex. We present a generative model that maps acoustics in a hierarchical manner to increasingly higher-level semantics. Our model has two layers with the first layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. We evaluate our model on a large-scale retrieval task from TRECVID 2011, and report significant improvements over standard baselines. 1

3 0.78217876 267 nips-2012-Perceptron Learning of SAT

Author: Alex Flint, Matthew Blaschko

Abstract: Boolean satisfiability (SAT) as a canonical NP-complete decision problem is one of the most important problems in computer science. In practice, real-world SAT sentences are drawn from a distribution that may result in efficient algorithms for their solution. Such SAT instances are likely to have shared characteristics and substructures. This work approaches the exploration of a family of SAT solvers as a learning problem. In particular, we relate polynomial time solvability of a SAT subset to a notion of margin between sentences mapped by a feature function into a Hilbert space. Provided this mapping is based on polynomial time computable statistics of a sentence, we show that the existance of a margin between these data points implies the existance of a polynomial time solver for that SAT subset based on the Davis-Putnam-Logemann-Loveland algorithm. Furthermore, we show that a simple perceptron-style learning rule will find an optimal SAT solver with a bounded number of training updates. We derive a linear time computable set of features and show analytically that margins exist for important polynomial special cases of SAT. Empirical results show an order of magnitude improvement over a state-of-the-art SAT solver on a hardware verification task. 1

4 0.78077608 55 nips-2012-Bayesian Warped Gaussian Processes

Author: Miguel Lázaro-gredilla

Abstract: Warped Gaussian processes (WGP) [1] model output observations in regression tasks as a parametric nonlinear transformation of a Gaussian process (GP). The use of this nonlinear transformation, which is included as part of the probabilistic model, was shown to enhance performance by providing a better prior model on several data sets. In order to learn its parameters, maximum likelihood was used. In this work we show that it is possible to use a non-parametric nonlinear transformation in WGP and variationally integrate it out. The resulting Bayesian WGP is then able to work in scenarios in which the maximum likelihood WGP failed: Low data regime, data with censored values, classification, etc. We demonstrate the superior performance of Bayesian warped GPs on several real data sets.

5 0.70262861 180 nips-2012-Learning Mixtures of Tree Graphical Models

Author: Anima Anandkumar, Furong Huang, Daniel J. Hsu, Sham M. Kakade

Abstract: We consider unsupervised estimation of mixtures of discrete graphical models, where the class variable is hidden and each mixture component can have a potentially different Markov graph structure and parameters over the observed variables. We propose a novel method for estimating the mixture components with provable guarantees. Our output is a tree-mixture model which serves as a good approximation to the underlying graphical model mixture. The sample and computational requirements for our method scale as poly(p, r), for an r-component mixture of pvariate graphical models, for a wide class of models which includes tree mixtures and mixtures over bounded degree graphs. Keywords: Graphical models, mixture models, spectral methods, tree approximation.

6 0.67481256 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

7 0.63356167 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

8 0.62779486 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

9 0.62675363 188 nips-2012-Learning from Distributions via Support Measure Machines

10 0.62404299 215 nips-2012-Minimizing Uncertainty in Pipelines

11 0.62324226 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

12 0.62197137 192 nips-2012-Learning the Dependency Structure of Latent Factors

13 0.62001014 197 nips-2012-Learning with Recursive Perceptual Representations

14 0.61967784 233 nips-2012-Multiresolution Gaussian Processes

15 0.61912256 294 nips-2012-Repulsive Mixtures

16 0.61869609 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

17 0.61764961 47 nips-2012-Augment-and-Conquer Negative Binomial Processes

18 0.61747891 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

19 0.61685592 126 nips-2012-FastEx: Hash Clustering with Exponential Families

20 0.61663365 220 nips-2012-Monte Carlo Methods for Maximum Margin Supervised Topic Models