nips nips2012 nips2012-72 knowledge-graph by maker-knowledge-mining

72 nips-2012-Cocktail Party Processing via Structured Prediction


Source: pdf

Author: Yuxuan Wang, Deliang Wang

Abstract: While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison. We show that the cocktail party problem, or the speech separation problem, can be effectively approached via structured prediction. To account for temporal dynamics in speech, we employ conditional random fields (CRFs) to classify speech dominance within each time-frequency unit for a sound mixture. To capture complex, nonlinear relationship between input and output, both state and transition feature functions in CRFs are learned by deep neural networks. The formulation of the problem as classification allows us to directly optimize a measure that is well correlated with human speech intelligibility. The proposed system substantially outperforms existing ones in a variety of noises.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu 1 Abstract While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison. [sent-3, score-0.13]

2 We show that the cocktail party problem, or the speech separation problem, can be effectively approached via structured prediction. [sent-4, score-0.544]

3 To account for temporal dynamics in speech, we employ conditional random fields (CRFs) to classify speech dominance within each time-frequency unit for a sound mixture. [sent-5, score-0.377]

4 To capture complex, nonlinear relationship between input and output, both state and transition feature functions in CRFs are learned by deep neural networks. [sent-6, score-0.109]

5 The formulation of the problem as classification allows us to directly optimize a measure that is well correlated with human speech intelligibility. [sent-7, score-0.305]

6 1 Introduction The cocktail party problem, or the speech separation problem, is one of the central problems in speech processing. [sent-9, score-0.788]

7 A particularly difficult scenario is monaural speech separation, in which mixtures are recorded by a single microphone and the task is to separate the target speech from its interference. [sent-10, score-0.673]

8 This is a severely underdetermined figure-ground separation problem, and has been studied for decades with limited success. [sent-11, score-0.087]

9 Researchers have attempted to solve the monaural speech separation problem from various angles. [sent-12, score-0.437]

10 Computational auditory scene analysis (CASA) [4] is inspired by how human auditory system functions [5]. [sent-19, score-0.171]

11 CASA has the potential to deal with general acoustic environments but existing systems have limited performance, particularly in dealing with unvoiced speech. [sent-20, score-0.208]

12 Recent studies suggest a new formulation to the cocktail party problem, where the focus is to classify whether a time-frequency (T-F) unit is dominated by the target speech [6]. [sent-21, score-0.46]

13 Motivated by this viewpoint, we propose to approach the monaural speech separation problem via structured prediction. [sent-22, score-0.473]

14 The use of structured predictors, as opposed to binary classifiers, is motivated by temporal dynamics in speech signal. [sent-23, score-0.394]

15 1 2 Separation as binary classification We aim to estimate a time-frequency matrix called the ideal binary mask (IBM). [sent-25, score-0.089]

16 The IBM is a binary matrix constructed from premixed target and interference, where 1 indicates that the target energy exceeds the interference energy by a local signal-to-noise (SNR) criterion (LC) in the corresponding T-F unit, and 0 otherwise. [sent-26, score-0.095]

17 First, the IBM is directly based on the auditory masking phenomenon whereby a stronger sound tends to mask a weaker one within a critical band. [sent-30, score-0.127]

18 Second, unlike other objectives such as maximizing SNR, it is well established that large human speech intelligibility improvements result from IBM processing, even for very low SNR mixtures [7–9]. [sent-31, score-0.502]

19 Improving human speech intelligibility is considered as a gold standard for speech separation. [sent-32, score-0.76]

20 Third, IBM estimation naturally leads to classification, which opens the cocktail party problem to a plethora of machine learning techniques. [sent-33, score-0.141]

21 The output from each filter channel is divided into 20-ms frames with 10-ms frame shift, producing a cochleagram [4]. [sent-36, score-0.15]

22 Due to different spectral properties of speech, a subband classifier is trained for each filter channel independently, with the IBM providing training labels. [sent-37, score-0.141]

23 Acoustic features for each subband classifier are extracted from T-F units in the cochleagram. [sent-38, score-0.091]

24 The target speech is separated by binary weighting of the cochleagram using the estimated IBM [4]. [sent-39, score-0.323]

25 [10] show that estimated masks can improve human speech intelligibility in noise. [sent-42, score-0.531]

26 [12] propose a set of complementary acoustic features that shows further improvements over previous systems. [sent-46, score-0.109]

27 Speech intelligibility studies [9, 10] have evaluated the influence of the hit (HIT) and false-alarm (FA) rate on intelligibility scores. [sent-49, score-0.742]

28 The difference, the HIT−FA rate, is found to be well correlated to human speech intelligibility in noise [10]. [sent-50, score-0.499]

29 The HIT rate is the percent of correctly classified target-dominant T-F units (1’s) in the IBM, and the FA rate is the percent of wrongly classified interference-dominant T-F units (0’s). [sent-51, score-0.098]

30 Therefore, it is desirable to design a separation algorithm that maximizes HIT−FA of the output mask. [sent-52, score-0.087]

31 3 Proposed system Dictated by speech production mechanisms, the IBM contains highly structured, rather than, random patterns. [sent-53, score-0.328]

32 However, these works do not treat separation as classification. [sent-61, score-0.087]

33 In this paper, we treat unit classification at each filter channel as a sequence labeling problem and employ linear-chain conditional random fields (CRFs) [15] as subband classifiers. [sent-63, score-0.133]

34 f is a vector-valued feature function associated with each local site (T-F unit in our task), and often categorized into state feature functions s(yt , x, t) and transition feature functions t(yt−1 , yt , x, t). [sent-69, score-0.274]

35 State feature functions define the local discriminant functions for each T-F unit and transition feature functions capture the interaction between neighboring labels. [sent-70, score-0.102]

36 To simplify notations, all feature functions are written as f (yt−1 , yt , x, t) in the remainder of the paper. [sent-76, score-0.172]

37 As acoustic features are generally not linearly separable, the direct use of CRFs unlikely produces good results. [sent-83, score-0.089]

38 DNNs can be viewed as hierarchical feature detectors that learn increasingly complex feature mappings as the number of hidden layers increases. [sent-88, score-0.084]

39 We first train DNN in the standard way to classify speech dominance in each T-F unit. [sent-92, score-0.28]

40 In a discriminatively trained DNN, the weights from the last hidden layer to the output layer would define a linear classifier, hence the last hidden layer representations are more amenable to linear classification. [sent-94, score-0.144]

41 3 We want to point out that an important advantage of using neural networks for feature learning is its efficiency in the test phase; once trained, the nonlinear feature extraction of DNN is extremely fast (only involves forward pass). [sent-103, score-0.103]

42 Test phase efficiency is crucial for real-time implementation of a speech separation system. [sent-106, score-0.367]

43 The proposed model differs from the previous methods in that (1) discriminatively trained deep architecture is used, and/or (2) a CRF instead of a Viterbi decoder is used on top of a neural network for sequence labeling, and/or (3) nonlinear features are also used in modeling transitions. [sent-114, score-0.104]

44 In addition, the use of a contextual window and the change of the objective function discussed in the next subsection is specifically tailored to the speech separation problem. [sent-115, score-0.421]

45 Denote the output label as ut ∈ {0, 1} and the true label as yt ∈ {0, 1}. [sent-120, score-0.161]

46 The per utterance HIT−FA rate can be expressed as t ut yt / t yt − t ut (1 − yt )/ t (1 − yt ), where the first term is the HIT rate and the second the FA rate. [sent-121, score-0.653]

47 To make the objective function differentiable, we replace ut by the marginal probability p(yt = 1|x), hence we seek w by maximizing the HIT−FA on a training set: (m) (m) max m t = 1|x(m) , w)yt p(yt w (m) t yt m (m) m − t p(yt (m) = 1|x(m) , w)(1 − yt t (1 m − (m) yt ) ) . [sent-122, score-0.467]

48 A speech utterance (sentence) typically spans several hundreds of time frames, therefore numerical stability is critically important in our task. [sent-124, score-0.327]

49 It is easy to show that Z(x) = the marginal has a simpler form of p(yt |x, w) = α(t, yt )β(t, yt ). [sent-130, score-0.284]

50 Therefore, the gradient of the marginal is, ∂p(yt |x, w) = Gα (t, yt )β(t, yt ) + α(t, yt )Gβ (t, yt ), (8) ∂w where Gα and Gβ are the gradients of the normalized forward and backward score, respectively. [sent-131, score-0.568]

51 This enables us to directly compare with previous intelligibility studies [10], where the same speaker is used in training and testing. [sent-148, score-0.205]

52 The training set is created by mixing 50 utterances with 12 noises at 0 dB. [sent-149, score-0.099]

53 To create the test set, we choose 20 unseen utterances from the same speaker. [sent-150, score-0.107]

54 condition, then 5 unseen noises to create a unmatched-noise test condition. [sent-153, score-0.122]

55 We use suffix R and P to distinguish training features for CRF, where R stands for learned features without a context window (features are learned from the complementary acoustic feature set mentioned in Section 2) and P stands for a window of posterior features. [sent-159, score-0.23]

56 We use a two hidden layer DNN as it provides a good trade-off between performance and complexity, and use a context window spanning 5 time frames and 17 frequency channels to construct the posterior feature vector. [sent-160, score-0.168]

57 To evaluate the contribution from the change of the objective alone, we use ideal pitch in the following experiments to neutralize pitch estimation errors. [sent-164, score-0.104]

58 We document HIT−FA rates on three levels: overall, voiced intervals (pitched frames) and unvoiced intervals (unpitched frames). [sent-169, score-0.237]

59 In the matched condition, the improvement by directly maximizing HIT−FA is most significant in unvoiced intervals. [sent-174, score-0.182]

60 For a closer inspection, Figure 2 shows channelwise HIT−FA comparisons on the 0 dB test mixtures in the matched-noise test condition. [sent-180, score-0.096]

61 It is well known that unvoiced speech is indispensable for speech intelligibility but hard to separate. [sent-181, score-0.875]

62 Due to the lack of harmonicity and weak energy, frequency channels containing unvoiced speech often have significantly skewed distributions of target-dominant and interference-dominant units. [sent-182, score-0.442]

63 As an illustration, Figure 3 shows two masks for an utterance mixed with an unseen crowd noise at 0 dB using DNN and DNN-CRF∗ -P respectively. [sent-184, score-0.187]

64 However, it is clear that the DNN mask misses significant portions of unvoiced speech, e. [sent-186, score-0.197]

65 1 Test noises are: babble, bird chirp, crow, cocktail party, yelling, clap, rain, rock music, siren, telephone, white, wind, crowd, fan, speech shaped, traffic, and factory noise. [sent-189, score-0.419]

66 6 System Table 2: Performance comparisons when tested on different unseen speakers Matched-noise condition Unmatched-noise condition Accuracy HIT−FA SNR (dB) SegSNR (dB) Accuracy HIT−FA SNR (dB) SegSNR (dB) SVM [11] 86. [sent-273, score-0.106]

67 3 System In summary, direct maximization of HIT−FA improves HIT−FA performance compared to accuracy maximization, especially for unvoiced speech, and the improvement is more significant when the system is tested on unseen acoustic environments. [sent-298, score-0.301]

68 3 Experiment 2: system comparisons We systematically compare the proposed system with three kinds of systems on 0 dB mixtures: binary classifier based, structured predictor based, and speech enhancement based. [sent-300, score-0.488]

69 To compute SNRs, we use the target speech resynthesized from the IBM as the ground truth signal for all classification-based systems. [sent-302, score-0.3]

70 It is interesting to see that DNN significantly outperforms SVM, especially for unvoiced speech (not shown) which is important for speech intelligibility. [sent-314, score-0.7]

71 ’s system fails to generalize to different acoustic environments due to substantially increased FA rates. [sent-318, score-0.116]

72 The proposed system significantly outperforms SVM and DNN, achieving about 71% overall HIT−FA and 10 dB SNR for unseen noises. [sent-319, score-0.093]

73 ’s system has been shown to improve human speech intelligibility [10], it is therefore reasonable to project that the proposed system will provide further speech intelligibility improvements. [sent-321, score-1.031]

74 With better ability of encoding contextual information, using a window of posteriors as features clearly outperforms single unit features in terms of classification. [sent-331, score-0.115]

75 Finally, we compare with two representative speech enhancement systems [1, 2]. [sent-335, score-0.335]

76 The algorithm proposed in [1] represents a recent state-of-the-art method and Wiener filtering [2] is one of the most widely used speech enhancement algorithms. [sent-336, score-0.335]

77 Since speech enhancement does not aim to estimate the IBM, we compare SNRs by using clean speech (not the IBM) as the ground truth. [sent-337, score-0.615]

78 As shown in Table 1, the speech enhancement algorithms are much worse, and this is true of all 17 noises. [sent-338, score-0.335]

79 Due to temporal continuity modeling and the use of T-F context, the proposed system produces masks that are smoother than those from the other systems (e. [sent-339, score-0.174]

80 4 Experiment 3: speaker generalization Although the training set contains only a single IEEE speaker, the proposed system generalizes reasonably well to different unseen speakers. [sent-344, score-0.123]

81 We show the results in Table 2, and it is clear that the proposed system generalizes better than existing ones to unseen speakers. [sent-347, score-0.093]

82 5 Discussion and conclusion Listening tests have shown that a high FA rate is more detrimental to speech intelligibility than a high miss (or low HIT) [9]. [sent-349, score-0.455]

83 These may include objectives concerning either speech intelligibility or quality, as long as the objective of interest can be expressed or approximated by a combination of marginal probabilities. [sent-355, score-0.455]

84 , [25]), where PEL represents the percent of target energy loss and PN R the percent of noise energy residue. [sent-358, score-0.127]

85 We have demonstrated that the challenge of the monaural speech separation problem can be effectively approached via structured prediction. [sent-360, score-0.473]

86 Observing that the IBM exhibits highly structured patterns, we have proposed to use CRF to explicitly model the temporal continuity in the IBM. [sent-361, score-0.111]

87 Consistent with the results from speech perception, we train the proposed DNN-CRF model to maximize a measure that is well correlated to human speech intelligibility in noise. [sent-363, score-0.76]

88 Experimental results show that the proposed system significantly outperforms existing ones and generalizes better to different acoustic environments. [sent-364, score-0.116]

89 Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, Divenyi P. [sent-387, score-0.138]

90 Loizou, “Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction,” J. [sent-413, score-0.226]

91 Loizou, “An algorithm that improves speech intelligibility in noise for normal-hearing listeners,” J. [sent-425, score-0.474]

92 Wang, “An SVM based classification approach to speech separation,” in ICASSP, 2011. [sent-434, score-0.28]

93 Wang, “Exploring monaural features for classification-based speech segregation,” IEEE Trans. [sent-438, score-0.371]

94 Smaragdis, “A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics,” in ICASSP, 2011. [sent-444, score-0.433]

95 Olsen, “Single channel speech separation using factorial dynamics,” in NIPS, 2007. [sent-449, score-0.434]

96 Hinton, “Deep belief networks for phone recognition,” in NIPS workshop on speech recognition and related applications, 2009. [sent-480, score-0.28]

97 Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. [sent-485, score-0.304]

98 [24] IEEE, “IEEE recommended practice for speech quality measurements,” IEEE Trans. [sent-490, score-0.28]

99 Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Trans. [sent-497, score-0.351]

100 Garofolo, DARPA TIMIT acoustic-phonetic continuous speech corpus, NIST, 1993. [sent-507, score-0.28]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dnn', 0.575), ('hit', 0.392), ('fa', 0.291), ('speech', 0.28), ('db', 0.229), ('crf', 0.212), ('intelligibility', 0.175), ('yt', 0.142), ('unvoiced', 0.14), ('ibm', 0.138), ('snr', 0.088), ('separation', 0.087), ('cnf', 0.085), ('cocktail', 0.082), ('monaural', 0.07), ('acoustic', 0.068), ('channel', 0.067), ('party', 0.059), ('dnns', 0.058), ('segsnr', 0.058), ('noises', 0.057), ('mask', 0.057), ('enhancement', 0.055), ('masks', 0.051), ('auditory', 0.049), ('system', 0.048), ('utterance', 0.047), ('temporal', 0.047), ('snrs', 0.047), ('subband', 0.047), ('voiced', 0.047), ('unseen', 0.045), ('classi', 0.044), ('utterances', 0.042), ('unmatched', 0.041), ('crfs', 0.039), ('structured', 0.036), ('pitch', 0.036), ('window', 0.035), ('casa', 0.035), ('channelwise', 0.035), ('hendriks', 0.035), ('segregation', 0.035), ('frames', 0.034), ('kim', 0.034), ('deep', 0.033), ('ideal', 0.032), ('dynamics', 0.031), ('feature', 0.03), ('speaker', 0.03), ('continuity', 0.028), ('elds', 0.028), ('trained', 0.027), ('timit', 0.027), ('percent', 0.026), ('icassp', 0.026), ('frame', 0.026), ('crowd', 0.025), ('human', 0.025), ('intervals', 0.025), ('gmm', 0.024), ('hidden', 0.024), ('wang', 0.024), ('ams', 0.023), ('wiener', 0.023), ('layer', 0.023), ('units', 0.023), ('cochleagram', 0.023), ('listeners', 0.023), ('loizou', 0.023), ('microphone', 0.023), ('mysore', 0.023), ('pel', 0.023), ('nonlinear', 0.023), ('lc', 0.023), ('transition', 0.023), ('frequency', 0.022), ('maximizing', 0.022), ('features', 0.021), ('comparisons', 0.021), ('masking', 0.021), ('rabiner', 0.021), ('matched', 0.02), ('target', 0.02), ('svm', 0.02), ('complementary', 0.02), ('hinton', 0.02), ('test', 0.02), ('condition', 0.02), ('ut', 0.019), ('score', 0.019), ('et', 0.019), ('hershey', 0.019), ('interference', 0.019), ('morency', 0.019), ('ohio', 0.019), ('noise', 0.019), ('unit', 0.019), ('contextual', 0.019), ('energy', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 72 nips-2012-Cocktail Party Processing via Structured Prediction

Author: Yuxuan Wang, Deliang Wang

Abstract: While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison. We show that the cocktail party problem, or the speech separation problem, can be effectively approached via structured prediction. To account for temporal dynamics in speech, we employ conditional random fields (CRFs) to classify speech dominance within each time-frequency unit for a sound mixture. To capture complex, nonlinear relationship between input and output, both state and transition feature functions in CRFs are learned by deep neural networks. The formulation of the problem as classification allows us to directly optimize a measure that is well correlated with human speech intelligibility. The proposed system substantially outperforms existing ones in a variety of noises.

2 0.15415002 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

Author: Dan Ciresan, Alessandro Giusti, Luca M. Gambardella, Jürgen Schmidhuber

Abstract: We address a central problem of neuroanatomy, namely, the automatic segmentation of neuronal structures depicted in stacks of electron microscopy (EM) images. This is necessary to efficiently map 3D brain structure and connectivity. To segment biological neuron membranes, we use a special type of deep artificial neural network as a pixel classifier. The label of each pixel (membrane or nonmembrane) is predicted from raw pixel values in a square window centered on it. The input layer maps each window pixel to a neuron. It is followed by a succession of convolutional and max-pooling layers which preserve 2D information and extract features with increasing levels of abstraction. The output layer produces a calibrated probability for each class. The classifier is trained by plain gradient descent on a 512 × 512 × 30 stack with known ground truth, and tested on a stack of the same size (ground truth unknown to the authors) by the organizers of the ISBI 2012 EM Segmentation Challenge. Even without problem-specific postprocessing, our approach outperforms competing techniques by a large margin in all three considered metrics, i.e. rand error, warping error and pixel error. For pixel error, our approach is the only one outperforming a second human observer. 1

3 0.13834348 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted first-order logic knowledge base to specify a Markov random field (MRF) or a conditional random field (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not significant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random field for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efficient online inference, and can directly model influences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

4 0.13030192 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

Author: Mathieu Sinn, Bei Chen

Abstract: Conditional Markov Chains (also known as Linear-Chain Conditional Random Fields in the literature) are a versatile class of discriminative models for the distribution of a sequence of hidden states conditional on a sequence of observable variables. Large-sample properties of Conditional Markov Chains have been first studied in [1]. The paper extends this work in two directions: first, mixing properties of models with unbounded feature functions are being established; second, necessary conditions for model identifiability and the uniqueness of maximum likelihood estimates are being given. 1

5 0.11357685 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

Author: Claudio Gentile, Francesco Orabona

Abstract: We present a novel multilabel/ranking algorithm working in partial information settings. The algorithm is based on 2nd-order descent methods, and relies on upper-confidence bounds to trade-off exploration and exploitation. We analyze this algorithm in a partial adversarial setting, where covariates can be adversarial, but multilabel probabilities are ruled by (generalized) linear models. We show O(T 1/2 log T ) regret bounds, which improve in several ways on the existing results. We test the effectiveness of our upper-confidence scheme by contrasting against full-information baselines on real-world multilabel datasets, often obtaining comparable performance. 1

6 0.10424665 150 nips-2012-Hierarchical spike coding of sound

7 0.099067606 197 nips-2012-Learning with Recursive Perceptual Representations

8 0.095579281 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

9 0.075930312 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function

10 0.064045981 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

11 0.062321335 303 nips-2012-Searching for objects driven by context

12 0.061845895 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

13 0.056038201 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

14 0.050612234 292 nips-2012-Regularized Off-Policy TD-Learning

15 0.04654374 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

16 0.04416611 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

17 0.043777272 170 nips-2012-Large Scale Distributed Deep Networks

18 0.043362938 200 nips-2012-Local Supervised Learning through Space Partitioning

19 0.043347094 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex

20 0.041691151 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.121), (1, 0.014), (2, -0.036), (3, 0.135), (4, 0.056), (5, -0.057), (6, -0.006), (7, -0.003), (8, -0.004), (9, -0.008), (10, 0.01), (11, 0.039), (12, 0.048), (13, 0.075), (14, 0.027), (15, -0.07), (16, 0.016), (17, 0.01), (18, 0.1), (19, -0.074), (20, -0.025), (21, -0.04), (22, -0.054), (23, 0.015), (24, 0.086), (25, -0.033), (26, 0.02), (27, -0.077), (28, -0.027), (29, 0.04), (30, 0.108), (31, 0.116), (32, -0.017), (33, 0.023), (34, -0.012), (35, -0.066), (36, 0.107), (37, 0.053), (38, 0.021), (39, 0.092), (40, 0.028), (41, 0.026), (42, -0.079), (43, 0.061), (44, 0.024), (45, 0.119), (46, 0.005), (47, 0.035), (48, -0.003), (49, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92740405 72 nips-2012-Cocktail Party Processing via Structured Prediction

Author: Yuxuan Wang, Deliang Wang

Abstract: While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison. We show that the cocktail party problem, or the speech separation problem, can be effectively approached via structured prediction. To account for temporal dynamics in speech, we employ conditional random fields (CRFs) to classify speech dominance within each time-frequency unit for a sound mixture. To capture complex, nonlinear relationship between input and output, both state and transition feature functions in CRFs are learned by deep neural networks. The formulation of the problem as classification allows us to directly optimize a measure that is well correlated with human speech intelligibility. The proposed system substantially outperforms existing ones in a variety of noises.

2 0.62188727 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted first-order logic knowledge base to specify a Markov random field (MRF) or a conditional random field (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not significant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random field for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efficient online inference, and can directly model influences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

3 0.59301281 356 nips-2012-Unsupervised Structure Discovery for Semantic Analysis of Audio

Author: Sourish Chaudhuri, Bhiksha Raj

Abstract: Approaches to audio classification and retrieval tasks largely rely on detectionbased discriminative models. We submit that such models make a simplistic assumption in mapping acoustics directly to semantics, whereas the actual process is likely more complex. We present a generative model that maps acoustics in a hierarchical manner to increasingly higher-level semantics. Our model has two layers with the first layer modeling generalized sound units with no clear semantic associations, while the second layer models local patterns over these sound units. We evaluate our model on a large-scale retrieval task from TRECVID 2011, and report significant improvements over standard baselines. 1

4 0.52757555 150 nips-2012-Hierarchical spike coding of sound

Author: Yan Karklin, Chaitanya Ekanadham, Eero P. Simoncelli

Abstract: Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important first step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The first layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. Patterns in the positions of first layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while fine-scale structure is captured by recurrent interactions within the first layer. When fit to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a significant improvement over standard methods. 1

5 0.51025891 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction

Author: Pietro D. Lena, Ken Nagata, Pierre F. Baldi

Abstract: Residue-residue contact prediction is a fundamental problem in protein structure prediction. Hower, despite considerable research efforts, contact prediction methods are still largely unreliable. Here we introduce a novel deep machine-learning architecture which consists of a multidimensional stack of learning modules. For contact prediction, the idea is implemented as a three-dimensional stack of Neural Networks NNk , where i and j index the spatial coordinates of the contact ij map and k indexes “time”. The temporal dimension is introduced to capture the fact that protein folding is not an instantaneous process, but rather a progressive refinement. Networks at level k in the stack can be trained in supervised fashion to refine the predictions produced by the previous level, hence addressing the problem of vanishing gradients, typical of deep architectures. Increased accuracy and generalization capabilities of this approach are established by rigorous comparison with other classical machine learning approaches for contact prediction. The deep approach leads to an accuracy for difficult long-range contacts of about 30%, roughly 10% above the state-of-the-art. Many variations in the architectures and the training algorithms are possible, leaving room for further improvements. Furthermore, the approach is applicable to other problems with strong underlying spatial and temporal components. 1

6 0.48988056 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

7 0.47001961 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

8 0.46778771 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

9 0.4632692 283 nips-2012-Putting Bayes to sleep

10 0.46290663 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

11 0.44427639 303 nips-2012-Searching for objects driven by context

12 0.44239047 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies

13 0.42089668 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

14 0.41418594 341 nips-2012-The topographic unsupervised learning of natural sounds in the auditory cortex

15 0.40276831 80 nips-2012-Confusion-Based Online Learning and a Passive-Aggressive Scheme

16 0.40216583 197 nips-2012-Learning with Recursive Perceptual Representations

17 0.39535165 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

18 0.3939628 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

19 0.38519076 289 nips-2012-Recognizing Activities by Attribute Dynamics

20 0.37363306 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.092), (21, 0.027), (38, 0.085), (41, 0.298), (42, 0.032), (44, 0.018), (54, 0.027), (55, 0.024), (74, 0.032), (76, 0.077), (80, 0.107), (87, 0.012), (92, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.75011718 72 nips-2012-Cocktail Party Processing via Structured Prediction

Author: Yuxuan Wang, Deliang Wang

Abstract: While human listeners excel at selectively attending to a conversation in a cocktail party, machine performance is still far inferior by comparison. We show that the cocktail party problem, or the speech separation problem, can be effectively approached via structured prediction. To account for temporal dynamics in speech, we employ conditional random fields (CRFs) to classify speech dominance within each time-frequency unit for a sound mixture. To capture complex, nonlinear relationship between input and output, both state and transition feature functions in CRFs are learned by deep neural networks. The formulation of the problem as classification allows us to directly optimize a measure that is well correlated with human speech intelligibility. The proposed system substantially outperforms existing ones in a variety of noises.

2 0.73185641 345 nips-2012-Topic-Partitioned Multinetwork Embeddings

Author: Peter Krafft, Juston Moore, Bruce Desmarais, Hanna M. Wallach

Abstract: We introduce a new Bayesian admixture model intended for exploratory analysis of communication networks—specifically, the discovery and visualization of topic-specific subnetworks in email data sets. Our model produces principled visualizations of email networks, i.e., visualizations that have precise mathematical interpretations in terms of our model and its relationship to the observed data. We validate our modeling assumptions by demonstrating that our model achieves better link prediction performance than three state-of-the-art network models and exhibits topic coherence comparable to that of latent Dirichlet allocation. We showcase our model’s ability to discover and visualize topic-specific communication patterns using a new email data set: the New Hanover County email network. We provide an extensive analysis of these communication patterns, leading us to recommend our model for any exploratory analysis of email networks or other similarly-structured communication data. Finally, we advocate for principled visualization as a primary objective in the development of new network models. 1

3 0.54452741 280 nips-2012-Proper losses for learning from partial labels

Author: Jesús Cid-sueiro

Abstract: This paper discusses the problem of calibrating posterior class probabilities from partially labelled data. Each instance is assumed to be labelled as belonging to one of several candidate categories, at most one of them being true. We generalize the concept of proper loss to this scenario, we establish a necessary and sufficient condition for a loss function to be proper, and we show a direct procedure to construct a proper loss for partial labels from a conventional proper loss. The problem can be characterized by the mixing probability matrix relating the true class of the data and the observed labels. The full knowledge of this matrix is not required, and losses can be constructed that are proper for a wide set of mixing probability matrices. 1

4 0.51982194 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables

Author: Aaron Dennis, Dan Ventura

Abstract: The sum-product network (SPN) is a recently-proposed deep model consisting of a network of sum and product nodes, and has been shown to be competitive with state-of-the-art deep models on certain difficult tasks such as image completion. Designing an SPN network architecture that is suitable for the task at hand is an open question. We propose an algorithm for learning the SPN architecture from data. The idea is to cluster variables (as opposed to data instances) in order to identify variable subsets that strongly interact with one another. Nodes in the SPN network are then allocated towards explaining these interactions. Experimental evidence shows that learning the SPN architecture significantly improves its performance compared to using a previously-proposed static architecture. 1

5 0.51605755 192 nips-2012-Learning the Dependency Structure of Latent Factors

Author: Yunlong He, Yanjun Qi, Koray Kavukcuoglu, Haesun Park

Abstract: In this paper, we study latent factor models with dependency structure in the latent space. We propose a general learning framework which induces sparsity on the undirected graphical model imposed on the vector of latent factors. A novel latent factor model SLFA is then proposed as a matrix factorization problem with a special regularization term that encourages collaborative reconstruction. The main benefit (novelty) of the model is that we can simultaneously learn the lowerdimensional representation for data and model the pairwise relationships between latent factors explicitly. An on-line learning algorithm is devised to make the model feasible for large-scale learning problems. Experimental results on two synthetic data and two real-world data sets demonstrate that pairwise relationships and latent factors learned by our model provide a more structured way of exploring high-dimensional data, and the learned representations achieve the state-of-the-art classification performance. 1

6 0.50186574 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

7 0.50033814 233 nips-2012-Multiresolution Gaussian Processes

8 0.50007951 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

9 0.49942887 65 nips-2012-Cardinality Restricted Boltzmann Machines

10 0.49930447 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

11 0.49886811 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

12 0.49792489 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

13 0.4977102 200 nips-2012-Local Supervised Learning through Space Partitioning

14 0.49749586 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

15 0.49635652 197 nips-2012-Learning with Recursive Perceptual Representations

16 0.49498212 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

17 0.49330455 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

18 0.49300578 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

19 0.49139935 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

20 0.4905929 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation