nips nips2010 nips2010-207 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Fabian Triefenbach, Azarakhsh Jalalvand, Benjamin Schrauwen, Jean-pierre Martens
Abstract: Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant. 1
Reference: text
sentIndex sentText sentNum sentScore
1 be Abstract Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. [sent-4, score-0.484]
2 In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. [sent-7, score-1.018]
3 Basically all state-of-the-art systems utilize Hidden Markov Models (HMMs) to compose an acoustic model that captures the relations between the acoustic signal and the phonemes, defined as the basic contrastive units of the sound system of a spoken language. [sent-10, score-0.356]
4 Two techniques, namely Deep Belief Networks (DBNs) [3, 4] and Long ShortTerm Memory (LSTM) recurrent neural networks [5], have already been used with great success for phoneme recognition. [sent-15, score-0.534]
5 In this paper we present the first (to our knowledge) phoneme recognizer that employs Reservoir Computing (RC) [6, 7, 8] as its core technology. [sent-16, score-0.465]
6 The RC concept has already been successfully applied to time series generation [6], robot navigation [9], signal classification [8], audio prediction [10] and isolated 1 spoken digit recognition [11, 12, 13]. [sent-18, score-0.123]
7 In this contribution we envisage a RC system that can recognize the English phonemes in continuous speech. [sent-19, score-0.172]
8 In a short period (a couple of months) we have been able to design a hierarchical system of large reservoirs that can already compete with many state-of-the-art HMMs that have only emerged after several decades of research. [sent-20, score-0.272]
9 2 The speech corpus Since the main aim of this paper is to demonstrate that reservoir computing can yield a good acoustic model, we will conduct experiments on TIMIT, an internationally renowned corpus [14] that was specifically designed to support the development and evaluation of such a model. [sent-22, score-1.123]
10 The TIMIT corpus contains 5040 English sentences spoken by 630 different speakers representing eight dialect groups. [sent-23, score-0.215]
11 The corpus documentation defines a training set of 462 speakers and a test set of 168 different speakers: a main test set of 144 speakers and a core test set of 24 speakers. [sent-25, score-0.202]
12 Each speaker has uttered 10 sentences: two SA sentences which are the same for all speakers, 5 SX-sentences from a list of 450 sentences (each one thus appearing 7 times in the corpus) and 3 SI-sentences from a set of 1890 sentences (each one thus appearing only once in the corpus). [sent-26, score-0.165]
13 It indicates where the phones, defined as the atomic units of the acoustic realizations of the phonemes, begin and end. [sent-29, score-0.11]
14 One is the Classification Error Rate (CER), defined as the percentage of the time the top hypothesis of the tested acoustic model is correct. [sent-32, score-0.11]
15 Both classification and recognition can be performed at the phone and the phoneme level. [sent-35, score-0.573]
16 The reservoir neurons have an activation function f(x) = logistic(x). [sent-39, score-0.808]
17 Based on its recurrent connections, the reservoir can capture the long-term dynamics of the human articulatory system to perform speech sound classification. [sent-44, score-1.132]
18 Besides the ’memory’ introduced through the recurrent connections, the neurons themselves can also integrate information over time. [sent-46, score-0.177]
19 With such neurons the reservoir state at time k+1 can be computed as follows: x[k + 1] = (1 − λ)x[k] + λf (Wres x[k] + Win u[k]) (1) with u[k] and x[k] representing the inputs and the reservoir state at time k. [sent-48, score-1.619]
20 As long as the leak rate λ < 1, the integration function provides an additional fading memory of the reservoir state. [sent-51, score-0.803]
21 To perform a classification task, the RC network computes the outputs at time k by means of the following linear equation: y[k] = Wout x[k] (2) The reservoir state in this equation is augmented with a constant bias. [sent-52, score-0.825]
22 For large training sets, as common in speech processing, the matrices XT X and XT D are updated on-line in order to suppress the need for huge storage capacity. [sent-55, score-0.198]
23 This regularization is equivalent to adding Gaussian noise with a variance of 10−8 to the reservoir state variables. [sent-57, score-0.753]
24 4 System architecture The main objective of our research is to build an RC-based LVCSR system that can retrieve the words from a spoken utterance. [sent-58, score-0.161]
25 The preprocessing stage converts the speech waveform into a sequence of acoustic Figure 2: Hierarchical reservoir architecture with multiple layers. [sent-60, score-1.044]
26 feature vectors representing the acoustic properties in subsequent speech frames. [sent-61, score-0.286]
27 This sequence is supplied to a hierarchical system of RC networks. [sent-62, score-0.145]
28 Each reservoir is composed of LINs which are fully connected to the inputs and to the 41 outputs. [sent-63, score-0.771]
29 The outputs of the last RC network are supplied to a decoder which retrieves the most likely linguistic interpretation of the speech input, given the information computed by the RC 3 networks and given some prior knowledge of the spoken language. [sent-65, score-0.427]
30 In this paper, the decoder is a phoneme recognizer just accommodating a bigram phoneme language model. [sent-66, score-0.885]
31 We conjecture that the integration time of the LINs in the first reservoir should ideally be long enough to capture the co-articulations between successive phonemes emerging from the dynamical constraints of the articulatory system. [sent-68, score-0.895]
32 On the other hand, it has to remain short enough to avoid that information pointing to the presence of a short phoneme is too much blurred by the left phonetic context. [sent-69, score-0.419]
33 Furthermore, we argue that additional reservoirs can correct some of the errors made by the first reservoir. [sent-70, score-0.167]
34 Indeed, such an error correcting reservoir can guess the correct labels from its inputs, and take the past phonetic context into account in an implicit way to refine the decision. [sent-71, score-0.794]
35 The analysis is performed on 25 ms Hammingwindowed speech frames, and subsequent speech frames are shifted over 10 ms with respect to each other. [sent-76, score-0.446]
36 Consequently, by rescaling the features, the impact of the inputs on the activations of the reservoir neurons is changed as well, which makes it compulsory to employ an appropriate input scaling [8]. [sent-81, score-0.867]
37 To establish a proper input scaling the acoustic feature vector is split into six sub-vectors according to the dimensions (energy, cepstrum) and (static, velocity, acceleration). [sent-82, score-0.11]
38 If the zi were supplied to the reservoir, each sub-vector would on average have the same impact on the reservoir neuron activations. [sent-89, score-0.816]
39 Therefore, in a second stage, the zi are rescaled to ui = βs zi with βs representing the relative importance of sub-vector s in the reservoir neuron activations. [sent-90, score-0.825]
40 The normalization constants αs straightly follow from a statistical analysis of the Table 1: Different types of acoustic information in the input features and their optimal scale factors. [sent-91, score-0.11]
41 The factors βs are free parameters that were selected such that the phoneme classification error of a single reservoir system of 1000 neurons is minimized on the validation set. [sent-114, score-1.287]
42 2 Sequence decoding The decoder in our present system performs a Viterbi search for the most likely phonemic sequence given the acoustic inputs and a bigram phoneme language model. [sent-118, score-0.765]
43 The search is driven by a simple model for the conditional likelihood p(y|m) that the reservoir output vector y is observed during the acoustical realization of phoneme m. [sent-119, score-1.166]
44 It controls the relative importance of the acoustic model and the bigram phoneme language model. [sent-129, score-0.55]
45 3 Reservoir optimization The training of the reservoir output nodes is based on Equations (3) and (4) and the desired phoneme labels emerge from a time synchronized phonemic transcription. [sent-131, score-1.241]
46 The recurrent weights of the reservoir are not trained but randomly drawn from statistical distributions. [sent-134, score-0.835]
47 The value of U controls the relative importance of the inputs in the activation of the reservoir neurons and is often called the input scale factor (ISF). [sent-136, score-0.846]
48 The SR describes the dynamical excitability of the reservoir [6, 8]. [sent-138, score-0.754]
49 If the nonlinear function is ignored and the time between frames is Tf , the reservoir neurons represent a first-order leaky integrator with a time constant τ that is related to λ by λ = 1 − e−Tf /τ . [sent-156, score-0.86]
50 This is confirmed by Figure 3 showing how the phoneme CER of a single reservoir system changes as a function of the integrator time constant. [sent-158, score-1.223]
51 org/organic/engine 5 It has been reported [19] that one can easily reduce the number of recurrent connections in a RC network without much affecting its performance. [sent-162, score-0.156]
52 5 Experiments Since our ultimate goal is to perform LVCSR, and since LVCSR systems work with a dictionary of phonemic transcriptions, we have worked with phonemes rather than with phones. [sent-164, score-0.112]
53 As in [20] we consider the 41 phoneme symbols one encounters in a typical phonetic dictionary like COMLEX [21]. [sent-165, score-0.452]
54 The 41 symbols are very similar to the 39 symbols of the reduced phone set proposed by [15], but with one major difference, namely, that a phoneme string does not contain any silences referring to closures of plosive sounds (e. [sent-166, score-0.601]
55 By ignoring confusions between /sh/ and /zh/ and between /ao/ and /aa/ we finally measure phoneme error rates for 39 classes, in order to make them more compliant with the phone error rates for 39 classes reported in other papers. [sent-169, score-0.629]
56 Nevertheless, we will see later that phoneme recognition is harder to accomplish than phone recognition. [sent-170, score-0.573]
57 This is because the closures are easy to recognize and contribute to a low phone error rate. [sent-171, score-0.207]
58 The bigram phoneme language model used for the sequence decoding step is created from the phonemic transcriptions of the training utterances. [sent-174, score-0.555]
59 1 Single reservoir systems In a first experiment we assess the performance of a single reservoir system as a function of the reservoir size, defined as the number of neurons in the reservoir. [sent-176, score-2.353]
60 The phoneme targets during training are derived from the manual acoustic-phonetic segmentation, as explained in Section 4. [sent-177, score-0.401]
61 The latter figure corresponds to the number of trainable parameters in an HMM system comprising 1200 independent Gaussian mixture distributions of 8 mixtures each. [sent-181, score-0.156]
62 Figure 4 shows that the phoneme CER on the training set drops by about 4% every time the reservoir size is doubled. [sent-182, score-1.134]
63 The phoneme CER on the test set shows a similar trend, but the slope is decreasing from 4% at low reservoir sizes to 2% at 20000 neurons (nodes). [sent-183, score-1.187]
64 At that point the CER on the test Figure 4: The Classification Error Rate (CER) at the phoneme level for the training and test set as a function of the reservoir size. [sent-184, score-1.134]
65 Although the figures show that an even larger reservoir will perform better, we stopped at 20000 nodes because the storage and the inversion of the large matrix XT X are getting problematic. [sent-189, score-0.771]
66 2 Multilayer reservoir systems Usually, a single reservoir system produces a number of competing outputs at all time steps, and this hampers the identification of the correct phoneme sequence. [sent-192, score-1.973]
67 The left panel of Figure 5 shows the outputs of a reservoir of 8000 nodes in a time interval of 350 ms. [sent-193, score-0.82]
68 Our hypothesis was that the observed confusions are not arbitrary, and that a second reservoir operating on the outputs of the first reservoir system may be able to discover regularities in the error patterns. [sent-194, score-1.643]
69 And indeed, the outputs of this second reservoir happen to exhibit a larger margin between the winner and the competition, as illustrated in the right panel of Figure 5. [sent-195, score-0.782]
70 Figure 5: The outputs of the first (left) and the second (right) layer of a two-layer system composed of two 8000 node reservoirs. [sent-196, score-0.157]
71 In Figure 6, we have plotted the phoneme CER and RER as a function of the number of reservoirs (layers) and the size of these reservoirs. [sent-198, score-0.546]
72 We have thus far only tested systems with equally large reservoirs at every layer. [sent-199, score-0.167]
73 Figure 6: The phoneme CERs and RERs for different combinations of number of nodes and layers For all reservoir sizes, the second layer induces a significant improvement of the CER by 3-4% absolute. [sent-203, score-1.202]
74 The corresponding improvements of the recognition error rates are a little bit less but still significant. [sent-204, score-0.113]
75 The best RER obtained with a two-layer system comprising reservoirs of 20000 nodes is 29. [sent-205, score-0.324]
76 Both plots demonstrate that a third layer does not cause any additional gain when the reservoir size is large enough. [sent-207, score-0.762]
77 We have also included the results of own experiments we conducted with SPRAAK2 [22], a recently launched HMM-based speech recognition toolkit. [sent-213, score-0.242]
78 In order to provide an easier comparison, we also build a phone recognition system based on the same design parameters that were optimized for phoneme recognition. [sent-214, score-0.652]
79 org 7 calculated on the core test set, while the phoneme RERs were measured on the main test set. [sent-217, score-0.405]
80 We do this because most figures in speech community papers apply to these experimental settings. [sent-218, score-0.176]
81 Before discussing our figures in detail we emphasize that the two figures for SPRAAK confirm our earlier statement that phoneme recognition is harder than phone recognition. [sent-220, score-0.573]
82 It is also fair to say that better systems do exist, like the Deep Belief Network system [4] and the hierarchical HMM system with multiple Multi-Layer Perceptrons (MLPs) on top of an HMM system [20]. [sent-233, score-0.263]
83 Such a system is impractical in many application since it has to wait until the end of a speech utterance to start the recognition. [sent-237, score-0.255]
84 The training of our two-layer 20K reservoir systems takes about 100 hours on a single core 3. [sent-240, score-0.781]
85 6 Conclusion and future work In this paper we showed for the first time that good phoneme recognition on TIMIT can be achieved with a system based on Reservoir Computing. [sent-242, score-0.524]
86 We demonstrated that in order to achieve this, we need large reservoirs (at least 20000 nodes) which are configured in a hierarchical way. [sent-243, score-0.193]
87 By stacking two reservoir layers, we were able to achieve error rates that are competitive with what is attainable using state-of-the-art HMM technology. [sent-244, score-0.78]
88 Our results support the idea that reservoirs can exploit long-term dynamic properties of the articulatory system in continuous speech recognition. [sent-245, score-0.464]
89 To achieve this improvement we will investigate even larger reservoirs with 50000 and more nodes and we will more thoroughly optimize the parameters of the different reservoirs. [sent-247, score-0.205]
90 Finally, we will develop an embedded training scheme that permits the training of reservoirs on much larger speech corpora for which only orthographic representations are distributed together with the speech data. [sent-249, score-0.563]
91 An application of recurrent neural nets to phone probability estimation. [sent-253, score-0.23]
92 Framewise phoneme classification with bidirectional LSTM and other neural network architectures. [sent-277, score-0.423]
93 Tutorial on training recurrent neural networks, covering BPTT, RTRL, EKF and the echo state network approach (48 pp). [sent-281, score-0.215]
94 Generative modeling of autonomous robots and their environments using reservoir computing. [sent-300, score-0.733]
95 Echo state networks with filter neurons and a delay & sum readout. [sent-305, score-0.148]
96 Automatic speech recognition using a predictive echo state network classifier. [sent-316, score-0.333]
97 Optimization and applications of echo state networks with leaky-integrator neurons. [sent-343, score-0.121]
98 Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. [sent-348, score-0.123]
99 Computational power and the order-chaos phase transition in reservoir computing. [sent-361, score-0.733]
100 SPRAAK: An open source speech recognition and automatic annotation kit. [sent-378, score-0.242]
wordName wordTfidf (topN-words)
[('reservoir', 0.733), ('phoneme', 0.379), ('speech', 0.176), ('reservoirs', 0.167), ('cer', 0.139), ('phone', 0.128), ('acoustic', 0.11), ('recurrent', 0.102), ('rc', 0.099), ('hmm', 0.099), ('rer', 0.084), ('system', 0.079), ('neurons', 0.075), ('timit', 0.074), ('lvcsr', 0.07), ('sr', 0.067), ('recognition', 0.066), ('phonemes', 0.063), ('schrauwen', 0.061), ('spoken', 0.057), ('lstm', 0.056), ('spraak', 0.056), ('wout', 0.056), ('sentences', 0.055), ('networks', 0.053), ('corpus', 0.052), ('speakers', 0.051), ('isf', 0.049), ('phonemic', 0.049), ('outputs', 0.049), ('echo', 0.048), ('ms', 0.047), ('deep', 0.043), ('articulatory', 0.042), ('lins', 0.042), ('rers', 0.042), ('supplied', 0.04), ('comprising', 0.04), ('phonetic', 0.04), ('nodes', 0.038), ('bigram', 0.038), ('inputs', 0.038), ('trainable', 0.037), ('recognizer', 0.037), ('hmms', 0.037), ('integration', 0.036), ('leak', 0.034), ('mfcc', 0.034), ('acoustical', 0.034), ('symbols', 0.033), ('acoustics', 0.033), ('gures', 0.032), ('integrator', 0.032), ('connections', 0.031), ('recognize', 0.03), ('layer', 0.029), ('ui', 0.029), ('decoder', 0.029), ('assp', 0.028), ('closures', 0.028), ('comlex', 0.028), ('confusions', 0.028), ('ctc', 0.028), ('ghent', 0.028), ('mlps', 0.028), ('verstraeten', 0.028), ('core', 0.026), ('rates', 0.026), ('hierarchical', 0.026), ('architecture', 0.025), ('acceleration', 0.025), ('recurrently', 0.024), ('insertions', 0.024), ('transcriptions', 0.024), ('triphone', 0.024), ('english', 0.024), ('employs', 0.023), ('network', 0.023), ('layers', 0.023), ('neuron', 0.023), ('language', 0.023), ('velocity', 0.023), ('belief', 0.023), ('tf', 0.022), ('methodologies', 0.022), ('edit', 0.022), ('phones', 0.022), ('training', 0.022), ('technology', 0.022), ('error', 0.021), ('bidirectional', 0.021), ('rescaling', 0.021), ('dynamical', 0.021), ('decoding', 0.02), ('state', 0.02), ('output', 0.02), ('xt', 0.02), ('zi', 0.02), ('closure', 0.02), ('leaky', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
Author: Fabian Triefenbach, Azarakhsh Jalalvand, Benjamin Schrauwen, Jean-pierre Martens
Abstract: Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant. 1
2 0.19370653 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton
Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1
3 0.16098799 61 nips-2010-Direct Loss Minimization for Structured Prediction
Author: Tamir Hazan, Joseph Keshet, David A. McAllester
Abstract: In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptronlike learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem. 1
4 0.078095235 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection
Author: Javier R. Movellan, Paul L. Ruvolo
Abstract: Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction. One popular approach to this problem is audio-visual synchrony detection [10, 21, 12]. A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal. Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all. Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models. The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e.g. in the case of a mobile robot). Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection. The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database. 1
5 0.06231967 264 nips-2010-Synergies in learning words and their referents
Author: Mark Johnson, Katherine Demuth, Bevan Jones, Michael J. Black
Abstract: This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. The models themselves are novel kinds of Adaptor Grammars that are an extension of an embedding of topic models into PCFGs. These models simultaneously segment phoneme sequences into words and learn the relationship between non-linguistic objects to the words that refer to them. We show (i) that modelling inter-word dependencies not only improves the accuracy of the word segmentation but also of word-object relationships, and (ii) that a model that simultaneously learns word-object relationships and word segmentation segments more accurately than one that just learns word segmentation on its own. We argue that these results support an interactive view of language acquisition that can take advantage of synergies such as these. 1
6 0.061840191 158 nips-2010-Learning via Gaussian Herding
7 0.061125007 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels
8 0.059280384 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
9 0.049242489 157 nips-2010-Learning to localise sounds with spiking neural networks
10 0.04704788 281 nips-2010-Using body-anchored priors for identifying actions in single images
11 0.043890759 285 nips-2010-Why are some word orders more common than others? A uniform information density account
12 0.041151989 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts
13 0.040491741 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition
14 0.040337086 138 nips-2010-Large Margin Multi-Task Metric Learning
15 0.039503362 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
16 0.039160836 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
17 0.038145922 252 nips-2010-SpikeAnts, a spiking neuron network modelling the emergence of organization in a complex system
18 0.035545744 99 nips-2010-Gated Softmax Classification
19 0.035317373 96 nips-2010-Fractionally Predictive Spiking Neurons
20 0.03443386 161 nips-2010-Linear readout from a neural population with partial correlation data
topicId topicWeight
[(0, 0.107), (1, 0.032), (2, -0.078), (3, 0.015), (4, 0.0), (5, 0.034), (6, -0.026), (7, 0.026), (8, -0.051), (9, 0.028), (10, 0.029), (11, -0.037), (12, 0.07), (13, -0.032), (14, -0.05), (15, -0.002), (16, 0.051), (17, -0.05), (18, -0.105), (19, -0.117), (20, -0.048), (21, 0.129), (22, 0.103), (23, 0.091), (24, 0.147), (25, -0.005), (26, 0.115), (27, -0.075), (28, 0.058), (29, -0.054), (30, -0.055), (31, -0.057), (32, 0.112), (33, 0.05), (34, 0.035), (35, 0.025), (36, 0.023), (37, 0.011), (38, -0.019), (39, -0.08), (40, -0.063), (41, -0.033), (42, -0.026), (43, -0.176), (44, 0.026), (45, 0.038), (46, -0.04), (47, 0.014), (48, 0.002), (49, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.92609006 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
Author: Fabian Triefenbach, Azarakhsh Jalalvand, Benjamin Schrauwen, Jean-pierre Martens
Abstract: Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant. 1
2 0.74299335 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
Author: George Dahl, Marc'aurelio Ranzato, Abdel-rahman Mohamed, Geoffrey E. Hinton
Abstract: Straightforward application of Deep Belief Nets (DBNs) to acoustic modeling produces a rich distributed representation of speech data that is useful for recognition and yields impressive results on the speaker-independent TIMIT phone recognition task. However, the first-layer Gaussian-Bernoulli Restricted Boltzmann Machine (GRBM) has an important limitation, shared with mixtures of diagonalcovariance Gaussians: GRBMs treat different components of the acoustic input vector as conditionally independent given the hidden state. The mean-covariance restricted Boltzmann machine (mcRBM), first introduced for modeling natural images, is a much more representationally efficient and powerful way of modeling the covariance structure of speech data. Every configuration of the precision units of the mcRBM specifies a different precision matrix for the conditional distribution over the acoustic space. In this work, we use the mcRBM to learn features of speech data that serve as input into a standard DBN. The mcRBM features combined with DBNs allow us to achieve a phone error rate of 20.5%, which is superior to all published results on speaker-independent TIMIT to date. 1
3 0.597166 61 nips-2010-Direct Loss Minimization for Structured Prediction
Author: Tamir Hazan, Joseph Keshet, David A. McAllester
Abstract: In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptronlike learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem. 1
4 0.59538251 28 nips-2010-An Alternative to Low-level-Sychrony-Based Methods for Speech Detection
Author: Javier R. Movellan, Paul L. Ruvolo
Abstract: Determining whether someone is talking has applications in many areas such as speech recognition, speaker diarization, social robotics, facial expression recognition, and human computer interaction. One popular approach to this problem is audio-visual synchrony detection [10, 21, 12]. A candidate speaker is deemed to be talking if the visual signal around that speaker correlates with the auditory signal. Here we show that with the proper visual features (in this case movements of various facial muscle groups), a very accurate detector of speech can be created that does not use the audio signal at all. Further we show that this person independent visual-only detector can be used to train very accurate audio-based person dependent voice models. The voice model has the advantage of being able to identify when a particular person is speaking even when they are not visible to the camera (e.g. in the case of a mobile robot). Moreover, we show that a simple sensory fusion scheme between the auditory and visual models improves performance on the task of talking detection. The work here provides dramatic evidence about the efficacy of two very different approaches to multimodal speech detection on a challenging database. 1
5 0.49892527 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels
Author: Grégoire Montavon, Klaus-Robert Müller, Mikio L. Braun
Abstract: Deep networks can potentially express a learning problem more efficiently than local learning machines. While deep networks outperform local learning machines on some problems, it is still unclear how their nice representation emerges from their complex structure. We present an analysis based on Gaussian kernels that measures how the representation of the learning problem evolves layer after layer as the deep network builds higher-level abstract representations of the input. We use this analysis to show empirically that deep networks build progressively better representations of the learning problem and that the best representations are obtained when the deep network discriminates only in the last layers. 1
6 0.47985506 188 nips-2010-On Herding and the Perceptron Cycling Theorem
7 0.46352667 271 nips-2010-Tiled convolutional neural networks
8 0.4592967 125 nips-2010-Inference and communication in the game of Password
9 0.45362312 111 nips-2010-Hallucinations in Charles Bonnet Syndrome Induced by Homeostasis: a Deep Boltzmann Machine Model
10 0.4101316 157 nips-2010-Learning to localise sounds with spiking neural networks
11 0.40894461 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
12 0.40770924 285 nips-2010-Why are some word orders more common than others? A uniform information density account
13 0.37431625 264 nips-2010-Synergies in learning words and their referents
14 0.35105938 156 nips-2010-Learning to combine foveal glimpses with a third-order Boltzmann machine
15 0.3496691 8 nips-2010-A Log-Domain Implementation of the Diffusion Network in Very Large Scale Integration
16 0.34582561 158 nips-2010-Learning via Gaussian Herding
17 0.34366333 99 nips-2010-Gated Softmax Classification
18 0.33425617 252 nips-2010-SpikeAnts, a spiking neuron network modelling the emergence of organization in a complex system
19 0.3219215 281 nips-2010-Using body-anchored priors for identifying actions in single images
20 0.3191005 34 nips-2010-Attractor Dynamics with Synaptic Depression
topicId topicWeight
[(13, 0.031), (17, 0.023), (27, 0.064), (30, 0.073), (35, 0.019), (45, 0.138), (50, 0.055), (52, 0.038), (59, 0.057), (60, 0.022), (64, 0.27), (77, 0.053), (78, 0.022), (90, 0.04)]
simIndex simValue paperId paperTitle
same-paper 1 0.74030453 207 nips-2010-Phoneme Recognition with Large Hierarchical Reservoirs
Author: Fabian Triefenbach, Azarakhsh Jalalvand, Benjamin Schrauwen, Jean-pierre Martens
Abstract: Automatic speech recognition has gradually improved over the years, but the reliable recognition of unconstrained speech is still not within reach. In order to achieve a breakthrough, many research groups are now investigating new methodologies that have potential to outperform the Hidden Markov Model technology that is at the core of all present commercial systems. In this paper, it is shown that the recently introduced concept of Reservoir Computing might form the basis of such a methodology. In a limited amount of time, a reservoir system that can recognize the elementary sounds of continuous speech has been built. The system already achieves a state-of-the-art performance, and there is evidence that the margin for further improvements is still significant. 1
2 0.56848383 94 nips-2010-Feature Set Embedding for Incomplete Data
Author: David Grangier, Iain Melvin
Abstract: We present a new learning strategy for classification problems in which train and/or test data suffer from missing features. In previous work, instances are represented as vectors from some feature space and one is forced to impute missing values or to consider an instance-specific subspace. In contrast, our method considers instances as sets of (feature,value) pairs which naturally handle the missing value case. Building onto this framework, we propose a classification strategy for sets. Our proposal maps (feature,value) pairs into an embedding space and then nonlinearly combines the set of embedded vectors. The embedding and the combination parameters are learned jointly on the final classification objective. This simple strategy allows great flexibility in encoding prior knowledge about the features in the embedding step and yields advantageous results compared to alternative solutions over several datasets. 1
3 0.56010985 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
Author: Surya Ganguli, Haim Sompolinsky
Abstract: Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that exceeds the number of neurons. This enhanced capacity is achieved by a class of “orthogonal” recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. 1
4 0.55886686 117 nips-2010-Identifying graph-structured activation patterns in networks
Author: James Sharpnack, Aarti Singh
Abstract: We consider the problem of identifying an activation pattern in a complex, largescale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of highdimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size. 1
5 0.55787653 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
Author: Dahua Lin, Eric Grimson, John W. Fisher
Abstract: We present a novel method for constructing dependent Dirichlet processes. The approach exploits the intrinsic relationship between Dirichlet and Poisson processes in order to create a Markov chain of Dirichlet processes suitable for use as a prior over evolving mixture models. The method allows for the creation, removal, and location variation of component models over time while maintaining the property that the random measures are marginally DP distributed. Additionally, we derive a Gibbs sampling algorithm for model inference and test it on both synthetic and real data. Empirical results demonstrate that the approach is effective in estimating dynamically varying mixture models. 1
6 0.55684268 63 nips-2010-Distributed Dual Averaging In Networks
7 0.55319136 4 nips-2010-A Computational Decision Theory for Interactive Assistants
8 0.55158061 155 nips-2010-Learning the context of a category
9 0.54932892 222 nips-2010-Random Walk Approach to Regret Minimization
10 0.54799843 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts
11 0.54641891 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
12 0.54625499 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior
13 0.54585701 206 nips-2010-Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine
14 0.54505873 148 nips-2010-Learning Networks of Stochastic Differential Equations
15 0.54336858 268 nips-2010-The Neural Costs of Optimal Control
16 0.54295075 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
17 0.54181582 265 nips-2010-The LASSO risk: asymptotic results and real world examples
18 0.54148006 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models
19 0.54101425 254 nips-2010-Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models
20 0.54092556 96 nips-2010-Fractionally Predictive Spiking Neurons