nips nips2006 nips2006-88 knowledge-graph by maker-knowledge-mining

88 nips-2006-Greedy Layer-Wise Training of Deep Networks

Source: pdf

Author: Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle

Abstract: Complexity theory of circuits strongly suggests that deep architectures can be much more efﬁcient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also conﬁrm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract Complexity theory of circuits strongly suggests that deep architectures can be much more efﬁcient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. [sent-3, score-0.5]

2 However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. [sent-5, score-0.385]

3 recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. [sent-7, score-0.743]

4 This is also true of feedforward neural networks with a single hidden layer (which can become SVMs when the number of hidden units becomes large (Bengio, Le Roux, Vincent, Delalleau, & Marcotte, 2006)). [sent-30, score-1.025]

5 A serious problem with shallow architectures is that they can be very inefﬁcient in terms of the number of computational units (e. [sent-31, score-0.398]

6 , 2006), O(d2 ) parameters for a one-hidden-layer neural network, O(d) parameters and units for a multi-layer network with O(log2 d) layers, and O(1) parameters with a recurrent neural network. [sent-38, score-0.369]

7 , with a shallow circuit, the number of training examples required to learn the concept may also be impractical. [sent-42, score-0.221]

8 Formal analyses of the computational complexity of shallow circuits can be found in (Hastad, 1987) or (Allender, 1996). [sent-43, score-0.178]

9 They point in the same direction: shallow circuits are much less expressive than deep ones. [sent-44, score-0.467]

10 However, until recently, it was believed too difﬁcult to train deep multi-layer neural networks. [sent-45, score-0.376]

11 Empirically, deep networks were generally found to be not better, and often worse, than neural networks with one or two hidden layers (Tesauro, 1992). [sent-46, score-0.891]

12 This was previously done using a supervised criterion at each stage (Fahlman & Lebiere, 1990; Lengell´ & Denoeux, 1996). [sent-50, score-0.204]

13 Hinton, e Osindero, and Teh (2006) recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. [sent-51, score-0.743]

14 The training strategy for such networks may hold great promise as a principle to help address the problem of training deep networks. [sent-52, score-0.588]

15 Upper layers of a DBN are supposed to represent more “abstract” concepts that explain the input observation x, whereas lower layers extract “low-level features” from x. [sent-53, score-0.634]

16 Second, we perform experiments to better understand the advantage brought by the greedy layer-wise unsupervised learning. [sent-58, score-0.328]

17 Finally, we discuss a problem that occurs with the layer-wise greedy unsupervised procedure when the input distribution is not revealing enough of the conditional distribution of the target variable given the input variable. [sent-61, score-0.527]

18 2 Deep Belief Nets Let x be the input, and gi the hidden variables at layer i, with joint distribution P (x, g1 , g2 , . [sent-63, score-0.582]

19 , g ) = P (x|g1 )P (g1 |g2 ) · · · P (g −2 |g −1 )P (g −1 , g ), where all the conditional layers P (g |g ) are factorized conditional distributions for which computation of probability and sampling are easy. [sent-66, score-0.344]

20 If we denote g0 = x, the generative model for the ﬁrst layer P (x|g 1 ) also follows (1). [sent-69, score-0.4]

21 1 Restricted Boltzmann machines The top-level prior P (g −1 , g ) is a Restricted Boltzmann Machine (RBM) between layer − 1 and layer . [sent-71, score-0.822]

22 To lighten notation, consider a generic RBM with input layer activations v (for visible units) and hidden layer activations h (for hidden units). [sent-72, score-1.174]

23 It has the following joint distribution: 1 P (v, h) = Z eh W v+b v+c h , where Z is the normalization constant for this distribution, b is the vector of biases for visible units, c is the vector of biases for the hidden units, and W is the weight matrix for the layer. [sent-73, score-0.23]

24 2 Gibbs Markov chain and log-likelihood gradient in an RBM To obtain an estimator of the gradient on the log-likelihood of an RBM, we consider a Gibbs Markov chain on the (visible units, hidden units) pair of variables. [sent-79, score-0.202]

25 A pseudo-code for Contrastive Divergence training (with k = 1) of an RBM with binomial input and hidden units is presented in the Appendix (Algorithm RBMupdate(x, , W, b, c)). [sent-85, score-0.623]

26 3 Greedy layer-wise training of a DBN A greedy layer-wise training algorithm was proposed (Hinton et al. [sent-90, score-0.371]

27 This gives rise to an “empirical” distribution p 1 over the ﬁrst layer g1 , when g0 is sampled from the data empirical distribution p: we have p1 (g1 ) = p(g0 )Q(g1 |g0 ). [sent-94, score-0.435]

28 In the RBM between layers − 1 and , P (g ) is deﬁned in terms on the parameters of that RBM, whereas in the DBN P (g ) is deﬁned in terms of the parameters of the upper layers. [sent-97, score-0.284]

29 Consequently, Q(g |g −1 ) of the RBM does not correspond to P (g |g −1 ) in the DBN, except when that RBM is the top layer of the DBN. [sent-98, score-0.425]

30 As a nice side beneﬁt, one obtains an approximation of the posterior for all the hidden variables in the DBN, at all levels, given an input g 0 = x. [sent-102, score-0.22]

31 Note that if we consider all the layers of a DBN from level i to the top, we have a smaller DBN, which generates the marginal distribution P (g i ) for the complete DBN. [sent-104, score-0.284]

32 The motivation for the greedy procedure is that a partial DBN with − i levels starting above level i may provide a better model for P (gi ) than does the RBM initially associated with level i itself. [sent-105, score-0.218]

33 The above greedy procedure is justiﬁed using a variational bound (Hinton et al. [sent-106, score-0.195]

34 The greedy layer-wise training algorithm for DBNs is quite simple, as illustrated by the pseudo-code in Algorithm TrainUnsupervisedDBN of the Appendix. [sent-109, score-0.283]

35 4 Supervised ﬁne-tuning As a last training stage, it is possible to ﬁne-tune the parameters of all the layers together. [sent-111, score-0.372]

36 According to these propagation rules, the whole network now deterministically computes internal representations as functions of the network input g0 = x. [sent-116, score-0.3]

37 After unsupervised pre-training of the layers of a DBN following Algorithm TrainUnsupervisedDBN (see Appendix) the whole network can be further optimized by gradient descent with respect to any deterministically computable training criterion that depends on these representations. [sent-117, score-0.666]

38 For example, this can be used (Hinton & Salakhutdinov, 2006) to ﬁne-tune a very deep auto-encoder, minimizing a reconstruction error. [sent-118, score-0.342]

39 It is also possible to use this as initialization of all except the last layer of a traditional multi-layer neural network, using gradient descent to ﬁne-tune the whole network with respect to a supervised training criterion. [sent-119, score-0.841]

40 Algorithm DBNSupervisedFineTuning in the appendix contains pseudo-code for supervised ﬁne-tuning, as part of the global supervised learning algorithm TrainSupervisedDBN. [sent-120, score-0.354]

41 Note that better results were obtained when using a 20-fold larger learning rate with the supervised criterion (here, squared error or cross-entropy) updates than in the contrastive divergence updates. [sent-121, score-0.379]

42 3 Extension to continuous-valued inputs With the binary units introduced for RBMs and DBNs in Hinton et al. [sent-122, score-0.308]

43 Linear energy: exponential or truncated exponential Consider a unit with value y of an RBM, connected to units z of the other layer. [sent-127, score-0.4]

44 p(y|z) can be obtained from the terms in the exponential that contain y , which can be grouped in ya(z) for linear energy functions as in (2), where a(z) = b + w z with b the bias of unit y , and w the vector of weights connecting unit y to units z. [sent-128, score-0.429]

45 Alternatively, if I is a closed interval (as in many applications of interest), or if we would like to use such a unit as a hidden unit with non-linear expected value, the above density is a truncated exponential. [sent-131, score-0.277]

46 In both truncated a(z) and not truncated cases, the Contrastive Divergence updates have the same form as for binomial units (input value times output value), since the updates only depend on the derivative of the energy with respect to the parameters. [sent-135, score-0.531]

47 Quadratic energy: Gaussian units 2 To obtain Gaussian-distributed units, one adds quadratic terms to the energy. [sent-137, score-0.232]

48 Adding i d2 yi gives i rise to a diagonal covariance matrix between units of the same layer, where y i is the continuous value of a Gaussian unit and d2 is a positive parameter that is equal to the inverse of the variance of y i . [sent-138, score-0.308]

49 6 Deep Network with no pre−training DBN with partially supervised pre−training DBN with unsupervised pre−training 0. [sent-140, score-0.297]

50 50 100 150 200 250 300 350 Figure 1: Training classiﬁcation error vs training iteration, on the Cotton price task, for deep network without pre-training, for DBN with unsupervised pre-training, and DBN with partially supervised pre-training. [sent-153, score-0.776]

51 Illustrates optimization difﬁculty of deep networks and advantage of partially supervised training. [sent-154, score-0.555]

52 400 Deep Network with no pre-training Logistic regression DBN, binomial inputs, unsupervised DBN, binomial inputs, partially supervised DBN, Gaussian inputs, unsupervised DBN, Gaussian inputs, partially supervised train. [sent-155, score-0.76]

53 this case the variance is unconditional, whereas the mean depends on the inputs of the unit: for a unit y with inputs z and inverse variance d2 , E[y|z] = a(z) . [sent-195, score-0.193]

54 , b and w above), the derivatives have the same form (input unit value times output unit value) as for the case of binomial units. [sent-199, score-0.165]

55 Gaussian units were previously used as hidden units of an RBM (with binomial or multinomial inputs) applied to an information retrieval task (Welling, Rosen-Zvi, & Hinton, 2005). [sent-201, score-0.701]

56 However, Gaussian and exponential hidden units have a weakness: the mean-ﬁeld propagation through a Gaussian unit gives rise to a purely linear transformation. [sent-204, score-0.555]

57 Hence if we have only such linear hidden units in a multi-layered network, the mean-ﬁeld propagation function that maps inputs to internal representations would be completely linear. [sent-205, score-0.508]

58 On the other hand, combining Gaussian with other types of units could be interesting. [sent-207, score-0.232]

59 In contrast with Gaussian or exponential units, remark that the conditional expectation of truncated exponential units is non-linear, and in fact involves a sigmoidal form of non-linearity applied to the weighted sum of its inputs. [sent-208, score-0.411]

60 In Table 1 (rows 3 and 5), we show improvements brought by DBNs with Gaussian inputs over DBNs with binomial inputs (with binomial hidden units in both cases). [sent-218, score-0.704]

61 4 Understanding why the layer-wise strategy works A reasonable explanation for the apparent success of the layer-wise training strategy for DBNs is that unsupervised pre-training helps to mitigate the difﬁcult optimization problem of deep networks by better initializing the weights of all layers. [sent-221, score-0.773]

62 Training each layer as an auto-encoder We want to verify that the layer-wise greedy unsupervised pre-training principle can be applied when using an auto-encoder instead of the RBM as a layer building block. [sent-223, score-1.105]

63 For a layer with weights matrix W , hidden biases column vector b and input biases column vector c, the reconstruction probability for bit i is p i (x), with the vector of probabilities p(x) = sigm(c + W sigm(b + W x)). [sent-225, score-0.726]

64 The training criterion for the layer is the average of negative log-likelihoods for predicting x from p(x). [sent-226, score-0.539]

65 We report several experimental results using this training criterion for each layer, in comparison to the contrastive divergence algorithm for an RBM. [sent-228, score-0.26]

66 Pseudo-code for a deep network obtained by training each layer as an auto-encoder is given in Appendix (Algorithm TrainGreedyAutoEncodingDeepNet). [sent-229, score-0.879]

67 However, our experiments suggest that networks with non-decreasing layer sizes generalize well. [sent-231, score-0.479]

68 This might be due to weight decay and stochastic gradient descent, preventing large weights: optimization falls in a local minimum which corresponds to a good transformation of the input (that provides a good initialization for supervised training of the whole net). [sent-232, score-0.399]

69 Greedy layer-wise supervised training A reasonable question to ask is whether the fact that each layer is trained in an unsupervised way is critical or not. [sent-233, score-0.751]

70 Pseudo-code for a deep network obtained by training each layer as the hidden layer of a supervised one-hidden-layer neural network is given in Appendix (Algorithm TrainGreedySupervisedDeepNet). [sent-235, score-1.694]

71 The ﬁnal ﬁne-tuning is done by adding a logistic regression layer on top of the network and training the whole network by stochastic gradient descent on the cross-entropy with respect to the target classiﬁcation. [sent-238, score-0.75]

72 The networks have the following architecture: 784 inputs, 10 outputs, 3 hidden layers with variable number of hidden units, selected by validation set performance (typically selected layer sizes are between 500 and 1000). [sent-239, score-1.078]

73 The DBN was slower to train and less experiments were performed, so that longer training and more appropriately chosen sizes of layers and learning rates could yield better results (Hinton 2006, unpublished, reports 1. [sent-242, score-0.43]

74 DBN, unsupervised pre-training Deep net, auto-associator pre-training Deep net, supervised pre-training Deep net, no pre-training Shallow net, no pre-training Experiment 2 train. [sent-244, score-0.263]

75 0% Table 2: Classiﬁcation error on MNIST training, validation, and test sets, with the best hyperparameters according to validation error, with and without pre-training, using purely supervised or purely unsupervised pre-training. [sent-272, score-0.347]

76 In experiment 3, the size of the top hidden layer was set to 20. [sent-273, score-0.602]

77 The results in Table 2 suggest that the auto-encoding criterion can yield performance comparable to the DBN when the layers are ﬁnally tuned in a supervised fashion. [sent-276, score-0.511]

78 They also clearly show that the greedy unsupervised layer-wise pre-training gives much better results than the standard way to train a deep network (with no greedy pre-training) or a shallow network, and that, without pre-training, deep networks tend to perform worse than shallow networks. [sent-277, score-1.583]

79 The results also suggest that unsupervised greedy layer-wise pre-training can perform signiﬁcantly better than purely supervised greedy layer-wise pre-training. [sent-278, score-0.726]

80 Without pre-training, the lower layers are initialized poorly, but still allowing the top two layers to learn the training set almost perfectly, because the output layer and the last hidden layer form a standard shallow but fat neural network. [sent-282, score-1.82]

81 Consider the top two layers of the deep network with pre-training: it presumably takes as input a better representation, one that allows for better generalization. [sent-283, score-0.812]

82 Instead, the network without pre-training sees a “random” transformation of the input, one that preserves enough information about the input to ﬁt the training set, but that does not help to generalize. [sent-284, score-0.233]

83 To test that hypothesis, we performed a second series of experiments in which we constrain the top hidden layer to be small (20 hidden units). [sent-285, score-0.733]

84 With no pre-training, training error degrades signiﬁcantly when there are only 20 hidden units in the top hidden layer. [sent-287, score-0.653]

85 Continuous training of all layers of a DBN With the layer-wise training algorithm for DBNs (TrainUnsupervisedDBN in Appendix), one element that we would like to dispense with is having to decide the number of training iterations for each layer. [sent-290, score-0.548]

86 It would be good if we did not have to explicitly add layers one at a time, i. [sent-291, score-0.284]

87 , if we could train all layers simultaneously, but keeping the “greedy” idea that each layer is pre-trained to model its input, ignoring the effect of higher layers. [sent-293, score-0.719]

88 To achieve this it is sufﬁcient to insert a line in TrainUnsupervisedDBN, so that RBMupdate is called on all the layers and the stochastic hidden values are propagated all the way up. [sent-294, score-0.438]

89 Computation time is slightly greater, since we do more computations initially (on the upper layers), which might be wasted (before the lower layers converge to a decent representation), but time is saved on optimizing hyper-parameters. [sent-297, score-0.284]

90 This variant may be more appealing for on-line training on very large data-sets, where one would never cycle back on the training data. [sent-298, score-0.176]

91 5 Dealing with uncooperative input distributions In classiﬁcation problems such as MNIST where classes are well separated, the structure of the input distribution p(x) naturally contains much information about the target variable y . [sent-299, score-0.186]

92 Imagine a supervised learning task in which the input distribution is mostly unrelated with y . [sent-300, score-0.219]

93 In such settings we cannot expect the unsupervised greedy layer-wise pre-training procedure to help in training deep supervised networks. [sent-305, score-0.858]

94 To deal with such uncooperative input distributions, we propose to train each layer with a mixed training criterion that combines the unsupervised objective (modeling or reconstructing the input) and a supervised objective (helping to predict the target). [sent-306, score-0.932]

95 In our experiments it appeared sufﬁcient to perform that partial supervision with the ﬁrst layer only, since once the predictive information about the target is “forced” into the representation of the ﬁrst layer, it tends to stay in the upper layers. [sent-308, score-0.425]

96 The results in Figure 1 and Table 1 clearly show the advantage of this partially supervised greedy training algorithm, in the case of the ﬁnancial dataset. [sent-309, score-0.47]

97 6 Conclusion This paper is motivated by the need to develop good training algorithms for deep architectures, since these can be much more representationally efﬁcient than shallow ones such as SVMs and one-hiddenlayer neural nets. [sent-311, score-0.562]

98 These experiments suggest a general principle that can be applied beyond DBNs, and we obtained similar results when each layer is initialized as an auto-associator instead of as an RBM. [sent-316, score-0.446]

99 In that case the DBN unsupervised greedy layer-wise strategy appears inadequate and we proposed a simple ﬁx based on partial supervision, that can yield signiﬁcant improvements. [sent-318, score-0.349]

100 Training MLPs layer by layer using an objective function for internal e representations. [sent-407, score-0.823]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('layer', 0.4), ('rbm', 0.357), ('dbn', 0.336), ('deep', 0.312), ('layers', 0.284), ('units', 0.232), ('greedy', 0.195), ('hinton', 0.155), ('hidden', 0.154), ('supervised', 0.153), ('shallow', 0.133), ('dbns', 0.116), ('unsupervised', 0.11), ('sigm', 0.103), ('bengio', 0.093), ('training', 0.088), ('binomial', 0.083), ('contrastive', 0.082), ('vk', 0.08), ('network', 0.079), ('inputs', 0.076), ('energy', 0.072), ('input', 0.066), ('rbms', 0.064), ('hk', 0.06), ('cotton', 0.059), ('cun', 0.059), ('trainunsuperviseddbn', 0.059), ('networks', 0.056), ('criterion', 0.051), ('le', 0.048), ('appendix', 0.048), ('delalleau', 0.047), ('roux', 0.047), ('strategy', 0.044), ('exponential', 0.043), ('net', 0.042), ('unit', 0.041), ('truncated', 0.041), ('mnist', 0.039), ('divergence', 0.039), ('gj', 0.039), ('biases', 0.038), ('initialization', 0.038), ('helps', 0.036), ('rise', 0.035), ('abalone', 0.035), ('revealing', 0.035), ('train', 0.035), ('pieces', 0.034), ('partially', 0.034), ('architectures', 0.033), ('boltzmann', 0.033), ('pre', 0.033), ('initializing', 0.033), ('updates', 0.031), ('reconstruction', 0.03), ('conditional', 0.03), ('validation', 0.03), ('whole', 0.03), ('allender', 0.029), ('denoeux', 0.029), ('fahlman', 0.029), ('hastad', 0.029), ('lebiere', 0.029), ('lengell', 0.029), ('marcotte', 0.029), ('mineiro', 0.029), ('montr', 0.029), ('movellan', 0.029), ('rbmupdate', 0.029), ('stracuzzi', 0.029), ('uncooperative', 0.029), ('utgoff', 0.029), ('wjk', 0.029), ('ya', 0.029), ('neural', 0.029), ('gi', 0.028), ('hypothesis', 0.027), ('purely', 0.027), ('explanation', 0.027), ('belief', 0.026), ('salakhutdinov', 0.026), ('target', 0.025), ('top', 0.025), ('gradient', 0.024), ('experiment', 0.023), ('analyses', 0.023), ('nn', 0.023), ('better', 0.023), ('tesauro', 0.023), ('bi', 0.023), ('propagation', 0.023), ('internal', 0.023), ('initialized', 0.023), ('suggest', 0.023), ('circuits', 0.022), ('machines', 0.022), ('vincent', 0.022), ('sigmoidal', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999994 88 nips-2006-Greedy Layer-Wise Training of Deep Networks

Author: Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle

2 0.29708162 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

Abstract: Independent Component Analysis (ICA) is a popular method for extracting independent features from visual data. However, as a fundamentally linear technique, there is always nonlinear residual redundancy that is not captured by ICA. Hence there have been many attempts to try to create a hierarchical version of ICA, but so far none of the approaches have a natural way to apply them more than once. Here we show that there is a relatively simple technique that transforms the absolute values of the outputs of a previous application of ICA into a normal distribution, to which ICA maybe applied again. This results in a recursive ICA algorithm that may be applied any number of times in order to extract higher order structure from previous layers. 1

3 0.29507089 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis

Abstract: We propose a non-linear generative model for human motion data that uses an undirected model with binary latent variables and real-valued “visible” variables that represent joint angles. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. Such an architecture makes on-line inference efﬁcient and allows us to use a simple approximate learning procedure. After training, the model ﬁnds a single set of parameters that simultaneously capture several different kinds of motion. We demonstrate the power of our approach by synthesizing various motion sequences and by performing on-line ﬁlling in of data lost during motion capture. Website: http://www.cs.toronto.edu/∼gwtaylor/publications/nips2006mhmublv/

4 0.14369516 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

Author: Marc'aurelio Ranzato, Christopher Poultney, Sumit Chopra, Yann L. Cun

Abstract: We describe a novel unsupervised method for learning sparse, overcomplete features. The model uses a linear encoder, and a linear decoder preceded by a sparsifying non-linearity that turns a code vector into a quasi-binary sparse code vector. Given an input, the optimal code minimizes the distance between the output of the decoder and the input patch while being as similar as possible to the encoder output. Learning proceeds in a two-phase EM-like fashion: (1) compute the minimum-energy code vector, (2) adjust the parameters of the encoder and decoder so as to decrease the energy. The model produces “stroke detectors” when trained on handwritten numerals, and Gabor-like ﬁlters when trained on natural image patches. Inference and learning are very fast, requiring no preprocessing, and no expensive sampling. Using the proposed unsupervised method to initialize the ﬁrst layer of a convolutional network, we achieved an error rate slightly lower than the best reported result on the MNIST dataset. Finally, an extension of the method is described to learn topographical ﬁlter maps. 1

5 0.076981001 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

Author: Su-in Lee, Varun Ganapathi, Daphne Koller

Abstract: Markov networks are commonly used in a wide variety of applications, ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally efﬁcient method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efﬁcient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: pixel values in the MNIST data, and genetic sequence variations in the human HapMap data. We show that our L1 -based method achieves considerably higher generalization performance than the more standard L2 -based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem.

6 0.074837208 139 nips-2006-Multi-dynamic Bayesian Networks

7 0.066774309 15 nips-2006-A Switched Gaussian Process for Estimating Disparity and Segmentation in Binocular Stereo

8 0.064025015 116 nips-2006-Learning from Multiple Sources

9 0.057168528 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

10 0.056818459 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

11 0.056599353 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

12 0.048522171 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation

13 0.047620822 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

14 0.047072127 50 nips-2006-Chained Boosting

15 0.04688371 57 nips-2006-Conditional mean field

16 0.045587994 108 nips-2006-Large Scale Hidden Semi-Markov SVMs

17 0.045508381 154 nips-2006-Optimal Change-Detection and Spiking Neurons

18 0.044794917 69 nips-2006-Distributed Inference in Dynamical Systems

19 0.044656109 181 nips-2006-Stability of $K$-Means Clustering

20 0.04464709 45 nips-2006-Blind Motion Deblurring Using Image Statistics

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.18), (1, -0.031), (2, 0.081), (3, -0.073), (4, 0.04), (5, -0.012), (6, -0.022), (7, -0.059), (8, -0.048), (9, 0.22), (10, -0.104), (11, -0.112), (12, -0.207), (13, -0.003), (14, -0.218), (15, -0.006), (16, -0.098), (17, 0.251), (18, 0.159), (19, -0.128), (20, -0.137), (21, 0.022), (22, 0.177), (23, -0.028), (24, -0.105), (25, 0.023), (26, 0.105), (27, 0.072), (28, -0.048), (29, -0.044), (30, 0.014), (31, 0.203), (32, -0.006), (33, 0.04), (34, -0.17), (35, -0.124), (36, 0.188), (37, -0.107), (38, -0.017), (39, 0.141), (40, -0.096), (41, -0.105), (42, -0.013), (43, 0.096), (44, 0.068), (45, 0.011), (46, -0.001), (47, -0.06), (48, 0.056), (49, -0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96428043 88 nips-2006-Greedy Layer-Wise Training of Deep Networks

Author: Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle

2 0.71551657 167 nips-2006-Recursive ICA

Author: Honghao Shan, Lingyun Zhang, Garrison W. Cottrell

3 0.65366161 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis

4 0.61890543 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

Author: Marc'aurelio Ranzato, Christopher Poultney, Sumit Chopra, Yann L. Cun

5 0.31282359 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

Author: Christopher J. Burges, Robert Ragno, Quoc V. Le

Abstract: The quality measures used in information retrieval are particularly difﬁcult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undeﬁned. In this paper, we propose a class of simple, ﬂexible algorithms, called LambdaRank, which avoids these difﬁculties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufﬁcient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate signiﬁcantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for signiﬁcantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions. 1

6 0.26697826 13 nips-2006-A Scalable Machine Learning Approach to Go

7 0.25966853 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields

8 0.25960672 139 nips-2006-Multi-dynamic Bayesian Networks

9 0.25003532 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields

10 0.24935377 113 nips-2006-Learning Structural Equation Models for fMRI

11 0.24651052 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

12 0.24478333 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

13 0.24337126 189 nips-2006-Temporal dynamics of information content carried by neurons in the primary visual cortex

14 0.23751171 108 nips-2006-Large Scale Hidden Semi-Markov SVMs

15 0.22138625 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis

16 0.21949844 169 nips-2006-Relational Learning with Gaussian Processes

17 0.21227027 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

18 0.21126437 129 nips-2006-Map-Reduce for Machine Learning on Multicore

19 0.21067457 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

20 0.20508532 41 nips-2006-Bayesian Ensemble Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.066), (3, 0.014), (7, 0.073), (9, 0.032), (20, 0.022), (22, 0.05), (44, 0.075), (57, 0.066), (65, 0.039), (69, 0.464)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92909265 147 nips-2006-Non-rigid point set registration: Coherent Point Drift

Author: Andriy Myronenko, Xubo Song, Miguel Á. Carreira-Perpiñán

Abstract: We introduce Coherent Point Drift (CPD), a novel probabilistic method for nonrigid registration of point sets. The registration is treated as a Maximum Likelihood (ML) estimation problem with motion coherence constraint over the velocity ﬁeld such that one point set moves coherently to align with the second set. We formulate the motion coherence constraint and derive a solution of regularized ML estimation through the variational approach, which leads to an elegant kernel form. We also derive the EM algorithm for the penalized ML optimization with deterministic annealing. The CPD method simultaneously ﬁnds both the non-rigid transformation and the correspondence between two point sets without making any prior assumption of the transformation model except that of motion coherence. This method can estimate complex non-linear non-rigid transformations, and is shown to be accurate on 2D and 3D examples and robust in the presence of outliers and missing points.

same-paper 2 0.91491222 88 nips-2006-Greedy Layer-Wise Training of Deep Networks

Author: Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle

3 0.89478189 176 nips-2006-Single Channel Speech Separation Using Factorial Dynamics

Author: John R. Hershey, Trausti Kristjansson, Steven Rennie, Peder A. Olsen

Abstract: Human listeners have the extraordinary ability to hear and recognize speech even when more than one person is talking. Their machine counterparts have historically been unable to compete with this ability, until now. We present a modelbased system that performs on par with humans in the task of separating speech of two talkers from a single-channel recording. Remarkably, the system surpasses human recognition performance in many conditions. The models of speech use temporal dynamics to help infer the source speech signals, given mixed speech signals. The estimated source signals are then recognized using a conventional speech recognition system. We demonstrate that the system achieves its best performance when the model of temporal dynamics closely captures the grammatical constraints of the task. One of the hallmarks of human perception is our ability to solve the auditory cocktail party problem: we can direct our attention to a given speaker in the presence of interfering speech, and understand what was said remarkably well. Until now the same could not be said for automatic speech recognition systems. However, we have recently introduced a system which in many conditions performs this task better than humans [1][2]. The model addresses the Pascal Speech Separation Challenge task [3], and outperforms all other published results by more than 10% word error rate (WER). In this model, dynamics are modeled using a layered combination of one or two Markov chains: one for long-term dependencies and another for short-term dependencies. The combination of the two speakers was handled via an iterative Laplace approximation method known as Algonquin [4]. Here we describe experiments that show better performance on the same task with a simpler version of the model. The task we address is provided by the PASCAL Speech Separation Challenge [3], which provides standard training, development, and test data sets of single-channel speech mixtures following an arbitrary but simple grammar. In addition, the challenge organizers have conducted human-listening experiments to provide an interesting baseline for comparison of computational techniques. The overall system we developed is composed of the three components: a speaker identiﬁcation and gain estimation component, a signal separation component, and a speech recognition system. In this paper we focus on the signal separation component, which is composed of the acoustic and grammatical models. The details of the other components are discussed in [2]. Single-channel speech separation has previously been attempted using Gaussian mixture models (GMMs) on individual frames of acoustic features. However such models tend to perform well only when speakers are of different gender or have rather different voices [4]. When speakers have similar voices, speaker-dependent mixture models cannot unambiguously identify the component speakers. In such cases it is helpful to model the temporal dynamics of the speech. Several models in the literature have attempted to do so either for recognition [5, 6] or enhancement [7, 8] of speech. Such models have typically been based on a discrete-state hidden Markov model (HMM) operating on a frame-based acoustic feature vector. Modeling the dynamics of the log spectrum of speech is challenging in that different speech components evolve at different time-scales. For example the excitation, which carries mainly pitch, versus the ﬁlter, which consists of the formant structure, are somewhat independent of each other. The formant structure closely follows the sequences of phonemes in each word, which are pronounced at a rate of several per second. In non-tonal languages such as English, the pitch ﬂuctuates with prosody over the course of a sentence, and is not directly coupled with the words being spoken. Nevertheless, it seems to be important in separating speech, because the pitch harmonics carry predictable structure that stands out against the background. We address the various dynamic components of speech by testing different levels of dynamic constraints in our models. We explore four different levels of dynamics: no dynamics, low-level acoustic dynamics, high-level grammar dynamics, and a layered combination, dual dynamics, of the acoustic and grammar dynamics. The grammar dynamics and dual dynamics models perform the best in our experiments. The acoustic models are combined to model mixtures of speech using two methods: a nonlinear model known as Algonquin, which models the combination of log-spectrum models as a sum in the power spectrum, and a simpler max model that combines two log spectra using the max function. It turns out that whereas Algonquin works well, our formulation of the max model does better overall. With the combination of the max model and grammar-level dynamics, the model produces remarkable results: it is often able to extract two utterances from a mixture even when they are from the same speaker 1 . Overall results are given in Table 1, which shows that our closest competitors are human listeners. Table 1: Overall word error rates across all conditions on the challenge task. Human: average human error rate, IBM: our best result, Next Best: the best of the eight other published results on this task, and Chance: the theoretical error rate for random guessing. System: Word Error Rate: 1 Human 22.3% IBM 22.6% Next Best 34.2% Chance 93.0% Speech Models The model consists of an acoustic model and temporal dynamics model for each source, and a mixing model, which models how the source models are combined to describe the mixture. The acoustic features were short-time log spectrum frames computed every 15 ms. Each frame was of length 40 ms and a 640-point mixed-radix FFT was used. The DC component was discarded, producing a 319-dimensional log-power-spectrum feature vector yt . The acoustic model consists of a set of diagonal-covariance Gaussians in the features. For a given speaker, a, we model the conditional probability of the log-power spectrum of each source signal xa given a discrete acoustic state sa as Gaussian, p(xa |sa ) = N (xa ; µsa , Σsa ), with mean µsa , and covariance matrix Σsa . We used 256 Gaussians, one per acoustic state, to model the acoustic space of each speaker. For efﬁciency and tractability we restrict the covariance to be diagonal. A model with no dynamics can be formulated by producing state probabilities p(sa ), and is depicted in 1(a). Acoustic Dynamics: To capture the low-level dynamics of the acoustic signal, we modeled the acoustic dynamics of a given speaker, a, via state transitions p(sa |sa ) as shown in Figure 1(b). t t−1 There are 256 acoustic states, hence for each speaker a, we estimated a 256 × 256 element transition matrix Aa . Grammar Dynamics: The grammar dynamics are modeled by grammar state transitions, a a p(vt |vt−1 ), which consist of left-to-right phone models. The legal word sequences are given by the Speech Separation Challenge grammar [3] and are modeled using a set of pronunciations that 1 Demos and information can be found at: http : //www.research.ibm.com/speechseparation sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (a) No Dynamics (b) Acoustic Dynamics a vt−1 a vt a vt−1 a vt sa t−1 sa t sa t−1 sa t xt−1 xt xt−1 xt (c) Grammar Dynamics (d) Dual Dynamics Figure 1: Graph of models for a given source. In (a), there are no dynamics, so the model is a simple mixture model. In (b), only acoustic dynamics are modeled. In (c), grammar dynamics are modeled with a shared set of acoustic Gaussians, in (d) dual – grammar and acoustic – dynamics have been combined. Note that (a) (b) and (c) are special cases of (d), where different nodes are assumed independent. map from words to three-state context-dependent phone models. The state transition probabilities derived from these phone models are sparse in the sense that most transition probabilities are zero. We model speaker dependent distributions p(sa |v a ) that associate the grammar states, v a to the speaker-dependent acoustic states. These are learned from training data where the grammar state sequences and acoustic state sequences are known for each utterance. The grammar of our system has 506 states, so we estimate a 506 × 256 element conditional probability matrix B a for each speaker. Dual Dynamics: The dual-dynamics model combines the acoustic dynamics with the grammar dynamics. It is useful in this case to avoid modeling the full combination of s and v states in the joint transitions p(sa |sa , vt ). Instead we make a naive-Bayes assumption to approximate this as t t−1 1 p(sa |sa )α p(sa |vt )β , where α and β adjust the relative inﬂuence of the two probabilities, and z t t−1 t z is the normalizing constant. Here we simply use the probability matrices Aa and B a , deﬁned above. 2 Mixed Speech Models The speech separation challenge involves recognizing speech in mixtures of signals from two speakers, a and b. We consider only mixing models that operate independently on each frequency for analytical and computational tractability. The short-time log spectrum of the mixture yt , in a given frequency band, is related to that of the two sources xa and xb via the mixing model given by the t t conditional probability distribution, p(y|xa , xb ). The joint distribution of the observation and source in one feature dimension, given the source states is thus: p(yt , xa , xb |sa , sb ) = p(yt |xa , xb )p(xa |sa )p(xb |sb ). t t t t t t t t t t (1) In general, to infer and reconstruct speech we need to compute the likelihood of the observed mixture p(yt |sa , sb ) = t t p(yt , xa , xb |sa , sb )dxa dxb , t t t t t t (2) and the posterior expected values of the sources given the states, E(xa |yt , sa , sb ) = t t t xa p(xa , xb |yt , sa , sb )dxa dxb , t t t t t t t (3) and similarly for xb . These quantities, combined with a prior model for the joint state set quences {sa , sb }, allow us to compute the minimum mean squared error (MMSE) estima1..T 1..T ˆ ˆ tors E(xa |y1..T ) or the maximum a posteriori (MAP) estimate E(xa |y1..T , sa 1..T , sb 1..T ), 1..T 1..T ˆ ˆ where sa 1..T , sb 1..T = arg maxsa ,sb p(sa , sb |y1..T ), where the subscript, 1..T , refers to 1..T 1..T 1..T 1..T all frames in the signal. The mixing model can be deﬁned in a number of ways. We explore two popular candidates, for which the above integrals can be readily computed: Algonquin, and the max model. s a s xa b xb y (a) Mixing Model (v a v b )t−1 (v a v b )t (sa sb )t−1 (sa sb )t yt yt (b) Dual Dynamics Factorial Model Figure 2: Model combination for two talkers. In (a) all dependencies are shown. In (b) the full dual-dynamics model is graphed with the xa and xb integrated out, and corresponding states from each speaker combined into product states. The other models are special cases of this graph with different edges removed, as in Figure 1. Algonquin: The relationship between the sources and mixture in the log power spectral domain is approximated as p(yt |xa , xb ) = N (yt ; log(exp(xa ) + exp(xb )), Ψ) (4) t t t t where Ψ is introduced to model the error due to the omission of phase [4]. An iterative NewtonLaplace method accurately approximates the conditional posterior p(xa , xb |yt , sa , sb ) from (1) as t t t t Gaussian. This Gaussian allows us to analytically compute the observation likelihood p(yt |sa , sb ) t t and expected value E(xa |yt , sa , sb ), as in [4]. t t t Max model: The mixing model is simpliﬁed using the fact that log of a sum is approximately the log of the maximum: p(y|xa , xb ) = δ y − max(xa , xb ) (5) In this model the likelihood is p(yt |sa , sb ) = pxa (yt |sa )Φxb (yt |sb ) + pxb (yt |sb )Φxa (yt |sa ), (6) t t t t t t t t t y t where Φxa (yt |sa ) = −∞ N (xa ; µsa , Σsa )dxa is a Gaussian cumulative distribution function [5]. t t t t t t In [5], such a model was used to compute state likelihoods and ﬁnd the optimal state sequence. In [8], a simpliﬁed model was used to infer binary masking values for reﬁltering. We take the max model a step further and derive source posteriors, so that we can compute the MMSE estimators for the log power spectrum. Note that the source posteriors in xa and xb are each t t a mixture of a delta function and a truncated Gaussian. Thus we analytically derive the necessary expected value: E(xa |yt , sa , sb ) t t t p(xa = yt |yt , sa , sb )yt + p(xa < yt |yt , sa , sb )E(xa |xa < yt , sa ) t t t t t t t t t pxa (yt |sa ) t a b , = πt yt + πt µsa − Σsa t t t Φxa (yt |sa ) t t = (7) (8) a b a with weights πt = p(xa=yt |yt , sa , sb ) = pxa (yt |sa )Φxb (yt |sb )/p(yt |sa , sb ), and πt = 1 − πt . For t t t t t t t t a ≫ µ b in a given frequency many pairs of states one model is signiﬁcantly louder than another µs s band, relative to their variances. In such cases it is reasonable to approximate the likelihood as p(yt |sa , sb ) ≈ pxa (yt |sa ), and the posterior expected values according to E(xa |yt , sa , sb ) ≈ yt and t t t t t t t E(xb |yt , sa , sb ) ≈ min(yt , µsb ), and similarly for µsa ≪ µsb . t t t t 3 Likelihood Estimation Because of the large number of state combinations, the model would not be practical without techniques to reduce computation time. To speed up the evaluation of the joint state likelihood, we employed both band quantization of the acoustic Gaussians and joint-state pruning. Band Quantization: One source of computational savings stems from the fact that some of the Gaussians in our model may differ only in a few features. Band quantization addresses this by approximating each of the D Gaussians of each model with a shared set of d Gaussians, where d ≪ D, in each of the F frequency bands of the feature vector. A similar idea is described in [9]. It relies on the use of a diagonal covariance matrix, so that p(xa |sa ) = f N (xa ; µf,sa , Σf,sa ), where Σf,sa f are the diagonal elements of covariance matrix Σsa . The mapping Mf (si ) associates each of the D Gaussians with one of the d Gaussians in band f . Now p(xa |sa ) = f N (xa ; µf,Mf (sa ) , Σf,Mf (sa ) ) ˆ f is used as a surrogate for p(xa |sa ). Figure 3 illustrates the idea. Figure 3: In band quantization, many multi-dimensional Gaussians are mapped to a few unidimensional Gaussians. Under this model the d Gaussians are optimized by minimizing the KL-divergence D( sa p(sa )p(xa |sa )|| sa p(sa )ˆ(xa |sa )), and likewise for sb . Then in each frequency band, p only d×d, instead of D ×D combinations of Gaussians have to be evaluated to compute p(y|sa , sb ). Despite the relatively small number of components d in each band, taken across bands, band quantization is capable of expressing dF distinct patterns, in an F -dimensional feature space, although in practice only a subset of these will be used to approximate the Gaussians in a given model. We used d = 8 and D = 256, which reduced the likelihood computation time by three orders of magnitude. Joint State Pruning: Another source of computational savings comes from the sparseness of the model. Only a handful of sa , sb combinations have likelihoods that are signiﬁcantly larger than the rest for a given observation. Only these states are required to adequately explain the observation. By pruning the total number of combinations down to a smaller number we can speed up the likelihood calculation, estimation of the components signals, as well as the temporal inference. However, we must estimate the likelihoods in order to determine which states to retain. We therefore used band-quantization to estimate likelihoods for all states, perform state pruning, and then the full model on the pruned states using the exact parameters. In the experiments reported here, we pruned down to 256 state combinations. The effect of these speedup methods on accuracy will be reported in a future publication. 4 Inference In our experiments we performed inference in four different conditions: no dynamics, with acoustic dynamics only, with grammar dynamics only, and with dual dynamics (acoustic and grammar). With no dynamics the source models reduce to GMMs and we infer MMSE estimates of the sources based on p(xa , xb |y) as computed from (1), using Algonquin or the max model. Once the log spectrum of each source is estimated, we estimate the corresponding time-domain signal as shown in [4]. In the acoustic dynamics condition the exact inference algorithm uses a 2-Dimensional Viterbi search, described below, with acoustic temporal constraints p(st |st−1 ) and likelihoods from Eqn. (1), to ﬁnd the most likely joint state sequence s1..T . Similarly in the grammar dynamics condition, 2-D Viterbi search is used to infer the grammar state sequences, v1..T . Instead of single Gaussians as the likelihood models, however, we have mixture models in this case. So we can perform an MMSE estimate of the sources by averaging over the posterior probability of the mixture components given the grammar Viterbi sequence, and the observations. It is critical to use the 2-D Viterbi algorithm in both cases, rather than the forward-backward algorithm, because in the same-speaker condition at 0dB, the acoustic models and dynamics are symmetric. This symmetry means that the posterior is essentially bimodal and averaging over these modes would yield identical estimates for both speakers. By ﬁnding the best path through the joint state space, the 2-D Viterbi algorithm breaks this symmetry and allows the model to make different estimates for each speaker. In the dual-dynamics condition we use the model of section 2(b). With two speakers, exact inference is computationally complex because the full joint distribution of the grammar and acoustic states, (v a × sa ) × (v b × sb ) is required and is very large in number. Instead we perform approximate inference by alternating the 2-D Viterbi search between two factors: the Cartesian product sa × sb of the acoustic state sequences and the Cartesian product v a × v b of the grammar state sequences. When evaluating each state sequence we hold the other chain constant, which decouples its dynamics and allows for efﬁcient inference. This is a useful factorization because the states sa and sb interact strongly with each other and similarly for v a and v b . Again, in the same-talker condition, the 2-D Viterbi search breaks the symmetry in each factor. 2-D Viterbi search: The Viterbi algorithm estimates the maximum-likelihood state sequence s1..T given the observations x1..T . The complexity of the Viterbi search is O(T D2 ) where D is the number of states and T is the number of frames. For producing MAP estimates of the 2 sources, we require a 2 dimensional Viterbi search which ﬁnds the most likely joint state sequences sa and 1..T sb given the mixed signal y1..T as was proposed in [5]. 1..T On the surface, the 2-D Viterbi search appears to be of complexity O(T D4 ). Surprisingly, it can be computed in O(T D3 ) operations. This stems from the fact that the dynamics for each chain are independent. The forward-backward algorithm for a factorial HMM with N state variables requires only O(T N DN +1 ) rather than the O(T D2N ) required for a naive implementation [10]. The same is true for the Viterbi algorithm. In the Viterbi algorithm, we wish to ﬁnd the most probable paths leading to each state by ﬁnding the two arguments sa and sb of the following maximization: t−1 t−1 {ˆa , sb } = st−1 ˆt−1 = arg max p(sa |sa )p(sb |sb )p(sa , sb |y1..t−1 ) t t−1 t t−1 t−1 t−1 sa sb t−1 t−1 arg max p(sa |sa ) max p(sb |sb )p(sa , sb |y1..t−1 ). t t−1 t t−1 t−1 t−1 a st−1 sb t−1 (9) The two maximizations can be done in sequence, requiring O(D3 ) operations with O(D2 ) storage for each step. In general, as with the forward-backward algorithm, the N -dimensional Viterbi search requires O(T N DN +1 ) operations. We can also exploit the sparsity of the transition matrices and observation likelihoods, by pruning unlikely values. Using both of these methods our implementation of 2-D Viterbi search is faster than the acoustic likelihood computation that serves as its input, for the model sizes and grammars chosen in the speech separation task. Speaker and Gain Estimation: In the challenge task, the gains and identities of the two speakers were unknown at test time and were selected from a set of 34 speakers which were mixed at SNRs ranging from 6dB to -9dB. We used speaker-dependent acoustic models because of their advantages when separating different speakers. These models were trained on gain-normalized data, so the models are not well matched to the different gains of the signals at test time. This means that we have to estimate both the speaker identities and the gain in order to adapt our models to the source signals for each test utterance. The number of speakers and range of SNRs in the test set makes it too expensive to consider every possible combination of models and gains. Instead, we developed an efﬁcient model-based method for identifying the speakers and gains, described in [2]. The algorithm is based upon a very simple idea: identify and utilize frames that are dominated by a single source – based on their likelihoods under each speaker-dependent acoustic model – to determine what sources are present in the mixture. Using this criteria we can eliminate most of the unlikely speakers, and explore all combinations of the remaining speakers. An approximate EM procedure is then used to select a single pair of speakers and estimate their gains. Recognition: Although inference in the system may involve recognition of the words– for models that contain a grammar –we still found that a separately trained recognizer performed better. After reconstruction, each of the two signals is therefore decoded with a speech recognition system that incorporates Speaker Dependent Labeling (SDL) [2]. This method uses speaker dependent models for each of the 34 speakers. Instead of using the speaker identities provided by the speaker ID and gain module, we followed the approach for gender dependent labeling (GDL) described in [11]. This technique provides better results than if the true speaker ID is speciﬁed. 5 Results The Speech Separation Challenge [3] involves separating the mixed speech of two speakers drawn from of a set of 34 speakers. An example utterance is place white by R 4 now. In each recording, one of the speakers says white while the other says blue, red or green. The task is to recognize the letter and the digit of the speaker that said white. Using the SDL recognizer, we decoded the two estimated signals under the assumption that one signal contains white and the other does not, and vice versa. We then used the association that yielded the highest combined likelihood. 80 WER (%) 60 40 20 0 Same Talker No Separation No dynamics Same Gender Acoustic Dyn. Different Gender Grammar Dyn All Dual Dyn Human Figure 4: Average word error rate (WER) as a function of model dynamics, in different talker conditions, compared to Human error rates, using Algonquin. Human listener performance [3] is compared in Figure 4 to results using the SDL recognizer without speech separation, and for each the proposed models. Performance is poor without separation in all conditions. With no dynamics the models do surprisingly well in the different talker conditions, but poorly when the signals come from the same talker. Acoustic dynamics gives some improvement, mainly in the same-talker condition. The grammar dynamics seems to give the most beneﬁt, bringing the error rate in the same-gender condition below that of humans. The dual-dynamics model performed about the same as the grammar dynamics model, despite our intuitions. Replacing Algonquin with the max model reduced the error rate in the dual dynamics model (from 24.3% to 23.5%) and grammar dynamics model (from 24.6% to 22.6%), which brings the latter closer than any other model to the human recognition rate of 22.3%. Figure 5 shows the relative word error rate of the best system compared to human subjects. When both speakers are around the same loudness, the system exceeds human performance, and in the same-gender condition makes less than half the errors of the humans. Human listeners do better when the two signals are at different levels, even if the target is below the masker (i.e., in -9dB), suggesting that they are better able to make use of differences in amplitude as a cue for separation. Relative Word Error Rate (WER) 200 Same Talker Same Gender Different Gender Human 150 100 50 0 −50 −100 6 dB 3 dB 0 dB −3 dB Signal to Noise Ratio (SNR) −6 dB −9 dB Figure 5: Word error rate of best system relative to human performance. Shaded area is where the system outperforms human listeners. An interesting question is to what extent different grammar constraints affect the results. To test this, we limited the grammar to just the two test utterances, and the error rate on the estimated sources dropped to around 10%. This may be a useful paradigm for separating speech from background noise when the text is known, such as in closed-captioned recordings. At the other extreme, in realistic speech recognition scenarios, there is little knowledge of the background speaker’s grammar. In such cases the beneﬁts of models of low-level acoustic continuity over purely grammar-based systems may be more apparent. It is our hope that further experiments with both human and machine listeners will provide us with a better understanding of the differences in their performance characteristics, and provide insights into how the human auditory system functions, as well as how automatic speech perception in general can be brought to human levels of performance. References [1] T. Kristjansson, J. R. Hershey, P. A. Olsen, S. Rennie, and R. Gopinath, “Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system,” in ICSLP, 2006. [2] Steven Rennie, Pedera A. Olsen, John R. Hershey, and Trausti Kristjansson, “Separating multiple speakers using temporal constraints,” in ISCA Workshop on Statistical And Perceptual Audition, 2006. [3] Martin Cooke and Tee-Won Lee, “Interspeech speech separation http : //www.dcs.shef.ac.uk/ ∼ martin/SpeechSeparationChallenge.htm, 2006. challenge,” [4] T. Kristjansson, J. Hershey, and H. Attias, “Single microphone source separation using high resolution signal reconstruction,” ICASSP, 2004. [5] P. Varga and R.K. Moore, “Hidden Markov model decomposition of speech and noise,” ICASSP, pp. 845–848, 1990. [6] M. Gales and S. Young, “Robust continuous speech recognition using parallel model combination,” IEEE Transactions on Speech and Audio Processing, vol. 4, no. 5, pp. 352–359, September 1996. [7] Y. Ephraim, “A Bayesian estimation approach for speech enhancement using hidden Markov models.,” vol. 40, no. 4, pp. 725–735, 1992. [8] S. Roweis, “Factorial models and reﬁltering for speech separation and denoising,” Eurospeech, pp. 1009–1012, 2003. [9] E. Bocchieri, “Vector quantization for the efﬁcient computation of continuous density likelihoods. proceedings of the international conference on acoustics,” in ICASSP, 1993, vol. II, pp. 692–695. [10] Zoubin Ghahramani and Michael I. Jordan, “Factorial hidden Markov models,” in Advances in Neural Information Processing Systems, vol. 8. [11] Peder Olsen and Satya Dharanipragada, “An efﬁcient integrated gender detection scheme and time mediated averaging of gender dependent acoustic models,” in Eurospeech 2003, 2003, vol. 4, pp. 2509–2512.

4 0.87069154 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms

Author: Xinhua Zhang, Wee S. Lee

Abstract: Semi-supervised learning algorithms have been successfully applied in many applications with scarce labeled data, by utilizing the unlabeled data. One important category is graph based semi-supervised learning algorithms, for which the performance depends considerably on the quality of the graph, or its hyperparameters. In this paper, we deal with the less explored problem of learning the graphs. We propose a graph learning method for the harmonic energy minimization method; this is done by minimizing the leave-one-out prediction error on labeled data points. We use a gradient based method and designed an efﬁcient algorithm which significantly accelerates the calculation of the gradient by applying the matrix inversion lemma and using careful pre-computation. Experimental results show that the graph learning method is effective in improving the performance of the classiﬁcation algorithm. 1

5 0.83150846 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

Author: Daniel Tarlow, Gal Elidan, Daphne Koller, John C. Duchi

Abstract: In general, the problem of computing a maximum a posteriori (MAP) assignment in a Markov random ﬁeld (MRF) is computationally intractable. However, in certain subclasses of MRF, an optimal or close-to-optimal assignment can be found very efﬁciently using combinatorial optimization algorithms: certain MRFs with mutual exclusion constraints can be solved using bipartite matching, and MRFs with regular potentials can be solved using minimum cut methods. However, these solutions do not apply to the many MRFs that contain such tractable components as sub-networks, but also other non-complying potentials. In this paper, we present a new method, called C OMPOSE, for exploiting combinatorial optimization for sub-networks within the context of a max-product belief propagation algorithm. C OMPOSE uses combinatorial optimization for computing exact maxmarginals for an entire sub-network; these can then be used for inference in the context of the network as a whole. We describe highly efﬁcient methods for computing max-marginals for subnetworks corresponding both to bipartite matchings and to regular networks. We present results on both synthetic and real networks encoding correspondence problems between images, which involve both matching constraints and pairwise geometric constraints. We compare to a range of current methods, showing that the ability of C OMPOSE to transmit information globally across the network leads to improved convergence, decreased running time, and higher-scoring assignments.

6 0.61773694 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

7 0.56109333 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints

8 0.54155338 167 nips-2006-Recursive ICA

9 0.50302601 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

10 0.50187659 158 nips-2006-PG-means: learning the number of clusters in data

11 0.49854708 97 nips-2006-Inducing Metric Violations in Human Similarity Judgements

12 0.49435645 34 nips-2006-Approximate Correspondences in High Dimensions

13 0.49217421 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

14 0.48614219 31 nips-2006-Analysis of Contour Motions

15 0.48522311 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians

16 0.47068575 15 nips-2006-A Switched Gaussian Process for Estimating Disparity and Segmentation in Binocular Stereo

17 0.47009021 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

18 0.46569085 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

19 0.46501887 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model

20 0.45749927 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds