nips nips2002 nips2002-31 knowledge-graph by maker-knowledge-mining

31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

Source: pdf

Author: Shinji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Naonori Ueda

Abstract: In this paper, we propose a Bayesian framework, which constructs shared-state triphone HMMs based on a variational Bayesian approach, and recognizes speech based on the Bayesian prediction classiﬁcation; variational Bayesian estimation and clustering for speech recognition (VBEC). An appropriate model structure with high recognition performance can be found within a VBEC framework. Unlike conventional methods, including BIC or MDL criterion based on the maximum likelihood approach, the proposed model selection is valid in principle, even when there are insufﬁcient amounts of data, because it does not use an asymptotic assumption. In isolated word recognition experiments, we show the advantage of VBEC over conventional methods, especially when dealing with small amounts of data.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp Abstract In this paper, we propose a Bayesian framework, which constructs shared-state triphone HMMs based on a variational Bayesian approach, and recognizes speech based on the Bayesian prediction classiﬁcation; variational Bayesian estimation and clustering for speech recognition (VBEC). [sent-5, score-1.154]

2 An appropriate model structure with high recognition performance can be found within a VBEC framework. [sent-6, score-0.236]

3 Unlike conventional methods, including BIC or MDL criterion based on the maximum likelihood approach, the proposed model selection is valid in principle, even when there are insufﬁcient amounts of data, because it does not use an asymptotic assumption. [sent-7, score-0.298]

4 In isolated word recognition experiments, we show the advantage of VBEC over conventional methods, especially when dealing with small amounts of data. [sent-8, score-0.317]

5 1 Introduction A statistical modeling of spectral features of speech (acoustic modeling) is one of the most crucial parts in the speech recognition. [sent-9, score-0.308]

6 In acoustic modeling, a triphone-based hidden Markov model (triphone HMM) has been widely employed. [sent-10, score-0.138]

7 The triphone is a context dependent phoneme unit that considers both the preceding and following phonemes. [sent-11, score-0.432]

8 Although the triphone enables the precise modeling of spectral features, the total number of triphones is too large to prepare sufﬁcient amounts of training data for each triphone. [sent-12, score-0.577]

9 In order to deal with the problem of data insufﬁciency, an HMM state is usually shared among multiple triphone HMMs, which means the amount of training data per state inﬂates. [sent-13, score-0.492]

10 Such shared-state triphone HMMs (SST-HMMs) can be constructed by successively clustering states based on the phonetic decision tree method [4] [7]. [sent-14, score-0.665]

11 The important practical problem that must be solved when constructing SST-HMMs is how to optimize the total number of shared states adaptively to the amounts of available training data. [sent-15, score-0.423]

12 Namely, maintaining the balance between model complexity and training data size is quite important for high generalization performance. [sent-16, score-0.114]

13 The maximum likelihood (ML) is inappropriate as a model selection criterion since ML increases monotonically as the number of states increases. [sent-17, score-0.179]

14 To solve this problem, the Bayesian information criterion (BIC) and minimum description length (MDL) criterion have been employed to determine the tree structure of SST-HMMs [2] [5] 1 . [sent-19, score-0.177]

15 However, since the BIC/MDL is based on an asymptotic assumption, it is invalid in principle when the number of training data is small because of the failure of the assumption. [sent-20, score-0.126]

16 The VB approach has been successfully applied to model selection problems, but mainly for relatively simple mixture models [1] [3] [6] [8]. [sent-23, score-0.109]

17 Here, we try to apply VB to SST-HMMs with more a complex model structure than the mixture model and evaluate the effectiveness through a large-scale real speech recognition experiment. [sent-24, score-0.467]

18 In the Bayesian approach we are interested in posterior distributions over model parameters, p(Θ|O, m), and the model structure, p(m|O). [sent-27, score-0.139]

19 Then the model with a ﬁxed model structure m can be deﬁned by the joint distribution p(O, Z|Θ, m). [sent-31, score-0.116]

20 In VB, variational posteriors q(Θ|O, m), q(Z|O, m), and q(m|O) are introduced to approximate the true corresponding posteriors. [sent-32, score-0.186]

22 1 Output distributions and prior distributions We attempt to apply a VB approach to a left-to-right HMM, which has been widely used to represent a phoneme unit in acoustic models for speech recognition, as shown in Figure 1. [sent-44, score-0.491]

23 , T } be a sequential data set for a phoneme unit. [sent-48, score-0.138]

24 The output distribution in an HMM is given by p(O, S, V |Θ, m) = T t= 1 ast−1 st cst vt bst vt (O t ), (2) where S is a set of sequences of hidden states, V is a set of sequences of Gaussian mixture components, and st and v t denote the state and mixture components at time t. [sent-49, score-0.871]

25 aij denotes the state 1 These criteria have been independently proposed, but they are practically the same. [sent-51, score-0.174]

26 a11 i =1 a33 a22 a12 i=2 Figure 1: Hidden Markov model for each phoneme unit. [sent-53, score-0.162]

27 A state is represented by the Gaussian mixture distribution below the state. [sent-54, score-0.125]

28 There are three states and three Gaussian components in this ﬁgure. [sent-55, score-0.096]

29 i=3 a 23 Gaussian mixture for state i transition probability from state i to state j, and cjk is the k-th weight factor of the Gaussian mixture for state j. [sent-56, score-0.43]

30 bjk (= N (O t |µjk , Σjk )) denotes the Gaussian distribution with mean vector µjk and covariance Σjk . [sent-57, score-0.103]

31 J denotes the number of states in an HMM and L denotes the number of Gaussian components in a state. [sent-65, score-0.184]

32 The conjugate prior distributions are assumed to be as follows: p(Θ|m) = i,j,k D({aij }J = 1 |φ0 )D({cjk }L = 1 |ϕ0 ) k j × N (µjk |ν 0 , (ξ 0 )−1 Σjk ) jk D d= 1 0 G(Σ−1 |η 0 , Rjk,d ). [sent-67, score-0.446]

33 (3), D denotes a Dirichlet distribution and G denotes a gamma distribution. [sent-71, score-0.111]

34 2 Optimal variational posterior distribution q (Θ|O, m) ˜ From the output distributions and prior distributions in section 3. [sent-73, score-0.308]

35 γ t denotes the transition probability from state i to state j at ˜ ˜ k|O, m) and ζ t= 1 jk ij ˜t time t. [sent-76, score-0.686]

36 ζjk denotes the occupation probability on mixture component k in state j at time t. [sent-77, score-0.179]

37 3 Optimal variational posterior distribution q (S, V |O, m) ˜ From the output distributions and prior distributions in section 3. [sent-79, score-0.308]

38 ˜ ij t ˜t [τ +1] using q(S, V Update γij [τ +1] and ζij ˜ ij |O, m)[τ +1] via the Viterbi algorithm or Forward-Backward algorithm. [sent-86, score-0.256]

39 Step 3) 5 Variational Bayesian estimation and clustering for speech recognition In the previous section, we described a VB training algorithm for HMMs. [sent-92, score-0.43]

40 Here, we explain VBEC, which constructs an acoustic model based on SST-HMMs and recognizes speech based on the Bayesian prediction classiﬁcation. [sent-93, score-0.402]

41 VBEC consists of three phases: model structure selection, retraining and recognition. [sent-94, score-0.102]

42 The model structure is determined based on triphone-state clustering by using the phonetic decision tree method [4] [7]. [sent-95, score-0.344]

43 The phonetic decision tree is a kind of binary tree that has a phonetic “Yes/No” question attached at each node, as shown in Figure 2. [sent-96, score-0.426]

44 Let Ω (n) denote a set of states held by a tree node n. [sent-97, score-0.226]

45 We start with only a root node (n = 0), which holds a set of all the triphone HMM states Ω (0) for an identical center phoneme. [sent-98, score-0.455]

46 The set of triphone states is then split into two sets, Ω (nY ) and Ω (nN ), which are held by two new nodes, nY and nN , respectively, as shown in Figure 3. [sent-99, score-0.366]

47 The partition is determined by an answer to a phonetic question such as “is the preceding phoneme a vowel? [sent-100, score-0.285]

48 Figure 3: Splitting a set of triphone HMM states Ω(n) into two sets Ω(nY ) Ω(nN ) by answering phonetic questions according to an objective function. [sent-103, score-0.489]

49 We continue this splitting successively for every new set of states to obtain a binary tree, each leaf node of which holds a clustered set of triphone states. [sent-105, score-0.511]

50 A set of triphones is thus represented by a set of sharedstate triphone HMMs (SST-HMMs). [sent-107, score-0.312]

51 A decision tree is produced speciﬁcally for each state in the sequence, and the trees are independent of each other. [sent-109, score-0.152]

52 Note that in the triphone-states clustering mentioned above, we assume the following conditions to reduce computations: • The state assignments while splitting are ﬁxed. [sent-110, score-0.137]

53 As a result, all variational posteriors and Fm can be obtained as closed forms without an iterative procedure. [sent-114, score-0.186]

54 Once we have obtained the model structure, we retrain the posterior distributions using the VB algorithm given in section 4. [sent-115, score-0.115]

55 In recognition, an unknown datum xt for a frame t is classiﬁed as the optimal phoneme class y using the predictive posterior classiﬁcation probability p(y|xt , O, m) ≡ p(y)p(xt |y, O, m)/p(xt ) for the estimated model ˜ ˜ structure m. [sent-116, score-0.427]

57 Therefore, we can compute a Bayesian predictive score for a frame, and then can compute a phoneme sequence score by using the Viterbi algorithm. [sent-121, score-0.138]

58 Thus, we can construct a VBEC framework for speech recognition by selecting an appropriate model structure and estimating posterior distributions with the VB approach, and then obtaining a recognition result based on the Bayesian prediction classiﬁcation. [sent-122, score-0.649]

59 The ﬁrst experiment compared VBEC with the conventional ML-BIC/MDL method for variable amounts of training data. [sent-124, score-0.269]

60 In the ML-BIC/MDL, retraining and recognition are based on the ML approach and model structure selection is based on the BIC/MDL. [sent-125, score-0.271]

61 The second experiment examined the robustness of the recognition performance with preset hyperparameter values against changes in the amounts of training data. [sent-126, score-0.549]

62 The training and recognition data used in these experiments are shown in Table 3. [sent-133, score-0.228]

63 The total training data consisted of about 3,000 Japanese sentences spoken by 30 males. [sent-134, score-0.235]

64 These sentences were designed so that the phonemic balance was maintained. [sent-135, score-0.131]

65 The total recognition data consisted of 2,500 Japanese city names spoken by 25 males. [sent-136, score-0.243]

66 As a result, 40 sets of SST-HMMs were prepared for several subsets of training data. [sent-138, score-0.121]

67 Figures 4 and 5 show the recognition rate and the total number of states in a set of SSTHMMs, according to the varying amounts of training data. [sent-139, score-0.499]

68 As shown in Figure 4, when the number of training sentences was less than 40, VBEC greatly outperformed the MLBIC/MDL (A). [sent-140, score-0.228]

69 With ML-BIC/MDL (A), an appropriate model structure was obtained by BIC/M DL maximizing an objective function lm w. [sent-141, score-0.134]

70 (9) is regarded as a penalty term added to a likelihood, and 2 is dependent on the number of free parameters # (Θ Ω ) and total frame number TΩ(0) of the training data. [sent-146, score-0.178]

71 ML-BIC/MDL (A) was based on the original deﬁnitions of BIC/MDL and has been widely used in speech recognition [2] [5]. [sent-147, score-0.292]

72 Figure 5: Number of shared states according to the amounts of training data based on the VBEC and ML-BIC/MDL (A) and (B). [sent-150, score-0.367]

73 This suggests that VBEC, which does not use an asymptotic assumption, determines the model structure more appropriately than the ML-BIC/MDL (A), when the training data size is small. [sent-153, score-0.195]

74 (9) so that the total numbers of states for small amounts of data were as close as possible to those of VBEC (ML-BIC/MDL (B) in Figure 5). [sent-155, score-0.271]

75 Nevertheless, the recognition rates obtained by VBEC were about 15 % better than those of ML-BIC/MDL (B) with fewer than 15 training sentences (Figure 4). [sent-156, score-0.364]

76 With such very small amounts of data, the VBEC and ML-BIC/MDL (B) model structures were almost same (Figure 5). [sent-157, score-0.169]

77 (8)) suppressed the over-ﬁtting of the models to very small amounts of training data compared with the ML estimation and recognition in ML-BIC/MDL (B). [sent-159, score-0.373]

78 With more than 100 training sentences, the recognition rates obtained by VBEC converged asymptotically to those obtained by ML-BIC/MDL methods as the amounts of training data became large. [sent-160, score-0.51]

79 , the appropriate determination of the number of states and the suppression effect on over-ﬁtting. [sent-164, score-0.149]

80 2 Inﬂuence of hyperparameter values on the quality of SST-HMMs Throughout the construction of the model structure, the estimation of the posterior distribution, and recognition, we used a ﬁxed combination of hyperparameter values, ξ 0 = η 0 = 0. [sent-166, score-0.374]

81 However, when the scale of the target application is large, the selection of hyperparameter values might affect the quality of the models. [sent-169, score-0.176]

82 Namely, the best or better values might differ greatly according to the amounts of training data. [sent-170, score-0.259]

83 Moreover, estimating appropriate hyperparameters with training SST-HMMs needs so much time that it is impractical in speech recognition. [sent-171, score-0.304]

84 Therefore, we examined how robustly the SST-HMMs produced by VBEC performed against changes in the hyperparameter values with varying amounts of training data. [sent-172, score-0.411]

85 0001 to 1, and examined the speech recognition rates in two typical cases; one in which the amount of data was very small (10 sentences) and one in which the amount was fairly large (150 sentences). [sent-174, score-0.37]

86 Tables Table 4: Recognition rates in each prior distribution parameter when using training data of 10 sentences. [sent-175, score-0.183]

87 Table 5: Recognition rates in each prior distribution parameter when using training data of 150 sentences. [sent-176, score-0.183]

88 2 4 and 5 show the recognition rates for each combination of hyperparameters. [sent-227, score-0.185]

89 We can see that the hyperparameter values for acceptable performance are broadly distributed for both very small and fairly large amounts of training data. [sent-228, score-0.38]

90 Moreover, roughly the ten best recognition rates are highlighted in the tables. [sent-229, score-0.21]

91 The combinations of hyperparameter values that achieved the highlighted recognition rates were similar for the two different amounts of training data. [sent-230, score-0.59]

92 Namely, appropriate combinations of hyperparameter values can consistently provide good performance levels regardless of the varying amounts of training data. [sent-231, score-0.409]

93 In summary, the hyperparameter values do not greatly inﬂuence the quality of the SSTHMMs. [sent-232, score-0.169]

94 This suggests that it is not necessary to select the hyperparameter values very carefully. [sent-233, score-0.145]

95 7 Conclusion In this paper, we proposed VBEC, which constructs SST-HMMs based on the VB approach, and recognizes speech based on the Bayesian prediction classiﬁcation. [sent-234, score-0.264]

96 With VBEC, the model structure of SST-HMMs is adaptively determined according to the amounts of given training data, and therefore a robust speech recognition system can be constructed. [sent-235, score-0.622]

97 The ﬁrst experimental results, obtained by using real speech recognition tasks, showed the effectiveness of VBEC. [sent-236, score-0.32]

98 In particular, when the training data size was small, VBEC signiﬁcantly outperformed conventional methods. [sent-237, score-0.149]

99 The second experimental results suggested that it is not necessary to select the hyperparameter values very carefully. [sent-238, score-0.145]

100 From these results, we conclude that VBEC provides a completely Bayesian framework for speech recognition which effectively hundles the sparse data problem. [sent-239, score-0.292]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vbec', 0.519), ('jk', 0.392), ('triphone', 0.27), ('st', 0.19), ('vb', 0.185), ('speech', 0.154), ('amounts', 0.145), ('hyperparameter', 0.145), ('variational', 0.14), ('phoneme', 0.138), ('recognition', 0.138), ('ij', 0.128), ('vt', 0.125), ('phonetic', 0.123), ('hmm', 0.12), ('acoustic', 0.114), ('cjk', 0.104), ('xt', 0.102), ('bayesian', 0.101), ('states', 0.096), ('training', 0.09), ('sentences', 0.089), ('yes', 0.087), ('fm', 0.083), ('aij', 0.082), ('tree', 0.076), ('cst', 0.062), ('viterbi', 0.061), ('nn', 0.061), ('posterior', 0.06), ('frame', 0.058), ('od', 0.058), ('node', 0.054), ('mixture', 0.054), ('ast', 0.049), ('ot', 0.049), ('recognizes', 0.049), ('hmms', 0.049), ('state', 0.048), ('clustering', 0.048), ('rates', 0.047), ('posteriors', 0.046), ('structure', 0.045), ('denotes', 0.044), ('mdl', 0.043), ('watanabe', 0.043), ('phonemic', 0.042), ('rjk', 0.042), ('triphones', 0.042), ('ny', 0.041), ('splitting', 0.041), ('ml', 0.04), ('insuf', 0.037), ('shared', 0.036), ('asymptotic', 0.036), ('mfcc', 0.036), ('bic', 0.036), ('bjk', 0.036), ('japanese', 0.036), ('lm', 0.036), ('latent', 0.035), ('root', 0.035), ('japan', 0.034), ('gaussian', 0.034), ('conventional', 0.034), ('occupation', 0.033), ('retraining', 0.033), ('ueda', 0.033), ('classi', 0.032), ('selection', 0.031), ('hyperparameters', 0.031), ('distributions', 0.031), ('prepared', 0.031), ('examined', 0.031), ('constructs', 0.031), ('prediction', 0.03), ('total', 0.03), ('appropriate', 0.029), ('decision', 0.028), ('criterion', 0.028), ('effectiveness', 0.028), ('ntt', 0.028), ('leaf', 0.026), ('spoken', 0.026), ('dl', 0.026), ('adaptively', 0.026), ('transition', 0.026), ('names', 0.025), ('highlighted', 0.025), ('outperformed', 0.025), ('city', 0.024), ('ax', 0.024), ('determination', 0.024), ('successively', 0.024), ('table', 0.024), ('greatly', 0.024), ('model', 0.024), ('preceding', 0.024), ('prior', 0.023), ('distribution', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

Author: Shinji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Naonori Ueda

2 0.12483677 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

Author: Samy Bengio

Abstract: This paper presents a novel Hidden Markov Model architecture to model the joint probability of pairs of asynchronous sequences describing the same event. It is based on two other Markovian models, namely Asynchronous Input/ Output Hidden Markov Models and Pair Hidden Markov Models. An EM algorithm to train the model is presented, as well as a Viterbi decoder that can be used to obtain the optimal state sequence as well as the alignment between the two sequences. The model has been tested on an audio-visual speech recognition task using the M2VTS database and yielded robust performances under various noise conditions. 1

3 0.1093796 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

Author: Peter Sykacek, Stephen J. Roberts

Abstract: We propose in this paper a probabilistic approach for adaptive inference of generalized nonlinear classiﬁcation that combines the computational advantage of a parametric solution with the ﬂexibility of sequential sampling techniques. We regard the parameters of the classiﬁer as latent states in a ﬁrst order Markov process and propose an algorithm which can be regarded as variational generalization of standard Kalman ﬁltering. The variational Kalman ﬁlter is based on two novel lower bounds that enable us to use a non-degenerate distribution over the adaptation rate. An extensive empirical evaluation demonstrates that the proposed method is capable of infering competitive classiﬁers both in stationary and non-stationary environments. Although we focus on classiﬁcation, the algorithm is easily extended to other generalized nonlinear models.

4 0.10545421 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

Author: Shantanu Chakrabartty, Gert Cauwenberghs

Abstract: Forward decoding kernel machines (FDKM) combine large-margin classifiers with hidden Markov models (HMM) for maximum a posteriori (MAP) adaptive sequence estimation. State transitions in the sequence are conditioned on observed data using a kernel-based probability model trained with a recursive scheme that deals effectively with noisy and partially labeled data. Training over very large data sets is accomplished using a sparse probabilistic support vector machine (SVM) model based on quadratic entropy, and an on-line stochastic steepest descent algorithm. For speaker-independent continuous phone recognition, FDKM trained over 177 ,080 samples of the TlMIT database achieves 80.6% recognition accuracy over the full test set, without use of a prior phonetic language model.

5 0.092500113 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

Abstract: In recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. For each new application, however, it is currently necessary ﬁrst to derive the variational update equations, and then to implement them in application-speciﬁc code. Each of these steps is both time consuming and error prone. In this paper we describe a general purpose inference engine called VIBES (‘Variational Inference for Bayesian Networks’) which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. New models are speciﬁed either through a simple script or via a graphical interface analogous to a drawing package. VIBES then automatically generates and solves the variational equations. We illustrate the power and ﬂexibility of VIBES using examples from Bayesian mixture modelling. 1

6 0.087436125 137 nips-2002-Location Estimation with a Differential Update Network

7 0.086258285 147 nips-2002-Monaural Speech Separation

8 0.0849953 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

9 0.070439801 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

10 0.068933167 134 nips-2002-Learning to Take Concurrent Actions

11 0.064895876 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA

12 0.063465163 199 nips-2002-Timing and Partial Observability in the Dopamine System

13 0.061755955 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

14 0.061273441 38 nips-2002-Bayesian Estimation of Time-Frequency Coefficients for Audio Signal Enhancement

15 0.060847163 170 nips-2002-Real Time Voice Processing with Audiovisual Feedback: Toward Autonomous Agents with Perfect Pitch

16 0.060608972 54 nips-2002-Combining Dimensions and Features in Similarity-Based Representations

17 0.057990458 53 nips-2002-Clustering with the Fisher Score

18 0.0577139 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

19 0.057552919 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

20 0.057450581 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.176), (1, -0.037), (2, -0.05), (3, 0.031), (4, -0.017), (5, 0.06), (6, -0.141), (7, 0.016), (8, 0.123), (9, -0.058), (10, 0.113), (11, -0.015), (12, -0.073), (13, 0.021), (14, -0.059), (15, -0.147), (16, -0.025), (17, 0.123), (18, -0.003), (19, 0.038), (20, 0.05), (21, -0.003), (22, -0.051), (23, -0.031), (24, -0.167), (25, 0.125), (26, -0.055), (27, -0.125), (28, -0.03), (29, -0.042), (30, -0.043), (31, 0.014), (32, 0.115), (33, 0.126), (34, -0.032), (35, 0.008), (36, 0.06), (37, 0.005), (38, 0.052), (39, -0.017), (40, 0.064), (41, -0.077), (42, -0.026), (43, 0.118), (44, -0.077), (45, -0.012), (46, -0.089), (47, -0.132), (48, -0.047), (49, -0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93379301 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

Author: Shinji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Naonori Ueda

2 0.64656848 25 nips-2002-An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition

Author: Samy Bengio

3 0.52721018 54 nips-2002-Combining Dimensions and Features in Similarity-Based Representations

Author: Daniel J. Navarro, Michael D. Lee

Abstract: unkown-abstract

4 0.52642047 137 nips-2002-Location Estimation with a Differential Update Network

Author: Ali Rahimi, Trevor Darrell

Abstract: Given a set of hidden variables with an a-priori Markov structure, we derive an online algorithm which approximately updates the posterior as pairwise measurements between the hidden variables become available. The update is performed using Assumed Density Filtering: to incorporate each pairwise measurement, we compute the optimal Markov structure which represents the true posterior and use it as a prior for incorporating the next measurement. We demonstrate the resulting algorithm by calculating globally consistent trajectories of a robot as it navigates along a 2D trajectory. To update a trajectory of length t, the update takes O(t). When all conditional distributions are linear-Gaussian, the algorithm can be thought of as a Kalman Filter which simpliﬁes the state covariance matrix after incorporating each measurement.

5 0.51622659 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA

Author: Kwokleung Chan, Te-Won Lee, Terrence J. Sejnowski

Abstract: Missing data is common in real-world datasets and is a problem for many estimation techniques. We have developed a variational Bayesian method to perform Independent Component Analysis (ICA) on high-dimensional data containing missing entries. Missing data are handled naturally in the Bayesian framework by integrating the generative density model. Modeling the distributions of the independent sources with mixture of Gaussians allows sources to be estimated with different kurtosis and skewness. The variational Bayesian method automatically determines the dimensionality of the data and yields an accurate density model for the observed data without overﬁtting problems. This allows direct probability estimation of missing values in the high dimensional space and avoids dimension reduction preprocessing which is not feasible with missing data.

6 0.50788897 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences

7 0.49276063 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

8 0.46782395 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

9 0.4533782 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

10 0.39780155 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

11 0.38874382 142 nips-2002-Maximum Likelihood and the Information Bottleneck

12 0.37688902 134 nips-2002-Learning to Take Concurrent Actions

13 0.34906873 150 nips-2002-Multiple Cause Vector Quantization

14 0.34789869 167 nips-2002-Rational Kernels

15 0.34411049 199 nips-2002-Timing and Partial Observability in the Dopamine System

16 0.33924413 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering

17 0.33421779 87 nips-2002-Fast Transformation-Invariant Factor Analysis

18 0.32762191 185 nips-2002-Speeding up the Parti-Game Algorithm

19 0.31640694 110 nips-2002-Incremental Gaussian Processes

20 0.31355941 84 nips-2002-Fast Exact Inference with a Factored Model for Natural Language Parsing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.029), (23, 0.024), (32, 0.02), (41, 0.011), (42, 0.067), (45, 0.271), (54, 0.084), (55, 0.024), (57, 0.012), (67, 0.014), (68, 0.035), (74, 0.116), (92, 0.09), (98, 0.115)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78091323 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

Author: Shinji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Naonori Ueda

2 0.73893291 173 nips-2002-Recovering Intrinsic Images from a Single Image

Author: Marshall F. Tappen, William T. Freeman, Edward H. Adelson

Abstract: We present an algorithm that uses multiple cues to recover shading and reﬂectance intrinsic images from a single image. Using both color information and a classiﬁer trained to recognize gray-scale patterns, each image derivative is classiﬁed as being caused by shading or a change in the surface’s reﬂectance. Generalized Belief Propagation is then used to propagate information from areas where the correct classiﬁcation is clear to areas where it is ambiguous. We also show results on real images.

3 0.7382971 43 nips-2002-Binary Coding in Auditory Cortex

Author: Michael R. Deweese, Anthony M. Zador

Abstract: Cortical neurons have been reported to use both rate and temporal codes. Here we describe a novel mode in which each neuron generates exactly 0 or 1 action potentials, but not more, in response to a stimulus. We used cell-attached recording, which ensured single-unit isolation, to record responses in rat auditory cortex to brief tone pips. Surprisingly, the majority of neurons exhibited binary behavior with few multi-spike responses; several dramatic examples consisted of exactly one spike on 100% of trials, with no trial-to-trial variability in spike count. Many neurons were tuned to stimulus frequency. Since individual trials yielded at most one spike for most neurons, the information about stimulus frequency was encoded in the population, and would not have been accessible to later stages of processing that only had access to the activity of a single unit. These binary units allow a more efficient population code than is possible with conventional rate coding units, and are consistent with a model of cortical processing in which synchronous packets of spikes propagate stably from one neuronal population to the next. 1 Binary coding in auditory cortex We recorded responses of neurons in the auditory cortex of anesthetized rats to pure-tone pips of different frequencies [1, 2]. Each pip was presented repeatedly, allowing us to assess the variability of the neural response to multiple presentations of each stimulus. We first recorded multi-unit activity with conventional tungsten electrodes (Fig. 1a). The number of spikes in response to each pip fluctuated markedly from one trial to the next (Fig. 1e), as though governed by a random mechanism such as that generating the ticks of a Geiger counter. Highly variable responses such as these, which are at least as variable as a Poisson process, are the norm in the cortex [3-7], and have contributed to the widely held view that cortical spike trains are so noisy that only the average firing rate can be used to encode stimuli. Because we were recording the activity of an unknown number of neurons, we could not be sure whether the strong trial-to-trial fluctuations reflected the underlying variability of the single units. We therefore used an alternative technique, cell- a b Single-unit recording method 5mV Multi-unit 1sec Raw cellattached voltage 10 kHz c Single-unit . . . . .. .. ... . . .... . ... . Identified spikes Threshold e 28 kHz d Single-unit 80 120 160 200 Time (msec) N = 29 tones 3 2 1 Poisson N = 11 tones ry 40 4 na bi 38 kHz 0 Response variance/mean (spikes/trial) High-pass filtered 0 0 1 2 3 Mean response (spikes/trial) Figure 1: Multi-unit spiking activity was highly variable, but single units obeyed binomial statistics. a Multi-unit spike rasters from a conventional tungsten electrode recording showed high trial-to-trial variability in response to ten repetitions of the same 50 msec pure tone stimulus (bottom). Darker hash marks indicate spike times within the response period, which were used in the variability analysis. b Spikes recorded in cell-attached mode were easily identified from the raw voltage trace (top) by applying a high-pass filter (bottom) and thresholding (dark gray line). Spike times (black squares) were assigned to the peaks of suprathreshold segments. c Spike rasters from a cell-attached recording of single-unit responses to 25 repetitions of the same tone consisted of exactly one well-timed spike per trial (latency standard deviation = 1.0 msec), unlike the multi-unit responses (Fig. 1a). Under the Poisson assumption, this would have been highly unlikely (P ~ 10 -11). d The same neuron as in Fig. 1c responds with lower probability to repeated presentations of a different tone, but there are still no multi-spike responses. e We quantified response variability for each tone by dividing the variance in spike count by the mean spike count across all trials for that tone. Response variability for multi-unit tungsten recording (open triangles) was high for each of the 29 tones (out of 32) that elicited at least one spike on one trial. All but one point lie above one (horizontal gray line), which is the value produced by a Poisson process with any constant or time varying event rate. Single unit responses recorded in cell-attached mode were far less variable (filled circles). Ninety one percent (10/11) of the tones that elicited at least one spike from this neuron produced no multi-spike responses in 25 trials; the corresponding points fall on the diagonal line between (0,1) and (1,0), which provides a strict lower bound on the variability for any response set with a mean between 0 and 1. No point lies above one. attached recording with a patch pipette [8, 9], in order to ensure single unit isolation (Fig. 1b). This recording mode minimizes both of the main sources of error in spike detection: failure to detect a spike in the unit under observation (false negatives), and contamination by spikes from nearby neurons (false positives). It also differs from conventional extracellular recording methods in its selection bias: With cell- attached recording neurons are selected solely on the basis of the experimenter’s ability to form a seal, rather than on the basis of neuronal activity and responsiveness to stimuli as in conventional methods. Surprisingly, single unit responses were far more orderly than suggested by the multi-unit recordings; responses typically consisted of either 0 or 1 spikes per trial, and not more (Fig. 1c-e). In the most dramatic examples, each presentation of the same tone pip elicited exactly one spike (Fig. 1c). In most cases, however, some presentations failed to elicit a spike (Fig. 1d). Although low-variability responses have recently been observed in the cortex [10, 11] and elsewhere [12, 13], the binary behavior described here has not previously been reported for cortical neurons. a 1.4 N = 3055 response sets b 1.2 1 Poisson 28 kHz - 100 msec 0.8 0.6 0.4 0.2 0 0 ry na bi Response variance/mean (spikes/trial) The majority of the neurons (59%) in our study for which statistical significance could be assessed (at the p<0.001 significance level; see Fig. 2, caption) showed noisy binary behavior—“binary” because neurons produced either 0 or 1 spikes, and “noisy” because some stimuli elicited both single spikes and failures. In a substantial fraction of neurons, however, the responses showed more variability. We found no correlation between neuronal variability and cortical layer (inferred from the depth of the recording electrode), cortical area (inside vs. outside of area A1) or depth of anesthesia. Moreover, the binary mode of spiking was not due to the brevity (25 msec) of the stimuli; responses that were binary for short tones were comparably binary when longer (100 msec) tones were used (Fig. 2b). Not assessable Not significant Significant (p<0.001) 0.2 0.4 0.6 0.8 1 1.2 Mean response (spikes/trial) 28 kHz - 25 msec 1.4 0 40 80 120 160 Time (msec) 200 Figure 2: Half of the neuronal population exhibited binary firing behavior. a Of the 3055 sets of responses to 25 msec tones, 2588 (gray points) could not be assessed for significance at the p<0.001 level, 225 (open circles) were not significantly binary, and 242 were significantly binary (black points; see Identification methods for group statistics below). All points were jittered slightly so that overlying points could be seen in the figure. 2165 response sets contained no multi-spike responses; the corresponding points fell on the line from [0,1] to [1,0]. b The binary nature of single unit responses was insensitive to tone duration, even for frequencies that elicited the largest responses. Twenty additional spike rasters from the same neuron (and tone frequency) as in Fig. 1c contain no multi-spike responses whether in response to 100 msec tones (above) or 25 msec tones (below). Across the population, binary responses were as prevalent for 100 msec tones as for 25 msec tones (see Identification methods for group statistics). In many neurons, binary responses showed high temporal precision, with latencies sometimes exhibiting standard deviations as low as 1 msec (Fig. 3; see also Fig. 1c), comparable to previous observations in the auditory cortex [14], and only slightly more precise than in monkey visual area MT [5]. High temporal precision was positively correlated with high response probability (Fig. 3). a b N = (44 cells)x(32 tones) 14 N = 32 tones 12 30 Jitter (msec) Jitter (msec) 40 10 8 6 20 10 4 2 0 0 0 0.2 0.4 0.6 0.8 Mean response (spikes/trial) 1 0 0.4 0.8 1.2 1.6 Mean response (spikes/trial) 2 Figure 3: Trial-to-trial variability in latency of response to repeated presentations of the same tone decreased with increasing response probability. a Scatter plot of standard deviation of latency vs. mean response for 25 presentations each of 32 tones for a different neuron as in Figs. 1 and 2 (gray line is best linear fit). Rasters from 25 repeated presentations of a low response tone (upper left inset, which corresponds to left-most data point) display much more variable latencies than rasters from a high response tone (lower right inset; corresponds to right-most data point). b The negative correlation between latency variability and response size was present on average across the population of 44 neurons described in Identification methods for group statistics (linear fit, gray). The low trial-to-trial variability ruled out the possibility that the firing statistics could be accounted for by a simple rate-modulated Poisson process (Fig. 4a1,a2). In other systems, low variability has sometimes been modeled as a Poisson process followed by a post-spike refractory period [10, 12]. In our system, however, the range in latencies of evoked binary responses was often much greater than the refractory period, which could not have been longer than the 2 msec inter-spike intervals observed during epochs of spontaneous spiking, indicating that binary spiking did not result from any intrinsic property of the spike generating mechanism (Fig. 4a3). Moreover, a single stimulus-evoked spike could suppress subsequent spikes for as long as hundreds of milliseconds (e.g. Figs. 1d,4d), supporting the idea that binary spiking arises through a circuit-level, rather than a single-neuron, mechanism. Indeed, the fact that this suppression is observed even in the cortex of awake animals [15] suggests that binary spiking is not a special property of the anesthetized state. It seems surprising that binary spiking in the cortex has not previously been remarked upon. In the auditory cortex the explanation may be in part technical: Because firing rates in the auditory cortex tend to be low, multi-unit recording is often used to maximize the total amount of data collected. Moreover, our use of cell-attached recording minimizes the usual bias toward responsive or active neurons. Such explanations are not, however, likely to account for the failure to observe binary spiking in the visual cortex, where spike count statistics have been scrutinized more closely [3-7]. One possibility is that this reflects a fundamental difference between the auditory and visual systems. An alternative interpretation— a1 b Response probability 100 spikes/s 2 kHz Poisson simulation c 100 200 300 400 Time (msec) 500 20 Ratio of pool sizes a2 0 16 12 8 4 0 a3 Poisson with refractory period 0 40 80 120 160 200 Time (msec) d Response probability PSTH 0.2 0.4 0.6 0.8 1 Mean spike count per neuron 1 0.8 N = 32 tones 0.6 0.4 0.2 0 2.0 3.8 7.1 13.2 24.9 46.7 Tone frequency (kHz) Figure 4: a The lack of multi-spike responses elicited by the neuron shown in Fig. 3a were not due to an absolute refractory period since the range of latencies for many tones, like that shown here, was much greater than any reasonable estimate for the neuron’s refractory period. (a1) Experimentally recorded responses. (a2) Using the smoothed post stimulus time histogram (PSTH; bottom) from the set of responses in Fig. 4a, we generated rasters under the assumption of Poisson firing. In this representative example, four double-spike responses (arrows at left) were produced in 25 trials. (a3) We then generated rasters assuming that the neuron fired according to a Poisson process subject to a hard refractory period of 2 msec. Even with a refractory period, this representative example includes one triple- and three double-spike responses. The minimum interspike-interval during spontaneous firing events was less than two msec for five of our neurons, so 2 msec is a conservative upper bound for the refractory period. b. Spontaneous activity is reduced following high-probability responses. The PSTH (top; 0.25 msec bins) of the combined responses from the 25% (8/32) of tones that elicited the largest responses from the same neuron as in Figs. 3a and 4a illustrates a preclusion of spontaneous and evoked activity for over 200 msec following stimulation. The PSTHs from progressively less responsive groups of tones show progressively less preclusion following stimulation. c Fewer noisy binary neurons need to be pooled to achieve the same “signal-to-noise ratio” (SNR; see ref. [24]) as a collection of Poisson neurons. The ratio of the number of Poisson to binary neurons required to achieve the same SNR is plotted against the mean number of spikes elicited per neuron following stimulation; here we have defined the SNR to be the ratio of the mean spike count to the standard deviation of the spike count. d Spike probability tuning curve for the same neuron as in Figs. 1c-e and 2b fit to a Gaussian in tone frequency. and one that we favor—is that the difference rests not in the sensory modality, but instead in the difference between the stimuli used. In this view, the binary responses may not be limited to the auditory cortex; neurons in visual and other sensory cortices might exhibit similar responses to the appropriate stimuli. For example, the tone pips we used might be the auditory analog of a brief flash of light, rather than the oriented moving edges or gratings usually used to probe the primary visual cortex. Conversely, auditory stimuli analogous to edges or gratings [16, 17] may be more likely to elicit conventional, rate-modulated Poisson responses in the auditory cortex. Indeed, there may be a continuum between binary and Poisson modes. Thus, even in conventional rate-modulated responses, the first spike is often privileged in that it carries most of the information in the spike train [5, 14, 18]. The first spike may be particularly important as a means of rapidly signaling stimulus transients. Binary responses suggest a mode that complements conventional rate coding. In the simplest rate-coding model, a stimulus parameter (such as the frequency of a tone) governs only the rate at which a neuron generates spikes, but not the detailed positions of the spikes; the actual spike train itself is an instantiation of a random process (such as a Poisson process). By contrast, in the binomial model, the stimulus parameter (frequency) is encoded as the probability of firing (Fig. 4d). Binary coding has implications for cortical computation. In the rate coding model, stimulus encoding is “ergodic”: a stimulus parameter can be read out either by observing the activity of one neuron for a long time, or a population for a short time. By contrast, in the binary model the stimulus value can be decoded only by observing a neuronal population, so that there is no benefit to integrating over long time periods (cf. ref. [19]). One advantage of binary encoding is that it allows the population to signal quickly; the most compact message a neuron can send is one spike [20]. Binary coding is also more efficient in the context of population coding, as quantified by the signal-to-noise ratio (Fig. 4c). The precise organization of both spike number and time we have observed suggests that cortical activity consists, at least under some conditions, of packets of spikes synchronized across populations of neurons. Theoretical work [21-23] has shown how such packets can propagate stably from one population to the next, but only if neurons within each population fire at most one spike per packet; otherwise, the number of spikes per packet—and hence the width of each packet—grows at each propagation step. Interestingly, one prediction of stable propagation models is that spike probability should be related to timing precision, a prediction born out by our observations (Fig. 3). The role of these packets in computation remains an open question. 2 Identification methods for group statistics We recorded responses to 32 different 25 msec tones from each of 175 neurons from the auditory cortices of 16 Sprague-Dawley rats; each tone was repeated between 5 and 75 times (mean = 19). Thus our ensemble consisted of 32x175=5600 response sets, with between 5 and 75 samples in each set. Of these, 3055 response sets contained at least one spike on at least on trial. For each response set, we tested the hypothesis that the observed variability was significantly lower than expected from the null hypothesis of a Poisson process. The ability to assess significance depended on two parameters: the sample size (5-75) and the firing probability. Intuitively, the dependence on firing probability arises because at low firing rates most responses produce only trials with 0 or 1 spikes under both the Poisson and binary models; only at high firing rates do the two models make different predictions, since in that case the Poisson model includes many trials with 2 or even 3 spikes while the binary model generates only solitary spikes (see Fig. 4a1,a2). Using a stringent significance criterion of p<0.001, 467 response sets had a sufficient number of repeats to assess significance, given the observed firing probability. Of these, half (242/467=52%) were significantly less variable than expected by chance, five hundred-fold higher than the 467/1000=0.467 response sets expected, based on the 0.001 significance criterion, to yield a binary response set. Seventy-two neurons had at least one response set for which significance could be assessed, and of these, 49 neurons (49/72=68%) had at least one significantly sub-Poisson response set. Of this population of 49 neurons, five achieved low variability through repeatable bursty behavior (e.g., every spike count was either 0 or 3, but not 1 or 2) and were excluded from further analysis. The remaining 44 neurons formed the basis for the group statistics analyses shown in Figs. 2a and 3b. Nine of these neurons were subjected to an additional protocol consisting of at least 10 presentations each of 100 msec tones and 25 msec tones of all 32 frequencies. Of the 100 msec stimulation response sets, 44 were found to be significantly sub-Poisson at the p<0.05 level, in good agreement with the 43 found to be significant among the responses to 25 msec tones. 3 Bibliography 1. Kilgard, M.P. and M.M. Merzenich, Cortical map reorganization enabled by nucleus basalis activity. Science, 1998. 279(5357): p. 1714-8. 2. Sally, S.L. and J.B. Kelly, Organization of auditory cortex in the albino rat: sound frequency. J Neurophysiol, 1988. 59(5): p. 1627-38. 3. Softky, W.R. and C. Koch, The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J Neurosci, 1993. 13(1): p. 334-50. 4. Stevens, C.F. and A.M. Zador, Input synchrony and the irregular firing of cortical neurons. Nat Neurosci, 1998. 1(3): p. 210-7. 5. Buracas, G.T., A.M. Zador, M.R. DeWeese, and T.D. Albright, Efficient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 1998. 20(5): p. 959-69. 6. Shadlen, M.N. and W.T. Newsome, The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci, 1998. 18(10): p. 3870-96. 7. Tolhurst, D.J., J.A. Movshon, and A.F. Dean, The statistical reliability of signals in single neurons in cat and monkey visual cortex. Vision Res, 1983. 23(8): p. 775-85. 8. Otmakhov, N., A.M. Shirke, and R. Malinow, Measuring the impact of probabilistic transmission on neuronal output. Neuron, 1993. 10(6): p. 1101-11. 9. Friedrich, R.W. and G. Laurent, Dynamic optimization of odor representations by slow temporal patterning of mitral cell activity. Science, 2001. 291(5505): p. 889-94. 10. Kara, P., P. Reinagel, and R.C. Reid, Low response variability in simultaneously recorded retinal, thalamic, and cortical neurons. Neuron, 2000. 27(3): p. 635-46. 11. Gur, M., A. Beylin, and D.M. Snodderly, Response variability of neurons in primary visual cortex (V1) of alert monkeys. J Neurosci, 1997. 17(8): p. 2914-20. 12. Berry, M.J., D.K. Warland, and M. Meister, The structure and precision of retinal spike trains. Proc Natl Acad Sci U S A, 1997. 94(10): p. 5411-6. 13. de Ruyter van Steveninck, R.R., G.D. Lewen, S.P. Strong, R. Koberle, and W. Bialek, Reproducibility and variability in neural spike trains. Science, 1997. 275(5307): p. 1805-8. 14. Heil, P., Auditory cortical onset responses revisited. I. First-spike timing. J Neurophysiol, 1997. 77(5): p. 2616-41. 15. Lu, T., L. Liang, and X. Wang, Temporal and rate representations of timevarying signals in the auditory cortex of awake primates. Nat Neurosci, 2001. 4(11): p. 1131-8. 16. Kowalski, N., D.A. Depireux, and S.A. Shamma, Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. J Neurophysiol, 1996. 76(5): p. 350323. 17. deCharms, R.C., D.T. Blake, and M.M. Merzenich, Optimizing sound features for cortical neurons. Science, 1998. 280(5368): p. 1439-43. 18. Panzeri, S., R.S. Petersen, S.R. Schultz, M. Lebedev, and M.E. Diamond, The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 2001. 29(3): p. 769-77. 19. Britten, K.H., M.N. Shadlen, W.T. Newsome, and J.A. Movshon, The analysis of visual motion: a comparison of neuronal and psychophysical performance. J Neurosci, 1992. 12(12): p. 4745-65. 20. Delorme, A. and S.J. Thorpe, Face identification using one spike per neuron: resistance to image degradations. Neural Netw, 2001. 14(6-7): p. 795-803. 21. Diesmann, M., M.O. Gewaltig, and A. Aertsen, Stable propagation of synchronous spiking in cortical neural networks. Nature, 1999. 402(6761): p. 529-33. 22. Marsalek, P., C. Koch, and J. Maunsell, On the relationship between synaptic input and spike output jitter in individual neurons. Proc Natl Acad Sci U S A, 1997. 94(2): p. 735-40. 23. Kistler, W.M. and W. Gerstner, Stable propagation of activity pulses in populations of spiking neurons. Neural Comp., 2002. 14: p. 987-997. 24. Zohary, E., M.N. Shadlen, and W.T. Newsome, Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 1994. 370(6485): p. 140-3. 25. Abbott, L.F. and P. Dayan, The effect of correlated variability on the accuracy of a population code. Neural Comput, 1999. 11(1): p. 91-101.

4 0.5956769 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

Author: Christopher M. Bishop, David Spiegelhalter, John Winn

5 0.58735001 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond

Author: Bernd Fischer, Johann Schumann, Wray Buntine, Alexander G. Gray

Abstract: Machine learning has reached a point where many probabilistic methods can be understood as variations, extensions and combinations of a much smaller set of abstract themes, e.g., as different instances of the EM algorithm. This enables the systematic derivation of algorithms customized for different models. Here, we describe the AUTO BAYES system which takes a high-level statistical model speciﬁcation, uses powerful symbolic techniques based on schema-based program synthesis and computer algebra to derive an efﬁcient specialized algorithm for learning that model, and generates executable code implementing that algorithm. This capability is far beyond that of code collections such as Matlab toolboxes or even tools for model-independent optimization such as BUGS for Gibbs sampling: complex new algorithms can be generated without new programming, algorithms can be highly specialized and tightly crafted for the exact structure of the model and data, and efﬁcient and commented code can be generated for different languages or systems. We present automatically-derived algorithms ranging from closed-form solutions of Bayesian textbook problems to recently-proposed EM algorithms for clustering, regression, and a multinomial form of PCA. 1 Automatic Derivation of Statistical Algorithms Overview. We describe a symbolic program synthesis system which works as a “statistical algorithm compiler:” it compiles a statistical model speciﬁcation into a custom algorithm design and from that further down into a working program implementing the algorithm design. This system, AUTO BAYES, can be loosely thought of as “part theorem prover, part Mathematica, part learning textbook, and part Numerical Recipes.” It provides much more ﬂexibility than a ﬁxed code repository such as a Matlab toolbox, and allows the creation of efﬁcient algorithms which have never before been implemented, or even written down. AUTO BAYES is intended to automate the more routine application of complex methods in novel contexts. For example, recent multinomial extensions to PCA [2, 4] can be derived in this way. The algorithm design problem. Given a dataset and a task, creating a learning method can be characterized by two main questions: 1. What is the model? 2. What algorithm will optimize the model parameters? The statistical algorithm (i.e., a parameter optimization algorithm for the statistical model) can then be implemented manually. The system in this paper answers the algorithm question given that the user has chosen a model for the data,and continues through to implementation. Performing this task at the state-of-the-art level requires an intertwined meld of probability theory, computational mathematics, and software engineering. However, a number of factors unite to allow us to solve the algorithm design problem computationally: 1. The existence of fundamental building blocks (e.g., standardized probability distributions, standard optimization procedures, and generic data structures). 2. The existence of common representations (i.e., graphical models [3, 13] and program schemas). 3. The formalization of schema applicability constraints as guards. 1 The challenges of algorithm design. The design problem has an inherently combinatorial nature, since subparts of a function may be optimized recursively and in different ways. It also involves the use of new data structures or approximations to gain performance. As the research in statistical algorithms advances, its creative focus should move beyond the ultimately mechanical aspects and towards extending the abstract applicability of already existing schemas (algorithmic principles like EM), improving schemas in ways that generalize across anything they can be applied to, and inventing radically new schemas. 2 Combining Schema-based Synthesis and Bayesian Networks Statistical Models. Externally, AUTO BAYES has the look and feel of 2 const int n_points as ’nr. of data points’ a compiler. Users specify their model 3 with 0 < n_points; 4 const int n_classes := 3 as ’nr. classes’ of interest in a high-level speciﬁcation 5 with 0 < n_classes language (as opposed to a program6 with n_classes << n_points; ming language). The ﬁgure shows the 7 double phi(1..n_classes) as ’weights’ speciﬁcation of the mixture of Gaus8 with 1 = sum(I := 1..n_classes, phi(I)); 9 double mu(1..n_classes); sians example used throughout this 9 double sigma(1..n_classes); paper.2 Note the constraint that the 10 int c(1..n_points) as ’class labels’; sum of the class probabilities must 11 c ˜ disc(vec(I := 1..n_classes, phi(I))); equal one (line 8) along with others 12 data double x(1..n_points) as ’data’; (lines 3 and 5) that make optimization 13 x(I) ˜ gauss(mu(c(I)), sigma(c(I))); of the model well-deﬁned. Also note 14 max pr(x| phi,mu,sigma ) wrt phi,mu,sigma ; the ability to specify assumptions of the kind in line 6, which may be used by some algorithms. The last line speciﬁes the goal inference task: maximize the conditional probability pr with respect to the parameters , , and . Note that moving the parameters across to the left of the conditioning bar converts this from a maximum likelihood to a maximum a posteriori problem. 1 model mog as ’Mixture of Gaussians’; ¡ £ £ £ §¤¢ £ © ¨ ¦ ¥ © ¡ ¡ £ £ £ ¨ Computational logic and theorem proving. Internally, AUTO BAYES uses a class of techniques known as computational logic which has its roots in automated theorem proving. AUTO BAYES begins with an initial goal and a set of initial assertions, or axioms, and adds new assertions, or theorems, by repeated application of the axioms, until the goal is proven. In our context, the goal is given by the input model; the derived algorithms are side effects of constructive theorems proving the existence of algorithms for the goal. 1 Schema guards vary widely; for example, compare Nead-Melder simplex or simulated annealing (which require only function evaluation), conjugate gradient (which require both Jacobian and Hessian), EM and its variational extension [6] (which require a latent-variable structure model). 2 Here, keywords have been underlined and line numbers have been added for reference in the text. The as-keyword allows annotations to variables which end up in the generated code’s comments. Also, n classes has been set to three (line 4), while n points is left unspeciﬁed. The class variable and single data variable are vectors, which deﬁnes them as i.i.d. Computer algebra. The ﬁrst core element which makes automatic algorithm derivation feasible is the fact that we can mechanize the required symbol manipulation, using computer algebra methods. General symbolic differentiation and expression simpliﬁcation are capabilities fundamental to our approach. AUTO BAYES contains a computer algebra engine using term rewrite rules which are an efﬁcient mechanism for substitution of equal quantities or expressions and thus well-suited for this task.3 Schema-based synthesis. The computational cost of full-blown theorem proving grinds simple tasks to a halt while elementary and intermediate facts are reinvented from scratch. To achieve the scale of deduction required by algorithm derivation, we thus follow a schema-based synthesis technique which breaks away from strict theorem proving. Instead, we formalize high-level domain knowledge, such as the general EM strategy, as schemas. A schema combines a generic code fragment with explicitly speciﬁed preconditions which describe the applicability of the code fragment. The second core element which makes automatic algorithm derivation feasible is the fact that we can use Bayesian networks to efﬁciently encode the preconditions of complex algorithms such as EM. First-order logic representation of Bayesian netNclasses works. A ﬁrst-order logic representation of Bayesian µ σ networks was developed by Haddawy [7]. In this framework, random variables are represented by functor symbols and indexes (i.e., speciﬁc instances φ x c of i.i.d. vectors) are represented as functor arguments. discrete gauss Nclasses Since unknown index values can be represented by Npoints implicitly universally quantiﬁed Prolog variables, this approach allows a compact encoding of networks involving i.i.d. variables or plates [3]; the ﬁgure shows the initial network for our running example. Moreover, such networks correspond to backtrack-free datalog programs, allowing the dependencies to be efﬁciently computed. We have extended the framework to work with non-ground probability queries since we seek to determine probabilities over entire i.i.d. vectors and matrices. Tests for independence on these indexed Bayesian networks are easily developed in Lauritzen’s framework which uses ancestral sets and set separation [9] and is more amenable to a theorem prover than the double negatives of the more widely known d-separation criteria. Given a Bayesian network, some probabilities can easily be extracted by enumerating the component probabilities at each node: § ¥ ¨¦¡ ¡ ¢© Lemma 1. Let be sets of variables over a Bayesian network with . Then descendents and parents hold 4 in the corresponding dependency graph iff the following probability statement holds: £ ¤ ¡ parents B % % 9 C0A@ ! 9 @8 § ¥ ¢ 2 ' % % 310 parents ©¢ £ ¡ ! ' % #!

6 0.58347368 163 nips-2002-Prediction and Semantic Association

7 0.57223386 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

8 0.57137948 74 nips-2002-Dynamic Structure Super-Resolution

9 0.57068408 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

10 0.57019603 135 nips-2002-Learning with Multiple Labels

11 0.56972921 27 nips-2002-An Impossibility Theorem for Clustering

12 0.56863964 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

13 0.56738877 3 nips-2002-A Convergent Form of Approximate Policy Iteration

14 0.56675661 89 nips-2002-Feature Selection by Maximum Marginal Diversity

15 0.56371957 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

16 0.56359655 10 nips-2002-A Model for Learning Variance Components of Natural Images

17 0.56240064 53 nips-2002-Clustering with the Fisher Score

18 0.56068325 152 nips-2002-Nash Propagation for Loopy Graphical Games

19 0.56049335 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

20 0.56028914 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games