nips nips2002 nips2002-101 knowledge-graph by maker-knowledge-mining

101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA

Source: pdf

Author: Kwokleung Chan, Te-Won Lee, Terrence J. Sejnowski

Abstract: Missing data is common in real-world datasets and is a problem for many estimation techniques. We have developed a variational Bayesian method to perform Independent Component Analysis (ICA) on high-dimensional data containing missing entries. Missing data are handled naturally in the Bayesian framework by integrating the generative density model. Modeling the distributions of the independent sources with mixture of Gaussians allows sources to be estimated with different kurtosis and skewness. The variational Bayesian method automatically determines the dimensionality of the data and yields an accurate density model for the observed data without overﬁtting problems. This allows direct probability estimation of missing values in the high dimensional space and avoids dimension reduction preprocessing which is not feasible with missing data.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Missing data is common in real-world datasets and is a problem for many estimation techniques. [sent-3, score-0.034]

2 We have developed a variational Bayesian method to perform Independent Component Analysis (ICA) on high-dimensional data containing missing entries. [sent-4, score-0.374]

3 Missing data are handled naturally in the Bayesian framework by integrating the generative density model. [sent-5, score-0.085]

4 Modeling the distributions of the independent sources with mixture of Gaussians allows sources to be estimated with different kurtosis and skewness. [sent-6, score-0.148]

5 The variational Bayesian method automatically determines the dimensionality of the data and yields an accurate density model for the observed data without overﬁtting problems. [sent-7, score-0.176]

6 This allows direct probability estimation of missing values in the high dimensional space and avoids dimension reduction preprocessing which is not feasible with missing data. [sent-8, score-0.556]

7 1 Introduction Data density estimation is an important step in many machine learning problems. [sent-9, score-0.04]

8 Often we are faced with data containing incomplete entries. [sent-10, score-0.06]

9 The data may be missing due to measurement or recording failure. [sent-11, score-0.272]

10 For example, it could be expensive and time consuming to perform some biomedical tests. [sent-13, score-0.012]

11 Data scarcity is not uncommon and it would be very undesirable to discard those data points with missing entries when we already have a small dataset. [sent-14, score-0.33]

12 Traditionally, missing data are ﬁlled in by mean imputation or regression imputation during preprocessing. [sent-15, score-0.326]

13 This could introduce biases into the data cloud density and adversely affect subsequent analysis. [sent-16, score-0.077]

14 A more principled way would be to use probability density estimates of the missing entries instead of point estimates. [sent-17, score-0.356]

15 A well known example of this approach is the use of Expectation-Maximization (EM) algorithm in ﬁtting incomplete data with a single Gaussian density [5]. [sent-18, score-0.087]

16 Independent Component Analysis (ICA) [4] tries to locate independent axes within the data cloud and was developed for blind source separation. [sent-19, score-0.116]

17 It has been applied to speech separation and analyzing fMRI and EEG data. [sent-20, score-0.012]

18 ICA is also used to model data density, describing data as linear mixture of independent features and ﬁnding projections that may uncover interesting structure in the data. [sent-21, score-0.064]

19 Maximum likelihood learning of ICA with incomplete data has been studied by [6], in the limited case of a square mixing matrix and predeﬁned source densities. [sent-22, score-0.145]

20 Many real-world datasets have intrinsic dimensionality smaller then that of the observed data. [sent-23, score-0.056]

21 With missing data, principal component analysis cannot be used to perform dimension reduction as preprocessing for ICA. [sent-24, score-0.305]

22 Instead, the variational Bayesian method applied to ICA can handle small datasets with high observed dimension [1, 2]. [sent-25, score-0.143]

23 The Bayesian method prevents overﬁtting and performs automatic dimension reduction. [sent-26, score-0.015]

24 In this paper, we extend the variational Bayesian ICA method to problems with missing data. [sent-27, score-0.347]

25 The probability density estimate of the missing entries can be used to ﬁll in the missing values. [sent-28, score-0.614]

26 This also allows the density model to be reﬁned and made more accurate. [sent-29, score-0.04]

27 1 ICA generative model with missing data Consider a data set of T data points in an N -dimensional space: X = {x t ∈ RN }, t = {1, · · · , T }. [sent-31, score-0.317]

28 Assume a noisy ICA generative model for the data: P (xt |θ) = N (xt |Ast + ν, Ψ)P (st |θs ) dst (1) where A is the mixing matrix, ν is the observation mean and Ψ−1 is the diagonal noise variance. [sent-32, score-0.119]

29 The hidden source st is assumed to have L dimensions. [sent-33, score-0.23]

30 Each component of st is modeled by a mixture of K Gaussians to allow for source densities of various kurtosis and skewness, L K l P (st |θs ) kl πlkl N (st (l)|φlkl , βlkl ) = (2) Split each data point into a missing part and an observed part: xt = (xo , xm ). [sent-34, score-0.908]

31 In this t t paper, we only consider the random missing case [3], i. [sent-35, score-0.258]

32 the probability for the missing m m entries xt is independent of the value of xt , but could depend on the value of xo . [sent-37, score-0.839]

33 (1), the variational Bayesian ICA [1, 2] can be extended naturally to handled missing data, but only if care is taken in discounting missing entries in the learning rules. [sent-41, score-0.698]

34 a o(·), bo (·), do (·), µo (·), and Λo (·) are prechosen hyperparameters for the priors. [sent-44, score-0.062]

35 Under the variational Bayesian treatment, instead of performing the integration in eqn. [sent-45, score-0.101]

36 A separable approximate posterior Q(θ) will be assumed: Q(θ) = Q(ν)Q(Ψ) × Q(A)Q(α) × Q(π l ) l Q(φlkl )Q(βlkl ) . [sent-49, score-0.018]

37 3 Special treatment for missing data Thus far the analysis follows almost exactly that of the variational Bayesian ICA on complete data, except that P (xt |θ) is replaced by P (xo |θ) in eqn. [sent-52, score-0.391]

38 (6) and consequently the t missing entries are discounted in the learning rules. [sent-53, score-0.316]

39 , the approximate distribution on the missing entries, which is given by t t Q(xm |xo ) = t t Q(θ) N (xm |[Ast + ν]m , [Ψ]m )Q(st ) dst dθ . [sent-56, score-0.324]

40 t t t (13) As noted in [6], elements of st given xo are dependent. [sent-57, score-0.537]

41 This is evident from ﬁgure 1 which shows the probability density functions of the data x and hidden variable s. [sent-59, score-0.084]

42 The inserts show the sample data in the two spaces. [sent-60, score-0.035]

43 Here the hidden sources assume density of P (s l ) ∝ exp(−|sl |0. [sent-61, score-0.11]

44 They are mixed noiselessly to give P (x) in the left graph. [sent-63, score-0.076]

45 5 −1 −1 s1 Figure 1: Pdfs for the data x (left) and hidden sources s (right). [sent-88, score-0.084]

46 In this paper, we t take a slightly different route from [1, 2] when performing variational Bayesian learning. [sent-94, score-0.101]

47 2) into a mixture of K L Gaussians in the L dimensional s space. [sent-96, score-0.034]

48 kt is a discrete hidden variable and Q(kt = k) is the probability that the tth data point belongs to the kth Gaussian. [sent-99, score-0.406]

49 (10), the variational Bayesian method can be cont tinued as usual. [sent-101, score-0.089]

50 We have drawn in ﬁgure 2 a simpliﬁed graphical representation for the generative model of variational ICA. [sent-102, score-0.106]

51 xt is the observed variable, kt and st are hidden variables and the rest are model parameters, where kt indicates which of the K L expanded Gaussians generated st . [sent-103, score-1.089]

52 α A xt ν Ψ π kt st φ β Figure 2: A simpliﬁed directed graph for the generative model of variational ICA. [sent-104, score-0.646]

53 x t is the observed variable, kt and st are hidden variables and the rest are model parameters. [sent-105, score-0.524]

54 The kt indicates which of the K L expanded Gaussians generated st . [sent-106, score-0.5]

55 (10,12 and 17) we perform functional maximization on the lower bound of the log marginal likelihood, log P (X), w. [sent-108, score-0.236]

56 1 0 −4 −3 −2 −1 0 1 2 Figure 3: The approximation of Q(xm |xo ) from the full missing ICA (solid line) and t t the polynomial missing ICA (dashed line). [sent-122, score-0.552]

57 Shaded area is the exact posterior P (x m |xo ) t t corresponding to the noiseless mixture in ﬁg. [sent-123, score-0.058]

58 kt t In the above equations, · denotes the expectation over the posterior distributions Q(·). [sent-126, score-0.335]

59 An· is the nth row of the mixing matrix A, kl =k means picking out those Gaussians such that the lth element of their indices k has the value of k, and o nt is a binary indicator variable for whether or not xnt is observed. [sent-127, score-0.178]

60 For a model of equal noise variance among all the observation dimensions, the summation in the learning rules for Q(Ψ) would be over both t and n. [sent-128, score-0.031]

61 Finally, Q(kt ) is given by, log Q(kt ) = log P (xo |skt , θ)+log N (skt |φk , βk )−log Q(skt )+log π k −log zt (26) t where zt is a normalization constant. [sent-133, score-0.242]

62 The lower bound E(X, Q(θ)|H) for the log marginal likelihood P (θ) dθ (27) E(X, Q(θ)|H) = log zt + Q(θ) log Q(θ) t can be monitored during learning and used for comparison of different solutions or models. [sent-134, score-0.377]

63 The shaded t t area is the exact posterior P (xm |xo ) for the noiseless mixing in ﬁg. [sent-137, score-0.089]

64 1 with observed x2 =–2 t t and the solid line is the approximation by eqn. [sent-138, score-0.055]

65 We have modiﬁed the variational ICA of [1] by discounting missing entries in the learning rules. [sent-140, score-0.426]

66 The dashed line is the approximation of Q(xm |xo ) from this modiﬁed method. [sent-141, score-0.04]

67 The treatment of fully expanding t t the K L hidden source Gaussians discussed in section 2. [sent-142, score-0.102]

68 3 is called “full missing ICA”, and the modiﬁed method is “polynomial missing ICA”. [sent-143, score-0.516]

69 The “full missing ICA” gives a more accurate ﬁt for P (xm |xo ) and a better estimate for xm |xo . [sent-144, score-0.412]

70 t t t t c) b) d) −1500 e) log marginal likelihood lower bound a) −1600 −1700 −1800 −1900 full missing ICA polynomial missing ICA −2000 1 2 3 4 5 Number of dimensions 6 7 Figure 4: a)-d) Source density modeling by variational missing ICA of the synthetic data. [sent-145, score-1.103]

71 Histograms: recovered sources distribution; dashed lines: original probability densities; solid line: mixture of Gaussians modeled probability densities; dotted lines: individual Gaussian contribution. [sent-146, score-0.131]

72 e) E(X, Q(θ)|H) as a function of hidden source dimensions. [sent-147, score-0.072]

73 1 Synthetic Data In the ﬁrst experiment, 200 data points were generated by mixing 4 sources randomly in a 7 dimensional space. [sent-149, score-0.102]

74 The generalized Gaussian, gamma and beta distributions were used to represent source densities of various skewness and kurtosis (ﬁg. [sent-150, score-0.136]

75 Noise at –26 dB level was added to the data and missing entries were created with a probability of 0. [sent-152, score-0.33]

76 4 a)-d), we plotted the histograms of the recovered sources and the probability density functions (pdf) of the 4 sources. [sent-155, score-0.113]

77 The dashed line is the exact pdf used to generate the data and solid line is the modeled pdf by mixture of two 1-D Gaussians (eqn. [sent-156, score-0.175]

78 4 e) plots the lower bound of log marginal likelihood (eqn. [sent-159, score-0.164]

79 27) for models assuming different numbers of intrinsic dimensions. [sent-160, score-0.017]

80 As expected, the Bayesian treatment allows us to the infer the intrinsic dimension of the data cloud. [sent-161, score-0.076]

81 In the ﬁgure, we also plot the E(X, Q(θ)|H) from the polynomial missing ICA. [sent-162, score-0.281]

82 It is clear that the full missing ICA gave a better ﬁt to the data density. [sent-163, score-0.285]

83 Furthermore, the polynomial missing ICA converges slower per epoch of learning, suffers from many more local minima and problems get worse with higher missing rate. [sent-164, score-0.539]

84 2 Mixing Images This experiment demonstrates the ability of the proposed method to ﬁll in missing values while performing demixing. [sent-166, score-0.27]

85 They were linearly mixed into 3 images and –20 dB noise was added. [sent-169, score-0.093]

86 The denoised mixtures and recovered sources are in the 3rd and 4th columns of ﬁg. [sent-171, score-0.09]

87 8% of the pixels were missing from all 3 mixed images and could not be recovered. [sent-174, score-0.376]

88 4% of the pixels were missing from only 1 mixed image and could be ﬁlled in with low uncertainty. [sent-176, score-0.347]

89 6% of the pixels were missing from any two of the mixed images. [sent-178, score-0.347]

90 5, we can see that the source images were well separated and the mixed images were nicely denoised. [sent-181, score-0.176]

91 The denoised mixed images in this example were only meant to visually illustrate the method. [sent-182, score-0.125]

92 6 Conclusion In this paper, we derived the learning rules for variational Bayesian ICA with missing data. [sent-184, score-0.365]

93 However, this exponential growth in ⇒ ⇒ + Figure 5: A demonstration of recovering missing values. [sent-186, score-0.258]

94 20% of the pixels in the mixed images (2nd column) are missing, while only 0. [sent-188, score-0.118]

95 8% are missing from the denoised mixed (3rd column) and separated images (4th column). [sent-189, score-0.383]

96 complexity is manageable and worthwhile for small data sets containing missing entries in a high dimensional space. [sent-190, score-0.355]

97 The proposed method shows promise in analyzing and identifying projections of datasets that have a very limited number of expensive data points yet contain missing entries due to data scarcity. [sent-191, score-0.388]

98 We have applied the variational missing ICA to a primates brain volumetric dataset containing 44 examples in 57 dimensions. [sent-192, score-0.36]

99 Variational learning of clusters of undercomplete nonsymmetric independent components. [sent-196, score-0.014]

100 Flexible Bayesian independent component analysis for blind source separation. [sent-201, score-0.098]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lkl', 0.61), ('xo', 0.379), ('skt', 0.345), ('kt', 0.317), ('missing', 0.258), ('ica', 0.196), ('st', 0.158), ('xm', 0.154), ('log', 0.092), ('variational', 0.089), ('kl', 0.089), ('ont', 0.08), ('dst', 0.066), ('xt', 0.065), ('mixed', 0.064), ('bo', 0.062), ('ao', 0.059), ('entries', 0.058), ('xnt', 0.053), ('gaussians', 0.047), ('bayesian', 0.044), ('source', 0.042), ('density', 0.04), ('dskt', 0.04), ('sources', 0.04), ('mixing', 0.036), ('incomplete', 0.033), ('ast', 0.032), ('kurtosis', 0.032), ('denoised', 0.032), ('hidden', 0.03), ('diag', 0.03), ('treatment', 0.03), ('zt', 0.029), ('images', 0.029), ('marginal', 0.027), ('anl', 0.027), ('dxm', 0.027), ('imputation', 0.027), ('kwokleung', 0.027), ('tth', 0.027), ('pixels', 0.025), ('pdf', 0.025), ('expanded', 0.025), ('blind', 0.023), ('polynomial', 0.023), ('chan', 0.023), ('cloud', 0.023), ('skewness', 0.023), ('densities', 0.023), ('jensen', 0.023), ('mixture', 0.022), ('dashed', 0.021), ('discounting', 0.021), ('inserts', 0.021), ('likelihood', 0.02), ('ll', 0.02), ('lk', 0.02), ('datasets', 0.02), ('observed', 0.019), ('component', 0.019), ('line', 0.019), ('noiseless', 0.018), ('rules', 0.018), ('recovered', 0.018), ('posterior', 0.018), ('kth', 0.018), ('terrence', 0.018), ('solid', 0.017), ('generative', 0.017), ('intrinsic', 0.017), ('shaded', 0.017), ('gamma', 0.016), ('inequality', 0.015), ('dimension', 0.015), ('lled', 0.015), ('histograms', 0.015), ('independent', 0.014), ('data', 0.014), ('handled', 0.014), ('preprocessing', 0.013), ('summation', 0.013), ('tting', 0.013), ('modeled', 0.013), ('containing', 0.013), ('bound', 0.013), ('full', 0.013), ('performing', 0.012), ('column', 0.012), ('lower', 0.012), ('dimensional', 0.012), ('analyzing', 0.012), ('expensive', 0.012), ('dummy', 0.012), ('ucsd', 0.012), ('nicely', 0.012), ('blood', 0.012), ('noiselessly', 0.012), ('cbcl', 0.012), ('erkki', 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA

Author: Kwokleung Chan, Te-Won Lee, Terrence J. Sejnowski

2 0.10994112 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

Author: Gil-jin Jang, Te-Won Lee

Abstract: We present a new technique for achieving source separation when given only a single channel recording. The main idea is based on exploiting the inherent time structure of sound sources by learning a priori sets of basis ﬁlters in time domain that encode the sources in a statistically efﬁcient manner. We derive a learning algorithm using a maximum likelihood approach given the observed single channel data and sets of basis ﬁlters. For each time point we infer the source signals and their contribution factors. This inference is possible due to the prior knowledge of the basis ﬁlters and the associated coefﬁcient densities. A ﬂexible model for density estimation allows accurate modeling of the observation and our experimental results exhibit a high level of separation performance for mixtures of two music signals as well as the separation of two voice signals.

3 0.073119082 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

Author: Luis E. Ortiz, David A. McAllester

Abstract: This paper gives distribution-free concentration inequalities for the missing mass and the error rate of histogram rules. Negative association methods can be used to reduce these concentration problems to concentration questions about independent sums. Although the sums are independent, they are highly heterogeneous. Such highly heterogeneous independent sums cannot be analyzed using standard concentration inequalities such as Hoeffding’s inequality, the Angluin-Valiant bound, Bernstein’s inequality, Bennett’s inequality, or McDiarmid’s theorem.

4 0.064895876 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

Author: Shinji Watanabe, Yasuhiro Minami, Atsushi Nakamura, Naonori Ueda

Abstract: In this paper, we propose a Bayesian framework, which constructs shared-state triphone HMMs based on a variational Bayesian approach, and recognizes speech based on the Bayesian prediction classiﬁcation; variational Bayesian estimation and clustering for speech recognition (VBEC). An appropriate model structure with high recognition performance can be found within a VBEC framework. Unlike conventional methods, including BIC or MDL criterion based on the maximum likelihood approach, the proposed model selection is valid in principle, even when there are insufﬁcient amounts of data, because it does not use an asymptotic assumption. In isolated word recognition experiments, we show the advantage of VBEC over conventional methods, especially when dealing with small amounts of data.

5 0.058543012 137 nips-2002-Location Estimation with a Differential Update Network

Author: Ali Rahimi, Trevor Darrell

Abstract: Given a set of hidden variables with an a-priori Markov structure, we derive an online algorithm which approximately updates the posterior as pairwise measurements between the hidden variables become available. The update is performed using Assumed Density Filtering: to incorporate each pairwise measurement, we compute the optimal Markov structure which represents the true posterior and use it as a prior for incorporating the next measurement. We demonstrate the resulting algorithm by calculating globally consistent trajectories of a robot as it navigates along a 2D trajectory. To update a trajectory of length t, the update takes O(t). When all conditional distributions are linear-Gaussian, the algorithm can be thought of as a Kalman Filter which simpliﬁes the state covariance matrix after incorporating each measurement.

6 0.05843202 111 nips-2002-Independent Components Analysis through Product Density Estimation

7 0.057527076 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

8 0.050530404 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

9 0.048994988 134 nips-2002-Learning to Take Concurrent Actions

10 0.047528051 10 nips-2002-A Model for Learning Variance Components of Natural Images

11 0.047116589 74 nips-2002-Dynamic Structure Super-Resolution

12 0.046519622 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

13 0.044177596 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior

14 0.039843902 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering

15 0.037282057 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

16 0.036118224 73 nips-2002-Dynamic Bayesian Networks with Deterministic Latent Tables

17 0.035377599 120 nips-2002-Kernel Design Using Boosting

18 0.035257827 161 nips-2002-PAC-Bayes & Margins

19 0.032669239 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

20 0.031173503 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.091), (1, -0.015), (2, -0.027), (3, 0.065), (4, -0.015), (5, 0.025), (6, -0.072), (7, 0.051), (8, 0.045), (9, -0.024), (10, 0.09), (11, -0.051), (12, -0.006), (13, -0.022), (14, 0.007), (15, -0.013), (16, -0.017), (17, 0.016), (18, -0.042), (19, 0.024), (20, 0.009), (21, -0.027), (22, -0.043), (23, 0.009), (24, -0.118), (25, 0.047), (26, -0.115), (27, 0.037), (28, -0.004), (29, -0.013), (30, -0.064), (31, -0.061), (32, -0.042), (33, 0.019), (34, -0.109), (35, 0.007), (36, 0.103), (37, 0.035), (38, -0.034), (39, -0.127), (40, 0.052), (41, 0.187), (42, 0.055), (43, 0.24), (44, 0.11), (45, 0.017), (46, -0.1), (47, -0.107), (48, 0.005), (49, -0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95314056 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA

Author: Kwokleung Chan, Te-Won Lee, Terrence J. Sejnowski

2 0.52908808 54 nips-2002-Combining Dimensions and Features in Similarity-Based Representations

Author: Daniel J. Navarro, Michael D. Lee

Abstract: unkown-abstract

3 0.52830505 183 nips-2002-Source Separation with a Sensor Array using Graphical Models and Subband Filtering

Author: Hagai Attias

Abstract: Source separation is an important problem at the intersection of several ﬁelds, including machine learning, signal processing, and speech technology. Here we describe new separation algorithms which are based on probabilistic graphical models with latent variables. In contrast with existing methods, these algorithms exploit detailed models to describe source properties. They also use subband ﬁltering ideas to model the reverberant environment, and employ an explicit model for background and sensor noise. We leverage variational techniques to keep the computational complexity per EM iteration linear in the number of frames. 1 The Source Separation Problem Fig. 1 illustrates the problem of source separation with a sensor array. In this problem, signals from K independent sources are received by each of L ≥ K sensors. The task is to extract the sources from the sensor signals. It is a difﬁcult task, partly because the received signals are distorted versions of the originals. There are two types of distortions. The ﬁrst type arises from propagation through a medium, and is approximately linear but also history dependent. This type is usually termed reverberations. The second type arises from background noise and sensor noise, which are assumed additive. Hence, the actual task is to obtain an optimal estimate of the sources from data. The task is difﬁcult for another reason, which is lack of advance knowledge of the properties of the sources, the propagation medium, and the noises. This difﬁculty gave rise to adaptive source separation algorithms, where parameters that are related to those properties are adjusted to optimized a chosen cost function. Unfortunately, the intense activity this problem has attracted over the last several years [1–9] has not yet produced a satisfactory solution. In our opinion, the reason is that existing techniques fail to address three major factors. The ﬁrst is noise robustness: algorithms typically ignore background and sensor noise, sometime assuming they may be treated as additional sources. It seems plausible that to produce a noise robust algorithm, noise signals and their properties must be modeled explicitly, and these models should be exploited to compute optimal source estimators. The second factor is mixing ﬁlters: algorithms typically seek, and directly optimize, a transformation that would unmix the sources. However, in many situations, the ﬁlters describing medium propagation are non-invertible, or have an unstable inverse, or have a stable inverse that is extremely long. It may hence be advantageous to Figure 1: The source separation problem. Signals from K = 2 speakers propagate toward L = 2 sensors. Each sensor receives a linear mixture of the speaker signals, distorted by multipath propagation, medium response, and background and sensor noise. The task is to infer the original signals from sensor data. estimate the mixing ﬁlters themselves, then use them to estimate the sources. The third factor is source properties: algorithms typically use a very simple source model (e.g., a one time point histogram). But in many cases one may easily obtain detailed models of the source signals. This is particularly true for speech sources, where large datasets exist and much modeling expertise has developed over decades of research. Separation of speakers is also one of the major potential commercial applications of source separation algorithms. It seems plausible that incorporating strong source models could improve performance. Such models may potentially have two more advantages: ﬁrst, they could help limit the range of possible mixing ﬁlters by constraining the optimization problem. Second, they could help avoid whitening the extracted signals by effectively limiting their spectral range to the range characteristic of the source model. This paper makes several contributions to the problem of real world source separation. In the following, we present new separation algorithms that are the ﬁrst to address all three factors. We work in the framework of probabilistic graphical models. This framework allows us to construct models for sources and for noise, combine them with the reverberant mixing transformation in a principled manner, and compute parameter and source estimates from data which are Bayes optimal. We identify three technical ideas that are key to our approach: (1) a strong speech model, (2) subband ﬁltering, and (3) variational EM. 2 Frames, Subband Signals, and Subband Filtering We start with the concept of subband ﬁltering. This is also a good point to deﬁne our notation. Let xm denote a time domain signal, e.g., the value of a sound pressure waveform at time point m = 0, 1, 2, .... Let Xn [k] denote the corresponding subband signal at time frame n and subband frequency k. The subband signals are obtained from the time domain signal by imposing an N -point window wm , m = 0 : N − 1 on that signal at equally spaced points nJ, n = 0, 1, 2, ..., and FFT-ing the windowed signal, N −1 e−iωk m wm xnJ+m , Xn [k] = (1) m=0 where ωk = 2πk/N and k = 0 : N − 1. The subband signals are also termed frames. Notice the difference in time scale between the time frame index n in Xn [k] and the time point index n in xn . The chosen value of the spacing J depends on the window length N . For J ≤ N the original signal xm can be synthesized exactly from the subband signals (synthesis formula omitted). An important consideration for selecting J, as well as the window shape, is behavior under ﬁltering. Consider a ﬁlter hm applied to xm , and denote by ym the ﬁltered signal. In the simple case hm = hδm,0 (no ﬁltering), the subband signals keep the same dependence as the time domain ones, yn = hxn −→ Yn [k] = hXn [k] . For an arbitrary ﬁlter hm , we use the relation yn = hm xn−m −→ Yn [k] = Hm [k]Xn−m [k] , (2) m m with complex coefﬁcients Hm [k] for each k. This relation between the subband signals is termed subband ﬁltering, and the Hm [k] are termed subband ﬁlters. Unlike the simple case of non-ﬁltering, the relation (2) holds approximately, but quite accurately using an appropriate choice of J and wm ; see [13] for details on accuracy. Throughout this paper, we will assume that an arbitrary ﬁlter hm can be modeled by the subband ﬁlters Hm [k] to a sufﬁcient accuracy for our purposes. One advantage of subband ﬁltering is that it replaces a long ﬁlter hm by a set of short independent ﬁlters Hm [k], one per frequency. This will turn out to decompose the source separation problem into a set of small (albeit coupled) problems, one per frequency. Another advantage is that this representation allows using a detailed speech model on the same footing with the ﬁlter model. This is because a speech model is deﬁned on the time scale of a single frame, whereas the original ﬁlter hm , in contrast with Hm [k], is typically as long as 10 or more frames. As a ﬁnal point on notation, we deﬁne a Gaussian distribution over a complex number Z ν by p(Z) = N (Z | µ, ν) = π exp(−ν | Z − µ |2 ) . Notice that this is a joint distribution over the real and imaginary parts of Z. The mean is µ = X and the precision (inverse variance) ν satisﬁes ν −1 = | X |2 − | µ |2 . 3 A Model for Speech Signals We assume independent sources, and model the distribution of source j by a mixture model over its subband signals Xjn , N/2−1 p(Xjn | Sjn = s) N (Xjn [k] | 0, Ajs [k]) = p(Sjn = s) = πjs k=1 p(X, S) p(Xjn | Sjn )p(Sjn ) , = (3) jn where the components are labeled by Sjn . Component s of source j is a zero mean Gaussian with precision Ajs . The mixing proportions of source j are πjs . The DAG representing this model is shown in Fig. 2. A similar model was used in [10] for one microphone speech enhancement for recognition (see also [11]). Here are several things to note about this model. (1) Each component has a characteristic spectrum, which may describe a particular part of a speech phoneme. This is because the precision corresponds to the inverse spectrum: the mean energy (w.r.t. the above distribution) of source j at frequency k, conditioned on label s, is | Xjn |2 = A−1 . (2) js A zero mean model is appropriate given the physics of the problem, since the mean of a sound pressure waveform is zero. (3) k runs from 1 to N/2 − 1, since for k > N/2, Xjn [k] = Xjn [N − k] ; the subbands k = 0, N/2 are real and are omitted from the model, a common practice in speech recognition engines. (4) Perhaps most importantly, for each source the subband signals are correlated via the component label s, as p(Xjn ) = s p(Xjn , Sjn = s) = k p(Xjn [k]) . Hence, when the source separation problem decomposes into one problem per frequency, these problems turn out to be coupled (see below), and independent frequency permutations are avoided. (5) To increase sn xn Figure 2: Graphical model describing speech signals in the subband domain. The model assumes i.i.d. frames; only the frame at time n is shown. The node Xn represents a complex N/2 − 1-dimensional vector Xn [k], k = 1 : N/2 − 1. model accuracy, a state transition matrix p(Sjn = s | Sj,n−1 = s ) may be added for each source. The resulting HMM models are straightforward to incorporate without increasing the algorithm complexity. There are several modes of using the speech model in the algorithms below. In one mode, the sources are trained online using the sensor data. In a second mode, source models are trained ofﬂine using available data on each source in the problem. A third mode correspond to separation of sources known to be speech but whose speakers are unknown. In this case, all sources have the same model, which is trained ofﬂine on a large dataset of speech signals, including 150 male and female speakers reading sentences from the Wall Street Journal (see [10] for details). This is the case presented in this paper. The training algorithm used was standard EM (omitted) using 256 clusters, initialized by vector quantization. 4 Separation of Non-Reverberant Mixtures We now present a source separation algorithm for the case of non-reverberant (or instantaneous) mixing. Whereas many algorithms exist for this case, our contribution here is an algorithm that is signiﬁcantly more robust to noise. Its robustness results, as indicated in the introduction, from three factors: (1) explicitly modeling the noise in the problem, (2) using a strong source model, in particular modeling the temporal statistics (over N time points) of the sources, rather than one time point statistics, and (3) extracting each source signal from data by a Bayes optimal estimator obtained from p(X | Y ). A more minor point is handling the case of less sources than sensors in a principled way. The mixing situation is described by yin = j hij xjn + uin , where xjn is source signal j at time point n, yin is sensor signal i, hij is the instantaneous mixing matrix, and uin is the noise corrupting sensor i’s signal. The corresponding subband signals satisfy Yin [k] = j hij Xjn [k] + Uin [k] . To turn the last equation into a probabilistic graphical model, we assume that noise i has precision (inverse spectrum) Bi [k], and that noises at different sensors are independent (the latter assumption is often inaccurate but can be easily relaxed). This yields p(Yin | X) N (Yin [k] | = p(Y | X) p(Yin | X) , = hij Xjn [k], Bi [k]) j k (4) in which together with the speech model (3) forms a complete model p(Y, X, S) for this problem. The DAG representing this model for the case K = L = 2 is shown in Fig. 3. Notice that this model generalizes [4] to the subband domain. s1n−2 s1n−1 s1 n s2n−2 s2n−1 s2 n x1n−2 x1n−1 x1 n x2n−2 x2n−1 x2 n y1n−2 y1n−1 y1n y2n−2 y2n−1 y2 n Figure 3: Graphical model for noisy, non-reverberant 2 × 2 mixing, showing a 3 frame-long sequence. All nodes Yin and Xjn represent complex N/2 − 1-dimensional vectors (see Fig. 2). While Y1n and Y2n have the same parents, X1n and X2n , the arcs from the parents to Y2n are omitted for clarity. The model parameters θ = {hij , Bi [k], Ajs [k], πjs } are estimated from data by an EM algorithm. However, as the number of speech components M or the number of sources K increases, the E-step becomes computationally intractable, as it requires summing over all O(M K ) conﬁgurations of (S1n , ..., SKn ) at each frame. We approximate the E-step using a variational technique: focusing on the posterior distribution p(X, S | Y ), we compute an optimal tractable approximation q(X, S | Y ) ≈ p(X, S | Y ), which we use to compute the sufﬁcient statistics (SS). We choose q(Xjn | Sjn , Y )q(Sjn | Y ) , q(X, S | Y ) = (5) jn where the hidden variables are factorized over the sources, and also over the frames (the latter factorization is exact in this model, but is an approximation for reverberant mixing). This posterior maintains the dependence of X on S, and thus the correlations between different subbands Xjn [k]. Notice also that this posterior implies a multimodal q(Xjn ) (i.e., a mixture distribution), which is more accurate than unimodal posteriors often employed in variational approximations (e.g., [12]), but is also harder to compute. A slightly more general form which allows inter-frame correlations by employing q(S | Y ) = jn q(Sjn | Sj,n−1 , Y ) may also be used, without increasing complexity. By optimizing in the usual way (see [12,13]) a lower bound on the likelihood w.r.t. q, we obtain q(Xjn [k] | Sjn = s, Y )q(Sjn = s | Y ) , q(Xjn , Sjn = s | Y ) = (6) k where q(Xjn [k] | Sjn = s, Y ) = N (Xjn [k] | ρjns [k], νjs [k]) and q(Sjn = s | Y ) = γjns . Both the factorization over k of q(Xjn | Sjn ) and its Gaussian functional form fall out from the optimization under the structural restriction (5) and need not be speciﬁed in advance. The variational parameters {ρjns [k], νjs [k], γjns }, which depend on the data Y , constitute the SS and are computed in the E-step. The DAG representing this posterior is shown in Fig. 4. s1n−2 s1n−1 s1 n s2n−2 s2n−1 s2 n x1n−2 x1n−1 x1 n x2n−2 x2n−1 x2 n {y im } Figure 4: Graphical model describing the variational posterior distribution applied to the model of Fig. 3. In the non-reverberant case, the components of this posterior at time frame n are conditioned only on the data Yin at that frame; in the reverberant case, the components at frame n are conditioned on the data Yim at all frames m. For clarity and space reasons, this distinction is not made in the ﬁgure. After learning, the sources are extracted from data by a variational approximation of the minimum mean squared error estimator, ˆ Xjn [k] = E(Xjn [k] | Y ) = dX q(X | Y )Xjn [k] , (7) i.e., the posterior mean, where q(X | Y ) = S q(X, S | Y ). The time domain waveform xjm is then obtained by appropriately patching together the subband signals. ˆ M-step. The update rule for the mixing matrix hij is obtained by solving the linear equation Bi [k]ηij,0 [k] = hij j k Bi [k]λj j,0 [k] . (8) k The update rule for the noise precisions Bi [k] is omitted. The quantities ηij,m [k] and λj j,m [k] are computed from the SS; see [13] for details. E-step. The posterior means of the sources (7) are obtained by solving   ˆ Xjn [k] = νjn [k]−1 ˆ i Bi [k]hij Yin [k] − j =j ˆ hij Xj n [k] (9) ˆ for Xjn [k], which is a K ×K linear system for each frequency k and frame n. The equations for the SS are given in [13], which also describes experimental results. 5 Separation of Reverberant Mixtures In this section we extend the algorithm to the case of reverberant mixing. In that case, due to signal propagation in the medium, each sensor signal at time frame n depends on the source signals not just at the same time but also at previous times. To describe this mathematically, the mixing matrix hij must become a matrix of ﬁlters hij,m , and yin = hij,m xj,n−m + uin . jm It may seem straightforward to extend the algorithm derived above to the present case. However, this appearance is misleading, because we have a time scale problem. Whereas are speech model p(X, S) is frame based, the ﬁlters hij,m are generally longer than the frame length N , typically 10 frames long and sometime longer. It is unclear how one can work with both Xjn and hij,m on the same footing (and, it is easy to see that straightforward windowed FFT cannot solve this problem). This is where the idea of subband ﬁltering becomes very useful. Using (2) we have Yin [k] = Hij,m [k]Xj,n−m [k] + Uin [k], which yields the probabilistic model jm p(Yin | X) N (Yin [k] | = Hij,m [k]Xj,n−m [k], Bi [k]) . (10) jm k Hence, both X and Y are now frame based. Combining this equation with the speech model (3), we now have a complete model p(Y, X, S) for the reverberant mixing problem. The DAG describing this model is shown in Fig. 5. s1n−2 s1n−1 s1 n s2n−2 s2n−1 s2 n x1n−2 x1n−1 x1 n x2n−2 x2n−1 x2 n y1n−2 y1n−1 y1n y2n−2 y2n−1 y2 n Figure 5: Graphical model for noisy, reverberant 2 × 2 mixing, showing a 3 frame-long sequence. Here we assume 2 frame-long ﬁlters, i.e., m = 0, 1 in Eq. (10), where the solid arcs from X to Y correspond to m = 0 (as in Fig. 3) and the dashed arcs to m = 1. While Y1n and Y2n have the same parents, X1n and X2n , the arcs from the parents to Y2n are omitted for clarity. The model parameters θ = {Hij,m [k], Bi [k], Ajs [k], πjs } are estimated from data by a variational EM algorithm, whose derivation generally follows the one outlined in the previous section. Notice that the exact E-step here is even more intractable, due to the history dependence introduced by the ﬁlters. M-step. The update rule for Hij,m is obtained by solving the Toeplitz system Hij ,m [k]λj j,m−m [k] = ηij,m [k] (11) j m where the quantities λj j,m [k], ηij,m [k] are computed from the SS (see [12]). The update rule for the Bi [k] is omitted. E-step. The posterior means of the sources (7) are obtained by solving  ˆ Xjn [k] = νjn [k]−1 ˆ im Bi [k]Hij,m−n [k] Yim [k] − Hij j m =jm ,m−m ˆ [k]Xj m  [k] (12) ˆ for Xjn [k]. Assuming P frames long ﬁlters Hij,m , m = 0 : P − 1, this is a KP × KP linear system for each frequency k. The equations for the SS are given in [13], which also describes experimental results. 6 Extensions An alternative technique we have been pursuing for approximating EM in our models is Sequential Rao-Blackwellized Monte Carlo. There, we sample state sequences S from the posterior p(S | Y ) and, for a given sequence, perform exact inference on the source signals X conditioned on that sequence (observe that given S, the posterior p(X | S, Y ) is Gaussian and can be computed exactly). In addition, we are extending our speech model to include features such as pitch [7] in order to improve separation performance, especially in cases with less sensors than sources [7–9]. Yet another extension is applying model selection techniques to infer the number of sources from data in a dynamic manner. Acknowledgments I thank Te-Won Lee for extremely valuable discussions. References [1] A.J. Bell, T.J. Sejnowski (1995). An information maximisation approach to blind separation and blind deconvolution. Neural Computation 7, 1129-1159. [2] B.A. Pearlmutter, L.C. Parra (1997). Maximum likelihood blind source separation: A contextsensitive generalization of ICA. Proc. NIPS-96. [3] A. Cichocki, S.-I. Amari (2002). Adaptive Blind Signal and Image Processing. Wiley. [4] H. Attias (1999). Independent Factor Analysis. Neural Computation 11, 803-851. [5] T.-W. Lee et al. (2001) (Ed.). Proc. ICA 2001. [6] S. Griebel, M. Brandstein (2001). Microphone array speech dereverberation using coarse channel modeling. Proc. ICASSP 2001. [7] J. Hershey, M. Casey (2002). Audiovisual source separation via hidden Markov models. Proc. NIPS 2001. [8] S. Roweis (2001). One Microphone Source Separation. Proc. NIPS-00, 793-799. [9] G.-J. Jang, T.-W. Lee, Y.-H. Oh (2003). A probabilistic approach to single channel blind signal separation. Proc. NIPS 2002. [10] H. Attias, L. Deng, A. Acero, J.C. Platt (2001). A new method for speech denoising using probabilistic models for clean speech and for noise. Proc. Eurospeech 2001. [11] Ephraim, Y. (1992). Statistical model based speech enhancement systems. Proc. IEEE 80(10), 1526-1555. [12] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, L.K. Saul (1999). An introduction to variational methods in graphical models. Machine Learning 37, 183-233. [13] H. Attias (2003). New EM algorithms for source separation and deconvolution with a microphone array. Proc. ICASSP 2003.

4 0.44287202 111 nips-2002-Independent Components Analysis through Product Density Estimation

Author: Trevor Hastie, Rob Tibshirani

Abstract: We present a simple direct approach for solving the ICA problem, using density estimation and maximum likelihood. Given a candidate orthogonal frame, we model each of the coordinates using a semi-parametric density estimate based on cubic splines. Since our estimates have two continuous derivatives , we can easily run a second order search for the frame parameters. Our method performs very favorably when compared to state-of-the-art techniques. 1

5 0.44025153 14 nips-2002-A Probabilistic Approach to Single Channel Blind Signal Separation

Author: Gil-jin Jang, Te-Won Lee

6 0.43078977 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

7 0.37696299 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

8 0.35935658 137 nips-2002-Location Estimation with a Differential Update Network

9 0.32186604 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

10 0.30464062 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting

11 0.2983942 18 nips-2002-Adaptation and Unsupervised Learning

12 0.29499331 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences

13 0.27966252 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm

14 0.2690787 134 nips-2002-Learning to Take Concurrent Actions

15 0.25842607 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex

16 0.2284378 138 nips-2002-Manifold Parzen Windows

17 0.22722569 67 nips-2002-Discriminative Binaural Sound Localization

18 0.22523482 17 nips-2002-A Statistical Mechanics Approach to Approximate Analytical Bootstrap Averages

19 0.22386786 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

20 0.22345589 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(23, 0.019), (42, 0.058), (51, 0.41), (54, 0.075), (55, 0.024), (67, 0.018), (68, 0.026), (74, 0.085), (92, 0.036), (98, 0.094)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74878019 101 nips-2002-Handling Missing Data with Variational Bayesian Learning of ICA

Author: Kwokleung Chan, Te-Won Lee, Terrence J. Sejnowski

2 0.63454074 23 nips-2002-Adaptive Quantization and Density Estimation in Silicon

Author: Seth Bridges, Miguel Figueroa, Chris Diorio, Daniel J. Hsu

Abstract: We present the bump mixture model, a statistical model for analog data where the probabilistic semantics, inference, and learning rules derive from low-level transistor behavior. The bump mixture model relies on translinear circuits to perform probabilistic inference, and floating-gate devices to perform adaptation. This system is low power, asynchronous, and fully parallel, and supports various on-chip learning algorithms. In addition, the mixture model can perform several tasks such as probability estimation, vector quantization, classification, and clustering. We tested a fabricated system on clustering, quantization, and classification of handwritten digits and show performance comparable to the E-M algorithm on mixtures of Gaussians. 1 I n trod u cti on Many system-on-a-chip applications, such as data compression and signal processing, use online adaptation to improve or tune performance. These applications can benefit from the low-power compact design that analog VLSI learning systems can offer. Analog VLSI learning systems can benefit immensely from flexible learning algorithms that take advantage of silicon device physics for compact layout, and that are capable of a variety of learning tasks. One learning paradigm that encompasses a wide variety of learning tasks is density estimation, learning the probability distribution over the input data. A silicon density estimator can provide a basic template for VLSI systems for feature extraction, classification, adaptive vector quantization, and more. In this paper, we describe the bump mixture model, a statistical model that describes the probability distribution function of analog variables using low-level transistor equations. We intend the bump mixture model to be the silicon version of mixture of Gaussians [1], one of the most widely used statistical methods for modeling the probability distribution of a collection of data. Mixtures of Gaussians appear in many contexts from radial basis functions [1] to hidden Markov models [2]. In the bump mixture model, probability computations derive from translinear circuits [3] and learning derives from floating-gate device equations [4]. The bump mixture model can perform different functions such as quantization, probability estimation, and classification. In addition this VLSI mixture model can implement multiple learning algorithms using different peripheral circuitry. Because the equations for system operation and learning derive from natural transistor behavior, we can build large bump mixture model with millions of parameters on a single chip. We have fabricated a bump mixture model, and tested it on clustering, classification, and vector quantization of handwritten digits. The results show that the fabricated system performs comparably to mixtures of Gaussians trained with the E-M algorithm [1]. Our work builds upon several trends of research in the VLSI community. The results in this paper are complement recent work on probability propagation in analog VLSI [5-7]. These previous systems, intended for decoding applications in communication systems, model special forms of probability distributions over discrete variables, and do not incorporate learning. In contrast, the bump mixture model performs inference and learning on probability distributions over continuous variables. The bump mixture model significantly extends previous results on floating-gate circuits [4]. Our system is a fully realized floating-gate learning algorithm that can be used for vector quantization, probability estimation, clustering, and classification. Finally, the mixture model’s architecture is similar to many previous VLSI vector quantizers [8, 9]. We can view the bump mixture model as a VLSI vector quantizer with well-defined probabilistic semantics. Computations such as probability estimation and maximum-likelihood classification have a natural statistical interpretation under the mixture model. In addition, because we rely on floating-gate devices, the mixture model does not require a refresh mechanism unlike previous learning VLSI quantizers. 2 T h e ad ap ti ve b u mp ci rcu i t The adaptive bump circuit [4], depicted in Fig.1(a-b), forms the basis of the bump mixture model. This circuit is slightly different from previous versions reported in the literature. Nevertheless, the high level functionality remains the same; the adaptive bump circuit computes the similarity between a stored variable and an input, and adapts to increase the similarity between the stored variable and input. Fig.1(a) shows the computation portion of the circuit. The bump circuit takes as input, a differential voltage signal (+Vin, −Vin) around a DC bias, and computes the similarity between Vin and a stored value, µ. We represent the stored memory µ as a voltage: µ= Vw- − Vw+ 2 (1) where Vw+ and Vw− are the gate-offset voltages stored on capacitors C1 and C2. Because C1 and C2 isolate the gates of transistors M1 and M2 respectively, these transistors are floating-gate devices. Consequently, the stored voltages Vw+ and Vw− are nonvolatile. We can express the floating-gate voltages Vfg1 and Vfg2 as Vfg1 =Vin +Vw+ and Vfg2 =Vw− −Vin, and the output of the bump circuit as [10]: I out = Ib cosh 2 ( ( 4κ / SU ) (V t fg 1 − V fg 2 ) ) = Ib cosh ( ( 8κ / SU t )(Vin − µ ) ) 2 (2) where Ib is the bias current, κ is the gate-coupling coefficient, Ut is the thermal voltage, and S depends on the transistor sizes. Fig.1(b) shows Iout for three different stored values of µ. As the data show, different µ’s shift the location of the peak response of the circuit. Vw+ V fg1 V in V fg2 Vb M1 −V in M2 I out Vw− C1 C2 V ca sc V2 V1 Vb V tun M6 V fg1 V2 V1 V in j (a) (b) bump circuit's transfer function for three µ's 10 Iout (nA) µ2 µ1 µ3 6 4 2 0 -0.4 -0.2 V fg2 M3 M4 V inj 8 V tun M5 0 V in (c) 0.2 0.4 Figure 1. (a-b) The adaptive bump circuit. (a) The original bump circuit augmented by capacitors C1 and C2, and cascode transistors (driven by Vcasc). (b) The adaptation subcircuit. M3 and M4 control injection on the floating-gates and M5 and M6 control tunneling. (b) Measured output current of a bump circuit for three programmed memories. Fig.1(b) shows the circuit that implements learning in the adaptive bump circuit. We implement learning through Fowler-Nordheim tunneling [11] on tunneling junctions M5-M6 and hot electron injection [12] on the floating-gate transistors M3-M4. Transistor M3 and M5 control injection and tunneling on M1’s floating-gate. Transistors M4 and M6 control injection and tunneling on M2’s floating-gate. We activate tunneling and injection by a high Vtun and low Vinj respectively. In the adaptive bump circuit, both processes increase the similarity between Vin and µ. In addition, the magnitude of the update does not depend on the sign of (Vin − µ) because the differential input provides common-mode rejection to the input differential pair. The similarity function, as seen in Fig.1(b), has a Gaussian-like shape. Consequently, we can equate the output current of the bump circuit with the probability of the input under a distribution parameterized by mean µ: P (Vin | µ ) = I out (3) In addition, increasing the similarity between Vin and µ is equivalent to increasing P(Vin |µ). Consequently, the adaptive bump circuit adapts to maximize the likelihood of the present input under the circuit’s probability distribution. 3 T h e b u mp mi xtu re mod el We now describe the computations and learning rule implemented by the bump mixture model. A mixture model is a general class of statistical models that approximates the probability of an analog input as the weighted sum of probability of the input under several simple distributions. The bump mixture model comprises a set of Gaussian-like probability density functions, each parameterized by a mean vector, µi. Denoting the j th dimension of the mean of the ith density as µij, we express the probability of an input vector x as: P ( x ) = (1/ N ) i P ( x | i ) = (1/ N ) i (∏ P ( x j j | µij ) ) (4) where N is the number of densities in the model and i denotes the ith density. P(x|i) is the product of one-dimensional densities P(xj|µij) that depend on the j th dimension of the ith mean, µij. We derive each one-dimensional probability distribution from the output current of a single bump circuit. The bump mixture model makes two assumptions: (1) the component densities are equally likely, and (2) within each component density, the input dimensions are independent and have equal variance. Despite these restrictions, this mixture model can, in principle, approximate any probability density function [1]. The bump mixture model adapts all µi to maximize the likelihood of the training data. Learning in the bump mixture model is based on the E-M algorithm, the standard algorithm for training Gaussian mixture models. The E-M algorithm comprises two steps. The E-step computes the conditional probability of each density given the input, P(i|x). The M-step updates the parameters of each distribution to increase the likelihood of the data, using P(i|x) to scale the magnitude of each parameter update. In the online setting, the learning rule is: ∆µij = η P (i | x ) ∂ log P ( x j | µij ) ∂µij =η P( x | i) k P( x | k) ∂ log P ( x j | µij ) ∂µij (5) where η is a learning rate and k denotes component densities. Because the adaptive bump circuit already adapts to increase the likelihood of the present input, we approximate E-M by modulating injection and tunneling in the adaptive bump circuit by the conditional probability: ∆µij = η P ( i | x ) f ( x j − µ ij ) (6) where f() is the parameter update implemented by the bump circuit. We can modulate the learning update in (6) with other competitive factors instead of the conditional probability to implement a variety of learning rules such as online K-means. 4 S i l i con i mp l emen tati on We now describe a VLSI system that implements the silicon mixture model. The high level organization of the system detailed in Fig.2, is similar to VLSI vector quantization systems. The heart of the mixture model is a matrix of adaptive bump circuits where the ith row of bump circuits corresponds to the ith component density. In addition, the periphery of the matrix comprises a set of inhibitory circuits for performing probability estimation, inference, quantization, and generating feedback for learning. We send each dimension of an input x down a single column. Unity-gain inverting amplifiers (not pictured) at the boundary of the matrix convert each single ended voltage input into a differential signal. Each bump circuit computes a current that represents (P(xj|µij))σ, where σ is the common variance of the one-dimensional densities. The mixture model computes P(x|i) along the ith row and inhibitory circuits perform inference, estimation, or quantization. We utilize translinear devices [3] to perform all of these computations. Translinear devices, such as the subthreshold MOSFET and bipolar transistor, exhibit an exponential relationship between the gate-voltage and source current. This property allows us to establish a power-law relationship between currents and probabilities (i.e. a linear relationship between gate voltages and log-probabilities). x1 x2 xn Vtun,Vinj P(x|µ11) P(x|µ12) Inh() P(x|µ1n) Output P(x|µ1) µ P(x|µ21) P(x|µ22) P(x|µ2n) Inh() P(x|µ2) µ Figure 2. Bump mixture model architecture. The system comprises a matrix of adaptive bump circuits where each row computes the probability P(x|µi). Inhibitory circuits transform the output of each row into system outputs. Spike generators also transform inhibitory circuit outputs into rate-coded feedback for learning. We compute the multiplication of the probabilities in each row of Fig.2 as addition in the log domain using the circuit in Fig.3 (a). This circuit first converts each bump circuit’s current into a voltage using a diode (e.g. M1). M2’s capacitive divider computes Vavg as the average of the scalar log probabilities, logP(xj|µij): Vavg = (σ / N ) j log P ( x j | µ ij ) (7) where σ is the variance, N is the number of input dimensions, and voltages are in units of κ/Ut (Ut is the thermal voltage and κ is the transistor-gate coupling coefficient). Transistors M2- M5 mirror Vavg to the gate of M5. We define the drain voltage of M5 as log P(x|i) (up to an additive constant) and compute: log ( P ( x | i ) ) = (C1 +C2 ) C1 Vavg = (C1 +C2 )σ C1 N j ( ) log P ( x j | µ ij ) + k (8) where k is a constant dependent on Vg (the control gate voltage on M5), and C1 and C2 are capacitances. From eq.8 we can derive the variance as: σ = NC1 / ( C1 + C2 ) (9) The system computes different output functions and feedback signals for learning by operating on the log probabilities of eq.8. Fig.3(b) demonstrates a circuit that computes P(i|x) for each distribution. The circuit is a k-input differential pair where the bias transistor M0 normalizes currents representing the probabilities P(x|i) at the ith leg. Fig.3(c) demonstrates a circuit that computes P(x). The ith transistor exponentiates logP(x|i), and a single wire sums the currents. We can also apply other inhibitory circuits to the log probabilities such as winner-take-all circuits (WTA) [13] and resistive networks [14]. In our fabricated chip, we implemented probability estimation,conditional probability computation, and WTA. The WTA outputs the index of the most likely component distribution for the present input, and can be used to implement vector quantization and to produce feedback for an online K-means learning rule. At each synapse, the system combines a feedback signal, such as the conditional probability P(i|x), computed at the matrix periphery, with the adaptive bump circuit to implement learning. We trigger adaptation at each bump circuit by a rate-coded spike signal generated from the inhibitory circuit’s current outputs. We generate this spike train with a current-to-spike converter based on Lazzaro’s low-powered spiking neuron [15]. This rate-coded signal toggles Vtun and Vinj at each bump circuit. Consequently, adaptation is proportional to the frequency of the spike train, which is in turn a linear function of the inhibitory feedback signal. The alternative to the rate code would be to transform the inhibitory circuit’s output directly into analog Vs M1 Vavg M2 M5 Vavg C2 ... P(xn|µin)σ P(x1|µi1)σ Vs Vg Vb C1 M4 M3 M0 ... ... log P(x|i) ... ... P(x) P(i|x) log P(x|i) (a) (b) (c) Figure 3. (a) Circuit for computing logP(x|i). (b) Circuit for computing P(i|x). The current through the ith leg represents P(i|x). (c) Circuit for computing P(x). Vtun and Vinj signals. Because injection and tunneling are highly nonlinear functions of Vinj and Vtun respectively, implementing updates that are linear in the inhibitory feedback signal is quite difficult using this approach. 5 E xp eri men tal Res u l ts an d Con cl u s i on s We fabricated an 8 x 8 mixture model (8 probability distribution functions with 8 dimensions each) in a TSMC 0.35µm CMOS process available through MOSIS, and tested the chip on synthetic data and a handwritten digits dataset. In our tests, we found that due to a design error, one of the input dimensions coupled to the other inputs. Consequently, we held that input fixed throughout the tests, effectively reducing the input to 7 dimensions. In addition, we found that the learning rule in eq.6 produced poor performance because the variance of the bump distributions was too large. Consequently, in our learning experiments, we used the hard winner-take-all circuit to control adaptation, resulting in a K-means learning rule. We trained the chip to perform different tasks on handwritten digits from the MNIST dataset [16]. To prepare the data, we first perform PCA to reduce the 784-pixel images to sevendimensional vectors, and then sent the data on-chip. We first tested the circuit on clustering handwritten digits. We trained the chip on 1000 examples of each of the digits 1-8. Fig.4(a) shows reconstructions of the eight means before and after training. We compute each reconstruction by multiplying the means by the seven principal eigenvectors of the dataset. The data shows that the means diverge to associate with different digits. The chip learns to associate most digits with a single probability distribution. The lone exception is digit 5 which doesn’t clearly associate with one distribution. We speculate that the reason is that 3’s, 5’s, and 8’s are very similar in our training data’s seven-dimensional representation. Gaussian mixture models trained with the E-M algorithm also demonstrate similar results, recovering only seven out of the eight digits. We next evaluated the same learned means on vector quantization of a set of test digits (4400 examples of each digit). We compare the chip’s learned means with means learned by the batch E-M algorithm on mixtures of Gaussians (with σ=0.01), a mismatch E-M algorithm that models chip nonidealities, and a non-adaptive baseline quantizer. The purpose of the mismatch E-M algorithm was to assess the effect of nonuniform injection and tunneling strengths in floating-gate transistors. Because tunneling and injection magnitudes can vary by a large amount on different floatinggate transistors, the adaptive bump circuits can learn a mean that is somewhat offcenter. We measured the offset of each bump circuit when adapting to a constant input and constructed the mismatch E-M algorithm by altering the learned means during the M-step by the measured offset. We constructed the baseline quantizer by selecting, at random, an example of each digit for the quantizer codebook. For each quantizer, we computed the reconstruction error on the digit’s seven-dimensional after average squared quantization error before E-M Probability under 7's model (µA) 7 + 9 o 1.5 1 0.5 1 1.5 2 Probability under 9's model (µA) 1 2 3 4 5 6 7 8 digit (b) 2 0.5 10 0 baseline chip E-M/mismatch (a) 2.5 20 2.5 Figure 4. (a) Reconstruction of chip means before and after training with handwritten digits. (b) Comparison of average quantization error on unseen handwritten digits, for the chip’s learned means and mixture models trained by standard algorithms. (c) Plot of probability of unseen examples of 7’s and 9’s under two bump mixture models trained solely on each digit. (c) representation when we represent each test digit by the closest mean. The results in Fig.4(b) show that for most of the digits the chip’s learned means perform as well as the E-M algorithm, and better than the baseline quantizer in all cases. The one digit where the chip’s performance is far from the E-M algorithm is the digit “1”. Upon examination of the E-M algorithm’s results, we found that it associated two means with the digit “1”, where the chip allocated two means for the digit “3”. Over all the digits, the E-M algorithm exhibited a quantization error of 9.98, mismatch E-M gives a quantization error of 10.9, the chip’s error was 11.6, and the baseline quantizer’s error was 15.97. The data show that mismatch is a significant factor in the difference between the bump mixture model’s performance and the E-M algorithm’s performance in quantization tasks. Finally, we use the mixture model to classify handwritten digits. If we train a separate mixture model for each class of data, we can classify an input by comparing the probabilities of the input under each model. In our experiment, we train two separate mixture models: one on examples of the digit 7, and the other on examples of the digit 9. We then apply both mixtures to a set of unseen examples of digits 7 and 9, and record the probability score of each unseen example under each mixture model. We plot the resulting data in Fig.4(c). Each axis represents the probability under a different class. The data show that the model probabilities provide a good metric for classification. Assigning each test example to the class model that outputs the highest probability results in an accuracy of 87% on 2000 unseen digits. Additional software experiments show that mixtures of Gaussians (σ=0.01) trained by the batch E-M algorithm provide an accuracy of 92.39% on this task. Our test results show that the bump mixture model’s performance on several learning tasks is comparable to standard mixtures of Gaussians trained by E-M. These experiments give further evidence that floating-gate circuits can be used to build effective learning systems even though their learning rules derive from silicon physics instead of statistical methods. The bump mixture model also represents a basic building block that we can use to build more complex silicon probability models over analog variables. This work can be extended in several ways. We can build distributions that have parameterized covariances in addition to means. In addition, we can build more complex, adaptive probability distributions in silicon by combining the bump mixture model with silicon probability models over discrete variables [5-7] and spike-based floating-gate learning circuits [4]. A c k n o w l e d g me n t s This work was supported by NSF under grants BES 9720353 and ECS 9733425, and Packard Foundation and Sloan Fellowships. References [1] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, UK: Clarendon Press, 1995. [2] L. R. Rabiner,

3 0.60700023 191 nips-2002-String Kernels, Fisher Kernels and Finite State Automata

Author: Craig Saunders, Alexei Vinokourov, John S. Shawe-taylor

Abstract: In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. By adjusting the parametrisation we can also influence the weighting received by the features . In this way we are able to obtain a logarithmic weighting in a Fisher kernel. Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline. 1

4 0.52255046 110 nips-2002-Incremental Gaussian Processes

Author: Joaquin Quiñonero-candela, Ole Winther

Abstract: In this paper, we consider Tipping’s relevance vector machine (RVM) [1] and formalize an incremental training strategy as a variant of the expectation-maximization (EM) algorithm that we call Subspace EM (SSEM). Working with a subset of active basis functions, the sparsity of the RVM solution will ensure that the number of basis functions and thereby the computational complexity is kept low. We also introduce a mean ﬁeld approach to the intractable classiﬁcation model that is expected to give a very good approximation to exact Bayesian inference and contains the Laplace approximation as a special case. We test the algorithms on two large data sets with O(103 − 104 ) examples. The results indicate that Bayesian learning of large data sets, e.g. the MNIST database is realistic.

5 0.37426278 85 nips-2002-Fast Kernels for String and Tree Matching

Author: Alex J. Smola, S.v.n. Vishwanathan

Abstract: In this paper we present a new algorithm suitable for matching discrete objects such as strings and trees in linear time, thus obviating dynarrtic programming with quadratic time complexity. Furthermore, prediction cost in many cases can be reduced to linear cost in the length of the sequence to be classified, regardless of the number of support vectors. This improvement on the currently available algorithms makes string kernels a viable alternative for the practitioner.

6 0.36605629 119 nips-2002-Kernel Dependency Estimation

7 0.36068943 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks

8 0.35881063 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

9 0.3585825 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition

10 0.35836551 145 nips-2002-Mismatch String Kernels for SVM Protein Classification

11 0.35787749 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

12 0.35730189 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

13 0.3571634 3 nips-2002-A Convergent Form of Approximate Policy Iteration

14 0.35678932 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

15 0.35639024 74 nips-2002-Dynamic Structure Super-Resolution

16 0.35627523 10 nips-2002-A Model for Learning Variance Components of Natural Images

17 0.35561195 89 nips-2002-Feature Selection by Maximum Marginal Diversity

18 0.35520375 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals

19 0.35485056 124 nips-2002-Learning Graphical Models with Mercer Kernels

20 0.35409445 135 nips-2002-Learning with Multiple Labels