nips nips2000 nips2000-14 knowledge-graph by maker-knowledge-mining

14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks

Source: pdf

Author: Chiranjib Bhattacharyya, S. Sathiya Keerthi

Abstract: A variational derivation of Plefka's mean-field theory is presented. This theory is then applied to sigmoidal belief networks with the aid of further approximations. Empirical evaluation on small scale networks show that the proposed approximations are quite competitive. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A variational mean-field theory for sigmoidal belief networks c. [sent-1, score-0.607]

2 sg Abstract A variational derivation of Plefka's mean-field theory is presented. [sent-10, score-0.396]

3 This theory is then applied to sigmoidal belief networks with the aid of further approximations. [sent-11, score-0.408]

4 Empirical evaluation on small scale networks show that the proposed approximations are quite competitive. [sent-12, score-0.213]

5 1 Introduction Application of mean-field theory to solve the problem of inference in Belief Networks(BNs) is well known [1]. [sent-13, score-0.118]

6 In this paper we will discuss a variational mean-field theory and its application to BNs, sigmoidal BNs in particular. [sent-14, score-0.506]

7 We present a variational derivation of the mean-field theory, proposed by Plefka[2]. [sent-15, score-0.278]

8 The theory will be developed for a stochastic system, consistin~ of N binary random variables, Si E {O, I}, described by the energy function E(S), and the following Boltzmann Gibbs distribution at a temperature T: _ P(S) = ~ e-z- , z = T ""' E(S) ~ e-----;y-. [sent-16, score-0.305]

9 S The application of this mean-field method to Boltzmann Machines(BMs) is already done [3]. [sent-17, score-0.056]

10 A large class of BN s are described by the following energy function: N E(S) = - L {Si In f(Mi) + (1 - i-l Si) In(1 - f(Mi)} Mi i=l =L WijSj + hi j=l The application of the mean-field theory for such energy functions is not straightforward and further approximations are needed. [sent-18, score-0.517]

11 We propose a new approximation scheme and discuss its utility for sigmoid networks, which is obtained by substituting 1 f(x) = 1 + eX in the above energy function. [sent-19, score-0.644]

12 In section 2 we present a variational derivation of Plefka's mean-field theory. [sent-21, score-0.278]

13 In section 3 the theory is extended to sigmoidal belief networks. [sent-22, score-0.329]

14 2 A Variational mean-field theory Plefka,[2] proposed a mean-field theory in the context of spin glasses. [sent-25, score-0.236]

15 This theory can, in principle, yield arbitrarily close approximation to log Z. [sent-26, score-0.242]

16 In this section we present an alternate derivation from a variational viewpoint, see also [4],[5]. [sent-27, score-0.278]

17 Let us define a 'Y dependent partition and distribution function, (1) Note that Zl (1) as =Z and Pl = p. [sent-29, score-0.122]

18 Z 'Y where by Introducing an external real vector, Blet us rewrite ,",e--Y~+2:. [sent-30, score-0.134]

19 s e -"'(JosoZLJi ' , (2) Z Z is the partition function associated with the distribution function p-y given 2:i _ E '" e --y~+ (JiSi - '"' ° (JiSi P- Z -L. [sent-35, score-0.041]

20 (JiUi (4) S where (5) Ui = (Si)P-r Taking logarithms on both sides of (4) we obtain log Z-y ~ log Z - L (6) OiUi The right hand side is defined as a function of u and 'Y via the following assumption. [sent-44, score-0.136]

21 Invertibility assumption: For each fixed u and 'Y, (5) can be solved for if If the invertibility assumption holds then we can use (with Bdependent on u) and rewrite (6) as u as the independent vector (7) where G is as defined in G(u,'Y) = -lnZ + LOiUi. [sent-45, score-0.134]

22 i This then gives a variational feel: treat it as an external variable vector and choose it to minimize G for a fixed 'Y. [sent-46, score-0.247]

23 The stationarity conditions of the above minimization problem yield {)G (Ji = - = O. [sent-47, score-0.058]

24 ()Ui At the minimum point we have the equality G = - log Z"(. [sent-48, score-0.034]

25 It is difficult to invert (5) for'Y :I 0, thus making it impossible to write an algebraic expression for G for any nonzero 'Y. [sent-49, score-0.033]

26 At 'Y = 0 the inversion is straightforward and one obtains N G(it,O) = 2)Ui In Ui + (1 - Ui) In(l- Ui)) , Po = II ui(1 - Ui). [sent-50, score-0.045]

27 i ~1 A Taylor series approach is then undertaken around 'Y to G. [sent-51, score-0.094]

28 Define - _ GM = G(u,O) + L = 0 to build an approximation 'Yk ()kG I kT 8k 'Y k (8) ,,(=0 Then G can be considered as an approximation of G. [sent-52, score-0.18]

29 The stationarity conditions M are enforced by setting (Ji = {)G {)Ui ~ {)GM = {)Ui In this paper we will restrict ourselves to M the following derivatives = 2. [sent-53, score-0.19]

30 The expression for M = 2 can be identified with the TAP correction. [sent-56, score-0.033]

31 The term (10) yields the TAP term for BM energy function. [sent-57, score-0.1]

32 3 Mean-field approximations for BNs The method, as developed in the previous section, is not directly useful for BNs because of the intractability of the partial derivatives at 'Y = O. [sent-58, score-0.191]

33 To overcome this problem, we suggest an approximation based on Taylor series expansion. [sent-59, score-0.147]

34 Though in this paper we will be restricting ourselves to sigmoid activation function, this method is applicable to other activation functions also. [sent-60, score-0.311]

35 This method enables calculation of all the necessary terms required for extending Plefka's method for BN s. [sent-61, score-0.037]

36 Since, for BN operation T is fixed to 1, T will be dropped from all equations in the rest of the paper. [sent-62, score-0.082]

37 Let us define a new energy function N E((3,S,il,w) = - 2)Silnf(Mi((3)) + (1- Si)ln(I- f(Mi((3))} (11) i=l where 0 ~ (3 ~ 1, i-l i-l Mi((3) = L Wij(3(Sj - Uj) + Mi , Mi = L WijUj + hi j=l j=l where e - 'VE+" (J·S· Di I Uk = L SkP"(/3 Vk, P"(/3 = t 1. [sent-63, score-0.219]

38 We use a Taylor series approximation of E((3) with respect to (3. [sent-70, score-0.147]

39 Let us define ~ ~ e (3k okE I Ec((3) = E(O) (13) + (; kf o(3k /3=0 If Ee approximates E, then we can write E = E(I) ~ Ec(I). [sent-71, score-0.081]

40 In view of (14) one can consider Ae as an approximation to A. [sent-73, score-0.09]

41 We define (18) Figure 1: Three layer BN (2 x 4 x 6) with top down propagation of beliefs. [sent-76, score-0.043]

42 a hence the mean-field aGM :::::i aaMc = 0 aUi aUi (19) In light of the above discussion one can consider equations can be stated as (}i = aG aUi :::::i In this paper we will restrict ourselves to M for a general C is given by GM = 2. [sent-78, score-0.087]

43 4 Experimental results To test the approximation schemes developed in the previous schemes, numerical experiments were conducted. [sent-80, score-0.308]

44 Small Networks were chosen so that In Z can be computed by exact enumeration for evaluation purposes. [sent-85, score-0.148]

45 This choice of the network enables us to compare the results with those of [1]. [sent-87, score-0.075]

46 To compare the performance of our methods with their method we repeated the experiment conducted by them for sigmoid BNs. [sent-88, score-0.212]

47 Ten thousand networks were generated by randomly choosing weight values in [-1,1]. [sent-89, score-0.079]

48 The bottom layer units, or the visible units of each network were instantiated to zero. [sent-90, score-0.037]

49 The likelihood, In Z, was computed by exact enumeration of all the states in the higher two layers. [sent-91, score-0.074]

50 The approximate value of - In Z was computed by MC j U was computed by solving the fixed point equations obtained from (19). [sent-92, score-0.048]

51 The goodness of approximation scheme was tested by the following measure a c = - aMc -1 (22) InZ For a proper comparison we also implemented the SJJ method. [sent-93, score-0.341]

52 The goodness of approximation for the SJ J scheme is evaluated by substituting MC, in (22) by Lsapprox, for specific formula see [1]. [sent-94, score-0.389]

53 The results are presented in the form of histograms in Figure 2. [sent-95, score-0.131]

54 We also repeated the experiment with weights and a Gu G 12 G 22 SJJ (£) small weights [-1, 1] -0. [sent-96, score-0.141]

55 0962 Table 1: Mean of £ for randomly generated sigmoid networks, in different weight ranges. [sent-104, score-0.173]

56 biases taking values between -5 and 5, the results are again presented in the form of histograms in Figure 3. [sent-105, score-0.185]

57 The findings are summarized in the form of means tabulated in Table l. [sent-106, score-0.037]

58 For small weights G and the SJJ approach show close results, which was expected. [sent-107, score-0.051]

59 12 But the improvement achieved by the G scheme is remarkable; it gave a mean 22 value of 0. [sent-108, score-0.278]

60 0029 which compares substantially well against the mean value of 0. [sent-109, score-0.082]

61 The improvement in [6] was achieved by using mixture distribution which requires introduction of extra variational variables; more than 100 extra variational variables are needed for a 5 component mixture. [sent-111, score-0.57]

62 On the other hand the extra computational cost for G over G is marginal. [sent-113, score-0.086]

63 This makes the G scheme computationally 22 12 22 attractive over the mixture distribution. [sent-114, score-0.196]

64 " , \ 0 '" Figure 2: Histograms for GlO and SJJ scheme for weights taking values in [-1,1], for sigmoid networks. [sent-115, score-0.474]

65 The plot on the left show histograms for £ for the schemes Gu and G12 They did not have any overlaps; Gu , gives a mean of -0. [sent-116, score-0.44]

66 The middle plot shows the histogram for the SJJ scheme, mean is given by 0. [sent-119, score-0.172]

67 The plot at the extreme right is for the scheme G , having 22 a mean of 0. [sent-121, score-0.408]

68 0029 Of the three schemes G is the most robust and also yields reasonably accurate 12 results. [sent-122, score-0.179]

69 It is outperformed only by G in the case of sigmoid networks with low 22 weights. [sent-123, score-0.252]

70 Empirical evidence thus suggests that the choice of a scheme is not straightforward and depends on the activation function and also parameter values. [sent-124, score-0.31]

71 Figure 3: Histograms for the G10 and SJJ schemes for weights taking values in [-5,5] for sigmoid networks. [sent-125, score-0.457]

72 The leftmost histogram shows £ for G scheme having 11 a mean of -0. [sent-126, score-0.354]

73 0440, second from left is for G scheme having a mean of 0. [sent-127, score-0.278]

74 0231, and 12 second from right is for SJJ scheme, having a mean of 0. [sent-128, score-0.116]

75 The scheme G is 22 at the extreme right with mean -0. [sent-130, score-0.36]

76 5 Discussion Application of Plefka's theory to BNs is not straightforward. [sent-132, score-0.118]

77 We presented a scheme in which the BN energy function is approximated by a Taylor series, which gives a tractable approximation to the terms required for Plefka's method. [sent-134, score-0.386]

78 Various approximation schemes depending on the degree of the Taylor series expansion are derived. [sent-135, score-0.36]

79 Unlike the approach in [1], the schemes discussed here are simpler as they do not introduce extra variational variables. [sent-136, score-0.464]

80 Empirical evaluation on small scale networks shows that the quality of approximations is quite good. [sent-137, score-0.213]

81 (1996), Mean field theory for sigmoid belief networks, Journal of Artificial Intelligence Research,4 [2] Plefka, T . [sent-144, score-0.406]

82 B(1998), Boltzmann machine learning using mean field theory and linear response correction, Advances in Neural Information Processing Systems 10, (eds. [sent-151, score-0.2]

83 (1991), How to expand around mean-field theory using high temperature expansions,J. [sent-161, score-0.166]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('plefka', 0.302), ('sjj', 0.296), ('ae', 0.217), ('bns', 0.201), ('variational', 0.199), ('scheme', 0.196), ('ui', 0.191), ('schemes', 0.179), ('sigmoid', 0.173), ('jisi', 0.172), ('mi', 0.153), ('il', 0.146), ('taylor', 0.139), ('histograms', 0.131), ('aui', 0.129), ('gm', 0.129), ('keerthi', 0.129), ('bn', 0.125), ('theory', 0.118), ('belief', 0.115), ('bhattacharyya', 0.111), ('energy', 0.1), ('sigmoidal', 0.096), ('gu', 0.093), ('mc', 0.093), ('approximation', 0.09), ('extra', 0.086), ('invertibility', 0.086), ('jis', 0.086), ('si', 0.082), ('mean', 0.082), ('networks', 0.079), ('derivation', 0.079), ('ec', 0.075), ('evaluation', 0.074), ('enumeration', 0.074), ('tap', 0.072), ('sj', 0.07), ('activation', 0.069), ('jordan', 0.064), ('approximations', 0.06), ('stationarity', 0.058), ('series', 0.057), ('application', 0.056), ('boltzmann', 0.056), ('derivatives', 0.056), ('goodness', 0.055), ('taking', 0.054), ('weights', 0.051), ('ok', 0.05), ('zl', 0.048), ('substituting', 0.048), ('external', 0.048), ('extreme', 0.048), ('rewrite', 0.048), ('temperature', 0.048), ('plot', 0.048), ('equations', 0.048), ('ee', 0.046), ('ji', 0.046), ('straightforward', 0.045), ('saul', 0.045), ('define', 0.043), ('histogram', 0.042), ('empirical', 0.042), ('partition', 0.041), ('replacing', 0.039), ('kearns', 0.039), ('repeated', 0.039), ('restrict', 0.039), ('developed', 0.039), ('hi', 0.038), ('jaakkola', 0.038), ('us', 0.038), ('discuss', 0.037), ('enables', 0.037), ('bm', 0.037), ('enforced', 0.037), ('georges', 0.037), ('instantiated', 0.037), ('inverting', 0.037), ('pl', 0.037), ('tabulated', 0.037), ('undertaken', 0.037), ('partial', 0.036), ('log', 0.034), ('right', 0.034), ('expansion', 0.034), ('singapore', 0.034), ('biggest', 0.034), ('dropped', 0.034), ('feel', 0.034), ('kg', 0.034), ('leftmost', 0.034), ('logarithms', 0.034), ('overlaps', 0.034), ('production', 0.034), ('yedidia', 0.034), ('expression', 0.033), ('solla', 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks

Author: Chiranjib Bhattacharyya, S. Sathiya Keerthi

2 0.21670157 114 nips-2000-Second Order Approximations for Probability Models

Author: Hilbert J. Kappen, Wim Wiegerinck

Abstract: In this paper, we derive a second order mean field theory for directed graphical probability models. By using an information theoretic argument it is shown how this can be done in the absense of a partition function. This method is a direct generalisation of the well-known TAP approximation for Boltzmann Machines. In a numerical example, it is shown that the method greatly improves the first order mean field approximation. For a restricted class of graphical models, so-called single overlap graphs, the second order method has comparable complexity to the first order method. For sigmoid belief networks, the method is shown to be particularly fast and effective.

3 0.20783506 13 nips-2000-A Tighter Bound for Graphical Models

Author: Martijn A. R. Leisink, Hilbert J. Kappen

Abstract: We present a method to bound the partition function of a Boltzmann machine neural network with any odd order polynomial. This is a direct extension of the mean field bound, which is first order. We show that the third order bound is strictly better than mean field. Additionally we show the rough outline how this bound is applicable to sigmoid belief networks. Numerical experiments indicate that an error reduction of a factor two is easily reached in the region where expansion based approximations are useful. 1

4 0.16807526 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

Author: Zoubin Ghahramani, Matthew J. Beal

Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1

5 0.10785847 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

Author: Oliver B. Downs

Abstract: Recent work has exploited boundedness of data in the unsupervised learning of new types of generative model. For nonnegative data it was recently shown that the maximum-entropy generative model is a Nonnegative Boltzmann Distribution not a Gaussian distribution, when the model is constrained to match the first and second order statistics of the data. Learning for practical sized problems is made difficult by the need to compute expectations under the model distribution. The computational cost of Markov chain Monte Carlo methods and low fidelity of naive mean field techniques has led to increasing interest in advanced mean field theories and variational methods. Here I present a secondorder mean-field approximation for the Nonnegative Boltzmann Machine model, obtained using a

6 0.10115688 115 nips-2000-Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR Networks

7 0.088187985 126 nips-2000-Stagewise Processing in Error-correcting Codes and Image Restoration

8 0.080242999 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

9 0.079760797 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

10 0.077294946 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

11 0.074566945 51 nips-2000-Factored Semi-Tied Covariance Matrices

12 0.072403945 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

13 0.069698729 15 nips-2000-Accumulator Networks: Suitors of Local Probability Propagation

14 0.060541268 62 nips-2000-Generalized Belief Propagation

15 0.060270302 94 nips-2000-On Reversing Jensen's Inequality

16 0.059145357 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

17 0.058586963 108 nips-2000-Recognizing Hand-written Digits Using Hierarchical Products of Experts

18 0.058025423 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations

19 0.05794818 47 nips-2000-Error-correcting Codes on a Bethe-like Lattice

20 0.057906758 18 nips-2000-Active Support Vector Machine Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.213), (1, -0.014), (2, 0.125), (3, -0.101), (4, 0.291), (5, -0.058), (6, -0.03), (7, 0.003), (8, 0.045), (9, 0.081), (10, -0.01), (11, -0.093), (12, -0.268), (13, 0.081), (14, -0.05), (15, -0.104), (16, 0.152), (17, -0.146), (18, -0.004), (19, 0.048), (20, 0.043), (21, -0.133), (22, 0.019), (23, 0.009), (24, -0.073), (25, 0.012), (26, -0.059), (27, 0.038), (28, 0.074), (29, 0.023), (30, 0.051), (31, 0.039), (32, -0.034), (33, 0.031), (34, 0.065), (35, 0.032), (36, 0.056), (37, -0.049), (38, 0.052), (39, 0.079), (40, 0.099), (41, 0.087), (42, -0.014), (43, -0.014), (44, -0.077), (45, -0.058), (46, -0.015), (47, -0.003), (48, 0.055), (49, -0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97505093 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks

Author: Chiranjib Bhattacharyya, S. Sathiya Keerthi

2 0.79538655 13 nips-2000-A Tighter Bound for Graphical Models

Author: Martijn A. R. Leisink, Hilbert J. Kappen

3 0.78747606 114 nips-2000-Second Order Approximations for Probability Models

Author: Hilbert J. Kappen, Wim Wiegerinck

4 0.5324176 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

Author: Oliver B. Downs

5 0.47927323 115 nips-2000-Sequentially Fitting ``Inclusive'' Trees for Inference in Noisy-OR Networks

Author: Brendan J. Frey, Relu Patrascu, Tommi Jaakkola, Jodi Moran

Abstract: An important class of problems can be cast as inference in noisyOR Bayesian networks, where the binary state of each variable is a logical OR of noisy versions of the states of the variable's parents. For example, in medical diagnosis, the presence of a symptom can be expressed as a noisy-OR of the diseases that may cause the symptom - on some occasions, a disease may fail to activate the symptom. Inference in richly-connected noisy-OR networks is intractable, but approximate methods (e .g., variational techniques) are showing increasing promise as practical solutions. One problem with most approximations is that they tend to concentrate on a relatively small number of modes in the true posterior, ignoring other plausible configurations of the hidden variables. We introduce a new sequential variational method for bipartite noisyOR networks, that favors including all modes of the true posterior and models the posterior distribution as a tree. We compare this method with other approximations using an ensemble of networks with network statistics that are comparable to the QMR-DT medical diagnostic network. 1 Inclusive variational approximations Approximate algorithms for probabilistic inference are gaining in popularity and are now even being incorporated into VLSI hardware (T. Richardson, personal communication). Approximate methods include variational techniques (Ghahramani and Jordan 1997; Saul et al. 1996; Frey and Hinton 1999; Jordan et al. 1999), local probability propagation (Gallager 1963; Pearl 1988; Frey 1998; MacKay 1999a; Freeman and Weiss 2001) and Markov chain Monte Carlo (Neal 1993; MacKay 1999b). Many algorithms have been proposed in each of these classes. One problem that most of the above algorithms suffer from is a tendency to concentrate on a relatively small number of modes of the target distribution (the distribution being approximated). In the case of medical diagnosis, different modes correspond to different explanations of the symptoms. Markov chain Monte Carlo methods are usually guaranteed to eventually sample from all the modes, but this may take an extremely long time, even when tempered transitions (Neal 1996) are (a) ,,

6 0.45911035 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

7 0.42060012 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

8 0.41580519 126 nips-2000-Stagewise Processing in Error-correcting Codes and Image Restoration

9 0.39418337 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

10 0.33167759 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

11 0.31048688 125 nips-2000-Stability and Noise in Biochemical Switches

12 0.30292413 80 nips-2000-Learning Switching Linear Models of Human Motion

13 0.28696793 18 nips-2000-Active Support Vector Machine Classification

14 0.27802908 62 nips-2000-Generalized Belief Propagation

15 0.27297717 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

16 0.27278292 47 nips-2000-Error-correcting Codes on a Bethe-like Lattice

17 0.26999304 15 nips-2000-Accumulator Networks: Suitors of Local Probability Propagation

18 0.26986498 51 nips-2000-Factored Semi-Tied Covariance Matrices

19 0.26279265 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

20 0.24301842 120 nips-2000-Sparse Greedy Gaussian Process Regression

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.03), (17, 0.091), (26, 0.339), (33, 0.042), (36, 0.024), (55, 0.023), (62, 0.029), (65, 0.041), (67, 0.059), (76, 0.063), (79, 0.03), (81, 0.017), (90, 0.04), (91, 0.059), (97, 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.85134518 14 nips-2000-A Variational Mean-Field Theory for Sigmoidal Belief Networks

Author: Chiranjib Bhattacharyya, S. Sathiya Keerthi

2 0.8355844 6 nips-2000-A Neural Probabilistic Language Model

Author: Yoshua Bengio, Réjean Ducharme, Pascal Vincent

Abstract: A goal of statistical language modeling is to learn the joint probability function of sequences of words. This is intrinsically difficult because of the curse of dimensionality: we propose to fight it with its own weapons. In the proposed approach one learns simultaneously (1) a distributed representation for each word (i.e. a similarity between words) along with (2) the probability function for word sequences, expressed with these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar to words forming an already seen sentence. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach very significantly improves on a state-of-the-art trigram model.

3 0.4503051 13 nips-2000-A Tighter Bound for Graphical Models

Author: Martijn A. R. Leisink, Hilbert J. Kappen

4 0.41161472 94 nips-2000-On Reversing Jensen's Inequality

Author: Tony Jebara, Alex Pentland

Abstract: Jensen's inequality is a powerful mathematical tool and one of the workhorses in statistical learning. Its applications therein include the EM algorithm, Bayesian estimation and Bayesian inference. Jensen computes simple lower bounds on otherwise intractable quantities such as products of sums and latent log-likelihoods. This simplification then permits operations like integration and maximization. Quite often (i.e. in discriminative learning) upper bounds are needed as well. We derive and prove an efficient analytic inequality that provides such variational upper bounds. This inequality holds for latent variable mixtures of exponential family distributions and thus spans a wide range of contemporary statistical models. We also discuss applications of the upper bounds including maximum conditional likelihood, large margin discriminative models and conditional Bayesian inference. Convergence, efficiency and prediction results are shown. 1

5 0.41005078 85 nips-2000-Mixtures of Gaussian Processes

Author: Volker Tresp

Abstract: We introduce the mixture of Gaussian processes (MGP) model which is useful for applications in which the optimal bandwidth of a map is input dependent. The MGP is derived from the mixture of experts model and can also be used for modeling general conditional probability densities. We discuss how Gaussian processes -in particular in form of Gaussian process classification, the support vector machine and the MGP modelcan be used for quantifying the dependencies in graphical models.

6 0.40509775 51 nips-2000-Factored Semi-Tied Covariance Matrices

7 0.40090477 114 nips-2000-Second Order Approximations for Probability Models

8 0.39829502 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

9 0.39622429 80 nips-2000-Learning Switching Linear Models of Human Motion

10 0.39370134 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

11 0.39348227 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

12 0.38000217 74 nips-2000-Kernel Expansions with Unlabeled Examples

13 0.37912086 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

14 0.37851813 122 nips-2000-Sparse Representation for Gaussian Process Models

15 0.37591225 138 nips-2000-The Use of Classifiers in Sequential Inference

16 0.37560451 37 nips-2000-Convergence of Large Margin Separable Linear Classification

17 0.37056571 4 nips-2000-A Linear Programming Approach to Novelty Detection

18 0.36850864 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

19 0.36763221 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

20 0.36743754 95 nips-2000-On a Connection between Kernel PCA and Metric Multidimensional Scaling