nips nips2009 nips2009-97 knowledge-graph by maker-knowledge-mining

97 nips-2009-Free energy score space

Source: pdf

Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic

Abstract: A score function induced by a generative model of the data can provide a feature vector of a ﬁxed dimension for each data sample. Data samples themselves may be of differing lengths (e.g., speech segments, or other sequence data), but as a score function is based on the properties of the data generation process, it produces a ﬁxed-length vector in a highly informative space, typically referred to as a “score space”. Discriminative classiﬁers have been shown to achieve higher performance in appropriately chosen score spaces than is achievable by either the corresponding generative likelihood-based classiﬁers, or the discriminative classiﬁers using standard feature extractors. In this paper, we present a novel score space that exploits the free energy associated with a generative model. The resulting free energy score space (FESS) takes into account latent structure of the data at various levels, and can be trivially shown to lead to classiﬁcation performance that at least matches the performance of the free energy classiﬁer based on the same generative model, and the same factorization of the posterior. We also show that in several typical vision and computational biology applications the classiﬁers optimized in FESS outperform the corresponding pure generative approaches, as well as a number of previous approaches to combining discriminating and generative models.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Free energy score-space Alessandro Perina1,3 , Marco Cristani1,2 , Umberto Castellani1 Vittorio Murino1,2 and Nebojsa Jojic3 {alessandro. [sent-1, score-0.343]

2 com 1 Department of Computer Science, University of Verona, Italy 2 IIT, Italian Institute of Technology, Genova, Italy 3 Microsoft Research, Redmond, WA Abstract A score function induced by a generative model of the data can provide a feature vector of a ﬁxed dimension for each data sample. [sent-7, score-0.499]

3 , speech segments, or other sequence data), but as a score function is based on the properties of the data generation process, it produces a ﬁxed-length vector in a highly informative space, typically referred to as a “score space”. [sent-10, score-0.345]

4 Discriminative classiﬁers have been shown to achieve higher performance in appropriately chosen score spaces than is achievable by either the corresponding generative likelihood-based classiﬁers, or the discriminative classiﬁers using standard feature extractors. [sent-11, score-0.637]

5 In this paper, we present a novel score space that exploits the free energy associated with a generative model. [sent-12, score-1.07]

6 We also show that in several typical vision and computational biology applications the classiﬁers optimized in FESS outperform the corresponding pure generative approaches, as well as a number of previous approaches to combining discriminating and generative models. [sent-14, score-0.477]

7 1 Introduction The complementary nature of discriminative and generative approaches to machine learning [20] has motivated lots of research on the ways in which these can be combined [5, 12, 15, 18, 9, 24, 27]. [sent-15, score-0.345]

8 , K, a family of generative models P = {P (x|θi )} parameterized by θi . [sent-26, score-0.342]

9 The observed sequence x is mapped to the ﬁxed-length score vector ϕf (x), ˆ F ϕf (x) = ϕF f ({Pi (x|θi ))}), ˆ ˆ F (1) ˆ where f is the function of the set of probability densities under the different models, and F is some operator applied to it. [sent-27, score-0.275]

10 For instance, in case of the Fisher score [9], f is the log likelihood, and the ˆ operator F produces the ﬁrst order derivatives with respect to parameters, whereas in [24] other derivatives are also included. [sent-28, score-0.424]

11 Another example is the TOP kernel [27] for which the function f is ˆ the posterior log-odds and F is again the gradient operator. [sent-29, score-0.136]

12 In these cases, the generative score-space approaches help to distill the relationship between a model parameter θi and the particular data sample. [sent-30, score-0.25]

13 After the mapping, a score-space metric must 1 be deﬁned in order to employ discriminative approaches. [sent-31, score-0.12]

14 A number of nice properties for these mappings, and especially for Fisher score, can be derived under the assumption that the test data indeed follows the generative model used for the score computation. [sent-32, score-0.531]

15 However, the generative score spaces build upon the choice of one (or few) out of many possible generative models, as well as the parameters ﬁt to a limited amount of data. [sent-33, score-0.71]

16 The only free parameters are therefore the Gaussian centers, and let us assume that training data is best captured with these centers all lying on (or close to) a hypersphere with a radius sufﬁciently larger than the Gaussians’ deviation. [sent-36, score-0.378]

17 In this paper, we propose a novel score space which focuses on how well the data point ﬁts different parts of the generative model, rather than on derivatives with respect to the model parameters. [sent-40, score-0.572]

18 We start with the variational free energy as a lower bound on the negative log-likelihood of the data, as this affords us with two advantages. [sent-41, score-0.694]

19 Second, a variational approximation of the posterior typically provides an additive decomposition of the free energy, providing many terms that can be used as features. [sent-43, score-0.441]

20 These terms/features are divided into two categories: the “entropy set” of terms that express uncertainty in the posterior distribution, and the “cross-entropy set” describing the quality of the ﬁt of the data to different parts of the model according to the posterior distribution. [sent-44, score-0.248]

21 We ﬁnd the resulting score space to be highly informative for discriminative learning. [sent-45, score-0.371]

22 3, we show that the proposed generative score space leads to better classiﬁcation performances than the related generative counterpart. [sent-51, score-0.667]

23 2 FESS: Free Energy Score Space T A generative model deﬁnes the distribution P (h, x|θ) = t=1 P (h(t) , x(t) |θ) over a set of observations x = {x(t) }T , each with associated hidden variables h(t) , for a given set of model parameters t=1 θ shared across all observations. [sent-55, score-0.378]

24 In addition, to model the posterior distribution P (h|x), we also deﬁne a family of distributions Q from which we need to select a variational distribution Q(h) that best ﬁts the model and the data. [sent-56, score-0.29]

25 Constraining Q to belong to a simpliﬁed family of distributions Q, however, 2 provides computational advantages for dealing with intractable models P . [sent-61, score-0.172]

26 Examples of distribution families used for approximation are the fully-factorized mean ﬁeld form [13], or the structured variational approximation [7], where some dependencies among the hidden variables are kept. [sent-62, score-0.229]

27 Minimization of FQ as a proxy for negative log likelihood is usually achieved by alternating optimization of with respect to Q and θ, a special case of which – when Q is fully expressive – is the EM algorithm. [sent-63, score-0.189]

28 This posterior distribution is ﬁt to minimize the free energy; the ﬁrst term in 3 is the entropy and quantiﬁes the uncertainty in this ﬁt. [sent-72, score-0.375]

29 For example, if the generative model is described by a Bayesian network, the joint distribution can (t) be written as P (v (t) = n P (vn |PAn ), where v (t) = {x(t) , h(t) } denotes the set of all variables (t) (hidden or visible) and PAn are the parents of the n − th of these variables, i. [sent-74, score-0.25]

30 , (t) (t) (t) (t) ˆ ˆ ˆ ˆ q(vn = 1, ∪ PAn |θ)·log P (vn = 1|PAn , θ)+· · ·+ q(vn = Dn , ∪ PAn |θ)·log P (vn = Dn |PAn , θ) (5) In a similar fashion, the entropy term can also be decomposed further into a sum of terms as dictated by the factorization of the family Q. [sent-79, score-0.114]

31 Therefore, the free energy for a single sample t can be expressed as the sum t FQ = t fi,θ ˆ (6) i t where all the free energy pieces fi,θ derive from the ﬁnest decomposition (5) or (4). [sent-80, score-1.256]

32 ˆ t The terms fi,θ describe how the data point ﬁts possible conﬁgurations of the hidden variables in ˆ different parts of the model. [sent-81, score-0.146]

33 Such information can be encapsulated in a score space that we call free energy score space or simply FESS. [sent-82, score-1.062]

34 Therefore, using the notation from [24] the free 3 energy score operator ϕF ESS (x(t) ) is deﬁned as ˆ F ϕF ESS : x(t) → F(Q1 ,θ1 ) (x(t) ); F(Q2 ,θ2 ) (x(t) ) ˆ ˆ ˆ F t where F(Qc ,θc ) = [. [sent-84, score-0.872]

35 However, the mapping also allows for the parts of the model ﬁt to play uneven roles in classiﬁcation after an additional step of discriminative training. [sent-91, score-0.214]

36 , because the models are not nested, and likelihoods cannot be directly compared), the mapping may still provide an abstraction from which another step of discriminative training can beneﬁt. [sent-95, score-0.207]

37 The additional step of training a discriminative model allows for mining the similarities among the data points in terms of the path through different hidden variables that has to be followed in their generation. [sent-96, score-0.276]

38 These similarities may be informative even if the generative process is imperfect. [sent-97, score-0.259]

39 Obviously, (7) can be generalized to include multiple models (or the use of a single model) and/or multiple posterior approximations, either for two-class or multi-class classiﬁcation problems. [sent-98, score-0.123]

40 3 Free energy score space classiﬁcation dominates the MAP classiﬁcation We use here the terminology introduced in [27], under which FESS would be considered a modeldependent feature extractor, as different generative models lead to different feature vectors [25]. [sent-99, score-0.882]

41 The family of feature extractors ϕF : X → d maps the input data x ∈ X in a space of ﬁxed ˆ ˆ dimension derived from a plug-in estimate λ, in our case the generative model with parameters θ from which the features are extracted. [sent-100, score-0.396]

42 Given some observations x and the corresponding class labels y ∈ {−1, +1} following the joint ˆ probability P (x, y|θ∗ ), a generative model can be trained to provide an estimate θ = θ∗ , where θ∗ are the true parameters. [sent-101, score-0.25]

43 The Fisher kernel (FK) classiﬁer can perform at least as well as its plug-in estimate if the parameters of a linear classiﬁer are properly determined [9, 27], 1 ˆ R(ϕF K ) ≤ Ex,t Φ[−y(P (y = +1|x, θ) − )] = R(λ) ˆ F 2 where λ represents the generative model used as plug-in estimate. [sent-106, score-0.296]

44 (9) This property also trivially holds for our method, where ϕF (x(t) ) = ϕF ESS (x(t) ), because the free ˆ ˆ F energy can be expressed as a linear combination of the elements of ϕ . [sent-107, score-0.628]

45 When the family Q is expressive enough to capture the true posterior distribution, then free energy reduces to negative log likelihood, 4 and the free energy test reduces to ML classiﬁcation. [sent-109, score-1.596]

46 In other cases, likelihood computation is intractable, and free energy test is used instead of the likelihood ratio test. [sent-110, score-0.776]

47 It is straightforward to prove that a kernel classiﬁer that works in FESS is asymptotically at least as good as the MAP labelling based on the generative models for the two classes since generative classiﬁcation is a special case of our framework. [sent-111, score-0.529]

48 The dominance of the Fisher and Top kernels [9, 27] over their plug-in holds for FESS too, and the same plug-in (the likelihood under a generative model) may be used when this is tractable. [sent-115, score-0.314]

49 However, if the computation of the likelihood (and the kernels derived from it) is intractable, then the free energy test as well as the kernel methods based on FESS that will outperform this test, can both still be used with an appropriate family of variational distributions Q. [sent-116, score-0.977]

50 4 Controlling the length of the feature vector In some generative models, especially sequence models, the number of hidden variables may change from one data point to the next. [sent-117, score-0.419]

51 In speech processing, for instance, hidden Markov models (HMM) (t) (t) [23] may have to model utterances x1 , . [sent-118, score-0.255]

52 As each (t) (t) element in the sequence has an associated hidden variable, the hidden state sequences s1 , . [sent-122, score-0.3]

53 Exact inference is tractable in HMMs and so we can use the exact posterior (EX) distribution to formulate the free energy and the free energy minimization is equivalent to the usual Baum-Welch training algorithm [17] and FEX = − log P (x). [sent-127, score-1.43]

54 To solve this problem, we ﬁrst note that a standard approach to dealing with utterances of different lengths is to normalize the likelihood by the sequence length, and this approach is also used for deﬁning other score spaces. [sent-129, score-0.397]

55 If, before the application of the score operator, we simply evaluate the sums over k in the free energy and divide each by K(t), we obtain a ﬁxed number of terms independent of the sequence length. [sent-130, score-0.876]

56 This results in a length-normalized score space nFESS, where the granularity of the decomposition of the free energy is dramatically reduced. [sent-131, score-0.889]

57 4 hidden Markov model T=7 T=8 T=9 λHMM T=10 T=11 50 0. [sent-135, score-0.128]

58 9 1 6 Regularization log(C) median RFP Figure 1: A) SVM error rates for nFESS and probability product kernels [10] using Markov models (we reported only their best result) and hidden Markov models as plug-ins. [sent-154, score-0.233]

59 B) Comparison with results obtained using FK and TK score spaces. [sent-157, score-0.217]

60 Y axis represents the total number of families for which a given method exceeds a median RFP score on the X axis. [sent-159, score-0.309]

61 In general, even for ﬁxed-length data points and arbitrary generative models, we do not need to create large feature vectors corresponding to the ﬁnest level of granularity described in (5), or for that matter the slightly coarser level of granularity in (4). [sent-160, score-0.345]

62 The longer the feature vector, the ﬁner is the level of detail with which the generative process for the data sample is represented, but more data is needed for the training of the discriminative classiﬁer. [sent-162, score-0.405]

63 Such control of the feature vector length does not negate the previously discussed advantages of the classiﬁcation in the free energy score space compared with the straightforward application of free energy, likelihood, or in case of sequence models, length-normalized likelihood tests. [sent-164, score-1.278]

64 Support vector machines (SVMs) with RBF kernel were used as discriminative classiﬁers in all the score spaces, as this technique was previously identiﬁed as most potent for dealing with variablelength sequences [25]. [sent-166, score-0.473]

65 As plug-ins, or generative models/likelihoods λ, for the three score spaces we compare across experiments, we used hidden Markov models (HMMs)[23] in Experiments 1-3 and latent Dirichlet allocation (LDA)[4] in Experiment 4. [sent-167, score-0.659]

66 For both FK and FESS, in each experiment we trained a single generative model (HMM or LDA, depending on the experiment). [sent-169, score-0.279]

67 For all HMM models, the length-normalization with associated summation over the sequence as described in the previous section was used in the construction of the free energy score space. [sent-170, score-0.876]

68 coli promoter gene sequences (DNA) with associated imperfect domain theory [26]. [sent-177, score-0.224]

69 A promoter is a genetic region which facilitates the transcription of gene located nearby. [sent-179, score-0.118]

70 Results, obtained using leave-one-out (LOO) validation, are 6 reported in Table 1 and illustrate that FESS represents well the ﬁxed size genetic sequences, leading to superior performance over other score spaces as well as over the plug-in λHM M . [sent-181, score-0.26]

71 The task here is to distinguish between the two types of gene sequences that can both vary in length (from dozens of nucleotides to tens of thousands of nucleotides). [sent-186, score-0.193]

72 The sequences in the database were selected from the Astral database, based on the E-value threshold of 10−25 for removing similar sequences from it. [sent-196, score-0.126]

73 In the end, 4352 distinct sequences were grouped into families and superfamilies. [sent-197, score-0.123]

74 For each family, the protein domains within the family are considered positive test examples, and the protein domains outside the family, but within the same superfamily, are taken as positive training examples. [sent-198, score-0.278]

75 The data set yields 54 families containing at least 10 family members (positive test) and 5 superfamily members outside of the family (positive train) for a total of 54 One-Vs-All problems. [sent-199, score-0.271]

76 The experimental setup is similar to that used in [8], except for one important difference: in the current experiments, the positive training sets do not include additional protein sequences extracted from a large, unlabelled database. [sent-200, score-0.157]

77 In order to measure the quality of the ranking, we used the median RFP score [8] which is the fraction of negative test sequences that score as high as or better than the median-scoring positive sequence. [sent-202, score-0.563]

78 We ﬁnd that FESS outperforms task-speciﬁc algorithms (PSI-Blast [2] and SAM [14]) as well as the Fisher score (FK,[8]) with statistical signiﬁcance with p-values of 5. [sent-204, score-0.217]

79 We repeated the test using two different choices of Q: the approximate mean ﬁeld factorization and the exact posterior (FESS-MF and FESS-EX, respectively, in Fig. [sent-211, score-0.154]

80 In both tests, we used Latent Dirichlet allocation (LDA) [4] as the generative model. [sent-216, score-0.225]

81 Although this new technique outperformed state of the art, once again, FESS outperforms both this result and other state-of-the-art discriminative methods [21, 16]. [sent-245, score-0.12]

82 6 Conclusions In this paper, we present a novel generative score space, FESS, exploiting variational free energy terms as features. [sent-246, score-1.136]

83 The additive free energy terms arise naturally as a consequence of the factorization of the model P and the posterior Q. [sent-247, score-0.773]

84 We show that the use of these terms as features in discriminative classiﬁcation leads to more robust results than the use of the Fisher scores, which are based on the derivatives of the log likelihood of the data with respect to the model parameters. [sent-248, score-0.32]

85 As was previously observed, we ﬁnd that the Fisher score space suffers from the so called “wrap-around” problem, where very different data points may map to the same derivative, an example of which was discussed in the introduction. [sent-249, score-0.217]

86 The free energy terms, on the other hand, quantify the data ﬁt in different parts of the model, and seem to be informative even when the model is imperfect. [sent-250, score-0.73]

87 This indicates that the re-scaling of these terms, which the subsequent discriminative training provides, leads to improved modelling of the data in some way. [sent-251, score-0.178]

88 Scaling a term in the free energy composition, e. [sent-252, score-0.628]

89 This is indeed reminiscent of some previous approaches to correcting generative modelling problems. [sent-255, score-0.255]

90 In speech applications, for example, it is a standard practice to raise the observation likelihood in HMMs to a power less than 1, before inference is performed on the test sample, as the acoustic signal would otherwise overwhelm the hidden process modelling the language constraints [28]. [sent-256, score-0.343]

91 For instance, a high-dimensional acoustic observation is often modelled as following a diagonal Gaussian distribution, thus assuming independent noise in the elements of the signal, even though the true acoustics of speech is far more constrained. [sent-258, score-0.119]

92 This results in over-accounting for the variation in the observed acoustic signal, and to correct for this in practice, the log probability of the observation given the hidden variable is scaled down. [sent-259, score-0.215]

93 The technique described here proposes a way to automatically infer the best scaling, but it also goes a step further in allowing for such corrections at all levels of the model hierarchy, and even for speciﬁc conﬁgurations of hidden variables. [sent-260, score-0.163]

94 This extremely simple technique was shown here to work remarkably well, outperforming previous score space approaches as well as the state of the art in multiple applications. [sent-262, score-0.217]

95 For example, the free energy approximated in different ways is used in [1] to construct various inference algorithms for a single scene parsing task. [sent-264, score-0.668]

96 It may also be effective, for example, to use the terms in the Bethe free energy linked to different belief propagation messages to construct the feature vectors. [sent-265, score-0.66]

97 Finally, although we ﬁnd that FESS outperforms the previously studied score spaces that depend on the ˆ derivatives, i. [sent-266, score-0.26]

98 This allows for the construction of kernels similar to FK and TK, but derived from intractable generative models as we show in Experiment 4 (FK in Table 2) on latent Dirichlet allocation. [sent-269, score-0.386]

99 Using the ﬁsher kernel method to detect remote protein homologies. [sent-326, score-0.14]

100 generative classiﬁers: A comparison of logistic regression and naive bayes. [sent-408, score-0.225]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fess', 0.54), ('energy', 0.343), ('free', 0.285), ('generative', 0.225), ('score', 0.217), ('ess', 0.176), ('fk', 0.136), ('pan', 0.123), ('fq', 0.121), ('discriminative', 0.12), ('vn', 0.119), ('oer', 0.108), ('hidden', 0.103), ('classi', 0.091), ('posterior', 0.09), ('sk', 0.085), ('family', 0.084), ('promoter', 0.081), ('fisher', 0.081), ('expressive', 0.076), ('hmm', 0.072), ('tk', 0.07), ('protein', 0.066), ('variational', 0.066), ('nfess', 0.065), ('nucleotides', 0.065), ('rfp', 0.065), ('sequences', 0.063), ('speech', 0.063), ('derivatives', 0.062), ('families', 0.06), ('homology', 0.057), ('likelihood', 0.057), ('rq', 0.056), ('acoustic', 0.056), ('log', 0.056), ('hmms', 0.052), ('fi', 0.046), ('graz', 0.046), ('kernel', 0.046), ('lda', 0.045), ('granularity', 0.044), ('parts', 0.043), ('coli', 0.043), ('fex', 0.043), ('nest', 0.043), ('scop', 0.043), ('superfamily', 0.043), ('wg', 0.043), ('ers', 0.043), ('spaces', 0.043), ('markov', 0.042), ('scene', 0.04), ('hm', 0.038), ('bikes', 0.038), ('fps', 0.038), ('latent', 0.038), ('gene', 0.037), ('xk', 0.036), ('jaakkola', 0.035), ('corrections', 0.035), ('test', 0.034), ('informative', 0.034), ('lengths', 0.034), ('models', 0.033), ('centers', 0.033), ('median', 0.032), ('kernels', 0.032), ('italy', 0.032), ('extractor', 0.032), ('hypersphere', 0.032), ('feature', 0.032), ('sequence', 0.031), ('utterances', 0.031), ('tsuda', 0.031), ('dn', 0.03), ('cation', 0.03), ('derived', 0.03), ('factorization', 0.03), ('modelling', 0.03), ('gurations', 0.03), ('detection', 0.03), ('broken', 0.029), ('bg', 0.029), ('recognition', 0.029), ('experiment', 0.029), ('length', 0.028), ('intractable', 0.028), ('svms', 0.028), ('remote', 0.028), ('training', 0.028), ('dealing', 0.027), ('proteins', 0.027), ('er', 0.027), ('operator', 0.027), ('biology', 0.027), ('dna', 0.026), ('mapping', 0.026), ('dirichlet', 0.026), ('model', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 97 nips-2009-Free energy score space

Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic

2 0.12624422 2 nips-2009-3D Object Recognition with Deep Belief Nets

Author: Vinod Nair, Geoffrey E. Hinton

Abstract: We introduce a new type of top-level model for Deep Belief Nets and evaluate it on a 3D object recognition task. The top-level model is a third-order Boltzmann machine, trained using a hybrid algorithm that combines both generative and discriminative gradients. Performance is evaluated on the NORB database (normalized-uniform version), which contains stereo-pair images of objects under diﬀerent lighting conditions and viewpoints. Our model achieves 6.5% error on the test set, which is close to the best published result for NORB (5.9%) using a convolutional neural net that has built-in knowledge of translation invariance. It substantially outperforms shallow models such as SVMs (11.6%). DBNs are especially suited for semi-supervised learning, and to demonstrate this we consider a modiﬁed version of the NORB recognition task in which additional unlabeled images are created by applying small translations to the images in the database. With the extra unlabeled data (and the same amount of labeled data as before), our model achieves 5.2% error. 1

3 0.12162589 201 nips-2009-Region-based Segmentation and Object Detection

Author: Stephen Gould, Tianshi Gao, Daphne Koller

Abstract: Object detection and multi-class image segmentation are two closely related tasks that can be greatly improved when solved jointly by feeding information from one task to the other [10, 11]. However, current state-of-the-art models use a separate representation for each task making joint inference clumsy and leaving the classiﬁcation of many parts of the scene ambiguous. In this work, we propose a hierarchical region-based approach to joint object detection and image segmentation. Our approach simultaneously reasons about pixels, regions and objects in a coherent probabilistic model. Pixel appearance features allow us to perform well on classifying amorphous background classes, while the explicit representation of regions facilitate the computation of more sophisticated features necessary for object detection. Importantly, our model gives a single uniﬁed description of the scene—we explain every pixel in the image and enforce global consistency between all random variables in our model. We run experiments on the challenging Street Scene dataset [2] and show signiﬁcant improvement over state-of-the-art results for object detection accuracy. 1

4 0.10255559 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation

Author: Yusuke Watanabe, Kenji Fukumizu

Abstract: We propose a new approach to the analysis of Loopy Belief Propagation (LBP) by establishing a formula that connects the Hessian of the Bethe free energy with the edge zeta function. The formula has a number of theoretical implications on LBP. It is applied to give a sufﬁcient condition that the Hessian of the Bethe free energy is positive deﬁnite, which shows non-convexity for graphs with multiple cycles. The formula clariﬁes the relation between the local stability of a ﬁxed point of LBP and local minima of the Bethe free energy. We also propose a new approach to the uniqueness of LBP ﬁxed point, and show various conditions of uniqueness. 1

5 0.09731634 168 nips-2009-Non-stationary continuous dynamic Bayesian networks

Author: Marco Grzegorczyk, Dirk Husmeier

Abstract: Dynamic Bayesian networks have been applied widely to reconstruct the structure of regulatory processes from time series data. The standard approach is based on the assumption of a homogeneous Markov chain, which is not valid in many realworld scenarios. Recent research efforts addressing this shortcoming have considered undirected graphs, directed graphs for discretized data, or over-ﬂexible models that lack any information sharing among time series segments. In the present article, we propose a non-stationary dynamic Bayesian network for continuous data, in which parameters are allowed to vary among segments, and in which a common network structure provides essential information sharing across segments. Our model is based on a Bayesian multiple change-point process, where the number and location of the change-points is sampled from the posterior distribution. 1

6 0.091910541 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

7 0.087662786 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

8 0.085826457 4 nips-2009-A Bayesian Analysis of Dynamics in Free Recall

9 0.080216683 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

10 0.079978898 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

11 0.074211039 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

12 0.07150387 100 nips-2009-Gaussian process regression with Student-t likelihood

13 0.07149417 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

14 0.069820367 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

15 0.067478336 88 nips-2009-Extending Phase Mechanism to Differential Motion Opponency for Motion Pop-out

16 0.066384129 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

17 0.064214028 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

18 0.061390013 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition

19 0.06089925 175 nips-2009-Occlusive Components Analysis

20 0.060396012 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.234), (1, -0.049), (2, -0.067), (3, -0.032), (4, 0.0), (5, -0.046), (6, 0.035), (7, 0.076), (8, -0.044), (9, -0.02), (10, 0.042), (11, 0.006), (12, -0.058), (13, -0.032), (14, 0.069), (15, 0.003), (16, -0.02), (17, -0.116), (18, -0.014), (19, -0.131), (20, 0.022), (21, 0.009), (22, 0.019), (23, -0.066), (24, -0.009), (25, 0.0), (26, 0.157), (27, 0.129), (28, 0.037), (29, 0.092), (30, -0.029), (31, 0.097), (32, -0.022), (33, 0.094), (34, -0.019), (35, 0.001), (36, -0.067), (37, -0.029), (38, -0.025), (39, -0.032), (40, -0.174), (41, 0.036), (42, 0.154), (43, -0.052), (44, 0.018), (45, -0.0), (46, 0.136), (47, 0.037), (48, -0.0), (49, 0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9341622 97 nips-2009-Free energy score space

Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic

2 0.63023651 56 nips-2009-Conditional Neural Fields

Author: Jian Peng, Liefeng Bo, Jinbo Xu

Abstract: Conditional random ﬁelds (CRF) are widely used for sequence labeling such as natural language processing and biological sequence analysis. Most CRF models use a linear potential function to represent the relationship between input features and output. However, in many real-world applications such as protein structure prediction and handwriting recognition, the relationship between input features and output is highly complex and nonlinear, which cannot be accurately modeled by a linear function. To model the nonlinear relationship between input and output we propose a new conditional probabilistic graphical model, Conditional Neural Fields (CNF), for sequence labeling. CNF extends CRF by adding one (or possibly more) middle layer between input and output. The middle layer consists of a number of gate functions, each acting as a local neuron or feature extractor to capture the nonlinear relationship between input and output. Therefore, conceptually CNF is much more expressive than CRF. Experiments on two widely-used benchmarks indicate that CNF performs signiﬁcantly better than a number of popular methods. In particular, CNF is the best among approximately 10 machine learning methods for protein secondary structure prediction and also among a few of the best methods for handwriting recognition.

3 0.59007972 2 nips-2009-3D Object Recognition with Deep Belief Nets

Author: Vinod Nair, Geoffrey E. Hinton

4 0.54137599 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori

Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random ﬁelds that deﬁnes a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label conﬁgurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classiﬁcation and sequence labeling problems. 1

5 0.53786188 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

Author: Ruslan Salakhutdinov

Abstract: Markov random ﬁelds (MRF’s), or undirected graphical models, provide a powerful framework for modeling complex dependencies among random variables. Maximum likelihood learning in MRF’s is hard due to the presence of the global normalizing constant. In this paper we consider a class of stochastic approximation algorithms of the Robbins-Monro type that use Markov chain Monte Carlo to do approximate maximum likelihood learning. We show that using MCMC operators based on tempered transitions enables the stochastic approximation algorithm to better explore highly multimodal distributions, which considerably improves parameter estimates in large, densely-connected MRF’s. Our results on MNIST and NORB datasets demonstrate that we can successfully learn good generative models of high-dimensional, richly structured data that perform well on digit and object recognition tasks.

6 0.50869668 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

7 0.50006324 39 nips-2009-Bayesian Belief Polarization

8 0.48155078 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

9 0.47902834 84 nips-2009-Evaluating multi-class learning strategies in a generative hierarchical framework for object detection

10 0.47744521 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition

11 0.47652096 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs

12 0.47029132 168 nips-2009-Non-stationary continuous dynamic Bayesian networks

13 0.46236321 72 nips-2009-Distribution Matching for Transduction

14 0.46005097 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

15 0.45777541 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

16 0.45254001 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling

17 0.44497839 25 nips-2009-Adaptive Design Optimization in Experiments with People

18 0.43887851 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

19 0.43856844 51 nips-2009-Clustering sequence sets for motif discovery

20 0.43523923 204 nips-2009-Replicated Softmax: an Undirected Topic Model

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(21, 0.015), (24, 0.052), (25, 0.091), (35, 0.071), (36, 0.123), (39, 0.052), (58, 0.096), (61, 0.02), (71, 0.11), (81, 0.016), (86, 0.077), (92, 0.181)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91873521 117 nips-2009-Inter-domain Gaussian Processes for Sparse Inference using Inducing Features

Author: Anibal Figueiras-vidal, Miguel Lázaro-gredilla

Abstract: We present a general inference framework for inter-domain Gaussian Processes (GPs) and focus on its usefulness to build sparse GP models. The state-of-the-art sparse GP model introduced by Snelson and Ghahramani in [1] relies on ﬁnding a small, representative pseudo data set of m elements (from the same domain as the n available data elements) which is able to explain existing data well, and then uses it to perform inference. This reduces inference and model selection computation time from O(n3 ) to O(m2 n), where m n. Inter-domain GPs can be used to ﬁnd a (possibly more compact) representative set of features lying in a different domain, at the same computational cost. Being able to specify a different domain for the representative features allows to incorporate prior knowledge about relevant characteristics of data and detaches the functional form of the covariance and basis functions. We will show how previously existing models ﬁt into this framework and will use it to develop two new sparse GP models. Tests on large, representative regression data sets suggest that signiﬁcant improvement can be achieved, while retaining computational efﬁciency. 1 Introduction and previous work Along the past decade there has been a growing interest in the application of Gaussian Processes (GPs) to machine learning tasks. GPs are probabilistic non-parametric Bayesian models that combine a number of attractive characteristics: They achieve state-of-the-art performance on supervised learning tasks, provide probabilistic predictions, have a simple and well-founded model selection scheme, present no overﬁtting (since parameters are integrated out), etc. Unfortunately, the direct application of GPs to regression problems (with which we will be concerned here) is limited due to their training time being O(n3 ). To overcome this limitation, several sparse approximations have been proposed [2, 3, 4, 5, 6]. In most of them, sparsity is achieved by projecting all available data onto a smaller subset of size m n (the active set), which is selected according to some speciﬁc criterion. This reduces computation time to O(m2 n). However, active set selection interferes with hyperparameter learning, due to its non-smooth nature (see [1, 3]). These proposals have been superseded by the Sparse Pseudo-inputs GP (SPGP) model, introduced in [1]. In this model, the constraint that the samples of the active set (which are called pseudoinputs) must be selected among training data is relaxed, allowing them to lie anywhere in the input space. This allows both pseudo-inputs and hyperparameters to be selected in a joint continuous optimisation and increases ﬂexibility, resulting in much superior performance. In this work we introduce Inter-Domain GPs (IDGPs) as a general tool to perform inference across domains. This allows to remove the constraint that the pseudo-inputs must remain within the same domain as input data. This added ﬂexibility results in an increased performance and allows to encode prior knowledge about other domains where data can be represented more compactly. 1 2 Review of GPs for regression We will brieﬂy state here the main deﬁnitions and results for regression with GPs. See [7] for a comprehensive review. Assume we are given a training set with n samples D ≡ {xj , yj }n , where each D-dimensional j=1 input xj is associated to a scalar output yj . The regression task goal is, given a new input x∗ , predict the corresponding output y∗ based on D. The GP regression model assumes that the outputs can be expressed as some noiseless latent function plus independent noise, y = f (x)+ε, and then sets a zero-mean1 GP prior on f (x), with covariance k(x, x ), and a zero-mean Gaussian prior on ε, with variance σ 2 (the noise power hyperparameter). The covariance function encodes prior knowledge about the smoothness of f (x). The most common choice for it is the Automatic Relevance Determination Squared Exponential (ARD SE): 2 k(x, x ) = σ0 exp − 1 2 D d=1 (xd − xd )2 2 d , (1) 2 with hyperparameters σ0 (the latent function power) and { d }D (the length-scales, deﬁning how d=1 rapidly the covariance decays along each dimension). It is referred to as ARD SE because, when coupled with a model selection method, non-informative input dimensions can be removed automatically by growing the corresponding length-scale. The set of hyperparameters that deﬁne the GP are 2 θ = {σ 2 , σ0 , { d }D }. We will omit the dependence on θ for the sake of clarity. d=1 If we evaluate the latent function at X = {xj }n , we obtain a set of latent variables following a j=1 joint Gaussian distribution p(f |X) = N (f |0, Kﬀ ), where [Kﬀ ]ij = k(xi , xj ). Using this model it is possible to express the joint distribution of training and test cases and then condition on the observed outputs to obtain the predictive distribution for any test case pGP (y∗ |x∗ , D) = N (y∗ |kf ∗ (Kﬀ + σ 2 In )−1 y, σ 2 + k∗∗ − kf ∗ (Kﬀ + σ 2 In )−1 kf ∗ ), (2) where y = [y1 , . . . , yn ] , kf ∗ = [k(x1 , x∗ ), . . . , k(xn , x∗ )] , and k∗∗ = k(x∗ , x∗ ). In is used to denote the identity matrix of size n. The O(n3 ) cost of these equations arises from the inversion of the n × n covariance matrix. Predictive distributions for additional test cases take O(n2 ) time each. These costs make standard GPs impractical for large data sets. To select hyperparameters θ, Type-II Maximum Likelihood (ML-II) is commonly used. This amounts to selecting the hyperparameters that correspond to a (possibly local) maximum of the log-marginal likelihood, also called log-evidence. 3 Inter-domain GPs In this section we will introduce Inter-Domain GPs (IDGPs) and show how they can be used as a framework for computationally efﬁcient inference. Then we will use this framework to express two previous relevant models and develop two new ones. 3.1 Deﬁnition Consider a real-valued GP f (x) with x ∈ RD and some deterministic real function g(x, z), with z ∈ RH . We deﬁne the following transformation: u(z) = f (x)g(x, z)dx. (3) RD There are many examples of transformations that take on this form, the Fourier transform being one of the best known. We will discuss possible choices for g(x, z) in Section 3.3; for the moment we will deal with the general form. Since u(z) is obtained by a linear transformation of GP f (x), 1 We follow the common approach of subtracting the sample mean from the outputs and then assume a zero-mean model. 2 it is also a GP. This new GP may lie in a different domain of possibly different dimension. This transformation is not invertible in general, its properties being deﬁned by g(x, z). IDGPs arise when we jointly consider f (x) and u(z) as a single, “extended” GP. The mean and covariance function of this extended GP are overloaded to accept arguments from both the input and transformed domains and treat them accordingly. We refer to each version of an overloaded function as an instance, which will accept a different type of arguments. If the distribution of the original GP is f (x) ∼ GP(m(x), k(x, x )), then it is possible to compute the remaining instances that deﬁne the distribution of the extended GP over both domains. The transformed-domain instance of the mean is m(z) = E[u(z)] = E[f (x)]g(x, z)dx = m(x)g(x, z)dx. RD RD The inter-domain and transformed-domain instances of the covariance function are: k(x, z ) = E[f (x)u(z )] = E f (x) f (x )g(x , z )dx = RD k(z, z ) = E[u(z)u(z )] = E f (x)g(x, z)dx RD = k(x, x )g(x , z )dx f (x )g(x , z )dx RD k(x, x )g(x, z)g(x , z )dxdx . RD (4) RD (5) RD Mean m(·) and covariance function k(·, ·) are therefore deﬁned both by the values and domains of their arguments. This can be seen as if each argument had an additional domain indicator used to select the instance. Apart from that, they deﬁne a regular GP, and all standard properties hold. In particular k(a, b) = k(b, a). This approach is related to [8], but here the latent space is deﬁned as a transformation of the input space, and not the other way around. This allows to pre-specify the desired input-domain covariance. The transformation is also more general: Any g(x, z) can be used. We can sample an IDGP at n input-domain points f = [f1 , f2 , . . . , fn ] (with fj = f (xj )) and m transformed-domain points u = [u1 , u2 , . . . , um ] (with ui = u(zi )). With the usual assumption of f (x) being a zero mean GP and deﬁning Z = {zi }m , the joint distribution of these samples is: i=1 Kﬀ Kfu f f 0, X, Z = N p , (6) u u Kfu Kuu with [Kﬀ ]pq = k(xp , xq ), [Kfu ]pq = k(xp , zq ), [Kuu ]pq = k(zp , zq ), which allows to perform inference across domains. We will only be concerned with one input domain and one transformed domain, but IDGPs can be deﬁned for any number of domains. 3.2 Sparse regression using inducing features In the standard regression setting, we are asked to perform inference about the latent function f (x) from a data set D lying in the input domain. Using IDGPs, we can use data from any domain to perform inference in the input domain. Some latent functions might be better deﬁned by a set of data lying in some transformed space rather than in the input space. This idea is used for sparse inference. Following [1] we introduce a pseudo data set, but here we place it in the transformed domain: D = {Z, u}. The following derivation is analogous to that of SPGP. We will refer to Z as the inducing features and u as the inducing variables. The key approximation leading to sparsity is to set m n and assume that f (x) is well-described by the pseudo data set D, so that any two samples (either from the training or test set) fp and fq with p = q will be independent given xp , xq and D. With this simplifying assumption2 , the prior over f can be factorised as a product of marginals: n p(f |X, Z, u) ≈ p(fj |xj , Z, u). (7) j=1 2 Alternatively, (7) can be obtained by proposing a generic factorised form for the approximate conn ditional p(f |X, Z, u) ≈ q(f |X, Z, u) = q (f |xj , Z, u) and then choosing the set of funcj=1 j j tions {qj (·)}n so as to minimise the Kullback-Leibler (KL) divergence from the exact joint prior j=1 KL(p(f |X, Z, u)p(u|Z)||q(f |X, Z, u)p(u|Z)), as noted in [9], Section 2.3.6. 3 Marginals are in turn obtained from (6): p(fj |xj , Z, u) = N (fj |kj K−1 u, λj ), where kj is the j-th uu row of Kfu and λj is the j-th element of the diagonal of matrix Λf = diag(Kf f − Kfu K−1 Kuf ). uu Operator diag(·) sets all off-diagonal elements to zero, so that Λf is a diagonal matrix. Since p(u|Z) is readily available and also Gaussian, the inducing variables can be integrated out from (7), yielding a new, approximate prior over f (x): n p(f |X, Z) = p(fj |xj , Z, u)p(u|Z)du = N (f |0, Kfu K−1 Kuf + Λf ) uu p(f , u|X, Z)du ≈ j=1 Using this approximate prior, the posterior distribution for a test case is: pIDGP (y∗ |x∗ , D, Z) = N (y∗ |ku∗ Q−1 Kfu Λ−1 y, σ 2 + k∗∗ + ku∗ (Q−1 − K−1 )ku∗ ), y uu (8) Kfu Λ−1 Kfu y where we have deﬁned Q = Kuu + and Λy = Λf + σ 2 In . The distribution (2) is approximated by (8) with the information available in the pseudo data set. After O(m2 n) time precomputations, predictive means and variances can be computed in O(m) and O(m2 ) time per test case, respectively. This model is, in general, non-stationary, even when it is approximating a stationary input-domain covariance and can be interpreted as a degenerate GP plus heteroscedastic white noise. The log-marginal likelihood (or log-evidence) of the model, explicitly including the conditioning on kernel hyperparameters θ can be expressed as 1 log p(y|X, Z, θ) = − [y Λ−1 y−y Λ−1 Kfu Q−1 Kfu Λ−1 y+log(|Q||Λy |/|Kuu |)+n log(2π)] y y y 2 which is also computable in O(m2 n) time. Model selection will be performed by jointly optimising the evidence with respect to the hyperparameters and the inducing features. If analytical derivatives of the covariance function are available, conjugate gradient optimisation can be used with O(m2 n) cost per step. 3.3 On the choice of g(x, z) The feature extraction function g(x, z) deﬁnes the transformed domain in which the pseudo data set lies. According to (3), the inducing variables can be seen as projections of the target function f (x) on the feature extraction function over the whole input space. Therefore, each of them summarises information about the behaviour of f (x) everywhere. The inducing features Z deﬁne the concrete set of functions over which the target function will be projected. It is desirable that this set captures the most signiﬁcant characteristics of the function. This can be achieved either using prior knowledge about data to select {g(x, zi )}m or using a very general family of functions and letting model i=1 selection automatically choose the appropriate set. Another way to choose g(x, z) relies on the form of the posterior. The posterior mean of a GP is often thought of as a linear combination of “basis functions”. For full GPs and other approximations such as [1, 2, 3, 4, 5, 6], basis functions must have the form of the input-domain covariance function. When using IDGPs, basis functions have the form of the inter-domain instance of the covariance function, and can therefore be adjusted by choosing g(x, z), independently of the input-domain covariance function. If two feature extraction functions g(·, ·) and h(·, ·) can be related by g(x, z) = h(x, z)r(z) for any function r(·), then both yield the same sparse GP model. This property can be used to simplify the expressions of the instances of the covariance function. In this work we use the same functional form for every feature, i.e. our function set is {g(x, zi )}m , i=1 but it is also possible to use sets with different functional forms for each inducing feature, i.e. {gi (x, zi )}m where each zi may even have a different size (dimension). In the sections below i=1 we will discuss different possible choices for g(x, z). 3.3.1 Relation with Sparse GPs using pseudo-inputs The sparse GP using pseudo-inputs (SPGP) was introduced in [1] and was later renamed to Fully Independent Training Conditional (FITC) model to ﬁt in the systematic framework of [10]. Since 4 the sparse model introduced in Section 3.2 also uses a fully independent training conditional, we will stick to the ﬁrst name to avoid possible confusion. IDGP innovation with respect to SPGP consists in letting the pseudo data set lie in a different domain. If we set gSPGP (x, z) ≡ δ(x − z) where δ(·) is a Dirac delta, we force the pseudo data set to lie in the input domain. Thus there is no longer a transformed space and the original SPGP model is retrieved. In this setting, the inducing features of IDGP play the role of SPGP’s pseudo-inputs. 3.3.2 Relation with Sparse Multiscale GPs Sparse Multiscale GPs (SMGPs) are presented in [11]. Seeking to generalise the SPGP model with ARD SE covariance function, they propose to use a different set of length-scales for each basis function. The resulting model presents a defective variance that is healed by adding heteroscedastic white noise. SMGPs, including the variance improvement, can be derived in a principled way as IDGPs: D 1 gSMGP (x, z) ≡ D d=1 2π(c2 d D kSMGP (x, z ) = exp − d=1 D kSMGP (z, z ) = exp − d=1 − 2) d exp − d=1 (xd − µd )2 2cd2 D µ c with z = 2 d cd2 d=1 (µd − µd )2 2(c2 + cd2 − 2 ) d d (xd − µd )2 2(c2 − 2 ) d d (9) (10) D c2 + d d=1 2 d cd2 − 2. d (11) With this approximation, each basis function has its own centre µ = [µ1 , µ2 , . . . , µd ] and its own length-scales c = [c1 , c2 , . . . , cd ] , whereas global length-scales { d }D are shared by all d=1 inducing features. Equations (10) and (11) are derived from (4) and (5) using (1) and (9). The integrals deﬁning kSMGP (·, ·) converge if and only if c2 ≥ 2 , ∀d , which suggests that other values, d d even if permitted in [11], should be avoided for the model to remain well deﬁned. 3.3.3 Frequency Inducing Features GP If the target function can be described more compactly in the frequency domain than in the input domain, it can be advantageous to let the pseudo data set lie in the former domain. We will pursue that possibility for the case where the input domain covariance is the ARD SE. We will call the resulting sparse model Frequency Inducing Features GP (FIFGP). Directly applying the Fourier transform is not possible because the target function is not square 2 integrable (it has constant power σ0 everywhere, so (5) does not converge). We will workaround this by windowing the target function in the region of interest. It is possible to use a square window, but this results in the covariance being deﬁned in terms of the complex error function, which is very slow to evaluate. Instead, we will use a Gaussian window3 . Since multiplying by a Gaussian in the input domain is equivalent to convolving with a Gaussian in the frequency domain, we will be working with a blurred version of the frequency space. This model is deﬁned by: gFIF (x, z) ≡ D 1 D d=1 2πc2 d D kFIF (x, z ) = exp − d=1 D kFIF (z, z ) = exp − d=1 exp − d=1 x2 d cos ω0 + 2c2 d x2 + c2 ωd2 d d cos ω0 + 2(c2 + 2 ) d d D d=1 exp − d=1 D + exp 3 c4 (ωd + ωd )2 d − 2(2c2 + 2 ) d d d=1 x d ωd with z = ω D 2 d d=1 c2 + d 2 d (13) c4 (ωd − ωd )2 d cos(ω0 − ω0 ) 2(2c2 + 2 ) d d D cos(ω0 + ω0 ) d=1 2 d 2c2 + d 2. d A mixture of m Gaussians could also be used as window without increasing the complexity order. 5 (12) d=1 c2 ωd xd d c2 + 2 d d D 2 c2 (ωd + ωd2 ) d 2(2c2 + 2 ) d d D (14) The inducing features are ω = [ω0 , ω1 , . . . , ωd ] , where ω0 is the phase and the remaining components are frequencies along each dimension. In this model, both global length-scales { d }D and d=1 window length-scales {cd }D are shared, thus cd = cd . Instances (13) and (14) are induced by (12) d=1 using (4) and (5). 3.3.4 Time-Frequency Inducing Features GP Instead of using a single window to select the region of interest, it is possible to use a different window for each feature. We will use windows of the same size but different centres. The resulting model combines SPGP and FIFGP, so we will call it Time-Frequency Inducing Features GP (TFIFGP). It is deﬁned by gTFIF (x, z) ≡ gFIF (x − µ, ω), with z = [µ ω ] . The implied inter-domain and transformed-domain instances of the covariance function are: D kTFIF (x, z ) = kFIF (x − µ , ω ) , kTFIF (z, z ) = kFIF (z, z ) exp − d=1 (µd − µd )2 2(2c2 + 2 ) d d FIFGP is trivially obtained by setting every centre to zero {µi = 0}m , whereas SPGP is obtained i=1 by setting window length-scales c, frequencies and phases {ω i }m to zero. If the window lengthi=1 scales were individually adjusted, SMGP would be obtained. While FIFGP has the modelling power of both FIFGP and SPGP, it might perform worse in practice due to it having roughly twice as many hyperparameters, thus making the optimisation problem harder. The same problem also exists in SMGP. A possible workaround is to initialise the hyperparameters using a simpler model, as done in [11] for SMGP, though we will not do this here. 4 Experiments In this section we will compare the proposed approximations FIFGP and TFIFGP with the current state of the art, SPGP on some large data sets, for the same number of inducing features/inputs and therefore, roughly equal computational cost. Additionally, we provide results using a full GP, which is expected to provide top performance (though requiring an impractically big amount of computation). In all cases, the (input-domain) covariance function is the ARD SE (1). We use four large data sets: Kin-40k, Pumadyn-32nm4 (describing the dynamics of a robot arm, used with SPGP in [1]), Elevators and Pole Telecomm5 (related to the control of the elevators of an F16 aircraft and a telecommunications problem, and used in [12, 13, 14]). Input dimensions that remained constant throughout the training set were removed. Input data was additionally centred for use with FIFGP (the remaining methods are translation invariant). Pole Telecomm outputs actually take discrete values in the 0-100 range, in multiples of 10. This was taken into account by using the corresponding quantization noise variance (102 /12) as lower bound for the noise hyperparameter6 . n 1 2 2 2 Hyperparameters are initialised as follows: σ0 = n j=1 yj , σ 2 = σ0 /4, { d }D to one half of d=1 the range spanned by training data along each dimension. For SPGP, pseudo-inputs are initialised to a random subset of the training data, for FIFGP window size c is initialised to the standard deviation of input data, frequencies are randomly chosen from a zero-mean −2 -variance Gaussian d distribution, and phases are obtained from a uniform distribution in [0 . . . 2π). TFIFGP uses the same initialisation as FIFGP, with window centres set to zero. Final values are selected by evidence maximisation. Denoting the output average over the training set as y and the predictive mean and variance for test sample y∗l as µ∗l and σ∗l respectively, we deﬁne the following quality measures: Normalized Mean Square Error (NMSE) (y∗l − µ∗l )2 / (y∗l − y)2 and Mean Negative Log-Probability (MNLP) 1 2 2 2 2 (y∗l − µ∗l ) /σ∗l + log σ∗l + log 2π , where · averages over the test set. 4 Kin-40k: 8 input dimensions, 10000/30000 samples for train/test, Pumadyn-32nm: 32 input dimensions, 7168/1024 samples for train/test, using exactly the same preprocessing and train/test splits as [1, 3]. Note that their error measure is actually one half of the Normalized Mean Square Error deﬁned here. 5 Pole Telecomm: 26 non-constant input dimensions, 10000/5000 samples for train/test. Elevators: 17 non-constant input dimensions, 8752/7847 samples for train/test. Both have been downloaded from http://www.liaad.up.pt/∼ltorgo/Regression/datasets.html 6 If unconstrained, similar plots are obtained; in particular, no overﬁtting is observed. 6 For Kin-40k (Fig. 1, top), all three sparse methods perform similarly, though for high sparseness (the most useful case) FIFGP and TFIFGP are slightly superior. In Pumadyn-32nm (Fig. 1, bottom), only 4 out the 32 input dimensions are relevant to the regression task, so it can be used as an ARD capabilities test. We follow [1] and use a full GP on a small subset of the training data (1024 data points) to obtain the initial length-scales. This allows better minima to be found during optimisation. Though all methods are able to properly ﬁnd a good solution, FIFGP and especially TFIFGP are better in the sparser regime. Roughly the same considerations can be made about Pole Telecomm and Elevators (Fig. 2), but in these data sets the superiority of FIFGP and TFIFGP is more dramatic. Though not shown here, we have additionally tested these models on smaller, overﬁtting-prone data sets, and have found no noticeable overﬁtting even using m > n, despite the relatively high number of parameters being adjusted. This is in line with the results and discussion of [1]. Mean Negative Log−Probability Normalized Mean Squared Error 2.5 0.5 0.1 0.05 0.01 0.005 SPGP FIFGP TFIFGP Full GP on 10000 data points 0.001 25 50 100 200 300 500 750 SPGP FIFGP TFIFGP Full GP on 10000 data points 2 1.5 1 0.5 0 −0.5 −1 −1.5 1250 25 50 100 200 300 500 750 1250 Inducing features / pseudo−inputs (b) Kin-40k MNLP (semilog plot) Inducing features / pseudo−inputs (a) Kin-40k NMSE (log-log plot) 0.05 0.04 10 25 50 75 Mean Negative Log−Probability Normalized Mean Squared Error 0.2 SPGP FIFGP TFIFGP Full GP on 7168 data points 0.1 0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 100 Inducing features / pseudo−inputs (c) Pumadyn-32nm NMSE (log-log plot) SPGP FIFGP TFIFGP Full GP on 7168 data points 0.15 10 25 50 75 100 Inducing features / pseudo−inputs (d) Pumadyn-32nm MNLP (semilog plot) Figure 1: Performance of the compared methods on Kin-40k and Pumadyn-32nm. 5 Conclusions and extensions In this work we have introduced IDGPs, which are able combine representations of a GP in different domains, and have used them to extend SPGP to handle inducing features lying in a different domain. This provides a general framework for sparse models, which are deﬁned by a feature extraction function. Using this framework, SMGPs can be reinterpreted as fully principled models using a transformed space of local features, without any need for post-hoc variance improvements. Furthermore, it is possible to develop new sparse models of practical use, such as the proposed FIFGP and TFIFGP, which are able to outperform the state-of-the-art SPGP on some large data sets, especially for high sparsity regimes. 7 0.25 0.2 0.15 0.1 10 25 50 100 250 Mean Negative Log−Probability Normalized Mean Squared Error −3.8 SPGP FIFGP TFIFGP Full GP on 8752 data points SPGP FIFGP TFIFGP Full GP on 8752 data points −4 −4.2 −4.4 −4.6 −4.8 500 750 1000 10 Inducing features / pseudo−inputs (a) Elevators NMSE (log-log plot) 25 50 100 250 500 750 1000 Inducing features / pseudo−inputs (b) Elevators MNLP (semilog plot) 0.2 0.15 0.1 0.05 0.04 0.03 0.02 0.01 10 25 50 100 250 500 Mean Negative Log−Probability Normalized Mean Squared Error 5.5 SPGP FIFGP TFIFGP Full GP on 10000 data points 4.5 4 3.5 3 2.5 1000 SPGP FIFGP TFIFGP Full GP on 10000 data points 5 10 25 50 100 250 500 1000 Inducing features / pseudo−inputs (d) Pole Telecomm MNLP (semilog plot) Inducing features / pseudo−inputs (c) Pole Telecomm NMSE (log-log plot) Figure 2: Performance of the compared methods on Elevators and Pole Telecomm. Choosing a transformed space for the inducing features enables to use domains where the target function can be expressed more compactly, or where the evidence (which is a function of the features) is easier to optimise. This added ﬂexibility translates as a detaching of the functional form of the input-domain covariance and the set of basis functions used to express the posterior mean. IDGPs approximate full GPs optimally in the KL sense noted in Section 3.2, for a given set of inducing features. Using ML-II to select the inducing features means that models providing a good ﬁt to data are given preference over models that might approximate the full GP more closely. This, though rarely, might lead to harmful overﬁtting. To more faithfully approximate the full GP and avoid overﬁtting altogether, our proposal can be combined with the variational approach from [15], in which the inducing features would be regarded as variational parameters. This would result in more constrained models, which would be closer to the full GP but might show reduced performance. We have explored the case of regression with Gaussian noise, which is analytically tractable, but it is straightforward to apply the same model to other tasks such as robust regression or classiﬁcation, using approximate inference (see [16]). Also, IDGPs as a general tool can be used for other purposes, such as modelling noise in the frequency domain, aggregating data from different domains or even imposing constraints on the target function. Acknowledgments We would like to thank the anonymous referees for helpful comments and suggestions. This work has been partly supported by the Spanish government under grant TEC2008- 02473/TEC, and by the Madrid Community under grant S-505/TIC/0223. 8 References [1] E. Snelson and Z. Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, pages 1259–1266. MIT Press, 2006. [2] A. J. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural Information Processing Systems 13, pages 619–625. MIT Press, 2001. [3] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse Gaussian process regression. In Proceedings of the 9th International Workshop on AI Stats, 2003. [4] V. Tresp. A Bayesian committee machine. Neural Computation, 12:2719–2741, 2000. [5] L. Csat´ and M. Opper. Sparse online Gaussian processes. Neural Computation, 14(3):641–669, 2002. o [6] C. K. I. Williams and M. Seeger. Using the Nystr¨ m method to speed up kernel machines. In Advances o in Neural Information Processing Systems 13, pages 682–688. MIT Press, 2001. [7] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. MIT Press, 2006. [8] M. Alvarez and N. D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In Advances in Neural Information Processing Systems 21, pages 57–64, 2009. [9] Ed. Snelson. Flexible and efﬁcient Gaussian process models for machine learning. PhD thesis, University of Cambridge, 2007. [10] J. Qui˜ onero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process n regression. Journal of Machine Learning Research, 6:1939–1959, 2005. [11] C. Walder, K. I. Kim, and B. Sch¨ lkopf. Sparse multiscale Gaussian process regression. In 25th Internao tional Conference on Machine Learning. ACM Press, New York, 2008. [12] G. Potgietera and A. P. Engelbrecht. Evolving model trees for mining data sets with continuous-valued classes. Expert Systems with Applications, 35:1513–1532, 2007. [13] L. Torgo and J. Pinto da Costa. Clustered partial linear regression. In Proceedings of the 11th European Conference on Machine Learning, pages 426–436. Springer, 2000. [14] G. Potgietera and A. P. Engelbrecht. Pairwise classiﬁcation as an ensemble technique. In Proceedings of the 13th European Conference on Machine Learning, pages 97–110. Springer-Verlag, 2002. [15] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the 12th International Workshop on AI Stats, 2009. [16] A. Naish-Guzman and S. Holden. The generalized FITC approximation. In Advances in Neural Information Processing Systems 20, pages 1057–1064. MIT Press, 2008. 9

2 0.87829906 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals

Author: Sebastian Gerwinn, Philipp Berens, Matthias Bethge

Abstract: Second-order maximum-entropy models have recently gained much interest for describing the statistics of binary spike trains. Here, we extend this approach to take continuous stimuli into account as well. By constraining the joint secondorder statistics, we obtain a joint Gaussian-Boltzmann distribution of continuous stimuli and binary neural ﬁring patterns, for which we also compute marginal and conditional distributions. This model has the same computational complexity as pure binary models and ﬁtting it to data is a convex problem. We show that the model can be seen as an extension to the classical spike-triggered average/covariance analysis and can be used as a non-linear method for extracting features which a neural population is sensitive to. Further, by calculating the posterior distribution of stimuli given an observed neural response, the model can be used to decode stimuli and yields a natural spike-train metric. Therefore, extending the framework of maximum-entropy models to continuous variables allows us to gain novel insights into the relationship between the ﬁring patterns of neural ensembles and the stimuli they are processing. 1

same-paper 3 0.8658604 97 nips-2009-Free energy score space

Author: Alessandro Perina, Marco Cristani, Umberto Castellani, Vittorio Murino, Nebojsa Jojic

4 0.84945267 191 nips-2009-Positive Semidefinite Metric Learning with Boosting

Author: Chunhua Shen, Junae Kim, Lei Wang, Anton Hengel

Abstract: The learning of appropriate distance metrics is a critical problem in image classiﬁcation and retrieval. In this work, we propose a boosting-based technique, termed B OOST M ETRIC, for learning a Mahalanobis distance metric. One of the primary difﬁculties in learning such a metric is to ensure that the Mahalanobis matrix remains positive semideﬁnite. Semideﬁnite programming is sometimes used to enforce this constraint, but does not scale well. B OOST M ETRIC is instead based on a key observation that any positive semideﬁnite matrix can be decomposed into a linear positive combination of trace-one rank-one matrices. B OOST M ETRIC thus uses rank-one positive semideﬁnite matrices as weak learners within an efﬁcient and scalable boosting-based learning process. The resulting method is easy to implement, does not require tuning, and can accommodate various types of constraints. Experiments on various datasets show that the proposed algorithm compares favorably to those state-of-the-art methods in terms of classiﬁcation accuracy and running time. 1

5 0.78620178 260 nips-2009-Zero-shot Learning with Semantic Output Codes

Author: Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, Tom M. Mitchell

Abstract: We consider the problem of zero-shot learning, where the goal is to learn a classiﬁer f : X → Y that must predict novel values of Y that were omitted from the training set. To achieve this, we deﬁne the notion of a semantic output code classiﬁer (SOC) which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes. We provide a formalism for this type of classiﬁer and study its theoretical properties in a PAC framework, showing conditions under which the classiﬁer can accurately predict novel classes. As a case study, we build a SOC classiﬁer for a neural decoding task and show that it can often predict words that people are thinking about from functional magnetic resonance images (fMRI) of their neural activity, even without training examples for those words. 1

6 0.78487962 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

7 0.78411049 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

8 0.78178161 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

9 0.78134972 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

10 0.78109992 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

11 0.78077757 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

12 0.77973258 28 nips-2009-An Additive Latent Feature Model for Transparent Object Recognition

13 0.77867687 113 nips-2009-Improving Existing Fault Recovery Policies

14 0.77785528 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

15 0.77714944 226 nips-2009-Spatial Normalized Gamma Processes

16 0.77556038 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

17 0.7728045 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

18 0.77077377 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models

19 0.77046549 129 nips-2009-Learning a Small Mixture of Trees

20 0.76756293 71 nips-2009-Distribution-Calibrated Hierarchical Classification