nips nips2004 nips2004-66 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton
Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. [sent-5, score-0.251]
2 Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. [sent-6, score-0.101]
3 In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. [sent-7, score-0.191]
4 Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. [sent-8, score-0.153]
5 A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. [sent-9, score-0.34]
6 In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords. [sent-10, score-0.079]
7 1 Introduction Graphical models have become the basic framework for generative approaches to probabilistic modelling. [sent-11, score-0.022]
8 In particular models with latent variables have proven to be a powerful way to capture hidden structure in the data. [sent-12, score-0.311]
9 In this paper we study the important subclass of models with one layer of observed units and one layer of hidden units. [sent-13, score-0.115]
10 Two-layer models can be subdivided into various categories depending on a number of characteristics. [sent-14, score-0.022]
11 An important property in that respect is given by the semantics of the graphical model: either directed (Bayes net) or undirected (Markov random field). [sent-15, score-0.149]
12 Directed models enjoy important advantages such as easy (ancestral) sampling and easy handling of unobserved attributes under certain conditions. [sent-17, score-0.104]
13 Moreover, the semantics of directed models dictates marginal independence of the latent variables, which is a suitable modelling assumption for many datasets. [sent-18, score-0.423]
14 For important applications, such as latent semantic indexing (LSI), this drawback may have serious consequences since we would like to swiftly search for documents that are similar in the latent topic space. [sent-20, score-0.662]
15 A type of two-layer model that has not enjoyed much attention is the undirected analogue of the above described family of models. [sent-21, score-0.096]
16 Later papers have studied the harmonium under various names (the “combination machine” in [4] and the “restricted Boltzmann machine” in [5]) and turned it into a practical method by introducing efficient learning algorithms. [sent-23, score-0.079]
17 Harmoniums have only been considered in the context of discrete binary variables (in both hidden and observed layers), and more recently in the Gaussian case [7]. [sent-24, score-0.134]
18 The first contribution of this paper is to extend harmoniums into the exponential family which will make them much more widely applicable. [sent-25, score-0.258]
19 Harmoniums also enjoy a number of important advantages which are rather orthogonal to the properties of directed models. [sent-26, score-0.14]
20 Firstly, their product structure has the ability to produce distributions with very sharp boundaries. [sent-27, score-0.022]
21 Secondly, unlike directed models, inference in these models is very fast, due to the fact that the latent variables are conditionally independent given the observations. [sent-29, score-0.383]
22 Thirdly, the latent variables of harmoniums produce distributed representations of the input. [sent-30, score-0.447]
23 This is much more efficient than the “grandmother-cell” representation associated with mixture models where each observation is generated by a single latent variable. [sent-31, score-0.226]
24 Their most important disadvantage is the presence of a global normalization factor which complicates both the evaluation of probabilities of input vectors1 and learning free parameters from examples. [sent-32, score-0.038]
25 The second objective of this paper is to show that the introduction of contrastive divergence has greatly improved the efficiency of learning and paved the way for large scale applications. [sent-33, score-0.12]
26 Whether a directed two-layer model or a harmonium is more appropriate for a particular application is an interesting question that will depend on many factors such as prior (conditional) independence assumptions and/or computational issues such as efficiency of inference. [sent-34, score-0.179]
27 To expose the fact that harmoniums can be viable alternatives to directed models we introduce an entirely new probabilistic extension of latent semantic analysis (LSI) [3] and show its usefulness in various applications. [sent-35, score-0.556]
28 We do not want to claim superiority of harmoniums over their directed cousins, but rather that harmoniums enjoy rather different advantages that deserve more attention and that may one day be combined with the advantages of directed models. [sent-36, score-0.554]
29 2 Extending Harmoniums into the Exponential Family Let xi , i = 1. [sent-37, score-0.04]
30 Mx be the set of observed random variables and hj , j = 1. [sent-40, score-0.277]
31 Both x and h can take values in either the continuous or the discrete domain. [sent-44, score-0.022]
32 For some combinations of exponential family jb distributions it may be necessary to restrict the domain of Wia in order to maintain norjb jb malizability of the joint probability distribution (e. [sent-53, score-1.038]
33 Although we could also have mutually coupled the observed variables (and/or the hidden variables) using similar interaction terms we refrain from doing so in order to keep the learning and inference procedures efficient. [sent-56, score-0.132]
34 Under the maximum likelihood objective the learning rules for the EFH are conceptually simple2 , ˆ ˆ δθia ∝ fia (xi ) p − fia (xi ) p δλjb ∝ B (λjb ) p − B (λjb ) p (8) ˜ ˜ jb 2 jb These learning rules are derived by taking derivatives of the log-likelihood objective using Eqn. [sent-60, score-1.256]
35 ab ˆ δWij ∝ fia (xi )Bjb (λjb ) p ˜ ˆ − fia (xi )Bjb (λjb ) p (9) ˆ ˆ ˆ where we have defined Bjb = ∂Bj (λjb )/∂ λjb with λjb defined in Eqn. [sent-62, score-0.316]
36 In the case of binary harmoniums (restricted BMs) it was shown in [5] that contrastive divergence has the potential to greatly improve on the efficiency and reduce the variance of the estimates needed in the learning rules. [sent-67, score-0.278]
37 8,9 are now replaced by averages · pCD where pCD is the distribution of samples that resulted from the truncated Gibbs chains. [sent-70, score-0.038]
38 Due to space limitations we refer to [5] for more details on contrastive divergence learning3 . [sent-72, score-0.12]
39 Deterministic learning rules can also be derived straightforwardly by generalizing the results described in [12] to the exponential family. [sent-73, score-0.06]
40 3 A Harmonium Model for Latent Semantic Indexing To illustrate the new possibilities that have opened up by extending harmoniums to the exponential family we will next describe a novel model for latent semantic indexing (LSI). [sent-74, score-0.585]
41 This will represent the undirected counterpart of pLSI [6] and LDA [1]. [sent-75, score-0.035]
42 One of the major drawbacks of LSI is that inherently discrete data (word counts) are being modelled with variables in the continuous domain. [sent-76, score-0.079]
43 The power of LSI on the other hand is that it provides an efficient mapping of the input data into a lower dimensional (continuous) latent space that has the effect of de-noising the input data and inferring semantic relationships among words. [sent-77, score-0.279]
44 σ and S{xa } [γa ] ∝ exp ( a=1 γa xa ) is the softmax function defining a probability distribution over x. [sent-80, score-0.046]
45 6 we can easily deduce the marginal distribution of the input variables, p({xia }) ∝ exp [ αia xia + ia 3 1 2 j Wia xia )2 ] ( j (12) ia Non-believers in contrastive divergence are invited to simply run the the Gibbs sampler to equilibrium before they do an update of the parameters. [sent-82, score-1.302]
46 j We observe that the role of the components Wia is that of templates or prototypes: input j vectors xia with large inner products ia Wia xia ∀j will have high probability under this model. [sent-84, score-0.783]
47 This idea is supported by the result that the same model with Gaussian units in both hidden and observed layers is in fact equivalent to factor analysis [7]. [sent-86, score-0.097]
48 1 Identifiability From the form of the marginal distribution Eqn. [sent-88, score-0.044]
49 First we note that the comj j k ponents Wia can be rotated and mirrored arbitrarily in latent space4 : Wia → k U jk Wia T with U U = I. [sent-90, score-0.243]
50 Secondly, we note that observed variables xia satisfy a constraint, j a xia = 1 ∀i. [sent-91, score-0.524]
51 Taken together, this results in the following set of transformations, j Wia → k U jk (Wia + Vik ) αia → (αia + βi ) − j k j Vlj )(Wia ) ( (13) l where U T U = I. [sent-93, score-0.039]
52 Although these transformations leave the marginal distribution over the observable variables invariant, they do change the latent representation and as such may have an impact on retrieval performance (if we use a fixed similarity measure between topic representations of documents). [sent-94, score-0.424]
53 To fix the spurious degrees of freedom we have choj sen to impose conditions on the representations in latent space: hn = ia Wia xn . [sent-95, score-0.599]
54 First, ia j we center the latent representations which has the effect of minimizing the “activity” of the latent variables and moving as much log-probability as possible to the constant component αia . [sent-96, score-0.818]
55 Next we align the axes in latent space with the eigen-directions of the latent covariance matrix. [sent-97, score-0.408]
56 This has the effect of approximately decorrelating the marginal latent activities. [sent-98, score-0.248]
57 This follows because the marginal distribution in latent space can be approxj imated by: p({hj }) ≈ n j Nhj [ ia Wia xn , 1]/N where we have used Eqn. [sent-99, score-0.573]
58 10 and ia replaced p({xia }) by its empirical distribution. [sent-100, score-0.325]
59 This would not result in a more general model because the effect of this on the marginal j k distribution over the observed variables is given by: Wia → k K jk Wia KK T = Σ. [sent-103, score-0.167]
60 However, the extra freedom can be used to define axes in latent space for which the projected data become approximately independent and have the same scale in all directions. [sent-104, score-0.229]
61 Some spurious degrees of freedom remain since shifts βi and shifts Vij that satisfy i Vij = 0 will not affect the projection into latent space. [sent-106, score-0.271]
62 One could decide to fix the remaining degrees of freedom by for example requiring that components are as small as possible in L2 norm (subject to j j j j 1 1 the constraint i Vij = 0), leading to the further shifts, Wia → Wia − D a Wia + DMx ia Wia 1 and αia → αia − D a αia . [sent-107, score-0.35]
63 1 −5 10 RANDOM −4 10 −3 10 −2 10 −1 10 0 10 Recall (b) (c) Figure 1: Precision-recall curves when the query was (a) entire documents, (b) 1 keyword, (c) 2 keywords for the EFH with and without 10 MF iterations, LSI , TFIDF weighted words and random guessing. [sent-138, score-0.126]
64 PR curves with more keywords looked very similar to (c). [sent-139, score-0.052]
65 A marker at position k (counted from the left along a curve) indicates that 2k−1 documents were retrieved. [sent-140, score-0.112]
66 4 Experiments Newsgroups: We have used the reduced version of the “20newsgroups” dataset prepared for MATLAB by Roweis6 . [sent-141, score-0.025]
67 An EFH model with 10 latent variables was trained on 12000 training cases using stochastic gradient descent on mini-batches of 1000 randomly chosen documents (training time approximately 1 hour on a 2GHz PC). [sent-144, score-0.373]
68 To test the quality of the trained model we mapped the remaining 4242 j j query documents into latent space using hj = ia Wia xia and where {Wia , αia } were “gauged” as in Eqns. [sent-146, score-1.089]
69 Precision-recall curves were computed by comparing training and query documents using the usual “cosine coefficient” (cosine of the angle between documents) and reporting success when the retrieved document was in the same domain as the query (results averaged over all queries). [sent-148, score-0.315]
70 In figure 1a we compare the results with LSI (also 10 dimensions) [3] where we preprocessed the data in the standard way (x → log(1 + x) and entropy weighting of the words) and to similarity in word space using TF-IDF weighting of the words. [sent-149, score-0.074]
71 In figure 1b,c we show PR curves when only 1 or 2 keywords were provided corresponding to randomly observed words in the query document. [sent-150, score-0.153]
72 The EFH model allows a principled way to deal with unobserved entries by inferring them using the model (in all other methods we insert 0 for the unobserved entries which corresponds to ignoring them). [sent-151, score-0.044]
73 We have used a few iterations of mean field to achieve k k x that: xia → exp [ jb ( k Wia Wjb + αjb )ˆjb ]/γi where γi is a normalization constant ˆ D ˆ and where xia represent probabilities: xia ∈ [0, 1], ˆ ˆ a=1 xia = 1 ∀i. [sent-152, score-1.357]
74 In all cases we find that without any preprocessing or weighting EFH still outperforms the other methods except when large numbers of documents were retrieved. [sent-154, score-0.13]
75 In the next experiment we compared performance of EFH, LSI and LDA by training models on a random subset of 15430 documents with 5 and 10 latent dimensions (this was found to be close to optimal for LDA). [sent-155, score-0.338]
76 The EFH and LSI models were trained as in the previous experiment while the training and testing details7 for LDA can be found in [9]. [sent-156, score-0.022]
77 For the remaining test documents we clamped a varying number of observed words and 6 7 http://www. [sent-157, score-0.231]
78 05 10 0 0 2 4 6 number of keywords (a) 8 10 12 6 4 2 0 −2 −4 −6 (b) Retrieval Performance for NIPS Dataset fraction documents also retrieved by TF−IDF fraction observed words correctly predicted Document Reconstruction for Newsgroup Dataset 0. [sent-167, score-0.354]
79 asked the models to predict the remaining observed words in the documents by computing the probabilities for all words in the vocabulary to be present and ranking them (see previous paragraph for details). [sent-177, score-0.262]
80 By comparing the list of the R remaining observed words in the document with the top-R ranked inferred words we computed the fraction of correctly predicted words. [sent-178, score-0.171]
81 The results are shown in figure 2a as a function of the number of clamped words. [sent-179, score-0.053]
82 To provide anecdotal evidence that EFH can infer semantic relationships we clamped the words ’drive’ ’driver’ and ’car’ which resulted in: ’car’ ’drive’ ’engine’ ’dealer’ ’honda’ ’bmw’ ’driver’ ’oil’ as the most probable words in the documents. [sent-180, score-0.227]
83 NIPS Conference Papers: Next we trained a model with 5 latent dimensions on the NIPS dataset8 which has a large vocabulary size (13649 words) and contains 1740 documents of which 1557 were used for training and 183 for testing. [sent-182, score-0.339]
84 Due to the lack of document labels it is hard to assess the quality of the trained model. [sent-186, score-0.041]
85 We choose to compare performance on document retrieval with the “golden standard”: cosine similarity in TF-IDF weighted word space. [sent-187, score-0.139]
86 In figure 2c we depict the fraction of documents retrieved by EFH that was also retrieved by TF-IDF as we vary the number of retrieved documents. [sent-188, score-0.359]
87 This correlation is indeed very high but note that EFH computes similarity in a 5-D space while TF-IDF computes similarity in a 13649-D space. [sent-189, score-0.034]
88 5 Discussion The main point of this paper was to show that there is a flexible family of 2-layer probabilistic models that represents a viable alternative to 2-layer causal (directed) models. [sent-190, score-0.117]
89 These models enjoy very different properties and can be trained efficiently using contrastive divergence. [sent-191, score-0.156]
90 As an example we have studied an EFH alternative for latent semantic indexing where we have found that the EFH has a number of favorable properties: fast inference allowing fast document retrieval and a principled approach to retrieval with keywords. [sent-192, score-0.464]
91 Previous examples of EFH include the original harmonium [10], Gaussian variants thereof [7], and the PoT model [13] which couples a gamma distribution with the covariance of a 8 Obtained from http://www. [sent-194, score-0.079]
92 Some exponential family extensions of general Boltzmann machines were proposed in [2], [14], but they do not have the bipartite structure that we study here. [sent-200, score-0.117]
93 While the components of the Gaussian-multinomial EFH act as prototypes or templates for highly probable input vectors, the components of the PoT act as constraints (i. [sent-201, score-0.042]
94 It has proven difficult to jointly model both prototypes and constraints in the this formalism except for the fully Gaussian case [11]. [sent-211, score-0.042]
95 A future challenge is therefore to start the modelling process with the desired non-linearity and to subsequently introduce auxiliary variables to facilitate inference and learning. [sent-212, score-0.096]
96 Unsupervised learning of distributions of binary vectors using 2-layer networks. [sent-244, score-0.022]
97 In Advances in Neural Information Processing Systems, volume 4, pages 912–919, 1992. [sent-245, score-0.022]
98 In Proceedings of the 21st International Conference on Machine Learning, volume 21, 2004. [sent-267, score-0.022]
99 In Proceedings of the Conference on Uncertainty in Artificial Intelligence, volume 20, 2004. [sent-274, score-0.022]
100 Learning sparse topographic representations with products of student-t distributions. [sent-304, score-0.046]
wordName wordTfidf (topN-words)
[('jb', 0.449), ('wia', 0.449), ('efh', 0.33), ('ia', 0.325), ('xia', 0.22), ('latent', 0.204), ('hj', 0.193), ('lsi', 0.172), ('fia', 0.158), ('harmoniums', 0.158), ('documents', 0.112), ('gjb', 0.106), ('contrastive', 0.092), ('directed', 0.08), ('harmonium', 0.079), ('semantic', 0.075), ('retrieved', 0.074), ('lda', 0.065), ('family', 0.061), ('mx', 0.057), ('variables', 0.057), ('clamped', 0.053), ('mf', 0.052), ('keywords', 0.052), ('newsgroups', 0.049), ('indexing', 0.048), ('marginal', 0.044), ('welling', 0.044), ('tfidf', 0.042), ('enjoy', 0.042), ('mh', 0.042), ('prototypes', 0.042), ('bj', 0.042), ('boltzmann', 0.042), ('document', 0.041), ('xi', 0.04), ('bjb', 0.04), ('plsi', 0.04), ('exponential', 0.039), ('jk', 0.039), ('words', 0.039), ('retrieval', 0.038), ('undirected', 0.035), ('query', 0.035), ('lsa', 0.034), ('vij', 0.034), ('semantics', 0.034), ('driver', 0.031), ('gibbs', 0.031), ('precision', 0.03), ('hidden', 0.028), ('exp', 0.028), ('representations', 0.028), ('divergence', 0.028), ('observed', 0.027), ('nhj', 0.026), ('pcd', 0.026), ('pot', 0.026), ('fraction', 0.025), ('dataset', 0.025), ('freedom', 0.025), ('recall', 0.023), ('hinton', 0.023), ('pc', 0.023), ('vocabulary', 0.023), ('distributions', 0.022), ('models', 0.022), ('unobserved', 0.022), ('cosine', 0.022), ('layers', 0.022), ('discrete', 0.022), ('volume', 0.022), ('rules', 0.021), ('resulted', 0.021), ('shifts', 0.021), ('word', 0.021), ('independence', 0.02), ('inference', 0.02), ('factor', 0.02), ('sampler', 0.02), ('eld', 0.019), ('modelling', 0.019), ('topic', 0.019), ('layer', 0.019), ('products', 0.018), ('fa', 0.018), ('advantages', 0.018), ('domain', 0.018), ('weighting', 0.018), ('disadvantage', 0.018), ('xa', 0.018), ('averages', 0.017), ('transformations', 0.017), ('viable', 0.017), ('bipartite', 0.017), ('secondly', 0.017), ('causal', 0.017), ('hn', 0.017), ('similarity', 0.017), ('ai', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton
Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.
2 0.18231337 113 nips-2004-Maximum-Margin Matrix Factorization
Author: Nathan Srebro, Jason Rennie, Tommi S. Jaakkola
Abstract: We present a novel approach to collaborative prediction, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-definite program, and discuss generalization error bounds for them. 1
3 0.14817195 163 nips-2004-Semi-parametric Exponential Family PCA
Author: Sajama Sajama, Alon Orlitsky
Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1
4 0.14008252 44 nips-2004-Conditional Random Fields for Object Recognition
Author: Ariadna Quattoni, Michael Collins, Trevor Darrell
Abstract: We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as flexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by finding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.
5 0.10646337 125 nips-2004-Multiple Relational Embedding
Author: Roland Memisevic, Geoffrey E. Hinton
Abstract: We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. 1
6 0.092317268 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
7 0.089073107 87 nips-2004-Integrating Topics and Syntax
8 0.08825814 124 nips-2004-Multiple Alignment of Continuous Time Series
9 0.061006952 62 nips-2004-Euclidean Embedding of Co-Occurrence Data
10 0.057097886 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks
11 0.055983983 197 nips-2004-Two-Dimensional Linear Discriminant Analysis
12 0.048940077 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
13 0.047454432 185 nips-2004-The Convergence of Contrastive Divergences
14 0.046306372 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account
15 0.0452957 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
16 0.042843014 145 nips-2004-Parametric Embedding for Class Visualization
17 0.04079821 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
18 0.039962139 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
19 0.039712701 165 nips-2004-Semi-supervised Learning on Directed Graphs
20 0.038418248 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
topicId topicWeight
[(0, -0.135), (1, 0.038), (2, -0.021), (3, -0.055), (4, 0.047), (5, 0.045), (6, -0.079), (7, 0.107), (8, 0.198), (9, 0.031), (10, 0.096), (11, -0.125), (12, 0.037), (13, 0.131), (14, 0.202), (15, 0.036), (16, 0.041), (17, 0.196), (18, -0.091), (19, -0.034), (20, 0.046), (21, 0.085), (22, 0.052), (23, -0.031), (24, -0.007), (25, 0.028), (26, -0.019), (27, -0.036), (28, -0.066), (29, -0.11), (30, 0.013), (31, 0.006), (32, -0.044), (33, -0.079), (34, 0.128), (35, 0.023), (36, 0.169), (37, -0.053), (38, -0.11), (39, -0.105), (40, -0.107), (41, -0.018), (42, -0.023), (43, 0.026), (44, -0.033), (45, 0.058), (46, -0.024), (47, 0.029), (48, -0.128), (49, 0.094)]
simIndex simValue paperId paperTitle
same-paper 1 0.94185102 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton
Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.
2 0.5225755 163 nips-2004-Semi-parametric Exponential Family PCA
Author: Sajama Sajama, Alon Orlitsky
Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1
3 0.51090807 113 nips-2004-Maximum-Margin Matrix Factorization
Author: Nathan Srebro, Jason Rennie, Tommi S. Jaakkola
Abstract: We present a novel approach to collaborative prediction, using low-norm instead of low-rank factorizations. The approach is inspired by, and has strong connections to, large-margin linear discrimination. We show how to learn low-norm factorizations by solving a semi-definite program, and discuss generalization error bounds for them. 1
4 0.50166696 124 nips-2004-Multiple Alignment of Continuous Time Series
Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Andrew Emili
Abstract: Multiple realizations of continuous-valued time series from a stochastic process often contain systematic variations in rate and amplitude. To leverage the information contained in such noisy replicate sets, we need to align them in an appropriate way (for example, to allow the data to be properly combined by adaptive averaging). We present the Continuous Profile Model (CPM), a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. After unsupervised training, the learned trace represents a canonical, high resolution fusion of all the replicates. As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. We apply CPM to successfully align speech signals from multiple speakers and sets of Liquid Chromatography-Mass Spectrometry proteomic data. 1 A Profile Model for Continuous Data When observing multiple time series generated by a noisy, stochastic process, large systematic sources of variability are often present. For example, within a set of nominally replicate time series, the time axes can be variously shifted, compressed and expanded, in complex, non-linear ways. Additionally, in some circumstances, the scale of the measured data can vary systematically from one replicate to the next, and even within a given replicate. We propose a Continuous Profile Model (CPM) for simultaneously analyzing a set of such time series. In this model, each time series is generated as a noisy transformation of a single latent trace. The latent trace is an underlying, noiseless representation of the set of replicated, observable time series. Output time series are generated from this model by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as in an HMM. Each hidden state corresponds to a particular location in the latent trace, and the emitted value from the state depends on the value of the latent trace at that position. To account for changes in the amplitude of the signals across and within replicates, the latent time states are augmented by a set of scale states, which control how the emission signal will be scaled relative to the value of the latent trace. During training, the latent trace is learned, as well as the transition probabilities controlling the Markovian evolution of the scale and time states and the overall noise level of the observed data. After training, the latent trace learned by the model represents a higher resolution ’fusion’ of the experimental replicates. Figure 1 illustrate the model in action. Unaligned, Linear Warp Alignment and CPM Alignment Amplitude 40 30 20 10 0 50 Amplitude 40 30 20 10 Amplitude 0 30 20 10 0 Time a) b) Figure 1: a) Top: ten replicated speech energy signals as described in Section 4), Middle: same signals, aligned using a linear warp with an offset, Bottom: aligned with CPM (the learned latent trace is also shown in cyan). b) Speech waveforms corresponding to energy signals in a), Top: unaligned originals, Bottom: aligned using CPM. 2 Defining the Continuous Profile Model (CPM) The CPM is generative model for a set of K time series, xk = (xk , xk , ..., xk k ). The 1 2 N temporal sampling rate within each xk need not be uniform, nor must it be the same across the different xk . Constraints on the variability of the sampling rate are discussed at the end of this section. For notational convenience, we henceforth assume N k = N for all k, but this is not a requirement of the model. The CPM is set up as follows: We assume that there is a latent trace, z = (z1 , z2 , ..., zM ), a canonical representation of the set of noisy input replicate time series. Any given observed time series in the set is modeled as a non-uniformly subsampled version of the latent trace to which local scale transformations have been applied. Ideally, M would be infinite, or at least very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overfitting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments and found this to be sufficient with < 0.2. Because the resolution of the latent trace is higher than that of the observed time series, experimental time can be made effectively to speed up or slow down by advancing along the latent trace in larger or smaller jumps. The subsampling and local scaling used during the generation of each observed time series are determined by a sequence of hidden state variables. Let the state sequence for observation k be π k . Each state in the state sequence maps to a time state/scale state pair: k πi → {τik , φk }. Time states belong to the integer set (1..M ); scale states belong to an i ordered set (φ1 ..φQ ). (In our experiments we have used Q=7, evenly spaced scales in k logarithmic space). States, πi , and observation values, xk , are related by the emission i k probability distribution: Aπi (xk |z) ≡ p(xk |πi , z, σ, uk ) ≡ N (xk ; zτik φk uk , σ), where σ k i i i i is the noise level of the observed data, N (a; b, c) denotes a Gaussian probability density for a with mean b and standard deviation c. The uk are real-valued scale parameters, one per observed time series, that correct for any overall scale difference between time series k and the latent trace. To fully specify our model we also need to define the state transition probabilities. We define the transitions between time states and between scale states separately, so that k Tπi−1 ,πi ≡ p(πi |πi−1 ) = p(φi |φi−1 )pk (τi |τi−1 ). The constraint that time must move forward, cannot stand still, and that it can jump ahead no more than Jτ time states is enforced. (In our experiments we used Jτ = 3.) As well, we only allow scale state transitions between neighbouring scale states so that the local scale cannot jump arbitrarily. These constraints keep the number of legal transitions to a tractable computational size and work well in practice. Each observed time series has its own time transition probability distribution to account for experiment-specific patterns. Both the time and scale transition probability distributions are given by multinomials: dk , if a − b = 1 1 k d2 , if a − b = 2 k . p (τi = a|τi−1 = b) = . . k d , if a − b = J τ Jτ 0, otherwise p(φi = a|φi−1 s0 , if D(a, b) = 0 s1 , if D(a, b) = 1 = b) = s1 , if D(a, b) = −1 0, otherwise where D(a, b) = 1 means that a is one scale state larger than b, and D(a, b) = −1 means that a is one scale state smaller than b, and D(a, b) = 0 means that a = b. The distributions Jτ are constrained by: i=1 dk = 1 and 2s1 + s0 = 1. i Jτ determines the maximum allowable instantaneous speedup of one portion of a time series relative to another portion, within the same series or across different series. However, the length of time for which any series can move so rapidly is constrained by the length of the latent trace; thus the maximum overall ratio in speeds achievable by the model between any two entire time series is given by min(Jτ , M ). N After training, one may examine either the latent trace or the alignment of each observable time series to the latent trace. Such alignments can be achieved by several methods, including use of the Viterbi algorithm to find the highest likelihood path through the hidden states [1], or sampling from the posterior over hidden state sequences. We found Viterbi alignments to work well in the experiments below; samples from the posterior looked quite similar. 3 Training with the Expectation-Maximization (EM) Algorithm As with HMMs, training with the EM algorithm (often referred to as Baum-Welch in the context of HMMs [1]), is a natural choice. In our model the E-Step is computed exactly using the Forward-Backward algorithm [1], which provides the posterior probability over k states for each time point of every observed time series, γs (i) ≡ p(πi = s|x) and also the pairwise state posteriors, ξs,t (i) ≡ p(πi−1 = s, πi = t|xk ). The algorithm is modified only in that the emission probabilities depend on the latent trace as described in Section 2. The M-Step consists of a series of analytical updates to the various parameters as detailed below. Given the latent trace (and the emission and state transition probabilities), the complete log likelihood of K observed time series, xk , is given by Lp ≡ L + P. L is the likelihood term arising in a (conditional) HMM model, and can be obtained from the Forward-Backward algorithm. It is composed of the emission and state transition terms. P is the log prior (or penalty term), regularizing various aspects of the model parameters as explained below. These two terms are: K N N L≡ log Aπi (xk |z) + i log p(π1 ) + τ −1 K (zj+1 − zj )2 + P ≡ −λ (1) i=2 i=1 k=1 k log Tπi−1 ,πi j=1 k log D(dk |{ηv }) + log D(sv |{ηv }), v (2) k=1 where p(π1 ) are priors over the initial states. The first term in Equation 2 is a smoothing k penalty on the latent trace, with λ controlling the amount of smoothing. ηv and ηv are Dirichlet hyperprior parameters for the time and scale state transition probability distributions respectively. These ensure that all non-zero transition probabilities remain non-zero. k For the time state transitions, v ∈ {1, Jτ } and ηv corresponds to the pseudo-count data for k the parameters d1 , d2 . . . dJτ . For the scale state transitions, v ∈ {0, 1} and ηv corresponds to the pseudo-count data for the parameters s0 and s1 . Letting S be the total number of possible states, that is, the number of elements in the cross-product of possible time states and possible scale states, the expected complete log likelihood is: K S K p k k γs (1) log T0,s
5 0.47740585 125 nips-2004-Multiple Relational Embedding
Author: Roland Memisevic, Geoffrey E. Hinton
Abstract: We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. 1
6 0.46985686 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account
7 0.43951878 71 nips-2004-Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices
8 0.4154228 41 nips-2004-Comparing Beliefs, Surveys, and Random Walks
9 0.3822563 44 nips-2004-Conditional Random Fields for Object Recognition
10 0.34599632 87 nips-2004-Integrating Topics and Syntax
11 0.30858555 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
12 0.28149289 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations
13 0.2771419 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
14 0.26776659 197 nips-2004-Two-Dimensional Linear Discriminant Analysis
15 0.24771343 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
16 0.24417587 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
17 0.24293581 193 nips-2004-Theories of Access Consciousness
18 0.23963174 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference
19 0.23739208 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection
20 0.22685806 74 nips-2004-Harmonising Chorales by Probabilistic Inference
topicId topicWeight
[(9, 0.02), (13, 0.063), (15, 0.099), (24, 0.011), (26, 0.058), (31, 0.384), (33, 0.127), (39, 0.052), (50, 0.026), (51, 0.014)]
simIndex simValue paperId paperTitle
1 0.84277147 137 nips-2004-On the Adaptive Properties of Decision Trees
Author: Clayton Scott, Robert Nowak
Abstract: Decision trees are surprisingly adaptive in three important respects: They automatically (1) adapt to favorable conditions near the Bayes decision boundary; (2) focus on data distributed on lower dimensional manifolds; (3) reject irrelevant features. In this paper we examine a decision tree based on dyadic splits that adapts to each of these conditions to achieve minimax optimal rates of convergence. The proposed classifier is the first known to achieve these optimal rates while being practical and implementable. 1
2 0.8411001 164 nips-2004-Semi-supervised Learning by Entropy Minimization
Author: Yves Grandvalet, Yoshua Bengio
Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1
same-paper 3 0.79963684 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval
Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton
Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.
4 0.75229478 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes
Author: Anton Schwaighofer, Volker Tresp, Kai Yu
Abstract: We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. In a first step, kernel matrices on a fixed set of input points are learned from data using a simple and efficient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are fitted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1
5 0.57285386 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul
Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1
6 0.57204854 69 nips-2004-Fast Rates to Bayes for Kernel Machines
7 0.56822336 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
8 0.55098152 87 nips-2004-Integrating Topics and Syntax
9 0.54199576 158 nips-2004-Sampling Methods for Unsupervised Learning
10 0.53738129 33 nips-2004-Brain Inspired Reinforcement Learning
11 0.53469026 113 nips-2004-Maximum-Margin Matrix Factorization
12 0.53121889 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination
13 0.53120977 159 nips-2004-Schema Learning: Experience-Based Construction of Predictive Action Models
14 0.53093195 62 nips-2004-Euclidean Embedding of Co-Occurrence Data
15 0.53077829 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
16 0.52811724 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection
17 0.52631551 187 nips-2004-The Entire Regularization Path for the Support Vector Machine
18 0.52603996 23 nips-2004-Analysis of a greedy active learning strategy
19 0.52484399 70 nips-2004-Following Curved Regularized Optimization Solution Paths
20 0.52477753 77 nips-2004-Hierarchical Clustering of a Mixture Model