nips nips2004 nips2004-62 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby
Abstract: Embedding algorithms search for low dimensional structure in complex data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of our embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text datasets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling and correspondence analysis. 1
Reference: text
sentIndex sentText sentNum sentScore
1 This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space based on their co-occurrence statistics. [sent-2, score-0.579]
2 The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. [sent-3, score-0.765]
3 The local structure of our embedding corresponds to the statistical correlations via random walks in the Euclidean space. [sent-4, score-0.453]
4 We quantify the performance of our method on two text datasets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling and correspondence analysis. [sent-5, score-0.267]
5 1 Introduction Embeddings of objects in a low-dimensional space are an important tool in unsupervised learning and in preprocessing data for supervised learning algorithms. [sent-6, score-0.126]
6 All these methods operate on objects of a single type endowed with a measure of similarity or dissimilarity. [sent-9, score-0.22]
7 However, real-world data often involve objects of several very different types without a natural measure of similarity. [sent-10, score-0.221]
8 A measure of similarity between words and pictures is difficult to define objectively. [sent-12, score-0.309]
9 Defining a useful measure of similarity is even difficult for some homogeneous data types, such as pictures or sounds, where the physical properties (pitch and frequency in sounds, color and luminosity distribution in images) do not directly reflect the semantic properties we are interested in. [sent-13, score-0.177]
10 The current paper addresses this problem by creating embeddings from statistical associations. [sent-14, score-0.275]
11 The idea is to find a Euclidean embedding in low dimension that represents the empirical co-occurrence statistics of two variables. [sent-15, score-0.608]
12 We focus on modeling the conditional probability of one variable given the other, since in the data we analyze (documents and words, authors and terms) there is a clear asymmetry which suggests a conditional model. [sent-16, score-0.163]
13 For example, pictures which frequently appear with a given text are expected to have some common, locally low-dimensional characteristic that allows them to be mapped to adjacent points. [sent-21, score-0.227]
14 We can thus rely on co-occurrences to embed different entity types, such as words and pictures, genes and expression arrays, into the same subspace. [sent-22, score-0.231]
15 Once this embedding is achieved it also naturally defines a measure of similarity between entities of the same kind (such as images), induced by their other corresponding modality (such as text), providing a meaningful similarity measure between images. [sent-23, score-0.677]
16 Embedding of heterogeneous objects is performed in statistics using correspondence analysis (CA), a variant of canonical correlation analysis for count data [8]. [sent-24, score-0.41]
17 These are related to Euclidean distances when the embeddings are constrained to be normalized. [sent-25, score-0.529]
18 Statistical embedding of same-type objects was recently studied by Hinton and Roweis [9]. [sent-27, score-0.579]
19 Their approach is similar to ours in that it assumes that distances induce probabilistic relations between objects. [sent-28, score-0.171]
20 However, we do not assume that distances are given in advance, but instead we derive them from the empirical co-occurrence data. [sent-29, score-0.218]
21 This partition func2 tion equals Z(x) = y p(y)e−dx,y and is thus the empirical mean of the exponentiated ¯ distances from x (therefore Z(x) ≤ 1). [sent-36, score-0.252]
22 This model directly relates the ratio p(y|x) to the distance between the embedded x and p(y) ¯ y. [sent-37, score-0.126]
23 As a result of the fast decay, the closest objects dominate the distribution. [sent-41, score-0.126]
24 This will be maximized when all distances are zero. [sent-50, score-0.171]
25 This trivial solution is avoided because of the regularization term x p(x) log Z(x), which acts ¯ to increase distances between x and y points. [sent-51, score-0.205]
26 To find the optimal φ, ψ for a given embedding dimension d, we used a conjugate gradient ascent algorithm with random restarts. [sent-57, score-0.518]
27 3 Relation to Other Methods Embedding the rows and columns of a contingency table into a low dimensional Euclidean space is related to statistical methods for the analysis of heterogeneous data. [sent-59, score-0.253]
28 Another closely related method is Correspondence analysis [8], which uses a different normalization scheme, and aims to model χ2 distances between rows and columns of p(x, y). [sent-62, score-0.269]
29 ¯ The goal of all the above methods is to maximize the correlation coefficient between the embeddings of X and Y . [sent-63, score-0.343]
30 First, note that the correlation coefficient is invariant under affine transformations and we can thus focus on centered solutions with a unity covariance matrix ( φ(x) = 0 and COV (φ(x)) = COV (ψ(y)) = I) solutions. [sent-65, score-0.178]
31 In this case, the correlation coefficient is given by the following expression (we focus on d = 1 for simplicity) ρ(φ(x), ψ(y)) = p(x, y)φ(x)ψ(y) = − ¯ x,y 1 2 p(x, y)d2 + 1 . [sent-66, score-0.103]
32 ¯ x,y (4) x,y Maximizing the correlation is therefore equivalent to minimizing the mean distance across all pairs. [sent-67, score-0.117]
33 However, CCA forces both embeddings to be centered and with a unity covariance matrix, whereas our method introduces a global regularization term related to the partition function. [sent-69, score-0.414]
34 Our method is additionally related to exponential models of contingency tables, where the counts are approximated by a normalized exponent of a low rank matrix [7]. [sent-70, score-0.301]
35 The current approach can be understood as a constrained version of these models where the expression in the exponent is constrained to have a geometric interpretation. [sent-71, score-0.21]
36 A well-known geometric oriented embedding method is multidimensional scaling (MDS) [4], whose standard version applies to same-type objects with predefined distances. [sent-72, score-0.681]
37 MDS embedding of heterogeneous entities was studied in the context of modeling ranking data (see [4] section 7. [sent-73, score-0.541]
38 4 Semidefinite Representation The optimal embeddings φ, ψ may be found using unconstrained optimization techniques. [sent-76, score-0.309]
39 However, the Euclidean distances used in the embedding space also allow us to reformulate the problem as constrained convex optimization over the cone of positive semidefinite (PSD) matrices [14]. [sent-77, score-0.774]
40 We start by showing that for embeddings with dimension d = |X| + |Y |, maximizing (2) is equivalent to minimizing a certain convex non-linear function over PSD matrices. [sent-78, score-0.408]
41 Consider the matrix A whose columns are all the embedded vectors φ and ψ. [sent-79, score-0.14]
42 The matrix G ≡ AT A is the Gram matrix of the dot products between embedding vectors. [sent-80, score-0.515]
43 The converse is also true: any PSD matrix of rank ≤ d can be factorized as AT A, where A is an embedding matrix of dimension d. [sent-82, score-0.685]
44 The distance between two columns in A is linearly related to the Gram matrix via d2 = gii + gjj − 2gij . [sent-83, score-0.147]
45 The minimized function is convex, since it is the sum of a linear function of G and functions log exp of an affine expression in G, which are also convex (see Geometric Programming section in [2]). [sent-86, score-0.103]
46 We conclude that when the embedding dimension is of size d = |X| + |Y | the optimization problem of Eq. [sent-88, score-0.552]
47 Embedding into a low dimension requires constraining the rank, but this is difficult since the problem is no longer convex in the general case. [sent-100, score-0.176]
48 One approach to obtaining low rank solutions is to optimize over a full rank G and then project it into a lower dimension via spectral decomposition as in [14] or classical MDS. [sent-101, score-0.361]
49 Small values of Tr(G) are expected to correspond to sparse eigenvalue sets and thus penalize high rank solutions. [sent-104, score-0.138]
50 We believe that PSD algorithms may turn out to be more efficient in cases where relatively high dimensional embeddings are sought. [sent-106, score-0.31]
51 Furthermore, under the PSD formulation it is easy to introduce additional constraints, for example on distances between subsets of points (as in [14]), and on marginals of the distribution. [sent-107, score-0.221]
52 Here we present embedding of words and documents and authors and documents. [sent-109, score-0.88]
53 The left panel shows document embeddings for NIPS 15-17, with colors to indicate the document topic. [sent-113, score-0.523]
54 Other panels show embedded words and documents for the areas specified by rectangles. [sent-114, score-0.411]
55 Figure (b) shows the border region between algorithms and architecture (AA) and learning theory (LT) (bottom rectangle in (a)). [sent-115, score-0.104]
56 Figure (c) shows the border region between neuroscience (NS) and biological vision (VB) (upper rectangle in (a)). [sent-116, score-0.104]
57 Figure (d) shows mainly control and navigation (CN) documents (left rectangle in (a)). [sent-117, score-0.261]
58 Left panel shows embeddings for authors (red crosses) and words (blue dots). [sent-119, score-0.5]
59 Other panels show embedded authors (only first 100 shown) and words for the areas specified by rectangles. [sent-120, score-0.302]
60 1 NIPS Database Embedding algorithms may be used to study the structure of document databases. [sent-123, score-0.124]
61 Here we used the NIPS 0-12 database supplied by Roweis 2 , and augmented it with data from NIPS volumes 13-17 3 . [sent-124, score-0.106]
62 We first used CODE to embed documents and words into R2 . [sent-127, score-0.398]
63 It can be seen that documents with similar topics are mapped next to each other (e. [sent-129, score-0.261]
64 Furthermore, words characterize the topics of their neighboring documents. [sent-132, score-0.132]
65 We could now embed authors and words into R2 , by using CODE to model p(word|author). [sent-134, score-0.289]
66 It can be seen that authors are indeed mapped next to terms relevant to their work, and that authors dealing with similar domains are also mapped together. [sent-136, score-0.304]
67 This illustrates how co-occurrence of words and authors may be used to induce a metric on authors alone. [sent-137, score-0.318]
68 edu/∼gal/ (a) (b) (c) 1 1 doc−word measure doc−doc measure 0. [sent-144, score-0.116]
69 5 1 10 100 N nearest neighbors 1000 0 0 CODE IsoM CA MDS SVD Newsgroup Sets Figure 4: (a) Document purity measure for the embedding of newsgroups crypt, electronics and med, as a function of neighborhood size. [sent-149, score-0.749]
70 (b) The doc − doc measure averaged over 7 newsgroup sets. [sent-150, score-0.776]
71 (c) The word − doc measure for CODE and CA algorithms, for 7 newsgroup sets. [sent-158, score-0.544]
72 We first removed the 100 most frequent words, and then selected the next k most frequent words for different values of k (see below). [sent-163, score-0.236]
73 The resulting words and documents were embedded with CODE, Correspondence Analysis (CA), SVD, IsoMap and classical MDS 4 . [sent-164, score-0.411]
74 CODE was used to model the distribution of words given documents p(w|d). [sent-165, score-0.334]
75 All methods were tested under several normalization schemes, including document sum normalization and TFIDF. [sent-166, score-0.221]
76 An embedding of words and documents is expected to map documents with similar semantics together, and to map words close to documents which are related to the meaning of the word. [sent-168, score-1.358]
77 We next test how our embeddings performs with respect to these requirements. [sent-169, score-0.275]
78 To represent the meaning of a document we use its corresponding newsgroup. [sent-170, score-0.124]
79 Note that this information is used only for evaluation and not in constructing the embedding itself. [sent-171, score-0.453]
80 To measure how well similar documents are mapped together we define a purity measure, which we denote doc − doc. [sent-172, score-0.696]
81 For each embedded document, we measure the fraction of its neighbors that are from the same newsgroup. [sent-173, score-0.135]
82 To measure how documents are related to their neighboring words, we use a measured denoted by word−doc. [sent-175, score-0.295]
83 For each document d we look at its n nearest words and calculate their probability under the document’s newsgroup, normalized by their prior. [sent-176, score-0.256]
84 This is repeated for neighborhood sizes smaller than 100 and averaged over documents . [sent-177, score-0.248]
85 The word − doc measure was only compared with CA, since this is the only method that provides joint embeddings. [sent-178, score-0.485]
86 Figure 4 compares the performance of CODE with that of the other methods with respect to the doc − doc and word − doc measures. [sent-179, score-1.008]
87 4 CA embedding followed the standard procedure described in [8]. [sent-181, score-0.453]
88 We tested both an SVD over the count matrix and SVD over log of the count plus one, only the latter is described here because it was better than the former. [sent-183, score-0.148]
89 For MDS, the distances between objects were calculated as the dot product between their count vectors (we also tested Euclidean distances) 6 Discussion We presented a method for embedding objects of different types into the same low dimension Euclidean space. [sent-184, score-1.097]
90 This embedding can be used to reveal low dimensional structures when distance measures between objects are unknown. [sent-185, score-0.742]
91 Furthermore, the embedding induces a meaningful metric also between objects of the same type, which could be used, for example, to embed images based on accompanying text, and derive the semantic distance between images. [sent-186, score-0.692]
92 Co-occurrence embedding should not be restricted to pairs of variables, but can be extended to multivariate joint distributions, when these are available. [sent-187, score-0.492]
93 It can also be augmented to use distances between same-type objects when these are known. [sent-188, score-0.327]
94 An important question in embedding objects is whether the embedding is unique. [sent-189, score-1.032]
95 In other words, can there be two non isometric embeddings which are obtained at the optimum of the problem. [sent-190, score-0.332]
96 This question is related to the rigidity and uniqueness of embeddings on graphs, specifically complete bipartite graphs in our case. [sent-191, score-0.346]
97 A theorem of Bolker and Roth [1] asserts that for such graphs with at least 5 vertices on each side, embeddings are rigid, i. [sent-192, score-0.275]
98 This suggests that the CODE embeddings for |X|, |Y | ≥ 5 are unique (at least locally) for d ≤ 3. [sent-195, score-0.275]
99 Maximum likelihood in these models is a non-trivial constrained optimization problem, and may be approached using the semidefinite representation outlined here. [sent-199, score-0.113]
100 A rank minimization heuristic with application to minimum order system approximation. [sent-233, score-0.105]
wordName wordTfidf (topN-words)
[('embedding', 0.453), ('doc', 0.31), ('embeddings', 0.275), ('documents', 0.202), ('psd', 0.196), ('distances', 0.171), ('crypt', 0.141), ('code', 0.132), ('words', 0.132), ('mds', 0.131), ('objects', 0.126), ('document', 0.124), ('euclidean', 0.111), ('rank', 0.105), ('newsgroup', 0.098), ('authors', 0.093), ('correspondence', 0.085), ('dxy', 0.085), ('electronics', 0.083), ('pictures', 0.083), ('ns', 0.079), ('word', 0.078), ('embedded', 0.077), ('svd', 0.075), ('xy', 0.075), ('med', 0.074), ('semide', 0.073), ('nips', 0.072), ('aa', 0.069), ('correlation', 0.068), ('convex', 0.068), ('cca', 0.067), ('purity', 0.067), ('isomap', 0.066), ('dimension', 0.065), ('embed', 0.064), ('lt', 0.062), ('ca', 0.061), ('mapped', 0.059), ('rectangle', 0.059), ('measure', 0.058), ('optimum', 0.057), ('roweis', 0.057), ('contingency', 0.056), ('shawe', 0.056), ('multidimensional', 0.054), ('heterogeneous', 0.052), ('mappings', 0.052), ('frequent', 0.052), ('marginals', 0.05), ('distance', 0.049), ('vb', 0.049), ('constrained', 0.048), ('geometric', 0.048), ('empirical', 0.047), ('neighborhood', 0.046), ('border', 0.045), ('low', 0.043), ('solutions', 0.043), ('text', 0.043), ('newsgroups', 0.042), ('graphics', 0.042), ('locally', 0.042), ('count', 0.041), ('database', 0.04), ('cov', 0.039), ('associations', 0.039), ('joint', 0.039), ('canonical', 0.038), ('types', 0.037), ('relation', 0.036), ('structures', 0.036), ('unity', 0.036), ('bipartite', 0.036), ('volumes', 0.036), ('entities', 0.036), ('rigid', 0.036), ('similarity', 0.036), ('related', 0.035), ('expression', 0.035), ('dimensional', 0.035), ('conditional', 0.035), ('tested', 0.035), ('mdp', 0.034), ('sounds', 0.034), ('optimization', 0.034), ('regularization', 0.034), ('relationships', 0.034), ('partition', 0.034), ('penalize', 0.033), ('vapnik', 0.033), ('columns', 0.032), ('singh', 0.032), ('hinton', 0.032), ('matrix', 0.031), ('normalization', 0.031), ('exponent', 0.031), ('rewards', 0.031), ('likelihood', 0.031), ('augmented', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 62 nips-2004-Euclidean Embedding of Co-Occurrence Data
Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby
Abstract: Embedding algorithms search for low dimensional structure in complex data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of our embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text datasets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling and correspondence analysis. 1
2 0.25893986 145 nips-2004-Parametric Embedding for Class Visualization
Author: Tomoharu Iwata, Kazumi Saito, Naonori Ueda, Sean Stromsten, Thomas L. Griffiths, Joshua B. Tenenbaum
Abstract: In this paper, we propose a new method, Parametric Embedding (PE), for visualizing the posteriors estimated over a mixture model. PE simultaneously embeds both objects and their classes in a low-dimensional space. PE takes as input a set of class posterior vectors for given data points, and tries to preserve the posterior structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a Gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier’s behavior in supervised, semi-supervised and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of web pages, semi-supervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, Latent Dirichlet Allocation. 1
3 0.15900888 125 nips-2004-Multiple Relational Embedding
Author: Roland Memisevic, Geoffrey E. Hinton
Abstract: We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. 1
4 0.14723663 160 nips-2004-Seeing through water
Author: Alexei Efros, Volkan Isler, Jianbo Shi, Mirkó Visontai
Abstract: We consider the problem of recovering an underwater image distorted by surface waves. A large amount of video data of the distorted image is acquired. The problem is posed in terms of finding an undistorted image patch at each spatial location. This challenging reconstruction task can be formulated as a manifold learning problem, such that the center of the manifold is the image of the undistorted patch. To compute the center, we present a new technique to estimate global distances on the manifold. Our technique achieves robustness through convex flow computations and solves the “leakage” problem inherent in recent manifold embedding techniques. 1
5 0.13149762 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul
Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1
6 0.12460911 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
7 0.10926453 87 nips-2004-Integrating Topics and Syntax
8 0.098123752 54 nips-2004-Distributed Information Regularization on Graphs
9 0.088602059 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
10 0.088203095 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
11 0.087564625 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
12 0.080027476 131 nips-2004-Non-Local Manifold Tangent Learning
13 0.079834543 100 nips-2004-Learning Preferences for Multiclass Problems
14 0.078751341 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem
15 0.078361191 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's
16 0.076956287 163 nips-2004-Semi-parametric Exponential Family PCA
17 0.07407102 68 nips-2004-Face Detection --- Efficient and Rank Deficient
18 0.073049828 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning
19 0.071227737 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification
20 0.068958163 79 nips-2004-Hierarchical Eigensolver for Transition Matrices in Spectral Methods
topicId topicWeight
[(0, -0.248), (1, 0.068), (2, -0.029), (3, -0.096), (4, -0.015), (5, 0.125), (6, -0.124), (7, 0.066), (8, 0.162), (9, 0.039), (10, 0.024), (11, -0.168), (12, -0.116), (13, -0.061), (14, -0.034), (15, 0.145), (16, 0.175), (17, -0.094), (18, -0.169), (19, 0.149), (20, -0.028), (21, -0.105), (22, 0.085), (23, -0.083), (24, -0.082), (25, -0.081), (26, 0.001), (27, -0.068), (28, -0.034), (29, -0.09), (30, 0.047), (31, -0.021), (32, 0.109), (33, -0.046), (34, 0.006), (35, -0.026), (36, -0.163), (37, 0.015), (38, -0.011), (39, 0.019), (40, 0.119), (41, 0.042), (42, 0.004), (43, 0.091), (44, 0.045), (45, 0.042), (46, 0.027), (47, -0.134), (48, 0.099), (49, -0.001)]
simIndex simValue paperId paperTitle
same-paper 1 0.96930218 62 nips-2004-Euclidean Embedding of Co-Occurrence Data
Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby
Abstract: Embedding algorithms search for low dimensional structure in complex data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of our embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text datasets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling and correspondence analysis. 1
2 0.81953937 145 nips-2004-Parametric Embedding for Class Visualization
Author: Tomoharu Iwata, Kazumi Saito, Naonori Ueda, Sean Stromsten, Thomas L. Griffiths, Joshua B. Tenenbaum
Abstract: In this paper, we propose a new method, Parametric Embedding (PE), for visualizing the posteriors estimated over a mixture model. PE simultaneously embeds both objects and their classes in a low-dimensional space. PE takes as input a set of class posterior vectors for given data points, and tries to preserve the posterior structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a Gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier’s behavior in supervised, semi-supervised and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of web pages, semi-supervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, Latent Dirichlet Allocation. 1
3 0.56725115 87 nips-2004-Integrating Topics and Syntax
Author: Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaum
Abstract: Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously find syntactic classes and semantic topics despite having no representation of syntax or semantics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively. 1
4 0.54573619 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling
Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul
Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1
5 0.52930313 125 nips-2004-Multiple Relational Embedding
Author: Roland Memisevic, Geoffrey E. Hinton
Abstract: We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. 1
6 0.50292295 160 nips-2004-Seeing through water
7 0.47066614 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem
8 0.44666809 127 nips-2004-Neighbourhood Components Analysis
9 0.4034299 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's
10 0.38305354 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection
11 0.3800562 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes
12 0.3726691 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters
13 0.36734089 54 nips-2004-Distributed Information Regularization on Graphs
14 0.32509756 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics
15 0.32374221 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension
16 0.31546584 150 nips-2004-Proximity Graphs for Clustering and Manifold Learning
17 0.31514987 131 nips-2004-Non-Local Manifold Tangent Learning
18 0.31478652 100 nips-2004-Learning Preferences for Multiclass Problems
19 0.29561844 186 nips-2004-The Correlated Correspondence Algorithm for Unsupervised Registration of Nonrigid Surfaces
20 0.29370999 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
topicId topicWeight
[(9, 0.015), (13, 0.11), (15, 0.107), (26, 0.062), (31, 0.051), (33, 0.188), (35, 0.034), (39, 0.029), (50, 0.034), (56, 0.234), (87, 0.03)]
simIndex simValue paperId paperTitle
1 0.91388047 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception
Author: Alan Stocker, Eero P. Simoncelli
Abstract: It has been demonstrated that basic aspects of human visual motion perception are qualitatively consistent with a Bayesian estimation framework, where the prior probability distribution on velocity favors slow speeds. Here, we present a refined probabilistic model that can account for the typical trial-to-trial variabilities observed in psychophysical speed perception experiments. We also show that data from such experiments can be used to constrain both the likelihood and prior functions of the model. Specifically, we measured matching speeds and thresholds in a two-alternative forced choice speed discrimination task. Parametric fits to the data reveal that the likelihood function is well approximated by a LogNormal distribution with a characteristic contrast-dependent variance, and that the prior distribution on velocity exhibits significantly heavier tails than a Gaussian, and approximately follows a power-law function. Humans do not perceive visual motion veridically. Various psychophysical experiments have shown that the perceived speed of visual stimuli is affected by stimulus contrast, with low contrast stimuli being perceived to move slower than high contrast ones [1, 2]. Computational models have been suggested that can qualitatively explain these perceptual effects. Commonly, they assume the perception of visual motion to be optimal either within a deterministic framework with a regularization constraint that biases the solution toward zero motion [3, 4], or within a probabilistic framework of Bayesian estimation with a prior that favors slow velocities [5, 6]. The solutions resulting from these two frameworks are similar (and in some cases identical), but the probabilistic framework provides a more principled formulation of the problem in terms of meaningful probabilistic components. Specifically, Bayesian approaches rely on a likelihood function that expresses the relationship between the noisy measurements and the quantity to be estimated, and a prior distribution that expresses the probability of encountering any particular value of that quantity. A probabilistic model can also provide a richer description, by defining a full probability density over the set of possible “percepts”, rather than just a single value. Numerous analyses of psychophysical experiments have made use of such distributions within the framework of signal detection theory in order to model perceptual behavior [7]. Previous work has shown that an ideal Bayesian observer model based on Gaussian forms µ posterior low contrast probability density probability density high contrast likelihood prior a posterior likelihood prior v ˆ v ˆ visual speed µ b visual speed Figure 1: Bayesian model of visual speed perception. a) For a high contrast stimulus, the likelihood has a narrow width (a high signal-to-noise ratio) and the prior induces only a small shift µ of the mean v of the posterior. b) For a low contrast stimuli, the measurement ˆ is noisy, leading to a wider likelihood. The shift µ is much larger and the perceived speed lower than under condition (a). for both likelihood and prior is sufficient to capture the basic qualitative features of global translational motion perception [5, 6]. But the behavior of the resulting model deviates systematically from human perceptual data, most importantly with regard to trial-to-trial variability and the precise form of interaction between contrast and perceived speed. A recent article achieved better fits for the model under the assumption that human contrast perception saturates [8]. In order to advance the theory of Bayesian perception and provide significant constraints on models of neural implementation, it seems essential to constrain quantitatively both the likelihood function and the prior probability distribution. In previous work, the proposed likelihood functions were derived from the brightness constancy constraint [5, 6] or other generative principles [9]. Also, previous approaches defined the prior distribution based on general assumptions and computational convenience, typically choosing a Gaussian with zero mean, although a Laplacian prior has also been suggested [4]. In this paper, we develop a more general form of Bayesian model for speed perception that can account for trial-to-trial variability. We use psychophysical speed discrimination data in order to constrain both the likelihood and the prior function. 1 1.1 Probabilistic Model of Visual Speed Perception Ideal Bayesian Observer Assume that an observer wants to obtain an estimate for a variable v based on a measurement m that she/he performs. A Bayesian observer “knows” that the measurement device is not ideal and therefore, the measurement m is affected by noise. Hence, this observer combines the information gained by the measurement m with a priori knowledge about v. Doing so (and assuming that the prior knowledge is valid), the observer will – on average – perform better in estimating v than just trusting the measurements m. According to Bayes’ rule 1 p(v|m) = p(m|v)p(v) (1) α the probability of perceiving v given m (posterior) is the product of the likelihood of v for a particular measurements m and the a priori knowledge about the estimated variable v (prior). α is a normalization constant independent of v that ensures that the posterior is a proper probability distribution. ^ ^ P(v2 > v1) 1 + Pcum=0.5 0 a b Pcum=0.875 vmatch vthres v2 Figure 2: 2AFC speed discrimination experiment. a) Two patches of drifting gratings were displayed simultaneously (motion without movement). The subject was asked to fixate the center cross and decide after the presentation which of the two gratings was moving faster. b) A typical psychometric curve obtained under such paradigm. The dots represent the empirical probability that the subject perceived stimulus2 moving faster than stimulus1. The speed of stimulus1 was fixed while v2 is varied. The point of subjective equality, vmatch , is the value of v2 for which Pcum = 0.5. The threshold velocity vthresh is the velocity for which Pcum = 0.875. It is important to note that the measurement m is an internal variable of the observer and is not necessarily represented in the same space as v. The likelihood embodies both the mapping from v to m and the noise in this mapping. So far, we assume that there is a monotonic function f (v) : v → vm that maps v into the same space as m (m-space). Doing so allows us to analytically treat m and vm in the same space. We will later propose a suitable form of the mapping function f (v). An ideal Bayesian observer selects the estimate that minimizes the expected loss, given the posterior and a loss function. We assume a least-squares loss function. Then, the optimal estimate v is the mean of the posterior in Equation (1). It is easy to see why this model ˆ of a Bayesian observer is consistent with the fact that perceived speed decreases with contrast. The width of the likelihood varies inversely with the accuracy of the measurements performed by the observer, which presumably decreases with decreasing contrast due to a decreasing signal-to-noise ratio. As illustrated in Figure 1, the shift in perceived speed towards slow velocities grows with the width of the likelihood, and thus a Bayesian model can qualitatively explain the psychophysical results [1]. 1.2 Two Alternative Forced Choice Experiment We would like to examine perceived speeds under a wide range of conditions in order to constrain a Bayesian model. Unfortunately, perceived speed is an internal variable, and it is not obvious how to design an experiment that would allow subjects to express it directly 1 . Perceived speed can only be accessed indirectly by asking the subject to compare the speed of two stimuli. For a given trial, an ideal Bayesian observer in such a two-alternative forced choice (2AFC) experimental paradigm simply decides on the basis of the two trial estimates v1 (stimulus1) and v2 (stimulus2) which stimulus moves faster. Each estimate v is based ˆ ˆ ˆ on a particular measurement m. For a given stimulus with speed v, an ideal Bayesian observer will produce a distribution of estimates p(ˆ|v) because m is noisy. Over trials, v the observers behavior can be described by classical signal detection theory based on the distributions of the estimates, hence e.g. the probability of perceiving stimulus2 moving 1 Although see [10] for an example of determining and even changing the prior of a Bayesian model for a sensorimotor task, where the estimates are more directly accessible. faster than stimulus1 is given as the cumulative probability Pcum (ˆ2 > v1 ) = v ˆ ∞ 0 p(ˆ2 |v2 ) v v2 ˆ 0 p(ˆ1 |v1 ) dˆ1 dˆ2 v v v (2) Pcum describes the full psychometric curve. Figure 2b illustrates the measured psychometric curve and its fit from such an experimental situation. 2 Experimental Methods We measured matching speeds (Pcum = 0.5) and thresholds (Pcum = 0.875) in a 2AFC speed discrimination task. Subjects were presented simultaneously with two circular patches of horizontally drifting sine-wave gratings for the duration of one second (Figure 2a). Patches were 3deg in diameter, and were displayed at 6deg eccentricity to either side of a fixation cross. The stimuli had an identical spatial frequency of 1.5 cycle/deg. One stimulus was considered to be the reference stimulus having one of two different contrast values (c1 =[0.075 0.5]) and one of five different speed values (u1 =[1 2 4 8 12] deg/sec) while the second stimulus (test) had one of five different contrast values (c2 =[0.05 0.1 0.2 0.4 0.8]) and a varying speed that was determined by an interleaved staircase procedure. For each condition there were 96 trials. Conditions were randomly interleaved, including a random choice of stimulus identity (test vs. reference) and motion direction (right vs. left). Subjects were asked to fixate during stimulus presentation and select the faster moving stimulus. The threshold experiment differed only in that auditory feedback was given to indicate the correctness of their decision. This did not change the outcome of the experiment but increased significantly the quality of the data and thus reduced the number of trials needed. 3 Analysis With the data from the speed discrimination experiments we could in principal apply a parametric fit using Equation (2) to derive the prior and the likelihood, but the optimization is difficult, and the fit might not be well constrained given the amount of data we have obtained. The problem becomes much more tractable given the following weak assumptions: • We consider the prior to be relatively smooth. • We assume that the measurement m is corrupted by additive Gaussian noise with a variance whose dependence on stimulus speed and contrast is separable. • We assume that there is a mapping function f (v) : v → vm that maps v into the space of m (m-space). In that space, the likelihood is convolutional i.e. the noise in the measurement directly defines the width of the likelihood. These assumptions allow us to relate the psychophysical data to our probabilistic model in a simple way. The following analysis is in the m-space. The point of subjective equality (Pcum = 0.5) is defined as where the expected values of the speed estimates are equal. We write E vm,1 ˆ vm,1 − E µ1 = E vm,2 ˆ = vm,2 − E µ2 (3) where E µ is the expected shift of the perceived speed compared to the veridical speed. For the discrimination threshold experiment, above assumptions imply that the variance var vm of the speed estimates vm is equal for both stimuli. Then, (2) predicts that the ˆ ˆ discrimination threshold is proportional to the standard deviation, thus vm,2 − vm,1 = γ var vm ˆ (4) likelihood a b prior vm Figure 3: Piece-wise approximation We perform a parametric fit by assuming the prior to be piece-wise linear and the likelihood to be LogNormal (Gaussian in the m-space). where γ is a constant that depends on the threshold criterion Pcum and the exact shape of p(ˆm |vm ). v 3.1 Estimating the prior and likelihood In order to extract the prior and the likelihood of our model from the data, we have to find a generic local form of the prior and the likelihood and relate them to the mean and the variance of the speed estimates. As illustrated in Figure 3, we assume that the likelihood is Gaussian with a standard deviation σ(c, vm ). Furthermore, the prior is assumed to be wellapproximated by a first-order Taylor series expansion over the velocity ranges covered by the likelihood. We parameterize this linear expansion of the prior as p(vm ) = avm + b. We now can derive a posterior for this local approximation of likelihood and prior and then define the perceived speed shift µ(m). The posterior can be written as 2 vm 1 1 p(m|vm )p(vm ) = [exp(− )(avm + b)] α α 2σ(c, vm )2 where α is the normalization constant ∞ b p(m|vm )p(vm )dvm = π2σ(c, vm )2 α= 2 −∞ p(vm |m) = (5) (6) We can compute µ(m) as the first order moment of the posterior for a given m. Exploiting the symmetries around the origin, we find ∞ a(m) µ(m) = σ(c, vm )2 vp(vm |m)dvm ≡ (7) b(m) −∞ The expected value of µ(m) is equal to the value of µ at the expected value of the measurement m (which is the stimulus velocity vm ), thus a(vm ) σ(c, vm )2 E µ = µ(m)|m=vm = (8) b(vm ) Similarly, we derive var vm . Because the estimator is deterministic, the variance of the ˆ estimate only depends on the variance of the measurement m. For a given stimulus, the variance of the estimate can be well approximated by ∂ˆm (m) v var vm = var m ( ˆ |m=vm )2 (9) ∂m ∂µ(m) |m=vm )2 ≈ var m = var m (1 − ∂m Under the assumption of a locally smooth prior, the perceived velocity shift remains locally constant. The variance of the perceived speed vm becomes equal to the variance of the ˆ measurement m, which is the variance of the likelihood (in the m-space), thus var vm = σ(c, vm )2 ˆ (10) With (3) and (4), above derivations provide a simple dependency of the psychophysical data to the local parameters of the likelihood and the prior. 3.2 Choosing a Logarithmic speed representation We now want to choose the appropriate mapping function f (v) that maps v to the m-space. We define the m-space as the space in which the likelihood is Gaussian with a speedindependent width. We have shown that discrimination threshold is proportional to the width of the likelihood (4), (10). Also, we know from the psychophysics literature that visual speed discrimination approximately follows a Weber-Fechner law [11, 12], thus that the discrimination threshold increases roughly proportional with speed and so would the likelihood. A logarithmic speed representation would be compatible with the data and our choice of the likelihood. Hence, we transform the linear speed-domain v into a normalized logarithmic domain according to v + v0 vm = f (v) = ln( ) (11) v0 where v0 is a small normalization constant. The normalization is chosen to account for the expected deviation of equal variance behavior at the low end. Surprisingly, it has been found that neurons in the Medial Temporal area (Area MT) of macaque monkeys have speed-tuning curves that are very well approximated by Gaussians of constant width in above normalized logarithmic space [13]. These neurons are known to play a central role in the representation of motion. It seems natural to assume that they are strongly involved in tasks such as our performed psychophysical experiments. 4 Results Figure 4 shows the contrast dependent shift of speed perception and the speed discrimination threshold data for two subjects. Data points connected with a dashed line represent the relative matching speed (v2 /v1 ) for a particular contrast value c2 of the test stimulus as a function of the speed of the reference stimulus. Error bars are the empirical standard deviation of fits to bootstrapped samples of the data. Clearly, low contrast stimuli are perceived to move slower. The effect, however, varies across the tested speed range and tends to become smaller for higher speeds. The relative discrimination thresholds for two different contrasts as a function of speed show that the Weber-Fechner law holds only approximately. The data are in good agreement with other data from the psychophysics literature [1, 11, 8]. For each subject, data from both experiments were used to compute a parametric leastsquares fit according to (3), (4), (7), and (10). In order to test the assumption of a LogNormal likelihood we allowed the standard deviation to be dependent on contrast and speed, thus σ(c, vm ) = g(c)h(vm ). We split the speed range into six bins (subject2: five) and parameterized h(vm ) and the ratio a/b accordingly. Similarly, we parameterized g(c) for the seven contrast values. The resulting fits are superimposed as bold lines in Figure 4. Figure 5 shows the fitted parametric values for g(c) and h(v) (plotted in the linear domain), and the reconstructed prior distribution p(v) transformed back to the linear domain. The approximately constant values for h(v) provide evidence that a LogNormal distribution is an appropriate functional description of the likelihood. The resulting values for g(c) suggest for the likelihood width a roughly exponential decaying dependency on contrast with strong saturation for higher contrasts. discrimination threshold (relative) reference stimulus contrast c1: 0.075 0.5 subject 1 normalized matching speed 1.5 contrast c2 1 0.5 1 10 0.075 0.5 0.79 0.5 0.4 0.3 0.2 0.1 0 10 1 contrast: 1 10 discrimination threshold (relative) normalized matching speed subject 2 1.5 contrast c2 1 0.5 10 1 a 0.5 0.4 0.3 0.2 0.1 10 1 1 b speed of reference stimulus [deg/sec] 10 stimulus speed [deg/sec] Figure 4: Speed discrimination data for two subjects. a) The relative matching speed of a test stimulus with different contrast levels (c2 =[0.05 0.1 0.2 0.4 0.8]) to achieve subjective equality with a reference stimulus (two different contrast values c1 ). b) The relative discrimination threshold for two stimuli with equal contrast (c1,2 =[0.075 0.5]). reconstructed prior subject 1 p(v) [unnormalized] 1 Gaussian Power-Law g(c) 1 h(v) 2 0.9 1.5 0.8 0.1 n=-1.41 0.7 1 0.6 0.01 0.5 0.5 0.4 0.3 1 p(v) [unnormalized] subject 2 10 0.1 1 1 1 1 10 1 10 2 0.9 n=-1.35 0.1 1.5 0.8 0.7 1 0.6 0.01 0.5 0.5 0.4 1 speed [deg/sec] 10 0.3 0 0.1 1 contrast speed [deg/sec] Figure 5: Reconstructed prior distribution and parameters of the likelihood function. The reconstructed prior for both subjects show much heavier tails than a Gaussian (dashed fit), approximately following a power-law function with exponent n ≈ −1.4 (bold line). 5 Conclusions We have proposed a probabilistic framework based on a Bayesian ideal observer and standard signal detection theory. We have derived a likelihood function and prior distribution for the estimator, with a fairly conservative set of assumptions, constrained by psychophysical measurements of speed discrimination and matching. The width of the resulting likelihood is nearly constant in the logarithmic speed domain, and decreases approximately exponentially with contrast. The prior expresses a preference for slower speeds, and approximately follows a power-law distribution, thus has much heavier tails than a Gaussian. It would be interesting to compare the here derived prior distributions with measured true distributions of local image velocities that impinge on the retina. Although a number of authors have measured the spatio-temporal structure of natural images [14, e.g. ], it is clearly difficult to extract therefrom the true prior distribution because of the feedback loop formed through movements of the body, head and eyes. Acknowledgments The authors thank all subjects for their participation in the psychophysical experiments. References [1] P. Thompson. Perceived rate of movement depends on contrast. Vision Research, 22:377–380, 1982. [2] L.S. Stone and P. Thompson. Human speed perception is contrast dependent. Vision Research, 32(8):1535–1549, 1992. [3] A. Yuille and N. Grzywacz. A computational theory for the perception of coherent visual motion. Nature, 333(5):71–74, May 1988. [4] Alan Stocker. Constraint Optimization Networks for Visual Motion Perception - Analysis and Synthesis. PhD thesis, Dept. of Physics, Swiss Federal Institute of Technology, Z¨ rich, Switzeru land, March 2002. [5] Eero Simoncelli. Distributed analysis and representation of visual motion. PhD thesis, MIT, Dept. of Electrical Engineering, Cambridge, MA, 1993. [6] Y. Weiss, E. Simoncelli, and E. Adelson. Motion illusions as optimal percept. Nature Neuroscience, 5(6):598–604, June 2002. [7] D.M. Green and J.A. Swets. Signal Detection Theory and Psychophysics. Wiley, New York, 1966. [8] F. H¨ rlimann, D. Kiper, and M. Carandini. Testing the Bayesian model of perceived speed. u Vision Research, 2002. [9] Y. Weiss and D.J. Fleet. Probabilistic Models of the Brain, chapter Velocity Likelihoods in Biological and Machine Vision, pages 77–96. Bradford, 2002. [10] K. Koerding and D. Wolpert. Bayesian integration in sensorimotor learning. 427(15):244–247, January 2004. Nature, [11] Leslie Welch. The perception of moving plaids reveals two motion-processing stages. Nature, 337:734–736, 1989. [12] S. McKee, G. Silvermann, and K. Nakayama. Precise velocity discrimintation despite random variations in temporal frequency and contrast. Vision Research, 26(4):609–619, 1986. [13] C.H. Anderson, H. Nover, and G.C. DeAngelis. Modeling the velocity tuning of macaque MT neurons. Journal of Vision/VSS abstract, 2003. [14] D.W. Dong and J.J. Atick. Statistics of natural time-varying images. Network: Computation in Neural Systems, 6:345–358, 1995.
2 0.87561655 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images
Author: Erik G. Learned-miller, Parvez Ahammad
Abstract: The correction of bias in magnetic resonance images is an important problem in medical image processing. Most previous approaches have used a maximum likelihood method to increase the likelihood of the pixels in a single image by adaptively estimating a correction to the unknown image bias field. The pixel likelihoods are defined either in terms of a pre-existing tissue model, or non-parametrically in terms of the image’s own pixel values. In both cases, the specific location of a pixel in the image is not used to calculate the likelihoods. We suggest a new approach in which we simultaneously eliminate the bias from a set of images of the same anatomy, but from different patients. We use the statistics from the same location across different images, rather than within an image, to eliminate bias fields from all of the images simultaneously. The method builds a “multi-resolution” non-parametric tissue model conditioned on image location while eliminating the bias fields associated with the original image set. We present experiments on both synthetic and real MR data sets, and present comparisons with other methods. 1
same-paper 3 0.87298626 62 nips-2004-Euclidean Embedding of Co-Occurrence Data
Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby
Abstract: Embedding algorithms search for low dimensional structure in complex data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of our embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text datasets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling and correspondence analysis. 1
4 0.79290569 145 nips-2004-Parametric Embedding for Class Visualization
Author: Tomoharu Iwata, Kazumi Saito, Naonori Ueda, Sean Stromsten, Thomas L. Griffiths, Joshua B. Tenenbaum
Abstract: In this paper, we propose a new method, Parametric Embedding (PE), for visualizing the posteriors estimated over a mixture model. PE simultaneously embeds both objects and their classes in a low-dimensional space. PE takes as input a set of class posterior vectors for given data points, and tries to preserve the posterior structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a Gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classifier’s behavior in supervised, semi-supervised and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of web pages, semi-supervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, Latent Dirichlet Allocation. 1
5 0.72686976 131 nips-2004-Non-Local Manifold Tangent Learning
Author: Yoshua Bengio, Martin Monperrus
Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1
6 0.72625804 125 nips-2004-Multiple Relational Embedding
7 0.72582769 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images
8 0.72522241 163 nips-2004-Semi-parametric Exponential Family PCA
9 0.72514898 102 nips-2004-Learning first-order Markov models for control
10 0.72470194 69 nips-2004-Fast Rates to Bayes for Kernel Machines
11 0.72460592 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss
12 0.72333831 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
13 0.72331303 54 nips-2004-Distributed Information Regularization on Graphs
14 0.72266734 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons
15 0.72241676 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
16 0.72101909 70 nips-2004-Following Curved Regularized Optimization Solution Paths
17 0.72035283 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations
18 0.72010398 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve
19 0.71804583 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
20 0.71671653 124 nips-2004-Multiple Alignment of Continuous Time Series