nips nips2004 nips2004-125 knowledge-graph by maker-knowledge-mining

125 nips-2004-Multiple Relational Embedding

Source: pdf

Author: Roland Memisevic, Geoffrey E. Hinton

Abstract: We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We describe a way of using multiple different types of similarity relationship to learn a low-dimensional embedding of a dataset. [sent-5, score-0.357]

2 Our method chooses different, possibly overlapping representations of similarity by individually reweighting the dimensions of a common underlying latent space. [sent-6, score-0.801]

3 When applied to a single similarity relation that is based on Euclidean distances between the input data points, the method reduces to simple dimensionality reduction. [sent-7, score-0.34]

4 If additional information is available about the dataset or about subsets of it, we can use this information to clean up or otherwise improve the embedding. [sent-8, score-0.116]

5 We demonstrate the potential usefulness of this form of semi-supervised dimensionality reduction on some simple examples. [sent-9, score-0.093]

6 1 Introduction Finding a representation for data in a low-dimensional Euclidean space is useful both for visualization and as prelude to other kinds of data analysis. [sent-10, score-0.131]

7 The common goal underlying the many different methods that accomplish this task (such as ISOMAP [1], LLE [2], stochastic neighbor embedding [3] and others) is to extract the usually small number of factors that are responsible for the variability in the data. [sent-11, score-0.481]

8 In making the underlying factors explicit, these methods help to focus on the kind of variability that is important and provide representations that make it easier to interpret and manipulate the data in reasonable ways. [sent-12, score-0.313]

9 Most dimensionality reduction methods are unsupervised, so there is no way of guiding the method towards modes of variability that are of particular interest to the user. [sent-13, score-0.248]

10 There is also no way of providing hints when the true underlying factors are too subtle to be discovered by optimizing generic criteria such as maximization of modeled variance in PCA, or preservation of local geometry in LLE. [sent-14, score-0.064]

11 Both these difﬁculties can be alleviated by allowing the user to provide more information than just the raw data points or a single set of pairwise similarities between data points. [sent-15, score-0.094]

12 Nonlinear methods have been shown to ﬁnd embeddings that nicely reﬂect the variability in the data caused by variation in face identity, pose, position, or lighting effects. [sent-17, score-0.312]

13 However, it is not possible to tell these methods to extract a particular single factor for the purpose of, say intelligent image manipulation or pose identiﬁcation, because the extracted factors are intermingled and may be represented simultaneously across all latent space dimensions. [sent-18, score-0.603]

14 Here, we consider the problem of learning a latent representation for data based on knowl- edge that is provided by a user in the form of several different similarity relations. [sent-19, score-0.664]

15 Our method, multiple relational embedding (MRE), ﬁnds an embedding that uses a single latent data representation, but weights the available latent space dimensions differently to allow the latent space to model the multiple different similarity relations. [sent-20, score-1.982]

16 By labeling a subset of the data according to the kind of variability one is interested in, one can encourage the model to reserve a subset of the latent dimensions for this kind of variability. [sent-21, score-0.862]

17 The model, in turn, returns a “handle” to that latent space in the form of a corresponding learned latent space metric. [sent-22, score-0.892]

18 They suggest pre-processing the input data by learning a metric in input space that makes the data respect user deﬁned grouping constraints. [sent-27, score-0.19]

19 That is, a user needs to deﬁne a grouping structure for the input data by informing the model which data-points belong together. [sent-30, score-0.165]

20 Here, we consider a rather different approach, where the side-information can be encoded in the form of similarity relations. [sent-31, score-0.213]

21 This allows arbitrary continuous degrees of freedom to constrain the low-dimensional embeddings. [sent-32, score-0.129]

22 Second, our model can deal with several, possibly conﬂicting, kinds of sideinformation. [sent-33, score-0.104]

23 MRE dynamically “allocates” latent space dimensions to model different user-provided similarity relations. [sent-34, score-0.718]

24 So inconsistent relations are modeled in disjoint subspaces, and consistent relations can share dimensions. [sent-35, score-0.182]

25 This scheme of sharing the dimensions of a common latent space is reminiscent of the INDSCAL method [9] that has been popular in the psychometric literature. [sent-36, score-0.538]

26 A quite different way to extend unsupervised models has recently been introduced by [10] and [11], where the authors propose ways to extract common factors that underlie two or more different datasets, with possibly different dimensionalities. [sent-37, score-0.142]

27 While these methods rely on a supervision signal containing information about correspondences between data-points in different datasets, MRE can be used to discover correspondences between different datasets using almost no pre-deﬁned grouping constraints. [sent-38, score-0.312]

28 2 Multiple Relational Embedding In the following we derive MRE as an extension to stochastic neighbor embedding (SNE). [sent-39, score-0.233]

29 Let X denote the matrix of latent space elements arranged column-wise, and let σ 2 be some real-valued neighborhood variance or “kernel bandwidth”. [sent-40, score-0.469]

30 SNE ﬁnds a low-dimensional representation for a set of input data points y i (i = 1, . [sent-41, score-0.061]

31 , N ) by ﬁrst constructing a similarity matrix P with entries Pij := 1 exp(− σ2 y i − y j 2 ) 1 i k 2) k exp(− σ 2 y − y (1) and then minimizing (w. [sent-44, score-0.258]

32 , N )) the mismatch between P and the corresponding latent similarity matrix Q(X) deﬁned by exp(− xi − xj 2 ) . [sent-50, score-0.64]

33 Our goal is to extend SNE so that it learns latent data representations that not only approximate the input space distances well, but also reﬂect additional characteristics of the input data that one may be interested in. [sent-53, score-0.532]

34 In order to accommodate these additional characteristics, instead of deﬁning a single similarity-matrix that is based on Euclidean distances in data space, we deﬁne several matrices P c , (c = 1, . [sent-54, score-0.096]

35 , C), each of which encodes some known type of similarity of the data. [sent-57, score-0.18]

36 Proximity in the Euclidean data-space is typically one of the types of similarity that we use, though it can easily be omitted. [sent-58, score-0.18]

37 The additional types of similarity may reﬂect any information that the user has access to about any subsets of the data provided the information can be expressed as a similarity matrix that is normalized over the relevant subset of the data. [sent-59, score-0.517]

38 At ﬁrst sight, a single latent data representation seems to be unsuitable to accommodate the different, and possibly incompatible, properties encoded in a set of P c -matrices. [sent-60, score-0.529]

39 Since our goal, however, is to capture possibly overlapping relations, we do use a single latent space and in addition we deﬁne a linear transformation Rc of the latent space for each of the C different similarity-types that we provide as input. [sent-61, score-0.933]

40 Note that this is equivalent to measuring distances in latent space using a different Mahalanobis metric for each c corresponding to the matrix Rc T Rc . [sent-62, score-0.563]

41 In order to learn the transformations Rc from the data along with the set of latent representations X we consider the loss function E c (X), E(X) = (3) c where we deﬁne E c (X) := 1 N c Pij log i,j c Pij Qc ij and Qc := Qij (Rc X). [sent-63, score-0.434]

42 As indicated above, here we consider diagonal R-matrices only, which simply amounts to using a rescaling factor for each latent space dimension. [sent-68, score-0.465]

43 By allowing each type of similarity to put a different scaling factor on each dimension the model allows similarity relations that “overlap” to share dimensions. [sent-69, score-0.451]

44 Completely unrelated or “orthogonal” relations can be encoded by using disjoint sets of non-zero scaling factors. [sent-70, score-0.124]

45 a single latent space element xl takes a similar form to the gradient of the standard SNE objective function and is given by ∂E(X) 2 c c = (Pil + Pli − Qc − Qc ) Rc T Rc (xl − xi ), (5) li il l ∂x N c i 2 0 −2 −4 −6 REucl 0. [sent-74, score-0.485]

46 5 2 −1 −6 RClass −1 1 Figure 1: Embedding of images of rotated objects. [sent-86, score-0.13]

47 Latent representatives are colored on a gray-scale corresponding to angle of rotation in the original images. [sent-88, score-0.169]

48 The rightmost plots show entries on the diagonals of latent space transformations REucl and RClass . [sent-89, score-0.612]

49 to a single entry of the diagonal of Rc reads   ∂E(X) 2 c j 2 c R Pij − Qc (xi − xl ) , = ij l c ∂Rll N ll i j (6) where xi denotes the lth component of the ith latent representative. [sent-93, score-0.485]

50 l As an illustrative example we ran MRE on a set of images from the Columbia object images library (COIL) [12]. [sent-94, score-0.298]

51 The dataset contains (128 × 128)-dimensional gray-scale images of different objects that vary only by rotation, i. [sent-95, score-0.243]

52 We took three subsets of images depicting toy-cars, where each subset corresponds to one of three different kinds of toy-cars, and embedded the ﬁrst 30 images of each of these subsets in a three-dimensional space. [sent-98, score-0.535]

53 We used two similarity relations: The ﬁrst, P Eucl , corresponds to the standard SNE objective; the second, P Class , is deﬁned as a block diagonal matrix 1 that contains homogeneous blocks of size 30 × 30 with entries ( 30 ) and models class membership, i. [sent-99, score-0.328]

54 we informed the model using the information that images depicting the same object class belong together. [sent-101, score-0.391]

55 This is also reﬂected in the entries on the diagonal of the corresponding R-matrices, depicted in the two right-most plots. [sent-105, score-0.187]

56 RClass is responsible for representing class membership and can do so using just a single dimension. [sent-106, score-0.157]

57 REucl on the other hand makes use of all dimensions to some degree, reﬂecting the fact that the overall variability in “pixel-space” depends on class-membership, as well as on other factors (here mainly rotation). [sent-107, score-0.329]

58 Note that with the variability according to class1 For training we set σ 2 manually to 5 · 107 for both SNE and MRE and initialized all entries in X and the diagonals of all Rc with small normally distributed values. [sent-108, score-0.308]

59 membership factored out, the remaining two dimensions capture the rotational degree of freedom very cleanly. [sent-111, score-0.418]

60 1 Partial information In many real world situations there might be side-information available only for a subset of the data-points, because labelling a complete dataset could be too expensive or for other reasons impossible. [sent-113, score-0.126]

61 A partially labelled dataset can in that case still be used to provide a hint about the kind of variability that one is interested in. [sent-114, score-0.349]

62 It is straightforward to modify the model to deal with partially labelled data. [sent-116, score-0.069]

63 For each type of similarity c that is known to hold for a subset containing N c examples, the corresponding P c -matrix references only this subset of the complete dataset and is thus an N c × N c -matrix. [sent-117, score-0.394]

64 1 Learning correspondences between image sets In extending the experiment described in section 2 we trained MRE to discover correspondences between sets of images, in this case with different dimensionalities. [sent-120, score-0.182]

65 We picked 20 successive images from one object of the COIL dataset described above and 28 images (112 × 92 pixels) depicting a person under different viewing angles taken from the UMIST dataset[13]. [sent-121, score-0.447]

66 We chose this data in order to obtain two sets of images that vary in a “similar” or related way. [sent-122, score-0.169]

67 Note that, because the datasets have different dimensionalities, here it is not possible to deﬁne a single relation describing Euclidean distance between all data-points. [sent-123, score-0.082]

68 Instead we constructed two relations P Coil and P Umist (for both we used Eq. [sent-124, score-0.123]

69 (1) with σ 2 set as in the previous experiment), with corresponding index-sets I Coil and I Umist containing the indices of the points in each of the two datasets. [sent-125, score-0.067]

70 In addition we constructed one class-membership relation in the same way as before and two identical relations P 1 and P 2 that take the form of a 2 × 2-matrix ﬁlled with entries 1 . [sent-126, score-0.247]

71 Each of the corresponding 2 index sets I 1 and I 2 points to two images (one from each dataset) that represent the end points of the rotational degree of freedom, i. [sent-127, score-0.315]

72 to the ﬁrst and the last points if we sort the data according to rotation (see ﬁgure 2, left plot). [sent-129, score-0.089]

73 These similarity types are used to make sure that the model properly aligns the representations of the two different datasets. [sent-130, score-0.223]

74 Note that the end points constitute the only supervision signal; we did not use any additional information about the alignment of the two datasets. [sent-131, score-0.073]

75 After training a two-dimensional embedding2 , we randomly picked latent representatives of the COIL images and computed reconstructions of corresponding face images using a kernel smoother (i. [sent-132, score-0.964]

76 as a linear combination of the face images with coefﬁcients based on latent space distances). [sent-134, score-0.634]

77 In order to factor out variability corresponding to class membership we ﬁrst multiplied all latent representatives by the inverse of R class . [sent-135, score-0.818]

78 (Note that such a strategy will in general blow up the latent space dimensions that do not represent class membership, as the corresponding entries in Rclass may contain very small values. [sent-136, score-0.685]

79 Right: Reconstructions of face images from randomly chosen cat images. [sent-140, score-0.206]

80 kernel smoother consequently requires a very large kernel bandwidth, with the net effect that the latent representation effectively collapses in the dimensions that correspond to class membership – which is exactly what we want. [sent-141, score-0.694]

81 ) The reconstructions, depicted in the right plot of ﬁgure 2, show that the model has captured the common mode of variability. [sent-142, score-0.093]

82 2 Supervised feature extraction To investigate the ability of MRE to perform a form of “supervised feature extraction” we used a dataset of synthetic face images that originally appeared in [1]. [sent-144, score-0.325]

83 The face images vary according to pose (two degrees of freedom) and according to the position of a lighting source (one degree of freedom). [sent-145, score-0.46]

84 We computed an embedding with the goal of obtaining features that explicitly correspond to these different kinds of variability in the data. [sent-147, score-0.396]

85 In addition we used a fourth similarity relation P Ink , corresponding to overall brightness or “amount of ink”, by constructing for each image a corresponding feature equal to the sum of its pixel intensities and then deﬁning the similarity matrix as above. [sent-150, score-0.478]

86 In addition we constructed the standard SNE relation P Eucl (deﬁned for all data-points) using Eq. [sent-153, score-0.078]

87 We initialized the model as before and trained for 1000 iterations of ’minimize’ to ﬁnd an embedding in a four-dimensional space. [sent-155, score-0.177]

88 Figure 3 (right plot) shows the learned latent space metrics corresponding to the ﬁve similarity-types. [sent-156, score-0.464]

89 Obviously, MRE devotes one dimension to each of the four similarity-types, reﬂecting the fact that each of them describes a single one-dimensional degree of freedom that is barely correlated with the others. [sent-157, score-0.139]

90 The plots on the left of ﬁgure 3 show the embedding of the 598 unlabelled data-points. [sent-159, score-0.208]

91 The top plot shows the embedding in the two dimensions in which the two “pose”-metrics take on their maximal values, the bottom plot shows the dimensions in which the “lighting”- and “ink”-metric take on their maximal values. [sent-160, score-0.511]

92 The plots show that MRE generalizes over unlabeled data: In each dimension the unlabeled data is clearly arranged according to the corresponding similarity type, and is arranged rather randomly with respect to other similarity types. [sent-161, score-0.579]

93 These correlations are also reﬂected in the slightly overlapping latent 3 This is certainly not an optimal choice, but we found the solutions to be rather robust against changes in the bandwidth, and this value worked ﬁne. [sent-165, score-0.47]

94 3 1 Figure 3: Left: Embedding of faces images that were not informed about their lowdimensional parameters. [sent-189, score-0.245]

95 For a randomly chosen subset of these (marked with a circle), the original images are shown next to their latent representatives. [sent-190, score-0.573]

96 Right: Entries on the diagonals of ﬁve latent space transformations. [sent-191, score-0.503]

97 MRE gets the pose-embedding wrong for a few very dark images that are apparently too far away in the data space to be associated with the correct labeled datapoints. [sent-193, score-0.167]

98 4 Conclusions We introduced a way to embed data in a low-dimensional space using a set of similarity relations. [sent-194, score-0.217]

99 Our experiments indicate that the informed feature extraction that this method facilitates will be most useful in cases where conventional dimensionality reduction methods fail because of their completely unsupervised nature. [sent-195, score-0.291]

100 Although we derived our approach as an extension to SNE, it should be straightforward to apply the same idea to other dimensionality reduction methods. [sent-196, score-0.093]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('latent', 0.391), ('mre', 0.372), ('sne', 0.324), ('similarity', 0.18), ('embedding', 0.177), ('rc', 0.158), ('variability', 0.155), ('coil', 0.143), ('images', 0.13), ('ink', 0.125), ('informed', 0.115), ('rclass', 0.115), ('reucl', 0.115), ('dimensions', 0.11), ('pij', 0.109), ('qc', 0.095), ('membership', 0.095), ('freedom', 0.094), ('reconstructions', 0.091), ('relational', 0.091), ('correspondences', 0.091), ('relations', 0.091), ('roland', 0.086), ('umist', 0.086), ('lighting', 0.081), ('entries', 0.078), ('face', 0.076), ('diagonals', 0.075), ('depicting', 0.075), ('representatives', 0.075), ('qij', 0.075), ('dataset', 0.074), ('labelled', 0.069), ('hinton', 0.065), ('bandwidth', 0.065), ('kinds', 0.064), ('factors', 0.064), ('geoffrey', 0.064), ('user', 0.063), ('distances', 0.061), ('rotation', 0.058), ('plot', 0.057), ('eucl', 0.057), ('intermingled', 0.057), ('memisevic', 0.057), ('xl', 0.057), ('euclidean', 0.056), ('neighbor', 0.056), ('pose', 0.054), ('dimensionality', 0.053), ('grouping', 0.052), ('subset', 0.052), ('kind', 0.051), ('informing', 0.05), ('relation', 0.046), ('bilinear', 0.045), ('extraction', 0.045), ('degree', 0.045), ('representations', 0.043), ('rotational', 0.042), ('supervision', 0.042), ('worked', 0.042), ('subsets', 0.042), ('arranged', 0.041), ('possibly', 0.04), ('reduction', 0.04), ('vary', 0.039), ('unsupervised', 0.038), ('columbia', 0.038), ('object', 0.038), ('metric', 0.038), ('space', 0.037), ('diagonal', 0.037), ('overlapping', 0.037), ('datasets', 0.036), ('joshua', 0.036), ('style', 0.036), ('corresponding', 0.036), ('depicted', 0.036), ('smoother', 0.035), ('accommodate', 0.035), ('sam', 0.035), ('degrees', 0.035), ('unlabeled', 0.035), ('pca', 0.034), ('encoded', 0.033), ('class', 0.033), ('ected', 0.033), ('mismatch', 0.033), ('constructed', 0.032), ('factored', 0.032), ('tenenbaum', 0.032), ('ecting', 0.032), ('points', 0.031), ('plots', 0.031), ('representation', 0.03), ('re', 0.03), ('gure', 0.029), ('roweis', 0.029), ('responsible', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 125 nips-2004-Multiple Relational Embedding

Author: Roland Memisevic, Geoffrey E. Hinton

2 0.21698442 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1

3 0.17130238 145 nips-2004-Parametric Embedding for Class Visualization

Author: Tomoharu Iwata, Kazumi Saito, Naonori Ueda, Sean Stromsten, Thomas L. Griffiths, Joshua B. Tenenbaum

Abstract: In this paper, we propose a new method, Parametric Embedding (PE), for visualizing the posteriors estimated over a mixture model. PE simultaneously embeds both objects and their classes in a low-dimensional space. PE takes as input a set of class posterior vectors for given data points, and tries to preserve the posterior structure in an embedding space by minimizing a sum of Kullback-Leibler divergences, under the assumption that samples are generated by a Gaussian mixture with equal covariances in the embedding space. PE has many potential uses depending on the source of the input data, providing insight into the classiﬁer’s behavior in supervised, semi-supervised and unsupervised settings. The PE algorithm has a computational advantage over conventional embedding methods based on pairwise object relations since its complexity scales with the product of the number of objects and the number of classes. We demonstrate PE by visualizing supervised categorization of web pages, semi-supervised categorization of digits, and the relations of words and latent topics found by an unsupervised algorithm, Latent Dirichlet Allocation. 1

4 0.16866222 124 nips-2004-Multiple Alignment of Continuous Time Series

Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Andrew Emili

Abstract: Multiple realizations of continuous-valued time series from a stochastic process often contain systematic variations in rate and amplitude. To leverage the information contained in such noisy replicate sets, we need to align them in an appropriate way (for example, to allow the data to be properly combined by adaptive averaging). We present the Continuous Proﬁle Model (CPM), a generative model in which each observed time series is a non-uniformly subsampled version of a single latent trace, to which local rescaling and additive noise are applied. After unsupervised training, the learned trace represents a canonical, high resolution fusion of all the replicates. As well, an alignment in time and scale of each observation to this trace can be found by inference in the model. We apply CPM to successfully align speech signals from multiple speakers and sets of Liquid Chromatography-Mass Spectrometry proteomic data. 1 A Proﬁle Model for Continuous Data When observing multiple time series generated by a noisy, stochastic process, large systematic sources of variability are often present. For example, within a set of nominally replicate time series, the time axes can be variously shifted, compressed and expanded, in complex, non-linear ways. Additionally, in some circumstances, the scale of the measured data can vary systematically from one replicate to the next, and even within a given replicate. We propose a Continuous Proﬁle Model (CPM) for simultaneously analyzing a set of such time series. In this model, each time series is generated as a noisy transformation of a single latent trace. The latent trace is an underlying, noiseless representation of the set of replicated, observable time series. Output time series are generated from this model by moving through a sequence of hidden states in a Markovian manner and emitting an observable value at each step, as in an HMM. Each hidden state corresponds to a particular location in the latent trace, and the emitted value from the state depends on the value of the latent trace at that position. To account for changes in the amplitude of the signals across and within replicates, the latent time states are augmented by a set of scale states, which control how the emission signal will be scaled relative to the value of the latent trace. During training, the latent trace is learned, as well as the transition probabilities controlling the Markovian evolution of the scale and time states and the overall noise level of the observed data. After training, the latent trace learned by the model represents a higher resolution ’fusion’ of the experimental replicates. Figure 1 illustrate the model in action. Unaligned, Linear Warp Alignment and CPM Alignment Amplitude 40 30 20 10 0 50 Amplitude 40 30 20 10 Amplitude 0 30 20 10 0 Time a) b) Figure 1: a) Top: ten replicated speech energy signals as described in Section 4), Middle: same signals, aligned using a linear warp with an offset, Bottom: aligned with CPM (the learned latent trace is also shown in cyan). b) Speech waveforms corresponding to energy signals in a), Top: unaligned originals, Bottom: aligned using CPM. 2 Deﬁning the Continuous Proﬁle Model (CPM) The CPM is generative model for a set of K time series, xk = (xk , xk , ..., xk k ). The 1 2 N temporal sampling rate within each xk need not be uniform, nor must it be the same across the different xk . Constraints on the variability of the sampling rate are discussed at the end of this section. For notational convenience, we henceforth assume N k = N for all k, but this is not a requirement of the model. The CPM is set up as follows: We assume that there is a latent trace, z = (z1 , z2 , ..., zM ), a canonical representation of the set of noisy input replicate time series. Any given observed time series in the set is modeled as a non-uniformly subsampled version of the latent trace to which local scale transformations have been applied. Ideally, M would be inﬁnite, or at least very large relative to N so that any experimental data could be mapped precisely to the correct underlying trace point. Aside from the computational impracticalities this would pose, great care to avoid overﬁtting would have to be taken. Thus in practice, we have used M = (2 + )N (double the resolution, plus some slack on each end) in our experiments and found this to be sufﬁcient with < 0.2. Because the resolution of the latent trace is higher than that of the observed time series, experimental time can be made effectively to speed up or slow down by advancing along the latent trace in larger or smaller jumps. The subsampling and local scaling used during the generation of each observed time series are determined by a sequence of hidden state variables. Let the state sequence for observation k be π k . Each state in the state sequence maps to a time state/scale state pair: k πi → {τik , φk }. Time states belong to the integer set (1..M ); scale states belong to an i ordered set (φ1 ..φQ ). (In our experiments we have used Q=7, evenly spaced scales in k logarithmic space). States, πi , and observation values, xk , are related by the emission i k probability distribution: Aπi (xk |z) ≡ p(xk |πi , z, σ, uk ) ≡ N (xk ; zτik φk uk , σ), where σ k i i i i is the noise level of the observed data, N (a; b, c) denotes a Gaussian probability density for a with mean b and standard deviation c. The uk are real-valued scale parameters, one per observed time series, that correct for any overall scale difference between time series k and the latent trace. To fully specify our model we also need to deﬁne the state transition probabilities. We deﬁne the transitions between time states and between scale states separately, so that k Tπi−1 ,πi ≡ p(πi |πi−1 ) = p(φi |φi−1 )pk (τi |τi−1 ). The constraint that time must move forward, cannot stand still, and that it can jump ahead no more than Jτ time states is enforced. (In our experiments we used Jτ = 3.) As well, we only allow scale state transitions between neighbouring scale states so that the local scale cannot jump arbitrarily. These constraints keep the number of legal transitions to a tractable computational size and work well in practice. Each observed time series has its own time transition probability distribution to account for experiment-speciﬁc patterns. Both the time and scale transition probability distributions are given by multinomials:  dk , if a − b = 1  1  k  d2 , if a − b = 2   k . p (τi = a|τi−1 = b) = . .  k d , if a − b = J  τ  Jτ  0, otherwise p(φi = a|φi−1  s0 , if D(a, b) = 0   s1 , if D(a, b) = 1 = b) =  s1 , if D(a, b) = −1  0, otherwise where D(a, b) = 1 means that a is one scale state larger than b, and D(a, b) = −1 means that a is one scale state smaller than b, and D(a, b) = 0 means that a = b. The distributions Jτ are constrained by: i=1 dk = 1 and 2s1 + s0 = 1. i Jτ determines the maximum allowable instantaneous speedup of one portion of a time series relative to another portion, within the same series or across different series. However, the length of time for which any series can move so rapidly is constrained by the length of the latent trace; thus the maximum overall ratio in speeds achievable by the model between any two entire time series is given by min(Jτ , M ). N After training, one may examine either the latent trace or the alignment of each observable time series to the latent trace. Such alignments can be achieved by several methods, including use of the Viterbi algorithm to ﬁnd the highest likelihood path through the hidden states [1], or sampling from the posterior over hidden state sequences. We found Viterbi alignments to work well in the experiments below; samples from the posterior looked quite similar. 3 Training with the Expectation-Maximization (EM) Algorithm As with HMMs, training with the EM algorithm (often referred to as Baum-Welch in the context of HMMs [1]), is a natural choice. In our model the E-Step is computed exactly using the Forward-Backward algorithm [1], which provides the posterior probability over k states for each time point of every observed time series, γs (i) ≡ p(πi = s|x) and also the pairwise state posteriors, ξs,t (i) ≡ p(πi−1 = s, πi = t|xk ). The algorithm is modiﬁed only in that the emission probabilities depend on the latent trace as described in Section 2. The M-Step consists of a series of analytical updates to the various parameters as detailed below. Given the latent trace (and the emission and state transition probabilities), the complete log likelihood of K observed time series, xk , is given by Lp ≡ L + P. L is the likelihood term arising in a (conditional) HMM model, and can be obtained from the Forward-Backward algorithm. It is composed of the emission and state transition terms. P is the log prior (or penalty term), regularizing various aspects of the model parameters as explained below. These two terms are: K N N L≡ log Aπi (xk |z) + i log p(π1 ) + τ −1 K (zj+1 − zj )2 + P ≡ −λ (1) i=2 i=1 k=1 k log Tπi−1 ,πi j=1 k log D(dk |{ηv }) + log D(sv |{ηv }), v (2) k=1 where p(π1 ) are priors over the initial states. The ﬁrst term in Equation 2 is a smoothing k penalty on the latent trace, with λ controlling the amount of smoothing. ηv and ηv are Dirichlet hyperprior parameters for the time and scale state transition probability distributions respectively. These ensure that all non-zero transition probabilities remain non-zero. k For the time state transitions, v ∈ {1, Jτ } and ηv corresponds to the pseudo-count data for k the parameters d1 , d2 . . . dJτ . For the scale state transitions, v ∈ {0, 1} and ηv corresponds to the pseudo-count data for the parameters s0 and s1 . Letting S be the total number of possible states, that is, the number of elements in the cross-product of possible time states and possible scale states, the expected complete log likelihood is: K S K p k k γs (1) log T0,s

5 0.15900888 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for low dimensional structure in complex data, but most algorithms only handle objects of a single type for which pairwise distances are speciﬁed. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semideﬁnite matrices. The local structure of our embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text datasets, and show that it consistently and signiﬁcantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling and correspondence analysis. 1

6 0.10646337 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

7 0.10423689 127 nips-2004-Neighbourhood Components Analysis

8 0.097549498 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

9 0.097151801 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

10 0.085799545 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

11 0.08315406 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

12 0.082555346 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

13 0.081113525 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

14 0.079157591 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account

15 0.079015292 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

16 0.074595265 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

17 0.074111812 160 nips-2004-Seeing through water

18 0.07006532 18 nips-2004-Algebraic Set Kernels with Application to Inference Over Local Image Representations

19 0.069107711 136 nips-2004-On Semi-Supervised Classification

20 0.06569355 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.221), (1, 0.074), (2, -0.095), (3, -0.173), (4, 0.043), (5, 0.067), (6, -0.137), (7, -0.002), (8, 0.225), (9, 0.079), (10, 0.063), (11, -0.163), (12, -0.025), (13, -0.03), (14, 0.064), (15, 0.089), (16, 0.078), (17, 0.029), (18, -0.093), (19, -0.041), (20, -0.066), (21, 0.08), (22, 0.126), (23, 0.0), (24, -0.006), (25, -0.008), (26, -0.017), (27, 0.027), (28, -0.052), (29, -0.133), (30, 0.147), (31, 0.092), (32, -0.131), (33, -0.105), (34, -0.118), (35, -0.025), (36, -0.002), (37, 0.037), (38, 0.086), (39, -0.063), (40, -0.036), (41, -0.125), (42, 0.095), (43, 0.011), (44, -0.093), (45, 0.057), (46, -0.011), (47, -0.063), (48, 0.129), (49, 0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95969754 125 nips-2004-Multiple Relational Embedding

Author: Roland Memisevic, Geoffrey E. Hinton

2 0.67308515 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

3 0.63344711 145 nips-2004-Parametric Embedding for Class Visualization

Author: Tomoharu Iwata, Kazumi Saito, Naonori Ueda, Sean Stromsten, Thomas L. Griffiths, Joshua B. Tenenbaum

4 0.5771994 124 nips-2004-Multiple Alignment of Continuous Time Series

Author: Jennifer Listgarten, Radford M. Neal, Sam T. Roweis, Andrew Emili

5 0.56622875 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton

Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research ﬁelds. Although this approach has met with considerable success, the causal semantics of these models can make it difﬁcult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.

6 0.55859166 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

7 0.54361415 170 nips-2004-Similarity and Discrimination in Classical Conditioning: A Latent Variable Account

8 0.46726337 127 nips-2004-Neighbourhood Components Analysis

9 0.41291392 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

10 0.39621839 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

11 0.38664997 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

12 0.36086765 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

13 0.35481435 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

14 0.34308121 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

15 0.33627996 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

16 0.332012 186 nips-2004-The Correlated Correspondence Algorithm for Unsupervised Registration of Nonrigid Surfaces

17 0.33001721 158 nips-2004-Sampling Methods for Unsupervised Learning

18 0.32861966 191 nips-2004-The Variational Ising Classifier (VIC) Algorithm for Coherently Contaminated Data

19 0.31525084 73 nips-2004-Generative Affine Localisation and Tracking

20 0.30389443 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.111), (15, 0.137), (17, 0.016), (26, 0.044), (31, 0.02), (33, 0.185), (35, 0.04), (39, 0.057), (50, 0.035), (54, 0.242), (56, 0.016), (71, 0.013), (76, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93704939 155 nips-2004-Responding to Modalities with Different Latencies

Author: Fredrik Bissmarck, Hiroyuki Nakahara, Kenji Doya, Okihide Hikosaka

Abstract: Motor control depends on sensory feedback in multiple modalities with different latencies. In this paper we consider within the framework of reinforcement learning how different sensory modalities can be combined and selected for real-time, optimal movement control. We propose an actor-critic architecture with multiple modules, whose output are combined using a softmax function. We tested our architecture in a simulation of a sequential reaching task. Reaching was initially guided by visual feedback with a long latency. Our learning scheme allowed the agent to utilize the somatosensory feedback with shorter latency when the hand is near the experienced trajectory. In simulations with different latencies for visual and somatosensory feedback, we found that the agent depended more on feedback with shorter latency. 1

2 0.90112197 196 nips-2004-Triangle Fixing Algorithms for the Metric Nearness Problem

Author: Suvrit Sra, Joel Tropp, Inderjit S. Dhillon

Abstract: Various problems in machine learning, databases, and statistics involve pairwise distances among a set of objects. It is often desirable for these distances to satisfy the properties of a metric, especially the triangle inequality. Applications where metric data is useful include clustering, classiﬁcation, metric-based indexing, and approximation algorithms for various graph problems. This paper presents the Metric Nearness Problem: Given a dissimilarity matrix, ﬁnd the “nearest” matrix of distances that satisfy the triangle inequalities. For p nearness measures, this paper develops efﬁcient triangle ﬁxing algorithms that compute globally optimal solutions by exploiting the inherent structure of the problem. Empirically, the algorithms have time and storage costs that are linear in the number of triangle constraints. The methods can also be easily parallelized for additional speed. 1

same-paper 3 0.83544141 125 nips-2004-Multiple Relational Embedding

Author: Roland Memisevic, Geoffrey E. Hinton

4 0.73362577 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

5 0.73119807 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

6 0.72626579 70 nips-2004-Following Curved Regularized Optimization Solution Paths

7 0.72538781 178 nips-2004-Support Vector Classification with Input Data Uncertainty

8 0.72338921 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

9 0.72254407 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform

10 0.72232497 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

11 0.7214098 102 nips-2004-Learning first-order Markov models for control

12 0.7212646 124 nips-2004-Multiple Alignment of Continuous Time Series

13 0.72123706 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

14 0.72074908 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

15 0.7206797 93 nips-2004-Kernel Projection Machine: a New Tool for Pattern Recognition

16 0.7203232 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

17 0.7196101 127 nips-2004-Neighbourhood Components Analysis

18 0.71869481 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

19 0.7160033 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

20 0.71587229 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning