nips nips2011 nips2011-102 knowledge-graph by maker-knowledge-mining

102 nips-2011-Generalised Coupled Tensor Factorisation

Source: pdf

Author: Kenan Y. Yılmaz, Ali T. Cemgil, Umut Simsekli

Abstract: We derive algorithms for generalised tensor factorisation (GTF) by building upon the well-established theory of Generalised Linear Models. Our algorithms are general in the sense that we can compute arbitrary factorisations in a message passing framework, derived for a broad class of exponential family distributions including special cases such as Tweedie’s distributions corresponding to βdivergences. By bounding the step size of the Fisher Scoring iteration of the GLM, we obtain general updates for real data and multiplicative updates for non-negative data. The GTF framework is, then extended easily to address the problems when multiple observed tensors are factorised simultaneously. We illustrate our coupled factorisation approach on synthetic data as well as on a musical audio restoration problem. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Taylan Cemgil Umut Simsekli ¸ ¸ Department of Computer Engineering Bo˘ azici University, Istanbul, Turkey g ¸ kenan@sibnet. [sent-3, score-0.05]

2 tr Abstract We derive algorithms for generalised tensor factorisation (GTF) by building upon the well-established theory of Generalised Linear Models. [sent-9, score-0.971]

3 Our algorithms are general in the sense that we can compute arbitrary factorisations in a message passing framework, derived for a broad class of exponential family distributions including special cases such as Tweedie’s distributions corresponding to βdivergences. [sent-10, score-0.39]

4 The GTF framework is, then extended easily to address the problems when multiple observed tensors are factorised simultaneously. [sent-12, score-0.212]

5 We illustrate our coupled factorisation approach on synthetic data as well as on a musical audio restoration problem. [sent-13, score-0.922]

6 1 Introduction A fruitful modelling approach for extracting meaningful information from highly structured multivariate datasets is based on matrix factorisations (MFs). [sent-14, score-0.285]

7 In fact, many standard data processing methods of machine learning and statistics such as clustering, source separation, independent components analysis (ICA), nonnegative matrix factorisation (NMF), latent semantic indexing (LSI) can be expressed and understood as MF problems. [sent-15, score-0.655]

8 These MF models also have well understood probabilistic interpretations as probabilistic generative models. [sent-16, score-0.07]

9 Tensors appear as a natural generalisation of matrix factorisation, when observed data and/or a latent representation have several semantically meaningful dimensions. [sent-19, score-0.171]

10 As the ﬁrst model is a CP (Parafac) while the second and the third ones are MF’s, we call the combined factorization as CP/MF/MF model. [sent-27, score-0.104]

11 Such models are of interest when one can obtain different ’views’ of the same piece of information (here Z2 ) under different experimental conditions. [sent-28, score-0.099]

12 Singh and Gordon [2] focused on a similar problem called as collective matrix factorisation (CMF) or multi-matrix factorisation, for relational learning but only for matrix factors and observations. [sent-29, score-0.658]

13 In addition, their generalised Bregman divergence minimisation procedure assumes matching link and loss functions. [sent-30, score-0.261]

14 For coupled matrix and tensor factorization (CMTF), recently [3] proposed a gradient-based allat-once optimization method as an alternative to alternating least square (ALS) optimization and 1 demonstrated their approach for a CP/MF coupled model. [sent-31, score-0.949]

15 The main motivation of the current paper is to construct a general and practical framework for computation of tensor factorisations (TF), by extending the well-established theory of Generalised Linear Models (GLM). [sent-33, score-0.57]

16 Our approach is also partially inspired by probabilistic graphical models: our computation procedures for a given factorisation have a natural message passing interpretation. [sent-34, score-0.56]

17 This provides a structured and efﬁcient approach that enables very easy development of application speciﬁc custom models, priors or error measures as well as algorithms for joint factorisations where an arbitrary set of tensors can be factorised simultaneously. [sent-35, score-0.388]

18 Well known models of multiway analysis (Parafac, Tucker [5]) appear as special cases and novel models and associated inference algorithms can be automatically be developed. [sent-36, score-0.068]

19 In [6], the authors take a similar approach to tensor factorisations as ours, but that work is limited to KL and Euclidean costs, generalising MF models of [7] to the tensor case. [sent-37, score-0.964]

20 It is possible to generalise this line of work to β-divergences [8] but none of these works address the coupled factorisation case and consider only a restricted class of cost functions. [sent-38, score-0.612]

21 2 Generalised Linear Models for Matrix/Tensor Factorisation To set the notation and our approach, we brieﬂy review GLMs following closely the original notation of [9, ch 5]. [sent-39, score-0.058]

22 xi is the expectation of xi and b(·) is the log partition function, enforcing normalization. [sent-41, score-0.14]

23 A Linear Model x (LM) is a special case of GLM that assumes normality, i. [sent-46, score-0.078]

24 xi ∼ N (xi ; xi , σ 2 ) as well as linearity ˆ ⊤ that implies identity link function as g(ˆi ) = xi = li z assuming li are known. [sent-48, score-0.418]

25 Logistic regression x ˆ ⊤ assumes a log link, g(ˆi ) = log xi = li z; here log xi and z have a linear relationship [9]. [sent-49, score-0.228]

26 x x Via Fisher Scoring, the general update equation in matrix form is written as −1 z ← z + L⊤ DL L⊤ DG(x − x) ˆ (6) Although this formulation is somewhat abstract, it covers a very broad range of model classes that are used in practice. [sent-54, score-0.188]

27 30], which are special cases of the exponential family of distributions for any p named Tweedie’s family [11]. [sent-57, score-0.149]

28 1 Tensor Factorisations (TF) as GLM’s The key observation for expressing a TF model as a GLM is to identify the multilinear structure and using an alternating optimization approach. [sent-60, score-0.041]

29 To hide the notational complexity, we will give an example with a simple matrix factorisation model; extension to tensors will require heavier notation, but are otherwise conceptually straightforward. [sent-61, score-0.676]

30 Clearly, this is a GLM where ∇2 plays the role of a model matrix and Z2 is the parameter vector. [sent-64, score-0.076]

31 By alternating between Z1 and Z2 , we can maximise the log-likelihood iteratively; indeed this alternating maximisation is standard for solving matrix factorisation problems. [sent-65, score-0.603]

32 2 Generalised Tensor Factorisation We deﬁne a tensor Λ as a multiway array with an index set V = {i1 , i2 , . [sent-68, score-0.515]

33 An element of the tensor Λ is a scalar that we denote by Λ(i1 , i2 , . [sent-78, score-0.394]

34 We call the form Λ(v) as element-wise; the notation [ ] yields a tensor by enumerating all the indices, i. [sent-88, score-0.423]

35 For any two tensors X and Y of compatible order, X ◦ Y is an element-wise multiplication and if not explicitly stressed X/Y is an element-wise division. [sent-94, score-0.155]

36 A generalised tensor factorisation problem is speciﬁed by an observed tensor X (with possibly missing entries, to be treated later) and a collection of latent tensors to be estimated, Z1:|α| = {Zα } for α = 1 . [sent-96, score-1.711]

37 The index set of X is denoted by V0 and |α| the index set of each Zα by Vα . [sent-100, score-0.09]

38 The goal is to ﬁnd the latent Zα that maximize the likelihood ˆ p(X|Z1:α ) where X = X is given via ˆ g(X(v0 )) = Zα (vα ) v0 ¯ (9) α ˆ To clarify our notation with an example, we express the CP (Parafac) model, deﬁned as X(i, j, k) = ˆ ˆ Z1 (i, r)Z2 (j, r)Z3 (k, r). [sent-103, score-0.095]

39 In our notation, we take identity link g(X) = X and the index sets r ¯ with V = {i, j, k, r}, V0 = {i, j, k}, V0 = {r}, V1 = {i, r}, V2 = {j, r} and V3 = {k, r}. [sent-104, score-0.169]

40 Our notation deliberately follows that of graphical models; the reader might ﬁnd it useful to associate indices with discrete random variables and factors with probability tables [14]. [sent-105, score-0.101]

41 1 to the tensor case, we need the equivalent of the model matrix, when updating Zα . [sent-108, score-0.394]

42 The tensor valued function ∆α (Q) : R|v0 | → R|vα | is deﬁned as Q(v0 ) Lα (oα )ε ∆ε (Q) = α (11) v0 ∩¯α v with ∆α (Q) being an object of the same order as Zα and oα ≡ (v0 ∪ vα ) ∩ (¯0 ∪ vα ). [sent-111, score-0.394]

43 Here, on v ¯ the right side, the nonnegative integer ε denotes the element-wise power, not to be confused with an index. [sent-112, score-0.068]

44 This abstraction has also an important practical facet: the computation of ∆ is algebraically (almost) equivalent to computation of marginal quantities on a factor graph, for which efﬁcient message passing algorithms exist [14]. [sent-115, score-0.08]

45 Also we verify the order of the gradient ∇A (10) as i Ii ⊗ LA p = ∇i,p that conforms the matrix derivation convention [13, pp. [sent-119, score-0.076]

46 We may assume an identity link function, or alternatively we may choose a matching link and lost functions such that they ˆ ˆ cancel each other smoothly [2]. [sent-123, score-0.207]

47 In the sequel we consider identity link g(X) = X that results to ˆ = 1. [sent-124, score-0.152]

48 We deﬁne a tensor W , that plays the same gX (X) ˆ ˆ role as w in (5), which becomes simply the precision (inverse variance function), i. [sent-128, score-0.394]

49 W = 1/v(X) where for the Gaussian, Poisson, Exponential and Inverse Gaussian distributions we have simply ˆ W = X −p with p = {0, 1, 2, 3} [10, pp 30]. [sent-130, score-0.064]

50 Then, the update (14) is reduced to Zα ← Zα + ∇⊤ D∇α α −1 ˆ ∇⊤ D(X − X) α (15) After this simpliﬁcation we obtain two update rules for GTF for non-negative and real data. [sent-131, score-0.166]

51 The update (15) can be used to derive multiplicative update rules (MUR) popularised by [15] for the nonnegative matrix factorisation (NMF). [sent-132, score-0.783]

52 The update equation (15) for nonnegative GTF is reduced to multiplicative form as Zα ← Zα ◦ ∆α (W ◦ X) ˆ ∆α (W ◦ X) s. [sent-135, score-0.179]

53 it reduces to NMF-KL as Z2 ← Z2 ◦ Z1 (X/X) 1 By dropping the non-negativity requirement we obtain the following update equation: Theorem 2. [sent-140, score-0.119]

54 Note the example for λα/0 that if Vα ∩ V0 = {p, q} then λα/0 = |p||q| which is number of all distinct conﬁgurations for the index set {p, q}. [sent-143, score-0.045]

55 Missing data can be handled easily by dropping the missing data terms from the likelihood [17]. [sent-144, score-0.161]

56 The net effect of this is the addition of an indicator variable mi to the gradient ∂L/∂z = τ −2 i (xi − ⊤ xi )mi wi gx (ˆi )li with mi = 1 if xi is observed otherwise mi = 0. [sent-145, score-0.392]

57 Hence we simply deﬁne a mask ˆ ˆ x tensor M having the same order as the observation X, where the element M (v0 ) is 1 if X(v0 ) is observed and zero otherwise. [sent-146, score-0.394]

58 In the update equations, we merely replace W with W ◦ M . [sent-147, score-0.083]

59 3 Coupled Tensor Factorization Here we address the problem when multiple observed tensors Xν for ν = 1 . [sent-148, score-0.155]

60 Each observed tensor Xν now has a corresponding index set V0,ν and a particular conﬁguration will be denoted by v0,ν ≡ uν . [sent-152, score-0.439]

61 getting the Hessian and ﬁnding Fisher Information) we arrive at the update rule in vector form as Rν,α ∇⊤ Dν ∇α,ν α,ν Zα ← Zα + ν −1 ˆ Rν,α ∇⊤ Dν Xν − Xν α,ν ν 5 (22) Z1 . [sent-157, score-0.083]

62 X|ν| A B C D X1 X2 E X3 Figure 1: (Left) Coupled factorisation structure where the arrow indicates the existence of the inﬂuence of latent tensor Zα onto the observed tensor Xν . [sent-169, score-1.299]

63 The update equations for the coupled case are quite intuitive; we calculate the ∆α,ν functions deﬁned as ∆ε (Q) = α,ν Zα′ (vα′ )R Q(uν ) uν ∩¯α v ν,α ε (23) α′ =α for each submodel and add the results: Lemma 1. [sent-172, score-0.25]

64 ∆α,ν Xν update is Zα ← Zα ◦ ∆α,ν Xν ◦ Xν / νR νR Lemma 2. [sent-176, score-0.083]

65 General update for CTF Zα ← Zα + 2 ν ˆ Rν,α ∆α,ν Wν ◦ Xν − Xν λα/0 ν (25) Rν,α ∆2 (Wν ) α,ν ˆ −p For the special case of the Tweedie family we plug Wν = Xν and get the related formula. [sent-177, score-0.159]

66 4 Experiments Here we want to solve the CTF problem introduced (1), which is a coupled CP/MF/MF problem ˆ i,j,k = X1 ˆ j,p X2 = Ai,r B j,r C k,r r B j,r Dp,r r ˆ j,q X3 = B j,r E q,r (26) r where we employ the symbols A : E for the latent tensors instead of Zα . [sent-178, score-0.388]

67 This factorisation problem has the following R matrix with |α| = 5, |ν| = 3 R= 1 0 0 1 1 1 0 1 0 ˆ X1 = ˆ with X2 = ˆ X3 = 0 0 1 0 0 1 A1 B 1 C 1 D0 E 0 A0 B 1 C 0 D1 E 0 A0 B 1 C 0 D0 E 1 (27) We want to use the general update equation (25). [sent-179, score-0.604]

68 for Z2 , which is the common factor ∆ε (Q) = B,1 Qi,j,k Ai,r C k,r ε = Q(1) (C ε ⊙ Aε ) (28) ik ∆ε (Q) = B,2 Qj,p Dp,r p 6 ε = QDε (29) with Q(n) being mode-n unfolding operation that turns a tensor into matrix form [5]. [sent-183, score-0.506]

69 The simulated data size for observables is |i| = |j| = |k| = |p| = |q| = 30 while the latent dimension is |r| = 5. [sent-185, score-0.066]

70 Note that CP factorisation is unique up to permutation and scaling [5] while MF factorisation is not unique, but when coupled with CP it recovers the original data as shown in the ﬁgure. [sent-191, score-1.092]

71 For visualisation, to ﬁnd the correct permutation, for each of Zα the matching permutation between the original and estimate are found by solving an orthogonal Procrustes problem [18, pp 601]. [sent-192, score-0.099]

72 1 Audio Experiments In this section, we illustrate a real data application of our approach, where we reconstruct missing parts of an audio spectrogram X(f, t), that represents the STFT coefﬁcient magnitude at frequency bin f and time frame t of a piano piece, see top left panel of Fig. [sent-194, score-0.443]

73 This is a difﬁcult matrix completion problem: as entire time frames (columns of X) are missing, low rank reconstruction techniques are likely to be ineffective. [sent-196, score-0.111]

74 Yet such missing data patterns arise often in practice, e. [sent-197, score-0.125]

75 We will develop here a novel approach, expressed as a coupled TF model. [sent-200, score-0.167]

76 In particular, the reconstruction will be aided by an approximate musical score, not necessarily belonging to the played piece, and spectra of isolated piano sounds. [sent-201, score-0.483]

77 However, as time frames are modeled conditionally independently, it is impossible to reconstruct audio with this model when entire time frames are missing. [sent-203, score-0.18]

78 In order to restore the missing parts in the audio, we form a model that can incorporates musical information of chords structures and how they evolve in time. [sent-204, score-0.325]

79 In order to achieve this, we hierarchically decompose the excitation matrix E as a convolution of some basis matrices and their weights: E(i, t) = k,τ B(i, τ, k)C(k, t − τ ). [sent-205, score-0.076]

80 Here the basis tensor B encapsulates both vertical and temporal information of the notes that are likely to be used in a musical piece; the musical piece to be reconstructed will share B, possibly played at different times or tempi as modelled by G. [sent-206, score-0.981]

81 In eq 32, while forming X3 we concatenate isolated recordings corresponding to different notes. [sent-208, score-0.062]

82 Besides, T is a 0 − 1 matrix, where T (i, p) = 1(0) if the note i is played (not played) during the time frame p and F models the time varying amplitudes of the training data. [sent-209, score-0.051]

83 Bottom Row: Reconstructed X1 , the ground truth, and the SNR results with increasing missing data. [sent-214, score-0.125]

84 Here, initial SNR is computed by substituting 0 as missing values. [sent-215, score-0.125]

85 5 Discussion This paper establishes a link between GLMs and TFs and provides a general solution for the computation of arbitrary coupled TFs, using message passing primitives. [sent-216, score-0.33]

86 distributions, to different observation tensors in a coupled factorization model. [sent-220, score-0.426]

87 This requires only minor modiﬁcations to the update equations. [sent-221, score-0.083]

88 We believe that, as a whole, the GCTF framework covers a broad range of models that can be useful in many different application areas beyond audio processing, such as network analysis, bioinformatics or collaborative ﬁltering. [sent-222, score-0.139]

89 ¨ ITAK grant number 110E292, Bayesian Acknowledgements: This work is funded by the TUB˙ matrix and tensor factorisations (BAYTEN) and Bo˘ azici University research fund BAP5723. [sent-223, score-0.696]

90 Cemgil, Bayesian inference for nonnegative matrix factorisation models, Computational Intelligence and Neuroscience 2009 (2009) 1–17. [sent-230, score-0.589]

91 Gordon, A uniﬁed view of matrix factorization models, in: ECML PKDD’08, Part II, no. [sent-235, score-0.18]

92 Dunlavy, All-at-once optimization for coupled matrix and tensor factorizations, CoRR abs/1105. [sent-243, score-0.637]

93 Yang, Protein-protein interaction prediction via collective matrix factorization, in: In Proc. [sent-251, score-0.104]

94 Cemgil, Probabilistic latent tensor factorization, in: Proceedings of the 9th international conference on Latent variable analysis and signal separation, LVA/ICA’10, Springer-Verlag, 2010, pp. [sent-263, score-0.46]

95 Cemgil, Nonnegative matrix factorisations as probabilistic inference in composite models, in: Proc. [sent-268, score-0.287]

96 Cemgil, Algorithms for probabilistic latent tensor factorization, Signal Processing(2011),doi:10. [sent-274, score-0.495]

97 Kaas, Compound poisson distributions and glm’s, tweedie’s distribution, Tech. [sent-291, score-0.042]

98 Seung, Algorithms for non-negative matrix factorization, in: NIPS, Vol. [sent-312, score-0.076]

99 Mnih, Probabilistic matrix factorization, in: Advances in Neural Information Processing Systems, Vol. [sent-320, score-0.076]

100 Brown, Non-negative matrix factorization for polyphonic music transcription, in: WASPAA, 2003, pp. [sent-331, score-0.212]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('factorisation', 0.445), ('tensor', 0.394), ('musical', 0.2), ('glm', 0.182), ('factorisations', 0.176), ('gtf', 0.175), ('coupled', 0.167), ('tensors', 0.155), ('tweedie', 0.15), ('vec', 0.139), ('generalised', 0.132), ('piano', 0.132), ('missing', 0.125), ('audio', 0.11), ('cemgil', 0.11), ('gx', 0.108), ('factorization', 0.104), ('mf', 0.099), ('piece', 0.099), ('nmf', 0.095), ('snr', 0.093), ('tf', 0.09), ('link', 0.083), ('update', 0.083), ('fisher', 0.077), ('matrix', 0.076), ('ctf', 0.075), ('lmaz', 0.075), ('cp', 0.073), ('xi', 0.07), ('nonnegative', 0.068), ('latent', 0.066), ('parafac', 0.066), ('la', 0.065), ('sec', 0.065), ('pp', 0.064), ('isolated', 0.062), ('factorised', 0.057), ('va', 0.054), ('hz', 0.052), ('played', 0.051), ('azici', 0.05), ('kenan', 0.05), ('midi', 0.05), ('mur', 0.05), ('simsekli', 0.05), ('tfs', 0.05), ('tub', 0.05), ('umut', 0.05), ('dg', 0.048), ('assumes', 0.046), ('index', 0.045), ('family', 0.044), ('mcculloch', 0.044), ('roll', 0.044), ('li', 0.042), ('poisson', 0.042), ('message', 0.042), ('alternating', 0.041), ('identity', 0.041), ('acar', 0.04), ('array', 0.04), ('indices', 0.039), ('passing', 0.038), ('mi', 0.038), ('glms', 0.038), ('transcription', 0.038), ('spectra', 0.038), ('spectrogram', 0.038), ('kl', 0.038), ('frequency', 0.038), ('notes', 0.037), ('scoring', 0.037), ('unfolding', 0.036), ('dropping', 0.036), ('multiway', 0.036), ('vb', 0.036), ('probabilistic', 0.035), ('frames', 0.035), ('permutation', 0.035), ('arrays', 0.034), ('kolda', 0.034), ('fruitful', 0.033), ('factors', 0.033), ('guration', 0.032), ('special', 0.032), ('dl', 0.032), ('music', 0.032), ('les', 0.032), ('edition', 0.031), ('bo', 0.031), ('wi', 0.03), ('broad', 0.029), ('semantically', 0.029), ('gordon', 0.029), ('notation', 0.029), ('exponential', 0.029), ('collective', 0.028), ('sequel', 0.028), ('multiplicative', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 102 nips-2011-Generalised Coupled Tensor Factorisation

Author: Kenan Y. Yılmaz, Ali T. Cemgil, Umut Simsekli

2 0.33885983 270 nips-2011-Statistical Performance of Convex Tensor Decomposition

Author: Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, Hisashi Kashima

Abstract: We analyze the statistical performance of a recently proposed convex tensor decomposition algorithm. Conventionally tensor decomposition has been formulated as non-convex optimization problems, which hindered the analysis of their performance. We show under some conditions that the mean squared error of the convex method scales linearly with the quantity we call the normalized rank of the true tensor. The current analysis naturally extends the analysis of convex low-rank matrix estimation to tensors. Furthermore, we show through numerical experiments that our theory can precisely predict the scaling behaviour in practice.

3 0.20493533 179 nips-2011-Multilinear Subspace Regression: An Orthogonal Tensor Decomposition Approach

Author: Qibin Zhao, Cesar F. Caiafa, Danilo P. Mandic, Liqing Zhang, Tonio Ball, Andreas Schulze-bonhage, Andrzej S. Cichocki

Abstract: A multilinear subspace regression model based on so called latent variable decomposition is introduced. Unlike standard regression methods which typically employ matrix (2D) data representations followed by vector subspace transformations, the proposed approach uses tensor subspace transformations to model common latent variables across both the independent and dependent data. The proposed approach aims to maximize the correlation between the so derived latent variables and is shown to be suitable for the prediction of multidimensional dependent data from multidimensional independent data, where for the estimation of the latent variables we introduce an algorithm based on Multilinear Singular Value Decomposition (MSVD) on a specially deﬁned cross-covariance tensor. It is next shown that in this way we are also able to unify the existing Partial Least Squares (PLS) and N-way PLS regression algorithms within the same framework. Simulations on benchmark synthetic data conﬁrm the advantages of the proposed approach, in terms of its predictive ability and robustness, especially for small sample sizes. The potential of the proposed technique is further illustrated on a real world task of the decoding of human intracranial electrocorticogram (ECoG) from a simultaneously recorded scalp electroencephalograph (EEG). 1

4 0.12371136 86 nips-2011-Empirical models of spiking in neural populations

Author: Jakob H. Macke, Lars Buesing, John P. Cunningham, Byron M. Yu, Krishna V. Shenoy, Maneesh Sahani

Abstract: Neurons in the neocortex code and compute as part of a locally interconnected population. Large-scale multi-electrode recording makes it possible to access these population processes empirically by ﬁtting statistical models to unaveraged data. What statistical structure best describes the concurrent spiking of cells within a local network? We argue that in the cortex, where ﬁring exhibits extensive correlations in both time and space and where a typical sample of neurons still reﬂects only a very small fraction of the local population, the most appropriate model captures shared variability by a low-dimensional latent process evolving with smooth dynamics, rather than by putative direct coupling. We test this claim by comparing a latent dynamical model with realistic spiking observations to coupled generalised linear spike-response models (GLMs) using cortical recordings. We ﬁnd that the latent dynamical approach outperforms the GLM in terms of goodness-ofﬁt, and reproduces the temporal correlations in the data more accurately. We also compare models whose observations models are either derived from a Gaussian or point-process models, ﬁnding that the non-Gaussian model provides slightly better goodness-of-ﬁt and more realistic population spike counts. 1

5 0.10860802 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

Author: Le Song, Eric P. Xing, Ankur P. Parikh

Abstract: Latent tree graphical models are natural tools for expressing long range and hierarchical dependencies among many variables which are common in computer vision, bioinformatics and natural language processing problems. However, existing models are largely restricted to discrete and Gaussian variables due to computational constraints; furthermore, algorithms for estimating the latent tree structure and learning the model parameters are largely restricted to heuristic local search. We present a method based on kernel embeddings of distributions for latent tree graphical models with continuous and non-Gaussian variables. Our method can recover the latent tree structures with provable guarantees and perform local-minimum free parameter learning and efﬁcient inference. Experiments on simulated and real data show the advantage of our proposed approach. 1

6 0.09167318 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

7 0.085387133 258 nips-2011-Sparse Bayesian Multi-Task Learning

8 0.082607873 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation

9 0.077139869 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion

10 0.076402783 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise

11 0.073246121 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems

12 0.071027823 301 nips-2011-Variational Gaussian Process Dynamical Systems

13 0.065888211 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent

14 0.065312587 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

15 0.065072648 225 nips-2011-Probabilistic amplitude and frequency demodulation

16 0.055191129 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression

17 0.053457893 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

18 0.050720058 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables

19 0.050518461 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

20 0.050407164 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.167), (1, 0.037), (2, 0.023), (3, -0.1), (4, -0.071), (5, -0.024), (6, 0.038), (7, -0.045), (8, 0.15), (9, 0.075), (10, 0.07), (11, -0.113), (12, -0.085), (13, -0.026), (14, 0.057), (15, -0.244), (16, -0.207), (17, 0.051), (18, -0.091), (19, -0.111), (20, -0.181), (21, -0.215), (22, -0.056), (23, 0.008), (24, 0.106), (25, 0.004), (26, 0.035), (27, 0.001), (28, -0.028), (29, -0.028), (30, 0.083), (31, -0.025), (32, -0.029), (33, 0.145), (34, -0.054), (35, -0.028), (36, 0.039), (37, 0.054), (38, 0.047), (39, 0.167), (40, -0.073), (41, 0.02), (42, -0.038), (43, -0.031), (44, 0.042), (45, -0.077), (46, -0.074), (47, -0.055), (48, 0.06), (49, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92937177 102 nips-2011-Generalised Coupled Tensor Factorisation

Author: Kenan Y. Yılmaz, Ali T. Cemgil, Umut Simsekli

2 0.87944138 179 nips-2011-Multilinear Subspace Regression: An Orthogonal Tensor Decomposition Approach

Author: Qibin Zhao, Cesar F. Caiafa, Danilo P. Mandic, Liqing Zhang, Tonio Ball, Andreas Schulze-bonhage, Andrzej S. Cichocki

3 0.85791844 270 nips-2011-Statistical Performance of Convex Tensor Decomposition

Author: Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, Hisashi Kashima

4 0.44798985 108 nips-2011-Greedy Algorithms for Structurally Constrained High Dimensional Problems

Author: Ambuj Tewari, Pradeep K. Ravikumar, Inderjit S. Dhillon

Abstract: A hallmark of modern machine learning is its ability to deal with high dimensional problems by exploiting structural assumptions that limit the degrees of freedom in the underlying model. A deep understanding of the capabilities and limits of high dimensional learning methods under specific assumptions such as sparsity, group sparsity, and low rank has been attsined. Efforts [1,2] are now underway to distill this valuable experience by proposing general unified frameworks that can achieve the twio goals of summarizing previous analyses and enabling their application to notions of structure hitherto unexplored. Inspired by these developments, we propose and analyze a general computational scheme based on a greedy strategy to solve convex optimization problems that arise when dealing with structurally constrained high-dimensional problems. Our framework not only unifies existing greedy algorithms by recovering them as special cases but also yields novel ones. Finally, we extend our results to infinite dimensional settings by using interesting connections between smoothness of norms and behavior of martingales in Banach spaces.

5 0.41770172 5 nips-2011-A Denoising View of Matrix Completion

Author: Weiran Wang, Zhengdong Lu, Miguel Á. Carreira-Perpiñán

Abstract: In matrix completion, we are given a matrix where the values of only some of the entries are present, and we want to reconstruct the missing ones. Much work has focused on the assumption that the data matrix has low rank. We propose a more general assumption based on denoising, so that we expect that the value of a missing entry can be predicted from the values of neighboring points. We propose a nonparametric version of denoising based on local, iterated averaging with meanshift, possibly constrained to preserve local low-rank manifold structure. The few user parameters required (the denoising scale, number of neighbors and local dimensionality) and the number of iterations can be estimated by cross-validating the reconstruction error. Using our algorithms as a postprocessing step on an initial reconstruction (provided by e.g. a low-rank method), we show consistent improvements with synthetic, image and motion-capture data. Completing a matrix from a few given entries is a fundamental problem with many applications in machine learning, computer vision, network engineering, and data mining. Much interest in matrix completion has been caused by recent theoretical breakthroughs in compressed sensing [1, 2] as well as by the now celebrated Netﬂix challenge on practical prediction problems [3, 4]. Since completion of arbitrary matrices is not a well-posed problem, it is often assumed that the underlying matrix comes from a restricted class. Matrix completion models almost always assume a low-rank structure of the matrix, which is partially justiﬁed through factor models [4] and fast convex relaxation [2], and often works quite well when the observations are sparse and/or noisy. The low-rank structure of the matrix essentially asserts that all the column vectors (or the row vectors) live on a low-dimensional subspace. This assumption is arguably too restrictive for problems with richer structure, e.g. when each column of the matrix represents a snapshot of a seriously corrupted motion capture sequence (see section 3), for which a more ﬂexible model, namely a curved manifold, is more appropriate. In this paper, we present a novel view of matrix completion based on manifold denoising, which conceptually generalizes the low-rank assumption to curved manifolds. Traditional manifold denoising is performed on fully observed data [5, 6], aiming to send the data corrupted by noise back to the correct surface (deﬁned in some way). However, with a large proportion of missing entries, we may not have a good estimate of the manifold. Instead, we start with a poor estimate and improve it iteratively. Therefore the “noise” may be due not just to intrinsic noise, but mostly to inaccurately estimated missing entries. We show that our algorithm can be motivated from an objective purely based on denoising, and prove its convergence under some conditions. We then consider a more general case with a nonlinear low-dimensional manifold and use a stopping criterion that works successfully in practice. Our model reduces to a low-rank model when we require the manifold to be ﬂat, showing a relation with a recent thread of matrix completion models based on alternating projection [7]. In our experiments, we show that our denoising-based matrix completion model can make better use of the latent manifold structure on both artiﬁcial and real-world data sets, and yields superior recovery of the missing entries. The paper is organized as follows: section 1 reviews nonparametric denoising methods based on mean-shift updates, section 2 extends this to matrix completion by using denoising with constraints, section 3 gives experimental results, and section 4 discusses related work. 1 1 Denoising with (manifold) blurring mean-shift algorithms (GBMS/MBMS) In Gaussian blurring mean-shift (GBMS), denoising is performed in a nonparametric way by local averaging: each data point moves to the average of its neighbors (to a certain scale), and the process is repeated. We follow the derivation in [8]. Consider a dataset {xn }N ⊂ RD and deﬁne a n=1 Gaussian kernel density estimate p(x) = 1 N N Gσ (x, xn ) (1) n=1 1 with bandwidth σ > 0 and kernel Gσ (x, xn ) ∝ exp − 2 ( x − xn /σ)2 (other kernels may be used, such as the Epanechnikov kernel, which results in sparse afﬁnities). The (non-blurring) mean-shift algorithm rearranges the stationary point equation ∇p(x) = 0 into the iterative scheme x(τ +1) = f (x(τ ) ) with N x (τ +1) = f (x (τ ) p(n|x )= (τ ) )xn p(n|x (τ ) n=1 )= exp − 1 (x(τ ) − xn )/σ 2 N n′ =1 2 exp − 1 (x(τ ) − xn′ )/σ 2 2 . (2) This converges to a mode of p from almost every initial x ∈ RD , and can be seen as taking selfadapting step sizes along the gradient (since the mean shift f (x) − x is parallel to ∇p(x)). This iterative scheme was originally proposed by [9] and it or variations of it have found widespread application in clustering [8, 10–12] and denoising of 3D point sets (surface fairing; [13, 14]) and manifolds in general [5, 6]. The blurring mean-shift algorithm applies one step of the previous scheme, initialized from every point, in parallel for all points. That is, given the dataset X = {x1 , . . . , xN }, for each xn ∈ X ˜ we obtain a new point xn = f (xn ) by applying one step of the mean-shift algorithm, and then we ˜ replace X with the new dataset X, which is a blurred (shrunk) version of X. By iterating this process we obtain a sequence of datasets X(0) , X(1) , . . . (and a corresponding sequence of kernel density estimates p(0) (x), p(1) (x), . . .) where X(0) is the original dataset and X(τ ) is obtained by blurring X(τ −1) with one mean-shift step. We can see this process as maximizing the following objective function [10] by taking parallel steps of the form (2) for each point: N p(xn ) = E(X) = n=1 1 N N N 1 e− 2 Gσ (xn , xm ) ∝ xn −xm σ 2 . (3) n,m=1 n,m=1 This process eventually converges to a dataset X(∞) where all points are coincident: a completely denoised dataset where all structure has been erased. As shown by [8], this process can be stopped early to return clusters (= locally denoised subsets of points); the number of clusters obtained is controlled by the bandwidth σ. However, here we are interested in the denoising behavior of GBMS. ˜ The GBMS step can be formulated in a matrix form reminiscent of spectral clustering [8] as X = X P where X = (x1 , . . . , xN ) is a D×N matrix of data points; W is the N ×N matrix of Gaussian N afﬁnities wnm = Gσ (xn , xm ); D = diag ( n=1 wnm ) is the degree matrix; and P = WD−1 is N an N × N stochastic matrix: pnm = p(n|xm ) ∈ (0, 1) and n=1 pnm = 1. P (or rather its transpose) is the stochastic matrix of the random walk in a graph [15], which in GBMS represents the posterior probabilities of each point under the kernel density estimate (1). P is similar to the 1 1 matrix N = D− 2 WD− 2 derived from the normalized graph Laplacian commonly used in spectral clustering, e.g. in the normalized cut [16]. Since, by the Perron-Frobenius theorem [17, ch. 8], all left eigenvalues of P(X) have magnitude less than 1 except for one that equals 1 and is associated with ˜ an eigenvector of constant entries, iterating X = X P(X) converges to the stationary distribution of each P(X), where all points coincide. ˜ From this point of view, the product X = X P(X) can be seen as ﬁltering the dataset X with a datadependent low-pass ﬁlter P(X), which makes clear the denoising behavior. This also suggests using ˜ other ﬁlters [12] X = X φ(P(X)) as long as φ(1) = 1 and |φ(r)| < 1 for r ∈ [0, 1), such as explicit schemes φ(P) = (1 − η)I + ηP for η ∈ (0, 2], power schemes φ(P) = Pn for n = 1, 2, 3 . . . or implicit schemes φ(P) = ((1 + η)I − ηP)−1 for η > 0. One important problem with GBMS is that it denoises equally in all directions. When the data lies on a low-dimensional manifold, denoising orthogonally to it removes out-of-manifold noise, but 2 denoising tangentially to it perturbs intrinsic degrees of freedom of the data and causes shrinkage of the entire manifold (most strongly near its boundary). To prevent this, the manifold blurring meanshift algorithm (MBMS) [5] ﬁrst computes a predictor averaging step with GBMS, and then for each point xn a corrector projective step removes the step direction that lies in the local tangent space of xn (obtained from local PCA run on its k nearest neighbors). In practice, both GBMS and MBMS must be stopped early to prevent excessive denoising and manifold distortions. 2 Blurring mean-shift denoising algorithms for matrix completion We consider the natural extension of GBMS to the matrix completion case by adding the constraints given by the present values. We use the subindex notation XM and XP to indicate selection of the missing or present values of the matrix XD×N , where P ⊂ U , M = U \ P and U = {(d, n): d = 1, . . . , D, n = 1, . . . , N }. The indices P and values XP of the present matrix entries are the data of the problem. Then we have the following constrained optimization problem: N Gσ (xn , xm ) max E(X) = X s.t. XP = XP . (4) n,m=1 This is similar to low-rank formulations for matrix completion that have the same constraints but use as objective function the reconstruction error with a low-rank assumption, e.g. X − ABX 2 with AD×L , BL×D and L < D. We initialize XM to the output of some other method for matrix completion, such as singular value projection (SVP; [7]). For simple constraints such as ours, gradient projection algorithms are attractive. The gradient of E wrt X is a matrix of D × N whose nth column is: ∇xn E(X) = 2 σ2 N 1 e− 2 xn −xm σ 2 N (xm − xn ) ∝ m=1 2 p(m|xn )xm p(xn ) −xn + σ2 m=1 (5) and its projection on the constraint space is given by zeroing its entries having indices in P; call ΠP this projection operator. Then, we have the following step of length α ≥ 0 along the projected gradient: (τ +1) X(τ +1) = X(τ ) + αΠP (∇X E(X(τ ) )) ⇐⇒ XM (τ ) = XM + α ΠP (∇X E(X(τ ) )) M (6) which updates only the missing entries XM . Since our search direction is ascent and makes an angle with the gradient that is bounded away from π/2, and E is lower bounded, continuously differentiable and has bounded Hessian (thus a Lipschitz continuous gradient) in RN L , by carrying out a line search that satisﬁes the Wolfe conditions, we are guaranteed convergence to a local stationary point, typically a maximizer [18, th. 3.2]. However, as reasoned later, we do not perform a line search at all, instead we ﬁx the step size to the GBMS self-adapting step size, which results in a simple and faster algorithm consisting of carrying out a GBMS step on X (i.e., X(τ +1) = X(τ ) P(X(τ ) )) and then reﬁlling XP to the present values. While we describe the algorithm in this way for ease of explanation, in practice we do not actually compute the GBMS step for all xdn values, but only for the missing ones, which is all we need. Thus, our algorithm carries out GBMS denoising steps within the missing-data subspace. We can derive this result in a different way by starting from N the unconstrained optimization problem maxXP E(X) = n,m=1 Gσ (xn , xm ) (equivalent to (4)), computing its gradient wrt XP , equating it to zero and rearranging (in the same way the mean-shift algorithm is derived) to obtain a ﬁxed-point iteration identical to our update above. Fig. 1 shows the pseudocode for our denoising-based matrix completion algorithms (using three nonparametric denoising algorithms: GBMS, MBMS and LTP). Convergence and stopping criterion As noted above, we have guaranteed convergence by simply satisfying standard line search conditions, but a line search is costly. At present we do not have (τ +1) a proof that the GBMS step size satisﬁes such conditions, or indeed that the new iterate XM increases or leaves unchanged the objective, although we have never encountered a counterexample. In fact, it turns out that none of the work about GBMS that we know about proves that either: [10] proves that ∅(X(τ +1) ) ≤ ∅(X(τ ) ) for 0 < ρ < 1, where ∅(·) is the set diameter, while [8, 12] 3 notes that P(X) has a single eigenvalue of value 1 and all others of magnitued less than 1. While this shows that all points converge to the same location, which indeed is the global maximum of (3), it does not necessarily follow that each step decreases E. GBMS (k, σ) with full or k-nn graph: given XD×N , M repeat for n = 1, . . . , N Nn ← {1, . . . , N } (full graph) or k nearest neighbors of xn (k-nn graph) Gσ (xn ,xm ) mean-shift xm ∂xn ← −xn + m∈Nn step m′ ∈Nn Gσ (xn ,xm′ ) end XM ← XM + (∂X)M move points’ missing entries until validation error increases return X However, the question of convergence as τ → ∞ has no practical interest in a denoising setting, because achieving a total denoising almost never yields a good matrix completion. What we want is to achieve just enough denoising and stop the algorithm, as was the case with GBMS clustering, and as is the case in algorithms for image denoising. We propose to determine the optimal number of iterations, as well as the bandwidth σ and any other parameters, by cross-validation. Specifically, we select a held-out set by picking a random subset of the present entries and considering them as missing; this allows us to evaluate an error between our completion for them and the ground truth. We stop iterating when this error increases. MBMS (L, k, σ) with full or k-nn graph: given XD×N , M repeat for n = 1, . . . , N Nn ← {1, . . . , N } (full graph) or k nearest neighbors of xn (k-nn graph) Gσ (xn ,xm ) mean-shift xm ∂xn ← −xn + m∈Nn step m′ ∈Nn Gσ (xn ,xm′ ) Xn ← k nearest neighbors of xn (µn , Un ) ← PCA(Xn , L) estimate L-dim tangent space at xn subtract parallel motion ∂xn ← (I − Un UT )∂xn n end XM ← XM + (∂X)M move points’ missing entries until validation error increases return X This argument justiﬁes an algorithmic, as opposed to an opLTP (L, k) with k-nn graph: given XD×N , M timization, view of denoisingrepeat based matrix completion: apfor n = 1, . . . , N ply a denoising step, reﬁll the Xn ← k nearest neighbors of xn present values, iterate until the (µn , Un ) ← PCA(Xn , L) estimate L-dim tangent space at xn validation error increases. This project point onto tangent space allows very general deﬁnitions ∂xn ← (I − Un UT )(µn − xn ) n end of denoising, and indeed a lowXM ← XM + (∂X)M move points’ missing entries rank projection is a form of deuntil validation error increases noising where points are not alreturn X lowed outside the linear manifold. Our formulation using Figure 1: Our denoising matrix completion algorithms, based on the objective function (4) is still Manifold Blurring Mean Shift (MBMS) and its particular cases useful in that it connects our Local Tangent Projection (LTP, k-nn graph, σ = ∞) and Gauss- denoising assumption with the ian Blurring Mean Shift (GBMS, L = 0); see [5] for details. Nn more usual low-rank assumption contains all N points (full graph) or only xn ’s nearest neighbors that has been used in much ma(k-nn graph). The index M selects the components of its input trix completion work, and juscorresponding to missing values. Parameters: denoising scale σ, tiﬁes the reﬁlling step as renumber of neighbors k, local dimensionality L. sulting from the present-data constraints under a gradientprojection optimization. MBMS denoising for matrix completion Following our algorithmic-based approach to denois˜ ing, we could consider generalized GBMS steps of the form X = X φ(P(X)). For clustering, Carreira-Perpi˜ an [12] found an overrelaxed explicit step φ(P) = (1 − η)I + ηP with η ≈ 1.25 to n´ achieve similar clusterings but faster. Here, we focus instead on the MBMS variant of GBMS that allows only for orthogonal, not tangential, point motions (deﬁned wrt their local tangent space as estimated by local PCA), with the goal of preserving low-dimensional manifold structure. MBMS has 3 user parameters: the bandwidth σ (for denoising), and the latent dimensionality L and the 4 number of neighbors k (for the local tangent space and the neighborhood graph). A special case of MBMS called local tangent projection (LTP) results by using a neighborhood graph and setting σ = ∞ (so only two user parameters are needed: L and k). LTP can be seen as doing a low-rank matrix completion locally. LTP was found in [5] to have nearly as good performance as the best σ in several problems. MBMS also includes as particular cases GBMS (L = 0), PCA (k = N , σ = ∞), and no denoising (σ = 0 or L = D). Note that if we apply MBMS to a dataset that lies on a linear manifold of dimensionality d using L ≥ d then no denoising occurs whatsoever because the GBMS updates lie on the d-dimensional manifold and are removed by the corrector step. In practice, even if the data are assumed noiseless, the reconstruction from a low-rank method will lie close to but not exactly on the d-dimensional manifold. However, this suggests using largish ranks for the low-rank method used to reconstruct X and lower L values in the subsequent MBMS run. In summary, this yields a matrix completion algorithm where we apply an MBMS step, reﬁll the present values, and iterate until the validation error increases. Again, in an actual implementation we compute the MBMS step only for the missing entries of X. The shrinking problem of GBMS is less pronounced in our matrix completion setting, because we constrain some values not to change. Still, in agreement with [5], we ﬁnd MBMS to be generally superior to GBMS. Computational cost With a full graph, the cost per iteration of GBMS and MBMS is O(N 2 D) and O(N 2 D + N (D + k) min(D, k)2 ), respectively. In practice with high-dimensional data, best denoising results are obtained using a neighborhood graph [5], so that the sums over points in eqs. (3) or (4) extend only to the neighbors. With a k-nearest-neighbor graph and if we do not update the neighbors at each iteration (which affects the result little), the respective cost per iteration is O(N kD) and O(N kD + N (D + k) min(D, k)2 ), thus linear in N . The graph is constructed on the initial X we use, consisting of the present values and an imputation for the missing ones achieved with a standard matrix completion method, and has a one-off cost of O(N 2 D). The cost when we have a fraction µ = |M| ∈ [0, 1] of missing data is simply the above times µ. Hence the run time ND of our mean-shift-based matrix completion algorithms is faster the more present data we have, and thus faster than the usual GBMS or MBMS case, where all data are effectively missing. 3 Experimental results We compare with representative methods of several approaches: a low-rank matrix completion method, singular value projection (SVP [7], whose performance we found similar to that of alternating least squares, ALS [3, 4]); ﬁtting a D-dimensional Gaussian model with EM and imputing the missing values of each xn as the conditional mean E {xn,Mn |xn,Pn } (we use the implementation of [19]); and the nonlinear method of [20] (nlPCA). We initialize GBMS and MBMS from some or all of these algorithms. For methods with user parameters, we set them by cross-validation in the following way: we randomly select 10% of the present entries and pretend they are missing as well, we run the algorithm on the remaining 90% of the present values, and we evaluate the reconstruction at the 10% entries we kept earlier. We repeat this over different parameters’ values and pick the one with lowest reconstruction error. We then run the algorithm with these parameters values on the entire present data and report the (test) error with the ground truth for the missing values. 100D Swissroll We created a 3D swissroll data set with 3 000 points and lifted it to 100D with a random orthonormal mapping, and added a little noise (spherical Gaussian with stdev 0.1). We selected uniformly at random 6.76% of the entries to be present. We use the Gaussian model and SVP (ﬁxed rank = 3) as initialization for our algorithm. We typically ﬁnd that these initial X are very noisy (ﬁg. 3), with some reconstructed points lying between different branches of the manifold and causing a big reconstruction error. We ﬁxed L = 2 (the known dimensionality) for MBMS and cross-validated the other parameters: σ and k for MBMS and GBMS (both using k-nn graph), and the number of iterations τ to be used. Table 1 gives the performance of MBMS and GBMS for testing, along with their optimal parameters. Fig. 3 shows the results of different methods at a few iterations. MBMS initialized from the Gaussian model gives the most remarkable denoising effect. To show that there is a wide range of σ and number of iterations τ that give good performance with GBMS and MBMS, we ﬁx k = 50 and run the algorithm with varying σ values and plot the reconstruction error for missing entries over iterations in ﬁg. 2. Both GBMS can achieve good 5 Methods Gaussian + GBMS (∞, 10, 0, 1) + MBMS (1, 20, 2, 25) SVP + GBMS (3, 50, 0, 1) + MBMS (3, 50, 2, 2) RSSE 168.1 165.8 157.2 156.8 151.4 151.8 mean 2.63 2.57 2.36 1.94 1.89 1.87 stdev 1.59 1.61 1.63 2.10 2.02 2.05 Methods nlPCA SVP + GBMS (400,140,0,1) + MBMS (500,140,9,5) Table 1: Swissroll data set: reconstruction errors obtained by different algorithms along with their optimal parameters (σ, k, L, no. iterations τ ). The three columns show the root sum of squared errors on missing entries, the mean, and the standard deviation of the pointwise reconstruction error, resp. SVP + GBMS error (RSSE) 180 170 SVP + MBMS Gaussian + GBMS 180 180 170 170 170 160 160 ∞ 160 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ stdev 42.6 39.3 37.7 34.9 Gaussian + MBMS 180 8 10 15 25 mean 26.1 21.8 18.8 17.0 Table 2: MNIST-7 data set: errors of the different algorithms and their optimal parameters (σ, k, L, no. iterations τ ). The three columns show the root sum of squared errors on missing entries (×10−4 ), the mean, and the standard deviation of pixel errors, respectively. 160 0.3 0.5 1 2 3 5 RSSE 7.77 6.99 6.54 6.03 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ Figure 2: Reconstruction error of GBMS/MBMS over iterations (each curve is a different σ value). denoising (and reconstruction), but MBMS is more robust, with good results occurring for a wide range of iterations, indicating it is able to preserve the manifold structure better. Mocap data We use the running-motion sequence 09 01 from the CMU mocap database with 148 samples (≈ 1.7 cycles) with 150 sensor readings (3D positions of 50 joints on a human body). The motion is intrinsically 1D, tracing a loop in 150D. We compare nlPCA, SVP, the Gaussian model, and MBMS initialized from the ﬁrst three algorithms. For nlPCA, we do a grid search for the weight decay coefﬁcient while ﬁxing its structure to be 2 × 10 × 150 units, and use an early stopping criterion. For SVP, we do grid search on {1, 2, 3, 5, 7, 10} for the rank. For MBMS (L = 1) and GBMS (L = 0), we do grid search for σ and k. We report the reconstruction error as a function of the proportion of missing entries from 50% to 95%. For each missing-data proportion, we randomly select 5 different sets of present values and run all algorithms for them. Fig. 4 gives the mean errors of all algorithms. All methods perform well when missing-data proportion is small. nlPCA, being prone to local optima, is less stable than SVP and the Gaussian model, especially when the missing-data proportion is large. The Gaussian model gives the best and most stable initialization. At 95%, all methods fail to give an acceptable reconstruction, but up to 90% missing entries, MBMS and GBMS always beat the other algorithms. Fig. 4 shows selected reconstructions from all algorithms. MNIST digit ‘7’ The MNIST digit ‘7’ data set contains 6 265 greyscale (0–255) images of size 28 × 28. We create missing entries in a way reminiscent of run-length errors in transmission. We generate 16 to 26 rectangular boxes of an area approximately 25 pixels at random locations in each image and use them to black out pixels. In this way, we create a high dimensional data set (784 dimensions) with about 50% entries missing on average. Because of the loss of spatial correlations within the blocks, this missing data pattern is harder than random. The Gaussian model cannot handle such a big data set because it involves inverting large covariance matrices. nlPCA is also very slow and we cannot afford cross-validating its structure or the weight decay coefﬁcient, so we picked a reasonable structure (10 × 30 × 784 units), used the default weight decay parameter in the code (10−3 ), and allowed up to 500 iterations. We only use SVP as initialization for our algorithm. Since the intrinsic dimension of MNIST is suspected to be not very high, 6 SVP τ =0 SVP + GBMS τ =1 SVP + MBMS τ =2 Gaussian τ =0 Gaussian + GBMS τ =1 Gaussian + MBMS τ = 25 20 20 20 20 20 20 15 15 15 15 15 15 10 10 10 10 10 10 5 5 5 5 5 5 0 0 0 0 0 0 −5 −5 −5 −5 −5 −5 −10 −10 −15 −15 −10 −5 0 5 10 15 20 −10 −15 −15 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −5 0 5 10 15 20 Figure 3: Denoising effect of the different algorithms. For visualization, we project the 100D data to 3D with the projection matrix used for creating the data. Present values are reﬁlled for all plots. 7000 6000 error 5000 4000 frame 2 (leg distance) frame 10 (foot pose) frame 147 (leg pose) nlPCA nlPCA + GBMS nlPCA + MBMS SVP SVP + GBMS SVP + MBMS Gaussian Gaussian + GBMS Gaussian + MBMS 3000 2000 1000 0 50 60 70 80 85 90 95 % of missing data Figure 4: Left: mean of errors (RSSE) of 5 runs obtained by different algorithms for varying percentage of missing values. Errorbars shown only for Gaussian + MBMS to avoid clutter. Right: sample reconstructions when 85% percent data is missing. Row 1: initialization. Row 2: init+GBMS. Row 3: init+MBMS. Color indicates different initialization: black, original data; red, nlPCA; blue, SVP; green, Gaussian. we used rank 10 for SVP and L = 9 for MBMS. We also use the same k = 140 as in [5]. So we only had to choose σ and the number of iterations via cross-validation. Table 2 shows the methods and their corresponding error. Fig. 5 shows some representative reconstructions from different algorithms, with present values reﬁlled. The mean-shift averaging among closeby neighbors (a soft form of majority voting) helps to eliminate noise, unusual strokes and other artifacts created by SVP, which by their nature tend to occur in different image locations over the neighborhood of images. 4 Related work Matrix completion is widely studied in theoretical compressed sensing [1, 2] as well as practical recommender systems [3, 4]. Most matrix completion models rely on a low-rank assumption, and cannot fully exploit a more complex structure of the problem, such as curved manifolds. Related work is on multi-task learning in a broad sense, which extracts the common structure shared by multiple related objects and achieves simultaneous learning on them. This includes applications such as alignment of noise-corrupted images [21], recovery of images with occlusion [22], and even learning of multiple related regressors or classiﬁers [23]. Again, all these works are essentially based on a subspace assumption, and do not generalize to more complex situations. A line of work based on a nonlinear low-rank assumption (with a latent variable z of dimensionN 2 ality L < D) involves setting up a least-squares error function minf ,Z n=1 xn − f (zn ) = N,D 2 n,d=1 (xdn − fd (zn )) where one ignores the terms for which xdn is missing, and estimates the function f and the low-dimensional data projections Z by alternating optimization. Linear functions f have been used in the homogeneity analysis literature [24], where this approach is called “missing data deleted”. Nonlinear functions f have been used recently (neural nets [20]; Gaussian processes for collaborative ﬁltering [25]). Better results are obtained if adding a projection term N 2 and optimizing over the missing data as well [26]. n=1 zn − F(xn ) 7 Orig Missing nlPCA SVP GBMS MBMS Orig Missing nlPCA SVP GBMS MBMS Figure 5: Selected reconstructions of MNIST block-occluded digits ‘7’ with different methods. Prior to our denoising-based work there have been efforts to extend the low-rank models to smooth manifolds, mostly in the context of compressed sensing. Baraniuk and Wakin [27] show that certain random measurements, e.g. random projection to a low-dimensional subspace, can preserve the metric of the manifold fairly well, if the intrinsic dimension and the curvature of the manifold are both small enough. However, these observations are not suitable for matrix completion and no algorithm is given for recovering the signal. Chen et al. [28] explicitly model a pre-determined manifold, and use this to regularize the signal when recovering the missing values. They estimate the manifold given complete data, while no complete data is assumed in our matrix completion setting. Another related work is [29], where the manifold modeled with Isomap is used in estimating the positions of satellite cameras in an iterative manner. Finally, our expectation that the value of a missing entry can be predicted from the values of neighboring points is similar to one category of collaborative ﬁltering methods that essentially use similar users/items to predict missing values [3, 4]. 5 Conclusion We have proposed a new paradigm for matrix completion, denoising, which generalizes the commonly used assumption of low rank. Assuming low-rank implies a restrictive form of denoising where the data is forced to have zero variance away from a linear manifold. More general definitions of denoising can potentially handle data that lives in a low-dimensional manifold that is nonlinear, or whose dimensionality varies (e.g. a set of manifolds), or that does not have low rank at all, and naturally they handle noise in the data. Denoising works because of the fundamental fact that a missing value can be predicted by averaging nearby present values. Although we motivate our framework from a constrained optimization point of view (denoise subject to respecting the present data), we argue for an algorithmic view of denoising-based matrix completion: apply a denoising step, reﬁll the present values, iterate until the validation error increases. In turn, this allows different forms of denoising, such as based on low-rank projection (earlier work) or local averaging with blurring mean-shift (this paper). Our nonparametric choice of mean-shift averaging further relaxes assumptions about the data and results in a simple algorithm with very few user parameters that afford user control (denoising scale, local dimensionality) but can be set automatically by cross-validation. Our algorithms are intended to be used as a postprocessing step over a user-provided initialization of the missing values, and we show they consistently improve upon existing algorithms. The MBMS-based algorithm bridges the gap between pure denoising (GBMS) and local low rank. Other deﬁnitions of denoising should be possible, for example using temporal as well as spatial neighborhoods, and even applicable to discrete data if we consider denoising as a majority voting among the neighbours of a vector (with suitable deﬁnitions of votes and neighborhood). Acknowledgments Work supported by NSF CAREER award IIS–0754089. 8 References [1] Emmanuel J. Cand` s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations e of Computational Mathematics, 9(6):717–772, December 2009. [2] Emmanuel J. Cand` s and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. e IEEE Trans. Information Theory, 56(5):2053–2080, April 2010. [3] Yehuda Koren. Factorization meets the neighborhood: A multifaceted collaborative ﬁltering model. SIGKDD 2008, pages 426–434, Las Vegas, NV, August 24–27 2008. [4] Robert Bell and Yehuda Koren. Scalable collaborative ﬁltering with jointly derived neighborhood interpolation weights. ICDM 2007, pages 43–52, October 28–31 2007. ´ [5] Weiran Wang and Miguel A. Carreira-Perpi˜ an. Manifold blurring mean shift algorithms for manifold n´ denoising. CVPR 2010, pages 1759–1766, San Francisco, CA, June 13–18 2010. [6] Matthias Hein and Markus Maier. Manifold denoising. NIPS 2006, 19:561–568. MIT Press, 2007. [7] Prateek Jain, Raghu Meka, and Inderjit S. Dhillon. Guaranteed rank minimization via singular value projection. NIPS 2010, 23:937–945. MIT Press, 2011. ´ [8] Miguel A. Carreira-Perpi˜ an. Fast nonparametric clustering with Gaussian blurring mean-shift. ICML n´ 2006, pages 153–160. Pittsburgh, PA, June 25–29 2006. [9] Keinosuke Fukunaga and Larry D. Hostetler. The estimation of the gradient of a density function, with application in pattern recognition. IEEE Trans. Information Theory, 21(1):32–40, January 1975. [10] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Trans. PAMI, 17(8):790–799, 1995. [11] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. PAMI, 24(5):603–619, May 2002. ´ [12] Miguel A. Carreira-Perpi˜ an. Generalised blurring mean-shift algorithms for nonparametric clustering. n´ CVPR 2008, Anchorage, AK, June 23–28 2008. [13] Gabriel Taubin. A signal processing approach to fair surface design. SIGGRAPH 1995, pages 351–358. [14] Mathieu Desbrun, Mark Meyer, Peter Schr¨ der, and Alan H. Barr. Implicit fairing of irregular meshes o using diffusion and curvature ﬂow. SIGGRAPH 1999, pages 317–324. [15] Fan R. K. Chung. Spectral Graph Theory. American Mathematical Society, Providence, RI, 1997. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 22(8):888– 905, August 2000. [17] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1986. [18] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer-Verlag, New York, second edition, 2006. [19] Tapio Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14(5):853–871, March 2001. [20] Matthias Scholz, Fatma Kaplan, Charles L. Guy, Joachim Kopka, and Joachim Selbig. Non-linear PCA: A missing data approach. Bioinformatics, 21(20):3887–3895, October 15 2005. [21] Yigang Peng, Arvind Ganesh, John Wright, Wenli Xu, and Yi Ma. RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images. CVPR 2010, pages 763–770, 2010. [22] A. M. Buchanan and A. W. Fitzgibbon. Damped Newton algorithms for matrix factorization with missing data. CVPR 2005, pages 316–322, San Diego, CA, June 20–25 2005. [23] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. NIPS 2006, 19:41–48. MIT Press, 2007. [24] Albert Giﬁ. Nonlinear Multivariate Analysis. John Wiley & Sons, 1990. [25] Neil D. Lawrence and Raquel Urtasun. Non-linear matrix factorization with Gaussian processes. ICML 2009, Montreal, Canada, June 14–18 2009. ´ [26] Miguel A. Carreira-Perpi˜ an and Zhengdong Lu. Manifold learning and missing data recovery through n´ unsupervised regression. ICDM 2011, December 11–14 2011. [27] Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds. Foundations of Computational Mathematics, 9(1):51–77, February 2009. [28] Minhua Chen, Jorge Silva, John Paisley, Chunping Wang, David Dunson, and Lawrence Carin. Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds. IEEE Trans. Signal Processing, 58(12):6140–6155, December 2010. [29] Michael B. Wakin. A manifold lifting algorithm for multi-view compressive imaging. In Proc. 27th Conference on Picture Coding Symposium (PCS’09), pages 381–384, 2009. 9

6 0.41533196 107 nips-2011-Global Solution of Fully-Observed Variational Bayesian Matrix Factorization is Column-Wise Independent

7 0.38656208 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion

8 0.36193171 225 nips-2011-Probabilistic amplitude and frequency demodulation

9 0.35992727 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise

10 0.35434902 86 nips-2011-Empirical models of spiking in neural populations

11 0.34888023 301 nips-2011-Variational Gaussian Process Dynamical Systems

12 0.3486279 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

13 0.34737846 211 nips-2011-Penalty Decomposition Methods for Rank Minimization

14 0.32795131 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation

15 0.32019207 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

16 0.30603316 148 nips-2011-Learning Probabilistic Non-Linear Latent Variable Models for Tracking Complex Activities

17 0.28421116 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

18 0.27804053 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

19 0.27765742 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines

20 0.26815486 68 nips-2011-Demixed Principal Component Analysis

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.033), (4, 0.046), (20, 0.03), (26, 0.032), (31, 0.092), (33, 0.025), (42, 0.236), (43, 0.061), (45, 0.061), (57, 0.056), (65, 0.018), (74, 0.046), (83, 0.051), (84, 0.014), (89, 0.011), (99, 0.107)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79151535 102 nips-2011-Generalised Coupled Tensor Factorisation

Author: Kenan Y. Yılmaz, Ali T. Cemgil, Umut Simsekli

2 0.73057365 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm

Author: Fabio Vitale, Nicolò Cesa-bianchi, Claudio Gentile, Giovanni Zappella

Abstract: Predicting the nodes of a given graph is a fascinating theoretical problem with applications in several domains. Since graph sparsiﬁcation via spanning trees retains enough information while making the task much easier, trees are an important special case of this problem. Although it is known how to predict the nodes of an unweighted tree in a nearly optimal way, in the weighted case a fully satisfactory algorithm is not available yet. We ﬁll this hole and introduce an efﬁcient node predictor, S HAZOO, which is nearly optimal on any weighted tree. Moreover, we show that S HAZOO can be viewed as a common nontrivial generalization of both previous approaches for unweighted trees and weighted lines. Experiments on real-world datasets conﬁrm that S HAZOO performs well in that it fully exploits the structure of the input tree, and gets very close to (and sometimes better than) less scalable energy minimization methods. 1

3 0.59249794 270 nips-2011-Statistical Performance of Convex Tensor Decomposition

Author: Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, Hisashi Kashima

4 0.588884 8 nips-2011-A Model for Temporal Dependencies in Event Streams

Author: Asela Gunawardana, Christopher Meek, Puyang Xu

Abstract: We introduce the Piecewise-Constant Conditional Intensity Model, a model for learning temporal dependencies in event streams. We describe a closed-form Bayesian approach to learning these models, and describe an importance sampling algorithm for forecasting future events using these models, using a proposal distribution based on Poisson superposition. We then use synthetic data, supercomputer event logs, and web search query logs to illustrate that our learning algorithm can efﬁciently learn nonlinear temporal dependencies, and that our importance sampling algorithm can effectively forecast future events. 1

5 0.58017761 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

Author: Alex K. Susemihl, Ron Meir, Manfred Opper

Abstract: Bayesian ﬁltering of stochastic stimuli has received a great deal of attention recently. It has been applied to describe the way in which biological systems dynamically represent and make decisions about the environment. There have been no exact results for the error in the biologically plausible setting of inference on point process, however. We present an exact analysis of the evolution of the meansquared error in a state estimation task using Gaussian-tuned point processes as sensors. This allows us to study the dynamics of the error of an optimal Bayesian decoder, providing insights into the limits obtainable in this task. This is done for Markovian and a class of non-Markovian Gaussian processes. We ﬁnd that there is an optimal tuning width for which the error is minimized. This leads to a characterization of the optimal encoding for the setting as a function of the statistics of the stimulus, providing a mathematically sound primer for an ecological theory of sensory processing. 1

6 0.57996547 186 nips-2011-Noise Thresholds for Spectral Clustering

7 0.57975399 66 nips-2011-Crowdclustering

8 0.57651025 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

9 0.57275599 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

10 0.56965667 75 nips-2011-Dynamical segmentation of single trials from population neural data

11 0.56914312 249 nips-2011-Sequence learning with hidden units in spiking neural networks

12 0.56816745 219 nips-2011-Predicting response time and error rates in visual search

13 0.56798875 180 nips-2011-Multiple Instance Filtering

14 0.56787318 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

15 0.56747818 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

16 0.56568551 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

17 0.56378442 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

18 0.56364942 258 nips-2011-Sparse Bayesian Multi-Task Learning

19 0.56357729 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

20 0.56200999 54 nips-2011-Co-regularized Multi-view Spectral Clustering