nips nips2011 nips2011-164 knowledge-graph by maker-knowledge-mining

164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

Source: pdf

Author: Nitesh Shroff, Pavan Turaga, Rama Chellappa

Abstract: In this paper, we consider the Pr´ cis problem of sampling K representative yet e diverse data points from a large dataset. This problem arises frequently in applications such as video and document summarization, exploratory data analysis, and pre-ﬁltering. We formulate a general theory which encompasses not just traditional techniques devised for vector spaces, but also non-Euclidean manifolds, thereby enabling these techniques to shapes, human activities, textures and many other image and video based datasets. We propose intrinsic manifold measures for measuring the quality of a selection of points with respect to their representative power, and their diversity. We then propose efﬁcient algorithms to optimize the cost function using a novel annealing-based iterative alternation algorithm. The proposed formulation is applicable to manifolds of known geometry as well as to manifolds whose geometry needs to be estimated from samples. Experimental results show the strength and generality of the proposed approach.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In this paper, we consider the Pr´ cis problem of sampling K representative yet e diverse data points from a large dataset. [sent-4, score-0.18]

2 We propose intrinsic manifold measures for measuring the quality of a selection of points with respect to their representative power, and their diversity. [sent-7, score-0.302]

3 The proposed formulation is applicable to manifolds of known geometry as well as to manifolds whose geometry needs to be estimated from samples. [sent-9, score-0.432]

4 1 Introduction The problem of sampling K representative data points from a large dataset arises frequently in various applications. [sent-11, score-0.15]

5 This has necessitated the problem of optimal selection of a few representative exemplars from the dataset as an important step in exploratory data analysis. [sent-14, score-0.64]

6 Similarly, in medical image analysis, picking a subset of K anatomical shapes from a large population helps in identifying the variations within and across shape classes, providing an invaluable tool for analysts. [sent-16, score-0.217]

7 However, there seems to be a consensus in selecting exemplars that are representative of the dataset while minimizing the redundancy between the exemplars. [sent-18, score-0.589]

8 [2] extended this idea to selecting exemplars from videos that maximize ‘coverage’ and ‘diversity’. [sent-22, score-0.546]

9 Similarly, in statistics, stratiﬁed sampling techniques sample the population by dividing the dataset into mutually exclusive and exhaustive ‘strata’ (sub-groups) followed by a random selection of representatives from each strata [4]. [sent-25, score-0.154]

10 In 1 applications like computer vision, images and videos are represented by features/models like shapes [10], bags-of-words, linear dynamical systems (LDS) [11], etc. [sent-32, score-0.164]

11 Since these feature/model spaces have a non-trivial manifold structure, the distance metrics are highly non-linear functions. [sent-34, score-0.156]

12 Examples of features/models - manifold pairs include: shapes - complex spherical manifold [10], linear subspaces - Grassmann manifold, covariance matrices - tensor space, histograms - simplex in Rn , etc. [sent-35, score-0.345]

13 The geometric properties of the non-Euclidean manifolds allow one to develop accurate inference and classiﬁcation algorithms [13, 14]. [sent-37, score-0.216]

14 In this paper, we focus on the problem of selecting a subset of K exemplars from a dataset of N points when the dataset has an underlying manifold structure to it. [sent-38, score-0.795]

15 We formulate the notion of representational error and diversity measure of exemplars while utilizing the non-Euclidean structure of the data points followed by the proposal of an efﬁcient annealing-based optimization algorithm. [sent-39, score-0.711]

16 Given a data matrix Y , the goal of RRQR factorization is to ﬁnd a permutation matrix Π such that the QR factorization of Y Π reveals the numerical rank of the matrix. [sent-42, score-0.186]

17 The resultant matrix Y Π has as its ﬁrst K columns the most “well-conditioned” columns of the matrix Y . [sent-43, score-0.178]

18 The goal of CSS is to pick K columns forming a matrix C ∈ Rm×K such that the residual || Y − PC Y ||ζ is minimized over all possible choices for the matrix C. [sent-45, score-0.167]

19 In order to select K exemplars, data points are clustered into clusters with ( ≤ K) followed by the selection of one or multiple exemplars from each cluster to obtain the best representation or low-rank approximation of each cluster. [sent-51, score-0.584]

20 Afﬁnity Propagation [21], is a clustering algorithm that takes similarity measures as input and recursively passes message between nodes until a set of exemplars emerges. [sent-52, score-0.518]

21 To the best of our knowledge, the problem of subset selection for analytic manifolds remains largely unaddressed. [sent-58, score-0.307]

22 While one could try to solve the problem by obtaining an embedding of a given manifold into a larger ambient Euclidean space, it is desirable to have a solution that is more intrinsic in nature. [sent-59, score-0.186]

23 Further manifolds such as the Grassmannian or the manifold of inﬁnite dimensional diffeomorphisms do not admit a natural embedding into a vector space. [sent-61, score-0.404]

24 2 Subset Selection on Analytic Manifolds In this section, we formalize the subset selection problem on manifolds and propose an efﬁcient algorithm. [sent-63, score-0.307]

25 The set of all such tangent vectors is called the tangent space to M at p. [sent-68, score-0.166]

26 If M is a Riemannian manifold then the exponential map expp : Tp (M) → M is deﬁned by expp (v) = αv (1) where αv is a speciﬁc geodesic. [sent-69, score-0.248]

27 The inverse exponential map (logarithmic map) logp : M → Tp takes a point on the manifold and returns a point on the tangent space – which is an Euclidean space. [sent-70, score-0.307]

28 The goal is to select a few exemplars E = {e1 , . [sent-75, score-0.475]

29 eK } from the set X, such that the exemplars provide a good representation of the given data points, and are minimally redundant. [sent-78, score-0.475]

30 2 The linear span error is given by: minz X − Ez F , where X is the matrix form of the data, and E 2 is a matrix of chosen exemplars. [sent-80, score-0.16]

31 The nearest-exemplar error is given by: i xk ∈Φi xk − ei , th where ei is the i exemplar and Φi corresponds to its Voronoi region. [sent-81, score-0.306]

32 Of these two measures, the notion of linear span, while appropriate for matrix approximation, is not particularly meaningful for general dataset approximation problems since the ‘span’ of a dataset item does not carry much perceptually meaningful information. [sent-82, score-0.187]

33 All elements of this set however are perceptually equivalent, and one does not obtain any representational advantage from considering the span of x. [sent-85, score-0.151]

34 One could attempt to deﬁne the notion of linear spans on the manifold as the set of points lying on the geodesic shot from some ﬁxed pole toward the given dataset item. [sent-88, score-0.266]

35 , samples from the linear span of a few shapes would give physically meaningless shapes. [sent-91, score-0.159]

36 Hence, it is but natural to consider the representational error of a set X with respect to a set of exemplars E as follows: d2 (xj , ei ) g Jrep (E) = i (1) xj ∈Φi Here, dg is the geodesic distance on the manifold and Φi is the Voronoi region of the ith exemplar. [sent-92, score-0.781]

37 This is further followed by selecting representative exemplar ei as the data point closest to µi . [sent-99, score-0.236]

38 Diversity measures on manifolds: The next question we consider is to deﬁne the notion of diversity of a selection of points on a manifold. [sent-100, score-0.202]

39 For vector spaces, given a set of vectors X = {xi }, written in matrix form X, RRQR [7] aims to ﬁnd Q, R and a permutation matrix Π ∈ Rn×n such that XΠ = QR reveals the numerical rank of the matrix X. [sent-107, score-0.185]

40 For the case of manifolds, we adopt an approximate approach in order to measure diversity in terms of the ‘well-conditioned’ nature of the set of exemplars projected on the tangent space at the mean. [sent-110, score-0.673]

41 In particular, for the dataset {xi } ⊆ M, with intrinsic mean µ, and a given selection of exemplars 3 Algorithm 1: Algorithm to minimize Jrep Input: X ∈ M, k, index vector ω, Γ Output: Permutation Matrix Π Initialize Π ← In×n for γ ← 1 to Γ do Initialize Π(γ) ← In×n ei ← xωi for i = {1,2,. [sent-111, score-0.665]

42 Here, log() is the inverse exponential map on the manifold and gives tangent vectors at µ that point towards ej . [sent-117, score-0.345]

43 , xn } ∈ M, optimizing Jdiv is adopted from [7] and deNumber of exemplars k, Tolerance step δ scribed in algorithm 2. [sent-126, score-0.501]

44 This is repeated until either βm < tol Diversity: Π ← Div(V, k, tol) as in algorithm 2. [sent-140, score-0.132]

45 tol ← tol + δ Representation and Diversity Trade-offs for Subset Selection: From (1) and (2), it can be seen that we seek a solution that end represents a trade-off between two conﬂicting ei ← xωi for i = {1,2,. [sent-144, score-0.31]

46 However, with each iteration, we increase the tolerance parameter tol in algorithm 2. [sent-162, score-0.166]

47 If the tol value is not increased at each iteration, then optimizing Jdiv will continue to provide a new solution at each iteration that modiﬁes the cost function only marginally. [sent-165, score-0.181]

48 As seen in ﬁgure 1(b) , the convergence of Jdiv and Jrep is obtained very quickly on using the proposed annealing alternation technique. [sent-167, score-0.181]

49 The complete annealing based alternation algorithm is described in algorithm 3. [sent-168, score-0.181]

50 In this case, algorithm 3 reduces to alternation between clustering and matrix approximation with the annealing ensuring that the algorithm converges. [sent-175, score-0.271]

51 For the case of manifolds implicitly speciﬁed using samples, one can approach the problem in one of two ways. [sent-177, score-0.216]

52 Thus the proposed formalism can accommodate manifolds with known and unknown geometries. [sent-183, score-0.216]

53 However, the formalism is limited to manifolds of ﬁnite dimension. [sent-184, score-0.216]

54 Let n be the number of data points and K be the number of exemplars to be selected. [sent-189, score-0.511]

55 The last two rows show the complexity of an efﬁcient algorithm proposed by [28] to compute the exponential map and its inverse for the case of Grassmann manifold Gm,p . [sent-191, score-0.224]

56 The ﬁrst baseline is a clustering-based solution to subset selection, where we cluster the dataset into K clusters, and pick as exemplar the data point that is closest to the cluster centroid. [sent-193, score-0.243]

57 While trying to minimize the representational error, Jrep picks two exemplars from the dominant class. [sent-201, score-0.655]

58 The proposed approach strikes a balance between the two and picks one ‘representative’ exemplar from each class. [sent-203, score-0.23]

59 This corresponds to optimizing only Jdiv where the input matrix is the matrix of tangent vectors. [sent-208, score-0.203]

60 Since minimization of Jrep is not explicitly enforced, we do not expect the exemplars to be the best representatives, even though the set is diverse. [sent-209, score-0.475]

61 For easy visualization and understanding, we generated a dataset with 3 unbalanced classes in Euclidean space R4 . [sent-211, score-0.152]

62 Individual cost functions, Jrep and Jdiv were ﬁrst optimized to pick three exemplars using algorithms 1 and 2 respectively. [sent-212, score-0.529]

63 Selected exemplars have been shown in ﬁgure 1(a), where the four dimensional dataset has been projected into two dimensions for visualization using Principal Component Analysis (PCA). [sent-213, score-0.534]

64 Despite unbalanced class sizes, when optimized individually, Jdiv seeks to select exemplars from diverse classes but tends to pick them from the class boundaries. [sent-214, score-0.657]

65 While unbalanced class sizes cause Jrep to pick 2 exemplars from the dominant cluster. [sent-215, score-0.539]

66 Algorithm 3 iteratively optimizes for both these cost functions and picks an exemplar from every class. [sent-216, score-0.253]

67 Figure 1(b) shows the convergence of the algorithm for this simple dataset and compares it with the case when no annealing is applied (ﬁgure 1(c)). [sent-218, score-0.163]

68 When annealing is applied, the tolerance value (tol) is increased by 0. [sent-220, score-0.138]

69 Shape sampling/summarization: We conducted a real shape summarization experiment on the MPEG dataset [29]. [sent-225, score-0.171]

70 This dataset has 70 shape classes with 20 shapes per class. [sent-226, score-0.265]

71 For our experiments, we created a smaller dataset of 10 shape classes with 10 shapes per class. [sent-227, score-0.265]

72 These points are concatenated in a matrix to obtain the landmark matrix L ∈ Rm×2 . [sent-231, score-0.161]

73 Left singular vectors (Um×2 ), obtained by the singular value decomposition of matrix L = U ΣV T , give the afﬁne-invariant representation of shapes [30]. [sent-232, score-0.14]

74 As the number of shape classes is also 10, one would ideally seek one exemplar from each class. [sent-237, score-0.226]

75 For algorithm 2, data points were projected on the tangent space at the mean using the inverse exponential map and the selected subset is shown in ﬁgure 2(c). [sent-241, score-0.257]

76 Individual optimization of Jrep results in 1 exemplar each from 6 classes, 2 each from 2 classes (‘apple’ and ‘ﬂower’) and misses 2 classes (‘bell’ and ‘chopper’). [sent-242, score-0.271]

77 It can be observed that exemplars chosen by Jdiv for classes ‘glass’, ‘heart’,‘ﬂower’ and ‘apple’ tend to be unusual members of the 6 (a) (b) (c) (d) Figure 2: (a) 10 classes from MPEG dataset with 10 shapes per class. [sent-244, score-0.818]

78 Comparison of 10 exemplars selected by (b)Jrep , (c) Jdiv and (d) Proposed Approach. [sent-245, score-0.475]

79 Jrep picks 2 exemplars each from 2 classes (‘apple’ and ‘ﬂower’) and misses ‘bell’ and ‘chopper’ classes. [sent-246, score-0.69]

80 Jdiv picks 1 from 8 different classes, 2 exemplars from class ‘car’ and none from class ‘bell’. [sent-247, score-0.592]

81 It can be observed that exemplars chosen by Jdiv for classes ‘glass’, ‘heart’, ‘ﬂower’ and ‘apple’ tend to be unusual members of the class. [sent-248, score-0.606]

82 While the proposed approach picks one representative exemplars from each class as desired. [sent-250, score-0.647]

83 Optimizing for both Jdiv and Jrep using algorithm 3 picks one ‘representative’ exemplar from each class as shown in ﬁgure 2(d). [sent-253, score-0.23]

84 These exemplars picked by the three algorithms can be further used to label data points. [sent-254, score-0.475]

85 This confusion is largely due to both Jrep and Jdiv having missed out picking exemplars from this class. [sent-259, score-0.53]

86 The proposed approach correctly labels all data points as it picks exemplars from every class. [sent-260, score-0.628]

87 This dataset consists of videos with 6 actions conducted by 25 persons in 4 different scenarios. [sent-266, score-0.13]

88 For our experiment, we created a smaller dataset of 30 videos with the ﬁrst 5 human subjects conducting 6 actions in the s4 (indoor) scenario. [sent-267, score-0.13]

89 5 exemplars selected by: (b)Jrep , (c) Jdiv and (d) Proposed. [sent-277, score-0.475]

90 We asked the algorithm to pick 5 exemplars when the actual number of classes in the dataset is 6. [sent-289, score-0.625]

91 It picks 1 exemplar each from 3 classes (‘box’,‘hand-clap’ and ‘hand-wave’), 2 from the class ‘run’ while misses out on ‘walk’ and ‘jog’. [sent-291, score-0.328]

92 On the other hand, Jdiv (when optimized alone) picks 1 each from 5 different classes and misses the class ‘run’. [sent-292, score-0.215]

93 It can be seen that Jdiv picks 2 exemplars that are ‘unusual’ members (occluded videos) of their respective class. [sent-293, score-0.621]

94 The proposed approach picks 1 representative exemplar from 5 classes and none from the class ‘jog’. [sent-294, score-0.345]

95 The proposed approach achieves both a diverse selection of exemplars, and also avoids picking outlying exemplars. [sent-295, score-0.14]

96 5 Conclusion and Discussion In this paper, we addressed the problem of selecting K exemplars from a dataset when the dataset has an underlying manifold structure to it. [sent-304, score-0.719]

97 We utilized the geometric structure of the manifold to formulate the notion of picking exemplars which minimize the representational error while maximizing the diversity of exemplars. [sent-305, score-0.81]

98 An iterative alternation optimization technique based on annealing has been proposed. [sent-306, score-0.181]

99 We discussed its convergence and complexity and showed its extension to data manifolds and Euclidean spaces. [sent-307, score-0.216]

100 Future work includes formulating subset selection for inﬁnite dimensional manifolds and efﬁcient approximations for this case. [sent-309, score-0.307]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('jdiv', 0.506), ('exemplars', 0.475), ('jrep', 0.41), ('manifolds', 0.216), ('tol', 0.132), ('manifold', 0.126), ('picks', 0.117), ('diversity', 0.115), ('exemplar', 0.113), ('annealing', 0.104), ('shapes', 0.093), ('qr', 0.084), ('tangent', 0.083), ('alternation', 0.077), ('apple', 0.072), ('videos', 0.071), ('chopper', 0.068), ('span', 0.066), ('representational', 0.063), ('bell', 0.062), ('classes', 0.06), ('grassmann', 0.06), ('karcher', 0.06), ('summarization', 0.059), ('dataset', 0.059), ('diverse', 0.058), ('representative', 0.055), ('jog', 0.055), ('shape', 0.053), ('selection', 0.051), ('css', 0.048), ('rrqr', 0.048), ('matrix', 0.047), ('ei', 0.046), ('geodesic', 0.045), ('pocket', 0.044), ('permutation', 0.044), ('te', 0.043), ('clustering', 0.043), ('unusual', 0.042), ('columns', 0.042), ('shroff', 0.041), ('subset', 0.04), ('xk', 0.039), ('glass', 0.039), ('ej', 0.038), ('misses', 0.038), ('map', 0.037), ('video', 0.036), ('ower', 0.036), ('joshi', 0.036), ('diffeomorphisms', 0.036), ('routines', 0.036), ('voronoi', 0.036), ('points', 0.036), ('tolerance', 0.034), ('intrinsic', 0.034), ('unbalanced', 0.033), ('picking', 0.031), ('individually', 0.031), ('exponential', 0.031), ('landmark', 0.031), ('cis', 0.031), ('riemannian', 0.031), ('pick', 0.031), ('inverse', 0.03), ('spaces', 0.03), ('srivastava', 0.029), ('members', 0.029), ('euclidean', 0.029), ('rn', 0.028), ('heart', 0.028), ('chellappa', 0.027), ('expp', 0.027), ('flower', 0.027), ('highlighting', 0.027), ('mio', 0.027), ('mpeg', 0.027), ('teddy', 0.027), ('gure', 0.027), ('walk', 0.027), ('optimizing', 0.026), ('embedding', 0.026), ('swap', 0.026), ('dg', 0.026), ('box', 0.025), ('confusion', 0.024), ('car', 0.024), ('baby', 0.024), ('tx', 0.024), ('factorization', 0.024), ('af', 0.023), ('column', 0.023), ('cost', 0.023), ('th', 0.023), ('followed', 0.022), ('perceptually', 0.022), ('strata', 0.022), ('drineas', 0.022), ('maxij', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

Author: Nitesh Shroff, Pavan Turaga, Rama Chellappa

2 0.12350844 263 nips-2011-Sparse Manifold Clustering and Embedding

Author: Ehsan Elhamifar, René Vidal

Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE ﬁnds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE ﬁnds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional afﬁne subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1

3 0.090278976 287 nips-2011-The Manifold Tangent Classifier

Author: Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller

Abstract: We combine three important ideas present in previous work for building classiﬁers: the semi-supervised hypothesis (the input distribution contains information about the classiﬁer), the unsupervised manifold hypothesis (data density concentrates near low-dimensional manifolds), and the manifold hypothesis for classiﬁcation (different classes correspond to disjoint manifolds separated by low density). We exploit a novel algorithm for capturing manifold structure (high-order contractive auto-encoders) and we show how it builds a topological atlas of charts, each chart being characterized by the principal singular vectors of the Jacobian of a representation mapping. This representation learning algorithm can be stacked to yield a deep architecture, and we combine it with a domain knowledge-free version of the TangentProp algorithm to encourage the classiﬁer to be insensitive to local directions changes along the manifold. Record-breaking classiﬁcation results are obtained. 1

4 0.078700669 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

Author: Binbin Lin, Chiyuan Zhang, Xiaofei He

Abstract: This paper studies the problem of semi-supervised learning from the vector ﬁeld perspective. Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semisupervised regression problems. To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient ﬁeld of a linear function has to be a parallel vector ﬁeld. Consequently, we propose to ﬁnd a function which minimizes the empirical error, and simultaneously requires its gradient ﬁeld to be as parallel as possible. We give a continuous objective function on the manifold and discuss how to discretize it by using random points. The discretized optimization problem turns out to be a sparse linear system which can be solved very efﬁciently. The experimental results have demonstrated the effectiveness of our proposed approach. 1

5 0.068047099 260 nips-2011-Sparse Features for PCA-Like Linear Regression

Author: Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail

Abstract: Principal Components Analysis (PCA) is often used as a feature extraction procedure. Given a matrix X ∈ Rn×d , whose rows represent n data points with respect to d features, the top k right singular vectors of X (the so-called eigenfeatures), are arbitrary linear combinations of all available features. The eigenfeatures are very useful in data analysis, including the regularization of linear regression. Enforcing sparsity on the eigenfeatures, i.e., forcing them to be linear combinations of only a small number of actual features (as opposed to all available features), can promote better generalization error and improve the interpretability of the eigenfeatures. We present deterministic and randomized algorithms that construct such sparse eigenfeatures while provably achieving in-sample performance comparable to regularized linear regression. Our algorithms are relatively simple and practically efﬁcient, and we demonstrate their performance on several data sets.

6 0.064324535 130 nips-2011-Inductive reasoning about chimeric creatures

7 0.062008224 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion

8 0.051778533 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

9 0.050107155 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

10 0.049261604 5 nips-2011-A Denoising View of Matrix Completion

11 0.047727112 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

12 0.045657706 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

13 0.044991646 165 nips-2011-Matrix Completion for Multi-label Image Classification

14 0.044832639 186 nips-2011-Noise Thresholds for Spectral Clustering

15 0.043241665 54 nips-2011-Co-regularized Multi-view Spectral Clustering

16 0.040288702 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition

17 0.038752463 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

18 0.038692538 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks

19 0.038643461 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data

20 0.036443736 180 nips-2011-Multiple Instance Filtering

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.129), (1, 0.034), (2, -0.041), (3, -0.02), (4, -0.032), (5, 0.012), (6, 0.004), (7, 0.008), (8, 0.068), (9, 0.011), (10, 0.011), (11, 0.084), (12, 0.019), (13, -0.077), (14, 0.072), (15, -0.054), (16, -0.08), (17, 0.06), (18, -0.06), (19, -0.034), (20, -0.031), (21, 0.07), (22, -0.001), (23, 0.048), (24, 0.052), (25, 0.039), (26, -0.095), (27, -0.054), (28, -0.028), (29, 0.013), (30, 0.013), (31, 0.07), (32, -0.065), (33, -0.045), (34, 0.002), (35, 0.028), (36, -0.075), (37, -0.029), (38, 0.083), (39, 0.026), (40, -0.049), (41, -0.011), (42, -0.031), (43, 0.032), (44, 0.137), (45, 0.029), (46, -0.014), (47, 0.109), (48, 0.005), (49, -0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88363069 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

Author: Nitesh Shroff, Pavan Turaga, Rama Chellappa

2 0.72644156 263 nips-2011-Sparse Manifold Clustering and Embedding

Author: Ehsan Elhamifar, René Vidal

3 0.68894541 5 nips-2011-A Denoising View of Matrix Completion

Author: Weiran Wang, Zhengdong Lu, Miguel Á. Carreira-Perpiñán

Abstract: In matrix completion, we are given a matrix where the values of only some of the entries are present, and we want to reconstruct the missing ones. Much work has focused on the assumption that the data matrix has low rank. We propose a more general assumption based on denoising, so that we expect that the value of a missing entry can be predicted from the values of neighboring points. We propose a nonparametric version of denoising based on local, iterated averaging with meanshift, possibly constrained to preserve local low-rank manifold structure. The few user parameters required (the denoising scale, number of neighbors and local dimensionality) and the number of iterations can be estimated by cross-validating the reconstruction error. Using our algorithms as a postprocessing step on an initial reconstruction (provided by e.g. a low-rank method), we show consistent improvements with synthetic, image and motion-capture data. Completing a matrix from a few given entries is a fundamental problem with many applications in machine learning, computer vision, network engineering, and data mining. Much interest in matrix completion has been caused by recent theoretical breakthroughs in compressed sensing [1, 2] as well as by the now celebrated Netﬂix challenge on practical prediction problems [3, 4]. Since completion of arbitrary matrices is not a well-posed problem, it is often assumed that the underlying matrix comes from a restricted class. Matrix completion models almost always assume a low-rank structure of the matrix, which is partially justiﬁed through factor models [4] and fast convex relaxation [2], and often works quite well when the observations are sparse and/or noisy. The low-rank structure of the matrix essentially asserts that all the column vectors (or the row vectors) live on a low-dimensional subspace. This assumption is arguably too restrictive for problems with richer structure, e.g. when each column of the matrix represents a snapshot of a seriously corrupted motion capture sequence (see section 3), for which a more ﬂexible model, namely a curved manifold, is more appropriate. In this paper, we present a novel view of matrix completion based on manifold denoising, which conceptually generalizes the low-rank assumption to curved manifolds. Traditional manifold denoising is performed on fully observed data [5, 6], aiming to send the data corrupted by noise back to the correct surface (deﬁned in some way). However, with a large proportion of missing entries, we may not have a good estimate of the manifold. Instead, we start with a poor estimate and improve it iteratively. Therefore the “noise” may be due not just to intrinsic noise, but mostly to inaccurately estimated missing entries. We show that our algorithm can be motivated from an objective purely based on denoising, and prove its convergence under some conditions. We then consider a more general case with a nonlinear low-dimensional manifold and use a stopping criterion that works successfully in practice. Our model reduces to a low-rank model when we require the manifold to be ﬂat, showing a relation with a recent thread of matrix completion models based on alternating projection [7]. In our experiments, we show that our denoising-based matrix completion model can make better use of the latent manifold structure on both artiﬁcial and real-world data sets, and yields superior recovery of the missing entries. The paper is organized as follows: section 1 reviews nonparametric denoising methods based on mean-shift updates, section 2 extends this to matrix completion by using denoising with constraints, section 3 gives experimental results, and section 4 discusses related work. 1 1 Denoising with (manifold) blurring mean-shift algorithms (GBMS/MBMS) In Gaussian blurring mean-shift (GBMS), denoising is performed in a nonparametric way by local averaging: each data point moves to the average of its neighbors (to a certain scale), and the process is repeated. We follow the derivation in [8]. Consider a dataset {xn }N ⊂ RD and deﬁne a n=1 Gaussian kernel density estimate p(x) = 1 N N Gσ (x, xn ) (1) n=1 1 with bandwidth σ > 0 and kernel Gσ (x, xn ) ∝ exp − 2 ( x − xn /σ)2 (other kernels may be used, such as the Epanechnikov kernel, which results in sparse afﬁnities). The (non-blurring) mean-shift algorithm rearranges the stationary point equation ∇p(x) = 0 into the iterative scheme x(τ +1) = f (x(τ ) ) with N x (τ +1) = f (x (τ ) p(n|x )= (τ ) )xn p(n|x (τ ) n=1 )= exp − 1 (x(τ ) − xn )/σ 2 N n′ =1 2 exp − 1 (x(τ ) − xn′ )/σ 2 2 . (2) This converges to a mode of p from almost every initial x ∈ RD , and can be seen as taking selfadapting step sizes along the gradient (since the mean shift f (x) − x is parallel to ∇p(x)). This iterative scheme was originally proposed by [9] and it or variations of it have found widespread application in clustering [8, 10–12] and denoising of 3D point sets (surface fairing; [13, 14]) and manifolds in general [5, 6]. The blurring mean-shift algorithm applies one step of the previous scheme, initialized from every point, in parallel for all points. That is, given the dataset X = {x1 , . . . , xN }, for each xn ∈ X ˜ we obtain a new point xn = f (xn ) by applying one step of the mean-shift algorithm, and then we ˜ replace X with the new dataset X, which is a blurred (shrunk) version of X. By iterating this process we obtain a sequence of datasets X(0) , X(1) , . . . (and a corresponding sequence of kernel density estimates p(0) (x), p(1) (x), . . .) where X(0) is the original dataset and X(τ ) is obtained by blurring X(τ −1) with one mean-shift step. We can see this process as maximizing the following objective function [10] by taking parallel steps of the form (2) for each point: N p(xn ) = E(X) = n=1 1 N N N 1 e− 2 Gσ (xn , xm ) ∝ xn −xm σ 2 . (3) n,m=1 n,m=1 This process eventually converges to a dataset X(∞) where all points are coincident: a completely denoised dataset where all structure has been erased. As shown by [8], this process can be stopped early to return clusters (= locally denoised subsets of points); the number of clusters obtained is controlled by the bandwidth σ. However, here we are interested in the denoising behavior of GBMS. ˜ The GBMS step can be formulated in a matrix form reminiscent of spectral clustering [8] as X = X P where X = (x1 , . . . , xN ) is a D×N matrix of data points; W is the N ×N matrix of Gaussian N afﬁnities wnm = Gσ (xn , xm ); D = diag ( n=1 wnm ) is the degree matrix; and P = WD−1 is N an N × N stochastic matrix: pnm = p(n|xm ) ∈ (0, 1) and n=1 pnm = 1. P (or rather its transpose) is the stochastic matrix of the random walk in a graph [15], which in GBMS represents the posterior probabilities of each point under the kernel density estimate (1). P is similar to the 1 1 matrix N = D− 2 WD− 2 derived from the normalized graph Laplacian commonly used in spectral clustering, e.g. in the normalized cut [16]. Since, by the Perron-Frobenius theorem [17, ch. 8], all left eigenvalues of P(X) have magnitude less than 1 except for one that equals 1 and is associated with ˜ an eigenvector of constant entries, iterating X = X P(X) converges to the stationary distribution of each P(X), where all points coincide. ˜ From this point of view, the product X = X P(X) can be seen as ﬁltering the dataset X with a datadependent low-pass ﬁlter P(X), which makes clear the denoising behavior. This also suggests using ˜ other ﬁlters [12] X = X φ(P(X)) as long as φ(1) = 1 and |φ(r)| < 1 for r ∈ [0, 1), such as explicit schemes φ(P) = (1 − η)I + ηP for η ∈ (0, 2], power schemes φ(P) = Pn for n = 1, 2, 3 . . . or implicit schemes φ(P) = ((1 + η)I − ηP)−1 for η > 0. One important problem with GBMS is that it denoises equally in all directions. When the data lies on a low-dimensional manifold, denoising orthogonally to it removes out-of-manifold noise, but 2 denoising tangentially to it perturbs intrinsic degrees of freedom of the data and causes shrinkage of the entire manifold (most strongly near its boundary). To prevent this, the manifold blurring meanshift algorithm (MBMS) [5] ﬁrst computes a predictor averaging step with GBMS, and then for each point xn a corrector projective step removes the step direction that lies in the local tangent space of xn (obtained from local PCA run on its k nearest neighbors). In practice, both GBMS and MBMS must be stopped early to prevent excessive denoising and manifold distortions. 2 Blurring mean-shift denoising algorithms for matrix completion We consider the natural extension of GBMS to the matrix completion case by adding the constraints given by the present values. We use the subindex notation XM and XP to indicate selection of the missing or present values of the matrix XD×N , where P ⊂ U , M = U \ P and U = {(d, n): d = 1, . . . , D, n = 1, . . . , N }. The indices P and values XP of the present matrix entries are the data of the problem. Then we have the following constrained optimization problem: N Gσ (xn , xm ) max E(X) = X s.t. XP = XP . (4) n,m=1 This is similar to low-rank formulations for matrix completion that have the same constraints but use as objective function the reconstruction error with a low-rank assumption, e.g. X − ABX 2 with AD×L , BL×D and L < D. We initialize XM to the output of some other method for matrix completion, such as singular value projection (SVP; [7]). For simple constraints such as ours, gradient projection algorithms are attractive. The gradient of E wrt X is a matrix of D × N whose nth column is: ∇xn E(X) = 2 σ2 N 1 e− 2 xn −xm σ 2 N (xm − xn ) ∝ m=1 2 p(m|xn )xm p(xn ) −xn + σ2 m=1 (5) and its projection on the constraint space is given by zeroing its entries having indices in P; call ΠP this projection operator. Then, we have the following step of length α ≥ 0 along the projected gradient: (τ +1) X(τ +1) = X(τ ) + αΠP (∇X E(X(τ ) )) ⇐⇒ XM (τ ) = XM + α ΠP (∇X E(X(τ ) )) M (6) which updates only the missing entries XM . Since our search direction is ascent and makes an angle with the gradient that is bounded away from π/2, and E is lower bounded, continuously differentiable and has bounded Hessian (thus a Lipschitz continuous gradient) in RN L , by carrying out a line search that satisﬁes the Wolfe conditions, we are guaranteed convergence to a local stationary point, typically a maximizer [18, th. 3.2]. However, as reasoned later, we do not perform a line search at all, instead we ﬁx the step size to the GBMS self-adapting step size, which results in a simple and faster algorithm consisting of carrying out a GBMS step on X (i.e., X(τ +1) = X(τ ) P(X(τ ) )) and then reﬁlling XP to the present values. While we describe the algorithm in this way for ease of explanation, in practice we do not actually compute the GBMS step for all xdn values, but only for the missing ones, which is all we need. Thus, our algorithm carries out GBMS denoising steps within the missing-data subspace. We can derive this result in a different way by starting from N the unconstrained optimization problem maxXP E(X) = n,m=1 Gσ (xn , xm ) (equivalent to (4)), computing its gradient wrt XP , equating it to zero and rearranging (in the same way the mean-shift algorithm is derived) to obtain a ﬁxed-point iteration identical to our update above. Fig. 1 shows the pseudocode for our denoising-based matrix completion algorithms (using three nonparametric denoising algorithms: GBMS, MBMS and LTP). Convergence and stopping criterion As noted above, we have guaranteed convergence by simply satisfying standard line search conditions, but a line search is costly. At present we do not have (τ +1) a proof that the GBMS step size satisﬁes such conditions, or indeed that the new iterate XM increases or leaves unchanged the objective, although we have never encountered a counterexample. In fact, it turns out that none of the work about GBMS that we know about proves that either: [10] proves that ∅(X(τ +1) ) ≤ ∅(X(τ ) ) for 0 < ρ < 1, where ∅(·) is the set diameter, while [8, 12] 3 notes that P(X) has a single eigenvalue of value 1 and all others of magnitued less than 1. While this shows that all points converge to the same location, which indeed is the global maximum of (3), it does not necessarily follow that each step decreases E. GBMS (k, σ) with full or k-nn graph: given XD×N , M repeat for n = 1, . . . , N Nn ← {1, . . . , N } (full graph) or k nearest neighbors of xn (k-nn graph) Gσ (xn ,xm ) mean-shift xm ∂xn ← −xn + m∈Nn step m′ ∈Nn Gσ (xn ,xm′ ) end XM ← XM + (∂X)M move points’ missing entries until validation error increases return X However, the question of convergence as τ → ∞ has no practical interest in a denoising setting, because achieving a total denoising almost never yields a good matrix completion. What we want is to achieve just enough denoising and stop the algorithm, as was the case with GBMS clustering, and as is the case in algorithms for image denoising. We propose to determine the optimal number of iterations, as well as the bandwidth σ and any other parameters, by cross-validation. Specifically, we select a held-out set by picking a random subset of the present entries and considering them as missing; this allows us to evaluate an error between our completion for them and the ground truth. We stop iterating when this error increases. MBMS (L, k, σ) with full or k-nn graph: given XD×N , M repeat for n = 1, . . . , N Nn ← {1, . . . , N } (full graph) or k nearest neighbors of xn (k-nn graph) Gσ (xn ,xm ) mean-shift xm ∂xn ← −xn + m∈Nn step m′ ∈Nn Gσ (xn ,xm′ ) Xn ← k nearest neighbors of xn (µn , Un ) ← PCA(Xn , L) estimate L-dim tangent space at xn subtract parallel motion ∂xn ← (I − Un UT )∂xn n end XM ← XM + (∂X)M move points’ missing entries until validation error increases return X This argument justiﬁes an algorithmic, as opposed to an opLTP (L, k) with k-nn graph: given XD×N , M timization, view of denoisingrepeat based matrix completion: apfor n = 1, . . . , N ply a denoising step, reﬁll the Xn ← k nearest neighbors of xn present values, iterate until the (µn , Un ) ← PCA(Xn , L) estimate L-dim tangent space at xn validation error increases. This project point onto tangent space allows very general deﬁnitions ∂xn ← (I − Un UT )(µn − xn ) n end of denoising, and indeed a lowXM ← XM + (∂X)M move points’ missing entries rank projection is a form of deuntil validation error increases noising where points are not alreturn X lowed outside the linear manifold. Our formulation using Figure 1: Our denoising matrix completion algorithms, based on the objective function (4) is still Manifold Blurring Mean Shift (MBMS) and its particular cases useful in that it connects our Local Tangent Projection (LTP, k-nn graph, σ = ∞) and Gauss- denoising assumption with the ian Blurring Mean Shift (GBMS, L = 0); see [5] for details. Nn more usual low-rank assumption contains all N points (full graph) or only xn ’s nearest neighbors that has been used in much ma(k-nn graph). The index M selects the components of its input trix completion work, and juscorresponding to missing values. Parameters: denoising scale σ, tiﬁes the reﬁlling step as renumber of neighbors k, local dimensionality L. sulting from the present-data constraints under a gradientprojection optimization. MBMS denoising for matrix completion Following our algorithmic-based approach to denois˜ ing, we could consider generalized GBMS steps of the form X = X φ(P(X)). For clustering, Carreira-Perpi˜ an [12] found an overrelaxed explicit step φ(P) = (1 − η)I + ηP with η ≈ 1.25 to n´ achieve similar clusterings but faster. Here, we focus instead on the MBMS variant of GBMS that allows only for orthogonal, not tangential, point motions (deﬁned wrt their local tangent space as estimated by local PCA), with the goal of preserving low-dimensional manifold structure. MBMS has 3 user parameters: the bandwidth σ (for denoising), and the latent dimensionality L and the 4 number of neighbors k (for the local tangent space and the neighborhood graph). A special case of MBMS called local tangent projection (LTP) results by using a neighborhood graph and setting σ = ∞ (so only two user parameters are needed: L and k). LTP can be seen as doing a low-rank matrix completion locally. LTP was found in [5] to have nearly as good performance as the best σ in several problems. MBMS also includes as particular cases GBMS (L = 0), PCA (k = N , σ = ∞), and no denoising (σ = 0 or L = D). Note that if we apply MBMS to a dataset that lies on a linear manifold of dimensionality d using L ≥ d then no denoising occurs whatsoever because the GBMS updates lie on the d-dimensional manifold and are removed by the corrector step. In practice, even if the data are assumed noiseless, the reconstruction from a low-rank method will lie close to but not exactly on the d-dimensional manifold. However, this suggests using largish ranks for the low-rank method used to reconstruct X and lower L values in the subsequent MBMS run. In summary, this yields a matrix completion algorithm where we apply an MBMS step, reﬁll the present values, and iterate until the validation error increases. Again, in an actual implementation we compute the MBMS step only for the missing entries of X. The shrinking problem of GBMS is less pronounced in our matrix completion setting, because we constrain some values not to change. Still, in agreement with [5], we ﬁnd MBMS to be generally superior to GBMS. Computational cost With a full graph, the cost per iteration of GBMS and MBMS is O(N 2 D) and O(N 2 D + N (D + k) min(D, k)2 ), respectively. In practice with high-dimensional data, best denoising results are obtained using a neighborhood graph [5], so that the sums over points in eqs. (3) or (4) extend only to the neighbors. With a k-nearest-neighbor graph and if we do not update the neighbors at each iteration (which affects the result little), the respective cost per iteration is O(N kD) and O(N kD + N (D + k) min(D, k)2 ), thus linear in N . The graph is constructed on the initial X we use, consisting of the present values and an imputation for the missing ones achieved with a standard matrix completion method, and has a one-off cost of O(N 2 D). The cost when we have a fraction µ = |M| ∈ [0, 1] of missing data is simply the above times µ. Hence the run time ND of our mean-shift-based matrix completion algorithms is faster the more present data we have, and thus faster than the usual GBMS or MBMS case, where all data are effectively missing. 3 Experimental results We compare with representative methods of several approaches: a low-rank matrix completion method, singular value projection (SVP [7], whose performance we found similar to that of alternating least squares, ALS [3, 4]); ﬁtting a D-dimensional Gaussian model with EM and imputing the missing values of each xn as the conditional mean E {xn,Mn |xn,Pn } (we use the implementation of [19]); and the nonlinear method of [20] (nlPCA). We initialize GBMS and MBMS from some or all of these algorithms. For methods with user parameters, we set them by cross-validation in the following way: we randomly select 10% of the present entries and pretend they are missing as well, we run the algorithm on the remaining 90% of the present values, and we evaluate the reconstruction at the 10% entries we kept earlier. We repeat this over different parameters’ values and pick the one with lowest reconstruction error. We then run the algorithm with these parameters values on the entire present data and report the (test) error with the ground truth for the missing values. 100D Swissroll We created a 3D swissroll data set with 3 000 points and lifted it to 100D with a random orthonormal mapping, and added a little noise (spherical Gaussian with stdev 0.1). We selected uniformly at random 6.76% of the entries to be present. We use the Gaussian model and SVP (ﬁxed rank = 3) as initialization for our algorithm. We typically ﬁnd that these initial X are very noisy (ﬁg. 3), with some reconstructed points lying between different branches of the manifold and causing a big reconstruction error. We ﬁxed L = 2 (the known dimensionality) for MBMS and cross-validated the other parameters: σ and k for MBMS and GBMS (both using k-nn graph), and the number of iterations τ to be used. Table 1 gives the performance of MBMS and GBMS for testing, along with their optimal parameters. Fig. 3 shows the results of different methods at a few iterations. MBMS initialized from the Gaussian model gives the most remarkable denoising effect. To show that there is a wide range of σ and number of iterations τ that give good performance with GBMS and MBMS, we ﬁx k = 50 and run the algorithm with varying σ values and plot the reconstruction error for missing entries over iterations in ﬁg. 2. Both GBMS can achieve good 5 Methods Gaussian + GBMS (∞, 10, 0, 1) + MBMS (1, 20, 2, 25) SVP + GBMS (3, 50, 0, 1) + MBMS (3, 50, 2, 2) RSSE 168.1 165.8 157.2 156.8 151.4 151.8 mean 2.63 2.57 2.36 1.94 1.89 1.87 stdev 1.59 1.61 1.63 2.10 2.02 2.05 Methods nlPCA SVP + GBMS (400,140,0,1) + MBMS (500,140,9,5) Table 1: Swissroll data set: reconstruction errors obtained by different algorithms along with their optimal parameters (σ, k, L, no. iterations τ ). The three columns show the root sum of squared errors on missing entries, the mean, and the standard deviation of the pointwise reconstruction error, resp. SVP + GBMS error (RSSE) 180 170 SVP + MBMS Gaussian + GBMS 180 180 170 170 170 160 160 ∞ 160 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ stdev 42.6 39.3 37.7 34.9 Gaussian + MBMS 180 8 10 15 25 mean 26.1 21.8 18.8 17.0 Table 2: MNIST-7 data set: errors of the different algorithms and their optimal parameters (σ, k, L, no. iterations τ ). The three columns show the root sum of squared errors on missing entries (×10−4 ), the mean, and the standard deviation of pixel errors, respectively. 160 0.3 0.5 1 2 3 5 RSSE 7.77 6.99 6.54 6.03 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ Figure 2: Reconstruction error of GBMS/MBMS over iterations (each curve is a different σ value). denoising (and reconstruction), but MBMS is more robust, with good results occurring for a wide range of iterations, indicating it is able to preserve the manifold structure better. Mocap data We use the running-motion sequence 09 01 from the CMU mocap database with 148 samples (≈ 1.7 cycles) with 150 sensor readings (3D positions of 50 joints on a human body). The motion is intrinsically 1D, tracing a loop in 150D. We compare nlPCA, SVP, the Gaussian model, and MBMS initialized from the ﬁrst three algorithms. For nlPCA, we do a grid search for the weight decay coefﬁcient while ﬁxing its structure to be 2 × 10 × 150 units, and use an early stopping criterion. For SVP, we do grid search on {1, 2, 3, 5, 7, 10} for the rank. For MBMS (L = 1) and GBMS (L = 0), we do grid search for σ and k. We report the reconstruction error as a function of the proportion of missing entries from 50% to 95%. For each missing-data proportion, we randomly select 5 different sets of present values and run all algorithms for them. Fig. 4 gives the mean errors of all algorithms. All methods perform well when missing-data proportion is small. nlPCA, being prone to local optima, is less stable than SVP and the Gaussian model, especially when the missing-data proportion is large. The Gaussian model gives the best and most stable initialization. At 95%, all methods fail to give an acceptable reconstruction, but up to 90% missing entries, MBMS and GBMS always beat the other algorithms. Fig. 4 shows selected reconstructions from all algorithms. MNIST digit ‘7’ The MNIST digit ‘7’ data set contains 6 265 greyscale (0–255) images of size 28 × 28. We create missing entries in a way reminiscent of run-length errors in transmission. We generate 16 to 26 rectangular boxes of an area approximately 25 pixels at random locations in each image and use them to black out pixels. In this way, we create a high dimensional data set (784 dimensions) with about 50% entries missing on average. Because of the loss of spatial correlations within the blocks, this missing data pattern is harder than random. The Gaussian model cannot handle such a big data set because it involves inverting large covariance matrices. nlPCA is also very slow and we cannot afford cross-validating its structure or the weight decay coefﬁcient, so we picked a reasonable structure (10 × 30 × 784 units), used the default weight decay parameter in the code (10−3 ), and allowed up to 500 iterations. We only use SVP as initialization for our algorithm. Since the intrinsic dimension of MNIST is suspected to be not very high, 6 SVP τ =0 SVP + GBMS τ =1 SVP + MBMS τ =2 Gaussian τ =0 Gaussian + GBMS τ =1 Gaussian + MBMS τ = 25 20 20 20 20 20 20 15 15 15 15 15 15 10 10 10 10 10 10 5 5 5 5 5 5 0 0 0 0 0 0 −5 −5 −5 −5 −5 −5 −10 −10 −15 −15 −10 −5 0 5 10 15 20 −10 −15 −15 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −5 0 5 10 15 20 Figure 3: Denoising effect of the different algorithms. For visualization, we project the 100D data to 3D with the projection matrix used for creating the data. Present values are reﬁlled for all plots. 7000 6000 error 5000 4000 frame 2 (leg distance) frame 10 (foot pose) frame 147 (leg pose) nlPCA nlPCA + GBMS nlPCA + MBMS SVP SVP + GBMS SVP + MBMS Gaussian Gaussian + GBMS Gaussian + MBMS 3000 2000 1000 0 50 60 70 80 85 90 95 % of missing data Figure 4: Left: mean of errors (RSSE) of 5 runs obtained by different algorithms for varying percentage of missing values. Errorbars shown only for Gaussian + MBMS to avoid clutter. Right: sample reconstructions when 85% percent data is missing. Row 1: initialization. Row 2: init+GBMS. Row 3: init+MBMS. Color indicates different initialization: black, original data; red, nlPCA; blue, SVP; green, Gaussian. we used rank 10 for SVP and L = 9 for MBMS. We also use the same k = 140 as in [5]. So we only had to choose σ and the number of iterations via cross-validation. Table 2 shows the methods and their corresponding error. Fig. 5 shows some representative reconstructions from different algorithms, with present values reﬁlled. The mean-shift averaging among closeby neighbors (a soft form of majority voting) helps to eliminate noise, unusual strokes and other artifacts created by SVP, which by their nature tend to occur in different image locations over the neighborhood of images. 4 Related work Matrix completion is widely studied in theoretical compressed sensing [1, 2] as well as practical recommender systems [3, 4]. Most matrix completion models rely on a low-rank assumption, and cannot fully exploit a more complex structure of the problem, such as curved manifolds. Related work is on multi-task learning in a broad sense, which extracts the common structure shared by multiple related objects and achieves simultaneous learning on them. This includes applications such as alignment of noise-corrupted images [21], recovery of images with occlusion [22], and even learning of multiple related regressors or classiﬁers [23]. Again, all these works are essentially based on a subspace assumption, and do not generalize to more complex situations. A line of work based on a nonlinear low-rank assumption (with a latent variable z of dimensionN 2 ality L < D) involves setting up a least-squares error function minf ,Z n=1 xn − f (zn ) = N,D 2 n,d=1 (xdn − fd (zn )) where one ignores the terms for which xdn is missing, and estimates the function f and the low-dimensional data projections Z by alternating optimization. Linear functions f have been used in the homogeneity analysis literature [24], where this approach is called “missing data deleted”. Nonlinear functions f have been used recently (neural nets [20]; Gaussian processes for collaborative ﬁltering [25]). Better results are obtained if adding a projection term N 2 and optimizing over the missing data as well [26]. n=1 zn − F(xn ) 7 Orig Missing nlPCA SVP GBMS MBMS Orig Missing nlPCA SVP GBMS MBMS Figure 5: Selected reconstructions of MNIST block-occluded digits ‘7’ with different methods. Prior to our denoising-based work there have been efforts to extend the low-rank models to smooth manifolds, mostly in the context of compressed sensing. Baraniuk and Wakin [27] show that certain random measurements, e.g. random projection to a low-dimensional subspace, can preserve the metric of the manifold fairly well, if the intrinsic dimension and the curvature of the manifold are both small enough. However, these observations are not suitable for matrix completion and no algorithm is given for recovering the signal. Chen et al. [28] explicitly model a pre-determined manifold, and use this to regularize the signal when recovering the missing values. They estimate the manifold given complete data, while no complete data is assumed in our matrix completion setting. Another related work is [29], where the manifold modeled with Isomap is used in estimating the positions of satellite cameras in an iterative manner. Finally, our expectation that the value of a missing entry can be predicted from the values of neighboring points is similar to one category of collaborative ﬁltering methods that essentially use similar users/items to predict missing values [3, 4]. 5 Conclusion We have proposed a new paradigm for matrix completion, denoising, which generalizes the commonly used assumption of low rank. Assuming low-rank implies a restrictive form of denoising where the data is forced to have zero variance away from a linear manifold. More general definitions of denoising can potentially handle data that lives in a low-dimensional manifold that is nonlinear, or whose dimensionality varies (e.g. a set of manifolds), or that does not have low rank at all, and naturally they handle noise in the data. Denoising works because of the fundamental fact that a missing value can be predicted by averaging nearby present values. Although we motivate our framework from a constrained optimization point of view (denoise subject to respecting the present data), we argue for an algorithmic view of denoising-based matrix completion: apply a denoising step, reﬁll the present values, iterate until the validation error increases. In turn, this allows different forms of denoising, such as based on low-rank projection (earlier work) or local averaging with blurring mean-shift (this paper). Our nonparametric choice of mean-shift averaging further relaxes assumptions about the data and results in a simple algorithm with very few user parameters that afford user control (denoising scale, local dimensionality) but can be set automatically by cross-validation. Our algorithms are intended to be used as a postprocessing step over a user-provided initialization of the missing values, and we show they consistently improve upon existing algorithms. The MBMS-based algorithm bridges the gap between pure denoising (GBMS) and local low rank. Other deﬁnitions of denoising should be possible, for example using temporal as well as spatial neighborhoods, and even applicable to discrete data if we consider denoising as a majority voting among the neighbours of a vector (with suitable deﬁnitions of votes and neighborhood). Acknowledgments Work supported by NSF CAREER award IIS–0754089. 8 References [1] Emmanuel J. Cand` s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations e of Computational Mathematics, 9(6):717–772, December 2009. [2] Emmanuel J. Cand` s and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. e IEEE Trans. Information Theory, 56(5):2053–2080, April 2010. [3] Yehuda Koren. Factorization meets the neighborhood: A multifaceted collaborative ﬁltering model. SIGKDD 2008, pages 426–434, Las Vegas, NV, August 24–27 2008. [4] Robert Bell and Yehuda Koren. Scalable collaborative ﬁltering with jointly derived neighborhood interpolation weights. ICDM 2007, pages 43–52, October 28–31 2007. ´ [5] Weiran Wang and Miguel A. Carreira-Perpi˜ an. Manifold blurring mean shift algorithms for manifold n´ denoising. CVPR 2010, pages 1759–1766, San Francisco, CA, June 13–18 2010. [6] Matthias Hein and Markus Maier. Manifold denoising. NIPS 2006, 19:561–568. MIT Press, 2007. [7] Prateek Jain, Raghu Meka, and Inderjit S. Dhillon. Guaranteed rank minimization via singular value projection. NIPS 2010, 23:937–945. MIT Press, 2011. ´ [8] Miguel A. Carreira-Perpi˜ an. Fast nonparametric clustering with Gaussian blurring mean-shift. ICML n´ 2006, pages 153–160. Pittsburgh, PA, June 25–29 2006. [9] Keinosuke Fukunaga and Larry D. Hostetler. The estimation of the gradient of a density function, with application in pattern recognition. IEEE Trans. Information Theory, 21(1):32–40, January 1975. [10] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Trans. PAMI, 17(8):790–799, 1995. [11] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. PAMI, 24(5):603–619, May 2002. ´ [12] Miguel A. Carreira-Perpi˜ an. Generalised blurring mean-shift algorithms for nonparametric clustering. n´ CVPR 2008, Anchorage, AK, June 23–28 2008. [13] Gabriel Taubin. A signal processing approach to fair surface design. SIGGRAPH 1995, pages 351–358. [14] Mathieu Desbrun, Mark Meyer, Peter Schr¨ der, and Alan H. Barr. Implicit fairing of irregular meshes o using diffusion and curvature ﬂow. SIGGRAPH 1999, pages 317–324. [15] Fan R. K. Chung. Spectral Graph Theory. American Mathematical Society, Providence, RI, 1997. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 22(8):888– 905, August 2000. [17] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1986. [18] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer-Verlag, New York, second edition, 2006. [19] Tapio Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14(5):853–871, March 2001. [20] Matthias Scholz, Fatma Kaplan, Charles L. Guy, Joachim Kopka, and Joachim Selbig. Non-linear PCA: A missing data approach. Bioinformatics, 21(20):3887–3895, October 15 2005. [21] Yigang Peng, Arvind Ganesh, John Wright, Wenli Xu, and Yi Ma. RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images. CVPR 2010, pages 763–770, 2010. [22] A. M. Buchanan and A. W. Fitzgibbon. Damped Newton algorithms for matrix factorization with missing data. CVPR 2005, pages 316–322, San Diego, CA, June 20–25 2005. [23] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. NIPS 2006, 19:41–48. MIT Press, 2007. [24] Albert Giﬁ. Nonlinear Multivariate Analysis. John Wiley & Sons, 1990. [25] Neil D. Lawrence and Raquel Urtasun. Non-linear matrix factorization with Gaussian processes. ICML 2009, Montreal, Canada, June 14–18 2009. ´ [26] Miguel A. Carreira-Perpi˜ an and Zhengdong Lu. Manifold learning and missing data recovery through n´ unsupervised regression. ICDM 2011, December 11–14 2011. [27] Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds. Foundations of Computational Mathematics, 9(1):51–77, February 2009. [28] Minhua Chen, Jorge Silva, John Paisley, Chunping Wang, David Dunson, and Lawrence Carin. Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds. IEEE Trans. Signal Processing, 58(12):6140–6155, December 2010. [29] Michael B. Wakin. A manifold lifting algorithm for multi-view compressive imaging. In Proc. 27th Conference on Picture Coding Symposium (PCS’09), pages 381–384, 2009. 9

4 0.67029661 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

Author: Binbin Lin, Chiyuan Zhang, Xiaofei He

5 0.65180284 287 nips-2011-The Manifold Tangent Classifier

Author: Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller

6 0.61323625 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion

7 0.59665567 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

8 0.4716129 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data

9 0.37655461 73 nips-2011-Divide-and-Conquer Matrix Factorization

10 0.37051564 143 nips-2011-Learning Anchor Planes for Classification

11 0.36909932 95 nips-2011-Fast and Accurate k-means For Large Datasets

12 0.36289364 241 nips-2011-Scalable Training of Mixture Models via Coresets

13 0.36015627 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification

14 0.35539275 102 nips-2011-Generalised Coupled Tensor Factorisation

15 0.34806362 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies

16 0.34583622 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure

17 0.34479234 176 nips-2011-Multi-View Learning of Word Embeddings via CCA

18 0.34374881 186 nips-2011-Noise Thresholds for Spectral Clustering

19 0.3414616 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

20 0.33537456 254 nips-2011-Similarity-based Learning via Data Driven Embeddings

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.028), (4, 0.047), (20, 0.062), (26, 0.044), (31, 0.057), (33, 0.027), (43, 0.044), (45, 0.092), (57, 0.031), (74, 0.052), (78, 0.271), (83, 0.053), (84, 0.011), (87, 0.01), (99, 0.069)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72132999 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

Author: Nitesh Shroff, Pavan Turaga, Rama Chellappa

2 0.65260905 263 nips-2011-Sparse Manifold Clustering and Embedding

Author: Ehsan Elhamifar, René Vidal

3 0.59932387 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon

Abstract: We introduce P I C O D ES: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of deﬁning indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images deﬁning the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classiﬁcation performance. Optimization directly for binary features is difﬁcult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. P I C O D ES of 256 bytes match the accuracy of the current best known classiﬁer for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.

4 0.5991978 29 nips-2011-Algorithms and hardness results for parallel large margin learning

Author: Phil Long, Rocco Servedio

Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1

5 0.52499634 186 nips-2011-Noise Thresholds for Spectral Clustering

Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh

Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1

6 0.52420288 303 nips-2011-Video Annotation and Tracking with Active Learning

7 0.51910484 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

8 0.51838696 22 nips-2011-Active Ranking using Pairwise Comparisons

9 0.51733381 227 nips-2011-Pylon Model for Semantic Segmentation

10 0.51272804 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

11 0.51104015 180 nips-2011-Multiple Instance Filtering

12 0.51047039 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation

13 0.50922912 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling

14 0.50848341 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

15 0.50798094 102 nips-2011-Generalised Coupled Tensor Factorisation

16 0.507339 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

17 0.50711024 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning

18 0.50686544 80 nips-2011-Efficient Online Learning via Randomized Rounding

19 0.50549191 231 nips-2011-Randomized Algorithms for Comparison-based Search

20 0.50522238 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction