nips nips2011 nips2011-263 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ehsan Elhamifar, René Vidal
Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. [sent-5, score-0.425]
2 Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. [sent-6, score-0.695]
3 This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. [sent-8, score-0.575]
4 The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. [sent-9, score-0.53]
5 Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. [sent-10, score-0.506]
6 Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. [sent-11, score-0.515]
7 The first step of most dimensionality reduction methods is to build a neighborhood graph by connecting each data point to a fixed number of nearest neighbors or to all points within a certain radius of the given point. [sent-17, score-0.915]
8 Global methods, such as Isomap [4], Semidefinite embedding [5], Minimum volume embedding [6] and Structure preserving embedding [7], try to preserve local and global relationships among all data points. [sent-19, score-0.961]
9 For both local and global methods, a proper choice of the neighborhood size used to build the neighborhood graph is critical. [sent-21, score-0.417]
10 Specifically, a small neighborhood size may not capture sufficient information about the manifold geometry, especially when it is smaller than the intrinsic dimension of the manifold. [sent-22, score-0.512]
11 Moreover, the curvature of the manifold and the density of the data points may be different in different regions of the manifold, hence using a fix neighborhood size may be inappropriate. [sent-24, score-0.568]
12 Thus, to find a low-dimensional embedding of the data, one needs to first cluster the data according to the underlying manifolds and then find a low-dimensional representation for the data in each cluster. [sent-27, score-0.658]
13 Since the manifolds can be very close to each other and they can have arbitrary dimensions, curvature and sampling, the manifold clustering and embedding problem is very challenging. [sent-28, score-0.908]
14 The particular case of clustering data lying in multiple flat manifolds (subspaces) is well studied and numerous algorithms have been proposed (see e. [sent-29, score-0.421]
15 Other methods assume that the manifolds have different instrinsic dimensions and cluster the data according to the dimensions rather than the manifolds themselves [9, 10, 11, 12, 13]. [sent-33, score-0.573]
16 When manifolds are densely sampled and sufficiently separated, existing dimensionality reduction algorithms such as LLE can be extended to perform clustering before the dimensionality reduction step [14, 15, 16]. [sent-36, score-0.708]
17 However, as we will see later, finding the right neighborhood size is in general difficult, especially when manifolds are close to each other. [sent-39, score-0.396]
18 3 Paper Contributions In this paper, we propose an algorithm, called SMCE, for simultaneous clustering and embedding of data lying in multiple manifolds. [sent-42, score-0.54]
19 To do so, we use the geometrically motivated assumption that for each data point there exists a small neighborhood in which only the points that come from the same manifold lie approximately in a low-dimensional affine subspace. [sent-43, score-0.64]
20 We propose an optimization program based on sparse representation to select a few neighbors of each data point that span a low-dimensional affine subspace passing near that point. [sent-44, score-0.662]
21 In addition, the weights associated to the chosen neighbors indicate their distances to the given data point, which can be used for dimensionality reduction. [sent-46, score-0.422]
22 Thus, unlike conventional methods that first build a neighborhood graph and then extract information from it, our method simultaneously builds the neighborhood graph and obtains its weights. [sent-47, score-0.455]
23 This leads to successful results even in challenging situations where the nearest neighbors of a point come from other manifolds. [sent-48, score-0.363]
24 Clustering and embedding of the data into lower dimensions follows by taking the eigenvectors of the matrix of weights and its submatrices, which are sparse hence can be stored and be operated on efficiently. [sent-49, score-0.459]
25 Thanks to the sparse representations obtained by SMCE, the number of neighbors of the data points in each manifold reflects the intrinsic dimensionality of the underlying manifold. [sent-50, score-0.829]
26 To the best of our knowledge, SMCE is the only algorithm proposed to date that allows robust automatic selection of neighbors and simultaneous clustering and dimensionality reduction in a unified manner. [sent-52, score-0.6]
27 2 Proposed Method Assume we are given a collection of N data points {xi ∈ RD }N lying in n different manifolds i=1 {Ml }n of intrinsic dimensions {dl }n . [sent-53, score-0.486]
28 In this section, we consider the problem of simultanel=1 l=1 ously clustering the data according to the underlying manifolds and obtaining a low-dimensional representation of the data points within each cluster. [sent-54, score-0.553]
29 We approach this problem using a spectral clustering and embedding algorithm. [sent-55, score-0.486]
30 Specifically, we build a similarity graph whose nodes represent the data points and whose edges represent the similarity between data points. [sent-56, score-0.395]
31 To 2 M2 x5 x x6 4 xp M1 x2 x1 x3 Figure 1: For x1 ∈ M1 , the smallest neighborhood containing points from M1 also contains points from M2 . [sent-59, score-0.438]
32 However, only the neighbors in M1 span a 1-dimensional subspace around x1 . [sent-60, score-0.391]
33 do dimensionality reduction, we wish to connect each point to neighboring points with appropriate weights that reflect the neighborhood information. [sent-61, score-0.493]
34 The underlying assumption behind the proposed method is that each data point has a small neighborhood in which the minimum number of points that span a low-dimensional affine subspace passing near that point is given by the points from the same manifold. [sent-64, score-0.692]
35 More precisely: Assumption 1 For each data point xi ∈ Ml consider the smallest ball Bi ⊂ RD that contains the dl + 1 nearest neighbors of xi from Ml . [sent-65, score-0.734]
36 Let the neighborhood Ni be the set of all data points in Bi excluding xi . [sent-66, score-0.417]
37 We assume that for all i there exists ≥ 0 such that the nonzero entries of the sparsest solution of j∈Ni cij (xj − xi ) 2 ≤ cij = 1 and (1) j∈Ni corresponds to the dl + 1 neighbors of xi from Ml . [sent-68, score-0.779]
38 In other words, among all affine subspaces spanned by subsets of the points {xj }j∈Ni and passing near xi up to error, the one of lowest dimension has dimension dl and it is spanned by the dl + 1 neighbors of xi from Ml . [sent-69, score-1.037]
39 In the limiting case of densely sampled data, this affine subspace coincides with the dl -dimensional tangent space of Ml at xi . [sent-70, score-0.367]
40 On the other hand, the affine span of any choices of 3 or more data points in the neighborhood always passes through x1 . [sent-74, score-0.4]
41 1 Optimization Algorithm Our goal is to propose a method that selects, for each data point xi , a few neighbors that lie in the same manifold. [sent-77, score-0.45]
42 However, Ni is not known a priori and searching for a few data points in Ni that satisfy (1) becomes more computationally complex as the size of the neighborhood increases. [sent-79, score-0.328]
43 However, by using a sparse optimization program, we bias the method to select a few data points that are close to xi and span a low-dimensional affine subspace passing near xi . [sent-81, score-0.618]
44 Consider a point xi in the dl -dimensional manifold Ml and consider the set of points {xj }j=i . [sent-82, score-0.611]
45 It follows from Assumption 1 that, among these points, the ones that are neighbors of xi in Ml span a dl -dimensional affine subspace of RD that passes near xi . [sent-83, score-0.76]
46 In other words, [x1 − xi · · · xN − xi ] ci 2 ≤ and 1 ci = 1 (2) has a solution ci whose dl + 1 nonzero entries corresponds to dl + 1 neighbors of xi in Ml . [sent-84, score-1.27]
47 (3) In this way, for a small ε, the locations of the nonzero entries of any solution ci of X i ci 2 ≤ ε do not depend on whether the selected points are close to or far from xi . [sent-92, score-0.553]
48 Now, among all the solutions of X i ci 2 ≤ ε that satisfy 1 ci = 1, we look for the one that uses a few closest neighbors of xi . [sent-93, score-0.613]
49 That is, points that are closer to xi get lower penalty than points that are farther away. [sent-95, score-0.354]
50 Another optimization program which is related to (4) by the method of Lagrange multipliers, is 1 min λ Qi ci 1 + X i ci 2 subject to 1 ci = 1, (5) 2 2 where the parameter λ sets the trade-off between the sparsity of the solution and the affine reconstruction error. [sent-102, score-0.461]
51 As we will show in the next section, there is a wide range of values of λ for which the optimization program in (5) successfully finds a sparse solution for each point from neighbors in the same manifold. [sent-105, score-0.428]
52 Notice that, in sharp contrast to the nearest neighbors-based methods, which first fix the number of neighbors or the neighborhood radius and then compute the weights between points in each neighborhood, we do the two steps at the same time. [sent-106, score-0.672]
53 In other words, the optimization programs (4) and (5) automatically choose a few neighbors of the given data point, which approximately span a low-dimensional affine subspace at that point. [sent-107, score-0.502]
54 2 Clustering and Dimensionality Reduction By solving the proposed optimization programs for each data point, we obtain the necessary information for clustering and dimensionality reduction. [sent-110, score-0.337]
55 (6) xj − xi 2 j=i Hence, we can rewrite xi ≈ [x1 x2 · · · xN ] wi , where the weight vector wi [wi1 · · · wiN ] ∈ RN associated to the i-th data point is defined as wii 0, wij cij / xj − xi 2 t=i cit / xt − xi 4 , j = i. [sent-112, score-0.601]
56 2 (7) The indices of the few nonzero elements of wi , ideally, correspond to neighbors of xi in the same manifold and their values indicate their (inverse) distances to xi . [sent-113, score-0.777]
57 While, potentially, every node can get connected to all other nodes, because of the sparsity of wi , each node i connects itself to only a few other nodes that correspond to the neighbors of xi in the same manifold. [sent-117, score-0.436]
58 In addition, the distances of the sparse neighbors to xi are reflected in the weights |wij |. [sent-119, score-0.43]
59 The similarity graph built in this way has ideally several connected components, where points in the same manifold are connected to each other and there is no connection between two points in different manifolds. [sent-120, score-0.661]
60 Thus, we can use the adjacency matrix, W [i], of the i-th cluster as a similarity between points in the corresponding manifold and obtain a low-dimensional embedding of the data by taking the last few eigenvectors of the normalized Laplacian matrix associated to W [i] [3]. [sent-139, score-0.846]
61 Note that there are other ways for inferring the low-dimensional embedding of the data in each cluster along the line of [21] and [1] which is beyond the scope of the current paper. [sent-140, score-0.409]
62 This comes from the fact that a data point xi ∈ Ml and its neighbors in Ml lie approximately in the dl -dimensional tangent space of Ml at xi . [sent-143, score-0.748]
63 Since dl + 1 vectors in this tangent space are linearly dependent, the solution ci of the proposed optimization programs is expected to have dl + 1 nonzero elements. [sent-144, score-0.6]
64 Thus, the number of nonzero elements of msc(l) or, more practically, the number of elements with relatively high magnitude, gives an estimate of the intrinsic dimension of the l-th manifold plus one. [sent-149, score-0.445]
65 2 An advantage of our method is that it allows us to have a different neighborhood size for each data point, depending on the local dimension of its underlying manifold at that point. [sent-150, score-0.474]
66 For example, in the case of two manifolds of dimensions d1 = 2 and d2 = 30, for data points in the l-th manifold we automatically obtain solutions with dl + 1 nonzero elements. [sent-151, score-0.818]
67 On the other hand, methods that fix the number of neighbors fall into trouble because the number of neighbors would be too small for one manifold or too large for the other manifold. [sent-152, score-0.745]
68 = 100 LEM, K = 5 LLE, K = 20 LEM, K = 20 Figure 2: Top: embedding of a punctured sphere and the msc vectors obtained by SMCE for different values of λ. [sent-161, score-0.436]
69 However, the clustering and embedding results obtained by SMCE are stable for λ ∈ [1, 200]. [sent-169, score-0.45]
70 Since the weighted 1 -optimization does not select the points that are very far from the given point, we consider only L < N − 1 neighbors of each data point in the optimization program, where we typically set L = N/10. [sent-170, score-0.497]
71 As in the case of nearest neighbors-based methods, there is no guarantee that the points in the same manifold form a single connected component of the similarity graph built by SMCE. [sent-171, score-0.543]
72 We sample N = 1, 000 data points from a 2-sphere, where a neighborhood of its north pole is excluded. [sent-176, score-0.328]
73 Notice that, for K = 20, nearest neighbor-based methods obtain similar embedding results to those of SMCE, while for K = 5 they obtain poor embedding results. [sent-183, score-0.662]
74 This suggests that the principle used by SMCE to select the neighbors is very effective: it chooses very few neighbors that are very informative for dimensionality reduction. [sent-184, score-0.64]
75 The data points are sampled such that among the 2 nearest neighbors of 1% of the data points there are points from the other manifold. [sent-188, score-0.773]
76 Also, among the 3 and 5 nearest neighbors of 9% and 18% of the data points, respectively, there are points from the other manifold. [sent-189, score-0.502]
77 For such points, the nearest neighbors-based methods will connect them to nearby points in the other manifold and assign large weights to the connection. [sent-190, score-0.504]
78 Table 1 shows the misclassification rates of LLE and LEM for different number of nearest neighbors K as well as the misclassification rates of SMCE for different values of λ. [sent-192, score-0.331]
79 0% Table 2: Percentage of data points whose K nearest neighbors contain points from the other manifold. [sent-218, score-0.621]
80 8% show, enforcing that the neighbors of a point from the same manifold span a low-dimensional affine subspace helps to select neighbors from the correct manifold and not from the other manifolds. [sent-225, score-1.158]
81 This results in successful clustering and embedding of the data as well as unraveling the dimensions of the manifolds. [sent-226, score-0.51]
82 First, we consider the problem of clustering and embedding of face images of two different subjects from the Extended Yale B database [22]. [sent-231, score-0.524]
83 Table 2 shows the percentage of points in the dataset whose K nearest neighbors contain points from the other manifold. [sent-238, score-0.592]
84 Beside the embedding of each method in Figure 4 (top row), we have shown the coefficient vector of a data point in M1 whose closest neighbor comes from M2 . [sent-240, score-0.404]
85 While nearest-neighbor-based methods pick the wrong neighbors with strong weights, SMCE successfully selects sparse neighbors from the correct manifold. [sent-241, score-0.56]
86 Next, we consider the dimensionality reduction of the images in the Frey face dataset, which consists of 1965 face images captured under varying pose and expression. [sent-245, score-0.331]
87 Figure 5: 2-D embedding of Frey face data using SMCE. [sent-249, score-0.372]
88 Cluster 1 Cluster 2 Digit 0 Digit 3 Digit 4 Digit 6 Digit 7 Figure 6: Clustering and embedding of five digits from the MNIST dataset. [sent-254, score-0.337]
89 Left: 2-D embedding obtained by SMCE for five digits {0, 3, 4, 6, 7}. [sent-255, score-0.337]
90 Middle: 2-D embedding of the data in the first cluster that corresponds to digit 3. [sent-256, score-0.471]
91 Right: 2-D embedding of the data in the second cluster that corresponds to digit 6. [sent-257, score-0.471]
92 Finally, we consider the clustering and dimensionality reduction of the digits from the MNIST test database [23]. [sent-258, score-0.358]
93 The middle and the right plots in Figure 6, show the two-dimensional embedding obtained by SMCE for two data clusters, which correspond to the digits 3 and 6. [sent-262, score-0.366]
94 4 Discussion We proposed a new algorithm based on sparse representation for simultaneous clustering and dimensionality reduction of data lying in multiple manifolds. [sent-263, score-0.466]
95 We used the solution of a sparse optimization program to build a similarity graph from which we obtained clustering and low-dimensional embedding of the data. [sent-264, score-0.695]
96 The sparse representation of each data point ideally encodes information that can be used for inferring the dimensionality of the underlying manifold around that point. [sent-265, score-0.458]
97 Finding robust methods for estimating the intrinsic dimension of the manifolds from the sparse coefficients and investigating theoretical guarantees under which SMCE works is the subject of our future research. [sent-266, score-0.391]
98 Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Neural Information Processing Systems, 2002, pp. [sent-282, score-0.363]
99 Medioni, “Unsupervised dimensionality estimation and manifold learning in highdimensional spaces by tensor voting. [sent-314, score-0.331]
100 Zha, “Principal manifolds and nonlinear dimensionality reduction via tangent space alignment,” SIAM J. [sent-362, score-0.425]
wordName wordTfidf (topN-words)
[('smce', 0.623), ('embedding', 0.296), ('neighbors', 0.261), ('manifold', 0.223), ('lle', 0.2), ('manifolds', 0.199), ('neighborhood', 0.178), ('lem', 0.173), ('clustering', 0.154), ('dl', 0.146), ('points', 0.121), ('msc', 0.115), ('ci', 0.107), ('af', 0.101), ('ml', 0.1), ('dimensionality', 0.09), ('xi', 0.089), ('cluster', 0.084), ('reduction', 0.073), ('span', 0.072), ('nearest', 0.07), ('nonzero', 0.069), ('intrinsic', 0.067), ('digit', 0.062), ('subspace', 0.058), ('ni', 0.054), ('similarity', 0.049), ('program', 0.048), ('face', 0.047), ('tangent', 0.045), ('dimension', 0.044), ('subject', 0.043), ('cij', 0.042), ('weights', 0.042), ('connected', 0.042), ('digits', 0.041), ('lying', 0.039), ('lie', 0.039), ('sparse', 0.038), ('graph', 0.038), ('programs', 0.038), ('qi', 0.037), ('proximity', 0.036), ('spectral', 0.036), ('vidal', 0.035), ('xj', 0.034), ('misclassi', 0.033), ('ehsan', 0.033), ('point', 0.032), ('eigenmaps', 0.031), ('dimensions', 0.031), ('connect', 0.03), ('densely', 0.029), ('laplacian', 0.029), ('data', 0.029), ('closest', 0.028), ('select', 0.028), ('median', 0.027), ('images', 0.027), ('subspaces', 0.027), ('vectorized', 0.026), ('optimization', 0.026), ('sphere', 0.025), ('passing', 0.025), ('beside', 0.025), ('wi', 0.025), ('ideally', 0.025), ('near', 0.024), ('wij', 0.024), ('preserving', 0.023), ('eigenvectors', 0.023), ('solution', 0.023), ('build', 0.023), ('farther', 0.023), ('simultaneous', 0.022), ('axis', 0.022), ('johns', 0.022), ('frey', 0.022), ('representation', 0.021), ('elements', 0.021), ('hopkins', 0.021), ('adjacency', 0.021), ('among', 0.021), ('pose', 0.02), ('coef', 0.02), ('saul', 0.02), ('selecting', 0.02), ('whose', 0.019), ('close', 0.019), ('nodes', 0.019), ('favors', 0.019), ('nonlinear', 0.018), ('entries', 0.018), ('highdimensional', 0.018), ('mnist', 0.018), ('recognition', 0.018), ('nearby', 0.018), ('smallest', 0.018), ('approximately', 0.018), ('curvature', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 263 nips-2011-Sparse Manifold Clustering and Embedding
Author: Ehsan Elhamifar, René Vidal
Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1
2 0.14543191 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
Author: Binbin Lin, Chiyuan Zhang, Xiaofei He
Abstract: This paper studies the problem of semi-supervised learning from the vector field perspective. Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semisupervised regression problems. To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient field of a linear function has to be a parallel vector field. Consequently, we propose to find a function which minimizes the empirical error, and simultaneously requires its gradient field to be as parallel as possible. We give a continuous objective function on the manifold and discuss how to discretize it by using random points. The discretized optimization problem turns out to be a sparse linear system which can be solved very efficiently. The experimental results have demonstrated the effectiveness of our proposed approach. 1
3 0.14321703 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
4 0.14243507 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
Author: Dominique C. Perrault-joncas, Marina Meila
Abstract: This paper considers the problem of embedding directed graphs in Euclidean space while retaining directional information. We model the observed graph as a sample from a manifold endowed with a vector field, and we design an algorithm that separates and recovers the features of this process: the geometry of the manifold, the data density and the vector field. The algorithm is motivated by our analysis of Laplacian-type operators and their continuous limit as generators of diffusions on a manifold. We illustrate the recovery algorithm on both artificially constructed and real data. 1 Motivation Recent advances in graph embedding and visualization have focused on undirected graphs, for which the graph Laplacian properties make the analysis particularly elegant [1, 2]. However, there is an important number of graph data, such as social networks, alignment scores between biological sequences, and citation data, which are naturally asymmetric. A commonly used approach for this type of data is to disregard the asymmetry by studying the spectral properties of W +W T or W T W , where W is the affinity matrix of the graph. Some approaches have been offered to preserve the asymmetry information contained in data: [3], [4], [5] or to define directed Laplacian operators [6]. Although quite successful, these works adopt a purely graph-theoretical point of view. Thus, they are not concerned with the generative process that produces the graph, nor with the interpretability and statistical properties of their algorithms. In contrast, we view the nodes of a directed graph as a finite sample from a manifold in Euclidean space, and the edges as macroscopic observations of a diffusion kernel between neighboring points on the manifold. We explore how this diffusion kernel determines the overall connectivity and asymmetry of the resulting graph and demonstrate how Laplacian-type operators of this graph can offer insights into the underlying generative process. Based on the analysis of the Laplacian-type operators, we derive an algorithm that, in the limit of infinite sample and vanishing bandwidth, recovers the key features of the sampling process: manifold geometry, sampling distribution, and local directionality, up to their intrinsic indeterminacies. 2 Model The first premise here is that we observe a directed graph G, with n nodes, having weights W = [Wij ] for the edge from node i to node j. In following with common Laplacian-based embedding approaches, we assume that G is a geometric random graph constructed from n points sampled according to distribution p = e−U on an unobserved compact smooth manifold M ⊆ Rl of known intrinsic dimension d ≤ l. The edge weight Wij is then determined by a directed similarity kernel k (xi , xj ) with bandwidth . The directional component of k (xi , xj ) will be taken to be derived 1 from a vector field r on M, which assigns a preferred direction between weights Wij and Wji . The choice of a vector field r to characterize the directional component of G might seem restrictive at first. In the asymptotic limit of → 0 and n → ∞ however, kernels are characterized by their diffusion, drift, and source components [7]. As such, r is sufficient to characterize any directionality associated with a drift component and as it turns out, the component of r normal M in Rl can also be use to characterize any source component. As for the diffusion component, it is not possible to uniquely identify it from G alone [8]. Some absolute knownledge of M is needed to say anything about it. Hence, without loss of generality, we will construct k (xi , xj ) so that the diffusion component ends being isotropic and constant, i.e. equal to Laplace-Beltrami operator ∆ on M. The schematic of this generative process is shown in the top left of Figure 1 below. From left to right: the graph generative process mapping the sample on M to geometric random graph G via the kernel k (x, y), then the subsequent embedding (α) Ψn of G by operators Haa,n , (α) Hss,n (defined in section 3.1). As these operators converge to (α) their respective limits, Haa and (α) Hss , so will Ψn → Ψ, pn → p, and rn → r. We design an algorithm that, given G, produces the top right embedding (Ψn , pn , and rn ). Figure 1: Schematic of our framework. The question is then as follows: can the generative process’ geometry M, distribution p = e−U , and directionality r, be recovered from G? In other words, is there an embedding of G in Rm , m ≥ d that approximates all three components of the process and that is also consistent as sample size increases and the bandwidth vanishes? In the case of undirected graphs, the theory of Laplacian eigenmaps [1] and Diffusion maps [9] answers this question in the affirmative, in that the geometry of M and p = e−U can be inferred using spectral graph theory. The aim here is to build on the undirected problem and recover all three components of the generative process from a directed graph G. The spectral approach to undirected graph embedding relies on the fact that eigenfunctions of the Laplace-Beltrami operator are known to preserve the local geometry of M [1]. With a consistent empirical Laplace-Beltrami operator based on G, its eigenvectors also recover the geometry of M and converge to the corresponding eigenfunctions on M. For a directed graph G, an additional operator is needed to recover the local directional component r, but the principle remains the same. (α) The schematic for this is shown in Figure 1 where two operators - Hss,n , introduced in [9] for (α) undirected embeddings, and Haa,n , a new operator defined in section 3.1 - are used to obtain the (α) (α) (α) embedding Ψn , distribution pn , and vector field rn . As Haa,n and Hss,n converge to Haa and (α) Hss , Ψn , pn , and rn also converge to Ψ, p, and r, where Ψ is the local geometry preserving the embedding of M into Rm . (α) The algorithm we propose in Section 4 will calculate the matrices corresponding to H·,n from the graph G, and with their eigenvectors, will find estimates for the node coordinates Ψ, the directional component r, and the sampling distribution p. In the next section we briefly describe the mathematical models of the diffusion processes that our model relies on. 2 2.1 Problem Setting The similarity kernel k (x, y) can be used to define transport operators on M. The natural transport operator is defined by normalizing k (x, y) as T [f ](x) = M k (x, y) f (y)p(y)dy , where p (x) = p (x) k (x, y)p(y)dy . (1) M T [f ](x) represents the diffusion of a distribution f (y) by the transition density k (x, y)p(y)/ k (x, y )p(y )dy . The eigenfunctions of this infinitesimal operator are the continuous limit of the eigenvectors of the transition probability matrix P = D−1 W given by normalizing the affinity matrix W of G by D = diag(W 1) [10]. Meanwhile, the infinitesimal transition ∂f (T − I)f = lim (2) →0 ∂t defines the backward equation for this diffusion process over M based on kernel k . Obtaining the explicit expression for transport operators like (2) is then the main technical challenge. 2.2 Choice of Kernel In order for T [f ] to have the correct asymptotic form, some hypotheses about the similarity kernel k (x, y) are required. The hypotheses are best presented by considering the decomposition of k (x, y) into symmetric h (x, y) = h (y, x) and anti-symmetric a (x, y) = −a (y, x) components: k (x, y) = h (x, y) + a (x, y) . (3) The symmetric component h (x, y) is assumed to satisfy the following properties: 1. h (||y − 2 x||2 ) = h(||y−x|| / ) , and 2. h ≥ 0 and h is exponentially decreasing as ||y − x|| → ∞. This form d/2 of symmetric kernel was used in [9] to analyze the diffusion map. For the asymmetric part of the similarity kernel, we assume the form a (x, y) = r(x, y) h(||y − x||2 / ) · (y − x) , d/2 2 (4) with r(x, y) = r(y, x) so that a (x, y) = −a (y, x). Here r(x, y) is a smooth vector field on the manifold that gives an orientation to the asymmetry of the kernel k (x, y). It is worth noting that the dependence of r(x, y) on both x and y implies that r : M × M → Rl with Rl the ambient space of M; however in the asymptotic limit, the dependence in y is only important “locally” (x = y), and as such it is appropriate to think of r(x, x) being a vector field on M. As a side note, it is worth pointing out that even though the form of a (x, y) might seem restrictive at first, it is sufficiently rich to describe any vector field . This can be seen by taking r(x, y) = (w(x) + w(y))/2 so that at x = y the resulting vector field is given by r(x, x) = w(x) for an arbitrary vector field w(x). 3 Continuous Limit of Laplacian Type Operators We are now ready to state the main asymptotic result. Proposition 3.1 Let M be a compact, closed, smooth manifold of dimension d and k (x, y) an asymmetric similarity kernel satisfying the conditions of section 2.2, then for any function f ∈ C 2 (M), the integral operator based on k has the asymptotic expansion k (x, y)f (y)dy = m0 f (x) + g(f (x), x) + o( ) , (5) M where g(f (x), x) = and m0 = Rd m2 (ω(x)f (x) + ∆f (x) + r · 2 h(||u||2 )du, m2 = Rd u2 h(||u||2 )du. i 3 f (x) + f (x) · r + c(x)f (x)) (6) The proof can be found in [8] along with the definition of ω(x) and c(x) in (6). For now, it suffices to say that ω(x) corresponds to an interaction between the symmetric kernel h and the curvature of M and was first derived in [9]. Meanwhile, c(x) is a new term that originates from the interaction between h and the component of r that is normal to M in the ambient space Rl . Proposition 3.1 foreshadows a general fact about spectral embedding algorithms: in most cases, Laplacian operators confound the effects of spatial proximity, sampling density and directional flow due to the presence of the various terms above. 3.1 Anisotropic Limit Operators Proposition 3.1 above can be used to derive the limits of a variety of Laplacian type operators associated with spectral embedding algorithms like [5, 6, 3]. Although we will focus primarily on a few operators that give the most insight into the generative process and enable us to recover the model defined in Figure 1, we first present four distinct families of operators for completeness. These operator families are inspired by the anisotropic family of operators that [9] introduced for undirected graphs, which make use of anisotropic kernels of the form: k (α) (x, y) = k (x, y) pα (x)pα (y) , (7) with α ∈ [0, 1] where α = 0 is the isotropic limit. To normalize the anisotropic kernels, we need (α) (α) (α) as p (x) = M k (x, y)p(y)dy. From (7), four to redefine the outdegrees distribution of k families of diffusion processes of the form ft = H (α) [f ](x) can be derived depending on which kernel is normalized and which outdegree distribution is used for the normalization. Specifically, (α) (α) we define transport operators by normalizing the asymmetric k or symmetric h kernels with the 1 asymmetric p or symmetric q = M h (x, y)p(y)dy outdegree distribution . To keep track of all options, we introduce the following notation: the operators will be indexed by the type of kernel and outdegree distribution they correspond to (symmetric or asymmetric), with the first index identifying the kernel and the second index identifying the outdegree distribution. For example, the family of anisotropic limit operators introduced by [9] is defined by normalizing the symmetric kernel by (α) the symmetric outdegree distribution, hence they will be denoted as Hss , with the superscript corresponding to the anisotropic power α. Proposition 3.2 With the above notation, (α) Haa [f ] = ∆f − 2 (1 − α) U · f + r· f (α) Has [f ] (α) Hsa [f ] (α) Hss [f ] = ∆f − 2 (1 − α) U · = ∆f − 2 (1 − α) U · = ∆f − 2(1 − α) U · (8) f − cf + (α − 1)(r · U )f − ( · r + (α − 1)r · f + (c + f. U )f · r)f + r · f (9) (10) (11) The proof of this proposition, which can be found in [8], follows from repeated application of Proposition 3.1 to p(y) or q(y) and then to k α (x, y) or hα (x, y), as well as the fact that p1 = α 1 p−α [1 − α (ω + ∆p p + 2r · p p +2 · r + c)] + o( ). (α) Thus, if we use the asymmetric k and p , we get Haa , defined by the advected diffusion equa(α) tion (8). In general, Haa is not hermitian, so it commonly has complex eigenvectors. This makes (1) embedding directed graphs with this operator problematic. Nevertheless, Haa will play an important role in extracting the directionality of the sampling process. If we use the symmetric kernel h but the asymmetric outdegree distribution p , we get the family (α) of operators Hsa , of which the WCut of [3] is a special case (α = 0). If we reverse the above, i.e. (α) (α) (α) use k and q , we obtain Has . This turns out to be merely a combination of Haa and Hsa . 1 The reader may notice that there are in fact eight possible combinations of kernel and degree distribution, since the anisotripic kernel (7) could also be defined using a symmetric or asymmetric outdegree distribution. However, there are only four distinct asymptotic results and they are all covered by using one kernel (symmetric or asymmetric) and one degree distribution (symmetric or asymmetric) throughout. 4 Algorithm 1 Directed Embedding Input: Affinity matrix Wi,j and embedding dimension m, (m ≥ d) 1. S ← (W + W T )/2 (Steps 1–6 estimate the coordinates as in [11]) n 2. qi ← j=1 Si,j , Q = diag(q) 3. V ← Q−1 SQ−1 (1) n 4. qi ← j=1 Vi,j , Q(1) = diag(q (1) ) (1) −1 5. Hss,n ← Q(1) V 6. Compute the Ψ the n × (m + 1) matrix with orthonormal columns containing the m + 1 largest (1) right eigenvector (by eigenvalue) of Hss,n as well as the Λ the (m + 1) × (m + 1) diagonal matrix of eigenvalues. Eigenvectors 2 to m + 1 from Ψ are the m coordinates of the embedding. (1) 7. Compute π the left eigenvector of Hss,n with eigenvalue 1. (Steps 7–8 estimate the density) n 8. π ← π/ i=1 πi is the density distribution over the embedding. n 9. pi ← j=1 Wi,j , P = diag(p) (Steps 9–13 estimate the vector field r) 10. T ← P −1 W P −1 (1) n 11. pi ← j=1 Ti,j , P (1) = diag(p(1) ) (1) −1 12. Haa,n ← P (1) T (1) (1) 13. R ← (Haa,n − Hss,n )Ψ/2. Columns 2 to m + 1 of R are the vector field components in the direction of the corresponding coordinates of the embedding. (α) Finally, if we only consider the symmetric kernel h and degree distribution q , we recover Hss , the anisotropic kernels of [9] for symmetric graphs. This operator for α = 1 is shown to separate the manifold from the probability distribution [11] and will be used as part of our recovery algorithm. Isolating the Vector Field r 4 Our aim is to esimate the manifold M, the density distribution p = e−U , and the vector field r. The (1) first two components of the data can be recovered from Hss as shown in [11] and summarized in Algorithm 1. At this juncture, one feature of generative process is missing: the vector field r. The natural approach (α) (α) for recovering r is to isolate the linear operator r · from Haa by substracting Hss : (α) (α) Haa − Hss = r · . (12) The advantage of recovering r in operator form as in (12) is that r · is coordinate free. In other words, as long as the chosen embedding of M is diffeomorphic to M2 , (12) can be used to express the component of r that lies in the tangent space T M, which we denote by r|| . Specifically, let Ψ be a diffeomorphic embedding of M ; the component of r along coordinate ψk is then given by r · ψk = rk , and so, in general, r|| = r · Ψ. (13) The subtle point that only r|| is recovered from (13) follows from the fact that the operator r · only defined along M and hence any directional derivative is necessarily along T M. is Equation (13) and the previous observations are the basis for Algorithm 1, which recovers the three important features of the generative process for an asymmetric graph with affinity matrix W . A similar approach can be employed to recover c + · r, or simply · r if r has no component perpendicular to the tangent space T M (meaning that c ≡ 0). Recovering c + · r is achieved by taking advantage of the fact that (1) (1) (Hsa − Hss ) = (c + 2 · r) , (14) (1) A diffeomorphic embedding is guaranteed by using the eigendecomposition of Hss . 5 (1) (1) which is a diagonal operator. Taking into account that for finite n (Hsa,n − Hss,n ) is not perfectly (1) (1) diagonal, using ψn ≡ 1n (vector of ones), i.e. (Hsa,n − Hss,n )[1n ] = (cn + · rn ), has been found (1) (1) empirically to be more stable than simply extracting the diagonal of (Hsa,n − Hss,n ). 5 Experiments Artificial Data For illustrative purposes, we begin by applying our method to an artificial example. We use the planet Earth as a manifold with a topographic density distribution, where sampling probability is proportional to elevation. We also consider two vector fields: the first is parallel to the line of constant latitude and purely tangential to the sphere, while the second is parallel to the line of constant longitude with a component of the vector field perpendicular to the manifold. The true model with constant latitude vector field is shown in Figure 2, along with the estimated density and vector field projected on the true manifold (sphere). Model Recovered Latitudinal (a) Longitudinal (b) Figure 2: (a): Sphere with latitudinal vector field, i.e East-West asymmetry, with Wew > Wwe if node w lies to the West of node e. The graph nodes are sampled non-uniformly, with the topographic map of the world as sampling density. We sample n = 5000 nodes, and observe only the resulting W matrix, but not the node locations. From W , our algorithm estimates the sample locations (geometry), the vector field (black arrows) generating the observed asymmetries, and the sampling distribution at each data point (colormap). (b) Vector fields on a spherical region (blue), and their estimates (red): latitudinal vector field tangent to the manifold (left) and longitudinal vector field with component perpendicular to manifold tangent plane (right). Both the estimated density and vector field agree with the true model, demonstrating that for artificial data, the recovery algorithm 1 performs quite well. We note that the estimated density does not recover all the details of the original density, even for large sample size (here n = 5000 with = 0.07). Meanwhile, the estimated vector field performs quite well even when the sampling is reduced to n = 500 with = 0.1. This can be seen in Figure 2, b, where the true and estimated vector fields are superimposed. Figure 2 also demonstrates how r · only recovers the tangential component of r. The estimated geometry is not shown on any of these figures, since the success of the diffusion map in recovering the geometry for such a simple manifold is already well established [2, 9]. Real DataThe National Longitudinal Survey of Youth (NLSY) 1979 Cohort is a representative sample of young men and women in the United States who were followed from 1979 to 2000 [12, 13]. The aim here is to use this survey to obtain a representation of the job market as a diffusion process over a manifold. The data set consists of a sample of 7,816 individual career sequences of length 64, listing the jobs a particular individual held every quarter between the ages of 20 and 36. Each token in the sequence identifies a job. Each job corresponds to an industry × occupation pair. There are 25 unique industry and 20 unique occupation indices. Out of the 500 possible pairings, approximately 450 occur in the data, with only 213 occurring with sufficient frequency to be included here. Thus, our graph G has 213 nodes - the jobs - and our observations consist of 7,816 walks between the graph nodes. We convert these walks to a directed graph with affinity matrix W . Specifically, Wij represents the number of times a transition from job i to job j was observed (Note that this matrix is asymmetric, 6 i.e Wij = Wji ). Normalizing each row i of W by its outdegree di gives P = diag(di )−1 W , the non-parametric maximum likelihood estimator for the Markov chain over G for the progression (0) of career sequences. This Markov chain has as limit operator Haa , as the granularity of the job market increases along with the number of observations. Thus, in trying to recover the geometry, distribution and vector field, we are actually interested in estimating the full advective effect of the (0) diffusion process generated by Haa ; that is, we want to estimate r · − 2 U · where we can use (0) (1) −2 U · = Hss − Hss to complement Algorithm 1. 0.25 1600 0.05 0.1 1400 0.1 3 0.7 1800 0.15 ! 0.8 0.15 0.2 0.9 0.2 2000 0.25 !30.05 0.6 0.5 0 0 0.4 1200 −0.05 −0.05 −0.1 0.3 −0.1 1000 0.2 −0.15 −0.15 800 −0.2 −0.25 0.1 −0.2 −0.25 −0.1 −0.05 0 !2 0.05 0.1 0.15 0.2 (a) −0.1 −0.05 0 !2 0.05 0.1 0.15 0.2 (b) Figure 3: Embedding the job market along with field r − 2 U over the first two non-constant eigenvectors. The color map corresponds to the mean monthly wage in dollars (a) and to the female proportion (b) for each job. We obtain an embedding of the job market that describes the relative position of jobs, their distribution, and the natural time progression from each job. Of these, the relative position and natural time progression are the most interesting. Together, they summarize the job market dynamics by describing which jobs are naturally “close” as well as where they can lead in the future. From a public policy perspective, this can potentially improve focus on certain jobs for helping individuals attain better upward mobility. The job market was found to be a high dimensional manifold. We present only the first two dimen(0) sions, that is, the second and third eigenvectors of Hss , since the first eigenvector is uninformative (constant) by construction. The eigenvectors showed correlation with important demographic data, such as wages and gender. Figure 3 displays this two-dimensional sub-embedding along with the directional information r − 2 U for each dimension. The plot shows very little net progression toward regions of increasing mean salary3 . This is somewhat surprising, but it is easy to overstate this observation: diffusion alone would be enough to move the individuals towards higher salary. What Figure 3 (a) suggests is that there appear to be no “external forces” advecting individuals towards higher salary. Nevertheless, there appear to be other external forces at play in the job market: Figure 3 (b), which is analogous to Figure 3 (a), but with gender replacing the salary color scheme, suggests that these forces push individuals towards greater gender differentiation. This is especially true amongst male-dominated jobs which appear to be advected toward the left edge of the embedding. Hence, this simple analysis of the job market can be seen as an indication that males and females tend to move away from each other over time, while neither seems to have a monopoly on high- or low- paying jobs. 6 Discussion This paper makes three contributions: (1) it introduces a manifold-based generative model for directed graphs with weighted edges, (2) it obtains asymptotic results for operators constructed from the directed graphs, and (3) these asymptotic results lead to a natural algorithm for estimating the model. 3 It is worth noting that in the NLSY data set, high paying jobs are teacher, nurse and mechanic. This is due to the fact that the career paths observed stop at at age 36, which is relatively early in an individual’s career. 7 Generative Models that assume that data are sampled from a manifold are standard for undirected graphs, but to our knowledge, none have yet been proposed for directed graphs. When W is symmetric, it is natural to assume that it depends on the points’ proximity. For asymmetric affinities W , one must include an additional component to explain the asymmetry. In the asymptotic limit, this is tantamount to defining a vector field on the manifold. Algorithm We have used from [9] the idea of defining anisotropic kernels (indexed by α) in order to separate the density p and the manifold geometry M. Also, we adopted their general assumptions about the symmetric part of the kernel. As a consequence, the recovery algorithm for p and M is identical to theirs. However, insofar as the asymmetric part of the kernel is concerned, everything, starting from the definition and the introduction of the vector field r as a way to model the asymmetry, through the derivation of the asymptotic expression for the symmetric plus asymmetric kernel, is new. We go significantly beyond the elegant idea of [9] regarding the use of anisotropic kernels by analyzing the four distinct renormalizations possible for a given α, each of them combining different aspects of M, p and r. Only the successful (and novel) combination of two different anisotropic operators is able to recover the directional flow r. Algorithm 1 is natural, but we do not claim it is the only possible one in the context of our model. (α) For instance, we can also use Hsa to recover the operator · r (which empirically seems to have worse numerical properties than r · ). In the National Longitudinal Survery of Youth study, we were interested in the whole advective term, so we estimated it from a different combination of operators. Depending on the specific question, other features of the model could be obtained Limit Results Proposition 3.1 is a general result on the asymptotics of asymmetric kernels. Recovering the manifold and r is just one, albeit the most useful, of the many ways of exploiting these (0) results. For instance, Hsa is the limit operator of the operators used in [3] and [5]. The limit analysis could be extended to other digraph embedding algorithms such as [4, 6]. How general is our model? Any kernel can be decomposed into a symmetric and an asymmetric part, as we have done. The assumptions on the symmetric part h are standard. The paper of [7] goes one step further from these assumptions; we will discuss it in relationship with our work shortly. The more interesting question is how limiting are our assumptions regarding the choice of kernel, especially the asymmetric part, which we parameterized as a (x, y) = r/2 · (y − x)h (x, y) in (4). In the asymptotic limit, this choice turns out to be fully general, at least up to the identifiable aspects of the model. For a more detailed discussion of this issue, see [8]. In [7], Ting, Huang and Jordan presented asymptotic results for a general family of kernels that includes asymmetric and random kernels. Our k can be expressed in the notation of [7] by taking wx (y) ← 1+r(x, y)·(y−x), rx (y) ← 1, K0 ← h, h ← . Their assumptions are more general than the assumptions we make here, yet our model is general up to what can be identified from G alone. The distinction arises because [7] focuses on the graph construction methods from an observed sample of M, while we focus on explaining an observed directed graph G through a manifold generative process. Moreover, while the [7] results can be used to analyze data from directed graphs, they differ from our Proposition 3.1. Specifically, with respect to the limit in Theorem 3 from [7], we obtain the additional source terms f (x) · r and c(x)f (x) that follow from not enforcing (α) (α) conservation of mass while defining operators Hsa and Has . We applied our theory of directed graph embedding to the analysis of the career sequences in Section 5, but asymmetric affinity data abound in other social contexts, and in the physical and life sciences. Indeed, any “similarity” score that is obtained from a likelihood of the form Wvu =likelihood(u|v) is generally asymmetric. Hence our methods can be applied to study not only social networks, but also patterns of human movement, road traffic, and trade relations, as well as alignment scores in molecular biology. Finally, the physical interpretation of our model also makes it naturally applicable to physical models of flows. Acknowledgments This research was partially supported by NSW awards IIS-0313339 and IIS-0535100. 8 References [1] Belkin and Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2002. [2] Nadler, Lafon, and Coifman. Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In Neural Information Processing Systems Conference, 2006. [3] Meila and Pentney. Clustering by weighted cuts in directed graphs. In SIAM Data Mining Conference, 2007. [4] Zhou, Huang, and Scholkopf. Learning from labeled and unlabeled data on a directed graph. In International Conference on Machine Learning, pages 1041–1048, 2005. [5] Zhou, Schlkopf, and Hofmann. Semi-supervised learning on directed graphs. In Advances in Neural Information Processing Systems, volume 17, pages 1633–1640, 2005. [6] Fan R. K. Chung. The diameter and laplacian eigenvalues of directed graphs. Electr. J. Comb., 13, 2006. [7] Ting, Huang, and Jordan. An analysis of the convergence of graph Laplacians. In International Conference on Machine Learning, 2010. [8] Dominique Perrault-Joncas and Marina Meil˘ . Directed graph embedding: an algorithm based on contina uous limits of laplacian-type operators. Technical Report TR 587, University of Washington - Department of Statistics, November 2011. [9] Coifman and Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21:6–30, 2006. [10] Mikhail Belkin and Partha Niyogi. Convergence of laplacian eigenmaps. preprint, short version NIPS 2008, 2008. [11] Coifman, Lafon, Lee, Maggioni, Warner, and Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. In Proceedings of the National Academy of Sciences, pages 7426–7431, 2005. [12] United States Department of Labor. National longitudinal survey of youth 1979 cohort. http://www.bls.gov/nls/, retrived October 2011. [13] Marc A. Scott. Affinity models for career sequences. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60(3):417–436, 2011. 9
5 0.12384599 54 nips-2011-Co-regularized Multi-view Spectral Clustering
Author: Abhishek Kumar, Piyush Rai, Hal Daume
Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to find a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efficacy of our proposed approaches.
6 0.12350844 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds
7 0.1206308 287 nips-2011-The Manifold Tangent Classifier
8 0.10682553 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
9 0.092950821 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data
10 0.090604618 5 nips-2011-A Denoising View of Matrix Completion
11 0.085679077 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
12 0.084019117 198 nips-2011-On U-processes and clustering performance
13 0.077339977 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models
14 0.075041212 171 nips-2011-Metric Learning with Multiple Kernels
15 0.072443239 157 nips-2011-Learning to Search Efficiently in High Dimensions
16 0.064589016 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
17 0.064201534 67 nips-2011-Data Skeletonization via Reeb Graphs
18 0.061500549 150 nips-2011-Learning a Distance Metric from a Network
19 0.06116062 176 nips-2011-Multi-View Learning of Word Embeddings via CCA
20 0.058984127 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification
topicId topicWeight
[(0, 0.166), (1, 0.057), (2, -0.075), (3, -0.06), (4, -0.056), (5, 0.021), (6, -0.051), (7, -0.016), (8, 0.12), (9, -0.122), (10, -0.023), (11, 0.191), (12, 0.042), (13, -0.167), (14, 0.079), (15, 0.058), (16, -0.115), (17, 0.115), (18, -0.077), (19, -0.005), (20, -0.005), (21, 0.103), (22, 0.048), (23, 0.134), (24, 0.05), (25, -0.049), (26, -0.082), (27, 0.012), (28, -0.071), (29, 0.022), (30, 0.034), (31, 0.068), (32, -0.083), (33, -0.111), (34, 0.062), (35, -0.031), (36, -0.067), (37, -0.059), (38, 0.077), (39, 0.093), (40, -0.021), (41, -0.008), (42, -0.022), (43, 0.065), (44, 0.042), (45, 0.022), (46, -0.014), (47, 0.076), (48, -0.048), (49, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.9553684 263 nips-2011-Sparse Manifold Clustering and Embedding
Author: Ehsan Elhamifar, René Vidal
Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1
2 0.75421375 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization
Author: Binbin Lin, Chiyuan Zhang, Xiaofei He
Abstract: This paper studies the problem of semi-supervised learning from the vector field perspective. Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semisupervised regression problems. To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient field of a linear function has to be a parallel vector field. Consequently, we propose to find a function which minimizes the empirical error, and simultaneously requires its gradient field to be as parallel as possible. We give a continuous objective function on the manifold and discuss how to discretize it by using random points. The discretized optimization problem turns out to be a sparse linear system which can be solved very efficiently. The experimental results have demonstrated the effectiveness of our proposed approach. 1
3 0.72677118 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators
Author: Dominique C. Perrault-joncas, Marina Meila
Abstract: This paper considers the problem of embedding directed graphs in Euclidean space while retaining directional information. We model the observed graph as a sample from a manifold endowed with a vector field, and we design an algorithm that separates and recovers the features of this process: the geometry of the manifold, the data density and the vector field. The algorithm is motivated by our analysis of Laplacian-type operators and their continuous limit as generators of diffusions on a manifold. We illustrate the recovery algorithm on both artificially constructed and real data. 1 Motivation Recent advances in graph embedding and visualization have focused on undirected graphs, for which the graph Laplacian properties make the analysis particularly elegant [1, 2]. However, there is an important number of graph data, such as social networks, alignment scores between biological sequences, and citation data, which are naturally asymmetric. A commonly used approach for this type of data is to disregard the asymmetry by studying the spectral properties of W +W T or W T W , where W is the affinity matrix of the graph. Some approaches have been offered to preserve the asymmetry information contained in data: [3], [4], [5] or to define directed Laplacian operators [6]. Although quite successful, these works adopt a purely graph-theoretical point of view. Thus, they are not concerned with the generative process that produces the graph, nor with the interpretability and statistical properties of their algorithms. In contrast, we view the nodes of a directed graph as a finite sample from a manifold in Euclidean space, and the edges as macroscopic observations of a diffusion kernel between neighboring points on the manifold. We explore how this diffusion kernel determines the overall connectivity and asymmetry of the resulting graph and demonstrate how Laplacian-type operators of this graph can offer insights into the underlying generative process. Based on the analysis of the Laplacian-type operators, we derive an algorithm that, in the limit of infinite sample and vanishing bandwidth, recovers the key features of the sampling process: manifold geometry, sampling distribution, and local directionality, up to their intrinsic indeterminacies. 2 Model The first premise here is that we observe a directed graph G, with n nodes, having weights W = [Wij ] for the edge from node i to node j. In following with common Laplacian-based embedding approaches, we assume that G is a geometric random graph constructed from n points sampled according to distribution p = e−U on an unobserved compact smooth manifold M ⊆ Rl of known intrinsic dimension d ≤ l. The edge weight Wij is then determined by a directed similarity kernel k (xi , xj ) with bandwidth . The directional component of k (xi , xj ) will be taken to be derived 1 from a vector field r on M, which assigns a preferred direction between weights Wij and Wji . The choice of a vector field r to characterize the directional component of G might seem restrictive at first. In the asymptotic limit of → 0 and n → ∞ however, kernels are characterized by their diffusion, drift, and source components [7]. As such, r is sufficient to characterize any directionality associated with a drift component and as it turns out, the component of r normal M in Rl can also be use to characterize any source component. As for the diffusion component, it is not possible to uniquely identify it from G alone [8]. Some absolute knownledge of M is needed to say anything about it. Hence, without loss of generality, we will construct k (xi , xj ) so that the diffusion component ends being isotropic and constant, i.e. equal to Laplace-Beltrami operator ∆ on M. The schematic of this generative process is shown in the top left of Figure 1 below. From left to right: the graph generative process mapping the sample on M to geometric random graph G via the kernel k (x, y), then the subsequent embedding (α) Ψn of G by operators Haa,n , (α) Hss,n (defined in section 3.1). As these operators converge to (α) their respective limits, Haa and (α) Hss , so will Ψn → Ψ, pn → p, and rn → r. We design an algorithm that, given G, produces the top right embedding (Ψn , pn , and rn ). Figure 1: Schematic of our framework. The question is then as follows: can the generative process’ geometry M, distribution p = e−U , and directionality r, be recovered from G? In other words, is there an embedding of G in Rm , m ≥ d that approximates all three components of the process and that is also consistent as sample size increases and the bandwidth vanishes? In the case of undirected graphs, the theory of Laplacian eigenmaps [1] and Diffusion maps [9] answers this question in the affirmative, in that the geometry of M and p = e−U can be inferred using spectral graph theory. The aim here is to build on the undirected problem and recover all three components of the generative process from a directed graph G. The spectral approach to undirected graph embedding relies on the fact that eigenfunctions of the Laplace-Beltrami operator are known to preserve the local geometry of M [1]. With a consistent empirical Laplace-Beltrami operator based on G, its eigenvectors also recover the geometry of M and converge to the corresponding eigenfunctions on M. For a directed graph G, an additional operator is needed to recover the local directional component r, but the principle remains the same. (α) The schematic for this is shown in Figure 1 where two operators - Hss,n , introduced in [9] for (α) undirected embeddings, and Haa,n , a new operator defined in section 3.1 - are used to obtain the (α) (α) (α) embedding Ψn , distribution pn , and vector field rn . As Haa,n and Hss,n converge to Haa and (α) Hss , Ψn , pn , and rn also converge to Ψ, p, and r, where Ψ is the local geometry preserving the embedding of M into Rm . (α) The algorithm we propose in Section 4 will calculate the matrices corresponding to H·,n from the graph G, and with their eigenvectors, will find estimates for the node coordinates Ψ, the directional component r, and the sampling distribution p. In the next section we briefly describe the mathematical models of the diffusion processes that our model relies on. 2 2.1 Problem Setting The similarity kernel k (x, y) can be used to define transport operators on M. The natural transport operator is defined by normalizing k (x, y) as T [f ](x) = M k (x, y) f (y)p(y)dy , where p (x) = p (x) k (x, y)p(y)dy . (1) M T [f ](x) represents the diffusion of a distribution f (y) by the transition density k (x, y)p(y)/ k (x, y )p(y )dy . The eigenfunctions of this infinitesimal operator are the continuous limit of the eigenvectors of the transition probability matrix P = D−1 W given by normalizing the affinity matrix W of G by D = diag(W 1) [10]. Meanwhile, the infinitesimal transition ∂f (T − I)f = lim (2) →0 ∂t defines the backward equation for this diffusion process over M based on kernel k . Obtaining the explicit expression for transport operators like (2) is then the main technical challenge. 2.2 Choice of Kernel In order for T [f ] to have the correct asymptotic form, some hypotheses about the similarity kernel k (x, y) are required. The hypotheses are best presented by considering the decomposition of k (x, y) into symmetric h (x, y) = h (y, x) and anti-symmetric a (x, y) = −a (y, x) components: k (x, y) = h (x, y) + a (x, y) . (3) The symmetric component h (x, y) is assumed to satisfy the following properties: 1. h (||y − 2 x||2 ) = h(||y−x|| / ) , and 2. h ≥ 0 and h is exponentially decreasing as ||y − x|| → ∞. This form d/2 of symmetric kernel was used in [9] to analyze the diffusion map. For the asymmetric part of the similarity kernel, we assume the form a (x, y) = r(x, y) h(||y − x||2 / ) · (y − x) , d/2 2 (4) with r(x, y) = r(y, x) so that a (x, y) = −a (y, x). Here r(x, y) is a smooth vector field on the manifold that gives an orientation to the asymmetry of the kernel k (x, y). It is worth noting that the dependence of r(x, y) on both x and y implies that r : M × M → Rl with Rl the ambient space of M; however in the asymptotic limit, the dependence in y is only important “locally” (x = y), and as such it is appropriate to think of r(x, x) being a vector field on M. As a side note, it is worth pointing out that even though the form of a (x, y) might seem restrictive at first, it is sufficiently rich to describe any vector field . This can be seen by taking r(x, y) = (w(x) + w(y))/2 so that at x = y the resulting vector field is given by r(x, x) = w(x) for an arbitrary vector field w(x). 3 Continuous Limit of Laplacian Type Operators We are now ready to state the main asymptotic result. Proposition 3.1 Let M be a compact, closed, smooth manifold of dimension d and k (x, y) an asymmetric similarity kernel satisfying the conditions of section 2.2, then for any function f ∈ C 2 (M), the integral operator based on k has the asymptotic expansion k (x, y)f (y)dy = m0 f (x) + g(f (x), x) + o( ) , (5) M where g(f (x), x) = and m0 = Rd m2 (ω(x)f (x) + ∆f (x) + r · 2 h(||u||2 )du, m2 = Rd u2 h(||u||2 )du. i 3 f (x) + f (x) · r + c(x)f (x)) (6) The proof can be found in [8] along with the definition of ω(x) and c(x) in (6). For now, it suffices to say that ω(x) corresponds to an interaction between the symmetric kernel h and the curvature of M and was first derived in [9]. Meanwhile, c(x) is a new term that originates from the interaction between h and the component of r that is normal to M in the ambient space Rl . Proposition 3.1 foreshadows a general fact about spectral embedding algorithms: in most cases, Laplacian operators confound the effects of spatial proximity, sampling density and directional flow due to the presence of the various terms above. 3.1 Anisotropic Limit Operators Proposition 3.1 above can be used to derive the limits of a variety of Laplacian type operators associated with spectral embedding algorithms like [5, 6, 3]. Although we will focus primarily on a few operators that give the most insight into the generative process and enable us to recover the model defined in Figure 1, we first present four distinct families of operators for completeness. These operator families are inspired by the anisotropic family of operators that [9] introduced for undirected graphs, which make use of anisotropic kernels of the form: k (α) (x, y) = k (x, y) pα (x)pα (y) , (7) with α ∈ [0, 1] where α = 0 is the isotropic limit. To normalize the anisotropic kernels, we need (α) (α) (α) as p (x) = M k (x, y)p(y)dy. From (7), four to redefine the outdegrees distribution of k families of diffusion processes of the form ft = H (α) [f ](x) can be derived depending on which kernel is normalized and which outdegree distribution is used for the normalization. Specifically, (α) (α) we define transport operators by normalizing the asymmetric k or symmetric h kernels with the 1 asymmetric p or symmetric q = M h (x, y)p(y)dy outdegree distribution . To keep track of all options, we introduce the following notation: the operators will be indexed by the type of kernel and outdegree distribution they correspond to (symmetric or asymmetric), with the first index identifying the kernel and the second index identifying the outdegree distribution. For example, the family of anisotropic limit operators introduced by [9] is defined by normalizing the symmetric kernel by (α) the symmetric outdegree distribution, hence they will be denoted as Hss , with the superscript corresponding to the anisotropic power α. Proposition 3.2 With the above notation, (α) Haa [f ] = ∆f − 2 (1 − α) U · f + r· f (α) Has [f ] (α) Hsa [f ] (α) Hss [f ] = ∆f − 2 (1 − α) U · = ∆f − 2 (1 − α) U · = ∆f − 2(1 − α) U · (8) f − cf + (α − 1)(r · U )f − ( · r + (α − 1)r · f + (c + f. U )f · r)f + r · f (9) (10) (11) The proof of this proposition, which can be found in [8], follows from repeated application of Proposition 3.1 to p(y) or q(y) and then to k α (x, y) or hα (x, y), as well as the fact that p1 = α 1 p−α [1 − α (ω + ∆p p + 2r · p p +2 · r + c)] + o( ). (α) Thus, if we use the asymmetric k and p , we get Haa , defined by the advected diffusion equa(α) tion (8). In general, Haa is not hermitian, so it commonly has complex eigenvectors. This makes (1) embedding directed graphs with this operator problematic. Nevertheless, Haa will play an important role in extracting the directionality of the sampling process. If we use the symmetric kernel h but the asymmetric outdegree distribution p , we get the family (α) of operators Hsa , of which the WCut of [3] is a special case (α = 0). If we reverse the above, i.e. (α) (α) (α) use k and q , we obtain Has . This turns out to be merely a combination of Haa and Hsa . 1 The reader may notice that there are in fact eight possible combinations of kernel and degree distribution, since the anisotripic kernel (7) could also be defined using a symmetric or asymmetric outdegree distribution. However, there are only four distinct asymptotic results and they are all covered by using one kernel (symmetric or asymmetric) and one degree distribution (symmetric or asymmetric) throughout. 4 Algorithm 1 Directed Embedding Input: Affinity matrix Wi,j and embedding dimension m, (m ≥ d) 1. S ← (W + W T )/2 (Steps 1–6 estimate the coordinates as in [11]) n 2. qi ← j=1 Si,j , Q = diag(q) 3. V ← Q−1 SQ−1 (1) n 4. qi ← j=1 Vi,j , Q(1) = diag(q (1) ) (1) −1 5. Hss,n ← Q(1) V 6. Compute the Ψ the n × (m + 1) matrix with orthonormal columns containing the m + 1 largest (1) right eigenvector (by eigenvalue) of Hss,n as well as the Λ the (m + 1) × (m + 1) diagonal matrix of eigenvalues. Eigenvectors 2 to m + 1 from Ψ are the m coordinates of the embedding. (1) 7. Compute π the left eigenvector of Hss,n with eigenvalue 1. (Steps 7–8 estimate the density) n 8. π ← π/ i=1 πi is the density distribution over the embedding. n 9. pi ← j=1 Wi,j , P = diag(p) (Steps 9–13 estimate the vector field r) 10. T ← P −1 W P −1 (1) n 11. pi ← j=1 Ti,j , P (1) = diag(p(1) ) (1) −1 12. Haa,n ← P (1) T (1) (1) 13. R ← (Haa,n − Hss,n )Ψ/2. Columns 2 to m + 1 of R are the vector field components in the direction of the corresponding coordinates of the embedding. (α) Finally, if we only consider the symmetric kernel h and degree distribution q , we recover Hss , the anisotropic kernels of [9] for symmetric graphs. This operator for α = 1 is shown to separate the manifold from the probability distribution [11] and will be used as part of our recovery algorithm. Isolating the Vector Field r 4 Our aim is to esimate the manifold M, the density distribution p = e−U , and the vector field r. The (1) first two components of the data can be recovered from Hss as shown in [11] and summarized in Algorithm 1. At this juncture, one feature of generative process is missing: the vector field r. The natural approach (α) (α) for recovering r is to isolate the linear operator r · from Haa by substracting Hss : (α) (α) Haa − Hss = r · . (12) The advantage of recovering r in operator form as in (12) is that r · is coordinate free. In other words, as long as the chosen embedding of M is diffeomorphic to M2 , (12) can be used to express the component of r that lies in the tangent space T M, which we denote by r|| . Specifically, let Ψ be a diffeomorphic embedding of M ; the component of r along coordinate ψk is then given by r · ψk = rk , and so, in general, r|| = r · Ψ. (13) The subtle point that only r|| is recovered from (13) follows from the fact that the operator r · only defined along M and hence any directional derivative is necessarily along T M. is Equation (13) and the previous observations are the basis for Algorithm 1, which recovers the three important features of the generative process for an asymmetric graph with affinity matrix W . A similar approach can be employed to recover c + · r, or simply · r if r has no component perpendicular to the tangent space T M (meaning that c ≡ 0). Recovering c + · r is achieved by taking advantage of the fact that (1) (1) (Hsa − Hss ) = (c + 2 · r) , (14) (1) A diffeomorphic embedding is guaranteed by using the eigendecomposition of Hss . 5 (1) (1) which is a diagonal operator. Taking into account that for finite n (Hsa,n − Hss,n ) is not perfectly (1) (1) diagonal, using ψn ≡ 1n (vector of ones), i.e. (Hsa,n − Hss,n )[1n ] = (cn + · rn ), has been found (1) (1) empirically to be more stable than simply extracting the diagonal of (Hsa,n − Hss,n ). 5 Experiments Artificial Data For illustrative purposes, we begin by applying our method to an artificial example. We use the planet Earth as a manifold with a topographic density distribution, where sampling probability is proportional to elevation. We also consider two vector fields: the first is parallel to the line of constant latitude and purely tangential to the sphere, while the second is parallel to the line of constant longitude with a component of the vector field perpendicular to the manifold. The true model with constant latitude vector field is shown in Figure 2, along with the estimated density and vector field projected on the true manifold (sphere). Model Recovered Latitudinal (a) Longitudinal (b) Figure 2: (a): Sphere with latitudinal vector field, i.e East-West asymmetry, with Wew > Wwe if node w lies to the West of node e. The graph nodes are sampled non-uniformly, with the topographic map of the world as sampling density. We sample n = 5000 nodes, and observe only the resulting W matrix, but not the node locations. From W , our algorithm estimates the sample locations (geometry), the vector field (black arrows) generating the observed asymmetries, and the sampling distribution at each data point (colormap). (b) Vector fields on a spherical region (blue), and their estimates (red): latitudinal vector field tangent to the manifold (left) and longitudinal vector field with component perpendicular to manifold tangent plane (right). Both the estimated density and vector field agree with the true model, demonstrating that for artificial data, the recovery algorithm 1 performs quite well. We note that the estimated density does not recover all the details of the original density, even for large sample size (here n = 5000 with = 0.07). Meanwhile, the estimated vector field performs quite well even when the sampling is reduced to n = 500 with = 0.1. This can be seen in Figure 2, b, where the true and estimated vector fields are superimposed. Figure 2 also demonstrates how r · only recovers the tangential component of r. The estimated geometry is not shown on any of these figures, since the success of the diffusion map in recovering the geometry for such a simple manifold is already well established [2, 9]. Real DataThe National Longitudinal Survey of Youth (NLSY) 1979 Cohort is a representative sample of young men and women in the United States who were followed from 1979 to 2000 [12, 13]. The aim here is to use this survey to obtain a representation of the job market as a diffusion process over a manifold. The data set consists of a sample of 7,816 individual career sequences of length 64, listing the jobs a particular individual held every quarter between the ages of 20 and 36. Each token in the sequence identifies a job. Each job corresponds to an industry × occupation pair. There are 25 unique industry and 20 unique occupation indices. Out of the 500 possible pairings, approximately 450 occur in the data, with only 213 occurring with sufficient frequency to be included here. Thus, our graph G has 213 nodes - the jobs - and our observations consist of 7,816 walks between the graph nodes. We convert these walks to a directed graph with affinity matrix W . Specifically, Wij represents the number of times a transition from job i to job j was observed (Note that this matrix is asymmetric, 6 i.e Wij = Wji ). Normalizing each row i of W by its outdegree di gives P = diag(di )−1 W , the non-parametric maximum likelihood estimator for the Markov chain over G for the progression (0) of career sequences. This Markov chain has as limit operator Haa , as the granularity of the job market increases along with the number of observations. Thus, in trying to recover the geometry, distribution and vector field, we are actually interested in estimating the full advective effect of the (0) diffusion process generated by Haa ; that is, we want to estimate r · − 2 U · where we can use (0) (1) −2 U · = Hss − Hss to complement Algorithm 1. 0.25 1600 0.05 0.1 1400 0.1 3 0.7 1800 0.15 ! 0.8 0.15 0.2 0.9 0.2 2000 0.25 !30.05 0.6 0.5 0 0 0.4 1200 −0.05 −0.05 −0.1 0.3 −0.1 1000 0.2 −0.15 −0.15 800 −0.2 −0.25 0.1 −0.2 −0.25 −0.1 −0.05 0 !2 0.05 0.1 0.15 0.2 (a) −0.1 −0.05 0 !2 0.05 0.1 0.15 0.2 (b) Figure 3: Embedding the job market along with field r − 2 U over the first two non-constant eigenvectors. The color map corresponds to the mean monthly wage in dollars (a) and to the female proportion (b) for each job. We obtain an embedding of the job market that describes the relative position of jobs, their distribution, and the natural time progression from each job. Of these, the relative position and natural time progression are the most interesting. Together, they summarize the job market dynamics by describing which jobs are naturally “close” as well as where they can lead in the future. From a public policy perspective, this can potentially improve focus on certain jobs for helping individuals attain better upward mobility. The job market was found to be a high dimensional manifold. We present only the first two dimen(0) sions, that is, the second and third eigenvectors of Hss , since the first eigenvector is uninformative (constant) by construction. The eigenvectors showed correlation with important demographic data, such as wages and gender. Figure 3 displays this two-dimensional sub-embedding along with the directional information r − 2 U for each dimension. The plot shows very little net progression toward regions of increasing mean salary3 . This is somewhat surprising, but it is easy to overstate this observation: diffusion alone would be enough to move the individuals towards higher salary. What Figure 3 (a) suggests is that there appear to be no “external forces” advecting individuals towards higher salary. Nevertheless, there appear to be other external forces at play in the job market: Figure 3 (b), which is analogous to Figure 3 (a), but with gender replacing the salary color scheme, suggests that these forces push individuals towards greater gender differentiation. This is especially true amongst male-dominated jobs which appear to be advected toward the left edge of the embedding. Hence, this simple analysis of the job market can be seen as an indication that males and females tend to move away from each other over time, while neither seems to have a monopoly on high- or low- paying jobs. 6 Discussion This paper makes three contributions: (1) it introduces a manifold-based generative model for directed graphs with weighted edges, (2) it obtains asymptotic results for operators constructed from the directed graphs, and (3) these asymptotic results lead to a natural algorithm for estimating the model. 3 It is worth noting that in the NLSY data set, high paying jobs are teacher, nurse and mechanic. This is due to the fact that the career paths observed stop at at age 36, which is relatively early in an individual’s career. 7 Generative Models that assume that data are sampled from a manifold are standard for undirected graphs, but to our knowledge, none have yet been proposed for directed graphs. When W is symmetric, it is natural to assume that it depends on the points’ proximity. For asymmetric affinities W , one must include an additional component to explain the asymmetry. In the asymptotic limit, this is tantamount to defining a vector field on the manifold. Algorithm We have used from [9] the idea of defining anisotropic kernels (indexed by α) in order to separate the density p and the manifold geometry M. Also, we adopted their general assumptions about the symmetric part of the kernel. As a consequence, the recovery algorithm for p and M is identical to theirs. However, insofar as the asymmetric part of the kernel is concerned, everything, starting from the definition and the introduction of the vector field r as a way to model the asymmetry, through the derivation of the asymptotic expression for the symmetric plus asymmetric kernel, is new. We go significantly beyond the elegant idea of [9] regarding the use of anisotropic kernels by analyzing the four distinct renormalizations possible for a given α, each of them combining different aspects of M, p and r. Only the successful (and novel) combination of two different anisotropic operators is able to recover the directional flow r. Algorithm 1 is natural, but we do not claim it is the only possible one in the context of our model. (α) For instance, we can also use Hsa to recover the operator · r (which empirically seems to have worse numerical properties than r · ). In the National Longitudinal Survery of Youth study, we were interested in the whole advective term, so we estimated it from a different combination of operators. Depending on the specific question, other features of the model could be obtained Limit Results Proposition 3.1 is a general result on the asymptotics of asymmetric kernels. Recovering the manifold and r is just one, albeit the most useful, of the many ways of exploiting these (0) results. For instance, Hsa is the limit operator of the operators used in [3] and [5]. The limit analysis could be extended to other digraph embedding algorithms such as [4, 6]. How general is our model? Any kernel can be decomposed into a symmetric and an asymmetric part, as we have done. The assumptions on the symmetric part h are standard. The paper of [7] goes one step further from these assumptions; we will discuss it in relationship with our work shortly. The more interesting question is how limiting are our assumptions regarding the choice of kernel, especially the asymmetric part, which we parameterized as a (x, y) = r/2 · (y − x)h (x, y) in (4). In the asymptotic limit, this choice turns out to be fully general, at least up to the identifiable aspects of the model. For a more detailed discussion of this issue, see [8]. In [7], Ting, Huang and Jordan presented asymptotic results for a general family of kernels that includes asymmetric and random kernels. Our k can be expressed in the notation of [7] by taking wx (y) ← 1+r(x, y)·(y−x), rx (y) ← 1, K0 ← h, h ← . Their assumptions are more general than the assumptions we make here, yet our model is general up to what can be identified from G alone. The distinction arises because [7] focuses on the graph construction methods from an observed sample of M, while we focus on explaining an observed directed graph G through a manifold generative process. Moreover, while the [7] results can be used to analyze data from directed graphs, they differ from our Proposition 3.1. Specifically, with respect to the limit in Theorem 3 from [7], we obtain the additional source terms f (x) · r and c(x)f (x) that follow from not enforcing (α) (α) conservation of mass while defining operators Hsa and Has . We applied our theory of directed graph embedding to the analysis of the career sequences in Section 5, but asymmetric affinity data abound in other social contexts, and in the physical and life sciences. Indeed, any “similarity” score that is obtained from a likelihood of the form Wvu =likelihood(u|v) is generally asymmetric. Hence our methods can be applied to study not only social networks, but also patterns of human movement, road traffic, and trade relations, as well as alignment scores in molecular biology. Finally, the physical interpretation of our model also makes it naturally applicable to physical models of flows. Acknowledgments This research was partially supported by NSW awards IIS-0313339 and IIS-0535100. 8 References [1] Belkin and Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2002. [2] Nadler, Lafon, and Coifman. Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In Neural Information Processing Systems Conference, 2006. [3] Meila and Pentney. Clustering by weighted cuts in directed graphs. In SIAM Data Mining Conference, 2007. [4] Zhou, Huang, and Scholkopf. Learning from labeled and unlabeled data on a directed graph. In International Conference on Machine Learning, pages 1041–1048, 2005. [5] Zhou, Schlkopf, and Hofmann. Semi-supervised learning on directed graphs. In Advances in Neural Information Processing Systems, volume 17, pages 1633–1640, 2005. [6] Fan R. K. Chung. The diameter and laplacian eigenvalues of directed graphs. Electr. J. Comb., 13, 2006. [7] Ting, Huang, and Jordan. An analysis of the convergence of graph Laplacians. In International Conference on Machine Learning, 2010. [8] Dominique Perrault-Joncas and Marina Meil˘ . Directed graph embedding: an algorithm based on contina uous limits of laplacian-type operators. Technical Report TR 587, University of Washington - Department of Statistics, November 2011. [9] Coifman and Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21:6–30, 2006. [10] Mikhail Belkin and Partha Niyogi. Convergence of laplacian eigenmaps. preprint, short version NIPS 2008, 2008. [11] Coifman, Lafon, Lee, Maggioni, Warner, and Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. In Proceedings of the National Academy of Sciences, pages 7426–7431, 2005. [12] United States Department of Labor. National longitudinal survey of youth 1979 cohort. http://www.bls.gov/nls/, retrived October 2011. [13] Marc A. Scott. Affinity models for career sequences. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60(3):417–436, 2011. 9
4 0.72180313 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds
Author: Nitesh Shroff, Pavan Turaga, Rama Chellappa
Abstract: In this paper, we consider the Pr´ cis problem of sampling K representative yet e diverse data points from a large dataset. This problem arises frequently in applications such as video and document summarization, exploratory data analysis, and pre-filtering. We formulate a general theory which encompasses not just traditional techniques devised for vector spaces, but also non-Euclidean manifolds, thereby enabling these techniques to shapes, human activities, textures and many other image and video based datasets. We propose intrinsic manifold measures for measuring the quality of a selection of points with respect to their representative power, and their diversity. We then propose efficient algorithms to optimize the cost function using a novel annealing-based iterative alternation algorithm. The proposed formulation is applicable to manifolds of known geometry as well as to manifolds whose geometry needs to be estimated from samples. Experimental results show the strength and generality of the proposed approach.
5 0.60515743 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data
Author: Vijay Mahadevan, Chi W. Wong, Jose C. Pereira, Tom Liu, Nuno Vasconcelos, Lawrence K. Saul
Abstract: We propose maximum covariance unfolding (MCU), a manifold learning algorithm for simultaneous dimensionality reduction of data from different input modalities. Given high dimensional inputs from two different but naturally aligned sources, MCU computes a common low dimensional embedding that maximizes the cross-modal (inter-source) correlations while preserving the local (intra-source) distances. In this paper, we explore two applications of MCU. First we use MCU to analyze EEG-fMRI data, where an important goal is to visualize the fMRI voxels that are most strongly correlated with changes in EEG traces. To perform this visualization, we augment MCU with an additional step for metric learning in the high dimensional voxel space. Second, we use MCU to perform cross-modal retrieval of matched image and text samples from Wikipedia. To manage large applications of MCU, we develop a fast implementation based on ideas from spectral graph theory. These ideas transform the original problem for MCU, one of semidefinite programming, into a simpler problem in semidefinite quadratic linear programming. 1
6 0.59461868 5 nips-2011-A Denoising View of Matrix Completion
7 0.58313614 287 nips-2011-The Manifold Tangent Classifier
8 0.53700769 54 nips-2011-Co-regularized Multi-view Spectral Clustering
9 0.52957863 186 nips-2011-Noise Thresholds for Spectral Clustering
10 0.46123302 67 nips-2011-Data Skeletonization via Reeb Graphs
11 0.45141786 150 nips-2011-Learning a Distance Metric from a Network
12 0.43051624 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies
13 0.42893377 157 nips-2011-Learning to Search Efficiently in High Dimensions
14 0.42242628 198 nips-2011-On U-processes and clustering performance
15 0.3978152 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation
16 0.39178669 47 nips-2011-Beyond Spectral Clustering - Tight Relaxations of Balanced Graph Cuts
17 0.39167923 213 nips-2011-Phase transition in the family of p-resistances
18 0.39118692 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension
19 0.38045239 241 nips-2011-Scalable Training of Mixture Models via Coresets
20 0.3756884 176 nips-2011-Multi-View Learning of Word Embeddings via CCA
topicId topicWeight
[(0, 0.016), (4, 0.078), (20, 0.076), (26, 0.024), (31, 0.047), (33, 0.022), (43, 0.06), (45, 0.11), (57, 0.046), (65, 0.034), (74, 0.067), (78, 0.18), (83, 0.025), (99, 0.109)]
simIndex simValue paperId paperTitle
1 0.82339275 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds
Author: Nitesh Shroff, Pavan Turaga, Rama Chellappa
Abstract: In this paper, we consider the Pr´ cis problem of sampling K representative yet e diverse data points from a large dataset. This problem arises frequently in applications such as video and document summarization, exploratory data analysis, and pre-filtering. We formulate a general theory which encompasses not just traditional techniques devised for vector spaces, but also non-Euclidean manifolds, thereby enabling these techniques to shapes, human activities, textures and many other image and video based datasets. We propose intrinsic manifold measures for measuring the quality of a selection of points with respect to their representative power, and their diversity. We then propose efficient algorithms to optimize the cost function using a novel annealing-based iterative alternation algorithm. The proposed formulation is applicable to manifolds of known geometry as well as to manifolds whose geometry needs to be estimated from samples. Experimental results show the strength and generality of the proposed approach.
same-paper 2 0.82132852 263 nips-2011-Sparse Manifold Clustering and Embedding
Author: Ehsan Elhamifar, René Vidal
Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1
3 0.75030649 29 nips-2011-Algorithms and hardness results for parallel large margin learning
Author: Phil Long, Rocco Servedio
Abstract: We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γ-margin halfspace over n dimensions using poly(n, 1/γ) processors ˜ and runs in time O(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γ-margin halfspace in time that depends polylogarithmically on n have Ω(1/γ 2 ) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
4 0.72867584 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon
Abstract: We introduce P I C O D ES: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of defining indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images defining the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classification performance. Optimization directly for binary features is difficult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. P I C O D ES of 256 bytes match the accuracy of the current best known classifier for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.
5 0.70353484 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
6 0.68984491 154 nips-2011-Learning person-object interactions for action recognition in still images
7 0.68694365 127 nips-2011-Image Parsing with Stochastic Scene Grammar
8 0.68634015 303 nips-2011-Video Annotation and Tracking with Active Learning
9 0.68338346 270 nips-2011-Statistical Performance of Convex Tensor Decomposition
10 0.68180543 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
11 0.67799926 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries
12 0.67424035 109 nips-2011-Greedy Model Averaging
13 0.67330742 54 nips-2011-Co-regularized Multi-view Spectral Clustering
14 0.67290372 80 nips-2011-Efficient Online Learning via Randomized Rounding
15 0.67272341 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
16 0.6720258 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
17 0.66978544 180 nips-2011-Multiple Instance Filtering
18 0.66960752 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods
19 0.66651893 287 nips-2011-The Manifold Tangent Classifier
20 0.66607046 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding