nips nips2011 nips2011-248 knowledge-graph by maker-knowledge-mining

248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

Source: pdf

Author: Binbin Lin, Chiyuan Zhang, Xiaofei He

Abstract: This paper studies the problem of semi-supervised learning from the vector ﬁeld perspective. Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semisupervised regression problems. To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient ﬁeld of a linear function has to be a parallel vector ﬁeld. Consequently, we propose to ﬁnd a function which minimizes the empirical error, and simultaneously requires its gradient ﬁeld to be as parallel as possible. We give a continuous objective function on the manifold and discuss how to discretize it by using random points. The discretized optimization problem turns out to be a sparse linear system which can be solved very efﬁciently. The experimental results have demonstrated the effectiveness of our proposed approach. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. [sent-5, score-0.139]

2 However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semisupervised regression problems. [sent-6, score-0.124]

3 To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient ﬁeld of a linear function has to be a parallel vector ﬁeld. [sent-7, score-0.395]

4 Consequently, we propose to ﬁnd a function which minimizes the empirical error, and simultaneously requires its gradient ﬁeld to be as parallel as possible. [sent-8, score-0.219]

5 We give a continuous objective function on the manifold and discuss how to discretize it by using random points. [sent-9, score-0.348]

6 Consequently, learning with the low dimensional manifold structure, or speciﬁcally the intrinsic topological and geometrical properties of the data manifold, becomes a crucial problem. [sent-15, score-0.27]

7 One hopes that the manifold structure could be preserved in the much lower dimensional Euclidean space. [sent-18, score-0.202]

8 For example, ISOMAP [1] is a global approach which tries to preserve the pairwise geodesic distance on the manifold. [sent-19, score-0.083]

9 These methods construct a nearest neighbor graph over the labeled and unlabeled data to model the underlying manifold structure, and use the graph Laplacian [5] to measure the smoothness of the learned function on the manifold. [sent-23, score-0.526]

10 In semi-supervised regression, some recent theoretical analysis [9] shows that using the graph Laplacian regularizer does not lead to faster minimax rates of convergence. [sent-25, score-0.091]

11 [9] also states that the Laplacian regularizer is way too general for measuring the smoothness of the function. [sent-26, score-0.112]

12 It is further suggested that we 1 should ensure second order smoothness to achieve faster rates of convergence for semi-supervised regression problems. [sent-27, score-0.124]

13 The Laplacian regularizer is the integral on the norm of the gradient of the function, which is the ﬁrst order derivative on the function. [sent-28, score-0.27]

14 In this paper, we design regularization terms that penalize the second order smoothness of the function, i. [sent-29, score-0.171]

15 Estimating the second order covariant derivative of the function is a very challenging problem. [sent-32, score-0.297]

16 We show that the gradient ﬁeld of a linear function has to be a parallel vector ﬁeld (or parallel ﬁeld in short). [sent-34, score-0.44]

17 Consequently, we propose a novel approach called Parallel Field Regularization (PFR) to simultaneously ﬁnd the function and its gradient ﬁeld, while requiring the gradient ﬁeld to be as parallel as possible. [sent-35, score-0.272]

18 Speciﬁcally, we propose to compute a function and a vector ﬁeld which satisfy three conditions simultaneously: 1) the function minimizes the empirical error on the labeled data, 2) the vector ﬁeld is close to the gradient ﬁeld of the function, 3) the vector ﬁeld should be as parallel as possible. [sent-36, score-0.46]

19 A novel regularization framework from the vector ﬁled perspective is developed. [sent-37, score-0.116]

20 Given l labeled data points (xi , yi )l on M, we i=1 aim to learn a function f : M → R based on the manifold M and the labeled points (xi , yi )l . [sent-41, score-0.44]

21 If D is the covariant derivative on the manifold, then R1 (f ) = M f 2 = M f L(f ) becomes the Laplacian regularizer. [sent-46, score-0.297]

22 1 Parallel Fields and Linear Functions We ﬁrst show the relationship between a parallel ﬁeld and a linear function on the manifold. [sent-49, score-0.166]

23 A vector ﬁeld X on manifold M is a parallel ﬁeld if X ≡ 0, where is the covariant derivative on M. [sent-52, score-0.72]

24 A continuous function f : M → R is said to be linear if (f ◦ γ)(t) = f (γ(0)) + ct (1) for each geodesic γ. [sent-55, score-0.083]

25 This proposition tells us the relationship between a parallel ﬁeld and a linear function on the manifold. [sent-62, score-0.166]

26 2 Objective Function We aim to design regularization terms that penalize the second order smoothness of the function. [sent-64, score-0.171]

27 Following the above analysis, we ﬁrst approximate gradient ﬁeld of the prediction function by a 2 Figure 1: Covariant derivative demonstration. [sent-65, score-0.173]

28 Then the covariant derivative along the direction dγ(t) |t=0 can be computed by prodt jecting dV |t=0 to the tangent space Tx M at x. [sent-69, score-0.506]

29 vector ﬁeld, then we require the vector ﬁeld to be as parallel as possible. [sent-72, score-0.276]

30 For a ﬁxed point x ∈ M, V |x is a linear map on the tangent space Tx M. [sent-80, score-0.249]

31 Naturally, we propose the following objective function based on vector ﬁeld regularization: 1 arg min E(f, V ) = l ∞ (M),V f ∈C l R0 (xi , yi , f ) + λ1 R1 (f, V ) + λ2 R2 (V ) (5) i=1 For the loss function R0 , we use the squared loss R0 (f (xi ), yi ) = (f (xi ) − yi )2 for simplicity. [sent-85, score-0.215]

32 3 Implementation Since the manifold M is unknown, the function f which minimizes (5) can not be directly solved. [sent-86, score-0.202]

33 In this section, we discuss how to discretize the continuous objective function (5). [sent-87, score-0.146]

34 1 Vector Field Representation Given l labeled data points (xi , yi )l and n − l unlabeled points xl+1 , . [sent-89, score-0.119]

35 We ﬁrst construct a nearest neighbor graph by either -neighborhood or k nearest neighbors. [sent-100, score-0.159]

36 Let xi ∼ xj denote that xi and 3 xj are neighbors. [sent-101, score-0.616]

37 For each point xi , we estimate its tangent space Txi M by performing PCA on its neighborhood. [sent-102, score-0.372]

38 It is easy to show that Pi = Ti TiT is the unique orthogonal projection from Rm onto the tangent space Txi M [11]. [sent-105, score-0.209]

39 For each point xi , let Vxi denote the value of the vector ﬁeld V at xi , and V |xi denote the value of V at xi . [sent-108, score-0.544]

40 According to the deﬁnition of vector ﬁeld, Vxi should be a vector in tangent space Txi M. [sent-109, score-0.319]

41 Therefore, it can be represented by the local coordinates T T T of the tangent space, Vxi = Ti vi , where vi ∈ Rd . [sent-110, score-0.508]

42 That is, V is a dn-dimensional big column vector which concatenates all the vi ’s. [sent-115, score-0.186]

43 In the following, we ﬁrst discretize our objective function E(f, V ), and then minimize it to obtain f and V. [sent-116, score-0.112]

44 2 Gradient Field Computation In order to discretize R1 (f, V ), we ﬁrst discuss the Taylor expansion of f on the manifold. [sent-118, score-0.115]

45 The exponential map expx : Tx M → M maps the tangent space Tx M to the manifold M. [sent-120, score-0.606]

46 Then there is a unique geodesic γa satisfying γa (0) = x with the initial tangent vector γa (0) = a. [sent-122, score-0.347]

47 The corresponding exponential map is deﬁned as expx (ta) = γa (t), t ∈ [0, 1]. [sent-123, score-0.195]

48 Then the following Taylor expansion of f holds: f (expx (a)) ≈ f (x) + f (x), a , (6) where a ∈ Tx M is a sufﬁciently small tangent vector. [sent-126, score-0.209]

49 In the discrete case, let expxi denote the exponential map at xi . [sent-127, score-0.308]

50 Since expxi is a diffeomorphism, there exists a tangent vector aij ∈ Txi M such that expxi (aij ) = xj . [sent-128, score-0.67]

51 According to the deﬁnition of exponential map, aij equals to the geodesic distance between xi and xj , which can be denoted as dij . [sent-129, score-0.559]

52 Let eij be the unit vector in the direction of aij , i. [sent-130, score-0.394]

53 We approximate aij by projecting the vector xj − xi to the tangent space, i. [sent-133, score-0.681]

54 (6) can be rewritten as follows: f (xj ) = f (xi ) + f (xi ), Pi (xj − xi ) (7) Since f is unknown, f is also unknown. [sent-137, score-0.163]

55 We ﬁrst show that the vector norm can be computed by an integral on a unit sphere, where the unit sphere can be discretely approximated by a neighborhood. [sent-139, score-0.219]

56 Let u be a unit vector on tangent space Tx M, then we have (see the exercise 1. [sent-140, score-0.338]

57 Then for any vector b ∈ Tx M, it can be written as d b = i=1 bi ∂i . [sent-146, score-0.084]

58 (7), we can see that f (xi ), Pi (xj − xi ) = f (xj ) − f (xi ). [sent-148, score-0.163]

59 Thus, we have f (xi ) − Vxi 2 = 1 ωd ≈ X, f (xi ) − Vxi 2 dδ(X) S d−1 eij , f (xi ) − Vxi j∼i 2 wij aij , f (xi ) − Vxi = j∼i wij Pi (xj − xi ), f (xi ) − Vxi = (9) 2 j∼i wij (Pi (xj − xi ))T Vxi − f (xj ) + f (xi ) = j∼i 4 2 2 . [sent-149, score-1.373]

60 The weight wij can be approximated either by heat kernel weight exp(− xi − ij xj 2 /δ) or simply by 0 − 1 weight. [sent-151, score-0.596]

61 Then R1 reduces to the following: wij (xj − xi )T Ti vi − fj + fi R1 (f, V) = i 3. [sent-152, score-0.572]

62 3 2 (10) j∼i Parallel Field Computation As discussed before, we hope the vector ﬁeld to be as parallel as possible on the manifold. [sent-153, score-0.221]

63 In the discrete case, R2 becomes n V |xi R2 (V) = 2 F (11) i=1 In the following, we discuss how to approximate V |xi 2 for a given point xi . [sent-154, score-0.197]

64 We have Trace(α) = F i=1 g( ∂i V, ∂i V ) = an orthonormal basis of the tangent space. [sent-161, score-0.262]

65 For the trace of α, we have the following geometric interpretation (see the exercise 1. [sent-162, score-0.079]

66 So for a given point xi , we can approximate V |xi by the following V |xi 2 F = Trace(α)xi = 1 ωd α(X, X)|xi dδ(X) ≈ S d−1 α(eij , eij ) = j∼i eij V 2 j∼i (13) Then we discuss how to discretize eij V . [sent-164, score-0.878]

67 Given eij ∈ Txi M, there exists a unique geodesic γ(t) which satisﬁes γ(0) = xi and γ (0) = eij . [sent-165, score-0.646]

68 Then the covariant derivative of vector ﬁeld V along eij is given by (please see Fig. [sent-166, score-0.552]

69 1) eij V = Pi dV |t=0 dt = Pi lim t→0 (Vxj − Vxi ) √ V (γ(t)) − V (γ(0)) ≈ Pi = wij (Pi Vxj − Vxi ) t dij Combining Eq. [sent-167, score-0.476]

70 (13), R2 becomes: wij Pi Tj vj − Ti vi R2 (V) = i 3. [sent-168, score-0.422]

71 4 2 (14) j∼i Objective Function in the Discrete Form Let I denote a n × n diagonal matrix where Iii = 1 if xi is labeled and Iii = 0 otherwise. [sent-169, score-0.239]

72 And let y ∈ Rn be a column vector whose i-th element is yi if xi is labeled and 0 otherwise. [sent-170, score-0.337]

73 (14), the ﬁnal objective function in the discrete form can be written as follows: E(f, V) = 1 (f − y)T I(f − y) + λ1 l wij (xj − xi )T Ti vi − fj + fi i j∼i wij Pi Tj vj − Ti vi + λ2 i j∼i 5 2 2 (16) 3. [sent-173, score-1.025]

74 Let L denote the Laplacian matrix of the graph with weights wij . [sent-175, score-0.305]

75 Then we can rewrite R1 as follows: R1 (f, V) = 2f T Lf + wij (xj − xi )T Ti vi i 2 wij (xj − xi )T Ti vi sT f ij −2 j∼i i j∼i n where sij ∈ R is a selection vector of all zero elements except for the i-th element being −1 and the j-th element being 1. [sent-176, score-1.177]

76 Then the partial derivative of R1 with respect to the variable vi is ∂R1 (f, V) =2 wij TiT (xj − xi )(xj − xi )T Ti vi − 2 wij TiT (xj − xi )sT f ij ∂vi j∼i j∼i Thus we get ∂R1 (f, V) = 2GV − 2Cf (17) ∂V T T where G is a dn × dn block diagonal matrix, and C = [C1 , . [sent-177, score-1.568]

77 Thus we obtain ∂R2 = 2BV (21) ∂V where B is a dn × dn sparse block matrix. [sent-182, score-0.163]

78 , n, we have Bii wij (Qij QT + I) ij = (22) j∼i Bij Notice that ∂R0 ∂f = −2wij Qij , if xi ∼ xj 0, otherwise (23) = 2 1 I(f − y). [sent-186, score-0.596]

79 Regression on the satellite measurement of temperatures in the middle troposphere. [sent-194, score-0.079]

80 4 Related Work and Discussion The approximation of the Laplacian operator using the graph Laplacian [5] has enjoyed a great success in the last decade. [sent-200, score-0.088]

81 Previous approaches [2, 15] ﬁrst estimate normal coordinates in the tangent space, and then estimate the ﬁrst order derivative at each point, which is a matrix pseudo-inversion problem. [sent-205, score-0.366]

82 One major limitation of this is that when the number of nearest neighbors k is larger than d + d(d+1) , where d is the dimension of the manifold, 2 the estimation will be inaccurate and unstable [15]. [sent-206, score-0.106]

83 Also, we directly estimate the norm of the second order derivative instead of trying to estimate its coefﬁcients, which turns out to be an integral problem over the neighboring points. [sent-209, score-0.185]

84 5 Experiments In this section, we compare our proposed Parallel Field Regularization (PFR) algorithm with two state-of-the-art semi-supervised regression methods: Laplacian regularized transduction (Laplacian) [8] and Hessian regularized transduction (Hessian)1 [15], respectively. [sent-212, score-0.132]

85 1 Global Temperature In this test, we perform regression on the earth surface, which is a 2D sphere manifold. [sent-216, score-0.083]

86 The coordinates (latitude, longitude) of the measurements are used as features and the corresponding temperature values are the responses. [sent-219, score-0.121]

87 The dimension of manifold is set to 2 and the number of nearest neighbors is set to 6 in graph construction. [sent-220, score-0.34]

88 We randomly select 1% of the samples as labeled data, and compare the predicted temperature values with the ground truth on the rest of the data. [sent-221, score-0.238]

89 7 frame 300 frame 016 PFR Hessian Laplacian 20 MAE 15 10 Laplacian 5 20 40 60 80 number of labels 100 Figure 3: Results on the moving hand dataset. [sent-233, score-0.117]

90 In each image, the red dots indicate the ground truth positions we labeled manually, and the blue arrows show the positions predicted by different algorithms. [sent-237, score-0.357]

91 Our goal is to predict the positions of the (left and right) elbows and wrists. [sent-242, score-0.176]

92 We extract the ﬁrst 500 frames of the video and manually label the positions of the elbows and wrists. [sent-243, score-0.241]

93 The response for each frame is a 8-dimensional vector whose elements are the 2D coordinates of the elbows and wrists. [sent-245, score-0.238]

94 Since there are 8 free parameters, we set the dimension of manifold to 8. [sent-246, score-0.202]

95 For each given number of labeled frames, we perform 10 tests with randomly selected labeled set. [sent-249, score-0.152]

96 The red dots in the ﬁgures indicate the ground truth positions and the blue arrows are drawn by connecting the positions of elbows and wrists predicted by different algorithms. [sent-258, score-0.382]

97 The parallelism of the vector ﬁeld is used to measure the linearity of the target prediction function. [sent-262, score-0.096]

98 For example, Poincar´ -Hopf e theorem tells us that the sum of the indices over all the isolated zeroes of a vector ﬁeld equals to the Euler characteristic of the manifold, which is a very important topological invariant. [sent-266, score-0.085]

99 Laplacian eigenmaps and spectral techniques for embedding and clustering. [sent-293, score-0.093]

100 Semi-supervised regression using hessian energy with an application to semi-supervised dimensionality reduction. [sent-350, score-0.176]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('vxi', 0.303), ('wij', 0.246), ('laplacian', 0.232), ('pfr', 0.228), ('tangent', 0.209), ('eld', 0.206), ('manifold', 0.202), ('eij', 0.2), ('tx', 0.191), ('txi', 0.178), ('covariant', 0.177), ('tit', 0.177), ('parallel', 0.166), ('xi', 0.163), ('xj', 0.145), ('pi', 0.137), ('hessian', 0.132), ('vi', 0.131), ('expx', 0.126), ('derivative', 0.12), ('ti', 0.116), ('aij', 0.109), ('field', 0.107), ('elbows', 0.101), ('temperature', 0.084), ('geodesic', 0.083), ('discretize', 0.081), ('smoothness', 0.08), ('expxi', 0.076), ('labeled', 0.076), ('tj', 0.076), ('positions', 0.075), ('qij', 0.073), ('china', 0.069), ('elds', 0.065), ('eigenmaps', 0.064), ('regularization', 0.061), ('dn', 0.06), ('graph', 0.059), ('vector', 0.055), ('orthonormal', 0.053), ('gradient', 0.053), ('vxj', 0.051), ('nearest', 0.05), ('dv', 0.047), ('rm', 0.046), ('frame', 0.045), ('vj', 0.045), ('gii', 0.044), ('exercise', 0.044), ('transduction', 0.044), ('temperatures', 0.044), ('regression', 0.044), ('yi', 0.043), ('px', 0.043), ('belkin', 0.043), ('block', 0.043), ('ij', 0.042), ('linearity', 0.041), ('mae', 0.041), ('captions', 0.041), ('yx', 0.041), ('truth', 0.04), ('map', 0.04), ('sphere', 0.039), ('isomap', 0.038), ('partha', 0.038), ('geometrical', 0.038), ('submanifold', 0.038), ('ground', 0.038), ('integral', 0.037), ('coordinates', 0.037), ('mikhail', 0.036), ('bij', 0.036), ('frames', 0.036), ('trace', 0.035), ('satellite', 0.035), ('discuss', 0.034), ('locally', 0.033), ('ams', 0.033), ('fi', 0.032), ('regularizer', 0.032), ('objective', 0.031), ('topological', 0.03), ('dij', 0.03), ('penalize', 0.03), ('unit', 0.03), ('embedding', 0.029), ('exponential', 0.029), ('operator', 0.029), ('neighbors', 0.029), ('bi', 0.029), ('differential', 0.029), ('video', 0.029), ('riemannian', 0.029), ('norm', 0.028), ('arrows', 0.027), ('unstable', 0.027), ('moving', 0.027), ('dots', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

Author: Binbin Lin, Chiyuan Zhang, Xiaofei He

2 0.16986312 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

Author: Dominique C. Perrault-joncas, Marina Meila

Abstract: This paper considers the problem of embedding directed graphs in Euclidean space while retaining directional information. We model the observed graph as a sample from a manifold endowed with a vector ﬁeld, and we design an algorithm that separates and recovers the features of this process: the geometry of the manifold, the data density and the vector ﬁeld. The algorithm is motivated by our analysis of Laplacian-type operators and their continuous limit as generators of diffusions on a manifold. We illustrate the recovery algorithm on both artiﬁcially constructed and real data. 1 Motivation Recent advances in graph embedding and visualization have focused on undirected graphs, for which the graph Laplacian properties make the analysis particularly elegant [1, 2]. However, there is an important number of graph data, such as social networks, alignment scores between biological sequences, and citation data, which are naturally asymmetric. A commonly used approach for this type of data is to disregard the asymmetry by studying the spectral properties of W +W T or W T W , where W is the afﬁnity matrix of the graph. Some approaches have been offered to preserve the asymmetry information contained in data: [3], [4], [5] or to deﬁne directed Laplacian operators [6]. Although quite successful, these works adopt a purely graph-theoretical point of view. Thus, they are not concerned with the generative process that produces the graph, nor with the interpretability and statistical properties of their algorithms. In contrast, we view the nodes of a directed graph as a ﬁnite sample from a manifold in Euclidean space, and the edges as macroscopic observations of a diffusion kernel between neighboring points on the manifold. We explore how this diffusion kernel determines the overall connectivity and asymmetry of the resulting graph and demonstrate how Laplacian-type operators of this graph can offer insights into the underlying generative process. Based on the analysis of the Laplacian-type operators, we derive an algorithm that, in the limit of inﬁnite sample and vanishing bandwidth, recovers the key features of the sampling process: manifold geometry, sampling distribution, and local directionality, up to their intrinsic indeterminacies. 2 Model The ﬁrst premise here is that we observe a directed graph G, with n nodes, having weights W = [Wij ] for the edge from node i to node j. In following with common Laplacian-based embedding approaches, we assume that G is a geometric random graph constructed from n points sampled according to distribution p = e−U on an unobserved compact smooth manifold M ⊆ Rl of known intrinsic dimension d ≤ l. The edge weight Wij is then determined by a directed similarity kernel k (xi , xj ) with bandwidth . The directional component of k (xi , xj ) will be taken to be derived 1 from a vector ﬁeld r on M, which assigns a preferred direction between weights Wij and Wji . The choice of a vector ﬁeld r to characterize the directional component of G might seem restrictive at ﬁrst. In the asymptotic limit of → 0 and n → ∞ however, kernels are characterized by their diffusion, drift, and source components [7]. As such, r is sufﬁcient to characterize any directionality associated with a drift component and as it turns out, the component of r normal M in Rl can also be use to characterize any source component. As for the diffusion component, it is not possible to uniquely identify it from G alone [8]. Some absolute knownledge of M is needed to say anything about it. Hence, without loss of generality, we will construct k (xi , xj ) so that the diffusion component ends being isotropic and constant, i.e. equal to Laplace-Beltrami operator ∆ on M. The schematic of this generative process is shown in the top left of Figure 1 below. From left to right: the graph generative process mapping the sample on M to geometric random graph G via the kernel k (x, y), then the subsequent embedding (α) Ψn of G by operators Haa,n , (α) Hss,n (deﬁned in section 3.1). As these operators converge to (α) their respective limits, Haa and (α) Hss , so will Ψn → Ψ, pn → p, and rn → r. We design an algorithm that, given G, produces the top right embedding (Ψn , pn , and rn ). Figure 1: Schematic of our framework. The question is then as follows: can the generative process’ geometry M, distribution p = e−U , and directionality r, be recovered from G? In other words, is there an embedding of G in Rm , m ≥ d that approximates all three components of the process and that is also consistent as sample size increases and the bandwidth vanishes? In the case of undirected graphs, the theory of Laplacian eigenmaps [1] and Diffusion maps [9] answers this question in the afﬁrmative, in that the geometry of M and p = e−U can be inferred using spectral graph theory. The aim here is to build on the undirected problem and recover all three components of the generative process from a directed graph G. The spectral approach to undirected graph embedding relies on the fact that eigenfunctions of the Laplace-Beltrami operator are known to preserve the local geometry of M [1]. With a consistent empirical Laplace-Beltrami operator based on G, its eigenvectors also recover the geometry of M and converge to the corresponding eigenfunctions on M. For a directed graph G, an additional operator is needed to recover the local directional component r, but the principle remains the same. (α) The schematic for this is shown in Figure 1 where two operators - Hss,n , introduced in [9] for (α) undirected embeddings, and Haa,n , a new operator deﬁned in section 3.1 - are used to obtain the (α) (α) (α) embedding Ψn , distribution pn , and vector ﬁeld rn . As Haa,n and Hss,n converge to Haa and (α) Hss , Ψn , pn , and rn also converge to Ψ, p, and r, where Ψ is the local geometry preserving the embedding of M into Rm . (α) The algorithm we propose in Section 4 will calculate the matrices corresponding to H·,n from the graph G, and with their eigenvectors, will ﬁnd estimates for the node coordinates Ψ, the directional component r, and the sampling distribution p. In the next section we brieﬂy describe the mathematical models of the diffusion processes that our model relies on. 2 2.1 Problem Setting The similarity kernel k (x, y) can be used to deﬁne transport operators on M. The natural transport operator is deﬁned by normalizing k (x, y) as T [f ](x) = M k (x, y) f (y)p(y)dy , where p (x) = p (x) k (x, y)p(y)dy . (1) M T [f ](x) represents the diffusion of a distribution f (y) by the transition density k (x, y)p(y)/ k (x, y )p(y )dy . The eigenfunctions of this inﬁnitesimal operator are the continuous limit of the eigenvectors of the transition probability matrix P = D−1 W given by normalizing the afﬁnity matrix W of G by D = diag(W 1) [10]. Meanwhile, the inﬁnitesimal transition ∂f (T − I)f = lim (2) →0 ∂t deﬁnes the backward equation for this diffusion process over M based on kernel k . Obtaining the explicit expression for transport operators like (2) is then the main technical challenge. 2.2 Choice of Kernel In order for T [f ] to have the correct asymptotic form, some hypotheses about the similarity kernel k (x, y) are required. The hypotheses are best presented by considering the decomposition of k (x, y) into symmetric h (x, y) = h (y, x) and anti-symmetric a (x, y) = −a (y, x) components: k (x, y) = h (x, y) + a (x, y) . (3) The symmetric component h (x, y) is assumed to satisfy the following properties: 1. h (||y − 2 x||2 ) = h(||y−x|| / ) , and 2. h ≥ 0 and h is exponentially decreasing as ||y − x|| → ∞. This form d/2 of symmetric kernel was used in [9] to analyze the diffusion map. For the asymmetric part of the similarity kernel, we assume the form a (x, y) = r(x, y) h(||y − x||2 / ) · (y − x) , d/2 2 (4) with r(x, y) = r(y, x) so that a (x, y) = −a (y, x). Here r(x, y) is a smooth vector ﬁeld on the manifold that gives an orientation to the asymmetry of the kernel k (x, y). It is worth noting that the dependence of r(x, y) on both x and y implies that r : M × M → Rl with Rl the ambient space of M; however in the asymptotic limit, the dependence in y is only important “locally” (x = y), and as such it is appropriate to think of r(x, x) being a vector ﬁeld on M. As a side note, it is worth pointing out that even though the form of a (x, y) might seem restrictive at ﬁrst, it is sufﬁciently rich to describe any vector ﬁeld . This can be seen by taking r(x, y) = (w(x) + w(y))/2 so that at x = y the resulting vector ﬁeld is given by r(x, x) = w(x) for an arbitrary vector ﬁeld w(x). 3 Continuous Limit of Laplacian Type Operators We are now ready to state the main asymptotic result. Proposition 3.1 Let M be a compact, closed, smooth manifold of dimension d and k (x, y) an asymmetric similarity kernel satisfying the conditions of section 2.2, then for any function f ∈ C 2 (M), the integral operator based on k has the asymptotic expansion k (x, y)f (y)dy = m0 f (x) + g(f (x), x) + o( ) , (5) M where g(f (x), x) = and m0 = Rd m2 (ω(x)f (x) + ∆f (x) + r · 2 h(||u||2 )du, m2 = Rd u2 h(||u||2 )du. i 3 f (x) + f (x) · r + c(x)f (x)) (6) The proof can be found in [8] along with the deﬁnition of ω(x) and c(x) in (6). For now, it sufﬁces to say that ω(x) corresponds to an interaction between the symmetric kernel h and the curvature of M and was ﬁrst derived in [9]. Meanwhile, c(x) is a new term that originates from the interaction between h and the component of r that is normal to M in the ambient space Rl . Proposition 3.1 foreshadows a general fact about spectral embedding algorithms: in most cases, Laplacian operators confound the effects of spatial proximity, sampling density and directional ﬂow due to the presence of the various terms above. 3.1 Anisotropic Limit Operators Proposition 3.1 above can be used to derive the limits of a variety of Laplacian type operators associated with spectral embedding algorithms like [5, 6, 3]. Although we will focus primarily on a few operators that give the most insight into the generative process and enable us to recover the model deﬁned in Figure 1, we ﬁrst present four distinct families of operators for completeness. These operator families are inspired by the anisotropic family of operators that [9] introduced for undirected graphs, which make use of anisotropic kernels of the form: k (α) (x, y) = k (x, y) pα (x)pα (y) , (7) with α ∈ [0, 1] where α = 0 is the isotropic limit. To normalize the anisotropic kernels, we need (α) (α) (α) as p (x) = M k (x, y)p(y)dy. From (7), four to redeﬁne the outdegrees distribution of k families of diffusion processes of the form ft = H (α) [f ](x) can be derived depending on which kernel is normalized and which outdegree distribution is used for the normalization. Speciﬁcally, (α) (α) we deﬁne transport operators by normalizing the asymmetric k or symmetric h kernels with the 1 asymmetric p or symmetric q = M h (x, y)p(y)dy outdegree distribution . To keep track of all options, we introduce the following notation: the operators will be indexed by the type of kernel and outdegree distribution they correspond to (symmetric or asymmetric), with the ﬁrst index identifying the kernel and the second index identifying the outdegree distribution. For example, the family of anisotropic limit operators introduced by [9] is deﬁned by normalizing the symmetric kernel by (α) the symmetric outdegree distribution, hence they will be denoted as Hss , with the superscript corresponding to the anisotropic power α. Proposition 3.2 With the above notation, (α) Haa [f ] = ∆f − 2 (1 − α) U · f + r· f (α) Has [f ] (α) Hsa [f ] (α) Hss [f ] = ∆f − 2 (1 − α) U · = ∆f − 2 (1 − α) U · = ∆f − 2(1 − α) U · (8) f − cf + (α − 1)(r · U )f − ( · r + (α − 1)r · f + (c + f. U )f · r)f + r · f (9) (10) (11) The proof of this proposition, which can be found in [8], follows from repeated application of Proposition 3.1 to p(y) or q(y) and then to k α (x, y) or hα (x, y), as well as the fact that p1 = α 1 p−α [1 − α (ω + ∆p p + 2r · p p +2 · r + c)] + o( ). (α) Thus, if we use the asymmetric k and p , we get Haa , deﬁned by the advected diffusion equa(α) tion (8). In general, Haa is not hermitian, so it commonly has complex eigenvectors. This makes (1) embedding directed graphs with this operator problematic. Nevertheless, Haa will play an important role in extracting the directionality of the sampling process. If we use the symmetric kernel h but the asymmetric outdegree distribution p , we get the family (α) of operators Hsa , of which the WCut of [3] is a special case (α = 0). If we reverse the above, i.e. (α) (α) (α) use k and q , we obtain Has . This turns out to be merely a combination of Haa and Hsa . 1 The reader may notice that there are in fact eight possible combinations of kernel and degree distribution, since the anisotripic kernel (7) could also be deﬁned using a symmetric or asymmetric outdegree distribution. However, there are only four distinct asymptotic results and they are all covered by using one kernel (symmetric or asymmetric) and one degree distribution (symmetric or asymmetric) throughout. 4 Algorithm 1 Directed Embedding Input: Afﬁnity matrix Wi,j and embedding dimension m, (m ≥ d) 1. S ← (W + W T )/2 (Steps 1–6 estimate the coordinates as in [11]) n 2. qi ← j=1 Si,j , Q = diag(q) 3. V ← Q−1 SQ−1 (1) n 4. qi ← j=1 Vi,j , Q(1) = diag(q (1) ) (1) −1 5. Hss,n ← Q(1) V 6. Compute the Ψ the n × (m + 1) matrix with orthonormal columns containing the m + 1 largest (1) right eigenvector (by eigenvalue) of Hss,n as well as the Λ the (m + 1) × (m + 1) diagonal matrix of eigenvalues. Eigenvectors 2 to m + 1 from Ψ are the m coordinates of the embedding. (1) 7. Compute π the left eigenvector of Hss,n with eigenvalue 1. (Steps 7–8 estimate the density) n 8. π ← π/ i=1 πi is the density distribution over the embedding. n 9. pi ← j=1 Wi,j , P = diag(p) (Steps 9–13 estimate the vector ﬁeld r) 10. T ← P −1 W P −1 (1) n 11. pi ← j=1 Ti,j , P (1) = diag(p(1) ) (1) −1 12. Haa,n ← P (1) T (1) (1) 13. R ← (Haa,n − Hss,n )Ψ/2. Columns 2 to m + 1 of R are the vector ﬁeld components in the direction of the corresponding coordinates of the embedding. (α) Finally, if we only consider the symmetric kernel h and degree distribution q , we recover Hss , the anisotropic kernels of [9] for symmetric graphs. This operator for α = 1 is shown to separate the manifold from the probability distribution [11] and will be used as part of our recovery algorithm. Isolating the Vector Field r 4 Our aim is to esimate the manifold M, the density distribution p = e−U , and the vector ﬁeld r. The (1) ﬁrst two components of the data can be recovered from Hss as shown in [11] and summarized in Algorithm 1. At this juncture, one feature of generative process is missing: the vector ﬁeld r. The natural approach (α) (α) for recovering r is to isolate the linear operator r · from Haa by substracting Hss : (α) (α) Haa − Hss = r · . (12) The advantage of recovering r in operator form as in (12) is that r · is coordinate free. In other words, as long as the chosen embedding of M is diffeomorphic to M2 , (12) can be used to express the component of r that lies in the tangent space T M, which we denote by r|| . Speciﬁcally, let Ψ be a diffeomorphic embedding of M ; the component of r along coordinate ψk is then given by r · ψk = rk , and so, in general, r|| = r · Ψ. (13) The subtle point that only r|| is recovered from (13) follows from the fact that the operator r · only deﬁned along M and hence any directional derivative is necessarily along T M. is Equation (13) and the previous observations are the basis for Algorithm 1, which recovers the three important features of the generative process for an asymmetric graph with afﬁnity matrix W . A similar approach can be employed to recover c + · r, or simply · r if r has no component perpendicular to the tangent space T M (meaning that c ≡ 0). Recovering c + · r is achieved by taking advantage of the fact that (1) (1) (Hsa − Hss ) = (c + 2 · r) , (14) (1) A diffeomorphic embedding is guaranteed by using the eigendecomposition of Hss . 5 (1) (1) which is a diagonal operator. Taking into account that for ﬁnite n (Hsa,n − Hss,n ) is not perfectly (1) (1) diagonal, using ψn ≡ 1n (vector of ones), i.e. (Hsa,n − Hss,n )[1n ] = (cn + · rn ), has been found (1) (1) empirically to be more stable than simply extracting the diagonal of (Hsa,n − Hss,n ). 5 Experiments Artiﬁcial Data For illustrative purposes, we begin by applying our method to an artiﬁcial example. We use the planet Earth as a manifold with a topographic density distribution, where sampling probability is proportional to elevation. We also consider two vector ﬁelds: the ﬁrst is parallel to the line of constant latitude and purely tangential to the sphere, while the second is parallel to the line of constant longitude with a component of the vector ﬁeld perpendicular to the manifold. The true model with constant latitude vector ﬁeld is shown in Figure 2, along with the estimated density and vector ﬁeld projected on the true manifold (sphere). Model Recovered Latitudinal (a) Longitudinal (b) Figure 2: (a): Sphere with latitudinal vector ﬁeld, i.e East-West asymmetry, with Wew > Wwe if node w lies to the West of node e. The graph nodes are sampled non-uniformly, with the topographic map of the world as sampling density. We sample n = 5000 nodes, and observe only the resulting W matrix, but not the node locations. From W , our algorithm estimates the sample locations (geometry), the vector ﬁeld (black arrows) generating the observed asymmetries, and the sampling distribution at each data point (colormap). (b) Vector ﬁelds on a spherical region (blue), and their estimates (red): latitudinal vector ﬁeld tangent to the manifold (left) and longitudinal vector ﬁeld with component perpendicular to manifold tangent plane (right). Both the estimated density and vector ﬁeld agree with the true model, demonstrating that for artiﬁcial data, the recovery algorithm 1 performs quite well. We note that the estimated density does not recover all the details of the original density, even for large sample size (here n = 5000 with = 0.07). Meanwhile, the estimated vector ﬁeld performs quite well even when the sampling is reduced to n = 500 with = 0.1. This can be seen in Figure 2, b, where the true and estimated vector ﬁelds are superimposed. Figure 2 also demonstrates how r · only recovers the tangential component of r. The estimated geometry is not shown on any of these ﬁgures, since the success of the diffusion map in recovering the geometry for such a simple manifold is already well established [2, 9]. Real DataThe National Longitudinal Survey of Youth (NLSY) 1979 Cohort is a representative sample of young men and women in the United States who were followed from 1979 to 2000 [12, 13]. The aim here is to use this survey to obtain a representation of the job market as a diffusion process over a manifold. The data set consists of a sample of 7,816 individual career sequences of length 64, listing the jobs a particular individual held every quarter between the ages of 20 and 36. Each token in the sequence identiﬁes a job. Each job corresponds to an industry × occupation pair. There are 25 unique industry and 20 unique occupation indices. Out of the 500 possible pairings, approximately 450 occur in the data, with only 213 occurring with sufﬁcient frequency to be included here. Thus, our graph G has 213 nodes - the jobs - and our observations consist of 7,816 walks between the graph nodes. We convert these walks to a directed graph with afﬁnity matrix W . Speciﬁcally, Wij represents the number of times a transition from job i to job j was observed (Note that this matrix is asymmetric, 6 i.e Wij = Wji ). Normalizing each row i of W by its outdegree di gives P = diag(di )−1 W , the non-parametric maximum likelihood estimator for the Markov chain over G for the progression (0) of career sequences. This Markov chain has as limit operator Haa , as the granularity of the job market increases along with the number of observations. Thus, in trying to recover the geometry, distribution and vector ﬁeld, we are actually interested in estimating the full advective effect of the (0) diffusion process generated by Haa ; that is, we want to estimate r · − 2 U · where we can use (0) (1) −2 U · = Hss − Hss to complement Algorithm 1. 0.25 1600 0.05 0.1 1400 0.1 3 0.7 1800 0.15 ! 0.8 0.15 0.2 0.9 0.2 2000 0.25 !30.05 0.6 0.5 0 0 0.4 1200 −0.05 −0.05 −0.1 0.3 −0.1 1000 0.2 −0.15 −0.15 800 −0.2 −0.25 0.1 −0.2 −0.25 −0.1 −0.05 0 !2 0.05 0.1 0.15 0.2 (a) −0.1 −0.05 0 !2 0.05 0.1 0.15 0.2 (b) Figure 3: Embedding the job market along with ﬁeld r − 2 U over the ﬁrst two non-constant eigenvectors. The color map corresponds to the mean monthly wage in dollars (a) and to the female proportion (b) for each job. We obtain an embedding of the job market that describes the relative position of jobs, their distribution, and the natural time progression from each job. Of these, the relative position and natural time progression are the most interesting. Together, they summarize the job market dynamics by describing which jobs are naturally “close” as well as where they can lead in the future. From a public policy perspective, this can potentially improve focus on certain jobs for helping individuals attain better upward mobility. The job market was found to be a high dimensional manifold. We present only the ﬁrst two dimen(0) sions, that is, the second and third eigenvectors of Hss , since the ﬁrst eigenvector is uninformative (constant) by construction. The eigenvectors showed correlation with important demographic data, such as wages and gender. Figure 3 displays this two-dimensional sub-embedding along with the directional information r − 2 U for each dimension. The plot shows very little net progression toward regions of increasing mean salary3 . This is somewhat surprising, but it is easy to overstate this observation: diffusion alone would be enough to move the individuals towards higher salary. What Figure 3 (a) suggests is that there appear to be no “external forces” advecting individuals towards higher salary. Nevertheless, there appear to be other external forces at play in the job market: Figure 3 (b), which is analogous to Figure 3 (a), but with gender replacing the salary color scheme, suggests that these forces push individuals towards greater gender differentiation. This is especially true amongst male-dominated jobs which appear to be advected toward the left edge of the embedding. Hence, this simple analysis of the job market can be seen as an indication that males and females tend to move away from each other over time, while neither seems to have a monopoly on high- or low- paying jobs. 6 Discussion This paper makes three contributions: (1) it introduces a manifold-based generative model for directed graphs with weighted edges, (2) it obtains asymptotic results for operators constructed from the directed graphs, and (3) these asymptotic results lead to a natural algorithm for estimating the model. 3 It is worth noting that in the NLSY data set, high paying jobs are teacher, nurse and mechanic. This is due to the fact that the career paths observed stop at at age 36, which is relatively early in an individual’s career. 7 Generative Models that assume that data are sampled from a manifold are standard for undirected graphs, but to our knowledge, none have yet been proposed for directed graphs. When W is symmetric, it is natural to assume that it depends on the points’ proximity. For asymmetric afﬁnities W , one must include an additional component to explain the asymmetry. In the asymptotic limit, this is tantamount to deﬁning a vector ﬁeld on the manifold. Algorithm We have used from [9] the idea of deﬁning anisotropic kernels (indexed by α) in order to separate the density p and the manifold geometry M. Also, we adopted their general assumptions about the symmetric part of the kernel. As a consequence, the recovery algorithm for p and M is identical to theirs. However, insofar as the asymmetric part of the kernel is concerned, everything, starting from the deﬁnition and the introduction of the vector ﬁeld r as a way to model the asymmetry, through the derivation of the asymptotic expression for the symmetric plus asymmetric kernel, is new. We go signiﬁcantly beyond the elegant idea of [9] regarding the use of anisotropic kernels by analyzing the four distinct renormalizations possible for a given α, each of them combining different aspects of M, p and r. Only the successful (and novel) combination of two different anisotropic operators is able to recover the directional ﬂow r. Algorithm 1 is natural, but we do not claim it is the only possible one in the context of our model. (α) For instance, we can also use Hsa to recover the operator · r (which empirically seems to have worse numerical properties than r · ). In the National Longitudinal Survery of Youth study, we were interested in the whole advective term, so we estimated it from a different combination of operators. Depending on the speciﬁc question, other features of the model could be obtained Limit Results Proposition 3.1 is a general result on the asymptotics of asymmetric kernels. Recovering the manifold and r is just one, albeit the most useful, of the many ways of exploiting these (0) results. For instance, Hsa is the limit operator of the operators used in [3] and [5]. The limit analysis could be extended to other digraph embedding algorithms such as [4, 6]. How general is our model? Any kernel can be decomposed into a symmetric and an asymmetric part, as we have done. The assumptions on the symmetric part h are standard. The paper of [7] goes one step further from these assumptions; we will discuss it in relationship with our work shortly. The more interesting question is how limiting are our assumptions regarding the choice of kernel, especially the asymmetric part, which we parameterized as a (x, y) = r/2 · (y − x)h (x, y) in (4). In the asymptotic limit, this choice turns out to be fully general, at least up to the identiﬁable aspects of the model. For a more detailed discussion of this issue, see [8]. In [7], Ting, Huang and Jordan presented asymptotic results for a general family of kernels that includes asymmetric and random kernels. Our k can be expressed in the notation of [7] by taking wx (y) ← 1+r(x, y)·(y−x), rx (y) ← 1, K0 ← h, h ← . Their assumptions are more general than the assumptions we make here, yet our model is general up to what can be identiﬁed from G alone. The distinction arises because [7] focuses on the graph construction methods from an observed sample of M, while we focus on explaining an observed directed graph G through a manifold generative process. Moreover, while the [7] results can be used to analyze data from directed graphs, they differ from our Proposition 3.1. Speciﬁcally, with respect to the limit in Theorem 3 from [7], we obtain the additional source terms f (x) · r and c(x)f (x) that follow from not enforcing (α) (α) conservation of mass while deﬁning operators Hsa and Has . We applied our theory of directed graph embedding to the analysis of the career sequences in Section 5, but asymmetric afﬁnity data abound in other social contexts, and in the physical and life sciences. Indeed, any “similarity” score that is obtained from a likelihood of the form Wvu =likelihood(u|v) is generally asymmetric. Hence our methods can be applied to study not only social networks, but also patterns of human movement, road trafﬁc, and trade relations, as well as alignment scores in molecular biology. Finally, the physical interpretation of our model also makes it naturally applicable to physical models of ﬂows. Acknowledgments This research was partially supported by NSW awards IIS-0313339 and IIS-0535100. 8 References [1] Belkin and Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2002. [2] Nadler, Lafon, and Coifman. Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In Neural Information Processing Systems Conference, 2006. [3] Meila and Pentney. Clustering by weighted cuts in directed graphs. In SIAM Data Mining Conference, 2007. [4] Zhou, Huang, and Scholkopf. Learning from labeled and unlabeled data on a directed graph. In International Conference on Machine Learning, pages 1041–1048, 2005. [5] Zhou, Schlkopf, and Hofmann. Semi-supervised learning on directed graphs. In Advances in Neural Information Processing Systems, volume 17, pages 1633–1640, 2005. [6] Fan R. K. Chung. The diameter and laplacian eigenvalues of directed graphs. Electr. J. Comb., 13, 2006. [7] Ting, Huang, and Jordan. An analysis of the convergence of graph Laplacians. In International Conference on Machine Learning, 2010. [8] Dominique Perrault-Joncas and Marina Meil˘ . Directed graph embedding: an algorithm based on contina uous limits of laplacian-type operators. Technical Report TR 587, University of Washington - Department of Statistics, November 2011. [9] Coifman and Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21:6–30, 2006. [10] Mikhail Belkin and Partha Niyogi. Convergence of laplacian eigenmaps. preprint, short version NIPS 2008, 2008. [11] Coifman, Lafon, Lee, Maggioni, Warner, and Zucker. Geometric diffusions as a tool for harmonic analysis and structure deﬁnition of data: Diffusion maps. In Proceedings of the National Academy of Sciences, pages 7426–7431, 2005. [12] United States Department of Labor. National longitudinal survey of youth 1979 cohort. http://www.bls.gov/nls/, retrived October 2011. [13] Marc A. Scott. Afﬁnity models for career sequences. Journal of the Royal Statistical Society: Series C (Applied Statistics), 60(3):417–436, 2011. 9

3 0.16768077 287 nips-2011-The Manifold Tangent Classifier

Author: Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller

Abstract: We combine three important ideas present in previous work for building classiﬁers: the semi-supervised hypothesis (the input distribution contains information about the classiﬁer), the unsupervised manifold hypothesis (data density concentrates near low-dimensional manifolds), and the manifold hypothesis for classiﬁcation (different classes correspond to disjoint manifolds separated by low density). We exploit a novel algorithm for capturing manifold structure (high-order contractive auto-encoders) and we show how it builds a topological atlas of charts, each chart being characterized by the principal singular vectors of the Jacobian of a representation mapping. This representation learning algorithm can be stacked to yield a deep architecture, and we combine it with a domain knowledge-free version of the TangentProp algorithm to encourage the classiﬁer to be insensitive to local directions changes along the manifold. Record-breaking classiﬁcation results are obtained. 1

4 0.14543191 263 nips-2011-Sparse Manifold Clustering and Embedding

Author: Ehsan Elhamifar, René Vidal

Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE ﬁnds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE ﬁnds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional afﬁne subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1

5 0.10626985 157 nips-2011-Learning to Search Efficiently in High Dimensions

Author: Zhen Li, Huazhong Ning, Liangliang Cao, Tong Zhang, Yihong Gong, Thomas S. Huang

Abstract: High dimensional similarity search in large scale databases becomes an important challenge due to the advent of Internet. For such applications, specialized data structures are required to achieve computational efﬁciency. Traditional approaches relied on algorithmic constructions that are often data independent (such as Locality Sensitive Hashing) or weakly dependent (such as kd-trees, k-means trees). While supervised learning algorithms have been applied to related problems, those proposed in the literature mainly focused on learning hash codes optimized for compact embedding of the data rather than search efﬁciency. Consequently such an embedding has to be used with linear scan or another search algorithm. Hence learning to hash does not directly address the search efﬁciency issue. This paper considers a new framework that applies supervised learning to directly optimize a data structure that supports efﬁcient large scale search. Our approach takes both search quality and computational cost into consideration. Speciﬁcally, we learn a boosted search forest that is optimized using pair-wise similarity labeled examples. The output of this search forest can be efﬁciently converted into an inverted indexing data structure, which can leverage modern text search infrastructure to achieve both scalability and efﬁciency. Experimental results show that our approach signiﬁcantly outperforms the start-of-the-art learning to hash methods (such as spectral hashing), as well as state-of-the-art high dimensional search algorithms (such as LSH and k-means trees).

6 0.10019892 249 nips-2011-Sequence learning with hidden units in spiking neural networks

7 0.1000048 226 nips-2011-Projection onto A Nonnegative Max-Heap

8 0.097594954 29 nips-2011-Algorithms and hardness results for parallel large margin learning

9 0.092984945 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion

10 0.091147155 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs

11 0.085393086 213 nips-2011-Phase transition in the family of p-resistances

12 0.084575489 186 nips-2011-Noise Thresholds for Spectral Clustering

13 0.082202926 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

14 0.081665635 59 nips-2011-Composite Multiclass Losses

15 0.078700669 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

16 0.076597013 305 nips-2011-k-NN Regression Adapts to Local Intrinsic Dimension

17 0.075228885 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

18 0.072316848 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

19 0.068484075 171 nips-2011-Metric Learning with Multiple Kernels

20 0.067757875 5 nips-2011-A Denoising View of Matrix Completion

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.202), (1, 0.033), (2, -0.042), (3, -0.092), (4, -0.06), (5, 0.047), (6, -0.07), (7, -0.05), (8, 0.032), (9, -0.111), (10, -0.047), (11, 0.058), (12, 0.011), (13, -0.113), (14, 0.056), (15, 0.041), (16, -0.046), (17, 0.134), (18, -0.161), (19, 0.032), (20, -0.029), (21, 0.115), (22, 0.087), (23, 0.084), (24, 0.069), (25, -0.107), (26, -0.146), (27, -0.136), (28, -0.074), (29, -0.029), (30, 0.016), (31, 0.151), (32, -0.117), (33, -0.162), (34, -0.011), (35, 0.09), (36, -0.118), (37, 0.012), (38, 0.032), (39, 0.088), (40, -0.09), (41, -0.024), (42, -0.044), (43, 0.051), (44, 0.054), (45, 0.018), (46, -0.019), (47, 0.011), (48, 0.007), (49, -0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95471996 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

Author: Binbin Lin, Chiyuan Zhang, Xiaofei He

2 0.7542879 263 nips-2011-Sparse Manifold Clustering and Embedding

Author: Ehsan Elhamifar, René Vidal

3 0.75015098 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

Author: Dominique C. Perrault-joncas, Marina Meila

4 0.68250716 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

Author: Nitesh Shroff, Pavan Turaga, Rama Chellappa

Abstract: In this paper, we consider the Pr´ cis problem of sampling K representative yet e diverse data points from a large dataset. This problem arises frequently in applications such as video and document summarization, exploratory data analysis, and pre-ﬁltering. We formulate a general theory which encompasses not just traditional techniques devised for vector spaces, but also non-Euclidean manifolds, thereby enabling these techniques to shapes, human activities, textures and many other image and video based datasets. We propose intrinsic manifold measures for measuring the quality of a selection of points with respect to their representative power, and their diversity. We then propose efﬁcient algorithms to optimize the cost function using a novel annealing-based iterative alternation algorithm. The proposed formulation is applicable to manifolds of known geometry as well as to manifolds whose geometry needs to be estimated from samples. Experimental results show the strength and generality of the proposed approach.

5 0.63778806 287 nips-2011-The Manifold Tangent Classifier

Author: Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller

6 0.59940982 5 nips-2011-A Denoising View of Matrix Completion

7 0.54456979 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion

8 0.49431336 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data

9 0.45903736 296 nips-2011-Uniqueness of Belief Propagation on Signed Graphs

10 0.4108609 171 nips-2011-Metric Learning with Multiple Kernels

11 0.40809265 67 nips-2011-Data Skeletonization via Reeb Graphs

12 0.40723792 150 nips-2011-Learning a Distance Metric from a Network

13 0.39998618 228 nips-2011-Quasi-Newton Methods for Markov Chain Monte Carlo

14 0.39002988 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

15 0.38484195 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

16 0.36831021 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories

17 0.3527784 143 nips-2011-Learning Anchor Planes for Classification

18 0.34446165 274 nips-2011-Structure Learning for Optimization

19 0.34238255 295 nips-2011-Unifying Non-Maximum Likelihood Learning Objectives with Minimum KL Contraction

20 0.3351945 157 nips-2011-Learning to Search Efficiently in High Dimensions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (4, 0.034), (20, 0.031), (26, 0.029), (31, 0.044), (43, 0.064), (45, 0.064), (57, 0.014), (74, 0.028), (83, 0.035), (99, 0.541)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9433493 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

Author: Binbin Lin, Chiyuan Zhang, Xiaofei He

2 0.9123975 50 nips-2011-Budgeted Optimization with Concurrent Stochastic-Duration Experiments

Author: Javad Azimi, Alan Fern, Xiaoli Z. Fern

Abstract: Budgeted optimization involves optimizing an unknown function that is costly to evaluate by requesting a limited number of function evaluations at intelligently selected inputs. Typical problem formulations assume that experiments are selected one at a time with a limited total number of experiments, which fail to capture important aspects of many real-world problems. This paper deﬁnes a novel problem formulation with the following important extensions: 1) allowing for concurrent experiments; 2) allowing for stochastic experiment durations; and 3) placing constraints on both the total number of experiments and the total experimental time. We develop both ofﬂine and online algorithms for selecting concurrent experiments in this new setting and provide experimental results on a number of optimization benchmarks. The results show that our algorithms produce highly effective schedules compared to natural baselines. 1

3 0.90122986 54 nips-2011-Co-regularized Multi-view Spectral Clustering

Author: Abhishek Kumar, Piyush Rai, Hal Daume

Abstract: In many clustering problems, we have access to multiple views of the data each of which could be individually used for clustering. Exploiting information from multiple views, one can hope to ﬁnd a clustering that is more accurate than the ones obtained using the individual views. Often these different views admit same underlying clustering of the data, so we can approach this problem by looking for clusterings that are consistent across the views, i.e., corresponding data points in each view should have same cluster membership. We propose a spectral clustering framework that achieves this goal by co-regularizing the clustering hypotheses, and propose two co-regularization schemes to accomplish this. Experimental comparisons with a number of baselines on two synthetic and three real-world datasets establish the efﬁcacy of our proposed approaches.

4 0.86584508 8 nips-2011-A Model for Temporal Dependencies in Event Streams

Author: Asela Gunawardana, Christopher Meek, Puyang Xu

Abstract: We introduce the Piecewise-Constant Conditional Intensity Model, a model for learning temporal dependencies in event streams. We describe a closed-form Bayesian approach to learning these models, and describe an importance sampling algorithm for forecasting future events using these models, using a proposal distribution based on Poisson superposition. We then use synthetic data, supercomputer event logs, and web search query logs to illustrate that our learning algorithm can efﬁciently learn nonlinear temporal dependencies, and that our importance sampling algorithm can effectively forecast future events. 1

5 0.82860601 270 nips-2011-Statistical Performance of Convex Tensor Decomposition

Author: Ryota Tomioka, Taiji Suzuki, Kohei Hayashi, Hisashi Kashima

Abstract: We analyze the statistical performance of a recently proposed convex tensor decomposition algorithm. Conventionally tensor decomposition has been formulated as non-convex optimization problems, which hindered the analysis of their performance. We show under some conditions that the mean squared error of the convex method scales linearly with the quantity we call the normalized rank of the true tensor. The current analysis naturally extends the analysis of convex low-rank matrix estimation to tensors. Furthermore, we show through numerical experiments that our theory can precisely predict the scaling behaviour in practice.

6 0.78530133 46 nips-2011-Better Mini-Batch Algorithms via Accelerated Gradient Methods

7 0.52826846 186 nips-2011-Noise Thresholds for Spectral Clustering

8 0.52764678 102 nips-2011-Generalised Coupled Tensor Factorisation

9 0.52668953 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

10 0.51562065 263 nips-2011-Sparse Manifold Clustering and Embedding

11 0.5041914 29 nips-2011-Algorithms and hardness results for parallel large margin learning

12 0.49414283 121 nips-2011-Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

13 0.49042755 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

14 0.47603011 36 nips-2011-Analysis and Improvement of Policy Gradient Estimation

15 0.46850112 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

16 0.46680915 198 nips-2011-On U-processes and clustering performance

17 0.46034956 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

18 0.46007213 72 nips-2011-Distributed Delayed Stochastic Optimization

19 0.45871705 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

20 0.45574343 59 nips-2011-Composite Multiclass Losses