nips nips2009 nips2009-146 knowledge-graph by maker-knowledge-mining

146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence


Source: pdf

Author: Wei Bian, Dacheng Tao

Abstract: In this paper, we study the manifold regularization for the Sliced Inverse Regression (SIR). The manifold regularization improves the standard SIR in two aspects: 1) it encodes the local geometry for SIR and 2) it enables SIR to deal with transductive and semi-supervised learning problems. We prove that the proposed graph Laplacian based regularization is convergent at rate root-n. The projection directions of the regularized SIR are optimized by using a conjugate gradient method on the Grassmann manifold. Experimental results support our theory.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 sg Abstract In this paper, we study the manifold regularization for the Sliced Inverse Regression (SIR). [sent-6, score-0.283]

2 The manifold regularization improves the standard SIR in two aspects: 1) it encodes the local geometry for SIR and 2) it enables SIR to deal with transductive and semi-supervised learning problems. [sent-7, score-0.403]

3 We prove that the proposed graph Laplacian based regularization is convergent at rate root-n. [sent-8, score-0.149]

4 The projection directions of the regularized SIR are optimized by using a conjugate gradient method on the Grassmann manifold. [sent-9, score-0.21]

5 1 Introduction Sliced inverse regression (SIR) [7] was proposed for sufficient dimension reduction. [sent-11, score-0.12]

6 In a regression setting, with the predictors X and the response Y, the sufficient dimension reduction (SDR) subspace B is defined by the conditional independency Y⊥ X| B T X. [sent-12, score-0.252]

7 Under the assumption that the distribution of X is elliptic symmetric [7], it has been proved that the SDR subsapce B is related to the inverse regression curve E(X|Y). [sent-13, score-0.132]

8 It can be estimated at least partially by a generalized eigendecomposition between the covariance matrix of the predictors Cov(X) and the covariance matrix of the inverse regression curve Cov(E(X|Y)). [sent-14, score-0.184]

9 When Y is a continuous random variable, it is discretized by slicing its range into several slices so as to estimate E(X|Y) empirically. [sent-15, score-0.05]

10 Therefore, SIR cannot select an efficient subspace for predicting the response Y because the local geometry of the predictors X is ignored. [sent-18, score-0.228]

11 Additionally, SIR only utilizes labeled (given response) data (predictors). [sent-19, score-0.037]

12 Thus, it is valuable to extend SIR to deal with transductive and semi-supervised learning problems by considering unlabelled samples. [sent-20, score-0.06]

13 We solve the above two problems of SIR by using the manifold regularization [2], which has been developed to incorporate the local geometry in learning classification or regression functions. [sent-21, score-0.404]

14 In this paper, we utilize it to preserve the local geometry of predictors in learning the SDR subspace B. [sent-22, score-0.206]

15 In addition, it helps SIR to solve transductive/semi-supervised learning problems because the regularization encodes the marginal distribution of the unlabelled predictors. [sent-23, score-0.105]

16 Different regularizations for SIR have been well studied, e. [sent-24, score-0.041]

17 , the non-singular regularization [14], the ridge regularization [9], and the sparse regularization [8]. [sent-26, score-0.252]

18 However, all existing regularizations do not encode the local geometry of the predictors. [sent-27, score-0.122]

19 Although the localized sliced inverse regression [12] considers the local geometry, it is heuristic and does not follow up the regularization framework. [sent-28, score-0.427]

20 Section 3 proves the convergence of the new manifold regularization. [sent-31, score-0.229]

21 We discuss the optimiza1 tion algorithm of the regularized SIR by using the conjugate gradient method on the Grassmann manifold in Section 4. [sent-32, score-0.381]

22 2 Manifold Regularization for SIR In the rest of the paper, we use terminologies in regression and deem classification as regression with the category response. [sent-35, score-0.08]

23 Upper case letters X ∈ Rp and Y ∈ R are respectively the predictors and the response, and lower case letters x and y are corresponding realizations. [sent-36, score-0.099]

24 Given a sample +n set, containing nl labeled samples {xi , yi }nl and nu unlabeled samples {xi }n=nl+1 u , we seek an i=1 i=nl optimal k-dimensional subspace spanned by B = [β1 , . [sent-37, score-0.126]

25 , βk ] such that the response Y is predictable with the projected predictors B T X. [sent-40, score-0.145]

26 , xn ] to denote all predictors in the sample set. [sent-44, score-0.099]

27 1 Sliced Inverse Regression Suppose the response Y is predictable with a sufficient k-dimensional projection of the original predictors X. [sent-46, score-0.173]

28 Given a set of samples {xi , yi }nl , SIR estimates the projection subspace B = [β1 , . [sent-52, score-0.054]

29 It can be proved that the generalized eigen-decomposition is equivalent to the following optimization, max trace B B T ΣB −1 B T ΓB . [sent-56, score-0.078]

30 (2) We refer to (2) as the objective function of SIR and thus we can impose with the manifold regularization on (2). [sent-57, score-0.283]

31 1 Another way to get the objective (2) is based on the least square formulation for SIR proposed in [3], H ¯ ¯ fh Xh − X − ΣBCh min L (B, C) = B T ¯ ¯ Σ−1 Xh − X − ΣBCh (3) h=1 where C = [C1 , C2 , . [sent-59, score-0.036]

32 2 Manifold Regularization for SIR Each dimension reduction projection β can be deemed as a linear function or a mapping g(x) = β T x. [sent-66, score-0.093]

33 We expect to preserve the local geometry of the distribution of the predictors X while doing mapping g(x). [sent-67, score-0.18]

34 Suppose the predictors X are embedded on a manifold M , this can be achieved by penalizing the gradient M g along the manifold M . [sent-68, score-0.535]

35 (4) X∈M The above formulation is different from the original manifold regularization [2]on the point that the function g(x) is a dimension reduction mapping here while it is a classification or regression 2 function in [2]. [sent-70, score-0.388]

36 Usually, both the manifold and the marginal distribution of X are unknown. [sent-71, score-0.199]

37 It has been well studied in manifold learning, however, that the regularization (4) can be approximated by n=n +n using the associated graph Laplacian of labeled and unlabeled {xi }i=1 l u . [sent-72, score-0.349]

38 n=n +n Construct an adjacent graph for {xi }i=1 1 u , where the pairwise edge weight (W )ij = φ ( xi − xj ) is defined by the kernel function φ (·), e. [sent-73, score-0.28]

39 , the heat kernel φ (d) = exp −d2 , and then the associated graph Laplacian is L = D − W , where D is a diagonal matrix given T by Dii = j Wij . [sent-75, score-0.036]

40 Thus, the regularization in (4) can be approximated by R = g Lg, where g = [β T x1 , . [sent-76, score-0.084]

41 , βk ] , we take the summation of k regularizations k T gi Lgi = trace GT LG R= (5) i=1 where G = [g1 , . [sent-83, score-0.119]

42 In the proposed regularized SIR, we normalize the regularization −1 T (5) as R = trace GT DG G LG , which is equivalent to the constraint GT DG = I. [sent-88, score-0.233]

43 This normalization makes R invariant to scalar and rotation transformations of the projections B = [β1 , . [sent-89, score-0.058]

44 , βk ], which is preferred for dimension reduction problems. [sent-92, score-0.065]

45 By adding the regularization −1 T G LG to SIR (2), and substituting G = X T B, we get the regularized R = trace GT DG SIR max SIRr (B) = trace B B T ΣB −1 B T ΓB − ηtrace B T SB −1 B T QB (6) where Q = 1/n (n − 1) XLX T , S = 1/n (n − 1) XDX T , and η is the positive weighting factor. [sent-93, score-0.311]

46 3 Convergence of the Regularization Different from the existing regularizations [8,9,14] for SIR, which are constructed as deterministic terms, the manifold regularization in (6) is a random term that involves two data dependent variables (matrices) Q and S. [sent-94, score-0.343]

47 Therefore, it is necessary to discuss the convergence property of the proposed manifold regularization. [sent-95, score-0.229]

48 It has been well proved that both Σ and Γ converge at rate root-n [7,11,15]. [sent-96, score-0.042]

49 Therefore, the convergence rate of the objective (6) depends on whether the regularization term converges at rate root-n. [sent-97, score-0.19]

50 Below, we prove that both the sample based estimations Q = 1/n (n − 1) XLX T and S = 1/n (n − 1) XDX T converge to deterministic matrices at rate root-n. [sent-98, score-0.064]

51 Note that the convergence of a special case where the graph Laplacian is built by the kernel function φ (d) = 1 (d < ε) was proved in [6]. [sent-99, score-0.083]

52 Additionally, we target a general choice of kernel φ (·) and also prove the root-n convergence rate which has not been obtained before. [sent-101, score-0.091]

53 Therefore, it is difficult to apply the law of large numbers and the central limit theorem to prove the convergence and obtain the corresponding convergence rate. [sent-103, score-0.08]

54 However, we can prove them by constructing the converged limitation and show that the variance of the sample based estimation with respect to the constructed limitation decades at rate root-n. [sent-104, score-0.045]

55 1 For kernel function φ (d) , it satisfies φ (0) = 1 and |φ (d)| of predictors P (X), the fourth order moment exists, i. [sent-107, score-0.115]

56 1, the next two lemmas show the convergence of T1 and T2 , respectively. [sent-114, score-0.064]

57 The E ϕ (x) xxT exists, and T1 in (8) converges almost surely at rate n−1/2 , i. [sent-117, score-0.086]

58 The E xη (x) exists, and T2 in (8) converges almost surely at rate n−1/2 , i. [sent-123, score-0.086]

59 1, the sample based estimation Q converges almost surely to a. [sent-131, score-0.061]

60 s T a deterministic matrix E (Q) = E ϕ (x) xxT − E xη (x) at rate n−1/2 , i. [sent-132, score-0.044]

61 1, the sample based estimation S converges almost surely to a a. [sent-141, score-0.061]

62 s deterministic matrix E ϕ (x) xxT at rate n−1/2 , i. [sent-142, score-0.044]

63 1 n(n−1) 1 n(n−1) 1 n(n−1) Dii n n i=1 n j=1 i=1 n = j Wij j=1 φ ( xi − xj ) xi xT = i xi xT . [sent-146, score-0.522]

64 n = 1 (n−1) n i=1 φ ( xi − xj ), so S 1 n(n−1) xi xT i n i=1 j=i = 1 n(n−1) n i=1 Dii xi xT i = φ ( xi − xj ) + φ (0) xi xT = T1 + i is an unbiased estimation of Cov(X), we have Therefore, according to Lemma 3. [sent-152, score-0.905]

65 4 Optimization on the Grassmann Manifold The optimization of the regularized SIR (6) is much more difficult than that of the standard SIR (2), which can be solved by the generalized eigen-decomposition. [sent-155, score-0.088]

66 In this section, we present a conjugate 4 gradient method on the Grassmann manifold to solve (6), based on the fact it is invariant to scalar and rotation transformations of the projection B. [sent-156, score-0.379]

67 By exploiting the geometry of the Grassmann manifold, the conjugate gradient algorithm converges faster than the gradient scheme in the Euclidean space. [sent-157, score-0.238]

68 Given a constrained optimization problem min F (A) subject to A ∈ Rp×k and AT A = I, if the problem further satisfies F (A) = F (AO) for an arbitrary orthonormal matrix O, then it is called an optimization problem defined on the Grassmann manifold Gpk . [sent-158, score-0.199]

69 Substituting B = Σ−1/2 U A into (6), we have SIRr (A) = trace ηtrace ˜ AT ΛA −1 AT A −1 ˜ AT ΓA − ˜ AT QA . [sent-164, score-0.061]

70 Because SIRr (A) is invariant to scalar and rotation transformations, a constraint AT A = I can be added to (6). [sent-167, score-0.037]

71 To implement the conjugate gradient method on the Grassmann manifold, the gradient of F (A) in (11) is required. [sent-170, score-0.132]

72 According to [4], the gradient GA of F (A) on the manifold is defined by GA = ΠA FA where FA is the gradient of F (A) in the Euclidian space and ΠA = I − AAT is the projection onto the tangent space at A of the manifold. [sent-171, score-0.303]

73 (12) Next, we present the conjugate gradient method on the Grassmann manifold [4] to solve (11). [sent-173, score-0.293]

74 The algorithm is given by the following three steps: • 1-D searching along the geodesic: given the current position Ak , the gradient Gk and the searching direction Hk , the 1-D searching along the geodesic is given by min F (A (t)) s. [sent-174, score-0.185]

75 A (t) = F Ak V cos (Σt) V T + U sin (Σt) V T t (13) where U ΣV T is the compact SVD of Hk . [sent-176, score-0.044]

76 Record the minimum solution tk = tmin , and Ak+1 = A (tk ) as the starting position for next searching. [sent-177, score-0.049]

77 Note that, the same as the conjugate gradient method in the Euclidian space, the searching direction Hk has to be resetting as Hk = −Gk with a period of p (n − p), i. [sent-182, score-0.149]

78 5 5 Experiments In this section, we evaluate the proposed regularized SIR on two real datasets. [sent-185, score-0.088]

79 We show the results of the standard SIR and the localized SIR on the same experiments for reference. [sent-186, score-0.076]

80 In our experiment, dimension reduction is first implemented and then the nearest neighbor rule is used for classification. [sent-190, score-0.065]

81 It shows that both the regularized SIR and the localized SIR [12] can achieve superior performance to the standard SIR, and the manifold regularized SIR performs better than the localized SIR in both the supervised and the semi-supervised training. [sent-194, score-0.527]

82 Experimental results reflect that the manifold regularized SIR is effective on exploiting the local geometry of a dataset. [sent-195, score-0.383]

83 Table 1: Experimental results on the USPS dataset: SIR; the manifold regularized SIR (RSIR); the localized SIR (LSIR); semi-supervised training of the manifold regularized SIR (sRSIR); semisupervised training of the localized SIR (sLSIR). [sent-196, score-0.743]

84 Figure 1 shows the visualization results obtained by SIR, the proposed regularized SIR and the localized SIR [12]. [sent-235, score-0.2]

85 The figure shows that by exploiting the unlabeled data via the manifold regularization for dimension reduction, the performance for data visualization can be significantly improved. [sent-236, score-0.395]

86 The localized SIR performs better than SIR, but not as good as the regularized SIR. [sent-237, score-0.164]

87 (17) j=i,j=1 = 1 n n E xi xT ϕ (xi ) = E ϕ (x) xxT . [sent-243, score-0.139]

88 When i, j, i , j are distinct, xi ,xj ,xi , and xj are independent, we have E (Φi,j,i ,j ) = E vec φ ( xi − xj ) xi xT i vec φ ( xi − xj ) xi xT i = vec E ϕ (x) xxT vec E ϕ (x) xxT T = (vec (E (T1 ))) (vec (E (T1 ))) . [sent-246, score-2.946]

89 T Therefore, the first term in E vec (T1 ) (vec (T1 )) 1 n2 (n−1)2 E (Φi,j,i ,j ) = T (19) T is T n(n−1)(n−2)(n−3) n2 (n−1)2 (vec (E (T1 ))) (vec (E (T1 ))) i,j,i ,j (20) distinct T = (vec (E (T1 ))) (vec (E (T1 ))) + O n−1 . [sent-247, score-0.484]

90 1, and thus we have 1 n2 E (Φi,j,i ,j ) 2 (n − 1) else , E (Φi,j,i ,j ) is bounded by a constant (matrix) 1 n2 M= 2 (n − 1) else 7 n(n − 1) (4n − 6) 2 n2 (n − 1) M = O n−1 . [sent-249, score-0.07]

91 First, we have    n n 1 1 E (T2 ) = E  xi  φ ( xi − xj ) xT  j n i=1 n−1 j=i,j=1    n n 1 1 E xi  E φ ( xi − xj ) xT |xi  = j n i=1 n−1 (23) j=i,j=1 = 1 n n E (xi η (xi )) = E (xη (x)) . [sent-254, score-0.627]

92 7 Conclusion We have studied the manifold regularization for Sliced Inverse Regression (SIR). [sent-259, score-0.283]

93 The regularized SIR extended the original SIR in many ways, i. [sent-260, score-0.088]

94 , it utilizes the local geometry that is ignored originally and enables SIR to deal with the tranductive/semisupervised learning problems. [sent-262, score-0.112]

95 We also discussed the statistical properties of the proposed regularization, that under mild conditions, the manifold regularization converges at rate root-n. [sent-263, score-0.334]

96 To solve the regularized SIR problem, we present a conjugate gradient method conducted on the Grassmann manifold. [sent-264, score-0.182]

97 Experiments on real datasets validate the effectiveness of the regularized SIR. [sent-265, score-0.088]

98 (2006) Manifold regularization: Ageometric framework for learning from labeled and unlabeled examples. [sent-274, score-0.046]

99 (1991) Sliced inverse regression for dimension reduction (with discussion). [sent-301, score-0.15]

100 (2005) RSIR: Regularized sliced inverse regression for motif discovery. [sent-352, score-0.251]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sir', 0.658), ('vec', 0.484), ('manifold', 0.199), ('sliced', 0.166), ('xxt', 0.161), ('grassmann', 0.151), ('xi', 0.139), ('xt', 0.139), ('xj', 0.105), ('gk', 0.103), ('predictors', 0.099), ('regularized', 0.088), ('regularization', 0.084), ('localized', 0.076), ('sirr', 0.076), ('cov', 0.072), ('ak', 0.07), ('hk', 0.067), ('xh', 0.066), ('geometry', 0.065), ('trace', 0.061), ('conjugate', 0.056), ('nl', 0.054), ('lg', 0.053), ('tk', 0.049), ('nanyang', 0.048), ('gt', 0.047), ('rsir', 0.045), ('xlx', 0.045), ('inverse', 0.045), ('laplacian', 0.041), ('dg', 0.041), ('regularizations', 0.041), ('dii', 0.041), ('regression', 0.04), ('gradient', 0.038), ('searching', 0.038), ('sdr', 0.036), ('fh', 0.036), ('visualization', 0.036), ('surely', 0.035), ('usps', 0.035), ('else', 0.035), ('dimension', 0.035), ('qa', 0.034), ('lemmas', 0.034), ('technological', 0.032), ('ch', 0.031), ('reduction', 0.03), ('aat', 0.03), ('bch', 0.03), ('elliptic', 0.03), ('lsir', 0.03), ('nonsingular', 0.03), ('slicing', 0.03), ('slsir', 0.03), ('srsir', 0.03), ('convergence', 0.03), ('wij', 0.029), ('projection', 0.028), ('xdx', 0.026), ('unlabeled', 0.026), ('subspace', 0.026), ('converges', 0.026), ('rate', 0.025), ('transductive', 0.025), ('ga', 0.025), ('lemma', 0.024), ('belkin', 0.024), ('predictable', 0.024), ('sin', 0.023), ('euclidian', 0.023), ('response', 0.022), ('singapore', 0.021), ('unlabelled', 0.021), ('transformations', 0.021), ('rotation', 0.021), ('cos', 0.021), ('graph', 0.02), ('labeled', 0.02), ('prove', 0.02), ('slices', 0.02), ('deterministic', 0.019), ('niyogi', 0.018), ('fa', 0.018), ('direction', 0.017), ('semisupervised', 0.017), ('proved', 0.017), ('summation', 0.017), ('utilizes', 0.017), ('columbia', 0.017), ('substituting', 0.017), ('geodesic', 0.016), ('kernel', 0.016), ('local', 0.016), ('conditions', 0.016), ('scalar', 0.016), ('exploiting', 0.015), ('li', 0.015), ('deal', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence

Author: Wei Bian, Dacheng Tao

Abstract: In this paper, we study the manifold regularization for the Sliced Inverse Regression (SIR). The manifold regularization improves the standard SIR in two aspects: 1) it encodes the local geometry for SIR and 2) it enables SIR to deal with transductive and semi-supervised learning problems. We prove that the proposed graph Laplacian based regularization is convergent at rate root-n. The projection directions of the regularized SIR are optimized by using a conjugate gradient method on the Grassmann manifold. Experimental results support our theory.

2 0.11450823 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou

Abstract: We study the behavior of the popular Laplacian Regularization method for SemiSupervised Learning at the regime of a fixed number of labeled points but a large number of unlabeled points. We show that in Rd , d 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the “smoothness” assumptions associated with this alternate method. 1 Introduction and Setup In this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods based on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We consider the limit when the number of labeled points is fixed and the number of unlabeled points goes to infinity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled data is virtually infinite. We can also think of this limit as “perfect” SSL, having full knowledge of the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative about the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be “smooth” in some sense relative to p(x). Studying the infinite-unlabeled-data limit, where p(x) is fully known, allows us to formulate and understand the underlying smoothness assumptions of a particular SSL method, and judge whether it is well-posed and sensible. Understanding the infinite-unlabeled-data limit is also a necessary first step to studying the convergence of the finite-labeled-data estimator. We consider the following setup: Let p(x) be an unknown smooth density on a compact domain Ω ⊂ Rd with a smooth boundary. Let y : Ω → Y be the unknown function we wish to estimate. In case of regression Y = R whereas in binary classification Y = {−1, 1}. The standard (transductive) semisupervised learning problem is formulated as follows: Given l labeled points, (x1 , y1 ), . . . , (xl , yl ), with yi = y(xi ), and u unlabeled points xl+1 , . . . , xl+u , with all points xi sampled i.i.d. from p(x), the goal is to construct an estimate of y(xl+i ) for any unlabeled point xl+i , utilizing both the labeled and the unlabeled points. We denote the total number of points by n = l + u. We are interested in the regime where l is fixed and u → ∞. 1 2 SSL with Graph Laplacian Regularization We first consider the following graph-based approach formulated by Zhu et. al. [15]: y (x) = arg min In (y) ˆ subject to y(xi ) = yi , i = 1, . . . , l y where 1 n2 In (y) = Wi,j (y(xi ) − y(xj ))2 (1) (2) i,j is a Laplacian regularization term enforcing “smoothness” with respect to the n×n similarity matrix W . This formulation has several natural interpretations in terms of, e.g. random walks and electrical circuits [15]. These interpretations, however, refer to a fixed graph, over a finite set of points with given similarities. In contrast, our focus here is on the more typical scenario where the points xi ∈ Rd are a random sample from a density p(x), and W is constructed based on this sample. We would like to understand the behavior of the method in terms of the density p(x), particularly in the limit where the number of unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density p(x) is the method (1) sensible? The answer, of course, depends on how the matrix W is constructed. We consider the common situation where the similarities are obtained by applying some decay filter to the distances: xi −xj σ Wi,j = G (3) where G : R+ → R+ is some function with an adequately fast decay. Popular choices are the 2 Gaussian filter G(z) = e−z /2 or the ǫ-neighborhood graph obtained by the step filter G(z) = 1z<1 . For simplicity, we focus here on the formulation (1) where the solution is required to satisfy the constraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced with a softer loss-based data term, which is balanced against the smoothness term In (y), e.g. [14, 6]. Our analysis and conclusions apply to such variants as well. Limit of the Laplacian Regularization Term As the number of unlabeled examples grows the regularization term (2) converges to its expectation, where the summation is replaced by integration w.r.t. the density p(x): lim In (y) = I (σ) (y) = n→∞ G Ω Ω x−x′ σ (y(x) − y(x′ ))2 p(x)p(x′ )dxdx′ . (4) In the above limit, the bandwidth σ is held fixed. Typically, one would also drive the bandwidth σ to zero as n → ∞. There are two reasons for this choice. First, from a practical perspective, this makes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical perspective, this leads to a clear and well defined limit of the smoothness regularization term In (y), at least when σ → 0 slowly enough1 , namely when σ = ω( d log n/n). If σ → 0 as n → ∞, and as long as nσ d / log n → ∞, then after appropriate normalization, the regularizer converges to a density weighted gradient penalty term [7, 8]: d lim d+2 In (y) n→∞ Cσ (σ) d (y) d+2 I σ→0 Cσ = lim ∇y(x) 2 p(x)2 dx = J(y) = (5) Ω where C = Rd z 2 G( z )dz, and assuming 0 < C < ∞ (which is the case for both the Gaussian and the step filters). This energy functional J(f ) therefore encodes the notion of “smoothness” with respect to p(x) that is the basis of the SSL formulation (1) with the graph constructions specified by (3). To understand the behavior and appropriateness of (1) we must understand this functional and the associated limit problem: y (x) = arg min J(y) ˆ subject to y(xi ) = yi , i = 1, . . . , l (6) y p When σ = o( d 1/n) then all non-diagonal weights Wi,j vanish (points no longer have any “close by” p neighbors). We are not aware of an analysis covering the regime where σ decays roughly as d 1/n, but would be surprised if a qualitatively different meaningful limit is reached. 1 2 3 Graph Laplacian Regularization in R1 We begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x ∈ R. We first consider the situation where the support of p(x) is a continuous interval Ω = [a, b] ⊂ R (a and/or b may be infinite). Without loss of generality, we assume the labeled data is sorted in increasing order a x1 < x2 < · · · < xl b. Applying the theory of variational calculus, the solution y (x) ˆ satisfies inside each interval (xi , xi+1 ) the Euler-Lagrange equation d dy p2 (x) = 0. dx dx Performing two integrations and enforcing the constraints at the labeled points yields y(x) = yi + x 1/p2 (t)dt xi (yi+1 xi+1 1/p2 (t)dt xi − yi ) for xi x xi+1 (7) with y(x) = x1 for a x x1 and y(x) = xl for xl x b. If the support of p(x) is a union of disjoint intervals, the above analysis and the form of the solution applies in each interval separately. The solution (7) seems reasonable and desirable from the point of view of the “smoothness” assumptions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas across low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore, the regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared semi-norm, giving us additional insight into this choice of regularizer: b 1 Theorem 1. Let p(x) be a smooth density on Ω = [a, b] ⊂ R such that Ap = 4 a 1/p2 (t)dt < ∞. 2 Then, J(f ) can be written as a squared semi-norm J(f ) = f Kp induced by the kernel x′ ′ Kp (x, x ) = Ap − 1 2 x with a null-space of all constant functions. That is, f the RKHS induced by Kp . 1 p2 (t) dt Kp . (8) is the norm of the projection of f onto If p(x) is supported on several disjoint intervals, Ω = ∪i [ai , bi ], then J(f ) can be written as a squared semi-norm induced by the kernel 1 bi dt 4 ai p2 (t) ′ Kp (x, x ) = − 1 2 x′ dt x p2 (t) if x, x′ ∈ [ai , bi ] (9) if x ∈ [ai , bi ], x′ ∈ [aj , bj ], i = j 0 with a null-space spanned by indicator functions 1[ai ,bi ] (x) on the connected components of Ω. Proof. For any f (x) = i αi Kp (x, xi ) in the RKHS induced by Kp : df dx J(f ) = 2 p2 (x)dx = αi αj Jij (10) i,j where Jij = d d Kp (x, xi ) Kp (x, xj )p2 (x)dx dx dx When xi and xj are in different connected components of Ω, the gradients of Kp (·, xi ) and Kp (·, xj ) are never non-zero together and Jij = 0 = Kp (xi , xj ). When they are in the same connected component [a, b], and assuming w.l.o.g. a xi xj b: Jij = = xi 1 4 1 4 a b a 1 dt + p2 (t) 1 1 dt − p2 (t) 2 xj xi xj xi −1 dt + p2 (t) xj 1 dt p2 (t) 1 dt = Kp (xi , xj ). p2 (t) Substituting Jij = Kp (xi , xj ) into (10) yields J(f ) = 3 b αi αj Kp (xi , xj ) = f (11) Kp . Combining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of any variant where the hard constraints are replaced by a data term) is of the form: l y(x) = αj Kp (x, xj ) + βi 1[ai ,bi ] (x), j=1 i where i ranges over the connected components [ai , bi ] of Ω, and we have: l J(y) = αi αj Kp (xi , xj ). (12) i,j=1 Viewing the regularizer as y 2 p suggests understanding (6), and so also its empirical approximaK tion (1), by interpreting Kp (x, x′ ) as a density-based “similarity measure” between x and x′ . This similarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a function of the distance. When the density is non-uniform, two points are relatively similar only if they are connected by a region in which 1/p2 (x) is low, i.e. the density is high, but are much less “similar”, i.e. related to each other, when connected by a low-density region. Furthermore, there is no dependence between points in disjoint components separated by zero density regions. 4 Graph Laplacian Regularization in Higher Dimensions The analysis of the previous section seems promising, at it shows that in one dimension, the SSL method (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is not the case anymore. In the following theorem we show that the infimum of the limit problem (6) is zero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation of the labeled points. Theorem 2. Let p(x) be a smooth density over Rd , d 2, bounded from above by some constant pmax , and let (x1 , y1 ), . . . , (xl , yl ) be any (non-repeating) set of labeled examples. There exist continuous functions yǫ (x), for any ǫ > 0, all satisfying the constraints yǫ (xj ) = yj , j = 1, . . . , l, such ǫ→0 ǫ→0 that J(yǫ ) −→ 0 but yǫ (x) −→ 0 for all x = xj , j = 1, . . . , l. Proof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the proof to more labeled points is straightforward. Furthermore, without loss of generality, we assume the first labeled point is at x0 = 0 with y(x0 ) = 0 and the second labeled point is at x1 with x1 = 1 and y(x1 ) = 1. In addition, we assume that the ball B1 (0) of radius one centered around the origin is contained in Ω = {x ∈ Rd | p(x) > 0}. We first consider the case d > 2. Here, for any ǫ > 0, consider the function x ǫ yǫ (x) = min ,1 which indeed satisfies the two constraints yǫ (xi ) = yi , i = 0, 1. Then, J(yǫ ) = Bǫ (0) p2 (x) dx ǫ2 pmax ǫ2 dx = p2 Vd ǫd−2 max (13) Bǫ (0) where Vd is the volume of a unit ball in Rd . Hence, the sequence of functions yǫ (x) satisfy the constraints, but for d > 2, inf ǫ J(yǫ ) = 0. For d = 2, a more extreme example is necessary: consider the functions 2 x yǫ (x) = log +ǫ ǫ log 1+ǫ ǫ for x 1 and yǫ (x) = 1 for x > 1. These functions satisfy the two constraints yǫ (xi ) = yi , i = 0, 1 and: J(yǫ ) = 4 h “ ”i 1+ǫ 2 log ǫ 4πp2 max h “ ”i 1+ǫ 2 log ǫ x B1 (0) log ( x 1+ǫ ǫ 2 2 +ǫ)2 p2 (x)dx 4p2 h “ max ”i2 1+ǫ log ǫ 4πp2 max ǫ→0 = −→ 0. log 1+ǫ ǫ 4 1 0 r2 (r 2 +ǫ)2 2πrdr The implication of Theorem 2 is that regardless of the values at the labeled points, as u → ∞, the solution of (1) is not well posed. Asymptotically, the solution has the form of an almost everywhere constant function, with highly localized spikes near the labeled points, and so no learning is performed. In particular, an interpretation in terms of a density-based kernel Kp , as in the onedimensional case, is not possible. Our analysis also carries over to a formulation where a loss-based data term replaces the hard label constraints, as in l 1 y = arg min ˆ (y(xj ) − yj )2 + γIn (y) y(x) l j=1 In the limit of infinite unlabeled data, functions of the form yǫ (x) above have a zero data penalty term (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence, it is possible to drive the entire objective functional (the data term plus the regularization term) to zero with functions that do not generalize at all to unlabeled points. 4.1 Numerical Example We illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density p(x) in R2 , which is a mixture of two unit variance spherical Gaussians, one per class, centered at the origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of the two components (four total). We then construct a similarity matrix using a Gaussian filter with σ = 0.4. Figure 1 depicts the predictor y (x) obtained from (1). In fact, two different predictors are shown, ˆ obtained by different numerical methods for solving (1). Both methods are based on the observation that the solution y (x) of (1) satisfies: ˆ n y (xi ) = ˆ n Wij y (xj ) / ˆ j=1 Wij on all unlabeled points i = l + 1, . . . , l + u. (14) j=1 Combined with the constraints of (1), we obtain a system of linear equations that can be solved by Gaussian elimination (here invoked through MATLAB’s backslash operator). This is the method used in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for y (xi ), ˆ which can be solved via the power method, or label propagation [2, 6]: start with zero labels on the unlabeled points and iterate (14), while keeping the known labels on x1 , . . . , xl . This is the method used in the bottom panels of Figure 1. As predicted, y (x) is almost constant for almost all unlabeled points. Although all values are very ˆ close to zero, thresholding at the “right” threshold does actually produce sensible results in terms of the true -1/+1 labels. However, beyond being inappropriate for regression, a very flat predictor is still problematic even from a classification perspective. First, it is not possible to obtain a meaningful confidence measure for particular labels. Second, especially if the size of each class is not known apriori, setting the threshold between the positive and negative classes is problematic. In our example, setting the threshold to zero yields a generalization error of 45%. The differences between the two numerical methods for solving (1) also point out to another problem with the ill-posedness of the limit problem: the solution is numerically very un-stable. A more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choosing a “wrong” bandwidth σ, is given in Figure 2. We again simulated data from a mixture of two Gaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and an increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classification error of the resulting predictor y (x). We plot the classification error both when a threshold ˆ of zero is used (i.e. the class is determined by sign(ˆ(x))) and with the ideal threshold minimizing y the test error. For each unlabeled sample size, we choose the bandwidth σ yielding the best test performance (this is a “cheating” approach which provides a lower bound on the error of the best method for selecting the bandwidth). As the number of unlabeled examples increases the squared error approaches 1, indicating a flat predictor. Using a threshold of zero leads to an increase in the classification error, possibly due to numerical instability. Interestingly, although the predictors become very flat, the classification error using the ideal threshold actually improves slightly. Note that 5 DIRECT INVERSION SQUARED ERROR SIGN ERROR: 45% OPTIMAL BANDWIDTH 1 0.9 1 5 0 4 2 0.85 y(x) > 0 y(x) < 0 6 0.95 10 0 0 −1 10 0 200 400 600 800 0−1 ERROR (THRESHOLD=0) 0.32 −5 10 0 5 −10 0 −10 −5 −5 0 5 10 10 1 0 0 200 400 600 800 OPTIMAL BANDWIDTH 0.5 0 0 200 400 600 800 0−1 ERROR (IDEAL THRESHOLD) 0.19 5 200 400 600 800 OPTIMAL BANDWIDTH 1 0.28 SIGN ERR: 17.1 0.3 0.26 POWER METHOD 0 1.5 8 0 0.18 −1 10 6 0.17 4 −5 10 0 5 −10 0 −5 −10 −5 0 5 10 Figure 1: Left plots: Minimizer of Eq. (1). Right plots: the resulting classification according to sign(y). The four labeled points are shown by green squares. Top: minimization via Gaussian elimination (MATLAB backslash). Bottom: minimization via label propagation with 1000 iterations - the solution has not yet converged, despite small residuals of the order of 2 · 10−4 . 0.16 0 200 400 600 800 2 0 200 400 600 800 Figure 2: Squared error (top), classification error with a threshold of zero (center) and minimal classification error using ideal threhold (bottom), of the minimizer of (1) as a function of number of unlabeled points. For each error measure and sample size, the bandwidth minimizing the test error was used, and is plotted. ideal classification performance is achieved with a significantly larger bandwidth than the bandwidth minimizing the squared loss, i.e. when the predictor is even flatter. 4.2 Probabilistic Interpretation, Exit and Hitting Times As mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in terms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with transition matrix M = D−1 W where D is a diagonal matrix with Dii = j Wij . Then, for the binary classification case with yi = ±1 we have [15]: y (xi ) = 2 Pr x(t) hits a point labeled +1 before hitting a point labeled -1 x(0) = xi − 1 ˆ We present an interpretation of our analysis in terms of the limiting properties of this random walk. Consider, for simplicity, the case where the two classes are separated by a low density region. Then, the random walk has two intrinsic quantities of interest. The first is the mean exit time from one cluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the number of unlabeled points increases and σ → 0, the random walk converges to a diffusion process [12]. While the mean exit time then converges to a finite value corresponding to its diffusion analogue, the hitting time to a labeled point increases to infinity (as these become absorbing boundaries of measure zero). With more and more unlabeled data the random walk will fully mix, forgetting where it started, before it hits any label. Thus, the probability of hitting +1 before −1 will become uniform across the entire graph, independent of the starting location xi , yielding a flat predictor. 5 Keeping σ Finite At this point, a reader may ask whether the problems found in higher dimensions are due to taking the limit σ → 0. One possible objection is that there is an intrinsic characteristic scale for the data σ0 where (with high probability) all points at a distance xi − xj < σ0 have the same label. If this is the case, then it may not necessarily make sense to take values of σ < σ0 in constructing W . However, keeping σ finite while taking the number of unlabeled points to infinity does not resolve the problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To see this, consider a function y(x) which is zero everywhere except at the labeled points, where y(xj ) = yj . With a finite number of labeled points of measure zero, I (σ) (y) = 0 in any dimension 6 50 points 500 points 3500 points 1 1 0.5 0.5 0.5 0 0 0 −0.5 y 1 −0.5 −0.5 −1 −2 0 2 4 6 −1 −2 0 2 4 6 −1 −2 0 2 4 6 x Figure 3: Minimizer of (1) for a 1-d problem with a fixed σ = 0.4, two labeled points and an increasing number of unlabeled points. and for any fixed σ > 0. While this limiting function is discontinuous, it is also possible to construct ǫ→0 a sequence of continuous functions yǫ that all satisfy the constraints and for which I (σ) (yǫ ) −→ 0. This behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians centered at the origin and at x = 4, with one Gaussian labeled −1 and the other +1. We used two labeled points at the centers of the Gaussians and an increasing number of randomly drawn unlabeled points. As predicted, with a fixed σ, although the solution is reasonable when the number of unlabeled points is small, it becomes flatter, with sharp spikes on the labeled points, as u → ∞. 6 Fourier-Eigenvector Based Methods Before we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian, suggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining candidate predictors y(x) non-parametrically to those with small In (y) values, here the predictors are constrained to the low-dimensional space spanned by the first few eigenvectors of the Laplacian: The similarity matrix W is computed as before, and the Graph Laplacian matrix L = D − W is considered (recall D is a diagonal matrix with Dii = j Wij ). Only predictors p j=1 aj ej y (x) = ˆ (15) spanned by the first p eigenvectors e1 , . . . , ep of L (with smallest eigenvalues) are considered. The coefficients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss: (ˆ1 , . . . , ap ) = arg min a ˆ l j=1 (yj − y (xj ))2 . ˆ (16) Unlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)–(16) is well posed in the limit u → ∞. This follows directly from the convergence of the eigenvectors of the graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4]. Eigenvector based methods were shown empirically to provide competitive generalization performance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the approach by arguing that ‘the eigenfunctions of the Laplace-Beltrami operator provide a natural basis for functions on the manifold and the desired classification function can be expressed in such a basis’. In our view, the success of the method is actually not due to data lying on a low-dimensional manifold, but rather due to the low density separation assumption, which states that different class labels form high-density clusters separated by low density regions. Indeed, under this assumption and with sufficient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami operator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11], providing a basis for a labeling that is constant within clusters but variable across clusters. In other settings, such as data uniformly distributed on a manifold but without any significant cluster structure, the success of eigenvector based methods critically depends on how well can the unknown classification function be approximated by a truncated expansion with relatively few eigenvectors. We illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform density in the box [0, 1] × [0, 0.8] × [0, 0.6], where the box lengths are different to prevent eigenvalue multiplicity. Consider learning three different functions, y1 (x) = 1x1 >0.5 , y2 (x) = 1x1 >x2 /0.8 and y3 (x) = 1x2 /0.8>x3 /0.6 . Even though all three functions are relatively simple, all having a linear separating boundary between the classes on the manifold, as shown in the experiment described in Figure 4, the Eigenvector based method (15)–(16) gives markedly different generalization performances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5 as suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test set (i.e. a “cheating” choice representing an upper bound on the generalization error of this method). 7 Prediction Error (%) p = #labeled points/5 40 optimal p 20 labeled points 40 Approx. Error 50 20 20 0 20 20 40 60 # labeled points 0 10 20 40 60 # labeled points 0 0 5 10 15 # eigenvectors 0 0 5 10 15 # eigenvectors Figure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)–(16) for the three different functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign agreements with the true labels. Rightmost panel: best fit when many (all 3000) points are used, representing the best we can hope for with a few leading eigenvectors. The reason for this behavior is that y2 (x) and even more so y3 (x) cannot be as easily approximated by the very few leading eigenfunctions—even though they seem “simple” and “smooth”, they are significantly more complicated than y1 (x) in terms of measure of simplicity implied by the Eigenvector Method. Since the density is uniform, the graph Laplacian converges to the standard Laplacian and its eigenfunctions have the form ψi,j,k (x) = cos(iπx1 ) cos(jπx2 /0.8) cos(kπx3 /0.6), making it hard to represent simple decision boundaries which are not axis-aligned. 7 Discussion Our results show that a popular SSL method, the Laplacian Regularization method (1), is not wellbehaved in the limit of infinite unlabeled data, despite its empirical success in various SSL tasks. The empirical success might be due to two reasons. First, it is possible that with a large enough number of labeled points relative to the number of unlabeled points, the method is well behaved. This regime, where the number of both labeled and unlabeled points grow while l/u is fixed, has recently been analyzed by Wasserman and Lafferty [9]. However, we do not find this regime particularly satisfying as we would expect that having more unlabeled data available should improve performance, rather than require more labeled points or make the problem ill-posed. It also places the user in a delicate situation of choosing the “just right” number of unlabeled points without any theoretical guidance. Second, in our experiments we noticed that although the predictor y (x) becomes extremely flat, in ˆ binary tasks, it is still typically possible to find a threshold leading to a good classification performance. We do not know of any theoretical explanation for such behavior, nor how to characterize it. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical foundation of the Laplacian Regularization method. On a very practical level, such a theoretical understanding might allow us to correct the method so as to avoid the numerical instability associated with flat predictors, and perhaps also make it appropriate for regression. The reason that the Laplacian regularizer (1) is ill-posed in the limit is that the first order gradient is not a sufficient penalty in high dimensions. This fact is well known in spline theory, where the Sobolev Embedding Theorem [1] indicates one must control at least d+1 derivatives in Rd . In the 2 context of Laplacian regularization, this can be done using the iterated Laplacian: replacing the d+1 graph Laplacian matrix L = D − W , where D is the diagonal degree matrix, with L 2 (matrix to d+1 the 2 power). In the infinite unlabeled data limit, this corresponds to regularizing all order- d+1 2 (mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional ambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimensionality, which poses a practical problem of estimating this usually unknown dimensionality. We are not aware of much practical work using the iterated Laplacian, nor a good understanding of its appropriateness for SSL. A different approach leading to a well-posed solution is to include also an ambient regularization term [5]. However, the properties of the solution and in particular its relation to various assumptions about the “smoothness” of y(x) relative to p(x) remain unclear. Acknowledgments The authors would like to thank the anonymous referees for valuable suggestions. The research of BN was supported by the Israel Science Foundation (grant 432/06). 8 References [1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975. [2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks, ICML, 2007. [3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classification, NIPS, vol. 15, 2003. [4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007. [5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006. [6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised Learning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006. [7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004. [8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006. [9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008. [10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol. 36(2), 2008. [11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001. [12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, NIPS, vol. 18, 2006. [13] B. Sch¨ lkopf, A. Smola, Learning with Kernels, MIT Press, 2002. o [14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch¨ lkopf, Learning with local and global consistency, o NIPS, vol. 16, 2004. [15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian fields and harmonic functions, ICML, 2003. 9

3 0.11054564 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos

Abstract: This paper considers a sensitivity analysis in Hidden Markov Models with continuous state and observation spaces. We propose an Infinitesimal Perturbation Analysis (IPA) on the filtering distribution with respect to some parameters of the model. We describe a methodology for using any algorithm that estimates the filtering density, such as Sequential Monte Carlo methods, to design an algorithm that estimates its gradient. The resulting IPA estimator is proven to be asymptotically unbiased, consistent and has computational complexity linear in the number of particles. We consider an application of this analysis to the problem of identifying unknown parameters of the model given a sequence of observations. We derive an IPA estimator for the gradient of the log-likelihood, which may be used in a gradient method for the purpose of likelihood maximization. We illustrate the method with several numerical experiments.

4 0.10066563 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

Author: Kwang I. Kim, Florian Steinke, Matthias Hein

Abstract: Semi-supervised regression based on the graph Laplacian suffers from the fact that the solution is biased towards a constant and the lack of extrapolating power. Based on these observations, we propose to use the second-order Hessian energy for semi-supervised regression which overcomes both these problems. If the data lies on or close to a low-dimensional submanifold in feature space, the Hessian energy prefers functions whose values vary linearly with respect to geodesic distance. We first derive the Hessian energy for smooth manifolds and continue to give a stable estimation procedure for the common case where only samples of the underlying manifold are given. The preference of ‘’linear” functions on manifolds renders the Hessian energy particularly suited for the task of semi-supervised dimensionality reduction, where the goal is to find a user-defined embedding function given some labeled points which varies smoothly (and ideally linearly) along the manifold. The experimental results suggest superior performance of our method compared with semi-supervised regression using Laplacian regularization or standard supervised regression techniques applied to this task. 1

5 0.080554113 182 nips-2009-Optimal Scoring for Unsupervised Learning

Author: Zhihua Zhang, Guang Dai

Abstract: We are often interested in casting classification and clustering problems as a regression framework, because it is feasible to achieve some statistical properties in this framework by imposing some penalty criteria. In this paper we illustrate optimal scoring, which was originally proposed for performing the Fisher linear discriminant analysis by regression, in the application of unsupervised learning. In particular, we devise a novel clustering algorithm that we call optimal discriminant clustering. We associate our algorithm with the existing unsupervised learning algorithms such as spectral clustering, discriminative clustering and sparse principal component analysis. Experimental results on a collection of benchmark datasets validate the effectiveness of the optimal discriminant clustering algorithm.

6 0.071992107 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds

7 0.071308643 137 nips-2009-Learning transport operators for image manifolds

8 0.071077839 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

9 0.06726785 178 nips-2009-On Stochastic and Worst-case Models for Investing

10 0.065721691 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine

11 0.06482245 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

12 0.061781351 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

13 0.061383694 154 nips-2009-Modeling the spacing effect in sequential category learning

14 0.060705908 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks

15 0.060627274 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

16 0.059483521 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

17 0.051446643 27 nips-2009-Adaptive Regularization of Weight Vectors

18 0.049673364 128 nips-2009-Learning Non-Linear Combinations of Kernels

19 0.049583867 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

20 0.049226791 177 nips-2009-On Learning Rotations


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.142), (1, 0.114), (2, 0.007), (3, 0.07), (4, 0.044), (5, 0.055), (6, 0.036), (7, 0.002), (8, -0.004), (9, 0.022), (10, 0.005), (11, 0.12), (12, -0.017), (13, -0.072), (14, -0.035), (15, -0.099), (16, -0.065), (17, -0.002), (18, -0.064), (19, 0.007), (20, -0.033), (21, -0.018), (22, 0.032), (23, 0.012), (24, -0.005), (25, 0.006), (26, 0.12), (27, 0.014), (28, -0.057), (29, 0.025), (30, 0.059), (31, -0.003), (32, 0.106), (33, -0.04), (34, -0.056), (35, -0.081), (36, -0.077), (37, 0.041), (38, 0.07), (39, 0.062), (40, 0.094), (41, 0.055), (42, 0.07), (43, 0.115), (44, 0.048), (45, -0.011), (46, -0.003), (47, 0.079), (48, -0.055), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94590849 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence

Author: Wei Bian, Dacheng Tao

Abstract: In this paper, we study the manifold regularization for the Sliced Inverse Regression (SIR). The manifold regularization improves the standard SIR in two aspects: 1) it encodes the local geometry for SIR and 2) it enables SIR to deal with transductive and semi-supervised learning problems. We prove that the proposed graph Laplacian based regularization is convergent at rate root-n. The projection directions of the regularized SIR are optimized by using a conjugate gradient method on the Grassmann manifold. Experimental results support our theory.

2 0.6874944 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

Author: Kwang I. Kim, Florian Steinke, Matthias Hein

Abstract: Semi-supervised regression based on the graph Laplacian suffers from the fact that the solution is biased towards a constant and the lack of extrapolating power. Based on these observations, we propose to use the second-order Hessian energy for semi-supervised regression which overcomes both these problems. If the data lies on or close to a low-dimensional submanifold in feature space, the Hessian energy prefers functions whose values vary linearly with respect to geodesic distance. We first derive the Hessian energy for smooth manifolds and continue to give a stable estimation procedure for the common case where only samples of the underlying manifold are given. The preference of ‘’linear” functions on manifolds renders the Hessian energy particularly suited for the task of semi-supervised dimensionality reduction, where the goal is to find a user-defined embedding function given some labeled points which varies smoothly (and ideally linearly) along the manifold. The experimental results suggest superior performance of our method compared with semi-supervised regression using Laplacian regularization or standard supervised regression techniques applied to this task. 1

3 0.6078828 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine

Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu, Zhirong Yang

Abstract: We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms. 1

4 0.59023434 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou

Abstract: We study the behavior of the popular Laplacian Regularization method for SemiSupervised Learning at the regime of a fixed number of labeled points but a large number of unlabeled points. We show that in Rd , d 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the “smoothness” assumptions associated with this alternate method. 1 Introduction and Setup In this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods based on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We consider the limit when the number of labeled points is fixed and the number of unlabeled points goes to infinity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled data is virtually infinite. We can also think of this limit as “perfect” SSL, having full knowledge of the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative about the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be “smooth” in some sense relative to p(x). Studying the infinite-unlabeled-data limit, where p(x) is fully known, allows us to formulate and understand the underlying smoothness assumptions of a particular SSL method, and judge whether it is well-posed and sensible. Understanding the infinite-unlabeled-data limit is also a necessary first step to studying the convergence of the finite-labeled-data estimator. We consider the following setup: Let p(x) be an unknown smooth density on a compact domain Ω ⊂ Rd with a smooth boundary. Let y : Ω → Y be the unknown function we wish to estimate. In case of regression Y = R whereas in binary classification Y = {−1, 1}. The standard (transductive) semisupervised learning problem is formulated as follows: Given l labeled points, (x1 , y1 ), . . . , (xl , yl ), with yi = y(xi ), and u unlabeled points xl+1 , . . . , xl+u , with all points xi sampled i.i.d. from p(x), the goal is to construct an estimate of y(xl+i ) for any unlabeled point xl+i , utilizing both the labeled and the unlabeled points. We denote the total number of points by n = l + u. We are interested in the regime where l is fixed and u → ∞. 1 2 SSL with Graph Laplacian Regularization We first consider the following graph-based approach formulated by Zhu et. al. [15]: y (x) = arg min In (y) ˆ subject to y(xi ) = yi , i = 1, . . . , l y where 1 n2 In (y) = Wi,j (y(xi ) − y(xj ))2 (1) (2) i,j is a Laplacian regularization term enforcing “smoothness” with respect to the n×n similarity matrix W . This formulation has several natural interpretations in terms of, e.g. random walks and electrical circuits [15]. These interpretations, however, refer to a fixed graph, over a finite set of points with given similarities. In contrast, our focus here is on the more typical scenario where the points xi ∈ Rd are a random sample from a density p(x), and W is constructed based on this sample. We would like to understand the behavior of the method in terms of the density p(x), particularly in the limit where the number of unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density p(x) is the method (1) sensible? The answer, of course, depends on how the matrix W is constructed. We consider the common situation where the similarities are obtained by applying some decay filter to the distances: xi −xj σ Wi,j = G (3) where G : R+ → R+ is some function with an adequately fast decay. Popular choices are the 2 Gaussian filter G(z) = e−z /2 or the ǫ-neighborhood graph obtained by the step filter G(z) = 1z<1 . For simplicity, we focus here on the formulation (1) where the solution is required to satisfy the constraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced with a softer loss-based data term, which is balanced against the smoothness term In (y), e.g. [14, 6]. Our analysis and conclusions apply to such variants as well. Limit of the Laplacian Regularization Term As the number of unlabeled examples grows the regularization term (2) converges to its expectation, where the summation is replaced by integration w.r.t. the density p(x): lim In (y) = I (σ) (y) = n→∞ G Ω Ω x−x′ σ (y(x) − y(x′ ))2 p(x)p(x′ )dxdx′ . (4) In the above limit, the bandwidth σ is held fixed. Typically, one would also drive the bandwidth σ to zero as n → ∞. There are two reasons for this choice. First, from a practical perspective, this makes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical perspective, this leads to a clear and well defined limit of the smoothness regularization term In (y), at least when σ → 0 slowly enough1 , namely when σ = ω( d log n/n). If σ → 0 as n → ∞, and as long as nσ d / log n → ∞, then after appropriate normalization, the regularizer converges to a density weighted gradient penalty term [7, 8]: d lim d+2 In (y) n→∞ Cσ (σ) d (y) d+2 I σ→0 Cσ = lim ∇y(x) 2 p(x)2 dx = J(y) = (5) Ω where C = Rd z 2 G( z )dz, and assuming 0 < C < ∞ (which is the case for both the Gaussian and the step filters). This energy functional J(f ) therefore encodes the notion of “smoothness” with respect to p(x) that is the basis of the SSL formulation (1) with the graph constructions specified by (3). To understand the behavior and appropriateness of (1) we must understand this functional and the associated limit problem: y (x) = arg min J(y) ˆ subject to y(xi ) = yi , i = 1, . . . , l (6) y p When σ = o( d 1/n) then all non-diagonal weights Wi,j vanish (points no longer have any “close by” p neighbors). We are not aware of an analysis covering the regime where σ decays roughly as d 1/n, but would be surprised if a qualitatively different meaningful limit is reached. 1 2 3 Graph Laplacian Regularization in R1 We begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x ∈ R. We first consider the situation where the support of p(x) is a continuous interval Ω = [a, b] ⊂ R (a and/or b may be infinite). Without loss of generality, we assume the labeled data is sorted in increasing order a x1 < x2 < · · · < xl b. Applying the theory of variational calculus, the solution y (x) ˆ satisfies inside each interval (xi , xi+1 ) the Euler-Lagrange equation d dy p2 (x) = 0. dx dx Performing two integrations and enforcing the constraints at the labeled points yields y(x) = yi + x 1/p2 (t)dt xi (yi+1 xi+1 1/p2 (t)dt xi − yi ) for xi x xi+1 (7) with y(x) = x1 for a x x1 and y(x) = xl for xl x b. If the support of p(x) is a union of disjoint intervals, the above analysis and the form of the solution applies in each interval separately. The solution (7) seems reasonable and desirable from the point of view of the “smoothness” assumptions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas across low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore, the regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared semi-norm, giving us additional insight into this choice of regularizer: b 1 Theorem 1. Let p(x) be a smooth density on Ω = [a, b] ⊂ R such that Ap = 4 a 1/p2 (t)dt < ∞. 2 Then, J(f ) can be written as a squared semi-norm J(f ) = f Kp induced by the kernel x′ ′ Kp (x, x ) = Ap − 1 2 x with a null-space of all constant functions. That is, f the RKHS induced by Kp . 1 p2 (t) dt Kp . (8) is the norm of the projection of f onto If p(x) is supported on several disjoint intervals, Ω = ∪i [ai , bi ], then J(f ) can be written as a squared semi-norm induced by the kernel 1 bi dt 4 ai p2 (t) ′ Kp (x, x ) = − 1 2 x′ dt x p2 (t) if x, x′ ∈ [ai , bi ] (9) if x ∈ [ai , bi ], x′ ∈ [aj , bj ], i = j 0 with a null-space spanned by indicator functions 1[ai ,bi ] (x) on the connected components of Ω. Proof. For any f (x) = i αi Kp (x, xi ) in the RKHS induced by Kp : df dx J(f ) = 2 p2 (x)dx = αi αj Jij (10) i,j where Jij = d d Kp (x, xi ) Kp (x, xj )p2 (x)dx dx dx When xi and xj are in different connected components of Ω, the gradients of Kp (·, xi ) and Kp (·, xj ) are never non-zero together and Jij = 0 = Kp (xi , xj ). When they are in the same connected component [a, b], and assuming w.l.o.g. a xi xj b: Jij = = xi 1 4 1 4 a b a 1 dt + p2 (t) 1 1 dt − p2 (t) 2 xj xi xj xi −1 dt + p2 (t) xj 1 dt p2 (t) 1 dt = Kp (xi , xj ). p2 (t) Substituting Jij = Kp (xi , xj ) into (10) yields J(f ) = 3 b αi αj Kp (xi , xj ) = f (11) Kp . Combining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of any variant where the hard constraints are replaced by a data term) is of the form: l y(x) = αj Kp (x, xj ) + βi 1[ai ,bi ] (x), j=1 i where i ranges over the connected components [ai , bi ] of Ω, and we have: l J(y) = αi αj Kp (xi , xj ). (12) i,j=1 Viewing the regularizer as y 2 p suggests understanding (6), and so also its empirical approximaK tion (1), by interpreting Kp (x, x′ ) as a density-based “similarity measure” between x and x′ . This similarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a function of the distance. When the density is non-uniform, two points are relatively similar only if they are connected by a region in which 1/p2 (x) is low, i.e. the density is high, but are much less “similar”, i.e. related to each other, when connected by a low-density region. Furthermore, there is no dependence between points in disjoint components separated by zero density regions. 4 Graph Laplacian Regularization in Higher Dimensions The analysis of the previous section seems promising, at it shows that in one dimension, the SSL method (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is not the case anymore. In the following theorem we show that the infimum of the limit problem (6) is zero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation of the labeled points. Theorem 2. Let p(x) be a smooth density over Rd , d 2, bounded from above by some constant pmax , and let (x1 , y1 ), . . . , (xl , yl ) be any (non-repeating) set of labeled examples. There exist continuous functions yǫ (x), for any ǫ > 0, all satisfying the constraints yǫ (xj ) = yj , j = 1, . . . , l, such ǫ→0 ǫ→0 that J(yǫ ) −→ 0 but yǫ (x) −→ 0 for all x = xj , j = 1, . . . , l. Proof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the proof to more labeled points is straightforward. Furthermore, without loss of generality, we assume the first labeled point is at x0 = 0 with y(x0 ) = 0 and the second labeled point is at x1 with x1 = 1 and y(x1 ) = 1. In addition, we assume that the ball B1 (0) of radius one centered around the origin is contained in Ω = {x ∈ Rd | p(x) > 0}. We first consider the case d > 2. Here, for any ǫ > 0, consider the function x ǫ yǫ (x) = min ,1 which indeed satisfies the two constraints yǫ (xi ) = yi , i = 0, 1. Then, J(yǫ ) = Bǫ (0) p2 (x) dx ǫ2 pmax ǫ2 dx = p2 Vd ǫd−2 max (13) Bǫ (0) where Vd is the volume of a unit ball in Rd . Hence, the sequence of functions yǫ (x) satisfy the constraints, but for d > 2, inf ǫ J(yǫ ) = 0. For d = 2, a more extreme example is necessary: consider the functions 2 x yǫ (x) = log +ǫ ǫ log 1+ǫ ǫ for x 1 and yǫ (x) = 1 for x > 1. These functions satisfy the two constraints yǫ (xi ) = yi , i = 0, 1 and: J(yǫ ) = 4 h “ ”i 1+ǫ 2 log ǫ 4πp2 max h “ ”i 1+ǫ 2 log ǫ x B1 (0) log ( x 1+ǫ ǫ 2 2 +ǫ)2 p2 (x)dx 4p2 h “ max ”i2 1+ǫ log ǫ 4πp2 max ǫ→0 = −→ 0. log 1+ǫ ǫ 4 1 0 r2 (r 2 +ǫ)2 2πrdr The implication of Theorem 2 is that regardless of the values at the labeled points, as u → ∞, the solution of (1) is not well posed. Asymptotically, the solution has the form of an almost everywhere constant function, with highly localized spikes near the labeled points, and so no learning is performed. In particular, an interpretation in terms of a density-based kernel Kp , as in the onedimensional case, is not possible. Our analysis also carries over to a formulation where a loss-based data term replaces the hard label constraints, as in l 1 y = arg min ˆ (y(xj ) − yj )2 + γIn (y) y(x) l j=1 In the limit of infinite unlabeled data, functions of the form yǫ (x) above have a zero data penalty term (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence, it is possible to drive the entire objective functional (the data term plus the regularization term) to zero with functions that do not generalize at all to unlabeled points. 4.1 Numerical Example We illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density p(x) in R2 , which is a mixture of two unit variance spherical Gaussians, one per class, centered at the origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of the two components (four total). We then construct a similarity matrix using a Gaussian filter with σ = 0.4. Figure 1 depicts the predictor y (x) obtained from (1). In fact, two different predictors are shown, ˆ obtained by different numerical methods for solving (1). Both methods are based on the observation that the solution y (x) of (1) satisfies: ˆ n y (xi ) = ˆ n Wij y (xj ) / ˆ j=1 Wij on all unlabeled points i = l + 1, . . . , l + u. (14) j=1 Combined with the constraints of (1), we obtain a system of linear equations that can be solved by Gaussian elimination (here invoked through MATLAB’s backslash operator). This is the method used in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for y (xi ), ˆ which can be solved via the power method, or label propagation [2, 6]: start with zero labels on the unlabeled points and iterate (14), while keeping the known labels on x1 , . . . , xl . This is the method used in the bottom panels of Figure 1. As predicted, y (x) is almost constant for almost all unlabeled points. Although all values are very ˆ close to zero, thresholding at the “right” threshold does actually produce sensible results in terms of the true -1/+1 labels. However, beyond being inappropriate for regression, a very flat predictor is still problematic even from a classification perspective. First, it is not possible to obtain a meaningful confidence measure for particular labels. Second, especially if the size of each class is not known apriori, setting the threshold between the positive and negative classes is problematic. In our example, setting the threshold to zero yields a generalization error of 45%. The differences between the two numerical methods for solving (1) also point out to another problem with the ill-posedness of the limit problem: the solution is numerically very un-stable. A more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choosing a “wrong” bandwidth σ, is given in Figure 2. We again simulated data from a mixture of two Gaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and an increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classification error of the resulting predictor y (x). We plot the classification error both when a threshold ˆ of zero is used (i.e. the class is determined by sign(ˆ(x))) and with the ideal threshold minimizing y the test error. For each unlabeled sample size, we choose the bandwidth σ yielding the best test performance (this is a “cheating” approach which provides a lower bound on the error of the best method for selecting the bandwidth). As the number of unlabeled examples increases the squared error approaches 1, indicating a flat predictor. Using a threshold of zero leads to an increase in the classification error, possibly due to numerical instability. Interestingly, although the predictors become very flat, the classification error using the ideal threshold actually improves slightly. Note that 5 DIRECT INVERSION SQUARED ERROR SIGN ERROR: 45% OPTIMAL BANDWIDTH 1 0.9 1 5 0 4 2 0.85 y(x) > 0 y(x) < 0 6 0.95 10 0 0 −1 10 0 200 400 600 800 0−1 ERROR (THRESHOLD=0) 0.32 −5 10 0 5 −10 0 −10 −5 −5 0 5 10 10 1 0 0 200 400 600 800 OPTIMAL BANDWIDTH 0.5 0 0 200 400 600 800 0−1 ERROR (IDEAL THRESHOLD) 0.19 5 200 400 600 800 OPTIMAL BANDWIDTH 1 0.28 SIGN ERR: 17.1 0.3 0.26 POWER METHOD 0 1.5 8 0 0.18 −1 10 6 0.17 4 −5 10 0 5 −10 0 −5 −10 −5 0 5 10 Figure 1: Left plots: Minimizer of Eq. (1). Right plots: the resulting classification according to sign(y). The four labeled points are shown by green squares. Top: minimization via Gaussian elimination (MATLAB backslash). Bottom: minimization via label propagation with 1000 iterations - the solution has not yet converged, despite small residuals of the order of 2 · 10−4 . 0.16 0 200 400 600 800 2 0 200 400 600 800 Figure 2: Squared error (top), classification error with a threshold of zero (center) and minimal classification error using ideal threhold (bottom), of the minimizer of (1) as a function of number of unlabeled points. For each error measure and sample size, the bandwidth minimizing the test error was used, and is plotted. ideal classification performance is achieved with a significantly larger bandwidth than the bandwidth minimizing the squared loss, i.e. when the predictor is even flatter. 4.2 Probabilistic Interpretation, Exit and Hitting Times As mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in terms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with transition matrix M = D−1 W where D is a diagonal matrix with Dii = j Wij . Then, for the binary classification case with yi = ±1 we have [15]: y (xi ) = 2 Pr x(t) hits a point labeled +1 before hitting a point labeled -1 x(0) = xi − 1 ˆ We present an interpretation of our analysis in terms of the limiting properties of this random walk. Consider, for simplicity, the case where the two classes are separated by a low density region. Then, the random walk has two intrinsic quantities of interest. The first is the mean exit time from one cluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the number of unlabeled points increases and σ → 0, the random walk converges to a diffusion process [12]. While the mean exit time then converges to a finite value corresponding to its diffusion analogue, the hitting time to a labeled point increases to infinity (as these become absorbing boundaries of measure zero). With more and more unlabeled data the random walk will fully mix, forgetting where it started, before it hits any label. Thus, the probability of hitting +1 before −1 will become uniform across the entire graph, independent of the starting location xi , yielding a flat predictor. 5 Keeping σ Finite At this point, a reader may ask whether the problems found in higher dimensions are due to taking the limit σ → 0. One possible objection is that there is an intrinsic characteristic scale for the data σ0 where (with high probability) all points at a distance xi − xj < σ0 have the same label. If this is the case, then it may not necessarily make sense to take values of σ < σ0 in constructing W . However, keeping σ finite while taking the number of unlabeled points to infinity does not resolve the problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To see this, consider a function y(x) which is zero everywhere except at the labeled points, where y(xj ) = yj . With a finite number of labeled points of measure zero, I (σ) (y) = 0 in any dimension 6 50 points 500 points 3500 points 1 1 0.5 0.5 0.5 0 0 0 −0.5 y 1 −0.5 −0.5 −1 −2 0 2 4 6 −1 −2 0 2 4 6 −1 −2 0 2 4 6 x Figure 3: Minimizer of (1) for a 1-d problem with a fixed σ = 0.4, two labeled points and an increasing number of unlabeled points. and for any fixed σ > 0. While this limiting function is discontinuous, it is also possible to construct ǫ→0 a sequence of continuous functions yǫ that all satisfy the constraints and for which I (σ) (yǫ ) −→ 0. This behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians centered at the origin and at x = 4, with one Gaussian labeled −1 and the other +1. We used two labeled points at the centers of the Gaussians and an increasing number of randomly drawn unlabeled points. As predicted, with a fixed σ, although the solution is reasonable when the number of unlabeled points is small, it becomes flatter, with sharp spikes on the labeled points, as u → ∞. 6 Fourier-Eigenvector Based Methods Before we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian, suggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining candidate predictors y(x) non-parametrically to those with small In (y) values, here the predictors are constrained to the low-dimensional space spanned by the first few eigenvectors of the Laplacian: The similarity matrix W is computed as before, and the Graph Laplacian matrix L = D − W is considered (recall D is a diagonal matrix with Dii = j Wij ). Only predictors p j=1 aj ej y (x) = ˆ (15) spanned by the first p eigenvectors e1 , . . . , ep of L (with smallest eigenvalues) are considered. The coefficients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss: (ˆ1 , . . . , ap ) = arg min a ˆ l j=1 (yj − y (xj ))2 . ˆ (16) Unlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)–(16) is well posed in the limit u → ∞. This follows directly from the convergence of the eigenvectors of the graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4]. Eigenvector based methods were shown empirically to provide competitive generalization performance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the approach by arguing that ‘the eigenfunctions of the Laplace-Beltrami operator provide a natural basis for functions on the manifold and the desired classification function can be expressed in such a basis’. In our view, the success of the method is actually not due to data lying on a low-dimensional manifold, but rather due to the low density separation assumption, which states that different class labels form high-density clusters separated by low density regions. Indeed, under this assumption and with sufficient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami operator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11], providing a basis for a labeling that is constant within clusters but variable across clusters. In other settings, such as data uniformly distributed on a manifold but without any significant cluster structure, the success of eigenvector based methods critically depends on how well can the unknown classification function be approximated by a truncated expansion with relatively few eigenvectors. We illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform density in the box [0, 1] × [0, 0.8] × [0, 0.6], where the box lengths are different to prevent eigenvalue multiplicity. Consider learning three different functions, y1 (x) = 1x1 >0.5 , y2 (x) = 1x1 >x2 /0.8 and y3 (x) = 1x2 /0.8>x3 /0.6 . Even though all three functions are relatively simple, all having a linear separating boundary between the classes on the manifold, as shown in the experiment described in Figure 4, the Eigenvector based method (15)–(16) gives markedly different generalization performances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5 as suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test set (i.e. a “cheating” choice representing an upper bound on the generalization error of this method). 7 Prediction Error (%) p = #labeled points/5 40 optimal p 20 labeled points 40 Approx. Error 50 20 20 0 20 20 40 60 # labeled points 0 10 20 40 60 # labeled points 0 0 5 10 15 # eigenvectors 0 0 5 10 15 # eigenvectors Figure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)–(16) for the three different functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign agreements with the true labels. Rightmost panel: best fit when many (all 3000) points are used, representing the best we can hope for with a few leading eigenvectors. The reason for this behavior is that y2 (x) and even more so y3 (x) cannot be as easily approximated by the very few leading eigenfunctions—even though they seem “simple” and “smooth”, they are significantly more complicated than y1 (x) in terms of measure of simplicity implied by the Eigenvector Method. Since the density is uniform, the graph Laplacian converges to the standard Laplacian and its eigenfunctions have the form ψi,j,k (x) = cos(iπx1 ) cos(jπx2 /0.8) cos(kπx3 /0.6), making it hard to represent simple decision boundaries which are not axis-aligned. 7 Discussion Our results show that a popular SSL method, the Laplacian Regularization method (1), is not wellbehaved in the limit of infinite unlabeled data, despite its empirical success in various SSL tasks. The empirical success might be due to two reasons. First, it is possible that with a large enough number of labeled points relative to the number of unlabeled points, the method is well behaved. This regime, where the number of both labeled and unlabeled points grow while l/u is fixed, has recently been analyzed by Wasserman and Lafferty [9]. However, we do not find this regime particularly satisfying as we would expect that having more unlabeled data available should improve performance, rather than require more labeled points or make the problem ill-posed. It also places the user in a delicate situation of choosing the “just right” number of unlabeled points without any theoretical guidance. Second, in our experiments we noticed that although the predictor y (x) becomes extremely flat, in ˆ binary tasks, it is still typically possible to find a threshold leading to a good classification performance. We do not know of any theoretical explanation for such behavior, nor how to characterize it. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical foundation of the Laplacian Regularization method. On a very practical level, such a theoretical understanding might allow us to correct the method so as to avoid the numerical instability associated with flat predictors, and perhaps also make it appropriate for regression. The reason that the Laplacian regularizer (1) is ill-posed in the limit is that the first order gradient is not a sufficient penalty in high dimensions. This fact is well known in spline theory, where the Sobolev Embedding Theorem [1] indicates one must control at least d+1 derivatives in Rd . In the 2 context of Laplacian regularization, this can be done using the iterated Laplacian: replacing the d+1 graph Laplacian matrix L = D − W , where D is the diagonal degree matrix, with L 2 (matrix to d+1 the 2 power). In the infinite unlabeled data limit, this corresponds to regularizing all order- d+1 2 (mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional ambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimensionality, which poses a practical problem of estimating this usually unknown dimensionality. We are not aware of much practical work using the iterated Laplacian, nor a good understanding of its appropriateness for SSL. A different approach leading to a well-posed solution is to include also an ambient regularization term [5]. However, the properties of the solution and in particular its relation to various assumptions about the “smoothness” of y(x) relative to p(x) remain unclear. Acknowledgments The authors would like to thank the anonymous referees for valuable suggestions. The research of BN was supported by the Israel Science Foundation (grant 432/06). 8 References [1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975. [2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks, ICML, 2007. [3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classification, NIPS, vol. 15, 2003. [4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007. [5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006. [6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised Learning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006. [7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004. [8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006. [9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008. [10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol. 36(2), 2008. [11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001. [12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, NIPS, vol. 18, 2006. [13] B. Sch¨ lkopf, A. Smola, Learning with Kernels, MIT Press, 2002. o [14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch¨ lkopf, Learning with local and global consistency, o NIPS, vol. 16, 2004. [15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian fields and harmonic functions, ICML, 2003. 9

5 0.57078594 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation

Author: Yusuke Watanabe, Kenji Fukumizu

Abstract: We propose a new approach to the analysis of Loopy Belief Propagation (LBP) by establishing a formula that connects the Hessian of the Bethe free energy with the edge zeta function. The formula has a number of theoretical implications on LBP. It is applied to give a sufficient condition that the Hessian of the Bethe free energy is positive definite, which shows non-convexity for graphs with multiple cycles. The formula clarifies the relation between the local stability of a fixed point of LBP and local minima of the Bethe free energy. We also propose a new approach to the uniqueness of LBP fixed point, and show various conditions of uniqueness. 1

6 0.50852895 177 nips-2009-On Learning Rotations

7 0.49965048 215 nips-2009-Sensitivity analysis in HMMs with application to likelihood maximization

8 0.47285143 27 nips-2009-Adaptive Regularization of Weight Vectors

9 0.45285919 238 nips-2009-Submanifold density estimation

10 0.45007503 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

11 0.42854398 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

12 0.42626336 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

13 0.39307424 154 nips-2009-Modeling the spacing effect in sequential category learning

14 0.3883073 178 nips-2009-On Stochastic and Worst-case Models for Investing

15 0.38218457 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

16 0.3771849 137 nips-2009-Learning transport operators for image manifolds

17 0.37010199 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

18 0.36908919 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

19 0.36003751 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds

20 0.35755938 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.042), (25, 0.035), (35, 0.043), (36, 0.112), (39, 0.03), (58, 0.075), (61, 0.015), (71, 0.065), (80, 0.011), (81, 0.031), (86, 0.077), (96, 0.328)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.76345879 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence

Author: Wei Bian, Dacheng Tao

Abstract: In this paper, we study the manifold regularization for the Sliced Inverse Regression (SIR). The manifold regularization improves the standard SIR in two aspects: 1) it encodes the local geometry for SIR and 2) it enables SIR to deal with transductive and semi-supervised learning problems. We prove that the proposed graph Laplacian based regularization is convergent at rate root-n. The projection directions of the regularized SIR are optimized by using a conjugate gradient method on the Grassmann manifold. Experimental results support our theory.

2 0.69833612 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models

Author: Baback Moghaddam, Emtiyaz Khan, Kevin P. Murphy, Benjamin M. Marlin

Abstract: We make several contributions in accelerating approximate Bayesian structural inference for non-decomposable GGMs. Our first contribution is to show how to efficiently compute a BIC or Laplace approximation to the marginal likelihood of non-decomposable graphs using convex methods for precision matrix estimation. This optimization technique can be used as a fast scoring function inside standard Stochastic Local Search (SLS) for generating posterior samples. Our second contribution is a novel framework for efficiently generating large sets of high-quality graph topologies without performing local search. This graph proposal method, which we call “Neighborhood Fusion” (NF), samples candidate Markov blankets at each node using sparse regression techniques. Our third contribution is a hybrid method combining the complementary strengths of NF and SLS. Experimental results in structural recovery and prediction tasks demonstrate that NF and hybrid NF/SLS out-perform state-of-the-art local search methods, on both synthetic and real-world datasets, when realistic computational limits are imposed.

3 0.67721182 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies

Author: Arno Onken, Steffen Grünewälder, Klaus Obermayer

Abstract: The linear correlation coefficient is typically used to characterize and analyze dependencies of neural spike counts. Here, we show that the correlation coefficient is in general insufficient to characterize these dependencies. We construct two neuron spike count models with Poisson-like marginals and vary their dependence structure using copulas. To this end, we construct a copula that allows to keep the spike counts uncorrelated while varying their dependence strength. Moreover, we employ a network of leaky integrate-and-fire neurons to investigate whether weakly correlated spike counts with strong dependencies are likely to occur in real networks. We find that the entropy of uncorrelated but dependent spike count distributions can deviate from the corresponding distribution with independent components by more than 25 % and that weakly correlated but strongly dependent spike counts are very likely to occur in biological networks. Finally, we introduce a test for deciding whether the dependence structure of distributions with Poissonlike marginals is well characterized by the linear correlation coefficient and verify it for different copula-based models. 1

4 0.64503253 154 nips-2009-Modeling the spacing effect in sequential category learning

Author: Hongjing Lu, Matthew Weiden, Alan L. Yuille

Abstract: We develop a Bayesian sequential model for category learning. The sequential model updates two category parameters, the mean and the variance, over time. We define conjugate temporal priors to enable closed form solutions to be obtained. This model can be easily extended to supervised and unsupervised learning involving multiple categories. To model the spacing effect, we introduce a generic prior in the temporal updating stage to capture a learning preference, namely, less change for repetition and more change for variation. Finally, we show how this approach can be generalized to efficiently perform model selection to decide whether observations are from one or multiple categories.

5 0.63619542 94 nips-2009-Fast Learning from Non-i.i.d. Observations

Author: Ingo Steinwart, Andreas Christmann

Abstract: We prove an oracle inequality for generic regularized empirical risk minimization algorithms learning from α-mixing processes. To illustrate this oracle inequality, we use it to derive learning rates for some learning methods including least squares SVMs. Since the proof of the oracle inequality uses recent localization ideas developed for independent and identically distributed (i.i.d.) processes, it turns out that these learning rates are close to the optimal rates known in the i.i.d. case. 1

6 0.48024666 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions

7 0.47988194 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

8 0.47938389 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

9 0.47783992 260 nips-2009-Zero-shot Learning with Semantic Output Codes

10 0.47766036 72 nips-2009-Distribution Matching for Transduction

11 0.47764134 129 nips-2009-Learning a Small Mixture of Trees

12 0.4728533 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models

13 0.47061062 97 nips-2009-Free energy score space

14 0.47055051 70 nips-2009-Discriminative Network Models of Schizophrenia

15 0.47045299 104 nips-2009-Group Sparse Coding

16 0.47030523 71 nips-2009-Distribution-Calibrated Hierarchical Classification

17 0.47018933 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference

18 0.46991634 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

19 0.46952167 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes

20 0.46929175 128 nips-2009-Learning Non-Linear Combinations of Kernels