nips nips2009 nips2009-182 knowledge-graph by maker-knowledge-mining

182 nips-2009-Optimal Scoring for Unsupervised Learning

Source: pdf

Author: Zhihua Zhang, Guang Dai

Abstract: We are often interested in casting classiﬁcation and clustering problems as a regression framework, because it is feasible to achieve some statistical properties in this framework by imposing some penalty criteria. In this paper we illustrate optimal scoring, which was originally proposed for performing the Fisher linear discriminant analysis by regression, in the application of unsupervised learning. In particular, we devise a novel clustering algorithm that we call optimal discriminant clustering. We associate our algorithm with the existing unsupervised learning algorithms such as spectral clustering, discriminative clustering and sparse principal component analysis. Experimental results on a collection of benchmark datasets validate the effectiveness of the optimal discriminant clustering algorithm.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In this paper we illustrate optimal scoring, which was originally proposed for performing the Fisher linear discriminant analysis by regression, in the application of unsupervised learning. [sent-2, score-0.236]

2 In particular, we devise a novel clustering algorithm that we call optimal discriminant clustering. [sent-3, score-0.407]

3 We associate our algorithm with the existing unsupervised learning algorithms such as spectral clustering, discriminative clustering and sparse principal component analysis. [sent-4, score-0.409]

4 Experimental results on a collection of benchmark datasets validate the effectiveness of the optimal discriminant clustering algorithm. [sent-5, score-0.399]

5 1 Introduction The Fisher linear discriminant analysis (LDA) is a classical method that considers dimensionality reduction and classiﬁcation jointly. [sent-6, score-0.187]

6 LDA estimates a low-dimensional discriminative space deﬁned by linear transformations through maximizing the ratio of between-class scatter to within-class scatter. [sent-7, score-0.096]

7 It is of great interest to obtain a similar relationship in multi-class problems. [sent-9, score-0.037]

8 This provides another approach to performing LDA by regression, in which penalty criteria are tractably introduced to achieve some statistical properties such as regularized LDA [5] and sparse discriminant analysis [2]. [sent-11, score-0.188]

9 It is also desirable to explore unsupervised learning problems in a regression framework. [sent-12, score-0.078]

10 [17] reformulated principal component analysis (PCA) as a regression problem and then devised a sparse PCA by imposing the lasso (the elastic net) penalty [10, 16] on the regression vector. [sent-14, score-0.153]

11 In this paper we consider unsupervised learning problems by optimal scoring, which was originally proposed to perform LDA by regression [6]. [sent-15, score-0.111]

12 In particular, we devise a novel unsupervised framework by using the optimal scoring and the ridge penalty. [sent-16, score-0.289]

13 This framework can be used for dimensionality reduction and clustering simultaneously. [sent-17, score-0.241]

14 In particular, we propose a clustering algorithm that we called optimal discriminant clustering (ODC). [sent-19, score-0.56]

15 Moreover, we establish a connection of our clustering algorithm with discriminative clustering algorithms [3, 13] and spectral clustering algorithms [7, 15]. [sent-20, score-0.731]

16 This implies that we can cast these clustering algorithms as regression-type problems. [sent-21, score-0.205]

17 In turn, this facilitates the introduction of penalty terms such as the lasso and elastic net so that we have sparse unsupervised learning algorithms. [sent-22, score-0.136]

18 Throughout this paper, Im denotes the m×m identity matrix, 1m the m×1 vector of ones, 0 the zero 1 vector or matrix with appropriate size, and Hm = Im − m 1m 1m the m×m centering matrix. [sent-23, score-0.044]

19 , am ) , diag(a) represents the m×m diagonal matrix with a1 , . [sent-27, score-0.061]

20 For an m×m matrix A = [aij ], we let A+ be the Moore-Penrose inverse of A, tr(A) be the trace of A, rk(A) be the rank of A and A F = tr(A A) be the Frobenius norm of A. [sent-31, score-0.058]

21 , xn } ∈ X ⊂ Rp , we assume that the xi are grouped into c disjoint classes and that each xi belongs to one class. [sent-36, score-0.044]

22 , n} denote the index set of the data points xi and partition V into c disjoint subsets Vj ; i. [sent-40, score-0.033]

23 , Vi ∩ Vj = ∅ for i = j and ∪c Vj = V , where the j=1 c cardinality of Vj is nj so that j=1 nj = n. [sent-42, score-0.108]

24 We also make use of a matrix representation for the problem in question. [sent-43, score-0.044]

25 , xn ] be an n×p data matrix, and E = [eij ] be an n×c indicator matrix with eij = 1 if input √ √ 1 xi is in class j and eij √ 0 otherwise. [sent-47, score-0.149]

26 [6] deﬁned a scoring matrix for the c-class classiﬁcation problem. [sent-64, score-0.18]

27 That is, it is such a c×(c−1) matrix Θ ∈ Rc×(c−1) that Θ (E E)Θ = Θ ΠΘ = Ic−1 . [sent-65, score-0.044]

28 The jth row of Θ deﬁnes a scoring or scaling for the jth class. [sent-66, score-0.184]

29 Here we reﬁne this deﬁnition as: Deﬁnition 1 Given a c-class classiﬁcation problem with the cardinality of the jth class being nj , a c×(c−1) matrix Θ is referred to as the class scoring matrix if it satisﬁes Θ ΠΘ = Ic−1 and π Θ = 0. [sent-67, score-0.302]

30 That is, θ 1 = √nn 1 , − √ 1c−1 and 1 θl = 0 ∗ 1l−1 , c j=l+1 nl c j=l nj √ , c j=l nj √ nl nj √ n n(n−n1 ) c j=l+1 nj 1c−l n 2 1 for l = 2, . [sent-73, score-0.26]

31 To address an unsupervised clustering problem with c classes, we relax the setting of Y = EΘ and give the following deﬁnition. [sent-80, score-0.255]

32 Deﬁnition 2 An n×(c−1) matrix Y is referred to as the sample scoring matrix if it satisﬁes Y Y = Ic−1 and 1n Y = 0. [sent-81, score-0.224]

33 For example, we view c−1 as the dimension of a reduced dimensional space in the dimensionality reduction problem. [sent-83, score-0.036]

34 1 Since Rπ 2 = 0, there exists a c×(c−1) orthogonal matrix ∆, the columns of which are the eigen1 vectors of R. [sent-90, score-0.064]

35 1 1 Here [∆, √n π 2 ] is the c×c matrix of the orthonormal eigenvectors of R. [sent-93, score-0.112]

36 ˆ Since for an arbitrary class scoring matrix Θ, its rank is c−1, we have Θ = ΘΥ where Υ is −1 1 some (c−1)×(c−1) orthonormal matrix. [sent-94, score-0.225]

37 Moreover, it follows from ΘΘ = Π − n 1c 1c that the between-class scatter matrix is given by ˆˆ Σb = X Hn EΘΘ E Hn X = X Hn EΘΘ E Hn X. [sent-95, score-0.08]

38 Accordingly, we can also write the generalized eigenproblem for the penalized LDA as ˆˆ X Hn EΘΘ E Hn XA = (X Hn X + σ 2 Ip )AΛ, because the total scatter matrix Σ is Σ = X Hn X. [sent-96, score-0.125]

39 Moreˆ ˆ ˆ over, Θ E Hn XA is the eigenvector matrix of Θ E Hn XW. [sent-99, score-0.091]

40 We thus establish the relationship between A in the penalized LDA and W in the penalized optimal scoring model (1). [sent-100, score-0.296]

41 3 Optimal Scoring for Unsupervised Learning In this section we extend the notion of optimal scoring to unsupervised learning problems, leading to a new framework for dimensionality reduction and clustering analysis simultaneously. [sent-101, score-0.462]

42 1 Framework In particular, we relax EΘ in (1) as a sample scoring matrix Y and deﬁne the following penalized model: 1 σ2 min f (Y, W) Y − Hn XW 2 + tr(W W) (2) F Y, W 2 2 under the constraints 1n Y = 0 and Y Y = Ic−1 . [sent-103, score-0.24]

43 ˆ ˆ ˆ ˆ Theorem 2 A minimizer of Problem (2) is Y and W = (X Hn X + σ 2 Ip )−1 X Hn Y, where Y is the n×(c−1) orthogonal matrix of the top eigenvectors of Hn X(X Hn X + σ 2 Ip )−1 X Hn . [sent-105, score-0.126]

44 Note that all the eigenvalues of Hn X(X Hn X + σ 2 Ip )−1 X Hn are between 0 and 1. [sent-107, score-0.029]

45 Especially, when σ 2 = 0, the eigenvalues are either 1 or 0. [sent-108, score-0.029]

46 In this ˆ ˆ case, if rk(Hn X) ≥ c−1, f (Y, W) achieves its minimum 0, otherwise the minimum value is c−1−rk(Hn X) . [sent-109, score-0.03]

47 2 With the estimates of Y and W, we can develop an unsupervised learning procedure. [sent-110, score-0.052]

48 It is clear that W can be treated as a non-orthogonal projection matrix and Hn XW is then the low-dimensional conﬁguration of X. [sent-111, score-0.044]

49 In this paper, however, we concentrate on the application of the framework in clustering analysis. [sent-114, score-0.205]

50 2 Optimal Discriminant Clustering Our clustering procedure is given in Algorithm 1. [sent-116, score-0.188]

51 We refer to this procedure as optimal discriminant clustering due to its relationship with LDA, which is shown by the connection between (1) and (2). [sent-117, score-0.423]

52 , xn ] (n×r) is a feature matrix corresponding to the data matrix X. [sent-121, score-0.102]

53 This implies that we can obtain Y ˜ Moreover, we can compute Z by without the explicit use of the feature matrix X. [sent-123, score-0.044]

54 We are thus able to devise this clustering algorithm by using the reproducing kernel k(·, ·) : ˜˜ ˜ ˜ X ×X → R such that K(xi , xj ) = xi xj and K = XX . [sent-125, score-0.252]

55 , zn ] = Hn XW; 4: Perform K-means on the zi ; 5: Return the partition of the zi as the partition of the xi . [sent-129, score-0.051]

56 3 Related Work We now explore the connection of the optimal discriminant clustering with the discriminative clusˆ tering algorithm [3] and spectral clustering [7]. [sent-131, score-0.693]

57 Recall that Y is the matrix of the c−1 top eigenvectors of C(C + σ 2 In )−1 . [sent-132, score-0.081]

58 Consider that if λ = 0 is an eigenvalue of C with associated eigenvector u, then λ/(λ + σ 2 ) (= 0) is an eigenvalue of C(C + σ 2 In )−1 with associated eigenvector u. [sent-133, score-0.148]

59 This implies that Y is also the matrix of the c−1 top eigenvectors of C. [sent-135, score-0.081]

60 As we know, the spectral clustering applies a rounding scheme such as K-means ˆ directly on Y. [sent-136, score-0.247]

61 We thus have a relationship between the spectral clustering and optimal discriminant clustering. [sent-137, score-0.468]

62 We study the relationship between the discriminative clustering algorithm and the spectral cluster˜ ing algorithm. [sent-138, score-0.344]

63 Let M be a linear transformation from the r-dimensional X to an s-dimensional transformed feature space F, namely ˜ F = XM, where M is an r×s matrix of rank s (s < r). [sent-139, score-0.058]

64 The corresponding scatter matrices in the F-space are thus given by M ΣM and M Σb M. [sent-140, score-0.036]

65 Again consider that if λ = 0 is an eigenvalue of C with associated eigenvector u, then λ/(λ+σ 2 ) = 0 is an eigenvalue of C(CC + σ 2 C)+ C with associated eigenvector u. [sent-150, score-0.148]

66 Types Face Gene UCI Dataset ORL Yale PIE SRBCT Iris Yeast Image segmentation Statlog landsat satellite c 40 15 68 4 4 10 7 7 p 1024 1024 1024 2308 4 8 19 36 n 400 165 6800 63 150 1484 2100 2000 (i) If σ 2 = 0, Y∗ is the solution of the following problem argmaxY∈Rn×(c−1) tr(Y CC+ Y), s. [sent-156, score-0.134]

67 Theorem 3 shows that discriminative clustering is essentially equivalent to spectral clustering. [sent-162, score-0.307]

68 This further leads us to a relationship between the discriminative clustering and optimal discriminant clustering from the relationship between the spectral clustering and optimal discriminant clustering. [sent-163, score-1.125]

69 In summary, we are able to unify the discriminative clustering as well as spectral clustering into the optimal scoring framework in (2). [sent-164, score-0.681]

70 4 Experimental Study To evaluate the performance of our optimal discriminant clustering (ODC) algorithm, we conducted experimental comparisons with other related clustering algorithms on several real-world datasets. [sent-165, score-0.577]

71 Further details of these datasets are summarized in Table 1. [sent-167, score-0.027]

72 In the experiments, we compared our ODC with four different clustering algorithms, i. [sent-172, score-0.188]

73 It is worth noting that two discriminative clustering algorithms: DisCluster [3] and DisKmeans [13], are very closely related to our ODC, because they are derived from the discriminant analysis criteria in essence (also see the analysis in Section 3. [sent-175, score-0.399]

74 Similarly, the parameters in other clustering algorithms compared here are also searched in a wide range. [sent-188, score-0.205]

75 For simplicity, we just reported the best results of clustering algorithms with respect to different parameters on each dataset. [sent-189, score-0.205]

76 According to the NMI values in Table 2, our ODC outperforms other clustering algorithms on ﬁve datasets: ORL, SRBCT, iris, yeast and image segmentation. [sent-191, score-0.287]

77 According to the CE values in Table 2, it is obvious that the performance of our ODC is best in comparison with other algorithms on all the datasets, and NC and DisKmeans algorithms can achieve the almost same performance with ODC on the SRBCT and iris datasets respectively. [sent-192, score-0.133]

78 From Figures 1 and 2, we can see that similar to the conventional clustering algorithms (including the compared algorithms), the parameter σ has a signiﬁcant impact on the performance of ODC, especially when the evaluation results are measured by NMI. [sent-263, score-0.205]

79 Table 2: Clustering results: the Normalized Mutual Information (NMI) and the Clustering Error (CE) (%) of all clustering algorithms are calculated on different datasets. [sent-265, score-0.205]

80 Measure NMI CE (%) 5 Dataset ORL Yale PIE SRBCT Iris Yeast Image segmentation Statlog landsat satellite ORL Yale PIE SRBCT Iris Yeast Image segmentation Statlog landsat satellite K-means 0. [sent-266, score-0.268]

81 50 Concluding Remarks In this paper we have proposed a regression framework to deal with unsupervised dimensionality reduction and clustering simultaneously. [sent-346, score-0.319]

82 The framework is based on the optimal scoring and ridge penalty. [sent-347, score-0.202]

83 In particular, we have developed a new clustering algorithm which is called optimal discriminant clustering (ODC). [sent-348, score-0.56]

84 ODC can efﬁciently identify the optimal solution and it has an underlying relationship with the discriminative clustering and spectral clustering. [sent-349, score-0.377]

85 This framework allows us for developing a sparse unsupervised learning algorithm; that is, we alternatively consider the following optimization problem: min f (Y, W) = Y, W 1 Y − Hn XW 2 2 F + λ1 tr(W W) + λ2 W 2 1 under the constraints 1n Y = 0 and Y Y = Ic−1 . [sent-366, score-0.086]

86 Consider the Lagrange function: L(Y, W, B, b) 1 1 1 = tr(Y Y) − tr(Y XW) + tr(W (X X+σ 2 Ip )W) − tr(B(Y Y−Iq )) − tr(b Y 1n ), 2 2 2 where B is a q×q symmetric matrix of Lagrange multipliers and b is a q×1 vector of Lagrange multipliers. [sent-370, score-0.044]

87 Now we take the spectral decomposition of B as B = UB ΛB UB where UB is a q×q orthonormal matrix and ΛB is a q×q diagonal matrix. [sent-376, score-0.151]

88 This shows that the diagonal entries of ΛB and the columns of YUB are the eigenvalues and the associated eigenvectors of In − X(X X + σ 2 Ip )−1 X . [sent-378, score-0.083]

89 , γp ) (p×p) is a diagonal matrix with γ1 ≥ γ2 ≥ · · · ≥ γp ≥ 0. [sent-383, score-0.061]

90 1 √ 1n and [U, U3 ] There exists such an n×(n−p) orthogonal matrix U3 that its last column is n is an n×n orthonormal matrix. [sent-388, score-0.095]

91 That is, U3 is the eigenvector matrix of X(X X + σ 2 Ip )−1 X corresponding to the eigenvalue 0. [sent-389, score-0.118]

92 Let U1 be the n×q matrix of the ﬁrst q columns of [U, U3 ]. [sent-390, score-0.044]

93 Note that all the eigenvalues of X(X X + σ 2 Ip )−1 X are between 0 and 1. [sent-397, score-0.029]

94 Especially, when σ 2 = 0, the eigenvalues are either 1 or 0. [sent-398, score-0.029]

95 In this case, if rk(X) ≥ q, ˆ ˆ f (Y, W) achieves its minimum 0, otherwise the minimum value is q−rk(X) . [sent-399, score-0.03]

96 2 ˆ ˆ To verify that (Y, W) is a minimizer of problem (2), we consider the Hessian matrix of L with respect to (Y, W). [sent-400, score-0.069]

97 The Hessian matrix is then given by H(Y, W) = ∂2L ∂vec(Y )∂vec(Y ) ∂2L ∂vec(W )∂vec(Y ) ∂2L ∂vec(Y )∂vec(W ) ∂2L ∂vec(W )∂vec(W ) = (Iq −B)⊗In −Iq ⊗X −Iq ⊗X Iq ⊗(X X + σ 2 Ip ) . [sent-414, score-0.044]

98 Let C = [C1 , C2 ], where C1 and C2 are n×q and p×q, be an arbitrary nonzero (n+p)×q matrix ˆ such that C1 [1n , Y] = 0, which is equivalent to C1 1n = 0 and C1 U1 = 0. [sent-415, score-0.044]

99 A relationship between linear discriminant analysis and the generalized minimum squared error solution. [sent-491, score-0.203]

100 A ﬂexible and efﬁcient algorithm for regularized Fisher discriminant analysis. [sent-522, score-0.151]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hn', 0.532), ('odc', 0.52), ('nmi', 0.284), ('clustering', 0.188), ('ip', 0.183), ('discriminant', 0.151), ('iq', 0.146), ('tr', 0.144), ('scoring', 0.136), ('vec', 0.134), ('srbct', 0.126), ('xw', 0.123), ('ic', 0.118), ('ce', 0.1), ('orl', 0.083), ('discluster', 0.079), ('iris', 0.072), ('landsat', 0.069), ('statlog', 0.069), ('lda', 0.066), ('ir', 0.065), ('yeast', 0.064), ('pie', 0.063), ('diskmeans', 0.063), ('discriminative', 0.06), ('spectral', 0.059), ('nc', 0.059), ('diag', 0.059), ('ub', 0.055), ('nj', 0.054), ('yale', 0.053), ('unsupervised', 0.052), ('rk', 0.049), ('eigenvector', 0.047), ('zou', 0.047), ('yub', 0.047), ('penalized', 0.045), ('matrix', 0.044), ('eij', 0.038), ('cc', 0.038), ('xm', 0.038), ('eigenvectors', 0.037), ('relationship', 0.037), ('scatter', 0.036), ('yb', 0.036), ('satellite', 0.036), ('devise', 0.035), ('hastie', 0.034), ('optimal', 0.033), ('means', 0.032), ('orthonormal', 0.031), ('elastic', 0.031), ('segmentation', 0.029), ('eigenvalues', 0.029), ('datasets', 0.027), ('vj', 0.027), ('eigenvalue', 0.027), ('regression', 0.026), ('zhejiang', 0.025), ('argmaxy', 0.025), ('dai', 0.025), ('minimizer', 0.025), ('xa', 0.025), ('jth', 0.024), ('lagrange', 0.024), ('nl', 0.022), ('svd', 0.022), ('span', 0.021), ('rc', 0.02), ('orthogonal', 0.02), ('penalty', 0.02), ('fisher', 0.02), ('theorem', 0.019), ('im', 0.018), ('dimensionality', 0.018), ('image', 0.018), ('reduction', 0.018), ('partition', 0.018), ('imposing', 0.017), ('diagonal', 0.017), ('xx', 0.017), ('algorithms', 0.017), ('framework', 0.017), ('pca', 0.017), ('sparse', 0.017), ('net', 0.016), ('ridge', 0.016), ('principal', 0.016), ('moreover', 0.015), ('relax', 0.015), ('zhang', 0.015), ('xi', 0.015), ('classi', 0.015), ('minimum', 0.015), ('xn', 0.014), ('rank', 0.014), ('connection', 0.014), ('reproducing', 0.014), ('figures', 0.014), ('hangzhou', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 182 nips-2009-Optimal Scoring for Unsupervised Learning

Author: Zhihua Zhang, Guang Dai

2 0.12458301 135 nips-2009-Learning to Hash with Binary Reconstructive Embeddings

Author: Brian Kulis, Trevor Darrell

Abstract: Fast retrieval methods are increasingly critical for many large-scale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings. We develop a scalable coordinate-descent algorithm for our proposed hashing objective that is able to efﬁciently learn hash functions in a variety of settings. Unlike existing methods such as semantic hashing and spectral hashing, our method is easily kernelized and does not require restrictive assumptions about the underlying distribution of the data. We present results over several domains to demonstrate that our method outperforms existing state-of-the-art techniques. 1

3 0.083072215 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

Author: Novi Quadrianto, John Lim, Dale Schuurmans, Tibério S. Caetano

Abstract: We develop a convex relaxation of maximum a posteriori estimation of a mixture of regression models. Although our relaxation involves a semideﬁnite matrix variable, we reformulate the problem to eliminate the need for general semideﬁnite programming. In particular, we provide two reformulations that admit fast algorithms. The ﬁrst is a max-min spectral reformulation exploiting quasi-Newton descent. The second is a min-min reformulation consisting of fast alternating steps of closed-form updates. We evaluate the methods against Expectation-Maximization in a real problem of motion segmentation from video data. 1

4 0.081846595 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale

Author: Shobha Venkataraman, Avrim Blum, Dawn Song, Subhabrata Sen, Oliver Spatscheck

Abstract: We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different “experts” algorithms and online paging algorithms. We prove guarantees on our algorithm’s performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithm achieves high accuracy on large real-world data sets, with signiﬁcant improvements over existing approaches.

5 0.080554113 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence

Author: Wei Bian, Dacheng Tao

Abstract: In this paper, we study the manifold regularization for the Sliced Inverse Regression (SIR). The manifold regularization improves the standard SIR in two aspects: 1) it encodes the local geometry for SIR and 2) it enables SIR to deal with transductive and semi-supervised learning problems. We prove that the proposed graph Laplacian based regularization is convergent at rate root-n. The projection directions of the regularized SIR are optimized by using a conjugate gradient method on the Grassmann manifold. Experimental results support our theory.

6 0.072875023 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

7 0.071333744 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

8 0.066064306 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

9 0.064202927 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

10 0.06231926 238 nips-2009-Submanifold density estimation

11 0.057264622 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

12 0.051713817 234 nips-2009-Streaming k-means approximation

13 0.050013609 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

14 0.046276417 223 nips-2009-Sparse Metric Learning via Smooth Optimization

15 0.046014436 166 nips-2009-Noisy Generalized Binary Search

16 0.045939766 126 nips-2009-Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

17 0.045744516 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

18 0.044535499 3 nips-2009-AUC optimization and the two-sample problem

19 0.040147819 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases

20 0.039807878 147 nips-2009-Matrix Completion from Noisy Entries

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.114), (1, 0.063), (2, -0.047), (3, 0.05), (4, -0.021), (5, -0.024), (6, 0.011), (7, -0.004), (8, 0.021), (9, 0.044), (10, 0.014), (11, 0.001), (12, 0.042), (13, -0.028), (14, -0.01), (15, -0.075), (16, 0.084), (17, -0.051), (18, -0.045), (19, 0.001), (20, 0.018), (21, 0.078), (22, -0.008), (23, -0.082), (24, 0.042), (25, -0.012), (26, -0.008), (27, 0.049), (28, 0.104), (29, 0.018), (30, 0.105), (31, -0.046), (32, 0.107), (33, 0.047), (34, 0.177), (35, 0.024), (36, 0.04), (37, 0.031), (38, 0.137), (39, -0.005), (40, 0.126), (41, -0.108), (42, -0.04), (43, 0.001), (44, 0.007), (45, -0.056), (46, -0.016), (47, 0.031), (48, 0.152), (49, 0.141)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94189566 182 nips-2009-Optimal Scoring for Unsupervised Learning

Author: Zhihua Zhang, Guang Dai

2 0.47238889 135 nips-2009-Learning to Hash with Binary Reconstructive Embeddings

Author: Brian Kulis, Trevor Darrell

3 0.46157125 234 nips-2009-Streaming k-means approximation

Author: Nir Ailon, Ragesh Jaiswal, Claire Monteleoni

Abstract: We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means (based on the recent k-means++), in which the algorithm is allowed to output more than k centers, and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (ﬁtting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method. 1

4 0.44281149 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

Author: Jing Gao, Feng Liang, Wei Fan, Yizhou Sun, Jiawei Han

Abstract: Ensemble classiﬁers such as bagging, boosting and model averaging are known to have improved accuracy and robustness over a single model. Their potential, however, is limited in applications which have no access to raw data but to the meta-level model output. In this paper, we study ensemble learning with output from multiple supervised and unsupervised models, a topic where little work has been done. Although unsupervised models, such as clustering, do not directly generate label prediction for each individual, they provide useful constraints for the joint prediction of a set of related objects. We propose to consolidate a classiﬁcation solution by maximizing the consensus among both supervised predictions and unsupervised constraints. We cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the smoothness of the prediction over the graph, as well as penalizing deviations from the initial labeling provided by supervised models. We solve this problem through iterative propagation of probability estimates among neighboring nodes. Our method can also be interpreted as conducting a constrained embedding in a transformed space, or a ranking on the graph. Experimental results on three real applications demonstrate the beneﬁts of the proposed method over existing alternatives1 . 1

5 0.41592634 81 nips-2009-Ensemble Nystrom Method

Author: Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar

Abstract: A crucial technique for scaling kernel methods to very large data sets reaching or exceeding millions of instances is based on low-rank approximation of kernel matrices. We introduce a new family of algorithms based on mixtures of Nystr¨ m o approximations, ensemble Nystr¨ m algorithms, that yield more accurate low-rank o approximations than the standard Nystr¨ m method. We give a detailed study of o variants of these algorithms based on simple averaging, an exponential weight method, or regression-based methods. We also present a theoretical analysis of these algorithms, including novel error bounds guaranteeing a better convergence rate than the standard Nystr¨ m method. Finally, we report results of extensive o experiments with several data sets containing up to 1M points demonstrating the signiﬁcant improvement over the standard Nystr¨ m approximation. o 1

6 0.41040596 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

7 0.3758468 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

8 0.37472641 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

9 0.37410945 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

10 0.36507615 46 nips-2009-Bilinear classifiers for visual recognition

11 0.36364189 51 nips-2009-Clustering sequence sets for motif discovery

12 0.33973941 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

13 0.32400841 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

14 0.31759188 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

15 0.31659737 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions

16 0.31229725 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

17 0.30819333 249 nips-2009-Tracking Dynamic Sources of Malicious Activity at Internet Scale

18 0.30734465 195 nips-2009-Probabilistic Relational PCA

19 0.29271394 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

20 0.28876811 69 nips-2009-Discrete MDL Predicts in Total Variation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(21, 0.018), (24, 0.038), (25, 0.065), (35, 0.019), (36, 0.084), (39, 0.056), (50, 0.375), (58, 0.115), (71, 0.049), (86, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.74560416 182 nips-2009-Optimal Scoring for Unsupervised Learning

Author: Zhihua Zhang, Guang Dai

2 0.64771539 228 nips-2009-Speeding up Magnetic Resonance Image Acquisition by Bayesian Multi-Slice Adaptive Compressed Sensing

Author: Matthias Seeger

Abstract: We show how to sequentially optimize magnetic resonance imaging measurement designs over stacks of neighbouring image slices, by performing convex variational inference on a large scale non-Gaussian linear dynamical system, tracking dominating directions of posterior covariance without imposing any factorization constraints. Our approach can be scaled up to high-resolution images by reductions to numerical mathematics primitives and parallelization on several levels. In a ﬁrst study, designs are found that improve signiﬁcantly on others chosen independently for each slice or drawn at random. 1

3 0.52756959 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou

Abstract: We study the behavior of the popular Laplacian Regularization method for SemiSupervised Learning at the regime of a ﬁxed number of labeled points but a large number of unlabeled points. We show that in Rd , d 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the “smoothness” assumptions associated with this alternate method. 1 Introduction and Setup In this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods based on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We consider the limit when the number of labeled points is ﬁxed and the number of unlabeled points goes to inﬁnity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled data is virtually inﬁnite. We can also think of this limit as “perfect” SSL, having full knowledge of the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative about the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be “smooth” in some sense relative to p(x). Studying the inﬁnite-unlabeled-data limit, where p(x) is fully known, allows us to formulate and understand the underlying smoothness assumptions of a particular SSL method, and judge whether it is well-posed and sensible. Understanding the inﬁnite-unlabeled-data limit is also a necessary ﬁrst step to studying the convergence of the ﬁnite-labeled-data estimator. We consider the following setup: Let p(x) be an unknown smooth density on a compact domain Ω ⊂ Rd with a smooth boundary. Let y : Ω → Y be the unknown function we wish to estimate. In case of regression Y = R whereas in binary classiﬁcation Y = {−1, 1}. The standard (transductive) semisupervised learning problem is formulated as follows: Given l labeled points, (x1 , y1 ), . . . , (xl , yl ), with yi = y(xi ), and u unlabeled points xl+1 , . . . , xl+u , with all points xi sampled i.i.d. from p(x), the goal is to construct an estimate of y(xl+i ) for any unlabeled point xl+i , utilizing both the labeled and the unlabeled points. We denote the total number of points by n = l + u. We are interested in the regime where l is ﬁxed and u → ∞. 1 2 SSL with Graph Laplacian Regularization We ﬁrst consider the following graph-based approach formulated by Zhu et. al. [15]: y (x) = arg min In (y) ˆ subject to y(xi ) = yi , i = 1, . . . , l y where 1 n2 In (y) = Wi,j (y(xi ) − y(xj ))2 (1) (2) i,j is a Laplacian regularization term enforcing “smoothness” with respect to the n×n similarity matrix W . This formulation has several natural interpretations in terms of, e.g. random walks and electrical circuits [15]. These interpretations, however, refer to a ﬁxed graph, over a ﬁnite set of points with given similarities. In contrast, our focus here is on the more typical scenario where the points xi ∈ Rd are a random sample from a density p(x), and W is constructed based on this sample. We would like to understand the behavior of the method in terms of the density p(x), particularly in the limit where the number of unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density p(x) is the method (1) sensible? The answer, of course, depends on how the matrix W is constructed. We consider the common situation where the similarities are obtained by applying some decay ﬁlter to the distances: xi −xj σ Wi,j = G (3) where G : R+ → R+ is some function with an adequately fast decay. Popular choices are the 2 Gaussian ﬁlter G(z) = e−z /2 or the ǫ-neighborhood graph obtained by the step ﬁlter G(z) = 1z<1 . For simplicity, we focus here on the formulation (1) where the solution is required to satisfy the constraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced with a softer loss-based data term, which is balanced against the smoothness term In (y), e.g. [14, 6]. Our analysis and conclusions apply to such variants as well. Limit of the Laplacian Regularization Term As the number of unlabeled examples grows the regularization term (2) converges to its expectation, where the summation is replaced by integration w.r.t. the density p(x): lim In (y) = I (σ) (y) = n→∞ G Ω Ω x−x′ σ (y(x) − y(x′ ))2 p(x)p(x′ )dxdx′ . (4) In the above limit, the bandwidth σ is held ﬁxed. Typically, one would also drive the bandwidth σ to zero as n → ∞. There are two reasons for this choice. First, from a practical perspective, this makes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical perspective, this leads to a clear and well deﬁned limit of the smoothness regularization term In (y), at least when σ → 0 slowly enough1 , namely when σ = ω( d log n/n). If σ → 0 as n → ∞, and as long as nσ d / log n → ∞, then after appropriate normalization, the regularizer converges to a density weighted gradient penalty term [7, 8]: d lim d+2 In (y) n→∞ Cσ (σ) d (y) d+2 I σ→0 Cσ = lim ∇y(x) 2 p(x)2 dx = J(y) = (5) Ω where C = Rd z 2 G( z )dz, and assuming 0 < C < ∞ (which is the case for both the Gaussian and the step ﬁlters). This energy functional J(f ) therefore encodes the notion of “smoothness” with respect to p(x) that is the basis of the SSL formulation (1) with the graph constructions speciﬁed by (3). To understand the behavior and appropriateness of (1) we must understand this functional and the associated limit problem: y (x) = arg min J(y) ˆ subject to y(xi ) = yi , i = 1, . . . , l (6) y p When σ = o( d 1/n) then all non-diagonal weights Wi,j vanish (points no longer have any “close by” p neighbors). We are not aware of an analysis covering the regime where σ decays roughly as d 1/n, but would be surprised if a qualitatively different meaningful limit is reached. 1 2 3 Graph Laplacian Regularization in R1 We begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x ∈ R. We ﬁrst consider the situation where the support of p(x) is a continuous interval Ω = [a, b] ⊂ R (a and/or b may be inﬁnite). Without loss of generality, we assume the labeled data is sorted in increasing order a x1 < x2 < · · · < xl b. Applying the theory of variational calculus, the solution y (x) ˆ satisﬁes inside each interval (xi , xi+1 ) the Euler-Lagrange equation d dy p2 (x) = 0. dx dx Performing two integrations and enforcing the constraints at the labeled points yields y(x) = yi + x 1/p2 (t)dt xi (yi+1 xi+1 1/p2 (t)dt xi − yi ) for xi x xi+1 (7) with y(x) = x1 for a x x1 and y(x) = xl for xl x b. If the support of p(x) is a union of disjoint intervals, the above analysis and the form of the solution applies in each interval separately. The solution (7) seems reasonable and desirable from the point of view of the “smoothness” assumptions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas across low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore, the regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared semi-norm, giving us additional insight into this choice of regularizer: b 1 Theorem 1. Let p(x) be a smooth density on Ω = [a, b] ⊂ R such that Ap = 4 a 1/p2 (t)dt < ∞. 2 Then, J(f ) can be written as a squared semi-norm J(f ) = f Kp induced by the kernel x′ ′ Kp (x, x ) = Ap − 1 2 x with a null-space of all constant functions. That is, f the RKHS induced by Kp . 1 p2 (t) dt Kp . (8) is the norm of the projection of f onto If p(x) is supported on several disjoint intervals, Ω = ∪i [ai , bi ], then J(f ) can be written as a squared semi-norm induced by the kernel 1 bi dt 4 ai p2 (t) ′ Kp (x, x ) = − 1 2 x′ dt x p2 (t) if x, x′ ∈ [ai , bi ] (9) if x ∈ [ai , bi ], x′ ∈ [aj , bj ], i = j 0 with a null-space spanned by indicator functions 1[ai ,bi ] (x) on the connected components of Ω. Proof. For any f (x) = i αi Kp (x, xi ) in the RKHS induced by Kp : df dx J(f ) = 2 p2 (x)dx = αi αj Jij (10) i,j where Jij = d d Kp (x, xi ) Kp (x, xj )p2 (x)dx dx dx When xi and xj are in different connected components of Ω, the gradients of Kp (·, xi ) and Kp (·, xj ) are never non-zero together and Jij = 0 = Kp (xi , xj ). When they are in the same connected component [a, b], and assuming w.l.o.g. a xi xj b: Jij = = xi 1 4 1 4 a b a 1 dt + p2 (t) 1 1 dt − p2 (t) 2 xj xi xj xi −1 dt + p2 (t) xj 1 dt p2 (t) 1 dt = Kp (xi , xj ). p2 (t) Substituting Jij = Kp (xi , xj ) into (10) yields J(f ) = 3 b αi αj Kp (xi , xj ) = f (11) Kp . Combining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of any variant where the hard constraints are replaced by a data term) is of the form: l y(x) = αj Kp (x, xj ) + βi 1[ai ,bi ] (x), j=1 i where i ranges over the connected components [ai , bi ] of Ω, and we have: l J(y) = αi αj Kp (xi , xj ). (12) i,j=1 Viewing the regularizer as y 2 p suggests understanding (6), and so also its empirical approximaK tion (1), by interpreting Kp (x, x′ ) as a density-based “similarity measure” between x and x′ . This similarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a function of the distance. When the density is non-uniform, two points are relatively similar only if they are connected by a region in which 1/p2 (x) is low, i.e. the density is high, but are much less “similar”, i.e. related to each other, when connected by a low-density region. Furthermore, there is no dependence between points in disjoint components separated by zero density regions. 4 Graph Laplacian Regularization in Higher Dimensions The analysis of the previous section seems promising, at it shows that in one dimension, the SSL method (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is not the case anymore. In the following theorem we show that the inﬁmum of the limit problem (6) is zero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation of the labeled points. Theorem 2. Let p(x) be a smooth density over Rd , d 2, bounded from above by some constant pmax , and let (x1 , y1 ), . . . , (xl , yl ) be any (non-repeating) set of labeled examples. There exist continuous functions yǫ (x), for any ǫ > 0, all satisfying the constraints yǫ (xj ) = yj , j = 1, . . . , l, such ǫ→0 ǫ→0 that J(yǫ ) −→ 0 but yǫ (x) −→ 0 for all x = xj , j = 1, . . . , l. Proof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the proof to more labeled points is straightforward. Furthermore, without loss of generality, we assume the ﬁrst labeled point is at x0 = 0 with y(x0 ) = 0 and the second labeled point is at x1 with x1 = 1 and y(x1 ) = 1. In addition, we assume that the ball B1 (0) of radius one centered around the origin is contained in Ω = {x ∈ Rd | p(x) > 0}. We ﬁrst consider the case d > 2. Here, for any ǫ > 0, consider the function x ǫ yǫ (x) = min ,1 which indeed satisﬁes the two constraints yǫ (xi ) = yi , i = 0, 1. Then, J(yǫ ) = Bǫ (0) p2 (x) dx ǫ2 pmax ǫ2 dx = p2 Vd ǫd−2 max (13) Bǫ (0) where Vd is the volume of a unit ball in Rd . Hence, the sequence of functions yǫ (x) satisfy the constraints, but for d > 2, inf ǫ J(yǫ ) = 0. For d = 2, a more extreme example is necessary: consider the functions 2 x yǫ (x) = log +ǫ ǫ log 1+ǫ ǫ for x 1 and yǫ (x) = 1 for x > 1. These functions satisfy the two constraints yǫ (xi ) = yi , i = 0, 1 and: J(yǫ ) = 4 h “ ”i 1+ǫ 2 log ǫ 4πp2 max h “ ”i 1+ǫ 2 log ǫ x B1 (0) log ( x 1+ǫ ǫ 2 2 +ǫ)2 p2 (x)dx 4p2 h “ max ”i2 1+ǫ log ǫ 4πp2 max ǫ→0 = −→ 0. log 1+ǫ ǫ 4 1 0 r2 (r 2 +ǫ)2 2πrdr The implication of Theorem 2 is that regardless of the values at the labeled points, as u → ∞, the solution of (1) is not well posed. Asymptotically, the solution has the form of an almost everywhere constant function, with highly localized spikes near the labeled points, and so no learning is performed. In particular, an interpretation in terms of a density-based kernel Kp , as in the onedimensional case, is not possible. Our analysis also carries over to a formulation where a loss-based data term replaces the hard label constraints, as in l 1 y = arg min ˆ (y(xj ) − yj )2 + γIn (y) y(x) l j=1 In the limit of inﬁnite unlabeled data, functions of the form yǫ (x) above have a zero data penalty term (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence, it is possible to drive the entire objective functional (the data term plus the regularization term) to zero with functions that do not generalize at all to unlabeled points. 4.1 Numerical Example We illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density p(x) in R2 , which is a mixture of two unit variance spherical Gaussians, one per class, centered at the origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of the two components (four total). We then construct a similarity matrix using a Gaussian ﬁlter with σ = 0.4. Figure 1 depicts the predictor y (x) obtained from (1). In fact, two different predictors are shown, ˆ obtained by different numerical methods for solving (1). Both methods are based on the observation that the solution y (x) of (1) satisﬁes: ˆ n y (xi ) = ˆ n Wij y (xj ) / ˆ j=1 Wij on all unlabeled points i = l + 1, . . . , l + u. (14) j=1 Combined with the constraints of (1), we obtain a system of linear equations that can be solved by Gaussian elimination (here invoked through MATLAB’s backslash operator). This is the method used in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for y (xi ), ˆ which can be solved via the power method, or label propagation [2, 6]: start with zero labels on the unlabeled points and iterate (14), while keeping the known labels on x1 , . . . , xl . This is the method used in the bottom panels of Figure 1. As predicted, y (x) is almost constant for almost all unlabeled points. Although all values are very ˆ close to zero, thresholding at the “right” threshold does actually produce sensible results in terms of the true -1/+1 labels. However, beyond being inappropriate for regression, a very ﬂat predictor is still problematic even from a classiﬁcation perspective. First, it is not possible to obtain a meaningful conﬁdence measure for particular labels. Second, especially if the size of each class is not known apriori, setting the threshold between the positive and negative classes is problematic. In our example, setting the threshold to zero yields a generalization error of 45%. The differences between the two numerical methods for solving (1) also point out to another problem with the ill-posedness of the limit problem: the solution is numerically very un-stable. A more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choosing a “wrong” bandwidth σ, is given in Figure 2. We again simulated data from a mixture of two Gaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and an increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classiﬁcation error of the resulting predictor y (x). We plot the classiﬁcation error both when a threshold ˆ of zero is used (i.e. the class is determined by sign(ˆ(x))) and with the ideal threshold minimizing y the test error. For each unlabeled sample size, we choose the bandwidth σ yielding the best test performance (this is a “cheating” approach which provides a lower bound on the error of the best method for selecting the bandwidth). As the number of unlabeled examples increases the squared error approaches 1, indicating a ﬂat predictor. Using a threshold of zero leads to an increase in the classiﬁcation error, possibly due to numerical instability. Interestingly, although the predictors become very ﬂat, the classiﬁcation error using the ideal threshold actually improves slightly. Note that 5 DIRECT INVERSION SQUARED ERROR SIGN ERROR: 45% OPTIMAL BANDWIDTH 1 0.9 1 5 0 4 2 0.85 y(x) > 0 y(x) < 0 6 0.95 10 0 0 −1 10 0 200 400 600 800 0−1 ERROR (THRESHOLD=0) 0.32 −5 10 0 5 −10 0 −10 −5 −5 0 5 10 10 1 0 0 200 400 600 800 OPTIMAL BANDWIDTH 0.5 0 0 200 400 600 800 0−1 ERROR (IDEAL THRESHOLD) 0.19 5 200 400 600 800 OPTIMAL BANDWIDTH 1 0.28 SIGN ERR: 17.1 0.3 0.26 POWER METHOD 0 1.5 8 0 0.18 −1 10 6 0.17 4 −5 10 0 5 −10 0 −5 −10 −5 0 5 10 Figure 1: Left plots: Minimizer of Eq. (1). Right plots: the resulting classiﬁcation according to sign(y). The four labeled points are shown by green squares. Top: minimization via Gaussian elimination (MATLAB backslash). Bottom: minimization via label propagation with 1000 iterations - the solution has not yet converged, despite small residuals of the order of 2 · 10−4 . 0.16 0 200 400 600 800 2 0 200 400 600 800 Figure 2: Squared error (top), classiﬁcation error with a threshold of zero (center) and minimal classiﬁcation error using ideal threhold (bottom), of the minimizer of (1) as a function of number of unlabeled points. For each error measure and sample size, the bandwidth minimizing the test error was used, and is plotted. ideal classiﬁcation performance is achieved with a signiﬁcantly larger bandwidth than the bandwidth minimizing the squared loss, i.e. when the predictor is even ﬂatter. 4.2 Probabilistic Interpretation, Exit and Hitting Times As mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in terms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with transition matrix M = D−1 W where D is a diagonal matrix with Dii = j Wij . Then, for the binary classiﬁcation case with yi = ±1 we have [15]: y (xi ) = 2 Pr x(t) hits a point labeled +1 before hitting a point labeled -1 x(0) = xi − 1 ˆ We present an interpretation of our analysis in terms of the limiting properties of this random walk. Consider, for simplicity, the case where the two classes are separated by a low density region. Then, the random walk has two intrinsic quantities of interest. The ﬁrst is the mean exit time from one cluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the number of unlabeled points increases and σ → 0, the random walk converges to a diffusion process [12]. While the mean exit time then converges to a ﬁnite value corresponding to its diffusion analogue, the hitting time to a labeled point increases to inﬁnity (as these become absorbing boundaries of measure zero). With more and more unlabeled data the random walk will fully mix, forgetting where it started, before it hits any label. Thus, the probability of hitting +1 before −1 will become uniform across the entire graph, independent of the starting location xi , yielding a ﬂat predictor. 5 Keeping σ Finite At this point, a reader may ask whether the problems found in higher dimensions are due to taking the limit σ → 0. One possible objection is that there is an intrinsic characteristic scale for the data σ0 where (with high probability) all points at a distance xi − xj < σ0 have the same label. If this is the case, then it may not necessarily make sense to take values of σ < σ0 in constructing W . However, keeping σ ﬁnite while taking the number of unlabeled points to inﬁnity does not resolve the problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To see this, consider a function y(x) which is zero everywhere except at the labeled points, where y(xj ) = yj . With a ﬁnite number of labeled points of measure zero, I (σ) (y) = 0 in any dimension 6 50 points 500 points 3500 points 1 1 0.5 0.5 0.5 0 0 0 −0.5 y 1 −0.5 −0.5 −1 −2 0 2 4 6 −1 −2 0 2 4 6 −1 −2 0 2 4 6 x Figure 3: Minimizer of (1) for a 1-d problem with a ﬁxed σ = 0.4, two labeled points and an increasing number of unlabeled points. and for any ﬁxed σ > 0. While this limiting function is discontinuous, it is also possible to construct ǫ→0 a sequence of continuous functions yǫ that all satisfy the constraints and for which I (σ) (yǫ ) −→ 0. This behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians centered at the origin and at x = 4, with one Gaussian labeled −1 and the other +1. We used two labeled points at the centers of the Gaussians and an increasing number of randomly drawn unlabeled points. As predicted, with a ﬁxed σ, although the solution is reasonable when the number of unlabeled points is small, it becomes ﬂatter, with sharp spikes on the labeled points, as u → ∞. 6 Fourier-Eigenvector Based Methods Before we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian, suggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining candidate predictors y(x) non-parametrically to those with small In (y) values, here the predictors are constrained to the low-dimensional space spanned by the ﬁrst few eigenvectors of the Laplacian: The similarity matrix W is computed as before, and the Graph Laplacian matrix L = D − W is considered (recall D is a diagonal matrix with Dii = j Wij ). Only predictors p j=1 aj ej y (x) = ˆ (15) spanned by the ﬁrst p eigenvectors e1 , . . . , ep of L (with smallest eigenvalues) are considered. The coefﬁcients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss: (ˆ1 , . . . , ap ) = arg min a ˆ l j=1 (yj − y (xj ))2 . ˆ (16) Unlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)–(16) is well posed in the limit u → ∞. This follows directly from the convergence of the eigenvectors of the graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4]. Eigenvector based methods were shown empirically to provide competitive generalization performance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the approach by arguing that ‘the eigenfunctions of the Laplace-Beltrami operator provide a natural basis for functions on the manifold and the desired classiﬁcation function can be expressed in such a basis’. In our view, the success of the method is actually not due to data lying on a low-dimensional manifold, but rather due to the low density separation assumption, which states that different class labels form high-density clusters separated by low density regions. Indeed, under this assumption and with sufﬁcient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami operator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11], providing a basis for a labeling that is constant within clusters but variable across clusters. In other settings, such as data uniformly distributed on a manifold but without any signiﬁcant cluster structure, the success of eigenvector based methods critically depends on how well can the unknown classiﬁcation function be approximated by a truncated expansion with relatively few eigenvectors. We illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform density in the box [0, 1] × [0, 0.8] × [0, 0.6], where the box lengths are different to prevent eigenvalue multiplicity. Consider learning three different functions, y1 (x) = 1x1 >0.5 , y2 (x) = 1x1 >x2 /0.8 and y3 (x) = 1x2 /0.8>x3 /0.6 . Even though all three functions are relatively simple, all having a linear separating boundary between the classes on the manifold, as shown in the experiment described in Figure 4, the Eigenvector based method (15)–(16) gives markedly different generalization performances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5 as suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test set (i.e. a “cheating” choice representing an upper bound on the generalization error of this method). 7 Prediction Error (%) p = #labeled points/5 40 optimal p 20 labeled points 40 Approx. Error 50 20 20 0 20 20 40 60 # labeled points 0 10 20 40 60 # labeled points 0 0 5 10 15 # eigenvectors 0 0 5 10 15 # eigenvectors Figure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)–(16) for the three different functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign agreements with the true labels. Rightmost panel: best ﬁt when many (all 3000) points are used, representing the best we can hope for with a few leading eigenvectors. The reason for this behavior is that y2 (x) and even more so y3 (x) cannot be as easily approximated by the very few leading eigenfunctions—even though they seem “simple” and “smooth”, they are signiﬁcantly more complicated than y1 (x) in terms of measure of simplicity implied by the Eigenvector Method. Since the density is uniform, the graph Laplacian converges to the standard Laplacian and its eigenfunctions have the form ψi,j,k (x) = cos(iπx1 ) cos(jπx2 /0.8) cos(kπx3 /0.6), making it hard to represent simple decision boundaries which are not axis-aligned. 7 Discussion Our results show that a popular SSL method, the Laplacian Regularization method (1), is not wellbehaved in the limit of inﬁnite unlabeled data, despite its empirical success in various SSL tasks. The empirical success might be due to two reasons. First, it is possible that with a large enough number of labeled points relative to the number of unlabeled points, the method is well behaved. This regime, where the number of both labeled and unlabeled points grow while l/u is ﬁxed, has recently been analyzed by Wasserman and Lafferty [9]. However, we do not ﬁnd this regime particularly satisfying as we would expect that having more unlabeled data available should improve performance, rather than require more labeled points or make the problem ill-posed. It also places the user in a delicate situation of choosing the “just right” number of unlabeled points without any theoretical guidance. Second, in our experiments we noticed that although the predictor y (x) becomes extremely ﬂat, in ˆ binary tasks, it is still typically possible to ﬁnd a threshold leading to a good classiﬁcation performance. We do not know of any theoretical explanation for such behavior, nor how to characterize it. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical foundation of the Laplacian Regularization method. On a very practical level, such a theoretical understanding might allow us to correct the method so as to avoid the numerical instability associated with ﬂat predictors, and perhaps also make it appropriate for regression. The reason that the Laplacian regularizer (1) is ill-posed in the limit is that the ﬁrst order gradient is not a sufﬁcient penalty in high dimensions. This fact is well known in spline theory, where the Sobolev Embedding Theorem [1] indicates one must control at least d+1 derivatives in Rd . In the 2 context of Laplacian regularization, this can be done using the iterated Laplacian: replacing the d+1 graph Laplacian matrix L = D − W , where D is the diagonal degree matrix, with L 2 (matrix to d+1 the 2 power). In the inﬁnite unlabeled data limit, this corresponds to regularizing all order- d+1 2 (mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional ambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimensionality, which poses a practical problem of estimating this usually unknown dimensionality. We are not aware of much practical work using the iterated Laplacian, nor a good understanding of its appropriateness for SSL. A different approach leading to a well-posed solution is to include also an ambient regularization term [5]. However, the properties of the solution and in particular its relation to various assumptions about the “smoothness” of y(x) relative to p(x) remain unclear. Acknowledgments The authors would like to thank the anonymous referees for valuable suggestions. The research of BN was supported by the Israel Science Foundation (grant 432/06). 8 References [1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975. [2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks, ICML, 2007. [3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classiﬁcation, NIPS, vol. 15, 2003. [4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007. [5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006. [6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised Learning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006. [7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004. [8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006. [9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008. [10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol. 36(2), 2008. [11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001. [12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, NIPS, vol. 18, 2006. [13] B. Sch¨ lkopf, A. Smola, Learning with Kernels, MIT Press, 2002. o [14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch¨ lkopf, Learning with local and global consistency, o NIPS, vol. 16, 2004. [15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian ﬁelds and harmonic functions, ICML, 2003. 9

4 0.42668682 158 nips-2009-Multi-Label Prediction via Sparse Infinite CCA

Author: Piyush Rai, Hal Daume

Abstract: Canonical Correlation Analysis (CCA) is a useful technique for modeling dependencies between two (or more) sets of variables. Building upon the recently suggested probabilistic interpretation of CCA, we propose a nonparametric, fully Bayesian framework that can automatically select the number of correlation components, and effectively capture the sparsity underlying the projections. In addition, given (partially) labeled data, our algorithm can also be used as a (semi)supervised dimensionality reduction technique, and can be applied to learn useful predictive features in the context of learning a set of related tasks. Experimental results demonstrate the efﬁcacy of the proposed approach for both CCA as a stand-alone problem, and when applied to multi-label prediction. 1

5 0.42453623 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

Author: Jaakko Luttinen, Alexander T. Ihler

Abstract: We present a probabilistic factor analysis model which can be used for studying spatio-temporal datasets. The spatial and temporal structure is modeled by using Gaussian process priors both for the loading matrix and the factors. The posterior distributions are approximated using the variational Bayesian framework. High computational cost of Gaussian process modeling is reduced by using sparse approximations. The model is used to compute the reconstructions of the global sea surface temperatures from a historical dataset. The results suggest that the proposed model can outperform the state-of-the-art reconstruction systems.

6 0.4242292 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

7 0.42266613 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

8 0.42233419 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

9 0.41824806 22 nips-2009-Accelerated Gradient Methods for Stochastic Optimization and Online Learning

10 0.4178808 148 nips-2009-Matrix Completion from Power-Law Distributed Samples

11 0.41770542 243 nips-2009-The Ordered Residual Kernel for Robust Motion Subspace Clustering

12 0.41723216 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

13 0.4168362 126 nips-2009-Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

14 0.41660553 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

15 0.41616017 97 nips-2009-Free energy score space

16 0.41497865 202 nips-2009-Regularized Distance Metric Learning:Theory and Algorithm

17 0.41484252 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

18 0.41475385 147 nips-2009-Matrix Completion from Noisy Entries

19 0.4145278 157 nips-2009-Multi-Label Prediction via Compressed Sensing

20 0.4143959 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem