nips nips2005 nips2005-27 knowledge-graph by maker-knowledge-mining

27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

Source: pdf

Author: Tong Zhang, Rie Kubota Ando

Abstract: We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. This approach subsumes a class of previously proposed semi-supervised learning methods on data graphs. We examine various theoretical properties of such methods. In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. Experiments are used to illustrate the main consequences of our analysis.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Watson Research Center Yorktown Heights, NY 10598 Abstract We consider a framework for semi-supervised learning using spectral decomposition based un-supervised kernel design. [sent-5, score-0.923]

2 In particular, we derive a generalization performance bound, and obtain the optimal kernel design by minimizing the bound. [sent-8, score-0.678]

3 Based on the theoretical analysis, we are able to demonstrate why spectral kernel design based methods can often improve the predictive performance. [sent-9, score-1.113]

4 1 Introduction Spectral graph methods have been used both in clustering and in semi-supervised learning. [sent-11, score-0.244]

5 This paper focuses on semi-supervised learning, where a classiﬁer is constructed from both labeled and unlabeled training examples. [sent-12, score-0.18]

6 The purpose of this paper is to develop a more complete theoretical understanding for graph based semi-supervised learning. [sent-14, score-0.212]

7 1, we present a transductive formulation of kernel learning on graphs which is equivalent to supervised kernel learning. [sent-16, score-1.226]

8 This new kernel learning formulation includes some of the previous proposed graph semi-supervised learning methods as special cases. [sent-17, score-0.911]

9 A consequence is that we can view such graph-based semi-supervised learning methods as kernel design methods that utilize unlabeled data; the designed kernel is then used in the standard supervised learning setting. [sent-18, score-1.488]

10 This insight allows us to prove useful results concerning the behavior of graph based semi-supervised learning from the more general view of spectral kernel design. [sent-19, score-1.158]

11 Similar spectral kernel design ideas also appeared in [2]. [sent-20, score-1.023]

12 We focus on two issues for graph kernel learning formulations based on Theorem 2. [sent-23, score-0.738]

13 First, we establish the convergence of graph based semi-supervised learning (when the number of unlabeled data increases). [sent-25, score-0.342]

14 This analysis gives insights to what are good kernels, and why graph-based spectral kernel design is often helpful in various applications. [sent-27, score-1.028]

15 Given a feature representation ψ(x), we can deﬁne kernel k(x, x ) = ψ(x)T ψ(x ). [sent-46, score-0.523]

16 (2) [αi ]∈R n i=1 j=1 i,j=1 The above formulations of kernel methods are standard. [sent-48, score-0.547]

17 In the following, we present an equivalence of supervised kernel learning to a speciﬁc semi-supervised formulation. [sent-49, score-0.659]

18 As we shall see later, this new kernel learning formulation is critical for analyzing a class of graph-based semi-supervised learning methods. [sent-51, score-0.734]

19 The following theorem, which establishes the graph kernel learning formulation we will study in this paper, essentially implies that graph-based semi-supervised learning is equivalent to the supervised learning method which employs the same kernel. [sent-54, score-1.061]

20 , fm ]T ∈ Rm , and the following semi-supervised learning method: 1 ˆ f = arg infm f ∈R n n L(fi , Yi ) + λf T K −1 f , (3) i=1 where K (often called gram-matrix in kernel learning or afﬁnity matrix in graph learning) is an m × m matrix with Ki,j = k(Xi , Xj ) = ψ(Xi )T ψ(Xj ). [sent-66, score-0.953]

21 then fj ˆ The kernel gram matrix K is always positive semi-deﬁnite. [sent-71, score-0.726]

22 If we start with a given kernel k and let K = [k(Xi , Xj )], then a semi-supervised learning method of the form (3) is equivalent to the supervised method (1). [sent-73, score-0.686]

23 It follows that with a formulation like (3), the only way to ¯ ¯ ¯ utilize unlabeled data is to replace K by a kernel K in (3), or k by k in (2), where K (or ¯ k) depends on the unlabeled data. [sent-74, score-0.847]

24 In other words, the only beneﬁt of unlabeled data in this setting is to construct a good kernel based on unlabeled data. [sent-75, score-0.717]

25 Some of previous graph-based semi-supervised learning methods employ the same formulation (3) with K −1 replaced by the graph Laplacian operator L (which we will describe in Section 5). [sent-76, score-0.35]

26 However, the equivalence of this formulation and supervised kernel learning (with kernel matrix K = L−1 ) was not obtained in these earlier studies. [sent-77, score-1.264]

27 Moreover, by treating graph-based supervised learning as unsupervised kernel design (see Figure 1), the scope of this paper is more general than graph Laplacian based methods. [sent-79, score-0.906]

28 , m), kernel function k(·, ·), ˆ Output: predictive values fj on Xj (j = 1, . [sent-89, score-0.678]

29 , m) Form the kernel matrix K = [k(Xi , Xj )] (i, j = 1, . [sent-92, score-0.504]

30 , m) Compute the kernel eigen-decomposition: T T K = m m µj vj vj , where (µj , vj ) are eigenpairs of K (vj vj = 1) j=1 m T ¯ Modify the kernel matrix as: K = m j=1 sj µj vj vj (∗) ˆ = arg minf ∈Rm 1 n L(fi , Yi ) + λf T K −1 f . [sent-95, score-2.548]

31 ¯ Compute f i=1 n Figure 1: Spectral kernel design based semi-supervised learning on graph In Figure 1, we consider a general formulation of semi-supervised learning method on data graph through spectral kernel design. [sent-96, score-2.042]

32 (4) As mentioned earlier, the idea of using spectral kernel design has appeared in [2] although they didn’t base their method on the graph formulation (3). [sent-101, score-1.319]

33 1 (which uses the ¯ original kernel K) – in other words, only when the new kernel K is better than K. [sent-104, score-0.981]

34 Therefore a change of kernel can be regarded as a change of feature mapping. [sent-110, score-0.523]

35 The following result establishes an equivalent feature space formulation of the semi-supervised learning method in Figure 1. [sent-112, score-0.275]

36 Consider √ m S = j=1 sj uj uT , where uj = Ψvj / µj , Ψ = [ψ(X1 ), . [sent-116, score-0.571]

37 , ψ(Xm )], then (µj , uj ) is j an eigenpair of ΨΨT /m. [sent-119, score-0.238]

38 The spectral decomposition of EX ψ(X)ψ(X)T corresponds to the feature space PCA. [sent-128, score-0.44]

39 In general, S converges if the eigenvectors uj converges and the shrinkage factors sj are bounded. [sent-131, score-0.479]

40 Assume when m → ∞, j=1 ψ(Xj )ψ(Xj )T /m conT verges to EX ψ(X)ψ(X) almost surely, and g is a continuous function in the spectral range of EX ψ(X)ψ(X)T . [sent-138, score-0.39]

41 Now in Figure 1 with (∗) given by (4) and kernel ˆ k(x, x ) = ψ(x)T ψ(x ), fj converges almost surely for each ﬁxed j. [sent-139, score-0.719]

42 4 Generalization analysis on graph We study the generalization behavior of graph based semi-supervised learning algorithm (3), and use it to compare different kernels. [sent-140, score-0.468]

43 We will then use this bound to justify the kernel design method given in Section 2. [sent-141, score-0.712]

44 We obtain predictive values fj on the graph using the semi-supervised learning method (3) with the labeled data, and test it on the remaining m − n data points. [sent-153, score-0.518]

45 Let f (Zn ) be the semi-supervised learning method (3) using training data in ∂ T −1 ˆ(Zn ) = arg minf ∈Rm 1 Zn : f f . [sent-167, score-0.197]

46 f ∈R m−n m j=1 2λnm j ∈Zn / The bound depends on the regularization parameter λ in addition to the kernel K. [sent-169, score-0.514]

47 With the optimal λ, we obtain:   m 1 1 γ ˆ R(f, K) , L(fj (Zn ), Yj ) ≤ infm  L(fj , Yj ) + √ EZn f ∈R m−n m j=1 2n j ∈Z / n where R(f, K) = tr(K/m) f T K −1 f is the complexity of f with respect to kernel K. [sent-173, score-0.587]

48 If we believe that a good approximate by R(f, K) j=1 j=1 j target function f can be expressed as f = j αj vj with |αj | ≤ βj for some known βj , then based on this belief, the optimal choice of the shrinkage factor becomes sj = βj /µj . [sent-175, score-0.52]

49 T ¯ That is, the kernel that optimizes the bound is K = j βj vj vj , where vj are normalized ¯ eigenvectors of K. [sent-176, score-1.28]

50 The eigenvalues of the optimal kernel is thus independent of K, but depends only on the spectral coefﬁcient’s range βj of the approximate target function. [sent-178, score-1.061]

51 Since there is no reason to believe that the eigenvalues µj of the original kernel K are proportional to the target spectral coefﬁcient range. [sent-179, score-1.096]

52 If we have some guess of the spectral coefﬁcients of the target, then one may use the knowledge to obtain a better kernel. [sent-180, score-0.423]

53 This justiﬁes why spectral kernel design based algorithm can be potentially helpful (when we have some information on the target spectral coefﬁcients). [sent-181, score-1.506]

54 However, for many application problems, we observe in practice that the eigenvalues of kernel K decays more slowly than that of the target spectral coefﬁcients. [sent-183, score-1.161]

55 In this case, our analysis implies that we should use an alternative kernel with faster eigenvalue decay: for example, using K 2 instead of K. [sent-184, score-0.594]

56 The intuition is also quite clear: if the dimension of the target function is small (spectral coefﬁcient decays fast), then we should project data to those dimensions by reducing the remaining noisy dimensions (corresponding to fast kernel eigenvalue decay). [sent-187, score-0.828]

57 5 Spectral analysis: the effect of input noise We provide a justiﬁcation on why spectral coefﬁcients of the target function often decay faster than the eigenvalues of a natural kernel K. [sent-188, score-1.254]

58 Together with results in the previous section, we know that in order to achieve optimal performance, we need to use a kernel with faster eigenvalue decay. [sent-190, score-0.623]

59 ¯ ¯ T Let EXX = j µj uj uT be the spectral decomposition with decreasing eigenvalues µj j 2 2 (uT uj = 1). [sent-208, score-0.947]

60 Then the following claims are valid: let σ1 ≥ σ2 ≥ · · · be the eigenvalj T 2 T ues of Eδ2 δ2 , then µj ≥ σj ; if δ1 2 ≤ b/ w∗ 2 , then |w∗ Xi − Yi | ≤ b; ∀t ≥ 0, T 2 −t T ¯ ¯T −t j≥1 (w∗ uj ) µj ≤ w∗ (E x x ) w∗ . [sent-209, score-0.238]

61 , Xm ] and K = ΨT Ψ = m j µj vj vj √ T be the kernel spectral decomposition. [sent-217, score-1.321]

62 Let uj = Ψvj / mµj , fi = w∗ Xi , and f = √ T mµj w∗ uj . [sent-218, score-0.552]

63 Then it is not difﬁcult to verify that αj = 1 asymptotically m m Xi XiT → EXX T , then we have the following consequences: i=1 T • fi = w∗ Xi is a good approximate target when b is small. [sent-220, score-0.164]

64 m 1 • For all t > 0, the spectral coefﬁcient αj of f decays as m j=1 α2 /µ1+t ≤ j j T w∗ (E¯ xT )−t w∗ . [sent-222, score-0.48]

65 x ¯ 2 • The eigenvalue µj decays slowly when the noise spectral decays slowly: µj ≥ σj . [sent-223, score-0.711]

66 This analysis implies that if the feature representation associated with the original kernel is corrupted with noise, then it is often helpful to use a kernel with faster spectral decay. [sent-225, score-1.594]

67 However, it may not be easy to estimate the exact decay rate of the target spectral coefﬁcients. [sent-227, score-0.605]

68 A kernel with fast spectral decay projects the data into the most prominent principal components. [sent-229, score-1.047]

69 Therefore we are interested in designing kernels which can achieve a dimension reduction effect. [sent-230, score-0.243]

70 For example, we may consider a normalized kernel such that K/m = j µj uj uT where 0 ≤ uj ≤ 1. [sent-232, score-0.997]

71 This is the function used in various graph Laplacian formulations with normalized Gaussian kernel as the initial kernel K. [sent-237, score-1.199]

72 Our analysis suggests that it is the dimension reduction effect of this function that is important, rather than the connection to graph Laplacian. [sent-239, score-0.325]

73 As we shall see in the empirical examples, other kernels such as K 2 , which achieve similar dimension reduction effect (but has nothing to do with graph Laplacian), also improve performance. [sent-240, score-0.409]

74 We regard n = 100 of them as labeled data, and the remaining m − n = 1900 as unlabeled test data. [sent-246, score-0.212]

75 5 0 20 40 60 80 100120140160180200 dimension (d) 0 50 100 150 dimension (d) 200 Figure 2: Left: spectral coefﬁcients; right: classiﬁcation accuracy. [sent-263, score-0.616]

76 We study the performance of various kernel design methods, by changing the spectral coefﬁcients of the initial gram matrix K, as in Figure 1. [sent-265, score-1.08]

77 Below we write µj for the new ¯ m T ¯ ¯ spectral coefﬁcient of the new gram matrix K: i. [sent-266, score-0.477]

78 We study the following kernel design methods (also see [2]), with a dimension cut off parameter d, so that µi = 0 when i > d. [sent-269, score-0.745]

79 This method is ¯ essentially kernel principal component analysis which keeps the d most signiﬁcant principal components of K. [sent-279, score-0.622]

80 This ¯ i accelerates the decay of eigenvalues of K. [sent-282, score-0.208]

81 This is essentially graph-Laplacian based semi-supervised learning for normalized kernel (e. [sent-286, score-0.581]

82 This is ¯ the oracle kernel that optimizes our generalization bound. [sent-291, score-0.629]

83 The purpose of testing this oracle method is to validate our analysis by checking whether good kernel in our theory produces good classiﬁcation performance on real data. [sent-292, score-0.587]

84 Therefore the resulting kernel will not be the best possible kernel for each speciﬁc class, and thus its performance may not always be optimal. [sent-294, score-0.946]

85 Figure 2 shows the spectral coefﬁcients of the above mentioned kernel design methods and the corresponding classiﬁcation performance. [sent-295, score-1.022]

86 The initial kernel is normalized 25-NN, which is deﬁned as K = D−1/2 W D−1/2 (see previous section), where Wij = 1 if either the i-th example is one of the 25 nearest neighbors of the j-th example or vice versa; and 0 otherwise. [sent-296, score-0.521]

87 As expected, the results demonstrate that the target spectral coefﬁcients Y decay faster than that of the original kernel K. [sent-297, score-1.17]

88 Therefore it is useful to use kernel design methods that accelerate the eigenvalue decay. [sent-298, score-0.696]

89 The near oracle kernel ’Y’ performs well especially when the dimension cut-off is large. [sent-300, score-0.665]

90 With appropriate dimension d, all methods perform better than the supervised base-line (original K) which is below 65%. [sent-301, score-0.225]

91 However, K p with (p = 2, 3, 4) is less sensitive to the cut-off dimension d than the kernel principal component dimension reduction method K. [sent-303, score-0.843]

92 Moreover, the hard threshold method in spectral clustering ([1, . [sent-304, score-0.48]

93 Figure 3 shows the classiﬁcation accuracy with the standard Gaussian kernel as the initial kernel K, both with and without normalization. [sent-312, score-1.007]

94 The spectral clustering kernel is sensitive to the cut-off dimension, while K p with p = 2, 3, 4 are quite stable. [sent-316, score-0.918]

95 The standard kernel principal component dimension reduction (method K) performs very well with appropriately chosen dimension cut-off. [sent-317, score-0.808]

96 5 0 50 100 150 dimension (d) 200 0 50 100 150 dimension (d) 200 Figure 3: Classiﬁcation accuracy with Gaussian kernel k(i, j) = exp(−||xi − xj ||2 /t). [sent-345, score-0.923]

97 By establishing a graph-based formulation of kernel learning, we showed that this class of semi-supervised learning methods is equivalent to supervised kernel learning with unsupervised kernel design (explored in [2]). [sent-350, score-1.882]

98 We then obtained a generalization bound, which implies that the eigenvalues of the optimal kernel should decay at the same rate of the target spectral coefﬁcients. [sent-351, score-1.234]

99 Moreover, we showed that input noise can cause the target spectral coefﬁcients to decay faster than the kernel spectral coefﬁcients. [sent-352, score-1.563]

100 The analysis explains why it is often helpful to modify the original kernel eigenvalues to achieve a dimension reduction effect. [sent-353, score-0.819]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kernel', 0.473), ('spectral', 0.39), ('uj', 0.238), ('vj', 0.229), ('fj', 0.166), ('xj', 0.163), ('graph', 0.16), ('design', 0.13), ('decay', 0.127), ('coef', 0.126), ('unlabeled', 0.122), ('zn', 0.118), ('dimension', 0.113), ('formulation', 0.101), ('xi', 0.097), ('sj', 0.095), ('mnist', 0.094), ('wt', 0.094), ('vp', 0.09), ('decays', 0.09), ('target', 0.088), ('infm', 0.085), ('ex', 0.083), ('supervised', 0.083), ('eigenvalues', 0.081), ('corrupted', 0.081), ('oracle', 0.079), ('fi', 0.076), ('theorem', 0.074), ('ut', 0.074), ('cients', 0.069), ('yj', 0.065), ('eigenvalue', 0.064), ('accuracy', 0.061), ('learning', 0.06), ('labeled', 0.058), ('yi', 0.058), ('principal', 0.057), ('faster', 0.057), ('didn', 0.057), ('exx', 0.057), ('ezn', 0.057), ('gram', 0.056), ('clustering', 0.055), ('arg', 0.053), ('theoretical', 0.052), ('reduction', 0.052), ('feature', 0.05), ('shrinkage', 0.05), ('minf', 0.049), ('digit', 0.048), ('converges', 0.048), ('normalized', 0.048), ('laplacian', 0.048), ('consequences', 0.047), ('generalization', 0.046), ('formulations', 0.045), ('kernels', 0.044), ('equivalence', 0.043), ('behavior', 0.042), ('bound', 0.041), ('shall', 0.04), ('xm', 0.04), ('integers', 0.04), ('predictive', 0.039), ('slowly', 0.039), ('noise', 0.038), ('replacement', 0.038), ('vi', 0.036), ('classi', 0.036), ('transductive', 0.036), ('method', 0.035), ('original', 0.035), ('helpful', 0.035), ('interested', 0.034), ('guess', 0.033), ('concerning', 0.033), ('justify', 0.033), ('inverse', 0.033), ('notations', 0.032), ('regard', 0.032), ('predictor', 0.032), ('surely', 0.032), ('weston', 0.032), ('optimizes', 0.031), ('matrix', 0.031), ('appeared', 0.03), ('modify', 0.03), ('orthogonal', 0.03), ('optimal', 0.029), ('utilize', 0.029), ('methods', 0.029), ('rm', 0.029), ('believe', 0.029), ('im', 0.029), ('limiting', 0.029), ('establishes', 0.029), ('special', 0.028), ('assume', 0.028), ('ny', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999964 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

Author: Tong Zhang, Rie Kubota Ando

2 0.23417634 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil

Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1

3 0.21228357 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

Author: Boaz Nadler, Stephane Lafon, Ioannis Kevrekidis, Ronald R. Coifman

Abstract: This paper presents a diffusion based probabilistic interpretation of spectral clustering and dimensionality reduction algorithms that use the eigenvectors of the normalized graph Laplacian. Given the pairwise adjacency matrix of all points, we deﬁne a diffusion distance between any two data points and show that the low dimensional representation of the data by the ﬁrst few eigenvectors of the corresponding Markov matrix is optimal under a certain mean squared error criterion. Furthermore, assuming that data points are random samples from a density p(x) = e−U (x) we identify these eigenvectors as discrete approximations of eigenfunctions of a Fokker-Planck operator in a potential 2U (x) with reﬂecting boundary conditions. Finally, applying known results regarding the eigenvalues and eigenfunctions of the continuous Fokker-Planck operator, we provide a mathematical justiﬁcation for the success of spectral clustering and dimensional reduction algorithms based on these ﬁrst few eigenvectors. This analysis elucidates, in terms of the characteristics of diffusion processes, many empirical ﬁndings regarding spectral clustering algorithms. Keywords: Algorithms and architectures, learning theory. 1

4 0.21118899 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks

Author: Ricky Der, Daniel D. Lee

Abstract: A general analysis of the limiting distribution of neural network functions is performed, with emphasis on non-Gaussian limits. We show that with i.i.d. symmetric stable output weights, and more generally with weights distributed from the normal domain of attraction of a stable variable, that the neural functions converge in distribution to stable processes. Conditions are also investigated under which Gaussian limits do occur when the weights are independent but not identically distributed. Some particularly tractable classes of stable distributions are examined, and the possibility of learning with such processes.

5 0.2074022 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

Author: Y. Altun, D. McAllester, M. Belkin

Abstract: Many real-world classiﬁcation problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classiﬁcation of such structured variables. In this paper, we investigate structured classiﬁcation in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1

6 0.18760087 102 nips-2005-Kernelized Infomax Clustering

7 0.17250407 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

8 0.15679139 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

9 0.15501179 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

10 0.14691648 75 nips-2005-Fixing two weaknesses of the Spectral Method

11 0.1429251 178 nips-2005-Soft Clustering on Graphs

12 0.13305146 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering

13 0.12348839 105 nips-2005-Large-Scale Multiclass Transduction

14 0.11883169 138 nips-2005-Non-Local Manifold Parzen Windows

15 0.11529931 47 nips-2005-Consistency of one-class SVM and related algorithms

16 0.10629778 167 nips-2005-Robust design of biological experiments

17 0.10604872 126 nips-2005-Metric Learning by Collapsing Classes

18 0.10526136 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis

19 0.098195009 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm

20 0.097480856 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.338), (1, 0.204), (2, -0.164), (3, -0.171), (4, -0.204), (5, 0.06), (6, 0.136), (7, -0.094), (8, 0.033), (9, 0.058), (10, 0.065), (11, 0.029), (12, -0.078), (13, 0.031), (14, -0.153), (15, -0.029), (16, 0.037), (17, 0.071), (18, 0.1), (19, 0.004), (20, 0.064), (21, -0.126), (22, -0.08), (23, -0.017), (24, 0.115), (25, -0.102), (26, -0.045), (27, -0.104), (28, -0.09), (29, -0.04), (30, 0.013), (31, -0.079), (32, -0.105), (33, -0.038), (34, -0.019), (35, 0.08), (36, 0.068), (37, -0.047), (38, -0.006), (39, -0.023), (40, -0.043), (41, -0.03), (42, 0.027), (43, -0.069), (44, -0.028), (45, 0.039), (46, 0.043), (47, -0.049), (48, -0.017), (49, 0.093)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97667593 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

Author: Tong Zhang, Rie Kubota Ando

2 0.711891 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

Author: Y. Altun, D. McAllester, M. Belkin

3 0.69827664 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil

4 0.66510719 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

Author: Yoshua Bengio, Olivier Delalleau, Nicolas L. Roux

Abstract: We present a series of theoretical arguments supporting the claim that a large class of modern learning algorithms that rely solely on the smoothness prior – with similarity between examples expressed with a local kernel – are sensitive to the curse of dimensionality, or more precisely to the variability of the target. Our discussion covers supervised, semisupervised and unsupervised learning algorithms. These algorithms are found to be local in the sense that crucial properties of the learned function at x depend mostly on the neighbors of x in the training set. This makes them sensitive to the curse of dimensionality, well studied for classical non-parametric statistical learning. We show in the case of the Gaussian kernel that when the function to be learned has many variations, these algorithms require a number of training examples proportional to the number of variations, which could be large even though there may exist short descriptions of the target function, i.e. their Kolmogorov complexity may be low. This suggests that there exist non-local learning algorithms that at least have the potential to learn about such structured but apparently complex functions (because locally they have many variations), while not using very speciﬁc prior domain knowledge. 1

5 0.5948121 102 nips-2005-Kernelized Infomax Clustering

Author: David Barber, Felix V. Agakov

Abstract: We propose a simple information-theoretic approach to soft clustering based on maximizing the mutual information I(x, y) between the unknown cluster labels y and the training patterns x with respect to parameters of speciﬁcally constrained encoding distributions. The constraints are chosen such that patterns are likely to be clustered similarly if they lie close to speciﬁc unknown vectors in the feature space. The method may be conveniently applied to learning the optimal aﬃnity matrix, which corresponds to learning parameters of the kernelized encoder. The procedure does not require computations of eigenvalues of the Gram matrices, which makes it potentially attractive for clustering large data sets. 1

6 0.58921653 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

7 0.58590198 71 nips-2005-Fast Krylov Methods for N-Body Learning

8 0.57620794 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

9 0.57378268 105 nips-2005-Large-Scale Multiclass Transduction

10 0.56982613 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

11 0.56969607 103 nips-2005-Kernels for gene regulatory regions

12 0.54434001 182 nips-2005-Statistical Convergence of Kernel CCA

13 0.5353663 44 nips-2005-Computing the Solution Path for the Regularized Support Vector Regression

14 0.51208937 38 nips-2005-Beyond Gaussian Processes: On the Distributions of Infinite Networks

15 0.4770115 178 nips-2005-Soft Clustering on Graphs

16 0.46881855 175 nips-2005-Sequence and Tree Kernels with Statistical Feature Mining

17 0.46810493 138 nips-2005-Non-Local Manifold Parzen Windows

18 0.45202819 75 nips-2005-Fixing two weaknesses of the Spectral Method

19 0.44126004 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions

20 0.44040313 69 nips-2005-Fast Gaussian Process Regression using KD-Trees

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.063), (10, 0.02), (27, 0.013), (31, 0.032), (34, 0.151), (50, 0.011), (55, 0.027), (69, 0.032), (73, 0.402), (88, 0.099), (91, 0.069)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97825772 71 nips-2005-Fast Krylov Methods for N-Body Learning

Author: Nando D. Freitas, Yang Wang, Maryam Mahdaviani, Dustin Lang

Abstract: This paper addresses the issue of numerical computation in machine learning domains based on similarity metrics, such as kernel methods, spectral techniques and Gaussian processes. It presents a general solution strategy based on Krylov subspace iteration and fast N-body learning methods. The experiments show signiﬁcant gains in computation and storage on datasets arising in image segmentation, object detection and dimensionality reduction. The paper also presents theoretical bounds on the stability of these methods.

2 0.96987456 104 nips-2005-Laplacian Score for Feature Selection

Author: Xiaofei He, Deng Cai, Partha Niyogi

Abstract: In supervised learning scenarios, feature selection has been studied widely in the literature. Selecting features in unsupervised learning scenarios is a much harder problem, due to the absence of class labels that would guide the search for relevant information. And, almost all of previous unsupervised feature selection methods are “wrapper” techniques that require a learning algorithm to evaluate the candidate feature subsets. In this paper, we propose a “ﬁlter” method for feature selection which is independent of any learning algorithm. Our method can be performed in either supervised or unsupervised fashion. The proposed method is based on the observation that, in many real world classiﬁcation problems, data from the same class are often close to each other. The importance of a feature is evaluated by its power of locality preserving, or, Laplacian Score. We compare our method with data variance (unsupervised) and Fisher score (supervised) on two data sets. Experimental results demonstrate the effectiveness and efﬁciency of our algorithm. 1

same-paper 3 0.9297744 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

Author: Tong Zhang, Rie Kubota Ando

4 0.92800158 55 nips-2005-Describing Visual Scenes using Transformed Dirichlet Processes

Author: Antonio Torralba, Alan S. Willsky, Erik B. Sudderth, William T. Freeman

Abstract: Motivated by the problem of learning to detect and recognize objects with minimal supervision, we develop a hierarchical probabilistic model for the spatial structure of visual scenes. In contrast with most existing models, our approach explicitly captures uncertainty in the number of object instances depicted in a given image. Our scene model is based on the transformed Dirichlet process (TDP), a novel extension of the hierarchical DP in which a set of stochastically transformed mixture components are shared between multiple groups of data. For visual scenes, mixture components describe the spatial structure of visual features in an object–centered coordinate frame, while transformations model the object positions in a particular image. Learning and inference in the TDP, which has many potential applications beyond computer vision, is based on an empirically effective Gibbs sampler. Applied to a dataset of partially labeled street scenes, we show that the TDP’s inclusion of spatial structure improves detection performance, ﬂexibly exploiting partially labeled training images. 1

5 0.8962189 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails

Author: Nebojsa Jojic, Vladimir Jojic, Christopher Meek, David Heckerman, Brendan J. Frey

Abstract: We introduce a new model of genetic diversity which summarizes a large input dataset into an epitome, a short sequence or a small set of short sequences of probability distributions capturing many overlapping subsequences from the dataset. The epitome as a representation has already been used in modeling real-valued signals, such as images and audio. The discrete sequence model we introduce in this paper targets applications in genetics, from multiple alignment to recombination and mutation inference. In our experiments, we concentrate on modeling the diversity of HIV where the epitome emerges as a natural model for producing relatively small vaccines covering a large number of immune system targets known as epitopes. Our experiments show that the epitome includes more epitopes than other vaccine designs of similar length, including cocktails of consensus strains, phylogenetic tree centers, and observed strains. We also discuss epitome designs that take into account uncertainty about Tcell cross reactivity and epitope presentation. In our experiments, we ﬁnd that vaccine optimization is fairly robust to these uncertainties. 1

6 0.74636567 102 nips-2005-Kernelized Infomax Clustering

7 0.71804315 13 nips-2005-A Probabilistic Approach for Optimizing Spectral Clustering

8 0.69272476 189 nips-2005-Tensor Subspace Analysis

9 0.6757319 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

10 0.66876745 84 nips-2005-Generalization in Clustering with Unobserved Features

11 0.65834421 77 nips-2005-From Lasso regression to Feature vector machine

12 0.65291256 98 nips-2005-Infinite latent feature models and the Indian buffet process

13 0.65098995 178 nips-2005-Soft Clustering on Graphs

14 0.64656997 56 nips-2005-Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators

15 0.629273 177 nips-2005-Size Regularized Cut for Data Clustering

16 0.61472237 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

17 0.61245996 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction

18 0.61156863 133 nips-2005-Nested sampling for Potts models

19 0.60591012 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

20 0.58505613 184 nips-2005-Structured Prediction via the Extragradient Method