nips nips2009 nips2009-26 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu, Zhirong Yang
Abstract: We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms. 1
Reference: text
sentIndex sentText sentNum sentScore
1 fi Abstract We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. [sent-24, score-0.676]
2 In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. [sent-25, score-0.604]
3 Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. [sent-26, score-0.153]
4 To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. [sent-27, score-0.153]
5 Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. [sent-28, score-0.199]
6 One is whether the unlabeled data can help the classification, and the other is what is the relation between the clustering assumption and the manifold assumption. [sent-34, score-0.47]
7 [16] provided a finite sample analysis on the usefulness of unlabeled data based on the cluster assumption. [sent-36, score-0.327]
8 They show that unlabeled data may be useful for improving the error bounds of supervised learning methods when the margin between different classes satisfies some conditions. [sent-37, score-0.391]
9 However, in the real-world problems, it is hard to identify the conditions that unlabeled data can help. [sent-38, score-0.295]
10 On the other hand, it is interesting to explore the relation between the low density assumption and the manifold assumption. [sent-39, score-0.225]
11 [14] implied that the cut-size of the graph partition converges to the weighted volume of the boundary which separates the two regions of the domain for a fixed partition. [sent-41, score-0.089]
12 Lafferty and Wasserman [13] revisited the assumptions of semi-supervised learning from the perspective of minimax theory, and suggested that the manifold assumption is stronger than the smoothness assumption for regression. [sent-44, score-0.243]
13 Till now, the underlying relationships between the cluster assumption and the manifold assumption are still undisclosed. [sent-45, score-0.246]
14 Specifically, it is unclear that in what kind of situation the clustering assumption or the manifold assumption should be adopted. [sent-46, score-0.214]
15 In this paper, we address these current limitations by a unified solution from the perspective of the regularization strength of the unlabeled data. [sent-47, score-0.562]
16 Taking Transductive Support Vector Machine (TSVM) as an example, we suggest an framework that introduces the regularization strength of the unlabeled data when estimating the decision boundary. [sent-48, score-0.576]
17 Therefore, we can obtain a spectrum of models by varying the regularization strength of unlabeled data which corresponds to changing the models from supervised SVM to Transductive SVM. [sent-49, score-0.608]
18 To select the optimal model under the proposed framework, we employ the manifold regularization assumption that enables the prediction function to be smooth over the data space. [sent-50, score-0.41]
19 Further, the optimal function is a linear combination of supervised models, weakly semi-supervised models, and semi-supervised models. [sent-51, score-0.089]
20 Additionally, it provides an effective approach towards combining the cluster assumption and the manifold assumption in semi-supervised learning. [sent-52, score-0.246]
21 In Section 3, we first present a framework of models with different regularization strength, followed by an integrating approach based on manifold regularization. [sent-55, score-0.289]
22 , xn ) denote the entire data set, including both the labeled examples and the unlabeled ones. [sent-62, score-0.362]
23 We assume that the first l examples within X are labeled and the u u next n − l examples are unlabeled. [sent-63, score-0.086]
24 TSVM [12] maximizes the margin in the presence of unlabeled data and keeps the boundary traversing through low density regions while respecting labels in the input space. [sent-68, score-0.426]
25 Under the maximum-margin framework, TSVM aims to find the classification model with the maximum classification margin for both labeled and unlabeled examples, which amounts to solve the following optimization problem: min w∈Rn ,yu ∈Rn− ,ξ∈Rn s. [sent-69, score-0.424]
26 1 w 2 n l ξi + C ∗ K+C i=1 ξi (1) i=l+1 yi w φ(xi ) ≥ 1 − ξi , ξi ≥ 0, 1 ≤ i ≤ l, u yi w φ(xi ) ≥ 1 − ξi , ξi ≥ 0, l + 1 ≤ i ≤ n, where C and C ∗ are the trade-off parameters between the complexity of the function w and the margin errors. [sent-71, score-0.11]
27 As in [19] and [20], we can rewrite (1) into the following optimization problem: min f ,ξ s. [sent-74, score-0.082]
28 1 f K−1 f + C 2 n l ξi + C i=1 ∗ (2) ξi i=l+1 yi fi ≥ 1 − ξi , ξi ≥ 0, 1 ≤ i ≤ l, |fi | ≥ 1 − ξi , ξi ≥ 0, l + 1 ≤ i ≤ n. [sent-76, score-0.149]
29 Later, the hinge loss in TSVM is replaced by a smooth loss function, and a gradient descent method is used to find the decision boundary in a region of low density [4]. [sent-81, score-0.115]
30 Despite the success of TSVM, the unlabeled data not necessarily improve classification accuracy. [sent-83, score-0.295]
31 To better utilize the unlabeled data, unlike existing TSVM approaches, we propose a framework that tries to control the regularization strength of the unlabeled data. [sent-84, score-0.809]
32 To do this, we intend to learn the optimal regularization strength configuration from the combination of a spectrum of models: supervised, weakly-supervised, and semi-supervised. [sent-85, score-0.281]
33 3 TSVM: A Regularization View For the sake of illustration, we first study a model that does not penalize on the classification errors of unlabeled data. [sent-86, score-0.276]
34 Note that the penalization on the margin errors of unlabeled data can be included if needed. [sent-87, score-0.335]
35 yi fi ≥ 1 − ξi , ξi ≥ 0, 1 ≤ i ≤ l, fi2 ≥ 1, l + 1 ≤ i ≤ n. [sent-90, score-0.149]
36 1 Full Regularization of Unlabeled Data In order to adjust the strength of the regularization raised from the unlabeled examples, we introduce a coefficient ρ ≥ 0, and modify the above problem (3) as below: min f ,ξ 1 f K−1 f + C 2 l (4) ξi i=1 s. [sent-92, score-0.555]
37 yi fi ≥ 1 − ξi , ξi ≥ 0, 1 ≤ i ≤ l, fi2 ≥ ρ, l + 1 ≤ i ≤ n. [sent-94, score-0.149]
38 In particular, the larger the ρ is, the stronger the regularization of unlabeled data is. [sent-96, score-0.448]
39 Further, we write f = (fl ; fu ) where fl = (f1 , . [sent-99, score-0.826]
40 , fn ) represent the prediction for the labeled and the unlabeled examples, respectively. [sent-105, score-0.364]
41 l,l Thus, the term f K−1 f is computed as f K−1 f = fl M−1 fl + fu M−1 fu − 2fl K−1 Kl,u M−1 fu . [sent-107, score-1.825]
42 u u l l,l When the unlabeled data are loosely correlated to the labeled data, namely when most of the elements within Ku,l are small, this leads to Mu ≈ Ku . [sent-108, score-0.343]
43 Using the above equations, we rewrite TSVM as follows: l fl ,fu ,ξ 1 f M−1 fl + C l 2 l s. [sent-110, score-1.328]
44 yi fi ≥ 1 − ξi , ξi ≥ 0, 1 ≤ i ≤ l, min ξi + ω(fl , ρ) (5) i=1 where ω(fl , ρ) is a regularization function for fl and it is the result of the following optimization problem: 1 f M−1 fu − fl K−1 Kl,u M−1 fu u l,l 2 u u s. [sent-112, score-2.014]
45 As the derivatives vanish for optimality, we have = −1 (Mu − D(λ))−1 M−1 Ku,l K−1 fl u l,l = fu (I − Mu D(λ))−1 Ku,l K−1 fl , l,l where I is an identity matrix. [sent-119, score-1.479]
46 Replacing fu in (6) with the above equation, we have the following dual problem: 1 −1 max − fl K−1 Kl,u (Mu − Mu D(λ)Mu )−1 Ku,l Kl,l fl + ρλ e l,l λ 2 s. [sent-120, score-1.503]
47 u (7) The above formulation allows us to understand how the parameter ρ controls the strength of regularization from the unlabeled data. [sent-126, score-0.554]
48 We have the following theorem to illustrate the relationship between the dual problem (7) and the supervised SVM. [sent-130, score-0.08]
49 Theorem 1 When ρ = 0, the optimization problem is reduced to the standard supervised SVM. [sent-131, score-0.094]
50 As a result, ω(fl , ρ) becomes 1 ω(fl , ρ = 0) = − fl K−1 Kl,u M−1 Ku,l K−1 fl u l,l l,l 2 Substituting ω(fl , ρ) in (5) with the formulation above, the overall optimization problem becomes 1 f (M−1 − K−1 Kl,u M−1 Ku,l K−1 )fl + C u l l,l l,l 2 l min fl ,ξ l ξi i=1 s. [sent-133, score-2.04]
51 u l,l l,l l,l Finally, the optimization problem is simplified as min fl ,ξ 1 f K−1 fl + C 2 l l,l l ξi (8) i=1 s. [sent-137, score-1.366]
52 Clearly, the above optimization is identical to the standard supervised SVM. [sent-140, score-0.094]
53 Hence, the unlabeled data are not employed to regularize the decision boundary when ρ = 0. [sent-141, score-0.36]
54 u Consequently, we can write ω(fl , ρ) as follows: ω(fl , ρ) = 1 − fl K−1 Kl,u M−1 Ku,l K−1 fl + φ(fl , ρ), u l,l l,l 2 (9) where φ(fl , ρ) is the output of the following optimization problem max λ s. [sent-146, score-1.344]
55 1 ρλ e − fl K−1 Kl,u D(λ)Ku,l K−1 fl l,l l,l 2 −1 Mu D(λ), λi ≥ 0, i = 1, . [sent-148, score-1.306]
56 1 ρ λ e − fl K−1 Kl,u D(λ)Ku,l K−1 fl l,l l,l 2 2 −1 0 ≤ λi ≤ [σ1 (Mu )] , 1 ≤ i ≤ n − l. [sent-158, score-1.306]
57 As the above problem is a linear programming problem, the solution for λ can be computed as: λi = 0 σ(Mu )−1 [Ku,l K−1 fl ]2 > ρ, i l,l [Ku,l K−1 fl ]2 ≤ ρ. [sent-159, score-1.306]
58 i l,l From the above formulation, we find that ρ plays the role of a threshold of selecting the unlabeled examples. [sent-160, score-0.276]
59 Since [Ku,l K−1 fl ]i can be regarded as the approximation for the ith l,l unlabeled example, the above formulation can be interpreted in the way that only the unlabeled examples with low prediction confidence will be selected for regularizing the decision boundary. [sent-161, score-1.352]
60 Moreover, all the unlabeled examples with high prediction confidence will be ignored. [sent-162, score-0.335]
61 From the above discussions, we can conclude that ρ determines the regularization strength of unlabeled examples. [sent-163, score-0.533]
62 Then, we rewrite the overall optimization problem as below: min max fl ,ξ λ s. [sent-164, score-0.735]
63 1 f K−1 fl + C 2 l l,l l 1 ξi − fl K−1 Kl,u D(λ)Ku,l K−1 fl l,l l,l 2 i=1 (10) yi fi ≥ 1 − ξi , ξi ≥ 0, 1 ≤ i ≤ l, 0 ≤ λi ≤ [σ1 (Mu )]−1 , 1 ≤ i ≤ n − l. [sent-166, score-2.108]
64 To obtain the optimal solution, we employ an alternating optimization procedure, which iteratively computes the values of fl and λ. [sent-168, score-0.714]
65 To account for the penalty on the margin error from the unlabeled data, we just need to add an extra constraint of λi ≤ 2C for i = 1, . [sent-169, score-0.316]
66 By varying the parameter ρ from 0 to 1, we can indeed obtain a series of transductive models for SVM. [sent-173, score-0.127]
67 When ρ is small, we call the corresponding optimization problem as weakly semisupervised learning. [sent-174, score-0.131]
68 However, as the data distribution is usually unknown, it is very challenging to directly estimate an optimal regularization strength parameter ρ. [sent-176, score-0.276]
69 4 Adaptive Regularization As stated in previous sections, ρ determines the regularization strength of the unlabeled data. [sent-180, score-0.533]
70 We now try to adapt the parameter ρ according to the unlabeled data information. [sent-181, score-0.295]
71 θi f i , i=1 i=1 where θi is the weight of the prediction function fi and θ ∈ Rm . [sent-199, score-0.154]
72 In the following, we study how to set the regularization strength adaptive to data. [sent-204, score-0.284]
73 Since TSVM naturally follows the cluster assumption of semi-supervised learning, in order to complement the cluster assumption, we adopt another principle in semi-supervised learning, i. [sent-205, score-0.103]
74 From the point of view of manifold assumption in semisupervised learning, the prediction function f should be smooth on unlabeled data. [sent-208, score-0.551]
75 To this end, the approach of manifold regularization is widely adopted as a smoothing term in semi-supervised learning literatures, e. [sent-209, score-0.289]
76 In the following, we will employ the manifold regularization principle for selecting the regularization strength. [sent-212, score-0.465]
77 The manifold regularization is mainly based on a graph G =< V, E > derived from the whole data space X, where V = {xi }n is the vertex set, and E denotes the edges linking pairs of i=1 nodes. [sent-213, score-0.356]
78 In general, a graph is built in the following four steps: (1) constructing adjacency graph; (2) calculating the weights on edges; (3) computing the adjacency matrix W; (4) obtaining the graph Laplacian by L = diag( regularization term as f Lf . [sent-214, score-0.289]
79 Then, we denote the manifold For simplicity, we denote the predicted values of function fi on the data X as fi , such that fi = ([fi ]1 , . [sent-216, score-0.497]
80 , fm ) is used to represent the set of the prediction values of all prediction functions. [sent-223, score-0.102]
81 , m, where the second term, y (F θ), is used to strengthen the confidence on the prediction over the labeled data. [sent-229, score-0.088]
82 It is important to note that the above optimization problem is less sensitive to the graph structure than Laplacian SVM as used in [1], since the basic learning functions are all strong learners. [sent-232, score-0.086]
83 The above approach indeed provides a practical approach towards a combination of both the cluster assumption and the manifold assumption. [sent-234, score-0.207]
84 Thus the usefulness of unlabeled in naturally considered by the regularization. [sent-237, score-0.276]
85 Data set n d usps 1500 241 coil 1500 241 pcmac 1946 7511 link 1051 1800 d represents the data dimensionality, and n Data set digit1 ibm vs rest page pagelink n 1500 1500 1051 1051 d 241 11960 3000 4800 For simplicity, our proposed adaptive regularization approach is denoted as ARTSVM. [sent-251, score-0.334]
86 In each run, 10% of the data are randomly selected as the training data and the remaining data are used as the unlabeled data. [sent-255, score-0.333]
87 Note that some very large deviations in SVM are mainly because the labeled data and the unlabeled data may have quite different distributions after the random sampling. [sent-280, score-0.362]
88 On the other hand, the unlabeled data capture the underlying distribution and help to correct such random error. [sent-281, score-0.295]
89 However, the proposed ARTSVM outperforms both the supervised and other semisupervised algorithms. [sent-284, score-0.116]
90 This indicates that the appropriate regularization from the unlabel data improves the classification performance. [sent-285, score-0.172]
91 81 5 Conclusion This paper presents a novel framework for semi-supervised learning from the perspective of the regularization strength from the unlabeled data. [sent-367, score-0.562]
92 In more detail, the loss on the unlabeled data can essentially be regarded as an additional regularizer for the decision boundary in TSVM. [sent-369, score-0.382]
93 To control the regularization strength, we introduce an alternative method of data-dependant regularization based on the principle of manifold regularization. [sent-370, score-0.442]
94 Empirical studies on benchmark data sets demonstrate that the proposed framework is more effective than the previous transductive algorithms and purely supervised methods. [sent-371, score-0.232]
95 For future work, we plan to design a controlling strategy that is adaptive to data from the perspective of low density assumption and manifold regularization of semi-supervised learning. [sent-372, score-0.453]
96 Finally, it is desirable to integrate the low density assumption and manifold regularization into a unified framework. [sent-373, score-0.378]
97 Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. [sent-377, score-0.324]
98 Learning from labeled and unlabeled data using graph mincuts. [sent-380, score-0.391]
99 On the relation between low density separation, spectral clustering and graph cuts. [sent-453, score-0.124]
100 On efficient large margin semisupervised learning: Method and theory. [sent-473, score-0.1]
wordName wordTfidf (topN-words)
[('fl', 0.653), ('tsvm', 0.403), ('mu', 0.279), ('unlabeled', 0.276), ('fu', 0.173), ('regularization', 0.153), ('manifold', 0.136), ('transductive', 0.127), ('artsvm', 0.124), ('fi', 0.114), ('strength', 0.104), ('vikas', 0.071), ('semisupervised', 0.06), ('supervised', 0.056), ('sathiya', 0.053), ('svm', 0.052), ('olivier', 0.05), ('graph', 0.048), ('labeled', 0.048), ('erent', 0.044), ('boundary', 0.041), ('cccp', 0.04), ('prediction', 0.04), ('margin', 0.04), ('assumption', 0.039), ('platt', 0.038), ('optimization', 0.038), ('rong', 0.036), ('fiu', 0.035), ('narayanan', 0.035), ('orts', 0.035), ('pagelink', 0.035), ('yi', 0.035), ('di', 0.035), ('sindhwani', 0.035), ('editors', 0.033), ('weakly', 0.033), ('kong', 0.032), ('cluster', 0.032), ('pcmac', 0.031), ('erty', 0.031), ('partha', 0.031), ('chapelle', 0.031), ('sch', 0.03), ('benchmark', 0.03), ('hong', 0.03), ('ma', 0.03), ('density', 0.029), ('perspective', 0.029), ('coil', 0.028), ('jianke', 0.028), ('mikhail', 0.028), ('zenglin', 0.028), ('lkopf', 0.027), ('jin', 0.027), ('coe', 0.027), ('irwin', 0.027), ('xiaojin', 0.027), ('zurich', 0.027), ('adaptive', 0.027), ('la', 0.026), ('spectral', 0.026), ('ective', 0.025), ('jason', 0.025), ('intend', 0.024), ('lyu', 0.024), ('classi', 0.024), ('dual', 0.024), ('decision', 0.024), ('employ', 0.023), ('hoi', 0.023), ('koller', 0.023), ('min', 0.022), ('rewrite', 0.022), ('bernhard', 0.022), ('fm', 0.022), ('pages', 0.022), ('zhu', 0.022), ('regarded', 0.022), ('annealing', 0.021), ('man', 0.021), ('formulation', 0.021), ('low', 0.021), ('lab', 0.021), ('ho', 0.021), ('usps', 0.021), ('laplacian', 0.021), ('mit', 0.02), ('ibm', 0.02), ('learners', 0.02), ('adjacency', 0.02), ('xu', 0.02), ('support', 0.02), ('data', 0.019), ('roweis', 0.019), ('king', 0.019), ('kaufmann', 0.019), ('belkin', 0.019), ('icml', 0.019), ('examples', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine
Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu, Zhirong Yang
Abstract: We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms. 1
2 0.16786052 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou
Abstract: We study the behavior of the popular Laplacian Regularization method for SemiSupervised Learning at the regime of a fixed number of labeled points but a large number of unlabeled points. We show that in Rd , d 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the “smoothness” assumptions associated with this alternate method. 1 Introduction and Setup In this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods based on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We consider the limit when the number of labeled points is fixed and the number of unlabeled points goes to infinity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled data is virtually infinite. We can also think of this limit as “perfect” SSL, having full knowledge of the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative about the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be “smooth” in some sense relative to p(x). Studying the infinite-unlabeled-data limit, where p(x) is fully known, allows us to formulate and understand the underlying smoothness assumptions of a particular SSL method, and judge whether it is well-posed and sensible. Understanding the infinite-unlabeled-data limit is also a necessary first step to studying the convergence of the finite-labeled-data estimator. We consider the following setup: Let p(x) be an unknown smooth density on a compact domain Ω ⊂ Rd with a smooth boundary. Let y : Ω → Y be the unknown function we wish to estimate. In case of regression Y = R whereas in binary classification Y = {−1, 1}. The standard (transductive) semisupervised learning problem is formulated as follows: Given l labeled points, (x1 , y1 ), . . . , (xl , yl ), with yi = y(xi ), and u unlabeled points xl+1 , . . . , xl+u , with all points xi sampled i.i.d. from p(x), the goal is to construct an estimate of y(xl+i ) for any unlabeled point xl+i , utilizing both the labeled and the unlabeled points. We denote the total number of points by n = l + u. We are interested in the regime where l is fixed and u → ∞. 1 2 SSL with Graph Laplacian Regularization We first consider the following graph-based approach formulated by Zhu et. al. [15]: y (x) = arg min In (y) ˆ subject to y(xi ) = yi , i = 1, . . . , l y where 1 n2 In (y) = Wi,j (y(xi ) − y(xj ))2 (1) (2) i,j is a Laplacian regularization term enforcing “smoothness” with respect to the n×n similarity matrix W . This formulation has several natural interpretations in terms of, e.g. random walks and electrical circuits [15]. These interpretations, however, refer to a fixed graph, over a finite set of points with given similarities. In contrast, our focus here is on the more typical scenario where the points xi ∈ Rd are a random sample from a density p(x), and W is constructed based on this sample. We would like to understand the behavior of the method in terms of the density p(x), particularly in the limit where the number of unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density p(x) is the method (1) sensible? The answer, of course, depends on how the matrix W is constructed. We consider the common situation where the similarities are obtained by applying some decay filter to the distances: xi −xj σ Wi,j = G (3) where G : R+ → R+ is some function with an adequately fast decay. Popular choices are the 2 Gaussian filter G(z) = e−z /2 or the ǫ-neighborhood graph obtained by the step filter G(z) = 1z<1 . For simplicity, we focus here on the formulation (1) where the solution is required to satisfy the constraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced with a softer loss-based data term, which is balanced against the smoothness term In (y), e.g. [14, 6]. Our analysis and conclusions apply to such variants as well. Limit of the Laplacian Regularization Term As the number of unlabeled examples grows the regularization term (2) converges to its expectation, where the summation is replaced by integration w.r.t. the density p(x): lim In (y) = I (σ) (y) = n→∞ G Ω Ω x−x′ σ (y(x) − y(x′ ))2 p(x)p(x′ )dxdx′ . (4) In the above limit, the bandwidth σ is held fixed. Typically, one would also drive the bandwidth σ to zero as n → ∞. There are two reasons for this choice. First, from a practical perspective, this makes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical perspective, this leads to a clear and well defined limit of the smoothness regularization term In (y), at least when σ → 0 slowly enough1 , namely when σ = ω( d log n/n). If σ → 0 as n → ∞, and as long as nσ d / log n → ∞, then after appropriate normalization, the regularizer converges to a density weighted gradient penalty term [7, 8]: d lim d+2 In (y) n→∞ Cσ (σ) d (y) d+2 I σ→0 Cσ = lim ∇y(x) 2 p(x)2 dx = J(y) = (5) Ω where C = Rd z 2 G( z )dz, and assuming 0 < C < ∞ (which is the case for both the Gaussian and the step filters). This energy functional J(f ) therefore encodes the notion of “smoothness” with respect to p(x) that is the basis of the SSL formulation (1) with the graph constructions specified by (3). To understand the behavior and appropriateness of (1) we must understand this functional and the associated limit problem: y (x) = arg min J(y) ˆ subject to y(xi ) = yi , i = 1, . . . , l (6) y p When σ = o( d 1/n) then all non-diagonal weights Wi,j vanish (points no longer have any “close by” p neighbors). We are not aware of an analysis covering the regime where σ decays roughly as d 1/n, but would be surprised if a qualitatively different meaningful limit is reached. 1 2 3 Graph Laplacian Regularization in R1 We begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x ∈ R. We first consider the situation where the support of p(x) is a continuous interval Ω = [a, b] ⊂ R (a and/or b may be infinite). Without loss of generality, we assume the labeled data is sorted in increasing order a x1 < x2 < · · · < xl b. Applying the theory of variational calculus, the solution y (x) ˆ satisfies inside each interval (xi , xi+1 ) the Euler-Lagrange equation d dy p2 (x) = 0. dx dx Performing two integrations and enforcing the constraints at the labeled points yields y(x) = yi + x 1/p2 (t)dt xi (yi+1 xi+1 1/p2 (t)dt xi − yi ) for xi x xi+1 (7) with y(x) = x1 for a x x1 and y(x) = xl for xl x b. If the support of p(x) is a union of disjoint intervals, the above analysis and the form of the solution applies in each interval separately. The solution (7) seems reasonable and desirable from the point of view of the “smoothness” assumptions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas across low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore, the regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared semi-norm, giving us additional insight into this choice of regularizer: b 1 Theorem 1. Let p(x) be a smooth density on Ω = [a, b] ⊂ R such that Ap = 4 a 1/p2 (t)dt < ∞. 2 Then, J(f ) can be written as a squared semi-norm J(f ) = f Kp induced by the kernel x′ ′ Kp (x, x ) = Ap − 1 2 x with a null-space of all constant functions. That is, f the RKHS induced by Kp . 1 p2 (t) dt Kp . (8) is the norm of the projection of f onto If p(x) is supported on several disjoint intervals, Ω = ∪i [ai , bi ], then J(f ) can be written as a squared semi-norm induced by the kernel 1 bi dt 4 ai p2 (t) ′ Kp (x, x ) = − 1 2 x′ dt x p2 (t) if x, x′ ∈ [ai , bi ] (9) if x ∈ [ai , bi ], x′ ∈ [aj , bj ], i = j 0 with a null-space spanned by indicator functions 1[ai ,bi ] (x) on the connected components of Ω. Proof. For any f (x) = i αi Kp (x, xi ) in the RKHS induced by Kp : df dx J(f ) = 2 p2 (x)dx = αi αj Jij (10) i,j where Jij = d d Kp (x, xi ) Kp (x, xj )p2 (x)dx dx dx When xi and xj are in different connected components of Ω, the gradients of Kp (·, xi ) and Kp (·, xj ) are never non-zero together and Jij = 0 = Kp (xi , xj ). When they are in the same connected component [a, b], and assuming w.l.o.g. a xi xj b: Jij = = xi 1 4 1 4 a b a 1 dt + p2 (t) 1 1 dt − p2 (t) 2 xj xi xj xi −1 dt + p2 (t) xj 1 dt p2 (t) 1 dt = Kp (xi , xj ). p2 (t) Substituting Jij = Kp (xi , xj ) into (10) yields J(f ) = 3 b αi αj Kp (xi , xj ) = f (11) Kp . Combining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of any variant where the hard constraints are replaced by a data term) is of the form: l y(x) = αj Kp (x, xj ) + βi 1[ai ,bi ] (x), j=1 i where i ranges over the connected components [ai , bi ] of Ω, and we have: l J(y) = αi αj Kp (xi , xj ). (12) i,j=1 Viewing the regularizer as y 2 p suggests understanding (6), and so also its empirical approximaK tion (1), by interpreting Kp (x, x′ ) as a density-based “similarity measure” between x and x′ . This similarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a function of the distance. When the density is non-uniform, two points are relatively similar only if they are connected by a region in which 1/p2 (x) is low, i.e. the density is high, but are much less “similar”, i.e. related to each other, when connected by a low-density region. Furthermore, there is no dependence between points in disjoint components separated by zero density regions. 4 Graph Laplacian Regularization in Higher Dimensions The analysis of the previous section seems promising, at it shows that in one dimension, the SSL method (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is not the case anymore. In the following theorem we show that the infimum of the limit problem (6) is zero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation of the labeled points. Theorem 2. Let p(x) be a smooth density over Rd , d 2, bounded from above by some constant pmax , and let (x1 , y1 ), . . . , (xl , yl ) be any (non-repeating) set of labeled examples. There exist continuous functions yǫ (x), for any ǫ > 0, all satisfying the constraints yǫ (xj ) = yj , j = 1, . . . , l, such ǫ→0 ǫ→0 that J(yǫ ) −→ 0 but yǫ (x) −→ 0 for all x = xj , j = 1, . . . , l. Proof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the proof to more labeled points is straightforward. Furthermore, without loss of generality, we assume the first labeled point is at x0 = 0 with y(x0 ) = 0 and the second labeled point is at x1 with x1 = 1 and y(x1 ) = 1. In addition, we assume that the ball B1 (0) of radius one centered around the origin is contained in Ω = {x ∈ Rd | p(x) > 0}. We first consider the case d > 2. Here, for any ǫ > 0, consider the function x ǫ yǫ (x) = min ,1 which indeed satisfies the two constraints yǫ (xi ) = yi , i = 0, 1. Then, J(yǫ ) = Bǫ (0) p2 (x) dx ǫ2 pmax ǫ2 dx = p2 Vd ǫd−2 max (13) Bǫ (0) where Vd is the volume of a unit ball in Rd . Hence, the sequence of functions yǫ (x) satisfy the constraints, but for d > 2, inf ǫ J(yǫ ) = 0. For d = 2, a more extreme example is necessary: consider the functions 2 x yǫ (x) = log +ǫ ǫ log 1+ǫ ǫ for x 1 and yǫ (x) = 1 for x > 1. These functions satisfy the two constraints yǫ (xi ) = yi , i = 0, 1 and: J(yǫ ) = 4 h “ ”i 1+ǫ 2 log ǫ 4πp2 max h “ ”i 1+ǫ 2 log ǫ x B1 (0) log ( x 1+ǫ ǫ 2 2 +ǫ)2 p2 (x)dx 4p2 h “ max ”i2 1+ǫ log ǫ 4πp2 max ǫ→0 = −→ 0. log 1+ǫ ǫ 4 1 0 r2 (r 2 +ǫ)2 2πrdr The implication of Theorem 2 is that regardless of the values at the labeled points, as u → ∞, the solution of (1) is not well posed. Asymptotically, the solution has the form of an almost everywhere constant function, with highly localized spikes near the labeled points, and so no learning is performed. In particular, an interpretation in terms of a density-based kernel Kp , as in the onedimensional case, is not possible. Our analysis also carries over to a formulation where a loss-based data term replaces the hard label constraints, as in l 1 y = arg min ˆ (y(xj ) − yj )2 + γIn (y) y(x) l j=1 In the limit of infinite unlabeled data, functions of the form yǫ (x) above have a zero data penalty term (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence, it is possible to drive the entire objective functional (the data term plus the regularization term) to zero with functions that do not generalize at all to unlabeled points. 4.1 Numerical Example We illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density p(x) in R2 , which is a mixture of two unit variance spherical Gaussians, one per class, centered at the origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of the two components (four total). We then construct a similarity matrix using a Gaussian filter with σ = 0.4. Figure 1 depicts the predictor y (x) obtained from (1). In fact, two different predictors are shown, ˆ obtained by different numerical methods for solving (1). Both methods are based on the observation that the solution y (x) of (1) satisfies: ˆ n y (xi ) = ˆ n Wij y (xj ) / ˆ j=1 Wij on all unlabeled points i = l + 1, . . . , l + u. (14) j=1 Combined with the constraints of (1), we obtain a system of linear equations that can be solved by Gaussian elimination (here invoked through MATLAB’s backslash operator). This is the method used in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for y (xi ), ˆ which can be solved via the power method, or label propagation [2, 6]: start with zero labels on the unlabeled points and iterate (14), while keeping the known labels on x1 , . . . , xl . This is the method used in the bottom panels of Figure 1. As predicted, y (x) is almost constant for almost all unlabeled points. Although all values are very ˆ close to zero, thresholding at the “right” threshold does actually produce sensible results in terms of the true -1/+1 labels. However, beyond being inappropriate for regression, a very flat predictor is still problematic even from a classification perspective. First, it is not possible to obtain a meaningful confidence measure for particular labels. Second, especially if the size of each class is not known apriori, setting the threshold between the positive and negative classes is problematic. In our example, setting the threshold to zero yields a generalization error of 45%. The differences between the two numerical methods for solving (1) also point out to another problem with the ill-posedness of the limit problem: the solution is numerically very un-stable. A more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choosing a “wrong” bandwidth σ, is given in Figure 2. We again simulated data from a mixture of two Gaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and an increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classification error of the resulting predictor y (x). We plot the classification error both when a threshold ˆ of zero is used (i.e. the class is determined by sign(ˆ(x))) and with the ideal threshold minimizing y the test error. For each unlabeled sample size, we choose the bandwidth σ yielding the best test performance (this is a “cheating” approach which provides a lower bound on the error of the best method for selecting the bandwidth). As the number of unlabeled examples increases the squared error approaches 1, indicating a flat predictor. Using a threshold of zero leads to an increase in the classification error, possibly due to numerical instability. Interestingly, although the predictors become very flat, the classification error using the ideal threshold actually improves slightly. Note that 5 DIRECT INVERSION SQUARED ERROR SIGN ERROR: 45% OPTIMAL BANDWIDTH 1 0.9 1 5 0 4 2 0.85 y(x) > 0 y(x) < 0 6 0.95 10 0 0 −1 10 0 200 400 600 800 0−1 ERROR (THRESHOLD=0) 0.32 −5 10 0 5 −10 0 −10 −5 −5 0 5 10 10 1 0 0 200 400 600 800 OPTIMAL BANDWIDTH 0.5 0 0 200 400 600 800 0−1 ERROR (IDEAL THRESHOLD) 0.19 5 200 400 600 800 OPTIMAL BANDWIDTH 1 0.28 SIGN ERR: 17.1 0.3 0.26 POWER METHOD 0 1.5 8 0 0.18 −1 10 6 0.17 4 −5 10 0 5 −10 0 −5 −10 −5 0 5 10 Figure 1: Left plots: Minimizer of Eq. (1). Right plots: the resulting classification according to sign(y). The four labeled points are shown by green squares. Top: minimization via Gaussian elimination (MATLAB backslash). Bottom: minimization via label propagation with 1000 iterations - the solution has not yet converged, despite small residuals of the order of 2 · 10−4 . 0.16 0 200 400 600 800 2 0 200 400 600 800 Figure 2: Squared error (top), classification error with a threshold of zero (center) and minimal classification error using ideal threhold (bottom), of the minimizer of (1) as a function of number of unlabeled points. For each error measure and sample size, the bandwidth minimizing the test error was used, and is plotted. ideal classification performance is achieved with a significantly larger bandwidth than the bandwidth minimizing the squared loss, i.e. when the predictor is even flatter. 4.2 Probabilistic Interpretation, Exit and Hitting Times As mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in terms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with transition matrix M = D−1 W where D is a diagonal matrix with Dii = j Wij . Then, for the binary classification case with yi = ±1 we have [15]: y (xi ) = 2 Pr x(t) hits a point labeled +1 before hitting a point labeled -1 x(0) = xi − 1 ˆ We present an interpretation of our analysis in terms of the limiting properties of this random walk. Consider, for simplicity, the case where the two classes are separated by a low density region. Then, the random walk has two intrinsic quantities of interest. The first is the mean exit time from one cluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the number of unlabeled points increases and σ → 0, the random walk converges to a diffusion process [12]. While the mean exit time then converges to a finite value corresponding to its diffusion analogue, the hitting time to a labeled point increases to infinity (as these become absorbing boundaries of measure zero). With more and more unlabeled data the random walk will fully mix, forgetting where it started, before it hits any label. Thus, the probability of hitting +1 before −1 will become uniform across the entire graph, independent of the starting location xi , yielding a flat predictor. 5 Keeping σ Finite At this point, a reader may ask whether the problems found in higher dimensions are due to taking the limit σ → 0. One possible objection is that there is an intrinsic characteristic scale for the data σ0 where (with high probability) all points at a distance xi − xj < σ0 have the same label. If this is the case, then it may not necessarily make sense to take values of σ < σ0 in constructing W . However, keeping σ finite while taking the number of unlabeled points to infinity does not resolve the problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To see this, consider a function y(x) which is zero everywhere except at the labeled points, where y(xj ) = yj . With a finite number of labeled points of measure zero, I (σ) (y) = 0 in any dimension 6 50 points 500 points 3500 points 1 1 0.5 0.5 0.5 0 0 0 −0.5 y 1 −0.5 −0.5 −1 −2 0 2 4 6 −1 −2 0 2 4 6 −1 −2 0 2 4 6 x Figure 3: Minimizer of (1) for a 1-d problem with a fixed σ = 0.4, two labeled points and an increasing number of unlabeled points. and for any fixed σ > 0. While this limiting function is discontinuous, it is also possible to construct ǫ→0 a sequence of continuous functions yǫ that all satisfy the constraints and for which I (σ) (yǫ ) −→ 0. This behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians centered at the origin and at x = 4, with one Gaussian labeled −1 and the other +1. We used two labeled points at the centers of the Gaussians and an increasing number of randomly drawn unlabeled points. As predicted, with a fixed σ, although the solution is reasonable when the number of unlabeled points is small, it becomes flatter, with sharp spikes on the labeled points, as u → ∞. 6 Fourier-Eigenvector Based Methods Before we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian, suggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining candidate predictors y(x) non-parametrically to those with small In (y) values, here the predictors are constrained to the low-dimensional space spanned by the first few eigenvectors of the Laplacian: The similarity matrix W is computed as before, and the Graph Laplacian matrix L = D − W is considered (recall D is a diagonal matrix with Dii = j Wij ). Only predictors p j=1 aj ej y (x) = ˆ (15) spanned by the first p eigenvectors e1 , . . . , ep of L (with smallest eigenvalues) are considered. The coefficients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss: (ˆ1 , . . . , ap ) = arg min a ˆ l j=1 (yj − y (xj ))2 . ˆ (16) Unlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)–(16) is well posed in the limit u → ∞. This follows directly from the convergence of the eigenvectors of the graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4]. Eigenvector based methods were shown empirically to provide competitive generalization performance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the approach by arguing that ‘the eigenfunctions of the Laplace-Beltrami operator provide a natural basis for functions on the manifold and the desired classification function can be expressed in such a basis’. In our view, the success of the method is actually not due to data lying on a low-dimensional manifold, but rather due to the low density separation assumption, which states that different class labels form high-density clusters separated by low density regions. Indeed, under this assumption and with sufficient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami operator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11], providing a basis for a labeling that is constant within clusters but variable across clusters. In other settings, such as data uniformly distributed on a manifold but without any significant cluster structure, the success of eigenvector based methods critically depends on how well can the unknown classification function be approximated by a truncated expansion with relatively few eigenvectors. We illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform density in the box [0, 1] × [0, 0.8] × [0, 0.6], where the box lengths are different to prevent eigenvalue multiplicity. Consider learning three different functions, y1 (x) = 1x1 >0.5 , y2 (x) = 1x1 >x2 /0.8 and y3 (x) = 1x2 /0.8>x3 /0.6 . Even though all three functions are relatively simple, all having a linear separating boundary between the classes on the manifold, as shown in the experiment described in Figure 4, the Eigenvector based method (15)–(16) gives markedly different generalization performances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5 as suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test set (i.e. a “cheating” choice representing an upper bound on the generalization error of this method). 7 Prediction Error (%) p = #labeled points/5 40 optimal p 20 labeled points 40 Approx. Error 50 20 20 0 20 20 40 60 # labeled points 0 10 20 40 60 # labeled points 0 0 5 10 15 # eigenvectors 0 0 5 10 15 # eigenvectors Figure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)–(16) for the three different functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign agreements with the true labels. Rightmost panel: best fit when many (all 3000) points are used, representing the best we can hope for with a few leading eigenvectors. The reason for this behavior is that y2 (x) and even more so y3 (x) cannot be as easily approximated by the very few leading eigenfunctions—even though they seem “simple” and “smooth”, they are significantly more complicated than y1 (x) in terms of measure of simplicity implied by the Eigenvector Method. Since the density is uniform, the graph Laplacian converges to the standard Laplacian and its eigenfunctions have the form ψi,j,k (x) = cos(iπx1 ) cos(jπx2 /0.8) cos(kπx3 /0.6), making it hard to represent simple decision boundaries which are not axis-aligned. 7 Discussion Our results show that a popular SSL method, the Laplacian Regularization method (1), is not wellbehaved in the limit of infinite unlabeled data, despite its empirical success in various SSL tasks. The empirical success might be due to two reasons. First, it is possible that with a large enough number of labeled points relative to the number of unlabeled points, the method is well behaved. This regime, where the number of both labeled and unlabeled points grow while l/u is fixed, has recently been analyzed by Wasserman and Lafferty [9]. However, we do not find this regime particularly satisfying as we would expect that having more unlabeled data available should improve performance, rather than require more labeled points or make the problem ill-posed. It also places the user in a delicate situation of choosing the “just right” number of unlabeled points without any theoretical guidance. Second, in our experiments we noticed that although the predictor y (x) becomes extremely flat, in ˆ binary tasks, it is still typically possible to find a threshold leading to a good classification performance. We do not know of any theoretical explanation for such behavior, nor how to characterize it. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical foundation of the Laplacian Regularization method. On a very practical level, such a theoretical understanding might allow us to correct the method so as to avoid the numerical instability associated with flat predictors, and perhaps also make it appropriate for regression. The reason that the Laplacian regularizer (1) is ill-posed in the limit is that the first order gradient is not a sufficient penalty in high dimensions. This fact is well known in spline theory, where the Sobolev Embedding Theorem [1] indicates one must control at least d+1 derivatives in Rd . In the 2 context of Laplacian regularization, this can be done using the iterated Laplacian: replacing the d+1 graph Laplacian matrix L = D − W , where D is the diagonal degree matrix, with L 2 (matrix to d+1 the 2 power). In the infinite unlabeled data limit, this corresponds to regularizing all order- d+1 2 (mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional ambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimensionality, which poses a practical problem of estimating this usually unknown dimensionality. We are not aware of much practical work using the iterated Laplacian, nor a good understanding of its appropriateness for SSL. A different approach leading to a well-posed solution is to include also an ambient regularization term [5]. However, the properties of the solution and in particular its relation to various assumptions about the “smoothness” of y(x) relative to p(x) remain unclear. Acknowledgments The authors would like to thank the anonymous referees for valuable suggestions. The research of BN was supported by the Israel Science Foundation (grant 432/06). 8 References [1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975. [2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks, ICML, 2007. [3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classification, NIPS, vol. 15, 2003. [4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007. [5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006. [6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised Learning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006. [7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004. [8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006. [9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008. [10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol. 36(2), 2008. [11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001. [12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, NIPS, vol. 18, 2006. [13] B. Sch¨ lkopf, A. Smola, Learning with Kernels, MIT Press, 2002. o [14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch¨ lkopf, Learning with local and global consistency, o NIPS, vol. 16, 2004. [15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian fields and harmonic functions, ICML, 2003. 9
3 0.10553997 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases
Author: Kaushik Sinha, Mikhail Belkin
Abstract: We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices. It turns out that when the data has clustered, that is, when the high density regions are sufficiently separated by low density valleys, each high density area corresponds to a unique representative eigenvector. Linear combination of such eigenvectors (or, more precisely, of their Nystrom extensions) provide good candidates for good classification functions when the cluster assumption holds. By first choosing an appropriate basis of these eigenvectors from unlabeled data and then using labeled data with Lasso to select a classifier in the span of these eigenvectors, we obtain a classifier, which has a very sparse representation in this basis. Importantly, the sparsity corresponds naturally to the cluster assumption. Experimental results on a number of real-world data-sets show that our method is competitive with the state of the art semi-supervised learning algorithms and outperforms the natural base-line algorithm (Lasso in the Kernel PCA basis). 1
4 0.10489064 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields
Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori
Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random fields that defines a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classification and sequence labeling problems. 1
5 0.085844137 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction
Author: Kwang I. Kim, Florian Steinke, Matthias Hein
Abstract: Semi-supervised regression based on the graph Laplacian suffers from the fact that the solution is biased towards a constant and the lack of extrapolating power. Based on these observations, we propose to use the second-order Hessian energy for semi-supervised regression which overcomes both these problems. If the data lies on or close to a low-dimensional submanifold in feature space, the Hessian energy prefers functions whose values vary linearly with respect to geodesic distance. We first derive the Hessian energy for smooth manifolds and continue to give a stable estimation procedure for the common case where only samples of the underlying manifold are given. The preference of ‘’linear” functions on manifolds renders the Hessian energy particularly suited for the task of semi-supervised dimensionality reduction, where the goal is to find a user-defined embedding function given some labeled points which varies smoothly (and ideally linearly) along the manifold. The experimental results suggest superior performance of our method compared with semi-supervised regression using Laplacian regularization or standard supervised regression techniques applied to this task. 1
6 0.084625833 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification
7 0.079509713 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization
8 0.077966429 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity
9 0.077432349 72 nips-2009-Distribution Matching for Transduction
10 0.076366767 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
11 0.072949074 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections
12 0.066862769 2 nips-2009-3D Object Recognition with Deep Belief Nets
13 0.065721691 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence
14 0.064871937 122 nips-2009-Label Selection on Graphs
15 0.052609049 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models
16 0.048474137 71 nips-2009-Distribution-Calibrated Hierarchical Classification
17 0.047902789 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
18 0.04782562 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models
19 0.046188101 187 nips-2009-Particle-based Variational Inference for Continuous Systems
20 0.046023346 179 nips-2009-On the Algorithmics and Applications of a Mixed-norm based Kernel Learning Formulation
topicId topicWeight
[(0, -0.146), (1, 0.083), (2, -0.049), (3, 0.094), (4, -0.062), (5, 0.01), (6, -0.025), (7, 0.033), (8, -0.069), (9, 0.101), (10, -0.052), (11, 0.145), (12, -0.065), (13, -0.174), (14, -0.043), (15, -0.03), (16, -0.051), (17, -0.025), (18, -0.043), (19, -0.005), (20, -0.032), (21, -0.097), (22, 0.001), (23, -0.02), (24, 0.047), (25, -0.039), (26, 0.038), (27, 0.021), (28, -0.017), (29, -0.028), (30, 0.031), (31, -0.012), (32, -0.027), (33, -0.014), (34, -0.078), (35, -0.043), (36, -0.066), (37, -0.004), (38, -0.027), (39, 0.051), (40, 0.018), (41, 0.057), (42, 0.003), (43, 0.022), (44, 0.033), (45, -0.024), (46, -0.046), (47, 0.096), (48, -0.12), (49, 0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.94117117 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine
Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu, Zhirong Yang
Abstract: We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms. 1
2 0.82040137 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou
Abstract: We study the behavior of the popular Laplacian Regularization method for SemiSupervised Learning at the regime of a fixed number of labeled points but a large number of unlabeled points. We show that in Rd , d 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the “smoothness” assumptions associated with this alternate method. 1 Introduction and Setup In this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods based on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We consider the limit when the number of labeled points is fixed and the number of unlabeled points goes to infinity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled data is virtually infinite. We can also think of this limit as “perfect” SSL, having full knowledge of the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative about the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be “smooth” in some sense relative to p(x). Studying the infinite-unlabeled-data limit, where p(x) is fully known, allows us to formulate and understand the underlying smoothness assumptions of a particular SSL method, and judge whether it is well-posed and sensible. Understanding the infinite-unlabeled-data limit is also a necessary first step to studying the convergence of the finite-labeled-data estimator. We consider the following setup: Let p(x) be an unknown smooth density on a compact domain Ω ⊂ Rd with a smooth boundary. Let y : Ω → Y be the unknown function we wish to estimate. In case of regression Y = R whereas in binary classification Y = {−1, 1}. The standard (transductive) semisupervised learning problem is formulated as follows: Given l labeled points, (x1 , y1 ), . . . , (xl , yl ), with yi = y(xi ), and u unlabeled points xl+1 , . . . , xl+u , with all points xi sampled i.i.d. from p(x), the goal is to construct an estimate of y(xl+i ) for any unlabeled point xl+i , utilizing both the labeled and the unlabeled points. We denote the total number of points by n = l + u. We are interested in the regime where l is fixed and u → ∞. 1 2 SSL with Graph Laplacian Regularization We first consider the following graph-based approach formulated by Zhu et. al. [15]: y (x) = arg min In (y) ˆ subject to y(xi ) = yi , i = 1, . . . , l y where 1 n2 In (y) = Wi,j (y(xi ) − y(xj ))2 (1) (2) i,j is a Laplacian regularization term enforcing “smoothness” with respect to the n×n similarity matrix W . This formulation has several natural interpretations in terms of, e.g. random walks and electrical circuits [15]. These interpretations, however, refer to a fixed graph, over a finite set of points with given similarities. In contrast, our focus here is on the more typical scenario where the points xi ∈ Rd are a random sample from a density p(x), and W is constructed based on this sample. We would like to understand the behavior of the method in terms of the density p(x), particularly in the limit where the number of unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density p(x) is the method (1) sensible? The answer, of course, depends on how the matrix W is constructed. We consider the common situation where the similarities are obtained by applying some decay filter to the distances: xi −xj σ Wi,j = G (3) where G : R+ → R+ is some function with an adequately fast decay. Popular choices are the 2 Gaussian filter G(z) = e−z /2 or the ǫ-neighborhood graph obtained by the step filter G(z) = 1z<1 . For simplicity, we focus here on the formulation (1) where the solution is required to satisfy the constraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced with a softer loss-based data term, which is balanced against the smoothness term In (y), e.g. [14, 6]. Our analysis and conclusions apply to such variants as well. Limit of the Laplacian Regularization Term As the number of unlabeled examples grows the regularization term (2) converges to its expectation, where the summation is replaced by integration w.r.t. the density p(x): lim In (y) = I (σ) (y) = n→∞ G Ω Ω x−x′ σ (y(x) − y(x′ ))2 p(x)p(x′ )dxdx′ . (4) In the above limit, the bandwidth σ is held fixed. Typically, one would also drive the bandwidth σ to zero as n → ∞. There are two reasons for this choice. First, from a practical perspective, this makes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical perspective, this leads to a clear and well defined limit of the smoothness regularization term In (y), at least when σ → 0 slowly enough1 , namely when σ = ω( d log n/n). If σ → 0 as n → ∞, and as long as nσ d / log n → ∞, then after appropriate normalization, the regularizer converges to a density weighted gradient penalty term [7, 8]: d lim d+2 In (y) n→∞ Cσ (σ) d (y) d+2 I σ→0 Cσ = lim ∇y(x) 2 p(x)2 dx = J(y) = (5) Ω where C = Rd z 2 G( z )dz, and assuming 0 < C < ∞ (which is the case for both the Gaussian and the step filters). This energy functional J(f ) therefore encodes the notion of “smoothness” with respect to p(x) that is the basis of the SSL formulation (1) with the graph constructions specified by (3). To understand the behavior and appropriateness of (1) we must understand this functional and the associated limit problem: y (x) = arg min J(y) ˆ subject to y(xi ) = yi , i = 1, . . . , l (6) y p When σ = o( d 1/n) then all non-diagonal weights Wi,j vanish (points no longer have any “close by” p neighbors). We are not aware of an analysis covering the regime where σ decays roughly as d 1/n, but would be surprised if a qualitatively different meaningful limit is reached. 1 2 3 Graph Laplacian Regularization in R1 We begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x ∈ R. We first consider the situation where the support of p(x) is a continuous interval Ω = [a, b] ⊂ R (a and/or b may be infinite). Without loss of generality, we assume the labeled data is sorted in increasing order a x1 < x2 < · · · < xl b. Applying the theory of variational calculus, the solution y (x) ˆ satisfies inside each interval (xi , xi+1 ) the Euler-Lagrange equation d dy p2 (x) = 0. dx dx Performing two integrations and enforcing the constraints at the labeled points yields y(x) = yi + x 1/p2 (t)dt xi (yi+1 xi+1 1/p2 (t)dt xi − yi ) for xi x xi+1 (7) with y(x) = x1 for a x x1 and y(x) = xl for xl x b. If the support of p(x) is a union of disjoint intervals, the above analysis and the form of the solution applies in each interval separately. The solution (7) seems reasonable and desirable from the point of view of the “smoothness” assumptions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas across low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore, the regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared semi-norm, giving us additional insight into this choice of regularizer: b 1 Theorem 1. Let p(x) be a smooth density on Ω = [a, b] ⊂ R such that Ap = 4 a 1/p2 (t)dt < ∞. 2 Then, J(f ) can be written as a squared semi-norm J(f ) = f Kp induced by the kernel x′ ′ Kp (x, x ) = Ap − 1 2 x with a null-space of all constant functions. That is, f the RKHS induced by Kp . 1 p2 (t) dt Kp . (8) is the norm of the projection of f onto If p(x) is supported on several disjoint intervals, Ω = ∪i [ai , bi ], then J(f ) can be written as a squared semi-norm induced by the kernel 1 bi dt 4 ai p2 (t) ′ Kp (x, x ) = − 1 2 x′ dt x p2 (t) if x, x′ ∈ [ai , bi ] (9) if x ∈ [ai , bi ], x′ ∈ [aj , bj ], i = j 0 with a null-space spanned by indicator functions 1[ai ,bi ] (x) on the connected components of Ω. Proof. For any f (x) = i αi Kp (x, xi ) in the RKHS induced by Kp : df dx J(f ) = 2 p2 (x)dx = αi αj Jij (10) i,j where Jij = d d Kp (x, xi ) Kp (x, xj )p2 (x)dx dx dx When xi and xj are in different connected components of Ω, the gradients of Kp (·, xi ) and Kp (·, xj ) are never non-zero together and Jij = 0 = Kp (xi , xj ). When they are in the same connected component [a, b], and assuming w.l.o.g. a xi xj b: Jij = = xi 1 4 1 4 a b a 1 dt + p2 (t) 1 1 dt − p2 (t) 2 xj xi xj xi −1 dt + p2 (t) xj 1 dt p2 (t) 1 dt = Kp (xi , xj ). p2 (t) Substituting Jij = Kp (xi , xj ) into (10) yields J(f ) = 3 b αi αj Kp (xi , xj ) = f (11) Kp . Combining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of any variant where the hard constraints are replaced by a data term) is of the form: l y(x) = αj Kp (x, xj ) + βi 1[ai ,bi ] (x), j=1 i where i ranges over the connected components [ai , bi ] of Ω, and we have: l J(y) = αi αj Kp (xi , xj ). (12) i,j=1 Viewing the regularizer as y 2 p suggests understanding (6), and so also its empirical approximaK tion (1), by interpreting Kp (x, x′ ) as a density-based “similarity measure” between x and x′ . This similarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a function of the distance. When the density is non-uniform, two points are relatively similar only if they are connected by a region in which 1/p2 (x) is low, i.e. the density is high, but are much less “similar”, i.e. related to each other, when connected by a low-density region. Furthermore, there is no dependence between points in disjoint components separated by zero density regions. 4 Graph Laplacian Regularization in Higher Dimensions The analysis of the previous section seems promising, at it shows that in one dimension, the SSL method (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is not the case anymore. In the following theorem we show that the infimum of the limit problem (6) is zero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation of the labeled points. Theorem 2. Let p(x) be a smooth density over Rd , d 2, bounded from above by some constant pmax , and let (x1 , y1 ), . . . , (xl , yl ) be any (non-repeating) set of labeled examples. There exist continuous functions yǫ (x), for any ǫ > 0, all satisfying the constraints yǫ (xj ) = yj , j = 1, . . . , l, such ǫ→0 ǫ→0 that J(yǫ ) −→ 0 but yǫ (x) −→ 0 for all x = xj , j = 1, . . . , l. Proof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the proof to more labeled points is straightforward. Furthermore, without loss of generality, we assume the first labeled point is at x0 = 0 with y(x0 ) = 0 and the second labeled point is at x1 with x1 = 1 and y(x1 ) = 1. In addition, we assume that the ball B1 (0) of radius one centered around the origin is contained in Ω = {x ∈ Rd | p(x) > 0}. We first consider the case d > 2. Here, for any ǫ > 0, consider the function x ǫ yǫ (x) = min ,1 which indeed satisfies the two constraints yǫ (xi ) = yi , i = 0, 1. Then, J(yǫ ) = Bǫ (0) p2 (x) dx ǫ2 pmax ǫ2 dx = p2 Vd ǫd−2 max (13) Bǫ (0) where Vd is the volume of a unit ball in Rd . Hence, the sequence of functions yǫ (x) satisfy the constraints, but for d > 2, inf ǫ J(yǫ ) = 0. For d = 2, a more extreme example is necessary: consider the functions 2 x yǫ (x) = log +ǫ ǫ log 1+ǫ ǫ for x 1 and yǫ (x) = 1 for x > 1. These functions satisfy the two constraints yǫ (xi ) = yi , i = 0, 1 and: J(yǫ ) = 4 h “ ”i 1+ǫ 2 log ǫ 4πp2 max h “ ”i 1+ǫ 2 log ǫ x B1 (0) log ( x 1+ǫ ǫ 2 2 +ǫ)2 p2 (x)dx 4p2 h “ max ”i2 1+ǫ log ǫ 4πp2 max ǫ→0 = −→ 0. log 1+ǫ ǫ 4 1 0 r2 (r 2 +ǫ)2 2πrdr The implication of Theorem 2 is that regardless of the values at the labeled points, as u → ∞, the solution of (1) is not well posed. Asymptotically, the solution has the form of an almost everywhere constant function, with highly localized spikes near the labeled points, and so no learning is performed. In particular, an interpretation in terms of a density-based kernel Kp , as in the onedimensional case, is not possible. Our analysis also carries over to a formulation where a loss-based data term replaces the hard label constraints, as in l 1 y = arg min ˆ (y(xj ) − yj )2 + γIn (y) y(x) l j=1 In the limit of infinite unlabeled data, functions of the form yǫ (x) above have a zero data penalty term (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence, it is possible to drive the entire objective functional (the data term plus the regularization term) to zero with functions that do not generalize at all to unlabeled points. 4.1 Numerical Example We illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density p(x) in R2 , which is a mixture of two unit variance spherical Gaussians, one per class, centered at the origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of the two components (four total). We then construct a similarity matrix using a Gaussian filter with σ = 0.4. Figure 1 depicts the predictor y (x) obtained from (1). In fact, two different predictors are shown, ˆ obtained by different numerical methods for solving (1). Both methods are based on the observation that the solution y (x) of (1) satisfies: ˆ n y (xi ) = ˆ n Wij y (xj ) / ˆ j=1 Wij on all unlabeled points i = l + 1, . . . , l + u. (14) j=1 Combined with the constraints of (1), we obtain a system of linear equations that can be solved by Gaussian elimination (here invoked through MATLAB’s backslash operator). This is the method used in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for y (xi ), ˆ which can be solved via the power method, or label propagation [2, 6]: start with zero labels on the unlabeled points and iterate (14), while keeping the known labels on x1 , . . . , xl . This is the method used in the bottom panels of Figure 1. As predicted, y (x) is almost constant for almost all unlabeled points. Although all values are very ˆ close to zero, thresholding at the “right” threshold does actually produce sensible results in terms of the true -1/+1 labels. However, beyond being inappropriate for regression, a very flat predictor is still problematic even from a classification perspective. First, it is not possible to obtain a meaningful confidence measure for particular labels. Second, especially if the size of each class is not known apriori, setting the threshold between the positive and negative classes is problematic. In our example, setting the threshold to zero yields a generalization error of 45%. The differences between the two numerical methods for solving (1) also point out to another problem with the ill-posedness of the limit problem: the solution is numerically very un-stable. A more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choosing a “wrong” bandwidth σ, is given in Figure 2. We again simulated data from a mixture of two Gaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and an increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classification error of the resulting predictor y (x). We plot the classification error both when a threshold ˆ of zero is used (i.e. the class is determined by sign(ˆ(x))) and with the ideal threshold minimizing y the test error. For each unlabeled sample size, we choose the bandwidth σ yielding the best test performance (this is a “cheating” approach which provides a lower bound on the error of the best method for selecting the bandwidth). As the number of unlabeled examples increases the squared error approaches 1, indicating a flat predictor. Using a threshold of zero leads to an increase in the classification error, possibly due to numerical instability. Interestingly, although the predictors become very flat, the classification error using the ideal threshold actually improves slightly. Note that 5 DIRECT INVERSION SQUARED ERROR SIGN ERROR: 45% OPTIMAL BANDWIDTH 1 0.9 1 5 0 4 2 0.85 y(x) > 0 y(x) < 0 6 0.95 10 0 0 −1 10 0 200 400 600 800 0−1 ERROR (THRESHOLD=0) 0.32 −5 10 0 5 −10 0 −10 −5 −5 0 5 10 10 1 0 0 200 400 600 800 OPTIMAL BANDWIDTH 0.5 0 0 200 400 600 800 0−1 ERROR (IDEAL THRESHOLD) 0.19 5 200 400 600 800 OPTIMAL BANDWIDTH 1 0.28 SIGN ERR: 17.1 0.3 0.26 POWER METHOD 0 1.5 8 0 0.18 −1 10 6 0.17 4 −5 10 0 5 −10 0 −5 −10 −5 0 5 10 Figure 1: Left plots: Minimizer of Eq. (1). Right plots: the resulting classification according to sign(y). The four labeled points are shown by green squares. Top: minimization via Gaussian elimination (MATLAB backslash). Bottom: minimization via label propagation with 1000 iterations - the solution has not yet converged, despite small residuals of the order of 2 · 10−4 . 0.16 0 200 400 600 800 2 0 200 400 600 800 Figure 2: Squared error (top), classification error with a threshold of zero (center) and minimal classification error using ideal threhold (bottom), of the minimizer of (1) as a function of number of unlabeled points. For each error measure and sample size, the bandwidth minimizing the test error was used, and is plotted. ideal classification performance is achieved with a significantly larger bandwidth than the bandwidth minimizing the squared loss, i.e. when the predictor is even flatter. 4.2 Probabilistic Interpretation, Exit and Hitting Times As mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in terms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with transition matrix M = D−1 W where D is a diagonal matrix with Dii = j Wij . Then, for the binary classification case with yi = ±1 we have [15]: y (xi ) = 2 Pr x(t) hits a point labeled +1 before hitting a point labeled -1 x(0) = xi − 1 ˆ We present an interpretation of our analysis in terms of the limiting properties of this random walk. Consider, for simplicity, the case where the two classes are separated by a low density region. Then, the random walk has two intrinsic quantities of interest. The first is the mean exit time from one cluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the number of unlabeled points increases and σ → 0, the random walk converges to a diffusion process [12]. While the mean exit time then converges to a finite value corresponding to its diffusion analogue, the hitting time to a labeled point increases to infinity (as these become absorbing boundaries of measure zero). With more and more unlabeled data the random walk will fully mix, forgetting where it started, before it hits any label. Thus, the probability of hitting +1 before −1 will become uniform across the entire graph, independent of the starting location xi , yielding a flat predictor. 5 Keeping σ Finite At this point, a reader may ask whether the problems found in higher dimensions are due to taking the limit σ → 0. One possible objection is that there is an intrinsic characteristic scale for the data σ0 where (with high probability) all points at a distance xi − xj < σ0 have the same label. If this is the case, then it may not necessarily make sense to take values of σ < σ0 in constructing W . However, keeping σ finite while taking the number of unlabeled points to infinity does not resolve the problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To see this, consider a function y(x) which is zero everywhere except at the labeled points, where y(xj ) = yj . With a finite number of labeled points of measure zero, I (σ) (y) = 0 in any dimension 6 50 points 500 points 3500 points 1 1 0.5 0.5 0.5 0 0 0 −0.5 y 1 −0.5 −0.5 −1 −2 0 2 4 6 −1 −2 0 2 4 6 −1 −2 0 2 4 6 x Figure 3: Minimizer of (1) for a 1-d problem with a fixed σ = 0.4, two labeled points and an increasing number of unlabeled points. and for any fixed σ > 0. While this limiting function is discontinuous, it is also possible to construct ǫ→0 a sequence of continuous functions yǫ that all satisfy the constraints and for which I (σ) (yǫ ) −→ 0. This behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians centered at the origin and at x = 4, with one Gaussian labeled −1 and the other +1. We used two labeled points at the centers of the Gaussians and an increasing number of randomly drawn unlabeled points. As predicted, with a fixed σ, although the solution is reasonable when the number of unlabeled points is small, it becomes flatter, with sharp spikes on the labeled points, as u → ∞. 6 Fourier-Eigenvector Based Methods Before we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian, suggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining candidate predictors y(x) non-parametrically to those with small In (y) values, here the predictors are constrained to the low-dimensional space spanned by the first few eigenvectors of the Laplacian: The similarity matrix W is computed as before, and the Graph Laplacian matrix L = D − W is considered (recall D is a diagonal matrix with Dii = j Wij ). Only predictors p j=1 aj ej y (x) = ˆ (15) spanned by the first p eigenvectors e1 , . . . , ep of L (with smallest eigenvalues) are considered. The coefficients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss: (ˆ1 , . . . , ap ) = arg min a ˆ l j=1 (yj − y (xj ))2 . ˆ (16) Unlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)–(16) is well posed in the limit u → ∞. This follows directly from the convergence of the eigenvectors of the graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4]. Eigenvector based methods were shown empirically to provide competitive generalization performance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the approach by arguing that ‘the eigenfunctions of the Laplace-Beltrami operator provide a natural basis for functions on the manifold and the desired classification function can be expressed in such a basis’. In our view, the success of the method is actually not due to data lying on a low-dimensional manifold, but rather due to the low density separation assumption, which states that different class labels form high-density clusters separated by low density regions. Indeed, under this assumption and with sufficient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami operator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11], providing a basis for a labeling that is constant within clusters but variable across clusters. In other settings, such as data uniformly distributed on a manifold but without any significant cluster structure, the success of eigenvector based methods critically depends on how well can the unknown classification function be approximated by a truncated expansion with relatively few eigenvectors. We illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform density in the box [0, 1] × [0, 0.8] × [0, 0.6], where the box lengths are different to prevent eigenvalue multiplicity. Consider learning three different functions, y1 (x) = 1x1 >0.5 , y2 (x) = 1x1 >x2 /0.8 and y3 (x) = 1x2 /0.8>x3 /0.6 . Even though all three functions are relatively simple, all having a linear separating boundary between the classes on the manifold, as shown in the experiment described in Figure 4, the Eigenvector based method (15)–(16) gives markedly different generalization performances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5 as suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test set (i.e. a “cheating” choice representing an upper bound on the generalization error of this method). 7 Prediction Error (%) p = #labeled points/5 40 optimal p 20 labeled points 40 Approx. Error 50 20 20 0 20 20 40 60 # labeled points 0 10 20 40 60 # labeled points 0 0 5 10 15 # eigenvectors 0 0 5 10 15 # eigenvectors Figure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)–(16) for the three different functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign agreements with the true labels. Rightmost panel: best fit when many (all 3000) points are used, representing the best we can hope for with a few leading eigenvectors. The reason for this behavior is that y2 (x) and even more so y3 (x) cannot be as easily approximated by the very few leading eigenfunctions—even though they seem “simple” and “smooth”, they are significantly more complicated than y1 (x) in terms of measure of simplicity implied by the Eigenvector Method. Since the density is uniform, the graph Laplacian converges to the standard Laplacian and its eigenfunctions have the form ψi,j,k (x) = cos(iπx1 ) cos(jπx2 /0.8) cos(kπx3 /0.6), making it hard to represent simple decision boundaries which are not axis-aligned. 7 Discussion Our results show that a popular SSL method, the Laplacian Regularization method (1), is not wellbehaved in the limit of infinite unlabeled data, despite its empirical success in various SSL tasks. The empirical success might be due to two reasons. First, it is possible that with a large enough number of labeled points relative to the number of unlabeled points, the method is well behaved. This regime, where the number of both labeled and unlabeled points grow while l/u is fixed, has recently been analyzed by Wasserman and Lafferty [9]. However, we do not find this regime particularly satisfying as we would expect that having more unlabeled data available should improve performance, rather than require more labeled points or make the problem ill-posed. It also places the user in a delicate situation of choosing the “just right” number of unlabeled points without any theoretical guidance. Second, in our experiments we noticed that although the predictor y (x) becomes extremely flat, in ˆ binary tasks, it is still typically possible to find a threshold leading to a good classification performance. We do not know of any theoretical explanation for such behavior, nor how to characterize it. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical foundation of the Laplacian Regularization method. On a very practical level, such a theoretical understanding might allow us to correct the method so as to avoid the numerical instability associated with flat predictors, and perhaps also make it appropriate for regression. The reason that the Laplacian regularizer (1) is ill-posed in the limit is that the first order gradient is not a sufficient penalty in high dimensions. This fact is well known in spline theory, where the Sobolev Embedding Theorem [1] indicates one must control at least d+1 derivatives in Rd . In the 2 context of Laplacian regularization, this can be done using the iterated Laplacian: replacing the d+1 graph Laplacian matrix L = D − W , where D is the diagonal degree matrix, with L 2 (matrix to d+1 the 2 power). In the infinite unlabeled data limit, this corresponds to regularizing all order- d+1 2 (mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional ambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimensionality, which poses a practical problem of estimating this usually unknown dimensionality. We are not aware of much practical work using the iterated Laplacian, nor a good understanding of its appropriateness for SSL. A different approach leading to a well-posed solution is to include also an ambient regularization term [5]. However, the properties of the solution and in particular its relation to various assumptions about the “smoothness” of y(x) relative to p(x) remain unclear. Acknowledgments The authors would like to thank the anonymous referees for valuable suggestions. The research of BN was supported by the Israel Science Foundation (grant 432/06). 8 References [1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975. [2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks, ICML, 2007. [3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classification, NIPS, vol. 15, 2003. [4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007. [5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006. [6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised Learning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006. [7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004. [8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006. [9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008. [10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol. 36(2), 2008. [11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001. [12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, NIPS, vol. 18, 2006. [13] B. Sch¨ lkopf, A. Smola, Learning with Kernels, MIT Press, 2002. o [14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch¨ lkopf, Learning with local and global consistency, o NIPS, vol. 16, 2004. [15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian fields and harmonic functions, ICML, 2003. 9
Author: Kwang I. Kim, Florian Steinke, Matthias Hein
Abstract: Semi-supervised regression based on the graph Laplacian suffers from the fact that the solution is biased towards a constant and the lack of extrapolating power. Based on these observations, we propose to use the second-order Hessian energy for semi-supervised regression which overcomes both these problems. If the data lies on or close to a low-dimensional submanifold in feature space, the Hessian energy prefers functions whose values vary linearly with respect to geodesic distance. We first derive the Hessian energy for smooth manifolds and continue to give a stable estimation procedure for the common case where only samples of the underlying manifold are given. The preference of ‘’linear” functions on manifolds renders the Hessian energy particularly suited for the task of semi-supervised dimensionality reduction, where the goal is to find a user-defined embedding function given some labeled points which varies smoothly (and ideally linearly) along the manifold. The experimental results suggest superior performance of our method compared with semi-supervised regression using Laplacian regularization or standard supervised regression techniques applied to this task. 1
4 0.67544472 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence
Author: Wei Bian, Dacheng Tao
Abstract: In this paper, we study the manifold regularization for the Sliced Inverse Regression (SIR). The manifold regularization improves the standard SIR in two aspects: 1) it encodes the local geometry for SIR and 2) it enables SIR to deal with transductive and semi-supervised learning problems. We prove that the proposed graph Laplacian based regularization is convergent at rate root-n. The projection directions of the regularized SIR are optimized by using a conjugate gradient method on the Grassmann manifold. Experimental results support our theory.
5 0.62409282 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields
Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori
Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random fields that defines a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classification and sequence labeling problems. 1
6 0.61602336 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification
7 0.58645344 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases
8 0.53673649 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections
9 0.51301038 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
10 0.47759688 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization
11 0.45740104 72 nips-2009-Distribution Matching for Transduction
12 0.45370993 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity
13 0.45312241 122 nips-2009-Label Selection on Graphs
14 0.44766197 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models
15 0.4077563 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines
16 0.40270203 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation
17 0.39404532 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding
18 0.36411867 76 nips-2009-Efficient Learning using Forward-Backward Splitting
19 0.35563871 71 nips-2009-Distribution-Calibrated Hierarchical Classification
20 0.34444869 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints
topicId topicWeight
[(7, 0.014), (22, 0.248), (24, 0.068), (25, 0.082), (35, 0.025), (36, 0.13), (39, 0.057), (44, 0.017), (58, 0.072), (61, 0.014), (71, 0.041), (86, 0.115)]
simIndex simValue paperId paperTitle
1 0.8806473 227 nips-2009-Speaker Comparison with Inner Product Discriminant Functions
Author: Zahi Karam, Douglas Sturim, William M. Campbell
Abstract: Speaker comparison, the process of finding the speaker similarity between two speech signals, occupies a central role in a variety of applications—speaker verification, clustering, and identification. Speaker comparison can be placed in a geometric framework by casting the problem as a model comparison process. For a given speech signal, feature vectors are produced and used to adapt a Gaussian mixture model (GMM). Speaker comparison can then be viewed as the process of compensating and finding metrics on the space of adapted models. We propose a framework, inner product discriminant functions (IPDFs), which extends many common techniques for speaker comparison—support vector machines, joint factor analysis, and linear scoring. The framework uses inner products between the parameter vectors of GMM models motivated by several statistical methods. Compensation of nuisances is performed via linear transforms on GMM parameter vectors. Using the IPDF framework, we show that many current techniques are simple variations of each other. We demonstrate, on a 2006 NIST speaker recognition evaluation task, new scoring methods using IPDFs which produce excellent error rates and require significantly less computation than current techniques.
2 0.85488272 21 nips-2009-Abstraction and Relational learning
Author: Charles Kemp, Alan Jern
Abstract: Most models of categorization learn categories defined by characteristic features but some categories are described more naturally in terms of relations. We present a generative model that helps to explain how relational categories are learned and used. Our model learns abstract schemata that specify the relational similarities shared by instances of a category, and our emphasis on abstraction departs from previous theoretical proposals that focus instead on comparison of concrete instances. Our first experiment suggests that abstraction can help to explain some of the findings that have previously been used to support comparison-based approaches. Our second experiment focuses on one-shot schema learning, a problem that raises challenges for comparison-based approaches but is handled naturally by our abstraction-based account. Categories such as family, sonnet, above, betray, and imitate differ in many respects but all of them depend critically on relational information. Members of a family are typically related by blood or marriage, and the lines that make up a sonnet must rhyme with each other according to a certain pattern. A pair of objects will demonstrate “aboveness” only if a certain spatial relationship is present, and an event will qualify as an instance of betrayal or imitation only if its participants relate to each other in certain ways. All of the cases just described are examples of relational categories. This paper develops a computational approach that helps to explain how simple relational categories are acquired. Our approach highlights the role of abstraction in relational learning. Given several instances of a relational category, it is often possible to infer an abstract representation that captures what the instances have in common. We refer to these abstract representations as schemata, although others may prefer to call them rules or theories. For example, a sonnet schema might specify the number of lines that a sonnet should include and the rhyming pattern that the lines should follow. Once a schema has been acquired it can support several kinds of inferences. A schema can be used to make predictions about hidden aspects of the examples already observed—if the final word in a sonnet is illegible, the rhyming pattern can help to predict the identity of this word. A schema can be used to decide whether new examples (e.g. new poems) qualify as members of the category. Finally, a schema can be used to generate novel examples of a category (e.g. novel sonnets). Most researchers would agree that abstraction plays some role in relational learning, but Gentner [1] and other psychologists have emphasized the role of comparison instead [2, 3]. Given one example of a sonnet and the task of deciding whether a second poem is also a sonnet, a comparison-based approach might attempt to establish an alignment or mapping between the two. Approaches that rely on comparison or mapping are especially prominent in the literature on analogical reasoning [4, 5], and many of these approaches can be viewed as accounts of relational categorization [6]. For example, the problem of deciding whether two systems are analogous can be formalized as the problem of deciding whether these systems are instances of the same relational category. Despite some notable exceptions [6, 7], most accounts of analogy focus on comparison rather than abstraction, and suggest that “analogy passes from one instance of a generalization to another without pausing for explicit induction of the generalization” (p 95) [8]. 1 Schema s 0∀Q ∀x ∀y Q(x) < Q(y) ↔ D1 (x) < D1 (y) Group g Observation o Figure 1: A hierarchical generative model for learning and using relational categories. The schema s at the top level is a logical sentence that specifies which groups are valid instances of the category. The group g at the second level is randomly sampled from the set of valid instances, and the observation o is a partially observed version of group g. Researchers that focus on comparison sometimes discuss abstraction, but typically suggest that abstractions emerge as a consequence of comparing two or more concrete instances of a category [3, 5, 9, 10]. This view, however, will not account for one-shot inferences, or inferences based on a single instance of a relational category. Consider a learner who is shown one instance of a sonnet then asked to create a second instance. Since only one instance is provided, it is hard to see how comparisons between instances could account for success on the task. A single instance, however, will sometimes provide enough information for a schema to be learned, and this schema should allow subsequent instances to be generated [11]. Here we develop a formal framework for exploring relational learning in general and one-shot schema learning in particular. Our framework relies on the hierarchical Bayesian approach, which provides a natural way to combine abstraction and probabilistic inference [12]. The hierarchical Bayesian approach supports representations at multiple levels of abstraction, and helps to explains how abstract representations (e.g. a sonnet schema) can be acquired given observations of concrete instances (e.g. individual sonnets). The schemata we consider are represented as sentences in a logical language, and our approach therefore builds on previous probabilistic methods for learning and using logical theories [13, 14]. Following previous authors, we propose that logical representations can help to capture the content of human knowledge, and that Bayesian inference helps to explain how these representations are acquired and how they support inductive inference. The following sections introduce our framework then evaluate it using two behavioral experiments. Our first experiment uses a standard classification task where participants are shown one example of a category then asked to decide which of two alternatives is more likely to belong to the same category. Tasks of this kind have previously been used to argue for the importance of comparison, but we suggest that these tasks can be handled by accounts that focus on abstraction. Our second experiment uses a less standard generation task [15, 16] where participants are shown a single example of a category then asked to generate additional examples. As predicted by our abstraction-based account, we find that people are able to learn relational categories on the basis of a single example. 1 A generative approach to relational learning Our examples so far have used real-world relational categories such as family and sonnet but we now turn to a very simple domain where relational categorization can be studied. Each element in the domain is a group of components that vary along a number of dimensions—in Figure 1, the components are figures that vary along the dimensions of size, color, and circle position. The groups can be organized into categories—one such category includes groups where every component is black. Although our domain is rather basic it allows some simple relational regularities to be explored. We can consider categories, for example, where all components in a group must be the same along some dimension, and categories where all components must be different along some dimension. We can also consider categories defined by relationships between dimensions—for example, the category that includes all groups where the size and color dimensions are correlated. Each category is associated with a schema, or an abstract representation that specifies which groups are valid instances of the category. Here we consider schemata that correspond to rules formulated 2 1 2 3 4 5 6 7 ff ˘ ¯ ∀x D (x) =, =, <, > vk ∃xff i ff ˘ ¯ ∀x ∀y x = y → D (x) =, =, <, > Di (y) ∃x ∃y x = y ∧ 8 i9 ˘ ¯ <∧= ˘ ¯ ∀x Di (x) =, = vk ∨ Dj (x) =, = vl : ; ↔ 8 9 0 1 <∧= ˘ ¯ ˘ ¯ ∀x∀y x = y → @Di (x) =, =, <, > Di (y) ∨ Dj (x) =, =, <, > Dj (y)A : ; ↔ ff ff ff ˘ ¯ ∀Q ∀x ∀y x = y → Q(x) =, =, <, > Q(y) ∃Q ∃x ∃y x = y ∧ 8 9 0 1 ff <∧= ˘ ¯ ˘ ¯ ∀Q Q = Di → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ Di (x) =, =, <, > Di (y)A ∃Q Q = Di ∧ : ; ↔ 8 9 0 1 ff ff <∧= ˘ ¯ ˘ ¯ ∀Q ∀R Q = R → ∀x∀y x = y → @Q(x) =, =, <, > Q(y) ∨ R(x) =, =, <, > R(y)A ∃Q ∃R Q = R ∧ : ; ↔ Table 1: Templates used to construct a hypothesis space of logical schemata. An instance of a given template can be created by choosing an element from each set enclosed in braces (some sets are laid out horizontally to save space), replacing each occurrence of Di or Dj with a dimension (e.g. D1 ) and replacing each occurrence of vk or vl with a value (e.g. 1). in a logical language. The language includes three binary connectives—and (∧), or (∨), and if and only if (↔). Four binary relations (=, =, <, and >) are available for comparing values along dimensions. Universal quantification (∀x) and existential quantification (∃x) are both permitted, and the language includes quantification over objects (∀x) and dimensions (∀Q). For example, the schema in Figure 1 states that all dimensions are aligned. More precisely, if D1 is the dimension of size, the schema states that for all dimensions Q, a component x is smaller than a component y along dimension Q if and only if x is smaller in size than y. It follows that all three dimensions must increase or decrease together. To explain how rules in this logical language are learned we work with the hierarchical generative model in Figure 1. The representation at the top level is a schema s, and we assume that one or more groups g are generated from a distribution P (g|s). Following a standard approach to category learning [17, 18], we assume that g is uniformly sampled from all groups consistent with s: p(g|s) ∝ 1 g is consistent with s 0 otherwise (1) For all applications in this paper, we assume that the number of components in a group is known and fixed in advance. The bottom level of the hierarchy specifies observations o that are generated from a distribution P (o|g). In most cases we assume that g can be directly observed, and that P (o|g) = 1 if o = g and 0 otherwise. We also consider the setting shown in Figure 1 where o is generated by concealing a component of g chosen uniformly at random. Note that the observation o in Figure 1 includes only four of the components in group g, and is roughly analogous to our earlier example of a sonnet with an illegible final word. To convert Figure 1 into a fully-specified probabilistic model it remains to define a prior distribution P (s) over schemata. An appealing approach is to consider all of the infinitely many sentences in the logical language already mentioned, and to define a prior favoring schemata which correspond to simple (i.e. short) sentences. We approximate this approach by considering a large but finite space of sentences that includes all instances of the templates in Table 1 and all conjunctions of these instances. When instantiating one of these templates, each occurrence of Di or Dj should be replaced by one of the dimensions in the domain. For example, the schema in Figure 1 is a simplified instance of template 6 where Di is replaced by D1 . Similarly, each instance of vk or vl should be replaced by a value along one of the dimensions. Our first experiment considers a problem where there are are three dimensions and three possible values along each dimension (i.e. vk = 1, 2, or 3). As a result there are 1568 distinct instances of the templates in Table 1 and roughly one million 3 conjunctions of these instances. Our second experiment uses three dimensions with five values along each dimension, which leads to 2768 template instances and roughly three million conjunctions of these instances. The templates in Table 1 capture most of the simple regularities that can be formulated in our logical language. Template 1 generates all rules that include quantification over a single object variable and no binary connectives. Template 3 is similar but includes a single binary connective. Templates 2 and 4 are similar to 1 and 3 respectively, but include two object variables (x and y) rather than one. Templates 5, 6 and 7 add quantification over dimensions to Templates 2 and 4. Although the templates in Table 1 capture a large class of regularities, several kinds of templates are not included. Since we do not assume that the dimensions are commensurable, values along different dimensions cannot be directly compared (∃x D1 (x) = D2 (x) is not permitted. For the same reason, comparisons to a dimension value must involve a concrete dimension (∀x D1 (x) = 1 is permitted) rather than a dimension variable (∀Q ∀x Q(x) = 1 is not permitted). Finally, we exclude all schemata where quantification over objects precedes quantification over dimensions, and as a result there are some simple schemata that our implementation cannot learn (e.g. ∃x∀y∃Q Q(x) = Q(y)). The extension of each schema is a set of groups, and schemata with the same extension can be assigned to the same equivalence class. For example, ∀x D1 (x) = v1 (an instance of template 1) and ∀x D1 (x) = v1 ∧ D1 (x) = v1 (an instance of template 3) end up in the same equivalence class. Each equivalence class can be represented by the shortest sentence that it contains, and we define our prior P (s) over a set that includes a single representative for each equivalence class. The prior probability P (s) of each sentence is inversely proportional to its length: P (s) ∝ λ|s| , where |s| is the length of schema s and λ is a constant between 0 and 1. For all applications in this paper we set λ = 0.8. The generative model in Figure 1 can be used for several purposes, including schema learning (inferring a schema s given one or more instances generated from the schema), classification (deciding whether group gnew belongs to a category given one or more instances of the category) and generation (generating a group gnew that belongs to the same category as one or more instances). Our first experiment explores all three of these problems. 2 Experiment 1: Relational classification Our first experiment is organized around a triad task where participants are shown one example of a category then asked to decide which of two choice examples is more likely to belong to the category. Triad tasks are regularly used by studies of relational categorization, and have been used to argue for the importance of comparison [1]. A comparison-based approach to this task, for instance, might compare the example object to each of the choice objects in order to decide which is the better match. Our first experiment is intended in part to explore whether a schema-learning approach can also account for inferences about triad tasks. Materials and Method. 18 adults participated for course credit and interacted with a custom-built computer interface. The stimuli were groups of figures that varied along three dimensions (color, size, and ball position, as in Figure 1). Each shape was displayed on a single card, and all groups in Experiment 1 included exactly three cards. The cards in Figure 1 show five different values along each dimension, but Experiment 1 used only three values along each dimension. The experiment included inferences about 10 triads. Participants were told that aliens from a certain planet “enjoy organizing cards into groups,” and that “any group of cards will probably be liked by some aliens and disliked by others.” The ten triad tasks were framed as questions about the preferences of 10 aliens. Participants were shown a group that Mr X likes (different names were used for the ten triads), then shown two choice groups and told that “Mr X likes one of these groups but not the other.” Participants were asked to select one of the choice groups, then asked to generate another 3-card group that Mr X would probably like. Cards could be added to the screen using an “Add Card” button, and there were three pairs of buttons that allowed each card to be increased or decreased along the three dimensions. Finally, participants were asked to explain in writing “what kind of groups Mr X likes.” The ten triads used are shown in Figure 2. Each group is represented as a 3 by 3 matrix where rows represent cards and columns show values along the three dimensions. Triad 1, for example, 4 (a) D1 value always 3 321 332 313 1 0.5 1 231 323 333 1 4 0.5 4 311 122 333 311 113 313 8 12 16 20 24 211 222 233 211 232 223 1 4 0.5 4 211 312 113 8 12 16 20 24 1 1 4 8 12 16 20 24 312 312 312 313 312 312 1 8 12 16 20 24 211 232 123 4 8 12 16 20 24 1 0.5 231 322 213 112 212 312 4 8 12 16 20 24 4 8 12 16 20 24 0.5 1 0.5 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 0.5 1 1 4 4 (j) Some dimension has no repeats 0.5 1 311 232 123 231 132 333 1 0.5 8 12 16 20 24 0.5 111 312 213 231 222 213 (i) All dimensions have no repeats 331 122 213 4 1 0.5 8 12 16 20 24 0.5 4 8 12 16 20 24 (h) Some dimension uniform 1 4 4 0.5 1 311 212 113 0.5 1 321 122 223 0.5 8 12 16 20 24 0.5 4 0.5 331 322 313 1 0.5 8 12 16 20 24 (f) Two dimensions anti-aligned (g) All dimensions uniform 133 133 133 4 0.5 1 321 222 123 0.5 1 8 12 16 20 24 1 0.5 8 12 16 20 24 1 0.5 111 212 313 331 212 133 1 (e) Two dimensions aligned 311 322 333 311 113 323 4 (d) D1 and D3 anti-aligned 0.5 1 0.5 1 1 0.5 1 0.5 8 12 16 20 24 (c) D2 and D3 aligned 1 132 332 233 1 0.5 331 323 333 (b) D2 uniform 1 311 321 331 8 12 16 20 24 311 331 331 4 8 12 16 20 24 4 8 12 16 20 24 0.5 Figure 2: Human responses and model predictions for the ten triads in Experiment 1. The plot at the left of each panel shows model predictions (white bars) and human preferences (black bars) for the two choice groups in each triad. The plots at the right of each panel summarize the groups created during the generation phase. The 23 elements along the x-axis correspond to the regularities listed in Table 2. 5 1 2 3 4 5 6 7 8 9 10 11 12 All dimensions aligned Two dimensions aligned D1 and D2 aligned D1 and D3 aligned D2 and D3 aligned All dimensions aligned or anti-aligned Two dimensions anti-aligned D1 and D2 anti-aligned D1 and D3 anti-aligned D2 and D3 anti-aligned All dimensions have no repeats Two dimensions have no repeats 13 14 15 16 17 18 19 20 21 22 23 One dimension has no repeats D1 has no repeats D2 has no repeats D3 has no repeats All dimensions uniform Two dimensions uniform One dimension uniform D1 uniform D2 uniform D3 uniform D1 value is always 3 Table 2: Regularities used to code responses to the generation tasks in Experiments 1 and 2 has an example group including three cards that each take value 3 along D1 . The first choice group is consistent with this regularity but the second choice group is not. The cards in each group were arrayed vertically on screen, and were initially sorted as shown in Figure 2 (i.e. first by D3 , then by D2 and then by D1 ). The cards could be dragged around on screen, and participants were invited to move them around in order to help them understand each group. The mapping between the three dimensions in each matrix and the three dimensions in the experiment (color, position, and size) was randomized across participants, and the order in which triads were presented was also randomized. Model predictions and results. Let ge be the example group presented in the triad task and g1 and g2 be the two choice groups. We use our model to compute the relative probability of two hypotheses: h1 which states that ge and g1 are generated from the same schema and that g2 is sampled randomly from all possible groups, and h2 which states that ge and g2 are generated from the same schema. We set P (h1 ) = P (h2 ) = 0.5, and compute posterior probabilities P (h1 |ge , g1 , g2 ) and P (h2 |ge , g1 , g2 ) by integrating over all schemata in the hypothesis space already described. Our model assumes that two groups are considered similar to the extent that they appear to have been generated by the same underlying schema, and is consistent with the generative approach to similarity described by Kemp et al. [19]. Model predictions for the ten triads are shown in Figure 2. In each case, the choice probabilities plotted (white bars) are the posterior probabilities of hypotheses h1 and h2 . In nine out of ten cases the best choice according to the model is the most common human response. Responses to triads 2c and 2d support the idea that people are sensitive to relationships between dimensions (i.e. alignment and anti-alignment). Triads 2e and 2f are similar to triads studied by Kotovsky and Gentner [1], and we replicate their finding that people are sensitive to relationships between dimensions even when the dimensions involved vary from group to group. The one case where human responses diverge from model predictions is shown in Figure 2h. Note that the schema for this triad involves existential quantification over dimensions (some dimension is uniform), and according to our prior P (s) this kind of quantification is no more complex than other kinds of quantification. Future applications of our approach can explore the idea that existential quantification over dimensions (∃Q) is psychologically more complex than universal quantification over dimensions (∀Q) or existential quantification over cards (∃x), and can consider logical languages that incorporate this inductive bias. To model the generation phase of the experiment we computed the posterior distribution P (gnew |ge , g1 , g2 ) = P (gnew |s)P (s|h, ge , g1 , g2 )P (h|ge , g1 , g2 ) s,h where P (h|ge , g1 , g2 ) is the distribution used to model selections in the triad task. Since the space of possible groups is large, we visualize this distribution using a profile that shows the posterior probability assigned to groups consistent with the 23 regularities shown in Table 2. The white bar plots in Figure 2 show profiles predicted by the model, and the black plots immediately above show profiles computed over the groups generated by our 18 participants. In many of the 10 cases the model accurately predicts regularities in the groups generated by people. In case 2c, for example, the model correctly predicts that generated groups will tend to have no repeats along dimensions D2 and D3 (regularities 15 and 16) and that these two dimensions will be aligned (regularities 2 and 5). There are, however, some departures from the model’s predictions, and a notable example occurs in case 2d. Here the model detects the regularity that dimensions D1 and D3 are anti-aligned (regularity 9). Some groups generated by participants are consistent with 6 (a) All dimensions aligned 1 0.5 1 8 12 16 20 24 (c) D1 has no repeats, D2 and D3 uniform 1 8 12 16 20 24 0.5 1 8 12 16 20 24 354 312 1 8 12 16 20 24 4 8 12 16 20 24 4 8 12 16 20 24 0.5 423 414 214 315 0.5 314 0.5 0.5 4 8 12 16 20 24 1 251 532 314 145 0.5 4 8 12 16 20 24 (f) All dimensions have no repeats 1 1 335 8 12 16 20 24 (e) All dimensions uniform 1 4 0.5 432 514 324 224 424 0.5 314 314 314 314 8 12 16 20 24 4 1 0.5 4 4 0.5 314 0.5 4 8 12 16 20 24 1 431 433 135 335 0.5 1 4 (d) D2 uniform 1 433 1 322 8 12 16 20 24 0.5 0.5 344 333 223 555 222 4 1 1 0.5 0.5 124 224 324 524 311 322 333 354 324 1 0.5 4 311 322 333 355 134 121 232 443 555 443 1 111 333 444 555 (b) D2 and D3 aligned Figure 3: Human responses and model predictions for the six cases in Experiment 2. In (a) and (b), the 4 cards used for the completion and generation phases are shown on either side of the dashed line (completion cards on the left). In the remaining cases, the same 4 cards were used for both phases. The plots at the right of each panel show model predictions (white bars) and human responses (black bars) for the generation task. In each case, the 23 elements along each x-axis correspond to the regularities listed in Table 2. The remaining plots show responses to the completion task. There are 125 possible responses, and the four responses shown always include the top two human responses and the top two model predictions. this regularity, but people also regularly generate groups where two dimensions are aligned rather than anti-aligned (regularity 2). This result may indicate that some participants are sensitive to relationships between dimensions but do not consider the difference between a positive relationship (alignment) and an inverse relationship (anti-alignment) especially important. Kotovsky and Gentner [1] suggest that comparison can explain how people respond to triad tasks, although they do not provide a computational model that can be compared with our approach. It is less clear how comparison might account for our generation data, and our next experiment considers a one-shot generation task that raises even greater challenges for a comparison-based approach. 3 Experiment 2: One-shot schema learning As described already, comparison involves constructing mappings between pairs of category instances. In some settings, however, learners make confident inferences given a single instance of a category [15, 20], and it is difficult to see how comparison could play a major role when only one instance is available. Models that rely on abstraction, however, can naturally account for one-shot relational learning, and we designed a second experiment to evaluate this aspect of our approach. 7 Several previous studies have explored one-shot relational learning. Holyoak and Thagard [21] developed a study of analogical reasoning using stories as stimuli and found little evidence of oneshot schema learning. Ahn et al. [11] demonstrated, however, that one-shot learning can be achieved with complex materials such as stories, and modeled this result using explanation-based learning. Here we use much simpler stimuli and explore a probabilistic approach to one-shot learning. Materials and Method. 18 adults participated for course credit. The same individuals completed Experiments 1 and 2, and Experiment 2 was always run before Experiment 1. The same computer interface was used in both experiments, and the only important difference was that the figures in Experiment 2 could now take five values along each dimension rather than three. The experiment included two phases. During the generation phase, participants saw a 4-card group that Mr X liked and were asked to generate two 5-card groups that Mr X would probably like. During the completion phase, participants were shown four members of a 5-card group and were asked to generate the missing card. The stimuli used in each phase are shown in Figure 3. In the first two cases, slightly different stimuli were used in the generation and completion phases, and in all remaining cases the same set of four cards was used in both cases. All participants responded to the six generation questions before answering the six completion questions. Model predictions and results. The generation phase is modeled as in Experiment 1, but now the posterior distribution P (gnew |ge ) is computed after observing a single instance of a category. The human responses in Figure 3 (white bars) are consistent with the model in all cases, and confirm that a single example can provide sufficient evidence for learners to acquire a relational category. For example, the most common response in case 3a was the 5-card group shown in Figure 1—a group with all three dimensions aligned. To model the completion phase, let oe represent a partial observation of group ge . Our model infers which card is missing from ge by computing the posterior distribution P (ge |oe ) ∝ P (oe |ge ) s P (ge |s)P (s), where P (oe |ge ) captures the idea that oe is generated by randomly concealing one component of ge . The white bars in Figure 3 show model predictions, and in five out of six cases the best response according to the model is the same as the most common human response. In the remaining case (Figure 3d) the model generates a diffuse distribution over all cards with value 3 on dimension 2, and all human responses satisfy this regularity. 4 Conclusion We presented a generative model that helps to explain how relational categories are learned and used. Our approach captures relational regularities using a logical language, and helps to explain how schemata formulated in this language can be learned from observed data. Our approach differs in several respects from previous accounts of relational categorization [1, 5, 10, 22]. First, we focus on abstraction rather than comparison. Second, we consider tasks where participants must generate examples of categories [16] rather than simply classify existing examples. Finally, we provide a formal account that helps to explain how relational categories can be learned from a single instance. Our approach can be developed and extended in several ways. For simplicity, we implemented our model by working with a finite space of several million schemata, but future work can consider hypothesis spaces that assign non-zero probability to all regularities that can be formulated in the language we described. The specific logical language used here is only a starting point, and future work can aim to develop languages that provide a more faithful account of human inductive biases. Finally, we worked with a domain that provides one of the simplest ways to address core questions such as one-shot learning. Future applications of our general approach can consider domains that include more than three dimensions and a richer space of relational regularities. Relational learning and analogical reasoning are tightly linked, and hierarchical generative models provide a promising approach to both problems. We focused here on relational categorization, but future studies can explore whether probabilistic accounts of schema learning can help to explain the inductive inferences typically considered by studies of analogical reasoning. Although there are many models of analogical reasoning, there are few that pursue a principled probabilistic approach, and the hierarchical Bayesian approach may help to fill this gap in the literature. Acknowledgments We thank Maureen Satyshur for running the experiments. This work was supported in part by NSF grant CDI-0835797. 8 References [1] L. Kotovsky and D. Gentner. Comparison and categorization in the development of relational similarity. Child Development, 67:2797–2822, 1996. [2] D. Gentner and A. B. Markman. Structure mapping in analogy and similarity. American Psychologist, 52:45–56, 1997. [3] D. Gentner and J. Medina. Similarity and the development of rules. Cognition, 65:263–297, 1998. [4] B. Falkenhainer, K. D. Forbus, and D. Gentner. The structure-mapping engine: Algorithm and examples. Artificial Intelligence, 41:1–63, 1989. [5] J. E. Hummel and K. J. Holyoak. A symbolic-connectionist theory of relational inference and generalization. Psychological Review, 110:220–264, 2003. [6] M. Mitchell. Analogy-making as perception: a computer model. MIT Press, Cambridge, MA, 1993. [7] D. R. Hofstadter and the Fluid Analogies Research Group. Fluid concepts and creative analogies: computer models of the fundamental mechanisms of thought. 1995. [8] W. V. O. Quine and J. Ullian. The Web of Belief. Random House, New York, 1978. [9] J. Skorstad, D. Gentner, and D. Medin. Abstraction processes during concept learning: a structural view. In Proceedings of the 10th Annual Conference of the Cognitive Science Society, pages 419–425. 2009. [10] D. Gentner and J. Loewenstein. Relational language and relational thought. In E. Amsel and J. P. Byrnes, editors, Language, literacy and cognitive development: the development and consequences of symbolic communication, pages 87–120. 2002. [11] W. Ahn, W. F. Brewer, and R. J. Mooney. Schema acquisition from a single example. Journal of Experimental Psychology: Learning, Memory and Cognition, 18(2):391–412, 1992. [12] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian data analysis. Chapman & Hall, New York, 2nd edition, 2003. [13] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Learning and using relational theories. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 753–760. MIT Press, Cambridge, MA, 2008. [14] S. Kok and P. Domingos. Learning the structure of Markov logic networks. In Proceedings of the 22nd International Conference on Machine Learning, 2005. [15] J. Feldman. The structure of perceptual categories. Journal of Mathematical Psychology, 41: 145–170, 1997. [16] A. Jern and C. Kemp. Category generation. In Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages 130–135. Cognitive Science Society, Austin, TX, 2009. [17] D. Conklin and I. H. Witten. Complexity-based induction. Machine Learning, 16(3):203–225, 1994. [18] J. B. Tenenbaum and T. L. Griffiths. Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24:629–641, 2001. [19] C. Kemp, A. Bernstein, and J. B. Tenenbaum. A generative theory of similarity. In B. G. Bara, L. Barsalou, and M. Bucciarelli, editors, Proceedings of the 27th Annual Conference of the Cognitive Science Society, pages 1132–1137. Lawrence Erlbaum Associates, 2005. [20] C. Kemp, N. D. Goodman, and J. B. Tenenbaum. Theory acquisition and the language of thought. In Proceedings of the 30th Annual Conference of the Cognitive Science Society, pages 1606–1611. Cognitive Science Society, Austin, TX, 2008. [21] K. J. Holyoak and P. Thagard. Analogical mapping by constraint satisfaction. Cognitive Science, 13(3):295–355, 1989. [22] L. A. A. Doumas, J. E. Hummel, and C. M. Sandhofer. A theory of the discovery and predication of relational concepts. Psychological Review, 115(1):1–43, 2008. [23] M. L. Gick and K. J. Holyoak. Schema induction and analogical transfer. Cognitive Psychology, 15:1–38, 1983. 9
same-paper 3 0.81862056 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine
Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu, Zhirong Yang
Abstract: We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms. 1
4 0.80730331 247 nips-2009-Time-rescaling methods for the estimation and assessment of non-Poisson neural encoding models
Author: Jonathan W. Pillow
Abstract: Recent work on the statistical modeling of neural responses has focused on modulated renewal processes in which the spike rate is a function of the stimulus and recent spiking history. Typically, these models incorporate spike-history dependencies via either: (A) a conditionally-Poisson process with rate dependent on a linear projection of the spike train history (e.g., generalized linear model); or (B) a modulated non-Poisson renewal process (e.g., inhomogeneous gamma process). Here we show that the two approaches can be combined, resulting in a conditional renewal (CR) model for neural spike trains. This model captures both real-time and rescaled-time history effects, and can be fit by maximum likelihood using a simple application of the time-rescaling theorem [1]. We show that for any modulated renewal process model, the log-likelihood is concave in the linear filter parameters only under certain restrictive conditions on the renewal density (ruling out many popular choices, e.g. gamma with shape κ = 1), suggesting that real-time history effects are easier to estimate than non-Poisson renewal properties. Moreover, we show that goodness-of-fit tests based on the time-rescaling theorem [1] quantify relative-time effects, but do not reliably assess accuracy in spike prediction or stimulus-response modeling. We illustrate the CR model with applications to both real and simulated neural data. 1
5 0.65439731 87 nips-2009-Exponential Family Graph Matching and Ranking
Author: James Petterson, Jin Yu, Julian J. Mcauley, Tibério S. Caetano
Abstract: We present a method for learning max-weight matching predictors in bipartite graphs. The method consists of performing maximum a posteriori estimation in exponential families with sufficient statistics that encode permutations and data features. Although inference is in general hard, we show that for one very relevant application–document ranking–exact inference is efficient. For general model instances, an appropriate sampler is readily available. Contrary to existing max-margin matching models, our approach is statistically consistent and, in addition, experiments with increasing sample sizes indicate superior improvement over such models. We apply the method to graph matching in computer vision as well as to a standard benchmark dataset for learning document ranking, in which we obtain state-of-the-art results, in particular improving on max-margin variants. The drawback of this method with respect to max-margin alternatives is its runtime for large graphs, which is comparatively high. 1
6 0.65346849 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
7 0.6527127 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
8 0.65219986 71 nips-2009-Distribution-Calibrated Hierarchical Classification
9 0.65175807 129 nips-2009-Learning a Small Mixture of Trees
10 0.65137351 104 nips-2009-Group Sparse Coding
11 0.6511417 77 nips-2009-Efficient Match Kernel between Sets of Features for Visual Recognition
12 0.65081805 72 nips-2009-Distribution Matching for Transduction
13 0.65049505 122 nips-2009-Label Selection on Graphs
14 0.64923447 139 nips-2009-Linear-time Algorithms for Pairwise Statistical Problems
15 0.64871752 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
16 0.64802802 137 nips-2009-Learning transport operators for image manifolds
17 0.64694285 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds
18 0.64501387 112 nips-2009-Human Rademacher Complexity
19 0.64468974 2 nips-2009-3D Object Recognition with Deep Belief Nets
20 0.64461827 128 nips-2009-Learning Non-Linear Combinations of Kernels