nips nips2009 nips2009-213 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kaushik Sinha, Mikhail Belkin
Abstract: We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices. It turns out that when the data has clustered, that is, when the high density regions are sufficiently separated by low density valleys, each high density area corresponds to a unique representative eigenvector. Linear combination of such eigenvectors (or, more precisely, of their Nystrom extensions) provide good candidates for good classification functions when the cluster assumption holds. By first choosing an appropriate basis of these eigenvectors from unlabeled data and then using labeled data with Lasso to select a classifier in the span of these eigenvectors, we obtain a classifier, which has a very sparse representation in this basis. Importantly, the sparsity corresponds naturally to the cluster assumption. Experimental results on a number of real-world data-sets show that our method is competitive with the state of the art semi-supervised learning algorithms and outperforms the natural base-line algorithm (Lasso in the Kernel PCA basis). 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices. [sent-7, score-0.397]
2 It turns out that when the data has clustered, that is, when the high density regions are sufficiently separated by low density valleys, each high density area corresponds to a unique representative eigenvector. [sent-8, score-0.985]
3 Linear combination of such eigenvectors (or, more precisely, of their Nystrom extensions) provide good candidates for good classification functions when the cluster assumption holds. [sent-9, score-0.53]
4 By first choosing an appropriate basis of these eigenvectors from unlabeled data and then using labeled data with Lasso to select a classifier in the span of these eigenvectors, we obtain a classifier, which has a very sparse representation in this basis. [sent-10, score-0.794]
5 Importantly, the sparsity corresponds naturally to the cluster assumption. [sent-11, score-0.134]
6 , learning from both labeled and unlabeled data has received considerable attention in recent years due to its potential in reducing the need for expensive labeled data. [sent-15, score-0.459]
7 However, to make effective use of unlabeled examples one needs to make some assumptions about the connection between the process generating the data and the process of assigning labels. [sent-16, score-0.242]
8 In particular, the cluster assumption can be interpreted as saying that two points are likely to have the same class labels if they can be connected by a path passing through a high density area. [sent-18, score-0.416]
9 In other words two high density areas with different class labels must be separated by a low density valley. [sent-19, score-0.451]
10 In this paper, we develop a framework for semi-supervised learning when the cluster assumption holds. [sent-20, score-0.144]
11 Specifically, we show that when the high density areas are sufficiently separated, a few appropriately chosen eigenfunctions of a convolution operator (which is the continuous counterpart of the kernel matrix) represents the high density areas reasonably well. [sent-21, score-1.241]
12 Under the ideal conditions each high density area can be represented by a single unique eigenfunction called the “representative” eigenfunction. [sent-22, score-0.544]
13 If the cluster assumption holds, each high density area will correspond to just one class label and thus a sparse linear combination of these representative eigenfunctions would be a good classifier. [sent-23, score-1.174]
14 Moreover, the basis of such eigenfunctions can be learned using only the unlabeled data by constructing the Nystrom extension of the eigenvectors of an appropriate kernel matrix. [sent-24, score-1.103]
15 Thus, given unlabeled data we construct the basis of eigenfunctions and then apply L 1 penalized optimization procedure Lasso [Tib96] to fit a sparse linear combination of the basis elements to 1 the labeled data. [sent-25, score-0.995]
16 Specifically we will assume that (i) data distribution has natural clusters separated by regions of low density and (ii) the label assignment conforms to these clusters. [sent-33, score-0.263]
17 , [RBV08]) that these eigenfunctions can be approximated from the eigenvectors of a kernel matrix obtained from the unlabeled data. [sent-37, score-1.073]
18 Thus, if the cluster assumption holds we expect each cluster to have exactly one label assignment. [sent-38, score-0.277]
19 Therefore eigenfunctions corresponding to these clusters should produce a natural sparse basis for constructing a classification function. [sent-39, score-0.571]
20 From unlabeled and labeled data obtain the eigenvectors of the Gaussian kernel matrix. [sent-41, score-0.756]
21 From these eigenvectors select a subset of candidate eigenvectors without sign change. [sent-43, score-0.796]
22 Using the labeled data, apply Lasso (sparse linear regression) in the constructed basis to obtain a classifier. [sent-45, score-0.196]
23 Using the Nystrom extension (see [BPV03]), extend the eigenvectors to obtain the classification function defined everywhere. [sent-47, score-0.35]
24 We note that our method is related to KPCA, where data is projected onto the space spanned by the top few eigenvectors of the kernel matrix and classification or regression task can be performed in that projected space. [sent-49, score-0.521]
25 The important difference is that we choose a subset of the eigenvectors in accordance to the cluster assumption. [sent-50, score-0.455]
26 We note that the method simply using the KPCA basis does not seem to benefit from unlabeled data and, in fact, cannot outperform the standard fully supervised SVM classifier. [sent-51, score-0.247]
27 We will see that each cluster in the data corresponds to its unique representative eigenvector of the kernel matrix. [sent-54, score-0.554]
28 However, this eigenvector may not be among the top eigenvectors and may thus be omitted when applying KPCA. [sent-55, score-0.487]
29 Alternatively, if the representative eigenvector is included, it will be included with a number of other uninformative eigenvectors resulting in poor performance due to overfitting. [sent-56, score-0.669]
30 (λi , v i )u i=1 2 has clusters, for each high density region there is a unique representative eigenfunction of a convolution operator that takes positive values around the chosen cluster and is close to zero everywhere else. [sent-62, score-1.188]
31 , each high density region corresponds to a portion of a pure class, then the classifier can be naturally expressed as a linear combination of the representative eigenfunctions. [sent-66, score-0.537]
32 representative eigenvector basis and a linear combination of the representative eigenvectors will be a reasonable candidate for a good classification function. [sent-67, score-1.005]
33 However, identifying representative eigenvectors is not very trivial because in real life depending on the separation between high density clusters the representative eigenvectors can have no sign change up to some small precision > 0. [sent-68, score-1.621]
34 , en ) ∈ Rn has no sign change up to precision if either ∀i ei > − or ∀i ei < . [sent-72, score-0.124]
35 Let N be the set of indices of all eigenvectors that have no sign change up to precision . [sent-73, score-0.474]
36 If is chosen properly, N will contain representative eigenvectors (note that the set N and the set {1, 2, . [sent-74, score-0.622]
37 Thus, instead of identifying the representative eigenvectors, we carefully select a small set containing the representative eigenvectors. [sent-78, score-0.48]
38 Our goal is to learn a linear combination of the eigenvectors i∈N βi v i which minimizes classification error on the labeled examples and the coefficients corresponding to non-representative eigenvectors are zeros. [sent-79, score-0.927]
39 Standard approach to get a sparse solution is to minimize a convex loss function V on the labeled examples and apply a L1 penalty (on βi s). [sent-81, score-0.252]
40 If we select V to be square loss function, we end up solving the L1 penalized least square or so called Lasso [Tib96], whose consistency property was studied in [ZY06]. [sent-82, score-0.134]
41 To obtain a classification function which is defined everywhere, we use the Nystrom extension l+u 1 of the ith eigenvector defined as ψi (x) = λ √l+u j=1 v i (xj )k(x, xj ). [sent-88, score-0.11]
42 Construct kernel matrix K from l + u unlabeled examples {xi }l+u . [sent-92, score-0.355]
43 Select set N containing indices of the eigenvectors with no sign change up to precision . [sent-94, score-0.474]
44 Construct design matrix Ψ whose ith column is top l rows of v N (i) . [sent-96, score-0.119]
45 In [ZY06] a concept of sign consistency was introduced which states ˆ that Lasso is sign consistent if, as l tends to infinity, signs of β l (λ) matches with the signs of β ∗ with probability 1, where β ∗ is the coefficients of the correct model. [sent-104, score-0.381]
46 Note that since we are expecting a ˆ sparse model, matching zeros of β l (λ) to the zeros of β ∗ is not enough, but in addition, matching the signs of the non zero coefficients ensures that the true model will be selected. [sent-105, score-0.195]
47 l Note that, for a random design matrix, sign consistency is equivalent to irrepresentable condition (see [ZY06]). [sent-111, score-0.277]
48 The requirement µΨ < 1 is not new and have also appeared in the context of noisy or noiseless sparse recovery of signal [Tro04, Wai06, Zha08]. [sent-113, score-0.141]
49 Note that Lasso is sign consistent if irrepresentable condition holds and the sufficient condition needed for irrepresentable condition to hold is given by the following result, Theorem 3. [sent-114, score-0.435]
50 Let the matrix C be normalized version C c of C such that Cij = Cij and maxi,j,i=j |Cij | ≤ 2q−1 for a constant 0 ≤ c < 1, then strong ii irrepresentable condition holds. [sent-117, score-0.171]
51 Our main result in the following shows that this sufficient condition is satisfied with high probability requiring relatively few labeled examples, as a result the correct model is identified consistently, which in turn describes a good classification function. [sent-118, score-0.239]
52 Let q be the minimum number of columns of the design matrix Ψ ∈ R l×|N | , constructed from l labeled examples, that describes the sparse model. [sent-121, score-0.227]
53 Then for any 0 < δ < 1, if 2 2048q 2 log( δ ) , then with probability greater than the number of unlabeled examples u satisfies u > g2 λ2 Nmax 1− δ 2 lλ2 max N 1 , maxi=j |Cij | < 2q−1 . [sent-122, score-0.242]
54 Note that in our framework, unlabeled examples help polynomially fast in estimating the eigenfunctions while labeled examples help exponentially fast in identifying the sparse model consisting of representative eigenfunctions. [sent-124, score-1.183]
55 Interestingly, in semi-supervised learning setting, similar role of labeled and unlabeled examples (in reducing classification error) has been reported in literature [CC96, RV95, SB07, SNZ08]. [sent-125, score-0.378]
56 2, we estimate the separation requirement among the high density regions which ensures that each high density region (class) can be well represented by a unique eigenfunction. [sent-128, score-0.747]
57 3, using perturbation results from [RBV08], we estimate the number of unlabeled examples required to ensure that Nystrom extensions of eigenvectors of K approximate the eigenfunctions of the convolution operator LK reasonably well with high probability. [sent-131, score-1.329]
58 2 Separation Requirement To motivate our discussion we consider binary classification problem where the marginal density can be considered as a mixture model where each class has its own probability density function, p1 (x), p2 (x) and corresponding mixing weights π1 , π2 respectively. [sent-135, score-0.445]
59 Thus, the density of the mixture is p(x) = π1 p1 (x) + π2 p2 (x). [sent-136, score-0.214]
60 We will use the following results from [SBY08a] specifying the behavior of the eigenfunction of LK corresponding to the largest eigenvalue. [sent-137, score-0.321]
61 k 2 (x, z)p(z)dz (Tail decay Note that the last (tail decay) property above is not restricted to the top eigenfunction alone but is satisfied by all eigenfunctions of LK . [sent-141, score-0.734]
62 The largest eigenvalues and corresponding eigenfunctions in the above three cases are λ1 , λ2 , λ0 and φL,1 , φL,2 , φL respec0 0 0 0 0 tively. [sent-143, score-0.573]
63 Thus, when T1 (x) and 0 π2 λ2 0 φL,2 are eigenfunctions of Lp with corresponding eigen0 K T2 (x) are small enough then φL,1 and 0 values π1 λ1 and π2 λ2 respectively. [sent-148, score-0.423]
64 Thus, we will restrict ourselves to the following class of probability distributions for each individual class which has reasonably fast tail decay. [sent-151, score-0.152]
65 For any 1/2 < η < 1, let M(η, R) be the class of probability distributions such that its density function p satisfies 1) R p(x)d(x) = η where R is the minimum volume ball around the mean of the distribution. [sent-153, score-0.197]
66 With a little abuse of notation we will use p ∈ M(η, R) to mean that p is the probability density function of a member of M(η, R). [sent-156, score-0.171]
67 Now a rough estimate of separation requirement can be given by the following lemma. [sent-157, score-0.152]
68 The estimate of ∆ in the above lemma, where we hide the log factor by Ω∗ , is by no means tight, nevertheless, it shows that separation requirement refers to existence of a low density valley between 5 two high density regions each corresponding to one of the classes. [sent-162, score-0.57]
69 This separation requirement is roughly of the same order required to learn mixture of Gaussians [Das99]. [sent-163, score-0.195]
70 Note that, provided separation requirement is satisfied, φL,1 and φL,2 are not necessarily the top two eigenfunctions of 0 0 LK corresponding to the two largest eigenvalues but can be quite far down the spectrum of L p K depending on the mixing weights π1 , π2 . [sent-164, score-0.817]
71 Next, the following lemma suggests that we can say more about the eigenfunction corresponding to the largest eigenvalue. [sent-165, score-0.352]
72 The Nmax largest eigenvalues of LK and K, where Nmax = maxi {i : i ∈ N }, are simple and bounded away from zero. [sent-173, score-0.192]
73 Note that Nystrom extension ψi s are eigenfunctions of an operator LK,H : H → H , where H is the unique RKHS defined by the chosen Gaussian kernel and all the eigenvalues of K are also eigenvalues of LK,H ([RBV08]). [sent-174, score-0.819]
74 The first one is due to the bounded away from zero part, which ensures that if we restrict to ψi ∈ H corresponding to the largest Nmax eigenvalues, then each of them is square integrable hence belongs to L2 (X , PX ). [sent-176, score-0.139]
75 The second implication due to the simple part, ensures that eigenfunctions corresponding to the N max largest eigenvalues are uniquely defined and so are the orthogonal projections on to them. [sent-177, score-0.613]
76 Note that if any eigenvalue has multiplicity greater than one then the corresponding eigenspace is well defined but not the individual eigenfunctions. [sent-178, score-0.119]
77 Let gNmax be the Nmax eigengap when eigenvalues of LK are sorted in non increasing order. [sent-180, score-0.142]
78 Suppose Assumption 2 holds and the top Nmax eigenvalues of LK and K are sorted in the decreasing order. [sent-184, score-0.195]
79 4 Concentration Results Having established that {ψi }i∈N approximate the top N eigenfunctions of LK reasonably well, next, we need to consider what happens when we restrict each of the ψi s to finite labeled examples. [sent-188, score-0.682]
80 Note that the design matrix Ψ ∈ Rl×|N | is constructed by restricting the {ψj }j∈N to l labeled data T points {xi }l such that the ith column of Ψ is ψN (i) (x1 ), ψN (i) (x2 ), · · · , ψN (i) (xl ) ∈ Rl . [sent-189, score-0.197]
81 Left panel in Figure 1 shows the probability density of the mixture in blue and representative eigenfunctions of each class in green and magenta respectively using 1000 examples (positive and negative) drawn from this mixture. [sent-208, score-1.026]
82 It is clear that each representative eigenfunction represents high density area of a particular class reasonably well. [sent-209, score-0.828]
83 In fact, the right panel of Fig 1 shows the regularization path for L1 penalized least square regression with 20 labeled examples. [sent-211, score-0.328]
84 The bold green and magenta lines shows the coefficient values for the representative eigenfunctions for different values of regularization parameter t. [sent-212, score-0.76]
85 As can be seen, regularization parameter t can be so chosen that the decision function will consist of a linear combination of representative eigenfunctions only. [sent-213, score-0.796]
86 Note that these representative eigenfunctions need not be the top two eigenfunctions corresponding to the largest eigenvalues. [sent-214, score-1.212]
87 05 Coefficients Density / Eigenfunctions Probability density for the mixture and representative eigenfunctions 0 −0. [sent-216, score-0.877]
88 15 −10 −5 0 5 10 x 15 −20 0 20 10 0 −10 −20 y 0 10 20 30 t 40 50 60 Figure 1: Left panel: Probability density of the mixture in blue and representative eigenfunctions in green and magenta. [sent-219, score-0.877]
89 Bold lines correspond to regularization path associated with representative eigenfunctions. [sent-221, score-0.333]
90 We compared our algorithm with state of the art semi-supervised learning (manifold regularization) method Laplacian SVM (LapSVM) [BNS06], fully supervised SVM and also two other kernel sparse regression methods. [sent-224, score-0.144]
91 In each experiment a specified number of examples (l) were randomly chosen and labeled and the rest (u) were treated as unlabeled test set. [sent-227, score-0.41]
92 As can be seen, for small number of labeled examples our method convincingly outperform SVM and is comparable to LapSVM. [sent-230, score-0.22]
93 The result also suggests that instead of selecting top few eigenvectors, as is normally done in KPCA, selecting them by our method and then applying L1 regularization yields better result. [sent-231, score-0.123]
94 In particular, in case of IONOSPHERE and BREAST-CANCER data sets top |N | (5 and 3 respectively) eigenvectors do not contain the representative ones. [sent-232, score-0.648]
95 1 We also selected 100 top eigenvectors and applied L1 penalty but it gave worse result. [sent-236, score-0.408]
96 The notation A/B represents average sparsity A and number of eigenvectors (|N | or 20). [sent-402, score-0.379]
97 For each pairwise classification problem, in each trial, 500 images of each digit in the USPS training set were chosen uniformly at random out of which 20 images were labeled and the rest were set aside for testing. [sent-405, score-0.199]
98 For the LapSVM we set the regularization terms and the kernel as reported by [BNS06] for a similar set of experiments, namely we set γ A l = γI l 0. [sent-407, score-0.148]
99 In particular, we chose all the images of digits 3, 4 and 5 from USPS training data set (there were 1866 in total) and randomly labeled 10 images from each class. [sent-413, score-0.136]
100 We showed that the cluster assumption is equivalent to the classifier being sparse in a certain appropriately chosen basis and demonstrated how such basis can be computed using only unlabeled data. [sent-419, score-0.544]
wordName wordTfidf (topN-words)
[('eigenfunctions', 0.423), ('eigenvectors', 0.35), ('eigenfunction', 0.253), ('representative', 0.24), ('nmax', 0.225), ('kpca', 0.192), ('lk', 0.191), ('unlabeled', 0.187), ('density', 0.171), ('labeled', 0.136), ('gnmax', 0.135), ('lapsvm', 0.135), ('convolution', 0.132), ('nystrom', 0.118), ('irrepresentable', 0.112), ('cij', 0.109), ('lasso', 0.107), ('cluster', 0.105), ('sign', 0.096), ('kernel', 0.083), ('px', 0.083), ('eigenvalues', 0.082), ('requirement', 0.08), ('eigenvector', 0.079), ('separation', 0.072), ('operator', 0.07), ('largest', 0.068), ('seb', 0.067), ('usps', 0.066), ('classi', 0.065), ('reasonably', 0.065), ('regularization', 0.065), ('signs', 0.061), ('sparse', 0.061), ('belkin', 0.061), ('basis', 0.06), ('top', 0.058), ('lp', 0.056), ('examples', 0.055), ('ohio', 0.054), ('dist', 0.051), ('everywhere', 0.048), ('ssl', 0.048), ('unique', 0.047), ('high', 0.047), ('xk', 0.047), ('uci', 0.047), ('columbus', 0.045), ('eigenspace', 0.045), ('spectroscopy', 0.045), ('region', 0.043), ('mixture', 0.043), ('maxi', 0.042), ('ionosphere', 0.041), ('consistency', 0.04), ('ensures', 0.04), ('sinha', 0.039), ('kpc', 0.039), ('coef', 0.039), ('assumption', 0.039), ('eigenvalue', 0.038), ('multiplicity', 0.036), ('separated', 0.036), ('svm', 0.036), ('combination', 0.036), ('panel', 0.036), ('tail', 0.035), ('cients', 0.034), ('mixing', 0.034), ('rl', 0.034), ('predictors', 0.034), ('oh', 0.034), ('non', 0.033), ('magenta', 0.032), ('penalized', 0.032), ('chosen', 0.032), ('selection', 0.031), ('lemma', 0.031), ('square', 0.031), ('ith', 0.031), ('aside', 0.031), ('cation', 0.03), ('xi', 0.03), ('consistently', 0.03), ('matrix', 0.03), ('dz', 0.029), ('comparable', 0.029), ('regions', 0.029), ('sparsity', 0.029), ('condition', 0.029), ('change', 0.028), ('path', 0.028), ('holds', 0.028), ('concentration', 0.028), ('correct', 0.027), ('sorted', 0.027), ('clusters', 0.027), ('area', 0.026), ('consisting', 0.026), ('class', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases
Author: Kaushik Sinha, Mikhail Belkin
Abstract: We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices. It turns out that when the data has clustered, that is, when the high density regions are sufficiently separated by low density valleys, each high density area corresponds to a unique representative eigenvector. Linear combination of such eigenvectors (or, more precisely, of their Nystrom extensions) provide good candidates for good classification functions when the cluster assumption holds. By first choosing an appropriate basis of these eigenvectors from unlabeled data and then using labeled data with Lasso to select a classifier in the span of these eigenvectors, we obtain a classifier, which has a very sparse representation in this basis. Importantly, the sparsity corresponds naturally to the cluster assumption. Experimental results on a number of real-world data-sets show that our method is competitive with the state of the art semi-supervised learning algorithms and outperforms the natural base-line algorithm (Lasso in the Kernel PCA basis). 1
2 0.5259065 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections
Author: Rob Fergus, Yair Weiss, Antonio Torralba
Abstract: With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. “Clean labels” can be manually obtained on a small fraction, “noisy labels” may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images. Specifically, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images gathered from the Internet. 1
Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou
Abstract: We study the behavior of the popular Laplacian Regularization method for SemiSupervised Learning at the regime of a fixed number of labeled points but a large number of unlabeled points. We show that in Rd , d 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the “smoothness” assumptions associated with this alternate method. 1 Introduction and Setup In this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods based on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We consider the limit when the number of labeled points is fixed and the number of unlabeled points goes to infinity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled data is virtually infinite. We can also think of this limit as “perfect” SSL, having full knowledge of the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative about the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be “smooth” in some sense relative to p(x). Studying the infinite-unlabeled-data limit, where p(x) is fully known, allows us to formulate and understand the underlying smoothness assumptions of a particular SSL method, and judge whether it is well-posed and sensible. Understanding the infinite-unlabeled-data limit is also a necessary first step to studying the convergence of the finite-labeled-data estimator. We consider the following setup: Let p(x) be an unknown smooth density on a compact domain Ω ⊂ Rd with a smooth boundary. Let y : Ω → Y be the unknown function we wish to estimate. In case of regression Y = R whereas in binary classification Y = {−1, 1}. The standard (transductive) semisupervised learning problem is formulated as follows: Given l labeled points, (x1 , y1 ), . . . , (xl , yl ), with yi = y(xi ), and u unlabeled points xl+1 , . . . , xl+u , with all points xi sampled i.i.d. from p(x), the goal is to construct an estimate of y(xl+i ) for any unlabeled point xl+i , utilizing both the labeled and the unlabeled points. We denote the total number of points by n = l + u. We are interested in the regime where l is fixed and u → ∞. 1 2 SSL with Graph Laplacian Regularization We first consider the following graph-based approach formulated by Zhu et. al. [15]: y (x) = arg min In (y) ˆ subject to y(xi ) = yi , i = 1, . . . , l y where 1 n2 In (y) = Wi,j (y(xi ) − y(xj ))2 (1) (2) i,j is a Laplacian regularization term enforcing “smoothness” with respect to the n×n similarity matrix W . This formulation has several natural interpretations in terms of, e.g. random walks and electrical circuits [15]. These interpretations, however, refer to a fixed graph, over a finite set of points with given similarities. In contrast, our focus here is on the more typical scenario where the points xi ∈ Rd are a random sample from a density p(x), and W is constructed based on this sample. We would like to understand the behavior of the method in terms of the density p(x), particularly in the limit where the number of unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density p(x) is the method (1) sensible? The answer, of course, depends on how the matrix W is constructed. We consider the common situation where the similarities are obtained by applying some decay filter to the distances: xi −xj σ Wi,j = G (3) where G : R+ → R+ is some function with an adequately fast decay. Popular choices are the 2 Gaussian filter G(z) = e−z /2 or the ǫ-neighborhood graph obtained by the step filter G(z) = 1z<1 . For simplicity, we focus here on the formulation (1) where the solution is required to satisfy the constraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced with a softer loss-based data term, which is balanced against the smoothness term In (y), e.g. [14, 6]. Our analysis and conclusions apply to such variants as well. Limit of the Laplacian Regularization Term As the number of unlabeled examples grows the regularization term (2) converges to its expectation, where the summation is replaced by integration w.r.t. the density p(x): lim In (y) = I (σ) (y) = n→∞ G Ω Ω x−x′ σ (y(x) − y(x′ ))2 p(x)p(x′ )dxdx′ . (4) In the above limit, the bandwidth σ is held fixed. Typically, one would also drive the bandwidth σ to zero as n → ∞. There are two reasons for this choice. First, from a practical perspective, this makes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical perspective, this leads to a clear and well defined limit of the smoothness regularization term In (y), at least when σ → 0 slowly enough1 , namely when σ = ω( d log n/n). If σ → 0 as n → ∞, and as long as nσ d / log n → ∞, then after appropriate normalization, the regularizer converges to a density weighted gradient penalty term [7, 8]: d lim d+2 In (y) n→∞ Cσ (σ) d (y) d+2 I σ→0 Cσ = lim ∇y(x) 2 p(x)2 dx = J(y) = (5) Ω where C = Rd z 2 G( z )dz, and assuming 0 < C < ∞ (which is the case for both the Gaussian and the step filters). This energy functional J(f ) therefore encodes the notion of “smoothness” with respect to p(x) that is the basis of the SSL formulation (1) with the graph constructions specified by (3). To understand the behavior and appropriateness of (1) we must understand this functional and the associated limit problem: y (x) = arg min J(y) ˆ subject to y(xi ) = yi , i = 1, . . . , l (6) y p When σ = o( d 1/n) then all non-diagonal weights Wi,j vanish (points no longer have any “close by” p neighbors). We are not aware of an analysis covering the regime where σ decays roughly as d 1/n, but would be surprised if a qualitatively different meaningful limit is reached. 1 2 3 Graph Laplacian Regularization in R1 We begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x ∈ R. We first consider the situation where the support of p(x) is a continuous interval Ω = [a, b] ⊂ R (a and/or b may be infinite). Without loss of generality, we assume the labeled data is sorted in increasing order a x1 < x2 < · · · < xl b. Applying the theory of variational calculus, the solution y (x) ˆ satisfies inside each interval (xi , xi+1 ) the Euler-Lagrange equation d dy p2 (x) = 0. dx dx Performing two integrations and enforcing the constraints at the labeled points yields y(x) = yi + x 1/p2 (t)dt xi (yi+1 xi+1 1/p2 (t)dt xi − yi ) for xi x xi+1 (7) with y(x) = x1 for a x x1 and y(x) = xl for xl x b. If the support of p(x) is a union of disjoint intervals, the above analysis and the form of the solution applies in each interval separately. The solution (7) seems reasonable and desirable from the point of view of the “smoothness” assumptions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas across low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore, the regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared semi-norm, giving us additional insight into this choice of regularizer: b 1 Theorem 1. Let p(x) be a smooth density on Ω = [a, b] ⊂ R such that Ap = 4 a 1/p2 (t)dt < ∞. 2 Then, J(f ) can be written as a squared semi-norm J(f ) = f Kp induced by the kernel x′ ′ Kp (x, x ) = Ap − 1 2 x with a null-space of all constant functions. That is, f the RKHS induced by Kp . 1 p2 (t) dt Kp . (8) is the norm of the projection of f onto If p(x) is supported on several disjoint intervals, Ω = ∪i [ai , bi ], then J(f ) can be written as a squared semi-norm induced by the kernel 1 bi dt 4 ai p2 (t) ′ Kp (x, x ) = − 1 2 x′ dt x p2 (t) if x, x′ ∈ [ai , bi ] (9) if x ∈ [ai , bi ], x′ ∈ [aj , bj ], i = j 0 with a null-space spanned by indicator functions 1[ai ,bi ] (x) on the connected components of Ω. Proof. For any f (x) = i αi Kp (x, xi ) in the RKHS induced by Kp : df dx J(f ) = 2 p2 (x)dx = αi αj Jij (10) i,j where Jij = d d Kp (x, xi ) Kp (x, xj )p2 (x)dx dx dx When xi and xj are in different connected components of Ω, the gradients of Kp (·, xi ) and Kp (·, xj ) are never non-zero together and Jij = 0 = Kp (xi , xj ). When they are in the same connected component [a, b], and assuming w.l.o.g. a xi xj b: Jij = = xi 1 4 1 4 a b a 1 dt + p2 (t) 1 1 dt − p2 (t) 2 xj xi xj xi −1 dt + p2 (t) xj 1 dt p2 (t) 1 dt = Kp (xi , xj ). p2 (t) Substituting Jij = Kp (xi , xj ) into (10) yields J(f ) = 3 b αi αj Kp (xi , xj ) = f (11) Kp . Combining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of any variant where the hard constraints are replaced by a data term) is of the form: l y(x) = αj Kp (x, xj ) + βi 1[ai ,bi ] (x), j=1 i where i ranges over the connected components [ai , bi ] of Ω, and we have: l J(y) = αi αj Kp (xi , xj ). (12) i,j=1 Viewing the regularizer as y 2 p suggests understanding (6), and so also its empirical approximaK tion (1), by interpreting Kp (x, x′ ) as a density-based “similarity measure” between x and x′ . This similarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a function of the distance. When the density is non-uniform, two points are relatively similar only if they are connected by a region in which 1/p2 (x) is low, i.e. the density is high, but are much less “similar”, i.e. related to each other, when connected by a low-density region. Furthermore, there is no dependence between points in disjoint components separated by zero density regions. 4 Graph Laplacian Regularization in Higher Dimensions The analysis of the previous section seems promising, at it shows that in one dimension, the SSL method (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is not the case anymore. In the following theorem we show that the infimum of the limit problem (6) is zero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation of the labeled points. Theorem 2. Let p(x) be a smooth density over Rd , d 2, bounded from above by some constant pmax , and let (x1 , y1 ), . . . , (xl , yl ) be any (non-repeating) set of labeled examples. There exist continuous functions yǫ (x), for any ǫ > 0, all satisfying the constraints yǫ (xj ) = yj , j = 1, . . . , l, such ǫ→0 ǫ→0 that J(yǫ ) −→ 0 but yǫ (x) −→ 0 for all x = xj , j = 1, . . . , l. Proof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the proof to more labeled points is straightforward. Furthermore, without loss of generality, we assume the first labeled point is at x0 = 0 with y(x0 ) = 0 and the second labeled point is at x1 with x1 = 1 and y(x1 ) = 1. In addition, we assume that the ball B1 (0) of radius one centered around the origin is contained in Ω = {x ∈ Rd | p(x) > 0}. We first consider the case d > 2. Here, for any ǫ > 0, consider the function x ǫ yǫ (x) = min ,1 which indeed satisfies the two constraints yǫ (xi ) = yi , i = 0, 1. Then, J(yǫ ) = Bǫ (0) p2 (x) dx ǫ2 pmax ǫ2 dx = p2 Vd ǫd−2 max (13) Bǫ (0) where Vd is the volume of a unit ball in Rd . Hence, the sequence of functions yǫ (x) satisfy the constraints, but for d > 2, inf ǫ J(yǫ ) = 0. For d = 2, a more extreme example is necessary: consider the functions 2 x yǫ (x) = log +ǫ ǫ log 1+ǫ ǫ for x 1 and yǫ (x) = 1 for x > 1. These functions satisfy the two constraints yǫ (xi ) = yi , i = 0, 1 and: J(yǫ ) = 4 h “ ”i 1+ǫ 2 log ǫ 4πp2 max h “ ”i 1+ǫ 2 log ǫ x B1 (0) log ( x 1+ǫ ǫ 2 2 +ǫ)2 p2 (x)dx 4p2 h “ max ”i2 1+ǫ log ǫ 4πp2 max ǫ→0 = −→ 0. log 1+ǫ ǫ 4 1 0 r2 (r 2 +ǫ)2 2πrdr The implication of Theorem 2 is that regardless of the values at the labeled points, as u → ∞, the solution of (1) is not well posed. Asymptotically, the solution has the form of an almost everywhere constant function, with highly localized spikes near the labeled points, and so no learning is performed. In particular, an interpretation in terms of a density-based kernel Kp , as in the onedimensional case, is not possible. Our analysis also carries over to a formulation where a loss-based data term replaces the hard label constraints, as in l 1 y = arg min ˆ (y(xj ) − yj )2 + γIn (y) y(x) l j=1 In the limit of infinite unlabeled data, functions of the form yǫ (x) above have a zero data penalty term (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence, it is possible to drive the entire objective functional (the data term plus the regularization term) to zero with functions that do not generalize at all to unlabeled points. 4.1 Numerical Example We illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density p(x) in R2 , which is a mixture of two unit variance spherical Gaussians, one per class, centered at the origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of the two components (four total). We then construct a similarity matrix using a Gaussian filter with σ = 0.4. Figure 1 depicts the predictor y (x) obtained from (1). In fact, two different predictors are shown, ˆ obtained by different numerical methods for solving (1). Both methods are based on the observation that the solution y (x) of (1) satisfies: ˆ n y (xi ) = ˆ n Wij y (xj ) / ˆ j=1 Wij on all unlabeled points i = l + 1, . . . , l + u. (14) j=1 Combined with the constraints of (1), we obtain a system of linear equations that can be solved by Gaussian elimination (here invoked through MATLAB’s backslash operator). This is the method used in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for y (xi ), ˆ which can be solved via the power method, or label propagation [2, 6]: start with zero labels on the unlabeled points and iterate (14), while keeping the known labels on x1 , . . . , xl . This is the method used in the bottom panels of Figure 1. As predicted, y (x) is almost constant for almost all unlabeled points. Although all values are very ˆ close to zero, thresholding at the “right” threshold does actually produce sensible results in terms of the true -1/+1 labels. However, beyond being inappropriate for regression, a very flat predictor is still problematic even from a classification perspective. First, it is not possible to obtain a meaningful confidence measure for particular labels. Second, especially if the size of each class is not known apriori, setting the threshold between the positive and negative classes is problematic. In our example, setting the threshold to zero yields a generalization error of 45%. The differences between the two numerical methods for solving (1) also point out to another problem with the ill-posedness of the limit problem: the solution is numerically very un-stable. A more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choosing a “wrong” bandwidth σ, is given in Figure 2. We again simulated data from a mixture of two Gaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and an increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classification error of the resulting predictor y (x). We plot the classification error both when a threshold ˆ of zero is used (i.e. the class is determined by sign(ˆ(x))) and with the ideal threshold minimizing y the test error. For each unlabeled sample size, we choose the bandwidth σ yielding the best test performance (this is a “cheating” approach which provides a lower bound on the error of the best method for selecting the bandwidth). As the number of unlabeled examples increases the squared error approaches 1, indicating a flat predictor. Using a threshold of zero leads to an increase in the classification error, possibly due to numerical instability. Interestingly, although the predictors become very flat, the classification error using the ideal threshold actually improves slightly. Note that 5 DIRECT INVERSION SQUARED ERROR SIGN ERROR: 45% OPTIMAL BANDWIDTH 1 0.9 1 5 0 4 2 0.85 y(x) > 0 y(x) < 0 6 0.95 10 0 0 −1 10 0 200 400 600 800 0−1 ERROR (THRESHOLD=0) 0.32 −5 10 0 5 −10 0 −10 −5 −5 0 5 10 10 1 0 0 200 400 600 800 OPTIMAL BANDWIDTH 0.5 0 0 200 400 600 800 0−1 ERROR (IDEAL THRESHOLD) 0.19 5 200 400 600 800 OPTIMAL BANDWIDTH 1 0.28 SIGN ERR: 17.1 0.3 0.26 POWER METHOD 0 1.5 8 0 0.18 −1 10 6 0.17 4 −5 10 0 5 −10 0 −5 −10 −5 0 5 10 Figure 1: Left plots: Minimizer of Eq. (1). Right plots: the resulting classification according to sign(y). The four labeled points are shown by green squares. Top: minimization via Gaussian elimination (MATLAB backslash). Bottom: minimization via label propagation with 1000 iterations - the solution has not yet converged, despite small residuals of the order of 2 · 10−4 . 0.16 0 200 400 600 800 2 0 200 400 600 800 Figure 2: Squared error (top), classification error with a threshold of zero (center) and minimal classification error using ideal threhold (bottom), of the minimizer of (1) as a function of number of unlabeled points. For each error measure and sample size, the bandwidth minimizing the test error was used, and is plotted. ideal classification performance is achieved with a significantly larger bandwidth than the bandwidth minimizing the squared loss, i.e. when the predictor is even flatter. 4.2 Probabilistic Interpretation, Exit and Hitting Times As mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in terms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with transition matrix M = D−1 W where D is a diagonal matrix with Dii = j Wij . Then, for the binary classification case with yi = ±1 we have [15]: y (xi ) = 2 Pr x(t) hits a point labeled +1 before hitting a point labeled -1 x(0) = xi − 1 ˆ We present an interpretation of our analysis in terms of the limiting properties of this random walk. Consider, for simplicity, the case where the two classes are separated by a low density region. Then, the random walk has two intrinsic quantities of interest. The first is the mean exit time from one cluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the number of unlabeled points increases and σ → 0, the random walk converges to a diffusion process [12]. While the mean exit time then converges to a finite value corresponding to its diffusion analogue, the hitting time to a labeled point increases to infinity (as these become absorbing boundaries of measure zero). With more and more unlabeled data the random walk will fully mix, forgetting where it started, before it hits any label. Thus, the probability of hitting +1 before −1 will become uniform across the entire graph, independent of the starting location xi , yielding a flat predictor. 5 Keeping σ Finite At this point, a reader may ask whether the problems found in higher dimensions are due to taking the limit σ → 0. One possible objection is that there is an intrinsic characteristic scale for the data σ0 where (with high probability) all points at a distance xi − xj < σ0 have the same label. If this is the case, then it may not necessarily make sense to take values of σ < σ0 in constructing W . However, keeping σ finite while taking the number of unlabeled points to infinity does not resolve the problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To see this, consider a function y(x) which is zero everywhere except at the labeled points, where y(xj ) = yj . With a finite number of labeled points of measure zero, I (σ) (y) = 0 in any dimension 6 50 points 500 points 3500 points 1 1 0.5 0.5 0.5 0 0 0 −0.5 y 1 −0.5 −0.5 −1 −2 0 2 4 6 −1 −2 0 2 4 6 −1 −2 0 2 4 6 x Figure 3: Minimizer of (1) for a 1-d problem with a fixed σ = 0.4, two labeled points and an increasing number of unlabeled points. and for any fixed σ > 0. While this limiting function is discontinuous, it is also possible to construct ǫ→0 a sequence of continuous functions yǫ that all satisfy the constraints and for which I (σ) (yǫ ) −→ 0. This behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians centered at the origin and at x = 4, with one Gaussian labeled −1 and the other +1. We used two labeled points at the centers of the Gaussians and an increasing number of randomly drawn unlabeled points. As predicted, with a fixed σ, although the solution is reasonable when the number of unlabeled points is small, it becomes flatter, with sharp spikes on the labeled points, as u → ∞. 6 Fourier-Eigenvector Based Methods Before we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian, suggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining candidate predictors y(x) non-parametrically to those with small In (y) values, here the predictors are constrained to the low-dimensional space spanned by the first few eigenvectors of the Laplacian: The similarity matrix W is computed as before, and the Graph Laplacian matrix L = D − W is considered (recall D is a diagonal matrix with Dii = j Wij ). Only predictors p j=1 aj ej y (x) = ˆ (15) spanned by the first p eigenvectors e1 , . . . , ep of L (with smallest eigenvalues) are considered. The coefficients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss: (ˆ1 , . . . , ap ) = arg min a ˆ l j=1 (yj − y (xj ))2 . ˆ (16) Unlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)–(16) is well posed in the limit u → ∞. This follows directly from the convergence of the eigenvectors of the graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4]. Eigenvector based methods were shown empirically to provide competitive generalization performance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the approach by arguing that ‘the eigenfunctions of the Laplace-Beltrami operator provide a natural basis for functions on the manifold and the desired classification function can be expressed in such a basis’. In our view, the success of the method is actually not due to data lying on a low-dimensional manifold, but rather due to the low density separation assumption, which states that different class labels form high-density clusters separated by low density regions. Indeed, under this assumption and with sufficient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami operator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11], providing a basis for a labeling that is constant within clusters but variable across clusters. In other settings, such as data uniformly distributed on a manifold but without any significant cluster structure, the success of eigenvector based methods critically depends on how well can the unknown classification function be approximated by a truncated expansion with relatively few eigenvectors. We illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform density in the box [0, 1] × [0, 0.8] × [0, 0.6], where the box lengths are different to prevent eigenvalue multiplicity. Consider learning three different functions, y1 (x) = 1x1 >0.5 , y2 (x) = 1x1 >x2 /0.8 and y3 (x) = 1x2 /0.8>x3 /0.6 . Even though all three functions are relatively simple, all having a linear separating boundary between the classes on the manifold, as shown in the experiment described in Figure 4, the Eigenvector based method (15)–(16) gives markedly different generalization performances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5 as suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test set (i.e. a “cheating” choice representing an upper bound on the generalization error of this method). 7 Prediction Error (%) p = #labeled points/5 40 optimal p 20 labeled points 40 Approx. Error 50 20 20 0 20 20 40 60 # labeled points 0 10 20 40 60 # labeled points 0 0 5 10 15 # eigenvectors 0 0 5 10 15 # eigenvectors Figure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)–(16) for the three different functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign agreements with the true labels. Rightmost panel: best fit when many (all 3000) points are used, representing the best we can hope for with a few leading eigenvectors. The reason for this behavior is that y2 (x) and even more so y3 (x) cannot be as easily approximated by the very few leading eigenfunctions—even though they seem “simple” and “smooth”, they are significantly more complicated than y1 (x) in terms of measure of simplicity implied by the Eigenvector Method. Since the density is uniform, the graph Laplacian converges to the standard Laplacian and its eigenfunctions have the form ψi,j,k (x) = cos(iπx1 ) cos(jπx2 /0.8) cos(kπx3 /0.6), making it hard to represent simple decision boundaries which are not axis-aligned. 7 Discussion Our results show that a popular SSL method, the Laplacian Regularization method (1), is not wellbehaved in the limit of infinite unlabeled data, despite its empirical success in various SSL tasks. The empirical success might be due to two reasons. First, it is possible that with a large enough number of labeled points relative to the number of unlabeled points, the method is well behaved. This regime, where the number of both labeled and unlabeled points grow while l/u is fixed, has recently been analyzed by Wasserman and Lafferty [9]. However, we do not find this regime particularly satisfying as we would expect that having more unlabeled data available should improve performance, rather than require more labeled points or make the problem ill-posed. It also places the user in a delicate situation of choosing the “just right” number of unlabeled points without any theoretical guidance. Second, in our experiments we noticed that although the predictor y (x) becomes extremely flat, in ˆ binary tasks, it is still typically possible to find a threshold leading to a good classification performance. We do not know of any theoretical explanation for such behavior, nor how to characterize it. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical foundation of the Laplacian Regularization method. On a very practical level, such a theoretical understanding might allow us to correct the method so as to avoid the numerical instability associated with flat predictors, and perhaps also make it appropriate for regression. The reason that the Laplacian regularizer (1) is ill-posed in the limit is that the first order gradient is not a sufficient penalty in high dimensions. This fact is well known in spline theory, where the Sobolev Embedding Theorem [1] indicates one must control at least d+1 derivatives in Rd . In the 2 context of Laplacian regularization, this can be done using the iterated Laplacian: replacing the d+1 graph Laplacian matrix L = D − W , where D is the diagonal degree matrix, with L 2 (matrix to d+1 the 2 power). In the infinite unlabeled data limit, this corresponds to regularizing all order- d+1 2 (mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional ambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimensionality, which poses a practical problem of estimating this usually unknown dimensionality. We are not aware of much practical work using the iterated Laplacian, nor a good understanding of its appropriateness for SSL. A different approach leading to a well-posed solution is to include also an ambient regularization term [5]. However, the properties of the solution and in particular its relation to various assumptions about the “smoothness” of y(x) relative to p(x) remain unclear. Acknowledgments The authors would like to thank the anonymous referees for valuable suggestions. The research of BN was supported by the Israel Science Foundation (grant 432/06). 8 References [1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975. [2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks, ICML, 2007. [3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classification, NIPS, vol. 15, 2003. [4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007. [5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006. [6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised Learning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006. [7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004. [8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006. [9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008. [10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol. 36(2), 2008. [11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001. [12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, NIPS, vol. 18, 2006. [13] B. Sch¨ lkopf, A. Smola, Learning with Kernels, MIT Press, 2002. o [14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch¨ lkopf, Learning with local and global consistency, o NIPS, vol. 16, 2004. [15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian fields and harmonic functions, ICML, 2003. 9
4 0.10923322 2 nips-2009-3D Object Recognition with Deep Belief Nets
Author: Vinod Nair, Geoffrey E. Hinton
Abstract: We introduce a new type of top-level model for Deep Belief Nets and evaluate it on a 3D object recognition task. The top-level model is a third-order Boltzmann machine, trained using a hybrid algorithm that combines both generative and discriminative gradients. Performance is evaluated on the NORB database (normalized-uniform version), which contains stereo-pair images of objects under different lighting conditions and viewpoints. Our model achieves 6.5% error on the test set, which is close to the best published result for NORB (5.9%) using a convolutional neural net that has built-in knowledge of translation invariance. It substantially outperforms shallow models such as SVMs (11.6%). DBNs are especially suited for semi-supervised learning, and to demonstrate this we consider a modified version of the NORB recognition task in which additional unlabeled images are created by applying small translations to the images in the database. With the extra unlabeled data (and the same amount of labeled data as before), our model achieves 5.2% error. 1
5 0.10553997 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine
Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu, Zhirong Yang
Abstract: We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms. 1
7 0.094393827 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems
8 0.094316013 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
9 0.090491213 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers
10 0.086337328 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields
11 0.085723668 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification
12 0.079870187 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation
13 0.079630338 137 nips-2009-Learning transport operators for image manifolds
14 0.079334207 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes
15 0.074351802 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity
16 0.073079243 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels
17 0.073046789 128 nips-2009-Learning Non-Linear Combinations of Kernels
18 0.07139343 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
19 0.071363889 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors
20 0.070804521 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
topicId topicWeight
[(0, -0.24), (1, 0.099), (2, -0.117), (3, 0.185), (4, -0.156), (5, -0.011), (6, 0.04), (7, 0.076), (8, -0.067), (9, 0.154), (10, -0.173), (11, 0.26), (12, -0.097), (13, -0.273), (14, -0.11), (15, -0.041), (16, -0.143), (17, 0.065), (18, -0.041), (19, 0.185), (20, -0.086), (21, 0.048), (22, -0.083), (23, -0.018), (24, -0.056), (25, -0.061), (26, -0.132), (27, -0.023), (28, 0.072), (29, -0.055), (30, -0.099), (31, 0.104), (32, -0.11), (33, -0.109), (34, 0.091), (35, 0.084), (36, -0.033), (37, -0.134), (38, -0.033), (39, -0.067), (40, 0.011), (41, -0.046), (42, -0.087), (43, -0.115), (44, 0.034), (45, 0.017), (46, 0.018), (47, -0.063), (48, 0.103), (49, -0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.93999237 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases
Author: Kaushik Sinha, Mikhail Belkin
Abstract: We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices. It turns out that when the data has clustered, that is, when the high density regions are sufficiently separated by low density valleys, each high density area corresponds to a unique representative eigenvector. Linear combination of such eigenvectors (or, more precisely, of their Nystrom extensions) provide good candidates for good classification functions when the cluster assumption holds. By first choosing an appropriate basis of these eigenvectors from unlabeled data and then using labeled data with Lasso to select a classifier in the span of these eigenvectors, we obtain a classifier, which has a very sparse representation in this basis. Importantly, the sparsity corresponds naturally to the cluster assumption. Experimental results on a number of real-world data-sets show that our method is competitive with the state of the art semi-supervised learning algorithms and outperforms the natural base-line algorithm (Lasso in the Kernel PCA basis). 1
2 0.86532831 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections
Author: Rob Fergus, Yair Weiss, Antonio Torralba
Abstract: With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. “Clean labels” can be manually obtained on a small fraction, “noisy labels” may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efficient approximations for semi-supervised learning that are linear in the number of images. Specifically, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images gathered from the Internet. 1
3 0.70873255 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou
Abstract: We study the behavior of the popular Laplacian Regularization method for SemiSupervised Learning at the regime of a fixed number of labeled points but a large number of unlabeled points. We show that in Rd , d 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the “smoothness” assumptions associated with this alternate method. 1 Introduction and Setup In this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods based on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We consider the limit when the number of labeled points is fixed and the number of unlabeled points goes to infinity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled data is virtually infinite. We can also think of this limit as “perfect” SSL, having full knowledge of the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative about the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be “smooth” in some sense relative to p(x). Studying the infinite-unlabeled-data limit, where p(x) is fully known, allows us to formulate and understand the underlying smoothness assumptions of a particular SSL method, and judge whether it is well-posed and sensible. Understanding the infinite-unlabeled-data limit is also a necessary first step to studying the convergence of the finite-labeled-data estimator. We consider the following setup: Let p(x) be an unknown smooth density on a compact domain Ω ⊂ Rd with a smooth boundary. Let y : Ω → Y be the unknown function we wish to estimate. In case of regression Y = R whereas in binary classification Y = {−1, 1}. The standard (transductive) semisupervised learning problem is formulated as follows: Given l labeled points, (x1 , y1 ), . . . , (xl , yl ), with yi = y(xi ), and u unlabeled points xl+1 , . . . , xl+u , with all points xi sampled i.i.d. from p(x), the goal is to construct an estimate of y(xl+i ) for any unlabeled point xl+i , utilizing both the labeled and the unlabeled points. We denote the total number of points by n = l + u. We are interested in the regime where l is fixed and u → ∞. 1 2 SSL with Graph Laplacian Regularization We first consider the following graph-based approach formulated by Zhu et. al. [15]: y (x) = arg min In (y) ˆ subject to y(xi ) = yi , i = 1, . . . , l y where 1 n2 In (y) = Wi,j (y(xi ) − y(xj ))2 (1) (2) i,j is a Laplacian regularization term enforcing “smoothness” with respect to the n×n similarity matrix W . This formulation has several natural interpretations in terms of, e.g. random walks and electrical circuits [15]. These interpretations, however, refer to a fixed graph, over a finite set of points with given similarities. In contrast, our focus here is on the more typical scenario where the points xi ∈ Rd are a random sample from a density p(x), and W is constructed based on this sample. We would like to understand the behavior of the method in terms of the density p(x), particularly in the limit where the number of unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density p(x) is the method (1) sensible? The answer, of course, depends on how the matrix W is constructed. We consider the common situation where the similarities are obtained by applying some decay filter to the distances: xi −xj σ Wi,j = G (3) where G : R+ → R+ is some function with an adequately fast decay. Popular choices are the 2 Gaussian filter G(z) = e−z /2 or the ǫ-neighborhood graph obtained by the step filter G(z) = 1z<1 . For simplicity, we focus here on the formulation (1) where the solution is required to satisfy the constraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced with a softer loss-based data term, which is balanced against the smoothness term In (y), e.g. [14, 6]. Our analysis and conclusions apply to such variants as well. Limit of the Laplacian Regularization Term As the number of unlabeled examples grows the regularization term (2) converges to its expectation, where the summation is replaced by integration w.r.t. the density p(x): lim In (y) = I (σ) (y) = n→∞ G Ω Ω x−x′ σ (y(x) − y(x′ ))2 p(x)p(x′ )dxdx′ . (4) In the above limit, the bandwidth σ is held fixed. Typically, one would also drive the bandwidth σ to zero as n → ∞. There are two reasons for this choice. First, from a practical perspective, this makes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical perspective, this leads to a clear and well defined limit of the smoothness regularization term In (y), at least when σ → 0 slowly enough1 , namely when σ = ω( d log n/n). If σ → 0 as n → ∞, and as long as nσ d / log n → ∞, then after appropriate normalization, the regularizer converges to a density weighted gradient penalty term [7, 8]: d lim d+2 In (y) n→∞ Cσ (σ) d (y) d+2 I σ→0 Cσ = lim ∇y(x) 2 p(x)2 dx = J(y) = (5) Ω where C = Rd z 2 G( z )dz, and assuming 0 < C < ∞ (which is the case for both the Gaussian and the step filters). This energy functional J(f ) therefore encodes the notion of “smoothness” with respect to p(x) that is the basis of the SSL formulation (1) with the graph constructions specified by (3). To understand the behavior and appropriateness of (1) we must understand this functional and the associated limit problem: y (x) = arg min J(y) ˆ subject to y(xi ) = yi , i = 1, . . . , l (6) y p When σ = o( d 1/n) then all non-diagonal weights Wi,j vanish (points no longer have any “close by” p neighbors). We are not aware of an analysis covering the regime where σ decays roughly as d 1/n, but would be surprised if a qualitatively different meaningful limit is reached. 1 2 3 Graph Laplacian Regularization in R1 We begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x ∈ R. We first consider the situation where the support of p(x) is a continuous interval Ω = [a, b] ⊂ R (a and/or b may be infinite). Without loss of generality, we assume the labeled data is sorted in increasing order a x1 < x2 < · · · < xl b. Applying the theory of variational calculus, the solution y (x) ˆ satisfies inside each interval (xi , xi+1 ) the Euler-Lagrange equation d dy p2 (x) = 0. dx dx Performing two integrations and enforcing the constraints at the labeled points yields y(x) = yi + x 1/p2 (t)dt xi (yi+1 xi+1 1/p2 (t)dt xi − yi ) for xi x xi+1 (7) with y(x) = x1 for a x x1 and y(x) = xl for xl x b. If the support of p(x) is a union of disjoint intervals, the above analysis and the form of the solution applies in each interval separately. The solution (7) seems reasonable and desirable from the point of view of the “smoothness” assumptions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas across low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore, the regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared semi-norm, giving us additional insight into this choice of regularizer: b 1 Theorem 1. Let p(x) be a smooth density on Ω = [a, b] ⊂ R such that Ap = 4 a 1/p2 (t)dt < ∞. 2 Then, J(f ) can be written as a squared semi-norm J(f ) = f Kp induced by the kernel x′ ′ Kp (x, x ) = Ap − 1 2 x with a null-space of all constant functions. That is, f the RKHS induced by Kp . 1 p2 (t) dt Kp . (8) is the norm of the projection of f onto If p(x) is supported on several disjoint intervals, Ω = ∪i [ai , bi ], then J(f ) can be written as a squared semi-norm induced by the kernel 1 bi dt 4 ai p2 (t) ′ Kp (x, x ) = − 1 2 x′ dt x p2 (t) if x, x′ ∈ [ai , bi ] (9) if x ∈ [ai , bi ], x′ ∈ [aj , bj ], i = j 0 with a null-space spanned by indicator functions 1[ai ,bi ] (x) on the connected components of Ω. Proof. For any f (x) = i αi Kp (x, xi ) in the RKHS induced by Kp : df dx J(f ) = 2 p2 (x)dx = αi αj Jij (10) i,j where Jij = d d Kp (x, xi ) Kp (x, xj )p2 (x)dx dx dx When xi and xj are in different connected components of Ω, the gradients of Kp (·, xi ) and Kp (·, xj ) are never non-zero together and Jij = 0 = Kp (xi , xj ). When they are in the same connected component [a, b], and assuming w.l.o.g. a xi xj b: Jij = = xi 1 4 1 4 a b a 1 dt + p2 (t) 1 1 dt − p2 (t) 2 xj xi xj xi −1 dt + p2 (t) xj 1 dt p2 (t) 1 dt = Kp (xi , xj ). p2 (t) Substituting Jij = Kp (xi , xj ) into (10) yields J(f ) = 3 b αi αj Kp (xi , xj ) = f (11) Kp . Combining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of any variant where the hard constraints are replaced by a data term) is of the form: l y(x) = αj Kp (x, xj ) + βi 1[ai ,bi ] (x), j=1 i where i ranges over the connected components [ai , bi ] of Ω, and we have: l J(y) = αi αj Kp (xi , xj ). (12) i,j=1 Viewing the regularizer as y 2 p suggests understanding (6), and so also its empirical approximaK tion (1), by interpreting Kp (x, x′ ) as a density-based “similarity measure” between x and x′ . This similarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a function of the distance. When the density is non-uniform, two points are relatively similar only if they are connected by a region in which 1/p2 (x) is low, i.e. the density is high, but are much less “similar”, i.e. related to each other, when connected by a low-density region. Furthermore, there is no dependence between points in disjoint components separated by zero density regions. 4 Graph Laplacian Regularization in Higher Dimensions The analysis of the previous section seems promising, at it shows that in one dimension, the SSL method (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is not the case anymore. In the following theorem we show that the infimum of the limit problem (6) is zero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation of the labeled points. Theorem 2. Let p(x) be a smooth density over Rd , d 2, bounded from above by some constant pmax , and let (x1 , y1 ), . . . , (xl , yl ) be any (non-repeating) set of labeled examples. There exist continuous functions yǫ (x), for any ǫ > 0, all satisfying the constraints yǫ (xj ) = yj , j = 1, . . . , l, such ǫ→0 ǫ→0 that J(yǫ ) −→ 0 but yǫ (x) −→ 0 for all x = xj , j = 1, . . . , l. Proof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the proof to more labeled points is straightforward. Furthermore, without loss of generality, we assume the first labeled point is at x0 = 0 with y(x0 ) = 0 and the second labeled point is at x1 with x1 = 1 and y(x1 ) = 1. In addition, we assume that the ball B1 (0) of radius one centered around the origin is contained in Ω = {x ∈ Rd | p(x) > 0}. We first consider the case d > 2. Here, for any ǫ > 0, consider the function x ǫ yǫ (x) = min ,1 which indeed satisfies the two constraints yǫ (xi ) = yi , i = 0, 1. Then, J(yǫ ) = Bǫ (0) p2 (x) dx ǫ2 pmax ǫ2 dx = p2 Vd ǫd−2 max (13) Bǫ (0) where Vd is the volume of a unit ball in Rd . Hence, the sequence of functions yǫ (x) satisfy the constraints, but for d > 2, inf ǫ J(yǫ ) = 0. For d = 2, a more extreme example is necessary: consider the functions 2 x yǫ (x) = log +ǫ ǫ log 1+ǫ ǫ for x 1 and yǫ (x) = 1 for x > 1. These functions satisfy the two constraints yǫ (xi ) = yi , i = 0, 1 and: J(yǫ ) = 4 h “ ”i 1+ǫ 2 log ǫ 4πp2 max h “ ”i 1+ǫ 2 log ǫ x B1 (0) log ( x 1+ǫ ǫ 2 2 +ǫ)2 p2 (x)dx 4p2 h “ max ”i2 1+ǫ log ǫ 4πp2 max ǫ→0 = −→ 0. log 1+ǫ ǫ 4 1 0 r2 (r 2 +ǫ)2 2πrdr The implication of Theorem 2 is that regardless of the values at the labeled points, as u → ∞, the solution of (1) is not well posed. Asymptotically, the solution has the form of an almost everywhere constant function, with highly localized spikes near the labeled points, and so no learning is performed. In particular, an interpretation in terms of a density-based kernel Kp , as in the onedimensional case, is not possible. Our analysis also carries over to a formulation where a loss-based data term replaces the hard label constraints, as in l 1 y = arg min ˆ (y(xj ) − yj )2 + γIn (y) y(x) l j=1 In the limit of infinite unlabeled data, functions of the form yǫ (x) above have a zero data penalty term (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence, it is possible to drive the entire objective functional (the data term plus the regularization term) to zero with functions that do not generalize at all to unlabeled points. 4.1 Numerical Example We illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density p(x) in R2 , which is a mixture of two unit variance spherical Gaussians, one per class, centered at the origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of the two components (four total). We then construct a similarity matrix using a Gaussian filter with σ = 0.4. Figure 1 depicts the predictor y (x) obtained from (1). In fact, two different predictors are shown, ˆ obtained by different numerical methods for solving (1). Both methods are based on the observation that the solution y (x) of (1) satisfies: ˆ n y (xi ) = ˆ n Wij y (xj ) / ˆ j=1 Wij on all unlabeled points i = l + 1, . . . , l + u. (14) j=1 Combined with the constraints of (1), we obtain a system of linear equations that can be solved by Gaussian elimination (here invoked through MATLAB’s backslash operator). This is the method used in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for y (xi ), ˆ which can be solved via the power method, or label propagation [2, 6]: start with zero labels on the unlabeled points and iterate (14), while keeping the known labels on x1 , . . . , xl . This is the method used in the bottom panels of Figure 1. As predicted, y (x) is almost constant for almost all unlabeled points. Although all values are very ˆ close to zero, thresholding at the “right” threshold does actually produce sensible results in terms of the true -1/+1 labels. However, beyond being inappropriate for regression, a very flat predictor is still problematic even from a classification perspective. First, it is not possible to obtain a meaningful confidence measure for particular labels. Second, especially if the size of each class is not known apriori, setting the threshold between the positive and negative classes is problematic. In our example, setting the threshold to zero yields a generalization error of 45%. The differences between the two numerical methods for solving (1) also point out to another problem with the ill-posedness of the limit problem: the solution is numerically very un-stable. A more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choosing a “wrong” bandwidth σ, is given in Figure 2. We again simulated data from a mixture of two Gaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and an increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classification error of the resulting predictor y (x). We plot the classification error both when a threshold ˆ of zero is used (i.e. the class is determined by sign(ˆ(x))) and with the ideal threshold minimizing y the test error. For each unlabeled sample size, we choose the bandwidth σ yielding the best test performance (this is a “cheating” approach which provides a lower bound on the error of the best method for selecting the bandwidth). As the number of unlabeled examples increases the squared error approaches 1, indicating a flat predictor. Using a threshold of zero leads to an increase in the classification error, possibly due to numerical instability. Interestingly, although the predictors become very flat, the classification error using the ideal threshold actually improves slightly. Note that 5 DIRECT INVERSION SQUARED ERROR SIGN ERROR: 45% OPTIMAL BANDWIDTH 1 0.9 1 5 0 4 2 0.85 y(x) > 0 y(x) < 0 6 0.95 10 0 0 −1 10 0 200 400 600 800 0−1 ERROR (THRESHOLD=0) 0.32 −5 10 0 5 −10 0 −10 −5 −5 0 5 10 10 1 0 0 200 400 600 800 OPTIMAL BANDWIDTH 0.5 0 0 200 400 600 800 0−1 ERROR (IDEAL THRESHOLD) 0.19 5 200 400 600 800 OPTIMAL BANDWIDTH 1 0.28 SIGN ERR: 17.1 0.3 0.26 POWER METHOD 0 1.5 8 0 0.18 −1 10 6 0.17 4 −5 10 0 5 −10 0 −5 −10 −5 0 5 10 Figure 1: Left plots: Minimizer of Eq. (1). Right plots: the resulting classification according to sign(y). The four labeled points are shown by green squares. Top: minimization via Gaussian elimination (MATLAB backslash). Bottom: minimization via label propagation with 1000 iterations - the solution has not yet converged, despite small residuals of the order of 2 · 10−4 . 0.16 0 200 400 600 800 2 0 200 400 600 800 Figure 2: Squared error (top), classification error with a threshold of zero (center) and minimal classification error using ideal threhold (bottom), of the minimizer of (1) as a function of number of unlabeled points. For each error measure and sample size, the bandwidth minimizing the test error was used, and is plotted. ideal classification performance is achieved with a significantly larger bandwidth than the bandwidth minimizing the squared loss, i.e. when the predictor is even flatter. 4.2 Probabilistic Interpretation, Exit and Hitting Times As mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in terms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with transition matrix M = D−1 W where D is a diagonal matrix with Dii = j Wij . Then, for the binary classification case with yi = ±1 we have [15]: y (xi ) = 2 Pr x(t) hits a point labeled +1 before hitting a point labeled -1 x(0) = xi − 1 ˆ We present an interpretation of our analysis in terms of the limiting properties of this random walk. Consider, for simplicity, the case where the two classes are separated by a low density region. Then, the random walk has two intrinsic quantities of interest. The first is the mean exit time from one cluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the number of unlabeled points increases and σ → 0, the random walk converges to a diffusion process [12]. While the mean exit time then converges to a finite value corresponding to its diffusion analogue, the hitting time to a labeled point increases to infinity (as these become absorbing boundaries of measure zero). With more and more unlabeled data the random walk will fully mix, forgetting where it started, before it hits any label. Thus, the probability of hitting +1 before −1 will become uniform across the entire graph, independent of the starting location xi , yielding a flat predictor. 5 Keeping σ Finite At this point, a reader may ask whether the problems found in higher dimensions are due to taking the limit σ → 0. One possible objection is that there is an intrinsic characteristic scale for the data σ0 where (with high probability) all points at a distance xi − xj < σ0 have the same label. If this is the case, then it may not necessarily make sense to take values of σ < σ0 in constructing W . However, keeping σ finite while taking the number of unlabeled points to infinity does not resolve the problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To see this, consider a function y(x) which is zero everywhere except at the labeled points, where y(xj ) = yj . With a finite number of labeled points of measure zero, I (σ) (y) = 0 in any dimension 6 50 points 500 points 3500 points 1 1 0.5 0.5 0.5 0 0 0 −0.5 y 1 −0.5 −0.5 −1 −2 0 2 4 6 −1 −2 0 2 4 6 −1 −2 0 2 4 6 x Figure 3: Minimizer of (1) for a 1-d problem with a fixed σ = 0.4, two labeled points and an increasing number of unlabeled points. and for any fixed σ > 0. While this limiting function is discontinuous, it is also possible to construct ǫ→0 a sequence of continuous functions yǫ that all satisfy the constraints and for which I (σ) (yǫ ) −→ 0. This behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians centered at the origin and at x = 4, with one Gaussian labeled −1 and the other +1. We used two labeled points at the centers of the Gaussians and an increasing number of randomly drawn unlabeled points. As predicted, with a fixed σ, although the solution is reasonable when the number of unlabeled points is small, it becomes flatter, with sharp spikes on the labeled points, as u → ∞. 6 Fourier-Eigenvector Based Methods Before we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian, suggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining candidate predictors y(x) non-parametrically to those with small In (y) values, here the predictors are constrained to the low-dimensional space spanned by the first few eigenvectors of the Laplacian: The similarity matrix W is computed as before, and the Graph Laplacian matrix L = D − W is considered (recall D is a diagonal matrix with Dii = j Wij ). Only predictors p j=1 aj ej y (x) = ˆ (15) spanned by the first p eigenvectors e1 , . . . , ep of L (with smallest eigenvalues) are considered. The coefficients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss: (ˆ1 , . . . , ap ) = arg min a ˆ l j=1 (yj − y (xj ))2 . ˆ (16) Unlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)–(16) is well posed in the limit u → ∞. This follows directly from the convergence of the eigenvectors of the graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4]. Eigenvector based methods were shown empirically to provide competitive generalization performance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the approach by arguing that ‘the eigenfunctions of the Laplace-Beltrami operator provide a natural basis for functions on the manifold and the desired classification function can be expressed in such a basis’. In our view, the success of the method is actually not due to data lying on a low-dimensional manifold, but rather due to the low density separation assumption, which states that different class labels form high-density clusters separated by low density regions. Indeed, under this assumption and with sufficient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami operator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11], providing a basis for a labeling that is constant within clusters but variable across clusters. In other settings, such as data uniformly distributed on a manifold but without any significant cluster structure, the success of eigenvector based methods critically depends on how well can the unknown classification function be approximated by a truncated expansion with relatively few eigenvectors. We illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform density in the box [0, 1] × [0, 0.8] × [0, 0.6], where the box lengths are different to prevent eigenvalue multiplicity. Consider learning three different functions, y1 (x) = 1x1 >0.5 , y2 (x) = 1x1 >x2 /0.8 and y3 (x) = 1x2 /0.8>x3 /0.6 . Even though all three functions are relatively simple, all having a linear separating boundary between the classes on the manifold, as shown in the experiment described in Figure 4, the Eigenvector based method (15)–(16) gives markedly different generalization performances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5 as suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test set (i.e. a “cheating” choice representing an upper bound on the generalization error of this method). 7 Prediction Error (%) p = #labeled points/5 40 optimal p 20 labeled points 40 Approx. Error 50 20 20 0 20 20 40 60 # labeled points 0 10 20 40 60 # labeled points 0 0 5 10 15 # eigenvectors 0 0 5 10 15 # eigenvectors Figure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)–(16) for the three different functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign agreements with the true labels. Rightmost panel: best fit when many (all 3000) points are used, representing the best we can hope for with a few leading eigenvectors. The reason for this behavior is that y2 (x) and even more so y3 (x) cannot be as easily approximated by the very few leading eigenfunctions—even though they seem “simple” and “smooth”, they are significantly more complicated than y1 (x) in terms of measure of simplicity implied by the Eigenvector Method. Since the density is uniform, the graph Laplacian converges to the standard Laplacian and its eigenfunctions have the form ψi,j,k (x) = cos(iπx1 ) cos(jπx2 /0.8) cos(kπx3 /0.6), making it hard to represent simple decision boundaries which are not axis-aligned. 7 Discussion Our results show that a popular SSL method, the Laplacian Regularization method (1), is not wellbehaved in the limit of infinite unlabeled data, despite its empirical success in various SSL tasks. The empirical success might be due to two reasons. First, it is possible that with a large enough number of labeled points relative to the number of unlabeled points, the method is well behaved. This regime, where the number of both labeled and unlabeled points grow while l/u is fixed, has recently been analyzed by Wasserman and Lafferty [9]. However, we do not find this regime particularly satisfying as we would expect that having more unlabeled data available should improve performance, rather than require more labeled points or make the problem ill-posed. It also places the user in a delicate situation of choosing the “just right” number of unlabeled points without any theoretical guidance. Second, in our experiments we noticed that although the predictor y (x) becomes extremely flat, in ˆ binary tasks, it is still typically possible to find a threshold leading to a good classification performance. We do not know of any theoretical explanation for such behavior, nor how to characterize it. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical foundation of the Laplacian Regularization method. On a very practical level, such a theoretical understanding might allow us to correct the method so as to avoid the numerical instability associated with flat predictors, and perhaps also make it appropriate for regression. The reason that the Laplacian regularizer (1) is ill-posed in the limit is that the first order gradient is not a sufficient penalty in high dimensions. This fact is well known in spline theory, where the Sobolev Embedding Theorem [1] indicates one must control at least d+1 derivatives in Rd . In the 2 context of Laplacian regularization, this can be done using the iterated Laplacian: replacing the d+1 graph Laplacian matrix L = D − W , where D is the diagonal degree matrix, with L 2 (matrix to d+1 the 2 power). In the infinite unlabeled data limit, this corresponds to regularizing all order- d+1 2 (mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional ambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimensionality, which poses a practical problem of estimating this usually unknown dimensionality. We are not aware of much practical work using the iterated Laplacian, nor a good understanding of its appropriateness for SSL. A different approach leading to a well-posed solution is to include also an ambient regularization term [5]. However, the properties of the solution and in particular its relation to various assumptions about the “smoothness” of y(x) relative to p(x) remain unclear. Acknowledgments The authors would like to thank the anonymous referees for valuable suggestions. The research of BN was supported by the Israel Science Foundation (grant 432/06). 8 References [1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975. [2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks, ICML, 2007. [3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classification, NIPS, vol. 15, 2003. [4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007. [5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006. [6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised Learning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006. [7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004. [8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006. [9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008. [10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol. 36(2), 2008. [11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001. [12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, NIPS, vol. 18, 2006. [13] B. Sch¨ lkopf, A. Smola, Learning with Kernels, MIT Press, 2002. o [14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch¨ lkopf, Learning with local and global consistency, o NIPS, vol. 16, 2004. [15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian fields and harmonic functions, ICML, 2003. 9
4 0.56120211 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine
Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu, Zhirong Yang
Abstract: We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms. 1
5 0.43616647 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields
Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori
Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random fields that defines a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classification and sequence labeling problems. 1
6 0.38482147 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification
8 0.36361143 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity
9 0.33231458 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization
10 0.32342461 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems
11 0.31939304 2 nips-2009-3D Object Recognition with Deep Belief Nets
12 0.30511427 93 nips-2009-Fast Image Deconvolution using Hyper-Laplacian Priors
13 0.29894722 137 nips-2009-Learning transport operators for image manifolds
14 0.29789495 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs
15 0.29591694 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors
16 0.29433557 138 nips-2009-Learning with Compressible Priors
17 0.29038596 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs
18 0.28965166 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers
19 0.27739349 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem
20 0.27734941 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models
topicId topicWeight
[(24, 0.062), (25, 0.084), (35, 0.036), (36, 0.108), (39, 0.036), (50, 0.011), (58, 0.094), (61, 0.013), (71, 0.035), (81, 0.019), (86, 0.1), (91, 0.325)]
simIndex simValue paperId paperTitle
1 0.98568738 39 nips-2009-Bayesian Belief Polarization
Author: Alan Jern, Kai-min Chang, Charles Kemp
Abstract: Empirical studies have documented cases of belief polarization, where two people with opposing prior beliefs both strengthen their beliefs after observing the same evidence. Belief polarization is frequently offered as evidence of human irrationality, but we demonstrate that this phenomenon is consistent with a fully Bayesian approach to belief revision. Simulation results indicate that belief polarization is not only possible but relatively common within the set of Bayesian models that we consider. Suppose that Carol has requested a promotion at her company and has received a score of 50 on an aptitude test. Alice, one of the company’s managers, began with a high opinion of Carol and became even more confident of her abilities after seeing her test score. Bob, another manager, began with a low opinion of Carol and became even less confident about her qualifications after seeing her score. On the surface, it may appear that either Alice or Bob is behaving irrationally, since the same piece of evidence has led them to update their beliefs about Carol in opposite directions. This situation is an example of belief polarization [1, 2], a widely studied phenomenon that is often taken as evidence of human irrationality [3, 4]. In some cases, however, belief polarization may appear much more sensible when all the relevant information is taken into account. Suppose, for instance, that Alice was familiar with the aptitude test and knew that it was scored out of 60, but that Bob was less familiar with the test and assumed that the score was a percentage. Even though only one interpretation of the score can be correct, Alice and Bob have both made rational inferences given their assumptions about the test. Some instances of belief polarization are almost certain to qualify as genuine departures from rational inference, but we argue in this paper that others will be entirely compatible with a rational approach. Distinguishing between these cases requires a precise normative standard against which human inferences can be compared. We suggest that Bayesian inference provides this normative standard, and present a set of Bayesian models that includes cases where polarization can and cannot emerge. Our work is in the spirit of previous studies that use careful rational analyses in order to illuminate apparently irrational human behavior (e.g. [5, 6, 7]). Previous studies of belief polarization have occasionally taken a Bayesian approach, but often the goal is to show how belief polarization can emerge as a consequence of approximate inference in a Bayesian model that is subject to memory constraints or processing limitations [8]. In contrast, we demonstrate that some examples of polarization are compatible with a fully Bayesian approach. Other formal accounts of belief polarization have relied on complex versions of utility theory [9], or have focused on continuous hypothesis spaces [10] unlike the discrete hypothesis spaces usually considered by psychological studies of belief polarization. We focus on discrete hypothesis spaces and require no additional machinery beyond the basics of Bayesian inference. We begin by introducing the belief revision phenomena considered in this paper and developing a Bayesian approach that clarifies whether and when these phenomena should be considered irrational. We then consider several Bayesian models that are capable of producing belief polarization and illustrate them with concrete examples. Having demonstrated that belief polarization is compatible 1 (a) Contrary updating (i) Divergence (ii) (b) Parallel updating Convergence A P (h1 ) 0.5 0.5 0.5 B Prior beliefs Updated beliefs Prior beliefs Updated beliefs Prior beliefs Updated beliefs Figure 1: Examples of belief updating behaviors for two individuals, A (solid line) and B (dashed line). The individuals begin with different beliefs about hypothesis h1 . After observing the same set of evidence, their beliefs may (a) move in opposite directions or (b) move in the same direction. with a Bayesian approach, we present simulations suggesting that this phenomenon is relatively generic within the space of models that we consider. We finish with some general comments on human rationality and normative models. 1 Belief revision phenomena The term “belief polarization” is generally used to describe situations in which two people observe the same evidence and update their respective beliefs in the directions of their priors. A study by Lord, et al. [1] provides one classic example in which participants read about two studies, one of which concluded that the death penalty deters crime and another which concluded that the death penalty has no effect on crime. After exposure to this mixed evidence, supporters of the death penalty strengthened their support and opponents strengthened their opposition. We will treat belief polarization as a special case of contrary updating, a phenomenon where two people update their beliefs in opposite directions after observing the same evidence (Figure 1a). We distinguish between two types of contrary updating. Belief divergence refers to cases in which the person with the stronger belief in some hypothesis increases the strength of his or her belief and the person with the weaker belief in the hypothesis decreases the strength of his or her belief (Figure 1a(i)). Divergence therefore includes cases of traditional belief polarization. The opposite of divergence is belief convergence (Figure 1a(ii)), in which the person with the stronger belief decreases the strength of his or her belief and the person with the weaker belief increases the strength of his or her belief. Contrary updating may be contrasted with parallel updating (Figure 1b), in which the two people update their beliefs in the same direction. Throughout this paper, we consider only situations in which both people change their beliefs after observing some evidence. All such situations can be unambiguously classified as instances of parallel or contrary updating. Parallel updating is clearly compatible with a normative approach, but the normative status of divergence and convergence is less clear. Many authors argue that divergence is irrational, and many of the same authors also propose that convergence is rational [2, 3]. For example, Baron [3] writes that “Normatively, we might expect that beliefs move toward the middle of the range when people are presented with mixed evidence.” (p. 210) The next section presents a formal analysis that challenges the conventional wisdom about these phenomena and clarifies the cases where they can be considered rational. 2 A Bayesian approach to belief revision Since belief revision involves inference under uncertainty, Bayesian inference provides the appropriate normative standard. Consider a problem where two people observe data d that bear on some hypothesis h1 . Let P1 (·) and P2 (·) be distributions that capture the two people’s respective beliefs. Contrary updating occurs whenever one person’s belief in h1 increases and the other person’s belief in h1 decreases, or when [P1 (h1 |d) − P1 (h1 )] [P2 (h1 |d) − P2 (h1 )] < 0 . 2 (1) Family 1 (a) H (c) (d) (e) V H D Family 2 (b) V V V D H D H D H (f) (g) V V D H D (h) V H D H D Figure 2: (a) A simple Bayesian network that cannot produce either belief divergence or belief convergence. (b) – (h) All possible three-node Bayes nets subject to the constraints described in the text. Networks in Family 1 can produce only parallel updating, but networks in Family 2 can produce both parallel and contrary updating. We will use Bayesian networks to capture the relationships between H, D, and any other variables that are relevant to the situation under consideration. For example, Figure 2a captures the idea that the data D are probabilistically generated from hypothesis H. The remaining networks in Figure 2 show several other ways in which D and H may be related, and will be discussed later. We assume that the two individuals agree on the variables that are relevant to a problem and agree about the relationships between these variables. We can formalize this idea by requiring that both people agree on the structure and the conditional probability distributions (CPDs) of a network N that captures relationships between the relevant variables, and that they differ only in the priors they assign to the root nodes of N . If N is the Bayes net in Figure 2a, then we assume that the two people must agree on the distribution P (D|H), although they may have different priors P1 (H) and P2 (H). If two people agree on network N but have different priors on the root nodes, we can create a single expanded Bayes net to simulate the inferences of both individuals. The expanded network is created by adding a background knowledge node B that sends directed edges to all root nodes in N , and acts as a switch that sets different root node priors for the two different individuals. Given this expanded network, distributions P1 and P2 in Equation 1 can be recovered by conditioning on the value of the background knowledge node and rewritten as [P (h1 |d, b1 ) − P (h1 |b1 )] [P (h1 |d, b2 ) − P (h1 |b2 )] < 0 (2) where P (·) represents the probability distribution captured by the expanded network. Suppose that there are exactly two mutually exclusive hypotheses. For example, h1 and h0 might state that the death penalty does or does not deter crime. In this case Equation 2 implies that contrary updating occurs when [P (d|h1 , b1 ) − P (d|h0 , b1 )] [P (d|h1 , b2 ) − P (d|h0 , b2 )] < 0 . (3) Equation 3 is derived in the supporting material, and leads immediately to the following result: R1: If H is a binary variable and D and B are conditionally independent given H, then contrary updating is impossible. Result R1 follows from the observation that if D and B are conditionally independent given H, then the product in Equation 3 is equal to (P (d|h1 ) − P (d|h0 ))2 , which cannot be less than zero. R1 implies that the simple Bayes net in Figure 2a is incapable of producing contrary updating, an observation previously made by Lopes [11]. Our analysis may help to explain the common intuition that belief divergence is irrational, since many researchers seem to implicitly adopt a model in which H and D are the only relevant variables. Network 2a, however, is too simple to capture the causal relationships that are present in many real world situations. For example, the promotion example at the beginning of this paper is best captured using a network with an additional node that represents the grading scale for the aptitude test. Networks with many nodes may be needed for some real world problems, but here we explore the space of three-node networks. We restrict our attention to connected graphs in which D has no outgoing edges, motivated by the idea that the three variables should be linked and that the data are the final result of some generative process. The seven graphs that meet these conditions are shown in Figures 2b–h, where the additional variable has been labeled V . These Bayes nets illustrate cases in which (b) V is an additional 3 Models Conventional wisdom Family 1 Family 2 Belief divergence Belief convergence Parallel updating Table 1: The first column represents the conventional wisdom about which belief revision phenomena are normative. The models in the remaining columns include all three-node Bayes nets. This set of models can be partitioned into those that support both belief divergence and convergence (Family 2) and those that support neither (Family 1). piece of evidence that bears on H, (c) V informs the prior probability of H, (d)–(e) D is generated by an intervening variable V , (f) V is an additional generating factor of D, (g) V informs both the prior probability of H and the likelihood of D, and (h) H and D are both effects of V . The graphs in Figure 2 have been organized into two families. R1 implies that none of the graphs in Family 1 is capable of producing contrary updating. The next section demonstrates by example that all three of the graphs in Family 2 are capable of producing contrary updating. Table 1 compares the two families of Bayes nets to the informal conclusions about normative approaches that are often found in the psychological literature. As previously noted, the conventional wisdom holds that belief divergence is irrational but that convergence and parallel updating are both rational. Our analysis suggests that this position has little support. Depending on the causal structure of the problem under consideration, a rational approach should allow both divergence and convergence or neither. Although we focus in this paper on Bayes nets with no more than three nodes, the class of all network structures can be partitioned into those that can (Family 2) and cannot (Family 1) produce contrary updating. R1 is true for Bayes nets of any size and characterizes one group of networks that belong to Family 1. Networks where the data provide no information about the hypotheses must also fail to produce contrary updating. Note that if D and H are conditionally independent given B, then the left side of Equation 3 is equal to zero, meaning contrary updating cannot occur. We conjecture that all remaining networks can produce contrary updating if the cardinalities of the nodes and the CPDs are chosen appropriately. Future studies can attempt to verify this conjecture and to precisely characterize the CPDs that lead to contrary updating. 3 Examples of rational belief divergence We now present four scenarios that can be modeled by the three-node Bayes nets in Family 2. Our purpose in developing these examples is to demonstrate that these networks can produce belief divergence and to provide some everyday examples in which this behavior is both normative and intuitive. 3.1 Example 1: Promotion We first consider a scenario that can be captured by Bayes net 2f, in which the data depend on two independent factors. Recall the scenario described at the beginning of this paper: Alice and Bob are responsible for deciding whether to promote Carol. For simplicity, we consider a case where the data represent a binary outcome—whether or not Carol’s r´ sum´ indicates that she is included e e in The Directory of Notable People—rather than her score on an aptitude test. Alice believes that The Directory is a reputable publication but Bob believes it is illegitimate. This situation is represented by the Bayes net and associated CPDs in Figure 3a. In the tables, the hypothesis space H = {‘Unqualified’ = 0, ‘Qualified’ = 1} represents whether or not Carol is qualified for the promotion, the additional factor V = {‘Disreputable’ = 0, ‘Reputable’ = 1} represents whether The Directory is a reputable publication, and the data variable D = {‘Not included’ = 0, ‘Included’ = 1} represents whether Carol is featured in it. The actual probabilities were chosen to reflect the fact that only an unqualified person is likely to pad their r´ sum´ by mentioning a disreputable publication, but that e e 4 (a) B Alice Bob (b) P(V=1) 0.01 0.9 B Alice Bob V B Alice Bob P(H=1) 0.6 0.4 V H D V 0 0 1 1 H 0 1 0 1 V 0 1 P(D=1) 0.5 0.1 0.1 0.9 (c) P(H=1) 0.1 0.9 H V 0 0 1 1 D H 0 1 0 1 P(D=1) 0.4 0.01 0.4 0.6 (d) B Alice Bob P(V=0) P(V=1) P(V=2) P(V=3) 0.6 0.2 0.1 0.1 0.1 0.1 0.2 0.6 B Alice Bob P(V1=1) 0.9 0.1 P(H=1) 1 1 0 0 H B Alice Bob V1 V V 0 1 2 3 P(V=1) 0.9 0.1 D V 0 1 2 3 P(D=0) P(D=1) P(D=2) P(D=3) 0.7 0.1 0.1 0.1 0.1 0.7 0.1 0.1 0.1 0.1 0.7 0.1 0.1 0.1 0.1 0.7 V1 0 0 1 1 V2 0 1 0 1 P(H=1) 0.5 0.1 0.5 0.9 P(V2=1) 0.5 0.5 V2 H D V2 0 1 P(D=1) 0.1 0.9 Figure 3: The Bayes nets and conditional probability distributions used in (a) Example 1: Promotion, (b) Example 2: Religious belief, (c) Example 3: Election polls, (d) Example 4: Political belief. only a qualified person is likely to be included in The Directory if it is reputable. Note that Alice and Bob agree on the conditional probability distribution for D, but assign different priors to V and H. Alice and Bob therefore interpret the meaning of Carol’s presence in The Directory differently, resulting in the belief divergence shown in Figure 4a. This scenario is one instance of a large number of belief divergence cases that can be attributed to two individuals possessing different mental models of how the observed evidence was generated. For instance, suppose now that Alice and Bob are both on an admissions committee and are evaluating a recommendation letter for an applicant. Although the letter is positive, it is not enthusiastic. Alice, who has less experience reading recommendation letters interprets the letter as a strong endorsement. Bob, however, takes the lack of enthusiasm as an indication that the author has some misgivings [12]. As in the promotion scenario, the differences in Alice’s and Bob’s experience can be effectively represented by the priors they assign to the H and V nodes in a Bayes net of the form in Figure 2f. 3.2 Example 2: Religious belief We now consider a scenario captured by Bayes net 2g. In our example for Bayes net 2f, the status of an additional factor V affected how Alice and Bob interpreted the data D, but did not shape their prior beliefs about H. In many cases, however, the additional factor V will influence both people’s prior beliefs about H as well as their interpretation of the relationship between D and H. Bayes net 2g captures this situation, and we provide a concrete example inspired by an experiment conducted by Batson [13]. Suppose that Alice believes in a “Christian universe:” she believes in the divinity of Jesus Christ and expects that followers of Christ will be persecuted. Bob, on the other hand, believes in a “secular universe.” This belief leads him to doubt Christ’s divinity, but to believe that if Christ were divine, his followers would likely be protected rather than persecuted. Now suppose that both Alice and Bob observe that Christians are, in fact, persecuted, and reassess the probability of Christ’s divinity. This situation is represented by the Bayes net and associated CPDs in Figure 3b. In the tables, the hypothesis space H = {‘Human’ = 0, ‘Divine’ = 1} represents the divinity of Jesus Christ, the additional factor V = {‘Secular’ = 0, ‘Christian’ = 1} represents the nature of the universe, and the data variable D = {‘Not persecuted’ = 0, ‘Persecuted’ = 1} represents whether Christians are subject to persecution. The exact probabilities were chosen to reflect the fact that, regardless of worldview, people will agree on a “base rate” of persecution given that Christ is not divine, but that more persecution is expected if the Christian worldview is correct than if the secular worldview is correct. Unlike in the previous scenario, Alice and Bob agree on the CPDs for both D and H, but 5 (a) (b) P (H = 1) (d) 1 1 1 0.5 1 (c) 0.5 0.5 A 0.5 B 0 0 0 Prior beliefs Updated beliefs Prior beliefs Updated beliefs 0 Prior beliefs Updated beliefs Prior beliefs Updated beliefs Figure 4: Belief revision outcomes for (a) Example 1: Promotion, (b) Example 2: Religious belief, (c) Example 3: Election polls, and (d) Example 4: Political belief. In all four plots, the updated beliefs for Alice (solid line) and Bob (dashed line) are computed after observing the data described in the text. The plots confirm that all four of our example networks can lead to belief divergence. differ in the priors they assign to V . As a result, Alice and Bob disagree about whether persecution supports or undermines a Christian worldview, which leads to the divergence shown in Figure 4b. This scenario is analogous to many real world situations in which one person has knowledge that the other does not. For instance, in a police interrogation, someone with little knowledge of the case (V ) might take a suspect’s alibi (D) as strong evidence of their innocence (H). However, a detective with detailed knowledge of the case may assign a higher prior probability to the subject’s guilt based on other circumstantial evidence, and may also notice a detail in the suspect’s alibi that only the culprit would know, thus making the statement strong evidence of guilt. In all situations of this kind, although two people possess different background knowledge, their inferences are normative given that knowledge, consistent with the Bayes net in Figure 2g. 3.3 Example 3: Election polls We now consider two qualitatively different cases that are both captured by Bayes net 2h. The networks considered so far have all included a direct link between H and D. In our next two examples, we consider cases where the hypotheses and observed data are not directly linked, but are coupled by means of one or more unobserved causal factors. Suppose that an upcoming election will be contested by two Republican candidates, Rogers and Rudolph, and two Democratic candidates, Davis and Daly. Alice and Bob disagree about the various candidates’ chances of winning, with Alice favoring the two Republicans and Bob favoring the two Democrats. Two polls were recently released, one indicating that Rogers was most likely to win the election and the other indicating that Daly was most likely to win. After considering these polls, they both assess the likelihood that a Republican will win the election. This situation is represented by the Bayes net and associated CPDs in Figure 3c. In the tables, the hypothesis space H = {‘Democrat wins’ = 0, ‘Republican wins’ = 1} represents the winning party, the variable V = {‘Rogers’ = 0, ‘Rudolph’ = 1, ‘Davis’ = 2, ‘Daly’ = 3} represents the winning candidate, and the data variables D1 = D2 = {‘Rogers’ = 0, ‘Rudolph’ = 1, ‘Davis’ = 2, ‘Daly’ = 3} represent the results of the two polls. The exact probabilities were chosen to reflect the fact that the polls are likely to reflect the truth with some noise, but whether a Democrat or Republican wins is completely determined by the winning candidate V . In Figure 3c, only a single D node is shown because D1 and D2 have identical CPDs. The resulting belief divergence is shown in Figure 4c. Note that in this scenario, Alice’s and Bob’s different priors cause them to discount the poll that disagrees with their existing beliefs as noise, thus causing their prior beliefs to be reinforced by the mixed data. This scenario was inspired by the death penalty study [1] alluded to earlier, in which a set of mixed results caused supporters and opponents of the death penalty to strengthen their existing beliefs. We do not claim that people’s behavior in this study can be explained with exactly the model employed here, but our analysis does show that selective interpretation of evidence is sometimes consistent with a rational approach. 6 3.4 Example 4: Political belief We conclude with a second illustration of Bayes net 2h in which two people agree on the interpretation of an observed piece of evidence but disagree about the implications of that evidence. In this scenario, Alice and Bob are two economists with different philosophies about how the federal government should approach a major recession. Alice believes that the federal government should increase its own spending to stimulate economic activity; Bob believes that the government should decrease its spending and reduce taxes instead, providing taxpayers with more spending money. A new bill has just been proposed and an independent study found that the bill was likely to increase federal spending. Alice and Bob now assess the likelihood that this piece of legislation will improve the economic climate. This scenario can be modeled by the Bayes net and associated CPDs in Figure 3d. In the tables, the hypothesis space H = {‘Bad policy’ = 0, ‘Good policy’ = 1} represents whether the new bill is good for the economy and the data variable D = {‘No spending’ = 0, ‘Spending increase’ = 1} represents the conclusions of the independent study. Unlike in previous scenarios, we introduce two additional factors, V 1 = {‘Fiscally conservative’ = 0, ‘Fiscally liberal’ = 1}, which represents the optimal economic philosophy, and V 2 = {‘No spending’ = 0, ‘Spending increase’ = 1}, which represents the spending policy of the new bill. The exact probabilities in the tables were chosen to reflect the fact that if the bill does not increase spending, the policy it enacts may still be good for other reasons. A uniform prior was placed on V 2 for both people, reflecting the fact that they have no prior expectations about the spending in the bill. However, the priors placed on V 1 for Alice and Bob reflect their different beliefs about the best economic policy. The resulting belief divergence behavior is shown in Figure 4d. The model used in this scenario bears a strong resemblance to the probabilogical model of attitude change developed by McGuire [14] in which V 1 and V 2 might be logical “premises” that entail the “conclusion” H. 4 How common is contrary updating? We have now described four concrete cases where belief divergence is captured by a normative approach. It is possible, however, that belief divergence is relatively rare within the Bayes nets of Family 2, and that our four examples are exotic special cases that depend on carefully selected CPDs. To rule out this possibility, we ran simulations to explore the space of all possible CPDs for the three networks in Family 2. We initially considered cases where H, D, and V were binary variables, and ran two simulations for each model. In one simulation, the priors and each row of each CPD were sampled from a symmetric Beta distribution with parameter 0.1, resulting in probabilities highly biased toward 0 and 1. In the second simulation, the probabilities were sampled from a uniform distribution. In each trial, a single set of CPDs were generated and then two different priors were generated for each root node in the graph to simulate two individuals, consistent with our assumption that two individuals may have different priors but must agree about the conditional probabilities. 20,000 trials were carried out in each simulation, and the proportion of trials that led to convergence and divergence was computed. Trials were only counted as instances of convergence or divergence if |P (H = 1|D = 1) − P (H = 1)| > for both individuals, with = 1 × 10−5 . The results of these simulations are shown in Table 2. The supporting material proves that divergence and convergence are equally common, and therefore the percentages in the table show the frequencies for contrary updating of either type. Our primary question was whether contrary updating is rare or anomalous. In all but the third simulation, contrary updating constituted a substantial proportion of trials, suggesting that the phenomenon is relatively generic. We were also interested in whether this behavior relied on particular settings of the CPDs. The fact that percentages for the uniform distribution are approximately the same or greater than for the biased distribution indicates that contrary updating appears to be a relatively generic behavior for the Bayes nets we considered. More generally, these results directly challenge the suggestion that normative accounts are not suited for modeling belief divergence. The last two columns of Table 2 show results for two simulations with the same Bayes net, the only difference being whether V was treated as 2-valued (binary) or 4-valued. The 4-valued case is included because both Examples 3 and 4 considered multi-valued additional factor variables V . 7 2-valued V V H Biased Uniform 4-valued V V V V D 9.6% 18.2% D H 12.7% 16.0% H D 0% 0% H D 23.3% 20.0% Table 2: Simulation results. The percentages indicate the proportion of trials that produced contrary updating using the specified Bayes net (column) and probability distributions (row). The prior and conditional probabilities were either sampled from a Beta(0.1, 0.1) distribution (biased) or a Beta(1, 1) distribution (uniform). The probabilities for the simulation results shown in the last column were sampled from a Dirichlet([0.1, 0.1, 0.1, 0.1]) distribution (biased) or a Dirichlet([1, 1, 1, 1]) distribution (uniform). In Example 4, we used two binary variables, but we could have equivalently used a single 4-valued variable. Belief convergence and divergence are not possible in the binary case, a result that is proved in the supporting material. We believe, however, that convergence and divergence are fairly common whenever V takes three or more values, and the simulation in the last column of the table confirms this claim for the 4-valued case. Given that belief divergence seems relatively common in the space of all Bayes nets, it is natural to explore whether cases of rational divergence are regularly encountered in the real world. One possible approach is to analyze a large database of networks that capture everyday belief revision problems, and to determine what proportion of networks lead to rational divergence. Future studies can explore this issue, but our simulations suggest that contrary updating is likely to arise in cases where it is necessary to move beyond a simple model like the one in Figure 2a and consider several causal factors. 5 Conclusion This paper presented a family of Bayes nets that can account for belief divergence, a phenomenon that is typically considered to be incompatible with normative accounts. We provided four concrete examples that illustrate how this family of networks can capture a variety of settings where belief divergence can emerge from rational statistical inference. We also described a series of simulations that suggest that belief divergence is not only possible but relatively common within the family of networks that we considered. Our work suggests that belief polarization should not always be taken as evidence of irrationality, and that researchers who aim to document departures from rationality may wish to consider alternative phenomena instead. One such phenomenon might be called “inevitable belief reinforcement” and occurs when supporters of a hypothesis update their belief in the same direction for all possible data sets d. For example, a gambler will demonstrate inevitable belief reinforcement if he or she becomes increasingly convinced that a roulette wheel is biased towards red regardless of whether the next spin produces red, black, or green. This phenomenon is provably inconsistent with any fully Bayesian approach, and therefore provides strong evidence of irrationality. Although we propose that some instances of polarization are compatible with a Bayesian approach, we do not claim that human inferences are always or even mostly rational. We suggest, however, that characterizing normative behavior can require careful thought, and that formal analyses are invaluable for assessing the rationality of human inferences. In some cases, a formal analysis will provide an appropriate baseline for understanding how human inferences depart from rational norms. In other cases, a formal analysis will suggest that an apparently irrational inference makes sense once all of the relevant information is taken into account. 8 References [1] C. G. Lord, L. Ross, and M. R. Lepper. Biased assimilation and attitude polarization: The effects of prior theories on subsequently considered evidence. Journal of Personality and Social Psychology, 37(1):2098–2109, 1979. [2] L. Ross and M. R. Lepper. The perseverance of beliefs: Empirical and normative considerations. In New directions for methodology of social and behavioral science: Fallible judgment in behavioral research. Jossey-Bass, San Francisco, 1980. [3] J. Baron. Thinking and Deciding. Cambridge University Press, Cambridge, 4th edition, 2008. [4] A. Gerber and D. Green. Misperceptions about perceptual bias. Annual Review of Political Science, 2:189–210, 1999. [5] M. Oaksford and N. Chater. A rational analysis of the selection task as optimal data selection. Psychological Review, 101(4):608–631, 1994. [6] U. Hahn and M. Oaksford. The rationality of informal argumentation: A Bayesian approach to reasoning fallacies. Psychological Review, 114(3):704–732, 2007. [7] S. Sher and C. R. M. McKenzie. Framing effects and rationality. In N. Chater and M. Oaksford, editors, The probablistic mind: Prospects for Bayesian cognitive science. Oxford University Press, Oxford, 2008. [8] B. O’Connor. Biased evidence assimilation under bounded Bayesian rationality. Master’s thesis, Stanford University, 2006. [9] A. Zimper and A. Ludwig. Attitude polarization. Technical report, Mannheim Research Institute for the Economics of Aging, 2007. [10] A. K. Dixit and J. W. Weibull. Political polarization. Proceedings of the National Academy of Sciences, 104(18):7351–7356, 2007. [11] L. L. Lopes. Averaging rules and adjustment processes in Bayesian inference. Bulletin of the Psychonomic Society, 23(6):509–512, 1985. [12] A. Harris, A. Corner, and U. Hahn. “Damned by faint praise”: A Bayesian account. In A. D. De Groot and G. Heymans, editors, Proceedings of the 31th Annual Conference of the Cognitive Science Society, Austin, TX, 2009. Cognitive Science Society. [13] C. D. Batson. Rational processing or rationalization? The effect of disconfirming information on a stated religious belief. Journal of Personality and Social Psychology, 32(1):176–184, 1975. [14] W. J. McGuire. The probabilogical model of cognitive structure and attitude change. In R. E. Petty, T. M. Ostrom, and T. C. Brock, editors, Cognitive Responses in Persuasion. Lawrence Erlbaum Associates, 1981. 9
2 0.96677917 259 nips-2009-Who’s Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation
Author: Jie Luo, Barbara Caputo, Vittorio Ferrari
Abstract: Given a corpus of news items consisting of images accompanied by text captions, we want to find out “who’s doing what”, i.e. associate names and action verbs in the captions to the face and body pose of the persons in the images. We present a joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus. These models can then be used to recognize people and actions in novel images without captions. We demonstrate experimentally that our joint ‘face and pose’ model solves the correspondence problem better than earlier models covering only the face, and that it can perform recognition of new uncaptioned images. 1
3 0.9043982 52 nips-2009-Code-specific policy gradient rules for spiking neurons
Author: Henning Sprekeler, Guillaume Hennequin, Wulfram Gerstner
Abstract: Although it is widely believed that reinforcement learning is a suitable tool for describing behavioral learning, the mechanisms by which it can be implemented in networks of spiking neurons are not fully understood. Here, we show that different learning rules emerge from a policy gradient approach depending on which features of the spike trains are assumed to influence the reward signals, i.e., depending on which neural code is in effect. We use the framework of Williams (1992) to derive learning rules for arbitrary neural codes. For illustration, we present policy-gradient rules for three different example codes - a spike count code, a spike timing code and the most general “full spike train” code - and test them on simple model problems. In addition to classical synaptic learning, we derive learning rules for intrinsic parameters that control the excitability of the neuron. The spike count learning rule has structural similarities with established Bienenstock-Cooper-Munro rules. If the distribution of the relevant spike train features belongs to the natural exponential family, the learning rules have a characteristic shape that raises interesting prediction problems. 1
same-paper 4 0.83762193 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases
Author: Kaushik Sinha, Mikhail Belkin
Abstract: We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices. It turns out that when the data has clustered, that is, when the high density regions are sufficiently separated by low density valleys, each high density area corresponds to a unique representative eigenvector. Linear combination of such eigenvectors (or, more precisely, of their Nystrom extensions) provide good candidates for good classification functions when the cluster assumption holds. By first choosing an appropriate basis of these eigenvectors from unlabeled data and then using labeled data with Lasso to select a classifier in the span of these eigenvectors, we obtain a classifier, which has a very sparse representation in this basis. Importantly, the sparsity corresponds naturally to the cluster assumption. Experimental results on a number of real-world data-sets show that our method is competitive with the state of the art semi-supervised learning algorithms and outperforms the natural base-line algorithm (Lasso in the Kernel PCA basis). 1
5 0.76919329 166 nips-2009-Noisy Generalized Binary Search
Author: Robert Nowak
Abstract: This paper addresses the problem of noisy Generalized Binary Search (GBS). GBS is a well-known greedy algorithm for determining a binary-valued hypothesis through a sequence of strategically selected queries. At each step, a query is selected that most evenly splits the hypotheses under consideration into two disjoint subsets, a natural generalization of the idea underlying classic binary search. GBS is used in many applications, including fault testing, machine diagnostics, disease diagnosis, job scheduling, image processing, computer vision, and active learning. In most of these cases, the responses to queries can be noisy. Past work has provided a partial characterization of GBS, but existing noise-tolerant versions of GBS are suboptimal in terms of query complexity. This paper presents an optimal algorithm for noisy GBS and demonstrates its application to learning multidimensional threshold functions. 1
6 0.65307546 247 nips-2009-Time-rescaling methods for the estimation and assessment of non-Poisson neural encoding models
7 0.65073371 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
8 0.64836466 210 nips-2009-STDP enables spiking neurons to detect hidden causes of their inputs
9 0.63110745 19 nips-2009-A joint maximum-entropy model for binary neural population patterns and continuous signals
10 0.63074058 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels
11 0.62805247 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies
12 0.61249834 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
13 0.61130369 163 nips-2009-Neurometric function analysis of population codes
14 0.60955781 99 nips-2009-Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning
15 0.59246826 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
16 0.59111208 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity
17 0.58578128 112 nips-2009-Human Rademacher Complexity
18 0.58543968 203 nips-2009-Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks
19 0.58151686 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction
20 0.58101946 3 nips-2009-AUC optimization and the two-sample problem