nips nips2009 nips2009-212 knowledge-graph by maker-knowledge-mining

212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

Source: pdf

Author: Rob Fergus, Yair Weiss, Antonio Torralba

Abstract: With the advent of the Internet it is now possible to collect hundreds of millions of images. These images come with varying degrees of label information. “Clean labels” can be manually obtained on a small fraction, “noisy labels” may be extracted automatically from surrounding text, while for most images there are no labels at all. Semi-supervised learning is a principled framework for combining these different label sources. However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. In this paper we show how to utilize recent results in machine learning to obtain highly efﬁcient approximations for semi-supervised learning that are linear in the number of images. Speciﬁcally, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. Our algorithm enables us to apply semi-supervised learning to a database of 80 million images gathered from the Internet. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 “Clean labels” can be manually obtained on a small fraction, “noisy labels” may be extracted automatically from surrounding text, while for most images there are no labels at all. [sent-10, score-0.269]

2 However, it scales polynomially with the number of images, making it impractical for use on gigantic collections with hundreds of millions of images and thousands of classes. [sent-12, score-0.294]

3 Speciﬁcally, we use the convergence of the eigenvectors of the normalized graph Laplacian to eigenfunctions of weighted Laplace-Beltrami operators. [sent-14, score-0.784]

4 Our algorithm enables us to apply semi-supervised learning to a database of 80 million images gathered from the Internet. [sent-15, score-0.266]

5 Effective techniques for searching and labeling this ocean of images and video must address two conﬂicting problems: (i) the techniques to understand the visual content of an image and (ii) the ability to scale to millions of billions of images or video frames. [sent-17, score-0.444]

6 A variety of collaborative and online annotation efforts have attempted to build large collections of human labeled images, ranging from simple image classiﬁcations, to boundingboxes and precise pixel-level segmentation [16, 21, 24]. [sent-20, score-0.262]

7 However, even though most images on the web lack human annotation, they often have some kind of noisy label gleaned from nearby text or from the image ﬁlename and often this gives a strong cue about the content of the image. [sent-22, score-0.352]

8 They rely on the density structure of the data itself to propagate known labels to areas lacking annotations, and provide a natural way to incorporate labeling uncertainty. [sent-25, score-0.247]

9 Building on recent results in spectral graph theory, we efﬁciently construct accurate numerical approximations to the eigenvectors of the normalized graph Laplacian. [sent-29, score-0.478]

10 Semi-supervised learning is a rapidly growing sub-ﬁeld of machine learning, dealing with datasets that have a large number of unlabeled points and a much smaller number of labeled points (see [5] for a recent overview). [sent-41, score-0.27]

11 The problem with approaches based on backbone graphs is that the spectrum of the graph Laplacian can change dramatically with different backbone construction methods [12]. [sent-56, score-0.303]

12 In order to use the unlabeled data, we form a graph G = (V, E) where the vertices V are the datapoints x1 , . [sent-81, score-0.271]

13 In graph-based semi-supervised learning, the graph Laplacian L is used to deﬁne a smoothness operator that takes into account the unlabeled data. [sent-87, score-0.338]

14 The smoothness is measured by the graph Laplacian: 1 2 f T Lf = Wij (f (i) − f (j)) 2 i,j Of course simply minimizing smoothness can be achieved by the trivial solution f = 1, but in semi-supervised learning, we minimize a combination of the smoothness and the training loss. [sent-89, score-0.542]

15 035 Figure 2: Left: The three generalized eigenvectors of the graph Laplacian, for the toy data. [sent-96, score-0.468]

16 Note that the semi-supervised solution can be written as a linear combination of these eigenvectors (in this case, the second eigenvector is enough). [sent-97, score-0.366]

17 Using generalized eigenvectors (or equivalently normalized Laplacians) increases robustness of the ﬁrst eigenvectors, compared to using the un-normalized eigenvectors. [sent-98, score-0.302]

18 Right: The 2D density of the toy data, and the associated smoothness eigenfunctions deﬁned by that density. [sent-99, score-0.684]

19 where Λ is a diagonal matrix whose diagonal elements are Λii = λ if i is a labeled point and Λii = 0 for unlabeled points. [sent-101, score-0.292]

20 But as suggested in [5, 17, 28], the dimension of the problem can be reduced dramatically by only working with a small number of eigenvectors of the Laplacian. [sent-107, score-0.267]

21 Let Φi , σi be the generalized eigenvectors and eigenvalues of the graph Laplacian L (solutions to Lφi = σi Dφi ). [sent-108, score-0.427]

22 Since any vector in Rn can be written f = i αi Φi , 2 the smoothness of a vector is simply i αi σi so that smooth vectors will be linear combinations of 1 the eigenvectors with small eigenvalues . [sent-110, score-0.419]

23 2(left) shows the three generalized eigenvectors of the Laplacian for the toy data shown in Fig. [sent-112, score-0.375]

24 1(c)) is a linear combination of these three eigenvectors (in fact just one eigenvector is enough). [sent-115, score-0.333]

25 In general, we can signiﬁcantly reduce the dimension of f by requiring it to be of the form f = U α where U is a n × k matrix whose columns are the k eigenvectors with smallest eigenvalue. [sent-116, score-0.291]

26 1 From Eigenvectors to Eigenfunctions Given the eigenvectors of the graph Laplacian, we can now solve the semi-supervised problem in a reduced dimensional space. [sent-118, score-0.36]

27 But to ﬁnd the eigenvectors in the ﬁrst place, we need to diagonalize a n × n matrix. [sent-119, score-0.267]

28 How can we efﬁciently calculate the eigenvectors as the number of unlabeled points increases? [sent-120, score-0.441]

29 We follow [23, 14] in assuming the data xi ∈ Rd are samples from a distribution p(x) and analyzing the eigenfunctions of the smoothness operator deﬁned by p(x). [sent-121, score-0.578]

30 Just as the graph Laplacian deﬁned eigenvectors of increasing smoothness, the smoothness operator will deﬁne eigenfunctions of increasing smoothness. [sent-126, score-0.938]

31 We deﬁne the ﬁrst eigenfunction of LP (f ) by a minimization problem: Φ1 = arg min F: Lp (F ) F 2 (x)p(x)D(x)dx=1 where D(x) = x2 W (x, x2 )p(x2 )dx2 . [sent-127, score-0.524]

32 Note that the ﬁrst eigenfunction will always be the trivial function Φ(x) = 1 since it has maximal smoothness LP (1) = 0. [sent-128, score-0.644]

33 The second eigenfunction of Lp (f ) minimizes the same problem, with the additional constraint that it be orthogonal to the ﬁrst eigenfunction F (x)Φ1 (x)D(x)p(x)dx = 0. [sent-129, score-1.048]

34 More generally, the kth eigenfunction minimizes Lp (f ) under additional constraints that F (x)Φl (x)p(x)D(x)dx = 0 for all l < k. [sent-130, score-0.524]

35 The eigenvalue of an eigenfunction Φk is simply its smoothness σk = Lp (Φk ). [sent-131, score-0.695]

36 2(right) shows the ﬁrst three eigenfunctions corresponding to the density of the toy data. [sent-133, score-0.564]

37 Similar to the eigenvectors of the graph Laplacian, the second eigenfunction reveals the natural clustering of the data. [sent-134, score-0.884]

38 Note that the eigenvalue of the eigenfunctions is similar to the eigenvalue of the discrete generalized eigenvector. [sent-135, score-0.561]

39 How are these eigenfunctions related to the generalized eigenvectors of the Laplacian? [sent-136, score-0.726]

40 It is easy 2 1 1 to see that as n → ∞, n2 f T Lf = 2n2 i,j Wij (f (i) − f (j)) will approach Lp (F ), and 1 2 F 2 (x)D(x)p(x)dx so that the minimization problems that dei f (i)D(i, i) will approach n ﬁne the eigenvectors approach the problems that deﬁne the eigenfunctions as n → ∞. [sent-137, score-0.691]

41 Thus under suitable convergence conditions, the eigenfunctions can be seen as the limit of the eigenvectors as the number of points goes to inﬁnity [1, 3, 6, 14]. [sent-138, score-0.739]

42 uniform, Gaussian) the eigenfunctions can be calculated analytically [14, 23]. [sent-141, score-0.424]

43 Thus for these cases, there is a tremendous advantage in estimating p(x) and calculating the eigenfunctions from p(x) rather than attempting to estimate the eigenvectors directly. [sent-142, score-0.691]

44 Instead of diagonalizing an 80 million by 80 million matrix, we can simply estimate a 32 × 32 covariance matrix and get analytical eigenfunctions. [sent-144, score-0.284]

45 In low dimensions, we can calculate the eigenfunction numerically by discretizing the density. [sent-145, score-0.559]

46 This method was used to calculate the eigenfunctions in Fig. [sent-147, score-0.459]

47 This assumption allows us to calculate the eigenfunctions of Lp using only the marginal distributions p(si ). [sent-151, score-0.459]

48 Let Φi (sk ) be an eigenfunction of Lpk with eigenvalue σi , then Φi (s) = Φi (sk ) is also an eigenfunction of Lp with the same eigenvalue σi . [sent-154, score-1.15]

49 Proof: This follows from the observation in [14, 23] that for separable distributions, the eigenfunctions are also separable. [sent-155, score-0.463]

50 4 • Given the approximated density p(sk ), solve numerically for eigenfunctions and eigenvalues of Lpk using Eqn. [sent-159, score-0.523]

51 • Order the eigenfunctions from all components by increasing eigenvalue. [sent-162, score-0.424]

52 The need to add a small constant to the histogram comes from the fact that the smoothness operator Lp (F ) ignores the value of F wherever the density vanishes, p(x) = 0. [sent-163, score-0.266]

53 Thus the eigenfunctions can oscillate wildly in regions with zero density. [sent-164, score-0.424]

54 This algorithm will recover eigenfunctions of Lp , which depend only on a single coordinate. [sent-167, score-0.424]

55 As discussed in [23], products of these eigenfunctions for different coordinates are also eigenfunctions, but we will assume the semi-supervised solution is a linear combination of only the single-coordinate eigenfunctions. [sent-168, score-0.457]

56 By choosing the k eigenfunctions with smallest eigenvalue we now have k functions Φk (x) whose value is given at a set of discrete points for each coordinate. [sent-169, score-0.523]

57 Consider again a dataset of 80 million points in 32 dimensions and assume 100 bins per dimension. [sent-174, score-0.26]

58 In contrast, directly estimating the eigenvectors of the graph Laplacian will require diagonalizing an 80 million by 80 million matrix. [sent-176, score-0.62]

59 We start with a toy example that illustrates our eigenfunction approach, compared to the Nystrom method of Talwalker et al. [sent-181, score-0.597]

60 As Data Nystrom Eigenfunction Data Nystrom Eigenfunction Figure 3: A comparison of the separable eigenfunction approach and the Nystrom method. [sent-188, score-0.563]

61 The Nystrom method is based on computing the graph Laplacian on a set of sparse landmark points and fails in cases where the landmarks do not adequately summarize the density (left). [sent-190, score-0.318]

62 The separable eigenfunction approach fails when the density is far from a product form (right). [sent-191, score-0.63]

63 3, the eigenfunction approach assumes that the input distribution is separable over dimension. [sent-193, score-0.563]

64 The resulting labels span 386 distinct keywords in the Tiny Images dataset. [sent-218, score-0.243]

65 The training examples consist of t positive/negative pairs drawn from the remaining pool of 100 positive/negative images for each keyword. [sent-222, score-0.284]

66 For each class in turn, we use our scheme to propagate labels from the training examples to the test examples. [sent-223, score-0.328]

67 We also make use of the the training examples from keywords other than c by treating them as additional negative examples. [sent-225, score-0.252]

68 For example, if we have C = 16 keywords and t = 5 training pairs per keyword, then we have 5 positive training examples and (5+(16-1)*10)=155 negative training examples for each class. [sent-226, score-0.46]

69 Note that the propagation from labeled images to test images may go through the unlabeled images that are not even in the same class. [sent-228, score-0.57]

70 Our use of examples from other classes as negative examples is motivated by real problems, where training labels are spread over many keywords but relatively few labels are available per class. [sent-229, score-0.623]

71 In experiments using our eigenfunction approach, we compute a ﬁxed set of k=256 eigenfunctions on the entire 63,000 datapoints in the 64D space with ǫ = 0. [sent-230, score-1.035]

72 5(left), we compare our eigenfunction approach to a variety of alternative learning schemes. [sent-239, score-0.524]

73 Our eigenfunction approach outperforms other methods, particularly where relatively few training examples are available. [sent-241, score-0.629]

74 The SVM approach 6 badly over-ﬁts the data for small numbers of training examples, but catches up with the eigenfunction approach once 64+ve/1984-ve labeled examples are used. [sent-243, score-0.712]

75 The exact least-squares approach (f = (L + Λ)−1 ΛY ) achieves comparable results to the eigenfunction method, although it is far more expensive. [sent-245, score-0.524]

76 1) performs less well, being limited by the k = 256 eigenvectors used (as k is increased, the performance converges to the exact least-squares solution). [sent-247, score-0.267]

77 Neither of these methods scale to large image collections as the afﬁnity matrix W becomes too big and cannot be inverted or have its eigenvectors computed. [sent-248, score-0.435]

78 3 NN 0 1 2 3 4 5 6 Log2 number of +ve training examples/class (c) Without noisy labels 0. [sent-262, score-0.264]

79 25 −Inf (b) With noisy labels 0 1 2 3 5 8 10 15 20 40 60 100 # +ve training examples/class Mean precision at 15% recall averaged over 16 classes 0. [sent-268, score-0.312]

80 Our eigenfunction scheme (solid red) outperforms standard supervised methods (nearest-neighbors (green) and a Gaussian SVM (blue)) for small numbers of training pairs. [sent-271, score-0.669]

81 By using noisy labels in addition to the training pairs, the performance is boosted when few training examples are available (dashed red). [sent-273, score-0.369]

82 Right: (a): The performance of our eigenfunction approach as the number of training pairs per class and number of classes is varied. [sent-274, score-0.675]

83 Increasing the number of classes also aids performance since labeled examples from other classes can be used as negative examples. [sent-275, score-0.269]

84 (c): The performance of our approach (using no noisy labels) as the number of eigenfunctions is varied. [sent-279, score-0.495]

85 5(right)(a) we explore how our eigenfunction approach performs as the number of classes C is varied, for different numbers of training pairs t per class. [sent-281, score-0.675]

86 5(right)(c) shows the effect of varying the number of eigenfunctions k for C = 16 classes. [sent-284, score-0.424]

87 The performance is fairly stable above k = 128 eigenfunctions (i. [sent-285, score-0.424]

88 3 Leveraging noisy labels In the experiments above, only two types of data are used: labeled training examples and unlabeled test examples. [sent-289, score-0.487]

89 However, an additional source is the noisy labels from the Tiny Image dataset (the keyword used to query the image search engine). [sent-290, score-0.377]

90 These labels can easily be utilized by our framework: all 300 test examples for a class c are given a positive label with a small weight (λ/10), while the 300(C − 1) test examples from other classes are given negative label with the same small weight. [sent-291, score-0.448]

91 Note that these labels do not reveal any information about which of the 300 test images are true positives. [sent-292, score-0.269]

92 These noisy labels can provide a signiﬁcant performance gain when few training (clean) labels are available, as shown in Fig. [sent-293, score-0.401]

93 Indeed, when no training labels are available, just the noisy labels, our eigenfunction scheme still performs very well. [sent-297, score-0.831]

94 In summary, using 7 the eigenfunction approach with noisy labels, the performance obtained with a total of 32 labeled examples is comparable to the SVM trained with 64*16=512 labeled examples. [sent-300, score-0.81]

95 4 Experiments on Tiny Images dataset Our ﬁnal experiment applies the eigenfunction approach to the whole of the Tiny Images dataset (79,302,017 images). [sent-302, score-0.59]

96 We map the gist descriptor for each image down to a 32D space using PCA and precompute k = 64 eigenfunctions over the entire dataset. [sent-303, score-0.658]

97 Ranking from search engine Nearest Neighbor re-ranking Eigenfunction re-ranking Figure 6: Re-ranking images from 4 keywords in an 80 million image dataset, using 3 labeled pairs for each keyword. [sent-309, score-0.59]

98 From L to R, the columns show the original image order, results of nearest-neighbors and the results of our eigenfunction approach. [sent-311, score-0.611]

99 By regularizing the solution using eigenfunctions computed from all 80 million images, our semi-supervised scheme outperforms the purely supervised method. [sent-312, score-0.654]

100 80 million tiny images: a large database for nonparametric object and scene recognition. [sent-450, score-0.302]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('eigenfunction', 0.524), ('eigenfunctions', 0.424), ('eigenvectors', 0.267), ('laplacian', 0.161), ('nystrom', 0.147), ('lp', 0.142), ('labels', 0.137), ('images', 0.132), ('smoothness', 0.12), ('tiny', 0.119), ('gist', 0.113), ('cifar', 0.109), ('million', 0.108), ('keywords', 0.106), ('backbone', 0.105), ('graph', 0.093), ('unlabeled', 0.091), ('image', 0.087), ('datapoints', 0.087), ('labeled', 0.083), ('mi', 0.075), ('toy', 0.073), ('noisy', 0.071), ('density', 0.067), ('descriptors', 0.066), ('eigenvector', 0.066), ('lf', 0.066), ('gigantic', 0.065), ('internet', 0.063), ('label', 0.062), ('vs', 0.062), ('collections', 0.057), ('torralba', 0.057), ('xl', 0.057), ('training', 0.056), ('fergus', 0.055), ('sk', 0.054), ('billions', 0.053), ('wij', 0.052), ('eigenvalue', 0.051), ('landmark', 0.049), ('keyword', 0.049), ('examples', 0.049), ('classes', 0.048), ('points', 0.048), ('diagonal', 0.047), ('dimensions', 0.047), ('pairs', 0.047), ('supervised', 0.046), ('histogram', 0.045), ('diagonalizing', 0.044), ('krizhevsky', 0.044), ('lpk', 0.044), ('vijayanarasimhan', 0.044), ('scheme', 0.043), ('propagate', 0.043), ('negative', 0.041), ('rx', 0.041), ('millions', 0.04), ('separable', 0.039), ('pca', 0.038), ('coifman', 0.038), ('lafon', 0.038), ('nadler', 0.038), ('dx', 0.037), ('diffusion', 0.036), ('harmonic', 0.036), ('nity', 0.036), ('cvpr', 0.035), ('calculate', 0.035), ('annotation', 0.035), ('landmarks', 0.035), ('laplacians', 0.035), ('generalized', 0.035), ('descriptor', 0.034), ('operator', 0.034), ('solution', 0.033), ('dataset', 0.033), ('eigenvalues', 0.032), ('manifold', 0.032), ('ssl', 0.031), ('squares', 0.03), ('svm', 0.029), ('clean', 0.028), ('yl', 0.028), ('clusterings', 0.027), ('engine', 0.027), ('berg', 0.027), ('zhu', 0.027), ('adequately', 0.026), ('sd', 0.026), ('database', 0.026), ('object', 0.025), ('spectral', 0.025), ('schemes', 0.025), ('kumar', 0.025), ('scene', 0.024), ('matrix', 0.024), ('ijcv', 0.024), ('bins', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

Author: Rob Fergus, Yair Weiss, Antonio Torralba

2 0.5259065 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases

Author: Kaushik Sinha, Mikhail Belkin

Abstract: We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices. It turns out that when the data has clustered, that is, when the high density regions are sufﬁciently separated by low density valleys, each high density area corresponds to a unique representative eigenvector. Linear combination of such eigenvectors (or, more precisely, of their Nystrom extensions) provide good candidates for good classiﬁcation functions when the cluster assumption holds. By ﬁrst choosing an appropriate basis of these eigenvectors from unlabeled data and then using labeled data with Lasso to select a classiﬁer in the span of these eigenvectors, we obtain a classiﬁer, which has a very sparse representation in this basis. Importantly, the sparsity corresponds naturally to the cluster assumption. Experimental results on a number of real-world data-sets show that our method is competitive with the state of the art semi-supervised learning algorithms and outperforms the natural base-line algorithm (Lasso in the Kernel PCA basis). 1

3 0.27609092 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou

Abstract: We study the behavior of the popular Laplacian Regularization method for SemiSupervised Learning at the regime of a ﬁxed number of labeled points but a large number of unlabeled points. We show that in Rd , d 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the “smoothness” assumptions associated with this alternate method. 1 Introduction and Setup In this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods based on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We consider the limit when the number of labeled points is ﬁxed and the number of unlabeled points goes to inﬁnity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled data is virtually inﬁnite. We can also think of this limit as “perfect” SSL, having full knowledge of the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative about the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be “smooth” in some sense relative to p(x). Studying the inﬁnite-unlabeled-data limit, where p(x) is fully known, allows us to formulate and understand the underlying smoothness assumptions of a particular SSL method, and judge whether it is well-posed and sensible. Understanding the inﬁnite-unlabeled-data limit is also a necessary ﬁrst step to studying the convergence of the ﬁnite-labeled-data estimator. We consider the following setup: Let p(x) be an unknown smooth density on a compact domain Ω ⊂ Rd with a smooth boundary. Let y : Ω → Y be the unknown function we wish to estimate. In case of regression Y = R whereas in binary classiﬁcation Y = {−1, 1}. The standard (transductive) semisupervised learning problem is formulated as follows: Given l labeled points, (x1 , y1 ), . . . , (xl , yl ), with yi = y(xi ), and u unlabeled points xl+1 , . . . , xl+u , with all points xi sampled i.i.d. from p(x), the goal is to construct an estimate of y(xl+i ) for any unlabeled point xl+i , utilizing both the labeled and the unlabeled points. We denote the total number of points by n = l + u. We are interested in the regime where l is ﬁxed and u → ∞. 1 2 SSL with Graph Laplacian Regularization We ﬁrst consider the following graph-based approach formulated by Zhu et. al. [15]: y (x) = arg min In (y) ˆ subject to y(xi ) = yi , i = 1, . . . , l y where 1 n2 In (y) = Wi,j (y(xi ) − y(xj ))2 (1) (2) i,j is a Laplacian regularization term enforcing “smoothness” with respect to the n×n similarity matrix W . This formulation has several natural interpretations in terms of, e.g. random walks and electrical circuits [15]. These interpretations, however, refer to a ﬁxed graph, over a ﬁnite set of points with given similarities. In contrast, our focus here is on the more typical scenario where the points xi ∈ Rd are a random sample from a density p(x), and W is constructed based on this sample. We would like to understand the behavior of the method in terms of the density p(x), particularly in the limit where the number of unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density p(x) is the method (1) sensible? The answer, of course, depends on how the matrix W is constructed. We consider the common situation where the similarities are obtained by applying some decay ﬁlter to the distances: xi −xj σ Wi,j = G (3) where G : R+ → R+ is some function with an adequately fast decay. Popular choices are the 2 Gaussian ﬁlter G(z) = e−z /2 or the ǫ-neighborhood graph obtained by the step ﬁlter G(z) = 1z<1 . For simplicity, we focus here on the formulation (1) where the solution is required to satisfy the constraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced with a softer loss-based data term, which is balanced against the smoothness term In (y), e.g. [14, 6]. Our analysis and conclusions apply to such variants as well. Limit of the Laplacian Regularization Term As the number of unlabeled examples grows the regularization term (2) converges to its expectation, where the summation is replaced by integration w.r.t. the density p(x): lim In (y) = I (σ) (y) = n→∞ G Ω Ω x−x′ σ (y(x) − y(x′ ))2 p(x)p(x′ )dxdx′ . (4) In the above limit, the bandwidth σ is held ﬁxed. Typically, one would also drive the bandwidth σ to zero as n → ∞. There are two reasons for this choice. First, from a practical perspective, this makes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical perspective, this leads to a clear and well deﬁned limit of the smoothness regularization term In (y), at least when σ → 0 slowly enough1 , namely when σ = ω( d log n/n). If σ → 0 as n → ∞, and as long as nσ d / log n → ∞, then after appropriate normalization, the regularizer converges to a density weighted gradient penalty term [7, 8]: d lim d+2 In (y) n→∞ Cσ (σ) d (y) d+2 I σ→0 Cσ = lim ∇y(x) 2 p(x)2 dx = J(y) = (5) Ω where C = Rd z 2 G( z )dz, and assuming 0 < C < ∞ (which is the case for both the Gaussian and the step ﬁlters). This energy functional J(f ) therefore encodes the notion of “smoothness” with respect to p(x) that is the basis of the SSL formulation (1) with the graph constructions speciﬁed by (3). To understand the behavior and appropriateness of (1) we must understand this functional and the associated limit problem: y (x) = arg min J(y) ˆ subject to y(xi ) = yi , i = 1, . . . , l (6) y p When σ = o( d 1/n) then all non-diagonal weights Wi,j vanish (points no longer have any “close by” p neighbors). We are not aware of an analysis covering the regime where σ decays roughly as d 1/n, but would be surprised if a qualitatively different meaningful limit is reached. 1 2 3 Graph Laplacian Regularization in R1 We begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x ∈ R. We ﬁrst consider the situation where the support of p(x) is a continuous interval Ω = [a, b] ⊂ R (a and/or b may be inﬁnite). Without loss of generality, we assume the labeled data is sorted in increasing order a x1 < x2 < · · · < xl b. Applying the theory of variational calculus, the solution y (x) ˆ satisﬁes inside each interval (xi , xi+1 ) the Euler-Lagrange equation d dy p2 (x) = 0. dx dx Performing two integrations and enforcing the constraints at the labeled points yields y(x) = yi + x 1/p2 (t)dt xi (yi+1 xi+1 1/p2 (t)dt xi − yi ) for xi x xi+1 (7) with y(x) = x1 for a x x1 and y(x) = xl for xl x b. If the support of p(x) is a union of disjoint intervals, the above analysis and the form of the solution applies in each interval separately. The solution (7) seems reasonable and desirable from the point of view of the “smoothness” assumptions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas across low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore, the regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared semi-norm, giving us additional insight into this choice of regularizer: b 1 Theorem 1. Let p(x) be a smooth density on Ω = [a, b] ⊂ R such that Ap = 4 a 1/p2 (t)dt < ∞. 2 Then, J(f ) can be written as a squared semi-norm J(f ) = f Kp induced by the kernel x′ ′ Kp (x, x ) = Ap − 1 2 x with a null-space of all constant functions. That is, f the RKHS induced by Kp . 1 p2 (t) dt Kp . (8) is the norm of the projection of f onto If p(x) is supported on several disjoint intervals, Ω = ∪i [ai , bi ], then J(f ) can be written as a squared semi-norm induced by the kernel 1 bi dt 4 ai p2 (t) ′ Kp (x, x ) = − 1 2 x′ dt x p2 (t) if x, x′ ∈ [ai , bi ] (9) if x ∈ [ai , bi ], x′ ∈ [aj , bj ], i = j 0 with a null-space spanned by indicator functions 1[ai ,bi ] (x) on the connected components of Ω. Proof. For any f (x) = i αi Kp (x, xi ) in the RKHS induced by Kp : df dx J(f ) = 2 p2 (x)dx = αi αj Jij (10) i,j where Jij = d d Kp (x, xi ) Kp (x, xj )p2 (x)dx dx dx When xi and xj are in different connected components of Ω, the gradients of Kp (·, xi ) and Kp (·, xj ) are never non-zero together and Jij = 0 = Kp (xi , xj ). When they are in the same connected component [a, b], and assuming w.l.o.g. a xi xj b: Jij = = xi 1 4 1 4 a b a 1 dt + p2 (t) 1 1 dt − p2 (t) 2 xj xi xj xi −1 dt + p2 (t) xj 1 dt p2 (t) 1 dt = Kp (xi , xj ). p2 (t) Substituting Jij = Kp (xi , xj ) into (10) yields J(f ) = 3 b αi αj Kp (xi , xj ) = f (11) Kp . Combining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of any variant where the hard constraints are replaced by a data term) is of the form: l y(x) = αj Kp (x, xj ) + βi 1[ai ,bi ] (x), j=1 i where i ranges over the connected components [ai , bi ] of Ω, and we have: l J(y) = αi αj Kp (xi , xj ). (12) i,j=1 Viewing the regularizer as y 2 p suggests understanding (6), and so also its empirical approximaK tion (1), by interpreting Kp (x, x′ ) as a density-based “similarity measure” between x and x′ . This similarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a function of the distance. When the density is non-uniform, two points are relatively similar only if they are connected by a region in which 1/p2 (x) is low, i.e. the density is high, but are much less “similar”, i.e. related to each other, when connected by a low-density region. Furthermore, there is no dependence between points in disjoint components separated by zero density regions. 4 Graph Laplacian Regularization in Higher Dimensions The analysis of the previous section seems promising, at it shows that in one dimension, the SSL method (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is not the case anymore. In the following theorem we show that the inﬁmum of the limit problem (6) is zero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation of the labeled points. Theorem 2. Let p(x) be a smooth density over Rd , d 2, bounded from above by some constant pmax , and let (x1 , y1 ), . . . , (xl , yl ) be any (non-repeating) set of labeled examples. There exist continuous functions yǫ (x), for any ǫ > 0, all satisfying the constraints yǫ (xj ) = yj , j = 1, . . . , l, such ǫ→0 ǫ→0 that J(yǫ ) −→ 0 but yǫ (x) −→ 0 for all x = xj , j = 1, . . . , l. Proof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the proof to more labeled points is straightforward. Furthermore, without loss of generality, we assume the ﬁrst labeled point is at x0 = 0 with y(x0 ) = 0 and the second labeled point is at x1 with x1 = 1 and y(x1 ) = 1. In addition, we assume that the ball B1 (0) of radius one centered around the origin is contained in Ω = {x ∈ Rd | p(x) > 0}. We ﬁrst consider the case d > 2. Here, for any ǫ > 0, consider the function x ǫ yǫ (x) = min ,1 which indeed satisﬁes the two constraints yǫ (xi ) = yi , i = 0, 1. Then, J(yǫ ) = Bǫ (0) p2 (x) dx ǫ2 pmax ǫ2 dx = p2 Vd ǫd−2 max (13) Bǫ (0) where Vd is the volume of a unit ball in Rd . Hence, the sequence of functions yǫ (x) satisfy the constraints, but for d > 2, inf ǫ J(yǫ ) = 0. For d = 2, a more extreme example is necessary: consider the functions 2 x yǫ (x) = log +ǫ ǫ log 1+ǫ ǫ for x 1 and yǫ (x) = 1 for x > 1. These functions satisfy the two constraints yǫ (xi ) = yi , i = 0, 1 and: J(yǫ ) = 4 h “ ”i 1+ǫ 2 log ǫ 4πp2 max h “ ”i 1+ǫ 2 log ǫ x B1 (0) log ( x 1+ǫ ǫ 2 2 +ǫ)2 p2 (x)dx 4p2 h “ max ”i2 1+ǫ log ǫ 4πp2 max ǫ→0 = −→ 0. log 1+ǫ ǫ 4 1 0 r2 (r 2 +ǫ)2 2πrdr The implication of Theorem 2 is that regardless of the values at the labeled points, as u → ∞, the solution of (1) is not well posed. Asymptotically, the solution has the form of an almost everywhere constant function, with highly localized spikes near the labeled points, and so no learning is performed. In particular, an interpretation in terms of a density-based kernel Kp , as in the onedimensional case, is not possible. Our analysis also carries over to a formulation where a loss-based data term replaces the hard label constraints, as in l 1 y = arg min ˆ (y(xj ) − yj )2 + γIn (y) y(x) l j=1 In the limit of inﬁnite unlabeled data, functions of the form yǫ (x) above have a zero data penalty term (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence, it is possible to drive the entire objective functional (the data term plus the regularization term) to zero with functions that do not generalize at all to unlabeled points. 4.1 Numerical Example We illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density p(x) in R2 , which is a mixture of two unit variance spherical Gaussians, one per class, centered at the origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of the two components (four total). We then construct a similarity matrix using a Gaussian ﬁlter with σ = 0.4. Figure 1 depicts the predictor y (x) obtained from (1). In fact, two different predictors are shown, ˆ obtained by different numerical methods for solving (1). Both methods are based on the observation that the solution y (x) of (1) satisﬁes: ˆ n y (xi ) = ˆ n Wij y (xj ) / ˆ j=1 Wij on all unlabeled points i = l + 1, . . . , l + u. (14) j=1 Combined with the constraints of (1), we obtain a system of linear equations that can be solved by Gaussian elimination (here invoked through MATLAB’s backslash operator). This is the method used in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for y (xi ), ˆ which can be solved via the power method, or label propagation [2, 6]: start with zero labels on the unlabeled points and iterate (14), while keeping the known labels on x1 , . . . , xl . This is the method used in the bottom panels of Figure 1. As predicted, y (x) is almost constant for almost all unlabeled points. Although all values are very ˆ close to zero, thresholding at the “right” threshold does actually produce sensible results in terms of the true -1/+1 labels. However, beyond being inappropriate for regression, a very ﬂat predictor is still problematic even from a classiﬁcation perspective. First, it is not possible to obtain a meaningful conﬁdence measure for particular labels. Second, especially if the size of each class is not known apriori, setting the threshold between the positive and negative classes is problematic. In our example, setting the threshold to zero yields a generalization error of 45%. The differences between the two numerical methods for solving (1) also point out to another problem with the ill-posedness of the limit problem: the solution is numerically very un-stable. A more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choosing a “wrong” bandwidth σ, is given in Figure 2. We again simulated data from a mixture of two Gaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and an increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classiﬁcation error of the resulting predictor y (x). We plot the classiﬁcation error both when a threshold ˆ of zero is used (i.e. the class is determined by sign(ˆ(x))) and with the ideal threshold minimizing y the test error. For each unlabeled sample size, we choose the bandwidth σ yielding the best test performance (this is a “cheating” approach which provides a lower bound on the error of the best method for selecting the bandwidth). As the number of unlabeled examples increases the squared error approaches 1, indicating a ﬂat predictor. Using a threshold of zero leads to an increase in the classiﬁcation error, possibly due to numerical instability. Interestingly, although the predictors become very ﬂat, the classiﬁcation error using the ideal threshold actually improves slightly. Note that 5 DIRECT INVERSION SQUARED ERROR SIGN ERROR: 45% OPTIMAL BANDWIDTH 1 0.9 1 5 0 4 2 0.85 y(x) > 0 y(x) < 0 6 0.95 10 0 0 −1 10 0 200 400 600 800 0−1 ERROR (THRESHOLD=0) 0.32 −5 10 0 5 −10 0 −10 −5 −5 0 5 10 10 1 0 0 200 400 600 800 OPTIMAL BANDWIDTH 0.5 0 0 200 400 600 800 0−1 ERROR (IDEAL THRESHOLD) 0.19 5 200 400 600 800 OPTIMAL BANDWIDTH 1 0.28 SIGN ERR: 17.1 0.3 0.26 POWER METHOD 0 1.5 8 0 0.18 −1 10 6 0.17 4 −5 10 0 5 −10 0 −5 −10 −5 0 5 10 Figure 1: Left plots: Minimizer of Eq. (1). Right plots: the resulting classiﬁcation according to sign(y). The four labeled points are shown by green squares. Top: minimization via Gaussian elimination (MATLAB backslash). Bottom: minimization via label propagation with 1000 iterations - the solution has not yet converged, despite small residuals of the order of 2 · 10−4 . 0.16 0 200 400 600 800 2 0 200 400 600 800 Figure 2: Squared error (top), classiﬁcation error with a threshold of zero (center) and minimal classiﬁcation error using ideal threhold (bottom), of the minimizer of (1) as a function of number of unlabeled points. For each error measure and sample size, the bandwidth minimizing the test error was used, and is plotted. ideal classiﬁcation performance is achieved with a signiﬁcantly larger bandwidth than the bandwidth minimizing the squared loss, i.e. when the predictor is even ﬂatter. 4.2 Probabilistic Interpretation, Exit and Hitting Times As mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in terms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with transition matrix M = D−1 W where D is a diagonal matrix with Dii = j Wij . Then, for the binary classiﬁcation case with yi = ±1 we have [15]: y (xi ) = 2 Pr x(t) hits a point labeled +1 before hitting a point labeled -1 x(0) = xi − 1 ˆ We present an interpretation of our analysis in terms of the limiting properties of this random walk. Consider, for simplicity, the case where the two classes are separated by a low density region. Then, the random walk has two intrinsic quantities of interest. The ﬁrst is the mean exit time from one cluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the number of unlabeled points increases and σ → 0, the random walk converges to a diffusion process [12]. While the mean exit time then converges to a ﬁnite value corresponding to its diffusion analogue, the hitting time to a labeled point increases to inﬁnity (as these become absorbing boundaries of measure zero). With more and more unlabeled data the random walk will fully mix, forgetting where it started, before it hits any label. Thus, the probability of hitting +1 before −1 will become uniform across the entire graph, independent of the starting location xi , yielding a ﬂat predictor. 5 Keeping σ Finite At this point, a reader may ask whether the problems found in higher dimensions are due to taking the limit σ → 0. One possible objection is that there is an intrinsic characteristic scale for the data σ0 where (with high probability) all points at a distance xi − xj < σ0 have the same label. If this is the case, then it may not necessarily make sense to take values of σ < σ0 in constructing W . However, keeping σ ﬁnite while taking the number of unlabeled points to inﬁnity does not resolve the problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To see this, consider a function y(x) which is zero everywhere except at the labeled points, where y(xj ) = yj . With a ﬁnite number of labeled points of measure zero, I (σ) (y) = 0 in any dimension 6 50 points 500 points 3500 points 1 1 0.5 0.5 0.5 0 0 0 −0.5 y 1 −0.5 −0.5 −1 −2 0 2 4 6 −1 −2 0 2 4 6 −1 −2 0 2 4 6 x Figure 3: Minimizer of (1) for a 1-d problem with a ﬁxed σ = 0.4, two labeled points and an increasing number of unlabeled points. and for any ﬁxed σ > 0. While this limiting function is discontinuous, it is also possible to construct ǫ→0 a sequence of continuous functions yǫ that all satisfy the constraints and for which I (σ) (yǫ ) −→ 0. This behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians centered at the origin and at x = 4, with one Gaussian labeled −1 and the other +1. We used two labeled points at the centers of the Gaussians and an increasing number of randomly drawn unlabeled points. As predicted, with a ﬁxed σ, although the solution is reasonable when the number of unlabeled points is small, it becomes ﬂatter, with sharp spikes on the labeled points, as u → ∞. 6 Fourier-Eigenvector Based Methods Before we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian, suggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining candidate predictors y(x) non-parametrically to those with small In (y) values, here the predictors are constrained to the low-dimensional space spanned by the ﬁrst few eigenvectors of the Laplacian: The similarity matrix W is computed as before, and the Graph Laplacian matrix L = D − W is considered (recall D is a diagonal matrix with Dii = j Wij ). Only predictors p j=1 aj ej y (x) = ˆ (15) spanned by the ﬁrst p eigenvectors e1 , . . . , ep of L (with smallest eigenvalues) are considered. The coefﬁcients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss: (ˆ1 , . . . , ap ) = arg min a ˆ l j=1 (yj − y (xj ))2 . ˆ (16) Unlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)–(16) is well posed in the limit u → ∞. This follows directly from the convergence of the eigenvectors of the graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4]. Eigenvector based methods were shown empirically to provide competitive generalization performance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the approach by arguing that ‘the eigenfunctions of the Laplace-Beltrami operator provide a natural basis for functions on the manifold and the desired classiﬁcation function can be expressed in such a basis’. In our view, the success of the method is actually not due to data lying on a low-dimensional manifold, but rather due to the low density separation assumption, which states that different class labels form high-density clusters separated by low density regions. Indeed, under this assumption and with sufﬁcient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami operator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11], providing a basis for a labeling that is constant within clusters but variable across clusters. In other settings, such as data uniformly distributed on a manifold but without any signiﬁcant cluster structure, the success of eigenvector based methods critically depends on how well can the unknown classiﬁcation function be approximated by a truncated expansion with relatively few eigenvectors. We illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform density in the box [0, 1] × [0, 0.8] × [0, 0.6], where the box lengths are different to prevent eigenvalue multiplicity. Consider learning three different functions, y1 (x) = 1x1 >0.5 , y2 (x) = 1x1 >x2 /0.8 and y3 (x) = 1x2 /0.8>x3 /0.6 . Even though all three functions are relatively simple, all having a linear separating boundary between the classes on the manifold, as shown in the experiment described in Figure 4, the Eigenvector based method (15)–(16) gives markedly different generalization performances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5 as suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test set (i.e. a “cheating” choice representing an upper bound on the generalization error of this method). 7 Prediction Error (%) p = #labeled points/5 40 optimal p 20 labeled points 40 Approx. Error 50 20 20 0 20 20 40 60 # labeled points 0 10 20 40 60 # labeled points 0 0 5 10 15 # eigenvectors 0 0 5 10 15 # eigenvectors Figure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)–(16) for the three different functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign agreements with the true labels. Rightmost panel: best ﬁt when many (all 3000) points are used, representing the best we can hope for with a few leading eigenvectors. The reason for this behavior is that y2 (x) and even more so y3 (x) cannot be as easily approximated by the very few leading eigenfunctions—even though they seem “simple” and “smooth”, they are signiﬁcantly more complicated than y1 (x) in terms of measure of simplicity implied by the Eigenvector Method. Since the density is uniform, the graph Laplacian converges to the standard Laplacian and its eigenfunctions have the form ψi,j,k (x) = cos(iπx1 ) cos(jπx2 /0.8) cos(kπx3 /0.6), making it hard to represent simple decision boundaries which are not axis-aligned. 7 Discussion Our results show that a popular SSL method, the Laplacian Regularization method (1), is not wellbehaved in the limit of inﬁnite unlabeled data, despite its empirical success in various SSL tasks. The empirical success might be due to two reasons. First, it is possible that with a large enough number of labeled points relative to the number of unlabeled points, the method is well behaved. This regime, where the number of both labeled and unlabeled points grow while l/u is ﬁxed, has recently been analyzed by Wasserman and Lafferty [9]. However, we do not ﬁnd this regime particularly satisfying as we would expect that having more unlabeled data available should improve performance, rather than require more labeled points or make the problem ill-posed. It also places the user in a delicate situation of choosing the “just right” number of unlabeled points without any theoretical guidance. Second, in our experiments we noticed that although the predictor y (x) becomes extremely ﬂat, in ˆ binary tasks, it is still typically possible to ﬁnd a threshold leading to a good classiﬁcation performance. We do not know of any theoretical explanation for such behavior, nor how to characterize it. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical foundation of the Laplacian Regularization method. On a very practical level, such a theoretical understanding might allow us to correct the method so as to avoid the numerical instability associated with ﬂat predictors, and perhaps also make it appropriate for regression. The reason that the Laplacian regularizer (1) is ill-posed in the limit is that the ﬁrst order gradient is not a sufﬁcient penalty in high dimensions. This fact is well known in spline theory, where the Sobolev Embedding Theorem [1] indicates one must control at least d+1 derivatives in Rd . In the 2 context of Laplacian regularization, this can be done using the iterated Laplacian: replacing the d+1 graph Laplacian matrix L = D − W , where D is the diagonal degree matrix, with L 2 (matrix to d+1 the 2 power). In the inﬁnite unlabeled data limit, this corresponds to regularizing all order- d+1 2 (mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional ambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimensionality, which poses a practical problem of estimating this usually unknown dimensionality. We are not aware of much practical work using the iterated Laplacian, nor a good understanding of its appropriateness for SSL. A different approach leading to a well-posed solution is to include also an ambient regularization term [5]. However, the properties of the solution and in particular its relation to various assumptions about the “smoothness” of y(x) relative to p(x) remain unclear. Acknowledgments The authors would like to thank the anonymous referees for valuable suggestions. The research of BN was supported by the Israel Science Foundation (grant 432/06). 8 References [1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975. [2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks, ICML, 2007. [3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classiﬁcation, NIPS, vol. 15, 2003. [4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007. [5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006. [6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised Learning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006. [7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004. [8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006. [9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008. [10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol. 36(2), 2008. [11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001. [12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, NIPS, vol. 18, 2006. [13] B. Sch¨ lkopf, A. Smola, Learning with Kernels, MIT Press, 2002. o [14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch¨ lkopf, Learning with local and global consistency, o NIPS, vol. 16, 2004. [15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian ﬁelds and harmonic functions, ICML, 2003. 9

4 0.13114026 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions

Author: Matthias Bethge, Eero P. Simoncelli, Fabian H. Sinz

Abstract: We introduce a new family of distributions, called Lp -nested symmetric distributions, whose densities are expressed in terms of a hierarchical cascade of Lp norms. This class generalizes the family of spherically and Lp -spherically symmetric distributions which have recently been successfully used for natural image modeling. Similar to those distributions it allows for a nonlinear mechanism to reduce the dependencies between its variables. With suitable choices of the parameters and norms, this family includes the Independent Subspace Analysis (ISA) model as a special case, which has been proposed as a means of deriving ﬁlters that mimic complex cells found in mammalian primary visual cortex. Lp -nested distributions are relatively easy to estimate and allow us to explore the variety of models between ISA and the Lp -spherically symmetric models. By ﬁtting the generalized Lp -nested model to 8 × 8 image patches, we show that the subspaces obtained from ISA are in fact more dependent than the individual ﬁlter coefﬁcients within a subspace. When ﬁrst applying contrast gain control as preprocessing, however, there are no dependencies left that could be exploited by ISA. This suggests that complex cell modeling can only be useful for redundancy reduction in larger image patches. 1

5 0.12178318 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

Author: Kwang I. Kim, Florian Steinke, Matthias Hein

Abstract: Semi-supervised regression based on the graph Laplacian suffers from the fact that the solution is biased towards a constant and the lack of extrapolating power. Based on these observations, we propose to use the second-order Hessian energy for semi-supervised regression which overcomes both these problems. If the data lies on or close to a low-dimensional submanifold in feature space, the Hessian energy prefers functions whose values vary linearly with respect to geodesic distance. We ﬁrst derive the Hessian energy for smooth manifolds and continue to give a stable estimation procedure for the common case where only samples of the underlying manifold are given. The preference of ‘’linear” functions on manifolds renders the Hessian energy particularly suited for the task of semi-supervised dimensionality reduction, where the goal is to ﬁnd a user-deﬁned embedding function given some labeled points which varies smoothly (and ideally linearly) along the manifold. The experimental results suggest superior performance of our method compared with semi-supervised regression using Laplacian regularization or standard supervised regression techniques applied to this task. 1

6 0.11140862 211 nips-2009-Segmenting Scenes by Matching Image Composites

7 0.10581928 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

8 0.1004272 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

9 0.098694667 122 nips-2009-Label Selection on Graphs

10 0.093742706 137 nips-2009-Learning transport operators for image manifolds

11 0.08313477 31 nips-2009-An LP View of the M-best MAP problem

12 0.077471599 142 nips-2009-Locality-sensitive binary codes from shift-invariant kernels

13 0.076951563 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

14 0.076691464 104 nips-2009-Group Sparse Coding

15 0.074515529 157 nips-2009-Multi-Label Prediction via Compressed Sensing

16 0.073440664 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

17 0.072949074 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine

18 0.072201774 257 nips-2009-White Functionals for Anomaly Detection in Dynamical Systems

19 0.071952239 251 nips-2009-Unsupervised Detection of Regions of Interest Using Iterative Link Analysis

20 0.071108222 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.245), (1, 0.021), (2, -0.179), (3, 0.134), (4, -0.167), (5, 0.064), (6, -0.022), (7, 0.118), (8, -0.046), (9, 0.116), (10, -0.228), (11, 0.303), (12, -0.054), (13, -0.264), (14, -0.141), (15, -0.058), (16, -0.14), (17, 0.094), (18, -0.008), (19, 0.173), (20, -0.063), (21, 0.021), (22, -0.047), (23, -0.004), (24, -0.053), (25, -0.041), (26, -0.154), (27, -0.049), (28, 0.096), (29, -0.075), (30, -0.074), (31, 0.107), (32, -0.084), (33, -0.075), (34, 0.079), (35, 0.096), (36, -0.002), (37, -0.132), (38, -0.037), (39, -0.043), (40, -0.015), (41, -0.011), (42, -0.094), (43, -0.054), (44, -0.02), (45, 0.036), (46, 0.039), (47, -0.074), (48, 0.088), (49, -0.002)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93615001 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

Author: Rob Fergus, Yair Weiss, Antonio Torralba

2 0.8834843 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases

Author: Kaushik Sinha, Mikhail Belkin

3 0.68614411 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou

4 0.53929055 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine

Author: Zenglin Xu, Rong Jin, Jianke Zhu, Irwin King, Michael Lyu, Zhirong Yang

Abstract: We discuss the framework of Transductive Support Vector Machine (TSVM) from the perspective of the regularization strength induced by the unlabeled data. In this framework, SVM and TSVM can be regarded as a learning machine without regularization and one with full regularization from the unlabeled data, respectively. Therefore, to supplement this framework of the regularization strength, it is necessary to introduce data-dependant partial regularization. To this end, we reformulate TSVM into a form with controllable regularization strength, which includes SVM and TSVM as special cases. Furthermore, we introduce a method of adaptive regularization that is data dependant and is based on the smoothness assumption. Experiments on a set of benchmark data sets indicate the promising results of the proposed work compared with state-of-the-art TSVM algorithms. 1

5 0.41899693 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori

Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random ﬁelds that deﬁnes a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label conﬁgurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classiﬁcation and sequence labeling problems. 1

6 0.40163201 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

7 0.39579532 258 nips-2009-Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise

8 0.38022918 93 nips-2009-Fast Image Deconvolution using Hyper-Laplacian Priors

9 0.36761498 137 nips-2009-Learning transport operators for image manifolds

10 0.35373425 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

11 0.35074079 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

12 0.34176421 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions

13 0.33968416 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

14 0.32261655 2 nips-2009-3D Object Recognition with Deep Belief Nets

15 0.31751776 149 nips-2009-Maximin affinity learning of image segmentation

16 0.31295946 122 nips-2009-Label Selection on Graphs

17 0.30818275 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

18 0.30311552 58 nips-2009-Constructing Topological Maps using Markov Random Fields and Loop-Closure Detection

19 0.29570591 120 nips-2009-Kernels and learning curves for Gaussian process regression on random graphs

20 0.2929866 5 nips-2009-A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(21, 0.023), (24, 0.065), (25, 0.084), (31, 0.23), (35, 0.06), (36, 0.089), (39, 0.07), (50, 0.02), (58, 0.078), (61, 0.01), (71, 0.043), (81, 0.015), (86, 0.13)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.95626384 183 nips-2009-Optimal context separation of spiking haptic signals by second-order somatosensory neurons

Author: Romain Brasselet, Roland Johansson, Angelo Arleo

Abstract: We study an encoding/decoding mechanism accounting for the relative spike timing of the signals propagating from peripheral nerve ﬁbers to second-order somatosensory neurons in the cuneate nucleus (CN). The CN is modeled as a population of spiking neurons receiving as inputs the spatiotemporal responses of real mechanoreceptors obtained via microneurography recordings in humans. The efﬁciency of the haptic discrimination process is quantiﬁed by a novel deﬁnition of entropy that takes into full account the metrical properties of the spike train space. This measure proves to be a suitable decoding scheme for generalizing the classical Shannon entropy to spike-based neural codes. It permits an assessment of neurotransmission in the presence of a large output space (i.e. hundreds of spike trains) with 1 ms temporal precision. It is shown that the CN population code performs a complete discrimination of 81 distinct stimuli already within 35 ms of the ﬁrst afferent spike, whereas a partial discrimination (80% of the maximum information transmission) is possible as rapidly as 15 ms. This study suggests that the CN may not constitute a mere synaptic relay along the somatosensory pathway but, rather, it may convey optimal contextual accounts (in terms of fast and reliable information transfer) of peripheral tactile inputs to downstream structures of the central nervous system. 1

2 0.90186518 261 nips-2009-fMRI-Based Inter-Subject Cortical Alignment Using Functional Connectivity

Author: Bryan Conroy, Ben Singer, James Haxby, Peter J. Ramadge

Abstract: The inter-subject alignment of functional MRI (fMRI) data is important for improving the statistical power of fMRI group analyses. In contrast to existing anatomically-based methods, we propose a novel multi-subject algorithm that derives a functional correspondence by aligning spatial patterns of functional connectivity across a set of subjects. We test our method on fMRI data collected during a movie viewing experiment. By cross-validating the results of our algorithm, we show that the correspondence successfully generalizes to a secondary movie dataset not used to derive the alignment. 1

3 0.88157415 256 nips-2009-Which graphical models are difficult to learn?

Author: Andrea Montanari, Jose A. Pereira

Abstract: We consider the problem of learning the structure of Ising models (pairwise binary Markov random ﬁelds) from i.i.d. samples. While several methods have been proposed to accomplish this task, their relative merits and limitations remain somewhat obscure. By analyzing a number of concrete examples, we show that low-complexity algorithms systematically fail when the Markov random ﬁeld develops long-range correlations. More precisely, this phenomenon appears to be related to the Ising model phase transition (although it does not coincide with it). 1 Introduction and main results Given a graph G = (V = [p], E), and a positive parameter θ > 0 the ferromagnetic Ising model on G is the pairwise Markov random ﬁeld µG,θ (x) = 1 ZG,θ eθxi xj (1) (i,j)∈E over binary variables x = (x1 , x2 , . . . , xp ). Apart from being one of the most studied models in statistical mechanics, the Ising model is a prototypical undirected graphical model, with applications in computer vision, clustering and spatial statistics. Its obvious generalization to edge-dependent parameters θij , (i, j) ∈ E is of interest as well, and will be introduced in Section 1.2.2. (Let us stress that we follow the statistical mechanics convention of calling (1) an Ising model for any graph G.) In this paper we study the following structural learning problem: Given n i.i.d. samples x(1) , x(2) ,. . . , x(n) with distribution µG,θ ( · ), reconstruct the graph G. For the sake of simplicity, we assume that the parameter θ is known, and that G has no double edges (it is a ‘simple’ graph). The graph learning problem is solvable with unbounded sample complexity, and computational resources [1]. The question we address is: for which classes of graphs and values of the parameter θ is the problem solvable under appropriate complexity constraints? More precisely, given an algorithm Alg, a graph G, a value θ of the model parameter, and a small δ > 0, the sample complexity is deﬁned as nAlg (G, θ) ≡ inf n ∈ N : Pn,G,θ {Alg(x(1) , . . . , x(n) ) = G} ≥ 1 − δ , (2) where Pn,G,θ denotes probability with respect to n i.i.d. samples with distribution µG,θ . Further, we let χAlg (G, θ) denote the number of operations of the algorithm Alg, when run on nAlg (G, θ) samples.1 1 For the algorithms analyzed in this paper, the behavior of nAlg and χAlg does not change signiﬁcantly if we require only ‘approximate’ reconstruction (e.g. in graph distance). 1 The general problem is therefore to characterize the functions nAlg (G, θ) and χAlg (G, θ), in particular for an optimal choice of the algorithm. General bounds on nAlg (G, θ) have been given in [2, 3], under the assumption of unbounded computational resources. A general charactrization of how well low complexity algorithms can perform is therefore lacking. Although we cannot prove such a general characterization, in this paper we estimate nAlg and χAlg for a number of graph models, as a function of θ, and unveil a fascinating universal pattern: when the model (1) develops long range correlations, low-complexity algorithms fail. Under the Ising model, the variables {xi }i∈V become strongly correlated for θ large. For a large class of graphs with degree bounded by ∆, this phenomenon corresponds to a phase transition beyond some critical value of θ uniformly bounded in p, with typically θcrit ≤ const./∆. In the examples discussed below, the failure of low-complexity algorithms appears to be related to this phase transition (although it does not coincide with it). 1.1 A toy example: the thresholding algorithm In order to illustrate the interplay between graph structure, sample complexity and interaction strength θ, it is instructive to consider a warmup example. The thresholding algorithm reconstructs G by thresholding the empirical correlations Cij ≡ 1 n n (ℓ) (ℓ) xi xj for i, j ∈ V . ℓ=1 (3) T HRESHOLDING( samples {x(ℓ) }, threshold τ ) 1: Compute the empirical correlations {Cij }(i,j)∈V ×V ; 2: For each (i, j) ∈ V × V 3: If Cij ≥ τ , set (i, j) ∈ E; We will denote this algorithm by Thr(τ ). Notice that its complexity is dominated by the computation of the empirical correlations, i.e. χThr(τ ) = O(p2 n). The sample complexity nThr(τ ) can be bounded for speciﬁc classes of graphs as follows (the proofs are straightforward and omitted from this paper). Theorem 1.1. If G has maximum degree ∆ > 1 and if θ < atanh(1/(2∆)) then there exists τ = τ (θ) such that 2p 8 log nThr(τ ) (G, θ) ≤ . (4) 1 δ (tanh θ − 2∆ )2 Further, the choice τ (θ) = (tanh θ + (1/2∆))/2 achieves this bound. Theorem 1.2. There exists a numerical constant K such that the following is true. If ∆ > 3 and θ > K/∆, there are graphs of bounded degree ∆ such that for any τ , nThr(τ ) = ∞, i.e. the thresholding algorithm always fails with high probability. These results conﬁrm the idea that the failure of low-complexity algorithms is related to long-range correlations in the underlying graphical model. If the graph G is a tree, then correlations between far apart variables xi , xj decay exponentially with the distance between vertices i, j. The same happens on bounded-degree graphs if θ ≤ const./∆. However, for θ > const./∆, there exists families of bounded degree graphs with long-range correlations. 1.2 More sophisticated algorithms In this section we characterize χAlg (G, θ) and nAlg (G, θ) for more advanced algorithms. We again obtain very distinct behaviors of these algorithms depending on long range correlations. Due to space limitations, we focus on two type of algorithms and only outline the proof of our most challenging result, namely Theorem 1.6. In the following we denote by ∂i the neighborhood of a node i ∈ G (i ∈ ∂i), and assume the degree / to be bounded: |∂i| ≤ ∆. 1.2.1 Local Independence Test A recurring approach to structural learning consists in exploiting the conditional independence structure encoded by the graph [1, 4, 5, 6]. 2 Let us consider, to be deﬁnite, the approach of [4], specializing it to the model (1). Fix a vertex r, whose neighborhood we want to reconstruct, and consider the conditional distribution of xr given its neighbors2 : µG,θ (xr |x∂r ). Any change of xi , i ∈ ∂r, produces a change in this distribution which is bounded away from 0. Let U be a candidate neighborhood, and assume U ⊆ ∂r. Then changing the value of xj , j ∈ U will produce a noticeable change in the marginal of Xr , even if we condition on the remaining values in U and in any W , |W | ≤ ∆. On the other hand, if U ∂r, then it is possible to ﬁnd W (with |W | ≤ ∆) and a node i ∈ U such that, changing its value after ﬁxing all other values in U ∪ W will produce no noticeable change in the conditional marginal. (Just choose i ∈ U \∂r and W = ∂r\U ). This procedure allows us to distinguish subsets of ∂r from other sets of vertices, thus motivating the following algorithm. L OCAL I NDEPENDENCE T EST( samples {x(ℓ) }, thresholds (ǫ, γ) ) 1: Select a node r ∈ V ; 2: Set as its neighborhood the largest candidate neighbor U of size at most ∆ for which the score function S CORE(U ) > ǫ/2; 3: Repeat for all nodes r ∈ V ; The score function S CORE( · ) depends on ({x(ℓ) }, ∆, γ) and is deﬁned as follows, min max W,j xi ,xW ,xU ,xj |Pn,G,θ {Xi = xi |X W = xW , X U = xU }− Pn,G,θ {Xi = xi |X W = xW , X U \j = xU \j , Xj = xj }| . (5) In the minimum, |W | ≤ ∆ and j ∈ U . In the maximum, the values must be such that Pn,G,θ {X W = xW , X U = xU } > γ/2, Pn,G,θ {X W = xW , X U \j = xU \j , Xj = xj } > γ/2 Pn,G,θ is the empirical distribution calculated from the samples {x(ℓ) }. We denote this algorithm by Ind(ǫ, γ). The search over candidate neighbors U , the search for minima and maxima in the computation of the S CORE(U ) and the computation of Pn,G,θ all contribute for χInd (G, θ). Both theorems that follow are consequences of the analysis of [4]. Theorem 1.3. Let G be a graph of bounded degree ∆ ≥ 1. For every θ there exists (ǫ, γ), and a numerical constant K, such that 2p 100∆ , χInd(ǫ,γ) (G, θ) ≤ K (2p)2∆+1 log p . nInd(ǫ,γ) (G, θ) ≤ 2 4 log ǫ γ δ More speciﬁcally, one can take ǫ = 1 4 sinh(2θ), γ = e−4∆θ 2−2∆ . This ﬁrst result implies in particular that G can be reconstructed with polynomial complexity for any bounded ∆. However, the degree of such polynomial is pretty high and non-uniform in ∆. This makes the above approach impractical. A way out was proposed in [4]. The idea is to identify a set of ‘potential neighbors’ of vertex r via thresholding: B(r) = {i ∈ V : Cri > κ/2} , (6) For each node r ∈ V , we evaluate S CORE(U ) by restricting the minimum in Eq. (5) over W ⊆ B(r), and search only over U ⊆ B(r). We call this algorithm IndD(ǫ, γ, κ). The basic intuition here is that Cri decreases rapidly with the graph distance between vertices r and i. As mentioned above, this is true at small θ. Theorem 1.4. Let G be a graph of bounded degree ∆ ≥ 1. Assume that θ < K/∆ for some small enough constant K. Then there exists ǫ, γ, κ such that nIndD(ǫ,γ,κ) (G, θ) ≤ 8(κ2 + 8∆ ) log 4p , δ χIndD(ǫ,γ,κ) (G, θ) ≤ K ′ p∆∆ More speciﬁcally, we can take κ = tanh θ, ǫ = 1 4 log(4/κ) α + K ′ ∆p2 log p . sinh(2θ) and γ = e−4∆θ 2−2∆ . 2 If a is a vector and R is a set of indices then we denote by aR the vector formed by the components of a with index in R. 3 1.2.2 Regularized Pseudo-Likelihoods A different approach to the learning problem consists in maximizing an appropriate empirical likelihood function [7, 8, 9, 10, 13]. To control the ﬂuctuations caused by the limited number of samples, and select sparse graphs a regularization term is often added [7, 8, 9, 10, 11, 12, 13]. As a speciﬁc low complexity implementation of this idea, we consider the ℓ1 -regularized pseudolikelihood method of [7]. For each node r, the following likelihood function is considered L(θ; {x(ℓ) }) = − 1 n n (ℓ) ℓ=1 log Pn,G,θ (x(ℓ) |x\r ) r (7) where x\r = xV \r = {xi : i ∈ V \ r} is the vector of all variables except xr and Pn,G,θ is deﬁned from the following extension of (1), µG,θ (x) = 1 ZG,θ eθij xi xj (8) i,j∈V / where θ = {θij }i,j∈V is a vector of real parameters. Model (1) corresponds to θij = 0, ∀(i, j) ∈ E and θij = θ, ∀(i, j) ∈ E. The function L(θ; {x(ℓ) }) depends only on θr,· = {θrj , j ∈ ∂r} and is used to estimate the neighborhood of each node by the following algorithm, Rlr(λ), R EGULARIZED L OGISTIC R EGRESSION( samples {x(ℓ) }, regularization (λ)) 1: Select a node r ∈ V ; 2: Calculate ˆr,· = arg min {L(θr,· ; {x(ℓ) }) + λ||θr,· ||1 }; θ θ r,· ∈Rp−1 3: ˆ If θrj > 0, set (r, j) ∈ E; Our ﬁrst result shows that Rlr(λ) indeed reconstructs G if θ is sufﬁciently small. Theorem 1.5. There exists numerical constants K1 , K2 , K3 , such that the following is true. Let G be a graph with degree bounded by ∆ ≥ 3. If θ ≤ K1 /∆, then there exist λ such that nRlr(λ) (G, θ) ≤ K2 θ−2 ∆ log 8p2 . δ (9) Further, the above holds with λ = K3 θ ∆−1/2 . This theorem is proved by noting that for θ ≤ K1 /∆ correlations decay exponentially, which makes all conditions in Theorem 1 of [7] (denoted there by A1 and A2) hold, and then computing the probability of success as a function of n, while strenghtening the error bounds of [7]. In order to prove a converse to the above result, we need to make some assumptions on λ. Given θ > 0, we say that λ is ‘reasonable for that value of θ if the following conditions old: (i) Rlr(λ) is successful with probability larger than 1/2 on any star graph (a graph composed by a vertex r connected to ∆ neighbors, plus isolated vertices); (ii) λ ≤ δ(n) for some sequence δ(n) ↓ 0. Theorem 1.6. There exists a numerical constant K such that the following happens. If ∆ > 3, θ > K/∆, then there exists graphs G of degree bounded by ∆ such that for all reasonable λ, nRlr(λ) (G) = ∞, i.e. regularized logistic regression fails with high probability. The graphs for which regularized logistic regression fails are not contrived examples. Indeed we will prove that the claim in the last theorem holds with high probability when G is a uniformly random graph of regular degree ∆. The proof Theorem 1.6 is based on showing that an appropriate incoherence condition is necessary for Rlr to successfully reconstruct G. The analogous result was proven in [14] for model selection using the Lasso. In this paper we show that such a condition is also necessary when the underlying model is an Ising model. Notice that, given the graph G, checking the incoherence condition is NP-hard for general (non-ferromagnetic) Ising model, and requires signiﬁcant computational effort 4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20 15 λ0 10 5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.2 1 0.8 0.6 Psucc 0.4 0.2 1 θ 0 0 0.2 0.4 0.6 θ θcrit 0.8 1 Figure 1: Learning random subgraphs of a 7 × 7 (p = 49) two-dimensional grid from n = 4500 Ising models samples, using regularized logistic regression. Left: success probability as a function of the model parameter θ and of the regularization parameter λ0 (darker corresponds to highest probability). Right: the same data plotted for several choices of λ versus θ. The vertical line corresponds to the model critical temperature. The thick line is an envelope of the curves obtained for different λ, and should correspond to optimal regularization. even in the ferromagnetic case. Hence the incoherence condition does not provide, by itself, a clear picture of which graph structure are difﬁcult to learn. We will instead show how to evaluate it on speciﬁc graph families. Under the restriction λ → 0 the solutions given by Rlr converge to θ∗ with n [7]. Thus, for large n we can expand L around θ∗ to second order in (θ − θ∗ ). When we add the regularization term to L we obtain a quadratic model analogous the Lasso plus the error term due to the quadratic approximation. It is thus not surprising that, when λ → 0 the incoherence condition introduced for the Lasso in [14] is also relevant for the Ising model. 2 Numerical experiments In order to explore the practical relevance of the above results, we carried out extensive numerical simulations using the regularized logistic regression algorithm Rlr(λ). Among other learning algorithms, Rlr(λ) strikes a good balance of complexity and performance. Samples from the Ising model (1) where generated using Gibbs sampling (a.k.a. Glauber dynamics). Mixing time can be very large for θ ≥ θcrit , and was estimated using the time required for the overall bias to change sign (this is a quite conservative estimate at low temperature). Generating the samples {x(ℓ) } was indeed the bulk of our computational effort and took about 50 days CPU time on Pentium Dual Core processors (we show here only part of these data). Notice that Rlr(λ) had been tested in [7] only on tree graphs G, or in the weakly coupled regime θ < θcrit . In these cases sampling from the Ising model is easy, but structural learning is also intrinsically easier. Figure reports the success probability of Rlr(λ) when applied to random subgraphs of a 7 × 7 two-dimensional grid. Each such graphs was obtained by removing each edge independently with probability ρ = 0.3. Success probability was estimated by applying Rlr(λ) to each vertex of 8 graphs (thus averaging over 392 runs of Rlr(λ)), using n = 4500 samples. We scaled the regularization parameter as λ = 2λ0 θ(log p/n)1/2 (this choice is motivated by the algorithm analysis and is empirically the most satisfactory), and searched over λ0 . The data clearly illustrate the phenomenon discussed. Despite the large number of samples n ≫ log p, when θ crosses a threshold, the algorithm starts performing poorly irrespective of λ. Intriguingly, this threshold is not far from the critical point of the Ising model on a randomly diluted grid θcrit (ρ = 0.3) ≈ 0.7 [15, 16]. 5 1.2 1.2 θ = 0.35, 0.40 1 1 θ = 0.25 θ = 0.20 0.8 0.8 θ = 0.45 θ = 0.10 0.6 0.6 Psucc Psucc 0.4 0.4 θ = 0.50 0.2 0.2 θthr θ = 0.65, 0.60, 0.55 0 0 0 2000 4000 6000 8000 10000 0 0.1 0.2 n 0.3 0.4 0.5 0.6 0.7 0.8 θ Figure 2: Learning uniformly random graphs of degree 4 from Ising models samples, using Rlr. Left: success probability as a function of the number of samples n for several values of θ. Right: the same data plotted for several choices of λ versus θ as in Fig. 1, right panel. Figure 2 presents similar data when G is a uniformly random graph of degree ∆ = 4, over p = 50 vertices. The evolution of the success probability with n clearly shows a dichotomy. When θ is below a threshold, a small number of samples is sufﬁcient to reconstruct G with high probability. Above the threshold even n = 104 samples are to few. In this case we can predict the threshold analytically, cf. Lemma 3.3 below, and get θthr (∆ = 4) ≈ 0.4203, which compares favorably with the data. 3 Proofs In order to prove Theorem 1.6, we need a few auxiliary results. It is convenient to introduce some notations. If M is a matrix and R, P are index sets then MR P denotes the submatrix with row indices in R and column indices in P . As above, we let r be the vertex whose neighborhood we are trying to reconstruct and deﬁne S = ∂r, S c = V \ ∂r ∪ r. Since the cost function L(θ; {x(ℓ) }) + λ||θ||1 only depend on θ through its components θr,· = {θrj }, we will hereafter neglect all the other parameters and write θ as a shorthand of θr,· . Let z ∗ be a subgradient of ||θ||1 evaluated at the true parameters values, θ∗ = {θrj : θij = 0, ∀j ∈ ˆ / ˆn be the parameter estimate returned by Rlr(λ) when the number ∂r, θrj = θ, ∀j ∈ ∂r}. Let θ of samples is n. Note that, since we assumed θ∗ ≥ 0, zS = ½. Deﬁne Qn (θ, ; {x(ℓ) }) to be the ˆ∗ (ℓ) (ℓ) n Hessian of L(θ; {x }) and Q(θ) = limn→∞ Q (θ, ; {x }). By the law of large numbers Q(θ) is the Hessian of EG,θ log PG,θ (Xr |X\r ) where EG,θ is the expectation with respect to (8) and X is a random variable distributed according to (8). We will denote the maximum and minimum eigenvalue of a symmetric matrix M by σmax (M ) and σmin (M ) respectively. We will omit arguments whenever clear from the context. Any quantity evaluated at the true parameter values will be represented with a ∗ , e.g. Q∗ = Q(θ∗ ). Quantities under a ∧ depend on n. Throughout this section G is a graph of maximum degree ∆. 3.1 Proof of Theorem 1.6 Our ﬁrst auxiliary results establishes that, if λ is small, then ||Q∗ c S Q∗ −1 zS ||∞ > 1 is a sufﬁcient ˆ∗ S SS condition for the failure of Rlr(λ). Lemma 3.1. Assume [Q∗ c S Q∗ −1 zS ]i ≥ 1 + ǫ for some ǫ > 0 and some row i ∈ V , σmin (Q∗ ) ≥ ˆ∗ S SS SS 3 ǫ/29 ∆4 . Then the success probability of Rlr(λ) is upper bounded as Cmin > 0, and λ < Cmin 2 2 2 δB Psucc ≤ 4∆2 e−nδA + 2∆ e−nλ 2 where δA = (Cmin /100∆2 )ǫ and δB = (Cmin /8∆)ǫ. 6 (10) The next Lemma implies that, for λ to be ‘reasonable’ (in the sense introduced in Section 1.2.2), nλ2 must be unbounded. Lemma 3.2. There exist M = M (K, θ) > 0 for θ > 0 such that the following is true: If G is the graph with only one edge between nodes r and i and nλ2 ≤ K, then Psucc ≤ e−M (K,θ)p + e−n(1−tanh θ) 2 /32 . (11) Finally, our key result shows that the condition ||Q∗ c S Q∗ −1 zS ||∞ ≤ 1 is violated with high ˆ∗ S SS probability for large random graphs. The proof of this result relies on a local weak convergence result for ferromagnetic Ising models on random graphs proved in [17]. Lemma 3.3. Let G be a uniformly random regular graph of degree ∆ > 3, and ǫ > 0 be sufﬁciently small. Then, there exists θthr (∆, ǫ) such that, for θ > θthr (∆, ǫ), ||Q∗ c S Q∗ −1 zS ||∞ ≥ 1 + ǫ with ˆ∗ S SS probability converging to 1 as p → ∞. ˜ ˜ ˜ Furthermore, for large ∆, θthr (∆, 0+) = θ ∆−1 (1 + o(1)). The constant θ is given by θ = ¯ ¯ ¯ ¯ ¯ ¯ tanh h)/h and h is the unique positive solution of h tanh h = (1 − tanh2 h)2 . Finally, there exist Cmin > 0 dependent only on ∆ and θ such that σmin (Q∗ ) ≥ Cmin with probability converging to SS 1 as p → ∞. The proofs of Lemmas 3.1 and 3.3 are sketched in the next subsection. Lemma 3.2 is more straightforward and we omit its proof for space reasons. Proof. (Theorem 1.6) Fix ∆ > 3, θ > K/∆ (where K is a large enough constant independent of ∆), and ǫ, Cmin > 0 and both small enough. By Lemma 3.3, for any p large enough we can choose a ∆-regular graph Gp = (V = [p], Ep ) and a vertex r ∈ V such that |Q∗ c S Q∗ −1 ½S |i > 1 + ǫ for S SS some i ∈ V \ r. By Theorem 1 in [4] we can assume, without loss of generality n > K ′ ∆ log p for some small constant K ′ . Further by Lemma 3.2, nλ2 ≥ F (p) for some F (p) ↑ ∞ as p → ∞ and the condition of Lemma 3.1 on λ is satisﬁed since by the ”reasonable” assumption λ → 0 with n. Using these results in Eq. (10) of Lemma 3.1 we get the following upper bound on the success probability 2 Psucc (Gp ) ≤ 4∆2 p−δA K In particular Psucc (Gp ) → 0 as p → ∞. 3.2 ′ ∆ 2 + 2∆ e−nF (p)δB . (12) Proofs of auxiliary lemmas θ θ Proof. (Lemma 3.1) We will show that under the assumptions of the lemma and if ˆ = (ˆS , ˆS C ) = θ (ˆS , 0) then the probability that the i component of any subgradient of L(θ; {x(ℓ) })+λ||θ||1 vanishes θ for any ˆS > 0 (component wise) is upper bounded as in Eq. (10). To simplify notation we will omit θ {x(ℓ) } in all the expression derived from L. θ θ) z Let z be a subgradient of ||θ|| at ˆ and assume ∇L(ˆ + λˆ = 0. An application of the mean value ˆ theorem yields ∇2 L(θ∗ )[ˆ − θ∗ ] = W n − λˆ + Rn , θ z (13) ∗ n n 2 ¯(j) ) − ∇2 L(θ∗ )]T (ˆ − θ∗ ) with ¯(j) a point in the line where W = −∇L(θ ) and [R ]j = [∇ L(θ θ j θ ˆ to θ∗ . Notice that by deﬁnition ∇2 L(θ∗ ) = Qn ∗ = Qn (θ∗ ). To simplify notation we will from θ omit the ∗ in all Qn ∗ . All Qn in this proof are thus evaluated at θ∗ . Breaking this expression into its S and S c components and since ˆS C = θ∗ C = 0 we can eliminate θ S ˆ − θ∗ from the two expressions obtained and write θS S n n n n ˆ z [WS C − RS C ] − Qn C S (Qn )−1 [WS − RS ] + λQn C S (Qn )−1 zS = λˆS C . SS SS S S Now notice that Qn C S (Qn )−1 = T1 + T2 + T3 + T4 where SS S T1 = Q∗ C S [(Qn )−1 − (Q∗ )−1 ] , SS SS S T3 = [Qn C S − Q∗ C S ][(Qn )−1 − (Q∗ )−1 ] , SS SS S S 7 T2 = [Qn C S − Q∗ C S ]Q∗ −1 , SS S S T4 = Q∗ C S Q∗ −1 . SS S (14) We will assume that the samples {x(ℓ) } are such that the following event holds n E ≡ {||Qn − Q∗ ||∞ < ξA , ||Qn C S − Q∗ C S ||∞ < ξB , ||WS /λ||∞ < ξC } , (15) SS SS S S √ 2 n where ξA ≡ Cmin ǫ/(16∆), ξB ≡ Cmin ǫ/(8 ∆) and ξC ≡ Cmin ǫ/(8∆). Since EG,θ (Q ) = Q∗ and EG,θ (W n ) = 0 and noticing that both Qn and W n are sums of bounded i.i.d. random variables, a simple application of Azuma-Hoeffding inequality upper bounds the probability of E as in (10). From E it follows that σmin (Qn ) > σmin (Q∗ ) − Cmin /2 > Cmin /2. We can therefore lower SS SS bound the absolute value of the ith component of zS C by ˆ n n ∆ Rn WS RS Wn + |[Q∗ C S Q∗ −1 ½S ]i |−||T1,i ||∞ −||T2,i ||∞ −||T3,i ||∞ − i − i − SS S λ λ Cmin λ ∞ λ ∞ where the subscript i denotes the i-th row of a matrix. The proof is completed by showing that the event E and the assumptions of the theorem imply that n each of last 7 terms in this expression is smaller than ǫ/8. Since |[Q∗ C S Q∗ −1 ]T zS | ≥ 1 + ǫ by i ˆ SS S assumption, this implies |ˆi | ≥ 1 + ǫ/8 > 1 which cannot be since any subgradient of the 1-norm z has components of magnitude at most 1. The last condition on E immediately bounds all terms involving W by ǫ/8. Some straightforward manipulations imply (See Lemma 7 from [7]) √ ∆ ∆ n ∗ ||T2,i ||∞ ≤ ||[Qn C S − Q∗ C S ]i ||∞ , ||T1,i ||∞ ≤ 2 ||QSS − QSS ||∞ , S S Cmin Cmin 2∆ ||T3,i ||∞ ≤ 2 ||Qn − Q∗ ||∞ ||[Qn C S − Q∗ C S ]i ||∞ , SS SS S S Cmin and thus all will be bounded by ǫ/8 when E holds. The upper bound of Rn follows along similar lines via an mean value theorem, and is deferred to a longer version of this paper. Proof. (Lemma 3.3.) Let us state explicitly the local weak convergence result mentioned in Sec. 3.1. For t ∈ N, let T(t) = (VT , ET ) be the regular rooted tree of t generations and deﬁne the associated Ising measure as ∗ 1 eθxi xj (16) eh xi . µ+ (x) = T,θ ZT,θ (i,j)∈ET i∈∂T(t) Here ∂T(t) is the set of leaves of T(t) and h∗ is the unique positive solution of h = (∆ − 1) atanh {tanh θ tanh h}. It can be proved using [17] and uniform continuity with respect to the ‘external ﬁeld’ that non-trivial local expectations with respect to µG,θ (x) converge to local expectations with respect to µ+ (x), as p → ∞. T,θ More precisely, let Br (t) denote a ball of radius t around node r ∈ G (the node whose neighborhood we are trying to reconstruct). For any ﬁxed t, the probability that Br (t) is not isomorphic to T(t) goes to 0 as p → ∞. Let g(xBr (t) ) be any function of the variables in Br (t) such that g(xBr (t) ) = g(−xBr (t) ). Then almost surely over graph sequences Gp of uniformly random regular graphs with p nodes (expectations here are taken with respect to the measures (1) and (16)) lim EG,θ {g(X Br (t) )} = ET(t),θ,+ {g(X T(t) )} . (17) p→∞ The proof consists in considering [Q∗ c S Q∗ −1 zS ]i for t = dist(r, i) ﬁnite. We then write ˆ∗ S SS (Q∗ )lk = E{gl,k (X Br (t) )} and (Q∗ c S )il = E{gi,l (X Br (t) )} for some functions g·,· (X Br (t) ) and S SS apply the weak convergence result (17) to these expectations. We thus reduced the calculation of [Q∗ c S Q∗ −1 zS ]i to the calculation of expectations with respect to the tree measure (16). The latter ˆ∗ S SS can be implemented explicitly through a recursive procedure, with simpliﬁcations arising thanks to the tree symmetry and by taking t ≫ 1. The actual calculations consist in a (very) long exercise in calculus and we omit them from this outline. The lower bound on σmin (Q∗ ) is proved by a similar calculation. SS Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 , References [1] P. Abbeel, D. Koller and A. Ng, “Learning factor graphs in polynomial time and sample complexity”. Journal of Machine Learning Research., 2006, Vol. 7, 1743–1788. [2] M. Wainwright, “Information-theoretic limits on sparsity recovery in the high-dimensional and noisy setting”, arXiv:math/0702301v2 [math.ST], 2007. [3] N. Santhanam, M. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions”, arXiv:0905.2639v1 [cs.IT], 2009. [4] G. Bresler, E. Mossel and A. Sly, “Reconstruction of Markov Random Fields from Samples: Some Observations and Algorithms”,Proceedings of the 11th international workshop, APPROX 2008, and 12th international workshop RANDOM 2008, 2008 ,343–356. [5] Csiszar and Z. Talata, “Consistent estimation of the basic neighborhood structure of Markov random ﬁelds”, The Annals of Statistics, 2006, 34, Vol. 1, 123-145. [6] N. Friedman, I. Nachman, and D. Peer, “Learning Bayesian network structure from massive datasets: The sparse candidate algorithm”. In UAI, 1999. [7] P. Ravikumar, M. Wainwright and J. Lafferty, “High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression”, arXiv:0804.4202v1 [math.ST], 2008. [8] M.Wainwright, P. Ravikumar, and J. Lafferty, “Inferring graphical model structure using l1regularized pseudolikelihood“, In NIPS, 2006. [9] H. H¨ ﬂing and R. Tibshirani, “Estimation of Sparse Binary Pairwise Markov Networks using o Pseudo-likelihoods” , Journal of Machine Learning Research, 2009, Vol. 10, 883–906. [10] O.Banerjee, L. El Ghaoui and A. d’Aspremont, “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data”, Journal of Machine Learning Research, March 2008, Vol. 9, 485–516. [11] M. Yuan and Y. Lin, “Model Selection and Estimation in Regression with Grouped Variables”, J. Royal. Statist. Soc B, 2006, 68, Vol. 19,49–67. [12] N. Meinshausen and P. B¨ uhlmann, “High dimensional graphs and variable selection with the u lasso”, Annals of Statistics, 2006, 34, Vol. 3. [13] R. Tibshirani, “Regression shrinkage and selection via the lasso”, Journal of the Royal Statistical Society, Series B, 1994, Vol. 58, 267–288. [14] P. Zhao, B. Yu, “On model selection consistency of Lasso”, Journal of Machine. Learning Research 7, 25412563, 2006. [15] D. Zobin, ”Critical behavior of the bond-dilute two-dimensional Ising model“, Phys. Rev., 1978 ,5, Vol. 18, 2387 – 2390. [16] M. Fisher, ”Critical Temperatures of Anisotropic Ising Lattices. II. General Upper Bounds”, Phys. Rev. 162 ,Oct. 1967, Vol. 2, 480–485. [17] A. Dembo and A. Montanari, “Ising Models on Locally Tree Like Graphs”, Ann. Appl. Prob. (2008), to appear, arXiv:0804.4726v2 [math.PR] 9

4 0.84946948 255 nips-2009-Variational Inference for the Nested Chinese Restaurant Process

Author: Chong Wang, David M. Blei

Abstract: The nested Chinese restaurant process (nCRP) is a powerful nonparametric Bayesian model for learning tree-based hierarchies from data. Since its posterior distribution is intractable, current inference methods have all relied on MCMC sampling. In this paper, we develop an alternative inference technique based on variational methods. To employ variational methods, we derive a tree-based stick-breaking construction of the nCRP mixture model, and a novel variational algorithm that efﬁciently explores a posterior over a large set of combinatorial structures. We demonstrate the use of this approach for text and hand written digits modeling, where we show we can adapt the nCRP to continuous data as well. 1

same-paper 5 0.82135582 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

Author: Rob Fergus, Yair Weiss, Antonio Torralba

6 0.68349051 32 nips-2009-An Online Algorithm for Large Scale Image Similarity Learning

7 0.68166614 137 nips-2009-Learning transport operators for image manifolds

8 0.67606574 104 nips-2009-Group Sparse Coding

9 0.67421627 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

10 0.67328072 112 nips-2009-Human Rademacher Complexity

11 0.6705572 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

12 0.66945755 87 nips-2009-Exponential Family Graph Matching and Ranking

13 0.66821784 3 nips-2009-AUC optimization and the two-sample problem

14 0.66728157 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

15 0.66638029 211 nips-2009-Segmenting Scenes by Matching Image Composites

16 0.66636354 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

17 0.66609955 119 nips-2009-Kernel Methods for Deep Learning

18 0.66458547 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling

19 0.66445899 9 nips-2009-A Game-Theoretic Approach to Hypergraph Clustering

20 0.66433489 151 nips-2009-Measuring Invariances in Deep Networks