nips nips2009 nips2009-240 knowledge-graph by maker-knowledge-mining

240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

Source: pdf

Author: Liwei Wang

Abstract: We study pool-based active learning in the presence of noise, i.e. the agnostic setting. Previous works have shown that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have advantage. In this paper, we propose intuitively reasonable sufﬁcient conditions under which agnostic active learning algorithm is strictly superior to passive supervised learning. We show that under some noise condition, if the Bayesian classiﬁcation boundary and the underlying distribution are smooth to a ﬁnite order, active learning achieves polynomial improvement in the label complexity; if the boundary and the distribution are inﬁnitely smooth, the improvement is exponential.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn Abstract We study pool-based active learning in the presence of noise, i. [sent-4, score-0.24]

2 Previous works have shown that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. [sent-7, score-0.582]

3 Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have advantage. [sent-8, score-0.48]

4 In this paper, we propose intuitively reasonable sufﬁcient conditions under which agnostic active learning algorithm is strictly superior to passive supervised learning. [sent-9, score-0.683]

5 This is in contrast with the standard passive supervised learning, where the labeled examples are chosen randomly. [sent-17, score-0.232]

6 The simplest example that demonstrates the potential of active learning is to learn the optimal threshold on an interval. [sent-18, score-0.24]

7 there is no noise), then binary search only needs O(ln 1 ) labels to learn an -accurate classiﬁer, while passive learning requires O( 1 ) labels. [sent-21, score-0.294]

8 In this case active learning can still give exponential savings in the label complexity [Das05]. [sent-23, score-0.466]

9 However, there are also very simple problems that active learning does not help at all. [sent-24, score-0.24]

10 Any active learning algorithms needs O( 1 ) label requests to learn an -accurate classiﬁer [Han07]. [sent-26, score-0.4]

11 Of more interest and more realistic is the agnostic setting, where the class labels can be noisy so that the best classiﬁer in the hypothesis space has a non-zero error ν. [sent-29, score-0.413]

12 For agnostic active learning, there 2 is no active learning algorithm that can always reduce label requests due to a lower bound Ω( ν2 ) for the label complexity [Kaa06]. [sent-30, score-1.057]

13 It is known that whether active learning helps or not depends on the distribution of the instance-label pairs and the hypothesis space. [sent-31, score-0.352]

14 Thus a natural question would be that under what conditions is active learning guaranteed to require fewer labels than passive learning. [sent-32, score-0.515]

15 1 In this paper we propose intuitively reasonable sufﬁcient conditions under which active learning achieves lower label complexity than that of passive learning. [sent-33, score-0.629]

16 Speciﬁcally, we focus on the A2 algorithm [BAL06] which works in the agnostic setting. [sent-34, score-0.211]

17 Earlier work has discovered that the label complexity of A2 can be upper bounded by a parameter of the hypothesis space and the data distribution called disagreement coefﬁcient [Han07]. [sent-35, score-0.579]

18 1 Related Works Our work is closely related to [CN07], in which the authors proved sample complexity bounds for problems with smooth classiﬁcation boundary under Tsybakov’s noise condition [Tsy04]. [sent-39, score-0.515]

19 He introduced a different notion of smoothness and showed that this guarantees exponential improvement for active learning. [sent-43, score-0.321]

20 But his work focused on the realizable case and does not apply to the agnostic setting studied here. [sent-44, score-0.243]

21 Soon after A2 , Dasgupta, Hsu and Monteleoni [DHM07] proposed an elegant agnostic active learning algorithm. [sent-45, score-0.451]

22 It reduces active learning to a series of supervised learning problems. [sent-46, score-0.28]

23 If the hypothesis space has a ﬁnite VC dimension, it has a better label complexity than A2 . [sent-47, score-0.316]

24 It is not known whether it holds for more general hypothesis space such as the smooth boundary class analyzed in this paper. [sent-49, score-0.487]

25 In our active learning model, the algorithm has access to a pool of unlabeled examples from DX . [sent-55, score-0.29]

26 For any unlabeled point x, the algorithm can ask for its label y, which is generated from the conditional distribution at x. [sent-56, score-0.142]

27 A2 is the ﬁrst rigorous agnostic active learning algorithm. [sent-63, score-0.451]

28 It was shown that A2 is never much worse than passive learning in terms of the label complexity. [sent-66, score-0.351]

29 The key observation that A2 can be superior to passive learning is that, since our goal is ˆ ˆ to choose an h such that erD (h) ≤ erD (h∗ ) + , we only need to compare the errors of hypotheses. [sent-67, score-0.23]

30 To achieve such a statistical guarantee, the algorithm must be provided with a uniform convergence bound over the hypothesis space. [sent-71, score-0.213]

31 The other space is the region of disagreement DIS(Vi ), which is the set of all x ∈ X for which there are hypotheses in Vi that disagree on x. [sent-73, score-0.235]

32 Requesting labels of the instances from DIS(Vi ) allows A2 require fewer labels than passive learning. [sent-77, score-0.365]

33 This process, and in turn the label complexity of A2 , are nicely characterized by the disagreement coefﬁcient θ introduced in [Han07]. [sent-79, score-0.352]

34 Deﬁnition 1 Let ρ(·, ·) be the pseudo-metric on a hypothesis space H induced by DX . [sent-80, score-0.138]

35 The disagreement coefﬁcient θ( ) is θ( ) = sup r≥ PrX∼DX (X ∈ DIS(B(h∗ , r))) , r (1) where h∗ = arg minh∈H erD (h). [sent-83, score-0.211]

36 3 Main Results As mentioned earlier, whether active learning helps or not depends on the distribution and the hypothesis space. [sent-85, score-0.352]

37 There are simple examples such as learning intervals for which active learning has no advantage. [sent-86, score-0.259]

38 In this section we provide intuitively reasonable conditions under which the A2 algorithm is strictly superior to passive learning. [sent-89, score-0.211]

39 Our main results (Theorem 11 and Theorem 12) show that if the learning problem has a smooth Bayes classiﬁcation boundary, and the distribution DX has a density bounded by a smooth function, then under some noise condition A2 saves label requests. [sent-90, score-0.807]

40 1 we formally deﬁne the smoothness and introduce the hypothesis space, which contains smooth classiﬁers. [sent-93, score-0.379]

41 We show a uniform convergence bound of order O(n−1/2 ) for this hypothesis space. [sent-94, score-0.213]

42 2 is the main technical part, where we give upper bounds for the disagreement coefﬁcient of smooth problems. [sent-97, score-0.455]

43 3 we show that under some noise condition, there is a sharper bound for the label complexity in terms of the disagreement coefﬁcient. [sent-99, score-0.469]

44 Deﬁne the K-norm as f K := Dk f (x) − Dk f (x ) , x−x |k|=K−1x,x ∈Ω max sup|Dk f (x)| + max |k|≤K−1x∈Ω where Dk = ∂ k1 x1 sup (2) ∂ |k| , · · · ∂ kd xd is the differential operator. [sent-104, score-0.29]

45 Deﬁnition 2 (Finite Smooth Functions) A function f is said to be Kth order smooth with respect to a constant C, if f K ≤ C. [sent-105, score-0.278]

46 The set of Kth order smooth functions is deﬁned as K FC := {f : f K ≤ C}. [sent-106, score-0.224]

47 (3) Thus Kth order smooth functions have uniformly bounded partial derivatives up to order K − 1, and the K − 1th order partial derivatives are Lipschitz. [sent-107, score-0.384]

48 Deﬁnition 3 (Inﬁnitely Smooth Functions) A function f is said to be inﬁnitely smooth with respect to a constant C, if f K ≤ C for all nonnegative integers K. [sent-108, score-0.32]

49 The set of inﬁnitely smooth functions ∞ is denoted by FC . [sent-109, score-0.224]

50 K Deﬁnition 4 (Hypotheses with Smooth Boundaries) A set of hypotheses HC deﬁned on [0, 1]d+1 K is said to have Kth order smooth boundaries, if for every h ∈ HC , the classiﬁcation boundary is a Kth order smooth function on [0, 1]d . [sent-111, score-0.617]

51 Similarly, ∞ ∞ a hypothesis space HC is said to have inﬁnitely smooth boundaries, if for every h ∈ HC the d classiﬁcation boundary is the graph an inﬁnitely smooth function on [0, 1] . [sent-120, score-0.72]

52 Previous results on the label complexity of A2 assumes the hypothesis space has ﬁnite VC dimension. [sent-121, score-0.335]

53 2 1 ln δ n and LB(S, h, δ) = erS (h) − λ 1 ln δ n , where S is of size n. [sent-137, score-0.268]

54 Disagreement Coefﬁcient The disagreement coefﬁcient θ plays an important role for the label complexity of active learning algorithms. [sent-138, score-0.592]

55 In fact previous negative examples for which active learning does not work are all the results of large θ. [sent-139, score-0.24]

56 For instance the interval learning problem, θ( ) = 1 , which leads to the same label complexity as passive learning. [sent-140, score-0.408]

57 In the following two theorems we show that the disagreement coefﬁcient θ( ) for smooth problems is small. [sent-141, score-0.435]

58 , xd+1 ) C such that there exists a Kth order smooth function g(x1 , . [sent-146, score-0.224]

59 , xd+1 ) C such that there exist an inﬁnitely smooth function g(x1 , . [sent-166, score-0.224]

60 , xd ) and two constants 0 < α ≤ β such that αg(x1 , . [sent-169, score-0.273]

61 The key points in the theorems are: the classiﬁcation boundaries are smooth; and the density is bounded from above and below by constants times a smooth function. [sent-182, score-0.499]

62 Let fh∗ (x) and fh (x) be the classiﬁcation boundaries of h∗ and h, and suppose ρ(h, h∗ ) is small, where ρ(h, h∗ ) = Prx∼DX (h(x) = h∗ (x)) is the pseudo metric. [sent-187, score-0.3]

63 If the classiﬁcation boundaries and the density are all smooth, then the two boundaries have to be close to each other everywhere. [sent-188, score-0.223]

64 Hence only the points close to the classiﬁcation boundary of h∗ can be in DIS(B(h∗ , )), which leads to a small disagreement coefﬁcient. [sent-190, score-0.274]

65 If there exists a Kth ˜ ˜ ˜ order smooth function Φ and 0 < α ≤ β such that α|Φ(x)| ≤ |Φ(x)| ≤ β|Φ(x)| for all x ∈ [0, 1]d , K d 1 K+d then Φ ∞ = O(r K+d ) = O(r · ( r ) ), where Φ ∞ = supx∈[0,1]d |Φ(x)|. [sent-193, score-0.224]

66 If there exists an inﬁnitely ˜ ˜ ˜ smooth function Φ and 0 < α ≤ β such that α|Φ(x)| ≤ |Φ(x)| ≤ β|Φ(x)| for all x ∈ [0, 1]d , then Φ ∞ = O(r · logd ( 1 )) r We will brieﬂy describe the ideas of the proofs of these two lemmas in the Appendix. [sent-195, score-0.355]

67 Let fh , fh∗ ∈ FC be the corresponding classiﬁcation boundaries of h and h∗ respectively. [sent-201, score-0.275]

68 If r is sufﬁciently small, we must have ∗ ρ(h, h ) = ∗ 1 Pr (h(X) = h (X)) = dx . [sent-202, score-0.258]

69 ,xd ) ˜ We assert that there is a Kth order smooth function Φh (x1 , . [sent-228, score-0.224]

70 , xd ) and two constants 0 < u ≤ v ˜ ˜ such that u|Φh | ≤ |Φh | ≤ v|Φh |. [sent-231, score-0.273]

71 To see this, remember that fh and fh∗ are Kth order smooth functions; and the density p is upper and lower bounded by constants times a Kth order smooth f (x1 ,. [sent-232, score-0.825]

72 r Now consider the region of disagreement of B(h∗ , r). [sent-251, score-0.174]

73 Hence Pr (x ∈ DIS(B(h∗ , r))) = X∼DX ≤ 2 sup h∈B(h∗ ,r) [0,1]d Φh 1 ∞ dx Pr X∼DX x ∈ ∪h∈B(h∗ ,r) {x : h(x) = h∗ (x)} d . [sent-253, score-0.295]

74 3 Label Complexity It was shown in [Han07] that the label complexity of A2 is O θ2 ν2 2 + 1 polylog 1 1 δ ln , (5) where ν = minh∈H erD (h). [sent-260, score-0.381]

75 When ≥ ν, our previous results on the disagreement coefﬁcient already imply polynomial or exponential improvements for A2 . [sent-261, score-0.226]

76 However, when < ν, the label complexity becomes O( 1 ), the same as passive learning whatever θ is. [sent-262, score-0.435]

77 In fact, without any as2 2 sumption on the noise, the O( 1 ) result is inevitable due to the Ω( ν2 ) lower bound of agnostic 2 active learning [Kaa06]. [sent-263, score-0.498]

78 A remarkable notion is due to Tsybakov [Tsy04], which was ﬁrst introduced for passive learning. [sent-265, score-0.211]

79 Castro and Nowak [CN07] noticed that Tsybakov’s noise condition is also important in active learning. [sent-271, score-0.332]

80 They proved label complexity bounds in terms of κ for the membership-query setting. [sent-272, score-0.201]

81 Hence for large κ, active learning has very limited improvement over passive learning whatever other factors are. [sent-275, score-0.531]

82 Recently, Hanneke [Han09] obtained similar label complexity for pool-based model. [sent-276, score-0.178]

83 He showed the labels requested by A2 is O(θ2 ln 1 ln 1 ) for the bounded noise case, i. [sent-277, score-0.531]

84 Here we slightly δ 6 generalize Hanneke’s result to unbounded noise by introducing the following noise condition. [sent-280, score-0.171]

85 Under this noise condition, A2 has a better label complexity. [sent-284, score-0.191]

86 Theorem 10 Assume that the learning problem satisﬁes the noise condition (8) and DX has a density upper bounded by a constant M . [sent-285, score-0.292]

87 For any hypothesis space H that has a O(n−1/2 ) uniform convergence bound, if the Bayes classiﬁer h∗ is in H, then with probability at least 1 − δ, A2 outputs ˆ ˆ h ∈ H with erD (h) ≤ erD (h∗ ) + , and the number of labels requested by the algorithm is at most 1 2 O(θ ( ) · ln δ · polylog( 1 )). [sent-286, score-0.464]

88 If ∆(Vi ) ≤ 2 θ( ), due to the O(n−1/2 ) uniform convergence bound, O(θ2 ( ) ln 1 ) labels sufﬁces δ to make ∆(Vi )(U B(Si , h, δ ) − LB(Si , h, δ )) ≤ for all h ∈ DIS(Vi ) and the algorithm stops. [sent-289, score-0.252]

89 By the noise assumption (9) we have that if γ(h) ln then ∆(Vi ) < 1 2 ∆(Vjk ). [sent-295, score-0.204]

90 Note that (10) holds if γ(h) ≤ c ∆(Vjk ) θ( ) ln θ( ) ∆(Vj ) k ∆(Vjk ) ) ln 1 , and in turn if γ(h) ≤ c θ( since ∆(Vjk ) ≥ ∆(Vi ) > 2 θ( ). [sent-297, score-0.293]

91 But to have the last inequality, the algorithm only needs to label O(θ2 ( ) ln2 1 ln 1 ) instances from DIS(Vi ). [sent-298, score-0.281]

92 So the total number of labels requested by A2 is δ O(θ2 ( ) ln 1 ln3 1 ) δ Now we give our main label complexity bounds for agnostic active learning. [sent-299, score-0.905]

93 ∗ K Assume that the Bayes classiﬁer h of the learning problem is in HC ; the noise condition (8) holds; and DX has a density bounded by a Kth order smooth function as in Theorem 6. [sent-302, score-0.462]

94 Then the A2 ˆ ˆ algorithm outputs h with error rate erD (h) ≤ erD (h∗ ) + and the number of labels requested is at 2d ˜ ˜ most O 1 K+d ln 1 , where in O we hide the polylog 1 term. [sent-303, score-0.341]

95 δ Proof Note that the density DX is upper bounded by a smooth function implies that it is also upper bounded by a constant M . [sent-304, score-0.475]

96 Assume that ∞ the Bayes classiﬁer h∗ of the learning problem is in HC ; the noise condition (8) holds; and DX has a density bounded by an inﬁnitely smooth function as in Theorem 7. [sent-309, score-0.462]

97 Then the A2 algorithm ˆ ˆ outputs h with error rate erD (h) ≤ erD (h∗ ) + and the number of labels requested is at most 1 1 O polylog ln δ . [sent-310, score-0.341]

98 Although we assume that the classiﬁcation boundary is the graph of a function, our results can be generalized to the case that the boundaries are a ﬁnite number of functions. [sent-312, score-0.185]

99 In addition, by techniques in [Dud99] (page 259), our results may generalize to problems which have intrinsic smooth boundaries (not only graphs of functions). [sent-320, score-0.309]

100 A bound on the label complexity of agnostic active learning. [sent-402, score-0.657]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('erd', 0.396), ('dis', 0.259), ('dx', 0.258), ('hc', 0.254), ('xd', 0.228), ('smooth', 0.224), ('active', 0.221), ('passive', 0.211), ('agnostic', 0.211), ('fh', 0.19), ('vi', 0.181), ('disagreement', 0.174), ('lb', 0.156), ('vjk', 0.143), ('ln', 0.134), ('kth', 0.122), ('label', 0.121), ('hypothesis', 0.112), ('minh', 0.104), ('nitely', 0.1), ('boundary', 0.1), ('boundaries', 0.085), ('fc', 0.08), ('logd', 0.079), ('prx', 0.079), ('requested', 0.074), ('noise', 0.07), ('polylog', 0.069), ('si', 0.065), ('labels', 0.064), ('vc', 0.064), ('er', 0.062), ('tsybakov', 0.059), ('dxd', 0.059), ('theorem', 0.057), ('complexity', 0.057), ('classi', 0.057), ('bounded', 0.055), ('coef', 0.054), ('density', 0.053), ('eri', 0.052), ('bound', 0.047), ('lemma', 0.046), ('ers', 0.045), ('constants', 0.045), ('smoothness', 0.043), ('derivatives', 0.043), ('pr', 0.041), ('condition', 0.041), ('dk', 0.04), ('requests', 0.039), ('sup', 0.037), ('jk', 0.037), ('theorems', 0.037), ('hypotheses', 0.035), ('requesting', 0.035), ('said', 0.034), ('upper', 0.034), ('improvement', 0.034), ('hk', 0.033), ('request', 0.032), ('realizable', 0.032), ('castro', 0.032), ('hanneke', 0.032), ('unbounded', 0.031), ('polynomial', 0.029), ('convergence', 0.029), ('pool', 0.029), ('lipschitz', 0.028), ('learnable', 0.028), ('proofs', 0.027), ('whatever', 0.027), ('instances', 0.026), ('space', 0.026), ('bayes', 0.026), ('uniform', 0.025), ('pseudo', 0.025), ('kd', 0.025), ('savings', 0.025), ('ideas', 0.025), ('log', 0.025), ('holds', 0.025), ('hsu', 0.024), ('dasgupta', 0.024), ('nite', 0.023), ('integers', 0.023), ('exponential', 0.023), ('bounds', 0.023), ('fn', 0.022), ('let', 0.022), ('supervised', 0.021), ('unlabeled', 0.021), ('constant', 0.02), ('proof', 0.019), ('assumes', 0.019), ('learning', 0.019), ('uniformly', 0.019), ('minimax', 0.019), ('cation', 0.019), ('nonnegative', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

Author: Liwei Wang

2 0.13703313 193 nips-2009-Potential-Based Agnostic Boosting

Author: Varun Kanade, Adam Kalai

Abstract: We prove strong noise-tolerance properties of a potential-based boosting algorithm, similar to MadaBoost (Domingo and Watanabe, 2000) and SmoothBoost (Servedio, 2003). Our analysis is in the agnostic framework of Kearns, Schapire and Sellie (1994), giving polynomial-time guarantees in presence of arbitrary noise. A remarkable feature of our algorithm is that it can be implemented without reweighting examples, by randomly relabeling them instead. Our boosting theorem gives, as easy corollaries, alternative derivations of two recent nontrivial results in computational learning theory: agnostically learning decision trees (Gopalan et al, 2008) and agnostically learning halfspaces (Kalai et al, 2005). Experiments suggest that the algorithm performs similarly to MadaBoost. 1

3 0.10353205 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

Author: Marcel V. Gerven, Botond Cseke, Robert Oostenveld, Tom Heskes

Abstract: We introduce a novel multivariate Laplace (MVL) distribution as a sparsity promoting prior for Bayesian source localization that allows the speciﬁcation of constraints between and within sources. We represent the MVL distribution as a scale mixture that induces a coupling between source variances instead of their means. Approximation of the posterior marginals using expectation propagation is shown to be very efﬁcient due to properties of the scale mixture representation. The computational bottleneck amounts to computing the diagonal elements of a sparse matrix inverse. Our approach is illustrated using a mismatch negativity paradigm for which MEG data and a structural MRI have been acquired. We show that spatial coupling leads to sources which are active over larger cortical areas as compared with an uncoupled prior. 1

4 0.090215161 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

Author: Boaz Nadler, Nathan Srebro, Xueyuan Zhou

Abstract: We study the behavior of the popular Laplacian Regularization method for SemiSupervised Learning at the regime of a ﬁxed number of labeled points but a large number of unlabeled points. We show that in Rd , d 2, the method is actually not well-posed, and as the number of unlabeled points increases the solution degenerates to a noninformative function. We also contrast the method with the Laplacian Eigenvector method, and discuss the “smoothness” assumptions associated with this alternate method. 1 Introduction and Setup In this paper we consider the limit behavior of two popular semi-supervised learning (SSL) methods based on the graph Laplacian: the regularization approach [15] and the spectral approach [3]. We consider the limit when the number of labeled points is ﬁxed and the number of unlabeled points goes to inﬁnity. This is a natural limit for SSL as the basic SSL scenario is one in which unlabeled data is virtually inﬁnite. We can also think of this limit as “perfect” SSL, having full knowledge of the marginal density p(x). The premise of SSL is that the marginal density p(x) is informative about the unknown mapping y(x) we are trying to learn, e.g. since y(x) is expected to be “smooth” in some sense relative to p(x). Studying the inﬁnite-unlabeled-data limit, where p(x) is fully known, allows us to formulate and understand the underlying smoothness assumptions of a particular SSL method, and judge whether it is well-posed and sensible. Understanding the inﬁnite-unlabeled-data limit is also a necessary ﬁrst step to studying the convergence of the ﬁnite-labeled-data estimator. We consider the following setup: Let p(x) be an unknown smooth density on a compact domain Ω ⊂ Rd with a smooth boundary. Let y : Ω → Y be the unknown function we wish to estimate. In case of regression Y = R whereas in binary classiﬁcation Y = {−1, 1}. The standard (transductive) semisupervised learning problem is formulated as follows: Given l labeled points, (x1 , y1 ), . . . , (xl , yl ), with yi = y(xi ), and u unlabeled points xl+1 , . . . , xl+u , with all points xi sampled i.i.d. from p(x), the goal is to construct an estimate of y(xl+i ) for any unlabeled point xl+i , utilizing both the labeled and the unlabeled points. We denote the total number of points by n = l + u. We are interested in the regime where l is ﬁxed and u → ∞. 1 2 SSL with Graph Laplacian Regularization We ﬁrst consider the following graph-based approach formulated by Zhu et. al. [15]: y (x) = arg min In (y) ˆ subject to y(xi ) = yi , i = 1, . . . , l y where 1 n2 In (y) = Wi,j (y(xi ) − y(xj ))2 (1) (2) i,j is a Laplacian regularization term enforcing “smoothness” with respect to the n×n similarity matrix W . This formulation has several natural interpretations in terms of, e.g. random walks and electrical circuits [15]. These interpretations, however, refer to a ﬁxed graph, over a ﬁnite set of points with given similarities. In contrast, our focus here is on the more typical scenario where the points xi ∈ Rd are a random sample from a density p(x), and W is constructed based on this sample. We would like to understand the behavior of the method in terms of the density p(x), particularly in the limit where the number of unlabeled points grows. Under what assumptions on the target labeling y(x) and on the density p(x) is the method (1) sensible? The answer, of course, depends on how the matrix W is constructed. We consider the common situation where the similarities are obtained by applying some decay ﬁlter to the distances: xi −xj σ Wi,j = G (3) where G : R+ → R+ is some function with an adequately fast decay. Popular choices are the 2 Gaussian ﬁlter G(z) = e−z /2 or the ǫ-neighborhood graph obtained by the step ﬁlter G(z) = 1z<1 . For simplicity, we focus here on the formulation (1) where the solution is required to satisfy the constraints at the labeled points exactly. In practice, the hard labeling constraints are often replaced with a softer loss-based data term, which is balanced against the smoothness term In (y), e.g. [14, 6]. Our analysis and conclusions apply to such variants as well. Limit of the Laplacian Regularization Term As the number of unlabeled examples grows the regularization term (2) converges to its expectation, where the summation is replaced by integration w.r.t. the density p(x): lim In (y) = I (σ) (y) = n→∞ G Ω Ω x−x′ σ (y(x) − y(x′ ))2 p(x)p(x′ )dxdx′ . (4) In the above limit, the bandwidth σ is held ﬁxed. Typically, one would also drive the bandwidth σ to zero as n → ∞. There are two reasons for this choice. First, from a practical perspective, this makes the similarity matrix W sparse so it can be stored and processed. Second, from a theoretical perspective, this leads to a clear and well deﬁned limit of the smoothness regularization term In (y), at least when σ → 0 slowly enough1 , namely when σ = ω( d log n/n). If σ → 0 as n → ∞, and as long as nσ d / log n → ∞, then after appropriate normalization, the regularizer converges to a density weighted gradient penalty term [7, 8]: d lim d+2 In (y) n→∞ Cσ (σ) d (y) d+2 I σ→0 Cσ = lim ∇y(x) 2 p(x)2 dx = J(y) = (5) Ω where C = Rd z 2 G( z )dz, and assuming 0 < C < ∞ (which is the case for both the Gaussian and the step ﬁlters). This energy functional J(f ) therefore encodes the notion of “smoothness” with respect to p(x) that is the basis of the SSL formulation (1) with the graph constructions speciﬁed by (3). To understand the behavior and appropriateness of (1) we must understand this functional and the associated limit problem: y (x) = arg min J(y) ˆ subject to y(xi ) = yi , i = 1, . . . , l (6) y p When σ = o( d 1/n) then all non-diagonal weights Wi,j vanish (points no longer have any “close by” p neighbors). We are not aware of an analysis covering the regime where σ decays roughly as d 1/n, but would be surprised if a qualitatively different meaningful limit is reached. 1 2 3 Graph Laplacian Regularization in R1 We begin by considering the solution of (6) for one dimensional data, i.e. d = 1 and x ∈ R. We ﬁrst consider the situation where the support of p(x) is a continuous interval Ω = [a, b] ⊂ R (a and/or b may be inﬁnite). Without loss of generality, we assume the labeled data is sorted in increasing order a x1 < x2 < · · · < xl b. Applying the theory of variational calculus, the solution y (x) ˆ satisﬁes inside each interval (xi , xi+1 ) the Euler-Lagrange equation d dy p2 (x) = 0. dx dx Performing two integrations and enforcing the constraints at the labeled points yields y(x) = yi + x 1/p2 (t)dt xi (yi+1 xi+1 1/p2 (t)dt xi − yi ) for xi x xi+1 (7) with y(x) = x1 for a x x1 and y(x) = xl for xl x b. If the support of p(x) is a union of disjoint intervals, the above analysis and the form of the solution applies in each interval separately. The solution (7) seems reasonable and desirable from the point of view of the “smoothness” assumptions: when p(x) is uniform, the solution interpolates linearly between labeled data points, whereas across low-density regions, where p(x) is close to zero, y(x) can change abruptly. Furthermore, the regularizer J(y) can be interpreted as a Reproducing Kernel Hilbert Space (RKHS) squared semi-norm, giving us additional insight into this choice of regularizer: b 1 Theorem 1. Let p(x) be a smooth density on Ω = [a, b] ⊂ R such that Ap = 4 a 1/p2 (t)dt < ∞. 2 Then, J(f ) can be written as a squared semi-norm J(f ) = f Kp induced by the kernel x′ ′ Kp (x, x ) = Ap − 1 2 x with a null-space of all constant functions. That is, f the RKHS induced by Kp . 1 p2 (t) dt Kp . (8) is the norm of the projection of f onto If p(x) is supported on several disjoint intervals, Ω = ∪i [ai , bi ], then J(f ) can be written as a squared semi-norm induced by the kernel 1 bi dt 4 ai p2 (t) ′ Kp (x, x ) = − 1 2 x′ dt x p2 (t) if x, x′ ∈ [ai , bi ] (9) if x ∈ [ai , bi ], x′ ∈ [aj , bj ], i = j 0 with a null-space spanned by indicator functions 1[ai ,bi ] (x) on the connected components of Ω. Proof. For any f (x) = i αi Kp (x, xi ) in the RKHS induced by Kp : df dx J(f ) = 2 p2 (x)dx = αi αj Jij (10) i,j where Jij = d d Kp (x, xi ) Kp (x, xj )p2 (x)dx dx dx When xi and xj are in different connected components of Ω, the gradients of Kp (·, xi ) and Kp (·, xj ) are never non-zero together and Jij = 0 = Kp (xi , xj ). When they are in the same connected component [a, b], and assuming w.l.o.g. a xi xj b: Jij = = xi 1 4 1 4 a b a 1 dt + p2 (t) 1 1 dt − p2 (t) 2 xj xi xj xi −1 dt + p2 (t) xj 1 dt p2 (t) 1 dt = Kp (xi , xj ). p2 (t) Substituting Jij = Kp (xi , xj ) into (10) yields J(f ) = 3 b αi αj Kp (xi , xj ) = f (11) Kp . Combining Theorem 1 with the Representer Theorem [13] establishes that the solution of (6) (or of any variant where the hard constraints are replaced by a data term) is of the form: l y(x) = αj Kp (x, xj ) + βi 1[ai ,bi ] (x), j=1 i where i ranges over the connected components [ai , bi ] of Ω, and we have: l J(y) = αi αj Kp (xi , xj ). (12) i,j=1 Viewing the regularizer as y 2 p suggests understanding (6), and so also its empirical approximaK tion (1), by interpreting Kp (x, x′ ) as a density-based “similarity measure” between x and x′ . This similarity measure indeed seems sensible: for a uniform density it is simply linearly decreasing as a function of the distance. When the density is non-uniform, two points are relatively similar only if they are connected by a region in which 1/p2 (x) is low, i.e. the density is high, but are much less “similar”, i.e. related to each other, when connected by a low-density region. Furthermore, there is no dependence between points in disjoint components separated by zero density regions. 4 Graph Laplacian Regularization in Higher Dimensions The analysis of the previous section seems promising, at it shows that in one dimension, the SSL method (1) is well posed and converges to a sensible limit. Regretfully, in higher dimensions this is not the case anymore. In the following theorem we show that the inﬁmum of the limit problem (6) is zero and can be obtained by a sequence of functions which are certainly not a sensible extrapolation of the labeled points. Theorem 2. Let p(x) be a smooth density over Rd , d 2, bounded from above by some constant pmax , and let (x1 , y1 ), . . . , (xl , yl ) be any (non-repeating) set of labeled examples. There exist continuous functions yǫ (x), for any ǫ > 0, all satisfying the constraints yǫ (xj ) = yj , j = 1, . . . , l, such ǫ→0 ǫ→0 that J(yǫ ) −→ 0 but yǫ (x) −→ 0 for all x = xj , j = 1, . . . , l. Proof. We present a detailed proof for the case of l = 2 labeled points. The generalization of the proof to more labeled points is straightforward. Furthermore, without loss of generality, we assume the ﬁrst labeled point is at x0 = 0 with y(x0 ) = 0 and the second labeled point is at x1 with x1 = 1 and y(x1 ) = 1. In addition, we assume that the ball B1 (0) of radius one centered around the origin is contained in Ω = {x ∈ Rd | p(x) > 0}. We ﬁrst consider the case d > 2. Here, for any ǫ > 0, consider the function x ǫ yǫ (x) = min ,1 which indeed satisﬁes the two constraints yǫ (xi ) = yi , i = 0, 1. Then, J(yǫ ) = Bǫ (0) p2 (x) dx ǫ2 pmax ǫ2 dx = p2 Vd ǫd−2 max (13) Bǫ (0) where Vd is the volume of a unit ball in Rd . Hence, the sequence of functions yǫ (x) satisfy the constraints, but for d > 2, inf ǫ J(yǫ ) = 0. For d = 2, a more extreme example is necessary: consider the functions 2 x yǫ (x) = log +ǫ ǫ log 1+ǫ ǫ for x 1 and yǫ (x) = 1 for x > 1. These functions satisfy the two constraints yǫ (xi ) = yi , i = 0, 1 and: J(yǫ ) = 4 h “ ”i 1+ǫ 2 log ǫ 4πp2 max h “ ”i 1+ǫ 2 log ǫ x B1 (0) log ( x 1+ǫ ǫ 2 2 +ǫ)2 p2 (x)dx 4p2 h “ max ”i2 1+ǫ log ǫ 4πp2 max ǫ→0 = −→ 0. log 1+ǫ ǫ 4 1 0 r2 (r 2 +ǫ)2 2πrdr The implication of Theorem 2 is that regardless of the values at the labeled points, as u → ∞, the solution of (1) is not well posed. Asymptotically, the solution has the form of an almost everywhere constant function, with highly localized spikes near the labeled points, and so no learning is performed. In particular, an interpretation in terms of a density-based kernel Kp , as in the onedimensional case, is not possible. Our analysis also carries over to a formulation where a loss-based data term replaces the hard label constraints, as in l 1 y = arg min ˆ (y(xj ) − yj )2 + γIn (y) y(x) l j=1 In the limit of inﬁnite unlabeled data, functions of the form yǫ (x) above have a zero data penalty term (since they exactly match the labels) and also drive the regularization term J(y) to zero. Hence, it is possible to drive the entire objective functional (the data term plus the regularization term) to zero with functions that do not generalize at all to unlabeled points. 4.1 Numerical Example We illustrate the phenomenon detailed by Theorem 2 with a simple example. Consider a density p(x) in R2 , which is a mixture of two unit variance spherical Gaussians, one per class, centered at the origin and at (4, 0). We sample a total of n = 3000 points, and label two points from each of the two components (four total). We then construct a similarity matrix using a Gaussian ﬁlter with σ = 0.4. Figure 1 depicts the predictor y (x) obtained from (1). In fact, two different predictors are shown, ˆ obtained by different numerical methods for solving (1). Both methods are based on the observation that the solution y (x) of (1) satisﬁes: ˆ n y (xi ) = ˆ n Wij y (xj ) / ˆ j=1 Wij on all unlabeled points i = l + 1, . . . , l + u. (14) j=1 Combined with the constraints of (1), we obtain a system of linear equations that can be solved by Gaussian elimination (here invoked through MATLAB’s backslash operator). This is the method used in the top panels of Figure 1. Alternatively, (14) can be viewed as an update equation for y (xi ), ˆ which can be solved via the power method, or label propagation [2, 6]: start with zero labels on the unlabeled points and iterate (14), while keeping the known labels on x1 , . . . , xl . This is the method used in the bottom panels of Figure 1. As predicted, y (x) is almost constant for almost all unlabeled points. Although all values are very ˆ close to zero, thresholding at the “right” threshold does actually produce sensible results in terms of the true -1/+1 labels. However, beyond being inappropriate for regression, a very ﬂat predictor is still problematic even from a classiﬁcation perspective. First, it is not possible to obtain a meaningful conﬁdence measure for particular labels. Second, especially if the size of each class is not known apriori, setting the threshold between the positive and negative classes is problematic. In our example, setting the threshold to zero yields a generalization error of 45%. The differences between the two numerical methods for solving (1) also point out to another problem with the ill-posedness of the limit problem: the solution is numerically very un-stable. A more quantitative evaluation, that also validates that the effect in Figure 1 is not a result of choosing a “wrong” bandwidth σ, is given in Figure 2. We again simulated data from a mixture of two Gaussians, one Gaussian per class, this time in 20 dimensions, with one labeled point per class, and an increasing number of unlabeled points. In Figure 2 we plot the squared error, and the classiﬁcation error of the resulting predictor y (x). We plot the classiﬁcation error both when a threshold ˆ of zero is used (i.e. the class is determined by sign(ˆ(x))) and with the ideal threshold minimizing y the test error. For each unlabeled sample size, we choose the bandwidth σ yielding the best test performance (this is a “cheating” approach which provides a lower bound on the error of the best method for selecting the bandwidth). As the number of unlabeled examples increases the squared error approaches 1, indicating a ﬂat predictor. Using a threshold of zero leads to an increase in the classiﬁcation error, possibly due to numerical instability. Interestingly, although the predictors become very ﬂat, the classiﬁcation error using the ideal threshold actually improves slightly. Note that 5 DIRECT INVERSION SQUARED ERROR SIGN ERROR: 45% OPTIMAL BANDWIDTH 1 0.9 1 5 0 4 2 0.85 y(x) > 0 y(x) < 0 6 0.95 10 0 0 −1 10 0 200 400 600 800 0−1 ERROR (THRESHOLD=0) 0.32 −5 10 0 5 −10 0 −10 −5 −5 0 5 10 10 1 0 0 200 400 600 800 OPTIMAL BANDWIDTH 0.5 0 0 200 400 600 800 0−1 ERROR (IDEAL THRESHOLD) 0.19 5 200 400 600 800 OPTIMAL BANDWIDTH 1 0.28 SIGN ERR: 17.1 0.3 0.26 POWER METHOD 0 1.5 8 0 0.18 −1 10 6 0.17 4 −5 10 0 5 −10 0 −5 −10 −5 0 5 10 Figure 1: Left plots: Minimizer of Eq. (1). Right plots: the resulting classiﬁcation according to sign(y). The four labeled points are shown by green squares. Top: minimization via Gaussian elimination (MATLAB backslash). Bottom: minimization via label propagation with 1000 iterations - the solution has not yet converged, despite small residuals of the order of 2 · 10−4 . 0.16 0 200 400 600 800 2 0 200 400 600 800 Figure 2: Squared error (top), classiﬁcation error with a threshold of zero (center) and minimal classiﬁcation error using ideal threhold (bottom), of the minimizer of (1) as a function of number of unlabeled points. For each error measure and sample size, the bandwidth minimizing the test error was used, and is plotted. ideal classiﬁcation performance is achieved with a signiﬁcantly larger bandwidth than the bandwidth minimizing the squared loss, i.e. when the predictor is even ﬂatter. 4.2 Probabilistic Interpretation, Exit and Hitting Times As mentioned above, the Laplacian regularization method (1) has a probabilistic interpretation in terms of a random walk on the weighted graph. Let x(t) denote a random walk on the graph with transition matrix M = D−1 W where D is a diagonal matrix with Dii = j Wij . Then, for the binary classiﬁcation case with yi = ±1 we have [15]: y (xi ) = 2 Pr x(t) hits a point labeled +1 before hitting a point labeled -1 x(0) = xi − 1 ˆ We present an interpretation of our analysis in terms of the limiting properties of this random walk. Consider, for simplicity, the case where the two classes are separated by a low density region. Then, the random walk has two intrinsic quantities of interest. The ﬁrst is the mean exit time from one cluster to the other, and the other is the mean hitting time to the labeled points in that cluster. As the number of unlabeled points increases and σ → 0, the random walk converges to a diffusion process [12]. While the mean exit time then converges to a ﬁnite value corresponding to its diffusion analogue, the hitting time to a labeled point increases to inﬁnity (as these become absorbing boundaries of measure zero). With more and more unlabeled data the random walk will fully mix, forgetting where it started, before it hits any label. Thus, the probability of hitting +1 before −1 will become uniform across the entire graph, independent of the starting location xi , yielding a ﬂat predictor. 5 Keeping σ Finite At this point, a reader may ask whether the problems found in higher dimensions are due to taking the limit σ → 0. One possible objection is that there is an intrinsic characteristic scale for the data σ0 where (with high probability) all points at a distance xi − xj < σ0 have the same label. If this is the case, then it may not necessarily make sense to take values of σ < σ0 in constructing W . However, keeping σ ﬁnite while taking the number of unlabeled points to inﬁnity does not resolve the problem. On the contrary, even the one-dimensional case becomes ill-posed in this case. To see this, consider a function y(x) which is zero everywhere except at the labeled points, where y(xj ) = yj . With a ﬁnite number of labeled points of measure zero, I (σ) (y) = 0 in any dimension 6 50 points 500 points 3500 points 1 1 0.5 0.5 0.5 0 0 0 −0.5 y 1 −0.5 −0.5 −1 −2 0 2 4 6 −1 −2 0 2 4 6 −1 −2 0 2 4 6 x Figure 3: Minimizer of (1) for a 1-d problem with a ﬁxed σ = 0.4, two labeled points and an increasing number of unlabeled points. and for any ﬁxed σ > 0. While this limiting function is discontinuous, it is also possible to construct ǫ→0 a sequence of continuous functions yǫ that all satisfy the constraints and for which I (σ) (yǫ ) −→ 0. This behavior is illustrated in Figure 3. We generated data from a mixture of two 1-D Gaussians centered at the origin and at x = 4, with one Gaussian labeled −1 and the other +1. We used two labeled points at the centers of the Gaussians and an increasing number of randomly drawn unlabeled points. As predicted, with a ﬁxed σ, although the solution is reasonable when the number of unlabeled points is small, it becomes ﬂatter, with sharp spikes on the labeled points, as u → ∞. 6 Fourier-Eigenvector Based Methods Before we conclude, we discuss a different approach for SSL, also based on the Graph Laplacian, suggested by Belkin and Niyogi [3]. Instead of using the Laplacian as a regularizer, constraining candidate predictors y(x) non-parametrically to those with small In (y) values, here the predictors are constrained to the low-dimensional space spanned by the ﬁrst few eigenvectors of the Laplacian: The similarity matrix W is computed as before, and the Graph Laplacian matrix L = D − W is considered (recall D is a diagonal matrix with Dii = j Wij ). Only predictors p j=1 aj ej y (x) = ˆ (15) spanned by the ﬁrst p eigenvectors e1 , . . . , ep of L (with smallest eigenvalues) are considered. The coefﬁcients aj are chosen by minimizing a loss function on the labeled data, e.g. the squared loss: (ˆ1 , . . . , ap ) = arg min a ˆ l j=1 (yj − y (xj ))2 . ˆ (16) Unlike the Laplacian Regularization method (1), the Laplacian Eigenvector method (15)–(16) is well posed in the limit u → ∞. This follows directly from the convergence of the eigenvectors of the graph Laplacian to the eigenfunctions of the corresponding Laplace-Beltrami operator [10, 4]. Eigenvector based methods were shown empirically to provide competitive generalization performance on a variety of simulated and real world problems. Belkin and Niyogi [3] motivate the approach by arguing that ‘the eigenfunctions of the Laplace-Beltrami operator provide a natural basis for functions on the manifold and the desired classiﬁcation function can be expressed in such a basis’. In our view, the success of the method is actually not due to data lying on a low-dimensional manifold, but rather due to the low density separation assumption, which states that different class labels form high-density clusters separated by low density regions. Indeed, under this assumption and with sufﬁcient separation between the clusters, the eigenfunctions of the graph Laplace-Beltrami operator are approximately piecewise constant in each of the clusters, as in spectral clustering [12, 11], providing a basis for a labeling that is constant within clusters but variable across clusters. In other settings, such as data uniformly distributed on a manifold but without any signiﬁcant cluster structure, the success of eigenvector based methods critically depends on how well can the unknown classiﬁcation function be approximated by a truncated expansion with relatively few eigenvectors. We illustrate this issue with the following three-dimensional example: Let p(x) denote the uniform density in the box [0, 1] × [0, 0.8] × [0, 0.6], where the box lengths are different to prevent eigenvalue multiplicity. Consider learning three different functions, y1 (x) = 1x1 >0.5 , y2 (x) = 1x1 >x2 /0.8 and y3 (x) = 1x2 /0.8>x3 /0.6 . Even though all three functions are relatively simple, all having a linear separating boundary between the classes on the manifold, as shown in the experiment described in Figure 4, the Eigenvector based method (15)–(16) gives markedly different generalization performances on the three targets. This happens both when the number of eigenvectors p is set to p = l/5 as suggested by Belkin and Niyogi, as well as for the optimal (oracle) value of p selected on the test set (i.e. a “cheating” choice representing an upper bound on the generalization error of this method). 7 Prediction Error (%) p = #labeled points/5 40 optimal p 20 labeled points 40 Approx. Error 50 20 20 0 20 20 40 60 # labeled points 0 10 20 40 60 # labeled points 0 0 5 10 15 # eigenvectors 0 0 5 10 15 # eigenvectors Figure 4: Left three panels: Generalization Performance of the Eigenvector Method (15)–(16) for the three different functions described in the text. All panels use n = 3000 points. Prediction counts the number of sign agreements with the true labels. Rightmost panel: best ﬁt when many (all 3000) points are used, representing the best we can hope for with a few leading eigenvectors. The reason for this behavior is that y2 (x) and even more so y3 (x) cannot be as easily approximated by the very few leading eigenfunctions—even though they seem “simple” and “smooth”, they are signiﬁcantly more complicated than y1 (x) in terms of measure of simplicity implied by the Eigenvector Method. Since the density is uniform, the graph Laplacian converges to the standard Laplacian and its eigenfunctions have the form ψi,j,k (x) = cos(iπx1 ) cos(jπx2 /0.8) cos(kπx3 /0.6), making it hard to represent simple decision boundaries which are not axis-aligned. 7 Discussion Our results show that a popular SSL method, the Laplacian Regularization method (1), is not wellbehaved in the limit of inﬁnite unlabeled data, despite its empirical success in various SSL tasks. The empirical success might be due to two reasons. First, it is possible that with a large enough number of labeled points relative to the number of unlabeled points, the method is well behaved. This regime, where the number of both labeled and unlabeled points grow while l/u is ﬁxed, has recently been analyzed by Wasserman and Lafferty [9]. However, we do not ﬁnd this regime particularly satisfying as we would expect that having more unlabeled data available should improve performance, rather than require more labeled points or make the problem ill-posed. It also places the user in a delicate situation of choosing the “just right” number of unlabeled points without any theoretical guidance. Second, in our experiments we noticed that although the predictor y (x) becomes extremely ﬂat, in ˆ binary tasks, it is still typically possible to ﬁnd a threshold leading to a good classiﬁcation performance. We do not know of any theoretical explanation for such behavior, nor how to characterize it. Obtaining such an explanation would be very interesting, and in a sense crucial to the theoretical foundation of the Laplacian Regularization method. On a very practical level, such a theoretical understanding might allow us to correct the method so as to avoid the numerical instability associated with ﬂat predictors, and perhaps also make it appropriate for regression. The reason that the Laplacian regularizer (1) is ill-posed in the limit is that the ﬁrst order gradient is not a sufﬁcient penalty in high dimensions. This fact is well known in spline theory, where the Sobolev Embedding Theorem [1] indicates one must control at least d+1 derivatives in Rd . In the 2 context of Laplacian regularization, this can be done using the iterated Laplacian: replacing the d+1 graph Laplacian matrix L = D − W , where D is the diagonal degree matrix, with L 2 (matrix to d+1 the 2 power). In the inﬁnite unlabeled data limit, this corresponds to regularizing all order- d+1 2 (mixed) partial derivatives. In the typical case of a low-dimensional manifold in a high dimensional ambient space, the order of iteration should correspond to the intrinsic, rather then ambient, dimensionality, which poses a practical problem of estimating this usually unknown dimensionality. We are not aware of much practical work using the iterated Laplacian, nor a good understanding of its appropriateness for SSL. A different approach leading to a well-posed solution is to include also an ambient regularization term [5]. However, the properties of the solution and in particular its relation to various assumptions about the “smoothness” of y(x) relative to p(x) remain unclear. Acknowledgments The authors would like to thank the anonymous referees for valuable suggestions. The research of BN was supported by the Israel Science Foundation (grant 432/06). 8 References [1] R.A. Adams, Sobolev Spaces, Academic Press (New York), 1975. [2] A. Azran, The rendevous algorithm: multiclass semi-supervised learning with Markov Random Walks, ICML, 2007. [3] M. Belkin, P. Niyogi, Using manifold structure for partially labelled classiﬁcation, NIPS, vol. 15, 2003. [4] M. Belkin and P. Niyogi, Convergence of Laplacian Eigenmaps, NIPS 19, 2007. [5] M. Belkin, P. Niyogi and S. Sindhwani, Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples, JMLR, 7:2399-2434, 2006. [6] Y. Bengio, O. Delalleau, N. Le Roux, label propagation and quadratic criterion, in Semi-Supervised Learning, Chapelle, Scholkopf and Zien, editors, MIT Press, 2006. [7] O. Bosquet, O. Chapelle, M. Hein, Measure Based Regularization, NIPS, vol. 16, 2004. [8] M. Hein, Uniform convergence of adaptive graph-based regularization, COLT, 2006. [9] J. Lafferty, L. Wasserman, Statistical Analysis of Semi-Supervised Regression, NIPS, vol. 20, 2008. [10] U. von Luxburg, M. Belkin and O. Bousquet, Consistency of spectral clustering, Annals of Statistics, vol. 36(2), 2008. [11] M. Meila, J. Shi. A random walks view of spectral segmentation, AI and Statistics, 2001. [12] B. Nadler, S. Lafon, I.G. Kevrekidis, R.R. Coifman, Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators, NIPS, vol. 18, 2006. [13] B. Sch¨ lkopf, A. Smola, Learning with Kernels, MIT Press, 2002. o [14] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, B. Sch¨ lkopf, Learning with local and global consistency, o NIPS, vol. 16, 2004. [15] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-Supervised Learning using Gaussian ﬁelds and harmonic functions, ICML, 2003. 9

5 0.089725658 49 nips-2009-Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition

Author: Ashish Kapoor, Eric Horvitz

Abstract: To date, the processes employed for active information acquisition during periods of learning and diagnosis have been considered as separate and have been applied in distinct phases of analysis. While active learning centers on the collection of information about training cases in order to build better predictive models, diagnosis uses ﬁxed predictive models for guiding the collection of observations about a speciﬁc test case at hand. We introduce a model and inferential methods that bridge these phases of analysis into a holistic approach to information acquisition that considers simultaneously the extension of the predictive model and the probing of a case at hand. The bridging of active learning and real-time diagnostic feature acquisition leads to a new class of policies for learning and diagnosis. 1

6 0.086546041 148 nips-2009-Matrix Completion from Power-Law Distributed Samples

7 0.079820722 29 nips-2009-An Infinite Factor Model Hierarchy Via a Noisy-Or Mechanism

8 0.077811129 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models

9 0.073258869 122 nips-2009-Label Selection on Graphs

10 0.067266703 71 nips-2009-Distribution-Calibrated Hierarchical Classification

11 0.065315291 33 nips-2009-Analysis of SVM with Indefinite Kernels

12 0.06422969 166 nips-2009-Noisy Generalized Binary Search

13 0.06360092 254 nips-2009-Variational Gaussian-process factor analysis for modeling spatio-temporal data

14 0.061697897 211 nips-2009-Segmenting Scenes by Matching Image Composites

15 0.058892034 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

16 0.057879187 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

17 0.057256214 157 nips-2009-Multi-Label Prediction via Compressed Sensing

18 0.056768693 3 nips-2009-AUC optimization and the two-sample problem

19 0.05443785 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

20 0.054394901 141 nips-2009-Local Rules for Global MAP: When Do They Work ?

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.164), (1, 0.077), (2, -0.012), (3, 0.032), (4, -0.023), (5, -0.026), (6, -0.027), (7, -0.053), (8, -0.045), (9, 0.029), (10, -0.051), (11, 0.018), (12, -0.019), (13, -0.058), (14, 0.028), (15, 0.006), (16, 0.056), (17, 0.035), (18, 0.086), (19, -0.095), (20, 0.084), (21, 0.003), (22, -0.107), (23, -0.02), (24, -0.063), (25, -0.113), (26, -0.133), (27, -0.119), (28, -0.023), (29, -0.073), (30, -0.061), (31, -0.11), (32, 0.108), (33, 0.088), (34, -0.085), (35, -0.011), (36, 0.084), (37, -0.167), (38, -0.089), (39, 0.053), (40, -0.079), (41, 0.16), (42, -0.075), (43, -0.006), (44, 0.049), (45, -0.102), (46, 0.056), (47, 0.043), (48, -0.04), (49, -0.166)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95410031 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

Author: Liwei Wang

2 0.75589353 193 nips-2009-Potential-Based Agnostic Boosting

Author: Varun Kanade, Adam Kalai

3 0.51462674 49 nips-2009-Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition

Author: Ashish Kapoor, Eric Horvitz

4 0.50248039 71 nips-2009-Distribution-Calibrated Hierarchical Classification

Author: Ofer Dekel

Abstract: While many advances have already been made in hierarchical classiﬁcation learning, we take a step back and examine how a hierarchical classiﬁcation problem should be formally deﬁned. We pay particular attention to the fact that many arbitrary decisions go into the design of the label taxonomy that is given with the training data. Moreover, many hand-designed taxonomies are unbalanced and misrepresent the class structure in the underlying data distribution. We attempt to correct these problems by using the data distribution itself to calibrate the hierarchical classiﬁcation loss function. This distribution-based correction must be done with care, to avoid introducing unmanageable statistical dependencies into the learning problem. This leads us off the beaten path of binomial-type estimation and into the unfamiliar waters of geometric-type estimation. In this paper, we present a new calibrated deﬁnition of statistical risk for hierarchical classiﬁcation, an unbiased estimator for this risk, and a new algorithmic reduction from hierarchical classiﬁcation to cost-sensitive classiﬁcation.

5 0.49283952 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization

Author: Massih Amini, Nicolas Usunier, Cyril Goutte

Abstract: We address the problem of learning classiﬁers when observations have multiple views, some of which may not be observed for all examples. We assume the existence of view generating functions which may complete the missing views in an approximate way. This situation corresponds for example to learning text classiﬁers from multilingual collections where documents are not available in all languages. In that case, Machine Translation (MT) systems may be used to translate each document in the missing languages. We derive a generalization error bound for classiﬁers learned on examples with multiple artiﬁcially created views. Our result uncovers a trade-off between the size of the training set, the number of views, and the quality of the view generating functions. As a consequence, we identify situations where it is more interesting to use multiple views for learning instead of classical single view learning. An extension of this framework is a natural way to leverage unlabeled multi-view data in semi-supervised learning. Experimental results on a subset of the Reuters RCV1/RCV2 collections support our ﬁndings by showing that additional views obtained from MT may signiﬁcantly improve the classiﬁcation performance in the cases identiﬁed by our trade-off. 1

6 0.43127361 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models

7 0.42948505 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

8 0.3964617 47 nips-2009-Boosting with Spatial Regularization

9 0.39070287 131 nips-2009-Learning from Neighboring Strokes: Combining Appearance and Context for Multi-Domain Sketch Recognition

10 0.38773918 166 nips-2009-Noisy Generalized Binary Search

11 0.37928128 69 nips-2009-Discrete MDL Predicts in Total Variation

12 0.37849879 180 nips-2009-On the Convergence of the Concave-Convex Procedure

13 0.3699314 122 nips-2009-Label Selection on Graphs

14 0.3673974 98 nips-2009-From PAC-Bayes Bounds to KL Regularization

15 0.36331677 252 nips-2009-Unsupervised Feature Selection for the $k$-means Clustering Problem

16 0.3591682 54 nips-2009-Compositionality of optimal control laws

17 0.35854635 112 nips-2009-Human Rademacher Complexity

18 0.34886044 165 nips-2009-Noise Characterization, Modeling, and Reduction for In Vivo Neural Recording

19 0.34418231 94 nips-2009-Fast Learning from Non-i.i.d. Observations

20 0.33994615 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(24, 0.566), (25, 0.066), (35, 0.021), (36, 0.087), (39, 0.024), (58, 0.054), (61, 0.01), (71, 0.034), (86, 0.037), (91, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96603525 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright

Abstract: We study minimax rates for estimating high-dimensional nonparametric regression models with sparse additive structure and smoothness constraints. More precisely, our goal is to estimate a function f ∗ : Rp → R that has an additive decomposition of the form ∗ ∗ f ∗ (X1 , . . . , Xp ) = j∈S hj (Xj ), where each component function hj lies in some class H of “smooth” functions, and S ⊂ {1, . . . , p} is an unknown subset with cardinality s = |S|. Given n i.i.d. observations of f ∗ (X) corrupted with additive white Gaussian noise where the covariate vectors (X1 , X2 , X3 , ..., Xp ) are drawn with i.i.d. components from some distribution P, we determine lower bounds on the minimax rate for estimating the regression function with respect to squared-L2 (P) error. Our main result is a lower bound on the minimax rate that scales as max s log(p/s) , s ǫ2 (H) . The ﬁrst term reﬂects the sample size required for n n performing subset selection, and is independent of the function class H. The second term s ǫ2 (H) is an s-dimensional estimation term corresponding to the sample size required for n estimating a sum of s univariate functions, each chosen from the function class H. It depends linearly on the sparsity index s but is independent of the global dimension p. As a special case, if H corresponds to functions that are m-times differentiable (an mth -order Sobolev space), then the s-dimensional estimation term takes the form sǫ2 (H) ≍ s n−2m/(2m+1) . Either of n the two terms may be dominant in different regimes, depending on the relation between the sparsity and smoothness of the additive decomposition.

2 0.94631147 181 nips-2009-Online Learning of Assignments

Author: Matthew Streeter, Daniel Golovin, Andreas Krause

Abstract: Which ads should we display in sponsored search in order to maximize our revenue? How should we dynamically rank information sources to maximize the value of the ranking? These applications exhibit strong diminishing returns: Redundancy decreases the marginal utility of each ad or information source. We show that these and other problems can be formalized as repeatedly selecting an assignment of items to positions to maximize a sequence of monotone submodular functions that arrive one by one. We present an efﬁcient algorithm for this general problem and analyze it in the no-regret model. Our algorithm possesses strong theoretical guarantees, such as a performance ratio that converges to the optimal constant of 1 − 1/e. We empirically evaluate our algorithm on two real-world online optimization problems on the web: ad allocation with submodular utilities, and dynamically ranking blogs to detect information cascades. 1

same-paper 3 0.93535662 240 nips-2009-Sufficient Conditions for Agnostic Active Learnable

Author: Liwei Wang

4 0.91239411 221 nips-2009-Solving Stochastic Games

Author: Liam M. Dermed, Charles L. Isbell

Abstract: Solving multi-agent reinforcement learning problems has proven difﬁcult because of the lack of tractable algorithms. We provide the ﬁrst approximation algorithm which solves stochastic games with cheap-talk to within absolute error of the optimal game-theoretic solution, in time polynomial in 1/ . Our algorithm extends Murray’s and Gordon’s (2007) modiﬁed Bellman equation which determines the set of all possible achievable utilities; this provides us a truly general framework for multi-agent learning. Further, we empirically validate our algorithm and ﬁnd the computational cost to be orders of magnitude less than what the theory predicts. 1

5 0.88966787 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields

Author: Yang Wang, Gholamreza Haffari, Shaojun Wang, Greg Mori

Abstract: We propose a novel information theoretic approach for semi-supervised learning of conditional random ﬁelds that deﬁnes a training objective to combine the conditional likelihood on labeled data and the mutual information on unlabeled data. In contrast to previous minimum conditional entropy semi-supervised discriminative learning methods, our approach is grounded on a more solid foundation, the rate distortion theory in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label conﬁgurations. Our experimental results show the rate distortion approach outperforms standard l2 regularization, minimum conditional entropy regularization as well as maximum conditional entropy regularization on both multi-class classiﬁcation and sequence labeling problems. 1

6 0.76075697 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

7 0.73895115 45 nips-2009-Beyond Convexity: Online Submodular Minimization

8 0.68550444 161 nips-2009-Nash Equilibria of Static Prediction Games

9 0.67928368 156 nips-2009-Monte Carlo Sampling for Regret Minimization in Extensive Games

10 0.64042556 12 nips-2009-A Generalized Natural Actor-Critic Algorithm

11 0.63986337 239 nips-2009-Submodularity Cuts and Applications

12 0.61871707 91 nips-2009-Fast, smooth and adaptive regression in metric spaces

13 0.61452222 122 nips-2009-Label Selection on Graphs

14 0.61431199 14 nips-2009-A Parameter-free Hedging Algorithm

15 0.60862333 178 nips-2009-On Stochastic and Worst-case Models for Investing

16 0.60732764 94 nips-2009-Fast Learning from Non-i.i.d. Observations

17 0.60075086 55 nips-2009-Compressed Least-Squares Regression

18 0.58898497 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

19 0.57809329 232 nips-2009-Strategy Grafting in Extensive Games

20 0.57584935 147 nips-2009-Matrix Completion from Noisy Entries