jmlr jmlr2011 jmlr2011-85 knowledge-graph by maker-knowledge-mining

85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

Source: pdf

Author: Liwei Wang

Abstract: We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefﬁcient which often characterizes the intrinsic difﬁculty of the learning problem. In this paper, we study the disagreement coefﬁcient of classiﬁcation problems for which the classiﬁcation boundary is smooth and the data distribution has a density that can be bounded by a smooth function. We prove upper and lower bounds for the disagreement coefﬁcients of both ﬁnitely and inﬁnitely smooth problems. Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. Keywords: active learning, disagreement coefﬁcient, label complexity, smooth function

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 China Editor: Rocco Servedio Abstract We study pool-based active learning in the presence of noise, that is, the agnostic setting. [sent-6, score-0.504]

2 It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. [sent-7, score-0.572]

3 Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. [sent-8, score-0.566]

4 Previous works have shown that the label complexity of active learning relies on the disagreement coefﬁcient which often characterizes the intrinsic difﬁculty of the learning problem. [sent-9, score-0.831]

5 In this paper, we study the disagreement coefﬁcient of classiﬁcation problems for which the classiﬁcation boundary is smooth and the data distribution has a density that can be bounded by a smooth function. [sent-10, score-0.774]

6 We prove upper and lower bounds for the disagreement coefﬁcients of both ﬁnitely and inﬁnitely smooth problems. [sent-11, score-0.586]

7 Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. [sent-12, score-0.679]

8 Keywords: active learning, disagreement coefﬁcient, label complexity, smooth function 1. [sent-13, score-0.954]

9 The simplest example that demonstrates the potential of active learning is to learn the optimal threshold on an interval. [sent-20, score-0.329]

10 If the data are distributed on the unit sphere of Rd , and the distribution has a density function upper and lower bounded by λ and 1/λ respectively, where λ is some constant, then active learning can still give exponential savings in the label complexity (Dasgupta, 2005). [sent-25, score-0.493]

11 However, there are also very simple problems that active learning does not help. [sent-26, score-0.283]

12 In this case, for any active learning algorithm there exists a distribution (i. [sent-29, score-0.283]

13 Of more interest and more realistic is the agnostic setting, where the best classiﬁer in the hypothesis space has a non-zero error ν. [sent-35, score-0.289]

14 For agnostic active learning, there is 2 no active learning algorithm that can always reduce label requests due to a lower bound Ω( ν2 ) for ε the label complexity (K¨ ari¨ inen, 2006). [sent-36, score-1.112]

15 a¨ a Previous results have shown that whether active learning helps relies crucially on the disagreement coefﬁcient of the learning problem (Hanneke, 2007). [sent-37, score-0.676]

16 The disagreement coefﬁcient depends on the distribution of the instance-label pairs and the hypothesis space and often describes the intrinsic difﬁculty of the active learning problem. [sent-38, score-0.717]

17 In particular, it has been shown that the label complexity of two important agnostic active learning algorithms A2 (Balcan et al. [sent-39, score-0.686]

18 (2007) (will be referred to as DHM) are characterized by the disagreement coefﬁcient. [sent-41, score-0.366]

19 If the disagreement coefﬁcient is small, active learning usually has smaller label complexity than passive learning. [sent-42, score-1.027]

20 In this paper, we study the disagreement coefﬁcient for smooth problems. [sent-43, score-0.528]

21 Speciﬁcally we analyze the disagreement coefﬁcient for learning problems whose classiﬁcation boundaries are smooth. [sent-44, score-0.44]

22 Under some mild assumptions on the distribution, we show that the magnitude of the disagreement coefﬁcient depends on the order of smoothness. [sent-46, score-0.366]

23 Combining with known upper bounds on the label complexity in terms of disagreement coefﬁcient, we give sufﬁcient condition under which active learning is strictly superior to passive learning. [sent-48, score-1.123]

24 1 Related Works Our work is closely related to Castro and Nowak (2008) which proved label complexity bounds for problems with smooth classiﬁcation boundary under Tsybakov’s noise condition (Tsybakov, 2004). [sent-50, score-0.507]

25 He gave conditions under which the disagreement coefﬁcient is always bounded from above by a constant. [sent-56, score-0.366]

26 In our active learning model, the algorithm has access to a pool of unlabeled examples from DX . [sent-62, score-0.372]

27 , 2006) is the ﬁrst rigorous agnostic active learning algorithm. [sent-84, score-0.504]

28 It can be viewed as a robust version of the active learning algorithm due to Cohn et al. [sent-85, score-0.283]

29 It was shown that A2 is never much worse than passive learning in terms of the label complexity. [sent-88, score-0.339]

30 One is the current version space V , consisting of hypotheses that with statistical conﬁdence are not too bad compared to h∗ ; the other is the region of disagreement DIS(V ), which is the set of all x ∈ X for which there are hypotheses in V that disagree on x. [sent-92, score-0.438]

31 This process, and in turn the label complexity of A2 , are nicely characterized by the disagreement coefﬁcient θ introduced in Hanneke (2007). [sent-101, score-0.548]

32 The disagreement coefﬁcient θ(ε) is Pr (X ∈ DIS(B(h∗ , r))) ∆(B(h∗ , r)) = sup X∼DX , r r r≥ε r≥ε θ(ε) = sup where h∗ = arg minh∈H erD (h). [sent-115, score-0.446]

33 1 The following is an upper bound of the label ε complexity of A2 in terms of the disagreement coefﬁcient θ(ε) (Hanneke, 2007). [sent-117, score-0.576]

34 problem with ν = 0, where in Ω Another important agnostic active learning algorithm is DHM. [sent-121, score-0.504]

35 ) DHM reduces active learning to a series of constrained supervised learning. [sent-123, score-0.283]

36 If we can, we put the data and the conﬁdently guessed label (x, y) into the guessed set G ; otherwise, ˆ we request the true label y of x and put (x, y) into the true set T . [sent-125, score-0.328]

37 They also showed that DHM is never much worse than passive learning and it has label complexity2 ν ˜ O θ(ν + ε) 1 + 2 ε 2 polylog 1 1 log VC(H ) . [sent-136, score-0.406]

38 ε δ (2) From (1) and (2) it can be seen that if ε > ν, the term ν2 is upper bounded by 1 and the label ε complexity of the active learning algorithms crucially depends on the disagreement coefﬁcient θ. [sent-137, score-0.886]

39 In fact, this bound cannot be improved: it is known that given some hypothesis ε 2 space H , for every active learning algorithm A, there is a learning problem (to be concrete, a h∗ ) 2 2 a¨ a such that the label complexity of A is at least Ω( ν2 ) (K¨ ari¨ inen, 2006). [sent-139, score-0.533]

40 Thus Ω( ν2 ) is a minimax ε ε lower bound of the label complexity of agnostic active learning algorithms. [sent-140, score-0.686]

41 Although no active learning algorithm is superior to passive learning in all agnostic settings, it turns out that if the disagreement coefﬁcient is small, active learning does always help under a ﬁner parametrization of the noise distribution, known as Tsybakov’s noise condition (Tsybakov, 2004). [sent-141, score-1.485]

42 Under Tsybakov’s noise condition, Hanneke (2009, 2011) proved that a variant of the DHM algorithm (by choosing the threshold ∆t based on local Rademacher complexity (Koltchinskii, 2006)) has the following asymptotic label complexity. [sent-147, score-0.277]

43 ε δ 2273 WANG Inspired by this result, Koltchinskii (2010) further proved that under similar conditions a variant of the A2 algorithm has label complexity 1 ε 1 κ O θ(ε ) 2− 2−p κ 1 + ε 2 2− κ 2− 2−p 1 1 log + log log δ ε . [sent-154, score-0.311]

44 In contrast, the sample complexity for passive learning under the same conditions is known to be (Tsybakov, 2004) O 1 ε 2− 1−p κ 1 1 log + log log δ ε , (5) and it is also a minimax lower bound. [sent-158, score-0.364]

45 Comparing (3), (4) and (5) one can see that whether active learning is strictly superior to passive learning entirely depends on how small the disagreement coefﬁcient θ(ε) is. [sent-159, score-0.883]

46 (2010) developed an active learning algorithm which does not need to keep the version space and therefore is computationally efﬁcient. [sent-165, score-0.283]

47 There are also upper bounds on the label complexity of these two algorithms in terms of the disagreement coefﬁcient. [sent-166, score-0.606]

48 Finally, for a comprehensive survey of the theoretical research on active learning, please see the excellent tutorial (Dasgupta and Langford, 2009). [sent-167, score-0.283]

49 Main Results As described in the previous section, whether active learning helps largely depends on the disagreement coefﬁcient which often characterizes the intrinsic difﬁculty of the learning problem using a given hypothesis space. [sent-169, score-0.717]

50 So it is important to understand if the disagreement coefﬁcient is small for learning problems with practical and theoretical interests. [sent-170, score-0.366]

51 In this section we give bounds on the disagreement coefﬁcient for problems that have smooth classiﬁcation boundaries, under additional assumptions on the distribution. [sent-171, score-0.558]

52 Such smooth problems are often referred to as boundary fragment class and has been extensively studied in passive learning and especially in empirical processes. [sent-172, score-0.489]

53 2 contains the main results, where we establish upper and lower bounds for the disagreement coefﬁcient of smooth problems. [sent-176, score-0.586]

54 Let i=1 Dk = ∂|k| , ∂k1 x1 · · · ∂kd xd be the differential operator. [sent-184, score-0.276]

55 With the deﬁnitions of smoothness, we introduce the hypothesis space we use in active learning algorithms. [sent-193, score-0.351]

56 α Deﬁnition 7 (Hypotheses with Smooth Classiﬁcation Boundaries) A set of hypotheses HC deﬁned on X = [0, 1]d+1 is said to have αth order smooth classiﬁcation boundaries, if for every α h ∈ HC , the classiﬁcation boundary is a αth order smooth function on [0, 1]d . [sent-194, score-0.444]

57 Similarly, a hypothesis space HC is said to have inﬁnitely smooth ∞ the classiﬁcation boundary is the graph an inﬁnitely smooth funcboundaries, if for every h ∈ HC tion on [0, 1]d . [sent-203, score-0.476]

58 2 Disagreement Coefﬁcient The disagreement coefﬁcient θ plays an important role for the label complexity of active learning algorithms. [sent-213, score-0.831]

59 In fact previous negative examples for which active learning does not work are all because of large θ. [sent-214, score-0.283]

60 For instance the interval learning problem, θ(ε) = 1 , which leads to the same ε label complexity as passive learning. [sent-215, score-0.378]

61 ) ε In this section we will show that that the disagreement coefﬁcient θ(ε) for smooth problems is small. [sent-217, score-0.528]

62 Especially, we establish both upper bounds (Theorem 9 and Theorem 10) and lower bounds (Theorem 13) for the disagreement coefﬁcient of smooth problems. [sent-218, score-0.616]

63 Finally we will combine our upper bounds on the disagreement coefﬁcient with the label complexity result of Theorem 4 and show that active learning is strictly superior to passive learning for smooth problems. [sent-219, score-1.285]

64 The key points in the theorem are: the classiﬁcation boundaries are smooth; and the density is bounded from above and below by constants times a smooth function. [sent-223, score-0.288]

65 Hence only the points close to the classiﬁcation boundary of h∗ can be in DIS(B(h∗ , ε)), which leads to a small disagreement coefﬁcient. [sent-230, score-0.45]

66 , xd ) and two constants 0 < a ≤ b such ˜ h | ≤ |Φh | ≤ b|Φh |. [sent-280, score-0.276]

67 To see this, remember that fh and fh∗ are αth order smooth functions; ˜ that a|Φ and the density p is upper and lower bounded by constants times a αth order smooth function g(x1 , . [sent-281, score-0.516]

68 Now consider the region of disagreement of B(h∗ , r). [sent-302, score-0.366]

69 In the next theorem, we give lower bounds on the disagreement coefﬁcient for ﬁnite smooth problems under the condition that the marginal distribution DX is the uniform distribution. [sent-310, score-0.558]

70 Then the disagreement coefﬁcient has the following lower bound6 θ(ε) = Ω 1 ε d α+d . [sent-315, score-0.366]

71 For inﬁnite smoothness, we do not know any lower bound for the disagreement coefﬁcient larger than the trivial Ω(1). [sent-376, score-0.366]

72 1 L ABEL C OMPLEXITY FOR S MOOTH P ROBLEMS Now we combine our results (Theorem 9 and Theorem 10) with the label complexity bounds for active learning (Theorem 4 and (4)) and show that active learning is strictly superior to passive learning for smooth problems. [sent-379, score-1.174]

73 Remember that under Tsybakov’s noise conditions the label complexity of active learning is (see Theorem 4 and (4)) 1 O θ(ε ) ε 1 κ 2− 2−p κ 2279 1 1 log + log ε δ . [sent-380, score-0.6]

74 WANG While for passive learning it is (see (5)) O 1 ε 2− 1−p κ 1 1 log + log log δ ε We see that if 1 κ θ(ε ) = o 1 ε . [sent-381, score-0.325]

75 1 κ , then active learning requires strictly fewer labels than passive learning. [sent-382, score-0.479]

76 Then active learning algorithms A2 and DHM have label complexity strictly smaller than passive learning for αth order smooth problems whenever α > d. [sent-386, score-0.823]

77 3 Discussion In this section we discuss and compare our results to a closely related work due to Castro and Nowak (2008), which also studied the label complexity of smooth problems under Tsybakov’s noise condition. [sent-388, score-0.393]

78 Under these assumptions, Castro and Nowak showed that an active learning algorithm they attributed to Burnashev and Zigangirov (1974) (will be referred to as BZ), which is essentially ˜ a Bayesian binary search algorithm,7 has label complexity O 2 1 2− κ ε . [sent-396, score-0.465]

79 This model is called membership query, making stronger assumptions than the pool-based active learning model. [sent-399, score-0.283]

80 , xd , xd+1 ) : xd+1 ∈ that on every vertical line segment in [0, 1] [0, 1]}, (x1 , x2 , . [sent-407, score-0.276]

81 , xd ) ∈ [0, 1]d ,) the one dimensional distribution ηx1 ,. [sent-410, score-0.276]

82 Based on these assumptions, they proposed the following active learning algorithm: Choosing M vertical line segments in [0, 1]d+1 . [sent-414, score-0.283]

83 Comparing the label complexity of this algorithm  1 ˜ O ε 2− d 2− α κ  and that obtained from Theorem 4 and Proposition 8  1 ˜ O θ(ε0 ) ε  2− d 2− α κ   one sees that their label complexity is smaller by θ(ε0 ). [sent-421, score-0.364]

84 It seems that the disagreement coefﬁcient of smooth problem does not play a role in their label complexity formula. [sent-422, score-0.71]

85 Recall that the disagreement coefﬁcient of the one-dimensional threshold learning problem is at most 2 for all ε > 0, so there seems no θ(ε) term in the ﬁnal label complexity formula. [sent-428, score-0.594]

86 But by Threorem 4 and (4) we know that both A2 and DHM (with slight modiﬁcations) have the same label complexity and the convergence property, since the disagreement coefﬁcient for this threshold problem is θ(ε) = 2 for all ε > 0. [sent-431, score-0.594]

87 But under the ordinary Tsybakov’s condition, the largest κ on lines can be arbitrarily large, resulting a label complexity equal to that of passive learning. [sent-439, score-0.378]

88 ,t α+d xd ), 1 where now the domain of f is (x1 , . [sent-477, score-0.276]

89 , xd−1 , For any ﬁxed consider gd as a function of the single variable xd . [sent-524, score-0.321]

90 Since f is the ﬁrst order derivative of gd , it is easy to check that gd has derivatives up to order α + 1 with respect to xd and its α + 1 order derivative is H¨ lder continuous with constant C. [sent-525, score-0.421]

91 log 1 r log log 1 r Now for r sufﬁciently small, take n = log 1 r log log 1 r log 1 r n n = log log . [sent-657, score-0.387]

92 ≤ r 2πnnn e−n ≤ √ 2π exp ≤ 2π log 1 r 1 1 (r) log log r 1 log log r log log 1 − log log log 1 log 1 r r r − 2 log log 1 r , which tends to zero as r → 0 and therefore ′ Mn = max(M0 n! [sent-662, score-0.559]

93 Thus we have, by Theorem 17 Φ ∞ ≤ Md 1− d n ≤ Cnd M0 ′d Mnn 1− d n ≤ Cnd M0 ≤ C d log 1 r log log 1 r 1 r r d log log 1 r log 1 r 2d 1 ≤ Cr log r . [sent-664, score-0.301]

94 Conclusion This paper studies the disagreement coefﬁcient of smooth problems and extends our previous results (Wang, 2009). [sent-707, score-0.528]

95 Comparing to the worst case θ(ε) = 1 for which active learning has the same label ε complexity as passive learning, the disagreement coefﬁcient is θ(ε) = O 1 ε d α+d for αth (α < ∞) order smooth problems, and is θ(ε) = O log2d 1 for inﬁnite order smooth problems. [sent-708, score-1.351]

96 Combining ε with the bounds on the label complexity in terms of disagreement coefﬁcient, we give sufﬁcient conditions for which active learning algorithm A2 and DHM are superior to passive learning under Tsybakov’s noise condition. [sent-709, score-1.144]

97 This would include a substantially richer class of problems which can be beneﬁt from active learning. [sent-719, score-0.283]

98 For inﬁnitely smooth problems we proved that the disagreement coefﬁcient can be upper and lower bounded by O log2d 1 and Ω(1) reε spectively. [sent-721, score-0.556]

99 A bound on the label complexity of agnostic active learning. [sent-793, score-0.686]

100 Rademacher complexities and bounding the excess risk in active learning. [sent-820, score-0.283]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('disagreement', 0.366), ('active', 0.283), ('xd', 0.276), ('mn', 0.268), ('agnostic', 0.221), ('mk', 0.207), ('passive', 0.196), ('cnk', 0.18), ('dx', 0.173), ('smooth', 0.162), ('tsybakov', 0.161), ('dhm', 0.152), ('moothness', 0.152), ('oefficient', 0.152), ('label', 0.143), ('dis', 0.138), ('fh', 0.138), ('isagreement', 0.13), ('hc', 0.107), ('mnn', 0.097), ('vc', 0.097), ('erd', 0.094), ('th', 0.086), ('boundary', 0.084), ('coef', 0.083), ('fc', 0.078), ('tt', 0.074), ('boundaries', 0.074), ('zd', 0.072), ('dasgupta', 0.072), ('hypothesis', 0.068), ('dk', 0.067), ('wang', 0.066), ('castro', 0.063), ('hy', 0.063), ('pr', 0.056), ('dxd', 0.055), ('gorny', 0.055), ('hanneke', 0.053), ('nitely', 0.053), ('theorem', 0.052), ('noise', 0.049), ('minh', 0.047), ('fragment', 0.047), ('threshold', 0.046), ('gd', 0.045), ('unlabeled', 0.045), ('pool', 0.044), ('exponent', 0.044), ('log', 0.043), ('vaart', 0.043), ('request', 0.042), ('ub', 0.042), ('wellner', 0.042), ('ari', 0.042), ('bz', 0.042), ('chicheng', 0.042), ('dud', 0.042), ('prx', 0.042), ('une', 0.042), ('yanqi', 0.042), ('ziteng', 0.042), ('sup', 0.04), ('complexity', 0.039), ('lb', 0.039), ('classi', 0.039), ('superior', 0.038), ('earning', 0.038), ('nowak', 0.037), ('gt', 0.037), ('beygelzimer', 0.036), ('hypotheses', 0.036), ('gi', 0.035), ('ud', 0.034), ('hypercube', 0.034), ('bracketing', 0.032), ('derivatives', 0.031), ('smoothness', 0.031), ('mm', 0.031), ('xb', 0.031), ('bounds', 0.03), ('er', 0.03), ('kai', 0.029), ('des', 0.029), ('guess', 0.029), ('upper', 0.028), ('der', 0.028), ('burnashev', 0.028), ('cnd', 0.028), ('ergt', 0.028), ('learnh', 0.028), ('mitrinovi', 0.028), ('balcan', 0.027), ('crucially', 0.027), ('dai', 0.026), ('remember', 0.026), ('rademacher', 0.024), ('lder', 0.024), ('liwei', 0.024), ('polylog', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

Author: Liwei Wang

2 0.11759889 12 jmlr-2011-Bayesian Co-Training

Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao

Abstract: Co-training (or more generally, co-regularization) has been a popular algorithm for semi-supervised learning in data with two feature representations (or views), but the fundamental assumptions underlying this type of models are still unclear. In this paper we propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clariﬁes the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classiﬁers. The resulting approach is convex and avoids local-maxima problems, and it can also automatically estimate how much each view should be trusted to accommodate noisy or unreliable views. The Bayesian co-training approach can also elegantly handle data samples with missing views, that is, some of the views are not available for some data points at learning time. This is further extended to an active sensing framework, in which the missing (sample, view) pairs are actively acquired to improve learning performance. The strength of active sensing model is that one actively sensed (sample, view) pair would improve the joint multi-view classiﬁcation on all the samples. Experiments on toy data and several real world data sets illustrate the beneﬁts of this approach. Keywords: co-training, multi-view learning, semi-supervised learning, Gaussian processes, undirected graphical models, active sensing

3 0.078935355 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is deﬁned as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classiﬁcation, level sets, connected components

4 0.068470873 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

Author: Kris De Brabanter, Jos De Brabanter, Johan A.K. Suykens, Bart De Moor

Abstract: It is a well-known problem that obtaining a correct bandwidth and/or smoothing parameter in nonparametric regression is difﬁcult in the presence of correlated errors. There exist a wide variety of methods coping with this problem, but they all critically depend on a tuning procedure which requires accurate information about the correlation structure. We propose a bandwidth selection procedure based on bimodal kernels which successfully removes the correlation without requiring any prior knowledge about its structure and its parameters. Further, we show that the form of the kernel is very important when errors are correlated which is in contrast to the independent and identically distributed (i.i.d.) case. Finally, some extensions are proposed to use the proposed criterion in support vector machines and least squares support vector machines for regression. Keywords: nonparametric regression, correlated errors, bandwidth choice, cross-validation, shortrange dependence, bimodal kernel

5 0.068158098 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

Author: Daniil Ryabko

Abstract: A sequence x1 , . . . , xn , . . . of discrete-valued observations is generated according to some unknown probabilistic law (measure) µ. After observing each outcome, one is required to give conditional probabilities of the next observation. The realizable case is when the measure µ belongs to an arbitrary but known class C of process measures. The non-realizable case is when µ is completely arbitrary, but the prediction performance is measured with respect to a given set C of process measures. We are interested in the relations between these problems and between their solutions, as well as in characterizing the cases when a solution exists and ﬁnding these solutions. We show that if the quality of prediction is measured using the total variation distance, then these problems coincide, while if it is measured using the expected average KL divergence, then they are different. For some of the formalizations we also show that when a solution exists it can be obtained as a Bayes mixture over a countable subset of C . We also obtain several characterization of those sets C for which solutions to the considered problems exist. As an illustration to the general results obtained, we show that a solution to the non-realizable case of the sequence prediction problem exists for the set of all ﬁnite-memory processes, but does not exist for the set of all stationary processes. It should be emphasized that the framework is completely general: the processes measures considered are not required to be i.i.d., mixing, stationary, or to belong to any parametric family. Keywords: sequence prediction, time series, online prediction, realizable sequence prediction, non-realizable sequence prediction

6 0.063579097 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

7 0.054831851 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

8 0.05371765 61 jmlr-2011-Logistic Stick-Breaking Process

9 0.050853677 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

10 0.049779028 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

11 0.045623854 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

12 0.044743177 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

13 0.04451834 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

14 0.04426346 58 jmlr-2011-Learning from Partial Labels

15 0.043445017 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms

16 0.04252933 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

17 0.040824343 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

18 0.040645789 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

19 0.037112359 35 jmlr-2011-Forest Density Estimation

20 0.036577825 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.188), (1, -0.026), (2, -0.011), (3, 0.056), (4, 0.092), (5, -0.031), (6, -0.013), (7, 0.043), (8, -0.177), (9, 0.014), (10, -0.031), (11, 0.031), (12, 0.15), (13, -0.007), (14, 0.094), (15, 0.025), (16, -0.06), (17, -0.202), (18, -0.24), (19, 0.034), (20, -0.114), (21, -0.108), (22, 0.164), (23, 0.001), (24, -0.306), (25, -0.007), (26, -0.05), (27, -0.334), (28, 0.138), (29, -0.084), (30, 0.088), (31, -0.03), (32, 0.104), (33, 0.137), (34, 0.061), (35, -0.047), (36, 0.058), (37, 0.021), (38, -0.114), (39, -0.115), (40, -0.008), (41, -0.017), (42, -0.056), (43, -0.006), (44, 0.033), (45, 0.027), (46, 0.044), (47, -0.0), (48, -0.057), (49, -0.046)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96664602 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

Author: Liwei Wang

2 0.59284759 12 jmlr-2011-Bayesian Co-Training

Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao

3 0.42852768 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

4 0.401997 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors

Author: Kris De Brabanter, Jos De Brabanter, Johan A.K. Suykens, Bart De Moor

5 0.33016941 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

Author: Daniil Ryabko

6 0.30023947 58 jmlr-2011-Learning from Partial Labels

7 0.2977457 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

8 0.26339275 61 jmlr-2011-Logistic Stick-Breaking Process

9 0.26278463 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

10 0.26141506 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

11 0.2420812 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

12 0.23162158 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

13 0.21453564 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

14 0.19901663 35 jmlr-2011-Forest Density Estimation

15 0.19869566 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

16 0.19680329 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

17 0.19121991 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

18 0.1906583 40 jmlr-2011-Hyper-Sparse Optimal Aggregation

19 0.17791994 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

20 0.17039022 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.061), (6, 0.027), (9, 0.03), (10, 0.025), (24, 0.066), (31, 0.063), (32, 0.039), (41, 0.031), (65, 0.021), (73, 0.018), (78, 0.064), (80, 0.364), (90, 0.018), (98, 0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.7011565 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

Author: Liwei Wang

2 0.33513618 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is speciﬁcally deﬁned as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a uniﬁed and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-deﬁnite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A speciﬁc implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 signiﬁcantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel

3 0.32629189 91 jmlr-2011-The Sample Complexity of Dictionary Learning

Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein

Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classiﬁcation, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a ﬁxed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefﬁcient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefﬁcient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefﬁcient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation

4 0.32095286 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under ﬁxed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufﬁcient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identiﬁed; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

5 0.32026801 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

6 0.3201493 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

7 0.31954685 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

8 0.31756052 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

9 0.31719688 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

10 0.31677547 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

11 0.31674287 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

12 0.31576711 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

13 0.31486037 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

14 0.31462827 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

15 0.31302318 59 jmlr-2011-Learning with Structured Sparsity

16 0.31298757 104 jmlr-2011-X-Armed Bandits

17 0.31219491 28 jmlr-2011-Double Updating Online Learning

18 0.31076586 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

19 0.31045783 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices

20 0.31039533 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints