jmlr jmlr2011 jmlr2011-85 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Liwei Wang
Abstract: We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefficient which often characterizes the intrinsic difficulty of the learning problem. In this paper, we study the disagreement coefficient of classification problems for which the classification boundary is smooth and the data distribution has a density that can be bounded by a smooth function. We prove upper and lower bounds for the disagreement coefficients of both finitely and infinitely smooth problems. Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. Keywords: active learning, disagreement coefficient, label complexity, smooth function
Reference: text
sentIndex sentText sentNum sentScore
1 China Editor: Rocco Servedio Abstract We study pool-based active learning in the presence of noise, that is, the agnostic setting. [sent-6, score-0.504]
2 It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. [sent-7, score-0.572]
3 Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. [sent-8, score-0.566]
4 Previous works have shown that the label complexity of active learning relies on the disagreement coefficient which often characterizes the intrinsic difficulty of the learning problem. [sent-9, score-0.831]
5 In this paper, we study the disagreement coefficient of classification problems for which the classification boundary is smooth and the data distribution has a density that can be bounded by a smooth function. [sent-10, score-0.774]
6 We prove upper and lower bounds for the disagreement coefficients of both finitely and infinitely smooth problems. [sent-11, score-0.586]
7 Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. [sent-12, score-0.679]
8 Keywords: active learning, disagreement coefficient, label complexity, smooth function 1. [sent-13, score-0.954]
9 The simplest example that demonstrates the potential of active learning is to learn the optimal threshold on an interval. [sent-20, score-0.329]
10 If the data are distributed on the unit sphere of Rd , and the distribution has a density function upper and lower bounded by λ and 1/λ respectively, where λ is some constant, then active learning can still give exponential savings in the label complexity (Dasgupta, 2005). [sent-25, score-0.493]
11 However, there are also very simple problems that active learning does not help. [sent-26, score-0.283]
12 In this case, for any active learning algorithm there exists a distribution (i. [sent-29, score-0.283]
13 Of more interest and more realistic is the agnostic setting, where the best classifier in the hypothesis space has a non-zero error ν. [sent-35, score-0.289]
14 For agnostic active learning, there is 2 no active learning algorithm that can always reduce label requests due to a lower bound Ω( ν2 ) for ε the label complexity (K¨ ari¨ inen, 2006). [sent-36, score-1.112]
15 a¨ a Previous results have shown that whether active learning helps relies crucially on the disagreement coefficient of the learning problem (Hanneke, 2007). [sent-37, score-0.676]
16 The disagreement coefficient depends on the distribution of the instance-label pairs and the hypothesis space and often describes the intrinsic difficulty of the active learning problem. [sent-38, score-0.717]
17 In particular, it has been shown that the label complexity of two important agnostic active learning algorithms A2 (Balcan et al. [sent-39, score-0.686]
18 (2007) (will be referred to as DHM) are characterized by the disagreement coefficient. [sent-41, score-0.366]
19 If the disagreement coefficient is small, active learning usually has smaller label complexity than passive learning. [sent-42, score-1.027]
20 In this paper, we study the disagreement coefficient for smooth problems. [sent-43, score-0.528]
21 Specifically we analyze the disagreement coefficient for learning problems whose classification boundaries are smooth. [sent-44, score-0.44]
22 Under some mild assumptions on the distribution, we show that the magnitude of the disagreement coefficient depends on the order of smoothness. [sent-46, score-0.366]
23 Combining with known upper bounds on the label complexity in terms of disagreement coefficient, we give sufficient condition under which active learning is strictly superior to passive learning. [sent-48, score-1.123]
24 1 Related Works Our work is closely related to Castro and Nowak (2008) which proved label complexity bounds for problems with smooth classification boundary under Tsybakov’s noise condition (Tsybakov, 2004). [sent-50, score-0.507]
25 He gave conditions under which the disagreement coefficient is always bounded from above by a constant. [sent-56, score-0.366]
26 In our active learning model, the algorithm has access to a pool of unlabeled examples from DX . [sent-62, score-0.372]
27 , 2006) is the first rigorous agnostic active learning algorithm. [sent-84, score-0.504]
28 It can be viewed as a robust version of the active learning algorithm due to Cohn et al. [sent-85, score-0.283]
29 It was shown that A2 is never much worse than passive learning in terms of the label complexity. [sent-88, score-0.339]
30 One is the current version space V , consisting of hypotheses that with statistical confidence are not too bad compared to h∗ ; the other is the region of disagreement DIS(V ), which is the set of all x ∈ X for which there are hypotheses in V that disagree on x. [sent-92, score-0.438]
31 This process, and in turn the label complexity of A2 , are nicely characterized by the disagreement coefficient θ introduced in Hanneke (2007). [sent-101, score-0.548]
32 The disagreement coefficient θ(ε) is Pr (X ∈ DIS(B(h∗ , r))) ∆(B(h∗ , r)) = sup X∼DX , r r r≥ε r≥ε θ(ε) = sup where h∗ = arg minh∈H erD (h). [sent-115, score-0.446]
33 1 The following is an upper bound of the label ε complexity of A2 in terms of the disagreement coefficient θ(ε) (Hanneke, 2007). [sent-117, score-0.576]
34 problem with ν = 0, where in Ω Another important agnostic active learning algorithm is DHM. [sent-121, score-0.504]
35 ) DHM reduces active learning to a series of constrained supervised learning. [sent-123, score-0.283]
36 If we can, we put the data and the confidently guessed label (x, y) into the guessed set G ; otherwise, ˆ we request the true label y of x and put (x, y) into the true set T . [sent-125, score-0.328]
37 They also showed that DHM is never much worse than passive learning and it has label complexity2 ν ˜ O θ(ν + ε) 1 + 2 ε 2 polylog 1 1 log VC(H ) . [sent-136, score-0.406]
38 ε δ (2) From (1) and (2) it can be seen that if ε > ν, the term ν2 is upper bounded by 1 and the label ε complexity of the active learning algorithms crucially depends on the disagreement coefficient θ. [sent-137, score-0.886]
39 In fact, this bound cannot be improved: it is known that given some hypothesis ε 2 space H , for every active learning algorithm A, there is a learning problem (to be concrete, a h∗ ) 2 2 a¨ a such that the label complexity of A is at least Ω( ν2 ) (K¨ ari¨ inen, 2006). [sent-139, score-0.533]
40 Thus Ω( ν2 ) is a minimax ε ε lower bound of the label complexity of agnostic active learning algorithms. [sent-140, score-0.686]
41 Although no active learning algorithm is superior to passive learning in all agnostic settings, it turns out that if the disagreement coefficient is small, active learning does always help under a finer parametrization of the noise distribution, known as Tsybakov’s noise condition (Tsybakov, 2004). [sent-141, score-1.485]
42 Under Tsybakov’s noise condition, Hanneke (2009, 2011) proved that a variant of the DHM algorithm (by choosing the threshold ∆t based on local Rademacher complexity (Koltchinskii, 2006)) has the following asymptotic label complexity. [sent-147, score-0.277]
43 ε δ 2273 WANG Inspired by this result, Koltchinskii (2010) further proved that under similar conditions a variant of the A2 algorithm has label complexity 1 ε 1 κ O θ(ε ) 2− 2−p κ 1 + ε 2 2− κ 2− 2−p 1 1 log + log log δ ε . [sent-154, score-0.311]
44 In contrast, the sample complexity for passive learning under the same conditions is known to be (Tsybakov, 2004) O 1 ε 2− 1−p κ 1 1 log + log log δ ε , (5) and it is also a minimax lower bound. [sent-158, score-0.364]
45 Comparing (3), (4) and (5) one can see that whether active learning is strictly superior to passive learning entirely depends on how small the disagreement coefficient θ(ε) is. [sent-159, score-0.883]
46 (2010) developed an active learning algorithm which does not need to keep the version space and therefore is computationally efficient. [sent-165, score-0.283]
47 There are also upper bounds on the label complexity of these two algorithms in terms of the disagreement coefficient. [sent-166, score-0.606]
48 Finally, for a comprehensive survey of the theoretical research on active learning, please see the excellent tutorial (Dasgupta and Langford, 2009). [sent-167, score-0.283]
49 Main Results As described in the previous section, whether active learning helps largely depends on the disagreement coefficient which often characterizes the intrinsic difficulty of the learning problem using a given hypothesis space. [sent-169, score-0.717]
50 So it is important to understand if the disagreement coefficient is small for learning problems with practical and theoretical interests. [sent-170, score-0.366]
51 In this section we give bounds on the disagreement coefficient for problems that have smooth classification boundaries, under additional assumptions on the distribution. [sent-171, score-0.558]
52 Such smooth problems are often referred to as boundary fragment class and has been extensively studied in passive learning and especially in empirical processes. [sent-172, score-0.489]
53 2 contains the main results, where we establish upper and lower bounds for the disagreement coefficient of smooth problems. [sent-176, score-0.586]
54 Let i=1 Dk = ∂|k| , ∂k1 x1 · · · ∂kd xd be the differential operator. [sent-184, score-0.276]
55 With the definitions of smoothness, we introduce the hypothesis space we use in active learning algorithms. [sent-193, score-0.351]
56 α Definition 7 (Hypotheses with Smooth Classification Boundaries) A set of hypotheses HC defined on X = [0, 1]d+1 is said to have αth order smooth classification boundaries, if for every α h ∈ HC , the classification boundary is a αth order smooth function on [0, 1]d . [sent-194, score-0.444]
57 Similarly, a hypothesis space HC is said to have infinitely smooth ∞ the classification boundary is the graph an infinitely smooth funcboundaries, if for every h ∈ HC tion on [0, 1]d . [sent-203, score-0.476]
58 2 Disagreement Coefficient The disagreement coefficient θ plays an important role for the label complexity of active learning algorithms. [sent-213, score-0.831]
59 In fact previous negative examples for which active learning does not work are all because of large θ. [sent-214, score-0.283]
60 For instance the interval learning problem, θ(ε) = 1 , which leads to the same ε label complexity as passive learning. [sent-215, score-0.378]
61 ) ε In this section we will show that that the disagreement coefficient θ(ε) for smooth problems is small. [sent-217, score-0.528]
62 Especially, we establish both upper bounds (Theorem 9 and Theorem 10) and lower bounds (Theorem 13) for the disagreement coefficient of smooth problems. [sent-218, score-0.616]
63 Finally we will combine our upper bounds on the disagreement coefficient with the label complexity result of Theorem 4 and show that active learning is strictly superior to passive learning for smooth problems. [sent-219, score-1.285]
64 The key points in the theorem are: the classification boundaries are smooth; and the density is bounded from above and below by constants times a smooth function. [sent-223, score-0.288]
65 Hence only the points close to the classification boundary of h∗ can be in DIS(B(h∗ , ε)), which leads to a small disagreement coefficient. [sent-230, score-0.45]
66 , xd ) and two constants 0 < a ≤ b such ˜ h | ≤ |Φh | ≤ b|Φh |. [sent-280, score-0.276]
67 To see this, remember that fh and fh∗ are αth order smooth functions; ˜ that a|Φ and the density p is upper and lower bounded by constants times a αth order smooth function g(x1 , . [sent-281, score-0.516]
68 Now consider the region of disagreement of B(h∗ , r). [sent-302, score-0.366]
69 In the next theorem, we give lower bounds on the disagreement coefficient for finite smooth problems under the condition that the marginal distribution DX is the uniform distribution. [sent-310, score-0.558]
70 Then the disagreement coefficient has the following lower bound6 θ(ε) = Ω 1 ε d α+d . [sent-315, score-0.366]
71 For infinite smoothness, we do not know any lower bound for the disagreement coefficient larger than the trivial Ω(1). [sent-376, score-0.366]
72 1 L ABEL C OMPLEXITY FOR S MOOTH P ROBLEMS Now we combine our results (Theorem 9 and Theorem 10) with the label complexity bounds for active learning (Theorem 4 and (4)) and show that active learning is strictly superior to passive learning for smooth problems. [sent-379, score-1.174]
73 Remember that under Tsybakov’s noise conditions the label complexity of active learning is (see Theorem 4 and (4)) 1 O θ(ε ) ε 1 κ 2− 2−p κ 2279 1 1 log + log ε δ . [sent-380, score-0.6]
74 WANG While for passive learning it is (see (5)) O 1 ε 2− 1−p κ 1 1 log + log log δ ε We see that if 1 κ θ(ε ) = o 1 ε . [sent-381, score-0.325]
75 1 κ , then active learning requires strictly fewer labels than passive learning. [sent-382, score-0.479]
76 Then active learning algorithms A2 and DHM have label complexity strictly smaller than passive learning for αth order smooth problems whenever α > d. [sent-386, score-0.823]
77 3 Discussion In this section we discuss and compare our results to a closely related work due to Castro and Nowak (2008), which also studied the label complexity of smooth problems under Tsybakov’s noise condition. [sent-388, score-0.393]
78 Under these assumptions, Castro and Nowak showed that an active learning algorithm they attributed to Burnashev and Zigangirov (1974) (will be referred to as BZ), which is essentially ˜ a Bayesian binary search algorithm,7 has label complexity O 2 1 2− κ ε . [sent-396, score-0.465]
79 This model is called membership query, making stronger assumptions than the pool-based active learning model. [sent-399, score-0.283]
80 , xd , xd+1 ) : xd+1 ∈ that on every vertical line segment in [0, 1] [0, 1]}, (x1 , x2 , . [sent-407, score-0.276]
81 , xd ) ∈ [0, 1]d ,) the one dimensional distribution ηx1 ,. [sent-410, score-0.276]
82 Based on these assumptions, they proposed the following active learning algorithm: Choosing M vertical line segments in [0, 1]d+1 . [sent-414, score-0.283]
83 Comparing the label complexity of this algorithm 1 ˜ O ε 2− d 2− α κ and that obtained from Theorem 4 and Proposition 8 1 ˜ O θ(ε0 ) ε 2− d 2− α κ one sees that their label complexity is smaller by θ(ε0 ). [sent-421, score-0.364]
84 It seems that the disagreement coefficient of smooth problem does not play a role in their label complexity formula. [sent-422, score-0.71]
85 Recall that the disagreement coefficient of the one-dimensional threshold learning problem is at most 2 for all ε > 0, so there seems no θ(ε) term in the final label complexity formula. [sent-428, score-0.594]
86 But by Threorem 4 and (4) we know that both A2 and DHM (with slight modifications) have the same label complexity and the convergence property, since the disagreement coefficient for this threshold problem is θ(ε) = 2 for all ε > 0. [sent-431, score-0.594]
87 But under the ordinary Tsybakov’s condition, the largest κ on lines can be arbitrarily large, resulting a label complexity equal to that of passive learning. [sent-439, score-0.378]
88 ,t α+d xd ), 1 where now the domain of f is (x1 , . [sent-477, score-0.276]
89 , xd−1 , For any fixed consider gd as a function of the single variable xd . [sent-524, score-0.321]
90 Since f is the first order derivative of gd , it is easy to check that gd has derivatives up to order α + 1 with respect to xd and its α + 1 order derivative is H¨ lder continuous with constant C. [sent-525, score-0.421]
91 log 1 r log log 1 r Now for r sufficiently small, take n = log 1 r log log 1 r log 1 r n n = log log . [sent-657, score-0.387]
92 ≤ r 2πnnn e−n ≤ √ 2π exp ≤ 2π log 1 r 1 1 (r) log log r 1 log log r log log 1 − log log log 1 log 1 r r r − 2 log log 1 r , which tends to zero as r → 0 and therefore ′ Mn = max(M0 n! [sent-662, score-0.559]
93 Thus we have, by Theorem 17 Φ ∞ ≤ Md 1− d n ≤ Cnd M0 ′d Mnn 1− d n ≤ Cnd M0 ≤ C d log 1 r log log 1 r 1 r r d log log 1 r log 1 r 2d 1 ≤ Cr log r . [sent-664, score-0.301]
94 Conclusion This paper studies the disagreement coefficient of smooth problems and extends our previous results (Wang, 2009). [sent-707, score-0.528]
95 Comparing to the worst case θ(ε) = 1 for which active learning has the same label ε complexity as passive learning, the disagreement coefficient is θ(ε) = O 1 ε d α+d for αth (α < ∞) order smooth problems, and is θ(ε) = O log2d 1 for infinite order smooth problems. [sent-708, score-1.351]
96 Combining ε with the bounds on the label complexity in terms of disagreement coefficient, we give sufficient conditions for which active learning algorithm A2 and DHM are superior to passive learning under Tsybakov’s noise condition. [sent-709, score-1.144]
97 This would include a substantially richer class of problems which can be benefit from active learning. [sent-719, score-0.283]
98 For infinitely smooth problems we proved that the disagreement coefficient can be upper and lower bounded by O log2d 1 and Ω(1) reε spectively. [sent-721, score-0.556]
99 A bound on the label complexity of agnostic active learning. [sent-793, score-0.686]
100 Rademacher complexities and bounding the excess risk in active learning. [sent-820, score-0.283]
wordName wordTfidf (topN-words)
[('disagreement', 0.366), ('active', 0.283), ('xd', 0.276), ('mn', 0.268), ('agnostic', 0.221), ('mk', 0.207), ('passive', 0.196), ('cnk', 0.18), ('dx', 0.173), ('smooth', 0.162), ('tsybakov', 0.161), ('dhm', 0.152), ('moothness', 0.152), ('oefficient', 0.152), ('label', 0.143), ('dis', 0.138), ('fh', 0.138), ('isagreement', 0.13), ('hc', 0.107), ('mnn', 0.097), ('vc', 0.097), ('erd', 0.094), ('th', 0.086), ('boundary', 0.084), ('coef', 0.083), ('fc', 0.078), ('tt', 0.074), ('boundaries', 0.074), ('zd', 0.072), ('dasgupta', 0.072), ('hypothesis', 0.068), ('dk', 0.067), ('wang', 0.066), ('castro', 0.063), ('hy', 0.063), ('pr', 0.056), ('dxd', 0.055), ('gorny', 0.055), ('hanneke', 0.053), ('nitely', 0.053), ('theorem', 0.052), ('noise', 0.049), ('minh', 0.047), ('fragment', 0.047), ('threshold', 0.046), ('gd', 0.045), ('unlabeled', 0.045), ('pool', 0.044), ('exponent', 0.044), ('log', 0.043), ('vaart', 0.043), ('request', 0.042), ('ub', 0.042), ('wellner', 0.042), ('ari', 0.042), ('bz', 0.042), ('chicheng', 0.042), ('dud', 0.042), ('prx', 0.042), ('une', 0.042), ('yanqi', 0.042), ('ziteng', 0.042), ('sup', 0.04), ('complexity', 0.039), ('lb', 0.039), ('classi', 0.039), ('superior', 0.038), ('earning', 0.038), ('nowak', 0.037), ('gt', 0.037), ('beygelzimer', 0.036), ('hypotheses', 0.036), ('gi', 0.035), ('ud', 0.034), ('hypercube', 0.034), ('bracketing', 0.032), ('derivatives', 0.031), ('smoothness', 0.031), ('mm', 0.031), ('xb', 0.031), ('bounds', 0.03), ('er', 0.03), ('kai', 0.029), ('des', 0.029), ('guess', 0.029), ('upper', 0.028), ('der', 0.028), ('burnashev', 0.028), ('cnd', 0.028), ('ergt', 0.028), ('learnh', 0.028), ('mitrinovi', 0.028), ('balcan', 0.027), ('crucially', 0.027), ('dai', 0.026), ('remember', 0.026), ('rademacher', 0.024), ('lder', 0.024), ('liwei', 0.024), ('polylog', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
Author: Liwei Wang
Abstract: We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefficient which often characterizes the intrinsic difficulty of the learning problem. In this paper, we study the disagreement coefficient of classification problems for which the classification boundary is smooth and the data distribution has a density that can be bounded by a smooth function. We prove upper and lower bounds for the disagreement coefficients of both finitely and infinitely smooth problems. Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. Keywords: active learning, disagreement coefficient, label complexity, smooth function
2 0.11759889 12 jmlr-2011-Bayesian Co-Training
Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao
Abstract: Co-training (or more generally, co-regularization) has been a popular algorithm for semi-supervised learning in data with two feature representations (or views), but the fundamental assumptions underlying this type of models are still unclear. In this paper we propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classifiers. The resulting approach is convex and avoids local-maxima problems, and it can also automatically estimate how much each view should be trusted to accommodate noisy or unreliable views. The Bayesian co-training approach can also elegantly handle data samples with missing views, that is, some of the views are not available for some data points at learning time. This is further extended to an active sensing framework, in which the missing (sample, view) pairs are actively acquired to improve learning performance. The strength of active sensing model is that one actively sensed (sample, view) pair would improve the joint multi-view classification on all the samples. Experiments on toy data and several real world data sets illustrate the benefits of this approach. Keywords: co-training, multi-view learning, semi-supervised learning, Gaussian processes, undirected graphical models, active sensing
3 0.078935355 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
4 0.068470873 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
Author: Kris De Brabanter, Jos De Brabanter, Johan A.K. Suykens, Bart De Moor
Abstract: It is a well-known problem that obtaining a correct bandwidth and/or smoothing parameter in nonparametric regression is difficult in the presence of correlated errors. There exist a wide variety of methods coping with this problem, but they all critically depend on a tuning procedure which requires accurate information about the correlation structure. We propose a bandwidth selection procedure based on bimodal kernels which successfully removes the correlation without requiring any prior knowledge about its structure and its parameters. Further, we show that the form of the kernel is very important when errors are correlated which is in contrast to the independent and identically distributed (i.i.d.) case. Finally, some extensions are proposed to use the proposed criterion in support vector machines and least squares support vector machines for regression. Keywords: nonparametric regression, correlated errors, bandwidth choice, cross-validation, shortrange dependence, bimodal kernel
5 0.068158098 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
Author: Daniil Ryabko
Abstract: A sequence x1 , . . . , xn , . . . of discrete-valued observations is generated according to some unknown probabilistic law (measure) µ. After observing each outcome, one is required to give conditional probabilities of the next observation. The realizable case is when the measure µ belongs to an arbitrary but known class C of process measures. The non-realizable case is when µ is completely arbitrary, but the prediction performance is measured with respect to a given set C of process measures. We are interested in the relations between these problems and between their solutions, as well as in characterizing the cases when a solution exists and finding these solutions. We show that if the quality of prediction is measured using the total variation distance, then these problems coincide, while if it is measured using the expected average KL divergence, then they are different. For some of the formalizations we also show that when a solution exists it can be obtained as a Bayes mixture over a countable subset of C . We also obtain several characterization of those sets C for which solutions to the considered problems exist. As an illustration to the general results obtained, we show that a solution to the non-realizable case of the sequence prediction problem exists for the set of all finite-memory processes, but does not exist for the set of all stationary processes. It should be emphasized that the framework is completely general: the processes measures considered are not required to be i.i.d., mixing, stationary, or to belong to any parametric family. Keywords: sequence prediction, time series, online prediction, realizable sequence prediction, non-realizable sequence prediction
6 0.063579097 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
7 0.054831851 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
8 0.05371765 61 jmlr-2011-Logistic Stick-Breaking Process
9 0.050853677 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
10 0.049779028 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
11 0.045623854 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
12 0.044743177 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
13 0.04451834 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
14 0.04426346 58 jmlr-2011-Learning from Partial Labels
15 0.043445017 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
16 0.04252933 8 jmlr-2011-Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
17 0.040824343 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
18 0.040645789 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
19 0.037112359 35 jmlr-2011-Forest Density Estimation
20 0.036577825 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
topicId topicWeight
[(0, 0.188), (1, -0.026), (2, -0.011), (3, 0.056), (4, 0.092), (5, -0.031), (6, -0.013), (7, 0.043), (8, -0.177), (9, 0.014), (10, -0.031), (11, 0.031), (12, 0.15), (13, -0.007), (14, 0.094), (15, 0.025), (16, -0.06), (17, -0.202), (18, -0.24), (19, 0.034), (20, -0.114), (21, -0.108), (22, 0.164), (23, 0.001), (24, -0.306), (25, -0.007), (26, -0.05), (27, -0.334), (28, 0.138), (29, -0.084), (30, 0.088), (31, -0.03), (32, 0.104), (33, 0.137), (34, 0.061), (35, -0.047), (36, 0.058), (37, 0.021), (38, -0.114), (39, -0.115), (40, -0.008), (41, -0.017), (42, -0.056), (43, -0.006), (44, 0.033), (45, 0.027), (46, 0.044), (47, -0.0), (48, -0.057), (49, -0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.96664602 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
Author: Liwei Wang
Abstract: We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefficient which often characterizes the intrinsic difficulty of the learning problem. In this paper, we study the disagreement coefficient of classification problems for which the classification boundary is smooth and the data distribution has a density that can be bounded by a smooth function. We prove upper and lower bounds for the disagreement coefficients of both finitely and infinitely smooth problems. Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. Keywords: active learning, disagreement coefficient, label complexity, smooth function
2 0.59284759 12 jmlr-2011-Bayesian Co-Training
Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao
Abstract: Co-training (or more generally, co-regularization) has been a popular algorithm for semi-supervised learning in data with two feature representations (or views), but the fundamental assumptions underlying this type of models are still unclear. In this paper we propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classifiers. The resulting approach is convex and avoids local-maxima problems, and it can also automatically estimate how much each view should be trusted to accommodate noisy or unreliable views. The Bayesian co-training approach can also elegantly handle data samples with missing views, that is, some of the views are not available for some data points at learning time. This is further extended to an active sensing framework, in which the missing (sample, view) pairs are actively acquired to improve learning performance. The strength of active sensing model is that one actively sensed (sample, view) pair would improve the joint multi-view classification on all the samples. Experiments on toy data and several real world data sets illustrate the benefits of this approach. Keywords: co-training, multi-view learning, semi-supervised learning, Gaussian processes, undirected graphical models, active sensing
3 0.42852768 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
4 0.401997 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
Author: Kris De Brabanter, Jos De Brabanter, Johan A.K. Suykens, Bart De Moor
Abstract: It is a well-known problem that obtaining a correct bandwidth and/or smoothing parameter in nonparametric regression is difficult in the presence of correlated errors. There exist a wide variety of methods coping with this problem, but they all critically depend on a tuning procedure which requires accurate information about the correlation structure. We propose a bandwidth selection procedure based on bimodal kernels which successfully removes the correlation without requiring any prior knowledge about its structure and its parameters. Further, we show that the form of the kernel is very important when errors are correlated which is in contrast to the independent and identically distributed (i.i.d.) case. Finally, some extensions are proposed to use the proposed criterion in support vector machines and least squares support vector machines for regression. Keywords: nonparametric regression, correlated errors, bandwidth choice, cross-validation, shortrange dependence, bimodal kernel
5 0.33016941 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
Author: Daniil Ryabko
Abstract: A sequence x1 , . . . , xn , . . . of discrete-valued observations is generated according to some unknown probabilistic law (measure) µ. After observing each outcome, one is required to give conditional probabilities of the next observation. The realizable case is when the measure µ belongs to an arbitrary but known class C of process measures. The non-realizable case is when µ is completely arbitrary, but the prediction performance is measured with respect to a given set C of process measures. We are interested in the relations between these problems and between their solutions, as well as in characterizing the cases when a solution exists and finding these solutions. We show that if the quality of prediction is measured using the total variation distance, then these problems coincide, while if it is measured using the expected average KL divergence, then they are different. For some of the formalizations we also show that when a solution exists it can be obtained as a Bayes mixture over a countable subset of C . We also obtain several characterization of those sets C for which solutions to the considered problems exist. As an illustration to the general results obtained, we show that a solution to the non-realizable case of the sequence prediction problem exists for the set of all finite-memory processes, but does not exist for the set of all stationary processes. It should be emphasized that the framework is completely general: the processes measures considered are not required to be i.i.d., mixing, stationary, or to belong to any parametric family. Keywords: sequence prediction, time series, online prediction, realizable sequence prediction, non-realizable sequence prediction
6 0.30023947 58 jmlr-2011-Learning from Partial Labels
7 0.2977457 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
8 0.26339275 61 jmlr-2011-Logistic Stick-Breaking Process
9 0.26278463 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms
10 0.26141506 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
11 0.2420812 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
12 0.23162158 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing
13 0.21453564 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
14 0.19901663 35 jmlr-2011-Forest Density Estimation
15 0.19869566 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
16 0.19680329 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
17 0.19121991 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
18 0.1906583 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
19 0.17791994 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
20 0.17039022 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin
topicId topicWeight
[(4, 0.061), (6, 0.027), (9, 0.03), (10, 0.025), (24, 0.066), (31, 0.063), (32, 0.039), (41, 0.031), (65, 0.021), (73, 0.018), (78, 0.064), (80, 0.364), (90, 0.018), (98, 0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.7011565 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
Author: Liwei Wang
Abstract: We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefficient which often characterizes the intrinsic difficulty of the learning problem. In this paper, we study the disagreement coefficient of classification problems for which the classification boundary is smooth and the data distribution has a density that can be bounded by a smooth function. We prove upper and lower bounds for the disagreement coefficients of both finitely and infinitely smooth problems. Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. Keywords: active learning, disagreement coefficient, label complexity, smooth function
2 0.33513618 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods
Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama
Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel
3 0.32629189 91 jmlr-2011-The Sample Complexity of Dictionary Learning
Author: Daniel Vainsencher, Shie Mannor, Alfred M. Bruckstein
Abstract: A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case of l1 regularized coefficient selection we provide a generalnp ln(mλ)/m , where n is the dimension, p is the number of ization bound of the order of O elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O( np ln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements. Keywords: dictionary learning, generalization bound, sparse representation
4 0.32095286 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky
Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types
5 0.32026801 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
6 0.3201493 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
7 0.31954685 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
8 0.31756052 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
9 0.31719688 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
10 0.31677547 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
11 0.31674287 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
13 0.31486037 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
14 0.31462827 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
15 0.31302318 59 jmlr-2011-Learning with Structured Sparsity
16 0.31298757 104 jmlr-2011-X-Armed Bandits
17 0.31219491 28 jmlr-2011-Double Updating Online Learning
18 0.31076586 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
19 0.31045783 37 jmlr-2011-Group Lasso Estimation of High-dimensional Covariance Matrices
20 0.31039533 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints