nips nips2008 nips2008-245 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Aarti Singh, Robert Nowak, Xiaojin Zhu
Abstract: Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semi-supervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates.
Reference: text
sentIndex sentText sentNum sentScore
1 However, in other situations unlabeled data do not seem to help. [sent-7, score-0.412]
2 Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. [sent-8, score-0.377]
3 We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. [sent-10, score-0.603]
4 We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates. [sent-11, score-0.271]
5 Semisupervised learning (SSL) aims to capitalize on the abundance of unlabeled data to improve learning performance. [sent-13, score-0.422]
6 Empirical evidence suggests that in certain favorable situations unlabeled data can help, while in other situations it does not. [sent-14, score-0.478]
7 It is wellaccepted that unlabeled data can help only if there exists a link between the marginal data distribution and the target function to be learnt. [sent-16, score-0.488]
8 Knowledge of these sets, which can be gleaned from unlabeled data, simplify the learning task. [sent-18, score-0.346]
9 In this paper, we bridge the gap between these seemingly conflicting views and develop a minimax framework based on finite sample bounds to identify situations in which unlabeled data help to improve learning. [sent-21, score-0.722]
10 Our results quantify both the amount of improvement possible using SSL as well as the the relative value of unlabeled data. [sent-22, score-0.408]
11 We focus on learning under a cluster assumption that is formalized in the next section, and establish that there exist nonparametric classes of distributions, denoted PXY , for which the decision sets (over which the target function is smooth) are discernable from unlabeled data. [sent-23, score-0.896]
12 Moreover, we show that there exist clairvoyant supervised learners that, given perfect knowledge of the decision sets denoted by D, can significantly outperform any generic supervised learner fn in these ∗ † Supported in part by the NSF grants CCF-0353079, CCF-0350213, and CNS-0519824. [sent-24, score-1.17]
13 1 (a) (b) (c) Figure 1: (a) Two separated high density sets with different labels that (b) cannot be discerned if the sample size is too small, but (c) can be estimated if sample density is high enough. [sent-26, score-0.437]
14 That is, if R denotes a risk of interest, n denotes the labeled data sample size, fD,n denotes the clairvoyant supervised learner, and E denotes expectation with respect to training data, then supPXY E[R(fD,n )] < inf fn supPXY E[R(fn )]. [sent-28, score-0.623]
15 Based on this, we establish that there also exist semi-supervised learners, denoted fm,n , that use m unlabeled examples in addition to the n labeled examples in order to estimate the decision sets, which perform as well as fD,n , provided that m grows appropriately relative to n. [sent-29, score-0.722]
16 Specifically, if the error bound for fD,n decays polynomially (exponentially) in n, then the number of unlabeled data m needs to grow polynomially (exponentially) with the number of labeled data n. [sent-30, score-0.69]
17 Then we examine a concrete instantiation of these general results in the regression setting by deriving minimax lower bounds on the performance of any supervised learner and compare that to upper bounds on the errors of fD,n and fm,n . [sent-32, score-0.755]
18 If this mixture is identifiable, then the classification problem may reduce to a simple hypothesis testing problem for which the error converges exponentially fast in the number of labeled examples. [sent-34, score-0.241]
19 The ideas in this paper are similar, except that we do not require identifiability of the mixture component densities, and show that it suffices to only approximately learn the decision sets over which the label is smooth. [sent-35, score-0.359]
20 Rigollet [3] establishes that for a fixed collection of distributions satisfying a cluster assumption, unlabeled data do not provide an improvement in convergence rate. [sent-37, score-0.47]
21 1, where the distribution is supported on two component sets separated by a margin γ and the target function is smooth over each component. [sent-42, score-0.377]
22 Given a finite sample of data, these decision sets may or may not be discernable depending on the sampling density (see Fig. [sent-43, score-0.563]
23 If γ is fixed (this is similar to fixing the class of cluster-based distributions in [3] or the manifold in [4, 10]), then given enough labeled data a supervised learner can achieve optimal performance (since, eventually, it operates in regime (c) of Fig. [sent-45, score-0.596]
24 Thus, in this example, there is no improvement due to unlabeled data in terms of the rate of error convergence for a fixed collection of distributions. [sent-47, score-0.457]
25 However, since the true separation between the component sets is unknown, given a finite sample of data, there always exists a distribution for which these sets are indiscernible (e. [sent-48, score-0.342]
26 We claim that meaningful characterizations of SSL performance and quantifications of the value of unlabeled data require finite sample error bounds, and that rates of convergence and asymptotic analysis may not capture the distinctions between SSL and supervised learning. [sent-52, score-0.617]
27 Simply stated, if the component density sets are discernable from a finite sample size m of unlabeled data but not from a finite sample size n < m of labeled data, then SSL can provide better performance than supervised learning. [sent-53, score-1.127]
28 We also show that there are certain plausible situations in which SSL yields rates of convergence that cannot be achieved by any supervised learner. [sent-54, score-0.249]
29 2 Characterization of model distributions under the cluster assumption Based on the cluster assumption [7, 3, 4], we define the following collection of joint distributions PXY (γ) = PX × PY |X indexed by a margin parameter γ. [sent-57, score-0.235]
30 K The marginal density p(x) = k=1 ak pk (x) is the mixture of a finite, but unknown, number of component densities {pk }K , where K < ∞. [sent-59, score-0.452]
31 pk is bounded from above and below, 0 < b ≤ pk ≤ B. [sent-76, score-0.243]
32 With probability ak , X is drawn from pk and Y is drawn from pk (Y |X = x). [sent-81, score-0.253]
33 In the supervised setting, we assume access to n labeled data L = {Xi , Yi }n i=1 drawn i. [sent-82, score-0.305]
34 d according to PXY ∈ PXY (γ), and in the semi-supervised setting, we assume access to m additional unlabeled data U = {Xi }m drawn i. [sent-84, score-0.346]
35 The cluster assumption is that the target function will be smooth on each set D ∈ D, hence the sets in D are called decision sets. [sent-89, score-0.481]
36 The collection PXY is indexed by a margin parameter γ, which denotes the minimum width of a decision set or separation between the component support sets Ck . [sent-91, score-0.488]
37 ,K} (q) − gk ∞ where j = k, σ= (1) (2) dkk := gk − gk ∞. [sent-99, score-0.285]
38 Boundary fragment sets capture the salient characteristics of more general decision sets since, locally, the boundaries of general sets are like fragments in a certain orientation. [sent-102, score-0.507]
39 3 3 Learning Decision Sets Ideally, we would like to break a given learning task into separate subproblems on each D ∈ D since the target function is smooth on each decision set. [sent-103, score-0.344]
40 Note that the marginal density p is also smooth within each decision set, but exhibits jumps at the boundaries since the component densities are bounded away from zero. [sent-104, score-0.65]
41 Hence, the collection D can be learnt from unlabeled data as follows: 1) Marginal density estimation — The procedure is based on the sup-norm kernel density estimator proposed in [14]. [sent-105, score-0.631]
42 Consider a uniform square grid over the domain X = [0, 1]d with spacing 2hm , where hm = κ0 ((log m)2 /m)1/d and κ0 > 0 is a constant. [sent-106, score-0.24]
43 Let G denote the kernel and Hm = hm I, then the estimator of p(x) is 1 p(x) = mhd m m i=1 −1 G(Hm (Xi − [x])). [sent-108, score-0.233]
44 , zl−1 ∈ U, zj −zj+1 ≤ 2 dhm , and for all points that satisfy zi −zj ≤ hm log m, |p(zi )− p(zj )| ≤ δm := √ (log m)−1/3 . [sent-115, score-0.781]
45 That is, there exists a sequence of 2 dhm -dense unlabeled data points between x1 and x2 such that the marginal density varies smoothly along the sequence. [sent-116, score-0.848]
46 All points that are pairwise connected specify an empirical decision set. [sent-117, score-0.241]
47 The following lemma shows that if the margin is large relative to the average spacing m−1/d between unlabeled data points, then with high probability, two points are connected if and only if they lie in the same decision set D ∈ D, provided the points are not too close to the decision boundaries. [sent-120, score-1.113]
48 Let ∂D denote the boundary of D and define the set of boundary points as √ B = {x : inf x − z ≤ 2 dhm }. [sent-123, score-0.37]
49 4 SSL Performance and the Value of Unlabeled Data We now state our main result that characterizes the performance of SSL relative to supervised learning and follows as a corollary to the lemma stated above. [sent-125, score-0.327]
50 Suppose there exists a clairvoyant supervised learner fD,n , with perfect knowledge of the decision sets D, for which the following finite sample upper bound holds sup E[E(fD,n )] ≤ ǫ2 (n). [sent-129, score-1.006]
51 PXY (γ) Then there exists a semi-supervised learner fm,n such that if |γ| > Co (m/(log m)2 )−1/d , sup E[E(fm,n )] ≤ ǫ2 (n) + O PXY (γ) 1 +n m m (log m)2 −1/d . [sent-130, score-0.241]
52 This result captures the essence of the relative characterization of semi-supervised and supervised learning for the margin based model distributions. [sent-131, score-0.299]
53 Further, suppose that the following finite sample lower bound holds for any supervised learner: inf sup E[E(fn )] ≥ ǫ1 (n). [sent-134, score-0.324]
54 fn PXY (γ) If ǫ2 (n) < ǫ1 (n), then there exists a clairvoyant supervised learner with perfect knowledge of the decision sets that outperforms any supervised learner that does not have this knowledge. [sent-135, score-1.348]
55 Hence, Corollary 1 implies that SSL can provide better performance than any supervised learner provided d (i) m ≫ n so that (n/ǫ2 (n)) = O(m/(log m)2 ), and (ii) knowledge of the decision sets simplifies the supervised learning task, so that ǫ2 (n) < ǫ1 (n). [sent-136, score-0.84]
56 However, if γ is small relative to the average spacing n−1/d between labeled data points, then ǫ1 (n) = cn−1/d where c > 0 is a constant. [sent-139, score-0.245]
57 This lower bound follows from the minimax lower bound proofs for regression in the next section. [sent-140, score-0.419]
58 Recall that pk (Y |X = x) is the conditional density on the k-th component and let Ek denote expectation with respect to the corresponding conditional distribution. [sent-144, score-0.266]
59 While a spatially uniform estimator suffices when the decision sets are discernable, we use the following spatially adaptive estimator proposed in Section 4. [sent-156, score-0.354]
60 This ensures that when the decision sets are indiscernible using unlabeled data, the semi-supervised learner still achieves an error bound that is, up to logarithmic factors, no worse than the minimax lower bound for supervised learners. [sent-158, score-1.42]
61 It is shown in [12] that this estimator yields a finite sample error bound of n−2α/(2α+d) for H¨ lder-α smooth functions, and o max{n−2α/(2α+d) , n−1/d } for piecewise H¨ lder-α functions, ignoring logarithmic factors. [sent-161, score-0.355]
62 o Using these results from [12] and Corollary 1, we now state finite sample upper bounds on the semisupervised learner (SSL) described above. [sent-162, score-0.355]
63 Also, we derive finite sample minimax lower bounds on the performance of any supervised learner (SL). [sent-163, score-0.689]
64 A sketch 2 If the component marginal densities and regression functions have different smoothnesses, say α and β, the same analysis holds except that f (x) is H¨ lder-min(α, β) smooth on each D ∈ D. [sent-165, score-0.334]
65 If d < 2α/(2α − 1), then the supervised learning error due to to not resolving the decision sets (which behaves like n−1/d ) is smaller than error incurred in estimating the target function itself (which behaves like n−2α/(2α+d) ). [sent-169, score-0.586]
66 Thus, when d < 2α/(2α − 1), the supervised regression error is dominated by the error in smooth regions and there appears to be no benefit to using a semi-supervised learner. [sent-170, score-0.435]
67 However, if γ > 0 is small relative to n−1/d , but large with respect to the spacing between unlabeled data points m−1/d , then the proposed semi-supervised learner provides improved error bounds compared to any supervised learner. [sent-174, score-1.006]
68 If |γ| is smaller than m−1/d , the decision sets are not discernable with unlabeled data and SSL provides no gain. [sent-175, score-0.762]
69 However, notice that the performance of the semi-supervised learner is no worse than the minimax lower bound for supervised learners. [sent-176, score-0.654]
70 In the γ < 0 case, if −γ larger than m−1/d , then the semi-supervised learner can discern the decision sets and achieves smaller error bounds, whereas these sets cannot be as accurately discerned by any supervised learner. [sent-177, score-0.899]
71 For the overlap case (γ < 0), supervised learners are always limited by the error incurred due to averaging across decision sets (n−1/d ). [sent-178, score-0.528]
72 The theoretical characterization we present explains why in certain situations unlabeled data can help to improve learning, while in other situations they may not. [sent-181, score-0.478]
73 We demonstrate that there exist general situations under which semi-supervised learning can be significantly superior to supervised learning in terms of achieving smaller finite sample error bounds than any supervised learner, and sometimes in terms of a better rate of error convergence. [sent-182, score-0.651]
74 Moreover, our results also provide a quantification of the relative value of unlabeled to labeled data. [sent-183, score-0.5]
75 Since the component densities are bounded from below and above, define pmin := b mink ak ≤ p(x) ≤ B =: pmax . [sent-187, score-0.294]
76 This implies i=1 G(Hm (Xi − x)) > 0, and ∃Xi ∈ U within dhm of x. [sent-201, score-0.251]
77 √ Using the density estimation results, we now show that if |γ| > 6 dhm , then for all p ∈ PX , all pairs of points x1 , x2 ∈ supp(p)\B and all D ∈ D, for large enough m, with probability > 1−1/m, we have x1 , x2 ∈ D if and only if x1 ↔ x2 . [sent-202, score-0.438]
78 In the latter case, the marginal density p(x) jumps by at least pmin one or more times along all sequences connecting x1 and x2 . [sent-205, score-0.269]
79 Suppose the first jump occurs where decision set D ends and another decision set √ D′ = D begins (in the sequence). [sent-206, score-0.38]
80 Then since D′ is at least |γ| > 6 dhm wide, by Corollary 2 for all sequences connecting x1 and x2 through unlabeled data points, there exist points zi , zj in the sequence that lie in D \ B, D′ \ B, respectively, and zi − zj ≤ hm log m. [sent-207, score-1.493]
81 Since the density on each decision set is H¨ lder-α smooth, we have |p(zi ) − p(zj )| ≥ pmin − O((hm log m)min(1,α) ). [sent-208, score-0.43]
82 o Since zi , zj ∈ B, using Theorem 1, |p(zi ) − p(zj )| ≥ |p(zi ) − p(zj )| − 2ǫm > δm for large enough m. [sent-209, score-0.299]
83 x√ x2 ∈ D ⇒ x1 ↔ x2 : Since D has width at least |γ| > 6 dhm , there exists a region of width 1, > 2 dhm contained in D \ B, and Corollary 2 implies that with probability > 1 − 1/m, there exist √ sequence(s) contained in D \ B connecting x1 and x2 through 2 dhm -dense unlabeled data points. [sent-212, score-1.277]
84 Since the sequence is contained in D and the density on D is H¨ lder-α smooth, we have for all points o zi , zj in the sequence that satisfy zi − zj ≤ hm log m, |p(zi ) − p(zj )| ≤ O((hm log m)min(1,α) ). [sent-213, score-0.963]
85 Since zi , zj ∈ B, using Theorem 1, |p(zi ) − p(zj )| ≤ |p(zi ) − p(zj )| + 2ǫm ≤ δm for large enough m. [sent-214, score-0.299]
86 Now observe that fD,n essentially uses the clairvoyant knowledge of the decision sets D to discern which labeled points X1 , . [sent-225, score-0.665]
87 Thus, we can define a semi-supervised learner fm,n to be the same as fD,n except that instead of using clairvoyant knowledge of whether X, Xi ∈ D, fm,n is based on whether X ↔ Xi . [sent-230, score-0.371]
88 2 Since E[(f (X) − fD,n (X))2 ] = D∈D E[(f (X) − fD,n (X)) 1X∈D ]P (D), taking expectation over nD ∼Binomial(n, P (D)) and summing over all decision sets recalling that |D| is a finite constant, the overall error of fD,n scales as n−2α/(2α+d) , ignoring logarithmic factors. [sent-236, score-0.314]
89 If |γ| < Co (m/(log m)2 )−1/d , the decision sets are not discernable using unlabeled data. [sent-239, score-0.762]
90 Since the regression function is piecewise H¨ lder-α smooth o on each empirical decision set, Using Theorem 9 in [12] and similar analysis, an upper bound of max{n−2α/(2α+d) , n−1/d } follows, which scales as n−1/d when d ≥ 2α/(2α − 1). [sent-240, score-0.47]
91 2) Supervised Learning Lower Bound: The formal minimax proof requires construction of a finite subset of distributions in PXY (γ) that are the hardest cases to distinguish based on a finite number of labeled data n, and relies on a Hellinger version of Assouad’s Lemma (Theorem 2. [sent-241, score-0.292]
92 Here we present the simple intuition behind the minimax lower bound of n−1/d when γ < co n−1/d . [sent-244, score-0.463]
93 In this case the decision boundaries can only be localized to an accuracy of n−1/d , the average spacing between labeled data points. [sent-245, score-0.446]
94 Since the boundaries are Lipschitz, the expected volume that is incorrectly assigned to any decision set is > c1 n−1/d , where c1 > 0 is a constant. [sent-246, score-0.233]
95 Thus, if the expected excess risk at a point that is incorrectly assigned to a decision set can be greater than a constant c2 > 0, then the overall expected excess risk is > c1 c2 n−1/d . [sent-247, score-0.372]
96 If γ > co n−1/d , the decision sets can be accurately discerned from the labeled data alone. [sent-249, score-0.651]
97 In this case, it follows that the minimax lower bound is equal to the minimax lower bound for H¨ lder-α smooth regression o functions, which is cn−2α/(d+2α) , where c > 0 is a constant [17]. [sent-250, score-0.698]
98 : A PAC-style model for learning from labeled and unlabeled data. [sent-254, score-0.468]
99 : The relative value of labeled and unlabeled samples in pattern recognition. [sent-293, score-0.5]
100 : The asymptotic minimax constant for sup-norm loss in nonparametric density estimation. [sent-325, score-0.273]
wordName wordTfidf (topN-words)
[('unlabeled', 0.346), ('ssl', 0.335), ('dhm', 0.251), ('learner', 0.204), ('pxy', 0.201), ('co', 0.196), ('decision', 0.19), ('supervised', 0.183), ('zj', 0.176), ('minimax', 0.17), ('clairvoyant', 0.167), ('hm', 0.149), ('discernable', 0.146), ('suppxy', 0.126), ('labeled', 0.122), ('px', 0.11), ('smooth', 0.109), ('pk', 0.104), ('density', 0.103), ('ck', 0.1), ('gk', 0.095), ('spacing', 0.091), ('zi', 0.09), ('supp', 0.089), ('margin', 0.084), ('sets', 0.08), ('pmin', 0.073), ('wisconsin', 0.071), ('corollary', 0.069), ('madison', 0.067), ('situations', 0.066), ('bound', 0.064), ('log', 0.064), ('fn', 0.063), ('castelli', 0.063), ('discerned', 0.063), ('marginal', 0.06), ('component', 0.059), ('cluster', 0.057), ('polynomially', 0.057), ('nite', 0.056), ('bounds', 0.055), ('regression', 0.055), ('yes', 0.055), ('discern', 0.055), ('lipschitz', 0.055), ('manifold', 0.054), ('icting', 0.054), ('url', 0.054), ('xd', 0.052), ('semisupervised', 0.052), ('piecewise', 0.052), ('densities', 0.051), ('points', 0.051), ('pen', 0.05), ('excess', 0.047), ('nowak', 0.047), ('target', 0.045), ('ak', 0.045), ('exponentially', 0.045), ('zl', 0.044), ('risk', 0.044), ('error', 0.044), ('sample', 0.044), ('fk', 0.044), ('lemma', 0.043), ('boundaries', 0.043), ('xi', 0.043), ('abundance', 0.042), ('indiscernible', 0.042), ('korostelev', 0.042), ('mhd', 0.042), ('nhm', 0.042), ('estimator', 0.042), ('sl', 0.041), ('bridge', 0.041), ('py', 0.041), ('width', 0.038), ('collection', 0.037), ('exists', 0.037), ('perfect', 0.037), ('djk', 0.037), ('lie', 0.035), ('bounded', 0.035), ('boundary', 0.034), ('capitalize', 0.034), ('fragment', 0.034), ('rigollet', 0.034), ('enough', 0.033), ('lower', 0.033), ('singh', 0.033), ('connecting', 0.033), ('relative', 0.032), ('exist', 0.032), ('pmax', 0.031), ('wasserman', 0.031), ('learners', 0.031), ('attempts', 0.031), ('improvement', 0.03), ('mixture', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
Author: Aarti Singh, Robert Nowak, Xiaojin Zhu
Abstract: Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semi-supervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates.
2 0.24407852 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
Author: Liu Yang, Rong Jin, Rahul Sukthankar
Abstract: The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classification. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classification accuracy becomes a challenge. We introduce “Semi-supervised Learning with Weakly-Related Unlabeled Data” (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classification tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal wordcorrelation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of stateof-the-art methods for inductive semi-supervised learning and text categorization. We show that SSLW results in a significant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test domain.
3 0.15878379 128 nips-2008-Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
Author: Doug Downey, Oren Etzioni
Abstract: Is accurate classification possible in the absence of hand-labeled data? This paper introduces the Monotonic Feature (MF) abstraction—where the probability of class membership increases monotonically with the MF’s value. The paper proves that when an MF is given, PAC learning is possible with no hand-labeled data under certain assumptions. We argue that MFs arise naturally in a broad range of textual classification applications. On the classic “20 Newsgroups” data set, a learner given an MF and unlabeled data achieves classification accuracy equal to that of a state-of-the-art semi-supervised learner relying on 160 hand-labeled examples. Even when MFs are not given as input, their presence or absence can be determined from a small amount of hand-labeled data, which yields a new semi-supervised learning method that reduces error by 15% on the 20 Newsgroups data. 1
4 0.15778059 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Author: Yi Zhang, Artur Dubrawski, Jeff G. Schneider
Abstract: In this paper, we address the question of what kind of knowledge is generally transferable from unlabeled text. We suggest and analyze the semantic correlation of words as a generally transferable structure of the language and propose a new method to learn this structure using an appropriately chosen latent variable model. This semantic correlation contains structural information of the language space and can be used to control the joint shrinkage of model parameters for any specific task in the same space through regularization. In an empirical study, we construct 190 different text classification tasks from a real-world benchmark, and the unlabeled documents are a mixture from all these tasks. We test the ability of various algorithms to use the mixed unlabeled text to enhance all classification tasks. Empirical results show that the proposed approach is a reliable and scalable method for semi-supervised learning, regardless of the source of unlabeled data, the specific task to be enhanced, and the prediction model used.
5 0.14140566 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
Author: Sudheendra Vijayanarasimhan, Kristen Grauman
Abstract: We introduce a framework for actively learning visual categories from a mixture of weakly and strongly labeled image examples. We propose to allow the categorylearner to strategically choose what annotations it receives—based on both the expected reduction in uncertainty as well as the relative costs of obtaining each annotation. We construct a multiple-instance discriminative classifier based on the initial training data. Then all remaining unlabeled and weakly labeled examples are surveyed to actively determine which annotation ought to be requested next. After each request, the current classifier is incrementally updated. Unlike previous work, our approach accounts for the fact that the optimal use of manual annotation may call for a combination of labels at multiple levels of granularity (e.g., a full segmentation on some images and a present/absent flag on others). As a result, it is possible to learn more accurate category models with a lower total expenditure of manual annotation effort. 1
6 0.14133438 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
7 0.10777781 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
8 0.1066789 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
9 0.10216585 65 nips-2008-Domain Adaptation with Multiple Sources
10 0.098252974 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
11 0.086329646 193 nips-2008-Regularized Co-Clustering with Dual Supervision
12 0.085810237 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
13 0.084136993 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
14 0.082227811 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
15 0.077907279 170 nips-2008-Online Optimization in X-Armed Bandits
16 0.077102154 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
17 0.075918972 228 nips-2008-Support Vector Machines with a Reject Option
18 0.074962065 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions
19 0.074549273 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
20 0.070693046 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
topicId topicWeight
[(0, -0.235), (1, -0.067), (2, -0.114), (3, -0.03), (4, -0.101), (5, 0.035), (6, 0.046), (7, 0.021), (8, -0.081), (9, 0.066), (10, 0.131), (11, 0.098), (12, -0.027), (13, -0.195), (14, -0.16), (15, 0.091), (16, -0.078), (17, 0.236), (18, 0.05), (19, -0.015), (20, -0.034), (21, 0.005), (22, -0.031), (23, -0.179), (24, 0.064), (25, -0.09), (26, -0.06), (27, -0.008), (28, 0.157), (29, 0.057), (30, -0.029), (31, 0.046), (32, -0.168), (33, 0.022), (34, -0.025), (35, -0.01), (36, 0.039), (37, -0.055), (38, 0.042), (39, 0.027), (40, -0.051), (41, -0.043), (42, -0.085), (43, 0.024), (44, -0.029), (45, -0.011), (46, 0.063), (47, -0.048), (48, -0.054), (49, 0.09)]
simIndex simValue paperId paperTitle
same-paper 1 0.95098454 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
Author: Aarti Singh, Robert Nowak, Xiaojin Zhu
Abstract: Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semi-supervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates.
2 0.74933308 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
Author: Liu Yang, Rong Jin, Rahul Sukthankar
Abstract: The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classification. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classification accuracy becomes a challenge. We introduce “Semi-supervised Learning with Weakly-Related Unlabeled Data” (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classification tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal wordcorrelation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of stateof-the-art methods for inductive semi-supervised learning and text categorization. We show that SSLW results in a significant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test domain.
3 0.74201703 128 nips-2008-Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
Author: Doug Downey, Oren Etzioni
Abstract: Is accurate classification possible in the absence of hand-labeled data? This paper introduces the Monotonic Feature (MF) abstraction—where the probability of class membership increases monotonically with the MF’s value. The paper proves that when an MF is given, PAC learning is possible with no hand-labeled data under certain assumptions. We argue that MFs arise naturally in a broad range of textual classification applications. On the classic “20 Newsgroups” data set, a learner given an MF and unlabeled data achieves classification accuracy equal to that of a state-of-the-art semi-supervised learner relying on 160 hand-labeled examples. Even when MFs are not given as input, their presence or absence can be determined from a small amount of hand-labeled data, which yields a new semi-supervised learning method that reduces error by 15% on the 20 Newsgroups data. 1
4 0.61924988 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning
Author: Massih Amini, Nicolas Usunier, François Laviolette
Abstract: We propose two transductive bounds on the risk of majority votes that are estimated over partially labeled training sets. The first one involves the margin distribution of the classifier and a risk bound on its associate Gibbs classifier. The bound is tight when so is the Gibbs’s bound and when the errors of the majority vote classifier is concentrated on a zone of low margin. In semi-supervised learning, considering the margin as an indicator of confidence constitutes the working hypothesis of algorithms which search the decision boundary on low density regions. Following this assumption, we propose to bound the error probability of the voted classifier on the examples for whose margins are above a fixed threshold. As an application, we propose a self-learning algorithm which iteratively assigns pseudo-labels to the set of unlabeled training examples that have their margin above a threshold obtained from this bound. Empirical results on different datasets show the effectiveness of our approach compared to the same algorithm and the TSVM in which the threshold is fixed manually. 1
5 0.58316147 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
Author: Yi Zhang, Artur Dubrawski, Jeff G. Schneider
Abstract: In this paper, we address the question of what kind of knowledge is generally transferable from unlabeled text. We suggest and analyze the semantic correlation of words as a generally transferable structure of the language and propose a new method to learn this structure using an appropriately chosen latent variable model. This semantic correlation contains structural information of the language space and can be used to control the joint shrinkage of model parameters for any specific task in the same space through regularization. In an empirical study, we construct 190 different text classification tasks from a real-world benchmark, and the unlabeled documents are a mixture from all these tasks. We test the ability of various algorithms to use the mixed unlabeled text to enhance all classification tasks. Empirical results show that the proposed approach is a reliable and scalable method for semi-supervised learning, regardless of the source of unlabeled data, the specific task to be enhanced, and the prediction model used.
6 0.53969109 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
7 0.52868956 126 nips-2008-Localized Sliced Inverse Regression
8 0.51736873 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
9 0.49861303 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
10 0.45005465 193 nips-2008-Regularized Co-Clustering with Dual Supervision
11 0.43631271 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
12 0.38920361 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
13 0.3815434 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
14 0.37998191 228 nips-2008-Support Vector Machines with a Reject Option
15 0.37397158 65 nips-2008-Domain Adaptation with Multiple Sources
16 0.37078097 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
17 0.36399585 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
18 0.3576535 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
19 0.35607222 101 nips-2008-Human Active Learning
20 0.35138038 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
topicId topicWeight
[(4, 0.03), (6, 0.1), (7, 0.074), (12, 0.061), (28, 0.132), (57, 0.056), (59, 0.023), (63, 0.027), (66, 0.222), (71, 0.029), (77, 0.051), (83, 0.107)]
simIndex simValue paperId paperTitle
1 0.88685107 18 nips-2008-An Efficient Sequential Monte Carlo Algorithm for Coalescent Clustering
Author: Dilan Gorur, Yee W. Teh
Abstract: We propose an efficient sequential Monte Carlo inference scheme for the recently proposed coalescent clustering model [1]. Our algorithm has a quadratic runtime while those in [1] is cubic. In experiments, we were surprised to find that in addition to being more efficient, it is also a better sequential Monte Carlo sampler than the best in [1], when measured in terms of variance of estimated likelihood and effective sample size. 1
2 0.81464159 95 nips-2008-Grouping Contours Via a Related Image
Author: Praveen Srinivasan, Liming Wang, Jianbo Shi
Abstract: Contours have been established in the biological and computer vision literature as a compact yet descriptive representation of object shape. While individual contours provide structure, they lack the large spatial support of region segments (which lack internal structure). We present a method for further grouping of contours in an image using their relationship to the contours of a second, related image. Stereo, motion, and similarity all provide cues that can aid this task; contours that have similar transformations relating them to their matching contours in the second image likely belong to a single group. To find matches for contours, we rely only on shape, which applies directly to all three modalities without modification, in contrast to the specialized approaches developed for each independently. Visually salient contours are extracted in each image, along with a set of candidate transformations for aligning subsets of them. For each transformation, groups of contours with matching shape across the two images are identified to provide a context for evaluating matches of individual contour points across the images. The resulting contexts of contours are used to perform a final grouping on contours in the original image while simultaneously finding matches in the related image, again by shape matching. We demonstrate grouping results on image pairs consisting of stereo, motion, and similar images. Our method also produces qualitatively better results against a baseline method that does not use the inferred contexts. 1
same-paper 3 0.80919892 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
Author: Aarti Singh, Robert Nowak, Xiaojin Zhu
Abstract: Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semi-supervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates.
4 0.68533254 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
5 0.6791774 202 nips-2008-Robust Regression and Lasso
Author: Huan Xu, Constantine Caramanis, Shie Mannor
Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a unified robustness perspective. 1
6 0.6789766 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
7 0.6783604 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
8 0.67830944 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
9 0.67583686 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
10 0.6746487 75 nips-2008-Estimating vector fields using sparse basis field expansions
11 0.67454392 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
12 0.67438173 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
13 0.67279935 62 nips-2008-Differentiable Sparse Coding
14 0.67056972 143 nips-2008-Multi-label Multiple Kernel Learning
15 0.66699356 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
16 0.66564816 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
17 0.66224086 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
18 0.6618253 196 nips-2008-Relative Margin Machines
19 0.66142267 228 nips-2008-Support Vector Machines with a Reject Option
20 0.66107178 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features