nips nips2007 nips2007-201 knowledge-graph by maker-knowledge-mining

201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect


Source: pdf

Author: Kaushik Sinha, Mikhail Belkin

Abstract: Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received significant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identifiable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identifiable components. There have been recent efforts to analyze the non-parametric situation, for example, “cluster” and “manifold” assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identifiable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 learning from both labeled and unlabeled data has received significant attention in the machine learning literature in recent years. [sent-9, score-0.576]

2 Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. [sent-10, score-0.365]

3 One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. [sent-13, score-0.651]

4 the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model. [sent-19, score-0.602]

5 learning from labeled and unlabeled data, has drawn significant attention. [sent-22, score-0.576]

6 The ubiquity and easy availability of unlabeled data together with the increased computational power of modern computers, make the paradigm attractive in various applications, while connections to natural learning make it also conceptually intriguing. [sent-23, score-0.323]

7 The unlabeled data comes from the marginal distribution p(x). [sent-27, score-0.339]

8 Thus the the usefulness of unlabeled data is tied to how much information about joint distribution can be extracted from the marginal distribution. [sent-28, score-0.35]

9 Therefore, in order to make unlabeled data useful, an assumption on the connection between these distributions needs to be made. [sent-29, score-0.339]

10 However, while these assumptions has motivated several algorithms and have been shown to hold empirically, few theoretical results on the value of unlabeled data in the non-parametric setting are available so far. [sent-33, score-0.362]

11 We note the work of Balcan and Blum ([2]), which attempts to unify several frameworks by introducing a notion of compatibility between labeled and unlabeled data. [sent-34, score-0.576]

12 The study of usefulness of unlabeled data under this assumption was undertaken by Castelli and Cover ([5]) and Ratsaby and Venkatesh ([11]). [sent-39, score-0.366]

13 Among several important conclusions from their study was the fact under a certain range of conditions, labeled data is exponentially more important for approximating the Bayes optimal classifier than unlabeled data. [sent-40, score-0.633]

14 Roughly speaking, unlabeled data may be used to identify the parameters of each mixture component, after which the class attribution can be established exponentially fast using only few labeled examples. [sent-41, score-0.848]

15 In this paper we investigate the limits of usefulness of unlabeled data as a function of how far the best fitting model strays from the underlying probability distribution. [sent-44, score-0.404]

16 We then describe how the relative value of labeled and unlabeled data changes when the true distribution is a perturbation of a parametric model. [sent-46, score-0.855]

17 Finally we discuss various regimes of usability for labeled and unlabeled data and represent our findings in Fig 1. [sent-47, score-0.612]

18 In what follows, we review some theoretical results that describe behavior of the excess error probability as a function of the number of labeled and unlabeled examples. [sent-50, score-0.75]

19 We will denote number of labeled examples by l and the number of unlabeled examples by u. [sent-51, score-0.872]

20 (Ratsaby and Venkatesh [11]) In a two class identifiable mixture model, let the equiprobable class densities p1 (x), p2 (x) be d-dimensional Gaussians with unit covariance matrices. [sent-56, score-0.455]

21 Then for sufficiently small > 0 and arbitrary δ > 0, given l = O log δ −1 labeled and u=O d2 −1 3 δ (d log + log δ −1 ) unlabeled examples respectively, with confidence at least 1 − δ, probability of error Perror ≤ PBayes (1 + c ) for some positive constant c. [sent-57, score-0.803]

22 Since the mixture is identifiable, parameters can be estimated from unlabeled examples alone. [sent-58, score-0.645]

23 Therefore, unlabeled examples are used to estimate the mixture and hence the two decision regions. [sent-60, score-0.666]

24 Once the decision regions are established, labeled examples are used to label them. [sent-61, score-0.546]

25 An equivalent form of the above result in terms of labeled and d unlabeled examples is Perror − PBayes = O u1/3 + O (exp(−l)). [sent-62, score-0.724]

26 For a fixed dimension d, this 2 indicates that labeled examples are exponentially more valuable than the unlabeled examples in reducing the excess probability of error, however, when d is not fixed, higher dimensions slower these rates. [sent-63, score-1.11]

27 (Cover and Castelli [5]) In a two class mixture model, let p 1 (x), p2 (x) be the parametric class densities and let h(η) be the prior over the unknown mixing parameter η. [sent-67, score-0.512]

28 In particular this implies that, if ue−Dl → 0 and l = o(u) the excess error is essentially determined by the number of unlabeled examples. [sent-70, score-0.461]

29 On the other hand if u grows faster than e Dl , then excess error is determined by the number of labeled examples. [sent-71, score-0.391]

30 Both results indicate that if the parametric model assumptions are satisfied, labeled examples are exponentially more valuable than unlabeled examples in reducing the excess probability of error. [sent-74, score-1.277]

31 • Type-A uncertainty for perfect model with imperfect information: Individual class densities follow the assumed parametric model. [sent-79, score-0.438]

32 • Type-B uncertainty for imperfect model: Individual class densities does not follow the assumed parametric model. [sent-82, score-0.416]

33 We will denote the mixing parameter by t and the individual parametric class densities by f 1 (x|θ1 ), f2 (x|θ2 ) respectively and the resulting mixture density as tf1 (x|θ1 ) + (1 − t)f2 (x|θ2 ). [sent-89, score-0.524]

34 The class of such mixtures is ˆ identifiable and hence using unlabeled examples alone, θ can be estimated by θ ∈ R2d . [sent-92, score-0.561]

35 1 Type-A Uncertainty : Perfect Model Imperfect Information Due to finiteness of unlabeled examples, density parameters can not be estimated arbitrarily close to the true parameters in terms of Euclidean norm. [sent-97, score-0.465]

36 Clearly, how close they can be estimated depends on the number of unlabeled examples used u, dimension d and confidence probability δ. [sent-98, score-0.534]

37 Thus, Type-I uncertainty inherently gives rise to a perturbation size defined by 1 (u, d, δ) such that, a fixed u defines a perturbation size 1 (d, δ). [sent-99, score-0.358]

38 From [11] it is clear that only very few labeled examples are good enough to label these two estimated decision regions reasonably well with high probability. [sent-101, score-0.588]

39 But what happens if the number of labeled examples available is greater than l ∗ ? [sent-103, score-0.401]

40 Since the individual densities follow the parametric model exactly, these extra labeled examples can be used to estimate the density parameters and hence the decision regions. [sent-104, score-0.882]

41 We adopt the following strategy to utilize labeled and unlabeled examples in order to learn a classification rule. [sent-108, score-0.751]

42 Given u unlabeled examples, and confidence probability δ > 0 use maximum likelihood estimation method to learn the parameters of the mixture model such that the estimates d ˆ ˆ θ1 , θ2 are only 1 (u, d, δ) = O∗ u1/3 close to the actual parameters with probability at least 1 − δ . [sent-110, score-0.522]

43 Use l∗ labeled examples to label the estimated decision regions with probability of incorrect labeling no greater than δ . [sent-112, score-0.588]

44 For any arbitrary 1 > δ > 0, if strategy 1 is used with u unlabeled examples then there exists a perturbation size ∗ 1 (u, d, δ) > 0 and positive constants A, B such that using l ≤ l = L(24, 0. [sent-118, score-0.658]

45 5, 1 ) labeled examples, Perror − PBayes reduces exponentially fast in the number of labeled examples with probability δ δ at least (1 − 2 ). [sent-119, score-0.808]

46 If more labeled examples l > l ∗ are provided then with probability at least (1 − 2 ), Perror − PBayes asymptotically converges to zero at a rate O d l log( d ) δ as l → ∞. [sent-120, score-0.477]

47 If we represent the reduction rate of this excess error(Perror − PBayes ) as a function of labeled examples Ree (l), then this can be compactly represented as, Ree (l) =   O (exp(−l))  O d l log( d ) δ if l ≤ l ∗ if l > l∗ After using l∗ labeled examples Perror = PBayes + O( 1 ). [sent-121, score-0.994]

48 Here the individual class densities do not follow the assumed parametric model exactly but are a perturbed version of the assumed model. [sent-124, score-0.424]

49 The uncertainty in this case is specified by the perturbation size 2 which roughly indicates by what extent the true class densities differ form that of the best fitting parametric model densities. [sent-125, score-0.573]

50 4 For any mixing parameter t ∈ (0, 1) let us consider a two class mixture model with individual class densities p1 (x), p2 (x) respectively. [sent-126, score-0.426]

51 Suppose the best knowledge available about this mixture model is that individual class densities approximately follow some parametric form from a class F. [sent-127, score-0.575]

52 Here, the Sobolev norm is used as a smoothness 2 2 condition and implies that true densities are smooth and not “too different” from the best fitting parametric model densities and in particular, if ||fi − pi || d ,2 ≤ 2 then ||fi − pi ||1 ≤ 2 and 2 ||fi − pi ||∞ ≤ 2 . [sent-129, score-0.55]

53 Thus, using only a very small number of labeled examples Perror reduces exponentially fast in the number of labeled examples to PBayes as number of labeled examples increases. [sent-132, score-1.357]

54 However, due to the presence of perturbation size, Perror reduces exponentially fast in number of labeled examples only up to PBayes + O( 2 ). [sent-133, score-0.713]

55 Since beyond this, parametric model assumptions do not hold due to the presence of perturbation size, some non parametric technique must be used to estimate the actual decision boundary. [sent-134, score-0.583]

56 If more labeled examples l > l ∗ are provided Perror − PBayes asymptotically converges to zero at a rate O 1 √ l as l → ∞. [sent-144, score-0.477]

57 After using l∗ labeled examples Perror = PBayes + O ( 2 ). [sent-145, score-0.401]

58 Thus, from the above theorem it can be thought that as labeled examples are added, initially the excess error reduces at a very fast rate (exponentially in the number of labeled examples) until Perror − PBayes = O ( 2 ). [sent-146, score-0.958]

59 After that the excess error reduces only polynomially fast in the number of labeled examples. [sent-147, score-0.488]

60 However, in case of a specific class of parametric densities such a crude approximation may not be necessary. [sent-149, score-0.338]

61 However, the true model is an equiprobable mixture of perturbed versions of these individual class densities. [sent-152, score-0.324]

62 As before, given u unlabeled examples and l labeled examples we want a strategy to learn a classification rule and analyze the effect of these examples and also of perturbation size 2 in reducing excess probability of error. [sent-153, score-1.354]

63 One option to achieve this task is to use the unlabeled examples to estimate the true mixture density 1 1 2 p1 + 2 p2 , however number of unlabeled examples required to estimate mixture density using non parametric kernel density estimation is exponential to the number of dimensions [10]. [sent-154, score-1.464]

64 A better option will be to use the unlabeled examples to estimate the best fitting Gaussians within F. [sent-156, score-0.543]

65 Number of unlabeled examples needed to estimate such a mixture of Gaussians is only polynomial in the number of dimensions [10] and it is easy to show that the distance between the Bayesian decision function and the decision function due to Gaussian approximation is at most 2 away in ||. [sent-157, score-0.735]

66 2 5 Now suppose we use the following strategy to use labeled and unlabeled examples. [sent-159, score-0.603]

67 Assume the examples are distributed according to a mixture of equiprobable Gaussians with unit covariance matrices and apply maximum likelihood estimation method to find the best Gaussian approximation of the densities. [sent-161, score-0.403]

68 Use small number of labeled examples l ∗ to label the two approximate decision regions correctly with high probability. [sent-163, score-0.561]

69 If more (l > l∗ ) labeled examples are available, use them to learn a better decision function using some nonparametric technique. [sent-165, score-0.498]

70 For 2 2 any > 0 and 0 < δ < 1, there exists positive constants A, B such that if strategy 2 is used 2 1 with u = O d δ (d log 1 + log δ ) unlabeled and l ∗ = L (0. [sent-169, score-0.388]

71 5, 12, ( + 2 )) labeled examples 3 ∗ then for l ≤ l , Perror − PBayes reduces exponentially fast in the number of labeled examples with probability at least (1 − δ). [sent-170, score-0.956]

72 If more labeled examples l > l ∗ are provided, Perror − PBayes asymptotically converges to zero at most at a rate O 1 √ l as l → ∞. [sent-171, score-0.463]

73 If we represent the reduction rate of this excess error (Perror − PBayes ) as a function of labeled examples as Ree (l), then this can compactly represented as, Ree (l) = O (exp(−l)) if l ≤ l ∗ 1 O √l if l > l∗ After using l∗ labeled examples, Perror = PBayes + O( + 2 ). [sent-172, score-0.868]

74 Note that when number of unlabeled examples is infinite, parameters of the best fitting model can be estimated arbitrarily well, i. [sent-173, score-0.591]

75 , → 0 and hence Perror − PBayes reduces exponentially fast in the number of labeled examples until Perror − PBayes = O( 2 ). [sent-175, score-0.555]

76 On the other hand if = O( 2 ), Perror − PBayes still reduces exponentially fast in the number of labeled examples until Perror − PBayes = O( 2 ). [sent-176, score-0.555]

77 A more precise estimate of parameters of the best fitting model using more unlabeled examples does not help reducing Perror − PBayes at the same exponential rate beyond Perror − PBayes = O( 2 ). [sent-178, score-0.661]

78 For a perturbation size 2 > 0, let the best fitting model for a mixture of equiprobable densities be a mixture of equiprobable d dimensional spherical Gaussians with unit covariance 2 1 matrices. [sent-182, score-0.837]

79 If using u∗ = O d δ (d log 1 + log δ ) unlabeled examples parameters of the best 3 2 2 fitting model can be estimated O( 2 ) close in Euclidean norm sense, then any additional unlabeled examples u > u∗ does not help in reducing the excess error. [sent-183, score-1.306]

80 3 Discussion on different rates of convergence In this section we discuss the effect of perturbation size 2 on the behavior of Perror − PBayes and its effect on controlling the value of labeled and unlabeled examples. [sent-184, score-0.772]

81 Different combinations of number of labeled and unlabeled examples give rise to four different regions where P error − PBayes behaves differently as shown in Figure 1 where the x axis corresponds to the number of unlabeled examples and the y axis corresponds to the number of labeled examples. [sent-185, score-1.519]

82 Let u∗ be the number of unlabeled examples required to estimate the parameters of the best fitting model O( 2 ) close in Euclidean norm sense. [sent-186, score-0.609]

83 When u > u∗ , unlabeled examples have no role to play in reducing 2 Perror − PBayes as shown in region II and part of III in Figure 1. [sent-189, score-0.537]

84 For u ≤ u∗ , unlabeled examples becomes useful only in region I and IV. [sent-190, score-0.506]

85 When u∗ unlabeled examples are available to estimate the parameters of the best fitting model O( 2 ) close, let the number of labeled examples required to 6 label the estimated decision regions so that Perror − PBayes = O( 2 ) be l∗ . [sent-191, score-1.155]

86 Behavior of Perror − PBayes for different labeled and unlabeled examples 3. [sent-194, score-0.724]

87 1 Behavior of Perror − PBayes in Region-I ∗ In this region u ≤ u∗ unlabeled examples estimate the decision regions and lu labeled examples, which depends on u, are required to correctly label these estimated regions. [sent-195, score-1.008]

88 This rate can be interpreted as the rate u3 at which unlabeled examples estimate the parameters of the best fitting model and rate at which labeled examples correctly label these estimated decision regions. [sent-197, score-1.259]

89 Instead of using these large number labeled examples to label poorly estimated decision regions, they can instead be used to estimate the parameters of the best fitting model and as will be seen next, this is precisely what happens in region III. [sent-199, score-0.67]

90 6, using u∗ unlabeled examples parameters of the best fitting model can be estimated O( 2 ) close in Euclidean norm sense and more precise estimate of the best fitting model parameters using more unlabeled examples u > u ∗ does not help reducing Perror − PBayes . [sent-204, score-1.248]

91 Thus, unlabeled examples have no role to play in this region and for small number of labeled examples l ≤ l ∗ , Perror − PBayes reduces at a rate O (exp(−l)). [sent-205, score-1.015]

92 Number of labeled examples available in this region is greater than what is required for mere labeling the estimated decision regions using u unlabeled examples and hence these excess labeled examples can be used to estimate the model parameters. [sent-209, score-1.624]

93 Note that once the parameters have been estimated O( 2 ) close to the parameters of the ∗ best fitting model using labeled examples, parametric model assumptions are no longer valid. [sent-210, score-0.601]

94 Also note that depending ∗ on number of unlabeled examples u ≤ u∗ , l∗ , and l1 are not fixed numbers but will depend on u. [sent-212, score-0.471]

95 In presence of labeled examples alone, using Theorem 2. [sent-213, score-0.417]

96 Since parameters are being estimated both using labeled and unlabeled examples, the 7 effective rate at which Perror − PBayes reduces at this region can be thought of as the mean of the two. [sent-215, score-0.785]

97 In either case, since the parameters of the best fitting model have been estimated O( 2 ) close to the parameters of the best fitting model, parametric model assumptions are also no longer valid and excess labeled examples must be used in nonparametric way. [sent-218, score-0.925]

98 A PAC-style model for learning from labeled and unlabeled data. [sent-238, score-0.598]

99 The complexity of learning from a mixture of labeled and unlabeled examples. [sent-283, score-0.684]

100 Learning from a mixture of labeled and unlabeled examples with parametric side information. [sent-289, score-0.969]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pbayes', 0.59), ('perror', 0.525), ('unlabeled', 0.323), ('labeled', 0.253), ('examples', 0.148), ('perturbation', 0.142), ('parametric', 0.137), ('densities', 0.136), ('excess', 0.116), ('mixture', 0.108), ('tting', 0.089), ('equiprobable', 0.08), ('decision', 0.069), ('reduces', 0.062), ('exponentially', 0.057), ('castelli', 0.057), ('spherical', 0.055), ('regions', 0.049), ('class', 0.048), ('gaussians', 0.047), ('rate', 0.046), ('estimated', 0.042), ('imperfect', 0.042), ('ratsaby', 0.039), ('uncertainty', 0.038), ('perturbed', 0.037), ('mixing', 0.035), ('region', 0.035), ('fast', 0.035), ('classi', 0.035), ('ree', 0.033), ('best', 0.032), ('sobolev', 0.031), ('density', 0.031), ('reducing', 0.031), ('lu', 0.029), ('individual', 0.029), ('boundary', 0.028), ('nonparametric', 0.028), ('dl', 0.028), ('strategy', 0.027), ('label', 0.027), ('usefulness', 0.027), ('columbus', 0.026), ('imperfection', 0.026), ('venkatesh', 0.026), ('identi', 0.026), ('euclidean', 0.024), ('parameters', 0.024), ('assumptions', 0.024), ('ohio', 0.023), ('oh', 0.023), ('theorem', 0.023), ('option', 0.022), ('pi', 0.022), ('model', 0.022), ('error', 0.022), ('fi', 0.021), ('behavior', 0.021), ('norm', 0.021), ('situation', 0.021), ('close', 0.021), ('regimes', 0.021), ('dimensional', 0.021), ('exp', 0.021), ('ers', 0.02), ('cover', 0.02), ('differentiability', 0.019), ('log', 0.019), ('manifold', 0.019), ('estimate', 0.018), ('valuable', 0.018), ('non', 0.018), ('covariance', 0.018), ('size', 0.018), ('balcan', 0.017), ('niteness', 0.017), ('help', 0.017), ('crude', 0.017), ('unit', 0.017), ('rd', 0.016), ('presence', 0.016), ('blum', 0.016), ('corollary', 0.016), ('asymptotically', 0.016), ('slower', 0.016), ('assumption', 0.016), ('comes', 0.016), ('belkin', 0.015), ('correctly', 0.015), ('follow', 0.015), ('compactly', 0.015), ('convergence', 0.015), ('theoretical', 0.015), ('er', 0.015), ('represent', 0.015), ('annual', 0.014), ('minimax', 0.014), ('separation', 0.014), ('provided', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

Author: Kaushik Sinha, Mikhail Belkin

Abstract: Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received significant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identifiable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identifiable components. There have been recent efforts to analyze the non-parametric situation, for example, “cluster” and “manifold” assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identifiable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.

2 0.1689188 69 nips-2007-Discriminative Batch Mode Active Learning

Author: Yuhong Guo, Dale Schuurmans

Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classifier. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classification performance of the target classifier, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1

3 0.16572024 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

Author: Larry Wasserman, John D. Lafferty

Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1

4 0.11944968 166 nips-2007-Regularized Boost for Semi-Supervised Learning

Author: Ke Chen, Shihai Wang

Abstract: Semi-supervised inductive learning concerns how to learn a decision rule from a data set containing both labeled and unlabeled data. Several boosting algorithms have been extended to semi-supervised learning with various strategies. To our knowledge, however, none of them takes local smoothness constraints among data into account during ensemble learning. In this paper, we introduce a local smoothness regularizer to semi-supervised boosting algorithms based on the universal optimization framework of margin cost functionals. Our regularizer is applicable to existing semi-supervised boosting algorithms to improve their generalization and speed up their training. Comparative results on synthetic, benchmark and real world tasks demonstrate the effectiveness of our local smoothness regularizer. We discuss relevant issues and relate our regularizer to previous work. 1

5 0.10623258 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.

6 0.09578716 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

7 0.088403799 175 nips-2007-Semi-Supervised Multitask Learning

8 0.076921485 32 nips-2007-Bayesian Co-Training

9 0.072824083 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

10 0.068336651 110 nips-2007-Learning Bounds for Domain Adaptation

11 0.059561186 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

12 0.059040595 136 nips-2007-Multiple-Instance Active Learning

13 0.048438173 15 nips-2007-A general agnostic active learning algorithm

14 0.041907605 135 nips-2007-Multi-task Gaussian Process Prediction

15 0.040891156 159 nips-2007-Progressive mixture rules are deviation suboptimal

16 0.039756633 61 nips-2007-Convex Clustering with Exemplar-Based Models

17 0.039403349 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

18 0.039201938 39 nips-2007-Boosting the Area under the ROC Curve

19 0.036956731 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

20 0.03422828 200 nips-2007-The Tradeoffs of Large Scale Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.135), (1, 0.035), (2, -0.097), (3, 0.096), (4, 0.027), (5, 0.108), (6, 0.014), (7, -0.098), (8, 0.106), (9, 0.066), (10, 0.028), (11, 0.042), (12, -0.156), (13, -0.104), (14, 0.076), (15, 0.105), (16, -0.008), (17, 0.058), (18, 0.172), (19, -0.015), (20, 0.033), (21, -0.075), (22, 0.086), (23, -0.109), (24, 0.077), (25, 0.075), (26, 0.094), (27, -0.073), (28, 0.035), (29, -0.088), (30, 0.002), (31, 0.1), (32, 0.12), (33, 0.005), (34, 0.009), (35, 0.004), (36, -0.046), (37, 0.014), (38, -0.052), (39, 0.004), (40, -0.04), (41, 0.04), (42, 0.042), (43, 0.016), (44, 0.035), (45, -0.078), (46, -0.027), (47, 0.045), (48, -0.053), (49, -0.079)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95694679 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

Author: Kaushik Sinha, Mikhail Belkin

Abstract: Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received significant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identifiable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identifiable components. There have been recent efforts to analyze the non-parametric situation, for example, “cluster” and “manifold” assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identifiable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.

2 0.72063977 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

Author: Larry Wasserman, John D. Lafferty

Abstract: Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on regularization using graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning. 1

3 0.69633412 69 nips-2007-Discriminative Batch Mode Active Learning

Author: Yuhong Guo, Dale Schuurmans

Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classifier. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classification performance of the target classifier, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1

4 0.66355699 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

Author: Maryam Mahdaviani, Tanzeem Choudhury

Abstract: We present a new and efficient semi-supervised training method for parameter estimation and feature selection in conditional random fields (CRFs). In real-world applications such as activity recognition, unlabeled sensor traces are relatively easy to obtain whereas labeled examples are expensive and tedious to collect. Furthermore, the ability to automatically select a small subset of discriminatory features from a large pool can be advantageous in terms of computational speed as well as accuracy. In this paper, we introduce the semi-supervised virtual evidence boosting (sVEB) algorithm for training CRFs – a semi-supervised extension to the recently developed virtual evidence boosting (VEB) method for feature selection and parameter learning. The objective function of sVEB combines the unlabeled conditional entropy with labeled conditional pseudo-likelihood. It reduces the overall system cost as well as the human labeling cost required during training, which are both important considerations in building real-world inference systems. Experiments on synthetic data and real activity traces collected from wearable sensors, illustrate that sVEB benefits from both the use of unlabeled data and automatic feature selection, and outperforms other semi-supervised approaches. 1

5 0.55914617 166 nips-2007-Regularized Boost for Semi-Supervised Learning

Author: Ke Chen, Shihai Wang

Abstract: Semi-supervised inductive learning concerns how to learn a decision rule from a data set containing both labeled and unlabeled data. Several boosting algorithms have been extended to semi-supervised learning with various strategies. To our knowledge, however, none of them takes local smoothness constraints among data into account during ensemble learning. In this paper, we introduce a local smoothness regularizer to semi-supervised boosting algorithms based on the universal optimization framework of margin cost functionals. Our regularizer is applicable to existing semi-supervised boosting algorithms to improve their generalization and speed up their training. Comparative results on synthetic, benchmark and real world tasks demonstrate the effectiveness of our local smoothness regularizer. We discuss relevant issues and relate our regularizer to previous work. 1

6 0.52876294 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images

7 0.51159555 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

8 0.50997138 175 nips-2007-Semi-Supervised Multitask Learning

9 0.47109675 15 nips-2007-A general agnostic active learning algorithm

10 0.45986953 136 nips-2007-Multiple-Instance Active Learning

11 0.42511731 32 nips-2007-Bayesian Co-Training

12 0.40762174 110 nips-2007-Learning Bounds for Domain Adaptation

13 0.40222675 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

14 0.37319478 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection

15 0.3450315 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

16 0.31173041 44 nips-2007-Catching Up Faster in Bayesian Model Selection and Model Averaging

17 0.29023555 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

18 0.28662291 13 nips-2007-A Unified Near-Optimal Estimator For Dimension Reduction in $l \alpha$ ($0<\alpha\leq 2$) Using Stable Random Projections

19 0.28429869 103 nips-2007-Inferring Elapsed Time from Stochastic Neural Processes

20 0.28094724 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.028), (13, 0.017), (16, 0.036), (18, 0.018), (19, 0.013), (21, 0.109), (22, 0.221), (31, 0.027), (34, 0.023), (35, 0.028), (47, 0.087), (83, 0.179), (85, 0.035), (90, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88040859 174 nips-2007-Selecting Observations against Adversarial Objectives

Author: Andreas Krause, Brendan Mcmahan, Carlos Guestrin, Anupam Gupta

Abstract: In many applications, one has to actively select among a set of expensive observations before making an informed decision. Often, we want to select observations which perform well when evaluated with an objective function chosen by an adversary. Examples include minimizing the maximum posterior variance in Gaussian Process regression, robust experimental design, and sensor placement for outbreak detection. In this paper, we present the Submodular Saturation algorithm, a simple and efficient algorithm with strong theoretical approximation guarantees for the case where the possible objective functions exhibit submodularity, an intuitive diminishing returns property. Moreover, we prove that better approximation algorithms do not exist unless NP-complete problems admit efficient algorithms. We evaluate our algorithm on several real-world problems. For Gaussian Process regression, our algorithm compares favorably with state-of-the-art heuristics described in the geostatistics literature, while being simpler, faster and providing theoretical guarantees. For robust experimental design, our algorithm performs favorably compared to SDP-based algorithms. 1

same-paper 2 0.81333786 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

Author: Kaushik Sinha, Mikhail Belkin

Abstract: Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received significant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identifiable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identifiable components. There have been recent efforts to analyze the non-parametric situation, for example, “cluster” and “manifold” assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identifiable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.

3 0.73187256 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning

Author: Kai Yu, Wei Chu

Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1

4 0.72277278 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes

Author: Geoffrey E. Hinton, Ruslan Salakhutdinov

Abstract: We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by [7]. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.

5 0.72238755 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

Author: Charles L. Isbell, Michael P. Holmes, Alexander G. Gray

Abstract: Machine learning contains many computational bottlenecks in the form of nested summations over datasets. Kernel estimators and other methods are burdened by these expensive computations. Exact evaluation is typically O(n2 ) or higher, which severely limits application to large datasets. We present a multi-stage stratified Monte Carlo method for approximating such summations with probabilistic relative error control. The essential idea is fast approximation by sampling in trees. This method differs from many previous scalability techniques (such as standard multi-tree methods) in that its error is stochastic, but we derive conditions for error control and demonstrate that they work. Further, we give a theoretical sample complexity for the method that is independent of dataset size, and show that this appears to hold in experiments, where speedups reach as high as 1014 , many orders of magnitude beyond the previous state of the art. 1

6 0.71954483 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

7 0.71588981 156 nips-2007-Predictive Matrix-Variate t Models

8 0.71513194 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search

9 0.71365911 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

10 0.71297598 175 nips-2007-Semi-Supervised Multitask Learning

11 0.7118066 128 nips-2007-Message Passing for Max-weight Independent Set

12 0.71161497 69 nips-2007-Discriminative Batch Mode Active Learning

13 0.71109217 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC

14 0.71082324 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine

15 0.71061653 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

16 0.71046001 187 nips-2007-Structured Learning with Approximate Inference

17 0.70897752 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data

18 0.70851779 18 nips-2007-A probabilistic model for generating realistic lip movements from speech

19 0.70800942 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

20 0.70744985 63 nips-2007-Convex Relaxations of Latent Variable Training