nips nips2004 nips2004-164 knowledge-graph by maker-knowledge-mining

164 nips-2004-Semi-supervised Learning by Entropy Minimization


Source: pdf

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca Abstract We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. [sent-5, score-0.731]

2 In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. [sent-6, score-1.034]

3 A series of experiments illustrates that the proposed solution benefits from unlabeled data. [sent-8, score-0.598]

4 The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. [sent-9, score-0.26]

5 The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. [sent-10, score-1.283]

6 1 Introduction In the classical supervised learning classification framework, a decision rule is to be learned from a learning set Ln = {xi , yi }n , where each example is described by a pattern xi ∈ X i=1 and by the supervisor’s response yi ∈ Ω = {ω1 , . [sent-12, score-0.162]

7 In the terminology used here, semi-supervised learning refers to learning a decision rule on X from labeled and unlabeled data. [sent-17, score-0.763]

8 In the probabilistic framework, semi-supervised learning can be modeled as a missing data problem, which can be addressed by generative models such as mixture models thanks to the EM algorithm and extensions thereof [6]. [sent-23, score-0.229]

9 Generative models apply to the joint density of patterns and class (X, Y ). [sent-24, score-0.16]

10 These difficulties have lead to proposals aiming at processing unlabeled data in the framework of supervised classification [1, 5, 11]. [sent-32, score-0.695]

11 Here, we propose an estimation principle applicable to any probabilistic classifier, aiming at making the most of unlabeled data when they are beneficial, while providing a control on their contribution to provide robustness to the learning scheme. [sent-33, score-0.669]

12 1 Derivation of the Criterion Likelihood We first recall how the semi-supervised learning problem fits into standard supervised learning by using the maximum (conditional) likelihood estimation principle. [sent-35, score-0.139]

13 We assume that labeling is missing at random, that is, for all unlabeled examples, P (z|x, ωk ) = P (z|x, ω ), for any (ωk , ω ) pair, which implies zk P (ωk |x) K =1 z P (ω |x) P (ωk |x, z) = . [sent-40, score-0.653]

14 This criterion is a concave function of fk (xi ; θ), and for simple models such as the ones provided by logistic regression, it is also concave in θ, so that the global solution can be obtained by numerical optimization. [sent-42, score-0.688]

15 Provided fk (xi ; θ) sum to one, the likelihood is not affected by unlabeled data: unlabeled data convey no information. [sent-44, score-1.512]

16 In the maximum a posteriori (MAP) framework, Seeger remarks that unlabeled data are useless regarding discrimination when the priors on P (X) and P (Y |X) factorize [10]: observing x does not inform about y, unless the modeler assumes so. [sent-45, score-0.625]

17 Benefitting from unlabeled data requires assumptions of some sort on the relationship between X and Y . [sent-46, score-0.565]

18 As there is no such thing like a universally relevant prior, we should look for an induction bias exploiting unlabeled data when the latter is known to convey information. [sent-48, score-0.645]

19 Theory provides little support to the numerous experimental evidences [5, 7, 8] showing that unlabeled examples can help the learning process. [sent-51, score-0.681]

20 Semi-supervised learning, in the terminology used here, does not fit the distribution-free frameworks: no positive statement can be made without distributional assumptions, as for some distributions P (X, Y ) unlabeled data are non-informative while supervised learning is an easy task. [sent-53, score-0.679]

21 In this regard, generalizing from labeled and unlabeled data may differ from transductive inference. [sent-54, score-0.723]

22 In parametric statistics, theory has shown the benefit of unlabeled examples, either for specific distributions [9], or for mixtures of the form P (x) = pP (x|ω1 ) + (1 − p)P (x|ω2 ) where the estimation problem is essentially reduced to the one of estimating the mixture parameter p [4]. [sent-55, score-0.671]

23 These studies conclude that the (asymptotic) information content of unlabeled examples decreases as classes overlap. [sent-56, score-0.688]

24 1 Thus, the assumption that classes are well separated is sensible if we expect to take advantage of unlabeled examples. [sent-57, score-0.634]

25 The conditional entropy H(Y |X) is a measure of class overlap, which is invariant to the parameterization of the model. [sent-58, score-0.405]

26 This measure is related to the usefulness of unlabeled data where labeling is indeed ambiguous. [sent-59, score-0.596]

27 Hence, we will measure the conditional entropy of class labels conditioned on the observed variables H(Y |X, Z) = −EXY Z [log P (Y |X, Z)] , (3) where EX denotes the expectation with respect to X. [sent-60, score-0.449]

28 Stating that we expect a high conditional entropy does not uniquely define the form of the prior distribution, but the latter can be derived by resorting to the maximum entropy principle. [sent-62, score-0.688]

29 Computing H(Y |X, Z) requires a model of P (X, Y, Z) whereas the choice of the diagnosis paradigm is motivated by the possibility to limit modeling to conditional probabilities. [sent-64, score-0.136]

30 This substitution, which can be interpreted as “modeling” P (X, Z) by its empirical distribution, yields n K 1 Hemp (Y |X, Z; Ln ) = − P (ωk |xi , zi ) log P (ωk |xi , zi ) . [sent-66, score-0.164]

31 3 Entropy Regularization Recalling that fk (x; θ) denotes the model of P (ωk |x), the model of P (ωk |x, z) (1) is defined as follows: zk fk (x; θ) gk (x, z; θ) = K . [sent-69, score-0.654]

32 =1 z f (x; θ) For labeled data, gk (x, z; θ) = zk , and for unlabeled data, gk (x, z; θ) = fk (x; θ). [sent-70, score-1.24]

33 From now on, we drop the reference to parameter θ in fk and gk to lighten notation. [sent-71, score-0.367]

34 p ˆ p)P ˆ 2 Here, maximum entropy refers to the construction principle which enables to derive distributions from constraints, not to the content of priors regarding entropy. [sent-73, score-0.343]

35 While L(θ; Ln ) is only sensitive to labeled data, Hemp (Y |X, Z; Ln ) is only affected by the value of fk (x) on unlabeled data. [sent-75, score-0.956]

36 Note that the approximation Hemp (5) of H (3) breaks down for wiggly functions fk (·) with abrupt changes between data points (where P (X) is bounded from below). [sent-76, score-0.23]

37 As a result, it is important to constrain fk (·) in order to enforce the closeness of the two functionals. [sent-77, score-0.23]

38 In the following experimental section, we imposed a smoothness constraint on fk (·) by adding to the criterion C (6) a penalizer with its corresponding Lagrange multiplier ν. [sent-78, score-0.322]

39 [1] analyzed this technique and shown that it is equivalent to a version of the classification EM algorithm, which minimizes the likelihood deprived of the entropy of the partition. [sent-81, score-0.339]

40 In the context of conditional likelihood with labeled and unlabeled examples, the criterion is K n log i=1 K zik fk (xi ) k=1 + gk (xi ) log gk (xi ) , k=1 which is recognized as an instance of the criterion (6) with λ = 1. [sent-82, score-1.514]

41 Self-confident logistic regression [5] is another algorithm optimizing the criterion for λ = 1. [sent-83, score-0.489]

42 Minimum entropy methods Minimum entropy regularizers have been used in other contexts to encode learnability priors (e. [sent-85, score-0.595]

43 However, we stress that for unlabeled data, the regularizer agrees with the complete likelihood provided P (X) is small near the decision surface. [sent-92, score-0.732]

44 Indeed, whereas a generative model would maximize log P (X) on the unlabeled data, our criterion minimizes the conditional entropy on the same points. [sent-93, score-1.058]

45 with weight decay), the conditional entropy is prevented from being too small close to the decision surface. [sent-96, score-0.394]

46 Our goal is to check to what extent supervised learning can be improved by unlabeled examples, and if minimum entropy can compete with generative models which are usually advocated in this framework. [sent-100, score-1.154]

47 The minimum entropy regularizer is applied to the logistic regression model. [sent-101, score-0.9]

48 It is compared to logistic regression fitted by maximum likelihood (ignoring unlabeled data) and logistic regression with all labels known. [sent-102, score-1.517]

49 The former shows what has been gained by handling unlabeled data, and the latter provides the “crystal ball” performance obtained by guessing correctly all labels. [sent-103, score-0.624]

50 All hyper-parameters (weight-decay for all logistic regression models plus the λ parameter (6) for minimum entropy) are tuned by ten-fold cross-validation. [sent-104, score-0.59]

51 Minimum entropy logistic regression is also compared to the classic EM algorithm for Gaussian mixture models (two means and one common covariance matrix estimated by maximum likelihood on labeled and unlabeled examples, see e. [sent-105, score-1.553]

52 Bad local maxima of the likelihood function are avoided by initializing EM with the parameters of the true distribution when the latter is a Gaussian mixture, or with maximum likelihood parameters on the (fully labeled) test sample when the distribution departs from the model. [sent-108, score-0.309]

53 Furthermore, this initialization prevents interferences that may result from the “pseudo-labels” given to unlabeled examples at the first E-step. [sent-110, score-0.695]

54 The learning sets comprise nl labeled examples, (nl = 50, 100, 200) and nu unlabeled examples, (nu = nl × (1, 3, 10, 30, 100)). [sent-126, score-1.117]

55 This benchmark provides a comparison for the algorithms in a situation where unlabeled data are known to convey information. [sent-129, score-0.642]

56 The logistic regression model is only compatible with the joint distribution, which is a weaker fulfillment than correctness. [sent-131, score-0.462]

57 The overall error rates (averaged over all settings) are in favor of minimum entropy logistic regression (14. [sent-133, score-0.984]

58 3 %) does worse on average than logistic regression (14. [sent-138, score-0.426]

59 The plots represent the error rates (averaged over nl ) versus Bayes error rate and the nu /nl ratio. [sent-146, score-0.482]

60 The first plot shows that, as asymptotic theory suggests [4, 9], unlabeled examples are mostly informative when the Bayes error is low. [sent-147, score-0.807]

61 This observation validates the relevance of the minimum entropy assumption. [sent-148, score-0.415]

62 Mixture models are outperformed by the simple logistic regression model when the sample size is low, since their number of parameters grows quadratically (vs. [sent-150, score-0.49]

63 The second plot shows that the minimum entropy model takes quickly advantage of unlabeled data when classes are well separated. [sent-152, score-1.015]

64 With nu = 3nl , the model considerably improves upon the one discarding unlabeled data. [sent-153, score-0.745]

65 At this stage, the generative models do not perform well, as the number of available examples is low compared to the number of parameters in the model. [sent-154, score-0.208]

66 However, for very large sample sizes, with 100 times more unla- 15 Test Error (%) Test Error (%) 40 30 20 10 10 5 5 10 15 Bayes Error (%) 20 1 3 10 Ratio n /n u 30 100 l Figure 1: Left: test error vs. [sent-155, score-0.153]

67 Bayes error rate for nu /nl = 10; right: test error vs. [sent-156, score-0.371]

68 Test errors of minimum entropy logistic regression (◦) and mixture models (+). [sent-159, score-0.95]

69 The errors of logistic regression (dashed), and logistic regression with all labels known (dash-dotted) are shown for reference. [sent-160, score-0.896]

70 beled examples than labeled examples, the generative approach eventually becomes more accurate than the diagnosis approach. [sent-161, score-0.367]

71 Misspecified joint density model In a second series of experiments, the setup is slightly modified by letting the class-conditional densities be corrupted by outliers. [sent-162, score-0.164]

72 For each class, the examples are generated from a mixture of two Gaussians centered on the same mean: a unit variance component gathers 98 % of examples, while the remaining 2 % are generated from a large variance component, where each variable has a standard deviation of 10. [sent-163, score-0.165]

73 The generative model dramatically suffers from the misspecification and behaves worse than logistic regression for all sample sizes. [sent-166, score-0.546]

74 The unlabeled examples have first a beneficial effect on test error, then have a detrimental effect when they overwhelm the number of labeled examples. [sent-167, score-0.818]

75 On the other hand, the diagnosis models behave smoothly as in the previous case, and the minimum entropy criterion performance improves. [sent-168, score-0.587]

76 Average test errors for minimum entropy logistic regression (◦) and mixture models (+). [sent-172, score-1.001]

77 The test error rates of logistic regression (dotted), and logistic regression with all labels known (dash-dotted) are shown for reference. [sent-173, score-1.05]

78 Left: experiment with outliers; right: experiment with uninformative unlabeled data. [sent-174, score-0.565]

79 The last series of experiments illustrate the robustness with respect to the cluster assumption, by testing it on distributions where unlabeled examples are not informative, and where a low density P (X) does not indicate a boundary region. [sent-175, score-0.811]

80 The data is drawn from two Gaussian clusters like in the first series of experiment, but the label is now independent of the clustering: an example x belongs to class ω1 if x2 > x1 and belongs to class ω2 otherwise: the Bayes decision boundary is now separates each cluster in its middle. [sent-176, score-0.326]

81 The right-hand-side plot of Figure 1 shows that the favorable initialization of EM does not prevent the model to be fooled by unlabeled data: its test error steadily increases with the amount of unlabeled data. [sent-179, score-1.323]

82 Comparison with manifold transduction Although our primary goal is to infer a decision function, we also provide comparisons with a transduction algorithm of the “manifold family”. [sent-181, score-0.242]

83 Table 1: Error rates (%) of minimum entropy (ME) vs. [sent-188, score-0.448]

84 23, nl = 50, and a) pure Gaussian clusters b) Gaussian clusters corrupted by outliers c) class boundary separating one Gaussian cluster nu 50 150 500 1500 a) ME 10. [sent-190, score-0.569]

85 2 The results are extremely poor for the consistency method, whose error is way above minimum entropy, and which does not show any sign of improvement as the sample of unlabeled data grows. [sent-238, score-0.9]

86 Furthermore, when classes do not correspond to clusters, the consistency method performs random class assignments. [sent-239, score-0.17]

87 In this situation, local methods suffer from the “curse of dimensionality”, and many more unlabeled examples would be required to get sensible results. [sent-241, score-0.714]

88 We tested kernelized logistic regression (Gaussian kernel), its minimum entropy version, nearest neigbor and the consistency method. [sent-247, score-0.99]

89 3 % test error (compared to 86 % error for random assignments). [sent-251, score-0.191]

90 3 % test error, and Kernelized logistic regression (ignoring unlabeled examples) improved to reach 53. [sent-254, score-1.042]

91 The scale parameter chosen for kernelized logistic regression (by ten-fold cross-validation) amount to use a global classifier. [sent-260, score-0.535]

92 5 Discussion We propose to tackle the semi-supervised learning problem in the supervised learning framework by using the minimum entropy regularizer. [sent-264, score-0.5]

93 This regularizer is motivated by theory, which shows that unlabeled examples are mostly beneficial when classes have small overlap. [sent-265, score-0.787]

94 The MAP framework provides a means to control the weight of unlabeled examples, and thus to depart from optimism when unlabeled data tend to harm classification. [sent-266, score-1.189]

95 Our proposal encompasses self-learning as a particular case, as minimizing entropy increases the confidence of the classifier output. [sent-267, score-0.283]

96 It also approaches the solution of transductive large margin classifiers in another limiting case, as minimizing entropy is a means to drive the decision boundary from learning examples. [sent-268, score-0.442]

97 The minimum entropy regularizer can be applied to both local and global classifiers. [sent-269, score-0.533]

98 Also, our experiments suggest that the minimum entropy regularization may be a serious contender to generative models. [sent-271, score-0.558]

99 The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. [sent-293, score-0.679]

100 Text classification from labeled and unlabeled documents using EM. [sent-316, score-0.679]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('unlabeled', 0.565), ('entropy', 0.283), ('logistic', 0.267), ('fk', 0.23), ('nu', 0.18), ('regression', 0.159), ('misspeci', 0.155), ('gk', 0.137), ('minimum', 0.132), ('hemp', 0.129), ('nl', 0.129), ('labeled', 0.114), ('zik', 0.09), ('ln', 0.089), ('examples', 0.088), ('generative', 0.088), ('bene', 0.083), ('zi', 0.082), ('em', 0.077), ('mixture', 0.077), ('diagnosis', 0.077), ('kernelized', 0.077), ('manifold', 0.076), ('consistency', 0.072), ('error', 0.07), ('criterion', 0.063), ('class', 0.063), ('cm', 0.061), ('regularizer', 0.059), ('conditional', 0.059), ('transduction', 0.057), ('zk', 0.057), ('xi', 0.056), ('likelihood', 0.056), ('bayes', 0.056), ('regularization', 0.055), ('supervised', 0.054), ('decision', 0.052), ('amini', 0.052), ('supervisor', 0.052), ('test', 0.051), ('clusters', 0.049), ('convey', 0.049), ('affected', 0.047), ('aiming', 0.045), ('labels', 0.044), ('transductive', 0.044), ('informative', 0.044), ('initialization', 0.042), ('maximizer', 0.041), ('favor', 0.04), ('mostly', 0.04), ('pictures', 0.038), ('classi', 0.036), ('ts', 0.036), ('boundary', 0.036), ('joint', 0.036), ('classes', 0.035), ('nigam', 0.034), ('demanding', 0.034), ('sensible', 0.034), ('zhou', 0.034), ('facial', 0.034), ('gaussian', 0.033), ('rates', 0.033), ('pp', 0.033), ('corrupted', 0.033), ('setup', 0.033), ('series', 0.033), ('sample', 0.032), ('prior', 0.032), ('concave', 0.032), ('aa', 0.032), ('terminology', 0.032), ('global', 0.032), ('models', 0.032), ('regarding', 0.031), ('normal', 0.031), ('latter', 0.031), ('framework', 0.031), ('ratio', 0.031), ('labeling', 0.031), ('performances', 0.03), ('favorable', 0.03), ('cluster', 0.03), ('robustness', 0.03), ('management', 0.029), ('lagrange', 0.029), ('multiplier', 0.029), ('avoided', 0.029), ('density', 0.029), ('poor', 0.029), ('priors', 0.029), ('estimation', 0.029), ('provides', 0.028), ('statement', 0.028), ('maxima', 0.027), ('truly', 0.027), ('limiting', 0.027), ('local', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.000001 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

2 0.27144703 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

Author: Saharon Rosset, Ji Zhu, Hui Zou, Trevor J. Hastie

Abstract: We consider the situation in semi-supervised learning, where the “label sampling” mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to “de-bias” its results using labeled data only and b. As a potentially interesting learning task in itself. We present several examples to illustrate the practical usefulness of our method.

3 0.19256802 54 nips-2004-Distributed Information Regularization on Graphs

Author: Adrian Corduneanu, Tommi S. Jaakkola

Abstract: We provide a principle for semi-supervised learning based on optimizing the rate of communicating labels for unlabeled points with side information. The side information is expressed in terms of identities of sets of points or regions with the purpose of biasing the labels in each region to be the same. The resulting regularization objective is convex, has a unique solution, and the solution can be found with a pair of local propagation operations on graphs induced by the regions. We analyze the properties of the algorithm and demonstrate its performance on document classification tasks. 1

4 0.15740164 115 nips-2004-Maximum Margin Clustering

Author: Linli Xu, James Neufeld, Bryce Larson, Dale Schuurmans

Abstract: We propose a new method for clustering based on finding maximum margin hyperplanes through data. By reformulating the problem in terms of the implied equivalence relation matrix, we can pose the problem as a convex integer program. Although this still yields a difficult computational problem, the hard-clustering constraints can be relaxed to a soft-clustering formulation which can be feasibly solved with a semidefinite program. Since our clustering technique only depends on the data through the kernel matrix, we can easily achieve nonlinear clusterings in the same manner as spectral clustering. Experimental results show that our maximum margin clustering technique often obtains more accurate results than conventional clustering methods. The real benefit of our approach, however, is that it leads naturally to a semi-supervised training method for support vector machines. By maximizing the margin simultaneously on labeled and unlabeled training data, we achieve state of the art performance by using a single, integrated learning principle. 1

5 0.14474435 166 nips-2004-Semi-supervised Learning via Gaussian Processes

Author: Neil D. Lawrence, Michael I. Jordan

Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. 1

6 0.13795042 23 nips-2004-Analysis of a greedy active learning strategy

7 0.12814201 136 nips-2004-On Semi-Supervised Classification

8 0.12422726 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

9 0.11948944 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

10 0.10795794 70 nips-2004-Following Curved Regularized Optimization Solution Paths

11 0.099739447 168 nips-2004-Semigroup Kernels on Finite Sets

12 0.094685636 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

13 0.093133248 178 nips-2004-Support Vector Classification with Input Data Uncertainty

14 0.092402175 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

15 0.091356851 138 nips-2004-Online Bounds for Bayesian Algorithms

16 0.091121465 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

17 0.090215124 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

18 0.090042159 42 nips-2004-Computing regularization paths for learning multiple kernels

19 0.089793272 131 nips-2004-Non-Local Manifold Tangent Learning

20 0.087677136 163 nips-2004-Semi-parametric Exponential Family PCA


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.275), (1, 0.117), (2, -0.063), (3, 0.082), (4, 0.018), (5, 0.073), (6, -0.042), (7, 0.145), (8, -0.018), (9, 0.02), (10, 0.139), (11, 0.197), (12, 0.061), (13, -0.163), (14, -0.066), (15, -0.002), (16, 0.077), (17, -0.202), (18, 0.062), (19, -0.152), (20, 0.065), (21, 0.137), (22, 0.183), (23, 0.05), (24, -0.021), (25, 0.146), (26, 0.007), (27, 0.06), (28, -0.061), (29, 0.021), (30, 0.183), (31, 0.043), (32, 0.101), (33, 0.044), (34, 0.05), (35, -0.044), (36, 0.167), (37, 0.053), (38, -0.042), (39, 0.09), (40, 0.089), (41, -0.023), (42, 0.038), (43, -0.038), (44, 0.024), (45, 0.009), (46, 0.0), (47, -0.035), (48, -0.032), (49, -0.033)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96682632 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

2 0.79417628 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

Author: Saharon Rosset, Ji Zhu, Hui Zou, Trevor J. Hastie

Abstract: We consider the situation in semi-supervised learning, where the “label sampling” mechanism stochastically depends on the true response (as well as potentially on the features). We suggest a method of moments for estimating this stochastic dependence using the unlabeled data. This is potentially useful for two distinct purposes: a. As an input to a supervised learning procedure which can be used to “de-bias” its results using labeled data only and b. As a potentially interesting learning task in itself. We present several examples to illustrate the practical usefulness of our method.

3 0.67972571 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

Author: Omid Madani, David M. Pennock, Gary W. Flake

Abstract: In the context of binary classification, we define disagreement as a measure of how often two independently-trained models differ in their classification of unlabeled data. We explore the use of disagreement for error estimation and model selection. We call the procedure co-validation, since the two models effectively (in)validate one another by comparing results on unlabeled data, which we assume is relatively cheap and plentiful compared to labeled data. We show that per-instance disagreement is an unbiased estimate of the variance of error for that instance. We also show that disagreement provides a lower bound on the prediction (generalization) error, and a tight upper bound on the “variance of prediction error”, or the variance of the average error across instances, where variance is measured across training sets. We present experimental results on several data sets exploring co-validation for error estimation and model selection. The procedure is especially effective in active learning settings, where training sets are not drawn at random and cross validation overestimates error. 1

4 0.67295045 54 nips-2004-Distributed Information Regularization on Graphs

Author: Adrian Corduneanu, Tommi S. Jaakkola

Abstract: We provide a principle for semi-supervised learning based on optimizing the rate of communicating labels for unlabeled points with side information. The side information is expressed in terms of identities of sets of points or regions with the purpose of biasing the labels in each region to be the same. The resulting regularization objective is convex, has a unique solution, and the solution can be found with a pair of local propagation operations on graphs induced by the regions. We analyze the properties of the algorithm and demonstrate its performance on document classification tasks. 1

5 0.65268528 166 nips-2004-Semi-supervised Learning via Gaussian Processes

Author: Neil D. Lawrence, Michael I. Jordan

Abstract: We present a probabilistic approach to learning a Gaussian Process classifier in the presence of unlabeled data. Our approach involves a “null category noise model” (NCNM) inspired by ordered categorical noise models. The noise model reflects an assumption that the data density is lower between the class-conditional densities. We illustrate our approach on a toy problem and present comparative results for the semi-supervised classification of handwritten digits. 1

6 0.61062807 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

7 0.60422391 23 nips-2004-Analysis of a greedy active learning strategy

8 0.54865652 136 nips-2004-On Semi-Supervised Classification

9 0.46240044 115 nips-2004-Maximum Margin Clustering

10 0.45855665 158 nips-2004-Sampling Methods for Unsupervised Learning

11 0.41791561 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

12 0.36291564 96 nips-2004-Learning, Regularization and Ill-Posed Inverse Problems

13 0.35068318 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

14 0.34138605 163 nips-2004-Semi-parametric Exponential Family PCA

15 0.33997276 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

16 0.3341541 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

17 0.33036762 8 nips-2004-A Machine Learning Approach to Conjoint Analysis

18 0.32662076 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

19 0.32652783 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

20 0.32456797 127 nips-2004-Neighbourhood Components Analysis


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.102), (15, 0.091), (26, 0.047), (31, 0.392), (33, 0.169), (35, 0.017), (39, 0.012), (50, 0.055), (87, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.8900491 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

2 0.88548082 137 nips-2004-On the Adaptive Properties of Decision Trees

Author: Clayton Scott, Robert Nowak

Abstract: Decision trees are surprisingly adaptive in three important respects: They automatically (1) adapt to favorable conditions near the Bayes decision boundary; (2) focus on data distributed on lower dimensional manifolds; (3) reject irrelevant features. In this paper we examine a decision tree based on dyadic splits that adapts to each of these conditions to achieve minimax optimal rates of convergence. The proposed classifier is the first known to achieve these optimal rates while being practical and implementable. 1

3 0.82761663 66 nips-2004-Exponential Family Harmoniums with an Application to Information Retrieval

Author: Max Welling, Michal Rosen-zvi, Geoffrey E. Hinton

Abstract: Directed graphical models with one layer of observed random variables and one or more layers of hidden random variables have been the dominant modelling paradigm in many research fields. Although this approach has met with considerable success, the causal semantics of these models can make it difficult to infer the posterior distribution over the hidden variables. In this paper we propose an alternative two-layer model based on exponential family distributions and the semantics of undirected models. Inference in these “exponential family harmoniums” is fast while learning is performed by minimizing contrastive divergence. A member of this family is then studied as an alternative probabilistic model for latent semantic indexing. In experiments it is shown that they perform well on document retrieval tasks and provide an elegant solution to searching with keywords.

4 0.78136492 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

Author: Anton Schwaighofer, Volker Tresp, Kai Yu

Abstract: We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. In a first step, kernel matrices on a fixed set of input points are learned from data using a simple and efficient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are fitted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1

5 0.62840432 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1

6 0.62549168 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

7 0.62351614 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

8 0.6052416 119 nips-2004-Mistake Bounds for Maximum Entropy Discrimination

9 0.60293519 158 nips-2004-Sampling Methods for Unsupervised Learning

10 0.59735882 87 nips-2004-Integrating Topics and Syntax

11 0.59560812 62 nips-2004-Euclidean Embedding of Co-Occurrence Data

12 0.58967382 23 nips-2004-Analysis of a greedy active learning strategy

13 0.58873308 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

14 0.5883925 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

15 0.58709228 33 nips-2004-Brain Inspired Reinforcement Learning

16 0.58647799 77 nips-2004-Hierarchical Clustering of a Mixture Model

17 0.5857532 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

18 0.58496028 181 nips-2004-Synergies between Intrinsic and Synaptic Plasticity in Individual Model Neurons

19 0.58028042 163 nips-2004-Semi-parametric Exponential Family PCA

20 0.57981116 131 nips-2004-Non-Local Manifold Tangent Learning