nips nips2008 nips2008-91 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Steven J. Phillips, Miroslav Dudík
Abstract: We apply robust Bayesian decision theory to improve both generative and discriminative learners under bias in class proportions in labeled training data, when the true class proportions are unknown. For the generative case, we derive an entropybased weighting that maximizes expected log likelihood under the worst-case true class proportions. For the discriminative case, we derive a multinomial logistic model that minimizes worst-case conditional log loss. We apply our theory to the modeling of species geographic distributions from presence data, an extreme case of labeling bias since there is no absence data. On a benchmark dataset, we find that entropy-based weighting offers an improvement over constant estimates of class proportions, consistently reducing log loss on unbiased test data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We apply robust Bayesian decision theory to improve both generative and discriminative learners under bias in class proportions in labeled training data, when the true class proportions are unknown. [sent-5, score-0.804]
2 For the generative case, we derive an entropybased weighting that maximizes expected log likelihood under the worst-case true class proportions. [sent-6, score-0.307]
3 For the discriminative case, we derive a multinomial logistic model that minimizes worst-case conditional log loss. [sent-7, score-0.34]
4 We apply our theory to the modeling of species geographic distributions from presence data, an extreme case of labeling bias since there is no absence data. [sent-8, score-0.717]
5 On a benchmark dataset, we find that entropy-based weighting offers an improvement over constant estimates of class proportions, consistently reducing log loss on unbiased test data. [sent-9, score-0.241]
6 Thus, class proportions in labeled data may significantly differ from true class proportions. [sent-11, score-0.346]
7 In an extreme case, labeled data for an entire class might be missing (for example, negative experimental results are typically not published). [sent-12, score-0.2]
8 Several techniques address labeling bias in the context of cost-sensitive learning and learning from imbalanced data [5, 11, 2]. [sent-14, score-0.176]
9 If the labeling bias is known or can be estimated, and all classes appear in the training set, a model trained on biased data can be corrected by reweighting [5]. [sent-15, score-0.244]
10 When the labeling bias is unknown, a model is often selected using threshold-independent analysis such as ROC curves [11]. [sent-16, score-0.176]
11 Here, we are concerned with situations when the labeling bias is unknown and some classes may be missing, but we have access to unlabeled data. [sent-18, score-0.417]
12 We will be concerned with minimizing joint and conditional log loss, or equivalently, maximizing joint and conditional log likelihood. [sent-20, score-0.331]
13 The data consists of a set of locations within some region (for example, the Australian wet tropics) where a species (such as the golden bowerbird) was observed, and a set of features such as precipitation and temperature, describing environmental conditions at each location. [sent-22, score-0.452]
14 Species distribution modeling suffers from extreme imbalance in training data: we often only have information about species presence (positive examples), but no information about species absence (negative examples). [sent-23, score-0.791]
15 We do, however, have unlabeled data, obtained either by randomly sampling locations from the region [4], or pooling presence data for several species collected with similar methods to yield a representative sample of locations which biologists have surveyed [13]. [sent-24, score-0.744]
16 Previous statistical methods for species distribution modeling can be divided into three main approaches. [sent-25, score-0.302]
17 The first interprets all unlabeled data as examples of species absence and learns a rule to discriminate them from presences [19, 4]. [sent-26, score-0.703]
18 The second embeds a discriminative learner in the EM algorithm in order to infer presences and absences in unlabeled data; this explicitly requires knowledge of true class probabilities [17]. [sent-27, score-0.465]
19 When using the first approach, the training data is commonly reweighted so that positive and negative examples have the same weight [4]; this models a quantity monotonically related to conditional probability of presence [13], with the relationship depending on true class probabilities. [sent-29, score-0.363]
20 If we use y to denote the binary variable indicating presence and x to denote a location on the map, then the first two approaches yield models of conditional probability p(y = 1|x), given estimates of true class probabilities. [sent-30, score-0.341]
21 On the other hand, the main instantiation of the third approach, maximum entropy density estimation (maxent) [14] yields a model of the distribution p(x|y = 1). [sent-31, score-0.122]
22 To convert this to an estimate of p(y = 1|x) (as is usually required, and necessary for measuring conditional log loss on which we focus here) again requires knowledge of the class probabilities p(y = 1) and p(y = 0). [sent-32, score-0.303]
23 Thus, existing discriminative approaches (the first and second) as well as generative approaches (the third) require estimates of true class probabilities. [sent-33, score-0.331]
24 We apply robust Bayesian decision theory, which is closely related to the maximum entropy principle [6], to derive conditional probability estimates p(y | x) that perform well under a wide range of test distributions. [sent-34, score-0.335]
25 Our approach can be used to derive robust estimates of class probabilities p(y) which are then used to reweight discriminative models or to convert generative models into discriminative ones. [sent-35, score-0.584]
26 We present a treatment for the general multiclass problem, but our experiments focus on one-class estimation and species distribution modeling in particular. [sent-36, score-0.302]
27 Using an extensive evaluation on real-world data, we show improvement in both generative and discriminative techniques. [sent-37, score-0.175]
28 Throughout this paper we assume that the difficulty of uncovering the true class label depends on the class label y alone, but is independent of the example x. [sent-38, score-0.209]
29 A related set of techniques estimates and corrects for the bias in sample selection, also known as covariate shift [9, 16, 18, 1, 13]. [sent-40, score-0.118]
30 The input consists of labeled examples (x1 , y1 ), . [sent-43, score-0.174]
31 Each example x is described by a set of features fj : X → R, indexed by j ∈ J. [sent-50, score-0.129]
32 In species distribution modeling from occurrence data, the space X corresponds to locations on the map, features are various functions derived from the environmental variables, and the set Y contains two classes: presence (y = 1) and absence (y = 0) for a particular species. [sent-52, score-0.626]
33 , recorded presence locations of the golden bowerbird, while unlabeled examples are locations that have been surveyed by biologists, but neither presence nor absence was recorded. [sent-55, score-0.709]
34 The unlabeled examples can be obtained as presence locations of species observed by a similar protocol, for example other birds [13]. [sent-56, score-0.641]
35 We posit a joint density π(x, y) and assume that examples are generated by the following process. [sent-57, score-0.187]
36 In our example, species presence is revealed with an unknown fixed probability whereas absence is revealed with probability zero (i. [sent-61, score-0.551]
37 1 Robust Bayesian Estimation, Maximum Entropy, and Logistic Regression Robust Bayesian decision theory formulates an estimation problem as a zero-sum game between a decision maker and nature [6]. [sent-65, score-0.119]
38 In our case, the decision maker chooses an estimate p(x, y) while nature selects a joint density π(x, y). [sent-66, score-0.215]
39 Using the available data, the decision maker forms a set P in which he believes nature’s choice lies, and tries to minimize worst-case loss under nature’s choice. [sent-67, score-0.136]
40 In this paper we are interested in minimizing the worst-case log loss relative to a fixed default estimate ν (equivalently, maximizing the worst-case log likelihood ratio) min max Eπ ln p∈∆ π∈P p(X, Y ) ν(X, Y ) . [sent-68, score-0.404]
41 (1) Here, ∆ is the simplex of joint densities and Eπ is a shorthand for EX,Y ∼π . [sent-69, score-0.138]
42 1) is often equivalent to the u minimum relative entropy problem min RE(p ν) , (2) p∈P where RE(p q) = Ep [ln(p(X, Y )/q(X, Y )] is relative entropy or Kullback-Leibler divergence and measures discrepancy between distributions p and q. [sent-72, score-0.182]
43 The formulation intuitively says that we should choose the density p which is closest to ν while respecting constraints P. [sent-73, score-0.158]
44 When ν is uniform, minimizing relative entropy is equivalent to maximizing entropy H(p) = Ep [− ln p(X, Y )]. [sent-74, score-0.288]
45 Hence, the approach is mainly referred to as maximum entropy [10] or maxent for short. [sent-75, score-0.432]
46 The next theorem outlines the equivalence of robust Bayes and maxent for the case considered in this paper. [sent-76, score-0.532]
47 Let X × Y be a finite sample space, ν a density on X × Y and P ⊆ ∆ a closed convex set containing at least one density absolutely continuous w. [sent-80, score-0.167]
48 For the case without labeling bias, the set P is usually described in terms of equality constraints on moments of the joint distribution (feature expectations). [sent-86, score-0.401]
49 When features are functions of x, but the goal is to discriminate among classes y, it is natural to consider a derived set of features which are versions of fj (x) active solely in individual classes y (see for instance [8]). [sent-88, score-0.308]
50 When the number of samples is too small or the number of features too large then equality constraints lead to overfitting because the true distribution does not match empirical averages exactly. [sent-90, score-0.33]
51 Overfitting is alleviated by relaxing the constraints so that feature expectations are only required to lie within a certain distance of sample averages [3]. [sent-91, score-0.251]
52 (2) with equality or relaxed constraints can be shown to lie in an exponential family parameterized by λ = λy y∈Y , λy ∈ RJ , and containing densities qλ (x, y) ∝ ν(x, y)eλ y ·f (x) . [sent-93, score-0.332]
53 (2) is the unique density which minimizes the empirical log loss 1 m ln qλ (xi , yi ) (3) i≤m possibly with an additional ℓ1 -regularization term accounting for slacks in equality constraints. [sent-95, score-0.385]
54 ) In addition to constraints on moments of the joint distribution, it is possible to introduce constraints on marginals of p. [sent-97, score-0.327]
55 The most common implementations of maxent impose marginal constraints p(x) = π lab (x) where π lab is the empirical distribution over labeled examples. [sent-98, score-0.751]
56 The solution then ˜ ˜ takes form qλ (x, y) = π lab (x)qλ (y | x) where qλ (y | x) is the multinomial logistic model ˜ qλ (y | x) ∝ ν(y | x)eλ y ·f (x) . [sent-99, score-0.162]
57 As before, the maxent solution is the unique density of this form which minimizes the empirical log loss (Eq. [sent-100, score-0.557]
58 (3) is equivalent to the minimization of conditional log loss 1 m i≤m − ln qλ (yi | xi ) . [sent-103, score-0.261]
59 Since it only models the labeling process π(y | x), but not the sample generation π(x), it is known as discriminative training. [sent-105, score-0.207]
60 The case with equality constraints p(y) = π lab (y) has been analyzed for example by [8]. [sent-106, score-0.262]
61 Log loss can be minimized for each class separately, i. [sent-108, score-0.128]
62 The joint estimate qλ (x, y) can be used to derive the conditional distribution qλ (y | x). [sent-111, score-0.17]
63 Since this approach estimates the sample generating distributions of individual classes, it is known as generative training. [sent-112, score-0.124]
64 The two approaches presented in this paper can be viewed as generalizations of generative and discriminative training with two additional components: availability of unlabeled examples and lack of information about class probabilities. [sent-114, score-0.485]
65 The former will influence the choice of the default ν, the latter the form of constraints P. [sent-115, score-0.146]
66 2 Generative Training: Entropy-weighted Maxent When the number of labeled and unlabeled examples is sufficiently large, it is reasonable to assume that the empirical distribution π (x) over all examples (labeled and unlabeled) is a faithful repre˜ sentation of π(x). [sent-117, score-0.412]
67 Thus, we consider defaults with ν(x) = π (x), shown to work well in species ˜ distribution modeling [13]. [sent-118, score-0.325]
68 Constraints on moments of the joint distribution, such as those in the previous section, will misspecify true moments in the presence of labeling bias. [sent-124, score-0.402]
69 However, as discussed earlier, labeled examples from each class y approximate conditional distributions π(x | y). [sent-125, score-0.309]
70 Thus, instead of constraining joint expectations, we constrain conditional expectations Ep [fj (X) | y]. [sent-126, score-0.22]
71 In general, we consider robust Bayes and maxent problems with the set P of the form P = {p ∈ ∆ : py ∈ Py } where py denotes X X X the |X|-dimensional vector of conditional probabilities p(x | y) and Py expresses the constraints on X y pX . [sent-127, score-1.95]
72 For example, relaxed constraints for class y are expressed as µy ˜j y ∀j : Ep [fj (X) | y] − µy ≤ βj ˜j (4) y βj where is the empirical average of fj among labeled examples in class y and are estimates of deviations of averages from true expectations. [sent-128, score-0.718]
73 Similar to [14], we use standard-error-like deviation √ y estimates βj = β σj / my where β is a single tuning constant, σj is the empirical standard devia˜y ˜y tion of fj among labeled examples in class y, and my is the number of labeled examples in class y. [sent-129, score-0.69]
74 The next theorem and the following corollary show that robust Bayes (and also maxent) with the constraint set P of the form above yield estimators similar to generative training. [sent-131, score-0.291]
75 In addition to the notation py for conditional densities, we use the notation pY and pX to denote vectors of marginal X probabilities p(y) and p(x), respectively. [sent-132, score-0.739]
76 Let Py , y ∈ Y be closed convex sets of densities over X and P = {p ∈ ∆ : py ∈ Py }. [sent-135, score-0.773]
77 ν then robust Bayes and maxent over P are equivalent. [sent-139, score-0.451]
78 The solution p has the form p(y)ˆ(x | y) where class-conditional densities py ˆ ˆ p ˆX minimize RE(py πX ) among py ∈ Py and ˜ X X X y ˆ p(y) ∝ ν(y)e−RE(pX ˆ πX ) ˜ . [sent-140, score-1.391]
79 It is not too difficult to verify that the set P is a closed convex set of joint densities, so the equivalence of robust Bayes and maxent follows from Theorem 1. [sent-142, score-0.567]
80 To prove the remainder, we rewrite the maxent objective as p(y)RE(py πX ) . [sent-143, score-0.361]
81 X ˜ RE(p ν) = RE(pY νY ) + y Maxent problem is then equivalent to p(y) y y RE(py πX ) min X ˜ min RE(pY νY ) + pY y p(y) ln = min pY y = min p(y) ln pY y pX ∈PX p(y) ν(y) p(y)RE(ˆy πX ) pX ˜ + y p(y) −RE(py ˆX ν(y)e πX ) ˜ = const. [sent-144, score-0.39]
82 ˆ Theorem 2 generalizes to the case when in addition to constraining py to lie in Py , we also constrain X X pY to lie in a closed convex set PY . [sent-147, score-0.828]
83 Unlike generative ˆ training without labeling bias, the class-conditional densities in the theorem above influence class probabilities. [sent-149, score-0.422]
84 (4) then robust Bayes and X maxent are equivalent. [sent-155, score-0.451]
85 The class-conditional densities p(x | y) of the solution take form ˆ ˆ y ·f (x) qλ (x | y) ∝ π (x)eλ ˜ (6) and solve single-class regularized maximum likelihood problems min y λ i:yi =y − ln qλ (xi | y) + my j∈J βj |λy | j . [sent-156, score-0.287]
86 In one-class estimation problems, there are two classes (0 and 1), but we only have access to labeled examples from one class (e. [sent-158, score-0.336]
87 In species distribution modeling, we only have access to presence records of the species. [sent-161, score-0.4]
88 Based on labeled examples, we derive a set of constraints on p(x | y = 1), but leave p(x | y = 0) unconstrained. [sent-162, score-0.235]
89 Assume without loss of generality that examples x1 , . [sent-164, score-0.132]
90 Thus, π (x) = 1/M on examples and zero elsewhere, ˜ and RE(ˆME πX ) = −H(ˆME ) + ln M . [sent-168, score-0.198]
91 Plugging these into Theorem 2, we can derive the condip ˜ p tional estimate p(y = 1 | x) across all unlabeled examples x: ˆ p(y = 1 | x) = ˆ ˆ ν(y = 1)ˆME (x)eH(pME ) p . [sent-169, score-0.269]
92 This model has the same coefficients as pME , with the intercept chosen ˆ so that “typical” examples x under pME (examples with log probability close to the expected log ˆ probability) yield predictions close to the default. [sent-172, score-0.2]
93 The set of constraints P will ˜ now also include equality constraints on p(x). [sent-175, score-0.288]
94 Next theorem is an analog of Corollary 3 for discriminative training. [sent-177, score-0.153]
95 It follows from ˜ a combination of Theorem 1 and duality of maxent with maximum likelihood [3]. [sent-178, score-0.385]
96 If the set P is non-empty then robust Bayes and maxent over P are ˜ X X equivalent. [sent-184, score-0.451]
97 For the solution p, p(x) = π (x) and p(y | x) takes form ˆ ˆ ˜ ˆ ˜ λy ·f (x)−λy ·µy + qλ (y | x) ∝ ν(y)e j y βj |λy | j (9) and solves the regularized “logistic regression” problem min λ 1 M i≤M y∈Y −¯ (y | xi ) ln qλ (y | xi ) + π π (y) ¯ y∈Y j∈J y βj λy + (¯y − µy )λy µj ˜j j j . [sent-185, score-0.178]
98 A natural choice of π is the “pseudo-empirical” distribution which views ¯ all unlabeled examples as negatives. [sent-198, score-0.208]
99 Pseudo-empirical means of class 1 match empirical averages of class 1 exactly, whereas pseudo-empirical means of class 0 can be arbitrary because they are unconstrained. [sent-199, score-0.334]
100 The lack of constraints on class 0 forces the corresponding λy to equal zero. [sent-200, score-0.186]
wordName wordTfidf (topN-words)
[('py', 0.653), ('maxent', 0.361), ('species', 0.268), ('pme', 0.155), ('unlabeled', 0.125), ('ln', 0.115), ('constraints', 0.107), ('labeling', 0.107), ('fj', 0.105), ('px', 0.105), ('presences', 0.103), ('discriminative', 0.1), ('presence', 0.094), ('labeled', 0.091), ('robust', 0.09), ('densities', 0.085), ('examples', 0.083), ('logistic', 0.081), ('lab', 0.081), ('class', 0.079), ('bowerbird', 0.078), ('generative', 0.075), ('equality', 0.074), ('absence', 0.074), ('entropy', 0.071), ('locations', 0.071), ('proportions', 0.069), ('bias', 0.069), ('bayes', 0.065), ('moments', 0.06), ('golden', 0.058), ('expectations', 0.057), ('conditional', 0.056), ('maker', 0.055), ('ep', 0.054), ('re', 0.054), ('joint', 0.053), ('theorem', 0.053), ('density', 0.051), ('loss', 0.049), ('estimates', 0.049), ('xm', 0.045), ('altitude', 0.045), ('classes', 0.045), ('averages', 0.044), ('lie', 0.043), ('eh', 0.041), ('biologists', 0.041), ('geographic', 0.041), ('phillips', 0.041), ('revealed', 0.041), ('log', 0.041), ('min', 0.04), ('surveyed', 0.039), ('default', 0.039), ('access', 0.038), ('corollary', 0.038), ('solely', 0.038), ('ave', 0.037), ('derive', 0.037), ('yield', 0.035), ('uencing', 0.035), ('closed', 0.035), ('modeling', 0.034), ('unknown', 0.033), ('park', 0.032), ('decision', 0.032), ('environmental', 0.031), ('minimizing', 0.031), ('extreme', 0.03), ('occurrence', 0.03), ('absolutely', 0.03), ('probabilities', 0.03), ('empirical', 0.03), ('equivalence', 0.028), ('true', 0.028), ('constraining', 0.028), ('discriminate', 0.027), ('constrain', 0.026), ('roc', 0.026), ('minimizes', 0.025), ('likelihood', 0.024), ('features', 0.024), ('convert', 0.024), ('estimate', 0.024), ('match', 0.023), ('training', 0.023), ('weighting', 0.023), ('relaxed', 0.023), ('bayesian', 0.023), ('regularized', 0.023), ('quotes', 0.023), ('affordable', 0.023), ('uncovering', 0.023), ('rarer', 0.023), ('simplistic', 0.023), ('defaults', 0.023), ('dud', 0.023), ('imputation', 0.023), ('interprets', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
Author: Steven J. Phillips, Miroslav Dudík
Abstract: We apply robust Bayesian decision theory to improve both generative and discriminative learners under bias in class proportions in labeled training data, when the true class proportions are unknown. For the generative case, we derive an entropybased weighting that maximizes expected log likelihood under the worst-case true class proportions. For the discriminative case, we derive a multinomial logistic model that minimizes worst-case conditional log loss. We apply our theory to the modeling of species geographic distributions from presence data, an extreme case of labeling bias since there is no absence data. On a benchmark dataset, we find that entropy-based weighting offers an improvement over constant estimates of class proportions, consistently reducing log loss on unbiased test data. 1
2 0.33812177 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
Author: Erik B. Sudderth, Michael I. Jordan
Abstract: We develop a statistical framework for the simultaneous, unsupervised segmentation and discovery of visual object categories from image databases. Examining a large set of manually segmented scenes, we show that object frequencies and segment sizes both follow power law distributions, which are well modeled by the Pitman–Yor (PY) process. This nonparametric prior distribution leads to learning algorithms which discover an unknown set of objects, and segmentation methods which automatically adapt their resolution to each image. Generalizing previous applications of PY processes, we use Gaussian processes to discover spatially contiguous segments which respect image boundaries. Using a novel family of variational approximations, our approach produces segmentations which compare favorably to state-of-the-art methods, while simultaneously discovering categories shared among natural scenes. 1
3 0.14133438 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
Author: Aarti Singh, Robert Nowak, Xiaojin Zhu
Abstract: Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semi-supervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates.
4 0.13611455 9 nips-2008-A mixture model for the evolution of gene expression in non-homogeneous datasets
Author: Gerald Quon, Yee W. Teh, Esther Chan, Timothy Hughes, Michael Brudno, Quaid D. Morris
Abstract: We address the challenge of assessing conservation of gene expression in complex, non-homogeneous datasets. Recent studies have demonstrated the success of probabilistic models in studying the evolution of gene expression in simple eukaryotic organisms such as yeast, for which measurements are typically scalar and independent. Models capable of studying expression evolution in much more complex organisms such as vertebrates are particularly important given the medical and scientific interest in species such as human and mouse. We present Brownian Factor Phylogenetic Analysis, a statistical model that makes a number of significant extensions to previous models to enable characterization of changes in expression among highly complex organisms. We demonstrate the efficacy of our method on a microarray dataset profiling diverse tissues from multiple vertebrate species. We anticipate that the model will be invaluable in the study of gene expression patterns in other diverse organisms as well, such as worms and insects. 1
5 0.11154523 153 nips-2008-Nonlinear causal discovery with additive noise models
Author: Patrik O. Hoyer, Dominik Janzing, Joris M. Mooij, Jan R. Peters, Bernhard Schölkopf
Abstract: The discovery of causal relationships between a set of observed variables is a fundamental problem in science. For continuous-valued data linear acyclic causal models with additive noise are often used because these models are well understood and there are well-known methods to fit them to data. In reality, of course, many causal relationships are more or less nonlinear, raising some doubts as to the applicability and usefulness of purely linear methods. In this contribution we show that the basic linear framework can be generalized to nonlinear models. In this extended framework, nonlinearities in the data-generating process are in fact a blessing rather than a curse, as they typically provide information on the underlying causal system and allow more aspects of the true data-generating mechanisms to be identified. In addition to theoretical results we show simulations and some simple real data experiments illustrating the identification power provided by nonlinearities. 1
6 0.10420334 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints
7 0.099111147 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
8 0.08839938 62 nips-2008-Differentiable Sparse Coding
9 0.080896743 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
10 0.075759709 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
11 0.074597411 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
12 0.073376253 202 nips-2008-Robust Regression and Lasso
13 0.073225908 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
14 0.073216572 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
15 0.070089556 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
16 0.067844003 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
17 0.063950516 226 nips-2008-Supervised Dictionary Learning
18 0.063291408 228 nips-2008-Support Vector Machines with a Reject Option
19 0.063256264 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
20 0.0598916 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
topicId topicWeight
[(0, -0.187), (1, -0.087), (2, -0.036), (3, -0.055), (4, -0.054), (5, 0.034), (6, -0.018), (7, 0.043), (8, -0.036), (9, 0.098), (10, 0.035), (11, 0.043), (12, 0.073), (13, -0.249), (14, -0.039), (15, -0.028), (16, -0.101), (17, 0.1), (18, -0.029), (19, 0.035), (20, -0.034), (21, 0.182), (22, -0.036), (23, -0.077), (24, -0.043), (25, -0.125), (26, 0.076), (27, 0.126), (28, 0.148), (29, -0.032), (30, -0.034), (31, -0.358), (32, -0.027), (33, 0.19), (34, -0.054), (35, 0.032), (36, 0.122), (37, -0.07), (38, 0.094), (39, 0.008), (40, 0.188), (41, -0.062), (42, 0.128), (43, -0.007), (44, 0.05), (45, 0.045), (46, -0.028), (47, -0.166), (48, 0.014), (49, 0.063)]
simIndex simValue paperId paperTitle
same-paper 1 0.93985671 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
Author: Steven J. Phillips, Miroslav Dudík
Abstract: We apply robust Bayesian decision theory to improve both generative and discriminative learners under bias in class proportions in labeled training data, when the true class proportions are unknown. For the generative case, we derive an entropybased weighting that maximizes expected log likelihood under the worst-case true class proportions. For the discriminative case, we derive a multinomial logistic model that minimizes worst-case conditional log loss. We apply our theory to the modeling of species geographic distributions from presence data, an extreme case of labeling bias since there is no absence data. On a benchmark dataset, we find that entropy-based weighting offers an improvement over constant estimates of class proportions, consistently reducing log loss on unbiased test data. 1
2 0.67458832 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
Author: Erik B. Sudderth, Michael I. Jordan
Abstract: We develop a statistical framework for the simultaneous, unsupervised segmentation and discovery of visual object categories from image databases. Examining a large set of manually segmented scenes, we show that object frequencies and segment sizes both follow power law distributions, which are well modeled by the Pitman–Yor (PY) process. This nonparametric prior distribution leads to learning algorithms which discover an unknown set of objects, and segmentation methods which automatically adapt their resolution to each image. Generalizing previous applications of PY processes, we use Gaussian processes to discover spatially contiguous segments which respect image boundaries. Using a novel family of variational approximations, our approach produces segmentations which compare favorably to state-of-the-art methods, while simultaneously discovering categories shared among natural scenes. 1
3 0.42658409 155 nips-2008-Nonparametric regression and classification with joint sparsity constraints
Author: Han Liu, Larry Wasserman, John D. Lafferty
Abstract: We propose new families of models and algorithms for high-dimensional nonparametric learning with joint sparsity constraints. Our approach is based on a regularization method that enforces common sparsity patterns across different function components in a nonparametric additive model. The algorithms employ a coordinate descent approach that is based on a functional soft-thresholding operator. The framework yields several new models, including multi-task sparse additive models, multi-response sparse additive models, and sparse additive multi-category logistic regression. The methods are illustrated with experiments on synthetic data and gene microarray data. 1
4 0.42052013 153 nips-2008-Nonlinear causal discovery with additive noise models
Author: Patrik O. Hoyer, Dominik Janzing, Joris M. Mooij, Jan R. Peters, Bernhard Schölkopf
Abstract: The discovery of causal relationships between a set of observed variables is a fundamental problem in science. For continuous-valued data linear acyclic causal models with additive noise are often used because these models are well understood and there are well-known methods to fit them to data. In reality, of course, many causal relationships are more or less nonlinear, raising some doubts as to the applicability and usefulness of purely linear methods. In this contribution we show that the basic linear framework can be generalized to nonlinear models. In this extended framework, nonlinearities in the data-generating process are in fact a blessing rather than a curse, as they typically provide information on the underlying causal system and allow more aspects of the true data-generating mechanisms to be identified. In addition to theoretical results we show simulations and some simple real data experiments illustrating the identification power provided by nonlinearities. 1
5 0.41993326 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
Author: Aarti Singh, Robert Nowak, Xiaojin Zhu
Abstract: Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semi-supervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates.
6 0.38924253 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
7 0.36090195 9 nips-2008-A mixture model for the evolution of gene expression in non-homogeneous datasets
8 0.33053428 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice
9 0.31273639 236 nips-2008-The Mondrian Process
10 0.30593824 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
11 0.30116841 41 nips-2008-Breaking Audio CAPTCHAs
12 0.28839135 128 nips-2008-Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification
13 0.27273342 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
14 0.26182586 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
15 0.25821832 228 nips-2008-Support Vector Machines with a Reject Option
16 0.25349924 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
17 0.25004953 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
18 0.23466463 233 nips-2008-The Gaussian Process Density Sampler
19 0.23400116 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection
20 0.2331789 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
topicId topicWeight
[(6, 0.105), (7, 0.064), (12, 0.047), (15, 0.013), (19, 0.239), (28, 0.135), (57, 0.064), (59, 0.019), (63, 0.022), (64, 0.014), (71, 0.026), (77, 0.037), (78, 0.012), (83, 0.111)]
simIndex simValue paperId paperTitle
1 0.83763093 217 nips-2008-Sparsity of SVMs that use the epsilon-insensitive loss
Author: Ingo Steinwart, Andreas Christmann
Abstract: In this paper lower and upper bounds for the number of support vectors are derived for support vector machines (SVMs) based on the -insensitive loss function. It turns out that these bounds are asymptotically tight under mild assumptions on the data generating distribution. Finally, we briefly discuss a trade-off in between sparsity and accuracy if the SVM is used to estimate the conditional median. 1
same-paper 2 0.79782122 91 nips-2008-Generative and Discriminative Learning with Unknown Labeling Bias
Author: Steven J. Phillips, Miroslav Dudík
Abstract: We apply robust Bayesian decision theory to improve both generative and discriminative learners under bias in class proportions in labeled training data, when the true class proportions are unknown. For the generative case, we derive an entropybased weighting that maximizes expected log likelihood under the worst-case true class proportions. For the discriminative case, we derive a multinomial logistic model that minimizes worst-case conditional log loss. We apply our theory to the modeling of species geographic distributions from presence data, an extreme case of labeling bias since there is no absence data. On a benchmark dataset, we find that entropy-based weighting offers an improvement over constant estimates of class proportions, consistently reducing log loss on unbiased test data. 1
3 0.75677806 20 nips-2008-An Extended Level Method for Efficient Multiple Kernel Learning
Author: Zenglin Xu, Rong Jin, Irwin King, Michael Lyu
Abstract: We consider the problem of multiple kernel learning (MKL), which can be formulated as a convex-concave problem. In the past, two efficient methods, i.e., Semi-Infinite Linear Programming (SILP) and Subgradient Descent (SD), have been proposed for large-scale multiple kernel learning. Despite their success, both methods have their own shortcomings: (a) the SD method utilizes the gradient of only the current solution, and (b) the SILP method does not regularize the approximate solution obtained from the cutting plane model. In this work, we extend the level method, which was originally designed for optimizing non-smooth objective functions, to convex-concave optimization, and apply it to multiple kernel learning. The extended level method overcomes the drawbacks of SILP and SD by exploiting all the gradients computed in past iterations and by regularizing the solution via a projection to a level set. Empirical study with eight UCI datasets shows that the extended level method can significantly improve efficiency by saving on average 91.9% of computational time over the SILP method and 70.3% over the SD method. 1
4 0.66373479 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
5 0.6599673 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
Author: Aarti Singh, Robert Nowak, Xiaojin Zhu
Abstract: Empirical evidence shows that in favorable situations semi-supervised learning (SSL) algorithms can capitalize on the abundance of unlabeled training data to improve the performance of a learning task, in the sense that fewer labeled training data are needed to achieve a target error bound. However, in other situations unlabeled data do not seem to help. Recent attempts at theoretically characterizing SSL gains only provide a partial and sometimes apparently conflicting explanations of whether, and to what extent, unlabeled data can help. In this paper, we attempt to bridge the gap between the practice and theory of semi-supervised learning. We develop a finite sample analysis that characterizes the value of unlabeled data and quantifies the performance improvement of SSL compared to supervised learning. We show that there are large classes of problems for which SSL can significantly outperform supervised learning, in finite sample regimes and sometimes also in terms of error convergence rates.
6 0.65726084 202 nips-2008-Robust Regression and Lasso
7 0.65685421 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
8 0.65664715 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models
9 0.65635389 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
10 0.65383661 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
11 0.65346944 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
12 0.65309238 75 nips-2008-Estimating vector fields using sparse basis field expansions
13 0.65259242 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
14 0.65245962 62 nips-2008-Differentiable Sparse Coding
15 0.64964962 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
16 0.64956748 143 nips-2008-Multi-label Multiple Kernel Learning
17 0.64740425 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
18 0.64680958 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis
19 0.64591521 32 nips-2008-Bayesian Kernel Shaping for Learning Control
20 0.64577997 95 nips-2008-Grouping Contours Via a Related Image