nips nips2006 nips2006-33 knowledge-graph by maker-knowledge-mining

33 nips-2006-Analysis of Representations for Domain Adaptation

Source: pdf

Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira

Abstract: Discriminative learning methods for classiﬁcation perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classiﬁer which performs well on a target domain with a different distribution. Under what conditions can we adapt a classiﬁer trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justiﬁcation for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In many situations, though, we have labeled training data for a source domain, and we wish to learn a classiﬁer which performs well on a target domain with a different distribution. [sent-6, score-0.894]

2 Under what conditions can we adapt a classiﬁer trained on the source domain for use in the target domain? [sent-7, score-0.799]

3 Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. [sent-8, score-0.538]

4 We formalize this intuition theoretically with a generalization bound for domain adaption. [sent-9, score-0.582]

5 Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justiﬁcation for a recently proposed model. [sent-10, score-0.742]

6 It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set. [sent-11, score-0.845]

7 1 Introduction We are all familiar with the situation in which someone learns to perform a task on training examples drawn from some domain (the source domain), but then needs to perform the same task on a related domain (the target domain). [sent-12, score-1.345]

8 In this situation, we expect the task performance in the target domain to depend on both the performance in the source domain and the similarity between the two domains. [sent-13, score-1.198]

9 For example, we might want to adapt for a new user (the target domain) a spam ﬁlter trained on the email of a group of previous users (the source domain), under the assumption that users generally agree on what is spam and what is not. [sent-15, score-0.561]

10 Intuitively, one might expect that the closer the two distributions are, the better the ﬁlter trained on the source domain will do on the target domain. [sent-17, score-0.839]

11 Many other instances of this situation arise in natural language processing. [sent-18, score-0.184]

12 In general, labeled data for tasks like part-of-speech tagging, parsing, or information extraction are drawn from a limited set of document types and genres in a given language because of availability, cost, and project goals. [sent-19, score-0.244]

13 Nevertheless, part-of-speech, syntactic structure, or entity mention decisions are to a large extent stable across different types and genres since they depend on general properties of the language under consideration. [sent-21, score-0.171]

14 However, the assumption does not hold for domain adaptation [5, 7, 13, 6]. [sent-24, score-0.595]

15 For the situations we outlined above, the challenge is the difference in instance distribution between the source and target domains. [sent-25, score-0.389]

16 We will approach this challenge by investigating how a common representation between the two domains can make the two domains appear to have similar distributions, enabling effective domain adaptation. [sent-26, score-0.763]

17 We formalize this intuition with a bound on the target generalization error of a classiﬁer trained from labeled data in the source domain. [sent-27, score-0.704]

18 The bound is stated in terms of a representation function, and it shows that a representation function should be designed to minimize domain divergence, as well as classiﬁer error. [sent-28, score-0.689]

19 While many authors have analyzed adaptation from multiple sets of labeled training data [3, 5, 7, 13], our theory applies to the setting in which the target domain has no labeled training data, but plentiful unlabeled data exists for both target and source domains. [sent-29, score-1.508]

20 We show experimentally that the heuristic choices made by the recently proposed structural correspondence learning algorithm [6] do lead to lower values of the relevant quantities in our theoretical analysis, providing insight as to why this algorithm achieves its empirical success. [sent-32, score-0.356]

21 Our theory also points to an interesting new algorithm for domain adaptation: one which directly minimizes a tradeoff between source-target similarity and source training error. [sent-33, score-0.729]

22 The remainder of this paper is structured as follows: In the next section we formally deﬁne domain adaptation. [sent-34, score-0.404]

23 Section 5 shows how the bound behaves for the structural correspondence learning representation [6] on natural language data. [sent-37, score-0.423]

24 We discuss our ﬁndings, including a new algorithm for domain adaptation based on our theory, in section 6 and conclude in section 7. [sent-38, score-0.595]

25 A learning problem is speciﬁed by two parameters: a distribution D over X and a (stochastic) target function f : X → [0, 1]. [sent-42, score-0.173]

26 A representation R induces a distribution over Z and a (stochastic) target function from Z to [0, 1] as follows: PrD [B] ˜ def = PrD R−1 (B) ˜ f (z) def ED [f (x)|R(x) = z] = −1 for any A ⊆ Z such that R (B) is D-measurable. [sent-45, score-0.333]

27 In summary, our learning setting is deﬁned by ﬁxed but unknown D and f , and our choice of representation function R and hypothesis class H ⊆ {g : Z → {0, 1}} of deterministic hypotheses to be used to approximate the function f . [sent-49, score-0.172]

28 1 Domain Adaptation We now formalize the problem of domain adaptation. [sent-51, score-0.438]

29 A domain is a distribution D on the instance set X . [sent-52, score-0.435]

30 Unlike in inductive transfer, where the tasks we wish to perform may be related but different, in domain adaptation we perform the same task in multiple domains. [sent-55, score-0.627]

31 This is quite common in natural language processing, where we might be performing the same syntactic analysis task, such as tagging or parsing, but on domains with very different vocabularies [6, 11]. [sent-56, score-0.415]

32 We assume two domains, a source domain and a target domain. [sent-58, score-0.762]

33 We denote by DS the source ˜ distribution of instances and DS the induced distribution over the feature space Z. [sent-59, score-0.33]

34 We use parallel ˜ notation, DT , DT , for the target domain. [sent-60, score-0.173]

35 3 Generalization Bounds for Domain Adaptation We now proceed to develop a bound on the target domain generalization performance of a classiﬁer trained in the source domain. [sent-65, score-0.943]

36 The ﬁrst term bounds the performance of the classiﬁer on the source domain. [sent-67, score-0.185]

37 The second term is a measure ˜ ˜ of the divergence between the induced source marginal DS and the induced target marginal DT . [sent-68, score-0.479]

38 Unfortunately the variational distance between real-valued distributions cannot be computed from ﬁnite samples [2, 9] and therefore is not useful to us when investigating representations for domain adaptation on real-world data. [sent-71, score-0.743]

39 A key part of our theory is the observation that in many realistic domain adaptation scenarios, we do not need such a powerful measure as variational distance. [sent-72, score-0.65]

40 Instead we can restrict our notion of domain distance to be measured only with respect to function in our hypothesis class. [sent-73, score-0.484]

41 Given a domain X and a collection A of subsets of X , let D, D′ be probability distributions over X , such that every set in A is measurable with respect to both distributions. [sent-76, score-0.444]

42 This embodies our domain adaptation assumption, and we will assume will assume that ˜ our induced labeling function f is λ-close to our hypothesis class H for a small λ. [sent-80, score-0.793]

43 , a ﬁnite VC-dimension), ˜ then uniform convergence theory tells us that whenever f is not λ-close to H, large training samples have poor empirical error for every h ∈ H. [sent-84, score-0.218]

44 If the training data is generated by some DS and we wish to use some H as a family of predictors for labels in the target domain, T , then one can construct a function which agrees with some h ∈ H with respect ˜ ˜ to DS and yet is far from H with respect to DT . [sent-86, score-0.224]

45 Nonetheless we believe that such examples do not occur for realistic domain adaptation problems when the hypothesis class H is sufﬁciently rich, since for most domain adaptation problems of interest the labeling function is ’similarly simple’ for both the source and target domains. [sent-87, score-1.705]

46 2 Bound on the target domain error We require one last piece of notation before we state and prove the main theorems of this work: the correspondence between functions and characteristic subsets. [sent-89, score-0.758]

47 Theorem 1 Let R be a ﬁxed representation function from X to Z and H be a hypothesis space of VC-dimension d. [sent-93, score-0.172]

48 2 from [9], we can state a computable bound for the error on the target domain. [sent-108, score-0.324]

49 Theorem 2 Let R be a ﬁxed representation function from X to Z and H be a hypothesis space of VC-dimension d. [sent-109, score-0.172]

50 The fourth term is the sample A-distance between domains for hypothesis class H. [sent-118, score-0.189]

51 Looking at the two terms, we see that a good representation R is one which achieves low values for both training error and domain A-distance simultaneously. [sent-119, score-0.679]

52 Then h is the classiﬁer which achieves minimum error on the binary classiﬁcation problem of discriminating between points generated by the two distributions. [sent-126, score-0.212]

53 Deﬁne the error of a classiﬁer h on the task of discriminating between points sampled from different distributions as 2m′ 1 err(h) = h(zi ) − Izi ∈US , ˜ 2m′ i=1 ˜ where Izi ∈US is the indicator function for points lying in the sample US . [sent-128, score-0.236]

54 It is important to note that this does not provide us with a valid upper bound on the target error, but as we will see it nonetheless provides us with useful insights about representations for domain adaptation. [sent-132, score-0.766]

55 In the subsequent experiments section, we train a linear classiﬁer to discriminate between points sampled from different domains to illustrate a proxy for the A-distance. [sent-133, score-0.172]

56 5 Natural Language Experiments In this section we use our theory to analyze different representations for the task of adapting a part of speech tagger from the ﬁnancial to biomedical domains [6]. [sent-135, score-0.446]

57 As we shall see, represenations which minimize both relevant terms of the bound also have small empirical error. [sent-139, score-0.163]

58 Part of speech (PoS) tagging is the task of labeling a word in context with its grammatical function. [sent-140, score-0.327]

59 PoS tagging is a common preprocessing step in many pipelined natural language processing systems and is described in more detail in [6]. [sent-142, score-0.268]

60 empirically investigate methods for adpating a part of speech tagger from ﬁnancial news (the Wall Street Journal, henceforth also WSJ) to biomedical abstracts (MEDLINE) [6]. [sent-144, score-0.258]

61 As in their investigation, we treat the ﬁnancial data as our source, for which we have labeled training data and the biomedical abstracts as our target, for which we have no labeled training data. [sent-146, score-0.404]

62 The representations we consider in this section are all linear projections of the original feature space into Rd . [sent-147, score-0.23]

63 Now at train time we apply the projection to the binary feature vector representation of each instance and learn a linear classiﬁer in the d-dimensional projected space. [sent-150, score-0.263]

64 At test time we apply the projection to the binary feature vector representation and classify in the d-dimensional projected space. [sent-151, score-0.232]

65 biomedical (circles) (b) Plot of SCL representation for nouns (diamonds) vs. [sent-159, score-0.232]

66 verbs (triangles) Figure 1: 2D plots of SCL representations for the (a) A-distance and (b) empirical risk parts of theorem 2 that random projections approximate well distances in the original high dimensional space, as long as d is sufﬁciently large. [sent-160, score-0.299]

67 [6] describe a heuristic method for domain adaptation that they call structural correspondence learning (henceforth also SCL). [sent-164, score-0.807]

68 SCL uses unlabeled data from both domains to induce correspondences among features in the two domains. [sent-165, score-0.172]

69 The intuition is that by capturing these important correlations, features from the source and target domains which behave similarly for PoS tagging will be represented similarly in the projected space. [sent-169, score-0.701]

70 3 Results We use as our source data set 100 sentences (about 2500 words) of PoS-tagged Wall Street Journal text. [sent-171, score-0.185]

71 The target domain test set is the same set as in [6]. [sent-172, score-0.577]

72 We use one million words (500 thousand from each domain) of unlabeled data to estimate the A-distance between the ﬁnancial and biomedical domains. [sent-173, score-0.165]

73 The results in this section are intended to illustrate the different parts of theorem 2 and how they can affect the target domain generalization error. [sent-174, score-0.701]

74 In this case we use the Huber loss as a proxy from the empirical training error. [sent-179, score-0.182]

75 Figure 1(a) shows one hundred random instances projected onto the space spanned by the best two discriminating projections from the SCL projection matrix for part of the ﬁnancial and biomedical dataset. [sent-180, score-0.535]

76 On the other hand, ﬁgure 1(b) shows the best two discriminating components for the task of discriminating between nouns and verbs. [sent-184, score-0.298]

77 Thus these pictures lead us to believe that SCL ﬁnds a representation which results both in small empirical classiﬁcation error and small A-distance. [sent-186, score-0.204]

78 (a) Plot of random projections representation for ﬁnancial (squares) vs. [sent-188, score-0.221]

79 biomedical (circles) (b) Comparison of bound terms vs. [sent-189, score-0.203]

80 Reprentations are linear projections of the original feature space. [sent-191, score-0.171]

81 Huber loss is the labeled training loss after training, and the A-distance is approximated as described in the previous subsection. [sent-192, score-0.208]

82 Error refers to tagging error for the full tagset on the target domain. [sent-193, score-0.403]

83 216 Figure 2: (a) 2D plot of random projection representation and (b) results summary on large data Figure 2(a) shows one hundred random instances projected onto the best two discriminating projections for WSJ vs. [sent-203, score-0.555]

84 Thus theorem 2 predicts that using two random projections as a representation will perform poorly, since it minimizes only the A-distance and not the empirical error. [sent-210, score-0.332]

85 The identity representation achieves very low Huber loss (corresponding to empirical error). [sent-212, score-0.274]

86 The random projections method achieves low A-distance but high Huber loss, and the classiﬁer which uses this representation achieves error rates much lower than the a classiﬁer which uses the identity representation. [sent-215, score-0.401]

87 Finally, the structural correspondence learning representation achieves low Huber loss and low A-distance, and the error rate is the lowest of the three representations. [sent-216, score-0.438]

88 6 Discussion and Future Work Our theory demonstrates an important tradeoff inherent in designing good representations for domain adaptation. [sent-217, score-0.552]

89 A good representation enables achieving low error rate on the source domain while also minimizing the A-distance between the induced marginal distributions of the two domains. [sent-218, score-0.846]

90 Unlike their work, though, we bound the target domain error using a ﬁnite source domain labeled sample and ﬁnite source and target domain unlabeled samples. [sent-222, score-2.223]

91 Our experiments illustrate the utility of our bound on target domain error, but they do not explore the accuracy of our approximate H-distance. [sent-223, score-0.71]

92 Finally our theory points toward an interesting new direction for domain adapation. [sent-225, score-0.491]

93 7 Conclusions We presented an analysis of representations for domain adaptation. [sent-229, score-0.463]

94 It is reasonable to think that a good representation is the key to effective domain adaptation, and our theory backs up that intuition. [sent-230, score-0.551]

95 Theorem 2 gives an upper bound on the generalization of a classiﬁer trained on a source domain and applied in a target domain. [sent-231, score-0.943]

96 The bound depends on the representation and explicitly demonstrates the tradeoff between low empirical source domain error and a small difference between distributions. [sent-232, score-0.962]

97 ˜ Under the assumption that the labeling function f is close to our hypothesis class H, we can compute the bound from ﬁnite samples. [sent-233, score-0.258]

98 Our experiments indicate that the heuristic structural correspondence learning method [6] does in fact simultaneously achieve low A-distance as well as a low margin-based loss. [sent-237, score-0.28]

99 Finally we note that our theory points to an interesting new algorithm for domain adaptation. [sent-239, score-0.459]

100 Instead of making heuristic choices, we are investigating algorithms which directly minimize a combination of the A-distance and the empirical training margin. [sent-240, score-0.232]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('domain', 0.404), ('ds', 0.272), ('scl', 0.257), ('zh', 0.257), ('adaptation', 0.191), ('source', 0.185), ('dt', 0.184), ('prd', 0.18), ('tagging', 0.18), ('target', 0.173), ('huber', 0.143), ('dh', 0.143), ('projections', 0.129), ('nancial', 0.119), ('discriminating', 0.114), ('domains', 0.109), ('biomedical', 0.102), ('blitzer', 0.102), ('bound', 0.101), ('er', 0.1), ('representation', 0.092), ('language', 0.088), ('classi', 0.084), ('labeled', 0.081), ('hypothesis', 0.08), ('correspondence', 0.078), ('medline', 0.077), ('labeling', 0.077), ('pos', 0.076), ('heuristic', 0.07), ('wsj', 0.067), ('structural', 0.064), ('unlabeled', 0.063), ('empirical', 0.062), ('instances', 0.062), ('representations', 0.059), ('ut', 0.058), ('theory', 0.055), ('projected', 0.054), ('characteristic', 0.053), ('training', 0.051), ('izi', 0.051), ('kifer', 0.051), ('prds', 0.051), ('prdt', 0.051), ('tagger', 0.051), ('zg', 0.051), ('error', 0.05), ('investigating', 0.049), ('theorem', 0.049), ('achieves', 0.048), ('crammer', 0.047), ('users', 0.045), ('arriaga', 0.045), ('discriminator', 0.045), ('genres', 0.045), ('noun', 0.045), ('pivot', 0.045), ('projection', 0.044), ('hyperplane', 0.044), ('generalization', 0.043), ('feature', 0.042), ('induced', 0.041), ('distributions', 0.04), ('divergence', 0.039), ('speech', 0.038), ('abstracts', 0.038), ('syntactic', 0.038), ('nouns', 0.038), ('spam', 0.038), ('tag', 0.038), ('rd', 0.038), ('loss', 0.038), ('trained', 0.037), ('choices', 0.034), ('formalize', 0.034), ('def', 0.034), ('situation', 0.034), ('low', 0.034), ('tradeoff', 0.034), ('poorly', 0.033), ('ez', 0.033), ('focs', 0.033), ('sugiyama', 0.033), ('toward', 0.032), ('task', 0.032), ('illustrate', 0.032), ('instance', 0.031), ('err', 0.031), ('wall', 0.031), ('proxy', 0.031), ('shai', 0.031), ('hundred', 0.03), ('covariate', 0.03), ('drawn', 0.03), ('plot', 0.03), ('henceforth', 0.029), ('nonetheless', 0.029), ('pereira', 0.029), ('stochastic', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 33 nips-2006-Analysis of Representations for Domain Adaptation

Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira

2 0.15340489 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data

Author: Ping Li, Kenneth W. Church, Trevor J. Hastie

Abstract: We1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using real-world data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations.

3 0.12414689 116 nips-2006-Learning from Multiple Sources

Author: Koby Crammer, Michael Kearns, Jennifer Wortman

Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classiﬁcation and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1

4 0.10927684 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

Author: Philip M. Long, Rocco Servedio

Abstract: We consider the well-studied problem of learning decision lists using few examples when many irrelevant features are present. We show that smooth boosting algorithms such as MadaBoost can efﬁciently learn decision lists of length k over n boolean variables using poly(k, log n) many examples provided that the marginal distribution over the relevant variables is “not too concentrated” in an L 2 -norm sense. Using a recent result of H˚ stad, we extend the analysis to obtain a similar a (though quantitatively weaker) result for learning arbitrary linear threshold functions with k nonzero coefﬁcients. Experimental results indicate that the use of a smooth boosting algorithm, which plays a crucial role in our analysis, has an impact on the actual performance of the algorithm.

5 0.097122028 109 nips-2006-Learnability and the doubling dimension

Author: Yi Li, Philip M. Long

Abstract: Given a set of classiﬁers and a probability distribution over their domain, one can deﬁne a metric by taking the distance between a pair of classiﬁers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classiﬁer with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).

6 0.096486218 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

7 0.09505751 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts

8 0.094266444 110 nips-2006-Learning Dense 3D Correspondence

9 0.088626467 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation

10 0.088258766 150 nips-2006-On Transductive Regression

11 0.088018648 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples

12 0.08716017 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions

13 0.087013252 193 nips-2006-Tighter PAC-Bayes Bounds

14 0.081915714 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

15 0.074209496 50 nips-2006-Chained Boosting

16 0.073144786 12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data

17 0.070985451 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

18 0.07076855 46 nips-2006-Blind source separation for over-determined delayed mixtures

19 0.069807142 141 nips-2006-Multiple timescales and uncertainty in motor adaptation

20 0.069186442 65 nips-2006-Denoising and Dimension Reduction in Feature Space

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.234), (1, 0.045), (2, -0.024), (3, -0.044), (4, -0.078), (5, 0.146), (6, -0.042), (7, 0.023), (8, -0.012), (9, -0.008), (10, 0.028), (11, 0.003), (12, -0.063), (13, 0.02), (14, 0.087), (15, -0.167), (16, 0.087), (17, -0.157), (18, -0.001), (19, -0.103), (20, -0.074), (21, 0.055), (22, 0.025), (23, 0.018), (24, -0.146), (25, 0.082), (26, -0.017), (27, 0.16), (28, 0.012), (29, -0.054), (30, 0.008), (31, -0.186), (32, -0.139), (33, 0.107), (34, -0.203), (35, 0.012), (36, 0.001), (37, -0.021), (38, -0.05), (39, 0.072), (40, 0.04), (41, -0.033), (42, 0.105), (43, -0.044), (44, -0.021), (45, -0.036), (46, -0.033), (47, 0.011), (48, 0.136), (49, 0.109)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95436549 33 nips-2006-Analysis of Representations for Domain Adaptation

Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira

2 0.81982356 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data

Author: Ping Li, Kenneth W. Church, Trevor J. Hastie

3 0.60651857 109 nips-2006-Learnability and the doubling dimension

Author: Yi Li, Philip M. Long

4 0.54306501 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples

Author: Steffen Bickel, Tobias Scheffer

Abstract: We study a setting that is motivated by the problem of ﬁltering spam messages for many users. Each user receives messages according to an individual, unknown distribution, reﬂected only in the unlabeled inbox. The spam ﬁlter for a user is required to perform well with respect to this distribution. Labeled messages from publicly available sources can be utilized, but they are governed by a distinct distribution, not adequately representing most inboxes. We devise a method that minimizes a loss function with respect to a user’s personal distribution based on the available biased sample. A nonparametric hierarchical Bayesian model furthermore generalizes across users by learning a common prior which is imposed on new email accounts. Empirically, we observe that bias-corrected learning outperforms naive reliance on the assumption of independent and identically distributed data; Dirichlet-enhanced generalization across users outperforms a single (“one size ﬁts all”) ﬁlter as well as independent ﬁlters for all users. 1

5 0.5233838 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

Author: Philip M. Long, Rocco Servedio

6 0.50725144 116 nips-2006-Learning from Multiple Sources

7 0.47529674 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

8 0.45779604 150 nips-2006-On Transductive Regression

9 0.45447165 158 nips-2006-PG-means: learning the number of clusters in data

10 0.44785017 193 nips-2006-Tighter PAC-Bayes Bounds

11 0.42488658 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

12 0.42228332 110 nips-2006-Learning Dense 3D Correspondence

13 0.40892616 5 nips-2006-A Kernel Method for the Two-Sample-Problem

14 0.4044964 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes

15 0.39765462 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines

16 0.3874895 155 nips-2006-Optimal Single-Class Classification Strategies

17 0.38264361 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation

18 0.38107407 50 nips-2006-Chained Boosting

19 0.37843803 141 nips-2006-Multiple timescales and uncertainty in motor adaptation

20 0.37124816 131 nips-2006-Mixture Regression for Covariate Shift

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.105), (3, 0.029), (7, 0.095), (9, 0.025), (12, 0.306), (20, 0.013), (22, 0.085), (44, 0.104), (57, 0.071), (65, 0.058), (69, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.88491803 39 nips-2006-Balanced Graph Matching

Author: Timothee Cour, Praveen Srinivasan, Jianbo Shi

Abstract: Graph matching is a fundamental problem in Computer Vision and Machine Learning. We present two contributions. First, we give a new spectral relaxation technique for approximate solutions to matching problems, that naturally incorporates one-to-one or one-to-many constraints within the relaxation scheme. The second is a normalization procedure for existing graph matching scoring functions that can dramatically improve the matching accuracy. It is based on a reinterpretation of the graph matching compatibility matrix as a bipartite graph on edges for which we seek a bistochastic normalization. We evaluate our two contributions on a comprehensive test set of random graph matching problems, as well as on image correspondence problem. Our normalization procedure can be used to improve the performance of many existing graph matching algorithms, including spectral matching, graduated assignment and semideﬁnite programming. 1

2 0.85838103 122 nips-2006-Learning to parse images of articulated bodies

Author: Deva Ramanan

Abstract: We consider the machine vision task of pose estimation from static images, specifically for the case of articulated objects. This problem is hard because of the large number of degrees of freedom to be estimated. Following a established line of research, pose estimation is framed as inference in a probabilistic model. In our experience however, the success of many approaches often lie in the power of the features. Our primary contribution is a novel casting of visual inference as an iterative parsing process, where one sequentially learns better and better features tuned to a particular image. We show quantitative results for human pose estimation on a database of over 300 images that suggest our algorithm is competitive with or surpasses the state-of-the-art. Since our procedure is quite general (it does not rely on face or skin detection), we also use it to estimate the poses of horses in the Weizmann database. 1

same-paper 3 0.80829906 33 nips-2006-Analysis of Representations for Domain Adaptation

Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira

4 0.58902663 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming

Author: Kilian Q. Weinberger, Fei Sha, Qihui Zhu, Lawrence K. Saul

Abstract: In many areas of science and engineering, the problem arises how to discover low dimensional representations of high dimensional data. Recently, a number of researchers have converged on common solutions to this problem using methods from convex optimization. In particular, many results have been obtained by constructing semideﬁnite programs (SDPs) with low rank solutions. While the rank of matrix variables in SDPs cannot be directly constrained, it has been observed that low rank solutions emerge naturally by computing high variance or maximal trace solutions that respect local distance constraints. In this paper, we show how to solve very large problems of this type by a matrix factorization that leads to much smaller SDPs than those previously studied. The matrix factorization is derived by expanding the solution of the original problem in terms of the bottom eigenvectors of a graph Laplacian. The smaller SDPs obtained from this matrix factorization yield very good approximations to solutions of the original problem. Moreover, these approximations can be further reﬁned by conjugate gradient descent. We illustrate the approach on localization in large scale sensor networks, where optimizations involving tens of thousands of nodes can be solved in just a few minutes. 1

5 0.58649826 65 nips-2006-Denoising and Dimension Reduction in Feature Space

Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann

Abstract: We show that the relevant information about a classiﬁcation problem in feature space is contained up to negligible error in a ﬁnite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efﬁcient implicit representations of the data to perform classiﬁcation. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classiﬁcation. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classiﬁcation results. 1

6 0.58596379 80 nips-2006-Fundamental Limitations of Spectral Clustering

7 0.57872218 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy

8 0.57705939 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data

9 0.57308245 20 nips-2006-Active learning for misspecified generalized linear models

10 0.57298774 163 nips-2006-Prediction on a Graph with a Perceptron

11 0.56848198 110 nips-2006-Learning Dense 3D Correspondence

12 0.56529599 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks

13 0.56444705 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

14 0.56245399 175 nips-2006-Simplifying Mixture Models through Function Approximation

15 0.56033373 117 nips-2006-Learning on Graph with Laplacian Regularization

16 0.55878186 21 nips-2006-AdaBoost is Consistent

17 0.55845034 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

18 0.55805582 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

19 0.55767155 5 nips-2006-A Kernel Method for the Two-Sample-Problem

20 0.55694002 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds