nips nips2006 nips2006-33 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira
Abstract: Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. Under what conditions can we adapt a classifier trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. [sent-6, score-0.894]
2 Under what conditions can we adapt a classifier trained on the source domain for use in the target domain? [sent-7, score-0.799]
3 Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. [sent-8, score-0.538]
4 We formalize this intuition theoretically with a generalization bound for domain adaption. [sent-9, score-0.582]
5 Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model. [sent-10, score-0.742]
6 It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set. [sent-11, score-0.845]
7 1 Introduction We are all familiar with the situation in which someone learns to perform a task on training examples drawn from some domain (the source domain), but then needs to perform the same task on a related domain (the target domain). [sent-12, score-1.345]
8 In this situation, we expect the task performance in the target domain to depend on both the performance in the source domain and the similarity between the two domains. [sent-13, score-1.198]
9 For example, we might want to adapt for a new user (the target domain) a spam filter trained on the email of a group of previous users (the source domain), under the assumption that users generally agree on what is spam and what is not. [sent-15, score-0.561]
10 Intuitively, one might expect that the closer the two distributions are, the better the filter trained on the source domain will do on the target domain. [sent-17, score-0.839]
11 Many other instances of this situation arise in natural language processing. [sent-18, score-0.184]
12 In general, labeled data for tasks like part-of-speech tagging, parsing, or information extraction are drawn from a limited set of document types and genres in a given language because of availability, cost, and project goals. [sent-19, score-0.244]
13 Nevertheless, part-of-speech, syntactic structure, or entity mention decisions are to a large extent stable across different types and genres since they depend on general properties of the language under consideration. [sent-21, score-0.171]
14 However, the assumption does not hold for domain adaptation [5, 7, 13, 6]. [sent-24, score-0.595]
15 For the situations we outlined above, the challenge is the difference in instance distribution between the source and target domains. [sent-25, score-0.389]
16 We will approach this challenge by investigating how a common representation between the two domains can make the two domains appear to have similar distributions, enabling effective domain adaptation. [sent-26, score-0.763]
17 We formalize this intuition with a bound on the target generalization error of a classifier trained from labeled data in the source domain. [sent-27, score-0.704]
18 The bound is stated in terms of a representation function, and it shows that a representation function should be designed to minimize domain divergence, as well as classifier error. [sent-28, score-0.689]
19 While many authors have analyzed adaptation from multiple sets of labeled training data [3, 5, 7, 13], our theory applies to the setting in which the target domain has no labeled training data, but plentiful unlabeled data exists for both target and source domains. [sent-29, score-1.508]
20 We show experimentally that the heuristic choices made by the recently proposed structural correspondence learning algorithm [6] do lead to lower values of the relevant quantities in our theoretical analysis, providing insight as to why this algorithm achieves its empirical success. [sent-32, score-0.356]
21 Our theory also points to an interesting new algorithm for domain adaptation: one which directly minimizes a tradeoff between source-target similarity and source training error. [sent-33, score-0.729]
22 The remainder of this paper is structured as follows: In the next section we formally define domain adaptation. [sent-34, score-0.404]
23 Section 5 shows how the bound behaves for the structural correspondence learning representation [6] on natural language data. [sent-37, score-0.423]
24 We discuss our findings, including a new algorithm for domain adaptation based on our theory, in section 6 and conclude in section 7. [sent-38, score-0.595]
25 A learning problem is specified by two parameters: a distribution D over X and a (stochastic) target function f : X → [0, 1]. [sent-42, score-0.173]
26 A representation R induces a distribution over Z and a (stochastic) target function from Z to [0, 1] as follows: PrD [B] ˜ def = PrD R−1 (B) ˜ f (z) def ED [f (x)|R(x) = z] = −1 for any A ⊆ Z such that R (B) is D-measurable. [sent-45, score-0.333]
27 In summary, our learning setting is defined by fixed but unknown D and f , and our choice of representation function R and hypothesis class H ⊆ {g : Z → {0, 1}} of deterministic hypotheses to be used to approximate the function f . [sent-49, score-0.172]
28 1 Domain Adaptation We now formalize the problem of domain adaptation. [sent-51, score-0.438]
29 A domain is a distribution D on the instance set X . [sent-52, score-0.435]
30 Unlike in inductive transfer, where the tasks we wish to perform may be related but different, in domain adaptation we perform the same task in multiple domains. [sent-55, score-0.627]
31 This is quite common in natural language processing, where we might be performing the same syntactic analysis task, such as tagging or parsing, but on domains with very different vocabularies [6, 11]. [sent-56, score-0.415]
32 We assume two domains, a source domain and a target domain. [sent-58, score-0.762]
33 We denote by DS the source ˜ distribution of instances and DS the induced distribution over the feature space Z. [sent-59, score-0.33]
34 We use parallel ˜ notation, DT , DT , for the target domain. [sent-60, score-0.173]
35 3 Generalization Bounds for Domain Adaptation We now proceed to develop a bound on the target domain generalization performance of a classifier trained in the source domain. [sent-65, score-0.943]
36 The first term bounds the performance of the classifier on the source domain. [sent-67, score-0.185]
37 The second term is a measure ˜ ˜ of the divergence between the induced source marginal DS and the induced target marginal DT . [sent-68, score-0.479]
38 Unfortunately the variational distance between real-valued distributions cannot be computed from finite samples [2, 9] and therefore is not useful to us when investigating representations for domain adaptation on real-world data. [sent-71, score-0.743]
39 A key part of our theory is the observation that in many realistic domain adaptation scenarios, we do not need such a powerful measure as variational distance. [sent-72, score-0.65]
40 Instead we can restrict our notion of domain distance to be measured only with respect to function in our hypothesis class. [sent-73, score-0.484]
41 Given a domain X and a collection A of subsets of X , let D, D′ be probability distributions over X , such that every set in A is measurable with respect to both distributions. [sent-76, score-0.444]
42 This embodies our domain adaptation assumption, and we will assume will assume that ˜ our induced labeling function f is λ-close to our hypothesis class H for a small λ. [sent-80, score-0.793]
43 , a finite VC-dimension), ˜ then uniform convergence theory tells us that whenever f is not λ-close to H, large training samples have poor empirical error for every h ∈ H. [sent-84, score-0.218]
44 If the training data is generated by some DS and we wish to use some H as a family of predictors for labels in the target domain, T , then one can construct a function which agrees with some h ∈ H with respect ˜ ˜ to DS and yet is far from H with respect to DT . [sent-86, score-0.224]
45 Nonetheless we believe that such examples do not occur for realistic domain adaptation problems when the hypothesis class H is sufficiently rich, since for most domain adaptation problems of interest the labeling function is ’similarly simple’ for both the source and target domains. [sent-87, score-1.705]
46 2 Bound on the target domain error We require one last piece of notation before we state and prove the main theorems of this work: the correspondence between functions and characteristic subsets. [sent-89, score-0.758]
47 Theorem 1 Let R be a fixed representation function from X to Z and H be a hypothesis space of VC-dimension d. [sent-93, score-0.172]
48 2 from [9], we can state a computable bound for the error on the target domain. [sent-108, score-0.324]
49 Theorem 2 Let R be a fixed representation function from X to Z and H be a hypothesis space of VC-dimension d. [sent-109, score-0.172]
50 The fourth term is the sample A-distance between domains for hypothesis class H. [sent-118, score-0.189]
51 Looking at the two terms, we see that a good representation R is one which achieves low values for both training error and domain A-distance simultaneously. [sent-119, score-0.679]
52 Then h is the classifier which achieves minimum error on the binary classification problem of discriminating between points generated by the two distributions. [sent-126, score-0.212]
53 Define the error of a classifier h on the task of discriminating between points sampled from different distributions as 2m′ 1 err(h) = h(zi ) − Izi ∈US , ˜ 2m′ i=1 ˜ where Izi ∈US is the indicator function for points lying in the sample US . [sent-128, score-0.236]
54 It is important to note that this does not provide us with a valid upper bound on the target error, but as we will see it nonetheless provides us with useful insights about representations for domain adaptation. [sent-132, score-0.766]
55 In the subsequent experiments section, we train a linear classifier to discriminate between points sampled from different domains to illustrate a proxy for the A-distance. [sent-133, score-0.172]
56 5 Natural Language Experiments In this section we use our theory to analyze different representations for the task of adapting a part of speech tagger from the financial to biomedical domains [6]. [sent-135, score-0.446]
57 As we shall see, represenations which minimize both relevant terms of the bound also have small empirical error. [sent-139, score-0.163]
58 Part of speech (PoS) tagging is the task of labeling a word in context with its grammatical function. [sent-140, score-0.327]
59 PoS tagging is a common preprocessing step in many pipelined natural language processing systems and is described in more detail in [6]. [sent-142, score-0.268]
60 empirically investigate methods for adpating a part of speech tagger from financial news (the Wall Street Journal, henceforth also WSJ) to biomedical abstracts (MEDLINE) [6]. [sent-144, score-0.258]
61 As in their investigation, we treat the financial data as our source, for which we have labeled training data and the biomedical abstracts as our target, for which we have no labeled training data. [sent-146, score-0.404]
62 The representations we consider in this section are all linear projections of the original feature space into Rd . [sent-147, score-0.23]
63 Now at train time we apply the projection to the binary feature vector representation of each instance and learn a linear classifier in the d-dimensional projected space. [sent-150, score-0.263]
64 At test time we apply the projection to the binary feature vector representation and classify in the d-dimensional projected space. [sent-151, score-0.232]
65 biomedical (circles) (b) Plot of SCL representation for nouns (diamonds) vs. [sent-159, score-0.232]
66 verbs (triangles) Figure 1: 2D plots of SCL representations for the (a) A-distance and (b) empirical risk parts of theorem 2 that random projections approximate well distances in the original high dimensional space, as long as d is sufficiently large. [sent-160, score-0.299]
67 [6] describe a heuristic method for domain adaptation that they call structural correspondence learning (henceforth also SCL). [sent-164, score-0.807]
68 SCL uses unlabeled data from both domains to induce correspondences among features in the two domains. [sent-165, score-0.172]
69 The intuition is that by capturing these important correlations, features from the source and target domains which behave similarly for PoS tagging will be represented similarly in the projected space. [sent-169, score-0.701]
70 3 Results We use as our source data set 100 sentences (about 2500 words) of PoS-tagged Wall Street Journal text. [sent-171, score-0.185]
71 The target domain test set is the same set as in [6]. [sent-172, score-0.577]
72 We use one million words (500 thousand from each domain) of unlabeled data to estimate the A-distance between the financial and biomedical domains. [sent-173, score-0.165]
73 The results in this section are intended to illustrate the different parts of theorem 2 and how they can affect the target domain generalization error. [sent-174, score-0.701]
74 In this case we use the Huber loss as a proxy from the empirical training error. [sent-179, score-0.182]
75 Figure 1(a) shows one hundred random instances projected onto the space spanned by the best two discriminating projections from the SCL projection matrix for part of the financial and biomedical dataset. [sent-180, score-0.535]
76 On the other hand, figure 1(b) shows the best two discriminating components for the task of discriminating between nouns and verbs. [sent-184, score-0.298]
77 Thus these pictures lead us to believe that SCL finds a representation which results both in small empirical classification error and small A-distance. [sent-186, score-0.204]
78 (a) Plot of random projections representation for financial (squares) vs. [sent-188, score-0.221]
79 biomedical (circles) (b) Comparison of bound terms vs. [sent-189, score-0.203]
80 Reprentations are linear projections of the original feature space. [sent-191, score-0.171]
81 Huber loss is the labeled training loss after training, and the A-distance is approximated as described in the previous subsection. [sent-192, score-0.208]
82 Error refers to tagging error for the full tagset on the target domain. [sent-193, score-0.403]
83 216 Figure 2: (a) 2D plot of random projection representation and (b) results summary on large data Figure 2(a) shows one hundred random instances projected onto the best two discriminating projections for WSJ vs. [sent-203, score-0.555]
84 Thus theorem 2 predicts that using two random projections as a representation will perform poorly, since it minimizes only the A-distance and not the empirical error. [sent-210, score-0.332]
85 The identity representation achieves very low Huber loss (corresponding to empirical error). [sent-212, score-0.274]
86 The random projections method achieves low A-distance but high Huber loss, and the classifier which uses this representation achieves error rates much lower than the a classifier which uses the identity representation. [sent-215, score-0.401]
87 Finally, the structural correspondence learning representation achieves low Huber loss and low A-distance, and the error rate is the lowest of the three representations. [sent-216, score-0.438]
88 6 Discussion and Future Work Our theory demonstrates an important tradeoff inherent in designing good representations for domain adaptation. [sent-217, score-0.552]
89 A good representation enables achieving low error rate on the source domain while also minimizing the A-distance between the induced marginal distributions of the two domains. [sent-218, score-0.846]
90 Unlike their work, though, we bound the target domain error using a finite source domain labeled sample and finite source and target domain unlabeled samples. [sent-222, score-2.223]
91 Our experiments illustrate the utility of our bound on target domain error, but they do not explore the accuracy of our approximate H-distance. [sent-223, score-0.71]
92 Finally our theory points toward an interesting new direction for domain adapation. [sent-225, score-0.491]
93 7 Conclusions We presented an analysis of representations for domain adaptation. [sent-229, score-0.463]
94 It is reasonable to think that a good representation is the key to effective domain adaptation, and our theory backs up that intuition. [sent-230, score-0.551]
95 Theorem 2 gives an upper bound on the generalization of a classifier trained on a source domain and applied in a target domain. [sent-231, score-0.943]
96 The bound depends on the representation and explicitly demonstrates the tradeoff between low empirical source domain error and a small difference between distributions. [sent-232, score-0.962]
97 ˜ Under the assumption that the labeling function f is close to our hypothesis class H, we can compute the bound from finite samples. [sent-233, score-0.258]
98 Our experiments indicate that the heuristic structural correspondence learning method [6] does in fact simultaneously achieve low A-distance as well as a low margin-based loss. [sent-237, score-0.28]
99 Finally we note that our theory points to an interesting new algorithm for domain adaptation. [sent-239, score-0.459]
100 Instead of making heuristic choices, we are investigating algorithms which directly minimize a combination of the A-distance and the empirical training margin. [sent-240, score-0.232]
wordName wordTfidf (topN-words)
[('domain', 0.404), ('ds', 0.272), ('scl', 0.257), ('zh', 0.257), ('adaptation', 0.191), ('source', 0.185), ('dt', 0.184), ('prd', 0.18), ('tagging', 0.18), ('target', 0.173), ('huber', 0.143), ('dh', 0.143), ('projections', 0.129), ('nancial', 0.119), ('discriminating', 0.114), ('domains', 0.109), ('biomedical', 0.102), ('blitzer', 0.102), ('bound', 0.101), ('er', 0.1), ('representation', 0.092), ('language', 0.088), ('classi', 0.084), ('labeled', 0.081), ('hypothesis', 0.08), ('correspondence', 0.078), ('medline', 0.077), ('labeling', 0.077), ('pos', 0.076), ('heuristic', 0.07), ('wsj', 0.067), ('structural', 0.064), ('unlabeled', 0.063), ('empirical', 0.062), ('instances', 0.062), ('representations', 0.059), ('ut', 0.058), ('theory', 0.055), ('projected', 0.054), ('characteristic', 0.053), ('training', 0.051), ('izi', 0.051), ('kifer', 0.051), ('prds', 0.051), ('prdt', 0.051), ('tagger', 0.051), ('zg', 0.051), ('error', 0.05), ('investigating', 0.049), ('theorem', 0.049), ('achieves', 0.048), ('crammer', 0.047), ('users', 0.045), ('arriaga', 0.045), ('discriminator', 0.045), ('genres', 0.045), ('noun', 0.045), ('pivot', 0.045), ('projection', 0.044), ('hyperplane', 0.044), ('generalization', 0.043), ('feature', 0.042), ('induced', 0.041), ('distributions', 0.04), ('divergence', 0.039), ('speech', 0.038), ('abstracts', 0.038), ('syntactic', 0.038), ('nouns', 0.038), ('spam', 0.038), ('tag', 0.038), ('rd', 0.038), ('loss', 0.038), ('trained', 0.037), ('choices', 0.034), ('formalize', 0.034), ('def', 0.034), ('situation', 0.034), ('low', 0.034), ('tradeoff', 0.034), ('poorly', 0.033), ('ez', 0.033), ('focs', 0.033), ('sugiyama', 0.033), ('toward', 0.032), ('task', 0.032), ('illustrate', 0.032), ('instance', 0.031), ('err', 0.031), ('wall', 0.031), ('proxy', 0.031), ('shai', 0.031), ('hundred', 0.03), ('covariate', 0.03), ('drawn', 0.03), ('plot', 0.03), ('henceforth', 0.029), ('nonetheless', 0.029), ('pereira', 0.029), ('stochastic', 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 33 nips-2006-Analysis of Representations for Domain Adaptation
Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira
Abstract: Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. Under what conditions can we adapt a classifier trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set. 1
2 0.15340489 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
Author: Ping Li, Kenneth W. Church, Trevor J. Hastie
Abstract: We1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using real-world data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations.
3 0.12414689 116 nips-2006-Learning from Multiple Sources
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We consider the problem of learning accurate models from multiple sources of “nearby” data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest. 1
Author: Philip M. Long, Rocco Servedio
Abstract: We consider the well-studied problem of learning decision lists using few examples when many irrelevant features are present. We show that smooth boosting algorithms such as MadaBoost can efficiently learn decision lists of length k over n boolean variables using poly(k, log n) many examples provided that the marginal distribution over the relevant variables is “not too concentrated” in an L 2 -norm sense. Using a recent result of H˚ stad, we extend the analysis to obtain a similar a (though quantitatively weaker) result for learning arbitrary linear threshold functions with k nonzero coefficients. Experimental results indicate that the use of a smooth boosting algorithm, which plays a crucial role in our analysis, has an impact on the actual performance of the algorithm.
5 0.097122028 109 nips-2006-Learnability and the doubling dimension
Author: Yi Li, Philip M. Long
Abstract: Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).
6 0.096486218 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
7 0.09505751 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
8 0.094266444 110 nips-2006-Learning Dense 3D Correspondence
9 0.088626467 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
10 0.088258766 150 nips-2006-On Transductive Regression
11 0.088018648 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples
12 0.08716017 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
13 0.087013252 193 nips-2006-Tighter PAC-Bayes Bounds
14 0.081915714 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
15 0.074209496 50 nips-2006-Chained Boosting
16 0.073144786 12 nips-2006-A Probabilistic Algorithm Integrating Source Localization and Noise Suppression of MEG and EEG data
17 0.070985451 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
18 0.07076855 46 nips-2006-Blind source separation for over-determined delayed mixtures
19 0.069807142 141 nips-2006-Multiple timescales and uncertainty in motor adaptation
20 0.069186442 65 nips-2006-Denoising and Dimension Reduction in Feature Space
topicId topicWeight
[(0, -0.234), (1, 0.045), (2, -0.024), (3, -0.044), (4, -0.078), (5, 0.146), (6, -0.042), (7, 0.023), (8, -0.012), (9, -0.008), (10, 0.028), (11, 0.003), (12, -0.063), (13, 0.02), (14, 0.087), (15, -0.167), (16, 0.087), (17, -0.157), (18, -0.001), (19, -0.103), (20, -0.074), (21, 0.055), (22, 0.025), (23, 0.018), (24, -0.146), (25, 0.082), (26, -0.017), (27, 0.16), (28, 0.012), (29, -0.054), (30, 0.008), (31, -0.186), (32, -0.139), (33, 0.107), (34, -0.203), (35, 0.012), (36, 0.001), (37, -0.021), (38, -0.05), (39, 0.072), (40, 0.04), (41, -0.033), (42, 0.105), (43, -0.044), (44, -0.021), (45, -0.036), (46, -0.033), (47, 0.011), (48, 0.136), (49, 0.109)]
simIndex simValue paperId paperTitle
same-paper 1 0.95436549 33 nips-2006-Analysis of Representations for Domain Adaptation
Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira
Abstract: Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. Under what conditions can we adapt a classifier trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set. 1
2 0.81982356 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
Author: Ping Li, Kenneth W. Church, Trevor J. Hastie
Abstract: We1 develop Conditional Random Sampling (CRS), a technique particularly suitable for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using real-world data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations.
3 0.60651857 109 nips-2006-Learnability and the doubling dimension
Author: Yi Li, Philip M. Long
Abstract: Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).
4 0.54306501 68 nips-2006-Dirichlet-Enhanced Spam Filtering based on Biased Samples
Author: Steffen Bickel, Tobias Scheffer
Abstract: We study a setting that is motivated by the problem of filtering spam messages for many users. Each user receives messages according to an individual, unknown distribution, reflected only in the unlabeled inbox. The spam filter for a user is required to perform well with respect to this distribution. Labeled messages from publicly available sources can be utilized, but they are governed by a distinct distribution, not adequately representing most inboxes. We devise a method that minimizes a loss function with respect to a user’s personal distribution based on the available biased sample. A nonparametric hierarchical Bayesian model furthermore generalizes across users by learning a common prior which is imposed on new email accounts. Empirically, we observe that bias-corrected learning outperforms naive reliance on the assumption of independent and identically distributed data; Dirichlet-enhanced generalization across users outperforms a single (“one size fits all”) filter as well as independent filters for all users. 1
Author: Philip M. Long, Rocco Servedio
Abstract: We consider the well-studied problem of learning decision lists using few examples when many irrelevant features are present. We show that smooth boosting algorithms such as MadaBoost can efficiently learn decision lists of length k over n boolean variables using poly(k, log n) many examples provided that the marginal distribution over the relevant variables is “not too concentrated” in an L 2 -norm sense. Using a recent result of H˚ stad, we extend the analysis to obtain a similar a (though quantitatively weaker) result for learning arbitrary linear threshold functions with k nonzero coefficients. Experimental results indicate that the use of a smooth boosting algorithm, which plays a crucial role in our analysis, has an impact on the actual performance of the algorithm.
6 0.50725144 116 nips-2006-Learning from Multiple Sources
7 0.47529674 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
8 0.45779604 150 nips-2006-On Transductive Regression
9 0.45447165 158 nips-2006-PG-means: learning the number of clusters in data
10 0.44785017 193 nips-2006-Tighter PAC-Bayes Bounds
11 0.42488658 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis
12 0.42228332 110 nips-2006-Learning Dense 3D Correspondence
13 0.40892616 5 nips-2006-A Kernel Method for the Two-Sample-Problem
14 0.4044964 191 nips-2006-The Robustness-Performance Tradeoff in Markov Decision Processes
15 0.39765462 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
16 0.3874895 155 nips-2006-Optimal Single-Class Classification Strategies
17 0.38264361 172 nips-2006-Scalable Discriminative Learning for Natural Language Parsing and Translation
18 0.38107407 50 nips-2006-Chained Boosting
19 0.37843803 141 nips-2006-Multiple timescales and uncertainty in motor adaptation
20 0.37124816 131 nips-2006-Mixture Regression for Covariate Shift
topicId topicWeight
[(1, 0.105), (3, 0.029), (7, 0.095), (9, 0.025), (12, 0.306), (20, 0.013), (22, 0.085), (44, 0.104), (57, 0.071), (65, 0.058), (69, 0.027)]
simIndex simValue paperId paperTitle
1 0.88491803 39 nips-2006-Balanced Graph Matching
Author: Timothee Cour, Praveen Srinivasan, Jianbo Shi
Abstract: Graph matching is a fundamental problem in Computer Vision and Machine Learning. We present two contributions. First, we give a new spectral relaxation technique for approximate solutions to matching problems, that naturally incorporates one-to-one or one-to-many constraints within the relaxation scheme. The second is a normalization procedure for existing graph matching scoring functions that can dramatically improve the matching accuracy. It is based on a reinterpretation of the graph matching compatibility matrix as a bipartite graph on edges for which we seek a bistochastic normalization. We evaluate our two contributions on a comprehensive test set of random graph matching problems, as well as on image correspondence problem. Our normalization procedure can be used to improve the performance of many existing graph matching algorithms, including spectral matching, graduated assignment and semidefinite programming. 1
2 0.85838103 122 nips-2006-Learning to parse images of articulated bodies
Author: Deva Ramanan
Abstract: We consider the machine vision task of pose estimation from static images, specifically for the case of articulated objects. This problem is hard because of the large number of degrees of freedom to be estimated. Following a established line of research, pose estimation is framed as inference in a probabilistic model. In our experience however, the success of many approaches often lie in the power of the features. Our primary contribution is a novel casting of visual inference as an iterative parsing process, where one sequentially learns better and better features tuned to a particular image. We show quantitative results for human pose estimation on a database of over 300 images that suggest our algorithm is competitive with or surpasses the state-of-the-art. Since our procedure is quite general (it does not rely on face or skin detection), we also use it to estimate the poses of horses in the Weizmann database. 1
same-paper 3 0.80829906 33 nips-2006-Analysis of Representations for Domain Adaptation
Author: Shai Ben-David, John Blitzer, Koby Crammer, Fernando Pereira
Abstract: Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. Under what conditions can we adapt a classifier trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set. 1
4 0.58902663 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
Author: Kilian Q. Weinberger, Fei Sha, Qihui Zhu, Lawrence K. Saul
Abstract: In many areas of science and engineering, the problem arises how to discover low dimensional representations of high dimensional data. Recently, a number of researchers have converged on common solutions to this problem using methods from convex optimization. In particular, many results have been obtained by constructing semidefinite programs (SDPs) with low rank solutions. While the rank of matrix variables in SDPs cannot be directly constrained, it has been observed that low rank solutions emerge naturally by computing high variance or maximal trace solutions that respect local distance constraints. In this paper, we show how to solve very large problems of this type by a matrix factorization that leads to much smaller SDPs than those previously studied. The matrix factorization is derived by expanding the solution of the original problem in terms of the bottom eigenvectors of a graph Laplacian. The smaller SDPs obtained from this matrix factorization yield very good approximations to solutions of the original problem. Moreover, these approximations can be further refined by conjugate gradient descent. We illustrate the approach on localization in large scale sensor networks, where optimizations involving tens of thousands of nodes can be solved in just a few minutes. 1
5 0.58649826 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
6 0.58596379 80 nips-2006-Fundamental Limitations of Spectral Clustering
7 0.57872218 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
8 0.57705939 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data
9 0.57308245 20 nips-2006-Active learning for misspecified generalized linear models
10 0.57298774 163 nips-2006-Prediction on a Graph with a Perceptron
11 0.56848198 110 nips-2006-Learning Dense 3D Correspondence
12 0.56529599 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks
13 0.56444705 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
14 0.56245399 175 nips-2006-Simplifying Mixture Models through Function Approximation
15 0.56033373 117 nips-2006-Learning on Graph with Laplacian Regularization
16 0.55878186 21 nips-2006-AdaBoost is Consistent
17 0.55845034 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
18 0.55805582 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
19 0.55767155 5 nips-2006-A Kernel Method for the Two-Sample-Problem
20 0.55694002 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds