nips nips2007 nips2007-110 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman
Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. [sent-4, score-1.434]
2 In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. [sent-5, score-1.091]
3 The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. [sent-6, score-1.007]
4 Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. [sent-7, score-1.704]
5 We have ample data drawn from a source domain to train a model, but little or no training data from the target domain where we wish to use the model [17, 3, 10, 5, 9]. [sent-9, score-1.382]
6 Those studies focus on the case where training data is drawn from a source domain and test data is drawn from a different target domain. [sent-15, score-1.181]
7 We generalize this approach to the case where we have some labeled data from the target domain in addition to a large amount of labeled source data. [sent-16, score-1.186]
8 Our main result is a uniform convergence bound on the true target risk of a model trained to minimize a convex combination of empirical source and target risks. [sent-17, score-1.672]
9 The bound describes an intuitive tradeoff between the quantity of the source data and the accuracy of the target data, and under relatively weak assumptions we can compute it from finite labeled and unlabeled samples of the source and target distributions. [sent-18, score-1.991]
10 We use the task of sentiment classification to demonstrate that our bound makes correct predictions about model error with respect to a distance measure between source and target domains and the number of training instances. [sent-19, score-1.335]
11 Several authors have empirically studied a special case of this in which each instance is weighted separately in the loss function, and instance weights are set to approximate the target domain distribution [10, 5, 9, 11]. [sent-21, score-0.696]
12 We give a uniform convergence bound for algorithms that min1 imize a convex combination of multiple empirical source risks and we show that these algorithms can outperform standard empirical risk minimization. [sent-22, score-1.063]
13 2 A Rigorous Model of Domain Adaptation We formalize domain adaptation for binary classification as follows. [sent-23, score-0.398]
14 A domain is a pair consisting of a distribution D on X and a labeling function f : X → [0, 1]. [sent-24, score-0.277]
15 1 Initially we consider two domains, a source domain DS , fS and a target domain DT , fT . [sent-25, score-1.253]
16 We write the empirical risk of a hypothesis on the source domain as ǫS (h). [sent-29, score-1.08]
17 We use the parallel notation ˆ ǫT (h, f ), ǫT (h), and ǫT (h) for the target domain. [sent-30, score-0.363]
18 ˆ We measure the distance between two distributions D and D′ using a hypothesis class-specific distance measure. [sent-31, score-0.295]
19 Let H be a hypothesis class for instance space X , and AH be the set of subsets of X that are the support of some hypothesis in H. [sent-32, score-0.372]
20 For a hypothesis space H, we define the symmetric difference hypothesis space H∆H as H∆H = {h(x) ⊕ h′ (x) : h, h′ ∈ H} , where ⊕ is the XOR operator. [sent-37, score-0.338]
21 The ideal hypothesis minimizes combined source and target risk: h∗ = argmin ǫS (h) + ǫT (h) . [sent-42, score-1.116]
22 h∈H We denote the combined risk of the ideal hypothesis by λ = ǫS (h∗ ) + ǫT (h∗ ) . [sent-43, score-0.398]
23 When the ideal hypothesis performs poorly, we cannot expect to learn a good target classifier by minimizing source error. [sent-45, score-1.08]
24 If this is the case, we can reasonably approximate target risk using source risk and the distance between DS and DT . [sent-47, score-1.222]
25 We illustrate the kind of result available in this setting with the following bound on the target risk in terms of the source risk, the difference between labeling functions fS and fT , and the distance between the distributions DS and DT . [sent-48, score-1.239]
26 1 This notion of domain is not the domain of a function. [sent-51, score-0.412]
27 2 Theorem 1 Let H be a hypothesis space of VC-dimension d and US , UT be unlabeled samples of ˆ size m′ each, drawn from DS and DT , respectively. [sent-54, score-0.319]
28 Let dH∆H be the empirical distance on US , UT , induced by the symmetric difference hypothesis space. [sent-55, score-0.3]
29 When the combined error of the ideal hypothesis is large, there is no classifier that performs well on both the source and target domains, so we cannot hope to find a good target hypothesis by training only on the source domain. [sent-60, score-2.219]
30 On the other hand, for small λ (the most relevant case for domain adaptation), Theorem 1 shows that source error and unlabeled H∆H-distance are important quantities for computing target error. [sent-61, score-1.18]
31 3 A Learning Bound Combining Source and Target Data Theorem 1 shows how to relate source and target risk. [sent-62, score-0.841]
32 We now proceed to give a learning bound for empirical risk minimization using combined source and target training data. [sent-63, score-1.233]
33 At train time a learner receives a sample S = (ST , SS ) of m instances, where ST consists of βm instances drawn independently from DT and SS consists of (1−β)m instances drawn independently from DS . [sent-66, score-0.32]
34 The goal of a learner is to find a hypothesis that minimizes target risk ǫT (h). [sent-67, score-0.727]
35 When β is small, as in domain adaptation, minimizing empirical target risk may not be the best choice. [sent-68, score-0.796]
36 We analyze learners that instead minimize a convex combination of empirical source and target risk: ǫα (h) = αˆT (h) + (1 − α)ˆS (h) ˆ ǫ ǫ We denote as ǫα (h) the corresponding weighted combination of true source and target risks, measured with respect to DS and DT . [sent-69, score-1.911]
37 We bound the target risk of a domain adaptation algorithm that minimizes ǫα (h). [sent-70, score-1.036]
38 First we bound the difference between the target risk ǫT (h) and weighted risk ǫα (h). [sent-72, score-0.814]
39 Then we bound the difference between the true and empirical weighted risks ǫα (h) and ǫα (h). [sent-73, score-0.262]
40 The lemma shows that as α approaches 1, we rely increasingly on the target data, and the distance between domains matters less and less. [sent-77, score-0.529]
41 m′ When α = 0 (that is, we ignore target data), the bound is identical to that of Theorem 1, but with an empirical estimate for the source error. [sent-88, score-1.051]
42 Similarly when α = 1 (that is, we use only target data), the bound is the standard learning bound using only target data. [sent-89, score-0.936]
43 Finally note that by choosing different values of α, the bound allows us to effectively trade off the small amount of target data against the large amount of less relevant source data. [sent-91, score-1.03]
44 Let ζ(US , UT ) be our approximation to dH∆H , computed from source and target unlabeled data. [sent-99, score-0.93]
45 Because of the large number of potential genres, sentiment classification is an ideal area for domain adaptation. [sent-106, score-0.428]
46 We chose the “apparel” domain as our target domain, and all of the plots on the right-hand side of Figure 1 are for this domain. [sent-111, score-0.608]
47 We obtain empirical curves for the error as a function of α by training a classifier using a weighted hinge loss. [sent-112, score-0.268]
48 Suppose the target domain has weight α and there are βm target training instances. [sent-113, score-0.992]
49 Then we scale the loss of target training instance by α/β and the loss of a source training instance by (1 − α)/(1 − β). [sent-114, score-1.091]
50 8 1 (f) source = dvd, vary mS , mT = 2500 mT: 250 books: 0. [sent-133, score-0.521]
51 8 1 Figure 1: Comparing the bound with test error for sentiment classification. [sent-149, score-0.301]
52 Curves in (c) and (d) represent different numbers of target instances. [sent-154, score-0.363]
53 Curves in (e) and (f) represent different numbers of source instances. [sent-155, score-0.478]
54 Figure 1 shows a series of plots of equation 1 (on the top) coupled with corresponding plots of test error (on the bottom) as a function of α for different amounts of source and target data and different distances between domains. [sent-156, score-0.963]
55 In each pair of plots, a single parameter (distance, number of target instances mT , or number of source instances mS ) is varied while the other two are held constant. [sent-157, score-1.087]
56 The plots on the top part of Figure 1 are not meant to be numerical proxies for the true error (For the source domains “books” and “dvd”, the distance alone is well above 1 ). [sent-159, score-0.694]
57 Furthermore the value of α which minimizes the bound also has a low empirical error for each corresponding curve. [sent-164, score-0.253]
58 This suggests that choosing α to minimize the bound of Theorem 2 and subsequently training a classifier to minimize the empirical error ǫα (h) can ˆ work well in practice, provided we have a reasonable measure of complexity. [sent-165, score-0.347]
59 4 Figures 1a and 1b show that more distant source domains result in higher target error. [sent-166, score-0.935]
60 Figures 1c and 1d illustrate that for more target data, we have not only lower error in general, but also a higher minimizing α. [sent-167, score-0.407]
61 Finally, figures 1e and 1f depict the limitation of distant source data. [sent-168, score-0.531]
62 With enough target data, no matter how much source data we include, we always prefer to use only the target data. [sent-169, score-1.204]
63 This is reflected in our bound as a phase transition in the value of the optimal α (governing the tradeoff between source and target data). [sent-170, score-1.036]
64 The value of α which minimizes the bound is indicated by the intensity, where black means α = 1 (corresponding to ignoring source and learning only from target data). [sent-176, score-0.982]
65 The x-axis shows the number of source instances (log-scale). [sent-179, score-0.601]
66 With more target instances than this, it is more effective to ignore even an infinite amount of source data. [sent-182, score-1.03]
67 5 Learning from Multiple Sources We now explore an extension of our theory to the case of multiple source domains. [sent-183, score-0.501]
68 Each source Sj is associated with an unknown underlying distribution Dj over input points and an unknown labeling function fj . [sent-185, score-0.549]
69 From each source Sj , we are given mj labeled training instances, and our goal is to use these instances to train a model to perform well on a target domain DT , fT , which may or may not be one of the sources. [sent-186, score-1.336]
70 This setting is motivated by several new domain adaptation algorithms [10, 5, 11, 9] that weigh the loss from training instances depending on how “far” they are from the target domain. [sent-187, score-0.95]
71 That is, each training instance is its own source domain. [sent-188, score-0.572]
72 As in the previous sections, we will examine algorithms that minimize convex combinations of training errors over the labeled examples from each source domain. [sent-189, score-0.666]
73 Given a vector α = (α1 , · · · , αN ) of domain weights with j αj = 1, we define the empirical α-weighted error of function h as N N αj ǫj (h) = ˆ ǫα (h) = ˆ j=1 j=1 αj mj |h(x) − fj (x)| . [sent-191, score-0.369]
74 Let Dα be a mixture of the N source distributions with mixing weights equal to the components of α. [sent-193, score-0.5]
75 Finally, analogous to λ in the single-source setting, we define the error of the multi-source ideal hypothesis for a weighting α as N αj ǫj (h)} . [sent-194, score-0.385]
76 γα = min{ǫT (h) + ǫα (h)} = min{ǫT (h) + h h j=1 The following theorem gives a learning bound for empirical risk minimization using the empirical α-weighted error. [sent-195, score-0.471]
77 Theorem 3 Suppose we are given mj labeled instances from source Sj for j = 1 . [sent-196, score-0.707]
78 (b) Learning by uniformly weighting the training data causes us to learn a suboptimal decision boundary, (c) but by weighting the males more highly, we can match the target data and learn an optimal classifier. [sent-207, score-0.729]
79 The first part bounds the difference between the α-weighted error and the target error similar to lemma 1. [sent-210, score-0.53]
80 ˆ Theorem 3 reduces to Theorem 2 when we have only two sources, one of which is the target domain (that is, we have some small number of target instances). [sent-212, score-0.932]
81 It is more general, though, because by manipulating α we can effectively change the source domain. [sent-213, score-0.478]
82 First, we demand that there exists a hypothesis h∗ which has low error on both the α-weighted convex combination of sources and the target domain. [sent-215, score-0.738]
83 Second, we measure distance between the target and a mixture of sources, rather than between the target and a single source. [sent-216, score-0.811]
84 This can happen if some non-uniform weighting of sources accurately approximates the target domain. [sent-218, score-0.559]
85 In this example, we can find the optimal classifier by weighting the “males” and “females” components of the source to match the target. [sent-221, score-0.58]
86 While we do not explicitly address the relationship in this paper, we note that domain adaptation is closely related to the setting of covariate shift, which has been studied in statistics. [sent-224, score-0.373]
87 [9] considered weighting instances using a transfer-aware variant of boosting, but the learning bounds they give are no stronger than bounds which completely ignore the source data. [sent-231, score-0.832]
88 [8] consider learning when the marginal distribution on instances is the same across sources but the labeling function may change. [sent-233, score-0.288]
89 They consider only including or discarding a source entirely. [sent-236, score-0.478]
90 They place source-centered prior on the parameters of a model learned in the target domain. [sent-238, score-0.363]
91 de/projects/different06/) contains a good set of references on domain adaptation and related topics. [sent-242, score-0.373]
92 7 our model, the divergence prior also emphasizes the tradeoff between source and target. [sent-243, score-0.549]
93 In our model, though, we measure the divergence (and consequently the bias) of the source domain from unlabeled data. [sent-244, score-0.808]
94 This allows us to choose the best tradeoff between source and target labeled data. [sent-245, score-0.958]
95 7 Conclusion In this work we investigate the task of domain adaptation when we have a large amount of training data from a source domain but wish to apply a model in a target domain with a much smaller amount of training data. [sent-246, score-1.836]
96 Our main result is a uniform convergence learning bound for algorithms which minimize convex combinations of source and target empirical risk. [sent-247, score-1.12]
97 Our bound reflects the trade-off between the size of the source data and the accuracy of the target data, and we give a simple approximation to it that is computable from finite labeled and unlabeled samples. [sent-248, score-1.09]
98 Our theory also extends in a straightforward manner to a multi-source setting, which we believe helps to explain the success of recent empirical work in domain adaptation. [sent-250, score-0.274]
99 We also plan to investigate algorithms that choose a convex combination of multiple sources to minimize the bound in Theorem 3. [sent-253, score-0.325]
100 Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. [sent-290, score-0.319]
wordName wordTfidf (topN-words)
[('source', 0.478), ('target', 0.363), ('dh', 0.303), ('mt', 0.217), ('domain', 0.206), ('hypothesis', 0.169), ('adaptation', 0.167), ('risk', 0.159), ('sentiment', 0.152), ('ut', 0.131), ('ds', 0.131), ('instances', 0.123), ('ms', 0.121), ('dt', 0.107), ('bound', 0.105), ('weighting', 0.102), ('sources', 0.094), ('unlabeled', 0.089), ('crammer', 0.087), ('females', 0.087), ('fs', 0.08), ('separator', 0.076), ('blitzer', 0.076), ('dvd', 0.076), ('males', 0.076), ('theorem', 0.071), ('labeling', 0.071), ('domains', 0.07), ('ft', 0.07), ('ideal', 0.07), ('empirical', 0.068), ('classi', 0.066), ('vc', 0.065), ('ah', 0.065), ('dist', 0.065), ('distance', 0.063), ('risks', 0.061), ('er', 0.061), ('training', 0.06), ('labeled', 0.055), ('genres', 0.052), ('mj', 0.051), ('bounds', 0.046), ('sj', 0.044), ('error', 0.044), ('apparel', 0.044), ('argminh', 0.044), ('jiang', 0.044), ('vary', 0.043), ('books', 0.04), ('plots', 0.039), ('convex', 0.038), ('kitchen', 0.038), ('dai', 0.038), ('prd', 0.038), ('electronics', 0.038), ('drawn', 0.037), ('ignore', 0.037), ('minimizes', 0.036), ('tradeoff', 0.036), ('curves', 0.036), ('minimize', 0.035), ('divergence', 0.035), ('instance', 0.034), ('appendix', 0.034), ('uniform', 0.033), ('proof', 0.033), ('lemma', 0.033), ('wish', 0.032), ('hinge', 0.032), ('rademacher', 0.032), ('loss', 0.031), ('stars', 0.03), ('combination', 0.03), ('ss', 0.029), ('depict', 0.029), ('phase', 0.029), ('amount', 0.029), ('weighted', 0.028), ('hypotheses', 0.028), ('speaker', 0.028), ('acl', 0.028), ('bickel', 0.028), ('darpa', 0.027), ('us', 0.026), ('poorly', 0.026), ('triangle', 0.026), ('hope', 0.025), ('formalize', 0.025), ('huang', 0.025), ('transition', 0.025), ('samples', 0.024), ('distant', 0.024), ('language', 0.024), ('lemmas', 0.023), ('inequality', 0.023), ('multiple', 0.023), ('log', 0.022), ('minimizer', 0.022), ('mixture', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 110 nips-2007-Learning Bounds for Domain Adaptation
Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman
Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1
2 0.13133904 129 nips-2007-Mining Internet-Scale Software Repositories
Author: Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, Pierre F. Baldi
Abstract: Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. Sourcerer allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, SLOC, and lexical containment distributions. We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and author-topic distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the AUC metric to 0.84– roughly 10-30% better than previous approaches based on text alone. Supplementary material may be found at: http://sourcerer.ics.uci.edu/nips2007/nips07.html. 1
3 0.12185959 69 nips-2007-Discriminative Batch Mode Active Learning
Author: Yuhong Guo, Dale Schuurmans
Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classifier. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classification performance of the target classifier, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1
4 0.11942774 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning
Author: Yuanqing Lin, Jingdong Chen, Youngmoo Kim, Daniel D. Lee
Abstract: Speech dereverberation remains an open problem after more than three decades of research. The most challenging step in speech dereverberation is blind channel identification (BCI). Although many BCI approaches have been developed, their performance is still far from satisfactory for practical applications. The main difficulty in BCI lies in finding an appropriate acoustic model, which not only can effectively resolve solution degeneracies due to the lack of knowledge of the source, but also robustly models real acoustic environments. This paper proposes a sparse acoustic room impulse response (RIR) model for BCI, that is, an acoustic RIR can be modeled by a sparse FIR filter. Under this model, we show how to formulate the BCI of a single-input multiple-output (SIMO) system into a l1 norm regularized least squares (LS) problem, which is convex and can be solved efficiently with guaranteed global convergence. The sparseness of solutions is controlled by l1 -norm regularization parameters. We propose a sparse learning scheme that infers the optimal l1 -norm regularization parameters directly from microphone observations under a Bayesian framework. Our results show that the proposed approach is effective and robust, and it yields source estimates in real acoustic environments with high fidelity to anechoic chamber measurements.
5 0.11845171 101 nips-2007-How SVMs can estimate quantiles and the median
Author: Andreas Christmann, Ingo Steinwart
Abstract: We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . This result is then used to derive an oracle inequality for an SVM based on the pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . 1
6 0.089685321 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
7 0.088329576 147 nips-2007-One-Pass Boosting
8 0.087969199 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
9 0.084324785 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin
10 0.080365978 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching
11 0.079372071 84 nips-2007-Expectation Maximization and Posterior Constraints
12 0.078313388 136 nips-2007-Multiple-Instance Active Learning
13 0.07245294 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
14 0.071128726 62 nips-2007-Convex Learning with Invariances
15 0.070611678 74 nips-2007-EEG-Based Brain-Computer Interaction: Improved Accuracy by Automatic Single-Trial Error Detection
16 0.068336651 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect
17 0.067878082 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
18 0.067641608 159 nips-2007-Progressive mixture rules are deviation suboptimal
19 0.065594152 187 nips-2007-Structured Learning with Approximate Inference
20 0.065441623 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations
topicId topicWeight
[(0, -0.22), (1, -0.005), (2, -0.066), (3, 0.076), (4, 0.075), (5, 0.057), (6, 0.125), (7, -0.047), (8, 0.03), (9, -0.035), (10, 0.074), (11, -0.008), (12, -0.125), (13, -0.105), (14, 0.023), (15, -0.118), (16, 0.075), (17, 0.005), (18, 0.034), (19, -0.104), (20, -0.024), (21, -0.089), (22, -0.004), (23, -0.135), (24, 0.007), (25, -0.034), (26, 0.115), (27, -0.031), (28, 0.085), (29, -0.224), (30, -0.075), (31, -0.196), (32, -0.094), (33, -0.01), (34, 0.063), (35, 0.144), (36, -0.115), (37, 0.058), (38, -0.158), (39, 0.063), (40, -0.178), (41, -0.012), (42, -0.015), (43, -0.048), (44, -0.006), (45, -0.004), (46, -0.151), (47, -0.108), (48, 0.026), (49, -0.055)]
simIndex simValue paperId paperTitle
same-paper 1 0.97829133 110 nips-2007-Learning Bounds for Domain Adaptation
Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman
Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1
2 0.66270047 101 nips-2007-How SVMs can estimate quantiles and the median
Author: Andreas Christmann, Ingo Steinwart
Abstract: We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . This result is then used to derive an oracle inequality for an SVM based on the pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . 1
3 0.64581919 129 nips-2007-Mining Internet-Scale Software Repositories
Author: Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, Pierre F. Baldi
Abstract: Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we first develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. Sourcerer allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data first reveal robust power-law behavior for package, SLOC, and lexical containment distributions. We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and author-topic distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to significantly improve software retrieval performance, increasing the AUC metric to 0.84– roughly 10-30% better than previous approaches based on text alone. Supplementary material may be found at: http://sourcerer.ics.uci.edu/nips2007/nips07.html. 1
4 0.57947826 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning
Author: Yuanqing Lin, Jingdong Chen, Youngmoo Kim, Daniel D. Lee
Abstract: Speech dereverberation remains an open problem after more than three decades of research. The most challenging step in speech dereverberation is blind channel identification (BCI). Although many BCI approaches have been developed, their performance is still far from satisfactory for practical applications. The main difficulty in BCI lies in finding an appropriate acoustic model, which not only can effectively resolve solution degeneracies due to the lack of knowledge of the source, but also robustly models real acoustic environments. This paper proposes a sparse acoustic room impulse response (RIR) model for BCI, that is, an acoustic RIR can be modeled by a sparse FIR filter. Under this model, we show how to formulate the BCI of a single-input multiple-output (SIMO) system into a l1 norm regularized least squares (LS) problem, which is convex and can be solved efficiently with guaranteed global convergence. The sparseness of solutions is controlled by l1 -norm regularization parameters. We propose a sparse learning scheme that infers the optimal l1 -norm regularization parameters directly from microphone observations under a Bayesian framework. Our results show that the proposed approach is effective and robust, and it yields source estimates in real acoustic environments with high fidelity to anechoic chamber measurements.
5 0.4977068 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
Author: Andrea Lecchini-visintini, John Lygeros, Jan Maciejowski
Abstract: Simulated annealing is a popular method for approaching the solution of a global optimization problem. Existing results on its performance apply to discrete combinatorial optimization where the optimization variables can assume only a finite set of possible values. We introduce a new general formulation of simulated annealing which allows one to guarantee finite-time performance in the optimization of functions of continuous variables. The results hold universally for any optimization problem on a bounded domain and establish a connection between simulated annealing and up-to-date theory of convergence of Markov chain Monte Carlo methods on continuous domains. This work is inspired by the concept of finite-time learning with known accuracy and confidence developed in statistical learning theory. Optimization is the general problem of finding a value of a vector of variables θ that maximizes (or minimizes) some scalar criterion U (θ). The set of all possible values of the vector θ is called the optimization domain. The elements of θ can be discrete or continuous variables. In the first case the optimization domain is usually finite, such as in the well-known traveling salesman problem; in the second case the optimization domain is a continuous set. An important example of a continuous optimization domain is the set of 3-D configurations of a sequence of amino-acids in the problem of finding the minimum energy folding of the corresponding protein [1]. In principle, any optimization problem on a finite domain can be solved by an exhaustive search. However, this is often beyond computational capacity: the optimization domain of the traveling salesman problem with 100 cities contains more than 10155 possible tours. An efficient algorithm to solve the traveling salesman and many similar problems has not yet been found and such problems remain reliably solvable only in principle [2]. Statistical mechanics has inspired widely used methods for finding good approximate solutions in hard discrete optimization problems which defy efficient exact solutions [3, 4, 5, 6]. Here a key idea has been that of simulated annealing [3]: a random search based on the Metropolis-Hastings algorithm, such that the distribution of the elements of the domain visited during the search converges to an equilibrium distribution concentrated around the global optimizers. Convergence and finite-time performance of simulated annealing on finite domains has been evaluated in many works, e.g. [7, 8, 9, 10]. On continuous domains, most popular optimization methods perform a local gradient-based search and in general converge to local optimizers; with the notable exception of convex criteria where convergence to the unique global optimizer occurs [11]. Simulated annealing performs a global search and can be easily implemented on continuous domains. Hence it can be considered a powerful complement to local methods. In this paper, we introduce for the first time rigorous guarantees on the finite-time performance of simulated annealing on continuous domains. We will show that it is possible to derive simulated annealing algorithms which, with an arbitrarily high level of confidence, find an approximate solution to the problem of optimizing a function of continuous variables, within a specified tolerance to the global optimal solution after a known finite number of steps. Rigorous guarantees on the finite-time performance of simulated annealing in the optimization of functions of continuous variables have never been obtained before; the only results available state that simulated annealing converges to a global optimizer as the number of steps grows to infinity, e.g. [12, 13, 14, 15]. The background of our work is twofold. On the one hand, our notion of approximate solution to a global optimization problem is inspired by the concept of finite-time learning with known accuracy and confidence developed in statistical learning theory [16, 17]. We actually maintain an important aspect of statistical learning theory which is that we do not introduce any particular assumption on the optimization criterion, i.e. our results hold regardless of what U is. On the other hand, we ground our results on the theory of convergence, with quantitative bounds on the distance to the target distribution, of the Metropolis-Hastings algorithm and Markov Chain Monte Carlo (MCMC) methods, which has been one of the main achievements of recent research in statistics [18, 19, 20, 21]. In this paper, we will not develop any ready-to-use optimization algorithm. We will instead introduce a general formulation of the simulated annealing method which allows one to derive new simulated annealing algorithms with rigorous finite-time guarantees on the basis of existing theory. The Metropolis-Hastings algorithm and the general family of MCMC methods have many degrees of freedom. The choice and comparison of specific algorithms goes beyond the scope of the paper. The paper is organized in the following sections. In Simulated annealing we introduce the method and fix the notation. In Convergence we recall the reasons why finite-time guarantees for simulated annealing on continuous domains have not been obtained before. In Finite-time guarantees we present the main result of the paper. In Conclusions we state our findings and conclude the paper. 1 Simulated annealing The original formulation of simulated annealing was inspired by the analogy between the stochastic evolution of the thermodynamic state of an annealing material towards the configurations of minimal energy and the search for the global minimum of an optimization criterion [3]. In the procedure, the optimization criterion plays the role of the energy and the state of the annealed material is simulated by the evolution of the state of an inhomogeneous Markov chain. The state of the chain evolves according to the Metropolis-Hastings algorithm in order to simulate the Boltzmann distribution of thermodynamic equilibrium. The Boltzmann distribution is simulated for a decreasing sequence of temperatures (“cooling”). The target distribution of the cooling procedure is the limiting Boltzmann distribution, for the temperature that tends to zero, which takes non-zero values only on the set of global minimizers [7]. The original formulation of the method was for a finite domain. However, simulated annealing can be generalized straightforwardly to a continuous domain because the Metropolis-Hastings algorithm can be used with almost no differences on discrete and continuous domains The main difference is that on a continuous domain the equilibrium distributions are specified by probability densities. On a continuous domain, Markov transition kernels in which the distribution of the elements visited by the chain converges to an equilibrium distribution with the desired density can be constructed using the Metropolis-Hastings algorithm and the general family of MCMC methods [22]. We point out that Boltzmann distributions are not the only distributions which can be adopted as equilibrium distributions in simulated annealing [7]. In this paper it is convenient for us to adopt a different type of equilibrium distribution in place of Boltzmann distributions. 1.1 Our setting The optimization criterion is U : Θ → [0, 1], with Θ ⊂ RN . The assumption that U takes values in the interval [0, 1] is a technical one. It does not imply any serious loss of generality. In general, any bounded optimization criterion can be scaled to take values in [0, 1]. We assume that the optimization task is to find a global maximizer; this can be done without loss of generality. We also assume that Θ is a bounded set. We consider equilibrium distributions defined by probability density functions proportional to [U (θ) + δ]J where J and δ are two strictly positive parameters. We use π (J) to denote an equilibrium distribution, i.e. π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ) where πLeb is the standard Lebesgue measure. Here, J −1 plays the role of the temperature: if the function U (θ) plus δ is taken to a positive power J then as J increases (i.e. as J −1 decreases) [U (θ) + δ]J becomes increasingly peaked around the global maximizers. The parameter δ is an offset which guarantees that the equilibrium densities are always strictly positive, even if U takes zero values on some elements of the domain. The offset δ is chosen by the user and we show later that our results allow one to make an optimal selection of δ. The zero-temperature distribution is the limiting distribution, for J → ∞, which takes non-zero values only on the set of global maximizers. It is denoted by π (∞) . In the generic formulation of the method, the Markov transition kernel of the k-th step of the inhomogeneous chain has equilibrium distribution π (Jk ) where {Jk }k=1,2,... is the “cooling schedule”. The cooling schedule is a non-decreasing sequence of positive numbers according to which the equilibrium distribution become increasingly sharpened during the evolution of the chain. We use θk to denote the state of the chain and Pθk to denote its probability distribution. The distribution Pθk obviously depends on the initial condition θ0 . However, in this work, we don’t need to make this dependence explicit in the notation. Remark 1: If, given an element θ in Θ, the value U (θ) can be computed directly, we say that U is a deterministic criterion, e.g. the energy landscape in protein structure prediction [1]. In problems involving random variables, the value U (θ) may be the expected value U (θ) = g(x, θ)px (x; θ)dx of some function g which depends on both the optimization variable θ, and on some random variable x which has probability density px (x; θ) (which may itself depend on θ). In such problems it is usually not possible to compute U (θ) directly, either because evaluation of the integral requires too much computation, or because no analytical expression for px (x; θ) is available. Typically one must perform stochastic simulations in order to obtain samples of x for a given θ, hence obtain sample values of g(x, θ), and thus construct a Monte Carlo estimate of U (θ). The Bayesian design of clinical trials is an important application area where such expected-value criteria arise [23]. The authors of this paper investigate the optimization of expected-value criteria motivated by problems of aircraft routing [24]. In the particular case that px (x; θ) does not depend on θ, the optimization task is often called “empirical risk minimization”, and is studied extensively in statistical learning theory [16, 17]. The results of this paper apply in the same way to the optimization of both deterministic and expected-value criteria. The MCMC method developed by M¨ ller [25, 26] allows one u to construct simulated annealing algorithms for the optimization of expected-value criteria. M¨ ller u [25, 26] employs the same equilibrium distributions as those described in our setting; in his context J is restricted to integer values. 2 Convergence The rationale of simulated annealing is as follows: if the temperature is kept constant, say Jk = J, then the distribution of the state of the chain Pθk tends to the equilibrium distribution π (J) ; if J → ∞ then the equilibrium distribution π (J) tends to the zero-temperature distribution π (∞) ; as a result, if the cooling schedule Jk tends to infinity, one obtains that Pθk “follows” π (Jk ) and that π (Jk ) tends to π (∞) and eventually that the distribution of the state of the chain Pθk tends to π (∞) . The theory shows that, under conditions on the cooling schedule and the Markov transition kernels, the distribution of the state of the chain Pθk actually converges to the target zero-temperature distribution π (∞) as k → ∞ [12, 13, 14, 15]. Convergence to the zero-temperature distribution implies that asymptotically the state of the chain eventually coincides with a global optimizer with probability one. The difficulty which must be overcome in order to obtain finite step results on simulated annealing algorithms on a continuous domain is that usually, in an optimization problem defined over continuous variables, the set of global optimizers has zero Lebesgue measure (e.g. a set of isolated points). If the set of global optimizers has zero measure then the set of global optimizers has null probability according to the equilibrium distributions π (J) for any finite J and, as a consequence, according to the distributions Pθk for any finite k. Put another way, the probability that the state of the chain visits the set of global optimizers is constantly zero after any finite number of steps. Hence the confidence of the fact that the solution provided by the algorithm in finite time coincides with a global optimizer is also constantly zero. Notice that this is not the case for a finite domain, where the set of global optimizers is of non-null measure with respect to the reference counting measure [7, 8, 9, 10]. It is instructive to look at the issue also in terms of the rate of convergence to the target zerotemperature distribution. On a discrete domain, the distribution of the state of the chain at each step and the zero-temperature distribution are both standard discrete distributions. It is then possible to define a distance between them and study the rate of convergence of this distance to zero. This analysis allows one to obtain results on the finite-time behavior of simulated annealing [7, 8]. On a continuous domain and for a set of global optimizers of measure zero, the target zero-temperature distribution π (∞) ends up being a mixture of probability masses on the set of global optimizers. In this situation, although the distribution of the state of the chain Pθk still converges asymptotically to π (∞) , it is not possible to introduce a sensible distance between the two distributions and a rate of convergence to the target distribution cannot even be defined (weak convergence), see [12, Theorem 3.3]. This is the reason that until now there have been no guarantees on the performance of simulated annealing on a continuous domain after a finite number of computations: by adopting the zero-temperature distribution π (∞) as the target distribution it is only possible to prove asymptotic convergence in infinite time to a global optimizer. Remark 2: The standard distance between two distributions, say µ1 and µ2 , on a continuous support is the total variation norm µ1 − µ2 T V = supA |µ1 (A) − µ2 (A)|, see e.g. [21]. In simulated annealing on a continuous domain the distribution of the state of the chain Pθk is absolutely continuous with respect to the Lebesgue measure (i.e. πLeb (A) = 0 ⇒ Pθk (A) = 0), by construction for any finite k. Hence if the set of global optimizers has zero Lebesgue measure then it has zero measure also according to Pθk . The set of global optimizers has however measure 1 according to π (∞) . The distance Pθk − π (∞) T V is then constantly 1 for any finite k. It is also worth mentioning that if the set of global optimizers has zero measure then asymptotic convergence to the zero-temperature distribution π (∞) can be proven only under the additional assumptions of continuity and differentiability of U [12, 13, 14, 15]. 3 Finite-time guarantees In general, optimization algorithms for problems defined on continuous variables can only find approximate solutions in finite time [27]. Given an element θ of a continuous domain how can we assess how good it is as an approximate solution to an optimization problem? Here we introduce the concept of approximate global optimizer to answer this question. The definition is given for a maximization problem in a continuous but bounded domain. We use two parameters: the value imprecision ǫ (greater than or equal to 0) and the residual domain α (between 0 and 1) which together determine the level of approximation. We say that θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if the function U takes values strictly greater than U (θ) + ǫ only on a subset of values of θ no larger than an α portion of the optimization domain. The formal definition is as follows. Definition 1 Let U : Θ → R be an optimization criterion where Θ ⊂ RN is bounded. Let πLeb denote the standard Lebesgue measure. Let ǫ ≥ 0 and α ∈ [0, 1] be given numbers. Then θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . In other words, the value U (θ) is within ǫ of a value which is greater than the values that U takes on at least a 1 − α portion of the domain. The smaller ǫ and α are, the better is the approximation of a true global optimizer. If both α and ǫ are equal to zero then U (θ) coincides with the essential supremum of U . Our definition of approximate global optimizer carries an important property, which holds regardless of what the criterion U is: if ǫ and α have non-zero values then the set of approximate global optimizers always has non-zero Lebesgue measure. It follows that the probability that the chain visits the set of approximate global optimizers can be non-zero. Hence, it is sensible to study the confidence of the fact that the solution found by simulated annealing in finite time is an approximate global optimizer. Remark 3: The intuition that our notion of approximate global optimizer can be used to obtain formal guarantees on the finite-time performance of optimization methods based on a stochastic search of the domain is already apparent in the work of Vidyasagar [17, 28]. Vidyasagar [17, 28] introduces a similar definition and obtains rigorous finite-time guarantees in the optimization of ex- pected value criteria based on uniform independent sampling of the domain. Notably, the number of independent samples required to guarantee some desired accuracy and confidence turns out to be polynomial in the values of the desired imprecision, residual domain and confidence. Although the method of Vidyasagar is not highly sophisticated, it has had considerable success in solving difficult control system design applications [28, 29]. Its appeal stems from its rigorous finite-time guarantees which exist without the need for any particular assumption on the optimization criterion. Here we show that finite-time guarantees for simulated annealing can be obtained by selecting a distribution π (J) with a finite J as the target distribution in place of the zero-temperature distribution π (∞) . The fundamental result is the following theorem which allows one to select in a rigorous way δ and J in the target distribution π (J) . It is important to stress that the result holds universally for any optimization criterion U on a bounded domain. The only minor requirement is that U takes values in [0, 1]. Theorem 1 Let U : Θ → [0, 1] be an optimization criterion where Θ ⊂ RN is bounded. Let J ≥ 1 and δ > 0 be given numbers. Let θ be a multivariate random variable with distribution π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ). Let α ∈ (0, 1] and ǫ ∈ [0, 1] be given numbers and define σ= 1 1+δ 1+ ǫ+1+δ J 1 1+δ 1+δ −1 α ǫ+δ δ . (1) Then the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ. Proof. See Appendix A. The importance of the choice of a target distribution π (J) with a finite J is that π (J) is absolutely continuous with respect to the Lebesgue measure. Hence, the distance Pθk − π (J) TV between the distribution of the state of the chain Pθk and the target distribution π (J) is a meaningful quantity. Convergence of the Metropolis-Hastings algorithm and MCMC methods in total variation norm is a well studied problem. The theory provides simple conditions under which one derives upper bounds on the distance to the target distribution which are known at each step of the chain and decrease monotonically to zero as the number of steps of the chain grows. The theory has been developed mainly for homogeneous chains [18, 19, 20, 21]. In the case of simulated annealing, the factor that enables us to employ these results is the absolute continuity of the target distribution π (J) with respect to the Lebesgue measure. However, simulated annealing involves the simulation of inhomogeneous chains. In this respect, another important fact is that the choice of a target distribution π (J) with a finite J implies that the inhomogeneous Markov chain can in fact be formed by a finite sequence of homogeneous chains (i.e. the cooling schedule {Jk }k=1,2,... can be chosen to be a sequence that takes only a finite set of values). In turn, this allows one to apply the theory of homogeneous MCMC methods to study the convergence of Pθk to π (J) in total variation norm. On a bounded domain, simple conditions on the ‘proposal distribution’ in the iteration of the simulated annealing algorithm allows one to obtain upper bounds on Pθk − π (J) TV that decrease geometrically to zero as k → ∞, without the need for any additional assumption on U [18, 19, 20, 21]. It is then appropriate to introduce the following finite-time result. Theorem 2 Let the notation and assumptions of Theorem 1 hold. Let θk , with distribution Pθk , be the state of the inhomogeneous chain of a simulated annealing algorithm with target distribution π (J) . Then the statement “θk is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ − Pθk − π (J) TV . The proof of the theorem follows directly from the definition of the total variation norm. It follows that if simulated annealing is implemented with an algorithm which converges in total variation distance to a target distribution π (J) with a finite J, then one can state with confidence arbitrarily close to 1 that the solution found by the algorithm after the known appropriate finite number of steps is an approximate global optimizer with the desired approximation level. For given non-zero values of ǫ, α the value of σ given by (1) can be made arbitrarily close to 1 by choice of J; while the distance Pθk − π (J) TV can be made arbitrarily small by taking the known sufficient number of steps. It can be shown that there exists the possibility of making an optimal choice of δ and J in the target distribution π (J) . In fact, for given ǫ and α and a given value of J there exists an optimal choice of δ which maximizes the value of σ given by (1). Hence, it is possible to obtain a desired σ with the smallest possible J. The advantage of choosing the smallest J, consistent with the required approximation and confidence, is that it will decrease the number of steps required to achieve the desired reduction of Pθk − π (J) TV . 4 Conclusions We have introduced a new formulation of simulated annealing which admits rigorous finite-time guarantees in the optimization of functions of continuous variables. First, we have introduced the notion of approximate global optimizer. Then, we have shown that simulated annealing is guaranteed to find approximate global optimizers, with the desired confidence and the desired level of accuracy, in a known finite number of steps, if a proper choice of the target distribution is made and conditions for convergence in total variation norm are met. The results hold for any optimization criterion on a bounded domain with the only minor requirement that it takes values between 0 and 1. In this framework, simulated annealing algorithms with rigorous finite-time guarantees can be derived by studying the choice of the proposal distribution and of the cooling schedule, in the generic iteration of simulated annealing, in order to ensure convergence to the target distribution in total variation norm. To do this, existing theory of convergence of the Metropolis-Hastings algorithm and MCMC methods on continuous domains can be used [18, 19, 20, 21]. Vidyasagar [17, 28] has introduced a similar definition of approximate global optimizer and has shown that approximate optimizers with desired accuracy and confidence can be obtained with a number of uniform independent samples of the domain which is polynomial in the accuracy and confidence parameters. In general, algorithms developed with the MCMC methodology can be expected to be equally or more efficient than uniform independent sampling. Acknowledgments Work supported by EPSRC, Grant EP/C014006/1, and by the European Commission under projects HYGEIA FP6-NEST-4995 and iFly FP6-TREN-037180. We thank S. Brooks, M. Vidyasagar and D. M. Wolpert for discussions and useful comments on the paper. A Proof of Theorem 1 Let α ∈ (0, 1] and ρ ∈ (0, 1] be given numbers. Let Uδ (θ) := U (θ) + δ. Let πδ be a normalized ¯ measure such that πδ (dθ) ∝ Uδ (θ)πLeb (dθ). In the first part of the proof we find a lower bound on the probability that θ belongs to the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} . ¯ Let yα := inf{y : πδ {θ ∈ Θ : Uδ (θ) ≤ y} ≥ 1 − α}. To start with we show that the set ¯ ¯ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} coincides with {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice ¯ ¯ that the quantity πδ {θ ∈ Θ : Uδ (θ) ≤ y} is a right-continuous non-decreasing function of y because it has the form of a distribution function (see e.g. [30, p.162] and [17, Lemma 11.1]). Therefore we have πδ {θ ∈ Θ : Uδ (θ) ≤ yα } ≥ 1 − α and ¯ ¯ y ≥ ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} ≥ 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α . ¯ ¯ ¯ Moreover, y < ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} < 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} > α ¯ ¯ ¯ and taking the contrapositive one obtains πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α ⇒ y ≥ ρ yα . ¯ ¯ ′ Therefore {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α}. ¯ ¯ We now derive a lower bound on π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Let us introduce the notation ¯ ¯¯ Aα := {θ ∈ Θ : Uδ (θ) < yα }, Aα := {θ ∈ Θ : Uδ (θ) ≥ yα }, Bα,ρ := {θ ∈ Θ : Uδ (θ) < ρ yα } ¯ ¯ ¯ ¯ ¯ ¯α,ρ := {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice that Bα,ρ ⊆ Aα and Aα ⊆ Bα,ρ . The quantity ¯¯ ¯¯ and B ¯ ¯ ¯ ¯ πδ {θ ∈ Θ : Uδ (θ) < y} as a function of y is the left-continuous version of πδ {θ ∈ Θ : Uδ (θ) ≤ ¯¯ y}[30, p.162]. Hence, the definition of yα implies πδ (Aα ) ≤ 1 − α and πδ (Aα ) ≥ α. Notice that ¯ ¯ ¯ ¯ δπLeb (Aα ) ¯ ≤ 1−α, ¯ πδ (Aα ) ≤ 1 − α ⇒ ¯ ¯ Uδ (θ)πLeb (dθ) Θ ¯¯ πδ (Aα ) ≥ α ¯ ¯¯ (1 + δ)πLeb (Aα ) ≥ α. ¯ U (θ)πLeb (dθ) Θ δ ⇒ ¯¯ Hence, πLeb (Aα ) > 0 and πLeb (Aα ) 1−α1+δ ¯ ¯ . ¯α ) ≤ α ¯ δ πLeb (A ¯ ¯¯ ¯¯ Notice that πLeb (Aα ) > 0 implies πLeb (Bα,ρ ) > 0. We obtain π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα } = ¯ 1+ 1 ≥ Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Bα,ρ ≥ 1+ J ρ J yα ¯ J yα ¯ Uδ (θ)J πLeb (dθ) 1+ 1 Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Aα Uδ (θ)J πLeb (dθ) 1 1 1 ≥ ≥ . 1−α1+δ ¯ πLeb (Aα ) πLeb (Bα,ρ ) ¯ ¯ 1 + ρJ 1 + ρJ ¯¯ ¯¯ α ¯ δ πLeb (Aα ) πLeb (Aα ) Since {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} the first part of ¯ ¯ the proof is complete. In the second part of the proof we show that the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} is contained in the set of approximate global optimizers of U with value imprecision ¯ ǫ := (ρ−1 − 1)(1 + δ) and residual domain α := 1+δ α. Hence, we show that {θ ∈ Θ : πδ {θ′ ∈ ˜ ˜ ǫ+δ ¯ ˜ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} ⊆ {θ ∈ Θ : πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ)}. We ¯ ˜ ˜ have U (θ′ ) > U (θ) + ǫ ⇔ ρ Uδ (θ′ ) > ρ [Uδ (θ) + ǫ] ⇒ ρ Uδ (θ′ ) > Uδ (θ) ˜ ˜ which is proven by noticing that ρ [Uδ (θ) + ǫ] ≥ Uδ (θ) ⇔ 1 − ρ ≥ U (θ)(1 − ρ) ˜ and U (θ) ∈ [0, 1]. Hence {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ⊇ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} . ˜ Therefore πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α . Let ¯ ˜ ¯ Qθ,˜ := {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} and notice that ˜ ǫ U (θ′ )πLeb (dθ′ ) + δπLeb (Qθ,˜) ǫ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} = ˜ Qθ,˜ ǫ . U (θ′ )πLeb (dθ′ ) + δπLeb (Θ) Θ We obtain πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α ⇒ ǫ πLeb (Qθ,˜) + δπLeb (Qθ,˜) ≤ α(1 + δ)πLeb (Θ) ˜ ¯ ˜ ¯ ǫ ǫ ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . ˜ ˜ Hence we can conclude that πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) ¯ ˜ ˜ and the second part of the proof is complete. We have shown that given α ∈ (0, 1], ρ ∈ (0, 1], ǫ := (ρ−1 − 1)(1 + δ), α := ¯ ˜ ˜ σ := 1 = 1−α1+δ ¯ 1+δ 1+ρ 1+ α ¯ δ ǫ+1+δ ˜ J 1+δ ǫ+δ ˜ 1 J 1 1+δ 1+δ −1 α ǫ+δ ˜˜ δ α and ¯ , the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual ˜ domain α” holds with probability at least σ. Notice that ǫ ∈ [0, 1] and α ∈ (0, 1] are linked through ˜ ˜ ˜ ǫ+δ ˜ a bijective relation to ρ ∈ [ 1+δ , 1] and α ∈ (0, 1+δ ]. The statement of the theorem is eventually ¯ 2+δ obtained by expressing σ as a function of desired ǫ = ǫ and α = α. ˜ ˜ References [1] D. J. Wales. Energy Landscapes. Cambridge University Press, Cambridge, UK, 2003. [2] D. Achlioptas, A. Naor, and Y. Peres. Rigorous location of phase transitions in hard optimization problems. Nature, 435:759–764, 2005. [3] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. 220(4598):671–680, 1983. Optimization by Simulated Annealing. Science, [4] E. Bonomi and J. Lutton. The N -city travelling salesman problem: statistical mechanics and the Metropolis algorithm. SIAM Rev., 26(4):551–568, 1984. [5] Y. Fu and P. W. Anderson. Application of statistical mechanics to NP-complete problems in combinatorial optimization. J. Phys. A: Math. Gen., 19(9):1605–1620, 1986. [6] M. M´ zard, G. Parisi, and R. Zecchina. Analytic and Algorithmic Solution of Random Satisfiability e Problems. Science, 297:812–815, 2002. [7] P. M. J. van Laarhoven and E. H. L. Aarts. Simulated Annealing: Theory and Applications. D. Reidel Publishing Company, Dordrecht, Holland, 1987. [8] D. Mitra, F. Romeo, and A. Sangiovanni-Vincentelli. Convergence and finite-time behavior of simulated annealing. Adv. Appl. Prob., 18:747–771, 1986. [9] B. Hajek. Cooling schedules for optimal annealing. Math. Oper. Res., 13:311–329, 1988. [10] J. Hannig, E. K. P. Chong, and S. R. Kulkarni. Relative Frequencies of Generalized Simulated Annealing. Math. Oper. Res., 31(1):199–216, 2006. [11] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. [12] H. Haario and E. Saksman. Simulated annealing process in general state space. Adv. Appl. Prob., 23:866– 893, 1991. [13] S. B. Gelfand and S. K. Mitter. Simulated Annealing Type Algorithms for Multivariate Optimization. Algorithmica, 6:419–436, 1991. [14] C. Tsallis and D. A. Stariolo. Generalized simulated annealing. Physica A, 233:395–406, 1996. [15] M. Locatelli. Simulated Annealing Algorithms for Continuous Global Optimization: Convergence Conditions. J. Optimiz. Theory App., 104(1):121–133, 2000. [16] V. N. Vapnik. The Nature of Statistical Learning Theory. Cambridge University Press, Springer, New York, US, 1995. [17] M. Vidyasagar. Learning and Generalization: With Application to Neural Networks. Springer-Verlag, London, second edition, 2003. [18] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, London, 1993. [19] J. S. Rosenthal. Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo. J. Am. Stat. Assoc., 90(430):558–566, 1995. [20] K. L. Mengersen and R. L. Tweedie. Rates of convergence of the Hastings and Metropolis algorithm. Ann. Stat., 24(1):101–121, 1996. [21] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Prob. Surv., 1:20–71, 2004. [22] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, second edition, 2004. [23] D.J. Spiegelhalter, K.R. Abrams, and J.P. Myles. Bayesian approaches to clinical trials and health-care evaluation. John Wiley & Sons, Chichester, UK, 2004. [24] A. Lecchini-Visintini, W. Glover, J. Lygeros, and J. M. Maciejowski. Monte Carlo Optimization for Conflict Resolution in Air Traffic Control. IEEE Trans. Intell. Transp. Syst., 7(4):470–482, 2006. [25] P. M¨ ller. Simulation based optimal design. In J. O. Berger, J. M. Bernardo, A. P. Dawid, and A. F. M. u Smith, editors, Bayesian Statistics 6: proceedings of the Sixth Valencia International Meeting, pages 459–474. Oxford: Clarendon Press, 1999. [26] P. M¨ ller, B. Sans´ , and M. De Iorio. Optimal Bayesian design by Inhomogeneous Markov Chain Simuu o lation. J. Am. Stat. Assoc., 99(467):788–798, 2004. [27] L. Blum, C. Cucker, M. Shub, and S. Smale. Complexity and Real Computation. Springer-Verlag, New York, 1998. [28] M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory. Automatica, 37(10):1515–1528, 2001. [29] R. Tempo, G. Calafiore, and F. Dabbene. Randomized Algorithms for Analysis and Control of Uncertain Systems. Springer-Verlag, London, 2005. [30] B.V. Gnedenko. Theory of Probability. Chelsea, New York, fourth edition, 1968.
6 0.47899571 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin
7 0.46731311 159 nips-2007-Progressive mixture rules are deviation suboptimal
8 0.44055551 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect
9 0.42151466 150 nips-2007-Optimal models of sound localization by barn owls
10 0.42048597 15 nips-2007-A general agnostic active learning algorithm
11 0.41002601 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
12 0.39076787 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
13 0.38941213 69 nips-2007-Discriminative Batch Mode Active Learning
14 0.38317326 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines
15 0.34697065 184 nips-2007-Stability Bounds for Non-i.i.d. Processes
16 0.33753034 136 nips-2007-Multiple-Instance Active Learning
17 0.33669597 62 nips-2007-Convex Learning with Invariances
18 0.33525619 85 nips-2007-Experience-Guided Search: A Theory of Attentional Control
19 0.32647955 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
20 0.31847432 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
topicId topicWeight
[(5, 0.022), (13, 0.038), (16, 0.013), (18, 0.013), (21, 0.055), (34, 0.012), (35, 0.021), (47, 0.054), (49, 0.014), (83, 0.608), (85, 0.012), (87, 0.012), (90, 0.049)]
simIndex simValue paperId paperTitle
1 0.99840128 159 nips-2007-Progressive mixture rules are deviation suboptimal
Author: Jean-yves Audibert
Abstract: We consider the learning task consisting in predicting as well as the best function in a finite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisfied by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisfies ˆ log |G| (1) ER(ˆ) ≤ ming∈G R(g) + Cst g , n where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1
same-paper 2 0.99730355 110 nips-2007-Learning Bounds for Domain Adaptation
Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman
Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1
3 0.99681515 109 nips-2007-Kernels on Attributed Pointsets with Applications
Author: Mehul Parsana, Sourangshu Bhattacharya, Chiru Bhattacharya, K. Ramakrishnan
Abstract: This paper introduces kernels on attributed pointsets, which are sets of vectors embedded in an euclidean space. The embedding gives the notion of neighborhood, which is used to define positive semidefinite kernels on pointsets. Two novel kernels on neighborhoods are proposed, one evaluating the attribute similarity and the other evaluating shape similarity. Shape similarity function is motivated from spectral graph matching techniques. The kernels are tested on three real life applications: face recognition, photo album tagging, and shot annotation in video sequences, with encouraging results. 1
4 0.99650019 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields
Author: Simon Osindero, Geoffrey E. Hinton
Abstract: We describe an efficient learning procedure for multilayer generative models that combine the best aspects of Markov random fields and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1
5 0.9950949 61 nips-2007-Convex Clustering with Exemplar-Based Models
Author: Danial Lashkari, Polina Golland
Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1
6 0.99377334 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis
7 0.94627756 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
8 0.93449122 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin
9 0.91923785 40 nips-2007-Bundle Methods for Machine Learning
10 0.91684288 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria
11 0.91610676 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching
12 0.91475797 116 nips-2007-Learning the structure of manifolds using random projections
13 0.91394275 21 nips-2007-Adaptive Online Gradient Descent
14 0.90887642 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
15 0.90253413 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
16 0.9021073 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
17 0.9019472 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
18 0.89908791 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs
19 0.89808255 147 nips-2007-One-Pass Boosting
20 0.89762127 46 nips-2007-Cluster Stability for Finite Samples