nips nips2007 nips2007-110 knowledge-graph by maker-knowledge-mining

110 nips-2007-Learning Bounds for Domain Adaptation

Source: pdf

Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman

Abstract: Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classiﬁer from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In the real world, though, we often wish to adapt a classiﬁer from a source domain with a large amount of training data to different target domain with very little training data. [sent-4, score-1.434]

2 In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. [sent-5, score-1.091]

3 The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. [sent-6, score-1.007]

4 Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization. [sent-7, score-1.704]

5 We have ample data drawn from a source domain to train a model, but little or no training data from the target domain where we wish to use the model [17, 3, 10, 5, 9]. [sent-9, score-1.382]

6 Those studies focus on the case where training data is drawn from a source domain and test data is drawn from a different target domain. [sent-15, score-1.181]

7 We generalize this approach to the case where we have some labeled data from the target domain in addition to a large amount of labeled source data. [sent-16, score-1.186]

8 Our main result is a uniform convergence bound on the true target risk of a model trained to minimize a convex combination of empirical source and target risks. [sent-17, score-1.672]

9 The bound describes an intuitive tradeoff between the quantity of the source data and the accuracy of the target data, and under relatively weak assumptions we can compute it from ﬁnite labeled and unlabeled samples of the source and target distributions. [sent-18, score-1.991]

10 We use the task of sentiment classiﬁcation to demonstrate that our bound makes correct predictions about model error with respect to a distance measure between source and target domains and the number of training instances. [sent-19, score-1.335]

11 Several authors have empirically studied a special case of this in which each instance is weighted separately in the loss function, and instance weights are set to approximate the target domain distribution [10, 5, 9, 11]. [sent-21, score-0.696]

12 We give a uniform convergence bound for algorithms that min1 imize a convex combination of multiple empirical source risks and we show that these algorithms can outperform standard empirical risk minimization. [sent-22, score-1.063]

13 2 A Rigorous Model of Domain Adaptation We formalize domain adaptation for binary classiﬁcation as follows. [sent-23, score-0.398]

14 A domain is a pair consisting of a distribution D on X and a labeling function f : X → [0, 1]. [sent-24, score-0.277]

15 1 Initially we consider two domains, a source domain DS , fS and a target domain DT , fT . [sent-25, score-1.253]

16 We write the empirical risk of a hypothesis on the source domain as ǫS (h). [sent-29, score-1.08]

17 We use the parallel notation ˆ ǫT (h, f ), ǫT (h), and ǫT (h) for the target domain. [sent-30, score-0.363]

18 ˆ We measure the distance between two distributions D and D′ using a hypothesis class-speciﬁc distance measure. [sent-31, score-0.295]

19 Let H be a hypothesis class for instance space X , and AH be the set of subsets of X that are the support of some hypothesis in H. [sent-32, score-0.372]

20 For a hypothesis space H, we deﬁne the symmetric difference hypothesis space H∆H as H∆H = {h(x) ⊕ h′ (x) : h, h′ ∈ H} , where ⊕ is the XOR operator. [sent-37, score-0.338]

21 The ideal hypothesis minimizes combined source and target risk: h∗ = argmin ǫS (h) + ǫT (h) . [sent-42, score-1.116]

22 h∈H We denote the combined risk of the ideal hypothesis by λ = ǫS (h∗ ) + ǫT (h∗ ) . [sent-43, score-0.398]

23 When the ideal hypothesis performs poorly, we cannot expect to learn a good target classiﬁer by minimizing source error. [sent-45, score-1.08]

24 If this is the case, we can reasonably approximate target risk using source risk and the distance between DS and DT . [sent-47, score-1.222]

25 We illustrate the kind of result available in this setting with the following bound on the target risk in terms of the source risk, the difference between labeling functions fS and fT , and the distance between the distributions DS and DT . [sent-48, score-1.239]

26 1 This notion of domain is not the domain of a function. [sent-51, score-0.412]

27 2 Theorem 1 Let H be a hypothesis space of VC-dimension d and US , UT be unlabeled samples of ˆ size m′ each, drawn from DS and DT , respectively. [sent-54, score-0.319]

28 Let dH∆H be the empirical distance on US , UT , induced by the symmetric difference hypothesis space. [sent-55, score-0.3]

29 When the combined error of the ideal hypothesis is large, there is no classiﬁer that performs well on both the source and target domains, so we cannot hope to ﬁnd a good target hypothesis by training only on the source domain. [sent-60, score-2.219]

30 On the other hand, for small λ (the most relevant case for domain adaptation), Theorem 1 shows that source error and unlabeled H∆H-distance are important quantities for computing target error. [sent-61, score-1.18]

31 3 A Learning Bound Combining Source and Target Data Theorem 1 shows how to relate source and target risk. [sent-62, score-0.841]

32 We now proceed to give a learning bound for empirical risk minimization using combined source and target training data. [sent-63, score-1.233]

33 At train time a learner receives a sample S = (ST , SS ) of m instances, where ST consists of βm instances drawn independently from DT and SS consists of (1−β)m instances drawn independently from DS . [sent-66, score-0.32]

34 The goal of a learner is to ﬁnd a hypothesis that minimizes target risk ǫT (h). [sent-67, score-0.727]

35 When β is small, as in domain adaptation, minimizing empirical target risk may not be the best choice. [sent-68, score-0.796]

36 We analyze learners that instead minimize a convex combination of empirical source and target risk: ǫα (h) = αˆT (h) + (1 − α)ˆS (h) ˆ ǫ ǫ We denote as ǫα (h) the corresponding weighted combination of true source and target risks, measured with respect to DS and DT . [sent-69, score-1.911]

37 We bound the target risk of a domain adaptation algorithm that minimizes ǫα (h). [sent-70, score-1.036]

38 First we bound the difference between the target risk ǫT (h) and weighted risk ǫα (h). [sent-72, score-0.814]

39 Then we bound the difference between the true and empirical weighted risks ǫα (h) and ǫα (h). [sent-73, score-0.262]

40 The lemma shows that as α approaches 1, we rely increasingly on the target data, and the distance between domains matters less and less. [sent-77, score-0.529]

41 m′ When α = 0 (that is, we ignore target data), the bound is identical to that of Theorem 1, but with an empirical estimate for the source error. [sent-88, score-1.051]

42 Similarly when α = 1 (that is, we use only target data), the bound is the standard learning bound using only target data. [sent-89, score-0.936]

43 Finally note that by choosing different values of α, the bound allows us to effectively trade off the small amount of target data against the large amount of less relevant source data. [sent-91, score-1.03]

44 Let ζ(US , UT ) be our approximation to dH∆H , computed from source and target unlabeled data. [sent-99, score-0.93]

45 Because of the large number of potential genres, sentiment classiﬁcation is an ideal area for domain adaptation. [sent-106, score-0.428]

46 We chose the “apparel” domain as our target domain, and all of the plots on the right-hand side of Figure 1 are for this domain. [sent-111, score-0.608]

47 We obtain empirical curves for the error as a function of α by training a classiﬁer using a weighted hinge loss. [sent-112, score-0.268]

48 Suppose the target domain has weight α and there are βm target training instances. [sent-113, score-0.992]

49 Then we scale the loss of target training instance by α/β and the loss of a source training instance by (1 − α)/(1 − β). [sent-114, score-1.091]

50 8 1 (f) source = dvd, vary mS , mT = 2500 mT: 250 books: 0. [sent-133, score-0.521]

51 8 1 Figure 1: Comparing the bound with test error for sentiment classiﬁcation. [sent-149, score-0.301]

52 Curves in (c) and (d) represent different numbers of target instances. [sent-154, score-0.363]

53 Curves in (e) and (f) represent different numbers of source instances. [sent-155, score-0.478]

54 Figure 1 shows a series of plots of equation 1 (on the top) coupled with corresponding plots of test error (on the bottom) as a function of α for different amounts of source and target data and different distances between domains. [sent-156, score-0.963]

55 In each pair of plots, a single parameter (distance, number of target instances mT , or number of source instances mS ) is varied while the other two are held constant. [sent-157, score-1.087]

56 The plots on the top part of Figure 1 are not meant to be numerical proxies for the true error (For the source domains “books” and “dvd”, the distance alone is well above 1 ). [sent-159, score-0.694]

57 Furthermore the value of α which minimizes the bound also has a low empirical error for each corresponding curve. [sent-164, score-0.253]

58 This suggests that choosing α to minimize the bound of Theorem 2 and subsequently training a classiﬁer to minimize the empirical error ǫα (h) can ˆ work well in practice, provided we have a reasonable measure of complexity. [sent-165, score-0.347]

59 4 Figures 1a and 1b show that more distant source domains result in higher target error. [sent-166, score-0.935]

60 Figures 1c and 1d illustrate that for more target data, we have not only lower error in general, but also a higher minimizing α. [sent-167, score-0.407]

61 Finally, ﬁgures 1e and 1f depict the limitation of distant source data. [sent-168, score-0.531]

62 With enough target data, no matter how much source data we include, we always prefer to use only the target data. [sent-169, score-1.204]

63 This is reﬂected in our bound as a phase transition in the value of the optimal α (governing the tradeoff between source and target data). [sent-170, score-1.036]

64 The value of α which minimizes the bound is indicated by the intensity, where black means α = 1 (corresponding to ignoring source and learning only from target data). [sent-176, score-0.982]

65 The x-axis shows the number of source instances (log-scale). [sent-179, score-0.601]

66 With more target instances than this, it is more effective to ignore even an inﬁnite amount of source data. [sent-182, score-1.03]

67 5 Learning from Multiple Sources We now explore an extension of our theory to the case of multiple source domains. [sent-183, score-0.501]

68 Each source Sj is associated with an unknown underlying distribution Dj over input points and an unknown labeling function fj . [sent-185, score-0.549]

69 From each source Sj , we are given mj labeled training instances, and our goal is to use these instances to train a model to perform well on a target domain DT , fT , which may or may not be one of the sources. [sent-186, score-1.336]

70 This setting is motivated by several new domain adaptation algorithms [10, 5, 11, 9] that weigh the loss from training instances depending on how “far” they are from the target domain. [sent-187, score-0.95]

71 That is, each training instance is its own source domain. [sent-188, score-0.572]

72 As in the previous sections, we will examine algorithms that minimize convex combinations of training errors over the labeled examples from each source domain. [sent-189, score-0.666]

73 Given a vector α = (α1 , · · · , αN ) of domain weights with j αj = 1, we deﬁne the empirical α-weighted error of function h as N N αj ǫj (h) = ˆ ǫα (h) = ˆ j=1 j=1 αj mj |h(x) − fj (x)| . [sent-191, score-0.369]

74 Let Dα be a mixture of the N source distributions with mixing weights equal to the components of α. [sent-193, score-0.5]

75 Finally, analogous to λ in the single-source setting, we deﬁne the error of the multi-source ideal hypothesis for a weighting α as N αj ǫj (h)} . [sent-194, score-0.385]

76 γα = min{ǫT (h) + ǫα (h)} = min{ǫT (h) + h h j=1 The following theorem gives a learning bound for empirical risk minimization using the empirical α-weighted error. [sent-195, score-0.471]

77 Theorem 3 Suppose we are given mj labeled instances from source Sj for j = 1 . [sent-196, score-0.707]

78 (b) Learning by uniformly weighting the training data causes us to learn a suboptimal decision boundary, (c) but by weighting the males more highly, we can match the target data and learn an optimal classiﬁer. [sent-207, score-0.729]

79 The ﬁrst part bounds the difference between the α-weighted error and the target error similar to lemma 1. [sent-210, score-0.53]

80 ˆ Theorem 3 reduces to Theorem 2 when we have only two sources, one of which is the target domain (that is, we have some small number of target instances). [sent-212, score-0.932]

81 It is more general, though, because by manipulating α we can effectively change the source domain. [sent-213, score-0.478]

82 First, we demand that there exists a hypothesis h∗ which has low error on both the α-weighted convex combination of sources and the target domain. [sent-215, score-0.738]

83 Second, we measure distance between the target and a mixture of sources, rather than between the target and a single source. [sent-216, score-0.811]

84 This can happen if some non-uniform weighting of sources accurately approximates the target domain. [sent-218, score-0.559]

85 In this example, we can ﬁnd the optimal classiﬁer by weighting the “males” and “females” components of the source to match the target. [sent-221, score-0.58]

86 While we do not explicitly address the relationship in this paper, we note that domain adaptation is closely related to the setting of covariate shift, which has been studied in statistics. [sent-224, score-0.373]

87 [9] considered weighting instances using a transfer-aware variant of boosting, but the learning bounds they give are no stronger than bounds which completely ignore the source data. [sent-231, score-0.832]

88 [8] consider learning when the marginal distribution on instances is the same across sources but the labeling function may change. [sent-233, score-0.288]

89 They consider only including or discarding a source entirely. [sent-236, score-0.478]

90 They place source-centered prior on the parameters of a model learned in the target domain. [sent-238, score-0.363]

91 de/projects/different06/) contains a good set of references on domain adaptation and related topics. [sent-242, score-0.373]

92 7 our model, the divergence prior also emphasizes the tradeoff between source and target. [sent-243, score-0.549]

93 In our model, though, we measure the divergence (and consequently the bias) of the source domain from unlabeled data. [sent-244, score-0.808]

94 This allows us to choose the best tradeoff between source and target labeled data. [sent-245, score-0.958]

95 7 Conclusion In this work we investigate the task of domain adaptation when we have a large amount of training data from a source domain but wish to apply a model in a target domain with a much smaller amount of training data. [sent-246, score-1.836]

96 Our main result is a uniform convergence learning bound for algorithms which minimize convex combinations of source and target empirical risk. [sent-247, score-1.12]

97 Our bound reﬂects the trade-off between the size of the source data and the accuracy of the target data, and we give a simple approximation to it that is computable from ﬁnite labeled and unlabeled samples. [sent-248, score-1.09]

98 Our theory also extends in a straightforward manner to a multi-source setting, which we believe helps to explain the success of recent empirical work in domain adaptation. [sent-250, score-0.274]

99 We also plan to investigate algorithms that choose a convex combination of multiple sources to minimize the bound in Theorem 3. [sent-253, score-0.325]

100 Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classiﬁcation. [sent-290, score-0.319]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('source', 0.478), ('target', 0.363), ('dh', 0.303), ('mt', 0.217), ('domain', 0.206), ('hypothesis', 0.169), ('adaptation', 0.167), ('risk', 0.159), ('sentiment', 0.152), ('ut', 0.131), ('ds', 0.131), ('instances', 0.123), ('ms', 0.121), ('dt', 0.107), ('bound', 0.105), ('weighting', 0.102), ('sources', 0.094), ('unlabeled', 0.089), ('crammer', 0.087), ('females', 0.087), ('fs', 0.08), ('separator', 0.076), ('blitzer', 0.076), ('dvd', 0.076), ('males', 0.076), ('theorem', 0.071), ('labeling', 0.071), ('domains', 0.07), ('ft', 0.07), ('ideal', 0.07), ('empirical', 0.068), ('classi', 0.066), ('vc', 0.065), ('ah', 0.065), ('dist', 0.065), ('distance', 0.063), ('risks', 0.061), ('er', 0.061), ('training', 0.06), ('labeled', 0.055), ('genres', 0.052), ('mj', 0.051), ('bounds', 0.046), ('sj', 0.044), ('error', 0.044), ('apparel', 0.044), ('argminh', 0.044), ('jiang', 0.044), ('vary', 0.043), ('books', 0.04), ('plots', 0.039), ('convex', 0.038), ('kitchen', 0.038), ('dai', 0.038), ('prd', 0.038), ('electronics', 0.038), ('drawn', 0.037), ('ignore', 0.037), ('minimizes', 0.036), ('tradeoff', 0.036), ('curves', 0.036), ('minimize', 0.035), ('divergence', 0.035), ('instance', 0.034), ('appendix', 0.034), ('uniform', 0.033), ('proof', 0.033), ('lemma', 0.033), ('wish', 0.032), ('hinge', 0.032), ('rademacher', 0.032), ('loss', 0.031), ('stars', 0.03), ('combination', 0.03), ('ss', 0.029), ('depict', 0.029), ('phase', 0.029), ('amount', 0.029), ('weighted', 0.028), ('hypotheses', 0.028), ('speaker', 0.028), ('acl', 0.028), ('bickel', 0.028), ('darpa', 0.027), ('us', 0.026), ('poorly', 0.026), ('triangle', 0.026), ('hope', 0.025), ('formalize', 0.025), ('huang', 0.025), ('transition', 0.025), ('samples', 0.024), ('distant', 0.024), ('language', 0.024), ('lemmas', 0.023), ('inequality', 0.023), ('multiple', 0.023), ('log', 0.022), ('minimizer', 0.022), ('mixture', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000011 110 nips-2007-Learning Bounds for Domain Adaptation

Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman

2 0.13133904 129 nips-2007-Mining Internet-Scale Software Repositories

Author: Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, Pierre F. Baldi

Abstract: Large repositories of source code create new challenges and opportunities for statistical machine learning. Here we ﬁrst develop Sourcerer, an infrastructure for the automated crawling, parsing, and database storage of open source software. Sourcerer allows us to gather Internet-scale source code. For instance, in one experiment, we gather 4,632 java projects from SourceForge and Apache totaling over 38 million lines of code from 9,250 developers. Simple statistical analyses of the data ﬁrst reveal robust power-law behavior for package, SLOC, and lexical containment distributions. We then develop and apply unsupervised author-topic, probabilistic models to automatically discover the topics embedded in the code and extract topic-word and author-topic distributions. In addition to serving as a convenient summary for program function and developer activities, these and other related distributions provide a statistical and information-theoretic basis for quantifying and analyzing developer similarity and competence, topic scattering, and document tangling, with direct applications to software engineering. Finally, by combining software textual content with structural information captured by our CodeRank approach, we are able to signiﬁcantly improve software retrieval performance, increasing the AUC metric to 0.84– roughly 10-30% better than previous approaches based on text alone. Supplementary material may be found at: http://sourcerer.ics.uci.edu/nips2007/nips07.html. 1

3 0.12185959 69 nips-2007-Discriminative Batch Mode Active Learning

Author: Yuhong Guo, Dale Schuurmans

Abstract: Active learning sequentially selects unlabeled instances to label with the goal of reducing the effort needed to learn a good classiﬁer. Most previous studies in active learning have focused on selecting one unlabeled instance to label at a time while retraining in each iteration. Recently a few batch mode active learning approaches have been proposed that select a set of most informative unlabeled instances in each iteration under the guidance of heuristic scores. In this paper, we propose a discriminative batch mode active learning approach that formulates the instance selection task as a continuous optimization problem over auxiliary instance selection variables. The optimization is formulated to maximize the discriminative classiﬁcation performance of the target classiﬁer, while also taking the unlabeled data into account. Although the objective is not convex, we can manipulate a quasi-Newton method to obtain a good local solution. Our empirical studies on UCI datasets show that the proposed active learning is more effective than current state-of-the art batch mode active learning algorithms. 1

4 0.11942774 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning

Author: Yuanqing Lin, Jingdong Chen, Youngmoo Kim, Daniel D. Lee

Abstract: Speech dereverberation remains an open problem after more than three decades of research. The most challenging step in speech dereverberation is blind channel identiﬁcation (BCI). Although many BCI approaches have been developed, their performance is still far from satisfactory for practical applications. The main difﬁculty in BCI lies in ﬁnding an appropriate acoustic model, which not only can effectively resolve solution degeneracies due to the lack of knowledge of the source, but also robustly models real acoustic environments. This paper proposes a sparse acoustic room impulse response (RIR) model for BCI, that is, an acoustic RIR can be modeled by a sparse FIR ﬁlter. Under this model, we show how to formulate the BCI of a single-input multiple-output (SIMO) system into a l1 norm regularized least squares (LS) problem, which is convex and can be solved efﬁciently with guaranteed global convergence. The sparseness of solutions is controlled by l1 -norm regularization parameters. We propose a sparse learning scheme that infers the optimal l1 -norm regularization parameters directly from microphone observations under a Bayesian framework. Our results show that the proposed approach is effective and robust, and it yields source estimates in real acoustic environments with high ﬁdelity to anechoic chamber measurements.

5 0.11845171 101 nips-2007-How SVMs can estimate quantiles and the median

Author: Andreas Christmann, Ingo Steinwart

Abstract: We investigate quantile regression based on the pinball loss and the ǫ-insensitive loss. For the pinball loss a condition on the data-generating distribution P is given that ensures that the conditional quantiles are approximated with respect to · 1 . This result is then used to derive an oracle inequality for an SVM based on the pinball loss. Moreover, we show that SVMs based on the ǫ-insensitive loss estimate the conditional median only under certain conditions on P . 1

6 0.089685321 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons

7 0.088329576 147 nips-2007-One-Pass Boosting

8 0.087969199 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

9 0.084324785 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

10 0.080365978 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

11 0.079372071 84 nips-2007-Expectation Maximization and Posterior Constraints

12 0.078313388 136 nips-2007-Multiple-Instance Active Learning

13 0.07245294 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

14 0.071128726 62 nips-2007-Convex Learning with Invariances

15 0.070611678 74 nips-2007-EEG-Based Brain-Computer Interaction: Improved Accuracy by Automatic Single-Trial Error Detection

16 0.068336651 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

17 0.067878082 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets

18 0.067641608 159 nips-2007-Progressive mixture rules are deviation suboptimal

19 0.065594152 187 nips-2007-Structured Learning with Approximate Inference

20 0.065441623 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.22), (1, -0.005), (2, -0.066), (3, 0.076), (4, 0.075), (5, 0.057), (6, 0.125), (7, -0.047), (8, 0.03), (9, -0.035), (10, 0.074), (11, -0.008), (12, -0.125), (13, -0.105), (14, 0.023), (15, -0.118), (16, 0.075), (17, 0.005), (18, 0.034), (19, -0.104), (20, -0.024), (21, -0.089), (22, -0.004), (23, -0.135), (24, 0.007), (25, -0.034), (26, 0.115), (27, -0.031), (28, 0.085), (29, -0.224), (30, -0.075), (31, -0.196), (32, -0.094), (33, -0.01), (34, 0.063), (35, 0.144), (36, -0.115), (37, 0.058), (38, -0.158), (39, 0.063), (40, -0.178), (41, -0.012), (42, -0.015), (43, -0.048), (44, -0.006), (45, -0.004), (46, -0.151), (47, -0.108), (48, 0.026), (49, -0.055)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97829133 110 nips-2007-Learning Bounds for Domain Adaptation

Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman

2 0.66270047 101 nips-2007-How SVMs can estimate quantiles and the median

Author: Andreas Christmann, Ingo Steinwart

3 0.64581919 129 nips-2007-Mining Internet-Scale Software Repositories

Author: Erik Linstead, Paul Rigor, Sushil Bajracharya, Cristina Lopes, Pierre F. Baldi

4 0.57947826 37 nips-2007-Blind channel identification for speech dereverberation using l1-norm sparse learning

Author: Yuanqing Lin, Jingdong Chen, Youngmoo Kim, Daniel D. Lee

5 0.4977068 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

Author: Andrea Lecchini-visintini, John Lygeros, Jan Maciejowski

Abstract: Simulated annealing is a popular method for approaching the solution of a global optimization problem. Existing results on its performance apply to discrete combinatorial optimization where the optimization variables can assume only a ﬁnite set of possible values. We introduce a new general formulation of simulated annealing which allows one to guarantee ﬁnite-time performance in the optimization of functions of continuous variables. The results hold universally for any optimization problem on a bounded domain and establish a connection between simulated annealing and up-to-date theory of convergence of Markov chain Monte Carlo methods on continuous domains. This work is inspired by the concept of ﬁnite-time learning with known accuracy and conﬁdence developed in statistical learning theory. Optimization is the general problem of ﬁnding a value of a vector of variables θ that maximizes (or minimizes) some scalar criterion U (θ). The set of all possible values of the vector θ is called the optimization domain. The elements of θ can be discrete or continuous variables. In the ﬁrst case the optimization domain is usually ﬁnite, such as in the well-known traveling salesman problem; in the second case the optimization domain is a continuous set. An important example of a continuous optimization domain is the set of 3-D conﬁgurations of a sequence of amino-acids in the problem of ﬁnding the minimum energy folding of the corresponding protein [1]. In principle, any optimization problem on a ﬁnite domain can be solved by an exhaustive search. However, this is often beyond computational capacity: the optimization domain of the traveling salesman problem with 100 cities contains more than 10155 possible tours. An efﬁcient algorithm to solve the traveling salesman and many similar problems has not yet been found and such problems remain reliably solvable only in principle [2]. Statistical mechanics has inspired widely used methods for ﬁnding good approximate solutions in hard discrete optimization problems which defy efﬁcient exact solutions [3, 4, 5, 6]. Here a key idea has been that of simulated annealing [3]: a random search based on the Metropolis-Hastings algorithm, such that the distribution of the elements of the domain visited during the search converges to an equilibrium distribution concentrated around the global optimizers. Convergence and ﬁnite-time performance of simulated annealing on ﬁnite domains has been evaluated in many works, e.g. [7, 8, 9, 10]. On continuous domains, most popular optimization methods perform a local gradient-based search and in general converge to local optimizers; with the notable exception of convex criteria where convergence to the unique global optimizer occurs [11]. Simulated annealing performs a global search and can be easily implemented on continuous domains. Hence it can be considered a powerful complement to local methods. In this paper, we introduce for the ﬁrst time rigorous guarantees on the ﬁnite-time performance of simulated annealing on continuous domains. We will show that it is possible to derive simulated annealing algorithms which, with an arbitrarily high level of conﬁdence, ﬁnd an approximate solution to the problem of optimizing a function of continuous variables, within a speciﬁed tolerance to the global optimal solution after a known ﬁnite number of steps. Rigorous guarantees on the ﬁnite-time performance of simulated annealing in the optimization of functions of continuous variables have never been obtained before; the only results available state that simulated annealing converges to a global optimizer as the number of steps grows to inﬁnity, e.g. [12, 13, 14, 15]. The background of our work is twofold. On the one hand, our notion of approximate solution to a global optimization problem is inspired by the concept of ﬁnite-time learning with known accuracy and conﬁdence developed in statistical learning theory [16, 17]. We actually maintain an important aspect of statistical learning theory which is that we do not introduce any particular assumption on the optimization criterion, i.e. our results hold regardless of what U is. On the other hand, we ground our results on the theory of convergence, with quantitative bounds on the distance to the target distribution, of the Metropolis-Hastings algorithm and Markov Chain Monte Carlo (MCMC) methods, which has been one of the main achievements of recent research in statistics [18, 19, 20, 21]. In this paper, we will not develop any ready-to-use optimization algorithm. We will instead introduce a general formulation of the simulated annealing method which allows one to derive new simulated annealing algorithms with rigorous ﬁnite-time guarantees on the basis of existing theory. The Metropolis-Hastings algorithm and the general family of MCMC methods have many degrees of freedom. The choice and comparison of speciﬁc algorithms goes beyond the scope of the paper. The paper is organized in the following sections. In Simulated annealing we introduce the method and ﬁx the notation. In Convergence we recall the reasons why ﬁnite-time guarantees for simulated annealing on continuous domains have not been obtained before. In Finite-time guarantees we present the main result of the paper. In Conclusions we state our ﬁndings and conclude the paper. 1 Simulated annealing The original formulation of simulated annealing was inspired by the analogy between the stochastic evolution of the thermodynamic state of an annealing material towards the conﬁgurations of minimal energy and the search for the global minimum of an optimization criterion [3]. In the procedure, the optimization criterion plays the role of the energy and the state of the annealed material is simulated by the evolution of the state of an inhomogeneous Markov chain. The state of the chain evolves according to the Metropolis-Hastings algorithm in order to simulate the Boltzmann distribution of thermodynamic equilibrium. The Boltzmann distribution is simulated for a decreasing sequence of temperatures (“cooling”). The target distribution of the cooling procedure is the limiting Boltzmann distribution, for the temperature that tends to zero, which takes non-zero values only on the set of global minimizers [7]. The original formulation of the method was for a ﬁnite domain. However, simulated annealing can be generalized straightforwardly to a continuous domain because the Metropolis-Hastings algorithm can be used with almost no differences on discrete and continuous domains The main difference is that on a continuous domain the equilibrium distributions are speciﬁed by probability densities. On a continuous domain, Markov transition kernels in which the distribution of the elements visited by the chain converges to an equilibrium distribution with the desired density can be constructed using the Metropolis-Hastings algorithm and the general family of MCMC methods [22]. We point out that Boltzmann distributions are not the only distributions which can be adopted as equilibrium distributions in simulated annealing [7]. In this paper it is convenient for us to adopt a different type of equilibrium distribution in place of Boltzmann distributions. 1.1 Our setting The optimization criterion is U : Θ → [0, 1], with Θ ⊂ RN . The assumption that U takes values in the interval [0, 1] is a technical one. It does not imply any serious loss of generality. In general, any bounded optimization criterion can be scaled to take values in [0, 1]. We assume that the optimization task is to ﬁnd a global maximizer; this can be done without loss of generality. We also assume that Θ is a bounded set. We consider equilibrium distributions deﬁned by probability density functions proportional to [U (θ) + δ]J where J and δ are two strictly positive parameters. We use π (J) to denote an equilibrium distribution, i.e. π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ) where πLeb is the standard Lebesgue measure. Here, J −1 plays the role of the temperature: if the function U (θ) plus δ is taken to a positive power J then as J increases (i.e. as J −1 decreases) [U (θ) + δ]J becomes increasingly peaked around the global maximizers. The parameter δ is an offset which guarantees that the equilibrium densities are always strictly positive, even if U takes zero values on some elements of the domain. The offset δ is chosen by the user and we show later that our results allow one to make an optimal selection of δ. The zero-temperature distribution is the limiting distribution, for J → ∞, which takes non-zero values only on the set of global maximizers. It is denoted by π (∞) . In the generic formulation of the method, the Markov transition kernel of the k-th step of the inhomogeneous chain has equilibrium distribution π (Jk ) where {Jk }k=1,2,... is the “cooling schedule”. The cooling schedule is a non-decreasing sequence of positive numbers according to which the equilibrium distribution become increasingly sharpened during the evolution of the chain. We use θk to denote the state of the chain and Pθk to denote its probability distribution. The distribution Pθk obviously depends on the initial condition θ0 . However, in this work, we don’t need to make this dependence explicit in the notation. Remark 1: If, given an element θ in Θ, the value U (θ) can be computed directly, we say that U is a deterministic criterion, e.g. the energy landscape in protein structure prediction [1]. In problems involving random variables, the value U (θ) may be the expected value U (θ) = g(x, θ)px (x; θ)dx of some function g which depends on both the optimization variable θ, and on some random variable x which has probability density px (x; θ) (which may itself depend on θ). In such problems it is usually not possible to compute U (θ) directly, either because evaluation of the integral requires too much computation, or because no analytical expression for px (x; θ) is available. Typically one must perform stochastic simulations in order to obtain samples of x for a given θ, hence obtain sample values of g(x, θ), and thus construct a Monte Carlo estimate of U (θ). The Bayesian design of clinical trials is an important application area where such expected-value criteria arise [23]. The authors of this paper investigate the optimization of expected-value criteria motivated by problems of aircraft routing [24]. In the particular case that px (x; θ) does not depend on θ, the optimization task is often called “empirical risk minimization”, and is studied extensively in statistical learning theory [16, 17]. The results of this paper apply in the same way to the optimization of both deterministic and expected-value criteria. The MCMC method developed by M¨ ller [25, 26] allows one u to construct simulated annealing algorithms for the optimization of expected-value criteria. M¨ ller u [25, 26] employs the same equilibrium distributions as those described in our setting; in his context J is restricted to integer values. 2 Convergence The rationale of simulated annealing is as follows: if the temperature is kept constant, say Jk = J, then the distribution of the state of the chain Pθk tends to the equilibrium distribution π (J) ; if J → ∞ then the equilibrium distribution π (J) tends to the zero-temperature distribution π (∞) ; as a result, if the cooling schedule Jk tends to inﬁnity, one obtains that Pθk “follows” π (Jk ) and that π (Jk ) tends to π (∞) and eventually that the distribution of the state of the chain Pθk tends to π (∞) . The theory shows that, under conditions on the cooling schedule and the Markov transition kernels, the distribution of the state of the chain Pθk actually converges to the target zero-temperature distribution π (∞) as k → ∞ [12, 13, 14, 15]. Convergence to the zero-temperature distribution implies that asymptotically the state of the chain eventually coincides with a global optimizer with probability one. The difﬁculty which must be overcome in order to obtain ﬁnite step results on simulated annealing algorithms on a continuous domain is that usually, in an optimization problem deﬁned over continuous variables, the set of global optimizers has zero Lebesgue measure (e.g. a set of isolated points). If the set of global optimizers has zero measure then the set of global optimizers has null probability according to the equilibrium distributions π (J) for any ﬁnite J and, as a consequence, according to the distributions Pθk for any ﬁnite k. Put another way, the probability that the state of the chain visits the set of global optimizers is constantly zero after any ﬁnite number of steps. Hence the conﬁdence of the fact that the solution provided by the algorithm in ﬁnite time coincides with a global optimizer is also constantly zero. Notice that this is not the case for a ﬁnite domain, where the set of global optimizers is of non-null measure with respect to the reference counting measure [7, 8, 9, 10]. It is instructive to look at the issue also in terms of the rate of convergence to the target zerotemperature distribution. On a discrete domain, the distribution of the state of the chain at each step and the zero-temperature distribution are both standard discrete distributions. It is then possible to deﬁne a distance between them and study the rate of convergence of this distance to zero. This analysis allows one to obtain results on the ﬁnite-time behavior of simulated annealing [7, 8]. On a continuous domain and for a set of global optimizers of measure zero, the target zero-temperature distribution π (∞) ends up being a mixture of probability masses on the set of global optimizers. In this situation, although the distribution of the state of the chain Pθk still converges asymptotically to π (∞) , it is not possible to introduce a sensible distance between the two distributions and a rate of convergence to the target distribution cannot even be deﬁned (weak convergence), see [12, Theorem 3.3]. This is the reason that until now there have been no guarantees on the performance of simulated annealing on a continuous domain after a ﬁnite number of computations: by adopting the zero-temperature distribution π (∞) as the target distribution it is only possible to prove asymptotic convergence in inﬁnite time to a global optimizer. Remark 2: The standard distance between two distributions, say µ1 and µ2 , on a continuous support is the total variation norm µ1 − µ2 T V = supA |µ1 (A) − µ2 (A)|, see e.g. [21]. In simulated annealing on a continuous domain the distribution of the state of the chain Pθk is absolutely continuous with respect to the Lebesgue measure (i.e. πLeb (A) = 0 ⇒ Pθk (A) = 0), by construction for any ﬁnite k. Hence if the set of global optimizers has zero Lebesgue measure then it has zero measure also according to Pθk . The set of global optimizers has however measure 1 according to π (∞) . The distance Pθk − π (∞) T V is then constantly 1 for any ﬁnite k. It is also worth mentioning that if the set of global optimizers has zero measure then asymptotic convergence to the zero-temperature distribution π (∞) can be proven only under the additional assumptions of continuity and differentiability of U [12, 13, 14, 15]. 3 Finite-time guarantees In general, optimization algorithms for problems deﬁned on continuous variables can only ﬁnd approximate solutions in ﬁnite time [27]. Given an element θ of a continuous domain how can we assess how good it is as an approximate solution to an optimization problem? Here we introduce the concept of approximate global optimizer to answer this question. The deﬁnition is given for a maximization problem in a continuous but bounded domain. We use two parameters: the value imprecision ǫ (greater than or equal to 0) and the residual domain α (between 0 and 1) which together determine the level of approximation. We say that θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if the function U takes values strictly greater than U (θ) + ǫ only on a subset of values of θ no larger than an α portion of the optimization domain. The formal deﬁnition is as follows. Deﬁnition 1 Let U : Θ → R be an optimization criterion where Θ ⊂ RN is bounded. Let πLeb denote the standard Lebesgue measure. Let ǫ ≥ 0 and α ∈ [0, 1] be given numbers. Then θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α if πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . In other words, the value U (θ) is within ǫ of a value which is greater than the values that U takes on at least a 1 − α portion of the domain. The smaller ǫ and α are, the better is the approximation of a true global optimizer. If both α and ǫ are equal to zero then U (θ) coincides with the essential supremum of U . Our deﬁnition of approximate global optimizer carries an important property, which holds regardless of what the criterion U is: if ǫ and α have non-zero values then the set of approximate global optimizers always has non-zero Lebesgue measure. It follows that the probability that the chain visits the set of approximate global optimizers can be non-zero. Hence, it is sensible to study the conﬁdence of the fact that the solution found by simulated annealing in ﬁnite time is an approximate global optimizer. Remark 3: The intuition that our notion of approximate global optimizer can be used to obtain formal guarantees on the ﬁnite-time performance of optimization methods based on a stochastic search of the domain is already apparent in the work of Vidyasagar [17, 28]. Vidyasagar [17, 28] introduces a similar deﬁnition and obtains rigorous ﬁnite-time guarantees in the optimization of ex- pected value criteria based on uniform independent sampling of the domain. Notably, the number of independent samples required to guarantee some desired accuracy and conﬁdence turns out to be polynomial in the values of the desired imprecision, residual domain and conﬁdence. Although the method of Vidyasagar is not highly sophisticated, it has had considerable success in solving difﬁcult control system design applications [28, 29]. Its appeal stems from its rigorous ﬁnite-time guarantees which exist without the need for any particular assumption on the optimization criterion. Here we show that ﬁnite-time guarantees for simulated annealing can be obtained by selecting a distribution π (J) with a ﬁnite J as the target distribution in place of the zero-temperature distribution π (∞) . The fundamental result is the following theorem which allows one to select in a rigorous way δ and J in the target distribution π (J) . It is important to stress that the result holds universally for any optimization criterion U on a bounded domain. The only minor requirement is that U takes values in [0, 1]. Theorem 1 Let U : Θ → [0, 1] be an optimization criterion where Θ ⊂ RN is bounded. Let J ≥ 1 and δ > 0 be given numbers. Let θ be a multivariate random variable with distribution π (J) (dθ) ∝ [U (θ) + δ]J πLeb (dθ). Let α ∈ (0, 1] and ǫ ∈ [0, 1] be given numbers and deﬁne σ= 1 1+δ 1+ ǫ+1+δ J 1 1+δ 1+δ −1 α ǫ+δ δ . (1) Then the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ. Proof. See Appendix A. The importance of the choice of a target distribution π (J) with a ﬁnite J is that π (J) is absolutely continuous with respect to the Lebesgue measure. Hence, the distance Pθk − π (J) TV between the distribution of the state of the chain Pθk and the target distribution π (J) is a meaningful quantity. Convergence of the Metropolis-Hastings algorithm and MCMC methods in total variation norm is a well studied problem. The theory provides simple conditions under which one derives upper bounds on the distance to the target distribution which are known at each step of the chain and decrease monotonically to zero as the number of steps of the chain grows. The theory has been developed mainly for homogeneous chains [18, 19, 20, 21]. In the case of simulated annealing, the factor that enables us to employ these results is the absolute continuity of the target distribution π (J) with respect to the Lebesgue measure. However, simulated annealing involves the simulation of inhomogeneous chains. In this respect, another important fact is that the choice of a target distribution π (J) with a ﬁnite J implies that the inhomogeneous Markov chain can in fact be formed by a ﬁnite sequence of homogeneous chains (i.e. the cooling schedule {Jk }k=1,2,... can be chosen to be a sequence that takes only a ﬁnite set of values). In turn, this allows one to apply the theory of homogeneous MCMC methods to study the convergence of Pθk to π (J) in total variation norm. On a bounded domain, simple conditions on the ‘proposal distribution’ in the iteration of the simulated annealing algorithm allows one to obtain upper bounds on Pθk − π (J) TV that decrease geometrically to zero as k → ∞, without the need for any additional assumption on U [18, 19, 20, 21]. It is then appropriate to introduce the following ﬁnite-time result. Theorem 2 Let the notation and assumptions of Theorem 1 hold. Let θk , with distribution Pθk , be the state of the inhomogeneous chain of a simulated annealing algorithm with target distribution π (J) . Then the statement “θk is an approximate global optimizer of U with value imprecision ǫ and residual domain α” holds with probability at least σ − Pθk − π (J) TV . The proof of the theorem follows directly from the deﬁnition of the total variation norm. It follows that if simulated annealing is implemented with an algorithm which converges in total variation distance to a target distribution π (J) with a ﬁnite J, then one can state with conﬁdence arbitrarily close to 1 that the solution found by the algorithm after the known appropriate ﬁnite number of steps is an approximate global optimizer with the desired approximation level. For given non-zero values of ǫ, α the value of σ given by (1) can be made arbitrarily close to 1 by choice of J; while the distance Pθk − π (J) TV can be made arbitrarily small by taking the known sufﬁcient number of steps. It can be shown that there exists the possibility of making an optimal choice of δ and J in the target distribution π (J) . In fact, for given ǫ and α and a given value of J there exists an optimal choice of δ which maximizes the value of σ given by (1). Hence, it is possible to obtain a desired σ with the smallest possible J. The advantage of choosing the smallest J, consistent with the required approximation and conﬁdence, is that it will decrease the number of steps required to achieve the desired reduction of Pθk − π (J) TV . 4 Conclusions We have introduced a new formulation of simulated annealing which admits rigorous ﬁnite-time guarantees in the optimization of functions of continuous variables. First, we have introduced the notion of approximate global optimizer. Then, we have shown that simulated annealing is guaranteed to ﬁnd approximate global optimizers, with the desired conﬁdence and the desired level of accuracy, in a known ﬁnite number of steps, if a proper choice of the target distribution is made and conditions for convergence in total variation norm are met. The results hold for any optimization criterion on a bounded domain with the only minor requirement that it takes values between 0 and 1. In this framework, simulated annealing algorithms with rigorous ﬁnite-time guarantees can be derived by studying the choice of the proposal distribution and of the cooling schedule, in the generic iteration of simulated annealing, in order to ensure convergence to the target distribution in total variation norm. To do this, existing theory of convergence of the Metropolis-Hastings algorithm and MCMC methods on continuous domains can be used [18, 19, 20, 21]. Vidyasagar [17, 28] has introduced a similar deﬁnition of approximate global optimizer and has shown that approximate optimizers with desired accuracy and conﬁdence can be obtained with a number of uniform independent samples of the domain which is polynomial in the accuracy and conﬁdence parameters. In general, algorithms developed with the MCMC methodology can be expected to be equally or more efﬁcient than uniform independent sampling. Acknowledgments Work supported by EPSRC, Grant EP/C014006/1, and by the European Commission under projects HYGEIA FP6-NEST-4995 and iFly FP6-TREN-037180. We thank S. Brooks, M. Vidyasagar and D. M. Wolpert for discussions and useful comments on the paper. A Proof of Theorem 1 Let α ∈ (0, 1] and ρ ∈ (0, 1] be given numbers. Let Uδ (θ) := U (θ) + δ. Let πδ be a normalized ¯ measure such that πδ (dθ) ∝ Uδ (θ)πLeb (dθ). In the ﬁrst part of the proof we ﬁnd a lower bound on the probability that θ belongs to the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} . ¯ Let yα := inf{y : πδ {θ ∈ Θ : Uδ (θ) ≤ y} ≥ 1 − α}. To start with we show that the set ¯ ¯ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} coincides with {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice ¯ ¯ that the quantity πδ {θ ∈ Θ : Uδ (θ) ≤ y} is a right-continuous non-decreasing function of y because it has the form of a distribution function (see e.g. [30, p.162] and [17, Lemma 11.1]). Therefore we have πδ {θ ∈ Θ : Uδ (θ) ≤ yα } ≥ 1 − α and ¯ ¯ y ≥ ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} ≥ 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α . ¯ ¯ ¯ Moreover, y < ρ yα ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) ≤ y} < 1 − α ⇒ πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} > α ¯ ¯ ¯ and taking the contrapositive one obtains πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > y} ≤ α ⇒ y ≥ ρ yα . ¯ ¯ ′ Therefore {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α}. ¯ ¯ We now derive a lower bound on π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Let us introduce the notation ¯ ¯¯ Aα := {θ ∈ Θ : Uδ (θ) < yα }, Aα := {θ ∈ Θ : Uδ (θ) ≥ yα }, Bα,ρ := {θ ∈ Θ : Uδ (θ) < ρ yα } ¯ ¯ ¯ ¯ ¯ ¯α,ρ := {θ ∈ Θ : Uδ (θ) ≥ ρ yα }. Notice that Bα,ρ ⊆ Aα and Aα ⊆ Bα,ρ . The quantity ¯¯ ¯¯ and B ¯ ¯ ¯ ¯ πδ {θ ∈ Θ : Uδ (θ) < y} as a function of y is the left-continuous version of πδ {θ ∈ Θ : Uδ (θ) ≤ ¯¯ y}[30, p.162]. Hence, the deﬁnition of yα implies πδ (Aα ) ≤ 1 − α and πδ (Aα ) ≥ α. Notice that ¯ ¯ ¯ ¯ δπLeb (Aα ) ¯ ≤ 1−α, ¯ πδ (Aα ) ≤ 1 − α ⇒ ¯ ¯ Uδ (θ)πLeb (dθ) Θ ¯¯ πδ (Aα ) ≥ α ¯ ¯¯ (1 + δ)πLeb (Aα ) ≥ α. ¯ U (θ)πLeb (dθ) Θ δ ⇒ ¯¯ Hence, πLeb (Aα ) > 0 and πLeb (Aα ) 1−α1+δ ¯ ¯ . ¯α ) ≤ α ¯ δ πLeb (A ¯ ¯¯ ¯¯ Notice that πLeb (Aα ) > 0 implies πLeb (Bα,ρ ) > 0. We obtain π (J) {θ ∈ Θ : Uδ (θ) ≥ ρ yα } = ¯ 1+ 1 ≥ Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Bα,ρ ≥ 1+ J ρ J yα ¯ J yα ¯ Uδ (θ)J πLeb (dθ) 1+ 1 Uδ (θ)J πLeb (dθ) Bα,ρ ¯ ¯¯ Aα Uδ (θ)J πLeb (dθ) 1 1 1 ≥ ≥ . 1−α1+δ ¯ πLeb (Aα ) πLeb (Bα,ρ ) ¯ ¯ 1 + ρJ 1 + ρJ ¯¯ ¯¯ α ¯ δ πLeb (Aα ) πLeb (Aα ) Since {θ ∈ Θ : Uδ (θ) ≥ ρ yα } ≡ {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} the ﬁrst part of ¯ ¯ the proof is complete. In the second part of the proof we show that the set {θ ∈ Θ : πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} is contained in the set of approximate global optimizers of U with value imprecision ¯ ǫ := (ρ−1 − 1)(1 + δ) and residual domain α := 1+δ α. Hence, we show that {θ ∈ Θ : πδ {θ′ ∈ ˜ ˜ ǫ+δ ¯ ˜ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α} ⊆ {θ ∈ Θ : πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ)}. We ¯ ˜ ˜ have U (θ′ ) > U (θ) + ǫ ⇔ ρ Uδ (θ′ ) > ρ [Uδ (θ) + ǫ] ⇒ ρ Uδ (θ′ ) > Uδ (θ) ˜ ˜ which is proven by noticing that ρ [Uδ (θ) + ǫ] ≥ Uδ (θ) ⇔ 1 − ρ ≥ U (θ)(1 − ρ) ˜ and U (θ) ∈ [0, 1]. Hence {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ⊇ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} . ˜ Therefore πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α . Let ¯ ˜ ¯ Qθ,˜ := {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} and notice that ˜ ǫ U (θ′ )πLeb (dθ′ ) + δπLeb (Qθ,˜) ǫ πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} = ˜ Qθ,˜ ǫ . U (θ′ )πLeb (dθ′ ) + δπLeb (Θ) Θ We obtain πδ {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α ⇒ ǫ πLeb (Qθ,˜) + δπLeb (Qθ,˜) ≤ α(1 + δ)πLeb (Θ) ˜ ¯ ˜ ¯ ǫ ǫ ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) . ˜ ˜ Hence we can conclude that πδ {θ′ ∈ Θ : ρ Uδ (θ′ ) > Uδ (θ)} ≤ α ⇒ πLeb {θ′ ∈ Θ : U (θ′ ) > U (θ) + ǫ} ≤ α πLeb (Θ) ¯ ˜ ˜ and the second part of the proof is complete. We have shown that given α ∈ (0, 1], ρ ∈ (0, 1], ǫ := (ρ−1 − 1)(1 + δ), α := ¯ ˜ ˜ σ := 1 = 1−α1+δ ¯ 1+δ 1+ρ 1+ α ¯ δ ǫ+1+δ ˜ J 1+δ ǫ+δ ˜ 1 J 1 1+δ 1+δ −1 α ǫ+δ ˜˜ δ α and ¯ , the statement “θ is an approximate global optimizer of U with value imprecision ǫ and residual ˜ domain α” holds with probability at least σ. Notice that ǫ ∈ [0, 1] and α ∈ (0, 1] are linked through ˜ ˜ ˜ ǫ+δ ˜ a bijective relation to ρ ∈ [ 1+δ , 1] and α ∈ (0, 1+δ ]. The statement of the theorem is eventually ¯ 2+δ obtained by expressing σ as a function of desired ǫ = ǫ and α = α. ˜ ˜ References [1] D. J. Wales. Energy Landscapes. Cambridge University Press, Cambridge, UK, 2003. [2] D. Achlioptas, A. Naor, and Y. Peres. Rigorous location of phase transitions in hard optimization problems. Nature, 435:759–764, 2005. [3] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. 220(4598):671–680, 1983. Optimization by Simulated Annealing. Science, [4] E. Bonomi and J. Lutton. The N -city travelling salesman problem: statistical mechanics and the Metropolis algorithm. SIAM Rev., 26(4):551–568, 1984. [5] Y. Fu and P. W. Anderson. Application of statistical mechanics to NP-complete problems in combinatorial optimization. J. Phys. A: Math. Gen., 19(9):1605–1620, 1986. [6] M. M´ zard, G. Parisi, and R. Zecchina. Analytic and Algorithmic Solution of Random Satisﬁability e Problems. Science, 297:812–815, 2002. [7] P. M. J. van Laarhoven and E. H. L. Aarts. Simulated Annealing: Theory and Applications. D. Reidel Publishing Company, Dordrecht, Holland, 1987. [8] D. Mitra, F. Romeo, and A. Sangiovanni-Vincentelli. Convergence and ﬁnite-time behavior of simulated annealing. Adv. Appl. Prob., 18:747–771, 1986. [9] B. Hajek. Cooling schedules for optimal annealing. Math. Oper. Res., 13:311–329, 1988. [10] J. Hannig, E. K. P. Chong, and S. R. Kulkarni. Relative Frequencies of Generalized Simulated Annealing. Math. Oper. Res., 31(1):199–216, 2006. [11] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, UK, 2004. [12] H. Haario and E. Saksman. Simulated annealing process in general state space. Adv. Appl. Prob., 23:866– 893, 1991. [13] S. B. Gelfand and S. K. Mitter. Simulated Annealing Type Algorithms for Multivariate Optimization. Algorithmica, 6:419–436, 1991. [14] C. Tsallis and D. A. Stariolo. Generalized simulated annealing. Physica A, 233:395–406, 1996. [15] M. Locatelli. Simulated Annealing Algorithms for Continuous Global Optimization: Convergence Conditions. J. Optimiz. Theory App., 104(1):121–133, 2000. [16] V. N. Vapnik. The Nature of Statistical Learning Theory. Cambridge University Press, Springer, New York, US, 1995. [17] M. Vidyasagar. Learning and Generalization: With Application to Neural Networks. Springer-Verlag, London, second edition, 2003. [18] S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Springer-Verlag, London, 1993. [19] J. S. Rosenthal. Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo. J. Am. Stat. Assoc., 90(430):558–566, 1995. [20] K. L. Mengersen and R. L. Tweedie. Rates of convergence of the Hastings and Metropolis algorithm. Ann. Stat., 24(1):101–121, 1996. [21] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Prob. Surv., 1:20–71, 2004. [22] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, second edition, 2004. [23] D.J. Spiegelhalter, K.R. Abrams, and J.P. Myles. Bayesian approaches to clinical trials and health-care evaluation. John Wiley & Sons, Chichester, UK, 2004. [24] A. Lecchini-Visintini, W. Glover, J. Lygeros, and J. M. Maciejowski. Monte Carlo Optimization for Conﬂict Resolution in Air Trafﬁc Control. IEEE Trans. Intell. Transp. Syst., 7(4):470–482, 2006. [25] P. M¨ ller. Simulation based optimal design. In J. O. Berger, J. M. Bernardo, A. P. Dawid, and A. F. M. u Smith, editors, Bayesian Statistics 6: proceedings of the Sixth Valencia International Meeting, pages 459–474. Oxford: Clarendon Press, 1999. [26] P. M¨ ller, B. Sans´ , and M. De Iorio. Optimal Bayesian design by Inhomogeneous Markov Chain Simuu o lation. J. Am. Stat. Assoc., 99(467):788–798, 2004. [27] L. Blum, C. Cucker, M. Shub, and S. Smale. Complexity and Real Computation. Springer-Verlag, New York, 1998. [28] M. Vidyasagar. Randomized algorithms for robust controller synthesis using statistical learning theory. Automatica, 37(10):1515–1528, 2001. [29] R. Tempo, G. Calaﬁore, and F. Dabbene. Randomized Algorithms for Analysis and Control of Uncertain Systems. Springer-Verlag, London, 2005. [30] B.V. Gnedenko. Theory of Probability. Chelsea, New York, fourth edition, 1968.

6 0.47899571 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

7 0.46731311 159 nips-2007-Progressive mixture rules are deviation suboptimal

8 0.44055551 201 nips-2007-The Value of Labeled and Unlabeled Examples when the Model is Imperfect

9 0.42151466 150 nips-2007-Optimal models of sound localization by barn owls

10 0.42048597 15 nips-2007-A general agnostic active learning algorithm

11 0.41002601 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets

12 0.39076787 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)

13 0.38941213 69 nips-2007-Discriminative Batch Mode Active Learning

14 0.38317326 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines

15 0.34697065 184 nips-2007-Stability Bounds for Non-i.i.d. Processes

16 0.33753034 136 nips-2007-Multiple-Instance Active Learning

17 0.33669597 62 nips-2007-Convex Learning with Invariances

18 0.33525619 85 nips-2007-Experience-Guided Search: A Theory of Attentional Control

19 0.32647955 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

20 0.31847432 180 nips-2007-Sparse Feature Learning for Deep Belief Networks

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.022), (13, 0.038), (16, 0.013), (18, 0.013), (21, 0.055), (34, 0.012), (35, 0.021), (47, 0.054), (49, 0.014), (83, 0.608), (85, 0.012), (87, 0.012), (90, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99840128 159 nips-2007-Progressive mixture rules are deviation suboptimal

Author: Jean-yves Audibert

Abstract: We consider the learning task consisting in predicting as well as the best function in a ﬁnite reference set G up to the smallest possible additive term. If R(g) denotes the generalization error of a prediction function g, under reasonable assumptions on the loss function (typically satisﬁed by the least square loss when the output is bounded), it is known that the progressive mixture rule g satisﬁes ˆ log |G| (1) ER(ˆ) ≤ ming∈G R(g) + Cst g , n where n denotes the size of the training set, and E denotes the expectation w.r.t. the training set distribution.This work shows that, surprisingly, for appropriate reference sets G, the deviation convergence rate of the progressive mixture rule is √ no better than Cst / n: it fails to achieve the expected Cst /n. We also provide an algorithm which does not suffer from this drawback, and which is optimal in both deviation and expectation convergence rates. 1

same-paper 2 0.99730355 110 nips-2007-Learning Bounds for Domain Adaptation

Author: John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, Jennifer Wortman

3 0.99681515 109 nips-2007-Kernels on Attributed Pointsets with Applications

Author: Mehul Parsana, Sourangshu Bhattacharya, Chiru Bhattacharya, K. Ramakrishnan

Abstract: This paper introduces kernels on attributed pointsets, which are sets of vectors embedded in an euclidean space. The embedding gives the notion of neighborhood, which is used to deﬁne positive semideﬁnite kernels on pointsets. Two novel kernels on neighborhoods are proposed, one evaluating the attribute similarity and the other evaluating shape similarity. Shape similarity function is motivated from spectral graph matching techniques. The kernels are tested on three real life applications: face recognition, photo album tagging, and shot annotation in video sequences, with encouraging results. 1

4 0.99650019 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields

Author: Simon Osindero, Geoffrey E. Hinton

Abstract: We describe an efﬁcient learning procedure for multilayer generative models that combine the best aspects of Markov random ﬁelds and deep, directed belief nets. The generative models can be learned one layer at a time and when learning is complete they have a very fast inference procedure for computing a good approximation to the posterior distribution in all of the hidden layers. Each hidden layer has its own MRF whose energy function is modulated by the top-down directed connections from the layer above. To generate from the model, each layer in turn must settle to equilibrium given its top-down input. We show that this type of model is good at capturing the statistics of patches of natural images. 1

5 0.9950949 61 nips-2007-Convex Clustering with Exemplar-Based Models

Author: Danial Lashkari, Polina Golland

Abstract: Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a signiﬁcant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture ﬁtting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efﬁcient algorithm with guaranteed convergence to the globally optimal solution. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering. 1

6 0.99377334 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

7 0.94627756 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels

8 0.93449122 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

9 0.91923785 40 nips-2007-Bundle Methods for Machine Learning

10 0.91684288 54 nips-2007-Computational Equivalence of Fixed Points and No Regret Algorithms, and Convergence to Equilibria

11 0.91610676 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching

12 0.91475797 116 nips-2007-Learning the structure of manifolds using random projections

13 0.91394275 21 nips-2007-Adaptive Online Gradient Descent

14 0.90887642 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering

15 0.90253413 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning

16 0.9021073 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

17 0.9019472 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs

18 0.89908791 204 nips-2007-Theoretical Analysis of Heuristic Search Methods for Online POMDPs

19 0.89808255 147 nips-2007-One-Pass Boosting

20 0.89762127 46 nips-2007-Cluster Stability for Finite Samples