nips nips2012 nips2012-142 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Chao Zhang, Lei Zhang, Jieping Ye
Abstract: In this paper, we provide a new framework to study the generalization bound of the learning process for domain adaptation. We consider two kinds of representative domain adaptation settings: one is domain adaptation with multiple sources and the other is domain adaptation combining source and target data. In particular, we use the integral probability metric to measure the difference between two domains. Then, we develop the specific Hoeffding-type deviation inequality and symmetrization inequality for either kind of domain adaptation to achieve the corresponding generalization bound based on the uniform entropy number. By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for domain adaptation. Meanwhile, we discuss the factors that affect the asymptotic behavior of the learning process. The numerical experiments support our results. 1
Reference: text
sentIndex sentText sentNum sentScore
1 cn 1 Abstract In this paper, we provide a new framework to study the generalization bound of the learning process for domain adaptation. [sent-8, score-0.621]
2 We consider two kinds of representative domain adaptation settings: one is domain adaptation with multiple sources and the other is domain adaptation combining source and target data. [sent-9, score-2.979]
3 In particular, we use the integral probability metric to measure the difference between two domains. [sent-10, score-0.207]
4 Then, we develop the specific Hoeffding-type deviation inequality and symmetrization inequality for either kind of domain adaptation to achieve the corresponding generalization bound based on the uniform entropy number. [sent-11, score-1.485]
5 By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for domain adaptation. [sent-12, score-0.995]
6 Meanwhile, we discuss the factors that affect the asymptotic behavior of the learning process. [sent-13, score-0.133]
7 1 Introduction In statistical learning theory, one of the major concerns is to obtain the generalization bound of a learning process, which measures the probability that a function, chosen from a function class by an algorithm, has a sufficiently small error (cf. [sent-15, score-0.249]
8 Generalization bounds have been widely used to study the consistency of the learning process [3], the asymptotic convergence of empirical process [4] and the learnability of learning models [5]. [sent-17, score-0.457]
9 Generally, there are three essential aspects to obtain generalization bounds of a specific learning process: complexity measures of function classes, deviation (or concentration) inequalities and symmetrization inequalities related to the learning process (cf. [sent-18, score-0.657]
10 It is noteworthy that the aforementioned results of statistical learning theory are all built under the assumption that training and test data are drawn from the same distribution (or briefly called the assumption of same distribution). [sent-20, score-0.168]
11 Domain adaptation has recently been proposed to handle this situation and it is aimed to apply a learning model, trained by using the samples drawn from a certain domain (source domain), to the samples drawn from another domain (target domain) with a different distribution (cf. [sent-22, score-1.351]
12 This paper is mainly concerned with two variants of domain adaptation. [sent-24, score-0.396]
13 In the first variant, the learner receives training data from several source domains, known as domain adaptation with multiple sources (cf. [sent-25, score-1.207]
14 In the second variant, the learner minimizes a convex combination 1 of empirical source and target risk, termed as domain adaptation combining source and target data (cf. [sent-27, score-1.416]
15 1 Overview of Main Results In this paper, we present a new framework to study generalization bounds of the learning processes for domain adaptation with multiple sources and domain adaptation combining source and target data, respectively. [sent-30, score-2.34]
16 Based on the resultant bounds, we then study the asymptotic behavior of the learning process for the two kinds of domain adaptation, respectively. [sent-31, score-0.72]
17 There are three major aspects in the framework: the quantity that measures the difference between two domains, the deviation inequalities and the symmetrization inequalities that are both designed for the situation of domain adaptation2 . [sent-32, score-0.945]
18 Generally, in order to obtain the generalization bounds of a learning process, it is necessary to obtain the corresponding deviation (or concentration) inequalities. [sent-33, score-0.253]
19 For either kind of domain adaptation, we use a martingale method to develop the related Hoeffding-type deviation inequality. [sent-34, score-0.503]
20 Moreover, in the situation of domain adaptation, since the source domain differs from the target domain, the desired symmetrization inequality for domain adaptation should incorporate some quantity to reflect the difference. [sent-35, score-2.262]
21 We then obtain the related symmetrization inequality incorporating the integral probability metric that measures the difference between the distributions of the source and the target domains. [sent-36, score-0.807]
22 Next, we present the generalization bounds based on the uniform entropy number for both kinds of domain adaptation. [sent-37, score-0.774]
23 Finally, based on the resultant bounds, we give a rigorous theoretic analysis to the asymptotic convergence and the rate of convergence of the learning processes for both types of domain adaptation. [sent-38, score-0.777]
24 Section 3 introduces the integral probability metric that measures the difference between the distributions of two domains. [sent-44, score-0.271]
25 We introduce the uniform entropy number for the situation of multiple sources in Section 4. [sent-45, score-0.457]
26 In Section 5, we present the generalization bounds for domain adaptation with multiple sources, and then analyze the asymptotic behavior of the learning process for this type of domain adaptation. [sent-46, score-1.686]
27 In the supplement (part A), we discuss the relationship between the integral probability metric DF (S, T ) and the other quantities proposed in the existing works including the H-divergence and the discrepancy distance. [sent-48, score-0.405]
28 Proofs of main results of this paper are provided in the supplement (part B). [sent-49, score-0.103]
29 We study domain adaptation combining source and target data in the supplement (part C) and then give a comparison with the existing works on the theoretical analysis of domain adaptation in the supplement (part D). [sent-50, score-2.09]
30 2 Problem Setup We denote Z (Sk ) := X (Sk ) ×Y (Sk ) ⊂ RI ×RJ (1 ≤ k ≤ K) and Z (T ) := X (T ) ×Y (T ) ⊂ RI ×RJ as the k-th source domain and the target domain, respectively. [sent-51, score-0.653]
31 In the situation of domain adaptation with multiple sources, (S ) the distributions D(Sk ) (1 ≤ k ≤ K) and D(T ) differ from each other, or g∗ k (1 ≤ k ≤ K) and (T ) g∗ differ from each other, or both of the cases occur. [sent-55, score-0.952]
32 samples 1 Due to the page limitation, the discussion on domain adaptation combining source and target data is provided in the supplement (part C). [sent-59, score-1.211]
33 2 Due to the page limitation, we only present the generalization bounds for domain adaptation with multiple sources and the discussions of the corresponding deviation inequalities and symmetrization inequalities are provided in the supplement (part B) along with the proofs of main results. [sent-60, score-1.744]
34 2 (k) ZNk = {zn }Nk drawn from each source domain Z (Sk ) (1 ≤ k ≤ K) but little or no labeled 1 n=1 samples drawn from the target domain Z (T ) . [sent-61, score-1.115]
35 , gw approximates the labeling g∗ (2) as precisely as possible. [sent-64, score-0.231]
36 By (1) and (2), given sample sets {ZNk }K 1 k=1 drawn from the multiple sources {Z (Sk ) }K respectively, we briefly denote for any f ∈ F, k=1 K E(T ) f := f (z(T ) )dP(z(T ) ) ; E(S) f := w k=1 wk Nk Nk f (z(k) ). [sent-69, score-0.495]
37 n (6) n=1 Thus, we can equivalently rewrite the generalization bound (4) for domain adaptation with multiple sources as sup E(T ) f − E(S) f . [sent-70, score-1.299]
38 [13, 16, 17, 18, 19, 20]), one of major challenges in the theoretical analysis of domain adaptation is how to measure the distance between the source domain Z (S) and the target domain Z (T ) . [sent-73, score-1.85]
39 Recall that, if Z (S) differs from Z (T ) , there are three possibilities: (S) (T ) D(S) differs from D(T ) , or g∗ differs from g∗ , or both of them occur. [sent-74, score-0.117]
40 In [13, 18], the H-divergence was introduced to derive the generalization bounds based on the VC dimension under the condition of “λ-close”. [sent-76, score-0.2]
41 [20] obtained the generalization bounds based on the Rademacher complexity by using the discrepancy distance. [sent-78, score-0.277]
42 Both quantities are aimed to measure the difference between two distributions D(S) and D(T ) . [sent-79, score-0.13]
43 In this paper, we e use the following quantity to measure the difference of the distributions between the source and the target domains: Definition 3. [sent-82, score-0.38]
44 We define DF (S, T ) := sup |E(S) f − E(T ) f |, (8) f ∈F where the expectations E(S) and E(T ) are taken with respect to the distributions Z (S) and Z (T ) , respectively. [sent-85, score-0.109]
45 The quantity DF (S, T ) is termed as the integral probability metric that plays an important role in probability theory for measuring the difference between the two probability distributions (cf. [sent-86, score-0.38]
46 [27] gave the further investigation and proposed the empirical methods to compute the integral probability metric in practice. [sent-89, score-0.204]
47 As mentioned by M¨ ller u [page 432, 25], the quantity DF (S, T ) is a semimetric and it is a metric if and only if the function class F separates the set of all signed measures with µ(Z) = 0. [sent-90, score-0.226]
48 1, given a non-trivial function class F, the quantity DF (S, T ) is equal to zero if the domains Z (S) and Z (T ) have the same distribution. [sent-92, score-0.156]
49 4 The Uniform Entropy Number Generally, the generalization bound of a certain learning process is achieved by incorporating the complexity measure of the function class, e. [sent-94, score-0.258]
50 The results of this paper are based on the uniform entropy number that is derived from the concept of the covering number and we refer to [22] for more details. [sent-97, score-0.22]
51 The covering number of a function class F is defined as follows: Definition 4. [sent-98, score-0.107]
52 1 Let F be a function class and d is a metric on F. [sent-99, score-0.097]
53 For any ξ > 0, the covering number of F at radius ξ with respect to the metric d, denoted by N (F, ξ, d) is the minimum size of a cover of radius ξ. [sent-100, score-0.15]
54 3 of [22], one can set d as the norm 1 (ZN ) and then derive the generalization bound of the i. [sent-103, score-0.162]
55 learning process 1 by incorporating the expectation of the covering number, i. [sent-106, score-0.176]
56 However, in 1 the situation of domain adaptation, we only know the information of source domain, while the expectation EN (F, ξ, 1 (ZN )) is dependent on both distributions of source and target domains 1 4 because z = (x, y). [sent-109, score-0.956]
57 Therefore, the covering number is no longer applicable to our framework to obtain the generalization bounds for domain adaptation. [sent-110, score-0.676]
58 By contrast, uniform entropy number is distribution-free and thus we choose it as the complexity measure of function class to derive the generalization bounds for domain adaptation. [sent-111, score-0.763]
59 For any 1 ≤ k ≤ (k) N (k) K, given a sample set ZNk := {zn }Nk drawn from Z (Sk ) , we denote Z 1 k := {z n }Nk as 1 n=1 n=1 (k) the ghost-sample set drawn from Z (Sk ) such that the ghost sample z n has the same distribution as (k) N zn for any 1 ≤ n ≤ Nk and any 1 ≤ k ≤ K. [sent-113, score-0.14]
60 Moreover, given 1 1 K any f ∈ F and any w = (w1 , · · · , wK ) ∈ [0, 1]K with k=1 wk = 1, we introduce a variant of the 1 norm: N K wk k (k) f w ({Z2Nk }K ) := |f (z(k) )| + |f (z n )| . [sent-115, score-0.44]
61 In the situation of domain adaptation with multiple sources, setting the metric d as w ({Z2Nk }K ), we then define the uniform 1 1 k=1 entropy number of F with respect to the metric w ({Z2Nk }K ) as 1 1 k=1 K w ln N1 F, ξ, 2 5 Nk := sup 2N {Z1 k }K k=1 k=1 ln N F, ξ, 2Nk K w }k=1 ) 1 ({Z1 . [sent-117, score-1.483]
62 (10) Domain Adaptation with Multiple Sources In this section, we present the generalization bound for domain adaptation with multiple sources. [sent-118, score-1.024]
63 Based on the resultant bound, we then analyze the asymptotic convergence and the rate of convergence of the learning process for such kind of domain adaptation. [sent-119, score-0.897]
64 1 Generalization Bounds for Domain Adaptation with Multiple Sources Based on the aforementioned uniform entropy number, the generalization bound for domain adaptation with multiple sources is presented in the following theorem: Theorem 5. [sent-121, score-1.388]
65 Let w = (w1 , · · · , wK ) ∈ [0, 1]K with k=1 wk = 1. [sent-123, score-0.206]
66 (12) k=1 (S) In the above theorem, we show that the generalization bound supf ∈F |E(T ) f − Ew f | can be bounded by the right-hand side of (11). [sent-125, score-0.216]
67 The two results will coincide if any source domain and the target domain match, i. [sent-130, score-1.049]
68 In order to prove this result, we develop the related Hoeffding-type deviation inequality and the symmetrization inequality for domain adaptation with multiple sources, respectively. [sent-133, score-1.219]
69 The detailed proof is provided in the supplement (part B). [sent-134, score-0.103]
70 By using the resultant bound (11), we can analyze the asymptotic behavior of the learning process for domain adaptation with multiple sources. [sent-135, score-1.219]
71 2 Asymptotic Convergence In statistical learning theory, it is well-known that the complexity of the function class is the main factor to the asymptotic convergence of the learning process under the assumption of same distribution (cf. [sent-137, score-0.314]
72 1 can directly lead to the concerning theorem showing that the asymptotic convergence of the learning process for domain adaptation with multiple sources: Theorem 5. [sent-140, score-1.155]
73 Let w = (w1 , · · · , wK ) ∈ [0, 1]K with k=1 wk = 1. [sent-142, score-0.206]
74 If the following condition holds: lim N1 ,··· ,NK →+∞ w ln N1 F, ξ /8, 2 with ξ = ξ − (14) Pr sup E(T ) f − E(S) f > ξ = 0. [sent-143, score-0.205]
75 w (15) K k=1 then we have for any ξ > lim N1 ,··· ,NK →+∞ Nk < +∞ K k=1 32(b−a)2 (w) DF (S, T ), K k=1 Nk 2 wk ( i=k Ni ) (w) DF (S, T ), f ∈F As shown in Theorem 5. [sent-144, score-0.231]
76 This is partially in accordance with the classical result of the asymptotic convergence of the learning process under the assumption of same distribution (cf. [sent-146, score-0.329]
77 5 of [22]): the probability of the event that “supf ∈F Ef − EN f > ξ” will converge to zero for any ξ > 0, if the uniform entropy number ln N1 (F, ξ, N ) satisfies the following: ln N1 (F, ξ, N ) lim < +∞. [sent-149, score-0.39]
78 (16) N →+∞ N Note that in the learning process of domain adaptation with multiple sources, the uniform convergence of the empirical risk on the source domains to the expected risk on the target domain may not (w) hold, because the limit (15) does not hold for any ξ > 0 but for any ξ > DF (S, T ). [sent-150, score-1.921]
79 6 6 Numerical Experiments We have performed the numerical experiments to verify the theoretic analysis of the asymptotic convergence of the learning process for domain adaptation with multiple sources. [sent-155, score-1.155]
80 , there are two source domains and one target domain. [sent-158, score-0.32]
81 The experiment data are generated in the following way: For the target domain Z (T ) = X (T ) ×Y (T ) ⊂ R100 ×R, we consider X (T ) as a Gaussian distribution (T ) N (0, 1) and draw {xn }NT (NT = 4000) from X (T ) randomly and independently. [sent-159, score-0.503]
82 01) respectively, and then generate yn ∈ Y as follows: (T yn ) = x(T ) , β + R. [sent-163, score-0.162]
83 n (T ) (18) (T ) The derived {(xn , yn )}NT (NT = 4000) are the samples of the target domain Z (T ) and will be n=1 used as the test data. [sent-164, score-0.584]
84 (1) (1) In the similar way, we derive the sample set {(xn , yn )}N1 (N1 = 2000) of the source domain n=1 Z (S1 ) = X (1) × Y (1) ⊂ R100 × R: for any 1 ≤ n ≤ N1 , (1) yn = x(1) , β + R, n (19) (1) where xn ∼ N (0. [sent-165, score-0.74]
85 (2) (2) For the source domain Z (S2 ) = X (2) × Y (2) ⊂ R100 × R, the samples {(xn , yn )}N2 (N2 = n=1 2000) are generated in the following way: for any 1 ≤ n ≤ N2 , (2) yn = x(2) , β + R, n where (2) xn (20) ∼ N (2, 5), β ∼ N (1, 5) and R ∼ N (0, 0. [sent-168, score-0.74]
86 In this experiment, we use the method of Least Square Regression to minimize the empirical risk E(S) ( ◦ g) = w w N1 N1 N (1) (g(x(1) ), yn ) + n n=1 (1 − w) 2 (2) (g(x(2) ), yn ) n N2 n=1 (21) for different combination coefficients w ∈ {0. [sent-170, score-0.252]
87 5, the discrepancy |Ew f − ENT f | has the fastest rate of convergence, and the rate becomes slower as w is further away from 0. [sent-185, score-0.169]
88 Recalling (17), we have shown that w = N2 /(N1 + N2 ) will provide the fastest rate of convergence and this proposition is supported by the experiment results shown in Fig. [sent-189, score-0.129]
89 7 Conclusion In this paper, we present a new framework to study the generalization bounds of the learning process for domain adaptation. [sent-191, score-0.659]
90 We use the integral probability metric to measure the difference between the distributions of two domains. [sent-192, score-0.236]
91 Then, we use a martingale method to develop the specific deviation inequality and the symmetrization inequality incorporating the integral probability metric. [sent-193, score-0.528]
92 Next, we utilize the resultant deviation inequality and symmetrization inequality to derive the generalization bound based on the uniform entropy number. [sent-194, score-0.749]
93 By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for domain adaptation. [sent-195, score-0.995]
94 25 500 1000 1500 2000 2500 3000 3500 4000 N1 + N2 Figure 1: Domain Adaptation with Multiple Sources We point out that the asymptotic convergence of the learning process is determined by the complexity of the function class F measured by the uniform entropy number. [sent-206, score-0.428]
95 This is partially in accordance with the classical result of the asymptotic convergence of the learning process under the assumption of same distribution (cf. [sent-207, score-0.329]
96 Moreover, the rate of convergence of this learning process is equal to that of the learning process under the assumption of same distribution. [sent-211, score-0.245]
97 18 of [22], the generalization bound (11) can lead to the result based on the fat-shattering dimension. [sent-216, score-0.162]
98 In our future work, we will attempt to find a new distance between two distributions and develop the generalization bounds based on other complexity measures, e. [sent-220, score-0.229]
99 , Rademacher complexities, and analyze other theoretical properties of domain adaptation. [sent-222, score-0.428]
100 Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. [sent-277, score-0.405]
wordName wordTfidf (topN-words)
[('adaptation', 0.405), ('domain', 0.396), ('df', 0.268), ('nk', 0.244), ('gw', 0.207), ('wk', 0.206), ('sources', 0.195), ('symmetrization', 0.188), ('sk', 0.156), ('source', 0.15), ('asymptotic', 0.133), ('ew', 0.125), ('generalization', 0.123), ('target', 0.107), ('znk', 0.104), ('supplement', 0.103), ('ln', 0.1), ('resultant', 0.09), ('wortman', 0.084), ('integral', 0.084), ('mansour', 0.082), ('yn', 0.081), ('entropy', 0.08), ('covering', 0.08), ('sup', 0.08), ('blitzer', 0.079), ('bounds', 0.077), ('discrepancy', 0.077), ('zn', 0.074), ('crammer', 0.072), ('ent', 0.071), ('metric', 0.07), ('quantity', 0.066), ('convergence', 0.065), ('risk', 0.065), ('en', 0.064), ('domains', 0.063), ('rostamizadeh', 0.063), ('process', 0.063), ('multiple', 0.061), ('situation', 0.061), ('uniform', 0.06), ('inequalities', 0.059), ('pereira', 0.059), ('inequality', 0.058), ('nt', 0.055), ('supf', 0.054), ('noteworthy', 0.054), ('deviation', 0.053), ('rademacher', 0.05), ('quantities', 0.046), ('mohri', 0.042), ('accordance', 0.042), ('nanjing', 0.042), ('bousquet', 0.04), ('bound', 0.039), ('differs', 0.039), ('kinds', 0.038), ('vc', 0.038), ('sriperumbudur', 0.036), ('fastest', 0.036), ('measures', 0.035), ('mendelson', 0.034), ('incorporating', 0.033), ('drawn', 0.033), ('xn', 0.032), ('theorem', 0.032), ('analyze', 0.032), ('numerical', 0.032), ('vapnik', 0.032), ('learnability', 0.031), ('nyi', 0.031), ('kulesza', 0.031), ('distributions', 0.029), ('aforementioned', 0.029), ('meanwhile', 0.029), ('martingale', 0.029), ('metrics', 0.029), ('termed', 0.028), ('ller', 0.028), ('difference', 0.028), ('variant', 0.028), ('rate', 0.028), ('aimed', 0.027), ('kearns', 0.027), ('class', 0.027), ('meeting', 0.026), ('assumption', 0.026), ('part', 0.026), ('page', 0.025), ('probability', 0.025), ('kind', 0.025), ('combining', 0.025), ('lim', 0.025), ('recalling', 0.025), ('empirical', 0.025), ('labeling', 0.024), ('acl', 0.024), ('linguistics', 0.024), ('minimizes', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 142 nips-2012-Generalization Bounds for Domain Adaptation
Author: Chao Zhang, Lei Zhang, Jieping Ye
Abstract: In this paper, we provide a new framework to study the generalization bound of the learning process for domain adaptation. We consider two kinds of representative domain adaptation settings: one is domain adaptation with multiple sources and the other is domain adaptation combining source and target data. In particular, we use the integral probability metric to measure the difference between two domains. Then, we develop the specific Hoeffding-type deviation inequality and symmetrization inequality for either kind of domain adaptation to achieve the corresponding generalization bound based on the uniform entropy number. By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for domain adaptation. Meanwhile, we discuss the factors that affect the asymptotic behavior of the learning process. The numerical experiments support our results. 1
2 0.24319728 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model
Author: Sander M. Bohte
Abstract: Neural adaptation underlies the ability of neurons to maximize encoded information over a wide dynamic range of input stimuli. Recent spiking neuron models like the adaptive Spike Response Model implement adaptation as additive fixed-size fast spike-triggered threshold dynamics and slow spike-triggered currents. Such adaptation accurately models neural spiking behavior over a limited dynamic input range. To extend efficient coding over large changes in dynamic input range, we propose a multiplicative adaptive Spike Response Model where the spike-triggered adaptation dynamics are scaled multiplicatively by the adaptation state at the time of spiking. We show that, unlike the additive adaptation model, the firing rate in our multiplicative adaptation model saturates to a realistic maximum spike-rate regardless of input magnitude. Additionally, when simulating variance switching experiments, the model quantitatively fits experimental data over a wide dynamic range. Dynamic threshold models of adaptation furthermore suggest a straightforward interpretation of neural activity in terms of dynamic differential signal encoding with shifted and weighted exponential kernels. We show that when thus encoding rectified filtered stimulus signals, the multiplicative adaptive Spike Response Model achieves a high coding efficiency and maintains this efficiency over changes in the dynamic signal range of several orders of magnitude, without changing model parameters. 1
3 0.18896498 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video
Author: Kevin Tang, Vignesh Ramanathan, Li Fei-fei, Daphne Koller
Abstract: Typical object detectors trained on images perform poorly on video, as there is a clear distinction in domain between the two types of data. In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos. We treat the problem as one of unsupervised domain adaptation, in which we are given labeled data from the source domain (image), but only unlabeled data from the target domain (video). Our approach, self-paced domain adaptation, seeks to iteratively adapt the detector by re-training the detector with automatically discovered target domain examples, starting with the easiest first. At each iteration, the algorithm adapts by considering an increased number of target domain examples, and a decreased number of source domain examples. To discover target domain examples from the vast amount of video data, we introduce a simple, robust approach that scores trajectory tracks instead of bounding boxes. We also show how rich and expressive features specific to the target domain can be incorporated under the same framework. We show promising results on the 2011 TRECVID Multimedia Event Detection [1] and LabelMe Video [2] datasets that illustrate the benefit of our approach to adapt object detectors to video. 1
4 0.14971308 36 nips-2012-Adaptive Stratified Sampling for Monte-Carlo integration of Differentiable functions
Author: Alexandra Carpentier, Rémi Munos
Abstract: We consider the problem of adaptive stratified sampling for Monte Carlo integration of a differentiable function given a finite number of evaluations to the function. We construct a sampling scheme that samples more often in regions where the function oscillates more, while allocating the samples such that they are well spread on the domain (this notion shares similitude with low discrepancy). We prove that the estimate returned by the algorithm is almost similarly accurate as the estimate that an optimal oracle strategy (that would know the variations of the function everywhere) would return, and provide a finite-sample analysis. 1
5 0.11719113 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics
Author: Guillermo Canas, Lorenzo Rosasco
Abstract: We study the problem of estimating, in the sense of optimal transport metrics, a measure which is assumed supported on a manifold embedded in a Hilbert space. By establishing a precise connection between optimal transport metrics, optimal quantization, and learning theory, we derive new probabilistic bounds for the performance of a classic algorithm in unsupervised learning (k-means), when used to produce a probability measure derived from the data. In the course of the analysis, we arrive at new lower bounds, as well as probabilistic upper bounds on the convergence rate of empirical to population measures, which, unlike existing bounds, are applicable to a wide class of measures. 1 Introduction and Motivation In this paper we study the problem of learning from random samples a probability distribution supported on a manifold, when the learning error is measured using transportation metrics. The problem of learning a probability distribution is classic in statistics, and is typically analyzed for distributions in X = Rd that have a density with respect to the Lebesgue measure, with total variation, and L2 among the common distances used to measure closeness of two densities (see for instance [10, 32] and references therein.) The setting in which the data distribution is supported on a low dimensional manifold embedded in a high dimensional space has only been considered more recently. In particular, kernel density estimators on manifolds have been described in [36], and their pointwise consistency, as well as convergence rates, have been studied in [25, 23, 18]. A discussion on several topics related to statistics on a Riemannian manifold can be found in [26]. Interestingly, the problem of approximating measures with respect to transportation distances has deep connections with the fields of optimal quantization [14, 16], optimal transport [35] and, as we point out in this work, with unsupervised learning (see Sec. 4.) In fact, as described in the sequel, some of the most widely-used algorithms for unsupervised learning, such as k-means (but also others such as PCA and k-flats), can be shown to be performing exactly the task of estimating the data-generating measure in the sense of the 2-Wasserstein distance. This close relation between learning theory, and optimal transport and quantization seems novel and of interest in its own right. Indeed, in this work, techniques from the above three fields are used to derive the new probabilistic bounds described below. Our technical contribution can be summarized as follows: (a) we prove uniform lower bounds for the distance between a measure and estimates based on discrete sets (such as the empirical measure or measures derived from algorithms such as kmeans); (b) we provide new probabilistic bounds for the rate of convergence of empirical to population measures which, unlike existing probabilistic bounds, hold for a very large class of measures; 1 (c) we provide probabilistic bounds for the rate of convergence of measures derived from k-means to the data measure. The structure of the paper is described at the end of Section 2, where we discuss the exact formulation of the problem as well as related previous works. 2 Setup and Previous work Consider the problem of learning a probability measure ρ supported on a space M, from an i.i.d. sample Xn = (x1 , . . . , xn ) ∼ ρn of size n. We assume M to be a compact, smooth d-dimensional manifold of bounded curvature, with C 1 metric and volume measure λM , embedded in the unit ball of a separable Hilbert space X with inner product ·, · , induced norm · , and distance d (for d instance M = B2 (1) the unit ball in X = Rd .) Following [35, p. 94], let Pp (M) denote the Wasserstein space of order 1 ≤ p < ∞: Pp (M) := x p dρ(x) < ∞ ρ ∈ P (M) : M of probability measures P (M) supported on M, with finite p-th moment. The p-Wasserstein distance 1/p Wp (ρ, µ) = inf [E X − Y p ] : Law(X) = ρ, Law(Y ) = µ (1) X,Y where the random variables X and Y are distributed according to ρ and µ respectively, is the optimal expected cost of transporting points generated from ρ to those generated from µ, and is guaranteed to be finite in Pp (M) [35, p. 95]. The space Pp (M) with the Wp metric is itself a complete separable metric space [35]. We consider here the problem of learning probability measures ρ ∈ P2 (M), where the performance is measured by the distance W2 . There are many possible choices of distances between probability measures [13]. Among them, Wp metrizes weak convergence (see [35] theorem 6.9), that is, in Pp (M), a sequence (µi )i∈N of measures converges weakly to µ iff Wp (µi , µ) → 0 and their p-th order moments converge to that of µ. There are other distances, such as the L´ vy-Prokhorov, or the weak-* distance, that also metrize e weak convergence. However, as pointed out by Villani in his excellent monograph [35, p. 98], 1. “Wasserstein distances are rather strong, [...]a definite advantage over the weak-* distance”. 2. “It is not so difficult to combine information on convergence in Wasserstein distance with some smoothness bound, in order to get convergence in stronger distances.” Wasserstein distances have been used to study the mixing and convergence of Markov chains [22], as well as concentration of measure phenomena [20]. To this list we would add the important fact that existing and widely-used algorithms for unsupervised learning can be easily extended (see Sec. 4) to compute a measure ρ that minimizes the distance W2 (ˆn , ρ ) to the empirical measure ρ n ρn := ˆ 1 δx , n i=1 i a fact that will allow us to prove, in Sec. 5, bounds on the convergence of a measure induced by k-means to the population measure ρ. The most useful versions of Wasserstein distance are p = 1, 2, with p = 1 being the weaker of the two (by H¨ lder’s inequality, p ≤ q ⇒ Wp ≤ Wq .) In particular, “results in W2 distance are usually o stronger, and more difficult to establish than results in W1 distance” [35, p. 95]. A discussion of p = ∞ would take us out of topic, since its behavior is markedly different. 2.1 Closeness of Empirical and Population Measures By the strong law of large numbers, the empirical measure converges almost surely to the population measure: ρn → ρ in the sense of the weak topology [34]. Since weak convergence and convergence ˆ in Wp plus convergence of p-th moments are equivalent in Pp (M), this means that, in the Wp sense, the empirical measure ρn converges to ρ, as n → ∞. A fundamental question is therefore how fast ˆ the rate of convergence of ρn → ρ is. ˆ 2 2.1.1 Convergence in expectation The rate of convergence of ρn → ρ in expectation has been widely studied in the past, resultˆ ing in upper bounds of order EW2 (ρ, ρn ) = O(n−1/(d+2) ) [19, 8], and lower bounds of order ˆ EW2 (ρ, ρn ) = Ω(n−1/d ) [29] (both assuming that the absolutely continuous part of ρ is ρA = 0, ˆ with possibly better rates otherwise). More recently, an upper bound of order EWp (ρ, ρn ) = O(n−1/d ) has been proposed [2] by proving ˆ a bound for the Optimal Bipartite Matching (OBM) problem [1], and relating this problem to the expected distance EWp (ρ, ρn ). In particular, given two independent samples Xn , Yn , the OBM ˆ problem is that of finding a permutation σ that minimizes the matching cost n−1 xi −yσ(i) p [24, p ˆ ˆ ˆ 30]. It is not hard to show that the optimal matching cost is Wp (ˆXn , ρYn ) , where ρXn , ρYn are ρ the empirical measures associated to Xn , Yn . By Jensen’s inequality, the triangle inequality, and (a + b)p ≤ 2p−1 (ap + bp ), it holds EWp (ρ, ρn )p ≤ EWp (ˆXn , ρYn )p ≤ 2p−1 EWp (ρ, ρn )p , ˆ ρ ˆ ˆ and therefore a bound of order O(n−p/d ) for the OBM problem [2] implies a bound EWp (ρ, ρn ) = ˆ O(n−1/d ). The matching lower bound is only known for a special case: ρA constant over a bounded set of non-null measure [2] (e.g. ρA uniform.) Similar results, with matching lower bounds are found for W1 in [11]. 2.1.2 Convergence in probability Results for convergence in probability, one of the main results of this work, appear to be considerably harder to obtain. One fruitful avenue of analysis has been the use of so-called transportation, or Talagrand inequalities Tp , which can be used to prove concentration inequalities on Wp [20]. In particular, we say that ρ satisfies a Tp (C) inequality with C > 0 iff Wp (ρ, µ)2 ≤ CH(µ|ρ), ∀µ ∈ Pp (M), where H(·|·) is the relative entropy [20]. As shown in [6, 5], it is possible to obtain probabilistic upper bounds on Wp (ρ, ρn ), with p = 1, 2, if ρ is known to satisfy a Tp inequality ˆ of the same order, thereby reducing the problem of bounding Wp (ρ, ρn ) to that of obtaining a Tp ˆ inequality. Note that, by Jensen’s inequality, and as expected from the behavior of Wp , the inequality T2 is stronger than T1 [20]. While it has been shown that ρ satisfies a T1 inequality iff it has a finite square-exponential moment 2 (E[eα x ] finite for some α > 0) [4, 7], no such general conditions have been found for T2 . As an example, consider that, if M is compact with diameter D then, by theorem 6.15 of [35], and the celebrated Csisz´ r-Kullback-Pinsker inequality [27], for all ρ, µ ∈ Pp (M), it is a Wp (ρ, µ)2p ≤ (2D)2p ρ − µ where · does not. TV 2 TV ≤ 22p−1 D2p H(µ|ρ), is the total variation norm. Clearly, this implies a Tp=1 inequality, but for p ≥ 2 it The T2 inequality has been shown by Talagrand to be satisfied by the Gaussian distribution [31], and then slightly more generally by strictly log-concave measures (see [20, p. 123], and [3].) However, as noted in [6], “contrary to the T1 case, there is no hope to obtain T2 inequalities from just integrability or decay estimates.” Structure of this paper. In this work we obtain bounds in probability (learning rates) for the problem of learning a probability measure in the sense of W2 . We begin by establishing (lower) bounds for the convergence of empirical to population measures, which serve to set up the problem and introduce the connection between quantization and measure learning (sec. 3.) We then describe how existing unsupervised learning algorithms that compute a set (k-means, k-flats, PCA,. . . ) can be easily extended to produce a measure (sec. 4.) Due to its simplicity and widespread use, we focus here on k-means. Since the two measure estimates that we consider are the empirical measure, and the measure induced by k-means, we next set out to prove upper bounds on their convergence to the data-generating measure (sec. 5.) We arrive at these bounds by means of intermediate measures, which are related to the problem of optimal quantization. The bounds apply in a very broad setting (unlike existing bounds based on transportation inequalities, they are not restricted to log-concave measures [20, 3].) 3 3 Learning probability measures, optimal transport and quantization We address the problem of learning a probability measure ρ when the only observations we have at our disposal are n i.i.d. samples Xn = (x1 , . . . , xn ). We begin by establishing some notation and useful intermediate results. Given a closed set S ⊆ X , let {Vq : q ∈ S} be a Borel Voronoi partition of X composed of sets Vq closest to each q ∈ S, that is, such that each Vq ⊆ {x ∈ X : x − q = minr∈S x − r } is measurable (see for instance [15].) Consider the projection function πS : X → S mapping each x ∈ Vq to q. By virtue of {Vq }q∈S being a Borel Voronoi partition, the map πS is measurable [15], and it is d (x, πS (x)) = minq∈S x − q for all x ∈ X . For any ρ ∈ Pp (M), let πS ρ be the pushforward, or image measure of ρ under the mapping πS , −1 which is defined to be (πS ρ)(A) := ρ(πS (A)) for all Borel measurable sets A. From its definition, it is clear that πS ρ is supported on S. We now establish a connection between the expected distance to a set S, and the distance between ρ and the set’s induced pushforward measure. Notice that, for discrete sets S, the expected Lp distance to S is exactly the expected quantization error Ep,ρ (S) := Ex∼ρ d(x, S)p = Ex∼ρ x − πS (x) p incurred when encoding points x drawn from ρ by their closest point πS (x) in S [14]. This close connection between optimal quantization and Wasserstein distance has been pointed out in the past in the statistics [28], optimal quantization [14, p. 33], and approximation theory [16] literatures. The following two lemmas are key tools in the reminder of the paper. The first highlights the close link between quantization and optimal transport. Lemma 3.1. For closed S ⊆ X , ρ ∈ Pp (M), 1 ≤ p < ∞, it holds Ex∼ρ d(x, S)p = Wp (ρ, πS ρ)p . Note that the key element in the above lemma is that the two measures in the expression Wp (ρ, πS ρ) must match. When there is a mismatch, the distance can only increase. That is, Wp (ρ, πS µ) ≥ Wp (ρ, πS ρ) for all µ ∈ Pp (M). In fact, the following lemma shows that, among all the measures with support in S, πS ρ is closest to ρ. Lemma 3.2. For closed S ⊆ X , and all µ ∈ Pp (M) with supp(µ) ⊆ S, 1 ≤ p < ∞, it holds Wp (ρ, µ) ≥ Wp (ρ, πS ρ). When combined, lemmas 3.1 and 3.2 indicate that the behavior of the measure learning problem is limited by the performance of the optimal quantization problem. For instance, Wp (ρ, ρn ) can only ˆ be, in the best-case, as low as the optimal quantization cost with codebook of size n. The following section makes this claim precise. 3.1 Lower bounds Consider the situation depicted in fig. 1, in which a sample X4 = {x1 , x2 , x3 , x4 } is drawn from a distribution ρ which we assume here to be absolutely continuous on its support. As shown, the projection map πX4 sends points x to their closest point in X4 . The resulting Voronoi decomposition of supp(ρ) is drawn in shades of blue. By lemma 5.2 of [9], the pairwise intersections of Voronoi regions have null ambient measure, and since ρ is absolutely continuous, the pushforward measure 4 can be written in this case as πX4 ρ = j=1 ρ(Vxj )δxj , where Vxj is the Voronoi region of xj . Note that, even for finite sets S, this particular decomposition is not always possible if the {Vq }q∈S form a Borel Voronoi tiling, instead of a Borel Voronoi partition. If, for instance, ρ has an atom falling on two Voronoi regions in a tiling, then both regions would count the atom as theirs, and double-counting would imply q ρ(Vq ) > 1. The technicalities required to correctly define a Borel Voronoi partition are such that, in general, it is simpler to write πS ρ, even though (if S is discrete) this measure can clearly be written as a sum of deltas with appropriate masses. By lemma 3.1, the distance Wp (ρ, πX4 ρ)p is the (expected) quantization cost of ρ when using X4 as codebook. Clearly, this cost can never be lower than the optimal quantization cost of size 4. This reasoning leads to the following lower bound between empirical and population measures. 4 Theorem 3.3. For ρ ∈ Pp (M) with absolutely continuous part ρA = 0, and 1 ≤ p < ∞, it holds Wp (ρ, ρn ) = Ω(n−1/d ) uniformly over ρn , where the constants depend on d and ρA only. ˆ ˆ Proof: Let Vn,p (ρ) := inf S⊂M,|S|=n Ex∼ρ d(x, S)p be the optimal quantization cost of ρ of order p with n centers. Since ρA = 0, and since ρ has a finite (p + δ)-th order moment, for some δ > 0 (since it is supported on the unit ball), then it is Vn,p (ρ) = Θ(n−p/d ), with constants depending on d and ρA (see [14, p. 78] and [16].) Since supp(ˆn ) = Xn , it follows that ρ Wp (ρ, ρn )p ˆ ≥ lemma 3.2 Wp (ρ, πXn ρ)p = lemma 3.1 Ex∼ρ d(x, Xn )p ≥ Vn,p (ρ) = Θ(n−p/d ) Note that the bound of theorem 3.3 holds for ρn derived from any sample Xn , and is therefore ˆ stronger than the existing lower bounds on the convergence rates of EWp (ρ, ρn ) → 0. In particular, ˆ it trivially induces the known lower bound Ω(n−1/d ) on the rate of convergence in expectation. 4 Unsupervised learning algorithms for learning a probability measure As described in [21], several of the most widely used unsupervised learning algorithms can be ˆ interpreted to take as input a sample Xn and output a set Sk , where k is typically a free parameter of the algorithm, such as the number of means in k-means1 , the dimension of affine spaces in PCA, n ˆ etc. Performance is measured by the empirical quantity n−1 i=1 d(xi , Sk )2 , which is minimized among all sets in some class (e.g. sets of size k, affine spaces of dimension k,. . . ) This formulation is general enough to encompass k-means and PCA, but also k-flats, non-negative matrix factorization, and sparse coding (see [21] and references therein.) Using the discussion of Sec. 3, we can establish a clear connection between unsupervised learning and the problem of learning probability measures with respect to W2 . Consider as a running example the k-means problem, though the argument is general. Given an input Xn , the k-means problem is ˆ ˆ to find a set |Sk | = k minimizing its average distance from points in Xn . By associating to Sk the pushforward measure πSk ρn , we find that ˆ ˆ 1 n n ˆ ˆ d(xi , Sk )2 = Ex∼ρn d(x, Sk )2 ˆ i=1 = lemma 3.1 W2 (ˆn , πSk ρn )2 . ρ ˆ ˆ (2) Since k-means minimizes equation 2, it also finds the measure that is closest to ρn , among those ˆ with support of size k. This connection between k-means and W2 measure approximation was, to the best of the authors’ knowledge, first suggested by Pollard [28] though, as mentioned earlier, the argument carries over to many other unsupervised learning algorithms. Unsupervised measure learning algorithms. We briefly clarify the steps involved in using an existing unsupervised learning algorithm for probability measure learning. Let Uk be a parametrized algorithm (e.g. k-means) that takes a sample Xn and outputs a set Uk (Xn ). The measure learning algorithm Ak : Mn → Pp (M) corresponding to Uk is defined as follows: ˆ 1. Ak takes a sample Xn and outputs the measure πSk ρn , supported on Sk = Uk (Xn ); ˆ ˆ 2. since ρn is discrete, then so must πSk ρn be, and thus Ak (Xn ) = ˆ ˆ ˆ 1 n n ˆ i=1 δπSk (xi ) ; 3. in practice, we can simply store an n-vector πSk (x1 ), . . . , πSk (xn ) , from which Ak (Xn ) ˆ ˆ can be reconstructed by placing atoms of mass 1/n at each point. In the case that Uk is the k-means algorithm, only k points and k masses need to be stored. Note that any algorithm A that attempts to output a measure A (Xn ) close to ρn can be cast in the ˆ above framework. Indeed, if S is the support of A (Xn ) then, by lemma 3.2, πS ρn is the measure ˆ closest to ρn with support in S . This effectively reduces the problem of learning a measure to that of ˆ 1 In a slight abuse of notation, we refer to the k-means algorithm here as an ideal algorithm that solves the k-means problem, even though in practice an approximation algorithm may be used. 5 finding a set, and is akin to how the fact that every optimal quantizer is a nearest-neighbor quantizer (see [15], [12, p. 350], and [14, p. 37–38]) reduces the problem of finding an optimal quantizer to that of finding an optimal quantizing set. Clearly, the minimum of equation 2 over sets of size k (the output of k-means) is monotonically ˆ ˆ non-increasing with k. In particular, since Sn = Xn and πSn ρn = ρn , it is Ex∼ρn d(x, Sn )2 = ˆ ˆ ˆ ˆ 2 W2 (ˆn , πSn ρn ) = 0. That is, we can always make the learned measure arbitrarily close to ρn ρ ˆ ˆ ˆ by increasing k. However, as pointed out in Sec. 2, the problem of measure learning is concerned with minimizing the 2-Wasserstein distance W2 (ρ, πSk ρn ) to the data-generating measure. The ˆ ˆ actual performance of k-means is thus not necessarily guaranteed to behave in the same way as the empirical one, and the question of characterizing its behavior as a function of k and n naturally arises. ˆ Finally, we note that, while it is Ex∼ρn d(x, Sk )2 = W2 (ˆn , πSk ρn )2 (the empirical performances ρ ˆ ˆ ˆ are the same in the optimal quantization, and measure learning problem formulations), the actual performances satisfy ˆ Ex∼ρ d(x, Sk )2 = W2 (ρ, π ˆ ρ)2 ≤ W2 (ρ, π ˆ ρn )2 , 1 ≤ k ≤ n. ˆ lemma 3.1 Sk lemma 3.2 Sk Consequently, with the identification between sets S and measures πS ρn , the measure learning ˆ problem is, in general, harder than the set-approximation problem (for example, if M = Rd and ρ is absolutely continuous over a set of non-null volume, it is not hard to show that the inequality is ˆ almost surely strict: Ex∼ρ d(x, Sk )2 < W2 (ρ, πSk ρn )2 for 1 < k < n.) ˆ ˆ In the remainder, we characterize the performance of k-means on the measure learning problem, for varying k, n. Although other unsupervised learning algorithms could have been chosen as basis for our analysis, k-means is one of the oldest and most widely used, and the one for which the deep connection between optimal quantization and measure approximation is most clearly manifested. Note that, by setting k = n, our analysis includes the problem of characterizing the behavior of the distance W2 (ρ, ρn ) between empirical and population measures which, as indicated in Sec. 2.1, ˆ is a fundamental question in statistics (i.e. the speed of convergence of empirical to population measures.) 5 Learning rates In order to analyze the performance of k-means as a measure learning algorithm, and the convergence of empirical to population measures, we propose the decomposition shown in fig. 2. The diagram includes all the measures considered in the paper, and shows the two decompositions used to prove upper bounds. The upper arrow (green), illustrates the decomposition used to bound the distance W2 (ρ, ρn ). This decomposition uses the measures πSk ρ and πSk ρn as intermediates to arrive ˆ ˆ at ρn , where Sk is a k-point optimal quantizer of ρ, that is, a set Sk minimizing Ex∼ρ d(x, S)2 over ˆ all sets of size |S| = k. The lower arrow (blue) corresponds to the decomposition of W2 (ρ, πSk ρn ) ˆ ˆ (the performance of k-means), whereas the labelled black arrows correspond to individual terms in the bounds. We begin with the (slightly) simpler of the two results. 5.1 Convergence rates for the empirical to population measures Let Sk be the optimal k-point quantizer of ρ of order two [14, p. 31]. By the triangle inequality and the identity (a + b + c)2 ≤ 3(a2 + b2 + c2 ), it follows that W2 (ρ, ρn )2 ≤ 3 W2 (ρ, πSk ρ)2 + W2 (πSk ρ, πSk ρn )2 + W2 (πSk ρn , ρn )2 . ˆ ˆ ˆ ˆ (3) This is the decomposition depicted in the upper arrow of fig. 2. By lemma 3.1, the first term in the sum of equation 3 is the optimal k-point quantization error of ρ over a d-manifold M which, using recent techniques from [16] (see also [17, p. 491]), is shown in the proof of theorem 5.1 (part a) to be of order Θ(k −2/d ). The remaining terms, b) and c), are slightly more technical and are bounded in the proof of theorem 5.1. Since equation 3 holds for all 1 ≤ k ≤ n, the best bound on W2 (ρ, ρn ) can be obtained by optimizˆ ing the right-hand side over all possible values of k, resulting in the following probabilistic bound for the rate of convergence of the empirical to population measures. 6 x2 x W2 (ρ, ρn ) ˆ supp ρ x1 π{x1 ,x2 ,x3 ,x4 } ρ a) x3 πSk ρ b) πSk ρn ˆ c) d) ρn ˆ πSk ρn ˆ ˆ W2 (ρ, πSk ρn ) ˆ ˆ x4 Figure 1: A sample {x1 , x2 , x3 , x4 } is drawn from a distribution ρ with support in supp ρ. The projection map π{x1 ,x2 ,x3 ,x4 } sends points x to their closest one in the sample. The induced Voronoi tiling is shown in shades of blue. Figure 2: The measures considered in this paper are linked by arrows for which upper bounds for their distance are derived. Bounds for the quantities of interest W2 (ρ, ρn )2 , and W2 (ρ, πSk ρn )2 , ˆ ˆ ˆ are decomposed by following the top and bottom colored arrows. Theorem 5.1. Given ρ ∈ Pp (M) with absolutely continuous part ρA = 0, sufficiently large n, and τ > 0, it holds W2 (ρ, ρn ) ≤ C · m(ρA ) · n−1/(2d+4) · τ, ˆ where m(ρA ) := 5.2 M 2 with probability 1 − e−τ . ρA (x)d/(d+2) dλM (x), and C depends only on d. Learning rates of k-means The key element in the proof of theorem 5.1 is that the distance between population and empirical measures can be bounded by choosing an intermediate optimal quantizing measure of an appropriate size k. In the analysis, the best bounds are obtained for k smaller than n. If the output of k-means is close to an optimal quantizer (for instance if sufficient data is available), then we would similarly expect that the best bounds for k-means correspond to a choice of k < n. The decomposition of the bottom (blue) arrow in figure 2 leads to the following bound in probability. Theorem 5.2. Given ρ ∈ Pp (M) with absolutely continuous part ρA = 0, and τ > 0, then for all sufficiently large n, and letting k = C · m(ρA ) · nd/(2d+4) , it holds W2 (ρ, πSk ρn ) ≤ C · m(ρA ) · n−1/(2d+4) · τ, ˆ ˆ where m(ρA ) := M 2 with probability 1 − e−τ . ρA (x)d/(d+2) dλM (x), and C depends only on d. Note that the upper bounds in theorem 5.1 and 5.2 are exactly the same. Although this may appear ˆ surprising, it stems from the following fact. Since S = Sk is a minimizer of W2 (πS ρn , ρn )2 , the ˆ ˆ bound d) of figure 2 satisfies: W2 (πSk ρn , ρn )2 ≤ W2 (πSk ρn , ρn )2 ˆ ˆ ˆ ˆ ˆ and therefore (by the definition of c), the term d) is of the same order as c). It follows then that adding term d) to the bound only affects the constants, but otherwise leaves it unchanged. Since d) is the term that takes the output measure of k-means to the empirical measure, this implies that the rate of convergence of k-means (for suitably chosen k) cannot be worse than that of ρn → ρ. ˆ Conversely, bounds for ρn → ρ are obtained from best rates of convergence of optimal quantizers, ˆ whose convergence to ρ cannot be slower than that of k-means (since the quantizers that k-means produces are suboptimal.) 7 Since the bounds obtained for the convergence of ρn → ρ are the same as those for k-means with ˆ k of order k = Θ(nd/(2d+4) ), this suggests that estimates of ρ that are as accurate as those derived from an n point-mass measure ρn can be derived from k point-mass measures with k ˆ n. Finally, we note that the introduced bounds are currently limited by the statistical bound sup |W2 (πS ρn , ρn )2 − W2 (πS ρ, ρ)2 | ˆ ˆ |S|=k = sup |Ex∼ρn d(x, S)2 − Ex∼ρ d(x, S)2 | ˆ lemma 3.1 |S|=k (4) (see for instance [21]), for which non-matching lower bounds are known. This means that, if better upper bounds can be obtained for equation 4, then both bounds in theorems 5.1 and 5.2 would automatically improve (would become closer to the lower bound.) References [1] M. Ajtai, J. Komls, and G. Tusndy. On optimal matchings. Combinatorica, 4:259–264, 1984. [2] Franck Barthe and Charles Bordenave. Combinatorial optimization over two random point sets. Technical Report arXiv:1103.2734, Mar 2011. [3] Gordon Blower. The Gaussian isoperimetric inequality and transportation. Positivity, 7:203–224, 2003. [4] S. G. Bobkov and F. G¨ tze. Exponential integrability and transportation cost related to logarithmic o Sobolev inequalities. Journal of Functional Analysis, 163(1):1–28, April 1999. [5] Emmanuel Boissard. Simple bounds for the convergence of empirical and occupation measures in 1wasserstein distance. Electron. J. Probab., 16(83):2296–2333, 2011. [6] F. Bolley, A. Guillin, and C. Villani. Quantitative concentration inequalities for empirical measures on non-compact spaces. Probability Theory and Related Fields, 137(3):541–593, 2007. [7] F. Bolley and C. Villani. Weighted Csisz´ r-Kullback-Pinsker inequalities and applications to transportaa tion inequalities. Annales de la Faculte des Sciences de Toulouse, 14(3):331–352, 2005. [8] Claire Caillerie, Fr´ d´ ric Chazal, J´ rˆ me Dedecker, and Bertrand Michel. Deconvolution for the Wassere e eo stein metric and geometric inference. Rapport de recherche RR-7678, INRIA, July 2011. [9] Kenneth L. Clarkson. Building triangulations using -nets. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, STOC ’06, pages 326–335, New York, NY, USA, 2006. ACM. [10] Luc Devroye and G´ bor Lugosi. Combinatorial methods in density estimation. Springer Series in Statisa tics. Springer-Verlag, New York, 2001. [11] V. Dobri and J. Yukich. Asymptotics for transportation cost in high dimensions. Journal of Theoretical Probability, 8:97–118, 1995. [12] A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, 1992. [13] Alison L. Gibbs and Francis E. Su. On choosing and bounding probability metrics. International Statistical Review, 70:419–435, 2002. [14] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2000. [15] Siegfried Graf, Harald Luschgy, and Gilles Page`. Distortion mismatch in the quantization of probability s measures. Esaim: Probability and Statistics, 12:127–153, 2008. [16] Peter M. Gruber. Optimum quantization and its applications. Adv. Math, 186:2004, 2002. [17] P.M. Gruber. Convex and discrete geometry. Grundlehren der mathematischen Wissenschaften. Springer, 2007. [18] Guillermo Henry and Daniela Rodriguez. Kernel density estimation on riemannian manifolds: Asymptotic results. J. Math. Imaging Vis., 34(3):235–239, July 2009. [19] Joseph Horowitz and Rajeeva L. Karandikar. Mean rates of convergence of empirical measures in the Wasserstein metric. J. Comput. Appl. Math., 55(3):261–273, November 1994. [20] M. Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs. American Mathematical Society, 2001. [21] A. Maurer and M. Pontil. K–dimensional coding schemes in Hilbert spaces. IEEE Transactions on Information Theory, 56(11):5839 –5846, nov. 2010. [22] Yann Ollivier. Ricci curvature of markov chains on metric spaces. J. Funct. Anal., 256(3):810–864, 2009. 8 [23] Arkadas Ozakin and Alexander Gray. Submanifold density estimation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1375–1382. 2009. [24] C. Papadimitriou. The probabilistic analysis of matching heuristics. In Proc. of the 15th Allerton Conf. on Communication, Control and Computing, pages 368–378, 1978. [25] Bruno Pelletier. Kernel density estimation on Riemannian manifolds. Statist. Probab. Lett., 73(3):297– 304, 2005. [26] Xavier Pennec. Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. J. Math. Imaging Vis., 25(1):127–154, July 2006. [27] M. S. Pinsker. Information and information stability of random variables and processes. San Francisco: Holden-Day, 1964. [28] David Pollard. Quantization and the method of k-means. IEEE Transactions on Information Theory, 28(2):199–204, 1982. [29] S.T. Rachev. Probability metrics and the stability of stochastic models. Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley, 1991. [30] J.M. Steele. Probability Theory and Combinatorial Optimization. Cbms-Nsf Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics, 1997. [31] M. Talagrand. Transportation cost for Gaussian and other product measures. Geometric And Functional Analysis, 6:587–600, 1996. [32] Alexandre B. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats. [33] A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, 1996. [34] V. S. Varadarajan. On the convergence of sample probability distributions. Sankhy¯ : The Indian Journal a of Statistics, 19(1/2):23–26, Feb. 1958. [35] C. Villani. Optimal Transport: Old and New. Grundlehren der Mathematischen Wissenschaften. Springer, 2009. [36] P. Vincent and Y. Bengio. Manifold Parzen Windows. In Advances in Neural Information Processing Systems 22, pages 849–856. 2003. 9
6 0.11561385 328 nips-2012-Submodular-Bregman and the Lovász-Bregman Divergences with Applications
7 0.11198998 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
8 0.11148499 60 nips-2012-Bayesian nonparametric models for ranked data
9 0.10484522 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
10 0.10480824 5 nips-2012-A Conditional Multinomial Mixture Model for Superset Label Learning
11 0.099623948 179 nips-2012-Learning Manifolds with K-Means and K-Flats
12 0.094327845 324 nips-2012-Stochastic Gradient Descent with Only One Projection
13 0.086731337 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
14 0.085901693 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization
15 0.075053319 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
16 0.070295803 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
17 0.069854759 361 nips-2012-Volume Regularization for Binary Classification
18 0.068060949 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
19 0.067491427 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
20 0.065554649 262 nips-2012-Optimal Neural Tuning Curves for Arbitrary Stimulus Distributions: Discrimax, Infomax and Minimum $L p$ Loss
topicId topicWeight
[(0, 0.17), (1, -0.003), (2, 0.055), (3, -0.007), (4, 0.027), (5, 0.07), (6, 0.053), (7, 0.117), (8, 0.039), (9, -0.007), (10, 0.018), (11, 0.007), (12, -0.016), (13, -0.162), (14, -0.033), (15, -0.062), (16, -0.043), (17, -0.016), (18, -0.011), (19, 0.037), (20, 0.061), (21, -0.037), (22, 0.126), (23, -0.011), (24, 0.046), (25, 0.12), (26, 0.061), (27, 0.131), (28, -0.083), (29, -0.128), (30, 0.118), (31, 0.225), (32, -0.088), (33, -0.25), (34, -0.183), (35, 0.019), (36, -0.006), (37, -0.092), (38, 0.008), (39, -0.084), (40, -0.021), (41, 0.058), (42, 0.12), (43, 0.11), (44, -0.014), (45, 0.103), (46, -0.002), (47, 0.062), (48, -0.017), (49, -0.079)]
simIndex simValue paperId paperTitle
same-paper 1 0.97838062 142 nips-2012-Generalization Bounds for Domain Adaptation
Author: Chao Zhang, Lei Zhang, Jieping Ye
Abstract: In this paper, we provide a new framework to study the generalization bound of the learning process for domain adaptation. We consider two kinds of representative domain adaptation settings: one is domain adaptation with multiple sources and the other is domain adaptation combining source and target data. In particular, we use the integral probability metric to measure the difference between two domains. Then, we develop the specific Hoeffding-type deviation inequality and symmetrization inequality for either kind of domain adaptation to achieve the corresponding generalization bound based on the uniform entropy number. By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for domain adaptation. Meanwhile, we discuss the factors that affect the asymptotic behavior of the learning process. The numerical experiments support our results. 1
2 0.77169335 36 nips-2012-Adaptive Stratified Sampling for Monte-Carlo integration of Differentiable functions
Author: Alexandra Carpentier, Rémi Munos
Abstract: We consider the problem of adaptive stratified sampling for Monte Carlo integration of a differentiable function given a finite number of evaluations to the function. We construct a sampling scheme that samples more often in regions where the function oscillates more, while allocating the samples such that they are well spread on the domain (this notion shares similitude with low discrepancy). We prove that the estimate returned by the algorithm is almost similarly accurate as the estimate that an optimal oracle strategy (that would know the variations of the function everywhere) would return, and provide a finite-sample analysis. 1
3 0.61626202 184 nips-2012-Learning Probability Measures with respect to Optimal Transport Metrics
Author: Guillermo Canas, Lorenzo Rosasco
Abstract: We study the problem of estimating, in the sense of optimal transport metrics, a measure which is assumed supported on a manifold embedded in a Hilbert space. By establishing a precise connection between optimal transport metrics, optimal quantization, and learning theory, we derive new probabilistic bounds for the performance of a classic algorithm in unsupervised learning (k-means), when used to produce a probability measure derived from the data. In the course of the analysis, we arrive at new lower bounds, as well as probabilistic upper bounds on the convergence rate of empirical to population measures, which, unlike existing bounds, are applicable to a wide class of measures. 1 Introduction and Motivation In this paper we study the problem of learning from random samples a probability distribution supported on a manifold, when the learning error is measured using transportation metrics. The problem of learning a probability distribution is classic in statistics, and is typically analyzed for distributions in X = Rd that have a density with respect to the Lebesgue measure, with total variation, and L2 among the common distances used to measure closeness of two densities (see for instance [10, 32] and references therein.) The setting in which the data distribution is supported on a low dimensional manifold embedded in a high dimensional space has only been considered more recently. In particular, kernel density estimators on manifolds have been described in [36], and their pointwise consistency, as well as convergence rates, have been studied in [25, 23, 18]. A discussion on several topics related to statistics on a Riemannian manifold can be found in [26]. Interestingly, the problem of approximating measures with respect to transportation distances has deep connections with the fields of optimal quantization [14, 16], optimal transport [35] and, as we point out in this work, with unsupervised learning (see Sec. 4.) In fact, as described in the sequel, some of the most widely-used algorithms for unsupervised learning, such as k-means (but also others such as PCA and k-flats), can be shown to be performing exactly the task of estimating the data-generating measure in the sense of the 2-Wasserstein distance. This close relation between learning theory, and optimal transport and quantization seems novel and of interest in its own right. Indeed, in this work, techniques from the above three fields are used to derive the new probabilistic bounds described below. Our technical contribution can be summarized as follows: (a) we prove uniform lower bounds for the distance between a measure and estimates based on discrete sets (such as the empirical measure or measures derived from algorithms such as kmeans); (b) we provide new probabilistic bounds for the rate of convergence of empirical to population measures which, unlike existing probabilistic bounds, hold for a very large class of measures; 1 (c) we provide probabilistic bounds for the rate of convergence of measures derived from k-means to the data measure. The structure of the paper is described at the end of Section 2, where we discuss the exact formulation of the problem as well as related previous works. 2 Setup and Previous work Consider the problem of learning a probability measure ρ supported on a space M, from an i.i.d. sample Xn = (x1 , . . . , xn ) ∼ ρn of size n. We assume M to be a compact, smooth d-dimensional manifold of bounded curvature, with C 1 metric and volume measure λM , embedded in the unit ball of a separable Hilbert space X with inner product ·, · , induced norm · , and distance d (for d instance M = B2 (1) the unit ball in X = Rd .) Following [35, p. 94], let Pp (M) denote the Wasserstein space of order 1 ≤ p < ∞: Pp (M) := x p dρ(x) < ∞ ρ ∈ P (M) : M of probability measures P (M) supported on M, with finite p-th moment. The p-Wasserstein distance 1/p Wp (ρ, µ) = inf [E X − Y p ] : Law(X) = ρ, Law(Y ) = µ (1) X,Y where the random variables X and Y are distributed according to ρ and µ respectively, is the optimal expected cost of transporting points generated from ρ to those generated from µ, and is guaranteed to be finite in Pp (M) [35, p. 95]. The space Pp (M) with the Wp metric is itself a complete separable metric space [35]. We consider here the problem of learning probability measures ρ ∈ P2 (M), where the performance is measured by the distance W2 . There are many possible choices of distances between probability measures [13]. Among them, Wp metrizes weak convergence (see [35] theorem 6.9), that is, in Pp (M), a sequence (µi )i∈N of measures converges weakly to µ iff Wp (µi , µ) → 0 and their p-th order moments converge to that of µ. There are other distances, such as the L´ vy-Prokhorov, or the weak-* distance, that also metrize e weak convergence. However, as pointed out by Villani in his excellent monograph [35, p. 98], 1. “Wasserstein distances are rather strong, [...]a definite advantage over the weak-* distance”. 2. “It is not so difficult to combine information on convergence in Wasserstein distance with some smoothness bound, in order to get convergence in stronger distances.” Wasserstein distances have been used to study the mixing and convergence of Markov chains [22], as well as concentration of measure phenomena [20]. To this list we would add the important fact that existing and widely-used algorithms for unsupervised learning can be easily extended (see Sec. 4) to compute a measure ρ that minimizes the distance W2 (ˆn , ρ ) to the empirical measure ρ n ρn := ˆ 1 δx , n i=1 i a fact that will allow us to prove, in Sec. 5, bounds on the convergence of a measure induced by k-means to the population measure ρ. The most useful versions of Wasserstein distance are p = 1, 2, with p = 1 being the weaker of the two (by H¨ lder’s inequality, p ≤ q ⇒ Wp ≤ Wq .) In particular, “results in W2 distance are usually o stronger, and more difficult to establish than results in W1 distance” [35, p. 95]. A discussion of p = ∞ would take us out of topic, since its behavior is markedly different. 2.1 Closeness of Empirical and Population Measures By the strong law of large numbers, the empirical measure converges almost surely to the population measure: ρn → ρ in the sense of the weak topology [34]. Since weak convergence and convergence ˆ in Wp plus convergence of p-th moments are equivalent in Pp (M), this means that, in the Wp sense, the empirical measure ρn converges to ρ, as n → ∞. A fundamental question is therefore how fast ˆ the rate of convergence of ρn → ρ is. ˆ 2 2.1.1 Convergence in expectation The rate of convergence of ρn → ρ in expectation has been widely studied in the past, resultˆ ing in upper bounds of order EW2 (ρ, ρn ) = O(n−1/(d+2) ) [19, 8], and lower bounds of order ˆ EW2 (ρ, ρn ) = Ω(n−1/d ) [29] (both assuming that the absolutely continuous part of ρ is ρA = 0, ˆ with possibly better rates otherwise). More recently, an upper bound of order EWp (ρ, ρn ) = O(n−1/d ) has been proposed [2] by proving ˆ a bound for the Optimal Bipartite Matching (OBM) problem [1], and relating this problem to the expected distance EWp (ρ, ρn ). In particular, given two independent samples Xn , Yn , the OBM ˆ problem is that of finding a permutation σ that minimizes the matching cost n−1 xi −yσ(i) p [24, p ˆ ˆ ˆ 30]. It is not hard to show that the optimal matching cost is Wp (ˆXn , ρYn ) , where ρXn , ρYn are ρ the empirical measures associated to Xn , Yn . By Jensen’s inequality, the triangle inequality, and (a + b)p ≤ 2p−1 (ap + bp ), it holds EWp (ρ, ρn )p ≤ EWp (ˆXn , ρYn )p ≤ 2p−1 EWp (ρ, ρn )p , ˆ ρ ˆ ˆ and therefore a bound of order O(n−p/d ) for the OBM problem [2] implies a bound EWp (ρ, ρn ) = ˆ O(n−1/d ). The matching lower bound is only known for a special case: ρA constant over a bounded set of non-null measure [2] (e.g. ρA uniform.) Similar results, with matching lower bounds are found for W1 in [11]. 2.1.2 Convergence in probability Results for convergence in probability, one of the main results of this work, appear to be considerably harder to obtain. One fruitful avenue of analysis has been the use of so-called transportation, or Talagrand inequalities Tp , which can be used to prove concentration inequalities on Wp [20]. In particular, we say that ρ satisfies a Tp (C) inequality with C > 0 iff Wp (ρ, µ)2 ≤ CH(µ|ρ), ∀µ ∈ Pp (M), where H(·|·) is the relative entropy [20]. As shown in [6, 5], it is possible to obtain probabilistic upper bounds on Wp (ρ, ρn ), with p = 1, 2, if ρ is known to satisfy a Tp inequality ˆ of the same order, thereby reducing the problem of bounding Wp (ρ, ρn ) to that of obtaining a Tp ˆ inequality. Note that, by Jensen’s inequality, and as expected from the behavior of Wp , the inequality T2 is stronger than T1 [20]. While it has been shown that ρ satisfies a T1 inequality iff it has a finite square-exponential moment 2 (E[eα x ] finite for some α > 0) [4, 7], no such general conditions have been found for T2 . As an example, consider that, if M is compact with diameter D then, by theorem 6.15 of [35], and the celebrated Csisz´ r-Kullback-Pinsker inequality [27], for all ρ, µ ∈ Pp (M), it is a Wp (ρ, µ)2p ≤ (2D)2p ρ − µ where · does not. TV 2 TV ≤ 22p−1 D2p H(µ|ρ), is the total variation norm. Clearly, this implies a Tp=1 inequality, but for p ≥ 2 it The T2 inequality has been shown by Talagrand to be satisfied by the Gaussian distribution [31], and then slightly more generally by strictly log-concave measures (see [20, p. 123], and [3].) However, as noted in [6], “contrary to the T1 case, there is no hope to obtain T2 inequalities from just integrability or decay estimates.” Structure of this paper. In this work we obtain bounds in probability (learning rates) for the problem of learning a probability measure in the sense of W2 . We begin by establishing (lower) bounds for the convergence of empirical to population measures, which serve to set up the problem and introduce the connection between quantization and measure learning (sec. 3.) We then describe how existing unsupervised learning algorithms that compute a set (k-means, k-flats, PCA,. . . ) can be easily extended to produce a measure (sec. 4.) Due to its simplicity and widespread use, we focus here on k-means. Since the two measure estimates that we consider are the empirical measure, and the measure induced by k-means, we next set out to prove upper bounds on their convergence to the data-generating measure (sec. 5.) We arrive at these bounds by means of intermediate measures, which are related to the problem of optimal quantization. The bounds apply in a very broad setting (unlike existing bounds based on transportation inequalities, they are not restricted to log-concave measures [20, 3].) 3 3 Learning probability measures, optimal transport and quantization We address the problem of learning a probability measure ρ when the only observations we have at our disposal are n i.i.d. samples Xn = (x1 , . . . , xn ). We begin by establishing some notation and useful intermediate results. Given a closed set S ⊆ X , let {Vq : q ∈ S} be a Borel Voronoi partition of X composed of sets Vq closest to each q ∈ S, that is, such that each Vq ⊆ {x ∈ X : x − q = minr∈S x − r } is measurable (see for instance [15].) Consider the projection function πS : X → S mapping each x ∈ Vq to q. By virtue of {Vq }q∈S being a Borel Voronoi partition, the map πS is measurable [15], and it is d (x, πS (x)) = minq∈S x − q for all x ∈ X . For any ρ ∈ Pp (M), let πS ρ be the pushforward, or image measure of ρ under the mapping πS , −1 which is defined to be (πS ρ)(A) := ρ(πS (A)) for all Borel measurable sets A. From its definition, it is clear that πS ρ is supported on S. We now establish a connection between the expected distance to a set S, and the distance between ρ and the set’s induced pushforward measure. Notice that, for discrete sets S, the expected Lp distance to S is exactly the expected quantization error Ep,ρ (S) := Ex∼ρ d(x, S)p = Ex∼ρ x − πS (x) p incurred when encoding points x drawn from ρ by their closest point πS (x) in S [14]. This close connection between optimal quantization and Wasserstein distance has been pointed out in the past in the statistics [28], optimal quantization [14, p. 33], and approximation theory [16] literatures. The following two lemmas are key tools in the reminder of the paper. The first highlights the close link between quantization and optimal transport. Lemma 3.1. For closed S ⊆ X , ρ ∈ Pp (M), 1 ≤ p < ∞, it holds Ex∼ρ d(x, S)p = Wp (ρ, πS ρ)p . Note that the key element in the above lemma is that the two measures in the expression Wp (ρ, πS ρ) must match. When there is a mismatch, the distance can only increase. That is, Wp (ρ, πS µ) ≥ Wp (ρ, πS ρ) for all µ ∈ Pp (M). In fact, the following lemma shows that, among all the measures with support in S, πS ρ is closest to ρ. Lemma 3.2. For closed S ⊆ X , and all µ ∈ Pp (M) with supp(µ) ⊆ S, 1 ≤ p < ∞, it holds Wp (ρ, µ) ≥ Wp (ρ, πS ρ). When combined, lemmas 3.1 and 3.2 indicate that the behavior of the measure learning problem is limited by the performance of the optimal quantization problem. For instance, Wp (ρ, ρn ) can only ˆ be, in the best-case, as low as the optimal quantization cost with codebook of size n. The following section makes this claim precise. 3.1 Lower bounds Consider the situation depicted in fig. 1, in which a sample X4 = {x1 , x2 , x3 , x4 } is drawn from a distribution ρ which we assume here to be absolutely continuous on its support. As shown, the projection map πX4 sends points x to their closest point in X4 . The resulting Voronoi decomposition of supp(ρ) is drawn in shades of blue. By lemma 5.2 of [9], the pairwise intersections of Voronoi regions have null ambient measure, and since ρ is absolutely continuous, the pushforward measure 4 can be written in this case as πX4 ρ = j=1 ρ(Vxj )δxj , where Vxj is the Voronoi region of xj . Note that, even for finite sets S, this particular decomposition is not always possible if the {Vq }q∈S form a Borel Voronoi tiling, instead of a Borel Voronoi partition. If, for instance, ρ has an atom falling on two Voronoi regions in a tiling, then both regions would count the atom as theirs, and double-counting would imply q ρ(Vq ) > 1. The technicalities required to correctly define a Borel Voronoi partition are such that, in general, it is simpler to write πS ρ, even though (if S is discrete) this measure can clearly be written as a sum of deltas with appropriate masses. By lemma 3.1, the distance Wp (ρ, πX4 ρ)p is the (expected) quantization cost of ρ when using X4 as codebook. Clearly, this cost can never be lower than the optimal quantization cost of size 4. This reasoning leads to the following lower bound between empirical and population measures. 4 Theorem 3.3. For ρ ∈ Pp (M) with absolutely continuous part ρA = 0, and 1 ≤ p < ∞, it holds Wp (ρ, ρn ) = Ω(n−1/d ) uniformly over ρn , where the constants depend on d and ρA only. ˆ ˆ Proof: Let Vn,p (ρ) := inf S⊂M,|S|=n Ex∼ρ d(x, S)p be the optimal quantization cost of ρ of order p with n centers. Since ρA = 0, and since ρ has a finite (p + δ)-th order moment, for some δ > 0 (since it is supported on the unit ball), then it is Vn,p (ρ) = Θ(n−p/d ), with constants depending on d and ρA (see [14, p. 78] and [16].) Since supp(ˆn ) = Xn , it follows that ρ Wp (ρ, ρn )p ˆ ≥ lemma 3.2 Wp (ρ, πXn ρ)p = lemma 3.1 Ex∼ρ d(x, Xn )p ≥ Vn,p (ρ) = Θ(n−p/d ) Note that the bound of theorem 3.3 holds for ρn derived from any sample Xn , and is therefore ˆ stronger than the existing lower bounds on the convergence rates of EWp (ρ, ρn ) → 0. In particular, ˆ it trivially induces the known lower bound Ω(n−1/d ) on the rate of convergence in expectation. 4 Unsupervised learning algorithms for learning a probability measure As described in [21], several of the most widely used unsupervised learning algorithms can be ˆ interpreted to take as input a sample Xn and output a set Sk , where k is typically a free parameter of the algorithm, such as the number of means in k-means1 , the dimension of affine spaces in PCA, n ˆ etc. Performance is measured by the empirical quantity n−1 i=1 d(xi , Sk )2 , which is minimized among all sets in some class (e.g. sets of size k, affine spaces of dimension k,. . . ) This formulation is general enough to encompass k-means and PCA, but also k-flats, non-negative matrix factorization, and sparse coding (see [21] and references therein.) Using the discussion of Sec. 3, we can establish a clear connection between unsupervised learning and the problem of learning probability measures with respect to W2 . Consider as a running example the k-means problem, though the argument is general. Given an input Xn , the k-means problem is ˆ ˆ to find a set |Sk | = k minimizing its average distance from points in Xn . By associating to Sk the pushforward measure πSk ρn , we find that ˆ ˆ 1 n n ˆ ˆ d(xi , Sk )2 = Ex∼ρn d(x, Sk )2 ˆ i=1 = lemma 3.1 W2 (ˆn , πSk ρn )2 . ρ ˆ ˆ (2) Since k-means minimizes equation 2, it also finds the measure that is closest to ρn , among those ˆ with support of size k. This connection between k-means and W2 measure approximation was, to the best of the authors’ knowledge, first suggested by Pollard [28] though, as mentioned earlier, the argument carries over to many other unsupervised learning algorithms. Unsupervised measure learning algorithms. We briefly clarify the steps involved in using an existing unsupervised learning algorithm for probability measure learning. Let Uk be a parametrized algorithm (e.g. k-means) that takes a sample Xn and outputs a set Uk (Xn ). The measure learning algorithm Ak : Mn → Pp (M) corresponding to Uk is defined as follows: ˆ 1. Ak takes a sample Xn and outputs the measure πSk ρn , supported on Sk = Uk (Xn ); ˆ ˆ 2. since ρn is discrete, then so must πSk ρn be, and thus Ak (Xn ) = ˆ ˆ ˆ 1 n n ˆ i=1 δπSk (xi ) ; 3. in practice, we can simply store an n-vector πSk (x1 ), . . . , πSk (xn ) , from which Ak (Xn ) ˆ ˆ can be reconstructed by placing atoms of mass 1/n at each point. In the case that Uk is the k-means algorithm, only k points and k masses need to be stored. Note that any algorithm A that attempts to output a measure A (Xn ) close to ρn can be cast in the ˆ above framework. Indeed, if S is the support of A (Xn ) then, by lemma 3.2, πS ρn is the measure ˆ closest to ρn with support in S . This effectively reduces the problem of learning a measure to that of ˆ 1 In a slight abuse of notation, we refer to the k-means algorithm here as an ideal algorithm that solves the k-means problem, even though in practice an approximation algorithm may be used. 5 finding a set, and is akin to how the fact that every optimal quantizer is a nearest-neighbor quantizer (see [15], [12, p. 350], and [14, p. 37–38]) reduces the problem of finding an optimal quantizer to that of finding an optimal quantizing set. Clearly, the minimum of equation 2 over sets of size k (the output of k-means) is monotonically ˆ ˆ non-increasing with k. In particular, since Sn = Xn and πSn ρn = ρn , it is Ex∼ρn d(x, Sn )2 = ˆ ˆ ˆ ˆ 2 W2 (ˆn , πSn ρn ) = 0. That is, we can always make the learned measure arbitrarily close to ρn ρ ˆ ˆ ˆ by increasing k. However, as pointed out in Sec. 2, the problem of measure learning is concerned with minimizing the 2-Wasserstein distance W2 (ρ, πSk ρn ) to the data-generating measure. The ˆ ˆ actual performance of k-means is thus not necessarily guaranteed to behave in the same way as the empirical one, and the question of characterizing its behavior as a function of k and n naturally arises. ˆ Finally, we note that, while it is Ex∼ρn d(x, Sk )2 = W2 (ˆn , πSk ρn )2 (the empirical performances ρ ˆ ˆ ˆ are the same in the optimal quantization, and measure learning problem formulations), the actual performances satisfy ˆ Ex∼ρ d(x, Sk )2 = W2 (ρ, π ˆ ρ)2 ≤ W2 (ρ, π ˆ ρn )2 , 1 ≤ k ≤ n. ˆ lemma 3.1 Sk lemma 3.2 Sk Consequently, with the identification between sets S and measures πS ρn , the measure learning ˆ problem is, in general, harder than the set-approximation problem (for example, if M = Rd and ρ is absolutely continuous over a set of non-null volume, it is not hard to show that the inequality is ˆ almost surely strict: Ex∼ρ d(x, Sk )2 < W2 (ρ, πSk ρn )2 for 1 < k < n.) ˆ ˆ In the remainder, we characterize the performance of k-means on the measure learning problem, for varying k, n. Although other unsupervised learning algorithms could have been chosen as basis for our analysis, k-means is one of the oldest and most widely used, and the one for which the deep connection between optimal quantization and measure approximation is most clearly manifested. Note that, by setting k = n, our analysis includes the problem of characterizing the behavior of the distance W2 (ρ, ρn ) between empirical and population measures which, as indicated in Sec. 2.1, ˆ is a fundamental question in statistics (i.e. the speed of convergence of empirical to population measures.) 5 Learning rates In order to analyze the performance of k-means as a measure learning algorithm, and the convergence of empirical to population measures, we propose the decomposition shown in fig. 2. The diagram includes all the measures considered in the paper, and shows the two decompositions used to prove upper bounds. The upper arrow (green), illustrates the decomposition used to bound the distance W2 (ρ, ρn ). This decomposition uses the measures πSk ρ and πSk ρn as intermediates to arrive ˆ ˆ at ρn , where Sk is a k-point optimal quantizer of ρ, that is, a set Sk minimizing Ex∼ρ d(x, S)2 over ˆ all sets of size |S| = k. The lower arrow (blue) corresponds to the decomposition of W2 (ρ, πSk ρn ) ˆ ˆ (the performance of k-means), whereas the labelled black arrows correspond to individual terms in the bounds. We begin with the (slightly) simpler of the two results. 5.1 Convergence rates for the empirical to population measures Let Sk be the optimal k-point quantizer of ρ of order two [14, p. 31]. By the triangle inequality and the identity (a + b + c)2 ≤ 3(a2 + b2 + c2 ), it follows that W2 (ρ, ρn )2 ≤ 3 W2 (ρ, πSk ρ)2 + W2 (πSk ρ, πSk ρn )2 + W2 (πSk ρn , ρn )2 . ˆ ˆ ˆ ˆ (3) This is the decomposition depicted in the upper arrow of fig. 2. By lemma 3.1, the first term in the sum of equation 3 is the optimal k-point quantization error of ρ over a d-manifold M which, using recent techniques from [16] (see also [17, p. 491]), is shown in the proof of theorem 5.1 (part a) to be of order Θ(k −2/d ). The remaining terms, b) and c), are slightly more technical and are bounded in the proof of theorem 5.1. Since equation 3 holds for all 1 ≤ k ≤ n, the best bound on W2 (ρ, ρn ) can be obtained by optimizˆ ing the right-hand side over all possible values of k, resulting in the following probabilistic bound for the rate of convergence of the empirical to population measures. 6 x2 x W2 (ρ, ρn ) ˆ supp ρ x1 π{x1 ,x2 ,x3 ,x4 } ρ a) x3 πSk ρ b) πSk ρn ˆ c) d) ρn ˆ πSk ρn ˆ ˆ W2 (ρ, πSk ρn ) ˆ ˆ x4 Figure 1: A sample {x1 , x2 , x3 , x4 } is drawn from a distribution ρ with support in supp ρ. The projection map π{x1 ,x2 ,x3 ,x4 } sends points x to their closest one in the sample. The induced Voronoi tiling is shown in shades of blue. Figure 2: The measures considered in this paper are linked by arrows for which upper bounds for their distance are derived. Bounds for the quantities of interest W2 (ρ, ρn )2 , and W2 (ρ, πSk ρn )2 , ˆ ˆ ˆ are decomposed by following the top and bottom colored arrows. Theorem 5.1. Given ρ ∈ Pp (M) with absolutely continuous part ρA = 0, sufficiently large n, and τ > 0, it holds W2 (ρ, ρn ) ≤ C · m(ρA ) · n−1/(2d+4) · τ, ˆ where m(ρA ) := 5.2 M 2 with probability 1 − e−τ . ρA (x)d/(d+2) dλM (x), and C depends only on d. Learning rates of k-means The key element in the proof of theorem 5.1 is that the distance between population and empirical measures can be bounded by choosing an intermediate optimal quantizing measure of an appropriate size k. In the analysis, the best bounds are obtained for k smaller than n. If the output of k-means is close to an optimal quantizer (for instance if sufficient data is available), then we would similarly expect that the best bounds for k-means correspond to a choice of k < n. The decomposition of the bottom (blue) arrow in figure 2 leads to the following bound in probability. Theorem 5.2. Given ρ ∈ Pp (M) with absolutely continuous part ρA = 0, and τ > 0, then for all sufficiently large n, and letting k = C · m(ρA ) · nd/(2d+4) , it holds W2 (ρ, πSk ρn ) ≤ C · m(ρA ) · n−1/(2d+4) · τ, ˆ ˆ where m(ρA ) := M 2 with probability 1 − e−τ . ρA (x)d/(d+2) dλM (x), and C depends only on d. Note that the upper bounds in theorem 5.1 and 5.2 are exactly the same. Although this may appear ˆ surprising, it stems from the following fact. Since S = Sk is a minimizer of W2 (πS ρn , ρn )2 , the ˆ ˆ bound d) of figure 2 satisfies: W2 (πSk ρn , ρn )2 ≤ W2 (πSk ρn , ρn )2 ˆ ˆ ˆ ˆ ˆ and therefore (by the definition of c), the term d) is of the same order as c). It follows then that adding term d) to the bound only affects the constants, but otherwise leaves it unchanged. Since d) is the term that takes the output measure of k-means to the empirical measure, this implies that the rate of convergence of k-means (for suitably chosen k) cannot be worse than that of ρn → ρ. ˆ Conversely, bounds for ρn → ρ are obtained from best rates of convergence of optimal quantizers, ˆ whose convergence to ρ cannot be slower than that of k-means (since the quantizers that k-means produces are suboptimal.) 7 Since the bounds obtained for the convergence of ρn → ρ are the same as those for k-means with ˆ k of order k = Θ(nd/(2d+4) ), this suggests that estimates of ρ that are as accurate as those derived from an n point-mass measure ρn can be derived from k point-mass measures with k ˆ n. Finally, we note that the introduced bounds are currently limited by the statistical bound sup |W2 (πS ρn , ρn )2 − W2 (πS ρ, ρ)2 | ˆ ˆ |S|=k = sup |Ex∼ρn d(x, S)2 − Ex∼ρ d(x, S)2 | ˆ lemma 3.1 |S|=k (4) (see for instance [21]), for which non-matching lower bounds are known. This means that, if better upper bounds can be obtained for equation 4, then both bounds in theorems 5.1 and 5.2 would automatically improve (would become closer to the lower bound.) References [1] M. Ajtai, J. Komls, and G. Tusndy. On optimal matchings. Combinatorica, 4:259–264, 1984. [2] Franck Barthe and Charles Bordenave. Combinatorial optimization over two random point sets. Technical Report arXiv:1103.2734, Mar 2011. [3] Gordon Blower. The Gaussian isoperimetric inequality and transportation. Positivity, 7:203–224, 2003. [4] S. G. Bobkov and F. G¨ tze. Exponential integrability and transportation cost related to logarithmic o Sobolev inequalities. Journal of Functional Analysis, 163(1):1–28, April 1999. [5] Emmanuel Boissard. Simple bounds for the convergence of empirical and occupation measures in 1wasserstein distance. Electron. J. Probab., 16(83):2296–2333, 2011. [6] F. Bolley, A. Guillin, and C. Villani. Quantitative concentration inequalities for empirical measures on non-compact spaces. Probability Theory and Related Fields, 137(3):541–593, 2007. [7] F. Bolley and C. Villani. Weighted Csisz´ r-Kullback-Pinsker inequalities and applications to transportaa tion inequalities. Annales de la Faculte des Sciences de Toulouse, 14(3):331–352, 2005. [8] Claire Caillerie, Fr´ d´ ric Chazal, J´ rˆ me Dedecker, and Bertrand Michel. Deconvolution for the Wassere e eo stein metric and geometric inference. Rapport de recherche RR-7678, INRIA, July 2011. [9] Kenneth L. Clarkson. Building triangulations using -nets. In Proceedings of the thirty-eighth annual ACM symposium on Theory of computing, STOC ’06, pages 326–335, New York, NY, USA, 2006. ACM. [10] Luc Devroye and G´ bor Lugosi. Combinatorial methods in density estimation. Springer Series in Statisa tics. Springer-Verlag, New York, 2001. [11] V. Dobri and J. Yukich. Asymptotics for transportation cost in high dimensions. Journal of Theoretical Probability, 8:97–118, 1995. [12] A. Gersho and R.M. Gray. Vector Quantization and Signal Compression. Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, 1992. [13] Alison L. Gibbs and Francis E. Su. On choosing and bounding probability metrics. International Statistical Review, 70:419–435, 2002. [14] Siegfried Graf and Harald Luschgy. Foundations of quantization for probability distributions. SpringerVerlag New York, Inc., Secaucus, NJ, USA, 2000. [15] Siegfried Graf, Harald Luschgy, and Gilles Page`. Distortion mismatch in the quantization of probability s measures. Esaim: Probability and Statistics, 12:127–153, 2008. [16] Peter M. Gruber. Optimum quantization and its applications. Adv. Math, 186:2004, 2002. [17] P.M. Gruber. Convex and discrete geometry. Grundlehren der mathematischen Wissenschaften. Springer, 2007. [18] Guillermo Henry and Daniela Rodriguez. Kernel density estimation on riemannian manifolds: Asymptotic results. J. Math. Imaging Vis., 34(3):235–239, July 2009. [19] Joseph Horowitz and Rajeeva L. Karandikar. Mean rates of convergence of empirical measures in the Wasserstein metric. J. Comput. Appl. Math., 55(3):261–273, November 1994. [20] M. Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys and Monographs. American Mathematical Society, 2001. [21] A. Maurer and M. Pontil. K–dimensional coding schemes in Hilbert spaces. IEEE Transactions on Information Theory, 56(11):5839 –5846, nov. 2010. [22] Yann Ollivier. Ricci curvature of markov chains on metric spaces. J. Funct. Anal., 256(3):810–864, 2009. 8 [23] Arkadas Ozakin and Alexander Gray. Submanifold density estimation. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1375–1382. 2009. [24] C. Papadimitriou. The probabilistic analysis of matching heuristics. In Proc. of the 15th Allerton Conf. on Communication, Control and Computing, pages 368–378, 1978. [25] Bruno Pelletier. Kernel density estimation on Riemannian manifolds. Statist. Probab. Lett., 73(3):297– 304, 2005. [26] Xavier Pennec. Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. J. Math. Imaging Vis., 25(1):127–154, July 2006. [27] M. S. Pinsker. Information and information stability of random variables and processes. San Francisco: Holden-Day, 1964. [28] David Pollard. Quantization and the method of k-means. IEEE Transactions on Information Theory, 28(2):199–204, 1982. [29] S.T. Rachev. Probability metrics and the stability of stochastic models. Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley, 1991. [30] J.M. Steele. Probability Theory and Combinatorial Optimization. Cbms-Nsf Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics, 1997. [31] M. Talagrand. Transportation cost for Gaussian and other product measures. Geometric And Functional Analysis, 6:587–600, 1996. [32] Alexandre B. Tsybakov. Introduction to nonparametric estimation. Springer Series in Statistics. Springer, New York, 2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats. [33] A.W. van der Vaart and J.A. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer, 1996. [34] V. S. Varadarajan. On the convergence of sample probability distributions. Sankhy¯ : The Indian Journal a of Statistics, 19(1/2):23–26, Feb. 1958. [35] C. Villani. Optimal Transport: Old and New. Grundlehren der Mathematischen Wissenschaften. Springer, 2009. [36] P. Vincent and Y. Bengio. Manifold Parzen Windows. In Advances in Neural Information Processing Systems 22, pages 849–856. 2003. 9
4 0.56739563 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model
Author: Sander M. Bohte
Abstract: Neural adaptation underlies the ability of neurons to maximize encoded information over a wide dynamic range of input stimuli. Recent spiking neuron models like the adaptive Spike Response Model implement adaptation as additive fixed-size fast spike-triggered threshold dynamics and slow spike-triggered currents. Such adaptation accurately models neural spiking behavior over a limited dynamic input range. To extend efficient coding over large changes in dynamic input range, we propose a multiplicative adaptive Spike Response Model where the spike-triggered adaptation dynamics are scaled multiplicatively by the adaptation state at the time of spiking. We show that, unlike the additive adaptation model, the firing rate in our multiplicative adaptation model saturates to a realistic maximum spike-rate regardless of input magnitude. Additionally, when simulating variance switching experiments, the model quantitatively fits experimental data over a wide dynamic range. Dynamic threshold models of adaptation furthermore suggest a straightforward interpretation of neural activity in terms of dynamic differential signal encoding with shifted and weighted exponential kernels. We show that when thus encoding rectified filtered stimulus signals, the multiplicative adaptive Spike Response Model achieves a high coding efficiency and maintains this efficiency over changes in the dynamic signal range of several orders of magnitude, without changing model parameters. 1
5 0.53619754 179 nips-2012-Learning Manifolds with K-Means and K-Flats
Author: Guillermo Canas, Tomaso Poggio, Lorenzo Rosasco
Abstract: We study the problem of estimating a manifold from random samples. In particular, we consider piecewise constant and piecewise linear estimators induced by k-means and k-flats, and analyze their performance. We extend previous results for k-means in two separate directions. First, we provide new results for k-means reconstruction on manifolds and, secondly, we prove reconstruction bounds for higher-order approximation (k-flats), for which no known results were previously available. While the results for k-means are novel, some of the technical tools are well-established in the literature. In the case of k-flats, both the results and the mathematical tools are new. 1
6 0.5039162 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
7 0.50227249 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video
8 0.43161061 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning
9 0.4288137 285 nips-2012-Query Complexity of Derivative-Free Optimization
10 0.40656289 73 nips-2012-Coding efficiency and detectability of rate fluctuations with non-Poisson neuronal firing
11 0.38678101 338 nips-2012-The Perturbed Variation
12 0.37290135 221 nips-2012-Multi-Stage Multi-Task Feature Learning
13 0.37232754 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
14 0.35359171 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation
15 0.35078898 21 nips-2012-A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
16 0.34945118 291 nips-2012-Reducing statistical time-series problems to binary classification
17 0.34387302 222 nips-2012-Multi-Task Averaging
18 0.33744401 271 nips-2012-Pointwise Tracking the Optimal Regression Function
19 0.33478585 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
20 0.33466759 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL
topicId topicWeight
[(0, 0.016), (21, 0.068), (38, 0.166), (39, 0.02), (42, 0.036), (53, 0.013), (54, 0.021), (55, 0.036), (63, 0.158), (74, 0.034), (76, 0.231), (80, 0.069), (92, 0.039)]
simIndex simValue paperId paperTitle
1 0.94352192 89 nips-2012-Coupling Nonparametric Mixtures via Latent Dirichlet Processes
Author: Dahua Lin, John W. Fisher
Abstract: Mixture distributions are often used to model complex data. In this paper, we develop a new method that jointly estimates mixture models over multiple data sets by exploiting the statistical dependencies between them. Specifically, we introduce a set of latent Dirichlet processes as sources of component models (atoms), and for each data set, we construct a nonparametric mixture model by combining sub-sampled versions of the latent DPs. Each mixture model may acquire atoms from different latent DPs, while each atom may be shared by multiple mixtures. This multi-to-multi association distinguishes the proposed method from previous ones that require the model structure to be a tree or a chain, allowing more flexible designs. We also derive a sampling algorithm that jointly infers the model parameters and present experiments on both document analysis and image modeling. 1
2 0.91959721 78 nips-2012-Compressive Sensing MRI with Wavelet Tree Sparsity
Author: Chen Chen, Junzhou Huang
Abstract: In Compressive Sensing Magnetic Resonance Imaging (CS-MRI), one can reconstruct a MR image with good quality from only a small number of measurements. This can significantly reduce MR scanning time. According to structured sparsity theory, the measurements can be further reduced to O(K + log n) for tree-sparse data instead of O(K + K log n) for standard K-sparse data with length n. However, few of existing algorithms have utilized this for CS-MRI, while most of them model the problem with total variation and wavelet sparse regularization. On the other side, some algorithms have been proposed for tree sparse regularization, but few of them have validated the benefit of wavelet tree structure in CS-MRI. In this paper, we propose a fast convex optimization algorithm to improve CS-MRI. Wavelet sparsity, gradient sparsity and tree sparsity are all considered in our model for real MR images. The original complex problem is decomposed into three simpler subproblems then each of the subproblems can be efficiently solved with an iterative scheme. Numerous experiments have been conducted and show that the proposed algorithm outperforms the state-of-the-art CS-MRI algorithms, and gain better reconstructions results on real MR images than general tree based solvers or algorithms. 1
same-paper 3 0.91332799 142 nips-2012-Generalization Bounds for Domain Adaptation
Author: Chao Zhang, Lei Zhang, Jieping Ye
Abstract: In this paper, we provide a new framework to study the generalization bound of the learning process for domain adaptation. We consider two kinds of representative domain adaptation settings: one is domain adaptation with multiple sources and the other is domain adaptation combining source and target data. In particular, we use the integral probability metric to measure the difference between two domains. Then, we develop the specific Hoeffding-type deviation inequality and symmetrization inequality for either kind of domain adaptation to achieve the corresponding generalization bound based on the uniform entropy number. By using the resultant generalization bound, we analyze the asymptotic convergence and the rate of convergence of the learning process for domain adaptation. Meanwhile, we discuss the factors that affect the asymptotic behavior of the learning process. The numerical experiments support our results. 1
4 0.89578664 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
Author: Ke Jiang, Brian Kulis, Michael I. Jordan
Abstract: Sampling and variational inference techniques are two standard methods for inference in probabilistic models, but for many problems, neither approach scales effectively to large-scale data. An alternative is to relax the probabilistic model into a non-probabilistic formulation which has a scalable associated algorithm. This can often be fulfilled by performing small-variance asymptotics, i.e., letting the variance of particular distributions in the model go to zero. For instance, in the context of clustering, such an approach yields connections between the kmeans and EM algorithms. In this paper, we explore small-variance asymptotics for exponential family Dirichlet process (DP) and hierarchical Dirichlet process (HDP) mixture models. Utilizing connections between exponential family distributions and Bregman divergences, we derive novel clustering algorithms from the asymptotic limit of the DP and HDP mixtures that features the scalability of existing hard clustering methods as well as the flexibility of Bayesian nonparametric models. We focus on special cases of our analysis for discrete-data problems, including topic modeling, and we demonstrate the utility of our results by applying variants of our algorithms to problems arising in vision and document analysis. 1
5 0.87578571 123 nips-2012-Exponential Concentration for Mutual Information Estimation with Application to Forests
Author: Han Liu, Larry Wasserman, John D. Lafferty
Abstract: We prove a new exponential concentration inequality for a plug-in estimator of the Shannon mutual information. Previous results on mutual information estimation only bounded expected error. The advantage of having the exponential inequality is that, combined with the union bound, we can guarantee accurate estimators of the mutual information for many pairs of random variables simultaneously. As an application, we show how to use such a result to optimally estimate the density function and graph of a distribution which is Markov to a forest graph. 1
6 0.86756408 18 nips-2012-A Simple and Practical Algorithm for Differentially Private Data Release
7 0.86605108 75 nips-2012-Collaborative Ranking With 17 Parameters
8 0.8654142 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC
9 0.86526644 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
10 0.86481607 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation
11 0.86433607 222 nips-2012-Multi-Task Averaging
12 0.86422861 291 nips-2012-Reducing statistical time-series problems to binary classification
13 0.86359513 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
14 0.86345947 76 nips-2012-Communication-Efficient Algorithms for Statistical Optimization
15 0.86169648 338 nips-2012-The Perturbed Variation
16 0.86149985 20 nips-2012-A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets
17 0.86020857 221 nips-2012-Multi-Stage Multi-Task Feature Learning
18 0.85946488 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability
19 0.85934848 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points
20 0.85848957 254 nips-2012-On the Sample Complexity of Robust PCA