jmlr jmlr2010 jmlr2010-106 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory
Reference: text
sentIndex sentText sentNum sentScore
1 In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. [sent-6, score-0.252]
2 However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. [sent-7, score-0.226]
3 This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i. [sent-10, score-0.215]
4 We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. [sent-14, score-0.258]
5 scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory 1. [sent-30, score-0.255]
6 Introduction Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, such as the VC-dimension, covering numbers, or Rademacher complexity. [sent-31, score-0.175]
7 In contrast, the notion of algorithmic stability can be used to derive bounds that are tailored to specific learning algorithms and exploit their particular properties. [sent-33, score-0.192]
8 Algorithmic stability has been used effectively in the past to derive tight generalization bounds (Bousquet and Elisseeff, 2001, 2002; Kearns and Ron, 1997). [sent-36, score-0.236]
9 But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i. [sent-37, score-0.226]
10 This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i. [sent-49, score-0.215]
11 We prove novel and distinct stabilitybased generalization bounds for stationary ϕ-mixing and β-mixing sequences. [sent-54, score-0.258]
12 However, our analysis somewhat differs from previous uses of this technique in that the blocks of points we consider are not necessarily of equal size. [sent-63, score-0.143]
13 For our analysis of stationary ϕ-mixing sequences, we make use of a generalized version of McDiarmid’s inequality given by Kontorovich and Ramanan (2008) that holds for ϕ-mixing sequences. [sent-64, score-0.211]
14 Our generalization bounds for stationary β-mixing sequences cover a more general non-i. [sent-66, score-0.305]
15 scenario and use the standard McDiarmid’s inequality, however, unlike the ϕ-mixing case, the β-mixing bound presented here is not a purely exponential bound and contains an additive term depending on the mixing coefficient. [sent-69, score-0.281]
16 The stability bounds we give for SVR, SVMs, and many other kernel regularization-based and relative entropy-based regularization algorithms can thus be viewed as the first theoretical basis for their use in such scenarios. [sent-81, score-0.221]
17 Section 3 gives our main generalization bounds for stationary ϕ-mixing sequences based on stability, as well as the illustration of its applications to general kernel regularization-based algorithms, including SVR, KRR, and SVMs, as well as to relative entropy-based regularization algorithms. [sent-87, score-0.35]
18 Finally, Section 4 presents the first known stability bounds for the more general stationary β-mixing scenario. [sent-88, score-0.311]
19 Preliminaries We first introduce some standard definitions for dependent observations in mixing theory (Doukhan, 1994) and then briefly discuss the learning scenarios in the non-i. [sent-93, score-0.203]
20 Definitions ∞ Definition 1 A sequence of random variables Z = {Zt }t=−∞ is said to be stationary if for any t and non-negative integers m and k, the random vectors (Zt , . [sent-101, score-0.167]
21 The following is a standard definition giving a measure of the dependence of the random variables Zt within a stationary sequence. [sent-111, score-0.151]
22 ∞ Definition 2 Let Z = {Zt }t=−∞ be a stationary sequence of random variables. [sent-113, score-0.167]
23 ϕ(k) ≤ ϕ0 /kr ) for all k, exponentially mixing if there exist real numbers β0 (resp. [sent-120, score-0.143]
24 The ϕ-mixing bounds are based on a concentration inequality that applies to ϕ-mixing processes only. [sent-131, score-0.166]
25 Except for the use of this concentration bound and two lemmas 5 and 6, all of the intermediate proofs and results to derive a ϕ-mixing bound in Section 3 are given in the more general case of β-mixing sequences. [sent-132, score-0.138]
26 Some results have also been obtained in the more general context of α-mixing but they seem to require the stronger condition of exponential mixing (Modha and Masry, 1998). [sent-139, score-0.143]
27 Algebraic mixing is a standard assumption for mixing coefficients that has been adopted in previous studies of learning in the presence of dependent observations (Yu, 1994; Meir, 2000; Vidyasagar, 2003; Lozano et al. [sent-141, score-0.322]
28 Let us also point out that mixing assumptions can be checked in some cases such as with Gaussian or Markov processes (Meir, 2000) and that mixing parameters can also be estimated in such cases. [sent-143, score-0.286]
29 Q P The lemma gives a measure of the difference between the distribution of µ blocks where the blocks are independent in one case and dependent in the other case. [sent-158, score-0.319]
30 We will consider here the more general case of dependent samples drawn from a stationary mixing sequence Z over X ×Y . [sent-184, score-0.376]
31 P ROCESSES In the most general version, future samples depend on the training sample S and thus the generalization error or true error of the hypothesis hS trained on S must be measured by its expected error conditioned on the sample S: R(hS ) = E[c(hS , z) | S]. [sent-193, score-0.165]
32 Let us also briefly discuss the more general scenario of non-stationary mixing sequences, that is one where the distribution may change over time. [sent-203, score-0.193]
33 ϕ-Mixing Generalization Bounds and Applications This section gives generalization bounds for β-stable algorithms over a mixing stationary distribution. [sent-211, score-0.401]
34 It has been later used successfully by Bousquet and Elisseeff (2001, 2002) to show algorithm-specific stability bounds for i. [sent-219, score-0.176]
35 The use of stability in conjunction with McDiarmid’s inequality will allow us to derive generalization bounds. [sent-229, score-0.226]
36 McDiarmid’s inequality is an exponential concentration bound of the form Pr[|Φ − E[Φ]| ≥ ε] ≤ exp − mε2 τ2 , τ where the probability is over a sample of size m and where m is the Lipschitz parameter of Φ, with τ a function of m. [sent-230, score-0.182]
37 Let us first take a brief look at the problem faced when attempting to give stability bounds for dependent sequences and give some idea of our solution for that problem. [sent-242, score-0.259]
38 Although there is no dependence between blocks of points in (b), the distribution within each block remains the same as in (a) and thus points within a block remain dependent. [sent-268, score-0.334]
39 This consists of eliminating from the original dependent sequence several blocks of contiguous points, leaving us with some remaining blocks of points. [sent-270, score-0.317]
40 Instead of these dependent blocks, we then consider independent blocks of points, each with the same size and the same distribution (within each block) as the dependent ones. [sent-271, score-0.188]
41 By Lemma 3, for a β-mixing distribution, the expected value of a random variable defined over the dependent blocks is close to the one based on these independent blocks. [sent-272, score-0.152]
42 This results in a sequence of three blocks of contiguous points. [sent-288, score-0.165]
43 In the next step, we apply the independent block lemma, which then allows us to assume each of these blocks as independent modulo the addition of a mixing term. [sent-290, score-0.333]
44 We first present a lemma that relates the expected value of the generalization error in that scenario and the same expectation in the scenario where the test point is independent of the training sample. [sent-297, score-0.211]
45 The block Sb is assumed to have exactly the same distribution as the corresponding block of the same size in S. [sent-300, score-0.148]
46 Then, for any sample S of size m drawn from a ϕ-mixing stationary distribution and for any b ∈ {0, . [sent-302, score-0.182]
47 The bound would then contain the mixing term ϕ(k + b) instead of ϕ(b). [sent-310, score-0.187]
48 , zm ) be two sequences drawn from a i ϕ-mixing stationary process that differ only in point zi for some i ∈ {1, . [sent-324, score-0.414]
49 P ROCESSES Sb Si zi z b b b (a) (b) i Si,b Si,b zi b z b z b z b b (c) b (d) Figure 2: Illustration of the sequences derived from S that are considered in the proofs. [sent-335, score-0.241]
50 797 M OHRI AND ROSTAMIZADEH Lemma 7 Let hS be the hypothesis returned by a β-stable algorithm trained on a sample S drawn from a β-mixing stationary distribution. [sent-344, score-0.287]
51 To deal with independent block sequences defined with respect to the same hypothesis, we will consider the sequence Si,b = Si ∩ Sb , which is illustrated by Figure 2(a-c). [sent-347, score-0.153]
52 As before, we will consider a sequence Si,b with a similar set of blocks each with the same distribution as the corresponding blocks in Si,b , but such that the blocks are independent as seen in Figure 2(d). [sent-349, score-0.38]
53 Since three blocks of at most b points are removed from each hypothesis, by the β-stability of the learning algorithm, the following holds: E[Φ(S)] = E[R(hS ) − R(hS )] S S =E S,z ≤ E Si,b ,z 1 m ∑ c(hS , zi ) − c(hS , z) m i=1 1 m ∑ c(hSi,b , zi ) − c(hSi,b , z) + 6bβ. [sent-350, score-0.337]
54 m i=1 The application of Lemma 3 to the difference of two cost functions also bounded by M as in the right-hand side leads to E[Φ(S)] ≤ E S Si,b ,z 1 m ∑ c(h , zi ) − c(hSi,b , z) + 6bβ + 3β(b)M. [sent-351, score-0.14]
55 m i=1 Si,b Now, since the points z and zi are independent and since the distribution is stationary, they have the same distribution and we can replace zi with z in the empirical cost. [sent-352, score-0.221]
56 3 ϕ-Mixing Concentration Bound We are now prepared to make use of a concentration inequality to provide a generalization bound in the ϕ-mixing scenario. [sent-357, score-0.207]
57 , Zm be arbitrary random variables taking values in Z and let Φ : Z m →R be a measurable function satisfying for all zi , z′ ∈Z, i=1, . [sent-386, score-0.144]
58 , Zm be independent random variables taking values in Z and let Φ : Z m → R be a measurable function satisfying for all zi , z′ ∈ Z, i = 1, . [sent-422, score-0.144]
59 To simplify presentation, we are adapting it to the case of stationary ϕ-mixing sequences by using the following straightforward inequality for a stationary process: 2ϕ( j − i) ≥ ηi j . [sent-450, score-0.37]
60 Stability Bound) Let hS denote the hypothesis returned by a βstable algorithm trained on a sample S drawn from a ϕ-mixing stationary distribution and let c be a measurable non-negative cost function upper bounded by M > 0, then for any b ∈ {0, . [sent-458, score-0.377]
61 The theorem gives a general stability bound for ϕ-mixing stationary sequences. [sent-463, score-0.323]
62 Proof For an algebraically mixing sequence, the value of b minimizing the bound of Theorem 11 satisfies the equation βb∗ = rMϕ(b∗ ). [sent-469, score-0.287]
63 In the case of a zero mixing coefficient (ϕ = 0 and b = 0), the bounds of Theorem 11 coincide with the i. [sent-478, score-0.206]
64 In the case of algebraically mixing sequences with r>1, as assumed in Theorem 12, β≤O(1/m) √ implies ϕ(b) ≈ ϕ0 (β/(rϕ0 M))(r/(r+1)) < O(1/ m). [sent-484, score-0.29]
65 More specifically, for the scenario of algebraic mixing with 1/m-stability, the following bound holds with probability at least 1 − δ: R(hS ) − R(hS ) ≤ O log(1/δ) r−1 . [sent-485, score-0.286]
66 r−1 m r+1 Similar bounds can be given for the exponential mixing setting (ϕ(k) = ϕ0 exp(−ϕ1 kr )). [sent-488, score-0.225]
67 5 Applications We now present the application of our stability bounds for algebraically ϕ-mixing sequences to several algorithms, including the family of kernel-based regularization algorithms and that of relative entropy-based regularization algorithms. [sent-492, score-0.363]
68 The application of our learning bounds will benefit from the previous analysis of the stability of these algorithms by Bousquet and Elisseeff (2002). [sent-493, score-0.176]
69 1 K ERNEL -BASED R EGULARIZATION A LGORITHMS We first apply our bounds to a family of algorithms minimizing a regularized objective function based on the norm · K in a reproducing kernel Hilbert space, where K is a positive definite symmetric kernel: 1 m (7) argmin ∑ c(h, zi ) + λ h 2 . [sent-496, score-0.2]
70 Let hS denote the hypothesis returned by the algorithm when trained on a sample S drawn from an algebraically ϕ-mixing stationary distribution. [sent-530, score-0.387]
71 g0 (θ) As shown by Bousquet and Elisseeff (2002, Theorem 24), a relative entropy-based regularization algorithm defined by (9) with bounded loss c′ (·) ≤ M, is β-stable with the following bound on the stability coefficient: M2 β≤ . [sent-555, score-0.201]
72 805 M OHRI AND ROSTAMIZADEH Corollary 16 Let hS be the hypothesis solution of the optimization (9) trained on a sample S drawn from an algebraically ϕ-mixing stationary distribution with the internal cost function c′ bounded by M. [sent-557, score-0.396]
73 The next section gives stability-based generalization bounds that hold even in the scenario of β-mixing sequences. [sent-569, score-0.173]
74 β-Mixing Generalization Bounds In this section, we prove a stability-based generalization bound that only requires the training sequence to be drawn from a β-mixing stationary distribution. [sent-571, score-0.301]
75 It contains an additive term, which depends on the mixing coefficient. [sent-574, score-0.143]
76 zk drawn independently of S, the following equality holds: E Z 1 1 k 1 k c(hS , z) = ∑ E[c(hS , zi )] = ∑ E[c(hS , zi )] = E[c(hS , z)] ∑ z |Z| z∈Z k i=1 Z k i=1 zi since, by stationarity, Ezi [c(hS , zi )] = Ez j [c(hS , z j )] for all 1 ≤ i, j ≤ k. [sent-581, score-0.418]
77 To derive a generalization bound for the β-mixing scenario, we apply McDiarmid’s inequality (Theorem 10) to Φ defined over a sequence of independent blocks. [sent-585, score-0.189]
78 From a sample S made of a sequence of m points, we construct two sequences of blocks Sa and Sb , each containing µ blocks. [sent-596, score-0.212]
79 Each block in Sa contains a points and each block in Sb contains b points (see Figure 3). [sent-597, score-0.202]
80 , µ}, such that the points within each block are drawn according to the same original β-mixing distribution (a) (a) and shall denote by Sa the block sequence (Z1 , . [sent-623, score-0.237]
81 Lemma 17 Let Sa be an independent block sequence as defined above, then the following bound holds for the expectation of |Φ(Sa )|: E [|Φ(Sa )|] ≤ 2aβ. [sent-641, score-0.173]
82 Sa Proof Since the blocks Z (a) are independent, we can replace any one of them with any other block Z drawn from the same distribution. [sent-642, score-0.22]
83 Then, for a sample S drawn from a β-mixing stationary distribution, the following bound holds: Pr[|Φ(S)| ≥ ε] ≤ Pr |Φ(Sa )| − E[|Φ(Sa )|] ≥ ε′ + (µ − 1)β(b), 0 S Sa ′ where ε′ = ε − µbM − 2µbβ − ES′ [|Φ(Sa )|]. [sent-651, score-0.226]
84 We can therefore apply McDiarmid’s inequality by viewing the blocks as i. [sent-665, score-0.169]
85 809 M OHRI AND ROSTAMIZADEH Using these bounds in conjunction with McDiarmid’s inequality yields ′ Pr[|Φ(Sa )| − E [|Φ(Sa )|] ≥ ε′ ] ≤ exp 0 Sa ′ Sa ≤ exp −2ε′2 m 0 2 −2ε′2 m 2 4aβm + (a + b)M 4aβm + (a + b)M . [sent-675, score-0.152]
86 There is a trade-off between selecting a large enough value for b to ensure that the mixing term decreases and choosing a large enough value of µ to minimize the remaining terms of the bound. [sent-680, score-0.143]
87 The exact choice of parameters will depend on the type of mixing that is assumed (e. [sent-681, score-0.143]
88 2m In the case of a fast mixing distribution, it is possible to select the values of the parameters to 1 retrieve a bound as in the i. [sent-687, score-0.187]
89 In the following, we examine slower mixing algebraic β-mixing distributions, which are thus not close to the i. [sent-695, score-0.169]
90 For algebraic mixing, the mixing parameter is defined as β(b) = b−r . [sent-699, score-0.169]
91 Then, for any sample S of size m drawn according to a algebraic β-mixing stationary distribution, and δ ≥ 0 such that δ′ ≥ 0, the following generalization bound holds with probability at least (1 − δ): 1 |R(hS ) − R(hS )| < O m 2(r+1) 1 −4 log(1/δ′ ) . [sent-716, score-0.335]
92 Furthermore, as expected, a larger mixing parameter r leads to a more favorable bound. [sent-718, score-0.143]
93 Conclusion We presented stability bounds for both ϕ-mixing and β-mixing stationary sequences. [sent-720, score-0.311]
94 Since they are algorithm-specific, these bounds can often be tighter than other generalization bounds based on general complexity measures for families of hypotheses. [sent-728, score-0.186]
95 These stability bounds complement general data-dependent learning bounds we have shown elsewhere for stationary β-mixing sequences using the notion of Rademacher complexity (Mohri and Rostamizadeh, 2009). [sent-733, score-0.421]
96 The stability bounds we presented can be used to analyze the properties of stable algorithms when used in the non-i. [sent-734, score-0.203]
97 Of course, some mixing properties of the distributions need to be known to take advantage of the information supplied by our generalization bounds. [sent-738, score-0.203]
98 In some problems, it is possible to estimate the shape of the mixing coefficients. [sent-739, score-0.143]
99 On the consistency in nonparametric estimation under mixing assumptions. [sent-797, score-0.143]
100 Rates of convergence for empirical processes of stationary mixing sequences. [sent-829, score-0.278]
wordName wordTfidf (topN-words)
[('hs', 0.655), ('sa', 0.363), ('pr', 0.206), ('sb', 0.186), ('hsb', 0.151), ('mixing', 0.143), ('hsa', 0.141), ('rostamizadeh', 0.137), ('stationary', 0.135), ('ohri', 0.131), ('hsi', 0.12), ('blocks', 0.116), ('stability', 0.113), ('mcdiarmid', 0.107), ('zm', 0.105), ('algebraically', 0.1), ('zi', 0.097), ('kontorovich', 0.094), ('tability', 0.093), ('rocesses', 0.086), ('bm', 0.082), ('rfs', 0.08), ('ramanan', 0.077), ('ounds', 0.075), ('elisseeff', 0.075), ('block', 0.074), ('bousquet', 0.069), ('meir', 0.069), ('bounds', 0.063), ('svr', 0.062), ('generalization', 0.06), ('inequality', 0.053), ('hypothesis', 0.052), ('lemma', 0.051), ('scenario', 0.05), ('lozano', 0.05), ('concentration', 0.05), ('sequences', 0.047), ('zt', 0.045), ('bound', 0.044), ('mohri', 0.044), ('ez', 0.042), ('afshin', 0.04), ('bfs', 0.04), ('vidyasagar', 0.04), ('bn', 0.037), ('dependent', 0.036), ('returned', 0.034), ('sequence', 0.032), ('mehryar', 0.031), ('krr', 0.031), ('theorem', 0.031), ('measurable', 0.03), ('bf', 0.03), ('chazottes', 0.03), ('drawn', 0.03), ('points', 0.027), ('stable', 0.027), ('algebraic', 0.026), ('fs', 0.026), ('yh', 0.026), ('kernel', 0.025), ('bounded', 0.024), ('yu', 0.024), ('inequalities', 0.024), ('scenarios', 0.024), ('holds', 0.023), ('ridge', 0.023), ('hypotheses', 0.023), ('si', 0.023), ('statement', 0.022), ('regularization', 0.02), ('mattera', 0.02), ('modha', 0.02), ('cr', 0.02), ('saunders', 0.02), ('straightforwardly', 0.02), ('trained', 0.019), ('mu', 0.019), ('kr', 0.019), ('cost', 0.019), ('exp', 0.018), ('vladimir', 0.018), ('corollary', 0.018), ('proof', 0.017), ('contiguous', 0.017), ('courant', 0.017), ('weakening', 0.017), ('coef', 0.017), ('bernstein', 0.017), ('let', 0.017), ('sample', 0.017), ('ci', 0.016), ('rademacher', 0.016), ('dependence', 0.016), ('algorithmic', 0.016), ('ron', 0.016), ('sii', 0.015), ('lipschitz', 0.015), ('reproducing', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory
2 0.28214121 60 jmlr-2010-Learnability, Stability and Uniform Convergence
Author: Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan
Abstract: The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression. Keywords: statistical learning theory, learnability, uniform convergence, stability, stochastic convex optimization
3 0.16233987 82 jmlr-2010-On Learning with Integral Operators
Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito
Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods
4 0.083748259 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
Author: Liva Ralaivola, Marie Szafranski, Guillaume Stempfel
Abstract: PAC-Bayes bounds are among the most accurate generalization bounds for classifiers learned from independently and identically distributed (IID) data, and it is particularly so for margin classifiers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classifiers (Germain et al., 2009). However, there are many practical situations where the training data show some dependencies and where the traditional IID assumption does not hold. Stating generalization bounds for such frameworks is therefore of the utmost interest, both from theoretical and practical standpoints. In this work, we propose the first—to the best of our knowledge—PAC-Bayes generalization bounds for classifiers trained on data exhibiting interdependencies. The approach undertaken to establish our results is based on the decomposition of a so-called dependency graph that encodes the dependencies within the data, in sets of independent data, thanks to graph fractional covers. Our bounds are very general, since being able to find an upper bound on the fractional chromatic number of the dependency graph is sufficient to get new PAC-Bayes bounds for specific settings. We show how our results can be used to derive bounds for ranking statistics (such as AUC) and classifiers trained on data distributed according to a stationary β-mixing process. In the way, we show how our approach seamlessly allows us to deal with U-processes. As a side note, we also provide a PAC-Bayes generalization bound for classifiers learned on data from stationary ϕ-mixing distributions. Keywords: PAC-Bayes bounds, non IID data, ranking, U-statistics, mixing processes c 2010 Liva Ralaivola, Marie Szafranski and Guillaume Stempfel. R ALAIVOLA , S ZAFRANSKI AND S TEMPFEL
5 0.06187303 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar
Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing
6 0.058098506 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
7 0.054114006 61 jmlr-2010-Learning From Crowds
8 0.053996116 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
9 0.043470312 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
10 0.041211218 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
11 0.039725635 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
12 0.038435876 72 jmlr-2010-Matrix Completion from Noisy Entries
13 0.038105696 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes
14 0.03637448 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
15 0.035602909 50 jmlr-2010-Image Denoising with Kernels Based on Natural Image Relations
16 0.035330553 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
17 0.033959385 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
18 0.033137739 45 jmlr-2010-High-dimensional Variable Selection with Sparse Random Projections: Measurement Sparsity and Statistical Efficiency
19 0.031493228 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models
20 0.031033995 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
topicId topicWeight
[(0, -0.157), (1, -0.082), (2, 0.052), (3, -0.029), (4, -0.149), (5, -0.04), (6, -0.158), (7, 0.294), (8, 0.241), (9, 0.103), (10, -0.181), (11, 0.005), (12, -0.262), (13, 0.323), (14, -0.017), (15, 0.107), (16, 0.292), (17, 0.111), (18, 0.065), (19, -0.071), (20, 0.105), (21, -0.173), (22, 0.056), (23, -0.057), (24, -0.043), (25, -0.012), (26, -0.047), (27, 0.046), (28, 0.029), (29, 0.029), (30, -0.002), (31, -0.066), (32, -0.021), (33, 0.029), (34, -0.059), (35, 0.031), (36, -0.012), (37, -0.005), (38, 0.011), (39, 0.024), (40, 0.018), (41, -0.024), (42, 0.029), (43, 0.001), (44, 0.019), (45, -0.0), (46, -0.014), (47, -0.014), (48, 0.004), (49, -0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.95942396 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory
2 0.86826265 60 jmlr-2010-Learnability, Stability and Uniform Convergence
Author: Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, Karthik Sridharan
Abstract: The problem of characterizing learnability is the most basic question of statistical learning theory. A fundamental and long-standing answer, at least for the case of supervised classification and regression, is that learnability is equivalent to uniform convergence of the empirical risk to the population risk, and that if a problem is learnable, it is learnable via empirical risk minimization. In this paper, we consider the General Learning Setting (introduced by Vapnik), which includes most statistical learning problems as special cases. We show that in this setting, there are non-trivial learning problems where uniform convergence does not hold, empirical risk minimization fails, and yet they are learnable using alternative mechanisms. Instead of uniform convergence, we identify stability as the key necessary and sufficient condition for learnability. Moreover, we show that the conditions for learnability in the general setting are significantly more complex than in supervised classification and regression. Keywords: statistical learning theory, learnability, uniform convergence, stability, stochastic convex optimization
3 0.45136026 82 jmlr-2010-On Learning with Integral Operators
Author: Lorenzo Rosasco, Mikhail Belkin, Ernesto De Vito
Abstract: A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold: 1. We use a technique based on a concentration inequality for Hilbert spaces to provide new much simplified proofs for a number of results in spectral approximation. 2. Using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from von Luxburg et al. (2008). Keywords: spectral convergence, empirical operators, learning integral operators, perturbation methods
4 0.33410445 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
Author: Liva Ralaivola, Marie Szafranski, Guillaume Stempfel
Abstract: PAC-Bayes bounds are among the most accurate generalization bounds for classifiers learned from independently and identically distributed (IID) data, and it is particularly so for margin classifiers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classifiers (Germain et al., 2009). However, there are many practical situations where the training data show some dependencies and where the traditional IID assumption does not hold. Stating generalization bounds for such frameworks is therefore of the utmost interest, both from theoretical and practical standpoints. In this work, we propose the first—to the best of our knowledge—PAC-Bayes generalization bounds for classifiers trained on data exhibiting interdependencies. The approach undertaken to establish our results is based on the decomposition of a so-called dependency graph that encodes the dependencies within the data, in sets of independent data, thanks to graph fractional covers. Our bounds are very general, since being able to find an upper bound on the fractional chromatic number of the dependency graph is sufficient to get new PAC-Bayes bounds for specific settings. We show how our results can be used to derive bounds for ranking statistics (such as AUC) and classifiers trained on data distributed according to a stationary β-mixing process. In the way, we show how our approach seamlessly allows us to deal with U-processes. As a side note, we also provide a PAC-Bayes generalization bound for classifiers learned on data from stationary ϕ-mixing distributions. Keywords: PAC-Bayes bounds, non IID data, ranking, U-statistics, mixing processes c 2010 Liva Ralaivola, Marie Szafranski and Guillaume Stempfel. R ALAIVOLA , S ZAFRANSKI AND S TEMPFEL
5 0.19431169 79 jmlr-2010-Near-optimal Regret Bounds for Reinforcement Learning
Author: Thomas Jaksch, Ronald Ortner, Peter Auer
Abstract: For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s, s′ there is a policy which moves from s to s′ in at most D steps (on average). √ ˜ We present a reinforcement learning algorithm with total regret O(DS AT ) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of √ Ω( DSAT ) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the number of suboptimal steps taken by our algorithm. This bound can be used to achieve a (gap-dependent) regret bound that is logarithmic in T . Finally, we also consider a setting where the MDP is allowed to change a fixed number of ℓ times. We present a modification of our algorithm that is able to deal with this setting and show a √ ˜ regret bound of O(ℓ1/3 T 2/3 DS A). Keywords: undiscounted reinforcement learning, Markov decision process, regret, online learning, sample complexity
6 0.19353074 98 jmlr-2010-Regularized Discriminant Analysis, Ridge Regression and Beyond
7 0.18121511 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods
8 0.17997722 61 jmlr-2010-Learning From Crowds
9 0.17865504 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
10 0.17792472 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models
11 0.15578538 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
12 0.143186 72 jmlr-2010-Matrix Completion from Noisy Entries
13 0.14212942 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
14 0.14142285 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes
15 0.13570644 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
16 0.13524771 85 jmlr-2010-On the Foundations of Noise-free Selective Classification
17 0.13188459 99 jmlr-2010-Restricted Eigenvalue Properties for Correlated Gaussian Designs
18 0.13126123 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
19 0.12715678 88 jmlr-2010-Optimal Search on Clustered Structural Constraint for Learning Bayesian Network Structure
20 0.12529749 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
topicId topicWeight
[(1, 0.018), (3, 0.053), (4, 0.012), (8, 0.02), (15, 0.02), (21, 0.022), (32, 0.052), (36, 0.068), (37, 0.055), (73, 0.423), (75, 0.095), (85, 0.049)]
simIndex simValue paperId paperTitle
same-paper 1 0.64850199 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: Most generalization bounds in learning theory are based on some measure of the complexity of the hypothesis class used, independently of any algorithm. In contrast, the notion of algorithmic stability can be used to derive tight generalization bounds that are tailored to specific learning algorithms by exploiting their particular properties. However, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed. In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence. This paper studies the scenario where the observations are drawn from a stationary ϕ-mixing or β-mixing sequence, a widely adopted assumption in the study of non-i.i.d. processes that implies a dependence between observations weakening over time. We prove novel and distinct stability-based generalization bounds for stationary ϕ-mixing and β-mixing sequences. These bounds strictly generalize the bounds given in the i.i.d. case and apply to all stable learning algorithms, thereby extending the use of stability-bounds to non-i.i.d. scenarios. We also illustrate the application of our ϕ-mixing generalization bounds to general classes of learning algorithms, including Support Vector Regression, Kernel Ridge Regression, and Support Vector Machines, and many other kernel regularization-based and relative entropy-based regularization algorithms. These novel bounds can thus be viewed as the first theoretical basis for the use of these algorithms in non-i.i.d. scenarios. Keywords: learning in non-i.i.d. scenarios, weakly dependent observations, mixing distributions, algorithmic stability, generalization bounds, learning theory
2 0.5932374 76 jmlr-2010-Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes
Author: Pradeep Ravikumar, Alekh Agarwal, Martin J. Wainwright
Abstract: The problem of computing a maximum a posteriori (MAP) configuration is a central computational challenge associated with Markov random fields. There has been some focus on “tree-based” linear programming (LP) relaxations for the MAP problem. This paper develops a family of super-linearly convergent algorithms for solving these LPs, based on proximal minimization schemes using Bregman divergences. As with standard message-passing on graphs, the algorithms are distributed and exploit the underlying graphical structure, and so scale well to large problems. Our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman projections used to compute each proximal update. We establish convergence guarantees for our algorithms, and illustrate their performance via some simulations. We also develop two classes of rounding schemes, deterministic and randomized, for obtaining integral configurations from the LP solutions. Our deterministic rounding schemes use a “re-parameterization” property of our algorithms so that when the LP solution is integral, the MAP solution can be obtained even before the LP-solver converges to the optimum. We also propose graph-structured randomized rounding schemes applicable to iterative LP-solving algorithms in general. We analyze the performance of and report simulations comparing these rounding schemes. Keywords: graphical models, MAP Estimation, LP relaxation, proximal minimization, rounding schemes
3 0.31750587 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming
Author: Ming Yuan
Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity
4 0.31654117 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond
Author: Yevgeny Seldin, Naftali Tishby
Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative filtering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data fit and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily
5 0.31042114 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions
Author: Shiliang Sun, John Shawe-Taylor
Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artificial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efficacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory
7 0.30574793 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks
8 0.30502877 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization
9 0.30483529 102 jmlr-2010-Semi-Supervised Novelty Detection
10 0.30410576 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning
11 0.30408162 60 jmlr-2010-Learnability, Stability and Uniform Convergence
12 0.3039819 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels
13 0.30350766 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes
14 0.30211899 2 jmlr-2010-A Convergent Online Single Time Scale Actor Critic Algorithm
15 0.3017799 69 jmlr-2010-Lp-Nested Symmetric Distributions
16 0.30157128 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring
17 0.30098385 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices
18 0.30009171 109 jmlr-2010-Stochastic Composite Likelihood
19 0.29859442 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
20 0.29796273 104 jmlr-2010-Sparse Spectrum Gaussian Process Regression