nips nips2008 nips2008-189 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents the first Rademacher complexity-based error bounds for noni.i.d. settings, a generalization of similar existing bounds derived for the i.i.d. case. Our bounds hold in the scenario of dependent samples generated by a stationary β-mixing process, which is commonly adopted in many previous studies of noni.i.d. settings. They benefit from the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. The empirical Rademacher complexity can be estimated from such finite samples and lead to tighter generalization bounds. We also present the first margin bounds for kernel-based classification in this non-i.i.d. setting and briefly study their convergence.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract This paper presents the first Rademacher complexity-based error bounds for noni. [sent-8, score-0.151]
2 settings, a generalization of similar existing bounds derived for the i. [sent-11, score-0.238]
3 Our bounds hold in the scenario of dependent samples generated by a stationary β-mixing process, which is commonly adopted in many previous studies of noni. [sent-15, score-0.413]
4 They benefit from the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. [sent-19, score-0.366]
5 In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. [sent-20, score-0.235]
6 The empirical Rademacher complexity can be estimated from such finite samples and lead to tighter generalization bounds. [sent-21, score-0.312]
7 We also present the first margin bounds for kernel-based classification in this non-i. [sent-22, score-0.204]
8 This dependence may appear more clearly in times series prediction or when the samples are drawn from a Markov chain, but various degrees of time-dependence can also affect other learning problems. [sent-36, score-0.124]
9 processes in machine learning is that of observations drawn from a stationary mixing sequence, a standard assumption adopted in most previous studies, which implies a dependence between observations that diminishes with time [7,9,10,14,15]. [sent-40, score-0.318]
10 The pioneering work of Yu [15] led to VC-dimension bounds for stationary β-mixing sequences. [sent-41, score-0.311]
11 Similarly, Meir [9] gave bounds based on covering numbers for time series prediction [9]. [sent-42, score-0.216]
12 Generalization bounds have also been derived for stable algorithms with weakly dependent observations [10]. [sent-49, score-0.208]
13 This paper gives data-dependent generalization bounds for stationary β-mixing sequences. [sent-52, score-0.361]
14 Our bounds are based on the notion of Rademacher complexity. [sent-53, score-0.17]
15 case the Rademacher complexity bounds derived in the i. [sent-57, score-0.313]
16 To the best of our knowledge, these are the first Rademacher complexity bounds derived for non-i. [sent-61, score-0.313]
17 Our proofs make 1 use of the so-called independent block technique due to Yu [15] and Bernstein and extend the applicability of the notion of Rademacher complexity to non-i. [sent-65, score-0.254]
18 Our generalization bounds benefit from all the advantageous properties of Rademacher complexity as in the i. [sent-69, score-0.36]
19 But, perhaps the most crucial advantage of bounds based on the empirical Rademacher complexity is that they are data-dependent: they measure the complexity of a class of hypotheses based on the training sample and thus better capture the properties of the distribution that has generated the data. [sent-76, score-0.592]
20 The empirical Rademacher complexity can be estimated from finite samples and lead to tighter bounds. [sent-77, score-0.245]
21 Furthermore, the Rademacher complexity of large hypothesis sets such as kernel-based hypotheses, decision trees, convex neural networks, can sometimes be bounded in some specific ways [2]. [sent-78, score-0.238]
22 For example, the Rademacher complexity of kernel-based hypotheses can be bounded in terms of the trace of the kernel matrix. [sent-79, score-0.298]
23 Section 3 introduces the idea of independent blocks and proves a bound on the expected deviation of the error from its empirical estimate. [sent-84, score-0.259]
24 In Section 4, we present our main Rademacher generalization bounds and discuss their properties. [sent-85, score-0.218]
25 scenario we will consider is based on stationary β-mixing processes. [sent-97, score-0.178]
26 A sequence of random variables Z = {Zt }t=−∞ is said to be stationary if for any t and non-negative integers m and k, the random vectors (Zt , . [sent-99, score-0.208]
27 Thus, the index t or time, does not affect the distribution of a variable Zt in a stationary sequence (note that this does not imply independence). [sent-106, score-0.213]
28 Let Z = {Zt }t=−∞ be a stationary sequence of random variables. [sent-108, score-0.182]
29 Then, for any positive integer k, the β-mixing coefficient of the stochastic process Z is defined as β(k) = sup En sup Pr[A | B] − Pr[A] . [sent-110, score-0.348]
30 It is said to be algebraically β-mixing if there exist real numbers β0 > 0 and r > 0 such that β(k) ≤ β0 /k r for all k, and exponentially mixing if there exist real numbers β0 and β1 such that β(k) ≤ β0 exp(−β1 k r ) for all k. [sent-112, score-0.2]
31 Thus, a sequence of random variables is mixing when the dependence of an event on those occurring k units of time in the past weakens as a function of k. [sent-113, score-0.177]
32 2 Rademacher Complexity Our generalization bounds will be based on the following measure of the complexity of a class of functions. [sent-115, score-0.36]
33 Given a sample S ∈ X m , the empirical Rademacher complexity of a set of real-valued functions H defined over a set X is defined as follows: RS (H) = 2 E sup m σ h∈H m σi h(xi ) S = (x1 , . [sent-117, score-0.38]
34 The Rademacher complexity of a hypothesis set H is defined as the expectation of RS (H) over all samples of size m: Rm (H) = E RS (H) |S| = m . [sent-125, score-0.277]
35 (3) S The definition of the Rademacher complexity depends on the distribution according to which samples S of size m are drawn, which in general is a dependent β-mixing distribution D. [sent-126, score-0.207]
36 m The Rademacher complexity measures the ability of a class of functions to fit noise. [sent-131, score-0.168]
37 The empirical Rademacher complexity has the added advantage that it is data-dependent and can be measured from finite samples. [sent-132, score-0.178]
38 This can lead to tighter bounds than those based on other measures of complexity such as the VC-dimension [2, 4, 5]. [sent-133, score-0.358]
39 We will denote by RS (h) the empirical average of a hypothesis h : X → R and by R(h) its expectation over a sample S drawn according to a stationary β-mixing distribution: RS (h) = 1 m m h(zi ) R(h) = E[RS (h)]. [sent-134, score-0.369]
40 S i=1 (4) The following proposition shows that this expectation is independent of the size of the sample S, as in the i. [sent-135, score-0.145]
41 For any sample S of size m drawn from a stationary distribution D, the following holds: ES∼Dm [RS (h)] = Ez∼D [h(z)]. [sent-140, score-0.208]
42 i=1 zi z 3 Proof Components Our proof makes use of McDiarmid’s inequality [8] to show that the empirical average closely estimates its expectation. [sent-147, score-0.317]
43 To derive a Rademacher generalization bound, we apply McDiarmid’s inequality to the following random variable, which is the quantity we wish to bound: Φ(S) = sup R(h) − RS (h). [sent-148, score-0.399]
44 (5) h∈H McDiarmid’s inequality bounds the deviation of Φ from its mean, thus, we must also bound the expectation E[Φ]. [sent-149, score-0.397]
45 However, we immediately face two obstacles: both McDiarmid’s inequality and the standard bound on E[Φ] hold only for samples drawn in an i. [sent-150, score-0.26]
46 1 Independent Blocks We derive Rademacher generalization bounds for the case where training and test points are drawn from a stationary β-mixing sequence. [sent-160, score-0.439]
47 analyses [7, 9, 10, 15], we use a technique transferring the original problem based on dependent points to one based on a sequence of independent blocks. [sent-164, score-0.131]
48 The method consists of first splitting a sequence S into two subsequences S0 and S1 , each made of µ blocks of a consecutive points. [sent-165, score-0.135]
49 (6) (7) Instead of the original sequence of odd blocks S0 , we will be working with a sequence S0 of independent blocks of equal size a to which standard i. [sent-182, score-0.329]
50 7], for a sufficiently large spacing a between blocks and a sufficiently fast mixing distribution, the expectation of a bounded measurable function h is essentially unchanged if we work with S0 instead of S0 . [sent-190, score-0.319]
51 Let h be a measurable function bounded by M ≥ 0 defined over the blocks Zk , then the following holds: | E [h] − E [h]| ≤ (µ − 1)M β(a), (8) S0 e S0 where ES0 denotes the expectation with respect to S0 , ES0 the expectation with respect to the S0 . [sent-192, score-0.257]
52 e We denote by D the distribution corresponding to the independent blocks Zk . [sent-193, score-0.148]
53 Also, to work with block sequences, we extend some of our definitions: we define the extension ha : Z a → R of any a 1 hypothesis h ∈ H to a block-hypothesis by ha (B) = a i=1 h(Zi ) for any block B = (z1 , . [sent-194, score-1.02]
54 , za ) ∈ a Z , and define Ha as the set of all block-based hypotheses ha generated from h ∈ H. [sent-197, score-0.516]
55 2 Concentration Inequality McDiarmid’s inequality requires the sample to be i. [sent-204, score-0.148]
56 Thus, we first show that Pr[Φ(S)] can be bounded in terms of independent blocks and then apply McDiarmid’s inequality to the independent blocks. [sent-207, score-0.342]
57 Let S denote a sample, of size m, drawn according to a stationary β-mixing distribution and let S0 denote a sequence of independent blocks. [sent-210, score-0.307]
58 of ǫ′ ) e S0 The second inequality holds by the union bound and the fact that Φ(S0 ) or Φ(S1 ) must surpass ǫ for their sum to surpass 2ǫ. [sent-217, score-0.319]
59 S0 e S0 e S0 e S0 We can now apply McDiarmid’s inequality to the independent blocks of Lemma 1. [sent-219, score-0.268]
60 For the same assumptions as in Lemma 1, the following bound holds for all ǫ > ES0 [Φ(S0 )]: e −2µǫ′2 Pr[Φ(S) > ǫ] ≤ 2 exp + 2(µ − 1)β(a), S M2 where ǫ′ = ǫ − ES0 [Φ(S0 )]. [sent-221, score-0.154]
61 1 Φ(S0 ) can be written in terms of ha as: Φ(S0 ) = R(ha ) − RS0 (ha ) = R(ha ) − µ µ ha (Zk ). [sent-227, score-0.846]
62 e k=1 Thus, changing a block Zk of the sample S0 can change Φ(S0 ) by at most McDiarmid’s inequality, the following holds for any ǫ > 2(µ − 1)M β(a): −2ǫ′2 Pr[Φ(S0 ) − E [Φ(S0 )] > ǫ′ ] ≤ exp e S0 µ 2 i=1 (M/µ) e S0 = exp 1 µ |h(Zk )| ≤ M/µ. [sent-228, score-0.143]
63 The following inequality holds for the expectation ES0 [Φ(S0 )] defined in terms of an e e independent block sequence:ES0 [Φ(S0 )] ≤ RD (H). [sent-238, score-0.32]
64 By the convexity of the supremum function and Jensen’s inequality, ES0 [Φ(S0 )] can be e bounded in terms of empirical averages over two samples: E [Φ(S0 )] = E [ sup E [RS ′ (h)] − RS0 (h)] ≤ E [ sup RS ′ (h) − RS0 (h)]. [sent-240, score-0.494]
65 e e e e e S0 e e′ S0 h∈H S0 e e′ S0 ,S0 h∈H 0 0 We now proceed with a standard symmetrization argument with the independent blocks thought of as i. [sent-241, score-0.13]
66 points: E [Φ(S0 )] ≤ E [ sup RS ′ (h) − RS0 (h)] e e e S0 e e′ S0 ,S0 h∈H = E sup e e′ S0 ,S0 ha ∈Ha = ≤ E 0 sup e e′ S0 ,S0 ,σ ha ∈Ha E sup e e′ S0 ,S0 ,σ ha ∈Ha =2 E sup e S0 ,σ ha ∈Ha µ 1 µ ′ ha (Zi ) − ha (Zi ) (def. [sent-244, score-3.408]
67 of R) i=1 µ 1 µ 1 µ 1 µ i=1 µ ′ σi (ha (Zi ) − ha (Zi )) σi ha (Zi ) + i=1 µ E (Rad. [sent-245, score-0.846]
68 ’s) sup e e′ S0 ,S0 ,σ ha ∈Ha 1 µ µ ′ σi ha (Zi ) (sub-add. [sent-247, score-1.02]
69 With probability 1/2, ′ σi = 1 and the difference ha (Zi ) − ha (Zi ) is left unchanged; and, with probability 1/2, σi = −1 ′ ′ and Zi and Zi are permuted. [sent-250, score-0.846]
70 Since the blocks Zi , or Zi are independent, taking the expectation over σ leaves the expectation unchanged. [sent-251, score-0.198]
71 The inequality follows from the sub-additivity of the supremum ′ function and the linearity of expectation. [sent-252, score-0.152]
72 We now relate the Rademacher block sequence to a sequence over independent points. [sent-254, score-0.171]
73 The righthand side of the inequality just presented can be rewritten as 1 2 E sup e0 ,σ ha ∈Ha µ S µ 2 σi ha (Zi ) = E sup e0 ,σ h∈H µ S i=1 5 µ σi i=1 1 a a (i) h(zj ) , j=1 (i) j where zj denotes the jth point of the ith block. [sent-255, score-1.426]
74 sample constructed from the jth point of each independent block Zi , i ∈ [1, µ]. [sent-259, score-0.152]
75 The remaining quantity, modulo absolute values, is the Rademacher complexity over µ independent points. [sent-263, score-0.204]
76 1 General Bounds This section presents and analyzes our main Rademacher complexity generalization bounds for stationary β-mixing sequences. [sent-268, score-0.503]
77 Setting the right-hand side of Proposition 2 to δ and using Lemma 2 to bound ES0 [Φ(S0 )] e e with the Rademacher complexity RD (H) shows the result. [sent-273, score-0.243]
78 µ As pointed out earlier, a key advantage of the Rademacher complexity is that it can be measured from data, assuming that the computation of the minimal empirical error can be done effectively and efficiently. [sent-274, score-0.178]
79 The following theorem gives a bound precisely with respect to the empirical Rademacher complexity RSµ . [sent-276, score-0.284]
80 Under the same assumptions as in Theorem 1, for any µ, a > 0 with 2µa = m and δ > 4(µ − 1)β(a), with probability at least 1 − δ, the following inequality holds for all hypotheses h ∈ H: R(h) ≤ RS (h) + RSµ (H) + 3M where δ ′ = δ − 4(µ − 1)β(a). [sent-278, score-0.292]
81 e µ µ (9) e Now, we can apply McDiarmid’s inequality to RD (H) − RSµ (H) which is defined over points e µ drawn in an i. [sent-282, score-0.196]
82 Changing a point of Sµ can affect RSµ by at most (2M/µ), thus, McDie armid’s inequality gives −µǫ2 + (µ − 1)β(2a − 1). [sent-286, score-0.151]
83 The result follows this inequality combined with the statement of Theorem 1 for a confidence parameter δ/2. [sent-289, score-0.138]
84 This theorem can be used to derive generalization bounds for a variety of hypothesis sets and learning settings. [sent-290, score-0.325]
85 In the next section, we present margin bounds for kernel-based classification. [sent-291, score-0.204]
86 For ρ any hypothesis h and margin ρ > 0, let RS (h) denote the average amount by which yh(x) deviates m ρ 1 from ρ over a sample S: RS (h) = m i=1 (ρ − yi h(xi ))+ . [sent-294, score-0.173]
87 For any h ∈ H, let h denote the corresponding hypothesis defined over Z by: ∀z ∈ Z, h(z) = −yh(x); and H K the hypothesis set {z ∈ Z → h(z) : h ∈ HK }. [sent-303, score-0.148]
88 The result is then obtained by applying Theorem 2 to R((L − 1) ◦ h) = R(L ◦ h) − 1 with R((L − 1) ◦ h) = R(L ◦ h) − 1, and using the known bound for the empirical Rademacher 2 complexity of kernel-based classifiers [2, 11]: RS (H K ) ≤ |S| Tr[K]. [sent-307, score-0.253]
89 In order to show that this bound converges, we must appropriately choose the parameter µ, or equivalently a, which will depend on the mixing parameter β. [sent-308, score-0.166]
90 In the case of algebraic mixing and using the straightforward bound Tr[K] ≤ mR2 for the kernel trace, where R is the radius of the ball that contains the data, the following corollary holds. [sent-309, score-0.24]
91 A tighter estimate of the trace of the kernel matrix, possibly derived from data, would provide a better bound, as would stronger mixing assumptions, e. [sent-314, score-0.173]
92 Finally, we note that as r → ∞ and β0 → 0, that is as the dependence between points vanishes, the right-hand side of the bound √ ρ approaches O(RS + 1/ m), which coincides with the asymptotic behavior in the i. [sent-317, score-0.173]
93 5 Conclusion We presented the first Rademacher complexity error bounds for dependent samples generated by a stationary β-mixing process, a generalization of similar existing bounds derived for the i. [sent-321, score-0.739]
94 We also gave the first margin bounds for kernel-based classification in this non-i. [sent-325, score-0.221]
95 Similar margin bounds can be obtained for the regression setting by using Theorem 2 and the properties of the empirical Rademacher complexity, as in the i. [sent-329, score-0.24]
96 bounds based on other complexity measures such as the VC-dimension or covering numbers can be retrieved from our framework. [sent-336, score-0.367]
97 Our framework and the bounds presented could serve as the basis for the design of regularization-based algorithms for dependent samples generated by a stationary β-mixing process. [sent-337, score-0.359]
98 Rademacher and Gaussian complexities: Risk bounds and structural results. [sent-348, score-0.151]
99 Convergence and consistency of regularized boosting algorithms with stationary β-mixing observations. [sent-374, score-0.168]
100 Rates of convergence for empirical processes of stationary mixing sequences. [sent-412, score-0.27]
wordName wordTfidf (topN-words)
[('rademacher', 0.569), ('ha', 0.423), ('rs', 0.377), ('pr', 0.179), ('sup', 0.174), ('bounds', 0.151), ('mcdiarmid', 0.147), ('zi', 0.144), ('stationary', 0.143), ('complexity', 0.142), ('inequality', 0.12), ('zk', 0.103), ('yh', 0.098), ('blocks', 0.096), ('hypotheses', 0.093), ('mixing', 0.091), ('zt', 0.08), ('bound', 0.075), ('generalization', 0.067), ('rd', 0.065), ('block', 0.059), ('hypothesis', 0.056), ('holds', 0.056), ('stationarity', 0.055), ('zj', 0.055), ('margin', 0.053), ('expectation', 0.051), ('corollary', 0.05), ('pac', 0.049), ('mohri', 0.047), ('bounded', 0.04), ('algebraically', 0.039), ('lozano', 0.039), ('reversing', 0.039), ('tighter', 0.039), ('hk', 0.039), ('sequence', 0.039), ('convexity', 0.038), ('drawn', 0.037), ('dependent', 0.037), ('empirical', 0.036), ('lemma', 0.035), ('scenario', 0.035), ('koltchinskii', 0.034), ('surpass', 0.034), ('independent', 0.034), ('supremum', 0.032), ('equality', 0.032), ('proposition', 0.032), ('jth', 0.031), ('theorem', 0.031), ('affect', 0.031), ('dependence', 0.028), ('sample', 0.028), ('samples', 0.028), ('modulo', 0.028), ('mercer', 0.026), ('courant', 0.026), ('steinwart', 0.026), ('said', 0.026), ('yu', 0.026), ('covering', 0.026), ('measures', 0.026), ('side', 0.026), ('odd', 0.025), ('consistency', 0.025), ('algebraic', 0.024), ('trace', 0.023), ('coincides', 0.023), ('assumptions', 0.023), ('ny', 0.023), ('unchanged', 0.022), ('numbers', 0.022), ('street', 0.022), ('tr', 0.021), ('points', 0.021), ('york', 0.02), ('derive', 0.02), ('derived', 0.02), ('identically', 0.019), ('notion', 0.019), ('measurable', 0.019), ('adopted', 0.019), ('event', 0.019), ('proves', 0.018), ('apply', 0.018), ('denote', 0.018), ('statement', 0.018), ('let', 0.018), ('classi', 0.017), ('proof', 0.017), ('gave', 0.017), ('xm', 0.017), ('obstacles', 0.017), ('afshin', 0.017), ('nystar', 0.017), ('rostami', 0.017), ('rostamizadeh', 0.017), ('pioneering', 0.017), ('ezi', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents the first Rademacher complexity-based error bounds for noni.i.d. settings, a generalization of similar existing bounds derived for the i.i.d. case. Our bounds hold in the scenario of dependent samples generated by a stationary β-mixing process, which is commonly adopted in many previous studies of noni.i.d. settings. They benefit from the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. The empirical Rademacher complexity can be estimated from such finite samples and lead to tighter generalization bounds. We also present the first margin bounds for kernel-based classification in this non-i.i.d. setting and briefly study their convergence.
2 0.39493349 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
Author: Sham M. Kakade, Karthik Sridharan, Ambuj Tewari
Abstract: This work characterizes the generalization ability of algorithms whose predictions are linear in the input vector. To this end, we provide sharp bounds for Rademacher and Gaussian complexities of (constrained) linear classes, which directly lead to a number of generalization bounds. This derivation provides simplified proofs of a number of corollaries including: risk bounds for linear prediction (including settings where the weight vectors are constrained by either L2 or L1 constraints), margin bounds (including both L2 and L1 margins, along with more general notions based on relative entropy), a proof of the PAC-Bayes theorem, and upper bounds on L2 covering numbers (with Lp norm constraints and relative entropy constraints). In addition to providing a unified analysis, the results herein provide some of the sharpest risk and margin bounds. Interestingly, our results show that the uniform convergence rates of empirical risk minimization algorithms tightly match the regret bounds of online learning algorithms for linear prediction, up to a constant factor of 2. 1
3 0.16689405 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
Author: Mohak Shah
Abstract: We derive risk bounds for the randomized classifiers in Sample Compression setting where the classifier-specification utilizes two sources of information viz. the compression set and the message string. By extending the recently proposed Occam’s Hammer principle to the data-dependent settings, we derive point-wise versions of the bounds on the stochastic sample compressed classifiers and also recover the corresponding classical PAC-Bayes bound. We further show how these compare favorably to the existing results.
4 0.15332706 111 nips-2008-Kernel Change-point Analysis
Author: Zaïd Harchaoui, Eric Moulines, Francis R. Bach
Abstract: We introduce a kernel-based method for change-point analysis within a sequence of temporal observations. Change-point analysis of an unlabelled sample of observations consists in, first, testing whether a change in the distribution occurs within the sample, and second, if a change occurs, estimating the change-point instant after which the distribution of the observations switches from one distribution to another different distribution. We propose a test statistic based upon the maximum kernel Fisher discriminant ratio as a measure of homogeneity between segments. We derive its limiting distribution under the null hypothesis (no change occurs), and establish the consistency under the alternative hypothesis (a change occurs). This allows to build a statistical hypothesis testing procedure for testing the presence of a change-point, with a prescribed false-alarm probability and detection probability tending to one in the large-sample setting. If a change actually occurs, the test statistic also yields an estimator of the change-point location. Promising experimental results in temporal segmentation of mental tasks from BCI data and pop song indexation are presented. 1
5 0.1375373 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
Author: Ali Rahimi, Benjamin Recht
Abstract: Randomized neural networks are immortalized in this AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. “What are you doing?” asked Minsky. “I am training a randomly wired neural net to play tic-tac-toe,” Sussman replied. “Why is the net wired randomly?” asked Minsky. Sussman replied, “I do not want it to have any preconceptions of how to play.” Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty,” replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities. 1
6 0.096820801 85 nips-2008-Fast Rates for Regularized Objectives
7 0.08780054 196 nips-2008-Relative Margin Machines
8 0.081401646 65 nips-2008-Domain Adaptation with Multiple Sources
9 0.079195388 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
10 0.077102154 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
11 0.073121108 40 nips-2008-Bounds on marginal probability distributions
12 0.067698911 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
13 0.066383995 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
14 0.065594755 239 nips-2008-Tighter Bounds for Structured Estimation
15 0.062012378 99 nips-2008-High-dimensional support union recovery in multivariate regression
16 0.061042871 179 nips-2008-Phase transitions for high-dimensional joint support recovery
17 0.059488826 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
18 0.0580904 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
19 0.057497699 195 nips-2008-Regularized Policy Iteration
20 0.054548435 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
topicId topicWeight
[(0, -0.165), (1, -0.01), (2, -0.177), (3, 0.035), (4, -0.112), (5, 0.051), (6, -0.01), (7, -0.058), (8, -0.008), (9, 0.037), (10, 0.02), (11, 0.21), (12, 0.25), (13, 0.045), (14, -0.09), (15, 0.202), (16, 0.12), (17, 0.035), (18, -0.014), (19, -0.085), (20, -0.05), (21, -0.064), (22, 0.146), (23, -0.134), (24, 0.076), (25, 0.113), (26, -0.06), (27, 0.202), (28, -0.046), (29, -0.065), (30, 0.091), (31, 0.144), (32, -0.111), (33, 0.063), (34, -0.068), (35, -0.021), (36, 0.079), (37, -0.101), (38, -0.073), (39, -0.052), (40, -0.032), (41, 0.092), (42, 0.146), (43, 0.086), (44, -0.102), (45, 0.102), (46, -0.054), (47, 0.08), (48, 0.146), (49, 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.97174126 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents the first Rademacher complexity-based error bounds for noni.i.d. settings, a generalization of similar existing bounds derived for the i.i.d. case. Our bounds hold in the scenario of dependent samples generated by a stationary β-mixing process, which is commonly adopted in many previous studies of noni.i.d. settings. They benefit from the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. The empirical Rademacher complexity can be estimated from such finite samples and lead to tighter generalization bounds. We also present the first margin bounds for kernel-based classification in this non-i.i.d. setting and briefly study their convergence.
2 0.78459632 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
Author: Sham M. Kakade, Karthik Sridharan, Ambuj Tewari
Abstract: This work characterizes the generalization ability of algorithms whose predictions are linear in the input vector. To this end, we provide sharp bounds for Rademacher and Gaussian complexities of (constrained) linear classes, which directly lead to a number of generalization bounds. This derivation provides simplified proofs of a number of corollaries including: risk bounds for linear prediction (including settings where the weight vectors are constrained by either L2 or L1 constraints), margin bounds (including both L2 and L1 margins, along with more general notions based on relative entropy), a proof of the PAC-Bayes theorem, and upper bounds on L2 covering numbers (with Lp norm constraints and relative entropy constraints). In addition to providing a unified analysis, the results herein provide some of the sharpest risk and margin bounds. Interestingly, our results show that the uniform convergence rates of empirical risk minimization algorithms tightly match the regret bounds of online learning algorithms for linear prediction, up to a constant factor of 2. 1
3 0.54594976 111 nips-2008-Kernel Change-point Analysis
Author: Zaïd Harchaoui, Eric Moulines, Francis R. Bach
Abstract: We introduce a kernel-based method for change-point analysis within a sequence of temporal observations. Change-point analysis of an unlabelled sample of observations consists in, first, testing whether a change in the distribution occurs within the sample, and second, if a change occurs, estimating the change-point instant after which the distribution of the observations switches from one distribution to another different distribution. We propose a test statistic based upon the maximum kernel Fisher discriminant ratio as a measure of homogeneity between segments. We derive its limiting distribution under the null hypothesis (no change occurs), and establish the consistency under the alternative hypothesis (a change occurs). This allows to build a statistical hypothesis testing procedure for testing the presence of a change-point, with a prescribed false-alarm probability and detection probability tending to one in the large-sample setting. If a change actually occurs, the test statistic also yields an estimator of the change-point location. Promising experimental results in temporal segmentation of mental tasks from BCI data and pop song indexation are presented. 1
4 0.52548122 250 nips-2008-Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning
Author: Ali Rahimi, Benjamin Recht
Abstract: Randomized neural networks are immortalized in this AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. “What are you doing?” asked Minsky. “I am training a randomly wired neural net to play tic-tac-toe,” Sussman replied. “Why is the net wired randomly?” asked Minsky. Sussman replied, “I do not want it to have any preconceptions of how to play.” Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty,” replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities. 1
5 0.50422192 85 nips-2008-Fast Rates for Regularized Objectives
Author: Karthik Sridharan, Shai Shalev-shwartz, Nathan Srebro
Abstract: We study convergence properties of empirical minimization of a stochastic strongly convex objective, where the stochastic component is linear. We show that the value attained by the empirical minimizer converges to the optimal value with rate 1/n. The result applies, in particular, to the SVM objective. Thus, we obtain a rate of 1/n on the convergence of the SVM objective (with fixed regularization parameter) to its infinite data limit. We demonstrate how this is essential for obtaining certain type of oracle inequalities for SVMs. The results extend also to approximate minimization as well as to strong convexity with respect to an arbitrary norm, and so also to objectives regularized using other p norms. 1
6 0.47272691 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
7 0.42853728 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning
8 0.41914535 2 nips-2008-A Convex Upper Bound on the Log-Partition Function for Binary Distributions
9 0.35729218 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
10 0.35066834 40 nips-2008-Bounds on marginal probability distributions
11 0.34495261 238 nips-2008-Theory of matching pursuit
12 0.32369772 196 nips-2008-Relative Margin Machines
13 0.31781945 239 nips-2008-Tighter Bounds for Structured Estimation
14 0.29784015 37 nips-2008-Biasing Approximate Dynamic Programming with a Lower Discount Factor
15 0.29748631 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
16 0.28800818 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
17 0.27638406 185 nips-2008-Privacy-preserving logistic regression
18 0.27426451 65 nips-2008-Domain Adaptation with Multiple Sources
19 0.26326746 236 nips-2008-The Mondrian Process
20 0.25864294 129 nips-2008-MAS: a multiplicative approximation scheme for probabilistic inference
topicId topicWeight
[(6, 0.124), (7, 0.038), (12, 0.035), (28, 0.212), (42, 0.198), (57, 0.048), (63, 0.013), (71, 0.132), (77, 0.037), (83, 0.044)]
simIndex simValue paperId paperTitle
1 0.85136133 57 nips-2008-Deflation Methods for Sparse PCA
Author: Lester W. Mackey
Abstract: In analogy to the PCA setting, the sparse PCA problem is often solved by iteratively alternating between two subtasks: cardinality-constrained rank-one variance maximization and matrix deflation. While the former has received a great deal of attention in the literature, the latter is seldom analyzed and is typically borrowed without justification from the PCA context. In this work, we demonstrate that the standard PCA deflation procedure is seldom appropriate for the sparse PCA setting. To rectify the situation, we first develop several deflation alternatives better suited to the cardinality-constrained context. We then reformulate the sparse PCA optimization problem to explicitly reflect the maximum additional variance objective on each round. The result is a generalized deflation procedure that typically outperforms more standard techniques on real-world datasets. 1
same-paper 2 0.8477968 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents the first Rademacher complexity-based error bounds for noni.i.d. settings, a generalization of similar existing bounds derived for the i.i.d. case. Our bounds hold in the scenario of dependent samples generated by a stationary β-mixing process, which is commonly adopted in many previous studies of noni.i.d. settings. They benefit from the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. The empirical Rademacher complexity can be estimated from such finite samples and lead to tighter generalization bounds. We also present the first margin bounds for kernel-based classification in this non-i.i.d. setting and briefly study their convergence.
3 0.80582923 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization
Author: Sham M. Kakade, Karthik Sridharan, Ambuj Tewari
Abstract: This work characterizes the generalization ability of algorithms whose predictions are linear in the input vector. To this end, we provide sharp bounds for Rademacher and Gaussian complexities of (constrained) linear classes, which directly lead to a number of generalization bounds. This derivation provides simplified proofs of a number of corollaries including: risk bounds for linear prediction (including settings where the weight vectors are constrained by either L2 or L1 constraints), margin bounds (including both L2 and L1 margins, along with more general notions based on relative entropy), a proof of the PAC-Bayes theorem, and upper bounds on L2 covering numbers (with Lp norm constraints and relative entropy constraints). In addition to providing a unified analysis, the results herein provide some of the sharpest risk and margin bounds. Interestingly, our results show that the uniform convergence rates of empirical risk minimization algorithms tightly match the regret bounds of online learning algorithms for linear prediction, up to a constant factor of 2. 1
4 0.77958477 131 nips-2008-MDPs with Non-Deterministic Policies
Author: Mahdi M. Fard, Joelle Pineau
Abstract: Markov Decision Processes (MDPs) have been extensively studied and used in the context of planning and decision-making, and many methods exist to find the optimal policy for problems modelled as MDPs. Although finding the optimal policy is sufficient in many domains, in certain applications such as decision support systems where the policy is executed by a human (rather than a machine), finding all possible near-optimal policies might be useful as it provides more flexibility to the person executing the policy. In this paper we introduce the new concept of non-deterministic MDP policies, and address the question of finding near-optimal non-deterministic policies. We propose two solutions to this problem, one based on a Mixed Integer Program and the other one based on a search algorithm. We include experimental results obtained from applying this framework to optimize treatment choices in the context of a medical decision support system. 1
5 0.77952325 11 nips-2008-A spatially varying two-sample recombinant coalescent, with applications to HIV escape response
Author: Alexander Braunstein, Zhi Wei, Shane T. Jensen, Jon D. Mcauliffe
Abstract: Statistical evolutionary models provide an important mechanism for describing and understanding the escape response of a viral population under a particular therapy. We present a new hierarchical model that incorporates spatially varying mutation and recombination rates at the nucleotide level. It also maintains separate parameters for treatment and control groups, which allows us to estimate treatment effects explicitly. We use the model to investigate the sequence evolution of HIV populations exposed to a recently developed antisense gene therapy, as well as a more conventional drug therapy. The detection of biologically relevant and plausible signals in both therapy studies demonstrates the effectiveness of the method. 1
6 0.77682698 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
7 0.77274668 85 nips-2008-Fast Rates for Regularized Objectives
8 0.77052939 220 nips-2008-Spike Feature Extraction Using Informative Samples
9 0.76740056 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization
10 0.76205963 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
11 0.7545166 162 nips-2008-On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost
12 0.74777937 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging
13 0.7450496 202 nips-2008-Robust Regression and Lasso
14 0.74079782 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
15 0.7395823 239 nips-2008-Tighter Bounds for Structured Estimation
16 0.73683447 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF
17 0.73518813 196 nips-2008-Relative Margin Machines
18 0.73308903 15 nips-2008-Adaptive Martingale Boosting
19 0.7319752 104 nips-2008-Improved Moves for Truncated Convex Models
20 0.73141676 228 nips-2008-Support Vector Machines with a Reject Option