nips nips2007 nips2007-184 knowledge-graph by maker-knowledge-mining

184 nips-2007-Stability Bounds for Non-i.i.d. Processes


Source: pdf

Author: Mehryar Mohri, Afshin Rostamizadeh

Abstract: The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. A key advantage of these bounds is that they are designed for specific learning algorithms, exploiting their particular properties. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. This paper studies the scenario where the observations are drawn from a stationary mixing sequence, which implies a dependence between observations that weaken over time. It proves novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case. It also illustrates their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. [sent-8, score-0.387]

2 A key advantage of these bounds is that they are designed for specific learning algorithms, exploiting their particular properties. [sent-9, score-0.179]

3 But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i. [sent-10, score-0.542]

4 The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. [sent-15, score-0.124]

5 This paper studies the scenario where the observations are drawn from a stationary mixing sequence, which implies a dependence between observations that weaken over time. [sent-16, score-0.6]

6 It proves novel stability-based generalization bounds that hold even with this more general setting. [sent-17, score-0.305]

7 These bounds strictly generalize the bounds given in the i. [sent-18, score-0.358]

8 1 Introduction The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds [2–4,6]. [sent-23, score-0.566]

9 A learning algorithm is stable when the hypotheses it outputs differ in a limited way when small changes are made to the training set. [sent-24, score-0.124]

10 A key advantage of stability bounds is that they are tailored to specific learning algorithms, exploiting their particular properties. [sent-25, score-0.424]

11 But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i. [sent-27, score-0.542]

12 The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. [sent-36, score-0.124]

13 This paper studies the scenario where the observations are drawn from a stationary mixing sequence, a widely adopted assumption in the study of non-i. [sent-38, score-0.57]

14 Our proofs are also based on the independent block technique commonly used in such contexts [17] and a generalized version of McDiarmid’s inequality [7]. [sent-42, score-0.189]

15 We prove novel stability-based generalization bounds that hold even with this more general setting. [sent-43, score-0.326]

16 These bounds strictly generalize the bounds given in the i. [sent-44, score-0.358]

17 Algorithms such as support vector regression (SVR) [14, 15] have been used in the context of time series prediction in which the i. [sent-52, score-0.14]

18 The stability bounds we give for SVR and many other kernel regularization-based algorithms can thus be viewed as the first theoretical basis for their use in such scenarios. [sent-60, score-0.466]

19 Section 3 gives our main generalization bounds based on stability, including the full proof and analysis. [sent-65, score-0.267]

20 In Section 4, we apply these bounds to general kernel regularization-based algorithms, including Support Vector Regression and Kernel Ridge Regression. [sent-66, score-0.221]

21 2 Preliminaries We first introduce some standard definitions for dependent observations in mixing theory [5] and then briefly discuss the learning scenarios in the non-i. [sent-67, score-0.287]

22 A sequence of random variables Z = {Zt }∞ t=−∞ is said to be stationary if for any t and non-negative integers m and k, the random vectors (Zt , . [sent-76, score-0.256]

23 Thus, the index t or time, does not affect the distribution of a variable Zt in a stationary sequence. [sent-83, score-0.166]

24 The following is a standard definition giving a measure of the dependence of the random variables Zt within a stationary sequence. [sent-86, score-0.21]

25 Let Z = {Zt }t=−∞ be a stationary sequence of random variables. [sent-89, score-0.207]

26 It is said to be algebraically β-mixing (algebraically ϕ-mixing) if there exist real numbers β0 > 0 (resp. [sent-94, score-0.242]

27 ϕ(k) ≤ ϕ0 /k r ) for all k, exponentially mixing if there exist real numbers β0 (resp. [sent-96, score-0.191]

28 We will be using a concentration inequality that leads to simple bounds but that applies to φ-mixing processes only. [sent-102, score-0.344]

29 This is a standard assumption adopted in previous studies of learning in the presence of dependent observations [8, 10, 16, 17]. [sent-104, score-0.179]

30 Several results have also been obtained in the more general context of α-mixing but they seem to require the stronger condition of exponential mixing [11]. [sent-106, score-0.168]

31 The mixing parameters can also be estimated in such cases. [sent-108, score-0.168]

32 2 Most previous studies use a technique originally introduced by [1] based on independent blocks of equal size [8, 10, 17]. [sent-109, score-0.158]

33 This technique is particularly relevant when dealing with stationary β-mixing. [sent-110, score-0.166]

34 Let µ ≥ 1 and suppose that h is measurable function, with µ µ si absolute value bounded by M , on a product probability space j=1 Ωj , i=1 σri where ri ≤ si ≤ ri+1 for all i. [sent-116, score-0.293]

35 Then, (2) The lemma gives a measure of the difference between the distribution of µ blocks where the blocks are independent in one case and dependent in the other case. [sent-123, score-0.377]

36 For a fixed learning algorithm, we denote by hS the hypothesis it returns when trained on the sample S. [sent-134, score-0.16]

37 The error of a hypothesis on a pair z ∈ X ×Y is measured in terms of a cost function c : Y ×Y → R+ . [sent-135, score-0.147]

38 Thus, c(h(x), y) measures the error of a hypothesis h on a pair (x, y), c(h(x), y) = (h(x)−y)2 in the standard regression cases. [sent-136, score-0.16]

39 We will use the shorthand c(h, z) := c(h(x), y) for a hypothesis h and z = (x, y) ∈ X × Y and will assume that c is upper bounded by a constant M > 0. [sent-137, score-0.122]

40 We denote by R(h) the empirical error of a hypothesis h for a training sample S = (z1 , . [sent-138, score-0.166]

41 We will consider here the more general case of dependent samples drawn from a stationary mixing sequence Z over X × Y . [sent-149, score-0.487]

42 In the most general version, future samples depend on the training sample S and thus the generalization error or true error of the hypothesis hS trained on S must be measured by its expected error conditioned on the sample S: R(hS ) = E[c(hS , z) | S]. [sent-155, score-0.396]

43 A somewhat less realistic version is one where the samples are dependent, but the test points are assumed to be independent of the training sample S. [sent-157, score-0.199]

44 The generalization error of the hypothesis hS trained on S is then: R(hS ) = E[c(hS , z) | S] = E[c(hS , z)]. [sent-158, score-0.24]

45 (5) z z This setting seems less natural since if samples are dependent, then future test points must also depend on the training points, even if that dependence is relatively weak due to the time interval after which test points are drawn. [sent-159, score-0.142]

46 Nevertheless, it is this somewhat less realistic setting that has been studied by all previous machine learning studies that we are aware of [8, 10, 16, 17], even when examining specifically a time series prediction problem [10]. [sent-160, score-0.142]

47 Thus, the bounds derived in these studies cannot be applied to the more general setting. [sent-161, score-0.216]

48 Stability Bounds ˆ This section gives generalization bounds for β-stable algorithms over a mixing stationary distribu1 tion. [sent-168, score-0.601]

49 The first two sections present our main proofs which hold for β-mixing stationary distributions. [sent-169, score-0.23]

50 In the third section, we will be using a concentration inequality that applies to φ-mixing processes only. [sent-170, score-0.165]

51 It has been later used successfully by [2, 3] to show algorithm-specific stability bounds for i. [sent-172, score-0.424]

52 A learning algorithm is said to be (uniformly) β-stable if the hypotheses it returns for any two training samples S and S ′ that differ by a single point satisfy ˆ |c(hS , z) − c(hS ′ , z)| ≤ β. [sent-179, score-0.159]

53 ∀z ∈ X × Y, (6) Many generalization error bounds rely on McDiarmid’s inequality. [sent-180, score-0.291]

54 Instead, we will use a theorem that extends McDiarmid’s inequality to general mixing distributions (Theorem 1, Section 3. [sent-185, score-0.3]

55 To obtain a stability-based generalization bound, we will apply this theorem to Φ(S) = R(hS ) − R(hS ). [sent-187, score-0.146]

56 We first present a lemma that relates the expected value of the generalization error in that scenario and the same expectation in the scenario where the test point is independent of the training sample. [sent-196, score-0.408]

57 We denote by R(hS ) = Ez [c(hS , z)|S] the expectation in the dependent case and by R(hSb ) = Ez [c(hSb , z)] that expectation when the test points are assumed independent of the training, with e Sb denoting a sequence similar to S but with the last b points removed. [sent-197, score-0.201]

58 The block Sb is assumed to have exactly the same distribution as the corresponding block of the same size in S. [sent-199, score-0.164]

59 Then, for any sample S of size m drawn from a β-mixing stationary distribution and for any b ∈ {0, . [sent-202, score-0.235]

60 S S,e z (9) The other side of the inequality of the lemma can be shown following the same steps. [sent-209, score-0.206]

61 1 The standard variable used for the stability coefficient is β. [sent-211, score-0.245]

62 4 Sb zi z b b zi b (a) i Si,b Si,b Si b z b (b) b z b (c) z b b (d) Figure 1: Illustration of the sequences derived from S that are considered in the proofs. [sent-213, score-0.353]

63 , zm ) be two sequences drawn from a β-mixing stationary process that differ only in point i ∈ [1, m], and let hS and hS i be the hypotheses ˆ returned by a β-stable algorithm when trained on each of these samples. [sent-221, score-0.506]

64 Bounding the difference of costs on agreeing points with β and the one that disagrees with M yields |R(hS ) − R(hS i )| = 1 m = 1 m m j=1 j=i ′ |c(hS , zj ) − c(hS i , zj )| ′ |c(hS , zj ) − c(hS i , zj )| + (11) 1 ′ ˆ M. [sent-225, score-0.231]

65 |c(hS , zi ) − c(hS i , zi )| ≤ β + m m ˆ Now, applying Lemma 2 to both generalization error terms and using β-stability result in ˆ i ≤ |R(hSb ) − R(hSb )| + 2bβ + 2β(b) |R(hS ) − R(hS i )| (12) ˆ ˆ ˆ = E[c(hSb , z) − c(hSb , z)] + 2bβ + 2β(b)M ≤ β + 2bβ + 2β(b)M. [sent-226, score-0.42]

66 Let hS be the hypothesis returned by a β-stable algorithm trained on a sample S drawn from a stationary β-mixing distribution. [sent-232, score-0.42]

67 Let Si be the sequence S with the b points before and after point zi removed. [sent-236, score-0.222]

68 Let Si denote a similar set of three blocks each with the same distribution as the corresponding block in Si , but such that the three blocks are independent. [sent-239, score-0.265]

69 In particular, the middle block reduced to one point ˆ zi is independent of the two others. [sent-240, score-0.243]

70 By the β-stability of the algorithm, 1 E[R(hS )] = E S S m m 1 c(hS , zi ) ≤ E Si m i=1 m ˆ c(hSi , zi ) + 2bβ. [sent-241, score-0.308]

71 (14) i=1 Applying Lemma 1 to the first term of the right-hand side yields E[R(hS )] ≤ E S e Si 1 m m ˆ c(hSi , zi ) + 2bβ + 2β(b)M. [sent-242, score-0.183]

72 e i=1 5 (15) Combining the independent block sequences associated to R(hS ) and R(hS ) will help us prove the lemma in a way similar to the i. [sent-243, score-0.258]

73 To deal with independent block sequences defined with respect to the same hypothesis, we will consider the sequence Si,b = Si ∩ Sb , which is illustrated by Figure 1(c). [sent-248, score-0.175]

74 As before, we will consider a sequence Si,b with a similar set of blocks each with the same distribution as the corresponding blocks in Si,b , but such that the blocks are independent. [sent-250, score-0.338]

75 ˆ Since three blocks of at most b points are removed from each hypothesis, by the β-stability of the learning algorithm, the following holds: E[Φ(S)] = E[R(hS ) − R(hS )] = E S S ≤ S,z E Si,b ,z 1 m 1 m m i=1 c(hS , zi ) − c(hS , z) (16) m i=1 ˆ c(hSi,b , zi ) − c(hSi,b , z) + 6bβ. [sent-251, score-0.434]

76 (17) Now, the application of Lemma 1 to the difference of two cost functions also bounded by M as in the right-hand side leads to E[Φ(S)] ≤ E S e z Si,b ,e 1 m m i=1 ˆ c(hSi,b , zi ) − c(hSi,b , z) + 6bβ + 3β(b)M. [sent-252, score-0.279]

77 The other side of the inequality in the statement of the lemma can be shown following the same steps. [sent-255, score-0.245]

78 We will use the following theorem which extends McDiarmid’s inequality to ϕ-mixing distributions. [sent-258, score-0.132]

79 The theorem gives a general stability bound for ϕ-mixing stationary sequences. [sent-272, score-0.527]

80 If we further assume that the sequence is algebraically ϕ-mixing, that is for all k, ϕ(k) = ϕ0 k −r for some r > 1, then we can solve for the value of b to optimize the bound. [sent-273, score-0.211]

81 For an algebraically mixing sequence, the value of b minimizing the bound of Theorem 2 −1/(r+1) r/(r+1) ˆ ˆ β β ˆ satisfies βb = rM ϕ(b), which gives b = and ϕ(b) = ϕ0 . [sent-281, score-0.396]

82 In the case of a zero mixing coefficient (ϕ = 0 and b = 0), the bounds of Theorem 2 and Theorem 3 coincide with the i. [sent-285, score-0.347]

83 In order for the right-hand side of these bounds to √ √ ˆ converge, we must have β = o(1/ m) and ϕ(b) = o(1/ m). [sent-289, score-0.208]

84 In the case of algebraically mixing sequences with r > 1 assumed in √ ˆ ˆ Theorem 3, β ≤ O(1/m) implies ϕ(b) = ϕ0 (β/(rϕ0 M ))(r/(r+1)) < O(1/ m). [sent-291, score-0.413]

85 4 Application We now present the application of our stability bounds to several algorithms in the case of an algebraically mixing sequence. [sent-293, score-0.787]

86 Let hS denote the hypothesis returned by the algorithm when trained on a sample S drawn from an algebraically ϕ-mixing stationary distribution. [sent-298, score-0.59]

87 Then, with probability at least 1 − δ, the following generalization bounds hold for a. [sent-299, score-0.305]

88 Plugging in these values in the bound of Theorem 3 and β using the lower bound on r, r > 1, yield the statement of the corollary. [sent-306, score-0.155]

89 These bounds give, to the best of our knowledge, the first stability-based generalization bounds for SVR and KRR in a non-i. [sent-307, score-0.446]

90 Similar bounds can be obtained for other families of algorithms such as maximum entropy discrimination, which can be shown to have comparable stability properties [3]. [sent-311, score-0.424]

91 Our bounds have the same convergence behavior as those derived by [3] in the i. [sent-312, score-0.179]

92 It would be interesting to give a quantitative comparison of our bounds and the generalization bounds of [10] based on covering numbers for mixing stationary distributions, in the scenario where test points are independent of the training sample. [sent-321, score-0.98]

93 In general, because the bounds of [10] are not algorithmdependent, one can expect tighter bounds using stability, provided that a tight bound is given on the stability coefficient. [sent-322, score-0.708]

94 For a fixed λ, the asymptotic behavior of our stability bounds for SVR and KRR is tight. [sent-324, score-0.424]

95 5 Conclusion Our stability bounds for mixing stationary sequences apply to large classes of algorithms, including SVR and KRR, extending to weakly dependent observations existing bounds in the i. [sent-325, score-1.098]

96 Since they are algorithm-specific, these bounds can often be tighter than other generalization bounds. [sent-329, score-0.288]

97 Weaker notions of stability might help further improve or refine them. [sent-330, score-0.245]

98 Convergence and consistency of regularized boosting algorithms with stationary β-mixing observations. [sent-370, score-0.166]

99 On the consistency in nonparametric estimation under mixing assumptions. [sent-385, score-0.168]

100 Rates of convergence for empirical processes of stationary mixing sequences. [sent-423, score-0.36]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hs', 0.704), ('stability', 0.245), ('hsb', 0.217), ('bounds', 0.179), ('svr', 0.17), ('algebraically', 0.17), ('mixing', 0.168), ('stationary', 0.166), ('zi', 0.154), ('lemma', 0.103), ('blocks', 0.099), ('krr', 0.094), ('zt', 0.091), ('generalization', 0.088), ('hypothesis', 0.087), ('si', 0.087), ('mcdiarmid', 0.08), ('inequality', 0.074), ('scenario', 0.074), ('zm', 0.072), ('sb', 0.072), ('pr', 0.072), ('ridge', 0.068), ('block', 0.067), ('bound', 0.058), ('theorem', 0.058), ('returned', 0.057), ('dependent', 0.054), ('zj', 0.051), ('regression', 0.049), ('said', 0.049), ('ri', 0.048), ('sequences', 0.045), ('dependence', 0.044), ('hsi', 0.043), ('kontorovich', 0.043), ('kernel', 0.042), ('sequence', 0.041), ('trained', 0.041), ('holds', 0.04), ('statement', 0.039), ('hold', 0.038), ('series', 0.038), ('hypotheses', 0.038), ('studies', 0.037), ('observations', 0.037), ('drawn', 0.037), ('nitions', 0.036), ('lipschitz', 0.036), ('measurable', 0.036), ('cost', 0.036), ('bounded', 0.035), ('stable', 0.035), ('courant', 0.034), ('illustrates', 0.033), ('concentration', 0.033), ('qi', 0.033), ('mohri', 0.032), ('applies', 0.032), ('sample', 0.032), ('covering', 0.031), ('assumed', 0.03), ('support', 0.03), ('side', 0.029), ('assumption', 0.029), ('ez', 0.029), ('algorithmic', 0.028), ('scenarios', 0.028), ('ki', 0.028), ('differ', 0.028), ('points', 0.027), ('zk', 0.027), ('stock', 0.027), ('corollary', 0.027), ('processes', 0.026), ('tight', 0.026), ('mercer', 0.026), ('diagnosis', 0.026), ('proofs', 0.026), ('sup', 0.026), ('classes', 0.025), ('coef', 0.025), ('application', 0.025), ('bousquet', 0.024), ('plugging', 0.024), ('du', 0.024), ('error', 0.024), ('street', 0.023), ('numbers', 0.023), ('realistic', 0.023), ('prediction', 0.023), ('training', 0.023), ('analyses', 0.023), ('let', 0.022), ('adopted', 0.022), ('independent', 0.022), ('tighter', 0.021), ('somewhat', 0.021), ('prove', 0.021), ('samples', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 184 nips-2007-Stability Bounds for Non-i.i.d. Processes

Author: Mehryar Mohri, Afshin Rostamizadeh

Abstract: The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. A key advantage of these bounds is that they are designed for specific learning algorithms, exploiting their particular properties. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. This paper studies the scenario where the observations are drawn from a stationary mixing sequence, which implies a dependence between observations that weaken over time. It proves novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case. It also illustrates their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression.

2 0.18661831 46 nips-2007-Cluster Stability for Finite Samples

Author: Ohad Shamir, Naftali Tishby

Abstract: Over the past few years, the notion of stability in data clustering has received growing attention as a cluster validation criterion in a sample-based framework. However, recent work has shown that as the sample size increases, any clustering model will usually become asymptotically stable. This led to the conclusion that stability is lacking as a theoretical and practical tool. The discrepancy between this conclusion and the success of stability in practice has remained an open question, which we attempt to address. Our theoretical approach is that stability, as used by cluster validation algorithms, is similar in certain respects to measures of generalization in a model-selection framework. In such cases, the model chosen governs the convergence rate of generalization bounds. By arguing that these rates are more important than the sample size, we are led to the prediction that stability-based cluster validation algorithms should not degrade with increasing sample size, despite the asymptotic universal stability. This prediction is substantiated by a theoretical analysis as well as some empirical results. We conclude that stability remains a meaningful cluster validation criterion over finite samples. 1

3 0.15479878 108 nips-2007-Kernel Measures of Conditional Dependence

Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf

Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1

4 0.076228611 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

Author: Byron Boots, Geoffrey J. Gordon, Sajid M. Siddiqi

Abstract: Stability is a desirable characteristic for linear dynamical systems, but it is often ignored by algorithms that learn these systems from data. We propose a novel method for learning stable linear dynamical systems: we formulate an approximation of the problem as a convex program, start with a solution to a relaxed version of the program, and incrementally add constraints to improve stability. Rather than continuing to generate constraints until we reach a feasible solution, we test stability at each step; because the convex program is only an approximation of the desired problem, this early stopping rule can yield a higher-quality solution. We apply our algorithm to the task of learning dynamic textures from image sequences as well as to modeling biosurveillance drug-sales data. The constraint generation approach leads to noticeable improvement in the quality of simulated sequences. We compare our method to those of Lacy and Bernstein [1, 2], with positive results in terms of accuracy, quality of simulated sequences, and efficiency. 1

5 0.067311212 21 nips-2007-Adaptive Online Gradient Descent

Author: Elad Hazan, Alexander Rakhlin, Peter L. Bartlett

Abstract: We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates √ between T and log T . Furthermore, we show strong optimality of the algorithm. Finally, we provide an extension of our results to general norms. 1

6 0.066991985 7 nips-2007-A Kernel Statistical Test of Independence

7 0.064431675 110 nips-2007-Learning Bounds for Domain Adaptation

8 0.056060601 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

9 0.054503396 160 nips-2007-Random Features for Large-Scale Kernel Machines

10 0.053388562 147 nips-2007-One-Pass Boosting

11 0.053350791 15 nips-2007-A general agnostic active learning algorithm

12 0.052070212 118 nips-2007-Learning with Transformation Invariant Kernels

13 0.051750973 91 nips-2007-Fitted Q-iteration in continuous action-space MDPs

14 0.051164027 20 nips-2007-Adaptive Embedded Subgraph Algorithms using Walk-Sum Analysis

15 0.050968084 159 nips-2007-Progressive mixture rules are deviation suboptimal

16 0.049739577 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning

17 0.049013108 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information

18 0.046902206 186 nips-2007-Statistical Analysis of Semi-Supervised Regression

19 0.043565858 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images

20 0.04279957 17 nips-2007-A neural network implementing optimal state estimation based on dynamic spike train decoding


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.162), (1, -0.024), (2, -0.047), (3, 0.072), (4, 0.035), (5, -0.037), (6, 0.019), (7, -0.01), (8, -0.104), (9, -0.002), (10, 0.08), (11, 0.049), (12, 0.06), (13, -0.065), (14, -0.082), (15, -0.042), (16, 0.147), (17, -0.016), (18, -0.006), (19, 0.011), (20, 0.121), (21, 0.09), (22, -0.046), (23, -0.122), (24, -0.035), (25, 0.05), (26, 0.181), (27, -0.096), (28, 0.018), (29, -0.147), (30, 0.005), (31, 0.016), (32, -0.103), (33, 0.008), (34, 0.025), (35, -0.082), (36, -0.005), (37, -0.174), (38, -0.017), (39, -0.046), (40, 0.037), (41, 0.03), (42, 0.061), (43, -0.08), (44, -0.15), (45, 0.18), (46, 0.152), (47, -0.132), (48, 0.153), (49, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96010882 184 nips-2007-Stability Bounds for Non-i.i.d. Processes

Author: Mehryar Mohri, Afshin Rostamizadeh

Abstract: The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. A key advantage of these bounds is that they are designed for specific learning algorithms, exploiting their particular properties. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. This paper studies the scenario where the observations are drawn from a stationary mixing sequence, which implies a dependence between observations that weaken over time. It proves novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case. It also illustrates their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression.

2 0.64774406 46 nips-2007-Cluster Stability for Finite Samples

Author: Ohad Shamir, Naftali Tishby

Abstract: Over the past few years, the notion of stability in data clustering has received growing attention as a cluster validation criterion in a sample-based framework. However, recent work has shown that as the sample size increases, any clustering model will usually become asymptotically stable. This led to the conclusion that stability is lacking as a theoretical and practical tool. The discrepancy between this conclusion and the success of stability in practice has remained an open question, which we attempt to address. Our theoretical approach is that stability, as used by cluster validation algorithms, is similar in certain respects to measures of generalization in a model-selection framework. In such cases, the model chosen governs the convergence rate of generalization bounds. By arguing that these rates are more important than the sample size, we are led to the prediction that stability-based cluster validation algorithms should not degrade with increasing sample size, despite the asymptotic universal stability. This prediction is substantiated by a theoretical analysis as well as some empirical results. We conclude that stability remains a meaningful cluster validation criterion over finite samples. 1

3 0.58320498 108 nips-2007-Kernel Measures of Conditional Dependence

Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf

Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1

4 0.4804641 15 nips-2007-A general agnostic active learning algorithm

Author: Sanjoy Dasgupta, Claire Monteleoni, Daniel J. Hsu

Abstract: We present an agnostic active learning algorithm for any hypothesis class of bounded VC dimension under arbitrary data distributions. Most previous work on active learning either makes strong distributional assumptions, or else is computationally prohibitive. Our algorithm extends the simple scheme of Cohn, Atlas, and Ladner [1] to the agnostic setting, using reductions to supervised learning that harness generalization bounds in a simple but subtle manner. We provide a fall-back guarantee that bounds the algorithm’s label complexity by the agnostic PAC sample complexity. Our analysis yields asymptotic label complexity improvements for certain hypothesis classes and distributions. We also demonstrate improvements experimentally. 1

5 0.45063874 7 nips-2007-A Kernel Statistical Test of Independence

Author: Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Schölkopf, Alex J. Smola

Abstract: Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2 ), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.

6 0.41961974 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems

7 0.41050208 159 nips-2007-Progressive mixture rules are deviation suboptimal

8 0.39653414 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis

9 0.38523698 194 nips-2007-The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information

10 0.36095688 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin

11 0.35483095 101 nips-2007-How SVMs can estimate quantiles and the median

12 0.30279052 110 nips-2007-Learning Bounds for Domain Adaptation

13 0.28945354 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets

14 0.27808392 13 nips-2007-A Unified Near-Optimal Estimator For Dimension Reduction in $l \alpha$ ($0<\alpha\leq 2$) Using Stable Random Projections

15 0.27027297 207 nips-2007-Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

16 0.26277617 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations

17 0.25264013 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

18 0.25119045 160 nips-2007-Random Features for Large-Scale Kernel Machines

19 0.25115123 21 nips-2007-Adaptive Online Gradient Descent

20 0.25058684 139 nips-2007-Nearest-Neighbor-Based Active Learning for Rare Category Detection


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.031), (13, 0.034), (16, 0.018), (18, 0.01), (19, 0.012), (21, 0.049), (34, 0.029), (35, 0.016), (46, 0.011), (47, 0.045), (49, 0.014), (83, 0.117), (85, 0.016), (90, 0.505)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93631929 8 nips-2007-A New View of Automatic Relevance Determination

Author: David P. Wipf, Srikantan S. Nagarajan

Abstract: Automatic relevance determination (ARD) and the closely-related sparse Bayesian learning (SBL) framework are effective tools for pruning large numbers of irrelevant features leading to a sparse explanatory subset. However, popular update rules used for ARD are either difficult to extend to more general problems of interest or are characterized by non-ideal convergence properties. Moreover, it remains unclear exactly how ARD relates to more traditional MAP estimation-based methods for learning sparse representations (e.g., the Lasso). This paper furnishes an alternative means of expressing the ARD cost function using auxiliary functions that naturally addresses both of these issues. First, the proposed reformulation of ARD can naturally be optimized by solving a series of re-weighted 1 problems. The result is an efficient, extensible algorithm that can be implemented using standard convex programming toolboxes and is guaranteed to converge to a local minimum (or saddle point). Secondly, the analysis reveals that ARD is exactly equivalent to performing standard MAP estimation in weight space using a particular feature- and noise-dependent, non-factorial weight prior. We then demonstrate that this implicit prior maintains several desirable advantages over conventional priors with respect to feature selection. Overall these results suggest alternative cost functions and update procedures for selecting features and promoting sparse solutions in a variety of general situations. In particular, the methodology readily extends to handle problems such as non-negative sparse coding and covariance component estimation. 1

2 0.92706358 85 nips-2007-Experience-Guided Search: A Theory of Attentional Control

Author: David Baldwin, Michael C. Mozer

Abstract: People perform a remarkable range of tasks that require search of the visual environment for a target item among distractors. The Guided Search model (Wolfe, 1994, 2007), or GS, is perhaps the best developed psychological account of human visual search. To prioritize search, GS assigns saliency to locations in the visual field. Saliency is a linear combination of activations from retinotopic maps representing primitive visual features. GS includes heuristics for setting the gain coefficient associated with each map. Variants of GS have formalized the notion of optimization as a principle of attentional control (e.g., Baldwin & Mozer, 2006; Cave, 1999; Navalpakkam & Itti, 2006; Rao et al., 2002), but every GS-like model must be ’dumbed down’ to match human data, e.g., by corrupting the saliency map with noise and by imposing arbitrary restrictions on gain modulation. We propose a principled probabilistic formulation of GS, called Experience-Guided Search (EGS), based on a generative model of the environment that makes three claims: (1) Feature detectors produce Poisson spike trains whose rates are conditioned on feature type and whether the feature belongs to a target or distractor; (2) the environment and/or task is nonstationary and can change over a sequence of trials; and (3) a prior specifies that features are more likely to be present for target than for distractors. Through experience, EGS infers latent environment variables that determine the gains for guiding search. Control is thus cast as probabilistic inference, not optimization. We show that EGS can replicate a range of human data from visual search, including data that GS does not address. 1

3 0.90997452 119 nips-2007-Learning with Tree-Averaged Densities and Distributions

Author: Sergey Kirshner

Abstract: We utilize the ensemble of trees framework, a tractable mixture over superexponential number of tree-structured distributions [1], to develop a new model for multivariate density estimation. The model is based on a construction of treestructured copulas – multivariate distributions with uniform on [0, 1] marginals. By averaging over all possible tree structures, the new model can approximate distributions with complex variable dependencies. We propose an EM algorithm to estimate the parameters for these tree-averaged models for both the real-valued and the categorical case. Based on the tree-averaged framework, we propose a new model for joint precipitation amounts data on networks of rain stations. 1

same-paper 4 0.90697354 184 nips-2007-Stability Bounds for Non-i.i.d. Processes

Author: Mehryar Mohri, Afshin Rostamizadeh

Abstract: The notion of algorithmic stability has been used effectively in the past to derive tight generalization bounds. A key advantage of these bounds is that they are designed for specific learning algorithms, exploiting their particular properties. But, as in much of learning theory, existing stability analyses and bounds apply only in the scenario where the samples are independently and identically distributed (i.i.d.). In many machine learning applications, however, this assumption does not hold. The observations received by the learning algorithm often have some inherent temporal dependence, which is clear in system diagnosis or time series prediction problems. This paper studies the scenario where the observations are drawn from a stationary mixing sequence, which implies a dependence between observations that weaken over time. It proves novel stability-based generalization bounds that hold even with this more general setting. These bounds strictly generalize the bounds given in the i.i.d. case. It also illustrates their application in the case of several general classes of learning algorithms, including Support Vector Regression and Kernel Ridge Regression.

5 0.82067996 182 nips-2007-Sparse deep belief net model for visual area V2

Author: Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng

Abstract: Motivated in part by the hierarchical organization of the cortex, a number of algorithms have recently been proposed that try to learn hierarchical, or “deep,” structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hierarchy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear (“contour”) features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex “corner” features matches well with the results from the Ito & Komatsu’s study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features. 1

6 0.59486979 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions

7 0.58420378 202 nips-2007-The discriminant center-surround hypothesis for bottom-up saliency

8 0.56384861 156 nips-2007-Predictive Matrix-Variate t Models

9 0.54089415 128 nips-2007-Message Passing for Max-weight Independent Set

10 0.52834761 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

11 0.52319556 185 nips-2007-Stable Dual Dynamic Programming

12 0.52135229 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models

13 0.51771736 96 nips-2007-Heterogeneous Component Analysis

14 0.51479459 63 nips-2007-Convex Relaxations of Latent Variable Training

15 0.51372421 113 nips-2007-Learning Visual Attributes

16 0.51013482 49 nips-2007-Colored Maximum Variance Unfolding

17 0.50912863 155 nips-2007-Predicting human gaze using low-level saliency combined with face detection

18 0.5078935 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI

19 0.50438648 7 nips-2007-A Kernel Statistical Test of Independence

20 0.49935216 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations