nips nips2008 nips2008-165 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ohad Shamir, Naftali Tishby
Abstract: unkown-abstract
Reference: text
sentIndex sentText sentNum sentScore
1 Formally, define the random variable dm (Ak (S1 ), Ak (S2 )) := D √ mdD (Ak (S1 ), Ak (S2 )) = √ m Pr x∼D argmaxfθ,i (x) = argmaxfθ′ ,i (x) , ˆ ˆ i i where θ, θ′ ∈ Θ are the solutions returned by Ak (S1 ), Ak (S2 ), and S1 , S2 are random samples, each of size m, drawn i. [sent-7, score-0.421]
2 The scaling by the square root of the sample size will allow us to analyze the non-trivial asymptotic behavior of these distance measures, which without scaling simply converge to zero in probability as m → ∞. [sent-10, score-0.122]
3 We will also need to define the following variant of dm (Ak (S1 ), Ak (S2 )), where we restrict ourD selves to the mass in some subset of Rn . [sent-13, score-0.429]
4 Formally, we define the restricted distance between two clusterings, with respect to a set B ∈ Rn , as √ dm (Ak (S1 ), Ak (S2 ), B) := m Pr argmaxfθ,i (x) = argmaxfθ′ ,i (x) ∧ x ∈ B . [sent-14, score-0.385]
5 (1) ˆ ˆ D x∼D i i In particular, dm (Ak (S1 ), Ak (S2 ), Br/√m (∪i,j Fθ0 ,i,j )) refers to the mass which switches clusters, D √ and is also inside an r/ m-neighborhood of the limit cluster boundaries (where the boundaries are defined with respect to fθ0 (·)). [sent-15, score-0.889]
6 Once again, when S1 , S2 are random samples, we can think of it as a random variable with respect to drawing and clustering S1 , S2 . [sent-16, score-0.139]
7 Consistency Condition: θ converges in probability (over drawing and clustering a sample of size m, m → ∞) to some θ 0 ∈ Θ. [sent-19, score-0.223]
8 Central Limit Condition: m(θ − θ 0 ) converges in distribution to a multivariate zero mean Gaussian random variable Z. [sent-25, score-0.098]
9 Both fθ0 (x) and (∂/∂θ)fθ0 (x) are twice differentiable with respect to any x ∈ X , with a uniformly bounded second derivative. [sent-28, score-0.108]
10 (d) Minimal Parametric Stability: It holds for some δ > 0 that ` ´ Pr dm (Ak (S1 ), Ak (S2 )) = dm (Ak (S1 ), Ak (S2 ), Br/√m (∪i,j Fθ 0 ,i,j )) = O(r−3−δ ) + o(1), D D where o(1) → 0 as m → ∞. [sent-32, score-0.77]
11 Namely, the mass of D which switches between clusters is with high probability inside thin strips around the limit cluster boundaries, and this high probability increases at least polynomially as the width of the strips increase (see below for a further discussion of this). [sent-33, score-0.749]
12 The regularity assumptions are relatively mild, and can usually be inferred based on the consistency and central limit conditions, as well as the the specific clustering framework that we are considering. [sent-34, score-0.208]
13 For example, condition 3c and the assumptions on Fθ0 ,i,j in condition 3b are fulfilled in a clustering framework where the clusters are separated by hyperplanes. [sent-35, score-0.192]
14 As to condition 3d, suppose our ˆ clustering framework is such that the cluster boundaries depend on θ in a smooth manner. [sent-36, score-0.328]
15 Then the ˆ with variance O(1/m), and the compactness of X , will generally imply asymptotic normality of θ, that the cluster boundaries √ obtained from clustering a sample are contained with high probability inside strips of width O(1/ m) around the limit cluster boundaries. [sent-37, score-0.897]
16 More specifically, the asymp√ totic probability of this happening for strips of width r/ m will be exponentially high in r, due ˆ to the asymptotic normality of θ. [sent-38, score-0.298]
17 As a result, the mass which switches between clusters, when we compare two independent clusterings, will be in those strips with probability exponentially high in r. [sent-39, score-0.263]
18 Therefore, condition 3d will hold by a large margin, since only polynomially high probability is required there. [sent-40, score-0.088]
19 Throughout the proofs, we will sometimes use the stochastic order notation Op (·) and op (·) (cf. [sent-43, score-0.093]
20 We write Xm = op (Ym ) to mean that Pr( Xm ≥ ǫ Ym ) → 0 for each ǫ > 0. [sent-47, score-0.093]
21 For example, Xm = op (1) means that Xm → 0 in probability. [sent-49, score-0.093]
22 When we write for example Xm = Ym + op (1), we mean that Xm − Ym = op (1). [sent-50, score-0.186]
23 C Proof of Proposition 1 ˆ By condition 3a, fθ (x) has a first order Taylor expansion with respect to any θ close enough to θ 0 , with a remainder term uniformly bounded for any x: fθ (x) = fθ0 (x) + ˆ ∂ fθ (x) ∂θ 0 2 ⊤ ˆ ˆ (θ − θ 0 ) + o( θ − θ 0 ). [sent-51, score-0.164]
24 (2) By the asymptotic normality assumption, Therefore, we get from Eq. [sent-52, score-0.139]
25 ⊤ √ ∂ ˆ fθ0 (x) ( m(θ − θ 0 )) + op (1), (3) ∂θ where the remainder term op (1) does not depend on x. [sent-54, score-0.216]
26 By regularity condition 3a and compactness of X , (∂/∂θ)fθ0 (·) is a uniformly bounded vector-valued function from X to the Euclidean space ˆ ˆ in which Θ resides. [sent-55, score-0.167]
27 As a result, the mapping θ → ((∂/∂θ)fθ0 (·))⊤ θ is a mapping from Θ, with the metric induced by the Euclidean space in which it resides, to the space of all uniformly bounded Rk -valued functions on X . [sent-56, score-0.119]
28 We also know that m(θ−θ 0 ) converges in distribution to a multivariate Gaussian random variable Z. [sent-59, score-0.098]
29 √ The purpose of the stability estimator ηm,q , scaled by m, boils down to trying to assess the ˆk ”expected” value of the random variable dm (Ak (S1 ), Ak (S2 )): we estimate q instantiations of D dm (Ak (S1 ), Ak (S2 )), and take their average. [sent-67, score-0.832]
30 The reason is that the convergence tools at our disposal deals with convergence in distribution of random variables, but convergence in distribution does not necessarily imply convergence of expectations. [sent-71, score-0.132]
31 In other words, we can try and analyze the asymptotic distribution of dm (Ak (S1 ), Ak (S2 )), but the expected value of this asymptotic distribution is not necessarily the D same as limm→∞ Edm (Ak (S1 ), Ak (S2 )). [sent-72, score-0.549]
32 D Here is the basic idea: instead of analyzing the asymptotic expectation of dm (Ak (S1 ), Ak (S2 )), we D analyze the asymptotic expectation of a different random variable, dm (Ak (S1 ), Ak (S2 ), B), which D was formally defined in Eq. [sent-74, score-0.952]
33 Informally, recall that dm (Ak (S1 ), Ak (S2 )) is the mass of the unD derlying distribution D which switches between clusters, when we draw and cluster two indepenm dent samples of size m. [sent-76, score-0.63]
34 A, we will pick B to be dm (Ak (S1 ), Ak (S2 ), Br/√m (∪i,j Fθ0 ,i,j )) for some r > 0. [sent-79, score-0.385]
35 In words, this constitutes strips of width D √ r/ m around the limit cluster boundaries. [sent-80, score-0.344]
36 Writing the above expression for B as Br/√m , we have that if r be large enough, then dm (Ak (S1 ), Ak (S2 ), Br/√m ) is equal to dm (Ak (S1 ), Ak (S2 )) with D D very high probability over drawing and clustering a pair of samples, for any large enough sample size m. [sent-81, score-0.96]
37 Basically, this is because the fluctuations of the cluster boundaries, based on drawing and clustering a random sample of size m, cannot be too large, and therefore the mass which switches clusters is concentrated around the limit cluster boundaries, if m is large enough. [sent-82, score-0.639]
38 The advantage of the ’surrogate’ random variable dm (Ak (S1 ), Ak (S2 ), Br/√m ) is that it is bounded D for any finite r, unlike dm (Ak (S1 ), Ak (S2 )). [sent-83, score-0.843]
39 With bounded random variables, convergence in D distribution does imply convergence of expectations, and as a result we are able to calculate limm→∞ Edm (Ak (S1 ), Ak (S2 ), Br/√m ) explicitly. [sent-84, score-0.143]
40 Our goal is to perform this calculation without going through an intermediate step of explicitly characterizing the distribution of dm (Ak (S1 ), Ak (S2 ), Br/√m ). [sent-98, score-0.385]
41 This is because the distribution might be highly dependent on the speD cific clustering framework, and thus it is unsuitable for the level of generality which we aim at (in other words, we do not wish to assume a specific clustering framework). [sent-99, score-0.162]
42 The idea is as follows: recall that dm (Ak (S1 ), Ak (S2 ), Br/√m ) is the mass of the underlying distribution D, inside strips of √ D width r/ m around the limit cluster boundaries, which switches clusters when we draw and cluster two independent samples of size m. [sent-100, score-1.058]
43 Then we can write dm (Ak (S1 ), Ak (S2 ), Br/√m ), by Fubini’s theorem, as: D Edm (Ak (S1 ), Ak (S2 ), Br/√m ) = D √ mE √ m Pr(Ax )p(x)dx. [sent-102, score-0.385]
44 5, which considers what happens to the integral above inside a single strip near one of the limit cluster boundaries Fθ0 ,i,j . [sent-104, score-0.37]
45 5 can be combined to give the asymptotic value of Eq. [sent-106, score-0.082]
46 The bottom line is that we can simply sum the contributions from each strip, because the intersection of these different strips is asymptotically negligible. [sent-108, score-0.127]
47 (5), and transforms it to an expression composed of a constant value, and a remainder term which converges to 0 as m → ∞. [sent-115, score-0.131]
48 The first step is rewriting everything using the asymptotic Gaussian distribution of the cluster association function fθ (x) for each x, plus remainder terms (Eq. [sent-117, score-0.281]
49 Since we are ˆ integrating over x, special care is given to show that the convergence to the asymptotic distribution is uniform for all x in the domain of integration. [sent-119, score-0.123]
50 The second step is to rewrite the integral (which is over a strip around the cluster boundary) as a double integral along the cluster boundary itself, and along a normal segment at any point on the cluster boundary (Eq. [sent-120, score-0.584]
51 Since the strips become arbitrarily small as m → ∞, the third step consists of rewriting everything in terms of a Taylor expansion around each point on the cluster boundary (Eq. [sent-122, score-0.357]
52 1 below), characterizing the asymptotic expected value of dm (Ak (S1 ), Ak (S2 ), Br/√m (∪i,j Fθ0 ,i,j )). [sent-129, score-0.467]
53 ˆ ˆ ˆ ˆ This is simply by definition of Ax : the probability that under one clustering, based on a random sample, x is more associated with cluster i, and that under a second clustering, based on another independent random sample, x is more associated with cluster j. [sent-135, score-0.319]
54 However, notice that any point x in Br/√m (Fθ0 ,i,j ) (for some i, j) is much closer to Fθ0 ,i,j than to any other cluster boundary. [sent-137, score-0.15]
55 Therefore, if x does switch clusters, then it is highly likely to switch between cluster i and cluster j. [sent-139, score-0.262]
56 Formally, by regularity condition 3d (which ensure that the cluster boundaries experience √ at most O(1/ m) fluctuations), we have that uniformly for any x, Pr(Ax ) = 2 Pr(fθ,i (x) − fθ,j (x) < 0) Pr(fθ,i (x) − fθ,j > 0) + o(1), ˆ ˆ ˆ ˆ where o(1) converges to 0 as m → ∞. [sent-140, score-0.389]
57 Therefore, the probability in the lemma above can be written as Pr 1 q q i=1 Zi − E[Zi ] ≥ ν , where Zi are bounded in [0, r] with probability 1. [sent-163, score-0.13]
58 Let Am be the event that for all subsample pairs {Si , Si }, r 1 2 1 2 dm (Ak (Si ), Ak (Si ), Br/√m(∪i,j Fθ0 ,i,j ) ) = dm (Ak (Si ), Ak (Si )). [sent-166, score-0.827]
59 Namely, this is the event that for D D all subsample pairs, the mass which switches clusters when we compare the two resulting clusterings √ is always in an r/ m-neighborhood of the limit cluster boundaries. [sent-167, score-0.434]
60 Since p(·) is bounded, we have that dm (r) is deterministically bounded by O(r), with implicit D constants depending only on D and θ 0 . [sent-168, score-0.44]
61 As a result, for any 11 ǫ > 0, √ Pr m ηm,q − instab(Ak , D) > ǫ ˆk q ≤ Pr 1 q + Pr √ i=1 1 2 dm (Ak (Si ), Ak (Si )) − instab(Ak , D) > D m ηm,q − instab(Ak , D) > ǫ ˆk 1 q ǫ 2 q i=1 1 2 dm (Ak (Si ), Ak (Si )) − instab(Ak , D) ≤ D ǫ . [sent-172, score-0.77]
62 We start by analyzing the conditional probability, forming the second summand in Eq. [sent-178, score-0.112]
63 Recall 1 2 that ηm,q , after clustering the q subsample pairs {Si , Si }q , uses an additional i. [sent-180, score-0.107]
64 Thus, conditioned on the event appearing in the second summand of Eq. [sent-184, score-0.184]
65 Using a relative entropy version of Hoeffding’s bound [5], we have that the second summand in Eq. [sent-192, score-0.131]
66 (22) is upper bounded by: exp −mDkl v + ǫ/2 v √ √ m m + exp −mDkl max 0, v − ǫ/2 √ m v √ m , (23) where Dkl [p||q] := −p log(p/q) − (1 − p) log((1 − p)/(1 − q)) for any q ∈ (0, 1) and any p ∈ [0, 1]. [sent-193, score-0.133]
67 (23) can be upper bounded by a quantity which converges to 0 as m → ∞. [sent-195, score-0.153]
68 (22), using the triangle inequality and switching sides allows us to upper bound it by: Pr 1 q q i=1 1 2 dm (Ak (Si ), Ak (Si )) − E[dm (r)|Am ] D D r ≥ ǫ − E[dm (r)|Am ] − E[dm (r)] − Edm (r) − instab(Ak , D) D r D D 2 (24) By the definition of instab(Ak , D) as appearing in Thm. [sent-199, score-0.446]
69 (24) by Pr 1 q q i=1 1 2 dm (Ak (Si ), Ak (Si )) − E[dm (r)|Am ] D D r ≥ ǫ − (1 − Pr(Am ))O(r) − O(exp(−r2 )) − o(1) , r 2 12 (26) where o(1) → 0 as m → ∞. [sent-205, score-0.385]
70 6, we have that for any ν > 0, Pr 1 q ≤ (1 − q i=1 1 2 dm (Ak (Si ), Ak (Si )) − E[dm (r)|Am ] > ν D D r Pr(Am )) r ∗1+ 1 q Pr(Am ) Pr r q i=1 2qν 2 ≤ (1 − Pr(Am )) + 2 Pr(Am ) exp − 2 r r r 1 2 dm (Ak (Si ), Ak (Si )) − E[dm (r)|Am ] > ν Am D D r r . [sent-207, score-0.8]
71 6 can be applied because dm (Ak (Si ), Ak (Si )) = dm (r) for any i, if Am occurs. [sent-209, score-0.77]
72 (26) is upper bounded by (1 − Pr(Am )) + 2 Pr(Am ) exp − r r ǫ 2 2q − (1 − Pr(Am ))O(r) − O(exp(−r2 ))) − o(1) r r2 2 . [sent-212, score-0.103]
73 (29) Let gm (r) := Pr S1 ,S2 ∼D m (dm (r) = dm (Ak (S1 ), Ak (S2 ))) D D , g(r) = lim gm (r) m→∞ −3−δ By regularity condition 3d, g(r) = O(r ) for some δ > 0. [sent-213, score-0.528]
74 (29) converges to q (1 − (1 − g(r))) ) + 2(1 − g(r))q exp − 2q ǫ 2 − (1 − (1 − g(r))q )O(r) − O(exp(−r2 )) r2 2 . [sent-216, score-0.11]
75 (30), we get an upper bound 3+δ 1 O q 1− 2+δ/2 + exp −2q 1− 1+δ/4 δ ǫ − O q − 4+δ 2 1 − O exp(−q 1+δ/4 ) 2 . [sent-223, score-0.088]
76 Since δ > 0, it can be verified that the first summand asymptotically dominates the second summand (as q → ∞), and can be bounded in turn by o(q −1/2 ). [sent-224, score-0.296]
77 ∞, Summarizing, we have that the first summand in Eq. [sent-225, score-0.112]
78 (22) converges to o(q −1/2 ) as m →√ and the second summand in Eq. [sent-226, score-0.192]
79 3 is the following general central limit theorem for Z-estimators (Thm. [sent-232, score-0.111]
80 If Ψ(θ 0 ) = 0 and Ψm (θ)/ m → 0 √ ˆ ˆ in probability, and θ converges in probability to θ 0 , then m(θ − θ 0 ) converges in distribution to ˙ −1 Z. [sent-241, score-0.181]
81 It is important to note that the theorem is stronger than what we actually need, since we only consider finite dimensional Euclidean spaces, while the theorem deals with possibly infinite dimensional Banach spaces. [sent-247, score-0.14]
82 In principle, it is possible to use this theorem to prove central limit theorems in infinite dimensional settings, for example in kernel clustering where the associated reproducing kernel Hilbert space is infinite dimensional. [sent-248, score-0.331]
83 We will prove this in a slightly more complicated way than necessary, which also treats the case of kernel clustering where H is infinite-dimensional. [sent-263, score-0.126]
84 Intuitively, a set of real functions {f (·)} from X (with any probability distribution D) to R is called Donsker if it satisfies a uniform central limit theorem. [sent-267, score-0.125]
85 Without getting too much into the details, 1 A linear operator is automatically continuous in finite dimensional spaces, not necessarily in infinite dimensional spaces. [sent-268, score-0.088]
86 + f (xm ))/ m converges in distribution (as m → ∞) to a Gaussian random variable, and the convergence is uniform over all f (·) in the set, in an appropriately defined sense. [sent-274, score-0.139]
87 (32) ˆ ˆ Notice that the first class is a set of bounded constant functions, while the third class is a set of indicator functions for all possible clusters. [sent-281, score-0.091]
88 1 in [3] (and its preceding discussion), the class is Donsker if any function in the class is differentiable to a sufficiently high order. [sent-288, score-0.087]
89 For finite dimensional kernel clustering, it is enough to show that { ·, φ(h) }h∈X is Donsker (namely, the same class of functions after performing the transformation from X to φ(X )). [sent-293, score-0.116]
90 In infinite dimensional kernel clustering, our class of functions can be written as {k(·, h)}h∈X , where k(·, ·) is the kernel function, so it is Donsker if the kernel function is differentiable to a sufficiently high order. [sent-295, score-0.188]
91 15 in [3] (and its preceding discussion), it suffices that the boundary of each possible cluster is composed of a finite number of smooth surfaces (differentiable to a high enough order) in some Euclidean space. [sent-300, score-0.238]
92 This will still be true for infinite dimensional kernel clustering, if we can guarantee that any cluster in any solution close enough to θ0 in Θ will have smooth boundaries. [sent-303, score-0.229]
93 For example, universal kernels (such as the Gaussian kernel) are capable of inducing cluster boundaries arbitrarily close in form to any continuous function, and thus our line of attack will not work in such cases. [sent-305, score-0.238]
94 In a sense, this is not too surprising, since these kernels correspond to very ’rich’ hypothesis classes, and it is not clear if a precise characterization of their stability properties, via central limit theorems, is at all possible. [sent-306, score-0.11]
95 √ As to the asymptotic distribution of m(Ψm − Ψ)(θ 0 ), since Ψ(θ 0 ) = 0 by assumption, we have that for any i ∈ {1, . [sent-313, score-0.082]
96 As a result, by the √ standard central limit theorem, m(Ψi −Ψi )(θ 0 ) converges in distribution to a zero mean Gaussian m random vector Y , with covariance matrix Vi = Cθ0 ,i p(x)(φ(x) − θ 0,i )(φ(x) − θ 0,i )⊤ dx. [sent-325, score-0.183]
97 Therefore, √ m(Ψm − Ψ)(θ 0 ) converges in distribution to a zero mean Gaussian random vector, whose covariance matrix V is composed of k diagonal blocks (V1 , . [sent-327, score-0.119]
98 1 to get that m(θ−θ 0 ) converges in distribution to a zero mean Gaussian ˙ random vector of the form −Ψ−1 Y , which is a Gaussian random vector with a covariance matrix of θ0 ˙ −1 V Ψ−1 . [sent-333, score-0.137]
99 X ∂θ ˆ ˆ Under the assumptions we have made, the model θ returned by the algorithm satisfies Ψm (θ) = 0, ˆ converges in probability to some θ 0 for which Ψ(θ 0 ) = 0. [sent-341, score-0.101]
100 The asymptotic normality of and θ √ ˆ m(θ − θ 0 ) is now an immediate consequence of central limit theorems for ’maximum likelihood’ Z-estimators, such as Thm. [sent-342, score-0.228]
wordName wordTfidf (topN-words)
[('ak', 0.689), ('dm', 0.385), ('pr', 0.262), ('instab', 0.224), ('edm', 0.154), ('cluster', 0.131), ('donsker', 0.126), ('limm', 0.126), ('si', 0.121), ('summand', 0.112), ('strips', 0.11), ('op', 0.093), ('boundaries', 0.086), ('asymptotic', 0.082), ('clustering', 0.081), ('converges', 0.08), ('xm', 0.074), ('switches', 0.07), ('ym', 0.065), ('bounded', 0.055), ('limit', 0.054), ('banach', 0.052), ('clusters', 0.051), ('argmaxf', 0.049), ('dimensional', 0.044), ('mass', 0.044), ('ax', 0.044), ('dd', 0.042), ('regularity', 0.042), ('euclidean', 0.04), ('boundary', 0.039), ('strip', 0.037), ('normality', 0.036), ('hoeffding', 0.035), ('bregman', 0.035), ('differentiable', 0.033), ('inside', 0.033), ('lemma', 0.033), ('width', 0.031), ('event', 0.031), ('central', 0.031), ('remainder', 0.03), ('condition', 0.03), ('exp', 0.03), ('proof', 0.03), ('integral', 0.029), ('enough', 0.029), ('namely', 0.029), ('mdkl', 0.028), ('vaart', 0.028), ('clusterings', 0.027), ('theorem', 0.026), ('nite', 0.026), ('gm', 0.026), ('subsample', 0.026), ('imply', 0.026), ('kernel', 0.025), ('stability', 0.025), ('theorems', 0.025), ('appearing', 0.024), ('proofs', 0.024), ('neighborhood', 0.023), ('proposition', 0.023), ('qg', 0.022), ('mapping', 0.022), ('drawing', 0.022), ('convergence', 0.022), ('zi', 0.021), ('probability', 0.021), ('composed', 0.021), ('arbitrarily', 0.021), ('everything', 0.021), ('get', 0.021), ('prove', 0.02), ('compactness', 0.02), ('dkl', 0.02), ('uniformly', 0.02), ('bound', 0.019), ('uniform', 0.019), ('law', 0.019), ('lim', 0.019), ('derivative', 0.019), ('polynomially', 0.019), ('instantiations', 0.019), ('sample', 0.019), ('notice', 0.019), ('xi', 0.018), ('high', 0.018), ('conditions', 0.018), ('fr', 0.018), ('rn', 0.018), ('class', 0.018), ('random', 0.018), ('upper', 0.018), ('around', 0.018), ('asymptotically', 0.017), ('veri', 0.017), ('conditioned', 0.017), ('summarizing', 0.017), ('rewriting', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
Author: Ohad Shamir, Naftali Tishby
Abstract: unkown-abstract
2 0.11756263 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
Author: Chunhua Shen, Alan Welsh, Lei Wang
Abstract: In this work, we consider the problem of learning a positive semidefinite matrix. The critical issue is how to preserve positive semidefiniteness during the course of learning. Our algorithm is mainly inspired by LPBoost [1] and the general greedy convex optimization framework of Zhang [2]. We demonstrate the essence of the algorithm, termed PSDBoost (positive semidefinite Boosting), by focusing on a few different applications in machine learning. The proposed PSDBoost algorithm extends traditional Boosting algorithms in that its parameter is a positive semidefinite matrix with trace being one instead of a classifier. PSDBoost is based on the observation that any trace-one positive semidefinite matrix can be decomposed into linear convex combinations of trace-one rank-one matrices, which serve as base learners of PSDBoost. Numerical experiments are presented. 1
3 0.11034346 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
Author: Peter Carbonetto, Mark Schmidt, Nando D. Freitas
Abstract: The stochastic approximation method is behind the solution to many important, actively-studied problems in machine learning. Despite its farreaching application, there is almost no work on applying stochastic approximation to learning problems with general constraints. The reason for this, we hypothesize, is that no robust, widely-applicable stochastic approximation method exists for handling such problems. We propose that interior-point methods are a natural solution. We establish the stability of a stochastic interior-point approximation method both analytically and empirically, and demonstrate its utility by deriving an on-line learning algorithm that also performs feature selection via L1 regularization. 1
4 0.079195388 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes
Author: Mehryar Mohri, Afshin Rostamizadeh
Abstract: This paper presents the first Rademacher complexity-based error bounds for noni.i.d. settings, a generalization of similar existing bounds derived for the i.i.d. case. Our bounds hold in the scenario of dependent samples generated by a stationary β-mixing process, which is commonly adopted in many previous studies of noni.i.d. settings. They benefit from the crucial advantages of Rademacher complexity over other measures of the complexity of hypothesis classes. In particular, they are data-dependent and measure the complexity of a class of hypotheses based on the training sample. The empirical Rademacher complexity can be estimated from such finite samples and lead to tighter generalization bounds. We also present the first margin bounds for kernel-based classification in this non-i.i.d. setting and briefly study their convergence.
5 0.07496994 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
Author: Ralf M. Haefner, Bruce G. Cumming
Abstract: A crucial part of developing mathematical models of information processing in the brain is the quantification of their success. One of the most widely-used metrics yields the percentage of the variance in the data that is explained by the model. Unfortunately, this metric is biased due to the intrinsic variability in the data. We derive a simple analytical modification of the traditional formula that significantly improves its accuracy (as measured by bias) with similar or better precision (as measured by mean-square error) in estimating the true underlying Variance Explained by the model class. Our estimator advances on previous work by a) accounting for overfitting due to free model parameters mitigating the need for a separate validation data set, b) adjusting for the uncertainty in the noise estimate and c) adding a conditioning term. We apply our new estimator to binocular disparity tuning curves of a set of macaque V1 neurons and find that on a population level almost all of the variance unexplained by Gabor functions is attributable to noise. 1
6 0.074549273 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
7 0.074243002 104 nips-2008-Improved Moves for Truncated Convex Models
8 0.07373628 106 nips-2008-Inferring rankings under constrained sensing
10 0.069085933 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
11 0.062137984 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF
12 0.061414771 238 nips-2008-Theory of matching pursuit
13 0.059745479 117 nips-2008-Learning Taxonomies by Dependence Maximization
14 0.059017681 48 nips-2008-Clustering via LP-based Stabilities
15 0.058887511 190 nips-2008-Reconciling Real Scores with Binary Comparisons: A New Logistic Based Model for Ranking
16 0.057633728 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models
17 0.055849425 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
18 0.052813068 132 nips-2008-Measures of Clustering Quality: A Working Set of Axioms for Clustering
19 0.052685838 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
20 0.049179703 107 nips-2008-Influence of graph construction on graph-based clustering measures
topicId topicWeight
[(0, -0.138), (1, -0.007), (2, -0.049), (3, 0.057), (4, 0.002), (5, -0.019), (6, 0.047), (7, -0.086), (8, 0.013), (9, -0.034), (10, 0.01), (11, -0.0), (12, 0.087), (13, 0.072), (14, -0.058), (15, 0.144), (16, -0.013), (17, 0.078), (18, 0.089), (19, -0.046), (20, 0.003), (21, 0.019), (22, -0.014), (23, 0.066), (24, 0.014), (25, 0.01), (26, -0.068), (27, -0.033), (28, 0.062), (29, 0.055), (30, 0.067), (31, -0.079), (32, -0.09), (33, -0.148), (34, -0.006), (35, -0.067), (36, 0.029), (37, -0.085), (38, 0.106), (39, 0.068), (40, -0.092), (41, -0.153), (42, -0.06), (43, -0.054), (44, -0.281), (45, 0.065), (46, -0.034), (47, -0.085), (48, 0.071), (49, 0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.97158617 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
Author: Ohad Shamir, Naftali Tishby
Abstract: unkown-abstract
2 0.53942609 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
Author: Chunhua Shen, Alan Welsh, Lei Wang
Abstract: In this work, we consider the problem of learning a positive semidefinite matrix. The critical issue is how to preserve positive semidefiniteness during the course of learning. Our algorithm is mainly inspired by LPBoost [1] and the general greedy convex optimization framework of Zhang [2]. We demonstrate the essence of the algorithm, termed PSDBoost (positive semidefinite Boosting), by focusing on a few different applications in machine learning. The proposed PSDBoost algorithm extends traditional Boosting algorithms in that its parameter is a positive semidefinite matrix with trace being one instead of a classifier. PSDBoost is based on the observation that any trace-one positive semidefinite matrix can be decomposed into linear convex combinations of trace-one rank-one matrices, which serve as base learners of PSDBoost. Numerical experiments are presented. 1
3 0.43962052 25 nips-2008-An interior-point stochastic approximation method and an L1-regularized delta rule
Author: Peter Carbonetto, Mark Schmidt, Nando D. Freitas
Abstract: The stochastic approximation method is behind the solution to many important, actively-studied problems in machine learning. Despite its farreaching application, there is almost no work on applying stochastic approximation to learning problems with general constraints. The reason for this, we hypothesize, is that no robust, widely-applicable stochastic approximation method exists for handling such problems. We propose that interior-point methods are a natural solution. We establish the stability of a stochastic interior-point approximation method both analytically and empirically, and demonstrate its utility by deriving an on-line learning algorithm that also performs feature selection via L1 regularization. 1
4 0.4278309 48 nips-2008-Clustering via LP-based Stabilities
Author: Nikos Komodakis, Nikos Paragios, Georgios Tziritas
Abstract: A novel center-based clustering algorithm is proposed in this paper. We first formulate clustering as an NP-hard linear integer program and we then use linear programming and the duality theory to derive the solution of this optimization problem. This leads to an efficient and very general algorithm, which works in the dual domain, and can cluster data based on an arbitrary set of distances. Despite its generality, it is independent of initialization (unlike EM-like methods such as K-means), has guaranteed convergence, can automatically determine the number of clusters, and can also provide online optimality bounds about the quality of the estimated clustering solutions. To deal with the most critical issue in a centerbased clustering algorithm (selection of cluster centers), we also introduce the notion of stability of a cluster center, which is a well defined LP-based quantity that plays a key role to our algorithm’s success. Furthermore, we also introduce, what we call, the margins (another key ingredient in our algorithm), which can be roughly thought of as dual counterparts to stabilities and allow us to obtain computationally efficient approximations to the latter. Promising experimental results demonstrate the potentials of our method.
5 0.41563034 29 nips-2008-Automatic online tuning for fast Gaussian summation
Author: Vlad I. Morariu, Balaji V. Srinivasan, Vikas C. Raykar, Ramani Duraiswami, Larry S. Davis
Abstract: Many machine learning algorithms require the summation of Gaussian kernel functions, an expensive operation if implemented straightforwardly. Several methods have been proposed to reduce the computational complexity of evaluating such sums, including tree and analysis based methods. These achieve varying speedups depending on the bandwidth, dimension, and prescribed error, making the choice between methods difficult for machine learning tasks. We provide an algorithm that combines tree methods with the Improved Fast Gauss Transform (IFGT). As originally proposed the IFGT suffers from two problems: (1) the Taylor series expansion does not perform well for very low bandwidths, and (2) parameter selection is not trivial and can drastically affect performance and ease of use. We address the first problem by employing a tree data structure, resulting in four evaluation methods whose performance varies based on the distribution of sources and targets and input parameters such as desired accuracy and bandwidth. To solve the second problem, we present an online tuning approach that results in a black box method that automatically chooses the evaluation method and its parameters to yield the best performance for the input data, desired accuracy, and bandwidth. In addition, the new IFGT parameter selection approach allows for tighter error bounds. Our approach chooses the fastest method at negligible additional cost, and has superior performance in comparisons with previous approaches. 1
6 0.40605789 106 nips-2008-Inferring rankings under constrained sensing
7 0.40271485 55 nips-2008-Cyclizing Clusters via Zeta Function of a Graph
8 0.39995036 149 nips-2008-Near-minimax recursive density estimation on the binary hypercube
9 0.38804361 132 nips-2008-Measures of Clustering Quality: A Working Set of Axioms for Clustering
10 0.37042513 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling
11 0.35962909 188 nips-2008-QUIC-SVD: Fast SVD Using Cosine Trees
12 0.34872395 104 nips-2008-Improved Moves for Truncated Convex Models
13 0.34552959 199 nips-2008-Risk Bounds for Randomized Sample Compressed Classifiers
14 0.34202677 24 nips-2008-An improved estimator of Variance Explained in the presence of noise
15 0.33904073 15 nips-2008-Adaptive Martingale Boosting
16 0.33899003 190 nips-2008-Reconciling Real Scores with Binary Comparisons: A New Logistic Based Model for Ranking
17 0.33410886 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
18 0.32789487 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
19 0.32172102 47 nips-2008-Clustered Multi-Task Learning: A Convex Formulation
20 0.32032487 117 nips-2008-Learning Taxonomies by Dependence Maximization
topicId topicWeight
[(6, 0.066), (7, 0.061), (12, 0.024), (28, 0.248), (57, 0.065), (59, 0.019), (63, 0.022), (71, 0.018), (77, 0.066), (80, 0.242), (83, 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 0.87433994 165 nips-2008-On the Reliability of Clustering Stability in the Large Sample Regime
Author: Ohad Shamir, Naftali Tishby
Abstract: unkown-abstract
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar
Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.
3 0.7508657 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
Author: David Sontag, Amir Globerson, Tommi S. Jaakkola
Abstract: We propose a new class of consistency constraints for Linear Programming (LP) relaxations for finding the most probable (MAP) configuration in graphical models. Usual cluster-based LP relaxations enforce joint consistency on the beliefs of a cluster of variables, with computational cost increasing exponentially with the size of the clusters. By partitioning the state space of a cluster and enforcing consistency only across partitions, we obtain a class of constraints which, although less tight, are computationally feasible for large clusters. We show how to solve the cluster selection and partitioning problem monotonically in the dual LP, using the current beliefs to guide these choices. We obtain a dual message passing algorithm and apply it to protein design problems where the variables have large state spaces and the usual cluster-based relaxations are very costly. The resulting method solves many of these problems exactly, and significantly faster than a method that does not use partitioning. 1
4 0.75058979 231 nips-2008-Temporal Dynamics of Cognitive Control
Author: Jeremy Reynolds, Michael C. Mozer
Abstract: Cognitive control refers to the flexible deployment of memory and attention in response to task demands and current goals. Control is often studied experimentally by presenting sequences of stimuli, some demanding a response, and others modulating the stimulus-response mapping. In these tasks, participants must maintain information about the current stimulus-response mapping in working memory. Prominent theories of cognitive control use recurrent neural nets to implement working memory, and optimize memory utilization via reinforcement learning. We present a novel perspective on cognitive control in which working memory representations are intrinsically probabilistic, and control operations that maintain and update working memory are dynamically determined via probabilistic inference. We show that our model provides a parsimonious account of behavioral and neuroimaging data, and suggest that it offers an elegant conceptualization of control in which behavior can be cast as optimal, subject to limitations on learning and the rate of information processing. Moreover, our model provides insight into how task instructions can be directly translated into appropriate behavior and then efficiently refined with subsequent task experience. 1
5 0.75029194 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
Author: Bernhard Nessler, Michael Pfeiffer, Wolfgang Maass
Abstract: Uncertainty is omnipresent when we perceive or interact with our environment, and the Bayesian framework provides computational methods for dealing with it. Mathematical models for Bayesian decision making typically require datastructures that are hard to implement in neural networks. This article shows that even the simplest and experimentally best supported type of synaptic plasticity, Hebbian learning, in combination with a sparse, redundant neural code, can in principle learn to infer optimal Bayesian decisions. We present a concrete Hebbian learning rule operating on log-probability ratios. Modulated by reward-signals, this Hebbian plasticity rule also provides a new perspective for understanding how Bayesian inference could support fast reinforcement learning in the brain. In particular we show that recent experimental results by Yang and Shadlen [1] on reinforcement learning of probabilistic inference in primates can be modeled in this way. 1
6 0.74973089 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms
7 0.74825805 112 nips-2008-Kernel Measures of Independence for non-iid Data
8 0.74789768 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
9 0.74732989 107 nips-2008-Influence of graph construction on graph-based clustering measures
10 0.74723369 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
11 0.7471683 195 nips-2008-Regularized Policy Iteration
12 0.74677157 40 nips-2008-Bounds on marginal probability distributions
13 0.74672067 101 nips-2008-Human Active Learning
14 0.74552476 138 nips-2008-Modeling human function learning with Gaussian processes
15 0.74462366 211 nips-2008-Simple Local Models for Complex Dynamical Systems
16 0.74442476 152 nips-2008-Non-stationary dynamic Bayesian networks
17 0.74372435 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations
18 0.74343085 77 nips-2008-Evaluating probabilities under high-dimensional latent variable models
19 0.74326718 118 nips-2008-Learning Transformational Invariants from Natural Movies
20 0.74276286 113 nips-2008-Kernelized Sorting