nips nips2013 nips2013-197 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Matus Telgarsky, Sanjoy Dasgupta
Abstract: Suppose k centers are fit to m points by heuristically minimizing the k-means cost; what is the corresponding fit over the source distribution? This question is resolved here for distributions with p 4 bounded moments; in particular, the difference between the sample cost and distribution cost decays with m and p as mmin{ 1/4, 1/2+2/p} . The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of k-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with refined constants is provided for k-means instances possessing some cluster structure. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Suppose k centers are fit to m points by heuristically minimizing the k-means cost; what is the corresponding fit over the source distribution? [sent-3, score-0.337]
2 This question is resolved here for distributions with p 4 bounded moments; in particular, the difference between the sample cost and distribution cost decays with m and p as mmin{ 1/4, 1/2+2/p} . [sent-4, score-0.243]
3 The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. [sent-5, score-0.355]
4 To further demonstrate this mechanism, a soft clustering variant of k-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. [sent-6, score-0.318]
5 1 Introduction Suppose a set of k centers {pi }k is selected by approximate minimization of k-means cost; how i=1 does the fit over the sample compare with the fit over the distribution? [sent-8, score-0.316]
6 Concretely: given m points sampled from a source distribution ⇢, what can be said about the quantities m 1 X min kxj m j=1 i pi k2 2 m k X 1 X ln ↵i p✓i (xj ) m j=1 i=1 ! [sent-9, score-0.214]
7 For k-means, there is firstly a consistency result: under some identifiability conditions, the global minimizer over the sample will converge to the global minimizer over the distribution as the sample size m increases [1]. [sent-15, score-0.13]
8 The task here is thus: to provide finite sample guarantees for these problems, but eschewing boundedness, subgaussianity, and similar assumptions in favor of moment assumptions. [sent-18, score-0.176]
9 1 Contribution The results here are of the following form: given m examples from a distribution with a few bounded moments, and any set of parameters beating some fixed cost c, the corresponding deviations in cost (as in eq. [sent-20, score-0.43]
10 1), p 4 moments suffice, and the rate is O(mmin{ 1/4, 1/2+2/p} ). [sent-27, score-0.143]
11 1), p 8 moments suffice, and the rate is O(m 1/2+3/p ). [sent-30, score-0.143]
12 For instance, suppose k centers are output by Lloyd’s method. [sent-32, score-0.277]
13 While Lloyd’s method carries no optimality guarantees, the results here hold for the output of Lloyd’s method simply by setting c to be the variance of the data, equivalently the k-means cost with a single center placed at the mean. [sent-33, score-0.119]
14 • The k-means and Gaussian mixture costs are only well-defined when the source distribution has p 2 moments. [sent-34, score-0.109]
15 The condition of p 4 moments, meaning the variance has a variance, allows consideration of many heavy-tailed distributions, which are ruled out by boundedness and subgaussianity assumptions. [sent-35, score-0.272]
16 The main technical byproduct of the proof is a mechanism to deal with the unboundedness of the cost function; this technique will be detailed in Section 3, but the difficulty and its resolution can be easily sketched here. [sent-36, score-0.243]
17 For a single set of centers P , the deviations in eq. [sent-37, score-0.429]
18 But this does not immediately grant deviation bounds on another set of centers P 0 , even if P and P 0 are very close: for instance, the difference between the two costs will grow as successively farther and farther away points are considered. [sent-40, score-0.437]
19 The resolution is to simply note that there is so little probability mass in those far reaches that the cost there is irrelevant. [sent-41, score-0.172]
20 kx pk2 is integrable); the 2 dominated convergence theorem grants Z Z kx pk2 d⇢(x) ! [sent-43, score-0.488]
21 Then Z Z kx p0 k2 d⇢(x) (kx pk2 + kp 2 c Bi R c Bi kx pk2 d⇢(x) 1/1024. [sent-46, score-0.518]
22 Now consider some 2 p0 k2 )2 d⇢(x) 4 c Bi Z c Bi kx pk2 d⇢(x) 2 1 . [sent-47, score-0.244]
23 256 In this way, a single center may control the outer deviations of whole swaths of other centers. [sent-48, score-0.492]
24 The general strategy is thus to split consideration into outer deviations, and local deviations. [sent-51, score-0.325]
25 The local deviations may be controlled by standard techniques. [sent-52, score-0.152]
26 To control outer deviations, a single pair of dominating costs — a lower bound and an upper bound — is controlled. [sent-53, score-0.364]
27 This technique can be found in the proof of the consistency of k-means due to Pollard [1]. [sent-54, score-0.118]
28 The core deviation technique, termed outer bracketing (to connect it to the bracketing technique from empirical process theory), is presented along with the deviations of k-means in Section 3. [sent-58, score-1.072]
29 The technique is then applied in Section 5 to a soft clustering variant, namely log likelihood of Gaussian mixtures having bounded spectra. [sent-59, score-0.273]
30 As a reprieve between these two heavier bracketing sections, Section 4 provides a simple refinement for k-means which can adapt to cluster structure. [sent-60, score-0.322]
31 All proofs are deferred to the appendices, however the construction and application of outer brackets is sketched in the text. [sent-61, score-0.461]
32 2 Related Work As referenced earlier, Pollard’s work deserves special mention, both since it can be seen as the origin of the outer bracketing technique, and since it handled k-means under similarly slight assumptions (just two moments, rather than the four here) [1, 7]. [sent-63, score-0.559]
33 The present work hopes to be a spiritual successor, providing finite sample guarantees, and adapting technique to a soft clustering problem. [sent-64, score-0.189]
34 In the machine learning community, statistical guarantees for clustering have been extensively studied under the topic of clustering stability [4, 8, 9, 10]. [sent-65, score-0.234]
35 The technical component of these works frequently involves finite sample guarantees, which in the works listed here make a boundedness assumption, or something similar (for instance, the work of Shamir and Tishby [9] requires the cost function to satisfy a bounded differences condition). [sent-67, score-0.25]
36 Amongst these finite sample guarantees, the finite sample guarantees due to Rakhlin and Caponnetto [4] are similar to the development here after the invocation of the outer bracket: namely, a covering argument controls deviations over a bounded set. [sent-68, score-0.634]
37 The results of Shamir and Tishby [10] do not make a boundedness assumption, but the main results are not finite sample guarantees; in particular, they rely on asymptotic results due to Pollard [7]. [sent-69, score-0.125]
38 There are many standard tools which may be applied to the problems here, particularly if a boundedness assumption is made [11, 12]; for instance, Lugosi and Zeger [2] use tools from VC theory to handle k-means in the bounded case. [sent-70, score-0.214]
39 Another interesting work, by Ben-david [3], develops specialized tools to measure the complexity of certain clustering problems; when applied to the problems of the type considered here, a boundedness assumption is made. [sent-71, score-0.239]
40 A few of the above works provide some negative results and related commentary on the topic of uniform deviations for distributions with unbounded support [10, Theorem 3 and subsequent discussion] [3, Page 5 above Definition 2]. [sent-72, score-0.152]
41 The primary “loophole” here is to constrain consideration to those solutions beating some reference score c. [sent-73, score-0.131]
42 (The secondary loophole is to make moment assumptions; these sufficiently constrain the structure of the distribution to provide rates. [sent-76, score-0.135]
43 Loosely stated, a bracket is simply a pair of functions which sandwich some set of functions; the bracketing entropy is then (the logarithm of) the number of brackets needed to control a particular set of functions. [sent-79, score-0.745]
44 In the present work, brackets are paired with sets which identify the far away regions they are meant to control; furthermore, while there is potential for the use of many outer brackets, the approach here is able to make use of just a single outer bracket. [sent-80, score-0.771]
45 The name bracket is suitable, as opposed to cover, since the bracketing elements need not be members of the function class being dominated. [sent-81, score-0.555]
46 (By contrast, Pollard’s use in the proof of the consistency of k-means was more akin to covering, in that remote fluctuations were compared to that of a a single center placed at the origin [1]. [sent-82, score-0.125]
47 The source probability measure will be ⇢, and when a finite sample of size m is available, ⇢ is the corresponding empirical measure. [sent-84, score-0.14]
48 Probability measure ⇢ has order-p moment bound M with respect to norm k·k when E⇢ kX E⇢ (X)kl M for 1 l p. [sent-90, score-0.217]
49 3 For example, the typical setting of k-means uses norm k·k2 , and at least two moments are needed for the cost over ⇢ to be finite; the condition here of needing 4 moments can be seen as naturally arising via Chebyshev’s inequality. [sent-91, score-0.462]
50 Of course, the availability of higher moments is beneficial, dropping the rates here from m 1/4 down to m 1/2 . [sent-92, score-0.182]
51 R which is strongly convex and has Lipschitz gradients with respective constants r1 , r2 with respect to norm k · k, the hard k-means cost of a single point x according to a set of centers P is f (x; P ) := min Bf (x, p). [sent-111, score-0.474]
52 p2P The corresponding k-means cost of a set of points (or distribution) is thus computed as E⌫ ( f (X; P )), and let Hf (⌫; c, k) denote all sets of at most k centers beating cost c, meaning Hf (⌫; c, k) := {P : |P | k, E⌫ ( f (X; P )) c}. [sent-112, score-0.509]
53 For example, choosing norm k · k2 and convex function f (x) = kxk2 (which has r1 = r2 = 2), the 2 corresponding Bregman divergence is Bf (x, y) = kx yk2 , and E⇢ ( f (X; P )) denotes the vanilla ˆ 2 k-means cost of some finite point set encoded in the empirical measure ⇢. [sent-113, score-0.481]
54 ˆ The hard clustering guarantees will work with Hf (⌫; c, k), where ⌫ can be either the source distribution ⇢, or its empirical counterpart ⇢. [sent-114, score-0.176]
55 2 (2⇡)d |⌃i | P Given Gaussian mixture parameters (↵, ⇥) = ({↵i }k , {✓i }k ) with ↵ 0 and i ↵i = 1 i=1 i=1 (written ↵ 2 ), the Gaussian mixture cost at a point x is ! [sent-121, score-0.177]
56 While a condition of the form ⌃ ⌫ 1 I is typically enforced in practice (say, with a Bayesian prior, or by ignoring updates which shrink the covariance beyond this point), the condition ⌃ 2 I is potentially violated. [sent-123, score-0.125]
57 Let real c 0 and probability 2 measure ⇢ be given with order-p moment bound M with respect to k · k2 , where p 4 is a positive multiple of 4. [sent-129, score-0.165]
58 1 Then with probability at least 1 3 over the draw of a sample of size m max{(p/(2p/4+2 e))2 , 9 ln(1/ )}, every set of centers P 2 Hf (ˆ; c, k) [ Hf (⇢; c, k) satisfies ⇢ Z Z ⇢ f (x; P )d⇢(x) f (x; P )dˆ(x) s ✓ ◆ r p/4 ✓ ◆4/p ! [sent-131, score-0.356]
59 1 (mN1 )dk 2 ep 2 1/2+min{1/4,2/p} 2 2 m 4 + (72c1 + 32M1 ) ln + . [sent-132, score-0.154]
60 1, catches the mass of points discarded due to the outer bracket, as well as the resolution of the (inner) cover. [sent-140, score-0.329]
61 Let probability measure ⇢ be given with order-p moment bound M where p 4, a convex function f with corresponding constants r1 and r2 , reals c and ✏ > 0, and integer 1 p0 p/2 1 be given. [sent-146, score-0.231]
62 ⌧ Then with probability at least 1 3 over the draw of a sample of size m 0 max{p0 /(e2p ✏), 9 ln(1/ )}, every set of centers P 2 Hf (⇢; c, k) [ Hf (ˆ; c, k) satisfies ⇢ s ✓ ◆ r p0 0 ✓ ◆1/p0 Z Z 1 2|N |k e2 ✏p 2 2 ⇢ ln + . [sent-148, score-0.51]
63 1 Compactification via Outer Brackets The outer bracket is defined as follows. [sent-150, score-0.532]
64 An outer bracket for probability measure ⌫ at scale ✏ consists of two triples, one each for lower and upper bounds. [sent-153, score-0.573]
65 2 Zu , then u Direct from the definition, given bracketing functions (`, u), a bracketed function f (·; P ), and the bracketing set B := Bu [ B` , Z Z Z ✏ `(x)d⌫(x) (x; P )d⌫(x) u(x)d⌫(x) ✏; (3. [sent-158, score-0.582]
66 4) f Bc Bc Bc in other words, as intended, this mechanism allows deviations on B c to be discarded. [sent-159, score-0.184]
67 Thus to uniformly control the deviations of the dominated functions Z := Zu [ Z` over the set B c , it suffices to simply control the deviations of the pair (`, u). [sent-160, score-0.368]
68 The following lemma shows that a bracket exists for { f (·; P ) : P 2 Hf (⌫; c, k)} and compact B, and moreover that this allows sampled points and candidate centers in far reaches to be deleted. [sent-161, score-0.603]
69 M := 2 ✏, `(x) := 0, u(x) := 4r2 kx E(X)k , ✏⇢ := ✏ + ˆ 2m The following statements hold with probability at least 1 2 over a draw of size m max{p0 /(M 0 e), 9 ln(1/ )}. [sent-166, score-0.284]
70 (u, `) is an outer bracket for ⇢ at scale ✏⇢ := ✏ with sets B` = Bu = B and Z` = Zu = { f (·; P ) : P 2 Hf (ˆ; c, k)[Hf (⇢; c, k)}, and furthermore the pair (u, `) is also an outer ⇢ bracket for ⇢ at scale ✏⇢ with the same sets. [sent-168, score-1.064]
71 It is not possible for an element of Hf (ˆ; c, k) [ Hf (⇢; c, k) to have all centers far from B0 , since otherwise the cost is ⇢ p larger than c. [sent-176, score-0.388]
72 ) Consequently, at least one center lies near to B0 ; this reasoning was also the first step in the k-means consistency proof due to k-means Pollard [1]. [sent-179, score-0.125]
73 To control the size of B0 , as well as the size of B, and moreover the deviations of the bracket over B, the moment tools from Appendix A are used. [sent-187, score-0.581]
74 2, the above bracketing allows the removal of points and centers outside of a compact set (in particular, the pair of compact sets B and C, respectively). [sent-189, score-0.568]
75 These moment conditions, however, do not necessarily reflect any natural cluster structure in the source distribution. [sent-193, score-0.183]
76 Real number R and compact set C are a clamp for probability measure ⌫ and family of centers Z and cost f at scale ✏ > 0 if every P 2 Z satisfies |E⌫ ( f (X; P )) E⌫ (min { f (X; P \ C) , R})| ✏. [sent-198, score-0.476]
77 Note that this definition is similar to the second part of the outer bracket guarantee in Lemma 3. [sent-199, score-0.532]
78 If the distribution has bounded support, then choosing a clamping value R and clamping set C respectively slightly larger than the support size and set is sufficient: as was reasoned in the construction of outer brackets, if no centers are close to the support, then the cost is bad. [sent-203, score-0.926]
79 Correspondingly, the clamped set of functions Z should again be choices of centers whose cost is not too high. [sent-204, score-0.356]
80 For a more interesting example, suppose ⇢ is supported on k small balls of radius R1 , where the distance between their respective centers is some R2 R1 . [sent-205, score-0.358]
81 Then by reasoning similar to the bounded case, all choices of centers achieving a good cost will place centers near to each ball, and thus the clamping value can be taken closer to R1 . [sent-206, score-0.807]
82 ⌅ Of course, the above gave the existence of clamps under favorable conditions. [sent-207, score-0.097]
83 The following shows that outer brackets can be used to show the existence of clamps in general. [sent-208, score-0.523]
84 In fact, the proof is very short, and follows the scheme laid out in the bounded example above: outer bracketing allows the restriction of consideration to a bounded set, and some algebra from there gives a conservative upper bound for the clamping value. [sent-209, score-0.901]
85 Then (C, R) is a clamp for measure ⇢ and center Hf (⇢; c, k) at scale ✏, and with probability at least 1 3 over a draw of size m max{p0 /(M 0 e), 9 ln(1/ )}, it is also a clamp for ⇢ and centers ˆ Hf (ˆ; c, k) at scale ✏⇢ . [sent-214, score-0.556]
86 ⇢ ˆ The general guarantee using clamps is as follows. [sent-215, score-0.097]
87 Let (R, C) be a clamp for probability measure ⇢ and empirical counterpart ⇢ over some center class Z and cost f at respective scales ✏⇢ and ✏⇢ , where f has ˆ ˆ corresponding convexity constants r1 and r2 . [sent-221, score-0.27]
88 Suppose C is contained within a ball of radius RC , let ✏ > 0 be given, define scale parameter ⇢r ✏ r1 ✏ ⌧ := min , , 2r2 2r2 R3 and let N be a cover of C by k · k-balls of radius ⌧ (as per lemma B. [sent-222, score-0.209]
89 Then with probability at least 1 over the draw of a sample of size m p0 /(M 0 e), every set of centers P 2 Z satisfies s ✓ ◆ Z Z 1 2|N |k 2 ⇢ ln . [sent-224, score-0.51]
90 f (x; P )d⇢(x) f (x; P )dˆ(x) 2✏ + ✏⇢ + ✏⇢ + R ˆ 2m Before adjourning this section, note that clamps and outer brackets disagree on the treatment of the outer regions: the former replaces the cost there with the fixed value R, whereas the latter uses the value 0. [sent-225, score-0.87]
91 This is not a problem with outer bracketing, since the same points (namely B c ) are ignored by every set of centers. [sent-227, score-0.303]
92 7 5 Mixtures of Gaussians Before turning to the deviation bound, it is a good place to discuss the condition 1 I ⌃ which must be met by every covariance matrix of every constituent Gaussian in a mixture. [sent-228, score-0.117]
93 2 I, The lower bound 1 I ⌃, as discussed previously, is fairly common in practice, arising either via a Bayesian prior, or by implementing EM with an explicit condition that covariance updates are discarded when the eigenvalues fall below some threshold. [sent-229, score-0.112]
94 Concretely, consider a Gaussian over R2 with covariance ⌃ = diag( , 1 ); as decreases, the Gaussian has large density on a line, but low density elsewhere. [sent-237, score-0.101]
95 1I In both the hard and soft clustering analyses here, a crucial early step allows the assertion that good scores in some region mean the relevant parameter is nearby. [sent-240, score-0.117]
96 Let probability measure ⇢ be given with order-p moment bound M according to norm k · k2 where p 8 is a positive multiple of 4, covariance bounds 0 < 1 2 with 1 1 for simplicity, and real c 1/2 be given. [sent-245, score-0.252]
97 2/ 1, The proof follows the scheme of the hard clustering analysis. [sent-247, score-0.104]
98 Superficially, a second distinction with the hard clustering case is that far away Gaussians can not be entirely ignored on local regions; the influence is limited, however, and the analysis proceeds similarly in each case. [sent-249, score-0.183]
99 Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding. [sent-255, score-0.12]
100 Estimates of the moments of sums of independent random variables. [sent-323, score-0.143]
wordName wordTfidf (topN-words)
[('hf', 0.385), ('bracketing', 0.291), ('centers', 0.277), ('outer', 0.268), ('bracket', 0.264), ('kx', 0.244), ('pollard', 0.194), ('brackets', 0.158), ('ln', 0.154), ('deviations', 0.152), ('moments', 0.143), ('clamping', 0.128), ('bf', 0.118), ('rb', 0.111), ('clamps', 0.097), ('smog', 0.097), ('moment', 0.092), ('bregman', 0.088), ('boundedness', 0.086), ('cost', 0.079), ('clamp', 0.079), ('zu', 0.079), ('beating', 0.074), ('wellner', 0.073), ('clustering', 0.071), ('rc', 0.07), ('bi', 0.07), ('source', 0.06), ('bu', 0.058), ('consideration', 0.057), ('gaussian', 0.054), ('norm', 0.052), ('consistency', 0.052), ('ces', 0.051), ('radius', 0.051), ('mixture', 0.049), ('bc', 0.049), ('mmin', 0.049), ('subgaussianity', 0.049), ('concretely', 0.048), ('shamir', 0.048), ('stability', 0.047), ('jon', 0.047), ('chebyshev', 0.047), ('soft', 0.046), ('bounded', 0.046), ('covering', 0.045), ('ball', 0.045), ('away', 0.045), ('condition', 0.045), ('guarantees', 0.045), ('lloyd', 0.044), ('loophole', 0.043), ('shai', 0.042), ('tools', 0.041), ('likelihood', 0.041), ('measure', 0.041), ('draw', 0.04), ('center', 0.04), ('vaart', 0.039), ('farther', 0.039), ('naftali', 0.039), ('phane', 0.039), ('availability', 0.039), ('sample', 0.039), ('lugosi', 0.038), ('gao', 0.037), ('tishby', 0.037), ('deviation', 0.037), ('gaussians', 0.036), ('mixtures', 0.036), ('sketched', 0.035), ('ruled', 0.035), ('covariance', 0.035), ('convex', 0.035), ('nition', 0.035), ('conditions', 0.035), ('ignored', 0.035), ('lastly', 0.033), ('technique', 0.033), ('density', 0.033), ('proof', 0.033), ('ohad', 0.032), ('boucheron', 0.032), ('control', 0.032), ('far', 0.032), ('bound', 0.032), ('mechanism', 0.032), ('cover', 0.032), ('constants', 0.031), ('cluster', 0.031), ('corollary', 0.031), ('resolution', 0.031), ('divergence', 0.03), ('appendix', 0.03), ('lipschitz', 0.03), ('mass', 0.03), ('lemma', 0.03), ('balls', 0.03), ('kp', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
Author: Matus Telgarsky, Sanjoy Dasgupta
Abstract: Suppose k centers are fit to m points by heuristically minimizing the k-means cost; what is the corresponding fit over the source distribution? This question is resolved here for distributions with p 4 bounded moments; in particular, the difference between the sample cost and distribution cost decays with m and p as mmin{ 1/4, 1/2+2/p} . The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of k-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with refined constants is provided for k-means instances possessing some cluster structure. 1
2 0.13622871 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields
Author: Mijung Park, Jonathan W. Pillow
Abstract: The receptive field (RF) of a sensory neuron describes how the neuron integrates sensory stimuli over time and space. In typical experiments with naturalistic or flickering spatiotemporal stimuli, RFs are very high-dimensional, due to the large number of coefficients needed to specify an integration profile across time and space. Estimating these coefficients from small amounts of data poses a variety of challenging statistical and computational problems. Here we address these challenges by developing Bayesian reduced rank regression methods for RF estimation. This corresponds to modeling the RF as a sum of space-time separable (i.e., rank-1) filters. This approach substantially reduces the number of parameters needed to specify the RF, from 1K-10K down to mere 100s in the examples we consider, and confers substantial benefits in statistical power and computational efficiency. We introduce a novel prior over low-rank RFs using the restriction of a matrix normal prior to the manifold of low-rank matrices, and use “localized” row and column covariances to obtain sparse, smooth, localized estimates of the spatial and temporal RF components. We develop two methods for inference in the resulting hierarchical model: (1) a fully Bayesian method using blocked-Gibbs sampling; and (2) a fast, approximate method that employs alternating ascent of conditional marginal likelihoods. We develop these methods for Gaussian and Poisson noise models, and show that low-rank estimates substantially outperform full rank estimates using neural data from retina and V1. 1
3 0.12400406 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
Author: Ilya O. Tolstikhin, Yevgeny Seldin
Abstract: We present a PAC-Bayes-Empirical-Bernstein inequality. The inequality is based on a combination of the PAC-Bayesian bounding technique with an Empirical Bernstein bound. We show that when the empirical variance is significantly smaller than the empirical loss the PAC-Bayes-Empirical-Bernstein inequality is significantly tighter than the PAC-Bayes-kl inequality of Seeger (2002) and otherwise it is comparable. Our theoretical analysis is confirmed empirically on a synthetic example and several UCI datasets. The PAC-Bayes-Empirical-Bernstein inequality is an interesting example of an application of the PAC-Bayesian bounding technique to self-bounding functions. 1
4 0.088872567 63 nips-2013-Cluster Trees on Manifolds
Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman
Abstract: unkown-abstract
5 0.079891741 158 nips-2013-Learning Multiple Models via Regularized Weighting
Author: Daniel Vainsencher, Shie Mannor, Huan Xu
Abstract: We consider the general problem of Multiple Model Learning (MML) from data, from the statistical and algorithmic perspectives; this problem includes clustering, multiple regression and subspace clustering as special cases. A common approach to solving new MML problems is to generalize Lloyd’s algorithm for clustering (or Expectation-Maximization for soft clustering). However this approach is unfortunately sensitive to outliers and large noise: a single exceptional point may take over one of the models. We propose a different general formulation that seeks for each model a distribution over data points; the weights are regularized to be sufficiently spread out. This enhances robustness by making assumptions on class balance. We further provide generalization bounds and explain how the new iterations may be computed efficiently. We demonstrate the robustness benefits of our approach with some experimental results and prove for the important case of clustering that our approach has a non-trivial breakdown point, i.e., is guaranteed to be robust to a fixed percentage of adversarial unbounded outliers. 1
6 0.079035483 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
7 0.073053963 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
8 0.07290516 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
9 0.067817599 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
10 0.065536246 307 nips-2013-Speedup Matrix Completion with Side Information: Application to Multi-Label Learning
11 0.06497895 91 nips-2013-Dirty Statistical Models
12 0.063897304 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
13 0.062063348 344 nips-2013-Using multiple samples to learn mixture models
14 0.061113976 230 nips-2013-Online Learning with Costly Features and Labels
15 0.059759825 353 nips-2013-When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity
16 0.057706162 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints
17 0.057491653 36 nips-2013-Annealing between distributions by averaging moments
18 0.053483825 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
19 0.053070564 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
20 0.051145565 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization
topicId topicWeight
[(0, 0.166), (1, 0.038), (2, 0.067), (3, 0.02), (4, 0.022), (5, 0.063), (6, 0.022), (7, 0.008), (8, -0.004), (9, 0.016), (10, 0.021), (11, 0.058), (12, 0.008), (13, 0.001), (14, 0.021), (15, 0.005), (16, 0.031), (17, 0.002), (18, 0.003), (19, 0.046), (20, -0.043), (21, 0.03), (22, 0.031), (23, -0.024), (24, -0.123), (25, -0.033), (26, -0.12), (27, 0.015), (28, -0.128), (29, -0.071), (30, -0.041), (31, -0.038), (32, 0.067), (33, -0.04), (34, -0.007), (35, 0.002), (36, -0.026), (37, -0.074), (38, 0.046), (39, -0.059), (40, -0.004), (41, -0.008), (42, -0.072), (43, -0.079), (44, 0.014), (45, 0.058), (46, 0.012), (47, -0.049), (48, 0.016), (49, -0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.92975527 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
Author: Matus Telgarsky, Sanjoy Dasgupta
Abstract: Suppose k centers are fit to m points by heuristically minimizing the k-means cost; what is the corresponding fit over the source distribution? This question is resolved here for distributions with p 4 bounded moments; in particular, the difference between the sample cost and distribution cost decays with m and p as mmin{ 1/4, 1/2+2/p} . The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of k-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with refined constants is provided for k-means instances possessing some cluster structure. 1
2 0.69937587 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
Author: Martin Azizyan, Aarti Singh, Larry Wasserman
Abstract: While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efficient procedure. Our results provide the first step of a theoretical basis for recent methods that combine feature selection and clustering. 1
3 0.66674113 63 nips-2013-Cluster Trees on Manifolds
Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman
Abstract: unkown-abstract
4 0.64534605 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
Author: Samory Kpotufe, Vikas Garg
Abstract: We present the first result for kernel regression where the procedure adapts locally at a point x to both the unknown local dimension of the metric space X and the unknown H¨ lder-continuity of the regression function at x. The result holds with o high probability simultaneously at all points x in a general metric space X of unknown structure. 1
5 0.63935465 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
Author: Maria-Florina Balcan, Steven Ehrlich, Yingyu Liang
Abstract: This paper provides new algorithms for distributed clustering for two popular center-based objectives, k-median and k-means. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by [13], we reduce the problem of finding a clustering with low cost to the problem of finding a coreset of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. Experimental results on large scale data sets show that this approach outperforms other coreset-based distributed clustering algorithms. 1
6 0.62743515 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
8 0.61911392 271 nips-2013-Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima
9 0.58694088 158 nips-2013-Learning Multiple Models via Regularized Weighting
10 0.58434981 344 nips-2013-Using multiple samples to learn mixture models
11 0.57940215 252 nips-2013-Predictive PAC Learning and Process Decompositions
12 0.57331681 112 nips-2013-Estimation Bias in Multi-Armed Bandit Algorithms for Search Advertising
13 0.53619272 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions
14 0.53533739 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
15 0.52834314 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors
16 0.52686238 70 nips-2013-Contrastive Learning Using Spectral Methods
17 0.52318168 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization
18 0.52258253 130 nips-2013-Generalizing Analytic Shrinkage for Arbitrary Covariance Structures
19 0.51142865 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models
20 0.50491256 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
topicId topicWeight
[(2, 0.016), (16, 0.037), (33, 0.155), (34, 0.075), (41, 0.033), (49, 0.028), (56, 0.189), (70, 0.027), (79, 0.256), (85, 0.028), (89, 0.03), (93, 0.043), (95, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.82983011 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
Author: Matus Telgarsky, Sanjoy Dasgupta
Abstract: Suppose k centers are fit to m points by heuristically minimizing the k-means cost; what is the corresponding fit over the source distribution? This question is resolved here for distributions with p 4 bounded moments; in particular, the difference between the sample cost and distribution cost decays with m and p as mmin{ 1/4, 1/2+2/p} . The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of k-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with refined constants is provided for k-means instances possessing some cluster structure. 1
2 0.7945711 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
Author: Martin Azizyan, Aarti Singh, Larry Wasserman
Abstract: While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efficient procedure. Our results provide the first step of a theoretical basis for recent methods that combine feature selection and clustering. 1
3 0.74148774 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
Author: Nataliya Shapovalova, Michalis Raptis, Leonid Sigal, Greg Mori
Abstract: We propose a weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach, we develop a generalization of the Max-Path search algorithm which allows us to efficiently search over a structured space of multiple spatio-temporal paths while also incorporating context information into the model. Instead of using spatial annotations in the form of bounding boxes to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating eye gaze, along with the classification, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classification, and achieves state-of-the-art results in localization. In addition, our model can produce top-down saliency maps conditioned on the classification label and localized latent paths. 1
4 0.72387052 344 nips-2013-Using multiple samples to learn mixture models
Author: Jason Lee, Ran Gilad-Bachrach, Rich Caruana
Abstract: In the mixture models problem it is assumed that there are K distributions θ1 , . . . , θK and one gets to observe a sample from a mixture of these distributions with unknown coefficients. The goal is to associate instances with their generating distributions, or to identify the parameters of the hidden distributions. In this work we make the assumption that we have access to several samples drawn from the same K underlying distributions, but with different mixing weights. As with topic modeling, having multiple samples is often a reasonable assumption. Instead of pooling the data into one sample, we prove that it is possible to use the differences between the samples to better recover the underlying structure. We present algorithms that recover the underlying structure under milder assumptions than the current state of art when either the dimensionality or the separation is high. The methods, when applied to topic modeling, allow generalization to words not present in the training data. 1
5 0.72036624 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
Author: Shiau Hong Lim, Huan Xu, Shie Mannor
Abstract: An important challenge in Markov decision processes is to ensure robustness with respect to unexpected or adversarial system behavior while taking advantage of well-behaving parts of the system. We consider a problem setting where some unknown parts of the state space can have arbitrary transitions while other parts are purely stochastic. We devise an algorithm that is adaptive to potentially adversarial behavior and show that it achieves similar regret bounds as the purely stochastic case. 1
6 0.71504247 249 nips-2013-Polar Operators for Structured Sparse Estimation
7 0.71447784 230 nips-2013-Online Learning with Costly Features and Labels
9 0.71088278 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences
10 0.7097711 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
11 0.70924801 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
12 0.70866668 252 nips-2013-Predictive PAC Learning and Process Decompositions
13 0.70836687 355 nips-2013-Which Space Partitioning Tree to Use for Search?
14 0.70808822 292 nips-2013-Sequential Transfer in Multi-armed Bandit with Finite Set of Models
15 0.70797181 95 nips-2013-Distributed Exploration in Multi-Armed Bandits
16 0.7072928 224 nips-2013-On the Sample Complexity of Subspace Learning
17 0.70719564 66 nips-2013-Computing the Stationary Distribution Locally
18 0.70687926 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization
19 0.70573068 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
20 0.70443201 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings