nips nips2013 nips2013-252 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Cosma Shalizi, Aryeh Kontorovitch
Abstract: We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular (β-mixing) processes, of independent probability-theoretic interest. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). [sent-5, score-0.565]
2 A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. [sent-6, score-0.623]
3 In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. [sent-7, score-0.331]
4 In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. [sent-9, score-0.655]
5 We also provide a characterization of mixtures of absolutely regular (β-mixing) processes, of independent probability-theoretic interest. [sent-10, score-0.679]
6 As recently as 1997, Vidyasagar [1] named extending learning theory to stochastic processes of dependent variables as a major open problem. [sent-12, score-0.339]
7 Since then, considerable progress has been made for specific classes of processes, particularly strongly-mixing sequences and exchangeable sequences. [sent-13, score-0.217]
8 ) Our goals in this paper are to point out that many practically-important classes of stochastic processes possess a special sort of structure, namely they are convex combinations of simpler, extremal processes. [sent-15, score-0.585]
9 3] to avoid technicalities, implying 1 Absolutely regular processes are ones where the joint distribution of past and future events approaches independence, in L1 , as the separation between events goes to infinity; see §4 below for a precise statement and extensive discussion. [sent-22, score-0.785]
10 Absolutely regular sequences are now more commonly called “β-mixing”, but we use the older name to avoid confusion with the other sort of “mixing”. [sent-23, score-0.323]
11 (We believe our ideas apply to stochastic processes with multidimensional index sets as well, j but use sequences here. [sent-25, score-0.394]
12 Any ∞ ∞ such stochastic process can be represented through the shift map τ : X → X (which just drops the first coordinate, (τ x)t = xt+1 ), and a suitable distribution of initial conditions. [sent-31, score-0.089]
13 9], stationary processes, whose distributions are invariant over time, proved to be convex combinations of stationary ergodic measures, ones where all invariant sets have either probability 0 or probability 1. [sent-37, score-0.641]
14 1], exchangeable sequences, which are invariant under permutation, are expressed as mixtures of IID sequences2 . [sent-39, score-0.302]
15 Similar results are now also known for asymptotically mean stationary sequences [18, §8. [sent-40, score-0.228]
16 4], for partially-exchangeable sequences [22], for stationary random fields, and even for infinite exchangeable arrays (including networks and graphs) [21, ch. [sent-41, score-0.319]
17 The probability law ρ of the composite but symmetric process is a convex combination of the simpler, extremal processes µ ∈ M with the same symmetry. [sent-45, score-0.65]
18 The infinite-dimensional distribution of these extremal processes are, naturally, mutually singular. [sent-46, score-0.493]
19 Sample paths from the composite process are generated hierarchically, first by picking an extremal process µ from M according to a some measure π supported on M, and then generating a sample path from µ. [sent-48, score-0.445]
20 Each realization of the composite process therefore gives information about only a single extremal process, as this is an invariant along each sample path. [sent-50, score-0.459]
21 The goal in PAC theory is to find a sample complexity function3 s( , δ, F, µ) such that, whenever n ≥ s, Pµ sup f ∈F 1 n n f (Xt ) − Eµ [f ] ≥ ≤δ (1) t=1 That is, PAC theory seeks finite-sample uniform law of large numbers for F. [sent-57, score-0.182]
22 Because of its importance, it will be convenient to abbreviate the supremum, sup f ∈F 1 n n f (Xt ) − Eµ [f ] ≡ Γn t=1 2 This is actually a special case of the ergodic decomposition [21, pp. [sent-58, score-0.38]
23 Standard PAC is defined as distribution-free, but here we maintain the dependence on µ for consistency with future notation. [sent-60, score-0.108]
24 What one has in mind, of course, is that there is a space H of predictive models (classifiers, regressions, . [sent-64, score-0.113]
25 If Γn → 0 in probability for this “loss function” class, then empirical risk ˆ minimization is reliable. [sent-70, score-0.105]
26 That is, the function fn which minimizes the empirical risk n−1 t f (Xt ) has an expected risk in the future which is close to the best attainable risk over all of F, R(F, µ) = inf f ∈F Eµ [f ]. [sent-71, score-0.3]
27 Indeed, since when n ≥ s, with high (≥ 1 − δ) probability all functions have ˆ empirical risks within of their true risks, with high probability the true risk Eµ fn is within 2 of R(F, µ). [sent-72, score-0.141]
28 Although empirical risk minimization is not the only conceivable learning strategy, it is, in a sense, a canonical one (computational considerations aside). [sent-73, score-0.131]
29 The latter is an immediate consequence of the VC-dimension characterization of PAC learnability: Theorem 1 Suppose that the concept class H is PAC learnable from IID samples. [sent-74, score-0.143]
30 This implies that the empirical risk minimizer is a PAC learner for H. [sent-78, score-0.143]
31 First, in general there is no single value for this — n it truly is a function of the past X1 , or at least some part of it. [sent-83, score-0.108]
32 ) The other, and related, problem with this idea of predictive PAC is that it presents learning with a perpetually moving target. [sent-85, score-0.113]
33 Whether the function which minimizes the empirical risk is going to do well by this criterion involves rather arbitrary details of the process. [sent-86, score-0.105]
34 To truly pursue this approach, one would have to actually learn the predictive dependence structure of the process, a quite formidable undertaking, though perhaps not hopeless [26]. [sent-87, score-0.205]
35 Both of these complications are exacerbated when the process producing the data is actually a mixture over simpler processes, as we have seen is very common in interesting applied settings. [sent-88, score-0.157]
36 This n is because, in addition to whatever dependence may be present within each extremal process, X1 n gives (partial) information about what that process is. [sent-89, score-0.342]
37 This averaging over the posterior produces additional dependence between past and future. [sent-91, score-0.178]
38 ) As we show in §4 below, continuous mixtures of absolutely regular processes are far from being absolutely regular themselves. [sent-93, score-1.477]
39 What a learner should care about is not averaging over some hypothetical prior distribution of inaccessible alternative dynamical systems, but rather what will happen in the future of the particular realization which provided the training data, and must continue to provide the testing data. [sent-96, score-0.183]
40 To get a sense of what this is means, 3 notice that for an ergodic µ, 1 m→∞ m m n E [f (Xn+i ) | X1 ] Eµ [f ] = lim i=1 (from [28, Cor. [sent-97, score-0.349]
41 That is, matching expectations under the process measure µ means controlling the long-run average behavior, and not just the one-step-ahead expectation suggested in [3, 2]. [sent-101, score-0.101]
42 Empirical risk minimization now makes sense: it is attempting to find a model which will work well not just at the next step (which may be inherently unstable), but will continue to work well, on average, indefinitely far into the future. [sent-102, score-0.105]
43 ∞ Definition 1 Let X1 be a stochastic process with law µ, and let I be the σ-field generated by all the invariant events. [sent-104, score-0.206]
44 As is well-known, distribution-free predictive PAC learning, in this sense, is possible for IID processes (F must have finite VC dimension). [sent-106, score-0.428]
45 For an ergodic µ, [6] shows that s( , δ, F, µ) exist and is finite if and only, once again, F has a finite VC dimension; this implies predictive PAC learning, but not distribution-free predictive PAC. [sent-107, score-0.501]
46 Since ergodic processes can converge arbitrarily slowly, some extra condition must be imposed on P to ensure that dependence decays fast enough for each µ. [sent-108, score-0.657]
47 We may apply these familiar results straightforwardly, because, when µ is ergodic, all invariant sets have either measure 0 or measure 1, conditioning on I has no effect, and Eµ [f | I] = Eµ [f ]. [sent-110, score-0.092]
48 Theorem 2 Suppose that distribution-free prediction PAC learning is possible for a class of functions F and a class of processes M, with sample-complexity function s( , δ, F, P). [sent-112, score-0.28]
49 Then the class of processes P formed by taking convex mixtures from M admits distribution-free PAC learning with the same sample complexity function. [sent-113, score-0.461]
50 Then by the law of total expectation, Pρ (Γn ≥ ) = Eρ [Pρ (Γn ≥ | µ)] = Eρ [Pµ (Γn ≥ )] ≤ Eρ [δ] = δ (2) (3) (4) In words, if the same bound holds for each component of the mixture, then it still holds after averaging over the mixture. [sent-115, score-0.11]
51 The same paper also lays out the idea of sensitive dependence on initial conditions, and the kernel trick of turning nonlinear problems into linear ones by projecting into infinite-dimensional feature spaces. [sent-120, score-0.099]
52 5 4 A useful consequence of this innocent-looking result is that it lets us immediately apply PAC learning results for extremal processes to composite processes, without any penalty in the sample complexity. [sent-121, score-0.571]
53 In [3], Pestov asks whether “one [can] maintain the initial sample complexity” when going from IID to exchangeable sequences; the answer is yes, if one picks the right target. [sent-127, score-0.165]
54 This holds whenever the data-generating process is a mixture of extremal processes for which learning is possible. [sent-128, score-0.625]
55 A particularly important special case of this are the absolutely regular processes. [sent-129, score-0.545]
56 ∞ To be precise, let X−∞ be a two-sided6 stationary process. [sent-132, score-0.102]
57 The β-dependence coefficient at lag k is 0 ∞ β(k) ≡ P−∞ ⊗ Pk − P−(1:k) TV (5) ∞ 0 where P−(1:k) is the joint distribution of X−∞ and Xk , the total variation distance between the actual joint distribution of the past and future, and the product of their marginals. [sent-133, score-0.083]
58 Equivalently [31, 32] β(k) = E sup ∞ B∈σ(Xk ) 0 P B | X−∞ − P (B) (6) which, roughly, is the expected total variation distance between the marginal distribution of the future and its distribution conditional on the past. [sent-134, score-0.107]
59 As is well known, β(k) is non-increasing in k, and of course ≥ 0, so β(k) must have a limit as k → ∞; it will be convenient to abbreviate this as β(∞). [sent-135, score-0.101]
60 When β(∞) = 0, the process is said to be beta mixing, or absolutely regular. [sent-136, score-0.403]
61 The importance of absolutely regular processes for learning comes essentially from a result due to n Yu [4]. [sent-138, score-0.825]
62 Let X1 be a part of a sample path from an absolutely regular process µ, whose dependence coefficients are β(k). [sent-139, score-0.739]
63 Since in particular the event C could be taken to be {Γn > }, this approximation result allows distribution-free learning bounds for IID processes to translate directly into distribution-free learning bounds for absolutely regular processes with bounded β coefficients. [sent-144, score-1.105]
64 6 We have worked with one-sided processes so far, but the devices for moving between the two representations are standard, and this definition is more easily stated in its two-sided version. [sent-145, score-0.28]
65 5 If M contains only absolutely regular processes, then a measure π on M creates a ρ which is a mixture of absolutely regular processes, or a MAR process. [sent-146, score-1.16]
66 It is easy to see that absolute regularity of the component processes (βµ (k) → 0) does not imply absolute regularity of the mixture processes (βρ (k) → 0). [sent-147, score-0.66]
67 To see this, suppose that M consists of two processes µ0 , which puts unit probability mass on the sequence of all zeros, and µ1 , which puts unit probability on the sequence of all ones; and that π gives these two equal probability. [sent-148, score-0.344]
68 Then βµi (k) = 0 for both i, but the past and the future of ρ are not independent of each other. [sent-149, score-0.124]
69 More generally, suppose π is supported on just two absolutely regular processes, µ and µ . [sent-150, score-0.545]
70 For each such µ, there exists a set of typical sequences Tµ ⊂ X ∞ , in which the infinite sample path of µ lies almost surely7 , and these sets do not overlap8 , Tµ ∩ Tµ = ∅. [sent-151, score-0.152]
71 This implies that ρ(Tµ ) = π(µ), but 1 0 0 ρ(Tµ | X−∞ ) = 0 X−∞ ∈ Tµ otherwise Thus the change in probability of Tµ due to conditioning on the past is π(µ1 ) if the selected component was µ2 , and 1 − π(µ1 ) = π(µ2 ) if the selected component is µ1 . [sent-152, score-0.17]
72 Similar reasoning when π is supported on q extremal processes shows q πi (1 − πi ) βρ (∞) = i=1 so that the general case is βρ (∞) = [1 − π({µ})]dπ(µ) This implies that if π has no atoms, βρ (∞) = 1 always. [sent-154, score-0.493]
73 Since βρ (k) is non-increasing, βρ (k) = 1 for all k, for a continuous mixture of absolutely regular processes. [sent-155, score-0.615]
74 Conceptually, this arises because of the use of infinite-dimensional distributions for both past and future in the definition of the βdependence coefficient. [sent-156, score-0.124]
75 Having seen an infinite past is sufficient, for an ergodic process, to identify the process, and of course the future must be a continuation of this past. [sent-157, score-0.492]
76 MARs thus display a rather odd separation between the properties of individual sample paths, which approach independence asymptotically in time, and the ensemble-level behavior, where there is ineradicable dependence, and indeed maximal dependence for continuous mixtures. [sent-158, score-0.17]
77 Any one realization of a MAR, no matter how long, is indistinguishable from a realization of an absolutely regular process which is a component of the mixture. [sent-159, score-0.748]
78 The distinction between a MAR and a single absolutely regular process only becomes apparent at the level of ensembles of paths. [sent-160, score-0.607]
79 Bernoulli shifts are stationary and ergodic, but not absolutely regular9 . [sent-164, score-0.443]
80 Roughly speaking, given the infinite past of a MAR, events in the future become asymptotically independent as the separation between them increases10 . [sent-166, score-0.25]
81 A more precise statement needs to control ˜ the approach to independence of the component processes in a MAR. [sent-167, score-0.31]
82 Then if we con0 dition on finite histories X−n and let n grow, widely separated future events become asymptotically independent. [sent-169, score-0.138]
83 n−1 ∞ −1 t For each B ∈ B, the set Tµ,B = x ∈ X : limn→∞ n t=0 1B (τ x) = µ(B) exists, is measurable, is dynamically invariant, and, by Bikrhoff’s ergodic theorem, µ(Tµ,B ) = 1 [18, §7. [sent-171, score-0.275]
84 9 They are, however, mixing in the sense of ergodic theory [28]. [sent-176, score-0.377]
85 10 0 ρ-almost-surely, X−∞ belongs to the typical set of a unique absolutely regular process, say µ, and so the 0 l ∞ 0 l ∞ posterior concentrates on that µ, π(· | X−∞ ) = δµ . [sent-177, score-0.545]
86 Hence ρ(X1 , Xl+k | X−∞ ) = µ(X−∞ , Xl+k ), which l ∞ tends to µ(X−∞ )µ(Xl+k ) as k → ∞ because µ is absolutely regular. [sent-178, score-0.341]
87 7, expressing ρ in terms of π and the extremal processes: lim lim E sup k→∞ n→∞ l sup ∞ B∈σ(Xk+l ) l 0 0 0 µ(B | X1 , X−n ) − µ(B | X−n ) dπ(µ | X−n ) 0 0 l 0 l As n → ∞, µ(B | X−n ) → µ(B | X−∞ ), and µ(B | X1 , X−n ) → µ(B | X−∞ ). [sent-182, score-0.431]
88 If it is not a uniform MAR, then no matter what function β(k) tending to zero we propose, ˜ the set of µ with βµ ≥ β must have positive π measure, i. [sent-186, score-0.093]
89 , a positive-measure set of processes must converge arbitrarily slowly. [sent-188, score-0.315]
90 If ρ is not a MAR at all, we know from the ergodic decomposition of stationary processes that it must be a mixture of ergodic processes, and so it must give positive π weight to processes which are not absolutely regular at all, i. [sent-191, score-1.897]
91 The witnessing events B for these processes a fortiori drive the limit in Eq. [sent-194, score-0.38]
92 5 Discussion and future work We have shown that with the right conditioning, a natural and useful notion of predictive PAC emerges. [sent-196, score-0.154]
93 This notion is natural in the sense that for a learner sampling from a mixture of ergodic processes, the only thing that matters is the future behavior of the component he is “stuck” in — and certainly not the process average over all the components. [sent-197, score-0.547]
94 This insight enables us to adapt the recent PAC bounds for mixing processes to mixtures of such processes. [sent-198, score-0.458]
95 An interesting question then is to characterize those processes that are convex mixtures of a given kind of ergodic process (de Finetti’s theorem was the first such characterization). [sent-199, score-0.724]
96 In this paper, we have addressed this question for mixtures of uniformly absolutely regular processes. [sent-200, score-0.652]
97 Another fascinating question is how to extend predictive PAC to real-valued functions [33, 34]. [sent-201, score-0.113]
98 Predictive PAC learnability: A paradigm for learning from exchangeable input data. [sent-209, score-0.13]
99 Rates of convergence for empirical processes of stationary mixing sequences. [sent-216, score-0.481]
100 Rates of uniform convergence of empirical means with mixing processes. [sent-259, score-0.128]
wordName wordTfidf (topN-words)
[('pac', 0.394), ('absolutely', 0.341), ('url', 0.284), ('processes', 0.28), ('ergodic', 0.275), ('extremal', 0.213), ('regular', 0.204), ('mar', 0.194), ('doi', 0.149), ('iid', 0.145), ('exchangeable', 0.13), ('learnable', 0.116), ('predictive', 0.113), ('mixtures', 0.107), ('stationary', 0.102), ('http', 0.099), ('sequences', 0.087), ('past', 0.083), ('risk', 0.077), ('finetti', 0.077), ('mixing', 0.071), ('mixture', 0.07), ('dependence', 0.067), ('sup', 0.066), ('invariant', 0.065), ('process', 0.062), ('roof', 0.058), ('events', 0.058), ('countable', 0.055), ('law', 0.052), ('laws', 0.048), ('symmetries', 0.048), ('bert', 0.048), ('berti', 0.048), ('mathukumalli', 0.048), ('pestov', 0.048), ('learnability', 0.045), ('measurable', 0.044), ('lim', 0.043), ('composite', 0.043), ('afshin', 0.042), ('witnessing', 0.042), ('ben', 0.041), ('future', 0.041), ('generalization', 0.041), ('realization', 0.041), ('xl', 0.039), ('vc', 0.039), ('expectations', 0.039), ('berlin', 0.039), ('admits', 0.039), ('asymptotically', 0.039), ('abbreviate', 0.039), ('andrzej', 0.039), ('lise', 0.039), ('learner', 0.038), ('risks', 0.036), ('mehryar', 0.036), ('xk', 0.036), ('must', 0.035), ('sample', 0.035), ('goals', 0.033), ('neumann', 0.033), ('mohri', 0.033), ('ones', 0.032), ('sort', 0.032), ('puts', 0.032), ('came', 0.032), ('dependent', 0.032), ('sense', 0.031), ('continuation', 0.031), ('subtle', 0.031), ('schuurmans', 0.031), ('path', 0.03), ('component', 0.03), ('nite', 0.029), ('matter', 0.029), ('uniform', 0.029), ('xt', 0.029), ('separation', 0.029), ('david', 0.029), ('decompositions', 0.029), ('empirical', 0.028), ('attempting', 0.028), ('edition', 0.028), ('coef', 0.028), ('averaging', 0.028), ('olivier', 0.028), ('conditioning', 0.027), ('course', 0.027), ('xn', 0.027), ('stochastic', 0.027), ('characterization', 0.027), ('annals', 0.026), ('considerations', 0.026), ('truly', 0.025), ('bernoulli', 0.025), ('simpler', 0.025), ('corollary', 0.025), ('measures', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 252 nips-2013-Predictive PAC Learning and Process Decompositions
Author: Cosma Shalizi, Aryeh Kontorovitch
Abstract: We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular (β-mixing) processes, of independent probability-theoretic interest. 1
2 0.11314946 277 nips-2013-Restricting exchangeable nonparametric distributions
Author: Sinead A. Williamson, Steve N. MacEachern, Eric Xing
Abstract: Distributions over matrices with exchangeable rows and infinitely many columns are useful in constructing nonparametric latent variable models. However, the distribution implied by such models over the number of features exhibited by each data point may be poorly-suited for many modeling tasks. In this paper, we propose a class of exchangeable nonparametric priors obtained by restricting the domain of existing models. Such models allow us to specify the distribution over the number of features per data point, and can achieve better performance on data sets where the number of features is not well-modeled by the original distribution. 1
3 0.085806221 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
Author: Martin Azizyan, Aarti Singh, Larry Wasserman
Abstract: While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efficient procedure. Our results provide the first step of a theoretical basis for recent methods that combine feature selection and clustering. 1
4 0.07944949 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
5 0.07942263 171 nips-2013-Learning with Noisy Labels
Author: Nagarajan Natarajan, Inderjit Dhillon, Pradeep Ravikumar, Ambuj Tewari
Abstract: In this paper, we theoretically study the problem of binary classification in the presence of random classification noise — the learner, instead of seeing the true labels, sees labels that have independently been flipped with some small probability. Moreover, random label noise is class-conditional — the flip probability depends on the class. We provide two approaches to suitably modify any given surrogate loss function. First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical risk minimization in the presence of iid data with noisy labels. If the loss function satisfies a simple symmetry condition, we show that the method leads to an efficient algorithm for empirical minimization. Second, by leveraging a reduction of risk minimization under noisy labels to classification with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong empirical risk bounds. This approach has a very remarkable consequence — methods used in practice such as biased SVM and weighted logistic regression are provably noise-tolerant. On a synthetic non-separable dataset, our methods achieve over 88% accuracy even when 40% of the labels are corrupted, and are competitive with respect to recently proposed methods for dealing with label noise in several benchmark datasets.
6 0.072295628 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors
7 0.070127182 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
8 0.063289136 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
9 0.062090579 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing
10 0.060947239 109 nips-2013-Estimating LASSO Risk and Noise Level
11 0.060305052 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints
12 0.058947958 40 nips-2013-Approximate Inference in Continuous Determinantal Processes
13 0.057913125 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
14 0.057592399 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
15 0.055811927 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
16 0.055530556 66 nips-2013-Computing the Stationary Distribution Locally
17 0.055340435 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
18 0.055064406 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
19 0.054505795 344 nips-2013-Using multiple samples to learn mixture models
20 0.050358132 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
topicId topicWeight
[(0, 0.157), (1, 0.025), (2, 0.031), (3, -0.005), (4, 0.005), (5, 0.069), (6, 0.061), (7, 0.017), (8, 0.007), (9, -0.009), (10, 0.019), (11, -0.022), (12, -0.032), (13, 0.008), (14, 0.028), (15, 0.004), (16, 0.021), (17, -0.007), (18, 0.012), (19, 0.036), (20, -0.037), (21, 0.046), (22, 0.014), (23, -0.056), (24, -0.064), (25, 0.058), (26, -0.035), (27, 0.001), (28, -0.06), (29, -0.042), (30, -0.049), (31, 0.033), (32, -0.034), (33, 0.036), (34, -0.098), (35, 0.023), (36, -0.02), (37, 0.09), (38, 0.071), (39, 0.06), (40, -0.06), (41, -0.04), (42, 0.034), (43, -0.03), (44, 0.033), (45, 0.076), (46, 0.009), (47, -0.096), (48, -0.022), (49, -0.092)]
simIndex simValue paperId paperTitle
same-paper 1 0.95357901 252 nips-2013-Predictive PAC Learning and Process Decompositions
Author: Cosma Shalizi, Aryeh Kontorovitch
Abstract: We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular (β-mixing) processes, of independent probability-theoretic interest. 1
2 0.67126942 199 nips-2013-More data speeds up training time in learning halfspaces over sparse vectors
Author: Amit Daniely, Nati Linial, Shai Shalev-Shwartz
Abstract: The increased availability of data in recent years has led several authors to ask whether it is possible to use data as a computational resource. That is, if more data is available, beyond the sample complexity limit, is it possible to use the extra examples to speed up the computation time required to perform the learning task? We give the first positive answer to this question for a natural supervised learning problem — we consider agnostic PAC learning of halfspaces over 3-sparse vectors in {−1, 1, 0}n . This class is inefficiently learnable using O n/ 2 examples. Our main contribution is a novel, non-cryptographic, methodology for establishing computational-statistical gaps, which allows us to show that, under a widely believed assumption that refuting random 3CNF formulas is hard, it is impossible to efficiently learn this class using only O n/ 2 examples. We further show that under stronger hardness assumptions, even O n1.499 / 2 examples do not suffice. On the other hand, we show a new algorithm that learns this class efficiently ˜ using Ω n2 / 2 examples. This formally establishes the tradeoff between sample and computational complexity for a natural supervised learning problem. 1
3 0.67057443 277 nips-2013-Restricting exchangeable nonparametric distributions
Author: Sinead A. Williamson, Steve N. MacEachern, Eric Xing
Abstract: Distributions over matrices with exchangeable rows and infinitely many columns are useful in constructing nonparametric latent variable models. However, the distribution implied by such models over the number of features exhibited by each data point may be poorly-suited for many modeling tasks. In this paper, we propose a class of exchangeable nonparametric priors obtained by restricting the domain of existing models. Such models allow us to specify the distribution over the number of features per data point, and can achieve better performance on data sets where the number of features is not well-modeled by the original distribution. 1
4 0.60974318 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations
Author: Isik B. Fidaner, Taylan Cemgil
Abstract: Infinite mixture models are commonly used for clustering. One can sample from the posterior of mixture assignments by Monte Carlo methods or find its maximum a posteriori solution by optimization. However, in some problems the posterior is diffuse and it is hard to interpret the sampled partitionings. In this paper, we introduce novel statistics based on block sizes for representing sample sets of partitionings and feature allocations. We develop an element-based definition of entropy to quantify segmentation among their elements. Then we propose a simple algorithm called entropy agglomeration (EA) to summarize and visualize this information. Experiments on various infinite mixture posteriors as well as a feature allocation dataset demonstrate that the proposed statistics are useful in practice.
5 0.60861689 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
Author: Anirban Roychowdhury, Ke Jiang, Brian Kulis
Abstract: Small-variance asymptotics provide an emerging technique for obtaining scalable combinatorial algorithms from rich probabilistic models. We present a smallvariance asymptotic analysis of the Hidden Markov Model and its infinite-state Bayesian nonparametric extension. Starting with the standard HMM, we first derive a “hard” inference algorithm analogous to k-means that arises when particular variances in the model tend to zero. This analysis is then extended to the Bayesian nonparametric case, yielding a simple, scalable, and flexible algorithm for discrete-state sequence data with a non-fixed number of states. We also derive the corresponding combinatorial objective functions arising from our analysis, which involve a k-means-like term along with penalties based on state transitions and the number of states. A key property of such algorithms is that— particularly in the nonparametric setting—standard probabilistic inference algorithms lack scalability and are heavily dependent on good initialization. A number of results on synthetic and real data sets demonstrate the advantages of the proposed framework. 1
6 0.58666998 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
7 0.57471466 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
8 0.56346619 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
9 0.55834633 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
10 0.55627269 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties
11 0.53383309 52 nips-2013-Bayesian inference as iterated random functions with applications to sequential inference in graphical models
12 0.53296983 344 nips-2013-Using multiple samples to learn mixture models
13 0.52340221 324 nips-2013-The Fast Convergence of Incremental PCA
14 0.52276367 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
15 0.515338 269 nips-2013-Regression-tree Tuning in a Streaming Setting
16 0.51203406 181 nips-2013-Machine Teaching for Bayesian Learners in the Exponential Family
17 0.50523543 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
18 0.48542526 134 nips-2013-Graphical Models for Inference with Missing Data
19 0.47871801 327 nips-2013-The Randomized Dependence Coefficient
20 0.47627318 332 nips-2013-Tracking Time-varying Graphical Structure
topicId topicWeight
[(2, 0.046), (16, 0.088), (33, 0.102), (34, 0.081), (41, 0.038), (49, 0.031), (56, 0.154), (70, 0.04), (85, 0.034), (89, 0.048), (93, 0.049), (95, 0.026), (97, 0.191)]
simIndex simValue paperId paperTitle
same-paper 1 0.84868824 252 nips-2013-Predictive PAC Learning and Process Decompositions
Author: Cosma Shalizi, Aryeh Kontorovitch
Abstract: We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular (β-mixing) processes, of independent probability-theoretic interest. 1
2 0.81768084 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
Author: Yichao Lu, Paramveer Dhillon, Dean P. Foster, Lyle Ungar
Abstract: We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1
3 0.79928744 186 nips-2013-Matrix factorization with binary components
Author: Martin Slawski, Matthias Hein, Pavlo Lutsik
Abstract: Motivated by an application in computational biology, we consider low-rank matrix factorization with {0, 1}-constraints on one of the factors and optionally convex constraints on the second one. In addition to the non-convexity shared with other matrix factorization schemes, our problem is further complicated by a combinatorial constraint set of size 2m·r , where m is the dimension of the data points and r the rank of the factorization. Despite apparent intractability, we provide − in the line of recent work on non-negative matrix factorization by Arora et al. (2012)− an algorithm that provably recovers the underlying factorization in the exact case with O(mr2r + mnr + r2 n) operations for n datapoints. To obtain this result, we use theory around the Littlewood-Offord lemma from combinatorics.
4 0.7337243 249 nips-2013-Polar Operators for Structured Sparse Estimation
Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans
Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. 1
5 0.73104841 355 nips-2013-Which Space Partitioning Tree to Use for Search?
Author: Parikshit Ram, Alexander Gray
Abstract: We consider the task of nearest-neighbor search with the class of binary-spacepartitioning trees, which includes kd-trees, principal axis trees and random projection trees, and try to rigorously answer the question “which tree to use for nearestneighbor search?” To this end, we present the theoretical results which imply that trees with better vector quantization performance have better search performance guarantees. We also explore another factor affecting the search performance – margins of the partitions in these trees. We demonstrate, both theoretically and empirically, that large margin partitions can improve tree search performance. 1 Nearest-neighbor search Nearest-neighbor search is ubiquitous in computer science. Several techniques exist for nearestneighbor search, but most algorithms can be categorized into two following groups based on the indexing scheme used – (1) search with hierarchical tree indices, or (2) search with hash-based indices. Although multidimensional binary space-partitioning trees (or BSP-trees), such as kd-trees [1], are widely used for nearest-neighbor search, it is believed that their performances degrade with increasing dimensions. Standard worst-case analyses of search with BSP-trees in high dimensions usually lead to trivial guarantees (such as, an Ω(n) search time guarantee for a single nearest-neighbor query in a set of n points). This is generally attributed to the “curse of dimensionality” – in the worst case, the high dimensionality can force the search algorithm to visit every node in the BSP-tree. However, these BSP-trees are very simple and intuitive, and still used in practice with success. The occasional favorable performances of BSP-trees in high dimensions are attributed to the low “intrinsic” dimensionality of real data. However, no clear relationship between the BSP-tree search performance and the intrinsic data properties is known. We present theoretical results which link the search performance of BSP-trees to properties of the data and the tree. This allows us to identify implicit factors influencing BSP-tree search performance — knowing these driving factors allows us to develop successful heuristics for BSP-trees with improved search performance. Each node in a BSP-tree represents a region of the space and each non-leaf node has a left and right child representing a disjoint partition of this region with some separating hyperplane and threshold (w, b). A search query on this tree is usually answered with a depth-first branch-and-bound algorithm. Algorithm 1 presents a simplified version where a search query is answered with a small set of neighbor candidates of any desired size by performing a greedy depth-first tree traversal to a specified depth. This is known as defeatist tree search. We are not aware of any data-dependent analysis of the quality of the results from defeatist BSP-tree search. However, Verma et al. (2009) [2] presented adaptive data-dependent analyses of some BSP-trees for the task of vector quantization. These results show precise connections between the quantization performance of the BSP-trees and certain properties of the data (we will present these data properties in Section 2). 1 Algorithm 1 BSP-tree search Input: BSP-tree T on set S, Query q, Desired depth l Output: Candidate neighbor p current tree depth lc ← 0 current tree node Tc ← T while lc < l do if Tc .w, q + Tc .b ≤ 0 then Tc ← Tc .left child else Tc ← Tc .right child end if Increment depth lc ← lc + 1 end while p ← arg minr∈Tc ∩S q − r . (a) kd-tree (b) RP-tree (c) MM-tree Figure 1: Binary space-partitioning trees. We establish search performance guarantees for BSP-trees by linking their nearest-neighbor performance to their vector quantization performance and utilizing the recent guarantees on the BSP-tree vector quantization. Our results provide theoretical evidence, for the first time, that better quantization performance implies better search performance1 . These results also motivate the use of large margin BSP-trees, trees that hierarchically partition the data with a large (geometric) margin, for better nearest-neighbor search performance. After discussing some existing literature on nearestneighbor search and vector quantization in Section 2, we discuss our following contributions: • We present performance guarantees for Algorithm 1 in Section 3, linking search performance to vector quantization performance. Specifically, we show that for any balanced BSP-tree and a depth l, under some conditions, the worst-case search error incurred by the neighbor candidate returned by Algorithm 1 is proportional to a factor which is 2l/2 exp(−l/2β) , (n/2l )1/O(d) − 2 where β corresponds to the quantization performance of the tree (smaller β implies smaller quantization error) and d is closely related to the doubling dimension of the dataset (as opposed to the ambient dimension D of the dataset). This implies that better quantization produces better worst-case search results. Moreover, this result implies that smaller l produces improved worstcase performance (smaller l does imply more computation, hence it is intuitive to expect less error at the cost of computation). Finally, there is also the expected dependence on the intrinsic dimensionality d – increasing d implies deteriorating worst-case performance. The theoretical results are empirically verified in this section as well. • In Section 3, we also show that the worst-case search error for Algorithm 1 with a BSP-tree T is proportional to (1/γ) where γ is the smallest margin size of all the partitions in T . • We present the quantization performance guarantee of a large margin BSP tree in Section 4. O These results indicate that for a given dataset, the best BSP-tree for search is the one with the best combination of low quantization error and large partition margins. We conclude with this insight and related unanswered questions in Section 5. 2 Search and vector quantization Binary space-partitioning trees (or BSP-trees) are hierarchical data structures providing a multiresolution view of the dataset indexed. There are several space-partitioning heuristics for a BSPtree construction. A tree is constructed by recursively applying a heuristic partition. The most popular kd-tree uses axis-aligned partitions (Figure 1(a)), often employing a median split along the coordinate axis of the data in the tree node with the largest spread. The principal-axis tree (PA-tree) partitions the space at each node at the median along the principal eigenvector of the covariance matrix of the data in that node [3, 4]. Another heuristic partitions the space based on a 2-means clustering of the data in the node to form the two-means tree (2M-tree) [5, 6]. The random-projection tree (RP-tree) partitions the space by projecting the data along a random standard normal direction and choosing an appropriate splitting threshold [7] (Figure 1(b)). The max-margin tree (MM-tree) is built by recursively employing large margin partitions of the data [8] (Figure 1(c)). The unsupervised large margin splits are usually performed using max-margin clustering techniques [9]. Search. Nearest-neighbor search with a BSP-tree usually involves a depth-first branch-and-bound algorithm which guarantees the search approximation (exact search is a special case of approximate search with zero approximation) by a depth-first traversal of the tree followed by a backtrack up the tree as required. This makes the tree traversal unpredictable leading to trivial worst-case runtime 1 This intuitive connection is widely believed but never rigorously established to the best of our knowledge. 2 guarantees. On the other hand, locality-sensitive hashing [10] based methods approach search in a different way. After indexing the dataset into hash tables, a query is answered by selecting candidate points from these hash tables. The candidate set size implies the worst-case search time bound. The hash table construction guarantees the set size and search approximation. Algorithm 1 uses a BSPtree to select a candidate set for a query with defeatist tree search. For a balanced tree on n points, the candidate set size at depth l is n/2l and the search runtime is O(l + n/2l ), with l ≤ log2 n. For any choice of the depth l, we present the first approximation guarantee for this search process. Defeatist BSP-tree search has been explored with the spill tree [11], a binary tree with overlapping sibling nodes unlike the disjoint nodes in the usual BSP-tree. The search involves selecting the candidates in (all) the leaf node(s) which contain the query. The level of overlap guarantees the search approximation, but this search method lacks any rigorous runtime guarantee; it is hard to bound the number of leaf nodes that might contain any given query. Dasgupta & Sinha (2013) [12] show that the probability of finding the exact nearest neighbor with defeatist search on certain randomized partition trees (randomized spill trees and RP-trees being among them) is directly proportional to the relative contrast of the search task [13], a recently proposed quantity which characterizes the difficulty of a search problem (lower relative contrast makes exact search harder). Vector Quantization. Recent work by Verma et al., 2009 [2] has established theoretical guarantees for some of these BSP-trees for the task of vector quantization. Given a set of points S ⊂ RD of n points, the task of vector quantization is to generate a set of points M ⊂ RD of size k n with low average quantization error. The optimal quantizer for any region A is given by the mean µ(A) of the data points lying in that region. The quantization error of the region A is then given by VS (A) = 1 |A ∩ S| x − µ(A) 2 2 , (1) x∈A∩S and the average quantization error of a disjoint partition of region A into Al and Ar is given by: VS ({Al , Ar }) = (|Al ∩ S|VS (Al ) + |Ar ∩ S|VS (Ar )) /|A ∩ S|. (2) Tree-based structured vector quantization is used for efficient vector quantization – a BSP-tree of depth log2 k partitions the space containing S into k disjoint regions to produce a k-quantization of S. The theoretical results for tree-based vector quantization guarantee the improvement in average quantization error obtained by partitioning any single region (with a single quantizer) into two disjoints regions (with two quantizers) in the following form (introduced by Freund et al. (2007) [14]): Definition 2.1. For a set S ⊂ RD , a region A partitioned into two disjoint regions {Al , Ar }, and a data-dependent quantity β > 1, the quantization error improvement is characterized by: VS ({Al , Ar }) < (1 − 1/β) VS (A). (3) Tree PA-tree RP-tree kd-tree 2M-tree MM-tree∗ Definition of β . D O( 2 ) : = i=1 λi /λ1 O(dc ) × optimal (smallest possible) . D 2 O(ρ) : ρ = i=1 λi /γ The quantization performance depends inversely on the data-dependent quantity β – lower β implies bet- Table 1: β for various trees. λ1 , . . . , λD are ter quantization. We present the definition of β for the sorted eigenvalues of the covariance matrix different BSP-trees in Table 1. For the PA-tree, β of A ∩ S in descending order, and dc < D is depends on the ratio of the sum of the eigenval- the covariance dimension of A ∩ S. The results ues of the covariance matrix of data (A ∩ S) to the for PA-tree and 2M-tree are due to Verma et al. principal eigenvalue. The improvement rate β for (2009) [2]. The PA-tree result can be improved to the RP-tree depends on the covariance dimension O( ) from O( 2 ) with an additional assumption of the data in the node A (β = O(dc )) [7], which [2]. The RP-tree result is in Freund et al. (2007) roughly corresponds to the lowest dimensionality of [14], which also has the precise definition of dc . an affine plane that captures most of the data covari- We establish the result for MM-tree in Section 4. ance. The 2M-tree does not have an explicit β but γ is the margin size of the large margin partition. it has the optimal theoretical improvement rate for a No such guarantee for kd-trees is known to us. single partition because the 2-means clustering objective is equal to |Al |V(Al ) + |Ar |V(Ar ) and minimizing this objective maximizes β. The 2means problem is NP-hard and an approximate solution is used in practice. These theoretical results are valid under the condition that there are no outliers in A ∩ S. This is characterized as 2 maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η > 0. This notion of the absence of outliers was first introduced for the theoretical analysis of the RP-trees [7]. Verma et al. (2009) [2] describe outliers as “points that are much farther away from the mean than the typical distance-from-mean”. In this situation, an alternate type of partition is used to remove these outliers that are farther away 3 from the mean than expected. For η ≥ 8, this alternate partitioning is guaranteed to reduce the data diameter (maxx,y∈A∩S x − y ) of the resulting nodes by a constant fraction [7, Lemma 12], and can be used until a region contain no outliers, at which point, the usual hyperplane partition can be used with their respective theoretical quantization guarantees. The implicit assumption is that the alternate partitioning scheme is employed rarely. These results for BSP-tree quantization performance indicate that different heuristics are adaptive to different properties of the data. However, no existing theoretical result relates this performance of BSP-trees to their search performance. Making the precise connection between the quantization performance and the search performance of these BSP-trees is a contribution of this paper. 3 Approximation guarantees for BSP-tree search In this section, we formally present the data and tree dependent performance guarantees on the search with BSP-trees using Algorithm 1. The quality of nearest-neighbor search can be quantized in two ways – (i) distance error and (ii) rank of the candidate neighbor. We present guarantees for both notions of search error2 . For a query q and a set of points S and a neighbor candidate p ∈ S, q−p distance error (q) = minr∈S q−r − 1, and rank τ (q) = |{r ∈ S : q − r < q − p }| + 1. Algorithm 1 requires the query traversal depth l as an input. The search runtime is O(l + (n/2l )). The depth can be chosen based on the desired runtime. Equivalently, the depth can be chosen based on the desired number of candidates m; for a balanced binary tree on a dataset S of n points with leaf nodes containing a single point, the appropriate depth l = log2 n − log2 m . We will be building on the existing results on vector quantization error [2] to present the worst case error guarantee for Algorithm 1. We need the following definitions to precisely state our results: Definition 3.1. An ω-balanced split partitioning a region A into disjoint regions {A1 , A2 } implies ||A1 ∩ S| − |A2 ∩ S|| ≤ ω|A ∩ S|. For a balanced tree corresponding to recursive median splits, such as the PA-tree and the kd-tree, ω ≈ 0. Non-zero values of ω 1, corresponding to approximately balanced trees, allow us to potentially adapt better to some structure in the data at the cost of slightly losing the tree balance. For the MM-tree (discussed in detail in Section 4), ω-balanced splits are enforced for any specified value of ω. Approximately balanced trees have a depth bound of O(log n) [8, Theorem 3.1]. For l a tree with ω-balanced splits, the worst case runtime of Algorithm 1 is O l + 1+ω n . For the 2 2M-tree, ω-balanced splits are not enforced. Hence the actual value of ω could be high for a 2M-tree. Definition 3.2. Let B 2 (p, ∆) = {r ∈ S : p − r < ∆} denote the points in S contained in a ball of radius ∆ around some p ∈ S with respect to the 2 metric. The expansion constant of (S, 2 ) is defined as the smallest c ≥ 2 such B 2 (p, 2∆) ≤ c B 2 (p, ∆) ∀p ∈ S and ∀∆ > 0. Bounded expansion constants correspond to growth-restricted metrics [15]. The expansion constant characterizes the data distribution, and c ∼ 2O(d) where d is the doubling dimension of the set S with respect to the 2 metric. The relationship is exact for points on a D-dimensional grid (i.e., c = Θ(2D )). Equipped with these definitions, we have the following guarantee for Algorithm 1: 2 1 Theorem 3.1. Consider a dataset S ⊂ RD of n points with ψ = 2n2 x,y∈S x − y , the BSP tree T built on S and a query q ∈ RD with the following conditions : (C1) (C2) (C3) (C4) Let (A ∩ (S ∪ {q}), 2 ) have an expansion constant at most c for any convex set A ⊂ RD . ˜ Let T be complete till a depth L < log2 n /(1 − log2 (1 − ω)) with ω-balanced splits. c ˜ Let β ∗ correspond to the worst quantization error improvement rate over all splits in T . 2 For any node A in the tree T , let maxx,y∈A∩S x − y ≤ ηVS (A) for a fixed η ≥ 8. For α = 1/(1 − ω), the upper bound du on the distance of q to the neighbor candidate p returned by Algorithm 1 with depth l ≤ L is given by √ 2 ηψ · (2α)l/2 · exp(−l/2β ∗ ) q − p ≤ du = . (4) 1/ log2 c ˜ (n/(2α)l ) −2 2 The distance error corresponds to the relative error in terms of the actual distance values. The rank is one more than the number of points in S which are better neighbor candidates than p. The nearest-neighbor of q has rank 1 and distance error 0. The appropriate notion of error depends on the search application. 4 Now η is fixed, and ψ is fixed for a dataset S. Then, for a fixed ω, this result implies that between two types of BSP-trees on the same set and the same query, Algorithm 1 has a better worst-case guarantee on the candidate-neighbor distance for the tree with better quantization performance (smaller β ∗ ). Moreover, for a particular tree with β ∗ ≥ log2 e, du is non-decreasing in l. This is expected because as we traverse down the tree, we can never reduce the candidate neighbor distance. At the root level (l = 0), the candidate neighbor is the nearest-neighbor. As we descend down the tree, the candidate neighbor distance will worsen if a tree split separates the query from its closer neighbors. This behavior is implied in Equation (4). For a chosen depth l in Algorithm 1, the candidate 1/ log2 c ˜ , implying deteriorating bounds du neighbor distance is inversely proportional to n/(2α)l with increasing c. Since log2 c ∼ O(d), larger intrinsic dimensionality implies worse guarantees as ˜ ˜ expected from the curse of dimensionality. To prove Theorem 3.1, we use the following result: Lemma 3.1. Under the conditions of Theorem 3.1, for any node A at a depth l in the BSP-tree T l on S, VS (A) ≤ ψ (2/(1 − ω)) exp(−l/β ∗ ). This result is obtained by recursively applying the quantization error improvement in Definition 2.1 over l levels of the tree (the proof is in Appendix A). Proof of Theorem 3.1. Consider the node A at depth l in the tree containing q, and let m = |A ∩ S|. Let D = maxx,y∈A∩S x − y , let d = minx∈A∩S q − x , and let B 2 (q, ∆) = {x ∈ A ∩ (S ∪ {q}) : q − x < ∆}. Then, by the Definition 3.2 and condition C1, D+d D+d D+2d B (q, D + d) ≤ clog2 d |B (q, d)| = clog2 d ≤ clog2 ( d ) , ˜ ˜ ˜ 2 2 where the equality follows from the fact that B 2 (q, d) = {q}. Now B 2 (q, D + d) ≥ m. Using ˜ ˜ this above gives us m1/ log2 c ≤ (D/d) + 2. By condition C2, m1/ log2 c > 2. Hence we have 1/ log2 c ˜ d ≤ D/(m − 2). By construction and condition C4, D ≤ ηVS (A). Now m ≥ n/(2α)l . Plugging this above and utilizing Lemma 3.1 gives us the statement of Theorem 3.1. Nearest-neighbor search error guarantees. Equipped with the bound on the candidate-neighbor distance, we bound the worst-case nearest-neighbor search errors as follows: Corollary 3.1. Under the conditions of Theorem 3.1, for any query q at a desired depth l ≤ L in Algorithm 1, the distance error (q) is bounded as (q) ≤ (du /d∗ ) − 1, and the rank τ (q) is q u ∗ bounded as τ (q) ≤ c log2 (d /dq ) , where d∗ = minr∈S q − r . ˜ q Proof. The distance error bound follows from the definition of distance error. Let R = {r ∈ S : q − r < du }. By definition, τ (q) ≤ |R| + 1. Let B 2 (q, ∆) = {x ∈ (S ∪ {q}) : q − x < ∆}. Since B 2 (q, du ) contains q and R, and q ∈ S, |B 2 (q, du )| = |R| + 1 ≥ τ (q). From Definition / 3.2 and Condition C1, |B 2 (q, du )| ≤ c log2 (d ˜ |{q}| = 1 gives us the upper bound on τ (q). u /d∗ ) q |B 2 (q, d∗ )|. Using the fact that |B 2 (q, d∗ )| = q q The upper bounds on both forms of search error are directly proportional to du . Hence, the BSPtree with better quantization performance has better search performance guarantees, and increasing traversal depth l implies less computation but worse performance guarantees. Any dependence of this approximation guarantee on the ambient data dimensionality is subsumed by the dependence on β ∗ and c. While our result bounds the worst-case performance of Algorithm 1, an average case ˜ performance guarantee on the distance error is given by Eq (q) ≤ du Eq 1/d∗ −1, and on the rank q u − log d∗ is given by E τ (q) ≤ c log2 d ˜ E c ( 2 q ) , since the expectation is over the queries q and du q q does not depend on q. For the purposes of relative comparison among BSP-trees, the bounds on the expected error depend solely on du since the term within the expectation over q is tree independent. Dependence of the nearest-neighbor search error on the partition margins. The search error bounds in Corollary 3.1 depend on the true nearest-neighbor distance d∗ of any query q of which we q have no prior knowledge. However, if we partition the data with a large margin split, then we can say that either the candidate neighbor is the true nearest-neighbor of q or that d∗ is greater than the q size of the margin. We characterize the influence of the margin size with the following result: Corollary 3.2. Consider the conditions of Theorem 3.1 and a query q at a depth l ≤ L in Algorithm 1. Further assume that γ is the smallest margin size on both sides of any partition in the tree T .uThen the distance error is bounded as (q) ≤ du /γ − 1, and the rank is bounded as τ (q) ≤ c log2 (d /γ) . ˜ This result indicates that if the split margins in a BSP-tree can be increased without adversely affecting its quantization performance, the BSP-tree will have improved nearest-neighbor error guarantees 5 for the Algorithm 1. This motivated us to consider the max-margin tree [8], a BSP-tree that explicitly maximizes the margin of the split for every split in the tree. Explanation of the conditions in Theorem 3.1. Condition C1 implies that for any convex set A ⊂ RD , ((A ∩ (S ∪ {q})), 2 ) has an expansion constant at most c. A bounded c implies that no ˜ ˜ subset of (S ∪ {q}), contained in a convex set, has a very high expansion constant. This condition implies that ((S ∪ {q}), 2 ) also has an expansion constant at most c (since (S ∪ {q}) is contained in ˜ its convex hull). However, if (S ∪ {q}, 2 ) has an expansion constant c, this does not imply that the data lying within any convex set has an expansion constant at most c. Hence a bounded expansion constant assumption for (A∩(S ∪{q}), 2 ) for every convex set A ⊂ RD is stronger than a bounded expansion constant assumption for (S ∪ {q}, 2 )3 . Condition C2 ensures that the tree is complete so that for every query q and a depth l ≤ L, there exists a large enough tree node which contains q. Condition C3 gives us the worst quantization error improvement rate over all the splits in the tree. 2 Condition C4 implies that the squared data diameter of any node A (maxx,y∈A∩S x − y ) is within a constant factor of its quantization error VS (A). This refers to the assumption that the node A contains no outliers as described in Section 3 and only hyperplane partitions are used and their respective quantization improvement guarantees presented in Section 2 (Table 1) hold. By placing condition C4, we ignore the alternate partitioning scheme used to remove outliers for simplicity of analysis. If we allow a small fraction of the partitions in the tree to be this alternate split, a similar result can be obtained since the alternate split is the same for all BSP-tree. For two different kinds of hyperplane splits, if alternate split is invoked the same number of times in the tree, the difference in their worst-case guarantees for both the trees would again be governed by their worstcase quantization performance (β ∗ ). However, for any fixed η, a harder question is whether one type of hyperplane partition violates the inlier condition more often than another type of partition, resulting in more alternate partitions. And we do not yet have a theoretical answer for this4 . Empirical validation. We examine our theoretical results with 4 datasets – O PTDIGITS (D = 64, n = 3823, 1797 queries), T INY I MAGES (D = 384, n = 5000, 1000 queries), MNIST (D = 784, n = 6000, 1000 queries), I MAGES (D = 4096, n = 500, 150 queries). We consider the following BSP-trees: kd-tree, random-projection (RP) tree, principal axis (PA) tree, two-means (2M) tree and max-margin (MM) tree. We only use hyperplane partitions for the tree construction. This is because, firstly, the check for the presence of outliers (∆2 (A) > ηVS (A)) can be computationally S expensive for large n, and, secondly, the alternate partition is mostly for the purposes of obtaining theoretical guarantees. The implementation details for the different tree constructions are presented in Appendix C. The performance of these BSP-trees are presented in Figure 2. Trees with missing data points for higher depth levels (for example, kd-tree in Figure 2(a) and 2M-tree in Figures 2 (b) & (c)) imply that we were unable to grow complete BSP-trees beyond that depth. The quantization performance of the 2M-tree, PA-tree and MM-tree are significantly better than the performance of the kd-tree and RP-tree and, as suggested by Corollary 3.1, this is also reflected in their search performance. The MM-tree has comparable quantization performance to the 2M-tree and PA-tree. However, in the case of search, the MM-tree outperforms PA-tree in all datasets. This can be attributed to the large margin partitions in the MM-tree. The comparison to 2M-tree is not as apparent. The MM-tree and PA-tree have ω-balanced splits for small ω enforced algorithmically, resulting in bounded depth and bounded computation of O(l + n(1 + ω)l /2l ) for any given depth l. No such balance constraint is enforced in the 2-means algorithm, and hence, the 2M-tree can be heavily unbalanced. The absence of complete BSP 2M-tree beyond depth 4 and 6 in Figures 2 (b) & (c) respectively is evidence of the lack of balance in the 2M-tree. This implies possibly more computation and hence lower errors. Under these conditions, the MM-tree with an explicit balance constraint performs comparably to the 2M-tree (slightly outperforming in 3 of the 4 cases) while still maintaining a balanced tree (and hence returning smaller candidate sets on average). 3 A subset of a growth-restricted metric space (S, 2 ) may not be growth-restricted. However, in our case, we are not considering all subsets; we only consider subsets of the form (A ∩ S) where A ⊂ RD is a convex set. So our condition does not imply that all subsets of (S, 2 ) are growth-restricted. 4 We empirically explore the effect of the tree type on the violation of the inlier condition (C4) in Appendix B. The results imply that for any fixed value of η, almost the same number of alternate splits would be invoked for the construction of different types of trees on the same dataset. Moreover, with η ≥ 8, for only one of the datasets would a significant fraction of the partitions in the tree (of any type) need to be the alternate partition. 6 (a) O PTDIGITS (b) T INY I MAGES (c) MNIST (d) I MAGES Figure 2: Performance of BSP-trees with increasing traversal depth. The top row corresponds to quantization performance of existing trees and the bottom row presents the nearest-neighbor error (in terms of mean rank τ of the candidate neighbors (CN)) of Algorithm 1 with these trees. The nearest-neighbor search error graphs are also annotated with the mean distance-error of the CN (please view in color). 4 Large margin BSP-tree We established that the search error depends on the quantization performance and the partition margins of the tree. The MM-tree explicitly maximizes the margin of every partition and empirical results indicate that it has comparable performance to the 2M-tree and PA-tree in terms of the quantization performance. In this section, we establish a theoretical guarantee for the MM-tree quantization performance. The large margin split in the MM-tree is obtained by performing max-margin clustering (MMC) with 2 clusters. The task of MMC is to find the optimal hyperplane (w∗ , b∗ ) from the following optimization problem5 given a set of points S = {x1 , x2 , . . . , xm } ⊂ RD : min w,b,ξi s.t. 1 w 2 m 2 2 ξi +C (5) i=1 | w, xi + b| ≥ 1 − ξi , ξi ≥ 0 ∀i = 1, . . . , m (6) m sgn( w, xi + b) ≤ ωm. −ωm ≤ (7) i=1 MMC finds a soft max-margin split in the data to obtain two clusters separated by a large (soft) margin. The balance constraint (Equation (7)) avoids trivial solutions and enforces an ω-balanced split. The margin constraints (Equation (6)) enforce a robust separation of the data. Given a solution to the MMC, we establish the following quantization error improvement rate for the MM-tree: Theorem 4.1. Given a set of points S ⊂ RD and a region A containing m points, consider an ω-balanced max-margin split (w, b) of the region A into {Al , Ar } with at most αm support vectors and a split margin of size γ = 1/ w . Then the quantization error improvement is given by: γ 2 (1 − α)2 VS ({Al , Ar }) ≤ 1 − D i=1 1−ω 1+ω λi VS (A), (8) where λ1 , . . . , λD are the eigenvalues of the covariance matrix of A ∩ S. The result indicates that larger margin sizes (large γ values) and a smaller number of support vectors (small α) implies better quantization performance. Larger ω implies smaller improvement, but ω is √ generally restricted algorithmically in MMC. If γ = O( λ1 ) then this rate matches the best possible quantization performance of the PA-tree (Table 1). We do assume that we have a feasible solution to the MMC problem to prove this result. We use the following result to prove Theorem 4.1: Proposition 4.1. [7, Lemma 15] Give a set S, for any partition {A1 , A2 } of a set A, VS (A) − VS ({A1 , A2 }) = |A1 ∩ S||A2 ∩ S| µ(A1 ) − µ(A2 ) |A ∩ S|2 2 , (9) where µ(A) is the centroid of the points in the region A. 5 This is an equivalent formulation [16] to the original form of max-margin clustering proposed by Xu et al. (2005) [9]. The original formulation also contains the labels yi s and optimizes over it. We consider this form of the problem since it makes our analysis easier to follow. 7 This result [7] implies that the improvement in the quantization error depends on the distance between the centroids of the two regions in the partition. Proof of Theorem 4.1. For a feasible solution (w, b, ξi |i=1,...,m ) to the MMC problem, m m | w, xi + b| ≥ m − ξi . i=1 i=1 Let xi = w, xi +b and mp = |{i : xi > 0}| and mn = |{i : xi ≤ 0}| and µp = ( ˜ ˜ ˜ ˜ and µn = ( i : xi ≤0 xi )/mn . Then mp µp − mn µn ≥ m − i ξi . ˜ ˜ ˜ ˜ ˜ i : xi >0 ˜ xi )/mp ˜ Without loss of generality, we assume that mp ≥ mn . Then the balance constraint (Equation (7)) 2 tells us that mp ≤ m(1 + ω)/2 and mn ≥ m(1 − ω)/2. Then µp − µn + ω(˜p + µn ) ≥ 2 − m i ξi . ˜ ˜ µ ˜ 2 Since µp > 0 and µn ≤ 0, |˜p + µn | ≤ (˜p − µn ). Hence (1 + ω)(˜p − µn ) ≥ 2 − m i ξi . For ˜ µ ˜ µ ˜ µ ˜ an unsupervised split, the data is always separable since there is no misclassification. This implies ∗ that ξi ≤ 1∀i. Hence, µp − µn ≥ ˜ ˜ 2− 2 |{i : ξi > 0}| /(1 + ω) ≥ 2 m 1−α 1+ω , (10) since the term |{i : ξi > 0}| corresponds to the number of support vectors in the solution. Cauchy-Schwartz implies that µ(Al ) − µ(Ar ) ≥ | w, µ(Al ) − µ(Ar ) |/ w = (˜p − µn )γ, µ ˜ since µn = w, µ(Al ) + b and µp = w, µ(Ar ) + b. From Equation (10), we can say ˜ ˜ 2 2 2 that µ(Al ) − µ(Ar ) ≥ 4γ 2 (1 − α) / (1 + ω) . Also, for ω-balanced splits, |Al ||Ar | ≥ (1 − ω 2 )m2 /4. Combining these into Equation (9) from Proposition 4.1, we have VS (A) − VS ({Al , Ar }) ≥ (1 − ω 2 )γ 2 1−α 1+ω 2 = γ 2 (1 − α)2 1−ω 1+ω . (11) Let Cov(A ∩ S) be the covariance matrix of the data contained in region A and λ1 , . . . , λD be the eigenvalues of Cov(A ∩ S). Then, we have: VS (A) = 1 |A ∩ S| D x − µ(A) 2 = tr (Cov(A ∩ S)) = λi . i=1 x∈A∩S Then dividing Equation (11) by VS (A) gives us the statement of the theorem. 5 Conclusions and future directions Our results theoretically verify that BSP-trees with better vector quantization performance and large partition margins do have better search performance guarantees as one would expect. This means that the best BSP-tree for search on a given dataset is the one with the best combination of good quantization performance (low β ∗ in Corollary 3.1) and large partition margins (large γ in Corollary 3.2). The MM-tree and the 2M-tree appear to have the best empirical performance in terms of the search error. This is because the 2M-tree explicitly minimizes β ∗ while the MM-tree explicitly maximizes γ (which also implies smaller β ∗ by Theorem 4.1). Unlike the 2M-tree, the MM-tree explicitly maintains an approximately balanced tree for better worst-case search time guarantees. However, the general dimensional large margin partitions in the MM-tree construction can be quite expensive. But the idea of large margin partitions can be used to enhance any simpler space partition heuristic – for any chosen direction (such as along a coordinate axis or along the principal eigenvector of the data covariance matrix), a one dimensional large margin split of the projections of the points along the chosen direction can be obtained very efficiently for improved search performance. This analysis of search could be useful beyond BSP-trees. Various heuristics have been developed to improve locality-sensitive hashing (LSH) [10]. The plain-vanilla LSH uses random linear projections and random thresholds for the hash-table construction. The data can instead be projected along the top few eigenvectors of the data covariance matrix. This was (empirically) improved upon by learning an orthogonal rotation of the projected data to minimize the quantization error of each bin in the hash-table [17]. A nonlinear hash function can be learned using a restricted Boltzmann machine [18]. If the similarity graph of the data is based on the Euclidean distance, spectral hashing [19] uses a subset of the eigenvectors of the similarity graph Laplacian. Semi-supervised hashing [20] incorporates given pairwise semantic similarity and dissimilarity constraints. The structural SVM framework has also been used to learn hash functions [21]. Similar to the choice of an appropriate BSP-tree for search, the best hashing scheme for any given dataset can be chosen by considering the quantization performance of the hash functions and the margins between the bins in the hash tables. We plan to explore this intuition theoretically and empirically for LSH based search schemes. 8 References [1] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Transactions in Mathematical Software, 1977. [2] N. Verma, S. Kpotufe, and S. Dasgupta. Which Spatial Partition Trees are Adaptive to Intrinsic Dimension? In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2009. [3] R.F. Sproull. Refinements to Nearest-Neighbor Searching in k-dimensional Trees. Algorithmica, 1991. [4] J. McNames. A Fast Nearest-Neighbor Algorithm based on a Principal Axis Search Tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001. [5] K. Fukunaga and P. M. Nagendra. A Branch-and-Bound Algorithm for Computing k-NearestNeighbors. IEEE Transactions on Computing, 1975. [6] D. Nister and H. Stewenius. Scalable Recognition with a Vocabulary Tree. In IEEE Conference on Computer Vision and Pattern Recognition, 2006. [7] S. Dasgupta and Y. Freund. Random Projection trees and Low Dimensional Manifolds. In Proceedings of ACM Symposium on Theory of Computing, 2008. [8] P. Ram, D. Lee, and A. G. Gray. Nearest-neighbor Search on a Time Budget via Max-Margin Trees. In SIAM International Conference on Data Mining, 2012. [9] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum Margin Clustering. Advances in Neural Information Processing Systems, 2005. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of ACM Symposium on Theory of Computing, 1998. [11] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. Advances in Neural Information Proceedings Systems, 2005. [12] S. Dasgupta and K. Sinha. Randomized Partition Trees for Exact Nearest Neighbor Search. In Proceedings of the Conference on Learning Theory, 2013. [13] J. He, S. Kumar and S. F. Chang. On the Difficulty of Nearest Neighbor Search. In Proceedings of the International Conference on Machine Learning, 2012. [14] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the Structure of Manifolds using Random Projections. Advances in Neural Information Processing Systems, 2007. [15] D. R. Karger and M. Ruhl. Finding Nearest Neighbors in Growth-Restricted Metrics. In Proceedings of ACM Symposium on Theory of Computing, 2002. [16] B. Zhao, F. Wang, and C. Zhang. Efficient Maximum Margin Clustering via Cutting Plane Algorithm. In SIAM International Conference on Data Mining, 2008. [17] Y. Gong and S. Lazebnik. Iterative Quantization: A Procrustean Approach to Learning Binary Codes. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. [18] R. Salakhutdinov and G. Hinton. Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure. In Artificial Intelligence and Statistics, 2007. [19] Y. Weiss, A. Torralba, and R. Fergus. Spectral Hashing. Advances of Neural Information Processing Systems, 2008. [20] J. Wang, S. Kumar, and S. Chang. Semi-Supervised Hashing for Scalable Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition, 2010. [21] M. Norouzi and D. J. Fleet. Minimal Loss Hashing for Compact Binary Codes. In Proceedings of the International Conference on Machine Learning, 2011. [22] S. Lloyd. Least Squares Quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137, 1982. 9
6 0.72979176 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
7 0.72604316 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
8 0.72191811 66 nips-2013-Computing the Stationary Distribution Locally
9 0.71726876 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
10 0.71712387 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
11 0.71688467 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
12 0.71468115 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion
13 0.71377939 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
14 0.71330839 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
15 0.71309161 230 nips-2013-Online Learning with Costly Features and Labels
16 0.71228534 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints
17 0.7114287 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
18 0.7109611 318 nips-2013-Structured Learning via Logistic Regression
19 0.71087062 250 nips-2013-Policy Shaping: Integrating Human Feedback with Reinforcement Learning
20 0.7101326 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences