nips nips2013 nips2013-269 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Samory Kpotufe, Francesco Orabona
Abstract: We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. We prove that it is possible to maintain such a structure in time O (log n) at ˜ any time step n while achieving a nearly-optimal regression rate of O n−2/(2+d) in terms of the unknown metric dimension d. Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. [sent-3, score-0.291]
2 We prove that it is possible to maintain such a structure in time O (log n) at ˜ any time step n while achieving a nearly-optimal regression rate of O n−2/(2+d) in terms of the unknown metric dimension d. [sent-4, score-0.451]
3 Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting. [sent-5, score-0.222]
4 1 Introduction Traditional nonparametric regression such as kernel or k-NN can be expensive to estimate given modern large training data sizes. [sent-6, score-0.199]
5 It is therefore common to resort to cheaper methods such as treebased regression which precompute the regression estimates over a partition of the data space [7]. [sent-7, score-0.341]
6 Given a future query x, the estimate fn (x) simply consists of finding the closest cell of the partition by traversing an appropriate tree-structure and returning the precomputed estimate. [sent-8, score-0.725]
7 The partition and precomputed estimates depend on the training data and are usually maintained in batch-mode. [sent-9, score-0.193]
8 We are interested in maintaining such a partition and estimates in a real-world setting where the training data arrives sequentially over time. [sent-10, score-0.307]
9 Our constraints are that of fast-update at every time step, while maintaining a near-minimax regression error-rate at any point in time. [sent-11, score-0.186]
10 The minimax-optimal binwidth n is known to be of the form O n−1/(2+d) , assuming a training data of size n from a metric space of unknown dimension d, and unknown Lipschitz target function f . [sent-14, score-0.474]
11 Thus, the dimension d is the most important problem variable entering the rate (and the tuning of n ), while other problem variables such as the Lipschitz properties of f are less crucial in comparison. [sent-16, score-0.145]
12 The main focus of this work is therefore that of adapting to the unknown d while maintaining fast partition estimates in a streaming setting. [sent-17, score-0.377]
13 A first idea would be to start with an initial dimension estimation phase (where the regression estimates are suboptimal), and using the estimated dimension for subsequent data in a following phase, which leaves only the problem of maintaining partition estimates over time. [sent-18, score-0.712]
14 However, while this sounds reasonable, it is generally unclear when to confidently stop such an initial phase since this would depend on the unknown d and the distribution of the data. [sent-19, score-0.239]
15 Our solution is to interleave dimension estimation with regression updates as the data arrives sequentially. [sent-20, score-0.315]
16 However the algorithm never relies on the estimated dimensions and views them rather as guesses di . [sent-21, score-0.177]
17 Even if di = d, it is kept as long as it is not hurtful to regression performance. [sent-22, score-0.274]
18 The guess di is discarded once we detect that it hurts the regression, a new di+1 is then estimated and a new phase i+1 is started. [sent-23, score-0.43]
19 The decision to discard di relies on monitoring quantities that play into the tradeoff between regression variance and bias, more precisely we monitor the size of the partition ∗ † SK and FO contributed equally to this paper. [sent-24, score-0.427]
20 We note that the idea can be applied to other forms of regression where other quantities control the regression variance and bias (see longer version of the paper). [sent-26, score-0.277]
21 1 Technical Overview of Results We assume that training data (xi , Yi ) is sampled sequentially over time, xi belongs to a general metric space X of unknown dimension d, and Yi is real. [sent-28, score-0.513]
22 3) maintains regression estimates for all training samples n xn {xt }t=1 arriving over time, while constantly updating a partition of the data and partition binwidth. [sent-31, score-0.459]
23 At any time t = n, all updates are proveably of order log n with constants depending on the unknown dimension d of X . [sent-32, score-0.245]
24 At time t = n, the estimate for a query point x is given by the precomputed estimate for the closest point to x in xn , which can be found fast using an off-the-shelf similarity search structure, such as ˜ those of [2, 10]. [sent-33, score-0.288]
25 We prove that the L2 error of the algorithm is O n−2/(2+d) , nearly optimal in terms of the unknown dimension d of the metric X . [sent-34, score-0.337]
26 Finally, we prove a new lower-bound for regression on a generic metric X , where the worst-case distribution is the same as n increases. [sent-35, score-0.235]
27 Thus, our lower-bound is more appropriate to the streaming setting where the data arrives over time from the same distribution. [sent-37, score-0.199]
28 2 Related Work Although various interesting heuristics have been proposed for maintaining tree-based learners in streaming settings (see e. [sent-40, score-0.191]
29 This is however an important problem given the growing size of modern datasets, and given that in many modern applications, training data is actually acquired sequentially over time and incremental updates have to be efficient (see e. [sent-43, score-0.272]
30 the known dimension d, while efficiently tuning regression bandwidth. [sent-51, score-0.242]
31 1 Preliminaries Notions of metric dimension We consider the following notion of dimension which extends traditional notions of dimension (e. [sent-55, score-0.528]
32 Euclidean dimension and manifold dimension) to general metric spaces [4]. [sent-57, score-0.257]
33 that the space X has diameter at most 1 under a metric ρ. [sent-62, score-0.224]
34 The metric measure space (X , µ, ρ) has metric measure-dimension d, if there exist ˇ ˆ ˇ ˆ Cµ , Cµ such that for all > 0, and x ∈ X , Cµ d ≤ µ(B(x, )) ≤ Cµ d . [sent-64, score-0.339]
35 The assumption of finite metric-measure dimension ensures that the measure µ has mass everywhere on the space ρ. [sent-65, score-0.182]
36 This assumption is a generalization (to a metric space) of common assumptions where the measure has an upper and lower-bounded density on a compact Euclidean space, however is more general in that it does not require the measure µ to have a density (relative to any reference measure). [sent-66, score-0.196]
37 The metric-measure dimension implies the following other notion of metric dimension. [sent-67, score-0.257]
38 The metric (X , ρ) has metric dimension d, if there exists Cρ such that, for all > 0, ˆρ −d . [sent-69, score-0.395]
39 X has an -cover of size at most C 2 The relation between the two notions of dimension is stated in the following lemma of [9], which allows us to use either notion as needed. [sent-70, score-0.215]
40 If (X , µ, ρ) has metric-measure dimension d, then there exists Cρ , Cρ such that, for −d ˆ −d ˇ all > 0, any ball B(x, r) centered on (X , ρ) has an r-cover of size in [Cρ , Cρ ]. [sent-72, score-0.146]
41 The input xt belongs to a metric measure space (X , ρ, µ) of diameter at most 1, and of metric measure dimension d. [sent-80, score-0.885]
42 L2 error: Our main performance result bounds the excess L2 risk E xn ,Y n fn − f 2 2,µ 2 E |fn (X) − f (X)| . [sent-87, score-0.655]
43 E xn ,Y n X We will often also be interested in the average error on the training sample: recall that at any time t, an estimate ft (xs ) of f is produced for every xs ∈ xt . [sent-88, score-0.941]
44 The average error on xn at t = n is denoted 1 n 2 E n E |fn (X) − f (X)| n Y 2. [sent-89, score-0.189]
45 t=1 Algorithm The procedure (Algorithm 1) works by partitioning the data into small regions of size roughly t /2 at any time t, and computing the regression estimate of the centers of each region. [sent-91, score-0.201]
46 All points falling in the same region (identified by a center point) are assigned the same regression estimate: the average Y values of all points in the region. [sent-92, score-0.321]
47 The procedure works in phases, where each Phase i corresponds to a guess di of the metric dimen−1/(2+di ) sion d. [sent-93, score-0.404]
48 Where t might have been set to t−1/(2+d) if we knew d, we set it to ti where ti is the current time step within Phase i. [sent-94, score-0.149]
49 We ensure that in each phase our guess di does not hurt the variance-bias tradeoff of the estimates: this is done by monitoring the size of the partition (|Xi | in the algorithm), which controls the variance (see analysis in Section 4), relative to the bindwidth t , which controls bias. [sent-95, score-0.655]
50 Whenever |Xi | is too large relative to t , the variance of the procedure is likely too large, so we start a new phase with an new guess of di . [sent-96, score-0.453]
51 Since the algorithm maintains at any time n an estimate fn (xt ) for all xt ∈ xn , for any query point x ∈ X , we simply return fn (x) = fn (xt ) where xt is the closest point to x in xn . [sent-97, score-2.57]
52 Despite having to adaptively tune to the unknown d, the main computation at each time step consists of just a 2-approximate nearest neighbor search for the closest center. [sent-98, score-0.202]
53 1: Initialize: i = 1, di = 1, ti = 0, Centers Xi = ∅ 2: for t = 1, 2, . [sent-111, score-0.24]
54 The main computation at time n consists of finding the 2-approximate nearest neighbor xn in n Xi and update the data structure for the nearest neighbor search. [sent-117, score-0.306]
55 The main difficulty is in the fact that the data is partitioned in a complicated way due to the ever-changing bindwidth t : every ball around a center can eventually contain both points assigned to the center and points not assigned to the center, in fact can contain other centers (see Figure 1). [sent-122, score-0.535]
56 This makes it hard to get a handle on the number of points assigned to a single center xt (contributing to the variance of fn (xt )) and the distance between points assigned to the same center (contributing to the bias). [sent-123, score-1.188]
57 The problem is handled by first looking at the average error over points in xn , which is less difficult. [sent-125, score-0.226]
58 Suppose the space (X , µ, ρ) has metric-measure dimension d. [sent-127, score-0.153]
59 For any x ∈ X , define fn (x) = fn (xt ) where xt is the closest point to x in xn . [sent-128, score-1.547]
60 Then at any time t = n, we have for some C independent of n, E n x ,Y n fn − f 2 2,µ ≤ C(d log n) sup E n E fn (X) − f (X) n xn Y 2 + Cλ2 d log n n 2/d + ∆2 Y . [sent-129, score-1.279]
61 Y n xn Y ˜ The convergence rate is therefore O(n−2/(2+d) ), nearly optimal in terms of the unknown d (up to a log n factor). [sent-131, score-0.213]
62 As alluded to above, the proof of Theorem 1 proceeds by first bounding the average error 2 E n EY n |fn (X) − f (X)| on the sample xn . [sent-135, score-0.214]
63 1 1 Incremental Tree−based fixed d=1 fixed d=4 fixed d=8 0. [sent-137, score-0.222]
64 2 0 Incremental Tree−based fixed d=1 fixed d=5 fixed d=10 1 Normalized RMSE on test set Normalized RMSE on test set 0. [sent-151, score-0.222]
65 This shows a sense in which the algorithm is robust to bad dimension estimates: the average error is of the optimal form in terms of d, even though the data could trick us into picking a bad guess di of d. [sent-165, score-0.46]
66 An algorithm that estimates the dimension in a first phase would end up using the suboptimal setting d = 1, while our algorithm robustly updates its estimate over time. [sent-168, score-0.344]
67 As mentioned in the introduction, the same insights can be applied to other forms of regression in a streaming setting. [sent-169, score-0.268]
68 We show in the longer version of the paper a procedure more akin to kernel regression, which employs other quantities (appropriate to the method) to control the bias-variance tradeoff while deciding on keeping or rejecting the guess di . [sent-170, score-0.314]
69 First, the lower-bound holds for any given metric measure space (X , ρ, µ) with finite measure-dimension: we constrain the worst-case distribution to have the marginal µ that nature happens to choose. [sent-174, score-0.201]
70 Let (X , µ, ρ) be a metric space of diameter 1, and metric-measure dimension d. [sent-190, score-0.343]
71 There exists an indexing subsequence {nt }t∈N , nt+1 > nt , such that inf sup lim EX nt ,Y nt fnt − f β nt {fn } Dµ,λ t→∞ 2 2,µ = ∞, where the infimum is taken over all sequences {fn } of estimators fn : X n , Y n → L2,µ . [sent-193, score-0.862]
72 5 By the statement of the theorem, if we pick any rate βn faster than n−2/(2+d) , then there exists a 2 distribution with marginal µ for which E fn − f /βn either diverges or tends to ∞. [sent-194, score-0.548]
73 4 Analysis We first analyze the average error of the algorithm over the data xn in Section 4. [sent-195, score-0.189]
74 1 Bounds on Average Error We start by bounding the average error on the sample xn at time n, that is we upper-bound 2 E n EY n |fn (X) − f (X)| . [sent-200, score-0.212]
75 We bound the error in a given phase (Lemma 4), then combine these errors over all phases to obtain the final bounds (Corollary 1). [sent-202, score-0.305]
76 Nevertheless, if ni is the number of steps in Phase i, we will see that both average −2/(2+di ) squared bias and variance can be bounded by roughly ni . [sent-205, score-0.664]
77 Finally, the algorithm ensures that the guess di is always an under-estimate of the unknown dimension d, as proven in Lemma 3 (proof in the supplemental appendix), so integrating over all phases yields an adaptive bound on the average error. [sent-206, score-0.594]
78 We assume throughout this section that the space ˆ (X , ρ) has dimension d for some Cρ (see Def. [sent-207, score-0.153]
79 The following invariants hold throughout the procedure for all phases i ≥ 1 of Algorithm 1: • i ≤ di ≤ d. [sent-211, score-0.264]
80 Consider Phase i ≥ 1 and suppose this phase lasts ni steps. [sent-215, score-0.485]
81 We have the following bound: 2 − 2 ˆ E E |fn (X) − f (X)| ≤ C4d ∆2 + 12λ2 ni 2+d . [sent-217, score-0.29]
82 Suppose Xi (X) = xs , s ∈ [n], we let nxs denote the number of points assigned to the center xs . [sent-220, score-1.135]
83 We use the notation xt → xs to say that xt is assigned to center xs . [sent-221, score-1.519]
84 ˜ First fix X ∈ {xt : t ∈ Phase i} and let xs = Xi (xt ). [sent-222, score-0.377]
85 We proceed with the following standard bias-variance decomposition xt →xs nx s 2 ˜ E |fn (X) − f (X)| = E fn (X) − fn (X) n n Y Y 2 ˜ + fn (X) − f (X) 2 . [sent-224, score-1.873]
86 t∈Phase i To bound the last term, we have 2 t = t∈Phase i t 2 − 2+d i ni ≤ τ 2 − 2+d i 2 1− 2+d dτ ≤ 3ni i . [sent-231, score-0.315]
87 0 ti ∈[ni ] Combine with the previous derivation and with both statements of Lemma 3 to get 2 E E |fn (X) − f (X)| ≤ n ni Y 2 − 2+d ∆2 · |Xi | − 2 Y i ˆ + 12λ2 ni ≤ C 4d ∆2 + 12λ2 ni 2+d . [sent-232, score-0.933]
88 We have: Y I 2 E n E |fn (X) − f (X)| ≤ B n Y i=1 2 ni − 2+d I ni =B n n I i=1 d 1 2+d I ni ≤B I n I i=1 ni I d 2+d d 2 2 2 2 I n 2+d = B · I 2+d n− 2+d ≤ B · d 2+d n− 2+d , =B n I where in the second inequality we use Jensen’s inequality, and in the last inequality Lemma 3. [sent-239, score-1.23]
89 2 Bound on L2 Error We need the following lemma, whose proof is in the supplemental appendix, which bounds the probability that a ρ-ball of a given radius contains a sample from xn . [sent-241, score-0.185]
90 Suppose (X , ρ, µ) has metric measure dimension d. [sent-244, score-0.286]
91 Since for every B ∈ B , µ(B) ≥ C1 −d ≥ αn,δ /n, we have by Lemma 5, that with probability at least 1 − δ, all B ∈ B contain a point from xn . [sent-256, score-0.166]
92 In other words, the event E that xn forms an -cover of X is (1 − δ)-likely. [sent-257, score-0.159]
93 7 Suppose xt is the closest point in xn to x ∈ X . [sent-258, score-0.511]
94 The larger the dimension d the more we can discretize the space and the more complex F, subject to a Lipschitz constraint. [sent-266, score-0.203]
95 The functions in F have to then vary sufficiently in each subset of the space X according to the corresponding ni . [sent-274, score-0.324]
96 The lack of regularity makes it unclear a priori how to divide such a space into Figure 3: Lower bound proof idea. [sent-279, score-0.145]
97 Last, we have to ensure that the functions f ∈ F resulting from our discretization of a general metric space X are in fact Lipschitz. [sent-281, score-0.172]
98 5 Conclusions We presented an efficient and nearly minimax optimal approach to nonparametric regression in a streaming setting. [sent-284, score-0.289]
99 The streaming setting is gaining more attention as modern data sizes are getting larger, and as data is being acquired online in many applications. [sent-285, score-0.215]
100 We left open the question of optimal adaptation to the smoothness of the unknown function, while we effciently solve the equally or more important question of adapting to the the unknown dimension of the data, which generally has a stronger effect on the convergence rate. [sent-287, score-0.243]
wordName wordTfidf (topN-words)
[('fn', 0.518), ('xs', 0.377), ('xt', 0.319), ('ni', 0.29), ('nxs', 0.217), ('di', 0.177), ('phase', 0.164), ('metric', 0.138), ('xn', 0.137), ('streaming', 0.125), ('dimension', 0.119), ('xr', 0.109), ('regression', 0.097), ('guess', 0.089), ('phases', 0.087), ('partition', 0.079), ('fixed', 0.074), ('bindwidth', 0.072), ('nt', 0.069), ('center', 0.067), ('xi', 0.067), ('maintaining', 0.066), ('lemma', 0.063), ('ti', 0.063), ('assigned', 0.06), ('ey', 0.057), ('closest', 0.055), ('cube', 0.053), ('diameter', 0.052), ('unknown', 0.051), ('arrives', 0.051), ('discretize', 0.05), ('centers', 0.05), ('binwidth', 0.048), ('teapot', 0.048), ('incremental', 0.048), ('precomputed', 0.047), ('sequentially', 0.044), ('yt', 0.044), ('euclidean', 0.042), ('nearest', 0.04), ('nonparametric', 0.04), ('lipschitz', 0.04), ('acquired', 0.039), ('bias', 0.038), ('points', 0.037), ('regularity', 0.037), ('subsequence', 0.035), ('inequality', 0.035), ('estimates', 0.034), ('space', 0.034), ('emphasized', 0.034), ('contributing', 0.034), ('neighbor', 0.033), ('sup', 0.033), ('notions', 0.033), ('training', 0.033), ('toyota', 0.032), ('suppose', 0.031), ('technological', 0.031), ('partitioning', 0.031), ('pick', 0.03), ('francesco', 0.029), ('modern', 0.029), ('measure', 0.029), ('error', 0.029), ('contain', 0.029), ('rmse', 0.029), ('minimax', 0.027), ('updates', 0.027), ('ball', 0.027), ('belongs', 0.027), ('tuning', 0.026), ('monitoring', 0.026), ('query', 0.026), ('jensen', 0.025), ('chicago', 0.025), ('log', 0.025), ('tradeoff', 0.025), ('bound', 0.025), ('proof', 0.025), ('unclear', 0.024), ('insights', 0.024), ('behind', 0.024), ('variance', 0.023), ('adversarial', 0.023), ('time', 0.023), ('picking', 0.023), ('average', 0.023), ('employs', 0.023), ('supplemental', 0.023), ('adapting', 0.022), ('forms', 0.022), ('online', 0.022), ('pakdd', 0.021), ('fteenth', 0.021), ('kohler', 0.021), ('krzyzak', 0.021), ('samory', 0.021), ('interleave', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 269 nips-2013-Regression-tree Tuning in a Streaming Setting
Author: Samory Kpotufe, Francesco Orabona
Abstract: We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. We prove that it is possible to maintain such a structure in time O (log n) at ˜ any time step n while achieving a nearly-optimal regression rate of O n−2/(2+d) in terms of the unknown metric dimension d. Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting. 1
2 0.19939046 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
Author: Martin Azizyan, Aarti Singh, Larry Wasserman
Abstract: While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efficient procedure. Our results provide the first step of a theoretical basis for recent methods that combine feature selection and clustering. 1
3 0.17076066 313 nips-2013-Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization
Author: Julien Mairal
Abstract: Majorization-minimization algorithms consist of iteratively minimizing a majorizing surrogate of an objective function. Because of its simplicity and its wide applicability, this principle has been very popular in statistics and in signal processing. In this paper, we intend to make this principle scalable. We introduce a stochastic majorization-minimization scheme which is able to deal with largescale or possibly infinite data sets. When applied to convex optimization problems under suitable assumptions, we show that it achieves an expected convergence √ rate of O(1/ n) after n iterations, and of O(1/n) for strongly convex functions. Equally important, our scheme almost surely converges to stationary points for a large class of non-convex problems. We develop several efficient algorithms based on our framework. First, we propose a new stochastic proximal gradient method, which experimentally matches state-of-the-art solvers for large-scale ℓ1 logistic regression. Second, we develop an online DC programming algorithm for non-convex sparse estimation. Finally, we demonstrate the effectiveness of our approach for solving large-scale structured matrix factorization problems. 1
4 0.1676438 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models
Author: Jonas Peters, Dominik Janzing, Bernhard Schölkopf
Abstract: Causal inference uses observational data to infer the causal structure of the data generating system. We study a class of restricted Structural Equation Models for time series that we call Time Series Models with Independent Noise (TiMINo). These models require independent residual time series, whereas traditional methods like Granger causality exploit the variance of residuals. This work contains two main contributions: (1) Theoretical: By restricting the model class (e.g. to additive noise) we provide general identifiability results. They cover lagged and instantaneous effects that can be nonlinear and unfaithful, and non-instantaneous feedbacks between the time series. (2) Practical: If there are no feedback loops between time series, we propose an algorithm based on non-linear independence tests of time series. We show empirically that when the data are causally insufficient or the model is misspecified, the method avoids incorrect answers. We extend the theoretical and the algorithmic part to situations in which the time series have been measured with different time delays. TiMINo is applied to artificial and real data and code is provided. 1
5 0.16363631 217 nips-2013-On Poisson Graphical Models
Author: Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu
Abstract: Undirected graphical models, such as Gaussian graphical models, Ising, and multinomial/categorical graphical models, are widely used in a variety of applications for modeling distributions over a large number of variables. These standard instances, however, are ill-suited to modeling count data, which are increasingly ubiquitous in big-data settings such as genomic sequencing data, user-ratings data, spatial incidence data, climate studies, and site visits. Existing classes of Poisson graphical models, which arise as the joint distributions that correspond to Poisson distributed node-conditional distributions, have a major drawback: they can only model negative conditional dependencies for reasons of normalizability given its infinite domain. In this paper, our objective is to modify the Poisson graphical model distribution so that it can capture a rich dependence structure between count-valued variables. We begin by discussing two strategies for truncating the Poisson distribution and show that only one of these leads to a valid joint distribution. While this model can accommodate a wider range of conditional dependencies, some limitations still remain. To address this, we investigate two additional novel variants of the Poisson distribution and their corresponding joint graphical model distributions. Our three novel approaches provide classes of Poisson-like graphical models that can capture both positive and negative conditional dependencies between count-valued variables. One can learn the graph structure of our models via penalized neighborhood selection, and we demonstrate the performance of our methods by learning simulated networks as well as a network from microRNA-sequencing data. 1
6 0.14770578 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
7 0.14081553 324 nips-2013-The Fast Convergence of Incremental PCA
8 0.13339944 89 nips-2013-Dimension-Free Exponentiated Gradient
9 0.12712795 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
10 0.12331086 188 nips-2013-Memory Limited, Streaming PCA
11 0.11459081 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap
12 0.10998399 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
13 0.10581016 24 nips-2013-Actor-Critic Algorithms for Risk-Sensitive MDPs
14 0.10405155 348 nips-2013-Variational Policy Search via Trajectory Optimization
15 0.1004426 48 nips-2013-Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC
16 0.099571869 4 nips-2013-A Comparative Framework for Preconditioned Lasso Algorithms
17 0.099259637 230 nips-2013-Online Learning with Costly Features and Labels
18 0.098955214 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
19 0.097843915 247 nips-2013-Phase Retrieval using Alternating Minimization
20 0.09766873 266 nips-2013-Recurrent linear models of simultaneously-recorded neural populations
topicId topicWeight
[(0, 0.22), (1, -0.001), (2, 0.104), (3, -0.039), (4, -0.026), (5, 0.11), (6, 0.047), (7, 0.117), (8, -0.058), (9, -0.081), (10, -0.055), (11, -0.145), (12, -0.022), (13, 0.084), (14, -0.011), (15, 0.014), (16, 0.104), (17, 0.08), (18, 0.055), (19, 0.177), (20, 0.115), (21, 0.179), (22, 0.089), (23, 0.092), (24, -0.211), (25, -0.022), (26, -0.138), (27, 0.03), (28, -0.072), (29, -0.037), (30, -0.048), (31, -0.119), (32, -0.084), (33, 0.086), (34, 0.041), (35, 0.06), (36, 0.119), (37, 0.075), (38, -0.073), (39, 0.037), (40, -0.038), (41, -0.075), (42, 0.045), (43, -0.008), (44, 0.107), (45, -0.072), (46, 0.028), (47, -0.073), (48, 0.01), (49, -0.077)]
simIndex simValue paperId paperTitle
same-paper 1 0.95927364 269 nips-2013-Regression-tree Tuning in a Streaming Setting
Author: Samory Kpotufe, Francesco Orabona
Abstract: We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. We prove that it is possible to maintain such a structure in time O (log n) at ˜ any time step n while achieving a nearly-optimal regression rate of O n−2/(2+d) in terms of the unknown metric dimension d. Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting. 1
2 0.62512386 62 nips-2013-Causal Inference on Time Series using Restricted Structural Equation Models
Author: Jonas Peters, Dominik Janzing, Bernhard Schölkopf
Abstract: Causal inference uses observational data to infer the causal structure of the data generating system. We study a class of restricted Structural Equation Models for time series that we call Time Series Models with Independent Noise (TiMINo). These models require independent residual time series, whereas traditional methods like Granger causality exploit the variance of residuals. This work contains two main contributions: (1) Theoretical: By restricting the model class (e.g. to additive noise) we provide general identifiability results. They cover lagged and instantaneous effects that can be nonlinear and unfaithful, and non-instantaneous feedbacks between the time series. (2) Practical: If there are no feedback loops between time series, we propose an algorithm based on non-linear independence tests of time series. We show empirically that when the data are causally insufficient or the model is misspecified, the method avoids incorrect answers. We extend the theoretical and the algorithmic part to situations in which the time series have been measured with different time delays. TiMINo is applied to artificial and real data and code is provided. 1
3 0.5930928 217 nips-2013-On Poisson Graphical Models
Author: Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu
Abstract: Undirected graphical models, such as Gaussian graphical models, Ising, and multinomial/categorical graphical models, are widely used in a variety of applications for modeling distributions over a large number of variables. These standard instances, however, are ill-suited to modeling count data, which are increasingly ubiquitous in big-data settings such as genomic sequencing data, user-ratings data, spatial incidence data, climate studies, and site visits. Existing classes of Poisson graphical models, which arise as the joint distributions that correspond to Poisson distributed node-conditional distributions, have a major drawback: they can only model negative conditional dependencies for reasons of normalizability given its infinite domain. In this paper, our objective is to modify the Poisson graphical model distribution so that it can capture a rich dependence structure between count-valued variables. We begin by discussing two strategies for truncating the Poisson distribution and show that only one of these leads to a valid joint distribution. While this model can accommodate a wider range of conditional dependencies, some limitations still remain. To address this, we investigate two additional novel variants of the Poisson distribution and their corresponding joint graphical model distributions. Our three novel approaches provide classes of Poisson-like graphical models that can capture both positive and negative conditional dependencies between count-valued variables. One can learn the graph structure of our models via penalized neighborhood selection, and we demonstrate the performance of our methods by learning simulated networks as well as a network from microRNA-sequencing data. 1
4 0.59130299 324 nips-2013-The Fast Convergence of Incremental PCA
Author: Akshay Balsubramani, Sanjoy Dasgupta, Yoav Freund
Abstract: We consider a situation in which we see samples Xn ∈ Rd drawn i.i.d. from some distribution with mean zero and unknown covariance A. We wish to compute the top eigenvector of A in an incremental fashion - with an algorithm that maintains an estimate of the top eigenvector in O(d) space, and incrementally adjusts the estimate with each new data point that arrives. Two classical such schemes are due to Krasulina (1969) and Oja (1983). We give finite-sample convergence rates for both. 1
5 0.58037221 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
Author: Martin Azizyan, Aarti Singh, Larry Wasserman
Abstract: While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efficient procedure. Our results provide the first step of a theoretical basis for recent methods that combine feature selection and clustering. 1
6 0.57311434 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
7 0.53177994 39 nips-2013-Approximate Gaussian process inference for the drift function in stochastic differential equations
8 0.52239007 252 nips-2013-Predictive PAC Learning and Process Decompositions
9 0.50538784 247 nips-2013-Phase Retrieval using Alternating Minimization
10 0.4934282 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions
11 0.48991087 344 nips-2013-Using multiple samples to learn mixture models
12 0.4893479 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces
13 0.48861042 127 nips-2013-Generalized Denoising Auto-Encoders as Generative Models
14 0.48654696 41 nips-2013-Approximate inference in latent Gaussian-Markov models from continuous time observations
15 0.48266256 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap
16 0.46830425 89 nips-2013-Dimension-Free Exponentiated Gradient
17 0.46459627 63 nips-2013-Cluster Trees on Manifolds
18 0.463581 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
19 0.45930171 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
20 0.45150048 266 nips-2013-Recurrent linear models of simultaneously-recorded neural populations
topicId topicWeight
[(2, 0.012), (16, 0.058), (33, 0.143), (34, 0.068), (36, 0.018), (39, 0.119), (41, 0.027), (49, 0.036), (56, 0.268), (70, 0.045), (85, 0.037), (89, 0.031), (93, 0.047), (95, 0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.94785041 269 nips-2013-Regression-tree Tuning in a Streaming Setting
Author: Samory Kpotufe, Francesco Orabona
Abstract: We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. We prove that it is possible to maintain such a structure in time O (log n) at ˜ any time step n while achieving a nearly-optimal regression rate of O n−2/(2+d) in terms of the unknown metric dimension d. Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting. 1
2 0.91768861 17 nips-2013-A multi-agent control framework for co-adaptation in brain-computer interfaces
Author: Josh S. Merel, Roy Fox, Tony Jebara, Liam Paninski
Abstract: In a closed-loop brain-computer interface (BCI), adaptive decoders are used to learn parameters suited to decoding the user’s neural response. Feedback to the user provides information which permits the neural tuning to also adapt. We present an approach to model this process of co-adaptation between the encoding model of the neural signal and the decoding algorithm as a multi-agent formulation of the linear quadratic Gaussian (LQG) control problem. In simulation we characterize how decoding performance improves as the neural encoding and adaptive decoder optimize, qualitatively resembling experimentally demonstrated closed-loop improvement. We then propose a novel, modified decoder update rule which is aware of the fact that the encoder is also changing and show it can improve simulated co-adaptation dynamics. Our modeling approach offers promise for gaining insights into co-adaptation as well as improving user learning of BCI control in practical settings.
3 0.91072279 76 nips-2013-Correlated random features for fast semi-supervised learning
Author: Brian McWilliams, David Balduzzi, Joachim Buhmann
Abstract: This paper presents Correlated Nystr¨ m Views (XNV), a fast semi-supervised alo gorithm for regression and classification. The algorithm draws on two main ideas. First, it generates two views consisting of computationally inexpensive random features. Second, multiview regression, using Canonical Correlation Analysis (CCA) on unlabeled data, biases the regression towards useful features. It has been shown that CCA regression can substantially reduce variance with a minimal increase in bias if the views contains accurate estimators. Recent theoretical and empirical work shows that regression with random features closely approximates kernel regression, implying that the accuracy requirement holds for random views. We show that XNV consistently outperforms a state-of-the-art algorithm for semi-supervised learning: substantially improving predictive performance and reducing the variability of performance on a wide variety of real-world datasets, whilst also reducing runtime by orders of magnitude. 1
4 0.91061753 3 nips-2013-A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables
Author: Jing Xiang, Seyoung Kim
Abstract: We address the problem of learning a sparse Bayesian network structure for continuous variables in a high-dimensional space. The constraint that the estimated Bayesian network structure must be a directed acyclic graph (DAG) makes the problem challenging because of the huge search space of network structures. Most previous methods were based on a two-stage approach that prunes the search space in the first stage and then searches for a network structure satisfying the DAG constraint in the second stage. Although this approach is effective in a lowdimensional setting, it is difficult to ensure that the correct network structure is not pruned in the first stage in a high-dimensional setting. In this paper, we propose a single-stage method, called A* lasso, that recovers the optimal sparse Bayesian network structure by solving a single optimization problem with A* search algorithm that uses lasso in its scoring system. Our approach substantially improves the computational efficiency of the well-known exact methods based on dynamic programming. We also present a heuristic scheme that further improves the efficiency of A* lasso without significantly compromising the quality of solutions. We demonstrate our approach on data simulated from benchmark Bayesian networks and real data. 1
5 0.91015613 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
Author: Tzu-Kuo Huang, Jeff Schneider
Abstract: Learning dynamic models from observed data has been a central issue in many scientific studies or engineering tasks. The usual setting is that data are collected sequentially from trajectories of some dynamical system operation. In quite a few modern scientific modeling tasks, however, it turns out that reliable sequential data are rather difficult to gather, whereas out-of-order snapshots are much easier to obtain. Examples include the modeling of galaxies, chronic diseases such Alzheimer’s, or certain biological processes. Existing methods for learning dynamic model from non-sequence data are mostly based on Expectation-Maximization, which involves non-convex optimization and is thus hard to analyze. Inspired by recent advances in spectral learning methods, we propose to study this problem from a different perspective: moment matching and spectral decomposition. Under that framework, we identify reasonable assumptions on the generative process of non-sequence data, and propose learning algorithms based on the tensor decomposition method [2] to provably recover firstorder Markov models and hidden Markov models. To the best of our knowledge, this is the first formal guarantee on learning from non-sequence data. Preliminary simulation results confirm our theoretical findings. 1
6 0.9079085 103 nips-2013-Efficient Exploration and Value Function Generalization in Deterministic Systems
7 0.90726161 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion
8 0.90557945 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting
9 0.90553713 106 nips-2013-Eluder Dimension and the Sample Complexity of Optimistic Exploration
10 0.90536499 230 nips-2013-Online Learning with Costly Features and Labels
11 0.90375441 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
12 0.89961749 252 nips-2013-Predictive PAC Learning and Process Decompositions
13 0.8984043 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings
14 0.89818877 118 nips-2013-Fast Determinantal Point Process Sampling with Application to Clustering
15 0.89743757 66 nips-2013-Computing the Stationary Distribution Locally
16 0.89696735 249 nips-2013-Polar Operators for Structured Sparse Estimation
17 0.89463389 32 nips-2013-Aggregating Optimistic Planning Trees for Solving Markov Decision Processes
18 0.8946141 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
19 0.89168364 42 nips-2013-Auditing: Active Learning with Outcome-Dependent Query Costs
20 0.8907451 235 nips-2013-Online learning in episodic Markovian decision processes by relative entropy policy search