nips nips2011 nips2011-144 knowledge-graph by maker-knowledge-mining

144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

Source: pdf

Author: Tzu-kuo Huang, Jeff G. Schneider

Abstract: Vector Auto-regressive models (VAR) are useful tools for analyzing time series data. In quite a few modern time series modelling tasks, the collection of reliable time series turns out to be a major challenge, either due to the slow progression of the dynamic process of interest, or inaccessibility of repetitive measurements of the same dynamic process over time. In those situations, however, we observe that it is often easier to collect a large amount of non-sequence samples, or snapshots of the dynamic process of interest. In this work, we assume a small amount of time series data are available, and propose methods to incorporate non-sequence data into penalized least-square estimation of VAR models. We consider non-sequence data as samples drawn from the stationary distribution of the underlying VAR model, and devise a novel penalization scheme based on the Lyapunov equation concerning the covariance of the stationary distribution. Experiments on synthetic and video data demonstrate the effectiveness of the proposed methods. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 In quite a few modern time series modelling tasks, the collection of reliable time series turns out to be a major challenge, either due to the slow progression of the dynamic process of interest, or inaccessibility of repetitive measurements of the same dynamic process over time. [sent-6, score-0.526]

2 In those situations, however, we observe that it is often easier to collect a large amount of non-sequence samples, or snapshots of the dynamic process of interest. [sent-7, score-0.19]

3 In this work, we assume a small amount of time series data are available, and propose methods to incorporate non-sequence data into penalized least-square estimation of VAR models. [sent-8, score-0.14]

4 We consider non-sequence data as samples drawn from the stationary distribution of the underlying VAR model, and devise a novel penalization scheme based on the Lyapunov equation concerning the covariance of the stationary distribution. [sent-9, score-0.629]

5 Experiments on synthetic and video data demonstrate the effectiveness of the proposed methods. [sent-10, score-0.14]

6 Recently, researchers in computational biology applied VAR models in the analysis of genomic time series [12], and found interesting results that were unknown previously. [sent-13, score-0.139]

7 In some situations, the dynamic process of interest may evolve slowly over time, such as the progression of Alzheimer’s or Parkinson’s diseases, and researchers may need to spend months or even years tracking the dynamic process to obtain enough time series data for analysis. [sent-15, score-0.344]

8 In other situations, the dynamic process of interest may not be able to undergo repetitive measurements, so researchers have to measure multiple instances of the same process while maintaining synchronization among these instances. [sent-16, score-0.244]

9 In their study, [19] measured expression proﬁles of yeast genes along consecutive metabolic cycles. [sent-18, score-0.148]

10 In order to obtain reliable time series data, they spent a lot of effort developing a stable environment to synchronize the cells during the metabolic cycles. [sent-20, score-0.172]

11 While obtaining reliable time series can be difﬁcult, we observe that it is often easier to collect nonsequence samples, or snapshots of the dynamic process of interest1 . [sent-24, score-0.408]

12 Or in gene expression analysis, current technology already enables large-scale collection of static gene expression data. [sent-27, score-0.275]

13 Previously [6] investigated ways to extract dynamics from such static gene expression data, and more recently [8, 9] proposed methods for learning ﬁrst-order dynamic models from general non-sequence data. [sent-28, score-0.277]

14 We assume that a small amount of sequence samples and a large amount of non-sequence samples are available. [sent-31, score-0.165]

15 Our aim is to rely on the few sequence samples to obtain a rough estimate of the model, while reﬁning this rough estimate using the non-sequence samples. [sent-32, score-0.199]

16 We consider the following ﬁrstorder p-dimensional vector auto-regressive model: xt+1 = xt A + ǫt+1 , t 1×p p×p (1) t where x ∈ R is the state vector at time t, A ∈ R is the transition matrix, and ǫ is a whitenoise process with a time-invariant variance σ 2 I. [sent-33, score-0.21]

17 Given a sequence sample, a common estimation method for A is the least-square estimator, whose properties have been studied extensively (see e. [sent-34, score-0.137]

18 , the eigenvalues of A have modulus less than one. [sent-39, score-0.234]

19 As a result, the process (1) has a stationary distribution, whose covariance Q is determined by the following discrete-time Lyapunov equation: A⊤ QA + σ 2 I = Q. [sent-40, score-0.34]

20 If the noise process ǫt follows a normal distribution, the stationary distribution also follows a normal distribution, with covariance Q determined as above. [sent-44, score-0.34]

21 What motivates this work is that the estimation of Q requires only samples drawn from the stationary distribution rather than sequence data. [sent-46, score-0.315]

22 We describe the proposed methods in Section 2, and demonstrate their performance through experiments on synthetic and video data in Section 3. [sent-49, score-0.14]

23 Our ﬁnding in short is that when the amount of sequence data is small and our VAR model assumption is valid, the proposed methods of incorporating non-sequence data into estimation signiﬁcantly improve over standard methods, which use only the sequence data. [sent-50, score-0.29]

24 The standard least-square i=1 estimator for the transition matrix A is the solution to the following minimization problem: min A Y − XA 2 F, (3) where Y ⊤ := [(x2 )⊤ (x3 )⊤ · · · (xT )⊤ ], X ⊤ := [(x1 )⊤ (x2 )⊤ · · · (xT −1 )⊤ ], and · F denotes the matrix Frobenius norm. [sent-53, score-0.155]

25 When p > T , which is often the case in modern time series modeling tasks, the least square problem (3) has multiple solutions all achieving zero squared error, and the resulting estimator overﬁtts the data. [sent-54, score-0.14]

26 A common remedy is adding a penalty term on A to (3) and minimizing the resulting regularized sum of squared errors. [sent-55, score-0.15]

27 Usual penalty terms include the ridge penalty A 2 and the sparse penalty A 1 := i,j |Aij |. [sent-56, score-0.862]

28 F Now suppose we also have a set of non-sequence observations {zi }n drawn independently from i=1 the stationary distribution of (1). [sent-57, score-0.144]

29 As described in Section 1, the size n of the non-sequence sample can usually be much larger than the size T of the sequence data. [sent-59, score-0.138]

30 More precisely, in addition to the usual ridge or sparse penalty terms, we also consider the following regularization: A⊤ QA + σ 2 I − Q 2 , (4) F which we refer to as the Lyapunov penalty. [sent-64, score-0.562]

31 To compare (4) with the ridge penalty and the sparse penalty, we consider (3) as a multiple-response regression problem and view the i-th column of A as the regression coefﬁcient vector for the i-th output dimension. [sent-65, score-0.658]

32 From this viewpoint, we immediately see that both the ridge and the sparse penalizations treat the p regression problems as unrelated. [sent-66, score-0.495]

33 On the contrary, the Lyapunov penalty incorporates relations between pairs of columns of A by using a covariance estimate Q. [sent-67, score-0.344]

34 In other words, although the non-sequence sample does not provide direct information about the individual regression problems, it does reveal how the regression problems are related to one another. [sent-68, score-0.137]

35 To illustrate how the Lyapunov penalty may help to improve learning, we give an example in Figure 1. [sent-69, score-0.15]

36 We generate a sequence of 4 points, draw a non-sequence sample of 20 points independently from the stationary distribution and obtain the sample covariance Q. [sent-75, score-0.483]

37 We ﬁx the second column of A but vary the ﬁrst, and plot in Figure 1(a) the resulting level sets of the sum of squared errors on the sequence (SSE) and the ridge penalty (Ridge), and in Figure 1(b) the level sets of the Lyapunov penalty (Lyap). [sent-76, score-0.749]

38 To see the behavior of the ridge regression, we trace out a path of the ridge regression solution by varying the penalization parameter, as indicated by the red-to-black curve in Figure 1(a). [sent-78, score-0.874]

39 This path is pretty far from the true model, due to insufﬁcient sequence data. [sent-79, score-0.136]

40 Thus, neither ridge regression nor the Lyapunov penalty can be used on its own to estimate the true model well. [sent-81, score-0.623]

41 This demonstrates how the ridge regression and the Lyapunov penalty may complement each other: the former by itself gives an inaccurate estimation of the true model, but is just enough to identify a good model from the many candidate local minima provided by the latter. [sent-83, score-0.662]

42 In the following we describe our proposed methods for incorporating the Lyapunov penalty (4) into ridge and sparse least-square estimation. [sent-84, score-0.618]

43 We also discuss robust estimation for the covariance Q. [sent-85, score-0.2]

44 1 Ridge and Lyapunov penalty Here we estimate A by solving the following problem: 1 λ1 λ2 ⊤ min Y − XA 2 + A 2 + A QA + σ 2 I − Q F F A 2 2 4 3 2 F, (6) where Q is a covariance estimate obtained from the non-sequence sample. [sent-87, score-0.378]

45 2 Sparse and Lyapunov penalty Sparse learning for vector auto-regressive models has become a useful tool in many modern time series modeling tasks, where the number p of states in the system is usually larger than the length T of the time series. [sent-95, score-0.235]

46 For example, an important problem in computational biology is to understand the progression of certain biological processes from some measurements, such as temporal gene expression data. [sent-96, score-0.233]

47 Instead of adding a sparse penalty on A to the objective function, we impose a constraint on the ℓ1 norm of A. [sent-100, score-0.21]

48 Both the penalty and the constraint formulations have been considered in the sparse learning literature, and shown to be equivalent in the case of a convex objective. [sent-101, score-0.21]

49 On the contrary, the penalty formulation leads to a non-smooth and non-convex optimization problem, which is difﬁcult to solve with standard methods for sparse learning. [sent-103, score-0.21]

50 [2] proved that there always exists η (k) = β rk > 0 satisfying (13), and every limit point of {A(k) }∞ is a stationary point of k=0 (9). [sent-112, score-0.172]

51 3 (13) Robust estimation of covariance matrices To obtain a good estimator for A using the proposed methods, we need a good estimator for the covariance of the stationary distribution of (1). [sent-123, score-0.645]

52 Given an independent sample {zi }n drawn from i=1 the stationary distribution, the sample covariance is deﬁned as S := 1 n−1 n ¯ ¯ (zi − z)⊤ (zi − z), n i=1 ¯ where z := zi n i=1 . [sent-124, score-0.435]

53 (14) Although unbiased, the sample covariance is known to be vulnerable to outliers, and ill-conditioned when the number of sample points n is smaller than the dimension p. [sent-125, score-0.242]

54 Most of these results originate from the idea of shrinkage estimators, which shrink the covariance matrix towards some target covariance with a simple structure, such as a diagonal matrix. [sent-128, score-0.418]

55 , [17, 10] that shrinking the sample covariance can achieve a smaller mean-squared error (MSE). [sent-131, score-0.201]

56 z i=1 5 n k=1 n wkij , (20) (a) (b) (c) (d) Eigenvalues in modulus Figure 2: Testing performances and eigenvalues in modulus for the dense model 3 Experiments To evaluate the proposed methods, we conduct experiments on synthetic and video data. [sent-136, score-0.672]

57 In both sets of experiments we use the following two performance measures for a learnt model A: Normalized error: Cosine score: 1 T −1 1 T −1 T −1 t=1 T −1 xt+1 − xt A 2 . [sent-137, score-0.152]

58 xt+1 − xt 2 (xt+1 − xt )⊤ (xt A − xt ) xt+1 − xt t=1 xt A − xt . [sent-138, score-0.768]

59 To give an idea of how a good estimate A would perform under these two measures, we point ˆ out that a constant prediction xt+1 = xt leads to a normalized error of 1, and a random-walk ˆ prediction xt+1 = xt + ǫt+1 , ǫt+1 being a white-noise process, results in a nearly-zero cosine score. [sent-139, score-0.516]

60 Thus, when the true model is more than a simple random walk, a good estimate A should achieve a normalized error much smaller than 1 and a cosine score way above 0. [sent-140, score-0.344]

61 We also note that the cosine score is upper-bounded by 1. [sent-141, score-0.233]

62 In experiments on synthetic data we have the true transition matrix A, so we consider a third criterion, the matrix error: A − A F / A F . [sent-142, score-0.177]

63 To choose the hyper-parameters λ1 , λ2 and σ 2 , we split the training sequence into two halves and use the second half as the validation sequence. [sent-144, score-0.192]

64 Once we ﬁnd the best hyper-parameters according to the validation performance, we train a model on the full training sequence and predict on the testing sequence. [sent-145, score-0.227]

65 Both models are stable, and the stationary distribution for each model is a zero-mean Gaussian. [sent-157, score-0.144]

66 We obtain the covariance Q of each stationary distribution by solving the Lyapunov equation (2). [sent-158, score-0.329]

67 For a single experiment, we generate a training sequence and a testing sequence, both initialized from the stationary distribution, and draw a non-sequence sample independently from the stationary distribution. [sent-159, score-0.522]

68 Under each combination of T and n, we compare the proposed Lyapunov penalization method with the baseline approach of penalized least square, which uses only the sequence data. [sent-161, score-0.289]

69 We run 10 such experiments for the dense model and 5 for the sparse model, and report the overall performances of both the proposed and the baseline methods. [sent-163, score-0.18]

70 The ridge regression approach and the proposed Lyapunov penalization method (6) are abbreviated as Ridge and Lyap, respectively. [sent-167, score-0.588]

71 For normalized error and cosine score, we also report the performance of the true A on testing sequences. [sent-168, score-0.338]

72 We observe that Lyap improves over Ridge more signiﬁcantly when the training sequence length T is small (≤ 200) and the non-sequence sample size n is large (≥ 400). [sent-169, score-0.184]

73 But with the true stationary covariance Q, Lyap outperforms Ridge signiﬁcantly for all T . [sent-171, score-0.343]

74 When n is small, the covariance estimate Q is far from the true Q and the Lyapunov penalty does not provide useful information about A. [sent-172, score-0.383]

75 We note that if instead of the robust covariance estimate in (18) and (19) we use the sample covariance, the performance of Lyap can be marginally worse than Ridge when n is small. [sent-175, score-0.235]

76 As a qualitative assessment of the estimated transition matrices, in Figure 2(d) we plot the eigenvalues in modulus of the true A and the A’s obtained by different methods when T = 50 and n = 1600. [sent-177, score-0.319]

77 2 Experimental results for the sparse model We give boxplots of the performance measures in the 5 experiments in Figures 3(a) to 3(c), and the eigenvalues in modulus of the true A and some A’s in Figure 3(d). [sent-182, score-0.395]

78 A major difference lies in the impact of the Lyapunov penalization on the spectrum of A, as revealed in Figure 3(d). [sent-186, score-0.156]

79 When T is as small as 25, the sparse least-square method shrinks all the eigenvalues but still keep most of them non-zero, while Lyap with a non-sequence sample of size 1600 over-estimates the ﬁrst few largest eigenvalues in modulus but shrink the rest to have very small modulus. [sent-187, score-0.454]

80 We may thus need an even better covariance estimate for the sparse model. [sent-189, score-0.254]

81 2 0 (a) The pendulum T=6 T=10 T=20 T=50 (b) Normalized error 0 T=6 T=10 T=20 T=50 (c) Cosine score Figure 4: Results on the pendulum video data 3. [sent-196, score-0.184]

82 2 Video Data We test our methods using a video sequence of a periodically swinging pendulum3 , which consists of 500 frames of 75-by-80 grayscale images. [sent-197, score-0.223]

83 To check whether a VAR model is a suitable choice, we estimate a transition matrix from the ﬁrst 400 frames by ridge regression while choosing the penalization parameter on the next 50 frames, and predict on the last 50 frames. [sent-202, score-0.684]

84 0156, and the testing normalized error and cosine score are 0. [sent-204, score-0.344]

85 97, respectively, suggesting that the dynamics of the video sequence is well-captured by a VAR model. [sent-206, score-0.195]

86 We compare the proposed method (6) with the ridge regression for two lengths of the training sequence: T ∈ {6, 10, 20, 50}, and treat the last 50 frames as the testing sequence. [sent-207, score-0.617]

87 For both methods, we split the training sequence into two halves and use the second half as a validation sequence. [sent-208, score-0.192]

88 For the proposed method, we simulate a non-sequence sample by randomly choosing 300 frames from between the (T + 1)-st frame and the 450-th frame without replacement. [sent-209, score-0.189]

89 The testing normalized errors and cosine scores of both methods are given in Figures 4(b) and 4(c). [sent-211, score-0.299]

90 For the proposed method, we report the mean performance measures over the 10 simulated nonsequence samples with standard deviation. [sent-212, score-0.206]

91 When T ≤ 20, which is close to the period, the proposed method outperforms ridge regression very signiﬁcantly except when T = 10 the cosine score of Lyap is barely better than Ridge. [sent-213, score-0.664]

92 As for λ1 and λ2 , their values decrease respectively from 512 and 2,048 to less than 32 as T increases, but since we ﬁx the amount of nonsequence data, the interaction between their value changes is less clear than on the synthetic data. [sent-216, score-0.155]

93 4 Conclusion We propose to improve penalized least-square estimation of VAR models by incorporating nonsequence data, which are assumed to be samples drawn from the stationary distribution of the underlying VAR model. [sent-217, score-0.399]

94 We construct a novel penalization term based on the discrete-time Lyapunov equation concerning the covariance (estimate) of the stationary distribution. [sent-218, score-0.451]

95 Preliminary experimental results demonstrate that our methods can improve signiﬁcantly over standard penalized least-square methods when there are only few sequence data but abundant non-sequence data and when the model assumption is valid. [sent-219, score-0.136]

96 Also, we may consider noise processes ǫt with more general covariances, and incorporate the noise covariance estimation into the proposed Lyapunov penalization scheme. [sent-221, score-0.353]

97 3 A similar video sequence has been used in [16]. [sent-223, score-0.168]

98 Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. [sent-284, score-0.227]

99 A shrinkage approach to large-scale covariance matrix estimation a and implications for functional genomics. [sent-312, score-0.274]

100 Estimation of a covariance matrix using the reference prior. [sent-344, score-0.187]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lyapunov', 0.472), ('lyap', 0.443), ('ridge', 0.352), ('cosine', 0.188), ('covariance', 0.16), ('var', 0.16), ('penalty', 0.15), ('stationary', 0.144), ('modulus', 0.139), ('xt', 0.128), ('penalization', 0.122), ('nonsequence', 0.117), ('rij', 0.113), ('sequence', 0.097), ('qa', 0.096), ('eigenvalues', 0.095), ('sse', 0.094), ('alsq', 0.093), ('testing', 0.073), ('video', 0.071), ('wkij', 0.07), ('gene', 0.069), ('xa', 0.068), ('dynamic', 0.067), ('series', 0.061), ('sparse', 0.06), ('dense', 0.058), ('estimator', 0.055), ('frames', 0.055), ('expression', 0.054), ('zi', 0.049), ('regression', 0.048), ('metabolic', 0.048), ('shrinkage', 0.047), ('ledoit', 0.047), ('yeast', 0.046), ('transition', 0.046), ('score', 0.045), ('biology', 0.043), ('progression', 0.042), ('armijo', 0.041), ('parkinson', 0.041), ('sample', 0.041), ('reliable', 0.04), ('estimation', 0.04), ('true', 0.039), ('penalized', 0.039), ('synthetic', 0.038), ('normalized', 0.038), ('alzheimer', 0.038), ('wiesel', 0.038), ('boxplots', 0.038), ('nance', 0.038), ('halves', 0.038), ('process', 0.036), ('repetitive', 0.035), ('synchronization', 0.035), ('abbreviated', 0.035), ('researchers', 0.035), ('treat', 0.035), ('estimate', 0.034), ('validation', 0.034), ('samples', 0.034), ('spectrum', 0.034), ('arc', 0.034), ('pendulum', 0.034), ('modelling', 0.033), ('minima', 0.033), ('gradient', 0.032), ('bern', 0.032), ('snapshots', 0.032), ('collect', 0.032), ('projection', 0.031), ('frame', 0.031), ('proposed', 0.031), ('descent', 0.031), ('performances', 0.031), ('mij', 0.03), ('sij', 0.03), ('figures', 0.029), ('static', 0.029), ('rk', 0.028), ('dynamics', 0.027), ('matrix', 0.027), ('insuf', 0.026), ('equation', 0.025), ('incorporating', 0.025), ('temporal', 0.025), ('measurements', 0.024), ('modern', 0.024), ('shrink', 0.024), ('huang', 0.024), ('measures', 0.024), ('observe', 0.023), ('scienti', 0.023), ('covariances', 0.023), ('stable', 0.023), ('training', 0.023), ('wij', 0.023), ('contrary', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

Author: Tzu-kuo Huang, Jeff G. Schneider

2 0.10684612 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

Author: Po-ling Loh, Martin J. Wainwright

Abstract: Although the standard formulations of prediction problems involve fully-observed and noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing data, possibly involving dependencies. We study these issues in the context of high-dimensional sparse linear regression, and propose novel estimators for the cases of noisy, missing, and/or dependent data. Many standard approaches to noisy or missing data, such as those using the EM algorithm, lead to optimization problems that are inherently non-convex, and it is difﬁcult to establish theoretical guarantees on practical algorithms. While our approach also involves optimizing non-convex programs, we are able to both analyze the statistical error associated with any global optimum, and prove that a simple projected gradient descent algorithm will converge in polynomial time to a small neighborhood of the set of global minimizers. On the statistical side, we provide non-asymptotic bounds that hold with high probability for the cases of noisy, missing, and/or dependent data. On the computational side, we prove that under the same types of conditions required for statistical consistency, the projected gradient descent algorithm will converge at geometric rates to a near-global minimizer. We illustrate these theoretical predictions with simulations, showing agreement with the predicted scalings. 1

3 0.095743723 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik

Abstract: The 1 regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix, or alternatively the underlying graph structure of a Gaussian Markov Random Field, from very limited samples. We propose a novel algorithm for solving the resulting optimization problem which is a regularized log-determinant program. In contrast to other state-of-the-art methods that largely use ﬁrst order gradient information, our algorithm is based on Newton’s method and employs a quadratic approximation, but with some modiﬁcations that leverage the structure of the sparse Gaussian MLE problem. We show that our method is superlinearly convergent, and also present experimental results using synthetic and real application data that demonstrate the considerable improvements in performance of our method when compared to other state-of-the-art methods.

4 0.088932522 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

Author: Dan Garber, Elad Hazan

Abstract: In recent years semideﬁnite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the ﬁrst sublinear time approximation algorithm for semideﬁnite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric. 1

5 0.082601532 258 nips-2011-Sparse Bayesian Multi-Task Learning

Author: Shengbo Guo, Onno Zoeter, Cédric Archambeau

Abstract: We propose a new sparse Bayesian model for multi-task regression and classiﬁcation. The model is able to capture correlations between tasks, or more speciﬁcally a low-rank approximation of the covariance matrix, while being sparse in the features. We introduce a general family of group sparsity inducing priors based on matrix-variate Gaussian scale mixtures. We show the amount of sparsity can be learnt from the data by combining an approximate inference approach with type II maximum likelihood estimation of the hyperparameters. Empirical evaluations on data sets from biology and vision demonstrate the applicability of the model, where on both regression and classiﬁcation tasks it achieves competitive predictive performance compared to previously proposed methods. 1

6 0.079039767 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs

7 0.0758214 301 nips-2011-Variational Gaussian Process Dynamical Systems

8 0.075809009 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

9 0.074353777 220 nips-2011-Prediction strategies without loss

10 0.072149947 4 nips-2011-A Convergence Analysis of Log-Linear Training

11 0.06965299 204 nips-2011-Online Learning: Stochastic, Constrained, and Smoothed Adversaries

12 0.06825728 211 nips-2011-Penalty Decomposition Methods for Rank Minimization

13 0.06301713 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

14 0.06228238 259 nips-2011-Sparse Estimation with Structured Dictionaries

15 0.062029447 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection

16 0.060644429 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

17 0.060097378 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise

18 0.05993136 145 nips-2011-Learning Eigenvectors for Free

19 0.056953698 303 nips-2011-Video Annotation and Tracking with Active Learning

20 0.054682188 131 nips-2011-Inference in continuous-time change-point models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.175), (1, -0.012), (2, 0.008), (3, -0.085), (4, -0.039), (5, 0.006), (6, 0.041), (7, -0.0), (8, 0.064), (9, 0.042), (10, 0.085), (11, -0.075), (12, 0.05), (13, 0.026), (14, -0.03), (15, 0.036), (16, 0.017), (17, 0.019), (18, 0.086), (19, -0.053), (20, 0.03), (21, 0.016), (22, 0.029), (23, 0.036), (24, -0.08), (25, -0.082), (26, -0.079), (27, 0.012), (28, 0.045), (29, -0.046), (30, -0.052), (31, -0.005), (32, 0.059), (33, -0.026), (34, -0.092), (35, 0.006), (36, 0.081), (37, 0.0), (38, -0.092), (39, 0.029), (40, -0.025), (41, 0.023), (42, -0.007), (43, 0.032), (44, 0.01), (45, -0.018), (46, -0.018), (47, -0.044), (48, 0.028), (49, -0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92783666 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

Author: Tzu-kuo Huang, Jeff G. Schneider

2 0.7508899 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

Author: Cho-jui Hsieh, Inderjit S. Dhillon, Pradeep K. Ravikumar, Mátyás A. Sustik

3 0.68391997 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise

Author: Oliver Stegle, Christoph Lippert, Joris M. Mooij, Neil D. Lawrence, Karsten M. Borgwardt

Abstract: Inference in matrix-variate Gaussian models has major applications for multioutput prediction and joint learning of row and column covariances from matrixvariate data. Here, we discuss an approach for efﬁcient inference in such models that explicitly account for iid observation noise. Computational tractability can be retained by exploiting the Kronecker product between row and column covariance matrices. Using this framework, we show how to generalize the Graphical Lasso in order to learn a sparse inverse covariance between features while accounting for a low-rank confounding covariance between samples. We show practical utility on applications to biology, where we model covariances with more than 100,000 dimensions. We ﬁnd greater accuracy in recovering biological network structures and are able to better reconstruct the confounders. 1

4 0.62874305 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection

Author: Miles Lopes, Laurent Jacob, Martin J. Wainwright

Abstract: We consider the hypothesis testing problem of detecting a shift between the means of two multivariate normal distributions in the high-dimensional setting, allowing for the data dimension p to exceed the sample size n. Our contribution is a new test statistic for the two-sample test of means that integrates a random projection with the classical Hotelling T 2 statistic. Working within a high-dimensional framework that allows (p, n) → ∞, we ﬁrst derive an asymptotic power function for our test, and then provide sufﬁcient conditions for it to achieve greater power than other state-of-the-art tests. Using ROC curves generated from simulated data, we demonstrate superior performance against competing tests in the parameter regimes anticipated by our theoretical results. Lastly, we illustrate an advantage of our procedure with comparisons on a high-dimensional gene expression dataset involving the discrimination of different types of cancer. 1

5 0.6178301 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity

Author: Po-ling Loh, Martin J. Wainwright

6 0.56124896 145 nips-2011-Learning Eigenvectors for Free

7 0.54321253 258 nips-2011-Sparse Bayesian Multi-Task Learning

8 0.54087651 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data

9 0.53979796 239 nips-2011-Robust Lasso with missing and grossly corrupted observations

10 0.52643365 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

11 0.524975 39 nips-2011-Approximating Semidefinite Programs in Sublinear Time

12 0.51436657 4 nips-2011-A Convergence Analysis of Log-Linear Training

13 0.50142545 68 nips-2011-Demixed Principal Component Analysis

14 0.49291825 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization

15 0.49042299 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

16 0.48537546 131 nips-2011-Inference in continuous-time change-point models

17 0.48269176 301 nips-2011-Variational Gaussian Process Dynamical Systems

18 0.47830796 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices

19 0.47672218 289 nips-2011-Trace Lasso: a trace norm regularization for correlated designs

20 0.45525882 65 nips-2011-Convergent Fitted Value Iteration with Linear Function Approximation

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.024), (4, 0.043), (20, 0.034), (26, 0.019), (27, 0.012), (31, 0.077), (33, 0.021), (43, 0.081), (45, 0.089), (57, 0.056), (65, 0.013), (74, 0.071), (83, 0.059), (84, 0.028), (85, 0.242), (99, 0.053)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.78977019 35 nips-2011-An ideal observer model for identifying the reference frame of objects

Author: Joseph L. Austerweil, Abram L. Friesen, Thomas L. Griffiths

Abstract: The object people perceive in an image can depend on its orientation relative to the scene it is in (its reference frame). For example, the images of the symbols × and + differ by a 45 degree rotation. Although real scenes have multiple images and reference frames, psychologists have focused on scenes with only one reference frame. We propose an ideal observer model based on nonparametric Bayesian statistics for inferring the number of reference frames in a scene and their parameters. When an ambiguous image could be assigned to two conﬂicting reference frames, the model predicts two factors should inﬂuence the reference frame inferred for the image: The image should be more likely to share the reference frame of the closer object (proximity) and it should be more likely to share the reference frame containing the most objects (alignment). We conﬁrm people use both cues using a novel methodology that allows for easy testing of human reference frame inference. 1

same-paper 2 0.76428831 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

Author: Tzu-kuo Huang, Jeff G. Schneider

3 0.76064956 299 nips-2011-Variance Penalizing AdaBoost

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: This paper proposes a novel boosting algorithm called VadaBoost which is motivated by recent empirical Bernstein bounds. VadaBoost iteratively minimizes a cost function that balances the sample mean and the sample variance of the exponential loss. Each step of the proposed algorithm minimizes the cost efﬁciently by providing weighted data to a weak learner rather than requiring a brute force evaluation of all possible weak learners. Thus, the proposed algorithm solves a key limitation of previous empirical Bernstein boosting methods which required brute force enumeration of all possible weak learners. Experimental results conﬁrm that the new algorithm achieves the performance improvements of EBBoost yet goes beyond decision stumps to handle any weak learner. Signiﬁcant performance gains are obtained over AdaBoost for arbitrary weak learners including decision trees (CART). 1

4 0.63132006 271 nips-2011-Statistical Tests for Optimization Efficiency

Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling

Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems deﬁned on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational beneﬁts and avoids overﬁtting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1

5 0.60774505 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

Author: Kamiar R. Rad, Liam Paninski

Abstract: Many fundamental questions in theoretical neuroscience involve optimal decoding and the computation of Shannon information rates in populations of spiking neurons. In this paper, we apply methods from the asymptotic theory of statistical inference to obtain a clearer analytical understanding of these quantities. We ﬁnd that for large neural populations carrying a ﬁnite total amount of information, the full spiking population response is asymptotically as informative as a single observation from a Gaussian process whose mean and covariance can be characterized explicitly in terms of network and single neuron properties. The Gaussian form of this asymptotic sufﬁcient statistic allows us in certain cases to perform optimal Bayesian decoding by simple linear transformations, and to obtain closed-form expressions of the Shannon information carried by the network. One technical advantage of the theory is that it may be applied easily even to non-Poisson point process network models; for example, we ﬁnd that under some conditions, neural populations with strong history-dependent (non-Poisson) effects carry exactly the same information as do simpler equivalent populations of non-interacting Poisson neurons with matched ﬁring rates. We argue that our ﬁndings help to clarify some results from the recent literature on neural decoding and neuroprosthetic design.

6 0.60206485 258 nips-2011-Sparse Bayesian Multi-Task Learning

7 0.59823585 276 nips-2011-Structured sparse coding via lateral inhibition

8 0.59668678 183 nips-2011-Neural Reconstruction with Approximate Message Passing (NeuRAMP)

9 0.59572512 273 nips-2011-Structural equations and divisive normalization for energy-dependent component analysis

10 0.59452707 186 nips-2011-Noise Thresholds for Spectral Clustering

11 0.59409118 102 nips-2011-Generalised Coupled Tensor Factorisation

12 0.5937224 219 nips-2011-Predicting response time and error rates in visual search

13 0.59260672 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs

14 0.59133881 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data

15 0.59071028 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes

16 0.58890301 236 nips-2011-Regularized Laplacian Estimation and Fast Eigenvector Approximation

17 0.58509696 281 nips-2011-The Doubly Correlated Nonparametric Topic Model

18 0.58507752 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation

19 0.58443534 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

20 0.58425575 83 nips-2011-Efficient inference in matrix-variate Gaussian models with \iid observation noise