nips nips2007 nips2007-179 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Han Liu, Larry Wasserman, John D. Lafferty, Pradeep K. Ravikumar
Abstract: We present a new class of models for high-dimensional nonparametric regression and classification called sparse additive models (SpAM). Our methods combine ideas from sparse linear modeling and additive nonparametric regression. We derive a method for fitting the models that is effective even when the number of covariates is larger than the sample size. A statistical analysis of the properties of SpAM is given together with empirical results on synthetic and real data, showing that SpAM can be effective in fitting sparse nonparametric models in high dimensional data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Our methods combine ideas from sparse linear modeling and additive nonparametric regression. [sent-2, score-0.426]
2 We derive a method for fitting the models that is effective even when the number of covariates is larger than the sample size. [sent-3, score-0.082]
3 A statistical analysis of the properties of SpAM is given together with empirical results on synthetic and real data, showing that SpAM can be effective in fitting sparse nonparametric models in high dimensional data. [sent-4, score-0.233]
4 1 Introduction Substantial progress has been made recently on the problem of fitting high dimensional linear regression models of the form Yi = X iT β + i , for i = 1, . [sent-5, score-0.09]
5 Finding an estimate of β when p > n that is both statistically well-behaved and computationally efficient has proved challenging; however, the lasso estimator (Tibshirani (1996)) has been remarkably successful. [sent-10, score-0.157]
6 The lasso estimator β minimizes the 1 -penalized sums of squares p i (Yi − X iT β) + λ j=1 (1) |β j | with the 1 penalty β 1 encouraging sparse solutions, where many components β j are zero. [sent-11, score-0.338]
7 The good empirical success of this estimator has been recently backed up by results confirming that it has strong theoretical properties; see (Greenshtein and Ritov, 2004; Zhao and Yu, 2007; Meinshausen and Yu, 2006; Wainwright, 2006). [sent-12, score-0.074]
8 The nonparametric regression model Yi = m(X i )+ i , where m is a general smooth function, relaxes the strong assumptions made by a linear model, but is much more challenging in high dimensions. [sent-13, score-0.203]
9 An extension of the additive model is the functional ANOVA model Yi = 1≤ j≤ p m j (X i j ) + j < m j,k, (X i j , X ik , X i ) + · · · + i (3) which allows interactions among the variables. [sent-15, score-0.246]
10 Unfortunately, additive models only have good statistical and computational behavior when the number of variables p is not large relative to the sample size n. [sent-16, score-0.247]
11 In this paper we introduce sparse additive models (SpAM) that extend the advantages of sparse linear models to the additive, nonparametric setting. [sent-17, score-0.515]
12 The underlying model is the same as in (2), but constraints are placed on the component functions {m j }1≤ j≤ p to simultaneously encourage smoothness of each component and sparsity across components; the penalty is similar to that used by the COSSO of Lin and Zhang (2006). [sent-18, score-0.361]
13 The SpAM estimation procedure we introduce allows the use of arbitrary nonparametric smoothing techniques, and in the case where the underlying component functions are linear, it reduces to the lasso. [sent-19, score-0.305]
14 It naturally extends to classification problems using generalized additive models. [sent-20, score-0.222]
15 We first present a population version of the procedure that intuitively suggests how sparsity is achieved. [sent-23, score-0.179]
16 In the following section we derive a backfitting procedure for solving this optimization problem in the finite sample setting. [sent-25, score-0.1]
17 To motivate our approach, we first consider a formulation that scales each component function g j by a scalar β j , and then imposes an 1 constraint on β = (β1 , . [sent-26, score-0.142]
18 , x p ) that have an additive form: f (x) = j f j (x j ). [sent-40, score-0.222]
19 The standard additive model optimization problem, in the population setting, is min f j ∈H j , 1≤ j≤ p E Y− p j=1 f j (X j ) 2 (5) and m(x) = E(Y | X = x) is the unknown regression function. [sent-41, score-0.358]
20 Now consider the following modification of this problem that imposes additional constraints: (P) min β∈R p ,g j ∈H j E Y− 2 p j=1 β j g j (X j ) (6a) p subject to j=1 |β j | ≤ L (6b) E g 2 = 1, j = 1, . [sent-42, score-0.033]
21 Intuitively, the constraint that β lies in the 1 -ball {β : β 1 ≤ L} encourages sparsity of the estimated β, just as for the parametric lasso. [sent-49, score-0.235]
22 When β is p p sparse, the estimated additive function f (x) = j=1 f j (x j ) = j=1 β j g j (x j ) will also be sparse, meaning that many of the component functions f j (·) = β j g j (·) are identically zero. [sent-50, score-0.323]
23 While this optimization problem makes plain the role 1 regularization of β to achieve sparsity, it has the unfortunate drawback of not being convex. [sent-52, score-0.072]
24 More specifically, while the optimization problem is convex in β and {g j } separately, it is not convex in β and {g j } jointly. [sent-53, score-0.107]
25 2 However, consider the following related optimization problem: (Q) min f j ∈H j p j=1 E Y− f j (X j ) 2 (7a) p subject to j=1 E( f j2 (X j )) ≤ L (7b) E( f j ) = 0, j = 1, . [sent-54, score-0.045]
26 Moreover, the solutions to problems (P) and (Q) are equivalent: β ∗ , g∗ j j optimizes (P) implies f j∗ optimizes (Q) implies β∗ = ( f j j 2) T f j∗ = β ∗ g ∗ optimizes (Q); j j , g ∗ = f j∗ / f j∗ j optimizes (P). [sent-59, score-0.25]
27 2 While optimization problem (Q) has the important virtue of being convex, the way it encourages sparsity is not intuitive; the following observation provides some insight. [sent-60, score-0.227]
28 Then the projec- tion π12 C onto the first two components is an 2 ball. [sent-62, score-0.077]
29 However, the projection π13 C onto the first and third components is an 1 ball. [sent-63, score-0.109]
30 In this way, it can be seen that the constraint j f j 2 ≤ L acts as an 1 constraint across components to encourage sparsity, while it acts as an 2 constraint within components to encourage smoothness, as in a ridge regression penalty. [sent-64, score-0.448]
31 It is thus crucial that 2 the norm f j 2 appears in the constraint, and not its square f j 2 . [sent-65, score-0.029]
32 For the purposes of sparsity, this constraint could be replaced by j f j q ≤ L for any q ≥ 1. [sent-66, score-0.045]
33 , xn j ), the optimization problem reduces to the lasso. [sent-73, score-0.045]
34 The use of scaling coefficients together with a nonnegative garrote penalty, similar to our problem (P), is considered by Yuan (2007). [sent-74, score-0.072]
35 However, the component functions g j are fixed, so that the procedure is not asymptotically consistent. [sent-75, score-0.094]
36 The form of the optimization problem (Q) is similar to that of the COSSO for smoothing spline ANOVA models (Lin and Zhang, 2006); however, our method differs significantly from the COSSO, as discussed below. [sent-76, score-0.141]
37 3 A Backfitting Algorithm for SpAM We now derive a coordinate descent algorithm for fitting a sparse additive model. [sent-78, score-0.311]
38 We write the Lagrangian for the optimization problem (Q) as 1 L( f, λ, µ) = E Y − 2 p j=1 f j (X j ) 2 p +λ j=1 E( f j2 (X j )) + µ j E( f j ). [sent-80, score-0.045]
39 Letting P j = E[R j | X j ] denote the projection of the residual onto H j , the solution satisfies λ 1 + f j = P j if E(P j2 ) > λ 2) E( f j 3 (10) (11) Input: Data (X i , Yi ), regularization parameter λ. [sent-84, score-0.112]
40 , p: Compute the residual: R j = Y − k= j f k (X k ); Estimate the projection P j = E[R j | X j ] by smoothing: P j = S j R j ; Estimate the norm s j = E[P j ]2 using, for example, (15) or (35); λ Pj ; sj + Center: f j ← f j − mean( f j ). [sent-92, score-0.061]
41 Output: Component functions f j and estimator m(X i ) = Soft-threshold: f j = 1 − j f j (X i j ). [sent-93, score-0.074]
42 Condition (11), in turn, implies λ E( f 2 ) = E(P 2 ) or 1 + j j 2) E( f j E( f j2 ) = E(P j2 ) − λ. [sent-95, score-0.025]
43 In the finite sample case, as in standard backfitting (Hastie and Tibshirani, 1999), we estimate the projection E[R j | X j ] by a smooth of the residuals: Pj = S j R j (14) where S j is a linear smoother, such as a local linear or kernel smoother. [sent-97, score-0.084]
44 While the motivating optimization problem (Q) is similar to that considered in the COSSO (Lin and Zhang, 2006) for smoothing splines, the SpAM backfitting algorithm decouples smoothing and sparsity, through a combination of soft-thresholding and smoothing. [sent-102, score-0.237]
45 In particular, SpAM backfitting can be carried out with any nonparametric smoother; it is not restricted to splines. [sent-103, score-0.115]
46 Moreover, by iteratively estimating over the components and using soft thresholding, our procedure is simple to implement and scales to high dimensions. [sent-104, score-0.081]
47 1 SpAM for Nonparametric Logistic Regression The SpAM backfitting procedure can be extended to nonparametric logistic regression for classification. [sent-106, score-0.296]
48 The additive logistic model is exp P(Y = 1 | X ) ≡ p(X ; f ) = 4 1 + exp p j=1 f j (X j ) p j=1 (16) f j (X j ) where Y ∈ {0, 1}, and the population log-likelihood is ( f ) = E Y f (X ) − log (1 + exp f (X )) . [sent-107, score-0.342]
49 Recall that in the local scoring algorithm for generalized additive models (Hastie and Tibshirani, 1999) in the logistic case, one runs the backfitting procedure within Newton’s method. [sent-108, score-0.369]
50 The weighted smooth is given by S j (w R j ) Pj = . [sent-110, score-0.027]
51 As in the unregularized case, this condition is nonlinear in f , and so we linearize the gradient of the log-likelihood around f 0 . [sent-112, score-0.039]
52 This yields the linearized condition E w(X )( f (X ) − Z ) | X j + λv j = 0. [sent-113, score-0.039]
53 When E( f j2 ) = 0, this implies the condition λ E w | X j + f j (X j ) = E(w R j | X j ). [sent-114, score-0.064]
54 (20) 2 E( f j ) In the finite sample case, in terms of the smoothing matrix S j , this becomes S j (w R j ) . [sent-115, score-0.121]
55 fj = Sjw + λ E( f j2 ) (21) If S j (w R j ) 2 < λ, then f j = 0. [sent-116, score-0.04]
56 Otherwise, this implicit, nonlinear equation for f j cannot be solved explicitly, so we propose to iterate until convergence: S j (w R j ) fj ← . [sent-117, score-0.066]
57 (22) √ Sjw + λ n fj 2 When λ = 0, this yields the standard local scoring update (18). [sent-118, score-0.067]
58 An example of logistic SpAM is given in Section 5. [sent-119, score-0.09]
59 1 Properties of SpAM SpAM is Persistent The notion of risk consistency, or persistence, was studied by Juditsky and Nemirovski (2000) and Greenshtein and Ritov (2004) in the context of linear models. [sent-121, score-0.027]
60 Let (X, Y ) denote a new pair (independent of the observed data) and define the predictive risk when predicting Y with f (X ) by R( f ) = E(Y − f (X ))2 . [sent-122, score-0.027]
61 (23) Since we consider predictors of the form f (x) = β j g j (x j ) we also write the risk as R(β, g) j where β = (β1 , . [sent-123, score-0.027]
62 Following Greenshtein and Ritov (2004), we say that an estimator m n is persistent relative to a class of functions Mn if P R(m n ) − R(m ∗ ) → 0 (24) n ∗ = argmin where m n R( f ) is the predictive oracle. [sent-130, score-0.15]
63 Greenshtein and Ritov (2004) showed f ∈Mn that the lasso is persistent for the class of linear models Mn = { f (x) = x T β : β 1 ≤ L n } if L n = o((n/ log n)1/4 ). [sent-131, score-0.159]
64 Then SpAM is persistent relative to the p class of additive models Mn = f (x) = j=1 β j g j (x j ) : β 1 ≤ L n if L n = o n (1−ξ )/4 . [sent-136, score-0.298]
65 We show a similar result for SpAM with the sparse backfitting procedure. [sent-139, score-0.089]
66 For the purpose of analysis, we use orthogonal function regression as the smoothing procedure. [sent-140, score-0.181]
67 We truncate the basis to finite dimension dn , and let dn → ∞ such that dn /n → 0. [sent-145, score-0.24]
68 The SpAM optimization problem can then be written as min β 1 Y− 2n p j=1 2 jβj p + λn 1 T β n j j=1 T j jβj (25) where each β j is a d-dimensional vector. [sent-151, score-0.045]
69 Let Sn = {j : β j = 0} denote the estimated set of variables from the minimizer βn of (25). [sent-153, score-0.037]
70 5 Experiments In this section we present experimental results for SpAM applied to both synthetic and real data, including regression and classification examples that illustrate the behavior of the algorithm in various conditions. [sent-158, score-0.061]
71 We first use simulated data to investigate the performance of the SpAM backfitting algorithm, where the true sparsity pattern is known. [sent-159, score-0.147]
72 If not explicitly stated otherwise, the data are always rescaled to lie in a d-dimensional cube [0, 1]d , and a kernel smoother with Gaussian kernel is used. [sent-161, score-0.044]
73 To tune the penalization parameter λ, we use a C p statistic, which is defined as C p( f ) = 1 n n i=1 Yi − p j=1 f j (X j ) 2 + 2σ 2 n p j=1 trace(S j ) 1[ f j = 0] (29) where S j is the smoothing matrix for the j-th dimension and σ 2 is the estimated variance. [sent-162, score-0.218]
74 0 0 10 20 30 40 50 60 70 80 90 110 130 150 sample size zero zero 0. [sent-195, score-0.025]
75 0 Figure 2: (Simulated data) Upper left: The empirical 2 norm of the estimated components as plotted against the tuning parameter λ; the value on the x-axis is proportional to j f j 2 . [sent-235, score-0.117]
76 Upper center: The C p scores against the tuning parameter λ; the dashed vertical line corresponds to the value of λ which has the smallest C p score. [sent-236, score-0.025]
77 Upper right: The proportion of 200 trials where the correct relevant variables are selected, as a function of sample size n. [sent-237, score-0.025]
78 Lower (from left to right): Estimated (solid lines) versus true additive component functions (dashed lines) for the first 6 dimensions; the remaining components are zero. [sent-238, score-0.337]
79 2 Boston Housing The Boston housing data was collected to study house values in the suburbs of Boston; there are altogether 506 observations with 10 covariates. [sent-240, score-0.094]
80 To explore the sparsistency properties of our method, we add 20 irrelevant variables. [sent-243, score-0.075]
81 Ten of them are randomly drawn from Uniform(0, 1), the remaining ten are a random permutation of the original ten covariates, so that they have the same empirical densities. [sent-244, score-0.062]
82 It correctly zeros out both types of irrelevant variables. [sent-247, score-0.085]
83 The importance of variables nox and b are borderline. [sent-249, score-0.054]
84 However, using C p as the selection criterion, the variables indux, age, dist, and tax are estimated to be irrelevant, a result not seen in other studies. [sent-252, score-0.12]
85 3 SpAM for Spam Here we consider an email spam classification problem, using the logistic SpAM backfitting algorithm from Section 3. [sent-254, score-0.902]
86 (2001), using a set of 3,065 emails as a training set, and conducting hypothesis tests to choose significant variables; there are a total of 4,601 observations with p = 57 attributes, all numeric. [sent-257, score-0.043]
87 The attributes measure the percentage of specific words or characters in the email, the average and maximum run lengths of upper case letters, and the total number of such letters. [sent-258, score-0.026]
88 To demonstrate how SpAM performs well with sparse data, we only sample n = 300 emails as the training set, with the remaining 4301 data points used as the test set. [sent-259, score-0.157]
89 We also use the test data as the hold-out set to tune the penalization parameter λ. [sent-260, score-0.085]
90 The results of a typical run of logistic SpAM are summarized in Figure 4, using plug-in bandwidths. [sent-261, score-0.09]
91 20 Figure 3: (Boston housing) Left: The empirical 2 norm of the estimated components versus the regularization parameter λ. [sent-323, score-0.144]
92 Center: The C p scores against λ; the dashed vertical line corresponds to best C p score. [sent-324, score-0.025]
93 5 penalization parameter Figure 4: (Email spam) Classification accuracies and variable selection for logistic SpAM. [sent-334, score-0.176]
94 Persistency in high dimensional linear predictor-selection and the virtue of over-parametrization. [sent-339, score-0.058]
95 Sharp thresholds for high-dimensional and noisy recovery of sparsity. [sent-388, score-0.044]
wordName wordTfidf (topN-words)
[('spam', 0.752), ('additive', 0.222), ('tting', 0.13), ('sparsity', 0.119), ('back', 0.115), ('nonparametric', 0.115), ('cosso', 0.108), ('rdle', 0.108), ('boston', 0.1), ('smoothing', 0.096), ('greenshtein', 0.094), ('housing', 0.094), ('logistic', 0.09), ('sparse', 0.089), ('pj', 0.086), ('ritov', 0.086), ('lasso', 0.083), ('ibshirani', 0.081), ('sjw', 0.081), ('dn', 0.08), ('persistent', 0.076), ('estimator', 0.074), ('sparsistent', 0.071), ('anova', 0.071), ('supp', 0.064), ('mn', 0.064), ('component', 0.064), ('yi', 0.062), ('regression', 0.061), ('email', 0.06), ('penalization', 0.057), ('covariates', 0.057), ('lin', 0.056), ('tibshirani', 0.055), ('astie', 0.054), ('lstat', 0.054), ('nox', 0.054), ('ptratio', 0.054), ('tax', 0.054), ('irrelevant', 0.051), ('hastie', 0.051), ('components', 0.051), ('optimizes', 0.05), ('zhang', 0.05), ('encourage', 0.048), ('garrote', 0.047), ('cmax', 0.047), ('cmin', 0.047), ('constraint', 0.045), ('optimization', 0.045), ('recovery', 0.044), ('smoother', 0.044), ('emails', 0.043), ('penalty', 0.041), ('fj', 0.04), ('condition', 0.039), ('wainwright', 0.038), ('age', 0.038), ('estimated', 0.037), ('cp', 0.036), ('zeros', 0.034), ('subgradient', 0.034), ('encourages', 0.034), ('imposes', 0.033), ('projection', 0.032), ('convex', 0.031), ('ten', 0.031), ('population', 0.03), ('procedure', 0.03), ('norm', 0.029), ('selection', 0.029), ('virtue', 0.029), ('dimensional', 0.029), ('simulated', 0.028), ('stationary', 0.028), ('norms', 0.028), ('tune', 0.028), ('scoring', 0.027), ('acts', 0.027), ('risk', 0.027), ('lagrangian', 0.027), ('residual', 0.027), ('regularization', 0.027), ('smooth', 0.027), ('iterate', 0.026), ('attributes', 0.026), ('onto', 0.026), ('sn', 0.026), ('sample', 0.025), ('implies', 0.025), ('nonnegative', 0.025), ('smoothness', 0.025), ('dashed', 0.025), ('uc', 0.024), ('orthogonal', 0.024), ('functional', 0.024), ('riedman', 0.024), ('pradeep', 0.024), ('ravikumar', 0.024), ('sparsistency', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 179 nips-2007-SpAM: Sparse Additive Models
Author: Han Liu, Larry Wasserman, John D. Lafferty, Pradeep K. Ravikumar
Abstract: We present a new class of models for high-dimensional nonparametric regression and classification called sparse additive models (SpAM). Our methods combine ideas from sparse linear modeling and additive nonparametric regression. We derive a method for fitting the models that is effective even when the number of covariates is larger than the sample size. A statistical analysis of the properties of SpAM is given together with empirical results on synthetic and real data, showing that SpAM can be effective in fitting sparse nonparametric models in high dimensional data. 1
2 0.16962747 62 nips-2007-Convex Learning with Invariances
Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola
Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1
3 0.11461718 99 nips-2007-Hierarchical Penalization
Author: Marie Szafranski, Yves Grandvalet, Pierre Morizet-mahoudeaux
Abstract: Hierarchical penalization is a generic framework for incorporating prior information in the fitting of statistical models, when the explicative variables are organized in a hierarchical structure. The penalizer is a convex functional that performs soft selection at the group level, and shrinks variables within each group. This favors solutions with few leading terms in the final combination. The framework, originally derived for taking prior knowledge into account, is shown to be useful in linear regression, when several parameters are used to model the influence of one feature, or in kernel regression, for learning multiple kernels. Keywords – Optimization: constrained and convex optimization. Supervised learning: regression, kernel methods, sparsity and feature selection. 1
4 0.10607809 53 nips-2007-Compressed Regression
Author: Shuheng Zhou, Larry Wasserman, John D. Lafferty
Abstract: Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. In this paper we study a variant of this problem where the original n input variables are compressed by a random linear transformation to m n examples in p dimensions, and establish conditions under which a sparse linear model can be successfully recovered from the compressed data. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We characterize the number of random projections that are required for 1 -regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called “sparsistence.” In addition, we show that 1 -regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called “persistence.” Finally, we characterize the privacy properties of the compression procedure in information-theoretic terms, establishing upper bounds on the rate of information communicated between the compressed and uncompressed data that decay to zero. 1
5 0.075386167 43 nips-2007-Catching Change-points with Lasso
Author: Céline Levy-leduc, Zaïd Harchaoui
Abstract: We propose a new approach for dealing with the estimation of the location of change-points in one-dimensional piecewise constant signals observed in white noise. Our approach consists in reframing this task in a variable selection context. We use a penalized least-squares criterion with a 1 -type penalty for this purpose. We prove some theoretical results on the estimated change-points and on the underlying piecewise constant estimated function. Then, we explain how to implement this method in practice by combining the LAR algorithm and a reduced version of the dynamic programming algorithm and we apply it to synthetic and real data. 1
6 0.071430095 8 nips-2007-A New View of Automatic Relevance Determination
7 0.07042633 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
8 0.065604582 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
10 0.058244042 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
11 0.054054819 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
12 0.053903904 145 nips-2007-On Sparsity and Overcompleteness in Image Models
13 0.051284675 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
14 0.04501674 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
15 0.042345461 32 nips-2007-Bayesian Co-Training
16 0.042306274 173 nips-2007-Second Order Bilinear Discriminant Analysis for single trial EEG analysis
17 0.040877957 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
18 0.040150948 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
19 0.039359689 195 nips-2007-The Generalized FITC Approximation
20 0.038795453 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
topicId topicWeight
[(0, -0.155), (1, 0.035), (2, -0.053), (3, 0.064), (4, -0.008), (5, 0.027), (6, -0.018), (7, 0.001), (8, -0.065), (9, 0.068), (10, 0.117), (11, -0.026), (12, 0.086), (13, 0.001), (14, 0.078), (15, 0.022), (16, 0.073), (17, 0.075), (18, 0.014), (19, -0.127), (20, -0.059), (21, -0.042), (22, 0.106), (23, 0.011), (24, 0.033), (25, 0.071), (26, -0.094), (27, 0.053), (28, -0.091), (29, 0.09), (30, 0.05), (31, 0.037), (32, -0.055), (33, 0.128), (34, 0.004), (35, 0.016), (36, -0.088), (37, 0.049), (38, 0.148), (39, 0.008), (40, -0.086), (41, 0.017), (42, -0.071), (43, -0.164), (44, -0.04), (45, -0.064), (46, -0.063), (47, -0.053), (48, -0.112), (49, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.92676479 179 nips-2007-SpAM: Sparse Additive Models
Author: Han Liu, Larry Wasserman, John D. Lafferty, Pradeep K. Ravikumar
Abstract: We present a new class of models for high-dimensional nonparametric regression and classification called sparse additive models (SpAM). Our methods combine ideas from sparse linear modeling and additive nonparametric regression. We derive a method for fitting the models that is effective even when the number of covariates is larger than the sample size. A statistical analysis of the properties of SpAM is given together with empirical results on synthetic and real data, showing that SpAM can be effective in fitting sparse nonparametric models in high dimensional data. 1
2 0.79989535 53 nips-2007-Compressed Regression
Author: Shuheng Zhou, Larry Wasserman, John D. Lafferty
Abstract: Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. In this paper we study a variant of this problem where the original n input variables are compressed by a random linear transformation to m n examples in p dimensions, and establish conditions under which a sparse linear model can be successfully recovered from the compressed data. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We characterize the number of random projections that are required for 1 -regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called “sparsistence.” In addition, we show that 1 -regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called “persistence.” Finally, we characterize the privacy properties of the compression procedure in information-theoretic terms, establishing upper bounds on the rate of information communicated between the compressed and uncompressed data that decay to zero. 1
3 0.67078161 99 nips-2007-Hierarchical Penalization
Author: Marie Szafranski, Yves Grandvalet, Pierre Morizet-mahoudeaux
Abstract: Hierarchical penalization is a generic framework for incorporating prior information in the fitting of statistical models, when the explicative variables are organized in a hierarchical structure. The penalizer is a convex functional that performs soft selection at the group level, and shrinks variables within each group. This favors solutions with few leading terms in the final combination. The framework, originally derived for taking prior knowledge into account, is shown to be useful in linear regression, when several parameters are used to model the influence of one feature, or in kernel regression, for learning multiple kernels. Keywords – Optimization: constrained and convex optimization. Supervised learning: regression, kernel methods, sparsity and feature selection. 1
4 0.65769881 43 nips-2007-Catching Change-points with Lasso
Author: Céline Levy-leduc, Zaïd Harchaoui
Abstract: We propose a new approach for dealing with the estimation of the location of change-points in one-dimensional piecewise constant signals observed in white noise. Our approach consists in reframing this task in a variable selection context. We use a penalized least-squares criterion with a 1 -type penalty for this purpose. We prove some theoretical results on the estimated change-points and on the underlying piecewise constant estimated function. Then, we explain how to implement this method in practice by combining the LAR algorithm and a reduced version of the dynamic programming algorithm and we apply it to synthetic and real data. 1
5 0.57931882 62 nips-2007-Convex Learning with Invariances
Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola
Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1
6 0.50077081 8 nips-2007-A New View of Automatic Relevance Determination
7 0.46610758 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
8 0.45304167 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation
9 0.42143285 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
10 0.37932348 89 nips-2007-Feature Selection Methods for Improving Protein Structure Prediction with Rosetta
11 0.35877663 159 nips-2007-Progressive mixture rules are deviation suboptimal
12 0.34382743 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
14 0.33985043 118 nips-2007-Learning with Transformation Invariant Kernels
15 0.3343133 178 nips-2007-Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains
16 0.33217421 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
17 0.32097632 145 nips-2007-On Sparsity and Overcompleteness in Image Models
18 0.3170763 88 nips-2007-Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition
19 0.3080543 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
20 0.30493194 40 nips-2007-Bundle Methods for Machine Learning
topicId topicWeight
[(4, 0.288), (5, 0.048), (13, 0.033), (16, 0.039), (19, 0.026), (21, 0.073), (31, 0.013), (34, 0.02), (35, 0.038), (47, 0.109), (49, 0.032), (83, 0.1), (85, 0.027), (87, 0.013), (90, 0.058)]
simIndex simValue paperId paperTitle
1 0.77591354 74 nips-2007-EEG-Based Brain-Computer Interaction: Improved Accuracy by Automatic Single-Trial Error Detection
Author: Pierre Ferrez, José Millán
Abstract: Brain-computer interfaces (BCIs), as any other interaction modality based on physiological signals and body channels (e.g., muscular activity, speech and gestures), are prone to errors in the recognition of subject’s intent. An elegant approach to improve the accuracy of BCIs consists in a verification procedure directly based on the presence of error-related potentials (ErrP) in the EEG recorded right after the occurrence of an error. Six healthy volunteer subjects with no prior BCI experience participated in a new human-robot interaction experiment where they were asked to mentally move a cursor towards a target that can be reached within a few steps using motor imagination. This experiment confirms the previously reported presence of a new kind of ErrP. These “Interaction ErrP” exhibit a first sharp negative peak followed by a positive peak and a second broader negative peak (∼290, ∼350 and ∼470 ms after the feedback, respectively). But in order to exploit these ErrP we need to detect them in each single trial using a short window following the feedback associated to the response of the classifier embedded in the BCI. We have achieved an average recognition rate of correct and erroneous single trials of 81.8% and 76.2%, respectively. Furthermore, we have achieved an average recognition rate of the subject’s intent while trying to mentally drive the cursor of 73.1%. These results show that it’s possible to simultaneously extract useful information for mental control to operate a brain-actuated device as well as cognitive states such as error potentials to improve the quality of the braincomputer interaction. Finally, using a well-known inverse model (sLORETA), we show that the main focus of activity at the occurrence of the ErrP are, as expected, in the pre-supplementary motor area and in the anterior cingulate cortex. 1
same-paper 2 0.74279273 179 nips-2007-SpAM: Sparse Additive Models
Author: Han Liu, Larry Wasserman, John D. Lafferty, Pradeep K. Ravikumar
Abstract: We present a new class of models for high-dimensional nonparametric regression and classification called sparse additive models (SpAM). Our methods combine ideas from sparse linear modeling and additive nonparametric regression. We derive a method for fitting the models that is effective even when the number of covariates is larger than the sample size. A statistical analysis of the properties of SpAM is given together with empirical results on synthetic and real data, showing that SpAM can be effective in fitting sparse nonparametric models in high dimensional data. 1
3 0.7290253 39 nips-2007-Boosting the Area under the ROC Curve
Author: Phil Long, Rocco Servedio
Abstract: We show that any weak ranker that can achieve an area under the ROC curve slightly better than 1/2 (which can be achieved by random guessing) can be efficiently boosted to achieve an area under the ROC curve arbitrarily close to 1. We further show that this boosting can be performed even in the presence of independent misclassification noise, given access to a noise-tolerant weak ranker.
4 0.54841626 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
Author: José M. Hernández-lobato, Tjeerd Dijkstra, Tom Heskes
Abstract: We introduce a hierarchical Bayesian model for the discovery of putative regulators from gene expression data only. The hierarchy incorporates the knowledge that there are just a few regulators that by themselves only regulate a handful of genes. This is implemented through a so-called spike-and-slab prior, a mixture of Gaussians with different widths, with mixing weights from a hierarchical Bernoulli model. For efficient inference we implemented expectation propagation. Running the model on a malaria parasite data set, we found four genes with significant homology to transcription factors in an amoebe, one RNA regulator and three genes of unknown function (out of the top ten genes considered).
5 0.53775436 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
Author: Matthias Bethge, Philipp Berens
Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1
6 0.52869475 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
7 0.52767378 158 nips-2007-Probabilistic Matrix Factorization
8 0.52665657 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
9 0.52460623 53 nips-2007-Compressed Regression
10 0.52384114 86 nips-2007-Exponential Family Predictive Representations of State
11 0.52291042 195 nips-2007-The Generalized FITC Approximation
12 0.52237856 24 nips-2007-An Analysis of Inference with the Universum
13 0.52203768 100 nips-2007-Hippocampal Contributions to Control: The Third Way
14 0.5213964 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI
15 0.52089489 28 nips-2007-Augmented Functional Time Series Representation and Forecasting with Gaussian Processes
16 0.52074379 63 nips-2007-Convex Relaxations of Latent Variable Training
17 0.51953858 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
18 0.51945847 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
19 0.5191828 156 nips-2007-Predictive Matrix-Variate t Models
20 0.51914144 174 nips-2007-Selecting Observations against Adversarial Objectives