nips nips2009 nips2009-105 knowledge-graph by maker-knowledge-mining

105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction


Source: pdf

Author: Grzegorz Swirszcz, Naoki Abe, Aurelie C. Lozano

Abstract: We consider the problem of variable group selection for least squares regression, namely, that of selecting groups of variables for best regression performance, leveraging and adhering to a natural grouping structure within the explanatory variables. We show that this problem can be efficiently addressed by using a certain greedy style algorithm. More precisely, we propose the Group Orthogonal Matching Pursuit algorithm (Group-OMP), which extends the standard OMP procedure (also referred to as “forward greedy feature selection algorithm” for least squares regression) to perform stage-wise group variable selection. We prove that under certain conditions Group-OMP can identify the correct (groups of) variables. We also provide an upperbound on the l∞ norm of the difference between the estimated regression coefficients and the true coefficients. Experimental results on simulated and real world datasets indicate that Group-OMP compares favorably to Group Lasso, OMP and Lasso, both in terms of variable selection and prediction accuracy. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We consider the problem of variable group selection for least squares regression, namely, that of selecting groups of variables for best regression performance, leveraging and adhering to a natural grouping structure within the explanatory variables. [sent-4, score-0.279]

2 We show that this problem can be efficiently addressed by using a certain greedy style algorithm. [sent-5, score-0.016]

3 More precisely, we propose the Group Orthogonal Matching Pursuit algorithm (Group-OMP), which extends the standard OMP procedure (also referred to as “forward greedy feature selection algorithm” for least squares regression) to perform stage-wise group variable selection. [sent-6, score-0.167]

4 We also provide an upperbound on the l∞ norm of the difference between the estimated regression coefficients and the true coefficients. [sent-8, score-0.042]

5 Experimental results on simulated and real world datasets indicate that Group-OMP compares favorably to Group Lasso, OMP and Lasso, both in terms of variable selection and prediction accuracy. [sent-9, score-0.063]

6 1 Introduction We address the problem of variable selection for regression, where a natural grouping structure exists within the explanatory variables, and the goal is to select the correct group of variables, rather than the individual variables. [sent-10, score-0.165]

7 in multifactor ANOVA, generalized additive models, time series data analysis, where lagged variables belonging to the same time series may form a natural group, gene expression analysis from microarrays data, where genes belonging to the same functional cluster may be considered as a group). [sent-13, score-0.033]

8 In these settings, selecting the right groups of variables is often more relevant to the subsequent use of estimated models, which may involve interpreting the models and making decisions based on them. [sent-14, score-0.082]

9 Recently, several methods have been proposed to address this variable group selection problem, in the context of linear regression [12, 15]. [sent-15, score-0.149]

10 These methods are based on extending the Lasso formulation [8] by modifying the l1 penalty to account for the group structure. [sent-16, score-0.091]

11 , XGJ are the natural groupings within the variables of X and βGj are the coefficient vectors for variables in groups Gj . [sent-20, score-0.115]

12 Zhao et al [15] considered a more general penalty class, the J Composite Absolute Penalties family T (β) = j=1 βj l0 , of which the Group Lasso penalty is a lj special instance. [sent-21, score-0.026]

13 This development opens up a new direction of research, namely that of extending the existing regression methods with variable selection to the variable group selection problem and investigating to what extent they carry over to the new scenario. [sent-22, score-0.2]

14 Specifically we propose the “Group Orthogonal Matching Pursuit” algorithm (Group-OMP), which extends the OMP algorithm to leverage variable groupings, and prove that, under certain conditions, Group-OMP can identify the correct (groups of) variables when the sample size tends to infinity. [sent-24, score-0.06]

15 We also provide an upperbound on the l∞ norm of the difference between the estimated regression coefficients and the true coefficients. [sent-25, score-0.042]

16 Hence our results generalize those of Zhang [13], which established consistency of the standard OMP algorithm. [sent-26, score-0.017]

17 Our results indicate that Group-OMP favorably compares to the Group Lasso, OMP and Lasso algorithms, both in terms of the accuracy of prediction and that of variable selection. [sent-31, score-0.04]

18 The consistency results are then stated in Section 3. [sent-35, score-0.017]

19 2 Group Orthogonal Matching Pursuit ¯ Consider the general regression problem y = X β + ν, where y ∈ Rn is the response vector, X = ¯ [f1 , . [sent-38, score-0.02]

20 , fd ] ∈ Rn×d is the matrix of feature (or variable) vectors fj ∈ Rn , β ∈ Rd is the coefficient vector and ν ∈ Rn is the noise vector. [sent-41, score-0.021]

21 , n, are independent Gaussian variables with mean 0 and variance σ 2 . [sent-45, score-0.02]

22 , d} let XG denote the restriction of X to the set of variables, {fj , j ∈ G}, where the colums fj are arranged in ascending order. [sent-49, score-0.026]

23 Similarly for any vector β ∈ Rd of regression coefficients, denote βG its restriction to G, with reordering in ascending order. [sent-50, score-0.035]

24 Suppose that a natural grouping structure exists within the variables of X consisting of J groups XG1 , . [sent-51, score-0.096]

25 Then, the above regression problem can be decomposed with respect to J ¯ ¯ the groups, i. [sent-58, score-0.02]

26 For any such G and v ∈ Rn , denote by βX (G, v) the coefficients resulting from applying ordinary least squares (OLS) with non-zero coefficients restricted ˆ to G, i. [sent-65, score-0.024]

27 Given the above setup, the 2 Group-OMP procedure we propose is described in Figure 1, which extends the OMP procedure to deal with group selection. [sent-68, score-0.088]

28 Note that this procedure picks the best group in each iteration, with respect to reduction of the residual error, and it then re-estimates the coefficients, β (k) , as in OMP. [sent-69, score-0.078]

29 We recall that this re-estimation step is what distinguishes OMP, and our group version, from standard boosting-like procedures. [sent-70, score-0.078]

30 • Output: The selected groups G (k) , the regression coefficients β (k) . [sent-80, score-0.082]

31 1 Consistency Results Notation Let Ggood denote the set of all the groups included in the true model. [sent-89, score-0.062]

32 Similarly we call Gbad the set of all the groups which are not included. [sent-91, score-0.062]

33 We let ggood and gbad denote the set of “good incides” and “bad indices”, i. [sent-92, score-0.979]

34 Furthermore, the elements of Ggood are groups of indices, and |Ggood | is the number of groups in Ggood , while ggood is defined in terms of individual indices, i. [sent-96, score-1.024]

35 ggood is the set of indices corresponding to the groups in Ggood . [sent-98, score-0.972]

36 ρX (Ggood ) = inf β Xβ 2 / β 2 : supp(β) ⊂ ggood . [sent-103, score-0.9]

37 2 2 Here and throughout the paper we let A∗ denote the conjugation of the matrix A (which, for a real matrix A, coincides with its transpose) and A+ denote the Moore–Penrose pseudoinverse of the matrix A (c. [sent-104, score-0.024]

38 , v|gbad | } we define u good (2,1) = Gi ∈Ggood j∈Gi u2 , and v j bad (2,1) and then for any matrix A ∈ R|ggood |×|gbad | , let A + Then we define µX (Ggood ) = Xggood Xgbad 3. [sent-114, score-0.022]

39 2 = Gi ∈Gbad good/bad (2,1) = j∈Gi sup v 2 vj Av bad =1 (2,1) good (2,1) . [sent-115, score-0.057]

40 (2,1) The Noiseless Case We first focus on the noiseless case (i. [sent-117, score-0.015]

41 In the noiseless case, we have r0 = −y ∈ Span(Ggood ). [sent-121, score-0.015]

42 The following theorem and its corollary provide a condition which guarantees that Group-OMP does not make a mistake at the next iteration, given that it has not made any mistakes up to that point. [sent-123, score-0.048]

43 By induction on k, it implies that Group-OMP never makes a mistake. [sent-124, score-0.019]

44 Reorder the groups in such a way that Ggood = G1 , . [sent-126, score-0.062]

45 Reorder the groups in such way that Ggood = {G1 , . [sent-142, score-0.062]

46 By definition if Φ∗ (x) = (0)VΦ then x must be orthogonal to each of the subm spaces spanned by XGi , i = 1, . [sent-197, score-0.014]

47 The choice of symbol is not coincident, the matrix of this mapping is indeed a pseudoinverse of the matrix Ψ∗ (r) Φ∗ (r) (XG1 |XG2 | . [sent-202, score-0.012]

48 We have Ψ (2,∞) Φ (2,∞) Ψ∗ ((Φ∗ )+ Φ∗ (r)) Ψ (2,∞) Φ∗ (r) Φ (2,∞) ∗ + Φ Ψ = ∗ where the last term is the norm of the operator Ψ ◦ (Φ ) : V following ≤ Ψ∗ ◦ (Φ∗ )+ (2,∞) , → V . [sent-206, score-0.012]

49 We prove for V Ψ , the proof for V Φ is identical. [sent-231, score-0.018]

50 v∗ = sup |v ∗ (v)| = v∈V Ψ v 2,∞ =1 J sup vi ∈Rni v i=m+1 ∗ | vi , vi | = J 2,∞ =1 ∗ ∗ The last equality follows from sup | vi , vi | = vi 2 vi ∈Rni vi ∗ sup | vi , vi | = vi ∈Rni i=m+1 (as ∗ 2 vi = We have J i=m+1 ∗ vi 2 . [sent-239, score-0.472]

51 2 =1 A fundamental fact from Functional Analysis states that a (Hermitian) conjugation is an isometric isomorphism. [sent-241, score-0.012]

52 (4) ∞ Intuitively, the condition µX (Ggood ) < 1 guarantees that no bad group “mimics” any good group too well. [sent-255, score-0.188]

53 3 The Noisy Case The following theorem extends the results of Theorem 1 to deal with the non-zero Gaussian noise ν. [sent-259, score-0.031]

54 It shows that under certain conditions the Group-OMP algorithm does not select bad groups. [sent-260, score-0.022]

55 We thus obtain the following theorem which states the main consistency result for Group-OMP. [sent-265, score-0.038]

56 To prove the theorem a series of lemmas are needed, whose proofs are omitted due to space constraint, as they can be derived using arguments similar to Zhang [13] for the standard OMP case. [sent-271, score-0.031]

57 The following lemma gives a lower bound on the correlation between the good groups and the residuals from the OLS prediction where the coefficients have been restricted to a set of good groups. [sent-272, score-0.106]

58 |Ggood \G| ˆ The following lemma relates the parameter βX (Ggood ), which is estimated by OLS given that the ¯ set of good groups has been correctly identified, to the true parameter β. [sent-279, score-0.085]

59 ρX (Ggood ) The following lemma provides an upper bound on the correlation of the bad features to the residuals from the prediction by OLS given that the set of good groups has been correctly identified. [sent-282, score-0.128]

60 We first prove that for each iteration k before the GroupOMP algorithm stops, G (k−1) ⊂ Ggood by induction on k. [sent-287, score-0.029]

61 One recovers the same dependency on n by con√ √ ¯ ¯ √ sidering X = nX, β (k) = β (k) / n, β = β/ n, defining (as in [13]) ρX (Ggood ) = 1 ˆ inf β n X β 2 / β 2 : supp(β) ⊂ ggood , and noting that ρX (Ggood ) = ρX(Ggood ) and βX (Ggood , y) = 2 2 ˆX(Ggood , y)/√n. [sent-291, score-0.9]

62 Lemma 5 together with the condition on of Theorem 2 implies that with probability at least 1 − η, max Gj ∈Ggood ∗ XGj (Xβ − y) 2 ≤σ 2d ln(2d/η) < (1 − µX (Ggood )) . [sent-302, score-0.026]

63 (7) Lemma 3 together with the definition of ρX (Ggood ) implies max Gj ∈Ggood ∗ XGj (y − Xβ (k−1) ) 2 ≥ ρX (Ggood ) |Ggood \ G (k−1) | β (k−1) − β (8) 2 We then have to deal with the following cases. [sent-303, score-0.016]

64 It follows that ρX (Ggood ) max Gj ∈Ggood ∗ XGj (y − Xβ (k−1) ) 2 ∗ > > max XGj (Xβ − y) 2 /(1 − µX (Ggood )), Gj ∈G (9) where the last inequality follows from Eq. [sent-305, score-0.024]

65 The above cases imply that if the algorithm does not stop we have Gi(k) ∈ Ggood , and hence G (k) ⊆ √ Ggood and if the algorithm stops we have β (k−1) − β 2 ≤ |Ggood \G (k−1) | . [sent-325, score-0.024]

66 Thus by ρX (Ggood ) (k−1) G ⊆ Ggood and induction, if the Group-OMP algorithm stops at iteration k, we have that β (k−1) − √ (k−1) | |Ggood \G β 2 ≤ . [sent-326, score-0.032]

67 Lemma 4 implies that (C3) holds, and ρX (Ggood ) together with the theorem’s condition on also implies that with probability at least 1 − η, we have ˆ ˆ βX (Ggood , y) − βX (Ggood , Ey) ∞ ≤ σ (2 ln(2|Ggood |/η))/ρX (Ggood ) < / ρX (Ggood ). [sent-328, score-0.026]

68 Comparison with OMP will test the effect of “grouping” OMP, while Group Lasso is included as a representative existing method of group variable selection. [sent-333, score-0.098]

69 We compare the performance of these methods in terms of the accuracy of variable selection, variable group selection and prediction. [sent-334, score-0.157]

70 As measure of variable (group) 2P R selection accuracy we use the F1 measure, which is defined as F1 = P +R , where P denotes the precision and R denotes the recall. [sent-335, score-0.059]

71 For computing variable group F1 for a variable selection method, 6 we consider a group to be selected if any of the variables in the group is selected. [sent-336, score-0.325]

72 2 As measure of ˆ ¯ ˆ ¯ prediction accuracy, we use the model error, defined as Model error = (β − β)∗ E(X ∗ X)(β − β), ¯ are the true model coefficients and β the estimated coefficients. [sent-337, score-0.012]

73 So the tuning parameter for Lasso and Group Lasso is the penalty parameter λ. [sent-339, score-0.021]

74 For the oracle estimate, the tuning parameter is chosen so as to minimize the model error. [sent-344, score-0.06]

75 The holdout-validated estimate is a practical version of the oracle estimate, obtained by selecting the tuning parameter by minimizing the average squared error on a validation set. [sent-346, score-0.06]

76 Experiment 1: We use an additive model with categorical variables taken from [12](model I). [sent-348, score-0.033]

77 Then let (X2(i−1)+1 , X2i ) = (I(Wi = 1), I(Wi = 0)), which are the variables that the estimation methods use as the explanatory variables, with the following variable groups: Gi = {2i − 1, 2i}(i = 1, . [sent-366, score-0.062]

78 We ran 100 runs, each with 50 observations for training and 25 for validation. [sent-370, score-0.012]

79 Experiment 2: We use an additive model with continuous variables taken from [12](model III), where the groups correspond to the expansion of each variable into a third-order polynomial. [sent-371, score-0.115]

80 Then let the explanatory variables be (X3(i−1)+1 , X3(i−1)+2 , X3i ) = Wi3 , Wi2 , Wi with the variable groups Gi = {3(i − 1) + 1, 3(i − 1) + 2, 3i}(i = 1, . [sent-388, score-0.124]

81 Experiment 3: We use an additive model with continuous variables similar to that of [16]. [sent-393, score-0.033]

82 The true model is 15 10 5 Y = 3 i=1 Xi + 4 i=6 Xi + 2 i=11 Xi + ν, where ν ∼ N (0, σ = 15) and the groups are Gk = {5(k − 1) + 1, . [sent-409, score-0.062]

83 Experiment 4: We use an additive model with continuous variables taken from [15]. [sent-417, score-0.033]

84 Consider 10 measurements of each of these hidden variables such that Xi = (0. [sent-425, score-0.02]

85 We note that F1 (Var) and F1 (Group) are identical for the grouped methods for Experiments 1, 2 and 4, since in these the groups have equal size. [sent-454, score-0.071]

86 In particular, Group-OMP does better than OMP not only for 2 Other ways of translating variable selection to variable group selection are possible, but the F1 measure is relatively robust with respect to this choice. [sent-456, score-0.18]

87 127 Table 1: Average F1 score at the variable level and group level, and model error for the models output by Ordinary Least Squares, Lasso, OMP, Group Lasso, and Group-OMP. [sent-665, score-0.098]

88 variable group selection, but also for variable selection and predictive accuracy. [sent-685, score-0.149]

89 Against GroupLasso, Group-OMP does better in all four experiments with respect to variable (group) selection when using Oracle, while it does worse in one case when using holdout validation. [sent-686, score-0.132]

90 The continuous variables appear to have non-linear effects on the target value, so for each such variable, say Xi , we 3 2 consider its third-order polynomial expansion, i. [sent-690, score-0.02]

91 , Xi , Xi and Xi , and consider them as a variable group. [sent-692, score-0.02]

92 We ran 100 runs, where for each run we select at random half of the instances as training examples, one quarter as validation set, and the remaining quarter as test examples. [sent-693, score-0.038]

93 The penalty parameter was chosen with holdout validation for all methods. [sent-694, score-0.094]

94 The average test set prediction error, the average number of selected original variables (i. [sent-695, score-0.032]

95 These results confirm that Group-OMP has the highest prediction accuracy among the comparison methods, and also leads to the sparsest model. [sent-698, score-0.02]

96 5 Concluding Remarks In addition to its merits in terms of consistency and accuracy, Group-OMP is particulary attractive due to its computational efficiency (the entire path is computed in J rounds, where J is the number of groups). [sent-699, score-0.017]

97 , Algorithms for simultaneous sparse approximation, Part I: greedy pursuit, Signal Proc. [sent-749, score-0.016]

98 , Model selection and estimation in regression with grouped variables, J. [sent-757, score-0.06]

99 , On the consistency of feature selection using greedy least squares regression, J. [sent-763, score-0.076]

100 , Grouped and hierarchical model selection through composite absolute penalties, Manuscript, 2006. [sent-769, score-0.031]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ggood', 0.9), ('xgj', 0.266), ('omp', 0.143), ('gj', 0.104), ('lasso', 0.097), ('maxgj', 0.086), ('holdout', 0.081), ('gbad', 0.079), ('group', 0.078), ('xgm', 0.065), ('rdm', 0.063), ('groups', 0.062), ('gi', 0.061), ('ols', 0.058), ('oracle', 0.052), ('xggood', 0.043), ('vi', 0.032), ('selection', 0.031), ('coef', 0.031), ('rdj', 0.029), ('xgi', 0.029), ('supp', 0.027), ('pursuit', 0.026), ('stops', 0.024), ('lemma', 0.023), ('gm', 0.023), ('bad', 0.022), ('cients', 0.022), ('ln', 0.022), ('explanatory', 0.022), ('rni', 0.022), ('xgbad', 0.022), ('zi', 0.021), ('vj', 0.021), ('theorem', 0.021), ('regression', 0.02), ('variables', 0.02), ('variable', 0.02), ('hang', 0.02), ('vm', 0.02), ('rn', 0.017), ('housing', 0.017), ('consistency', 0.017), ('wi', 0.017), ('span', 0.016), ('greedy', 0.016), ('boston', 0.015), ('noiseless', 0.015), ('ascending', 0.015), ('idj', 0.014), ('ropp', 0.014), ('xg', 0.014), ('grouping', 0.014), ('stopping', 0.014), ('orthogonal', 0.014), ('sup', 0.014), ('penalty', 0.013), ('matching', 0.013), ('additive', 0.013), ('groupings', 0.013), ('quarter', 0.013), ('reorder', 0.013), ('prediction', 0.012), ('norm', 0.012), ('squares', 0.012), ('exp', 0.012), ('ran', 0.012), ('ordinary', 0.012), ('conjugation', 0.012), ('pseudoinverse', 0.012), ('gk', 0.011), ('fj', 0.011), ('induction', 0.011), ('runs', 0.011), ('condition', 0.01), ('indices', 0.01), ('dictionaries', 0.01), ('upperbound', 0.01), ('extends', 0.01), ('cov', 0.01), ('fd', 0.01), ('xi', 0.01), ('prove', 0.01), ('signal', 0.009), ('mistake', 0.009), ('residuals', 0.009), ('dm', 0.009), ('penalties', 0.009), ('grouped', 0.009), ('max', 0.008), ('iteration', 0.008), ('implies', 0.008), ('proof', 0.008), ('inequality', 0.008), ('tuning', 0.008), ('ey', 0.008), ('rd', 0.008), ('mistakes', 0.008), ('holds', 0.008), ('accuracy', 0.008)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction

Author: Grzegorz Swirszcz, Naoki Abe, Aurelie C. Lozano

Abstract: We consider the problem of variable group selection for least squares regression, namely, that of selecting groups of variables for best regression performance, leveraging and adhering to a natural grouping structure within the explanatory variables. We show that this problem can be efficiently addressed by using a certain greedy style algorithm. More precisely, we propose the Group Orthogonal Matching Pursuit algorithm (Group-OMP), which extends the standard OMP procedure (also referred to as “forward greedy feature selection algorithm” for least squares regression) to perform stage-wise group variable selection. We prove that under certain conditions Group-OMP can identify the correct (groups of) variables. We also provide an upperbound on the l∞ norm of the difference between the estimated regression coefficients and the true coefficients. Experimental results on simulated and real world datasets indicate that Group-OMP compares favorably to Group Lasso, OMP and Lasso, both in terms of variable selection and prediction accuracy. 1

2 0.098146193 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis

Author: Sundeep Rangan, Alyson K. Fletcher

Abstract: A well-known analysis of Tropp and Gilbert shows that orthogonal matching pursuit (OMP) can recover a k-sparse n-dimensional real vector from m = 4k log(n) noise-free linear measurements obtained through a random Gaussian measurement matrix with a probability that approaches one as n → ∞. This work strengthens this result by showing that a lower number of measurements, m = 2k log(n − k), is in fact sufficient for asymptotic recovery. More generally, when the sparsity level satisfies kmin ≤ k ≤ kmax but is unknown, m = 2kmax log(n − kmin ) measurements is sufficient. Furthermore, this number of measurements is also sufficient for detection of the sparsity pattern (support) of the vector with measurement errors provided the signal-to-noise ratio (SNR) scales to infinity. The scaling m = 2k log(n − k) exactly matches the number of measurements required by the more complex lasso method for signal recovery in a similar SNR scaling.

3 0.05933246 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

Author: Shuheng Zhou

Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model. 1

4 0.049064163 67 nips-2009-Directed Regression

Author: Yi-hao Kao, Benjamin V. Roy, Xiang Yan

Abstract: When used to guide decisions, linear regression analysis typically involves estimation of regression coefficients via ordinary least squares and their subsequent use to make decisions. When there are multiple response variables and features do not perfectly capture their relationships, it is beneficial to account for the decision objective when computing regression coefficients. Empirical optimization does so but sacrifices performance when features are well-chosen or training data are insufficient. We propose directed regression, an efficient algorithm that combines merits of ordinary least squares and empirical optimization. We demonstrate through a computational study that directed regression can generate significant performance gains over either alternative. We also develop a theory that motivates the algorithm. 1

5 0.044339173 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

Author: Sahand Negahban, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar

Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless p/n → 0, a line of recent work has studied models with various types of structure (e.g., sparse vectors; block-structured matrices; low-rank matrices; Markov assumptions). In such settings, a general approach to estimation is to solve a regularized convex program (known as a regularized M -estimator) which combines a loss function (measuring how well the model fits the data) with some regularization function that encourages the assumed structure. The goal of this paper is to provide a unified framework for establishing consistency and convergence rates for such regularized M estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive several existing results, and also to obtain several new results on consistency and convergence rates. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure the corresponding regularized M -estimators have fast convergence rates. 1

6 0.042480823 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

7 0.040739652 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

8 0.039392024 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

9 0.039086934 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization

10 0.037123736 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

11 0.036273379 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

12 0.029462591 224 nips-2009-Sparse and Locally Constant Gaussian Graphical Models

13 0.028168468 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs

14 0.025373891 157 nips-2009-Multi-Label Prediction via Compressed Sensing

15 0.0253109 55 nips-2009-Compressed Least-Squares Regression

16 0.02511174 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models

17 0.025000758 213 nips-2009-Semi-supervised Learning using Sparse Eigenfunction Bases

18 0.023025377 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints

19 0.022030724 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior

20 0.021933693 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.065), (1, 0.043), (2, 0.0), (3, 0.029), (4, -0.005), (5, -0.031), (6, 0.049), (7, -0.071), (8, 0.052), (9, 0.026), (10, -0.006), (11, -0.013), (12, -0.002), (13, 0.016), (14, 0.031), (15, 0.048), (16, 0.015), (17, -0.022), (18, -0.001), (19, 0.031), (20, 0.018), (21, 0.038), (22, -0.028), (23, 0.045), (24, 0.023), (25, 0.028), (26, -0.02), (27, 0.011), (28, -0.04), (29, 0.02), (30, -0.045), (31, 0.06), (32, -0.051), (33, 0.031), (34, 0.041), (35, 0.016), (36, -0.039), (37, 0.018), (38, 0.058), (39, -0.06), (40, 0.105), (41, 0.003), (42, -0.069), (43, -0.038), (44, 0.035), (45, 0.009), (46, -0.005), (47, 0.081), (48, -0.027), (49, -0.119)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89477003 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction

Author: Grzegorz Swirszcz, Naoki Abe, Aurelie C. Lozano

Abstract: We consider the problem of variable group selection for least squares regression, namely, that of selecting groups of variables for best regression performance, leveraging and adhering to a natural grouping structure within the explanatory variables. We show that this problem can be efficiently addressed by using a certain greedy style algorithm. More precisely, we propose the Group Orthogonal Matching Pursuit algorithm (Group-OMP), which extends the standard OMP procedure (also referred to as “forward greedy feature selection algorithm” for least squares regression) to perform stage-wise group variable selection. We prove that under certain conditions Group-OMP can identify the correct (groups of) variables. We also provide an upperbound on the l∞ norm of the difference between the estimated regression coefficients and the true coefficients. Experimental results on simulated and real world datasets indicate that Group-OMP compares favorably to Group Lasso, OMP and Lasso, both in terms of variable selection and prediction accuracy. 1

2 0.56623656 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation

Author: Shuheng Zhou

Abstract: Given n noisy samples with p dimensions, where n ≪ p, we show that the multistep thresholding procedure can accurately estimate a sparse vector β ∈ Rp in a linear model, under the restricted eigenvalue conditions (Bickel-Ritov-Tsybakov 09). Thus our conditions for model selection consistency are considerably weaker than what has been achieved in previous works. More importantly, this method allows very significant values of s, which is the number of non-zero elements in the true parameter. For example, it works for cases where the ordinary Lasso would have failed. Finally, we show that if X obeys a uniform uncertainty principle and if the true parameter is sufficiently sparse, the Gauss-Dantzig selector (Cand` se Tao 07) achieves the ℓ2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle which would supply perfect information about which coordinates are non-zero and which are above the noise level, while selecting a sufficiently sparse model. 1

3 0.56536686 173 nips-2009-Nonparametric Greedy Algorithms for the Sparse Learning Problem

Author: Han Liu, Xi Chen

Abstract: This paper studies the forward greedy strategy in sparse nonparametric regression. For additive models, we propose an algorithm called additive forward regression; for general multivariate models, we propose an algorithm called generalized forward regression. Both algorithms simultaneously conduct estimation and variable selection in nonparametric settings for the high dimensional sparse learning problem. Our main emphasis is empirical: on both simulated and real data, these two simple greedy methods can clearly outperform several state-ofthe-art competitors, including LASSO, a nonparametric version of LASSO called the sparse additive model (SpAM) and a recently proposed adaptive parametric forward-backward algorithm called Foba. We also provide some theoretical justifications of specific versions of the additive forward regression. 1

4 0.54816675 36 nips-2009-Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

Author: Sundeep Rangan, Vivek Goyal, Alyson K. Fletcher

Abstract: The replica method is a non-rigorous but widely-accepted technique from statistical physics used in the asymptotic analysis of large, random, nonlinear problems. This paper applies the replica method to non-Gaussian maximum a posteriori (MAP) estimation. It is shown that with random linear measurements and Gaussian noise, the asymptotic behavior of the MAP estimate of an n-dimensional vector “decouples” as n scalar MAP estimators. The result is a counterpart to Guo and Verd´ ’s replica analysis of minimum mean-squared error estimation. u The replica MAP analysis can be readily applied to many estimators used in compressed sensing, including basis pursuit, lasso, linear estimation with thresholding, and zero norm-regularized estimation. In the case of lasso estimation the scalar estimator reduces to a soft-thresholding operator, and for zero normregularized estimation it reduces to a hard-threshold. Among other benefits, the replica method provides a computationally-tractable method for exactly computing various performance metrics including mean-squared error and sparsity pattern recovery probability.

5 0.54714805 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis

Author: Sundeep Rangan, Alyson K. Fletcher

Abstract: A well-known analysis of Tropp and Gilbert shows that orthogonal matching pursuit (OMP) can recover a k-sparse n-dimensional real vector from m = 4k log(n) noise-free linear measurements obtained through a random Gaussian measurement matrix with a probability that approaches one as n → ∞. This work strengthens this result by showing that a lower number of measurements, m = 2k log(n − k), is in fact sufficient for asymptotic recovery. More generally, when the sparsity level satisfies kmin ≤ k ≤ kmax but is unknown, m = 2kmax log(n − kmin ) measurements is sufficient. Furthermore, this number of measurements is also sufficient for detection of the sparsity pattern (support) of the vector with measurement errors provided the signal-to-noise ratio (SNR) scales to infinity. The scaling m = 2k log(n − k) exactly matches the number of measurements required by the more complex lasso method for signal recovery in a similar SNR scaling.

6 0.5096947 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

7 0.49454919 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors

8 0.49397358 67 nips-2009-Directed Regression

9 0.41181305 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

10 0.4012287 64 nips-2009-Data-driven calibration of linear estimators with minimal penalties

11 0.39384308 55 nips-2009-Compressed Least-Squares Regression

12 0.35186633 35 nips-2009-Approximating MAP by Compensating for Structural Relaxations

13 0.34719646 54 nips-2009-Compositionality of optimal control laws

14 0.32837176 7 nips-2009-A Data-Driven Approach to Modeling Choice

15 0.31096631 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

16 0.30387744 144 nips-2009-Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

17 0.30350468 166 nips-2009-Noisy Generalized Binary Search

18 0.30226541 117 nips-2009-Inter-domain Gaussian Processes for Sparse Inference using Inducing Features

19 0.29862863 176 nips-2009-On Invariance in Hierarchical Models

20 0.2928738 108 nips-2009-Heterogeneous multitask learning with joint sparsity constraints


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.011), (24, 0.049), (25, 0.088), (35, 0.03), (36, 0.082), (39, 0.048), (58, 0.071), (61, 0.013), (69, 0.329), (71, 0.025), (77, 0.035), (86, 0.046), (91, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72019583 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction

Author: Grzegorz Swirszcz, Naoki Abe, Aurelie C. Lozano

Abstract: We consider the problem of variable group selection for least squares regression, namely, that of selecting groups of variables for best regression performance, leveraging and adhering to a natural grouping structure within the explanatory variables. We show that this problem can be efficiently addressed by using a certain greedy style algorithm. More precisely, we propose the Group Orthogonal Matching Pursuit algorithm (Group-OMP), which extends the standard OMP procedure (also referred to as “forward greedy feature selection algorithm” for least squares regression) to perform stage-wise group variable selection. We prove that under certain conditions Group-OMP can identify the correct (groups of) variables. We also provide an upperbound on the l∞ norm of the difference between the estimated regression coefficients and the true coefficients. Experimental results on simulated and real world datasets indicate that Group-OMP compares favorably to Group Lasso, OMP and Lasso, both in terms of variable selection and prediction accuracy. 1

2 0.67184305 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification

Author: Wenming Zheng, Zhouchen Lin

Abstract: The method of common spatio-spectral patterns (CSSPs) is an extension of common spatial patterns (CSPs) by utilizing the technique of delay embedding to alleviate the adverse effects of noises and artifacts on the electroencephalogram (EEG) classification. Although the CSSPs method has shown to be more powerful than the CSPs method in the EEG classification, this method is only suitable for two-class EEG classification problems. In this paper, we generalize the two-class CSSPs method to multi-class cases. To this end, we first develop a novel theory of multi-class Bayes error estimation and then present the multi-class CSSPs (MCSSPs) method based on this Bayes error theoretical framework. By minimizing the estimated closed-form Bayes error, we obtain the optimal spatio-spectral filters of MCSSPs. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on the BCI competition 2005 data set. The experimental results show that our method significantly outperforms the previous multi-class CSPs (MCSPs) methods in the EEG classification.

3 0.64696497 243 nips-2009-The Ordered Residual Kernel for Robust Motion Subspace Clustering

Author: Tat-jun Chin, Hanzi Wang, David Suter

Abstract: We present a novel and highly effective approach for multi-body motion segmentation. Drawing inspiration from robust statistical model fitting, we estimate putative subspace hypotheses from the data. However, instead of ranking them we encapsulate the hypotheses in a novel Mercer kernel which elicits the potential of two point trajectories to have emerged from the same subspace. The kernel permits the application of well-established statistical learning methods for effective outlier rejection, automatic recovery of the number of motions and accurate segmentation of the point trajectories. The method operates well under severe outliers arising from spurious trajectories or mistracks. Detailed experiments on a recent benchmark dataset (Hopkins 155) show that our method is superior to other stateof-the-art approaches in terms of recovering the number of motions, segmentation accuracy, robustness against gross outliers and computational efficiency. 1 Introduction1 Multi-body motion segmentation concerns the separation of motions arising from multiple moving objects in a video sequence. The input data is usually a set of points on the surface of the objects which are tracked throughout the video sequence. Motion segmentation can serve as a useful preprocessing step for many computer vision applications. In recent years the case of rigid (i.e. nonarticulated) objects for which the motions could be semi-dependent on each other has received much attention [18, 14, 19, 21, 22, 17]. Under this domain the affine projection model is usually adopted. Such a model implies that the point trajectories from a particular motion lie on a linear subspace of at most four, and trajectories from different motions lie on distinct subspaces. Thus multi-body motion segmentation is reduced to the problem of subspace segmentation or clustering. To realize practical algorithms, motion segmentation approaches should possess four desirable attributes: (1) Accuracy in classifying the point trajectories to the motions they respectively belong to. This is crucial for success in the subsequent vision applications, e.g. object recognition, 3D reconstruction. (2) Robustness against inlier noise (e.g. slight localization error) and gross outliers (e.g. mistracks, spurious trajectories), since getting imperfect data is almost always unavoidable in practical circumstances. (3) Ability to automatically deduce the number of motions in the data. This is pivotal to accomplish fully automated vision applications. (4) Computational efficiency. This is integral for the processing of video sequences which are usually large amounts of data. Recent work on multi-body motion segmentation can roughly be divided into algebraic or factorization methods [3, 19, 20], statistical methods [17, 7, 14, 6, 10] and clustering methods [22, 21, 5]. Notable approaches include Generalized PCA (GPCA) [19, 20], an algebraic method based on the idea that one can fit a union of m subspaces with a set of polynomials of degree m. Statistical methods often employ concepts such random hypothesis generation [4, 17], Expectation-Maximization [14, 6] 1 This work was supported by the Australian Research Council (ARC) under the project DP0878801. 1 and geometric model selection [7, 8]. Clustering based methods [22, 21, 5] are also gaining attention due to their effectiveness. They usually include a dimensionality reduction step (e.g. manifold learning [5]) followed by a clustering of the point trajectories (e.g. via spectral clustering in [21]). A recent benchmark [18] indicated that Local Subspace Affinity (LSA) [21] gave the best performance in terms of classification accuracy, although their result was subsequently surpassed by [5, 10]. However, we argue that most of the previous approaches do not simultaneously fulfil the qualities desirable of motion segmentation algorithms. Most notably, although some of the approaches have the means to estimate the number of motions, they are generally unreliable in this respect and require manual input of this parameter. In fact this prior knowledge was given to all the methods compared in [18]2 . Secondly, most of the methods (e.g. [19, 5]) do not explicitly deal with outliers. They will almost always breakdown when given corrupted data. These deficiencies reduce the usefulness of available motion segmentation algorithms in practical circumstances. In this paper we attempt to bridge the gap between experimental performance and practical usability. Our previous work [2] indicates that robust multi-structure model fitting can be achieved effectively with statistical learning. Here we extend this concept to motion subspace clustering. Drawing inspiration from robust statistical model fitting [4], we estimate random hypotheses of motion subspaces in the data. However, instead of ranking these hypotheses we encapsulate them in a novel Mercer kernel. The kernel can function reliably despite overwhelming sampling imbalance, and it permits the application of non-linear dimensionality reduction techniques to effectively identify and reject outlying trajectories. This is then followed by Kernel PCA [11] to maximize the separation between groups and spectral clustering [13] to recover the number of motions and clustering. Experiments on the Hopkins 155 benchmark dataset [18] show that our method is superior to other approaches in terms of the qualities described above, including computational efficiency. 1.1 Brief review of affine model multi-body motion segmentation Let {tf p ∈ R2 }f =1,...,F be the set of 2D coordinates of P trajectories tracked across F frames. In p=1,...,P multi-body motion segmentation the tf p ’s correspond to points on the surface of rigid objects which are moving. The goal is to separate the trajectories into groups corresponding to the motion they belong to. In other words, if we arrange the coordinates in the following data matrix   t11 · · · t1P  . .  ∈ R2F ×P , .. .  T= . (1) . . . tF 1 . . . tF P the goal is to find the permutation Γ ∈ RP ×P such that the columns of T · Γ are arranged according to the respective motions they belong to. It turns out that under affine projection [1, 16] trajectories from the same motion lie on a distinct subspace in R2F , and each of these motion subspaces is of dimensions 2, 3 or 4. Thus motion segmentation can be accomplished via clustering subspaces in R2F . See [1, 16] for more details. Realistically actual motion sequences might contain trajectories which do not correspond to valid objects or motions. These trajectories behave as outliers in the data and, if not taken into account, can be seriously detrimental to subspace clustering algorithms. 2 The Ordered Residual Kernel (ORK) First, we take a statistical model fitting point of view to motion segmentation. Let {xi }i=1,...,N be the set of N samples on which we want to perform model fitting. We randomly draw p-subsets from the data and use it to fit a hypothesis of the model, where p is the number of parameters that define the model. In motion segmentation, the xi ’s are the columns of matrix T, and p = 4 since the model is a four-dimensional subspace3 . Assume that M of such random hypotheses are drawn. i i For each data point xi compute its absolute residual set ri = {r1 , . . . , rM } as measured to the M hypotheses. For motion segmentation, the residual is the orthogonal distance to a hypothesis 2 As confirmed through private contact with the authors of [18]. Ideally we should also consider degenerate motions with subspace dimensions 2 or 3, but previous work [18] using RANSAC [4] and our results suggest this is not a pressing issue for the Hopkins 155 dataset. 3 2 i i subspace. We sort the elements in ri to obtain the sorted residual set ˜i = {rλi , . . . , rλi }, where r 1 M i i the permutation {λi , . . . , λi } is obtained such that rλi ≤ · · · ≤ rλi . Define the following 1 M 1 M ˜ θi := {λi , . . . , λi } 1 M (2) ˜ as the sorted hypothesis set of point xi , i.e. θi depicts the order in which xi becomes the inlier of the M hypotheses as a fictitious inlier threshold is increased from 0 to ∞. We define the Ordered Residual Kernel (ORK) between two data points as 1 kr (xi1 , xi2 ) := ˜ Z M/h t ˜ ˜ zt · k∩ (θi1 , θi2 ), (3) t=1 M/h where zt = 1 are the harmonic series and Z = t=1 zt is the (M/h)-th harmonic number. t Without lost of generality assume that M is wholly divisible by h. Step size h is used to obtain the Difference of Intersection Kernel (DOIK) 1 ˜1:α t ˜ ˜ ˜1:α ˜1:α ˜1:α k∩ (θi1 , θi2 ) := (|θi1 t ∩ θi2 t | − |θi1 t−1 ∩ θi2 t−1 |) (4) h ˜a:b where αt = t · h and αt−1 = (t − 1) · h. Symbol θi indicates the set formed by the a-th to ˜i . Since the contents of the sorted hypotheses set are merely permutations of the b-th elements of θ {1 . . . M }, i.e. there are no repeating elements, 0 ≤ kr (xi1 , xi2 ) ≤ 1. ˜ (5) Note that kr is independent of the type of model to be fitted, thus it is applicable to generic statistical ˜ model fitting problems. However, we concentrate on motion subspaces in this paper. Let τ be a fictitious inlier threshold. The kernel kr captures the intuition that, if τ is low, two ˜ points arising from the same subspace will have high normalized intersection since they share many common hypotheses which correspond to that subspace. If τ is high, implausible hypotheses fitted on outliers start to dominate and decrease the normalized intersection. Step size h allows us to quantify the rate of change of intersection if τ is increased from 0 to ∞, and since zt is decreasing, kr will evaluate to a high value for two points from the same subspace. In contrast, kr is always low ˜ ˜ for points not from the same subspace or that are outliers. Proof of satisfying Mercer’s condition. Let D be a fixed domain, and P(D) be the power set of D, i.e. the set of all subsets of D. Let S ⊆ P(D), and p, q ∈ S. If µ is a measure on D, then k∩ (p, q) = µ(p ∩ q), (6) called the intersection kernel, is provably a valid Mercer kernel [12]. The DOIK can be rewritten as t ˜ ˜ k∩ (θi1 , θi2 ) = 1 ˜(αt−1 +1):αt ˜(αt−1 +1):αt (|θ ∩ θi2 | h i1 ˜1:(α ) ˜(α +1):αt | + |θ (αt−1 +1):αt ∩ θ 1:(αt−1 ) |). ˜ ˜ +|θi1 t−1 ∩ θi2 t−1 i1 i2 (7) If we let D = {1 . . . M } be the set of all possible hypothesis indices and µ be uniform on D, each term in Eq. (7) is simply an intersection kernel multiplied by |D|/h. Since multiplying a kernel with a positive constant and adding two kernels respectively produce valid Mercer kernels [12], the DOIK and ORK are also valid Mercer kernels.• Parameter h in kr depends on the number of random hypotheses M , i.e. step size h can be set as a ˜ ratio of M . The value of M can be determined based on the size of the p-subset and the size of the data N (e.g. [23, 15]), and thus h is not contingent on knowledge of the true inlier noise scale or threshold. Moreover, our experiments in Sec. 4 show that segmentation performance is relatively insensitive to the settings of h and M . 2.1 Performance under sampling imbalance Methods based on random sampling (e.g. RANSAC [4]) are usually affected by unbalanced datasets. The probability of simultaneously retrieving p inliers from a particular structure is tiny if points 3 from that structure represent only a small minority in the data. In an unbalanced dataset the “pure” p-subsets in the M randomly drawn samples will be dominated by points from the majority structure in the data. This is a pronounced problem in motion sequences, since there is usually a background “object” whose point trajectories form a large majority in the data. In fact, for motion sequences from the Hopkins 155 dataset [18] with typically about 300 points per sequence, M has to be raised to about 20,000 before a pure p-subset from the non-background objects is sampled. However, ORK can function reliably despite serious sampling imbalance. This is because points from the same subspace are roughly equi-distance to the sampled hypotheses in their vicinity, even though these hypotheses might not pass through that subspace. Moreover, since zt in Eq. (3) is decreasing only residuals/hypotheses in the vicinity of a point are heavily weighted in the intersection. Fig. 1(a) illustrates this condition. Results in Sec. 4 show that ORK excelled even with M = 1, 000. (a) Data in R2F . (b) Data in RKHS Fkr . ˜ Figure 1: (a) ORK under sampling imbalance. (b) Data in RKHS induced by ORK. 3 Multi-Body Motion Segmentation using ORK In this section, we describe how ORK is used for multi-body motion segmentation. 3.1 Outlier rejection via non-linear dimensionality reduction Denote by Fkr the Reproducing Kernel Hilbert Space (RKHS) induced by kr . Let matrix A = ˜ ˜ [φ(x1 ) . . . φ(xN )] contain the input data after it is mapped to Fkr . The kernel matrix K = AT A is ˜ computed using the kernel function kr as ˜ Kp,q = φ(xp ), φ(xq ) = kr (xp , xq ), p, q ∈ {1 . . . N }. ˜ (8) Since kr is a valid Mercer kernel, K is guaranteed to be positive semi-definite [12]. Let K = ˜ Q∆QT be the eigenvalue decomposition (EVD) of K. Then the rank-n Kernel Singular Value Decomposition (Kernel SVD) [12] of A is 1 1 An = [AQn (∆n )− 2 ][(∆n ) 2 ][(Qn )T ] ≡ Un Σn (Vn )T . n n (9) n Via the Matlab notation, Q = Q:,1:n and ∆ = ∆1:n,1:n . The left singular vectors U is an orthonormal basis for the n-dimensional principal subspace of the whole dataset in Fkr . Projecting ˜ the data onto the principal subspace yields 1 1 B = [AQn (∆n )− 2 ]T A = (∆n ) 2 (Qn )T , (10) n×N where B = [b1 . . . bN ] ∈ R is the reduced dimension version of A. Directions of the principal subspace are dominated by inlier points, since kr evaluates to a high value generally for them, but ˜ always to a low value for gross outliers. Moreover the kernel ensures that points from the same subspace are mapped to the same cluster and vice versa. Fig. 1(b) illustrates this condition. Fig. 2(a)(left) shows the first frame of sequence “Cars10” from the Hopkins 155 dataset [18] with 100 false trajectories of Brownian motion added to the original data (297 points). The corresponing RKHS norm histogram for n = 3 is displayed in Fig. 2(b). The existence of two distinct modes, 4 15 Outlier mode Bin count Inlier mode 10 5 0 (a) (left) Before and (right) after outlier removal. Blue dots are inliers while red dots are added outliers. 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 Vector norm in principal subspace 0.18 0.2 (b) Actual norm histogram of “cars10”. Figure 2: Demonstration of outlier rejection on sequence “cars10” from Hopkins 155. corresponding respectively to inliers and outliers, is evident. We exploit this observation for outlier rejection by discarding data with low norms in the principal subspace. The cut-off threshold ψ can be determined by analyzing the shape of the distribution. For instance we can fit a 1D Gaussian Mixture Model (GMM) with two components and set ψ as the point of equal Mahalanobis distance between the two components. However, our experimentation shows that an effective threshold can be obtained by simply setting ψ as the average value of all the norms, i.e. ψ= 1 N N bi . (11) i=1 This method was applied uniformly on all the sequences in our experiments in Sec. 4. Fig. 2(a)(right) shows an actual result of the method on Fig. 2(a)(left). 3.2 Recovering the number of motions and subspace clustering After outlier rejection, we further take advantage of the mapping induced by ORK for recovering the number of motions and subspace clustering. On the remaining data, we perform Kernel PCA [11] to seek the principal components which maximize the variance of the data in the RKHS, as Fig. 1(b) illustrates. Let {yi }i=1,...,N ′ be the N ′ -point subset of the input data that remains after outlier removal, where N ′ < N . Denote by C = [φ(y1 ) . . . φ(yN ′ )] the data matrix after mapping the data ˜ to Fkr , and by symbol C the result of adjusting C with the empirical mean of {φ(y1 ), . . . , φ(yN ′ )}. ˜ ˜ ˜ ˜ The centered kernel matrix K′ = CT C [11] can be obtained as 1 ˜ K′ = ν T K′ ν, ν = [IN ′ − ′ 1N ′ ,N ′ ], (12) N where K′ = CT C is the uncentered kernel matrix, Is and 1s,s are respectively the s × s identity ˜ ˜ matrix and a matrix of ones. If K′ = RΩRT is the EVD of K′ , then we obtain first-m kernel m ˜ principal components P of C as the first-m left singular vectors of C , i.e. 1 ˜ Pm = CRm (Ωm )− 2 , (13) where Rm = R:,1:m and Ω1:m,1:m ; see Eq. (9). Projecting the data on the principal components yields 1 D = [d1 . . . dN ′ ] = (Ωm ) 2 (Rm )T , (14) ′ where D ∈ Rm×N . The affine subspace span(Pm ) maximizes the spread of the centered data in the RKHS, and the projection D offers an effective representation for clustering. Fig. 3(a) shows the Kernel PCA projection results for m = 3 on the sequence in Fig. 2(a). The number of clusters in D is recovered via spectral clustering. More specifically we apply the Normalized Cut (Ncut) [13] algorithm. A fully connected graph is first derived from the data, where ′ ′ its weighted adjacency matrix W ∈ RN ×N is obtained as Wp,q = exp(− dp − dq 2 /2δ 2 ), (15) and δ is taken as the average nearest neighbour distance in the Euclidean sense among the vectors in D. The Laplacian matrix [13] is then derived from W and eigendecomposed. Under Ncut, 5 0.1 0.05 0 −0.05 −0.1 0.1 −0.15 0.15 0.08 0.1 0.05 0 −0.05 −0.1 0.06 (a) Kernel PCA and Ncut results. (b) W matrix. (c) Final result for “cars10”. Figure 3: Actual results on the motion sequence in Fig. 2(a)(left). the number of clusters is revealed as the number of eigenvalues of the Laplacian that are zero or numerically insignificant. With this knowledge, a subsequent k-means step is then performed to cluster the points. Fig. 3(b) shows W for the input data in Fig. 2(a)(left) after outlier removal. It can be seen that strong affinity exists between points from the same cluster, thus allowing accurate clustering. Figs. 3(a) and 3(c) illustrate the final clustering result for the data in Fig. 2(a)(left). There are several reasons why spectral clustering under our framework is more successful than previous methods. Firstly, we perform an effective outlier rejection step that removes bad trajectories that can potentially mislead the clustering. Secondly, the mapping induced by ORK deliberately separates the trajectories based on their cluster membership. Finally, we perform Kernel PCA to maximize the variance of the data. Effectively this also improves the separation of clusters, thus facilitating an accurate recovery of the number of clusters and also the subsequent segmentation. This distinguishes our work from previous clustering based methods [21, 5] which tend to operate without maximizing the between-class scatter. Results in Sec. 4 validate our claims. 4 Results Henceforth we indicate the proposed method as “ORK”. We leverage on a recently published benchmark on affine model motion segmentation [18] as a basis of comparison. The benchmark was evaluated on the Hopkins 155 dataset4 which contains 155 sequences with tracked point trajectories. A total of 120 sequences have two motions while 35 have three motions. The sequences contain degenerate and non-degenerate motions, independent and partially dependent motions, articulated motions, nonrigid motions etc. In terms of video content three categories exist: Checkerboard sequences, traffic sequences (moving cars, trucks) and articulated motions (moving faces, people). 4.1 Details on benchmarking Four major algorithms were compared in [18]: Generalized PCA (GPCA) [19], Local Subspace Affinity (LSA) [21], Multi-Stage Learning (MSL) [14] and RANSAC [17]. Here we extend the benchmark with newly reported results from Locally Linear Manifold Clustering (LLMC) [5] and Agglomerative Lossy Compression (ALC) [10, 9]. We also compare our method against Kanatani and Matsunaga’s [8] algorithm (henceforth, the “KM” method) in estimating the number of independent motions in the video sequences. Note that KM per se does not perform motion segmentation. For the sake of objective comparisons we use only implementations available publicly5. Following [18], motion segmentation performance is evaluated in terms of the labelling error of the point trajectories, where each point in a sequence has a ground truth label, i.e. number of mislabeled points . (16) classification error = total number of points Unlike [18], we also emphasize on the ability of the methods in recovering the number of motions. However, although the methods compared in [18] (except RANSAC) theoretically have the means to 4 Available at http://www.vision.jhu.edu/data/hopkins155/. For MSL and KM, see http://www.suri.cs.okayama-u.ac.jp/e-program-separate.html/. For GPCA, LSA and RANSAC, refer to the url for the Hopkins 155 dataset. 5 6 do so, their estimation of the number of motions is generally unrealiable and the benchmark results in [18] were obtained by revealing the actual number of motions to the algorithms. A similar initialization exists in [5, 10] where the results were obtained by giving LLMC and ALC this knowledge a priori (for LLMC, this was given at least to the variant LLMC 4m during dimensionality reduction [5], where m is the true number of motions). In the following subsections, where variants exist for the compared algorithms we use results from the best performing variant. In the following the number of random hypotheses M and step size h for ORK are fixed at 1000 and 300 respectively, and unlike the others, ORK is not given knowledge of the number of motions. 4.2 Data without gross outliers We apply ORK on the Hopkins 155 dataset. Since ORK uses random sampling we repeat it 100 times for each sequence and average the results. Table 1 depicts the obtained classification error among those from previously proposed methods. ORK (column 9) gives comparable results to the other methods for sequences with 2 motions (mean = 7.83%, median = 0.41%). For sequences with 3 motions, ORK (mean = 12.62%, median = 4.75%) outperforms GPCA and RANSAC, but is slightly less accurate than the others. However, bear in mind that unlike the other methods ORK is not given prior knowledge of the true number of motions and has to estimate this independently. Column Method 1 REF 2 GPCA Mean Median 2.03 0.00 4.59 0.38 Mean Median 5.08 2.40 28.66 28.26 3 4 5 6 LSA MSL RANSAC LLMC Sequences with 2 motions 3.45 4.14 5.56 3.62 0.59 0.00 1.18 0.00 Sequences with 3 motions 9.73 8.23 22.94 8.85 2.33 1.76 22.03 3.19 8 ALC 9 ORK 10 ORK∗ 3.03 0.00 7.83 0.41 1.27 0.00 6.26 1.02 12.62 4.75 2.09 0.05 Table 1: Classification error (%) on Hopkins 155 sequences. REF represents the reference/control method which operates based on knowledge of ground truth segmentation. Refer to [18] for details. We also separately investigate the accuracy of ORK in estimating the number of motions, and compare it against KM [8] which was proposed for this purpose. Note that such an experiment was not attempted in [18] since approaches compared therein generally do not perform reliably in estimating the number of motions. The results in Table 2 (columns 1–2) show that for sequences with two motions, KM (80.83%) outperforms ORK (67.37%) by ≈ 15 percentage points. However, for sequences with three motions, ORK (49.66%) vastly outperforms KM (14.29%) by more than doubling the percentage points of accuracy. The overall accuracy of KM (65.81%) is slightly better than ORK (63.37%), but this is mostly because sequences with two motions form the majority in the dataset (120 out of 155). This leads us to conclude that ORK is actually the superior method here. Dataset Column Method 2 motions 3 motions Overall Hopkins 155 1 2 KM ORK 80.83% 67.37% 14.29% 49.66% 65.81% 63.37% Hopkins 155 + Outliers 3 4 KM ORK 00.00% 47.58% 100.00% 50.00% 22.58% 48.13% Table 2: Accuracy in determining the number of motions in a sequence. Note that in the experiment with outliers (columns 3–4), KM returns a constant number of 3 motions for all sequences. We re-evaluate the performance of ORK by considering only results on sequences where the number of motions is estimated correctly by ORK (there are about 98 ≡ 63.37% of such cases). The results are tabulated under ORK∗ (column 10) in Table 1. It can be seen that when ORK estimates the number of motions correctly, it is significantly more accurate than the other methods. Finally, we compare the speed of the methods in Table 3. ORK was implemented and run in Matlab on a Dual Core Pentium 3.00GHz machine with 4GB of main memory (this is much less powerful 7 than the 8 Core Xeon 3.66GHz with 32GB memory used in [18] for the other methods in Table 3). The results show that ORK is comparable to LSA, much faster than MSL and ALC, but slower than GPCA and RANSAC. Timing results of LLMC are not available in the literature. Method 2 motions 3 motions GPCA 324ms 738ms LSA 7.584s 15.956s MSL 11h 4m 1d 23h RANSAC 175ms 258ms ALC 10m 32s 10m 32s ORK 4.249s 8.479s Table 3: Average computation time on Hopkins 155 sequences. 4.3 Data with gross outliers We next examine the ability of the proposed method in dealing with gross outliers in motion data. For each sequence in Hopkins 155, we add 100 gross outliers by creating trajectories corresponding to mistracks or spuriously occuring points. These are created by randomly initializing 100 locations in the first frame and allowing them to drift throughout the sequence according to Brownian motion. The corrupted sequences are then subjected to the algorithms for motion segmentation. Since only ORK is capable of rejecting outliers, the classification error of Eq. (16) is evaluated on the inlier points only. The results in Table 4 illustrate that ORK (column 4) is the most accurate method by a large margin. Despite being given the true number of motions a priori, GPCA, LSA and RANSAC are unable to provide satisfactory segmentation results. Column Method Mean Median Mean Median 1 2 3 4 GPCA LSA RANSAC ORK Sequences with 2 motions 28.66 24.25 30.64 16.50 30.96 26.51 32.36 10.54 Sequences with 3 motions 40.61 30.94 42.24 19.99 41.30 27.68 43.43 8.49 5 ORK∗ 1.62 0.00 2.68 0.09 Table 4: Classification error (%) on Hopkins 155 sequences with 100 gross outliers per sequence. In terms of estimating the number of motions, as shown in column 4 in Table 2 the overall accuracy of ORK is reduced to 48.13%. This is contributed mainly by the deterioration in accuracy on sequences with two motions (47.58%), although the accuracy on sequences with three motions are maintained (50.00%). This is not a surprising result, since sequences with three motions generally have more (inlying) point trajectories than sequences with two motions, thus the outlier rates for sequences with three motions are lower (recall that a fixed number of 100 false trajectories are added). On the other hand, the KM method (column 3) is completely overwhelmed by the outliers— for all the sequences with outliers it returned a constant “3” as the number of motions. We again re-evaluate ORK by considering results from sequences (now with gross outliers) where the number of motions is correctly estimated (there are about 75 ≡ 48.13% of such cases). The results tabulated under ORK∗ (column 5) in Table 4 show that the proposed method can accurately segment the point trajectories without being influenced by the gross outliers. 5 Conclusions In this paper we propose a novel and highly effective approach for multi-body motion segmentation. Our idea is based on encapsulating random hypotheses in a novel Mercel kernel and statistical learning. We evaluated our method on the Hopkins 155 dataset with results showing that the idea is superior other state-of-the-art approaches. It is by far the most accurate in terms of estimating the number of motions, and it excels in segmentation accuracy despite lacking prior knowledge of the number of motions. The proposed idea is also highly robust towards outliers in the input data. Acknowledgements. We are grateful to the authors of [18] especially Ren´ Vidal for discussions e and insights which have been immensely helpful. 8 References [1] T. Boult and L. Brown. Factorization-based segmentation of motions. In IEEE Workshop on Motion Understanding, 1991. [2] T.-J. Chin, H. Wang, and D. Suter. Robust fitting of multiple structures: The statistical learning approach. In ICCV, 2009. [3] J. Costeira and T. Kanade. A multibody factorization method for independently moving objects. IJCV, 29(3):159–179, 1998. [4] M. A. Fischler and R. C. Bolles. Random sample concensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. of the ACM, 24:381–395, 1981. [5] A. Goh and R. Vidal. Segmenting motions of different types by unsupervised manifold clustering. In CVPR, 2007. [6] A. Gruber and Y. Weiss. Multibody factorization with uncertainty and missing data using the EM algorithm. In CVPR, 2004. [7] K. Kanatani. Motion segmentation by subspace separation and model selection. In ICCV, 2001. [8] K. Kanatani and C. Matsunaga. Estimating the number of independent motions for multibody segmentation. In ACCV, 2002. [9] Y. Ma, H. Derksen, W. Hong, and J. Wright. Segmentation of multivariate mixed data via lossy coding and compression. TPAMI, 29(9):1546–1562, 2007. [10] S. Rao, R. Tron, Y. Ma, and R. Vidal. Motion segmentation via robust subspace separation in the presence of outlying, incomplete, or corrupted trajectories. In CVPR, 2008. [11] B. Sch¨ lkopf, A. Smola, and K. R. M¨ ller. Nonlinear component analysis as a kernel eigeno u value problem. Neural Computation, 10:1299–1319, 1998. [12] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge University Press, 2004. [13] J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22(8):888–905, 2000. [14] Y. Sugaya and K. Kanatani. Geometric structure of degeneracy for multi-body motion segmentation. In Workshop on Statistical Methods in Video Processing, 2004. [15] R. Toldo and A. Fusiello. Robust multiple structures estimation with J-Linkage. In ECCV, 2008. [16] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography. IJCV, 9(2):137–154, 1992. [17] P. Torr. Geometric motion segmentation and model selection. Phil. Trans. Royal Society of London, 356(1740):1321–1340, 1998. [18] R. Tron and R. Vidal. A benchmark for the comparison of 3-D motion segmentation algorithms. In CVPR, 2007. [19] R. Vidal and R. Hartley. Motion segmentation with missing data by PowerFactorization and Generalized PCA. In CVPR, 2004. [20] R. Vidal, Y. Ma, and S. Sastry. Generalized Principal Component Analysis (GPCA). TPAMI, 27(12):1–15, 2005. [21] J. Yan and M. Pollefeys. A general framework for motion segmentation: independent, articulated, rigid, non-rigid, degenerate and non-degenerate. In ECCV, 2006. [22] L. Zelnik-Manor and M. Irani. Degeneracies, dependencies and their implications on multibody and multi-sequence factorization. In CVPR, 2003. [23] W. Zhang and J. Koseck´ . Nonparametric estimation of multiple structures with outliers. In a Dynamical Vision, ICCV 2005 and ECCV 2006 Workshops, 2006. 9

4 0.44442531 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

Author: Zhirong Yang, Irwin King, Zenglin Xu, Erkki Oja

Abstract: Stochastic Neighbor Embedding (SNE) has shown to be quite promising for data visualization. Currently, the most popular implementation, t-SNE, is restricted to a particular Student t-distribution as its embedding distribution. Moreover, it uses a gradient descent algorithm that may require users to tune parameters such as the learning step size, momentum, etc., in finding its optimum. In this paper, we propose the Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE) method, which is a generalization of the t-SNE to accommodate various heavytailed embedding similarity functions. With this generalization, we are presented with two difficulties. The first is how to select the best embedding similarity among all heavy-tailed functions and the second is how to optimize the objective function once the heavy-tailed function has been selected. Our contributions then are: (1) we point out that various heavy-tailed embedding similarities can be characterized by their negative score functions. Based on this finding, we present a parameterized subset of similarity functions for choosing the best tail-heaviness for HSSNE; (2) we present a fixed-point optimization algorithm that can be applied to all heavy-tailed functions and does not require the user to set any parameters; and (3) we present two empirical studies, one for unsupervised visualization showing that our optimization algorithm runs as fast and as good as the best known t-SNE implementation and the other for semi-supervised visualization showing quantitative superiority using the homogeneity measure as well as qualitative advantage in cluster separation over t-SNE.

5 0.44002143 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis

Author: Sundeep Rangan, Alyson K. Fletcher

Abstract: A well-known analysis of Tropp and Gilbert shows that orthogonal matching pursuit (OMP) can recover a k-sparse n-dimensional real vector from m = 4k log(n) noise-free linear measurements obtained through a random Gaussian measurement matrix with a probability that approaches one as n → ∞. This work strengthens this result by showing that a lower number of measurements, m = 2k log(n − k), is in fact sufficient for asymptotic recovery. More generally, when the sparsity level satisfies kmin ≤ k ≤ kmax but is unknown, m = 2kmax log(n − kmin ) measurements is sufficient. Furthermore, this number of measurements is also sufficient for detection of the sparsity pattern (support) of the vector with measurement errors provided the signal-to-noise ratio (SNR) scales to infinity. The scaling m = 2k log(n − k) exactly matches the number of measurements required by the more complex lasso method for signal recovery in a similar SNR scaling.

6 0.43737462 225 nips-2009-Sparsistent Learning of Varying-coefficient Models with Structural Changes

7 0.43684742 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

8 0.43401769 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

9 0.4328388 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

10 0.43202549 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

11 0.431582 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models

12 0.43064985 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction

13 0.43055937 97 nips-2009-Free energy score space

14 0.43015045 20 nips-2009-A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

15 0.42999166 25 nips-2009-Adaptive Design Optimization in Experiments with People

16 0.4295361 237 nips-2009-Subject independent EEG-based BCI decoding

17 0.42939815 122 nips-2009-Label Selection on Graphs

18 0.42810676 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output

19 0.42810476 157 nips-2009-Multi-Label Prediction via Compressed Sensing

20 0.42802978 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms