jmlr jmlr2013 jmlr2013-50 knowledge-graph by maker-knowledge-mining

50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

Source: pdf

Author: Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk

Abstract: Unions of subspaces provide a powerful generalization of single subspace models for collections of high-dimensional data; however, learning multiple subspaces from data is challenging due to the fact that segmentation—the identiﬁcation of points that live in the same subspace—and subspace estimation must be performed simultaneously. Recently, sparse recovery methods were shown to provide a provable and robust strategy for exact feature selection (EFS)—recovering subsets of points from the ensemble that live in the same subspace. In parallel with recent studies of EFS with ℓ1 -minimization, in this paper, we develop sufﬁcient conditions for EFS with a greedy method for sparse signal recovery known as orthogonal matching pursuit (OMP). Following our analysis, we provide an empirical study of feature selection strategies for signals living on unions of subspaces and characterize the gap between sparse recovery methods and nearest neighbor (NN)-based approaches. In particular, we demonstrate that sparse recovery methods provide signiﬁcant advantages over NN methods and that the gap between the two approaches is particularly pronounced when the sampling of subspaces in the data set is sparse. Our results suggest that OMP may be employed to reliably recover exact feature sets in a number of regimes where NN approaches fail to reveal the subspace membership of points in the ensemble. Keywords: subspace clustering, unions of subspaces, hybrid linear models, sparse approximation, structured sparsity, nearest neighbors, low-rank approximation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Following our analysis, we provide an empirical study of feature selection strategies for signals living on unions of subspaces and characterize the gap between sparse recovery methods and nearest neighbor (NN)-based approaches. [sent-12, score-0.875]

2 In particular, we demonstrate that sparse recovery methods provide signiﬁcant advantages over NN methods and that the gap between the two approaches is particularly pronounced when the sampling of subspaces in the data set is sparse. [sent-13, score-0.566]

3 Keywords: subspace clustering, unions of subspaces, hybrid linear models, sparse approximation, structured sparsity, nearest neighbors, low-rank approximation 1. [sent-15, score-0.517]

4 1 Unions of Subspaces In many cases, a linear subspace model is sufﬁcient to characterize the intrinsic structure of an ensemble; however, in many emerging applications, a single subspace is not enough. [sent-26, score-0.51]

5 Instead, ensembles can be modeled as living on a union of subspaces or a union of afﬁne planes of mixed or equal dimension. [sent-27, score-0.535]

6 , yd }, each of dimension n, p lives on a union of p subspaces if Y ⊂ U = ∪i=1 Si , where Si is a subspace of Rn . [sent-31, score-0.72]

7 , 2010) are all well-approximated by a union of low-dimensional subspaces or a union of afﬁne hyperplanes. [sent-33, score-0.495]

8 2 Exact Feature Selection Unions of subspaces provide a natural extension to single subspace models, but providing an extension of PCA that leads to provable guarantees for learning multiple subspaces is challenging. [sent-37, score-1.017]

9 This is due to the fact that subspace clustering—the identiﬁcation of points that live in the same subspace—and subspace estimation must be performed simultaneously. [sent-38, score-0.595]

10 However, if we can sift through the points in the ensemble and identify subsets of points that lie along or near the same subspace, then a local subspace estimate1 formed from any such set is guaranteed to coincide with one of the true subspaces present in the ensemble (Vidal et al. [sent-39, score-0.931]

11 Methods that use sets of NNs to learn a union of subspaces include: local subspace afﬁnity (LSA) (Yan and Pollefeys, 2006), spectral clustering based on locally linear approximations (Arias-Castro et al. [sent-43, score-0.78]

12 When the subspaces present in the ensemble are non-intersecting and densely sampled, NN-based approaches provide high rates of EFS and in turn, provide accurate local estimates of the subspaces present in the ensemble. [sent-48, score-0.824]

13 However, such approaches quickly fail as the intersection between pairs of subspaces increases and as the number of points in each subspace decreases; in both of these cases, the Euclidean distance between points becomes a poor predictor of whether points belong to the same subspace. [sent-49, score-0.806]

14 The main result of our analysis is a new geometric condition (Theorem 1) for EFS with OMP that highlights the tradeoff between the: mutual coherence or similarity between points living in different subspaces and the covering radius of the points within the same subspace. [sent-66, score-0.802]

15 The distance between a pair of subspaces is typically measured with respect to the principal angles between the subspaces or other related distances on the Grassmanian manifold. [sent-69, score-0.95]

16 The vector in the normalized subspace (unit circle) that attains the covering radius (deep hole) is marked with a star: when compared with the convex hull, the deep hole coincides with the maximal gap between the convex hull and the set of all vectors that live in the normalized subspace. [sent-72, score-0.565]

17 After introducing a general geometric condition for EFS, we extend this analysis to the case where the data live on what we refer to as a bounded union of subspaces (Theorem 3). [sent-76, score-0.473]

18 In particular, we show that when the points living in a particular subspace are incoherent with the principal vectors that support pairs of subspaces in the ensemble, EFS can be guaranteed, even when non-trivial intersections exist between subspaces in the ensemble. [sent-77, score-1.264]

19 4, we study the performance of OMP, BP, and NN-based subspace clustering on real data, where the goal is to cluster a collection of images into their respective illumination subspaces. [sent-82, score-0.432]

20 In the case of very sparsely sampled subspaces, where the subspace dimension equals 5 and the number of points per subspace equals 16, we obtain a 10% and 30% improvement in classiﬁcation accuracy with OMP (90%), when compared with BP (80%) and NN (60%). [sent-84, score-0.56]

21 We introduce a signal model for unions of subspaces, detail the sparse subspace clustering (SSC) algorithm (Elhamifar and Vidal, 2010), and then go on to introduce the use of OMP for feature selection and subspace clustering (Algorithm 2); we end with a motivating example. [sent-88, score-0.93]

22 We introduce sufﬁcient conditions for EFS to occur with OMP for general unions of subspaces in Theorem 1, disjoint unions in Corollary 1, and bounded unions in Theorem 3. [sent-91, score-0.903]

23 Sparse Feature Selection for Subspace Clustering In this section, we introduce a signal model for unions of subspaces, detail the sparse subspace clustering (SSC) method (Elhamifar and Vidal, 2009), and introduce an OMP-based method for sparse subspace clustering (SSC-OMP). [sent-111, score-0.96]

24 Let Yi denote the set of points in the ith subspace p cluster and let Y = ∪i=1 Yi denote the union of p subspace clusters. [sent-117, score-0.658]

25 Let y1 y2 yd , ,··· , y1 2 y2 2 yd Y = 2 denote the resulting set of unit norm points and let Yi be the set of unit norm points that lie in the span of subspace Si . [sent-119, score-0.462]

26 2 Sparse Subspace Clustering The sparse subspace clustering (SSC) algorithm (Elhamifar and Vidal, 2009) proceeds by solving the following basis pursuit (BP) problem for each point in Y : c∗ = arg min c i c∈Rd 1 subject to yi = ∑ c( j)y j . [sent-125, score-0.471]

27 By varying the number of unique illumination conditions that we collect (number of points per subspace), we can manipulate the sampling density of the subspaces in the ensemble. [sent-156, score-0.505]

28 EFS provides a natural condition for studying performance of both subspace consensus and spectral clustering methods due to the fact that when EFS occurs, this results in a local subspace estimate that coincides with one of the true subspaces contained within the data. [sent-190, score-0.978]

29 Deﬁnition 1 (Exact Feature Selection) Let Yk = {y : (I − Pk )y = 0, y ∈ Y } index the set of points in Y that live in the span of subspace Sk , where Pk is a projector onto the span of subspace Sk . [sent-192, score-0.639]

30 1 P RELIMINARIES Our main geometric result in Theorem 1 below requires measures of both the distance between points in different subspace clusters and within the same subspace cluster. [sent-202, score-0.56]

31 A natural measure of the similarity between points living in different subspaces is the mutual coherence. [sent-203, score-0.527]

32 u∈Yi ,v∈Y j 2495 DYER , S ANKARANARAYANAN AND BARANIUK In words, the mutual coherence provides a point-wise measure of the normalized inner product (coherence) between all pairs of points that lie in two different subspace clusters. [sent-207, score-0.523]

33 Let µc (Yi ) denote the maximum mutual coherence between the points in Yi and all other subspace clusters in the ensemble, where µc (Yi ) = max µc (Yi , Y j ). [sent-208, score-0.489]

34 The ﬁrst principal angle θ∗j between subspaces Si and i S j , is the smallest angle between a pair of unit vectors (u1 , v1 ) drawn from Si × S j . [sent-210, score-0.657]

35 i (5) To measure how well points in the same subspace cluster cover the subspace they live on, we will study the covering radius of each normalized subspace cluster relative to the projective distance. [sent-215, score-1.07]

36 u 2 y 2 The covering radius of the normalized subspace cluster Yi can be interpreted as the size of the largest open ball that can be placed in the set of all unit norm vectors that lie in the span of Si , without touching a point in Yi . [sent-217, score-0.506]

37 A sufﬁcient condition for Algorithm 1 to return exact feature sets for all points in Yi is that the mutual coherence ε2 ε (7) µc (Yi ) < 1 − − √ max cos(θ∗j ), i 4 4 12 j=i where θ∗j is the minimum principal angle deﬁned in (4). [sent-240, score-0.453]

38 i In words, this condition requires that the mutual coherence between points in different subspaces is less than the difference of two terms that both depend on the covering radius of points along a single subspace. [sent-241, score-0.782]

39 The second term on the RHS of (7) is the product of the cosine of the minimum principal angle between pairs of subspaces in the ensemble and the covering diameter ε of the points in Yi . [sent-243, score-0.779]

40 When subspaces in the ensemble intersect, that is, cos(θ∗j ) = 1, condition (7) in Theorem 1 can i be simpliﬁed as ε2 ε2 ε ε µc (Yi ) < 1 − − √ ≈ 1 − − . [sent-244, score-0.443]

41 86 12 In this case, EFS can be guaranteed for intersecting subspaces as long as the points in distinct subspace clusters are bounded away from intersections between subspaces. [sent-246, score-0.686]

42 When the covering radius shrinks to zero, Theorem 1 requires that µc < 1, or that points from different subspaces do not lie exactly in the subspace intersection, that is, are identiﬁable from one another. [sent-247, score-0.833]

43 3 EFS FOR D ISJOINT S UBSPACES When the subspaces in the ensemble are disjoint, that is, cos(θ∗j ) < 1, Theorem 1 can be simplii ﬁed further by using the bound for the mutual coherence in (5). [sent-250, score-0.608]

44 Corollary 1 Let θ∗j denote the ﬁrst principal angle between a pair of disjoint subspaces Si and i S j , and let ε denote the maximal covering diameter of the points in Yi . [sent-252, score-0.737]

45 A union of two disjoint subspaces of different dimension: the convex hull of a set of points (red circles) living on a 2D subspace is shaded (green). [sent-255, score-0.808]

46 To be precise, we require that the normalized inner product of the residual signal s and all points outside of the correct subspace cluster max y∈Y−i | s, y | < r(Yi ), s 2 (8) at each iteration of Algorithm 1. [sent-261, score-0.466]

47 A geometric interpretation of the EFS condition in Theorem 1 is that the orthogonal projection of all points outside of a subspace must lie within the antipodal convex hull of the set of normalized points that span the subspace. [sent-263, score-0.496]

48 In Figure 3, we provide a geometric visualization of the EFS condition for a union of disjoint subspaces (union of a 1D subspace with a 2D subspace). [sent-269, score-0.693]

49 In (a), we show an example where EFS is guaranteed because the projection of the points outside of the 2D subspace lie well within the antipodal convex hull of the points along the normalized 2D subspace (ring). [sent-270, score-0.749]

50 In (b), we show an example where EFS is not guaranteed because the projection of the points outside of the 2D subspace lie outside of the antipodal convex hull of the points along the normalized 2D subspace (ring). [sent-271, score-0.749]

51 This suggests that the minimum principal angle of the union i must go to zero, that is, the union must consist of orthogonal subspaces, as the subspace dimension increases. [sent-279, score-0.54]

52 Rather, we require that there are enough points in each subspace to achieve a sufﬁciently small covering; in which case, EFS can be guaranteed for subspaces of any dimension. [sent-281, score-0.686]

53 2 E XACT R ECOVERY C ONDITIONS FOR S PARSE R ECOVERY To provide further intuition about EFS in endogenous sparse recovery, we will compare the geometry underlying the EFS condition with the geometry of the exact recovery condition (ERC) for sparse signal recovery methods (Tropp, 2004, 2006). [sent-298, score-0.432]

54 In particular, the EFS condition requires that the projection of atoms in an incorrect subspace cluster Y−i onto Si must be incoherent with any deep holes in Yi along Si . [sent-314, score-0.449]

55 In contrast, we require that the points within a subspace cluster exhibit local coherence in order to produce a small covering radius. [sent-315, score-0.532]

56 1 Subspace Distances To characterize the “distance” between pairs of subspaces in the ensemble, the principal angles between subspaces will prove useful. [sent-319, score-0.93]

57 As we saw in the previous section, the ﬁrst principal angle θ0 between subspaces S1 and S2 of dimension k1 and k2 is deﬁned as the smallest angle between a pair of unit vectors (u1 , v1 ) drawn from S1 × S2 . [sent-320, score-0.657]

58 The second principal angle θ1 is deﬁned much like the ﬁrst, except that the second set of principal vectors that deﬁne the second principal angle are required to be orthogonal to the ﬁrst set of principal vectors (u∗ , v∗ ). [sent-322, score-0.59]

59 Rather, a computationally efﬁcient way to compute the principal angles between two subspaces Si and S j is to ﬁrst compute the singular values of the matrix G = ΦT Φ j , i where Φi ∈ Rn×ki is an ONB that spans subspace Si . [sent-326, score-0.853]

60 A pair of subspaces is said to be disjoint if the minimum principal angle is greater than zero. [sent-330, score-0.572]

61 Non-disjoint or intersecting subspaces are deﬁned as subspaces with minimum principal angle equal to zero. [sent-331, score-0.933]

62 The dimension of the intersection between two subspaces is equivalent to the number of principal angles equal to zero or equivalently, the number of entries of the cross-spectra that 2501 DYER , S ANKARANARAYANAN AND BARANIUK are equal to one. [sent-332, score-0.569]

63 We deﬁne the overlap between two subspaces as the rank(G) or equivalently, q = σi j 0 , where q ≥ dim(Si ∩ S j ). [sent-333, score-0.44]

64 2 Sufﬁcient Conditions for EFS from Bounded Unions The sufﬁcient conditions for EFS in Theorem 1 and Corollary 1 reveal an interesting relationship between the covering radius, mutual coherence, and the minimum principal angle between pairs of subspaces in the ensemble. [sent-335, score-0.685]

65 To make this connection more apparent, we will make additional assumptions about the distribution of points in the ensemble, namely that the data set produces a bounded union of subspaces relative to the principal vectors supporting pairs of subspaces in the ensemble. [sent-337, score-0.993]

66 When the points in each subspace are incoherent with the principal vectors in the columns of U and V , we say that the ensemble Y is an bounded union of subspaces. [sent-342, score-0.581]

67 This property requires that the inner products between the points in a subspace and the set of principal vectors that span non-orthogonal directions between a pair of subspaces is bounded by a ﬁxed constant. [sent-344, score-0.852]

68 When the points in each subspace are distributed such that (11) holds, we can rewrite the mutual coherence between any two points from different subspaces to reveal its dependence on higher-order principal angles. [sent-345, score-1.006]

69 Let σi j denote the cross-spectra of the subspaces Si and S j and let ε denote the covering diameter of Yi . [sent-352, score-0.496]

70 One way to guarantee that the 2502 G REEDY F EATURE S ELECTION FOR S UBSPACE C LUSTERING ensemble has a small bounding constant is to constrain the total amount of energy that points in Y j have in the q-dimensional subspace spanned by the principal vectors in V . [sent-355, score-0.51]

71 Instead of requiring incoherence with all principal vectors, the data must be sufﬁciently incoherent with the principal vectors that correspond to small principal angles (or large values of the cross-spectra). [sent-359, score-0.451]

72 We show that when the data set is sparsely sampled (larger covering radius), reducing the amount of energy that points contain in subspace intersections, does in fact increase the probability that points admit EFS. [sent-365, score-0.483]

73 Experimental Results In our theoretical analysis of EFS in Sections 3 and 4, we revealed an intimate connection between the covering radius of subspaces and the principal angles between pairs of subspaces in the ensemble. [sent-370, score-1.046]

74 1 Generative Model for Synthetic Data In order to study EFS for unions of subspaces with varied cross-spectra, we will generate synthetic data from unions of overlapping block sparse signals. [sent-374, score-0.789]

75 2503 DYER , S ANKARANARAYANAN AND BARANIUK Figure 4: Generating unions of subspaces from shift-invariant dictionaries. [sent-383, score-0.555]

76 We can leverage the banded structure of shift-invariant dictionaries, for example, dictionary matrices with localized Toeplitz structure, to generate subspaces with structured cross-spectra as follows. [sent-386, score-0.472]

77 Since σ ∈ (0, 1]k , the worstcase pair of subspaces with overlap equal to q is obtained when we pair q identical atoms with k − q orthogonal atoms. [sent-392, score-0.559]

78 , 2010), we introduce the idea of using shift-invariant dictionaries to create structured unions of subspaces for the ﬁrst time here. [sent-411, score-0.645]

79 As our theory predicts, the oversampling ratio has a strong impact on the degree of overlap between subspaces that can be tolerated before EFS no longer occurs. [sent-425, score-0.512]

80 In contrast, when the subspaces are critically sampled, that is, the number of points per subspace d ≈ k, only a small amount of overlap can be tolerated, where δ < 0. [sent-430, score-0.745]

81 The probability of EFS for a union of two subspaces of dimension k = 20 (left column) and k = 50 (right column). [sent-472, score-0.438]

82 Second, we compare the phase transitions for unions of orthoblock sparse signals as we vary the overlap and oversampling ratio. [sent-486, score-0.482]

83 Along the bottom row of Figure 6, we show the probability of EFS for OMP and NN for each of these three subspace unions as we vary the overlap q. [sent-515, score-0.488]

84 Each subspace cluster is generated by sampling d = 100 points from each subspace according to the coefﬁcient model (M1). [sent-517, score-0.601]

85 This study provides a number of interesting insights into the role that higher-order principal angles between subspaces play in feature selection for both sparse recovery methods and NN. [sent-518, score-0.741]

86 For this experiment, we consider unions of orthoblock sparse signals living on subspaces of dimension k = 50 and vary ρ ∈ [0. [sent-560, score-0.772]

87 4 Clustering Illumination Subspaces In this section, we compare the performance of sparse recovery methods, that is, BP and OMP, with NN for clustering unions of illumination subspaces arising from a collection of images of faces under different lighting conditions. [sent-567, score-0.853]

88 In Table 1, we display the percentage of points that resulted in EFS and the classiﬁcation error for all pairs of 38 subspaces in the Yale B database. [sent-581, score-0.45]

89 While both sparse recovery methods (BPDN and OMP) admit EFS rates that are comparable to NN on the full data set, we ﬁnd that sparse recovery methods provide higher rates of EFS than NN when the sampling of each subspace is sparse, that is, the half and quarter data sets. [sent-584, score-0.611]

90 The applicability and utility of endogenous sparse recovery in subspace learning draws into question whether we can use endogenous sparse recovery for other tasks, including approximation and compression. [sent-598, score-0.695]

91 —the underlying subspaces that the signals occupy must be learned directly from the data. [sent-606, score-0.446]

92 The methods that we have described for learning union of subspaces from ensembles of data can be used in the context of learning block sparse and other structured sparse signal models. [sent-607, score-0.618]

93 Thus, a study of the principal angles formed by different subsets of atoms from a dictionary might provide new insights into the performance of sparse recovery methods with coherent dictionaries and for compressive sensing from structured matrices. [sent-611, score-0.602]

94 Our study of EFS from unions with structured cross-spectra suggests that the decay of the cross-spectra between different data classes provides a powerful predictor of the performance of sparse recovery methods from data living on a union of low-dimensional subspaces. [sent-633, score-0.461]

95 In particular, we found that for dense samplings of each subspace, the performance of NN is comparable to sparse recovery methods; however, when each subspace is more sparsely sampled, sparse recovery methods provide signiﬁcant gains over NN methods. [sent-642, score-0.579]

96 This result suggests that endogenous sparse recovery provides a powerful strategy for clustering when the sampling of subspace clusters is sparse. [sent-643, score-0.537]

97 The normalized residual at the mth step is computed as (I − PΛ )yi sm = , (I − PΛ )yi 2 † where PΛ = YΛYΛ ∈ Rn×n is a projector onto the subspace spanned by the points in the current feature set Λ, where |Λ| = m − 1. [sent-652, score-0.45]

98 Thus, we can bound the RHS as follows max | sm , y j | = max | z + e, y j | y j ∈Yk / y j ∈Yk / ≤ max | z, y j | + | e, y j | y j ∈Yk / ≤ µc + max | e, y j | y j ∈Yk / ≤ µc + cos(θ0 ) e 2 yi 2 , where θ0 is the minimum principal angle between Sk and all other subspaces in the ensemble. [sent-660, score-0.723]

99 nn Let us assume that the points in Yk admit an ε-covering of the subspace cluster Sk , or that cover(Yk ) = ε/2. [sent-674, score-0.474]

100 2 Proof of Theorem 3 To prove Theorem 3, we will assume that the union of subspaces is bounded in accordance with (11). [sent-679, score-0.438]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('efs', 0.623), ('subspaces', 0.381), ('omp', 0.287), ('subspace', 0.255), ('unions', 0.174), ('dyer', 0.122), ('coherence', 0.109), ('principal', 0.105), ('bp', 0.104), ('recovery', 0.102), ('nn', 0.096), ('elhamifar', 0.093), ('ankaranarayanan', 0.087), ('vidal', 0.087), ('yk', 0.079), ('baraniuk', 0.079), ('atoms', 0.079), ('covering', 0.077), ('illumination', 0.074), ('lustering', 0.072), ('yi', 0.072), ('reedy', 0.067), ('ubspace', 0.066), ('angle', 0.066), ('signals', 0.065), ('bpdn', 0.064), ('dictionary', 0.063), ('angles', 0.063), ('ensemble', 0.062), ('dictionaries', 0.062), ('clustering', 0.062), ('sparse', 0.06), ('inradius', 0.06), ('overlap', 0.059), ('endogenous', 0.058), ('soltanolkotabi', 0.058), ('union', 0.057), ('mutual', 0.056), ('si', 0.056), ('eature', 0.055), ('orthoblock', 0.052), ('election', 0.052), ('oversampling', 0.05), ('points', 0.05), ('residual', 0.047), ('cluster', 0.041), ('antipodal', 0.041), ('onb', 0.041), ('living', 0.04), ('radius', 0.039), ('diameter', 0.038), ('af', 0.036), ('live', 0.035), ('nity', 0.035), ('incoherent', 0.033), ('cos', 0.032), ('signal', 0.032), ('admit', 0.032), ('erc', 0.031), ('lie', 0.031), ('feature', 0.03), ('ssc', 0.03), ('cand', 0.029), ('rice', 0.029), ('structured', 0.028), ('yd', 0.027), ('spans', 0.027), ('hole', 0.027), ('atom', 0.027), ('rn', 0.026), ('jv', 0.025), ('rhs', 0.025), ('spectral', 0.025), ('hull', 0.025), ('sm', 0.023), ('mailh', 0.023), ('mth', 0.023), ('gap', 0.023), ('span', 0.022), ('normalized', 0.022), ('ratio', 0.022), ('singular', 0.022), ('pursuit', 0.022), ('phase', 0.022), ('deep', 0.021), ('incoherence', 0.021), ('tropp', 0.021), ('coef', 0.02), ('intersection', 0.02), ('pair', 0.02), ('formed', 0.02), ('along', 0.02), ('yc', 0.02), ('compressive', 0.02), ('vectors', 0.019), ('energy', 0.019), ('sk', 0.019), ('display', 0.019), ('max', 0.019), ('exact', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999928 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

Author: Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk

2 0.11899038 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

Author: Antony Joseph

Abstract: The performance of orthogonal matching pursuit (OMP) for variable selection is analyzed for random designs. When contrasted with the deterministic case, since the performance is here measured after averaging over the distribution of the design matrix, one can have far less stringent sparsity constraints on the coefﬁcient vector. We demonstrate that for exact sparse vectors, the performance of the OMP is similar to known results on the Lasso algorithm (Wainwright, 2009). Moreover, variable selection under a more relaxed sparsity assumption on the coefﬁcient vector, whereby one has only control on the ℓ1 norm of the smaller coefﬁcients, is also analyzed. As consequence of these results, we also show that the coefﬁcient estimate satisﬁes strong oracle type inequalities. Keywords: high dimensional regression, greedy algorithms, Lasso, compressed sensing

3 0.071226202 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

Author: Tony Cai, Jianqing Fan, Tiefeng Jiang

Abstract: This paper studies the asymptotic behaviors of the pairwise angles among n randomly and uniformly distributed unit vectors in R p as the number of points n → ∞, while the dimension p is either ﬁxed or growing with n. For both settings, we derive the limiting empirical distribution of the random angles and the limiting distributions of the extreme angles. The results reveal interesting differences in the two settings and provide a precise characterization of the folklore that “all high-dimensional random vectors are almost always nearly orthogonal to each other”. Applications to statistics and machine learning and connections with some open problems in physics and mathematics are also discussed. Keywords: random angle, uniform distribution on sphere, empirical law, maximum of random variables, minimum of random variables, extreme-value distribution, packing on sphere

4 0.064794421 96 jmlr-2013-Regularization-Free Principal Curve Estimation

Author: Samuel Gerber, Ross Whitaker

Abstract: Principal curves and manifolds provide a framework to formulate manifold learning within a statistical context. Principal curves deﬁne the notion of a curve passing through the middle of a distribution. While the intuition is clear, the formal deﬁnition leads to some technical and practical difﬁculties. In particular, principal curves are saddle points of the mean-squared projection distance, which poses severe challenges for estimation and model selection. This paper demonstrates that the difﬁculties in model selection associated with the saddle point property of principal curves are intrinsically tied to the minimization of the mean-squared projection distance. We introduce a new objective function, facilitated through a modiﬁcation of the principal curve estimation approach, for which all critical points are principal curves and minima. Thus, the new formulation removes the fundamental issue for model selection in principal curve estimation. A gradient-descent-based estimator demonstrates the effectiveness of the new formulation for controlling model complexity on numerical experiments with synthetic and real data. Keywords: principal curve, manifold estimation, unsupervised learning, model complexity, model selection

5 0.06004997 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition and the Nystr¨ m approximation are two important low-rank matrix o approximation techniques. The Nystr¨ m method approximates a symmetric positive semideﬁnite o matrix in terms of a small number of its columns, while CUR approximates an arbitrary data matrix by a small number of its columns and rows. Thus, CUR decomposition can be regarded as an extension of the Nystr¨ m approximation. o In this paper we establish a more general error bound for the adaptive column/row sampling algorithm, based on which we propose more accurate CUR and Nystr¨ m algorithms with expected o relative-error bounds. The proposed CUR and Nystr¨ m algorithms also have low time complexity o and can avoid maintaining the whole data matrix in RAM. In addition, we give theoretical analysis for the lower error bounds of the standard Nystr¨ m method and the ensemble Nystr¨ m method. o o The main theoretical results established in this paper are novel, and our analysis makes no special assumption on the data matrices. Keywords: large-scale matrix computation, CUR matrix decomposition, the Nystr¨ m method, o randomized algorithms, adaptive sampling

6 0.053077403 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

7 0.047439359 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

8 0.045017313 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

9 0.041406725 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

10 0.039327078 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

11 0.039061025 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

12 0.038375385 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

13 0.037159707 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

14 0.03701688 90 jmlr-2013-Quasi-Newton Method: A New Direction

15 0.036999237 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

16 0.036709487 120 jmlr-2013-Variational Algorithms for Marginal MAP

17 0.035007559 68 jmlr-2013-Machine Learning with Operational Costs

18 0.034465194 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

19 0.032659486 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model

20 0.031980772 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.157), (1, 0.051), (2, 0.009), (3, -0.048), (4, 0.072), (5, 0.025), (6, -0.013), (7, -0.004), (8, 0.145), (9, 0.066), (10, 0.057), (11, 0.019), (12, 0.02), (13, 0.114), (14, 0.052), (15, 0.029), (16, -0.202), (17, -0.117), (18, -0.279), (19, -0.064), (20, 0.208), (21, -0.037), (22, -0.198), (23, 0.071), (24, 0.012), (25, 0.121), (26, 0.192), (27, 0.037), (28, 0.001), (29, 0.184), (30, 0.06), (31, 0.117), (32, -0.038), (33, 0.003), (34, -0.09), (35, -0.078), (36, 0.055), (37, -0.179), (38, -0.082), (39, 0.055), (40, -0.136), (41, -0.015), (42, -0.081), (43, -0.008), (44, 0.089), (45, -0.095), (46, -0.029), (47, 0.021), (48, 0.033), (49, 0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95360649 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

Author: Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk

2 0.62459373 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

Author: Antony Joseph

3 0.39638609 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

Author: Tony Cai, Jianqing Fan, Tiefeng Jiang

4 0.31038493 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

Author: Michael Chertkov, Adam B. Yedidia

Abstract: We discuss schemes for exact and approximate computations of permanents, and compare them with each other. Speciﬁcally, we analyze the belief propagation (BP) approach and its fractional belief propagation (FBP) generalization for computing the permanent of a non-negative matrix. Known bounds and Conjectures are veriﬁed in experiments, and some new theoretical relations, bounds and Conjectures are proposed. The fractional free energy (FFE) function is parameterized by a scalar parameter γ ∈ [−1; 1], where γ = −1 corresponds to the BP limit and γ = 1 corresponds to the exclusion principle (but ignoring perfect matching constraints) mean-ﬁeld (MF) limit. FFE shows monotonicity and continuity with respect to γ. For every non-negative matrix, we deﬁne its special value γ∗ ∈ [−1; 0] to be the γ for which the minimum of the γ-parameterized FFE function is equal to the permanent of the matrix, where the lower and upper bounds of the γ-interval corresponds to respective bounds for the permanent. Our experimental analysis suggests that the distribution of γ∗ varies for different ensembles but γ∗ always lies within the [−1; −1/2] interval. Moreover, for all ensembles considered, the behavior of γ∗ is highly distinctive, offering an empirical practical guidance for estimating permanents of non-negative matrices via the FFE approach. Keywords: permanent, graphical models, belief propagation, exact and approximate algorithms, learning ﬂows

5 0.29938242 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos

Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the ﬁdelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm

6 0.29342875 96 jmlr-2013-Regularization-Free Principal Curve Estimation

7 0.28872254 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

8 0.26286498 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

9 0.25823531 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

10 0.25734618 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

11 0.22970314 98 jmlr-2013-Segregating Event Streams and Noise with a Markov Renewal Process Model

12 0.22734286 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

13 0.20616373 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

14 0.19923918 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

15 0.19700615 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

16 0.19284296 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

17 0.17797579 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

18 0.17636307 120 jmlr-2013-Variational Algorithms for Marginal MAP

19 0.16395858 33 jmlr-2013-Dimension Independent Similarity Computation

20 0.16334407 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (5, 0.114), (6, 0.045), (10, 0.098), (14, 0.044), (20, 0.025), (23, 0.04), (27, 0.338), (68, 0.042), (70, 0.024), (75, 0.041), (85, 0.017), (87, 0.029), (94, 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72289455 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

Author: Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk

2 0.43687338 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

Author: Wendelin Böhmer, Steffen Grünewälder, Yun Shen, Marek Musial, Klaus Obermayer

Abstract: Linear reinforcement learning (RL) algorithms like least-squares temporal difference learning (LSTD) require basis functions that span approximation spaces of potential value functions. This article investigates methods to construct these bases from samples. We hypothesize that an ideal approximation spaces should encode diffusion distances and that slow feature analysis (SFA) constructs such spaces. To validate our hypothesis we provide theoretical statements about the LSTD value approximation error and induced metric of approximation spaces constructed by SFA and the state-of-the-art methods Krylov bases and proto-value functions (PVF). In particular, we prove that SFA minimizes the average (over all tasks in the same environment) bound on the above approximation error. Compared to other methods, SFA is very sensitive to sampling and can sometimes fail to encode the whole state space. We derive a novel importance sampling modiﬁcation to compensate for this effect. Finally, the LSTD and least squares policy iteration (LSPI) performance of approximation spaces constructed by Krylov bases, PVF, SFA and PCA is compared in benchmark tasks and a visual robot navigation experiment (both in a realistic simulation and with a robot). The results support our hypothesis and suggest that (i) SFA provides subspace-invariant features for MDPs with self-adjoint transition operators, which allows strong guarantees on the approximation error, (ii) the modiﬁed SFA algorithm is best suited for LSPI in both discrete and continuous state spaces and (iii) approximation spaces encoding diffusion distances facilitate LSPI performance. Keywords: reinforcement learning, diffusion distance, proto value functions, slow feature analysis, least-squares policy iteration, visual robot navigation c 2013 Wendelin B¨ hmer, Steffen Gr¨ new¨ lder, Yun Shen, Marek Musial and Klaus Obermayer. o u a ¨ ¨ ¨ B OHMER , G R UNEW ALDER , S HEN , M USIAL AND O BERMAYER

3 0.43660685 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

Author: Ery Arias-Castro, Bruno Pelletier

Abstract: Maximum Variance Unfolding is one of the main methods for (nonlinear) dimensionality reduction. We study its large sample limit, providing speciﬁc rates of convergence under standard assumptions. We ﬁnd that it is consistent when the underlying submanifold is isometric to a convex subset, and we provide some simple examples where it fails to be consistent. Keywords: maximum variance unfolding, isometric embedding, U-processes, empirical processes, proximity graphs.

4 0.4344998 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

Author: Alberto N. Escalante-B., Laurenz Wiskott

Abstract: Supervised learning from high-dimensional data, for example, multimedia data, is a challenging task. We propose an extension of slow feature analysis (SFA) for supervised dimensionality reduction called graph-based SFA (GSFA). The algorithm extracts a label-predictive low-dimensional set of features that can be post-processed by typical supervised algorithms to generate the ﬁnal label or class estimation. GSFA is trained with a so-called training graph, in which the vertices are the samples and the edges represent similarities of the corresponding labels. A new weighted SFA optimization problem is introduced, generalizing the notion of slowness from sequences of samples to such training graphs. We show that GSFA computes an optimal solution to this problem in the considered function space and propose several types of training graphs. For classiﬁcation, the most straightforward graph yields features equivalent to those of (nonlinear) Fisher discriminant analysis. Emphasis is on regression, where four different graphs were evaluated experimentally with a subproblem of face detection on photographs. The method proposed is promising particularly when linear models are insufﬁcient as well as when feature selection is difﬁcult. Keywords: slow feature analysis, feature extraction, classiﬁcation, regression, pattern recognition, training graphs, nonlinear dimensionality reduction, supervised learning, implicitly supervised, high-dimensional data, image analysis

5 0.43303919 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

Author: Daniil Ryabko, Jérémie Mary

Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classiﬁcation methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classiﬁcation and concern highly dependent time series. Speciﬁcally, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions

6 0.43172947 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

7 0.43050206 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

8 0.42691255 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

9 0.42580712 59 jmlr-2013-Large-scale SVD and Manifold Learning

10 0.42465097 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

11 0.42446569 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

12 0.42220679 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

13 0.42215708 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

14 0.42202547 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

15 0.42130658 41 jmlr-2013-Experiment Selection for Causal Discovery

16 0.42059192 13 jmlr-2013-Approximating the Permanent with Fractional Belief Propagation

17 0.4200432 22 jmlr-2013-Classifying With Confidence From Incomplete Information

18 0.41924256 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

19 0.41900721 86 jmlr-2013-Parallel Vector Field Embedding

20 0.41891941 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression