nips nips2011 nips2011-149 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Zhen J. Xiang, Hao Xu, Peter J. Ramadge
Abstract: Learning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. We explore three aspects of the problem. First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. Second, we study the properties of random projections in the context of learning sparse representations. Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. Empirical results show that our framework can learn informative hierarchical sparse representations more efficiently. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Learning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. [sent-3, score-0.268]
2 But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. [sent-4, score-0.259]
3 First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. [sent-6, score-0.942]
4 Second, we study the properties of random projections in the context of learning sparse representations. [sent-7, score-0.337]
5 Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. [sent-8, score-1.046]
6 Empirical results show that our framework can learn informative hierarchical sparse representations more efficiently. [sent-9, score-0.359]
7 1 Introduction Consider approximating a p-dimensional data point x by a linear combination x ≈ Bw of m (possibly linearly dependent) codewords in a dictionary B = [b1 , b2 , . [sent-10, score-0.737]
8 , x is approximated as a weighted sum of only a few codewords in the dictionary, has recently attracted much attention [1]. [sent-16, score-0.478]
9 As a further refinement, when there are many data points xj , the dictionary B can be optimized to make the representations wj as sparse as possible. [sent-17, score-0.63]
10 , xn ] ∈ Rp×n , we want to learn a dictionary B = [b1 , b2 , . [sent-22, score-0.259]
11 , bm ] ∈ Rp×m and sparse representation weights W = [w1 , w2 , . [sent-25, score-0.267]
12 , wn ] ∈ Rm×n so that each data point xj is well approximated by Bwj with wj a sparse vector: 1 min X − BW 2 + λ W 1 F B,W 2 (1) s. [sent-28, score-0.32]
13 Fourier, wavelet, DCT), problem (1) assumes minimal prior knowledge and uses sparsity as a cue to learn a dictionary adapted to the data. [sent-41, score-0.288]
14 In many other approaches (including [2–4]), although the codewords in B are cleverly chosen, the new representation w is simply a linear mapping of x, e. [sent-45, score-0.565]
15 As a final point, we note that the human visual cortex uses similar mechanisms to encode visual scenes [7] and sparse representation has exhibited superior performance on difficult computer vision problems such as face [8] and object [9] recognition. [sent-49, score-0.287]
16 First (§2), we explore dictionary screening [13, 14], to select a subset of codewords to use in each Lasso optimization. [sent-57, score-1.037]
17 We derive two new screening tests that are significantly better than existing tests when the data points and codewords are highly correlated, a typical scenario in sparse representation applications [15]. [sent-58, score-1.359]
18 We also provide simple geometric intuition for guiding the derivation of screening tests. [sent-59, score-0.329]
19 Second (§3), we examine projecting data onto a lower dimensional space so that we can control information flow in our hierarchical framework and solve sparse representations with smaller p. [sent-60, score-0.359]
20 We identify an important property of the data that’s implicitly assumed in sparse representation problems (scale indifference) and study how random projection preserves this property. [sent-61, score-0.35]
21 Finally (§4), we develop a framework for learning a hierarchical dictionary (similar in spirit to [17] and DBN [18]). [sent-63, score-0.38]
22 To do so we exploit our results on screening and random projection and impose a zero-tree like structured sparsity constraint on the representation. [sent-64, score-0.488]
23 2 Reducing the Dictionary By Screening In this section we assume that all data points and codewords are normalized: xj 2 = bi 2 = 1, 1 ≤ j ≤ n, 1 ≤ i ≤ m (we discuss the implications of this assumption in §3). [sent-72, score-0.927]
24 Each subproblem is then of the form: m m 1 x− wi bi 2 i=1 min w1 ,w2 ,. [sent-79, score-0.426]
25 +λ (2) i=1 To address the challenge of solving (2) for large m, we first explore simple screening tests that identify and discard codewords bi guaranteed to have optimal solution wi = 0. [sent-83, score-1.462]
26 El Ghaoui’s SAFE ˜ rule [13] is an example of a simple screening test. [sent-84, score-0.329]
27 We introduce some simple geometric intuition for screening and use this to derive new tests that are significantly better than existing tests for the type of problems of interest here. [sent-85, score-0.599]
28 λ2 x 2 1 x 2− θ− 2 2 2 λ 2 T |θ bi | ≤ 1 ∀i = 1, 2, . [sent-88, score-0.339]
29 , wm ] and the optimal solution of the dual problem θ ˜ ˜ ˜ m ˜ wi bi + λθ, ˜ x= ˜T θ bi ∈ i=1 {sign wi } ˜ [−1, 1] if wi = 0, ˜ if wi = 0. [sent-95, score-1.062]
30 Since x 2 = bi 2 = 1, x and all bi lie on the unit sphere S p−1 (Fig. [sent-97, score-0.785]
31 (c) The solid red, dotted blue and solid magenta circles leading to sphere tests ST1/SAFE, ST2, ST3, respectively. [sent-117, score-0.362]
32 By (4), if θ is not on P (bi ) or P (−bi ), then wi = 0 and we can safely discard bi from problem (2). [sent-125, score-0.492]
33 ˜ Let λmax = maxi |xT bi | and b∗ ∈ {±bi }m be selected so that λmax = xT b∗ . [sent-126, score-0.339]
34 ˜ ˜ These observations can be used for screening as follows. [sent-135, score-0.329]
35 If we know that θ is within a region R, then we can discard those bi for which the tangent hyperplanes P (bi ) and P (−bi ) don’t intersect R, since by (4) the corresponding wi will be 0. [sent-136, score-0.526]
36 , {θ : θ − q 2 ≤ r}, then one can discard all bi for which |qT bi | is smaller than a threshold determined by the common tangent hyperplanes of the spheres θ − q 2 = r and S p−1 . [sent-142, score-0.778]
37 If the solution θ of (3) satisfies θ − q 2 ≤ r, then |qT bi | < (1 − r) ⇒ wi = 0. [sent-145, score-0.426]
38 Plugging in q = x/λ and r = 1/λ − 1/λmax into Lemma 1 yields El Ghaoui’s SAFE rule: Sphere Test # 1 (ST1/SAFE): If |xT bi | < λ − 1 + λ/λmax , then wi = 0. [sent-150, score-0.426]
39 , when the codewords are very similar to the data points, a frequent situation in applications [15]. [sent-153, score-0.507]
40 Plugging q = x/λmax and r = 2 1/λ2 − 1(λmax /λ − 1) into Lemma 1 yields our first new test: max 3 Sphere Test # 2 (ST2): If |xT bi | < λmax (1 − 2 1/λ2 − 1(λmax /λ − 1)), then wi = 0. [sent-166, score-0.534]
41 ˜ max Since ST2 and ST1/SAFE both test |xT bi | against thresholds, we can compare the tests by plotting their thresholds. [sent-167, score-0.582]
42 When λmax > 3/2, if ST1/SAFE discards bi , then ST2 also discards bi . [sent-175, score-0.898]
43 Finally, to use the closed ball constraint (5), we plug in q = x/λ − (λmax /λ − 1)b∗ and r = 1/λ2 − 1(λmax /λ − 1) into Lemma 1 to obtain a second new test: max Sphere Test # 3 (ST3): If |xT bi − (λmax − λ)bT bi | < λ(1 − ∗ 1/λ2 − 1(λmax /λ − 1)), then wi = 0. [sent-176, score-1.041]
44 Given any x, b∗ and λ, if ST2 discards bi , then ST3 also discards bi . [sent-180, score-0.898]
45 The first pass holds x, u, bi ∈ Rp in memory at once and computes u(i) = xT bi . [sent-185, score-0.678]
46 The second pass holds u, b∗ , bi in memory at once, computes bT bi and executes the test. [sent-187, score-0.678]
47 ∗ 3 Random Projections of the Data In §4 we develop a framework for learning a hierarchical dictionary and this involves the use of random data projections to control information flow to the levels of the hierarchy. [sent-188, score-0.613]
48 Here we lay some groundwork by studying basic properties of random projections in learning sparse representations. [sent-190, score-0.337]
49 We first revisit the normalization assumption xj 2 = bi 2 = 1, 1 ≤ j ≤ n, 1 ≤ i ≤ m in §2. [sent-191, score-0.387]
50 The assumption that all codewords are normalized: bi 2 = 1, is necessary for (1) to be meaningful, otherwise we can increase the scale of bi and inversely scale the ith row of W to lower the loss. [sent-192, score-1.185]
51 To see this, assume that the data {xj }n are samples from an underlying low dimensional smooth j=1 manifold X and that one desires a correspondence between codewords and local regions on X . [sent-194, score-0.507]
52 Intuitively, SI means that X doesn’t contain points differing only in scale and it implies that points x1 , x2 from distinct regions on X will use different codewords in their representation. [sent-197, score-0.544]
53 Since a random projection of the original data doesn’t preserve the normalization xj 2 = 1, it’s important for the random projection to preserve the SI property so that it is reasonable to renormalize the projected data. [sent-202, score-0.266]
54 If X satisfies SI and has a κ-sparse representation using dictionary B, then the projected data T (X ) satisfies SI if (2κ − 1)M (TB) < 1, where M (·) is matrix mutual coherence. [sent-228, score-0.413]
55 Let wj be the new representation of xj and µi = Txj − Bwj 2 be the length of the residual (j = 1, 2). [sent-233, score-0.245]
56 Therefore the distances between the sparse representation weights reflect the original data point distances. [sent-236, score-0.296]
57 4 Learning a Hierarchical Dictionary Our hierarchical framework decomposes a large dictionary learning problem into a sequence of smaller hierarchically structured dictionary learning problems. [sent-238, score-0.61]
58 High levels of the tree give course representations, deeper levels give more detailed representations, and the codewords at the leaves form the final dictionary. [sent-240, score-0.528]
59 The tree is grown top-down in l levels by refining the dictionary at the previous level to give the dictionary at the next level. [sent-241, score-0.538]
60 We also enforce a zero-tree constraint on the sparse representation weights so that the zero weights in the previous level will force the corresponding weights in the next level to be zero. [sent-243, score-0.514]
61 At each stage we combine this zero-tree constraint with our new screening tests to reduce the size of Lasso problems that must be solved. [sent-244, score-0.533]
62 At level k we learn a dictionary Bk ∈ Rdk ×mk and weights Wk ∈ Rmk ×n by solving a small sparse representation problem similar to (1): min Bk ,Wk s. [sent-247, score-0.554]
63 1 Tk X − Bk Wk 2 (k) 2 2 bi ≤ 1, 2 F + λk Wk 1 (9) ∀i = 1, 2, . [sent-249, score-0.339]
64 (k) Here bi is the ith column of matrix Bk and mk is assumed to be a multiple of mk−1 , so the number of codewords mk increases with k. [sent-253, score-1.216]
65 The ith group has mk /mk−1 weights: {wj (rmk−1 + i), 0 ≤ r < mk /mk−1 }, (k−1) and has weight wj (i) as its parent weight. [sent-260, score-0.509]
66 So for every level k ≥ 2, data point j (1 ≤ j ≤ n), group i (k) (1 ≤ i ≤ mk−1 ), and weight wj (rmk−1 + i) (0 ≤ r < mk /mk−1 ), we enforce: (k−1) wj (i) = 0 ⇒ (k) wj (rmk−1 + i) = 0. [sent-262, score-0.572]
67 In addition, (10) means that the weights of the previous layer select a small subset of codewords to k enter the Lasso optimization. [sent-264, score-0.652]
68 When solving for wj , (10) reduces the number of codewords from (k−1) (k−1) mk to (mk /mk−1 ) wj is sparse. [sent-265, score-0.883]
69 Thus the screening 0 , a considerable reduction since wj rules in §2 and the imposed screening rule (10) work together to reduce the effective dictionary size. [sent-266, score-1.038]
70 The tree structure in the weights introduces a similar hierarchical tree structure in the dictionaries (k) (k−1) {Bk }l : the codewords {brmk−1 +i , 0 ≤ r < mk /mk−1 } are the children of codeword bi . [sent-267, score-1.357]
71 When k > 1, the mk codewords in layer k are naturally divided into mk−1 groups, so we can solve Bk by optimizing each group sequentially. [sent-269, score-0.79]
72 , mk−1 , let B = [brmk−1 +i ]r=0 k−1 denote the codewords in group i. [sent-274, score-0.478]
73 Denote the remaining codewords and weights by B and W . [sent-280, score-0.525]
74 5 Experiments We tested our framework on: (a) the COIL rotational image data set [23], (b) the MNIST digit classification data set [24], and (c) the extended Yale B face recognition data set [25] [26]. [sent-290, score-0.381]
75 We ran the traditional sparse representation algorithm to compare the three screening tests in §2. [sent-297, score-0.741]
76 2(c), ST3 discards a larger fraction of codewords than ST2 and ST2 discards a larger fraction than ST1/SAFE. [sent-300, score-0.698]
77 5), helps the second layer discard more codewords when the tree constraint (10) is imposed. [sent-307, score-0.79]
78 2(b) illustrates this constraint: the 16 second layer codewords are organized in 4 groups of 4 (only 2 groups shown). [sent-309, score-0.667]
79 This imposed constraint discards many more codewords in the screening stage than any of the three tests in §2. [sent-311, score-1.161]
80 2(b) are the actual values in 6 Learning non−hierarchical sparse representation Average % of codewords discarded 0(1& 0,1& Average percentage of discarded codewords in the prescreening. [sent-315, score-1.308]
81 ST3, original data ST3, projected data ST2, original data ST2, projected data ST1/SAFE, original data ST1/SAFE, projected data 100 80 80 021& 0/1& 60 Use our new bound on the origianl data Use our new bound on the projected data Use El Ghaoui et al. [sent-316, score-0.5]
82 8 1 Learning the second layer sparse representation Average percentage of discarded codewords in the prescreening. [sent-326, score-0.907]
83 Average % of codewords discarded 100 80 80 (10) + ST3 60 60 Use constraint (13) and our new bound ST3 only Use our new bound (10) + ST2 ST2 only Use constraint (13) and El Ghaoui et al. [sent-327, score-0.666]
84 8%) # of random projections (percentage of image size) to use Average encoding time (ms) Classification accuracy (%) on testing set 97 Recognition rate (%) on testing set tion. [sent-345, score-0.374]
85 (c): Comparison of the three screening tests for sparse representation. [sent-346, score-0.597]
86 (d): Screening performance in the second layer of our hierarchical framework using combinations of screening criteria. [sent-347, score-0.606]
87 The imposed constraint (10) helps to discard significantly more codewords when λ is small. [sent-348, score-0.653]
88 80 60 Traditional sparse representation Our hierarchical framework Our framework with PCA projections Linear classifier 40 20 0 32(0. [sent-349, score-0.672]
89 8%) # of random projections (percentage of image size) to use Figure 3: Left: MNIST: The tradeoff between classification accuracy and average encoding time for various sparse representation methods. [sent-353, score-0.524]
90 The performance of traditional sparse representation is consistent with [9]. [sent-356, score-0.277]
91 Traditional sparse representation has the best accuracy and is very close to a similar method SRC in [8] (SRC’s recognition rate is cited from [8] but data on encoding time is not available). [sent-358, score-0.345]
92 Using PCA projections in our framework yields worse performance since these projections do not spread information across the layers. [sent-360, score-0.459]
93 The sparse representation gives a multiresolution representation of the rotational pattern: the first layer encodes rough orientation and the second layer refines it. [sent-364, score-0.623]
94 We ran the traditional sparse representation algorithm for dictionary size m ∈ {64, 128, 192, 256} and λ ∈ Λ = {0. [sent-372, score-0.507]
95 The plot shows that compared to the traditional sparse representation, our hierarchical framework achieves roughly a 1% accuracy improvement given the same encoding time and a roughly 2X speedup given the same accuracy. [sent-387, score-0.398]
96 In this experiment we start with the random projected data (p ∈ {32, 64, 128, 256} random projections of the original 192x128 data) and use this data as follows: (a) learn a traditional non-hierarchical sparse representation, (b) our framework, i. [sent-391, score-0.548]
97 , sample the data in two stages using orthogonal random projections and learn a 2 layer hierarchical sparse representation, (c) use PCA projections to replace random projections in (b), (d) directly apply a linear classifier without first learning a sparse representation. [sent-393, score-1.162]
98 The PCA variant of our framework has worse performance because the first 3 8 p projections contain too much information, leaving the second layer too little information (which also drags down the speed for lack of sparsity and structure). [sent-402, score-0.411]
99 As expected, λmax is large, a situation that favors our new screening tests (ST2, ST3). [sent-410, score-0.464]
100 We have shown that under certain conditions, random projection preserves the scale indifference (SI) property with high probability, thus providing an opportunity to learn informative sparse representations with data fewer dimensions. [sent-412, score-0.401]
wordName wordTfidf (topN-words)
[('codewords', 0.478), ('bi', 0.339), ('screening', 0.329), ('dictionary', 0.23), ('projections', 0.204), ('mk', 0.185), ('tests', 0.135), ('sparse', 0.133), ('layer', 0.127), ('discards', 0.11), ('wj', 0.11), ('max', 0.108), ('sphere', 0.107), ('hierarchical', 0.099), ('bk', 0.093), ('rmk', 0.091), ('si', 0.091), ('ghaoui', 0.09), ('wi', 0.087), ('representation', 0.087), ('tk', 0.078), ('safe', 0.071), ('el', 0.071), ('constraint', 0.069), ('face', 0.067), ('projected', 0.067), ('discard', 0.066), ('ball', 0.065), ('submanifold', 0.063), ('indifference', 0.062), ('rotational', 0.062), ('projection', 0.061), ('tx', 0.06), ('layers', 0.059), ('dictionaries', 0.059), ('encoding', 0.058), ('doesn', 0.057), ('traditional', 0.057), ('pca', 0.055), ('src', 0.055), ('framework', 0.051), ('codeword', 0.05), ('tree', 0.05), ('discarded', 0.05), ('xj', 0.048), ('lemma', 0.048), ('classifier', 0.047), ('mq', 0.047), ('riemannian', 0.047), ('representations', 0.047), ('weights', 0.047), ('wk', 0.046), ('mnist', 0.046), ('solid', 0.045), ('lasso', 0.044), ('ow', 0.043), ('discarding', 0.043), ('brmk', 0.042), ('bwj', 0.042), ('coil', 0.042), ('rdk', 0.042), ('image', 0.042), ('preserves', 0.04), ('feasible', 0.04), ('rp', 0.04), ('xt', 0.04), ('imposed', 0.04), ('recognition', 0.038), ('wavelet', 0.037), ('dual', 0.036), ('testing', 0.035), ('satis', 0.035), ('closed', 0.034), ('tangent', 0.034), ('princeton', 0.034), ('digit', 0.034), ('bw', 0.034), ('nadler', 0.034), ('zhen', 0.034), ('lighting', 0.034), ('points', 0.033), ('shaded', 0.033), ('percentage', 0.032), ('wright', 0.032), ('rows', 0.032), ('sp', 0.031), ('mairal', 0.031), ('groups', 0.031), ('supplemental', 0.03), ('magenta', 0.03), ('sparsity', 0.029), ('learn', 0.029), ('ith', 0.029), ('data', 0.029), ('wakin', 0.029), ('xiang', 0.029), ('level', 0.028), ('thresholds', 0.028), ('enforce', 0.028), ('challenge', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999928 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
Author: Zhen J. Xiang, Hao Xu, Peter J. Ramadge
Abstract: Learning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. We explore three aspects of the problem. First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. Second, we study the properties of random projections in the context of learning sparse representations. Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. Empirical results show that our framework can learn informative hierarchical sparse representations more efficiently. 1
2 0.23142651 178 nips-2011-Multiclass Boosting: Theory and Algorithms
Author: Mohammad J. Saberian, Nuno Vasconcelos
Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1
3 0.214562 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
Author: Ioannis A. Gkioulekas, Todd Zickler
Abstract: We propose an approach for linear unsupervised dimensionality reduction, based on the sparse linear model that has been used to probabilistically interpret sparse coding. We formulate an optimization problem for learning a linear projection from the original signal domain to a lower-dimensional one in a way that approximately preserves, in expectation, pairwise inner products in the sparse domain. We derive solutions to the problem, present nonlinear extensions, and discuss relations to compressed sensing. Our experiments using facial images, texture patches, and images of object categories suggest that the approach can improve our ability to recover meaningful structure in many classes of signals. 1
4 0.17597553 259 nips-2011-Sparse Estimation with Structured Dictionaries
Author: David P. Wipf
Abstract: In the vast majority of recent work on sparse estimation algorithms, performance has been evaluated using ideal or quasi-ideal dictionaries (e.g., random Gaussian or Fourier) characterized by unit ℓ2 norm, incoherent columns or features. But in reality, these types of dictionaries represent only a subset of the dictionaries that are actually used in practice (largely restricted to idealized compressive sensing applications). In contrast, herein sparse estimation is considered in the context of structured dictionaries possibly exhibiting high coherence between arbitrary groups of columns and/or rows. Sparse penalized regression models are analyzed with the purpose of finding, to the extent possible, regimes of dictionary invariant performance. In particular, a Type II Bayesian estimator with a dictionarydependent sparsity penalty is shown to have a number of desirable invariance properties leading to provable advantages over more conventional penalties such as the ℓ1 norm, especially in areas where existing theoretical recovery guarantees no longer hold. This can translate into improved performance in applications such as model selection with correlated features, source localization, and compressive sensing with constrained measurement directions. 1
5 0.13871041 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: Extracting good representations from images is essential for many computer vision tasks. In this paper, we propose hierarchical matching pursuit (HMP), which builds a feature hierarchy layer-by-layer using an efficient matching pursuit encoder. It includes three modules: batch (tree) orthogonal matching pursuit, spatial pyramid max pooling, and contrast normalization. We investigate the architecture of HMP, and show that all three components are critical for good performance. To speed up the orthogonal matching pursuit, we propose a batch tree orthogonal matching pursuit that is particularly suitable to encode a large number of observations that share the same large dictionary. HMP is scalable and can efficiently handle full-size images. In addition, HMP enables linear support vector machines (SVM) to match the performance of nonlinear SVM while being scalable to large datasets. We compare HMP with many state-of-the-art algorithms including convolutional deep belief networks, SIFT based single layer sparse coding, and kernel based feature learning. HMP consistently yields superior accuracy on three types of image classification problems: object recognition (Caltech-101), scene recognition (MIT-Scene), and static event recognition (UIUC-Sports). 1
6 0.13308486 303 nips-2011-Video Annotation and Tracking with Active Learning
7 0.12774011 276 nips-2011-Structured sparse coding via lateral inhibition
8 0.12279493 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations
9 0.11185341 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
10 0.11050227 261 nips-2011-Sparse Filtering
11 0.097573861 244 nips-2011-Selecting Receptive Fields in Deep Networks
12 0.093080603 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
13 0.091448285 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
14 0.089594699 64 nips-2011-Convergent Bounds on the Euclidean Distance
15 0.087140486 258 nips-2011-Sparse Bayesian Multi-Task Learning
16 0.087061115 168 nips-2011-Maximum Margin Multi-Instance Learning
17 0.081071362 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data
18 0.079798512 239 nips-2011-Robust Lasso with missing and grossly corrupted observations
19 0.078795366 250 nips-2011-Shallow vs. Deep Sum-Product Networks
20 0.076455928 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
topicId topicWeight
[(0, 0.252), (1, 0.065), (2, -0.07), (3, -0.071), (4, -0.014), (5, 0.11), (6, 0.065), (7, 0.162), (8, -0.109), (9, -0.051), (10, -0.008), (11, 0.022), (12, -0.024), (13, -0.005), (14, -0.018), (15, -0.027), (16, -0.033), (17, -0.012), (18, -0.01), (19, -0.059), (20, 0.247), (21, -0.014), (22, 0.059), (23, 0.01), (24, 0.001), (25, -0.203), (26, -0.063), (27, 0.018), (28, -0.116), (29, 0.023), (30, -0.107), (31, -0.067), (32, -0.004), (33, -0.05), (34, 0.001), (35, 0.039), (36, 0.002), (37, -0.066), (38, -0.026), (39, 0.173), (40, 0.045), (41, -0.06), (42, 0.006), (43, -0.073), (44, -0.064), (45, -0.071), (46, 0.105), (47, 0.212), (48, -0.019), (49, 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.95253986 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
Author: Zhen J. Xiang, Hao Xu, Peter J. Ramadge
Abstract: Learning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. We explore three aspects of the problem. First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. Second, we study the properties of random projections in the context of learning sparse representations. Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. Empirical results show that our framework can learn informative hierarchical sparse representations more efficiently. 1
2 0.65300804 259 nips-2011-Sparse Estimation with Structured Dictionaries
Author: David P. Wipf
Abstract: In the vast majority of recent work on sparse estimation algorithms, performance has been evaluated using ideal or quasi-ideal dictionaries (e.g., random Gaussian or Fourier) characterized by unit ℓ2 norm, incoherent columns or features. But in reality, these types of dictionaries represent only a subset of the dictionaries that are actually used in practice (largely restricted to idealized compressive sensing applications). In contrast, herein sparse estimation is considered in the context of structured dictionaries possibly exhibiting high coherence between arbitrary groups of columns and/or rows. Sparse penalized regression models are analyzed with the purpose of finding, to the extent possible, regimes of dictionary invariant performance. In particular, a Type II Bayesian estimator with a dictionarydependent sparsity penalty is shown to have a number of desirable invariance properties leading to provable advantages over more conventional penalties such as the ℓ1 norm, especially in areas where existing theoretical recovery guarantees no longer hold. This can translate into improved performance in applications such as model selection with correlated features, source localization, and compressive sensing with constrained measurement directions. 1
3 0.63256252 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
Author: Ioannis A. Gkioulekas, Todd Zickler
Abstract: We propose an approach for linear unsupervised dimensionality reduction, based on the sparse linear model that has been used to probabilistically interpret sparse coding. We formulate an optimization problem for learning a linear projection from the original signal domain to a lower-dimensional one in a way that approximately preserves, in expectation, pairwise inner products in the sparse domain. We derive solutions to the problem, present nonlinear extensions, and discuss relations to compressed sensing. Our experiments using facial images, texture patches, and images of object categories suggest that the approach can improve our ability to recover meaningful structure in many classes of signals. 1
4 0.5688507 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox
Abstract: Extracting good representations from images is essential for many computer vision tasks. In this paper, we propose hierarchical matching pursuit (HMP), which builds a feature hierarchy layer-by-layer using an efficient matching pursuit encoder. It includes three modules: batch (tree) orthogonal matching pursuit, spatial pyramid max pooling, and contrast normalization. We investigate the architecture of HMP, and show that all three components are critical for good performance. To speed up the orthogonal matching pursuit, we propose a batch tree orthogonal matching pursuit that is particularly suitable to encode a large number of observations that share the same large dictionary. HMP is scalable and can efficiently handle full-size images. In addition, HMP enables linear support vector machines (SVM) to match the performance of nonlinear SVM while being scalable to large datasets. We compare HMP with many state-of-the-art algorithms including convolutional deep belief networks, SIFT based single layer sparse coding, and kernel based feature learning. HMP consistently yields superior accuracy on three types of image classification problems: object recognition (Caltech-101), scene recognition (MIT-Scene), and static event recognition (UIUC-Sports). 1
5 0.55521196 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
Author: Nobuyuki Morioka, Shin'ichi Satoh
Abstract: Sparse coding, a method of explaining sensory data with as few dictionary bases as possible, has attracted much attention in computer vision. For visual object category recognition, 1 regularized sparse coding is combined with the spatial pyramid representation to obtain state-of-the-art performance. However, because of its iterative optimization, applying sparse coding onto every local feature descriptor extracted from an image database can become a major bottleneck. To overcome this computational challenge, this paper presents “Generalized Lasso based Approximation of Sparse coding” (GLAS). By representing the distribution of sparse coefficients with slice transform, we fit a piece-wise linear mapping function with the generalized lasso. We also propose an efficient post-refinement procedure to perform mutual inhibition between bases which is essential for an overcomplete setting. The experiments show that GLAS obtains a comparable performance to 1 regularized sparse coding, yet achieves a significant speed up demonstrating its effectiveness for large-scale visual recognition problems. 1
6 0.55139631 276 nips-2011-Structured sparse coding via lateral inhibition
7 0.52508748 143 nips-2011-Learning Anchor Planes for Classification
8 0.50074661 64 nips-2011-Convergent Bounds on the Euclidean Distance
9 0.47521588 178 nips-2011-Multiclass Boosting: Theory and Algorithms
10 0.45691997 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data
11 0.45282936 27 nips-2011-Advice Refinement in Knowledge-Based SVMs
12 0.41524625 287 nips-2011-The Manifold Tangent Classifier
13 0.40123609 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
14 0.40115136 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data
15 0.3869366 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices
16 0.38620955 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
17 0.37795013 261 nips-2011-Sparse Filtering
18 0.3746964 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network
20 0.37243721 51 nips-2011-Clustered Multi-Task Learning Via Alternating Structure Optimization
topicId topicWeight
[(0, 0.035), (4, 0.055), (20, 0.036), (26, 0.025), (31, 0.069), (33, 0.029), (43, 0.067), (45, 0.168), (57, 0.061), (65, 0.031), (74, 0.078), (83, 0.072), (92, 0.142), (99, 0.064)]
simIndex simValue paperId paperTitle
1 0.91095281 209 nips-2011-Orthogonal Matching Pursuit with Replacement
Author: Prateek Jain, Ambuj Tewari, Inderjit S. Dhillon
Abstract: In this paper, we consider the problem of compressed sensing where the goal is to recover all sparse vectors using a small number offixed linear measurements. For this problem, we propose a novel partial hard-thresholding operator that leads to a general family of iterative algorithms. While one extreme of the family yields well known hard thresholding algorithms like ITI and HTP[17, 10], the other end of the spectrum leads to a novel algorithm that we call Orthogonal Matching Pursnit with Replacement (OMPR). OMPR, like the classic greedy algorithm OMP, adds exactly one coordinate to the support at each iteration, based on the correlation with the current residnal. However, unlike OMP, OMPR also removes one coordinate from the support. This simple change allows us to prove that OMPR has the best known guarantees for sparse recovery in terms of the Restricted Isometry Property (a condition on the measurement matrix). In contrast, OMP is known to have very weak performance guarantees under RIP. Given its simple structore, we are able to extend OMPR using locality sensitive hashing to get OMPR-Hasb, the first provably sub-linear (in dimensionality) algorithm for sparse recovery. Our proof techniques are novel and flexible enough to also permit the tightest known analysis of popular iterative algorithms such as CoSaMP and Subspace Pursnit. We provide experimental results on large problems providing recovery for vectors of size up to million dimensions. We demonstrste that for large-scale problems our proposed methods are more robust and faster than existing methods.
2 0.90625167 9 nips-2011-A More Powerful Two-Sample Test in High Dimensions using Random Projection
Author: Miles Lopes, Laurent Jacob, Martin J. Wainwright
Abstract: We consider the hypothesis testing problem of detecting a shift between the means of two multivariate normal distributions in the high-dimensional setting, allowing for the data dimension p to exceed the sample size n. Our contribution is a new test statistic for the two-sample test of means that integrates a random projection with the classical Hotelling T 2 statistic. Working within a high-dimensional framework that allows (p, n) → ∞, we first derive an asymptotic power function for our test, and then provide sufficient conditions for it to achieve greater power than other state-of-the-art tests. Using ROC curves generated from simulated data, we demonstrate superior performance against competing tests in the parameter regimes anticipated by our theoretical results. Lastly, we illustrate an advantage of our procedure with comparisons on a high-dimensional gene expression dataset involving the discrimination of different types of cancer. 1
same-paper 3 0.89665198 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
Author: Zhen J. Xiang, Hao Xu, Peter J. Ramadge
Abstract: Learning sparse representations on data adaptive dictionaries is a state-of-the-art method for modeling data. But when the dictionary is large and the data dimension is high, it is a computationally challenging problem. We explore three aspects of the problem. First, we derive new, greatly improved screening tests that quickly identify codewords that are guaranteed to have zero weights. Second, we study the properties of random projections in the context of learning sparse representations. Finally, we develop a hierarchical framework that uses incremental random projections and screening to learn, in small stages, a hierarchically structured dictionary for sparse representations. Empirical results show that our framework can learn informative hierarchical sparse representations more efficiently. 1
4 0.87401962 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
Author: Zuoguan Wang, Gerwin Schalk, Qiang Ji
Abstract: Brain-computer interfaces (BCIs) use brain signals to convey a user’s intent. Some BCI approaches begin by decoding kinematic parameters of movements from brain signals, and then proceed to using these signals, in absence of movements, to allow a user to control an output. Recent results have shown that electrocorticographic (ECoG) recordings from the surface of the brain in humans can give information about kinematic parameters (e.g., hand velocity or finger flexion). The decoding approaches in these demonstrations usually employed classical classification/regression algorithms that derive a linear mapping between brain signals and outputs. However, they typically only incorporate little prior information about the target kinematic parameter. In this paper, we show that different types of anatomical constraints that govern finger flexion can be exploited in this context. Specifically, we incorporate these constraints in the construction, structure, and the probabilistic functions of a switched non-parametric dynamic system (SNDS) model. We then apply the resulting SNDS decoder to infer the flexion of individual fingers from the same ECoG dataset used in a recent study. Our results show that the application of the proposed model, which incorporates anatomical constraints, improves decoding performance compared to the results in the previous work. Thus, the results presented in this paper may ultimately lead to neurally controlled hand prostheses with full fine-grained finger articulation. 1
5 0.83656102 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
6 0.83025128 150 nips-2011-Learning a Distance Metric from a Network
7 0.82881546 271 nips-2011-Statistical Tests for Optimization Efficiency
8 0.82877785 263 nips-2011-Sparse Manifold Clustering and Embedding
9 0.82826781 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
10 0.82622892 118 nips-2011-High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity
11 0.82601607 185 nips-2011-Newtron: an Efficient Bandit algorithm for Online Multiclass Prediction
12 0.82481778 244 nips-2011-Selecting Receptive Fields in Deep Networks
13 0.82437867 276 nips-2011-Structured sparse coding via lateral inhibition
14 0.82349414 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
15 0.82331842 265 nips-2011-Sparse recovery by thresholded non-negative least squares
16 0.82028913 220 nips-2011-Prediction strategies without loss
17 0.81982207 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
18 0.81879985 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
19 0.81868124 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
20 0.81865656 80 nips-2011-Efficient Online Learning via Randomized Rounding