nips nips2011 nips2011-143 knowledge-graph by maker-knowledge-mining

143 nips-2011-Learning Anchor Planes for Classification

Source: pdf

Author: Ziming Zhang, Lubor Ladicky, Philip Torr, Amir Saffari

Abstract: Local Coordinate Coding (LCC) [18] is a method for modeling functions of data lying on non-linear manifolds. It provides a set of anchor points which form a local coordinate system, such that each data point on the manifold can be approximated by a linear combination of its anchor points, and the linear weights become the local coordinate coding. In this paper we propose encoding data using orthogonal anchor planes, rather than anchor points. Our method needs only a few orthogonal anchor planes for coding, and it can linearize any (α, β, p)-Lipschitz smooth nonlinear function with a ﬁxed expected value of the upper-bound approximation error on any high dimensional data. In practice, the orthogonal coordinate system can be easily learned by minimizing this upper bound using singular value decomposition (SVD). We apply our method to model the coordinates locally in linear SVMs for classiﬁcation tasks, and our experiment on MNIST shows that using only 50 anchor planes our method achieves 1.72% error rate, while LCC achieves 1.90% error rate using 4096 anchor points. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 It provides a set of anchor points which form a local coordinate system, such that each data point on the manifold can be approximated by a linear combination of its anchor points, and the linear weights become the local coordinate coding. [sent-15, score-1.824]

2 In this paper we propose encoding data using orthogonal anchor planes, rather than anchor points. [sent-16, score-1.603]

3 Our method needs only a few orthogonal anchor planes for coding, and it can linearize any (α, β, p)-Lipschitz smooth nonlinear function with a ﬁxed expected value of the upper-bound approximation error on any high dimensional data. [sent-17, score-1.039]

4 In practice, the orthogonal coordinate system can be easily learned by minimizing this upper bound using singular value decomposition (SVD). [sent-18, score-0.26]

5 We apply our method to model the coordinates locally in linear SVMs for classiﬁcation tasks, and our experiment on MNIST shows that using only 50 anchor planes our method achieves 1. [sent-19, score-0.906]

6 1 Introduction Local Coordinate Coding (LCC) [18] is a coding scheme that encodes the data locally so that any non-linear (α, β, p)-Lipschitz smooth function (see Deﬁnition 1 in Section 2 for details) over the data manifold can be approximated using linear functions. [sent-22, score-0.285]

7 There are two components in this method: (1) a set of anchor points which decide the local coordinates, and (2) the coding for each data based on the local coordinates given the anchor points. [sent-23, score-1.763]

8 locality-constraint linear coding (LLC) [16]) in VOC 2009 challenge [7]. [sent-27, score-0.17]

9 This has been demonstrated in [18] where LCC has been tested on MNIST dataset, using from 512 to 4096 anchor points learned from sparse coding, the error rate decreased from 2. [sent-29, score-0.815]

10 This situation could become a serious problem when the distribution of the data points is sparse in the feature space, i. [sent-32, score-0.07]

11 As a result of this, many redundant anchor points will be distributed in the holes with little information. [sent-37, score-0.819]

12 By using many anchor points, the computational complexity of the classiﬁer at both training and test time increases signiﬁcantly, defeating the original purpose of using LCC. [sent-38, score-0.752]

13 1 So far several approaches have been proposed for problems closely related to anchor point learning such as dictionary learning or codebook learning. [sent-39, score-0.776]

14 [12] proposed learning the anchor points for sparse coding using the Lagrange dual. [sent-42, score-0.947]

15 [16] proposed localityconstraint linear coding (LLC), which is a fast implementation of LCC, and an online incremental codebook learning algorithm using coordinate descent method, whose performance is very close to that using K-Means. [sent-48, score-0.289]

16 However, none of these algorithms can deal with holes of sparse data as they need many anchor points. [sent-49, score-0.775]

17 In this paper, we propose a method to approximate any non-linear (α, β, p)-Lipschitz smooth function using an orthogonal coordinate coding (OCC) scheme on a set of orthogonal basis vectors. [sent-50, score-0.61]

18 Each basis vector v ∈ Rd deﬁnes a family of anchor planes, each of which can be considered as consisting of inﬁnite number of anchor points, and the nearest point on each anchor plane to a data point x ∈ Rd is used for coding, as illustrated in Figure 1. [sent-51, score-2.382]

19 The data point x will be encoded based on the margin, xT v where (·)T denotes the matrix transpose operator, between x and an anchor plane deﬁned by v. [sent-52, score-0.801]

20 The beneﬁts of using anchor planes are: • A few anchor planes can replace many anchor points while preserving similar locality of anchor points. [sent-53, score-3.214]

21 This sparsity may lead to a better generalization since many anchor points will overﬁt the data easily. [sent-54, score-0.803]

22 • The learned orthogonal basis vectors can ﬁt naturally into locally linear SVM’s (such as [9,10,11,19,21]) which we describe below. [sent-56, score-0.314]

23 Theoretically we show that using OCC any (α, β, p)-Lipschitz smooth non-linear function can be linearized with a ﬁxed upper-bound approximation error. [sent-57, score-0.045]

24 In practice by minimizing this upper bound, the orthogonal basis vectors can be learned using singular value decomposition (SVD). [sent-58, score-0.3]

25 Linear support vector machines have become popular for solving classiﬁcation tasks due to their fast and simple online application to large scale data sets. [sent-60, score-0.049]

26 A recent trend has grown to create a classiﬁer locally based on a set of linear SVMs [9,10,11,19,21]. [sent-64, score-0.049]

27 For instance, in [20] SVMs are trained only based on the N nearest neighbors of each data, and in [9] multiple kernel learning was applied locally. [sent-65, score-0.046]

28 Ladicky and Torr [11] proposed a novel locally linear SVM classiﬁer (LL-SVM) with smooth decision boundary and bounded curvature. [sent-67, score-0.14]

29 They show how the functions deﬁning the classiﬁer can be approximated using local codings and show how this model can be optimized in an online fashion by performing stochastic gradient descent with the same convergence guarantees as standard gradient descent method for linear SVMs. [sent-68, score-0.053]

30 λ W 2 2 ∀k ∈ S : + 1 |S| ξk (1) k∈S ξk ≥ 1 − yk γ Tk Wxk + γ Tk b , ξk ≥ 0 x x where ∀k, xk ∈ Rd is a training vector, yk ∈ {−1, 1} is its label, γ xk ∈ RN is its local coding, λ ≥ 0 is a pre-deﬁned scalar, and W ∈ RN ×d and b ∈ RN are the model parameters. [sent-71, score-0.069]

31 As demonstrated in our experiments, the choices of the local coding methods are very important for LL-SVM, and an improper choice will hurt its performance. [sent-72, score-0.193]

32 In Section 2 we ﬁrst recall some deﬁnitions and lemmas in LCC, then introduce OCC for non-linear function approximation and its property on the upper bound of localization error as well as comparing OCC with LCC in terms of geometric interpretation and optimization. [sent-74, score-0.105]

33 2 2 Anchor Plane Learning In this section, we introduce our Orthogonal Coordinate Coding (OCC) based on some orthogonal basis vectors. [sent-77, score-0.206]

34 1 Deﬁnition A d-dimensional anchor point in LCC; a d-dimensional basis vector which deﬁnes a family of anchor planes in OCC. [sent-81, score-1.667]

35 A subset in d-dimensional space containing all the anchor points (∀v, v ∈ C) in LCC; a subset in d-dimensional space containing all the basis vectors in OCC. [sent-82, score-0.919]

36 The anchor point (or basis vector) matrix with v ∈ C as columns. [sent-83, score-0.828]

37 The local coding of a data point x ∈ Rd using the anchor point (or basis vector) v. [sent-84, score-1.037]

38 The physical approximation vector of a data point x. [sent-85, score-0.051]

39 The coding vector of data point x containing all γv (x) in order γ x = [γv (x)]v∈C . [sent-86, score-0.183]

40 A coordinate coding is a pair (γ, C), where C ⊂ Rd is a set of anchor points, and γ is a map of x ∈ Rd to [γv (x)]v∈C ∈ R|C| such that v γv (x) = 1. [sent-94, score-0.979]

41 Moreover, for all x ∈ Rd , we deﬁne the corresponding coding norm as x γ = ( v∈C γv (x)2 )1/2 . [sent-96, score-0.157]

42 Let (γ, C) be an arbitrary coordinate coding on Rd . [sent-98, score-0.246]

43 We have for all x ∈ Rd : f (x) − γv (x)f (v) ≤ α x − γ(x) + β v∈C |γv (x)| v − γ(x) 1+p (2) v∈C As explained in [18], a good coding scheme for non-linear function approximation should make x close to its physical approximation γ(x) (i. [sent-100, score-0.193]

44 smaller data reconstruction error x − γ(x) ) and should be localized (i. [sent-102, score-0.05]

45 smaller localization error v∈C |γv (x)| v − γ(x) 1+p ). [sent-104, score-0.071]

46 Given α, β, p, and coding (γ, C), we deﬁne Qα,β,p (γ, C) = Ex α x − γ(x) + β |γv (x)| v − γ(x) 1+p (3) v∈C Localization measure is equivalent to the expectation of the upper bound of the approximate error. [sent-107, score-0.169]

47 2 Orthogonal Coordinate Coding In the following sections, we will follow the notations in Table 1, and deﬁne our orthogonal coordinate coding (OCC) as below. [sent-109, score-0.386]

48 An orthogonal coordinate coding is a pair (γ, C), where C ⊂ Rd contains |C| orthogonal basis vectors, that is, ∀u, v ∈ C, if u = v, then uT v = 0, T and coding γ is a map of x ∈ Rd to [γv (x)]v∈C ∈ R|C| such that γv (x) ∝ xv v and v∈C |γv (x)| = 2 1. [sent-111, score-0.733]

49 3 Figure 1: Comparison of the geometric views on (a) LCC and (b) OCC, where the white and red dots denote the data and anchor points, respectively. [sent-112, score-0.746]

50 In LCC, the anchor points are distributed among the data space and several nearest neighbors around the data are selected for data reconstruction, while in OCC the anchor points are located on the anchor plane deﬁned by the normal vector (i. [sent-113, score-2.44]

51 coordinate, basis vector) v and only the closest point to each data point on the anchor plane is selected for coding. [sent-115, score-0.896]

52 Compared to Deﬁnition 2, there are two changes in OCC: (1) instead of anchor points we use a set of orthogonal basis vectors, which deﬁnes a set of anchor planes, and (2) the coding for each data point is deﬁned on the 1 -norm unit ball, which removes the scaling factors in both x and v. [sent-117, score-1.912]

53 Notice that since given the data matrix, the maximum number of orthogonal basis vectors which can be used to represent all the data precisely is equal to the rank of the data matrix, the maximum value of |C| is equal to the rank of the data matrix as well. [sent-118, score-0.305]

54 Intuitively, in both methods anchor points try to encode data locally. [sent-120, score-0.817]

55 In LCC anchor points are distributed among the whole data space such that each data can be covered by certain anchor points in a local region, and their distribution cannot be described using regular shapes. [sent-122, score-1.632]

56 On the contrary, the anchor points in OCC are located on the anchor plane deﬁned by a basis vector. [sent-123, score-1.661]

57 In fact, each anchor plane can be considered as inﬁnite number of anchor points, and for each data point only its closest point on each anchor plane is utilized for reconstruction. [sent-124, score-2.322]

58 Therefore, intuitively the number of anchor planes in OCC should be much fewer than the number of anchor points in LCC. [sent-125, score-1.629]

59 Let (γ, C) be an orthogonal coordinate coding on Rd where C ⊂ Rd with size |C| = M . [sent-127, score-0.37]

60 Without losing generalization, assuming ∀x ∈ Rd , x ≤ 1 and ∀v ∈ C, 1 ≤ v ≤ h(h ≥ 1), then the localization error in Lemma 1 is bounded by: 1+p |γv (x)| v − γ(x) 1+p ≤ (1 + M )h . [sent-129, score-0.071]

61 (6) Learning Orthogonal Basis Vectors Instead of optimizing Deﬁnition 3, LCC simpliﬁes the localization error term by assuming γ(x) = x and p = 1. [sent-133, score-0.071]

62 Since Theorem 1 proves that the localization error per data point is bounded by a constant given an OCC, in practice we only need to minimize the data reconstruction error in order to minimize the upper bound of the localization measure. [sent-140, score-0.205]

63 This optimization problem is quite similar to sparse coding [12], except that there exists the orthogonal constraint on the basis vectors. [sent-144, score-0.363]

64 We need only to use a few top eigenvectors as our orthogonal basis vectors for coding, and the search space is far smaller than generating anchor points. [sent-151, score-0.986]

65 Since we have the orthogonal basis vectors in C, we can easily derive the for˜ ˜ mulation for calculating γ x , the value of γx before normalization, that is, γ x = (CT C)−1 CT x. [sent-153, score-0.268]

66 Letting {¯} and {σv } be the corresponding singular vectors and singular values, based on the orv ¯T thogonality of basis vectors we have γv (x) = vσvx , which is a variant of the coding deﬁnition in ˜ ˜ Deﬁnition 4. [sent-154, score-0.379]

67 3 Modeling Classiﬁcation Decision Boundary in SVM Given a set of data {(xi , yi )} where yi ∈ {−1, 1} is the label of xi , the decision boundary for binary classiﬁcation of a linear SVM is f (x) = wT x + b where w is the normal vector of the decision hyperplane (i. [sent-156, score-0.133]

68 Here, we assume that the decision boundary is an (α, β, p)-Lipschitz smooth function. [sent-159, score-0.091]

69 Since in LCC each data is encoded by some anchor points on the data manifold, it can model the decision boundary of an SVM directly using f (x) ≈ v∈C γv (x)f (v). [sent-160, score-0.873]

70 Then by taking γ x as the input data of a linear SVM, f (v)’s can be learned to approximate the decision boundary f . [sent-161, score-0.095]

71 However, OCC learns a set of orthogonal basis vectors, rather than anchor points, and corresponding coding for data. [sent-162, score-1.096]

72 This makes OCC suitable to model the normal vectors of decision hyperplanes in SVMs locally with LL-SVM. [sent-163, score-0.12]

73 Given data x and an orthogonal coordinate coding (γ, C), the decision boundary in LL-SVM can be formulated as follows 1 . [sent-164, score-0.44]

74 In the view of kernel SVMs, we actually deﬁne another kernel K based on x and γ x as shown below. [sent-166, score-0.052]

75 Notice that in our kernel, latent semantic kernel [6] has been involved which is deﬁned based on a set of orthogonal basis vectors. [sent-168, score-0.232]

76 USPS contains 7291 training and 2007 test gray-scale images with resolution 16 x 16, directly stored as 256 dimensional vectors, and the label of each image still corresponds to one of the 10 digits from 0 to 9. [sent-173, score-0.062]

77 LETTER contains 16000 training and 4000 testing images, each of which is represented as a relatively short 16 dimensional vector, and the label of each image corresponds to one of the 26 letters from A to Z. [sent-174, score-0.05]

78 We tried to learn our basis vectors in two ways: (1) SVD is applied directly to the entire training data matrix, or (2) SVD is applied separately to the data matrix consisting of all the positive training data. [sent-178, score-0.193]

79 Then the coding for each data is calculated as explained in Section 2. [sent-184, score-0.17]

80 Next, all the training raw features and their coding vectors are taken as the input to train the model (W, b) of LL-SVM. [sent-186, score-0.237]

81 For each test data x, we calculate its coding in the same way and classify it based on its decision values, that is, y(x) = arg maxy γ T Wy x + by . [sent-187, score-0.195]

82 x,y Figure 2 shows the comparison of classiﬁcation error rates among G-OCC + LIB-LLSVM, G-OCC + PEG-LLSVM, C-OCC + LIB-LLSVM, and C-OCC + PEG-LLSVM on MNIST (left), USPS (middle), and LETTER (right), respectively, using different numbers of orthogonal basis vectors. [sent-188, score-0.234]

83 9 is slightly different from the original formulation in [11] by ignoring the different bias term for each orthogonal basis vector. [sent-190, score-0.206]

84 6 Figure 2: Performance comparison among the 4 different combinations of OCC + LL-SVM on MNIST (left), USPS (middle), and LETTER (right) using different numbers of orthogonal basis vectors. [sent-194, score-0.221]

85 behaves similarly with the increase of the number of orthogonal basis vectors. [sent-196, score-0.206]

86 The parameters of the RBF kernel used in the kernel SVMs are the same as [2]. [sent-199, score-0.052]

87 From Table 2, we can see that applying linear SVM directly on OCC works slightly better than on the raw features, and when OCC is working with LL-SVM, the performance is boosted signiﬁcantly while the numbers of anchor points that are needed in LL-SVM are reduced. [sent-202, score-0.832]

88 On MNIST we can see that our non-linear function approximation is better than LCC, improved LCC, LLC, and LL-SVM, on USPS ours is better than both LLC and LL-SVM, but on LETTER ours is worse than LLC (4096 anchor points) and LL-SVM (100 anchor points). [sent-203, score-1.477]

89 All of these results demonstrate that OCC is quite suitable to model the non-linear normal vectors using linear SVMs for classiﬁcation on high dimensional data. [sent-208, score-0.09]

90 In summary, our encoding scheme uses much less number of basis vectors compared to anchor points in LCC while achieving better test accuracy, which translates to higher performance both in terms of generalization and efﬁciency in computation. [sent-209, score-0.919]

91 5 Conclusion In this paper, we propose orthogonal coordinate coding (OCC) to encode high dimensional data based on a set of anchor planes deﬁned by a set of orthogonal basis vectors. [sent-216, score-1.46]

92 Theoretically we prove that our OCC can guarantee a ﬁxed upper bound of approximation error for any (α, β, p)-Lipschitz smooth function, and we can easily learn the orthogonal basis vectors using SVD to minimize the localization measure. [sent-217, score-0.381]

93 In future, we would like to learn the orthogonal basis vectors using semi-deﬁnite programming to guarantee the orthogonality. [sent-219, score-0.253]

94 The numbers of anchor planes in the brackets are the ones which returns the best result on each dataset. [sent-226, score-0.854]

95 Methods MNIST USPS LETTER Linear SVM + G-OCC (# basis vectors) 9. [sent-229, score-0.082]

96 30 (16) PEG-LLSVM + C-OCC (# basis vectors) Linear SVM (10 passes) [1] 12. [sent-247, score-0.082]

97 90 Linear SVM + improved LCC (512 anchor points) [19] 1. [sent-252, score-0.733]

98 64 Linear SVM + improved LCC (4096 anchor points) [19] Linear SVM + LLC (512 anchor points) [16] 3. [sent-254, score-1.466]

99 Notice that for PEG-LLSVM, 106 random data points is used for training. [sent-279, score-0.07]

100 (2005) Fast kernel classiﬁers with online and active learning. [sent-342, score-0.04]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('anchor', 0.733), ('occ', 0.429), ('lcc', 0.335), ('coding', 0.157), ('orthogonal', 0.124), ('usps', 0.111), ('planes', 0.106), ('coordinate', 0.089), ('basis', 0.082), ('mnist', 0.079), ('llc', 0.077), ('svm', 0.068), ('letter', 0.068), ('rd', 0.067), ('svms', 0.059), ('localization', 0.058), ('points', 0.057), ('vectors', 0.047), ('proceeding', 0.044), ('plane', 0.042), ('svd', 0.042), ('torr', 0.039), ('locally', 0.036), ('smooth', 0.034), ('boundary', 0.032), ('classi', 0.03), ('passes', 0.029), ('holes', 0.029), ('bordes', 0.029), ('sx', 0.027), ('kernel', 0.026), ('local', 0.026), ('decision', 0.025), ('gallinari', 0.024), ('kecman', 0.024), ('mcsvm', 0.024), ('mstruct', 0.024), ('pegasos', 0.024), ('liblinear', 0.024), ('singular', 0.023), ('libsvm', 0.022), ('oxford', 0.021), ('yu', 0.02), ('xt', 0.02), ('nearest', 0.02), ('pass', 0.019), ('training', 0.019), ('ladicky', 0.019), ('manifold', 0.019), ('dimensional', 0.018), ('brooks', 0.018), ('coordinates', 0.018), ('jmlr', 0.017), ('zhang', 0.016), ('notations', 0.016), ('sv', 0.016), ('amir', 0.016), ('codebook', 0.016), ('notice', 0.016), ('nition', 0.015), ('gong', 0.015), ('calculating', 0.015), ('table', 0.015), ('numbers', 0.015), ('bottou', 0.014), ('raw', 0.014), ('online', 0.014), ('dictionary', 0.014), ('rbf', 0.014), ('located', 0.014), ('encode', 0.014), ('physical', 0.014), ('linear', 0.013), ('data', 0.013), ('weston', 0.013), ('code', 0.013), ('theoretically', 0.013), ('locality', 0.013), ('error', 0.013), ('point', 0.013), ('label', 0.013), ('normal', 0.012), ('digits', 0.012), ('icml', 0.012), ('yk', 0.012), ('mairal', 0.012), ('reconstruction', 0.012), ('learned', 0.012), ('upper', 0.012), ('localized', 0.012), ('approximation', 0.011), ('library', 0.011), ('tk', 0.011), ('lemmas', 0.011), ('mathematically', 0.011), ('chang', 0.011), ('machines', 0.011), ('multiclass', 0.011), ('support', 0.011), ('improper', 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 143 nips-2011-Learning Anchor Planes for Classification

Author: Ziming Zhang, Lubor Ladicky, Philip Torr, Amir Saffari

2 0.060408153 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

Author: Liefeng Bo, Xiaofeng Ren, Dieter Fox

Abstract: Extracting good representations from images is essential for many computer vision tasks. In this paper, we propose hierarchical matching pursuit (HMP), which builds a feature hierarchy layer-by-layer using an efﬁcient matching pursuit encoder. It includes three modules: batch (tree) orthogonal matching pursuit, spatial pyramid max pooling, and contrast normalization. We investigate the architecture of HMP, and show that all three components are critical for good performance. To speed up the orthogonal matching pursuit, we propose a batch tree orthogonal matching pursuit that is particularly suitable to encode a large number of observations that share the same large dictionary. HMP is scalable and can efﬁciently handle full-size images. In addition, HMP enables linear support vector machines (SVM) to match the performance of nonlinear SVM while being scalable to large datasets. We compare HMP with many state-of-the-art algorithms including convolutional deep belief networks, SIFT based single layer sparse coding, and kernel based feature learning. HMP consistently yields superior accuracy on three types of image classiﬁcation problems: object recognition (Caltech-101), scene recognition (MIT-Scene), and static event recognition (UIUC-Sports). 1

3 0.059344713 261 nips-2011-Sparse Filtering

Author: Jiquan Ngiam, Zhenghao Chen, Sonia A. Bhaskar, Pang W. Koh, Andrew Y. Ng

Abstract: Unsupervised feature learning has been shown to be effective at learning representations that perform well on image, video and audio classiﬁcation. However, many existing feature learning algorithms are hard to use and require extensive hyperparameter tuning. In this work, we present sparse ﬁltering, a simple new algorithm which is efﬁcient and only has one hyperparameter, the number of features to learn. In contrast to most other feature learning methods, sparse ﬁltering does not explicitly attempt to construct a model of the data distribution. Instead, it optimizes a simple cost function – the sparsity of 2 -normalized features – which can easily be implemented in a few lines of MATLAB code. Sparse ﬁltering scales gracefully to handle high-dimensional inputs, and can also be used to learn meaningful features in additional layers with greedy layer-wise stacking. We evaluate sparse ﬁltering on natural images, object classiﬁcation (STL-10), and phone classiﬁcation (TIMIT), and show that our method works well on a range of different modalities. 1

4 0.057665933 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

Author: Nobuyuki Morioka, Shin'ichi Satoh

Abstract: Sparse coding, a method of explaining sensory data with as few dictionary bases as possible, has attracted much attention in computer vision. For visual object category recognition, 1 regularized sparse coding is combined with the spatial pyramid representation to obtain state-of-the-art performance. However, because of its iterative optimization, applying sparse coding onto every local feature descriptor extracted from an image database can become a major bottleneck. To overcome this computational challenge, this paper presents “Generalized Lasso based Approximation of Sparse coding” (GLAS). By representing the distribution of sparse coefﬁcients with slice transform, we ﬁt a piece-wise linear mapping function with the generalized lasso. We also propose an efﬁcient post-reﬁnement procedure to perform mutual inhibition between bases which is essential for an overcomplete setting. The experiments show that GLAS obtains a comparable performance to 1 regularized sparse coding, yet achieves a signiﬁcant speed up demonstrating its effectiveness for large-scale visual recognition problems. 1

5 0.057545312 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent

Author: Inderjit S. Dhillon, Pradeep K. Ravikumar, Ambuj Tewari

Abstract: Increasingly, optimization problems in machine learning, especially those arising from bigh-dimensional statistical estimation, bave a large number of variables. Modem statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structore to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a snite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structore, we are not only able to significantly speed up the vanilla greedy method, hot also outperform cyclic descent when the problem size becomes large. Our resnlts indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.

6 0.050752301 287 nips-2011-The Manifold Tangent Classifier

7 0.050092313 276 nips-2011-Structured sparse coding via lateral inhibition

8 0.048834421 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

9 0.042356014 271 nips-2011-Statistical Tests for Optimization Efficiency

10 0.041384518 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

11 0.040913489 4 nips-2011-A Convergence Analysis of Log-Linear Training

12 0.04033361 207 nips-2011-Optimal learning rates for least squares SVMs using Gaussian kernels

13 0.039988261 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

14 0.038571078 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

15 0.038503379 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

16 0.037112258 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

17 0.033548757 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

18 0.031111326 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

19 0.030704947 178 nips-2011-Multiclass Boosting: Theory and Algorithms

20 0.030484475 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.096), (1, 0.027), (2, -0.034), (3, -0.027), (4, -0.006), (5, 0.058), (6, 0.046), (7, 0.037), (8, -0.039), (9, -0.021), (10, -0.014), (11, -0.011), (12, 0.012), (13, 0.015), (14, -0.002), (15, 0.029), (16, -0.028), (17, 0.064), (18, -0.029), (19, 0.017), (20, 0.024), (21, 0.023), (22, 0.025), (23, -0.005), (24, -0.033), (25, 0.01), (26, -0.014), (27, -0.027), (28, -0.063), (29, 0.054), (30, -0.015), (31, -0.033), (32, 0.019), (33, -0.013), (34, -0.086), (35, -0.081), (36, 0.004), (37, -0.0), (38, -0.073), (39, 0.055), (40, 0.039), (41, 0.007), (42, 0.044), (43, 0.037), (44, 0.103), (45, 0.034), (46, -0.002), (47, 0.051), (48, 0.022), (49, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88861269 143 nips-2011-Learning Anchor Planes for Classification

Author: Ziming Zhang, Lubor Ladicky, Philip Torr, Amir Saffari

2 0.59416318 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent

Author: Inderjit S. Dhillon, Pradeep K. Ravikumar, Ambuj Tewari

3 0.56885421 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

Author: Nobuyuki Morioka, Shin'ichi Satoh

4 0.55796522 276 nips-2011-Structured sparse coding via lateral inhibition

Author: Arthur D. Szlam, Karol Gregor, Yann L. Cun

Abstract: This work describes a conceptually simple method for structured sparse coding and dictionary design. Supposing a dictionary with K atoms, we introduce a structure as a set of penalties or interactions between every pair of atoms. We describe modiﬁcations of standard sparse coding algorithms for inference in this setting, and describe experiments showing that these algorithms are efﬁcient. We show that interesting dictionaries can be learned for interactions that encode tree structures or locally connected structures. Finally, we show that our framework allows us to learn the values of the interactions from the data, rather than having them pre-speciﬁed. 1

5 0.55251414 111 nips-2011-Hashing Algorithms for Large-Scale Learning

Author: Ping Li, Anshumali Shrivastava, Joshua L. Moore, Arnd C. König

Abstract: Minwise hashing is a standard technique in the context of search for efﬁciently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not ﬁt in memory. We compare b-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is signiﬁcantly more accurate (at the same storage cost) than VW (and random projections) for binary data.

6 0.51371157 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

7 0.50083697 287 nips-2011-The Manifold Tangent Classifier

8 0.49028596 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

9 0.4879393 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

10 0.47820589 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data

11 0.46393332 261 nips-2011-Sparse Filtering

12 0.4544948 64 nips-2011-Convergent Bounds on the Euclidean Distance

13 0.44820949 271 nips-2011-Statistical Tests for Optimization Efficiency

14 0.43738353 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

15 0.43007252 262 nips-2011-Sparse Inverse Covariance Matrix Estimation Using Quadratic Approximation

16 0.40290254 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing

17 0.39917514 4 nips-2011-A Convergence Analysis of Log-Linear Training

18 0.39723313 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification

19 0.39201146 62 nips-2011-Continuous-Time Regression Models for Longitudinal Networks

20 0.39144769 147 nips-2011-Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.03), (4, 0.024), (20, 0.036), (26, 0.014), (31, 0.041), (33, 0.024), (43, 0.048), (45, 0.543), (57, 0.029), (65, 0.012), (74, 0.032), (83, 0.024), (99, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9964987 142 nips-2011-Large-Scale Sparse Principal Component Analysis with Application to Text Data

Author: Youwei Zhang, Laurent E. Ghaoui

Abstract: Sparse PCA provides a linear combination of small number of features that maximizes variance across data. Although Sparse PCA has apparent advantages compared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing ﬁrst-order ones. We provide experimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models. 1

2 0.99452537 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization

Author: Mark Schmidt, Nicolas L. Roux, Francis R. Bach

Abstract: We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen ﬁxed error level on a set of structured sparsity problems. 1

3 0.99126595 28 nips-2011-Agnostic Selective Classification

Author: Yair Wiener, Ran El-Yaniv

Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classiﬁcation performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefﬁcient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classiﬁcation algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1

4 0.99100035 272 nips-2011-Stochastic convex optimization with bandit feedback

Author: Alekh Agarwal, Dean P. Foster, Daniel J. Hsu, Sham M. Kakade, Alexander Rakhlin

Abstract: This paper addresses the problem of minimizing a convex, Lipschitz function f over a convex, compact set X under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value f (x) at any query point x ∈ X . We demonstrate √ a generalization of the ellipsoid algorithm that √ incurs O(poly(d) T ) regret. Since any algorithm has regret at least Ω( T ) on this problem, our algorithm is optimal in terms of the scaling with T . 1

same-paper 5 0.98879904 143 nips-2011-Learning Anchor Planes for Classification

Author: Ziming Zhang, Lubor Ladicky, Philip Torr, Amir Saffari

6 0.98728603 293 nips-2011-Understanding the Intrinsic Memorability of Images

7 0.98343605 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison

8 0.9479695 190 nips-2011-Nonlinear Inverse Reinforcement Learning with Gaussian Processes

9 0.91776609 19 nips-2011-Active Classification based on Value of Classifier

10 0.90899539 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

11 0.90428203 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

12 0.90354961 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers

13 0.89644378 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing

14 0.89260948 220 nips-2011-Prediction strategies without loss

15 0.89007252 78 nips-2011-Efficient Methods for Overlapping Group Lasso

16 0.889404 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

17 0.8868413 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent

18 0.87934351 161 nips-2011-Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation

19 0.87922764 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

20 0.87871486 169 nips-2011-Maximum Margin Multi-Label Structured Prediction