nips nips2010 nips2010-59 knowledge-graph by maker-knowledge-mining

59 nips-2010-Deep Coding Network

Source: pdf

Author: Yuanqing Lin, Zhang Tong, Shenghuo Zhu, Kai Yu

Abstract: This paper proposes a principled extension of the traditional single-layer ﬂat sparse coding scheme, where a two-layer coding scheme is derived based on theoretical analysis of nonlinear functional approximation that extends recent results for local coordinate coding. The two-layer approach can be easily generalized to deeper structures in a hierarchical multiple-layer manner. Empirically, it is shown that the deep coding approach yields improved performance in benchmark datasets.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The two-layer approach can be easily generalized to deeper structures in a hierarchical multiple-layer manner. [sent-2, score-0.133]

2 Empirically, it is shown that the deep coding approach yields improved performance in benchmark datasets. [sent-3, score-0.804]

3 1 Introduction Sparse coding has attracted signiﬁcant attention in recent years because it has been shown to be effective for some classiﬁcation problems [12, 10, 9, 13, 11, 14, 2, 5]. [sent-4, score-0.729]

4 In particular, it has been empirically observed that high-dimensional sparse coding plus linear classiﬁer is successful for image classiﬁcation tasks such as PASCAL 2009 [7, 15]. [sent-5, score-0.852]

5 Speciﬁcally, LCC learns a nonlinear function in high dimension by forming an adaptive set of basis functions on the data manifold, and it has nonlinear approximation power. [sent-7, score-0.372]

6 A recent extension of LCC with added local tangent directions [16] demonstrated the possibility to achieve locally quadratic approximation power when the underlying data manifold is relatively ﬂat. [sent-8, score-0.32]

7 This also indicates that the nonlinear function approximation view of sparse coding not only yields deeper theoretical understanding of its success, but also leads to improved algorithms based on reﬁned analysis. [sent-9, score-1.08]

8 This paper follows the same idea, where we propose a principled extension of single-layer sparse coding based on theoretical analysis of a two level coding scheme. [sent-10, score-1.591]

9 Such extension draws connection to deep belief networks (DBN) [8], and hence we call this approach deep coding network. [sent-12, score-0.937]

10 Hierarchical sparse coding has two main advantages over its single-layer counter-part. [sent-13, score-0.811]

11 Second, it is computationally more efﬁcient than ﬂat coding because we only need to look at locations in the second (or higher) layer corresponding to basis functions with nonzero coefﬁcients in the ﬁrst (or previous) layer. [sent-16, score-1.09]

12 Since sparse coding produces many zero coefﬁcients, the hierarchical structure signiﬁcantly eliminates many of the coding computation. [sent-17, score-1.554]

13 Moreover, instead of ﬁtting a single model with many variables as in a ﬂat single layer approach, our proposal of multi-layer coding requires ﬁtting many small models separately, each 1 with a small number of parameters. [sent-18, score-1.027]

14 2 Sparse Coding and Nonlinear Function Approximation This section reviews the nonlinear function approximation results of single-layer coding scheme in [17], and then presents our multi-layer extension. [sent-22, score-0.958]

15 Since the result of [17] requires a modiﬁcation of the traditional sparse coding scheme called local coordinate coding (LCC), our analysis will rely on a similar modiﬁcation. [sent-23, score-1.838]

16 Speciﬁcally, it was theoretically shown in [17] that a speciﬁc coding scheme called Local Coordinate Coding can take advantage of the underlying data manifold geometric structure in order to learn a nonlinear function in high dimension and alleviate the curse of dimensionality problem. [sent-27, score-1.04]

17 The main idea of LCC, described in [17], is to locally embed points on the underlying data manifold into a lower dimensional space, expressed as coordinates with respect to a set of anchor points. [sent-28, score-0.168]

18 The main theoretical observation was relatively simple: it was shown in [17] that on the data manifold, a nonlinear function can be effectively approximated by a globally linear function with respect to the local coordinate coding. [sent-29, score-0.377]

19 For local constant approximation, the error term α x−x′ is the ﬁrst order in x−x′ ; for local linear approximation, the error term β x−x′ 2 is the second order in x − x′ ; for local quadratic approximation, the error term ν x − x′ 3 is the third order in x − x′ . [sent-42, score-0.345]

20 2 Similar to the single-layer coordinate coding in [17], here we deﬁne a two-layer coordinate coding as the following. [sent-44, score-1.678]

21 2 (Coordinate Coding) A single-layer coordinate coding is a pair (γ 1 , C 1 ), where 1 C 1 ⊂ Rd is a set of anchor points (aka basis functions), and γ is a map of x ∈ Rd to [γv (x)]v∈C 1 ∈ 1 1 R|C | such that v∈C 1 γv (x) = 1. [sent-46, score-0.979]

22 hγ 1 ,C 1 (x) = v∈C 1 A two-layer coordinate coding (γ, C) consists of coordinate coding systems {(γ 1 , C 1 )} ∪ {(γ 2,v , C 2,v ) : v ∈ C 1 }. [sent-48, score-1.678]

23 The pair (γ 1 , C 1 ) is the ﬁrst layer coordinate coding, (γ 2,v , C 2,v ) are second layer coordinate-coding pairs that reﬁne the ﬁrst layer coding for every ﬁrst-layer anchor point v ∈ C 1 . [sent-49, score-1.799]

24 The performance of LCC is characterized in [17] using the following nonlinear function approximation result. [sent-50, score-0.153]

25 1 (Single-layer LCC Nonlinear Function Approximation) Let (γ 1 , C 1 ) be an arbitrary single-layer coordinate coding scheme on Rd . [sent-52, score-0.942]

26 We have for all x ∈ Rd : 1 wv γv (x) ≤ α x − hγ 1 ,C 1 (x) + β f (x) − v∈C 1 1 |γv (x)| v − x 2 , (1) v∈C 1 where wv = f (v) for v ∈ C 1 . [sent-54, score-0.286]

27 This result shows that a high dimensional nonlinear function can be globally approximated by a 1 linear function with respect to the single-layer coding [γv (x)], with unknown linear coefﬁcients [wv ]v∈C 1 = [f (v)]v∈C 1 , where the approximation on the right hand size is second order. [sent-55, score-0.97]

28 This 1 bounds directly suggests the following learning method: for each x, we use its coding [γv (x)] ∈ 1 1 R|C | as features. [sent-56, score-0.702]

29 We then learn a linear function of the form v wv γv (x) using a standard linear learning method such as SVM, where [wv ] is the unknown coefﬁcient vector to be learned. [sent-57, score-0.164]

30 The optimal coding can be learned using unlabeled data by optimizing the right hand side of (1) over unlabeled data. [sent-58, score-0.854]

31 2 (Two-layer LCC Nonlinear Function Approximation) Let (γ, C) = {(γ 1 , C 1 )} ∪ {(γ 2,v , C 2,v ) : v ∈ C 1 } be an arbitrary two-layer coordinate coding on Rd . [sent-62, score-0.839]

32 We have for all x ∈ Rd : 1 wv γv (x) − f (x) − v∈C 1 1 γv (x) v∈C 1 2,v wv,u γu (x) u∈C 2,v 1 |γv (x)| x − hγ 2,v ,C 2,v (x) + ν ≤0. [sent-64, score-0.143]

33 5α v∈C 1 1 |γv (x)| x − v 3 , (2) v∈C 1 where wv = f (v) for v ∈ C 1 and wv,u = 0. [sent-66, score-0.143]

34 The coding can be learned from unlabeled data by minimizing the right hand side of (2) or (3). [sent-71, score-0.821]

35 Compare with the single-layer coding, we note that the second term on the right hand side of (1) is replaced by the third term on the right hand side of (2). [sent-72, score-0.242]

36 That is, the linear approximation power of the single-layer coding scheme (with a quadratic error term) becomes quadratic approximation power of the two-layer coding scheme (with a cubic error term). [sent-73, score-1.898]

37 The ﬁrst term on the right hand side of (1) is replaced by the ﬁrst two terms on the right hand of (2). [sent-74, score-0.165]

38 If the manifold is relatively ﬂat, then the error terms x − hγ 1 ,C 1 (x) and x − hγ 2,v ,C 2,v (x) will be relatively small in comparison to the second term on the right hand side of (1). [sent-75, score-0.253]

39 In such case the two-layer coding scheme can potentially improve the single-layer system signiﬁcantly. [sent-76, score-0.805]

40 This result is similar to that of [16], where the second layer uses local PCA instead of another layer of nonlinear coding. [sent-77, score-0.753]

41 The bound in (2) shows the potential of the two-layer coding scheme in achieving higher order approximation power than single layer coding. [sent-80, score-1.206]

42 In such case, the bound in (3) shows that the performance of the two-level coding is still comparable to that of one-level coding scheme in (1). [sent-83, score-1.507]

43 This is the situation where the 1st layer is mainly used to partition the space (while its approximation accuracy is not important), while the main approximation power is achieved with the second layer. [sent-84, score-0.432]

44 The main advantage of two-layer coding in this case is to save computation. [sent-85, score-0.702]

45 This is because instead of solving a single layer coding system with many parameters, we can solve many smaller coding systems, each with a small number of parameters. [sent-86, score-1.706]

46 This is the situation when including nonlinearity in the second layer becomes useful, which means that the deep-coding network approach in this paper has some advantage over [16] which can only approximate linear function with local PCA in the second layer. [sent-87, score-0.375]

47 While the two bounds (2) and (3) consider different scenarios depending on the relative size of the ﬁrst layer versus the second layer, in reality it is difﬁcult to differentiate and usually both bounds play a role at the same time. [sent-90, score-0.302]

48 , vj,L2 }, and 2,v i γvj,kj (Xi ) = γj,k , where L1 is the size of the ﬁrst-layer codebook, and L2 is the size of each individual codebook at the second layer. [sent-100, score-0.238]

49 We take a layer-by-layer approach for training, where the second layer is regarded as a reﬁnement of the ﬁrst layer, which is consistent with Lemma 2. [sent-101, score-0.323]

50 In the ﬁrst layer, we learn a simple sparse coding model with all data:    2 L1 n 1 1  [γ , C ] = arg min  γ,v 1 i γj vj  Xi − 2 i=1 j=1 i subject to γj ≥ 0, 2   i γj = 1, vj ≤ κ, (4) j where κ is some constant, e. [sent-103, score-0.888]

51 For convenience, we not only enforce sum-to-one-constraint on the sparse coefﬁcients, but also 4 i i impose nonnegative constraints so that j |γj | = j γj = 1 for all i. [sent-106, score-0.198]

52 L 1 where s is a scaling factor balances the coding from the two different layers. [sent-119, score-0.702]

53 2 Multi-layer Extension The two-level coding scheme can be easily extended to the third and higher layers. [sent-121, score-0.864]

54 For example, at the third layer, for each base vj,k , the third-layer coding is to solve the following weighted optimization:    2 L3 n 1 j,k j,k i i i [γ3 , C3 ] = arg min  γj,k  Xi − γj,k,l  γj,k,l vj,k,l + λ3 γ,v 2 i=1 l=1 subject to i γj,k,l ≥ 0, vj,k,l ≤ 1. [sent-122, score-0.722]

55 3 Optimization The optimization problems in Equations (4) to (6) can be generally solved by alternating the following two steps: 1) given current cookbook estimation v, compute the optimal sparse coefﬁcients γ; 2) given the new estimates of the sparse coefﬁcients, optimize the cookbooks. [sent-124, score-0.254]

56 The optimization problem in (5) contains only nonnegative constraints (but not the sum-to-one constraint), for which we employ a pathwise projected Newton (PPN) method [3] that optimizes a block of coordinates per iteration instead of one coordinate at a time in the active set method. [sent-131, score-0.278]

57 As a result, in typical sparse coding settings (for example, in the experiments that we will present shortly in Section 4), the PPN method is able to give the exact solution of a median size (e. [sent-132, score-0.811]

58 A signiﬁcant advantage of the second layer optimization in our proposal is parallelization. [sent-139, score-0.325]

59 As shown in (5), the second-layer sparse coding is decomposed into L1 independent coding problems, and thus can be naturally parallelized. [sent-140, score-1.513]

60 1 MNIST dataset We ﬁrst demonstrate the effectiveness of the proposed deep coding scheme on the popular MNIST benchmark data [1]. [sent-143, score-0.907]

61 In our experiments of deep coding network, the entire training set is used to learn the ﬁrst-layer coding, with codebook of size 64. [sent-145, score-1.063]

62 For each of the 64 bases in the ﬁrst layer, a second-layer codebook was learned – the deep coding scheme presented in the paper ensures that the codebook learning can be done independently. [sent-146, score-1.562]

63 We implemented a Hadoop parallel program that solved the 64 codebook learning tasks in about an hour – which would have taken 64 hours on single machine. [sent-147, score-0.261]

64 This shows that easy parallelization is a very attractive aspect of the proposed deep coding scheme, especially for large scale problems. [sent-148, score-0.804]

65 Table 1 shows the performance of deep coding network on MNIST compared to some previous coding schemes. [sent-149, score-1.532]

66 First, adding an extra layer yields signiﬁcant improvement on classiﬁcation; e. [sent-151, score-0.302]

67 for L1 = 512, the classiﬁcation error rate for single layer LCC is 2. [sent-153, score-0.324]

68 98% [16] (the extended LCC method in [16] may also be regarded as a two layer method but the second layer is linear); the two-layer coding scheme here signiﬁcantly improves the performance with classiﬁcation error rate of 1. [sent-155, score-1.471]

69 Second, the two-layer coding is less prone to overﬁtting than its single-layer counterpart. [sent-157, score-0.702]

70 In fact, for the single-layer coding, our experiment shows that further increasing the codebook size will cause overﬁtting (e. [sent-158, score-0.238]

71 In contrast, the performance of two-layer coding still improves when the second-layer codebook is as large as 512 (and the total codebook size is 64 × 512 = 32768, which is very high-dimensional considering the total number of training data is only 60,000). [sent-162, score-1.178]

72 This property is desirable especially when high-dimensional representation is preferred in the case of using sparse coding plus linear classiﬁer for classiﬁcations. [sent-163, score-0.811]

73 Figure 1 shows some ﬁrst-layer bases and their associated second-layer bases. [sent-164, score-0.179]

74 We can see that the second-layer bases provide deeper details that helps to further explain their ﬁrst layer parent basis; on the other hand, the parent ﬁrst-layer basis provides an informative context for its child secondlayer bases. [sent-165, score-0.743]

75 1 where the ﬁrst-layer basis is like Digit 7, this basis can come from Digit 7, Digit 9 or even Digit 4. [sent-167, score-0.172]

76 Then, its second-layer bases help to further explain the meaning of the ﬁrst-layer basis: in its associated second-layer bases, the ﬁrst two bases in that row are parts of Digit 9 while the last basis in that row is a part of Digit ’4’. [sent-168, score-0.488]

77 Meanwhile, the ﬁrst-layer 7-like basis provides important context for its second-layer part-like bases – without the ﬁrst-layer basis, the fragmented parts (like the ﬁrst two second-layer bases in that row) may not be very informative. [sent-169, score-0.444]

78 The zoomed-in details contained in deeper bases signiﬁcantly help a classiﬁer to resolve difﬁcult examples, and interestingly, coarser details provide useful context for ﬁner details. [sent-170, score-0.271]

79 Single layer sparse coding Number of bases (L1 ) 512 1024 Local coordinate coding 2. [sent-171, score-2.131]

80 82 Two-layer sparse coding Number of bases (L2 ) 64 128 L1 = 64 1. [sent-175, score-0.99]

81 51 Table 1: The classiﬁcation error rate (in %) on MNIST dataset with different sparse coding schemes. [sent-183, score-0.833]

82 6 First−layer bases Second−layer bases Figure 1: Example of bases from a two-layer coding network on MNIST data. [sent-184, score-1.265]

83 The colorbar is the same for all images, but the range it represents differs from image to image – generally, the color of the background of a image represent zero value, and the colors above and below that color respectively represent positive and negative values. [sent-186, score-0.142]

84 Among different methods, one particularly effective approach is to use sparse coding to derive a codebook of low-level features (such as SIFT) and represent an image as a bag of visual words [15]. [sent-192, score-1.138]

85 Here, we intend to learn two-layer hierarchical codebooks instead of single ﬂat codebook for the bag-of-word image representation. [sent-193, score-0.404]

86 Then, the SIFT descriptors from all images (both training and validation images) were utilized to learn ﬁrst-layer codebooks with different dimensions, L1 = 512, 1024 and 2048. [sent-195, score-0.137]

87 Then, given a ﬁrstlayer codebook, for each basis in the codebook, we learned its second-layer codebook of size 64 by solving the weighted optimization in (5). [sent-196, score-0.324]

88 Again, the second-layer codebook learning was done in parallel using Hadoop. [sent-197, score-0.261]

89 With the ﬁrst-layer and second-layer codebooks, each SIFT feature was coded into a very high dimensional space: using L1 = 1024 as an example, the coding dimension 7 Dimension of the ﬁrst layer (L1 ) Single-layer sparse coding Two-layer sparse coding (L2 =64) 512 42. [sent-198, score-2.682]

90 3 Table 2: Average precision (in %) of classiﬁcation on PASCAL07 dataset using different sparse coding schemes. [sent-204, score-0.811]

91 It is clear that the two-layer sparse coding performs signiﬁcantly better than its single-layer counterpart. [sent-209, score-0.811]

92 We would like to point out that, although we simply employed max-pooling in the experiments, it may not be the best pooling strategy for the hierarchical coding scheme presented in this paper. [sent-210, score-0.869]

93 We believe a better pooling scheme needs to take the hierarchical structure into account, but this remains as an open problem and is one of our future work. [sent-211, score-0.167]

94 5 Conclusion This paper proposes a principled extension of the traditional single-layer ﬂat sparse coding scheme, where a two-layer coding scheme is derived based on theoretical analysis of nonlinear functional approximation that extends recent results for local coordinate coding. [sent-212, score-2.089]

95 The two-layer approach can be easily generalized to deeper structures in a hierarchical multiple-layer manner. [sent-213, score-0.133]

96 There are two main advantages of multi-layer coding: it can potentially achieve better performance because the deeper layers provide more details and structures; it is computationally more efﬁcient because coding are decomposed into smaller problems. [sent-214, score-0.831]

97 Experiment showed that the performance of two-layer coding can signiﬁcantly improve that of single-layer coding. [sent-215, score-0.702]

98 For the future directions, it will be interesting to explore the deep coding network with more than two layers. [sent-216, score-0.83]

99 However, for more complicated data, deeper coding with multiple layers may be an effective way for gaining ﬁner and ﬁner features. [sent-219, score-0.858]

100 Linear spatial pyramid matching using sparse coding for image classiﬁcation. [sent-296, score-0.876]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('coding', 0.702), ('layer', 0.302), ('lcc', 0.25), ('codebook', 0.238), ('bases', 0.179), ('wv', 0.143), ('coordinate', 0.137), ('sparse', 0.109), ('scheme', 0.103), ('deep', 0.102), ('nonlinear', 0.102), ('deeper', 0.092), ('basis', 0.086), ('mnist', 0.076), ('lipschitz', 0.075), ('nonnegative', 0.067), ('digit', 0.064), ('codebooks', 0.063), ('kai', 0.061), ('ner', 0.059), ('manifold', 0.058), ('anchor', 0.054), ('rd', 0.053), ('pascal', 0.052), ('approximation', 0.051), ('cients', 0.049), ('local', 0.047), ('sift', 0.046), ('coef', 0.046), ('hadoop', 0.044), ('ppn', 0.044), ('secondlayer', 0.044), ('yihong', 0.044), ('newton', 0.043), ('quadratic', 0.043), ('image', 0.041), ('hierarchical', 0.041), ('classi', 0.04), ('rajat', 0.039), ('traditional', 0.038), ('layers', 0.037), ('approximated', 0.036), ('tong', 0.036), ('alexis', 0.036), ('cookbook', 0.036), ('honglak', 0.036), ('tting', 0.035), ('hand', 0.034), ('unlabeled', 0.033), ('side', 0.032), ('projected', 0.032), ('lemma', 0.031), ('locally', 0.031), ('relatively', 0.031), ('dimension', 0.031), ('extension', 0.031), ('images', 0.03), ('hessian', 0.028), ('yu', 0.028), ('vj', 0.028), ('power', 0.028), ('battle', 0.027), ('effective', 0.027), ('network', 0.026), ('raina', 0.026), ('smoothness', 0.025), ('dimensional', 0.025), ('term', 0.025), ('theoretical', 0.024), ('pyramid', 0.024), ('pooling', 0.023), ('curse', 0.023), ('principled', 0.023), ('parallel', 0.023), ('descriptors', 0.023), ('proposal', 0.023), ('row', 0.022), ('smooth', 0.022), ('ected', 0.022), ('re', 0.022), ('constraints', 0.022), ('error', 0.022), ('andrew', 0.021), ('visual', 0.021), ('regarded', 0.021), ('learn', 0.021), ('parent', 0.02), ('proposes', 0.02), ('third', 0.02), ('employ', 0.02), ('higher', 0.02), ('right', 0.02), ('extended', 0.019), ('dennis', 0.019), ('popularized', 0.019), ('colorbar', 0.019), ('airplanes', 0.019), ('lewicki', 0.019), ('rstlayer', 0.019), ('xi', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 59 nips-2010-Deep Coding Network

Author: Yuanqing Lin, Zhang Tong, Shenghuo Zhu, Kai Yu

2 0.2266365 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition

Author: Koray Kavukcuoglu, Pierre Sermanet, Y-lan Boureau, Karol Gregor, Michael Mathieu, Yann L. Cun

Abstract: We propose an unsupervised method for learning multi-stage hierarchies of sparse convolutional features. While sparse coding has become an increasingly popular method for learning visual features, it is most often trained at the patch level. Applying the resulting ﬁlters convolutionally results in highly redundant codes because overlapping patches are encoded in isolation. By training convolutionally over large image windows, our method reduces the redudancy between feature vectors at neighboring locations and improves the efﬁciency of the overall representation. In addition to a linear decoder that reconstructs the image from sparse features, our method trains an efﬁcient feed-forward encoder that predicts quasisparse features from the input. While patch-based training rarely produces anything but oriented edge detectors, we show that convolutional training produces highly diverse ﬁlters, including center-surround ﬁlters, corner detectors, cross detectors, and oriented grating detectors. We show that using these ﬁlters in multistage convolutional network architecture improves performance on a number of visual recognition and detection tasks. 1

3 0.18488233 76 nips-2010-Energy Disaggregation via Discriminative Sparse Coding

Author: J. Z. Kolter, Siddharth Batra, Andrew Y. Ng

Abstract: Energy disaggregation is the task of taking a whole-home energy signal and separating it into its component appliances. Studies have shown that having devicelevel energy information can cause users to conserve signiﬁcant amounts of energy, but current electricity meters only report whole-home data. Thus, developing algorithmic methods for disaggregation presents a key technical challenge in the effort to maximize energy conservation. In this paper, we examine a large scale energy disaggregation task, and apply a novel extension of sparse coding to this problem. In particular, we develop a method, based upon structured prediction, for discriminatively training sparse coding algorithms speciﬁcally to maximize disaggregation performance. We show that this signiﬁcantly improves the performance of sparse coding algorithms on the energy task and illustrate how these disaggregation results can provide useful information about energy usage. 1

4 0.16856603 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

Author: Pierre Garrigues, Bruno A. Olshausen

Abstract: We propose a class of sparse coding models that utilizes a Laplacian Scale Mixture (LSM) prior to model dependencies among coefﬁcients. Each coefﬁcient is modeled as a Laplacian distribution with a variable scale parameter, with a Gamma distribution prior over the scale parameter. We show that, due to the conjugacy of the Gamma prior, it is possible to derive efﬁcient inference procedures for both the coefﬁcients and the scale parameter. When the scale parameters of a group of coefﬁcients are combined into a single variable, it is possible to describe the dependencies that occur due to common amplitude ﬂuctuations among coefﬁcients, which have been shown to constitute a large fraction of the redundancy in natural images [1]. We show that, as a consequence of this group sparse coding, the resulting inference of the coefﬁcients follows a divisive normalization rule, and that this may be efﬁciently implemented in a network architecture similar to that which has been proposed to occur in primary visual cortex. We also demonstrate improvements in image coding and compressive sensing recovery using the LSM model. 1

5 0.15931979 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels

Author: Grégoire Montavon, Klaus-Robert Müller, Mikio L. Braun

Abstract: Deep networks can potentially express a learning problem more efﬁciently than local learning machines. While deep networks outperform local learning machines on some problems, it is still unclear how their nice representation emerges from their complex structure. We present an analysis based on Gaussian kernels that measures how the representation of the learning problem evolves layer after layer as the deep network builds higher-level abstract representations of the input. We use this analysis to show empirically that deep networks build progressively better representations of the learning problem and that the best representations are obtained when the deep network discriminates only in the last layers. 1

6 0.15388536 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives

7 0.15043983 268 nips-2010-The Neural Costs of Optimal Control

8 0.14065656 65 nips-2010-Divisive Normalization: Justification and Effectiveness as Efficient Coding Transform

9 0.12739709 141 nips-2010-Layered image motion with explicit occlusions, temporal consistency, and depth ordering

10 0.10829341 89 nips-2010-Factorized Latent Spaces with Structured Sparsity

11 0.10572756 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models

12 0.10571076 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication

13 0.08888337 271 nips-2010-Tiled convolutional neural networks

14 0.088485666 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions

15 0.087499008 96 nips-2010-Fractionally Predictive Spiking Neurons

16 0.084087811 133 nips-2010-Kernel Descriptors for Visual Recognition

17 0.082479075 99 nips-2010-Gated Softmax Classification

18 0.080111578 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

19 0.078721106 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters

20 0.075831808 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.186), (1, 0.083), (2, -0.148), (3, -0.021), (4, 0.111), (5, -0.066), (6, 0.011), (7, 0.169), (8, -0.145), (9, -0.01), (10, 0.112), (11, -0.062), (12, 0.003), (13, -0.195), (14, -0.169), (15, -0.144), (16, -0.067), (17, 0.04), (18, -0.131), (19, -0.104), (20, 0.116), (21, -0.096), (22, -0.105), (23, -0.038), (24, -0.063), (25, -0.029), (26, -0.093), (27, -0.025), (28, -0.022), (29, 0.2), (30, -0.013), (31, 0.091), (32, -0.106), (33, 0.078), (34, -0.074), (35, 0.029), (36, 0.082), (37, 0.06), (38, -0.055), (39, -0.038), (40, 0.036), (41, -0.01), (42, 0.09), (43, 0.017), (44, 0.136), (45, -0.056), (46, 0.085), (47, -0.072), (48, -0.031), (49, -0.081)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97542661 59 nips-2010-Deep Coding Network

Author: Yuanqing Lin, Zhang Tong, Shenghuo Zhu, Kai Yu

2 0.82863462 76 nips-2010-Energy Disaggregation via Discriminative Sparse Coding

Author: J. Z. Kolter, Siddharth Batra, Andrew Y. Ng

3 0.77262533 143 nips-2010-Learning Convolutional Feature Hierarchies for Visual Recognition

Author: Koray Kavukcuoglu, Pierre Sermanet, Y-lan Boureau, Karol Gregor, Michael Mathieu, Yann L. Cun

4 0.67211068 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

Author: Pierre Garrigues, Bruno A. Olshausen

5 0.6404897 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives

Author: Taehwan Kim, Gregory Shakhnarovich, Raquel Urtasun

Abstract: Sparse coding has recently become a popular approach in computer vision to learn dictionaries of natural images. In this paper we extend the sparse coding framework to learn interpretable spatio-temporal primitives. We formulated the problem as a tensor factorization problem with tensor group norm constraints over the primitives, diagonal constraints on the activations that provide interpretability as well as smoothness constraints that are inherent to human motion. We demonstrate the effectiveness of our approach to learn interpretable representations of human motion from motion capture data, and show that our approach outperforms recently developed matching pursuit and sparse coding algorithms. 1

6 0.58891881 56 nips-2010-Deciphering subsampled data: adaptive compressive sampling as a principle of brain communication

7 0.57990313 271 nips-2010-Tiled convolutional neural networks

8 0.50964206 37 nips-2010-Basis Construction from Power Series Expansions of Value Functions

9 0.48880786 140 nips-2010-Layer-wise analysis of deep networks with Gaussian kernels

10 0.48540029 65 nips-2010-Divisive Normalization: Justification and Effectiveness as Efficient Coding Transform

11 0.46099061 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters

12 0.45151493 111 nips-2010-Hallucinations in Charles Bonnet Syndrome Induced by Homeostasis: a Deep Boltzmann Machine Model

13 0.4118745 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts

14 0.40808693 268 nips-2010-The Neural Costs of Optimal Control

15 0.37126002 209 nips-2010-Pose-Sensitive Embedding by Nonlinear NCA Regression

16 0.36582682 89 nips-2010-Factorized Latent Spaces with Structured Sparsity

17 0.35602009 99 nips-2010-Gated Softmax Classification

18 0.3322477 96 nips-2010-Fractionally Predictive Spiking Neurons

19 0.32412603 3 nips-2010-A Bayesian Framework for Figure-Ground Interpretation

20 0.31872773 31 nips-2010-An analysis on negative curvature induced by singularity in multi-layer neural-network learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.041), (17, 0.014), (27, 0.056), (30, 0.047), (35, 0.381), (45, 0.205), (50, 0.031), (52, 0.033), (60, 0.035), (77, 0.035), (90, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89930969 170 nips-2010-Moreau-Yosida Regularization for Grouped Tree Structure Learning

Author: Jun Liu, Jieping Ye

Abstract: We consider the tree structured group Lasso where the structure over the features can be represented as a tree with leaf nodes as features and internal nodes as clusters of the features. The structured regularization with a pre-deﬁned tree structure is based on a group-Lasso penalty, where one group is deﬁned for each node in the tree. Such a regularization can help uncover the structured sparsity, which is desirable for applications with some meaningful tree structures on the features. However, the tree structured group Lasso is challenging to solve due to the complex regularization. In this paper, we develop an efﬁcient algorithm for the tree structured group Lasso. One of the key steps in the proposed algorithm is to solve the Moreau-Yosida regularization associated with the grouped tree structure. The main technical contributions of this paper include (1) we show that the associated Moreau-Yosida regularization admits an analytical solution, and (2) we develop an efﬁcient algorithm for determining the effective interval for the regularization parameter. Our experimental results on the AR and JAFFE face data sets demonstrate the efﬁciency and effectiveness of the proposed algorithm.

2 0.82463205 97 nips-2010-Functional Geometry Alignment and Localization of Brain Areas

Author: Georg Langs, Yanmei Tie, Laura Rigolo, Alexandra Golby, Polina Golland

Abstract: Matching functional brain regions across individuals is a challenging task, largely due to the variability in their location and extent. It is particularly difﬁcult, but highly relevant, for patients with pathologies such as brain tumors, which can cause substantial reorganization of functional systems. In such cases spatial registration based on anatomical data is only of limited value if the goal is to establish correspondences of functional areas among different individuals, or to localize potentially displaced active regions. Rather than rely on spatial alignment, we propose to perform registration in an alternative space whose geometry is governed by the functional interaction patterns in the brain. We ﬁrst embed each brain into a functional map that reﬂects connectivity patterns during a fMRI experiment. The resulting functional maps are then registered, and the obtained correspondences are propagated back to the two brains. In application to a language fMRI experiment, our preliminary results suggest that the proposed method yields improved functional correspondences across subjects. This advantage is pronounced for subjects with tumors that affect the language areas and thus cause spatial reorganization of the functional regions. 1

same-paper 3 0.82452035 59 nips-2010-Deep Coding Network

Author: Yuanqing Lin, Zhang Tong, Shenghuo Zhu, Kai Yu

4 0.81068122 73 nips-2010-Efficient and Robust Feature Selection via Joint ℓ2,1-Norms Minimization

Author: Feiping Nie, Heng Huang, Xiao Cai, Chris H. Ding

Abstract: Feature selection is an important component of many machine learning applications. Especially in many bioinformatics tasks, efﬁcient and robust feature selection methods are desired to extract meaningful features and eliminate noisy ones. In this paper, we propose a new robust feature selection method with emphasizing joint 2,1 -norm minimization on both loss function and regularization. The 2,1 -norm based loss function is robust to outliers in data points and the 2,1 norm regularization selects features across all data points with joint sparsity. An efﬁcient algorithm is introduced with proved convergence. Our regression based objective makes the feature selection process more efﬁcient. Our method has been applied into both genomic and proteomic biomarkers discovery. Extensive empirical studies are performed on six data sets to demonstrate the performance of our feature selection method. 1

5 0.72533536 149 nips-2010-Learning To Count Objects in Images

Author: Victor Lempitsky, Andrew Zisserman

Abstract: We propose a new supervised learning framework for visual object counting tasks, such as estimating the number of cells in a microscopic image or the number of humans in surveillance video frames. We focus on the practically-attractive case when the training images are annotated with dots (one dot per object). Our goal is to accurately estimate the count. However, we evade the hard task of learning to detect and localize individual object instances. Instead, we cast the problem as that of estimating an image density whose integral over any image region gives the count of objects within that region. Learning to infer such density can be formulated as a minimization of a regularized risk quadratic cost function. We introduce a new loss function, which is well-suited for such learning, and at the same time can be computed efﬁciently via a maximum subarray algorithm. The learning can then be posed as a convex quadratic program solvable with cutting-plane optimization. The proposed framework is very ﬂexible as it can accept any domain-speciﬁc visual features. Once trained, our system provides accurate object counts and requires a very small time overhead over the feature extraction step, making it a good candidate for applications involving real-time processing or dealing with huge amount of visual data. 1

6 0.68460989 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework

7 0.66116941 109 nips-2010-Group Sparse Coding with a Laplacian Scale Mixture Prior

8 0.64678228 26 nips-2010-Adaptive Multi-Task Lasso: with Application to eQTL Detection

9 0.63197976 76 nips-2010-Energy Disaggregation via Discriminative Sparse Coding

10 0.62079155 249 nips-2010-Spatial and anatomical regularization of SVM for brain image analysis

11 0.61797309 246 nips-2010-Sparse Coding for Learning Interpretable Spatio-Temporal Primitives

12 0.61709088 181 nips-2010-Network Flow Algorithms for Structured Sparsity

13 0.6168558 200 nips-2010-Over-complete representations on recurrent neural networks can support persistent percepts

14 0.61569375 13 nips-2010-A Primal-Dual Message-Passing Algorithm for Approximated Large Scale Structured Prediction

15 0.6142534 7 nips-2010-A Family of Penalty Functions for Structured Sparsity

16 0.61087012 12 nips-2010-A Primal-Dual Algorithm for Group Sparse Regularization with Overlapping Groups

17 0.61066777 123 nips-2010-Individualized ROI Optimization via Maximization of Group-wise Consistency of Structural and Functional Profiles

18 0.60315675 258 nips-2010-Structured sparsity-inducing norms through submodular functions

19 0.60239077 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings

20 0.60236657 5 nips-2010-A Dirty Model for Multi-task Learning