nips nips2011 nips2011-287 knowledge-graph by maker-knowledge-mining

287 nips-2011-The Manifold Tangent Classifier

Source: pdf

Author: Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller

Abstract: We combine three important ideas present in previous work for building classiﬁers: the semi-supervised hypothesis (the input distribution contains information about the classiﬁer), the unsupervised manifold hypothesis (data density concentrates near low-dimensional manifolds), and the manifold hypothesis for classiﬁcation (different classes correspond to disjoint manifolds separated by low density). We exploit a novel algorithm for capturing manifold structure (high-order contractive auto-encoders) and we show how it builds a topological atlas of charts, each chart being characterized by the principal singular vectors of the Jacobian of a representation mapping. This representation learning algorithm can be stacked to yield a deep architecture, and we combine it with a domain knowledge-free version of the TangentProp algorithm to encourage the classiﬁer to be insensitive to local directions changes along the manifold. Record-breaking classiﬁcation results are obtained. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We exploit a novel algorithm for capturing manifold structure (high-order contractive auto-encoders) and we show how it builds a topological atlas of charts, each chart being characterized by the principal singular vectors of the Jacobian of a representation mapping. [sent-5, score-0.673]

2 This representation learning algorithm can be stacked to yield a deep architecture, and we combine it with a domain knowledge-free version of the TangentProp algorithm to encourage the classiﬁer to be insensitive to local directions changes along the manifold. [sent-6, score-0.391]

3 The (unsupervised) manifold hypothesis, according to which real world data presented in high dimensional spaces is likely to concentrate in the vicinity of non-linear sub-manifolds of much lower dimensionality (Cayton, 2005; Narayanan and Mitter, 2010). [sent-20, score-0.262]

4 The manifold hypothesis for classiﬁcation, according to which points of different classes are likely to concentrate along different sub-manifolds, separated by low density regions of the input space. [sent-22, score-0.341]

5 , 2011a), based on the idea of encouraging the learned representation to be robust to small variations of the input, was shown to be very effective for unsupervised feature learning. [sent-24, score-0.184]

6 Its successful application in the pre-training of deep neural networks is yet another illustration of what can be gained by adopting hypothesis 1. [sent-25, score-0.21]

7 Most of the directions to which the representation is substantially sensitive are thought to be directions tangent to the datasupporting manifold (those that locally deﬁne its tangent space). [sent-29, score-1.325]

8 The present work follows through on this interpretation, and investigates whether it is possible to use this information, that is presumably captured about manifold structure, to further improve classiﬁcation performance by leveraging hypothesis 3. [sent-30, score-0.341]

9 To that end, we extract a set of basis vectors for the local tangent space at each training point from the Contractive Auto-Encoder’s learned parameters. [sent-31, score-0.576]

10 This is obtained with a Singular Value Decomposition (SVD) of the Jacobian of the encoder that maps each input to its learned representation. [sent-32, score-0.164]

11 Based on hypothesis 3, we then adopt the “generic prior” that class labels are likely to be insensitive to most directions within these local tangent spaces (ex: small translations, rotations or scalings usually do not change an image’s class). [sent-33, score-0.707]

12 Supervised classiﬁcation algorithms that have been devised to efﬁciently exploit tangent directions given as domain-speciﬁc prior-knowledge (Simard et al. [sent-34, score-0.58]

13 , 1992, 1993), can readily be used instead with our learned tangent spaces. [sent-35, score-0.443]

14 In particular, we will show record-breaking improvements by using TangentProp for ﬁne tuning CAE-pre-trained deep neural networks. [sent-36, score-0.131]

15 To the best of our knowledge this is the ﬁrst time that the implicit relationship between an unsupervised learned mapping and the tangent space of a manifold is rendered explicit and successfully exploited for the training of a classiﬁer. [sent-37, score-0.842]

16 convolutional networks and tangent distance based on a-priori known transformations). [sent-42, score-0.43]

17 2 Contractive auto-encoders (CAE) We consider the problem of the unsupervised learning of a non-linear feature extractor from a dataset D = {x1 , . [sent-44, score-0.108]

18 It learns an encoder function h, that maps an input x ∈ IRd to a hidden representation h(x) ∈ IRdh , jointly with a decoder function g, that maps h back to the input space as r = g(h(x)) the reconstruction of x. [sent-54, score-0.246]

19 The encoder and decoder’s parameters θ are learned by stochastic gradient descent to minimize the average reconstruction error L(x, g(h(x))) for the examples of the training set. [sent-55, score-0.276]

20 (1) x∈D We will will use the most common forms of encoder, decoder, and reconstruction error: 1 Encoder: h(x) = s(W x + bh ), where s is the element-wise logistic sigmoid s(z) = 1+e−z . [sent-57, score-0.146]

21 dh Parameters are a dh × d weight matrix W and bias vector bh ∈ IR . [sent-58, score-0.161]

22 2 First order and higher order contractive auto-encoders More recently, Rifai et al. [sent-80, score-0.189]

23 (2011a) introduced the Contractive Auto-Encoder (CAE), that encourages robustness of representation h(x) to small variations of a training input x, by penalizing its sensitivity to that input, measured as the Frobenius norm of the encoder’s Jacobian J(x) = ∂h (x). [sent-81, score-0.14]

24 Note that, with the traditional sigmoid encoder form given above, one can easily obtain the Jacobian of the encoder. [sent-83, score-0.169]

25 , 2011b) variant with the following optimization objective: 2 JCAE+H (θ) = L(x, g(h(x))) + λ ||J(x)|| + γE 2 ∼N (0,σ 2 I) ||J(x) − J(x + )|| , (4) x∈D where γ is an additional regularization hyper-parameters that controls how strongly we penalize local variations of the Jacobian, i. [sent-89, score-0.134]

26 3 Characterizing the tangent bundle captured by a CAE Rifai et al. [sent-95, score-0.475]

27 The geometric interpretation is that these directions span the local tangent space of the underlying manifold that supports the data. [sent-97, score-0.869]

28 The tangent bundle of a smooth manifold is the manifold along with the set of tangent planes taken at all points on it. [sent-98, score-1.32]

29 Each such tangent plane can be equipped with a local Euclidean coordinate system or chart. [sent-99, score-0.53]

30 In topology, an atlas is a collection of such charts (like the locally Euclidean map in each page of a geographic atlas). [sent-100, score-0.226]

31 Even though the set of charts may form a non-Euclidean manifold (e. [sent-101, score-0.355]

32 1 Conditions for the feature mapping to deﬁne an atlas on a manifold In order to obtain a proper atlas of charts, h must be a diffeomorphism. [sent-105, score-0.474]

33 It must be smooth (C ∞ ) and invertible on open Euclidean balls on the manifold M around the training points. [sent-106, score-0.364]

34 Injectivity (different values of h(x) correspond to different values of x) on the training examples is encouraged by minimizing reconstruction error (otherwise we cannot distinguish training examples xi and xj by only looking at h(xi ) and h(xj )). [sent-108, score-0.181]

35 With this condition satisﬁed, mapping h is injective in the subspace spanned by the variations in the training set. [sent-111, score-0.14]

36 If we limit the domain d of h to h(X ) ⊂ (0, 1) h comprising values obtainable by h applied to some set X , then we obtain surjectivity by deﬁnition, hence bijectivity of h between the training set D and h(D). [sent-112, score-0.102]

37 Let Mx be an open ball on the manifold M around training example x. [sent-113, score-0.364]

38 By smoothness of the manifold M and of mapping h, we obtain bijectivity locally around the training examples (on the manifold) as well, i. [sent-114, score-0.424]

39 2 Obtaining an atlas from the learned feature mapping Now that we have necessary conditions for local invertibility of h(x) for x ∈ D, let us consider how to deﬁne the local chart around x from the nature of h. [sent-118, score-0.427]

40 Because h must be sensitive to changes from an example xi to one of its neighbors xj , but insensitive to other changes (because of the CAE penalty), we expect that this will be reﬂected in the spectrum of the Jacobian matrix J(x) = ∂h(x) ∂x at each training point x. [sent-119, score-0.13]

41 In the ideal case where J(x) has rank k, h(x + v) differs from h(x) only if v is in the span of the singular vectors of J(x) with non-zero singular value. [sent-120, score-0.136]

42 Hence, we deﬁne a local chart around x using the Singular Value Decomposition of J T (x) = U (x)S(x)V T (x) (where U (x) and V (x) are orthogonal and S(x) is diagonal). [sent-122, score-0.212]

43 The tangent plane Hx at x is given by the span of the set of principal singular vectors Bx : Bx = {U·k (x)|Skk (x) > } and Hx = {x + v|v ∈ span(Bx )}, where U·k (x) is the k-th column of U (x), and span({zk }) = {x|x = k wk zk , wk ∈ IR}. [sent-123, score-0.58]

44 We can thus deﬁne an atlas A captured by h, based on the local linear approximation around each example: A = {(Mx , φx )|x ∈ D, φx (˜) = Bx (˜ − x)}. [sent-124, score-0.203]

45 x x (5) Note that this way of obtaining an atlas can also be applied to subsequent layers of a deep network. [sent-125, score-0.237]

46 It is thus possible to use a greedy layer-wise strategy to initialize a network with CAEs (Rifai et al. [sent-126, score-0.113]

47 , 2011a) and obtain an atlas that corresponds to the nonlinear features computed at any layer. [sent-127, score-0.106]

48 4 Exploiting the learned tangent directions for classiﬁcation Using the previously deﬁned charts for every point of the training set, we propose to use this additional information provided by unsupervised learning to improve the performance of the supervised task. [sent-128, score-0.83]

49 In this we adopt the manifold hypothesis for classiﬁcation mentioned in the introduction. [sent-129, score-0.341]

50 1 CAE-based tangent distance One way of achieving this is to use a nearest neighbor classiﬁer with a similarity criterion deﬁned as the shortest distance between two hyperplanes (Simard et al. [sent-131, score-0.539]

51 The tangents extracted on each points will allow us to shrink the distances between two samples when they can approximate each other by a linear combination of their local tangents. [sent-133, score-0.437]

52 (1993), we deﬁne the tangent distance between two points x and y as the distance between the two hyperplanes Hx , Hy ⊂ IRd spanned respectively by Bx and By . [sent-135, score-0.492]

53 Using the usual deﬁnition of distance between two spaces, d(Hx , Hy ) = inf{ z−w 2 |/ (z, w) ∈ Hx ×Hy }, we obtain the solution for this convex 4 problem by solving a system of linear equations (Simard et al. [sent-136, score-0.109]

54 This procedure corresponds to allowing the considered points x and y to move along the directions spanned by their associated local charts. [sent-138, score-0.199]

55 2 CAE-based tangent propagation Nearest neighbor techniques are often impractical for large scale datasets because their computational requirements scale linearly with n for each test case. [sent-142, score-0.439]

56 We can also leverage the extracted local charts when training a neural network. [sent-144, score-0.271]

57 Following the tangent propagation approach of Simard et al. [sent-145, score-0.516]

58 3 The Manifold Tangent Classiﬁer (MTC) Putting it all together, here is the high level summary of how we build and train a deep network: 1. [sent-148, score-0.131]

59 Finetune the whole network for supervised classiﬁcation2 with an added tangent propagation penalty (Eq. [sent-157, score-0.562]

60 We call this deep learning algorithm the Manifold Tangent Classiﬁer (MTC). [sent-159, score-0.131]

61 Alternatively, instead of step 3, one can use the tangent vectors in Bxi in a tangent distance nearest neighbors classiﬁer. [sent-160, score-0.828]

62 , 2000) have been proposed which can automatically discover the main directions of variation around each training point, i. [sent-162, score-0.207]

63 See Bengio and Monperrus (2005) for a critique of local non-parametric manifold algorithms: they might require a number of training examples which grows exponentially with manifold dimension and curvature (more crooks and valleys in the manifold will require more examples). [sent-168, score-0.919]

64 One attempt to generalize the manifold shape non-locally (Bengio et al. [sent-169, score-0.339]

65 , 2006) is based on explicitly predicting the tangent plane associated to any given point x, as a parametrized function of x. [sent-170, score-0.434]

66 they use pairs or tuples of points, with the goal to explicitly model the tangent space, while it is 1 (K) J is the product of the Jacobians of each encoder (see Eq. [sent-173, score-0.517]

67 This is achieved in O(dM × d × dh ) per training example. [sent-176, score-0.123]

68 For comparison, the cost of a forward propagation through a single MLP layer is O(d × dh ) per example. [sent-177, score-0.142]

69 , 1992) algorithms were initially designed to exploit prior domain-knowledge of directions of invariance (ex: knowledge that the class of an image should be invariant to small translations rotations or scalings in the image plane). [sent-188, score-0.178]

70 However any algorithm able to output a chart for a training point might potentially be used, as we do here, to provide directions to a Tangent distance or TangentProp (Simard et al. [sent-189, score-0.398]

71 Our approach is nevertheless unique as the CAE’s unsupervised feature learning capabilities are used simultaneously to provide a good initialization of deep network layers and a coherent non-local predictor of tangent spaces. [sent-191, score-0.633]

72 Whereas TangentProp attempts to make the output insensitive to selected directions of change, the double backpropagation penalty term attempts to make the error at a training example invariant to changes in all directions. [sent-193, score-0.341]

73 In addition to minimizing a supervised prediction error, it encourages each layer of representation of a deep architecture to be invariant when the training example is changed from x to a near neighbor of x in the training set. [sent-199, score-0.439]

74 This algorithm works implicitly under the hypothesis that the variable y to predict from x is invariant to the local directions of change present between nearest neighbors. [sent-200, score-0.289]

75 This is consistent with the manifold hypothesis for classiﬁcation (hypothesis 3 mentioned in the introduction). [sent-201, score-0.341]

76 Instead of removing variability along the local directions of variation, the Contractive Auto-Encoder (Rifai et al. [sent-202, score-0.246]

77 6 Experiments We conducted experiments to evaluate our approach and the quality of the manifold tangents learned by the CAE, using a range of datasets from different domains: MNIST is a dataset of 28 × 28 images of handwritten digits. [sent-204, score-0.635]

78 We investigate whether leveraging the CAE learned tangents leads to better classiﬁcation performance on these problems, using the following methodology: Optimal hyper-parameters for (a stack of) CAEs are selected by cross-validation on a disjoint validation set extracted from the training set. [sent-217, score-0.537]

79 The quality of the feature extractor and tangents captured by the CAEs is evaluated by initializing an neural network (MLP) with the same parameters and ﬁne-tuning it by backpropagation on the supervised classiﬁcation task. [sent-218, score-0.486]

80 The optimal strength of the supervised TangentProp penalty and number of tangents dM is also cross-validated. [sent-219, score-0.415]

81 Results Figure 1 shows a visualization of the tangents learned by the CAE. [sent-220, score-0.373]

82 On MNIST, the tangents mostly correspond to small geometrical transformations like translations and rotations. [sent-221, score-0.36]

83 On CIFAR-10, the 6 Figure 1: Visualisation of the tangents learned by the CAE for MNIST, CIFAR-10 and RCV1 (top to bottom). [sent-222, score-0.373]

84 On RCV1, we show the tangents of a document with the topic ”Trading & Markets” (MCAT) with the negative terms in red(-) and the positive terms in green(+). [sent-224, score-0.328]

85 Figure 2: Tangents extracted by local PCA on CIFAR-10. [sent-225, score-0.109]

86 The tangents on RCV1-v2 correspond to the addition or removal of similar words and removal of irrelevant words. [sent-228, score-0.328]

87 We also note that extracting the tangents of the model is a way to visualize what the model has learned about the structure of the manifold. [sent-229, score-0.373]

88 Interestingly, we see that hypothesis 3 holds for these datasets because most tangents do not change the class of the example. [sent-230, score-0.407]

89 The KNN is trained on the raw input vector using the Euclidean distance while the K-layer+KNN is computed on the representation learned by a K-layer CAE. [sent-232, score-0.107]

90 The KNN+Tangents uses at every sample the local charts extracted from the 1-layer CAE to compute tangent distance. [sent-233, score-0.6]

91 45 We use KNN using tangent distance to evaluate the quality of the learned tangents more objectively. [sent-245, score-0.803]

92 Table 1 shows that using the tangents extracted from a CAE always lead to better performance than a traditional KNN. [sent-246, score-0.373]

93 2, the tangents extracted by the CAE can be used for ﬁne-tuning the multilayer perceptron using tangent propagation, yielding our Manifold Tangent Classiﬁer (MTC). [sent-248, score-0.771]

94 (2008), the unsupervised feature extractor is trained on the full training set and the supervised classiﬁer is trained on a restricted labeled set. [sent-251, score-0.229]

95 Table 2 shows our results for a single hidden layer MLP initialized with CAE+H pretraining (noted CAE for brevity) and for the same classiﬁer ﬁne-tuned with tangent propagation (i. [sent-252, score-0.486]

96 It shows that the features extracted 7 Table 2: Semi-supervised classiﬁcation error on the MNIST test set with 100, 600, 1000 and 3000 labeled training examples. [sent-259, score-0.114]

97 The CAE in this ﬁgure is a two-layer deep network with 2000 units per layer pretrained with the CAE+H objective. [sent-307, score-0.214]

98 The MTC uses the same stack of CAEs trained with tangent propagation using 15 tangents. [sent-308, score-0.489]

99 7 Conclusion In this work, we have shown a new way to characterize a manifold by extracting a local chart at each data point based on the unsupervised feature mapping built with a deep learning approach. [sent-321, score-0.64]

100 Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. [sent-520, score-0.231]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cae', 0.442), ('tangent', 0.398), ('tangents', 0.328), ('manifold', 0.262), ('mtc', 0.197), ('rifai', 0.18), ('deep', 0.131), ('jacobian', 0.13), ('simard', 0.124), ('encoder', 0.119), ('chart', 0.115), ('tangentprop', 0.115), ('contractive', 0.112), ('atlas', 0.106), ('knn', 0.106), ('directions', 0.105), ('bengio', 0.101), ('charts', 0.093), ('vincent', 0.081), ('hypothesis', 0.079), ('et', 0.077), ('classi', 0.073), ('lecun', 0.073), ('training', 0.069), ('unsupervised', 0.068), ('bx', 0.068), ('caes', 0.066), ('cnn', 0.066), ('ird', 0.066), ('local', 0.064), ('mnist', 0.063), ('insensitive', 0.061), ('hx', 0.056), ('dh', 0.054), ('ranzato', 0.054), ('decoder', 0.054), ('bh', 0.053), ('supervised', 0.052), ('sigmoid', 0.05), ('hinton', 0.05), ('stack', 0.05), ('er', 0.049), ('singular', 0.048), ('forest', 0.048), ('layer', 0.047), ('mx', 0.047), ('learned', 0.045), ('extracted', 0.045), ('mlp', 0.043), ('covertype', 0.043), ('reconstruction', 0.043), ('variations', 0.041), ('invariant', 0.041), ('propagation', 0.041), ('span', 0.04), ('parzen', 0.04), ('hy', 0.04), ('extractor', 0.04), ('bxi', 0.04), ('muller', 0.037), ('injectivity', 0.037), ('plane', 0.036), ('weston', 0.036), ('network', 0.036), ('penalty', 0.035), ('dm', 0.034), ('pressure', 0.034), ('around', 0.033), ('bijectivity', 0.033), ('cayton', 0.033), ('charting', 0.033), ('dauphin', 0.033), ('drucker', 0.033), ('glorot', 0.033), ('goodfellow', 0.033), ('irdh', 0.033), ('jcae', 0.033), ('mitter', 0.033), ('monperrus', 0.033), ('nw', 0.033), ('trebar', 0.033), ('translations', 0.032), ('kavukcuoglu', 0.032), ('distance', 0.032), ('coordinate', 0.032), ('backpropagation', 0.03), ('salakhutdinov', 0.03), ('spanned', 0.03), ('representation', 0.03), ('penalize', 0.029), ('wk', 0.029), ('lcc', 0.029), ('denker', 0.029), ('brand', 0.029), ('steele', 0.029), ('hj', 0.029), ('larochelle', 0.029), ('generic', 0.027), ('hypotheses', 0.027), ('locally', 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 287 nips-2011-The Manifold Tangent Classifier

Author: Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller

2 0.16768077 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

Author: Binbin Lin, Chiyuan Zhang, Xiaofei He

Abstract: This paper studies the problem of semi-supervised learning from the vector ﬁeld perspective. Many of the existing work use the graph Laplacian to ensure the smoothness of the prediction function on the data manifold. However, beyond smoothness, it is suggested by recent theoretical work that we should ensure second order smoothness for achieving faster rates of convergence for semisupervised regression problems. To achieve this goal, we show that the second order smoothness measures the linearity of the function, and the gradient ﬁeld of a linear function has to be a parallel vector ﬁeld. Consequently, we propose to ﬁnd a function which minimizes the empirical error, and simultaneously requires its gradient ﬁeld to be as parallel as possible. We give a continuous objective function on the manifold and discuss how to discretize it by using random points. The discretized optimization problem turns out to be a sparse linear system which can be solved very efﬁciently. The experimental results have demonstrated the effectiveness of our proposed approach. 1

3 0.1206308 263 nips-2011-Sparse Manifold Clustering and Embedding

Author: Ehsan Elhamifar, René Vidal

Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE ﬁnds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE ﬁnds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional afﬁne subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1

4 0.11785769 250 nips-2011-Shallow vs. Deep Sum-Product Networks

Author: Olivier Delalleau, Yoshua Bengio

Abstract: We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efﬁciently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning. 1 Introduction and prior work Many learning algorithms are based on searching a family of functions so as to identify one member of said family which minimizes a training criterion. The choice of this family of functions and how members of that family are parameterized can be a crucial one. Although there is no universally optimal choice of parameterization or family of functions (or “architecture”), as demonstrated by the no-free-lunch results [37], it may be the case that some architectures are appropriate (or inappropriate) for a large class of learning tasks and data distributions, such as those related to Artiﬁcial Intelligence (AI) tasks [4]. Different families of functions have different characteristics that can be appropriate or not depending on the learning task of interest. One of the characteristics that has spurred much interest and research in recent years is depth of the architecture. In the case of a multi-layer neural network, depth corresponds to the number of (hidden and output) layers. A ﬁxedkernel Support Vector Machine is considered to have depth 2 [4] and boosted decision trees to have depth 3 [7]. Here we use the word circuit or network to talk about a directed acyclic graph, where each node is associated with some output value which can be computed based on the values associated with its predecessor nodes. The arguments of the learned function are set at the input nodes of the circuit (which have no predecessor) and the outputs of the function are read off the output nodes of the circuit. Different families of functions correspond to different circuits and allowed choices of computations in each node. Learning can be performed by changing the computation associated with a node, or rewiring the circuit (possibly changing the number of nodes). The depth of the circuit is the length of the longest path in the graph from an input node to an output node. Deep Learning algorithms [3] are tailored to learning circuits with variable depth, typically greater than depth 2. They are based on the idea of multiple levels of representation, with the intuition that the raw input can be represented at different levels of abstraction, with more abstract features of the input or more abstract explanatory factors represented by deeper circuits. These algorithms are often based on unsupervised learning, opening the door to semi-supervised learning and efﬁcient 1 use of large quantities of unlabeled data [3]. Analogies with the structure of the cerebral cortex (in particular the visual cortex) [31] and similarities between features learned with some Deep Learning algorithms and those hypothesized in the visual cortex [17] further motivate investigations into deep architectures. It has been suggested that deep architectures are more powerful in the sense of being able to more efﬁciently represent highly-varying functions [4, 3]. In this paper, we measure “efﬁciency” in terms of the number of computational units in the network. An efﬁcient representation is important mainly because: (i) it uses less memory and is faster to compute, and (ii) given a ﬁxed amount of training samples and computational power, better generalization is expected. The ﬁrst successful algorithms for training deep architectures appeared in 2006, with efﬁcient training procedures for Deep Belief Networks [14] and deep auto-encoders [13, 27, 6], both exploiting the general idea of greedy layer-wise pre-training [6]. Since then, these ideas have been investigated further and applied in many settings, demonstrating state-of-the-art learning performance in object recognition [16, 28, 18, 15] and segmentation [20], audio classiﬁcation [19, 10], natural language processing [9, 36, 21, 32], collaborative ﬁltering [30], modeling textures [24], modeling motion [34, 33], information retrieval [29, 26], and semi-supervised learning [36, 22]. Poon and Domingos [25] introduced deep sum-product networks as a method to compute partition functions of tractable graphical models. These networks are analogous to traditional artiﬁcial neural networks but with nodes that compute either products or weighted sums of their inputs. Analogously to neural networks, we deﬁne “hidden” nodes as those nodes that are neither input nodes nor output nodes. If the nodes are organized in layers, we deﬁne the “hidden” layers to be those that are neither the input layer nor the output layer. Poon and Domingos [25] report experiments with networks much deeper (30+ hidden layers) than those typically used until now, e.g. in Deep Belief Networks [14, 3], where the number of hidden layers is usually on the order of three to ﬁve. Whether such deep architectures have theoretical advantages compared to so-called “shallow” architectures (i.e. those with a single hidden layer) remains an open question. After all, in the case of a sum-product network, the output value can always be written as a sum of products of input variables (possibly raised to some power by allowing multiple connections from the same input), and consequently it is easily rewritten as a shallow network with a sum output unit and product hidden units. The argument supported by our theoretical analysis is that a deep architecture is able to compute some functions much more efﬁciently than a shallow one. Until recently, very few theoretical results supported the idea that deep architectures could present an advantage in terms of representing some functions more efﬁciently. Most related results originate from the analysis of boolean circuits (see e.g. [2] for a review). Well-known results include the proof that solving the n-bit parity task with a depth-2 circuit requires an exponential number of gates [1, 38], and more generally that there exist functions computable with a polynomial-size depthk circuit that would require exponential size when restricted to depth k − 1 [11]. Another recent result on boolean circuits by Braverman [8] offers proof of a longstanding conjecture, showing that bounded-depth boolean circuits are unable to distinguish some (non-uniform) input distributions from the uniform distribution (i.e. they are “fooled” by such input distributions). In particular, Braverman’s result suggests that shallow circuits can in general be fooled more easily than deep ones, i.e., that they would have more difﬁculty efﬁciently representing high-order dependencies (those involving many input variables). It is not obvious that circuit complexity results (that typically consider only boolean or at least discrete nodes) are directly applicable in the context of typical machine learning algorithms such as neural networks (that compute continuous representations of their input). Orponen [23] surveys theoretical results in computational complexity that are relevant to learning algorithms. For instance, H˚ stad and Goldmann [12] extended some results to the case of networks of linear threshold units a with positivity constraints on the weights. Bengio et al. [5, 7] investigate, respectively, complexity issues in networks of Gaussian radial basis functions and decision trees, showing intrinsic limitations of these architectures e.g. on tasks similar to the parity problem. Utgoff and Stracuzzi [35] informally discuss the advantages of depth in boolean circuit in the context of learning architectures. Bengio [3] suggests that some polynomials could be represented more efﬁciently by deep sumproduct networks, but without providing any formal statement or proofs. This work partly addresses this void by demonstrating families of circuits for which a deep architecture can be exponentially more efﬁcient than a shallow one in the context of real-valued polynomials. Note that we do not address in this paper the problem of learning these parameters: even if an efﬁcient deep representation exists for the function we seek to approximate, in general there is no 2 guarantee for standard optimization algorithms to easily converge to this representation. This paper focuses on the representational power of deep sum-product circuits compared to shallow ones, and studies it by considering particular families of target functions (to be represented by the learner). We ﬁrst formally deﬁne sum-product networks. We consider two families of functions represented by deep sum-product networks (families F and G). For each family, we establish a lower bound on the minimal number of hidden units a depth-2 sum-product network would require to represent a function of this family, showing it is much less efﬁcient than the deep representation. 2 Sum-product networks Deﬁnition 1. A sum-product network is a network composed of units that either compute the product of their inputs or a weighted sum of their inputs (where weights are strictly positive). Here, we restrict our deﬁnition of the generic term “sum-product network” to networks whose summation units have positive incoming weights1 , while others are called “negative-weight” networks. Deﬁnition 2. A “negative-weight“ sum-product network may contain summation units whose weights are non-positive (i.e. less than or equal to zero). Finally, we formally deﬁne what we mean by deep vs. shallow networks in the rest of the paper. Deﬁnition 3. A “shallow“ sum-product network contains a single hidden layer (i.e. a total of three layers when counting the input and output layers, and a depth equal to two). Deﬁnition 4. A “deep“ sum-product network contains more than one hidden layer (i.e. a total of at least four layers, and a depth at least three). The family F 3 3.1 Deﬁnition The ﬁrst family of functions we study, denoted by F, is made of functions built from deep sumproduct networks that alternate layers of product and sum units with two inputs each (details are provided below). The basic idea we use here is that composing layers (i.e. using a deep architecture) is equivalent to using a factorized representation of the polynomial function computed by the network. Such a factorized representation can be exponentially more compact than its expansion as a sum of products (which can be associated to a shallow network with product units in its hidden layer and a sum unit as output). This is what we formally show in what follows. + ℓ2 = λ11ℓ1 + µ11ℓ1 = x1x2 + x3x4 = f (x1, x2, x3, x4) 2 1 1 λ11 = 1 µ11 = 1 × ℓ1 = x1x2 1 x1 x2 × ℓ1 = x3x4 2 x3 x4 Figure 1: Sum-product network computing the function f ∈ F such that i = λ11 = µ11 = 1. Let n = 4i , with i a positive integer value. Denote by ℓ0 the input layer containing scalar variables {x1 , . . . , xn }, such that ℓ0 = xj for 1 ≤ j ≤ n. Now deﬁne f ∈ F as any function computed by a j sum-product network (deep for i ≥ 2) composed of alternating product and sum layers: • ℓ2k+1 = ℓ2k · ℓ2k for 0 ≤ k ≤ i − 1 and 1 ≤ j ≤ 22(i−k)−1 2j−1 2j j • ℓ2k = λjk ℓ2k−1 + µjk ℓ2k−1 for 1 ≤ k ≤ i and 1 ≤ j ≤ 22(i−k) j 2j 2j−1 where the weights λjk and µjk of the summation units are strictly positive. The output of the network is given by f (x1 , . . . , xn ) = ℓ2i ∈ R, the unique unit in the last layer. 1 The corresponding (shallow) network for i = 1 and additive weights set to one is shown in Figure 1 1 This condition is required by some of the proofs presented here. 3 (this architecture is also the basic building block of bigger networks for i > 1). Note that both the input size n = 4i and the network’s depth 2i increase with parameter i. 3.2 Theoretical results The main result of this section is presented below in Corollary 1, providing a lower bound on the minimum number of hidden units required by a shallow sum-product network to represent a function f ∈ F. The high-level proof sketch consists in the following steps: (1) Count the number of unique products found in the polynomial representation of f (Lemma 1 and Proposition 1). (2) Show that the only possible architecture for a shallow sum-product network to compute f is to have a hidden layer made of product units, with a sum unit as output (Lemmas 2 to 5). (3) Conclude that the number of hidden units must be at least the number of unique products computed in step 3.2 (Lemma 6 and Corollary 1). Lemma 1. Any element ℓk can be written as a (positively) weighted sum of products of input varij ables, such that each input variable xt is used in exactly one unit of ℓk . Moreover, the number mk of products found in the sum computed by ℓk does not depend on j and obeys the following recurrence j rule for k ≥ 0: if k + 1 is odd, then mk+1 = m2 , otherwise mk+1 = 2mk . k Proof. We prove the lemma by induction on k. It is obviously true for k = 0 since ℓ0 = xj . j Assuming this is true for some k ≥ 0, we consider two cases: k+1 k • If k + 1 is odd, then ℓj = ℓk 2j−1 · ℓ2j . By the inductive hypothesis, it is the product of two (positively) weighted sums of products of input variables, and no input variable can k appear in both ℓk 2j−1 and ℓ2j , so the result is also a (positively) weighted sum of products k of input variables. Additionally, if the number of products in ℓk 2j−1 and ℓ2j is mk , then 2 mk+1 = mk , since all products involved in the multiplication of the two units are different (since they use disjoint subsets of input variables), and the sums have positive weights. Finally, by the induction assumption, an input variable appears in exactly one unit of ℓk . This unit is an input to a single unit of ℓk+1 , that will thus be the only unit of ℓk+1 where this input variable appears. k • If k + 1 is even, then ℓk+1 = λjk ℓk 2j−1 + µjk ℓ2j . Again, from the induction assumption, it j must be a (positively) weighted sum of products of input variables, but with mk+1 = 2mk such products. As in the previous case, an input variable will appear in the single unit of ℓk+1 that has as input the single unit of ℓk in which this variable must appear. 2i Proposition 1. The number of products in the sum computed in the output unit l1 of a network √ n−1 . computing a function in F is m2i = 2 Proof. We ﬁrst prove by induction on k ≥ 1 that for odd k, mk = 22 k 22 1+1 2 2 k+1 2 −2 , and for even k, . This is obviously true for k = 1 since 2 = 2 = 1, and all units in ℓ1 are mk = 2 single products of the form xr xs . Assuming this is true for some k ≥ 1, then: −1 0 −2 • if k + 1 is odd, then from Lemma 1 and the induction assumption, we have: mk+1 = m2 = k 2 k 22 2 −1 k +1 = 22 2 • if k + 1 is even, then instead we have: mk+1 = 2mk = 2 · 22 k+1 2 −2 −2 = 22 = 22 (k+1)+1 2 (k+1) 2 −2 −1 which shows the desired result for k + 1, and thus concludes the induction proof. Applying this result with k = 2i (which is even) yields 2i m2i = 22 2 −1 √ =2 4 22i −1 √ =2 n−1 . 2i Lemma 2. The products computed in the output unit l1 can be split in two groups, one with products containing only variables x1 , . . . , x n and one containing only variables x n +1 , . . . , xn . 2 2 Proof. This is obvious since the last unit is a “sum“ unit that adds two terms whose inputs are these two groups of variables (see e.g. Fig. 1). 2i Lemma 3. The products computed in the output unit l1 involve more than one input variable. k Proof. It is straightforward to show by induction on k ≥ 1 that the products computed by lj all involve more than one input variable, thus it is true in particular for the output layer (k = 2i). Lemma 4. Any shallow sum-product network computing f ∈ F must have a “sum” unit as output. Proof. By contradiction, suppose the output unit of such a shallow sum-product network is multiplicative. This unit must have more than one input, because in the case that it has only one input, the output would be either a (weighted) sum of input variables (which would violate Lemma 3), or a single product of input variables (which would violate Proposition 1), depending on the type (sum or product) of the single input hidden unit. Thus the last unit must compute a product of two or more hidden units. It can be re-written as a product of two factors, where each factor corresponds to either one hidden unit, or a product of multiple hidden units (it does not matter here which speciﬁc factorization is chosen among all possible ones). Regardless of the type (sum or product) of the hidden units involved, those two factors can thus be written as weighted sums of products of variables xt (with positive weights, and input variables potentially raised to powers above one). From Lemma 1, both x1 and xn must be present in the ﬁnal output, and thus they must appear in at least one of these two factors. Without loss of generality, assume x1 appears in the ﬁrst factor. Variables x n +1 , . . . , xn then cannot be present in the second factor, since otherwise one product in the output 2 would contain both x1 and one of these variables (this product cannot cancel out since weights must be positive), violating Lemma 2. But with a similar reasoning, since as a result xn must appear in the ﬁrst factor, variables x1 , . . . , x n cannot be present in the second factor either. Consequently, no 2 input variable can be present in the second factor, leading to the desired contradiction. Lemma 5. Any shallow sum-product network computing f ∈ F must have only multiplicative units in its hidden layer. Proof. By contradiction, suppose there exists a “sum“ unit in the hidden layer, written s = t∈S αt xt with S the set of input indices appearing in this sum, and αt > 0 for all t ∈ S. Since according to Lemma 4 the output unit must also be a sum (and have positive weights according to Deﬁnition 1), then the ﬁnal output will also contain terms of the form βt xt for t ∈ S, with βt > 0. This violates Lemma 3, establishing the contradiction. Lemma 6. Any shallow negative-weight sum-product network (see Deﬁnition 2) computing f ∈ F √ must have at least 2 n−1 hidden units, if its output unit is a sum and its hidden units are products. Proof. Such a network computes a weighted sum of its hidden units, where each hidden unit is a γ product of input variables, i.e. its output can be written as Σj wj Πt xt jt with wj ∈ R and γjt ∈ {0, 1}. In order to compute a function in F, this shallow network thus needs a number of hidden units at least equal to the number of unique products in that function. From Proposition 1, this √ number is equal to 2 n−1 . √ Corollary 1. Any shallow sum-product network computing f ∈ F must have at least 2 units. n−1 hidden Proof. This is a direct corollary of Lemmas 4 (showing the output unit is a sum), 5 (showing that hidden units are products), and 6 (showing the desired result for any shallow network with this speciﬁc structure – regardless of the sign of weights). 5 3.3 Discussion Corollary 1 above shows that in order to compute some function in F with n inputs, the number of √ √ units in a shallow network has to be at least 2 n−1 , (i.e. grows exponentially in n). On another hand, the total number of units in the deep (for i > 1) network computing the same function, as described in Section 3.1, is equal to 1 + 2 + 4 + 8 + . . . + 22i−1 (since all units are binary), which is √ also equal to 22i − 1 = n − 1 (i.e. grows only quadratically in n). It shows that some deep sumproduct network with n inputs and depth O(log n) can represent with O(n) units what would √ require O(2 n ) units for a depth-2 network. Lemma 6 also shows a similar result regardless of the sign of the weights in the summation units of the depth-2 network, but assumes a speciﬁc architecture for this network (products in the hidden layer with a sum as output). 4 The family G In this section we present similar results with a different family of functions, denoted by G. Compared to F, one important difference of deep sum-product networks built to deﬁne functions in G is that they can vary their input size independently of their depth. Their analysis thus provides additional insight when comparing the representational efﬁciency of deep vs. shallow sum-product networks in the case of a ﬁxed dataset. 4.1 Deﬁnition Networks in family G also alternate sum and product layers, but their units have as inputs all units from the previous layer except one. More formally, deﬁne the family G = ∪n≥2,i≥0 Gin of functions represented by sum-product networks, where the sub-family Gin is made of all sum-product networks with n input variables and 2i + 2 layers (including the input layer ℓ0 ), such that: 1. ℓ1 contains summation units; further layers alternate multiplicative and summation units. 2. Summation units have positive weights. 3. All layers are of size n, except the last layer ℓ2i+1 that contains a single sum unit that sums all units in the previous layer ℓ2i . k−1 4. In each layer ℓk for 1 ≤ k ≤ 2i, each unit ℓk takes as inputs {ℓm |m = j}. j An example of a network belonging to G1,3 (i.e. with three layers and three input variables) is shown in Figure 2. ℓ3 = x2 + x2 + x2 + 3(x1x2 + x1x3 + x2x3) = g(x1, x2, x3) 3 2 1 1 + ℓ2 = x2 + x1x2 × 1 1 +x1x3 + x2x3 ℓ1 = x2 + x3 1 × ℓ2 = . . . 2 × ℓ2 = x2 + x1x2 3 3 +x1x3 + x2x3 + + ℓ1 = x1 + x3 2 + ℓ1 = x1 + x2 3 x1 x2 x3 Figure 2: Sum-product network computing a function of G1,3 (summation units’ weights are all 1’s). 4.2 Theoretical results The main result is stated in Proposition 3 below, establishing a lower bound on the number of hidden units of a shallow sum-product network computing g ∈ G. The proof sketch is as follows: 1. We show that the polynomial expansion of g must contain a large set of products (Proposition 2 and Corollary 2). 2. We use both the number of products in that set as well as their degree to establish the desired lower bound (Proposition 3). 6 We will also need the following lemma, which states that when n − 1 items each belong to n − 1 sets among a total of n sets, then we can associate to each item one of the sets it belongs to without using the same set for different items. Lemma 7. Let S1 , . . . , Sn be n sets (n ≥ 2) containing elements of {P1 , . . . , Pn−1 }, such that for any q, r, |{r|Pq ∈ Sr }| ≥ n − 1 (i.e. each element Pq belongs to at least n − 1 sets). Then there exist r1 , . . . , rn−1 different indices such that Pq ∈ Srq for 1 ≤ q ≤ n − 1. Proof. Omitted due to lack of space (very easy to prove by construction). Proposition 2. For any 0 ≤ j ≤ i, and any product of variables P = Πn xαt such that αt ∈ N and t=1 t j 2j whose computed value, when expanded as a weighted t αt = (n − 1) , there exists a unit in ℓ sum of products, contains P among these products. Proof. We prove this proposition by induction on j. First, for j = 0, this is obvious since any P of this form must be made of a single input variable xt , that appears in ℓ0 = xt . t Suppose now the proposition is true for some j < i. Consider a product P = Πn xαt such that t=1 t αt ∈ N and t αt = (n − 1)j+1 . P can be factored in n − 1 sub-products of degree (n − 1)j , β i.e. written P = P1 . . . Pn−1 with Pq = Πn xt qt , βqt ∈ N and t βqt = (n − 1)j for all q. By t=1 the induction hypothesis, each Pq can be found in at least one unit ℓ2j . As a result, by property 4 kq (in the deﬁnition of family G), each Pq will also appear in the additive layer ℓ2j+1 , in at least n − 1 different units (the only sum unit that may not contain Pq is the one that does not have ℓ2j as input). kq By Lemma 7, we can thus ﬁnd a set of units ℓ2j+1 such that for any 1 ≤ q ≤ n − 1, the product rq Pq appears in ℓ2j+1 , with indices rq being different from each other. Let 1 ≤ s ≤ n be such that rq 2(j+1) s = rq for all q. Then, from property 4 of family G, the multiplicative unit ℓs computes the n−1 2j+1 product Πq=1 ℓrq , and as a result, when expanded as a sum of products, it contains in particular P1 . . . Pn−1 = P . The proposition is thus true for j + 1, and by induction, is true for all j ≤ i. Corollary 2. The output gin of a sum-product network in Gin , when expanded as a sum of products, contains all products of variables of the form Πn xαt such that αt ∈ N and t αt = (n − 1)i . t=1 t Proof. Applying Proposition 2 with j = i, we obtain that all products of this form can be found in the multiplicative units of ℓ2i . Since the output unit ℓ2i+1 computes a sum of these multiplicative 1 units (weighted with positive weights), those products are also present in the output. Proposition 3. A shallow negative-weight sum-product network computing gin ∈ Gin must have at least (n − 1)i hidden units. Proof. First suppose the output unit of the shallow network is a sum. Then it may be able to compute gin , assuming we allow multiplicative units in the hidden layer in the hidden layer to use powers of their inputs in the product they compute (which we allow here for the proof to be more generic). However, it will require at least as many of these units as the number of unique products that can be found in the expansion of gin . In particular, from Corollary 2, it will require at least the number n of unique tuples of the form (α1 , . . . , αn ) such that αt ∈ N and t=1 αt = (n − 1)i . Denoting ni dni = (n − 1)i , this number is known to be equal to n+dni −1 , and it is easy to verify it is higher d than (or equal to) dni for any n ≥ 2 and i ≥ 0. Now suppose the output unit is multiplicative. Then there can be no multiplicative hidden unit, otherwise it would mean one could factor some input variable xt in the computed function output: this is not possible since by Corollary 2, for any variable xt there exist products in the output function that do not involve xt . So all hidden units must be additive, and since the computed function contains products of degree dni , there must be at least dni such hidden units. 7 4.3 Discussion Proposition 3 shows that in order to compute the same function as gin ∈ Gin , the number of units in the shallow network has to grow exponentially in i, i.e. in the network’s depth (while the deep network’s size grows linearly in i). The shallow network also needs to grow polynomially in the number of input variables n (with a degree equal to i), while the deep network grows only linearly in n. It means that some deep sum-product network with n inputs and depth O(i) can represent with O(ni) units what would require O((n − 1)i ) units for a depth-2 network. Note that in the similar results found for family F, the depth-2 network computing the same function as a function in F had to be constrained to either have a speciﬁc combination of sum and hidden units (in Lemma 6) or to have non-negative weights (in Corollary 1). On the contrary, the result presented here for family G holds without requiring any of these assumptions. 5 Conclusion We compared a deep sum-product network and a shallow sum-product network representing the same function, taken from two families of functions F and G. For both families, we have shown that the number of units in the shallow network has to grow exponentially, compared to a linear growth in the deep network, so as to represent the same functions. The deep version thus offers a much more compact representation of the same functions. This work focuses on two speciﬁc families of functions: ﬁnding more general parameterization of functions leading to similar results would be an interesting topic for future research. Another open question is whether it is possible to represent such functions only approximately (e.g. up to an error bound ǫ) with a much smaller shallow network. Results by Braverman [8] on boolean circuits suggest that similar results as those presented in this paper may still hold, but this topic has yet to be formally investigated in the context of sum-product networks. A related problem is also to look into functions deﬁned only on discrete input variables: our proofs do not trivially extend to this situation because we cannot assume anymore that two polynomials yielding the same output values must have the same expansion coefﬁcients (since the number of input combinations becomes ﬁnite). Acknowledgments The authors would like to thank Razvan Pascanu and David Warde-Farley for their help in improving this manuscript, as well as the anonymous reviewers for their careful reviews. This work was partially funded by NSERC, CIFAR, and the Canada Research Chairs. References [1] Ajtai, M. (1983). P1 1 -formulae on ﬁnite structures. Annals of Pure and Applied Logic, 24(1), 1–48. [2] Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pages 1–18. Lecture Notes in Computer Science 1180, Springer Verlag. [3] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009. [4] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press. [5] Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In NIPS’05, pages 107–114. MIT Press, Cambridge, MA. [6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NIPS 19, pages 153–160. MIT Press. [7] Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence, 26(4), 449–467. [8] Braverman, M. (2011). Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM, 54(4), 108–115. [9] Collobert, R. and Weston, J. (2008). A uniﬁed architecture for natural language processing: Deep neural networks with multitask learning. In ICML 2008, pages 160–167. [10] Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the meancovariance restricted boltzmann machine. In Advances in Neural Information Processing Systems (NIPS). 8 [11] H˚ stad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th a annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California. ACM Press. [12] H˚ stad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational a Complexity, 1, 113–129. [13] Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. [14] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. [15] Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’10. [16] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In ICML’07, pages 473–480. ACM. [17] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS’07, pages 873–880. MIT Press, Cambridge, MA. [18] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML 2009. Montreal (Qc), Canada. [19] Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classiﬁcation using convolutional deep belief networks. In NIPS’09, pages 1096–1104. [20] Levner, I. (2008). Data Driven Object Segmentation. Ph.D. thesis, Department of Computer Science, University of Alberta. [21] Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model. In NIPS’08, pages 1081–1088. [22] Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In ICML’2009, pages 737–744. [23] Orponen, P. (1994). Computational complexity of neural networks: a survey. Nordic Journal of Computing, 1(1), 94–110. [24] Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random ﬁeld. In NIPS’07, pages 1121–1128, Cambridge, MA. MIT Press. [25] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In UAI’2011, Barcelona, Spain. [26] Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. In ICML. [27] Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efﬁcient learning of sparse representations with an energy-based model. In NIPS’06, pages 1137–1144. MIT Press. [28] Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In NIPS’07, pages 1185–1192, Cambridge, MA. MIT Press. [29] Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam. Elsevier. [30] Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative ﬁltering. In ICML 2007, pages 791–798, New York, NY, USA. [31] Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., and Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165, 33–56. [32] Socher, R., Lin, C., Ng, A. Y., and Manning, C. (2011). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In ICML’2011. [33] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In ICML 2009, pages 1025–1032. [34] Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary latent variables. In NIPS’06, pages 1345–1352. MIT Press, Cambridge, MA. [35] Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14, 2497–2539. [36] Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008, pages 1168–1175, New York, NY, USA. [37] Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation, 8(7), 1341–1390. [38] Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10. 9

5 0.10156802 244 nips-2011-Selecting Receptive Fields in Deep Networks

Author: Adam Coates, Andrew Y. Ng

Abstract: Recent deep learning and unsupervised feature learning systems that learn from unlabeled data have achieved high performance in benchmarks by using extremely large architectures with many features (hidden units) at each layer. Unfortunately, for such large architectures the number of parameters can grow quadratically in the width of the network, thus necessitating hand-coded “local receptive ﬁelds” that limit the number of connections from lower level features to higher ones (e.g., based on spatial locality). In this paper we propose a fast method to choose these connections that may be incorporated into a wide variety of unsupervised training methods. Speciﬁcally, we choose local receptive ﬁelds that group together those low-level features that are most similar to each other according to a pairwise similarity metric. This approach allows us to harness the advantages of local receptive ﬁelds (such as improved scalability, and reduced data requirements) when we do not know how to specify such receptive ﬁelds by hand or where our unsupervised training algorithm has no obvious generalization to a topographic setting. We produce results showing how this method allows us to use even simple unsupervised training algorithms to train successful multi-layered networks that achieve state-of-the-art results on CIFAR and STL datasets: 82.0% and 60.1% accuracy, respectively. 1

6 0.090278976 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

7 0.088750616 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion

8 0.087436765 5 nips-2011-A Denoising View of Matrix Completion

9 0.083427876 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

10 0.080714084 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

11 0.078813553 156 nips-2011-Learning to Learn with Compound HD Models

12 0.078581169 261 nips-2011-Sparse Filtering

13 0.068729289 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

14 0.064263895 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

15 0.062886685 28 nips-2011-Agnostic Selective Classification

16 0.057722677 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

17 0.055127915 217 nips-2011-Practical Variational Inference for Neural Networks

18 0.054482002 276 nips-2011-Structured sparse coding via lateral inhibition

19 0.053829722 271 nips-2011-Statistical Tests for Optimization Efficiency

20 0.053297676 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.163), (1, 0.067), (2, -0.055), (3, -0.003), (4, -0.028), (5, 0.048), (6, 0.066), (7, 0.039), (8, -0.018), (9, -0.143), (10, -0.064), (11, 0.001), (12, 0.039), (13, -0.074), (14, 0.025), (15, -0.023), (16, -0.094), (17, 0.114), (18, -0.09), (19, 0.013), (20, 0.026), (21, 0.123), (22, 0.021), (23, 0.01), (24, 0.071), (25, -0.015), (26, -0.115), (27, -0.115), (28, -0.075), (29, -0.012), (30, 0.086), (31, 0.039), (32, -0.084), (33, -0.032), (34, 0.052), (35, 0.032), (36, -0.122), (37, 0.035), (38, 0.025), (39, 0.083), (40, -0.1), (41, 0.061), (42, 0.056), (43, 0.065), (44, 0.15), (45, -0.063), (46, 0.025), (47, 0.147), (48, 0.004), (49, -0.047)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91172004 287 nips-2011-The Manifold Tangent Classifier

Author: Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller

2 0.6802671 248 nips-2011-Semi-supervised Regression via Parallel Field Regularization

Author: Binbin Lin, Chiyuan Zhang, Xiaofei He

3 0.65950382 164 nips-2011-Manifold Precis: An Annealing Technique for Diverse Sampling of Manifolds

Author: Nitesh Shroff, Pavan Turaga, Rama Chellappa

Abstract: In this paper, we consider the Pr´ cis problem of sampling K representative yet e diverse data points from a large dataset. This problem arises frequently in applications such as video and document summarization, exploratory data analysis, and pre-ﬁltering. We formulate a general theory which encompasses not just traditional techniques devised for vector spaces, but also non-Euclidean manifolds, thereby enabling these techniques to shapes, human activities, textures and many other image and video based datasets. We propose intrinsic manifold measures for measuring the quality of a selection of points with respect to their representative power, and their diversity. We then propose efﬁcient algorithms to optimize the cost function using a novel annealing-based iterative alternation algorithm. The proposed formulation is applicable to manifolds of known geometry as well as to manifolds whose geometry needs to be estimated from samples. Experimental results show the strength and generality of the proposed approach.

4 0.59629273 263 nips-2011-Sparse Manifold Clustering and Embedding

Author: Ehsan Elhamifar, René Vidal

5 0.58054179 5 nips-2011-A Denoising View of Matrix Completion

Author: Weiran Wang, Zhengdong Lu, Miguel Á. Carreira-Perpiñán

Abstract: In matrix completion, we are given a matrix where the values of only some of the entries are present, and we want to reconstruct the missing ones. Much work has focused on the assumption that the data matrix has low rank. We propose a more general assumption based on denoising, so that we expect that the value of a missing entry can be predicted from the values of neighboring points. We propose a nonparametric version of denoising based on local, iterated averaging with meanshift, possibly constrained to preserve local low-rank manifold structure. The few user parameters required (the denoising scale, number of neighbors and local dimensionality) and the number of iterations can be estimated by cross-validating the reconstruction error. Using our algorithms as a postprocessing step on an initial reconstruction (provided by e.g. a low-rank method), we show consistent improvements with synthetic, image and motion-capture data. Completing a matrix from a few given entries is a fundamental problem with many applications in machine learning, computer vision, network engineering, and data mining. Much interest in matrix completion has been caused by recent theoretical breakthroughs in compressed sensing [1, 2] as well as by the now celebrated Netﬂix challenge on practical prediction problems [3, 4]. Since completion of arbitrary matrices is not a well-posed problem, it is often assumed that the underlying matrix comes from a restricted class. Matrix completion models almost always assume a low-rank structure of the matrix, which is partially justiﬁed through factor models [4] and fast convex relaxation [2], and often works quite well when the observations are sparse and/or noisy. The low-rank structure of the matrix essentially asserts that all the column vectors (or the row vectors) live on a low-dimensional subspace. This assumption is arguably too restrictive for problems with richer structure, e.g. when each column of the matrix represents a snapshot of a seriously corrupted motion capture sequence (see section 3), for which a more ﬂexible model, namely a curved manifold, is more appropriate. In this paper, we present a novel view of matrix completion based on manifold denoising, which conceptually generalizes the low-rank assumption to curved manifolds. Traditional manifold denoising is performed on fully observed data [5, 6], aiming to send the data corrupted by noise back to the correct surface (deﬁned in some way). However, with a large proportion of missing entries, we may not have a good estimate of the manifold. Instead, we start with a poor estimate and improve it iteratively. Therefore the “noise” may be due not just to intrinsic noise, but mostly to inaccurately estimated missing entries. We show that our algorithm can be motivated from an objective purely based on denoising, and prove its convergence under some conditions. We then consider a more general case with a nonlinear low-dimensional manifold and use a stopping criterion that works successfully in practice. Our model reduces to a low-rank model when we require the manifold to be ﬂat, showing a relation with a recent thread of matrix completion models based on alternating projection [7]. In our experiments, we show that our denoising-based matrix completion model can make better use of the latent manifold structure on both artiﬁcial and real-world data sets, and yields superior recovery of the missing entries. The paper is organized as follows: section 1 reviews nonparametric denoising methods based on mean-shift updates, section 2 extends this to matrix completion by using denoising with constraints, section 3 gives experimental results, and section 4 discusses related work. 1 1 Denoising with (manifold) blurring mean-shift algorithms (GBMS/MBMS) In Gaussian blurring mean-shift (GBMS), denoising is performed in a nonparametric way by local averaging: each data point moves to the average of its neighbors (to a certain scale), and the process is repeated. We follow the derivation in [8]. Consider a dataset {xn }N ⊂ RD and deﬁne a n=1 Gaussian kernel density estimate p(x) = 1 N N Gσ (x, xn ) (1) n=1 1 with bandwidth σ > 0 and kernel Gσ (x, xn ) ∝ exp − 2 ( x − xn /σ)2 (other kernels may be used, such as the Epanechnikov kernel, which results in sparse afﬁnities). The (non-blurring) mean-shift algorithm rearranges the stationary point equation ∇p(x) = 0 into the iterative scheme x(τ +1) = f (x(τ ) ) with N x (τ +1) = f (x (τ ) p(n|x )= (τ ) )xn p(n|x (τ ) n=1 )= exp − 1 (x(τ ) − xn )/σ 2 N n′ =1 2 exp − 1 (x(τ ) − xn′ )/σ 2 2 . (2) This converges to a mode of p from almost every initial x ∈ RD , and can be seen as taking selfadapting step sizes along the gradient (since the mean shift f (x) − x is parallel to ∇p(x)). This iterative scheme was originally proposed by [9] and it or variations of it have found widespread application in clustering [8, 10–12] and denoising of 3D point sets (surface fairing; [13, 14]) and manifolds in general [5, 6]. The blurring mean-shift algorithm applies one step of the previous scheme, initialized from every point, in parallel for all points. That is, given the dataset X = {x1 , . . . , xN }, for each xn ∈ X ˜ we obtain a new point xn = f (xn ) by applying one step of the mean-shift algorithm, and then we ˜ replace X with the new dataset X, which is a blurred (shrunk) version of X. By iterating this process we obtain a sequence of datasets X(0) , X(1) , . . . (and a corresponding sequence of kernel density estimates p(0) (x), p(1) (x), . . .) where X(0) is the original dataset and X(τ ) is obtained by blurring X(τ −1) with one mean-shift step. We can see this process as maximizing the following objective function [10] by taking parallel steps of the form (2) for each point: N p(xn ) = E(X) = n=1 1 N N N 1 e− 2 Gσ (xn , xm ) ∝ xn −xm σ 2 . (3) n,m=1 n,m=1 This process eventually converges to a dataset X(∞) where all points are coincident: a completely denoised dataset where all structure has been erased. As shown by [8], this process can be stopped early to return clusters (= locally denoised subsets of points); the number of clusters obtained is controlled by the bandwidth σ. However, here we are interested in the denoising behavior of GBMS. ˜ The GBMS step can be formulated in a matrix form reminiscent of spectral clustering [8] as X = X P where X = (x1 , . . . , xN ) is a D×N matrix of data points; W is the N ×N matrix of Gaussian N afﬁnities wnm = Gσ (xn , xm ); D = diag ( n=1 wnm ) is the degree matrix; and P = WD−1 is N an N × N stochastic matrix: pnm = p(n|xm ) ∈ (0, 1) and n=1 pnm = 1. P (or rather its transpose) is the stochastic matrix of the random walk in a graph [15], which in GBMS represents the posterior probabilities of each point under the kernel density estimate (1). P is similar to the 1 1 matrix N = D− 2 WD− 2 derived from the normalized graph Laplacian commonly used in spectral clustering, e.g. in the normalized cut [16]. Since, by the Perron-Frobenius theorem [17, ch. 8], all left eigenvalues of P(X) have magnitude less than 1 except for one that equals 1 and is associated with ˜ an eigenvector of constant entries, iterating X = X P(X) converges to the stationary distribution of each P(X), where all points coincide. ˜ From this point of view, the product X = X P(X) can be seen as ﬁltering the dataset X with a datadependent low-pass ﬁlter P(X), which makes clear the denoising behavior. This also suggests using ˜ other ﬁlters [12] X = X φ(P(X)) as long as φ(1) = 1 and |φ(r)| < 1 for r ∈ [0, 1), such as explicit schemes φ(P) = (1 − η)I + ηP for η ∈ (0, 2], power schemes φ(P) = Pn for n = 1, 2, 3 . . . or implicit schemes φ(P) = ((1 + η)I − ηP)−1 for η > 0. One important problem with GBMS is that it denoises equally in all directions. When the data lies on a low-dimensional manifold, denoising orthogonally to it removes out-of-manifold noise, but 2 denoising tangentially to it perturbs intrinsic degrees of freedom of the data and causes shrinkage of the entire manifold (most strongly near its boundary). To prevent this, the manifold blurring meanshift algorithm (MBMS) [5] ﬁrst computes a predictor averaging step with GBMS, and then for each point xn a corrector projective step removes the step direction that lies in the local tangent space of xn (obtained from local PCA run on its k nearest neighbors). In practice, both GBMS and MBMS must be stopped early to prevent excessive denoising and manifold distortions. 2 Blurring mean-shift denoising algorithms for matrix completion We consider the natural extension of GBMS to the matrix completion case by adding the constraints given by the present values. We use the subindex notation XM and XP to indicate selection of the missing or present values of the matrix XD×N , where P ⊂ U , M = U \ P and U = {(d, n): d = 1, . . . , D, n = 1, . . . , N }. The indices P and values XP of the present matrix entries are the data of the problem. Then we have the following constrained optimization problem: N Gσ (xn , xm ) max E(X) = X s.t. XP = XP . (4) n,m=1 This is similar to low-rank formulations for matrix completion that have the same constraints but use as objective function the reconstruction error with a low-rank assumption, e.g. X − ABX 2 with AD×L , BL×D and L < D. We initialize XM to the output of some other method for matrix completion, such as singular value projection (SVP; [7]). For simple constraints such as ours, gradient projection algorithms are attractive. The gradient of E wrt X is a matrix of D × N whose nth column is: ∇xn E(X) = 2 σ2 N 1 e− 2 xn −xm σ 2 N (xm − xn ) ∝ m=1 2 p(m|xn )xm p(xn ) −xn + σ2 m=1 (5) and its projection on the constraint space is given by zeroing its entries having indices in P; call ΠP this projection operator. Then, we have the following step of length α ≥ 0 along the projected gradient: (τ +1) X(τ +1) = X(τ ) + αΠP (∇X E(X(τ ) )) ⇐⇒ XM (τ ) = XM + α ΠP (∇X E(X(τ ) )) M (6) which updates only the missing entries XM . Since our search direction is ascent and makes an angle with the gradient that is bounded away from π/2, and E is lower bounded, continuously differentiable and has bounded Hessian (thus a Lipschitz continuous gradient) in RN L , by carrying out a line search that satisﬁes the Wolfe conditions, we are guaranteed convergence to a local stationary point, typically a maximizer [18, th. 3.2]. However, as reasoned later, we do not perform a line search at all, instead we ﬁx the step size to the GBMS self-adapting step size, which results in a simple and faster algorithm consisting of carrying out a GBMS step on X (i.e., X(τ +1) = X(τ ) P(X(τ ) )) and then reﬁlling XP to the present values. While we describe the algorithm in this way for ease of explanation, in practice we do not actually compute the GBMS step for all xdn values, but only for the missing ones, which is all we need. Thus, our algorithm carries out GBMS denoising steps within the missing-data subspace. We can derive this result in a different way by starting from N the unconstrained optimization problem maxXP E(X) = n,m=1 Gσ (xn , xm ) (equivalent to (4)), computing its gradient wrt XP , equating it to zero and rearranging (in the same way the mean-shift algorithm is derived) to obtain a ﬁxed-point iteration identical to our update above. Fig. 1 shows the pseudocode for our denoising-based matrix completion algorithms (using three nonparametric denoising algorithms: GBMS, MBMS and LTP). Convergence and stopping criterion As noted above, we have guaranteed convergence by simply satisfying standard line search conditions, but a line search is costly. At present we do not have (τ +1) a proof that the GBMS step size satisﬁes such conditions, or indeed that the new iterate XM increases or leaves unchanged the objective, although we have never encountered a counterexample. In fact, it turns out that none of the work about GBMS that we know about proves that either: [10] proves that ∅(X(τ +1) ) ≤ ∅(X(τ ) ) for 0 < ρ < 1, where ∅(·) is the set diameter, while [8, 12] 3 notes that P(X) has a single eigenvalue of value 1 and all others of magnitued less than 1. While this shows that all points converge to the same location, which indeed is the global maximum of (3), it does not necessarily follow that each step decreases E. GBMS (k, σ) with full or k-nn graph: given XD×N , M repeat for n = 1, . . . , N Nn ← {1, . . . , N } (full graph) or k nearest neighbors of xn (k-nn graph) Gσ (xn ,xm ) mean-shift xm ∂xn ← −xn + m∈Nn step m′ ∈Nn Gσ (xn ,xm′ ) end XM ← XM + (∂X)M move points’ missing entries until validation error increases return X However, the question of convergence as τ → ∞ has no practical interest in a denoising setting, because achieving a total denoising almost never yields a good matrix completion. What we want is to achieve just enough denoising and stop the algorithm, as was the case with GBMS clustering, and as is the case in algorithms for image denoising. We propose to determine the optimal number of iterations, as well as the bandwidth σ and any other parameters, by cross-validation. Specifically, we select a held-out set by picking a random subset of the present entries and considering them as missing; this allows us to evaluate an error between our completion for them and the ground truth. We stop iterating when this error increases. MBMS (L, k, σ) with full or k-nn graph: given XD×N , M repeat for n = 1, . . . , N Nn ← {1, . . . , N } (full graph) or k nearest neighbors of xn (k-nn graph) Gσ (xn ,xm ) mean-shift xm ∂xn ← −xn + m∈Nn step m′ ∈Nn Gσ (xn ,xm′ ) Xn ← k nearest neighbors of xn (µn , Un ) ← PCA(Xn , L) estimate L-dim tangent space at xn subtract parallel motion ∂xn ← (I − Un UT )∂xn n end XM ← XM + (∂X)M move points’ missing entries until validation error increases return X This argument justiﬁes an algorithmic, as opposed to an opLTP (L, k) with k-nn graph: given XD×N , M timization, view of denoisingrepeat based matrix completion: apfor n = 1, . . . , N ply a denoising step, reﬁll the Xn ← k nearest neighbors of xn present values, iterate until the (µn , Un ) ← PCA(Xn , L) estimate L-dim tangent space at xn validation error increases. This project point onto tangent space allows very general deﬁnitions ∂xn ← (I − Un UT )(µn − xn ) n end of denoising, and indeed a lowXM ← XM + (∂X)M move points’ missing entries rank projection is a form of deuntil validation error increases noising where points are not alreturn X lowed outside the linear manifold. Our formulation using Figure 1: Our denoising matrix completion algorithms, based on the objective function (4) is still Manifold Blurring Mean Shift (MBMS) and its particular cases useful in that it connects our Local Tangent Projection (LTP, k-nn graph, σ = ∞) and Gauss- denoising assumption with the ian Blurring Mean Shift (GBMS, L = 0); see [5] for details. Nn more usual low-rank assumption contains all N points (full graph) or only xn ’s nearest neighbors that has been used in much ma(k-nn graph). The index M selects the components of its input trix completion work, and juscorresponding to missing values. Parameters: denoising scale σ, tiﬁes the reﬁlling step as renumber of neighbors k, local dimensionality L. sulting from the present-data constraints under a gradientprojection optimization. MBMS denoising for matrix completion Following our algorithmic-based approach to denois˜ ing, we could consider generalized GBMS steps of the form X = X φ(P(X)). For clustering, Carreira-Perpi˜ an [12] found an overrelaxed explicit step φ(P) = (1 − η)I + ηP with η ≈ 1.25 to n´ achieve similar clusterings but faster. Here, we focus instead on the MBMS variant of GBMS that allows only for orthogonal, not tangential, point motions (deﬁned wrt their local tangent space as estimated by local PCA), with the goal of preserving low-dimensional manifold structure. MBMS has 3 user parameters: the bandwidth σ (for denoising), and the latent dimensionality L and the 4 number of neighbors k (for the local tangent space and the neighborhood graph). A special case of MBMS called local tangent projection (LTP) results by using a neighborhood graph and setting σ = ∞ (so only two user parameters are needed: L and k). LTP can be seen as doing a low-rank matrix completion locally. LTP was found in [5] to have nearly as good performance as the best σ in several problems. MBMS also includes as particular cases GBMS (L = 0), PCA (k = N , σ = ∞), and no denoising (σ = 0 or L = D). Note that if we apply MBMS to a dataset that lies on a linear manifold of dimensionality d using L ≥ d then no denoising occurs whatsoever because the GBMS updates lie on the d-dimensional manifold and are removed by the corrector step. In practice, even if the data are assumed noiseless, the reconstruction from a low-rank method will lie close to but not exactly on the d-dimensional manifold. However, this suggests using largish ranks for the low-rank method used to reconstruct X and lower L values in the subsequent MBMS run. In summary, this yields a matrix completion algorithm where we apply an MBMS step, reﬁll the present values, and iterate until the validation error increases. Again, in an actual implementation we compute the MBMS step only for the missing entries of X. The shrinking problem of GBMS is less pronounced in our matrix completion setting, because we constrain some values not to change. Still, in agreement with [5], we ﬁnd MBMS to be generally superior to GBMS. Computational cost With a full graph, the cost per iteration of GBMS and MBMS is O(N 2 D) and O(N 2 D + N (D + k) min(D, k)2 ), respectively. In practice with high-dimensional data, best denoising results are obtained using a neighborhood graph [5], so that the sums over points in eqs. (3) or (4) extend only to the neighbors. With a k-nearest-neighbor graph and if we do not update the neighbors at each iteration (which affects the result little), the respective cost per iteration is O(N kD) and O(N kD + N (D + k) min(D, k)2 ), thus linear in N . The graph is constructed on the initial X we use, consisting of the present values and an imputation for the missing ones achieved with a standard matrix completion method, and has a one-off cost of O(N 2 D). The cost when we have a fraction µ = |M| ∈ [0, 1] of missing data is simply the above times µ. Hence the run time ND of our mean-shift-based matrix completion algorithms is faster the more present data we have, and thus faster than the usual GBMS or MBMS case, where all data are effectively missing. 3 Experimental results We compare with representative methods of several approaches: a low-rank matrix completion method, singular value projection (SVP [7], whose performance we found similar to that of alternating least squares, ALS [3, 4]); ﬁtting a D-dimensional Gaussian model with EM and imputing the missing values of each xn as the conditional mean E {xn,Mn |xn,Pn } (we use the implementation of [19]); and the nonlinear method of [20] (nlPCA). We initialize GBMS and MBMS from some or all of these algorithms. For methods with user parameters, we set them by cross-validation in the following way: we randomly select 10% of the present entries and pretend they are missing as well, we run the algorithm on the remaining 90% of the present values, and we evaluate the reconstruction at the 10% entries we kept earlier. We repeat this over different parameters’ values and pick the one with lowest reconstruction error. We then run the algorithm with these parameters values on the entire present data and report the (test) error with the ground truth for the missing values. 100D Swissroll We created a 3D swissroll data set with 3 000 points and lifted it to 100D with a random orthonormal mapping, and added a little noise (spherical Gaussian with stdev 0.1). We selected uniformly at random 6.76% of the entries to be present. We use the Gaussian model and SVP (ﬁxed rank = 3) as initialization for our algorithm. We typically ﬁnd that these initial X are very noisy (ﬁg. 3), with some reconstructed points lying between different branches of the manifold and causing a big reconstruction error. We ﬁxed L = 2 (the known dimensionality) for MBMS and cross-validated the other parameters: σ and k for MBMS and GBMS (both using k-nn graph), and the number of iterations τ to be used. Table 1 gives the performance of MBMS and GBMS for testing, along with their optimal parameters. Fig. 3 shows the results of different methods at a few iterations. MBMS initialized from the Gaussian model gives the most remarkable denoising effect. To show that there is a wide range of σ and number of iterations τ that give good performance with GBMS and MBMS, we ﬁx k = 50 and run the algorithm with varying σ values and plot the reconstruction error for missing entries over iterations in ﬁg. 2. Both GBMS can achieve good 5 Methods Gaussian + GBMS (∞, 10, 0, 1) + MBMS (1, 20, 2, 25) SVP + GBMS (3, 50, 0, 1) + MBMS (3, 50, 2, 2) RSSE 168.1 165.8 157.2 156.8 151.4 151.8 mean 2.63 2.57 2.36 1.94 1.89 1.87 stdev 1.59 1.61 1.63 2.10 2.02 2.05 Methods nlPCA SVP + GBMS (400,140,0,1) + MBMS (500,140,9,5) Table 1: Swissroll data set: reconstruction errors obtained by different algorithms along with their optimal parameters (σ, k, L, no. iterations τ ). The three columns show the root sum of squared errors on missing entries, the mean, and the standard deviation of the pointwise reconstruction error, resp. SVP + GBMS error (RSSE) 180 170 SVP + MBMS Gaussian + GBMS 180 180 170 170 170 160 160 ∞ 160 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ stdev 42.6 39.3 37.7 34.9 Gaussian + MBMS 180 8 10 15 25 mean 26.1 21.8 18.8 17.0 Table 2: MNIST-7 data set: errors of the different algorithms and their optimal parameters (σ, k, L, no. iterations τ ). The three columns show the root sum of squared errors on missing entries (×10−4 ), the mean, and the standard deviation of pixel errors, respectively. 160 0.3 0.5 1 2 3 5 RSSE 7.77 6.99 6.54 6.03 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ 150 0 1 2 3 4 5 6 7 8 910 12 14 16 18 20 iteration τ Figure 2: Reconstruction error of GBMS/MBMS over iterations (each curve is a different σ value). denoising (and reconstruction), but MBMS is more robust, with good results occurring for a wide range of iterations, indicating it is able to preserve the manifold structure better. Mocap data We use the running-motion sequence 09 01 from the CMU mocap database with 148 samples (≈ 1.7 cycles) with 150 sensor readings (3D positions of 50 joints on a human body). The motion is intrinsically 1D, tracing a loop in 150D. We compare nlPCA, SVP, the Gaussian model, and MBMS initialized from the ﬁrst three algorithms. For nlPCA, we do a grid search for the weight decay coefﬁcient while ﬁxing its structure to be 2 × 10 × 150 units, and use an early stopping criterion. For SVP, we do grid search on {1, 2, 3, 5, 7, 10} for the rank. For MBMS (L = 1) and GBMS (L = 0), we do grid search for σ and k. We report the reconstruction error as a function of the proportion of missing entries from 50% to 95%. For each missing-data proportion, we randomly select 5 different sets of present values and run all algorithms for them. Fig. 4 gives the mean errors of all algorithms. All methods perform well when missing-data proportion is small. nlPCA, being prone to local optima, is less stable than SVP and the Gaussian model, especially when the missing-data proportion is large. The Gaussian model gives the best and most stable initialization. At 95%, all methods fail to give an acceptable reconstruction, but up to 90% missing entries, MBMS and GBMS always beat the other algorithms. Fig. 4 shows selected reconstructions from all algorithms. MNIST digit ‘7’ The MNIST digit ‘7’ data set contains 6 265 greyscale (0–255) images of size 28 × 28. We create missing entries in a way reminiscent of run-length errors in transmission. We generate 16 to 26 rectangular boxes of an area approximately 25 pixels at random locations in each image and use them to black out pixels. In this way, we create a high dimensional data set (784 dimensions) with about 50% entries missing on average. Because of the loss of spatial correlations within the blocks, this missing data pattern is harder than random. The Gaussian model cannot handle such a big data set because it involves inverting large covariance matrices. nlPCA is also very slow and we cannot afford cross-validating its structure or the weight decay coefﬁcient, so we picked a reasonable structure (10 × 30 × 784 units), used the default weight decay parameter in the code (10−3 ), and allowed up to 500 iterations. We only use SVP as initialization for our algorithm. Since the intrinsic dimension of MNIST is suspected to be not very high, 6 SVP τ =0 SVP + GBMS τ =1 SVP + MBMS τ =2 Gaussian τ =0 Gaussian + GBMS τ =1 Gaussian + MBMS τ = 25 20 20 20 20 20 20 15 15 15 15 15 15 10 10 10 10 10 10 5 5 5 5 5 5 0 0 0 0 0 0 −5 −5 −5 −5 −5 −5 −10 −10 −15 −15 −10 −5 0 5 10 15 20 −10 −15 −15 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −10 −5 0 5 10 15 20 −15 −15 −10 −5 0 5 10 15 20 Figure 3: Denoising effect of the different algorithms. For visualization, we project the 100D data to 3D with the projection matrix used for creating the data. Present values are reﬁlled for all plots. 7000 6000 error 5000 4000 frame 2 (leg distance) frame 10 (foot pose) frame 147 (leg pose) nlPCA nlPCA + GBMS nlPCA + MBMS SVP SVP + GBMS SVP + MBMS Gaussian Gaussian + GBMS Gaussian + MBMS 3000 2000 1000 0 50 60 70 80 85 90 95 % of missing data Figure 4: Left: mean of errors (RSSE) of 5 runs obtained by different algorithms for varying percentage of missing values. Errorbars shown only for Gaussian + MBMS to avoid clutter. Right: sample reconstructions when 85% percent data is missing. Row 1: initialization. Row 2: init+GBMS. Row 3: init+MBMS. Color indicates different initialization: black, original data; red, nlPCA; blue, SVP; green, Gaussian. we used rank 10 for SVP and L = 9 for MBMS. We also use the same k = 140 as in [5]. So we only had to choose σ and the number of iterations via cross-validation. Table 2 shows the methods and their corresponding error. Fig. 5 shows some representative reconstructions from different algorithms, with present values reﬁlled. The mean-shift averaging among closeby neighbors (a soft form of majority voting) helps to eliminate noise, unusual strokes and other artifacts created by SVP, which by their nature tend to occur in different image locations over the neighborhood of images. 4 Related work Matrix completion is widely studied in theoretical compressed sensing [1, 2] as well as practical recommender systems [3, 4]. Most matrix completion models rely on a low-rank assumption, and cannot fully exploit a more complex structure of the problem, such as curved manifolds. Related work is on multi-task learning in a broad sense, which extracts the common structure shared by multiple related objects and achieves simultaneous learning on them. This includes applications such as alignment of noise-corrupted images [21], recovery of images with occlusion [22], and even learning of multiple related regressors or classiﬁers [23]. Again, all these works are essentially based on a subspace assumption, and do not generalize to more complex situations. A line of work based on a nonlinear low-rank assumption (with a latent variable z of dimensionN 2 ality L < D) involves setting up a least-squares error function minf ,Z n=1 xn − f (zn ) = N,D 2 n,d=1 (xdn − fd (zn )) where one ignores the terms for which xdn is missing, and estimates the function f and the low-dimensional data projections Z by alternating optimization. Linear functions f have been used in the homogeneity analysis literature [24], where this approach is called “missing data deleted”. Nonlinear functions f have been used recently (neural nets [20]; Gaussian processes for collaborative ﬁltering [25]). Better results are obtained if adding a projection term N 2 and optimizing over the missing data as well [26]. n=1 zn − F(xn ) 7 Orig Missing nlPCA SVP GBMS MBMS Orig Missing nlPCA SVP GBMS MBMS Figure 5: Selected reconstructions of MNIST block-occluded digits ‘7’ with different methods. Prior to our denoising-based work there have been efforts to extend the low-rank models to smooth manifolds, mostly in the context of compressed sensing. Baraniuk and Wakin [27] show that certain random measurements, e.g. random projection to a low-dimensional subspace, can preserve the metric of the manifold fairly well, if the intrinsic dimension and the curvature of the manifold are both small enough. However, these observations are not suitable for matrix completion and no algorithm is given for recovering the signal. Chen et al. [28] explicitly model a pre-determined manifold, and use this to regularize the signal when recovering the missing values. They estimate the manifold given complete data, while no complete data is assumed in our matrix completion setting. Another related work is [29], where the manifold modeled with Isomap is used in estimating the positions of satellite cameras in an iterative manner. Finally, our expectation that the value of a missing entry can be predicted from the values of neighboring points is similar to one category of collaborative ﬁltering methods that essentially use similar users/items to predict missing values [3, 4]. 5 Conclusion We have proposed a new paradigm for matrix completion, denoising, which generalizes the commonly used assumption of low rank. Assuming low-rank implies a restrictive form of denoising where the data is forced to have zero variance away from a linear manifold. More general definitions of denoising can potentially handle data that lives in a low-dimensional manifold that is nonlinear, or whose dimensionality varies (e.g. a set of manifolds), or that does not have low rank at all, and naturally they handle noise in the data. Denoising works because of the fundamental fact that a missing value can be predicted by averaging nearby present values. Although we motivate our framework from a constrained optimization point of view (denoise subject to respecting the present data), we argue for an algorithmic view of denoising-based matrix completion: apply a denoising step, reﬁll the present values, iterate until the validation error increases. In turn, this allows different forms of denoising, such as based on low-rank projection (earlier work) or local averaging with blurring mean-shift (this paper). Our nonparametric choice of mean-shift averaging further relaxes assumptions about the data and results in a simple algorithm with very few user parameters that afford user control (denoising scale, local dimensionality) but can be set automatically by cross-validation. Our algorithms are intended to be used as a postprocessing step over a user-provided initialization of the missing values, and we show they consistently improve upon existing algorithms. The MBMS-based algorithm bridges the gap between pure denoising (GBMS) and local low rank. Other deﬁnitions of denoising should be possible, for example using temporal as well as spatial neighborhoods, and even applicable to discrete data if we consider denoising as a majority voting among the neighbours of a vector (with suitable deﬁnitions of votes and neighborhood). Acknowledgments Work supported by NSF CAREER award IIS–0754089. 8 References [1] Emmanuel J. Cand` s and Benjamin Recht. Exact matrix completion via convex optimization. Foundations e of Computational Mathematics, 9(6):717–772, December 2009. [2] Emmanuel J. Cand` s and Terence Tao. The power of convex relaxation: Near-optimal matrix completion. e IEEE Trans. Information Theory, 56(5):2053–2080, April 2010. [3] Yehuda Koren. Factorization meets the neighborhood: A multifaceted collaborative ﬁltering model. SIGKDD 2008, pages 426–434, Las Vegas, NV, August 24–27 2008. [4] Robert Bell and Yehuda Koren. Scalable collaborative ﬁltering with jointly derived neighborhood interpolation weights. ICDM 2007, pages 43–52, October 28–31 2007. ´ [5] Weiran Wang and Miguel A. Carreira-Perpi˜ an. Manifold blurring mean shift algorithms for manifold n´ denoising. CVPR 2010, pages 1759–1766, San Francisco, CA, June 13–18 2010. [6] Matthias Hein and Markus Maier. Manifold denoising. NIPS 2006, 19:561–568. MIT Press, 2007. [7] Prateek Jain, Raghu Meka, and Inderjit S. Dhillon. Guaranteed rank minimization via singular value projection. NIPS 2010, 23:937–945. MIT Press, 2011. ´ [8] Miguel A. Carreira-Perpi˜ an. Fast nonparametric clustering with Gaussian blurring mean-shift. ICML n´ 2006, pages 153–160. Pittsburgh, PA, June 25–29 2006. [9] Keinosuke Fukunaga and Larry D. Hostetler. The estimation of the gradient of a density function, with application in pattern recognition. IEEE Trans. Information Theory, 21(1):32–40, January 1975. [10] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Trans. PAMI, 17(8):790–799, 1995. [11] Dorin Comaniciu and Peter Meer. Mean shift: A robust approach toward feature space analysis. IEEE Trans. PAMI, 24(5):603–619, May 2002. ´ [12] Miguel A. Carreira-Perpi˜ an. Generalised blurring mean-shift algorithms for nonparametric clustering. n´ CVPR 2008, Anchorage, AK, June 23–28 2008. [13] Gabriel Taubin. A signal processing approach to fair surface design. SIGGRAPH 1995, pages 351–358. [14] Mathieu Desbrun, Mark Meyer, Peter Schr¨ der, and Alan H. Barr. Implicit fairing of irregular meshes o using diffusion and curvature ﬂow. SIGGRAPH 1999, pages 317–324. [15] Fan R. K. Chung. Spectral Graph Theory. American Mathematical Society, Providence, RI, 1997. [16] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 22(8):888– 905, August 2000. [17] Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 1986. [18] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer-Verlag, New York, second edition, 2006. [19] Tapio Schneider. Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 14(5):853–871, March 2001. [20] Matthias Scholz, Fatma Kaplan, Charles L. Guy, Joachim Kopka, and Joachim Selbig. Non-linear PCA: A missing data approach. Bioinformatics, 21(20):3887–3895, October 15 2005. [21] Yigang Peng, Arvind Ganesh, John Wright, Wenli Xu, and Yi Ma. RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images. CVPR 2010, pages 763–770, 2010. [22] A. M. Buchanan and A. W. Fitzgibbon. Damped Newton algorithms for matrix factorization with missing data. CVPR 2005, pages 316–322, San Diego, CA, June 20–25 2005. [23] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. NIPS 2006, 19:41–48. MIT Press, 2007. [24] Albert Giﬁ. Nonlinear Multivariate Analysis. John Wiley & Sons, 1990. [25] Neil D. Lawrence and Raquel Urtasun. Non-linear matrix factorization with Gaussian processes. ICML 2009, Montreal, Canada, June 14–18 2009. ´ [26] Miguel A. Carreira-Perpi˜ an and Zhengdong Lu. Manifold learning and missing data recovery through n´ unsupervised regression. ICDM 2011, December 11–14 2011. [27] Richard G. Baraniuk and Michael B. Wakin. Random projections of smooth manifolds. Foundations of Computational Mathematics, 9(1):51–77, February 2009. [28] Minhua Chen, Jorge Silva, John Paisley, Chunping Wang, David Dunson, and Lawrence Carin. Compressive sensing on manifolds using a nonparametric mixture of factor analyzers: Algorithm and performance bounds. IEEE Trans. Signal Processing, 58(12):6140–6155, December 2010. [29] Michael B. Wakin. A manifold lifting algorithm for multi-view compressive imaging. In Proc. 27th Conference on Picture Coding Symposium (PCS’09), pages 381–384, 2009. 9

6 0.53007078 230 nips-2011-RTRMC: A Riemannian trust-region method for low-rank matrix completion

7 0.49884385 143 nips-2011-Learning Anchor Planes for Classification

8 0.48743659 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network

9 0.47778451 71 nips-2011-Directed Graph Embedding: an Algorithm based on Continuous Limits of Laplacian-type Operators

10 0.47050661 250 nips-2011-Shallow vs. Deep Sum-Product Networks

11 0.4677335 244 nips-2011-Selecting Receptive Fields in Deep Networks

12 0.42063677 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification

13 0.41758609 156 nips-2011-Learning to Learn with Compound HD Models

14 0.41652986 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection

15 0.39200118 167 nips-2011-Maximum Covariance Unfolding : Manifold Learning for Bimodal Data

16 0.37468255 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

17 0.36419779 64 nips-2011-Convergent Bounds on the Euclidean Distance

18 0.36019713 254 nips-2011-Similarity-based Learning via Data Driven Embeddings

19 0.34853441 150 nips-2011-Learning a Distance Metric from a Network

20 0.34343466 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.064), (4, 0.052), (20, 0.051), (26, 0.021), (31, 0.051), (33, 0.014), (40, 0.013), (43, 0.048), (45, 0.112), (51, 0.289), (57, 0.022), (65, 0.049), (74, 0.05), (83, 0.028), (99, 0.054)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.78422445 3 nips-2011-A Collaborative Mechanism for Crowdsourcing Prediction Problems

Author: Jacob D. Abernethy, Rafael M. Frongillo

Abstract: Machine Learning competitions such as the Netﬂix Prize have proven reasonably successful as a method of “crowdsourcing” prediction tasks. But these competitions have a number of weaknesses, particularly in the incentive structure they create for the participants. We propose a new approach, called a Crowdsourced Learning Mechanism, in which participants collaboratively “learn” a hypothesis for a given prediction task. The approach draws heavily from the concept of a prediction market, where traders bet on the likelihood of a future event. In our framework, the mechanism continues to publish the current hypothesis, and participants can modify this hypothesis by wagering on an update. The critical incentive property is that a participant will proﬁt an amount that scales according to how much her update improves performance on a released test set. 1

same-paper 2 0.73820341 287 nips-2011-The Manifold Tangent Classifier

Author: Salah Rifai, Yann N. Dauphin, Pascal Vincent, Yoshua Bengio, Xavier Muller

3 0.73258275 89 nips-2011-Estimating time-varying input signals and ion channel states from a single voltage trace of a neuron

Author: Ryota Kobayashi, Yasuhiro Tsubo, Petr Lansky, Shigeru Shinomoto

Abstract: State-of-the-art statistical methods in neuroscience have enabled us to ﬁt mathematical models to experimental data and subsequently to infer the dynamics of hidden parameters underlying the observable phenomena. Here, we develop a Bayesian method for inferring the time-varying mean and variance of the synaptic input, along with the dynamics of each ion channel from a single voltage trace of a neuron. An estimation problem may be formulated on the basis of the state-space model with prior distributions that penalize large ﬂuctuations in these parameters. After optimizing the hyperparameters by maximizing the marginal likelihood, the state-space model provides the time-varying parameters of the input signals and the ion channel states. The proposed method is tested not only on the simulated data from the Hodgkin−Huxley type models but also on experimental data obtained from a cortical slice in vitro. 1

4 0.67217767 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment

Author: Sebastian A. Kurtek, Anuj Srivastava, Wei Wu

Abstract: While signal estimation under random amplitudes, phase shifts, and additive noise is studied frequently, the problem of estimating a deterministic signal under random time-warpings has been relatively unexplored. We present a novel framework for estimating the unknown signal that utilizes the action of the warping group to form an equivalence relation between signals. First, we derive an estimator for the equivalence class of the unknown signal using the notion of Karcher mean on the quotient space of equivalence classes. This step requires the use of Fisher-Rao Riemannian metric and a square-root representation of signals to enable computations of distances and means under this metric. Then, we deﬁne a notion of the center of a class and show that the center of the estimated class is a consistent estimator of the underlying unknown signal. This estimation algorithm has many applications: (1) registration/alignment of functional data, (2) separation of phase/amplitude components of functional data, (3) joint demodulation and carrier estimation, and (4) sparse modeling of functional data. Here we demonstrate only (1) and (2): Given signals are temporally aligned using nonlinear warpings and, thus, separated into their phase and amplitude components. The proposed method for signal alignment is shown to have state of the art performance using Berkeley growth, handwritten signatures, and neuroscience spike train data. 1

5 0.51401985 87 nips-2011-Energetically Optimal Action Potentials

Author: Martin B. Stemmler, Biswa Sengupta, Simon Laughlin, Jeremy Niven

Abstract: Most action potentials in the nervous system take on the form of strong, rapid, and brief voltage deﬂections known as spikes, in stark contrast to other action potentials, such as in the heart, that are characterized by broad voltage plateaus. We derive the shape of the neuronal action potential from ﬁrst principles, by postulating that action potential generation is strongly constrained by the brain’s need to minimize energy expenditure. For a given height of an action potential, the least energy is consumed when the underlying currents obey the bang-bang principle: the currents giving rise to the spike should be intense, yet short-lived, yielding spikes with sharp onsets and offsets. Energy optimality predicts features in the biophysics that are not per se required for producing the characteristic neuronal action potential: sodium currents should be extraordinarily powerful and inactivate with voltage; both potassium and sodium currents should have kinetics that have a bell-shaped voltage-dependence; and the cooperative action of multiple ‘gates’ should start the ﬂow of current. 1 The paradox Nerve cells communicate with each other over long distances using spike-like action potentials, which are brief electrical events traveling rapidly down axons and dendrites. Each action potential is caused by an accelerating inﬂux of sodium or calcium ions, depolarizing the cell membrane by forty millivolts or more, followed by repolarization of the cell membrane caused by an efﬂux of potassium ions. As different species of ions are swapped across the membrane during the action potential, ion pumps shuttle the excess ions back and restore the ionic concentration gradients. If we label each ionic species by α, the work ∆E done to restore the ionic concentration gradients is [α] ∆E = RT V ∆[α]in ln out , (1) [α]in α where R is the gas constant, T is the temperature in Kelvin, V is the cell volume, [α]in|out is the concentration of ion α inside or outside the cell, and ∆[α]in is the concentration change inside the cell, which is assumed to be small relative to the total concentration. The sum α zα ∆[α] = 0, where zα is the charge on ion α, as no net charge accumulates during the action potential and no net work is done by or on the electric ﬁeld. Often, sodium (Na+ ) and potassium (K+ ) play the dominant role in generating action potentials, in which case ∆E = ∆[Na]in F V(ENa − EK ), where F is Faraday’s constant, ENa = RT /F ln [Na]out /[Na]in is the reversal potential for Na+ , at which no net sodium current ﬂows, and EK = RT /F ln [K]out /[K]in . This estimate of the work done does not include heat (due to loss through the membrane resistance) or the work done by the ion channel proteins in changing their conformational state during the action potential. Hence, the action potential’s energetic cost to the cell is directly proportional to ∆[Na]in ; taking into account that each Na+ ion carries one elementary charge, the cost is also proportional to the 1 charge QNa that accumulates inside the cell. A maximally efﬁcient cell reduces the charge per spike to a minimum. If a cell ﬁres action potentials at an average rate f , the cell’s Na/K pumps must move Na+ and K+ ions in opposite directions, against their respective concentration gradients, to counteract an average inward Na+ current of f QNa . Exhaustive measurements on myocytes in the heart, which expend tremendous amounts of energy to keep the heart beating, indicate that Na/K pumps expel ∼ 0.5 µA/cm2 of Na+ current at membrane potentials close to rest [1]. Most excitable cells, even when spiking, spend most of their time close to resting potential, and yet standard models for action potentials can easily lead to accumulating an ionic charge of up to 5 µC/cm2 [2]; most of this accumulation occurs during a very brief time interval. If one were to take an isopotential nerve cell with the same density of ion pumps as in the heart, then such a cell would not be able to produce more than an action potential once every ten seconds on average. The brain should be effectively silent. Clearly, this conﬂicts with what is known about the average ﬁring rates of neurons in the brainstem or even the neocortex, which can sustain spiking up to at least 7 Hz [3]. Part of the discrepancy can be resolved by noting that nerve cells are not isopotential and that action potential generation occurs within a highly restricted area of the membrane. Even so, standard models of action potential generation waste extraordinary amounts of energy; recent evidence [4] points out that many mammalian cortical neurons are much more efﬁcient. As nature places a premium on energy consumption, we will argue that one can predict both the shape of the action potential and the underlying biophysics of the nonlinear, voltage-dependent ionic conductances from the principle of minimal energy consumption. After reviewing the ionic basis of action potentials, we ﬁrst sketch how to compute the minimal energy cost for an arbitrary spike shape, and then solve for the optimal action potential shape with a given height. Finally, we show how minimal energy consumption explains all the dynamical features in the standard HodgkinHuxley (HH) model for neuronal dynamics that distinguish the brain’s action potentials from other highly nonlinear oscillations in physics and chemistry. 2 Ionic basis of the action potential In an excitable cell, synaptic drive forces the membrane permeability to different ions to change rapidly in time, producing the dynamics of the action potential. The current density Iα carried by an ion species α is given by the Goldman-Hodgkin-Katz (GHK) current equation[5, 6, 2], which assumes that ions are driven independently across the membrane under the inﬂuence of a constant electric ﬁeld. Iα depends upon the ions membrane permeability, Pα , its concentrations on either side of the membrane [α]out and [α]in and the voltage across the membrane, V , according to: Iα = Pα 2 zα V F 2 [α]out − [α]in exp (zα V F/RT ) , RT 1 − exp(zα V F/RT ) (2) To produce the fast currents that generate APs, a subset of the membranes ionic permeabilities Pα are gated by voltage. Changes in the permeability Pα are not instantaneous; the voltage-gated permeability is scaled mathematically by gating variables m(t) and h(t) with their own time dependence. After separating constant from time-dependent components in the permeability, the voltage-gated permeability obeys ¯ Pα (t) = m(t)r h(t)s such that 0 ≤ Pα (t) ≤ Pα , ¯ where r and s are positive, and Pα is the peak permeability to ion α when all channels for ion α are open. Gating is also referred to as activation, and the associated nonlinear permeabilities are called active. There are also passive, voltage-insensitive permeabilities that maintain the resting potential and depolarise the membrane to trigger action potentials. The simplest possible kinetics for the gating variables are ﬁrst order, involving only a single derivative in time. The steady state of each gating variable at a given voltage is determined by a Boltzmann function, to which the gating variables evolve: dm r ¯ τm = Pα m∞ (V ) − m(t) dt dh and τh =h∞ (V ) − h(t), dt 2 −1 with m∞ (V ) = {1 + exp ((V − Vm )/sm )} the Boltzmann function described by the slope sm > −1 0 and the midpoint Vm ; similarly, h∞ (V ) = {1 + exp ((V − Vh )/sh )} , but with sh < 0. Scaling ¯ m∞ (V ) by the rth root of the peak permeability Pα is a matter of mathematical convenience. We will consider both voltage-independent and voltage-dependent time constants, either setting τj = τj,0 to be constant, where j ∈ {m(t), h(t)}, or imposing a bell-shaped voltage dependence τj (V ) = τj,0 sech [sj (V − Vj )] The synaptic, leak, and voltage-dependent currents drive the rate of change in the voltage across the membrane dV C = Isyn + Ileak + Iα , dt α where the synaptic permeability and leak permeability are held constant. 3 Resistive and capacitive components of the energy cost By treating the action potential as the charging and discharging of the cell membrane capacitance, the action potentials measured at the mossy ﬁbre synapse in rats [4] or in mouse thalamocortical neurons [7] were found to be highly energy-efﬁcient: the nonlinear, active conductances inject only slightly more current than is needed to charge a capacitor to the peak voltage of the action potential. The implicit assumption made here is that one can neglect the passive loss of current through the membrane resistance, known as the leak. Any passive loss must be compensated by additional charge, making this loss the primary target of the selection pressure that has shaped the dynamics of action potentials. On the other hand, the membrane capacitance at the site of AP initiation is generally modelled and experimentally conﬁrmed [8] as being fairly constant around 1 µF/cm2 ; in contrast, the propagation, but not generation, of AP’s can be assisted by a reduction in the capacitance achieved by the myelin sheath that wraps some axons. As myelin would block the ﬂow of ions, we posit that the speciﬁc capacitance cannot yield to selection pressure to minimise the work W = QNa (ENa − EK ) needed for AP generation. To address how the shape and dynamics of action potentials might have evolved to consume less energy, we ﬁrst ﬁx the action potential’s shape and solve for the minimum charge QNa ab initio, without treating the cell membrane as a pure capacitor. Regardless of the action potential’s particular time-course V (t), voltage-dependent ionic conductances must transfer Na+ and K+ charge to elicit an action potential. Figure 1 shows a generic action potential and the associated ionic currents, comparing the latter to the minimal currents required. The passive equivalent circuit for the neuron consists of a resistor in parallel with a capacitor, driven by a synaptic current. To charge the membrane to the peak voltage, a neuron in a high-conductance state [9, 10] may well lose more charge through the resistor than is stored on the capacitor. For neurons in a low-conductance state and for rapid voltage deﬂections from the resting potential, membrane capacitance will be the primary determinant of the charge. 4 The norm of spikes How close can voltage-gated channels with realistic properties come to the minimal currents? What time-course for the action potential leads to the smallest minimal currents? To answer these questions, we must solve a constrained optimization problem on the solutions to the nonlinear differential equations for the neuronal dynamics. To separate action potentials from mere small-amplitude oscillations in the voltage, we need to introduce a metric. Smaller action potentials consume less energy, provided the underlying currents are optimal, yet signalling between neurons depends on the action potential’s voltage deﬂection reaching a minimum amplitude. Given the importance of the action potential’s amplitude, we deﬁne an Lp norm on the voltage wave-form V (t) to emphasize the maximal voltage deﬂection: 1 p T V (t) − V p V (t) − V = 0 3 p dt , Generic Action Potential -10 + a V [mV] -20 -30 gsyn -40 -50 -60 0 2 4 6 8 t [ms] 10 12 14 16 gNa Active and Minimal Currents 100 gK + gleak C + + 80 2 current [µA/cm ] 60 b Active IK Minimum IK 40 20 0 -20 For a ﬁxed action potential waveform V (t): Active INa Minimum INa -40 -60 Minimum INa (t) = −LV (t)θ(LV (t)) Minimum IK (t) = −LV (t)θ(−LV (t)) -80 -100 0 2 4 6 8 10 t [ms] 12 14 ˙ with LV (t) ≡ C V (t) + Ileak [V (t)] + Isyn [V (t)]. 16 c Qresistive/Qcapacitive Resistive vs. Capacitive Minimum Charge 1 0.5 0 0.2 0.4 0.6 0.8 1.0 1.2 leak conductance [mS/cm2] 1.4 Figure 1: To generate an action potential with an arbitrary time-course V (t), the nonlinear, timedependent permeabilities must deliver more charge than just to load the membrane capacitance— resistive losses must be compensated. (a) The action potential’s time-course in a generic HH model for a neuron, represented by the circuit diagram on the right. The peak of the action potential is ∼ 50 mV above the average potential. (b) The inward Na+ current, shown in green going in the negative direction, rapidly depolarizes the potential V (t) and yields the upstroke of the action potential. Concurrently, the K+ current activates, displayed as a positive deﬂection, and leads to the downstroke in the potential V (t). Inward and outward currents overlap signiﬁcantly in time. The dotted lines within the region bounded by the solid lines represent the minimal Na+ current and the minimal K+ current needed to produce the V (t) spike waveform in (a). By the law of current conservation, the sum of capacitive, resistive, and synaptic currents, denoted by ˙ LV (t) ≡ C V (t) + Ileak [V (t)] + Isyn [V (t)], must be balanced by the active currents. If the cell’s passive properties, namely its capacitance and (leak) resistance, and the synaptic conductance are constant, we can deduce the minimal active currents needed to generate a speciﬁed V (t). The minimal currents, by deﬁnition, do not overlap in time. Taking into account passive current ﬂow, restoring the concentration gradients after the action potential requires 29 nJ/cm2 . By contrast, if the active currents were optimal, the cost would be 8.9 nJ/cm2 . (c) To depolarize from the minimum to the maximum of the AP, the synaptic voltage-gated currents must deliver a charge Qcapacitive to charge the membrane capacitance and a charge Qresistive to compensate for the loss of current through leak channels. For a large leak conductance in the cell membrane, Qresistive can be larger than Qcapacitive . 4 where V is the average voltage. In the limit as p → ∞, the norm simply becomes the difference between the action potential’s peak voltage and the mean voltage, whereas a ﬁnite p ensures that the norm is differentiable. In parameter space, we will focus our attention to the manifold of action potentials with constant Lp norm with 2 p < ∞, which entails that the optimal action potential will have a ﬁnite, though possibly narrow width. To be close to the supremum norm, yet still have a norm that is well-behaved under differentiation, we decided to use p = 16. 5 Poincar´ -Lindstedt perturbation of periodic dynamical orbits e Standard (secular) perturbation theory diverges for periodic orbits, so we apply the PoincarLindstedt technique of expanding both in the period and the dynamics of the asymptotic orbit and then derive a set of adjoint sensitivity equations for the differential-algebraic system. Solving once for the adjoint functions, we can easily compute the parameter gradient of any functional on the orbit, even for thousands of parameters. ˙ We start with a set of ordinary differential equations x = F(x; p) for the neuron’s dynamics, an asymptotically periodic orbit xγ (t) that describes the action potential, and a functional G(x; p) on the orbit, representing the energy consumption, for instance. The functional can be written as an integral ω(p)−1 G(xγ ; p) = g(xγ (t); p) dt, 0 over some source term g(xγ (t); p). Assume that locally perturbing a parameter p ∈ p induces a smooth change in the stable limit cycle, preserving its existence. Generally, a perturbation changes not only the limit cycle’s path in state space, but also the average speed with which this orbit is traversed; as a consequence, the value of the functional depends on this change in speed, to lowest order. For simplicity, consider a single, scalar parameter p. G(xγ ; p) is the solution to ω(p)∂τ [G(xγ ; p)] = g(xγ ; p), where we have normalised time via τ = ω(p)t. Denoting partial derivatives by subscripts, we expand p → p + to get the O 1 equation dτ [Gp (xγ ; p)] + ωp g(xγ ; p) = gx (xγ ; p)xp + gp (xγ ; p) in a procedure known as the Poincar´ -Lindstedt method. Hence, e dG = dp ω −1 (gp + gx xp − ωp g) dt, 0 where, once again by the Poincar´ -Lindstedt method, xp is the solution to e ˙ xp =Fx (xγ )xp + Fp (xγ ) − ωp F (xγ ) . Following the approach described by Cao, Li, Petzold, and Serban (2003), introduce a Lagrange vector AG (x) and consider the augmented objective function ω −1 I(xγ ; p) = G(xγ ; p) − ˙ AG (xγ ). (F(xγ ) − xγ ) dt, 0 γ ˙ which is identical to G(x ; p) as F(x) − x = 0. Then dI(xγ ; p) = dp ω −1 ω −1 ˙ AG . (Fp + Fx xp − ωp F − xp ) dt. (gp + gx xp − ωp g) dt − 0 0 ˙ Integrating the AG (x).xp term by parts and using periodicity, we get dI(xγ ; p) = dp ω −1 ω −1 ˙ −gx + AG + AG .F xp dt. G gp − ωp g − A . (Fp − ωp F) dt − 0 0 5 Parameter ¯ peak permeability PNa ¯ peak permeability PK midpoint voltage Vm ∨ Vh slope sm ∨ (−sh ) time constant τm,0 ∨ τh,0 gating exponent r ∨ s minimum 0.24 fm/s 6.6 fm/s - 72 mV 3.33 mV 5 µs 0.2 maximum 0.15 µm/s 11 µm/s 70 mV 200 mV 200 ms 5.0 Table 1: Parameter limits. We can let the second term vanish by making the vector AG (x) obey ˙ AG (x) = −FT (x; p) AG (x) + gx (x; p). x Label the homogeneous solution (obtained by setting gx (xγ ; p) = 0) as Z(x). It is known that ω −1 the term ωp is given by ωp = ω 0 Z(x).Fp (x) dt, provided Z(x) is normalised to satisfy Z(x).F(x) = 1. We can add any multiple of the homogeneous solution Z(x) to the inhomogeneous solution, so we can always make ω −1 AG (x).F(x) dt = G 0 by taking ω −1 G G AG (x).F(x) dt − ωG . A (x) → A (x) − Z(x) (3) 0 This condition will make AG (x) unique. Finally, with eq. (3) we get dI(xγ ; p) dG(xγ ; p) = = dp dp ω −1 gp − AG . Fp dt. 0 The ﬁrst term in the integral gives rise to the partial derivative ∂G(xγ ; p)/ ∂p. In many cases, this term is either zero, can be made zero, or at least made independent of the dynamical variables. The parameters for the neuron models are listed in Table 1 together with their minimum and maximum allowed values. For each parameter in the neuron model, an auxiliary parameter on the entire real line is introduced, and a mapping from the real line onto the ﬁnite range set by the biophysical limits is deﬁned. Gradient descent on this auxiliary parameter space is performed by orthogonalizing the gradient dQα /dp to the gradient dL/dp of the norm. To correct for drift off the constraint manifold of constant norm, illustrated in Fig. 3, steps of gradient ascent or descent on the Lp norm are performed while keeping Qα constant. The step size during gradient descent is adjusted to assure that ∆Qα < 0 and that a periodic solution xγ exists after adapting the parameters. The energy landscape is locally convex (Fig. 3). 6 Predicting the Hodgkin-Huxley model We start with a single-compartment Goldman-Hodgkin-Katz model neuron containing voltage-gated Na+ and leak conductances (Figure 1). A tonic synaptic input to the model evokes repetitive ﬁring of action potentials. We seek those parameters that minimize the ionic load for an action potential of constant norm—in other words, spikes whose height relative to the average voltage is fairly constant, subject to a trade-off with the spike width. The ionic load is directly proportional to the work W performed by the ion ﬂux. All parameters governing the ion channels’ voltage dependence and kinetics, including their time constants, mid-points, slopes, and peak values, are subject to change. The simplest model capable of generating an action potential must have two dynamical variables and two time scales: one for the upstroke and another for the downstroke. If both Na+ and K+ currents 6 Transient Na Current Model Optimal Action Potential Falling Phase Currents 40 20 a τ [ms] 5 1 2 Q = 239 nC/cm PNa = m(t)h(t) PK = n(t) 0 -60 0 -20 τh τn 60 current [μA/cm2] V [mV] V [mV] -40 IK[V] Excess INa[V] Peak Resurgence 300 200 100 -60 -4 -2 0 0 4 40 20 τ [ms] 5 1 Q = 169 nC/cm2 PNa = m(t)h(t) PK = n(t) τi = τi(V) 0 -60 0 -20 τh τn 60 current [μA/cm2] 60 V [mV] -40 -60 -4 -2 0 2 t [ms] 0.5 0.75 IK[V] Excess INa[V] Peak Resurgence 200 100 0 4 0.25 40 5 1 PNa = m(t)h(t) s PK = n(t) τi = τi(V) 20 0 delay τ [ms] Q = 156 nC/cm2 current [μA/cm2] 60 -60 0 -20 τh τn 60 V [mV] -40 -60 t [ms] 0.5 t [ms] 0.75 Cooperative Gating Model Optimal Action Potential Falling Phase Currents V [mV] c 0.25 Voltage-dependent (In)activation Model Falling Phase Currents Optimal Action Potential V [mV] b 2 t [ms] -2 -1 0 t [ms] 1 750 500 250 0 0 2 IK[V] Excess INa[V] Peak Resurgence 0.2 t [ms] 0.4 Figure 2: Optimal spike shapes and currents for neuron models with different biophysical features. During optimization, the spikes were constrained to have constant norm V (t) − V 16 = 92 mV, which controls the height of the spike. Insets in the left column display the voltage-dependence of the optimized time constants for sodium inactivation and potassium activation; sodium activation is modeled as occurring instantaneously. (a) Model with voltage-dependent inactivation of Na+ ; time constants for the ﬁrst order permeability kinetics are voltage-independent (inset). Inactivation turns off the Na+ current on the downstroke, but not completely: as the K+ current activates to repolarize the membrane, the inward Na+ current reactivates and counteracts the K+ current; the peak of the resurgent Na+ current is marked by a triangle. (b) Model with voltage-dependent time constants for the ﬁrst order kinetics of activation and inactivation. The voltage dependence minimizes the resurgence of the Na+ current. (c) Power-law gating model with an inwardly rectifying potassium current replacing the leak current. The power law dependence introduces an effective delay in the onset of the K+ current, which further minimizes the overlap of Na+ and K+ currents in time. 7 Energy per Spike Surface of Constant Norm Spikes ya 10 16 V [mV] K 18 10 12 14 14 16 T b V [mV] 0 t [ms] 2 10 18 16 12 s [mV] K V [mV] K T a V [mV] 12 nJ/cm2 ≥ 16.5 16.3 16.3 yc 16 sK [mV] 100 0 -2 16.4 T c 100 0 -2 V [mV] 14 14 VE [nJ/cm2] yc ya τK [ms] yb 12 16.5 yb 20 0 t [ms] 2 100 0 -2 0 t [ms] 2 Figure 3: The energy required for an action potential three parameters governing potassium activation: the midpoint voltage VK , the slope sK , and the (maximum) time constant τK . The energy is the minimum work required to restore the ionic concentration gradients, as given by Eq. (1). Note that the energy within the constrained manifold of constant norm spikes is locally convex. are persistent, current ﬂows in opposite directions at the same time, so that, even at the optimum, the ionic load is 1200 nC/cm2 . On the other hand, no voltage-gated K+ channels are even required for a spike, as long as Na+ channels activate on a fast time scale and inactivate on a slower time scale and the leak is powerful enough to repolarize the neuron. Even so, the load is still 520 nC/cm2 . While spikes require dynamics on two time scales, suppressing the overlap between inward and outward currents calls for a third time scale. The resulting dynamics are higher-dimensional and reduce the load to to 239 nC/cm2 . Making the activation and inactivation time constants voltage-dependent permits ion channels to latch to an open or closed state during the rising and falling phase of the spike, reducing the ionic load to 189 nC/cm2 (Fig. 2) . The minimal Na+ and K+ currents are separated in time, yet dynamics that are linear in the activation variables cannot enforce a true delay between the offset of the Na+ current and the onset of the K+ current. If current ﬂow depends on multiple gates that need to be activated simultaneously, optimization can use the nonlinearity of multiplication to introduce a delay in the rise of the K+ current that abolishes the overlap, and the ionic load drops to 156 nC/cm2 . Any number of kinetic schemes for the nonlinear permeabilities Pα can give rise to the same spike waveform V (t), including the simplest two-dimensional one. Yet only the full Hodgkin-Huxley (HH) model, with its voltage-dependent kinetics that prevent the premature resurgence of inward current and cooperative gating that delays the onset of the outward current, minimizes the energetic cost. More complex models, in which voltage-dependent ion channels make transitions between multiple closed, inactivated, and open states, instantiate the energy-conserving features of the HH system at the molecular level. Furthermore, features that are eliminated during optimization, such as a voltage-dependent inactivation of the outward potassium current, are also not part of the delayed rectiﬁer potassium current in the Hodgkin-Huxley framework. 8 References [1] Paul De Weer, David C. Gadsby, and R. F. Rakowski. Voltage dependence of the na-k pump. Ann. Rev. Physiol., 50:225–241, 1988. [2] B. Frankenhaeuser and A. F. Huxley. The action potential in the myelinated nerve ﬁbre of xenopus laevis as computed on the basis of voltage clamp data. J. Physiol., 171:302–315, 1964. [3] Samuel S.-H. Wang, Jennifer R. Shultz, Mark J. Burish, Kimberly H. Harrison, Patrick R. Hof, Lex C. Towns, Matthew W. Wagers, and Krysta D. Wyatt. Functional trade-offs in white matter axonal scaling. J. Neurosci., 28(15):4047–4056, 2008. [4] Henrik Alle, Arnd Roth, and J¨ rg R. P. Geiger. Energy-efﬁcient action potentials in hippocamo pal mossy ﬁbers. Science, 325(5946):1405–1408, 2009. [5] D. E. Goldman. Potential, impedance and rectiﬁcation in membranes. J. Gen. Physiol., 27:37– 60, 1943. [6] A. L. Hodgkin and B. Katz. The effect of sodium ions on the electrical activity of the giant axon of the squid. J. Physiol., 108:37–77, 1949. [7] Brett C. Carter and Bruce P. Bean. Sodium entry during action potentials of mammalian neurons: Incomplete inactivation and reduced metabolic efﬁciency in fast-spiking neurons. Neuron, 64(6):898–909, 2009. [8] Luc J. Gentet, Greg J. Stuart, and John D. Clements. Direct measurement of speciﬁc membrane capacitance in neurons. Biophys. J., 79:314–320, 2000. [9] Alain Destexhe, Michael Rudolph, and Denis Par´ . The high-conductance state of neocortical e neurons in vivo. Nature Neurosci. Rev., 4:739–751, 2003. [10] Bilal Haider and David A. McCormick. Rapid neocortical dynamics: Cellular and network mechanisms. Neuron, 62:171–189, 2009. 9

6 0.51115566 244 nips-2011-Selecting Receptive Fields in Deep Networks

7 0.50982243 263 nips-2011-Sparse Manifold Clustering and Embedding

8 0.50958377 186 nips-2011-Noise Thresholds for Spectral Clustering

9 0.50921434 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels

10 0.50213981 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

11 0.50169009 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

12 0.49974358 265 nips-2011-Sparse recovery by thresholded non-negative least squares

13 0.49913701 80 nips-2011-Efficient Online Learning via Randomized Rounding

14 0.49888411 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms

15 0.49873406 29 nips-2011-Algorithms and hardness results for parallel large margin learning

16 0.49829537 124 nips-2011-ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning

17 0.4973965 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

18 0.49580342 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions

19 0.49534127 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

20 0.49508238 106 nips-2011-Generalizing from Several Related Classification Tasks to a New Unlabeled Sample