nips nips2012 nips2012-171 knowledge-graph by maker-knowledge-mining

171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

Source: pdf

Author: Matthew Der, Lawrence K. Saul

Abstract: We describe a latent variable model for supervised dimensionality reduction and distance metric learning. The model discovers linear projections of high dimensional data that shrink the distance between similarly labeled inputs and expand the distance between diﬀerently labeled ones. The model’s continuous latent variables locate pairs of examples in a latent space of lower dimensionality. The model diﬀers signiﬁcantly from classical factor analysis in that the posterior distribution over these latent variables is not always multivariate Gaussian. Nevertheless we show that inference is completely tractable and derive an Expectation-Maximization (EM) algorithm for parameter estimation. We also compare the model to other approaches in distance metric learning. The model’s main advantage is its simplicity: at each iteration of the EM algorithm, the distance metric is re-estimated by solving an unconstrained least-squares problem. Experiments show that these simple updates are highly eﬀective. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We describe a latent variable model for supervised dimensionality reduction and distance metric learning. [sent-4, score-0.365]

2 The model discovers linear projections of high dimensional data that shrink the distance between similarly labeled inputs and expand the distance between diﬀerently labeled ones. [sent-5, score-0.423]

3 The model’s continuous latent variables locate pairs of examples in a latent space of lower dimensionality. [sent-6, score-0.355]

4 The model diﬀers signiﬁcantly from classical factor analysis in that the posterior distribution over these latent variables is not always multivariate Gaussian. [sent-7, score-0.261]

5 We also compare the model to other approaches in distance metric learning. [sent-9, score-0.144]

6 The model’s main advantage is its simplicity: at each iteration of the EM algorithm, the distance metric is re-estimated by solving an unconstrained least-squares problem. [sent-10, score-0.144]

7 1 Introduction In this paper we propose a simple but new model to learn informative linear projections of multivariate data. [sent-12, score-0.066]

8 Our approach is rooted in the tradition of latent variable modeling, a popular methodology for discovering low dimensional structure in high dimensional data. [sent-13, score-0.212]

9 Two well-known examples of latent variable models are factor analyzers (FAs), which recover subspaces of high variance [1], and Gaussian mixture models (GMMs), which reveal clusters of high density [2]. [sent-14, score-0.312]

10 Here we describe a model that we call latent coincidence analysis (LCA). [sent-15, score-0.28]

11 The goal of LCA is to discover a latent space in which metric distances reﬂect meaningful notions of similarity and diﬀerence. [sent-16, score-0.23]

12 We apply LCA to two problems in distance metric learning, where the goal is to improve the performance of a classiﬁer—typically, a k-nearest neighbor (kNN) classiﬁer [3]—by a linear transformation of its input space. [sent-17, score-0.22]

13 Several previous methods have been proposed for this problem, including neighborhood component analysis (NCA) [4], large margin neighbor neighbor classiﬁcation (LMNN) [5], and information-theoretic metric learning (ITML) [6]. [sent-18, score-0.172]

14 NCA was conceived as a supervised counterpart to stochastic neighborhood embedding [7], an unsupervised method for dimensionality reduction. [sent-21, score-0.102]

15 Perhaps it is due to 1 x x z W, x' z x z W, y y x' z' y z' z' x' N Figure 1: Bayesian network for latent coincidence analysis. [sent-24, score-0.28]

16 The inputs x, x ∈ d are mapped into Gaussian latent variables z, z ∈ p whose statistics are parameterized by the linear transformation W ∈ p×d and noise level σ. [sent-25, score-0.281]

17 Coincidence in the latent space at length scale κ is detected by the binary variable y ∈ {0, 1}. [sent-26, score-0.14]

18 Distance metric learning is a fundamental problem, and the more solutions we have, the better equipped we are to solve its myriad variations. [sent-30, score-0.086]

19 It is in this spirit that we revisit the problem of distance metric learning in the venerable tradition of latent variable modeling. [sent-31, score-0.316]

20 We believe that LCA, like factor analysis and Gaussian mixture modeling, is the simplest latent variable model that can be imagined for its purpose. [sent-32, score-0.224]

21 In particular, the inference in LCA (though not purely Gaussian) is tractable, and the distance metric is re-estimated at each iteration of its EM algorithm by a simple leastsquares update. [sent-33, score-0.144]

22 This update has stronger guarantees of convergence than the gradientbased methods in NCA; it also sidesteps the large number of linear inequality constraints that appear in the optimizations for LMNN and ITML. [sent-34, score-0.059]

23 There are three observed variables: the inputs x, x ∈ d , which we always imagine to be observed in pairs, and the binary label y ∈ {0, 1}, which indicates if the inputs map (or are desired to be mapped) to nearby locations in a latent space of equal or reduced dimensionality p ≤ d. [sent-39, score-0.337]

24 These locations are in turn represented by the Gaussian latent variables z, z ∈ p . [sent-40, score-0.136]

25 The conditional distributions P (z|x) and P (z |x ) are parameterized by a linear transformation W ∈ p×d (from the input space to the latent space) and a noise level σ 2 . [sent-42, score-0.17]

26 (2) Finally, the binary label y ∈ {0, 1} is used to detect the coincidence of the variables z, z in the latent space. [sent-44, score-0.319]

27 (3) states that y = 1 with certainty if z and z coincide at the exact same point in the latent space; otherwise, the probability in eq. [sent-47, score-0.151]

28 1 Inference Inference in this model requires averaging over the Gaussian latent variables z, z . [sent-52, score-0.136]

29 For inputs (x, x ), we denote the relative likelihood, or odds, of the event y = 1 by ν(x, x ) = P (y = 1|x, x ) . [sent-56, score-0.068]

30 P (y = 0|x, x ) (6) As we shall see, the odds appear in the calculations for many useful forms of inference. [sent-57, score-0.095]

31 Note that the odds ν(x, x ) has a complicated nonlinear dependence on the inputs (x, x ); the numerator in eq. [sent-58, score-0.156]

32 2) are the statistics of the posterior distribution P (z, z |x, x , y). [sent-61, score-0.075]

33 P (y|x, x ) (7) We note that the prior distribution P (z, z |x, x ) is multivariate Gaussian, as is the posterior distribution P (z, z |x, x , y = 1) for positively labeled pairs of examples. [sent-63, score-0.251]

34 However, this is not true of the posterior distribution P (z, z |x, x , y = 0) for negatively labeled pairs. [sent-64, score-0.186]

35 In this respect, the model diﬀers from classical factor analysis and other canonical models with Gaussian latent variables (e. [sent-65, score-0.162]

36 (7) for both positively (y = 1) and negatively (y = 0) labeled pairs1 of examples. [sent-69, score-0.134]

37 In particular, for the posterior means, we obtain: E[z|x, x , y = 0] = W x − E[z|x, x , y = 1] = W x + νσ 2 (x −x) , + 2σ 2 σ2 (x −x) , 2 + 2σ 2 κ κ2 (8) (9) where the coeﬃcient ν in eq. [sent-70, score-0.075]

38 Note how the posterior means E[z|x, x , y] in eqs. [sent-73, score-0.075]

39 (10) Analogous results hold for the prior and posterior means of the latent variable z . [sent-75, score-0.215]

40 Intuitively, these calculations show that the expected values of z and z move toward each other if the observed label indicates a coincidence (y = 1) and away from each other if not (y = 0). [sent-76, score-0.211]

41 For learning it is also necessary to compute second-order statistics of the posterior distribution. [sent-77, score-0.075]

42 For the posterior variances, straightforward calculations give: νσ 2 , κ2 + 2σ 2 σ2 1− 2 , κ + 2σ 2 E 2 x, x , y = 0 = pσ 2 1 + (11) E 1 ¯ z−z ¯ z−z 2 x, x , y = 1 = pσ 2 (12) For the latter, the statistics can be expressed as the diﬀerences of Gaussian integrals. [sent-78, score-0.103]

43 3 ¯ where z in these expressions denotes the posterior means in eqs. [sent-79, score-0.075]

44 (8–9), and again the coefﬁcient ν is shorthand for the odds ν(x, x ) in eq. [sent-80, score-0.108]

45 x, x (13) Intuitively, we see that the posterior variance shrinks if the observed label indicates a coincidence (y = 1) and grows if not (y = 0). [sent-84, score-0.278]

46 The expressions for the posterior variance of the latent variable z are identical due to the model’s symmetry. [sent-85, score-0.235]

47 We assume that the data comes in the form of paired inputs x, x , together with binary judgments y ∈ {0, 1} of similarity or diﬀerence. [sent-88, score-0.068]

48 In particular, from a training set {(xi , xi , yi )}N of N such examples, we wish to learn the parameters that i=1 maximize the conditional log-likelihood N L(W, σ 2 , κ2 ) = log P (yi |xi , xi ) (14) i=1 of observed coincidences (yi = 1) and non-coincidences (yi = 0). [sent-89, score-0.182]

49 We say that the data is incomplete or partially observed in the sense that the examples do not specify target values for the latent variables z, z ; instead, such target values must be inferred from the model’s posterior distribution. [sent-90, score-0.347]

50 In the absence of an analytical solution, we avail ourselves of the EM algorithm, an iterative procedure for maximum likelihood estimation in latent variable models [10]. [sent-96, score-0.14]

51 The EM algorithm consists of two steps, an E-step which computes statistics of the posterior distribution in eq. [sent-97, score-0.075]

52 Intuitively, the EM algorithm uses the posterior means in eqs. [sent-100, score-0.075]

53 (8–9) to “ﬁll in” the missing values of the latent variables z, z . [sent-101, score-0.136]

54 As shorthand, let ¯ zi ¯ zi = E[z|xi , xi , yi ], = E[z |xi , xi , yi ] (15) (16) denote these posterior means for the ith example in the training set, as computed from the results in eq. [sent-102, score-0.354]

55 (17) gives the update rule: N W ← −1 N ¯ ¯ zi x i + zi x i xi xi + xi xi i=1 , (18) i=1 where the product in eq. [sent-106, score-0.272]

56 As shorthand, let ε2 = E i ¯ z−z 4 2 xi , xi , yi (19) # train # test # classes # features (D) # inputs (d) LCA dim (p) Euclidean PCA LMNN LCA MNIST 60000 10000 10 784 164 40 2. [sent-109, score-0.187]

57 For eﬃciency we projected data sets of high dimensionality D down to their leading d principal components. [sent-142, score-0.113]

58 BBC [12] and Classic4 [13] are text corpora with labeled topics. [sent-144, score-0.086]

59 denote the posterior variance of z for the ith example in the training set, as computed from the results in eqs. [sent-148, score-0.118]

60 3 Applications We explore two applications of LCA in which its linear transformation is used to preprocess the data for diﬀerent models of multiway classiﬁcation. [sent-165, score-0.096]

61 We assume that the original data consists of labeled examples {(xi , ci )} of inputs xi ∈ d and their class labels ci ∈ {1, 2, . [sent-166, score-0.272]

62 For each application, we show how to instantiate LCA by creating a particular data set of labeled pairs, where the labels indicate whether the examples in each pair should be mapped closer together (y = 1) or farther apart (y = 0) in LCA’s latent space of dimensionality p ≤ d. [sent-170, score-0.435]

63 1 Gaussian mixture modeling Gaussian mixture models (GMMs) oﬀer perhaps the simplest parametric model of multiway classiﬁcation. [sent-174, score-0.129]

64 In the most straightforward application of GMMs, the labeled examples in 5 each class c are modeled by a single multivariate Gaussian distribution with mean µc and covariance matrix Σc . [sent-175, score-0.189]

65 Classiﬁcation is also simple: for each unlabeled example, we use Bayes rule to compute the class with the highest posterior probability. [sent-176, score-0.125]

66 In this case two simple options are: (i) to reduce the input’s dimensionality using principal component analysis (PCA) or linear discriminant analysis (LDA) [15], or (ii) to model each multivariate Gaussian distribution using factor analysis. [sent-179, score-0.156]

67 Factor analysis can be formulated as a latent variable model, and its parameters estimated by an EM algorithm [1]. [sent-181, score-0.14]

68 In particular, we use LCA to project each example xi into a lower dimensional space where we hope for two properties: (i) that it is closer to the mean of projected examples from the same class yi , and (ii) that it is farther from the mean of projected examples from other classes c = yi . [sent-185, score-0.391]

69 Let µc ∈ d denote the mean of the labeled examples in class c. [sent-187, score-0.165]

70 Then we create a training set of labeled pairs {µc , xi , yic } over all examples xi where yic = 1 if yi = c and yic = 0 if yi = c. [sent-188, score-0.55]

71 As we shall see, this decision rule for LCA often makes diﬀerent predictions than Bayes rule in maximum likelihood GMMs. [sent-194, score-0.058]

72 2 shows that LCA generally outperforms these other methods; also, its largest gains occur in the regime of very aggressive dimensionality reduction p d. [sent-199, score-0.081]

73 2 Distance metric learning We can also apply LCA to learn a distance metric that improves kNN classiﬁcation [4, 5, 6]. [sent-204, score-0.252]

74 In LMNN, each training example has k target neighbors, typically chosen as the k nearest neighbors in Euclidean space with the same class label. [sent-206, score-0.132]

75 LMNN learns a metric to shrink the distances between examples and target neighbors while preserving (or increasing) the distances between examples from diﬀerent classes. [sent-207, score-0.349]

76 Errors in kNN classiﬁcation tend to occur when diﬀerently labeled examples are closer together than pairs of target neighbors. [sent-208, score-0.253]

77 Thus LMNN seeks to minimize the number of diﬀerently labeled examples that invade the perimeters established by target neighbors. [sent-209, score-0.183]

78 In LCA, we can view the matrix W W as a Mahalanobis distance metric for kNN classiﬁcation. [sent-211, score-0.144]

79 The starting point of LCA is to create a training set of pairs of examples. [sent-212, score-0.066]

80 The plots show test set error versus dimensionality p. [sent-214, score-0.062]

81 these pairs, we wish the similarly labeled examples to coincide (y = 1) and the diﬀerently labeled examples to diverge (y = 0). [sent-217, score-0.321]

82 For the former, it is natural to choose all pairs of examples and their target neighbors. [sent-218, score-0.14]

83 For the latter, it is natural to choose all pairs of diﬀerently labeled examples. [sent-219, score-0.129]

84 Concretely, if there are c classes, each with m examples, then this approach creates a training set of ckm pairs of similarly labeled examples (with y = 1) and c(c − 1)m2 pairs of diﬀerently labeled examples (with y = 0). [sent-220, score-0.397]

85 First, we do not include training examples without impostors. [sent-223, score-0.081]

86 Second, among the pairs of diﬀerently labeled examples, we only include each example with its current or previous impostors. [sent-224, score-0.129]

87 If so, we add the example, its target neighbors, and impostors into the training set. [sent-226, score-0.062]

88 Our short-cuts are similar in spirit to the optimizations in LMNN [5] as well as more general cutting plane strategies of constrained optimization [16]. [sent-228, score-0.061]

89 Recall that the parameter κ2 determines the length scale at which projected examples are judged to coincide in the latent space. [sent-231, score-0.234]

90 These local parameters κ2 are needed to account for the fact that diﬀerent inputs may reside at very diﬀerent distances from their target neighbors. [sent-234, score-0.133]

91 The plots show kNN classiﬁcation error (training dotted, test solid) versus dimensionality p. [sent-239, score-0.062]

92 perform kNN classiﬁcation using the Mahalanobis distance metric parameterized by the linear transformation W. [sent-244, score-0.196]

93 Additionally, we hold out a validation set to search for an optimal dimensionality p. [sent-249, score-0.062]

94 Advantageously, we often obtain our best result with LCA using a lower dimensionality p < d. [sent-253, score-0.062]

95 4 Discussion In this paper we have introduced Latent Coincidence Analysis (LCA), a latent variable model for learning linear projections that map similar inputs closer together and diﬀerent inputs farther apart. [sent-254, score-0.359]

96 On problems in mixture modeling and distance metric learning tasks, LCA performs competitively across a range of reduced dimensionalities. [sent-257, score-0.172]

97 To handle larger data sets, we plan to explore online strategies for distance metric learning [18], possibly based on Bayesian [19] or conﬁdence-weighted updates [20]. [sent-260, score-0.205]

98 Finally, we will explore hybrid strategies between the mixture modeling in section 3. [sent-261, score-0.07]

99 2, where multiple (but not all) examples in each class are used as “anchors” for distance-based classiﬁcation. [sent-263, score-0.079]

100 Distance metric learning for large margin nearest neighbor classiﬁcation. [sent-299, score-0.149]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lca', 0.779), ('lmnn', 0.174), ('knn', 0.17), ('pca', 0.163), ('coincidence', 0.162), ('di', 0.161), ('erent', 0.125), ('erently', 0.123), ('latent', 0.118), ('bbc', 0.114), ('em', 0.101), ('gmms', 0.097), ('labeled', 0.086), ('metric', 0.086), ('posterior', 0.075), ('isolet', 0.071), ('mnist', 0.069), ('inputs', 0.068), ('odds', 0.067), ('dimensionality', 0.062), ('fa', 0.061), ('bal', 0.06), ('yic', 0.06), ('lda', 0.059), ('examples', 0.058), ('distance', 0.058), ('seg', 0.053), ('nca', 0.053), ('classi', 0.053), ('zi', 0.048), ('iris', 0.043), ('multiway', 0.043), ('pairs', 0.043), ('neighbor', 0.043), ('shorthand', 0.041), ('yi', 0.041), ('wx', 0.04), ('analyzers', 0.04), ('conceived', 0.04), ('lineages', 0.04), ('xi', 0.039), ('target', 0.039), ('optimizations', 0.039), ('farther', 0.036), ('wc', 0.035), ('transformation', 0.033), ('coincide', 0.033), ('tradition', 0.032), ('itml', 0.032), ('gaussian', 0.032), ('wxi', 0.031), ('simplest', 0.03), ('roweis', 0.029), ('rule', 0.029), ('neighbors', 0.029), ('calculations', 0.028), ('mixture', 0.028), ('closer', 0.027), ('editors', 0.027), ('shrink', 0.027), ('er', 0.027), ('principal', 0.026), ('distances', 0.026), ('factor', 0.026), ('projected', 0.025), ('negatively', 0.025), ('mahalanobis', 0.025), ('mapped', 0.025), ('multivariate', 0.024), ('dz', 0.024), ('training', 0.023), ('saul', 0.023), ('positively', 0.023), ('letters', 0.023), ('ers', 0.023), ('instantiate', 0.023), ('kulis', 0.023), ('strategies', 0.022), ('variable', 0.022), ('learn', 0.022), ('class', 0.021), ('numerator', 0.021), ('label', 0.021), ('variance', 0.02), ('dimensional', 0.02), ('explore', 0.02), ('projections', 0.02), ('update', 0.02), ('nearest', 0.02), ('updates', 0.019), ('la', 0.019), ('reduction', 0.019), ('parameterized', 0.019), ('variables', 0.018), ('discriminant', 0.018), ('marcel', 0.018), ('obermayer', 0.018), ('ban', 0.018), ('coincidences', 0.018), ('ction', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

Author: Matthew Der, Lawrence K. Saul

2 0.24061938 242 nips-2012-Non-linear Metric Learning

Author: Dor Kedem, Stephen Tyree, Fei Sha, Gert R. Lanckriet, Kilian Q. Weinberger

Abstract: In this paper, we introduce two novel metric learning algorithms, χ2 -LMNN and GB-LMNN, which are explicitly designed to be non-linear and easy-to-use. The two approaches achieve this goal in fundamentally different ways: χ2 -LMNN inherits the computational beneﬁts of a linear mapping from linear metric learning, but uses a non-linear χ2 -distance to explicitly capture similarities within histogram data sets; GB-LMNN applies gradient-boosting to learn non-linear mappings directly in function space and takes advantage of this approach’s robustness, speed, parallelizability and insensitivity towards the single additional hyperparameter. On various benchmark data sets, we demonstrate these methods not only match the current state-of-the-art in terms of kNN classiﬁcation error, but in the case of χ2 -LMNN, obtain best results in 19 out of 20 learning settings. 1

3 0.10322034 148 nips-2012-Hamming Distance Metric Learning

Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov

Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efﬁcient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a ﬂexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classiﬁcation results using no more than kNN on the binary codes. 1

4 0.089281283 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

Author: Jun Wang, Alexandros Kalousis, Adam Woznica

Abstract: We study the problem of learning local metrics for nearest neighbor classiﬁcation. Most previous works on local metric learning learn a number of local unrelated metrics. While this ”independence” approach delivers an increased ﬂexibility its downside is the considerable risk of overﬁtting. We present a new parametric local metric learning method in which we learn a smooth metric matrix function over the data manifold. Using an approximation error bound of the metric matrix function we learn local metrics as linear combinations of basis metrics deﬁned on anchor points over different regions of the instance space. We constrain the metric matrix function by imposing on the linear combinations manifold regularization which makes the learned metric matrix function vary smoothly along the geodesics of the data manifold. Our metric learning method has excellent performance both in terms of predictive power and scalability. We experimented with several largescale classiﬁcation problems, tens of thousands of instances, and compared it with several state of the art metric learning methods, both global and local, as well as to SVM with automatic kernel selection, all of which it outperforms in a signiﬁcant manner. 1

5 0.089044362 9 nips-2012-A Geometric take on Metric Learning

Author: Søren Hauberg, Oren Freifeld, Michael J. Black

Abstract: Multi-metric learning techniques learn local metric tensors in different parts of a feature space. With such an approach, even simple classiﬁers can be competitive with the state-of-the-art because the distance measure locally adapts to the structure of the data. The learned distance measure is, however, non-metric, which has prevented multi-metric learning from generalizing to tasks such as dimensionality reduction and regression in a principled way. We prove that, with appropriate changes, multi-metric learning corresponds to learning the structure of a Riemannian manifold. We then show that this structure gives us a principled way to perform dimensionality reduction and regression according to the learned metrics. Algorithmically, we provide the ﬁrst practical algorithm for computing geodesics according to the learned metrics, as well as algorithms for computing exponential and logarithmic maps on the Riemannian manifold. Together, these tools let many Euclidean algorithms take advantage of multi-metric learning. We illustrate the approach on regression and dimensionality reduction tasks that involve predicting measurements of the human body from shape data. 1 Learning and Computing Distances Statistics relies on measuring distances. When the Euclidean metric is insufﬁcient, as is the case in many real problems, standard methods break down. This is a key motivation behind metric learning, which strives to learn good distance measures from data. In the most simple scenarios a single metric tensor is learned, but in recent years, several methods have proposed learning multiple metric tensors, such that different distance measures are applied in different parts of the feature space. This has proven to be a very powerful approach for classiﬁcation tasks [1, 2], but the approach has not generalized to other tasks. Here we consider the generalization of Principal Component Analysis (PCA) and linear regression; see Fig. 1 for an illustration of our approach. The main problem with generalizing multi-metric learning is that it is based on assumptions that make the feature space both non-smooth and non-metric. Speciﬁcally, it is often assumed that straight lines form geodesic curves and that the metric tensor stays constant along these lines. These assumptions are made because it is believed that computing the actual geodesics is intractable, requiring a discretization of the entire feature space [3]. We solve these problems by smoothing the transitions between different metric tensors, which ensures a metric space where geodesics can be computed. In this paper, we consider the scenario where the metric tensor at a given point in feature space is deﬁned as the weighted average of a set of learned metric tensors. In this model, we prove that the feature space becomes a chart for a Riemannian manifold. This ensures a metric feature space, i.e. dist(x, y) = 0 ⇔ x = y , dist(x, y) = dist(y, x) (symmetry), (1) dist(x, z) ≤ dist(x, y) + dist(y, z) (triangle inequality). To compute statistics according to the learned metric, we need to be able to compute distances, which implies that we need to compute geodesics. Based on the observation that geodesics are 1 (a) Local Metrics & Geodesics (b) Tangent Space Representation (c) First Principal Geodesic Figure 1: Illustration of Principal Geodesic Analysis. (a) Geodesics are computed between the mean and each data point. (b) Data is mapped to the Euclidean tangent space and the ﬁrst principal component is computed. (c) The principal component is mapped back to the feature space. smooth curves in Riemannian spaces, we derive an algorithm for computing geodesics that only requires a discretization of the geodesic rather than the entire feature space. Furthermore, we show how to compute the exponential and logarithmic maps of the manifold. With this we can map any point back and forth between a Euclidean tangent space and the manifold. This gives us a general strategy for incorporating the learned metric tensors in many Euclidean algorithms: map the data to the tangent of the manifold, perform the Euclidean analysis and map the results back to the manifold. Before deriving the algorithms (Sec. 3) we set the scene by an analysis of the shortcomings of current state-of-the-art methods (Sec. 2), which motivate our ﬁnal model. The model is general and can be used for many problems. Here we illustrate it with several challenging problems in 3D body shape modeling and analysis (Sec. 4). All proofs can be found in the supplementary material along with algorithmic details and further experimental results. 2 Background and Related Work Single-metric learning learns a metric tensor, M, such that distances are measured as dist2 (xi , xj ) = xi − xj 2 M ≡ (xi − xj )T M(xi − xj ) , (2) where M is a symmetric and positive deﬁnite D × D matrix. Classic approaches for ﬁnding such a metric tensor include PCA, where the metric is given by the inverse covariance matrix of the training data; and linear discriminant analysis (LDA), where the metric tensor is M = S−1 SB S−1 , with Sw W W and SB being the within class scatter and the between class scatter respectively [9]. A more recent approach tries to learn a metric tensor from triplets of data points (xi , xj , xk ), where the metric should obey the constraint that dist(xi , xj ) < dist(xi , xk ). Here the constraints are often chosen such that xi and xj belong to the same class, while xi and xk do not. Various relaxed versions of this idea have been suggested such that the metric can be learned by solving a semi-deﬁnite or a quadratic program [1, 2, 4–8]. Among the most popular approaches is the Large Margin Nearest Neighbor (LMNN) classiﬁer [5], which ﬁnds a linear transformation that satisﬁes local distance constraints, making the approach suitable for multi-modal classes. For many problems, a single global metric tensor is not enough, which motivates learning several local metric tensors. The classic work by Hastie and Tibshirani [9] advocates locally learning metric tensors according to LDA and using these as part of a kNN classiﬁer. In a somewhat similar fashion, Weinberger and Saul [5] cluster the training data and learn a separate metric tensor for each cluster using LMNN. A more extreme point of view was taken by Frome et al. [1, 2], who learn a diagonal metric tensor for every point in the training set, such that distance rankings are preserved. Similarly, Malisiewicz and Efros [6] ﬁnd a diagonal metric tensor for each training point such that the distance to a subset of the training data from the same class is kept small. Once a set of metric tensors {M1 , . . . , MR } has been learned, the distance dist(a, b) is measured according to (2) where “the nearest” metric tensor is used, i.e. R M(x) = r=1 wr (x) ˜ Mr , where wr (x) = ˜ ˜ j wj (x) 1 0 x − xr 2 r ≤ x − xj M otherwise 2 Mj , ∀j , (3) where x is either a or b depending on the algorithm. Note that this gives a non-metric distance function as it is not symmetric. To derive this equation, it is necessary to assume that 1) geodesics 2 −8 −8 Assumed Geodesics Location of Metric Tensors Test Points −6 −8 Actual Geodesics Location of Metric Tensors Test Points −6 Riemannian Geodesics Location of Metric Tensors Test Points −6 −4 −4 −4 −2 −2 −2 0 0 0 2 2 2 4 4 4 6 −8 6 −8 −6 −4 −2 0 (a) 2 4 6 −6 −4 −2 0 2 4 6 6 −8 −6 (b) −4 −2 (c) 0 2 4 6 (d) Figure 2: (a)–(b) An illustrative example where straight lines do not form geodesics and where the metric tensor does not stay constant along lines; see text for details. The background color is proportional to the trace of the metric tensor, such that light grey corresponds to regions where paths are short (M1 ), and dark grey corresponds to regions they are long (M2 ). (c) The suggested geometric model along with the geodesics. Again, background colour is proportional to the trace of the metric tensor; the colour scale is the same is used in (a) and (b). (d) An illustration of the exponential and logarithmic maps. form straight lines, and 2) the metric tensor stays constant along these lines [3]. Both assumptions are problematic, which we illustrate with a simple example in Fig. 2a–c. Assume we are given two metric tensors M1 = 2I and M2 = I positioned at x1 = (2, 2)T and x2 = (4, 4)T respectively. This gives rise to two regions in feature space in which x1 is nearest in the ﬁrst and x2 is nearest in the second, according to (3). This is illustrated in Fig. 2a. In the same ﬁgure, we also show the assumed straight-line geodesics between selected points in space. As can be seen, two of the lines goes through both regions, such that the assumption of constant metric tensors along the line is violated. Hence, it would seem natural to measure the length of the line, by adding the length of the line segments which pass through the different regions of feature space. This was suggested by Ramanan and Baker [3] who also proposed a polynomial time algorithm for measuring these line lengths. This gives a symmetric distance function. Properly computing line lengths according to the local metrics is, however, not enough to ensure that the distance function is metric. As can be seen in Fig. 2a the straight line does not form a geodesic as a shorter path can be found by circumventing the region with the “expensive” metric tensor M1 as illustrated in Fig. 2b. This issue makes it trivial to construct cases where the triangle inequality is violated, which again makes the line length measure non-metric. In summary, if we want a metric feature space, we can neither assume that geodesics are straight lines nor that the metric tensor stays constant along such lines. In practice, good results have been reported using (3) [1,3,5], so it seems obvious to ask: is metricity required? For kNN classiﬁers this does not appear to be the case, with many successes based on dissimilarities rather than distances [10]. We, however, want to generalize PCA and linear regression, which both seek to minimize the reconstruction error of points projected onto a subspace. As the notion of projection is hard to deﬁne sensibly in non-metric spaces, we consider metricity essential. In order to build a model with a metric feature space, we change the weights in (3) to be smooth functions. This impose a well-behaved geometric structure on the feature space, which we take advantage of in order to perform statistical analysis according to the learned metrics. However, ﬁrst we review the basics of Riemannian geometry as this provides the theoretical foundation of our work. 2.1 Geodesics and Riemannian Geometry We start by deﬁning Riemannian manifolds, which intuitively are smoothly curved spaces equipped with an inner product. Formally, they are smooth manifolds endowed with a Riemannian metric [11]: Deﬁnition A Riemannian metric M on a manifold M is a smoothly varying inner product < a, b >x = aT M(x)b in the tangent space Tx M of each point x ∈ M . 3 Often Riemannian manifolds are represented by a chart; i.e. a parameter space for the curved surface. An example chart is the spherical coordinate system often used to represent spheres. While such charts are often ﬂat spaces, the curvature of the manifold arises from the smooth changes in the metric. On a Riemannian manifold M, the length of a smooth curve c : [0, 1] → M is deﬁned as the integral of the norm of the tangent vector (interpreted as speed) along the curve: 1 Length(c) = 1 c (λ) M(c(λ)) dλ c (λ)T M(c(λ))c (λ)dλ , = (4) 0 0 where c denotes the derivative of c and M(c(λ)) is the metric tensor at c(λ). A geodesic curve is then a length-minimizing curve connecting two given points x and y, i.e. (5) cgeo = arg min Length(c) with c(0) = x and c(1) = y . c The distance between x and y is deﬁned as the length of the geodesic. Given a tangent vector v ∈ Tx M, there exists a unique geodesic cv (t) with initial velocity v at x. The Riemannian exponential map, Expx , maps v to a point on the manifold along the geodesic cv at t = 1. This mapping preserves distances such that dist(cv (0), cv (1)) = v . The inverse of the exponential map is the Riemannian logarithmic map denoted Logx . Informally, the exponential and logarithmic maps move points back and forth between the manifold and the tangent space while preserving distances (see Fig. 2d for an illustration). This provides a general strategy for generalizing many Euclidean techniques to Riemannian domains: data points are mapped to the tangent space, where ordinary Euclidean techniques are applied and the results are mapped back to the manifold. 3 A Metric Feature Space With the preliminaries settled we deﬁne the new model. Let C = RD denote the feature space. We endow C with a metric tensor in every point x, which we deﬁne akin to (3), R M(x) = wr (x)Mr , where wr (x) = r=1 wr (x) ˜ R ˜ j=1 wj (x) , (6) with wr > 0. The only difference from (3) is that we shall not restrict ourselves to binary weight ˜ functions wr . We assume the metric tensors Mr have already been learned; Sec. 4 contain examples ˜ where they have been learned using LMNN [5] and LDA [9]. From the deﬁnition of a Riemannian metric, we trivially have the following result: Lemma 1 The space C = RD endowed with the metric tensor from (6) is a chart of a Riemannian manifold, iff the weights wr (x) change smoothly with x. Hence, by only considering smooth weight functions wr we get a well-studied geometric structure ˜ on the feature space, which ensures us that it is metric. To illustrate the implications we return to the example in Fig. 2. We change the weight functions from binary to squared exponentials, which gives the feature space shown in Fig. 2c. As can be seen, the metric tensor now changes smoothly, which also makes the geodesics smooth curves (a property we will use when computing the geodesics). It is worth noting that Ramanan and Baker [3] also consider the idea of smoothly averaging the metric tensor. They, however, only evaluate the metric tensor at the test point of their classiﬁer and then assume straight line geodesics with a constant metric tensor. Such assumptions violate the premise of a smoothly changing metric tensor and, again, the distance measure becomes non-metric. Lemma 1 shows that metric learning can be viewed as manifold learning. The main difference between our approach and techniques such as Isomap [12] is that, while Isomap learns an embedding of the data points, we learn the actual manifold structure. This gives us the beneﬁt that we can compute geodesics as well as the exponential and logarithmic maps. These provide us with mappings back and forth between the manifold and Euclidean representation of the data, which preserve distances as well as possible. The availability of such mappings is in stark contrast to e.g. Isomap. In the next section we will derive a system of ordinary differential equations (ODE’s) that geodesics in C have to satisfy, which provides us with algorithms for computing geodesics as well as exponential and logarithmic maps. With these we can generalize many Euclidean techniques. 4 3.1 Computing Geodesics, Maps and Statistics At minima of (4) we know that the Euler-Lagrange equation must hold [11], i.e. ∂L d ∂L , where L(λ, c, c ) = c (λ)T M(c(λ))c (λ) . = ∂c dλ ∂c As we have an explicit expression for the metric tensor we can compute (7) in closed form: (7) Theorem 2 Geodesic curves in C satisfy the following system of 2nd order ODE’s M(c(λ))c (λ) = − 1 ∂vec [M(c(λ))] 2 ∂c(λ) T (c (λ) ⊗ c (λ)) , (8) where ⊗ denotes the Kronecker product and vec [·] stacks the columns of a matrix into a vector [13]. Proof See supplementary material. This result holds for any smooth weight functions wr . We, however, still need to compute ∂vec[M] , ˜ ∂c which depends on the speciﬁc choice of wr . Any smooth weighting scheme is applicable, but we ˜ restrict ourselves to the obvious smooth generalization of (3) and use squared exponentials. From this assumption, we get the following result Theorem 3 For wr (x) = exp − ρ x − xr ˜ 2 ∂vec [M(c)] = ∂c the derivative of the metric tensor from (6) is R ρ R j=1 2 Mr R 2 wj ˜ T r=1 T wj (c − xj ) Mj − (c − xr ) Mr ˜ wr vec [Mr ] ˜ . (9) j=1 Proof See supplementary material. Computing Geodesics. Any geodesic curve must be a solution to (8). Hence, to compute a geodesic between x and y, we can solve (8) subject to the constraints c(0) = x and c(1) = y . (10) This is a boundary value problem, which has a smooth solution. This allows us to solve the problem numerically using a standard three-stage Lobatto IIIa formula, which provides a fourth-order accurate C 1 –continuous solution [14]. Ramanan and Baker [3] discuss the possibility of computing geodesics, but arrive at the conclusion that this is intractable based on the assumption that it requires discretizing the entire feature space. Our solution avoids discretizing the feature space by discretizing the geodesic curve instead. As this is always one-dimensional the approach remains tractable in high-dimensional feature spaces. Computing Logarithmic Maps. Once a geodesic c is found, it follows from the deﬁnition of the logarithmic map, Logx (y), that it can be computed as v = Logx (y) = c (0) Length(c) . c (0) (11) In practice, we solve (8) by rewriting it as a system of ﬁrst order ODE’s, such that we compute both c and c simultaneously (see supplementary material for details). Computing Exponential Maps. Given a starting point x on the manifold and a vector v in the tangent space, the exponential map, Expx (v), ﬁnds the unique geodesic starting at x with initial velocity v. As the geodesic must fulﬁll (8), we can compute the exponential map by solving this system of ODE’s with the initial conditions c(0) = x and c (0) = v . (12) This initial value problem has a unique solution, which we ﬁnd numerically using a standard RungeKutta scheme [15]. 5 3.1.1 Generalizing PCA and Regression At this stage, we know that the feature space is Riemannian and we know how to compute geodesics and exponential and logarithmic maps. We now seek to generalize PCA and linear regression, which becomes straightforward since solutions are available in Riemannian spaces [16, 17]. These generalizations can be summarized as mapping the data to the tangent space at the mean, performing standard Euclidean analysis in the tangent and mapping the results back. The ﬁrst step is to compute the mean value on the manifold, which is deﬁned as the point that minimizes the sum-of-squares distances to the data points. Pennec [18] provides an efﬁcient gradient descent approach for computing this point, which we also summarize in the supplementary material. The empirical covariance of a set of points is deﬁned as the ordinary Euclidean covariance in the tangent space at the mean value [18]. With this in mind, it is not surprising that the principal components of a dataset have been generalized as the geodesics starting at the mean with initial velocity corresponding to the eigenvectors of the covariance [16], γvd (t) = Expµ (tvd ) , (13) th where vd denotes the d eigenvector of the covariance. This approach is called Principal Geodesic Analysis (PGA), and the geodesic curve γvd is called the principal geodesic. An illustration of the approach can be seen in Fig. 1 and more algorithmic details are in the supplementary material. Linear regression has been generalized in a similar way [17] by performing regression in the tangent of the mean and mapping the resulting line back to the manifold using the exponential map. The idea of working in the tangent space is both efﬁcient and convenient, but comes with an element of approximation as the logarithmic map is only guarantied to preserve distances to the origin of the tangent and not between all pairs of data points. Practical experience, however, indicates that this is a good tradeoff; see [19] for a more in-depth discussion of when the approximation is suitable. 4 Experiments To illustrate the framework1 we consider an example in human body analysis, and then we analyze the scalability of the approach. But ﬁrst, to build intuition, Fig. 3a show synthetically generated data samples from two classes. We sample random points xr and learn a local LDA metric [9] by considering all data points within a radius; this locally pushes the two classes apart. We combine the local metrics using (6) and Fig. 3b show the data in the tangent space of the resulting manifold. As can be seen the two classes are now globally further apart, which shows the effect of local metrics. 4.1 Human Body Shape We consider a regression example concerning human body shape analysis. We study 986 female body laser scans from the CAESAR [20] data set; each shape is represented using the leading 35 principal components of the data learned using a SCAPE-like model [21, 22]. Each shape is associated with anthropometric measurements such as body height, shoe size, etc. We show results for shoulder to wrist distance and shoulder breadth, but results for more measurements are in the supplementary material. To predict the measurements from shape coefﬁcients, we learn local metrics and perform linear regression according to these. As a further experiment, we use PGA to reduce the dimensionality of the shape coefﬁcients according to the local metrics, and measure the quality of the reduction by performing linear regression to predict the measurements. As a baseline we use the corresponding Euclidean techniques. To learn the local metric we do the following. First we whiten the data such that the variance captured by PGA will only be due to the change of metric; this allows easy visualization of the impact of the learned metrics. We then cluster the body shapes into equal-sized clusters according to the measurement and learn a LMNN metric for each cluster [5], which we associate with the mean of each class. These push the clusters apart, which introduces variance along the directions where the measurement changes. From this we construct a Riemannian manifold according to (6), 1 Our software implementation for computing geodesics and performing manifold statistics is available at http://ps.is.tue.mpg.de/project/Smooth Metric Learning 6 30 Euclidean Model Riemannian Model 24 20 18 16 20 15 10 5 14 12 0 (a) 25 22 Running Time (sec.) Average Prediction Error 26 10 (b) 20 Dimensionality 0 0 30 50 (c) 100 Dimensionality 150 (d) 4 3 3 2 2 1 1 0 −1 −2 −3 −4 −4 −3 −2 −1 0 1 2 3 4 Shoulder breadth 20 −2 −3 Euclidean Model Riemannian Model 0 −1 25 Prediction Error 4 15 10 0 −4 −5 0 4 10 15 20 Dimensionality 16 25 30 35 17 3 3 5 5 Euclidean Model Riemannian Model 2 15 2 1 1 Prediction Error Shoulder to wrist distance Figure 3: Left panels: Synthetic data. (a) Samples from two classes along with illustratively sampled metric tensors from (6). (b) The data represented in the tangent of a manifold constructed from local LDA metrics learned at random positions. Right panels: Real data. (c) Average error of linearly predicted body measurements (mm). (d) Running time (sec) of the geodesic computation as a function of dimensionality. 0 0 −1 −2 −1 −3 14 13 12 11 −2 −4 −3 −4 −4 10 −5 −3 −2 −1 0 1 Euclidean PCA 2 3 −6 −4 9 0 −2 0 2 4 Tangent Space PCA (PGA) 6 5 10 15 20 Dimensionality 25 30 35 Regression Error Figure 4: Left: body shape data in the ﬁrst two principal components according to the Euclidean metric. Point color indicates cluster membership. Center: As on the left, but according to the Riemannian model. Right: regression error as a function of the dimensionality of the shape space; again the Euclidean metric and the Riemannian metric are compared. compute the mean value on the manifold, map the data to the tangent space at the mean and perform linear regression in the tangent space. As a ﬁrst visualization we plot the data expressed in the leading two dimensions of PGA in Fig. 4; as can be seen the learned metrics provide principal geodesics, which are more strongly related with the measurements than the Euclidean model. In order to predict the measurements from the body shape, we perform linear regression, both directly in the shape space according to the Euclidean metric and in the tangent space of the manifold corresponding to the learned metrics (using the logarithmic map from (11)). We measure the prediction error using leave-one-out cross-validation. To further illustrate the power of the PGA model, we repeat this experiment for different dimensionalities of the data. The results are plotted in Fig. 4, showing that regression according to the learned metrics outperforms the Euclidean model. To verify that the learned metrics improve accuracy, we average the prediction errors over all millimeter measurements. The result in Fig. 3c shows that much can be gained in lower dimensions by using the local metrics. To provide visual insights into the behavior of the learned metrics, we uniformly sample body shape along the ﬁrst principal geodesic (in the range ±7 times the standard deviation) according to the different metrics. The results are available as a movie in the supplementary material, but are also shown in Fig. 5. As can be seen, the learned metrics pick up intuitive relationships between body shape and the measurements, e.g. shoulder to wrist distance is related to overall body size, while shoulder breadth is related to body weight. 7 Shoulder to wrist distance Shoulder breadth Figure 5: Shapes corresponding to the mean (center) and ±7 times the standard deviations along the principal geodesics (left and right). Movies are available in the supplementary material. 4.2 Scalability The human body data set is small enough (986 samples in 35 dimensions) that computing a geodesic only takes a few seconds. To show that the current unoptimized Matlab implementation can handle somewhat larger datasets, we brieﬂy consider a dimensionality reduction task on the classic MNIST handwritten digit data set. We use the preprocessed data available with [3] where the original 28×28 gray scale images were deskewed and projected onto their leading 164 Euclidean principal components (which captures 95% of the variance in the original data). We learn one diagonal LMNN metric per class, which we associate with the mean of the class. From this we construct a Riemannian manifold from (6), compute the mean value on the manifold and compute geodesics between the mean and each data point; this is the computationally expensive part of performing PGA. Fig. 3d plots the average running time (sec) for the computation of geodesics as a function of the dimensionality of the training data. A geodesic can be computed in 100 dimensions in approximately 5 sec., whereas in 150 dimensions it takes about 30 sec. In this experiment, we train a PGA model on 60,000 data points, and test a nearest neighbor classiﬁer in the tangent space as we decrease the dimensionality of the model. Compared to a Euclidean model, this gives a modest improvement in classiﬁcation accuracy of 2.3 percent, when averaged across different dimensionalities. Plots of the results can be found in the supplementary material. 5 Discussion This work shows that multi-metric learning techniques are indeed applicable outside the realm of kNN classiﬁers. The idea of deﬁning the metric tensor at any given point as the weighted average of a ﬁnite set of learned metrics is quite natural from a modeling point of view, which is also validated by the Riemannian structure of the resulting space. This opens both a theoretical and a practical toolbox for analyzing and developing algorithms that use local metric tensors. Speciﬁcally, we show how to use local metric tensors for both regression and dimensionality reduction tasks. Others have attempted to solve non-classiﬁcation problems using local metrics, but we feel that our approach is the ﬁrst to have a solid theoretical backing. For example, Hastie and Tibshirani [9] use local LDA metrics for dimensionality reduction by averaging the local metrics and using the resulting metric as part of a Euclidean PCA, which essentially is a linear approach. Another approach was suggested by Hong et al. [23] who simply compute the principal components according to each metric separately, such that one low dimensional model is learned per metric. The suggested approach is, however, not difﬁculty-free in its current implementation. Currently, we are using off-the-shelf numerical solvers for computing geodesics, which can be computationally demanding. While we managed to analyze medium-sized datasets, we believe that the run-time can be drastically improved by developing specialized numerical solvers. In the experiments, we learned local metrics using techniques specialized for classiﬁcation tasks as this is all the current literature provides. We expect improvements by learning the metrics speciﬁcally for regression and dimensionality reduction, but doing so is currently an open problem. Acknowledgments: Søren Hauberg is supported in part by the Villum Foundation, and Oren Freifeld is supported in part by NIH-NINDS EUREKA (R01-NS066311). 8 References [1] Andrea Frome, Yoram Singer, and Jitendra Malik. Image retrieval and classiﬁcation using local distance functions. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing o Systems 19 (NIPS), pages 417–424, Cambridge, MA, 2007. MIT Press. [2] Andrea Frome, Fei Sha, Yoram Singer, and Jitendra Malik. Learning globally-consistent local distance functions for shape-based image retrieval and classiﬁcation. In International Conference on Computer Vision (ICCV), pages 1–8, 2007. [3] Deva Ramanan and Simon Baker. Local distance functions: A taxonomy, new algorithms, and an evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):794–806, 2011. [4] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In Proceedings of the twenty-ﬁrst international conference on Machine learning, ICML ’04, pages 94–101. ACM, 2004. [5] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. The Journal of Machine Learning Research, 10:207–244, 2009. [6] Tomasz Malisiewicz and Alexei A. Efros. Recognition by association via learning per-exemplar distances. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. [7] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. The Journal of Machine Learning Research, 13:1–26, 2012. [8] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems 16 (NIPS), 2004. [9] Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–616, June 1996. [10] Elzbieta Pekalska, Pavel Paclik, and Robert P. W. Duin. A generalized kernel approach to dissimilaritybased classiﬁcation. Journal of Machine Learning Research, 2:175–211, 2002. [11] Manfredo Perdigao do Carmo. Riemannian Geometry. Birkh¨ user Boston, January 1992. a [12] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [13] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons, 2007. [14] Jacek Kierzenka and Lawrence F. Shampine. A BVP solver based on residual control and the Matlab PSE. ACM Transactions on Mathematical Software, 27(3):299–316, 2001. [15] John R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, 6:19–26, 1980. [16] P. Thomas Fletcher, Conglin Lu, Stephen M. Pizer, and Sarang Joshi. Principal Geodesic Analysis for the study of Nonlinear Statistics of Shape. IEEE Transactions on Medical Imaging, 23(8):995–1005, 2004. [17] Peter E. Jupp and John T. Kent. Fitting smooth paths to spherical data. Applied Statistics, 36(1):34–46, 1987. [18] Xavier Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In Proceedings of Nonlinear Signal and Image Processing, pages 194–198, 1999. [19] Stefan Sommer, Francois Lauze, Søren Hauberg, and Mads Nielsen. Manifold valued statistics, exact ¸ principal geodesic analysis and the effect of linear approximations. In European Conference on Computer Vision (ECCV), pages 43–56, 2010. [20] Kathleen M. Robinette, Hein Daanen, and Eric Paquet. The CAESAR project: a 3-D surface anthropometry survey. In 3-D Digital Imaging and Modeling, pages 380–386, 1999. [21] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. ACM Transactions on Graphics, 24(3):408–416, 2005. [22] Oren Freifeld and Michael J. Black. Lie bodies: A manifold representation of 3D human shape. In A. Fitzgibbon et al. (Eds.), editor, European Conference on Computer Vision (ECCV), Part I, LNCS 7572, pages 1–14. Springer-Verlag, oct 2012. [23] Yi Hong, Quannan Li, Jiayan Jiang, and Zhuowen Tu. Learning a mixture of sparse distance metrics for classiﬁcation and dimensionality reduction. In International Conference on Computer Vision (ICCV), pages 906–913, 2011. 9

6 0.079162486 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

7 0.078331858 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

8 0.075107276 200 nips-2012-Local Supervised Learning through Space Partitioning

9 0.071418718 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

10 0.070725948 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

11 0.069120213 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

12 0.065196045 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

13 0.060045294 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

14 0.057068691 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

15 0.056993924 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

16 0.056877445 187 nips-2012-Learning curves for multi-task Gaussian process regression

17 0.056690831 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

18 0.052379981 289 nips-2012-Recognizing Activities by Attribute Dynamics

19 0.052127603 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

20 0.051998824 206 nips-2012-Majorization for CRFs and Latent Likelihoods

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.159), (1, 0.056), (2, -0.041), (3, -0.034), (4, -0.021), (5, -0.042), (6, 0.003), (7, 0.075), (8, 0.026), (9, -0.043), (10, 0.052), (11, -0.08), (12, 0.086), (13, 0.024), (14, -0.059), (15, 0.032), (16, -0.039), (17, 0.04), (18, 0.102), (19, 0.139), (20, -0.02), (21, -0.043), (22, -0.002), (23, 0.099), (24, 0.06), (25, -0.017), (26, 0.001), (27, -0.083), (28, 0.045), (29, 0.057), (30, -0.082), (31, -0.012), (32, 0.052), (33, 0.012), (34, -0.084), (35, 0.004), (36, 0.057), (37, 0.011), (38, 0.032), (39, 0.092), (40, -0.053), (41, -0.002), (42, -0.02), (43, 0.004), (44, 0.007), (45, 0.161), (46, -0.013), (47, 0.029), (48, -0.08), (49, -0.087)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91894156 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

Author: Matthew Der, Lawrence K. Saul

2 0.8131122 242 nips-2012-Non-linear Metric Learning

Author: Dor Kedem, Stephen Tyree, Fei Sha, Gert R. Lanckriet, Kilian Q. Weinberger

3 0.73304337 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

Author: Jun Wang, Alexandros Kalousis, Adam Woznica

4 0.60080922 9 nips-2012-A Geometric take on Metric Learning

Author: Søren Hauberg, Oren Freifeld, Michael J. Black

5 0.60007173 279 nips-2012-Projection Retrieval for Classification

Author: Madalina Fiterau, Artur Dubrawski

Abstract: In many applications, classiﬁcation systems often require human intervention in the loop. In such cases the decision process must be transparent and comprehensible, simultaneously requiring minimal assumptions on the underlying data distributions. To tackle this problem, we formulate an axis-aligned subspace-ﬁnding task under the assumption that query speciﬁc information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efﬁciently solves this problem by ﬁnding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classiﬁcation models. Often, it is required that a test-point be ‘explained’ (classiﬁed) using a simple low-dimensional projection of the original feature space. This is a Projection Retrieval for Classiﬁcation problem (PRC). The interaction with the user proceeds as follows: the user provides the system a query point; the system searches for a projection in which the point can be accurately classiﬁed; the system displays the classiﬁcation result as well as an illustration of how the classiﬁcation decision was reached in the selected projection. Solving the PRC problem is relevant in many practical applications. For instance, consider a nuclear threat detection system installed at a border check point. Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. These observations are fed into a classiﬁcation system that determines whether the scanned vehicle may carry a threat. Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. With the positive classiﬁcation rate of the system under strict bounds because of limitations in the control process, the risk of false negatives is increased. Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classiﬁcation. In order for a user to attest the validity of a decision, the user must have a good understanding of the classiﬁcation process, which happens more readily when the classiﬁer only uses the original dataset features rather than combinations of them, and when the discrimination models are low-dimensional. In this context, we aim to learn a set of classiﬁers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classiﬁed. Assume we are given a dataset {(x1 , y1 ) . . . (xn , yn )} ∈ X n × {0, 1}n and a class of discriminators H. The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classiﬁed. The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. Formally, we describe the model class as Md = {Π = {π : π ∈ Π, dim(π) ≤ d}, H = {hi : hi ∈ H, h : πi → Y, ∀i = 1 . . . |Π|}, g ∈ {f : X → {1 . . . |Π|}} . where dim(π) presents the dimensionality of the subspace determined by the projection π. Note that only projections up to size d will be considered, where d is a parameter speciﬁc to the application. The set H contains one discriminator from the hypothesis class H for each projection. Intuitively, the aim is to minimize the expected classiﬁcation error over Md , however, a notable modiﬁcation is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classiﬁed. Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can be handled using different combinations of subspaces and discriminators. We consider models that minimize 0/1 loss. Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . . . xnk ) k k datasets that fall into the latter category fulﬁll what we call the Subspace-Separability Assumption. This paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonparametric classiﬁers. We evaluate the method on artiﬁcial data to show that indeed it correctly identiﬁes the underlying structure for data satisfying the Subspace-Separability Assumption. We show some case studies to illustrate how RECIP offers insight into applications requiring human intervention. The use of dimensionality reduction techniques is a common preprocessing step in applications where the use of simpliﬁed classiﬁcation models is preferable. Methods that learn linear combinations of features, such as Linear Discriminant Analysis, are not quite appropriate for the task considered here, since we prefer to natively rely on the dimensions available in the original feature space. Feature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features, but do not consider interactions between them. Our work better ﬁts the areas of class dependent feature selection and context speciﬁc classiﬁcation, highly connected to the concept of Transductive Learning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and [10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by ﬁtting local predictors, though the algorithm itself is substantially different. Obozinski et al present a subspace selection method in the context of multitask learning [11]. Go et al propose a joint method for feature selection and subspace learning [7], however, their classiﬁcation model is not particularly query speciﬁc. Alternatively, algorithms that transform complex or unintelligible models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms speciﬁcally designed to yield understandable models are a precious few. Here we note a rule learning method described in [12], even though the resulting rules can make visualization difﬁcult, while itemset mining [9] is not speciﬁcally designed for classiﬁcation. Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classiﬁcation) while providing query-speciﬁc information. 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially beneﬁt different areas of the feature space. A(π) refers to the area of the feature space where the projection π is selected. A(π) = {x ∈ X : πg(x) = π} The objective becomes min E(X ×Y) y = hg(x) (πg(x) (x)) M ∈Md = p(A(π))E y = hg(x) (πg(x) (x))|x ∈ A(π) min M ∈Md 2 π∈Π . The expected classiﬁcation error over A(π) is linked to the conditional entropy of Y |X. Fano’s inequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper bound on the minimal error probability in terms of the entropy. This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classiﬁcation accuracy for a speciﬁc discriminator. In view of this connection between classiﬁcation accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. Proposition 2.1. Given a continuous variable X ∈ X and a binary variable Y , where X is sampled from the mixture model f (x) = p(y = 0)f0 (x) + p(y = 1)f1 (x) = p0 f0 (x) + p1 f1 (x) , then H(Y |X) = −p0 log p0 − p1 log p1 − DKL (f0 ||f ) − DKL (f1 ||f ) . Next, we will use the nonparametric estimator presented in [13] for Tsallis α-divergence. Given samples Ui ∼ U, with i = 1, n and Vj ∼ V with j = 1, m, the divergence is estimated as follows: ˆ Tα (U ||V ) = 1 1 1−α n n (n − 1)νk (Ui , U \ ui )d mνk (Ui , V )d i=1 1−α B(k, α) − 1 , (2) where d is the dimensionality of the variables U and V and νk (z, Z) represents the distance from z ˆ to its k th nearest neighbor of the set of points Z. For α ≈ 1 and n → ∞, Tα (u||v) ≈ DKL (u||v). 2.1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We use the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1 to represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. ˆ ˆ x ˆ x H(Y |X; X ∈ A) = −H(p0 ) − H(p1 ) − T (f0 ||f x ) − T (f1 ||f x ) + C α≈1 ˆ H(Y |X; X ∈ A) ∝ 1 n0 + 1 n1 ∝ 1 n0 + 1 n1 ∝ 1 n n0 (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X1 \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X0 \ xi )d 1−α I[xi ∈ A] i=1 n1 i=1 n0 i=1 n1 i=1 n I[xi ∈ A] i=1 (n − 1)νk (xi , Xy(xi ) \ xi )d nνk (xi , X¬y(xi ) \ xi )d 1−α The estimator for the entropy of the data that is classiﬁed with projection π is as follows: ˆ H(Y |π(X); X ∈ A(π)) ∝ 1 n n I[xi ∈ A(π)] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 1−α (3) From 3 and using the fact that I[xi ∈ A(π)] = I[πg(xi ) = π] for which we use the notation I[g(xi ) → π], we estimate the objective as min M ∈Md π∈Π 1 n n I[g(xi ) → π] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 3 1−α (4) Therefore, the contribution of each data point to the objective corresponds to a distance ratio on the projection π ∗ where the class of the point is obtained with the highest conﬁdence (data is separable in the neighborhood of the point). We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . . . d∗ } (5) For each training data point, we compute the best distance ratio amid all the projections, which is simply Ti = minj∈[d∗ ] Zij . The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the deﬁnition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2.2 ≥ i=1 πj ∈Π Ti . (7) i=1 Projection selection as a combinatorial problem Considering form (6) of the objective, and given that the estimates Zij are constants, depending only on the training set, the projection retrieval problem is reduced to ﬁnding g for all training points, which will implicitly select the projection set of the model. Naturally, one might assume the bestperforming classiﬁcation model is the one containing all the axis-aligned subspaces. This model achieves the lower bound (7) for the training set. However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difﬁcult. It becomes apparent that the number of projections should be somehow restricted to allow intepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. This problem cannot be solved exactly in polynomial time. Instead, it can be formulated as an optimization problem under 1 constraints. 2.3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. d∗ min ||T − (Z B B)J|Π|,1 ||2 2 +λ [Bi = 0] (8) i=1 subject to |Bk | 1 = 1 k = 1, n where (Z B)ij = Zij + Bij and J is a matrix of ones. The problem with this optimization is that it is not convex. A typical walk-around of this issue is to use the convex relaxation for Bi = 0, that is 1 norm. This would transform the penalized term d∗ d∗ n to i=1 |Bi | 1 . However, i=1 |Bi | 1 = k=1 |Bk | 1 = n , so this penalty really has no effect. An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. With no prior information about which subspaces are more informative, δ starts as an all-1 vector. An initial value for B is obtained through the optimization (8). Since our goal is to handle data using a small number of projections, δ is then updated such that its value is lower for the denser columns in B. This update resembles the reweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until convergence of δ. Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. The procedure is shown in Algorithm 1. 4 Algorithm 1: RECIP δ = [1 . . . 1] repeat |P I| b = arg minB ||T − i=1 < Z, B > ||2 + λ|Bδ| 1 2 subject to |Bk | 1 = 1 k = 1...n δk = |Bi | 1 i = . . . d∗ (update the differential penalty) δ δ = 1 − |δ| 1 until δ converges return Π = {πi ; |Bi | 1 > 0 ∀i = 1 . . . d∗ } 2.4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. We write this as an 1 -penalized optimization on the extended feature set Z, with the objective T : minβ |T − Zβ|2 + λ|β| 1 . The lasso penalty to the coefﬁcient vector encourages sparsity. For a high enough λ, the sparsity pattern in β is indicative of the usefulness of the projections. The lasso on entropy contributions was not found to perform well as it is not query speciﬁc and will ﬁnd one projection for all data. We improved it by allowing it to iteratively ﬁnd projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. 2.5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. Then, we may select the label/subspace combination that minimizes the empirical entropy. (i∗ , θ∗ ) = arg min i,θ 3 νk (πi (x), πi (Xθ )) νk (πi (x), πi (X¬θ )) dim(πi )(1−α) i = 1 . . . d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classiﬁcation. First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3.1). As a point of reference, we also present classiﬁcation accuracy results (3.2) for both the synthetic data and a few real-world sets. This is to quantify the extent of the trade-off between ﬁdelity of attainable classiﬁers and desired informativeness of the projections chosen by RECIP. We expect RECIP’s classiﬁcation performance to be slightly, but not substantially worse when compared to relevant classiﬁcation algorithms trained to maximize classiﬁcation accuracy. Finally, we present a few examples (3.3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. A set of artiﬁcial data used in our experiments contains q batches of data points, each of them made classiﬁable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . . . q}. The data in batch k also have the property that x1 > tk . This is done such that the group a k point belongs to can be detected from x1 , thus x1 is a regulatory variable. We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. We kept the number of features speciﬁcally low in order to prevent excessive variation between any two sets generated this way, and to enable computing meaningful estimates of the expectation and variance of performance, while enabling creation of complicated data in which synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns implies that dimensions of at least 4 of the patterns will overlap). We implemented our method 5 to be scalable to the size and dimensionality of data and although for brevity we do not include a discussion of this topic here, we have successfully run RECIP against data with 100 features. The parameter α is a value close to 1, because the Tsallis divergence converges to the KL divergence as alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were considered). For the artiﬁcial data experiments, we reported results for d = 2 as they do not change signiﬁcantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. The users will choose d according to the desired/acceptable complexity of the resulting model. If the results are to be visually interpreted by a human, values of 2 or 3 are reasonable for d. 3.1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. We note that the observed precision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments, and those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap among the dimensions, that is when the injections do not interfere with each other. As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. We have also used lasso as described in 2.4 in an attempt to recover projections. This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. 3.2 Classiﬁcation accuracy Table 2 shows the classiﬁcation accuracy of RECIP, obtained using synthetic data. As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. Most types of ensemble learners would use a voting scheme to arrive at the ﬁnal classiﬁcation of a testing sample, rather than use a model selection scheme. For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. This conﬁrms the intuition that a selection-based approach can be more effective than voting for data which satisﬁes the subspace separability assumption. For reference, we have also classiﬁed the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. The results of that experiment are shown in Table 5. Since RECIP uses neighbor information, K-NN is conceptually the closest among the popular alternatives. Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. It is because some features used then by K-NN are noisy. As more features become informative, the K-NN accuracy improves. This example shows the beneﬁt of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classiﬁcation tasks in such circumstances. 3.3 RECIP case studies using real-world data Table 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the methods using Cell dataset containing a set of measurements such as the area and perimeter biological cells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the nearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 accuracy), which is an indication that all features are jointly relevant. For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. 6 Table 1: Projection recovery for artiﬁcial datasets with 1 . . . 7 informative features and noise level 0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. PRECISION 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 0.02 1 1 1 1 1 1 1 Mean 0.05 1 1 1 1 1 1 1 1 2 3 4 5 6 7 0 1 1 1 0.9643 0.7714 0.6429 0.6327 0.02 1 1 1 0.9643 0.7429 0.6905 0.5918 Mean 0.05 1 1 0.9524 0.9643 0.8286 0.6905 0.5918 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 Variance 0.05 0 0 0 0 0 0 0 0.1 0.0306 0 0 0 0 0 0 0.2 0.0306 0 0 0 0 0 0 0 0 0 0 0.0077 0.0163 0.0113 0.0225 0.02 0 0 0 0.0077 0.0196 0.0113 0.02 Variance 0.05 0 0 0.0136 0.0077 0.0049 0.0272 0.0258 0.1 0 0 0.0136 0.0077 0.0082 0.0113 0.0233 0.2 0 0 0 0.0128 0.0278 0.0113 0.02 0.1 0.0008 0.0001 0.0028 0.0025 0.0036 0.0025 0.0042 0.2 0.0007 0.0001 0.0007 0.0032 0.0044 0.0027 0.0045 0.1 0.0001 0.0001 0.0007 0.0014 0.0019 0.0023 0.0021 0.2 0.0000 0.0001 0.0007 0.0020 0.0023 0.0021 0.0020 0.1 0.9286 1 1 1 1 1 1 0.2 0.9286 1 1 1 1 1 1 RECALL 0.1 1 1 0.9524 0.9643 0.8571 0.6905 0.5714 0.2 1 1 1 0.9286 0.7714 0.6905 0.551 Table 2: RECIP Classiﬁcation Accuracy on Artiﬁcial Data 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8725 0.8113 0.7655 0.7534 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8820 0.8714 0.8566 0.8429 CLASSIFICATION ACCURACY Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9543 0.9420 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8764 0.8640 0.8618 0.0004 0.0005 0.0016 0.8685 0.8589 0.8454 0.8187 0.0020 0.0020 0.0019 0.8009 0.8105 0.8105 0.7782 0.0042 0.0044 0.0033 0.7739 0.7669 0.7632 0.7511 0.0025 0.0021 0.0026 0.7399 0.7347 0.7278 0.7205 0.0034 0.0040 0.0042 CLASSIFICATION ACCURACY - KNOWN PROJECTIONS Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8914 0.8777 0.8618 0.0004 0.0005 0.0005 0.8781 0.8657 0.8541 0.8331 0.0011 0.0011 0.0014 0.8641 0.8523 0.8429 0.8209 0.0015 0.0015 0.0018 0.8497 0.8377 0.8285 0.8074 0.0014 0.0015 0.0016 0.8371 0.8256 0.8122 0.7988 0.0015 0.0018 0.0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0.8415 0.8275 ’Capital Length Total’ (CLT)/’Capital Length Longest’ Breast Tissue 1.0000 1.0000 (CLL) and CLT/’Frequency of word your’ (FWY). FigCell 0.7072 0.7640 ure 1 shows these two projections, with the dots repreMiniBOONE* 0.7896 0.7396 senting training points. The red dots represent points laSpam 0.7680 0.7680 Vowel 0.9839 0.9839 beled as spam while the blue ones are non-spam. The circles are query points that have been assigned to be classiﬁed with the projection in which they are plotted. The green circles are correctly classiﬁed points, while the magenta circles - far fewer - are the incorrectly classiﬁed ones. Not only does the importance of text in capital letters make sense for a spam ﬁltering dataset, but the points that select those projections are almost ﬂawlessly classiﬁed. Additionally, assuming the user would need to attest the validity of classiﬁcation for the ﬁrst plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be therefore ﬂagged for further investigation. 7 Informative Projection for the Spam dataset 2000 1500 1000 500 0 Informative Projection for the Spam dataset 12 Frequency of word ‘your‘ Capital Run Length Longest 2500 10 8 6 4 2 0 500 1000 1500 2000 2500 Capital Run Length Total 3000 0 3500 0 2000 4000 6000 8000 10000 12000 14000 Capital Run Length Total 16000 Figure 1: Spam Dataset Selected Subspaces Table 4: Classiﬁcation accuracy using RECIP-learned projections - or known projections, in the lower section - within a voting model instead of a selection model 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CLASSIFICATION ACCURACY - VOTING ENSEMBLE Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9317 0.9226 0.0000 0.0000 0.0000 0.0070 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7290 0.7266 0.7163 0.7166 0.7212 0.0002 0.0002 0.0008 0.0006 0.6934 0.6931 0.6932 0.6904 0.6867 0.0008 0.0008 0.0008 0.0008 0.6715 0.6602 0.6745 0.6688 0.6581 0.0013 0.0014 0.0013 0.0014 0.6410 0.6541 0.6460 0.6529 0.6512 0.0008 0.0007 0.0010 0.0006 0.6392 0.6342 0.6268 0.6251 0.6294 0.0009 0.0011 0.0012 0.0012 CLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.0001 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7409 0.7385 0.7390 0.7353 0.7325 0.0010 0.0012 0.0010 0.0011 0.7110 0.7109 0.7083 0.7067 0.7035 0.0041 0.0041 0.0042 0.0042 0.7077 0.7070 0.7050 0.7034 0.7008 0.0015 0.0015 0.0015 0.0016 0.6816 0.6807 0.6801 0.6790 0.6747 0.0008 0.0008 0.0008 0.0008 0.6787 0.6783 0.6772 0.6767 0.6722 0.0008 0.0009 0.0009 0.0008 0.2 0.0053 0.0001 0.0002 0.0009 0.0013 0.0005 0.0012 0.2 0.0000 0.0001 0.0010 0.0043 0.0016 0.0009 0.0008 Table 5: Classiﬁcation accuracy for artiﬁcial data with the K-Nearest Neighbors method CLASSIFICATION ACCURACY - KNN 1 2 3 4 5 6 7 4 0 0.7909 0.7940 0.7964 0.7990 0.8038 0.8043 0.8054 0.02 0.7843 0.7911 0.7939 0.7972 0.8024 0.8032 0.8044 Mean 0.05 0.7747 0.7861 0.7901 0.7942 0.8002 0.8015 0.8028 0.1 0.7652 0.7790 0.7854 0.7904 0.7970 0.7987 0.8004 0.2 0.7412 0.7655 0.7756 0.7828 0.7905 0.7930 0.7955 0 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.02 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Variance 0.05 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.1 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.2 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 Conclusion This paper considers the problem of Projection Recovery for Classiﬁcation. It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. It optimizes the selection using individual data-point-speciﬁc entropy estimators. In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high conﬁdence regions known in conformal prediction models. Empirical results obtained using simulated and real-world data show the effectiveness of our method in ﬁnding informative projections that enable accurate classiﬁcation while maintaining transparency of the underlying decision process. Acknowledgments This material is based upon work supported by the NSF, under Grant No. IIS-0911032. 8 References [1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 24–30. The MIT Press, 1996. [2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187–202, 1998. [3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose strategy for the selection of classiﬁer ensembles. Pattern Recogn., 41:2993–3009, October 2008. [4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE Transactions on, 40(1):259–266, January 1994. [5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996. [6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artiﬁcial Intelligence, pages 148–155. Morgan Kaufmann, 1998. [7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011. [8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and exceptions. In Proceedings of Seventeeth National Conference on Artiﬁcial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, pages 615–620, 2000. [9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 573–581, New York, NY, USA, 2011. ACM. [10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOURNAL OF MACHINE LEARNING REASEARCH, 6, 2005. [11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classiﬁcation problems. Statistics and Computing, 20(2):231–252, April 2010. [12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning intelligible rules, 1997. [13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011. [14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating: ensembles for stable andunstable learners. Machine Learning, 82:375–397, 2011. 10.1007/s10994-0105224-5. 9

6 0.5408392 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

7 0.53426957 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

8 0.50619066 200 nips-2012-Local Supervised Learning through Space Partitioning

9 0.47468704 248 nips-2012-Nonparanormal Belief Propagation (NPNBP)

10 0.47391763 148 nips-2012-Hamming Distance Metric Learning

11 0.46453232 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

12 0.45465106 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

13 0.44431093 105 nips-2012-Dynamic Pruning of Factor Graphs for Maximum Marginal Prediction

14 0.4427349 204 nips-2012-MAP Inference in Chains using Column Generation

15 0.43398148 145 nips-2012-Gradient Weights help Nonparametric Regressors

16 0.43053481 338 nips-2012-The Perturbed Variation

17 0.41724625 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

18 0.41631851 198 nips-2012-Learning with Target Prior

19 0.41349152 157 nips-2012-Identification of Recurrent Patterns in the Activation of Brain Networks

20 0.40827703 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.036), (12, 0.166), (21, 0.025), (27, 0.016), (38, 0.099), (42, 0.112), (54, 0.025), (55, 0.023), (59, 0.017), (74, 0.049), (76, 0.133), (80, 0.144), (92, 0.044)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8894136 239 nips-2012-Neuronal Spike Generation Mechanism as an Oversampling, Noise-shaping A-to-D converter

Author: Dmitri B. Chklovskii, Daniel Soudry

Abstract: We test the hypothesis that the neuronal spike generation mechanism is an analog-to-digital (AD) converter encoding rectified low-pass filtered summed synaptic currents into a spike train linearly decodable in postsynaptic neurons. Faithful encoding of an analog waveform by a binary signal requires that the spike generation mechanism has a sampling rate exceeding the Nyquist rate of the analog signal. Such oversampling is consistent with the experimental observation that the precision of the spikegeneration mechanism is an order of magnitude greater than the cut -off frequency of low-pass filtering in dendrites. Additional improvement in the coding accuracy may be achieved by noise-shaping, a technique used in signal processing. If noise-shaping were used in neurons, it would reduce coding error relative to Poisson spike generator for frequencies below Nyquist by introducing correlations into spike times. By using experimental data from three different classes of neurons, we demonstrate that biological neurons utilize noise-shaping. Therefore, the spike-generation mechanism can be viewed as an oversampling and noise-shaping AD converter. The nature of the neural spike code remains a central problem in neuroscience [1-3]. In particular, no consensus exists on whether information is encoded in firing rates [4, 5] or individual spike timing [6, 7]. On the single-neuron level, evidence exists to support both points of view. On the one hand, post-synaptic currents are low-pass-filtered by dendrites with the cut-off frequency of approximately 30Hz [8], Figure 1B, providing ammunition for the firing rate camp: if the signal reaching the soma is slowly varying, why would precise spike timing be necessary? On the other hand, the ability of the spike-generation mechanism to encode harmonics of the injected current up to about 300Hz [9, 10], Figure 1B, points at its exquisite temporal precision [11]. Yet, in view of the slow variation of the somatic current, such precision may seem gratuitous and puzzling. The timescale mismatch between gradual variation of the somatic current and high precision of spike generation has been addressed previously. Existing explanations often rely on the population nature of the neural code [10, 12]. Although this is a distinct possibility, the question remains whether invoking population coding is necessary. Other possible explanations for the timescale mismatch include the possibility that some synaptic currents (for example, GABAergic) may be generated by synapses proximal to the soma and therefore not subject to low-pass filtering or that the high frequency harmonics are so strong in the pre-synaptic spike that despite attenuation, their trace is still present. Although in some cases, these explanations could apply, for the majority of synaptic inputs to typical neurons there is a glaring mismatch. The perceived mismatch between the time scales of somatic currents and the spike-generation mechanism can be resolved naturally if one views spike trains as digitally encoding analog somatic currents [13-15], Figure 1A. Although somatic currents vary slowly, information that could be communicated by their analog amplitude far exceeds that of binary signals, such as all- or-none spikes, of the same sampling rate. Therefore, faithful digital encoding requires sampling rate of the digital signal to be much higher than the cut-off frequency of the analog signal, socalled over-sampling. Although the spike generation mechanism operates in continuous time, the high temporal precision of the spikegeneration mechanism may be viewed as a manifestation of oversampling, which is needed for the digital encoding of the analog signal. Therefore, the extra order of magnitude in temporal precision available to the spike-generation mechanism relative to somatic current, Figure 1B, is necessary to faithfully encode the amplitude of the analog signal, thus potentially reconciling the firing rate and the spike timing points of view [13-15]. Figure 1. Hybrid digital-analog operation of neuronal circuits. A. Post-synaptic currents are low-pass filtered and summed in dendrites (black) to produce a somatic current (blue). This analog signal is converted by the spike generation mechanism into a sequence of all-or-none spikes (green), a digital signal. Spikes propagate along an axon and are chemically transduced across synapses (gray) into post-synatpic currents (black), whose amplitude reflects synaptic weights, thus converting digital signal back to analog. B. Frequency response function for dendrites (blue, adapted from [8]) and for the spike generation mechanism (green, adapted from [9]). Note one order of magnitude gap between the cut off frequencies. C. Amplitude of the summed postsynaptic currents depends strongly on spike timing. If the blue spike arrives just 5ms later, as shown in red, the EPSCs sum to a value already 20% less. Therefore, the extra precision of the digital signal may be used to communicate the amplitude of the analog signal. In signal processing, efficient AD conversion combines the principle of oversampling with that of noise-shaping, which utilizes correlations in the digital signal to allow more accurate encoding of the analog amplitude. This is exemplified by a family of AD converters called modulators [16], of which the basic one is analogous to an integrate-and-fire (IF) neuron [13-15]. The analogy between the basic modulator and the IF neuron led to the suggestion that neurons also use noise-shaping to encode incoming analog current waveform in the digital spike train [13]. However, the hypothesis of noise-shaping AD conversion has never been tested experimentally in biological neurons. In this paper, by analyzing existing experimental datasets, we demonstrate that noise-shaping is present in three different classes of neurons from vertebrates and invertebrates. This lends support to the view that neurons act as oversampling and noise-shaping AD converters and accounts for the mismatch between the slowly varying somatic currents and precise spike timing. Moreover, we show that the degree of noise-shaping in biological neurons exceeds that used by basic  modulators or IF neurons and propose viewing more complicated models in the noise-shaping framework. This paper is organized as follows: We review the principles of oversampling and noise-shaping in Section 2. In Section 3, we present experimental evidence for noise-shaping AD conversion in neurons. In Section 4 we argue that rectification of somatic currents may improve energy efficiency and/or implement de-noising. 2 . Oversampling and noise-shaping in AD converters To understand how oversampling can lead to more accurate encoding of the analog signal amplitude in a digital form, we first consider a Poisson spike encoder, whose rate of spiking is modulated by the signal amplitude, Figure 2A. Such an AD converter samples an analog signal at discrete time points and generates a spike with a probability given by the (normalized) signal amplitude. Because of the binary nature of spike trains, the resulting spike train encodes the signal with a large error even when the sampling is done at Nyquist rate, i.e. the lowest rate for alias-free sampling. To reduce the encoding error a Poisson encoder can sample at frequencies, fs , higher than Nyquist, fN – hence, the term oversampling, Figure 2B. When combined with decoding by lowpass filtering (down to Nyquist) on the receiving end, this leads to a reduction of the error, which can be estimated as follows. The number of samples over a Nyquist half-period (1/2fN) is given by the oversampling ratio: . As the normalized signal amplitude, , stays roughly constant over the Nyquist half-period, it can be encoded by spikes generated with a fixed probability, x. For a Poisson process the variance in the number of spikes is equal to the mean, . Therefore, the mean relative error of the signal decoded by averaging over the Nyquist half-period: , (1) indicating that oversampling reduces transmission error. However, the weak dependence of the error on the oversampling frequency indicates diminishing returns on the investment in oversampling and motivates one to search for other ways to lower the error. Figure 2. Oversampling and noise-shaping in AD conversion. A. Analog somatic current (blue) and its digital code (green). The difference between the green and the blue curves is encoding error. B. Digital output of oversampling Poisson encoder over one Nyquist half-period. C. Error power spectrum of a Nyquist (dark green) and oversampled (light green) Poisson encoder. Although the total error power is the same, the fraction surviving low-pass filtering during decoding (solid green) is smaller in oversampled case. D. Basic  modulator. E. Signal at the output of the integrator. F. Digital output of the  modulator over one Nyquist period. G. Error power spectrum of the  modulator (brown) is shifted to higher frequencies and low-pass filtered during decoding. The remaining error power (solid brown) is smaller than for Poisson encoder. To reduce encoding error beyond the ½ power of the oversampling ratio, the principle of noiseshaping was put forward [17]. To illustrate noise-shaping consider a basic AD converter called  [18], Figure 2D. In the basic  modulator, the previous quantized signal is fed back and subtracted from the incoming signal and then the difference is integrated in time. Rather than quantizing the input signal, as would be done in the Poisson encoder,  modulator quantizes the integral of the difference between the incoming analog signal and the previous quantized signal, Figure 2F. One can see that, in the oversampling regime, the quantization error of the basic  modulator is significantly less than that of the Poisson encoder. As the variance in the number of spikes over the Nyquist period is less than one, the mean relative error of the signal is at most, , which is better than the Poisson encoder. To gain additional insight and understand the origin of the term noise-shaping, we repeat the above analysis in the Fourier domain. First, the Poisson encoder has a flat power spectrum up to the sampling frequency, Figure 2C. Oversampling preserves the total error power but extends the frequency range resulting in the lower error power below Nyquist. Second, a more detailed analysis of the basic  modulator, where the dynamics is linearized by replacing the quantization device with a random noise injection [19], shows that the quantization noise is effectively differentiated. Taking the derivative in time is equivalent to multiplying the power spectrum of the quantization noise by frequency squared. Such reduction of noise power at low frequencies is an example of noise shaping, Figure 2G. Under the additional assumption of the white quantization noise, such analysis yields: , (2) which for R >> 1 is significantly better performance than for the Poisson encoder, Eq.(1). As mentioned previously, the basic  modulator, Figure 2D, in the continuous-time regime is nothing other than an IF neuron [13, 20, 21]. In the IF neuron, quantization is implemented by the spike generation mechanism and the negative feedback corresponds to the after-spike reset. Note that resetting the integrator to zero is strictly equivalent to subtraction only for continuous-time operation. In discrete-time computer simulations, the integrator value may exceed the threshold, and, therefore, subtraction of the threshold value rather than reset must be used. Next, motivated by the -IF analogy, we look for the signs of noise-shaping AD conversion in real neurons. 3 . Experimental evidence of noise-shaping AD conversion in real neurons In order to determine whether noise-shaping AD conversion takes place in biological neurons, we analyzed three experimental datasets, where spike trains were generated by time-varying somatic currents: 1) rat somatosensory cortex L5 pyramidal neurons [9], 2) mouse olfactory mitral cells [22, 23], and 3) fruit fly olfactory receptor neurons [24]. In the first two datasets, the current was injected through an electrode in whole-cell patch clamp mode, while in the third, the recording was extracellular and the intrinsic somatic current could be measured because the glial compartment included only one active neuron. Testing the noise-shaping AD conversion hypothesis is complicated by the fact that encoded and decoded signals are hard to measure accurately. First, as somatic current is rectified by the spikegeneration mechanism, only its super-threshold component can be encoded faithfully making it hard to know exactly what is being encoded. Second, decoding in the dendrites is not accessible in these single-neuron recordings. In view of these difficulties, we start by simply computing the power spectrum of the reconstruction error obtained by subtracting a scaled and shifted, but otherwise unaltered, spike train from the somatic current. The scaling factor was determined by the total weight of the decoding linear filter and the shift was optimized to maximize information capacity, see below. At the frequencies below 20Hz the error contains significantly lower power than the input signal, Figure 3, indicating that the spike generation mechanism may be viewed as an AD converter. Furthermore, the error power spectrum of the biological neuron is below that of the Poisson encoder, thus indicating the presence of noise-shaping. For dataset 3 we also plot the error power spectrum of the IF neuron, the threshold of which is chosen to generate the same number of spikes as the biological neuron. 4 somatic current biological neuron error Poisson encoder error I&F; neuron error 10 1 10 0 Spectral power, a.u. Spectral power, a.u. 10 3 10 -1 10 -2 10 -3 10 2 10 -4 10 0 10 20 30 40 50 60 Frequency [Hz] 70 80 90 0 10 20 30 40 50 60 70 80 90 100 Frequency [Hz] Figure 3. Evidence of noise-shaping. Power spectra of the somatic current (blue), difference between the somatic current and the digital spike train of the biological neuron (black), of the Poisson encoder (green) and of the IF neuron (red). Left: datset 1, right: dataset 3. Although the simple analysis presented above indicates noise-shaping, subtracting the spike train from the input signal, Figure 3, does not accurately quantify the error when decoding involves additional filtering. An example of such additional encoding/decoding is predictive coding, which will be discussed below [25]. To take such decoding filter into account, we computed a decoded waveform by convolving the spike train with the optimal linear filter, which predicts the somatic current from the spike train with the least mean squared error. Our linear decoding analysis lends additional support to the noise-shaping AD conversion hypothesis [13-15]. First, the optimal linear filter shape is similar to unitary post-synaptic currents, Figure 4B, thus supporting the view that dendrites reconstruct the somatic current of the presynaptic neuron by low-pass filtering the spike train in accordance with the noise-shaping principle [13]. Second, we found that linear decoding using an optimal filter accounts for 60-80% of the somatic current variance. Naturally, such prediction works better for neurons in suprathreshold regime, i.e. with high firing rates, an issue to which we return in Section 4. To avoid complications associated with rectification for now we focused on neurons which were in suprathreshold regime by monitoring that the relationship between predicted and actual current is close to linear. 2 10 C D 1 10 somatic current biological neuron error Poisson encoder error Spectral power, a.u. Spectral power, a.u. I&F; neuron error 3 10 0 10 -1 10 -2 10 -3 10 2 10 -4 0 10 20 30 40 50 60 Frequency [Hz] 70 80 90 10 0 10 20 30 40 50 60 70 80 90 100 Frequency [Hz] Figure 4. Linear decoding of experimentally recorded spike trains. A. Waveform of somatic current (blue), resulting spike train (black), and the linearly decoded waveform (red) from dataset 1. B. Top: Optimal linear filter for the trace in A, is representative of other datasets as well. Bottom: Typical EPSPs have a shape similar to the decoding filter (adapted from [26]). C-D. Power spectra of the somatic current (blue), the decdoding error of the biological neuron (black), the Poisson encoder (green), and IF neuron (red) for dataset 1 (C) dataset 3 (D). Next, we analyzed the spectral distribution of the reconstruction error calculated by subtracting the decoded spike train, i.e. convolved with the computed optimal linear filter, from the somatic current. We found that at low frequencies the error power is significantly lower than in the input signal, Figure 4C,D. This observation confirms that signals below the dendritic cut-off frequency of 20-30Hz can be efficiently communicated using spike trains. To quantify the effect of noise-shaping we computed information capacity of different encoders: where S(f) and N(f) are the power spectra of the somatic current and encoding error correspondingly and the sum is computed only over the frequencies for which S(f) > N(f). Because the plots in Figure 4C,D use semi-logrithmic scale, the information capacity can be estimated from the area between a somatic current (blue) power spectrum and an error power spectrum. We find that the biological spike generation mechanism has higher information capacity than the Poisson encoder and IF neurons. Therefore, neurons act as AD converters with stronger noise-shaping than IF neurons. We now return to the predictive nature of the spike generation mechanism. Given the causal nature of the spike generation mechanism it is surprising that the optimal filters for all three datasets carry most of their weight following a spike, Figure 4B. This indicates that the spike generation mechanism is capable of making predictions, which are possible in these experiments because somatic currents are temporally correlated. We note that these observations make delay-free reconstruction of the signal possible, thus allowing fast operation of neural circuits [27]. The predictive nature of the encoder can be captured by a  modulator embedded in a predictive coding feedback loop [28], Figure 5A. We verified by simulation that such a nested architecture generates a similar optimal linear filter with most of its weight in the time following a spike, Figure 5A right. Of course such prediction is only possible for correlated inputs implying that the shape of the optimal linear filter depends on the statistics of the inputs. The role of predictive coding is to reduce the dynamic range of the signal that enters , thus avoiding overloading. A possible biological implementation for such integrating feedback could be Ca2+ 2+ concentration and Ca dependent potassium channels [25, 29]. Figure 5. Enhanced  modulators. A.  modulator combined with predictive coder. In such device, the optimal decoding filter computed for correlated inputs has most of its weight following a spike, similar to experimental measurements, Figure 4B. B. Second-order  modulator possesses stronger noise-shaping properties. Because such circuit contains an internal state variable it generates a non-periodic spike train in response to a constant input. Bottom trace shows a typical result of a simulation. Black – spikes, blue – input current. 4 . Possible reasons for current rectification: energy efficiency and de-noising We have shown that at high firing rates biological neurons encode somatic current into a linearly decodable spike train. However, at low firing rates linear decoding cannot faithfully reproduce the somatic current because of rectification in the spike generation mechanism. If the objective of spike generation is faithful AD conversion, why would such rectification exist? We see two potential reasons: energy efficiency and de-noising. It is widely believed that minimizing metabolic costs is an important consideration in brain design and operation [30, 31]. Moreover, spikes are known to consume a significant fraction of the metabolic budget [30, 32] placing a premium on their total number. Thus, we can postulate that neuronal spike trains find a trade-off between the mean squared error in the decoded spike train relative to the input signal and the total number of spikes, as expressed by the following cost function over a time interval T: , (3) where x is the analog input signal, s is the binary spike sequence composed of zeros and ones, and is the linear filter. To demonstrate how solving Eq.(3) would lead to thresholding, let us consider a simplified version taken over a Nyquist period, during which the input signal stays constant: (4) where and normalized by w. Minimizing such a cost function reduces to choosing the lowest lying parabola for a given , Figure 6A. Therefore, thresholding is a natural outcome of minimizing a cost function combining the decoding error and the energy cost, Eq.(3). In addition to energy efficiency, there may be a computational reason for thresholding somatic current in neurons. To illustrate this point, we note that the cost function in Eq. (3) for continuous variables, st, may be viewed as a non-negative version of the L1-norm regularized linear regression called LASSO [33], which is commonly used for de-noising of sparse and Laplacian signals [34]. Such cost function can be minimized by iteratively applying a gradient descent and a shrinkage steps [35], which is equivalent to thresholding (one-sided in case of non-negative variables), Figure 6B,C. Therefore, neurons may be encoding a de-noised input signal. Figure 6. Possible reasons for rectification in neurons. A. Cost function combining encoding error squared with metabolic expense vs. input signal for different values of the spike number N, Eq.(4). Note that the optimal number of spikes jumps from zero to one as a function of input. B. Estimating most probable “clean” signal value for continuous non-negative Laplacian signal and Gaussian noise, Eq.(3) (while setting w = 1). The parabolas (red) illustrate the quadratic loglikelihood term in (3) for different values of the measurement, s, while the linear function (blue) reflects the linear log-prior term in (3). C. The minimum of the combined cost function in B is at zero if s , and grows linearly with s, if s >. 5 . Di scu ssi on In this paper, we demonstrated that the neuronal spike-generation mechanism can be viewed as an oversampling and noise-shaping AD converter, which encodes a rectified low-pass filtered somatic current as a digital spike train. Rectification by the spike generation mechanism may subserve both energy efficiency and de-noising. As the degree of noise-shaping in biological neurons exceeds that in IF neurons, or basic , we suggest that neurons should be modeled by more advanced  modulators, e.g. Figure 5B. Interestingly,  modulators can be also viewed as coders with error prediction feedback [19]. Many publications studied various aspects of spike generation in neurons yet we believe that the framework [13-15] we adopt is different and discuss its relationship to some of the studies. Our framework is different from previous proposals to cast neurons as predictors [36, 37] because a different quantity is being predicted. The possibility of perfect decoding from a spike train with infinite temporal precision has been proven in [38]. Here, we are concerned with a more practical issue of how reconstruction error scales with the over-sampling ratio. Also, we consider linear decoding which sets our work apart from [39]. Finally, previous experiments addressing noiseshaping [40] studied the power spectrum of the spike train rather than that of the encoding error. Our work is aimed at understanding biological and computational principles of spike-generation and decoding and is not meant as a substitute for the existing phenomenological spike-generation models [41], which allow efficient fitting of parameters and prediction of spike trains [42]. Yet, the theoretical framework [13-15] we adopt may assist in building better models of spike generation for a given somatic current waveform. First, having interpreted spike generation as AD conversion, we can draw on the rich experience in signal processing to attack the problem. Second, this framework suggests a natural metric to compare the performance of different spike generation models in the high firing rate regime: a mean squared error between the injected current waveform and the filtered version of the spike train produced by a model provided the total number of spikes is the same as in the experimental data. The AD conversion framework adds justification to the previously proposed spike distance obtained by subtracting low-pass filtered spike trains [43]. As the framework [13-15] we adopt relies on viewing neuronal computation as an analog-digital hybrid, which requires AD and DA conversion at every step, one may wonder about the reason for such a hybrid scheme. Starting with the early days of computers, the analog mode is known to be advantageous for computation. For example, performing addition of many variables in one step is possible in the analog mode simply by Kirchhoff law, but would require hundreds of logical gates in the digital mode [44]. However, the analog mode is vulnerable to noise build-up over many stages of computation and is inferior in precisely communicating information over long distances under limited energy budget [30, 31]. While early analog computers were displaced by their digital counterparts, evolution combined analog and digital modes into a computational hybrid [44], thus necessitating efficient AD and DA conversion, which was the focus of the present study. We are grateful to L. Abbott, S. Druckmann, D. Golomb, T. Hu, J. Magee, N. Spruston, B. Theilman for helpful discussions and comments on the manuscript, to X.-J. Wang, D. McCormick, K. Nagel, R. Wilson, K. Padmanabhan, N. Urban, S. Tripathy, H. Koendgen, and M. Giugliano for sharing their data. The work of D.S. was partially supported by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI). R e f e re n c e s 1. Ferster, D. and N. Spruston, Cracking the neural code. Science, 1995. 270: p. 756-7. 2. Panzeri, S., et al., Sensory neural codes using multiplexed temporal scales. Trends Neurosci, 2010. 33(3): p. 111-20. 3. Stevens, C.F. and A. Zador, Neural coding: The enigma of the brain. Curr Biol, 1995. 5(12): p. 1370-1. 4. Shadlen, M.N. and W.T. Newsome, The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J Neurosci, 1998. 18(10): p. 3870-96. 5. Shadlen, M.N. and W.T. Newsome, Noise, neural codes and cortical organization. Curr Opin Neurobiol, 1994. 4(4): p. 569-79. 6. Singer, W. and C.M. Gray, Visual feature integration and the temporal correlation hypothesis. Annu Rev Neurosci, 1995. 18: p. 555-86. 7. Meister, M., Multineuronal codes in retinal signaling. Proc Natl Acad Sci U S A, 1996. 93(2): p. 609-14. 8. Cook, E.P., et al., Dendrite-to-soma input/output function of continuous timevarying signals in hippocampal CA1 pyramidal neurons. J Neurophysiol, 2007. 98(5): p. 2943-55. 9. Kondgen, H., et al., The dynamical response properties of neocortical neurons to temporally modulated noisy inputs in vitro. Cereb Cortex, 2008. 18(9): p. 2086-97. 10. Tchumatchenko, T., et al., Ultrafast population encoding by cortical neurons. J Neurosci, 2011. 31(34): p. 12171-9. 11. Mainen, Z.F. and T.J. Sejnowski, Reliability of spike timing in neocortical neurons. Science, 1995. 268(5216): p. 1503-6. 12. Mar, D.J., et al., Noise shaping in populations of coupled model neurons. Proc Natl Acad Sci U S A, 1999. 96(18): p. 10450-5. 13. Shin, J., Adaptive noise shaping neural spike encoding and decoding. Neurocomputing, 2001. 38-40: p. 369-381. 14. Shin, J., The noise shaping neural coding hypothesis: a brief history and physiological implications. Neurocomputing, 2002. 44: p. 167-175. 15. Shin, J.H., Adaptation in spiking neurons based on the noise shaping neural coding hypothesis. Neural Networks, 2001. 14(6-7): p. 907-919. 16. Schreier, R. and G.C. Temes, Understanding delta-sigma data converters2005, Piscataway, NJ: IEEE Press, Wiley. xii, 446 p. 17. Candy, J.C., A use of limit cycle oscillations to obtain robust analog-to-digital converters. IEEE Trans. Commun, 1974. COM-22: p. 298-305. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. Inose, H., Y. Yasuda, and J. Murakami, A telemetring system code modulation -  modulation. IRE Trans. Space Elect. Telemetry, 1962. SET-8: p. 204-209. Spang, H.A. and P.M. Schultheiss, Reduction of quantizing noise by use of feedback. IRE TRans. Commun. Sys., 1962: p. 373-380. Hovin, M., et al., Delta-Sigma modulation in single neurons, in IEEE International Symposium on Circuits and Systems2002. Cheung, K.F. and P.Y.H. Tang, Sigma-Delta Modulation Neural Networks. Proc. IEEE Int Conf Neural Networkds, 1993: p. 489-493. Padmanabhan, K. and N. Urban, Intrinsic biophysical diversity decorelates neuronal firing while increasing information content. Nat Neurosci, 2010. 13: p. 1276-82. Urban, N. and S. Tripathy, Neuroscience: Circuits drive cell diversity. Nature, 2012. 488(7411): p. 289-90. Nagel, K.I. and R.I. Wilson, personal communication. Shin, J., C. Koch, and R. Douglas, Adaptive neural coding dependent on the timevarying statistics of the somatic input current. Neural Comp, 1999. 11: p. 1893-913. Magee, J.C. and E.P. Cook, Somatic EPSP amplitude is independent of synapse location in hippocampal pyramidal neurons. Nat Neurosci, 2000. 3(9): p. 895-903. Thorpe, S., D. Fize, and C. Marlot, Speed of processing in the human visual system. Nature, 1996. 381(6582): p. 520-2. Tewksbury, S.K. and R.W. Hallock, Oversample, linear predictive and noiseshaping coders of order N>1. IEEE Trans Circuits & Sys, 1978. CAS25: p. 436-47. Wang, X.J., et al., Adaptation and temporal decorrelation by single neurons in the primary visual cortex. J Neurophysiol, 2003. 89(6): p. 3279-93. Attwell, D. and S.B. Laughlin, An energy budget for signaling in the grey matter of the brain. J Cereb Blood Flow Metab, 2001. 21(10): p. 1133-45. Laughlin, S.B. and T.J. Sejnowski, Communication in neuronal networks. Science, 2003. 301(5641): p. 1870-4. Lennie, P., The cost of cortical computation. Curr Biol, 2003. 13(6): p. 493-7. Tibshirani, R., Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological, 1996. 58(1): p. 267-288. Chen, S.S.B., D.L. Donoho, and M.A. Saunders, Atomic decomposition by basis pursuit. Siam Journal on Scientific Computing, 1998. 20(1): p. 33-61. Elad, M., et al., Wide-angle view at iterated shrinkage algorithms. P SOc Photo-Opt Ins, 2007. 6701: p. 70102. Deneve, S., Bayesian spiking neurons I: inference. Neural Comp, 2008. 20: p. 91. Yu, A.J., Optimal Change-Detection and Spinking Neurons, in NIPS, B. Scholkopf, J. Platt, and T. Hofmann, Editors. 2006. Lazar, A. and L. Toth, Perfect Recovery and Sensitivity Analysis of Time Encoded Bandlimited Signals. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, 2004. 51(10). Pfister, J.P., P. Dayan, and M. Lengyel, Synapses with short-term plasticity are optimal estimators of presynaptic membrane potentials. Nat Neurosci, 2010. 13(10): p. 1271-5. Chacron, M.J., et al., Experimental and theoretical demonstration of noise shaping by interspike interval correlations. Fluctuations and Noise in Biological, Biophysical, and Biomedical Systems III, 2005. 5841: p. 150-163. Pillow, J., Likelihood-based approaches to modeling the neural code, in Bayesian Brain: Probabilistic Approaches to Neural Coding, K. Doya, et al., Editors. 2007, MIT Press. Jolivet, R., et al., A benchmark test for a quantitative assessment of simple neuron models. J Neurosci Methods, 2008. 169(2): p. 417-24. van Rossum, M.C., A novel spike distance. Neural Comput, 2001. 13(4): p. 751-63. Sarpeshkar, R., Analog versus digital: extrapolating from electronics to neurobiology. Neural Computation, 1998. 10(7): p. 1601-38.

same-paper 2 0.8496592 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

Author: Matthew Der, Lawrence K. Saul

3 0.80737185 289 nips-2012-Recognizing Activities by Attribute Dynamics

Author: Weixin Li, Nuno Vasconcelos

Abstract: In this work, we consider the problem of modeling the dynamic structure of human activities in the attributes space. A video sequence is Ä?Ĺš rst represented in a semantic feature space, where each feature encodes the probability of occurrence of an activity attribute at a given time. A generative model, denoted the binary dynamic system (BDS), is proposed to learn both the distribution and dynamics of different activities in this space. The BDS is a non-linear dynamic system, which extends both the binary principal component analysis (PCA) and classical linear dynamic systems (LDS), by combining binary observation variables with a hidden Gauss-Markov state process. In this way, it integrates the representation power of semantic modeling with the ability of dynamic systems to capture the temporal structure of time-varying processes. An algorithm for learning BDS parameters, inspired by a popular LDS learning method from dynamic textures, is proposed. A similarity measure between BDSs, which generalizes the BinetCauchy kernel for LDS, is then introduced and used to design activity classiÄ?Ĺš ers. The proposed method is shown to outperform similar classiÄ?Ĺš ers derived from the kernel dynamic system (KDS) and state-of-the-art approaches for dynamics-based or attribute-based action recognition. 1

4 0.80192322 242 nips-2012-Non-linear Metric Learning

Author: Dor Kedem, Stephen Tyree, Fei Sha, Gert R. Lanckriet, Kilian Q. Weinberger

5 0.7967357 243 nips-2012-Non-parametric Approximate Dynamic Programming via the Kernel Method

Author: Nikhil Bhat, Vivek Farias, Ciamac C. Moallemi

Abstract: This paper presents a novel non-parametric approximate dynamic programming (ADP) algorithm that enjoys graceful approximation and sample complexity guarantees. In particular, we establish both theoretically and computationally that our proposal can serve as a viable alternative to state-of-the-art parametric ADP algorithms, freeing the designer from carefully specifying an approximation architecture. We accomplish this by developing a kernel-based mathematical program for ADP. Via a computational study on a controlled queueing network, we show that our procedure is competitive with parametric ADP approaches. 1

6 0.79246044 219 nips-2012-Modelling Reciprocating Relationships with Hawkes Processes

7 0.7837708 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

8 0.78239274 200 nips-2012-Local Supervised Learning through Space Partitioning

9 0.77717555 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

10 0.77419162 197 nips-2012-Learning with Recursive Perceptual Representations

11 0.7681641 279 nips-2012-Projection Retrieval for Classification

12 0.76768327 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

13 0.76605958 168 nips-2012-Kernel Latent SVM for Visual Recognition

14 0.76532662 292 nips-2012-Regularized Off-Policy TD-Learning

15 0.76457649 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

16 0.764229 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

17 0.76323611 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

18 0.76290679 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

19 0.76252055 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

20 0.76124334 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models