nips nips2011 nips2011-254 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Purushottam Kar, Prateek Jain
Abstract: We consider the problem of classification using similarity/distance functions over data. Specifically, we propose a framework for defining the goodness of a (dis)similarity function with respect to a given learning task and propose algorithms that have guaranteed generalization properties when working with such good functions. Our framework unifies and generalizes the frameworks proposed by [1] and [2]. An attractive feature of our framework is its adaptability to data - we do not promote a fixed notion of goodness but rather let data dictate it. We show, by giving theoretical guarantees that the goodness criterion best suited to a problem can itself be learned which makes our approach applicable to a variety of domains and problems. We propose a landmarking-based approach to obtaining a classifier from such learned goodness criteria. We then provide a novel diversity based heuristic to perform task-driven selection of landmark points instead of random selection. We demonstrate the effectiveness of our goodness criteria learning method as well as the landmark selection heuristic on a variety of similarity-based learning datasets and benchmark UCI datasets on which our method consistently outperforms existing approaches by a significant margin. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Specifically, we propose a framework for defining the goodness of a (dis)similarity function with respect to a given learning task and propose algorithms that have guaranteed generalization properties when working with such good functions. [sent-6, score-0.287]
2 An attractive feature of our framework is its adaptability to data - we do not promote a fixed notion of goodness but rather let data dictate it. [sent-8, score-0.302]
3 We show, by giving theoretical guarantees that the goodness criterion best suited to a problem can itself be learned which makes our approach applicable to a variety of domains and problems. [sent-9, score-0.413]
4 We propose a landmarking-based approach to obtaining a classifier from such learned goodness criteria. [sent-10, score-0.269]
5 We then provide a novel diversity based heuristic to perform task-driven selection of landmark points instead of random selection. [sent-11, score-0.406]
6 We demonstrate the effectiveness of our goodness criteria learning method as well as the landmark selection heuristic on a variety of similarity-based learning datasets and benchmark UCI datasets on which our method consistently outperforms existing approaches by a significant margin. [sent-12, score-0.849]
7 While classical techniques like decision tree and linear perceptron cannot handle such data, several modern machine learning algorithms such as support vector machine (SVM) can be kernelized and are thereby capable of using kernels or similarity functions. [sent-16, score-0.204]
8 However, most of these algorithms require the similarity functions to be positive semi-definite (PSD), which essentially implies that the similarity stems from an (implicit) embedding of the data into a Hilbert space. [sent-17, score-0.411]
9 Unfortunately in many domains, the most natural notion of similarity does not satisfy this condition - moreover, verifying this condition is usually a non-trivial exercise. [sent-18, score-0.202]
10 Consequently, there have been efforts to develop algorithms that do not make assumptions about the PSD-ness of the similarity functions used. [sent-21, score-0.193]
11 The first approach tries to coerce a given similarity measure into a PSD one by either clipping or shifting the spectrum of the kernel matrix [4, 5]. [sent-23, score-0.192]
12 The third approach, which has been investigated recently in a series of papers [1, 2, 8, 9], uses the similarity function to embed the domain into a low dimensional Euclidean space. [sent-26, score-0.196]
13 More specifically, these algorithms choose landmark points in the domain which then give the embedding. [sent-27, score-0.323]
14 The model proposed by Balcan-Blum in [1] gives sufficient conditions for a similarity function to be well suited to such a landmarking approach. [sent-29, score-0.288]
15 in [2] on the other hand provide goodness conditions for dissimilarity functions that enable landmarking algorithms. [sent-31, score-0.419]
16 Informally, a similarity (or distance) function can be said to be good if points of similar labels are closer to each other than points of different labels in some sense. [sent-32, score-0.221]
17 Both the models described above restrict themselves to a fixed goodness criterion, which need not hold for the underlying data. [sent-33, score-0.269]
18 We observe that this might be too restrictive in many situations and present a framework that allows us to tune the goodness criterion itself to the classification problem at hand. [sent-34, score-0.315]
19 We first prove generalization bounds corresponding to landmarked embeddings under a fixed goodness criterion. [sent-36, score-0.44]
20 We then provide a uniform-convergence bound that enables us to learn the best goodness criterion for a given problem. [sent-37, score-0.331]
21 We further generalize our framework by giving the ability to incorporate any Lipschitz loss function into our goodness criterion which allows us to give guarantees for the use of various algorithms such as C-SVM and logistic regression on the landmarked space. [sent-38, score-0.538]
22 In contrast, we propose a general heuristic for selecting informative landmarks based on a novel notion of diversity which can then be applied to any instantiation of our model. [sent-43, score-0.491]
23 Finally, we apply our methods to a variety of benchmark datasets for similarity learning as well as ones from the UCI repository. [sent-44, score-0.282]
24 We empirically demonstrate that our learning model and landmark selection heuristic consistently offers significant improvements over the existing approaches. [sent-45, score-0.382]
25 In particular, for small number of landmark points, which is a practically important scenario as it is expensive to compute similarity function values at test time, our method provides, on an average, accuracy boosts of upto 5% over existing methods. [sent-46, score-0.547]
26 We also note that our methods can be applied on top of any strategy used to learn the similarity measure (eg. [sent-47, score-0.185]
27 Given a (potentially non-PSD) similarity function2 K : X × X → R, the goal is to learn a classifier ˆ : X → {−1, +1} from a finite number of i. [sent-52, score-0.185]
28 Now, learning a reasonable classifier seems unlikely if the given similarity function does not have any inherent “goodness” property. [sent-56, score-0.169]
29 Intuitively, the goodness of a similarity function should be its suitability to the classification task at hand. [sent-57, score-0.438]
30 For PSD kernels, the notion of goodness is defined in terms of the margin offered in the RKHS [13]. [sent-58, score-0.337]
31 However, a more basic requirement is that the similarity function should preserve affinities among similarly labeled points - that is to say, a good similarity function should not, on an average, assign higher similarity values to dissimilarly labeled points than to similarly labeled points. [sent-59, score-0.666]
32 This intuitive notion of goodness turns out to be rather robust 1 Throughout the paper, we use the terms embedding space and landmarked space interchangeably. [sent-60, score-0.504]
33 Results described in this section hold for distance functions as well; we present results with respect to similarity functions for sake of simplicity. [sent-61, score-0.217]
34 2 2 in the sense that all PSD kernels that offer a good margin in their respective RKHSs satisfy some form of this goodness criterion as well [14]. [sent-62, score-0.387]
35 Recently there has been some interest in studying different realizations of this general notion of goodness and developing corresponding algorithms that allow for efficient learning with similarity/distance functions. [sent-63, score-0.302]
36 Balcan-Blum in [1] present a goodness criteria in which a good similarity function is considered to be one that, for most points, assigns a greater average similarity to similarly labeled points than to dissimilarly labeled points. [sent-64, score-0.754]
37 More specifically, a similarity function is ( , γ)-good if there exists a weighing function w : X → R such that, at least a (1 − ) probability mass of examples x ∼ D satisfies: E [w (x ) K(x, x )| (x ) = (x)] ≥ E [w (x ) K(x, x )| (x ) = (x)] + γ. [sent-65, score-0.217]
38 x ∼D x ∼D (1) where instead of average similarity, one considers an average weighted similarity to allow the definition to be more general. [sent-66, score-0.169]
39 However these notions of goodness with a single fixed criterion may be too restrictive in the sense that the data and the (dis)similarity function may not satisfy the underlying criterion. [sent-70, score-0.339]
40 Thus there is need to make the goodness criterion more flexible and data-dependent. [sent-72, score-0.315]
41 To this end, we unify and generalize both the above criteria to give a notion of goodness that is more data dependent. [sent-73, score-0.339]
42 Although the above goodness criteria (1) and (2) seem disparate at first, they can be shown to be special cases of a generalized framework where an antisymmetric function is used to compare intra and inter-class affinities. [sent-74, score-0.351]
43 We use this observation to define our novel goodness criterion using arbitrary bounded antisymmetric functions which we refer to as transfer functions. [sent-75, score-0.568]
44 This allows us to define a family of goodness criteria of which (1) and (2) form special cases ((1) uses the identity function and (2) uses the sign function as transfer function). [sent-76, score-0.473]
45 Moreover, the resulting definition of a good similarity function is more flexible and data dependent. [sent-77, score-0.169]
46 In the rest of the paper we shall always assume that our similarity functions are normalized i. [sent-78, score-0.231]
47 We stress that the property of antisymmetry for the transfer function is crucial to the definition in order to provide a uniform treatment to points of all classes as will be evident in the proof4 of Theorem 2. [sent-83, score-0.193]
48 As in [1, 2], our goodness criterion lends itself to a simple learning algorithm which consists of d choosing a set of d random pairs of points from the domain P = x+ , x− i=1 (which we refer to i i 3 4 We refer the reader to the supplementary material (Section 2) for a discussion. [sent-84, score-0.43]
49 Due to lack of space we relegate all proofs to the supplementary material 3 as landmark pairs) and defining an embedding of the domain into a landmarked space using these d landmarks : ΦL : X → Rd , ΦL (x) = f (K(x, x+ ) − K(x, x− )) i=1 ∈ Rd . [sent-85, score-0.869]
50 The advantage of i i performing this embedding is the guaranteed existence of a large margin classifier in the landmarked space as shown below. [sent-86, score-0.256]
51 Firstly, the existence of this classifier itself is predicated on the use of the correct transfer function, something which is unknown. [sent-90, score-0.186]
52 Secondly, even if an optimal transfer function is known, the above formulation cannot be converted into an efficient learning algorithm for discovering the (unknown) weights since the formulation seeks to minimize the number of misclassifications which is an intractable problem in general. [sent-91, score-0.207]
53 First of all we assume that for some fixed loss function L, given any transfer function and any set of landmark pairs, it is possible to obtain a large margin classifier in the corresponding landmarked space that minimizes L. [sent-93, score-0.658]
54 2, we shall show it to be valid for a large class of loss functions by incorporating surrogate loss functions into our goodness criterion. [sent-97, score-0.421]
55 1 Learning the transfer function In this section we present results that allow us to learn a near optimal transfer function from a family of transfer functions. [sent-99, score-0.517]
56 We shall assume, for some fixed loss function L, the existence of an efficient routine which we refer to as TRAIN that shall return, for any landmarked space indexed by a set of landmark pairs P, a large margin classifier minimizing L. [sent-100, score-0.657]
57 An immediate algorithm for choosing the best transfer function is to simply search the set of possible transfer functions (in an algorithmically efficient manner) and choose the one offering lowest training error. [sent-102, score-0.377]
58 We show here that given enough landmark pairs, this simple technique, which we refer to as FTUNE (see Algorithm 2) is guaranteed to return a near-best transfer function. [sent-103, score-0.454]
59 Let P = d x+ , x− i=1 be a set of (random) landmark pairs. [sent-109, score-0.27]
60 For any transfer function f and arbitrary choice of landmark pairs P, let w(g,f ) be the best weighing function for this choice of transfer function and landmark pairs i. [sent-114, score-0.978]
61 Then we can ensure the following : Note that the function g(f,w) (x) is dictated by the choice of the set of landmark pairs P 4 Theorem 3. [sent-120, score-0.298]
62 Let F be a compact class of transfer functions with respect to the infinity norm and 1 , δ > 0. [sent-121, score-0.191]
63 Thus, if we iterate over the set of transfer functions (or use some gradient-descent based optimization routine), we are bound to select a transfer function that is capable of giving a classifier that is close to the best. [sent-124, score-0.408]
64 Thus what we require is for the landmarked space to admit a classifier that has low error with respect to a loss function that can also be efficiently minimized on any training set. [sent-129, score-0.205]
65 In such a situation, minimizing the loss on a random training set would, with very high probability, give us weights that give similar performance guarantees as the ones used in the goodness criterion. [sent-130, score-0.341]
66 With a similar objective in mind, [1] offers variants of its goodness criterion tailored to the hinge loss function which can be efficiently optimized on large training sets (for example LIBSVM [17]). [sent-131, score-0.399]
67 Here we give a general notion of goodness that can be tailored to any arbitrary Lipschitz loss function. [sent-132, score-0.351]
68 3 Selecting informative landmarks Recall that the generalization guarantees we described in the previous section rely on random selection of landmark pairs from a fixed distribution over the domain. [sent-138, score-0.71]
69 For typical domains such as computer vision, similarity function computation is an expensive task and hence selection of a small number of landmarks should lead to a significant improvement in the test times. [sent-140, score-0.585]
70 For this reason, we propose a landmark pair selection heuristic which we call DSELECT (see Algorithm 1). [sent-141, score-0.333]
71 3: z ← arg min 4: 5: 6: 7: 8: 9: 10: Algorithm 2 FTUNE Require: A family of transfer functions F, a similarity function K and a loss function L Ensure: An optimal transfer function f ∗ ∈ F. [sent-145, score-0.56]
72 PFTUNE ← PFTUNE ∪ {(z1 , z2 )} end for return L (for BBS), PFTUNE (for FTUNE) generalizes naturally to multi-class problems and can also be applied to the classification model of Balcan-Blum that uses landmark singletons instead of pairs. [sent-150, score-0.288]
73 Assuming K is a normalized similarity kernel, we call a set of points S ⊂ X diverse if the average inter-point similarity 1 is small i. [sent-152, score-0.364]
74 The key observation behind DSELECT is that a nondiverse set of landmarks would cause all data points to receive identical embeddings and linear separation would be impossible. [sent-154, score-0.414]
75 Small inter-landmark similarity, on the other hand would imply that the landmarks are well-spread in the domain and can capture novel patterns in the data. [sent-155, score-0.397]
76 Here we use this notion to achieve a better embedding into the landmarked space. [sent-157, score-0.235]
77 Experimental results demonstrate that the heuristic offers significant performance improvements over random landmark selection (see Figure 1). [sent-158, score-0.349]
78 One can easily extend Although Algorithm 1 to multiclass problems by selecting a fixed number of landmarks from each class. [sent-159, score-0.37]
79 We refer to our transfer function learning based formulation as FTUNE and its augmentation using DSELECT as FTUNE+D. [sent-163, score-0.225]
80 In multi-class classification scenarios we will use a one-vs-all formulation which presents us with an opportunity to further exploit the transfer function by learning separate transfer function per class (i. [sent-164, score-0.354]
81 We take the class of ramp functions indexed by a slope parameter as our set of transfer functions. [sent-170, score-0.191]
82 Our goal in this section is two-fold: 1) to show that our FTUNE method is able to learn a more suitable transfer function for the underlying data than the existing methods BBS and DBOOST and 2) to show that our diversity based heuristic for landmark selection performs better than random selection. [sent-173, score-0.579]
83 To this end, we perform experiments on a few benchmark datasets for learning with similarity (non-PSD) functions [5] as well as on a variety of standard UCI datasets where the similarity function used is the Gaussian kernel function. [sent-174, score-0.562]
84 4 FTUNE+D FTUNE BBS+D BBS DBOOST 50 100 150 200 Number of Landmarks 250 300 0 0 100 200 Number of Landmarks 300 Figure 1: Accuracy obtained by various methods on four different datasets as the number of landmarks used increases. [sent-305, score-0.434]
85 Note that for small number of landmarks (30, 50) our diversity based landmark selection criteria increases accuracy for both BBS and our method FTUNE-S significantly. [sent-306, score-0.796]
86 1 Similarity learning datasets First, we conduct experiments on a few similarity learning datasets [5]; these datasets provide a (non-PSD) similarity matrix along with class labels. [sent-308, score-0.53]
87 We then apply our FTUNE-S, FTUNE+D-S, BBS+D methods along with BBS and DBOOST with varying number of landmark pairs. [sent-310, score-0.27]
88 Table 1 compares the accuracies achieved by our FTUNE+D-S method with those of BBS and DBOOST over different datasets when using landmark sets of sizes 30 and 300. [sent-313, score-0.412]
89 Next, we evaluate the effectiveness of our landmark selection criteria for both BBS and our method. [sent-318, score-0.329]
90 Note that in all the datasets, our diversity based landmark selection criteria increases the classification accuracy by around 5 − 6% for small number of landmarks. [sent-320, score-0.426]
91 2 UCI benchmark datasets We now compare our FTUNE method against existing methods on a variety of UCI datasets [19]. [sent-322, score-0.193]
92 So for lack of space we drop it from our presentation and only show results for FTUNE-S (FTUNE with a single transfer function) and FTUNE-M (FTUNE with one transfer function per class). [sent-324, score-0.334]
93 Similar to [2], we use the Gaussian kernel function as the similarity function for evaluating our method. [sent-325, score-0.192]
94 We report accuracy values averaged over 20 runs for each method with varying number of landmark pairs. [sent-328, score-0.32]
95 88 50 100 150 200 Number of Landmarks 250 300 Figure 2: Accuracy achieved by various methods on four different UCI repository datasets as the number of landmarks used increases. [sent-549, score-0.434]
96 Note that both FTUNE-S and FTUNE-M perform significantly better than BBS and DBOOST for small number of landmarks (30, 50). [sent-550, score-0.37]
97 Also, BBS performs reasonably well for small landmarking sizes while DBOOST performs well for large landmarking sizes. [sent-553, score-0.22]
98 Figure 2 shows accuracies obtained by various methods as the number of landmarks selected increases. [sent-556, score-0.432]
99 However, even in these cases, DSELECT (intuitively) removes redundancies in the landmark points thus allowing FTUNE to recover the best transfer function. [sent-563, score-0.463]
100 In contrast, for larger datasets like those in Table 2 (average size 13200), FTUNE is itself able to recover better transfer functions than the baseline methods and hence both FTUNE-S and FTUNE-M perform significantly better than the baselines. [sent-564, score-0.255]
wordName wordTfidf (topN-words)
[('ftune', 0.522), ('bbs', 0.408), ('landmarks', 0.37), ('landmark', 0.27), ('goodness', 0.269), ('dboost', 0.268), ('similarity', 0.169), ('transfer', 0.167), ('landmarked', 0.153), ('landmarking', 0.102), ('dselect', 0.089), ('india', 0.067), ('datasets', 0.064), ('accuracies', 0.062), ('classi', 0.051), ('facerec', 0.051), ('pftune', 0.051), ('accuracy', 0.05), ('embedding', 0.049), ('uci', 0.049), ('weighing', 0.048), ('psd', 0.048), ('diversity', 0.047), ('criterion', 0.046), ('antisymmetric', 0.045), ('vs', 0.045), ('er', 0.043), ('heuristic', 0.041), ('shall', 0.038), ('amazonbinary', 0.038), ('dissimilarly', 0.038), ('criteria', 0.037), ('margin', 0.035), ('isolet', 0.034), ('loss', 0.033), ('notion', 0.033), ('avrim', 0.029), ('wf', 0.029), ('benchmark', 0.029), ('cl', 0.028), ('pairs', 0.028), ('domain', 0.027), ('routine', 0.026), ('microsoft', 0.026), ('points', 0.026), ('auralsonar', 0.025), ('magic', 0.025), ('nursery', 0.025), ('patrol', 0.025), ('balcan', 0.025), ('dis', 0.024), ('notions', 0.024), ('functions', 0.024), ('domains', 0.024), ('dissimilarity', 0.024), ('labeled', 0.023), ('kernel', 0.023), ('hurdles', 0.022), ('selection', 0.022), ('inef', 0.022), ('bold', 0.021), ('prateek', 0.021), ('boosts', 0.021), ('faults', 0.021), ('upto', 0.021), ('augmentation', 0.021), ('variety', 0.02), ('guarantees', 0.02), ('formulation', 0.02), ('misclassi', 0.019), ('training', 0.019), ('existence', 0.019), ('offer', 0.019), ('generalizes', 0.018), ('kernels', 0.018), ('sgn', 0.018), ('embeddings', 0.018), ('working', 0.018), ('satellite', 0.018), ('giving', 0.017), ('refer', 0.017), ('sup', 0.017), ('capable', 0.017), ('suited', 0.017), ('consistently', 0.017), ('liblinear', 0.017), ('offers', 0.016), ('existing', 0.016), ('tailored', 0.016), ('blum', 0.016), ('select', 0.016), ('learn', 0.016), ('sizes', 0.016), ('letters', 0.016), ('libsvm', 0.016), ('nathan', 0.015), ('cheng', 0.015), ('validation', 0.015), ('lipschitz', 0.015), ('cf', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 254 nips-2011-Similarity-based Learning via Data Driven Embeddings
Author: Purushottam Kar, Prateek Jain
Abstract: We consider the problem of classification using similarity/distance functions over data. Specifically, we propose a framework for defining the goodness of a (dis)similarity function with respect to a given learning task and propose algorithms that have guaranteed generalization properties when working with such good functions. Our framework unifies and generalizes the frameworks proposed by [1] and [2]. An attractive feature of our framework is its adaptability to data - we do not promote a fixed notion of goodness but rather let data dictate it. We show, by giving theoretical guarantees that the goodness criterion best suited to a problem can itself be learned which makes our approach applicable to a variety of domains and problems. We propose a landmarking-based approach to obtaining a classifier from such learned goodness criteria. We then provide a novel diversity based heuristic to perform task-driven selection of landmark points instead of random selection. We demonstrate the effectiveness of our goodness criteria learning method as well as the landmark selection heuristic on a variety of similarity-based learning datasets and benchmark UCI datasets on which our method consistently outperforms existing approaches by a significant margin. 1
2 0.091288164 291 nips-2011-Transfer from Multiple MDPs
Author: Alessandro Lazaric, Marcello Restelli
Abstract: Transfer reinforcement learning (RL) methods leverage on the experience collected on a set of source tasks to speed-up RL algorithms. A simple and effective approach is to transfer samples from source tasks and include them in the training set used to solve a target task. In this paper, we investigate the theoretical properties of this transfer method and we introduce novel algorithms adapting the transfer process on the basis of the similarity between source and target tasks. Finally, we report illustrative experimental results in a continuous chain problem.
3 0.056363974 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation
Author: Qian Sun, Rita Chattopadhyay, Sethuraman Panchanathan, Jieping Ye
Abstract: Discriminative learning when training and test data belong to different distributions is a challenging and complex task. Often times we have very few or no labeled data from the test or target distribution but may have plenty of labeled data from multiple related sources with different distributions. The difference in distributions may be both in marginal and conditional probabilities. Most of the existing domain adaptation work focuses on the marginal probability distribution difference between the domains, assuming that the conditional probabilities are similar. However in many real world applications, conditional probability distribution differences are as commonplace as marginal probability differences. In this paper we propose a two-stage domain adaptation methodology which combines weighted data from multiple sources based on marginal probability differences (first stage) as well as conditional probability differences (second stage), with the target domain data. The weights for minimizing the marginal probability differences are estimated independently, while the weights for minimizing conditional probability differences are computed simultaneously by exploiting the potential interaction among multiple sources. We also provide a theoretical analysis on the generalization performance of the proposed multi-source domain adaptation formulation using the weighted Rademacher complexity measure. Empirical comparisons with existing state-of-the-art domain adaptation methods using three real-world datasets demonstrate the effectiveness of the proposed approach. 1
4 0.048136532 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li
Abstract: We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fine grained control over the efficiency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classification with 10184 classes and 9 million images. We demonstrate significant improvements in test accuracy and efficiency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1
5 0.04363231 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon
Abstract: We introduce P I C O D ES: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of defining indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images defining the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classification performance. Optimization directly for binary features is difficult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. P I C O D ES of 256 bytes match the accuracy of the current best known classifier for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.
6 0.041450005 186 nips-2011-Noise Thresholds for Spectral Clustering
7 0.040075622 171 nips-2011-Metric Learning with Multiple Kernels
8 0.03991738 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
9 0.039201226 28 nips-2011-Agnostic Selective Classification
10 0.037458014 189 nips-2011-Non-parametric Group Orthogonal Matching Pursuit for Sparse Learning with Multiple Kernels
11 0.035639953 263 nips-2011-Sparse Manifold Clustering and Embedding
12 0.034509622 54 nips-2011-Co-regularized Multi-view Spectral Clustering
13 0.032906115 19 nips-2011-Active Classification based on Value of Classifier
14 0.032436352 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
15 0.031948179 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors
16 0.031684101 157 nips-2011-Learning to Search Efficiently in High Dimensions
17 0.03141864 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
18 0.031237632 155 nips-2011-Learning to Agglomerate Superpixel Hierarchies
19 0.03080814 53 nips-2011-Co-Training for Domain Adaptation
20 0.030558178 178 nips-2011-Multiclass Boosting: Theory and Algorithms
topicId topicWeight
[(0, 0.095), (1, 0.01), (2, -0.044), (3, 0.004), (4, -0.014), (5, 0.008), (6, 0.012), (7, -0.048), (8, -0.036), (9, -0.012), (10, -0.069), (11, 0.069), (12, 0.04), (13, 0.021), (14, 0.003), (15, -0.012), (16, -0.051), (17, 0.049), (18, 0.032), (19, 0.009), (20, -0.053), (21, 0.089), (22, -0.049), (23, 0.012), (24, -0.002), (25, 0.053), (26, 0.008), (27, 0.017), (28, -0.023), (29, -0.028), (30, 0.001), (31, -0.036), (32, 0.005), (33, -0.022), (34, -0.001), (35, -0.06), (36, 0.0), (37, 0.009), (38, 0.083), (39, 0.019), (40, -0.014), (41, 0.008), (42, 0.078), (43, -0.044), (44, -0.009), (45, 0.015), (46, -0.039), (47, -0.014), (48, 0.015), (49, 0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.88568145 254 nips-2011-Similarity-based Learning via Data Driven Embeddings
Author: Purushottam Kar, Prateek Jain
Abstract: We consider the problem of classification using similarity/distance functions over data. Specifically, we propose a framework for defining the goodness of a (dis)similarity function with respect to a given learning task and propose algorithms that have guaranteed generalization properties when working with such good functions. Our framework unifies and generalizes the frameworks proposed by [1] and [2]. An attractive feature of our framework is its adaptability to data - we do not promote a fixed notion of goodness but rather let data dictate it. We show, by giving theoretical guarantees that the goodness criterion best suited to a problem can itself be learned which makes our approach applicable to a variety of domains and problems. We propose a landmarking-based approach to obtaining a classifier from such learned goodness criteria. We then provide a novel diversity based heuristic to perform task-driven selection of landmark points instead of random selection. We demonstrate the effectiveness of our goodness criteria learning method as well as the landmark selection heuristic on a variety of similarity-based learning datasets and benchmark UCI datasets on which our method consistently outperforms existing approaches by a significant margin. 1
2 0.63783705 120 nips-2011-History distribution matching method for predicting effectiveness of HIV combination therapies
Author: Jasmina Bogojeska
Abstract: This paper presents an approach that predicts the effectiveness of HIV combination therapies by simultaneously addressing several problems affecting the available HIV clinical data sets: the different treatment backgrounds of the samples, the uneven representation of the levels of therapy experience, the missing treatment history information, the uneven therapy representation and the unbalanced therapy outcome representation. The computational validation on clinical data shows that, compared to the most commonly used approach that does not account for the issues mentioned above, our model has significantly higher predictive power. This is especially true for samples stemming from patients with longer treatment history and samples associated with rare therapies. Furthermore, our approach is at least as powerful for the remaining samples. 1
3 0.61269397 53 nips-2011-Co-Training for Domain Adaptation
Author: Minmin Chen, Kilian Q. Weinberger, John Blitzer
Abstract: Domain adaptation algorithms seek to generalize a model trained in a source domain to a new target domain. In many practical cases, the source and target distributions can differ substantially, and in some cases crucial target features may not have support in the source domain. In this paper we introduce an algorithm that bridges the gap between source and target domains by slowly adding to the training set both the target features and instances in which the current algorithm is the most confident. Our algorithm is a variant of co-training [7], and we name it CODA (Co-training for domain adaptation). Unlike the original co-training work, we do not assume a particular feature split. Instead, for each iteration of cotraining, we formulate a single optimization problem which simultaneously learns a target predictor, a split of the feature space into views, and a subset of source and target features to include in the predictor. CODA significantly out-performs the state-of-the-art on the 12-domain benchmark data set of Blitzer et al. [4]. Indeed, over a wide range (65 of 84 comparisons) of target supervision CODA achieves the best performance. 1
4 0.58966058 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers
Author: Luca Oneto, Davide Anguita, Alessandro Ghio, Sandro Ridella
Abstract: We derive here new generalization bounds, based on Rademacher Complexity theory, for model selection and error estimation of linear (kernel) classifiers, which exploit the availability of unlabeled samples. In particular, two results are obtained: the first one shows that, using the unlabeled samples, the confidence term of the conventional bound can be reduced by a factor of three; the second one shows that the unlabeled samples can be used to obtain much tighter bounds, by building localized versions of the hypothesis class containing the optimal classifier. 1
5 0.578417 291 nips-2011-Transfer from Multiple MDPs
Author: Alessandro Lazaric, Marcello Restelli
Abstract: Transfer reinforcement learning (RL) methods leverage on the experience collected on a set of source tasks to speed-up RL algorithms. A simple and effective approach is to transfer samples from source tasks and include them in the training set used to solve a target task. In this paper, we investigate the theoretical properties of this transfer method and we introduce novel algorithms adapting the transfer process on the basis of the similarity between source and target tasks. Finally, we report illustrative experimental results in a continuous chain problem.
6 0.57466722 12 nips-2011-A Two-Stage Weighting Framework for Multi-Source Domain Adaptation
7 0.53595233 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification
8 0.50015271 238 nips-2011-Relative Density-Ratio Estimation for Robust Distribution Comparison
9 0.48560366 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions
10 0.484079 111 nips-2011-Hashing Algorithms for Large-Scale Learning
11 0.47979176 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
12 0.47322881 77 nips-2011-Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression
13 0.46615115 277 nips-2011-Submodular Multi-Label Learning
14 0.45300338 27 nips-2011-Advice Refinement in Knowledge-Based SVMs
15 0.44689351 157 nips-2011-Learning to Search Efficiently in High Dimensions
16 0.44259116 181 nips-2011-Multiple Instance Learning on Structured Data
17 0.44135484 287 nips-2011-The Manifold Tangent Classifier
18 0.43864447 171 nips-2011-Metric Learning with Multiple Kernels
19 0.4364619 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection
20 0.43533882 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
topicId topicWeight
[(0, 0.025), (4, 0.028), (20, 0.035), (26, 0.024), (31, 0.046), (33, 0.025), (43, 0.055), (45, 0.124), (57, 0.051), (64, 0.329), (65, 0.015), (74, 0.056), (83, 0.02), (99, 0.051)]
simIndex simValue paperId paperTitle
1 0.69387335 87 nips-2011-Energetically Optimal Action Potentials
Author: Martin B. Stemmler, Biswa Sengupta, Simon Laughlin, Jeremy Niven
Abstract: Most action potentials in the nervous system take on the form of strong, rapid, and brief voltage deflections known as spikes, in stark contrast to other action potentials, such as in the heart, that are characterized by broad voltage plateaus. We derive the shape of the neuronal action potential from first principles, by postulating that action potential generation is strongly constrained by the brain’s need to minimize energy expenditure. For a given height of an action potential, the least energy is consumed when the underlying currents obey the bang-bang principle: the currents giving rise to the spike should be intense, yet short-lived, yielding spikes with sharp onsets and offsets. Energy optimality predicts features in the biophysics that are not per se required for producing the characteristic neuronal action potential: sodium currents should be extraordinarily powerful and inactivate with voltage; both potassium and sodium currents should have kinetics that have a bell-shaped voltage-dependence; and the cooperative action of multiple ‘gates’ should start the flow of current. 1 The paradox Nerve cells communicate with each other over long distances using spike-like action potentials, which are brief electrical events traveling rapidly down axons and dendrites. Each action potential is caused by an accelerating influx of sodium or calcium ions, depolarizing the cell membrane by forty millivolts or more, followed by repolarization of the cell membrane caused by an efflux of potassium ions. As different species of ions are swapped across the membrane during the action potential, ion pumps shuttle the excess ions back and restore the ionic concentration gradients. If we label each ionic species by α, the work ∆E done to restore the ionic concentration gradients is [α] ∆E = RT V ∆[α]in ln out , (1) [α]in α where R is the gas constant, T is the temperature in Kelvin, V is the cell volume, [α]in|out is the concentration of ion α inside or outside the cell, and ∆[α]in is the concentration change inside the cell, which is assumed to be small relative to the total concentration. The sum α zα ∆[α] = 0, where zα is the charge on ion α, as no net charge accumulates during the action potential and no net work is done by or on the electric field. Often, sodium (Na+ ) and potassium (K+ ) play the dominant role in generating action potentials, in which case ∆E = ∆[Na]in F V(ENa − EK ), where F is Faraday’s constant, ENa = RT /F ln [Na]out /[Na]in is the reversal potential for Na+ , at which no net sodium current flows, and EK = RT /F ln [K]out /[K]in . This estimate of the work done does not include heat (due to loss through the membrane resistance) or the work done by the ion channel proteins in changing their conformational state during the action potential. Hence, the action potential’s energetic cost to the cell is directly proportional to ∆[Na]in ; taking into account that each Na+ ion carries one elementary charge, the cost is also proportional to the 1 charge QNa that accumulates inside the cell. A maximally efficient cell reduces the charge per spike to a minimum. If a cell fires action potentials at an average rate f , the cell’s Na/K pumps must move Na+ and K+ ions in opposite directions, against their respective concentration gradients, to counteract an average inward Na+ current of f QNa . Exhaustive measurements on myocytes in the heart, which expend tremendous amounts of energy to keep the heart beating, indicate that Na/K pumps expel ∼ 0.5 µA/cm2 of Na+ current at membrane potentials close to rest [1]. Most excitable cells, even when spiking, spend most of their time close to resting potential, and yet standard models for action potentials can easily lead to accumulating an ionic charge of up to 5 µC/cm2 [2]; most of this accumulation occurs during a very brief time interval. If one were to take an isopotential nerve cell with the same density of ion pumps as in the heart, then such a cell would not be able to produce more than an action potential once every ten seconds on average. The brain should be effectively silent. Clearly, this conflicts with what is known about the average firing rates of neurons in the brainstem or even the neocortex, which can sustain spiking up to at least 7 Hz [3]. Part of the discrepancy can be resolved by noting that nerve cells are not isopotential and that action potential generation occurs within a highly restricted area of the membrane. Even so, standard models of action potential generation waste extraordinary amounts of energy; recent evidence [4] points out that many mammalian cortical neurons are much more efficient. As nature places a premium on energy consumption, we will argue that one can predict both the shape of the action potential and the underlying biophysics of the nonlinear, voltage-dependent ionic conductances from the principle of minimal energy consumption. After reviewing the ionic basis of action potentials, we first sketch how to compute the minimal energy cost for an arbitrary spike shape, and then solve for the optimal action potential shape with a given height. Finally, we show how minimal energy consumption explains all the dynamical features in the standard HodgkinHuxley (HH) model for neuronal dynamics that distinguish the brain’s action potentials from other highly nonlinear oscillations in physics and chemistry. 2 Ionic basis of the action potential In an excitable cell, synaptic drive forces the membrane permeability to different ions to change rapidly in time, producing the dynamics of the action potential. The current density Iα carried by an ion species α is given by the Goldman-Hodgkin-Katz (GHK) current equation[5, 6, 2], which assumes that ions are driven independently across the membrane under the influence of a constant electric field. Iα depends upon the ions membrane permeability, Pα , its concentrations on either side of the membrane [α]out and [α]in and the voltage across the membrane, V , according to: Iα = Pα 2 zα V F 2 [α]out − [α]in exp (zα V F/RT ) , RT 1 − exp(zα V F/RT ) (2) To produce the fast currents that generate APs, a subset of the membranes ionic permeabilities Pα are gated by voltage. Changes in the permeability Pα are not instantaneous; the voltage-gated permeability is scaled mathematically by gating variables m(t) and h(t) with their own time dependence. After separating constant from time-dependent components in the permeability, the voltage-gated permeability obeys ¯ Pα (t) = m(t)r h(t)s such that 0 ≤ Pα (t) ≤ Pα , ¯ where r and s are positive, and Pα is the peak permeability to ion α when all channels for ion α are open. Gating is also referred to as activation, and the associated nonlinear permeabilities are called active. There are also passive, voltage-insensitive permeabilities that maintain the resting potential and depolarise the membrane to trigger action potentials. The simplest possible kinetics for the gating variables are first order, involving only a single derivative in time. The steady state of each gating variable at a given voltage is determined by a Boltzmann function, to which the gating variables evolve: dm r ¯ τm = Pα m∞ (V ) − m(t) dt dh and τh =h∞ (V ) − h(t), dt 2 −1 with m∞ (V ) = {1 + exp ((V − Vm )/sm )} the Boltzmann function described by the slope sm > −1 0 and the midpoint Vm ; similarly, h∞ (V ) = {1 + exp ((V − Vh )/sh )} , but with sh < 0. Scaling ¯ m∞ (V ) by the rth root of the peak permeability Pα is a matter of mathematical convenience. We will consider both voltage-independent and voltage-dependent time constants, either setting τj = τj,0 to be constant, where j ∈ {m(t), h(t)}, or imposing a bell-shaped voltage dependence τj (V ) = τj,0 sech [sj (V − Vj )] The synaptic, leak, and voltage-dependent currents drive the rate of change in the voltage across the membrane dV C = Isyn + Ileak + Iα , dt α where the synaptic permeability and leak permeability are held constant. 3 Resistive and capacitive components of the energy cost By treating the action potential as the charging and discharging of the cell membrane capacitance, the action potentials measured at the mossy fibre synapse in rats [4] or in mouse thalamocortical neurons [7] were found to be highly energy-efficient: the nonlinear, active conductances inject only slightly more current than is needed to charge a capacitor to the peak voltage of the action potential. The implicit assumption made here is that one can neglect the passive loss of current through the membrane resistance, known as the leak. Any passive loss must be compensated by additional charge, making this loss the primary target of the selection pressure that has shaped the dynamics of action potentials. On the other hand, the membrane capacitance at the site of AP initiation is generally modelled and experimentally confirmed [8] as being fairly constant around 1 µF/cm2 ; in contrast, the propagation, but not generation, of AP’s can be assisted by a reduction in the capacitance achieved by the myelin sheath that wraps some axons. As myelin would block the flow of ions, we posit that the specific capacitance cannot yield to selection pressure to minimise the work W = QNa (ENa − EK ) needed for AP generation. To address how the shape and dynamics of action potentials might have evolved to consume less energy, we first fix the action potential’s shape and solve for the minimum charge QNa ab initio, without treating the cell membrane as a pure capacitor. Regardless of the action potential’s particular time-course V (t), voltage-dependent ionic conductances must transfer Na+ and K+ charge to elicit an action potential. Figure 1 shows a generic action potential and the associated ionic currents, comparing the latter to the minimal currents required. The passive equivalent circuit for the neuron consists of a resistor in parallel with a capacitor, driven by a synaptic current. To charge the membrane to the peak voltage, a neuron in a high-conductance state [9, 10] may well lose more charge through the resistor than is stored on the capacitor. For neurons in a low-conductance state and for rapid voltage deflections from the resting potential, membrane capacitance will be the primary determinant of the charge. 4 The norm of spikes How close can voltage-gated channels with realistic properties come to the minimal currents? What time-course for the action potential leads to the smallest minimal currents? To answer these questions, we must solve a constrained optimization problem on the solutions to the nonlinear differential equations for the neuronal dynamics. To separate action potentials from mere small-amplitude oscillations in the voltage, we need to introduce a metric. Smaller action potentials consume less energy, provided the underlying currents are optimal, yet signalling between neurons depends on the action potential’s voltage deflection reaching a minimum amplitude. Given the importance of the action potential’s amplitude, we define an Lp norm on the voltage wave-form V (t) to emphasize the maximal voltage deflection: 1 p T V (t) − V p V (t) − V = 0 3 p dt , Generic Action Potential -10 + a V [mV] -20 -30 gsyn -40 -50 -60 0 2 4 6 8 t [ms] 10 12 14 16 gNa Active and Minimal Currents 100 gK + gleak C + + 80 2 current [µA/cm ] 60 b Active IK Minimum IK 40 20 0 -20 For a fixed action potential waveform V (t): Active INa Minimum INa -40 -60 Minimum INa (t) = −LV (t)θ(LV (t)) Minimum IK (t) = −LV (t)θ(−LV (t)) -80 -100 0 2 4 6 8 10 t [ms] 12 14 ˙ with LV (t) ≡ C V (t) + Ileak [V (t)] + Isyn [V (t)]. 16 c Qresistive/Qcapacitive Resistive vs. Capacitive Minimum Charge 1 0.5 0 0.2 0.4 0.6 0.8 1.0 1.2 leak conductance [mS/cm2] 1.4 Figure 1: To generate an action potential with an arbitrary time-course V (t), the nonlinear, timedependent permeabilities must deliver more charge than just to load the membrane capacitance— resistive losses must be compensated. (a) The action potential’s time-course in a generic HH model for a neuron, represented by the circuit diagram on the right. The peak of the action potential is ∼ 50 mV above the average potential. (b) The inward Na+ current, shown in green going in the negative direction, rapidly depolarizes the potential V (t) and yields the upstroke of the action potential. Concurrently, the K+ current activates, displayed as a positive deflection, and leads to the downstroke in the potential V (t). Inward and outward currents overlap significantly in time. The dotted lines within the region bounded by the solid lines represent the minimal Na+ current and the minimal K+ current needed to produce the V (t) spike waveform in (a). By the law of current conservation, the sum of capacitive, resistive, and synaptic currents, denoted by ˙ LV (t) ≡ C V (t) + Ileak [V (t)] + Isyn [V (t)], must be balanced by the active currents. If the cell’s passive properties, namely its capacitance and (leak) resistance, and the synaptic conductance are constant, we can deduce the minimal active currents needed to generate a specified V (t). The minimal currents, by definition, do not overlap in time. Taking into account passive current flow, restoring the concentration gradients after the action potential requires 29 nJ/cm2 . By contrast, if the active currents were optimal, the cost would be 8.9 nJ/cm2 . (c) To depolarize from the minimum to the maximum of the AP, the synaptic voltage-gated currents must deliver a charge Qcapacitive to charge the membrane capacitance and a charge Qresistive to compensate for the loss of current through leak channels. For a large leak conductance in the cell membrane, Qresistive can be larger than Qcapacitive . 4 where V is the average voltage. In the limit as p → ∞, the norm simply becomes the difference between the action potential’s peak voltage and the mean voltage, whereas a finite p ensures that the norm is differentiable. In parameter space, we will focus our attention to the manifold of action potentials with constant Lp norm with 2 p < ∞, which entails that the optimal action potential will have a finite, though possibly narrow width. To be close to the supremum norm, yet still have a norm that is well-behaved under differentiation, we decided to use p = 16. 5 Poincar´ -Lindstedt perturbation of periodic dynamical orbits e Standard (secular) perturbation theory diverges for periodic orbits, so we apply the PoincarLindstedt technique of expanding both in the period and the dynamics of the asymptotic orbit and then derive a set of adjoint sensitivity equations for the differential-algebraic system. Solving once for the adjoint functions, we can easily compute the parameter gradient of any functional on the orbit, even for thousands of parameters. ˙ We start with a set of ordinary differential equations x = F(x; p) for the neuron’s dynamics, an asymptotically periodic orbit xγ (t) that describes the action potential, and a functional G(x; p) on the orbit, representing the energy consumption, for instance. The functional can be written as an integral ω(p)−1 G(xγ ; p) = g(xγ (t); p) dt, 0 over some source term g(xγ (t); p). Assume that locally perturbing a parameter p ∈ p induces a smooth change in the stable limit cycle, preserving its existence. Generally, a perturbation changes not only the limit cycle’s path in state space, but also the average speed with which this orbit is traversed; as a consequence, the value of the functional depends on this change in speed, to lowest order. For simplicity, consider a single, scalar parameter p. G(xγ ; p) is the solution to ω(p)∂τ [G(xγ ; p)] = g(xγ ; p), where we have normalised time via τ = ω(p)t. Denoting partial derivatives by subscripts, we expand p → p + to get the O 1 equation dτ [Gp (xγ ; p)] + ωp g(xγ ; p) = gx (xγ ; p)xp + gp (xγ ; p) in a procedure known as the Poincar´ -Lindstedt method. Hence, e dG = dp ω −1 (gp + gx xp − ωp g) dt, 0 where, once again by the Poincar´ -Lindstedt method, xp is the solution to e ˙ xp =Fx (xγ )xp + Fp (xγ ) − ωp F (xγ ) . Following the approach described by Cao, Li, Petzold, and Serban (2003), introduce a Lagrange vector AG (x) and consider the augmented objective function ω −1 I(xγ ; p) = G(xγ ; p) − ˙ AG (xγ ). (F(xγ ) − xγ ) dt, 0 γ ˙ which is identical to G(x ; p) as F(x) − x = 0. Then dI(xγ ; p) = dp ω −1 ω −1 ˙ AG . (Fp + Fx xp − ωp F − xp ) dt. (gp + gx xp − ωp g) dt − 0 0 ˙ Integrating the AG (x).xp term by parts and using periodicity, we get dI(xγ ; p) = dp ω −1 ω −1 ˙ −gx + AG + AG .F xp dt. G gp − ωp g − A . (Fp − ωp F) dt − 0 0 5 Parameter ¯ peak permeability PNa ¯ peak permeability PK midpoint voltage Vm ∨ Vh slope sm ∨ (−sh ) time constant τm,0 ∨ τh,0 gating exponent r ∨ s minimum 0.24 fm/s 6.6 fm/s - 72 mV 3.33 mV 5 µs 0.2 maximum 0.15 µm/s 11 µm/s 70 mV 200 mV 200 ms 5.0 Table 1: Parameter limits. We can let the second term vanish by making the vector AG (x) obey ˙ AG (x) = −FT (x; p) AG (x) + gx (x; p). x Label the homogeneous solution (obtained by setting gx (xγ ; p) = 0) as Z(x). It is known that ω −1 the term ωp is given by ωp = ω 0 Z(x).Fp (x) dt, provided Z(x) is normalised to satisfy Z(x).F(x) = 1. We can add any multiple of the homogeneous solution Z(x) to the inhomogeneous solution, so we can always make ω −1 AG (x).F(x) dt = G 0 by taking ω −1 G G AG (x).F(x) dt − ωG . A (x) → A (x) − Z(x) (3) 0 This condition will make AG (x) unique. Finally, with eq. (3) we get dI(xγ ; p) dG(xγ ; p) = = dp dp ω −1 gp − AG . Fp dt. 0 The first term in the integral gives rise to the partial derivative ∂G(xγ ; p)/ ∂p. In many cases, this term is either zero, can be made zero, or at least made independent of the dynamical variables. The parameters for the neuron models are listed in Table 1 together with their minimum and maximum allowed values. For each parameter in the neuron model, an auxiliary parameter on the entire real line is introduced, and a mapping from the real line onto the finite range set by the biophysical limits is defined. Gradient descent on this auxiliary parameter space is performed by orthogonalizing the gradient dQα /dp to the gradient dL/dp of the norm. To correct for drift off the constraint manifold of constant norm, illustrated in Fig. 3, steps of gradient ascent or descent on the Lp norm are performed while keeping Qα constant. The step size during gradient descent is adjusted to assure that ∆Qα < 0 and that a periodic solution xγ exists after adapting the parameters. The energy landscape is locally convex (Fig. 3). 6 Predicting the Hodgkin-Huxley model We start with a single-compartment Goldman-Hodgkin-Katz model neuron containing voltage-gated Na+ and leak conductances (Figure 1). A tonic synaptic input to the model evokes repetitive firing of action potentials. We seek those parameters that minimize the ionic load for an action potential of constant norm—in other words, spikes whose height relative to the average voltage is fairly constant, subject to a trade-off with the spike width. The ionic load is directly proportional to the work W performed by the ion flux. All parameters governing the ion channels’ voltage dependence and kinetics, including their time constants, mid-points, slopes, and peak values, are subject to change. The simplest model capable of generating an action potential must have two dynamical variables and two time scales: one for the upstroke and another for the downstroke. If both Na+ and K+ currents 6 Transient Na Current Model Optimal Action Potential Falling Phase Currents 40 20 a τ [ms] 5 1 2 Q = 239 nC/cm PNa = m(t)h(t) PK = n(t) 0 -60 0 -20 τh τn 60 current [μA/cm2] V [mV] V [mV] -40 IK[V] Excess INa[V] Peak Resurgence 300 200 100 -60 -4 -2 0 0 4 40 20 τ [ms] 5 1 Q = 169 nC/cm2 PNa = m(t)h(t) PK = n(t) τi = τi(V) 0 -60 0 -20 τh τn 60 current [μA/cm2] 60 V [mV] -40 -60 -4 -2 0 2 t [ms] 0.5 0.75 IK[V] Excess INa[V] Peak Resurgence 200 100 0 4 0.25 40 5 1 PNa = m(t)h(t) s PK = n(t) τi = τi(V) 20 0 delay τ [ms] Q = 156 nC/cm2 current [μA/cm2] 60 -60 0 -20 τh τn 60 V [mV] -40 -60 t [ms] 0.5 t [ms] 0.75 Cooperative Gating Model Optimal Action Potential Falling Phase Currents V [mV] c 0.25 Voltage-dependent (In)activation Model Falling Phase Currents Optimal Action Potential V [mV] b 2 t [ms] -2 -1 0 t [ms] 1 750 500 250 0 0 2 IK[V] Excess INa[V] Peak Resurgence 0.2 t [ms] 0.4 Figure 2: Optimal spike shapes and currents for neuron models with different biophysical features. During optimization, the spikes were constrained to have constant norm V (t) − V 16 = 92 mV, which controls the height of the spike. Insets in the left column display the voltage-dependence of the optimized time constants for sodium inactivation and potassium activation; sodium activation is modeled as occurring instantaneously. (a) Model with voltage-dependent inactivation of Na+ ; time constants for the first order permeability kinetics are voltage-independent (inset). Inactivation turns off the Na+ current on the downstroke, but not completely: as the K+ current activates to repolarize the membrane, the inward Na+ current reactivates and counteracts the K+ current; the peak of the resurgent Na+ current is marked by a triangle. (b) Model with voltage-dependent time constants for the first order kinetics of activation and inactivation. The voltage dependence minimizes the resurgence of the Na+ current. (c) Power-law gating model with an inwardly rectifying potassium current replacing the leak current. The power law dependence introduces an effective delay in the onset of the K+ current, which further minimizes the overlap of Na+ and K+ currents in time. 7 Energy per Spike Surface of Constant Norm Spikes ya 10 16 V [mV] K 18 10 12 14 14 16 T b V [mV] 0 t [ms] 2 10 18 16 12 s [mV] K V [mV] K T a V [mV] 12 nJ/cm2 ≥ 16.5 16.3 16.3 yc 16 sK [mV] 100 0 -2 16.4 T c 100 0 -2 V [mV] 14 14 VE [nJ/cm2] yc ya τK [ms] yb 12 16.5 yb 20 0 t [ms] 2 100 0 -2 0 t [ms] 2 Figure 3: The energy required for an action potential three parameters governing potassium activation: the midpoint voltage VK , the slope sK , and the (maximum) time constant τK . The energy is the minimum work required to restore the ionic concentration gradients, as given by Eq. (1). Note that the energy within the constrained manifold of constant norm spikes is locally convex. are persistent, current flows in opposite directions at the same time, so that, even at the optimum, the ionic load is 1200 nC/cm2 . On the other hand, no voltage-gated K+ channels are even required for a spike, as long as Na+ channels activate on a fast time scale and inactivate on a slower time scale and the leak is powerful enough to repolarize the neuron. Even so, the load is still 520 nC/cm2 . While spikes require dynamics on two time scales, suppressing the overlap between inward and outward currents calls for a third time scale. The resulting dynamics are higher-dimensional and reduce the load to to 239 nC/cm2 . Making the activation and inactivation time constants voltage-dependent permits ion channels to latch to an open or closed state during the rising and falling phase of the spike, reducing the ionic load to 189 nC/cm2 (Fig. 2) . The minimal Na+ and K+ currents are separated in time, yet dynamics that are linear in the activation variables cannot enforce a true delay between the offset of the Na+ current and the onset of the K+ current. If current flow depends on multiple gates that need to be activated simultaneously, optimization can use the nonlinearity of multiplication to introduce a delay in the rise of the K+ current that abolishes the overlap, and the ionic load drops to 156 nC/cm2 . Any number of kinetic schemes for the nonlinear permeabilities Pα can give rise to the same spike waveform V (t), including the simplest two-dimensional one. Yet only the full Hodgkin-Huxley (HH) model, with its voltage-dependent kinetics that prevent the premature resurgence of inward current and cooperative gating that delays the onset of the outward current, minimizes the energetic cost. More complex models, in which voltage-dependent ion channels make transitions between multiple closed, inactivated, and open states, instantiate the energy-conserving features of the HH system at the molecular level. Furthermore, features that are eliminated during optimization, such as a voltage-dependent inactivation of the outward potassium current, are also not part of the delayed rectifier potassium current in the Hodgkin-Huxley framework. 8 References [1] Paul De Weer, David C. Gadsby, and R. F. Rakowski. Voltage dependence of the na-k pump. Ann. Rev. Physiol., 50:225–241, 1988. [2] B. Frankenhaeuser and A. F. Huxley. The action potential in the myelinated nerve fibre of xenopus laevis as computed on the basis of voltage clamp data. J. Physiol., 171:302–315, 1964. [3] Samuel S.-H. Wang, Jennifer R. Shultz, Mark J. Burish, Kimberly H. Harrison, Patrick R. Hof, Lex C. Towns, Matthew W. Wagers, and Krysta D. Wyatt. Functional trade-offs in white matter axonal scaling. J. Neurosci., 28(15):4047–4056, 2008. [4] Henrik Alle, Arnd Roth, and J¨ rg R. P. Geiger. Energy-efficient action potentials in hippocamo pal mossy fibers. Science, 325(5946):1405–1408, 2009. [5] D. E. Goldman. Potential, impedance and rectification in membranes. J. Gen. Physiol., 27:37– 60, 1943. [6] A. L. Hodgkin and B. Katz. The effect of sodium ions on the electrical activity of the giant axon of the squid. J. Physiol., 108:37–77, 1949. [7] Brett C. Carter and Bruce P. Bean. Sodium entry during action potentials of mammalian neurons: Incomplete inactivation and reduced metabolic efficiency in fast-spiking neurons. Neuron, 64(6):898–909, 2009. [8] Luc J. Gentet, Greg J. Stuart, and John D. Clements. Direct measurement of specific membrane capacitance in neurons. Biophys. J., 79:314–320, 2000. [9] Alain Destexhe, Michael Rudolph, and Denis Par´ . The high-conductance state of neocortical e neurons in vivo. Nature Neurosci. Rev., 4:739–751, 2003. [10] Bilal Haider and David A. McCormick. Rapid neocortical dynamics: Cellular and network mechanisms. Neuron, 62:171–189, 2009. 9
same-paper 2 0.68606627 254 nips-2011-Similarity-based Learning via Data Driven Embeddings
Author: Purushottam Kar, Prateek Jain
Abstract: We consider the problem of classification using similarity/distance functions over data. Specifically, we propose a framework for defining the goodness of a (dis)similarity function with respect to a given learning task and propose algorithms that have guaranteed generalization properties when working with such good functions. Our framework unifies and generalizes the frameworks proposed by [1] and [2]. An attractive feature of our framework is its adaptability to data - we do not promote a fixed notion of goodness but rather let data dictate it. We show, by giving theoretical guarantees that the goodness criterion best suited to a problem can itself be learned which makes our approach applicable to a variety of domains and problems. We propose a landmarking-based approach to obtaining a classifier from such learned goodness criteria. We then provide a novel diversity based heuristic to perform task-driven selection of landmark points instead of random selection. We demonstrate the effectiveness of our goodness criteria learning method as well as the landmark selection heuristic on a variety of similarity-based learning datasets and benchmark UCI datasets on which our method consistently outperforms existing approaches by a significant margin. 1
3 0.46972173 263 nips-2011-Sparse Manifold Clustering and Embedding
Author: Ehsan Elhamifar, René Vidal
Abstract: We propose an algorithm called Sparse Manifold Clustering and Embedding (SMCE) for simultaneous clustering and dimensionality reduction of data lying in multiple nonlinear manifolds. Similar to most dimensionality reduction methods, SMCE finds a small neighborhood around each data point and connects each point to its neighbors with appropriate weights. The key difference is that SMCE finds both the neighbors and the weights automatically. This is done by solving a sparse optimization problem, which encourages selecting nearby points that lie in the same manifold and approximately span a low-dimensional affine subspace. The optimal solution encodes information that can be used for clustering and dimensionality reduction using spectral clustering and embedding. Moreover, the size of the optimal neighborhood of a data point, which can be different for different points, provides an estimate of the dimension of the manifold to which the point belongs. Experiments demonstrate that our method can effectively handle multiple manifolds that are very close to each other, manifolds with non-uniform sampling and holes, as well as estimate the intrinsic dimensions of the manifolds. 1 1.1
4 0.46919206 186 nips-2011-Noise Thresholds for Spectral Clustering
Author: Sivaraman Balakrishnan, Min Xu, Akshay Krishnamurthy, Aarti Singh
Abstract: Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating our results. 1
5 0.46563753 109 nips-2011-Greedy Model Averaging
Author: Dong Dai, Tong Zhang
Abstract: This paper considers the problem of combining multiple models to achieve a prediction accuracy not much worse than that of the best single model for least squares regression. It is known that if the models are mis-specified, model averaging is superior to model selection. Specifically, let n be the sample size, then the worst case regret of the former decays at the rate of O(1/n) while the worst √ case regret of the latter decays at the rate of O(1/ n). In the literature, the most important and widely studied model averaging method that achieves the optimal O(1/n) average regret is the exponential weighted model averaging (EWMA) algorithm. However this method suffers from several limitations. The purpose of this paper is to present a new greedy model averaging procedure that improves EWMA. We prove strong theoretical guarantees for the new procedure and illustrate our theoretical results with empirical examples. 1
6 0.46550831 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
7 0.46494332 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
8 0.46375966 265 nips-2011-Sparse recovery by thresholded non-negative least squares
9 0.46349144 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model
10 0.46303117 244 nips-2011-Selecting Receptive Fields in Deep Networks
11 0.46295387 157 nips-2011-Learning to Search Efficiently in High Dimensions
12 0.46208525 251 nips-2011-Shaping Level Sets with Submodular Functions
13 0.46087688 271 nips-2011-Statistical Tests for Optimization Efficiency
14 0.46049955 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
15 0.45936882 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
16 0.45929772 16 nips-2011-A reinterpretation of the policy oscillation phenomenon in approximate policy iteration
17 0.4585821 80 nips-2011-Efficient Online Learning via Randomized Rounding
18 0.45830563 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition
19 0.4579353 98 nips-2011-From Bandits to Experts: On the Value of Side-Observations
20 0.45783994 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning