nips nips2011 nips2011-141 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bin Zhao, Fei Li, Eric P. Xing
Abstract: Most previous research on image categorization has focused on medium-scale data sets, while large-scale image categorization with millions of images from thousands of categories remains a challenge. With the emergence of structured large-scale dataset such as the ImageNet, rich information about the conceptual relationships between images, such as a tree hierarchy among various image categories, become available. As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. In this paper, we employ such semantic relatedness among image categories for large-scale image categorization. Specifically, a category hierarchy is utilized to properly define loss function and select common set of features for related categories. An efficient optimization method based on proximal approximation and accelerated parallel gradient method is introduced. Experimental results on a subset of ImageNet containing 1.2 million images from 1000 categories demonstrate the effectiveness and promise of our proposed approach. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Most previous research on image categorization has focused on medium-scale data sets, while large-scale image categorization with millions of images from thousands of categories remains a challenge. [sent-8, score-1.109]
2 With the emergence of structured large-scale dataset such as the ImageNet, rich information about the conceptual relationships between images, such as a tree hierarchy among various image categories, become available. [sent-9, score-0.469]
3 As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. [sent-10, score-0.45]
4 In this paper, we employ such semantic relatedness among image categories for large-scale image categorization. [sent-11, score-1.128]
5 Specifically, a category hierarchy is utilized to properly define loss function and select common set of features for related categories. [sent-12, score-0.327]
6 An efficient optimization method based on proximal approximation and accelerated parallel gradient method is introduced. [sent-13, score-0.327]
7 2 million images from 1000 categories demonstrate the effectiveness and promise of our proposed approach. [sent-15, score-0.369]
8 1 Introduction Image categorization / object recognition has been one of the most important research problems in the computer vision community. [sent-16, score-0.25]
9 (A human being has little problem recognizing tens of thousands of visual categories, even with very little “training” data. [sent-18, score-0.12]
10 LabelMe [31] provides 30k labeled and segmented images, covering around 200 image categories. [sent-20, score-0.213]
11 Moreover, the newly released ImageNet [12] data set goes a big step further, in that it further increases the number of classes to over 15000, and has more than 1000 images for each class on average. [sent-21, score-0.138]
12 Similarly, TinyImage [36] contains 80 million 32 × 32 low resolution images, with each image loosely labeled with one of 75,062 English nouns. [sent-22, score-0.268]
13 Clearly, these are no longer artificial visual categorization problems created for machine learning, but instead more like a human-level cognition problem for real world object recognition with a much bigger set of objects. [sent-23, score-0.359]
14 1 (a) (b) (c) Figure 1: (a) Image category hierarchy in ImageNet; (b) Overlapping group structure; (3) Semantic relatedness measure between image categories. [sent-35, score-0.775]
15 The hierarchical semantic structure stemmed from the WordNet over image categories in the ImageNet data makes it distinctive from other existing large-scale dataset, and it reassembles how human cognitive system stores visual knowledge. [sent-36, score-0.817]
16 Figure 1(a) shows an example such as a tree hierarchy, where leaf nodes are individual categories, and each internal node denotes the cluster of categories corresponding to the leaf nodes in the subtree rooted at the given node. [sent-37, score-0.567]
17 As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. [sent-38, score-0.45]
18 It should be noted that our proposed method is applicable to any tree structure for image category, such as the category structure learned to capture visual appearance similarities between image classes [32, 17, 13]. [sent-40, score-0.771]
19 To the best of our knowledge, our attempt in this paper represents an initial foray to systematically utilizing information residing in concept hierarchy, for multi-way classification on super large-scale image data sets. [sent-41, score-0.336]
20 More precisely, our approach utilizes the concept hierarchy in two aspects: loss function and feature selection. [sent-42, score-0.186]
21 First, the loss function used in our formulation weighs differentially for different misclassification outcomes: misclassifying an image to a category that is close to its true identity should receive less penalty than misclassifying it to a totally unrelated one. [sent-43, score-0.643]
22 Second, in an image classification problem with thousands of categories, it is not realistic to assume that all of the classes share the same set of relevant features. [sent-44, score-0.369]
23 That is to say, a subset of highly related categories may share a common set of relevant features, whereas weakly related categories are less likely to be affected by the same features. [sent-45, score-0.534]
24 Consequently, the image categorization problem is formulated as augmented logistic regression with overlapping-group-lasso regularization. [sent-46, score-0.552]
25 To solve this optimization problem, we introduce the Accelerated Parallel ProximaL gradiEnT (APPLET) method, which tackles the non-smoothness of overlapping-group-lasso penalty via proximal gradient [20, 9], and the huge number of training samples by Map-Reduce parallel computing [10]. [sent-48, score-0.414]
26 Therefore, the contributions made in this paper are: (1) We incorporate the semantic relationships between object classes, into an augmented multi-class logistic regression formulation, regularized by the overlapping-group-lasso penalty. [sent-49, score-0.54]
27 (2) We propose a proximal gradient based method for solving the resulting non-smooth optimization problem, where the super large scale of the problem is tackled by map-reduce parallel computation. [sent-51, score-0.325]
28 Section 5 demonstrates the effectiveness of the proposed algorithm using millions of training images from 1000 categories, followed by conclusions in Section 6. [sent-56, score-0.134]
29 1 Category Structure Aware Image Categorization Motivation ImageNet organizes the different classes of images in a densely populated semantic hierarchy. [sent-58, score-0.355]
30 Specifically, image categories in ImageNet are interlinked by several types of relations, with the 2 “IS-A” relation being the most comprehensive and useful [11], resulting in a tree hierarchy over image categories. [sent-59, score-0.887]
31 For example, the ’husky’ category follows a path in the tree composed of ’working dog’, ’dog’, ’canine’, etc. [sent-60, score-0.249]
32 The distance between two nodes in the tree depicts the difference between the two corresponding image categories. [sent-61, score-0.306]
33 Consequently, in the category hierarchy in ImageNet, each internal node near the bottom of the tree shows that the image categories of its subtree are highly correlated, whereas the internal node near the root represents relatively weaker correlations among the categories in its subtree. [sent-62, score-1.256]
34 The class hierarchy provides a measure of relatedness between image classes. [sent-63, score-0.599]
35 Misclassifying an image to a category that is close to its true identity should receive less penalty than misclassifying it to a totally unrelated one. [sent-64, score-0.532]
36 Instead of using 0-1 loss as in conventional image categorization, which treats image categories equally and independently, our approach utilizes a loss function that is aware of the category hierarchy. [sent-66, score-0.928]
37 Moreover, highly related image categories are more likely to share common visual patterns. [sent-67, score-0.567]
38 For example, in Figure 1(a), husky and shepherd share similar object shape and texture. [sent-68, score-0.154]
39 Consequently, recognition of these related categories are more likely to be affected by the same features. [sent-69, score-0.283]
40 Multi-class logistic regression defines a weight vector wk for each class k ∈ {1, . [sent-76, score-0.233]
41 ,k} P (y|x, W), with the conditional likelihood computed as T exp(wyi xi ) P (yi |xi , W) = ∑ (1) T k exp(wk xi ) ∗ ∗ The optimal weight vectors W∗ = [w1 , . [sent-82, score-0.118]
42 1 Augmented Soft-Max Loss Function Using the tree hierarchy on image categories, we could calculate a semantic relatedness (a. [sent-88, score-0.874]
43 similarity) matrix S ∈ RK×K over all categories, where Sij measures the semantic relatedness of class i and j. [sent-91, score-0.451]
44 Using this modified softmax loss function, the image categorization problem could be formulated as [ ( ) ( )] N ∑ ∑∑ ∑ T T min log Sk,r exp(wr xi ) − log Syi ,r exp(wr xi ) + λΩ(W) (5) W i=1 r r k 3 2. [sent-95, score-0.505]
45 2 Semantic Relatedness Matrix To compute semantic relatedness matrix S in the above formulation, we first define a metric measuring the semantic distance between image categories. [sent-97, score-0.881]
46 A simple way to compute semantic distance in a structure such as the one provided by ImageNet is to utilize the paths connecting the two corresponding nodes to the root node. [sent-98, score-0.297]
47 We construct the semantic relatedness matrix S = exp(−κ(1− D)), where κ is a constant controlling the decay factor of semantic relatedness with respect to semantic distance. [sent-100, score-1.119]
48 Figure 1(c) shows the semantic relatedness matrix computed with κ = 5. [sent-101, score-0.451]
49 3 Tree-Guided Sparse Feature Coding In ImageNet, image categories are grouped at multiple granularity as a tree hierarchy. [sent-103, score-0.522]
50 1, the image categories in each internal node are likely to be influenced by a common set of features. [sent-105, score-0.542]
51 Specifically, given the tree hierarchy T = (V, E) over image categories, each node v ∈ V of tree T is associated with group Gv , composed of all leaf nodes in the subtree rooted at v, as illustrated in Figure 1(b). [sent-107, score-0.668]
52 , J} associated with categories in Gv , and each group Gv is associated with weight γv that reflects the strength of correlation within the group. [sent-115, score-0.286]
53 Specifically, we first reformulate the overlapping-group-lasso penalty Ω(W) into a max problem over auxiliary variables using dual norm, and then introduce its smooth lower bound [20, 9]. [sent-122, score-0.216]
54 Instead of optimizing the original non-smooth penalty, we run the accelerated gradient descent method [27] under a Map-Reduce framework [10] to optimize the smooth lower bound. [sent-123, score-0.22]
55 For each input j and group gi associated with wjgi , we introduce a vector of auxiliary variables αjgi ∈ R|gi | . [sent-130, score-0.205]
56 Since the dual norm of L2 norm is also an L2 norm, we can reformulate ||wjgi ||2 as ||wjgi ||2 = ∑ max||αjgi ||2 ≤1 αT i wjgi . [sent-131, score-0.163]
57 Following [9], the overlappinggroup-lasso penalty in (8) can be equivalently reformulated as ∑∑ Ω(W) = γi max αT i wjgi = max⟨CWT , A⟩ (10) jg j i ||αjgi ||2 ≤1 A∈O ∑ where i = 1, . [sent-151, score-0.282]
58 Moreover, the matrix C is defined with rows indexed by (s, gi ) such that s ∈ gi and i ∈ {1, . [sent-158, score-0.12]
59 Specifically, our smooth approximation function is defined as: fµ (W) = max⟨CWT , A⟩ − µd(A) A∈O (11) where µ is the positive smoothness parameter and d(A) is an arbitrary smooth strongly-convex function defined on O. [sent-167, score-0.124]
60 According to Theorem 1, fµ (W) is a smooth function for any µ > 0, with a simple form of gradient and can be viewed as a smooth approximation of f0 (W) with the maximum gap of µD. [sent-172, score-0.201]
61 2 Accelerated Parallel Gradient Method Given the smooth approximation of Ω(W) in (11) and the corresponding gradient presented in Theorem 1, we could apply gradient descent method to solve the problem. [sent-174, score-0.216]
62 wk could be calculated as follows [ ∑ ] N T T Syi ,k exp(wk xi ) ∂g(W) ∑ q Sk,q exp(wk xi ) = xi ∑ ∑ −∑ (14) T T ∂wk r q Sr,q exp(wr xi ) r Syi ,r exp(wr xi ) i=1 5 Therefore, the gradient of g(W) w. [sent-179, score-0.46]
63 ∂w1 ∂wK ˜ According to Theorem 1, the gradient of f (W) is given by ˜ ∇f (W) = ∇g(W) + λA∗T C (15) ˜ Although f (W) is a smooth function of W, it is represented as a summation over all training sam˜ ples. [sent-186, score-0.17]
64 Due to the huge number of samples in the training set, we adopt a Map-Reduce parallel framework [10] to compute ∇g(W) as shown in Eq. [sent-188, score-0.136]
65 In this paper, we adopt this accelerated gradient method , and the whole algorithm is shown in Algorithm 1. [sent-192, score-0.158]
66 Recent approaches transfer information across classes by regularizing the parameters of the classifiers across classes [37, 28, 15, 33, 34, 2, 26, 30]. [sent-201, score-0.182]
67 It is unclear how these approaches would perform on super large-scale data sets containing thousands of image categories. [sent-203, score-0.341]
68 Some of these approaches would encounter severe computational bottleneck when scaling up to thousands of classes [16]. [sent-204, score-0.124]
69 However, none of these approaches utilize the semantic relationships defined among image categories in ImageNet, which we argue is a crucial source of information for further improvement in such super large scale classification problem. [sent-207, score-0.851]
70 The number of images for each category ranges from 668 to 3047. [sent-210, score-0.204]
71 Our work, on the other hand, looks at this problem from a different angle, focusing on principled 6 methodology that explores the benefit of utilizing class structure in image categorization and proposing a model and related optimization technique to properly incorporate such information. [sent-214, score-0.353]
72 1 Image Features Each image is resized to have a max side length of 300 pixels. [sent-218, score-0.213]
73 We then perform k-means clustering on a random subset of 10 million SIFT descriptors to form a visual vocabulary of 1000 visual words. [sent-221, score-0.23]
74 Finally, a single feature vector is computed for each image using max pooling on a spatial pyramid [22]. [sent-223, score-0.256]
75 Consequently, each image is represented as a vector in a 21,000 dimensional space. [sent-225, score-0.213]
76 Specifically, for every image, each tested algorithm will produce a list of 5 object categories in the descending order of confidence. [sent-228, score-0.329]
77 012 Figure 2: Left: image classes with highest accuracy. [sent-277, score-0.288]
78 We use the conventional L2 regularized logistic regression [5] as baseline. [sent-282, score-0.145]
79 According to the classification results, we could clearly see the advantage of APPLET over conventional logistic regression, especially on the top-5 error rate. [sent-285, score-0.125]
80 It should be noted 7 that hierarchical error is measured by the height of the lowest common ancestor in the hierarchy, and moving up a level can more than double the number of descendants. [sent-289, score-0.141]
81 Moreover, Figure 2 shows the image categories with highest and lowest classification accuracy. [sent-294, score-0.507]
82 One key reason for introducing the augmented loss function is to ensure that predicted image class falls not too far from its true class on the semantic hierarchy. [sent-295, score-0.518]
83 As shown in Table 1, a systematic reduction in classification error using APPLET shows that acknowledging semantic relationships between image classes enables the system to discriminate at more informative semantic levels. [sent-300, score-0.801]
84 4 We present in Figure 3 how categorization performance scales with λ and κ. [sent-303, score-0.14]
85 According to Figure 3, APPLET achieves lowest categorization error around λ = 0. [sent-304, score-0.216]
86 When κ is too small, a large number of categories are mixed together, resulting in a much higher flat loss. [sent-330, score-0.251]
87 On the other hand, when κ ≥ 50, the semantic relatedness matrix is close to diagonal, resulting in treating all categories independently, and categorization performance becomes similar as LR. [sent-331, score-0.842]
88 6 Conclusions In this paper, we argue the positive effect of incorporating category hierarchy information in super large scale image categorization. [sent-332, score-0.585]
89 2 million training images from 1000 categories demonstrates the effectiveness and promise of our proposed approach. [sent-335, score-0.4]
90 Smoothing proximal gradient method for general structured sparse learning. [sent-396, score-0.177]
91 What does classifying more than 10,000 image categories tell us? [sent-414, score-0.495]
92 Fast and balanced: Efficient label tree learning for large scale object recognition. [sent-431, score-0.136]
93 Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. [sent-437, score-0.18]
94 Discriminative learning of relaxed hierarchy for large-scale visual recognition. [sent-455, score-0.223]
95 Large-scale image classification: fast feature extraction and svm training. [sent-505, score-0.213]
96 Incremental learning of object detectors using a visual shape alphabet. [sent-528, score-0.149]
97 Transfer learning for image classification with sparse prototype representations. [sent-540, score-0.213]
98 Labelme: A database and web-based tool for image annotation. [sent-547, score-0.213]
99 Learning to share visual appearance for multiclass object detection. [sent-552, score-0.181]
100 80 million tiny images: A large data set for nonparametric object and scene recognition. [sent-578, score-0.133]
wordName wordTfidf (topN-words)
[('applet', 0.419), ('categories', 0.251), ('relatedness', 0.234), ('semantic', 0.217), ('wr', 0.214), ('image', 0.213), ('imagenet', 0.21), ('syi', 0.176), ('gv', 0.154), ('hierarchy', 0.152), ('category', 0.141), ('categorization', 0.14), ('top', 0.115), ('alr', 0.11), ('jgi', 0.11), ('wjgi', 0.11), ('penalty', 0.101), ('proximal', 0.1), ('logistic', 0.092), ('wk', 0.088), ('cvpr', 0.082), ('lr', 0.081), ('accelerated', 0.081), ('super', 0.079), ('object', 0.078), ('misclassifying', 0.077), ('gradient', 0.077), ('classes', 0.075), ('jg', 0.071), ('visual', 0.071), ('parallel', 0.069), ('classi', 0.068), ('grouplr', 0.066), ('odometer', 0.066), ('hierarchical', 0.065), ('images', 0.063), ('smooth', 0.062), ('gi', 0.06), ('xi', 0.059), ('tree', 0.058), ('bt', 0.058), ('exp', 0.056), ('million', 0.055), ('augmented', 0.054), ('deng', 0.053), ('reformulate', 0.053), ('regression', 0.053), ('torralba', 0.052), ('flat', 0.05), ('path', 0.05), ('thousands', 0.049), ('laptop', 0.048), ('relationships', 0.046), ('fergus', 0.045), ('utilize', 0.045), ('node', 0.045), ('consequently', 0.044), ('cwt', 0.044), ('earthworm', 0.044), ('husky', 0.044), ('kappa', 0.044), ('residing', 0.044), ('setter', 0.044), ('singles', 0.044), ('socp', 0.044), ('volcano', 0.044), ('wjgv', 0.044), ('woodfrog', 0.044), ('lowest', 0.043), ('pyramid', 0.043), ('aware', 0.042), ('millions', 0.04), ('bullfrog', 0.039), ('cognition', 0.038), ('wt', 0.038), ('leaf', 0.038), ('berg', 0.036), ('huge', 0.036), ('sheer', 0.036), ('snake', 0.036), ('group', 0.035), ('nodes', 0.035), ('loss', 0.034), ('subtree', 0.034), ('cally', 0.034), ('lambda', 0.033), ('snps', 0.033), ('eccv', 0.033), ('cation', 0.033), ('error', 0.033), ('descriptors', 0.033), ('internal', 0.033), ('transfer', 0.032), ('recognition', 0.032), ('shared', 0.032), ('share', 0.032), ('moreover', 0.031), ('kim', 0.031), ('classifying', 0.031), ('training', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000011 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
Author: Bin Zhao, Fei Li, Eric P. Xing
Abstract: Most previous research on image categorization has focused on medium-scale data sets, while large-scale image categorization with millions of images from thousands of categories remains a challenge. With the emergence of structured large-scale dataset such as the ImageNet, rich information about the conceptual relationships between images, such as a tree hierarchy among various image categories, become available. As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. In this paper, we employ such semantic relatedness among image categories for large-scale image categorization. Specifically, a category hierarchy is utilized to properly define loss function and select common set of features for related categories. An efficient optimization method based on proximal approximation and accelerated parallel gradient method is introduced. Experimental results on a subset of ImageNet containing 1.2 million images from 1000 categories demonstrate the effectiveness and promise of our proposed approach. 1
2 0.24221818 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen
Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1
3 0.19710582 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon
Abstract: We introduce P I C O D ES: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of defining indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images defining the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classification performance. Optimization directly for binary features is difficult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. P I C O D ES of 256 bytes match the accuracy of the current best known classifier for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.
4 0.18592475 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
Author: Kristen Grauman, Fei Sha, Sung J. Hwang
Abstract: We introduce an approach to learn discriminative visual representations while exploiting external semantic knowledge about object category relationships. Given a hierarchical taxonomy that captures semantic similarity between the objects, we learn a corresponding tree of metrics (ToM). In this tree, we have one metric for each non-leaf node of the object hierarchy, and each metric is responsible for discriminating among its immediate subcategory children. Specifically, a Mahalanobis metric learned for a given node must satisfy the appropriate (dis)similarity constraints generated only among its subtree members’ training instances. To further exploit the semantics, we introduce a novel regularizer coupling the metrics that prefers a sparse disjoint set of features to be selected for each metric relative to its ancestor (supercategory) nodes’ metrics. Intuitively, this reflects that visual cues most useful to distinguish the generic classes (e.g., feline vs. canine) should be different than those cues most useful to distinguish their component fine-grained classes (e.g., Persian cat vs. Siamese cat). We validate our approach with multiple image datasets using the WordNet taxonomy, show its advantages over alternative metric learning approaches, and analyze the meaning of attribute features selected by our algorithm. 1
5 0.16806351 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li
Abstract: We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fine grained control over the efficiency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classification with 10184 classes and 9 million images. We demonstrate significant improvements in test accuracy and efficiency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1
6 0.15141681 168 nips-2011-Maximum Margin Multi-Instance Learning
7 0.14258587 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs
8 0.12614292 159 nips-2011-Learning with the weighted trace-norm under arbitrary sampling distributions
9 0.12534322 165 nips-2011-Matrix Completion for Multi-label Image Classification
10 0.10803055 78 nips-2011-Efficient Methods for Overlapping Group Lasso
11 0.10758066 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
12 0.10148174 154 nips-2011-Learning person-object interactions for action recognition in still images
13 0.10107855 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection
14 0.097802997 156 nips-2011-Learning to Learn with Compound HD Models
15 0.096096672 227 nips-2011-Pylon Model for Semantic Segmentation
16 0.089570999 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors
17 0.087675095 114 nips-2011-Hierarchical Multitask Structured Output Learning for Large-scale Sequence Segmentation
18 0.08753784 63 nips-2011-Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization
19 0.086714223 258 nips-2011-Sparse Bayesian Multi-Task Learning
20 0.084572159 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
topicId topicWeight
[(0, 0.257), (1, 0.14), (2, -0.187), (3, 0.141), (4, 0.056), (5, 0.064), (6, 0.05), (7, 0.018), (8, -0.035), (9, 0.023), (10, 0.042), (11, 0.031), (12, -0.111), (13, 0.103), (14, -0.057), (15, 0.069), (16, -0.097), (17, 0.006), (18, 0.078), (19, -0.037), (20, 0.017), (21, 0.025), (22, 0.008), (23, -0.054), (24, -0.032), (25, 0.075), (26, 0.085), (27, -0.092), (28, 0.063), (29, 0.02), (30, -0.001), (31, 0.12), (32, -0.083), (33, 0.012), (34, 0.005), (35, 0.011), (36, 0.097), (37, -0.016), (38, 0.023), (39, -0.046), (40, 0.046), (41, 0.101), (42, -0.134), (43, -0.065), (44, 0.052), (45, -0.064), (46, 0.01), (47, -0.013), (48, 0.051), (49, 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.95890099 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
Author: Bin Zhao, Fei Li, Eric P. Xing
Abstract: Most previous research on image categorization has focused on medium-scale data sets, while large-scale image categorization with millions of images from thousands of categories remains a challenge. With the emergence of structured large-scale dataset such as the ImageNet, rich information about the conceptual relationships between images, such as a tree hierarchy among various image categories, become available. As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. In this paper, we employ such semantic relatedness among image categories for large-scale image categorization. Specifically, a category hierarchy is utilized to properly define loss function and select common set of features for related categories. An efficient optimization method based on proximal approximation and accelerated parallel gradient method is introduced. Experimental results on a subset of ImageNet containing 1.2 million images from 1000 categories demonstrate the effectiveness and promise of our proposed approach. 1
2 0.7797482 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon
Abstract: We introduce P I C O D ES: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of defining indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images defining the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classification performance. Optimization directly for binary features is difficult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. P I C O D ES of 256 bytes match the accuracy of the current best known classifier for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.
3 0.73554718 293 nips-2011-Understanding the Intrinsic Memorability of Images
Author: Phillip Isola, Devi Parikh, Antonio Torralba, Aude Oliva
Abstract: Artists, advertisers, and photographers are routinely presented with the task of creating an image that a viewer will remember. While it may seem like image memorability is purely subjective, recent work shows that it is not an inexplicable phenomenon: variation in memorability of images is consistent across subjects, suggesting that some images are intrinsically more memorable than others, independent of a subjects’ contexts and biases. In this paper, we used the publicly available memorability dataset of Isola et al. [13], and augmented the object and scene annotations with interpretable spatial, content, and aesthetic image properties. We used a feature-selection scheme with desirable explaining-away properties to determine a compact set of attributes that characterizes the memorability of any individual image. We find that images of enclosed spaces containing people with visible faces are memorable, while images of vistas and peaceful scenes are not. Contrary to popular belief, unusual or aesthetically pleasing scenes do not tend to be highly memorable. This work represents one of the first attempts at understanding intrinsic image memorability, and opens a new domain of investigation at the interface between human cognition and computer vision. 1
4 0.7139895 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs
Author: Vicente Ordonez, Girish Kulkarni, Tamara L. Berg
Abstract: We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning. 1
5 0.70710069 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
Author: Kristen Grauman, Fei Sha, Sung J. Hwang
Abstract: We introduce an approach to learn discriminative visual representations while exploiting external semantic knowledge about object category relationships. Given a hierarchical taxonomy that captures semantic similarity between the objects, we learn a corresponding tree of metrics (ToM). In this tree, we have one metric for each non-leaf node of the object hierarchy, and each metric is responsible for discriminating among its immediate subcategory children. Specifically, a Mahalanobis metric learned for a given node must satisfy the appropriate (dis)similarity constraints generated only among its subtree members’ training instances. To further exploit the semantics, we introduce a novel regularizer coupling the metrics that prefers a sparse disjoint set of features to be selected for each metric relative to its ancestor (supercategory) nodes’ metrics. Intuitively, this reflects that visual cues most useful to distinguish the generic classes (e.g., feline vs. canine) should be different than those cues most useful to distinguish their component fine-grained classes (e.g., Persian cat vs. Siamese cat). We validate our approach with multiple image datasets using the WordNet taxonomy, show its advantages over alternative metric learning approaches, and analyze the meaning of attribute features selected by our algorithm. 1
6 0.70512521 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation
7 0.69802439 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
8 0.66704988 168 nips-2011-Maximum Margin Multi-Instance Learning
9 0.65911907 165 nips-2011-Matrix Completion for Multi-label Image Classification
10 0.64661014 235 nips-2011-Recovering Intrinsic Images with a Global Sparsity Prior on Reflectance
11 0.62599695 290 nips-2011-Transfer Learning by Borrowing Examples for Multiclass Object Detection
12 0.62204868 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
13 0.59208554 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors
14 0.58235484 166 nips-2011-Maximal Cliques that Satisfy Hard Constraints with Application to Deformable Object Model Learning
15 0.57891375 154 nips-2011-Learning person-object interactions for action recognition in still images
16 0.57488751 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes
17 0.57319838 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
18 0.56705087 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing
19 0.55350536 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition
20 0.54709381 19 nips-2011-Active Classification based on Value of Classifier
topicId topicWeight
[(0, 0.024), (4, 0.046), (20, 0.042), (26, 0.013), (31, 0.04), (33, 0.418), (43, 0.051), (45, 0.114), (57, 0.033), (65, 0.022), (74, 0.062), (83, 0.024), (99, 0.034)]
simIndex simValue paperId paperTitle
1 0.96029472 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows
Author: Bogdan Alexe, Viviana Petrescu, Vittorio Ferrari
Abstract: We present a computationally efficient technique to compute the distance of highdimensional appearance descriptor vectors between image windows. The method exploits the relation between appearance distance and spatial overlap. We derive an upper bound on appearance distance given the spatial overlap of two windows in an image, and use it to bound the distances of many pairs between two images. We propose algorithms that build on these basic operations to efficiently solve tasks relevant to many computer vision applications, such as finding all pairs of windows between two images with distance smaller than a threshold, or finding the single pair with the smallest distance. In experiments on the PASCAL VOC 07 dataset, our algorithms accurately solve these problems while greatly reducing the number of appearance distances computed, and achieve larger speedups than approximate nearest neighbour algorithms based on trees [18] and on hashing [21]. For example, our algorithm finds the most similar pair of windows between two images while computing only 1% of all distances on average. 1
2 0.93185878 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes
Author: Yee W. Teh, Charles Blundell, Lloyd Elliott
Abstract: We propose a novel class of Bayesian nonparametric models for sequential data called fragmentation-coagulation processes (FCPs). FCPs model a set of sequences using a partition-valued Markov process which evolves by splitting and merging clusters. An FCP is exchangeable, projective, stationary and reversible, and its equilibrium distributions are given by the Chinese restaurant process. As opposed to hidden Markov models, FCPs allow for flexible modelling of the number of clusters, and they avoid label switching non-identifiability problems. We develop an efficient Gibbs sampler for FCPs which uses uniformization and the forward-backward algorithm. Our development of FCPs is motivated by applications in population genetics, and we demonstrate the utility of FCPs on problems of genotype imputation with phased and unphased SNP data. 1
same-paper 3 0.8606649 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
Author: Bin Zhao, Fei Li, Eric P. Xing
Abstract: Most previous research on image categorization has focused on medium-scale data sets, while large-scale image categorization with millions of images from thousands of categories remains a challenge. With the emergence of structured large-scale dataset such as the ImageNet, rich information about the conceptual relationships between images, such as a tree hierarchy among various image categories, become available. As human cognition of complex visual world benefits from underlying semantic relationships between object classes, we believe a machine learning system can and should leverage such information as well for better performance. In this paper, we employ such semantic relatedness among image categories for large-scale image categorization. Specifically, a category hierarchy is utilized to properly define loss function and select common set of features for related categories. An efficient optimization method based on proximal approximation and accelerated parallel gradient method is introduced. Experimental results on a subset of ImageNet containing 1.2 million images from 1000 categories demonstrate the effectiveness and promise of our proposed approach. 1
4 0.76755512 99 nips-2011-From Stochastic Nonlinear Integrate-and-Fire to Generalized Linear Models
Author: Skander Mensi, Richard Naud, Wulfram Gerstner
Abstract: Variability in single neuron models is typically implemented either by a stochastic Leaky-Integrate-and-Fire model or by a model of the Generalized Linear Model (GLM) family. We use analytical and numerical methods to relate state-of-theart models from both schools of thought. First we find the analytical expressions relating the subthreshold voltage from the Adaptive Exponential Integrate-andFire model (AdEx) to the Spike-Response Model with escape noise (SRM as an example of a GLM). Then we calculate numerically the link-function that provides the firing probability given a deterministic membrane potential. We find a mathematical expression for this link-function and test the ability of the GLM to predict the firing probability of a neuron receiving complex stimulation. Comparing the prediction performance of various link-functions, we find that a GLM with an exponential link-function provides an excellent approximation to the Adaptive Exponential Integrate-and-Fire with colored-noise input. These results help to understand the relationship between the different approaches to stochastic neuron models. 1 Motivation When it comes to modeling the intrinsic variability in simple neuron models, we can distinguish two traditional approaches. One approach is inspired by the stochastic Leaky Integrate-and-Fire (LIF) hypothesis of Stein (1967) [1], where a noise term is added to the system of differential equations implementing the leaky integration to a threshold. There are multiple versions of such a stochastic LIF [2]. How the noise affects the firing probability is also a function of the parameters of the neuron model. Therefore, it is important to take into account the refinements of simple neuron models in terms of subthreshold resonance [3, 4], spike-triggered adaptation [5, 6] and non-linear spike 1 initiation [7, 5]. All these improvements are encompassed by the Adaptive Exponential Integrateand-Fire model (AdEx [8, 9]). The other approach is to start with some deterministic dynamics for the the state of the neuron (for instance the instantaneous distance from the membrane potential to the threshold) and link the probability intensity of emitting a spike with a non-linear function of the state variable. Under some conditions, this type of model is part of a greater class of statistical models called Generalized Linear Models (GLM [10]). As a single neuron model, the Spike Response Model (SRM) with escape noise is a GLM in which the state variable is explicitly the distance between a deterministic voltage and the threshold. The original SRM could account for subthreshold resonance, refractory effects and spike-frequency adaptation [11]. Mathematically similar models were developed independently in the study of the visual system [12] where spike-frequency adaptation has also been modeled [13]. Recently, this approach has retained increased attention since the probabilistic framework can be linked with the Bayesian theory of neural systems [14] and because Bayesian inference can be applied to the population of neurons [15]. In this paper, we investigate the similarity and differences between the state-of-the-art GLM and the stochastic AdEx. The motivation behind this work is to relate the traditional threshold neuron models to Bayesian theory. Our results extend the work of Plesser and Gerstner (2000) [16] since we include the non-linearity for spike initiation and spike-frequency adaptation. We also provide relationships between the parameters of the AdEx and the equivalent GLM. These precise relationships can be used to relate analog implementations of threshold models [17] to the probabilistic models used in the Bayesian approach. The paper is organized as follows: We first describe the expressions relating the SRM state-variable to the parameters of the AdEx (Sect. 3.1) in the subthreshold regime. Then, we use numerical methods to find the non-linear link-function that models the firing probability (Sect. 3.2). We find a functional form for the SRM link-function that best describes the firing probability of a stochastic AdEx. We then compare the performance of this link-function with the often used exponential or linear-rectifier link-functions (also called half-wave linear rectifier) in terms of predicting the firing probability of an AdEx under complex stimulus (Sect. 3.3). We find that the exponential linkfunction yields almost perfect prediction. Finally, we explore the relations between the statistic of the noise and the sharpness of the non-linearity for spike initiation with the parameters of the SRM. 2 Presentation of the Models In this section we present the general formula for the stochastic AdEx model (Sect. 2.1) and the SRM (Sect 2.2). 2.1 The Stochastic Adaptive Exponential Integrate-and-Fire Model The voltage dynamics of the stochastic AdEx is given by: V −Θ ˙ τm V = El − V + ∆T exp − Rw + RI + R (1) ∆T τw w = a(V − El ) − w ˙ (2) where τm is the membrane time constant, El the reverse potential, R the membrane resistance, Θ is the threshold, ∆T is the shape factor and I(t) the input current which is chosen to be an Ornstein−Θ Uhlenbeck process with correlation time-constant of 5 ms. The exponential term ∆T exp( V∆T ) is a non-linear function responsible for the emission of spikes and is a diffusive white noise with standard deviation σ (i.e. ∼ N (0, σ)). Note that the diffusive white-noise does not imply white noise fluctuations of the voltage V (t), the probability distribution of V (t) will depend on ∆T and Θ. The second variable, w, describes the subthreshold as well as the spike-triggered adaptation both ˆ parametrized by the coupling strength a and the time constant τw . Each time tj the voltage goes to infinity, we assumed that a spike is emitted. Then the voltage is reset to a fixed value Vr and w is increased by a constant value b. 2.2 The Generalized Linear Model In the SRM, The voltage V (t) is given by the convolution of the injected current I(t) with the membrane filter κ(t) plus the additional kernel η(t) that acts after each spikes (here we split the 2 spike-triggered kernel in two η(t) = ηv (t) + ηw (t) for reasons that will become clear later): V (t) = ˆ ˆ ηv (t − tj ) + ηw (t − tj ) El + [κ ∗ I](t) + (3) ˆ {tj } ˆ Then at each time tj a spike is emitted which results in a change of voltage described by η(t) = ηv (t) + ηw (t). Given the deterministic voltage, (Eq. 3) a spike is emitted according to the firing intensity λ(V ): λ(t) = f (V (t)) (4) where f (·) is an arbitrary function called the link-function. Then the firing behavior of the SRM depends on the choice of the link-function and its parameters. The most common link-function used to model single neuron activities are the linear-rectifier and the exponential function. 3 Mapping In order to map the stochastic AdEx to the SRM we follow a two-step procedure. First we derive the filter κ(t) and the kernels ηv (t) and ηw (t) analytically as a function of AdEx parameters. Second, we derive the link-function of the SRM from the stochastic spike emission of the AdEx. Figure 1: Mapping of the subthreshold dynamics of an AdEx to an equivalent SRM. A. Membrane filter κ(t) for three different sets of parameters of the AdEx leading to over-damped, critically damped and under-damped cases (upper, middle and lower panel, respectively). B. Spike-Triggered η(t) (black), ηv (t) (light gray) and ηw (gray) for the three cases. C. Example of voltage trace produced when an AdEx is stimulated with a step of colored noise (black). The corresponding voltage from a SRM stimulated with the same current and where we forced the spikes to match those of the AdEx (red). D. Error in the subthreshold voltage (VAdEx − VGLM ) as a function of the mean voltage of the AdEx, for the three different cases: over-, critically and under-damped (light gray, gray and black, respectively) with ∆T = 1 mV. Red line represents the voltage threshold Θ. E. Root Mean Square Error (RMSE) ratio for the three cases with ∆T = 1 mV. The RMSE ratio is the RMSE between the deterministic VSRM and the stochastic VAdEx divided by the RMSE between repetitions of the stochastic AdEx voltage. The error bar shows a single standard deviation as the RMSE ratio is averaged accross multiple value of σ. 3.1 Subthreshold voltage dynamics We start by assuming that the non-linearity for spike initiation does not affect the mean subthreshold voltage of the stochastic AdEx (see Figure 1 D). This assumption is motivated by the small ∆T 3 observed in in-vitro recordings (from 0.5 to 2 mV [8, 9]) which suggest that the subthreshold dynamics are mainly linear except very close to Θ. Also, we expect that the non-linear link-function will capture some of the dynamics due to the non-linearity for spike initiation. Thus it is possible to rewrite the deterministic subthreshold part of the AdEx (Eq. 1-2 without and without ∆T exp((V − Θ)/∆T )) using matrices: ˙ x = Ax (5) with x = V w and A = − τ1 m a τw − gl1m τ − τ1 w (6) In this form, the dynamics of the deterministic AdEx voltage is a damped oscillator with a driving force. Depending on the eigenvalues of A the system could be over-damped, critically damped or under-damped. The filter κ(t) of the GLM is given by the impulse response of the system of coupled differential equations of the AdEx, described by Eq. 5 and 6. In other words, one has to derive the response of the system when stimulating with a Dirac-delta function. The type of damping gives three different qualitative shapes of the kernel κ(t), which are summarized in Table 3.1 and Figure 1 A. Since the three different filters also affect the nature of the stochastic voltage fluctuations, we will keep the distinction between over-damped, critically damped and under-damped scenarios throughout the paper. This means that our approach is valid for at least 3 types of diffusive voltage-noise (i.e. the white noise in Eq. 1 filtered by 3 different membrane filters κ(t)). To complete the description of the deterministic voltage, we need an expression for the spiketriggered kernels. The voltage reset at each spike brings a spike-triggered jump in voltage of magˆ nitude ∆ = Vr − V (t). This perturbation is superposed to the current fluctuations due to I(t) and can be mediated by a Delta-diract pulse of current. Thus we can write the voltage reset kernel by: ηv (t) = ∆ ∆ [δ ∗ κ] (t) = κ(t) κ(0) κ(0) (7) where δ(t) is the Dirac-delta function. The shape of this kernel depends on κ(t) and can be computed from Table 3.1 (see Figure 1 B). Finally, the AdEx mediates spike-frequency adaptation by the jump of the second variables w. From Eq. 2 we can see that this produces a current wspike (t) = b exp (−t/τw ) that can cumulate over subsequent spikes. The effect of this current on voltage is then given by the convolution of wspike (t) with the membrane filter κ(t). Thus in the SRM framework the spike-frequency adaptation is taken into account by: ηw (t) = [wspike ∗ κ](t) (8) Again the precise form of ηw (t) depends on κ(t) and can be computed from Table 3.1 (see Figure 1 B). At this point, we would like to verify our assumption that the non-linearity for spike emission can be neglected. Fig. 1 C and D shows that the error between the voltage from Eq. 3 and the voltage from the stochastic AdEx is generally small. Moreover, we see that the main contribution to the voltage prediction error is due to the mismatch close to the spikes. However the non-linearity for spike initiation may change the probability distribution of the voltage fluctuations, which in turn influences the probability of spiking. This will influence the choice of the link-function, as we will see in the next section. 3.2 Spike Generation Using κ(t), ηv (t) and ηw (t), we must relate the spiking probability of the stochastic AdEx as a function of its deterministic voltage. According to [2] the probability of spiking in time bin dt given the deterministic voltage V (t) is given by: p(V ) = prob{spike in [t, t + dt]} = 1 − exp (−f (V (t))dt) (9) where f (·) gives the firing intensity as a function of the deterministic V (t) (Eq. 3). Thus to extract the link-function f we have to compute the probability of spiking given V (t) for our SRM. To do so we apply the method proposed by Jolivet et al. (2004) [18], where the probability of spiking is simply given by the distribution of the deterministic voltage estimated at the spike times divided by the distribution of the SRM voltage when there is no spike (see figure 2 A). One can numerically compute these two quantities for our models using N repetitions of the same stimulus. 4 Table 1: Analytical expressions for the membrane filter κ(t) in terms of the parameters of the AdEx for over-, critically-, and under-damped cases. Membrane Filter: κ(t) over-damped if: (τm + τw )2 > 4τm τw (gl +a) gl κ(t) = k1 eλ1 t + k2 eλ2 t λ1 = 1 2τm τw (−(τm + τw ) + critically-damped if: (τm + τw )2 = 4τm τw (gl +a) gl κ(t) = (αt + β)eλt λ= under-damped if: (τm + τw )2 < 4τm τw (gl +a) gl κ(t) = (k1 cos (ωt) + k2 sin (ωt)) eλt −(τm +τw ) 2τm τw λ= −(τm +τw ) 2τm τw (τm + τw )2 − 4 τm τw (gl + a) gl λ2 = 1 2τm τw (−(τm + τw ) − α= τm −τw 2Cτm τw ω= τw −τm 2τm τw 2 − a g l τm τw (τm + τw )2 − 4 τm τw (gl + a) gl k1 = −(1+(τm λ2 )) Cτm (λ1 −λ2 ) k2 = 1+(τm λ1 ) Cτm (λ1 −λ2 ) β= 1 C k1 = k2 = 1 C −(1+τm λ) Cωτm The standard deviation σ of the noise and the parameter ∆T of the AdEx non-linearity may affect the shape of the link-function. We thus extract p(V ) for different σ and ∆T (Fig. 2 B). Then using visual heuristics and previous knowledge about the potential analytical expression of the link-funtion, we try to find a simple analytical function that captures p(V ) for a large range of combinations of σ and ∆T . We observed that the log(− log(p)) is close to linear in most studied conditions Fig. 2 B suggesting the following two distributions of p(V ): V − VT (10) p(V ) = 1 − exp − exp ∆V V − VT p(V ) = exp − exp − (11) ∆V Once we have p(V ), we can use Eq. 4 to obtain the equivalent SRM link-function, which leads to: −1 f (V ) = log (1 − p(V )) (12) dt Then the two potential link-functions of the SRM can be derived from Eq. 10 and Eq. 11 (respectively): V − VT f (V ) = λ0 exp (13) ∆V V − VT (14) f (V ) = −λ0 log 1 − exp − exp − ∆V 1 with λ0 = dt , VT the threshold of the SRM and ∆V the sharpness of the link-function (i.e. the parameters that governs the degree of the stochasticity). Note that the exact value of λ0 has no importance since it is redundant with VT . Eq. 13 is the standard exponential link-function, but we call Eq. 14 the log-exp-exp link-function. 3.3 Prediction The next point is to evaluate the fit quality of each link-function. To do this, we first estimate the parameters VT and ∆V of the GLM link-function that maximize the likelihood of observing a spike 5 Figure 2: SRM link-function. A. Histogram of the SRM voltage at the AdEx firing times (red) and at non-firing times (gray). The ratio of the two distributions gives p(V ) (Eq. 9, dashed lines). Inset, zoom to see the voltage histogram evaluated at the firing time (red). B. log(− log(p)) as a function of the SRM voltage for three different noise levels σ = 0.07, 0.14, 0.18 nA (pale gray, gray, black dots, respectively) and ∆T = 1 mV. The line is a linear fit corresponding to the log-exp-exp linkfunction and the dashed line corresponds to a fit with the exponential link-function. C. Same data and labeling scheme as B, but plotting f (V ) according to Eq. 12. The lines are produced with Eq. 14 with parameters fitted as described in B. and the dashed lines are produced with Eq. 13. Inset, same plot but on a semi-log(y) axis. train generated with an AdEx. Second we look at the predictive power of the resulting SRM in terms of Peri-Stimulus Time Histogram (PSTH). In other words we ask how close the spike trains generated with a GLM are from the spike train generated with a stochastic AdEx when both models are stimulated with the same input current. For any GLM with link-function f (V ) ≡ f (t|I, θ) and parameters θ regulating the shape of κ(t), ˆ ηv (t) and ηw (t), the Negative Log-Likelihood (NLL) of observing a spike-train {t} is given by: NLL = − log(f (t|I, θ)) − f (t|I, θ) (15) t ˆ t It has been shown that the negative log-likelihood is convex in the parameters if f is convex and logconcave [19]. It is easy to show that a linear-rectifier link-function, the exponential link-function and the log-exp-exp link-function all satisfy these conditions. This allows efficient estimation of ˆ ˆ the optimal parameters VT and ∆V using a simple gradient descent. One can thus estimate from a single AdEx spike train the optimal parameters of a given link-function, which is more efficient than the method used in Sect. 3.2. The minimal NLL resulting from the gradient descent gives an estimation of the fit quality. A better estimate of the fit quality is given by the distance between the PSTHs in response to stimuli not used for parameter fitting . Let ν1 (t) be the PSTH of the AdEx, and ν2 (t) be the PSTH of the fitted SRM, 6 Figure 3: PSTH prediction. A. Injected current. B. Voltage traces produced by an AdEx (black) and the equivalent SRM (red), when stimulated with the current in A. C. Raster plot for 20 realizations of AdEx (black tick marks) and equivalent SRM (red tick marks). D. PSTH of the AdEx (black) and the SRM (red) obtained by averaging 10,000 repetitions. E. Optimal log-likelihood for the three cases of the AdEx, using three different link-functions, a linear-rectifier (light gray), an exponential link-function (gray) and the link-function defined by Eq. 14 (dark gray), these values are obtained by averaging over 40 different combinations σ and ∆T (see Fig. 4). Error bars are one standard deviation, the stars denote a significant difference, two-sample t-test with α = 0.01. F. same as E. but for Md (Eq. 16). then we use Md ∈ [0, 1] as a measure of match: Md = 2 2 (ν1 (t) − ν2 (t)) dt ν1 (t)2 dt + ν2 (t)2 dt (16) Md = 1 means that it is impossible to differentiate the SRM from the AdEx in terms of their PSTHs, whereas a Md of 0 means that the two PSTHs are completely different. Thus Md is a normalized similarity measure between two PSTHs. In practice, Md is estimated from the smoothed (boxcar average of 1 ms half-width) averaged spike train of 1 000 repetitions for each models. We use both the NLL and Md to quantify the fit quality for each of the three damping cases and each of the three link-functions. Figure 3 shows the match between the stochastic AdEx used as a reference and the derived GLM when both are stimulated with the same input current (Fig. 3 A). The resulting voltage traces are almost identical (Fig. 3 B) and both models predict almost the same spike trains and so the same PSTHs (Fig. 3 C and D). More quantitalively, we see on Fig. 3 E and F, that the linear-rectifier fits significantly worse than both the exponential and log-exp-exp link-functions, both in terms of NLL and of Md . The exponential link-function performs as well as the log-exp-exp link-function, with a spike train similarity measure Md being almost 1 for both. Finally the likelihood-based method described above gives us the opportunity to look at the relationship between the AdEx parameters σ and ∆T that governs its spike emission and the parameters VT and ∆V of the link-function (Fig. 4). We observe that an increase of the noise level produces a flatter link-function (greater ∆V ) while an increase in ∆T also produces an increase in ∆V and VT (note that Fig. 4 shows ∆V and VT for the exponential link-function only, but equivalent results are obtained with the log-exp-exp link-function). 4 Discussion In Sect. 3.3 we have shown that it is possible to predict with almost perfect accuracy the PSTH of a stochastic AdEx model using an appropriate set of parameters in the SRM. Moreover, since 7 Figure 4: Influence of the AdEx parameters on the parameters of the exponential link-function. A. VT as a function of ∆T and σ. B. ∆V as a function of ∆T and σ. the subthreshold voltage of the AdEx also gives a good match with the deterministic voltage of the SRM, we expect that the AdEx and the SRM will not differ in higher moments of the spike train probability distributions beyond the PSTH. We therefore conclude that diffusive noise models of the type of Eq. 1-2 are equivalent to GLM of the type of Eq. 3-4. Once combined with similar results on other types of stochastic LIF (e.g. correlated noise), we could bridge the gap between the literature on GLM and the literature on diffusive noise models. Another noteworthy observation pertains to the nature of the link-function. The link-function has been hypothesized to be a linear-rectifier, an exponential, a sigmoidal or a Gaussian [16]. We have observed that for the AdEx the link-function follows Eq. 14 that we called the log-exp-exp linkfunction. Although the link-function is log-exp-exp for most of the AdEx parameters, the exponential link-function gives an equivalently good prediction of the PSTH. This can be explained by the fact that the difference between log-exp-exp and exponential link-functions happens mainly at low voltage (i.e. far from the threshold), where the probability of emitting a spike is so low (Figure 2 C, until -50 mv). Therefore, even if the exponential link-function overestimates the firing probability at these low voltages it rarely produces extra spikes. At voltages closer to the threshold, where most of the spikes are emitted, the two link-functions behave almost identically and hence produce the same PSTH. The Gaussian link-function can be seen as lying in-between the exponential link-function and the log-exp-exp link-function in Fig. 2. This means that the work of Plesser and Gerstner (2000) [16] is in agreement with the results presented here. The importance of the time-derivative of the ˙ voltage stressed by Plesser and Gerstner (leading to a two-dimensional link-function f (V, V )) was not studied here to remain consistent with the typical usage of GLM in neural systems [14]. Finally we restricted our study to exponential non-linearity for spike initiation and do not consider other cases such as the Quadratic Integrate-and-fire (QIF, [5]) or other polynomial functional shapes. We overlooked these cases for two reasons. First, there are many evidences that the non-linearity in neurons (estimated from in-vitro recordings of Pyramidal neurons) is well approximated by a single exponential [9]. Second, the exponential non-linearity of the AdEx only affects the subthreshold voltage at high voltage (close to threshold) and thus can be neglected to derive the filters κ(t) and η(t). Polynomial non-linearities on the other hand affect a larger range of the subthreshold voltage so that it would be difficult to justify the linearization of subthreshold dynamics essential to the method presented here. References [1] R. B. Stein, “Some models of neuronal variability,” Biophys J, vol. 7, no. 1, pp. 37–68, 1967. [2] W. Gerstner and W. Kistler, Spiking neuron models. Cambridge University Press New York, 2002. [3] E. Izhikevich, “Resonate-and-fire neurons,” Neural Networks, vol. 14, no. 883-894, 2001. [4] M. J. E. Richardson, N. Brunel, and V. Hakim, “From subthreshold to firing-rate resonance,” Journal of Neurophysiology, vol. 89, pp. 2538–2554, 2003. 8 [5] E. Izhikevich, “Simple model of spiking neurons,” IEEE Transactions on Neural Networks, vol. 14, pp. 1569–1572, 2003. [6] S. Mensi, R. Naud, M. Avermann, C. C. H. Petersen, and W. Gerstner, “Parameter extraction and classification of three neuron types reveals two different adaptation mechanisms,” Under review. [7] N. Fourcaud-Trocme, D. Hansel, C. V. Vreeswijk, and N. Brunel, “How spike generation mechanisms determine the neuronal response to fluctuating inputs,” Journal of Neuroscience, vol. 23, no. 37, pp. 11 628–11 640, 2003. [8] R. Brette and W. Gerstner, “Adaptive exponential integrate-and-fire model as an effective description of neuronal activity,” Journal of Neurophysiology, vol. 94, pp. 3637–3642, 2005. [9] L. Badel, W. Gerstner, and M. Richardson, “Dependence of the spike-triggered average voltage on membrane response properties,” Neurocomputing, vol. 69, pp. 1062–1065, 2007. [10] P. McCullagh and J. A. Nelder, Generalized linear models, 2nd ed. Chapman & Hall/CRC, 1998, vol. 37. [11] W. Gerstner, J. van Hemmen, and J. Cowan, “What matters in neuronal locking?” Neural computation, vol. 8, pp. 1653–1676, 1996. [12] D. Hubel and T. Wiesel, “Receptive fields and functional architecture of monkey striate cortex,” Journal of Physiology, vol. 195, pp. 215–243, 1968. [13] J. Pillow, L. Paninski, V. Uzzell, E. Simoncelli, and E. Chichilnisky, “Prediction and decoding of retinal ganglion cell responses with a probabilistic spiking model,” Journal of Neuroscience, vol. 25, no. 47, pp. 11 003–11 013, 2005. [14] K. Doya, S. Ishii, A. Pouget, and R. P. N. Rao, Bayesian brain: Probabilistic approaches to neural coding. The MIT Press, 2007. [15] S. Gerwinn, J. H. Macke, M. Seeger, and M. Bethge, “Bayesian inference for spiking neuron models with a sparsity prior,” in Advances in Neural Information Processing Systems, 2007. [16] H. Plesser and W. Gerstner, “Noise in integrate-and-fire neurons: From stochastic input to escape rates,” Neural Computation, vol. 12, pp. 367–384, 2000. [17] J. Schemmel, J. Fieres, and K. Meier, “Wafer-scale integration of analog neural networks,” in Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, june 2008, pp. 431 –438. [18] R. Jolivet, T. Lewis, and W. Gerstner, “Generalized integrate-and-fire models of neuronal activity approximate spike trains of a detailed model to a high degree of accuracy,” Journal of Neurophysiology, vol. 92, pp. 959–976, 2004. [19] L. Paninski, “Maximum likelihood estimation of cascade point-process neural encoding models,” Network: Computation in Neural Systems, vol. 15, pp. 243–262, 2004. 9
5 0.75979829 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
Author: Joseph Keshet, David A. McAllester
Abstract: We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor over the given features. We also give finite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized on a given sample. 1
6 0.72369903 165 nips-2011-Matrix Completion for Multi-label Image Classification
7 0.68810344 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
8 0.67899936 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation
9 0.62827319 216 nips-2011-Portmanteau Vocabularies for Multi-Cue Image Representation
10 0.59772813 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling
11 0.591308 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
12 0.59062284 168 nips-2011-Maximum Margin Multi-Instance Learning
13 0.58899045 126 nips-2011-Im2Text: Describing Images Using 1 Million Captioned Photographs
14 0.57886589 227 nips-2011-Pylon Model for Semantic Segmentation
15 0.56966072 154 nips-2011-Learning person-object interactions for action recognition in still images
16 0.55922037 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition
17 0.55070728 127 nips-2011-Image Parsing with Stochastic Scene Grammar
18 0.54468006 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features
19 0.53322983 35 nips-2011-An ideal observer model for identifying the reference frame of objects
20 0.52803326 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation