nips nips2013 nips2013-226 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Brenden M. Lake, Ruslan Salakhutdinov, Josh Tenenbaum
Abstract: People can learn a new visual class from just one example, yet machine learning algorithms typically require hundreds or thousands of examples to tackle the same problems. Here we present a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image. We evaluated performance on a challenging one-shot classification task, where our model achieved a human-level error rate while substantially outperforming two deep learning models. We also tested the model on another conceptual task, generating new examples, by using a “visual Turing test” to show that our model produces human-like performance. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract People can learn a new visual class from just one example, yet machine learning algorithms typically require hundreds or thousands of examples to tackle the same problems. [sent-10, score-0.203]
2 Here we present a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image. [sent-11, score-0.242]
3 1 Introduction People can acquire a new concept from only the barest of experience – just one or a handful of examples in a high-dimensional space of raw perceptual input. [sent-14, score-0.185]
4 Although machine learning has tackled some of the same classification and recognition problems that people solve so effortlessly, the standard algorithms require hundreds or thousands of examples to reach good performance. [sent-15, score-0.346]
5 While the standard MNIST benchmark dataset for digit recognition has 6000 training examples per class [19], people can classify new images of a foreign handwritten character from just one example (Figure 1b) [23, 16, 17]. [sent-16, score-0.693]
6 Similarly, while classifiers are generally trained on hundreds of images per class, using benchmark datasets such as ImageNet [4] and CIFAR-10/100 [14], people can learn a a) b) c) Human drawers 3 3 canonical Figure 1: Can you learn a new concept from just one example? [sent-17, score-0.468]
7 c) The learned concepts also support many other abilities such as generating examples and parsing. [sent-20, score-0.183]
8 1 1 2 1 Figure 2: Four alphabets from Omniglot, each with five characters drawn by four different people. [sent-25, score-0.296]
9 Additionally, while classification has received most of the attention in machine learning, people can generalize in a variety of other ways after learning a new concept. [sent-30, score-0.241]
10 Equipped with the concept “Segway” or a new handwritten character (Figure 1c), people can produce new examples, parse an object into its critical parts, and fill in a missing part of an image. [sent-31, score-0.671]
11 Given that people seem to succeed at both sides of the tradeoff, a central challenge is to explain this remarkable ability: What types of representations can be learned from just one or a few examples, and how can these representations support such flexible generalizations? [sent-36, score-0.241]
12 We selected simple visual concepts from the domain of handwritten characters, which offers a large number of novel, high-dimensional, and cognitively natural stimuli (Figure 2). [sent-39, score-0.212]
13 These characters are significantly more complex than the simple artificial stimuli most often modeled in psychological studies of concept learning (e. [sent-40, score-0.274]
14 , [6, 13]), yet they remain simple enough to hope that a computational model could see all the structure that people do, unlike domains such as natural scenes. [sent-42, score-0.241]
15 While similar in spirit to MNIST, rather than having 10 characters with 6000 examples each, it has over 1600 character with 20 examples each – making it more like the “transpose” of MNIST. [sent-44, score-0.558]
16 These characters were selected from 50 different alphabets on www. [sent-45, score-0.296]
17 Since it was produced on Amazon’s Mechanical Turk, each image is paired with a movie ([x,y,time] coordinates) showing how that drawing was produced. [sent-52, score-0.185]
18 In addition to introducing new one-shot learning challenge problems, this paper also introduces Hierarchical Bayesian Program Learning (HBPL), a model that exploits the principles of compositionality and causality to learn a wide range of simple visual concepts from just a single example. [sent-53, score-0.308]
19 We compared the model with people and other competitive computational models for character recognition, including Deep Boltzmann Machines [25] and their Hierarchical Deep extension for learning with very few examples [26]. [sent-54, score-0.523]
20 In this test, both people and the model performed the same task side by side, and then other human participants judged which result was from a person and which was from a machine. [sent-57, score-0.476]
21 2 Hierarchical Bayesian Program Learning We introduce a new computational approach called Hierarchical Bayesian Program Learning (HBPL) that utilizes the principles of compositionality and causality to build a probabilistic generative model of handwritten characters. [sent-58, score-0.225]
22 It is compositional because characters are represented as stochastic motor programs where primitive structure is shared and re-used across characters at multiple levels, including strokes and sub-strokes. [sent-59, score-0.855]
23 character type 1 ( = 2) (m) b} I (m) I (m) Figure 3: An illustration of the HBPL model generating two character types (left and right), where the dotted line separates the type-level from the token-level variables. [sent-63, score-0.471]
24 Legend: number of strokes κ, relations R, primitive id z (color-coded to highlight sharing), control points x (open circles), scale y, start locations L, trajectories T , transformation A, noise and θb , and image I. [sent-64, score-0.587]
25 “structural description” to explain the image by freely combining these elementary parts and their spatial relations. [sent-65, score-0.185]
26 Unlike classic structural description models [27, 2], HBPL also reflects abstract causal structure about how characters are actually produced. [sent-66, score-0.258]
27 This type of causal representation is psychologically plausible, and it has been previously theorized to explain both behavioral and neuro-imaging data regarding human character perception and learning (e. [sent-67, score-0.324]
28 As in most previous “analysis by synthesis” models of characters, strokes are not modeled at the level of muscle movements, so that they are abstract enough to be completed by a hand, a foot, or an airplane writing in the sky. [sent-70, score-0.253]
29 But HBPL also learns a significantly more complex representation than earlier models, which used only one stroke (unless a second was added manually) [24, 10] or received on-line input data [9], sidestepping the challenging parsing problem needed to interpret complex characters. [sent-71, score-0.285]
30 The model distinguishes between character types (an ‘A’, ‘B’, etc. [sent-72, score-0.212]
31 1 Generating a character type A character type ψ = {κ, S, R} is defined by a set of κ strokes S = {S1 , . [sent-83, score-0.677]
32 (2) The number of strokes is sampled from a multinomial P (κ) estimated from the empirical frequencies (Figure 4b), and the other conditional distributions are defined in the sections below. [sent-94, score-0.282]
33 All hyperparameters, including the library of primitives (top of Figure 3), were learned from a large “background set” of character drawings as described in Sections 2. [sent-95, score-0.359]
34 Each stroke is initiated by pressing the pen down and terminated by lifting the pen up. [sent-98, score-0.376]
35 In between, a stroke is a motor routine composed of simple movements called substrokes Si = {si1 , . [sent-99, score-0.356]
36 The discrete class zij ∈ N is an index into the library of primini tive motor elements (top of Figure 3), and its distribution P (zi ) = P (zi1 ) j=2 P (zij |zi(j−1) ) is a first-order Markov Process that adds sub-strokes at each step until a special “stop” state is sampled that ends the stroke. [sent-104, score-0.3]
37 The five control points xij ∈ R10 (small open circles in Figure 3) are sampled from a Gaussian P (xij |zij ) = N (µzij , Σzij ) , but they live in an abstract space not yet embedded in the image frame. [sent-105, score-0.324]
38 The type-level scale yij of this space, relative to the image frame, is sampled from P (yij |zij ) = Gamma(αzij , βzij ). [sent-106, score-0.299]
39 The spatial relation Ri specifies how the beginning of stroke Si connects to the previous strokes {S1 , . [sent-108, score-0.503]
40 Relations can come in four types with probabilities θR , and each type has different sub-variables and dimensionalities: • Independent relations, Ri = {Ji , Li }, where the position of stroke i does not depend on previous strokes. [sent-119, score-0.278]
41 The variable Ji ∈ N is drawn from P (Ji ), a multinomial over a 2D image grid that depends on index i (Figure 4c). [sent-120, score-0.217]
42 Since the position Li ∈ R2 has to be real-valued, P (Li |Ji ) is then sampled uniformly at random from within the image cell Ji . [sent-121, score-0.242]
43 • Start or End relations, Ri = {ui }, where stroke i starts at either the beginning or end of a previous stroke ui , sampled uniformly at random from ui ∈ {1, . [sent-122, score-0.603]
44 • Along relations, Ri = {ui , vi , τi }, where stroke i begins along previous stroke ui ∈ {1, . [sent-126, score-0.537]
45 2 Generating a character token (m) The token-level variables, θ(m) = {L(m) , x(m) , y (m) , R(m) , A(m) , σb (m) P (θ(m) |ψ) = P (L(m) |θ\L(m) , ψ) (m) P (Ri i (m) |Ri )P (yi , (m) |yi )P (xi (m) }, are distributed as (m) |xi )P (A(m) , σb , (m) ) (3) with details below. [sent-134, score-0.266]
46 A stroke trajectory Ti (Figure 3) is a sequence of points in the image plane (m) (m) (m) (m) that represents the path of the pen. [sent-138, score-0.467]
47 To construct the trajectory Ti (see illustration in Figure 3), the spline defined by the scaled (m) (m) 10 control points y1 x1 ∈ R is evaluated to form a trajectory,1 which is shifted in the image plane (m) (m) (m) to begin at Li . [sent-141, score-0.321]
48 (m) Token-level relations must be exactly equal to their type-level counterparts, P (Ri |Ri ) = (m) δ(Ri − Ri ), except for the “along” relation which allows for token-level variability for (m) 2 the attachment along the spline using a truncated Gaussian P (τi |τi ) ∝ N (τi , στ ). [sent-143, score-0.209]
49 is start or end, and 1 The number of spline evaluations is computed to be approximately 2 points for every 3 pixels of distance along the spline (with a minimum of 10 evaluations). [sent-152, score-0.244]
50 4 a) library of motor primitives b) number of of strokes Number strokes frequency 6000 1 1 2 Figure 4: Learned hyperparameters. [sent-153, score-0.698]
51 b&c;) Empirical distributions where the heatmap c) show how starting point differs by stroke number. [sent-156, score-0.25]
52 2 4000 2000 0 0 2 c) 4 6 8 stroke start positions 1 1 3 3 2 2 1 2 3 3 4 4 ≥4 4 4 (m) 3 Image. [sent-157, score-0.25]
53 An image transformation A ∈ R is sampled from P (A(m) ) = N ([1, 1, 0, 0], ΣA ), where the first two elements control a global re-scaling and the second two control a global translation of the center of mass of T (m) . [sent-158, score-0.214]
54 This grayscale image is then perturbed by two noise processes, which make the gradient more robust during optimization and encourage partial solutions during classification. [sent-160, score-0.225]
55 3 4 Learning high-level knowledge of motor programs The Omniglot dataset was randomly split into a 30 alphabet “background” set and a 20 alphabet “evaluation” set, constrained such that the background set included the six most common alphabets as determined by Google hits. [sent-164, score-0.318]
56 Background images, paired with their motor data, were used to learn the hyperparameters of the HBPL model, including a set of 1000 primitive motor elements (Figure 4a) and position models for a drawing’s first, second, and third stroke, etc. [sent-165, score-0.316]
57 Details are provided in Section SI-4 for learning the models of primitives, positions, relations, token variability, and image transformations. [sent-168, score-0.239]
58 4 Inference Posterior inference in this model is very challenging, since parsing an image I (m) requires exploring a large combinatorial space of different numbers and types of strokes, relations, and sub-strokes. [sent-170, score-0.22]
59 , ψ [K] , θ(m)[K] , which are the most promising candidates proposed by a fast, bottom-up image analysis, shown in Figure 5a and detailed in Section SI-5. [sent-174, score-0.185]
60 (6) Image Thinned Binary image a) i Binary image b) 1 2 train train train traintrain train 22 222 11 112 11 Traced graph (raw) 0 train train 2 2 1 1 2 22 11 222 111 2 1 1 −59. [sent-183, score-0.64]
61 6 0 test 00000 0 ii Thinned image test test test test test test 22 1 1 2 2 1 22 11 222 111 1 0 Thinned image 1 2 test 22 222 12 11 111 traced graph (cleaned) −831 planning 2 1 2 −2. [sent-184, score-0.808]
62 6 Thinned image iii train train Binary image −159 −159 −159 −159 −159 −88. [sent-214, score-0.46]
63 12e+03 planning planning -1273 -831 -2041 Figure 5: Parsing a raw image. [sent-251, score-0.178]
64 a) The raw image (i) is processed by a thinning algorithm [18] (ii) and then analyzed as an undirected graph [20] (iii) where parses are guided random walks (Section SI-5). [sent-252, score-0.439]
65 b) The five best parses found for that image (top row) are shown with their log wj (Eq. [sent-253, score-0.33]
66 5), where numbers inside circles denote stroke order and starting position, and smaller open circles denote sub-stroke breaks. [sent-254, score-0.326]
67 These five parses were re-fit to three different raw images of characters (left in image triplets), where the best parse (top right) and its associated image reconstruction (bottom right) are shown above its score (Eq. [sent-255, score-0.988]
68 planning cleaned planning cleaned planning cleaned Given an approximate posterior for a particular image, the model can evaluate the posterior predictive score of a new image by re-fitting the token-level variables (bottom Figure 5b), as explained in Section 3. [sent-257, score-0.536]
69 Each trial (of 400 total) consists of a single test image of a new character compared to 20 new characters from the same alphabet, given just one image each produced by a typical drawer of that alphabet. [sent-262, score-0.832]
70 On each trial, as in Figure 1b, participants were shown an image of a new character and asked to click on another image that shows the same character. [sent-266, score-0.752]
71 To ensure classification was indeed “one shot,” participants completed just one randomly selected trial from each of the 10 within-alphabet classification tasks, so that characters never repeated across trials. [sent-267, score-0.376]
72 For a test image I (T ) and 20 training images I (c) for c = 1, . [sent-270, score-0.318]
73 While inference so far involves parses of I (c) refit to I (T ) , it also seems desirable to include parses of I (T ) refit to I (c) , namely P (I (c) |I (T ) ). [sent-278, score-0.29]
74 The full HBPL model is compared to a transformation-based approach that models the variance in image tokens as just global scales, translations, and blur, which relates to congealing models [23]. [sent-287, score-0.226]
75 This HBPL model “without strokes” still benefits from good bottom-up image analysis (Figure 5) and a learned transformation model. [sent-288, score-0.185]
76 The Affine model is identical to HBPL during search, (m) but during classification, only the warp A(m) , blur σb , and noise (m) are re-optimized to a new (T ) image (change the argument of “max” in Eq. [sent-289, score-0.213]
77 To evaluate classification performance, first the approximate posterior distribution over the DBMs top-level features was inferred for each image in the evaluation set, followed by performing 1-nearest neighbor in this feature space using cosine similarity. [sent-293, score-0.185]
78 To speed up learning of the DBM and HD models, the original images were down-sampled, so that each image was represented by 28x28 pixels with greyscale values from [0,1]. [sent-294, score-0.31]
79 To further reduce overfitting and learn more about the 2D image topology, which is built in to some deep models like convolution networks [19], the set of background characters was artificially enhanced by generating slight image translations (+/- 3 pixels), rotations (+/- 5 degrees), and scales (0. [sent-295, score-0.781]
80 As predicted, people were skilled one-shot learners, with an average error rate of 4. [sent-308, score-0.241]
81 3% One-shot generation of new examples Not only can people classify new examples, they can generate new examples – even from just one image. [sent-330, score-0.381]
82 While all generative classifiers can produce examples, it can be difficult to synthesize a range of compelling new examples in their raw form, especially since many models generate only features of raw stimuli (e. [sent-331, score-0.267]
83 We ran another Mechanical Turk task to produce nine new examples of 50 randomly selected handwritten character images from the evaluation set. [sent-334, score-0.516]
84 After correctly answering comprehension questions, 18 participants in the USA were asked to “draw a new example” of 25 characters, resulting in nine examples per character. [sent-336, score-0.335]
85 To simulate drawings from nine different people, each of the models generated nine samples after seeing exactly the same images people did, as described in Section SI-8 and shown in Figure 6. [sent-337, score-0.519]
86 Low-level image differences were minimized by re-rendering stroke trajectories in the same way for the models and people. [sent-338, score-0.47]
87 7 Example People HBPL Affine HD Figure 6: Generating new examples from just a single “target” image (left). [sent-340, score-0.255]
88 Each grid shows nine new examples synthesized by people and the three computational models. [sent-341, score-0.407]
89 To compare the examples generated by people and the models, we ran a visual Turing test using 50 new participants in the USA on Mechanical Turk. [sent-343, score-0.59]
90 Participants were told that they would see a target image and two grids of 9 images (Figure 6), where one grid was drawn by people with their computer mice and the other grid was drawn by a computer program that “simulates how people draw a new character. [sent-344, score-0.862]
91 Participants who tried to label drawings from people vs. [sent-352, score-0.302]
92 HBPL were only 56% percent correct, while those who tried to label people vs. [sent-353, score-0.241]
93 While both group means were significantly better than chance, a subject analysis revealed only 2 of 21 participants were better than chance for people vs. [sent-357, score-0.455]
94 HBPL, while 24 of 25 were significant for people vs. [sent-358, score-0.241]
95 Likewise, 8 of 50 items were above chance for people vs. [sent-360, score-0.285]
96 HBPL, while 48 of 50 items were above chance for people vs. [sent-361, score-0.285]
97 Since participants could easily detect the overly consistent Affine model, it seems the difficulty participants had in detecting HBPL’s exemplars was not due to task confusion. [sent-363, score-0.371]
98 Interestingly, participants did not significantly improve over the trials, even after seeing hundreds of images from the model. [sent-364, score-0.294]
99 If one were to incorporate this compositional and causal structure into a deep learning model, it could lead to better performance on our tasks. [sent-369, score-0.182]
100 Thus, we do not see our model as the final word on how humans learn concepts, but rather, as a suggestion for the type of structure that best captures how people learn rich concepts from very sparse data. [sent-370, score-0.373]
wordName wordTfidf (topN-words)
[('hbpl', 0.512), ('strokes', 0.253), ('stroke', 0.25), ('people', 0.241), ('character', 0.212), ('characters', 0.206), ('image', 0.185), ('participants', 0.17), ('parses', 0.145), ('zij', 0.137), ('ri', 0.122), ('motor', 0.106), ('spline', 0.104), ('parse', 0.1), ('alphabets', 0.09), ('deep', 0.089), ('images', 0.089), ('yij', 0.085), ('handwritten', 0.081), ('compositionality', 0.078), ('turing', 0.078), ('raw', 0.078), ('hd', 0.075), ('xij', 0.072), ('relations', 0.071), ('examples', 0.07), ('thinned', 0.067), ('cleaned', 0.067), ('concepts', 0.066), ('causality', 0.066), ('visual', 0.065), ('af', 0.065), ('nine', 0.064), ('pen', 0.063), ('drawings', 0.061), ('omniglot', 0.061), ('classi', 0.058), ('primitives', 0.058), ('wi', 0.055), ('token', 0.054), ('dbms', 0.054), ('li', 0.054), ('hierarchical', 0.053), ('causal', 0.052), ('boltzmann', 0.05), ('planning', 0.05), ('salakhutdinov', 0.049), ('mechanical', 0.049), ('generating', 0.047), ('train', 0.045), ('ji', 0.044), ('test', 0.044), ('chance', 0.044), ('si', 0.043), ('primitive', 0.043), ('alphabet', 0.043), ('program', 0.042), ('compelling', 0.041), ('compositional', 0.041), ('tokens', 0.041), ('brenden', 0.041), ('tui', 0.041), ('grayscale', 0.04), ('argmax', 0.039), ('circles', 0.038), ('ti', 0.038), ('lake', 0.037), ('ui', 0.037), ('concept', 0.037), ('traced', 0.036), ('segway', 0.036), ('background', 0.036), ('pixels', 0.036), ('psychology', 0.036), ('hundreds', 0.035), ('trajectories', 0.035), ('parsing', 0.035), ('variability', 0.034), ('judged', 0.033), ('greek', 0.033), ('shot', 0.033), ('learn', 0.033), ('trajectory', 0.032), ('cognitive', 0.032), ('imagenet', 0.032), ('bayesian', 0.032), ('grid', 0.032), ('human', 0.032), ('thinning', 0.031), ('comprehension', 0.031), ('exemplars', 0.031), ('psychological', 0.031), ('sampled', 0.029), ('scripts', 0.028), ('dbm', 0.028), ('blur', 0.028), ('turk', 0.028), ('perception', 0.028), ('position', 0.028), ('library', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 226 nips-2013-One-shot learning by inverting a compositional causal process
Author: Brenden M. Lake, Ruslan Salakhutdinov, Josh Tenenbaum
Abstract: People can learn a new visual class from just one example, yet machine learning algorithms typically require hundreds or thousands of examples to tackle the same problems. Here we present a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image. We evaluated performance on a challenging one-shot classification task, where our model achieved a human-level error rate while substantially outperforming two deep learning models. We also tested the model on another conceptual task, generating new examples, by using a “visual Turing test” to show that our model produces human-like performance. 1
2 0.19966991 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
Author: Yangqing Jia, Joshua T. Abbott, Joseph Austerweil, Thomas Griffiths, Trevor Darrell
Abstract: Learning a visual concept from a small number of positive examples is a significant challenge for machine learning algorithms. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given set of visual examples. Recent work in cognitive science on Bayesian models of generalization addresses this challenge, but prior results assumed that objects were perfectly recognized. We present an algorithm for learning visual concepts directly from images, using probabilistic predictions generated by visual classifiers as the input to a Bayesian generalization model. As no existing challenge data tests this paradigm, we collect and make available a new, large-scale dataset for visual concept learning using the ImageNet hierarchy as the source of possible concepts, with human annotators to provide ground truth labels as to whether a new image is an instance of each concept using a paradigm similar to that used in experiments studying word learning in children. We compare the performance of our system to several baseline algorithms, and show a significant advantage results from combining visual classifiers with the ability to identify an appropriate level of abstraction using Bayesian generalization. 1
3 0.11374186 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
Author: Michiel Hermans, Benjamin Schrauwen
Abstract: Time series often have a temporal hierarchy, with information that is spread out over multiple time scales. Common recurrent neural networks, however, do not explicitly accommodate such a hierarchy, and most research on them has been focusing on training algorithms rather than on their basic architecture. In this paper we study the effect of a hierarchy of recurrent neural networks on processing time series. Here, each layer is a recurrent network which receives the hidden state of the previous layer as input. This architecture allows us to perform hierarchical processing on difficult temporal tasks, and more naturally capture the structure of time series. We show that they reach state-of-the-art performance for recurrent networks in character-level language modeling when trained with simple stochastic gradient descent. We also offer an analysis of the different emergent time scales. 1
4 0.10532411 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs
Author: Vikash Mansinghka, Tejas D. Kulkarni, Yura N. Perov, Josh Tenenbaum
Abstract: The idea of computer vision as the Bayesian inverse problem to computer graphics has a long history and an appealing elegance, but it has proved difficult to directly implement. Instead, most vision tasks are approached via complex bottom-up processing pipelines. Here we show that it is possible to write short, simple probabilistic graphics programs that define flexible generative models and to automatically invert them to interpret real-world images. Generative probabilistic graphics programs (GPGP) consist of a stochastic scene generator, a renderer based on graphics software, a stochastic likelihood model linking the renderer’s output and the data, and latent variables that adjust the fidelity of the renderer and the tolerance of the likelihood. Representations and algorithms from computer graphics are used as the deterministic backbone for highly approximate and stochastic generative models. This formulation combines probabilistic programming, computer graphics, and approximate Bayesian computation, and depends only on generalpurpose, automatic inference techniques. We describe two applications: reading sequences of degraded and adversarially obscured characters, and inferring 3D road models from vehicle-mounted camera images. Each of the probabilistic graphics programs we present relies on under 20 lines of probabilistic code, and yields accurate, approximately Bayesian inferences about real-world images. 1
5 0.09781982 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
Author: Richard Socher, Milind Ganjoo, Christopher D. Manning, Andrew Ng
Abstract: This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of seen and unseen classes, simultaneously obtaining state of the art performance on classes with thousands of training images and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what objects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images. Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or unseen class. We then use novelty detection methods to differentiate unseen classes from seen classes. We demonstrate two novelty detection strategies; the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes’ accuracy high. 1
6 0.094575495 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
7 0.09170568 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
8 0.091233872 93 nips-2013-Discriminative Transfer Learning with Tree-based Priors
9 0.08039619 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
10 0.079348795 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
11 0.079266444 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty
12 0.07911969 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning
13 0.075725675 183 nips-2013-Mapping paradigm ontologies to and from the brain
14 0.074662231 166 nips-2013-Learning invariant representations and applications to face verification
15 0.07245931 251 nips-2013-Predicting Parameters in Deep Learning
16 0.071726196 84 nips-2013-Deep Neural Networks for Object Detection
17 0.070502706 331 nips-2013-Top-Down Regularization of Deep Belief Networks
19 0.068243891 83 nips-2013-Deep Fisher Networks for Large-Scale Image Classification
20 0.06663651 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
topicId topicWeight
[(0, 0.176), (1, 0.074), (2, -0.14), (3, -0.065), (4, 0.104), (5, -0.072), (6, -0.009), (7, 0.013), (8, -0.025), (9, 0.017), (10, -0.076), (11, -0.002), (12, -0.017), (13, 0.011), (14, -0.091), (15, 0.01), (16, -0.035), (17, -0.066), (18, -0.088), (19, 0.003), (20, 0.021), (21, -0.042), (22, -0.038), (23, 0.006), (24, -0.046), (25, 0.099), (26, 0.061), (27, 0.011), (28, 0.002), (29, -0.043), (30, -0.031), (31, 0.049), (32, -0.108), (33, -0.044), (34, -0.056), (35, -0.008), (36, -0.032), (37, 0.082), (38, 0.047), (39, 0.037), (40, -0.009), (41, 0.135), (42, -0.074), (43, 0.033), (44, 0.017), (45, -0.003), (46, -0.008), (47, -0.036), (48, 0.06), (49, -0.087)]
simIndex simValue paperId paperTitle
same-paper 1 0.93583846 226 nips-2013-One-shot learning by inverting a compositional causal process
Author: Brenden M. Lake, Ruslan Salakhutdinov, Josh Tenenbaum
Abstract: People can learn a new visual class from just one example, yet machine learning algorithms typically require hundreds or thousands of examples to tackle the same problems. Here we present a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image. We evaluated performance on a challenging one-shot classification task, where our model achieved a human-level error rate while substantially outperforming two deep learning models. We also tested the model on another conceptual task, generating new examples, by using a “visual Turing test” to show that our model produces human-like performance. 1
2 0.81764531 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
Author: Yangqing Jia, Joshua T. Abbott, Joseph Austerweil, Thomas Griffiths, Trevor Darrell
Abstract: Learning a visual concept from a small number of positive examples is a significant challenge for machine learning algorithms. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given set of visual examples. Recent work in cognitive science on Bayesian models of generalization addresses this challenge, but prior results assumed that objects were perfectly recognized. We present an algorithm for learning visual concepts directly from images, using probabilistic predictions generated by visual classifiers as the input to a Bayesian generalization model. As no existing challenge data tests this paradigm, we collect and make available a new, large-scale dataset for visual concept learning using the ImageNet hierarchy as the source of possible concepts, with human annotators to provide ground truth labels as to whether a new image is an instance of each concept using a paradigm similar to that used in experiments studying word learning in children. We compare the performance of our system to several baseline algorithms, and show a significant advantage results from combining visual classifiers with the ability to identify an appropriate level of abstraction using Bayesian generalization. 1
3 0.73999745 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning
Author: Chen-Ping Yu, Wen-Yu Hua, Dimitris Samaras, Greg Zelinsky
Abstract: Visual clutter, the perception of an image as being crowded and disordered, affects aspects of our lives ranging from object detection to aesthetics, yet relatively little effort has been made to model this important and ubiquitous percept. Our approach models clutter as the number of proto-objects segmented from an image, with proto-objects defined as groupings of superpixels that are similar in intensity, color, and gradient orientation features. We introduce a novel parametric method of clustering superpixels by modeling mixture of Weibulls on Earth Mover’s Distance statistics, then taking the normalized number of proto-objects following partitioning as our estimate of clutter perception. We validated this model using a new 90-image dataset of real world scenes rank ordered by human raters for clutter, and showed that our method not only predicted clutter extremely well (Spearman’s ρ = 0.8038, p < 0.001), but also outperformed all existing clutter perception models and even a behavioral object segmentation ground truth. We conclude that the number of proto-objects in an image affects clutter perception more than the number of objects or features. 1
4 0.71766108 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs
Author: Vikash Mansinghka, Tejas D. Kulkarni, Yura N. Perov, Josh Tenenbaum
Abstract: The idea of computer vision as the Bayesian inverse problem to computer graphics has a long history and an appealing elegance, but it has proved difficult to directly implement. Instead, most vision tasks are approached via complex bottom-up processing pipelines. Here we show that it is possible to write short, simple probabilistic graphics programs that define flexible generative models and to automatically invert them to interpret real-world images. Generative probabilistic graphics programs (GPGP) consist of a stochastic scene generator, a renderer based on graphics software, a stochastic likelihood model linking the renderer’s output and the data, and latent variables that adjust the fidelity of the renderer and the tolerance of the likelihood. Representations and algorithms from computer graphics are used as the deterministic backbone for highly approximate and stochastic generative models. This formulation combines probabilistic programming, computer graphics, and approximate Bayesian computation, and depends only on generalpurpose, automatic inference techniques. We describe two applications: reading sequences of degraded and adversarially obscured characters, and inferring 3D road models from vehicle-mounted camera images. Each of the probabilistic graphics programs we present relies on under 20 lines of probabilistic code, and yields accurate, approximately Bayesian inferences about real-world images. 1
5 0.71039927 166 nips-2013-Learning invariant representations and applications to face verification
Author: Qianli Liao, Joel Z. Leibo, Tomaso Poggio
Abstract: One approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformationinvariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identitypreserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions (from [1]) for the case of 2D affine transformations. Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well. 1
6 0.70116496 183 nips-2013-Mapping paradigm ontologies to and from the brain
8 0.65092218 84 nips-2013-Deep Neural Networks for Object Detection
9 0.64365572 343 nips-2013-Unsupervised Structure Learning of Stochastic And-Or Grammars
10 0.60952401 163 nips-2013-Learning a Deep Compact Image Representation for Visual Tracking
11 0.605811 356 nips-2013-Zero-Shot Learning Through Cross-Modal Transfer
12 0.60417169 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
13 0.60297805 351 nips-2013-What Are the Invariant Occlusive Components of Image Patches? A Probabilistic Generative Approach
14 0.58591592 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
15 0.57766491 138 nips-2013-Higher Order Priors for Joint Intrinsic Image, Objects, and Attributes Estimation
16 0.5666827 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty
17 0.55529553 21 nips-2013-Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths
18 0.53753835 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
19 0.53625518 200 nips-2013-Multi-Prediction Deep Boltzmann Machines
20 0.53322905 119 nips-2013-Fast Template Evaluation with Vector Quantization
topicId topicWeight
[(16, 0.037), (33, 0.193), (34, 0.107), (41, 0.035), (49, 0.041), (56, 0.084), (70, 0.066), (71, 0.228), (85, 0.036), (89, 0.021), (93, 0.051), (95, 0.023)]
simIndex simValue paperId paperTitle
1 0.86450189 327 nips-2013-The Randomized Dependence Coefficient
Author: David Lopez-Paz, Philipp Hennig, Bernhard Schölkopf
Abstract: We introduce the Randomized Dependence Coefficient (RDC), a measure of nonlinear dependence between random variables of arbitrary dimension based on the Hirschfeld-Gebelein-R´ nyi Maximum Correlation Coefficient. RDC is defined in e terms of correlation of random non-linear copula projections; it is invariant with respect to marginal distribution transformations, has low computational cost and is easy to implement: just five lines of R code, included at the end of the paper. 1
same-paper 2 0.83551848 226 nips-2013-One-shot learning by inverting a compositional causal process
Author: Brenden M. Lake, Ruslan Salakhutdinov, Josh Tenenbaum
Abstract: People can learn a new visual class from just one example, yet machine learning algorithms typically require hundreds or thousands of examples to tackle the same problems. Here we present a Hierarchical Bayesian model based on compositionality and causality that can learn a wide range of natural (although simple) visual concepts, generalizing in human-like ways from just one image. We evaluated performance on a challenging one-shot classification task, where our model achieved a human-level error rate while substantially outperforming two deep learning models. We also tested the model on another conceptual task, generating new examples, by using a “visual Turing test” to show that our model produces human-like performance. 1
3 0.80276132 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato
Abstract: A probabilistic model based on the horseshoe prior is proposed for learning dependencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data. 1
Author: Nataliya Shapovalova, Michalis Raptis, Leonid Sigal, Greg Mori
Abstract: We propose a weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video. As part of the proposed approach, we develop a generalization of the Max-Path search algorithm which allows us to efficiently search over a structured space of multiple spatio-temporal paths while also incorporating context information into the model. Instead of using spatial annotations in the form of bounding boxes to guide the latent model during training, we utilize human gaze data in the form of a weak supervisory signal. This is achieved by incorporating eye gaze, along with the classification, into the structured loss within the latent SVM learning framework. Experiments on a challenging benchmark dataset, UCF-Sports, show that our model is more accurate, in terms of classification, and achieves state-of-the-art results in localization. In addition, our model can produce top-down saliency maps conditioned on the classification label and localized latent paths. 1
5 0.7354477 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
Author: Marius Pachitariu, Adam M. Packer, Noah Pettit, Henry Dalgleish, Michael Hausser, Maneesh Sahani
Abstract: Biological tissue is often composed of cells with similar morphologies replicated throughout large volumes and many biological applications rely on the accurate identification of these cells and their locations from image data. Here we develop a generative model that captures the regularities present in images composed of repeating elements of a few different types. Formally, the model can be described as convolutional sparse block coding. For inference we use a variant of convolutional matching pursuit adapted to block-based representations. We extend the KSVD learning algorithm to subspaces by retaining several principal vectors from the SVD decomposition instead of just one. Good models with little cross-talk between subspaces can be obtained by learning the blocks incrementally. We perform extensive experiments on simulated images and the inference algorithm consistently recovers a large proportion of the cells with a small number of false positives. We fit the convolutional model to noisy GCaMP6 two-photon images of spiking neurons and to Nissl-stained slices of cortical tissue and show that it recovers cell body locations without supervision. The flexibility of the block-based representation is reflected in the variability of the recovered cell shapes. 1
6 0.73429149 331 nips-2013-Top-Down Regularization of Deep Belief Networks
7 0.73176193 64 nips-2013-Compete to Compute
8 0.73165596 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
9 0.73097283 275 nips-2013-Reservoir Boosting : Between Online and Offline Ensemble Learning
10 0.73034942 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles
11 0.72861028 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes
12 0.72855079 349 nips-2013-Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies
13 0.7282266 301 nips-2013-Sparse Additive Text Models with Low Rank Background
14 0.72772247 201 nips-2013-Multi-Task Bayesian Optimization
15 0.72709441 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
16 0.72677696 251 nips-2013-Predicting Parameters in Deep Learning
17 0.72632366 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions
18 0.72627777 334 nips-2013-Training and Analysing Deep Recurrent Neural Networks
19 0.72572833 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
20 0.72568995 200 nips-2013-Multi-Prediction Deep Boltzmann Machines