nips nips2011 nips2011-156 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Antonio Torralba, Joshua B. Tenenbaum, Ruslan Salakhutdinov
Abstract: We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Specifically we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. [sent-6, score-0.368]
2 Specifically we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). [sent-7, score-0.346]
3 We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets. [sent-9, score-0.316]
4 1 Introduction “Learning to learn”, or the ability to learn abstract representations that support transfer to novel but related tasks, lies at the core of many problems in computer vision, natural language processing, cognitive science, and machine learning. [sent-10, score-0.179]
5 For humans learners, however, just one or a few examples are often sufficient to grasp a new category and make meaningful generalizations to novel instances [25, 16]. [sent-12, score-0.215]
6 The architecture we describe here takes a step towards this “one-shot learning” ability by learning several forms of abstract knowledge that support transfer of useful representations from previously learned concepts to novel ones. [sent-13, score-0.217]
7 We call our architectures compound HD models, where “HD” stands for “Hierarchical-Deep”, because they are derived by composing hierarchical nonparametric Bayesian models with deep networks, two influential approaches from the recent unsupervised learning literature with complementary strengths. [sent-14, score-0.481]
8 Recently introduced deep learning models, including Deep Belief Networks [5], Deep Boltzmann Machines [14], deep autoencoders [10], and others [12, 11], have been shown to learn useful distributed feature representations for many high-dimensional datasets. [sent-15, score-0.463]
9 The ability to automatically learn in multiple layers allows deep models to construct sophisticated domain-specific features without the need to rely on precise human-crafted input representations, increasingly important with the proliferation of data sets and application domains. [sent-16, score-0.385]
10 While the features learned by deep models can enable more rapid and accurate classification learning, deep networks themselves are not well suited to one-shot learning of novel classes. [sent-17, score-0.615]
11 All units and parameters at all levels of the network are engaged in representing any given input and are adjusted together during learning. [sent-18, score-0.148]
12 This ability is the 1 hallmark of hierarchical Bayesian (HB) models, recently proposed in computer vision, statistics, and cognitive science [7, 25, 4, 13] for learning to learn from few examples. [sent-20, score-0.153]
13 Unlike deep networks, these HB models explicitly represent category hierarchies that admit sharing the appropriate abstract knowledge about the new class’s parameters via a prior abstracted from related classes. [sent-21, score-0.404]
14 HB approaches, however, have complementary weaknesses relative to deep networks. [sent-22, score-0.196]
15 GIST, SIFT features in computer vision, MFCC features in speech perception domains). [sent-25, score-0.214]
16 Moreover, many HB approaches often assume a fixed hierarchy for sharing parameters [17, 3] instead of learning the hierarchy in an unsupervised fashion. [sent-27, score-0.256]
17 In this work we investigate compound HD (hierarchical-deep) architectures that integrate these deep models with structured hierarchical Bayesian models. [sent-28, score-0.386]
18 In particular, we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM), coming to represent both a layered hierarchy of increasingly abstract features, and a tree-structured hierarchy of classes. [sent-29, score-0.534]
19 Our model depends minimally on domain-specific representations and achieves state-of-the-art one-shot learning performance by unsupervised discovery of three components: (a) low-level features that abstract from the raw high-dimensional sensory input (e. [sent-30, score-0.254]
20 pixels, or 3D joint angles); (b) high-level part-like features that express the distinctive perceptual structure of a specific class, in terms of class-specific correlations over low-level features; and (c) a hierarchy of super-classes for sharing abstract knowledge among related classes. [sent-32, score-0.231]
21 It contains a set of visible units v ∈ {0, 1}D , and a sequence of layers of hidden units h1 ∈ {0, 1}F1 , h2 ∈ {0, 1}F2 ,. [sent-36, score-0.402]
22 There are connections only between hidden units in adjacent layers, as well as between visible and hidden units in the first hidden layer. [sent-40, score-0.482]
23 Multinomial DBMs: To allow DBMs to express more information and introduce more structured hierarchical priors, we will use a conditional multinomial distribution to model activities of the toplevel units. [sent-50, score-0.304]
24 Specifically, we will use M softmax units, each with “1-of-K” encoding (so that each 1 For clarity, we use three hidden layers. [sent-51, score-0.192]
25 Middle: A different interpretation: M softmax units are replaced by a single multinomial unit which is sampled M times. [sent-54, score-0.365]
26 All M separate softmax units will share the same set of weights, connecting them to binary hidden units at the lower-level (Fig. [sent-57, score-0.488]
27 A key observation is that M separate copies of softmax units that all share the same set of weights can be viewed as a single multinomial unit that is samples M times [15, 19]. [sent-59, score-0.365]
28 A pleasing property of using softmax units is that the mathematics underlying the learning algorithm for binary-binary DBMs remains the same. [sent-60, score-0.278]
29 One way to express what has been learned is the conditional model P (v, h1 , h2 |h3 ) and a prior term P (h3 ). [sent-62, score-0.131]
30 a restricted Boltzmann machine), to model P (h3 ), we can place a hierarchical Dirichlet process prior over h3 , that will allow us to learn category hierarchies, and more importantly, useful representations of classes that contain few training examples. [sent-68, score-0.451]
31 l m (4) lm A Hierarchical Bayesian Topic Prior In a typical hierarchical topic model, we observe a set of N documents, each of which is modeled as a mixture over topics, that are shared among documents. [sent-71, score-0.181]
32 A topic t is a discrete distribution over K words with probability vector φt . [sent-73, score-0.123]
33 Each document n has its own distribution over topics given by probabilities θn . [sent-74, score-0.154]
34 In our compound HDP-DBM model, we will use a hierarchical topic model as a prior over the activities of the DBM’s top-level features. [sent-75, score-0.4]
35 Specifically, the term “document” will refer to the toplevel multinomial unit h3 , and M “words” in the document will represent the M samples, or active DBM’s top-level features, generated by this multinomial unit. [sent-76, score-0.262]
36 Words in each document are drawn by choosing a topic t with probability θnt , and then choosing a word w with probability φtw . [sent-77, score-0.147]
37 We will often refer to topics as our learned higher-level features, each of which defines a topic specific distribution over DBM’s h3 features. [sent-78, score-0.238]
38 Let h3 be the ith word in document n, and xin be its topic: in θn |π ∼ Dir(απ), φt |τ ∼ Dir(βτ ), xin |θn ∼ Mult(θn ), h3 |xin , φxin ∼ Mult(φxin ), in (5) where π is the global distribution over topics, τ is the global distribution over K words, and α and β are concentration parameters. [sent-79, score-0.492]
39 3 Let us further assume that we are presented with a fixed two-level category hierarchy. [sent-80, score-0.123]
40 We b represent such partition by a vector zb of length N , each entry of which is zn ∈ {1, . [sent-84, score-0.212]
41 These partitions define a fixed two-level tree hierarchy (see Fig. [sent-94, score-0.13]
42 The hierarchical topic model can be readily extended to modeling the above hierarchy. [sent-97, score-0.181]
43 For each document n that belong to the basic category c, we place a common Dirichlet prior over θn with (1) parameters πc . [sent-98, score-0.259]
44 , N xin |θn φt |τ ∼ Mult(θn ), ∼ Dir(βτ ), n (6) (1) n for each word i = 1, . [sent-107, score-0.22]
45 For a fixed number of topics T , the above model represents a hierarchical extension of LDA. [sent-111, score-0.188]
46 Making use of topic index variables xin , we denote φ∗ = φxin (see Eq. [sent-115, score-0.315]
47 If, however, we are not given any level-1 or level-2 category labels, we need to infer the distribution over the possible category structures. [sent-127, score-0.246]
48 We place a nonparametric two-level nested Chinese Restaurant Prior (CRP) [2] over z, which defines a prior over tree structures and is flexible enough to learn arbitrary hierarchies. [sent-128, score-0.274]
49 The main building block of the nested CRP is the Chinese restaurant process, a distribution on partition of integers. [sent-129, score-0.142]
50 Its assignment b s to the basic-level category zn , that is placed under a super-category zn , is again recursively drawn from Eq. [sent-137, score-0.343]
51 The proposed model allows for both: a nonparametric prior over potentially unbounded number of global topics, or higher-level features, as well as a nonparametric prior that allow learning an arbitrary tree taxonomy. [sent-140, score-0.26]
52 Sampling HDP parameters: Given category assignment vectors z, and the states of the top-level DBM features h3 , we use posterior representation sampler of [20]. [sent-143, score-0.298]
53 In particular, the HDP sampler (1) (2) (3) maintains the stick-breaking weights {θ}N , and {πc , πs , πg }; and topic indicator variables x n=1 (parameters φ can be integrated out). [sent-144, score-0.133]
54 The sampler alternatives between: (a) sampling cluster indices xin using Gibbs updates in the Chinese restaurant franchise (CRF) representation of the HDP; (b) sampling the weights at all three levels conditioned on x using the usual posterior of a DP2 . [sent-145, score-0.371]
55 Sampling category assignments z: Given current instantiation of the stick-breaking weights, using a defining property of a DP, for each input n, we have: (1) (1) (1) (θ1,n , . [sent-146, score-0.123]
56 8), the posterior over the category assignment can be calculated as follows: p(zn |θn , z−n , π (1) ) ∝ p(θn |π (1) , zn )p(zn |z−n ), (10) where z−n denotes variables z for all observations other than n. [sent-153, score-0.263]
57 Sampling DBM’s hidden units: Given the states of the DBM’s top-level multinomial unit h3 , conditional samples from P (h1 , h2 |h3 , vn ) can be obtained by running a Gibbs sampler that alternates n n n between sampling the states of h1 independently given h2 , and vice versa. [sent-155, score-0.222]
58 Conditioned on topic n n assignments xin and h2 , the states of the multinomial unit h3 for each input n are sampled using n n Gibbs conditionals: P (h3 |h2 , h3 , xn ) ∝ P (h2 |h3 )P (h3 |xin ), (11) in n −in n n in where the first term is given by the product of logistic functions (see Eq. [sent-156, score-0.402]
59 2, followed by running the full Gibbs sampler to get approximate samples from the posterior over the category assignments. [sent-163, score-0.191]
60 In practice, for faster inference, we fix learned topics φt and approximate the marginal likelihood that h3 belongs to category zt by assuming that t (1) document specific DP can be well approximated by the class-specific DP3 Gt ≈ Gzt (see Fig. [sent-164, score-0.362]
61 1): P (h3 |zt , G(1) , φ) = t Gt P (h3 |φ, Gt )P (Gt |G(1) )dGt ≈ P (h3 |φ, G(1) ), t zt t zt (13) Combining this likelihood term with nCRP prior P (zt |z−t ) (Eq. [sent-165, score-0.143]
62 8) allows us to efficiently infer approximate posterior over category assignments4 . [sent-166, score-0.153]
63 1st DBM features layer 2nd layer HDP high-level features 2. [sent-171, score-0.462]
64 forest Figure 2: A random subset of the 1st , 2nd layer DBM features, Figure 3: A typical partition of the 100 and higher-level class-sensitive HDP features/topics. [sent-184, score-0.153]
65 basic-level categories 5 Experiments We present experimental results on the CIFAR-100 [8], handwritten character [9], and human motion capture recognition datasets. [sent-185, score-0.199]
66 For all datasets, we first pretrain a DBM model in unsupervised fashion on raw sensory input (e. [sent-186, score-0.15]
67 Across all datasets, we also assume that the basic-level category labels are given, but no super-category labels are available. [sent-191, score-0.123]
68 The training set includes many examples of familiar categories but only a few examples of a novel class. [sent-192, score-0.225]
69 1 CIFAR-100 dataset The CIFAR-100 image dataset [8] contains 50,000 training and 10,000 test images of 100 object categories (100 per class), with 32 × 32 × 3 RGB pixels. [sent-201, score-0.243]
70 The first DBM layer contained 10,000 binary hidden units, and the second layer contained M=1000 softmax units, each defining a distribution over 10, 000 second layer features6 . [sent-205, score-0.564]
71 2 displays a random subset of the 1st and 2nd layer DBM features, as well as higher-level classsensitive features, or topics, learned by the HDP model. [sent-208, score-0.202]
72 To visualize a particular higher-level feature, we first sample M words from a fixed topic φt , followed by sampling RGB pixel values from the conditional DBM model. [sent-209, score-0.158]
73 While DBM features capture mostly low-level structure, including edges and corners, the HDP features tend to capture higher-level structure, including contours, shapes, color components, and surface boundaries. [sent-210, score-0.288]
74 More importantly, features at all levels of the hierarchy evolve without incorporating any image-specific priors. [sent-211, score-0.201]
75 We trained the model on 99 classes containing 500 training images each, but only one training example of a “pear” class. [sent-215, score-0.182]
76 Hence the novel category can inherit 5 The dataset contains random images of natural scenes downloaded from the web We also experimented with a 3-layer DBM model, as well as various softmax parameters: M = 500 and M = 2000. [sent-218, score-0.381]
77 6 6 Learning with 3 examples Shared HDP high-level features Shape Color HDP−DBM DBM SVM 0. [sent-220, score-0.153]
78 1 0 0 200 400 600 800 1000 1200 1400 Sorted Class Index Figure 4: Left: Training examples along with eight most probable topics φt , ordered by hand. [sent-238, score-0.148]
79 CIFAR Dataset Model Tuned HDP-DBM HDP-DBM Flat HDP-DBM DBM DBN SVM 1-NN GIST Handwritten Characters Motion Capture Number of examples Number of examples Number of examples 1 3 5 10 50 1 3 5 10 1 3 5 10 50 0. [sent-241, score-0.138]
80 Table 1 also shows that fine-tuning parameters of all layers jointly as well as learning super-category hierarchy significantly improves model performance. [sent-356, score-0.138]
81 2 Handwritten Characters The handwritten characters dataset [9] can be viewed as the “transpose” of MNIST. [sent-362, score-0.209]
82 Instead of containing 60,000 images of 10 digit classes, the dataset contains 30,000 images of 1500 characters (20 examples each) with 28 × 28 pixels. [sent-363, score-0.301]
83 These characters are from 50 alphabets from around the world, including Bengali, Cyrillic, Arabic, Sanskrit, Tagalog (see Fig. [sent-364, score-0.125]
84 We split the dataset into 15,000 training and 15,000 test images (10 examples of each class). [sent-366, score-0.175]
85 Similar to the CIFAR dataset, we pretrain a two-layer DBM model, with the first layer containing 1000 hidden units, and the second layer containing M=100 softmax units, each defining a distribution over 1000 second layer features. [sent-367, score-0.6]
86 2 displays a random subset of training images, along with the 1st and 2nd layer DBM features, as well as higher-level class-sensitive HDP features. [sent-369, score-0.208]
87 The HDP features tend to capture higher-level parts, many of which resemble pen “strokes”. [sent-370, score-0.144]
88 The HDP-DBM model significantly outperforms other methods, particularly when learning characters with few training examples. [sent-373, score-0.172]
89 Learned Super-Classes (by row) Learning with 3 examples Sampled Novel Characters Training Examples Conditional Samples Figure 6: Left: Learned super-classes along with examples of novel characters, generated by the model for the same super-class. [sent-376, score-0.138]
90 Right: Three training examples along with 8 conditional samples. [sent-377, score-0.128]
91 3 Motion capture We next applied our model to human motion capture data consisting of sequences of 3D joint angles plus body orientation and translation [18]. [sent-382, score-0.159]
92 For the two-layer DBM model, the first layer contained 500 hidden units, with the second layer containing M =50 softmax units, each defining a distribution over 500 second layer features. [sent-387, score-0.564]
93 The difference is particularly large in the regime when we observe only a handful number of training examples of a novel walking style. [sent-390, score-0.182]
94 6 Conclusions We developed a compositional architecture that learns an HDP prior over the activities of top-level features of the DBM model. [sent-391, score-0.308]
95 The resulting compound HDP-DBM model is able to learn low-level features from raw sensory input, high-level features, as well as a category hierarchy for parameter sharing. [sent-392, score-0.542]
96 Indeed, any other deep learning module, including Deep Belief Networks, sparse auto-encoders, or any other hierarchical Bayesian model can be adapted. [sent-395, score-0.282]
97 The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. [sent-409, score-0.339]
98 Modeling human transfer learning with the hierarchical dirichlet process. [sent-415, score-0.2]
99 Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. [sent-460, score-0.387]
100 80 million tiny images: a large dataset for nonparametric object and scene recognition. [sent-554, score-0.163]
wordName wordTfidf (topN-words)
[('dbm', 0.601), ('hdp', 0.3), ('xin', 0.22), ('deep', 0.196), ('units', 0.148), ('softmax', 0.13), ('dbms', 0.128), ('characters', 0.125), ('layer', 0.124), ('category', 0.123), ('gn', 0.112), ('zn', 0.11), ('features', 0.107), ('compound', 0.104), ('topics', 0.102), ('dir', 0.101), ('topic', 0.095), ('hierarchy', 0.094), ('boltzmann', 0.089), ('multinomial', 0.087), ('hierarchical', 0.086), ('mult', 0.083), ('dirichlet', 0.081), ('auroc', 0.08), ('pear', 0.073), ('zb', 0.073), ('crp', 0.065), ('gt', 0.065), ('hidden', 0.062), ('hd', 0.062), ('activities', 0.06), ('nested', 0.059), ('dp', 0.057), ('nonparametric', 0.057), ('compositional', 0.055), ('prior', 0.055), ('gzt', 0.054), ('wlm', 0.054), ('restaurant', 0.054), ('document', 0.052), ('handwritten', 0.05), ('images', 0.048), ('hb', 0.048), ('training', 0.047), ('sensory', 0.047), ('novel', 0.046), ('examples', 0.046), ('layers', 0.044), ('zt', 0.044), ('angles', 0.044), ('walking', 0.043), ('torralba', 0.043), ('chinese', 0.042), ('zc', 0.041), ('flat', 0.041), ('motion', 0.041), ('learned', 0.041), ('object', 0.04), ('categories', 0.04), ('classes', 0.04), ('cifar', 0.039), ('gist', 0.039), ('sampler', 0.038), ('learn', 0.038), ('unsupervised', 0.038), ('belief', 0.038), ('gibbs', 0.037), ('displays', 0.037), ('capture', 0.037), ('gc', 0.036), ('pretrain', 0.036), ('styles', 0.036), ('toplevel', 0.036), ('wjl', 0.036), ('tree', 0.036), ('conditional', 0.035), ('dataset', 0.034), ('concepts', 0.033), ('salakhutdinov', 0.033), ('transfer', 0.033), ('representations', 0.033), ('bayesian', 0.032), ('ncrp', 0.032), ('dinosaur', 0.032), ('tiny', 0.032), ('variational', 0.031), ('character', 0.031), ('nk', 0.031), ('architecture', 0.031), ('posterior', 0.03), ('sharing', 0.03), ('conditionals', 0.029), ('truck', 0.029), ('partition', 0.029), ('place', 0.029), ('conditioned', 0.029), ('raw', 0.029), ('cognitive', 0.029), ('networks', 0.029), ('words', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999905 156 nips-2011-Learning to Learn with Compound HD Models
Author: Antonio Torralba, Joshua B. Tenenbaum, Ruslan Salakhutdinov
Abstract: We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Specifically we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets. 1
2 0.20175479 250 nips-2011-Shallow vs. Deep Sum-Product Networks
Author: Olivier Delalleau, Yoshua Bengio
Abstract: We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efficiently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning. 1 Introduction and prior work Many learning algorithms are based on searching a family of functions so as to identify one member of said family which minimizes a training criterion. The choice of this family of functions and how members of that family are parameterized can be a crucial one. Although there is no universally optimal choice of parameterization or family of functions (or “architecture”), as demonstrated by the no-free-lunch results [37], it may be the case that some architectures are appropriate (or inappropriate) for a large class of learning tasks and data distributions, such as those related to Artificial Intelligence (AI) tasks [4]. Different families of functions have different characteristics that can be appropriate or not depending on the learning task of interest. One of the characteristics that has spurred much interest and research in recent years is depth of the architecture. In the case of a multi-layer neural network, depth corresponds to the number of (hidden and output) layers. A fixedkernel Support Vector Machine is considered to have depth 2 [4] and boosted decision trees to have depth 3 [7]. Here we use the word circuit or network to talk about a directed acyclic graph, where each node is associated with some output value which can be computed based on the values associated with its predecessor nodes. The arguments of the learned function are set at the input nodes of the circuit (which have no predecessor) and the outputs of the function are read off the output nodes of the circuit. Different families of functions correspond to different circuits and allowed choices of computations in each node. Learning can be performed by changing the computation associated with a node, or rewiring the circuit (possibly changing the number of nodes). The depth of the circuit is the length of the longest path in the graph from an input node to an output node. Deep Learning algorithms [3] are tailored to learning circuits with variable depth, typically greater than depth 2. They are based on the idea of multiple levels of representation, with the intuition that the raw input can be represented at different levels of abstraction, with more abstract features of the input or more abstract explanatory factors represented by deeper circuits. These algorithms are often based on unsupervised learning, opening the door to semi-supervised learning and efficient 1 use of large quantities of unlabeled data [3]. Analogies with the structure of the cerebral cortex (in particular the visual cortex) [31] and similarities between features learned with some Deep Learning algorithms and those hypothesized in the visual cortex [17] further motivate investigations into deep architectures. It has been suggested that deep architectures are more powerful in the sense of being able to more efficiently represent highly-varying functions [4, 3]. In this paper, we measure “efficiency” in terms of the number of computational units in the network. An efficient representation is important mainly because: (i) it uses less memory and is faster to compute, and (ii) given a fixed amount of training samples and computational power, better generalization is expected. The first successful algorithms for training deep architectures appeared in 2006, with efficient training procedures for Deep Belief Networks [14] and deep auto-encoders [13, 27, 6], both exploiting the general idea of greedy layer-wise pre-training [6]. Since then, these ideas have been investigated further and applied in many settings, demonstrating state-of-the-art learning performance in object recognition [16, 28, 18, 15] and segmentation [20], audio classification [19, 10], natural language processing [9, 36, 21, 32], collaborative filtering [30], modeling textures [24], modeling motion [34, 33], information retrieval [29, 26], and semi-supervised learning [36, 22]. Poon and Domingos [25] introduced deep sum-product networks as a method to compute partition functions of tractable graphical models. These networks are analogous to traditional artificial neural networks but with nodes that compute either products or weighted sums of their inputs. Analogously to neural networks, we define “hidden” nodes as those nodes that are neither input nodes nor output nodes. If the nodes are organized in layers, we define the “hidden” layers to be those that are neither the input layer nor the output layer. Poon and Domingos [25] report experiments with networks much deeper (30+ hidden layers) than those typically used until now, e.g. in Deep Belief Networks [14, 3], where the number of hidden layers is usually on the order of three to five. Whether such deep architectures have theoretical advantages compared to so-called “shallow” architectures (i.e. those with a single hidden layer) remains an open question. After all, in the case of a sum-product network, the output value can always be written as a sum of products of input variables (possibly raised to some power by allowing multiple connections from the same input), and consequently it is easily rewritten as a shallow network with a sum output unit and product hidden units. The argument supported by our theoretical analysis is that a deep architecture is able to compute some functions much more efficiently than a shallow one. Until recently, very few theoretical results supported the idea that deep architectures could present an advantage in terms of representing some functions more efficiently. Most related results originate from the analysis of boolean circuits (see e.g. [2] for a review). Well-known results include the proof that solving the n-bit parity task with a depth-2 circuit requires an exponential number of gates [1, 38], and more generally that there exist functions computable with a polynomial-size depthk circuit that would require exponential size when restricted to depth k − 1 [11]. Another recent result on boolean circuits by Braverman [8] offers proof of a longstanding conjecture, showing that bounded-depth boolean circuits are unable to distinguish some (non-uniform) input distributions from the uniform distribution (i.e. they are “fooled” by such input distributions). In particular, Braverman’s result suggests that shallow circuits can in general be fooled more easily than deep ones, i.e., that they would have more difficulty efficiently representing high-order dependencies (those involving many input variables). It is not obvious that circuit complexity results (that typically consider only boolean or at least discrete nodes) are directly applicable in the context of typical machine learning algorithms such as neural networks (that compute continuous representations of their input). Orponen [23] surveys theoretical results in computational complexity that are relevant to learning algorithms. For instance, H˚ stad and Goldmann [12] extended some results to the case of networks of linear threshold units a with positivity constraints on the weights. Bengio et al. [5, 7] investigate, respectively, complexity issues in networks of Gaussian radial basis functions and decision trees, showing intrinsic limitations of these architectures e.g. on tasks similar to the parity problem. Utgoff and Stracuzzi [35] informally discuss the advantages of depth in boolean circuit in the context of learning architectures. Bengio [3] suggests that some polynomials could be represented more efficiently by deep sumproduct networks, but without providing any formal statement or proofs. This work partly addresses this void by demonstrating families of circuits for which a deep architecture can be exponentially more efficient than a shallow one in the context of real-valued polynomials. Note that we do not address in this paper the problem of learning these parameters: even if an efficient deep representation exists for the function we seek to approximate, in general there is no 2 guarantee for standard optimization algorithms to easily converge to this representation. This paper focuses on the representational power of deep sum-product circuits compared to shallow ones, and studies it by considering particular families of target functions (to be represented by the learner). We first formally define sum-product networks. We consider two families of functions represented by deep sum-product networks (families F and G). For each family, we establish a lower bound on the minimal number of hidden units a depth-2 sum-product network would require to represent a function of this family, showing it is much less efficient than the deep representation. 2 Sum-product networks Definition 1. A sum-product network is a network composed of units that either compute the product of their inputs or a weighted sum of their inputs (where weights are strictly positive). Here, we restrict our definition of the generic term “sum-product network” to networks whose summation units have positive incoming weights1 , while others are called “negative-weight” networks. Definition 2. A “negative-weight“ sum-product network may contain summation units whose weights are non-positive (i.e. less than or equal to zero). Finally, we formally define what we mean by deep vs. shallow networks in the rest of the paper. Definition 3. A “shallow“ sum-product network contains a single hidden layer (i.e. a total of three layers when counting the input and output layers, and a depth equal to two). Definition 4. A “deep“ sum-product network contains more than one hidden layer (i.e. a total of at least four layers, and a depth at least three). The family F 3 3.1 Definition The first family of functions we study, denoted by F, is made of functions built from deep sumproduct networks that alternate layers of product and sum units with two inputs each (details are provided below). The basic idea we use here is that composing layers (i.e. using a deep architecture) is equivalent to using a factorized representation of the polynomial function computed by the network. Such a factorized representation can be exponentially more compact than its expansion as a sum of products (which can be associated to a shallow network with product units in its hidden layer and a sum unit as output). This is what we formally show in what follows. + ℓ2 = λ11ℓ1 + µ11ℓ1 = x1x2 + x3x4 = f (x1, x2, x3, x4) 2 1 1 λ11 = 1 µ11 = 1 × ℓ1 = x1x2 1 x1 x2 × ℓ1 = x3x4 2 x3 x4 Figure 1: Sum-product network computing the function f ∈ F such that i = λ11 = µ11 = 1. Let n = 4i , with i a positive integer value. Denote by ℓ0 the input layer containing scalar variables {x1 , . . . , xn }, such that ℓ0 = xj for 1 ≤ j ≤ n. Now define f ∈ F as any function computed by a j sum-product network (deep for i ≥ 2) composed of alternating product and sum layers: • ℓ2k+1 = ℓ2k · ℓ2k for 0 ≤ k ≤ i − 1 and 1 ≤ j ≤ 22(i−k)−1 2j−1 2j j • ℓ2k = λjk ℓ2k−1 + µjk ℓ2k−1 for 1 ≤ k ≤ i and 1 ≤ j ≤ 22(i−k) j 2j 2j−1 where the weights λjk and µjk of the summation units are strictly positive. The output of the network is given by f (x1 , . . . , xn ) = ℓ2i ∈ R, the unique unit in the last layer. 1 The corresponding (shallow) network for i = 1 and additive weights set to one is shown in Figure 1 1 This condition is required by some of the proofs presented here. 3 (this architecture is also the basic building block of bigger networks for i > 1). Note that both the input size n = 4i and the network’s depth 2i increase with parameter i. 3.2 Theoretical results The main result of this section is presented below in Corollary 1, providing a lower bound on the minimum number of hidden units required by a shallow sum-product network to represent a function f ∈ F. The high-level proof sketch consists in the following steps: (1) Count the number of unique products found in the polynomial representation of f (Lemma 1 and Proposition 1). (2) Show that the only possible architecture for a shallow sum-product network to compute f is to have a hidden layer made of product units, with a sum unit as output (Lemmas 2 to 5). (3) Conclude that the number of hidden units must be at least the number of unique products computed in step 3.2 (Lemma 6 and Corollary 1). Lemma 1. Any element ℓk can be written as a (positively) weighted sum of products of input varij ables, such that each input variable xt is used in exactly one unit of ℓk . Moreover, the number mk of products found in the sum computed by ℓk does not depend on j and obeys the following recurrence j rule for k ≥ 0: if k + 1 is odd, then mk+1 = m2 , otherwise mk+1 = 2mk . k Proof. We prove the lemma by induction on k. It is obviously true for k = 0 since ℓ0 = xj . j Assuming this is true for some k ≥ 0, we consider two cases: k+1 k • If k + 1 is odd, then ℓj = ℓk 2j−1 · ℓ2j . By the inductive hypothesis, it is the product of two (positively) weighted sums of products of input variables, and no input variable can k appear in both ℓk 2j−1 and ℓ2j , so the result is also a (positively) weighted sum of products k of input variables. Additionally, if the number of products in ℓk 2j−1 and ℓ2j is mk , then 2 mk+1 = mk , since all products involved in the multiplication of the two units are different (since they use disjoint subsets of input variables), and the sums have positive weights. Finally, by the induction assumption, an input variable appears in exactly one unit of ℓk . This unit is an input to a single unit of ℓk+1 , that will thus be the only unit of ℓk+1 where this input variable appears. k • If k + 1 is even, then ℓk+1 = λjk ℓk 2j−1 + µjk ℓ2j . Again, from the induction assumption, it j must be a (positively) weighted sum of products of input variables, but with mk+1 = 2mk such products. As in the previous case, an input variable will appear in the single unit of ℓk+1 that has as input the single unit of ℓk in which this variable must appear. 2i Proposition 1. The number of products in the sum computed in the output unit l1 of a network √ n−1 . computing a function in F is m2i = 2 Proof. We first prove by induction on k ≥ 1 that for odd k, mk = 22 k 22 1+1 2 2 k+1 2 −2 , and for even k, . This is obviously true for k = 1 since 2 = 2 = 1, and all units in ℓ1 are mk = 2 single products of the form xr xs . Assuming this is true for some k ≥ 1, then: −1 0 −2 • if k + 1 is odd, then from Lemma 1 and the induction assumption, we have: mk+1 = m2 = k 2 k 22 2 −1 k +1 = 22 2 • if k + 1 is even, then instead we have: mk+1 = 2mk = 2 · 22 k+1 2 −2 −2 = 22 = 22 (k+1)+1 2 (k+1) 2 −2 −1 which shows the desired result for k + 1, and thus concludes the induction proof. Applying this result with k = 2i (which is even) yields 2i m2i = 22 2 −1 √ =2 4 22i −1 √ =2 n−1 . 2i Lemma 2. The products computed in the output unit l1 can be split in two groups, one with products containing only variables x1 , . . . , x n and one containing only variables x n +1 , . . . , xn . 2 2 Proof. This is obvious since the last unit is a “sum“ unit that adds two terms whose inputs are these two groups of variables (see e.g. Fig. 1). 2i Lemma 3. The products computed in the output unit l1 involve more than one input variable. k Proof. It is straightforward to show by induction on k ≥ 1 that the products computed by lj all involve more than one input variable, thus it is true in particular for the output layer (k = 2i). Lemma 4. Any shallow sum-product network computing f ∈ F must have a “sum” unit as output. Proof. By contradiction, suppose the output unit of such a shallow sum-product network is multiplicative. This unit must have more than one input, because in the case that it has only one input, the output would be either a (weighted) sum of input variables (which would violate Lemma 3), or a single product of input variables (which would violate Proposition 1), depending on the type (sum or product) of the single input hidden unit. Thus the last unit must compute a product of two or more hidden units. It can be re-written as a product of two factors, where each factor corresponds to either one hidden unit, or a product of multiple hidden units (it does not matter here which specific factorization is chosen among all possible ones). Regardless of the type (sum or product) of the hidden units involved, those two factors can thus be written as weighted sums of products of variables xt (with positive weights, and input variables potentially raised to powers above one). From Lemma 1, both x1 and xn must be present in the final output, and thus they must appear in at least one of these two factors. Without loss of generality, assume x1 appears in the first factor. Variables x n +1 , . . . , xn then cannot be present in the second factor, since otherwise one product in the output 2 would contain both x1 and one of these variables (this product cannot cancel out since weights must be positive), violating Lemma 2. But with a similar reasoning, since as a result xn must appear in the first factor, variables x1 , . . . , x n cannot be present in the second factor either. Consequently, no 2 input variable can be present in the second factor, leading to the desired contradiction. Lemma 5. Any shallow sum-product network computing f ∈ F must have only multiplicative units in its hidden layer. Proof. By contradiction, suppose there exists a “sum“ unit in the hidden layer, written s = t∈S αt xt with S the set of input indices appearing in this sum, and αt > 0 for all t ∈ S. Since according to Lemma 4 the output unit must also be a sum (and have positive weights according to Definition 1), then the final output will also contain terms of the form βt xt for t ∈ S, with βt > 0. This violates Lemma 3, establishing the contradiction. Lemma 6. Any shallow negative-weight sum-product network (see Definition 2) computing f ∈ F √ must have at least 2 n−1 hidden units, if its output unit is a sum and its hidden units are products. Proof. Such a network computes a weighted sum of its hidden units, where each hidden unit is a γ product of input variables, i.e. its output can be written as Σj wj Πt xt jt with wj ∈ R and γjt ∈ {0, 1}. In order to compute a function in F, this shallow network thus needs a number of hidden units at least equal to the number of unique products in that function. From Proposition 1, this √ number is equal to 2 n−1 . √ Corollary 1. Any shallow sum-product network computing f ∈ F must have at least 2 units. n−1 hidden Proof. This is a direct corollary of Lemmas 4 (showing the output unit is a sum), 5 (showing that hidden units are products), and 6 (showing the desired result for any shallow network with this specific structure – regardless of the sign of weights). 5 3.3 Discussion Corollary 1 above shows that in order to compute some function in F with n inputs, the number of √ √ units in a shallow network has to be at least 2 n−1 , (i.e. grows exponentially in n). On another hand, the total number of units in the deep (for i > 1) network computing the same function, as described in Section 3.1, is equal to 1 + 2 + 4 + 8 + . . . + 22i−1 (since all units are binary), which is √ also equal to 22i − 1 = n − 1 (i.e. grows only quadratically in n). It shows that some deep sumproduct network with n inputs and depth O(log n) can represent with O(n) units what would √ require O(2 n ) units for a depth-2 network. Lemma 6 also shows a similar result regardless of the sign of the weights in the summation units of the depth-2 network, but assumes a specific architecture for this network (products in the hidden layer with a sum as output). 4 The family G In this section we present similar results with a different family of functions, denoted by G. Compared to F, one important difference of deep sum-product networks built to define functions in G is that they can vary their input size independently of their depth. Their analysis thus provides additional insight when comparing the representational efficiency of deep vs. shallow sum-product networks in the case of a fixed dataset. 4.1 Definition Networks in family G also alternate sum and product layers, but their units have as inputs all units from the previous layer except one. More formally, define the family G = ∪n≥2,i≥0 Gin of functions represented by sum-product networks, where the sub-family Gin is made of all sum-product networks with n input variables and 2i + 2 layers (including the input layer ℓ0 ), such that: 1. ℓ1 contains summation units; further layers alternate multiplicative and summation units. 2. Summation units have positive weights. 3. All layers are of size n, except the last layer ℓ2i+1 that contains a single sum unit that sums all units in the previous layer ℓ2i . k−1 4. In each layer ℓk for 1 ≤ k ≤ 2i, each unit ℓk takes as inputs {ℓm |m = j}. j An example of a network belonging to G1,3 (i.e. with three layers and three input variables) is shown in Figure 2. ℓ3 = x2 + x2 + x2 + 3(x1x2 + x1x3 + x2x3) = g(x1, x2, x3) 3 2 1 1 + ℓ2 = x2 + x1x2 × 1 1 +x1x3 + x2x3 ℓ1 = x2 + x3 1 × ℓ2 = . . . 2 × ℓ2 = x2 + x1x2 3 3 +x1x3 + x2x3 + + ℓ1 = x1 + x3 2 + ℓ1 = x1 + x2 3 x1 x2 x3 Figure 2: Sum-product network computing a function of G1,3 (summation units’ weights are all 1’s). 4.2 Theoretical results The main result is stated in Proposition 3 below, establishing a lower bound on the number of hidden units of a shallow sum-product network computing g ∈ G. The proof sketch is as follows: 1. We show that the polynomial expansion of g must contain a large set of products (Proposition 2 and Corollary 2). 2. We use both the number of products in that set as well as their degree to establish the desired lower bound (Proposition 3). 6 We will also need the following lemma, which states that when n − 1 items each belong to n − 1 sets among a total of n sets, then we can associate to each item one of the sets it belongs to without using the same set for different items. Lemma 7. Let S1 , . . . , Sn be n sets (n ≥ 2) containing elements of {P1 , . . . , Pn−1 }, such that for any q, r, |{r|Pq ∈ Sr }| ≥ n − 1 (i.e. each element Pq belongs to at least n − 1 sets). Then there exist r1 , . . . , rn−1 different indices such that Pq ∈ Srq for 1 ≤ q ≤ n − 1. Proof. Omitted due to lack of space (very easy to prove by construction). Proposition 2. For any 0 ≤ j ≤ i, and any product of variables P = Πn xαt such that αt ∈ N and t=1 t j 2j whose computed value, when expanded as a weighted t αt = (n − 1) , there exists a unit in ℓ sum of products, contains P among these products. Proof. We prove this proposition by induction on j. First, for j = 0, this is obvious since any P of this form must be made of a single input variable xt , that appears in ℓ0 = xt . t Suppose now the proposition is true for some j < i. Consider a product P = Πn xαt such that t=1 t αt ∈ N and t αt = (n − 1)j+1 . P can be factored in n − 1 sub-products of degree (n − 1)j , β i.e. written P = P1 . . . Pn−1 with Pq = Πn xt qt , βqt ∈ N and t βqt = (n − 1)j for all q. By t=1 the induction hypothesis, each Pq can be found in at least one unit ℓ2j . As a result, by property 4 kq (in the definition of family G), each Pq will also appear in the additive layer ℓ2j+1 , in at least n − 1 different units (the only sum unit that may not contain Pq is the one that does not have ℓ2j as input). kq By Lemma 7, we can thus find a set of units ℓ2j+1 such that for any 1 ≤ q ≤ n − 1, the product rq Pq appears in ℓ2j+1 , with indices rq being different from each other. Let 1 ≤ s ≤ n be such that rq 2(j+1) s = rq for all q. Then, from property 4 of family G, the multiplicative unit ℓs computes the n−1 2j+1 product Πq=1 ℓrq , and as a result, when expanded as a sum of products, it contains in particular P1 . . . Pn−1 = P . The proposition is thus true for j + 1, and by induction, is true for all j ≤ i. Corollary 2. The output gin of a sum-product network in Gin , when expanded as a sum of products, contains all products of variables of the form Πn xαt such that αt ∈ N and t αt = (n − 1)i . t=1 t Proof. Applying Proposition 2 with j = i, we obtain that all products of this form can be found in the multiplicative units of ℓ2i . Since the output unit ℓ2i+1 computes a sum of these multiplicative 1 units (weighted with positive weights), those products are also present in the output. Proposition 3. A shallow negative-weight sum-product network computing gin ∈ Gin must have at least (n − 1)i hidden units. Proof. First suppose the output unit of the shallow network is a sum. Then it may be able to compute gin , assuming we allow multiplicative units in the hidden layer in the hidden layer to use powers of their inputs in the product they compute (which we allow here for the proof to be more generic). However, it will require at least as many of these units as the number of unique products that can be found in the expansion of gin . In particular, from Corollary 2, it will require at least the number n of unique tuples of the form (α1 , . . . , αn ) such that αt ∈ N and t=1 αt = (n − 1)i . Denoting ni dni = (n − 1)i , this number is known to be equal to n+dni −1 , and it is easy to verify it is higher d than (or equal to) dni for any n ≥ 2 and i ≥ 0. Now suppose the output unit is multiplicative. Then there can be no multiplicative hidden unit, otherwise it would mean one could factor some input variable xt in the computed function output: this is not possible since by Corollary 2, for any variable xt there exist products in the output function that do not involve xt . So all hidden units must be additive, and since the computed function contains products of degree dni , there must be at least dni such hidden units. 7 4.3 Discussion Proposition 3 shows that in order to compute the same function as gin ∈ Gin , the number of units in the shallow network has to grow exponentially in i, i.e. in the network’s depth (while the deep network’s size grows linearly in i). The shallow network also needs to grow polynomially in the number of input variables n (with a degree equal to i), while the deep network grows only linearly in n. It means that some deep sum-product network with n inputs and depth O(i) can represent with O(ni) units what would require O((n − 1)i ) units for a depth-2 network. Note that in the similar results found for family F, the depth-2 network computing the same function as a function in F had to be constrained to either have a specific combination of sum and hidden units (in Lemma 6) or to have non-negative weights (in Corollary 1). On the contrary, the result presented here for family G holds without requiring any of these assumptions. 5 Conclusion We compared a deep sum-product network and a shallow sum-product network representing the same function, taken from two families of functions F and G. For both families, we have shown that the number of units in the shallow network has to grow exponentially, compared to a linear growth in the deep network, so as to represent the same functions. The deep version thus offers a much more compact representation of the same functions. This work focuses on two specific families of functions: finding more general parameterization of functions leading to similar results would be an interesting topic for future research. Another open question is whether it is possible to represent such functions only approximately (e.g. up to an error bound ǫ) with a much smaller shallow network. Results by Braverman [8] on boolean circuits suggest that similar results as those presented in this paper may still hold, but this topic has yet to be formally investigated in the context of sum-product networks. A related problem is also to look into functions defined only on discrete input variables: our proofs do not trivially extend to this situation because we cannot assume anymore that two polynomials yielding the same output values must have the same expansion coefficients (since the number of input combinations becomes finite). Acknowledgments The authors would like to thank Razvan Pascanu and David Warde-Farley for their help in improving this manuscript, as well as the anonymous reviewers for their careful reviews. This work was partially funded by NSERC, CIFAR, and the Canada Research Chairs. References [1] Ajtai, M. (1983). P1 1 -formulae on finite structures. Annals of Pure and Applied Logic, 24(1), 1–48. [2] Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pages 1–18. Lecture Notes in Computer Science 1180, Springer Verlag. [3] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009. [4] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press. [5] Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In NIPS’05, pages 107–114. MIT Press, Cambridge, MA. [6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NIPS 19, pages 153–160. MIT Press. [7] Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence, 26(4), 449–467. [8] Braverman, M. (2011). Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM, 54(4), 108–115. [9] Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML 2008, pages 160–167. [10] Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the meancovariance restricted boltzmann machine. In Advances in Neural Information Processing Systems (NIPS). 8 [11] H˚ stad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th a annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California. ACM Press. [12] H˚ stad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational a Complexity, 1, 113–129. [13] Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. [14] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. [15] Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’10. [16] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In ICML’07, pages 473–480. ACM. [17] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS’07, pages 873–880. MIT Press, Cambridge, MA. [18] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML 2009. Montreal (Qc), Canada. [19] Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS’09, pages 1096–1104. [20] Levner, I. (2008). Data Driven Object Segmentation. Ph.D. thesis, Department of Computer Science, University of Alberta. [21] Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model. In NIPS’08, pages 1081–1088. [22] Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In ICML’2009, pages 737–744. [23] Orponen, P. (1994). Computational complexity of neural networks: a survey. Nordic Journal of Computing, 1(1), 94–110. [24] Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field. In NIPS’07, pages 1121–1128, Cambridge, MA. MIT Press. [25] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In UAI’2011, Barcelona, Spain. [26] Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. In ICML. [27] Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efficient learning of sparse representations with an energy-based model. In NIPS’06, pages 1137–1144. MIT Press. [28] Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In NIPS’07, pages 1185–1192, Cambridge, MA. MIT Press. [29] Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam. Elsevier. [30] Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. In ICML 2007, pages 791–798, New York, NY, USA. [31] Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., and Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165, 33–56. [32] Socher, R., Lin, C., Ng, A. Y., and Manning, C. (2011). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In ICML’2011. [33] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In ICML 2009, pages 1025–1032. [34] Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary latent variables. In NIPS’06, pages 1345–1352. MIT Press, Cambridge, MA. [35] Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14, 2497–2539. [36] Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008, pages 1168–1175, New York, NY, USA. [37] Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation, 8(7), 1341–1390. [38] Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10. 9
3 0.18258248 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability
Author: David P. Reichert, Peggy Series, Amos J. Storkey
Abstract: It has been argued that perceptual multistability reflects probabilistic inference performed by the brain when sensory input is ambiguous. Alternatively, more traditional explanations of multistability refer to low-level mechanisms such as neuronal adaptation. We employ a Deep Boltzmann Machine (DBM) model of cortical processing to demonstrate that these two different approaches can be combined in the same framework. Based on recent developments in machine learning, we show how neuronal adaptation can be understood as a mechanism that improves probabilistic, sampling-based inference. Using the ambiguous Necker cube image, we analyze the perceptual switching exhibited by the model. We also examine the influence of spatial attention, and explore how binocular rivalry can be modeled with the same approach. Our work joins earlier studies in demonstrating how the principles underlying DBMs relate to cortical processing, and offers novel perspectives on the neural implementation of approximate probabilistic inference in the brain. 1
4 0.15659466 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation
Author: David Sontag, Dan Roy
Abstract: We consider the computational complexity of probabilistic inference in Latent Dirichlet Allocation (LDA). First, we study the problem of finding the maximum a posteriori (MAP) assignment of topics to words, where the document’s topic distribution is integrated out. We show that, when the e↵ective number of topics per document is small, exact inference takes polynomial time. In contrast, we show that, when a document has a large number of topics, finding the MAP assignment of topics to words in LDA is NP-hard. Next, we consider the problem of finding the MAP topic distribution for a document, where the topic-word assignments are integrated out. We show that this problem is also NP-hard. Finally, we briefly discuss the problem of sampling from the posterior, showing that this is NP-hard in one restricted setting, but leaving open the general question. 1
5 0.14863013 281 nips-2011-The Doubly Correlated Nonparametric Topic Model
Author: Dae I. Kim, Erik B. Sudderth
Abstract: Topic models are learned via a statistical model of variation within document collections, but designed to extract meaningful semantic structure. Desirable traits include the ability to incorporate annotations or metadata associated with documents; the discovery of correlated patterns of topic usage; and the avoidance of parametric assumptions, such as manual specification of the number of topics. We propose a doubly correlated nonparametric topic (DCNT) model, the first model to simultaneously capture all three of these properties. The DCNT models metadata via a flexible, Gaussian regression on arbitrary input features; correlations via a scalable square-root covariance representation; and nonparametric selection from an unbounded series of potential topics via a stick-breaking construction. We validate the semantic structure and predictive performance of the DCNT using a corpus of NIPS documents annotated by various metadata. 1
6 0.14667539 244 nips-2011-Selecting Receptive Fields in Deep Networks
7 0.14528532 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices
8 0.11241757 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines
9 0.1093654 129 nips-2011-Improving Topic Coherence with Regularized Topic Models
10 0.10661249 261 nips-2011-Sparse Filtering
11 0.099735662 200 nips-2011-On the Analysis of Multi-Channel Neural Spike Data
12 0.097802997 141 nips-2011-Large-Scale Category Structure Aware Image Categorization
13 0.094585799 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
14 0.093728177 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
15 0.087824099 113 nips-2011-Hierarchical Matching Pursuit for Image Classification: Architecture and Fast Algorithms
16 0.082663126 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
17 0.081087045 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation
18 0.080273598 104 nips-2011-Generalized Beta Mixtures of Gaussians
19 0.079394974 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
20 0.078813553 287 nips-2011-The Manifold Tangent Classifier
topicId topicWeight
[(0, 0.205), (1, 0.135), (2, -0.004), (3, 0.11), (4, 0.011), (5, -0.238), (6, 0.176), (7, 0.107), (8, -0.11), (9, -0.101), (10, 0.027), (11, -0.058), (12, 0.035), (13, -0.043), (14, -0.013), (15, -0.044), (16, -0.013), (17, 0.033), (18, -0.065), (19, 0.11), (20, 0.067), (21, 0.063), (22, -0.151), (23, -0.032), (24, -0.031), (25, -0.023), (26, 0.081), (27, -0.084), (28, 0.018), (29, 0.024), (30, 0.061), (31, 0.054), (32, -0.008), (33, 0.1), (34, 0.109), (35, 0.126), (36, -0.056), (37, 0.008), (38, -0.007), (39, 0.073), (40, 0.049), (41, 0.061), (42, 0.03), (43, -0.004), (44, 0.051), (45, -0.004), (46, 0.04), (47, 0.003), (48, 0.054), (49, -0.109)]
simIndex simValue paperId paperTitle
same-paper 1 0.93602133 156 nips-2011-Learning to Learn with Compound HD Models
Author: Antonio Torralba, Joshua B. Tenenbaum, Ruslan Salakhutdinov
Abstract: We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Specifically we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets. 1
2 0.73908764 250 nips-2011-Shallow vs. Deep Sum-Product Networks
Author: Olivier Delalleau, Yoshua Bengio
Abstract: We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efficiently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning. 1 Introduction and prior work Many learning algorithms are based on searching a family of functions so as to identify one member of said family which minimizes a training criterion. The choice of this family of functions and how members of that family are parameterized can be a crucial one. Although there is no universally optimal choice of parameterization or family of functions (or “architecture”), as demonstrated by the no-free-lunch results [37], it may be the case that some architectures are appropriate (or inappropriate) for a large class of learning tasks and data distributions, such as those related to Artificial Intelligence (AI) tasks [4]. Different families of functions have different characteristics that can be appropriate or not depending on the learning task of interest. One of the characteristics that has spurred much interest and research in recent years is depth of the architecture. In the case of a multi-layer neural network, depth corresponds to the number of (hidden and output) layers. A fixedkernel Support Vector Machine is considered to have depth 2 [4] and boosted decision trees to have depth 3 [7]. Here we use the word circuit or network to talk about a directed acyclic graph, where each node is associated with some output value which can be computed based on the values associated with its predecessor nodes. The arguments of the learned function are set at the input nodes of the circuit (which have no predecessor) and the outputs of the function are read off the output nodes of the circuit. Different families of functions correspond to different circuits and allowed choices of computations in each node. Learning can be performed by changing the computation associated with a node, or rewiring the circuit (possibly changing the number of nodes). The depth of the circuit is the length of the longest path in the graph from an input node to an output node. Deep Learning algorithms [3] are tailored to learning circuits with variable depth, typically greater than depth 2. They are based on the idea of multiple levels of representation, with the intuition that the raw input can be represented at different levels of abstraction, with more abstract features of the input or more abstract explanatory factors represented by deeper circuits. These algorithms are often based on unsupervised learning, opening the door to semi-supervised learning and efficient 1 use of large quantities of unlabeled data [3]. Analogies with the structure of the cerebral cortex (in particular the visual cortex) [31] and similarities between features learned with some Deep Learning algorithms and those hypothesized in the visual cortex [17] further motivate investigations into deep architectures. It has been suggested that deep architectures are more powerful in the sense of being able to more efficiently represent highly-varying functions [4, 3]. In this paper, we measure “efficiency” in terms of the number of computational units in the network. An efficient representation is important mainly because: (i) it uses less memory and is faster to compute, and (ii) given a fixed amount of training samples and computational power, better generalization is expected. The first successful algorithms for training deep architectures appeared in 2006, with efficient training procedures for Deep Belief Networks [14] and deep auto-encoders [13, 27, 6], both exploiting the general idea of greedy layer-wise pre-training [6]. Since then, these ideas have been investigated further and applied in many settings, demonstrating state-of-the-art learning performance in object recognition [16, 28, 18, 15] and segmentation [20], audio classification [19, 10], natural language processing [9, 36, 21, 32], collaborative filtering [30], modeling textures [24], modeling motion [34, 33], information retrieval [29, 26], and semi-supervised learning [36, 22]. Poon and Domingos [25] introduced deep sum-product networks as a method to compute partition functions of tractable graphical models. These networks are analogous to traditional artificial neural networks but with nodes that compute either products or weighted sums of their inputs. Analogously to neural networks, we define “hidden” nodes as those nodes that are neither input nodes nor output nodes. If the nodes are organized in layers, we define the “hidden” layers to be those that are neither the input layer nor the output layer. Poon and Domingos [25] report experiments with networks much deeper (30+ hidden layers) than those typically used until now, e.g. in Deep Belief Networks [14, 3], where the number of hidden layers is usually on the order of three to five. Whether such deep architectures have theoretical advantages compared to so-called “shallow” architectures (i.e. those with a single hidden layer) remains an open question. After all, in the case of a sum-product network, the output value can always be written as a sum of products of input variables (possibly raised to some power by allowing multiple connections from the same input), and consequently it is easily rewritten as a shallow network with a sum output unit and product hidden units. The argument supported by our theoretical analysis is that a deep architecture is able to compute some functions much more efficiently than a shallow one. Until recently, very few theoretical results supported the idea that deep architectures could present an advantage in terms of representing some functions more efficiently. Most related results originate from the analysis of boolean circuits (see e.g. [2] for a review). Well-known results include the proof that solving the n-bit parity task with a depth-2 circuit requires an exponential number of gates [1, 38], and more generally that there exist functions computable with a polynomial-size depthk circuit that would require exponential size when restricted to depth k − 1 [11]. Another recent result on boolean circuits by Braverman [8] offers proof of a longstanding conjecture, showing that bounded-depth boolean circuits are unable to distinguish some (non-uniform) input distributions from the uniform distribution (i.e. they are “fooled” by such input distributions). In particular, Braverman’s result suggests that shallow circuits can in general be fooled more easily than deep ones, i.e., that they would have more difficulty efficiently representing high-order dependencies (those involving many input variables). It is not obvious that circuit complexity results (that typically consider only boolean or at least discrete nodes) are directly applicable in the context of typical machine learning algorithms such as neural networks (that compute continuous representations of their input). Orponen [23] surveys theoretical results in computational complexity that are relevant to learning algorithms. For instance, H˚ stad and Goldmann [12] extended some results to the case of networks of linear threshold units a with positivity constraints on the weights. Bengio et al. [5, 7] investigate, respectively, complexity issues in networks of Gaussian radial basis functions and decision trees, showing intrinsic limitations of these architectures e.g. on tasks similar to the parity problem. Utgoff and Stracuzzi [35] informally discuss the advantages of depth in boolean circuit in the context of learning architectures. Bengio [3] suggests that some polynomials could be represented more efficiently by deep sumproduct networks, but without providing any formal statement or proofs. This work partly addresses this void by demonstrating families of circuits for which a deep architecture can be exponentially more efficient than a shallow one in the context of real-valued polynomials. Note that we do not address in this paper the problem of learning these parameters: even if an efficient deep representation exists for the function we seek to approximate, in general there is no 2 guarantee for standard optimization algorithms to easily converge to this representation. This paper focuses on the representational power of deep sum-product circuits compared to shallow ones, and studies it by considering particular families of target functions (to be represented by the learner). We first formally define sum-product networks. We consider two families of functions represented by deep sum-product networks (families F and G). For each family, we establish a lower bound on the minimal number of hidden units a depth-2 sum-product network would require to represent a function of this family, showing it is much less efficient than the deep representation. 2 Sum-product networks Definition 1. A sum-product network is a network composed of units that either compute the product of their inputs or a weighted sum of their inputs (where weights are strictly positive). Here, we restrict our definition of the generic term “sum-product network” to networks whose summation units have positive incoming weights1 , while others are called “negative-weight” networks. Definition 2. A “negative-weight“ sum-product network may contain summation units whose weights are non-positive (i.e. less than or equal to zero). Finally, we formally define what we mean by deep vs. shallow networks in the rest of the paper. Definition 3. A “shallow“ sum-product network contains a single hidden layer (i.e. a total of three layers when counting the input and output layers, and a depth equal to two). Definition 4. A “deep“ sum-product network contains more than one hidden layer (i.e. a total of at least four layers, and a depth at least three). The family F 3 3.1 Definition The first family of functions we study, denoted by F, is made of functions built from deep sumproduct networks that alternate layers of product and sum units with two inputs each (details are provided below). The basic idea we use here is that composing layers (i.e. using a deep architecture) is equivalent to using a factorized representation of the polynomial function computed by the network. Such a factorized representation can be exponentially more compact than its expansion as a sum of products (which can be associated to a shallow network with product units in its hidden layer and a sum unit as output). This is what we formally show in what follows. + ℓ2 = λ11ℓ1 + µ11ℓ1 = x1x2 + x3x4 = f (x1, x2, x3, x4) 2 1 1 λ11 = 1 µ11 = 1 × ℓ1 = x1x2 1 x1 x2 × ℓ1 = x3x4 2 x3 x4 Figure 1: Sum-product network computing the function f ∈ F such that i = λ11 = µ11 = 1. Let n = 4i , with i a positive integer value. Denote by ℓ0 the input layer containing scalar variables {x1 , . . . , xn }, such that ℓ0 = xj for 1 ≤ j ≤ n. Now define f ∈ F as any function computed by a j sum-product network (deep for i ≥ 2) composed of alternating product and sum layers: • ℓ2k+1 = ℓ2k · ℓ2k for 0 ≤ k ≤ i − 1 and 1 ≤ j ≤ 22(i−k)−1 2j−1 2j j • ℓ2k = λjk ℓ2k−1 + µjk ℓ2k−1 for 1 ≤ k ≤ i and 1 ≤ j ≤ 22(i−k) j 2j 2j−1 where the weights λjk and µjk of the summation units are strictly positive. The output of the network is given by f (x1 , . . . , xn ) = ℓ2i ∈ R, the unique unit in the last layer. 1 The corresponding (shallow) network for i = 1 and additive weights set to one is shown in Figure 1 1 This condition is required by some of the proofs presented here. 3 (this architecture is also the basic building block of bigger networks for i > 1). Note that both the input size n = 4i and the network’s depth 2i increase with parameter i. 3.2 Theoretical results The main result of this section is presented below in Corollary 1, providing a lower bound on the minimum number of hidden units required by a shallow sum-product network to represent a function f ∈ F. The high-level proof sketch consists in the following steps: (1) Count the number of unique products found in the polynomial representation of f (Lemma 1 and Proposition 1). (2) Show that the only possible architecture for a shallow sum-product network to compute f is to have a hidden layer made of product units, with a sum unit as output (Lemmas 2 to 5). (3) Conclude that the number of hidden units must be at least the number of unique products computed in step 3.2 (Lemma 6 and Corollary 1). Lemma 1. Any element ℓk can be written as a (positively) weighted sum of products of input varij ables, such that each input variable xt is used in exactly one unit of ℓk . Moreover, the number mk of products found in the sum computed by ℓk does not depend on j and obeys the following recurrence j rule for k ≥ 0: if k + 1 is odd, then mk+1 = m2 , otherwise mk+1 = 2mk . k Proof. We prove the lemma by induction on k. It is obviously true for k = 0 since ℓ0 = xj . j Assuming this is true for some k ≥ 0, we consider two cases: k+1 k • If k + 1 is odd, then ℓj = ℓk 2j−1 · ℓ2j . By the inductive hypothesis, it is the product of two (positively) weighted sums of products of input variables, and no input variable can k appear in both ℓk 2j−1 and ℓ2j , so the result is also a (positively) weighted sum of products k of input variables. Additionally, if the number of products in ℓk 2j−1 and ℓ2j is mk , then 2 mk+1 = mk , since all products involved in the multiplication of the two units are different (since they use disjoint subsets of input variables), and the sums have positive weights. Finally, by the induction assumption, an input variable appears in exactly one unit of ℓk . This unit is an input to a single unit of ℓk+1 , that will thus be the only unit of ℓk+1 where this input variable appears. k • If k + 1 is even, then ℓk+1 = λjk ℓk 2j−1 + µjk ℓ2j . Again, from the induction assumption, it j must be a (positively) weighted sum of products of input variables, but with mk+1 = 2mk such products. As in the previous case, an input variable will appear in the single unit of ℓk+1 that has as input the single unit of ℓk in which this variable must appear. 2i Proposition 1. The number of products in the sum computed in the output unit l1 of a network √ n−1 . computing a function in F is m2i = 2 Proof. We first prove by induction on k ≥ 1 that for odd k, mk = 22 k 22 1+1 2 2 k+1 2 −2 , and for even k, . This is obviously true for k = 1 since 2 = 2 = 1, and all units in ℓ1 are mk = 2 single products of the form xr xs . Assuming this is true for some k ≥ 1, then: −1 0 −2 • if k + 1 is odd, then from Lemma 1 and the induction assumption, we have: mk+1 = m2 = k 2 k 22 2 −1 k +1 = 22 2 • if k + 1 is even, then instead we have: mk+1 = 2mk = 2 · 22 k+1 2 −2 −2 = 22 = 22 (k+1)+1 2 (k+1) 2 −2 −1 which shows the desired result for k + 1, and thus concludes the induction proof. Applying this result with k = 2i (which is even) yields 2i m2i = 22 2 −1 √ =2 4 22i −1 √ =2 n−1 . 2i Lemma 2. The products computed in the output unit l1 can be split in two groups, one with products containing only variables x1 , . . . , x n and one containing only variables x n +1 , . . . , xn . 2 2 Proof. This is obvious since the last unit is a “sum“ unit that adds two terms whose inputs are these two groups of variables (see e.g. Fig. 1). 2i Lemma 3. The products computed in the output unit l1 involve more than one input variable. k Proof. It is straightforward to show by induction on k ≥ 1 that the products computed by lj all involve more than one input variable, thus it is true in particular for the output layer (k = 2i). Lemma 4. Any shallow sum-product network computing f ∈ F must have a “sum” unit as output. Proof. By contradiction, suppose the output unit of such a shallow sum-product network is multiplicative. This unit must have more than one input, because in the case that it has only one input, the output would be either a (weighted) sum of input variables (which would violate Lemma 3), or a single product of input variables (which would violate Proposition 1), depending on the type (sum or product) of the single input hidden unit. Thus the last unit must compute a product of two or more hidden units. It can be re-written as a product of two factors, where each factor corresponds to either one hidden unit, or a product of multiple hidden units (it does not matter here which specific factorization is chosen among all possible ones). Regardless of the type (sum or product) of the hidden units involved, those two factors can thus be written as weighted sums of products of variables xt (with positive weights, and input variables potentially raised to powers above one). From Lemma 1, both x1 and xn must be present in the final output, and thus they must appear in at least one of these two factors. Without loss of generality, assume x1 appears in the first factor. Variables x n +1 , . . . , xn then cannot be present in the second factor, since otherwise one product in the output 2 would contain both x1 and one of these variables (this product cannot cancel out since weights must be positive), violating Lemma 2. But with a similar reasoning, since as a result xn must appear in the first factor, variables x1 , . . . , x n cannot be present in the second factor either. Consequently, no 2 input variable can be present in the second factor, leading to the desired contradiction. Lemma 5. Any shallow sum-product network computing f ∈ F must have only multiplicative units in its hidden layer. Proof. By contradiction, suppose there exists a “sum“ unit in the hidden layer, written s = t∈S αt xt with S the set of input indices appearing in this sum, and αt > 0 for all t ∈ S. Since according to Lemma 4 the output unit must also be a sum (and have positive weights according to Definition 1), then the final output will also contain terms of the form βt xt for t ∈ S, with βt > 0. This violates Lemma 3, establishing the contradiction. Lemma 6. Any shallow negative-weight sum-product network (see Definition 2) computing f ∈ F √ must have at least 2 n−1 hidden units, if its output unit is a sum and its hidden units are products. Proof. Such a network computes a weighted sum of its hidden units, where each hidden unit is a γ product of input variables, i.e. its output can be written as Σj wj Πt xt jt with wj ∈ R and γjt ∈ {0, 1}. In order to compute a function in F, this shallow network thus needs a number of hidden units at least equal to the number of unique products in that function. From Proposition 1, this √ number is equal to 2 n−1 . √ Corollary 1. Any shallow sum-product network computing f ∈ F must have at least 2 units. n−1 hidden Proof. This is a direct corollary of Lemmas 4 (showing the output unit is a sum), 5 (showing that hidden units are products), and 6 (showing the desired result for any shallow network with this specific structure – regardless of the sign of weights). 5 3.3 Discussion Corollary 1 above shows that in order to compute some function in F with n inputs, the number of √ √ units in a shallow network has to be at least 2 n−1 , (i.e. grows exponentially in n). On another hand, the total number of units in the deep (for i > 1) network computing the same function, as described in Section 3.1, is equal to 1 + 2 + 4 + 8 + . . . + 22i−1 (since all units are binary), which is √ also equal to 22i − 1 = n − 1 (i.e. grows only quadratically in n). It shows that some deep sumproduct network with n inputs and depth O(log n) can represent with O(n) units what would √ require O(2 n ) units for a depth-2 network. Lemma 6 also shows a similar result regardless of the sign of the weights in the summation units of the depth-2 network, but assumes a specific architecture for this network (products in the hidden layer with a sum as output). 4 The family G In this section we present similar results with a different family of functions, denoted by G. Compared to F, one important difference of deep sum-product networks built to define functions in G is that they can vary their input size independently of their depth. Their analysis thus provides additional insight when comparing the representational efficiency of deep vs. shallow sum-product networks in the case of a fixed dataset. 4.1 Definition Networks in family G also alternate sum and product layers, but their units have as inputs all units from the previous layer except one. More formally, define the family G = ∪n≥2,i≥0 Gin of functions represented by sum-product networks, where the sub-family Gin is made of all sum-product networks with n input variables and 2i + 2 layers (including the input layer ℓ0 ), such that: 1. ℓ1 contains summation units; further layers alternate multiplicative and summation units. 2. Summation units have positive weights. 3. All layers are of size n, except the last layer ℓ2i+1 that contains a single sum unit that sums all units in the previous layer ℓ2i . k−1 4. In each layer ℓk for 1 ≤ k ≤ 2i, each unit ℓk takes as inputs {ℓm |m = j}. j An example of a network belonging to G1,3 (i.e. with three layers and three input variables) is shown in Figure 2. ℓ3 = x2 + x2 + x2 + 3(x1x2 + x1x3 + x2x3) = g(x1, x2, x3) 3 2 1 1 + ℓ2 = x2 + x1x2 × 1 1 +x1x3 + x2x3 ℓ1 = x2 + x3 1 × ℓ2 = . . . 2 × ℓ2 = x2 + x1x2 3 3 +x1x3 + x2x3 + + ℓ1 = x1 + x3 2 + ℓ1 = x1 + x2 3 x1 x2 x3 Figure 2: Sum-product network computing a function of G1,3 (summation units’ weights are all 1’s). 4.2 Theoretical results The main result is stated in Proposition 3 below, establishing a lower bound on the number of hidden units of a shallow sum-product network computing g ∈ G. The proof sketch is as follows: 1. We show that the polynomial expansion of g must contain a large set of products (Proposition 2 and Corollary 2). 2. We use both the number of products in that set as well as their degree to establish the desired lower bound (Proposition 3). 6 We will also need the following lemma, which states that when n − 1 items each belong to n − 1 sets among a total of n sets, then we can associate to each item one of the sets it belongs to without using the same set for different items. Lemma 7. Let S1 , . . . , Sn be n sets (n ≥ 2) containing elements of {P1 , . . . , Pn−1 }, such that for any q, r, |{r|Pq ∈ Sr }| ≥ n − 1 (i.e. each element Pq belongs to at least n − 1 sets). Then there exist r1 , . . . , rn−1 different indices such that Pq ∈ Srq for 1 ≤ q ≤ n − 1. Proof. Omitted due to lack of space (very easy to prove by construction). Proposition 2. For any 0 ≤ j ≤ i, and any product of variables P = Πn xαt such that αt ∈ N and t=1 t j 2j whose computed value, when expanded as a weighted t αt = (n − 1) , there exists a unit in ℓ sum of products, contains P among these products. Proof. We prove this proposition by induction on j. First, for j = 0, this is obvious since any P of this form must be made of a single input variable xt , that appears in ℓ0 = xt . t Suppose now the proposition is true for some j < i. Consider a product P = Πn xαt such that t=1 t αt ∈ N and t αt = (n − 1)j+1 . P can be factored in n − 1 sub-products of degree (n − 1)j , β i.e. written P = P1 . . . Pn−1 with Pq = Πn xt qt , βqt ∈ N and t βqt = (n − 1)j for all q. By t=1 the induction hypothesis, each Pq can be found in at least one unit ℓ2j . As a result, by property 4 kq (in the definition of family G), each Pq will also appear in the additive layer ℓ2j+1 , in at least n − 1 different units (the only sum unit that may not contain Pq is the one that does not have ℓ2j as input). kq By Lemma 7, we can thus find a set of units ℓ2j+1 such that for any 1 ≤ q ≤ n − 1, the product rq Pq appears in ℓ2j+1 , with indices rq being different from each other. Let 1 ≤ s ≤ n be such that rq 2(j+1) s = rq for all q. Then, from property 4 of family G, the multiplicative unit ℓs computes the n−1 2j+1 product Πq=1 ℓrq , and as a result, when expanded as a sum of products, it contains in particular P1 . . . Pn−1 = P . The proposition is thus true for j + 1, and by induction, is true for all j ≤ i. Corollary 2. The output gin of a sum-product network in Gin , when expanded as a sum of products, contains all products of variables of the form Πn xαt such that αt ∈ N and t αt = (n − 1)i . t=1 t Proof. Applying Proposition 2 with j = i, we obtain that all products of this form can be found in the multiplicative units of ℓ2i . Since the output unit ℓ2i+1 computes a sum of these multiplicative 1 units (weighted with positive weights), those products are also present in the output. Proposition 3. A shallow negative-weight sum-product network computing gin ∈ Gin must have at least (n − 1)i hidden units. Proof. First suppose the output unit of the shallow network is a sum. Then it may be able to compute gin , assuming we allow multiplicative units in the hidden layer in the hidden layer to use powers of their inputs in the product they compute (which we allow here for the proof to be more generic). However, it will require at least as many of these units as the number of unique products that can be found in the expansion of gin . In particular, from Corollary 2, it will require at least the number n of unique tuples of the form (α1 , . . . , αn ) such that αt ∈ N and t=1 αt = (n − 1)i . Denoting ni dni = (n − 1)i , this number is known to be equal to n+dni −1 , and it is easy to verify it is higher d than (or equal to) dni for any n ≥ 2 and i ≥ 0. Now suppose the output unit is multiplicative. Then there can be no multiplicative hidden unit, otherwise it would mean one could factor some input variable xt in the computed function output: this is not possible since by Corollary 2, for any variable xt there exist products in the output function that do not involve xt . So all hidden units must be additive, and since the computed function contains products of degree dni , there must be at least dni such hidden units. 7 4.3 Discussion Proposition 3 shows that in order to compute the same function as gin ∈ Gin , the number of units in the shallow network has to grow exponentially in i, i.e. in the network’s depth (while the deep network’s size grows linearly in i). The shallow network also needs to grow polynomially in the number of input variables n (with a degree equal to i), while the deep network grows only linearly in n. It means that some deep sum-product network with n inputs and depth O(i) can represent with O(ni) units what would require O((n − 1)i ) units for a depth-2 network. Note that in the similar results found for family F, the depth-2 network computing the same function as a function in F had to be constrained to either have a specific combination of sum and hidden units (in Lemma 6) or to have non-negative weights (in Corollary 1). On the contrary, the result presented here for family G holds without requiring any of these assumptions. 5 Conclusion We compared a deep sum-product network and a shallow sum-product network representing the same function, taken from two families of functions F and G. For both families, we have shown that the number of units in the shallow network has to grow exponentially, compared to a linear growth in the deep network, so as to represent the same functions. The deep version thus offers a much more compact representation of the same functions. This work focuses on two specific families of functions: finding more general parameterization of functions leading to similar results would be an interesting topic for future research. Another open question is whether it is possible to represent such functions only approximately (e.g. up to an error bound ǫ) with a much smaller shallow network. Results by Braverman [8] on boolean circuits suggest that similar results as those presented in this paper may still hold, but this topic has yet to be formally investigated in the context of sum-product networks. A related problem is also to look into functions defined only on discrete input variables: our proofs do not trivially extend to this situation because we cannot assume anymore that two polynomials yielding the same output values must have the same expansion coefficients (since the number of input combinations becomes finite). Acknowledgments The authors would like to thank Razvan Pascanu and David Warde-Farley for their help in improving this manuscript, as well as the anonymous reviewers for their careful reviews. This work was partially funded by NSERC, CIFAR, and the Canada Research Chairs. References [1] Ajtai, M. (1983). P1 1 -formulae on finite structures. Annals of Pure and Applied Logic, 24(1), 1–48. [2] Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference on Foundations of Software Technology and Theoretical Computer Science, pages 1–18. Lecture Notes in Computer Science 1180, Springer Verlag. [3] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1), 1–127. Also published as a book. Now Publishers, 2009. [4] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press. [5] Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel machines. In NIPS’05, pages 107–114. MIT Press, Cambridge, MA. [6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep networks. In NIPS 19, pages 153–160. MIT Press. [7] Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations. Computational Intelligence, 26(4), 449–467. [8] Braverman, M. (2011). Poly-logarithmic independence fools bounded-depth boolean circuits. Communications of the ACM, 54(4), 108–115. [9] Collobert, R. and Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML 2008, pages 160–167. [10] Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the meancovariance restricted boltzmann machine. In Advances in Neural Information Processing Systems (NIPS). 8 [11] H˚ stad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the 18th a annual ACM Symposium on Theory of Computing, pages 6–20, Berkeley, California. ACM Press. [12] H˚ stad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational a Complexity, 1, 113–129. [13] Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. [14] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. [15] Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning convolutional feature hierarchies for visual recognition. In NIPS’10. [16] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. In ICML’07, pages 473–480. ACM. [17] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS’07, pages 873–880. MIT Press, Cambridge, MA. [18] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML 2009. Montreal (Qc), Canada. [19] Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classification using convolutional deep belief networks. In NIPS’09, pages 1096–1104. [20] Levner, I. (2008). Data Driven Object Segmentation. Ph.D. thesis, Department of Computer Science, University of Alberta. [21] Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model. In NIPS’08, pages 1081–1088. [22] Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In ICML’2009, pages 737–744. [23] Orponen, P. (1994). Computational complexity of neural networks: a survey. Nordic Journal of Computing, 1(1), 94–110. [24] Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov random field. In NIPS’07, pages 1121–1128, Cambridge, MA. MIT Press. [25] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In UAI’2011, Barcelona, Spain. [26] Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations with deep networks. In ICML. [27] Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Efficient learning of sparse representations with an energy-based model. In NIPS’06, pages 1137–1144. MIT Press. [28] Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In NIPS’07, pages 1185–1192, Cambridge, MA. MIT Press. [29] Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In Proceedings of the 2007 Workshop on Information Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam. Elsevier. [30] Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative filtering. In ICML 2007, pages 791–798, New York, NY, USA. [31] Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., and Poggio, T. (2007). A quantitative theory of immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical Insights into Brain Function, 165, 33–56. [32] Socher, R., Lin, C., Ng, A. Y., and Manning, C. (2011). Learning continuous phrase representations and syntactic parsing with recursive neural networks. In ICML’2011. [33] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling motion style. In ICML 2009, pages 1025–1032. [34] Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary latent variables. In NIPS’06, pages 1345–1352. MIT Press, Cambridge, MA. [35] Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14, 2497–2539. [36] Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML 2008, pages 1168–1175, New York, NY, USA. [37] Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation, 8(7), 1341–1390. [38] Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual IEEE Symposium on Foundations of Computer Science, pages 1–10. 9
3 0.6796506 184 nips-2011-Neuronal Adaptation for Sampling-Based Probabilistic Inference in Perceptual Bistability
Author: David P. Reichert, Peggy Series, Amos J. Storkey
Abstract: It has been argued that perceptual multistability reflects probabilistic inference performed by the brain when sensory input is ambiguous. Alternatively, more traditional explanations of multistability refer to low-level mechanisms such as neuronal adaptation. We employ a Deep Boltzmann Machine (DBM) model of cortical processing to demonstrate that these two different approaches can be combined in the same framework. Based on recent developments in machine learning, we show how neuronal adaptation can be understood as a mechanism that improves probabilistic, sampling-based inference. Using the ambiguous Necker cube image, we analyze the perceptual switching exhibited by the model. We also examine the influence of spatial attention, and explore how binocular rivalry can be modeled with the same approach. Our work joins earlier studies in demonstrating how the principles underlying DBMs relate to cortical processing, and offers novel perspectives on the neural implementation of approximate probabilistic inference in the brain. 1
4 0.59431893 115 nips-2011-Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices
Author: Xianxing Zhang, Lawrence Carin, David B. Dunson
Abstract: The nested Chinese restaurant process is extended to design a nonparametric topic-model tree for representation of human choices. Each tree path corresponds to a type of person, and each node (topic) has a corresponding probability vector over items that may be selected. The observed data are assumed to have associated temporal covariates (corresponding to the time at which choices are made), and we wish to impose that with increasing time it is more probable that topics deeper in the tree are utilized. This structure is imposed by developing a new “change point
5 0.59382606 93 nips-2011-Extracting Speaker-Specific Information with a Regularized Siamese Deep Network
Author: Ke Chen, Ahmad Salman
Abstract: Speech conveys different yet mixed information ranging from linguistic to speaker-specific components, and each of them should be exclusively used in a specific task. However, it is extremely difficult to extract a specific information component given the fact that nearly all existing acoustic representations carry all types of speech information. Thus, the use of the same representation in both speech and speaker recognition hinders a system from producing better performance due to interference of irrelevant information. In this paper, we present a deep neural architecture to extract speaker-specific information from MFCCs. As a result, a multi-objective loss function is proposed for learning speaker-specific characteristics and regularization via normalizing interference of non-speaker related information and avoiding information loss. With LDC benchmark corpora and a Chinese speech corpus, we demonstrate that a resultant speaker-specific representation is insensitive to text/languages spoken and environmental mismatches and hence outperforms MFCCs and other state-of-the-art techniques in speaker recognition. We discuss relevant issues and relate our approach to previous work. 1
6 0.57311881 92 nips-2011-Expressive Power and Approximation Errors of Restricted Boltzmann Machines
7 0.55542696 281 nips-2011-The Doubly Correlated Nonparametric Topic Model
8 0.53939593 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines
9 0.50795162 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation
10 0.50688368 74 nips-2011-Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection
11 0.48508245 244 nips-2011-Selecting Receptive Fields in Deep Networks
12 0.47567683 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
13 0.46686009 197 nips-2011-On Tracking The Partition Function
14 0.44835514 287 nips-2011-The Manifold Tangent Classifier
15 0.44485992 110 nips-2011-Group Anomaly Detection using Flexible Genre Models
16 0.43585029 129 nips-2011-Improving Topic Coherence with Regularized Topic Models
17 0.41138977 304 nips-2011-Why The Brain Separates Face Recognition From Object Recognition
18 0.40810436 293 nips-2011-Understanding the Intrinsic Memorability of Images
19 0.40122634 221 nips-2011-Priors over Recurrent Continuous Time Processes
20 0.40023884 14 nips-2011-A concave regularization technique for sparse mixture models
topicId topicWeight
[(0, 0.053), (4, 0.068), (20, 0.043), (26, 0.013), (28, 0.21), (31, 0.098), (33, 0.084), (40, 0.011), (43, 0.046), (45, 0.085), (57, 0.043), (65, 0.029), (74, 0.062), (83, 0.035), (84, 0.02), (99, 0.025)]
simIndex simValue paperId paperTitle
1 0.84664661 110 nips-2011-Group Anomaly Detection using Flexible Genre Models
Author: Liang Xiong, Barnabás Póczos, Jeff G. Schneider
Abstract: An important task in exploring and analyzing real-world data sets is to detect unusual and interesting phenomena. In this paper, we study the group anomaly detection problem. Unlike traditional anomaly detection research that focuses on data points, our goal is to discover anomalous aggregated behaviors of groups of points. For this purpose, we propose the Flexible Genre Model (FGM). FGM is designed to characterize data groups at both the point level and the group level so as to detect various types of group anomalies. We evaluate the effectiveness of FGM on both synthetic and real data sets including images and turbulence data, and show that it is superior to existing approaches in detecting group anomalies. 1
same-paper 2 0.78682452 156 nips-2011-Learning to Learn with Compound HD Models
Author: Antonio Torralba, Joshua B. Tenenbaum, Ruslan Salakhutdinov
Abstract: We introduce HD (or “Hierarchical-Deep”) models, a new compositional learning architecture that integrates deep learning models with structured hierarchical Bayesian models. Specifically we show how we can learn a hierarchical Dirichlet process (HDP) prior over the activities of the top-level features in a Deep Boltzmann Machine (DBM). This compound HDP-DBM model learns to learn novel concepts from very few training examples, by learning low-level generic features, high-level features that capture correlations among low-level features, and a category hierarchy for sharing priors over the high-level features that are typical of different kinds of concepts. We present efficient learning and inference algorithms for the HDP-DBM model and show that it is able to learn new concepts from very few examples on CIFAR-100 object recognition, handwritten character recognition, and human motion capture datasets. 1
3 0.77026778 95 nips-2011-Fast and Accurate k-means For Large Datasets
Author: Michael Shindler, Alex Wong, Adam W. Meyerson
Abstract: Clustering is a popular problem with many applications. We consider the k-means problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute k-means in o( nk) (where n is the number of data points; note that computing the cost, given a solution, takes 8(nk) time). We show that our algorithm compares favorably to existing algorithms - both theoretically and experimentally, thus providing state-of-the-art performance in both theory and practice.
4 0.69342154 226 nips-2011-Projection onto A Nonnegative Max-Heap
Author: Jun Liu, Liang Sun, Jieping Ye
Abstract: We consider the problem of computing the Euclidean projection of a vector of length p onto a non-negative max-heap—an ordered tree where the values of the nodes are all nonnegative and the value of any parent node is no less than the value(s) of its child node(s). This Euclidean projection plays a building block role in the optimization problem with a non-negative maxheap constraint. Such a constraint is desirable when the features follow an ordered tree structure, that is, a given feature is selected for the given regression/classification task only if its parent node is selected. In this paper, we show that such Euclidean projection problem admits an analytical solution and we develop a top-down algorithm where the key operation is to find the so-called maximal root-tree of the subtree rooted at each node. A naive approach for finding the maximal root-tree is to enumerate all the possible root-trees, which, however, does not scale well. We reveal several important properties of the maximal root-tree, based on which we design a bottom-up algorithm with merge for efficiently finding the maximal roottree. The proposed algorithm has a (worst-case) linear time complexity for a sequential list, and O(p2 ) for a general tree. We report simulation results showing the effectiveness of the max-heap for regression with an ordered tree structure. Empirical results show that the proposed algorithm has an expected linear time complexity for many special cases including a sequential list, a full binary tree, and a tree with depth 1. 1
5 0.65703166 266 nips-2011-Spatial distance dependent Chinese restaurant processes for image segmentation
Author: Soumya Ghosh, Andrei B. Ungureanu, Erik B. Sudderth, David M. Blei
Abstract: The distance dependent Chinese restaurant process (ddCRP) was recently introduced to accommodate random partitions of non-exchangeable data [1]. The ddCRP clusters data in a biased way: each data point is more likely to be clustered with other data that are near it in an external sense. This paper examines the ddCRP in a spatial setting with the goal of natural image segmentation. We explore the biases of the spatial ddCRP model and propose a novel hierarchical extension better suited for producing “human-like” segmentations. We then study the sensitivity of the models to various distance and appearance hyperparameters, and provide the first rigorous comparison of nonparametric Bayesian models in the image segmentation domain. On unsupervised image segmentation, we demonstrate that similar performance to existing nonparametric Bayesian models is possible with substantially simpler models and algorithms.
6 0.63756335 227 nips-2011-Pylon Model for Semantic Segmentation
7 0.63156158 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
8 0.63071185 127 nips-2011-Image Parsing with Stochastic Scene Grammar
9 0.63016212 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors
10 0.6276601 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding
11 0.62735921 43 nips-2011-Bayesian Partitioning of Large-Scale Distance Data
12 0.62702823 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling
13 0.62689018 180 nips-2011-Multiple Instance Filtering
14 0.626598 116 nips-2011-Hierarchically Supervised Latent Dirichlet Allocation
15 0.6260649 168 nips-2011-Maximum Margin Multi-Instance Learning
16 0.62457025 188 nips-2011-Non-conjugate Variational Message Passing for Multinomial and Binary Regression
17 0.62396687 66 nips-2011-Crowdclustering
18 0.62126082 35 nips-2011-An ideal observer model for identifying the reference frame of objects
19 0.62060547 99 nips-2011-From Stochastic Nonlinear Integrate-and-Fire to Generalized Linear Models
20 0.61934459 303 nips-2011-Video Annotation and Tracking with Active Learning