nips nips2012 nips2012-100 knowledge-graph by maker-knowledge-mining

100 nips-2012-Discriminative Learning of Sum-Product Networks

Source: pdf

Author: Robert Gens, Pedro Domingos

Abstract: Sum-product networks are a new deep architecture that can perform fast, exact inference on high-treewidth models. Only generative methods for training SPNs have been proposed to date. In this paper, we present the ﬁrst discriminative training algorithms for SPNs, combining the high accuracy of the former with the representational power and tractability of the latter. We show that the class of tractable discriminative SPNs is broader than the class of tractable generative ones, and propose an efﬁcient backpropagation-style algorithm for computing the gradient of the conditional log likelihood. Standard gradient descent suffers from the diffusion problem, but networks with many layers can be learned reliably using “hard” gradient descent, where marginal inference is replaced by MPE inference (i.e., inferring the most probable state of the non-evidence variables). The resulting updates have a simple and intuitive form. We test discriminative SPNs on standard image classiﬁcation tasks. We obtain the best results to date on the CIFAR-10 dataset, using fewer features than prior methods with an SPN architecture that learns local image structure discriminatively. We also report the highest published test accuracy on STL-10 even though we only use the labeled portion of the dataset. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Sum-product networks are a new deep architecture that can perform fast, exact inference on high-treewidth models. [sent-6, score-0.326]

2 In this paper, we present the ﬁrst discriminative training algorithms for SPNs, combining the high accuracy of the former with the representational power and tractability of the latter. [sent-8, score-0.25]

3 We show that the class of tractable discriminative SPNs is broader than the class of tractable generative ones, and propose an efﬁcient backpropagation-style algorithm for computing the gradient of the conditional log likelihood. [sent-9, score-0.368]

4 Standard gradient descent suffers from the diffusion problem, but networks with many layers can be learned reliably using “hard” gradient descent, where marginal inference is replaced by MPE inference (i. [sent-10, score-0.567]

5 We test discriminative SPNs on standard image classiﬁcation tasks. [sent-14, score-0.196]

6 We obtain the best results to date on the CIFAR-10 dataset, using fewer features than prior methods with an SPN architecture that learns local image structure discriminatively. [sent-15, score-0.183]

7 They can be viewed as a new type of deep architecture, where sum layers alternate with product layers. [sent-22, score-0.262]

8 Deep networks have many layers of hidden variables, which greatly increases their representational power, but inference with even a single layer is generally intractable, and adding layers compounds the problem [3]. [sent-23, score-0.301]

9 SPNs are a deep architecture with full probabilistic semantics where inference is guaranteed to be tractable, under general conditions derived by Poon and Domingos [23]. [sent-24, score-0.276]

10 Poon and Domingos introduced an algorithm for generatively training SPNs, yet it is generally observed that discriminative training fares better. [sent-26, score-0.277]

11 By optimizing P (Y|X) instead of P (X, Y) conditional random ﬁelds retain joint inference over dependent label variables Y while allowing for ﬂexible features over given inputs X [22]. [sent-27, score-0.19]

12 In this paper, discriminatively training SPNs will allow us to combine ﬂexible features with fast, exact inference over high treewidth models. [sent-33, score-0.324]

13 1 With inference and learning that easily scales to many layers, SPNs can be viewed as a type of deep network. [sent-34, score-0.204]

14 Existing deep networks employ discriminative training with backpropagation through softmax layers or support vector machines over network variables. [sent-35, score-0.502]

15 Poon and Domingos showed that deep SPNs could be learned faster and more accurately than deep belief networks and deep Boltzmann machines on a generative image completion task [23]. [sent-37, score-0.508]

16 This paper contributes a discriminative training algorithm that could be used on its own or with generative pre-training. [sent-38, score-0.244]

17 For the ﬁrst time we combine the advantages of SPNs with those of discriminative models. [sent-39, score-0.143]

18 We then provide a training algorithm, demonstrate how to compute the gradient of the conditional log-likelihood of an SPN using backpropagation, and explore variations of inference. [sent-41, score-0.149]

19 Finally, we show state-of-the-art results where a discriminatively-trained SPN achieves higher accuracy than SVMs and deep models on image classiﬁcation tasks. [sent-42, score-0.186]

20 To distinguish random variables from indicator variables, we use roman font for the former and ¯ italic for the latter. [sent-48, score-0.153]

21 The network polynomial of Φ(x) is deﬁned as x Φ(x) (x), where (x) is the product of indicators that are one in state x. [sent-50, score-0.216]

22 For example, the network polynomial of the Bayesian network X1 → X2 is P (x1 )P (x2 |x1 )x1 x2 + P (x1 )P (¯2 |x1 )x1 x2 + P (¯1 )P (x2 |¯1 )¯1 x2 + P (¯1 )P (¯2 |¯1 )¯1 x2 . [sent-51, score-0.149]

23 To x ¯ x x x x x x x ¯ compute P (X1 = true, X2 = false), we access the corresponding term of the network polynomial by setting indicators x1 and x2 to one and the rest to zero. [sent-52, score-0.176]

24 The network polynomial has size exponential in the number of variables, but in many cases it can be represented more compactly using a sum-product network [23, 14]. [sent-57, score-0.174]

25 Each edge (i, j) emanating from a sum node i has a non-negative weight wij . [sent-69, score-0.243]

26 The value of a product node is the product of the values of its children. [sent-70, score-0.229]

27 The value of a sum node is j∈Ch(i) wij vj , where Ch(i) are the children of i and vj is the value of node j. [sent-71, score-0.402]

28 Instead, we set the indicators so that the network sums out both X2 and X3 . [sent-92, score-0.174]

29 Figure 1: SPN over Boolean variables X1 , X2 , X3 An indicator setting of S[1,0,1,1,1,1] computes the sum over all states compatible with our evidence e = {X1 = true} and requires only one evaluation. [sent-93, score-0.171]

30 If a linear evaluation of an SPN with indicators set to represent evidence equals the exponential sum over all variable states consistent with that evidence, the SPN is valid. [sent-95, score-0.184]

31 (Poon & Domingos, 2011) A sum-product network S is valid iff S(e) = ΦS (e) for all evidence e. [sent-97, score-0.139]

32 (Poon & Domingos, 2011) A sum-product network is complete iff all children of the same sum node have the same scope. [sent-100, score-0.337]

33 (Poon & Domingos, 2011) A sum-product network is consistent iff no variable appears negated in one child of a product node and non-negated in another. [sent-102, score-0.363]

34 The scope of a node is deﬁned as the set of variables that have indicators among the node’s descendants. [sent-105, score-0.298]

35 If a sum node is incomplete, the SPN will undercount the true marginals. [sent-107, score-0.217]

36 Since an incomplete sum node has scope larger than a child, that child will be non-zero for more than one state of the sum (e. [sent-108, score-0.407]

37 If a product node is incon¯ ¯ sistent, the SPN will overcount the marginals as it will incorporate impossible states (e. [sent-111, score-0.189]

38 One method is to compute the likelihood gradient and optimize with gradient descent (GD). [sent-115, score-0.186]

39 They also show how to use expectation maximization (EM) by considering each sum node as the marginalization of a hidden variable [17]. [sent-116, score-0.244]

40 They found that online EM using most probable explanation (MPE or “hard”) inference worked the best for their image completion task. [sent-117, score-0.177]

41 Gradient diffusion is a key issue in training deep models. [sent-118, score-0.187]

42 It is commonly observed in neural networks that when the gradient is propagated to lower layers it becomes less informative [3]. [sent-119, score-0.17]

43 When every node in the network takes fractional responsibility for the errors of a top level node, it becomes difﬁcult to steer parameters out of local minima. [sent-120, score-0.209]

44 In the next section we show how to discriminatively train SPNs with hard gradient descent. [sent-124, score-0.221]

45 We do not sum over states of given variables X when discriminatively training SPNs. [sent-127, score-0.226]

46 This means that one ignores X variables in the scope of a node when considering completeness and consistency. [sent-129, score-0.251]

47 Since adding a constant as a child to a product node cannot make that product inconsistent, a variable x can be the child of any product node in a valid SPN. [sent-130, score-0.624]

48 To maintain completeness, x can only be the child of a sum node that has scope outside of Y or H. [sent-131, score-0.339]

49 Output: An SPN with learned weights repeat forall the d ∈ D do UpdateWeights(S, Inference(S,xd ,yd )) until convergence or early stopping condition; 3 The parameters of an SPN can be learned using an online procedure as in Algorithm 1 as proposed by Poon and Domingos. [sent-133, score-0.136]

50 Poon and Domingos discussed generative gradient descent with marginal inference as well as EM with marginal and MPE inference. [sent-136, score-0.317]

51 In this section we will derive discriminative gradient descent with marginal and MPE inference, where hard gradient descent can also be used for generative training. [sent-137, score-0.517]

52 EM is not typically used for discriminative training as it requires modiﬁcation to lower bound the conditional likelihood [25] and there may not be a closed form for the M-step. [sent-138, score-0.219]

53 After performing a bottom-up evaluation of the SPN, partial derivatives are passed from parent to child as follows from the chain rule and described in [15]. [sent-142, score-0.154]

54 The form of backpropagation presented takes time linear in the number of nodes in the SPN if product nodes have a bounded number of children. [sent-143, score-0.189]

55 Our gradient descent update then follows the direction of the partial derivative of the conditional ∂ log likelihood with learning rate η: ∆w = η ∂w log P (y|x). [sent-144, score-0.219]

56 After each gradient step we optionally renormalize the weights of a sum node so they sum to one. [sent-145, score-0.358]

57 Algorithm 2: BackpropSPN Input: A valid SPN S, where Sn denotes the value of node n after bottom-up evaluation. [sent-150, score-0.175]

58 2 k∈Ch(n)\{j} Sk Discriminative Training with MPE Inference There are several reasons why MPE inference is appealing for discriminatively training SPNs. [sent-152, score-0.225]

59 As discussed above, hard inference was crucial for overcoming gradient diffusion when generatively training SPNs. [sent-153, score-0.356]

60 The root node sums out the variable Y, the two sum nodes on the left sum out the hidden variable H1 , the two sum nodes on the right sum out H2 , and a circled ‘f’ denotes an input variable Xi . [sent-159, score-0.601]

61 To convert an SPN to an MPN, we replace each sum node by a max node, where weights on children are retained. [sent-163, score-0.253]

62 The gradient of the conditional log likelihood with MPE inference is then ∂ ˜ log P (y|x) ∂w = ∂ ∂ log max Φ(Y = y, H = h|x) − log max Φ(Y = y , H = h|x) h y ,h ∂w ∂w where the two maximizations are computed by M [y, 1|x] and M [1, 1|x]. [sent-164, score-0.317]

63 Analogous to Viterbi decoding, the path starts at the root node and at each max (formerly sum) node it only travels to the max-valued child. [sent-167, score-0.322]

64 The value of the c MPN takes the form of a product wi ∈W wi i , where ci is the number of times wi appears in W . [sent-170, score-0.417]

65 The partial derivatives of the MPN with respect to all nodes and weights is computed by Algorithm 2 modiﬁed to accommodate MPNs: (1) S becomes M , (2) when n is a sum node, the body of the forall loop is run once for j as the max-valued child. [sent-171, score-0.257]

66 The hard gradient update is then ∆wi = ∂ ˜ η ∂wi log P (y|x) = η ∆ci . [sent-173, score-0.164]

67 wi The hard gradient for a training instance (xd , yd ) is illustrated in Figure 2. [sent-174, score-0.291]

68 However, sparsity is not as important for SPNs as it is for Markov random ﬁelds, where a non-zero weight can have outsize impact on inference time; with SPNs inference is always linear with respect to model size. [sent-179, score-0.22]

69 The generative hard gradient can be used in place of online EM for datasets where it would be prohibitive to store inference results from past epoch. [sent-181, score-0.291]

70 For architectures that have high fan-in sum nodes, soft inference may be able to separate groups of modes faster than hard inference, which can only alter one child of a sum node at a time. [sent-182, score-0.612]

71 We observe the similarity between the updates of hard EM and hard gradient descent. [sent-183, score-0.209]

72 4 Experiments We have applied discriminative training of SPNs to image classiﬁcation benchmarks. [sent-188, score-0.244]

73 CIFAR-10 and STL-10 are standard datasets for deep networks and unsupervised feature learning. [sent-189, score-0.157]

74 The procedure consists of extracting 4 × 105 6x6 pixel patches from the training set images, ZCA whitening those patches [19], running k-means for 50 rounds, and then normalizing the dictionary to have zero mean and unit variance. [sent-194, score-0.154]

75 We then use the dictionary to extract K features at every 6x6 pixel site in the image (unit stride) with the “triangle” encoding fk (x) = max{0, z − zk }, where zk = ||x − ck ||2 , ck is the k-th item in the dictionary, and z is the ¯ ¯ average zk . [sent-195, score-0.272]

76 We experiment with a simple architecture that Classes allows for discriminative learning of local + structure. [sent-197, score-0.215]

77 A part is a pattern of image patch features that can occur anywhere in the image (e. [sent-201, score-0.141]

78 The root of the SPN is a sum node with a child Sc for each class c in the dataset multiplied by the indicator for that state of the Figure 3: SPN architecture for experiments. [sent-205, score-0.447]

79 Sc is a product over P nodes den variable indicators omitted for legibility. [sent-207, score-0.178]

80 Scp , where each Scp is a sum node over T nodes 6 Scpt . [sent-208, score-0.268]

81 Coates and Ng [12] also learn higher-order structure, but whereas our method learns structure discriminatively in the context of a parts-based model, their unsupervised algorithm greedily groups features based on correlation and is unable to learn mixtures. [sent-212, score-0.138]

82 Other deep probabilistic architectures should be able to model high-level structure, but considering the difﬁculty in training these models with approximate inference, it is hard to make full use of their representational power. [sent-215, score-0.297]

83 [18] that learns ﬁlters over predeﬁned HOG image features, our SPN learns on top of learned image features that can model color and detailed patterns. [sent-217, score-0.218]

84 Generative SPN architectures on the same features produce unsatisfactory results as generative training is led astray by the large number of features, very few of which differentiate labels. [sent-218, score-0.177]

85 In the generative SPN paper [23], continuous variables are modeled with univariate Gaussians at the leaves (viewed as a sum node with inﬁnite children but ﬁnite weight sum). [sent-219, score-0.362]

86 With discriminative training, X can be continuous because we always condition on it, which effectively folds it into the weights. [sent-220, score-0.189]

87 All networks are learned with stochastic gradient descent regularized by early stopping. [sent-221, score-0.194]

88 We found that using marginal inference for the root node and MPE inference for the rest of the network worked best. [sent-222, score-0.454]

89 We ﬁrst compare discriminative SPNs with other methods as we vary the size of the dictionary K. [sent-228, score-0.213]

90 We hypothesize that this is because the SPN architecture allows us to discriminatively train large moveable parts, image structure that cannot be captured by larger dictionaries. [sent-232, score-0.205]

91 a 6x6 pixel dictionary item), from which the classiﬁer may have trouble inferring the coordination of image parts. [sent-236, score-0.159]

92 68 64 200 400 800 1600 Dictionary Size 4000 Figure 4: Impact of dictionary size K with a 4x4 pooling grid (W =3) on CIFAR-10 test accuracy 7 Table 3: Test accuracies on CIFAR-10. [sent-247, score-0.197]

93 This architecture achieves the highest published test accuracy on the CIFAR-10 dataset, remarkably using one ﬁfth the number of features of the next best approach. [sent-269, score-0.157]

94 5 Conclusion Sum-product networks are a new class of probabilistic model where inference remains tractable despite high treewidth and many hidden layers. [sent-287, score-0.262]

95 Discriminative training allows for a wider variety of SPN architectures than generative training, because completeness and consistency do not have to be maintained over evidence variables. [sent-289, score-0.211]

96 We proposed both “soft” and “hard” gradient algorithms, using marginal inference in the “soft” case and MPE inference in the “hard” case. [sent-290, score-0.294]

97 The latter successfully combats the diffusion problem, allowing deep networks to be learned. [sent-291, score-0.189]

98 Experiments on image classiﬁcation benchmarks illustrate the power of discriminative SPNs. [sent-292, score-0.196]

99 Future research directions include applying other discriminative learning paradigms to SPNs (e. [sent-293, score-0.143]

100 max-margin methods), automatically learning SPN structure, and applying discriminative SPNs to a variety of structured prediction problems. [sent-295, score-0.143]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('spn', 0.651), ('spns', 0.37), ('mpe', 0.252), ('domingos', 0.167), ('poon', 0.163), ('node', 0.149), ('discriminative', 0.143), ('deep', 0.107), ('coates', 0.104), ('wi', 0.102), ('mpn', 0.101), ('inference', 0.097), ('child', 0.09), ('indicators', 0.087), ('discriminatively', 0.08), ('forall', 0.074), ('ch', 0.074), ('gradient', 0.073), ('architecture', 0.072), ('ci', 0.071), ('dictionary', 0.07), ('pooling', 0.07), ('sum', 0.068), ('hard', 0.068), ('scp', 0.067), ('treewidth', 0.064), ('network', 0.06), ('wki', 0.055), ('wj', 0.054), ('generative', 0.053), ('sn', 0.053), ('image', 0.053), ('nodes', 0.051), ('jia', 0.051), ('xd', 0.05), ('scpt', 0.05), ('wkn', 0.05), ('em', 0.05), ('networks', 0.05), ('training', 0.048), ('backpropagation', 0.047), ('layers', 0.047), ('folds', 0.046), ('indicator', 0.044), ('architectures', 0.041), ('descent', 0.04), ('product', 0.04), ('completeness', 0.04), ('generatively', 0.038), ('sj', 0.038), ('children', 0.036), ('pixel', 0.036), ('gd', 0.035), ('features', 0.035), ('bo', 0.034), ('ecj', 0.034), ('fcpt', 0.034), ('subcircuit', 0.034), ('representational', 0.033), ('scope', 0.032), ('derivatives', 0.032), ('partial', 0.032), ('diffusion', 0.032), ('grid', 0.031), ('sk', 0.031), ('soft', 0.031), ('learned', 0.031), ('variables', 0.03), ('italic', 0.03), ('boolean', 0.029), ('evidence', 0.029), ('polynomial', 0.029), ('conditional', 0.028), ('sums', 0.027), ('maximizations', 0.027), ('traversed', 0.027), ('afrl', 0.027), ('font', 0.027), ('ren', 0.027), ('mk', 0.027), ('marginal', 0.027), ('hidden', 0.027), ('probable', 0.027), ('valid', 0.026), ('zk', 0.026), ('accuracy', 0.026), ('weight', 0.026), ('summations', 0.026), ('compactly', 0.025), ('published', 0.024), ('cvpr', 0.024), ('root', 0.024), ('tractable', 0.024), ('iff', 0.024), ('et', 0.024), ('bold', 0.023), ('log', 0.023), ('learns', 0.023), ('receptive', 0.023), ('roman', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 100 nips-2012-Discriminative Learning of Sum-Product Networks

Author: Robert Gens, Pedro Domingos

2 0.62944859 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables

Author: Aaron Dennis, Dan Ventura

Abstract: The sum-product network (SPN) is a recently-proposed deep model consisting of a network of sum and product nodes, and has been shown to be competitive with state-of-the-art deep models on certain difﬁcult tasks such as image completion. Designing an SPN network architecture that is suitable for the task at hand is an open question. We propose an algorithm for learning the SPN architecture from data. The idea is to cluster variables (as opposed to data instances) in order to identify variable subsets that strongly interact with one another. Nodes in the SPN network are then allocated towards explaining these interactions. Experimental evidence shows that learning the SPN architecture signiﬁcantly improves its performance compared to using a previously-proposed static architecture. 1

3 0.15532835 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

Author: Stephen Bach, Matthias Broecheler, Lise Getoor, Dianne O'leary

Abstract: Probabilistic graphical models are powerful tools for analyzing constrained, continuous domains. However, ﬁnding most-probable explanations (MPEs) in these models can be computationally expensive. In this paper, we improve the scalability of MPE inference in a class of graphical models with piecewise-linear and piecewise-quadratic dependencies and linear constraints over continuous domains. We derive algorithms based on a consensus-optimization framework and demonstrate their superior performance over state of the art. We show empirically that in a large-scale voter-preference modeling problem our algorithms scale linearly in the number of dependencies and constraints. 1

4 0.093447961 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks

Author: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

Abstract: We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of ﬁve convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a ﬁnal 1000-way softmax. To make training faster, we used non-saturating neurons and a very efﬁcient GPU implementation of the convolution operation. To reduce overﬁtting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry. 1

5 0.091784045 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

Author: Ryan Kiros, Csaba Szepesvári

Abstract: The task of image auto-annotation, namely assigning a set of relevant tags to an image, is challenging due to the size and variability of tag vocabularies. Consequently, most existing algorithms focus on tag assignment and ﬁx an often large number of hand-crafted features to describe image characteristics. In this paper we introduce a hierarchical model for learning representations of standard sized color images from the pixel level, removing the need for engineered feature representations and subsequent feature selection for annotation. We benchmark our model on the STL-10 recognition dataset, achieving state-of-the-art performance. When our features are combined with TagProp (Guillaumin et al.), we compete with or outperform existing annotation approaches that use over a dozen distinct handcrafted image descriptors. Furthermore, using 256-bit codes and Hamming distance for training TagProp, we exchange only a small reduction in performance for efﬁcient storage and fast comparisons. Self-taught learning is used in all of our experiments and deeper architectures always outperform shallow ones. 1

6 0.082568854 197 nips-2012-Learning with Recursive Perceptual Representations

7 0.07429564 298 nips-2012-Scalable Inference of Overlapping Communities

8 0.069690682 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

9 0.068406343 182 nips-2012-Learning Networks of Heterogeneous Influence

10 0.067987114 91 nips-2012-Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images

11 0.067636773 193 nips-2012-Learning to Align from Scratch

12 0.064858213 87 nips-2012-Convolutional-Recursive Deep Learning for 3D Object Classification

13 0.063879073 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

14 0.062840037 180 nips-2012-Learning Mixtures of Tree Graphical Models

15 0.061591707 62 nips-2012-Burn-in, bias, and the rationality of anchoring

16 0.061591707 116 nips-2012-Emergence of Object-Selective Features in Unsupervised Feature Learning

17 0.058932953 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines

18 0.058410909 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

19 0.056215283 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

20 0.054971654 204 nips-2012-MAP Inference in Chains using Column Generation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.177), (1, 0.057), (2, -0.127), (3, -0.031), (4, -0.012), (5, -0.053), (6, -0.013), (7, -0.128), (8, -0.145), (9, 0.076), (10, -0.011), (11, 0.08), (12, -0.014), (13, 0.084), (14, -0.17), (15, -0.124), (16, -0.087), (17, -0.029), (18, 0.106), (19, -0.091), (20, -0.116), (21, -0.027), (22, 0.318), (23, 0.05), (24, -0.031), (25, -0.151), (26, 0.006), (27, 0.213), (28, -0.13), (29, 0.152), (30, -0.019), (31, 0.118), (32, 0.346), (33, 0.224), (34, -0.035), (35, 0.083), (36, -0.093), (37, -0.096), (38, 0.025), (39, -0.038), (40, 0.07), (41, 0.117), (42, 0.023), (43, -0.04), (44, -0.029), (45, -0.027), (46, -0.028), (47, 0.067), (48, -0.061), (49, -0.076)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92755908 191 nips-2012-Learning the Architecture of Sum-Product Networks Using Clustering on Variables

Author: Aaron Dennis, Dan Ventura

same-paper 2 0.91572875 100 nips-2012-Discriminative Learning of Sum-Product Networks

Author: Robert Gens, Pedro Domingos

3 0.40205464 215 nips-2012-Minimizing Uncertainty in Pipelines

Author: Nilesh Dalvi, Aditya Parameswaran, Vibhor Rastogi

Abstract: In this paper, we consider the problem of debugging large pipelines by human labeling. We represent the execution of a pipeline using a directed acyclic graph of AND and OR nodes, where each node represents a data item produced by some operator in the pipeline. We assume that each operator assigns a conﬁdence to each of its output data. We want to reduce the uncertainty in the output by issuing queries to a human, where a query consists of checking if a given data item is correct. In this paper, we consider the problem of asking the optimal set of queries to minimize the resulting output uncertainty. We perform a detailed evaluation of the complexity of the problem for various classes of graphs. We give efﬁcient algorithms for the problem for trees, and show that, for a general dag, the problem is intractable. 1

4 0.38779086 170 nips-2012-Large Scale Distributed Deep Networks

Author: Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc'aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, Andrew Y. Ng

Abstract: Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm. 1

5 0.38573167 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization

Author: Stephen Bach, Matthias Broecheler, Lise Getoor, Dianne O'leary

6 0.36205041 182 nips-2012-Learning Networks of Heterogeneous Influence

7 0.35318035 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

8 0.34747508 39 nips-2012-Analog readout for optical reservoir computers

9 0.31120384 298 nips-2012-Scalable Inference of Overlapping Communities

10 0.31077611 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks

11 0.31073567 93 nips-2012-Deep Spatio-Temporal Architectures and Learning for Protein Structure Prediction

12 0.2982704 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification

13 0.29147258 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks

14 0.29048821 267 nips-2012-Perceptron Learning of SAT

15 0.28503034 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

16 0.28328437 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data

17 0.28175792 193 nips-2012-Learning to Align from Scratch

18 0.28049085 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

19 0.27232012 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

20 0.26472211 210 nips-2012-Memorability of Image Regions

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.057), (21, 0.019), (38, 0.094), (42, 0.015), (54, 0.015), (55, 0.031), (74, 0.061), (76, 0.096), (80, 0.481), (92, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96536171 204 nips-2012-MAP Inference in Chains using Column Generation

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefﬁcient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been signiﬁcant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

2 0.96293169 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted ﬁrst-order logic knowledge base to specify a Markov random ﬁeld (MRF) or a conditional random ﬁeld (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not signiﬁcant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random ﬁeld for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efﬁcient online inference, and can directly model inﬂuences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

3 0.96161598 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

Author: James Scott, Jonathan W. Pillow

Abstract: Characterizing the information carried by neural populations in the brain requires accurate statistical models of neural spike responses. The negative-binomial distribution provides a convenient model for over-dispersed spike counts, that is, responses with greater-than-Poisson variability. Here we describe a powerful data-augmentation framework for fully Bayesian inference in neural models with negative-binomial spiking. Our approach relies on a recently described latentvariable representation of the negative-binomial distribution, which equates it to a Polya-gamma mixture of normals. This framework provides a tractable, conditionally Gaussian representation of the posterior that can be used to design efﬁcient EM and Gibbs sampling based algorithms for inference in regression and dynamic factor models. We apply the model to neural data from primate retina and show that it substantially outperforms Poisson regression on held-out data, and reveals latent structure underlying spike count correlations in simultaneously recorded spike trains. 1

4 0.95820814 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

Author: Ashish Kapoor, Raajay Viswanathan, Prateek Jain

Abstract: In this paper, we present a Bayesian framework for multilabel classiďŹ cation using compressed sensing. The key idea in compressed sensing for multilabel classiďŹ cation is to ďŹ rst project the label vector to a lower dimensional space using a random transformation and then learn regression functions over these projections. Our approach considers both of these components in a single probabilistic model, thereby jointly optimizing over compression as well as learning tasks. We then derive an efďŹ cient variational inference scheme that provides joint posterior distribution over all the unobserved labels. The two key beneďŹ ts of the model are that a) it can naturally handle datasets that have missing labels and b) it can also measure uncertainty in prediction. The uncertainty estimate provided by the model allows for active learning paradigms where an oracle provides information about labels that promise to be maximally informative for the prediction task. Our experiments show signiďŹ cant boost over prior methods in terms of prediction performance over benchmark datasets, both in the fully labeled and the missing labels case. Finally, we also highlight various useful active learning scenarios that are enabled by the probabilistic model. 1

same-paper 5 0.94854081 100 nips-2012-Discriminative Learning of Sum-Product Networks

Author: Robert Gens, Pedro Domingos

6 0.92166638 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

7 0.83505404 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

8 0.81238592 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

9 0.78685534 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

10 0.78029627 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

11 0.77121264 200 nips-2012-Local Supervised Learning through Space Partitioning

12 0.76409209 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

13 0.76199973 293 nips-2012-Relax and Randomize : From Value to Algorithms

14 0.75601804 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

15 0.75276434 197 nips-2012-Learning with Recursive Perceptual Representations

16 0.73918247 279 nips-2012-Projection Retrieval for Classification

17 0.73264128 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

18 0.73211402 168 nips-2012-Kernel Latent SVM for Visual Recognition

19 0.7274 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

20 0.72131747 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification