nips nips2010 nips2010-213 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ning Chen, Jun Zhu, Eric P. Xing
Abstract: Learning from multi-view data is important in many applications, such as image classification and annotation. In this paper, we present a large-margin learning framework to discover a predictive latent subspace representation shared by multiple views. Our approach is based on an undirected latent space Markov network that fulfills a weak conditional independence assumption that multi-view observations and response variables are independent given a set of latent variables. We provide efficient inference and parameter estimation methods for the latent subspace model. Finally, we demonstrate the advantages of large-margin learning on real video and web image data for discovering predictive latent representations and improving the performance on image classification, annotation and retrieval.
Reference: text
sentIndex sentText sentNum sentScore
1 of CS & T, TNList Lab, State Key Lab of ITS, Tsinghua University, Beijing 100084 China ‡ School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 USA † Abstract Learning from multi-view data is important in many applications, such as image classification and annotation. [sent-8, score-0.081]
2 In this paper, we present a large-margin learning framework to discover a predictive latent subspace representation shared by multiple views. [sent-9, score-0.506]
3 Our approach is based on an undirected latent space Markov network that fulfills a weak conditional independence assumption that multi-view observations and response variables are independent given a set of latent variables. [sent-10, score-0.561]
4 We provide efficient inference and parameter estimation methods for the latent subspace model. [sent-11, score-0.386]
5 Finally, we demonstrate the advantages of large-margin learning on real video and web image data for discovering predictive latent representations and improving the performance on image classification, annotation and retrieval. [sent-12, score-0.794]
6 1 Introduction In many scientific and engineering applications, such as image annotation [28] and web-page classification [6], the available data usually come from diverse domains or are extracted from different aspects, which will be referred to as views. [sent-13, score-0.205]
7 Standard predictive methods, such as support vector machines, are built with all the variables available, without taking into consideration the presence of distinct views. [sent-14, score-0.092]
8 These methods would sacrifice the predictive performance [7] and may also be incapable of performing view-level analysis [12], such as predicting the tags for image annotation and analyzing the underlying relationships amongst views. [sent-15, score-0.383]
9 To discover a subspace representation shared by multi-view data, the unsupervised canonical correlation analysis (CCA) [17] and its kernelized version [1] ignore the widely available supervised information, such as image categories. [sent-17, score-0.37]
10 Therefore, they could discover a subspace with weak predictive ability. [sent-18, score-0.254]
11 However, this deterministic approach cannot provide viewlevel predictions, such as image annotation; and it would also need a density estimator in order to apply the information criterion [9] to detect view disagreement. [sent-20, score-0.112]
12 Specifically, we propose a large-margin learning approach to discovering a predictive subspace representation for multi-view data. [sent-26, score-0.291]
13 The approach is based on a generic multi-view latent space Markov network (MN) that fulfills a weak conditional independence assumption that the data from different views and the response variables are conditionally independent given a set of latent variables. [sent-27, score-0.569]
14 , latent Dirichlet allocation (LDA) [5] and probabilistic CCA [3]) can also be designed to fulfill the conditional independence, the posterior inference can be hard because all the latent variables are coupled together given the input variables [26]. [sent-33, score-0.515]
15 Undirected latent variable models have shown promising performance in many applications [26, 20]. [sent-35, score-0.229]
16 In the multiview MN, conditioned on latent variables, each view defines a joint distribution similar to that in a conditional random field (CRF) [18] and thus it can effectively extract latent topics from structured data. [sent-36, score-0.651]
17 For example, considering word ordering information could improve the quality of discovered latent topics [23] compared to a method (e. [sent-37, score-0.434]
18 , LDA) solely based on the natural bag-of-word representation, and spatial relationship among regions in an image is also useful for computer vision applications [15]. [sent-39, score-0.081]
19 To learn the multi-view latent space MN, we develop a large-margin approach, which jointly maximizes the data likelihood and minimizes the hinge-loss on training data. [sent-40, score-0.229]
20 The learning and inference problems are efficiently solved with a contrastive divergence method [25]. [sent-41, score-0.096]
21 Finally, we concentrate on one special case of the large-margin mult-view MN and extensively evaluate it on real video and web image datasets for image classification, annotation and retrieval tasks. [sent-42, score-0.451]
22 Our results show that the large-margin approach can achieve significant improvements in terms of prediction performance and discovered latent subspace representations. [sent-43, score-0.499]
23 Sec 2 and Sec 3 present the multi-view latent space MN and its large-margin training. [sent-45, score-0.229]
24 HK 2 Multi-view Latent Space Markov Networks The unsupervised two-view latent space Markov network is shown in Fig. [sent-51, score-0.26]
25 1, which consists of two views of input data X := {Xn } X1 X2 XN Z1 Z2 ZM and Z := {Zm } and a set of latent variables H := {Hk }. [sent-52, score-0.273]
26 For ease of presentation, we assume that the variables on each view Figure 1: Multi-view Markov networks with K latent variables. [sent-53, score-0.26]
27 The model is constructed based on an underlying conditional independence assumption that given the latent variables H, the two views X and Z are independent. [sent-56, score-0.34]
28 Graphically, we can see that both the exponential family Harmonium (EFH) [26] and its extension of dual-wing Harmonium (DWH) [28] are special cases of multi-view latent space MNs. [sent-57, score-0.229]
29 Specifically, we first define marginal distributions of the data on each view and the latent variables. [sent-59, score-0.26]
30 For latent variables H, each component hk has an exponential family distribution and therefore the marginal distribution is: p(hk ) = p(h) = k k exp λk ϕ(hk ) − Ck (λk ) , where ϕ(hk ) is the feature vector of hk , Ck is another log-partition function. [sent-62, score-0.595]
31 j We can see that conditioned on the latent variables, both p(x|h) and p(z|h) are defined in the exponential form with a pairwise potential function, which is very similar to conditional random fields [18]. [sent-66, score-0.262]
32 Since the latent variables are not directly connected, the complexity of inferring the posterior distribution of H is the same as in EFH when all the input data are observed, as reflected in the factorized form of p(h|x, z). [sent-69, score-0.229]
33 Therefore, multi-view latent space MNs do not increase the complexity on testing if our task depends solely on the latent representation (i. [sent-70, score-0.458]
34 Up to now, we have sticken on unsupervised multi-view latent space MNs, which are of wide use in discovering latent subspace representations shared by multi-view data. [sent-79, score-0.759]
35 In this paper, however, we are more interested in the supervised setting where each input sample is associated with a supervised response variable, such as image categories. [sent-80, score-0.169]
36 Accordingly, our goal is to discover a predictive subspace by exploring the supervised information. [sent-81, score-0.298]
37 The supervised multi-view latent space MNs are defined similarly as above, but with an additional view of response variables Y . [sent-82, score-0.304]
38 However, the resultant TWH model does not yield improved performance compared to the naive method that combines an unsupervised DWH for discovering latent representations and an SVM for classification. [sent-94, score-0.374]
39 This observation further motivates us to develop a more discriminative learning approach to exploring the supervised information for discovering predictive latent subspace representations. [sent-95, score-0.598]
40 As we shall see, integrating the large-margin principle into one objective function for joint latent subspace model and prediction model learning can yield much better results, in terms of prediction performance and predictiveness of discovered latent subspace representations. [sent-96, score-0.92]
41 3 Parameter Estimation: a Large Margin Approach To learn the supervised multi-view latent space MNs, a natural method is the maximum likelihood estimation (MLE), which has been widely used to train directed [24, 30] and undirected latent variable models [26, 20, 28, 29]. [sent-97, score-0.564]
42 Here, we integrate the large-margin idea into the learning of supervised multi-view latent space MNs for multi-view data analysis, analogous to the development of MedLDA [31], which is directed and has single-view. [sent-105, score-0.299]
43 This additional term introduces a regularization effect to the latent subspace model. [sent-142, score-0.362]
44 If the prediction label yd differs from the true label yd , this term will be non-zero and it biases the ¯ model towards discovering a better representation for prediction. [sent-143, score-0.236]
45 4 Application to Image Classification, Annotation and Retrieval We have developed the large-margin framework with a generic multi-view latent space MN to model structured data. [sent-144, score-0.26]
46 , image tags) and z is a vector of real-valued features (e. [sent-151, score-0.081]
47 Each xi is a Bernoulli variable that denotes whether the ith term of a dictionary appears or not in an image, and each zj is a real number that denotes the normalized color histogram of an image. [sent-154, score-0.267]
48 We assume that each real-valued hk follows a univariate Gaussian distribution. [sent-155, score-0.183]
49 For retrieval, the expectation v of each image is used to compute a similarity (e. [sent-165, score-0.081]
50 Then, tags with high probabilities are selected as annotation. [sent-170, score-0.086]
51 5 Experiments We report empirical results on TRECVID2003 and flickr image datasets. [sent-171, score-0.081]
52 Our results demonstrate that the large-margin approach can achieve significantly better performance on discovering predictive subspace representations and the tasks of image classification, annotation and retrieval. [sent-172, score-0.544]
53 1 Datasets and Features The first dataset is the TRECVID2003 video dataset [28], which contains 1078 manually labeled video shots that belong to 5 categories. [sent-174, score-0.092]
54 The second one is a subset selected from NUSWIDE [10], which is a big image dataset constructed from flickr web images. [sent-181, score-0.108]
55 , 64-dim color histogram, 144-dim color correlogram, 73-dim edge direction histogram, 128-dim wavelet texture and 225-dim block-wise color moments) and 500-dim bag-ofword representation based on SIFT [19] features. [sent-187, score-0.12]
56 The online tags are also downloaded for evaluating image annotation. [sent-189, score-0.167]
57 2 Discovering Predictive Latent Subspace Representations We first evaluate the predictive power of the discovered latent subspace representations. [sent-191, score-0.561]
58 2 shows the 2D embedding of the discovered 10-dim latent representations by three models (i. [sent-193, score-0.384]
59 We can see that clearly the latent subspace representations discovered by the largemargin based MMH show a strong grouping pattern for the images belonging to the same category, while images from different categories tend to be separated from each other on the 2D embedding space. [sent-197, score-0.648]
60 In contrast, the latent subspace representations discovered by the likelihood-based unsupervised DWH and supervised TWH do not show a clear grouping pattern, except for the first category. [sent-198, score-0.592]
61 These observations suggest that the largemargin based latent subspace model can discover more predictive or discriminative latent subspace representations, which will result in better prediction performance, as we shall see. [sent-200, score-0.938]
62 To quantitatively evaluate the predictiveness of the discovered latent subspace representations, we compute the pair-wise average KL-divergence between the per-class average distribution over latent topics2 . [sent-201, score-0.727]
63 This again suggests that the latent subspace representations discovered by MMH are more discriminative or predictive. [sent-204, score-0.551]
64 Finally, we examine the predictive power of discovered latent topics. [sent-210, score-0.428]
65 3 shows five example topics discovered by the large-margin MMH on the flickr image data. [sent-212, score-0.286]
66 For each topic Hk , we show the 5 top-ranked images that yield a high expected value of Hk , together with the associated tags. [sent-213, score-0.112]
67 Also, to qualitatively visualize the discriminative power of each topic among the 13 categories, we show the average probability of each category distributed on the particular topic. [sent-214, score-0.112]
68 From the results, we can see that many of the discovered topics are very predictive for one or several categories. [sent-215, score-0.297]
69 For example, topics 3 and 4 are discriminative in predicting the categories hawk and whales, respectively. [sent-216, score-0.327]
70 Similarly, topics 1 and 5 are good at predicting squirrel and zebra, respectively. [sent-217, score-0.332]
71 We also have some topics which are good at discriminating a subset of categories against another subset. [sent-218, score-0.132]
72 For example, the topic 2 is good at discriminating {squirrel, wolf, rabbit} against {tiger, whales, zebra}; but it is not very discriminative between squirrel and wolf. [sent-219, score-0.346]
73 The per-class average is computed by averaging the topic distributions of the images within the same class. [sent-221, score-0.112]
74 01 wolf, alaska, animal, nature, wildlife, africa, squirrel wolf 0. [sent-224, score-0.382]
75 014 hawk, bird, flying, wildlife, wings, nature, fabulous, texas zebra 0. [sent-233, score-0.175]
76 015 whales elephant antler elephant whales cow rabbit hawk snake 0. [sent-240, score-1.002]
77 025 antler squirrel lion antler snake lion hawk cow lion cat elephant cow cat probability rabbit 0. [sent-242, score-1.695]
78 015 squirrel, nature, animal, wildlife, rabbit, cute, bunny, interestingness cow hawk whales cat snake wolf squirrel 0. [sent-247, score-0.997]
79 01 zebra, zoo, animal, stripes, africa, mammal, black, white, nature, eyes wolf probability 0. [sent-253, score-0.148]
80 025 Figure 3: Example latent topics discovered by a 60-topic MMH on the flickr animal dataset. [sent-256, score-0.594]
81 For each of the unsupervised DWH, GM-Mix, GM-LDA and CorrLDA, a downstream SVM is built with the same tool based on the discovered latent representations. [sent-263, score-0.367]
82 These results show that supervised information can help in discovering predictive latent space representations that are more suitable for prediction if the model is appropriately learned, e. [sent-268, score-0.509]
83 4 (b) shows the classification accuracy on the flickr animal dataset. [sent-276, score-0.16]
84 With the similar large-margin principle, MMH is an important extension of MedLDA to the undirected latent subspace models and for multi-view data analysis. [sent-283, score-0.398]
85 2 Retrieval For image retrieval, each test image is treated as a query and training images are ranked based on their cosine similarity with the given query, which is computed based on latent subspace representations. [sent-286, score-0.558]
86 An image is considered relevant to the query if they belong to the same category. [sent-287, score-0.081]
87 We evaluate the retrieval results by computing the average precision (AP) score and drawing precision-recall curves. [sent-288, score-0.086]
88 3 5 10 15 20 25 30 35 40 10 20 # of latent topics 30 40 50 60 # of latent topics (a) GM−Mix 15 topics 0. [sent-302, score-0.752]
89 5 Recall 1 (c) Figure 4: Classification accuracy on the (a) TRECVID 2003 and (b) flickr datasets and (c) the average precision curve and the two precision-recall curves for image retrieval on TRECVID data. [sent-327, score-0.167]
90 175 Finally, we report the annotation results on the flickr dataset, with a dictionary of 1000 unique tags. [sent-369, score-0.124]
91 We also compare with the sLDA annotation Figure 5: Top-N F1-measure. [sent-373, score-0.124]
92 model [24], which uses SIFT features and tags as inputs. [sent-374, score-0.086]
93 With 60 latent topics, the top-N F-measure scores are shown in Fig. [sent-376, score-0.229]
94 6 shows example images from all the 13 categories, where for each category the left image is generally of a good annotation quality and the right one is relatively worse. [sent-380, score-0.239]
95 6 Conclusions and Future Work We have presented a generic large-margin learning framework for discovering predictive latent subspace representations shared by structured multi-view data. [sent-381, score-0.622]
96 The inference and learning can be efficiently done with contrastive divergence methods. [sent-382, score-0.12]
97 Finally, we concentrate on a specialized model with applications to image classification, annotation and retrieval. [sent-383, score-0.238]
98 Extensive experiments on real video and web image datasets demonstrate the advantages of large-margin learning for both prediction and predictive latent subspace discovery. [sent-384, score-0.638]
99 NUS-WIDE: A real-world web image database from national university of singapore. [sent-465, score-0.108]
100 MedLDA: Maximum margin supervised topic models for regression and classification. [sent-595, score-0.122]
wordName wordTfidf (topN-words)
[('mmh', 0.453), ('dwh', 0.263), ('squirrel', 0.234), ('latent', 0.229), ('twh', 0.219), ('hk', 0.183), ('zebra', 0.175), ('zj', 0.165), ('hawk', 0.161), ('animal', 0.16), ('wolf', 0.148), ('ickr', 0.146), ('subspace', 0.133), ('rabbit', 0.13), ('lion', 0.128), ('wildlife', 0.128), ('snake', 0.128), ('annotation', 0.124), ('cat', 0.121), ('whales', 0.117), ('discovered', 0.107), ('topics', 0.098), ('elephant', 0.094), ('predictive', 0.092), ('cow', 0.088), ('zoo', 0.088), ('tags', 0.086), ('tiger', 0.086), ('image', 0.081), ('topic', 0.078), ('mns', 0.077), ('medlda', 0.077), ('antler', 0.073), ('harmonium', 0.073), ('ocean', 0.071), ('yd', 0.07), ('sift', 0.069), ('discovering', 0.066), ('retrieval', 0.059), ('corrlda', 0.058), ('efh', 0.058), ('trecvid', 0.058), ('ep', 0.049), ('representations', 0.048), ('vy', 0.047), ('mn', 0.047), ('video', 0.046), ('contrastive', 0.046), ('views', 0.044), ('lda', 0.044), ('antlers', 0.044), ('bunny', 0.044), ('cute', 0.044), ('supervised', 0.044), ('classi', 0.04), ('color', 0.04), ('nature', 0.039), ('rhinge', 0.038), ('variational', 0.038), ('sec', 0.037), ('undirected', 0.036), ('gm', 0.036), ('xi', 0.034), ('discriminative', 0.034), ('categories', 0.034), ('independence', 0.034), ('images', 0.034), ('conditional', 0.033), ('svm', 0.033), ('lv', 0.033), ('concentrate', 0.033), ('unsupervised', 0.031), ('view', 0.031), ('structured', 0.031), ('wi', 0.031), ('eq', 0.031), ('uk', 0.03), ('hd', 0.03), ('prediction', 0.03), ('xing', 0.029), ('canonical', 0.029), ('africa', 0.029), ('kitten', 0.029), ('largemargin', 0.029), ('predictiveness', 0.029), ('vyd', 0.029), ('discover', 0.029), ('clamped', 0.029), ('histogram', 0.028), ('bird', 0.028), ('web', 0.027), ('precision', 0.027), ('divergence', 0.026), ('ful', 0.026), ('directed', 0.026), ('marine', 0.026), ('inference', 0.024), ('done', 0.024), ('slda', 0.024), ('shared', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
Author: Ning Chen, Jun Zhu, Eric P. Xing
Abstract: Learning from multi-view data is important in many applications, such as image classification and annotation. In this paper, we present a large-margin learning framework to discover a predictive latent subspace representation shared by multiple views. Our approach is based on an undirected latent space Markov network that fulfills a weak conditional independence assumption that multi-view observations and response variables are independent given a set of latent variables. We provide efficient inference and parameter estimation methods for the latent subspace model. Finally, we demonstrate the advantages of large-margin learning on real video and web image data for discovering predictive latent representations and improving the performance on image classification, annotation and retrieval.
2 0.14315258 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
Author: Yangqing Jia, Mathieu Salzmann, Trevor Darrell
Abstract: Recent approaches to multi-view learning have shown that factorizing the information into parts that are shared across all views and parts that are private to each view could effectively account for the dependencies and independencies between the different input modalities. Unfortunately, these approaches involve minimizing non-convex objective functions. In this paper, we propose an approach to learning such factorized representations inspired by sparse coding techniques. In particular, we show that structured sparsity allows us to address the multiview learning problem by alternately solving two convex optimization problems. Furthermore, the resulting factorized latent spaces generalize over existing approaches in that they allow having latent dimensions shared between any subset of the views instead of between all the views only. We show that our approach outperforms state-of-the-art methods on the task of human pose estimation. 1
3 0.13349776 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence
Author: Yang Wang, Greg Mori
Abstract: We propose a discriminative latent model for annotating images with unaligned object-level textual annotations. Instead of using the bag-of-words image representation currently popular in the computer vision community, our model explicitly captures more intricate relationships underlying visual and textual information. In particular, we model the mapping that translates image regions to annotations. This mapping allows us to relate image regions to their corresponding annotation terms. We also model the overall scene label as latent information. This allows us to cluster test images. Our training data consist of images and their associated annotations. But we do not have access to the ground-truth regionto-annotation mapping or the overall scene label. We develop a novel variant of the latent SVM framework to model them as latent variables. Our experimental results demonstrate the effectiveness of the proposed model compared with other baseline methods.
4 0.11182833 137 nips-2010-Large Margin Learning of Upstream Scene Understanding Models
Author: Jun Zhu, Li-jia Li, Li Fei-fei, Eric P. Xing
Abstract: Upstream supervised topic models have been widely used for complicated scene understanding. However, existing maximum likelihood estimation (MLE) schemes can make the prediction model learning independent of latent topic discovery and result in an imbalanced prediction rule for scene classification. This paper presents a joint max-margin and max-likelihood learning method for upstream scene understanding models, in which latent topic discovery and prediction model estimation are closely coupled and well-balanced. The optimization problem is efficiently solved with a variational EM procedure, which iteratively solves an online loss-augmented SVM. We demonstrate the advantages of the large-margin approach on both an 8-category sports dataset and the 67-class MIT indoor scene dataset for scene categorization.
5 0.10992537 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
Author: Matthew Blaschko, Andrea Vedaldi, Andrew Zisserman
Abstract: A standard approach to learning object category detectors is to provide strong supervision in the form of a region of interest (ROI) specifying each instance of the object in the training images [17]. In this work are goal is to learn from heterogeneous labels, in which some images are only weakly supervised, specifying only the presence or absence of the object or a weak indication of object location, whilst others are fully annotated. To this end we develop a discriminative learning approach and make two contributions: (i) we propose a structured output formulation for weakly annotated images where full annotations are treated as latent variables; and (ii) we propose to optimize a ranking objective function, allowing our method to more effectively use negatively labeled images to improve detection average precision performance. The method is demonstrated on the benchmark INRIA pedestrian detection dataset of Dalal and Triggs [14] and the PASCAL VOC dataset [17], and it is shown that for a significant proportion of weakly supervised images the performance achieved is very similar to the fully supervised (state of the art) results. 1
6 0.087494522 103 nips-2010-Generating more realistic images using gated MRF's
7 0.084301755 194 nips-2010-Online Learning for Latent Dirichlet Allocation
8 0.083482482 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
9 0.082000114 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
10 0.079195231 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
11 0.076931901 235 nips-2010-Self-Paced Learning for Latent Variable Models
12 0.076691292 286 nips-2010-Word Features for Latent Dirichlet Allocation
13 0.073368974 99 nips-2010-Gated Softmax Classification
14 0.064944342 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities
15 0.06379272 149 nips-2010-Learning To Count Objects in Images
16 0.06253992 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
17 0.062406193 284 nips-2010-Variational bounds for mixed-data factor analysis
18 0.060034465 133 nips-2010-Kernel Descriptors for Visual Recognition
19 0.057078727 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models
20 0.056362588 101 nips-2010-Gaussian sampling by local perturbations
topicId topicWeight
[(0, 0.165), (1, 0.08), (2, -0.057), (3, -0.116), (4, -0.109), (5, 0.039), (6, 0.068), (7, 0.038), (8, -0.037), (9, -0.037), (10, -0.005), (11, 0.05), (12, 0.057), (13, -0.006), (14, -0.014), (15, 0.009), (16, 0.007), (17, 0.049), (18, 0.104), (19, 0.026), (20, 0.033), (21, 0.006), (22, -0.026), (23, -0.02), (24, -0.083), (25, 0.025), (26, -0.01), (27, 0.012), (28, 0.043), (29, -0.019), (30, 0.13), (31, -0.086), (32, 0.035), (33, 0.051), (34, -0.008), (35, -0.002), (36, 0.002), (37, -0.037), (38, -0.009), (39, 0.057), (40, 0.009), (41, -0.037), (42, -0.044), (43, 0.021), (44, -0.073), (45, 0.053), (46, -0.018), (47, 0.131), (48, -0.025), (49, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.93544251 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
Author: Ning Chen, Jun Zhu, Eric P. Xing
Abstract: Learning from multi-view data is important in many applications, such as image classification and annotation. In this paper, we present a large-margin learning framework to discover a predictive latent subspace representation shared by multiple views. Our approach is based on an undirected latent space Markov network that fulfills a weak conditional independence assumption that multi-view observations and response variables are independent given a set of latent variables. We provide efficient inference and parameter estimation methods for the latent subspace model. Finally, we demonstrate the advantages of large-margin learning on real video and web image data for discovering predictive latent representations and improving the performance on image classification, annotation and retrieval.
2 0.71478283 262 nips-2010-Switched Latent Force Models for Movement Segmentation
Author: Mauricio Alvarez, Jan R. Peters, Neil D. Lawrence, Bernhard Schölkopf
Abstract: Latent force models encode the interaction between multiple related dynamical systems in the form of a kernel or covariance function. Each variable to be modeled is represented as the output of a differential equation and each differential equation is driven by a weighted sum of latent functions with uncertainty given by a Gaussian process prior. In this paper we consider employing the latent force model framework for the problem of determining robot motor primitives. To deal with discontinuities in the dynamical systems or the latent driving force we introduce an extension of the basic latent force model, that switches between different latent functions and potentially different dynamical systems. This creates a versatile representation for robot movements that can capture discrete changes and non-linearities in the dynamics. We give illustrative examples on both synthetic data and for striking movements recorded using a Barrett WAM robot as haptic input device. Our inspiration is robot motor primitives, but we expect our model to have wide application for dynamical systems including models for human motion capture data and systems biology. 1
3 0.71393025 89 nips-2010-Factorized Latent Spaces with Structured Sparsity
Author: Yangqing Jia, Mathieu Salzmann, Trevor Darrell
Abstract: Recent approaches to multi-view learning have shown that factorizing the information into parts that are shared across all views and parts that are private to each view could effectively account for the dependencies and independencies between the different input modalities. Unfortunately, these approaches involve minimizing non-convex objective functions. In this paper, we propose an approach to learning such factorized representations inspired by sparse coding techniques. In particular, we show that structured sparsity allows us to address the multiview learning problem by alternately solving two convex optimization problems. Furthermore, the resulting factorized latent spaces generalize over existing approaches in that they allow having latent dimensions shared between any subset of the views instead of between all the views only. We show that our approach outperforms state-of-the-art methods on the task of human pose estimation. 1
4 0.68018621 103 nips-2010-Generating more realistic images using gated MRF's
Author: Marc'aurelio Ranzato, Volodymyr Mnih, Geoffrey E. Hinton
Abstract: Probabilistic models of natural images are usually evaluated by measuring performance on rather indirect tasks, such as denoising and inpainting. A more direct way to evaluate a generative model is to draw samples from it and to check whether statistical properties of the samples match the statistics of natural images. This method is seldom used with high-resolution images, because current models produce samples that are very different from natural images, as assessed by even simple visual inspection. We investigate the reasons for this failure and we show that by augmenting existing models so that there are two sets of latent variables, one set modelling pixel intensities and the other set modelling image-specific pixel covariances, we are able to generate high-resolution images that look much more realistic than before. The overall model can be interpreted as a gated MRF where both pair-wise dependencies and mean intensities of pixels are modulated by the states of latent variables. Finally, we confirm that if we disallow weight-sharing between receptive fields that overlap each other, the gated MRF learns more efficient internal representations, as demonstrated in several recognition tasks. 1 Introduction and Prior Work The study of the statistical properties of natural images has a long history and has influenced many fields, from image processing to computational neuroscience [1]. In this work we focus on probabilistic models of natural images. These models are useful for extracting representations [2, 3, 4] that can be used for discriminative tasks and they can also provide adaptive priors [5, 6, 7] that can be used in applications like denoising and inpainting. Our main focus, however, will be on improving the quality of the generative model, rather than exploring its possible applications. Markov Random Fields (MRF’s) provide a very general framework for modelling natural images. In an MRF, an image is assigned a probability which is a normalized product of potential functions, with each function typically being defined over a subset of the observed variables. In this work we consider a very versatile class of MRF’s in which potential functions are defined over both pixels and latent variables, thus allowing the states of the latent variables to modulate or gate the effective interactions between the pixels. This type of MRF, that we dub gated MRF, was proposed as an image model by Geman and Geman [8]. Welling et al. [9] showed how an MRF in this family1 could be learned for small image patches and their work was extended to high-resolution images by Roth and Black [6] who also demonstrated its success in some practical applications [7]. Besides their practical use, these models were specifically designed to match the statistical properties of natural images, and therefore, it seems natural to evaluate them in those terms. Indeed, several authors [10, 7] have proposed that these models should be evaluated by generating images and 1 Product of Student’s t models (without pooling) may not appear to have latent variables but each potential can be viewed as an infinite mixture of zero-mean Gaussians where the inverse variance of the Gaussian is the latent variable. 1 checking whether the samples match the statistical properties observed in natural images. It is, therefore, very troublesome that none of the existing models can generate good samples, especially for high-resolution images (see for instance fig. 2 in [7] which is one of the best models of highresolution images reported in the literature so far). In fact, as our experiments demonstrate the generated samples from these models are more similar to random images than to natural images! When MRF’s with gated interactions are applied to small image patches, they actually seem to work moderately well, as demonstrated by several authors [11, 12, 13]. The generated patches have some coherent and elongated structure and, like natural image patches, they are predominantly very smooth with sudden outbreaks of strong structure. This is unsurprising because these models have a built-in assumption that images are very smooth with occasional strong violations of smoothness [8, 14, 15]. However, the extension of these patch-based models to high-resolution images by replicating filters across the image has proven to be difficult. The receptive fields that are learned no longer resemble Gabor wavelets but look random [6, 16] and the generated images lack any of the long range structure that is so typical of natural images [7]. The success of these methods in applications such as denoising is a poor measure of the quality of the generative model that has been learned: Setting the parameters to random values works almost as well for eliminating independent Gaussian noise [17], because this can be done quite well by just using a penalty for high-frequency variation. In this work, we show that the generative quality of these models can be drastically improved by jointly modelling both pixel mean intensities and pixel covariances. This can be achieved by using two sets of latent variables, one that gates pair-wise interactions between pixels and another one that sets the mean intensities of pixels, as we already proposed in some earlier work [4]. Here, we show that this modelling choice is crucial to make the gated MRF work well on high-resolution images. Finally, we show that the most widely used method of sharing weights in MRF’s for high-resolution images is overly constrained. Earlier work considered homogeneous MRF’s in which each potential is replicated at all image locations. This has the subtle effect of making learning very difficult because of strong correlations at nearby sites. Following Gregor and LeCun [18] and also Tang and Eliasmith [19], we keep the number of parameters under control by using local potentials, but unlike Roth and Black [6] we only share weights between potentials that do not overlap. 2 Augmenting Gated MRF’s with Mean Hidden Units A Product of Student’s t (PoT) model [15] is a gated MRF defined on small image patches that can be viewed as modelling image-specific, pair-wise relationships between pixel values by using the states of its latent variables. It is very good at representing the fact that two-pixel have very similar intensities and no good at all at modelling what these intensities are. Failure to model the mean also leads to impoverished modelling of the covariances when the input images have nonzero mean intensity. The covariance RBM (cRBM) [20] is another model that shares the same limitation since it only differs from PoT in the distribution of its latent variables: The posterior over the latent variables is a product of Bernoulli distributions instead of Gamma distributions as in PoT. We explain the fundamental limitation of these models by using a simple toy example: Modelling two-pixel images using a cRBM with only one binary hidden unit, see fig. 1. This cRBM assumes that the conditional distribution over the input is a zero-mean Gaussian with a covariance that is determined by the state of the latent variable. Since the latent variable is binary, the cRBM can be viewed as a mixture of two zero-mean full covariance Gaussians. The latent variable uses the pairwise relationship between pixels to decide which of the two covariance matrices should be used to model each image. When the input data is pre-proessed by making each image have zero mean intensity (the empirical histogram is shown in the first row and first column), most images lie near the origin because most of the times nearby pixels are strongly correlated. Less frequently we encounter edge images that exhibit strong anti-correlation between the pixels, as shown by the long tails along the anti-diagonal line. A cRBM could model this data by using two Gaussians (first row and second column): one that is spherical and tight at the origin for smooth images and another one that has a covariance elongated along the anti-diagonal for structured images. If, however, the whole set of images is normalized by subtracting from every pixel the mean value of all pixels over all images (second row and first column), the cRBM fails at modelling structured images (second row and second column). It can fit a Gaussian to the smooth images by discovering 2 Figure 1: In the first row, each image is zero mean. In the second row, the whole set of data points is centered but each image can have non-zero mean. The first column shows 8x8 images picked at random from natural images. The images in the second column are generated by a model that does not account for mean intensity. The images in the third column are generated by a model that has both “mean” and “covariance” hidden units. The contours in the first column show the negative log of the empirical distribution of (tiny) natural two-pixel images (x-axis being the first pixel and the y-axis the second pixel). The plots in the other columns are toy examples showing how each model could represent the empirical distribution using a mixture of Gaussians with components that have one of two possible covariances (corresponding to the state of a binary “covariance” latent variable). Models that can change the means of the Gaussians (mPoT and mcRBM) can represent better structured images (edge images lie along the anti-diagonal and are fitted by the Gaussians shown in red) while the other models (PoT and cRBM) fail, overall when each image can have non-zero mean. the direction of strong correlation along the main diagonal, but it is very likely to fail to discover the direction of anti-correlation, which is crucial to represent discontinuities, because structured images with different mean intensity appear to be evenly spread over the whole input space. If the model has another set of latent variables that can change the means of the Gaussian distributions in the mixture (as explained more formally below and yielding the mPoT and mcRBM models), then the model can represent both changes of mean intensity and the correlational structure of pixels (see last column). The mean latent variables effectively subtract off the relevant mean from each data-point, letting the covariance latent variable capture the covariance structure of the data. As before, the covariance latent variable needs only to select between two covariance matrices. In fact, experiments on real 8x8 image patches confirm these conjectures. Fig. 1 shows samples drawn from PoT and mPoT. mPoT (and similarly mcRBM [4]) is not only better at modelling zero mean images but it can also represent images that have non zero mean intensity well. We now describe mPoT, referring the reader to [4] for a detailed description of mcRBM. In PoT [9] the energy function is: E PoT (x, hc ) = i 1 [hc (1 + (Ci T x)2 ) + (1 − γ) log hc ] i i 2 (1) where x is a vectorized image patch, hc is a vector of Gamma “covariance” latent variables, C is a filter bank matrix and γ is a scalar parameter. The joint probability over input pixels and latent variables is proportional to exp(−E PoT (x, hc )). Therefore, the conditional distribution over the input pixels is a zero-mean Gaussian with covariance equal to: Σc = (Cdiag(hc )C T )−1 . (2) In order to make the mean of the conditional distribution non-zero, we define mPoT as the normalized product of the above zero-mean Gaussian that models the covariance and a spherical covariance Gaussian that models the mean. The overall energy function becomes: E mPoT (x, hc , hm ) = E PoT (x, hc ) + E m (x, hm ) 3 (3) Figure 2: Illustration of different choices of weight-sharing scheme for a RBM. Links converging to one latent variable are filters. Filters with the same color share the same parameters. Kinds of weight-sharing scheme: A) Global, B) Local, C) TConv and D) Conv. E) TConv applied to an image. Cells correspond to neighborhoods to which filters are applied. Cells with the same color share the same parameters. F) 256 filters learned by a Gaussian RBM with TConv weight-sharing scheme on high-resolution natural images. Each filter has size 16x16 pixels and it is applied every 16 pixels in both the horizontal and vertical directions. Filters in position (i, j) and (1, 1) are applied to neighborhoods that are (i, j) pixels away form each other. Best viewed in color. where hm is another set of latent variables that are assumed to be Bernoulli distributed (but other distributions could be used). The new energy term is: E m (x, hm ) = 1 T x x− 2 hm Wj T x j (4) j yielding the following conditional distribution over the input pixels: p(x|hc , hm ) = N (Σ(W hm ), Σ), Σ = (Σc + I)−1 (5) with Σc defined in eq. 2. As desired, the conditional distribution has non-zero mean2 . Patch-based models like PoT have been extended to high-resolution images by using spatially localized filters [6]. While we can subtract off the mean intensity from independent image patches to successfully train PoT, we cannot do that on a high-resolution image because overlapping patches might have different mean. Unfortunately, replicating potentials over the image ignoring variations of mean intensity has been the leading strategy to date [6]3 . This is the major reason why generation of high-resolution images is so poor. Sec. 4 shows that generation can be drastically improved by explicitly accounting for variations of mean intensity, as performed by mPoT and mcRBM. 3 Weight-Sharing Schemes By integrating out the latent variables, we can write the density function of any gated MRF as a normalized product of potential functions (for mPoT refer to eq. 6). In this section we investigate different ways of constraining the parameters of the potentials of a generic MRF. Global: The obvious way to extend a patch-based model like PoT to high-resolution images is to define potentials over the whole image; we call this scheme global. This is not practical because 1) the number of parameters grows about quadratically with the size of the image making training too slow, 2) we do not need to model interactions between very distant pairs of pixels since their dependence is negligible, and 3) we would not be able to use the model on images of different size. Conv: The most popular way to handle big images is to define potentials on small subsets of variables (e.g., neighborhoods of size 5x5 pixels) and to replicate these potentials across space while 2 The need to model the means was clearly recognized in [21] but they used conjunctive latent features that simultaneously represented a contribution to the “precision matrix” in a specific direction and the mean along that same direction. 3 The success of PoT-like models in Bayesian denoising is not surprising since the noisy image effectively replaces the reconstruction term from the mean hidden units (see eq. 5), providing a set of noisy mean intensities that are cleaned up by the patterns of correlation enforced by the covariance latent variables. 4 sharing their parameters at each image location [23, 24, 6]. This yields a convolutional weightsharing scheme, also called homogeneous field in the statistics literature. This choice is justified by the stationarity of natural images. This weight-sharing scheme is extremely concise in terms of number of parameters, but also rather inefficient in terms of latent representation. First, if there are N filters at each location and these filters are stepped by one pixel then the internal representation is about N times overcomplete. The internal representation has not only high computational cost, but it is also highly redundant. Since the input is mostly smooth and the parameters are the same across space, the latent variables are strongly correlated as well. This inefficiency turns out to be particularly harmful for a model like PoT causing the learned filters to become “random” looking (see fig 3-iii). A simple intuition follows from the equivalence between PoT and square ICA [15]. If the filter matrix C of eq. 1 is square and invertible, we can marginalize out the latent variables and write: p(y) = i S(yi ), where yi = Ci T x and S is a Student’s t distribution. In other words, there is an underlying assumption that filter outputs are independent. However, if the filters of matrix C are shifted and overlapping versions of each other, this clearly cannot be true. Training PoT with the Conv weight-sharing scheme forces the model to find filters that make filter outputs as independent as possible, which explains the very high-frequency patterns that are usually discovered [6]. Local: The Global and Conv weight-sharing schemes are at the two extremes of a spectrum of possibilities. For instance, we can define potentials on a small subset of input variables but, unlike Conv, each potential can have its own set of parameters, as shown in fig. 2-B. This is called local, or inhomogeneous field. Compared to Conv the number of parameters increases only slightly but the number of latent variables required and their redundancy is greatly reduced. In fact, the model learns different receptive fields at different locations as a better strategy for representing the input, overall when the number of potentials is limited (see also fig. 2-F). TConv: Local would not allow the model to be trained and tested on images of different resolution, and it might seem wasteful not to exploit the translation invariant property of images. We therefore advocate the use of a weight-sharing scheme that we call tiled-convolutional (TConv) shown in fig. 2-C and E [18]. Each filter tiles the image without overlaps with copies of itself (i.e. the stride equals the filter diameter). This reduces spatial redundancy of latent variables and allows the input images to have arbitrary size. At the same time, different filters do overlap with each other in order to avoid tiling artifacts. Fig. 2-F shows filters that were (jointly) learned by a Restricted Boltzmann Machine (RBM) [29] with Gaussian input variables using the TConv weight-sharing scheme. 4 Experiments We train gated MRF’s with and without mean hidden units using different weight-sharing schemes. The training procedure is very similar in all cases. We perform approximate maximum likelihood by using Fast Persistence Contrastive Divergence (FPCD) [25] and we draw samples by using Hybrid Monte Carlo (HMC) [26]. Since all latent variables can be exactly marginalized out we can use HMC on the free energy (negative logarithm of the marginal distribution over the input pixels). For mPoT this is: F mPoT (x) = − log(p(x))+const. = k,i 1 1 γ log(1+ (Cik T xk )2 )+ xT x− 2 2 T log(1+exp(Wjk xk )) (6) k,j where the index k runs over spatial locations and xk is the k-th image patch. FPCD keeps samples, called negative particles, that it uses to represent the model distribution. These particles are all updated after each weight update. For each mini-batch of data-points a) we compute the derivative of the free energy w.r.t. the training samples, b) we update the negative particles by running HMC for one HMC step consisting of 20 leapfrog steps. We start at the previous set of negative particles and use as parameters the sum of the regular parameters and a small perturbation vector, c) we compute the derivative of the free energy at the negative particles, and d) we update the regular parameters by using the difference of gradients between step a) and c) while the perturbation vector is updated using the gradient from c) only. The perturbation is also strongly decayed to zero and is subject to a larger learning rate. The aim is to encourage the negative particles to explore the space more quickly by slightly and temporarily raising the energy at their current position. Note that the use of FPCD as opposed to other estimation methods (like Persistent Contrastive Divergence [27]) turns out to be crucial to achieve good mixing of the sampler even after training. We train on mini-batches of 32 samples using gray-scale images of approximate size 160x160 pixels randomly cropped from the Berkeley segmentation dataset [28]. We perform 160,000 weight updates decreasing the learning by a factor of 4 by the end of training. The initial learning rate is set to 0.1 for the covariance 5 Figure 3: 160x160 samples drawn by A) mPoT-TConv, B) mHPoT-TConv, C) mcRBM-TConv and D) PoTTConv. On the side also i) a subset of 8x8 “covariance” filters learned by mPoT-TConv (the plot below shows how the whole set of filters tile a small patch; each bar correspond to a Gabor fit of a filter and colors identify filters applied at the same 8x8 location, each group is shifted by 2 pixels down the diagonal and a high-resolution image is tiled by replicating this pattern every 8 pixels horizontally and vertically), ii) a subset of 8x8 “mean” filters learned by the same mPoT-TConv, iii) filters learned by PoT-Conv and iv) by PoT-TConv. filters (matrix C of eq. 1), 0.01 for the mean parameters (matrix W of eq. 4), and 0.001 for the other parameters (γ of eq. 1). During training we condition on the borders and initialize the negative particles at zero in order to avoid artifacts at the border of the image. We learn 8x8 filters and pre-multiply the covariance filters by a whitening transform retaining 99% of the variance; we also normalize the norm of the covariance filters to prevent some of them from decaying to zero during training4 . Whenever we use the TConv weight-sharing scheme the model learns covariance filters that mostly resemble localized and oriented Gabor functions (see fig. 3-i and iv), while the Conv weight-sharing scheme learns structured but poorly localized high-frequency patterns (see fig. 3-iii) [6]. The TConv models re-use the same 8x8 filters every 8 pixels and apply a diagonal offset of 2 pixels between neighboring filters with different weights in order to reduce tiling artifacts. There are 4 sets of filters, each with 64 filters for a total of 256 covariance filters (see bottom plot of fig. 3). Similarly, we have 4 sets of mean filters, each with 32 filters. These filters have usually non-zero mean and exhibit on-center off-surround and off-center on-surround patterns, see fig. 3-ii. In order to draw samples from the learned models, we run HMC for a long time (10,000 iterations, each composed of 20 leap-frog steps). Some samples of size 160x160 pixels are reported in fig. 3 A)D). Without modelling the mean intensity, samples lack structure and do not seem much different from those that would be generated by a simple Gaussian model merely fitting the second order statistics (see fig. 3 in [1] and also fig. 2 in [7]). By contrast, structure, sharp boundaries and some simple texture emerge only from models that have mean latent variables, namely mcRBM, mPoT and mHPoT which differs from mPoT by having a second layer pooling matrix on the squared covariance filter outputs [11]. A more quantitative comparison is reported in table 1. We first compute marginal statistics of filter responses using the generated images, natural images from the test set, and random images. The statistics are the normalized histogram of individual filter responses to 24 Gabor filters (8 orientations and 3 scales). We then calculate the KL divergence between the histograms on random images and generated images and the KL divergence between the histograms on natural images and generated images. The table also reports the average difference of energies between random images and natural images. All results demonstrate that models that account for mean intensity generate images 4 The code used in the experiments can be found at the first author’s web-page. 6 MODEL F (R) − F (T ) (104 ) KL(R G) KL(T G) KL(R G) − KL(T PoT - Conv 2.9 0.3 0.6 PoT - TConv 2.8 0.4 1.0 -0.6 mPoT - TConv 5.2 1.0 0.2 0.8 mHPoT - TConv 4.9 1.7 0.8 0.9 mcRBM - TConv 3.5 1.5 1.0 G) -0.3 0.5 Table 1: Comparing MRF’s by measuring: difference of energy (negative log ratio of probabilities) between random images (R) and test natural images (T), the KL divergence between statistics of random images (R) and generated images (G), KL divergence between statistics of test natural images (T) and generated images (G), and difference of these two KL divergences. Statistics are computed using 24 Gabor filters. that are closer to natural images than to random images, whereas models that do not account for the mean (like the widely used PoT-Conv) produce samples that are actually closer to random images. 4.1 Discriminative Experiments on Weight-Sharing Schemes In future work, we intend to use the features discovered by the generative model for recognition. To understand how the different weight sharing schemes affect recognition performance we have done preliminary tests using the discriminative performance of a simpler model on simpler data. We consider one of the simplest and most versatile models, namely the RBM [29]. Since we also aim to test the Global weight-sharing scheme we are constrained to using fairly low resolution datasets such as the MNIST dataset of handwritten digits [30] and the CIFAR 10 dataset of generic object categories [22]. The MNIST dataset has soft binary images of size 28x28 pixels, while the CIFAR 10 dataset has color images of size 32x32 pixels. CIFAR 10 has 10 classes, 5000 training samples per class and 1000 test samples per class. MNIST also has 10 classes with, on average, 6000 training samples per class and 1000 test samples per class. The energy function of the RBM trained on the CIFAR 10 dataset, modelling input pixels with 3 (R,G,B) Gaussian variables [31], is exactly the one shown in eq. 4; while the RBM trained on MNIST uses logistic units for the pixels and the energy function is again the same as before but without any quadratic term. All models are trained in an unsupervised way to approximately maximize the likelihood in the training set using Contrastive Divergence [32]. They are then used to represent each input image with a feature vector (mean of the posterior over the latent variables) which is fed to a multinomial logistic classifier for discrimination. Models are compared in terms of: 1) recognition accuracy, 2) convergence time and 3) dimensionality of the representation. In general, assuming filters much smaller than the input image and assuming equal number of latent variables, Conv, TConv and Local models process each sample faster than Global by a factor approximately equal to the ratio between the area of the image and the area of the filters, which can be very large in practice. In the first set of experiments reported on the left of fig. 4 we study the internal representation in terms of discrimination and dimensionality using the MNIST dataset. For each choice of dimensionality all models are trained using the same number of operations. This is set to the amount necessary to complete one epoch over the training set using the Global model. This experiment shows that: 1) Local outperforms all other weight-sharing schemes for a wide range of dimensionalities, 2) TConv does not perform as well as Local probably because the translation invariant assumption is clearly violated for these relatively small, centered, images, 3) Conv performs well only when the internal representation is very high dimensional (10 times overcomplete) otherwise it severely underfits, 4) Global performs well when the representation is compact but its performance degrades rapidly as this increases because it needs more than the allotted training time. The right hand side of fig. 4 shows how the recognition performance evolves as we increase the number of operations (or training time) using models that produce a twice overcomplete internal representation. With only very few filters Conv still underfits and it does not improve its performance by training for longer, but Global does improve and eventually it reaches the performance of Local. If we look at the crossing of the error rate at 2% we can see that Local is about 4 times faster than Global. To summarize, Local provides more compact representations than Conv, is much faster than Global while achieving 7 6 2.4 error rate % 5 error rate % 2.6 Global Local TConv Conv 4 3 2 1 0 2.2 Global Local 2 Conv 1.8 1000 2000 3000 4000 5000 dimensionality 6000 7000 1.6 0 8000 2 4 6 8 # flops (relative to # flops per epoch of Global model) 10 Figure 4: Experiments on MNIST using RBM’s with different weight-sharing schemes. Left: Error rate as a function of the dimensionality of the latent representation. Right: Error rate as a function of the number of operations (normalized to those needed to perform one epoch in the Global model); all models have a twice overcomplete latent representation. similar performance in discrimination. Also, Local can easily scale to larger images while Global cannot. Similar experiments are performed using the CIFAR 10 dataset [22] of natural images. Using the same protocol introduced in earlier work by Krizhevsky [22], the RBM’s are trained in an unsupervised way on a subset of the 80 million tiny images dataset [33] and then “fine-tuned” on the CIFAR 10 dataset by supervised back-propagation of the error through the linear classifier and feature extractor. All models produce an approximately 10,000 dimensional internal representation to make a fair comparison. Models using local filters learn 16x16 filters that are stepped every pixel. Again, we do not experiment with the TConv weight-sharing scheme because the image is not large enough to allow enough replicas. Similarly to fig. 3-iii the Conv weight-sharing scheme was very difficult to train and did not produce Gabor-like features. Indeed, careful injection of sparsity and long training time seem necessary [31] for these RBM’s. By contrast, both Local and Global produce Gabor-like filters similar to those shown in fig. 2 F). The model trained with Conv weight-sharing scheme yields an accuracy equal to 56.6%, while Local and Global yield much better performance, 63.6% and 64.8% [22], respectively. Although Local and Global have similar performance, training with the Local weight-sharing scheme took under an hour while using the Global weight-sharing scheme required more than a day. 5 Conclusions and Future Work This work is motivated by the poor generative quality of currently popular MRF models of natural images. These models generate images that are actually more similar to white noise than to natural images. Our contribution is to recognize that current models can benefit from 1) the addition of a simple model of the mean intensities and from 2) the use of a less constrained weight-sharing scheme. By augmenting these models with an extra set of latent variables that model mean intensity we can generate samples that look much more realistic: they are characterized by smooth regions, sharp boundaries and some simple high frequency texture. We validate our approach by comparing the statistics of filter outputs on natural images and generated images. In the future, we plan to integrate these MRF’s into deeper hierarchical models and to use their internal representation to perform object recognition in high-resolution images. The hope is to further improve generation by capturing longer range dependencies and to exploit this to better cope with missing values and ambiguous sensory inputs. References [1] E.P. Simoncelli. Statistical modeling of photographic images. Handbook of Image and Video Processing, pages 431–441, 2005. 8 [2] A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2001. [3] G.E. Hinton and R. R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [4] M. Ranzato and G.E. Hinton. Modeling pixel means and covariances using factorized third-order boltzmann machines. In CVPR, 2010. [5] M.J. Wainwright and E.P. Simoncelli. Scale mixtures of gaussians and the statistics of natural images. In NIPS, 2000. [6] S. Roth and M.J. Black. Fields of experts: A framework for learning image priors. In CVPR, 2005. [7] U. Schmidt, Q. Gao, and S. Roth. A generative perspective on mrfs in low-level vision. In CVPR, 2010. [8] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. PAMI, 6:721–741, 1984. [9] M. Welling, G.E. Hinton, and S. Osindero. Learning sparse topographic representations with products of student-t distributions. In NIPS, 2003. [10] S.C. Zhu and D. Mumford. Prior learning and gibbs reaction diffusion. PAMI, pages 1236–1250, 1997. [11] S. Osindero, M. Welling, and G. E. Hinton. Topographic product models applied to natural scene statistics. Neural Comp., 18:344–381, 2006. [12] S. Osindero and G. E. Hinton. Modeling image patches with a directed hierarchy of markov random fields. In NIPS, 2008. [13] Y. Karklin and M.S. Lewicki. Emergence of complex cell properties by learning to generalize in natural scenes. Nature, 457:83–86, 2009. [14] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Research, 37:3311–3325, 1997. [15] Y. W. Teh, M. Welling, S. Osindero, and G. E. Hinton. Energy-based models for sparse overcomplete representations. JMLR, 4:1235–1260, 2003. [16] Y. Weiss and W.T. Freeman. What makes a good model of natural images? In CVPR, 2007. [17] S. Roth and M. J. Black. Fields of experts. Int. Journal of Computer Vision, 82:205–229, 2009. [18] K. Gregor and Y. LeCun. Emergence of complex-like cells in a temporal product network with local receptive fields. arXiv:1006.0448, 2010. [19] C. Tang and C. Eliasmith. Deep networks for robust visual recognition. In ICML, 2010. [20] M. Ranzato, A. Krizhevsky, and G.E. Hinton. Factored 3-way restricted boltzmann machines for modeling natural images. In AISTATS, 2010. [21] N. Heess, C.K.I. Williams, and G.E. Hinton. Learning generative texture models with extended fields-ofexperts. In BMCV, 2009. [22] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009. MSc Thesis, Dept. of Comp. Science, Univ. of Toronto. [23] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Acoustics Speech and Signal Proc., 37:328–339, 1989. [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998. [25] T. Tieleman and G.E. Hinton. Using fast weights to improve persistent contrastive divergence. In ICML, 2009. [26] R.M. Neal. Bayesian learning for neural networks. Springer-Verlag, 1996. [27] T. Tieleman. Training restricted boltzmann machines using approximations to the likelihood gradient. In ICML, 2008. [28] http://www.cs.berkeley.edu/projects/vision/grouping/segbench/. [29] M. Welling, M. Rosen-Zvi, and G.E. Hinton. Exponential family harmoniums with an application to information retrieval. In NIPS, 2005. [30] http://yann.lecun.com/exdb/mnist/. [31] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proc. ICML, 2009. [32] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002. [33] A. Torralba, R. Fergus, and W.T. Freeman. 80 million tiny images: a large dataset for non-parametric object and scene recognition. PAMI, 30:1958–1970, 2008. 9
5 0.66188848 6 nips-2010-A Discriminative Latent Model of Image Region and Object Tag Correspondence
Author: Yang Wang, Greg Mori
Abstract: We propose a discriminative latent model for annotating images with unaligned object-level textual annotations. Instead of using the bag-of-words image representation currently popular in the computer vision community, our model explicitly captures more intricate relationships underlying visual and textual information. In particular, we model the mapping that translates image regions to annotations. This mapping allows us to relate image regions to their corresponding annotation terms. We also model the overall scene label as latent information. This allows us to cluster test images. Our training data consist of images and their associated annotations. But we do not have access to the ground-truth regionto-annotation mapping or the overall scene label. We develop a novel variant of the latent SVM framework to model them as latent variables. Our experimental results demonstrate the effectiveness of the proposed model compared with other baseline methods.
6 0.65665901 242 nips-2010-Slice sampling covariance hyperparameters of latent Gaussian models
7 0.62495196 235 nips-2010-Self-Paced Learning for Latent Variable Models
8 0.56853813 284 nips-2010-Variational bounds for mixed-data factor analysis
9 0.5650866 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities
10 0.56135303 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision
11 0.55879241 101 nips-2010-Gaussian sampling by local perturbations
12 0.54779238 99 nips-2010-Gated Softmax Classification
13 0.54412168 82 nips-2010-Evaluation of Rarity of Fingerprints in Forensics
14 0.52967662 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
15 0.52611589 137 nips-2010-Large Margin Learning of Upstream Scene Understanding Models
16 0.5100022 256 nips-2010-Structural epitome: a way to summarize one’s visual experience
17 0.48442239 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
18 0.47472981 60 nips-2010-Deterministic Single-Pass Algorithm for LDA
19 0.47045848 194 nips-2010-Online Learning for Latent Dirichlet Allocation
20 0.46930847 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters
topicId topicWeight
[(13, 0.054), (17, 0.012), (27, 0.087), (30, 0.041), (35, 0.015), (45, 0.177), (50, 0.058), (52, 0.021), (60, 0.034), (72, 0.316), (77, 0.026), (78, 0.042), (90, 0.044)]
simIndex simValue paperId paperTitle
1 0.7682811 129 nips-2010-Inter-time segment information sharing for non-homogeneous dynamic Bayesian networks
Author: Dirk Husmeier, Frank Dondelinger, Sophie Lebre
Abstract: Conventional dynamic Bayesian networks (DBNs) are based on the homogeneous Markov assumption, which is too restrictive in many practical applications. Various approaches to relax the homogeneity assumption have recently been proposed, allowing the network structure to change with time. However, unless time series are very long, this flexibility leads to the risk of overfitting and inflated inference uncertainty. In the present paper we investigate three regularization schemes based on inter-segment information sharing, choosing different prior distributions and different coupling schemes between nodes. We apply our method to gene expression time series obtained during the Drosophila life cycle, and compare the predicted segmentation with other state-of-the-art techniques. We conclude our evaluation with an application to synthetic biology, where the objective is to predict a known in vivo regulatory network of five genes in yeast. 1
same-paper 2 0.72836852 213 nips-2010-Predictive Subspace Learning for Multi-view Data: a Large Margin Approach
Author: Ning Chen, Jun Zhu, Eric P. Xing
Abstract: Learning from multi-view data is important in many applications, such as image classification and annotation. In this paper, we present a large-margin learning framework to discover a predictive latent subspace representation shared by multiple views. Our approach is based on an undirected latent space Markov network that fulfills a weak conditional independence assumption that multi-view observations and response variables are independent given a set of latent variables. We provide efficient inference and parameter estimation methods for the latent subspace model. Finally, we demonstrate the advantages of large-margin learning on real video and web image data for discovering predictive latent representations and improving the performance on image classification, annotation and retrieval.
3 0.72553605 288 nips-2010-Worst-case bounds on the quality of max-product fixed-points
Author: Meritxell Vinyals, Jes\'us Cerquides, Alessandro Farinelli, Juan A. Rodríguez-aguilar
Abstract: We study worst-case bounds on the quality of any fixed point assignment of the max-product algorithm for Markov Random Fields (MRF). We start providing a bound independent of the MRF structure and parameters. Afterwards, we show how this bound can be improved for MRFs with specific structures such as bipartite graphs or grids. Our results provide interesting insight into the behavior of max-product. For example, we prove that max-product provides very good results (at least 90% optimal) on MRFs with large variable-disjoint cycles1 . 1
4 0.5845257 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
5 0.57144493 1 nips-2010-(RF)^2 -- Random Forest Random Field
Author: Nadia Payet, Sinisa Todorovic
Abstract: We combine random forest (RF) and conditional random field (CRF) into a new computational framework, called random forest random field (RF)2 . Inference of (RF)2 uses the Swendsen-Wang cut algorithm, characterized by MetropolisHastings jumps. A jump from one state to another depends on the ratio of the proposal distributions, and on the ratio of the posterior distributions of the two states. Prior work typically resorts to a parametric estimation of these four distributions, and then computes their ratio. Our key idea is to instead directly estimate these ratios using RF. RF collects in leaf nodes of each decision tree the class histograms of training examples. We use these class histograms for a nonparametric estimation of the distribution ratios. We derive the theoretical error bounds of a two-class (RF)2 . (RF)2 is applied to a challenging task of multiclass object recognition and segmentation over a random field of input image regions. In our empirical evaluation, we use only the visual information provided by image regions (e.g., color, texture, spatial layout), whereas the competing methods additionally use higher-level cues about the horizon location and 3D layout of surfaces in the scene. Nevertheless, (RF)2 outperforms the state of the art on benchmark datasets, in terms of accuracy and computation time.
6 0.56624758 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors
7 0.56350994 194 nips-2010-Online Learning for Latent Dirichlet Allocation
8 0.56274831 98 nips-2010-Functional form of motion priors in human motion perception
9 0.56164092 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
10 0.56121325 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers
11 0.56096625 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
12 0.5595389 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
13 0.55941194 186 nips-2010-Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification
14 0.55841666 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average
15 0.5580619 158 nips-2010-Learning via Gaussian Herding
16 0.55786484 21 nips-2010-Accounting for network effects in neuronal responses using L1 regularized point process models
17 0.55770278 161 nips-2010-Linear readout from a neural population with partial correlation data
18 0.55694604 44 nips-2010-Brain covariance selection: better individual functional connectivity models using population prior
19 0.55666393 63 nips-2010-Distributed Dual Averaging In Networks
20 0.55564541 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone