nips nips2008 nips2008-191 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Leo Zhu, Yuanhao Chen, Yuan Lin, Chenxi Lin, Alan L. Yuille
Abstract: Language and image understanding are two major goals of artificial intelligence which can both be conceptually formulated in terms of parsing the input signal into a hierarchical representation. Natural language researchers have made great progress by exploiting the 1D structure of language to design efficient polynomialtime parsing algorithms. By contrast, the two-dimensional nature of images makes it much harder to design efficient image parsers and the form of the hierarchical representations is also unclear. Attempts to adapt representations and algorithms from natural language have only been partially successful. In this paper, we propose a Hierarchical Image Model (HIM) for 2D image parsing which outputs image segmentation and object recognition. This HIM is represented by recursive segmentation and recognition templates in multiple layers and has advantages for representation, inference, and learning. Firstly, the HIM has a coarse-to-fine representation which is capable of capturing long-range dependency and exploiting different levels of contextual information. Secondly, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which enables us to parse the image rapidly in polynomial time. Thirdly, we can learn the HIM efficiently in a discriminative manner from a labeled dataset. We demonstrate that HIM outperforms other state-of-the-art methods by evaluation on the challenging public MSRC image dataset. Finally, we sketch how the HIM architecture can be extended to model more complex image phenomena. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Language and image understanding are two major goals of artificial intelligence which can both be conceptually formulated in terms of parsing the input signal into a hierarchical representation. [sent-10, score-0.775]
2 Natural language researchers have made great progress by exploiting the 1D structure of language to design efficient polynomialtime parsing algorithms. [sent-11, score-0.603]
3 By contrast, the two-dimensional nature of images makes it much harder to design efficient image parsers and the form of the hierarchical representations is also unclear. [sent-12, score-0.521]
4 In this paper, we propose a Hierarchical Image Model (HIM) for 2D image parsing which outputs image segmentation and object recognition. [sent-14, score-1.392]
5 This HIM is represented by recursive segmentation and recognition templates in multiple layers and has advantages for representation, inference, and learning. [sent-15, score-0.784]
6 Secondly, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which enables us to parse the image rapidly in polynomial time. [sent-17, score-0.522]
7 We demonstrate that HIM outperforms other state-of-the-art methods by evaluation on the challenging public MSRC image dataset. [sent-19, score-0.322]
8 Finally, we sketch how the HIM architecture can be extended to model more complex image phenomena. [sent-20, score-0.272]
9 1 Introduction Language and image understanding are two major tasks in artificial intelligence. [sent-21, score-0.272]
10 Natural language researchers have formalized this task in terms of parsing an input signal into a hierarchical representation. [sent-22, score-0.609]
11 Secondly, they have exploited the one-dimensional structure of language to obtain efficient polynomial-time parsing algorithms (e. [sent-32, score-0.456]
12 By contrast, the nature of images makes it much harder to design efficient image parsers which are capable of simultaneously performing segmentation (parsing an image into regions) and recognition (labeling the regions). [sent-35, score-1.143]
13 Firstly, it is unclear what hierarchical representations should be used to model images and there are no direct analogies to the syntactic categories and phrase structures that occur in speech. [sent-36, score-0.253]
14 Secondly, the inference problem is formidable due to the well-known complexity 1 and ambiguity of segmentation and recognition. [sent-37, score-0.448]
15 Unlike most languages (Chinese is an exception), whose constituents are well-separated words, the boundaries between different image regions are usually highly unclear. [sent-38, score-0.421]
16 Exploring all the different image partitions results in combinatorial explosions because of the two-dimensional nature of images (which makes it impossible to order these partitions to enable dynamic programming). [sent-39, score-0.332]
17 Overall it has been hard to adapt methods from natural language parsing and apply them to vision despite the high-level conceptual similarities (except for restricted problems such as text [4]). [sent-40, score-0.541]
18 Attempts at image parsing must make trade-offs between the complexity of the models and the complexity of the computation (for inference and learning). [sent-41, score-0.676]
19 This yields simpler (discriminative) models which are less capable of representing complex image structures and long range interactions. [sent-48, score-0.315]
20 In this paper, we introduce Hierarchical Image Models (HIM)’s for image parsing. [sent-57, score-0.272]
21 In particular, we introduce recursive segmentation and recognition templates which represent complex image knowledge and serve as elementary constituents analogous to those used in speech. [sent-59, score-1.163]
22 As in speech, we can recursively compose these constituents at lower levels to form more complex constituents at higher level. [sent-60, score-0.268]
23 Each node of the hierarchy corresponds to an image region (whose size depends on the level in the hierarchy). [sent-61, score-0.557]
24 The state of each node represents both the partitioning of the corresponding region into segments and the labeling of these segments (i. [sent-62, score-0.259]
25 Segmentations at the top levels of the hierarchy give coarse descriptions of the image which are refined by the segmentations at the lower levels. [sent-65, score-0.55]
26 Learning and inference (parsing) are made efficient by exploiting the hierarchical structure (and the absence of loops). [sent-66, score-0.248]
27 To illustrate the HIM we implement it for parsing images and we evaluate it on the public MSRC image dataset [8]. [sent-70, score-0.732]
28 We discuss ways that HIM’s can be extended naturally to model more complex image phenomena. [sent-72, score-0.272]
29 1 The Model We represent an image by a hierarchical graph defined by parent-child relationships. [sent-74, score-0.425]
30 The hierarchy corresponds to the image pyramid (with 5 layers in this paper). [sent-76, score-0.424]
31 The leaf nodes represent local image patches (27 × 27 in this paper). [sent-79, score-0.332]
32 Thus, the hierarchy is a quad tree and Ch(a) encodes all its vertical edges. [sent-82, score-0.228]
33 The image region represented by node a is denoted by R(a). [sent-83, score-0.405]
34 A pixel in R(a), indexed by r, corresponds to an image pixel. [sent-84, score-0.351]
35 A configuration of the hierarchy is an assignment of state variables y = {ya } with ya = (sa , ca ) at each node a, where s and c denote region partition and object labeling, respectively and (s, c) is called the “Segmentation and Recognition” pair, which we call an S-R pair. [sent-86, score-0.545]
36 The middle panel shows a dictionary of 30 segmentation templates. [sent-91, score-0.473]
37 The last panel shows that the ground truth pixel labels (upper right panel) can be well approximated by composing a set of labeled segmentation templates (bottom right panel). [sent-95, score-0.974]
38 Figure 2: This figure illustrates how the segmentation templates and object labels (S-R pair) represent image regions in a coarse-to-fine way. [sent-96, score-1.155]
39 The left figure is the input image which is followed by global, mid-level and local S-R pairs. [sent-97, score-0.272]
40 The global S-R pair gives a coarse description of the object identity (horse), its background (grass), and its position in the image (central). [sent-98, score-0.503]
41 More precisely, each region R(a) is described by a segmentation templates which is selected from a dictionary DS . [sent-103, score-0.779]
42 Each segmentation template consists of a partition of the region into K non-overlapping sub-parts, see figure 1. [sent-104, score-0.524]
43 In this paper K ≤ 3, |Ds | = 30, and the segmentation templates are designed by hand to cover the taxonomy of shape segmentations that happen in images, such as T-junctions, Y-junctions, and so on. [sent-105, score-0.741]
44 The variable s refers to the indexes of the segmentation templates in the dictionary, i. [sent-106, score-0.664]
45 The labeling of a pixel r in region R(a) is denoted by or ∈ {1. [sent-118, score-0.281]
46 In other words, each level of the hierarchy has a separate labeling field. [sent-123, score-0.278]
47 a A novel feature of this hierarchical representation is the multi-level S-R pairs which explicitly model both the segmentation and labeling of its corresponding region, while traditional vision approaches [8, 10, 11] use labeling only. [sent-125, score-0.927]
48 The S-R pairs defined in a hierarchical form provide a coarse-to-fine representation which captures the “gist” (semantical meaning) of image regions. [sent-126, score-0.468]
49 As one can see in figure 2, the global S-R pair gives a coarse description (the identities of objects and their spatial layout) of the whole image which is accurate enough to encode high level image properties in a compact form. [sent-127, score-0.715]
50 The four templates at the lower level further refine the interpretations. [sent-129, score-0.27]
51 The first term E1 (x, s, c) is an object specific data term which represents image features of regions. [sent-134, score-0.376]
52 We use 55 types of shape (spatial) filters (similar to [8]) to calculate the responses of 47 image features. [sent-140, score-0.272]
53 The second term (segmentation specific) E2 (x, s, c) = − a α2 ψ2 (x, sa , ca ) is designed to favor the segmentation templates in which the pixels belonging to the same partitions (i. [sent-142, score-1.125]
54 The third term, defined as E3 (s, c) = − a,b=P a(a) α3 ψ3 (sa , ca , sb , cb ) (i. [sent-149, score-0.273]
55 ψ3 (sa , ca , sb , cb ) is defined by the Hamming distance: ψ3 (sa , ca , sb , cb ) = 1 |R(a)| δ(or , or ) a b (5) r∈R(a) where δ(or , or ) is the Kronecker delta, which equals one whenever or = or and zero otherwise. [sent-152, score-0.546]
56 The a b a b hamming function ensures to glue the segmentation templates (and their labels) at different levels together in a consistent hierarchical form. [sent-153, score-0.871]
57 , a cow is unlikely to appear next to an aeroplane): E4 (c) = − α4 (i, j)ψ4 (i, j, ca , ca ) − a i,j=1. [sent-158, score-0.323]
58 M where ψ4 (i, j, ca , cb ) is an indicator function which equals one while i ≡ ca and j ≡ cb (i ≡ ca means i is a component of ca ) hold true and zero otherwise. [sent-162, score-0.68]
59 s encodes the classes in a single template while the second term encodes the classes in two templates of the parent-child nodes. [sent-166, score-0.47]
60 4 The fifth term E5 (s) = − a α5 ψ5 (sa ), where ψ5 (sa ) = log p(sa ) encode the generic prior of the segmentation template. [sent-168, score-0.394]
61 Similarly the sixth term E6 (s, c) = − a j≡ca α6 ψ6 (sa , j), where ψ6 (sa , j) = log p(sa , j), models the co-occurrence of the segmentation templates and the object classes. [sent-169, score-0.768]
62 HIM is a coarse-to-fine representation which captures the “gist” of image regions by using the S-R pairs at multiple levels. [sent-174, score-0.357]
63 But the traditional concept of “gist” [14] relies only on image features and does not include segmentation templates. [sent-175, score-0.666]
64 Levin and Weiss [15] use a segmentation mask which is more object-specific than our segmentation templates (and they do not have a hierarchy). [sent-176, score-1.058]
65 Our approach has some similarities to some hierarchical models (which have two-layers only) [10],[11] – but these models also lack segmentation templates. [sent-178, score-0.547]
66 2 Parsing by Dynamic Programming Parsing an image is performed as inference of the HIM. [sent-181, score-0.326]
67 More precisely, the task of parsing is to obtain the maximum a posterior (MAP): y ∗ = arg max p(y|x; α) = arg max ψ(x, y) · α y y (7) The size of the states of each node is O(M K |Ds |) where K = 3, M = 21, |Ds | = 30 in our case. [sent-182, score-0.407]
68 The final output of the model for segmentation is the pixel labeling determined by the (s, c) of the lowest level. [sent-186, score-0.599]
69 In our case, X is a set of images, with Y being a set of possible parse trees which specify the labels of image regions in a hierarchical form. [sent-201, score-0.669]
70 It seems that the ground truth of parsing trees needs all labels of both segmentation template and pixel labelings. [sent-202, score-1.068]
71 In our experiment, we will show that how to obtain the ground truth directly from the segmentation labels without extra human labeling. [sent-203, score-0.585]
72 N • find the best state of the model on the i’th training image with current parameter setting, i. [sent-219, score-0.272]
73 This dataset is designed to evaluate scene labeling including both image segmentation and multi-class object recognition. [sent-236, score-0.967]
74 The ground truth only gives the labeling of the image pixels. [sent-237, score-0.516]
75 To supplement this ground truth (to enable learning), we estimate the true labels (states of the S-R pair ) of the nodes in the five-layer hierarchy of HIM by selecting the S-R pairs which have maximum overlap with the labels of the image pixels. [sent-238, score-0.827]
76 This approximation only results in 2% error in labeling image pixels. [sent-239, score-0.398]
77 For a given image x, the parsing result is obtained by estimating the best configuration y ∗ of the HIM. [sent-246, score-0.622]
78 To evaluate the performance of parsing we use the global accuracy measured in terms of all pixels and the average accuracy over the 21 object classes (global accuracy pays most attention to frequently occurring objects and penalizes infrequent objects). [sent-247, score-0.671]
79 The boosting learning takes about 35 hours of which 27 hours are spent on I/O processing and 8 hours on computing. [sent-251, score-0.256]
80 In the testing stage, it takes 30 seconds to parse an image with size of 320 × 200 (6s for extracting image features, 9s for computing the strong classifier of boosting and 15s for parsing the HIM). [sent-253, score-1.12]
81 Figure 4 (best viewed in color) shows several parsing results obtained by the HIM and by the classifier by itself (i. [sent-255, score-0.35]
82 One can see that the HIM is able to roughly a capture different shaped segmentation boundaries (see the legs of the cow and sheep in rows 1 and 3, and the boundary curve between sky and building in row 4). [sent-258, score-0.477]
83 Our boosting results are better than Textonboost [8] because of image features. [sent-267, score-0.369]
84 The colors indicate the labels of 21 object classes as in [8]. [sent-271, score-0.243]
85 The columns (except the fourth “accuracy” column) show the input images, ground truth, the labels obtained by HIM and the boosting classifier respectively. [sent-272, score-0.228]
86 “Classifier only” are the results where the pixel labels are predicted by the classifier obtained by boosting only. [sent-286, score-0.249]
87 Figure 5 shows how the S-R pairs (which include the segmentation templates) can be used to (partially) parse an object into its constituent parts, by the correspondence between S-R pairs and specific parts of objects. [sent-290, score-0.795]
88 4 Conclusion This paper describes a novel hierarchical image model (HIM) for 2D image parsing. [sent-296, score-0.697]
89 The hierarchical nature of the model, and the use of recursive segmentation and recognition templates, enables the HIM to represent complex image structures in a coarse-to-fine manner. [sent-297, score-0.939]
90 We can perform inference (parsing) rapidly in polynomial time by exploiting the hierarchical structure. [sent-298, score-0.283]
91 We demonstrated the effectiveness of HIM’s by applying them to the challenging task of segmentation and labeling of the public MSRC image database. [sent-300, score-0.842]
92 7 Figure 5: The S-R pairs can be used to parse the object into parts. [sent-302, score-0.276]
93 The shapes (spacial layout) of the segmentation templates distinguish the constituent parts of the object. [sent-304, score-0.746]
94 We have attempted to capture the underlying spirit of the successful language processing approaches – the hierarchical representations based on the recursive composition of constituents and efficient inference and learning algorithms. [sent-309, score-0.474]
95 Zhu, “Image segmentation by data-driven markov chain monte carlo,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. [sent-338, score-0.394]
96 Carreira-Perpi˜ an, “Multiscale conditional random fields for image labeling,” in Proceedings of IEEE n´ Computer Society Conference on Computer Vision and Pattern Recognition, 2004, pp. [sent-373, score-0.272]
97 Jolly, “Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images,” in Proceedings of IEEE International Conference on Computer Vision, 2001, pp. [sent-390, score-0.514]
98 Torralba, “Building the gist of a scene: the role of global image features in recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. [sent-394, score-0.409]
99 Zhang, “Rapid inference on a novel and/or graph for object detection, segmentation and parsing,” in Advances in Neural Information Processing Systems, 2007. [sent-417, score-0.552]
100 Triggs, “Scene segmentation with crfs learned from partially labeled images,” in Advances in Neural Information Processing Systems, vol. [sent-437, score-0.394]
wordName wordTfidf (topN-words)
[('segmentation', 0.394), ('parsing', 0.35), ('image', 0.272), ('templates', 0.27), ('sa', 0.254), ('hierarchical', 0.153), ('hierarchy', 0.152), ('parse', 0.129), ('labeling', 0.126), ('ca', 0.12), ('textonboost', 0.117), ('oq', 0.111), ('constituents', 0.107), ('language', 0.106), ('object', 0.104), ('cb', 0.1), ('crf', 0.098), ('boosting', 0.097), ('xr', 0.094), ('vision', 0.085), ('cow', 0.083), ('grass', 0.083), ('pixel', 0.079), ('msrc', 0.078), ('gist', 0.078), ('region', 0.076), ('labels', 0.073), ('yuille', 0.072), ('ds', 0.071), ('oa', 0.067), ('horse', 0.067), ('recognition', 0.066), ('grammars', 0.063), ('nodes', 0.06), ('truth', 0.06), ('images', 0.06), ('zhu', 0.059), ('global', 0.059), ('ground', 0.058), ('dp', 0.058), ('node', 0.057), ('er', 0.057), ('xq', 0.055), ('inference', 0.054), ('levels', 0.054), ('template', 0.054), ('recursive', 0.054), ('hours', 0.053), ('sb', 0.053), ('proceedings', 0.05), ('public', 0.05), ('constituent', 0.05), ('pixels', 0.05), ('collins', 0.049), ('conference', 0.049), ('style', 0.048), ('objects', 0.044), ('pairs', 0.043), ('gure', 0.043), ('capable', 0.043), ('classi', 0.042), ('regions', 0.042), ('exploiting', 0.041), ('lin', 0.04), ('segmentations', 0.04), ('firstly', 0.04), ('phrase', 0.04), ('tu', 0.04), ('panel', 0.04), ('encodes', 0.039), ('stomach', 0.039), ('levin', 0.039), ('leg', 0.039), ('dictionary', 0.039), ('energy', 0.038), ('designed', 0.037), ('tree', 0.037), ('chen', 0.036), ('pair', 0.036), ('linguistics', 0.036), ('contextual', 0.036), ('parsers', 0.036), ('ya', 0.036), ('ch', 0.036), ('dist', 0.036), ('emphasizes', 0.036), ('polynomial', 0.035), ('computer', 0.035), ('ieee', 0.035), ('scene', 0.034), ('classes', 0.034), ('triggs', 0.033), ('potts', 0.033), ('attempts', 0.033), ('coarse', 0.032), ('rapid', 0.032), ('colors', 0.032), ('parts', 0.032), ('maxy', 0.031), ('pays', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
Author: Leo Zhu, Yuanhao Chen, Yuan Lin, Chenxi Lin, Alan L. Yuille
Abstract: Language and image understanding are two major goals of artificial intelligence which can both be conceptually formulated in terms of parsing the input signal into a hierarchical representation. Natural language researchers have made great progress by exploiting the 1D structure of language to design efficient polynomialtime parsing algorithms. By contrast, the two-dimensional nature of images makes it much harder to design efficient image parsers and the form of the hierarchical representations is also unclear. Attempts to adapt representations and algorithms from natural language have only been partially successful. In this paper, we propose a Hierarchical Image Model (HIM) for 2D image parsing which outputs image segmentation and object recognition. This HIM is represented by recursive segmentation and recognition templates in multiple layers and has advantages for representation, inference, and learning. Firstly, the HIM has a coarse-to-fine representation which is capable of capturing long-range dependency and exploiting different levels of contextual information. Secondly, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which enables us to parse the image rapidly in polynomial time. Thirdly, we can learn the HIM efficiently in a discriminative manner from a labeled dataset. We demonstrate that HIM outperforms other state-of-the-art methods by evaluation on the challenging public MSRC image dataset. Finally, we sketch how the HIM architecture can be extended to model more complex image phenomena. 1
2 0.26273584 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
Author: Geremy Heitz, Stephen Gould, Ashutosh Saxena, Daphne Koller
Abstract: One of the original goals of computer vision was to fully understand a natural scene. This requires solving several sub-problems simultaneously, including object detection, region labeling, and geometric reasoning. The last few decades have seen great progress in tackling each of these problems in isolation. Only recently have researchers returned to the difficult task of considering them jointly. In this work, we consider learning a set of related models in such that they both solve their own problem and help each other. We develop a framework called Cascaded Classification Models (CCM), where repeated instantiations of these classifiers are coupled by their input/output variables in a cascade that improves performance at each level. Our method requires only a limited “black box” interface with the models, allowing us to use very sophisticated, state-of-the-art classifiers without having to look under the hood. We demonstrate the effectiveness of our method on a large set of natural images by combining the subtasks of scene categorization, object detection, multiclass image segmentation, and 3d reconstruction. 1
3 0.22506131 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
Author: Xuming He, Richard S. Zemel
Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difficult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1
4 0.19901776 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
Author: Erik B. Sudderth, Michael I. Jordan
Abstract: We develop a statistical framework for the simultaneous, unsupervised segmentation and discovery of visual object categories from image databases. Examining a large set of manually segmented scenes, we show that object frequencies and segment sizes both follow power law distributions, which are well modeled by the Pitman–Yor (PY) process. This nonparametric prior distribution leads to learning algorithms which discover an unknown set of objects, and segmentation methods which automatically adapt their resolution to each image. Generalizing previous applications of PY processes, we use Gaussian processes to discover spatially contiguous segments which respect image boundaries. Using a novel family of variational approximations, our approach produces segmentations which compare favorably to state-of-the-art methods, while simultaneously discovering categories shared among natural scenes. 1
5 0.1801351 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF
Author: Jonathan L. Roux, Alain D. Cheveigné, Lucas C. Parra
Abstract: How does one extract unknown but stereotypical events that are linearly superimposed within a signal with variable latencies and variable amplitudes? One could think of using template matching or matching pursuit to find the arbitrarily shifted linear components. However, traditional matching approaches require that the templates be known a priori. To overcome this restriction we use instead semi Non-Negative Matrix Factorization (semiNMF) that we extend to allow for time shifts when matching the templates to the signal. The algorithm estimates templates directly from the data along with their non-negative amplitudes. The resulting method can be thought of as an adaptive template matching procedure. We demonstrate the procedure on the task of extracting spikes from single channel extracellular recordings. On these data the algorithm essentially performs spike detection and unsupervised spike clustering. Results on simulated data and extracellular recordings indicate that the method performs well for signalto-noise ratios of 6dB or higher and that spike templates are recovered accurately provided they are sufficiently different. 1
6 0.13658836 6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context
7 0.13549665 148 nips-2008-Natural Image Denoising with Convolutional Networks
8 0.12103796 46 nips-2008-Characterizing response behavior in multisensory perception with conflicting cues
9 0.12061167 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
10 0.12023581 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
11 0.11990166 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree
12 0.11800595 207 nips-2008-Shape-Based Object Localization for Descriptive Classification
13 0.11440451 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
14 0.11014877 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
15 0.10846043 229 nips-2008-Syntactic Topic Models
16 0.10625581 118 nips-2008-Learning Transformational Invariants from Natural Movies
17 0.10448453 247 nips-2008-Using Bayesian Dynamical Systems for Motion Template Libraries
18 0.10440172 119 nips-2008-Learning a discriminative hidden part model for human action recognition
19 0.10420186 139 nips-2008-Modeling the effects of memory on human online sentence processing with particle filters
20 0.10222589 98 nips-2008-Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data
topicId topicWeight
[(0, -0.262), (1, -0.164), (2, 0.187), (3, -0.24), (4, -0.038), (5, -0.028), (6, -0.055), (7, -0.136), (8, 0.044), (9, -0.055), (10, -0.096), (11, -0.02), (12, 0.12), (13, -0.051), (14, -0.022), (15, 0.033), (16, -0.054), (17, -0.13), (18, -0.041), (19, 0.012), (20, 0.111), (21, 0.075), (22, 0.073), (23, 0.019), (24, -0.064), (25, 0.013), (26, 0.017), (27, -0.052), (28, -0.064), (29, -0.044), (30, 0.064), (31, -0.165), (32, -0.069), (33, 0.012), (34, -0.027), (35, -0.137), (36, 0.055), (37, 0.145), (38, 0.021), (39, -0.132), (40, 0.063), (41, 0.102), (42, -0.103), (43, 0.078), (44, -0.035), (45, -0.023), (46, 0.077), (47, 0.095), (48, 0.075), (49, -0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.95943338 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
Author: Leo Zhu, Yuanhao Chen, Yuan Lin, Chenxi Lin, Alan L. Yuille
Abstract: Language and image understanding are two major goals of artificial intelligence which can both be conceptually formulated in terms of parsing the input signal into a hierarchical representation. Natural language researchers have made great progress by exploiting the 1D structure of language to design efficient polynomialtime parsing algorithms. By contrast, the two-dimensional nature of images makes it much harder to design efficient image parsers and the form of the hierarchical representations is also unclear. Attempts to adapt representations and algorithms from natural language have only been partially successful. In this paper, we propose a Hierarchical Image Model (HIM) for 2D image parsing which outputs image segmentation and object recognition. This HIM is represented by recursive segmentation and recognition templates in multiple layers and has advantages for representation, inference, and learning. Firstly, the HIM has a coarse-to-fine representation which is capable of capturing long-range dependency and exploiting different levels of contextual information. Secondly, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which enables us to parse the image rapidly in polynomial time. Thirdly, we can learn the HIM efficiently in a discriminative manner from a labeled dataset. We demonstrate that HIM outperforms other state-of-the-art methods by evaluation on the challenging public MSRC image dataset. Finally, we sketch how the HIM architecture can be extended to model more complex image phenomena. 1
2 0.78694326 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
Author: Geremy Heitz, Stephen Gould, Ashutosh Saxena, Daphne Koller
Abstract: One of the original goals of computer vision was to fully understand a natural scene. This requires solving several sub-problems simultaneously, including object detection, region labeling, and geometric reasoning. The last few decades have seen great progress in tackling each of these problems in isolation. Only recently have researchers returned to the difficult task of considering them jointly. In this work, we consider learning a set of related models in such that they both solve their own problem and help each other. We develop a framework called Cascaded Classification Models (CCM), where repeated instantiations of these classifiers are coupled by their input/output variables in a cascade that improves performance at each level. Our method requires only a limited “black box” interface with the models, allowing us to use very sophisticated, state-of-the-art classifiers without having to look under the hood. We demonstrate the effectiveness of our method on a large set of natural images by combining the subtasks of scene categorization, object detection, multiclass image segmentation, and 3d reconstruction. 1
3 0.67390674 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
Author: Erik B. Sudderth, Michael I. Jordan
Abstract: We develop a statistical framework for the simultaneous, unsupervised segmentation and discovery of visual object categories from image databases. Examining a large set of manually segmented scenes, we show that object frequencies and segment sizes both follow power law distributions, which are well modeled by the Pitman–Yor (PY) process. This nonparametric prior distribution leads to learning algorithms which discover an unknown set of objects, and segmentation methods which automatically adapt their resolution to each image. Generalizing previous applications of PY processes, we use Gaussian processes to discover spatially contiguous segments which respect image boundaries. Using a novel family of variational approximations, our approach produces segmentations which compare favorably to state-of-the-art methods, while simultaneously discovering categories shared among natural scenes. 1
4 0.61343759 36 nips-2008-Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree
Author: Daphna Weinshall, Hynek Hermansky, Alon Zweig, Jie Luo, Holly Jimison, Frank Ohl, Misha Pavel
Abstract: Unexpected stimuli are a challenge to any machine learning algorithm. Here we identify distinct types of unexpected events, focusing on ’incongruent events’ when ’general level’ and ’specific level’ classifiers give conflicting predictions. We define a formal framework for the representation and processing of incongruent events: starting from the notion of label hierarchy, we show how partial order on labels can be deduced from such hierarchies. For each event, we compute its probability in different ways, based on adjacent levels (according to the partial order) in the label hierarchy. An incongruent event is an event where the probability computed based on some more specific level (in accordance with the partial order) is much smaller than the probability computed based on some more general level, leading to conflicting predictions. We derive algorithms to detect incongruent events from different types of hierarchies, corresponding to class membership or part membership. Respectively, we show promising results with real data on two specific problems: Out Of Vocabulary words in speech recognition, and the identification of a new sub-class (e.g., the face of a new individual) in audio-visual facial object recognition.
5 0.56870472 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
Author: Xuming He, Richard S. Zemel
Abstract: Extensive labeled data for image annotation systems, which learn to assign class labels to image regions, is difficult to obtain. We explore a hybrid model framework for utilizing partially labeled data that integrates a generative topic model for image appearance with discriminative label prediction. We propose three alternative formulations for imposing a spatial smoothness prior on the image labels. Tests of the new models and some baseline approaches on three real image datasets demonstrate the effectiveness of incorporating the latent structure. 1
6 0.56171322 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
7 0.55436563 98 nips-2008-Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data
8 0.55209875 207 nips-2008-Shape-Based Object Localization for Descriptive Classification
9 0.54345131 148 nips-2008-Natural Image Denoising with Convolutional Networks
10 0.52796978 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
11 0.52324617 119 nips-2008-Learning a discriminative hidden part model for human action recognition
12 0.49915522 82 nips-2008-Fast Computation of Posterior Mode in Multi-Level Hierarchical Models
13 0.47968012 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
14 0.46686339 66 nips-2008-Dynamic visual attention: searching for coding length increments
15 0.44749042 147 nips-2008-Multiscale Random Fields with Application to Contour Grouping
16 0.44381219 6 nips-2008-A ``Shape Aware'' Model for semi-supervised Learning of Objects and its Context
17 0.43673256 142 nips-2008-Multi-Level Active Prediction of Useful Image Annotations for Recognition
18 0.43140081 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF
19 0.40319267 139 nips-2008-Modeling the effects of memory on human online sentence processing with particle filters
20 0.39857832 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
topicId topicWeight
[(6, 0.042), (7, 0.055), (12, 0.029), (28, 0.109), (57, 0.538), (59, 0.015), (63, 0.018), (77, 0.022), (78, 0.012), (81, 0.01), (83, 0.071)]
simIndex simValue paperId paperTitle
same-paper 1 0.94513822 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
Author: Leo Zhu, Yuanhao Chen, Yuan Lin, Chenxi Lin, Alan L. Yuille
Abstract: Language and image understanding are two major goals of artificial intelligence which can both be conceptually formulated in terms of parsing the input signal into a hierarchical representation. Natural language researchers have made great progress by exploiting the 1D structure of language to design efficient polynomialtime parsing algorithms. By contrast, the two-dimensional nature of images makes it much harder to design efficient image parsers and the form of the hierarchical representations is also unclear. Attempts to adapt representations and algorithms from natural language have only been partially successful. In this paper, we propose a Hierarchical Image Model (HIM) for 2D image parsing which outputs image segmentation and object recognition. This HIM is represented by recursive segmentation and recognition templates in multiple layers and has advantages for representation, inference, and learning. Firstly, the HIM has a coarse-to-fine representation which is capable of capturing long-range dependency and exploiting different levels of contextual information. Secondly, the structure of the HIM allows us to design a rapid inference algorithm, based on dynamic programming, which enables us to parse the image rapidly in polynomial time. Thirdly, we can learn the HIM efficiently in a discriminative manner from a labeled dataset. We demonstrate that HIM outperforms other state-of-the-art methods by evaluation on the challenging public MSRC image dataset. Finally, we sketch how the HIM architecture can be extended to model more complex image phenomena. 1
2 0.92807293 233 nips-2008-The Gaussian Process Density Sampler
Author: Iain Murray, David MacKay, Ryan P. Adams
Abstract: We present the Gaussian Process Density Sampler (GPDS), an exchangeable generative model for use in nonparametric Bayesian density estimation. Samples drawn from the GPDS are consistent with exact, independent samples from a fixed density function that is a transformation of a function drawn from a Gaussian process prior. Our formulation allows us to infer an unknown density from data using Markov chain Monte Carlo, which gives samples from the posterior distribution over density functions and from the predictive distribution on data space. We can also infer the hyperparameters of the Gaussian process. We compare this density modeling technique to several existing techniques on a toy problem and a skullreconstruction task. 1
3 0.9271493 148 nips-2008-Natural Image Denoising with Convolutional Networks
Author: Viren Jain, Sebastian Seung
Abstract: We present an approach to low-level vision that combines two main ideas: the use of convolutional networks as an image processing architecture and an unsupervised learning procedure that synthesizes training samples from specific noise models. We demonstrate this approach on the challenging problem of natural image denoising. Using a test set with a hundred natural images, we find that convolutional networks provide comparable and in some cases superior performance to state of the art wavelet and Markov random field (MRF) methods. Moreover, we find that a convolutional network offers similar performance in the blind denoising setting as compared to other techniques in the non-blind setting. We also show how convolutional networks are mathematically related to MRF approaches by presenting a mean field theory for an MRF specially designed for image denoising. Although these approaches are related, convolutional networks avoid computational difficulties in MRF approaches that arise from probabilistic learning and inference. This makes it possible to learn image processing architectures that have a high degree of representational power (we train models with over 15,000 parameters), but whose computational expense is significantly less than that associated with inference in MRF approaches with even hundreds of parameters. 1 Background Low-level image processing tasks include edge detection, interpolation, and deconvolution. These tasks are useful both in themselves, and as a front-end for high-level visual tasks like object recognition. This paper focuses on the task of denoising, defined as the recovery of an underlying image from an observation that has been subjected to Gaussian noise. One approach to image denoising is to transform an image from pixel intensities into another representation where statistical regularities are more easily captured. For example, the Gaussian scale mixture (GSM) model introduced by Portilla and colleagues is based on a multiscale wavelet decomposition that provides an effective description of local image statistics [1, 2]. Another approach is to try and capture statistical regularities of pixel intensities directly using Markov random fields (MRFs) to define a prior over the image space. Initial work used handdesigned settings of the parameters, but recently there has been increasing success in learning the parameters of such models from databases of natural images [3, 4, 5, 6, 7, 8]. Prior models can be used for tasks such as image denoising by augmenting the prior with a noise model. Alternatively, an MRF can be used to model the probability distribution of the clean image conditioned on the noisy image. This conditional random field (CRF) approach is said to be discriminative, in contrast to the generative MRF approach. Several researchers have shown that the CRF approach can outperform generative learning on various image restoration and labeling tasks [9, 10]. CRFs have recently been applied to the problem of image denoising as well [5]. 1 The present work is most closely related to the CRF approach. Indeed, certain special cases of convolutional networks can be seen as performing maximum likelihood inference on a CRF [11]. The advantage of the convolutional network approach is that it avoids a general difficulty with applying MRF-based methods to image analysis: the computational expense associated with both parameter estimation and inference in probabilistic models. For example, naive methods of learning MRFbased models involve calculation of the partition function, a normalization factor that is generally intractable for realistic models and image dimensions. As a result, a great deal of research has been devoted to approximate MRF learning and inference techniques that meliorate computational difficulties, generally at the cost of either representational power or theoretical guarantees [12, 13]. Convolutional networks largely avoid these difficulties by posing the computational task within the statistical framework of regression rather than density estimation. Regression is a more tractable computation and therefore permits models with greater representational power than methods based on density estimation. This claim will be argued for with empirical results on the denoising problem, as well as mathematical connections between MRF and convolutional network approaches. 2 Convolutional Networks Convolutional networks have been extensively applied to visual object recognition using architectures that accept an image as input and, through alternating layers of convolution and subsampling, produce one or more output values that are thresholded to yield binary predictions regarding object identity [14, 15]. In contrast, we study networks that accept an image as input and produce an entire image as output. Previous work has used such architectures to produce images with binary targets in image restoration problems for specialized microscopy data [11, 16]. Here we show that similar architectures can also be used to produce images with the analog fluctuations found in the intensity distributions of natural images. Network Dynamics and Architecture A convolutional network is an alternating sequence of linear filtering and nonlinear transformation operations. The input and output layers include one or more images, while intermediate layers contain “hidden
4 0.90782082 236 nips-2008-The Mondrian Process
Author: Daniel M. Roy, Yee W. Teh
Abstract: We describe a novel class of distributions, called Mondrian processes, which can be interpreted as probability distributions over kd-tree data structures. Mondrian processes are multidimensional generalizations of Poisson processes and this connection allows us to construct multidimensional generalizations of the stickbreaking process described by Sethuraman (1994), recovering the Dirichlet process in one dimension. After introducing the Aldous-Hoover representation for jointly and separately exchangeable arrays, we show how the process can be used as a nonparametric prior distribution in Bayesian models of relational data. 1
5 0.89547056 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning
Author: Jihun Hamm, Daniel D. Lee
Abstract: Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of affine as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classification and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases. 1
6 0.79854637 100 nips-2008-How memory biases affect information transmission: A rational analysis of serial reproduction
7 0.77766895 27 nips-2008-Artificial Olfactory Brain for Mixture Identification
8 0.72707248 208 nips-2008-Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes
9 0.6801582 158 nips-2008-Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks
10 0.64054054 234 nips-2008-The Infinite Factorial Hidden Markov Model
11 0.63538909 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
12 0.6249184 35 nips-2008-Bayesian Synchronous Grammar Induction
13 0.61024392 200 nips-2008-Robust Kernel Principal Component Analysis
14 0.59690076 232 nips-2008-The Conjoint Effect of Divisive Normalization and Orientation Selectivity on Redundancy Reduction
15 0.59678602 66 nips-2008-Dynamic visual attention: searching for coding length increments
16 0.59610641 42 nips-2008-Cascaded Classification Models: Combining Models for Holistic Scene Understanding
17 0.59325486 127 nips-2008-Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction
18 0.58968496 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
19 0.58635694 95 nips-2008-Grouping Contours Via a Related Image
20 0.58631426 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation