iccv iccv2013 iccv2013-331 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Eran Swears, Anthony Hoogs, Kim Boyer
Abstract: Recognizing functional scene elemeents in video scenes based on the behaviors of moving objects that interact with them is an emerging problem ooff interest. Existing approaches have a limited ability to chharacterize elements such as cross-walks, intersections, andd buildings that have low activity, are multi-modal, or havee indirect evidence. Our approach recognizes the low activvity and multi-model elements (crosswalks/intersections) by introducing a hierarchy of descriptive clusters to fform a pyramid of codebooks that is sparse in the numbber of clusters and dense in content. The incorporation oof local behavioral context such as person-enter-building aand vehicle-parking nearby enables the detection of elemennts that do not have direct motion-based evidence, e.g. buuildings. These two contributions significantly improvee scene element recognition when compared against thhree state-of-the-art approaches. Results are shown on tyypical ground level surveillance video and for the first time on the more complex Wide Area Motion Imagery.
Reference: text
sentIndex sentText sentNum sentScore
1 eddu Abstract Recognizing functional scene elemeents in video scenes based on the behaviors of moving objects that interact with them is an emerging problem ooff interest. [sent-7, score-0.616]
2 Existing approaches have a limited ability to chharacterize elements such as cross-walks, intersections, andd buildings that have low activity, are multi-modal, or havee indirect evidence. [sent-8, score-0.196]
3 Our approach recognizes the low activvity and multi-model elements (crosswalks/intersections) by introducing a hierarchy of descriptive clusters to fform a pyramid of codebooks that is sparse in the numbber of clusters and dense in content. [sent-9, score-1.091]
4 The incorporation oof local behavioral context such as person-enter-building aand vehicle-parking nearby enables the detection of elemennts that do not have direct motion-based evidence, e. [sent-10, score-0.278]
5 These two contributions significantly improvee scene element recognition when compared against thhree state-of-the-art approaches. [sent-13, score-0.299]
6 Introduction We present a new approach to viddeo scene modeling and scene element recognition thhat makes several improvements over existing state-of-the-art approaches. [sent-16, score-0.473]
7 More specifically, we recognize stationnary scene elements in video using descriptors derived fromm the moving objects (people/vehicles) that interact with tthem. [sent-17, score-0.456]
8 When these scene elements have a specific purpose or function they are referred to as functional scene elemments [1,2,3]. [sent-18, score-0.607]
9 Relying on descriptors derived from automaticallyy computed tracks, as opposed to pixel features, enables the detection of scene elements that cannot be discrimminated based on appearance alone. [sent-20, score-0.327]
10 For example, cross--walks may or may not have the black and white zebra pattterns and doorways can be completely occluded in high altiitude aerial video or have very few pixels, as is the case wiith one of the aerial video datasets analyzed here (Figure 1). [sent-21, score-0.302]
11 Fortunately, the moving objects that interact with themm are easier to detect [9] and track [10] which enables the detection of these visually ambiguous or poorly seen elemments. [sent-22, score-0.21]
12 Existing functional scenne element recognition approaches either characterizee the scene elements by clustering descriptors/features from individual grid cells [3,4], a flat layer of clusters [2,3], and/or through the use of manually defined scene elemment detectors [1,4]. [sent-23, score-1.352]
13 These work well when the scene elements have a sufficient number of moving objects wiith well-defined behaviors passing over them. [sent-24, score-0.514]
14 For exampple, when there are many examples of pedestrians crossiing the road on the crosswalk. [sent-25, score-0.042]
15 But, they can fail to recoognize scene element when the activity is low, multi-modaal, or indirect. [sent-26, score-0.392]
16 Multi-model elements have multiple behhavior characteristics; for example, roadways have vehiclees driving on them but they can also have vehicles stoppinng and turning to enter a parking-spot. [sent-27, score-0.158]
17 Additionally, inddirect activity is when the activity associated with a scenee element (building) occurs nearby (person-entering-buildiing), but not within the scene element’s bounds (no wallking on the roof). [sent-28, score-0.533]
18 Our solution for recognizingg scene elements with low, indirect, and/or multi-model acttivity is to introduce a new pyramid coding approach thaat creates a hierarchy of descriptive clusters to form a pyyramid of codebooks over a local behavioral context windoww. [sent-29, score-1.255]
19 Our first contribution is the characterization of scene elements using a sparsedense pyramid of codebookks and the path of the descriptors through the pyrammid. [sent-30, score-0.697]
20 This pyramid coding approach implicitly captures alll behavior granularities, up frame image, (Middle) Defineed AOI around main street, (Bottom) detected functional sccene elements. [sent-31, score-0.625]
21 345 to a maximum number, to enable the characterization of multi-modal behaviors. [sent-32, score-0.043]
22 However, this approach focuses on using the leaf clusters through entropy weighting and divides each cluster into K more clusters independent of the variability in their data, which can result in millions of sparse clusters. [sent-34, score-0.635]
23 Our approach uses clusters from all layers in the pyramid and produces a smaller number of clusters that are dense in content. [sent-35, score-0.769]
24 The dense clusters are created by bifurcating clusters based on the variance of their assigned data. [sent-36, score-0.63]
25 o Tcehsiss aslspao shea sp aynra imnihde rreenst slutsb sient ? [sent-38, score-0.042]
26 tfduhelenl n speuy mcralubmsertie d dro sof crin ltsuhtseete aHdrsK oafMn dt’h sLe ? [sent-113, score-0.06]
27 is the incorporation of local behavior context to compensate for both the low and indirect activity. [sent-127, score-0.181]
28 Local behavioral context is captured by aggregating (pooling) behaviors from that surround the scene element of interest, not just a single grid cell [1,2,3,4]. [sent-128, score-0.68]
29 This increases the observed amount of activity and couples the scene element with nearby activity. [sent-129, score-0.44]
30 Our overall approach recognizes spatial regions that have similar functional behaviors as the presented training examples. [sent-130, score-0.414]
31 Our framework for this is similar to those used by standard Pyramid Matching (PM) approaches for image/object classification [6,12], where there is a coding danerdi vae dpo ofrolimng msteovp. [sent-131, score-0.119]
32 od , isnege ssteepct iaorne sdheoriwvend i nf Foimgu mreo v2 nwgh eorbej ewctes sitandrti cwaittehd ab yse t? [sent-133, score-0.084]
33 o ,f dseees csreipcttioorns tdheer ivpeydr fmroidm cmoodvinign g aolgbojerictths mins di? [sent-135, score-0.163]
34 e t ,hh esieene f arsedech c i tcinotaonl tdheer vpeydra fmroidm cmoodvinigng aolbgjoercitthsm insd ic? [sent-140, score-0.225]
35 s sithana t Musixe thuriee aMrcohdicealsl t(GheM pMyrsa)m tido foodrimng tahleg rpityhrmams id? [sent-143, score-0.094]
36 r hTichials clustering process results in two unique clusters per layer as indicated by the red and blue clusters, where the red cluster has the highest variance and is bifurcated. [sent-145, score-0.438]
37 After pyramid coding, a 2D spatial grid is applied to the isnc etnhee’ s pgyrroaumndid p olaf nce oadnedb oenokcso de? [sent-146, score-0.352]
38 oTnchee feorn ceoadchin cgo pdreobcoeosks in the pyramid of codebooks ? [sent-148, score-0.342]
39 The encoding process fiinr stth ae spsigynrasm dide scorfi pctoordse btoo cklsu st? [sent-150, score-0.132]
40 u eTnhtley eoncccoudrriningg crolucsetesrs within each grid cell to that grid cell. [sent-154, score-0.154]
41 Each encoded scene is referred to as a functional region map [3]. [sent-155, score-0.363]
42 The scene element models are formed during the pooling step, where one model is created for each training example. [sent-156, score-0.349]
43 Pooling involves accumulating the unique clusters/codewords for the Regions of Interest (ROIs) from each layer’s functional region map into a histogram model. [sent-157, score-0.273]
44 To reduce processing time during the recognition process the unique codewords from the functional region maps are stored as integral images during training. [sent-158, score-0.273]
45 The testing process is a recognition framework that identifies both the location and label of scene elements. [sent-159, score-0.132]
46 During the testing process an “unknown” histogram model from a test ROI is compared to each learned model which returns the likelihood of fitting to each. [sent-160, score-0.066]
47 The scene is raster scanned with the test ROI to produce a 2D likelihood map that is later smoothed with a Markov Random Field. [sent-161, score-0.198]
48 To date, no functional scene modeling approaches have been applied to WAMI data, which offers more challenges such as a more diverse set of behaviors and fewer pixels on vehicles and pedestrians (movers). [sent-163, score-0.575]
49 Our experiments show how modeling local context along with applying the pyramid to the coding step significantly improves recognition results, particularly when compared to the most relevant coding [14] and functional recognition [2,3,4] approaches. [sent-164, score-0.749]
50 Relevant Work Swears and Hoogs [1] introduced functional scene element recognition in outdoor surveillance video. [sent-166, score-0.53]
51 This approach uses manually defined Bayesian classifiers and weak activity detectors to accumulate 2D likelihood maps over a scene for the elements of interest. [sent-167, score-0.455]
52 This was later (red/blue clusters are unique) ? [sent-168, score-0.268]
53 346 extended in [2] by converting the likelihood maps to track descriptors and passing them into a hierarchical divisive clustering algorithm. [sent-173, score-0.314]
54 The Functional-Category approach in [3] is a completely unsupervised method that clusters histograms of descriptors using a flat mean-shift clustering algorithm. [sent-174, score-0.404]
55 These approaches only use the leaf or flat layer of clusters to characterize functional scene elements and do not take local context into account. [sent-175, score-1.07]
56 The functional scene element recognition approach in [4] implements supervised binary scene element detectors to produce 2D likelihood maps for each element and then imposes local class adjacency constraints to perform spatial smoothing with a Markov Random Field (MRF). [sent-176, score-1.114]
57 However, it does not scale to a wide variety of descriptors and there is no local behavioral context taken into account. [sent-177, score-0.263]
58 The work in [5] offers a more complex approach that uses manually defined complex Markov Logic Networks to recognize interactions between moving objects specific to the scene element of interest. [sent-178, score-0.361]
59 However, the logic representation is limited to evidence that has well-defined semantic meaning, which is not always available, is subjective, and requires a subject matter expert to define. [sent-179, score-0.067]
60 Other work to classify images/objects uses HKM clustering [14] to form the pyramid of codebooks. [sent-180, score-0.233]
61 This work has shown that a larger set of leaf clusters leads to improved recognition when focusing on the leaf clusters. [sent-181, score-0.466]
62 However, our work shows that using our dense clusters from all layers in the model leads to improved recognition over emphasizing sparse leaf clusters. [sent-182, score-0.367]
63 Track Based Descriptors Our pyramid coding algorithms can use virtually any feature derived from detections or tracks, where both are referred to as track based descriptors. [sent-184, score-0.475]
64 Moving objects are detected in video using a standard background subtraction algorithm [9] and then associated to tracks [10] resulting in multiple detections per track. [sent-185, score-0.127]
65 The tracks are then processed through event detectors [4,5], track-type classification [1], and normalcy modeling algorithms [1,2]. [sent-187, score-0.459]
66 Simple low-level event detectors based on speed thresholds are used here to generate the probability of events on a per detection basis such as vehicle-stopping and vehicle-driving-fast. [sent-190, score-0.138]
67 Similarly, the vehicle-turning event detector is based on angular difference thresholds. [sent-191, score-0.086]
68 The person/vehicle/other (PVO) classifier descriptors are generated from a simple Bayesian classifier where the parameters for the person, vehicle, and other classes have been manually defined, as in [1,2]. [sent-192, score-0.083]
69 Spatial normalcy models are 2D likelihood maps that show where a Table 1, Track based descriptors including PVO classification, event detection, and normalcy model types. [sent-193, score-0.707]
70 the normalcy model for doorways is shown in Figure 2 ? [sent-194, score-0.32]
71 l, atiisn ga sesviigdneendc ea f rovamlu ew efarok md ettehcet oDrs todrevaescckrr ’tipism toedre stt oer ceptsiruoolntdsiun, cge ? [sent-199, score-0.146]
72 This whitening creates a descriptor space that is better conditioned for optimization during hierarchical divisive clustering. [sent-234, score-0.144]
73 Note, any event detector, PVO classifier, or normalcy model generator can be used as descriptors here. [sent-235, score-0.405]
74 Pyramid Coding The pyramid coding process first forms the sparsedense pyramid of codebooks and then encodes the scene tirnatock ’as pdyertaemctiido sof, ? [sent-237, score-0.92]
75 , where N is tsthrtaaerc tknsu’ sam td bleeatrye eocrft i t odwneotse,, ck? [sent-283, score-0.057]
76 u sintetroin tgw aol cgloursittehmrs starts at layer two, k=2, by bifurcating ? [sent-316, score-0.264]
77 nare itwhnet ods ett wsfroeo cmcll u essttaeecrrhss other in the ? [sent-332, score-0.042]
78 This process is repeated at each layer until the maximum number of clusters is reached, or until the model fit to the data vs. [sent-420, score-0.396]
79 One significant benefit of this approach is that the data points in X have a clear path through the pyramid, where the sum of the points in the child clusters equals the number in their parent cluster. [sent-422, score-0.268]
80 This results in only two uniTqhuee 0clusters at each layer, which reduces the model coTmphle ex0i1t2y and creates our sparse pyramid. [sent-423, score-0.06]
81 The 012 layer in the pyramid initially results in a full 3co ? [sent-424, score-0.361]
82 Tde 4hb eog or0ikd12 ownittoh kth ce lugsroteurns d( cpoldaenwe oarndds )a, swsihgincihn gar eea uchse gd tido e3n ? [sent-425, score-0.094]
83 Tghrios insd a cpclaonmep lanisdh eads sbigyn oinvge relaaychin gg raidn 3 ? [sent-427, score-0.062]
84 4 grid onto the ground plane and assigning each grid c3e ? [sent-428, score-0.154]
85 eenc 4 i fagicrsaisldil gyon, etdoa ctthhh eeo gf c roothudeen wddao tpradl npoela inbatnesld i na? [sent-431, score-0.055]
86 h,n gt(hri,aijdt) tahree a sassigsingende dto tohnee coofd tehwe okr dc oldaebweol,r d? [sent-437, score-0.042]
wordName wordTfidf (topN-words)
[('clusters', 0.268), ('normalcy', 0.236), ('pyramid', 0.233), ('functional', 0.231), ('element', 0.167), ('pvo', 0.141), ('behavioral', 0.133), ('scene', 0.132), ('layer', 0.128), ('rem', 0.125), ('behaviors', 0.124), ('coding', 0.119), ('elements', 0.112), ('codebooks', 0.109), ('stt', 0.104), ('leaf', 0.099), ('bifurcating', 0.094), ('cltlsu', 0.094), ('dfu', 0.094), ('drs', 0.094), ('fdfuuellnl', 0.094), ('fmroidm', 0.094), ('nnoc', 0.094), ('sparsedense', 0.094), ('tdreascckr', 0.094), ('tido', 0.094), ('wami', 0.094), ('activity', 0.093), ('event', 0.086), ('tracks', 0.085), ('indirect', 0.084), ('rei', 0.084), ('divisive', 0.084), ('wiith', 0.084), ('doorways', 0.084), ('hese', 0.084), ('descriptors', 0.083), ('track', 0.081), ('hkm', 0.077), ('grid', 0.077), ('tdheer', 0.069), ('interact', 0.067), ('logic', 0.067), ('aerial', 0.067), ('likelihood', 0.066), ('ts', 0.065), ('roi', 0.064), ('om', 0.063), ('moving', 0.062), ('insd', 0.062), ('creates', 0.06), ('sof', 0.06), ('recognizes', 0.059), ('sam', 0.057), ('gf', 0.055), ('bd', 0.054), ('flat', 0.053), ('detectors', 0.052), ('incorporation', 0.05), ('pooling', 0.05), ('er', 0.048), ('nearby', 0.048), ('dh', 0.048), ('ae', 0.048), ('context', 0.047), ('vehicles', 0.046), ('id', 0.043), ('characterization', 0.043), ('ri', 0.042), ('unique', 0.042), ('sg', 0.042), ('pm', 0.042), ('descriptive', 0.042), ('detections', 0.042), ('dr', 0.042), ('pedestrians', 0.042), ('ett', 0.042), ('rois', 0.042), ('anthony', 0.042), ('btoo', 0.042), ('itri', 0.042), ('rts', 0.042), ('esv', 0.042), ('stth', 0.042), ('cge', 0.042), ('olaf', 0.042), ('shea', 0.042), ('yse', 0.042), ('doorway', 0.042), ('edde', 0.042), ('uid', 0.042), ('thhat', 0.042), ('alll', 0.042), ('hoogs', 0.042), ('ike', 0.042), ('tgw', 0.042), ('iec', 0.042), ('movers', 0.042), ('coofd', 0.042), ('nwgh', 0.042)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 331 iccv-2013-Pyramid Coding for Functional Scene Element Recognition in Video Scenes
Author: Eran Swears, Anthony Hoogs, Kim Boyer
Abstract: Recognizing functional scene elemeents in video scenes based on the behaviors of moving objects that interact with them is an emerging problem ooff interest. Existing approaches have a limited ability to chharacterize elements such as cross-walks, intersections, andd buildings that have low activity, are multi-modal, or havee indirect evidence. Our approach recognizes the low activvity and multi-model elements (crosswalks/intersections) by introducing a hierarchy of descriptive clusters to fform a pyramid of codebooks that is sparse in the numbber of clusters and dense in content. The incorporation oof local behavioral context such as person-enter-building aand vehicle-parking nearby enables the detection of elemennts that do not have direct motion-based evidence, e.g. buuildings. These two contributions significantly improvee scene element recognition when compared against thhree state-of-the-art approaches. Results are shown on tyypical ground level surveillance video and for the first time on the more complex Wide Area Motion Imagery.
2 0.17253469 208 iccv-2013-Image Co-segmentation via Consistent Functional Maps
Author: Fan Wang, Qixing Huang, Leonidas J. Guibas
Abstract: Joint segmentation of image sets has great importance for object recognition, image classification, and image retrieval. In this paper, we aim to jointly segment a set of images starting from a small number of labeled images or none at all. To allow the images to share segmentation information with each other, we build a network that contains segmented as well as unsegmented images, and extract functional maps between connected image pairs based on image appearance features. These functional maps act as general property transporters between the images and, in particular, are used to transfer segmentations. We define and operate in a reduced functional space optimized so that the functional maps approximately satisfy cycle-consistency under composition in the network. A joint optimization framework is proposed to simultaneously generate all segmentation functions over the images so that they both align with local segmentation cues in each particular image, and agree with each other under network transportation. This formulation allows us to extract segmentations even with no training data, but can also exploit such data when available. The collective effect of the joint processing using functional maps leads to accurate information sharing among images and yields superior segmentation results, as shown on the iCoseg, MSRC, and PASCAL data sets.
3 0.13896966 447 iccv-2013-Volumetric Semantic Segmentation Using Pyramid Context Features
Author: Jonathan T. Barron, Mark D. Biggin, Pablo Arbeláez, David W. Knowles, Soile V.E. Keranen, Jitendra Malik
Abstract: We present an algorithm for the per-voxel semantic segmentation of a three-dimensional volume. At the core of our algorithm is a novel “pyramid context” feature, a descriptive representation designed such that exact per-voxel linear classification can be made extremely efficient. This feature not only allows for efficient semantic segmentation but enables other aspects of our algorithm, such as novel learned features and a stacked architecture that can reason about self-consistency. We demonstrate our technique on 3Dfluorescence microscopy data ofDrosophila embryosfor which we are able to produce extremely accurate semantic segmentations in a matter of minutes, and for which other algorithms fail due to the size and high-dimensionality of the data, or due to the difficulty of the task.
4 0.1268325 127 iccv-2013-Dynamic Pooling for Complex Event Recognition
Author: Weixin Li, Qian Yu, Ajay Divakaran, Nuno Vasconcelos
Abstract: The problem of adaptively selecting pooling regions for the classification of complex video events is considered. Complex events are defined as events composed of several characteristic behaviors, whose temporal configuration can change from sequence to sequence. A dynamic pooling operator is defined so as to enable a unified solution to the problems of event specific video segmentation, temporal structure modeling, and event detection. Video is decomposed into segments, and the segments most informative for detecting a given event are identified, so as to dynamically determine the pooling operator most suited for each sequence. This dynamic pooling is implemented by treating the locations of characteristic segments as hidden information, which is inferred, on a sequence-by-sequence basis, via a large-margin classification rule with latent variables. Although the feasible set of segment selections is combinatorial, it is shown that a globally optimal solution to the inference problem can be obtained efficiently, through the solution of a series of linear programs. Besides the coarselevel location of segments, a finer model of video struc- ture is implemented by jointly pooling features of segmenttuples. Experimental evaluation demonstrates that the re- sulting event detector has state-of-the-art performance on challenging video datasets.
5 0.10149293 216 iccv-2013-Inferring "Dark Matter" and "Dark Energy" from Videos
Author: Dan Xie, Sinisa Todorovic, Song-Chun Zhu
Abstract: This paper presents an approach to localizing functional objects in surveillance videos without domain knowledge about semantic object classes that may appear in the scene. Functional objects do not have discriminative appearance and shape, but they affect behavior of people in the scene. For example, they “attract” people to approach them for satisfying certain needs (e.g., vending machines could quench thirst), or “repel” people to avoid them (e.g., grass lawns). Therefore, functional objects can be viewed as “dark matter”, emanating “dark energy ” that affects people ’s trajectories in the video. To detect “dark matter” and infer their “dark energy ” field, we extend the Lagrangian mechanics. People are treated as particle-agents with latent intents to approach “dark matter” and thus satisfy their needs, where their motions are subject to a composite “dark energy ” field of all functional objects in the scene. We make the assumption that people take globally optimal paths toward the intended “dark matter” while avoiding latent obstacles. A Bayesian framework is used to probabilistically model: people ’s trajectories and intents, constraint map of the scene, and locations of functional objects. A data-driven Markov Chain Monte Carlo (MCMC) process is used for inference. Our evaluation on videos of public squares and courtyards demonstrates our effectiveness in localizing functional objects and predicting people ’s trajectories in unobserved parts of the video footage.
6 0.099869028 406 iccv-2013-Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time
7 0.090420961 58 iccv-2013-Bayesian 3D Tracking from Monocular Video
8 0.084770106 236 iccv-2013-Learning Discriminative Part Detectors for Image Classification and Cosegmentation
9 0.08402063 289 iccv-2013-Network Principles for SfM: Disambiguating Repeated Structures with Local Context
10 0.080724657 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification
11 0.079390571 268 iccv-2013-Modeling 4D Human-Object Interactions for Event and Object Recognition
12 0.077534519 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition
13 0.07736899 146 iccv-2013-Event Detection in Complex Scenes Using Interval Temporal Constraints
14 0.07587862 258 iccv-2013-Low-Rank Sparse Coding for Image Classification
15 0.074130818 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
16 0.074066378 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow
17 0.073817052 147 iccv-2013-Event Recognition in Photo Collections with a Stopwatch HMM
18 0.072784528 333 iccv-2013-Quantize and Conquer: A Dimensionality-Recursive Solution to Clustering, Vector Quantization, and Image Retrieval
19 0.072515629 165 iccv-2013-Find the Best Path: An Efficient and Accurate Classifier for Image Hierarchies
20 0.071624935 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction
topicId topicWeight
[(0, 0.181), (1, 0.039), (2, 0.023), (3, 0.027), (4, 0.026), (5, 0.025), (6, -0.012), (7, 0.014), (8, -0.022), (9, -0.085), (10, -0.016), (11, -0.034), (12, -0.003), (13, 0.068), (14, -0.078), (15, -0.023), (16, -0.006), (17, 0.033), (18, 0.035), (19, 0.046), (20, -0.058), (21, -0.028), (22, 0.059), (23, 0.0), (24, -0.064), (25, -0.009), (26, 0.034), (27, -0.024), (28, -0.019), (29, 0.052), (30, 0.045), (31, -0.045), (32, 0.051), (33, -0.014), (34, -0.038), (35, 0.054), (36, 0.017), (37, -0.065), (38, 0.057), (39, 0.039), (40, -0.021), (41, 0.098), (42, -0.068), (43, -0.055), (44, -0.155), (45, -0.14), (46, -0.115), (47, -0.029), (48, -0.055), (49, -0.114)]
simIndex simValue paperId paperTitle
same-paper 1 0.95927322 331 iccv-2013-Pyramid Coding for Functional Scene Element Recognition in Video Scenes
Author: Eran Swears, Anthony Hoogs, Kim Boyer
Abstract: Recognizing functional scene elemeents in video scenes based on the behaviors of moving objects that interact with them is an emerging problem ooff interest. Existing approaches have a limited ability to chharacterize elements such as cross-walks, intersections, andd buildings that have low activity, are multi-modal, or havee indirect evidence. Our approach recognizes the low activvity and multi-model elements (crosswalks/intersections) by introducing a hierarchy of descriptive clusters to fform a pyramid of codebooks that is sparse in the numbber of clusters and dense in content. The incorporation oof local behavioral context such as person-enter-building aand vehicle-parking nearby enables the detection of elemennts that do not have direct motion-based evidence, e.g. buuildings. These two contributions significantly improvee scene element recognition when compared against thhree state-of-the-art approaches. Results are shown on tyypical ground level surveillance video and for the first time on the more complex Wide Area Motion Imagery.
2 0.67348409 447 iccv-2013-Volumetric Semantic Segmentation Using Pyramid Context Features
Author: Jonathan T. Barron, Mark D. Biggin, Pablo Arbeláez, David W. Knowles, Soile V.E. Keranen, Jitendra Malik
Abstract: We present an algorithm for the per-voxel semantic segmentation of a three-dimensional volume. At the core of our algorithm is a novel “pyramid context” feature, a descriptive representation designed such that exact per-voxel linear classification can be made extremely efficient. This feature not only allows for efficient semantic segmentation but enables other aspects of our algorithm, such as novel learned features and a stacked architecture that can reason about self-consistency. We demonstrate our technique on 3Dfluorescence microscopy data ofDrosophila embryosfor which we are able to produce extremely accurate semantic segmentations in a matter of minutes, and for which other algorithms fail due to the size and high-dimensionality of the data, or due to the difficulty of the task.
3 0.61940253 401 iccv-2013-Stacked Predictive Sparse Coding for Classification of Distinct Regions in Tumor Histopathology
Author: Hang Chang, Yin Zhou, Paul Spellman, Bahram Parvin
Abstract: Image-based classification ofhistology sections, in terms of distinct components (e.g., tumor, stroma, normal), provides a series of indices for tumor composition. Furthermore, aggregation of these indices, from each whole slide image (WSI) in a large cohort, can provide predictive models of the clinical outcome. However, performance of the existing techniques is hindered as a result of large technical variations and biological heterogeneities that are always present in a large cohort. We propose a system that automatically learns a series of basis functions for representing the underlying spatial distribution using stacked predictive sparse decomposition (PSD). The learned representation is then fed into the spatial pyramid matching framework (SPM) with a linear SVM classifier. The system has been evaluated for classification of (a) distinct histological components for two cohorts of tumor types, and (b) colony organization of normal and malignant cell lines in 3D cell culture models. Throughput has been increased through the utility of graphical processing unit (GPU), and evalu- ation indicates a superior performance results, compared with previous research.
4 0.6188525 412 iccv-2013-Synergistic Clustering of Image and Segment Descriptors for Unsupervised Scene Understanding
Author: Daniel M. Steinberg, Oscar Pizarro, Stefan B. Williams
Abstract: With the advent of cheap, high fidelity, digital imaging systems, the quantity and rate of generation of visual data can dramatically outpace a humans ability to label or annotate it. In these situations there is scope for the use of unsupervised approaches that can model these datasets and automatically summarise their content. To this end, we present a totally unsupervised, and annotation-less, model for scene understanding. This model can simultaneously cluster whole-image and segment descriptors, therebyforming an unsupervised model of scenes and objects. We show that this model outperforms other unsupervised models that can only cluster one source of information (image or segment) at once. We are able to compare unsupervised and supervised techniques using standard measures derived from confusion matrices and contingency tables. This shows that our unsupervised model is competitive with current supervised and weakly-supervised models for scene understanding on standard datasets. We also demonstrate our model operating on a dataset with more than 100,000 images col- lected by an autonomous underwater vehicle.
Author: Yong Jae Lee, Alexei A. Efros, Martial Hebert
Abstract: We present a weakly-supervised visual data mining approach that discovers connections between recurring midlevel visual elements in historic (temporal) and geographic (spatial) image collections, and attempts to capture the underlying visual style. In contrast to existing discovery methods that mine for patterns that remain visually consistent throughout the dataset, our goal is to discover visual elements whose appearance changes due to change in time or location; i.e., exhibit consistent stylistic variations across the label space (date or geo-location). To discover these elements, we first identify groups of patches that are stylesensitive. We then incrementally build correspondences to find the same element across the entire dataset. Finally, we train style-aware regressors that model each element’s range of stylistic differences. We apply our approach to date and geo-location prediction and show substantial improvement over several baselines that do not model visual style. We also demonstrate the method’s effectiveness on the related task of fine-grained classification.
6 0.52798718 287 iccv-2013-Neighbor-to-Neighbor Search for Fast Coding of Feature Vectors
7 0.52552366 72 iccv-2013-Characterizing Layouts of Outdoor Scenes Using Spatial Topic Processes
8 0.52342421 87 iccv-2013-Conservation Tracking
9 0.5166198 258 iccv-2013-Low-Rank Sparse Coding for Image Classification
10 0.50678176 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns
11 0.50481182 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
12 0.49897268 73 iccv-2013-Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification
13 0.49753079 34 iccv-2013-Abnormal Event Detection at 150 FPS in MATLAB
14 0.49130583 125 iccv-2013-Drosophila Embryo Stage Annotation Using Label Propagation
15 0.48540887 289 iccv-2013-Network Principles for SfM: Disambiguating Repeated Structures with Local Context
16 0.48264876 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow
17 0.48220626 388 iccv-2013-Shape Index Descriptors Applied to Texture-Based Galaxy Analysis
18 0.47908574 58 iccv-2013-Bayesian 3D Tracking from Monocular Video
19 0.47626454 215 iccv-2013-Incorporating Cloud Distribution in Sky Representation
20 0.47150999 128 iccv-2013-Dynamic Probabilistic Volumetric Models
topicId topicWeight
[(2, 0.077), (7, 0.011), (13, 0.011), (26, 0.056), (31, 0.041), (42, 0.073), (48, 0.459), (64, 0.043), (73, 0.029), (89, 0.132)]
simIndex simValue paperId paperTitle
same-paper 1 0.78591776 331 iccv-2013-Pyramid Coding for Functional Scene Element Recognition in Video Scenes
Author: Eran Swears, Anthony Hoogs, Kim Boyer
Abstract: Recognizing functional scene elemeents in video scenes based on the behaviors of moving objects that interact with them is an emerging problem ooff interest. Existing approaches have a limited ability to chharacterize elements such as cross-walks, intersections, andd buildings that have low activity, are multi-modal, or havee indirect evidence. Our approach recognizes the low activvity and multi-model elements (crosswalks/intersections) by introducing a hierarchy of descriptive clusters to fform a pyramid of codebooks that is sparse in the numbber of clusters and dense in content. The incorporation oof local behavioral context such as person-enter-building aand vehicle-parking nearby enables the detection of elemennts that do not have direct motion-based evidence, e.g. buuildings. These two contributions significantly improvee scene element recognition when compared against thhree state-of-the-art approaches. Results are shown on tyypical ground level surveillance video and for the first time on the more complex Wide Area Motion Imagery.
2 0.73992705 311 iccv-2013-Pedestrian Parsing via Deep Decompositional Network
Author: Ping Luo, Xiaogang Wang, Xiaoou Tang
Abstract: We propose a new Deep Decompositional Network (DDN) for parsing pedestrian images into semantic regions, such as hair, head, body, arms, and legs, where the pedestrians can be heavily occluded. Unlike existing methods based on template matching or Bayesian inference, our approach directly maps low-level visual features to the label maps of body parts with DDN, which is able to accurately estimate complex pose variations with good robustness to occlusions and background clutters. DDN jointly estimates occluded regions and segments body parts by stacking three types of hidden layers: occlusion estimation layers, completion layers, and decomposition layers. The occlusion estimation layers estimate a binary mask, indicating which part of a pedestrian is invisible. The completion layers synthesize low-level features of the invisible part from the original features and the occlusion mask. The decomposition layers directly transform the synthesized visual features to label maps. We devise a new strategy to pre-train these hidden layers, and then fine-tune the entire network using the stochastic gradient descent. Experimental results show that our approach achieves better segmentation accuracy than the state-of-the-art methods on pedestrian images with or without occlusions. Another important contribution of this paper is that it provides a large scale benchmark human parsing dataset1 that includes 3, 673 annotated samples collected from 171 surveillance videos. It is 20 times larger than existing public datasets.
3 0.69270039 63 iccv-2013-Bounded Labeling Function for Global Segmentation of Multi-part Objects with Geometric Constraints
Author: Masoud S. Nosrati, Shawn Andrews, Ghassan Hamarneh
Abstract: The inclusion of shape and appearance priors have proven useful for obtaining more accurate and plausible segmentations, especially for complex objects with multiple parts. In this paper, we augment the popular MumfordShah model to incorporate two important geometrical constraints, termed containment and detachment, between different regions with a specified minimum distance between their boundaries. Our method is able to handle multiple instances of multi-part objects defined by these geometrical hamarneh} @ s fu . ca (a)Standar laΩb ehlingΩfuhnctionseting(Ωb)hΩOuirseΩtijng Figure 1: The inside vs. outside ambiguity in (a) is resolved by our containment constraint in (b). constraints using a single labeling function while maintaining global optimality. We demonstrate the utility and advantages of these two constraints and show that the proposed convex continuous method is superior to other state-of-theart methods, including its discrete counterpart, in terms of memory usage, and metrication errors.
4 0.62291414 320 iccv-2013-Pose-Configurable Generic Tracking of Elongated Objects
Author: Daniel Wesierski, Patrick Horain
Abstract: Elongated objects have various shapes and can shift, rotate, change scale, and be rigid or deform by flexing, articulating, and vibrating, with examples as varied as a glass bottle, a robotic arm, a surgical suture, a finger pair, a tram, and a guitar string. This generally makes tracking of poses of elongated objects very challenging. We describe a unified, configurable framework for tracking the pose of elongated objects, which move in the image plane and extend over the image region. Our method strives for simplicity, versatility, and efficiency. The object is decomposed into a chained assembly of segments of multiple parts that are arranged under a hierarchy of tailored spatio-temporal constraints. In this hierarchy, segments can rescale independently while their elasticity is controlled with global orientations and local distances. While the trend in tracking is to design complex, structure-free algorithms that update object appearance on- line, we show that our tracker, with the novel but remarkably simple, structured organization of parts with constant appearance, reaches or improves state-of-the-art performance. Most importantly, our model can be easily configured to track exact pose of arbitrary, elongated objects in the image plane. The tracker can run up to 100 fps on a desktop PC, yet the computation time scales linearly with the number of object parts. To our knowledge, this is the first approach to generic tracking of elongated objects.
5 0.60646492 354 iccv-2013-Robust Dictionary Learning by Error Source Decomposition
Author: Zhuoyuan Chen, Ying Wu
Abstract: Sparsity models have recently shown great promise in many vision tasks. Using a learned dictionary in sparsity models can in general outperform predefined bases in clean data. In practice, both training and testing data may be corrupted and contain noises and outliers. Although recent studies attempted to cope with corrupted data and achieved encouraging results in testing phase, how to handle corruption in training phase still remains a very difficult problem. In contrast to most existing methods that learn the dictionaryfrom clean data, this paper is targeted at handling corruptions and outliers in training data for dictionary learning. We propose a general method to decompose the reconstructive residual into two components: a non-sparse component for small universal noises and a sparse component for large outliers, respectively. In addition, , further analysis reveals the connection between our approach and the “partial” dictionary learning approach, updating only part of the prototypes (or informative codewords) with remaining (or noisy codewords) fixed. Experiments on synthetic data as well as real applications have shown satisfactory per- formance of this new robust dictionary learning approach.
6 0.59023231 207 iccv-2013-Illuminant Chromaticity from Image Sequences
7 0.4861238 220 iccv-2013-Joint Deep Learning for Pedestrian Detection
8 0.46476942 279 iccv-2013-Multi-stage Contextual Deep Learning for Pedestrian Detection
9 0.45883316 7 iccv-2013-A Deep Sum-Product Architecture for Robust Facial Attributes Analysis
10 0.44766128 206 iccv-2013-Hybrid Deep Learning for Face Verification
11 0.44084358 106 iccv-2013-Deep Learning Identity-Preserving Face Space
12 0.42877018 313 iccv-2013-Person Re-identification by Salience Matching
13 0.42798162 5 iccv-2013-A Color Constancy Model with Double-Opponency Mechanisms
14 0.42401129 208 iccv-2013-Image Co-segmentation via Consistent Functional Maps
15 0.41940731 351 iccv-2013-Restoring an Image Taken through a Window Covered with Dirt or Rain
16 0.41879934 312 iccv-2013-Perceptual Fidelity Aware Mean Squared Error
17 0.41384155 151 iccv-2013-Exploiting Reflection Change for Automatic Reflection Removal
18 0.41347885 61 iccv-2013-Beyond Hard Negative Mining: Efficient Detector Learning via Block-Circulant Decomposition
19 0.40805131 364 iccv-2013-SGTD: Structure Gradient and Texture Decorrelating Regularization for Image Decomposition
20 0.40647286 369 iccv-2013-Saliency Detection: A Boolean Map Approach