cvpr cvpr2013 cvpr2013-313 knowledge-graph by maker-knowledge-mining

313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos


Source: pdf

Author: Mehrsan Javan Roshtkhari, Martin D. Levine

Abstract: We present a novel approach for video parsing and simultaneous online learning of dominant and anomalous behaviors in surveillance videos. Dominant behaviors are those occurring frequently in videos and hence, usually do not attract much attention. They can be characterized by different complexities in space and time, ranging from a scene background to human activities. In contrast, an anomalous behavior is defined as having a low likelihood of occurrence. We do not employ any models of the entities in the scene in order to detect these two kinds of behaviors. In this paper, video events are learnt at each pixel without supervision using densely constructed spatio-temporal video volumes. Furthermore, the volumes are organized into large contextual graphs. These compositions are employed to construct a hierarchical codebook model for the dominant behaviors. By decomposing spatio-temporal contextual information into unique spatial and temporal contexts, the proposed framework learns the models of the dominant spatial and temporal events. Thus, it is ultimately capable of simultaneously modeling high-level behaviors as well as low-level spatial, temporal and spatio-temporal pixel level changes.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 ca , l Abstract We present a novel approach for video parsing and simultaneous online learning of dominant and anomalous behaviors in surveillance videos. [sent-5, score-0.949]

2 Dominant behaviors are those occurring frequently in videos and hence, usually do not attract much attention. [sent-6, score-0.301]

3 In contrast, an anomalous behavior is defined as having a low likelihood of occurrence. [sent-8, score-0.403]

4 In this paper, video events are learnt at each pixel without supervision using densely constructed spatio-temporal video volumes. [sent-10, score-0.514]

5 Furthermore, the volumes are organized into large contextual graphs. [sent-11, score-0.366]

6 These compositions are employed to construct a hierarchical codebook model for the dominant behaviors. [sent-12, score-0.468]

7 By decomposing spatio-temporal contextual information into unique spatial and temporal contexts, the proposed framework learns the models of the dominant spatial and temporal events. [sent-13, score-0.686]

8 Thus, it is ultimately capable of simultaneously modeling high-level behaviors as well as low-level spatial, temporal and spatio-temporal pixel level changes. [sent-14, score-0.544]

9 We can further categorize dominant behavior into two classes. [sent-20, score-0.416]

10 The input video is parsed into three meaningful components: background, dominant activities (walk- ing pedestrians), and rare activities (the bicyclist). [sent-25, score-0.594]

11 However, dominant behavior detection is more general and more complicated than background subtraction, since it includes the scene background while not being limited to it. [sent-27, score-0.605]

12 In contrast, dominant behavior understanding can be seen as a generalization of this in which all of the dynamic contents (foreground) of the video come into play. [sent-30, score-0.614]

13 Here we concentrate on detecting two of the elements in Figure 1, that is dominant spatio-temporal activities and abnormal behavior in a video. [sent-31, score-0.773]

14 Using densely sampled spatio-temporal video volumes (STVs), we create both local and global compositional graphs of volumes at each pixel. [sent-33, score-0.849]

15 Although employing STVs in the context of bag of video words (BOV) has been extensively studied for the well-known problem of activity recognition, generally it involves supervised training. [sent-34, score-0.26]

16 After initializing the algorithm, typically using one or two seconds of video, the system builds an adaptive model of the dominant behavior while simultane222666001 919 ? [sent-37, score-0.416]

17 To capture spatio-temporal configurations of video volumes, a probabilistic framework is employed by estimating probability density functions of the arrangements of video volumes. [sent-202, score-0.35]

18 The high-level output can be employed to simultaneously model normal and abnormal behaviors. [sent-204, score-0.251]

19 We are interested in detecting different kinds of behavior in the spatial and temporal domains. [sent-210, score-0.446]

20 We use a modified version of online fuzzy clustering and thereby track the dominant spatio-temporal activities (clusters). [sent-212, score-0.577]

21 For example, we can determine all ofthe abnormal (“anomalous”) spatial and temporal behaviors in a video. [sent-214, score-0.647]

22 The main contribution of this paper is an approach capable of learning both dominant and anomalous behaviors in videos of different spatio-temporal complexity. [sent-215, score-0.801]

23 Thus, the algorithm can simultaneously model high level behaviors and detect abnormalities by considering both spatial and temporal contextual information while also performing temporal pixel level change detection and background subtraction. [sent-217, score-1.014]

24 This characteristic makes our algorithm more general than both abnormality detection and background subtraction methods on their own. [sent-218, score-0.336]

25 II- High level activity modeling and low level pixel change detection are performed simultaneously by a single algorithm. [sent-220, score-0.248]

26 This makes the algorithm capable of understanding behaviors of different complexity. [sent-222, score-0.342]

27 III- The algorithm adaptively learns the behavior patterns in the scene in an online manner. [sent-223, score-0.304]

28 In order to evaluate capabilities of our approach we have conducted experiments using different datasets with different dominant behavior patterns. [sent-226, score-0.416]

29 On the other hand, techniques that do not require object detection followed by tracking focus on local spatio-temporal behaviors in videos and have recently gained increased popularity [2, 11]. [sent-230, score-0.341]

30 This is achieved either by constructing a pixel-level background model and behavior template [16, 14, 3, 8, 19] or by employing spatiotemporal video volumes [6, 4, 15, 29]. [sent-236, score-0.78]

31 The recent trend in video analysis is to use spatio-temporal video volumes in the context of BOV models2. [sent-237, score-0.612]

32 Although there have been some efforts to incorporate either spatial or temporal compositions of the video volumes into the probabilistic topic models, they suffer from high computational complexity. [sent-240, score-0.737]

33 Therefore, they cannot be employed for online behavior understanding and real-time scene monitoring [12]. [sent-241, score-0.387]

34 To date, these have focussed on detecting low-level local anomalies in a video by analyzing the activity pattern of each pixel as a function of time. [sent-243, score-0.363]

35 Although the latter has achieved good results for abnormality detection, the method requires that the activity pattern of each pixel be constructed by employing a conventional method for background subtraction. [sent-246, score-0.412]

36 In contrast to the aforementioned approaches that attempt to model either local spatio-temporal activity patterns of a pixel or trajectories of moving objects, our goal is to construct a hierarchical model for all of the activities in a scene. [sent-248, score-0.278]

37 We use densely sampled videos and construct a hierarchy of spatiotemporal regions in the video to model dominant local activity patterns. [sent-252, score-0.646]

38 the spatial and temporal information independently, thereby making it capable of detecting purely spatial or temporal abnormalities. [sent-254, score-0.567]

39 Low level scene representation The first stage of the algorithm is to represent a surveillance video by meaningful spatio-temporal descriptors. [sent-258, score-0.282]

40 This is achieved by dense sampling, thereby producing STVs, and then clustering similar video volumes. [sent-259, score-0.251]

41 1 Spatio-temporal video volume descriptors ×ny The 3D STVs, vi ∈ Rnx ×nt are constructed by assuming a volume of siz∈e nx ny nt (typically 5 5 5) around each pixel (in which nx ny ×is nthe( tsyipzeic oaflltyhe 5 spatial (image) window and nt is the depth of the video volume in time). [sent-262, score-0.637]

42 These volumes are then characterized by the histogram of the spatio-temporal gradient of the video in polar coordinates [4, 27]. [sent-263, score-0.492]

43 2 Online clustering of video volumes In the previous section, a set of spatio-temporal volumes, vi, was constructed using dense sampling and represented by a descriptor vector, hi. [sent-293, score-0.531]

44 As the number of these volumes is extremely large, it is advantageous to group similar spatio- temporal volumes to reduce the dimensions of the search space, as commonly performed in “bag of video words” approaches [4, 25]. [sent-294, score-0.908]

45 To be capable of handling large amounts of data, and also considering the sequential nature of the video frames, the clustering strategy needs to be capable of limiting the amount of memory used for data storage and computations. [sent-295, score-0.321]

46 Thus, we adopt an online fuzzy clustering approach for very large datasets, which is capable of incrementally updating the cluster centers as new data are observed [9]. [sent-296, score-0.309]

47 Contextual information: Ensembles ofvolumes As indicated earlier, in order to understand the scene background and make the correct decision regarding normal and suspicious (foreground) events, it is necessary to analyze the spatio-temporal arrangements of volumes [6, 25] ? [sent-332, score-0.503]

48 R contains many video volumes and thereby captures both local and more distant information in the video frames. [sent-357, score-0.664]

49 Such a set is called an ensemble of volumes around the particular pixel in the video (Figure 3). [sent-358, score-0.626]

50 The ensemble of volumes (Es,t) surrounding each pixel s in the video at time t, is defined as: Es,t = ? [sent-359, score-0.626]

51 {vi : vi ∈ Rs,t}iI=1 (6) where Rs,t is a region with pre-defined spatial and temporal radii centered at point (s, t) in the video (e. [sent-362, score-0.392]

52 To capture the spatio-temporal compositions of the video volumes, we use the relative spatio-temporal coordinates of the volume in each ensemble [25]. [sent-365, score-0.375]

53 Thus, ∈ R3 is the relative position of the ith video volume, vi(∈in space and time), inside the ensemble of volumes, Es,t, for a given point (s, t) in the video (Figure 3b). [sent-366, score-0.429]

54 During the codeword assignment process described in the previous section, each xvEis,t volume vi inside each ensemble was assigned to all labels cj with weights of uj,i using (4). [sent-367, score-0.259]

55 i=1:I,j=1:NC (7) A common approach for calculating similarity between ensembles of volumes is to use the star graph model [6, 21, 4]. [sent-371, score-0.544]

56 This model uses the joint probability between a database and a query ensemble to decouple the similarity of the topologies of the ensembles and that of the actual video volumes [21]. [sent-372, score-0.853]

57 Thus, the probability of a particular arrangement of volumes v inside the ensemble of Es,t is given by: 222666111422 PEs,t (v) = P (xv , c1, c2 , . [sent-374, score-0.421]

58 We would like to represent each ensemble of volumes by its pdf, PEs,t (v). [sent-384, score-0.421]

59 Space/Time decomposition of ensembles As stated previously, we are interested in detecting normal spatial and temporal activities to ultimately distinguish them from both spatial (shape and texture changes) and temporal abnormalities. [sent-389, score-0.784]

60 In order to individually characterize the different behaviors in the video, two sets of ensembles of spatio-temporal volumes are formed, one for the spatially oriented ensembles of volumes and the other, for the temporally oriented ones. [sent-391, score-1.39]

61 max {rx, ry}} (9) where DS and DT represent the sets of spatiallyand temporally-oriented ensembles, respectively, and (rx ry rt) is the size of the ensembles in (6). [sent-394, score-0.287]

62 The spatial a×nd r temporal decomposition of ensembles of STVs is illustrated in Figure 3c. [sent-395, score-0.44]

63 Clustering ensembles of STVs Once a video clip has been processed by the first level of BOV clustering in section 3. [sent-398, score-0.487]

64 2, each ensemble of spatiotemporal volumes has been represented by a pdf of its spatio-temporal volume distribution, as described in 3. [sent-400, score-0.589]

65 This will then permit us to construct a behavioral model for the video as well as infer the dominant behavior. [sent-404, score-0.417]

66 Using the pdf to represent each ensemble of volumes makes it possible to use a divergence function from statistics and information theory as the dissimilarity measure. [sent-405, score-0.497]

67 (10) where PEsi,ti and PEsj,tj are the pdfs of the ensembles Esi,ti and Esj , respectively, and d is the symmetric KL divergence between the two pdfs in (10). [sent-413, score-0.37]

68 As stated previously, we are interested in detecting dominant spatial and temporal activities as an ultimate means of determining both spatial (shape and texture changes) and temporal abnormalities (foreground regions). [sent-427, score-0.896]

69 At each temporal sample t, a single image is added to the already observed frames and a new video sequence, the query, Q, is formed. [sent-429, score-0.334]

70 The query is densely sampled in order to construct the video volumes and thereby, the ensembles of STVs, as described in section 3. [sent-430, score-0.802]

71 Given the already existing codebooks of ensembles constructed in 3. [sent-431, score-0.305]

72 Returning t |oD (1| i2n),d itchaet parameters α and β are seen to control the balance between spatial and temporal abnormalities based on the ultimate objective of the abnormality detection. [sent-461, score-0.498]

73 As an example, if the objective is to detect the temporal abnormality in the scene (background/foreground segmentation), then one can assume that α = 0. [sent-462, score-0.36]

74 The scenario we have considered implies on-line and continuous surveillance of a particular scene in order to simultaneusly detect dominant and anomalous patterns. [sent-465, score-0.52]

75 b) The dominant behaviors are produced by the cars passing through the lanes running from top to bottom and vise versa. [sent-479, score-0.546]

76 Experiments × The algorithm has been tested using the following datasets: the dominant behavior understanding dataset in [28]3, UCSD pedestrian dataset [18]4, and subway surveillance videos [1]5. [sent-482, score-0.606]

77 In all cases, we have assumed that local video volumes are of size 5 5 5 and the HOG is calculated assuming nθ s=a 1e6o, nφ e=5 ×8 5a×nd5 Nandd t=h e5H0 OfrGamisesc. [sent-483, score-0.457]

78 The dominant behaviors are either the static background or the dynamic cars passing through the lanes running from top to bottom. [sent-492, score-0.601]

79 Figure 4 (a), (b), and (c) illustrate a sample frame, and the dominant and abnormal behavior maps, respectively. [sent-494, score-0.627]

80 In the BoatSea video sequence, the dominant behavior is the waves while the abnormalities are the passing boats since they are newly observed objects in the scene. [sent-495, score-0.699]

81 Figure 5 shows a sample video frame of each video sequence, the detected abnormal regions and the precision/recall curves. [sent-499, score-0.55]

82 The first experiment (first row) is concerned with detecting dominant and abnormal behavior in a busy traffic scene. [sent-531, score-0.685]

83 The second and third experiments were conducted on videos in which the abnormalities were defined as being rare but nevertheless acceptable foreground motions. [sent-532, score-0.269]

84 Column b) The detected anomalous regions are cars moving from right to left (top), a boat moving to the right (middle), and a moving person (bottom). [sent-535, score-0.286]

85 As the abnormalities in this dataset are low level motions, we also include the pixel-level background models (Gaussians Mixture Models [30]) and the behavior template approaches in [14] for comparison. [sent-538, score-0.419]

86 In particular, the method based on spatio-temporal oriented energy filters [28] produced results comparable to ours, but might not be useful for more complex behaviors for two reasons: it is too local and does not consider contextual information. [sent-540, score-0.334]

87 It is also clear that conventional methods for background subtraction (GMM) fail to detect dominant behaviors in scenes containing complicated behaviors, such as the Train and Belleview video sequences. [sent-541, score-0.743]

88 However, they still do produce good results for background subtraction in a scene with a stationary background (Boat-Sea video sequences). [sent-542, score-0.373]

89 idcoMaleFhtwd1 (a) (b) FalseP (costi)vie Raet Figure 6: Frame level abnormality detection using the UCSD pedestrian datasets. [sent-554, score-0.3]

90 b) Detected anomalous regions: bicyclist (top), a car (bottom). [sent-557, score-0.256]

91 We also conducted experiments with the UCSD pedestrian It contains video sequences from two pedestrian walkways where abnormal events occur. [sent-569, score-0.512]

92 The results in Table 1 indicate that the proposed al7This dataset was employed as it includes pixel level ground truth showing the exact location of the abnormal regions in each frame. [sent-574, score-0.376]

93 8Frame level detection implies that a frame is marked as suspicious if it contains any abnormal pixel, regardless of its location. [sent-575, score-0.361]

94 This requires that the detected pixels in each video frame be compared to a pixel level ground truth map. [sent-577, score-0.251]

95 This is a major advantage of the proposed method that can also learn dominant and abnormal behaviors on the fly. [sent-580, score-0.675]

96 Conclusions and future work This paper presents a novel approach for simultaneously learning dominant behaviors and detecting anomalous patterns in videos. [sent-586, score-0.735]

97 The algorithm is centered on three main ideas: hierarchical analysis of multi-scalar visual features; accounting for their spatio-temporal compositional information; and spatial and temporal decomposition of the behaviors in order to learn dominant spatial and temporal activities. [sent-587, score-0.956]

98 Future research will extend the approach by adding another level of analysis in the hierarchical structure to model the spatial and temporal connectivity of the learnt behaviors. [sent-589, score-0.312]

99 Motion segmentation and abnormal behavior detection via behavior clustering. [sent-653, score-0.631]

100 Observe locally, infer globally: A spacetime mrf for detecting abnormal activities with incremental updates. [sent-719, score-0.357]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('stvs', 0.411), ('volumes', 0.302), ('ensembles', 0.242), ('behaviors', 0.238), ('dominant', 0.226), ('anomalous', 0.213), ('abnormal', 0.211), ('behavior', 0.19), ('abnormality', 0.172), ('video', 0.155), ('temporal', 0.149), ('abnormalities', 0.128), ('ensemble', 0.119), ('bov', 0.115), ('anomaly', 0.108), ('saligrama', 0.108), ('fuzzy', 0.092), ('jodoin', 0.089), ('activities', 0.088), ('codebook', 0.08), ('pdf', 0.076), ('ks', 0.075), ('online', 0.075), ('subtraction', 0.069), ('activity', 0.066), ('belleview', 0.065), ('suspicious', 0.064), ('pdfs', 0.064), ('contextual', 0.064), ('videos', 0.063), ('events', 0.062), ('ucsd', 0.061), ('capable', 0.061), ('compositional', 0.058), ('detecting', 0.058), ('background', 0.055), ('chunk', 0.053), ('volume', 0.053), ('thereby', 0.052), ('pixel', 0.05), ('spatial', 0.049), ('codeword', 0.048), ('compositions', 0.048), ('rx', 0.048), ('level', 0.046), ('ry', 0.045), ('clustering', 0.044), ('cars', 0.044), ('kt', 0.044), ('avan', 0.043), ('bertini', 0.043), ('bicyclist', 0.043), ('dkt', 0.043), ('knss', 0.043), ('kntt', 0.043), ('levine', 0.043), ('ofvolumes', 0.043), ('roshtkhari', 0.043), ('understanding', 0.043), ('pedestrian', 0.042), ('surveillance', 0.042), ('foreground', 0.041), ('xv', 0.041), ('detection', 0.04), ('employed', 0.04), ('spatiotemporal', 0.039), ('vi', 0.039), ('scene', 0.039), ('employing', 0.039), ('mdt', 0.038), ('cim', 0.038), ('lanes', 0.038), ('benezeth', 0.038), ('esj', 0.038), ('hierarchical', 0.038), ('rare', 0.037), ('cluster', 0.037), ('construct', 0.036), ('cong', 0.036), ('query', 0.035), ('kl', 0.035), ('characterized', 0.035), ('topic', 0.034), ('clusters', 0.034), ('codewords', 0.034), ('anomalies', 0.034), ('codebooks', 0.033), ('densely', 0.032), ('oriented', 0.032), ('continuously', 0.031), ('rt', 0.031), ('hospedales', 0.031), ('mahadevan', 0.031), ('pages', 0.03), ('frames', 0.03), ('constructed', 0.03), ('learnt', 0.03), ('roc', 0.029), ('regions', 0.029), ('reddy', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos

Author: Mehrsan Javan Roshtkhari, Martin D. Levine

Abstract: We present a novel approach for video parsing and simultaneous online learning of dominant and anomalous behaviors in surveillance videos. Dominant behaviors are those occurring frequently in videos and hence, usually do not attract much attention. They can be characterized by different complexities in space and time, ranging from a scene background to human activities. In contrast, an anomalous behavior is defined as having a low likelihood of occurrence. We do not employ any models of the entities in the scene in order to detect these two kinds of behaviors. In this paper, video events are learnt at each pixel without supervision using densely constructed spatio-temporal video volumes. Furthermore, the volumes are organized into large contextual graphs. These compositions are employed to construct a hierarchical codebook model for the dominant behaviors. By decomposing spatio-temporal contextual information into unique spatial and temporal contexts, the proposed framework learns the models of the dominant spatial and temporal events. Thus, it is ultimately capable of simultaneously modeling high-level behaviors as well as low-level spatial, temporal and spatio-temporal pixel level changes.

2 0.26930487 310 cvpr-2013-Object-Centric Anomaly Detection by Attribute-Based Reasoning

Author: Babak Saleh, Ali Farhadi, Ahmed Elgammal

Abstract: When describing images, humans tend not to talk about the obvious, but rather mention what they find interesting. We argue that abnormalities and deviations from typicalities are among the most important components that form what is worth mentioning. In this paper we introduce the abnormality detection as a recognition problem and show how to model typicalities and, consequently, meaningful deviations from prototypical properties of categories. Our model can recognize abnormalities and report the main reasons of any recognized abnormality. We also show that abnormality predictions can help image categorization. We introduce the abnormality detection dataset and show interesting results on how to reason about abnormalities.

3 0.166879 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

Author: Raghuraman Gopalan

Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.

4 0.14103009 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?

Author: Michael S. Ryoo, Larry Matthies

Abstract: This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects to the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multichannel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos reliably.

5 0.13044885 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa

Abstract: We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use ofrandomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.

6 0.12428831 148 cvpr-2013-Ensemble Video Object Cut in Highly Dynamic Scenes

7 0.1222574 158 cvpr-2013-Exploring Weak Stabilization for Motion Feature Extraction

8 0.12126182 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

9 0.12057802 172 cvpr-2013-Finding Group Interactions in Social Clutter

10 0.11916409 187 cvpr-2013-Geometric Context from Videos

11 0.11393902 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos

12 0.10487614 59 cvpr-2013-Better Exploiting Motion for Better Action Recognition

13 0.10344971 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

14 0.10258385 378 cvpr-2013-Sampling Strategies for Real-Time Action Recognition

15 0.093729027 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model

16 0.093535036 55 cvpr-2013-Background Modeling Based on Bidirectional Analysis

17 0.090748012 203 cvpr-2013-Hierarchical Video Representation with Trajectory Binary Partition Tree

18 0.089570388 332 cvpr-2013-Pixel-Level Hand Detection in Ego-centric Videos

19 0.08831168 103 cvpr-2013-Decoding Children's Social Behavior

20 0.087071642 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.202), (1, -0.039), (2, 0.002), (3, -0.09), (4, -0.092), (5, -0.007), (6, -0.053), (7, -0.057), (8, -0.048), (9, 0.068), (10, 0.074), (11, -0.067), (12, 0.08), (13, -0.038), (14, 0.081), (15, 0.029), (16, 0.01), (17, 0.065), (18, 0.003), (19, -0.097), (20, -0.013), (21, 0.068), (22, -0.012), (23, -0.07), (24, -0.045), (25, -0.027), (26, 0.006), (27, 0.042), (28, 0.01), (29, 0.023), (30, 0.04), (31, 0.068), (32, 0.004), (33, -0.028), (34, -0.007), (35, -0.006), (36, 0.003), (37, -0.031), (38, -0.057), (39, -0.004), (40, -0.024), (41, 0.033), (42, -0.026), (43, 0.005), (44, 0.002), (45, 0.022), (46, -0.088), (47, 0.017), (48, 0.048), (49, 0.081)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95591366 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos

Author: Mehrsan Javan Roshtkhari, Martin D. Levine

Abstract: We present a novel approach for video parsing and simultaneous online learning of dominant and anomalous behaviors in surveillance videos. Dominant behaviors are those occurring frequently in videos and hence, usually do not attract much attention. They can be characterized by different complexities in space and time, ranging from a scene background to human activities. In contrast, an anomalous behavior is defined as having a low likelihood of occurrence. We do not employ any models of the entities in the scene in order to detect these two kinds of behaviors. In this paper, video events are learnt at each pixel without supervision using densely constructed spatio-temporal video volumes. Furthermore, the volumes are organized into large contextual graphs. These compositions are employed to construct a hierarchical codebook model for the dominant behaviors. By decomposing spatio-temporal contextual information into unique spatial and temporal contexts, the proposed framework learns the models of the dominant spatial and temporal events. Thus, it is ultimately capable of simultaneously modeling high-level behaviors as well as low-level spatial, temporal and spatio-temporal pixel level changes.

2 0.77664781 49 cvpr-2013-Augmenting Bag-of-Words: Data-Driven Discovery of Temporal and Structural Information for Activity Recognition

Author: Vinay Bettadapura, Grant Schindler, Thomas Ploetz, Irfan Essa

Abstract: We present data-driven techniques to augment Bag of Words (BoW) models, which allow for more robust modeling and recognition of complex long-term activities, especially when the structure and topology of the activities are not known a priori. Our approach specifically addresses the limitations of standard BoW approaches, which fail to represent the underlying temporal and causal information that is inherent in activity streams. In addition, we also propose the use ofrandomly sampled regular expressions to discover and encode patterns in activities. We demonstrate the effectiveness of our approach in experimental evaluations where we successfully recognize activities and detect anomalies in four complex datasets.

3 0.74513268 413 cvpr-2013-Story-Driven Summarization for Egocentric Video

Author: Zheng Lu, Kristen Grauman

Abstract: We present a video summarization approach that discovers the story of an egocentric video. Given a long input video, our method selects a short chain of video subshots depicting the essential events. Inspired by work in text analysis that links news articles over time, we define a randomwalk based metric of influence between subshots that reflects how visual objects contribute to the progression of events. Using this influence metric, we define an objective for the optimal k-subshot summary. Whereas traditional methods optimize a summary ’s diversity or representativeness, ours explicitly accounts for how one sub-event “leads to ” another—which, critically, captures event connectivity beyond simple object co-occurrence. As a result, our summaries provide a better sense of story. We apply our approach to over 12 hours of daily activity video taken from 23 unique camera wearers, and systematically evaluate its quality compared to multiple baselines with 34 human subjects.

4 0.74292028 233 cvpr-2013-Joint Sparsity-Based Representation and Analysis of Unconstrained Activities

Author: Raghuraman Gopalan

Abstract: While the notion of joint sparsity in understanding common and innovative components of a multi-receiver signal ensemble has been well studied, we investigate the utility of such joint sparse models in representing information contained in a single video signal. By decomposing the content of a video sequence into that observed by multiple spatially and/or temporally distributed receivers, we first recover a collection of common and innovative components pertaining to individual videos. We then present modeling strategies based on subspace-driven manifold metrics to characterize patterns among these components, across other videos in the system, to perform subsequent video analysis. We demonstrate the efficacy of our approach for activity classification and clustering by reporting competitive results on standard datasets such as, HMDB, UCF-50, Olympic Sports and KTH.

5 0.7325322 243 cvpr-2013-Large-Scale Video Summarization Using Web-Image Priors

Author: Aditya Khosla, Raffay Hamid, Chih-Jen Lin, Neel Sundaresan

Abstract: Given the enormous growth in user-generated videos, it is becoming increasingly important to be able to navigate them efficiently. As these videos are generally of poor quality, summarization methods designed for well-produced videos do not generalize to them. To address this challenge, we propose to use web-images as a prior to facilitate summarization of user-generated videos. Our main intuition is that people tend to take pictures of objects to capture them in a maximally informative way. Such images could therefore be used as prior information to summarize videos containing a similar set of objects. In this work, we apply our novel insight to develop a summarization algorithm that uses the web-image based prior information in an unsupervised manner. Moreover, to automatically evaluate summarization algorithms on a large scale, we propose a framework that relies on multiple summaries obtained through crowdsourcing. We demonstrate the effectiveness of our evaluation framework by comparing its performance to that ofmultiple human evaluators. Finally, wepresent resultsfor our framework tested on hundreds of user-generated videos.

6 0.71841335 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding

7 0.70899558 175 cvpr-2013-First-Person Activity Recognition: What Are They Doing to Me?

8 0.69605654 294 cvpr-2013-Multi-class Video Co-segmentation with a Generative Multi-video Model

9 0.69007909 347 cvpr-2013-Recognize Human Activities from Partially Observed Videos

10 0.68197215 137 cvpr-2013-Dynamic Scene Classification: Learning Motion Descriptors with Slow Features Analysis

11 0.67517227 187 cvpr-2013-Geometric Context from Videos

12 0.66953212 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

13 0.6603753 148 cvpr-2013-Ensemble Video Object Cut in Highly Dynamic Scenes

14 0.65050268 118 cvpr-2013-Detecting Pulse from Head Motions in Video

15 0.64476448 37 cvpr-2013-Adherent Raindrop Detection and Removal in Video

16 0.64395887 333 cvpr-2013-Plane-Based Content Preserving Warps for Video Stabilization

17 0.64235735 103 cvpr-2013-Decoding Children's Social Behavior

18 0.62773782 55 cvpr-2013-Background Modeling Based on Bidirectional Analysis

19 0.62412906 348 cvpr-2013-Recognizing Activities via Bag of Words for Attribute Dynamics

20 0.62237048 28 cvpr-2013-A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.105), (16, 0.017), (17, 0.205), (26, 0.054), (28, 0.017), (33, 0.278), (65, 0.041), (67, 0.099), (69, 0.055), (87, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86906356 313 cvpr-2013-Online Dominant and Anomalous Behavior Detection in Videos

Author: Mehrsan Javan Roshtkhari, Martin D. Levine

Abstract: We present a novel approach for video parsing and simultaneous online learning of dominant and anomalous behaviors in surveillance videos. Dominant behaviors are those occurring frequently in videos and hence, usually do not attract much attention. They can be characterized by different complexities in space and time, ranging from a scene background to human activities. In contrast, an anomalous behavior is defined as having a low likelihood of occurrence. We do not employ any models of the entities in the scene in order to detect these two kinds of behaviors. In this paper, video events are learnt at each pixel without supervision using densely constructed spatio-temporal video volumes. Furthermore, the volumes are organized into large contextual graphs. These compositions are employed to construct a hierarchical codebook model for the dominant behaviors. By decomposing spatio-temporal contextual information into unique spatial and temporal contexts, the proposed framework learns the models of the dominant spatial and temporal events. Thus, it is ultimately capable of simultaneously modeling high-level behaviors as well as low-level spatial, temporal and spatio-temporal pixel level changes.

2 0.83678287 277 cvpr-2013-MODEC: Multimodal Decomposable Models for Human Pose Estimation

Author: Ben Sapp, Ben Taskar

Abstract: We propose a multimodal, decomposable model for articulated human pose estimation in monocular images. A typical approach to this problem is to use a linear structured model, which struggles to capture the wide range of appearance present in realistic, unconstrained images. In this paper, we instead propose a model of human pose that explicitly captures a variety of pose modes. Unlike other multimodal models, our approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. We also employ a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. Our model outperforms state-of-theart approaches across the accuracy-speed trade-off curve for several pose datasets. This includes our newly-collected dataset of people in movies, FLIC, which contains an order of magnitude more labeled data for training and testing than existing datasets. The new dataset and code are avail- able online. 1

3 0.83571106 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval

Author: Xiaohui Shen, Zhe Lin, Jonathan Brandt, Ying Wu

Abstract: Detecting faces in uncontrolled environments continues to be a challenge to traditional face detection methods[24] due to the large variation in facial appearances, as well as occlusion and clutter. In order to overcome these challenges, we present a novel and robust exemplarbased face detector that integrates image retrieval and discriminative learning. A large database of faces with bounding rectangles and facial landmark locations is collected, and simple discriminative classifiers are learned from each of them. A voting-based method is then proposed to let these classifiers cast votes on the test image through an efficient image retrieval technique. As a result, faces can be very efficiently detected by selecting the modes from the voting maps, without resorting to exhaustive sliding window-style scanning. Moreover, due to the exemplar-based framework, our approach can detect faces under challenging conditions without explicitly modeling their variations. Evaluation on two public benchmark datasets shows that our new face detection approach is accurate and efficient, and achieves the state-of-the-art performance. We further propose to use image retrieval for face validation (in order to remove false positives) and for face alignment/landmark localization. The same methodology can also be easily generalized to other facerelated tasks, such as attribute recognition, as well as general object detection.

4 0.83348691 131 cvpr-2013-Discriminative Non-blind Deblurring

Author: Uwe Schmidt, Carsten Rother, Sebastian Nowozin, Jeremy Jancsary, Stefan Roth

Abstract: Non-blind deblurring is an integral component of blind approaches for removing image blur due to camera shake. Even though learning-based deblurring methods exist, they have been limited to the generative case and are computationally expensive. To this date, manually-defined models are thus most widely used, though limiting the attained restoration quality. We address this gap by proposing a discriminative approach for non-blind deblurring. One key challenge is that the blur kernel in use at test time is not known in advance. To address this, we analyze existing approaches that use half-quadratic regularization. From this analysis, we derive a discriminative model cascade for image deblurring. Our cascade model consists of a Gaussian CRF at each stage, based on the recently introduced regression tree fields. We train our model by loss minimization and use synthetically generated blur kernels to generate training data. Our experiments show that the proposed approach is efficient and yields state-of-the-art restoration quality on images corrupted with synthetic and real blur.

5 0.83327174 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection

Author: Jianguo Li, Yimin Zhang

Abstract: This paper presents a novel learning framework for training boosting cascade based object detector from large scale dataset. The framework is derived from the wellknown Viola-Jones (VJ) framework but distinguished by three key differences. First, the proposed framework adopts multi-dimensional SURF features instead of single dimensional Haar features to describe local patches. In this way, the number of used local patches can be reduced from hundreds of thousands to several hundreds. Second, it adopts logistic regression as weak classifier for each local patch instead of decision trees in the VJ framework. Third, we adopt AUC as a single criterion for the convergence test during cascade training rather than the two trade-off criteria (false-positive-rate and hit-rate) in the VJ framework. The benefit is that the false-positive-rate can be adaptive among different cascade stages, and thus yields much faster convergence speed of SURF cascade. Combining these points together, the proposed approach has three good properties. First, the boosting cascade can be trained very efficiently. Experiments show that the proposed approach can train object detectors from billions of negative samples within one hour even on personal computers. Second, the built detector is comparable to the stateof-the-art algorithm not only on the accuracy but also on the processing speed. Third, the built detector is small in model-size due to short cascade stages.

6 0.83131427 248 cvpr-2013-Learning Collections of Part Models for Object Recognition

7 0.83128732 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence

8 0.83102214 94 cvpr-2013-Context-Aware Modeling and Recognition of Activities in Video

9 0.83074361 310 cvpr-2013-Object-Centric Anomaly Detection by Attribute-Based Reasoning

10 0.83057845 4 cvpr-2013-3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

11 0.82991326 379 cvpr-2013-Scalable Sparse Subspace Clustering

12 0.82991225 104 cvpr-2013-Deep Convolutional Network Cascade for Facial Point Detection

13 0.8298834 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation

14 0.82981402 446 cvpr-2013-Understanding Indoor Scenes Using 3D Geometric Phrases

15 0.8296892 80 cvpr-2013-Category Modeling from Just a Single Labeling: Use Depth Information to Guide the Learning of 2D Models

16 0.82968664 221 cvpr-2013-Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection

17 0.82966733 176 cvpr-2013-Five Shades of Grey for Fast and Reliable Camera Pose Estimation

18 0.82962632 416 cvpr-2013-Studying Relationships between Human Gaze, Description, and Computer Vision

19 0.82873142 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation

20 0.82870013 204 cvpr-2013-Histograms of Sparse Codes for Object Detection