iccv iccv2013 iccv2013-247 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yin Li, Alireza Fathi, James M. Rehg
Abstract: We present a model for gaze prediction in egocentric video by leveraging the implicit cues that exist in camera wearer’s behaviors. Specifically, we compute the camera wearer’s head motion and hand location from the video and combine them to estimate where the eyes look. We further model the dynamic behavior of the gaze, in particular fixations, as latent variables to improve the gaze prediction. Our gaze prediction results outperform the state-of-the-art algorithms by a large margin on publicly available egocentric vision datasets. In addition, we demonstrate that we get a significant performance boost in recognizing daily actions and segmenting foreground objects by plugging in our gaze predictions into state-of-the-art methods.
Reference: text
sentIndex sentText sentNum sentScore
1 Rehg College of Computing Georgia Institute of Technology Abstract We present a model for gaze prediction in egocentric video by leveraging the implicit cues that exist in camera wearer’s behaviors. [sent-2, score-1.311]
2 Our gaze prediction results outperform the state-of-the-art algorithms by a large margin on publicly available egocentric vision datasets. [sent-5, score-1.256]
3 In addition, we demonstrate that we get a significant performance boost in recognizing daily actions and segmenting foreground objects by plugging in our gaze predictions into state-of-the-art methods. [sent-6, score-0.921]
4 A key component in ego- centric vision is the egocentric gaze [13]. [sent-9, score-1.188]
5 Because a person senses the visual world through a series of fixations, egocentric gaze measurements contain important cues regarding the most salient objects in the scene, and the intentions and goals of the camera-wearer. [sent-10, score-1.256]
6 Previous works have demonstrated the utility of gaze measurements in object discovery [17] and action recognition [7]. [sent-11, score-0.897]
7 This paper addresses the problem of egocentric gaze prediction, which is the task of predicting the user’s point-ofgaze given an egocentric video. [sent-12, score-1.543]
8 Previous work on gaze prediction in computer vision has primarily focused on saliency detection [2]. [sent-13, score-0.964]
9 However, none of these approaches seem to be sufficient to predict egocentric gaze in the context of hand-eye coordination tasks. [sent-15, score-1.33]
10 Egocentric gaze in a natural environment is the combi- nation of gaze direction (the line of sight in a head-centered coordinate system), head orientation, and body pose. [sent-19, score-1.811]
11 For example, large head movement is almost always accompanied by a large gaze shift [14]. [sent-21, score-0.999]
12 Also, the gaze point tends to fall on the object that is currently being manipulated by the first person [14]. [sent-22, score-0.884]
13 These evidences suggest that we can model the gaze of the first person by exploring the coordination of eye, hands and head, using egocentric cues alone. [sent-23, score-1.477]
14 Our major contribution is leveraging the implicit cues that are provided by first person, such as hand location and pose, head/hand motion, for predicting gaze in egocentric vision. [sent-25, score-1.285]
15 Moreover, we build a graphical model for gaze prediction that accounts for eye-hand and eye-head coordinations, and combines the temporal dynamics of gazes. [sent-27, score-0.956]
16 The model requires no information of task or action, predicts gaze position at each frame and identifies moments of fixation. [sent-28, score-0.885]
17 Our gaze prediction results outperform all stateof-the-art bottom-up and top-down saliency detection algo- rithms by a large margin on two publicly available datasets. [sent-29, score-0.964]
18 The second part of our paper explores applications of gaze prediction in egocentric vision. [sent-30, score-1.256]
19 We design a graphical model for gaze prediction for eye-hand and eye-head coordinations, and combines the temporal dynamics of gazes. [sent-33, score-0.956]
20 Our model predicts gaze that take account position at each frame and identifies moments of fixation with only egocentric videos. [sent-34, score-1.303]
21 We demonstrate two important applications prediction: object segmentation and gaze prediction. [sent-35, score-0.873]
22 of gaze Our gaze prediction, object segmentation and action recognition results outperform several state-of-the-art methods. [sent-36, score-1.757]
23 We conclude that our gaze prediction model promises a great prospect for egocentric vision. [sent-40, score-1.256]
24 Instead, we address real object manipulation tasks using egocentric vision, assume no additional information other than the video and utilize only egocentric cues for gaze prediction. [sent-55, score-1.708]
25 In the computer vision community, Ba and Odobez [1] presented a model for the recognition of people’s visual focus of attention in meetings by approximating gaze direction with head orientation. [sent-66, score-0.981]
26 [25] combined bottom-up visual saliency with ego-motion information for egocentric gaze prediction in a walking or sitting setting. [sent-78, score-1.307]
27 They presented a joint method for egocentric gaze prediction and action recognition. [sent-81, score-1.295]
28 However, their model requires object masks and action annotations for gaze prediction and the performance drops significantly if gazes are not available or inaccurate. [sent-82, score-0.995]
29 Thus, head orientation provides a good approximation for gaze direction in egocentric videos. [sent-86, score-1.319]
30 (b) A scatter plot of head movement against gaze shift along vertical and horizontal direction in GTEA Gaze+ dataset. [sent-87, score-1.025]
31 Egocentric Cues for Gaze Prediction We focus on object manipulation tasks in a meal preparation setting, and explore the possibility of gaze prediction using egocentric cues, including hand/head movement and hand location/pose. [sent-90, score-1.457]
32 The coordination of eye, head and hand, as we show in this section, bridges the gap between these egocentric cues and gaze prediction. [sent-91, score-1.48]
33 Both datasets contain egocentric videos of meal preparation with gaze tracking results and action annotations. [sent-93, score-1.268]
34 We also consider MIT eye tracking dataset [12] for comparing gaze statistics. [sent-94, score-0.927]
35 Eye-Head Coordination Several psychophysical experiments have indicated that eye gaze and head pose are coupled in various tasks [19, 14, 15]. [sent-98, score-1.032]
36 For example, large head movement is almost always accompanied by a large gaze shift. [sent-99, score-0.989]
37 The gaze statistics suggest a sharp center bias and a strong correlation between head motion and gaze shifts. [sent-101, score-1.871]
38 These findings thus provide powerful cues for gaze prediction. [sent-102, score-0.888]
39 Center Bias: Our first observation is a sharp center bias of egocentric gaze points. [sent-111, score-1.212]
40 We fit a 2D Gaussian as the center prior to all gaze points in GTEA Gaze and GTEA Gaze+ dataset, respectively, as shown in Fig 2. [sent-112, score-0.874]
41 This is due to the fact that egocentric vision captures a first-person’s perspective in 3D world, where the gaze often aligns with the head orientation. [sent-115, score-1.298]
42 In this case, the needs of large gaze shifts are usually compensated by head movements plus small gaze shifts. [sent-116, score-1.852]
43 Note that the preference of gaze towards the bottom part of the image is influenced by table-top object manipulation tasks. [sent-118, score-0.956]
44 Correlation between Gaze Shifts and Head Motion: We also observe a tight correlation between head motion and gaze shift in the horizontal direction. [sent-119, score-1.01]
45 A scatter plot of gaze shifts (from the center) against head motion for GTEA Gaze+ dataset is shown in Fig 2b. [sent-120, score-1.001]
46 The plot suggests a linear correlation in the horizontal direction, especially for large gaze shifts. [sent-121, score-0.881]
47 The correlation, therefore, allows us to predict gaze location from head motion. [sent-124, score-0.967]
48 Eye gaze generally guides the movement of the hands to target [15]. [sent-128, score-0.954]
49 Moreover, it has also been shown [21] that the proprioception of limbs may influence gaze shift, where the hands are used to guide eye movements. [sent-129, score-0.996]
50 We introduce the concept of manipulation point, align gaze points with respect to the first person’s hands and discover clusters in the aligned gaze density map, suggesting a strong eye-hand coordination. [sent-130, score-1.907]
51 This suggest that we can predict egocentric gaze by looking into the first-person’s hand information. [sent-131, score-1.257]
52 We align the gaze points into the hand’s coordinates by selecting the manipulation points as the origin, and projecting the gaze point into the new coordinate system every frame. [sent-139, score-1.835]
53 We then plot the density map by averaging the aligned gaze points across all frames within the dataset. [sent-140, score-0.882]
54 A manipulation point provides an anchor with respect to current hand pose, and allows us to align gaze points into the hand’s coordinates. [sent-159, score-1.018]
55 Gaze around Hands: We align the gaze points to the first-person’s hands by setting the manipulation points as the origin (See Fig 3). [sent-160, score-1.065]
56 The density maps of the aligned gaze points for four different hand configurations are plotted in Fig 3. [sent-161, score-0.912]
57 The data suggest interesting spatial relationship between manipulation points and gaze points. [sent-163, score-0.972]
58 For two intersecting hands, gaze shifts towards the bottom, partly due to openning/closing actions. [sent-166, score-0.874]
59 Gaze Prediction in Egocentric Video We have witness strong cues for gaze by the coordination of eye, hand and head movement. [sent-170, score-1.17]
60 Therefore, we present a learning based framework to incorporate all these egocentric cues for gaze prediction. [sent-172, score-1.231]
61 The core of our method lies in a graphical model that combines egocentric cues at a single frame with a temporal model of gaze shifts. [sent-173, score-1.288]
62 Our gaze prediction consist of two parts: predicting the gaze position at each frame and identifying the fixations among all gazes. [sent-174, score-1.83]
63 We extract features zt at each frame t, predict its gaze position gt and identify its moments of fixation mt. [sent-183, score-1.113]
64 The Model Denote the gaze point at frame t as gt = [gxt, gty]T ∈ R2 and its binary label as mt = {0, 1}, where mt = 1den∈o Rtes gt is a fixation. [sent-191, score-1.188]
65 The model consists of 1) P(gt |zt) a single frame gaze prediction model given zt; 2) P(|mzt |gN(t) ) a temporal model that couples fixation mt and gaze prediction gN(t) . [sent-209, score-2.002]
66 Single Frame Gaze Prediction: We use random regression forest for gaze prediction in a single frame. [sent-211, score-0.924]
67 We train two separate models for gaze prediction, one with both hand and head cues and one with only head cues. [sent-217, score-1.15]
68 On one hand, fixation mt can be detected given all gaze points. [sent-235, score-0.986]
69 On the other hand, there is a strong constraint over gaze locations if we know current gaze point is a fixation. [sent-236, score-1.7]
70 22⎠⎞ (3) where mi can be obtained by a fixation detection algorithm given gaze points gN(t) . [sent-241, score-0.934]
71 Here we use a velocity-threshold based fixation detection [18]: a fixation is detected if velocity of gaze points are below a threshold c over a minimum amount of time (two frames in our case). [sent-242, score-1.009]
72 Inference and Learning Inference: To get the gaze points {gt}tK=1 and fixations {mItn}ftK=er1e, wcee: :ap Tpoly ge Mt tahxeim gauzme pLoiiknetlsih {ogod} (ML) estimation o{mfEq} (1). [sent-249, score-0.902]
73 Intuitively, the optimization follows a EM like updating by (1) identifying fixations mt by velocity-thresholds given all gaze 33221203 predictions gt and (2) smoothing the gaze points gt given fixation labels mt. [sent-272, score-2.082]
74 Gaze Prediction We use two standard, complementary measures to assess the performance ofour gaze prediction method: Area Under (ROC) Curve (AUC) and Average Angular Error (AAE). [sent-300, score-0.913]
75 AUC measures the consistency between a predicted saliency map and the ground truth gaze points in an image, and is widely used in the saliency detection literature. [sent-301, score-0.976]
76 Since our method outputs a single predicted gaze point, we generate a saliency map that can be used for AUC scoring by convolving an isotropic Gaussian over the predicted gaze. [sent-305, score-0.926]
77 1 Results Both GTEA Gaze and GTEA Gaze+ dataset contain gaze data from eye tracking glasses, which are used as ground truth for gaze prediction. [sent-308, score-1.772]
78 Our method requires no information about action or task, and largely outperforms the bottom-up and top-down gaze prediction method. [sent-321, score-0.952]
79 Our method benefits from using the strong egocentric cues (head, hand and eye coordination) for gaze prediction and bypasses the challenging object segmentation step required by [7]. [sent-346, score-1.435]
80 These results suggest that egocentric cues can provide a reliable gaze estimate without low-level im33221214 GTEA Gaze+ dataset. [sent-348, score-1.246]
81 However, for hand-eye coordination tasks the gaze is naturally-coordinated with the head, making the head orientation a more effective approximation. [sent-364, score-1.106]
82 Object Segmentation We further demonstrate that gaze prediction can be used to segment task-relevant foreground objects. [sent-370, score-0.945]
83 We plug in our gaze prediction into two different algorithms. [sent-379, score-0.913]
84 6% over top 100 segments using gaze, with only a small performance gap between human gaze and predicted gaze. [sent-385, score-0.879]
85 (b) Confusion matrix of action recognition using predicted gaze on GTEA Gaze dataset (25 classes). [sent-386, score-0.899]
86 1 Results We used both our gaze prediction results and the ground truth gaze point to seed two different methods for extracting foreground object regions: ActSeg [17] and CPMC [5]. [sent-391, score-1.803]
87 Given the ground truth segmentation, we score the effectiveness of object segmentation under both predicted and measured gaze, thereby obtaining an alternate characteri- zation of the effectiveness of our gaze prediction method. [sent-392, score-0.956]
88 ActSeg [17] takes gaze points as the input and outputs one object segment per gaze point. [sent-393, score-1.727]
89 It assumes that the gaze point always lies within the object boundary and segments the object by finding the most salient boundary. [sent-394, score-0.91]
90 The performance using our gaze prediction method is comparable to that using ground truth gaze. [sent-415, score-0.913]
91 Action Recognition Egocentric gaze is not only useful for foreground object segmentation, but also helps to recognize first-person’s action. [sent-418, score-0.88]
92 We report action recognition results on GTEA Gaze dataset by plugging in our predicted gaze into the implementation of [7], as shown in Fig 7(b). [sent-419, score-0.917]
93 Their method extracts motion and appearance features from a small region around a gaze point and trains a SVM classifier combined with HMM for action recognition. [sent-420, score-0.915]
94 Using our gaze prediction, we improve the action recognition result to 32. [sent-422, score-0.884]
95 We notice the large gap between our gaze prediction and real human gaze. [sent-427, score-0.922]
96 However, we conclude the gap is only partly due to gaze prediction since the performance of method [7] is sensitive to input gaze points. [sent-428, score-1.767]
97 Conclusion We described a novel approach to gaze prediction in egocentric video. [sent-430, score-1.256]
98 Our method is motivated by the fact that in an egocentric setting, the behaviors of the first-person provide strong cues for predicting the gaze direction. [sent-431, score-1.243]
99 Our gaze prediction results outperform the state-of-the-art algorithms by a large margin on GTEA Gaze and GTEA Gaze+ datasets. [sent-433, score-0.913]
100 Finally, we demonstrate that we get a significant performance boost in recognizing daily actions and segmenting foreground objects by plugging in our gaze predictions into state-of-the-art methods. [sent-434, score-0.921]
wordName wordTfidf (topN-words)
[('gaze', 0.845), ('egocentric', 0.343), ('gtea', 0.203), ('coordination', 0.13), ('head', 0.11), ('manipulation', 0.098), ('gt', 0.092), ('hands', 0.085), ('fixation', 0.075), ('aae', 0.069), ('prediction', 0.068), ('eye', 0.066), ('mt', 0.066), ('zt', 0.058), ('auc', 0.056), ('saliency', 0.051), ('cues', 0.043), ('fixations', 0.043), ('hand', 0.042), ('movements', 0.039), ('action', 0.039), ('actseg', 0.038), ('cpmc', 0.038), ('fathi', 0.032), ('fig', 0.032), ('gn', 0.03), ('tk', 0.029), ('movement', 0.024), ('coordinations', 0.023), ('foreground', 0.022), ('motion', 0.021), ('temporal', 0.018), ('degree', 0.018), ('plugging', 0.018), ('roc', 0.017), ('frame', 0.017), ('daily', 0.016), ('person', 0.016), ('masks', 0.016), ('intersecting', 0.016), ('tracking', 0.016), ('center', 0.015), ('attention', 0.015), ('hayhoe', 0.015), ('pelz', 0.015), ('eq', 0.015), ('suggest', 0.015), ('segmentation', 0.015), ('predicted', 0.015), ('borji', 0.014), ('moments', 0.014), ('points', 0.014), ('spriggs', 0.014), ('gazes', 0.014), ('wearer', 0.014), ('shifts', 0.013), ('object', 0.013), ('dynamics', 0.013), ('horizontal', 0.013), ('subjects', 0.013), ('activity', 0.013), ('yamada', 0.013), ('meal', 0.013), ('graphical', 0.012), ('predicting', 0.012), ('video', 0.012), ('saccade', 0.012), ('plot', 0.012), ('hou', 0.012), ('koch', 0.012), ('predict', 0.012), ('tpami', 0.012), ('videos', 0.012), ('density', 0.011), ('tasks', 0.011), ('direction', 0.011), ('forest', 0.011), ('pages', 0.011), ('brain', 0.011), ('topdown', 0.011), ('correlation', 0.011), ('point', 0.01), ('predictions', 0.01), ('segments', 0.01), ('accompanied', 0.01), ('harel', 0.01), ('shift', 0.01), ('segment', 0.01), ('wearable', 0.01), ('orientation', 0.01), ('lies', 0.01), ('actions', 0.01), ('salient', 0.009), ('bias', 0.009), ('predicts', 0.009), ('align', 0.009), ('mit', 0.009), ('gap', 0.009), ('sign', 0.009), ('gaussian', 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 247 iccv-2013-Learning to Predict Gaze in Egocentric Video
Author: Yin Li, Alireza Fathi, James M. Rehg
Abstract: We present a model for gaze prediction in egocentric video by leveraging the implicit cues that exist in camera wearer’s behaviors. Specifically, we compute the camera wearer’s head motion and hand location from the video and combine them to estimate where the eyes look. We further model the dynamic behavior of the gaze, in particular fixations, as latent variables to improve the gaze prediction. Our gaze prediction results outperform the state-of-the-art algorithms by a large margin on publicly available egocentric vision datasets. In addition, we demonstrate that we get a significant performance boost in recognizing daily actions and segmenting foreground objects by plugging in our gaze predictions into state-of-the-art methods.
2 0.81920391 67 iccv-2013-Calibration-Free Gaze Estimation Using Human Gaze Patterns
Author: Fares Alnajar, Theo Gevers, Roberto Valenti, Sennay Ghebreab
Abstract: We present a novel method to auto-calibrate gaze estimators based on gaze patterns obtained from other viewers. Our method is based on the observation that the gaze patterns of humans are indicative of where a new viewer will look at [12]. When a new viewer is looking at a stimulus, we first estimate a topology of gaze points (initial gaze points). Next, these points are transformed so that they match the gaze patterns of other humans to find the correct gaze points. In a flexible uncalibrated setup with a web camera and no chin rest, the proposed method was tested on ten subjects and ten images. The method estimates the gaze points after looking at a stimulus for a few seconds with an average accuracy of 4.3◦. Although the reported performance is lower than what could be achieved with dedicated hardware or calibrated setup, the proposed method still provides a sufficient accuracy to trace the viewer attention. This is promising considering the fact that auto-calibration is done in a flexible setup , without the use of a chin rest, and based only on a few seconds of gaze initialization data. To the best of our knowledge, this is the first work to use human gaze patterns in order to auto-calibrate gaze estimators.
3 0.52444357 381 iccv-2013-Semantically-Based Human Scanpath Estimation with HMMs
Author: Huiying Liu, Dong Xu, Qingming Huang, Wen Li, Min Xu, Stephen Lin
Abstract: We present a method for estimating human scanpaths, which are sequences of gaze shifts that follow visual attention over an image. In this work, scanpaths are modeled based on three principal factors that influence human attention, namely low-levelfeature saliency, spatialposition, and semantic content. Low-level feature saliency is formulated as transition probabilities between different image regions based on feature differences. The effect of spatial position on gaze shifts is modeled as a Levy flight with the shifts following a 2D Cauchy distribution. To account for semantic content, we propose to use a Hidden Markov Model (HMM) with a Bag-of-Visual-Words descriptor of image regions. An HMM is well-suited for this purpose in that 1) the hidden states, obtained by unsupervised learning, can represent latent semantic concepts, 2) the prior distribution of the hidden states describes visual attraction to the semantic concepts, and 3) the transition probabilities represent human gaze shift patterns. The proposed method is applied to task-driven viewing processes. Experiments and analysis performed on human eye gaze data verify the effectiveness of this method.
4 0.37180009 325 iccv-2013-Predicting Primary Gaze Behavior Using Social Saliency Fields
Author: Hyun Soo Park, Eakta Jain, Yaser Sheikh
Abstract: We present a method to predict primary gaze behavior in a social scene. Inspired by the study of electric fields, we posit “social charges ”—latent quantities that drive the primary gaze behavior of members of a social group. These charges induce a gradient field that defines the relationship between the social charges and the primary gaze direction of members in the scene. This field model is used to predict primary gaze behavior at any location or time in the scene. We present an algorithm to estimate the time-varying behavior of these charges from the primary gaze behavior of measured observers in the scene. We validate the model by evaluating its predictive precision via cross-validation in a variety of social scenes.
5 0.14719209 50 iccv-2013-Analysis of Scores, Datasets, and Models in Visual Saliency Prediction
Author: Ali Borji, Hamed R. Tavakoli, Dicky N. Sihite, Laurent Itti
Abstract: Significant recent progress has been made in developing high-quality saliency models. However, less effort has been undertaken on fair assessment of these models, over large standardized datasets and correctly addressing confounding factors. In this study, we pursue a critical and quantitative look at challenges (e.g., center-bias, map smoothing) in saliency modeling and the way they affect model accuracy. We quantitatively compare 32 state-of-the-art models (using the shuffled AUC score to discount center-bias) on 4 benchmark eye movement datasets, for prediction of human fixation locations and scanpath sequence. We also account for the role of map smoothing. We find that, although model rankings vary, some (e.g., AWS, LG, AIM, and HouNIPS) consistently outperform other models over all datasets. Some models work well for prediction of both fixation locations and scanpath sequence (e.g., Judd, GBVS). Our results show low prediction accuracy for models over emotional stimuli from the NUSEF dataset. Our last benchmark, for the first time, gauges the ability of models to decode the stimulus category from statistics of fixations, saccades, and model saliency values at fixated locations. In this test, ITTI and AIM models win over other models. Our benchmark provides a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in saliency modeling.
6 0.1464117 373 iccv-2013-Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics
7 0.1060475 267 iccv-2013-Model Recommendation with Virtual Probes for Egocentric Hand Detection
8 0.10131306 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach
9 0.093624853 180 iccv-2013-From Where and How to What We See
11 0.082794331 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?
12 0.06319458 369 iccv-2013-Saliency Detection: A Boolean Map Approach
13 0.057334848 71 iccv-2013-Category-Independent Object-Level Saliency Detection
14 0.054164015 372 iccv-2013-Saliency Detection via Dense and Sparse Reconstruction
15 0.049409244 396 iccv-2013-Space-Time Robust Representation for Action Recognition
16 0.045427948 91 iccv-2013-Contextual Hypergraph Modeling for Salient Object Detection
17 0.044281211 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
18 0.043882515 370 iccv-2013-Saliency Detection in Large Point Sets
19 0.043199524 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments
20 0.042004965 403 iccv-2013-Strong Appearance and Expressive Spatial Models for Human Pose Estimation
topicId topicWeight
[(0, 0.1), (1, -0.0), (2, 0.224), (3, -0.066), (4, -0.042), (5, -0.034), (6, 0.079), (7, -0.027), (8, 0.004), (9, 0.109), (10, 0.019), (11, -0.118), (12, -0.114), (13, 0.096), (14, -0.03), (15, 0.281), (16, -0.537), (17, 0.106), (18, 0.302), (19, -0.365), (20, -0.052), (21, 0.035), (22, -0.132), (23, 0.04), (24, -0.073), (25, 0.021), (26, -0.089), (27, 0.023), (28, -0.042), (29, -0.003), (30, 0.007), (31, 0.028), (32, -0.024), (33, 0.027), (34, -0.031), (35, 0.02), (36, 0.035), (37, 0.006), (38, 0.009), (39, 0.032), (40, -0.006), (41, -0.009), (42, -0.01), (43, 0.024), (44, 0.005), (45, 0.037), (46, 0.027), (47, -0.02), (48, 0.019), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.97534645 247 iccv-2013-Learning to Predict Gaze in Egocentric Video
Author: Yin Li, Alireza Fathi, James M. Rehg
Abstract: We present a model for gaze prediction in egocentric video by leveraging the implicit cues that exist in camera wearer’s behaviors. Specifically, we compute the camera wearer’s head motion and hand location from the video and combine them to estimate where the eyes look. We further model the dynamic behavior of the gaze, in particular fixations, as latent variables to improve the gaze prediction. Our gaze prediction results outperform the state-of-the-art algorithms by a large margin on publicly available egocentric vision datasets. In addition, we demonstrate that we get a significant performance boost in recognizing daily actions and segmenting foreground objects by plugging in our gaze predictions into state-of-the-art methods.
2 0.97527671 67 iccv-2013-Calibration-Free Gaze Estimation Using Human Gaze Patterns
Author: Fares Alnajar, Theo Gevers, Roberto Valenti, Sennay Ghebreab
Abstract: We present a novel method to auto-calibrate gaze estimators based on gaze patterns obtained from other viewers. Our method is based on the observation that the gaze patterns of humans are indicative of where a new viewer will look at [12]. When a new viewer is looking at a stimulus, we first estimate a topology of gaze points (initial gaze points). Next, these points are transformed so that they match the gaze patterns of other humans to find the correct gaze points. In a flexible uncalibrated setup with a web camera and no chin rest, the proposed method was tested on ten subjects and ten images. The method estimates the gaze points after looking at a stimulus for a few seconds with an average accuracy of 4.3◦. Although the reported performance is lower than what could be achieved with dedicated hardware or calibrated setup, the proposed method still provides a sufficient accuracy to trace the viewer attention. This is promising considering the fact that auto-calibration is done in a flexible setup , without the use of a chin rest, and based only on a few seconds of gaze initialization data. To the best of our knowledge, this is the first work to use human gaze patterns in order to auto-calibrate gaze estimators.
3 0.90225899 325 iccv-2013-Predicting Primary Gaze Behavior Using Social Saliency Fields
Author: Hyun Soo Park, Eakta Jain, Yaser Sheikh
Abstract: We present a method to predict primary gaze behavior in a social scene. Inspired by the study of electric fields, we posit “social charges ”—latent quantities that drive the primary gaze behavior of members of a social group. These charges induce a gradient field that defines the relationship between the social charges and the primary gaze direction of members in the scene. This field model is used to predict primary gaze behavior at any location or time in the scene. We present an algorithm to estimate the time-varying behavior of these charges from the primary gaze behavior of measured observers in the scene. We validate the model by evaluating its predictive precision via cross-validation in a variety of social scenes.
4 0.79213899 381 iccv-2013-Semantically-Based Human Scanpath Estimation with HMMs
Author: Huiying Liu, Dong Xu, Qingming Huang, Wen Li, Min Xu, Stephen Lin
Abstract: We present a method for estimating human scanpaths, which are sequences of gaze shifts that follow visual attention over an image. In this work, scanpaths are modeled based on three principal factors that influence human attention, namely low-levelfeature saliency, spatialposition, and semantic content. Low-level feature saliency is formulated as transition probabilities between different image regions based on feature differences. The effect of spatial position on gaze shifts is modeled as a Levy flight with the shifts following a 2D Cauchy distribution. To account for semantic content, we propose to use a Hidden Markov Model (HMM) with a Bag-of-Visual-Words descriptor of image regions. An HMM is well-suited for this purpose in that 1) the hidden states, obtained by unsupervised learning, can represent latent semantic concepts, 2) the prior distribution of the hidden states describes visual attraction to the semantic concepts, and 3) the transition probabilities represent human gaze shift patterns. The proposed method is applied to task-driven viewing processes. Experiments and analysis performed on human eye gaze data verify the effectiveness of this method.
5 0.27739772 50 iccv-2013-Analysis of Scores, Datasets, and Models in Visual Saliency Prediction
Author: Ali Borji, Hamed R. Tavakoli, Dicky N. Sihite, Laurent Itti
Abstract: Significant recent progress has been made in developing high-quality saliency models. However, less effort has been undertaken on fair assessment of these models, over large standardized datasets and correctly addressing confounding factors. In this study, we pursue a critical and quantitative look at challenges (e.g., center-bias, map smoothing) in saliency modeling and the way they affect model accuracy. We quantitatively compare 32 state-of-the-art models (using the shuffled AUC score to discount center-bias) on 4 benchmark eye movement datasets, for prediction of human fixation locations and scanpath sequence. We also account for the role of map smoothing. We find that, although model rankings vary, some (e.g., AWS, LG, AIM, and HouNIPS) consistently outperform other models over all datasets. Some models work well for prediction of both fixation locations and scanpath sequence (e.g., Judd, GBVS). Our results show low prediction accuracy for models over emotional stimuli from the NUSEF dataset. Our last benchmark, for the first time, gauges the ability of models to decode the stimulus category from statistics of fixations, saccades, and model saliency values at fixated locations. In this test, ITTI and AIM models win over other models. Our benchmark provides a comprehensive high-level picture of the strengths and weaknesses of many popular models, and suggests future research directions in saliency modeling.
6 0.25259629 373 iccv-2013-Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics
7 0.20799957 180 iccv-2013-From Where and How to What We See
8 0.20570977 316 iccv-2013-Pictorial Human Spaces: How Well Do Humans Perceive a 3D Articulated Pose?
9 0.1721074 267 iccv-2013-Model Recommendation with Virtual Probes for Egocentric Hand Detection
10 0.16523962 369 iccv-2013-Saliency Detection: A Boolean Map Approach
11 0.163863 416 iccv-2013-The Interestingness of Images
12 0.14477627 260 iccv-2013-Manipulation Pattern Discovery: A Nonparametric Bayesian Approach
13 0.12936373 253 iccv-2013-Linear Sequence Discriminant Analysis: A Model-Based Dimensionality Reduction Method for Vector Sequences
14 0.12798676 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields
15 0.12346089 22 iccv-2013-A New Adaptive Segmental Matching Measure for Human Activity Recognition
16 0.11864094 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns
17 0.11801612 145 iccv-2013-Estimating the Material Properties of Fabric from Video
18 0.11719494 246 iccv-2013-Learning the Visual Interpretation of Sentences
19 0.11353438 301 iccv-2013-Optimal Orthogonal Basis and Image Assimilation: Motion Modeling
20 0.11175763 178 iccv-2013-From Semi-supervised to Transfer Counting of Crowds
topicId topicWeight
[(2, 0.057), (7, 0.052), (12, 0.04), (26, 0.058), (31, 0.032), (34, 0.011), (42, 0.066), (64, 0.074), (73, 0.022), (75, 0.263), (89, 0.121), (97, 0.035)]
simIndex simValue paperId paperTitle
same-paper 1 0.74670744 247 iccv-2013-Learning to Predict Gaze in Egocentric Video
Author: Yin Li, Alireza Fathi, James M. Rehg
Abstract: We present a model for gaze prediction in egocentric video by leveraging the implicit cues that exist in camera wearer’s behaviors. Specifically, we compute the camera wearer’s head motion and hand location from the video and combine them to estimate where the eyes look. We further model the dynamic behavior of the gaze, in particular fixations, as latent variables to improve the gaze prediction. Our gaze prediction results outperform the state-of-the-art algorithms by a large margin on publicly available egocentric vision datasets. In addition, we demonstrate that we get a significant performance boost in recognizing daily actions and segmenting foreground objects by plugging in our gaze predictions into state-of-the-art methods.
2 0.62276703 277 iccv-2013-Multi-channel Correlation Filters
Author: Hamed Kiani Galoogahi, Terence Sim, Simon Lucey
Abstract: Modern descriptors like HOG and SIFT are now commonly used in vision for pattern detection within image and video. From a signal processing perspective, this detection process can be efficiently posed as a correlation/convolution between a multi-channel image and a multi-channel detector/filter which results in a singlechannel response map indicating where the pattern (e.g. object) has occurred. In this paper, we propose a novel framework for learning a multi-channel detector/filter efficiently in the frequency domain, both in terms of training time and memory footprint, which we refer to as a multichannel correlation filter. To demonstrate the effectiveness of our strategy, we evaluate it across a number of visual detection/localization tasks where we: (i) exhibit superiorperformance to current state of the art correlation filters, and (ii) superior computational and memory efficiencies compared to state of the art spatial detectors.
3 0.62077445 136 iccv-2013-Efficient Pedestrian Detection by Directly Optimizing the Partial Area under the ROC Curve
Author: Sakrapee Paisitkriangkrai, Chunhua Shen, Anton Van Den Hengel
Abstract: Many typical applications of object detection operate within a prescribed false-positive range. In this situation the performance of a detector should be assessed on the basis of the area under the ROC curve over that range, rather than over the full curve, as the performance outside the range is irrelevant. This measure is labelled as the partial area under the ROC curve (pAUC). Effective cascade-based classification, for example, depends on training node classifiers that achieve the maximal detection rate at a moderate false positive rate, e.g., around 40% to 50%. We propose a novel ensemble learning method which achieves a maximal detection rate at a user-defined range of false positive rates by directly optimizing the partial AUC using structured learning. By optimizing for different ranges of false positive rates, the proposed method can be used to train either a single strong classifier or a node classifier forming part of a cascade classifier. Experimental results on both synthetic and real-world data sets demonstrate the effectiveness of our approach, and we show that it is possible to train state-of-the-art pedestrian detectors using the pro- posed structured ensemble learning method.
4 0.60230207 200 iccv-2013-Higher Order Matching for Consistent Multiple Target Tracking
Author: Chetan Arora, Amir Globerson
Abstract: This paper addresses the data assignment problem in multi frame multi object tracking in video sequences. Traditional methods employing maximum weight bipartite matching offer limited temporal modeling. It has recently been shown [6, 8, 24] that incorporating higher order temporal constraints improves the assignment solution. Finding maximum weight matching with higher order constraints is however NP-hard and the solutions proposed until now have either been greedy [8] or rely on greedy rounding of the solution obtained from spectral techniques [15]. We propose a novel algorithm to find the approximate solution to data assignment problem with higher order temporal constraints using the method of dual decomposition and the MPLP message passing algorithm [21]. We compare the proposed algorithm with an implementation of [8] and [15] and show that proposed technique provides better solution with a bound on approximation factor for each inferred solution.
5 0.56021023 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation
Author: Zhibin Hong, Xue Mei, Danil Prokhorov, Dacheng Tao
Abstract: Combining multiple observation views has proven beneficial for tracking. In this paper, we cast tracking as a novel multi-task multi-view sparse learning problem and exploit the cues from multiple views including various types of visual features, such as intensity, color, and edge, where each feature observation can be sparsely represented by a linear combination of atoms from an adaptive feature dictionary. The proposed method is integrated in a particle filter framework where every view in each particle is regarded as an individual task. We jointly consider the underlying relationship between tasks across different views and different particles, and tackle it in a unified robust multi-task formulation. In addition, to capture the frequently emerging outlier tasks, we decompose the representation matrix to two collaborative components which enable a more robust and accurate approximation. We show that theproposedformulation can be efficiently solved using the Accelerated Proximal Gradient method with a small number of closed-form updates. The presented tracker is implemented using four types of features and is tested on numerous benchmark video sequences. Both the qualitative and quantitative results demonstrate the superior performance of the proposed approach compared to several stateof-the-art trackers.
6 0.55836856 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
7 0.55644417 338 iccv-2013-Randomized Ensemble Tracking
8 0.55193186 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning
9 0.54958653 86 iccv-2013-Concurrent Action Detection with Structural Prediction
11 0.54874933 253 iccv-2013-Linear Sequence Discriminant Analysis: A Model-Based Dimensionality Reduction Method for Vector Sequences
12 0.54824615 180 iccv-2013-From Where and How to What We See
13 0.54764724 299 iccv-2013-Online Video SEEDS for Temporal Window Objectness
14 0.54515958 415 iccv-2013-Text Localization in Natural Images Using Stroke Feature Transform and Text Covariance Descriptors
15 0.54459703 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes
16 0.54443026 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments
17 0.54400373 178 iccv-2013-From Semi-supervised to Transfer Counting of Crowds
18 0.5435406 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
19 0.54238737 22 iccv-2013-A New Adaptive Segmental Matching Measure for Human Activity Recognition
20 0.54224676 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests