iccv iccv2013 iccv2013-213 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Devi Parikh, Kristen Grauman
Abstract: User feedback helps an image search system refine its relevance predictions, tailoring the search towards the user’s preferences. Existing methods simply take feedback at face value: clicking on an image means the user wants things like it; commenting that an image lacks a specific attribute means the user wants things that have it. However, we expect there is actually more information behind the user’s literal feedback. In particular, a user’s (possibly subconscious) search strategy leads him to comment on certain images rather than others, based on how any of the visible candidate images compare to the desired content. For example, he may be more likely to give negative feedback on an irrelevant image that is relatively close to his target, as opposed to bothering with one that is altogether different. We introduce novel features to capitalize on such implied feedback cues, and learn a ranking function that uses them to improve the system’s relevance estimates. We validate the approach with real users searching for shoes, faces, or scenes using two different modes of feedback: binary relevance feedback and relative attributes-based feedback. The results show that retrieval improves significantly when the system accounts for the learned behaviors. We show that the nuances learned are domain-invariant, and useful for both generic user-independent search as well as personalized user-specific search.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract User feedback helps an image search system refine its relevance predictions, tailoring the search towards the user’s preferences. [sent-2, score-1.036]
2 Existing methods simply take feedback at face value: clicking on an image means the user wants things like it; commenting that an image lacks a specific attribute means the user wants things that have it. [sent-3, score-1.666]
3 For example, he may be more likely to give negative feedback on an irrelevant image that is relatively close to his target, as opposed to bothering with one that is altogether different. [sent-6, score-0.681]
4 We introduce novel features to capitalize on such implied feedback cues, and learn a ranking function that uses them to improve the system’s relevance estimates. [sent-7, score-0.933]
5 We validate the approach with real users searching for shoes, faces, or scenes using two different modes of feedback: binary relevance feedback and relative attributes-based feedback. [sent-8, score-1.126]
6 Hence, feedback plays a critical role in allowing a user to better communicate his needs. [sent-18, score-1.09]
7 edu Figure 1: Choices made by users while providing feedback to an image search system reveal information about the desired target image beyond what is explicitly stated. [sent-22, score-1.07]
8 binary relevance feedback [4, 6, 13, 19, 26] the user clicks on a few of the images returned by the system, and conveys whether each one is relevant or not. [sent-24, score-1.351]
9 The feedback can also be relative [10], where the user clicks on a reference image and specifies how what he is looking for is different from it. [sent-26, score-1.38]
10 For instance, the user might say “I am looking for a downtown scene that is less congested than this. [sent-27, score-0.503]
11 Any such feedback be it based on binary relevance, relative attributes, or some – other form – allows users to explicitly inject subjectivity into the search results, and thereby improve performance. [sent-29, score-0.967]
12 However, there is more revealed by a user’s feedback beyond what is explicitly stated. [sent-30, score-0.681]
13 We hypothesize that when a user provides feedback, some beyond-the-obvious thought be it conscious or subconscious goes into the specific choices made. [sent-31, score-0.513]
14 ” It is unlikely that you will instead provide feedback on the picture of the elephant, even though “I don’t want something like this” is true for that image as well. [sent-35, score-0.761]
15 Hence, what the user chooses to not comment on contains nuanced but valuable information that is not tapped into by existing work. [sent-36, score-0.505]
16 We hypothesize that such nuances of user behavior also carry over to humanmachine interactions, yet (unlike human listeners) machines do not exploit them in existing systems. [sent-44, score-0.635]
17 Instead, these strategies seem to be evoked naturally, perhaps from an implicit assumption by the user that the search engine incorporates feedback in some reasonable fashion. [sent-46, score-1.322]
18 We propose to learn these subtle tendencies in user behavior, with the goal of improving image search. [sent-47, score-0.499]
19 Whereas prior work concentrates on building richer interfaces to elicit more detailed (and thus possibly cumbersome) feedback from the user (e. [sent-48, score-1.117]
20 First, we collect training data: human subjects are given a target image to search for, and we record the feedback choices they make. [sent-52, score-1.003]
21 We consider two possible modes of feedback: binary relevance feedback and relative attribute-based feedback. [sent-53, score-0.992]
22 Then, we extract features that characterize the observed feedback interactions, in terms of which among the candidate images the user chooses to comment on, and how. [sent-54, score-1.233]
23 Thus, rather than hand code any search rules to exploit scenarios like the ones suggested above, our method will learn the nuances in user behavior that are useful for search. [sent-58, score-0.748]
24 In this regard, the feedback features and ranking formulation we propose are important novel aspects of the work. [sent-59, score-0.728]
25 We conduct experiments on three domains scenes, faces, and shoes and we show that modeling the nuances of user behavior significantly improves image search with both binary relevance and relative attribute-based feedback. [sent-60, score-1.151]
26 Moreover, we show that the model of user behavior learnt is not dataset-specific and can be successfully used across domains. [sent-61, score-0.51]
27 Various other modes of feedback have been explored to improve interactive search, the most common being binary relevance feedback [4, 6, 13, 19, 22, 26]. [sent-68, score-1.64]
28 We show that both binary relevance and attribute-based feedback are enhanced by the proposed implicit cues. [sent-69, score-0.93]
29 While typically the exemplar images presented to a user are those currently ranked highest by the system, some methods actively select exemplars to elicit the most informative feedback [4, 6]. [sent-70, score-1.142]
30 We wish to model the implicit information hidden in the explicit feedback provided by users in order to improve image search. [sent-72, score-0.849]
31 In particular, the fact that we consider how to more deeply leverage existing modes of feedback is in stark contrast to prior work that explores novel forms of deeper explicit feedback [2, 10, 24]. [sent-73, score-1.425]
32 Reading between the lines: Our idea can be thought of as reading between the lines of what the user is saying, and not simply taking the feedback at face value. [sent-74, score-1.116]
33 For example, users can teach the machine to detect visual concepts that interest them [2], and user-generated meta-data on social networks can be used to personalize search results by learning user preferences [14]. [sent-82, score-0.638]
34 While our primary goal is to learn patterns in collective user behavior to improve search results in a domain- and user-independent manner, we also conduct experiments on learning user-specific behaviors for further improvements in search quality. [sent-84, score-0.748]
35 Approach We model user behavior in two different feedback settings: binary relevance feedback and relative attributebased feedback (Section 3. [sent-86, score-2.802]
36 To gather training data, we conduct user studies and log the interactions – that is, the feedback choices made by users searching for a given target image. [sent-88, score-1.515]
37 We extract features to describe each such interaction that capture not only the feedback the user provided, but also the feedback the user could have provided but did not (Section 3. [sent-89, score-2.225]
38 We then learn a ranking function, which, given a set of user choices, assigns higher scores to true target images and lower scores to other distractor images in the database (Section 3. [sent-91, score-0.681]
39 The learnt ranking function is used at test time; given a novel user’s feedback choices, the system computes the likelihood of each image in the database being the target. [sent-93, score-0.803]
40 Image Search with Feedback Let’s say a user is searching through a database of N images D = {x1, . [sent-96, score-0.497]
41 The search process starts by showing the user a set P of K images; P = {p1, . [sent-109, score-0.495]
42 The system uses this feedback to compute a relevance Pfu. [sent-123, score-0.864]
43 3, we show how to model the nuances of user behavior for either of these feedback mechanisms to improve their effectiveness. [sent-130, score-1.316]
44 Binary Relevance Feedback: In this form of feedback, the user can select a reference image p∗ ∈ P and state whether the reference image is relevant or i∈rrel Peva anntd dto s tahtee image that he is looking for. [sent-131, score-0.743]
45 If the user says “What I want is like p∗”, then we have S(xi) = −d(xi, p∗), where d captures the distance between two) images xin some feature space (e. [sent-132, score-0.537]
46 If the user says “What Iwant is not like p∗”, then S(xi) = d(xi ,p∗), making the images most dissimilar from p∗ to be the most relevant. [sent-136, score-0.503]
47 Relative Attribute-based Feedback: In this form of feedback, the user can select a reference image p∗ ∈ P and state how it is different from what he is looking fo∈r, as proposed in [10]. [sent-140, score-0.591]
48 If the user feedback is “What Iwant is more am than p∗”, then the explicit feedback method (which will serve as a baseline for our approach) computes S(xi) = rm (xi) rm (p∗). [sent-155, score-1.873]
49 Similarly, if the user says “What Iwant is less am than p∗”, then S(xi) = rm (p∗) − rm (xi). [sent-157, score-0.581]
50 In [10], S(xi) = 1if rm (xi) > rm(p∗) and 0 otherwise for the first feedback statement, and vice-versa for the second feedback statement. [sent-159, score-1.401]
51 Relevance Ranking with Implied Feedback We build our model of user behavior by collecting training data, that is, by observing users providing feedback in 1We describe our approach for the user providing a single statement of feedback on a single reference image for one iteration. [sent-163, score-2.622]
52 e images uvpilse- lth ible to the user for that interaction, (2) pl∗, the user’s choice of reference image, (3) ml∗, his choice of feedback statement, and (4) ql, the “polarity” of the feedback statement. [sent-170, score-1.923]
53 In the case of attribute feedback, ml∗ is the attribute the user chose to comment on, whereas in the case of binary relevance feedback, ml∗ is simply constant, since the user only comments on the “attribute” of generic similarity. [sent-171, score-1.328]
54 The polarity of the feedback ql refers to whether the user said “more” or “less” in the attribute case, and “similar” or “dissimilar” in the binary relevance case. [sent-172, score-1.474]
55 We represent each observed interaction with a feature vector φ(tl , Ωl) that depends on both the target image tl for that interaction as well as the corresponding choices made by the user Ωl while trying to find that target image tl (feature details to be given in the next section). [sent-173, score-0.971]
56 At test time when a user provides novel feedback Ωtest, each image xi in the database will be considered as a potential target image and will be paired with Ωtest and plugged into to extract the corresponding feature vector φ(xi, Ωtest). [sent-174, score-1.322]
57 We stress that the learned function is parameterized by both the target image as well as the user feedback. [sent-205, score-0.551]
58 During training, the target images are known, since we tell the user what image to search for. [sent-206, score-0.637]
59 What is known is the user’s interaction with the system Ωtest, which includes the candidate reference images the user saw. [sent-208, score-0.681]
60 Perhaps to provide negative feedback, users may click on images that are different enough from the target images to not be satisfactory, but not so different that using them as feedback is barely informative. [sent-223, score-0.96]
61 features reflect not only how the target relates to the selected reference image, but also how it relates to the reference images that the user also saw but declined to comment on. [sent-228, score-0.999]
62 We expect the distribution of these features to be different depending on whether the user says “like this” or “not like this” (i. [sent-230, score-0.503]
63 For example, when giving negative feedback, the user may comment on an image that is not too similar to the target, but is also not too dissimilar (the puppy example). [sent-233, score-0.546]
64 In contrast, when giving positive feedback, the user may very well comment on the reference 774488 image that is most similar to the target. [sent-234, score-0.657]
65 Hence, we learn separate scoring functions for the two feedback statements. [sent-235, score-0.756]
66 Features for Relative Attribute-based Feedback: This form of feedback provides the user with more options for richer feedback, providing more opportunities to learn the nuances of user behavior. [sent-239, score-1.7]
67 The direction of feedback q is +1 if the user said “more” and −1 if the user said “less”. [sent-242, score-1.551]
68 ( Tt,h Ωe) e xinp trheses rieolnasti (vie) feedback case. [sent-244, score-0.681]
69 As described in the crime witness example in the introduction, one hypothesis is that the target image usually lies between the chosen reference image and another candidate reference image closest to it along the chosen attribute (Figure 1, bottom). [sent-246, score-0.721]
70 If all candidate reference images in P are sorted by the chosen attribute m∗, let denote the reference image ranked consecutively to the chosen reference image p∗ in the direction q of the feedbac? [sent-247, score-0.687]
71 Another hypothesis is that relative to the entire range of the attribute values spanned by images in P, perhaps the target image is usually close to the chosen reference image along the chosen attribute. [sent-254, score-0.545]
72 Or, maybe users pick the reference image a−nmdin nattribute that allow the target image to be as close to the reference image as possible, as captured by the following ratio: minp∈P,m|p∈∗m{1∗,−. [sent-256, score-0.553]
73 Further, maybe users pick the attribute and reference image such that the target image falls in the smallest interval formed by any two consecutive candidate reference images when sorted by the strength of the chosen attribute. [sent-267, score-0.736]
74 For any candidate reference image p and attribute m, p+m is the value of the mth attribute captured by: in a candidate reference image closest to p along m, while ensuring that tm ∈ [pm, pm+] . [sent-269, score-0.671]
75 Finally, another hypothesis is that when a user says “What Iwant is more am than p∗”, perhaps he really means “I want something like p∗ but more am” (see Figure 2). [sent-273, score-0.615]
76 In this case the user would pick the reference image and attribute such that the reference image has a high difference from the target image along the chosen attribute, but is similar to the target image along the remain- maxm|p? [sent-274, score-1.133]
77 Discussion: Overall, the proposed features capture both what the user did say in the feedback, as well as what he chose not to say, in light of all candidate reference images available. [sent-287, score-0.645]
78 Instead of learning the nuances of user behavior, one might envision simply instructing the user to behave in a certain way. [sent-292, score-0.966]
79 For instance, we could explain to users how the system works, and tell them the optimal strategy of providing feedback to converge on their desired result quickly. [sent-293, score-0.842]
80 Instead of forcing users to adapt to a system, we take the approach of having 774499 Figure3:Userinterfacesfortwof rmsof e dback:binary el vance(top)relativeatribute-based(bot m;asubetofatributesareshownforilustraion) our systems adapt to user behavior. [sent-299, score-0.516]
81 Subjects were allowed to pick a reference image and provide a feedback statement (clue) for that image (Figure 3). [sent-309, score-0.886]
82 The game aspect encourages the user to care about the quality of his response, much as he would if doing a search for his own purposes. [sent-311, score-0.495]
83 We ensure that the training target images and users do not overlap with the testing target images and users (except for experiments where we learn a user-specific behavior model). [sent-326, score-0.603]
84 Binary Relevance Feedback: We compare our proposed approach of modeling user behavior in binary relevance feedback to the traditional approach [19, 26] (Section 3. [sent-333, score-1.356]
85 1) that simply takes what the user says at face value. [sent-334, score-0.503]
86 If the user says “what I want is (not) like this”, it sorts all images in the database in (descending) ascending order of their distance from the selected reference image. [sent-335, score-0.751]
87 1) which, if the user says “what I want is more (less) colorful than this”, sorts all images in descending order of how much more × (less) colorful they are than the selected reference image. [sent-342, score-0.751]
88 In Figure 4 (right plot) we see significant gains across the board by modeling the nuances of user behavior. [sent-343, score-0.557]
89 As a consequence, it is more amenable to eliciting nuances in user behavior, resulting in our dramatic gains in performance. [sent-348, score-0.557]
90 We next train our ranking function using interactions from user studies conducted on one dataset and use it to sort the images of a different dataset. [sent-351, score-0.518]
91 User-specific: While the primary goal of our work is to learn user- and domain-independent models of user behavior, our model naturally lends itself to learning user-specific tendencies as well. [sent-357, score-0.499]
92 (SC: scenes, FA: faces, SH: shoes) Figure 7: Preliminary result using multiple feedback statements on 5 query shoe images using relative attribute-based feedback. [sent-359, score-0.789]
93 gine to personalize its reactions to user feedback, given the user’s prior search history. [sent-360, score-0.531]
94 We next conduct experiments using interactions from only one user to train our model, and then use held-out interactions from the same user to test our model. [sent-361, score-0.971]
95 We see that learning the tendencies of a specific user usually further improves search accuracy. [sent-367, score-0.558]
96 Multiple Feedback Statements: For clarity, our approach is presented and tested in the setting where a user provides one feedback statement on one reference image in a single iteration. [sent-370, score-1.295]
97 In (a): while searching for a target face (blue outline), the user clicks on the reference image (black outline) and says “Not like this”. [sent-377, score-0.872]
98 In each example, Top: feedback given by MTurk user for the target image on the left. [sent-379, score-1.232]
99 In (b): while searching for the target pair of shoes (blue outline), the user clicks on the reference image (black outline) and says “What Iwant is shinier than this”. [sent-391, score-1.08]
100 Whether using traditional binary relevance feedback [4, 6, 13, 19, 22, 26] or a more recent form of attribute feedback [10], our method offers notable gains in search accuracy, yet requires no additional overhead on the part of the user. [sent-398, score-1.749]
wordName wordTfidf (topN-words)
[('feedback', 0.681), ('user', 0.409), ('relevance', 0.155), ('shoes', 0.153), ('reference', 0.152), ('nuances', 0.148), ('target', 0.142), ('attribute', 0.113), ('users', 0.107), ('comment', 0.096), ('says', 0.094), ('search', 0.086), ('behavior', 0.078), ('iwant', 0.077), ('xi', 0.066), ('tl', 0.065), ('tendencies', 0.063), ('modes', 0.063), ('interactions', 0.062), ('implicit', 0.061), ('relative', 0.06), ('choices', 0.058), ('chubby', 0.055), ('shinier', 0.055), ('attributes', 0.055), ('statement', 0.053), ('statements', 0.048), ('scoring', 0.048), ('clicks', 0.048), ('tm', 0.047), ('ranking', 0.047), ('candidate', 0.047), ('subconscious', 0.046), ('something', 0.046), ('interaction', 0.045), ('personalization', 0.041), ('puppy', 0.041), ('rank', 0.041), ('rm', 0.039), ('faces', 0.038), ('sorts', 0.038), ('witness', 0.038), ('say', 0.037), ('personalize', 0.036), ('subjects', 0.036), ('pm', 0.035), ('want', 0.034), ('binary', 0.033), ('someone', 0.033), ('behaviors', 0.033), ('perhaps', 0.032), ('ql', 0.032), ('distractor', 0.032), ('crime', 0.031), ('inclinations', 0.031), ('minp', 0.031), ('ppm', 0.031), ('pragmatics', 0.031), ('suspects', 0.031), ('click', 0.03), ('looking', 0.03), ('engine', 0.029), ('conduct', 0.029), ('system', 0.028), ('outline', 0.028), ('personalizing', 0.027), ('elicit', 0.027), ('softer', 0.027), ('kitten', 0.027), ('congested', 0.027), ('searching', 0.027), ('interactive', 0.027), ('trans', 0.027), ('wants', 0.027), ('wt', 0.027), ('learn', 0.027), ('ml', 0.026), ('providing', 0.026), ('said', 0.026), ('reading', 0.026), ('white', 0.026), ('polarity', 0.025), ('angry', 0.025), ('percentiles', 0.025), ('ranked', 0.025), ('returned', 0.025), ('biases', 0.024), ('strategies', 0.024), ('attributebased', 0.024), ('furry', 0.024), ('baseline', 0.024), ('database', 0.024), ('parikh', 0.024), ('descending', 0.024), ('relates', 0.024), ('pictures', 0.024), ('multimedia', 0.023), ('learnt', 0.023), ('chosen', 0.023), ('implied', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 213 iccv-2013-Implied Feedback: Learning Nuances of User Behavior in Image Search
Author: Devi Parikh, Kristen Grauman
Abstract: User feedback helps an image search system refine its relevance predictions, tailoring the search towards the user’s preferences. Existing methods simply take feedback at face value: clicking on an image means the user wants things like it; commenting that an image lacks a specific attribute means the user wants things that have it. However, we expect there is actually more information behind the user’s literal feedback. In particular, a user’s (possibly subconscious) search strategy leads him to comment on certain images rather than others, based on how any of the visible candidate images compare to the desired content. For example, he may be more likely to give negative feedback on an irrelevant image that is relatively close to his target, as opposed to bothering with one that is altogether different. We introduce novel features to capitalize on such implied feedback cues, and learn a ranking function that uses them to improve the system’s relevance estimates. We validate the approach with real users searching for shoes, faces, or scenes using two different modes of feedback: binary relevance feedback and relative attributes-based feedback. The results show that retrieval improves significantly when the system accounts for the learned behaviors. We show that the nuances learned are domain-invariant, and useful for both generic user-independent search as well as personalized user-specific search.
2 0.62751287 54 iccv-2013-Attribute Pivots for Guiding Relevance Feedback in Image Search
Author: Adriana Kovashka, Kristen Grauman
Abstract: In interactive image search, a user iteratively refines his results by giving feedback on exemplar images. Active selection methods aim to elicit useful feedback, but traditional approaches suffer from expensive selection criteria and cannot predict informativeness reliably due to the imprecision of relevance feedback. To address these drawbacks, we propose to actively select “pivot” exemplars for which feedback in the form of a visual comparison will most reduce the system’s uncertainty. For example, the system might ask, “Is your target image more or less crowded than this image? ” Our approach relies on a series of binary search trees in relative attribute space, together with a selection function that predicts the information gain were the user to compare his envisioned target to the next node deeper in a given attribute ’s tree. It makes interactive search more efficient than existing strategies—both in terms of the system ’s selection time as well as the user’s feedback effort.
3 0.35567969 52 iccv-2013-Attribute Adaptation for Personalized Image Search
Author: Adriana Kovashka, Kristen Grauman
Abstract: Current methods learn monolithic attribute predictors, with the assumption that a single model is sufficient to reflect human understanding of a visual attribute. However, in reality, humans vary in how they perceive the association between a named property and image content. For example, two people may have slightly different internal models for what makes a shoe look “formal”, or they may disagree on which of two scenes looks “more cluttered”. Rather than discount these differences as noise, we propose to learn user-specific attribute models. We adapt a generic model trained with annotations from multiple users, tailoring it to satisfy user-specific labels. Furthermore, we propose novel techniques to infer user-specific labels based on transitivity and contradictions in the user’s search history. We demonstrate that adapted attributes improve accuracy over both existing monolithic models as well as models that learn from scratch with user-specific data alone. In addition, we show how adapted attributes are useful to personalize image search, whether with binary or relative attributes.
4 0.31920165 305 iccv-2013-POP: Person Re-identification Post-rank Optimisation
Author: Chunxiao Liu, Chen Change Loy, Shaogang Gong, Guijin Wang
Abstract: Owing to visual ambiguities and disparities, person reidentification methods inevitably produce suboptimal ranklist, which still requires exhaustive human eyeballing to identify the correct target from hundreds of different likelycandidates. Existing re-identification studies focus on improving the ranking performance, but rarely look into the critical problem of optimising the time-consuming and error-prone post-rank visual search at the user end. In this study, we present a novel one-shot Post-rank OPtimisation (POP) method, which allows a user to quickly refine their search by either “one-shot” or a couple of sparse negative selections during a re-identification process. We conduct systematic behavioural studies to understand user’s searching behaviour and show that the proposed method allows correct re-identification to converge 2.6 times faster than the conventional exhaustive search. Importantly, through extensive evaluations we demonstrate that the method is capable of achieving significant improvement over the stateof-the-art distance metric learning based ranking models, even with just “one shot” feedback optimisation, by as much as over 30% performance improvement for rank 1reidentification on the VIPeR and i-LIDS datasets.
5 0.14208165 399 iccv-2013-Spoken Attributes: Mixing Binary and Relative Attributes to Say the Right Thing
Author: Amir Sadovnik, Andrew Gallagher, Devi Parikh, Tsuhan Chen
Abstract: In recent years, there has been a great deal of progress in describing objects with attributes. Attributes have proven useful for object recognition, image search, face verification, image description, and zero-shot learning. Typically, attributes are either binary or relative: they describe either the presence or absence of a descriptive characteristic, or the relative magnitude of the characteristic when comparing two exemplars. However, prior work fails to model the actual way in which humans use these attributes in descriptive statements of images. Specifically, it does not address the important interactions between the binary and relative aspects of an attribute. In this work we propose a spoken attribute classifier which models a more natural way of using an attribute in a description. For each attribute we train a classifier which captures the specific way this attribute should be used. We show that as a result of using this model, we produce descriptions about images of people that are more natural and specific than past systems.
6 0.13143578 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation
7 0.12827028 53 iccv-2013-Attribute Dominance: What Pops Out?
8 0.11933647 31 iccv-2013-A Unified Probabilistic Approach Modeling Relationships between Attributes and Objects
9 0.11030978 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels
10 0.095924623 444 iccv-2013-Viewing Real-World Faces in 3D
11 0.095067941 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes
12 0.092539363 123 iccv-2013-Domain Adaptive Classification
13 0.092454538 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning
14 0.08801356 446 iccv-2013-Visual Semantic Complex Network for Web Images
15 0.084920086 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
16 0.078147747 3 iccv-2013-3D Sub-query Expansion for Improving Sketch-Based Multi-view Image Retrieval
17 0.072669357 124 iccv-2013-Domain Transfer Support Vector Ranking for Person Re-identification without Target Camera Label Information
18 0.067647301 246 iccv-2013-Learning the Visual Interpretation of Sentences
19 0.066190958 26 iccv-2013-A Practical Transfer Learning Algorithm for Face Verification
20 0.063332573 238 iccv-2013-Learning Graphs to Match
topicId topicWeight
[(0, 0.158), (1, 0.104), (2, -0.063), (3, -0.159), (4, 0.137), (5, -0.006), (6, -0.049), (7, -0.136), (8, 0.181), (9, 0.204), (10, -0.038), (11, 0.02), (12, 0.026), (13, 0.015), (14, -0.036), (15, -0.08), (16, -0.025), (17, -0.047), (18, -0.054), (19, -0.046), (20, 0.04), (21, -0.104), (22, -0.093), (23, -0.052), (24, 0.032), (25, 0.112), (26, -0.027), (27, 0.116), (28, -0.03), (29, -0.194), (30, -0.121), (31, -0.073), (32, -0.084), (33, 0.119), (34, 0.114), (35, 0.095), (36, -0.063), (37, 0.128), (38, 0.13), (39, 0.068), (40, -0.057), (41, -0.075), (42, 0.118), (43, -0.159), (44, 0.005), (45, -0.18), (46, -0.095), (47, -0.104), (48, -0.103), (49, 0.167)]
simIndex simValue paperId paperTitle
same-paper 1 0.97288918 213 iccv-2013-Implied Feedback: Learning Nuances of User Behavior in Image Search
Author: Devi Parikh, Kristen Grauman
Abstract: User feedback helps an image search system refine its relevance predictions, tailoring the search towards the user’s preferences. Existing methods simply take feedback at face value: clicking on an image means the user wants things like it; commenting that an image lacks a specific attribute means the user wants things that have it. However, we expect there is actually more information behind the user’s literal feedback. In particular, a user’s (possibly subconscious) search strategy leads him to comment on certain images rather than others, based on how any of the visible candidate images compare to the desired content. For example, he may be more likely to give negative feedback on an irrelevant image that is relatively close to his target, as opposed to bothering with one that is altogether different. We introduce novel features to capitalize on such implied feedback cues, and learn a ranking function that uses them to improve the system’s relevance estimates. We validate the approach with real users searching for shoes, faces, or scenes using two different modes of feedback: binary relevance feedback and relative attributes-based feedback. The results show that retrieval improves significantly when the system accounts for the learned behaviors. We show that the nuances learned are domain-invariant, and useful for both generic user-independent search as well as personalized user-specific search.
2 0.89598536 54 iccv-2013-Attribute Pivots for Guiding Relevance Feedback in Image Search
Author: Adriana Kovashka, Kristen Grauman
Abstract: In interactive image search, a user iteratively refines his results by giving feedback on exemplar images. Active selection methods aim to elicit useful feedback, but traditional approaches suffer from expensive selection criteria and cannot predict informativeness reliably due to the imprecision of relevance feedback. To address these drawbacks, we propose to actively select “pivot” exemplars for which feedback in the form of a visual comparison will most reduce the system’s uncertainty. For example, the system might ask, “Is your target image more or less crowded than this image? ” Our approach relies on a series of binary search trees in relative attribute space, together with a selection function that predicts the information gain were the user to compare his envisioned target to the next node deeper in a given attribute ’s tree. It makes interactive search more efficient than existing strategies—both in terms of the system ’s selection time as well as the user’s feedback effort.
3 0.72219968 305 iccv-2013-POP: Person Re-identification Post-rank Optimisation
Author: Chunxiao Liu, Chen Change Loy, Shaogang Gong, Guijin Wang
Abstract: Owing to visual ambiguities and disparities, person reidentification methods inevitably produce suboptimal ranklist, which still requires exhaustive human eyeballing to identify the correct target from hundreds of different likelycandidates. Existing re-identification studies focus on improving the ranking performance, but rarely look into the critical problem of optimising the time-consuming and error-prone post-rank visual search at the user end. In this study, we present a novel one-shot Post-rank OPtimisation (POP) method, which allows a user to quickly refine their search by either “one-shot” or a couple of sparse negative selections during a re-identification process. We conduct systematic behavioural studies to understand user’s searching behaviour and show that the proposed method allows correct re-identification to converge 2.6 times faster than the conventional exhaustive search. Importantly, through extensive evaluations we demonstrate that the method is capable of achieving significant improvement over the stateof-the-art distance metric learning based ranking models, even with just “one shot” feedback optimisation, by as much as over 30% performance improvement for rank 1reidentification on the VIPeR and i-LIDS datasets.
4 0.6640808 52 iccv-2013-Attribute Adaptation for Personalized Image Search
Author: Adriana Kovashka, Kristen Grauman
Abstract: Current methods learn monolithic attribute predictors, with the assumption that a single model is sufficient to reflect human understanding of a visual attribute. However, in reality, humans vary in how they perceive the association between a named property and image content. For example, two people may have slightly different internal models for what makes a shoe look “formal”, or they may disagree on which of two scenes looks “more cluttered”. Rather than discount these differences as noise, we propose to learn user-specific attribute models. We adapt a generic model trained with annotations from multiple users, tailoring it to satisfy user-specific labels. Furthermore, we propose novel techniques to infer user-specific labels based on transitivity and contradictions in the user’s search history. We demonstrate that adapted attributes improve accuracy over both existing monolithic models as well as models that learn from scratch with user-specific data alone. In addition, we show how adapted attributes are useful to personalize image search, whether with binary or relative attributes.
5 0.45923615 53 iccv-2013-Attribute Dominance: What Pops Out?
Author: Naman Turakhia, Devi Parikh
Abstract: When we look at an image, some properties or attributes of the image stand out more than others. When describing an image, people are likely to describe these dominant attributes first. Attribute dominance is a result of a complex interplay between the various properties present or absent in the image. Which attributes in an image are more dominant than others reveals rich information about the content of the image. In this paper we tap into this information by modeling attribute dominance. We show that this helps improve the performance of vision systems on a variety of human-centric applications such as zero-shot learning, image search and generating textual descriptions of images.
6 0.45260963 399 iccv-2013-Spoken Attributes: Mixing Binary and Relative Attributes to Say the Right Thing
7 0.40351039 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes
8 0.4021759 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning
9 0.394979 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation
10 0.36190599 31 iccv-2013-A Unified Probabilistic Approach Modeling Relationships between Attributes and Objects
11 0.35046387 350 iccv-2013-Relative Attributes for Large-Scale Abandoned Object Detection
12 0.34182116 446 iccv-2013-Visual Semantic Complex Network for Web Images
13 0.33550018 449 iccv-2013-What Do You Do? Occupation Recognition in a Photo via Social Context
14 0.33505368 367 iccv-2013-SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels
15 0.33327815 3 iccv-2013-3D Sub-query Expansion for Improving Sketch-Based Multi-view Image Retrieval
16 0.32786882 267 iccv-2013-Model Recommendation with Virtual Probes for Egocentric Hand Detection
17 0.29866374 248 iccv-2013-Learning to Rank Using Privileged Information
18 0.29811707 416 iccv-2013-The Interestingness of Images
19 0.29505745 124 iccv-2013-Domain Transfer Support Vector Ranking for Person Re-identification without Target Camera Label Information
20 0.26527396 178 iccv-2013-From Semi-supervised to Transfer Counting of Crowds
topicId topicWeight
[(2, 0.071), (7, 0.013), (8, 0.016), (12, 0.02), (26, 0.049), (31, 0.027), (34, 0.024), (42, 0.477), (64, 0.031), (69, 0.011), (73, 0.018), (89, 0.119), (98, 0.011)]
simIndex simValue paperId paperTitle
1 0.9770726 162 iccv-2013-Fast Subspace Search via Grassmannian Based Hashing
Author: Xu Wang, Stefan Atev, John Wright, Gilad Lerman
Abstract: The problem of efficiently deciding which of a database of models is most similar to a given input query arises throughout modern computer vision. Motivated by applications in recognition, image retrieval and optimization, there has been significant recent interest in the variant of this problem in which the database models are linear subspaces and the input is either a point or a subspace. Current approaches to this problem have poor scaling in high dimensions, and may not guarantee sublinear query complexity. We present a new approach to approximate nearest subspace search, based on a simple, new locality sensitive hash for subspaces. Our approach allows point-tosubspace query for a database of subspaces of arbitrary dimension d, in a time that depends sublinearly on the number of subspaces in the database. The query complexity of our algorithm is linear in the ambient dimension D, allow- ing it to be directly applied to high-dimensional imagery data. Numerical experiments on model problems in image repatching and automatic face recognition confirm the advantages of our algorithm in terms of both speed and accuracy.
Author: De-An Huang, Yu-Chiang Frank Wang
Abstract: Cross-domain image synthesis and recognition are typically considered as two distinct tasks in the areas of computer vision and pattern recognition. Therefore, it is not clear whether approaches addressing one task can be easily generalized or extended for solving the other. In this paper, we propose a unified model for coupled dictionary and feature space learning. The proposed learning model not only observes a common feature space for associating cross-domain image data for recognition purposes, the derived feature space is able to jointly update the dictionaries in each image domain for improved representation. This is why our method can be applied to both cross-domain image synthesis and recognition problems. Experiments on a variety of synthesis and recognition tasks such as single image super-resolution, cross-view action recognition, and sketchto-photo face recognition would verify the effectiveness of our proposed learning model.
3 0.96703529 422 iccv-2013-Toward Guaranteed Illumination Models for Non-convex Objects
Author: Yuqian Zhang, Cun Mu, Han-Wen Kuo, John Wright
Abstract: Illumination variation remains a central challenge in object detection and recognition. Existing analyses of illumination variation typically pertain to convex, Lambertian objects, and guarantee quality of approximation in an average case sense. We show that it is possible to build models for the set of images across illumination variation with worstcase performance guarantees, for nonconvex Lambertian objects. Namely, a natural verification test based on the distance to the model guarantees to accept any image which can be sufficiently well-approximated by an image of the object under some admissible lighting condition, and guarantees to reject any image that does not have a sufficiently good approximation. These models are generated by sampling illumination directions with sufficient density, which follows from a new perturbation bound for directional illuminated images in the Lambertian model. As the number of such images required for guaranteed verification may be large, we introduce a new formulation for cone preserving dimensionality reduction, which leverages tools from sparse and low-rank decomposition to reduce the complexity, while controlling the approximation error with respect to the original model. 1
4 0.96492469 70 iccv-2013-Cascaded Shape Space Pruning for Robust Facial Landmark Detection
Author: Xiaowei Zhao, Shiguang Shan, Xiujuan Chai, Xilin Chen
Abstract: In this paper, we propose a novel cascaded face shape space pruning algorithm for robust facial landmark detection. Through progressively excluding the incorrect candidate shapes, our algorithm can accurately and efficiently achieve the globally optimal shape configuration. Specifically, individual landmark detectors are firstly applied to eliminate wrong candidates for each landmark. Then, the candidate shape space is further pruned by jointly removing incorrect shape configurations. To achieve this purpose, a discriminative structure classifier is designed to assess the candidate shape configurations. Based on the learned discriminative structure classifier, an efficient shape space pruning strategy is proposed to quickly reject most incorrect candidate shapes while preserve the true shape. The proposed algorithm is carefully evaluated on a large set of real world face images. In addition, comparison results on the publicly available BioID and LFW face databases demonstrate that our algorithm outperforms some state-of-the-art algorithms.
5 0.95258081 167 iccv-2013-Finding Causal Interactions in Video Sequences
Author: Mustafa Ayazoglu, Burak Yilmaz, Mario Sznaier, Octavia Camps
Abstract: This paper considers the problem of detecting causal interactions in video clips. Specifically, the goal is to detect whether the actions of a given target can be explained in terms of the past actions of a collection of other agents. We propose to solve this problem by recasting it into a directed graph topology identification, where each node corresponds to the observed motion of a given target, and each link indicates the presence of a causal correlation. As shown in the paper, this leads to a block-sparsification problem that can be efficiently solved using a modified Group-Lasso type approach, capable of handling missing data and outliers (due for instance to occlusion and mis-identified correspondences). Moreover, this approach also identifies time instants where the interactions between agents change, thus providing event detection capabilities. These results are illustrated with several examples involving non–trivial interactions amongst several human subjects. 1. Introduction and Motivation The problem of identifying causal interactions amongst targets in a video sequence has been the focus of considerable attention in the past few years. A large portion of the existing body of work in this field uses human annotated video to build a storyline that includes both recognizing the activities involved and the causal relationships between them (see for instance [10] and references therein). While these methods are powerful and work well when suitably annotated data is available, annotating video clips is expensive and parsing relevant actions requires domain knowledge which may not be readily available. Indeed, in many situations, unveiling potentially hidden causal relationships is a first step towards building such knowledge. In this paper we consider the problem of identifying causal interactions amongst targets, not necessarily human, ∗This work was supported by NSF grants IIS–0713003, IIS-1318145, and ECCS–0901433, AFOSR grant FA9559–12–1–0271, and the Alert DHS Center of Excellence under Award Number 2008-ST-061-ED0001 . from unannotated video sequences and without prior domain knowledge. Our approach exploits the concept of “Granger Causality” [9], that formalizes the intuitive idea that ifa time series {x(t)} is causally related to a second one {thya(tt)if}a, ttihmene knowledge }oifs tchaeu past vrealluateesd otfo a{yse}c1to should l{eya(dt t}o, a ebnett kern prediction o thf efu ptuasret vvaalulueess ooff {{yx}}tt+k. In [l1ea4d], Pora ab bheatktearr eprt.e aicl.t successfully vuasleude a frequency domain reformulation of this concept to uncover pairwise interactions in scenarios involving repeating events, such as social games. This technique was later extended in [17] to model causal correlations between human joints and applied to the problem of activity classification. However, since this approach is based upon estimating the crosscovariance density function between events, it cannot handle situations where these events are non repeating, are too rare to provide an accurate estimate, or where these estimates are biased by outliers or missing data. Further, estimating a pairwise measure of causal correlation requires a spectral factorization of the cross-covariance, followed by numerical integration and statistical thresholding, limiting the approach to moderately large problems. To circumvent these problems, in this paper we propose an alternative approach based upon recasting the problem into that of identifying the topology of a sparse (directed) graph, where each node corresponds to the time traces of relevant features of a target, and each link corresponds to a regressor. The situation is illustrated in Fig. 1 using as an example the problem of finding causal relations amongst 4 tennis players, leading to a graph with 4 nodes, and potentially 12 (directed) links. Note that in general, the problem of identifying causal relationships is ill posed (unless one wants to identify the set of all individuals that could possibly have causal connections), due to the existence of secondary interactions. To illustrate this point, consider a very simplistic scenario with three actors A, B, and C, where A copies (with some delay) the actions of B, which in turn mimics C, also with some delay. In this situation, the ac- tions of A can be explained in terms of either those of B delayed one time sample, or those of C delayed by two samples. Thus, an algorithm based upon a statistical analysis 33556758 would identify a causal connection between A and C, even though there is no direct link between them. Further, if the actions of C can be explained by some simple autoregressive model of the form: = C(t) ?aiC(t − i) then it follows that the acti?ons of A can be explained by the same model, e.g. = A(t) ?aiA(t − i) Hence, multiple graphs topologies, some of which include self-loops, can explain the same set of time-series. On the other hand, note that in this situation, the sparsest graph (in the sense of having the fewest links) is the one that correctly captures the causality relations: the most direct cause of A is B and that of B is C, with C potentially being explained by a self-loop. To capture this feature and regularize the problem, in the sequel we will seek to find the sparsest graph, in the sense of having the least number of interconnections, that explains the observed data, reflecting the fact that, when alternative models are possible, often the most parsimonious is the correct one. Our main result shows that the problem of identifying sparse graph structures from observed noisy data can be reduced to a convex optimization problem (via the use of Group Lasso type arguments) that can be efficiently solved. The advantages of the proposed methods are: • • • • Its ability to handle complex scenarios involving nonrepeating events, een cvoimropnlmeexn stcael changes, clvoillnegct nioonnsof targets that do not necessarily split into well defined groups, outliers and missing data. The ability to identify the sparsest interaction structure tThhaet explains th idee nobtifseyr tvheed s dpaartas e(stthu inst avoiding labeling as causal connections those indirect correlations mediated only by an intermediary), together with a sparse “indicator” function whose support set indicates time instants where the interactions between agents change. Since the approach is not based on semantic analysis, iSt can bt hee applied ctoh ti she n moto btiaosne dof o arbitrary targets, sniost, necessarily humans (indeed, it applies to arbitrary time series including for instance economic or genetic data). From a computational standpoint, the resulting optiFmriozmatio an c problems nhaalve s a specific fthoerm re asmuletinnagbl oep ttiobe solved by a class of iterative algorithms [5, 3], that require at each step only a combination of thresholding and least-squares approximations. These algorithms have been shown to substantially outperform conventional convex-optimization solvers both in terms of memory and computation time requirements. The remainder of the paper is organized as follows. In section 2 we provide a formal reformulation of the problem of finding causal relationships between agents as a sparse graph identification problem. In section 3, we show that this problem can be efficiently solved using a re-weighted Group Lasso approach. Moreover, as shown there, the resulting problem can be solved one node at a time using first order methods, which allows for handling situations involving a large number of agents. Finally, the effectiveness of the proposed method is illustrated in section 4 using both simple scenarios (for which ground truth is readily available) and video clips of sports, involving complex, nonrepeating interactions amongst many agents. Figure 1. Finding causal interactions as a graph identification problem. Top: sample frame from a doubles tennis sequence. Bottom: Representation of this sequence as a graph, where each node represents the time series associated with the position of each player and the links are vector regressive models. Causal interactions exist when one of the time series can be explained as a combination of past values of the others. 2. Preliminaries For ease of reference, in this section we summarize the notation used in the paper and give a formal definition of the problem under consideration. 2.1. Notation (M) ?M? ??MM??F ?M?1 ?M?o σi ∗ ◦ ith largest singular value of the matrix M. nuclear norm: ?M? ?i σ?i (M). Fnruocbleeanrio nours norm: ??M?2F? ?i,j Mi2j ?1 norm: ?M? 1 ?i,j |Mij? ?|. ?o quasi-norm: ?M?o number of non-zero ?eleme?nMts i?n M. Hadamard product of matrices: (A ◦ ∗ =.: =. =. =. B)i,j = Ai,jBi,j. 33556769 2.2. Statement of the Problem Next, we formalize the problem under consideration. Consider a scenario with P moving agents, and denote by the 3D homogenous coordinates of the pth individual at time t. Motivated by the idea of Granger Causality, we will say that the actions of this agent depend causally from those in a set Ip (which can possibly contain p itself), if can be written as: Q˜p(t) Q˜p(t) Q˜p(t) ?N = ? ?ajp(n)Q˜j(t − n) +˜ η p(t) +˜ u p(t) (1) j? ?∈Ip ?n=0 Here ajp are unknown coefficients, and ˜η p(t) and up(t) represent measurement noise and a piecewise constant signal that is intended to account for relatively rare events that cannot be explained by the (past) actions of other agents. Examples include interactions of an agent with the environment, for instance to avoid obstacles, or changes in the interactions between agents. Since these events are infrequent, we will model as a signal that has (component-wise) a sparse derivative. Note in passing that since (1) involves homogeneous coordinates, the coefficients aj,p(.) satisfy the following constraint1 u ?N ? ?ajp(n) j? ?∈Ip ?n=0 =1 (2) Our goal is to identify causal relationships using as data 2D measurements qp(t) in F frames of the affine projections of the 3D coordinates Q˜p(t) of the targets. Note that, under the affine camera assumption, the 2D coordinates are related exactly by the same regressor parameters [2]. Thus, (1) holds if and only if: ?N qp(t) = ? ?ajp(n)qj(t − n) + u˜ p(t) + ηp(t) (3) j?∈Ip ?n=0 In this context, the problem can be precisely stated as: Given qp(t) (in F number of frames) and some a-priori bound N on the order of the regressors (that is the “memory” of the interactions), find the sparsest set of equations of the form (3) that explains the data, that is: aj,pm,ηinp,up?nIp (4) subject to? ?(2) and: = ? ?ajp(n)qj(t − n) + ?N qp(t) j? ?∈Ip ?n=0 up(t) + ηp(t) , p = 1 . . . , P and t = 1, ..F 1This follows by considering the third coordinate in (1) (5) where nIp denotes the cardinality of the set Ip. Rewriting (5) in matrixd efnoormtes yields: [xp; yp] = [Bp, I][apTuxTpuyTp]T + ηp (6) where qp(t) up(t) ηp(t) xp yp ap aip uxp uyp Bp Xp = [xp(t)Typ(t)T]T = [uTxp(t)uyTp(t)]T = [ηxp(t)Tηyp(t)T]T = = [xp(F)xp(F − 1)...xp(1)]T = [yp(F)yp(F − 1)...yp(1)]T [aT1p, a2Tp, ..., aTPp]T = [aip(0), aip(1), ..., aip(N)]T = [uxp(F)uxp(F−1)...uxp(1)]T = [uyp(F)uyp(F−1)...uyp(1)]T = = [Xp; Yp] [hankel(x1 , N) , ..., hankel(xP, N)] Yp = [hankel(y1, N), ..., hankel(yP, N)] and where, for a sequence z(t), hankel(z, N) denotes its associated Hankel matrix: hankel(z, N) = Itfolw⎛⎜⎝ sz t(hNzFa(t. +−a)d1 2e)scrzip(tF io(N. n− )o231f)al· t h· einzt(Frac−zti(.o1N.n)s−a)m12o)⎟ ⎞⎠ ngst uηaq= ? ηuqa1 T ,ηqau2 T ,ηaqu3 T ,· ·, ηauqP T ? T (8) Thus,inthBisc=on⎢⎣⎡teBx0t.1,theB0p.r2ob·le.·m·ofB0 i.nPte⎦⎥r ⎤estcanbeforagents (that is the complete graph structure) is captured by a matrix equation of the form: q = [B, I][aTuT]T + η (7) where and malized as finding the block–sparsest solution to the set of linear equations (2) and (7). 33557770 The problem of identifying a graph structure subject to sparsity constraints, has been the subject of intense research in the past few years. For instance, [1] proposed a Lasso type algorithm to identify a sparse network where each link corresponds to a VAR process. The main idea underlying this method is to exploit the fact that penalizing the ?1 norm of the vector of regression coefficients tends to produce sparse solutions. However, enforcing sparsity of the entire vector of regressor coefficients does not necessarily result in a sparse graph structure, since the resulting solution can consist of many links, each with a few coefficients. This difficulty can be circumvented by resorting to group Lasso type approaches [18], which seek to enforce block sparsity by using a combination of ?1 and ?2 norm constraints on the coefficients of the regressor. While this approach was shown to work well with artificial data in [11], exact recovery of the underlying network can be only guaranteed when the data satisfies suitable “incoherence” type conditions [4]. Finally, a different approach was pursued in [13], based on the use of a modified Orthogonal Least Squares algorithm, Cyclic Orthogonal Least Squares. However, this approach requires enforcing an a-priori limit on the number of links allowed to point to a single node, and such information may not be readily available, specially in cases where this number has high variability amongst nodes. To address these difficulties, in the next section we develop a convex optimization based approach to the problem of identifying sparse graph structures from observed noisy data. This method is closest in spirit to that in [11], in the sense that it is also based on a group Lasso type argument. The main differences consist in the ability to handle the unknown inputs up(t), needed to model exogenous disturbances affecting the agents, and in a reformulation of the problem, that allows for using a re-weighted iterative type algorithm, leading to substantially sparser solutions, even when the conditions in [4] fail. 3. Causality Identification Algorithm In this section we present the main result of this paper, an algorithm to search for block-sparse solutions to (7). For each fixed p, the algorithm searches for sparse solutions to (6) by solving (iteratively) the following problem (suggested by the re-weighted heuristic proposed in [7]) ?P ap,muxipn,uypi?=1wja(?aip?2) + λ??diag(wu)[Δuxp;Δuyp]??1 subject to: ?ηp ? ≤ p = 1, . . , P. ∞ ?P ?, ?N ??aip(n) i?= ?1 ?n=0 ?. = 1, p = 1,...,P. (9) where [Δuxp ; Δuyp] represents the first order differences of the exogenous input vector [uxp ; uyp], Wa and Wu are weighting matrices, and λ is a Lagrange multiplier that plays the role of a tuning parameter between graph sparsity and event sensitivity. Intuitively, for a fixed set of weights w, the algorithm attempts to find a block sparse solution to (6) and a set of sparse inp?uts Δuxp ; Δuyp , by exploiting the facts that minimizing ?i ?aip ?2 (the ?2,1 norm of the vector sequence {aip}) te?nds? tao m?aximize block-sparsity [18], while minimizing et?nhed s? 1t norm mmaizxeim blizoceks sparsity [ [1168]]. wOhniclee t mheisnesolutions are found, the weights w are adjusted to penalize those elements of the sequences with small values, so that in the next iteration solutions that set these elements to zero (hence further increasing sparsity) are favored. Note however, that proceeding in this way, requires solving at each iteration a problem with n = P(Pnr + F) variables, where P and F denote the number of agents and frames, respectively, and where nr is a bound on the regressor order. On the other hand, it is easily seen that both the objective function and the constraints in (9) can be partitioned into P groups, with the pth group involving only the variables related to the pth node. It follows then that problem (9) can be solved by solving P smaller problems of the form: ?P ap,muxipn,uypi?=1wja(?aip?2) + λ??diag(wu)[Δuxp;Δuyp]??1 ?P subject to: ?ηp?∞ ?N ≤ ? and ??aip(n) i?= ?1 ?n=0 leading to the algorithm given below: =1 (10) Algorithm 1: REWEIGHTEDCAUSALITYALGORITHM for each p wa = [1, 1, ..., 1] = [1, 1, ..., 1] S > 1(self loop weight) s = [1, 1, ..., S, ..., 1] (p’th element is S) while not converged do 1. solve (9) 2. wja = 1/( ?aip ?2 + δ) 3. wja = wja ◦ s (Penalization self loops) 4. = 1./(abs([Δuxp ; Δuyp]) + δ) end while 5. At this point ajp(.) , Ip and up(t) have been identified end for wu wu It is worth emphasizing that, since the computational complexity of standard interior point methods grows as n3, solving these smaller P problems leads to roughly a O(P2) 33557781 reduction in computational time over solving a single, larger optimization. Thus, this approach can handle moderately large problems using standard, interior-point based, semidefinite optimization solvers. Larger problems can be accommodated by noting that the special form of the objective and constraints allow for using iterative Augmented La- grangian Type Methods (ALM), based upon computing, at each step, the closed form solution to suitable intermediate optimization problems. While a complete derivation of such an algorithm is beyond the scope of this paper, using results from [12] it can be shown that each step requires only a combination of thresholding and least-squares approximations. Moreover, it can be shown that such an algorithm converges Q-superlinearly. 4. Handling Outliers and Missing Data The algorithm outlined above assumes an ideal situation where the data matrix B is perfectly known. However, in practice many of its elements may be outliers (due to misidentified correspondences) or missing (due to occlusion). As we briefly show next, these situations can be efficiently handled by performing a structured robust PCA step [3] to obtain a “clean” data matrix, prior to applying Algorithm 1. From equation (6) it follows that, in the absence of exogenous inputs and noise: ?xy11.. . .yxPP? = ?XY11.. . .YXPP? ?a1...aP? (11) Since xi ∈ {col(Xj)} and yi ∈ {col(Yj }), it follows that the sets {∈co {l(cXoli(X)} a)n}d a n{dco yl(Y∈i) { }c? are self-ex?pressive, or, ?equivalently?, Xthe }ma atnridce {sc oXl( =.) }? aXre1 . . . fX-eNxp? eanssdiv eY, ?Y1 ...YN? are mraantkri cdeesfic Xient. ?Consider no?w the case =.r, w?here some ?elements xi, yi of X and Y are missing. From ?the self-expressive property ooff {Xco aln(Xd Yi)} a raen dm i{scsoinlg(Y. Fi)ro} mit tfhoello swelsf tehxaptr ethsessieve missing eyle omf {encotsl are given by: xi = argmin rank(X) , yi x = argmin rank(Y) (12) y Similarly, in the presence of outliers, X, Y can be decomposed irnlyto, itnhe t sum oesfe a lcoew o fra onkut mlieartsr,ix X (,thYe ccalenan b eda dtae)c oamnda sparse one (the outliers) by solving a problem of the form minrank?YXoo?+ λ????EEYX????os. t.: ?XYoo?+?EEYX?=?YX? From the reasoning? abov?e it follows that in the presence of noise and exogenous outputs, the clean data record can be recovered from the corrupted, partial measurements by solving the following optimization problem: s+muλibn3je? ? ? cYXtM ot ? Y?X:∗◦ +Ξ λYX1? ? ? FM XY◦ E XY? ?1+λ2? ?M YX◦ Δ U YX? ?1 ?YX?=?XYoo?+?EEXY?+?UUYX?+?ΞΞYX? (13) where we have used the standard convex relaxations of rank and cardinality2. Here Ξ and U denote noise and piecewise constant exogenous matrices, ΔU denotes the matrix obtained by taking the difference between consecutive elements in U, and MX (MY) is a “mask” matrix, with mi,j = 0 if the element (i, j) in X ( Y) is missing, mi,j = 1 otherw=i0s e, i tuhseed e etom aenvtoi (di, penalizing )e lisem miesnstisn gin, mE, Ξ, U corresponding to missing data. Problem (13) is a structured robust PCA problem (due to the Hankel structure of X, Y) trhobatu can C bAe efficiently suoelv teod t using tkheel fsitrrsut oturrdeer o mf Xeth,oYd) proposed in [3], slightly modified to handle the terms containing ΔU. 5. Experimental Results In this section we illustrate the effectiveness of the proposed approach using several video clips (provided as supplemental material). The results of the experiments are displayed using graphs embedded on the video frames: An arrow indicates causal correlation between agents, with the point of the arrow indicating the agent whose actions are affected by the agent at its tail. The internal parameters of the algorithm were experimentally tuned, leading to the values ? = 0.1, = 0.05, self loop weights S = 10. The algorithm is fairly insensitive to the value of the regularization parameters and S, which could be adjusted up or down by an order of magnitude without affecting the structure of the resulting graph. Finally, we used regressor order N=2 for the first three examples and N=4 for the last one, a choice that is consistent with the frame rate and the complexity of λ λ the actions taking place in each clip. 5.1. Clips from the UT-Interaction Data Set We considered two video clips from the UT Human Interaction Data Set [15] (sequences 6 and 16). Figures 2 and 5 compare the results obtained applying the proposed algorithm versus Group Lasso (GL) [11] and Group Lasso combined with the reweighted heuristic described in (9) (GLRW). In all cases, the inputs to the algorithm were the (approximate) coordinates of the heads of each of the agents, normalized to the interval [−1, 1], artificially corrupted ,w niothrm m10al%iz eodut tloie trhs.e Notably, [t−he1 proposed algorithm 2As shown in [6, 8] under suitable conditions these relaxations the exact minimum rank solution. 33557792 recover Figure 2. Sample frames from the UT sequence 6 with the identified causal connections superimposed. Top: Proposed Method. Center: Reweighted Group Lasso. Bottom: Group Lasso. Only the proposed method identifies the correct connections. was able to correctly identify the correlations between the agents from this very limited amount of information, while the others failed to do so. Note in passing that in both cases none of the algorithms were directly applicable, due to some of the individuals leaving the field of view or being occluded. As illustrated in Fig. 3, the missing data was recovered by solving an RPCA problem prior to applying Algorithm 1. Finally, Fig. 4 sheds more insight on the key role played by the sparse signal u. As shown there, changes in u correspond exactly to time instants when the behavior of the corresponding agent deviates from the general pattern followed during most of the clip. Figure 3. Time traces of the individual heads in the UT sequence 6, artificially corrupted with 10 % outliers. The outliers were removed and the missing data due to targets leaving the field of view was estimated solving a modified RPCA problem. Frame number Figure 4. Sample (derivative sparse) exogenous signals in the UT sequence 6. The changes correspond to the instants when the second person starts moving towards the first, who remains stationary, and when the two persons merge in an embrace. Figure 5. Sample frames from the UT sequence 16. Top: Correct correlations identified by the Proposed Method. Center and Bottom: Reweighted Group Lasso and Group Lasso (circles indicate self-loops). 5.2. Doubles Tennis Experiment This experiment considers a non-staged real-life scenario. The data consists of 230 frames of a video clip from the Australian Open Tennis Doubles Final games. The goal here is to identify causal relationships between the different players using time traces of the respective centroid positions. Note that in this case the ground truth is not available. Nevertheless, since players from the same team usually look at their opponents and react to their motions, we expect a strong causality connection between members of 33557803 opposite teams. This intuition is matched by the correlations unveiled by the algorithm, shown in Fig. 6. The identified sparse input corresponding to the vertical direction is shown in Fig. 7 (similar results for the horizontal component are omitted due to space reasons.) Figure 6. Sample frames from the tennis sequence. Top: The proposed method correctly identifies interactions between opposite team members. Center: Reweighted Group Lasso misses the interaction between the two rear-most individuals of opposite teams, generating self loops instead (denoted by the disks). Bottom: Group Lasso yields an almost complete graph. Figure 7. Exogenous signal corresponding to the vertical axis for the tennis sequence. The change in one component corresponds to the instant when the leftmost player in the bottom team moves from the line towards the net, remaining closer to it from then on. 5.3. Basketball Game Experiment This experiment considers the interactions amongst players in a basketball game. As in the case ofthe tennis players, since the data comes from a real life scenario, the ground truth is not available. However, contrary to the tennis game, this scenario involves complex interactions amongst many players, and causality is hard to discern by inspection. Nevertheless, the results shown in Fig. 8, obtained using the position of the centroids as inputs to our algorithm, match our intuition. Firstly, one would expect a strong cause/effect connection between the actions of the player with the ball and the two defending opponents facing him. These connections (denoted by the yellow arrows) were indeed successfully identified by the algorithm. The next set of causal correlations is represented by the (blue, light green) and (black, white) arrow pairs showing the defending and the opponent players on the far side of the field and under the hoop. An important, counterintuitive, connection identified by the algorithm is represented by the magenta arrows be- tween the right winger of the white team with two of his teammates: the one holding the ball and the one running behind all players. While at first sight this connection is not as obvious as the others, it becomes apparent towards the end of the sequence, when the right winger player is signaling with a raised arm. Notably, our algorithm was able to unveil this signaling without the need to perform a semantic analysis (a very difficult task here, since this signaling is apparent only in the last few frames). Rather, it used the fact that the causal correlation was encapsulated in the dynamics of the relative motions of these players. 6. Conclusions In this paper we propose a new method for detecting causal interactions between agents using video data. The main idea is to recast this problem into a blind directed graph topology identification, where each node corresponds to the observed motion of a given target, each link indicates the presence of a causal correlation and the unknown inputs account for changes in the interaction patterns. In turn, this problem can be reduced to that of finding block-sparse solutions to a set of linear equations, which can be efficiently accomplished using an iterative re-weighted Group-Lasso approach. The ability of the algorithm to correctly identify causal correlations, even in cases where portions of the data record are missing or corrupted by outliers, and the key role played by the unknown exogenous input were illustrated with several examples involving non–trivial inter- actions amongst several human subjects. Remarkably, the proposed algorithm was able to identify both the correct interactions and the time instants when interactions amongst agents changed, based on minimal motion information: in all cases we used just a single time trace per person. This success indicates that in many scenarios, the dynamic information contained in the motion pattern of a single feature associated with a target is rich enough to enable identifying complex interaction patterns, without the need to track multiple features, perform a semantic analysis or use additional domain knowledge. 33557814 Figure 8. Sample frames from a Basketball game. Top: proposed method. Center: Reweighted Group the signaling player and his teammates. Bottom: Group Lasso yields an almost complete graph. Lasso misses the interaction between References [1] A. Arnold, Y. Liu, and N. Abe. Estimating brain functional connectivity with sparse multivariate autoregression. In Proc. of the 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 66–75, 2007. 4 [2] M. Ayazoglu, B. Li, C. Dicle, M. Sznaier, and O. Camps. Dynamic subspace-based coordinated multicamera tracking. In 2011 IEEE ICCV, pages 2462–2469, 2011. 3 [3] M. Ayazoglu, M. Sznaier, and O. Camps. Fast algorithms for structured robust principal component analysis. In 2012 IEEE CVPR, pages 1704–171 1, June 2012. 2, 5 [4] A. Bolstad, B. Van Veen, and R. Nowak. Causal network inference via group sparse regularization. IEEE Transactions on Signal Processing, 59(6):2628–2641, 2011. 4 [5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Dis- [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] tributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122, Jan. 2011. 2 E. Candes, X. Li, Y. Ma, and J.Wright. Robust principal component analysis? J. ACM, (3), 2011. 5 E. J. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted l1minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, December 2008. 4 V. Chandrasekaran, S. Sanghavi, P. Parrilo, and A. Willsky. Rank-sparsity incoherence for matrix decomposition. Siam J. Optim., (2):572–596, 2011. 5 C. W. J. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, pages 424–438l, 1969. 1 A. Gupta, P. Srinivasan, J. Shi, and L. Davis. Understanding videos, constructing plots: Learning a visually grounded storyline model from annotated videos. In 2009 IEEE CVPR, pages 2012–2019, 2009. 1 S. Haufe, G. Nolte, K. R. Muller, and N. Kramer. Sparse causal discovery in multivariate time series. In Neural Information Processing Systems, 2009. 4, 5 G. Liu, Z. Lin, and Y. Yu. Robust subspace segmentation by low-rank representation. In ICML, pages 1663–670, 2010. 5 D. Materassi, G. Innocenti, and L. Giarre. Reduced complexity models in identification of dynamical networks: Links with sparsification problems. In 48th IEEE Conference on Decision and Control, pages 4796–4801, 2009. 4 K. Prabhakar, S. Oh, P. Wang, G. Abowd, and J. Rehg. Temporal causality for the analysis ofvisual events. In IEEE Conf Comp. Vision and Pattern Recog. (CVPR)., pages 1967– 1974, 2010. 1 M. S. Ryoo and J. K. Aggarwal. UT Interaction Dataset, ICPR contest on Semantic Description of Human Activities. http://cvrc.ece.utexas.edu/SDHA2010/Human Interaction.html, 2010. 5 [16] J. Tropp. Just relax: convex programming methods for identifying sparse signals in noise. IEEE Transactions on Information Theory, 52(3): 1030–1051, 2006. 4 [17] S. Yi and V. Pavlovic. Sparse granger causality graphs for human action classification. In 2012 ICPR, pages 3374–3377. 1 [18] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society Series B, 68(1):49–67, 2006. 4 33557825
same-paper 6 0.94752425 213 iccv-2013-Implied Feedback: Learning Nuances of User Behavior in Image Search
7 0.92031431 46 iccv-2013-Allocentric Pose Estimation
8 0.90569383 184 iccv-2013-Global Fusion of Relative Motions for Robust, Accurate and Scalable Structure from Motion
9 0.88804024 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition
10 0.86593461 93 iccv-2013-Correlation Adaptive Subspace Segmentation by Trace Lasso
11 0.85914725 14 iccv-2013-A Generalized Iterated Shrinkage Algorithm for Non-convex Sparse Coding
12 0.85805345 54 iccv-2013-Attribute Pivots for Guiding Relevance Feedback in Image Search
13 0.82979715 161 iccv-2013-Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration
14 0.82291257 398 iccv-2013-Sparse Variation Dictionary Learning for Face Recognition with a Single Training Sample per Person
15 0.82117993 52 iccv-2013-Attribute Adaptation for Personalized Image Search
16 0.81908977 259 iccv-2013-Manifold Based Face Synthesis from Sparse Samples
17 0.81209415 154 iccv-2013-Face Recognition via Archetype Hull Ranking
18 0.80839139 114 iccv-2013-Dictionary Learning and Sparse Coding on Grassmann Manifolds: An Extrinsic Solution
19 0.80765206 335 iccv-2013-Random Faces Guided Sparse Many-to-One Encoder for Pose-Invariant Face Recognition
20 0.80723399 44 iccv-2013-Adapting Classification Cascades to New Domains