iccv iccv2013 iccv2013-285 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta
Abstract: We propose NEIL (NeverEnding Image Learner), a computer program that runs 24 hours per day and 7 days per week to automatically extract visual knowledge from Internet data. NEIL uses a semi-supervised learning algorithm that jointly discovers common sense relationships (e.g., “Corolla is a kind of/looks similar to Car”, “Wheel is a part of Car”) and labels instances of the given visual categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human labeling effort. As of 10th October 2013, NEIL has been continuously running for 2.5 months on 200 core cluster (more than 350K CPU hours) and has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. During this period, NEIL has discovered more than 1700 relationships and has labeled more than 400K visual instances. 1. Motivation Recent successes in computer vision can be primarily attributed to the ever increasing size of visual knowledge in terms of labeled instances of scenes, objects, actions, attributes, and the contextual relationships between them. But as we move forward, a key question arises: how will we gather this structured visual knowledge on a vast scale? Recent efforts such as ImageNet [8] and Visipedia [30] have tried to harness human intelligence for this task. However, we believe that these approaches lack both the richness and the scalability required for gathering massive amounts of visual knowledge. For example, at the time of submission, only 7% of the data in ImageNet had bounding boxes and the relationships were still extracted via Wordnet. In this paper, we consider an alternative approach of automatically extracting visual knowledge from Internet scale data. The feasibility of extracting knowledge automatically from images and videos will itself depend on the state-ofthe-art in computer vision. While we have witnessed significant progress on the task of detection and recognition, we still have a long way to go for automatically extracting the semantic content of a given image. So, is it really possible to use existing approaches for gathering visual knowledge directly from web data? 1.1. NEIL – Never Ending Image Learner We propose NEIL, a computer program that runs 24 hours per day, 7 days per week, forever to: (a) semantically understand images on the web; (b) use this semantic understanding to augment its knowledge base with new labeled instances and common sense relationships; (c) use this dataset and these relationships to build better classifiers and detectors which in turn help improve semantic understanding. NEIL is a constrained semi-supervised learning (SSL) system that exploits the big scale of visual data to automatically extract common sense relationships and then uses these relationships to label visual instances of existing categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human effort one that reflects the factual content of the images on the Internet, and that would be useful to many computer vision and AI efforts. Specifically, NEIL can use web data to extract: (a) Labeled examples of object categories with bounding boxes; (b) Labeled examples of scenes; (c) Labeled examples of attributes; (d) Visual subclasses for object categories; and (e) Common sense relationships about scenes, objects and attributes like “Corolla is a kind of/looks similar to Car”, “Wheel is a part ofCar”, etc. (See Figure 1). We believe our approach is possible for three key reasons: (a) Macro-vision vs. Micro-vision: We use the term “micro-vision” to refer to the traditional paradigm where the input is an image and the output is some information extracted from that image. In contrast, we define “macrovision” as a paradigm where the input is a large collection of images and the desired output is extracting significant or interesting patterns in visual data (e.g., car is detected frequently in raceways). These patterns help us to extract common sense relationships. Note, the key difference is that macro-vision does not require us to understand every image in the corpora and extract all possible patterns. Instead, it relies on understanding a few images and statistically combine evidence from these to build our visual knowledge. – (b) Structure of the Visual World: Our approach exploits the structure of the visual world and builds constraints for detection and classification. These global constraints are represented in terms of common sense relationships be1409 orCllaoraC Hloc e yrs(a) Objects (w/Bounding Boxes and Vislue hWal Subcategories) aongkPi lrt(b) ScenewyacaResaephs nuoRd(c) At d worCreibutes Visual Instances Labeled by NEIL (O-O) Wheel is a part of Car. (S-O) Car is found in Raceway. (O-O) Corolla is a kind of/looks similar to Car. (S-O) Pyramid is found in Egypt. (O-A) Wheel is/has Round shape. (S-A) Alley is/has Narrow. (S-A) Bamboo forest is/has Vertical lines. (O-A) Sunflower is/has Yellow. Relationships Extracted by NEIL Figure 1. NEIL is a computer program that runs 24 hours a day and 7 days a week to gather visual knowledge from the Internet. Specifically, it simultaneously labels the data and extracts common sense relationships between categories. tween categories. Most prior work uses manually defined relationships or learns relationships in a supervised setting. Our key insight is that at a large scale one can simultane- ously label the visual instances and extract common sense relationships in ajoint semi-supervised learning framework. (c) Semantically driven knowledge acquisition: We use a semantic representation for visual knowledge; that is, we group visual data based on semantic categories and develop relationships between semantic categories. This allows us to leverage text-based indexing tools such as Google Image Search to initialize our visual knowledge base learning. Contributions: Our main contributions are: (a) We propose a never ending learning algorithm for gathering visual knowledge from the Internet via macro-vision. NEIL has been continuously running for 2.5 months on a 200 core cluster; (b) We are automatically building a large visual structured knowledge base which not only consists of labeled instances of scenes, objects, and attributes but also the relationships between them. While NEIL’s core SSL algorithm works with a fixed vocabulary, we also use noun phrases from NELL’s ontology [5] to grow our vocabulary. Currently, our growing knowledge base has an ontology of 1152 object categories, 1034 scene categories, and 87 attributes. NEIL has discovered more than 1700 relationships and labeled more than 400K visual instances of these categories. (c) We demonstrate how joint discovery of relationships and labeling of instances at a gigantic scale can provide constraints for improving semi-supervised learning. 2. Related Work Recent work has only focused on extracting knowledge in the form of large datasets for recognition and classification [8, 23, 30]. One of the most commonly used approaches to build datasets is using manual annotations by motivated teams of people [30] or the power of crowds [8, 40]. To minimize human effort, recent works have also focused on active learning [37, 39] which selects label requests that are most informative. However, both of these directions have a major limitation: annotations are expensive, prone to errors, biased and do not scale. An alternative approach is to use visual recognition for extracting these datasets automatically from the Internet [23, 34, 36]. A common way of automatically creating a dataset is to use image search results and rerank them via visual classifiers [14] or some form of joint-clustering in text and visual space [2, 34]. Another approach is to use a semi-supervised framework [42]. Here, a small amount of labeled data is used in conjunction with a large amount of unlabeled data to learn reliable and robust visual models. These seed images can be manually labeled [36] or the top retrievals of a text-based search [23]. The biggest problem with most of these automatic approaches is that the small number of labeled examples or image search results do not provide enough constraints for learning robust visual classifiers. Hence, these approaches suffer from semantic drift [6]. One way to avoid semantic drift is to exploit additional constraints based on the structure of our visual data. Researchers have exploited a variety of constraints such as those based on visual similarity [11, 15], seman- tic similarity [17] or multiple feature spaces [3]. However, most of these constraints are weak in nature: for example, visual similarity only models the constraint that visuallysimilar images should receive the same labels. On the other hand, our visual world is highly structured: object cate1410 gories share parts and attributes, objects and scenes have strong contextual relationships, etc. Therefore, we need a way to capture the rich structure of our visual world and exploit this structure during semi-supervised learning. In recent years, there have been huge advances in modeling the rich structure of our visual world via contextual relationships. Some of these relationships include: SceneObject [38], Object-Object [3 1], Object-Attribute [12, 22, 28], Scene-Attribute [29]. All these relationships can provide a rich set of constraints which can help us improve SSL [4]. For example, scene-attribute relationships such as amphitheaters are circular can help improve semisupervised learning of scene classifiers [36] and Wordnet hierarchical relationships can help in propagating segmentations [21]. But the big question is: how do we obtain these relationships? One way to obtain such relationships is via text analysis [5, 18]. However, as [40] points out that the visual knowledge we need to obtain is so obvious that no one would take the time to write it down and put it on web. In this work, we argue that, at a large-scale, one can jointly discover relationships and constrain the SSL prob- lem for extracting visual knowledge and learning visual classifiers and detectors. Motivated by a never ending learning algorithm for text [5], we propose a never ending visual learning algorithm that cycles between extracting global relationships, labeling data and learning classifiers/detectors for building visual knowledge from the Internet. Our work is also related to attribute discovery [33, 35] since these approaches jointly discover the attributes and relationships between objects and attributes simultaneously. However, in our case, we only focus on semantic attributes and therefore our goal is to discover semantic relationships and semantically label visual instances. 3. Technical Approach Our goal is to extract visual knowledge from the pool of visual data on the web. We define visual knowledge as any information that can be useful for improving vision tasks such as image understanding and object/scene recognition. One form of visual knowledge would be labeled examples of different categories or labeled segments/boundaries. Labeled examples helps us learn classifiers or detectors and improve image understanding. Another example of visual knowledge would be relationships. For example, spatial contextual relationships can be used to improve object recognition. In this paper, we represent visual knowledge in terms of labeled examples of semantic categories and the relationships between those categories. Our knowledge base consists of labeled examples of: (1) Objects (e.g., Car, Corolla); (2) Scenes (e.g., Alley, Church); (3) Attributes (e.g., Blue, Modern). Note that for objects we learn detectors and for scenes we build classifiers; however for the rest of the paper we will use the term detector and classifier interchangeably. Our knowledge base also contains relationships of four types: (1) Object-Object (e.g., Wheel is a part of Car);(2) Object-Attribute (e.g., Sheep is/has White); (3) Scene-Object (e.g., Car is found in Raceway); (4) SceneAttribute (e.g., Alley is/has Narrow). The outline of our approach is shown in Figure 2. We use Google Image Search to download thousands of images for each object, scene and attribute category. Our method then uses an iterative approach to clean the labels and train detectors/classifiers in a semi-supervised manner. For a given concept (e.g., car), we first discover the latent visual subcategories and bounding boxes for these sub-categories using an exemplar-based clustering approach (Section 3. 1). We then train multiple detectors for a concept (one for each sub-category) using the clustering and localization results. These detectors and classifiers are then used for detections on millions of images to learn relationships based on cooccurrence statistics (Section 3.2). Here, we exploit the fact the we are interested in macro-vision and therefore build co-occurrence statistics using only confident detections/classifications. Once we have relationships, we use them in conjunction with our classifiers and detectors to label the large set of noisy images (Section 3.3). The most confidently labeled images are added to the pool of labeled data and used to retrain the models, and the process repeats itself. At every iteration, we learn better classifiers and detectors, which in turn help us learn more relationships and further constrain the semi-supervised learning problem. We now describe each step in detail below. 3.1. Seeding Classifiers via Google Image Search The first step in our semi-supervised algorithm is to build classifiers for visual categories. One way to build initial classifiers is via a few manually labeled seed images. Here, we take an alternative approach and use text-based image retrieval systems to provide seed images for training initial detectors. For scene and attribute classifiers we directly use these retrieved images as positive data. However, such an approach fails for training object and attribute detectors because of four reasons (Figure 3(a)) (1) Outliers: Due to the imperfectness of text-based image retrieval, the downloaded images usually have irrelevant images/outliers; (2) Polysemy: In many cases, semantic categories might be overloaded and a single semantic category might have multiple senses (e.g., apple can mean both the company and the fruit); (3) Visual Diversity: Retrieved images might have high intra-class variation due to different viewpoint, illumination etc.; (4) Localization: In many cases the retrieved image might be a scene without a bounding-box and hence one needs to localize the concept before training a detector. Most of the current approaches handle these problems via clustering. Clustering helps in handling visual diversity [9] and discovering multiple senses of retrieval (polysemy) [25]. It can also help us to reject outliers based on – distances from cluster centers. One simple way to cluster 141 1 would be to use K-means on the set of all possible bounding boxes and then use the representative clusters as visual sub-categories. However, clustering using K-means has two issues: (1) High Dimensionality: We use the Color HOG (CHOG) [20] representation and standard distance metrics do not work well in such high-dimensions [10]; (2) Scalability: Most clustering approaches tend to partition the complete feature space. In our case, since we do not have bounding boxes provided, every image creates millions of data points and the majority of the datapoints are outliers. Recent work has suggested that K-means is not scalable and has bad performance in this scenario since it assigns membership to every data point [10]. Instead, we propose to use a two-step approach for clustering. In the first step, we mine the set of downloaded im- × ages from Google Image Search to create candidate object windows. Specifically, every image is used to train a detector using recently proposed exemplar-LDA [19]. These detectors are then used for dense detections on the same set of downloaded images. We select the top K windows which have high scores from multiple detectors. Note that this step helps us prune out outliers as the candidate windows are selected via representativeness (how many detectors fire on them). For example, in Figure 3, none of the tricycle detectors fire on the outliers such as circular dots and people eating, and hence these images are rejected at this candidate widow step. Once we have candidate windows, we cluster them in the next step. However, instead of using the high-dimensional CHOG representation for clustering, we use the detection signature of each window (represented as a vector of seed detector ELDA scores on the window) to create a K K affinity matrix. The (i, j) entry in the affinity amteat arix K i s× thKe da fofti product orixf t.h Tish vee (cit,ojr) )fo enr twryin indo thwes ai fainndj. Intuitively, this step connects candidate windows if the same set of detectors fire on both windows. Once we have the affinity matrix, we cluster the candidate windows using the standard affinity propagation algorithm [16]. Affinity propagation also allows us to extract a representative window (prototype) for each cluster which acts as an iconic image for the object [32] (Figure 3). After clustering, we train a detector for each cluster/sub-category using three-quarters of the images in the cluster. The remaining quarter is used as a validation set for calibration. 3.2. Extracting Relationships Once we have initialized object detectors, attribute detectors, attribute classifiers and scene classifiers, we can use them to extract relationships automatically from the data. The key idea is that we do not need to understand each and every image downloaded from the Internet but instead understand the statistical pattern of detections and classifications at a large scale. These patterns can be used to select the top-N relationships at every iteration. Specifically, we extract four different kinds of relationships: Object-Object Relationships: The first kind of relationship we extract are object-object relationships which include: (1) Partonomy relationships such as “Eye is a part of Baby”; (2) Taxonomy relationships such as “BMW 320 is a kind of Car”; and (3) Similarity relationships such as 1412 (a) Google Image Search for “tricycle” (b) Sub-category Discovery Figure 3. An example of how clustering handles polysemy, intraclass variation and outlier removal (a). The bottom row shows our discovered clusters. “Swan looks similar to Goose”. To extract these relationships, we first build a co-detection matrix O0 whose elements represent the probability of simultaneous detection of object categories i and j. Intuitively, the co-detection matrix has high values when object detector idetects objects inside the bounding box of object j with high detection scores. To account for detectors that fire everywhere and images which have lots of detections, we normalize the matrix O0. The normalized co-detection matrix can be written 1 1 as: N1− 2 O0N2− 2 , where N1 and N2 are out-degree and indegree matrix and (i, j) element of O0 represents the average score of top-detections of detector ion images of object category j. Once we have selected a relationship between pair of categories, we learn its characteristics in terms of mean and variance of relative locations, relative aspect ra- tio, relative scores and relative size of the detections. For example, the nose-face relationship is characterized by low relative window size (nose is less than 20% of face area) and the relative location that nose occurs in center of the face. This is used to define a compatibility function ψi,j (·) which evaluates if the detections from category iand j are compatible or not. We also classify the relationships into the two semantic categories (part-of, taxonomy/similar) using relative features to have a human-communicable view of visual knowledge base. Object-Attribute Relationships: The second type of relationship we extract is object-attribute relationships such as “Pizza has Round Shape”, ”Sunflower is Yellow” etc. To extract these relationships we use the same methodology where the attributes are detected in the labeled examples of object categories. These detections and their scores are then used to build a normalized co-detection matrix which is used to find the top object-attribute relationships. Scene-Object Relationships: The third type of relationship extracted by our algorithm includes scene-object relationships such as “Bus is found in Bus depot” and “Monitor is found in Control room”. For extracting scene-object relationships, we use the object detectors on randomly sampled images of different scene classes. The detections are then used to create the normalized co-presence matrix (similar to object-object relationships) where the (i, j) element represents the likelihood of detection of instance of object category iand the scene category class j. Scene-Attribute Relationships: The fourth and final type of relationship extracted by our algorithm includes sceneattribute relationships such as “Ocean is Blue”, “Alleys are Narrow”, etc. Here, we follow a simple methodology for extracting scene-attribute relationships where we compute co-classification matrix such that the element (i, j) of the matrix represents average classification scores of attribute ion images of scene j. The top entries in this coclassification matrix are used to extract scene-attribute relationships. 3.3. Retraining via Labeling New Instances Once we have the initial set of classifiers/detectors and the set of relationships, we can use them to find new instances of different objects and scene categories. These new instances are then added to the set of labeled data and we retrain new classifiers/detectors using the updated set of labeled data. These new classifiers are then used to extract more relationships which in turn are used to label more data and so on. One way to find new instances is directly using the detector itself. For instance, using the car detector to find more cars. However, this approach leads to semantic drift. To avoid semantic drift, we use the rich set of relationships we extracted in the previous section and ensure that the new labeled instances of car satisfy the extracted relationships (e.g., has wheels, found in raceways etc.) Mathematically, let RO, RA and RS represent the set of object-object, object-attribute aanndd scene-object relationships at iteration t. If φi (·) represents the potential from object detector i, ωk (·) represents sthenet scene potential, raonmd ψi,j (·) represent the compatibility sfu thnect siocnen nbeet pwoeteennt tiwalo, aonbdject categories i,j,ethceonm we can ifityndfu uthncet new inetswtaenecnetsw woof oobb-ject category iusing the contextual scoring function given below: φi(x) + ? φj(xl)ψi,j(x,xl) + ? i,j∈R?O RA ? ωk(x) i,k?∈RS where x is the wi?ndow being evaluated and xl is the topdetected window of related object/attribute category. The above equation has three terms: the first term is appearance term for the object category itself and is measured by the 1413 Nilgai Yamaha Violin Bass F-18 Figure 4. Qualitative Examples of Bounding Box Labeling Done by NEIL score of the SVM detector on the window x. The second term measures the compatibility between object category i and the object/attribute category j if the relationship (i, j) is part of the catalogue. For example, if “Wheel is a part of Car” exists in the catalogue then this term will be the product of the score of wheel detector and the compatibility function between the wheel window (xl) and the car window (x). The final term measures the scene-object compatibility. Therefore, if the knowledge base contains the re- lationship “Car is found in Raceway”, this term boosts the “Car” detection scores in the “Raceway” scenes. At each iteration, we also add new instances of different scene categories. We find new instances of scene category k using the contextual scoring function given below: ωk(x) + ? ωm(x) + ? φi(xl) m,k?∈RA? i,k?∈RS where RA? represents the catalogue of scene-attribute relationships. The above equation has three terms: the first term is the appearance term for the scene category itself and is estimated using the scene classifier. The second term is the appearance term for the attribute category and is estimated using the attribute classifier. This term ensures that if a scene-attribute relationship exists then the attribute classifier score should be high. The third and the final term is the appearance term of an object category and is estimated using the corresponding object detector. This term ensures that if a scene-object relationship exists then the object detector should detect objects in the scene. Implementation Details: To train scene & attribute classifiers, we first extract a 3912 dimensional feature vector from each image. The feature vector includes 5 12D GIST [27] features, concatenated with bag ofwords representations for SIFT [24], HOG [7], Lab color space, and Texton [26]. The dictionary sizes are 1000, 1000, 400, 1000, respectively. Features of randomly sampled windows from other categories are used as negative examples for SVM training and hard mining. For the object and attribute section, we use CHOG [20] features with a bin size of 8. We train the detectors using latent SVM model (without parts) [13]. 4. Experimental Results We demonstrate the quality of visual knowledge by qualitative results, verification via human subjects and quantitative results on tasks such as object detection and scene recognition. 4.1. NEIL Statistics While NEIL’s core algorithm uses a fixed vocabulary, we use noun phrases from NELL [5] to grow NEIL’s vocabulary. As of 10th October 2013, NEIL has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. It has downloaded more than 2 million images for extracting the current structured visual knowledge. For bootstrapping our system, we use a few seed images from ImageNet [8], SUN [4 1] or the top-images from Google Image Search. For the purposes of extensive experimental evaluation in this paper, we ran NEIL on steroids (200 cores as opposed to 30 cores used generally) for the last 2.5 months. NEIL has completed 16 iterations and it has labeled more than 400K visual instances (including 300,000 objects with their bounding boxes). It has also extracted 1703 common sense relationships. Readers can browse the current visual knowledge base and download the detectors from: www.neil-kb.com 4.2. Qualitative Results We first show some qualitative results in terms of ex- tracted visual knowledge by NEIL. Figure 4 shows the extracted visual sub-categories along with a few labeled instances belonging to each sub-category. It can be seen from the figure that NEIL effectively handles the intra-class variation and polysemy via the clustering process. The purity and diversity of the clusters for different concepts indicate that contextual relationships help make our system robust to semantic drift and ensure diversity. Figure 5 shows some of the qualitative examples of scene-object and object-object relationships extracted by NEIL. It is effective in using a few confident detections to extract interesting relationships. Figure 6 shows some of the interesting scene-attribute and object-attribute relationships extracted by NEIL. 1414 Helicopter is found in Airfield Leaning tower is found in Pisa Van is a kind of/looks similar to Ambulance Airplane nose is a part of Airbus 330 Zebra is found in Savanna Ferris wheel is found in Amusement park Opera house is found in Sydney Eye is a part of Baby Duck is a kind of/looks similar to Goose Monitor is a kind of/looks similar to Desktop computer Figure 5. Qualitative Examples of Scene-Object (rows Bus is found in Bus depot outdoor Sparrow is a kind of/looks similar to bird 1-2) and Object-Object (rows Throne is found in Throne room Camry is found in Pub outdoor Gypsy moth is a kind of/looks similar to Butterfly Basketball net is a part of Backboard 3-4) Relationships Extracted by NEIL 4.3. Evaluating Quality via Human Subjects Next, we want to evaluate the quality of extracted visual knowledge by NEIL. It should be noted that an extensive and comprehensive evaluation for the whole NEIL system is an extremely difficult task. It is impractical to evaluate each and every labeled instance and each and every rela- tionship for correctness. Therefore, we randomly sample the 500 visual instances and 500 relationships, and verify them using human experts. At the end of iteration 6, 79% of the relationships extracted by NEIL are correct, and 98% of the visual data labeled by NEIL has been labeled correctly. We also evaluate the per iteration correctness of relationships: At iteration 1, more than 96% relationships are correct and by iteration 3, the system stabilizes and 80% of extracted relationships are correct. While currently the system does not demonstrate any major semantic drift, we do plan to continue evaluation and extensive analysis of knowledge base as NEIL grows older. We also evaluate the quality of bounding-boxes generated by NEIL. For this we sample 100 images randomly and label the ground-truth bounding boxes. On the standard intersection-over-union metric, NEIL generates bounding boxes with 0.78 overlap on average with ground-truth. To give context to the difficulty of the task, the standard Objectness algorithm [1] produces bounding boxes with 0.59 overlap on average. 4.4. Using Knowledge for Vision Tasks Finally, we want to demonstrate the usefulness of the visual knowledge learned by NEIL on standard vision tasks such as object detection and scene classification. Here, we will also compare several aspects of our approach: (a) We first compare the quality of our automatically labeled dataset. As baselines, we train classifiers/detectors directly on the seed images downloaded from Google Image Search. (b) We compare NEIL against a standard bootstrapping ap- proach which does not extract/use relationships. (c) Finally, we will demonstrate the usefulness of relationships by detecting and classifying new test data with and without the learned relationships. Scene Classification: First we evaluate our visual knowledge for the task of scene classification. We build a dataset of 600 images (12 scene categories) using Flickr images. We compare the performance ofour scene classifiers against the scene classifiers trained from top 15 images of Google Image Search (our seed classifier). We also compare the performance with standard bootstrapping approach without using any relationship extraction. Table 1shows the results. We use mean average precision (mAP) as the evaluation metric. As the results show, automatic relationship extraction helps us to constrain the learning problem, and so the learned classifiers give much better performance. Finally, if we also use the contextual information from NEIL relationships we get a significant boost in performance. Table 1. mAP performance for scene classification on 12 categories. mAP Seed Classifier (15 Google Images) Bootstrapping (without relationships) NEIL Scene Classifiers NEIL (Classifiers + Relationships) 0.52 0.54 0.57 0.62 Object Detection: We also evaluate our extracted visual knowledge for the task of object detection. We build a dataset of 1000 images (15 object categories) using Flickr data for testing. We compare the performance against object detectors trained directly using (top-50 and top-450) images from Google Image Search. We also compare the perfor- mance of detectors trained after aspect-ratio, HOG clustering and our proposed clustering procedure. Table 2 shows the detection results. Using 450 images from Google image search decreases the performance due to noisy retrievals. While other clustering methods help, the gain by our clustering procedure is much larger. Finally, detectors trained using NEIL work better than standard bootstrapping. 1415 MMoonniittoorr iiss f foouunndd iinn CCoonnttrrooll rroooomm? rroooomm? MMoonniittoorr iiss ffoouunndd iinn CCoonnttrrooll Washing machine is found in Utility room? Siberian tiger is found in Zoo Baseball is found in Butters box Bullet train is found in Train station platform? Cougar looks similar to Cat Urn looks similar to Goblet Samsung galaxy is a kind of Cellphone Computer room is /has Modern Hallway is /has Narrow? Building facade is /has Check texture Trading floor is /has Crowded Umbrella looks similar to Ferris wheel Bonfire is found in Volcano Figure 6. Examples of extracted common sense relationships. Table 2. mAP performance for object detection on 15 categories. mAP Latent SVM (50 Google Images) Latent SVM (450 Google Images) 0.34 0.28 Latent SVM (450, Aspect Ratio Clustering) Latent SVM (450, HOG-based Clustering) Seed Detector (NEIL Clustering) Bootstrapping (without relationships) NEIL Detector NEIL Detector + Relationships 0.30 0.33 0.44 0.45 0.49 0.51 Acknowledgements: This research was supported by ONR MURI N000141010934 and a gift from Google. The authors would like to thank Tom Mitchell and David Fouhey for insightful discussions. We would also like to thank our computing clusters warp and workhorse for doing all the hard work! References [1] B. Alexe, T. Deselares, and V. Ferrari. What is an object? In TPAMI, 2010. 7 [2] T. Berg and D. Forsyth. Animals on the web. In CVPR, 2006. 2 [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998. 2 [4] A. Carlson, J. Betteridge, E. R. H. Jr., and T. M. Mitchell. Coupling semi-supervised learning of categories and relations. In NAACL HLT Workskop on SSL for NLP, 2009. 3 [5] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010. 2, 3, 6 [6] J. R. Curran, T. Murphy, and B. Scholz. Minimising semantic drift with mutual exclusion bootstrapping. In PacificAssociationfor Computational Linguistics, 2007. 2 [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 6 [8] J. Deng, W. Dong, R. Socher, J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 1, 2, 6 [9] S. Divvala, A. Efros, and M. Hebert. How important are ‘deformable parts’ in the deformable parts model? In ECCV, Parts and Attributes [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] Workshop, 2012. 3 C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes Paris look like Paris? SIGGRAPH, 2012. 4 S. Ebert, D. Larlus, and B. Schiele. Extracting structures in image collections for object recognition. In ECCV, 2010. 2 A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. 3 P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2010. 6 R. Fergus, P. Perona, and A. Zisserman. A visual category filter for Google images. In ECCV, 2004. 2 R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning in gigantic image collections. In NIPS. 2009. 2 B. Frey and D. Dueck. Clustering by passing messages between data points. Science, 2007. 4 M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semisupervised learning for image classification. In CVPR, 2010. 2 A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, 2008. 3 B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classification. In ECCV. 2012. 4 S. Khan, F. Anwer, R. Muhammad, J. van de Weijer, A. Joost, M. Vanrell, and A. Lopez. Color attributes for object detection. In CVPR, 2012. 4, 6 D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation propagation in ImageNet. In ECCV, 2012. 3 C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009. 3 L.-J. Li, G. Wang, and L. Fei-Fei. OPTIMOL: Automatic object picture collection via incremental model learning. In CVPR, 2007. 2 D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. 6 [25] A. Lucchi and J. Weston. Joint image and word sense discrimination for image retrieval. In ECCV, 2012. 3 [26] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI, 2004. 6 [27] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 2001. 6 [28] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. 3 [29] G. Patterson and J. Hays. SUN attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR, 2012. 3 [30] P. Perona. Visions of a Visipedia. Proceedings of IEEE, 2010. 1, 2 [3 1] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007. 3 [32] R. Raguram and S. Lazebnik. Computing iconic summaries of general visual concepts. In Workshop on Internet Vision, 2008. 4 [33] M. Rastegari, A. Farhadi, and D. Forsyth. Attribute discovery via predictable discriminative binary codes. In ECCV, 2012. 3 [34] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. In ICCV, 2007. 2 [35] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Augmented attribute representations. In ECCV, 2012. 3 [36] A. Shrivastava, S. Singh, and A. Gupta. Constrained semi-supervised learning using attributes and comparative attributes. In ECCV, 2012. 2, 3 [37] B. Siddiquie and A. Gupta. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In CVPR, 2010. 2 [38] E. Sudderth, A. Torralba, W. T. Freeman, and A. Wilsky. Learning hierarchical models of scenes, objects, and parts. In ICCV, 2005. 3 [39] S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. In CVPR, 2011. 2 [40] L. von Ahn and L. Dabbish. Labeling images with a computer game. In SIGCHI, 2004. 2, 3 [41] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database: Large scale scene recognition from abbey to zoo. In CVPR, 2010. 6 [42] X. Zhu. Semi-supervised learning literature survey. Technical report, CS, UW-Madison, 2005. 2 1416
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We propose NEIL (NeverEnding Image Learner), a computer program that runs 24 hours per day and 7 days per week to automatically extract visual knowledge from Internet data. [sent-3, score-0.391]
2 NEIL uses a semi-supervised learning algorithm that jointly discovers common sense relationships (e. [sent-4, score-0.485]
3 , “Corolla is a kind of/looks similar to Car”, “Wheel is a part of Car”) and labels instances of the given visual categories. [sent-6, score-0.248]
4 It is an attempt to develop the world’s largest visual structured knowledge base with minimum human labeling effort. [sent-7, score-0.267]
5 5 months on 200 core cluster (more than 350K CPU hours) and has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. [sent-9, score-0.262]
6 During this period, NEIL has discovered more than 1700 relationships and has labeled more than 400K visual instances. [sent-10, score-0.632]
7 Motivation Recent successes in computer vision can be primarily attributed to the ever increasing size of visual knowledge in terms of labeled instances of scenes, objects, actions, attributes, and the contextual relationships between them. [sent-12, score-0.853]
8 For example, at the time of submission, only 7% of the data in ImageNet had bounding boxes and the relationships were still extracted via Wordnet. [sent-16, score-0.569]
9 In this paper, we consider an alternative approach of automatically extracting visual knowledge from Internet scale data. [sent-17, score-0.272]
10 The feasibility of extracting knowledge automatically from images and videos will itself depend on the state-ofthe-art in computer vision. [sent-18, score-0.199]
11 So, is it really possible to use existing approaches for gathering visual knowledge directly from web data? [sent-20, score-0.2]
12 NEIL is a constrained semi-supervised learning (SSL) system that exploits the big scale of visual data to automatically extract common sense relationships and then uses these relationships to label visual instances of existing categories. [sent-24, score-1.24]
13 It is an attempt to develop the world’s largest visual structured knowledge base with minimum human effort one that reflects the factual content of the images on the Internet, and that would be useful to many computer vision and AI efforts. [sent-25, score-0.239]
14 These global constraints are represented in terms of common sense relationships be1409 orCllaoraC Hloc e yrs(a) Objects (w/Bounding Boxes and Vislue hWal Subcategories) aongkPi lrt(b) ScenewyacaResaephs nuoRd(c) At d worCreibutes Visual Instances Labeled by NEIL (O-O) Wheel is a part of Car. [sent-37, score-0.485]
15 NEIL is a computer program that runs 24 hours a day and 7 days a week to gather visual knowledge from the Internet. [sent-46, score-0.312]
16 Specifically, it simultaneously labels the data and extracts common sense relationships between categories. [sent-47, score-0.485]
17 Most prior work uses manually defined relationships or learns relationships in a supervised setting. [sent-49, score-0.864]
18 Our key insight is that at a large scale one can simultane- ously label the visual instances and extract common sense relationships in ajoint semi-supervised learning framework. [sent-50, score-0.705]
19 (c) Semantically driven knowledge acquisition: We use a semantic representation for visual knowledge; that is, we group visual data based on semantic categories and develop relationships between semantic categories. [sent-51, score-0.948]
20 This allows us to leverage text-based indexing tools such as Google Image Search to initialize our visual knowledge base learning. [sent-52, score-0.239]
21 Contributions: Our main contributions are: (a) We propose a never ending learning algorithm for gathering visual knowledge from the Internet via macro-vision. [sent-53, score-0.276]
22 5 months on a 200 core cluster; (b) We are automatically building a large visual structured knowledge base which not only consists of labeled instances of scenes, objects, and attributes but also the relationships between them. [sent-55, score-1.006]
23 Currently, our growing knowledge base has an ontology of 1152 object categories, 1034 scene categories, and 87 attributes. [sent-57, score-0.301]
24 NEIL has discovered more than 1700 relationships and labeled more than 400K visual instances of these categories. [sent-58, score-0.73]
25 (c) We demonstrate how joint discovery of relationships and labeling of instances at a gigantic scale can provide constraints for improving semi-supervised learning. [sent-59, score-0.587]
26 A common way of automatically creating a dataset is to use image search results and rerank them via visual classifiers [14] or some form of joint-clustering in text and visual space [2, 34]. [sent-66, score-0.26]
27 These seed images can be manually labeled [36] or the top retrievals of a text-based search [23]. [sent-69, score-0.185]
28 One way to avoid semantic drift is to exploit additional constraints based on the structure of our visual data. [sent-72, score-0.205]
29 On the other hand, our visual world is highly structured: object cate1410 gories share parts and attributes, objects and scenes have strong contextual relationships, etc. [sent-75, score-0.193]
30 Some of these relationships include: SceneObject [38], Object-Object [3 1], Object-Attribute [12, 22, 28], Scene-Attribute [29]. [sent-78, score-0.432]
31 All these relationships can provide a rich set of constraints which can help us improve SSL [4]. [sent-79, score-0.467]
32 For example, scene-attribute relationships such as amphitheaters are circular can help improve semisupervised learning of scene classifiers [36] and Wordnet hierarchical relationships can help in propagating segmentations [21]. [sent-80, score-1.072]
33 One way to obtain such relationships is via text analysis [5, 18]. [sent-82, score-0.432]
34 In this work, we argue that, at a large-scale, one can jointly discover relationships and constrain the SSL prob- lem for extracting visual knowledge and learning visual classifiers and detectors. [sent-84, score-0.858]
35 Motivated by a never ending learning algorithm for text [5], we propose a never ending visual learning algorithm that cycles between extracting global relationships, labeling data and learning classifiers/detectors for building visual knowledge from the Internet. [sent-85, score-0.495]
36 Our work is also related to attribute discovery [33, 35] since these approaches jointly discover the attributes and relationships between objects and attributes simultaneously. [sent-86, score-0.7]
37 However, in our case, we only focus on semantic attributes and therefore our goal is to discover semantic relationships and semantically label visual instances. [sent-87, score-0.75]
38 Technical Approach Our goal is to extract visual knowledge from the pool of visual data on the web. [sent-89, score-0.285]
39 One form of visual knowledge would be labeled examples of different categories or labeled segments/boundaries. [sent-91, score-0.434]
40 For example, spatial contextual relationships can be used to improve object recognition. [sent-94, score-0.52]
41 In this paper, we represent visual knowledge in terms of labeled examples of semantic categories and the relationships between those categories. [sent-95, score-0.835]
42 Our knowledge base consists of labeled examples of: (1) Objects (e. [sent-96, score-0.268]
43 Note that for objects we learn detectors and for scenes we build classifiers; however for the rest of the paper we will use the term detector and classifier interchangeably. [sent-103, score-0.211]
44 Our knowledge base also contains relationships of four types: (1) Object-Object (e. [sent-104, score-0.598]
45 , car), we first discover the latent visual subcategories and bounding boxes for these sub-categories using an exemplar-based clustering approach (Section 3. [sent-118, score-0.278]
46 These detectors and classifiers are then used for detections on millions of images to learn relationships based on cooccurrence statistics (Section 3. [sent-121, score-0.659]
47 The most confidently labeled images are added to the pool of labeled data and used to retrain the models, and the process repeats itself. [sent-126, score-0.204]
48 At every iteration, we learn better classifiers and detectors, which in turn help us learn more relationships and further constrain the semi-supervised learning problem. [sent-127, score-0.551]
49 Seeding Classifiers via Google Image Search The first step in our semi-supervised algorithm is to build classifiers for visual categories. [sent-131, score-0.187]
50 One way to build initial classifiers is via a few manually labeled seed images. [sent-132, score-0.299]
51 For scene and attribute classifiers we directly use these retrieved images as positive data. [sent-134, score-0.227]
52 One simple way to cluster 141 1 would be to use K-means on the set of all possible bounding boxes and then use the representative clusters as visual sub-categories. [sent-142, score-0.198]
53 These detectors are then used for dense detections on the same set of downloaded images. [sent-149, score-0.187]
54 Note that this step helps us prune out outliers as the candidate windows are selected via representativeness (how many detectors fire on them). [sent-151, score-0.211]
55 For example, in Figure 3, none of the tricycle detectors fire on the outliers such as circular dots and people eating, and hence these images are rejected at this candidate widow step. [sent-152, score-0.213]
56 However, instead of using the high-dimensional CHOG representation for clustering, we use the detection signature of each window (represented as a vector of seed detector ELDA scores on the window) to create a K K affinity matrix. [sent-154, score-0.211]
57 Intuitively, this step connects candidate windows if the same set of detectors fire on both windows. [sent-157, score-0.211]
58 Extracting Relationships Once we have initialized object detectors, attribute detectors, attribute classifiers and scene classifiers, we can use them to extract relationships automatically from the data. [sent-164, score-0.857]
59 These patterns can be used to select the top-N relationships at every iteration. [sent-166, score-0.432]
60 We also classify the relationships into the two semantic categories (part-of, taxonomy/similar) using relative features to have a human-communicable view of visual knowledge base. [sent-178, score-0.733]
61 Object-Attribute Relationships: The second type of relationship we extract is object-attribute relationships such as “Pizza has Round Shape”, ”Sunflower is Yellow” etc. [sent-179, score-0.525]
62 To extract these relationships we use the same methodology where the attributes are detected in the labeled examples of object categories. [sent-180, score-0.689]
63 Scene-Object Relationships: The third type of relationship extracted by our algorithm includes scene-object relationships such as “Bus is found in Bus depot” and “Monitor is found in Control room”. [sent-182, score-0.583]
64 For extracting scene-object relationships, we use the object detectors on randomly sampled images of different scene classes. [sent-183, score-0.259]
65 The detections are then used to create the normalized co-presence matrix (similar to object-object relationships) where the (i, j) element represents the likelihood of detection of instance of object category iand the scene category class j. [sent-184, score-0.237]
66 Scene-Attribute Relationships: The fourth and final type of relationship extracted by our algorithm includes sceneattribute relationships such as “Ocean is Blue”, “Alleys are Narrow”, etc. [sent-185, score-0.559]
67 Here, we follow a simple methodology for extracting scene-attribute relationships where we compute co-classification matrix such that the element (i, j) of the matrix represents average classification scores of attribute ion images of scene j. [sent-186, score-0.654]
68 These new instances are then added to the set of labeled data and we retrain new classifiers/detectors using the updated set of labeled data. [sent-191, score-0.302]
69 These new classifiers are then used to extract more relationships which in turn are used to label more data and so on. [sent-192, score-0.565]
70 To avoid semantic drift, we use the rich set of relationships we extracted in the previous section and ensure that the new labeled instances of car satisfy the extracted relationships (e. [sent-196, score-1.321]
71 ) Mathematically, let RO, RA and RS represent the set of object-object, object-attribute aanndd scene-object relationships at iteration t. [sent-199, score-0.432]
72 The second term measures the compatibility between object category i and the object/attribute category j if the relationship (i, j) is part of the catalogue. [sent-209, score-0.246]
73 For example, if “Wheel is a part of Car” exists in the catalogue then this term will be the product of the score of wheel detector and the compatibility function between the wheel window (xl) and the car window (x). [sent-210, score-0.589]
74 Therefore, if the knowledge base contains the re- lationship “Car is found in Raceway”, this term boosts the “Car” detection scores in the “Raceway” scenes. [sent-212, score-0.225]
75 We find new instances of scene category k using the contextual scoring function given below: ωk(x) + ? [sent-214, score-0.263]
76 The above equation has three terms: the first term is the appearance term for the scene category itself and is estimated using the scene classifier. [sent-221, score-0.215]
77 The second term is the appearance term for the attribute category and is estimated using the attribute classifier. [sent-222, score-0.285]
78 Implementation Details: To train scene & attribute classifiers, we first extract a 3912 dimensional feature vector from each image. [sent-226, score-0.219]
79 Experimental Results We demonstrate the quality of visual knowledge by qualitative results, verification via human subjects and quantitative results on tasks such as object detection and scene recognition. [sent-233, score-0.276]
80 As of 10th October 2013, NEIL has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. [sent-237, score-0.202]
81 It has downloaded more than 2 million images for extracting the current structured visual knowledge. [sent-238, score-0.196]
82 NEIL has completed 16 iterations and it has labeled more than 400K visual instances (including 300,000 objects with their bounding boxes). [sent-242, score-0.317]
83 Readers can browse the current visual knowledge base and download the detectors from: www. [sent-244, score-0.335]
84 Qualitative Results We first show some qualitative results in terms of ex- tracted visual knowledge by NEIL. [sent-248, score-0.192]
85 Figure 4 shows the extracted visual sub-categories along with a few labeled instances belonging to each sub-category. [sent-249, score-0.316]
86 The purity and diversity of the clusters for different concepts indicate that contextual relationships help make our system robust to semantic drift and ensure diversity. [sent-251, score-0.657]
87 Figure 5 shows some of the qualitative examples of scene-object and object-object relationships extracted by NEIL. [sent-252, score-0.504]
88 Figure 6 shows some of the interesting scene-attribute and object-attribute relationships extracted by NEIL. [sent-254, score-0.475]
89 Evaluating Quality via Human Subjects Next, we want to evaluate the quality of extracted visual knowledge by NEIL. [sent-258, score-0.206]
90 At the end of iteration 6, 79% of the relationships extracted by NEIL are correct, and 98% of the visual data labeled by NEIL has been labeled correctly. [sent-262, score-0.752]
91 We also evaluate the per iteration correctness of relationships: At iteration 1, more than 96% relationships are correct and by iteration 3, the system stabilizes and 80% of extracted relationships are correct. [sent-263, score-0.907]
92 While currently the system does not demonstrate any major semantic drift, we do plan to continue evaluation and extensive analysis of knowledge base as NEIL grows older. [sent-264, score-0.237]
93 Using Knowledge for Vision Tasks Finally, we want to demonstrate the usefulness of the visual knowledge learned by NEIL on standard vision tasks such as object detection and scene classification. [sent-273, score-0.247]
94 (c) Finally, we will demonstrate the usefulness of relationships by detecting and classifying new test data with and without the learned relationships. [sent-277, score-0.432]
95 Scene Classification: First we evaluate our visual knowledge for the task of scene classification. [sent-278, score-0.217]
96 We compare the performance ofour scene classifiers against the scene classifiers trained from top 15 images of Google Image Search (our seed classifier). [sent-280, score-0.359]
97 Finally, if we also use the contextual information from NEIL relationships we get a significant boost in performance. [sent-285, score-0.49]
98 62 Object Detection: We also evaluate our extracted visual knowledge for the task of object detection. [sent-292, score-0.236]
99 We also compare the perfor- mance of detectors trained after aspect-ratio, HOG clustering and our proposed clustering procedure. [sent-295, score-0.212]
100 Building facade is /has Check texture Trading floor is /has Crowded Umbrella looks similar to Ferris wheel Bonfire is found in Volcano Figure 6. [sent-305, score-0.193]
wordName wordTfidf (topN-words)
[('neil', 0.667), ('relationships', 0.432), ('wheel', 0.131), ('labeled', 0.102), ('car', 0.1), ('instances', 0.098), ('detectors', 0.096), ('knowledge', 0.09), ('attribute', 0.089), ('google', 0.085), ('classifiers', 0.084), ('seed', 0.083), ('corolla', 0.079), ('extracting', 0.079), ('kind', 0.077), ('ssl', 0.077), ('base', 0.076), ('attributes', 0.076), ('visual', 0.073), ('semantic', 0.071), ('polysemy', 0.07), ('categories', 0.067), ('drift', 0.061), ('alley', 0.059), ('raceway', 0.059), ('contextual', 0.058), ('clustering', 0.058), ('detector', 0.058), ('scene', 0.054), ('sense', 0.053), ('category', 0.053), ('bootstrapping', 0.053), ('week', 0.053), ('chog', 0.053), ('ontology', 0.051), ('fire', 0.051), ('internet', 0.05), ('boxes', 0.05), ('extract', 0.049), ('detections', 0.047), ('bus', 0.047), ('ending', 0.046), ('downloaded', 0.044), ('relationship', 0.044), ('bounding', 0.044), ('extracted', 0.043), ('betteridge', 0.04), ('ccoonnttrrooll', 0.04), ('depot', 0.04), ('ferris', 0.04), ('goose', 0.04), ('iiss', 0.04), ('mmoonniittoorr', 0.04), ('nell', 0.04), ('raceways', 0.04), ('rroooomm', 0.04), ('sceneattribute', 0.04), ('sunflower', 0.04), ('throne', 0.04), ('tricycle', 0.04), ('compatibility', 0.039), ('noun', 0.038), ('windows', 0.038), ('gathering', 0.037), ('affinity', 0.036), ('xl', 0.036), ('catalogue', 0.035), ('help', 0.035), ('room', 0.034), ('window', 0.034), ('imagenet', 0.034), ('day', 0.033), ('carlson', 0.033), ('senses', 0.033), ('world', 0.032), ('hours', 0.032), ('found', 0.032), ('days', 0.031), ('cluster', 0.031), ('narrow', 0.031), ('abhinav', 0.031), ('iconic', 0.031), ('looks', 0.03), ('automatically', 0.03), ('build', 0.03), ('object', 0.03), ('never', 0.03), ('nose', 0.03), ('gigantic', 0.029), ('months', 0.029), ('qualitative', 0.029), ('understand', 0.028), ('labeling', 0.028), ('term', 0.027), ('train', 0.027), ('discover', 0.027), ('subcategories', 0.026), ('candidate', 0.026), ('discovered', 0.025), ('iinn', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data
Author: Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta
Abstract: We propose NEIL (NeverEnding Image Learner), a computer program that runs 24 hours per day and 7 days per week to automatically extract visual knowledge from Internet data. NEIL uses a semi-supervised learning algorithm that jointly discovers common sense relationships (e.g., “Corolla is a kind of/looks similar to Car”, “Wheel is a part of Car”) and labels instances of the given visual categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human labeling effort. As of 10th October 2013, NEIL has been continuously running for 2.5 months on 200 core cluster (more than 350K CPU hours) and has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. During this period, NEIL has discovered more than 1700 relationships and has labeled more than 400K visual instances. 1. Motivation Recent successes in computer vision can be primarily attributed to the ever increasing size of visual knowledge in terms of labeled instances of scenes, objects, actions, attributes, and the contextual relationships between them. But as we move forward, a key question arises: how will we gather this structured visual knowledge on a vast scale? Recent efforts such as ImageNet [8] and Visipedia [30] have tried to harness human intelligence for this task. However, we believe that these approaches lack both the richness and the scalability required for gathering massive amounts of visual knowledge. For example, at the time of submission, only 7% of the data in ImageNet had bounding boxes and the relationships were still extracted via Wordnet. In this paper, we consider an alternative approach of automatically extracting visual knowledge from Internet scale data. The feasibility of extracting knowledge automatically from images and videos will itself depend on the state-ofthe-art in computer vision. While we have witnessed significant progress on the task of detection and recognition, we still have a long way to go for automatically extracting the semantic content of a given image. So, is it really possible to use existing approaches for gathering visual knowledge directly from web data? 1.1. NEIL – Never Ending Image Learner We propose NEIL, a computer program that runs 24 hours per day, 7 days per week, forever to: (a) semantically understand images on the web; (b) use this semantic understanding to augment its knowledge base with new labeled instances and common sense relationships; (c) use this dataset and these relationships to build better classifiers and detectors which in turn help improve semantic understanding. NEIL is a constrained semi-supervised learning (SSL) system that exploits the big scale of visual data to automatically extract common sense relationships and then uses these relationships to label visual instances of existing categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human effort one that reflects the factual content of the images on the Internet, and that would be useful to many computer vision and AI efforts. Specifically, NEIL can use web data to extract: (a) Labeled examples of object categories with bounding boxes; (b) Labeled examples of scenes; (c) Labeled examples of attributes; (d) Visual subclasses for object categories; and (e) Common sense relationships about scenes, objects and attributes like “Corolla is a kind of/looks similar to Car”, “Wheel is a part ofCar”, etc. (See Figure 1). We believe our approach is possible for three key reasons: (a) Macro-vision vs. Micro-vision: We use the term “micro-vision” to refer to the traditional paradigm where the input is an image and the output is some information extracted from that image. In contrast, we define “macrovision” as a paradigm where the input is a large collection of images and the desired output is extracting significant or interesting patterns in visual data (e.g., car is detected frequently in raceways). These patterns help us to extract common sense relationships. Note, the key difference is that macro-vision does not require us to understand every image in the corpora and extract all possible patterns. Instead, it relies on understanding a few images and statistically combine evidence from these to build our visual knowledge. – (b) Structure of the Visual World: Our approach exploits the structure of the visual world and builds constraints for detection and classification. These global constraints are represented in terms of common sense relationships be1409 orCllaoraC Hloc e yrs(a) Objects (w/Bounding Boxes and Vislue hWal Subcategories) aongkPi lrt(b) ScenewyacaResaephs nuoRd(c) At d worCreibutes Visual Instances Labeled by NEIL (O-O) Wheel is a part of Car. (S-O) Car is found in Raceway. (O-O) Corolla is a kind of/looks similar to Car. (S-O) Pyramid is found in Egypt. (O-A) Wheel is/has Round shape. (S-A) Alley is/has Narrow. (S-A) Bamboo forest is/has Vertical lines. (O-A) Sunflower is/has Yellow. Relationships Extracted by NEIL Figure 1. NEIL is a computer program that runs 24 hours a day and 7 days a week to gather visual knowledge from the Internet. Specifically, it simultaneously labels the data and extracts common sense relationships between categories. tween categories. Most prior work uses manually defined relationships or learns relationships in a supervised setting. Our key insight is that at a large scale one can simultane- ously label the visual instances and extract common sense relationships in ajoint semi-supervised learning framework. (c) Semantically driven knowledge acquisition: We use a semantic representation for visual knowledge; that is, we group visual data based on semantic categories and develop relationships between semantic categories. This allows us to leverage text-based indexing tools such as Google Image Search to initialize our visual knowledge base learning. Contributions: Our main contributions are: (a) We propose a never ending learning algorithm for gathering visual knowledge from the Internet via macro-vision. NEIL has been continuously running for 2.5 months on a 200 core cluster; (b) We are automatically building a large visual structured knowledge base which not only consists of labeled instances of scenes, objects, and attributes but also the relationships between them. While NEIL’s core SSL algorithm works with a fixed vocabulary, we also use noun phrases from NELL’s ontology [5] to grow our vocabulary. Currently, our growing knowledge base has an ontology of 1152 object categories, 1034 scene categories, and 87 attributes. NEIL has discovered more than 1700 relationships and labeled more than 400K visual instances of these categories. (c) We demonstrate how joint discovery of relationships and labeling of instances at a gigantic scale can provide constraints for improving semi-supervised learning. 2. Related Work Recent work has only focused on extracting knowledge in the form of large datasets for recognition and classification [8, 23, 30]. One of the most commonly used approaches to build datasets is using manual annotations by motivated teams of people [30] or the power of crowds [8, 40]. To minimize human effort, recent works have also focused on active learning [37, 39] which selects label requests that are most informative. However, both of these directions have a major limitation: annotations are expensive, prone to errors, biased and do not scale. An alternative approach is to use visual recognition for extracting these datasets automatically from the Internet [23, 34, 36]. A common way of automatically creating a dataset is to use image search results and rerank them via visual classifiers [14] or some form of joint-clustering in text and visual space [2, 34]. Another approach is to use a semi-supervised framework [42]. Here, a small amount of labeled data is used in conjunction with a large amount of unlabeled data to learn reliable and robust visual models. These seed images can be manually labeled [36] or the top retrievals of a text-based search [23]. The biggest problem with most of these automatic approaches is that the small number of labeled examples or image search results do not provide enough constraints for learning robust visual classifiers. Hence, these approaches suffer from semantic drift [6]. One way to avoid semantic drift is to exploit additional constraints based on the structure of our visual data. Researchers have exploited a variety of constraints such as those based on visual similarity [11, 15], seman- tic similarity [17] or multiple feature spaces [3]. However, most of these constraints are weak in nature: for example, visual similarity only models the constraint that visuallysimilar images should receive the same labels. On the other hand, our visual world is highly structured: object cate1410 gories share parts and attributes, objects and scenes have strong contextual relationships, etc. Therefore, we need a way to capture the rich structure of our visual world and exploit this structure during semi-supervised learning. In recent years, there have been huge advances in modeling the rich structure of our visual world via contextual relationships. Some of these relationships include: SceneObject [38], Object-Object [3 1], Object-Attribute [12, 22, 28], Scene-Attribute [29]. All these relationships can provide a rich set of constraints which can help us improve SSL [4]. For example, scene-attribute relationships such as amphitheaters are circular can help improve semisupervised learning of scene classifiers [36] and Wordnet hierarchical relationships can help in propagating segmentations [21]. But the big question is: how do we obtain these relationships? One way to obtain such relationships is via text analysis [5, 18]. However, as [40] points out that the visual knowledge we need to obtain is so obvious that no one would take the time to write it down and put it on web. In this work, we argue that, at a large-scale, one can jointly discover relationships and constrain the SSL prob- lem for extracting visual knowledge and learning visual classifiers and detectors. Motivated by a never ending learning algorithm for text [5], we propose a never ending visual learning algorithm that cycles between extracting global relationships, labeling data and learning classifiers/detectors for building visual knowledge from the Internet. Our work is also related to attribute discovery [33, 35] since these approaches jointly discover the attributes and relationships between objects and attributes simultaneously. However, in our case, we only focus on semantic attributes and therefore our goal is to discover semantic relationships and semantically label visual instances. 3. Technical Approach Our goal is to extract visual knowledge from the pool of visual data on the web. We define visual knowledge as any information that can be useful for improving vision tasks such as image understanding and object/scene recognition. One form of visual knowledge would be labeled examples of different categories or labeled segments/boundaries. Labeled examples helps us learn classifiers or detectors and improve image understanding. Another example of visual knowledge would be relationships. For example, spatial contextual relationships can be used to improve object recognition. In this paper, we represent visual knowledge in terms of labeled examples of semantic categories and the relationships between those categories. Our knowledge base consists of labeled examples of: (1) Objects (e.g., Car, Corolla); (2) Scenes (e.g., Alley, Church); (3) Attributes (e.g., Blue, Modern). Note that for objects we learn detectors and for scenes we build classifiers; however for the rest of the paper we will use the term detector and classifier interchangeably. Our knowledge base also contains relationships of four types: (1) Object-Object (e.g., Wheel is a part of Car);(2) Object-Attribute (e.g., Sheep is/has White); (3) Scene-Object (e.g., Car is found in Raceway); (4) SceneAttribute (e.g., Alley is/has Narrow). The outline of our approach is shown in Figure 2. We use Google Image Search to download thousands of images for each object, scene and attribute category. Our method then uses an iterative approach to clean the labels and train detectors/classifiers in a semi-supervised manner. For a given concept (e.g., car), we first discover the latent visual subcategories and bounding boxes for these sub-categories using an exemplar-based clustering approach (Section 3. 1). We then train multiple detectors for a concept (one for each sub-category) using the clustering and localization results. These detectors and classifiers are then used for detections on millions of images to learn relationships based on cooccurrence statistics (Section 3.2). Here, we exploit the fact the we are interested in macro-vision and therefore build co-occurrence statistics using only confident detections/classifications. Once we have relationships, we use them in conjunction with our classifiers and detectors to label the large set of noisy images (Section 3.3). The most confidently labeled images are added to the pool of labeled data and used to retrain the models, and the process repeats itself. At every iteration, we learn better classifiers and detectors, which in turn help us learn more relationships and further constrain the semi-supervised learning problem. We now describe each step in detail below. 3.1. Seeding Classifiers via Google Image Search The first step in our semi-supervised algorithm is to build classifiers for visual categories. One way to build initial classifiers is via a few manually labeled seed images. Here, we take an alternative approach and use text-based image retrieval systems to provide seed images for training initial detectors. For scene and attribute classifiers we directly use these retrieved images as positive data. However, such an approach fails for training object and attribute detectors because of four reasons (Figure 3(a)) (1) Outliers: Due to the imperfectness of text-based image retrieval, the downloaded images usually have irrelevant images/outliers; (2) Polysemy: In many cases, semantic categories might be overloaded and a single semantic category might have multiple senses (e.g., apple can mean both the company and the fruit); (3) Visual Diversity: Retrieved images might have high intra-class variation due to different viewpoint, illumination etc.; (4) Localization: In many cases the retrieved image might be a scene without a bounding-box and hence one needs to localize the concept before training a detector. Most of the current approaches handle these problems via clustering. Clustering helps in handling visual diversity [9] and discovering multiple senses of retrieval (polysemy) [25]. It can also help us to reject outliers based on – distances from cluster centers. One simple way to cluster 141 1 would be to use K-means on the set of all possible bounding boxes and then use the representative clusters as visual sub-categories. However, clustering using K-means has two issues: (1) High Dimensionality: We use the Color HOG (CHOG) [20] representation and standard distance metrics do not work well in such high-dimensions [10]; (2) Scalability: Most clustering approaches tend to partition the complete feature space. In our case, since we do not have bounding boxes provided, every image creates millions of data points and the majority of the datapoints are outliers. Recent work has suggested that K-means is not scalable and has bad performance in this scenario since it assigns membership to every data point [10]. Instead, we propose to use a two-step approach for clustering. In the first step, we mine the set of downloaded im- × ages from Google Image Search to create candidate object windows. Specifically, every image is used to train a detector using recently proposed exemplar-LDA [19]. These detectors are then used for dense detections on the same set of downloaded images. We select the top K windows which have high scores from multiple detectors. Note that this step helps us prune out outliers as the candidate windows are selected via representativeness (how many detectors fire on them). For example, in Figure 3, none of the tricycle detectors fire on the outliers such as circular dots and people eating, and hence these images are rejected at this candidate widow step. Once we have candidate windows, we cluster them in the next step. However, instead of using the high-dimensional CHOG representation for clustering, we use the detection signature of each window (represented as a vector of seed detector ELDA scores on the window) to create a K K affinity matrix. The (i, j) entry in the affinity amteat arix K i s× thKe da fofti product orixf t.h Tish vee (cit,ojr) )fo enr twryin indo thwes ai fainndj. Intuitively, this step connects candidate windows if the same set of detectors fire on both windows. Once we have the affinity matrix, we cluster the candidate windows using the standard affinity propagation algorithm [16]. Affinity propagation also allows us to extract a representative window (prototype) for each cluster which acts as an iconic image for the object [32] (Figure 3). After clustering, we train a detector for each cluster/sub-category using three-quarters of the images in the cluster. The remaining quarter is used as a validation set for calibration. 3.2. Extracting Relationships Once we have initialized object detectors, attribute detectors, attribute classifiers and scene classifiers, we can use them to extract relationships automatically from the data. The key idea is that we do not need to understand each and every image downloaded from the Internet but instead understand the statistical pattern of detections and classifications at a large scale. These patterns can be used to select the top-N relationships at every iteration. Specifically, we extract four different kinds of relationships: Object-Object Relationships: The first kind of relationship we extract are object-object relationships which include: (1) Partonomy relationships such as “Eye is a part of Baby”; (2) Taxonomy relationships such as “BMW 320 is a kind of Car”; and (3) Similarity relationships such as 1412 (a) Google Image Search for “tricycle” (b) Sub-category Discovery Figure 3. An example of how clustering handles polysemy, intraclass variation and outlier removal (a). The bottom row shows our discovered clusters. “Swan looks similar to Goose”. To extract these relationships, we first build a co-detection matrix O0 whose elements represent the probability of simultaneous detection of object categories i and j. Intuitively, the co-detection matrix has high values when object detector idetects objects inside the bounding box of object j with high detection scores. To account for detectors that fire everywhere and images which have lots of detections, we normalize the matrix O0. The normalized co-detection matrix can be written 1 1 as: N1− 2 O0N2− 2 , where N1 and N2 are out-degree and indegree matrix and (i, j) element of O0 represents the average score of top-detections of detector ion images of object category j. Once we have selected a relationship between pair of categories, we learn its characteristics in terms of mean and variance of relative locations, relative aspect ra- tio, relative scores and relative size of the detections. For example, the nose-face relationship is characterized by low relative window size (nose is less than 20% of face area) and the relative location that nose occurs in center of the face. This is used to define a compatibility function ψi,j (·) which evaluates if the detections from category iand j are compatible or not. We also classify the relationships into the two semantic categories (part-of, taxonomy/similar) using relative features to have a human-communicable view of visual knowledge base. Object-Attribute Relationships: The second type of relationship we extract is object-attribute relationships such as “Pizza has Round Shape”, ”Sunflower is Yellow” etc. To extract these relationships we use the same methodology where the attributes are detected in the labeled examples of object categories. These detections and their scores are then used to build a normalized co-detection matrix which is used to find the top object-attribute relationships. Scene-Object Relationships: The third type of relationship extracted by our algorithm includes scene-object relationships such as “Bus is found in Bus depot” and “Monitor is found in Control room”. For extracting scene-object relationships, we use the object detectors on randomly sampled images of different scene classes. The detections are then used to create the normalized co-presence matrix (similar to object-object relationships) where the (i, j) element represents the likelihood of detection of instance of object category iand the scene category class j. Scene-Attribute Relationships: The fourth and final type of relationship extracted by our algorithm includes sceneattribute relationships such as “Ocean is Blue”, “Alleys are Narrow”, etc. Here, we follow a simple methodology for extracting scene-attribute relationships where we compute co-classification matrix such that the element (i, j) of the matrix represents average classification scores of attribute ion images of scene j. The top entries in this coclassification matrix are used to extract scene-attribute relationships. 3.3. Retraining via Labeling New Instances Once we have the initial set of classifiers/detectors and the set of relationships, we can use them to find new instances of different objects and scene categories. These new instances are then added to the set of labeled data and we retrain new classifiers/detectors using the updated set of labeled data. These new classifiers are then used to extract more relationships which in turn are used to label more data and so on. One way to find new instances is directly using the detector itself. For instance, using the car detector to find more cars. However, this approach leads to semantic drift. To avoid semantic drift, we use the rich set of relationships we extracted in the previous section and ensure that the new labeled instances of car satisfy the extracted relationships (e.g., has wheels, found in raceways etc.) Mathematically, let RO, RA and RS represent the set of object-object, object-attribute aanndd scene-object relationships at iteration t. If φi (·) represents the potential from object detector i, ωk (·) represents sthenet scene potential, raonmd ψi,j (·) represent the compatibility sfu thnect siocnen nbeet pwoeteennt tiwalo, aonbdject categories i,j,ethceonm we can ifityndfu uthncet new inetswtaenecnetsw woof oobb-ject category iusing the contextual scoring function given below: φi(x) + ? φj(xl)ψi,j(x,xl) + ? i,j∈R?O RA ? ωk(x) i,k?∈RS where x is the wi?ndow being evaluated and xl is the topdetected window of related object/attribute category. The above equation has three terms: the first term is appearance term for the object category itself and is measured by the 1413 Nilgai Yamaha Violin Bass F-18 Figure 4. Qualitative Examples of Bounding Box Labeling Done by NEIL score of the SVM detector on the window x. The second term measures the compatibility between object category i and the object/attribute category j if the relationship (i, j) is part of the catalogue. For example, if “Wheel is a part of Car” exists in the catalogue then this term will be the product of the score of wheel detector and the compatibility function between the wheel window (xl) and the car window (x). The final term measures the scene-object compatibility. Therefore, if the knowledge base contains the re- lationship “Car is found in Raceway”, this term boosts the “Car” detection scores in the “Raceway” scenes. At each iteration, we also add new instances of different scene categories. We find new instances of scene category k using the contextual scoring function given below: ωk(x) + ? ωm(x) + ? φi(xl) m,k?∈RA? i,k?∈RS where RA? represents the catalogue of scene-attribute relationships. The above equation has three terms: the first term is the appearance term for the scene category itself and is estimated using the scene classifier. The second term is the appearance term for the attribute category and is estimated using the attribute classifier. This term ensures that if a scene-attribute relationship exists then the attribute classifier score should be high. The third and the final term is the appearance term of an object category and is estimated using the corresponding object detector. This term ensures that if a scene-object relationship exists then the object detector should detect objects in the scene. Implementation Details: To train scene & attribute classifiers, we first extract a 3912 dimensional feature vector from each image. The feature vector includes 5 12D GIST [27] features, concatenated with bag ofwords representations for SIFT [24], HOG [7], Lab color space, and Texton [26]. The dictionary sizes are 1000, 1000, 400, 1000, respectively. Features of randomly sampled windows from other categories are used as negative examples for SVM training and hard mining. For the object and attribute section, we use CHOG [20] features with a bin size of 8. We train the detectors using latent SVM model (without parts) [13]. 4. Experimental Results We demonstrate the quality of visual knowledge by qualitative results, verification via human subjects and quantitative results on tasks such as object detection and scene recognition. 4.1. NEIL Statistics While NEIL’s core algorithm uses a fixed vocabulary, we use noun phrases from NELL [5] to grow NEIL’s vocabulary. As of 10th October 2013, NEIL has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. It has downloaded more than 2 million images for extracting the current structured visual knowledge. For bootstrapping our system, we use a few seed images from ImageNet [8], SUN [4 1] or the top-images from Google Image Search. For the purposes of extensive experimental evaluation in this paper, we ran NEIL on steroids (200 cores as opposed to 30 cores used generally) for the last 2.5 months. NEIL has completed 16 iterations and it has labeled more than 400K visual instances (including 300,000 objects with their bounding boxes). It has also extracted 1703 common sense relationships. Readers can browse the current visual knowledge base and download the detectors from: www.neil-kb.com 4.2. Qualitative Results We first show some qualitative results in terms of ex- tracted visual knowledge by NEIL. Figure 4 shows the extracted visual sub-categories along with a few labeled instances belonging to each sub-category. It can be seen from the figure that NEIL effectively handles the intra-class variation and polysemy via the clustering process. The purity and diversity of the clusters for different concepts indicate that contextual relationships help make our system robust to semantic drift and ensure diversity. Figure 5 shows some of the qualitative examples of scene-object and object-object relationships extracted by NEIL. It is effective in using a few confident detections to extract interesting relationships. Figure 6 shows some of the interesting scene-attribute and object-attribute relationships extracted by NEIL. 1414 Helicopter is found in Airfield Leaning tower is found in Pisa Van is a kind of/looks similar to Ambulance Airplane nose is a part of Airbus 330 Zebra is found in Savanna Ferris wheel is found in Amusement park Opera house is found in Sydney Eye is a part of Baby Duck is a kind of/looks similar to Goose Monitor is a kind of/looks similar to Desktop computer Figure 5. Qualitative Examples of Scene-Object (rows Bus is found in Bus depot outdoor Sparrow is a kind of/looks similar to bird 1-2) and Object-Object (rows Throne is found in Throne room Camry is found in Pub outdoor Gypsy moth is a kind of/looks similar to Butterfly Basketball net is a part of Backboard 3-4) Relationships Extracted by NEIL 4.3. Evaluating Quality via Human Subjects Next, we want to evaluate the quality of extracted visual knowledge by NEIL. It should be noted that an extensive and comprehensive evaluation for the whole NEIL system is an extremely difficult task. It is impractical to evaluate each and every labeled instance and each and every rela- tionship for correctness. Therefore, we randomly sample the 500 visual instances and 500 relationships, and verify them using human experts. At the end of iteration 6, 79% of the relationships extracted by NEIL are correct, and 98% of the visual data labeled by NEIL has been labeled correctly. We also evaluate the per iteration correctness of relationships: At iteration 1, more than 96% relationships are correct and by iteration 3, the system stabilizes and 80% of extracted relationships are correct. While currently the system does not demonstrate any major semantic drift, we do plan to continue evaluation and extensive analysis of knowledge base as NEIL grows older. We also evaluate the quality of bounding-boxes generated by NEIL. For this we sample 100 images randomly and label the ground-truth bounding boxes. On the standard intersection-over-union metric, NEIL generates bounding boxes with 0.78 overlap on average with ground-truth. To give context to the difficulty of the task, the standard Objectness algorithm [1] produces bounding boxes with 0.59 overlap on average. 4.4. Using Knowledge for Vision Tasks Finally, we want to demonstrate the usefulness of the visual knowledge learned by NEIL on standard vision tasks such as object detection and scene classification. Here, we will also compare several aspects of our approach: (a) We first compare the quality of our automatically labeled dataset. As baselines, we train classifiers/detectors directly on the seed images downloaded from Google Image Search. (b) We compare NEIL against a standard bootstrapping ap- proach which does not extract/use relationships. (c) Finally, we will demonstrate the usefulness of relationships by detecting and classifying new test data with and without the learned relationships. Scene Classification: First we evaluate our visual knowledge for the task of scene classification. We build a dataset of 600 images (12 scene categories) using Flickr images. We compare the performance ofour scene classifiers against the scene classifiers trained from top 15 images of Google Image Search (our seed classifier). We also compare the performance with standard bootstrapping approach without using any relationship extraction. Table 1shows the results. We use mean average precision (mAP) as the evaluation metric. As the results show, automatic relationship extraction helps us to constrain the learning problem, and so the learned classifiers give much better performance. Finally, if we also use the contextual information from NEIL relationships we get a significant boost in performance. Table 1. mAP performance for scene classification on 12 categories. mAP Seed Classifier (15 Google Images) Bootstrapping (without relationships) NEIL Scene Classifiers NEIL (Classifiers + Relationships) 0.52 0.54 0.57 0.62 Object Detection: We also evaluate our extracted visual knowledge for the task of object detection. We build a dataset of 1000 images (15 object categories) using Flickr data for testing. We compare the performance against object detectors trained directly using (top-50 and top-450) images from Google Image Search. We also compare the perfor- mance of detectors trained after aspect-ratio, HOG clustering and our proposed clustering procedure. Table 2 shows the detection results. Using 450 images from Google image search decreases the performance due to noisy retrievals. While other clustering methods help, the gain by our clustering procedure is much larger. Finally, detectors trained using NEIL work better than standard bootstrapping. 1415 MMoonniittoorr iiss f foouunndd iinn CCoonnttrrooll rroooomm? rroooomm? MMoonniittoorr iiss ffoouunndd iinn CCoonnttrrooll Washing machine is found in Utility room? Siberian tiger is found in Zoo Baseball is found in Butters box Bullet train is found in Train station platform? Cougar looks similar to Cat Urn looks similar to Goblet Samsung galaxy is a kind of Cellphone Computer room is /has Modern Hallway is /has Narrow? Building facade is /has Check texture Trading floor is /has Crowded Umbrella looks similar to Ferris wheel Bonfire is found in Volcano Figure 6. Examples of extracted common sense relationships. Table 2. mAP performance for object detection on 15 categories. mAP Latent SVM (50 Google Images) Latent SVM (450 Google Images) 0.34 0.28 Latent SVM (450, Aspect Ratio Clustering) Latent SVM (450, HOG-based Clustering) Seed Detector (NEIL Clustering) Bootstrapping (without relationships) NEIL Detector NEIL Detector + Relationships 0.30 0.33 0.44 0.45 0.49 0.51 Acknowledgements: This research was supported by ONR MURI N000141010934 and a gift from Google. The authors would like to thank Tom Mitchell and David Fouhey for insightful discussions. We would also like to thank our computing clusters warp and workhorse for doing all the hard work! References [1] B. Alexe, T. Deselares, and V. Ferrari. What is an object? In TPAMI, 2010. 7 [2] T. Berg and D. Forsyth. Animals on the web. In CVPR, 2006. 2 [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998. 2 [4] A. Carlson, J. Betteridge, E. R. H. Jr., and T. M. Mitchell. Coupling semi-supervised learning of categories and relations. In NAACL HLT Workskop on SSL for NLP, 2009. 3 [5] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010. 2, 3, 6 [6] J. R. Curran, T. Murphy, and B. Scholz. Minimising semantic drift with mutual exclusion bootstrapping. In PacificAssociationfor Computational Linguistics, 2007. 2 [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 6 [8] J. Deng, W. Dong, R. Socher, J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 1, 2, 6 [9] S. Divvala, A. Efros, and M. Hebert. How important are ‘deformable parts’ in the deformable parts model? In ECCV, Parts and Attributes [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] Workshop, 2012. 3 C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes Paris look like Paris? SIGGRAPH, 2012. 4 S. Ebert, D. Larlus, and B. Schiele. Extracting structures in image collections for object recognition. In ECCV, 2010. 2 A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. 3 P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2010. 6 R. Fergus, P. Perona, and A. Zisserman. A visual category filter for Google images. In ECCV, 2004. 2 R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning in gigantic image collections. In NIPS. 2009. 2 B. Frey and D. Dueck. Clustering by passing messages between data points. Science, 2007. 4 M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semisupervised learning for image classification. In CVPR, 2010. 2 A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, 2008. 3 B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classification. In ECCV. 2012. 4 S. Khan, F. Anwer, R. Muhammad, J. van de Weijer, A. Joost, M. Vanrell, and A. Lopez. Color attributes for object detection. In CVPR, 2012. 4, 6 D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation propagation in ImageNet. In ECCV, 2012. 3 C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009. 3 L.-J. Li, G. Wang, and L. Fei-Fei. OPTIMOL: Automatic object picture collection via incremental model learning. In CVPR, 2007. 2 D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. 6 [25] A. Lucchi and J. Weston. Joint image and word sense discrimination for image retrieval. In ECCV, 2012. 3 [26] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI, 2004. 6 [27] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 2001. 6 [28] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. 3 [29] G. Patterson and J. Hays. SUN attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR, 2012. 3 [30] P. Perona. Visions of a Visipedia. Proceedings of IEEE, 2010. 1, 2 [3 1] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007. 3 [32] R. Raguram and S. Lazebnik. Computing iconic summaries of general visual concepts. In Workshop on Internet Vision, 2008. 4 [33] M. Rastegari, A. Farhadi, and D. Forsyth. Attribute discovery via predictable discriminative binary codes. In ECCV, 2012. 3 [34] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. In ICCV, 2007. 2 [35] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Augmented attribute representations. In ECCV, 2012. 3 [36] A. Shrivastava, S. Singh, and A. Gupta. Constrained semi-supervised learning using attributes and comparative attributes. In ECCV, 2012. 2, 3 [37] B. Siddiquie and A. Gupta. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In CVPR, 2010. 2 [38] E. Sudderth, A. Torralba, W. T. Freeman, and A. Wilsky. Learning hierarchical models of scenes, objects, and parts. In ICCV, 2005. 3 [39] S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. In CVPR, 2011. 2 [40] L. von Ahn and L. Dabbish. Labeling images with a computer game. In SIGCHI, 2004. 2, 3 [41] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database: Large scale scene recognition from abbey to zoo. In CVPR, 2010. 6 [42] X. Zhu. Semi-supervised learning literature survey. Technical report, CS, UW-Madison, 2005. 2 1416
2 0.20004779 31 iccv-2013-A Unified Probabilistic Approach Modeling Relationships between Attributes and Objects
Author: Xiaoyang Wang, Qiang Ji
Abstract: This paper proposes a unified probabilistic model to model the relationships between attributes and objects for attribute prediction and object recognition. As a list of semantically meaningful properties of objects, attributes generally relate to each other statistically. In this paper, we propose a unified probabilistic model to automatically discover and capture both the object-dependent and objectindependent attribute relationships. The model utilizes the captured relationships to benefit both attribute prediction and object recognition. Experiments on four benchmark attribute datasets demonstrate the effectiveness of the proposed unified model for improving attribute prediction as well as object recognition in both standard and zero-shot learning cases.
3 0.12760118 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
Author: Jungseock Joo, Shuo Wang, Song-Chun Zhu
Abstract: We present a part-based approach to the problem of human attribute recognition from a single image of a human body. To recognize the attributes of human from the body parts, it is important to reliably detect the parts. This is a challenging task due to the geometric variation such as articulation and view-point changes as well as the appearance variation of the parts arisen from versatile clothing types. The prior works have primarily focused on handling . edu . cn ???????????? geometric variation by relying on pre-trained part detectors or pose estimators, which require manual part annotation, but the appearance variation has been relatively neglected in these works. This paper explores the importance of the appearance variation, which is directly related to the main task, attribute recognition. To this end, we propose to learn a rich appearance part dictionary of human with significantly less supervision by decomposing image lattice into overlapping windows at multiscale and iteratively refining local appearance templates. We also present quantitative results in which our proposed method outperforms the existing approaches.
4 0.1123089 179 iccv-2013-From Subcategories to Visual Composites: A Multi-level Framework for Object Detection
Author: Tian Lan, Michalis Raptis, Leonid Sigal, Greg Mori
Abstract: The appearance of an object changes profoundly with pose, camera view and interactions of the object with other objects in the scene. This makes it challenging to learn detectors based on an object-level label (e.g., “car”). We postulate that having a richer set oflabelings (at different levels of granularity) for an object, including finer-grained subcategories, consistent in appearance and view, and higherorder composites – contextual groupings of objects consistent in their spatial layout and appearance, can significantly alleviate these problems. However, obtaining such a rich set of annotations, including annotation of an exponentially growing set of object groupings, is simply not feasible. We propose a weakly-supervised framework for object detection where we discover subcategories and the composites automatically with only traditional object-level category labels as input. To this end, we first propose an exemplar-SVM-based clustering approach, with latent SVM refinement, that discovers a variable length set of discriminative subcategories for each object class. We then develop a structured model for object detection that captures interactions among object subcategories and automatically discovers semantically meaningful and discriminatively relevant visual composites. We show that this model produces state-of-the-art performance on UIUC phrase object detection benchmark.
5 0.10958304 377 iccv-2013-Segmentation Driven Object Detection with Fisher Vectors
Author: Ramazan Gokberk Cinbis, Jakob Verbeek, Cordelia Schmid
Abstract: We present an object detection system based on the Fisher vector (FV) image representation computed over SIFT and color descriptors. For computational and storage efficiency, we use a recent segmentation-based method to generate class-independent object detection hypotheses, in combination with data compression techniques. Our main contribution is a method to produce tentative object segmentation masks to suppress background clutter in the features. Re-weighting the local image features based on these masks is shown to improve object detection significantly. We also exploit contextual features in the form of a full-image FV descriptor, and an inter-category rescoring mechanism. Our experiments on the PASCAL VOC 2007 and 2010 datasets show that our detector improves over the current state-of-the-art detection results.
6 0.10875855 52 iccv-2013-Attribute Adaptation for Personalized Image Search
7 0.10476146 406 iccv-2013-Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time
8 0.10238067 142 iccv-2013-Ensemble Projection for Semi-supervised Image Classification
9 0.10149392 236 iccv-2013-Learning Discriminative Part Detectors for Image Classification and Cosegmentation
10 0.093712419 201 iccv-2013-Holistic Scene Understanding for 3D Object Detection with RGBD Cameras
11 0.088144034 399 iccv-2013-Spoken Attributes: Mixing Binary and Relative Attributes to Say the Right Thing
12 0.085572928 246 iccv-2013-Learning the Visual Interpretation of Sentences
13 0.085545152 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes
14 0.084387355 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
15 0.082908936 190 iccv-2013-Handling Occlusions with Franken-Classifiers
16 0.082828954 69 iccv-2013-Capturing Global Semantic Relationships for Facial Action Unit Recognition
17 0.08279597 378 iccv-2013-Semantic-Aware Co-indexing for Image Retrieval
18 0.082762532 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction
19 0.082387626 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories
20 0.079870425 53 iccv-2013-Attribute Dominance: What Pops Out?
topicId topicWeight
[(0, 0.181), (1, 0.1), (2, -0.012), (3, -0.076), (4, 0.128), (5, -0.026), (6, -0.059), (7, -0.047), (8, 0.051), (9, -0.005), (10, 0.034), (11, 0.02), (12, -0.028), (13, -0.018), (14, -0.027), (15, -0.056), (16, 0.016), (17, 0.063), (18, -0.004), (19, -0.016), (20, -0.095), (21, 0.031), (22, -0.015), (23, 0.021), (24, -0.021), (25, -0.042), (26, 0.018), (27, -0.028), (28, 0.009), (29, -0.027), (30, -0.024), (31, 0.029), (32, -0.002), (33, 0.005), (34, -0.031), (35, -0.008), (36, 0.062), (37, -0.0), (38, -0.055), (39, 0.03), (40, 0.052), (41, -0.025), (42, -0.019), (43, -0.007), (44, -0.015), (45, 0.014), (46, 0.042), (47, 0.096), (48, 0.019), (49, -0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.9479627 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data
Author: Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta
Abstract: We propose NEIL (NeverEnding Image Learner), a computer program that runs 24 hours per day and 7 days per week to automatically extract visual knowledge from Internet data. NEIL uses a semi-supervised learning algorithm that jointly discovers common sense relationships (e.g., “Corolla is a kind of/looks similar to Car”, “Wheel is a part of Car”) and labels instances of the given visual categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human labeling effort. As of 10th October 2013, NEIL has been continuously running for 2.5 months on 200 core cluster (more than 350K CPU hours) and has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. During this period, NEIL has discovered more than 1700 relationships and has labeled more than 400K visual instances. 1. Motivation Recent successes in computer vision can be primarily attributed to the ever increasing size of visual knowledge in terms of labeled instances of scenes, objects, actions, attributes, and the contextual relationships between them. But as we move forward, a key question arises: how will we gather this structured visual knowledge on a vast scale? Recent efforts such as ImageNet [8] and Visipedia [30] have tried to harness human intelligence for this task. However, we believe that these approaches lack both the richness and the scalability required for gathering massive amounts of visual knowledge. For example, at the time of submission, only 7% of the data in ImageNet had bounding boxes and the relationships were still extracted via Wordnet. In this paper, we consider an alternative approach of automatically extracting visual knowledge from Internet scale data. The feasibility of extracting knowledge automatically from images and videos will itself depend on the state-ofthe-art in computer vision. While we have witnessed significant progress on the task of detection and recognition, we still have a long way to go for automatically extracting the semantic content of a given image. So, is it really possible to use existing approaches for gathering visual knowledge directly from web data? 1.1. NEIL – Never Ending Image Learner We propose NEIL, a computer program that runs 24 hours per day, 7 days per week, forever to: (a) semantically understand images on the web; (b) use this semantic understanding to augment its knowledge base with new labeled instances and common sense relationships; (c) use this dataset and these relationships to build better classifiers and detectors which in turn help improve semantic understanding. NEIL is a constrained semi-supervised learning (SSL) system that exploits the big scale of visual data to automatically extract common sense relationships and then uses these relationships to label visual instances of existing categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human effort one that reflects the factual content of the images on the Internet, and that would be useful to many computer vision and AI efforts. Specifically, NEIL can use web data to extract: (a) Labeled examples of object categories with bounding boxes; (b) Labeled examples of scenes; (c) Labeled examples of attributes; (d) Visual subclasses for object categories; and (e) Common sense relationships about scenes, objects and attributes like “Corolla is a kind of/looks similar to Car”, “Wheel is a part ofCar”, etc. (See Figure 1). We believe our approach is possible for three key reasons: (a) Macro-vision vs. Micro-vision: We use the term “micro-vision” to refer to the traditional paradigm where the input is an image and the output is some information extracted from that image. In contrast, we define “macrovision” as a paradigm where the input is a large collection of images and the desired output is extracting significant or interesting patterns in visual data (e.g., car is detected frequently in raceways). These patterns help us to extract common sense relationships. Note, the key difference is that macro-vision does not require us to understand every image in the corpora and extract all possible patterns. Instead, it relies on understanding a few images and statistically combine evidence from these to build our visual knowledge. – (b) Structure of the Visual World: Our approach exploits the structure of the visual world and builds constraints for detection and classification. These global constraints are represented in terms of common sense relationships be1409 orCllaoraC Hloc e yrs(a) Objects (w/Bounding Boxes and Vislue hWal Subcategories) aongkPi lrt(b) ScenewyacaResaephs nuoRd(c) At d worCreibutes Visual Instances Labeled by NEIL (O-O) Wheel is a part of Car. (S-O) Car is found in Raceway. (O-O) Corolla is a kind of/looks similar to Car. (S-O) Pyramid is found in Egypt. (O-A) Wheel is/has Round shape. (S-A) Alley is/has Narrow. (S-A) Bamboo forest is/has Vertical lines. (O-A) Sunflower is/has Yellow. Relationships Extracted by NEIL Figure 1. NEIL is a computer program that runs 24 hours a day and 7 days a week to gather visual knowledge from the Internet. Specifically, it simultaneously labels the data and extracts common sense relationships between categories. tween categories. Most prior work uses manually defined relationships or learns relationships in a supervised setting. Our key insight is that at a large scale one can simultane- ously label the visual instances and extract common sense relationships in ajoint semi-supervised learning framework. (c) Semantically driven knowledge acquisition: We use a semantic representation for visual knowledge; that is, we group visual data based on semantic categories and develop relationships between semantic categories. This allows us to leverage text-based indexing tools such as Google Image Search to initialize our visual knowledge base learning. Contributions: Our main contributions are: (a) We propose a never ending learning algorithm for gathering visual knowledge from the Internet via macro-vision. NEIL has been continuously running for 2.5 months on a 200 core cluster; (b) We are automatically building a large visual structured knowledge base which not only consists of labeled instances of scenes, objects, and attributes but also the relationships between them. While NEIL’s core SSL algorithm works with a fixed vocabulary, we also use noun phrases from NELL’s ontology [5] to grow our vocabulary. Currently, our growing knowledge base has an ontology of 1152 object categories, 1034 scene categories, and 87 attributes. NEIL has discovered more than 1700 relationships and labeled more than 400K visual instances of these categories. (c) We demonstrate how joint discovery of relationships and labeling of instances at a gigantic scale can provide constraints for improving semi-supervised learning. 2. Related Work Recent work has only focused on extracting knowledge in the form of large datasets for recognition and classification [8, 23, 30]. One of the most commonly used approaches to build datasets is using manual annotations by motivated teams of people [30] or the power of crowds [8, 40]. To minimize human effort, recent works have also focused on active learning [37, 39] which selects label requests that are most informative. However, both of these directions have a major limitation: annotations are expensive, prone to errors, biased and do not scale. An alternative approach is to use visual recognition for extracting these datasets automatically from the Internet [23, 34, 36]. A common way of automatically creating a dataset is to use image search results and rerank them via visual classifiers [14] or some form of joint-clustering in text and visual space [2, 34]. Another approach is to use a semi-supervised framework [42]. Here, a small amount of labeled data is used in conjunction with a large amount of unlabeled data to learn reliable and robust visual models. These seed images can be manually labeled [36] or the top retrievals of a text-based search [23]. The biggest problem with most of these automatic approaches is that the small number of labeled examples or image search results do not provide enough constraints for learning robust visual classifiers. Hence, these approaches suffer from semantic drift [6]. One way to avoid semantic drift is to exploit additional constraints based on the structure of our visual data. Researchers have exploited a variety of constraints such as those based on visual similarity [11, 15], seman- tic similarity [17] or multiple feature spaces [3]. However, most of these constraints are weak in nature: for example, visual similarity only models the constraint that visuallysimilar images should receive the same labels. On the other hand, our visual world is highly structured: object cate1410 gories share parts and attributes, objects and scenes have strong contextual relationships, etc. Therefore, we need a way to capture the rich structure of our visual world and exploit this structure during semi-supervised learning. In recent years, there have been huge advances in modeling the rich structure of our visual world via contextual relationships. Some of these relationships include: SceneObject [38], Object-Object [3 1], Object-Attribute [12, 22, 28], Scene-Attribute [29]. All these relationships can provide a rich set of constraints which can help us improve SSL [4]. For example, scene-attribute relationships such as amphitheaters are circular can help improve semisupervised learning of scene classifiers [36] and Wordnet hierarchical relationships can help in propagating segmentations [21]. But the big question is: how do we obtain these relationships? One way to obtain such relationships is via text analysis [5, 18]. However, as [40] points out that the visual knowledge we need to obtain is so obvious that no one would take the time to write it down and put it on web. In this work, we argue that, at a large-scale, one can jointly discover relationships and constrain the SSL prob- lem for extracting visual knowledge and learning visual classifiers and detectors. Motivated by a never ending learning algorithm for text [5], we propose a never ending visual learning algorithm that cycles between extracting global relationships, labeling data and learning classifiers/detectors for building visual knowledge from the Internet. Our work is also related to attribute discovery [33, 35] since these approaches jointly discover the attributes and relationships between objects and attributes simultaneously. However, in our case, we only focus on semantic attributes and therefore our goal is to discover semantic relationships and semantically label visual instances. 3. Technical Approach Our goal is to extract visual knowledge from the pool of visual data on the web. We define visual knowledge as any information that can be useful for improving vision tasks such as image understanding and object/scene recognition. One form of visual knowledge would be labeled examples of different categories or labeled segments/boundaries. Labeled examples helps us learn classifiers or detectors and improve image understanding. Another example of visual knowledge would be relationships. For example, spatial contextual relationships can be used to improve object recognition. In this paper, we represent visual knowledge in terms of labeled examples of semantic categories and the relationships between those categories. Our knowledge base consists of labeled examples of: (1) Objects (e.g., Car, Corolla); (2) Scenes (e.g., Alley, Church); (3) Attributes (e.g., Blue, Modern). Note that for objects we learn detectors and for scenes we build classifiers; however for the rest of the paper we will use the term detector and classifier interchangeably. Our knowledge base also contains relationships of four types: (1) Object-Object (e.g., Wheel is a part of Car);(2) Object-Attribute (e.g., Sheep is/has White); (3) Scene-Object (e.g., Car is found in Raceway); (4) SceneAttribute (e.g., Alley is/has Narrow). The outline of our approach is shown in Figure 2. We use Google Image Search to download thousands of images for each object, scene and attribute category. Our method then uses an iterative approach to clean the labels and train detectors/classifiers in a semi-supervised manner. For a given concept (e.g., car), we first discover the latent visual subcategories and bounding boxes for these sub-categories using an exemplar-based clustering approach (Section 3. 1). We then train multiple detectors for a concept (one for each sub-category) using the clustering and localization results. These detectors and classifiers are then used for detections on millions of images to learn relationships based on cooccurrence statistics (Section 3.2). Here, we exploit the fact the we are interested in macro-vision and therefore build co-occurrence statistics using only confident detections/classifications. Once we have relationships, we use them in conjunction with our classifiers and detectors to label the large set of noisy images (Section 3.3). The most confidently labeled images are added to the pool of labeled data and used to retrain the models, and the process repeats itself. At every iteration, we learn better classifiers and detectors, which in turn help us learn more relationships and further constrain the semi-supervised learning problem. We now describe each step in detail below. 3.1. Seeding Classifiers via Google Image Search The first step in our semi-supervised algorithm is to build classifiers for visual categories. One way to build initial classifiers is via a few manually labeled seed images. Here, we take an alternative approach and use text-based image retrieval systems to provide seed images for training initial detectors. For scene and attribute classifiers we directly use these retrieved images as positive data. However, such an approach fails for training object and attribute detectors because of four reasons (Figure 3(a)) (1) Outliers: Due to the imperfectness of text-based image retrieval, the downloaded images usually have irrelevant images/outliers; (2) Polysemy: In many cases, semantic categories might be overloaded and a single semantic category might have multiple senses (e.g., apple can mean both the company and the fruit); (3) Visual Diversity: Retrieved images might have high intra-class variation due to different viewpoint, illumination etc.; (4) Localization: In many cases the retrieved image might be a scene without a bounding-box and hence one needs to localize the concept before training a detector. Most of the current approaches handle these problems via clustering. Clustering helps in handling visual diversity [9] and discovering multiple senses of retrieval (polysemy) [25]. It can also help us to reject outliers based on – distances from cluster centers. One simple way to cluster 141 1 would be to use K-means on the set of all possible bounding boxes and then use the representative clusters as visual sub-categories. However, clustering using K-means has two issues: (1) High Dimensionality: We use the Color HOG (CHOG) [20] representation and standard distance metrics do not work well in such high-dimensions [10]; (2) Scalability: Most clustering approaches tend to partition the complete feature space. In our case, since we do not have bounding boxes provided, every image creates millions of data points and the majority of the datapoints are outliers. Recent work has suggested that K-means is not scalable and has bad performance in this scenario since it assigns membership to every data point [10]. Instead, we propose to use a two-step approach for clustering. In the first step, we mine the set of downloaded im- × ages from Google Image Search to create candidate object windows. Specifically, every image is used to train a detector using recently proposed exemplar-LDA [19]. These detectors are then used for dense detections on the same set of downloaded images. We select the top K windows which have high scores from multiple detectors. Note that this step helps us prune out outliers as the candidate windows are selected via representativeness (how many detectors fire on them). For example, in Figure 3, none of the tricycle detectors fire on the outliers such as circular dots and people eating, and hence these images are rejected at this candidate widow step. Once we have candidate windows, we cluster them in the next step. However, instead of using the high-dimensional CHOG representation for clustering, we use the detection signature of each window (represented as a vector of seed detector ELDA scores on the window) to create a K K affinity matrix. The (i, j) entry in the affinity amteat arix K i s× thKe da fofti product orixf t.h Tish vee (cit,ojr) )fo enr twryin indo thwes ai fainndj. Intuitively, this step connects candidate windows if the same set of detectors fire on both windows. Once we have the affinity matrix, we cluster the candidate windows using the standard affinity propagation algorithm [16]. Affinity propagation also allows us to extract a representative window (prototype) for each cluster which acts as an iconic image for the object [32] (Figure 3). After clustering, we train a detector for each cluster/sub-category using three-quarters of the images in the cluster. The remaining quarter is used as a validation set for calibration. 3.2. Extracting Relationships Once we have initialized object detectors, attribute detectors, attribute classifiers and scene classifiers, we can use them to extract relationships automatically from the data. The key idea is that we do not need to understand each and every image downloaded from the Internet but instead understand the statistical pattern of detections and classifications at a large scale. These patterns can be used to select the top-N relationships at every iteration. Specifically, we extract four different kinds of relationships: Object-Object Relationships: The first kind of relationship we extract are object-object relationships which include: (1) Partonomy relationships such as “Eye is a part of Baby”; (2) Taxonomy relationships such as “BMW 320 is a kind of Car”; and (3) Similarity relationships such as 1412 (a) Google Image Search for “tricycle” (b) Sub-category Discovery Figure 3. An example of how clustering handles polysemy, intraclass variation and outlier removal (a). The bottom row shows our discovered clusters. “Swan looks similar to Goose”. To extract these relationships, we first build a co-detection matrix O0 whose elements represent the probability of simultaneous detection of object categories i and j. Intuitively, the co-detection matrix has high values when object detector idetects objects inside the bounding box of object j with high detection scores. To account for detectors that fire everywhere and images which have lots of detections, we normalize the matrix O0. The normalized co-detection matrix can be written 1 1 as: N1− 2 O0N2− 2 , where N1 and N2 are out-degree and indegree matrix and (i, j) element of O0 represents the average score of top-detections of detector ion images of object category j. Once we have selected a relationship between pair of categories, we learn its characteristics in terms of mean and variance of relative locations, relative aspect ra- tio, relative scores and relative size of the detections. For example, the nose-face relationship is characterized by low relative window size (nose is less than 20% of face area) and the relative location that nose occurs in center of the face. This is used to define a compatibility function ψi,j (·) which evaluates if the detections from category iand j are compatible or not. We also classify the relationships into the two semantic categories (part-of, taxonomy/similar) using relative features to have a human-communicable view of visual knowledge base. Object-Attribute Relationships: The second type of relationship we extract is object-attribute relationships such as “Pizza has Round Shape”, ”Sunflower is Yellow” etc. To extract these relationships we use the same methodology where the attributes are detected in the labeled examples of object categories. These detections and their scores are then used to build a normalized co-detection matrix which is used to find the top object-attribute relationships. Scene-Object Relationships: The third type of relationship extracted by our algorithm includes scene-object relationships such as “Bus is found in Bus depot” and “Monitor is found in Control room”. For extracting scene-object relationships, we use the object detectors on randomly sampled images of different scene classes. The detections are then used to create the normalized co-presence matrix (similar to object-object relationships) where the (i, j) element represents the likelihood of detection of instance of object category iand the scene category class j. Scene-Attribute Relationships: The fourth and final type of relationship extracted by our algorithm includes sceneattribute relationships such as “Ocean is Blue”, “Alleys are Narrow”, etc. Here, we follow a simple methodology for extracting scene-attribute relationships where we compute co-classification matrix such that the element (i, j) of the matrix represents average classification scores of attribute ion images of scene j. The top entries in this coclassification matrix are used to extract scene-attribute relationships. 3.3. Retraining via Labeling New Instances Once we have the initial set of classifiers/detectors and the set of relationships, we can use them to find new instances of different objects and scene categories. These new instances are then added to the set of labeled data and we retrain new classifiers/detectors using the updated set of labeled data. These new classifiers are then used to extract more relationships which in turn are used to label more data and so on. One way to find new instances is directly using the detector itself. For instance, using the car detector to find more cars. However, this approach leads to semantic drift. To avoid semantic drift, we use the rich set of relationships we extracted in the previous section and ensure that the new labeled instances of car satisfy the extracted relationships (e.g., has wheels, found in raceways etc.) Mathematically, let RO, RA and RS represent the set of object-object, object-attribute aanndd scene-object relationships at iteration t. If φi (·) represents the potential from object detector i, ωk (·) represents sthenet scene potential, raonmd ψi,j (·) represent the compatibility sfu thnect siocnen nbeet pwoeteennt tiwalo, aonbdject categories i,j,ethceonm we can ifityndfu uthncet new inetswtaenecnetsw woof oobb-ject category iusing the contextual scoring function given below: φi(x) + ? φj(xl)ψi,j(x,xl) + ? i,j∈R?O RA ? ωk(x) i,k?∈RS where x is the wi?ndow being evaluated and xl is the topdetected window of related object/attribute category. The above equation has three terms: the first term is appearance term for the object category itself and is measured by the 1413 Nilgai Yamaha Violin Bass F-18 Figure 4. Qualitative Examples of Bounding Box Labeling Done by NEIL score of the SVM detector on the window x. The second term measures the compatibility between object category i and the object/attribute category j if the relationship (i, j) is part of the catalogue. For example, if “Wheel is a part of Car” exists in the catalogue then this term will be the product of the score of wheel detector and the compatibility function between the wheel window (xl) and the car window (x). The final term measures the scene-object compatibility. Therefore, if the knowledge base contains the re- lationship “Car is found in Raceway”, this term boosts the “Car” detection scores in the “Raceway” scenes. At each iteration, we also add new instances of different scene categories. We find new instances of scene category k using the contextual scoring function given below: ωk(x) + ? ωm(x) + ? φi(xl) m,k?∈RA? i,k?∈RS where RA? represents the catalogue of scene-attribute relationships. The above equation has three terms: the first term is the appearance term for the scene category itself and is estimated using the scene classifier. The second term is the appearance term for the attribute category and is estimated using the attribute classifier. This term ensures that if a scene-attribute relationship exists then the attribute classifier score should be high. The third and the final term is the appearance term of an object category and is estimated using the corresponding object detector. This term ensures that if a scene-object relationship exists then the object detector should detect objects in the scene. Implementation Details: To train scene & attribute classifiers, we first extract a 3912 dimensional feature vector from each image. The feature vector includes 5 12D GIST [27] features, concatenated with bag ofwords representations for SIFT [24], HOG [7], Lab color space, and Texton [26]. The dictionary sizes are 1000, 1000, 400, 1000, respectively. Features of randomly sampled windows from other categories are used as negative examples for SVM training and hard mining. For the object and attribute section, we use CHOG [20] features with a bin size of 8. We train the detectors using latent SVM model (without parts) [13]. 4. Experimental Results We demonstrate the quality of visual knowledge by qualitative results, verification via human subjects and quantitative results on tasks such as object detection and scene recognition. 4.1. NEIL Statistics While NEIL’s core algorithm uses a fixed vocabulary, we use noun phrases from NELL [5] to grow NEIL’s vocabulary. As of 10th October 2013, NEIL has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. It has downloaded more than 2 million images for extracting the current structured visual knowledge. For bootstrapping our system, we use a few seed images from ImageNet [8], SUN [4 1] or the top-images from Google Image Search. For the purposes of extensive experimental evaluation in this paper, we ran NEIL on steroids (200 cores as opposed to 30 cores used generally) for the last 2.5 months. NEIL has completed 16 iterations and it has labeled more than 400K visual instances (including 300,000 objects with their bounding boxes). It has also extracted 1703 common sense relationships. Readers can browse the current visual knowledge base and download the detectors from: www.neil-kb.com 4.2. Qualitative Results We first show some qualitative results in terms of ex- tracted visual knowledge by NEIL. Figure 4 shows the extracted visual sub-categories along with a few labeled instances belonging to each sub-category. It can be seen from the figure that NEIL effectively handles the intra-class variation and polysemy via the clustering process. The purity and diversity of the clusters for different concepts indicate that contextual relationships help make our system robust to semantic drift and ensure diversity. Figure 5 shows some of the qualitative examples of scene-object and object-object relationships extracted by NEIL. It is effective in using a few confident detections to extract interesting relationships. Figure 6 shows some of the interesting scene-attribute and object-attribute relationships extracted by NEIL. 1414 Helicopter is found in Airfield Leaning tower is found in Pisa Van is a kind of/looks similar to Ambulance Airplane nose is a part of Airbus 330 Zebra is found in Savanna Ferris wheel is found in Amusement park Opera house is found in Sydney Eye is a part of Baby Duck is a kind of/looks similar to Goose Monitor is a kind of/looks similar to Desktop computer Figure 5. Qualitative Examples of Scene-Object (rows Bus is found in Bus depot outdoor Sparrow is a kind of/looks similar to bird 1-2) and Object-Object (rows Throne is found in Throne room Camry is found in Pub outdoor Gypsy moth is a kind of/looks similar to Butterfly Basketball net is a part of Backboard 3-4) Relationships Extracted by NEIL 4.3. Evaluating Quality via Human Subjects Next, we want to evaluate the quality of extracted visual knowledge by NEIL. It should be noted that an extensive and comprehensive evaluation for the whole NEIL system is an extremely difficult task. It is impractical to evaluate each and every labeled instance and each and every rela- tionship for correctness. Therefore, we randomly sample the 500 visual instances and 500 relationships, and verify them using human experts. At the end of iteration 6, 79% of the relationships extracted by NEIL are correct, and 98% of the visual data labeled by NEIL has been labeled correctly. We also evaluate the per iteration correctness of relationships: At iteration 1, more than 96% relationships are correct and by iteration 3, the system stabilizes and 80% of extracted relationships are correct. While currently the system does not demonstrate any major semantic drift, we do plan to continue evaluation and extensive analysis of knowledge base as NEIL grows older. We also evaluate the quality of bounding-boxes generated by NEIL. For this we sample 100 images randomly and label the ground-truth bounding boxes. On the standard intersection-over-union metric, NEIL generates bounding boxes with 0.78 overlap on average with ground-truth. To give context to the difficulty of the task, the standard Objectness algorithm [1] produces bounding boxes with 0.59 overlap on average. 4.4. Using Knowledge for Vision Tasks Finally, we want to demonstrate the usefulness of the visual knowledge learned by NEIL on standard vision tasks such as object detection and scene classification. Here, we will also compare several aspects of our approach: (a) We first compare the quality of our automatically labeled dataset. As baselines, we train classifiers/detectors directly on the seed images downloaded from Google Image Search. (b) We compare NEIL against a standard bootstrapping ap- proach which does not extract/use relationships. (c) Finally, we will demonstrate the usefulness of relationships by detecting and classifying new test data with and without the learned relationships. Scene Classification: First we evaluate our visual knowledge for the task of scene classification. We build a dataset of 600 images (12 scene categories) using Flickr images. We compare the performance ofour scene classifiers against the scene classifiers trained from top 15 images of Google Image Search (our seed classifier). We also compare the performance with standard bootstrapping approach without using any relationship extraction. Table 1shows the results. We use mean average precision (mAP) as the evaluation metric. As the results show, automatic relationship extraction helps us to constrain the learning problem, and so the learned classifiers give much better performance. Finally, if we also use the contextual information from NEIL relationships we get a significant boost in performance. Table 1. mAP performance for scene classification on 12 categories. mAP Seed Classifier (15 Google Images) Bootstrapping (without relationships) NEIL Scene Classifiers NEIL (Classifiers + Relationships) 0.52 0.54 0.57 0.62 Object Detection: We also evaluate our extracted visual knowledge for the task of object detection. We build a dataset of 1000 images (15 object categories) using Flickr data for testing. We compare the performance against object detectors trained directly using (top-50 and top-450) images from Google Image Search. We also compare the perfor- mance of detectors trained after aspect-ratio, HOG clustering and our proposed clustering procedure. Table 2 shows the detection results. Using 450 images from Google image search decreases the performance due to noisy retrievals. While other clustering methods help, the gain by our clustering procedure is much larger. Finally, detectors trained using NEIL work better than standard bootstrapping. 1415 MMoonniittoorr iiss f foouunndd iinn CCoonnttrrooll rroooomm? rroooomm? MMoonniittoorr iiss ffoouunndd iinn CCoonnttrrooll Washing machine is found in Utility room? Siberian tiger is found in Zoo Baseball is found in Butters box Bullet train is found in Train station platform? Cougar looks similar to Cat Urn looks similar to Goblet Samsung galaxy is a kind of Cellphone Computer room is /has Modern Hallway is /has Narrow? Building facade is /has Check texture Trading floor is /has Crowded Umbrella looks similar to Ferris wheel Bonfire is found in Volcano Figure 6. Examples of extracted common sense relationships. Table 2. mAP performance for object detection on 15 categories. mAP Latent SVM (50 Google Images) Latent SVM (450 Google Images) 0.34 0.28 Latent SVM (450, Aspect Ratio Clustering) Latent SVM (450, HOG-based Clustering) Seed Detector (NEIL Clustering) Bootstrapping (without relationships) NEIL Detector NEIL Detector + Relationships 0.30 0.33 0.44 0.45 0.49 0.51 Acknowledgements: This research was supported by ONR MURI N000141010934 and a gift from Google. The authors would like to thank Tom Mitchell and David Fouhey for insightful discussions. We would also like to thank our computing clusters warp and workhorse for doing all the hard work! References [1] B. Alexe, T. Deselares, and V. Ferrari. What is an object? In TPAMI, 2010. 7 [2] T. Berg and D. Forsyth. Animals on the web. In CVPR, 2006. 2 [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998. 2 [4] A. Carlson, J. Betteridge, E. R. H. Jr., and T. M. Mitchell. Coupling semi-supervised learning of categories and relations. In NAACL HLT Workskop on SSL for NLP, 2009. 3 [5] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010. 2, 3, 6 [6] J. R. Curran, T. Murphy, and B. Scholz. Minimising semantic drift with mutual exclusion bootstrapping. In PacificAssociationfor Computational Linguistics, 2007. 2 [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 6 [8] J. Deng, W. Dong, R. Socher, J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 1, 2, 6 [9] S. Divvala, A. Efros, and M. Hebert. How important are ‘deformable parts’ in the deformable parts model? In ECCV, Parts and Attributes [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] Workshop, 2012. 3 C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes Paris look like Paris? SIGGRAPH, 2012. 4 S. Ebert, D. Larlus, and B. Schiele. Extracting structures in image collections for object recognition. In ECCV, 2010. 2 A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. 3 P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2010. 6 R. Fergus, P. Perona, and A. Zisserman. A visual category filter for Google images. In ECCV, 2004. 2 R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning in gigantic image collections. In NIPS. 2009. 2 B. Frey and D. Dueck. Clustering by passing messages between data points. Science, 2007. 4 M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semisupervised learning for image classification. In CVPR, 2010. 2 A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, 2008. 3 B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classification. In ECCV. 2012. 4 S. Khan, F. Anwer, R. Muhammad, J. van de Weijer, A. Joost, M. Vanrell, and A. Lopez. Color attributes for object detection. In CVPR, 2012. 4, 6 D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation propagation in ImageNet. In ECCV, 2012. 3 C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009. 3 L.-J. Li, G. Wang, and L. Fei-Fei. OPTIMOL: Automatic object picture collection via incremental model learning. In CVPR, 2007. 2 D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. 6 [25] A. Lucchi and J. Weston. Joint image and word sense discrimination for image retrieval. In ECCV, 2012. 3 [26] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI, 2004. 6 [27] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 2001. 6 [28] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. 3 [29] G. Patterson and J. Hays. SUN attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR, 2012. 3 [30] P. Perona. Visions of a Visipedia. Proceedings of IEEE, 2010. 1, 2 [3 1] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007. 3 [32] R. Raguram and S. Lazebnik. Computing iconic summaries of general visual concepts. In Workshop on Internet Vision, 2008. 4 [33] M. Rastegari, A. Farhadi, and D. Forsyth. Attribute discovery via predictable discriminative binary codes. In ECCV, 2012. 3 [34] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. In ICCV, 2007. 2 [35] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Augmented attribute representations. In ECCV, 2012. 3 [36] A. Shrivastava, S. Singh, and A. Gupta. Constrained semi-supervised learning using attributes and comparative attributes. In ECCV, 2012. 2, 3 [37] B. Siddiquie and A. Gupta. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In CVPR, 2010. 2 [38] E. Sudderth, A. Torralba, W. T. Freeman, and A. Wilsky. Learning hierarchical models of scenes, objects, and parts. In ICCV, 2005. 3 [39] S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. In CVPR, 2011. 2 [40] L. von Ahn and L. Dabbish. Labeling images with a computer game. In SIGCHI, 2004. 2, 3 [41] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database: Large scale scene recognition from abbey to zoo. In CVPR, 2010. 6 [42] X. Zhu. Semi-supervised learning literature survey. Technical report, CS, UW-Madison, 2005. 2 1416
2 0.74267167 350 iccv-2013-Relative Attributes for Large-Scale Abandoned Object Detection
Author: Quanfu Fan, Prasad Gabbur, Sharath Pankanti
Abstract: Effective reduction of false alarms in large-scale video surveillance is rather challenging, especially for applications where abnormal events of interest rarely occur, such as abandoned object detection. We develop an approach to prioritize alerts by ranking them, and demonstrate its great effectiveness in reducing false positives while keeping good detection accuracy. Our approach benefits from a novel representation of abandoned object alerts by relative attributes, namely staticness, foregroundness and abandonment. The relative strengths of these attributes are quantified using a ranking function[19] learnt on suitably designed low-level spatial and temporal features.These attributes of varying strengths are not only powerful in distinguishing abandoned objects from false alarms such as people and light artifacts, but also computationally efficient for large-scale deployment. With these features, we apply a linear ranking algorithm to sort alerts according to their relevance to the end-user. We test the effectiveness of our approach on both public data sets and large ones collected from the real world.
3 0.71383971 179 iccv-2013-From Subcategories to Visual Composites: A Multi-level Framework for Object Detection
Author: Tian Lan, Michalis Raptis, Leonid Sigal, Greg Mori
Abstract: The appearance of an object changes profoundly with pose, camera view and interactions of the object with other objects in the scene. This makes it challenging to learn detectors based on an object-level label (e.g., “car”). We postulate that having a richer set oflabelings (at different levels of granularity) for an object, including finer-grained subcategories, consistent in appearance and view, and higherorder composites – contextual groupings of objects consistent in their spatial layout and appearance, can significantly alleviate these problems. However, obtaining such a rich set of annotations, including annotation of an exponentially growing set of object groupings, is simply not feasible. We propose a weakly-supervised framework for object detection where we discover subcategories and the composites automatically with only traditional object-level category labels as input. To this end, we first propose an exemplar-SVM-based clustering approach, with latent SVM refinement, that discovers a variable length set of discriminative subcategories for each object class. We then develop a structured model for object detection that captures interactions among object subcategories and automatically discovers semantically meaningful and discriminatively relevant visual composites. We show that this model produces state-of-the-art performance on UIUC phrase object detection benchmark.
4 0.70977366 31 iccv-2013-A Unified Probabilistic Approach Modeling Relationships between Attributes and Objects
Author: Xiaoyang Wang, Qiang Ji
Abstract: This paper proposes a unified probabilistic model to model the relationships between attributes and objects for attribute prediction and object recognition. As a list of semantically meaningful properties of objects, attributes generally relate to each other statistically. In this paper, we propose a unified probabilistic model to automatically discover and capture both the object-dependent and objectindependent attribute relationships. The model utilizes the captured relationships to benefit both attribute prediction and object recognition. Experiments on four benchmark attribute datasets demonstrate the effectiveness of the proposed unified model for improving attribute prediction as well as object recognition in both standard and zero-shot learning cases.
5 0.69708174 109 iccv-2013-Detecting Avocados to Zucchinis: What Have We Done, and Where Are We Going?
Author: Olga Russakovsky, Jia Deng, Zhiheng Huang, Alexander C. Berg, Li Fei-Fei
Abstract: The growth of detection datasets and the multiple directions of object detection research provide both an unprecedented need and a great opportunity for a thorough evaluation of the current state of the field of categorical object detection. In this paper we strive to answer two key questions. First, where are we currently as a field: what have we done right, what still needs to be improved? Second, where should we be going in designing the next generation of object detectors? Inspired by the recent work of Hoiem et al. [10] on the standard PASCAL VOC detection dataset, we perform a large-scale study on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) data. First, we quantitatively demonstrate that this dataset provides many of the same detection challenges as the PASCAL VOC. Due to its scale of 1000 object categories, ILSVRC also provides an excellent testbed for understanding the performance of detectors as a function of several key properties of the object classes. We conduct a series of analyses looking at how different detection methods perform on a number of imagelevel and object-class-levelproperties such as texture, color, deformation, and clutter. We learn important lessons of the current object detection methods and propose a number of insights for designing the next generation object detectors.
6 0.68509632 380 iccv-2013-Semantic Transform: Weakly Supervised Semantic Inference for Relating Visual Attributes
7 0.67958635 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary
8 0.67647988 406 iccv-2013-Style-Aware Mid-level Representation for Discovering Visual Connections in Space and Time
9 0.6548633 349 iccv-2013-Regionlets for Generic Object Detection
10 0.65451282 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction
11 0.64832491 449 iccv-2013-What Do You Do? Occupation Recognition in a Photo via Social Context
12 0.64109397 269 iccv-2013-Modeling Occlusion by Discriminative AND-OR Structures
13 0.63571823 53 iccv-2013-Attribute Dominance: What Pops Out?
14 0.63487476 191 iccv-2013-Handling Uncertain Tags in Visual Recognition
15 0.62039822 416 iccv-2013-The Interestingness of Images
16 0.61946243 399 iccv-2013-Spoken Attributes: Mixing Binary and Relative Attributes to Say the Right Thing
17 0.61564237 52 iccv-2013-Attribute Adaptation for Personalized Image Search
18 0.61366343 246 iccv-2013-Learning the Visual Interpretation of Sentences
19 0.60842848 7 iccv-2013-A Deep Sum-Product Architecture for Robust Facial Attributes Analysis
20 0.59743667 66 iccv-2013-Building Part-Based Object Detectors via 3D Geometry
topicId topicWeight
[(2, 0.086), (7, 0.014), (8, 0.01), (12, 0.025), (13, 0.024), (26, 0.084), (31, 0.037), (42, 0.106), (64, 0.078), (73, 0.029), (77, 0.031), (78, 0.019), (89, 0.144), (94, 0.174), (98, 0.023)]
simIndex simValue paperId paperTitle
1 0.84583837 222 iccv-2013-Joint Learning of Discriminative Prototypes and Large Margin Nearest Neighbor Classifiers
Author: Martin Köstinger, Paul Wohlhart, Peter M. Roth, Horst Bischof
Abstract: In this paper, we raise important issues concerning the evaluation complexity of existing Mahalanobis metric learning methods. The complexity scales linearly with the size of the dataset. This is especially cumbersome on large scale or for real-time applications with limited time budget. To alleviate this problem we propose to represent the dataset by a fixed number of discriminative prototypes. In particular, we introduce a new method that jointly chooses the positioning of prototypes and also optimizes the Mahalanobis distance metric with respect to these. We show that choosing the positioning of the prototypes and learning the metric in parallel leads to a drastically reduced evaluation effort while maintaining the discriminative essence of the original dataset. Moreover, for most problems our method performing k-nearest prototype (k-NP) classification on the condensed dataset leads to even better generalization compared to k-NN classification using all data. Results on a variety of challenging benchmarks demonstrate the power of our method. These include standard machine learning datasets as well as the challenging Public Fig- ures Face Database. On the competitive machine learning benchmarks we are comparable to the state-of-the-art while being more efficient. On the face benchmark we clearly outperform the state-of-the-art in Mahalanobis metric learning with drastically reduced evaluation effort.
same-paper 2 0.82082421 285 iccv-2013-NEIL: Extracting Visual Knowledge from Web Data
Author: Xinlei Chen, Abhinav Shrivastava, Abhinav Gupta
Abstract: We propose NEIL (NeverEnding Image Learner), a computer program that runs 24 hours per day and 7 days per week to automatically extract visual knowledge from Internet data. NEIL uses a semi-supervised learning algorithm that jointly discovers common sense relationships (e.g., “Corolla is a kind of/looks similar to Car”, “Wheel is a part of Car”) and labels instances of the given visual categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human labeling effort. As of 10th October 2013, NEIL has been continuously running for 2.5 months on 200 core cluster (more than 350K CPU hours) and has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. During this period, NEIL has discovered more than 1700 relationships and has labeled more than 400K visual instances. 1. Motivation Recent successes in computer vision can be primarily attributed to the ever increasing size of visual knowledge in terms of labeled instances of scenes, objects, actions, attributes, and the contextual relationships between them. But as we move forward, a key question arises: how will we gather this structured visual knowledge on a vast scale? Recent efforts such as ImageNet [8] and Visipedia [30] have tried to harness human intelligence for this task. However, we believe that these approaches lack both the richness and the scalability required for gathering massive amounts of visual knowledge. For example, at the time of submission, only 7% of the data in ImageNet had bounding boxes and the relationships were still extracted via Wordnet. In this paper, we consider an alternative approach of automatically extracting visual knowledge from Internet scale data. The feasibility of extracting knowledge automatically from images and videos will itself depend on the state-ofthe-art in computer vision. While we have witnessed significant progress on the task of detection and recognition, we still have a long way to go for automatically extracting the semantic content of a given image. So, is it really possible to use existing approaches for gathering visual knowledge directly from web data? 1.1. NEIL – Never Ending Image Learner We propose NEIL, a computer program that runs 24 hours per day, 7 days per week, forever to: (a) semantically understand images on the web; (b) use this semantic understanding to augment its knowledge base with new labeled instances and common sense relationships; (c) use this dataset and these relationships to build better classifiers and detectors which in turn help improve semantic understanding. NEIL is a constrained semi-supervised learning (SSL) system that exploits the big scale of visual data to automatically extract common sense relationships and then uses these relationships to label visual instances of existing categories. It is an attempt to develop the world’s largest visual structured knowledge base with minimum human effort one that reflects the factual content of the images on the Internet, and that would be useful to many computer vision and AI efforts. Specifically, NEIL can use web data to extract: (a) Labeled examples of object categories with bounding boxes; (b) Labeled examples of scenes; (c) Labeled examples of attributes; (d) Visual subclasses for object categories; and (e) Common sense relationships about scenes, objects and attributes like “Corolla is a kind of/looks similar to Car”, “Wheel is a part ofCar”, etc. (See Figure 1). We believe our approach is possible for three key reasons: (a) Macro-vision vs. Micro-vision: We use the term “micro-vision” to refer to the traditional paradigm where the input is an image and the output is some information extracted from that image. In contrast, we define “macrovision” as a paradigm where the input is a large collection of images and the desired output is extracting significant or interesting patterns in visual data (e.g., car is detected frequently in raceways). These patterns help us to extract common sense relationships. Note, the key difference is that macro-vision does not require us to understand every image in the corpora and extract all possible patterns. Instead, it relies on understanding a few images and statistically combine evidence from these to build our visual knowledge. – (b) Structure of the Visual World: Our approach exploits the structure of the visual world and builds constraints for detection and classification. These global constraints are represented in terms of common sense relationships be1409 orCllaoraC Hloc e yrs(a) Objects (w/Bounding Boxes and Vislue hWal Subcategories) aongkPi lrt(b) ScenewyacaResaephs nuoRd(c) At d worCreibutes Visual Instances Labeled by NEIL (O-O) Wheel is a part of Car. (S-O) Car is found in Raceway. (O-O) Corolla is a kind of/looks similar to Car. (S-O) Pyramid is found in Egypt. (O-A) Wheel is/has Round shape. (S-A) Alley is/has Narrow. (S-A) Bamboo forest is/has Vertical lines. (O-A) Sunflower is/has Yellow. Relationships Extracted by NEIL Figure 1. NEIL is a computer program that runs 24 hours a day and 7 days a week to gather visual knowledge from the Internet. Specifically, it simultaneously labels the data and extracts common sense relationships between categories. tween categories. Most prior work uses manually defined relationships or learns relationships in a supervised setting. Our key insight is that at a large scale one can simultane- ously label the visual instances and extract common sense relationships in ajoint semi-supervised learning framework. (c) Semantically driven knowledge acquisition: We use a semantic representation for visual knowledge; that is, we group visual data based on semantic categories and develop relationships between semantic categories. This allows us to leverage text-based indexing tools such as Google Image Search to initialize our visual knowledge base learning. Contributions: Our main contributions are: (a) We propose a never ending learning algorithm for gathering visual knowledge from the Internet via macro-vision. NEIL has been continuously running for 2.5 months on a 200 core cluster; (b) We are automatically building a large visual structured knowledge base which not only consists of labeled instances of scenes, objects, and attributes but also the relationships between them. While NEIL’s core SSL algorithm works with a fixed vocabulary, we also use noun phrases from NELL’s ontology [5] to grow our vocabulary. Currently, our growing knowledge base has an ontology of 1152 object categories, 1034 scene categories, and 87 attributes. NEIL has discovered more than 1700 relationships and labeled more than 400K visual instances of these categories. (c) We demonstrate how joint discovery of relationships and labeling of instances at a gigantic scale can provide constraints for improving semi-supervised learning. 2. Related Work Recent work has only focused on extracting knowledge in the form of large datasets for recognition and classification [8, 23, 30]. One of the most commonly used approaches to build datasets is using manual annotations by motivated teams of people [30] or the power of crowds [8, 40]. To minimize human effort, recent works have also focused on active learning [37, 39] which selects label requests that are most informative. However, both of these directions have a major limitation: annotations are expensive, prone to errors, biased and do not scale. An alternative approach is to use visual recognition for extracting these datasets automatically from the Internet [23, 34, 36]. A common way of automatically creating a dataset is to use image search results and rerank them via visual classifiers [14] or some form of joint-clustering in text and visual space [2, 34]. Another approach is to use a semi-supervised framework [42]. Here, a small amount of labeled data is used in conjunction with a large amount of unlabeled data to learn reliable and robust visual models. These seed images can be manually labeled [36] or the top retrievals of a text-based search [23]. The biggest problem with most of these automatic approaches is that the small number of labeled examples or image search results do not provide enough constraints for learning robust visual classifiers. Hence, these approaches suffer from semantic drift [6]. One way to avoid semantic drift is to exploit additional constraints based on the structure of our visual data. Researchers have exploited a variety of constraints such as those based on visual similarity [11, 15], seman- tic similarity [17] or multiple feature spaces [3]. However, most of these constraints are weak in nature: for example, visual similarity only models the constraint that visuallysimilar images should receive the same labels. On the other hand, our visual world is highly structured: object cate1410 gories share parts and attributes, objects and scenes have strong contextual relationships, etc. Therefore, we need a way to capture the rich structure of our visual world and exploit this structure during semi-supervised learning. In recent years, there have been huge advances in modeling the rich structure of our visual world via contextual relationships. Some of these relationships include: SceneObject [38], Object-Object [3 1], Object-Attribute [12, 22, 28], Scene-Attribute [29]. All these relationships can provide a rich set of constraints which can help us improve SSL [4]. For example, scene-attribute relationships such as amphitheaters are circular can help improve semisupervised learning of scene classifiers [36] and Wordnet hierarchical relationships can help in propagating segmentations [21]. But the big question is: how do we obtain these relationships? One way to obtain such relationships is via text analysis [5, 18]. However, as [40] points out that the visual knowledge we need to obtain is so obvious that no one would take the time to write it down and put it on web. In this work, we argue that, at a large-scale, one can jointly discover relationships and constrain the SSL prob- lem for extracting visual knowledge and learning visual classifiers and detectors. Motivated by a never ending learning algorithm for text [5], we propose a never ending visual learning algorithm that cycles between extracting global relationships, labeling data and learning classifiers/detectors for building visual knowledge from the Internet. Our work is also related to attribute discovery [33, 35] since these approaches jointly discover the attributes and relationships between objects and attributes simultaneously. However, in our case, we only focus on semantic attributes and therefore our goal is to discover semantic relationships and semantically label visual instances. 3. Technical Approach Our goal is to extract visual knowledge from the pool of visual data on the web. We define visual knowledge as any information that can be useful for improving vision tasks such as image understanding and object/scene recognition. One form of visual knowledge would be labeled examples of different categories or labeled segments/boundaries. Labeled examples helps us learn classifiers or detectors and improve image understanding. Another example of visual knowledge would be relationships. For example, spatial contextual relationships can be used to improve object recognition. In this paper, we represent visual knowledge in terms of labeled examples of semantic categories and the relationships between those categories. Our knowledge base consists of labeled examples of: (1) Objects (e.g., Car, Corolla); (2) Scenes (e.g., Alley, Church); (3) Attributes (e.g., Blue, Modern). Note that for objects we learn detectors and for scenes we build classifiers; however for the rest of the paper we will use the term detector and classifier interchangeably. Our knowledge base also contains relationships of four types: (1) Object-Object (e.g., Wheel is a part of Car);(2) Object-Attribute (e.g., Sheep is/has White); (3) Scene-Object (e.g., Car is found in Raceway); (4) SceneAttribute (e.g., Alley is/has Narrow). The outline of our approach is shown in Figure 2. We use Google Image Search to download thousands of images for each object, scene and attribute category. Our method then uses an iterative approach to clean the labels and train detectors/classifiers in a semi-supervised manner. For a given concept (e.g., car), we first discover the latent visual subcategories and bounding boxes for these sub-categories using an exemplar-based clustering approach (Section 3. 1). We then train multiple detectors for a concept (one for each sub-category) using the clustering and localization results. These detectors and classifiers are then used for detections on millions of images to learn relationships based on cooccurrence statistics (Section 3.2). Here, we exploit the fact the we are interested in macro-vision and therefore build co-occurrence statistics using only confident detections/classifications. Once we have relationships, we use them in conjunction with our classifiers and detectors to label the large set of noisy images (Section 3.3). The most confidently labeled images are added to the pool of labeled data and used to retrain the models, and the process repeats itself. At every iteration, we learn better classifiers and detectors, which in turn help us learn more relationships and further constrain the semi-supervised learning problem. We now describe each step in detail below. 3.1. Seeding Classifiers via Google Image Search The first step in our semi-supervised algorithm is to build classifiers for visual categories. One way to build initial classifiers is via a few manually labeled seed images. Here, we take an alternative approach and use text-based image retrieval systems to provide seed images for training initial detectors. For scene and attribute classifiers we directly use these retrieved images as positive data. However, such an approach fails for training object and attribute detectors because of four reasons (Figure 3(a)) (1) Outliers: Due to the imperfectness of text-based image retrieval, the downloaded images usually have irrelevant images/outliers; (2) Polysemy: In many cases, semantic categories might be overloaded and a single semantic category might have multiple senses (e.g., apple can mean both the company and the fruit); (3) Visual Diversity: Retrieved images might have high intra-class variation due to different viewpoint, illumination etc.; (4) Localization: In many cases the retrieved image might be a scene without a bounding-box and hence one needs to localize the concept before training a detector. Most of the current approaches handle these problems via clustering. Clustering helps in handling visual diversity [9] and discovering multiple senses of retrieval (polysemy) [25]. It can also help us to reject outliers based on – distances from cluster centers. One simple way to cluster 141 1 would be to use K-means on the set of all possible bounding boxes and then use the representative clusters as visual sub-categories. However, clustering using K-means has two issues: (1) High Dimensionality: We use the Color HOG (CHOG) [20] representation and standard distance metrics do not work well in such high-dimensions [10]; (2) Scalability: Most clustering approaches tend to partition the complete feature space. In our case, since we do not have bounding boxes provided, every image creates millions of data points and the majority of the datapoints are outliers. Recent work has suggested that K-means is not scalable and has bad performance in this scenario since it assigns membership to every data point [10]. Instead, we propose to use a two-step approach for clustering. In the first step, we mine the set of downloaded im- × ages from Google Image Search to create candidate object windows. Specifically, every image is used to train a detector using recently proposed exemplar-LDA [19]. These detectors are then used for dense detections on the same set of downloaded images. We select the top K windows which have high scores from multiple detectors. Note that this step helps us prune out outliers as the candidate windows are selected via representativeness (how many detectors fire on them). For example, in Figure 3, none of the tricycle detectors fire on the outliers such as circular dots and people eating, and hence these images are rejected at this candidate widow step. Once we have candidate windows, we cluster them in the next step. However, instead of using the high-dimensional CHOG representation for clustering, we use the detection signature of each window (represented as a vector of seed detector ELDA scores on the window) to create a K K affinity matrix. The (i, j) entry in the affinity amteat arix K i s× thKe da fofti product orixf t.h Tish vee (cit,ojr) )fo enr twryin indo thwes ai fainndj. Intuitively, this step connects candidate windows if the same set of detectors fire on both windows. Once we have the affinity matrix, we cluster the candidate windows using the standard affinity propagation algorithm [16]. Affinity propagation also allows us to extract a representative window (prototype) for each cluster which acts as an iconic image for the object [32] (Figure 3). After clustering, we train a detector for each cluster/sub-category using three-quarters of the images in the cluster. The remaining quarter is used as a validation set for calibration. 3.2. Extracting Relationships Once we have initialized object detectors, attribute detectors, attribute classifiers and scene classifiers, we can use them to extract relationships automatically from the data. The key idea is that we do not need to understand each and every image downloaded from the Internet but instead understand the statistical pattern of detections and classifications at a large scale. These patterns can be used to select the top-N relationships at every iteration. Specifically, we extract four different kinds of relationships: Object-Object Relationships: The first kind of relationship we extract are object-object relationships which include: (1) Partonomy relationships such as “Eye is a part of Baby”; (2) Taxonomy relationships such as “BMW 320 is a kind of Car”; and (3) Similarity relationships such as 1412 (a) Google Image Search for “tricycle” (b) Sub-category Discovery Figure 3. An example of how clustering handles polysemy, intraclass variation and outlier removal (a). The bottom row shows our discovered clusters. “Swan looks similar to Goose”. To extract these relationships, we first build a co-detection matrix O0 whose elements represent the probability of simultaneous detection of object categories i and j. Intuitively, the co-detection matrix has high values when object detector idetects objects inside the bounding box of object j with high detection scores. To account for detectors that fire everywhere and images which have lots of detections, we normalize the matrix O0. The normalized co-detection matrix can be written 1 1 as: N1− 2 O0N2− 2 , where N1 and N2 are out-degree and indegree matrix and (i, j) element of O0 represents the average score of top-detections of detector ion images of object category j. Once we have selected a relationship between pair of categories, we learn its characteristics in terms of mean and variance of relative locations, relative aspect ra- tio, relative scores and relative size of the detections. For example, the nose-face relationship is characterized by low relative window size (nose is less than 20% of face area) and the relative location that nose occurs in center of the face. This is used to define a compatibility function ψi,j (·) which evaluates if the detections from category iand j are compatible or not. We also classify the relationships into the two semantic categories (part-of, taxonomy/similar) using relative features to have a human-communicable view of visual knowledge base. Object-Attribute Relationships: The second type of relationship we extract is object-attribute relationships such as “Pizza has Round Shape”, ”Sunflower is Yellow” etc. To extract these relationships we use the same methodology where the attributes are detected in the labeled examples of object categories. These detections and their scores are then used to build a normalized co-detection matrix which is used to find the top object-attribute relationships. Scene-Object Relationships: The third type of relationship extracted by our algorithm includes scene-object relationships such as “Bus is found in Bus depot” and “Monitor is found in Control room”. For extracting scene-object relationships, we use the object detectors on randomly sampled images of different scene classes. The detections are then used to create the normalized co-presence matrix (similar to object-object relationships) where the (i, j) element represents the likelihood of detection of instance of object category iand the scene category class j. Scene-Attribute Relationships: The fourth and final type of relationship extracted by our algorithm includes sceneattribute relationships such as “Ocean is Blue”, “Alleys are Narrow”, etc. Here, we follow a simple methodology for extracting scene-attribute relationships where we compute co-classification matrix such that the element (i, j) of the matrix represents average classification scores of attribute ion images of scene j. The top entries in this coclassification matrix are used to extract scene-attribute relationships. 3.3. Retraining via Labeling New Instances Once we have the initial set of classifiers/detectors and the set of relationships, we can use them to find new instances of different objects and scene categories. These new instances are then added to the set of labeled data and we retrain new classifiers/detectors using the updated set of labeled data. These new classifiers are then used to extract more relationships which in turn are used to label more data and so on. One way to find new instances is directly using the detector itself. For instance, using the car detector to find more cars. However, this approach leads to semantic drift. To avoid semantic drift, we use the rich set of relationships we extracted in the previous section and ensure that the new labeled instances of car satisfy the extracted relationships (e.g., has wheels, found in raceways etc.) Mathematically, let RO, RA and RS represent the set of object-object, object-attribute aanndd scene-object relationships at iteration t. If φi (·) represents the potential from object detector i, ωk (·) represents sthenet scene potential, raonmd ψi,j (·) represent the compatibility sfu thnect siocnen nbeet pwoeteennt tiwalo, aonbdject categories i,j,ethceonm we can ifityndfu uthncet new inetswtaenecnetsw woof oobb-ject category iusing the contextual scoring function given below: φi(x) + ? φj(xl)ψi,j(x,xl) + ? i,j∈R?O RA ? ωk(x) i,k?∈RS where x is the wi?ndow being evaluated and xl is the topdetected window of related object/attribute category. The above equation has three terms: the first term is appearance term for the object category itself and is measured by the 1413 Nilgai Yamaha Violin Bass F-18 Figure 4. Qualitative Examples of Bounding Box Labeling Done by NEIL score of the SVM detector on the window x. The second term measures the compatibility between object category i and the object/attribute category j if the relationship (i, j) is part of the catalogue. For example, if “Wheel is a part of Car” exists in the catalogue then this term will be the product of the score of wheel detector and the compatibility function between the wheel window (xl) and the car window (x). The final term measures the scene-object compatibility. Therefore, if the knowledge base contains the re- lationship “Car is found in Raceway”, this term boosts the “Car” detection scores in the “Raceway” scenes. At each iteration, we also add new instances of different scene categories. We find new instances of scene category k using the contextual scoring function given below: ωk(x) + ? ωm(x) + ? φi(xl) m,k?∈RA? i,k?∈RS where RA? represents the catalogue of scene-attribute relationships. The above equation has three terms: the first term is the appearance term for the scene category itself and is estimated using the scene classifier. The second term is the appearance term for the attribute category and is estimated using the attribute classifier. This term ensures that if a scene-attribute relationship exists then the attribute classifier score should be high. The third and the final term is the appearance term of an object category and is estimated using the corresponding object detector. This term ensures that if a scene-object relationship exists then the object detector should detect objects in the scene. Implementation Details: To train scene & attribute classifiers, we first extract a 3912 dimensional feature vector from each image. The feature vector includes 5 12D GIST [27] features, concatenated with bag ofwords representations for SIFT [24], HOG [7], Lab color space, and Texton [26]. The dictionary sizes are 1000, 1000, 400, 1000, respectively. Features of randomly sampled windows from other categories are used as negative examples for SVM training and hard mining. For the object and attribute section, we use CHOG [20] features with a bin size of 8. We train the detectors using latent SVM model (without parts) [13]. 4. Experimental Results We demonstrate the quality of visual knowledge by qualitative results, verification via human subjects and quantitative results on tasks such as object detection and scene recognition. 4.1. NEIL Statistics While NEIL’s core algorithm uses a fixed vocabulary, we use noun phrases from NELL [5] to grow NEIL’s vocabulary. As of 10th October 2013, NEIL has an ontology of 1152 object categories, 1034 scene categories and 87 attributes. It has downloaded more than 2 million images for extracting the current structured visual knowledge. For bootstrapping our system, we use a few seed images from ImageNet [8], SUN [4 1] or the top-images from Google Image Search. For the purposes of extensive experimental evaluation in this paper, we ran NEIL on steroids (200 cores as opposed to 30 cores used generally) for the last 2.5 months. NEIL has completed 16 iterations and it has labeled more than 400K visual instances (including 300,000 objects with their bounding boxes). It has also extracted 1703 common sense relationships. Readers can browse the current visual knowledge base and download the detectors from: www.neil-kb.com 4.2. Qualitative Results We first show some qualitative results in terms of ex- tracted visual knowledge by NEIL. Figure 4 shows the extracted visual sub-categories along with a few labeled instances belonging to each sub-category. It can be seen from the figure that NEIL effectively handles the intra-class variation and polysemy via the clustering process. The purity and diversity of the clusters for different concepts indicate that contextual relationships help make our system robust to semantic drift and ensure diversity. Figure 5 shows some of the qualitative examples of scene-object and object-object relationships extracted by NEIL. It is effective in using a few confident detections to extract interesting relationships. Figure 6 shows some of the interesting scene-attribute and object-attribute relationships extracted by NEIL. 1414 Helicopter is found in Airfield Leaning tower is found in Pisa Van is a kind of/looks similar to Ambulance Airplane nose is a part of Airbus 330 Zebra is found in Savanna Ferris wheel is found in Amusement park Opera house is found in Sydney Eye is a part of Baby Duck is a kind of/looks similar to Goose Monitor is a kind of/looks similar to Desktop computer Figure 5. Qualitative Examples of Scene-Object (rows Bus is found in Bus depot outdoor Sparrow is a kind of/looks similar to bird 1-2) and Object-Object (rows Throne is found in Throne room Camry is found in Pub outdoor Gypsy moth is a kind of/looks similar to Butterfly Basketball net is a part of Backboard 3-4) Relationships Extracted by NEIL 4.3. Evaluating Quality via Human Subjects Next, we want to evaluate the quality of extracted visual knowledge by NEIL. It should be noted that an extensive and comprehensive evaluation for the whole NEIL system is an extremely difficult task. It is impractical to evaluate each and every labeled instance and each and every rela- tionship for correctness. Therefore, we randomly sample the 500 visual instances and 500 relationships, and verify them using human experts. At the end of iteration 6, 79% of the relationships extracted by NEIL are correct, and 98% of the visual data labeled by NEIL has been labeled correctly. We also evaluate the per iteration correctness of relationships: At iteration 1, more than 96% relationships are correct and by iteration 3, the system stabilizes and 80% of extracted relationships are correct. While currently the system does not demonstrate any major semantic drift, we do plan to continue evaluation and extensive analysis of knowledge base as NEIL grows older. We also evaluate the quality of bounding-boxes generated by NEIL. For this we sample 100 images randomly and label the ground-truth bounding boxes. On the standard intersection-over-union metric, NEIL generates bounding boxes with 0.78 overlap on average with ground-truth. To give context to the difficulty of the task, the standard Objectness algorithm [1] produces bounding boxes with 0.59 overlap on average. 4.4. Using Knowledge for Vision Tasks Finally, we want to demonstrate the usefulness of the visual knowledge learned by NEIL on standard vision tasks such as object detection and scene classification. Here, we will also compare several aspects of our approach: (a) We first compare the quality of our automatically labeled dataset. As baselines, we train classifiers/detectors directly on the seed images downloaded from Google Image Search. (b) We compare NEIL against a standard bootstrapping ap- proach which does not extract/use relationships. (c) Finally, we will demonstrate the usefulness of relationships by detecting and classifying new test data with and without the learned relationships. Scene Classification: First we evaluate our visual knowledge for the task of scene classification. We build a dataset of 600 images (12 scene categories) using Flickr images. We compare the performance ofour scene classifiers against the scene classifiers trained from top 15 images of Google Image Search (our seed classifier). We also compare the performance with standard bootstrapping approach without using any relationship extraction. Table 1shows the results. We use mean average precision (mAP) as the evaluation metric. As the results show, automatic relationship extraction helps us to constrain the learning problem, and so the learned classifiers give much better performance. Finally, if we also use the contextual information from NEIL relationships we get a significant boost in performance. Table 1. mAP performance for scene classification on 12 categories. mAP Seed Classifier (15 Google Images) Bootstrapping (without relationships) NEIL Scene Classifiers NEIL (Classifiers + Relationships) 0.52 0.54 0.57 0.62 Object Detection: We also evaluate our extracted visual knowledge for the task of object detection. We build a dataset of 1000 images (15 object categories) using Flickr data for testing. We compare the performance against object detectors trained directly using (top-50 and top-450) images from Google Image Search. We also compare the perfor- mance of detectors trained after aspect-ratio, HOG clustering and our proposed clustering procedure. Table 2 shows the detection results. Using 450 images from Google image search decreases the performance due to noisy retrievals. While other clustering methods help, the gain by our clustering procedure is much larger. Finally, detectors trained using NEIL work better than standard bootstrapping. 1415 MMoonniittoorr iiss f foouunndd iinn CCoonnttrrooll rroooomm? rroooomm? MMoonniittoorr iiss ffoouunndd iinn CCoonnttrrooll Washing machine is found in Utility room? Siberian tiger is found in Zoo Baseball is found in Butters box Bullet train is found in Train station platform? Cougar looks similar to Cat Urn looks similar to Goblet Samsung galaxy is a kind of Cellphone Computer room is /has Modern Hallway is /has Narrow? Building facade is /has Check texture Trading floor is /has Crowded Umbrella looks similar to Ferris wheel Bonfire is found in Volcano Figure 6. Examples of extracted common sense relationships. Table 2. mAP performance for object detection on 15 categories. mAP Latent SVM (50 Google Images) Latent SVM (450 Google Images) 0.34 0.28 Latent SVM (450, Aspect Ratio Clustering) Latent SVM (450, HOG-based Clustering) Seed Detector (NEIL Clustering) Bootstrapping (without relationships) NEIL Detector NEIL Detector + Relationships 0.30 0.33 0.44 0.45 0.49 0.51 Acknowledgements: This research was supported by ONR MURI N000141010934 and a gift from Google. The authors would like to thank Tom Mitchell and David Fouhey for insightful discussions. We would also like to thank our computing clusters warp and workhorse for doing all the hard work! References [1] B. Alexe, T. Deselares, and V. Ferrari. What is an object? In TPAMI, 2010. 7 [2] T. Berg and D. Forsyth. Animals on the web. In CVPR, 2006. 2 [3] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998. 2 [4] A. Carlson, J. Betteridge, E. R. H. Jr., and T. M. Mitchell. Coupling semi-supervised learning of categories and relations. In NAACL HLT Workskop on SSL for NLP, 2009. 3 [5] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. Toward an architecture for never-ending language learning. In AAAI, 2010. 2, 3, 6 [6] J. R. Curran, T. Murphy, and B. Scholz. Minimising semantic drift with mutual exclusion bootstrapping. In PacificAssociationfor Computational Linguistics, 2007. 2 [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005. 6 [8] J. Deng, W. Dong, R. Socher, J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009. 1, 2, 6 [9] S. Divvala, A. Efros, and M. Hebert. How important are ‘deformable parts’ in the deformable parts model? In ECCV, Parts and Attributes [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] Workshop, 2012. 3 C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes Paris look like Paris? SIGGRAPH, 2012. 4 S. Ebert, D. Larlus, and B. Schiele. Extracting structures in image collections for object recognition. In ECCV, 2010. 2 A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In CVPR, 2009. 3 P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. TPAMI, 2010. 6 R. Fergus, P. Perona, and A. Zisserman. A visual category filter for Google images. In ECCV, 2004. 2 R. Fergus, Y. Weiss, and A. Torralba. Semi-supervised learning in gigantic image collections. In NIPS. 2009. 2 B. Frey and D. Dueck. Clustering by passing messages between data points. Science, 2007. 4 M. Guillaumin, J. Verbeek, and C. Schmid. Multimodal semisupervised learning for image classification. In CVPR, 2010. 2 A. Gupta and L. S. Davis. Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV, 2008. 3 B. Hariharan, J. Malik, and D. Ramanan. Discriminative decorrelation for clustering and classification. In ECCV. 2012. 4 S. Khan, F. Anwer, R. Muhammad, J. van de Weijer, A. Joost, M. Vanrell, and A. Lopez. Color attributes for object detection. In CVPR, 2012. 4, 6 D. Kuettel, M. Guillaumin, and V. Ferrari. Segmentation propagation in ImageNet. In ECCV, 2012. 3 C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In CVPR, 2009. 3 L.-J. Li, G. Wang, and L. Fei-Fei. OPTIMOL: Automatic object picture collection via incremental model learning. In CVPR, 2007. 2 D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004. 6 [25] A. Lucchi and J. Weston. Joint image and word sense discrimination for image retrieval. In ECCV, 2012. 3 [26] D. Martin, C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. PAMI, 2004. 6 [27] A. Oliva and A. Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 2001. 6 [28] D. Parikh and K. Grauman. Relative attributes. In ICCV, 2011. 3 [29] G. Patterson and J. Hays. SUN attribute database: Discovering, annotating, and recognizing scene attributes. In CVPR, 2012. 3 [30] P. Perona. Visions of a Visipedia. Proceedings of IEEE, 2010. 1, 2 [3 1] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007. 3 [32] R. Raguram and S. Lazebnik. Computing iconic summaries of general visual concepts. In Workshop on Internet Vision, 2008. 4 [33] M. Rastegari, A. Farhadi, and D. Forsyth. Attribute discovery via predictable discriminative binary codes. In ECCV, 2012. 3 [34] F. Schroff, A. Criminisi, and A. Zisserman. Harvesting image databases from the web. In ICCV, 2007. 2 [35] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Augmented attribute representations. In ECCV, 2012. 3 [36] A. Shrivastava, S. Singh, and A. Gupta. Constrained semi-supervised learning using attributes and comparative attributes. In ECCV, 2012. 2, 3 [37] B. Siddiquie and A. Gupta. Beyond active noun tagging: Modeling contextual interactions for multi-class active learning. In CVPR, 2010. 2 [38] E. Sudderth, A. Torralba, W. T. Freeman, and A. Wilsky. Learning hierarchical models of scenes, objects, and parts. In ICCV, 2005. 3 [39] S. Vijayanarasimhan and K. Grauman. Large-scale live active learning: Training object detectors with crawled data and crowds. In CVPR, 2011. 2 [40] L. von Ahn and L. Dabbish. Labeling images with a computer game. In SIGCHI, 2004. 2, 3 [41] J. Xiao, J. Hays, K. Ehinger, A. Oliva, and A. Torralba. SUN database: Large scale scene recognition from abbey to zoo. In CVPR, 2010. 6 [42] X. Zhu. Semi-supervised learning literature survey. Technical report, CS, UW-Madison, 2005. 2 1416
3 0.7833029 180 iccv-2013-From Where and How to What We See
Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath
Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.
4 0.779845 189 iccv-2013-HOGgles: Visualizing Object Detection Features
Author: Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz, Antonio Torralba
Abstract: We introduce algorithms to visualize feature spaces used by object detectors. The tools in this paper allow a human to put on ‘HOG goggles ’ and perceive the visual world as a HOG based object detector sees it. We found that these visualizations allow us to analyze object detection systems in new ways and gain new insight into the detector’s failures. For example, when we visualize the features for high scoring false alarms, we discovered that, although they are clearly wrong in image space, they do look deceptively similar to true positives in feature space. This result suggests that many of these false alarms are caused by our choice of feature space, and indicates that creating a better learning algorithm or building bigger datasets is unlikely to correct these errors. By visualizing feature spaces, we can gain a more intuitive understanding of our detection systems.
5 0.77868909 433 iccv-2013-Understanding High-Level Semantics by Modeling Traffic Patterns
Author: Hongyi Zhang, Andreas Geiger, Raquel Urtasun
Abstract: In this paper, we are interested in understanding the semantics of outdoor scenes in the context of autonomous driving. Towards this goal, we propose a generative model of 3D urban scenes which is able to reason not only about the geometry and objects present in the scene, but also about the high-level semantics in the form of traffic patterns. We found that a small number of patterns is sufficient to model the vast majority of traffic scenes and show how these patterns can be learned. As evidenced by our experiments, this high-level reasoning significantly improves the overall scene estimation as well as the vehicle-to-lane association when compared to state-of-the-art approaches [10].
6 0.77296692 152 iccv-2013-Extrinsic Camera Calibration without a Direct View Using Spherical Mirror
7 0.77077895 338 iccv-2013-Randomized Ensemble Tracking
8 0.77058303 326 iccv-2013-Predicting Sufficient Annotation Strength for Interactive Foreground Segmentation
9 0.76994145 150 iccv-2013-Exemplar Cut
10 0.76880467 188 iccv-2013-Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps
11 0.76832604 350 iccv-2013-Relative Attributes for Large-Scale Abandoned Object Detection
12 0.76831079 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection
13 0.76635665 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning
14 0.76543653 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning
15 0.76541293 194 iccv-2013-Heterogeneous Image Features Integration via Multi-modal Semi-supervised Learning Model
16 0.7653386 427 iccv-2013-Transfer Feature Learning with Joint Distribution Adaptation
17 0.76481104 86 iccv-2013-Concurrent Action Detection with Structural Prediction
18 0.7639569 44 iccv-2013-Adapting Classification Cascades to New Domains
19 0.76380277 181 iccv-2013-Frustratingly Easy NBNN Domain Adaptation
20 0.76358253 414 iccv-2013-Temporally Consistent Superpixels