cvpr cvpr2013 cvpr2013-275 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Liang Zheng, Shengjin Wang, Ziqiong Liu, Qi Tian
Abstract: The Inverse Document Frequency (IDF) is prevalently utilized in the Bag-of-Words based image search. The basic idea is to assign less weight to terms with high frequency, and vice versa. However, the estimation of visual word frequency is coarse and heuristic. Therefore, the effectiveness of the conventional IDF routine is marginal, and far from optimal. To tackle thisproblem, thispaper introduces a novel IDF expression by the use of Lp-norm pooling technique. . edu . cn qit i @ c s an . ut s a . edu ? ? ? ? ? ? ? ? Carefully designed, the proposed IDF takes into account the term frequency, document frequency, the complexity of images, as well as the codebook information. Optimizing the IDF function towards optimal balancing between TF and pIDF weights yields the so-called Lp-norm IDF (pIDF). WpIDe sFho wwe ithghatts sth yeie clodsnv tehnetio son-acla IlDleFd i Ls a special case of our generalized version, and two novel IDFs, i.e. the average IDF and the max IDF, can also be derived from our formula. Further, by counting for the term-frequency in each image, the proposed Lp-norm IDF helps to alleviate the viismuaalg we,o trhde b purrosptionseesds phenomenon. Our method is evaluated through extensive experiments on three benchmark datasets (Oxford 5K, Paris 6K and Flickr 1M). We report a performance improvement of as large as 27.1% over the baseline approach. Moreover, since the Lp-norm IDF is computed offline, no extra computation or memory cost is introduced to the system at all.
Reference: text
sentIndex sentText sentNum sentScore
1 However, the estimation of visual word frequency is coarse and heuristic. [sent-10, score-0.332]
2 Carefully designed, the proposed IDF takes into account the term frequency, document frequency, the complexity of images, as well as the codebook information. [sent-25, score-0.133]
3 Visual words zx and zy both occurs in all the six images, but with varying TF distributions over the entire image collection. [sent-41, score-0.153]
4 In conventional IDF, the IDF weights are equal to zero for both words. [sent-42, score-0.079]
5 But when resorting to TF, zx and zy both have some discriminative power, the problem of which will be tackled in this paper. [sent-43, score-0.131]
6 This step is achieved by constructing a codebook through unsupervised clustering, e. [sent-49, score-0.071]
7 The Bag-of-Words model then treats each cluster center as a word in the codebook. [sent-53, score-0.19]
8 In the spirit of text retrieval, the method quantizes each detected keypoint into its nearest visual word(s) and represents each image as a histogram of visual words. [sent-54, score-0.107]
9 To measure the importance of visual words, most of the existing approaches use the TF-IDF (Term FrequencyInverse Document Frequency) weighting scheme. [sent-56, score-0.125]
10 First, the conventional IDF functions on the image collection level. [sent-58, score-0.081]
11 It 111666222644 does not take a closer look into the visual word level, where multiple occurrences of a visual word are often observed. [sent-59, score-0.436]
12 Consequently, it only makes a coarse estimation of visual word frequency. [sent-60, score-0.218]
13 Second, as suggested in [5], IDF weighting does not address the problem of burstiness. [sent-61, score-0.078]
14 In this paper, we propose a novel IDF formula, called ”Lp-norm IDF”, which makes a careful estimation of vis”uLal word frequency and achieves significant improvement in performance. [sent-63, score-0.304]
15 The key idea is that the estimated visual word frequency is the weighted sum of the TF data across the whole database. [sent-64, score-0.332]
16 Experimental studies on three image search datasets confirm that by integrating the term frequency into IDF using the proposed method, image search performance is improved dramatically. [sent-67, score-0.208]
17 For example, soft matching [10, 16] assigns each descriptor to multiple visual words, but instead increases the query time and memory overload. [sent-77, score-0.079]
18 [2] designs quantization method by kernel density estimation, while [1, 24] utilize binary features to improve efficiency and reduce quantization error. [sent-79, score-0.089]
19 The geometric context among local features can be also encoded into visual word assemblies [20, 22, 19]. [sent-81, score-0.218]
20 The third group of work concerns about visual word weighting. [sent-83, score-0.218]
21 For example, [5] uses IDF-like weighting formulas to tackle the burstiness problem. [sent-84, score-0.276]
22 Our work, instead, re-estimate the visual word frequency by Lp-norm pooling in an offline manner. [sent-89, score-0.422]
23 Given the codebook {zi}kK=1 with a vocabulary size of K, image Ini hise quantized i{nzto} a vector representation vi = [vi,1, vi,2, . [sent-95, score-0.094]
24 Conventional TF-IDF The TF part of the weighting scheme reflects the number of keypoints featured by this visual word. [sent-101, score-0.168]
25 The presence of a less common visual word in an image may be a better discriminator than that of a more common one. [sent-104, score-0.218]
26 The IDF weight of a visual word zk is denoted as: IDF(zk) = lognNk (1) where N denotes the total number of images in the collection, and nk encodes the number ofimages where zk occurs. [sent-105, score-0.458]
27 In addition, in text retrieval, a variety of weighting methods have been proposed, such as the Okapi-BM25 [12], the pivoted normalization weighting [14], etc. [sent-120, score-0.22]
28 Lp-norm IDF The basic idea of IDF is the negative correlation between the visual word frequency uk and the IDF weight. [sent-124, score-0.377]
29 The conventional IDF treats uk as the number of images possessing word zk. [sent-125, score-0.296]
30 1, visual words zx and zy appear in all the images through I1to I6. [sent-128, score-0.2]
31 It indicates that zx and zy are totally worthless for image search. [sent-131, score-0.107]
32 However, if we consider the fact that the frequency distribution of zx and zy over the entire image collection are quite different, we may realize 111666222755 that these visual words indeed possess some discriminative power, which is ignored by the conventional IDF formula. [sent-132, score-0.419]
33 Specifically, assume a cuorelldec bytio Ln of images consists of N images, nk of which contain visual word zk. [sent-134, score-0.218]
34 We denote the image set containing zk as Pk = {I ∈ D|zk ∈ I}, and |Pk | = nk. [sent-135, score-0.12]
35 (3) Built upon the adjusted estimation of visual word frequency uk, our framework is presented as follows: pIDF(zk) = logNuk= log? [sent-144, score-0.332]
36 Parameter p determines the extent to which the term frequency contributes to the estimated value. [sent-147, score-0.144]
37 The coefficient wi,k reflects the contribution of each image containing zk to the frequency estimation. [sent-148, score-0.234]
38 Put it another way, it is more probable that zk appears in large images. [sent-154, score-0.12]
39 For numerical reasons, it is appropriate to introduce the normalization by relating image length to the average value d¯. [sent-157, score-0.076]
40 Next, we seek to incorporate codebook information into wi,k. [sent-159, score-0.083]
41 Given that uk is larger for a smaller codebook, another normalization should be considered. [sent-160, score-0.096]
42 isual words should be non-negative (each visual word, no matter how often it appears in bursts, should at least have some discriminative power). [sent-166, score-0.117]
43 The term frequencies of word zk in each image is depicted below. [sent-278, score-0.309]
44 The formulas demonstrate how to calculate the estimated word frequency of zk for the four IDFs, e. [sent-279, score-0.425]
45 , the conventional IDF, average IDF, max IDF, as well as the Lp-norm IDF introduced in this paper. [sent-281, score-0.098]
46 Ii∈NPkvi,k, (6) mIDF(zk) = logNuk= logmiaxNvi,k, (7) and where uk is approximated by the L1-norm and L∞-norm of vi,k (i =i 1, ·p ·p ·r , xnimk),a corresponding ntoo mthe average pooling and( max pooling, respectively. [sent-294, score-0.144]
47 The pooling technique we used here differs from that in feature pooling in two aspects. [sent-297, score-0.124]
48 Second, pooling used here is to aggregate the response of the whole image collection to the frequency accumulates of each visual word, while in feature pooling, the result is the response of an image to each visual word. [sent-299, score-0.318]
49 5, we seek to minimize a cost function of visual word discriminative power. [sent-303, score-0.254]
50 The TF-IDF weight encodes the importance of a visual word in separating one image from the others, i. [sent-304, score-0.218]
51 nN motea rtkhearts sv diseunaolt ew thoerds L in repetitive structures (burstiness) are heavily down-weighted, while the weights of discriminative structures are preserved. [sent-316, score-0.112]
52 e I nbe ottwheeren w tohred t,w thoe weighting factors. [sent-319, score-0.078]
53 More specifically, the objective function is to minimize the discriminative power diversity among visual words, namely, argmpinvkar{n1kIi? [sent-320, score-0.092]
54 ∈Pkvi,k· pIDF(zk)} (8) where the variance operator characterizes the diversity of discriminative power among visual words. [sent-321, score-0.092]
55 The discriminative power of a visual word is described by its average TF-pIDF value. [sent-322, score-0.274]
56 Visual Word Burstiness In text retrieval, the term positive adaption or burstiness refers to the phenomenon in which words tend to appear in bursts, i. [sent-331, score-0.279]
57 In the image search community, burstiness often describe the phenomenon that repetitive structures are present (see Fig. [sent-334, score-0.272]
58 Another difference lies in that [5] penalizes burstiness by computing a normalization factor on-thefly, while our method assigns weights to visual words on the visual word level and in an offline manner. [sent-339, score-0.586]
59 To analyze the burstiness phenomenon, we plot the visual word distribution in Fig. [sent-342, score-0.396]
60 For each visual word in the codebook, we first count its maximum term frequency across the image collection and form a visual word histogram in Fig. [sent-344, score-0.602]
61 Then, we denote the maximum term frequency of image I NI, and count the number of imas ages that fall into different values of NI, as is shown in Fig. [sent-346, score-0.146]
62 The statistics suggests that a majority of visual word- s maximally occur 2 or 3 times in an image and that most images have a maximal term frequency of 5 or 6. [sent-348, score-0.179]
63 Therefore, the burstiness phenomenon (see the first row of Fig. [sent-349, score-0.202]
64 3 depicts the impact of the pro111666222977 rsnrdufweobom341020x 105 1015202530354045 0 eifgbm uernaos1 5050 51015202530354045 0 maximum term frequency of a visual word x 104 (a) maximum term frequency of an image (b) Figure 4. [sent-352, score-0.482]
65 (a): Histogram of visual words for different values of maximum term frequency; (b): Histogram of images for different values of maximum term frequency. [sent-353, score-0.129]
66 The data is evaluated over Flickr 1M dataset, and the codebook size is 1M. [sent-354, score-0.071]
67 The green and red markers are colpoocsaetded L with visual words, and correspond to the conventional IDF and the Lp-norm IDF weights, respectively. [sent-356, score-0.137]
68 Small red markers indicate that the visual words are heavily downweighted, which suggests that these visual words be part of repetitive structures. [sent-358, score-0.261]
69 On the other hand, big red markers denote slightly, if any, down-weighted visual words, which are quite discriminative structures. [sent-359, score-0.1]
70 As a result, the Lp-norm IDF punishes visual word burstiness, wreshuillet retaining the discriminative structures. [sent-363, score-0.268]
71 5, the parameter p determFoinre Ls the extent to which burstiness is punished, so its value should be tuned carefully. [sent-395, score-0.19]
72 5 that larger value of p indicates amplified punishment on visual word burstiness. [sent-415, score-0.218]
73 L2 normalization takes a compromise between the above two methods, and produces the highest 111666223088 codebook size (K) (a) Oxford 5K codebook size (K) (b) Paris 6K Figure 6. [sent-423, score-0.193]
74 Image search performance as a function of the codebook size for different weighting schemes. [sent-424, score-0.174]
75 the conventional IDF, average IDF (avgIDF), max IDF (maxIDF), and the Lp-norm IDF (pIDF). [sent-449, score-0.098]
76 So it neglects the document frequency, while conventional IDF neglects TF. [sent-462, score-0.139]
77 6 and Table 3 shows that average IDF is shown to be slightly superior to both the conventional IDF and the max IDF. [sent-466, score-0.098]
78 In its nature, average IDF explicitly considers both the TF of visual words in each image and the document frequency. [sent-470, score-0.148]
79 The Lp-norm IDF estimates the word frequency using Fb os. [sent-474, score-0.285]
80 th T therem L frequency and document frequency of each visual word. [sent-475, score-0.319]
81 By carefully weighting the contribution of every database image, and optimizing the parameter p, Lp-norm IDF gives better weights to visual words, thus elaborately making significant improvement over the baseline approach. [sent-476, score-0.205]
82 First, the introduction of conventional TF-IDF helps to improve performance over the ”no TF-IDF” case, but the 111666223 919 (a) Oxford 5K + Flickr 1M database size (K) (b) Paris 6K + Flickr 1M Figure 7. [sent-481, score-0.089]
83 Third, we note that although the codebook is trained on Oxford dataset, more notable improvement is observed on Paris dataset. [sent-499, score-0.09]
84 The codebook may be quite discriminative for Oxford dataset, but much more ambiguous [16] for Paris. [sent-500, score-0.095]
85 Therefore, the burstiness problem is more severe on Paris dataset. [sent-501, score-0.178]
86 Our proposed method helps to down-weight visual word in bursts and alleviate the burstiness problem, so more improvement is brought on Paris dataset. [sent-502, score-0.477]
87 It indicates that Lp-norm IDF based approach generalizes well to the case wLhere the codebook is trained on irrelevant data and the improvement is much more considerable. [sent-503, score-0.09]
88 Okapi-BM25 weighting is the least efficient one, because this method computes the TF-IDF weights online, resulting in more efficiency loss. [sent-516, score-0.117]
89 Comparison with other methods: We compare the proposed Lp-norm IDF with [5] and the Okapi-BM25 weighting e[1d2 L], as in Table 3. [sent-518, score-0.078]
90 b eHcoawueseve thr, our method clearly outperforms the Okapi-BM25 weighting on both large datasets. [sent-522, score-0.078]
91 Notably, on Oxford 5K + 1M dataset, the BM25 weighting obtains mAP of 0. [sent-523, score-0.078]
92 We couple both the inter- and intra-image burstiness solutions in our experiment. [sent-527, score-0.191]
93 Conclusion This paper proposed an effective IDF weighting scheme, i. [sent-540, score-0.078]
94 Benyt frequency, document length as well as the codebook information, into the final IDF representation. [sent-546, score-0.129]
95 The Lp-norm IDF functions on the visual word sleevnetla,t aonnd. [sent-547, score-0.218]
96 can d Leal with the burstiness problem by downweighting visual words in bursts. [sent-548, score-0.271]
97 Furthermore, the Lpnorm IDF outperforms several stateF-uorf-ththere-marotr weighting approaches, and more improvement can be observed when the database size gets larger. [sent-551, score-0.111]
98 In the future, more investigation will be focused on the empirical studies of visual word frequency distribution and its discriminative power. [sent-554, score-0.356]
99 This study re-issues the importance of visual word weighting, and various weighting strategies will be studied. [sent-555, score-0.296]
100 Descriptive visual words and visual phrases for image applications. [sent-699, score-0.14]
wordName wordTfidf (topN-words)
[('idf', 0.877), ('burstiness', 0.178), ('word', 0.171), ('zk', 0.12), ('frequency', 0.114), ('oxford', 0.113), ('paris', 0.107), ('tf', 0.09), ('pidf', 0.089), ('flickr', 0.089), ('weighting', 0.078), ('idfs', 0.077), ('codebook', 0.071), ('pooling', 0.062), ('conventional', 0.061), ('zy', 0.059), ('normalization', 0.051), ('zx', 0.048), ('visual', 0.047), ('words', 0.046), ('uk', 0.045), ('document', 0.044), ('lognuk', 0.038), ('quantization', 0.034), ('bursts', 0.034), ('repetitive', 0.033), ('inghua', 0.03), ('baseline', 0.029), ('markers', 0.029), ('offline', 0.028), ('distractor', 0.027), ('datasets', 0.026), ('max', 0.026), ('lpnorm', 0.026), ('punished', 0.026), ('punishes', 0.026), ('featured', 0.025), ('search', 0.025), ('discriminative', 0.024), ('phenomenon', 0.024), ('pk', 0.024), ('scalability', 0.023), ('formula', 0.023), ('vocabulary', 0.023), ('retrieval', 0.022), ('power', 0.021), ('efficiency', 0.021), ('formulas', 0.02), ('collection', 0.02), ('improvement', 0.019), ('treats', 0.019), ('chum', 0.019), ('mm', 0.018), ('ls', 0.018), ('evident', 0.018), ('keypoints', 0.018), ('weights', 0.018), ('term', 0.018), ('neglects', 0.017), ('populated', 0.017), ('query', 0.017), ('ii', 0.016), ('indexing', 0.016), ('memory', 0.015), ('philbin', 0.015), ('je', 0.015), ('database', 0.014), ('response', 0.014), ('helps', 0.014), ('map', 0.014), ('fb', 0.014), ('egou', 0.014), ('length', 0.014), ('douze', 0.014), ('count', 0.014), ('brought', 0.014), ('matches', 0.013), ('couple', 0.013), ('isard', 0.013), ('scale', 0.013), ('heavily', 0.013), ('text', 0.013), ('cn', 0.013), ('tian', 0.013), ('contextual', 0.012), ('notably', 0.012), ('deals', 0.012), ('extent', 0.012), ('seek', 0.012), ('structures', 0.012), ('average', 0.011), ('hamming', 0.011), ('consequently', 0.011), ('fisr', 0.011), ('perd', 0.011), ('fxpal', 0.011), ('qit', 0.011), ('utsa', 0.011), ('leal', 0.011), ('trhde', 0.011)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 275 cvpr-2013-Lp-Norm IDF for Large Scale Image Search
Author: Liang Zheng, Shengjin Wang, Ziqiong Liu, Qi Tian
Abstract: The Inverse Document Frequency (IDF) is prevalently utilized in the Bag-of-Words based image search. The basic idea is to assign less weight to terms with high frequency, and vice versa. However, the estimation of visual word frequency is coarse and heuristic. Therefore, the effectiveness of the conventional IDF routine is marginal, and far from optimal. To tackle thisproblem, thispaper introduces a novel IDF expression by the use of Lp-norm pooling technique. . edu . cn qit i @ c s an . ut s a . edu ? ? ? ? ? ? ? ? Carefully designed, the proposed IDF takes into account the term frequency, document frequency, the complexity of images, as well as the codebook information. Optimizing the IDF function towards optimal balancing between TF and pIDF weights yields the so-called Lp-norm IDF (pIDF). WpIDe sFho wwe ithghatts sth yeie clodsnv tehnetio son-acla IlDleFd i Ls a special case of our generalized version, and two novel IDFs, i.e. the average IDF and the max IDF, can also be derived from our formula. Further, by counting for the term-frequency in each image, the proposed Lp-norm IDF helps to alleviate the viismuaalg we,o trhde b purrosptionseesds phenomenon. Our method is evaluated through extensive experiments on three benchmark datasets (Oxford 5K, Paris 6K and Flickr 1M). We report a performance improvement of as large as 27.1% over the baseline approach. Moreover, since the Lp-norm IDF is computed offline, no extra computation or memory cost is introduced to the system at all.
2 0.17673953 456 cvpr-2013-Visual Place Recognition with Repetitive Structures
Author: Akihiko Torii, Josef Sivic, Tomáš Pajdla, Masatoshi Okutomi
Abstract: Repeated structures such as building facades, fences or road markings often represent a significant challenge for place recognition. Repeated structures are notoriously hard for establishing correspondences using multi-view geometry. Even more importantly, they violate thefeature independence assumed in the bag-of-visual-words representation which often leads to over-counting evidence and significant degradation of retrieval performance. In this work we show that repeated structures are not a nuisance but, when appropriately represented, theyform an importantdistinguishing feature for many places. We describe a representation of repeated structures suitable for scalable retrieval. It is based on robust detection of repeated image structures and a simple modification of weights in the bag-of-visual-word model. Place recognition results are shown on datasets of street-level imagery from Pittsburgh and San Francisco demonstrating significant gains in recognition performance compared to the standard bag-of-visual-words baseline and more recently proposed burstiness weighting.
3 0.083301403 343 cvpr-2013-Query Adaptive Similarity for Large Scale Object Retrieval
Author: Danfeng Qin, Christian Wengert, Luc Van_Gool
Abstract: Many recent object retrieval systems rely on local features for describing an image. The similarity between a pair of images is measured by aggregating the similarity between their corresponding local features. In this paper we present a probabilistic framework for modeling the feature to feature similarity measure. We then derive a query adaptive distance which is appropriate for global similarity evaluation. Furthermore, we propose a function to score the individual contributions into an image to image similarity within the probabilistic framework. Experimental results show that our method improves the retrieval accuracy significantly and consistently. Moreover, our result compares favorably to the state-of-the-art.
4 0.075307839 8 cvpr-2013-A Fast Approximate AIB Algorithm for Distributional Word Clustering
Author: Lei Wang, Jianjia Zhang, Luping Zhou, Wanqing Li
Abstract: Distributional word clustering merges the words having similar probability distributions to attain reliable parameter estimation, compact classification models and even better classification performance. Agglomerative Information Bottleneck (AIB) is one of the typical word clustering algorithms and has been applied to both traditional text classification and recent image recognition. Although enjoying theoretical elegance, AIB has one main issue on its computational efficiency, especially when clustering a large number of words. Different from existing solutions to this issue, we analyze the characteristics of its objective function the loss of mutual information, and show that by merely using the ratio of word-class joint probabilities of each word, good candidate word pairs for merging can be easily identified. Based on this finding, we propose a fast approximate AIB algorithm and show that it can significantly improve the computational efficiency of AIB while well maintaining or even slightly increasing its classification performance. Experimental study on both text and image classification benchmark data sets shows that our algorithm can achieve more than 100 times speedup on large real data sets over the state-of-the-art method.
5 0.064618379 200 cvpr-2013-Harvesting Mid-level Visual Concepts from Large-Scale Internet Images
Author: Quannan Li, Jiajun Wu, Zhuowen Tu
Abstract: Obtaining effective mid-level representations has become an increasingly important task in computer vision. In this paper, we propose a fully automatic algorithm which harvests visual concepts from a large number of Internet images (more than a quarter of a million) using text-based queries. Existing approaches to visual concept learning from Internet images either rely on strong supervision with detailed manual annotations or learn image-level classifiers only. Here, we take the advantage of having massive wellorganized Google and Bing image data; visual concepts (around 14, 000) are automatically exploited from images using word-based queries. Using the learned visual concepts, we show state-of-the-art performances on a variety of benchmark datasets, which demonstrate the effectiveness of the learned mid-level representations: being able to generalize well to general natural images. Our method shows significant improvement over the competing systems in image classification, including those with strong supervision.
6 0.063111052 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
7 0.062529601 5 cvpr-2013-A Bayesian Approach to Multimodal Visual Dictionary Learning
8 0.05835839 268 cvpr-2013-Leveraging Structure from Motion to Learn Discriminative Codebooks for Scalable Landmark Classification
9 0.055469185 53 cvpr-2013-BFO Meets HOG: Feature Extraction Based on Histograms of Oriented p.d.f. Gradients for Image Classification
10 0.05143737 38 cvpr-2013-All About VLAD
11 0.047700368 382 cvpr-2013-Scene Text Recognition Using Part-Based Tree-Structured Character Detection
12 0.044419792 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
13 0.039858386 189 cvpr-2013-Graph-Based Discriminative Learning for Location Recognition
14 0.039815214 319 cvpr-2013-Optimized Product Quantization for Approximate Nearest Neighbor Search
15 0.038347658 164 cvpr-2013-Fast Convolutional Sparse Coding
16 0.038310409 145 cvpr-2013-Efficient Object Detection and Segmentation for Fine-Grained Recognition
17 0.036230348 260 cvpr-2013-Learning and Calibrating Per-Location Classifiers for Visual Place Recognition
18 0.036213312 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
19 0.034473345 182 cvpr-2013-Fusing Robust Face Region Descriptors via Multiple Metric Learning for Face Recognition in the Wild
20 0.034203894 346 cvpr-2013-Real-Time No-Reference Image Quality Assessment Based on Filter Learning
topicId topicWeight
[(0, 0.078), (1, -0.024), (2, -0.009), (3, 0.013), (4, 0.03), (5, 0.015), (6, -0.031), (7, -0.031), (8, -0.041), (9, -0.021), (10, -0.027), (11, 0.003), (12, 0.044), (13, 0.004), (14, 0.033), (15, -0.069), (16, 0.016), (17, 0.005), (18, 0.075), (19, -0.06), (20, 0.084), (21, -0.038), (22, 0.029), (23, 0.021), (24, -0.011), (25, 0.039), (26, 0.013), (27, 0.039), (28, -0.006), (29, -0.058), (30, 0.034), (31, 0.023), (32, -0.054), (33, 0.068), (34, -0.018), (35, 0.014), (36, -0.029), (37, 0.023), (38, 0.003), (39, -0.041), (40, -0.015), (41, -0.009), (42, -0.05), (43, -0.041), (44, -0.083), (45, 0.038), (46, -0.009), (47, 0.01), (48, 0.074), (49, -0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.92293555 275 cvpr-2013-Lp-Norm IDF for Large Scale Image Search
Author: Liang Zheng, Shengjin Wang, Ziqiong Liu, Qi Tian
Abstract: The Inverse Document Frequency (IDF) is prevalently utilized in the Bag-of-Words based image search. The basic idea is to assign less weight to terms with high frequency, and vice versa. However, the estimation of visual word frequency is coarse and heuristic. Therefore, the effectiveness of the conventional IDF routine is marginal, and far from optimal. To tackle thisproblem, thispaper introduces a novel IDF expression by the use of Lp-norm pooling technique. . edu . cn qit i @ c s an . ut s a . edu ? ? ? ? ? ? ? ? Carefully designed, the proposed IDF takes into account the term frequency, document frequency, the complexity of images, as well as the codebook information. Optimizing the IDF function towards optimal balancing between TF and pIDF weights yields the so-called Lp-norm IDF (pIDF). WpIDe sFho wwe ithghatts sth yeie clodsnv tehnetio son-acla IlDleFd i Ls a special case of our generalized version, and two novel IDFs, i.e. the average IDF and the max IDF, can also be derived from our formula. Further, by counting for the term-frequency in each image, the proposed Lp-norm IDF helps to alleviate the viismuaalg we,o trhde b purrosptionseesds phenomenon. Our method is evaluated through extensive experiments on three benchmark datasets (Oxford 5K, Paris 6K and Flickr 1M). We report a performance improvement of as large as 27.1% over the baseline approach. Moreover, since the Lp-norm IDF is computed offline, no extra computation or memory cost is introduced to the system at all.
2 0.77127153 456 cvpr-2013-Visual Place Recognition with Repetitive Structures
Author: Akihiko Torii, Josef Sivic, Tomáš Pajdla, Masatoshi Okutomi
Abstract: Repeated structures such as building facades, fences or road markings often represent a significant challenge for place recognition. Repeated structures are notoriously hard for establishing correspondences using multi-view geometry. Even more importantly, they violate thefeature independence assumed in the bag-of-visual-words representation which often leads to over-counting evidence and significant degradation of retrieval performance. In this work we show that repeated structures are not a nuisance but, when appropriately represented, theyform an importantdistinguishing feature for many places. We describe a representation of repeated structures suitable for scalable retrieval. It is based on robust detection of repeated image structures and a simple modification of weights in the bag-of-visual-word model. Place recognition results are shown on datasets of street-level imagery from Pittsburgh and San Francisco demonstrating significant gains in recognition performance compared to the standard bag-of-visual-words baseline and more recently proposed burstiness weighting.
3 0.73040944 8 cvpr-2013-A Fast Approximate AIB Algorithm for Distributional Word Clustering
Author: Lei Wang, Jianjia Zhang, Luping Zhou, Wanqing Li
Abstract: Distributional word clustering merges the words having similar probability distributions to attain reliable parameter estimation, compact classification models and even better classification performance. Agglomerative Information Bottleneck (AIB) is one of the typical word clustering algorithms and has been applied to both traditional text classification and recent image recognition. Although enjoying theoretical elegance, AIB has one main issue on its computational efficiency, especially when clustering a large number of words. Different from existing solutions to this issue, we analyze the characteristics of its objective function the loss of mutual information, and show that by merely using the ratio of word-class joint probabilities of each word, good candidate word pairs for merging can be easily identified. Based on this finding, we propose a fast approximate AIB algorithm and show that it can significantly improve the computational efficiency of AIB while well maintaining or even slightly increasing its classification performance. Experimental study on both text and image classification benchmark data sets shows that our algorithm can achieve more than 100 times speedup on large real data sets over the state-of-the-art method.
4 0.64178842 38 cvpr-2013-All About VLAD
Author: unkown-author
Abstract: The objective of this paper is large scale object instance retrieval, given a query image. A starting point of such systems is feature detection and description, for example using SIFT. The focus of this paper, however, is towards very large scale retrieval where, due to storage requirements, very compact image descriptors are required and no information about the original SIFT descriptors can be accessed directly at run time. We start from VLAD, the state-of-the art compact descriptor introduced by J´ egou et al. [8] for this purpose, and make three novel contributions: first, we show that a simple change to the normalization method significantly improves retrieval performance; second, we show that vocabulary adaptation can substantially alleviate problems caused when images are added to the dataset after initial vocabulary learning. These two methods set a new stateof-the-art over all benchmarks investigated here for both mid-dimensional (20k-D to 30k-D) and small (128-D) descriptors. Our third contribution is a multiple spatial VLAD representation, MultiVLAD, that allows the retrieval and local- ization of objects that only extend over a small part of an image (again without requiring use of the original image SIFT descriptors).
5 0.61176527 200 cvpr-2013-Harvesting Mid-level Visual Concepts from Large-Scale Internet Images
Author: Quannan Li, Jiajun Wu, Zhuowen Tu
Abstract: Obtaining effective mid-level representations has become an increasingly important task in computer vision. In this paper, we propose a fully automatic algorithm which harvests visual concepts from a large number of Internet images (more than a quarter of a million) using text-based queries. Existing approaches to visual concept learning from Internet images either rely on strong supervision with detailed manual annotations or learn image-level classifiers only. Here, we take the advantage of having massive wellorganized Google and Bing image data; visual concepts (around 14, 000) are automatically exploited from images using word-based queries. Using the learned visual concepts, we show state-of-the-art performances on a variety of benchmark datasets, which demonstrate the effectiveness of the learned mid-level representations: being able to generalize well to general natural images. Our method shows significant improvement over the competing systems in image classification, including those with strong supervision.
6 0.5989114 183 cvpr-2013-GRASP Recurring Patterns from a Single View
7 0.59841627 343 cvpr-2013-Query Adaptive Similarity for Large Scale Object Retrieval
8 0.55751407 268 cvpr-2013-Leveraging Structure from Motion to Learn Discriminative Codebooks for Scalable Landmark Classification
9 0.54734015 434 cvpr-2013-Topical Video Object Discovery from Key Frames by Modeling Word Co-occurrence Prior
10 0.54482502 5 cvpr-2013-A Bayesian Approach to Multimodal Visual Dictionary Learning
14 0.50787902 260 cvpr-2013-Learning and Calibrating Per-Location Classifiers for Visual Place Recognition
15 0.50091976 130 cvpr-2013-Discriminative Color Descriptors
16 0.4993073 246 cvpr-2013-Learning Binary Codes for High-Dimensional Data Using Bilinear Projections
17 0.47522426 69 cvpr-2013-Boosting Binary Keypoint Descriptors
18 0.46748227 404 cvpr-2013-Sparse Quantization for Patch Description
19 0.46427193 157 cvpr-2013-Exploring Implicit Image Statistics for Visual Representativeness Modeling
20 0.46045685 382 cvpr-2013-Scene Text Recognition Using Part-Based Tree-Structured Character Detection
topicId topicWeight
[(10, 0.082), (15, 0.011), (16, 0.017), (26, 0.034), (33, 0.226), (67, 0.418), (69, 0.037), (87, 0.045)]
simIndex simValue paperId paperTitle
1 0.94480544 142 cvpr-2013-Efficient Detector Adaptation for Object Detection in a Video
Author: Pramod Sharma, Ram Nevatia
Abstract: In this work, we present a novel and efficient detector adaptation method which improves the performance of an offline trained classifier (baseline classifier) by adapting it to new test datasets. We address two critical aspects of adaptation methods: generalizability and computational efficiency. We propose an adaptation method, which can be applied to various baseline classifiers and is computationally efficient also. For a given test video, we collect online samples in an unsupervised manner and train a randomfern adaptive classifier . The adaptive classifier improves precision of the baseline classifier by validating the obtained detection responses from baseline classifier as correct detections or false alarms. Experiments demonstrate generalizability, computational efficiency and effectiveness of our method, as we compare our method with state of the art approaches for the problem of human detection and show good performance with high computational efficiency on two different baseline classifiers.
2 0.93843412 103 cvpr-2013-Decoding Children's Social Behavior
Author: James M. Rehg, Gregory D. Abowd, Agata Rozga, Mario Romero, Mark A. Clements, Stan Sclaroff, Irfan Essa, Opal Y. Ousley, Yin Li, Chanho Kim, Hrishikesh Rao, Jonathan C. Kim, Liliana Lo Presti, Jianming Zhang, Denis Lantsman, Jonathan Bidwell, Zhefan Ye
Abstract: We introduce a new problem domain for activity recognition: the analysis of children ’s social and communicative behaviors based on video and audio data. We specifically target interactions between children aged 1–2 years and an adult. Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. We introduce a new publicly-available dataset containing over 160 sessions of a 3–5 minute child-adult interaction. In each session, the adult examiner followed a semistructured play interaction protocol which was designed to elicit a broad range of social behaviors. We identify the key technical challenges in analyzing these behaviors, and describe methods for decoding the interactions. We present experimental results that demonstrate the potential of the dataset to drive interesting research questions, and show preliminary results for multi-modal activity recognition.
3 0.91104037 398 cvpr-2013-Single-Pedestrian Detection Aided by Multi-pedestrian Detection
Author: Wanli Ouyang, Xiaogang Wang
Abstract: In this paper, we address the challenging problem of detecting pedestrians who appear in groups and have interaction. A new approach is proposed for single-pedestrian detection aided by multi-pedestrian detection. A mixture model of multi-pedestrian detectors is designed to capture the unique visual cues which are formed by nearby multiple pedestrians but cannot be captured by single-pedestrian detectors. A probabilistic framework is proposed to model the relationship between the configurations estimated by single- and multi-pedestrian detectors, and to refine the single-pedestrian detection result with multi-pedestrian detection. It can integrate with any single-pedestrian detector without significantly increasing the computation load. 15 state-of-the-art single-pedestrian detection approaches are investigated on three widely used public datasets: Caltech, TUD-Brussels andETH. Experimental results show that our framework significantly improves all these approaches. The average improvement is 9% on the Caltech-Test dataset, 11% on the TUD-Brussels dataset and 17% on the ETH dataset in terms of average miss rate. The lowest average miss rate is reduced from 48% to 43% on the Caltech-Test dataset, from 55% to 50% on the TUD-Brussels dataset and from 51% to 41% on the ETH dataset.
4 0.86171931 160 cvpr-2013-Face Recognition in Movie Trailers via Mean Sequence Sparse Representation-Based Classification
Author: Enrique G. Ortiz, Alan Wright, Mubarak Shah
Abstract: This paper presents an end-to-end video face recognition system, addressing the difficult problem of identifying a video face track using a large dictionary of still face images of a few hundred people, while rejecting unknown individuals. A straightforward application of the popular ?1minimization for face recognition on a frame-by-frame basis is prohibitively expensive, so we propose a novel algorithm Mean Sequence SRC (MSSRC) that performs video face recognition using a joint optimization leveraging all of the available video data and the knowledge that the face track frames belong to the same individual. By adding a strict temporal constraint to the ?1-minimization that forces individual frames in a face track to all reconstruct a single identity, we show the optimization reduces to a single minimization over the mean of the face track. We also introduce a new Movie Trailer Face Dataset collected from 101 movie trailers on YouTube. Finally, we show that our methodmatches or outperforms the state-of-the-art on three existing datasets (YouTube Celebrities, YouTube Faces, and Buffy) and our unconstrained Movie Trailer Face Dataset. More importantly, our method excels at rejecting unknown identities by at least 8% in average precision.
5 0.85640782 45 cvpr-2013-Articulated Pose Estimation Using Discriminative Armlet Classifiers
Author: Georgia Gkioxari, Pablo Arbeláez, Lubomir Bourdev, Jitendra Malik
Abstract: We propose a novel approach for human pose estimation in real-world cluttered scenes, and focus on the challenging problem of predicting the pose of both arms for each person in the image. For this purpose, we build on the notion of poselets [4] and train highly discriminative classifiers to differentiate among arm configurations, which we call armlets. We propose a rich representation which, in addition to standardHOGfeatures, integrates the information of strong contours, skin color and contextual cues in a principled manner. Unlike existing methods, we evaluate our approach on a large subset of images from the PASCAL VOC detection dataset, where critical visual phenomena, such as occlusion, truncation, multiple instances and clutter are the norm. Our approach outperforms Yang and Ramanan [26], the state-of-the-art technique, with an improvement from 29.0% to 37.5% PCP accuracy on the arm keypoint prediction task, on this new pose estimation dataset.
same-paper 6 0.84894133 275 cvpr-2013-Lp-Norm IDF for Large Scale Image Search
7 0.8399505 2 cvpr-2013-3D Pictorial Structures for Multiple View Articulated Pose Estimation
8 0.83615315 246 cvpr-2013-Learning Binary Codes for High-Dimensional Data Using Bilinear Projections
9 0.82335931 375 cvpr-2013-Saliency Detection via Graph-Based Manifold Ranking
10 0.80674458 345 cvpr-2013-Real-Time Model-Based Rigid Object Pose Estimation and Tracking Combining Dense and Sparse Visual Cues
11 0.77781165 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation
12 0.77231222 288 cvpr-2013-Modeling Mutual Visibility Relationship in Pedestrian Detection
13 0.76695251 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
14 0.74963319 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
15 0.74878657 363 cvpr-2013-Robust Multi-resolution Pedestrian Detection in Traffic Scenes
16 0.74007088 122 cvpr-2013-Detection Evolution with Multi-order Contextual Co-occurrence
17 0.71763611 438 cvpr-2013-Towards Pose Robust Face Recognition
18 0.7138958 60 cvpr-2013-Beyond Physical Connections: Tree Models in Human Pose Estimation
19 0.71124429 338 cvpr-2013-Probabilistic Elastic Matching for Pose Variant Face Verification
20 0.70890033 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors