iccv iccv2013 iccv2013-235 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kaiye Wang, Ran He, Wei Wang, Liang Wang, Tieniu Tan
Abstract: Cross-modal matching has recently drawn much attention due to the widespread existence of multimodal data. It aims to match data from different modalities, and generally involves two basic problems: the measure of relevance and coupled feature selection. Most previous works mainly focus on solving the first problem. In this paper, we propose a novel coupled linear regression framework to deal with both problems. Our method learns two projection matrices to map multimodal data into a common feature space, in which cross-modal data matching can be performed. And in the learning procedure, the ?21-norm penalties are imposed on the two projection matrices separately, which leads to select relevant and discriminative features from coupled feature spaces simultaneously. A trace norm is further imposed on the projected data as a low-rank constraint, which enhances the relevance of different modal data with connections. We also present an iterative algorithm based on halfquadratic minimization to solve the proposed regularized linear regression problem. The experimental results on two challenging cross-modal datasets demonstrate that the proposed method outperforms the state-of-the-art approaches.
Reference: text
sentIndex sentText sentNum sentScore
1 It aims to match data from different modalities, and generally involves two basic problems: the measure of relevance and coupled feature selection. [sent-3, score-0.457]
2 In this paper, we propose a novel coupled linear regression framework to deal with both problems. [sent-5, score-0.387]
3 Our method learns two projection matrices to map multimodal data into a common feature space, in which cross-modal data matching can be performed. [sent-6, score-0.31]
4 21-norm penalties are imposed on the two projection matrices separately, which leads to select relevant and discriminative features from coupled feature spaces simultaneously. [sent-8, score-0.682]
5 A trace norm is further imposed on the projected data as a low-rank constraint, which enhances the relevance of different modal data with connections. [sent-9, score-0.603]
6 We also present an iterative algorithm based on halfquadratic minimization to solve the proposed regularized linear regression problem. [sent-10, score-0.197]
7 The task of cross-modal matching is to predict whether a pair of data points from two different modalities represent the same underlying content or object. [sent-14, score-0.306]
8 Take multimedia retrieval for example, one often seeks to find the picture (or video) that best illustrates a given text, or find the text that best describes a given picture (or video). [sent-16, score-0.296]
9 Most of them just focus on learning a common latent subspace to make all data comparable. [sent-18, score-0.165]
10 UA and UB are projection matrices learned using our method on space A and B. [sent-24, score-0.112]
11 21-norm and trace norm are used for coupled feature selection and low-rank relevance measure respectively. [sent-26, score-0.857]
12 portant problem, how to simultaneously select relevant and discriminative features from two different feature spaces, is usually ignored. [sent-27, score-0.181]
13 Although various feature selection methods [26] have been developed for the single modality data analysis, they are not extended to the case of multi-modality data. [sent-29, score-0.225]
14 21-norm has been proved to be a powerful tool for the feature selection problem [5, 8, 15], and trace norm [1, 3, 4, 6] is used to encode the correlation of the design matrix or prior knowledge by enforcing a low-rank solution. [sent-31, score-0.485]
15 Motivated by these recent advances, this paper proposes a novel regularization framework (as shown in Figure 1) for the cross-modal matching problem, by combining common subspace learning and coupled feature selection. [sent-32, score-0.575]
16 First, inspired by the potential relationship between Canonical Correlation Analysis (CCA) and linear least squares [23], coupled linear regression is used to project data from different modalities into a common subspace that is defined by label information. [sent-33, score-0.758]
17 21-norm is used to select the relevant and discriminative features from coupled 2088 modalities, and the trace norm regularization enforces the relevance of the projected data with potentially connections. [sent-35, score-0.939]
18 Second, based on the alternative formulation for the trace norm [4] and the half-quadratic analysis for ? [sent-36, score-0.33]
19 21-norm and trace norm into a generic minimization formulation so that subspace learning and coupled feature selection can be performed simultaneously. [sent-41, score-0.909]
20 2) An iterative algorithm is presented to efficiently solve such kind of complex minimization problems. [sent-42, score-0.119]
21 Section 3 describes our proposed regularized linear regression framework for cross-modal matching, along with an iterative algorithm to solve this problem. [sent-51, score-0.129]
22 Based on the hypothesis that there is a ben- efit to explicitly model correlations between two modalities, CCA is used to learn a common subspace by maximizing the correlation between the two modalities. [sent-59, score-0.174]
23 Then, a semantic space is learned to measure the similarity of different modal features. [sent-60, score-0.151]
24 They use CCA to learn a common space in which the possibility of whether two non-corresponding face regions belong to the same face can be measured. [sent-63, score-0.134]
25 low-resolution photos, Sharma and Jacobs [21] use PLS to linearly map images in different modalities to a common linear subspace in which they are highly correlated. [sent-68, score-0.384]
26 They use PLS to switch the image features into the text space, then learn a semantic space for the measure of similarity between two different modalities . [sent-71, score-0.432]
27 In [24], Tenenbaum and Freeman propose a bilinear model (BLM) to derive a common space for cross-modal face recognition, and BLM is also used for text-image retrieval in [22]. [sent-72, score-0.205]
28 Lei and Li [12] propose coupled spectral regression to learn two associated projections, which project heterogeneous data into a common space respectively in which classification is performed. [sent-74, score-0.451]
29 , Generalized Multiview LDA (GMLDA) and Generalized Multiview MFA (GMMFA), and apply them to deal with the cross-media retrieval problem. [sent-81, score-0.102]
30 All above methods can be categorized into two classes: one is to learn a common latent space in which both modalities are projected, and the other is to map data of one modality into the space of another one. [sent-82, score-0.437]
31 Hence, how to simultaneously select the relevant and discriminative features for different modalities of data is very important. [sent-87, score-0.357]
32 Accordingly, we aim to jointly perform common subspace learning and coupled feature selection. [sent-88, score-0.487]
33 To achieve this goal, we propose a generic minimization formulation by coupled linear regressions, ? [sent-89, score-0.376]
34 21-norm and trace norm, which will be detailed in the next section. [sent-90, score-0.222]
35 Learning Coupled Feature Spaces In this section, we present a novel framework for the cross-modal matching problem, which can be formulated as a minimization problem. [sent-92, score-0.118]
36 Then, an iterative algorithm based on half-quadratic optimization is given to solve this minimization problem. [sent-93, score-0.119]
37 The Frobenius norm of the matrix M is defined as ? [sent-98, score-0.108]
38 Given a query from one modality, the goal of the cross-modal matching is to return the closest match in another modality. [sent-143, score-0.187]
39 As shown in Figure 1, the cross-modal matching generally involves two problems: 1) The first problem is how to measure the relevance of data from different modalities. [sent-144, score-0.154]
40 2) The second one is how to select the relevant and discriminative features from the coupled feature spaces, simultaneously. [sent-145, score-0.461]
41 They project data from different modalities into a latent space, in which the possibility of whether two different modal data represent the same semantic concept can be measured. [sent-147, score-0.403]
42 Compared to dimensionality reduction or feature selection methods performed on the two feature spaces separately, coupled feature selection is more likely to find the most relevant features. [sent-149, score-0.771]
43 Based on this consideration, we propose that the feature selection procedure should be performed on coupled feature spaces simultaneously for better matching. [sent-150, score-0.605]
44 , xbn] ∈ Rd2×n, each modality ]h a∈s n samples embedded in diffe]re ∈nt Rdimensional spaces (d1 and d2), and each pair {xia, xib} represents the same underlying content and belongs to the same class. [sent-157, score-0.254]
45 Our model aims to learn two projection matrices to map the data of the coupled spaces into the common space defined by class labels. [sent-162, score-0.604]
46 21-norm on the projection matrices for coupled feature selection, and impose a lowrank constraint, defined by the trace norm, on the projected data. [sent-164, score-0.738]
47 The first term is coupled linear regression, which is used to learn two projection matrices for mapping different modal data into a common space. [sent-194, score-0.584]
48 21-norms that play a role of feature selection on two feature spaces simultaneously. [sent-196, score-0.269]
49 And the trace norm is to enforce the relevance of projected data with connections. [sent-197, score-0.485]
50 Here, an iterative algorithm based on the half-quadratic minimization [8, 9] is proposed to solve this problem. [sent-202, score-0.119]
51 Toward this end, we first need to introduce a variational formulation for the trace norm [4]: Lemma 1. [sent-203, score-0.33]
52 and the infimum is attained for S = Using this lemma, we can reform? [sent-211, score-0.139]
53 Otherwise, the infimum over S could be attained at a non-invertible S, leading to a non-convergent algorithm. [sent-222, score-0.139]
54 The infimum over S is then attained for S = (XTaUaUaTXa + XbTUbUbTXb + μI)1/2 (9) If we define φ(x) = √x2+ ε , we can replace ? [sent-223, score-0.139]
55 Step 1and Step 2 correspond to the trace norm, which is expected to reinforce the relevance of projected data of different modalities with connections. [sent-314, score-0.598]
56 21-norms and play an important role in coupled feature selection. [sent-316, score-0.353]
57 376)) approximately, where k is the number of iteration needed to converge, n is the number of training samples, and d = max(d1, d2), d1 and d2 are the dimensions of the two modality data, respectively. [sent-352, score-0.148]
58 Experimental Results Given a cross-modal problem, we can learn two projection matrices on the training set using the iterative algorithm given by Algorithm 1. [sent-354, score-0.163]
59 Then, using the two projection matrices we can project each pair of data into the common subspace defined by class labels, in which the relevance of projected data from different modalities can be easily measured. [sent-355, score-0.622]
60 In the testing phase, we take one modality data of the testing set as the query set to retrieve the other modality data. [sent-356, score-0.357]
61 Experimental settings We compare the proposed LCFS approach with several related methods, namely, PLS [21], BLM [22, 24], CCA [7, 19], GMMFA and GMLDA [22], for two cross-modal retrieval tasks: (1) Image query vs. [sent-363, score-0.211]
62 The top nine images retrieved by our method on the Pascal VOC dataset, given the tags “boat+water”. [sent-379, score-0.17]
63 task is to find the nearest neighbors from the text (or image) database. [sent-380, score-0.178]
64 We want more correct matches in the top K documents for a better retrieval. [sent-381, score-0.12]
65 cision (AP) of a set of N retrieved documents by AP = T1 ? [sent-384, score-0.203]
66 ents in the retrieved set, P(r) denotes the precision of th? [sent-386, score-0.113]
67 e top r retrieved documents, and δ(r) = 1 if the rth retrieved document is relevant (where relevant means belonging to the class of the query) and δ(r) = 0 otherwise. [sent-387, score-0.414]
68 The MAP is then computed by averaging the AP values over all queries in the query set. [sent-388, score-0.182]
69 The precision-recall curve is a classical measure of information retrieval performance, but some researchers [18] consider the characterization of retrieval performance by curves of precision-scope more expressive for multimedia retrieval. [sent-391, score-0.278]
70 The image features are 5 12-dimensional Gist features [11], and the text features are 399-dimensional word frequency features. [sent-404, score-0.178]
71 As we mentioned in Section 2, the compared methods just focus on the common subspace learning, so Principal Component Analysis (PCA) is performed on the orig- inal features to remove redundant features. [sent-405, score-0.168]
72 Our method can perform coupled feature selection, so we do not perform PCA on the original features for our method. [sent-406, score-0.353]
73 This may be because our method selects the relevant and discriminative features from the two modalities simultaneously, and the learnt common space is more compact and effective. [sent-410, score-0.375]
74 This may be because the text features of the Pascal VOC dataset are very sparse, which maybe does not agree the assumption of GMLDA. [sent-415, score-0.178]
75 Figure 2 shows the top nine retrieval images using a tag vector containing “boat+water” as query. [sent-416, score-0.15]
76 Firstly, tag vectors and image feature vectors are projected into the common space by the proposed method. [sent-417, score-0.188]
77 Then, for a tag vector, we return the nearest K images as the retrieved results. [sent-418, score-0.159]
78 We can see that most retrieved images are very relevant to the given query. [sent-419, score-0.192]
79 The corresponding precision-scope curves and precision-recall curves are plotted in Figure 3. [sent-420, score-0.094]
80 , the top K retrieved items) for the precisionscope curves varies from K=50 to 1000. [sent-423, score-0.19]
81 The top row shows the performance of different methods based on the precision-scope curves for both forms of cross-modal retrieval tasks, i. [sent-424, score-0.151]
82 In each pair, the text is an article describing people, places or some events and the image is closely related to the content of the article. [sent-440, score-0.213]
83 The representation of the text with 10 dimensions is derived from a latent Dirichlet allocation model. [sent-443, score-0.247]
84 Due to the low dimensions of image and text features themselves, PCA is not used to reduce the dimensions of the original features here. [sent-445, score-0.254]
85 2141 for the image query and text query respectively, only a little bit better than those of GMMFA and GMLDA. [sent-449, score-0.452]
86 The reason is that the dimensions of image and text features are low, so the ? [sent-450, score-0.216]
87 21-norms of our method for coupled feature selection could hardly take effect. [sent-451, score-0.423]
88 We can see that for both forms of cross-modal retrieval, our method finds more correct matches in the top K documents than its compared methods. [sent-455, score-0.12]
89 Figure 5 shows two examples of text queries and the top five images retrieved by our method. [sent-456, score-0.366]
90 In each case, the query text and its paired image MethodsImage queryText queryAverage TableLPGCB2LMC. [sent-457, score-0.343]
91 The retrieved images are perceived as belonging to the same category of the query text (“Geography & places” at the top, “Warfare” at the bottom). [sent-466, score-0.428]
92 Conclusion In this paper, we have proposed a general regularization framework to solve the problem of cross-modal matching, which consists of coupled subspace learning for different modalities, the ? [sent-468, score-0.434]
93 21-norms for coupled feature selection , and the trace norm for the measurement of relevance. [sent-469, score-0.753]
94 Under the framework, different projection matrices are learnt to project different modal data into a common subspace defined by label information, and relevant and discriminative features for the coupled spaces are selected simultaneously in the projection procedure. [sent-470, score-0.972]
95 To solve this complex regularization problem, we have harnessed an alternative formulation of the trace norm, and reformulated ? [sent-471, score-0.292]
96 Two examples of text queries (the first column) and the top five images (columns 3-7) retrieved by our method on the Wiki dataset. [sent-511, score-0.366]
97 The second column contains the paired images of the text queries . [sent-512, score-0.251]
98 Continuum [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] regression for cross-modal multimedia retrieval. [sent-525, score-0.095]
99 Trace lasso: a trace norm regularization for correlated designs. [sent-537, score-0.368]
100 The heterogeneous feature selection with structual sparsity for multimedia annotation and hashing: a survey. [sent-689, score-0.205]
wordName wordTfidf (topN-words)
[('coupled', 0.308), ('pls', 0.233), ('gmmfa', 0.23), ('trace', 0.222), ('modalities', 0.221), ('cca', 0.215), ('blm', 0.204), ('gmlda', 0.202), ('ub', 0.193), ('ua', 0.18), ('wiki', 0.179), ('text', 0.178), ('query', 0.137), ('modal', 0.118), ('lcfs', 0.115), ('retrieved', 0.113), ('modality', 0.11), ('spaces', 0.109), ('norm', 0.108), ('uit', 0.106), ('relevance', 0.104), ('qt', 0.095), ('voc', 0.091), ('documents', 0.09), ('subspace', 0.088), ('infimum', 0.086), ('pascal', 0.08), ('lemma', 0.08), ('relevant', 0.079), ('tr', 0.077), ('retrieval', 0.074), ('ft', 0.073), ('selection', 0.07), ('minimization', 0.068), ('rasiwasia', 0.067), ('pages', 0.064), ('sharma', 0.06), ('kaiye', 0.058), ('xatuaxbtub', 0.058), ('xbtub', 0.058), ('xbtububtxb', 0.058), ('xtauauatxa', 0.058), ('matrices', 0.057), ('projection', 0.055), ('attained', 0.053), ('iterative', 0.051), ('regression', 0.051), ('vdiag', 0.051), ('mfa', 0.051), ('projected', 0.051), ('matching', 0.05), ('pca', 0.048), ('regressions', 0.047), ('correntropy', 0.047), ('curves', 0.047), ('common', 0.046), ('heterogeneous', 0.046), ('tag', 0.046), ('multiview', 0.045), ('queries', 0.045), ('feature', 0.045), ('xib', 0.045), ('squares', 0.044), ('face', 0.044), ('multimedia', 0.044), ('bilinear', 0.041), ('correlation', 0.04), ('shan', 0.04), ('rn', 0.039), ('curve', 0.039), ('dimensions', 0.038), ('regularization', 0.038), ('quadrianto', 0.037), ('boat', 0.036), ('wikipedia', 0.036), ('photos', 0.035), ('content', 0.035), ('tenenbaum', 0.035), ('redundant', 0.034), ('xia', 0.033), ('semantic', 0.033), ('reformulated', 0.032), ('lei', 0.032), ('xb', 0.032), ('latent', 0.031), ('canonical', 0.03), ('xa', 0.03), ('top', 0.03), ('water', 0.029), ('map', 0.029), ('discriminative', 0.029), ('simultaneously', 0.028), ('deal', 0.028), ('multimodal', 0.028), ('paired', 0.028), ('objective', 0.027), ('tags', 0.027), ('china', 0.027), ('ui', 0.027), ('regularized', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching
Author: Kaiye Wang, Ran He, Wei Wang, Liang Wang, Tieniu Tan
Abstract: Cross-modal matching has recently drawn much attention due to the widespread existence of multimodal data. It aims to match data from different modalities, and generally involves two basic problems: the measure of relevance and coupled feature selection. Most previous works mainly focus on solving the first problem. In this paper, we propose a novel coupled linear regression framework to deal with both problems. Our method learns two projection matrices to map multimodal data into a common feature space, in which cross-modal data matching can be performed. And in the learning procedure, the ?21-norm penalties are imposed on the two projection matrices separately, which leads to select relevant and discriminative features from coupled feature spaces simultaneously. A trace norm is further imposed on the projected data as a low-rank constraint, which enhances the relevance of different modal data with connections. We also present an iterative algorithm based on halfquadratic minimization to solve the proposed regularized linear regression problem. The experimental results on two challenging cross-modal datasets demonstrate that the proposed method outperforms the state-of-the-art approaches.
2 0.14372055 210 iccv-2013-Image Retrieval Using Textual Cues
Author: Anand Mishra, Karteek Alahari, C.V. Jawahar
Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.
3 0.13329983 194 iccv-2013-Heterogeneous Image Features Integration via Multi-modal Semi-supervised Learning Model
Author: Xiao Cai, Feiping Nie, Weidong Cai, Heng Huang
Abstract: Automatic image categorization has become increasingly important with the development of Internet and the growth in the size of image databases. Although the image categorization can be formulated as a typical multiclass classification problem, two major challenges have been raised by the real-world images. On one hand, though using more labeled training data may improve the prediction performance, obtaining the image labels is a time consuming as well as biased process. On the other hand, more and more visual descriptors have been proposed to describe objects and scenes appearing in images and different features describe different aspects of the visual characteristics. Therefore, how to integrate heterogeneous visual features to do the semi-supervised learning is crucial for categorizing large-scale image data. In this paper, we propose a novel approach to integrate heterogeneous features by performing multi-modal semi-supervised classification on unlabeled as well as unsegmented images. Considering each type of feature as one modality, taking advantage of the large amoun- t of unlabeled data information, our new adaptive multimodal semi-supervised classification (AMMSS) algorithm learns a commonly shared class indicator matrix and the weights for different modalities (image features) simultaneously.
Author: Basura Fernando, Tinne Tuytelaars
Abstract: In this paper we present a new method for object retrieval starting from multiple query images. The use of multiple queries allows for a more expressive formulation of the query object including, e.g., different viewpoints and/or viewing conditions. This, in turn, leads to more diverse and more accurate retrieval results. When no query images are available to the user, they can easily be retrieved from the internet using a standard image search engine. In particular, we propose a new method based on pattern mining. Using the minimal description length principle, we derive the most suitable set of patterns to describe the query object, with patterns corresponding to local feature configurations. This results in apowerful object-specific mid-level image representation. The archive can then be searched efficiently for similar images based on this representation, using a combination of two inverted file systems. Since the patterns already encode local spatial information, good results on several standard image retrieval datasets are obtained even without costly re-ranking based on geometric verification.
5 0.12065308 180 iccv-2013-From Where and How to What We See
Author: S. Karthikeyan, Vignesh Jagadeesh, Renuka Shenoy, Miguel Ecksteinz, B.S. Manjunath
Abstract: Eye movement studies have confirmed that overt attention is highly biased towards faces and text regions in images. In this paper we explore a novel problem of predicting face and text regions in images using eye tracking data from multiple subjects. The problem is challenging as we aim to predict the semantics (face/text/background) only from eye tracking data without utilizing any image information. The proposed algorithm spatially clusters eye tracking data obtained in an image into different coherent groups and subsequently models the likelihood of the clusters containing faces and text using afully connectedMarkov Random Field (MRF). Given the eye tracking datafrom a test image, itpredicts potential face/head (humans, dogs and cats) and text locations reliably. Furthermore, the approach can be used to select regions of interest for further analysis by object detectors for faces and text. The hybrid eye position/object detector approach achieves better detection performance and reduced computation time compared to using only the object detection algorithm. We also present a new eye tracking dataset on 300 images selected from ICDAR, Street-view, Flickr and Oxford-IIIT Pet Dataset from 15 subjects.
6 0.11901881 162 iccv-2013-Fast Subspace Search via Grassmannian Based Hashing
7 0.11720092 360 iccv-2013-Robust Subspace Clustering via Half-Quadratic Minimization
9 0.10753414 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes
10 0.10287926 232 iccv-2013-Latent Space Sparse Subspace Clustering
11 0.10066897 292 iccv-2013-Non-convex P-Norm Projection for Robust Sparsity
12 0.097257093 337 iccv-2013-Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search
13 0.096408695 94 iccv-2013-Correntropy Induced L2 Graph for Robust Subspace Clustering
14 0.09332376 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
15 0.088976666 93 iccv-2013-Correlation Adaptive Subspace Segmentation by Trace Lasso
16 0.083758913 290 iccv-2013-New Graph Structured Sparsity Model for Multi-label Image Annotations
17 0.083157316 444 iccv-2013-Viewing Real-World Faces in 3D
18 0.081017308 233 iccv-2013-Latent Task Adaptation with Large-Scale Hierarchies
19 0.079955287 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning
20 0.079798289 378 iccv-2013-Semantic-Aware Co-indexing for Image Retrieval
topicId topicWeight
[(0, 0.182), (1, 0.062), (2, -0.071), (3, -0.096), (4, -0.047), (5, 0.124), (6, 0.045), (7, 0.021), (8, 0.009), (9, -0.001), (10, 0.154), (11, -0.044), (12, 0.001), (13, 0.04), (14, -0.008), (15, -0.013), (16, 0.008), (17, -0.031), (18, -0.011), (19, 0.003), (20, 0.014), (21, 0.025), (22, -0.05), (23, -0.068), (24, 0.037), (25, 0.004), (26, 0.06), (27, 0.011), (28, 0.012), (29, -0.001), (30, 0.008), (31, 0.037), (32, 0.029), (33, 0.04), (34, -0.026), (35, 0.025), (36, 0.014), (37, 0.076), (38, 0.019), (39, 0.031), (40, 0.047), (41, -0.134), (42, 0.003), (43, -0.01), (44, 0.067), (45, 0.017), (46, 0.016), (47, -0.049), (48, -0.024), (49, 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 0.93757242 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching
Author: Kaiye Wang, Ran He, Wei Wang, Liang Wang, Tieniu Tan
Abstract: Cross-modal matching has recently drawn much attention due to the widespread existence of multimodal data. It aims to match data from different modalities, and generally involves two basic problems: the measure of relevance and coupled feature selection. Most previous works mainly focus on solving the first problem. In this paper, we propose a novel coupled linear regression framework to deal with both problems. Our method learns two projection matrices to map multimodal data into a common feature space, in which cross-modal data matching can be performed. And in the learning procedure, the ?21-norm penalties are imposed on the two projection matrices separately, which leads to select relevant and discriminative features from coupled feature spaces simultaneously. A trace norm is further imposed on the projected data as a low-rank constraint, which enhances the relevance of different modal data with connections. We also present an iterative algorithm based on halfquadratic minimization to solve the proposed regularized linear regression problem. The experimental results on two challenging cross-modal datasets demonstrate that the proposed method outperforms the state-of-the-art approaches.
2 0.66262054 210 iccv-2013-Image Retrieval Using Textual Cues
Author: Anand Mishra, Karteek Alahari, C.V. Jawahar
Abstract: We present an approach for the text-to-image retrieval problem based on textual content present in images. Given the recent developments in understanding text in images, an appealing approach to address this problem is to localize and recognize the text, and then query the database, as in a text retrieval problem. We show that such an approach, despite being based on state-of-the-artmethods, is insufficient, and propose a method, where we do not rely on an exact localization and recognition pipeline. We take a query-driven search approach, where we find approximate locations of characters in the text query, and then impose spatial constraints to generate a ranked list of images in the database. The retrieval performance is evaluated on public scene text datasets as well as three large datasets, namely IIIT scene text retrieval, Sports-10K and TV series-1M, we introduce.
3 0.64070678 446 iccv-2013-Visual Semantic Complex Network for Web Images
Author: Shi Qiu, Xiaogang Wang, Xiaoou Tang
Abstract: This paper proposes modeling the complex web image collections with an automatically generated graph structure called visual semantic complex network (VSCN). The nodes on this complex network are clusters of images with both visual and semantic consistency, called semantic concepts. These nodes are connected based on the visual and semantic correlations. Our VSCN with 33, 240 concepts is generated from a collection of 10 million web images. 1 A great deal of valuable information on the structures of the web image collections can be revealed by exploring the VSCN, such as the small-world behavior, concept community, indegree distribution, hubs, and isolated concepts. It not only helps us better understand the web image collections at a macroscopic level, but also has many important practical applications. This paper presents two application examples: content-based image retrieval and image browsing. Experimental results show that the VSCN leads to significant improvement on both the precision of image retrieval (over 200%) and user experience for image browsing.
4 0.6368404 3 iccv-2013-3D Sub-query Expansion for Improving Sketch-Based Multi-view Image Retrieval
Author: Yen-Liang Lin, Cheng-Yu Huang, Hao-Jeng Wang, Winston Hsu
Abstract: We propose a 3D sub-query expansion approach for boosting sketch-based multi-view image retrieval. The core idea of our method is to automatically convert two (guided) 2D sketches into an approximated 3D sketch model, and then generate multi-view sketches as expanded sub-queries to improve the retrieval performance. To learn the weights among synthesized views (sub-queries), we present a new multi-query feature to model the similarity between subqueries and dataset images, and formulate it into a convex optimization problem. Our approach shows superior performance compared with the state-of-the-art approach on a public multi-view image dataset. Moreover, we also conduct sensitivity tests to analyze the parameters of our approach based on the gathered user sketches.
5 0.60193515 292 iccv-2013-Non-convex P-Norm Projection for Robust Sparsity
Author: Mithun Das Gupta, Sanjeev Kumar
Abstract: In this paper, we investigate the properties of Lp norm (p ≤ 1) within a projection framework. We start with the (KpK T≤ equations of the neoctni-olnin efraarm optimization problem a thnde then use its key properties to arrive at an algorithm for Lp norm projection on the non-negative simplex. We compare with L1projection which needs prior knowledge of the true norm, as well as hard thresholding based sparsificationproposed in recent compressed sensing literature. We show performance improvements compared to these techniques across different vision applications.
6 0.59726483 445 iccv-2013-Visual Reranking through Weakly Supervised Multi-graph Learning
7 0.59506899 162 iccv-2013-Fast Subspace Search via Grassmannian Based Hashing
8 0.58983666 194 iccv-2013-Heterogeneous Image Features Integration via Multi-modal Semi-supervised Learning Model
9 0.58691943 94 iccv-2013-Correntropy Induced L2 Graph for Robust Subspace Clustering
10 0.57906878 334 iccv-2013-Query-Adaptive Asymmetrical Dissimilarities for Visual Object Retrieval
11 0.57898104 290 iccv-2013-New Graph Structured Sparsity Model for Multi-label Image Annotations
12 0.56094205 122 iccv-2013-Distributed Low-Rank Subspace Segmentation
13 0.56025094 266 iccv-2013-Mining Multiple Queries for Image Retrieval: On-the-Fly Learning of an Object-Specific Mid-level Representation
14 0.55925888 306 iccv-2013-Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items
15 0.55103749 93 iccv-2013-Correlation Adaptive Subspace Segmentation by Trace Lasso
16 0.54925591 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions
17 0.54689425 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes
18 0.53930259 294 iccv-2013-Offline Mobile Instance Retrieval with a Small Memory Footprint
19 0.53783762 182 iccv-2013-GOSUS: Grassmannian Online Subspace Updates with Structured-Sparsity
20 0.53558093 337 iccv-2013-Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search
topicId topicWeight
[(2, 0.081), (7, 0.022), (12, 0.015), (13, 0.011), (26, 0.062), (27, 0.016), (31, 0.049), (42, 0.181), (64, 0.05), (68, 0.249), (73, 0.022), (89, 0.147)]
simIndex simValue paperId paperTitle
same-paper 1 0.80158055 235 iccv-2013-Learning Coupled Feature Spaces for Cross-Modal Matching
Author: Kaiye Wang, Ran He, Wei Wang, Liang Wang, Tieniu Tan
Abstract: Cross-modal matching has recently drawn much attention due to the widespread existence of multimodal data. It aims to match data from different modalities, and generally involves two basic problems: the measure of relevance and coupled feature selection. Most previous works mainly focus on solving the first problem. In this paper, we propose a novel coupled linear regression framework to deal with both problems. Our method learns two projection matrices to map multimodal data into a common feature space, in which cross-modal data matching can be performed. And in the learning procedure, the ?21-norm penalties are imposed on the two projection matrices separately, which leads to select relevant and discriminative features from coupled feature spaces simultaneously. A trace norm is further imposed on the projected data as a low-rank constraint, which enhances the relevance of different modal data with connections. We also present an iterative algorithm based on halfquadratic minimization to solve the proposed regularized linear regression problem. The experimental results on two challenging cross-modal datasets demonstrate that the proposed method outperforms the state-of-the-art approaches.
2 0.7776401 203 iccv-2013-How Related Exemplars Help Complex Event Detection in Web Videos?
Author: Yi Yang, Zhigang Ma, Zhongwen Xu, Shuicheng Yan, Alexander G. Hauptmann
Abstract: Compared to visual concepts such as actions, scenes and objects, complex event is a higher level abstraction of longer video sequences. For example, a “marriage proposal” event is described by multiple objects (e.g., ring, faces), scenes (e.g., in a restaurant, outdoor) and actions (e.g., kneeling down). The positive exemplars which exactly convey the precise semantic of an event are hard to obtain. It would be beneficial to utilize the related exemplars for complex event detection. However, the semantic correlations between related exemplars and the target event vary substantially as relatedness assessment is subjective. Two related exemplars can be about completely different events, e.g., in the TRECVID MED dataset, both bicycle riding and equestrianism are labeled as related to “attempting a bike trick” event. To tackle the subjectiveness of human assessment, our algorithm automatically evaluates how positive the related exemplars are for the detection of an event and uses them on an exemplar-specific basis. Experiments demonstrate that our algorithm is able to utilize related exemplars adaptively, and the algorithm gains good perform- z. ance for complex event detection.
3 0.74576694 208 iccv-2013-Image Co-segmentation via Consistent Functional Maps
Author: Fan Wang, Qixing Huang, Leonidas J. Guibas
Abstract: Joint segmentation of image sets has great importance for object recognition, image classification, and image retrieval. In this paper, we aim to jointly segment a set of images starting from a small number of labeled images or none at all. To allow the images to share segmentation information with each other, we build a network that contains segmented as well as unsegmented images, and extract functional maps between connected image pairs based on image appearance features. These functional maps act as general property transporters between the images and, in particular, are used to transfer segmentations. We define and operate in a reduced functional space optimized so that the functional maps approximately satisfy cycle-consistency under composition in the network. A joint optimization framework is proposed to simultaneously generate all segmentation functions over the images so that they both align with local segmentation cues in each particular image, and agree with each other under network transportation. This formulation allows us to extract segmentations even with no training data, but can also exploit such data when available. The collective effect of the joint processing using functional maps leads to accurate information sharing among images and yields superior segmentation results, as shown on the iCoseg, MSRC, and PASCAL data sets.
4 0.72907066 79 iccv-2013-Coherent Object Detection with 3D Geometric Context from a Single Image
Author: Jiyan Pan, Takeo Kanade
Abstract: Objects in a real world image cannot have arbitrary appearance, sizes and locations due to geometric constraints in 3D space. Such a 3D geometric context plays an important role in resolving visual ambiguities and achieving coherent object detection. In this paper, we develop a RANSAC-CRF framework to detect objects that are geometrically coherent in the 3D world. Different from existing methods, we propose a novel generalized RANSAC algorithm to generate global 3D geometry hypothesesfrom local entities such that outlier suppression and noise reduction is achieved simultaneously. In addition, we evaluate those hypotheses using a CRF which considers both the compatibility of individual objects under global 3D geometric context and the compatibility between adjacent objects under local 3D geometric context. Experiment results show that our approach compares favorably with the state of the art.
5 0.71727616 240 iccv-2013-Learning Maximum Margin Temporal Warping for Action Recognition
Author: Jiang Wang, Ying Wu
Abstract: Temporal misalignment and duration variation in video actions largely influence the performance of action recognition, but it is very difficult to specify effective temporal alignment on action sequences. To address this challenge, this paper proposes a novel discriminative learning-based temporal alignment method, called maximum margin temporal warping (MMTW), to align two action sequences and measure their matching score. Based on the latent structure SVM formulation, the proposed MMTW method is able to learn a phantom action template to represent an action class for maximum discrimination against other classes. The recognition of this action class is based on the associated learned alignment of the input action. Extensive experiments on five benchmark datasets have demonstrated that this MMTW model is able to significantly promote the accuracy and robustness of action recognition under temporal misalignment and variations.
6 0.71478778 436 iccv-2013-Unsupervised Intrinsic Calibration from a Single Frame Using a "Plumb-Line" Approach
7 0.71360457 54 iccv-2013-Attribute Pivots for Guiding Relevance Feedback in Image Search
8 0.71282971 44 iccv-2013-Adapting Classification Cascades to New Domains
9 0.71120179 277 iccv-2013-Multi-channel Correlation Filters
10 0.70850545 231 iccv-2013-Latent Multitask Learning for View-Invariant Action Recognition
11 0.70826077 259 iccv-2013-Manifold Based Face Synthesis from Sparse Samples
12 0.70817178 398 iccv-2013-Sparse Variation Dictionary Learning for Face Recognition with a Single Training Sample per Person
13 0.70733392 52 iccv-2013-Attribute Adaptation for Personalized Image Search
14 0.706743 45 iccv-2013-Affine-Constrained Group Sparse Coding and Its Application to Image-Based Classifications
15 0.70601773 26 iccv-2013-A Practical Transfer Learning Algorithm for Face Verification
16 0.70597649 392 iccv-2013-Similarity Metric Learning for Face Recognition
17 0.70502973 14 iccv-2013-A Generalized Iterated Shrinkage Algorithm for Non-convex Sparse Coding
18 0.70498145 161 iccv-2013-Fast Sparsity-Based Orthogonal Dictionary Learning for Image Restoration
19 0.70418131 124 iccv-2013-Domain Transfer Support Vector Ranking for Person Re-identification without Target Camera Label Information
20 0.70365787 93 iccv-2013-Correlation Adaptive Subspace Segmentation by Trace Lasso