Author: Zeynep Akata, Florent Perronnin, Zaid Harchaoui, Cordelia Schmid
Abstract: Attributes are an intermediate representation, which enables parameter sharing between classes, a must when training data is scarce. We propose to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors. We introduce a function which measures the compatibility between an image and a label embedding. The parameters of this function are learned on a training set of labeled samples to ensure that, given an image, the correct classes rank higher than the incorrect ones. Results on the Animals With Attributes and Caltech-UCSD-Birds datasets show that the proposed framework outperforms the standard Direct Attribute Prediction baseline in a zero-shot learning scenario. The label embedding framework offers other advantages such as the ability to leverage alternative sources of information in addition to attributes (e.g. class hierarchies) or to transition smoothly from zero-shot learning to learning with large quantities of data.
1 We propose to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors. [sent-2, score-0.375]
2 The parameters of this function are learned on a training set of labeled samples to ensure that, given an image, the correct classes rank higher than the incorrect ones. [sent-4, score-0.267]
3 The label embedding framework offers other advantages such as the ability to leverage alternative sources of information in addition to attributes (e. [sent-6, score-0.761]
4 A solution to zero-shot learning which has recently gained in popularity in the computer vision community consists in introducing an intermediate space A referred to as attribute layer [16, 8]. [sent-14, score-0.361]
5 age embedding (left): how to extract suitable features from an image? [sent-17, score-0.307]
6 We focus on label embedding (right): how to embed class labels in a Euclidean space? [sent-18, score-0.501]
7 We use attributes as side information for the label embedding and measure the “compatibility”’ between the embedded inputs and outputs with a function F. [sent-19, score-0.753]
8 As an example, ifthe classes correspond to animals, possible attributes include “has paws”, “has stripes” or “is black”. [sent-21, score-0.357]
9 To classify a new image, its attributes are predicted using the learned classifiers and the attribute scores are combined into class-level scores. [sent-23, score-0.596]
10 In other words, since attribute classifiers are learned independently of the end-task they might be optimal at predicting attributes but not necessarily at predicting classes. [sent-27, score-0.596]
11 which can perform zeroshot prediction if no labeled samples are available for some classes, but which can also leverage new labeled samples for these classes as they become available. [sent-30, score-0.419]
12 Third, while attributes can be a useful source of prior information, other sources of information could be leveraged for zeroshot learning. [sent-32, score-0.41]
13 Indeed, images of classes which are close in a semantic hierarchy are usually more similar than images of classes which are far [6]. [sent-34, score-0.284]
14 This paper proposes such a solution by making use of the label embedding framework. [sent-38, score-0.395]
15 We underline that, while there is an abundant literature in the computer vision community on image embedding (how to describe an image? [sent-39, score-0.307]
16 ) much less work has been devoted in comparison to label embedding in the Y space (how to describe a class? [sent-40, score-0.425]
17 We embed each class y ∈ Y in the space of attribute vectors and thus refer to our approach as Attribute Label Embedding (ALE). [sent-42, score-0.395]
18 Third, the label embedding framework is generic and not restricted to attributes. [sent-50, score-0.395]
19 Related Work We now review related work on attributes, zero-shot learning and label embedding (three research areas which strongly overlap) with an emphasis on the latter. [sent-58, score-0.436]
20 It has been proposed to improve the standard DAP model to take into account the correlation between attributes or between attributes and classes [38, 39, 43, 19]. [sent-62, score-0.652]
21 Zero-shot learning Zero-shot learning requires the ability to transfer knowledge from classes for which we have training data to classes for which we do not. [sent-74, score-0.299]
22 Possible sources of prior information × include attributes [16, 8, 25, 28, 27], semantic class taxonomies [27, 22] or text features [25, 28, 27]. [sent-75, score-0.452]
23 It is unclear, however, how such an embedding could be extrapolated to the case of generic visual categories. [sent-79, score-0.307]
24 Label embedding In computer vision, a vast amount of work has been devoted to input embedding, i. [sent-83, score-0.337]
25 This includes works on patch encoding (see [4] for a recent comparison), on kernel-based methods [32] with a recent focus on explicit embeddings [20, 35], on dimensionality reduction [32] and on compression [13, 30, 36]. [sent-86, score-0.281]
26 Provided that the embedding function ϕ is chosen correctly, label embedding can be an effective way to share parameters between classes. [sent-88, score-0.702]
27 While this taxonomy is valid fot both input θ and output embeddings ϕ, we focus here on output embeddings. [sent-91, score-0.353]
28 A possible strategy consists in learning directly an embedding from the input to the output (or from the output to the input) as is the case of re888882111200888 gression [25]. [sent-98, score-0.426]
29 This setting is particularly relevant when little training data is available, as side information and the derived embeddings can compensate for the lack of data. [sent-106, score-0.357]
30 In our work, we focus on embeddings derived from side information but we also consider the case where they are learned from labeled data, using side information as a prior. [sent-110, score-0.46]
31 Learning with attributes as label embedding Given a training set S = {(xn, yn) , n = 1. [sent-112, score-0.693]
32 In machine learning, a common strategy is to use embedding functions θ :X → and ϕ : Y → for the inputs and outputs and then to learn on the transformed input/output pairs. [sent-118, score-0.386]
33 We then explain how to leverage attributes to compute label embeddings. [sent-122, score-0.389]
34 Finally, we show that the label embedding framework is generic enough to accommodate for other sources of side information. [sent-124, score-0.575]
35 It is generally assumed that F is linear in some combined feature embedding of inputs/outputs ψ(x, y) : F(x, y; w) = w? [sent-129, score-0.307]
36 ψ(x, y) (2) and that the joint embedding ψ can be written as the tensor product between the image embedding θ : X → = RD and the label embedding ϕ : Y → = RE: ψ(x, y) = θ(x) ⊗ ϕ(y) (3) and ψ(x, y) : RD RE → RDE. [sent-130, score-1.009]
37 Attribute label embedding We now consider the problem of computing label embeddings ϕA from attributes which we refer to as Attribute Label Embedding (ALE). [sent-144, score-1.005]
38 , C} and that we have a set of E attributes A = {ai , i= 1. [sent-150, score-0.265]
39 We also assume that we are provided with an association measure ρy,i between each attribute ai and each class y. [sent-154, score-0.36]
40 In this work, we focus on binary relevance although one advantage of the label embedding framework is that it can easily accommodate real-valued relevances. [sent-156, score-0.443]
41 We embed class y in the E-dim attribute space as follows: ϕA(y) = [ρy,1, . [sent-157, score-0.395]
42 , ρy,E] (7) and denote ΦA the E C matrix of attribute embeddings which stacks the individual ϕA(y) ’s. [sent-160, score-0.567]
43 We note that in equation (4) the image and label embeddings play symmetric roles. [sent-161, score-0.345]
44 Also, in the case where attributes are redundant, it might be advantageous to decorrelate them. [sent-165, score-0.284]
45 We will study the effect of attribute decorrelation in our experiments. [sent-171, score-0.312]
46 The simplest learning strategy is to maximize directly the compatibility between the input and output embeddings N1 ? [sent-175, score-0.405]
47 Therefore, we draw inspiration from the WSABIE algorithm [41] which learns jointly image and label embeddings from data to optimize classification accuracy. [sent-179, score-0.408]
48 The crucial difference between WSABIE and ALE is the fact that the latter uses attributes as side information. [sent-180, score-0.332]
49 In what follows, Φ is the matrix which stacks the embeddings ϕ(y). [sent-183, score-0.278]
50 In WSABIE, the label embedding space dimensionality is a parameter to tune. [sent-198, score-0.419]
51 In such a case, we want to learn the class embeddings using as prior information ΦA. [sent-210, score-0.368]
52 Beyond attributes While attributes make sense in the label embedding framework, we note that label embedding is more general and can accommodate for other sources of side information. [sent-218, score-1.5]
53 The hierarchy embedding ϕH (y) can be defined as the C dimensional vector: ξy,z ϕH(y) = [ξy,1, . [sent-221, score-0.387]
54 (11) We later refer to this embedding as Hierarchy Label Embed- ding (HLE) and we compare ϕA and ϕH as sources of prior information in our experiments. [sent-225, score-0.398]
55 In the case where classes are not organized in a tree structure but form a graph, then other types of embeddings could be used, for instance by performing a kernel PCA on the commute time kernel [29]. [sent-226, score-0.391]
56 Different embeddings can be easily combined in the label embedding framework, e. [sent-227, score-0.652]
57 through simple concatenation of the different embeddings or through more complex operations such as a CCA of the embeddings. [sent-229, score-0.289]
58 Each class was annotated with 85 attributes by 10 students [24] and the result was binarized. [sent-243, score-0.317]
59 Hence, there is a significant difference in the number and quality of attributes between the two datasets. [sent-255, score-0.265]
60 What is the best way to encode/normalize the attribute embeddings? [sent-273, score-0.289]
61 How do attributes compare to a class hierarchy as prior information? [sent-276, score-0.423]
62 The first baseline is Ridge Regression (RR) which was used in [25] to map input features to output attribute labels. [sent-280, score-0.338]
63 Comparison of different attribute embeddings: {0, 1} embedding, {− 1, +1} embedding and mean-centered embedding, with and without ? [sent-295, score-0.596]
64 For these experiments, the attribute vectors are encoded in a binary fashion (using {0, 1}) and ? [sent-299, score-0.308]
65 We experiment with a {0, 1} embedding, a {−1, +1} embedding and a meancentered embedding (i. [sent-307, score-0.614]
66 Underlying the {0, 1} embedding is the assumption that the presence of the same attribute in two classes should contribute to their similarity, but not its ab- sence3. [sent-310, score-0.707]
67 Underlying the {−1, 1} embedding is the assumption that the presence or the absence of the same attribute in two classes should contribute equally to their similarity. [sent-311, score-0.707]
68 For instance, if an attribute appears in almost all classes, then in the mean-centered embedding, its absence will contribute more to the similarity than its presence4. [sent-313, score-0.308]
69 In what follows, we make use of the simple {0, 1} embedding with ? [sent-320, score-0.307]
70 In DAP, given a new image x, we assign it to the class y with 3Here we assume a dot-product similarity between attribute embeddings which is consistent with our linear compatibility function (4). [sent-325, score-0.656]
71 Our DAP results on AWA are lower than those reported in [16] because we use only half of the data to train the attribute classifiers. [sent-342, score-0.308]
72 Right 2 columns: attribute prediction accuracy (AUC in %) on the 85 AWA and 312 CUB attributes. [sent-343, score-0.352]
73 p(ae = ρy,e|x) (12) e= 1 where ρy,e is the association measure between attribute ae and class y, and p(ae = 1|x) is the probability that image x contains attribute e. [sent-345, score-0.668]
74 We train for each attribute one linear classifier on the FVs. [sent-346, score-0.289]
75 We use a (regularized) logistic loss which provides an attribute classification accuracy similar to the SVM but with the added benefit that its output is already a probability. [sent-347, score-0.352]
76 Hence, our approach seems to be more beneficial when the attribute quality is higher. [sent-349, score-0.289]
77 In ALE, each column of W can be interpreted as an attribute classifier and θ(x)? [sent-352, score-0.289]
78 However, one major difference with DAP is that we do not optimize for attribute classification accuracy. [sent-354, score-0.352]
79 We therefore measured the attribute prediction accuracy of DAP and ALE. [sent-356, score-0.352]
80 As expected, the attribute prediction accuracy of DAP is higher than that of our approach. [sent-359, score-0.352]
81 Thus, our learned attribute classifiers should still be interpretable. [sent-363, score-0.331]
82 Classification accuracy on AWA (left) and CUB (right) as a function of the label embedding dimensionality. [sent-366, score-0.395]
83 We compare the baseline which uses all attributes, with an SVD dimensionality reduction and a sampling of attributes (we report the mean and standard deviation over 10 samplings). [sent-367, score-0.309]
84 Comparison of attributes (ALE) and hierarchies (HLE) for label embedding. [sent-377, score-0.381]
85 We explore two different techniques: Singular Value Decomposition (SVD) and attribute sampling. [sent-382, score-0.289]
86 From these experiments, we can conclude that there is a significant amount of correlation between attributes and that the output space dimensionality can be significantly reduced with little accuracy loss. [sent-389, score-0.348]
87 As expected, SVD outperforms a random sampling of the attribute dimensions. [sent-396, score-0.289]
88 As mentioned earlier, while attributes can be a useful source of prior information to embed classes, other sources exist. [sent-398, score-0.41]
89 For each attribute we show the images ranked highest. [sent-413, score-0.289]
90 We explore different alternatives such as the concatenation of the embeddings or performing CCA on the embeddings. [sent-418, score-0.289]
91 On AWA, the combina- tion performs better than attributes or the hierarchy alone while on CUB, there is no improvement through the combination, certainly because the hierarchy adds little additional information. [sent-424, score-0.425]
92 We compare ALE with WSABIE [41] which performs label embedding and therefore “shares” samples be- tween classes but does not use prior information. [sent-453, score-0.553]
93 One advantage of WSABIE with respect to ALE is that the embedding space dimensionality can be tuned, thus giving more flexibility when larger amounts of training data become available. [sent-457, score-0.364]
94 As an example, ALE with 2 training samples performs on par with WSABIE with 20 training samples, showing that attributes can compensate for limited training data. [sent-460, score-0.423]
95 We compare three embedding techniques: ALE (attributes only), HLE (hierarchy only), AHLE (attributes and hierarchy). [sent-465, score-0.307]
96 Second, our model can leverage labeled training data (if available) to update the label embedding, using the attribute embedding as a prior. [sent-482, score-0.8]
97 Third, the label embedding famework is not restricted to attributes and can accommodate other sources of prior information such as class taxonomies. [sent-483, score-0.851]
98 In the few-shots setting, we showed improvements with respect to WSABIE, which learns the label embedding from labeled data but does not leverage prior information. [sent-485, score-0.504]
99 Learning to detect unseen object classes by between-class attribute transfer. [sent-596, score-0.381]
100 A joint learning framework for attribute models and object descriptions. [sent-613, score-0.33]
