cvpr cvpr2013 cvpr2013-79 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Mohammad Norouzi, David J. Fleet
Abstract: A fundamental limitation of quantization techniques like the k-means clustering algorithm is the storage and runtime cost associated with the large numbers of clusters required to keep quantization errors small and model fidelity high. We develop new models with a compositional parameterization of cluster centers, so representational capacity increases super-linearly in the number of parameters. This allows one to effectively quantize data using billions or trillions of centers. We formulate two such models, Orthogonal k-means and Cartesian k-means. They are closely related to one another, to k-means, to methods for binary hash function optimization like ITQ [5], and to Product Quantization for vector quantization [7]. The models are tested on largescale ANN retrieval tasks (1M GIST, 1B SIFT features), and on codebook learning for object recognition (CIFAR-10). 1. Introduction and background Techniques for vector quantization, like the well-known k-means algorithm, are used widely in vision and learning. Common applications include codebook learning for object recognition [16], and approximate nearest neighbor (ANN) search for information retrieval [12, 14]. In general terms, such techniques involve partitioning an input vector space into multiple regions {Si}ik=1, mapping points x in each region uinlttiop region-specific representatives {ci}ik=1, known as cioennte irnst.o A resg siuonch-,s a quantizer, qse(nxt)a,t can b {ec expressed as k q(x) = ?1(x∈Si)ci, (1) i?= ?1 where 1(·) is the usual indicator function. eTrhee 1 quality oef u a quantizer oisr expressed in terms of expected distortion, a common measure of which is squared error ?x − q(x) ?22. In this case, given centers {ci}, the region t ?ox xw −hic qh(x a point nis t assigned gwivitehn m ceinnitemrasl {dcis}to,r tthioen r eisobtained by Euclidean nearest neighbor (NN) search. The k-means algorithm can be used to learn centers from data. To reduce expected distortion, crucial for many applications, one can shrink region volumes by increasing k, the number of regions. In practice, however, increasing k results in prohibitive storage and run-time costs. Even if one resorts to ANN search with approximate k-means [14] or hierarchical k-means [12], it is hard to scale to large k (e.g., k = 264), as storing the centers is untenable. This paper concerns methods for scalable quantization with tractable storage and run-time costs. Inspired by Product Quantization (PQ), a state-of-the-art algorithm for ANN search with high-dimensional data (e.g., [7]), compositionality is one key. By expressing data in terms of recurring, reusable parts, the representational capacity of compositional models grows exponentially in the number ofparameters. Compression techniques like JPEG accomplish this by encoding images as disjoint rectangular patches. PQ divides the feature space into disjoint subspaces that are quantized independently. Other examples include part-based recognition models (e.g., [18]), and tensor-based models for stylecontent separation (e.g., [17]). Here, with a compositional parameterization of region centers, we find a family of models that reduce the enc√oding cost of k centers down from k to between log2 k and √k. A model parameter controls the trade-off between model fidelity and compactness. We formulate two related algorithms, Orthogonal kmeans (ok-means) and Cartesian k-means (ck-means). They are natural extensions of k-means, and are closely related to other hashing and quantization methods. The okmeans algorithm is a generalization of the Iterative Quantization (ITQ) algorithm for finding locality-sensitive binary codes [5]. The ck-means model is an extension of okmeans, and can be viewed as a generalization of PQ.1 We then report empirical results on large-scale ANN search tasks, and on codebook learning for recognition. For ANN search we use datasets of 1M GIST and 1B SIFT features, with both symmetric and asymmetric distance measures on the latent codes. We consider codebook learning for a generic form of recognition on CIFAR-10. 2. k-means Given a dataset of n p-dim points, D ≡ {xj }jn=1, the kmeans algorithm partitions ithme n points in ≡to { kx c}lusters, and 1A very similar generalization of PQ, was developed concurrently by Ge, He, Ke and Sun, and also appears in CVPR 2013 [4]. 333000111755 represents each cluster by a center point. Let C ∈ Rp×k breep a mseantrtsix e wachhos celu sctoelurm byns a comprise tihnet .k Lceltus Cter ∈ centers, i.e., C = [c1, c2 , · · · , ck] . The k-means objective is to minimize the within,-c··lu·s ,tecr squared distances: ?k-means(C) = = x?∈Dmiin?x − ci?22 x?∈Db∈mHin1/k?x − Cb?22 (2) where H1/k ≡ {b | b ∈ {0, 1}k and ?b? = 1}, i.e., b is a binary vee Hctor comprising a ,1-1o}f-ka encoding. Lloyd’s b k- imse aa bni-s algorithm [11] finds a local minimum of (2) by iterative, alternating optimization with respect to C and the b’s. The k-means model is simple and intuitive, using NN search to assign points to centers. The assignment of points to centers can be represented with a log k-bit index per data point. The cost of storing the centers grows linearly with k. 3. Orthogonal k-means with 2m centers With a compositional model one can represent cluster centers more efficiently. One such approach is to re- construct each input with an additive combination of the columns of C. To this end, instead of the 1-of-k encoding in (2), we let b be a general m-bit vector, b ∈ Hm ≡ {0, 1}m, (a2n)d, wCe e∈ l Rt bp× bme. aA gse nsuerchal, meac-bhi tc vluesctteorr c, ebnt ∈er H His th≡e sum o}f a asundbs Cet o∈f Rthe columns of C. There are 2m possible subsets, and therefore k = 2m centers. While the number of parameters in the quantizer is linear in m, the number of centers increases exponentially. While efficient in representing cluster centers, the approach is problematic, because solving bˆ = abrg∈Hmmin?x − Cb?22,(3) is intractable; i.e., the discrete optimization is not submodular. Obviously, for small 2m one could generate all possible centers and then perform NN search to find the optimal solution, but this would not scale well to large values of m. One key observation is that if the columns of C are orthogonal, then optimization (3) becomes tractable. To explain this, without loss of generality, assume the bits belong to {−1, 1} instead of {0, 1}, i.e., b? ∈ Hm? ≡ {−1, 1}m. tToh e{−n, ?x − Cb??22 = xTx + b?TCTCb? − 2xTCb? .(4) For diagonal CTC, the middle quadratic term on the RHS becomes trace(CTC), independent of b?. As a consequence, when C has orthogonal columns, abr?g∈mH?min?x − Cb??22 = sgn(CTx) ,(5) where sgn(·) is the element-wise sign function. Cluster centers are depicted by dots, and cluster boundaries by dashed lines. (Left) Clusters formed by a 2-cube with no rotation, scaling or translation; centers = {b? |b? ∈ H2? }. (Right) Rotation, scaling and translation are used to reduce distances between data points and cluster centers; centers = {μ + RDb? | b? ∈ H2? }. To reduce quantization error further we can also introduce an offset, denoted μ, to translate x. Taken together with (5), this leads to the loss function for the model we call orthogonal k-means (ok-means):2 ?ok-means(C,μ) = x?∈Db?m∈iHn?m?x − μ − Cb??22. (6) Clearly, with a change of variables, b? = 2b −1, we can define new versions of μ and C, with ide=ntic 2abl −lo1s,s, w wfoer c wanh dicehthe unknowns are binary, but in {0, 1}m. T uhnek nook-wmnesa anrse quantizer uetn icnod {e0,s 1ea}ch data point as a vertex of a transformed m-dimensional unit hypercube. The transform, via C and μ, maps the hypercube vertices onto the input feature space, ideally as close as possible to the data points. The matrix C has orthogonal columns and can therefore be expressed in terms of rotation and scaling; i.e., C ≡ RD, where R ∈ Rp×m has orthonormal cinoglu;m i.ne.s, ( CRT ≡R = R DIm,) w, ahnerde eD R Ris ∈ diagonal and positive definite. The goal of learning is to find the parameters, R, D, and μ, which minimize quantization error. Fig. 1 depicts a small set of 2D data points (red x’s) and two possible quantizations. Fig. 1 (left) depicts the vertices of a 2-cube with C = I2 and zero translation. The cluster regions are simply the four quadrants of the 2D space. The distances between data points and cluster centers, i.e., the quantization errors, are relatively large. By comparison, Fig. 1 (right) shows how a transformed 2-cube, the full model, can greatly reduce quantization errors. 3.1. Learning ok-means To derive the learning algorithm for ok-means we first rewrite the objective in matrix terms. Given n data points, let X = [x1, x2, · · · , xn] ∈ Rp×n. Let B? ∈ {−1, 1}m×n denote the corresponding ∈clu Rster assignment {co−ef1f,ic1i}ents. Our 2While closely related to ITQ, we use the the relationship to k-means. term ok-means to emphasize 333000111 686 goal is to find the assignment coefficients B? and the transformation parameters, namely, the rotation R, scaling D, and translation μ, that minimize ?ok-means(B?, R, D,μ) = ?X − μ1T − RDB??f2 (7) = ?X? − RDB??f2 (8) where ?·?f denotes the Frobenius norm, X? ≡ X − μ1T, R iws hceornes ?tr·a?ined to have orthonormal columns≡ ≡(R XTR − μ=1 Im), and D is a positive diagonal matrix. Like k-means, coordinate descent is effective for optimizing (8). We first initialize μ and R, and then iteratively optimize ?ok-means with respect to B?, D, R, and μ: • Optimize B? and D: With straightforward algebraic manipulation, one can show that ?ok-means = ?RTX?−DB??f2 + ?R⊥TX??f2 , (9) where columns of R⊥ span the orthogonal complement of the column-space of R (i.e., the block matrix [R R⊥] is orthogonal). It follows that, given X? and R, we can optimize the first term in (9) to solve for B? and D. Here, DB? is the leastsquares approximation to RTX?, where RTX? and DB? × are m n. Further, the ith row of DB? can only contain earleem men×tsn .fr Fomur t{h−erd,i t,h +e dii} where di = Dii. Wherever tehleem corresponding delement} }o fw RheTrXe? d is positive (negative) we should put a positive (negative) value in DB?. The optimal di is determined by the mean absolute value of the elements in the ith row of RTX?: • • B? ← sgn ?RTX?? (10) D ← Diag?(mroewan?(abs(RTX?))) (11) Optimize R: Using the original objective (8), find R that minimizes ?X? −RA?f2 , subject to RTR = Im, and Am n≡i mDizBes?. ?TXhis i−s equivalent to an Orthogonal Procrustes problem [15], and can be solved exactly using SVD. In particular, by adding p − m rows of zeros to the bottom poaf rDtic, uDlaBr, bb eyc aodmdeins p p× − n. mT rhoewn sR o ifs z square a tnhde orthogoonf aDl ,a DndB can boem eessti pm ×at end. Twhiethn RSV isD s. qBuuatr es ainncde oDrtBho gisdegenerate we are only interested in the first m columns of R; the remaining columns are unique only up to rotation of the null-space.) Optimize μ: Given R, B? and D, the optimal μ is given by the column average of X −RDB?: μ ← mcoeluamn(X−RDB?) (12) 3.2. Approximate nearest neighbor search One application of scalable quantization is ANN search. Before introducing more advanced quantization techniques, we describe some experimental results with ok-means on Euclidean ANN search. The benefits of ok-means for ANN search are two-fold. Storage costs for the centers is reduced to O(log k), from O(k) with k-means. Second, substantial speedups are possible by exploiting fast methods for NN search on binary codes in Hamming space (e.g., [13]). Generally, in terms of a generic quantizer q(·), there are twoG neanteurraalll ways etrom messti omfa ate g ethneer dicis qtaunanceti z beetrw qe(·e)n, ttwheor vectors, v and u [7]. Using the Symmetric quantizer distance (SQD) ?v−u? is approximated by ?q(v)−q(u) ? . Using the Asymmetric quantizer doixsimtanacteed (A byQ ?Dq()v, only one o Uf sthineg tw thoe vectors is quantized, and ?v−u? is estimated as ?v−q(u) ? . vWechtiloer sS iQs qDu might b, aen slightly f?as itse ers t iom compute, vA−QqD(u i)?n-. curs lower quantization errors, and hence is more accurate. For ANN search, in a pre-processing stage, each database entry, u, is encoded into a binary vector corresponding to the cluster center index to which u is assigned. At test time, the queries may or may not be encoded into indices, depending on whether one uses SQD or AQD. In the ok-means model, the quantization of an input x is straightforwardly shown to be qok(x) = μ + RD sgn(RT(x − μ)) . (13) The corresponding m-bit cluster index is sgn(RT(x − μ)). Given two indices, namely b?1 , b?2 ∈ {−1, +1}m,( txhe − symmetric ok-means quantizer distance∈ ∈is { SQDok(b?1, b?2) = ?μ + RDb?1 − μ − RDb?2 ?22 = ?D(b?1 − b?2)?22 .(14) In effect, SQDok is a weighted Hamming distance. It is the sum of the squared diagonal entries of D corresponding to bits where b?1 and b?2 differ. Interestingly, in our experiments with ok-means, Hamming and weighted Hamming distances yield similar results. Thus, in ok-means experiments we simply report results for Hamming distance, to facilitate comparisons with other techniques. When the scaling in ok-means is constrained to be isotropic (i.e., D = αIm for α ∈ R+), then SQDok becomes a constant multiple off othre α αu ∈sua Rl Hamming distance. As discussed in Sec. 5, this isotropic ok-means is closely related to ITQ [5]. The ok-means AQD between a feature vector x and a cluster index b?, is defined as AQDok(x, b?) = ?x −μ − RDb??22 = ?RTx? − Db??22 + ?R⊥Tx??22 , (15) where x? = x−μ. For ANN search, in comparing distances from query x t−o a .d Faotars AetN oNf binary i,n idni ccoesm, ptharei snegc odnisdta tnecrems on the RHS of (15) is irrelevant, since it does not depend on b?. Without this term, AQDok becomes a form of asymmetric Hamming (AH) distance between RTx? and b?. While previous work [6] discussed various ad hoc AH distance measures for binary hashing, interestingly, the optimal AH distance for ok-means is derived directly in our model. 333000111977 1M SIFT, 64−bit encoding (k = 264) lRc@aeR0 . 206148 10 RIoPT1Qk K−Q m( AHeDa)Hn )s (AH)10K Figure 2. Euclidean ANN retrieval results for different methods and distance functions on the 1M SIFT dataset. 3.3. Experiments with ok-means Following [7], we report ANN search results on 1M SIFT, a corpus of 128D SIFT descriptors with disjoint sets of 100K training, 1M base, and 10K test vectors. The training set is used to train models. The base set is the database, and the test points are queries. The number of bits m is typically less than p, but no pre-processing is needed for dimensionality reduction. Rather, to initialize learning, R is a random rotation of the first m principal directions of the training data, and μ is the mean of the data. For each query we find R nearest neighbors, and compute Recall@R, the fraction of queries for which the ground-truth Euclidean NN is found in the R retrieved items. Fig. 2 shows the Recall@R plots for ok-means with Hamming (H) ≈ SQDok and asymmetric Hamming (AH) H≡a mAmQiDngok ( Hd)is t≈an SceQ, vs. ITQ [5] and PQ [7]. The PQ ≡met AhoQdD exploits a more complex asymmetric distance func- tion denoted AD ≡ AQDck (defined in Sec. 6. 1). Note first tthioant od ke-nmoeteadns A improves upon ITQ, with both Hamming and asymmetric Hamming distances. This is due to the scaling parameters (i.e., D) in ok-means. If one is interested in Hamming distance efficient retrieval, ok-means is prefered over ITQ. But better results are obtained with the asymmetric distance function. Fig. 2 also shows that PQ achieves superior recall rates. This stems from its joint encoding of multiple feature dimensions. In ok-means, each bit represents a partition ofthe feature space into two clusters, separated by a hyperplane. The intersection of m orthogonal hyperplanes yields 2m regions. Hence we obtain just two clusters per dimension, and each dimension is encoded independently. In PQ, by comparison, multiple dimensions are encoded jointly, with arbitrary numbers of clusters. PQ thereby captures statistical dependencies among different dimensions. We next extend ok-means to jointly encode multiple dimensions. 4. Cartesian k-means In the Cartesian k-means (ck-means) model, each region center is expressed parametrically as an additive combination of multiple subcenters. Let there be m sets of subcen- Fidg2ure31.Ddep4icton5fCartde?1si2nqdu?4ati5z?3aton 4qDck(da)t= ,?wd i?5134t?h the first (last) two dimensions sub-quantized on the left (right). Cartesian k-means quantizer denoted qck, combines the subquantizations in subspaces, and produces a 4D reconstruction. ters, each with h elements.3 Let C(i) be a matrix whose columns comprise the elements of the ith subcenter set; C(i) ∈ Rp×h. Finally, assume that each cluster center, c, is the∈ sum of exactly one element from each subcenter set: = ?C(i)b(i) m c i?= ?1 , (16) where b(i) ∈ H1/h is a 1-of-h encoding. As a conc∈re Hte example (see Fig. 3), suppose we are given 4D inputs, x ∈ R4, and we split each datum into m = 2 parts: z(1) = ?I2 0? x , and Then, suppose w?e z(2) = ?0 I2? x .(17) qu?antize each part, z(?1) and? z(2) , sepa- × rately. As depicted in Fig. 3 (left and middle), we could use h = 5 subcenters for each one. Placing the corresponding subcenters in the columns of 4 5 matrices C(1) and C(2) , C(1)=?d1d20d2×35d4d5?, C(2)=?d?1d?20d2×?35d?4d?5?, we obtain a model (16) that provides 52 possible centers with which to quantize the data. More generally, the total number of model centers is k = hm. Each center is a member of the Cartesian product of the subcenter sets, hence the name Cartesian k-means. Importantly, while the number of centers is hm, the number of subcenters is only mh. The model provides a super-linear number of centers with a linear number of parameters. The learning objective for Cartesian k-means is ?ck-means(C) =x?∈D{b(mi)}inim=1???x −i?=m1C(i)b(i)??22 where b(i) ∈ H1/h, and C ≡ [C(1), ··· , (18) C(m)] ∈ Rp×mh. [b(1)T, ··· ,b(m)T] If we let bT ≡ then the second sum in (18) can be expressed succinctly as Cb. 3While here we assume a fixed cardinality for all subcenter sets, the model is easily extended to allow sets with different cardinalities. 333000112088 The key problem with this formulation is that the min- imization of (18) with respect to the b(i) ’s is intractable. Nevertheless, motivated by orthogonal k-means (Sec. 3), encoding can be shown to be both efficient and exact if we impose orthogonality constraints on the sets of subcenters. To that end, assume that all subcenters in different sets are pairwise orthogonal; i.e., ∀i,j | i = j C(i)TC(j) = 0h×h .(19) Each subcenter matrix C(i) spans a linear subspace of Rp, and the linear subspaces for different subcenter sets do not intersect. Hence, (19) implies that only the subcenters in C(i) can explain the projection of x onto the C(i) subspace. In the example depicted in Fig. 3, the input features are simply partitioned (17), and the subspaces clearly satisfy the orthogonality constraints. It is also clear that C ≡ [ C(1) C(2)] is block diagonal, Iwtit ihs 2 a ×lso o5 c bleloacrks t,h adte Cnote ≡d D(1) and D(]2 i)s . bTlohec quantization error t×he5re bfolorcek bse,c doemnoeste ?x − Cb?22 = ???zz((12))?−?D0(1) D0(2)? ?b ( 21) ? ?2 = ?????z(1)−D(1)b(1)??2+???z(2)−D(2??)b(2)??2. In words, the squa??zred quantization?? erro??r zis the sum of t??he squared errors on the subspaces. One can therefore solve for the binary coefficients of the subcenter sets independently. In the general case, assuming (19) is satisfied, C can be expressed as a product R D, where R has orthonormal columns, and D is block diagonal; i.e., C = R D where Ra=nd[hRe(n1c),e·C(,i)R=(mR)]i,Dan(di).DW=i⎢t⎡⎣⎢hDs0i.(1≡)Dra0(n2)k. C.(Di)0(.m,i)t⎦⎥ ⎤fo,l(2w0)s that D(i) ∈ Rsi×h and R(i) ∈ Rp×≡sira. Clearly, ? si ≤ p, because of∈ ∈th Re orthogonality ∈con Rstraints. Replacing C(i) with R(i)D(i) in the RHS of (18?), we find ?x−Cb?22 = ?m?z(i)−D(i)b(i)?22+?R⊥Tx?22, (21) i?= ?1 where z(i)≡R(i)Tx, and R⊥is the orthogonal complement of R. This≡ ≡shRows that, with orthogonality constraints (19), the ck-means encoding problem can be split into m independent sub-encoding problems, one for each subcenter set. To find the b(i) that minimizes ??z(i) we perform NN search with z(i) again??st h si-dim vec??tors in D(i) . This entails a cost of O(hsi).? Fortunately, all? the elements of b can be found very efficiently, in O(hs), where s ≡ ? si. If we also include the cost of rotating x to −D(i)b(i)?22, Taocbkml-emt1he.aoAnds um#ceh2nkmrtyeofskm-#lobgeiatknsh,cOo-rm( pOec(2oamkn+spt),hpan)dkO- m(pOecosa(n+mst(hipns)khtesro)m of number of centers, number of bits needed for indices (i.e., log #centers), and the storage cost of representation, which is the same as the encoding cost to convert inputs to indices. The last column shows the costs given an assumption that C has a rank of s ≥ m. obtain each z(i) , the total encoding cost is O(ps + hs), i.e., O(p2+hp). Alternatively, one could perform NN search in p-dim C(i) ’s, to find the b(i) ’s, which costs O(mhp). Table 1 summerizes the models in terms of their number of centers, index size, and cost of storage and encoding. 4.1. Learning ck-means We can re-write the ck-means objective (18) in matrix form with the Frobenius norm; i.e., ?ck-means(R, D, B) = ? X − RDB ?f2 (22) where the columns of X and B comprise the data points and the subcenter assignment coefficients. The input to the learning algorithm is the training data X, the number of subcenter sets m, the cardinality of the subcenter sets h, and an upper bound on the rank of C, i.e., s. In practice, we also let each rotation matrix R(i) have the same number of columns, i.e., si = s/m. The outputs are the matrices {R(i) } and {D(i) } that provide a local minima of (22). Learning begins hwaitth p trohev idneit aia lloizcaatlio mni noimf Ra oanf d(2 D2)., followed by iterative coordinate descent in B, D, and R: • Optimize B and D: With R fixed, the objective is given by (21) where ?R⊥TX?f2 R(i)TX, is constant. Given data pro- jections Z(i) ≡ to optimize for B and D we perform one step oRf k-means for each subcenter set: – Assignment: Perform NN searches for each subcenter set to find the assignment coefficients, B(i) , B(i)← arBg(mi)in?Z(i)− D(i)B(i)?f2 – • Update: D(i)← arDg(mi)in?Z(i)− D(i)B(i)?f2 Optimize R: Placing the D(i) ’s along the diagonal of D, as in (20), and concatenating B(i) ’s as rows of B, [B(1)T, ...,B(m)T], i.e., BT = the optimization of R reduces to the orthogonal Procrustes problem: R ← argRmin?X − RDB?f2. In experiments below, R ∈ Rp×p, and rank(C) ≤ p is unIcnon esxtpraeirniemde. tFso rb high-dimensional adnadta r awnhke(rCe ) ra ≤nk( pX is) ?np, fnostrr efficiency ri th may dbime eusnesfiuoln atol dcoantast wrahiner er anr akn(kC(X). 333000112199 5. Relations and related work As mentioned above, there are close mathematical relationships between ok-means, ck-means, ITQ for binary hashing [5], and PQ for vector quantization [7]. It is instructive to specify these relationships in some detail. Iterative Quantization vs. Orthogonal k-means. ITQ [5] is a variant of locality-sensitive hashing, mapping data to binary codes for fast retrieval. To extract m-bit codes, ITQ first zero-centers the data matrix to obtain X?. PCA is then used for dimensionality reduction, fromp down to m dimensions, after which the subspace representation is randomly rotated. The composition of PCA and the random rotation can be expressed as WTX? where W ∈ Rp×m. ITQ then solves for the m×m rotation mawthriex,r eR W, t ∈hatR minimizes ?ITQ(B?, R) = ?RTWTX?− B??f2 , s.t. RTR = Im×m, (23) where B? ∈ {−1, +1}n×p. ITQ ro∈tate {s− t1h,e+ subspace representation of the data to match the binary codes, and then minimizes quantization error within the subspace. By contrast, ok-means maps the binary codes into the original input space, and then considers both the quantization error within the subspace and the out-of-subspace projection error. A key difference is the inclusion of ?R⊥X? ?f2 in the ok-means objective (9). This is important s ?inRce one can often reduce quantization errors by projecting out significant portions of the feature space. Another key difference between ITQ and ok-means is the inclusion of non-uniform scaling in ok-means. This is important when the data are not isotropic, and contributes to the marked improvement in recall rates in Fig. 2. Orthogonal k-means vs. Cartesian k-means. We next show that ok-means is a special case of ck-means with h = 2, where each subcenter set has two elements. To this end, let C(i) = and let b(i) = be the ith subcenter matrix selection vector. Since b(i) is a 1-of-2 encoding (?10? or ?01?), it follows that: [c(1i) c2(i)], b1(i)c(1i)+ b(2i)c2(i) = [b(1i) b2(i)]T c(1i)+2 c2(i)+ bi?c1(i)−2 c2(i), (24) b1(i) − b2(i) where bi? ≡ ∈ {−1, +1}. With the following setting of t≡he bok-m−e abns parameters, μ=?i=m1c1(i)+2c(2i),andC=?c1(1)−2c2(1),...,c1(m)−2c2(m)?, it should be clear that ?i C(i)b(i) = μ + Cb?, where b? ∈ {−1, +1}m, and b??i is the ith bit of b?, used in (24). Similarly, one can also map ok-means parameters onto corresponding subcenters for ck-means. Thus, there is a 1-to-1 mapping between the parameterizations of cluster centers in ok-means and ck-means for h = 2. The benefits of ok-means are its small number of parameters, and its intuitive formulation. The benefit of the ck-means generalization is its joint encoding of multiple dimensions with an arbitrary number of centers, allowing it to capture intrinsic dependence among data dimensions. Product Quantization vs. Cartesian k-means. PQ first splits the input vectors into m disjoint sub-vectors, and then quantizes each sub-vector separately using a subquantizer. Thus PQ is a special case of ck-means where the rotation R is not optimized; rather, R is fixed in both learning and retrieval. This is important because the sub-quantizers act independently, thereby ignoring intrasubspace statistical dependence. Thus the selection of subspaces is critical for PQ [7, 8]. J e´gou et al. [8] suggest using PCA, followed by random rotation, before applying PQ to high-dimensional vectors. Finding the rotation to minimize quantization error is mentioned in [8], but it was considered too difficult to estimate. Here we show that one can find a rotation to minimize the quantization error. The ck-means learning algorithm optimizes sub-quantizers in the inner loop, but they interact in an outer loop that optimizes the rotation (Sec. 4.1). 6. Experiments 6.1. Euclidean ANN search Euclidean ANN is a useful task for comparing quantization techniques. Given a database of ck-means indices, and a query, we use Asymmetric and Symmetric ck-means quantizer distance, denoted AQDck and SQDck. The AQDck between a query x and a binary index b ≡ ?b(1)T, ...,b(m)T?T, derived in (21), is AQDck(x,b) ?m?z(i)−D(i)b(i)?22+ ?R⊥Tx?22.(25) = i?= ?1 ?z(i)−D(i)b(i)?22is the distance between the ithpro- Here, jection?? of x, i.e., z(i) , ?a?nd a subcenter projection from D(i) selecte?d by b(i) . Give?n a query, these distances for each i, and all h possible values of b(i) can be pre-computed and stored in a query-specific h×m lookup table. Once created, tshtoer lookup qtaubelery i-ss pusecedif itoc compare akullp pda tatbablea.se O points etaot tehde, query. So computing AQDck entails m lookups and additions, somewhat more costly than Hamming distance. The last term on the RHS of (25) is irrelevant for NN search. Since PQ is a special case of ck-means with pre-defined subspaces, the same distances are used for PQ (cf. [7]). The SQDck between binary codes b1 and b2 is given by SQDck(b1,b2) = ?m?D(i)b1(i)− D(i)b2(i)?22.(26) i?= ?1 Since b1(i) and b(2i) are 1-of-?h encodings, an m×h?×h lookup table can be createadre t 1o- sft-ohre e nacllo pairwise msu×bh×-dhis ltaonockeusp. 333000222200 1M SIFT, 64−bit encoding (k = 264) 1M GIST, 64−bit encoding (k = 264) 1B SIFT, 64−bit encoding (k = 264) Re@acl0 .4186021RcIPoTQk0−Q m( SAeDaH)n s (SADH)1K0 .1642081 0RIocP1TkQ −K m (SAeDHa)n s (ASHD1)0K .186420 1 R0IoPc1TkQ −KQm( ASe DaHn) s (SADH1)0K Figure 4. Euclidean NN recall@R (number of items retrieved) based on different quantizers and corresponding distance functions on the 1M SIFT, 1M GIST, and 1B SIFT datasets. The dashed curves use symmetric distance. (AH ≡ AQDok, SD ≡ SQDck, AD ≡ AQDck) While the cost of computing SQDck is the same as AQDck, SQDck could also be used to estimate the distance between the indexed database entries, for diversifying the retrieval results, or to detect near duplicates, for example. Datasets. We use the 1M SIFT dataset, as in Sec. 3.3, along with two others, namely, 1M GIST (960D features) and 1B SIFT, both comprising disjoint sets of training, base and test vectors. 1M GIST has 500K training, 1M base, and 1K query vectors. 1B SIFT has 100M training, 1B base, and 10K query points. In each case, recall rates are averaged over queries in test set for a database populated from the base set. For expedience, we use only the first 1M training points for the 1B SIFT experiments. Parameters. In ANN experiments below, for both ckmeans and PQ, we use m = 8 and h = 256. Hence the number of clusters is k = 2568 = 264, so 64-bits are used as database indices. Using h = 256 is particularly attractive because the resulting lookup tables are small, encoding is fast, and each subcenter index fits into one byte. As h increases we expect retrieval results to improve, but encoding and indexing of a query to become slower. Initialization. To initialize the Di’s for learning, as in kmeans, we simply begin with random samples from the set of Zi’s (see Sec. 4.1). To initialize R we consider the different methods that J e´gou et al. [7] proposed for splitting the feature dimensions into m sub-vectors for PQ: (1) natural: sub-vectors comprise consecutive dimensions, (2) structured: dimensions with the same index modulo 8 are grouped, and (3) random: random permutations are used. For PQ in the experiments below, we use the orderings that produced the best results in [7], namely, the structured ordering for 960D GIST, and the natural ordering for 128D SIFT. For learning ck-means, R is initialized to the identity with SIFT corpora. For 1M GIST, where the PQ ordering is significant, we consider all three orderings to initialize R. Results. Fig. 4 shows Recall@R plots for ck-means and PQ [7] with symmetric and asymmetric distances (SD ≡ PSQQD [7c]k wanitdh A syDm m≡e rAicQ aDncdk) a on tmhee t3r cda dtaissteatns.c Tsh (eS Dhor ≡izontal axainsd represents tQheD number of retrieved items, R, on a log-scale. The results consistently favor ck-means. On 1M GIST, 64−bit encoding (k = 264) lae@Rc0 .261084 1R0Pc Qk −1m(K 321e )a (A nD s )(132) (A D )10K Figure 5. PQ and ck-means results using natural (1), structured (2), and random (3) ordering to define the (initial) subspaces. the high-dimensional GIST data, ck-means with AD significantly outperforms other methods; even ck-means with SD performs on par with PQ with AD. On 1M SIFT, the Recall@ 10 numbers for PQ and ck-means, both using AD, × are 59.9% and 63.7%. On 1B SIFT, Recall@ 100 numbers are 56.5% and 64.9%. As expected, with increasing dataset size, the difference between methods is more significant. In 1B SIFT, each feature vector is 128 bytes, hence a total of 119 GB. Using any method in Fig. 4 (including ckmeans) to index the database into 64 bits, this storage cost reduces to only 7.5 GB. This allows one to work with much larger datasets. In the experiments we use linear scan to find the nearest items according to quantizer distances. For NN search using 10K SIFT queries on 1B SIFT this takes about 8 hours for AD and AH and 4 hours for Hamming distance on a 2 4-core computer. Search can be sped up significantly; using a coarse tienri.tia Sl quantization sapnedd an pin svigenritefidfile structure for AD and AH, as suggested by [7, 1], and using the multi-index hashing method of [13] for Hamming distance. In this paper we did not implement these efficiencies as we focus primarily on the quality of quantization. Fig. 5 compares ck-means to PQ when R in ck-means is initialized using the 3 orderings of [7]. It shows that ckmeans is superior in all cases. Simiarly interesting, it also shows that despite the non-convexity ofthe optimization objective, ck-means learning tends to find similarly good encodings under different initial conditions. Finally, Fig. 6 compares the methods under different code dimensionality. 333000222311 1M GIST, encoding with 64, 96, and 128 bits Rl@ace0 .814206 1 R0cP kQ − m1 69e4216a−8 Knb −itsb 1t69248− bit10K Figure 6. PQ and ck-means results using different number of bits for encoding. In all cases asymmetric distance is used. Table2.Rcokg-nmiteoamPnQCesac(ondksue=r(bkaoc416y=0ko26)n40t2h[eC]IFAR7c598-u1.72096r%a tcesyuingdfferent codebook learning algorithms. 6.2. Learning visual codebooks While the ANN seach tasks above require too many clusters for k-means, it is interesing to compare k-means and ck-means on a task with a moderate number of clusters. To this end we consider codebook learning for bag-ofword√s models [3, 10]. We use ck-means with m = 2 and h = √k, and hence k centers. The main advantage of ck- × here is that finding the closest cluster center is done in O(√k) time, much faster than standard NN search with k-means in O(k). Alternatives for k-means, to improve efficiency, include approximate k-means [14], and hierarchical k-means [12]. Here we only compare to exact k-means. CIFAR-10 [9] comprises 50K training and 10K test images (32 32 pixels). Each image is one of 10 classes (airplane, 3b2i×rd,3 car, cat, )d.e Eera,c dog, frog, sh oonrsee o, ship, laansds etrsu (caki)r.We use publicly available code from Coates et al [2], with changes to the codebook learning and cluster assignment modules. Codebooks are built on 6×6 whitened color image patches. .O Cnoed histogram i sb ucirleta otend 6 per image quadrant, aagnde a linear SVM is applied to 4k-dim feature vectors. Recognition accuracy rates on the test set for different models and k are given in Table 2. Despite having fewer parameters, ck-means performs on par or better than kmeans. This is consistent for different initializations of the algorithms. Although k-means has higher fedility than ckmeans, with fewer parameters, ck-means may be less susceptible to overfitting. Table 2, also compares with the approach of [19], where PQ without learning a rotation is used for clustering features. As expected, learning the rotation has a significant impact on recognition rates, outperforming all three initializations of PQ. mean√s 7. Conclusions We present the Cartesian k-means algorithm, a generalization of k-means with a parameterization of the clus- ter centers such that number of centers is super-linear in the number of parameters. The method is also shown to be a generalization of the ITQ algorithm and Product Quantization. In experiments on large-scale retrieval and codebook learning for recognition the results are impressive, outperforming product quantization by a significant margin. An implementation of the method is available at https : / / github . com/norou z i ckmeans . / References [1] A. Babenko and V. Lempitsky. The inverted multi-index. CVPR, 2012. [2] A. Coates, H. Lee, and A. Ng. An analysis of single-layer networks in unsupervised feature learning. AISTATS, 2011. [3] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. ECCV Workshop Statistical Learning in Computer Vision, 2004. [4] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. CVPR, 2013. [5] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. CVPR, 2011. [6] A. Gordo and F. Perronnin. Asymmetric distances for binary embeddings. CVPR, 2011. [7] H. J ´egou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Trans. PAMI, 2011. [8] H. J ´egou, M. Douze, C. Schmid, and P. P ´erez. Aggregating local descriptors into a compact image representation. CVPR, 2010. [9] A. Krizhevsky. Learning multiple layers of features from tiny images. MSc Thesis, Univ. Toronto, 2009. [10] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR, 2006. [11] S. P. Lloyd. Least squares quantization in pcm. IEEE Trans. IT, 28(2): 129–137, 1982. [12] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. CVPR, 2006. [13] M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in hamming space with multi-index hashing. CVPR, 2012. [14] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. CVPR, 2007. [15] P. Sch o¨nemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 3 1, 1966. [16] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. ICCV, 2003. [17] J. Tenenbaum and W. Freeman. Separating style and content with bilinear models. Neural Comp., 12: 1247–1283, 2000. [18] A. Torralba, K. Murphy, and W. Freeman. Sharing visual features for multiclass and multiview object detection. IEEE T. PAMI, 29(5):854–869, 2007. [19] S. Wei, X. Wu, and D. Xu. Partitioned k-means clustering for fast construction of unbiased visual vocabulary. The Era of Interactive Media, 2012. 333000222422
Reference: text
sentIndex sentText sentNum sentScore
1 edu , Abstract A fundamental limitation of quantization techniques like the k-means clustering algorithm is the storage and runtime cost associated with the large numbers of clusters required to keep quantization errors small and model fidelity high. [sent-4, score-0.642]
2 We develop new models with a compositional parameterization of cluster centers, so representational capacity increases super-linearly in the number of parameters. [sent-5, score-0.19]
3 They are closely related to one another, to k-means, to methods for binary hash function optimization like ITQ [5], and to Product Quantization for vector quantization [7]. [sent-8, score-0.299]
4 The models are tested on largescale ANN retrieval tasks (1M GIST, 1B SIFT features), and on codebook learning for object recognition (CIFAR-10). [sent-9, score-0.137]
5 Common applications include codebook learning for object recognition [16], and approximate nearest neighbor (ANN) search for information retrieval [12, 14]. [sent-12, score-0.241]
6 eTrhee 1 quality oef u a quantizer oisr expressed in terms of expected distortion, a common measure of which is squared error ? [sent-18, score-0.226]
7 The k-means algorithm can be used to learn centers from data. [sent-23, score-0.223]
8 This paper concerns methods for scalable quantization with tractable storage and run-time costs. [sent-29, score-0.326]
9 Compression techniques like JPEG accomplish this by encoding images as disjoint rectangular patches. [sent-34, score-0.146]
10 Here, with a compositional parameterization of region centers, we find a family of models that reduce the enc√oding cost of k centers down from k to between log2 k and √k. [sent-41, score-0.332]
11 They are natural extensions of k-means, and are closely related to other hashing and quantization methods. [sent-44, score-0.315]
12 The okmeans algorithm is a generalization of the Iterative Quantization (ITQ) algorithm for finding locality-sensitive binary codes [5]. [sent-45, score-0.182]
13 1 We then report empirical results on large-scale ANN search tasks, and on codebook learning for recognition. [sent-47, score-0.143]
14 For ANN search we use datasets of 1M GIST and 1B SIFT features, with both symmetric and asymmetric distance measures on the latent codes. [sent-48, score-0.191]
15 The assignment of points to centers can be represented with a log k-bit index per data point. [sent-71, score-0.307]
16 The cost of storing the centers grows linearly with k. [sent-72, score-0.251]
17 Orthogonal k-means with 2m centers With a compositional model one can represent cluster centers more efficiently. [sent-74, score-0.57]
18 While the number of parameters in the quantizer is linear in m, the number of centers increases exponentially. [sent-79, score-0.385]
19 Obviously, for small 2m one could generate all possible centers and then perform NN search to find the optimal solution, but this would not scale well to large values of m. [sent-85, score-0.272]
20 As a consequence, when C has orthogonal columns, abr? [sent-101, score-0.119]
21 Cluster centers are depicted by dots, and cluster boundaries by dashed lines. [sent-107, score-0.299]
22 (Left) Clusters formed by a 2-cube with no rotation, scaling or translation; centers = {b? [sent-108, score-0.264]
23 (Right) Rotation, scaling and translation are used to reduce distances between data points and cluster centers; centers = {μ + RDb? [sent-112, score-0.38]
24 To reduce quantization error further we can also introduce an offset, denoted μ, to translate x. [sent-116, score-0.253]
25 Taken together with (5), this leads to the loss function for the model we call orthogonal k-means (ok-means):2 ? [sent-117, score-0.119]
26 T uhnek nook-wmnesa anrse quantizer uetn icnod {e0,s 1ea}ch data point as a vertex of a transformed m-dimensional unit hypercube. [sent-127, score-0.162]
27 The matrix C has orthogonal columns and can therefore be expressed in terms of rotation and scaling; i. [sent-129, score-0.295]
28 The goal of learning is to find the parameters, R, D, and μ, which minimize quantization error. [sent-134, score-0.281]
29 1 (right) shows how a transformed 2-cube, the full model, can greatly reduce quantization errors. [sent-144, score-0.253]
30 and the transformation parameters, namely, the rotation R, scaling D, and translation μ, that minimize ? [sent-153, score-0.117]
31 f2 , (9) where columns of R⊥ span the orthogonal complement of the column-space of R (i. [sent-180, score-0.185]
32 qBuuatr es ainncde oDrtBho gisdegenerate we are only interested in the first m columns of R; the remaining columns are unique only up to rotation of the null-space. [sent-215, score-0.208]
33 Approximate nearest neighbor search One application of scalable quantization is ANN search. [sent-221, score-0.357]
34 Before introducing more advanced quantization techniques, we describe some experimental results with ok-means on Euclidean ANN search. [sent-222, score-0.253]
35 Storage costs for the centers is reduced to O(log k), from O(k) with k-means. [sent-224, score-0.223]
36 Second, substantial speedups are possible by exploiting fast methods for NN search on binary codes in Hamming space (e. [sent-225, score-0.149]
37 Generally, in terms of a generic quantizer q(·), there are twoG neanteurraalll ways etrom messti omfa ate g ethneer dicis qtaunanceti z beetrw qe(·e)n, ttwheor vectors, v and u [7]. [sent-228, score-0.162]
38 curs lower quantization errors, and hence is more accurate. [sent-243, score-0.253]
39 For ANN search, in a pre-processing stage, each database entry, u, is encoded into a binary vector corresponding to the cluster center index to which u is assigned. [sent-244, score-0.163]
40 In the ok-means model, the quantization of an input x is straightforwardly shown to be qok(x) = μ + RD sgn(RT(x − μ)) . [sent-246, score-0.253]
41 (13) The corresponding m-bit cluster index is sgn(RT(x − μ)). [sent-247, score-0.117]
42 2 ∈ {−1, +1}m,( txhe − symmetric ok-means quantizer distance∈ ∈is { SQDok(b? [sent-250, score-0.2]
43 It is the sum of the squared diagonal entries of D corresponding to bits where b? [sent-262, score-0.15]
44 Experiments with ok-means Following [7], we report ANN search results on 1M SIFT, a corpus of 128D SIFT descriptors with disjoint sets of 100K training, 1M base, and 10K test vectors. [sent-298, score-0.126]
45 For each query we find R nearest neighbors, and compute Recall@R, the fraction of queries for which the ground-truth Euclidean NN is found in the R retrieved items. [sent-303, score-0.153]
46 The intersection of m orthogonal hyperplanes yields 2m regions. [sent-320, score-0.119]
47 Cartesian k-means quantizer denoted qck, combines the subquantizations in subspaces, and produces a 4D reconstruction. [sent-335, score-0.162]
48 3 Let C(i) be a matrix whose columns comprise the elements of the ith subcenter set; C(i) ∈ Rp×h. [sent-337, score-0.533]
49 Finally, assume that each cluster center, c, is the∈ sum of exactly one element from each subcenter set: = ? [sent-338, score-0.454]
50 3 (left and middle), we could use h = 5 subcenters for each one. [sent-354, score-0.134]
51 Placing the corresponding subcenters in the columns of 4 5 matrices C(1) and C(2) , C(1)=? [sent-355, score-0.2]
52 , we obtain a model (16) that provides 52 possible centers with which to quantize the data. [sent-364, score-0.223]
53 More generally, the total number of model centers is k = hm. [sent-365, score-0.223]
54 Each center is a member of the Cartesian product of the subcenter sets, hence the name Cartesian k-means. [sent-366, score-0.42]
55 Importantly, while the number of centers is hm, the number of subcenters is only mh. [sent-367, score-0.357]
56 The model provides a super-linear number of centers with a linear number of parameters. [sent-368, score-0.223]
57 3While here we assume a fixed cardinality for all subcenter sets, the model is easily extended to allow sets with different cardinalities. [sent-379, score-0.408]
58 3), encoding can be shown to be both efficient and exact if we impose orthogonality constraints on the sets of subcenters. [sent-382, score-0.175]
59 To that end, assume that all subcenters in different sets are pairwise orthogonal; i. [sent-383, score-0.164]
60 (19) Each subcenter matrix C(i) spans a linear subspace of Rp, and the linear subspaces for different subcenter sets do not intersect. [sent-386, score-0.842]
61 Hence, (19) implies that only the subcenters in C(i) can explain the projection of x onto the C(i) subspace. [sent-387, score-0.134]
62 bTlohec quantization error t×he5re bfolorcek bse,c doemnoeste ? [sent-391, score-0.253]
63 One can therefore solve for the binary coefficients of the subcenter sets independently. [sent-426, score-0.454]
64 1 where z(i)≡R(i)Tx, and R⊥is the orthogonal complement of R. [sent-447, score-0.119]
65 This≡ ≡shRows that, with orthogonality constraints (19), the ck-means encoding problem can be split into m independent sub-encoding problems, one for each subcenter set. [sent-448, score-0.523]
66 , log #centers), and the storage cost of representation, which is the same as the encoding cost to convert inputs to indices. [sent-465, score-0.228]
67 obtain each z(i) , the total encoding cost is O(ps + hs), i. [sent-467, score-0.127]
68 Table 1 summerizes the models in terms of their number of centers, index size, and cost of storage and encoding. [sent-471, score-0.142]
69 f2 (22) where the columns of X and B comprise the data points and the subcenter assignment coefficients. [sent-479, score-0.544]
70 The input to the learning algorithm is the training data X, the number of subcenter sets m, the cardinality of the subcenter sets h, and an upper bound on the rank of C, i. [sent-480, score-0.844]
71 Given data pro- jections Z(i) ≡ to optimize for B and D we perform one step oRf k-means for each subcenter set: – Assignment: Perform NN searches for each subcenter set to find the assignment coefficients, B(i) , B(i)← arBg(mi)in? [sent-491, score-0.84]
72 , BT = the optimization of R reduces to the orthogonal Procrustes problem: R ← argRmin? [sent-500, score-0.119]
73 Relations and related work As mentioned above, there are close mathematical relationships between ok-means, ck-means, ITQ for binary hashing [5], and PQ for vector quantization [7]. [sent-507, score-0.361]
74 ITQ ro∈tate {s− t1h,e+ subspace representation of the data to match the binary codes, and then minimizes quantization error within the subspace. [sent-527, score-0.326]
75 By contrast, ok-means maps the binary codes into the original input space, and then considers both the quantization error within the subspace and the out-of-subspace projection error. [sent-528, score-0.353]
76 inRce one can often reduce quantization errors by projecting out significant portions of the feature space. [sent-534, score-0.253]
77 We next show that ok-means is a special case of ck-means with h = 2, where each subcenter set has two elements. [sent-540, score-0.378]
78 To this end, let C(i) = and let b(i) = be the ith subcenter matrix selection vector. [sent-541, score-0.41]
79 Similarly, one can also map ok-means parameters onto corresponding subcenters for ck-means. [sent-562, score-0.134]
80 Thus, there is a 1-to-1 mapping between the parameterizations of cluster centers in ok-means and ck-means for h = 2. [sent-563, score-0.299]
81 The benefit of the ck-means generalization is its joint encoding of multiple dimensions with an arbitrary number of centers, allowing it to capture intrinsic dependence among data dimensions. [sent-565, score-0.175]
82 Finding the rotation to minimize quantization error is mentioned in [8], but it was considered too difficult to estimate. [sent-574, score-0.329]
83 Here we show that one can find a rotation to minimize the quantization error. [sent-575, score-0.329]
84 Euclidean ANN search Euclidean ANN is a useful task for comparing quantization techniques. [sent-582, score-0.302]
85 Given a database of ck-means indices, and a query, we use Asymmetric and Symmetric ck-means quantizer distance, denoted AQDck and SQDck. [sent-583, score-0.162]
86 The AQDck between a query x and a binary index b ≡ ? [sent-584, score-0.145]
87 333000222200 1M SIFT, 64−bit encoding (k = 264) 1M GIST, 64−bit encoding (k = 264) 1B SIFT, 64−bit encoding (k = 264) Re@acl0 . [sent-624, score-0.297]
88 Using h = 256 is particularly attractive because the resulting lookup tables are small, encoding is fast, and each subcenter index fits into one byte. [sent-642, score-0.568]
89 As h increases we expect retrieval results to improve, but encoding and indexing of a query to become slower. [sent-643, score-0.2]
90 [7] proposed for splitting the feature dimensions into m sub-vectors for PQ: (1) natural: sub-vectors comprise consecutive dimensions, (2) structured: dimensions with the same index modulo 8 are grouped, and (3) random: random permutations are used. [sent-649, score-0.176]
91 For PQ in the experiments below, we use the orderings that produced the best results in [7], namely, the structured ordering for 960D GIST, and the natural ordering for 128D SIFT. [sent-650, score-0.118]
92 4 shows Recall@R plots for ck-means and PQ [7] with symmetric and asymmetric distances (SD ≡ PSQQD [7c]k wanitdh A syDm m≡e rAicQ aDncdk) a on tmhee t3r cda dtaissteatns. [sent-655, score-0.182]
93 4 (including ckmeans) to index the database into 64 bits, this storage cost reduces to only 7. [sent-671, score-0.142]
94 In the experiments we use linear scan to find the nearest items according to quantizer distances. [sent-674, score-0.223]
95 tia Sl quantization sapnedd an pin svigenritefidfile structure for AD and AH, as suggested by [7, 1], and using the multi-index hashing method of [13] for Hamming distance. [sent-677, score-0.315]
96 333000222311 1M GIST, encoding with 64, 96, and 128 bits Rl@ace0 . [sent-685, score-0.178]
97 The main advantage of ck- × here is that finding the closest cluster center is done in O(√k) time, much faster than standard NN search with k-means in O(k). [sent-697, score-0.125]
98 We use publicly available code from Coates et al [2], with changes to the codebook learning and cluster assignment modules. [sent-703, score-0.213]
99 Conclusions We present the Cartesian k-means algorithm, a generalization of k-means with a parameterization of the clus- ter centers such that number of centers is super-linear in the number of parameters. [sent-714, score-0.516]
100 In experiments on large-scale retrieval and codebook learning for recognition the results are impressive, outperforming product quantization by a significant margin. [sent-716, score-0.432]
wordName wordTfidf (topN-words)
[('subcenter', 0.378), ('pq', 0.312), ('quantization', 0.253), ('centers', 0.223), ('rdb', 0.223), ('itq', 0.2), ('ann', 0.163), ('cartesian', 0.163), ('quantizer', 0.162), ('hamming', 0.159), ('aqdck', 0.156), ('rtx', 0.146), ('subcenters', 0.134), ('orthogonal', 0.119), ('gist', 0.112), ('ckmeans', 0.111), ('sqdck', 0.111), ('cb', 0.109), ('asymmetric', 0.104), ('encoding', 0.099), ('ah', 0.091), ('sqdok', 0.089), ('sift', 0.089), ('db', 0.084), ('nn', 0.083), ('bits', 0.079), ('rotation', 0.076), ('cluster', 0.076), ('rp', 0.074), ('storage', 0.073), ('bit', 0.068), ('aqdok', 0.067), ('columns', 0.066), ('codebook', 0.066), ('rhs', 0.066), ('tx', 0.063), ('hashing', 0.062), ('sgn', 0.062), ('query', 0.058), ('comprise', 0.057), ('ad', 0.057), ('subspaces', 0.056), ('codes', 0.054), ('lookup', 0.05), ('search', 0.049), ('compositional', 0.048), ('disjoint', 0.047), ('orthogonality', 0.046), ('orderings', 0.046), ('binary', 0.046), ('okmeans', 0.045), ('sqd', 0.045), ('retrieval', 0.043), ('assignment', 0.043), ('product', 0.042), ('index', 0.041), ('scaling', 0.041), ('diagonal', 0.041), ('optimize', 0.041), ('base', 0.04), ('distances', 0.04), ('ctc', 0.039), ('recall', 0.039), ('dimensions', 0.039), ('hm', 0.038), ('symmetric', 0.038), ('queries', 0.038), ('procrustes', 0.037), ('generalization', 0.037), ('ordering', 0.036), ('clusters', 0.035), ('rtr', 0.034), ('expressed', 0.034), ('parameterization', 0.033), ('representational', 0.033), ('euclidean', 0.032), ('isotropic', 0.032), ('items', 0.032), ('ith', 0.032), ('gou', 0.032), ('encodings', 0.032), ('norouzi', 0.032), ('orthonormal', 0.031), ('indices', 0.031), ('kmeans', 0.031), ('toronto', 0.03), ('sets', 0.03), ('squared', 0.03), ('nearest', 0.029), ('initialize', 0.029), ('cost', 0.028), ('coates', 0.028), ('di', 0.028), ('learning', 0.028), ('retrieved', 0.028), ('minimizes', 0.027), ('ra', 0.027), ('entails', 0.027), ('neighbor', 0.026), ('inclusion', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 79 cvpr-2013-Cartesian K-Means
Author: Mohammad Norouzi, David J. Fleet
Abstract: A fundamental limitation of quantization techniques like the k-means clustering algorithm is the storage and runtime cost associated with the large numbers of clusters required to keep quantization errors small and model fidelity high. We develop new models with a compositional parameterization of cluster centers, so representational capacity increases super-linearly in the number of parameters. This allows one to effectively quantize data using billions or trillions of centers. We formulate two such models, Orthogonal k-means and Cartesian k-means. They are closely related to one another, to k-means, to methods for binary hash function optimization like ITQ [5], and to Product Quantization for vector quantization [7]. The models are tested on largescale ANN retrieval tasks (1M GIST, 1B SIFT features), and on codebook learning for object recognition (CIFAR-10). 1. Introduction and background Techniques for vector quantization, like the well-known k-means algorithm, are used widely in vision and learning. Common applications include codebook learning for object recognition [16], and approximate nearest neighbor (ANN) search for information retrieval [12, 14]. In general terms, such techniques involve partitioning an input vector space into multiple regions {Si}ik=1, mapping points x in each region uinlttiop region-specific representatives {ci}ik=1, known as cioennte irnst.o A resg siuonch-,s a quantizer, qse(nxt)a,t can b {ec expressed as k q(x) = ?1(x∈Si)ci, (1) i?= ?1 where 1(·) is the usual indicator function. eTrhee 1 quality oef u a quantizer oisr expressed in terms of expected distortion, a common measure of which is squared error ?x − q(x) ?22. In this case, given centers {ci}, the region t ?ox xw −hic qh(x a point nis t assigned gwivitehn m ceinnitemrasl {dcis}to,r tthioen r eisobtained by Euclidean nearest neighbor (NN) search. The k-means algorithm can be used to learn centers from data. To reduce expected distortion, crucial for many applications, one can shrink region volumes by increasing k, the number of regions. In practice, however, increasing k results in prohibitive storage and run-time costs. Even if one resorts to ANN search with approximate k-means [14] or hierarchical k-means [12], it is hard to scale to large k (e.g., k = 264), as storing the centers is untenable. This paper concerns methods for scalable quantization with tractable storage and run-time costs. Inspired by Product Quantization (PQ), a state-of-the-art algorithm for ANN search with high-dimensional data (e.g., [7]), compositionality is one key. By expressing data in terms of recurring, reusable parts, the representational capacity of compositional models grows exponentially in the number ofparameters. Compression techniques like JPEG accomplish this by encoding images as disjoint rectangular patches. PQ divides the feature space into disjoint subspaces that are quantized independently. Other examples include part-based recognition models (e.g., [18]), and tensor-based models for stylecontent separation (e.g., [17]). Here, with a compositional parameterization of region centers, we find a family of models that reduce the enc√oding cost of k centers down from k to between log2 k and √k. A model parameter controls the trade-off between model fidelity and compactness. We formulate two related algorithms, Orthogonal kmeans (ok-means) and Cartesian k-means (ck-means). They are natural extensions of k-means, and are closely related to other hashing and quantization methods. The okmeans algorithm is a generalization of the Iterative Quantization (ITQ) algorithm for finding locality-sensitive binary codes [5]. The ck-means model is an extension of okmeans, and can be viewed as a generalization of PQ.1 We then report empirical results on large-scale ANN search tasks, and on codebook learning for recognition. For ANN search we use datasets of 1M GIST and 1B SIFT features, with both symmetric and asymmetric distance measures on the latent codes. We consider codebook learning for a generic form of recognition on CIFAR-10. 2. k-means Given a dataset of n p-dim points, D ≡ {xj }jn=1, the kmeans algorithm partitions ithme n points in ≡to { kx c}lusters, and 1A very similar generalization of PQ, was developed concurrently by Ge, He, Ke and Sun, and also appears in CVPR 2013 [4]. 333000111755 represents each cluster by a center point. Let C ∈ Rp×k breep a mseantrtsix e wachhos celu sctoelurm byns a comprise tihnet .k Lceltus Cter ∈ centers, i.e., C = [c1, c2 , · · · , ck] . The k-means objective is to minimize the within,-c··lu·s ,tecr squared distances: ?k-means(C) = = x?∈Dmiin?x − ci?22 x?∈Db∈mHin1/k?x − Cb?22 (2) where H1/k ≡ {b | b ∈ {0, 1}k and ?b? = 1}, i.e., b is a binary vee Hctor comprising a ,1-1o}f-ka encoding. Lloyd’s b k- imse aa bni-s algorithm [11] finds a local minimum of (2) by iterative, alternating optimization with respect to C and the b’s. The k-means model is simple and intuitive, using NN search to assign points to centers. The assignment of points to centers can be represented with a log k-bit index per data point. The cost of storing the centers grows linearly with k. 3. Orthogonal k-means with 2m centers With a compositional model one can represent cluster centers more efficiently. One such approach is to re- construct each input with an additive combination of the columns of C. To this end, instead of the 1-of-k encoding in (2), we let b be a general m-bit vector, b ∈ Hm ≡ {0, 1}m, (a2n)d, wCe e∈ l Rt bp× bme. aA gse nsuerchal, meac-bhi tc vluesctteorr c, ebnt ∈er H His th≡e sum o}f a asundbs Cet o∈f Rthe columns of C. There are 2m possible subsets, and therefore k = 2m centers. While the number of parameters in the quantizer is linear in m, the number of centers increases exponentially. While efficient in representing cluster centers, the approach is problematic, because solving bˆ = abrg∈Hmmin?x − Cb?22,(3) is intractable; i.e., the discrete optimization is not submodular. Obviously, for small 2m one could generate all possible centers and then perform NN search to find the optimal solution, but this would not scale well to large values of m. One key observation is that if the columns of C are orthogonal, then optimization (3) becomes tractable. To explain this, without loss of generality, assume the bits belong to {−1, 1} instead of {0, 1}, i.e., b? ∈ Hm? ≡ {−1, 1}m. tToh e{−n, ?x − Cb??22 = xTx + b?TCTCb? − 2xTCb? .(4) For diagonal CTC, the middle quadratic term on the RHS becomes trace(CTC), independent of b?. As a consequence, when C has orthogonal columns, abr?g∈mH?min?x − Cb??22 = sgn(CTx) ,(5) where sgn(·) is the element-wise sign function. Cluster centers are depicted by dots, and cluster boundaries by dashed lines. (Left) Clusters formed by a 2-cube with no rotation, scaling or translation; centers = {b? |b? ∈ H2? }. (Right) Rotation, scaling and translation are used to reduce distances between data points and cluster centers; centers = {μ + RDb? | b? ∈ H2? }. To reduce quantization error further we can also introduce an offset, denoted μ, to translate x. Taken together with (5), this leads to the loss function for the model we call orthogonal k-means (ok-means):2 ?ok-means(C,μ) = x?∈Db?m∈iHn?m?x − μ − Cb??22. (6) Clearly, with a change of variables, b? = 2b −1, we can define new versions of μ and C, with ide=ntic 2abl −lo1s,s, w wfoer c wanh dicehthe unknowns are binary, but in {0, 1}m. T uhnek nook-wmnesa anrse quantizer uetn icnod {e0,s 1ea}ch data point as a vertex of a transformed m-dimensional unit hypercube. The transform, via C and μ, maps the hypercube vertices onto the input feature space, ideally as close as possible to the data points. The matrix C has orthogonal columns and can therefore be expressed in terms of rotation and scaling; i.e., C ≡ RD, where R ∈ Rp×m has orthonormal cinoglu;m i.ne.s, ( CRT ≡R = R DIm,) w, ahnerde eD R Ris ∈ diagonal and positive definite. The goal of learning is to find the parameters, R, D, and μ, which minimize quantization error. Fig. 1 depicts a small set of 2D data points (red x’s) and two possible quantizations. Fig. 1 (left) depicts the vertices of a 2-cube with C = I2 and zero translation. The cluster regions are simply the four quadrants of the 2D space. The distances between data points and cluster centers, i.e., the quantization errors, are relatively large. By comparison, Fig. 1 (right) shows how a transformed 2-cube, the full model, can greatly reduce quantization errors. 3.1. Learning ok-means To derive the learning algorithm for ok-means we first rewrite the objective in matrix terms. Given n data points, let X = [x1, x2, · · · , xn] ∈ Rp×n. Let B? ∈ {−1, 1}m×n denote the corresponding ∈clu Rster assignment {co−ef1f,ic1i}ents. Our 2While closely related to ITQ, we use the the relationship to k-means. term ok-means to emphasize 333000111 686 goal is to find the assignment coefficients B? and the transformation parameters, namely, the rotation R, scaling D, and translation μ, that minimize ?ok-means(B?, R, D,μ) = ?X − μ1T − RDB??f2 (7) = ?X? − RDB??f2 (8) where ?·?f denotes the Frobenius norm, X? ≡ X − μ1T, R iws hceornes ?tr·a?ined to have orthonormal columns≡ ≡(R XTR − μ=1 Im), and D is a positive diagonal matrix. Like k-means, coordinate descent is effective for optimizing (8). We first initialize μ and R, and then iteratively optimize ?ok-means with respect to B?, D, R, and μ: • Optimize B? and D: With straightforward algebraic manipulation, one can show that ?ok-means = ?RTX?−DB??f2 + ?R⊥TX??f2 , (9) where columns of R⊥ span the orthogonal complement of the column-space of R (i.e., the block matrix [R R⊥] is orthogonal). It follows that, given X? and R, we can optimize the first term in (9) to solve for B? and D. Here, DB? is the leastsquares approximation to RTX?, where RTX? and DB? × are m n. Further, the ith row of DB? can only contain earleem men×tsn .fr Fomur t{h−erd,i t,h +e dii} where di = Dii. Wherever tehleem corresponding delement} }o fw RheTrXe? d is positive (negative) we should put a positive (negative) value in DB?. The optimal di is determined by the mean absolute value of the elements in the ith row of RTX?: • • B? ← sgn ?RTX?? (10) D ← Diag?(mroewan?(abs(RTX?))) (11) Optimize R: Using the original objective (8), find R that minimizes ?X? −RA?f2 , subject to RTR = Im, and Am n≡i mDizBes?. ?TXhis i−s equivalent to an Orthogonal Procrustes problem [15], and can be solved exactly using SVD. In particular, by adding p − m rows of zeros to the bottom poaf rDtic, uDlaBr, bb eyc aodmdeins p p× − n. mT rhoewn sR o ifs z square a tnhde orthogoonf aDl ,a DndB can boem eessti pm ×at end. Twhiethn RSV isD s. qBuuatr es ainncde oDrtBho gisdegenerate we are only interested in the first m columns of R; the remaining columns are unique only up to rotation of the null-space.) Optimize μ: Given R, B? and D, the optimal μ is given by the column average of X −RDB?: μ ← mcoeluamn(X−RDB?) (12) 3.2. Approximate nearest neighbor search One application of scalable quantization is ANN search. Before introducing more advanced quantization techniques, we describe some experimental results with ok-means on Euclidean ANN search. The benefits of ok-means for ANN search are two-fold. Storage costs for the centers is reduced to O(log k), from O(k) with k-means. Second, substantial speedups are possible by exploiting fast methods for NN search on binary codes in Hamming space (e.g., [13]). Generally, in terms of a generic quantizer q(·), there are twoG neanteurraalll ways etrom messti omfa ate g ethneer dicis qtaunanceti z beetrw qe(·e)n, ttwheor vectors, v and u [7]. Using the Symmetric quantizer distance (SQD) ?v−u? is approximated by ?q(v)−q(u) ? . Using the Asymmetric quantizer doixsimtanacteed (A byQ ?Dq()v, only one o Uf sthineg tw thoe vectors is quantized, and ?v−u? is estimated as ?v−q(u) ? . vWechtiloer sS iQs qDu might b, aen slightly f?as itse ers t iom compute, vA−QqD(u i)?n-. curs lower quantization errors, and hence is more accurate. For ANN search, in a pre-processing stage, each database entry, u, is encoded into a binary vector corresponding to the cluster center index to which u is assigned. At test time, the queries may or may not be encoded into indices, depending on whether one uses SQD or AQD. In the ok-means model, the quantization of an input x is straightforwardly shown to be qok(x) = μ + RD sgn(RT(x − μ)) . (13) The corresponding m-bit cluster index is sgn(RT(x − μ)). Given two indices, namely b?1 , b?2 ∈ {−1, +1}m,( txhe − symmetric ok-means quantizer distance∈ ∈is { SQDok(b?1, b?2) = ?μ + RDb?1 − μ − RDb?2 ?22 = ?D(b?1 − b?2)?22 .(14) In effect, SQDok is a weighted Hamming distance. It is the sum of the squared diagonal entries of D corresponding to bits where b?1 and b?2 differ. Interestingly, in our experiments with ok-means, Hamming and weighted Hamming distances yield similar results. Thus, in ok-means experiments we simply report results for Hamming distance, to facilitate comparisons with other techniques. When the scaling in ok-means is constrained to be isotropic (i.e., D = αIm for α ∈ R+), then SQDok becomes a constant multiple off othre α αu ∈sua Rl Hamming distance. As discussed in Sec. 5, this isotropic ok-means is closely related to ITQ [5]. The ok-means AQD between a feature vector x and a cluster index b?, is defined as AQDok(x, b?) = ?x −μ − RDb??22 = ?RTx? − Db??22 + ?R⊥Tx??22 , (15) where x? = x−μ. For ANN search, in comparing distances from query x t−o a .d Faotars AetN oNf binary i,n idni ccoesm, ptharei snegc odnisdta tnecrems on the RHS of (15) is irrelevant, since it does not depend on b?. Without this term, AQDok becomes a form of asymmetric Hamming (AH) distance between RTx? and b?. While previous work [6] discussed various ad hoc AH distance measures for binary hashing, interestingly, the optimal AH distance for ok-means is derived directly in our model. 333000111977 1M SIFT, 64−bit encoding (k = 264) lRc@aeR0 . 206148 10 RIoPT1Qk K−Q m( AHeDa)Hn )s (AH)10K Figure 2. Euclidean ANN retrieval results for different methods and distance functions on the 1M SIFT dataset. 3.3. Experiments with ok-means Following [7], we report ANN search results on 1M SIFT, a corpus of 128D SIFT descriptors with disjoint sets of 100K training, 1M base, and 10K test vectors. The training set is used to train models. The base set is the database, and the test points are queries. The number of bits m is typically less than p, but no pre-processing is needed for dimensionality reduction. Rather, to initialize learning, R is a random rotation of the first m principal directions of the training data, and μ is the mean of the data. For each query we find R nearest neighbors, and compute Recall@R, the fraction of queries for which the ground-truth Euclidean NN is found in the R retrieved items. Fig. 2 shows the Recall@R plots for ok-means with Hamming (H) ≈ SQDok and asymmetric Hamming (AH) H≡a mAmQiDngok ( Hd)is t≈an SceQ, vs. ITQ [5] and PQ [7]. The PQ ≡met AhoQdD exploits a more complex asymmetric distance func- tion denoted AD ≡ AQDck (defined in Sec. 6. 1). Note first tthioant od ke-nmoeteadns A improves upon ITQ, with both Hamming and asymmetric Hamming distances. This is due to the scaling parameters (i.e., D) in ok-means. If one is interested in Hamming distance efficient retrieval, ok-means is prefered over ITQ. But better results are obtained with the asymmetric distance function. Fig. 2 also shows that PQ achieves superior recall rates. This stems from its joint encoding of multiple feature dimensions. In ok-means, each bit represents a partition ofthe feature space into two clusters, separated by a hyperplane. The intersection of m orthogonal hyperplanes yields 2m regions. Hence we obtain just two clusters per dimension, and each dimension is encoded independently. In PQ, by comparison, multiple dimensions are encoded jointly, with arbitrary numbers of clusters. PQ thereby captures statistical dependencies among different dimensions. We next extend ok-means to jointly encode multiple dimensions. 4. Cartesian k-means In the Cartesian k-means (ck-means) model, each region center is expressed parametrically as an additive combination of multiple subcenters. Let there be m sets of subcen- Fidg2ure31.Ddep4icton5fCartde?1si2nqdu?4ati5z?3aton 4qDck(da)t= ,?wd i?5134t?h the first (last) two dimensions sub-quantized on the left (right). Cartesian k-means quantizer denoted qck, combines the subquantizations in subspaces, and produces a 4D reconstruction. ters, each with h elements.3 Let C(i) be a matrix whose columns comprise the elements of the ith subcenter set; C(i) ∈ Rp×h. Finally, assume that each cluster center, c, is the∈ sum of exactly one element from each subcenter set: = ?C(i)b(i) m c i?= ?1 , (16) where b(i) ∈ H1/h is a 1-of-h encoding. As a conc∈re Hte example (see Fig. 3), suppose we are given 4D inputs, x ∈ R4, and we split each datum into m = 2 parts: z(1) = ?I2 0? x , and Then, suppose w?e z(2) = ?0 I2? x .(17) qu?antize each part, z(?1) and? z(2) , sepa- × rately. As depicted in Fig. 3 (left and middle), we could use h = 5 subcenters for each one. Placing the corresponding subcenters in the columns of 4 5 matrices C(1) and C(2) , C(1)=?d1d20d2×35d4d5?, C(2)=?d?1d?20d2×?35d?4d?5?, we obtain a model (16) that provides 52 possible centers with which to quantize the data. More generally, the total number of model centers is k = hm. Each center is a member of the Cartesian product of the subcenter sets, hence the name Cartesian k-means. Importantly, while the number of centers is hm, the number of subcenters is only mh. The model provides a super-linear number of centers with a linear number of parameters. The learning objective for Cartesian k-means is ?ck-means(C) =x?∈D{b(mi)}inim=1???x −i?=m1C(i)b(i)??22 where b(i) ∈ H1/h, and C ≡ [C(1), ··· , (18) C(m)] ∈ Rp×mh. [b(1)T, ··· ,b(m)T] If we let bT ≡ then the second sum in (18) can be expressed succinctly as Cb. 3While here we assume a fixed cardinality for all subcenter sets, the model is easily extended to allow sets with different cardinalities. 333000112088 The key problem with this formulation is that the min- imization of (18) with respect to the b(i) ’s is intractable. Nevertheless, motivated by orthogonal k-means (Sec. 3), encoding can be shown to be both efficient and exact if we impose orthogonality constraints on the sets of subcenters. To that end, assume that all subcenters in different sets are pairwise orthogonal; i.e., ∀i,j | i = j C(i)TC(j) = 0h×h .(19) Each subcenter matrix C(i) spans a linear subspace of Rp, and the linear subspaces for different subcenter sets do not intersect. Hence, (19) implies that only the subcenters in C(i) can explain the projection of x onto the C(i) subspace. In the example depicted in Fig. 3, the input features are simply partitioned (17), and the subspaces clearly satisfy the orthogonality constraints. It is also clear that C ≡ [ C(1) C(2)] is block diagonal, Iwtit ihs 2 a ×lso o5 c bleloacrks t,h adte Cnote ≡d D(1) and D(]2 i)s . bTlohec quantization error t×he5re bfolorcek bse,c doemnoeste ?x − Cb?22 = ???zz((12))?−?D0(1) D0(2)? ?b ( 21) ? ?2 = ?????z(1)−D(1)b(1)??2+???z(2)−D(2??)b(2)??2. In words, the squa??zred quantization?? erro??r zis the sum of t??he squared errors on the subspaces. One can therefore solve for the binary coefficients of the subcenter sets independently. In the general case, assuming (19) is satisfied, C can be expressed as a product R D, where R has orthonormal columns, and D is block diagonal; i.e., C = R D where Ra=nd[hRe(n1c),e·C(,i)R=(mR)]i,Dan(di).DW=i⎢t⎡⎣⎢hDs0i.(1≡)Dra0(n2)k. C.(Di)0(.m,i)t⎦⎥ ⎤fo,l(2w0)s that D(i) ∈ Rsi×h and R(i) ∈ Rp×≡sira. Clearly, ? si ≤ p, because of∈ ∈th Re orthogonality ∈con Rstraints. Replacing C(i) with R(i)D(i) in the RHS of (18?), we find ?x−Cb?22 = ?m?z(i)−D(i)b(i)?22+?R⊥Tx?22, (21) i?= ?1 where z(i)≡R(i)Tx, and R⊥is the orthogonal complement of R. This≡ ≡shRows that, with orthogonality constraints (19), the ck-means encoding problem can be split into m independent sub-encoding problems, one for each subcenter set. To find the b(i) that minimizes ??z(i) we perform NN search with z(i) again??st h si-dim vec??tors in D(i) . This entails a cost of O(hsi).? Fortunately, all? the elements of b can be found very efficiently, in O(hs), where s ≡ ? si. If we also include the cost of rotating x to −D(i)b(i)?22, Taocbkml-emt1he.aoAnds um#ceh2nkmrtyeofskm-#lobgeiatknsh,cOo-rm( pOec(2oamkn+spt),hpan)dkO- m(pOecosa(n+mst(hipns)khtesro)m of number of centers, number of bits needed for indices (i.e., log #centers), and the storage cost of representation, which is the same as the encoding cost to convert inputs to indices. The last column shows the costs given an assumption that C has a rank of s ≥ m. obtain each z(i) , the total encoding cost is O(ps + hs), i.e., O(p2+hp). Alternatively, one could perform NN search in p-dim C(i) ’s, to find the b(i) ’s, which costs O(mhp). Table 1 summerizes the models in terms of their number of centers, index size, and cost of storage and encoding. 4.1. Learning ck-means We can re-write the ck-means objective (18) in matrix form with the Frobenius norm; i.e., ?ck-means(R, D, B) = ? X − RDB ?f2 (22) where the columns of X and B comprise the data points and the subcenter assignment coefficients. The input to the learning algorithm is the training data X, the number of subcenter sets m, the cardinality of the subcenter sets h, and an upper bound on the rank of C, i.e., s. In practice, we also let each rotation matrix R(i) have the same number of columns, i.e., si = s/m. The outputs are the matrices {R(i) } and {D(i) } that provide a local minima of (22). Learning begins hwaitth p trohev idneit aia lloizcaatlio mni noimf Ra oanf d(2 D2)., followed by iterative coordinate descent in B, D, and R: • Optimize B and D: With R fixed, the objective is given by (21) where ?R⊥TX?f2 R(i)TX, is constant. Given data pro- jections Z(i) ≡ to optimize for B and D we perform one step oRf k-means for each subcenter set: – Assignment: Perform NN searches for each subcenter set to find the assignment coefficients, B(i) , B(i)← arBg(mi)in?Z(i)− D(i)B(i)?f2 – • Update: D(i)← arDg(mi)in?Z(i)− D(i)B(i)?f2 Optimize R: Placing the D(i) ’s along the diagonal of D, as in (20), and concatenating B(i) ’s as rows of B, [B(1)T, ...,B(m)T], i.e., BT = the optimization of R reduces to the orthogonal Procrustes problem: R ← argRmin?X − RDB?f2. In experiments below, R ∈ Rp×p, and rank(C) ≤ p is unIcnon esxtpraeirniemde. tFso rb high-dimensional adnadta r awnhke(rCe ) ra ≤nk( pX is) ?np, fnostrr efficiency ri th may dbime eusnesfiuoln atol dcoantast wrahiner er anr akn(kC(X). 333000112199 5. Relations and related work As mentioned above, there are close mathematical relationships between ok-means, ck-means, ITQ for binary hashing [5], and PQ for vector quantization [7]. It is instructive to specify these relationships in some detail. Iterative Quantization vs. Orthogonal k-means. ITQ [5] is a variant of locality-sensitive hashing, mapping data to binary codes for fast retrieval. To extract m-bit codes, ITQ first zero-centers the data matrix to obtain X?. PCA is then used for dimensionality reduction, fromp down to m dimensions, after which the subspace representation is randomly rotated. The composition of PCA and the random rotation can be expressed as WTX? where W ∈ Rp×m. ITQ then solves for the m×m rotation mawthriex,r eR W, t ∈hatR minimizes ?ITQ(B?, R) = ?RTWTX?− B??f2 , s.t. RTR = Im×m, (23) where B? ∈ {−1, +1}n×p. ITQ ro∈tate {s− t1h,e+ subspace representation of the data to match the binary codes, and then minimizes quantization error within the subspace. By contrast, ok-means maps the binary codes into the original input space, and then considers both the quantization error within the subspace and the out-of-subspace projection error. A key difference is the inclusion of ?R⊥X? ?f2 in the ok-means objective (9). This is important s ?inRce one can often reduce quantization errors by projecting out significant portions of the feature space. Another key difference between ITQ and ok-means is the inclusion of non-uniform scaling in ok-means. This is important when the data are not isotropic, and contributes to the marked improvement in recall rates in Fig. 2. Orthogonal k-means vs. Cartesian k-means. We next show that ok-means is a special case of ck-means with h = 2, where each subcenter set has two elements. To this end, let C(i) = and let b(i) = be the ith subcenter matrix selection vector. Since b(i) is a 1-of-2 encoding (?10? or ?01?), it follows that: [c(1i) c2(i)], b1(i)c(1i)+ b(2i)c2(i) = [b(1i) b2(i)]T c(1i)+2 c2(i)+ bi?c1(i)−2 c2(i), (24) b1(i) − b2(i) where bi? ≡ ∈ {−1, +1}. With the following setting of t≡he bok-m−e abns parameters, μ=?i=m1c1(i)+2c(2i),andC=?c1(1)−2c2(1),...,c1(m)−2c2(m)?, it should be clear that ?i C(i)b(i) = μ + Cb?, where b? ∈ {−1, +1}m, and b??i is the ith bit of b?, used in (24). Similarly, one can also map ok-means parameters onto corresponding subcenters for ck-means. Thus, there is a 1-to-1 mapping between the parameterizations of cluster centers in ok-means and ck-means for h = 2. The benefits of ok-means are its small number of parameters, and its intuitive formulation. The benefit of the ck-means generalization is its joint encoding of multiple dimensions with an arbitrary number of centers, allowing it to capture intrinsic dependence among data dimensions. Product Quantization vs. Cartesian k-means. PQ first splits the input vectors into m disjoint sub-vectors, and then quantizes each sub-vector separately using a subquantizer. Thus PQ is a special case of ck-means where the rotation R is not optimized; rather, R is fixed in both learning and retrieval. This is important because the sub-quantizers act independently, thereby ignoring intrasubspace statistical dependence. Thus the selection of subspaces is critical for PQ [7, 8]. J e´gou et al. [8] suggest using PCA, followed by random rotation, before applying PQ to high-dimensional vectors. Finding the rotation to minimize quantization error is mentioned in [8], but it was considered too difficult to estimate. Here we show that one can find a rotation to minimize the quantization error. The ck-means learning algorithm optimizes sub-quantizers in the inner loop, but they interact in an outer loop that optimizes the rotation (Sec. 4.1). 6. Experiments 6.1. Euclidean ANN search Euclidean ANN is a useful task for comparing quantization techniques. Given a database of ck-means indices, and a query, we use Asymmetric and Symmetric ck-means quantizer distance, denoted AQDck and SQDck. The AQDck between a query x and a binary index b ≡ ?b(1)T, ...,b(m)T?T, derived in (21), is AQDck(x,b) ?m?z(i)−D(i)b(i)?22+ ?R⊥Tx?22.(25) = i?= ?1 ?z(i)−D(i)b(i)?22is the distance between the ithpro- Here, jection?? of x, i.e., z(i) , ?a?nd a subcenter projection from D(i) selecte?d by b(i) . Give?n a query, these distances for each i, and all h possible values of b(i) can be pre-computed and stored in a query-specific h×m lookup table. Once created, tshtoer lookup qtaubelery i-ss pusecedif itoc compare akullp pda tatbablea.se O points etaot tehde, query. So computing AQDck entails m lookups and additions, somewhat more costly than Hamming distance. The last term on the RHS of (25) is irrelevant for NN search. Since PQ is a special case of ck-means with pre-defined subspaces, the same distances are used for PQ (cf. [7]). The SQDck between binary codes b1 and b2 is given by SQDck(b1,b2) = ?m?D(i)b1(i)− D(i)b2(i)?22.(26) i?= ?1 Since b1(i) and b(2i) are 1-of-?h encodings, an m×h?×h lookup table can be createadre t 1o- sft-ohre e nacllo pairwise msu×bh×-dhis ltaonockeusp. 333000222200 1M SIFT, 64−bit encoding (k = 264) 1M GIST, 64−bit encoding (k = 264) 1B SIFT, 64−bit encoding (k = 264) Re@acl0 .4186021RcIPoTQk0−Q m( SAeDaH)n s (SADH)1K0 .1642081 0RIocP1TkQ −K m (SAeDHa)n s (ASHD1)0K .186420 1 R0IoPc1TkQ −KQm( ASe DaHn) s (SADH1)0K Figure 4. Euclidean NN recall@R (number of items retrieved) based on different quantizers and corresponding distance functions on the 1M SIFT, 1M GIST, and 1B SIFT datasets. The dashed curves use symmetric distance. (AH ≡ AQDok, SD ≡ SQDck, AD ≡ AQDck) While the cost of computing SQDck is the same as AQDck, SQDck could also be used to estimate the distance between the indexed database entries, for diversifying the retrieval results, or to detect near duplicates, for example. Datasets. We use the 1M SIFT dataset, as in Sec. 3.3, along with two others, namely, 1M GIST (960D features) and 1B SIFT, both comprising disjoint sets of training, base and test vectors. 1M GIST has 500K training, 1M base, and 1K query vectors. 1B SIFT has 100M training, 1B base, and 10K query points. In each case, recall rates are averaged over queries in test set for a database populated from the base set. For expedience, we use only the first 1M training points for the 1B SIFT experiments. Parameters. In ANN experiments below, for both ckmeans and PQ, we use m = 8 and h = 256. Hence the number of clusters is k = 2568 = 264, so 64-bits are used as database indices. Using h = 256 is particularly attractive because the resulting lookup tables are small, encoding is fast, and each subcenter index fits into one byte. As h increases we expect retrieval results to improve, but encoding and indexing of a query to become slower. Initialization. To initialize the Di’s for learning, as in kmeans, we simply begin with random samples from the set of Zi’s (see Sec. 4.1). To initialize R we consider the different methods that J e´gou et al. [7] proposed for splitting the feature dimensions into m sub-vectors for PQ: (1) natural: sub-vectors comprise consecutive dimensions, (2) structured: dimensions with the same index modulo 8 are grouped, and (3) random: random permutations are used. For PQ in the experiments below, we use the orderings that produced the best results in [7], namely, the structured ordering for 960D GIST, and the natural ordering for 128D SIFT. For learning ck-means, R is initialized to the identity with SIFT corpora. For 1M GIST, where the PQ ordering is significant, we consider all three orderings to initialize R. Results. Fig. 4 shows Recall@R plots for ck-means and PQ [7] with symmetric and asymmetric distances (SD ≡ PSQQD [7c]k wanitdh A syDm m≡e rAicQ aDncdk) a on tmhee t3r cda dtaissteatns.c Tsh (eS Dhor ≡izontal axainsd represents tQheD number of retrieved items, R, on a log-scale. The results consistently favor ck-means. On 1M GIST, 64−bit encoding (k = 264) lae@Rc0 .261084 1R0Pc Qk −1m(K 321e )a (A nD s )(132) (A D )10K Figure 5. PQ and ck-means results using natural (1), structured (2), and random (3) ordering to define the (initial) subspaces. the high-dimensional GIST data, ck-means with AD significantly outperforms other methods; even ck-means with SD performs on par with PQ with AD. On 1M SIFT, the Recall@ 10 numbers for PQ and ck-means, both using AD, × are 59.9% and 63.7%. On 1B SIFT, Recall@ 100 numbers are 56.5% and 64.9%. As expected, with increasing dataset size, the difference between methods is more significant. In 1B SIFT, each feature vector is 128 bytes, hence a total of 119 GB. Using any method in Fig. 4 (including ckmeans) to index the database into 64 bits, this storage cost reduces to only 7.5 GB. This allows one to work with much larger datasets. In the experiments we use linear scan to find the nearest items according to quantizer distances. For NN search using 10K SIFT queries on 1B SIFT this takes about 8 hours for AD and AH and 4 hours for Hamming distance on a 2 4-core computer. Search can be sped up significantly; using a coarse tienri.tia Sl quantization sapnedd an pin svigenritefidfile structure for AD and AH, as suggested by [7, 1], and using the multi-index hashing method of [13] for Hamming distance. In this paper we did not implement these efficiencies as we focus primarily on the quality of quantization. Fig. 5 compares ck-means to PQ when R in ck-means is initialized using the 3 orderings of [7]. It shows that ckmeans is superior in all cases. Simiarly interesting, it also shows that despite the non-convexity ofthe optimization objective, ck-means learning tends to find similarly good encodings under different initial conditions. Finally, Fig. 6 compares the methods under different code dimensionality. 333000222311 1M GIST, encoding with 64, 96, and 128 bits Rl@ace0 .814206 1 R0cP kQ − m1 69e4216a−8 Knb −itsb 1t69248− bit10K Figure 6. PQ and ck-means results using different number of bits for encoding. In all cases asymmetric distance is used. Table2.Rcokg-nmiteoamPnQCesac(ondksue=r(bkaoc416y=0ko26)n40t2h[eC]IFAR7c598-u1.72096r%a tcesyuingdfferent codebook learning algorithms. 6.2. Learning visual codebooks While the ANN seach tasks above require too many clusters for k-means, it is interesing to compare k-means and ck-means on a task with a moderate number of clusters. To this end we consider codebook learning for bag-ofword√s models [3, 10]. We use ck-means with m = 2 and h = √k, and hence k centers. The main advantage of ck- × here is that finding the closest cluster center is done in O(√k) time, much faster than standard NN search with k-means in O(k). Alternatives for k-means, to improve efficiency, include approximate k-means [14], and hierarchical k-means [12]. Here we only compare to exact k-means. CIFAR-10 [9] comprises 50K training and 10K test images (32 32 pixels). Each image is one of 10 classes (airplane, 3b2i×rd,3 car, cat, )d.e Eera,c dog, frog, sh oonrsee o, ship, laansds etrsu (caki)r.We use publicly available code from Coates et al [2], with changes to the codebook learning and cluster assignment modules. Codebooks are built on 6×6 whitened color image patches. .O Cnoed histogram i sb ucirleta otend 6 per image quadrant, aagnde a linear SVM is applied to 4k-dim feature vectors. Recognition accuracy rates on the test set for different models and k are given in Table 2. Despite having fewer parameters, ck-means performs on par or better than kmeans. This is consistent for different initializations of the algorithms. Although k-means has higher fedility than ckmeans, with fewer parameters, ck-means may be less susceptible to overfitting. Table 2, also compares with the approach of [19], where PQ without learning a rotation is used for clustering features. As expected, learning the rotation has a significant impact on recognition rates, outperforming all three initializations of PQ. mean√s 7. Conclusions We present the Cartesian k-means algorithm, a generalization of k-means with a parameterization of the clus- ter centers such that number of centers is super-linear in the number of parameters. The method is also shown to be a generalization of the ITQ algorithm and Product Quantization. In experiments on large-scale retrieval and codebook learning for recognition the results are impressive, outperforming product quantization by a significant margin. An implementation of the method is available at https : / / github . com/norou z i ckmeans . / References [1] A. Babenko and V. Lempitsky. The inverted multi-index. CVPR, 2012. [2] A. Coates, H. Lee, and A. Ng. An analysis of single-layer networks in unsupervised feature learning. AISTATS, 2011. [3] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. ECCV Workshop Statistical Learning in Computer Vision, 2004. [4] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. CVPR, 2013. [5] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. CVPR, 2011. [6] A. Gordo and F. Perronnin. Asymmetric distances for binary embeddings. CVPR, 2011. [7] H. J ´egou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Trans. PAMI, 2011. [8] H. J ´egou, M. Douze, C. Schmid, and P. P ´erez. Aggregating local descriptors into a compact image representation. CVPR, 2010. [9] A. Krizhevsky. Learning multiple layers of features from tiny images. MSc Thesis, Univ. Toronto, 2009. [10] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR, 2006. [11] S. P. Lloyd. Least squares quantization in pcm. IEEE Trans. IT, 28(2): 129–137, 1982. [12] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. CVPR, 2006. [13] M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in hamming space with multi-index hashing. CVPR, 2012. [14] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. CVPR, 2007. [15] P. Sch o¨nemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 3 1, 1966. [16] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. ICCV, 2003. [17] J. Tenenbaum and W. Freeman. Separating style and content with bilinear models. Neural Comp., 12: 1247–1283, 2000. [18] A. Torralba, K. Murphy, and W. Freeman. Sharing visual features for multiclass and multiview object detection. IEEE T. PAMI, 29(5):854–869, 2007. [19] S. Wei, X. Wu, and D. Xu. Partitioned k-means clustering for fast construction of unbiased visual vocabulary. The Era of Interactive Media, 2012. 333000222422
2 0.42630473 319 cvpr-2013-Optimized Product Quantization for Approximate Nearest Neighbor Search
Author: Tiezheng Ge, Kaiming He, Qifa Ke, Jian Sun
Abstract: Product quantization is an effective vector quantization approach to compactly encode high-dimensional vectors for fast approximate nearest neighbor (ANN) search. The essence of product quantization is to decompose the original high-dimensional space into the Cartesian product of a finite number of low-dimensional subspaces that are then quantized separately. Optimal space decomposition is important for the performance of ANN search, but still remains unaddressed. In this paper, we optimize product quantization by minimizing quantization distortions w.r.t. the space decomposition and the quantization codebooks. We present two novel methods for optimization: a nonparametric method that alternatively solves two smaller sub-problems, and a parametric method that is guaranteed to achieve the optimal solution if the input data follows some Gaussian distribution. We show by experiments that our optimized approach substantially improves the accuracy of product quantization for ANN search.
3 0.32859409 236 cvpr-2013-K-Means Hashing: An Affinity-Preserving Quantization Method for Learning Binary Compact Codes
Author: Kaiming He, Fang Wen, Jian Sun
Abstract: In computer vision there has been increasing interest in learning hashing codes whose Hamming distance approximates the data similarity. The hashing functions play roles in both quantizing the vector space and generating similarity-preserving codes. Most existing hashing methods use hyper-planes (or kernelized hyper-planes) to quantize and encode. In this paper, we present a hashing method adopting the k-means quantization. We propose a novel Affinity-Preserving K-means algorithm which simultaneously performs k-means clustering and learns the binary indices of the quantized cells. The distance between the cells is approximated by the Hamming distance of the cell indices. We further generalize our algorithm to a product space for learning longer codes. Experiments show our method, named as K-means Hashing (KMH), outperforms various state-of-the-art hashing encoding methods.
4 0.30380455 246 cvpr-2013-Learning Binary Codes for High-Dimensional Data Using Bilinear Projections
Author: Yunchao Gong, Sanjiv Kumar, Henry A. Rowley, Svetlana Lazebnik
Abstract: Recent advances in visual recognition indicate that to achieve good retrieval and classification accuracy on largescale datasets like ImageNet, extremely high-dimensional visual descriptors, e.g., Fisher Vectors, are needed. We present a novel method for converting such descriptors to compact similarity-preserving binary codes that exploits their natural matrix structure to reduce their dimensionality using compact bilinear projections instead of a single large projection matrix. This method achieves comparable retrieval and classification accuracy to the original descriptors and to the state-of-the-art Product Quantization approach while having orders ofmagnitudefaster code generation time and smaller memory footprint.
5 0.16492991 63 cvpr-2013-Binary Code Ranking with Weighted Hamming Distance
Author: Lei Zhang, Yongdong Zhang, Jinhu Tang, Ke Lu, Qi Tian
Abstract: Binary hashing has been widely used for efficient similarity search due to its query and storage efficiency. In most existing binary hashing methods, the high-dimensional data are embedded into Hamming space and the distance or similarity of two points are approximated by the Hamming distance between their binary codes. The Hamming distance calculation is efficient, however, in practice, there are often lots of results sharing the same Hamming distance to a query, which makes this distance measure ambiguous and poses a critical issue for similarity search where ranking is important. In this paper, we propose a weighted Hamming distance ranking algorithm (WhRank) to rank the binary codes of hashing methods. By assigning different bit-level weights to different hash bits, the returned binary codes are ranked at a finer-grained binary code level. We give an algorithm to learn the data-adaptive and query-sensitive weight for each hash bit. Evaluations on two large-scale image data sets demonstrate the efficacy of our weighted Hamming distance for binary code ranking.
6 0.13533524 223 cvpr-2013-Inductive Hashing on Manifolds
7 0.12408274 343 cvpr-2013-Query Adaptive Similarity for Large Scale Object Retrieval
8 0.11244106 404 cvpr-2013-Sparse Quantization for Patch Description
9 0.11237571 69 cvpr-2013-Boosting Binary Keypoint Descriptors
10 0.098338448 87 cvpr-2013-Compressed Hashing
11 0.095506109 268 cvpr-2013-Leveraging Structure from Motion to Learn Discriminative Codebooks for Scalable Landmark Classification
12 0.090104342 88 cvpr-2013-Compressible Motion Fields
13 0.084948003 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
14 0.082216308 456 cvpr-2013-Visual Place Recognition with Repetitive Structures
15 0.072547652 293 cvpr-2013-Multi-attribute Queries: To Merge or Not to Merge?
16 0.072177686 260 cvpr-2013-Learning and Calibrating Per-Location Classifiers for Visual Place Recognition
17 0.071572497 250 cvpr-2013-Learning Cross-Domain Information Transfer for Location Recognition and Clustering
18 0.068623498 403 cvpr-2013-Sparse Output Coding for Large-Scale Visual Recognition
19 0.062881142 249 cvpr-2013-Learning Compact Binary Codes for Visual Tracking
20 0.061697271 215 cvpr-2013-Improved Image Set Classification via Joint Sparse Approximated Nearest Subspaces
topicId topicWeight
[(0, 0.157), (1, -0.026), (2, -0.038), (3, 0.055), (4, 0.083), (5, -0.001), (6, -0.087), (7, -0.191), (8, -0.235), (9, -0.086), (10, -0.172), (11, 0.047), (12, 0.058), (13, 0.116), (14, 0.014), (15, 0.065), (16, 0.069), (17, 0.042), (18, -0.031), (19, -0.027), (20, 0.025), (21, -0.057), (22, -0.002), (23, -0.01), (24, 0.025), (25, 0.029), (26, -0.003), (27, -0.082), (28, -0.015), (29, -0.023), (30, 0.137), (31, 0.112), (32, 0.018), (33, 0.03), (34, -0.107), (35, -0.083), (36, 0.007), (37, -0.038), (38, 0.042), (39, 0.014), (40, 0.058), (41, 0.131), (42, 0.067), (43, 0.04), (44, 0.026), (45, 0.042), (46, -0.121), (47, -0.125), (48, -0.002), (49, -0.045)]
simIndex simValue paperId paperTitle
same-paper 1 0.94265449 79 cvpr-2013-Cartesian K-Means
Author: Mohammad Norouzi, David J. Fleet
Abstract: A fundamental limitation of quantization techniques like the k-means clustering algorithm is the storage and runtime cost associated with the large numbers of clusters required to keep quantization errors small and model fidelity high. We develop new models with a compositional parameterization of cluster centers, so representational capacity increases super-linearly in the number of parameters. This allows one to effectively quantize data using billions or trillions of centers. We formulate two such models, Orthogonal k-means and Cartesian k-means. They are closely related to one another, to k-means, to methods for binary hash function optimization like ITQ [5], and to Product Quantization for vector quantization [7]. The models are tested on largescale ANN retrieval tasks (1M GIST, 1B SIFT features), and on codebook learning for object recognition (CIFAR-10). 1. Introduction and background Techniques for vector quantization, like the well-known k-means algorithm, are used widely in vision and learning. Common applications include codebook learning for object recognition [16], and approximate nearest neighbor (ANN) search for information retrieval [12, 14]. In general terms, such techniques involve partitioning an input vector space into multiple regions {Si}ik=1, mapping points x in each region uinlttiop region-specific representatives {ci}ik=1, known as cioennte irnst.o A resg siuonch-,s a quantizer, qse(nxt)a,t can b {ec expressed as k q(x) = ?1(x∈Si)ci, (1) i?= ?1 where 1(·) is the usual indicator function. eTrhee 1 quality oef u a quantizer oisr expressed in terms of expected distortion, a common measure of which is squared error ?x − q(x) ?22. In this case, given centers {ci}, the region t ?ox xw −hic qh(x a point nis t assigned gwivitehn m ceinnitemrasl {dcis}to,r tthioen r eisobtained by Euclidean nearest neighbor (NN) search. The k-means algorithm can be used to learn centers from data. To reduce expected distortion, crucial for many applications, one can shrink region volumes by increasing k, the number of regions. In practice, however, increasing k results in prohibitive storage and run-time costs. Even if one resorts to ANN search with approximate k-means [14] or hierarchical k-means [12], it is hard to scale to large k (e.g., k = 264), as storing the centers is untenable. This paper concerns methods for scalable quantization with tractable storage and run-time costs. Inspired by Product Quantization (PQ), a state-of-the-art algorithm for ANN search with high-dimensional data (e.g., [7]), compositionality is one key. By expressing data in terms of recurring, reusable parts, the representational capacity of compositional models grows exponentially in the number ofparameters. Compression techniques like JPEG accomplish this by encoding images as disjoint rectangular patches. PQ divides the feature space into disjoint subspaces that are quantized independently. Other examples include part-based recognition models (e.g., [18]), and tensor-based models for stylecontent separation (e.g., [17]). Here, with a compositional parameterization of region centers, we find a family of models that reduce the enc√oding cost of k centers down from k to between log2 k and √k. A model parameter controls the trade-off between model fidelity and compactness. We formulate two related algorithms, Orthogonal kmeans (ok-means) and Cartesian k-means (ck-means). They are natural extensions of k-means, and are closely related to other hashing and quantization methods. The okmeans algorithm is a generalization of the Iterative Quantization (ITQ) algorithm for finding locality-sensitive binary codes [5]. The ck-means model is an extension of okmeans, and can be viewed as a generalization of PQ.1 We then report empirical results on large-scale ANN search tasks, and on codebook learning for recognition. For ANN search we use datasets of 1M GIST and 1B SIFT features, with both symmetric and asymmetric distance measures on the latent codes. We consider codebook learning for a generic form of recognition on CIFAR-10. 2. k-means Given a dataset of n p-dim points, D ≡ {xj }jn=1, the kmeans algorithm partitions ithme n points in ≡to { kx c}lusters, and 1A very similar generalization of PQ, was developed concurrently by Ge, He, Ke and Sun, and also appears in CVPR 2013 [4]. 333000111755 represents each cluster by a center point. Let C ∈ Rp×k breep a mseantrtsix e wachhos celu sctoelurm byns a comprise tihnet .k Lceltus Cter ∈ centers, i.e., C = [c1, c2 , · · · , ck] . The k-means objective is to minimize the within,-c··lu·s ,tecr squared distances: ?k-means(C) = = x?∈Dmiin?x − ci?22 x?∈Db∈mHin1/k?x − Cb?22 (2) where H1/k ≡ {b | b ∈ {0, 1}k and ?b? = 1}, i.e., b is a binary vee Hctor comprising a ,1-1o}f-ka encoding. Lloyd’s b k- imse aa bni-s algorithm [11] finds a local minimum of (2) by iterative, alternating optimization with respect to C and the b’s. The k-means model is simple and intuitive, using NN search to assign points to centers. The assignment of points to centers can be represented with a log k-bit index per data point. The cost of storing the centers grows linearly with k. 3. Orthogonal k-means with 2m centers With a compositional model one can represent cluster centers more efficiently. One such approach is to re- construct each input with an additive combination of the columns of C. To this end, instead of the 1-of-k encoding in (2), we let b be a general m-bit vector, b ∈ Hm ≡ {0, 1}m, (a2n)d, wCe e∈ l Rt bp× bme. aA gse nsuerchal, meac-bhi tc vluesctteorr c, ebnt ∈er H His th≡e sum o}f a asundbs Cet o∈f Rthe columns of C. There are 2m possible subsets, and therefore k = 2m centers. While the number of parameters in the quantizer is linear in m, the number of centers increases exponentially. While efficient in representing cluster centers, the approach is problematic, because solving bˆ = abrg∈Hmmin?x − Cb?22,(3) is intractable; i.e., the discrete optimization is not submodular. Obviously, for small 2m one could generate all possible centers and then perform NN search to find the optimal solution, but this would not scale well to large values of m. One key observation is that if the columns of C are orthogonal, then optimization (3) becomes tractable. To explain this, without loss of generality, assume the bits belong to {−1, 1} instead of {0, 1}, i.e., b? ∈ Hm? ≡ {−1, 1}m. tToh e{−n, ?x − Cb??22 = xTx + b?TCTCb? − 2xTCb? .(4) For diagonal CTC, the middle quadratic term on the RHS becomes trace(CTC), independent of b?. As a consequence, when C has orthogonal columns, abr?g∈mH?min?x − Cb??22 = sgn(CTx) ,(5) where sgn(·) is the element-wise sign function. Cluster centers are depicted by dots, and cluster boundaries by dashed lines. (Left) Clusters formed by a 2-cube with no rotation, scaling or translation; centers = {b? |b? ∈ H2? }. (Right) Rotation, scaling and translation are used to reduce distances between data points and cluster centers; centers = {μ + RDb? | b? ∈ H2? }. To reduce quantization error further we can also introduce an offset, denoted μ, to translate x. Taken together with (5), this leads to the loss function for the model we call orthogonal k-means (ok-means):2 ?ok-means(C,μ) = x?∈Db?m∈iHn?m?x − μ − Cb??22. (6) Clearly, with a change of variables, b? = 2b −1, we can define new versions of μ and C, with ide=ntic 2abl −lo1s,s, w wfoer c wanh dicehthe unknowns are binary, but in {0, 1}m. T uhnek nook-wmnesa anrse quantizer uetn icnod {e0,s 1ea}ch data point as a vertex of a transformed m-dimensional unit hypercube. The transform, via C and μ, maps the hypercube vertices onto the input feature space, ideally as close as possible to the data points. The matrix C has orthogonal columns and can therefore be expressed in terms of rotation and scaling; i.e., C ≡ RD, where R ∈ Rp×m has orthonormal cinoglu;m i.ne.s, ( CRT ≡R = R DIm,) w, ahnerde eD R Ris ∈ diagonal and positive definite. The goal of learning is to find the parameters, R, D, and μ, which minimize quantization error. Fig. 1 depicts a small set of 2D data points (red x’s) and two possible quantizations. Fig. 1 (left) depicts the vertices of a 2-cube with C = I2 and zero translation. The cluster regions are simply the four quadrants of the 2D space. The distances between data points and cluster centers, i.e., the quantization errors, are relatively large. By comparison, Fig. 1 (right) shows how a transformed 2-cube, the full model, can greatly reduce quantization errors. 3.1. Learning ok-means To derive the learning algorithm for ok-means we first rewrite the objective in matrix terms. Given n data points, let X = [x1, x2, · · · , xn] ∈ Rp×n. Let B? ∈ {−1, 1}m×n denote the corresponding ∈clu Rster assignment {co−ef1f,ic1i}ents. Our 2While closely related to ITQ, we use the the relationship to k-means. term ok-means to emphasize 333000111 686 goal is to find the assignment coefficients B? and the transformation parameters, namely, the rotation R, scaling D, and translation μ, that minimize ?ok-means(B?, R, D,μ) = ?X − μ1T − RDB??f2 (7) = ?X? − RDB??f2 (8) where ?·?f denotes the Frobenius norm, X? ≡ X − μ1T, R iws hceornes ?tr·a?ined to have orthonormal columns≡ ≡(R XTR − μ=1 Im), and D is a positive diagonal matrix. Like k-means, coordinate descent is effective for optimizing (8). We first initialize μ and R, and then iteratively optimize ?ok-means with respect to B?, D, R, and μ: • Optimize B? and D: With straightforward algebraic manipulation, one can show that ?ok-means = ?RTX?−DB??f2 + ?R⊥TX??f2 , (9) where columns of R⊥ span the orthogonal complement of the column-space of R (i.e., the block matrix [R R⊥] is orthogonal). It follows that, given X? and R, we can optimize the first term in (9) to solve for B? and D. Here, DB? is the leastsquares approximation to RTX?, where RTX? and DB? × are m n. Further, the ith row of DB? can only contain earleem men×tsn .fr Fomur t{h−erd,i t,h +e dii} where di = Dii. Wherever tehleem corresponding delement} }o fw RheTrXe? d is positive (negative) we should put a positive (negative) value in DB?. The optimal di is determined by the mean absolute value of the elements in the ith row of RTX?: • • B? ← sgn ?RTX?? (10) D ← Diag?(mroewan?(abs(RTX?))) (11) Optimize R: Using the original objective (8), find R that minimizes ?X? −RA?f2 , subject to RTR = Im, and Am n≡i mDizBes?. ?TXhis i−s equivalent to an Orthogonal Procrustes problem [15], and can be solved exactly using SVD. In particular, by adding p − m rows of zeros to the bottom poaf rDtic, uDlaBr, bb eyc aodmdeins p p× − n. mT rhoewn sR o ifs z square a tnhde orthogoonf aDl ,a DndB can boem eessti pm ×at end. Twhiethn RSV isD s. qBuuatr es ainncde oDrtBho gisdegenerate we are only interested in the first m columns of R; the remaining columns are unique only up to rotation of the null-space.) Optimize μ: Given R, B? and D, the optimal μ is given by the column average of X −RDB?: μ ← mcoeluamn(X−RDB?) (12) 3.2. Approximate nearest neighbor search One application of scalable quantization is ANN search. Before introducing more advanced quantization techniques, we describe some experimental results with ok-means on Euclidean ANN search. The benefits of ok-means for ANN search are two-fold. Storage costs for the centers is reduced to O(log k), from O(k) with k-means. Second, substantial speedups are possible by exploiting fast methods for NN search on binary codes in Hamming space (e.g., [13]). Generally, in terms of a generic quantizer q(·), there are twoG neanteurraalll ways etrom messti omfa ate g ethneer dicis qtaunanceti z beetrw qe(·e)n, ttwheor vectors, v and u [7]. Using the Symmetric quantizer distance (SQD) ?v−u? is approximated by ?q(v)−q(u) ? . Using the Asymmetric quantizer doixsimtanacteed (A byQ ?Dq()v, only one o Uf sthineg tw thoe vectors is quantized, and ?v−u? is estimated as ?v−q(u) ? . vWechtiloer sS iQs qDu might b, aen slightly f?as itse ers t iom compute, vA−QqD(u i)?n-. curs lower quantization errors, and hence is more accurate. For ANN search, in a pre-processing stage, each database entry, u, is encoded into a binary vector corresponding to the cluster center index to which u is assigned. At test time, the queries may or may not be encoded into indices, depending on whether one uses SQD or AQD. In the ok-means model, the quantization of an input x is straightforwardly shown to be qok(x) = μ + RD sgn(RT(x − μ)) . (13) The corresponding m-bit cluster index is sgn(RT(x − μ)). Given two indices, namely b?1 , b?2 ∈ {−1, +1}m,( txhe − symmetric ok-means quantizer distance∈ ∈is { SQDok(b?1, b?2) = ?μ + RDb?1 − μ − RDb?2 ?22 = ?D(b?1 − b?2)?22 .(14) In effect, SQDok is a weighted Hamming distance. It is the sum of the squared diagonal entries of D corresponding to bits where b?1 and b?2 differ. Interestingly, in our experiments with ok-means, Hamming and weighted Hamming distances yield similar results. Thus, in ok-means experiments we simply report results for Hamming distance, to facilitate comparisons with other techniques. When the scaling in ok-means is constrained to be isotropic (i.e., D = αIm for α ∈ R+), then SQDok becomes a constant multiple off othre α αu ∈sua Rl Hamming distance. As discussed in Sec. 5, this isotropic ok-means is closely related to ITQ [5]. The ok-means AQD between a feature vector x and a cluster index b?, is defined as AQDok(x, b?) = ?x −μ − RDb??22 = ?RTx? − Db??22 + ?R⊥Tx??22 , (15) where x? = x−μ. For ANN search, in comparing distances from query x t−o a .d Faotars AetN oNf binary i,n idni ccoesm, ptharei snegc odnisdta tnecrems on the RHS of (15) is irrelevant, since it does not depend on b?. Without this term, AQDok becomes a form of asymmetric Hamming (AH) distance between RTx? and b?. While previous work [6] discussed various ad hoc AH distance measures for binary hashing, interestingly, the optimal AH distance for ok-means is derived directly in our model. 333000111977 1M SIFT, 64−bit encoding (k = 264) lRc@aeR0 . 206148 10 RIoPT1Qk K−Q m( AHeDa)Hn )s (AH)10K Figure 2. Euclidean ANN retrieval results for different methods and distance functions on the 1M SIFT dataset. 3.3. Experiments with ok-means Following [7], we report ANN search results on 1M SIFT, a corpus of 128D SIFT descriptors with disjoint sets of 100K training, 1M base, and 10K test vectors. The training set is used to train models. The base set is the database, and the test points are queries. The number of bits m is typically less than p, but no pre-processing is needed for dimensionality reduction. Rather, to initialize learning, R is a random rotation of the first m principal directions of the training data, and μ is the mean of the data. For each query we find R nearest neighbors, and compute Recall@R, the fraction of queries for which the ground-truth Euclidean NN is found in the R retrieved items. Fig. 2 shows the Recall@R plots for ok-means with Hamming (H) ≈ SQDok and asymmetric Hamming (AH) H≡a mAmQiDngok ( Hd)is t≈an SceQ, vs. ITQ [5] and PQ [7]. The PQ ≡met AhoQdD exploits a more complex asymmetric distance func- tion denoted AD ≡ AQDck (defined in Sec. 6. 1). Note first tthioant od ke-nmoeteadns A improves upon ITQ, with both Hamming and asymmetric Hamming distances. This is due to the scaling parameters (i.e., D) in ok-means. If one is interested in Hamming distance efficient retrieval, ok-means is prefered over ITQ. But better results are obtained with the asymmetric distance function. Fig. 2 also shows that PQ achieves superior recall rates. This stems from its joint encoding of multiple feature dimensions. In ok-means, each bit represents a partition ofthe feature space into two clusters, separated by a hyperplane. The intersection of m orthogonal hyperplanes yields 2m regions. Hence we obtain just two clusters per dimension, and each dimension is encoded independently. In PQ, by comparison, multiple dimensions are encoded jointly, with arbitrary numbers of clusters. PQ thereby captures statistical dependencies among different dimensions. We next extend ok-means to jointly encode multiple dimensions. 4. Cartesian k-means In the Cartesian k-means (ck-means) model, each region center is expressed parametrically as an additive combination of multiple subcenters. Let there be m sets of subcen- Fidg2ure31.Ddep4icton5fCartde?1si2nqdu?4ati5z?3aton 4qDck(da)t= ,?wd i?5134t?h the first (last) two dimensions sub-quantized on the left (right). Cartesian k-means quantizer denoted qck, combines the subquantizations in subspaces, and produces a 4D reconstruction. ters, each with h elements.3 Let C(i) be a matrix whose columns comprise the elements of the ith subcenter set; C(i) ∈ Rp×h. Finally, assume that each cluster center, c, is the∈ sum of exactly one element from each subcenter set: = ?C(i)b(i) m c i?= ?1 , (16) where b(i) ∈ H1/h is a 1-of-h encoding. As a conc∈re Hte example (see Fig. 3), suppose we are given 4D inputs, x ∈ R4, and we split each datum into m = 2 parts: z(1) = ?I2 0? x , and Then, suppose w?e z(2) = ?0 I2? x .(17) qu?antize each part, z(?1) and? z(2) , sepa- × rately. As depicted in Fig. 3 (left and middle), we could use h = 5 subcenters for each one. Placing the corresponding subcenters in the columns of 4 5 matrices C(1) and C(2) , C(1)=?d1d20d2×35d4d5?, C(2)=?d?1d?20d2×?35d?4d?5?, we obtain a model (16) that provides 52 possible centers with which to quantize the data. More generally, the total number of model centers is k = hm. Each center is a member of the Cartesian product of the subcenter sets, hence the name Cartesian k-means. Importantly, while the number of centers is hm, the number of subcenters is only mh. The model provides a super-linear number of centers with a linear number of parameters. The learning objective for Cartesian k-means is ?ck-means(C) =x?∈D{b(mi)}inim=1???x −i?=m1C(i)b(i)??22 where b(i) ∈ H1/h, and C ≡ [C(1), ··· , (18) C(m)] ∈ Rp×mh. [b(1)T, ··· ,b(m)T] If we let bT ≡ then the second sum in (18) can be expressed succinctly as Cb. 3While here we assume a fixed cardinality for all subcenter sets, the model is easily extended to allow sets with different cardinalities. 333000112088 The key problem with this formulation is that the min- imization of (18) with respect to the b(i) ’s is intractable. Nevertheless, motivated by orthogonal k-means (Sec. 3), encoding can be shown to be both efficient and exact if we impose orthogonality constraints on the sets of subcenters. To that end, assume that all subcenters in different sets are pairwise orthogonal; i.e., ∀i,j | i = j C(i)TC(j) = 0h×h .(19) Each subcenter matrix C(i) spans a linear subspace of Rp, and the linear subspaces for different subcenter sets do not intersect. Hence, (19) implies that only the subcenters in C(i) can explain the projection of x onto the C(i) subspace. In the example depicted in Fig. 3, the input features are simply partitioned (17), and the subspaces clearly satisfy the orthogonality constraints. It is also clear that C ≡ [ C(1) C(2)] is block diagonal, Iwtit ihs 2 a ×lso o5 c bleloacrks t,h adte Cnote ≡d D(1) and D(]2 i)s . bTlohec quantization error t×he5re bfolorcek bse,c doemnoeste ?x − Cb?22 = ???zz((12))?−?D0(1) D0(2)? ?b ( 21) ? ?2 = ?????z(1)−D(1)b(1)??2+???z(2)−D(2??)b(2)??2. In words, the squa??zred quantization?? erro??r zis the sum of t??he squared errors on the subspaces. One can therefore solve for the binary coefficients of the subcenter sets independently. In the general case, assuming (19) is satisfied, C can be expressed as a product R D, where R has orthonormal columns, and D is block diagonal; i.e., C = R D where Ra=nd[hRe(n1c),e·C(,i)R=(mR)]i,Dan(di).DW=i⎢t⎡⎣⎢hDs0i.(1≡)Dra0(n2)k. C.(Di)0(.m,i)t⎦⎥ ⎤fo,l(2w0)s that D(i) ∈ Rsi×h and R(i) ∈ Rp×≡sira. Clearly, ? si ≤ p, because of∈ ∈th Re orthogonality ∈con Rstraints. Replacing C(i) with R(i)D(i) in the RHS of (18?), we find ?x−Cb?22 = ?m?z(i)−D(i)b(i)?22+?R⊥Tx?22, (21) i?= ?1 where z(i)≡R(i)Tx, and R⊥is the orthogonal complement of R. This≡ ≡shRows that, with orthogonality constraints (19), the ck-means encoding problem can be split into m independent sub-encoding problems, one for each subcenter set. To find the b(i) that minimizes ??z(i) we perform NN search with z(i) again??st h si-dim vec??tors in D(i) . This entails a cost of O(hsi).? Fortunately, all? the elements of b can be found very efficiently, in O(hs), where s ≡ ? si. If we also include the cost of rotating x to −D(i)b(i)?22, Taocbkml-emt1he.aoAnds um#ceh2nkmrtyeofskm-#lobgeiatknsh,cOo-rm( pOec(2oamkn+spt),hpan)dkO- m(pOecosa(n+mst(hipns)khtesro)m of number of centers, number of bits needed for indices (i.e., log #centers), and the storage cost of representation, which is the same as the encoding cost to convert inputs to indices. The last column shows the costs given an assumption that C has a rank of s ≥ m. obtain each z(i) , the total encoding cost is O(ps + hs), i.e., O(p2+hp). Alternatively, one could perform NN search in p-dim C(i) ’s, to find the b(i) ’s, which costs O(mhp). Table 1 summerizes the models in terms of their number of centers, index size, and cost of storage and encoding. 4.1. Learning ck-means We can re-write the ck-means objective (18) in matrix form with the Frobenius norm; i.e., ?ck-means(R, D, B) = ? X − RDB ?f2 (22) where the columns of X and B comprise the data points and the subcenter assignment coefficients. The input to the learning algorithm is the training data X, the number of subcenter sets m, the cardinality of the subcenter sets h, and an upper bound on the rank of C, i.e., s. In practice, we also let each rotation matrix R(i) have the same number of columns, i.e., si = s/m. The outputs are the matrices {R(i) } and {D(i) } that provide a local minima of (22). Learning begins hwaitth p trohev idneit aia lloizcaatlio mni noimf Ra oanf d(2 D2)., followed by iterative coordinate descent in B, D, and R: • Optimize B and D: With R fixed, the objective is given by (21) where ?R⊥TX?f2 R(i)TX, is constant. Given data pro- jections Z(i) ≡ to optimize for B and D we perform one step oRf k-means for each subcenter set: – Assignment: Perform NN searches for each subcenter set to find the assignment coefficients, B(i) , B(i)← arBg(mi)in?Z(i)− D(i)B(i)?f2 – • Update: D(i)← arDg(mi)in?Z(i)− D(i)B(i)?f2 Optimize R: Placing the D(i) ’s along the diagonal of D, as in (20), and concatenating B(i) ’s as rows of B, [B(1)T, ...,B(m)T], i.e., BT = the optimization of R reduces to the orthogonal Procrustes problem: R ← argRmin?X − RDB?f2. In experiments below, R ∈ Rp×p, and rank(C) ≤ p is unIcnon esxtpraeirniemde. tFso rb high-dimensional adnadta r awnhke(rCe ) ra ≤nk( pX is) ?np, fnostrr efficiency ri th may dbime eusnesfiuoln atol dcoantast wrahiner er anr akn(kC(X). 333000112199 5. Relations and related work As mentioned above, there are close mathematical relationships between ok-means, ck-means, ITQ for binary hashing [5], and PQ for vector quantization [7]. It is instructive to specify these relationships in some detail. Iterative Quantization vs. Orthogonal k-means. ITQ [5] is a variant of locality-sensitive hashing, mapping data to binary codes for fast retrieval. To extract m-bit codes, ITQ first zero-centers the data matrix to obtain X?. PCA is then used for dimensionality reduction, fromp down to m dimensions, after which the subspace representation is randomly rotated. The composition of PCA and the random rotation can be expressed as WTX? where W ∈ Rp×m. ITQ then solves for the m×m rotation mawthriex,r eR W, t ∈hatR minimizes ?ITQ(B?, R) = ?RTWTX?− B??f2 , s.t. RTR = Im×m, (23) where B? ∈ {−1, +1}n×p. ITQ ro∈tate {s− t1h,e+ subspace representation of the data to match the binary codes, and then minimizes quantization error within the subspace. By contrast, ok-means maps the binary codes into the original input space, and then considers both the quantization error within the subspace and the out-of-subspace projection error. A key difference is the inclusion of ?R⊥X? ?f2 in the ok-means objective (9). This is important s ?inRce one can often reduce quantization errors by projecting out significant portions of the feature space. Another key difference between ITQ and ok-means is the inclusion of non-uniform scaling in ok-means. This is important when the data are not isotropic, and contributes to the marked improvement in recall rates in Fig. 2. Orthogonal k-means vs. Cartesian k-means. We next show that ok-means is a special case of ck-means with h = 2, where each subcenter set has two elements. To this end, let C(i) = and let b(i) = be the ith subcenter matrix selection vector. Since b(i) is a 1-of-2 encoding (?10? or ?01?), it follows that: [c(1i) c2(i)], b1(i)c(1i)+ b(2i)c2(i) = [b(1i) b2(i)]T c(1i)+2 c2(i)+ bi?c1(i)−2 c2(i), (24) b1(i) − b2(i) where bi? ≡ ∈ {−1, +1}. With the following setting of t≡he bok-m−e abns parameters, μ=?i=m1c1(i)+2c(2i),andC=?c1(1)−2c2(1),...,c1(m)−2c2(m)?, it should be clear that ?i C(i)b(i) = μ + Cb?, where b? ∈ {−1, +1}m, and b??i is the ith bit of b?, used in (24). Similarly, one can also map ok-means parameters onto corresponding subcenters for ck-means. Thus, there is a 1-to-1 mapping between the parameterizations of cluster centers in ok-means and ck-means for h = 2. The benefits of ok-means are its small number of parameters, and its intuitive formulation. The benefit of the ck-means generalization is its joint encoding of multiple dimensions with an arbitrary number of centers, allowing it to capture intrinsic dependence among data dimensions. Product Quantization vs. Cartesian k-means. PQ first splits the input vectors into m disjoint sub-vectors, and then quantizes each sub-vector separately using a subquantizer. Thus PQ is a special case of ck-means where the rotation R is not optimized; rather, R is fixed in both learning and retrieval. This is important because the sub-quantizers act independently, thereby ignoring intrasubspace statistical dependence. Thus the selection of subspaces is critical for PQ [7, 8]. J e´gou et al. [8] suggest using PCA, followed by random rotation, before applying PQ to high-dimensional vectors. Finding the rotation to minimize quantization error is mentioned in [8], but it was considered too difficult to estimate. Here we show that one can find a rotation to minimize the quantization error. The ck-means learning algorithm optimizes sub-quantizers in the inner loop, but they interact in an outer loop that optimizes the rotation (Sec. 4.1). 6. Experiments 6.1. Euclidean ANN search Euclidean ANN is a useful task for comparing quantization techniques. Given a database of ck-means indices, and a query, we use Asymmetric and Symmetric ck-means quantizer distance, denoted AQDck and SQDck. The AQDck between a query x and a binary index b ≡ ?b(1)T, ...,b(m)T?T, derived in (21), is AQDck(x,b) ?m?z(i)−D(i)b(i)?22+ ?R⊥Tx?22.(25) = i?= ?1 ?z(i)−D(i)b(i)?22is the distance between the ithpro- Here, jection?? of x, i.e., z(i) , ?a?nd a subcenter projection from D(i) selecte?d by b(i) . Give?n a query, these distances for each i, and all h possible values of b(i) can be pre-computed and stored in a query-specific h×m lookup table. Once created, tshtoer lookup qtaubelery i-ss pusecedif itoc compare akullp pda tatbablea.se O points etaot tehde, query. So computing AQDck entails m lookups and additions, somewhat more costly than Hamming distance. The last term on the RHS of (25) is irrelevant for NN search. Since PQ is a special case of ck-means with pre-defined subspaces, the same distances are used for PQ (cf. [7]). The SQDck between binary codes b1 and b2 is given by SQDck(b1,b2) = ?m?D(i)b1(i)− D(i)b2(i)?22.(26) i?= ?1 Since b1(i) and b(2i) are 1-of-?h encodings, an m×h?×h lookup table can be createadre t 1o- sft-ohre e nacllo pairwise msu×bh×-dhis ltaonockeusp. 333000222200 1M SIFT, 64−bit encoding (k = 264) 1M GIST, 64−bit encoding (k = 264) 1B SIFT, 64−bit encoding (k = 264) Re@acl0 .4186021RcIPoTQk0−Q m( SAeDaH)n s (SADH)1K0 .1642081 0RIocP1TkQ −K m (SAeDHa)n s (ASHD1)0K .186420 1 R0IoPc1TkQ −KQm( ASe DaHn) s (SADH1)0K Figure 4. Euclidean NN recall@R (number of items retrieved) based on different quantizers and corresponding distance functions on the 1M SIFT, 1M GIST, and 1B SIFT datasets. The dashed curves use symmetric distance. (AH ≡ AQDok, SD ≡ SQDck, AD ≡ AQDck) While the cost of computing SQDck is the same as AQDck, SQDck could also be used to estimate the distance between the indexed database entries, for diversifying the retrieval results, or to detect near duplicates, for example. Datasets. We use the 1M SIFT dataset, as in Sec. 3.3, along with two others, namely, 1M GIST (960D features) and 1B SIFT, both comprising disjoint sets of training, base and test vectors. 1M GIST has 500K training, 1M base, and 1K query vectors. 1B SIFT has 100M training, 1B base, and 10K query points. In each case, recall rates are averaged over queries in test set for a database populated from the base set. For expedience, we use only the first 1M training points for the 1B SIFT experiments. Parameters. In ANN experiments below, for both ckmeans and PQ, we use m = 8 and h = 256. Hence the number of clusters is k = 2568 = 264, so 64-bits are used as database indices. Using h = 256 is particularly attractive because the resulting lookup tables are small, encoding is fast, and each subcenter index fits into one byte. As h increases we expect retrieval results to improve, but encoding and indexing of a query to become slower. Initialization. To initialize the Di’s for learning, as in kmeans, we simply begin with random samples from the set of Zi’s (see Sec. 4.1). To initialize R we consider the different methods that J e´gou et al. [7] proposed for splitting the feature dimensions into m sub-vectors for PQ: (1) natural: sub-vectors comprise consecutive dimensions, (2) structured: dimensions with the same index modulo 8 are grouped, and (3) random: random permutations are used. For PQ in the experiments below, we use the orderings that produced the best results in [7], namely, the structured ordering for 960D GIST, and the natural ordering for 128D SIFT. For learning ck-means, R is initialized to the identity with SIFT corpora. For 1M GIST, where the PQ ordering is significant, we consider all three orderings to initialize R. Results. Fig. 4 shows Recall@R plots for ck-means and PQ [7] with symmetric and asymmetric distances (SD ≡ PSQQD [7c]k wanitdh A syDm m≡e rAicQ aDncdk) a on tmhee t3r cda dtaissteatns.c Tsh (eS Dhor ≡izontal axainsd represents tQheD number of retrieved items, R, on a log-scale. The results consistently favor ck-means. On 1M GIST, 64−bit encoding (k = 264) lae@Rc0 .261084 1R0Pc Qk −1m(K 321e )a (A nD s )(132) (A D )10K Figure 5. PQ and ck-means results using natural (1), structured (2), and random (3) ordering to define the (initial) subspaces. the high-dimensional GIST data, ck-means with AD significantly outperforms other methods; even ck-means with SD performs on par with PQ with AD. On 1M SIFT, the Recall@ 10 numbers for PQ and ck-means, both using AD, × are 59.9% and 63.7%. On 1B SIFT, Recall@ 100 numbers are 56.5% and 64.9%. As expected, with increasing dataset size, the difference between methods is more significant. In 1B SIFT, each feature vector is 128 bytes, hence a total of 119 GB. Using any method in Fig. 4 (including ckmeans) to index the database into 64 bits, this storage cost reduces to only 7.5 GB. This allows one to work with much larger datasets. In the experiments we use linear scan to find the nearest items according to quantizer distances. For NN search using 10K SIFT queries on 1B SIFT this takes about 8 hours for AD and AH and 4 hours for Hamming distance on a 2 4-core computer. Search can be sped up significantly; using a coarse tienri.tia Sl quantization sapnedd an pin svigenritefidfile structure for AD and AH, as suggested by [7, 1], and using the multi-index hashing method of [13] for Hamming distance. In this paper we did not implement these efficiencies as we focus primarily on the quality of quantization. Fig. 5 compares ck-means to PQ when R in ck-means is initialized using the 3 orderings of [7]. It shows that ckmeans is superior in all cases. Simiarly interesting, it also shows that despite the non-convexity ofthe optimization objective, ck-means learning tends to find similarly good encodings under different initial conditions. Finally, Fig. 6 compares the methods under different code dimensionality. 333000222311 1M GIST, encoding with 64, 96, and 128 bits Rl@ace0 .814206 1 R0cP kQ − m1 69e4216a−8 Knb −itsb 1t69248− bit10K Figure 6. PQ and ck-means results using different number of bits for encoding. In all cases asymmetric distance is used. Table2.Rcokg-nmiteoamPnQCesac(ondksue=r(bkaoc416y=0ko26)n40t2h[eC]IFAR7c598-u1.72096r%a tcesyuingdfferent codebook learning algorithms. 6.2. Learning visual codebooks While the ANN seach tasks above require too many clusters for k-means, it is interesing to compare k-means and ck-means on a task with a moderate number of clusters. To this end we consider codebook learning for bag-ofword√s models [3, 10]. We use ck-means with m = 2 and h = √k, and hence k centers. The main advantage of ck- × here is that finding the closest cluster center is done in O(√k) time, much faster than standard NN search with k-means in O(k). Alternatives for k-means, to improve efficiency, include approximate k-means [14], and hierarchical k-means [12]. Here we only compare to exact k-means. CIFAR-10 [9] comprises 50K training and 10K test images (32 32 pixels). Each image is one of 10 classes (airplane, 3b2i×rd,3 car, cat, )d.e Eera,c dog, frog, sh oonrsee o, ship, laansds etrsu (caki)r.We use publicly available code from Coates et al [2], with changes to the codebook learning and cluster assignment modules. Codebooks are built on 6×6 whitened color image patches. .O Cnoed histogram i sb ucirleta otend 6 per image quadrant, aagnde a linear SVM is applied to 4k-dim feature vectors. Recognition accuracy rates on the test set for different models and k are given in Table 2. Despite having fewer parameters, ck-means performs on par or better than kmeans. This is consistent for different initializations of the algorithms. Although k-means has higher fedility than ckmeans, with fewer parameters, ck-means may be less susceptible to overfitting. Table 2, also compares with the approach of [19], where PQ without learning a rotation is used for clustering features. As expected, learning the rotation has a significant impact on recognition rates, outperforming all three initializations of PQ. mean√s 7. Conclusions We present the Cartesian k-means algorithm, a generalization of k-means with a parameterization of the clus- ter centers such that number of centers is super-linear in the number of parameters. The method is also shown to be a generalization of the ITQ algorithm and Product Quantization. In experiments on large-scale retrieval and codebook learning for recognition the results are impressive, outperforming product quantization by a significant margin. An implementation of the method is available at https : / / github . com/norou z i ckmeans . / References [1] A. Babenko and V. Lempitsky. The inverted multi-index. CVPR, 2012. [2] A. Coates, H. Lee, and A. Ng. An analysis of single-layer networks in unsupervised feature learning. AISTATS, 2011. [3] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. ECCV Workshop Statistical Learning in Computer Vision, 2004. [4] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. CVPR, 2013. [5] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. CVPR, 2011. [6] A. Gordo and F. Perronnin. Asymmetric distances for binary embeddings. CVPR, 2011. [7] H. J ´egou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Trans. PAMI, 2011. [8] H. J ´egou, M. Douze, C. Schmid, and P. P ´erez. Aggregating local descriptors into a compact image representation. CVPR, 2010. [9] A. Krizhevsky. Learning multiple layers of features from tiny images. MSc Thesis, Univ. Toronto, 2009. [10] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR, 2006. [11] S. P. Lloyd. Least squares quantization in pcm. IEEE Trans. IT, 28(2): 129–137, 1982. [12] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. CVPR, 2006. [13] M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in hamming space with multi-index hashing. CVPR, 2012. [14] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. CVPR, 2007. [15] P. Sch o¨nemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 3 1, 1966. [16] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. ICCV, 2003. [17] J. Tenenbaum and W. Freeman. Separating style and content with bilinear models. Neural Comp., 12: 1247–1283, 2000. [18] A. Torralba, K. Murphy, and W. Freeman. Sharing visual features for multiclass and multiview object detection. IEEE T. PAMI, 29(5):854–869, 2007. [19] S. Wei, X. Wu, and D. Xu. Partitioned k-means clustering for fast construction of unbiased visual vocabulary. The Era of Interactive Media, 2012. 333000222422
2 0.94007593 319 cvpr-2013-Optimized Product Quantization for Approximate Nearest Neighbor Search
Author: Tiezheng Ge, Kaiming He, Qifa Ke, Jian Sun
Abstract: Product quantization is an effective vector quantization approach to compactly encode high-dimensional vectors for fast approximate nearest neighbor (ANN) search. The essence of product quantization is to decompose the original high-dimensional space into the Cartesian product of a finite number of low-dimensional subspaces that are then quantized separately. Optimal space decomposition is important for the performance of ANN search, but still remains unaddressed. In this paper, we optimize product quantization by minimizing quantization distortions w.r.t. the space decomposition and the quantization codebooks. We present two novel methods for optimization: a nonparametric method that alternatively solves two smaller sub-problems, and a parametric method that is guaranteed to achieve the optimal solution if the input data follows some Gaussian distribution. We show by experiments that our optimized approach substantially improves the accuracy of product quantization for ANN search.
3 0.83336025 236 cvpr-2013-K-Means Hashing: An Affinity-Preserving Quantization Method for Learning Binary Compact Codes
Author: Kaiming He, Fang Wen, Jian Sun
Abstract: In computer vision there has been increasing interest in learning hashing codes whose Hamming distance approximates the data similarity. The hashing functions play roles in both quantizing the vector space and generating similarity-preserving codes. Most existing hashing methods use hyper-planes (or kernelized hyper-planes) to quantize and encode. In this paper, we present a hashing method adopting the k-means quantization. We propose a novel Affinity-Preserving K-means algorithm which simultaneously performs k-means clustering and learns the binary indices of the quantized cells. The distance between the cells is approximated by the Hamming distance of the cell indices. We further generalize our algorithm to a product space for learning longer codes. Experiments show our method, named as K-means Hashing (KMH), outperforms various state-of-the-art hashing encoding methods.
4 0.81584197 246 cvpr-2013-Learning Binary Codes for High-Dimensional Data Using Bilinear Projections
Author: Yunchao Gong, Sanjiv Kumar, Henry A. Rowley, Svetlana Lazebnik
Abstract: Recent advances in visual recognition indicate that to achieve good retrieval and classification accuracy on largescale datasets like ImageNet, extremely high-dimensional visual descriptors, e.g., Fisher Vectors, are needed. We present a novel method for converting such descriptors to compact similarity-preserving binary codes that exploits their natural matrix structure to reduce their dimensionality using compact bilinear projections instead of a single large projection matrix. This method achieves comparable retrieval and classification accuracy to the original descriptors and to the state-of-the-art Product Quantization approach while having orders ofmagnitudefaster code generation time and smaller memory footprint.
5 0.60430092 63 cvpr-2013-Binary Code Ranking with Weighted Hamming Distance
Author: Lei Zhang, Yongdong Zhang, Jinhu Tang, Ke Lu, Qi Tian
Abstract: Binary hashing has been widely used for efficient similarity search due to its query and storage efficiency. In most existing binary hashing methods, the high-dimensional data are embedded into Hamming space and the distance or similarity of two points are approximated by the Hamming distance between their binary codes. The Hamming distance calculation is efficient, however, in practice, there are often lots of results sharing the same Hamming distance to a query, which makes this distance measure ambiguous and poses a critical issue for similarity search where ranking is important. In this paper, we propose a weighted Hamming distance ranking algorithm (WhRank) to rank the binary codes of hashing methods. By assigning different bit-level weights to different hash bits, the returned binary codes are ranked at a finer-grained binary code level. We give an algorithm to learn the data-adaptive and query-sensitive weight for each hash bit. Evaluations on two large-scale image data sets demonstrate the efficacy of our weighted Hamming distance for binary code ranking.
6 0.5853489 69 cvpr-2013-Boosting Binary Keypoint Descriptors
7 0.58213103 38 cvpr-2013-All About VLAD
8 0.57658672 87 cvpr-2013-Compressed Hashing
9 0.54286849 404 cvpr-2013-Sparse Quantization for Patch Description
10 0.50465071 223 cvpr-2013-Inductive Hashing on Manifolds
11 0.46093646 343 cvpr-2013-Query Adaptive Similarity for Large Scale Object Retrieval
12 0.45366588 268 cvpr-2013-Leveraging Structure from Motion to Learn Discriminative Codebooks for Scalable Landmark Classification
13 0.38436353 403 cvpr-2013-Sparse Output Coding for Large-Scale Visual Recognition
14 0.35165116 151 cvpr-2013-Event Retrieval in Large Video Collections with Circulant Temporal Encoding
15 0.35051283 109 cvpr-2013-Dense Non-rigid Point-Matching Using Random Projections
16 0.34688073 250 cvpr-2013-Learning Cross-Domain Information Transfer for Location Recognition and Clustering
17 0.33602151 448 cvpr-2013-Universality of the Local Marginal Polytope
18 0.3351317 275 cvpr-2013-Lp-Norm IDF for Large Scale Image Search
19 0.33196181 215 cvpr-2013-Improved Image Set Classification via Joint Sparse Approximated Nearest Subspaces
20 0.3291783 456 cvpr-2013-Visual Place Recognition with Repetitive Structures
topicId topicWeight
[(10, 0.1), (16, 0.022), (26, 0.046), (28, 0.013), (33, 0.214), (36, 0.071), (57, 0.213), (67, 0.099), (69, 0.042), (77, 0.014), (87, 0.091)]
simIndex simValue paperId paperTitle
1 0.85073024 141 cvpr-2013-Efficient Computation of Shortest Path-Concavity for 3D Meshes
Author: Henrik Zimmer, Marcel Campen, Leif Kobbelt
Abstract: In the context of shape segmentation and retrieval object-wide distributions of measures are needed to accurately evaluate and compare local regions ofshapes. Lien et al. [16] proposed two point-wise concavity measures in the context of Approximate Convex Decompositions of polygons measuring the distance from a point to the polygon ’s convex hull: an accurate Shortest Path-Concavity (SPC) measure and a Straight Line-Concavity (SLC) approximation of the same. While both are practicable on 2D shapes, the exponential costs of SPC in 3D makes it inhibitively expensive for a generalization to meshes [14]. In this paper we propose an efficient and straight forward approximation of the Shortest Path-Concavity measure to 3D meshes. Our approximation is based on discretizing the space between mesh and convex hull, thereby reducing the continuous Shortest Path search to an efficiently solvable graph problem. Our approach works outof-the-box on complex mesh topologies and requires no complicated handling of genus. Besides presenting a rigorous evaluation of our method on a variety of input meshes, we also define an SPC-based Shape Descriptor and show its superior retrieval and runtime performance compared with the recently presented results on the Convexity Distribution by Lian et al. [12].
same-paper 2 0.82364637 79 cvpr-2013-Cartesian K-Means
Author: Mohammad Norouzi, David J. Fleet
Abstract: A fundamental limitation of quantization techniques like the k-means clustering algorithm is the storage and runtime cost associated with the large numbers of clusters required to keep quantization errors small and model fidelity high. We develop new models with a compositional parameterization of cluster centers, so representational capacity increases super-linearly in the number of parameters. This allows one to effectively quantize data using billions or trillions of centers. We formulate two such models, Orthogonal k-means and Cartesian k-means. They are closely related to one another, to k-means, to methods for binary hash function optimization like ITQ [5], and to Product Quantization for vector quantization [7]. The models are tested on largescale ANN retrieval tasks (1M GIST, 1B SIFT features), and on codebook learning for object recognition (CIFAR-10). 1. Introduction and background Techniques for vector quantization, like the well-known k-means algorithm, are used widely in vision and learning. Common applications include codebook learning for object recognition [16], and approximate nearest neighbor (ANN) search for information retrieval [12, 14]. In general terms, such techniques involve partitioning an input vector space into multiple regions {Si}ik=1, mapping points x in each region uinlttiop region-specific representatives {ci}ik=1, known as cioennte irnst.o A resg siuonch-,s a quantizer, qse(nxt)a,t can b {ec expressed as k q(x) = ?1(x∈Si)ci, (1) i?= ?1 where 1(·) is the usual indicator function. eTrhee 1 quality oef u a quantizer oisr expressed in terms of expected distortion, a common measure of which is squared error ?x − q(x) ?22. In this case, given centers {ci}, the region t ?ox xw −hic qh(x a point nis t assigned gwivitehn m ceinnitemrasl {dcis}to,r tthioen r eisobtained by Euclidean nearest neighbor (NN) search. The k-means algorithm can be used to learn centers from data. To reduce expected distortion, crucial for many applications, one can shrink region volumes by increasing k, the number of regions. In practice, however, increasing k results in prohibitive storage and run-time costs. Even if one resorts to ANN search with approximate k-means [14] or hierarchical k-means [12], it is hard to scale to large k (e.g., k = 264), as storing the centers is untenable. This paper concerns methods for scalable quantization with tractable storage and run-time costs. Inspired by Product Quantization (PQ), a state-of-the-art algorithm for ANN search with high-dimensional data (e.g., [7]), compositionality is one key. By expressing data in terms of recurring, reusable parts, the representational capacity of compositional models grows exponentially in the number ofparameters. Compression techniques like JPEG accomplish this by encoding images as disjoint rectangular patches. PQ divides the feature space into disjoint subspaces that are quantized independently. Other examples include part-based recognition models (e.g., [18]), and tensor-based models for stylecontent separation (e.g., [17]). Here, with a compositional parameterization of region centers, we find a family of models that reduce the enc√oding cost of k centers down from k to between log2 k and √k. A model parameter controls the trade-off between model fidelity and compactness. We formulate two related algorithms, Orthogonal kmeans (ok-means) and Cartesian k-means (ck-means). They are natural extensions of k-means, and are closely related to other hashing and quantization methods. The okmeans algorithm is a generalization of the Iterative Quantization (ITQ) algorithm for finding locality-sensitive binary codes [5]. The ck-means model is an extension of okmeans, and can be viewed as a generalization of PQ.1 We then report empirical results on large-scale ANN search tasks, and on codebook learning for recognition. For ANN search we use datasets of 1M GIST and 1B SIFT features, with both symmetric and asymmetric distance measures on the latent codes. We consider codebook learning for a generic form of recognition on CIFAR-10. 2. k-means Given a dataset of n p-dim points, D ≡ {xj }jn=1, the kmeans algorithm partitions ithme n points in ≡to { kx c}lusters, and 1A very similar generalization of PQ, was developed concurrently by Ge, He, Ke and Sun, and also appears in CVPR 2013 [4]. 333000111755 represents each cluster by a center point. Let C ∈ Rp×k breep a mseantrtsix e wachhos celu sctoelurm byns a comprise tihnet .k Lceltus Cter ∈ centers, i.e., C = [c1, c2 , · · · , ck] . The k-means objective is to minimize the within,-c··lu·s ,tecr squared distances: ?k-means(C) = = x?∈Dmiin?x − ci?22 x?∈Db∈mHin1/k?x − Cb?22 (2) where H1/k ≡ {b | b ∈ {0, 1}k and ?b? = 1}, i.e., b is a binary vee Hctor comprising a ,1-1o}f-ka encoding. Lloyd’s b k- imse aa bni-s algorithm [11] finds a local minimum of (2) by iterative, alternating optimization with respect to C and the b’s. The k-means model is simple and intuitive, using NN search to assign points to centers. The assignment of points to centers can be represented with a log k-bit index per data point. The cost of storing the centers grows linearly with k. 3. Orthogonal k-means with 2m centers With a compositional model one can represent cluster centers more efficiently. One such approach is to re- construct each input with an additive combination of the columns of C. To this end, instead of the 1-of-k encoding in (2), we let b be a general m-bit vector, b ∈ Hm ≡ {0, 1}m, (a2n)d, wCe e∈ l Rt bp× bme. aA gse nsuerchal, meac-bhi tc vluesctteorr c, ebnt ∈er H His th≡e sum o}f a asundbs Cet o∈f Rthe columns of C. There are 2m possible subsets, and therefore k = 2m centers. While the number of parameters in the quantizer is linear in m, the number of centers increases exponentially. While efficient in representing cluster centers, the approach is problematic, because solving bˆ = abrg∈Hmmin?x − Cb?22,(3) is intractable; i.e., the discrete optimization is not submodular. Obviously, for small 2m one could generate all possible centers and then perform NN search to find the optimal solution, but this would not scale well to large values of m. One key observation is that if the columns of C are orthogonal, then optimization (3) becomes tractable. To explain this, without loss of generality, assume the bits belong to {−1, 1} instead of {0, 1}, i.e., b? ∈ Hm? ≡ {−1, 1}m. tToh e{−n, ?x − Cb??22 = xTx + b?TCTCb? − 2xTCb? .(4) For diagonal CTC, the middle quadratic term on the RHS becomes trace(CTC), independent of b?. As a consequence, when C has orthogonal columns, abr?g∈mH?min?x − Cb??22 = sgn(CTx) ,(5) where sgn(·) is the element-wise sign function. Cluster centers are depicted by dots, and cluster boundaries by dashed lines. (Left) Clusters formed by a 2-cube with no rotation, scaling or translation; centers = {b? |b? ∈ H2? }. (Right) Rotation, scaling and translation are used to reduce distances between data points and cluster centers; centers = {μ + RDb? | b? ∈ H2? }. To reduce quantization error further we can also introduce an offset, denoted μ, to translate x. Taken together with (5), this leads to the loss function for the model we call orthogonal k-means (ok-means):2 ?ok-means(C,μ) = x?∈Db?m∈iHn?m?x − μ − Cb??22. (6) Clearly, with a change of variables, b? = 2b −1, we can define new versions of μ and C, with ide=ntic 2abl −lo1s,s, w wfoer c wanh dicehthe unknowns are binary, but in {0, 1}m. T uhnek nook-wmnesa anrse quantizer uetn icnod {e0,s 1ea}ch data point as a vertex of a transformed m-dimensional unit hypercube. The transform, via C and μ, maps the hypercube vertices onto the input feature space, ideally as close as possible to the data points. The matrix C has orthogonal columns and can therefore be expressed in terms of rotation and scaling; i.e., C ≡ RD, where R ∈ Rp×m has orthonormal cinoglu;m i.ne.s, ( CRT ≡R = R DIm,) w, ahnerde eD R Ris ∈ diagonal and positive definite. The goal of learning is to find the parameters, R, D, and μ, which minimize quantization error. Fig. 1 depicts a small set of 2D data points (red x’s) and two possible quantizations. Fig. 1 (left) depicts the vertices of a 2-cube with C = I2 and zero translation. The cluster regions are simply the four quadrants of the 2D space. The distances between data points and cluster centers, i.e., the quantization errors, are relatively large. By comparison, Fig. 1 (right) shows how a transformed 2-cube, the full model, can greatly reduce quantization errors. 3.1. Learning ok-means To derive the learning algorithm for ok-means we first rewrite the objective in matrix terms. Given n data points, let X = [x1, x2, · · · , xn] ∈ Rp×n. Let B? ∈ {−1, 1}m×n denote the corresponding ∈clu Rster assignment {co−ef1f,ic1i}ents. Our 2While closely related to ITQ, we use the the relationship to k-means. term ok-means to emphasize 333000111 686 goal is to find the assignment coefficients B? and the transformation parameters, namely, the rotation R, scaling D, and translation μ, that minimize ?ok-means(B?, R, D,μ) = ?X − μ1T − RDB??f2 (7) = ?X? − RDB??f2 (8) where ?·?f denotes the Frobenius norm, X? ≡ X − μ1T, R iws hceornes ?tr·a?ined to have orthonormal columns≡ ≡(R XTR − μ=1 Im), and D is a positive diagonal matrix. Like k-means, coordinate descent is effective for optimizing (8). We first initialize μ and R, and then iteratively optimize ?ok-means with respect to B?, D, R, and μ: • Optimize B? and D: With straightforward algebraic manipulation, one can show that ?ok-means = ?RTX?−DB??f2 + ?R⊥TX??f2 , (9) where columns of R⊥ span the orthogonal complement of the column-space of R (i.e., the block matrix [R R⊥] is orthogonal). It follows that, given X? and R, we can optimize the first term in (9) to solve for B? and D. Here, DB? is the leastsquares approximation to RTX?, where RTX? and DB? × are m n. Further, the ith row of DB? can only contain earleem men×tsn .fr Fomur t{h−erd,i t,h +e dii} where di = Dii. Wherever tehleem corresponding delement} }o fw RheTrXe? d is positive (negative) we should put a positive (negative) value in DB?. The optimal di is determined by the mean absolute value of the elements in the ith row of RTX?: • • B? ← sgn ?RTX?? (10) D ← Diag?(mroewan?(abs(RTX?))) (11) Optimize R: Using the original objective (8), find R that minimizes ?X? −RA?f2 , subject to RTR = Im, and Am n≡i mDizBes?. ?TXhis i−s equivalent to an Orthogonal Procrustes problem [15], and can be solved exactly using SVD. In particular, by adding p − m rows of zeros to the bottom poaf rDtic, uDlaBr, bb eyc aodmdeins p p× − n. mT rhoewn sR o ifs z square a tnhde orthogoonf aDl ,a DndB can boem eessti pm ×at end. Twhiethn RSV isD s. qBuuatr es ainncde oDrtBho gisdegenerate we are only interested in the first m columns of R; the remaining columns are unique only up to rotation of the null-space.) Optimize μ: Given R, B? and D, the optimal μ is given by the column average of X −RDB?: μ ← mcoeluamn(X−RDB?) (12) 3.2. Approximate nearest neighbor search One application of scalable quantization is ANN search. Before introducing more advanced quantization techniques, we describe some experimental results with ok-means on Euclidean ANN search. The benefits of ok-means for ANN search are two-fold. Storage costs for the centers is reduced to O(log k), from O(k) with k-means. Second, substantial speedups are possible by exploiting fast methods for NN search on binary codes in Hamming space (e.g., [13]). Generally, in terms of a generic quantizer q(·), there are twoG neanteurraalll ways etrom messti omfa ate g ethneer dicis qtaunanceti z beetrw qe(·e)n, ttwheor vectors, v and u [7]. Using the Symmetric quantizer distance (SQD) ?v−u? is approximated by ?q(v)−q(u) ? . Using the Asymmetric quantizer doixsimtanacteed (A byQ ?Dq()v, only one o Uf sthineg tw thoe vectors is quantized, and ?v−u? is estimated as ?v−q(u) ? . vWechtiloer sS iQs qDu might b, aen slightly f?as itse ers t iom compute, vA−QqD(u i)?n-. curs lower quantization errors, and hence is more accurate. For ANN search, in a pre-processing stage, each database entry, u, is encoded into a binary vector corresponding to the cluster center index to which u is assigned. At test time, the queries may or may not be encoded into indices, depending on whether one uses SQD or AQD. In the ok-means model, the quantization of an input x is straightforwardly shown to be qok(x) = μ + RD sgn(RT(x − μ)) . (13) The corresponding m-bit cluster index is sgn(RT(x − μ)). Given two indices, namely b?1 , b?2 ∈ {−1, +1}m,( txhe − symmetric ok-means quantizer distance∈ ∈is { SQDok(b?1, b?2) = ?μ + RDb?1 − μ − RDb?2 ?22 = ?D(b?1 − b?2)?22 .(14) In effect, SQDok is a weighted Hamming distance. It is the sum of the squared diagonal entries of D corresponding to bits where b?1 and b?2 differ. Interestingly, in our experiments with ok-means, Hamming and weighted Hamming distances yield similar results. Thus, in ok-means experiments we simply report results for Hamming distance, to facilitate comparisons with other techniques. When the scaling in ok-means is constrained to be isotropic (i.e., D = αIm for α ∈ R+), then SQDok becomes a constant multiple off othre α αu ∈sua Rl Hamming distance. As discussed in Sec. 5, this isotropic ok-means is closely related to ITQ [5]. The ok-means AQD between a feature vector x and a cluster index b?, is defined as AQDok(x, b?) = ?x −μ − RDb??22 = ?RTx? − Db??22 + ?R⊥Tx??22 , (15) where x? = x−μ. For ANN search, in comparing distances from query x t−o a .d Faotars AetN oNf binary i,n idni ccoesm, ptharei snegc odnisdta tnecrems on the RHS of (15) is irrelevant, since it does not depend on b?. Without this term, AQDok becomes a form of asymmetric Hamming (AH) distance between RTx? and b?. While previous work [6] discussed various ad hoc AH distance measures for binary hashing, interestingly, the optimal AH distance for ok-means is derived directly in our model. 333000111977 1M SIFT, 64−bit encoding (k = 264) lRc@aeR0 . 206148 10 RIoPT1Qk K−Q m( AHeDa)Hn )s (AH)10K Figure 2. Euclidean ANN retrieval results for different methods and distance functions on the 1M SIFT dataset. 3.3. Experiments with ok-means Following [7], we report ANN search results on 1M SIFT, a corpus of 128D SIFT descriptors with disjoint sets of 100K training, 1M base, and 10K test vectors. The training set is used to train models. The base set is the database, and the test points are queries. The number of bits m is typically less than p, but no pre-processing is needed for dimensionality reduction. Rather, to initialize learning, R is a random rotation of the first m principal directions of the training data, and μ is the mean of the data. For each query we find R nearest neighbors, and compute Recall@R, the fraction of queries for which the ground-truth Euclidean NN is found in the R retrieved items. Fig. 2 shows the Recall@R plots for ok-means with Hamming (H) ≈ SQDok and asymmetric Hamming (AH) H≡a mAmQiDngok ( Hd)is t≈an SceQ, vs. ITQ [5] and PQ [7]. The PQ ≡met AhoQdD exploits a more complex asymmetric distance func- tion denoted AD ≡ AQDck (defined in Sec. 6. 1). Note first tthioant od ke-nmoeteadns A improves upon ITQ, with both Hamming and asymmetric Hamming distances. This is due to the scaling parameters (i.e., D) in ok-means. If one is interested in Hamming distance efficient retrieval, ok-means is prefered over ITQ. But better results are obtained with the asymmetric distance function. Fig. 2 also shows that PQ achieves superior recall rates. This stems from its joint encoding of multiple feature dimensions. In ok-means, each bit represents a partition ofthe feature space into two clusters, separated by a hyperplane. The intersection of m orthogonal hyperplanes yields 2m regions. Hence we obtain just two clusters per dimension, and each dimension is encoded independently. In PQ, by comparison, multiple dimensions are encoded jointly, with arbitrary numbers of clusters. PQ thereby captures statistical dependencies among different dimensions. We next extend ok-means to jointly encode multiple dimensions. 4. Cartesian k-means In the Cartesian k-means (ck-means) model, each region center is expressed parametrically as an additive combination of multiple subcenters. Let there be m sets of subcen- Fidg2ure31.Ddep4icton5fCartde?1si2nqdu?4ati5z?3aton 4qDck(da)t= ,?wd i?5134t?h the first (last) two dimensions sub-quantized on the left (right). Cartesian k-means quantizer denoted qck, combines the subquantizations in subspaces, and produces a 4D reconstruction. ters, each with h elements.3 Let C(i) be a matrix whose columns comprise the elements of the ith subcenter set; C(i) ∈ Rp×h. Finally, assume that each cluster center, c, is the∈ sum of exactly one element from each subcenter set: = ?C(i)b(i) m c i?= ?1 , (16) where b(i) ∈ H1/h is a 1-of-h encoding. As a conc∈re Hte example (see Fig. 3), suppose we are given 4D inputs, x ∈ R4, and we split each datum into m = 2 parts: z(1) = ?I2 0? x , and Then, suppose w?e z(2) = ?0 I2? x .(17) qu?antize each part, z(?1) and? z(2) , sepa- × rately. As depicted in Fig. 3 (left and middle), we could use h = 5 subcenters for each one. Placing the corresponding subcenters in the columns of 4 5 matrices C(1) and C(2) , C(1)=?d1d20d2×35d4d5?, C(2)=?d?1d?20d2×?35d?4d?5?, we obtain a model (16) that provides 52 possible centers with which to quantize the data. More generally, the total number of model centers is k = hm. Each center is a member of the Cartesian product of the subcenter sets, hence the name Cartesian k-means. Importantly, while the number of centers is hm, the number of subcenters is only mh. The model provides a super-linear number of centers with a linear number of parameters. The learning objective for Cartesian k-means is ?ck-means(C) =x?∈D{b(mi)}inim=1???x −i?=m1C(i)b(i)??22 where b(i) ∈ H1/h, and C ≡ [C(1), ··· , (18) C(m)] ∈ Rp×mh. [b(1)T, ··· ,b(m)T] If we let bT ≡ then the second sum in (18) can be expressed succinctly as Cb. 3While here we assume a fixed cardinality for all subcenter sets, the model is easily extended to allow sets with different cardinalities. 333000112088 The key problem with this formulation is that the min- imization of (18) with respect to the b(i) ’s is intractable. Nevertheless, motivated by orthogonal k-means (Sec. 3), encoding can be shown to be both efficient and exact if we impose orthogonality constraints on the sets of subcenters. To that end, assume that all subcenters in different sets are pairwise orthogonal; i.e., ∀i,j | i = j C(i)TC(j) = 0h×h .(19) Each subcenter matrix C(i) spans a linear subspace of Rp, and the linear subspaces for different subcenter sets do not intersect. Hence, (19) implies that only the subcenters in C(i) can explain the projection of x onto the C(i) subspace. In the example depicted in Fig. 3, the input features are simply partitioned (17), and the subspaces clearly satisfy the orthogonality constraints. It is also clear that C ≡ [ C(1) C(2)] is block diagonal, Iwtit ihs 2 a ×lso o5 c bleloacrks t,h adte Cnote ≡d D(1) and D(]2 i)s . bTlohec quantization error t×he5re bfolorcek bse,c doemnoeste ?x − Cb?22 = ???zz((12))?−?D0(1) D0(2)? ?b ( 21) ? ?2 = ?????z(1)−D(1)b(1)??2+???z(2)−D(2??)b(2)??2. In words, the squa??zred quantization?? erro??r zis the sum of t??he squared errors on the subspaces. One can therefore solve for the binary coefficients of the subcenter sets independently. In the general case, assuming (19) is satisfied, C can be expressed as a product R D, where R has orthonormal columns, and D is block diagonal; i.e., C = R D where Ra=nd[hRe(n1c),e·C(,i)R=(mR)]i,Dan(di).DW=i⎢t⎡⎣⎢hDs0i.(1≡)Dra0(n2)k. C.(Di)0(.m,i)t⎦⎥ ⎤fo,l(2w0)s that D(i) ∈ Rsi×h and R(i) ∈ Rp×≡sira. Clearly, ? si ≤ p, because of∈ ∈th Re orthogonality ∈con Rstraints. Replacing C(i) with R(i)D(i) in the RHS of (18?), we find ?x−Cb?22 = ?m?z(i)−D(i)b(i)?22+?R⊥Tx?22, (21) i?= ?1 where z(i)≡R(i)Tx, and R⊥is the orthogonal complement of R. This≡ ≡shRows that, with orthogonality constraints (19), the ck-means encoding problem can be split into m independent sub-encoding problems, one for each subcenter set. To find the b(i) that minimizes ??z(i) we perform NN search with z(i) again??st h si-dim vec??tors in D(i) . This entails a cost of O(hsi).? Fortunately, all? the elements of b can be found very efficiently, in O(hs), where s ≡ ? si. If we also include the cost of rotating x to −D(i)b(i)?22, Taocbkml-emt1he.aoAnds um#ceh2nkmrtyeofskm-#lobgeiatknsh,cOo-rm( pOec(2oamkn+spt),hpan)dkO- m(pOecosa(n+mst(hipns)khtesro)m of number of centers, number of bits needed for indices (i.e., log #centers), and the storage cost of representation, which is the same as the encoding cost to convert inputs to indices. The last column shows the costs given an assumption that C has a rank of s ≥ m. obtain each z(i) , the total encoding cost is O(ps + hs), i.e., O(p2+hp). Alternatively, one could perform NN search in p-dim C(i) ’s, to find the b(i) ’s, which costs O(mhp). Table 1 summerizes the models in terms of their number of centers, index size, and cost of storage and encoding. 4.1. Learning ck-means We can re-write the ck-means objective (18) in matrix form with the Frobenius norm; i.e., ?ck-means(R, D, B) = ? X − RDB ?f2 (22) where the columns of X and B comprise the data points and the subcenter assignment coefficients. The input to the learning algorithm is the training data X, the number of subcenter sets m, the cardinality of the subcenter sets h, and an upper bound on the rank of C, i.e., s. In practice, we also let each rotation matrix R(i) have the same number of columns, i.e., si = s/m. The outputs are the matrices {R(i) } and {D(i) } that provide a local minima of (22). Learning begins hwaitth p trohev idneit aia lloizcaatlio mni noimf Ra oanf d(2 D2)., followed by iterative coordinate descent in B, D, and R: • Optimize B and D: With R fixed, the objective is given by (21) where ?R⊥TX?f2 R(i)TX, is constant. Given data pro- jections Z(i) ≡ to optimize for B and D we perform one step oRf k-means for each subcenter set: – Assignment: Perform NN searches for each subcenter set to find the assignment coefficients, B(i) , B(i)← arBg(mi)in?Z(i)− D(i)B(i)?f2 – • Update: D(i)← arDg(mi)in?Z(i)− D(i)B(i)?f2 Optimize R: Placing the D(i) ’s along the diagonal of D, as in (20), and concatenating B(i) ’s as rows of B, [B(1)T, ...,B(m)T], i.e., BT = the optimization of R reduces to the orthogonal Procrustes problem: R ← argRmin?X − RDB?f2. In experiments below, R ∈ Rp×p, and rank(C) ≤ p is unIcnon esxtpraeirniemde. tFso rb high-dimensional adnadta r awnhke(rCe ) ra ≤nk( pX is) ?np, fnostrr efficiency ri th may dbime eusnesfiuoln atol dcoantast wrahiner er anr akn(kC(X). 333000112199 5. Relations and related work As mentioned above, there are close mathematical relationships between ok-means, ck-means, ITQ for binary hashing [5], and PQ for vector quantization [7]. It is instructive to specify these relationships in some detail. Iterative Quantization vs. Orthogonal k-means. ITQ [5] is a variant of locality-sensitive hashing, mapping data to binary codes for fast retrieval. To extract m-bit codes, ITQ first zero-centers the data matrix to obtain X?. PCA is then used for dimensionality reduction, fromp down to m dimensions, after which the subspace representation is randomly rotated. The composition of PCA and the random rotation can be expressed as WTX? where W ∈ Rp×m. ITQ then solves for the m×m rotation mawthriex,r eR W, t ∈hatR minimizes ?ITQ(B?, R) = ?RTWTX?− B??f2 , s.t. RTR = Im×m, (23) where B? ∈ {−1, +1}n×p. ITQ ro∈tate {s− t1h,e+ subspace representation of the data to match the binary codes, and then minimizes quantization error within the subspace. By contrast, ok-means maps the binary codes into the original input space, and then considers both the quantization error within the subspace and the out-of-subspace projection error. A key difference is the inclusion of ?R⊥X? ?f2 in the ok-means objective (9). This is important s ?inRce one can often reduce quantization errors by projecting out significant portions of the feature space. Another key difference between ITQ and ok-means is the inclusion of non-uniform scaling in ok-means. This is important when the data are not isotropic, and contributes to the marked improvement in recall rates in Fig. 2. Orthogonal k-means vs. Cartesian k-means. We next show that ok-means is a special case of ck-means with h = 2, where each subcenter set has two elements. To this end, let C(i) = and let b(i) = be the ith subcenter matrix selection vector. Since b(i) is a 1-of-2 encoding (?10? or ?01?), it follows that: [c(1i) c2(i)], b1(i)c(1i)+ b(2i)c2(i) = [b(1i) b2(i)]T c(1i)+2 c2(i)+ bi?c1(i)−2 c2(i), (24) b1(i) − b2(i) where bi? ≡ ∈ {−1, +1}. With the following setting of t≡he bok-m−e abns parameters, μ=?i=m1c1(i)+2c(2i),andC=?c1(1)−2c2(1),...,c1(m)−2c2(m)?, it should be clear that ?i C(i)b(i) = μ + Cb?, where b? ∈ {−1, +1}m, and b??i is the ith bit of b?, used in (24). Similarly, one can also map ok-means parameters onto corresponding subcenters for ck-means. Thus, there is a 1-to-1 mapping between the parameterizations of cluster centers in ok-means and ck-means for h = 2. The benefits of ok-means are its small number of parameters, and its intuitive formulation. The benefit of the ck-means generalization is its joint encoding of multiple dimensions with an arbitrary number of centers, allowing it to capture intrinsic dependence among data dimensions. Product Quantization vs. Cartesian k-means. PQ first splits the input vectors into m disjoint sub-vectors, and then quantizes each sub-vector separately using a subquantizer. Thus PQ is a special case of ck-means where the rotation R is not optimized; rather, R is fixed in both learning and retrieval. This is important because the sub-quantizers act independently, thereby ignoring intrasubspace statistical dependence. Thus the selection of subspaces is critical for PQ [7, 8]. J e´gou et al. [8] suggest using PCA, followed by random rotation, before applying PQ to high-dimensional vectors. Finding the rotation to minimize quantization error is mentioned in [8], but it was considered too difficult to estimate. Here we show that one can find a rotation to minimize the quantization error. The ck-means learning algorithm optimizes sub-quantizers in the inner loop, but they interact in an outer loop that optimizes the rotation (Sec. 4.1). 6. Experiments 6.1. Euclidean ANN search Euclidean ANN is a useful task for comparing quantization techniques. Given a database of ck-means indices, and a query, we use Asymmetric and Symmetric ck-means quantizer distance, denoted AQDck and SQDck. The AQDck between a query x and a binary index b ≡ ?b(1)T, ...,b(m)T?T, derived in (21), is AQDck(x,b) ?m?z(i)−D(i)b(i)?22+ ?R⊥Tx?22.(25) = i?= ?1 ?z(i)−D(i)b(i)?22is the distance between the ithpro- Here, jection?? of x, i.e., z(i) , ?a?nd a subcenter projection from D(i) selecte?d by b(i) . Give?n a query, these distances for each i, and all h possible values of b(i) can be pre-computed and stored in a query-specific h×m lookup table. Once created, tshtoer lookup qtaubelery i-ss pusecedif itoc compare akullp pda tatbablea.se O points etaot tehde, query. So computing AQDck entails m lookups and additions, somewhat more costly than Hamming distance. The last term on the RHS of (25) is irrelevant for NN search. Since PQ is a special case of ck-means with pre-defined subspaces, the same distances are used for PQ (cf. [7]). The SQDck between binary codes b1 and b2 is given by SQDck(b1,b2) = ?m?D(i)b1(i)− D(i)b2(i)?22.(26) i?= ?1 Since b1(i) and b(2i) are 1-of-?h encodings, an m×h?×h lookup table can be createadre t 1o- sft-ohre e nacllo pairwise msu×bh×-dhis ltaonockeusp. 333000222200 1M SIFT, 64−bit encoding (k = 264) 1M GIST, 64−bit encoding (k = 264) 1B SIFT, 64−bit encoding (k = 264) Re@acl0 .4186021RcIPoTQk0−Q m( SAeDaH)n s (SADH)1K0 .1642081 0RIocP1TkQ −K m (SAeDHa)n s (ASHD1)0K .186420 1 R0IoPc1TkQ −KQm( ASe DaHn) s (SADH1)0K Figure 4. Euclidean NN recall@R (number of items retrieved) based on different quantizers and corresponding distance functions on the 1M SIFT, 1M GIST, and 1B SIFT datasets. The dashed curves use symmetric distance. (AH ≡ AQDok, SD ≡ SQDck, AD ≡ AQDck) While the cost of computing SQDck is the same as AQDck, SQDck could also be used to estimate the distance between the indexed database entries, for diversifying the retrieval results, or to detect near duplicates, for example. Datasets. We use the 1M SIFT dataset, as in Sec. 3.3, along with two others, namely, 1M GIST (960D features) and 1B SIFT, both comprising disjoint sets of training, base and test vectors. 1M GIST has 500K training, 1M base, and 1K query vectors. 1B SIFT has 100M training, 1B base, and 10K query points. In each case, recall rates are averaged over queries in test set for a database populated from the base set. For expedience, we use only the first 1M training points for the 1B SIFT experiments. Parameters. In ANN experiments below, for both ckmeans and PQ, we use m = 8 and h = 256. Hence the number of clusters is k = 2568 = 264, so 64-bits are used as database indices. Using h = 256 is particularly attractive because the resulting lookup tables are small, encoding is fast, and each subcenter index fits into one byte. As h increases we expect retrieval results to improve, but encoding and indexing of a query to become slower. Initialization. To initialize the Di’s for learning, as in kmeans, we simply begin with random samples from the set of Zi’s (see Sec. 4.1). To initialize R we consider the different methods that J e´gou et al. [7] proposed for splitting the feature dimensions into m sub-vectors for PQ: (1) natural: sub-vectors comprise consecutive dimensions, (2) structured: dimensions with the same index modulo 8 are grouped, and (3) random: random permutations are used. For PQ in the experiments below, we use the orderings that produced the best results in [7], namely, the structured ordering for 960D GIST, and the natural ordering for 128D SIFT. For learning ck-means, R is initialized to the identity with SIFT corpora. For 1M GIST, where the PQ ordering is significant, we consider all three orderings to initialize R. Results. Fig. 4 shows Recall@R plots for ck-means and PQ [7] with symmetric and asymmetric distances (SD ≡ PSQQD [7c]k wanitdh A syDm m≡e rAicQ aDncdk) a on tmhee t3r cda dtaissteatns.c Tsh (eS Dhor ≡izontal axainsd represents tQheD number of retrieved items, R, on a log-scale. The results consistently favor ck-means. On 1M GIST, 64−bit encoding (k = 264) lae@Rc0 .261084 1R0Pc Qk −1m(K 321e )a (A nD s )(132) (A D )10K Figure 5. PQ and ck-means results using natural (1), structured (2), and random (3) ordering to define the (initial) subspaces. the high-dimensional GIST data, ck-means with AD significantly outperforms other methods; even ck-means with SD performs on par with PQ with AD. On 1M SIFT, the Recall@ 10 numbers for PQ and ck-means, both using AD, × are 59.9% and 63.7%. On 1B SIFT, Recall@ 100 numbers are 56.5% and 64.9%. As expected, with increasing dataset size, the difference between methods is more significant. In 1B SIFT, each feature vector is 128 bytes, hence a total of 119 GB. Using any method in Fig. 4 (including ckmeans) to index the database into 64 bits, this storage cost reduces to only 7.5 GB. This allows one to work with much larger datasets. In the experiments we use linear scan to find the nearest items according to quantizer distances. For NN search using 10K SIFT queries on 1B SIFT this takes about 8 hours for AD and AH and 4 hours for Hamming distance on a 2 4-core computer. Search can be sped up significantly; using a coarse tienri.tia Sl quantization sapnedd an pin svigenritefidfile structure for AD and AH, as suggested by [7, 1], and using the multi-index hashing method of [13] for Hamming distance. In this paper we did not implement these efficiencies as we focus primarily on the quality of quantization. Fig. 5 compares ck-means to PQ when R in ck-means is initialized using the 3 orderings of [7]. It shows that ckmeans is superior in all cases. Simiarly interesting, it also shows that despite the non-convexity ofthe optimization objective, ck-means learning tends to find similarly good encodings under different initial conditions. Finally, Fig. 6 compares the methods under different code dimensionality. 333000222311 1M GIST, encoding with 64, 96, and 128 bits Rl@ace0 .814206 1 R0cP kQ − m1 69e4216a−8 Knb −itsb 1t69248− bit10K Figure 6. PQ and ck-means results using different number of bits for encoding. In all cases asymmetric distance is used. Table2.Rcokg-nmiteoamPnQCesac(ondksue=r(bkaoc416y=0ko26)n40t2h[eC]IFAR7c598-u1.72096r%a tcesyuingdfferent codebook learning algorithms. 6.2. Learning visual codebooks While the ANN seach tasks above require too many clusters for k-means, it is interesing to compare k-means and ck-means on a task with a moderate number of clusters. To this end we consider codebook learning for bag-ofword√s models [3, 10]. We use ck-means with m = 2 and h = √k, and hence k centers. The main advantage of ck- × here is that finding the closest cluster center is done in O(√k) time, much faster than standard NN search with k-means in O(k). Alternatives for k-means, to improve efficiency, include approximate k-means [14], and hierarchical k-means [12]. Here we only compare to exact k-means. CIFAR-10 [9] comprises 50K training and 10K test images (32 32 pixels). Each image is one of 10 classes (airplane, 3b2i×rd,3 car, cat, )d.e Eera,c dog, frog, sh oonrsee o, ship, laansds etrsu (caki)r.We use publicly available code from Coates et al [2], with changes to the codebook learning and cluster assignment modules. Codebooks are built on 6×6 whitened color image patches. .O Cnoed histogram i sb ucirleta otend 6 per image quadrant, aagnde a linear SVM is applied to 4k-dim feature vectors. Recognition accuracy rates on the test set for different models and k are given in Table 2. Despite having fewer parameters, ck-means performs on par or better than kmeans. This is consistent for different initializations of the algorithms. Although k-means has higher fedility than ckmeans, with fewer parameters, ck-means may be less susceptible to overfitting. Table 2, also compares with the approach of [19], where PQ without learning a rotation is used for clustering features. As expected, learning the rotation has a significant impact on recognition rates, outperforming all three initializations of PQ. mean√s 7. Conclusions We present the Cartesian k-means algorithm, a generalization of k-means with a parameterization of the clus- ter centers such that number of centers is super-linear in the number of parameters. The method is also shown to be a generalization of the ITQ algorithm and Product Quantization. In experiments on large-scale retrieval and codebook learning for recognition the results are impressive, outperforming product quantization by a significant margin. An implementation of the method is available at https : / / github . com/norou z i ckmeans . / References [1] A. Babenko and V. Lempitsky. The inverted multi-index. CVPR, 2012. [2] A. Coates, H. Lee, and A. Ng. An analysis of single-layer networks in unsupervised feature learning. AISTATS, 2011. [3] G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual categorization with bags of keypoints. ECCV Workshop Statistical Learning in Computer Vision, 2004. [4] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. CVPR, 2013. [5] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. CVPR, 2011. [6] A. Gordo and F. Perronnin. Asymmetric distances for binary embeddings. CVPR, 2011. [7] H. J ´egou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Trans. PAMI, 2011. [8] H. J ´egou, M. Douze, C. Schmid, and P. P ´erez. Aggregating local descriptors into a compact image representation. CVPR, 2010. [9] A. Krizhevsky. Learning multiple layers of features from tiny images. MSc Thesis, Univ. Toronto, 2009. [10] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. CVPR, 2006. [11] S. P. Lloyd. Least squares quantization in pcm. IEEE Trans. IT, 28(2): 129–137, 1982. [12] D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. CVPR, 2006. [13] M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in hamming space with multi-index hashing. CVPR, 2012. [14] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. CVPR, 2007. [15] P. Sch o¨nemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 3 1, 1966. [16] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. ICCV, 2003. [17] J. Tenenbaum and W. Freeman. Separating style and content with bilinear models. Neural Comp., 12: 1247–1283, 2000. [18] A. Torralba, K. Murphy, and W. Freeman. Sharing visual features for multiclass and multiview object detection. IEEE T. PAMI, 29(5):854–869, 2007. [19] S. Wei, X. Wu, and D. Xu. Partitioned k-means clustering for fast construction of unbiased visual vocabulary. The Era of Interactive Media, 2012. 333000222422
3 0.79266846 95 cvpr-2013-Continuous Inference in Graphical Models with Polynomial Energies
Author: Mathieu Salzmann
Abstract: In this paper, we tackle the problem of performing inference in graphical models whose energy is a polynomial function of continuous variables. Our energy minimization method follows a dual decomposition approach, where the global problem is split into subproblems defined over the graph cliques. The optimal solution to these subproblems is obtained by making use of a polynomial system solver. Our algorithm inherits the convergence guarantees of dual decomposition. To speed up optimization, we also introduce a variant of this algorithm based on the augmented Lagrangian method. Our experiments illustrate the diversity of computer vision problems that can be expressed with polynomial energies, and demonstrate the benefits of our approach over existing continuous inference methods.
4 0.77115244 421 cvpr-2013-Supervised Kernel Descriptors for Visual Recognition
Author: Peng Wang, Jingdong Wang, Gang Zeng, Weiwei Xu, Hongbin Zha, Shipeng Li
Abstract: In visual recognition tasks, the design of low level image feature representation is fundamental. The advent of local patch features from pixel attributes such as SIFT and LBP, has precipitated dramatic progresses. Recently, a kernel view of these features, called kernel descriptors (KDES) [1], generalizes the feature design in an unsupervised fashion and yields impressive results. In this paper, we present a supervised framework to embed the image level label information into the design of patch level kernel descriptors, which we call supervised kernel descriptors (SKDES). Specifically, we adopt the broadly applied bag-of-words (BOW) image classification pipeline and a large margin criterion to learn the lowlevel patch representation, which makes the patch features much more compact and achieve better discriminative ability than KDES. With this method, we achieve competitive results over several public datasets comparing with stateof-the-art methods.
5 0.76405603 255 cvpr-2013-Learning Separable Filters
Author: Roberto Rigamonti, Amos Sironi, Vincent Lepetit, Pascal Fua
Abstract: Learning filters to produce sparse image representations in terms of overcomplete dictionaries has emerged as a powerful way to create image features for many different purposes. Unfortunately, these filters are usually both numerous and non-separable, making their use computationally expensive. In this paper, we show that such filters can be computed as linear combinations of a smaller number of separable ones, thus greatly reducing the computational complexity at no cost in terms of performance. This makes filter learning approaches practical even for large images or 3D volumes, and we show that we significantly outperform state-of-theart methods on the linear structure extraction task, in terms ofboth accuracy and speed. Moreover, our approach is general and can be used on generic filter banks to reduce the complexity of the convolutions.
6 0.7590301 248 cvpr-2013-Learning Collections of Part Models for Object Recognition
7 0.75448549 236 cvpr-2013-K-Means Hashing: An Affinity-Preserving Quantization Method for Learning Binary Compact Codes
8 0.75428838 339 cvpr-2013-Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation
9 0.75361019 119 cvpr-2013-Detecting and Aligning Faces by Image Retrieval
10 0.75302738 254 cvpr-2013-Learning SURF Cascade for Fast and Accurate Object Detection
11 0.75204045 319 cvpr-2013-Optimized Product Quantization for Approximate Nearest Neighbor Search
12 0.75203466 69 cvpr-2013-Boosting Binary Keypoint Descriptors
13 0.75199568 63 cvpr-2013-Binary Code Ranking with Weighted Hamming Distance
14 0.75195152 2 cvpr-2013-3D Pictorial Structures for Multiple View Articulated Pose Estimation
15 0.7508325 110 cvpr-2013-Dense Object Reconstruction with Semantic Priors
16 0.75079268 365 cvpr-2013-Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities
17 0.7489149 362 cvpr-2013-Robust Monocular Epipolar Flow Estimation
18 0.74855775 225 cvpr-2013-Integrating Grammar and Segmentation for Human Pose Estimation
19 0.74771827 179 cvpr-2013-From N to N+1: Multiclass Transfer Incremental Learning
20 0.74761206 322 cvpr-2013-PISA: Pixelwise Image Saliency by Aggregating Complementary Appearance Contrast Measures with Spatial Priors