nips nips2011 nips2011-111 knowledge-graph by maker-knowledge-mining

111 nips-2011-Hashing Algorithms for Large-Scale Learning

Source: pdf

Author: Ping Li, Anshumali Shrivastava, Joshua L. Moore, Arnd C. König

Abstract: Minwise hashing is a standard technique in the context of search for efﬁciently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not ﬁt in memory. We compare b-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is signiﬁcantly more accurate (at the same storage cost) than VW (and random projections) for binary data.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Minwise hashing is a standard technique in the context of search for efﬁciently computing set similarities. [sent-7, score-0.463]

2 The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. [sent-8, score-1.18]

3 In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and logistic regression, to solve large-scale and high-dimensional statistical learning tasks, especially when the data do not ﬁt in memory. [sent-9, score-1.027]

4 We compare b-bit minwise hashing with the Count-Min (CM) and Vowpal Wabbit (VW) algorithms, which have essentially the same variances as random projections. [sent-10, score-0.994]

5 Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is signiﬁcantly more accurate (at the same storage cost) than VW (and random projections) for binary data. [sent-11, score-1.088]

6 1 Introduction With the advent of the Internet, many machine learning applications are faced with very large and inherently high-dimensional datasets, resulting in challenges in scaling up training algorithms and storing the data. [sent-12, score-0.081]

7 For example, [33] discusses training sets with 1011 items and 109 distinct features, requiring novel algorithmic approaches and architectures. [sent-14, score-0.061]

8 The method of b-bit minwise hashing [26–28] is a recent progress for efﬁciently (in both time and space) computing resemblances among extremely high-dimensional (e. [sent-17, score-1.036]

9 In this paper, we show that b-bit minwise hashing can be seamlessly integrated with linear Support Vector Machine (SVM) [13, 18, 20, 31, 35] and logistic regression solvers. [sent-20, score-1.027]

10 When the data can ﬁt in memory, linear SVM training is often extremely efﬁcient after the data are loaded into the memory. [sent-33, score-0.079]

11 It is however often the case that, for very large datasets, the data loading 1 time dominates the computing time for solving the SVM problem [35]. [sent-34, score-0.074]

12 The publicly available webspam dataset (in LIBSVM format) needs about 24GB disk space, which exceeds the memory capacity of many desktop PCs. [sent-37, score-0.136]

13 2 Our Proposal We propose a solution which leverages b-bit minwise hashing. [sent-40, score-0.531]

14 We apply b-bit minwise hashing to obtain a compact representation of the original data. [sent-42, score-1.019]

15 In order to use the technique for efﬁcient learning, we have to address several issues: • We need to prove that the matrices generated by b-bit minwise hashing are positive deﬁnite, which will provide the solid foundation for our proposed solution. [sent-43, score-0.994]

16 • If we use b-bit minwise hashing to estimate the resemblance, which is nonlinear, how can we effectively convert this nonlinear problem into a linear problem? [sent-44, score-0.994]

17 • Compared to other hashing techniques such as random projections, Count-Min (CM) sketch [11], or Vowpal Wabbit (VW) [32, 34], does our approach exhibits advantages? [sent-45, score-0.496]

18 It turns out that our proof in the next section that b-bit hashing matrices are positive deﬁnite naturally provides the construction for converting the otherwise nonlinear SVM problem into linear SVM. [sent-46, score-0.463]

19 2 Review of Minwise Hashing and b-Bit Minwise Hashing Minwise hashing [6,7] has been successfully applied to a wide range of real-world problems [4,6,7, 9, 10, 12, 15, 16, 30], for efﬁciently computing set similarities. [sent-47, score-0.463]

20 Minwise hashing mainly works well with binary data, which can be viewed either as 0/1 vectors or as sets. [sent-48, score-0.512]

21 , D − 1}, a widely used measure of similarity is the resemblance R: R= |S1 ∩ S2 | a = , |S1 ∪ S2 | f1 + f2 − a where f1 = |S1 |, f2 = |S2 |, a = |S1 ∩ S2 |. [sent-52, score-0.061]

22 The common practice is to store each hashed value, e. [sent-58, score-0.093]

23 The storage (and computational) cost will be prohibitive in truly large-scale (industry) applications [29]. [sent-61, score-0.07]

24 b-bit minwise hashing [27] provides a strikingly simple solution to this (storage and computational) problem by storing only the lowest b bits (instead of 64 bits) of each hashed value. [sent-62, score-1.18]

25 (b) (b) For convenience, denote z1 = min (π (S1 )) and z2 = min (π (S2 )), and denote z1 (z2 ) the (2) integer value corresponding to the lowest b bits of of z1 (z2 ). [sent-63, score-0.073]

26 Deﬁne zi = min{π(Si )} and zi the lowest b bits of zi . [sent-81, score-0.19]

27 The resemblance matrix R ∈ Rn×n , whose (i, j)-th entry is the resemblance between set |S ∩S | |Si ∩Sj | Si and set Sj : Rij = |Si ∪Sj | = |Si |+|Sj |−|Si ∩Sj | . [sent-84, score-0.122]

28 The minwise hashing matrix M ∈ Rn×n : Mij = 1{zi = zj }. [sent-86, score-1.025]

29 The b-bit minwise hashing matrix M(b) ∈ Rn×n : Mij = 1 zi (b) = zj . [sent-88, score-1.064]

30 (b) Consequently, consider k independent permutations and denote M(s) the b-bit minwise hashing matrix generated by the s-th permutation. [sent-89, score-1.022]

31 In our approach, we apply k random permutations on each feature vector xi and store the lowest b bits of each hashed value. [sent-103, score-0.194]

32 For example, suppose k = 3 and the hashed values are originally {12013, 25964, 20191}, whose binary digits are {010111011101101, 110010101101100, 100111011011111}. [sent-106, score-0.117]

33 Clearly, this expansion is directly inspired by the proof that the b-bit minwise hashing matrix is PD in Theorem 2. [sent-110, score-0.994]

34 They conducted experiments on three datasets, of which only the webspam dataset is public and reasonably high-dimensional (n = 350000, D = 16609143). [sent-112, score-0.135]

35 In applications when the data do not ﬁt in memory, we expect that b-bit hashing will be even more substantially advantageous, because the hashed data are relatively very small. [sent-119, score-0.583]

36 In fact, our experimental results will show that for this dataset, using k = 200 and b = 8 can achieve similar testing accuracies as using the original data. [sent-120, score-0.125]

37 The effective storage for the reduced dataset (with 350K examples, using k = 200 and b = 8) would be merely about 70MB. [sent-121, score-0.105]

38 1 Experimental Results on Nonlinear (Kernel) SVM We implemented a new resemblance kernel function and tried to use LIBSVM to train an SVM using the webspam dataset. [sent-123, score-0.162]

39 Fortunately, using b-bit minswise hashing to estimate the resemblance kernels provides a substantial improvement. [sent-125, score-0.524]

40 For example, with k = 150, b = 4, and C = 1, the training time is about 5185 seconds and the testing accuracy is quite close to the best results given by LIBLINEAR on the original webspam data. [sent-126, score-0.38]

41 Figure 1 demonstrates that using b ≥ 8 and k ≥ 200 achieves similar test accuracies as using the original data. [sent-133, score-0.069]

42 b-bit hashing achieves very similar accuracies as using the original data (dashed, red if color is available). [sent-141, score-0.532]

43 Compared with the original training time (about 100 seconds), Figure 3 (upper panels) shows that our method only needs about 3 seconds (near C = 1). [sent-148, score-0.139]

44 Note that our reported training time did not include data loading (about 12 minutes for the original data and 10 seconds for the hashed data). [sent-149, score-0.258]

45 Compared with the original testing time (about 150 seconds), Figure 3 (bottom panels) shows that our method needs merely about 2 seconds. [sent-150, score-0.125]

46 Note that the testing time includes both the data loading time, as designed by LIBLINEAR. [sent-151, score-0.106]

47 The efﬁciency of testing may be very important in practice, for example, when the classiﬁer is deployed in a user-facing application (such as search), while the cost of training or preprocessing may be less critical and can be conducted off-line. [sent-152, score-0.165]

48 3 Experimental Results on Logistic Regression Figure 4 presents the test accuracies and training time using logistic regression. [sent-156, score-0.162]

49 Again, with k ≥ 200 and b ≥ 8, b-bit minwise hashing can achieve similar test accuracies as using the original data. [sent-157, score-1.063]

50 The training time is substantially reduced, from about 1000 seconds to about 30 seconds only. [sent-158, score-0.17]

51 2 10 1 10 logit: k = 200 0 Spam: Training time 10 −3 −2 −1 0 10 10 10 10 C 1 10 2 10 b = 16 2 10 1 10 logit: k = 500 0 Spam: Training time 10 −3 −2 −1 0 10 10 10 10 C 1 2 10 10 Figure 4: Logistic regression test accuracy (upper panels) and training time (bottom panels). [sent-159, score-0.217]

52 In summary, it appears b-bit hashing is highly effective in reducing the data size and speeding up the training (and testing), for both SVM and logistic regression. [sent-160, score-0.557]

53 We notice that when using b = 16, the training time can be much larger than using b ≤ 8. [sent-161, score-0.085]

54 Interestingly, we ﬁnd that b-bit hashing can be easily combined with Vowpal Wabbit (VW) [34] to further reduce the training time when b is large. [sent-162, score-0.548]

55 6 Random Projections, Count-Min (CM) Sketch, and Vowpal Wabbit (VW) Random projections [1, 24], Count-Min (CM) sketch [11], and Vowpal Wabbit (VW) [32, 34], as popular hashing algorithms for estimating inner products for high-dimensional datasets, are naturally applicable in large-scale learning. [sent-163, score-0.581]

56 Note that in this paper, we use ”VW“ particularly for the hashing algorithm in [34], not the inﬂuential “VW” online learning platform. [sent-166, score-0.463]

57 The general idea is to multiply the data vectors by a random matrix {rij } ∈ RD×k , where rij is sampled i. [sent-170, score-0.156]

58 2 ar(rij ) = 4 2 E(rij ) − E 2 (rij ) D v1,j = u1,i rij , (8) = s − 1 ≥ 0. [sent-174, score-0.131]

59 This generates two k-dim vectors, v1 and v2 : D u2,i rij , j = 1, 2, . [sent-175, score-0.131]

60 2s  1 √ and the “sparse projection” distribution speciﬁed as rij = s × 0 with prob. [sent-179, score-0.131]

61 [23] proposed an improved estimator for random projections as the solution to a cubic equation. [sent-183, score-0.073]

62 2 Count-Min (CM) Sketch and Vowpal Wabbit (VW) Again, in this paper, “VW” always refers to the hashing algorithm in [34]. [sent-186, score-0.463]

63 In the original CM algorithm, the key step is to independently and uniformly hash elements of the data vectors to k buckets and the 1 hashed value is the sum of the elements in the bucket. [sent-188, score-0.166]

64 By writing Iij = , we can write the hashed data as 0 otherwise D D w1,j = w2,j = u1,i Iij , u2,i Iij (12) i=1 i=1 k The estimate acm = j=1 w1,j w2,j is (severely) biased for estimating inner products. [sent-193, score-0.123]

65 After applying multiplication and hashing on u1 and u2 , the resultant vectors g1 and g2 are D D g1,j = u2,i ri Iij , j = 1, 2, . [sent-202, score-0.534]

66 6 7 Comparing b-Bit Minwise Hashing with VW (and Random Projections) 3 10 3 10 svm: VW vs b = 8 hashing Training time (sec) 100 10,100 100 10 C =1 1 98 C = 0. [sent-210, score-0.507]

67 01 92 90 88 86 84 logit: VW vs b = 8 hashing 82 Spam: Accuracy 80 1 2 3 4 5 6 10 10 10 10 10 10 k Training time (sec) 100 1,10,100 10,100 C =1 98 0. [sent-214, score-0.507]

68 01 92 90 88 86 84 svm: VW vs b = 8 hashing 82 Spam: Accuracy 80 1 2 3 4 5 6 10 10 10 10 10 10 k Accuracy (%) Accuracy (%) We implemented VW and experimented it on the same webspam dataset. [sent-218, score-0.611]

69 Figure 5 shows that b-bit minwise hashing is substantially more accurate (at the same sample size k) and requires signiﬁcantly less training time (to achieve the same accuracy). [sent-219, score-1.106]

70 Basically, for 8-bit minwise hashing with k = 200 achieves similar test accuracies as VW with k = 104 ∼ 106 (note that we only stored the non-zeros). [sent-220, score-1.054]

71 01 logit: VW vs b = 8 hashing Spam: Training time 0 C = 1,0. [sent-228, score-0.507]

72 01 2 10 C = 100,10,1 2 3 4 10 5 10 10 6 10 2 10 10 3 4 10 k 10 5 10 6 10 k Figure 5: The dashed (red if color is available) curves represent b-bit minwise hashing results (only for k ≤ 500) while solid curves for VW. [sent-230, score-0.994]

73 This empirical ﬁnding is not surprising, because the variance of b-bit hashing is usually substantially smaller than the variance of VW (and random projections). [sent-234, score-0.49]

74 0967, which also includes the complete proofs of the theorems presented in this paper), we show that, at the same storage cost, b-bit hashing usually improves VW by 10- to 100-fold, by assuming each sample of VW needs 32 bits to store. [sent-236, score-0.584]

75 Unlike random projections (and minwise hashing), VW is a sparsity-preserving algorithm, meaning that in the resultant sample vector of length k, the number of non-zeros will not exceed the number of non-zeros in the original vector. [sent-239, score-0.633]

76 , we use b-bit minwise hashing or VW for the purpose of data reduction. [sent-243, score-0.994]

77 In fact, our b-bit minwise hashing scheme for linear learning may face such an issue. [sent-245, score-0.994]

78 8 Combining b-Bit Minwise Hashing with VW In Figures 3 and 4, when b = 16, the training time becomes substantially larger than b ≤ 8. [sent-246, score-0.112]

79 Recall that in the run-time, we expand the b-bit minwise hashed data to sparse binary vectors of length 2b k with exactly k 1’s. [sent-247, score-0.69]

80 Therefore, in the run-time, after we have generated the sparse binary vectors of length 2b k, we hash them using VW with sample size m (to differentiate from k). [sent-250, score-0.072]

81 Recall Section 2 provides the estimator, denoted by Rb , of the resemblance R, using b-bit minwise hashing. [sent-253, score-0.592]

82 Now, suppose we ﬁrst apply VW hashing with size m on the binary vector of length 2b k before estimating R, which will introduce some additional randomness. [sent-254, score-0.487]

83 When m = 28 k, this method achieves similar test accuracies (left panels) while substantially reducing the training time (right panels). [sent-260, score-0.156]

84 7 2 10 C Theorem 4 1 1 ˆ ˆ Var Rb,vw = V ar Rb + m [1 − C2,b ]2 ˆ where V ar Rb = 1 Pb (1−Pb ) k [1−C2,b ]2 2 1 + Pb − Pb (1 + Pb ) k , (16) is given by (4) and C2,b is the constant deﬁned in Theorem 1. [sent-261, score-0.09]

85 “ ” ˆ Compared to the original variance V ar Rb , the additional term in (16) can be relatively large, if m is small. [sent-262, score-0.07]

86 9 Limitations While using b-bit minwise hashing for training linear algorithms is successful on the webspam dataset, it is important to understand the following three major limitations of the algorithm: (A): Our method is designed for binary (0/1) sparse data. [sent-266, score-1.196]

87 For most applications, we expect the preprocessing cost is not a major issue because the preprocessing can be conducted off-line (or combined with the data-collection step) and is easily parallelizable. [sent-268, score-0.077]

88 However, even if the speed is not a concern, the energy consumption might be an issue, especially considering (b-bit) minwise hashing is mainly used for industry applications. [sent-269, score-1.017]

89 ) For example, for the webspam dataset, using b = 8 and k = 200 seems to sufﬁce and 28 × 200 = 51200 is quite large, although it is much smaller than the original dimension of 16 million. [sent-276, score-0.126]

90 It would be desirable if we can further reduce the dimension, because the dimension determines the storage cost of the model and (moderately) increases the training time for batch learning algorithms such as LIBLINEAR. [sent-277, score-0.155]

91 In hopes of ﬁxing the above limitations, we experimented with an implementation using another hashing technique named Conditional Random Sampling (CRS) [21, 22], which is not limited to binary data and requires only one permutation of the original data (i. [sent-278, score-0.56]

92 For example, CRS compares favorably to VW in terms of storage (to achieve the same accuracy) on the webspam dataset. [sent-282, score-0.171]

93 However, so far CRS can not compete with b-bit minwise hashing for linear learning (in terms of training speed, storage cost, and model size). [sent-283, score-1.125]

94 In our implementation, we could not to use fully correct normalization factors, which lead to severe bias of the inner product estimates and less than satisfactory performance of linear learning compared to b-bit minwise hashing. [sent-286, score-0.561]

95 10 Conclusion As data sizes continue to grow faster than the memory and computational power, statistical learning tasks in industrial practice are increasingly faced with training datasets that exceed the resources on a single server. [sent-287, score-0.098]

96 We also compare b-bit minwise hashing with the Count-Min (CM) sketch and Vowpal Wabbit (VW) algorithms, which, according to our analysis, all have (essentially) the same variances as random projections [24]. [sent-290, score-1.082]

97 Our theoretical and empirical comparisons illustrate that b-bit minwise hashing is signiﬁcantly more accurate (at the same storage) for binary data. [sent-291, score-1.018]

98 We thank John Langford and Tong Zhang for helping us better understand the VW hashing algorithm, and Chih-Jen Lin for his patient explanation of LIBLINEAR package and datasets. [sent-296, score-0.463]

99 Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. [sent-301, score-0.463]

100 Accurate estimators for improving minwise hashing and b-bit minwise hashing. [sent-400, score-1.525]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('minwise', 0.531), ('hashing', 0.463), ('vw', 0.375), ('spam', 0.365), ('rij', 0.131), ('svm', 0.126), ('logit', 0.122), ('webspam', 0.101), ('sec', 0.095), ('hashed', 0.093), ('ping', 0.091), ('pb', 0.087), ('std', 0.084), ('accuracy', 0.084), ('vowpal', 0.071), ('wabbit', 0.071), ('storage', 0.07), ('training', 0.061), ('resemblance', 0.061), ('mij', 0.059), ('crs', 0.058), ('iij', 0.058), ('testing', 0.056), ('projections', 0.055), ('bits', 0.051), ('cm', 0.051), ('arnd', 0.051), ('panels', 0.049), ('pd', 0.047), ('rb', 0.046), ('ar', 0.045), ('accuracies', 0.044), ('zi', 0.039), ('www', 0.038), ('liblinear', 0.038), ('kenneth', 0.037), ('christian', 0.036), ('sketch', 0.033), ('logistic', 0.033), ('zj', 0.031), ('inner', 0.03), ('seconds', 0.029), ('preprocessing', 0.029), ('permutations', 0.028), ('experimented', 0.027), ('substantially', 0.027), ('kdd', 0.027), ('loading', 0.026), ('vectors', 0.025), ('li', 0.025), ('original', 0.025), ('documents', 0.024), ('time', 0.024), ('ri', 0.024), ('binary', 0.024), ('trevor', 0.024), ('sj', 0.024), ('eshghi', 0.023), ('kave', 0.023), ('manasse', 0.023), ('najork', 0.023), ('spain', 0.023), ('industry', 0.023), ('hash', 0.023), ('lowest', 0.022), ('bc', 0.022), ('resultant', 0.022), ('hsieh', 0.022), ('libsvm', 0.021), ('vancouver', 0.021), ('permutation', 0.021), ('enterprise', 0.02), ('gollapudi', 0.02), ('nig', 0.02), ('bit', 0.02), ('vs', 0.02), ('si', 0.02), ('storing', 0.02), ('merely', 0.02), ('memory', 0.02), ('conducted', 0.019), ('cornell', 0.019), ('sreenivas', 0.019), ('andrei', 0.019), ('extremely', 0.018), ('pages', 0.018), ('var', 0.018), ('estimator', 0.018), ('format', 0.017), ('expand', 0.017), ('web', 0.017), ('datasets', 0.017), ('wsdm', 0.017), ('marc', 0.017), ('stored', 0.016), ('barcelona', 0.016), ('chang', 0.016), ('limitations', 0.016), ('interestingly', 0.016), ('dataset', 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 111 nips-2011-Hashing Algorithms for Large-Scale Learning

Author: Ping Li, Anshumali Shrivastava, Joshua L. Moore, Arnd C. König

2 0.18310744 157 nips-2011-Learning to Search Efficiently in High Dimensions

Author: Zhen Li, Huazhong Ning, Liangliang Cao, Tong Zhang, Yihong Gong, Thomas S. Huang

Abstract: High dimensional similarity search in large scale databases becomes an important challenge due to the advent of Internet. For such applications, specialized data structures are required to achieve computational efﬁciency. Traditional approaches relied on algorithmic constructions that are often data independent (such as Locality Sensitive Hashing) or weakly dependent (such as kd-trees, k-means trees). While supervised learning algorithms have been applied to related problems, those proposed in the literature mainly focused on learning hash codes optimized for compact embedding of the data rather than search efﬁciency. Consequently such an embedding has to be used with linear scan or another search algorithm. Hence learning to hash does not directly address the search efﬁciency issue. This paper considers a new framework that applies supervised learning to directly optimize a data structure that supports efﬁcient large scale search. Our approach takes both search quality and computational cost into consideration. Speciﬁcally, we learn a boosted search forest that is optimized using pair-wise similarity labeled examples. The output of this search forest can be efﬁciently converted into an inverted indexing data structure, which can leverage modern text search infrastructure to achieve both scalability and efﬁciency. Experimental results show that our approach signiﬁcantly outperforms the start-of-the-art learning to hash methods (such as spectral hashing), as well as state-of-the-art high dimensional search algorithms (such as LSH and k-means trees).

3 0.078621276 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon

Abstract: We introduce P I C O D ES: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of deﬁning indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images deﬁning the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classiﬁcation performance. Optimization directly for binary features is difﬁcult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. P I C O D ES of 256 bytes match the accuracy of the current best known classiﬁer for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.

4 0.043673489 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

Author: Ioannis A. Gkioulekas, Todd Zickler

Abstract: We propose an approach for linear unsupervised dimensionality reduction, based on the sparse linear model that has been used to probabilistically interpret sparse coding. We formulate an optimization problem for learning a linear projection from the original signal domain to a lower-dimensional one in a way that approximately preserves, in expectation, pairwise inner products in the sparse domain. We derive solutions to the problem, present nonlinear extensions, and discuss relations to compressed sensing. Our experiments using facial images, texture patches, and images of object categories suggest that the approach can improve our ability to recover meaningful structure in many classes of signals. 1

5 0.041542973 271 nips-2011-Statistical Tests for Optimization Efficiency

Author: Levi Boyles, Anoop Korattikara, Deva Ramanan, Max Welling

Abstract: Learning problems, such as logistic regression, are typically formulated as pure optimization problems deﬁned on some loss function. We argue that this view ignores the fact that the loss function depends on stochastically generated data which in turn determines an intrinsic scale of precision for statistical estimation. By considering the statistical properties of the update variables used during the optimization (e.g. gradients), we can construct frequentist hypothesis tests to determine the reliability of these updates. We utilize subsets of the data for computing updates, and use the hypothesis tests for determining when the batch-size needs to be increased. This provides computational beneﬁts and avoids overﬁtting by stopping when the batch-size has become equal to size of the full dataset. Moreover, the proposed algorithms depend on a single interpretable parameter – the probability for an update to be in the wrong direction – which is set to a single value across all algorithms and datasets. In this paper, we illustrate these ideas on three L1 regularized coordinate descent algorithms: L1 -regularized L2 -loss SVMs, L1 -regularized logistic regression, and the Lasso, but we emphasize that the underlying methods are much more generally applicable. 1

6 0.041125756 84 nips-2011-EigenNet: A Bayesian hybrid of generative and conditional models for sparse learning

7 0.040982816 209 nips-2011-Orthogonal Matching Pursuit with Replacement

8 0.040353287 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

9 0.038987793 144 nips-2011-Learning Auto-regressive Models from Sequence and Non-sequence Data

10 0.03845048 211 nips-2011-Penalty Decomposition Methods for Rank Minimization

11 0.038377903 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

12 0.035095666 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows

13 0.034855682 45 nips-2011-Beating SGD: Learning SVMs in Sublinear Time

14 0.034601472 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

15 0.031607859 276 nips-2011-Structured sparse coding via lateral inhibition

16 0.030921582 231 nips-2011-Randomized Algorithms for Comparison-based Search

17 0.030478436 95 nips-2011-Fast and Accurate k-means For Large Datasets

18 0.029528556 261 nips-2011-Sparse Filtering

19 0.027897933 247 nips-2011-Semantic Labeling of 3D Point Clouds for Indoor Scenes

20 0.026983732 258 nips-2011-Sparse Bayesian Multi-Task Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.095), (1, 0.027), (2, -0.044), (3, 0.003), (4, -0.004), (5, 0.024), (6, 0.017), (7, -0.019), (8, -0.031), (9, -0.012), (10, -0.008), (11, 0.014), (12, -0.019), (13, -0.008), (14, 0.025), (15, 0.025), (16, -0.058), (17, 0.054), (18, -0.032), (19, -0.024), (20, 0.04), (21, 0.012), (22, 0.069), (23, -0.026), (24, -0.013), (25, -0.005), (26, 0.051), (27, -0.018), (28, 0.058), (29, 0.058), (30, -0.032), (31, -0.024), (32, 0.117), (33, -0.122), (34, -0.05), (35, -0.022), (36, 0.063), (37, -0.024), (38, 0.001), (39, 0.008), (40, -0.058), (41, 0.049), (42, 0.13), (43, 0.008), (44, 0.052), (45, 0.019), (46, -0.124), (47, 0.006), (48, -0.013), (49, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.88631451 111 nips-2011-Hashing Algorithms for Large-Scale Learning

Author: Ping Li, Anshumali Shrivastava, Joshua L. Moore, Arnd C. König

2 0.717462 157 nips-2011-Learning to Search Efficiently in High Dimensions

Author: Zhen Li, Huazhong Ning, Liangliang Cao, Tong Zhang, Yihong Gong, Thomas S. Huang

3 0.50020403 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon

4 0.49913555 91 nips-2011-Exploiting spatial overlap to efficiently compute appearance distances between image windows

Author: Bogdan Alexe, Viviana Petrescu, Vittorio Ferrari

Abstract: We present a computationally efﬁcient technique to compute the distance of highdimensional appearance descriptor vectors between image windows. The method exploits the relation between appearance distance and spatial overlap. We derive an upper bound on appearance distance given the spatial overlap of two windows in an image, and use it to bound the distances of many pairs between two images. We propose algorithms that build on these basic operations to efﬁciently solve tasks relevant to many computer vision applications, such as ﬁnding all pairs of windows between two images with distance smaller than a threshold, or ﬁnding the single pair with the smallest distance. In experiments on the PASCAL VOC 07 dataset, our algorithms accurately solve these problems while greatly reducing the number of appearance distances computed, and achieve larger speedups than approximate nearest neighbour algorithms based on trees [18] and on hashing [21]. For example, our algorithm ﬁnds the most similar pair of windows between two images while computing only 1% of all distances on average. 1

5 0.46535656 143 nips-2011-Learning Anchor Planes for Classification

Author: Ziming Zhang, Lubor Ladicky, Philip Torr, Amir Saffari

Abstract: Local Coordinate Coding (LCC) [18] is a method for modeling functions of data lying on non-linear manifolds. It provides a set of anchor points which form a local coordinate system, such that each data point on the manifold can be approximated by a linear combination of its anchor points, and the linear weights become the local coordinate coding. In this paper we propose encoding data using orthogonal anchor planes, rather than anchor points. Our method needs only a few orthogonal anchor planes for coding, and it can linearize any (α, β, p)-Lipschitz smooth nonlinear function with a ﬁxed expected value of the upper-bound approximation error on any high dimensional data. In practice, the orthogonal coordinate system can be easily learned by minimizing this upper bound using singular value decomposition (SVD). We apply our method to model the coordinates locally in linear SVMs for classiﬁcation tasks, and our experiment on MNIST shows that using only 50 anchor planes our method achieves 1.72% error rate, while LCC achieves 1.90% error rate using 4096 anchor points. 1

6 0.42246363 288 nips-2011-Thinning Measurement Models and Questionnaire Design

7 0.4116596 280 nips-2011-Testing a Bayesian Measure of Representativeness Using a Large Image Database

8 0.40613392 209 nips-2011-Orthogonal Matching Pursuit with Replacement

9 0.38561302 254 nips-2011-Similarity-based Learning via Data Driven Embeddings

10 0.38528633 240 nips-2011-Robust Multi-Class Gaussian Process Classification

11 0.37997636 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

12 0.37092796 64 nips-2011-Convergent Bounds on the Euclidean Distance

13 0.36621395 267 nips-2011-Spectral Methods for Learning Multivariate Latent Tree Structure

14 0.36580792 299 nips-2011-Variance Penalizing AdaBoost

15 0.36456391 276 nips-2011-Structured sparse coding via lateral inhibition

16 0.36088726 95 nips-2011-Fast and Accurate k-means For Large Datasets

17 0.3437787 30 nips-2011-Algorithms for Hyper-Parameter Optimization

18 0.3396737 292 nips-2011-Two is better than one: distinct roles for familiarity and recollection in retrieving palimpsest memories

19 0.33451352 19 nips-2011-Active Classification based on Value of Classifier

20 0.32920727 112 nips-2011-Heavy-tailed Distances for Gradient Based Image Descriptors

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.029), (4, 0.022), (20, 0.017), (26, 0.028), (31, 0.049), (33, 0.014), (43, 0.05), (45, 0.099), (57, 0.476), (74, 0.037), (83, 0.03), (99, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.93664855 224 nips-2011-Probabilistic Modeling of Dependencies Among Visual Short-Term Memory Representations

Author: Emin Orhan, Robert A. Jacobs

Abstract: Extensive evidence suggests that items are not encoded independently in visual short-term memory (VSTM). However, previous research has not quantitatively considered how the encoding of an item inﬂuences the encoding of other items. Here, we model the dependencies among VSTM representations using a multivariate Gaussian distribution with a stimulus-dependent mean and covariance matrix. We report the results of an experiment designed to determine the speciﬁc form of the stimulus-dependence of the mean and the covariance matrix. We ﬁnd that the magnitude of the covariance between the representations of two items is a monotonically decreasing function of the difference between the items’ feature values, similar to a Gaussian process with a distance-dependent, stationary kernel function. We further show that this type of covariance function can be explained as a natural consequence of encoding multiple stimuli in a population of neurons with correlated responses. 1

2 0.93529207 234 nips-2011-Reconstructing Patterns of Information Diffusion from Incomplete Observations

Author: Flavio Chierichetti, David Liben-nowell, Jon M. Kleinberg

Abstract: Motivated by the spread of on-line information in general and on-line petitions in particular, recent research has raised the following combinatorial estimation problem. There is a tree T that we cannot observe directly (representing the structure along which the information has spread), and certain nodes randomly decide to make their copy of the information public. In the case of a petition, the list of names on each public copy of the petition also reveals a path leading back to the root of the tree. What can we conclude about the properties of the tree we observe from these revealed paths, and can we use the structure of the observed tree to estimate the size of the full unobserved tree T ? Here we provide the ﬁrst algorithm for this size estimation task, together with provable guarantees on its performance. We also establish structural properties of the observed tree, providing the ﬁrst rigorous explanation for some of the unusual structural phenomena present in the spread of real chain-letter petitions on the Internet. 1

3 0.89885122 170 nips-2011-Message-Passing for Approximate MAP Inference with Latent Variables

Author: Jiarong Jiang, Piyush Rai, Hal Daume

Abstract: We consider a general inference setting for discrete probabilistic graphical models where we seek maximum a posteriori (MAP) estimates for a subset of the random variables (max nodes), marginalizing over the rest (sum nodes). We present a hybrid message-passing algorithm to accomplish this. The hybrid algorithm passes a mix of sum and max messages depending on the type of source node (sum or max). We derive our algorithm by showing that it falls out as the solution of a particular relaxation of a variational framework. We further show that the Expectation Maximization algorithm can be seen as an approximation to our algorithm. Experimental results on synthetic and real-world datasets, against several baselines, demonstrate the efﬁcacy of our proposed algorithm. 1

same-paper 4 0.86757791 111 nips-2011-Hashing Algorithms for Large-Scale Learning

Author: Ping Li, Anshumali Shrivastava, Joshua L. Moore, Arnd C. König

5 0.86196917 100 nips-2011-Gaussian Process Training with Input Noise

Author: Andrew Mchutchon, Carl E. Rasmussen

Abstract: In standard Gaussian Process regression input locations are assumed to be noise free. We present a simple yet effective GP model for training on input points corrupted by i.i.d. Gaussian noise. To make computations tractable we use a local linear expansion about each input point. This allows the input noise to be recast as output noise proportional to the squared gradient of the GP posterior mean. The input noise variances are inferred from the data as extra hyperparameters. They are trained alongside other hyperparameters by the usual method of maximisation of the marginal likelihood. Training uses an iterative scheme, which alternates between optimising the hyperparameters and calculating the posterior gradient. Analytic predictive moments can then be found for Gaussian distributed test points. We compare our model to others over a range of different regression problems and show that it improves over current methods. 1

6 0.72114503 171 nips-2011-Metric Learning with Multiple Kernels

7 0.62418824 157 nips-2011-Learning to Search Efficiently in High Dimensions

8 0.61903715 242 nips-2011-See the Tree Through the Lines: The Shazoo Algorithm

9 0.59892893 226 nips-2011-Projection onto A Nonnegative Max-Heap

10 0.57555044 140 nips-2011-Kernel Embeddings of Latent Tree Graphical Models

11 0.56145567 89 nips-2011-Estimating time-varying input signals and ion channel states from a single voltage trace of a neuron

12 0.55982918 177 nips-2011-Multi-armed bandits on implicit metric spaces

13 0.53636098 135 nips-2011-Information Rates and Optimal Decoding in Large Neural Populations

14 0.53429949 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

15 0.52826869 219 nips-2011-Predicting response time and error rates in visual search

16 0.52806711 30 nips-2011-Algorithms for Hyper-Parameter Optimization

17 0.52037102 31 nips-2011-An Application of Tree-Structured Expectation Propagation for Channel Decoding

18 0.52016264 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

19 0.51205456 274 nips-2011-Structure Learning for Optimization

20 0.50961816 24 nips-2011-Active learning of neural response functions with Gaussian processes