nips nips2012 nips2012-148 knowledge-graph by maker-knowledge-mining

148 nips-2012-Hamming Distance Metric Learning


Source: pdf

Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov

Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. [sent-4, score-0.627]

2 Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. [sent-5, score-0.328]

3 The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. [sent-6, score-0.567]

4 1 Introduction Many machine learning algorithms presuppose the existence of a pairwise similarity measure on the input space. [sent-10, score-0.215]

5 Examples include semi-supervised clustering, nearest neighbor classification, and kernel-based methods, When similarity measures are not given a priori, one could adopt a generic function such as Euclidean distance, but this often produces unsatisfactory results. [sent-11, score-0.22]

6 The goal of metric learning techniques is to improve matters by incorporating side information, and optimizing parametric distance functions such as the Mahalanobis distance [7, 12, 30, 34, 36]. [sent-12, score-0.411]

7 Compact binary codes are remarkably storage efficient, allowing one to store massive datasets in memory. [sent-14, score-0.439]

8 The Hamming distance, a natural similarity measure on binary codes, can be computed with just a few machine instructions per comparison. [sent-15, score-0.201]

9 By contrast, retrieval based on Mahalanobis distance requires approximate nearest neighbor (ANN) search, for which state-of-the-art methods (e. [sent-17, score-0.336]

10 , see [18, 23]) do not always perform well, especially with massive, high-dimensional datasets when storage overheads and distance computations become prohibitive. [sent-19, score-0.207]

11 In metric learning, by comparison, the goal is to preserve semantic structure based on labeled attributes or parameters associated with training exemplars. [sent-32, score-0.264]

12 The question of whether or not it is possible to learn hash functions capable of preserving complex semantic structure, with high fidelity, has remained unanswered. [sent-34, score-0.447]

13 1 To address this issue, we introduce a framework for learning a broad class of binary hash functions based on a triplet ranking loss designed to preserve relative similarity (c. [sent-35, score-1.258]

14 While certainly useful for preserving metric structure, this loss function is very well suited to the preservation of semantic similarity. [sent-38, score-0.351]

15 It is more flexible than the pairwise hinge loss of [24], and is shown below to produce superior hash functions. [sent-40, score-0.617]

16 Our formulation is inspired by latent SVM [10] and latent structural SVM [37] models, and it generalizes the minimal loss hashing (MLH) algorithm of [24]. [sent-41, score-0.268]

17 Accordingly, to optimize hash function parameters we formulate a continuous upper-bound on empirical loss, with a new form of lossaugmented inference designed for efficient optimization with the proposed triplet loss on the Hamming space. [sent-42, score-0.818]

18 To our knowledge, this is one of the most general frameworks for learning a broad class of hash functions. [sent-43, score-0.312]

19 We show that k-nearest neighbor (kNN) search on the resulting binary codes retrieves items that bear remarkable similarity to a given query item. [sent-48, score-0.668]

20 To show that the binary representation is rich enough to capture salient semantic structure, as is common in metric learning, we also report classification performance on the binary codes. [sent-49, score-0.341]

21 An important appeal of our approach is the scalability of kNN search on binary codes to billions of data points, and of kNN classification to millions of class labels. [sent-51, score-0.415]

22 2 Formulation The task is to learn a mapping b(x) that projects p-dimensional real-valued inputs x ∈ Rp onto q-dimensional binary codes h ∈ H ≡ {−1, 1}q , while preserving some notion of similarity. [sent-52, score-0.453]

23 This mapping, referred to as a hash function, is parameterized by a real-valued vector w as b(x; w) = sign (f (x; w)) , (1) p q where sign(. [sent-53, score-0.317]

24 Different forms of f give rise to different families of hash functions: 1. [sent-55, score-0.324]

25 For such mappings the bits correspond to stripes of 1 and −1 regions, oriented parallel to the corresponding hyperplanes, in the input space. [sent-62, score-0.233]

26 More complex hash functions are obtained with multilayer neural networks [28, 32]. [sent-66, score-0.365]

27 Our Hamming distance metric learning framework applies to all of the above families of hash functions. [sent-69, score-0.55]

28 1 Loss functions The choice of loss function is crucial for learning good similarity measures. [sent-72, score-0.268]

29 To this end, most existing supervised binary hashing techniques [13, 22, 24] formulate learning objectives in terms of pairwise similarity, where pairs of inputs are labelled as either similar or dissimilar. [sent-73, score-0.373]

30 Similaritypreserving hashing aims to ensure that Hamming distances between binary codes for similar (dissimilar) items are small (large). [sent-74, score-0.627]

31 For example, MLH [24] uses a pairwise hinge loss function. [sent-75, score-0.329]

32 This loss incurs zero cost when a pair of similar inputs map to codes that differ by less than ρ bits. [sent-79, score-0.438]

33 The loss is zero for dissimilar items whose Hamming distance is more than ρ bits. [sent-80, score-0.402]

34 So, constraining pairwise Hamming distances over all pairs of codes with a single threshold is overly restrictive. [sent-83, score-0.409]

35 Such loss functions have been used in metric learning [5, 11], and, as shown below, they are also naturally suited to Hamming distance metric learning. [sent-86, score-0.494]

36 To define relative similarity, we assume that the training data includes triplets of items (x, x+ , x− ), such that the pair (x, x+ ) is more similar than the pair (x, x− ). [sent-87, score-0.228]

37 Our goal is to learn a hash function b such that b(x) is closer to b(x+ ) than to b(x− ) in Hamming distance. [sent-88, score-0.288]

38 Accordingly, we propose a ranking loss on the triplet of binary codes (h, h+ , h− ), obtained from b applied to (x, x+ , x− ): + − h−h+ H − h−h− H + 1 + . [sent-89, score-0.986]

39 (3) triplet (h, h , h ) = This loss is zero when the Hamming distance between the more-similar pair, h − h+ H , is at least one bit smaller than the Hamming distance between the less-similar pair, h − h− H . [sent-90, score-0.879]

40 This loss function is more flexible than the pairwise loss function pair , as it can be used to preserve rankings among similar items, for example based on Euclidean distance, or perhaps using path distance between category labels within a phylogenetic tree. [sent-91, score-0.629]

41 (4) triplet b(x; w), b(x ; w), b(x ; w) + 2 2 + − (x,x ,x )∈D This objective is discontinuous and non-convex. [sent-93, score-0.412]

42 The hash function is a discrete mapping and empirical loss is piecewise constant. [sent-94, score-0.458]

43 The upper bound on loss that we adopt is inspired by previous work on latent structural SVMs [37]. [sent-97, score-0.207]

44 The key observation that relates our Hamming distance metric learning framework to structured prediction is as follows, b(x; w) = sign (f (x; w)) = argmax hT f (x; w) , (5) h∈H where H ≡ {−1, +1}q . [sent-98, score-0.28]

45 The argmax on the RHS effectively means that for dimensions of f (x; w) with positive values, the optimal code should take on values +1, and when elements of f (x; w) are negative the corresponding bits of the code should be −1. [sent-99, score-0.266]

46 Summing the upper bound instead of the loss in Eq. [sent-109, score-0.207]

47 4 yields an upper bound on the regularized empirical loss in Eq. [sent-110, score-0.207]

48 In particular, when f is linear in w, the bound on regularized empirical loss becomes piecewise linear and convex-concave. [sent-115, score-0.232]

49 6 is more challenging to optimize than the bound in [24], it allows us to learn hash functions based on non-linear functions f , e. [sent-117, score-0.432]

50 While the bound in [24] was defined for pair -type loss functions and pairwise similarity labels, the bound here applies to the more flexible class of triplet loss functions. [sent-120, score-0.988]

51 6 for optimization, we must be able to find the binary codes given by (ˆ , g+ , g− ) = argmax g ˆ ˆ T triplet T g, g+ , g− + gT f (x) + g+ f (x+ ) + g− f (x− ) . [sent-123, score-0.764]

52 The challenge stems from the 23q possible binary codes over which one has to maximize the RHS. [sent-125, score-0.357]

53 Fortunately, we can show that this loss-augmented inference problem can be solved efficiently for the class of triplet loss functions that depend only on the value of d(g, g+ , g− ) ≡ g−g+ H − g−g− H . [sent-126, score-0.542]

54 Importantly, such loss functions do not depend on the specific binary codes, but rather just the differences. [sent-127, score-0.253]

55 Clearly the triplet ranking loss only depends on d since triplet g, g+ , g− d(g, g+ , g− ) , = (α) = [ α − 1 ]+ . [sent-129, score-1.011]

56 Accordingly, let ei ∈ {−1, 0, +1} denote the effect of the ith bits on d(g, g+ , g− ). [sent-148, score-0.221]

57 9, we aim to select values for ei , for all i, such that i=1 ei = m and q i=1 cont(i, ei ) is maximized. [sent-152, score-0.21]

58 9 and set the bits to the configurations that maximized cont(i, ei ). [sent-155, score-0.221]

59 3 Perceptron-like learning Our learning algorithm is a form of stochastic gradient descent, where in the tth iteration we sample a triplet (x, x+ , x− ) from the dataset, and then take a step in the direction that decreases the upper bound on the triplet’s loss in Eq. [sent-158, score-0.589]

60 Select a random triplet (x, x+ , x− ) from dataset D. [sent-162, score-0.382]

61 6 is not differentiable at isolated points (owing to the max terms), in our experiments we find that this update rule consistently decreases both the upper bound and the actual regularized empirical loss L(w). [sent-173, score-0.207]

62 4 Asymmetric Hamming distance When Hamming distance is used to score and retrieve the nearest neighbors to a given query, there is a high probability of a tie, where multiple items are equidistant from the query in Hamming space. [sent-174, score-0.518]

63 To break ties and improve the similarity measure, previous work suggests the use of an asymmetric Hamming (AH) distance [9, 14]. [sent-175, score-0.333]

64 With an AH distance, one stores dataset entries as binary codes (for storage efficiency) but the queries are not binarized. [sent-176, score-0.418]

65 An asymmetric distance function is therefore defined on a real-valued query vector, v ∈ Rq , and a database binary code, h ∈ H. [sent-177, score-0.358]

66 Computing AH distance is slightly less efficient than Hamming distance, and efficient retrieval algorithms, such as [25], are not directly applicable. [sent-178, score-0.224]

67 Nevertheless, the AH distance can also be used to re-rank items retrieved using Hamming distance, with a negligible increase in run-time. [sent-179, score-0.233]

68 To improve efficiency further when there are many codes to be re-ranked, AH distance from the query to binary codes can be pre-computed for each 8 or 16 consecutive bits, and stored in a query-specific lookup table. [sent-180, score-0.808]

69 Here we use the AH distance between a database code b(x ) and the real-valued projection for the query f (x). [sent-183, score-0.232]

70 First, instead of using a single training triplet to estimate the gradients, we use mini-batches comprising 100 triplets and average the gradient. [sent-190, score-0.474]

71 Second, for each triplet (x, x+ , x− ), we replace x− with a “hard” example by selecting an item among all negative examples in the mini-batch that is closest in the current Hamming distance to b(x). [sent-191, score-0.529]

72 87 Two−layer net, triplet Two−layer net, pairwise Linear, triplet Linear, pairwise [24] 10 100 k 1000 0. [sent-200, score-0.978]

73 87 10000 128−bit, linear, triplet 64−bit, linear, triplet 32−bit, linear, triplet Euclidean distance 10 100 k 1000 10000 Figure 1: MNIST precision@k: (left) four methods (with 32-bit codes); (right) three code lengths. [sent-204, score-1.338]

74 Empirically, we observe that including this term in the objective improves the quality of binary codes, especially with the triplet ranking loss. [sent-210, score-0.6]

75 6 Experiments Our experiments evaluate Hamming distance metric learning using two families of hash functions, namely, linear transforms and multilayer neural networks (see Sec. [sent-222, score-0.611]

76 For each, we examine two loss functions, the pairwise hinge loss (Eq. [sent-224, score-0.451]

77 Ground-truth similarity labels are derived from class labels; items from the same class are deemed similar4 . [sent-228, score-0.222]

78 the fraction of k closest items in Hamming distance that are same-class neighbors. [sent-235, score-0.233]

79 Such high-dimensional inputs are challenging for learning similarity-preserving hash functions. [sent-246, score-0.315]

80 MNIST: We optimize binary hash functions, mapping raw MNIST images to 32, 64, and 128-bit codes. [sent-248, score-0.456]

81 For each test code we find the k closest training codes using Hamming distance, and report precision@k in Fig. [sent-249, score-0.34]

82 3) yields better performance than the pairwise 4 Training triplets are created by taking two items from the same class, and one item from a different class. [sent-253, score-0.254]

83 Further, note that these Euclidian results effectively provide an upper bound on the performance one would expect with existing hashing methods that preserve Eucliean distances (e. [sent-294, score-0.347]

84 To focus on the quality of the hash functions, and the speed of retrieval for large-scale multimedia datasets, we use a kNN classifier; i. [sent-298, score-0.399]

85 The ranking hinge loss also improves upon the pairwise hinge loss, even though the former has no hyperparameters. [sent-305, score-0.554]

86 The above results show that our Hamming distance metric learning framework can preserve sufficient semantic similarity, to the extent that Hamming kNN classification becomes competitive with state-of-the-art discriminative methods. [sent-316, score-0.404]

87 CIFAR-10: On CIFAR-10 we optimize hash functions for 64, 128, 256, and 512-bit codes. [sent-322, score-0.352]

88 The supplementary material includes precision@k curves, showing superior quality of hash functions learned by the ranking loss compared to the pairwise loss. [sent-323, score-0.68]

89 2, we depict the quality of retrieval results for two queries, showing the 16 nearest neighbors using 256-bit codes, 64-bit codes (both learned with the triplet ranking loss), and Euclidean distance in the original 6400-D feature space. [sent-325, score-1.093]

90 7 (Hamming on 256 bit codes) (Hamming on 64 bit codes) (Euclidean distance) Figure 2: Retrieval results for two CIFAR-10 test images using Hamming distance on 256-bit and 64-bit codes, and Euclidean distance on bag-of-words features. [sent-328, score-0.481]

91 Hashing, Loss Linear, pairwise hinge [24] Linear, pairwise hinge Linear, triplet ranking Linear, triplet ranking Distance H AH H AH kNN 7 NN 8 NN 2 NN 2 NN Baseline One-vs-all linear SVM [6] Euclidean 3NN 64 bits 72. [sent-330, score-1.601]

92 Euclidean NN on the 6400-D input features yields under 60% accuracy, while kNN with the binary codes obtains 76 − 78%. [sent-351, score-0.357]

93 Not surprisingly, training fully-connected neural nets on 6400-dimensional features with only 50, 000 training examples is challenging and susceptible to over-fitting, hence the results of neural nets on CIFAR-10 were not competitive. [sent-353, score-0.224]

94 7 Conclusion We present a framework for Hamming distance metric learning, which entails learning a discrete mapping from the input space onto binary codes. [sent-356, score-0.343]

95 This framework accommodates different families of hash functions, including quantized linear transforms, and multilayer neural nets. [sent-357, score-0.385]

96 By using a piecewise-smooth upper bound on a triplet ranking loss, we optimize hash functions that are shown to preserve semantic similarity on complex datasets. [sent-358, score-1.206]

97 In particular, our experiments show that a simple kNN classifier on the learned binary codes is competitive with sophisticated discriminative classifiers. [sent-359, score-0.381]

98 One appeal of this approach is the scalability of kNN search on binary codes to billions of data points, and of kNN classification to millions of class labels. [sent-362, score-0.415]

99 Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces. [sent-410, score-0.283]

100 Learning globally-consistent local distance functions for shape-based image retrieval and classification. [sent-425, score-0.288]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('hamming', 0.453), ('triplet', 0.382), ('hash', 0.288), ('codes', 0.264), ('knn', 0.26), ('ah', 0.182), ('bits', 0.151), ('distance', 0.147), ('hashing', 0.146), ('nn', 0.141), ('ranking', 0.125), ('mnist', 0.124), ('loss', 0.122), ('similarity', 0.108), ('pairwise', 0.107), ('hinge', 0.1), ('binary', 0.093), ('items', 0.086), ('mappings', 0.082), ('nets', 0.081), ('bit', 0.081), ('metric', 0.079), ('preserve', 0.078), ('asymmetric', 0.078), ('retrieval', 0.077), ('semantic', 0.076), ('ei', 0.07), ('euclidean', 0.065), ('nearest', 0.063), ('triplets', 0.061), ('net', 0.058), ('jacobian', 0.056), ('cont', 0.051), ('classi', 0.051), ('rq', 0.05), ('neighbor', 0.049), ('tanh', 0.047), ('dissimilar', 0.047), ('precision', 0.046), ('preserving', 0.045), ('code', 0.045), ('upper', 0.043), ('norouzi', 0.042), ('bound', 0.042), ('query', 0.04), ('convolutional', 0.04), ('multilayer', 0.039), ('mlh', 0.039), ('distances', 0.038), ('functions', 0.038), ('quantization', 0.037), ('families', 0.036), ('storage', 0.035), ('neighbors', 0.035), ('rhs', 0.035), ('multimedia', 0.034), ('kulis', 0.033), ('itq', 0.032), ('training', 0.031), ('discontinuous', 0.03), ('hyperbolic', 0.03), ('billions', 0.03), ('suited', 0.029), ('sign', 0.029), ('svms', 0.029), ('accordingly', 0.029), ('labels', 0.028), ('compact', 0.028), ('cvpr', 0.028), ('search', 0.028), ('inputs', 0.027), ('delity', 0.027), ('queries', 0.026), ('owing', 0.026), ('lsh', 0.026), ('er', 0.026), ('layer', 0.026), ('optimize', 0.026), ('gt', 0.026), ('image', 0.026), ('reports', 0.025), ('argmax', 0.025), ('rates', 0.025), ('exible', 0.025), ('datasets', 0.025), ('pair', 0.025), ('hinton', 0.025), ('images', 0.025), ('discriminative', 0.024), ('mahalanobis', 0.024), ('mapping', 0.024), ('piecewise', 0.024), ('broad', 0.024), ('gpu', 0.024), ('massive', 0.022), ('ers', 0.022), ('svm', 0.022), ('aside', 0.022), ('coates', 0.022), ('linear', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 148 nips-2012-Hamming Distance Metric Learning

Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov

Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes. 1

2 0.23427066 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

Author: Yunchao Gong, Sanjiv Kumar, Vishal Verma, Svetlana Lazebnik

Abstract: This paper focuses on the problem of learning binary codes for efficient retrieval of high-dimensional non-negative data that arises in vision and text applications where counts or frequencies are used as features. The similarity of such feature vectors is commonly measured using the cosine of the angle between them. In this work, we introduce a novel angular quantization-based binary coding (AQBC) technique for such data and analyze its properties. In its most basic form, AQBC works by mapping each non-negative feature vector onto the vertex of the binary hypercube with which it has the smallest angle. Even though the number of vertices (quantization landmarks) in this scheme grows exponentially with data dimensionality d, we propose a method for mapping feature vectors to their smallest-angle binary vertices that scales as O(d log d). Further, we propose a method for learning a linear transformation of the data to minimize the quantization error, and show that it results in improved binary codes. Experiments on image and text datasets show that the proposed AQBC method outperforms the state of the art. 1

3 0.20734984 313 nips-2012-Sketch-Based Linear Value Function Approximation

Author: Marc Bellemare, Joel Veness, Michael Bowling

Abstract: Hashing is a common method to reduce large, potentially infinite feature vectors to a fixed-size table. In reinforcement learning, hashing is often used in conjunction with tile coding to represent states in continuous spaces. Hashing is also a promising approach to value function approximation in large discrete domains such as Go and Hearts, where feature vectors can be constructed by exhaustively combining a set of atomic features. Unfortunately, the typical use of hashing in value function approximation results in biased value estimates due to the possibility of collisions. Recent work in data stream summaries has led to the development of the tug-of-war sketch, an unbiased estimator for approximating inner products. Our work investigates the application of this new data structure to linear value function approximation. Although in the reinforcement learning setting the use of the tug-of-war sketch leads to biased value estimates, we show that this bias can be orders of magnitude less than that of standard hashing. We provide empirical results on two RL benchmark domains and fifty-five Atari 2600 games to highlight the superior learning performance obtained when using tug-of-war hashing. 1

4 0.20459287 71 nips-2012-Co-Regularized Hashing for Multimodal Data

Author: Yi Zhen, Dit-Yan Yeung

Abstract: Hashing-based methods provide a very promising approach to large-scale similarity search. To obtain compact hash codes, a recent trend seeks to learn the hash functions from data automatically. In this paper, we study hash function learning in the context of multimodal data. We propose a novel multimodal hash function learning method, called Co-Regularized Hashing (CRH), based on a boosted coregularization framework. The hash functions for each bit of the hash codes are learned by solving DC (difference of convex functions) programs, while the learning for multiple bits proceeds via a boosting procedure so that the bias introduced by the hash functions can be sequentially minimized. We empirically compare CRH with two state-of-the-art multimodal hash function learning methods on two publicly available data sets. 1

5 0.18809439 126 nips-2012-FastEx: Hash Clustering with Exponential Families

Author: Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy

Abstract: Clustering is a key component in any data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as k-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. 1

6 0.15072137 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

7 0.15007159 242 nips-2012-Non-linear Metric Learning

8 0.14421214 329 nips-2012-Super-Bit Locality-Sensitive Hashing

9 0.14268306 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

10 0.13238209 163 nips-2012-Isotropic Hashing

11 0.12912743 202 nips-2012-Locally Uniform Comparison Image Descriptor

12 0.12117808 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

13 0.11838058 257 nips-2012-One Permutation Hashing

14 0.11762369 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space

15 0.10322034 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

16 0.090836659 158 nips-2012-ImageNet Classification with Deep Convolutional Neural Networks

17 0.090355411 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

18 0.087383367 9 nips-2012-A Geometric take on Metric Learning

19 0.085228339 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking

20 0.085221082 197 nips-2012-Learning with Recursive Perceptual Representations


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.204), (1, 0.056), (2, -0.094), (3, -0.078), (4, 0.099), (5, -0.121), (6, -0.01), (7, 0.219), (8, 0.15), (9, 0.162), (10, 0.09), (11, -0.149), (12, 0.061), (13, 0.247), (14, -0.212), (15, 0.097), (16, 0.049), (17, 0.001), (18, -0.097), (19, 0.046), (20, 0.038), (21, 0.016), (22, 0.016), (23, 0.025), (24, 0.035), (25, -0.032), (26, -0.007), (27, -0.035), (28, -0.031), (29, -0.018), (30, 0.049), (31, -0.043), (32, -0.004), (33, -0.011), (34, 0.028), (35, 0.061), (36, 0.068), (37, -0.007), (38, 0.013), (39, 0.019), (40, -0.043), (41, -0.004), (42, -0.009), (43, 0.019), (44, 0.0), (45, 0.004), (46, -0.018), (47, 0.064), (48, 0.029), (49, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93440932 148 nips-2012-Hamming Distance Metric Learning

Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov

Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes. 1

2 0.8549996 71 nips-2012-Co-Regularized Hashing for Multimodal Data

Author: Yi Zhen, Dit-Yan Yeung

Abstract: Hashing-based methods provide a very promising approach to large-scale similarity search. To obtain compact hash codes, a recent trend seeks to learn the hash functions from data automatically. In this paper, we study hash function learning in the context of multimodal data. We propose a novel multimodal hash function learning method, called Co-Regularized Hashing (CRH), based on a boosted coregularization framework. The hash functions for each bit of the hash codes are learned by solving DC (difference of convex functions) programs, while the learning for multiple bits proceeds via a boosting procedure so that the bias introduced by the hash functions can be sequentially minimized. We empirically compare CRH with two state-of-the-art multimodal hash function learning methods on two publicly available data sets. 1

3 0.80726355 329 nips-2012-Super-Bit Locality-Sensitive Hashing

Author: Jianqiu Ji, Jianmin Li, Shuicheng Yan, Bo Zhang, Qi Tian

Abstract: Sign-random-projection locality-sensitive hashing (SRP-LSH) is a probabilistic dimension reduction method which provides an unbiased estimate of angular similarity, yet suffers from the large variance of its estimation. In this work, we propose the Super-Bit locality-sensitive hashing (SBLSH). It is easy to implement, which orthogonalizes the random projection vectors in batches, and it is theoretically guaranteed that SBLSH also provides an unbiased estimate of angular similarity, yet with a smaller variance when the angle to estimate is within (0, ⇡/2]. The extensive experiments on real data well validate that given the same length of binary code, SBLSH may achieve significant mean squared error reduction in estimating pairwise angular similarity. Moreover, SBLSH shows the superiority over SRP-LSH in approximate nearest neighbor (ANN) retrieval experiments. 1

4 0.79118085 42 nips-2012-Angular Quantization-based Binary Codes for Fast Similarity Search

Author: Yunchao Gong, Sanjiv Kumar, Vishal Verma, Svetlana Lazebnik

Abstract: This paper focuses on the problem of learning binary codes for efficient retrieval of high-dimensional non-negative data that arises in vision and text applications where counts or frequencies are used as features. The similarity of such feature vectors is commonly measured using the cosine of the angle between them. In this work, we introduce a novel angular quantization-based binary coding (AQBC) technique for such data and analyze its properties. In its most basic form, AQBC works by mapping each non-negative feature vector onto the vertex of the binary hypercube with which it has the smallest angle. Even though the number of vertices (quantization landmarks) in this scheme grows exponentially with data dimensionality d, we propose a method for mapping feature vectors to their smallest-angle binary vertices that scales as O(d log d). Further, we propose a method for learning a linear transformation of the data to minimize the quantization error, and show that it results in improved binary codes. Experiments on image and text datasets show that the proposed AQBC method outperforms the state of the art. 1

5 0.73527658 257 nips-2012-One Permutation Hashing

Author: Ping Li, Art Owen, Cun-hui Zhang

Abstract: Minwise hashing is a standard procedure in the context of search, for efficiently estimating set similarities in massive binary data such as text. Recently, b-bit minwise hashing has been applied to large-scale learning and sublinear time nearneighbor search. The major drawback of minwise hashing is the expensive preprocessing, as the method requires applying (e.g.,) k = 200 to 500 permutations on the data. This paper presents a simple solution called one permutation hashing. Conceptually, given a binary data matrix, we permute the columns once and divide the permuted columns evenly into k bins; and we store, for each data vector, the smallest nonzero location in each bin. The probability analysis illustrates that this one permutation scheme should perform similarly to the original (k-permutation) minwise hashing. Our experiments with training SVM and logistic regression confirm that one permutation hashing can achieve similar (or even better) accuracies compared to the k-permutation scheme. See more details in arXiv:1208.1259.

6 0.73492748 163 nips-2012-Isotropic Hashing

7 0.68790746 313 nips-2012-Sketch-Based Linear Value Function Approximation

8 0.5367443 202 nips-2012-Locally Uniform Comparison Image Descriptor

9 0.5100078 242 nips-2012-Non-linear Metric Learning

10 0.50047719 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

11 0.485616 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

12 0.47719133 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning

13 0.47110558 265 nips-2012-Parametric Local Metric Learning for Nearest Neighbor Classification

14 0.4679127 338 nips-2012-The Perturbed Variation

15 0.43987015 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm

16 0.41213465 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space

17 0.40793791 97 nips-2012-Diffusion Decision Making for Adaptive k-Nearest Neighbor Classification

18 0.39915338 126 nips-2012-FastEx: Hash Clustering with Exponential Families

19 0.39862394 146 nips-2012-Graphical Gaussian Vector for Image Categorization

20 0.39686921 9 nips-2012-A Geometric take on Metric Learning


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.033), (17, 0.019), (21, 0.019), (38, 0.145), (42, 0.071), (44, 0.177), (53, 0.017), (54, 0.024), (55, 0.046), (74, 0.084), (76, 0.134), (80, 0.071), (92, 0.078)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91042638 150 nips-2012-Hierarchical spike coding of sound

Author: Yan Karklin, Chaitanya Ekanadham, Eero P. Simoncelli

Abstract: Natural sounds exhibit complex statistical regularities at multiple scales. Acoustic events underlying speech, for example, are characterized by precise temporal and frequency relationships, but they can also vary substantially according to the pitch, duration, and other high-level properties of speech production. Learning this structure from data while capturing the inherent variability is an important first step in building auditory processing systems, as well as understanding the mechanisms of auditory perception. Here we develop Hierarchical Spike Coding, a two-layer probabilistic generative model for complex acoustic structure. The first layer consists of a sparse spiking representation that encodes the sound using kernels positioned precisely in time and frequency. Patterns in the positions of first layer spikes are learned from the data: on a coarse scale, statistical regularities are encoded by a second-layer spiking representation, while fine-scale structure is captured by recurrent interactions within the first layer. When fit to speech data, the second layer acoustic features include harmonic stacks, sweeps, frequency modulations, and precise temporal onsets, which can be composed to represent complex acoustic events. Unlike spectrogram-based methods, the model gives a probability distribution over sound pressure waveforms. This allows us to use the second-layer representation to synthesize sounds directly, and to perform model-based denoising, on which we demonstrate a significant improvement over standard methods. 1

2 0.90425527 193 nips-2012-Learning to Align from Scratch

Author: Gary Huang, Marwan Mattar, Honglak Lee, Erik G. Learned-miller

Abstract: Unsupervised joint alignment of images has been demonstrated to improve performance on recognition tasks such as face verification. Such alignment reduces undesired variability due to factors such as pose, while only requiring weak supervision in the form of poorly aligned examples. However, prior work on unsupervised alignment of complex, real-world images has required the careful selection of feature representation based on hand-crafted image descriptors, in order to achieve an appropriate, smooth optimization landscape. In this paper, we instead propose a novel combination of unsupervised joint alignment with unsupervised feature learning. Specifically, we incorporate deep learning into the congealing alignment framework. Through deep learning, we obtain features that can represent the image at differing resolutions based on network depth, and that are tuned to the statistics of the specific data being aligned. In addition, we modify the learning algorithm for the restricted Boltzmann machine by incorporating a group sparsity penalty, leading to a topographic organization of the learned filters and improving subsequent alignment results. We apply our method to the Labeled Faces in the Wild database (LFW). Using the aligned images produced by our proposed unsupervised algorithm, we achieve higher accuracy in face verification compared to prior work in both unsupervised and supervised alignment. We also match the accuracy for the best available commercial method. 1

3 0.87200552 81 nips-2012-Context-Sensitive Decision Forests for Object Detection

Author: Peter Kontschieder, Samuel R. Bulò, Antonio Criminisi, Pushmeet Kohli, Marcello Pelillo, Horst Bischof

Abstract: In this paper we introduce Context-Sensitive Decision Forests - A new perspective to exploit contextual information in the popular decision forest framework for the object detection problem. They are tree-structured classifiers with the ability to access intermediate prediction (here: classification and regression) information during training and inference time. This intermediate prediction is available for each sample and allows us to develop context-based decision criteria, used for refining the prediction process. In addition, we introduce a novel split criterion which in combination with a priority based way of constructing the trees, allows more accurate regression mode selection and hence improves the current context information. In our experiments, we demonstrate improved results for the task of pedestrian detection on the challenging TUD data set when compared to state-ofthe-art methods. 1 Introduction and Related Work In the last years, the random forest framework [1, 6] has become a very popular and powerful tool for classification and regression problems by exhibiting many appealing properties like inherent multi-class capability, robustness to label noise and reduced tendencies to overfitting [7]. They are considered to be close to an ideal learner [13], making them attractive in many areas of computer vision like image classification [5, 17], clustering [19], regression [8] or semantic segmentation [24, 15, 18]. In this work we show how the decision forest algorithm can be extended to include contextual information during learning and inference for classification and regression problems. We focus on applying random forests to object detection, i.e. the problem of localizing multiple instances of a given object class in a test image. This task has been previously addressed in random forests [9], where the trees were modified to learn a mapping between the appearance of an image patch and its relative position to the object category centroid (i.e. center voting information). During inference, the resulting Hough Forest not only performs classification on test samples but also casts probabilistic votes in a generalized Hough-voting space [3] that is subsequently used to obtain object center hypotheses. Ever since, a series of applications such as tracking and action recognition [10], body-joint position estimation [12] and multi-class object detection [22] have been presented. However, Hough Forests typically produce non-distinctive object hypotheses in the Hough space and hence there is the need to perform non-maximum suppression (NMS) for obtaining the final results. While this has been addressed in [4, 26], another shortcoming is that standard (Hough) forests treat samples in a completely independent way, i.e. there is no mechanism that encourages the classifier to perform consistent predictions. Within this work we are proposing that context information can be used to overcome the aforementioned problems. For example, training data for visual learning is often represented by images in form of a (regular) pixel grid topology, i.e. objects appearing in natural images can often be found in a specific context. The importance of contextual information was already highlighted in the 80’s with 1 Figure 1: Top row: Training image, label image, visualization of priority-based growing of tree (the lower, the earlier the consideration during training.). Bottom row: Inverted Hough image using [9] and breadth-first training after 6 levels (26 = 64 nodes), Inverted Hough image after growing 64 nodes using our priority queue, Inverted Hough image using priority queue shows distinctive peaks at the end of training. a pioneering work on relaxation labelling [14] and a later work with focus on inference tasks [20] that addressed the issue of learning within the same framework. More recently, contextual information has been used in the field of object class segmentation [21], however, mostly for high-level reasoning in random field models or to resolve contradicting segmentation results. The introduction of contextual information as additional features in low-level classifiers was initially proposed in the Auto-context [25] and Semantic Texton Forest [24] models. Auto-context shows a general approach for classifier boosting by iteratively learning from appearance and context information. In this line of research [18] augmented the feature space for an Entanglement Random Forest with a classification feature, that is consequently refined by the class posterior distributions according to the progress of the trained subtree. The training procedure is allowed to perform tests for specific, contextual label configurations which was demonstrated to significantly improve the segmentation results. However, the In this paper we are presenting Context-Sensitve Decision Forests - A novel and unified interpretation of Hough Forests in light of contextual sensitivity. Our work is inspired by Auto-Context and Entanglement Forests, but instead of providing only posterior classification results from an earlier level of the classifier construction during learning and testing, we additionally provide regression (voting) information as it is used in Hough Forests. The second core contribution of our work is related to how we grow the trees: Instead of training them in a depth- or breadth-first way, we propose a priority-based construction (which could actually consider depth- or breadth-first as particular cases). The priority is determined by the current training error, i.e. we first grow the parts of the tree where we experience higher error. To this end, we introduce a unified splitting criterion that estimates the joint error of classification and regression. The consequence of using our priority-based training are illustrated in Figure 1: Given the training image with corresponding label image (top row, images 1 and 2), the tree first tries to learn the foreground samples as shown in the color-coded plot (top row, image 3, colors correspond to index number of nodes in the tree). The effects on the intermediate prediction quality are shown in the bottom row for the regression case: The first image shows the regression quality after training a tree with 6 levels (26 = 64 nodes) in a breadth-first way while the second image shows the progress after growing 64 nodes according to the priority based training. Clearly, the modes for the center hypotheses are more distinctive which in turn yields to more accurate intermediate regression information that can be used for further tree construction. Our third contribution is a new family of split functions that allows to learn from training images containing multiple training instances as shown for the pedestrians in the example. We introduce a test that checks the centroid compatibility for pairs of training samples taken from the context, based on the intermediate classification and regression derived as described before. To assess our contributions, we performed several experiments on the challenging TUD pedestrian data set [2], yielding a significant improvement of 9% in the recall at 90% precision rate in comparison to standard Hough Forests, when learning from crowded pedestrian images. 2 2 Context-Sensitive Decision Trees This section introduces the general idea behind the context-sensitive decision forest without references to specific applications. Only in Section 3 we show a particular application to the problem of object detection. After showing some basic notational conventions that are used in the paper, we provide a section that revisits the random forest framework for classification and regression tasks from a joint perspective, i.e. a theory allowing to consider e.g. [1, 11] and [9] in a unified way. Starting from this general view we finally introduce the context-sensitive forests in 2.2. Notations. In the paper we denote vectors using boldface lowercase (e.g. d, u, v) and sets by using uppercase calligraphic (e.g. X , Y) symbols. The sets of real, natural and integer numbers are denoted with R, N and Z as usually. We denote by 2X the power set of X and by 1 [P ] the indicator function returning 1 or 0 according to whether the proposition P is true or false. Moreover, with P(Y) we denote the set of probability distributions having Y as sample space and we implicitly assume that some σ-algebra is defined on Y. We denote by δ(x) the Dirac delta function. Finally, Ex∼Q [f (x)] denotes the expectation of f (x) with respect to x sampled according to distribution Q. 2.1 Random Decision Forests for joint classification and regression A (binary) decision tree is a tree-structured predictor1 where, starting from the root, a sample is routed until it reaches a leaf where the prediction takes place. At each internal node of the tree the decision is taken whether the sample should be forwarded to the left or right child, according to a binary-valued function. In formal terms, let X denote the input space, let Y denote the output space and let T dt be the set of decision trees. In its simplest form a decision tree consists of a single node (a leaf ) and is parametrized by a probability distribution Q ∈ P(Y) which represents the posterior probability of elements in Y given any data sample reaching the leaf. We denote this (admittedly rudimentary) tree as L F (Q) ∈ T td . Otherwise, a decision tree consists of a node with a left and a right sub-tree. This node is parametrized by a split function φ : X → {0, 1}, which determines whether to route a data sample x ∈ X reaching it to the left decision sub-tree tl ∈ T dt (if φ(x) = 0) or to the right one tr ∈ T dt (if φ(x) = 1). We denote such a tree as N D (φ, tl , tr ) ∈ T td . Finally, a decision forest is an ensemble F ⊆ T td of decision trees which makes a prediction about a data sample by averaging over the single predictions gathered from all trees. Inference. Given a decision tree t ∈ T dt , the associated posterior probability of each element in Y given a sample x ∈ X is determined by finding the probability distribution Q parametrizing the leaf that is reached by x when routed along the tree. This is compactly presented with the following definition of P (y|x, t), which is inductive in the structure of t:  if t = L F (Q) Q(y) P (y | x, t ) = P (y | x, tl ) if t = N D (φ, tl , tr ) and φ(x) = 0 (1)  P (y | x, tr ) if t = N D (φ, tl , tr ) and φ(x) = 1 . Finally, the combination of the posterior probabilities derived from the trees in a forest F ⊆ T dt can be done by an averaging operation [6], yielding a single posterior probability for the whole forest: P (y|x, F) = 1 |F| P (y|x, t) . (2) t∈F Randomized training. A random forest is created by training a set of random decision trees independently on random subsets of the training data D ⊆ X ×Y. The training procedure for a single decision tree heuristically optimizes a set of parameters like the tree structure, the split functions at the internal nodes and the density estimates at the leaves in order to reduce the prediction error on the training data. In order to prevent overfitting problems, the search space of possible split functions is limited to a random set and a minimum number of training samples is required to grow a leaf node. During the training procedure, each new node is fed with a set of training samples Z ⊆ D. If some stopping condition holds, depending on Z, the node becomes a leaf and a density on Y is estimated based on Z. Otherwise, an internal node is grown and a split function is selected from a pool of random ones in a way to minimize some sort of training error on Z. The selected split function induces a partition 1 we use the term predictor because we will jointly consider classification and regression. 3 of Z into two sets, which are in turn becoming the left and right childs of the current node where the training procedure is continued, respectively. We will now write this training procedure in more formal terms. To this end we introduce a function π(Z) ∈ P(Y) providing a density on Y estimated from the training data Z ⊆ D and a loss function L(Z | Q) ∈ R penalizing wrong predictions on the training samples in Z, when predictions are given according to a distribution Q ∈ P(Y). The loss function L can be further decomposed in terms of a loss function (·|Q) : Y → R acting on each sample of the training set: L(Z | Q) = (y | Q) . (3) (x,y)∈Z Also, let Φ(Z) be a set of split functions randomly generated for a training set Z and given a split φ function φ ∈ Φ(Z), we denote by Zlφ and Zr the sets identified by splitting Z according to φ, i.e. Zlφ = {(x, y) ∈ Z : φ(x) = 0} and φ Zr = {(x, y) ∈ Z : φ(x) = 1} . We can now summarize the training procedure in terms of a recursive function g : 2X ×Y → T , which generates a random decision tree from a training set given as argument: g(Z) = L F (π(Z)) ND if some stopping condition holds φ φ, g(Zlφ ), g(Zr ) otherwise . (4) Here, we determine the optimal split function φ in the pool Φ(Z) as the one minimizing the loss we incur as a result of the node split: φ φ ∈ arg min L(Zlφ ) + L(Zr ) : φ ∈ Φ(Z) (5) where we compactly write L(Z) for L(Z|π(Z)), i.e. the loss on Z obtained with predictions driven by π(Z). A typical split function selection criterion commonly adopted for classification and regression is information gain. The equivalent counterpart in terms of loss can be obtained by using a log-loss, i.e. (y|Q) = − log(Q(y)). A further widely used criterion is based on Gini impurity, which can be expressed in this setting by using (y|Q) = 1 − Q(y). Finally, the stopping condition that is used in (4) to determine whether to create a leaf or to continue branching the tree typically consists in checking |Z|, i.e. the number of training samples at the node, or the loss L(Z) are below some given thresholds, or if a maximum depth is reached. 2.2 Context-sensitive decision forests A context-sensitive (CS) decision tree is a decision tree in which split functions are enriched with the ability of testing contextual information of a sample, before taking a decision about where to route it. We generate contextual information at each node of a decision tree by exploiting a truncated version of the same tree as a predictor. This idea is shared with [18], however, we introduce some novelties by tackling both, classification and regression problems in a joint manner and by leaving a wider flexibility in the tree truncation procedure. We denote the set of CS decision trees as T . The main differences characterizing a CS decision tree t ∈ T compared with a standard decision tree are the following: a) every node (leaves and internal nodes) of t has an associated probability distribution Q ∈ P(Y) representing the posterior probability of an element in Y given any data sample reaching it; b) internal nodes are indexed with distinct natural numbers n ∈ N in a way to preserve the property that children nodes have a larger index compared to their parent node; c) the split function at each internal node, denoted by ϕ(·|t ) : X → {0, 1}, is bound to a CS decision tree t ∈ T , which is a truncated version of t and can be used to compute intermediate, contextual information. Similar to Section 2.1 we denote by L F (Q) ∈ T the simplest CS decision tree consisting of a single leaf node parametrized by the distribution Q, while we denote by N D (n, Q, ϕ, tl , tr ) ∈ T , the rest of the trees consisting of a node having a left and a right sub-tree, denoted by tl , tr ∈ T respectively, and being parametrized by the index n, a probability distribution Q and the split function ϕ as described above. As shown in Figure 2, the truncation of a CS decision tree at each node is obtained by exploiting the indexing imposed on the internal nodes of the tree. Given a CS decision tree t ∈ T and m ∈ N, 4 1 1 4 2 3 6 2 5 4 3 (b) The truncated version t(<5) (a) A CS decision tree t Figure 2: On the left, we find a CS decision tree t, where only the internal nodes are indexed. On the right, we see the truncated version t(<5) of t, which is obtained by converting to leaves all nodes having index ≥ 5 (we marked with colors the corresponding node transformations). we denote by t( < τ 2 In the experiments conducted, we never exceeded 10 iterations for finding a mode. 6 (8) where Pj = P (·|(u + hj , I), t), with j = 1, 2, are the posterior probabilities obtained from tree t given samples at position u+h1 and u+h2 of image I, respectively. Please note that this test should not be confused with the regression split criterion in [9], which tries to partition the training set in a way to group examples with similar voting direction and length. Besides the novel context-sensitive split function we employ also standard split functions performing tests on X as defined in [24]. 4 Experiments To assess our proposed approach, we have conducted several experiments on the task of pedestrian detection. Detecting pedestrians is very challenging for Hough-voting based methods as they typically exhibit strong articulations of feet and arms, yielding to non-distinctive hypotheses in the Hough space. We evaluated our method on the TUD pedestrian data base [2] in two different ways: First, we show our detection results with training according to the standard protocol using 400 training images (where each image contains a single annotation of a pedestrian) and evaluation on the Campus and Crossing scenes, respectively (Section 4.1). With this experiment we show the improvement over state-of-the-art approaches when learning can be performed with simultaneous knowledge about context information. In a second variation (Section 4.2), we use the images of the Crossing scene (201 images) as a training set. Most images of this scene contain more than four persons with strong overlap and mutual occlusions. However, instead of using the original annotation which covers only pedestrians with at least 50% overlap (1008 bounding boxes), we use the more accurate, pixel-wise ground truth annotations of [23] for the entire scene that includes all persons and consists of 1215 bounding boxes. Please note that this annotation is even more detailed than the one presented in [4] with 1018 bounding boxes. The purpose of the second experiment is to show that our context-sensitive forest can exploit the availability of multiple training instances significantly better than state-of-the-art. The most related work and therefore also the baseline in our experiments is the Hough Forest [9]. To guarantee a fair comparison, we use the same training parameters for [9] and our context sensitive forest: We trained 20 trees and the training data (including horizontally flipped images) was sampled homogeneously per category per image. The patch size was fixed to 30 × 30 and we performed 1600 node tests for finding the best split function parameters per node. The trees were stopped growing when < 7 samples were available. As image features, we used the the first 16 feature channels provided in the publicly available Hough Forest code of [9]. In order to obtain the object detection hypotheses from the Hough space, we use the same Non-maximum suppression (NMS) technique in all our experiments as suggested in [9]. To evaluate the obtained hypotheses, we use the standard PASAL-VOC criterion which requires the mutual overlap between ground truth and detected bounding boxes to be ≥ 50%. The additional parameter of (7) was fixed to σ = 7. 4.1 Evaluation using standard protocol training set The standard training set contains 400 images where each image comes with a single pedestrian annotation. For our experiments, we rescaled the images by a factor of 0.5 and doubled the training image set by including also the horizontally flipped images. We randomly chose 125 training samples per image for foreground and background, resulting in 2 · 400 · 2 · 125 = 200k training samples per tree. For additional comparisons, we provide the results presented in the recent work on joint object detection and segmentation of [23], from which we also provide evaluation results of the Implicit Shape Model (ISM) [16]. However, please note that the results of [23] are based on a different baseline implementation. Moreover, we show the results of [4] when using the provided code and configuration files from the first authors homepage. Unfortunately, we could not reproduce the results of the original paper. First, we discuss the results obtained on the Campus scene. This data set consists of 71 images showing walking pedestrians at severe scale differences and partial occlusions. The ground truth we use has been released with [4] and contains a total number of 314 pedestrians. Figure 3, first row, plot 1 shows the precision-recall curves when using 3 scales (factors 0.3, 0.4, 0.55) for our baseline [9] (blue), results from re-evaluating [4] (cyan, 5 scales), [23] (green) and our ContextSensitive Forest without and with using the priority queue based tree construction (red/magenta). In case of not using the priority queue, we trained the trees according to a breadth-first way. We obtain a performance boost of ≈ 6% in recall at a precision of 90% when using both, context information and the priority based construction of our forest. The second plot in the first row of Figure 3 shows the results when the same forests are tested on the Crossing scene, using the more detailed ground 7 TUD Campus (3 scales) TUD−Crossing (3 scales) 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 1 0.5 0.4 0.3 0.2 0.1 0 0 0.5 0.4 0.3 Baseline Hough Forest Barinova et al. CVPR’10, 5 scales Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.2 0.1 0.9 0 0 1 Baseline Hough Forest Barinova et al. CVPR’10 Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue Riemenschneider et al. ECCV’12 (1 scale) Leibe et al. IJCV’08 (1 scale) 0.1 TUD Campus (3 scales) 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.9 1 0.9 1 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision 1 0.9 Precision 0.2 TUD Campus (5 scales) 0.5 0.4 0.3 0 0 0.4 0.3 0.2 0.1 0.5 0.2 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 0.1 0.9 1 0 0 Baseline Hough Forest Proposed Context−Sensitive, No Priority Queue Proposed Context−Sensitive, With Priority Queue 0.1 0.2 0.3 0.4 0.5 Recall 0.6 0.7 0.8 Figure 3: Precision-Recall Curves for detections, Top row: Standard training (400 images), evaluation on Campus and Crossing (3 scales). Bottom row: Training on Crossing annotations of [23], evaluation on Campus, 3 and 5 scales. Right images: Qualitative examples for Campus (top 2) and Crossing (bottom 2) scenes. (green) correctly found by our method (blue) ground truth (red) wrong association (cyan) missed detection. truth annotations. The data set shows walking pedestrians (Figure 3, right side, last 2 images) with a smaller variation in scale compared to the Campus scene but with strong mutual occlusions and overlaps. The improvement with respect to the baseline is lower (≈ 2% gain at a precision of 90%) and we find similar developments of the curves. However, this comes somewhat expectedly as the training data does not properly reflect the occlusions we actually want to model. 4.2 Evaluation on Campus scene using Crossing scene as training set In our next experiment we trained the forests (same parameters) on the novel annotations of [23] for the Crossing scene. Please note that this reduces the training set to only 201 images (we did not include the flipped images). Qualitative detection results are shown in Figure 3, right side, images 1 and 2. From the first precison-recall curve in the second row of Figure 3 we can see, that the margin between the baseline and our proposed method could be clearly improved (gain of ≈ 9% recall at precision 90%) when evaluating on the same 3 scales. With evaluation on 5 scales (factors 0.34, 0.42, 0.51, 0.65, 0.76) we found a strong increase in the recall, however, at the cost of loosing 2 − 3% of precision below a recall of 60%, as illustrated in the second plot of row 2 in Figure 3. While our method is able to maintain a precision above 90% up to a recall of ≈ 83%, the baseline implementation drops already at a recall of ≈ 20%. 5 Conclusions In this work we have presented Context-Sensitive Decision Forests with application to the object detection problem. Our new forest has the ability to access intermediate prediction (classification and regression) information about all samples of the training set and can therefore learn from contextual information throughout the growing process. This is in contrast to existing random forest methods used for object detection which typically treat training samples in an independent manner. Moreover, we have introduced a novel splitting criterion together with a mode isolation technique, which allows us to (a) perform a priority-driven way of tree growing and (b) install novel context-based test functions to check for mutual object centroid agreements. In our experimental results on pedestrian detection we demonstrated superior performance with respect to state-of-the-art methods and additionally found that our new algorithm can significantly better exploit training data containing multiple training objects. Acknowledgements. Peter Kontschieder acknowledges financial support of the Austrian Science Fund (FWF) from project ’Fibermorph’ with number P22261-N22. 8 References [1] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 1997. [2] M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In (CVPR), 2008. [3] D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern Recognition, 13(2), 1981. [4] O. Barinova, V. Lempitsky, and P. Kohli. On detection of multiple object instances using hough transforms. In (CVPR), 2010. [5] A. Bosch, A. Zisserman, and X. Mu˜oz. Image classification using random forests and ferns. In (ICCV), n 2007. [6] L. Breiman. Random forests. In Machine Learning, 2001. [7] A. Criminisi, J. Shotton, and E. Konukoglu. Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. In Foundations and Trends in Computer Graphics and Vision, volume 7, pages 81–227, 2012. [8] A. Criminisi, J. Shotton, D. Robertson, and E. Konukoglu. Regression forests for efficient anatomy detection and localization in CT scans. In MICCAI-MCV Workshop, 2010. [9] J. Gall and V. Lempitsky. Class-specific hough forests for object detection. In (CVPR), 2009. [10] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. (PAMI), 2011. [11] P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Machine Learning, 2006. [12] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon. Efficient regression of general-activity human poses from depth images. In (ICCV), 2011. [13] T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2009. [14] R. A. Hummel and S. W. Zucker. On the foundations of relaxation labeling. (PAMI), 5(3):267–287, 1983. [15] P. Kontschieder, S. Rota Bul` , H. Bischof, and M. Pelillo. Structured class-labels in random forests for o semantic image labelling. In (ICCV), 2011. [16] B. Leibe, A. Leonardis, and B. Schiele. Robust object detection with interleaved categorization and segmentation. (IJCV), 2008. [17] R. Mar´ e, P. Geurts, J. Piater, and L. Wehenkel. Random subwindows for robust image classification. In e (CVPR), 2005. [18] A. Montillo, J. Shotton, J. Winn, J. E. Iglesias, D. Metaxas, and A. Criminisi. Entangled decision forests and their application for semantic segmentation of CT images. In (IPMI), 2011. [19] F. Moosmann, B. Triggs, and F. Jurie. Fast discriminative visual codebooks using randomized clustering forests. In (NIPS), 2006. [20] M. Pelillo and M. Refice. Learning compatibility coefficients for relaxation labeling processes. (PAMI), 16(9):933–945, 1994. [21] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In (ICCV), 2007. [22] N. Razavi, J. Gall, and L. Van Gool. Scalable multi-class object detection. In (CVPR), 2011. [23] H. Riemenschneider, S. Sternig, M. Donoser, P. M. Roth, and H. Bischof. Hough regions for joining instance localization and segmentation. In (ECCV), 2012. [24] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In (CVPR), 2008. [25] Z. Tu. Auto-context and its application to high-level vision tasks. In (CVPR), 2008. [26] O. Woodford, M. Pham, A. Maki, F. Perbet, and B. Stenger. Demisting the hough transform for 3d shape recognition and registration. In (BMVC), 2011. 9

same-paper 4 0.87172586 148 nips-2012-Hamming Distance Metric Learning

Author: Mohammad Norouzi, David M. Blei, Ruslan Salakhutdinov

Abstract: Motivated by large-scale multimedia applications we propose to learn mappings from high-dimensional data to binary codes that preserve semantic similarity. Binary codes are well suited to large-scale applications as they are storage efficient and permit exact sub-linear kNN search. The framework is applicable to broad families of mappings, and uses a flexible form of triplet ranking loss. We overcome discontinuous optimization of the discrete mappings by minimizing a piecewise-smooth upper bound on empirical loss, inspired by latent structural SVMs. We develop a new loss-augmented inference algorithm that is quadratic in the code length. We show strong retrieval performance on CIFAR-10 and MNIST, with promising classification results using no more than kNN on the binary codes. 1

5 0.82534367 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

Author: Emile Richard, Stephane Gaiffas, Nicolas Vayatis

Abstract: In the paper, we consider the problem of link prediction in time-evolving graphs. We assume that certain graph features, such as the node degree, follow a vector autoregressive (VAR) model and we propose to use this information to improve the accuracy of prediction. Our strategy involves a joint optimization procedure over the space of adjacency matrices and VAR matrices which takes into account both sparsity and low rank properties of the matrices. Oracle inequalities are derived and illustrate the trade-offs in the choice of smoothing parameters when modeling the joint effect of sparsity and low rank property. The estimate is computed efficiently using proximal methods through a generalized forward-backward agorithm. 1

6 0.78346753 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

7 0.77849519 260 nips-2012-Online Sum-Product Computation Over Trees

8 0.77478373 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

9 0.77182168 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

10 0.77170175 163 nips-2012-Isotropic Hashing

11 0.77157068 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

12 0.77073652 71 nips-2012-Co-Regularized Hashing for Multimodal Data

13 0.76976722 176 nips-2012-Learning Image Descriptors with the Boosting-Trick

14 0.76969999 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

15 0.76902902 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

16 0.76732677 38 nips-2012-Algorithms for Learning Markov Field Policies

17 0.7659176 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

18 0.76556039 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

19 0.76473981 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

20 0.76430655 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation