nips nips2008 nips2008-219 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yair Weiss, Antonio Torralba, Rob Fergus
Abstract: Semantic hashing[1] seeks compact binary codes of data-points so that the Hamming distance between codewords correlates with semantic similarity. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresholded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel datapoint. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes outperform the state-of-the art. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Semantic hashing[1] seeks compact binary codes of data-points so that the Hamming distance between codewords correlates with semantic similarity. [sent-9, score-0.453]
2 In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. [sent-10, score-0.299]
3 By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresholded eigenvectors of the graph Laplacian. [sent-11, score-0.472]
4 By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel datapoint. [sent-12, score-0.68]
5 Taken together, both learning the code and applying it to a novel point are extremely simple. [sent-13, score-0.241]
6 To categorize a novel image, they simply searched for similar images in the dataset and used the labels of these retrieved images to predict the label of the novel image. [sent-19, score-0.274]
7 The code is constructed so that similar items will have similar binary codewords and there is a simple feedforward network that can calculate the binary code for a novel input. [sent-24, score-0.679]
8 Retrieving similar neighbors is then done simply by retrieving all items with codes within a small Hamming distance of the code for the query. [sent-25, score-0.673]
9 The key for this method to work is to learn a good code for the dataset. [sent-27, score-0.18]
10 We need a code that is (1) easily computed for a novel input (2) requires a small number of bits to code the full dataset and (3) maps similar items to similar binary codewords. [sent-28, score-0.964]
11 To simplify the problem, we will assume that the items have already been embedded in a Euclidean space, say Rd , in which Euclidean distance correlates with the desired similarity. [sent-29, score-0.187]
12 If we forget about the requirement of having a small number of bits in the codewords, then it is easy to design a binary code so that items that are close in Euclidean space will map to similar binary codewords. [sent-37, score-0.742]
13 This is the basis of the popular locality sensitive hashing method E2LSH [8]. [sent-38, score-0.449]
14 As shown in[8], if every bit in the code is calculated by a random linear projection followed by a random threshold, then the Hamming distance between codewords will asymptotically approach the Euclidean distance between the items. [sent-39, score-0.493]
15 As the number of bits increases the precision improves (and approaches one with many bits), but the rate of convergence can be very slow. [sent-43, score-0.351]
16 Rather than using random projections to define the bits in a code, several authors have pursued machine learning approaches. [sent-44, score-0.377]
17 In order to learn 32 bits, the middle layer of the autoencoder has 32 hidden units, and noise was injected during training to encourage these bits to be as binary as possible. [sent-47, score-0.459]
18 In [9] they used multiple stacked RBMs to learn a non-linear mapping between input vector and code bits. [sent-49, score-0.18]
19 In [3] both RBMs and Boosting were used to learn binary codes for a database of millions of images and were found to outperform LSH. [sent-57, score-0.287]
20 Also, the retrieval speed using these short binary codes was found to be significantly faster than LSH (which was faster than other methods such as KD trees). [sent-58, score-0.284]
21 The success of machine learning methods leads us to ask: what is the best code for performing semantic hashing for a given dataset? [sent-59, score-0.692]
22 We formalize the requirements for a good code and show that these are equivalent to a particular form of graph partitioning. [sent-60, score-0.221]
23 These eigenvectors are exactly the eigenvectors used in many spectral algorithms including spectral clustering and Laplacian eigenmaps [6, 11]. [sent-63, score-0.729]
24 This leads to a new algorithm, which we call “spectral hashing” where the bits are calculated by thresholding a subset of eigenvectors of the Laplacian of the similarity graph. [sent-64, score-0.49]
25 By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel datapoint. [sent-65, score-0.68]
26 Taken together, both learning the code and applying it to a novel point are extremely simple. [sent-66, score-0.241]
27 In many applications of LSH and Boosting SSC, a different retrieval algorithm is used whereby the binary code only creates a shortlist and exhaustive search is performed on the shortlist. [sent-71, score-0.339]
28 2 stumps boosting SSC LSH Proportion good neighbors for hamming distance < 2 Training samples RBM (two hidden layers) 1 0. [sent-73, score-0.593]
29 2 0 LSH 0 5 10 15 20 number of bits 25 30 35 Figure 1: Building hash codes to find neighbors. [sent-77, score-0.571]
30 The figure plots the average precision (number of neighbors in the original space divided by number of neighbors in a hamming ball using the hash codes) at Hamming distance ≤ 1 for three methods. [sent-80, score-0.544]
31 The plots on the left show how each method partitions the space to compute the bits to represent each sample. [sent-81, score-0.413]
32 Despite the simplicity of this toy data, the methods still require many bits in order to get good performance. [sent-82, score-0.374]
33 2 Analysis: what makes a good code As mentioned earlier, we seek a code that is (1) easily computed for a novel input (2) requires a small number of bits to code the full dataset and (3) maps similar items to similar binary codewords. [sent-83, score-1.144]
34 Let us first ignore the first requirement, that codewords be easily computed for a novel input and search only for a code that is efficient (i. [sent-84, score-0.31]
35 For a code to be efficient, we require that each bit has a 50% chance of being one or zero, and that different bits are independent of each other. [sent-89, score-0.631]
36 Among all codes that have this property, we will seek the ones where the average Hamming distance between similar points is minimal. [sent-90, score-0.215]
37 Using this notation, the average Hamming distance between similar neighbors can be written: 2 ij Wij yi − yj . [sent-94, score-0.25]
38 The bit partitions the graph into two equal parts (A, B), vertices where the bit is on and vertices where the bit is off. [sent-99, score-0.447]
39 For k bits the problem can be thought of as trying to find k independent balanced partitions, each of which should have as low cut as possible. [sent-102, score-0.424]
40 2 Out of Sample Extension The fact that the solution to the relaxed problem are the k eigenvectors of D − W with minimal eigenvalue would suggest simply thresholding these eigenvectors to obtain a binary code. [sent-106, score-0.416]
41 But this would only tell us how to compute the code representation of items in the training set. [sent-107, score-0.275]
42 This is the problem of out-of-sample extension of spectral methods which is often solved using the Nystrom method [13, 14]. [sent-108, score-0.281]
43 In our setting, where there can be millions of items in the dataset this is impractical. [sent-110, score-0.191]
44 Relaxing the constraint that y(x) ∈ {−1, 1}k now gives a spectral problem whose solutions are eigenfunctions of the weighted Laplace-Beltrami operators defined on manifolds [15, 16, 13, 17]. [sent-114, score-0.621]
45 The solution to the relaxation of problem 3 s are functions that satisfy Lp f = λf with minimal eigenvalue (ignoring the trivial solution f (x) = 1 which has eigenvalue 0). [sent-116, score-0.219]
46 As discussed in [16, 15, 13], with proper normalization, the eigenvectors of the discrete Laplacian defined by n points sampled from p(x) converges to eigenfunctions of Lp as n → ∞. [sent-117, score-0.424]
47 Observation: [17] If p(x) is separable, and similarity between datapoints is defined as 2 2 e− xi −xj / then the eigenfunctions of the continuous weighted Laplacian, Lp have an outer product form. [sent-122, score-0.382]
48 That is, if Φi (x) is an eigenfunction of the weighted Laplacian defined on R1 with eigenvalue λi then Φi (x1 )Φj (x2 ) · · · Φd (xd ) is an eigenfunction of the d dimensional problem with eigenvalue λi λj · · · λd . [sent-123, score-0.301]
49 Specifically for a case of a uniform distribution on [a, b] the eigenfunctions of the onedimensional Laplacian Lp are extremely well studied objects in mathematics. [sent-124, score-0.389]
50 The eigenfunctions Φk (x) and eigenvalues λk are: π kπ + x) 2 b−a Φk (x) = sin( λk = 1 − e− 2 | b−a | 2 kπ 2 (4) (5) A similar equation is also available for the one dimensional Gaussian . [sent-126, score-0.393]
51 In this case the eigenfunctions of the one-dimensional Laplacian Lp are (in the limit of small ) solutions to the Schrodinger equations and are related to Hermite polynomials. [sent-127, score-0.331]
52 Figure 2 shows the analytical eigenfunctions for a 2D rectangle in order of increasing eigenvalue. [sent-128, score-0.434]
53 The eigenvalue (which corresponds to the cut) determines which k bits will be used. [sent-129, score-0.436]
54 Note that the eigenvalue depends on the aspect ratio of the rectangle and the spatial frequency — it is better to cut the long dimension before the short one, and low spatial frequencies are preferred. [sent-130, score-0.288]
55 Note that the eigenfunctions do not depend on the radius of similar neighbors . [sent-131, score-0.49]
56 We distinguish between single-dimension eigenfunctions, which are of the form Φk (x1 ) or Φk (x2 ) and outer-product eigenfunctions which are of the form Φk (x1 )Φl (x2 ). [sent-133, score-0.331]
57 These outerproduct eigenfunctions are shown marked with a red border in the figure. [sent-134, score-0.367]
58 As we now discuss, these outer-product eigenfunctions should be avoided when building a hashing code. [sent-135, score-0.78]
59 Observation: Suppose we build a code by thresholding the k eigenfunctions of Lp with minimal eigenvalue y(x) = sign(Φk (x)). [sent-136, score-0.668]
60 If any of the eigenfunctions is an outer-product eigenfunction, then that bit is a deterministic function of other bits in the code. [sent-137, score-0.782]
61 This observation highlights the simplification we made in relaxing the independence constraint and requiring that the bits be uncorrelated. [sent-139, score-0.383]
62 Indeed the bits corresponding to outerproduct eigenfunctions are approximately uncorrelated but they are surely not independent. [sent-140, score-0.718]
63 The exact form of the eigenfunctions for 1D continuous Laplacian for different distributions is a matter of ongoing research [17]. [sent-141, score-0.331]
64 We have found, however, that the bit codes obtained by thresholding the eigenfunctions are robust to the exact form of the distribution. [sent-142, score-0.633]
65 In particular, simply fitting a multidimensional rectangle distribution to the data (by using PCA to align the axes, and then assuming a uniform distribution on each axis) works surprisingly well for a wide range of distributions. [sent-143, score-0.202]
66 In particular, using the analytic eigenfunctions of a uniform distribution on data sampled from a Gaussian, works as well as using the numerically calculated eigenvectors and far better than boosting or RBMs trained on the Gaussian distribution. [sent-144, score-0.614]
67 To summarize, given a training set of points {xi } and a desired number of bits k the spectral hashing algorithm works by: • Finding the principal components of the data using PCA. [sent-145, score-1.057]
68 • Calculating the k smallest single-dimension analytical eigenfunctions of Lp using a rectangular approximation along every PCA direction. [sent-146, score-0.381]
69 The eigenvalues depend on the aspect ratio of the rectangle and the spatial frequency of the cut – it is better to cut the long dimension first and lower spatial frequencies are better than higher ones. [sent-151, score-0.303]
70 The red points correspond to the locations that are within a hamming distance of zero. [sent-154, score-0.224]
71 Green corresponds to a hamming ball of radius 1, and blue to radius 2. [sent-155, score-0.227]
72 • Thresholding the analytical eigenfunctions at zero, to obtain binary codes. [sent-156, score-0.402]
73 Second, even though it avoids the trivial 3 way dependencies that arise from outer-product eigenfunctions, other high-order dependencies between the bits may exist. [sent-160, score-0.351]
74 Neither of these more complicated variants of spectral hashing gave a significant improvement in performance in our experiments. [sent-162, score-0.706]
75 Figure 4a compares the performance of spectral hashing to LSH, RBMs and Boosting on a 2D rectangle and figure 3 visualizes the Hamming balls for the different methods. [sent-163, score-0.785]
76 Despite the simplicity of spectral hashing, it outperforms the other methods. [sent-164, score-0.257]
77 Even when we apply RBMs and Boosting to the output of spectral hashing the performance does not improve. [sent-165, score-0.706]
78 Figure 5 shows retrieval results for spectral hashing, RBMs and boosting on the “labelme” dataset. [sent-172, score-0.496]
79 Note that even though the spectral hashing uses a terrible model of the statistics of the database — it simply assumes a N dimensional rectangle, it performs better than boosting which actually uses the distribution (the difference in performance relative to RBMs is not significant). [sent-173, score-0.925]
80 Not only is the performance numerically better, but 6 Proportion good neighbors for hamming distance < 2 Proportion good neighbors for hamming distance < 2 1 Spectral hashing Boosting + spectral hashing 0. [sent-174, score-1.859]
81 2 0 LSH 0 5 10 15 20 number of bits 25 30 35 1 Spectral hashing 0. [sent-178, score-0.8]
82 2 LSH 0 0 5 10 15 20 number of bits 25 30 35 b) 10D uniform distribution a) 2D uniform distribution Proportion good neighbors for hamming distance < 2 Figure 4: left: results on 2D rectangles with different methods. [sent-182, score-0.767]
83 Even though spectral hashing is the simplest, it gives the best performance. [sent-183, score-0.706]
84 Input Gist neighbors Spectral hashing 10 bits Boosting 10 bits 1 9 Spectral hashing RBM Boosting SSC LSH 0. [sent-185, score-1.728]
85 1 0 0 10 20 30 40 50 60 number of bits Figure 5: Performance of different binary codes on the LabelMe dataset described in [3]. [sent-193, score-0.601]
86 The data is certainly not uniformly distributed, and yet spectral hashing gives better retrieval performance than boosting and LSH. [sent-194, score-0.945]
87 our visual inspection of the retrieved neighbors suggests that with a small number of bits, the retrieved images are better using spectral hashing than with boosting. [sent-195, score-0.991]
88 This dataset is obviously more challenging and even using exhaustive search some of the retrieved neighbors are semantically quite different. [sent-197, score-0.293]
89 Still, the majority of retrieved neighbors seem to be semantically relevant, and with 64 bits spectral hashing enables this peformance in fractions of a second. [sent-198, score-1.272]
90 4 Discussion We have discussed the problem of learning a code for semantic hashing. [sent-199, score-0.243]
91 We defined a hard criterion for a good code that is related to graph partitioning and used a spectral relaxation to obtain an eigenvector solution. [sent-200, score-0.559]
92 We used recent results on convergence of graph Laplacian eigenvectors to obtain analytic solutions for certain distributions and showed the importance of avoiding redundant bits that arise from separable distributions. [sent-201, score-0.532]
93 The final algorithm we arrive at, spectral hashing, is extremely simple - one simply performs PCA on the data and then fits a multidimensional rectangle. [sent-202, score-0.374]
94 The aspect ratio of this multidimensional rectangle determines the code using a simple formula. [sent-203, score-0.348]
95 7 Gist neighbors Spectral hashing: 32 bits 64 bits Figure 6: Retrieval results on a dataset of 80 million images using the original gist descriptor, and hash codes build with spectral hashing with 32 bits and 64 bits. [sent-205, score-2.281]
96 The input image corresponds to the image on the top-left corner, the rest are the 24 nearest neighbors using hamming distance for the hash codes and L2 for gist. [sent-206, score-0.572]
97 Laplacian eigenmaps and spectral techniques for embedding and clustering. [sent-238, score-0.337]
98 Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. [sent-248, score-0.449]
99 Learnc ing eigenfunctions links spectral embedding and kernel pca. [sent-279, score-0.639]
100 Diffusion maps, spectral clustering and reaction coordinates of dynamical systems. [sent-311, score-0.257]
wordName wordTfidf (topN-words)
[('hashing', 0.449), ('bits', 0.351), ('eigenfunctions', 0.331), ('spectral', 0.257), ('lsh', 0.218), ('ssc', 0.182), ('code', 0.18), ('hamming', 0.165), ('boosting', 0.158), ('codes', 0.156), ('rbm', 0.154), ('neighbors', 0.128), ('rbms', 0.116), ('laplacian', 0.114), ('bit', 0.1), ('codewords', 0.095), ('items', 0.095), ('eigenvectors', 0.093), ('erent', 0.091), ('di', 0.087), ('eigenvalue', 0.085), ('retrieval', 0.081), ('rectangle', 0.079), ('lp', 0.079), ('cut', 0.073), ('multidimensional', 0.065), ('hash', 0.064), ('gist', 0.064), ('semantic', 0.063), ('partitions', 0.062), ('retrieved', 0.061), ('distance', 0.059), ('euclidean', 0.058), ('stumps', 0.058), ('torralba', 0.053), ('datapoints', 0.051), ('embedding', 0.051), ('millions', 0.049), ('nystrom', 0.048), ('eigenfunction', 0.048), ('binary', 0.047), ('separable', 0.047), ('dataset', 0.047), ('salakhutdinov', 0.047), ('thresholding', 0.046), ('np', 0.045), ('layers', 0.045), ('graph', 0.041), ('autoencoder', 0.036), ('lafon', 0.036), ('nadler', 0.036), ('outerproduct', 0.036), ('usion', 0.036), ('yi', 0.036), ('dimensional', 0.035), ('images', 0.035), ('novel', 0.035), ('fergus', 0.034), ('descriptor', 0.034), ('correlates', 0.033), ('manifolds', 0.033), ('uniform', 0.032), ('relaxing', 0.032), ('coifman', 0.032), ('radius', 0.031), ('partitioning', 0.031), ('exhaustive', 0.031), ('preserving', 0.03), ('eigenmaps', 0.029), ('retrieving', 0.029), ('maps', 0.029), ('million', 0.028), ('sign', 0.027), ('eigenvector', 0.027), ('frequencies', 0.027), ('proportion', 0.027), ('eigenvalues', 0.027), ('yj', 0.027), ('wij', 0.026), ('minimal', 0.026), ('simply', 0.026), ('neighbourhood', 0.026), ('rectangular', 0.026), ('semantically', 0.026), ('pursued', 0.026), ('extremely', 0.026), ('hidden', 0.025), ('axes', 0.025), ('aspect', 0.024), ('extension', 0.024), ('analytical', 0.024), ('calculating', 0.024), ('toy', 0.023), ('relaxation', 0.023), ('labelme', 0.023), ('thresholded', 0.023), ('neighborhood', 0.022), ('vertices', 0.022), ('rd', 0.022), ('requirement', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 219 nips-2008-Spectral Hashing
Author: Yair Weiss, Antonio Torralba, Rob Fergus
Abstract: Semantic hashing[1] seeks compact binary codes of data-points so that the Hamming distance between codewords correlates with semantic similarity. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresholded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel datapoint. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes outperform the state-of-the art. 1
2 0.18982843 168 nips-2008-Online Metric Learning and Fast Similarity Search
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon, Kristen Grauman
Abstract: Metric learning algorithms can provide useful distance functions for a variety of domains, and recent work has shown good accuracy for problems where the learner can access all distance constraints at once. However, in many real applications, constraints are only available incrementally, thus necessitating methods that can perform online updates to the learned metric. Existing online algorithms offer bounds on worst-case performance, but typically do not perform well in practice as compared to their offline counterparts. We present a new online metric learning algorithm that updates a learned Mahalanobis metric based on LogDet regularization and gradient descent. We prove theoretical worst-case performance bounds, and empirically compare the proposed method against existing online metric learning algorithms. To further boost the practicality of our approach, we develop an online locality-sensitive hashing scheme which leads to efficient updates to data structures used for fast approximate similarity search. We demonstrate our algorithm on multiple datasets and show that it outperforms relevant baselines.
3 0.16292191 218 nips-2008-Spectral Clustering with Perturbed Data
Author: Ling Huang, Donghui Yan, Nina Taft, Michael I. Jordan
Abstract: Spectral clustering is useful for a wide-ranging set of applications in areas such as biological data analysis, image processing and data mining. However, the computational and/or communication resources required by the method in processing large-scale data are often prohibitively high, and practitioners are often required to perturb the original data in various ways (quantization, downsampling, etc) before invoking a spectral algorithm. In this paper, we use stochastic perturbation theory to study the effects of data perturbation on the performance of spectral clustering. We show that the error under perturbation of spectral clustering is closely related to the perturbation of the eigenvectors of the Laplacian matrix. From this result we derive approximate upper bounds on the clustering error. We show that this bound is tight empirically across a wide range of problems, suggesting that it can be used in practical settings to determine the amount of data reduction allowed in order to meet a specification of permitted loss in clustering performance. 1
4 0.13875821 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
Author: Vinod Nair, Geoffrey E. Hinton
Abstract: We present a mixture model whose components are Restricted Boltzmann Machines (RBMs). This possibility has not been considered before because computing the partition function of an RBM is intractable, which appears to make learning a mixture of RBMs intractable as well. Surprisingly, when formulated as a third-order Boltzmann machine, such a mixture model can be learned tractably using contrastive divergence. The energy function of the model captures threeway interactions among visible units, hidden units, and a single hidden discrete variable that represents the cluster label. The distinguishing feature of this model is that, unlike other mixture models, the mixing proportions are not explicitly parameterized. Instead, they are defined implicitly via the energy function and depend on all the parameters in the model. We present results for the MNIST and NORB datasets showing that the implicit mixture of RBMs learns clusters that reflect the class structure in the data. 1
5 0.12993489 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
Author: Benjamin Yackley, Eduardo Corona, Terran Lane
Abstract: Many interesting problems, including Bayesian network structure-search, can be cast in terms of finding the optimum value of a function over the space of graphs. However, this function is often expensive to compute exactly. We here present a method derived from the study of Reproducing Kernel Hilbert Spaces which takes advantage of the regular structure of the space of all graphs on a fixed number of nodes to obtain approximations to the desired function quickly and with reasonable accuracy. We then test this method on both a small testing set and a real-world Bayesian network; the results suggest that not only is this method reasonably accurate, but that the BDe score itself varies quadratically over the space of all graphs. 1
6 0.11001711 67 nips-2008-Effects of Stimulus Type and of Error-Correcting Code Design on BCI Speller Performance
7 0.094657719 184 nips-2008-Predictive Indexing for Fast Search
8 0.085324779 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images
9 0.071925499 194 nips-2008-Regularized Learning with Networks of Features
10 0.070816793 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
11 0.068900354 107 nips-2008-Influence of graph construction on graph-based clustering measures
12 0.066876821 109 nips-2008-Interpreting the neural code with Formal Concept Analysis
13 0.064425312 117 nips-2008-Learning Taxonomies by Dependence Maximization
14 0.064167902 160 nips-2008-On Computational Power and the Order-Chaos Phase Transition in Reservoir Computing
15 0.06393189 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
16 0.06157263 15 nips-2008-Adaptive Martingale Boosting
17 0.059082136 193 nips-2008-Regularized Co-Clustering with Dual Supervision
18 0.05899794 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
19 0.056247223 191 nips-2008-Recursive Segmentation and Recognition Templates for 2D Parsing
20 0.054758705 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm
topicId topicWeight
[(0, -0.164), (1, -0.067), (2, 0.005), (3, 0.002), (4, 0.006), (5, -0.022), (6, 0.028), (7, -0.046), (8, 0.088), (9, -0.147), (10, 0.012), (11, 0.055), (12, -0.095), (13, 0.146), (14, -0.153), (15, -0.117), (16, -0.108), (17, -0.06), (18, 0.145), (19, -0.072), (20, -0.033), (21, 0.045), (22, 0.13), (23, 0.068), (24, -0.019), (25, 0.151), (26, 0.018), (27, 0.08), (28, -0.002), (29, 0.051), (30, -0.037), (31, -0.083), (32, 0.061), (33, -0.066), (34, 0.043), (35, -0.166), (36, 0.09), (37, -0.218), (38, -0.102), (39, -0.091), (40, -0.014), (41, 0.175), (42, -0.031), (43, -0.195), (44, 0.155), (45, -0.069), (46, -0.069), (47, -0.102), (48, 0.049), (49, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.97004426 219 nips-2008-Spectral Hashing
Author: Yair Weiss, Antonio Torralba, Rob Fergus
Abstract: Semantic hashing[1] seeks compact binary codes of data-points so that the Hamming distance between codewords correlates with semantic similarity. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresholded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel datapoint. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes outperform the state-of-the art. 1
2 0.58532709 218 nips-2008-Spectral Clustering with Perturbed Data
Author: Ling Huang, Donghui Yan, Nina Taft, Michael I. Jordan
Abstract: Spectral clustering is useful for a wide-ranging set of applications in areas such as biological data analysis, image processing and data mining. However, the computational and/or communication resources required by the method in processing large-scale data are often prohibitively high, and practitioners are often required to perturb the original data in various ways (quantization, downsampling, etc) before invoking a spectral algorithm. In this paper, we use stochastic perturbation theory to study the effects of data perturbation on the performance of spectral clustering. We show that the error under perturbation of spectral clustering is closely related to the perturbation of the eigenvectors of the Laplacian matrix. From this result we derive approximate upper bounds on the clustering error. We show that this bound is tight empirically across a wide range of problems, suggesting that it can be used in practical settings to determine the amount of data reduction allowed in order to meet a specification of permitted loss in clustering performance. 1
3 0.51179212 168 nips-2008-Online Metric Learning and Fast Similarity Search
Author: Prateek Jain, Brian Kulis, Inderjit S. Dhillon, Kristen Grauman
Abstract: Metric learning algorithms can provide useful distance functions for a variety of domains, and recent work has shown good accuracy for problems where the learner can access all distance constraints at once. However, in many real applications, constraints are only available incrementally, thus necessitating methods that can perform online updates to the learned metric. Existing online algorithms offer bounds on worst-case performance, but typically do not perform well in practice as compared to their offline counterparts. We present a new online metric learning algorithm that updates a learned Mahalanobis metric based on LogDet regularization and gradient descent. We prove theoretical worst-case performance bounds, and empirically compare the proposed method against existing online metric learning algorithms. To further boost the practicality of our approach, we develop an online locality-sensitive hashing scheme which leads to efficient updates to data structures used for fast approximate similarity search. We demonstrate our algorithm on multiple datasets and show that it outperforms relevant baselines.
4 0.48663145 67 nips-2008-Effects of Stimulus Type and of Error-Correcting Code Design on BCI Speller Performance
Author: Jeremy Hill, Jason Farquhar, Suzanna Martens, Felix Biessmann, Bernhard Schölkopf
Abstract: From an information-theoretic perspective, a noisy transmission system such as a visual Brain-Computer Interface (BCI) speller could benefit from the use of errorcorrecting codes. However, optimizing the code solely according to the maximal minimum-Hamming-distance criterion tends to lead to an overall increase in target frequency of target stimuli, and hence a significantly reduced average target-to-target interval (TTI), leading to difficulties in classifying the individual event-related potentials (ERPs) due to overlap and refractory effects. Clearly any change to the stimulus setup must also respect the possible psychophysiological consequences. Here we report new EEG data from experiments in which we explore stimulus types and codebooks in a within-subject design, finding an interaction between the two factors. Our data demonstrate that the traditional, rowcolumn code has particular spatial properties that lead to better performance than one would expect from its TTIs and Hamming-distances alone, but nonetheless error-correcting codes can improve performance provided the right stimulus type is used. 1
5 0.46313486 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
Author: Benjamin Yackley, Eduardo Corona, Terran Lane
Abstract: Many interesting problems, including Bayesian network structure-search, can be cast in terms of finding the optimum value of a function over the space of graphs. However, this function is often expensive to compute exactly. We here present a method derived from the study of Reproducing Kernel Hilbert Spaces which takes advantage of the regular structure of the space of all graphs on a fixed number of nodes to obtain approximations to the desired function quickly and with reasonable accuracy. We then test this method on both a small testing set and a real-world Bayesian network; the results suggest that not only is this method reasonably accurate, but that the BDe score itself varies quadratically over the space of all graphs. 1
6 0.39185274 167 nips-2008-One sketch for all: Theory and Application of Conditional Random Sampling
7 0.38997394 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
8 0.35962164 107 nips-2008-Influence of graph construction on graph-based clustering measures
9 0.34559581 160 nips-2008-On Computational Power and the Order-Chaos Phase Transition in Reservoir Computing
10 0.32231724 41 nips-2008-Breaking Audio CAPTCHAs
11 0.31688789 194 nips-2008-Regularized Learning with Networks of Features
12 0.30914 92 nips-2008-Generative versus discriminative training of RBMs for classification of fMRI images
13 0.28867328 193 nips-2008-Regularized Co-Clustering with Dual Supervision
14 0.28163025 240 nips-2008-Tracking Changing Stimuli in Continuous Attractor Neural Networks
15 0.28081602 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning
16 0.27665025 117 nips-2008-Learning Taxonomies by Dependence Maximization
17 0.26403546 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm
18 0.26031965 184 nips-2008-Predictive Indexing for Fast Search
19 0.24179429 237 nips-2008-The Recurrent Temporal Restricted Boltzmann Machine
20 0.24109931 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features
topicId topicWeight
[(6, 0.04), (7, 0.121), (12, 0.05), (28, 0.167), (32, 0.032), (57, 0.104), (59, 0.018), (63, 0.033), (71, 0.017), (77, 0.037), (78, 0.015), (83, 0.049), (89, 0.228)]
simIndex simValue paperId paperTitle
same-paper 1 0.85988581 219 nips-2008-Spectral Hashing
Author: Yair Weiss, Antonio Torralba, Rob Fergus
Abstract: Semantic hashing[1] seeks compact binary codes of data-points so that the Hamming distance between codewords correlates with semantic similarity. In this paper, we show that the problem of finding a best code for a given dataset is closely related to the problem of graph partitioning and can be shown to be NP hard. By relaxing the original problem, we obtain a spectral method whose solutions are simply a subset of thresholded eigenvectors of the graph Laplacian. By utilizing recent results on convergence of graph Laplacian eigenvectors to the Laplace-Beltrami eigenfunctions of manifolds, we show how to efficiently calculate the code of a novel datapoint. Taken together, both learning the code and applying it to a novel point are extremely simple. Our experiments show that our codes outperform the state-of-the art. 1
2 0.76443404 53 nips-2008-Counting Solution Clusters in Graph Coloring Problems Using Belief Propagation
Author: Lukas Kroc, Ashish Sabharwal, Bart Selman
Abstract: We show that an important and computationally challenging solution space feature of the graph coloring problem (COL), namely the number of clusters of solutions, can be accurately estimated by a technique very similar to one for counting the number of solutions. This cluster counting approach can be naturally written in terms of a new factor graph derived from the factor graph representing the COL instance. Using a variant of the Belief Propagation inference framework, we can efficiently approximate cluster counts in random COL problems over a large range of graph densities. We illustrate the algorithm on instances with up to 100, 000 vertices. Moreover, we supply a methodology for computing the number of clusters exactly using advanced techniques from the knowledge compilation literature. This methodology scales up to several hundred variables. 1
3 0.70629382 66 nips-2008-Dynamic visual attention: searching for coding length increments
Author: Xiaodi Hou, Liqing Zhang
Abstract: A visual attention system should respond placidly when common stimuli are presented, while at the same time keep alert to anomalous visual inputs. In this paper, a dynamic visual attention model based on the rarity of features is proposed. We introduce the Incremental Coding Length (ICL) to measure the perspective entropy gain of each feature. The objective of our model is to maximize the entropy of the sampled visual features. In order to optimize energy consumption, the limit amount of energy of the system is re-distributed amongst features according to their Incremental Coding Length. By selecting features with large coding length increments, the computational system can achieve attention selectivity in both static and dynamic scenes. We demonstrate that the proposed model achieves superior accuracy in comparison to mainstream approaches in static saliency map generation. Moreover, we also show that our model captures several less-reported dynamic visual search behaviors, such as attentional swing and inhibition of return. 1
4 0.70291549 71 nips-2008-Efficient Sampling for Gaussian Process Inference using Control Variables
Author: Neil D. Lawrence, Magnus Rattray, Michalis K. Titsias
Abstract: Sampling functions in Gaussian process (GP) models is challenging because of the highly correlated posterior distribution. We describe an efficient Markov chain Monte Carlo algorithm for sampling from the posterior process of the GP model. This algorithm uses control variables which are auxiliary function values that provide a low dimensional representation of the function. At each iteration, the algorithm proposes new values for the control variables and generates the function from the conditional GP prior. The control variable input locations are found by minimizing an objective function. We demonstrate the algorithm on regression and classification problems and we use it to estimate the parameters of a differential equation model of gene regulation. 1
5 0.70092893 4 nips-2008-A Scalable Hierarchical Distributed Language Model
Author: Andriy Mnih, Geoffrey E. Hinton
Abstract: Neural probabilistic language models (NPLMs) have been shown to be competitive with and occasionally superior to the widely-used n-gram language models. The main drawback of NPLMs is their extremely long training and testing times. Morin and Bengio have proposed a hierarchical language model built around a binary tree of words, which was two orders of magnitude faster than the nonhierarchical model it was based on. However, it performed considerably worse than its non-hierarchical counterpart in spite of using a word tree created using expert knowledge. We introduce a fast hierarchical language model along with a simple feature-based algorithm for automatic construction of word trees from the data. We then show that the resulting models can outperform non-hierarchical neural models as well as the best n-gram models. 1
6 0.69967586 192 nips-2008-Reducing statistical dependencies in natural signals using radial Gaussianization
7 0.69879889 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
8 0.6985122 200 nips-2008-Robust Kernel Principal Component Analysis
9 0.69837147 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC
10 0.69798166 138 nips-2008-Modeling human function learning with Gaussian processes
11 0.69684434 118 nips-2008-Learning Transformational Invariants from Natural Movies
12 0.69593465 197 nips-2008-Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation
13 0.69581056 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
14 0.69486284 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
15 0.69410443 62 nips-2008-Differentiable Sparse Coding
16 0.69163311 103 nips-2008-Implicit Mixtures of Restricted Boltzmann Machines
17 0.68825513 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
18 0.68824464 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
19 0.68769187 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
20 0.68727487 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data