nips nips2007 nips2007-49 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Le Song, Arthur Gretton, Karsten M. Borgwardt, Alex J. Smola
Abstract: Maximum variance unfolding (MVU) is an effective heuristic for dimensionality reduction. It produces a low-dimensional representation of the data by maximizing the variance of their embeddings while preserving the local distances of the original data. We show that MVU also optimizes a statistical dependence measure which aims to retain the identity of individual observations under the distancepreserving constraints. This general view allows us to design “colored” variants of MVU, which produce low-dimensional representations for a given task, e.g. subject to class labels or other side information. 1
Reference: text
sentIndex sentText sentNum sentScore
1 de Abstract Maximum variance unfolding (MVU) is an effective heuristic for dimensionality reduction. [sent-11, score-0.16]
2 It produces a low-dimensional representation of the data by maximizing the variance of their embeddings while preserving the local distances of the original data. [sent-12, score-0.164]
3 We show that MVU also optimizes a statistical dependence measure which aims to retain the identity of individual observations under the distancepreserving constraints. [sent-13, score-0.133]
4 1 Introduction In recent years maximum variance unfolding (MVU), introduced by Saul et al. [sent-17, score-0.16]
5 This method is based on a simple heuristic: maximizing the overall variance of the embedding while preserving the local distances between neighboring observations. [sent-19, score-0.358]
6 We show that the algorithm attempts to extract features from the data which simultaneously preserve the identity of individual observations and their local distance structure. [sent-25, score-0.116]
7 That is, we are able to find an embedding that leverages between two goals: • preserve the local distance structure according to the first source of information (the data); • and maximally align with the second sources of information (side information). [sent-29, score-0.365]
8 “Colored” MVU achieves the goal of elucidating primarily relevant features by aligning the embedding to the objective provided in the side information. [sent-32, score-0.336]
9 Some examples illustrate this situation in more details: • Given a-bag-of-pixels representation of images (the data), such as USPS digits, find an embedding which reflects the categories of the images (side information). [sent-33, score-0.285]
10 • Given a vector space representation of texts on the web (the data), such as newsgroups, find an embedding which reflects a hierarchy of the topics (side information). [sent-34, score-0.33]
11 1 • Given a TF/IDF representation of documents (the data), such as NIPS papers, find an embedding which reflects co-authorship relations between the documents (side information). [sent-35, score-0.341]
12 There is a strong motivation for not simply merging the two sources of information into a single distance metric: Firstly, the data and the side information may be heterogenous. [sent-36, score-0.173]
13 It is unclear how to combine them into a single distance metric; Secondly, the side information may appear in the form of similarity rather than distance. [sent-37, score-0.173]
14 For instance, co-authorship relations is a similarity between documents (if two papers share more authors, they tends to be more similar), but it does not induce a distance between the documents (if two papers share no authors, we cannot assert they are far apart). [sent-38, score-0.77]
15 , zm } ⊆ Z and a distance metric d : Z × Z → [0, ∞) find an inner product matrix (kernel matrix) K ∈ Rm×m with K 0 such that 1. [sent-48, score-0.102]
16 Kii + Kjj − 2Kij = d2 for all (i, j) pairs which are ij sufficiently close to each other, such as the n nearest neighbors of each observation. [sent-51, score-0.15]
17 The trace of K is maximized (the maximum variance unfolding part). [sent-65, score-0.16]
18 By and large the optimization problem looks as follows: maximize tr K subject to K1 = 0 and Kii + Kjj − 2Kij = d2 for all (i, j) ∈ N . [sent-67, score-0.211]
19 where the distances are only allow to shrink, where slack variables are added to the objective function to allow approximate distance preserving, or where one uses low-rank expansions of K to cope with the computational complexity of semidefinite programming. [sent-70, score-0.096]
20 (For convenience, we will drop the normalization and use tr HKHL as HSIC. [sent-88, score-0.173]
21 Here we use it in a different way: We try to construct a kernel matrix K for the dimension-reduced data X which preserves the local distance structure of the original data Z, such that X is maximally dependent on the side information Y as seen from its kernel matrix L. [sent-90, score-0.467]
22 This is desirable, as we want our metric embedding to be robust to small changes. [sent-93, score-0.221]
23 Second, HSIC is easy to compute, since only the kernel matrices are required and no density estimation is needed. [sent-94, score-0.101]
24 The freedom of choosing a kernel for L allows us to incorporate prior knowledge into the dependence estimation process. [sent-95, score-0.144]
25 The consequence is that we are able to incorporate various side information by simply choosing an appropriate kernel for Y . [sent-96, score-0.189]
26 For instance, in the case of NIPS papers which happen to have author information, L would be the kernel matrix arising from coauthorship and d(z, z ) would be the Euclidean distance between the vector space representations of the documents. [sent-98, score-0.472]
27 Then the following two optimization problems are equivalent: maximize tr HKHL subject to K K maximize tr KL subject to K K 0 and constraints on Kii + Kjj − 2Kij . [sent-100, score-0.422]
28 Moreover tr HKa HL ≤ tr Kb L by requirement on the optimality of Kb . [sent-107, score-0.346]
29 Combining both inequalities shows that tr HKa HL = tr Kb L, hence both solutions are equivalent. [sent-108, score-0.346]
30 This means that the centering imposed in MVU via constraints is equivalent to the centering in HSIC by means of the dependence measure tr HKHL itself. [sent-109, score-0.339]
31 In other words, MVU equivalently maximizes tr HKHI, i. [sent-110, score-0.173]
32 the dependence between K and the identity matrix I which corresponds to retain maximal diversity between observations via Lij = δij . [sent-112, score-0.177]
33 This suggests the following colored version of MVU: maximize tr HKHL subject to K K 0 and Kii + Kjj − 2Kij = d2 for all (i, j) ∈ N . [sent-113, score-0.306]
34 ij (6) Using (6) we see that we are now extracting a Euclidean embedding which maximally depends on the coloring matrix L (for the side information) while preserving local distance structure. [sent-114, score-0.639]
35 Then the distance ji ij jj ii preserving constraint can be written as tr KEij = d2 . [sent-123, score-0.374]
36 Thus we have the following Lagrangian: ij wij (tr KEij − d2 ) ij L = tr KHLH + tr KZ − (i,j)∈N wij Eij ) + = tr K(HLH + Z − (i,j)∈N wij d2 where Z ij 0 and wij ≥ 0. [sent-124, score-1.139]
37 wij d2 subject to G(w) ij minimize wij (7) (i,j)∈N (i,j)∈N wij Eij . [sent-126, score-0.408]
38 (8) (i,j)∈N Note that G(w) amounts to the Graph Laplacian of a weighted graph with adjacency matrix given by w. [sent-128, score-0.192]
39 The dual constraint G(w) HLH effectively requires that the eigen-spectrum of the graph Laplacian is bounded from below by that of HLH. [sent-129, score-0.106]
40 Recall that at optimality the Karush-Kuhn-Tucker conditions imply tr KZ = 0, i. [sent-131, score-0.173]
41 Recall that Z = G(w) − HLH 0, and by design G(w) 0 since it is the graph Laplacian of a weighted graph with edge weights wij . [sent-135, score-0.24]
42 Hence, the eigenvectors of Z with zero eigenvalues would correspond to those lying in the image of HLH. [sent-137, score-0.096]
43 for an n-class classification problem, then we will only have up to n vanishing eigenvalues in Z. [sent-140, score-0.097]
44 Hence it is likely that G(w) − HLH will have many vanishing eigenvalues which translates into many nonzero eigenvalues of K. [sent-144, score-0.19]
45 6 Implementation Details In practice, instead of requiring the distances to remain unchanged in the embedding we only require them to be preserved approximately [4]. [sent-146, score-0.299]
46 We do so by penalizing the slackness between the original distance and the embedding distance, i. [sent-147, score-0.279]
47 Kii + Kjj − 2Kij − d2 ij maximize tr HKHL − ν K 2 subject to K 0 (9) (i,j)∈N Here ν controls the tradeoff between dependence maximization and distance preservation. [sent-149, score-0.415]
48 First, we construct a nearest neighbor graph according to N (we will also refer to this graph and its adjacency matrix as N ). [sent-157, score-0.307]
49 Then we form V by stacking together the bottom n eigenvectors of the graph Laplacian of the neighborhood graph via N . [sent-158, score-0.176]
50 The key idea is that neighbors in the original space remain neighbors in 4 the embedding space. [sent-159, score-0.281]
51 As we require them to have similar locations, the bottom eigenvectors of the graph Laplacian provide a set of good bases for functions smoothly varying across the graph. [sent-160, score-0.105]
52 Subsequent to the semidefinite program we perform local refinement of the embedding via gradient descent. [sent-161, score-0.221]
53 We demonstrate this based on three datasets: embedding of digits of the USPS database, the Newsgroups 20 dataset containing Usenet articles in text form, and a collection of NIPS papers from 1987 to 1999. [sent-167, score-0.689]
54 1 We compare “colored” MVU (also called MUHSIC, maximum unfolding via HSIC) to MVU [1] and PCA, highlighting places where MUHSIC produces more meaningful results by incorporating side information. [sent-168, score-0.272]
55 Further details, such as effects of the adjacency matrices and a comparison to Neighborhood Component Analysis [6] are relegated to the appendix due to limitations of space. [sent-169, score-0.104]
56 For images we use the Euclidean distance between pixel values as the base metric. [sent-170, score-0.09]
57 As before, the Euclidean distance on those vectors is used to find the nearest neighbors. [sent-172, score-0.102]
58 As in [4] we construct the nearest neighbor graph by considering the 1% nearest neighbors of each point. [sent-173, score-0.189]
59 Subsequently the adjacency matrix of this graph is symmetrized. [sent-174, score-0.192]
60 Moreover, as in [4] we choose 10 dimensions (n = 10) to decompose the embedding matrix K. [sent-176, score-0.265]
61 USPS Digits This dataset consists of images of hand written digits of a resolution of 16×16 pixels. [sent-179, score-0.105]
62 Y is used to construct the matrix L by applying the kernel k(y, y ) = δy,y . [sent-185, score-0.118]
63 This kernel further promotes embedding where images from the same class are grouped tighter. [sent-186, score-0.327]
64 Figure 1 also shows the eigenspectrum of K produced by different methods. [sent-194, score-0.1]
65 Newsgroups This dataset consists of Usenet articles collected from 20 different newsgroups. [sent-199, score-0.099]
66 We use a subset of 2000 documents for our experiments (100 articles from each newsgroup). [sent-200, score-0.159]
67 We remove the headers from the articles before the preprocessing while keeping the subject line. [sent-201, score-0.137]
68 For instance, 5 topics are related to computer science, 3 are related to religion, and 4 are related to recreation. [sent-203, score-0.076]
69 We will use these different topics as side information and apply a delta kernel k(y, y ) = δy,y on them. [sent-204, score-0.265]
70 Similar to USPS digits we want to preserve the identity of individual newsgroups. [sent-205, score-0.101]
71 5 Figure 1: Embedding of 2007 USPS digits produced by MUHSIC, MVU and PCA respectively. [sent-213, score-0.103]
72 The color bar below each figure shows the eigenspectrum of the learned kernel matrix K. [sent-215, score-0.255]
73 Figure 2: Embedding of 2000 newsgroup articles produced by MUHSIC, MVU and PCA respectively. [sent-216, score-0.164]
74 Colors and shapes of the dots are used to denote articles from different newsgroups. [sent-217, score-0.099]
75 The color bar below each figure shows the eigenspectrum of the learned kernel matrix K. [sent-218, score-0.255]
76 A distinctive feature of the visualizations is that MUHSIC groups articles from individual topics more tightly than MVU and PCA. [sent-219, score-0.175]
77 For instance, on the left side of the embedding, all computer science topics are placed adjacent to each other; comp. [sent-221, score-0.216]
78 Likewise we see that on the top we find all recreational topics (with rec. [sent-236, score-0.076]
79 NIPS Papers We used the 1735 regular NIPS papers from 1987 to 1999. [sent-252, score-0.296]
80 Our goal is to embed the papers by taking the coauthor information into account. [sent-256, score-0.336]
81 Furthermore, we also annotated some papers to show the semantics revealed by the embedding. [sent-259, score-0.296]
82 All three methods correctly represent the two major topics of NIPS papers: artificial systems, i. [sent-261, score-0.076]
83 machine learning (they are positioned on the left side of the visualization) and natural systems, 6 7 Figure 3: Embedding of 1735 NIPS papers produced by MUHSIC, MVU and PCA. [sent-263, score-0.441]
84 The color bar below each figure shows the eigenspectrum of the learned kernel matrix K. [sent-265, score-0.255]
85 For instance, the papers by Smola, Sch¨ lkopf and Jordan are embedded on o the left, whereas the many papers by Sejnowski, Dayan and Bialek can be found on the right. [sent-272, score-0.619]
86 Unique to the visualization of MUHSIC is that there is a clear grouping of the papers by researchers. [sent-273, score-0.345]
87 Interestingly, papers by Jordan (at that time best-known for his work in graphical models) are grouped close to the papers on reinforcement learning. [sent-275, score-0.623]
88 Another interesting trend is that papers on new fields of research are embedded on the edges. [sent-277, score-0.323]
89 For instance, papers on reinforcement learning (Barto, Singh and Sutton), are along the left edge. [sent-278, score-0.327]
90 Note that while MUHSIC groups papers according to authors, thereby preserving the macroscopic structure of the data it also reveals the microscopic semantics between the papers. [sent-280, score-0.363]
91 For instance, the 4 papers (numbered from 6 to 9 in Figure 3) by Smola, Scholk¨ pf, Hinton and Dayan are very close o to each other. [sent-281, score-0.296]
92 Although their titles do not convey strong similarity information, these papers all used handwritten digits for the experiments. [sent-282, score-0.369]
93 Although most of his papers are on the neuroscience side, two of his papers (numbered 14 and 15) on reinforcement learning can be found on the machine learning side. [sent-284, score-0.623]
94 A third example are papers by Bialek and Hinton on spiking neurons (numbered 20, 21 and 23). [sent-285, score-0.327]
95 Although Hinton’s papers are mainly on the left, his paper on spiking Boltzmann machines is closer to Bialek’s two papers on spiking neurons. [sent-286, score-0.654]
96 8 Discussion In summary, MUHSIC provides an embedding of the data which preserves side information possibly available at training time. [sent-287, score-0.336]
97 It makes feature extraction robust to spurious interactions between observations and noise (see the appendix for an example of adjacency matrices and further discussion). [sent-289, score-0.134]
98 A fortuitous side-effect is that if the matrix containing side information is of low rank, the reduced representation learned by MUHSIC can be lower rank than that obtained by MVU, too. [sent-290, score-0.159]
99 The fastest mixing markove process on a graph and a connection to a maximum variance unfolding problem. [sent-310, score-0.231]
100 Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. [sent-338, score-0.108]
wordName wordTfidf (topN-words)
[('mvu', 0.527), ('muhsic', 0.343), ('papers', 0.296), ('embedding', 0.221), ('hsic', 0.209), ('tr', 0.173), ('unfolding', 0.128), ('hlh', 0.123), ('hkhl', 0.121), ('kb', 0.121), ('kjj', 0.121), ('eij', 0.12), ('side', 0.115), ('articles', 0.099), ('wij', 0.098), ('colored', 0.095), ('kii', 0.09), ('hka', 0.081), ('prxy', 0.081), ('adjacency', 0.077), ('topics', 0.076), ('ij', 0.076), ('kernel', 0.074), ('digits', 0.073), ('graph', 0.071), ('cxy', 0.07), ('eigenspectrum', 0.07), ('dependence', 0.07), ('preserving', 0.067), ('eigenvalues', 0.062), ('usps', 0.062), ('usenet', 0.061), ('documents', 0.06), ('distance', 0.058), ('maximally', 0.058), ('laplacian', 0.056), ('newsgroups', 0.053), ('gretton', 0.051), ('semide', 0.05), ('smola', 0.05), ('hinton', 0.05), ('visualization', 0.049), ('centering', 0.048), ('numbered', 0.045), ('hl', 0.045), ('matrix', 0.044), ('nearest', 0.044), ('song', 0.042), ('bialek', 0.042), ('brilliant', 0.04), ('coauthor', 0.04), ('hkb', 0.04), ('keij', 0.04), ('toc', 0.04), ('preserved', 0.04), ('rm', 0.039), ('instance', 0.039), ('australia', 0.039), ('subject', 0.038), ('distances', 0.038), ('independence', 0.038), ('highlighted', 0.037), ('pca', 0.036), ('dayan', 0.036), ('bar', 0.036), ('dual', 0.035), ('exx', 0.035), ('newsgroup', 0.035), ('sha', 0.035), ('vanishing', 0.035), ('weinberger', 0.035), ('hilbert', 0.034), ('eigenvectors', 0.034), ('retain', 0.033), ('hierarchy', 0.033), ('images', 0.032), ('schmidt', 0.032), ('hij', 0.032), ('kz', 0.032), ('variance', 0.032), ('reinforcement', 0.031), ('nonzero', 0.031), ('spiking', 0.031), ('color', 0.031), ('observations', 0.03), ('produced', 0.03), ('euclidean', 0.03), ('singh', 0.03), ('neighbors', 0.03), ('meaningful', 0.029), ('sejnowski', 0.028), ('nips', 0.028), ('preserve', 0.028), ('embeddings', 0.027), ('barto', 0.027), ('variants', 0.027), ('embedded', 0.027), ('matrices', 0.027), ('borgwardt', 0.026), ('adjacent', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 49 nips-2007-Colored Maximum Variance Unfolding
Author: Le Song, Arthur Gretton, Karsten M. Borgwardt, Alex J. Smola
Abstract: Maximum variance unfolding (MVU) is an effective heuristic for dimensionality reduction. It produces a low-dimensional representation of the data by maximizing the variance of their embeddings while preserving the local distances of the original data. We show that MVU also optimizes a statistical dependence measure which aims to retain the identity of individual observations under the distancepreserving constraints. This general view allows us to design “colored” variants of MVU, which produce low-dimensional representations for a given task, e.g. subject to class labels or other side information. 1
2 0.18189901 115 nips-2007-Learning the 2-D Topology of Images
Author: Nicolas L. Roux, Yoshua Bengio, Pascal Lamblin, Marc Joliveau, Balázs Kégl
Abstract: We study the following question: is the two-dimensional structure of images a very strong prior or is it something that can be learned with a few examples of natural images? If someone gave us a learning task involving images for which the two-dimensional topology of pixels was not known, could we discover it automatically and exploit it? For example suppose that the pixels had been permuted in a fixed but unknown way, could we recover the relative two-dimensional location of pixels on images? The surprising result presented here is that not only the answer is yes, but that about as few as a thousand images are enough to approximately recover the relative locations of about a thousand pixels. This is achieved using a manifold learning algorithm applied to pixels associated with a measure of distributional similarity between pixel intensities. We compare different topologyextraction approaches and show how having the two-dimensional topology can be exploited.
3 0.14664389 7 nips-2007-A Kernel Statistical Test of Independence
Author: Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Schölkopf, Alex J. Smola
Abstract: Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2 ), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.
4 0.087463677 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
Author: Ronny Luss, Alexandre D'aspremont
Abstract: In this paper, we propose a method for support vector machine classification using indefinite kernels. Instead of directly minimizing or stabilizing a nonconvex loss function, our method simultaneously finds the support vectors and a proxy kernel matrix used in computing the loss. This can be interpreted as a robust classification problem where the indefinite kernel matrix is treated as a noisy observation of the true positive semidefinite kernel. Our formulation keeps the problem convex and relatively large problems can be solved efficiently using the analytic center cutting plane method. We compare the performance of our technique with other methods on several data sets.
5 0.069195978 108 nips-2007-Kernel Measures of Conditional Dependence
Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf
Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
6 0.068654023 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
7 0.063305721 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI
8 0.062797941 97 nips-2007-Hidden Common Cause Relations in Relational Learning
9 0.06236827 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
10 0.062094364 109 nips-2007-Kernels on Attributed Pointsets with Applications
11 0.056108452 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
12 0.055707533 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
13 0.054013725 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning
14 0.053839244 182 nips-2007-Sparse deep belief net model for visual area V2
15 0.052422918 63 nips-2007-Convex Relaxations of Latent Variable Training
16 0.051930122 129 nips-2007-Mining Internet-Scale Software Repositories
17 0.051923431 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
18 0.050826021 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
19 0.049934074 118 nips-2007-Learning with Transformation Invariant Kernels
20 0.049161717 160 nips-2007-Random Features for Large-Scale Kernel Machines
topicId topicWeight
[(0, -0.173), (1, 0.061), (2, -0.061), (3, 0.012), (4, -0.033), (5, -0.005), (6, -0.019), (7, 0.089), (8, -0.056), (9, 0.15), (10, 0.033), (11, 0.046), (12, 0.059), (13, 0.062), (14, -0.056), (15, -0.116), (16, 0.048), (17, -0.073), (18, -0.016), (19, 0.113), (20, 0.07), (21, 0.121), (22, -0.066), (23, -0.061), (24, -0.06), (25, 0.041), (26, 0.113), (27, -0.02), (28, 0.007), (29, -0.065), (30, -0.025), (31, -0.082), (32, 0.042), (33, -0.077), (34, 0.08), (35, -0.077), (36, 0.009), (37, 0.232), (38, 0.096), (39, 0.045), (40, 0.032), (41, 0.066), (42, -0.085), (43, 0.091), (44, -0.011), (45, -0.028), (46, -0.051), (47, -0.062), (48, -0.126), (49, 0.112)]
simIndex simValue paperId paperTitle
same-paper 1 0.94134015 49 nips-2007-Colored Maximum Variance Unfolding
Author: Le Song, Arthur Gretton, Karsten M. Borgwardt, Alex J. Smola
Abstract: Maximum variance unfolding (MVU) is an effective heuristic for dimensionality reduction. It produces a low-dimensional representation of the data by maximizing the variance of their embeddings while preserving the local distances of the original data. We show that MVU also optimizes a statistical dependence measure which aims to retain the identity of individual observations under the distancepreserving constraints. This general view allows us to design “colored” variants of MVU, which produce low-dimensional representations for a given task, e.g. subject to class labels or other side information. 1
2 0.69749612 7 nips-2007-A Kernel Statistical Test of Independence
Author: Arthur Gretton, Kenji Fukumizu, Choon H. Teo, Le Song, Bernhard Schölkopf, Alex J. Smola
Abstract: Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2 ), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.
3 0.60444689 108 nips-2007-Kernel Measures of Conditional Dependence
Author: Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, Bernhard Schölkopf
Abstract: We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments. 1
4 0.57865912 115 nips-2007-Learning the 2-D Topology of Images
Author: Nicolas L. Roux, Yoshua Bengio, Pascal Lamblin, Marc Joliveau, Balázs Kégl
Abstract: We study the following question: is the two-dimensional structure of images a very strong prior or is it something that can be learned with a few examples of natural images? If someone gave us a learning task involving images for which the two-dimensional topology of pixels was not known, could we discover it automatically and exploit it? For example suppose that the pixels had been permuted in a fixed but unknown way, could we recover the relative two-dimensional location of pixels on images? The surprising result presented here is that not only the answer is yes, but that about as few as a thousand images are enough to approximately recover the relative locations of about a thousand pixels. This is achieved using a manifold learning algorithm applied to pixels associated with a measure of distributional similarity between pixel intensities. We compare different topologyextraction approaches and show how having the two-dimensional topology can be exploited.
5 0.4437204 109 nips-2007-Kernels on Attributed Pointsets with Applications
Author: Mehul Parsana, Sourangshu Bhattacharya, Chiru Bhattacharya, K. Ramakrishnan
Abstract: This paper introduces kernels on attributed pointsets, which are sets of vectors embedded in an euclidean space. The embedding gives the notion of neighborhood, which is used to define positive semidefinite kernels on pointsets. Two novel kernels on neighborhoods are proposed, one evaluating the attribute similarity and the other evaluating shape similarity. Shape similarity function is motivated from spectral graph matching techniques. The kernels are tested on three real life applications: face recognition, photo album tagging, and shot annotation in video sequences, with encouraging results. 1
6 0.44024575 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
7 0.42475909 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
8 0.38572791 107 nips-2007-Iterative Non-linear Dimensionality Reduction with Manifold Sculpting
9 0.37318403 67 nips-2007-Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation
10 0.35348913 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines
11 0.35148865 118 nips-2007-Learning with Transformation Invariant Kernels
12 0.34319907 97 nips-2007-Hidden Common Cause Relations in Relational Learning
13 0.33937576 70 nips-2007-Discriminative K-means for Clustering
14 0.33559266 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning
15 0.32633811 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
16 0.30240661 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
17 0.2945627 129 nips-2007-Mining Internet-Scale Software Repositories
18 0.29406106 80 nips-2007-Ensemble Clustering using Semidefinite Programming
19 0.28218937 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
20 0.27937588 161 nips-2007-Random Projections for Manifold Learning
topicId topicWeight
[(5, 0.055), (13, 0.051), (16, 0.036), (18, 0.014), (21, 0.041), (31, 0.017), (34, 0.028), (35, 0.043), (47, 0.062), (49, 0.026), (79, 0.27), (83, 0.141), (85, 0.018), (87, 0.036), (90, 0.074)]
simIndex simValue paperId paperTitle
1 0.82530689 42 nips-2007-CPR for CSPs: A Probabilistic Relaxation of Constraint Propagation
Author: Luis E. Ortiz
Abstract: This paper proposes constraint propagation relaxation (CPR), a probabilistic approach to classical constraint propagation that provides another view on the whole parametric family of survey propagation algorithms SP(ρ). More importantly, the approach elucidates the implicit, but fundamental assumptions underlying SP(ρ), thus shedding some light on its effectiveness and leading to applications beyond k-SAT. 1
same-paper 2 0.77507281 49 nips-2007-Colored Maximum Variance Unfolding
Author: Le Song, Arthur Gretton, Karsten M. Borgwardt, Alex J. Smola
Abstract: Maximum variance unfolding (MVU) is an effective heuristic for dimensionality reduction. It produces a low-dimensional representation of the data by maximizing the variance of their embeddings while preserving the local distances of the original data. We show that MVU also optimizes a statistical dependence measure which aims to retain the identity of individual observations under the distancepreserving constraints. This general view allows us to design “colored” variants of MVU, which produce low-dimensional representations for a given task, e.g. subject to class labels or other side information. 1
3 0.57922596 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
4 0.57311398 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
Author: Marc'aurelio Ranzato, Y-lan Boureau, Yann L. Cun
Abstract: Unsupervised learning algorithms aim to discover the structure hidden in the data, and to learn representations that are more suitable as input to a supervised machine than the raw input. Many unsupervised methods are based on reconstructing the input from the representation, while constraining the representation to have certain desirable properties (e.g. low dimension, sparsity, etc). Others are based on approximating density by stochastically reconstructing the input from the representation. We describe a novel and efficient algorithm to learn sparse representations, and compare it theoretically and experimentally with a similar machine trained probabilistically, namely a Restricted Boltzmann Machine. We propose a simple criterion to compare and select different unsupervised machines based on the trade-off between the reconstruction error and the information content of the representation. We demonstrate this method by extracting features from a dataset of handwritten numerals, and from a dataset of natural image patches. We show that by stacking multiple levels of such machines and by training sequentially, high-order dependencies between the input observed variables can be captured. 1
5 0.56639981 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
Author: Tony Jebara, Yingbo Song, Kapil Thadani
Abstract: A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. This article makes independent similarly distributed (or isd) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya affinity penalty between pairs of distributions on samples. Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. Experiments in density estimation on a variety of datasets confirm the value of isd over iid estimation, id estimation and mixture modeling.
6 0.56634444 115 nips-2007-Learning the 2-D Topology of Images
7 0.5654037 156 nips-2007-Predictive Matrix-Variate t Models
8 0.5640915 185 nips-2007-Stable Dual Dynamic Programming
9 0.56403017 7 nips-2007-A Kernel Statistical Test of Independence
10 0.56329626 187 nips-2007-Structured Learning with Approximate Inference
11 0.56305981 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
12 0.56283164 16 nips-2007-A learning framework for nearest neighbor search
13 0.56240392 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation
14 0.56190217 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
15 0.56182671 84 nips-2007-Expectation Maximization and Posterior Constraints
16 0.56137413 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
17 0.56132144 134 nips-2007-Multi-Task Learning via Conic Programming
18 0.56017268 146 nips-2007-On higher-order perceptron algorithms
19 0.55973893 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
20 0.55915606 18 nips-2007-A probabilistic model for generating realistic lip movements from speech