nips nips2008 nips2008-113 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Novi Quadrianto, Le Song, Alex J. Smola
Abstract: Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between the classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution. 1
Reference: text
sentIndex sentText sentNum sentScore
1 org Abstract Object matching is a fundamental operation in data analysis. [sent-8, score-0.217]
2 It typically requires the definition of a similarity measure between the classes of objects to be matched. [sent-9, score-0.111]
3 Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. [sent-10, score-0.312]
4 For instance, we might want to match a photo with a textual description of a person, a map with a satellite image, or a music score with a music performance. [sent-14, score-0.191]
5 For instance, we might have two collections of documents purportedly covering the same content, written in two different languages. [sent-19, score-0.11]
6 In the following we present a method which is able to perform such matching without the need of a cross-domain similarity measure. [sent-21, score-0.312]
7 We choose the Hilbert Schmidt Independence Criterion between two sets and we maximize over the permutation group to find a good match. [sent-24, score-0.237]
8 When using a different measure of dependence, namely an approximation of the mutual information, our method is related to an algorithm of [1]. [sent-27, score-0.11]
9 Finally, we give a simple approximation algorithm for kernelized sorting. [sent-28, score-0.222]
10 That is, we would like to find some permutation π ∈ Πm on m terms, that is m×m Πm := π|π ∈ {0, 1} and π1m = 1m and π 1m = 1m , (1) such that the pairs Z(π) := (xi , yπ(i) ) for 1 ≤ i ≤ m correspond to dependent random variables. [sent-38, score-0.193]
11 We seek a permutation π such that the mapping xi → yπ(i) and its converse mapping from y to x are simple. [sent-40, score-0.37]
12 Then we define nonparametric sorting of X and Y as follows π ∗ := argmaxπ∈Πm D(Z(π)). [sent-42, score-0.362]
13 The Hilbert Schmidt Independence Criterion (HSIC) [2] measures the dependence between x and y by computing the norm of the cross-covariance operator over the domain X × Y in Hilbert Space. [sent-46, score-0.137]
14 Formally, let F be the Reproducing Kernel Hilbert Space (RKHS) on X with associated kernel k : X × X → R and feature map φ : X → F. [sent-49, score-0.13]
15 Let G be the RKHS on Y with kernel l and feature map ψ. [sent-50, score-0.13]
16 In term of kernels HSIC can be expressed as Exx yy [k(x, x )l(y, y )] + Exx [k(x, x )]Eyy [l(y, y )] − 2Exy [Ex [k(x, x )]Ey [l(y, y )]], (4) where Exx yy is the expectation over both (x, y) ∼ Prxy and an additional pair of variables (x , y ) ∼ Prxy drawn independently according to the same law. [sent-53, score-0.12]
17 , (xm , ym )} of size m drawn from Prxy an empirical estimate of HSIC is ¯¯ D(F, G, Z) = (m − 1)−2 tr HKHL = (m − 1)−2 tr K L. [sent-57, score-0.19]
18 (5) where K, L ∈ Rm×m are the kernel matrices for the data and the labels respectively, i. [sent-58, score-0.202]
19 Second, HSIC is easy to compute, since only the kernel matrices are required and no density estimation is needed. [sent-71, score-0.202]
20 The freedom of choosing a kernel allows us to incorporate prior knowledge into the dependence estimation process. [sent-72, score-0.219]
21 ¯ ¯ Lemma 1 The nonparametric sorting problem is given by π ∗ = argmaxπ∈Πm tr Kπ Lπ. [sent-74, score-0.439]
22 Note that since H is a centering matrix, it has the eigenvalue 0 for the vector of all ones and the eigenvalue 1 for all vectors orthogonal to that. [sent-76, score-0.148]
23 Next note that the vector of all ones is also an eigenvector of any permutation matrix π with π1 = 1. [sent-77, score-0.193]
24 In this case (5) is maximized by either π = argsort y or by π = argsort −y. [sent-83, score-0.252]
25 Since the centering matrix H only changes the offset but not the order this is equivalent to sorting y. [sent-87, score-0.406]
26 This means that sorting is a special case of kernelized sorting, hence the name. [sent-89, score-0.584]
27 2 Diagonal Dominance In some cases the biased estimate of HSIC as given in (5) leads to very undesirable results, in particular in the case of document analysis. [sent-92, score-0.134]
28 This is the case since kernel matrices on texts tend to be diagonally dominant: a document tends to be much more similar to itself than to others. [sent-93, score-0.363]
29 Using the shorthand ˜ ˜ Kij = Kij (1 − δij ) and Lij = Lij (1 − δij ) for kernel matrices where the main diagonal terms have ˜ ˜ been removed we arrive at the expression (m − 1)−2 tr H LH K. [sent-99, score-0.279]
30 3 Mutual Information An alternative, natural means of studying the dependence between random variables is to compute the mutual information between the random variables xi and yπ(i) . [sent-102, score-0.28]
31 However, if we assume that x and y are jointly normal in the Reproducing Kernel Hilbert Spaces spanned by the kernels k, l and k · l we can devise an effective approximation of the mutual information. [sent-104, score-0.11]
32 (7) Since the mutual information between random variables X and Y is I(X, Y ) = h(X) + h(Y ) − h(X, Y ) we will obtain maximum mutual information by minimizing the joint entropy h(X, Y ). [sent-106, score-0.258]
33 Using the Gaussian upper bound on the joint entropy we can maximize a lower bound on the mutual information by minimizing the joint entropy of J(π) := h(X, Y ). [sent-107, score-0.23]
34 By defining a joint kernel on X × Y via k((x, y), (x , y )) = k(x, x )l(y, y ) we arrive at the optimization problem argminπ∈Πm log |HJ(π)H| where Jij = Kij Lπ(i),π(j) . [sent-108, score-0.179]
35 (8) Note that this is related to the optimization criterion proposed by Jebara [1] in the context of sorting via minimum volume PCA. [sent-109, score-0.455]
36 The main difference is that [1] uses the setting to align bags of observations by optimizing log |HJ(π)H| with respect to re-ordering within each of the bags. [sent-111, score-0.11]
37 As we shall see, for the optimization in Lemma 1 a simple iteration over linear assignment problems will lead to desirable solutions, whereas in (8) even computing derivatives is a computational challenge. [sent-114, score-0.133]
38 3 Optimization DC Programming To find a local maximum of the matching problem we may take recourse to a well-known algorithm, namely DC Programming [4] which in machine learning is also known as the Concave Convex Procedure [5]. [sent-115, score-0.217]
39 (9) This lower bound is convex and it can be maximized effectively over a convex domain. [sent-117, score-0.166]
40 ¯ ¯ Lemma 4 The function tr Kπ Lπ is convex in π. [sent-119, score-0.126]
41 ¯ ¯ ¯ ¯ Since K, L 0 we may factorize them as K = U U and L = V V we may rewrite the objective 2 function as V πU which is clearly a convex quadratic function in π. [sent-120, score-0.115]
42 Note that the set of feasible permutations π is constrained in a unimodular fashion, that is, the set Pm := M ∈ Rm×m where Mij ≥ 0 and i Mij = 1 and j Mij = 1 (10) has only integral vertices, namely admissible permutation matrices. [sent-121, score-0.193]
43 This means that the following procedure will generate a succession of permutation matrices which will yield a local maximum for the assignment problem: ¯ ¯ πi+1 = (1 − λ)πi + λ argmaxπ∈P tr Kπ Lπi (11) m Here we may choose λ = 1 in the last step to ensure integrality. [sent-122, score-0.426]
44 That is, if K and L only had rank-1, the problem could be solved by sorting X and Y in matching fashion. [sent-134, score-0.579]
45 The problem with the relaxation (12) is that it does not scale well to large estimation problems (the size of the optimization problem scales O(m4 )) and that the relaxation does not guarantee a feasible solution which means that subsequent projection heuristics need to be found. [sent-139, score-0.125]
46 Moreover, denote by ki : Xi × Xi → R the corresponding kernels. [sent-148, score-0.188]
47 The expectation operator with respect to the joint distribution and with respect to the product of the marginals is given by [2] T T ki (xi , ·) Ex1 ,. [sent-156, score-0.236]
48 The squared difference between both is given by T T T Exi ,xi [ki (xi , xi )] − 2ExT i=1 ki (xi , xi ) + ExT ,x T i=1 i=1 i=1 Exi [k(xi , xi )] . [sent-161, score-0.431]
49 Denote by Ki the kernel matrix obtained from the kernel ki on the set of observations Xi := {xi1 , . [sent-164, score-0.49]
50 To apply t=1 this to sorting we only need to define T permutation matrices πi ∈ Πm and replace the kernel matrices Ki by πi Ki πi . [sent-173, score-0.829]
51 That is, the objective function is convex in the permutation matrices πi and we may apply DC programming to find a locally optimal solution. [sent-176, score-0.38]
52 5 Applications To investigate the performance of our algorithm (it is a fairly nonstandard unsupervised method) we applied it to a variety of different problems ranging from visualization to matching and estimation. [sent-178, score-0.265]
53 In particular, we want to align it according to a given template, such as a grid, a torus, or any other fixed structure. [sent-183, score-0.12]
54 Such problems occur when presenting images or documents to a user. [sent-184, score-0.237]
55 Instead, we may use kernelized sorting to align objects. [sent-190, score-0.652]
56 Here the kernel matrix L is given by the similarity measure between the objects xi that are to be aligned. [sent-191, score-0.322]
57 The kernel K, on the other hand, denotes the similarity between the locations where objects are to be aligned to. [sent-192, score-0.241]
58 For the sake of simplicity we used a Gaussian RBF kernel between the objects to laid out and also between the Figure 1: Layout of 284 images into a ‘NIPS 2008’ letter grid using kernelized sorting. [sent-193, score-0.635]
59 The kernel width γ was adjusted to the 2 inverse median of x − x such that the argument of the exponential is O(1). [sent-197, score-0.13]
60 Our choice of the Gaussian RBF kernel is likely not optimal for the specific set of observations (e. [sent-198, score-0.172]
61 SIFT feature extraction followed by a set kernel would be much more appropriate for images). [sent-200, score-0.13]
62 We converted the images from RGB into Lab color space, yielding 40 × 40 × 3 dimensional objects. [sent-205, score-0.127]
63 The grid, corresponding to X is a ‘NIPS 2008’ letters on which the images are to be laid out. [sent-206, score-0.181]
64 After sorting we display the images according to their matching coordinates (Figure 1). [sent-207, score-0.744]
65 We can see images with similar color composition are found at proximal locations. [sent-208, score-0.127]
66 We also lay out the images (we add 36 images to make the number 320) into a 2D grid of 16 × 20 mesh using kernelized sorting. [sent-209, score-0.526]
67 Although the images are also arranged according to the color grading, the drawback of SOM (and GTM) is that it creates blank spaces in the layout. [sent-211, score-0.127]
68 This is because SOM maps several images into the same neuron. [sent-212, score-0.127]
69 While SOM is excellent in grouping similar images together, it falls short in exactly arranging the images into 2D grid. [sent-214, score-0.254]
70 2 Matching To obtain more quantifiable results rather than just generally aesthetically pleasing pictures we apply our algorithm to matching problems where the correct match is known. [sent-216, score-0.282]
71 For this purpose we used the data from the layout experiment and we cut the images into two 20 × 40 pixel patches. [sent-218, score-0.227]
72 The aim was to find an alignment between both halves such that the dependence between them is maximized. [sent-219, score-0.203]
73 In other words, given xi being the left half of the image and yi being the right half, we want to find a permutation π which lines up xi and yi . [sent-220, score-0.443]
74 This would be a trivial undertaking when being able to compare the two image halves xi and yi . [sent-221, score-0.207]
75 While such comparison is clearly feasible for images where we know the compatibility function, it may not be possible for generic objects. [sent-222, score-0.192]
76 For a total of 320 images we recovered 140 pairs. [sent-224, score-0.127]
77 This is quite respectable given that chance level would be 1 correct pair (a random permutation matrix has on expectation one nonzero diagonal entry). [sent-225, score-0.193]
78 For this purpose we used binary, multi- Type Binary Multiclass Regression Table 1: Data set australian breastcancer derm optdigits wdbc satimage segment vehicle abalone bodyfat Error rate for matching problems m Kernelized Sorting Baseline 690 0. [sent-228, score-0.297]
79 To ensure good dependence between the subsets of variables we choose a split which ensures correlation. [sent-291, score-0.13]
80 5 correlation with the reference and split those equally into two sets, set A and set B. [sent-294, score-0.11]
81 We also split the remainder coordinates equally into the two existing sets and finally put the reference coordinate into set A. [sent-295, score-0.148]
82 As before, we use a Gaussian RBF kernel with median adjustment of the kernel width. [sent-299, score-0.26]
83 Multilingual Document Matching To illustrate that kernelized sorting is able to recover nontrivial similarity relations we applied our algorithm to the matching of multilingual documents. [sent-304, score-0.977]
84 We select the 300 longest documents of Danish (Da), Dutch (Nl), English (En), French (Fr), German (De), Italian (It), Portuguese (Pt), Spanish (Es), and Swedish (Sv). [sent-307, score-0.147]
85 The purpose is to match the non-English documents (source languages) to its English translations (target language). [sent-308, score-0.218]
86 In fact, one could use kernelized sorting to generate a dictionary after initial matching has occurred. [sent-310, score-0.873]
87 In keeping with the choice of a simple kernel we used standard TF-IDF (term frequency - inverse document frequency) features of a bag of words kernel. [sent-311, score-0.23]
88 Since kernel matrices on documents are notoriously diagonally dominant we use the bias-corrected version of our optimization problem. [sent-316, score-0.422]
89 As baseline we used a fairly straightforward means of document matching via its length. [sent-317, score-0.353]
90 That is, longer documents in one language will be most probably translated into longer documents in the other language. [sent-318, score-0.256]
91 Smallest distance matches in combination with a linear assignment solver are used for the matching. [sent-323, score-0.125]
92 Low matching performance for the document length-based method might be due to small variance in the document length after we choose the 300 longest documents. [sent-334, score-0.454]
93 Further in forming the dictionary, we do not perform stemming on English words and thus the dictionary is highly customized to the problem at hand. [sent-336, score-0.126]
94 Our method produces results consistent to the dictionary-based method with notably low performance for matching German documents to its English translations. [sent-337, score-0.327]
95 We conclude that the difficulty of German-English document matching is inherent to this dataset [9]. [sent-338, score-0.317]
96 6 Summary and Discussion In this paper, we generalized sorting by maximizing the dependency between matched pairs or observations by means of the Hilbert Schmidt Independence Criterion. [sent-340, score-0.458]
97 This way we are able to perform matching without the need of a cross-domain similarity measure. [sent-341, score-0.312]
98 The proposed sorting algorithm is efficient and it can be applied to a variety of different problems ranging from data visualization to image and multilingual document matching and estimation. [sent-342, score-0.844]
99 Further examples of kernelized sorting and of reference algorithms are given in the appendix. [sent-343, score-0.653]
100 Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces. [sent-368, score-0.168]
wordName wordTfidf (topN-words)
[('sorting', 0.362), ('hsic', 0.261), ('kernelized', 0.222), ('matching', 0.217), ('permutation', 0.193), ('ki', 0.188), ('prxy', 0.184), ('hilbert', 0.168), ('kernel', 0.13), ('schmidt', 0.13), ('kij', 0.13), ('images', 0.127), ('som', 0.123), ('mutual', 0.11), ('documents', 0.11), ('lij', 0.108), ('cxy', 0.108), ('document', 0.1), ('argsort', 0.092), ('exx', 0.092), ('dependence', 0.089), ('assignment', 0.084), ('xi', 0.081), ('exi', 0.081), ('multilingual', 0.081), ('tr', 0.077), ('english', 0.073), ('matrices', 0.072), ('independence', 0.072), ('dictionary', 0.072), ('lemma', 0.072), ('reference', 0.069), ('maximized', 0.068), ('align', 0.068), ('objective', 0.066), ('compatibility', 0.065), ('mij', 0.065), ('match', 0.065), ('diagonally', 0.061), ('europarl', 0.061), ('hy', 0.061), ('multiway', 0.061), ('alignment', 0.06), ('yy', 0.06), ('similarity', 0.059), ('layout', 0.057), ('matched', 0.054), ('gtm', 0.054), ('halves', 0.054), ('laid', 0.054), ('stemming', 0.054), ('eigenvalue', 0.052), ('objects', 0.052), ('want', 0.052), ('rm', 0.05), ('grid', 0.05), ('lh', 0.049), ('optimization', 0.049), ('convex', 0.049), ('visualization', 0.048), ('mapping', 0.048), ('operator', 0.048), ('dc', 0.047), ('rbf', 0.045), ('criterion', 0.044), ('initialization', 0.044), ('argmax', 0.044), ('maximize', 0.044), ('centering', 0.044), ('topographic', 0.044), ('jebara', 0.044), ('nicta', 0.044), ('nl', 0.044), ('purpose', 0.043), ('observations', 0.042), ('matches', 0.041), ('split', 0.041), ('xm', 0.041), ('sv', 0.04), ('fr', 0.04), ('relaxation', 0.038), ('reproducing', 0.038), ('coordinates', 0.038), ('german', 0.038), ('albeit', 0.038), ('entropy', 0.038), ('australia', 0.037), ('longest', 0.037), ('music', 0.037), ('da', 0.037), ('australian', 0.037), ('alex', 0.037), ('image', 0.036), ('able', 0.036), ('baseline', 0.036), ('translated', 0.036), ('hj', 0.036), ('ym', 0.036), ('biased', 0.034), ('expectations', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999976 113 nips-2008-Kernelized Sorting
Author: Novi Quadrianto, Le Song, Alex J. Smola
Abstract: Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between the classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution. 1
2 0.2299895 112 nips-2008-Kernel Measures of Independence for non-iid Data
Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola
Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1
3 0.16095601 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
Author: Fernando Pérez-Cruz
Abstract: We analyze the estimation of information theoretic measures of continuous random variables such as: differential entropy, mutual information or KullbackLeibler divergence. The objective of this paper is two-fold. First, we prove that the information theoretic measure estimates using the k-nearest-neighbor density estimation with fixed k converge almost surely, even though the k-nearest-neighbor density estimation with fixed k does not converge to its true measure. Second, we show that the information theoretic measure estimates do not converge for k growing linearly with the number of samples. Nevertheless, these nonconvergent estimates can be used for solving the two-sample problem and assessing if two random variables are independent. We show that the two-sample and independence tests based on these nonconvergent estimates compare favorably with the maximum mean discrepancy test and the Hilbert Schmidt independence criterion. 1
4 0.1350764 59 nips-2008-Dependent Dirichlet Process Spike Sorting
Author: Jan Gasthaus, Frank Wood, Dilan Gorur, Yee W. Teh
Abstract: In this paper we propose a new incremental spike sorting model that automatically eliminates refractory period violations, accounts for action potential waveform drift, and can handle “appearance” and “disappearance” of neurons. Our approach is to augment a known time-varying Dirichlet process that ties together a sequence of infinite Gaussian mixture models, one per action potential waveform observation, with an interspike-interval-dependent likelihood that prohibits refractory period violations. We demonstrate this model by showing results from sorting two publicly available neural data recordings for which a partial ground truth labeling is known. 1
5 0.12410012 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models
Author: Alex J. Smola, Julian J. Mcauley, Tibério S. Caetano
Abstract: Models for near-rigid shape matching are typically based on distance-related features, in order to infer matches that are consistent with the isometric assumption. However, real shapes from image datasets, even when expected to be related by “almost isometric” transformations, are actually subject not only to noise but also, to some limited degree, to variations in appearance and scale. In this paper, we introduce a graphical model that parameterises appearance, distance, and angle features and we learn all of the involved parameters via structured prediction. The outcome is a model for near-rigid shape matching which is robust in the sense that it is able to capture the possibly limited but still important scale and appearance variations. Our experimental results reveal substantial improvements upon recent successful models, while maintaining similar running times. 1
6 0.11730103 117 nips-2008-Learning Taxonomies by Dependence Maximization
7 0.1127287 106 nips-2008-Inferring rankings under constrained sensing
8 0.11059289 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
9 0.10678696 143 nips-2008-Multi-label Multiple Kernel Learning
10 0.10659084 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
11 0.10635205 239 nips-2008-Tighter Bounds for Structured Estimation
12 0.097553924 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
13 0.094524786 226 nips-2008-Supervised Dictionary Learning
14 0.092741951 238 nips-2008-Theory of matching pursuit
15 0.090806313 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization
16 0.08570537 220 nips-2008-Spike Feature Extraction Using Informative Samples
17 0.085031047 95 nips-2008-Grouping Contours Via a Related Image
18 0.084765412 241 nips-2008-Transfer Learning by Distribution Matching for Targeted Advertising
19 0.084056079 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
20 0.083911307 218 nips-2008-Spectral Clustering with Perturbed Data
topicId topicWeight
[(0, -0.286), (1, -0.101), (2, -0.014), (3, 0.028), (4, -0.02), (5, 0.055), (6, 0.047), (7, -0.027), (8, 0.014), (9, -0.116), (10, 0.133), (11, -0.221), (12, -0.008), (13, 0.058), (14, 0.028), (15, 0.052), (16, 0.129), (17, 0.002), (18, 0.001), (19, 0.116), (20, -0.005), (21, 0.079), (22, -0.022), (23, 0.073), (24, 0.127), (25, 0.064), (26, 0.011), (27, -0.025), (28, -0.024), (29, 0.006), (30, -0.044), (31, -0.077), (32, -0.017), (33, 0.049), (34, -0.064), (35, 0.029), (36, 0.004), (37, 0.075), (38, -0.018), (39, 0.109), (40, 0.043), (41, 0.144), (42, 0.004), (43, 0.003), (44, 0.056), (45, -0.09), (46, -0.209), (47, -0.016), (48, -0.014), (49, -0.122)]
simIndex simValue paperId paperTitle
same-paper 1 0.94126397 113 nips-2008-Kernelized Sorting
Author: Novi Quadrianto, Le Song, Alex J. Smola
Abstract: Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between the classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution. 1
2 0.62226468 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
Author: Fernando Pérez-Cruz
Abstract: We analyze the estimation of information theoretic measures of continuous random variables such as: differential entropy, mutual information or KullbackLeibler divergence. The objective of this paper is two-fold. First, we prove that the information theoretic measure estimates using the k-nearest-neighbor density estimation with fixed k converge almost surely, even though the k-nearest-neighbor density estimation with fixed k does not converge to its true measure. Second, we show that the information theoretic measure estimates do not converge for k growing linearly with the number of samples. Nevertheless, these nonconvergent estimates can be used for solving the two-sample problem and assessing if two random variables are independent. We show that the two-sample and independence tests based on these nonconvergent estimates compare favorably with the maximum mean discrepancy test and the Hilbert Schmidt independence criterion. 1
3 0.6151033 112 nips-2008-Kernel Measures of Independence for non-iid Data
Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola
Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1
4 0.55339366 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
Author: Haixuan Yang, Irwin King, Michael Lyu
Abstract: Regularized Least Squares (RLS) algorithms have the ability to avoid over-fitting problems and to express solutions as kernel expansions. However, we observe that the current RLS algorithms cannot provide a satisfactory interpretation even on the penalty of a constant function. Based on the intuition that a good kernelbased inductive function should be consistent with both the data and the kernel, a novel learning scheme is proposed. The advantages of this scheme lie in its corresponding Representer Theorem, its strong interpretation ability about what kind of functions should not be penalized, and its promising accuracy improvements shown in a number of experiments. Furthermore, we provide a detailed technical description about heat kernels, which serves as an example for the readers to apply similar techniques for other kernels. Our work provides a preliminary step in a new direction to explore the varying consistency between inductive functions and kernels under various distributions. 1
5 0.55084664 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection
Author: Takafumi Kanamori, Shohei Hido, Masashi Sugiyama
Abstract: We address the problem of estimating the ratio of two probability density functions (a.k.a. the importance). The importance values can be used for various succeeding tasks such as non-stationarity adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closed-form solution; the leave-one-out cross-validation score can also be computed analytically. Therefore, the proposed method is computationally very efficient and numerically stable. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bound. Numerical experiments show that the proposed method is comparable to the best existing method in accuracy, while it is computationally more efficient than competing approaches. 1
6 0.54540229 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence
7 0.50179243 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
8 0.491164 225 nips-2008-Supervised Bipartite Graph Inference
9 0.48309556 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy
10 0.47856778 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
11 0.46925437 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching
12 0.4597232 238 nips-2008-Theory of matching pursuit
13 0.45118389 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models
14 0.4452675 239 nips-2008-Tighter Bounds for Structured Estimation
15 0.44196555 178 nips-2008-Performance analysis for L\ 2 kernel classification
16 0.44170812 143 nips-2008-Multi-label Multiple Kernel Learning
17 0.44016981 193 nips-2008-Regularized Co-Clustering with Dual Supervision
18 0.43757015 117 nips-2008-Learning Taxonomies by Dependence Maximization
19 0.43644595 44 nips-2008-Characteristic Kernels on Groups and Semigroups
20 0.43567371 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition
topicId topicWeight
[(6, 0.062), (7, 0.082), (12, 0.061), (28, 0.163), (57, 0.08), (59, 0.014), (62, 0.227), (63, 0.031), (71, 0.027), (77, 0.059), (78, 0.013), (81, 0.016), (83, 0.084)]
simIndex simValue paperId paperTitle
1 0.81376636 235 nips-2008-The Infinite Hierarchical Factor Regression Model
Author: Piyush Rai, Hal Daume
Abstract: We propose a nonparametric Bayesian factor regression model that accounts for uncertainty in the number of factors, and the relationship between factors. To accomplish this, we propose a sparse variant of the Indian Buffet Process and couple this with a hierarchical model over factors, based on Kingman’s coalescent. We apply this model to two problems (factor analysis and factor regression) in gene-expression data analysis. 1
same-paper 2 0.80121428 113 nips-2008-Kernelized Sorting
Author: Novi Quadrianto, Le Song, Alex J. Smola
Abstract: Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between the classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution. 1
3 0.70800799 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization
Author: Liu Yang, Rong Jin, Rahul Sukthankar
Abstract: The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classification. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classification accuracy becomes a challenge. We introduce “Semi-supervised Learning with Weakly-Related Unlabeled Data” (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classification tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal wordcorrelation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of stateof-the-art methods for inductive semi-supervised learning and text categorization. We show that SSLW results in a significant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test domain.
4 0.70559603 194 nips-2008-Regularized Learning with Networks of Features
Author: Ted Sandler, John Blitzer, Partha P. Talukdar, Lyle H. Ungar
Abstract: For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, and when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should be expected to have similar weights in an accurate model. Here we present a framework for regularized learning when one has prior knowledge about which features are expected to have similar and dissimilar weights. The prior knowledge is encoded as a network whose vertices are features and whose edges represent similarities and dissimilarities between them. During learning, each feature’s weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using networks of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature networks constructed from declarative human knowledge significantly improve prediction accuracy. 1
5 0.70482081 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
Author: Francis R. Bach
Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.
6 0.70315939 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data
7 0.699938 245 nips-2008-Unlabeled data: Now it helps, now it doesn't
8 0.69912714 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations
9 0.69866204 62 nips-2008-Differentiable Sparse Coding
10 0.69819528 95 nips-2008-Grouping Contours Via a Related Image
11 0.69562137 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
12 0.69420677 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models
13 0.69403958 120 nips-2008-Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text
14 0.69329011 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words
15 0.69279486 200 nips-2008-Robust Kernel Principal Component Analysis
16 0.69223213 226 nips-2008-Supervised Dictionary Learning
17 0.6919412 64 nips-2008-DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification
18 0.69183052 4 nips-2008-A Scalable Hierarchical Distributed Language Model
19 0.69119334 75 nips-2008-Estimating vector fields using sparse basis field expansions
20 0.69061857 130 nips-2008-MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of Images and Visual Features