jmlr jmlr2008 jmlr2008-81 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Arthur D. Szlam, Mauro Maggioni, Ronald R. Coifman
Abstract: Harmonic analysis and diffusion on discrete data has been shown to lead to state-of-the-art algorithms for machine learning tasks, especially in the context of semi-supervised and transductive learning. The success of these algorithms rests on the assumption that the function(s) to be studied (learned, interpolated, etc.) are smooth with respect to the geometry of the data. In this paper we present a method for modifying the given geometry so the function(s) to be studied are smoother with respect to the modified geometry, and thus more amenable to treatment using harmonic analysis methods. Among the many possible applications, we consider the problems of image denoising and transductive classification. In both settings, our approach improves on standard diffusion based methods. Keywords: diffusion processes, diffusion geometry, spectral graph theory, image denoising, transductive learning, semi-supervised learning
Reference: text
sentIndex sentText sentNum sentScore
1 ) are smooth with respect to the geometry of the data. [sent-15, score-0.272]
2 In this paper we present a method for modifying the given geometry so the function(s) to be studied are smoother with respect to the modified geometry, and thus more amenable to treatment using harmonic analysis methods. [sent-16, score-0.347]
3 Among the many possible applications, we consider the problems of image denoising and transductive classification. [sent-17, score-0.419]
4 In both settings, our approach improves on standard diffusion based methods. [sent-18, score-0.393]
5 Keywords: diffusion processes, diffusion geometry, spectral graph theory, image denoising, transductive learning, semi-supervised learning 1. [sent-19, score-1.092]
6 , 2005a; Coifman and Lafon, 2006a) or its diffusion wavelets (Coifman and Maggioni, 2006) can be used to study the geometry of and analyze functions on the data set. [sent-25, score-0.629]
7 One of the main contributions of this work is the observation that the geometry of the space is not the only important factor to be considered, but that the geometry and the properties of the function f to be studied (denoised/learned) should also affect the smoothing operation of the diffusion. [sent-30, score-0.481]
8 The reason for doing this is that perhaps f is not smooth with respect to the geometry of the space, but has structure that is well encoded in the features. [sent-32, score-0.272]
9 Since the harmonic analysis tools we use are robust to complicated geometries, but are most useful on smooth functions, it is reasonable to let the geometry of the data set borrow some of the complexity of the function, and study a smoother function on a more irregular data set. [sent-33, score-0.411]
10 In other words, we attempt to find the geometry so that the functions to be studied are as smooth as possible with respect to that geometry. [sent-34, score-0.272]
11 In Section 4 we demonstrate this approach in the context of the image denoising problem. [sent-42, score-0.328]
12 The random walk allows to construct diffusion operators on the data set, as well as associated basis functions. [sent-53, score-0.393]
13 In general K is not column-stochastic,2 but the operation f K of multiplication on the right by a (row) vector can be thought of as a diffusion of the vector f . [sent-67, score-0.393]
14 Local similarities are assumed to be more reliable, and non-local similarities will be inferred from local similarities through diffusion processes on the graph. [sent-75, score-0.516]
15 For example one defines the diffusion or spectral distance (B erard et al. [sent-114, score-0.462]
16 z∈X The term diffusion distance was introduced in Lafon (2004), Coifman et al. [sent-117, score-0.393]
17 (2005a) and Coifman and Lafon (2006a) and is suggested by the formula above, which expresses D (t) as some similarity between the probability distributions K t (x, ·) and K t (y, ·), which are obtained by diffusion from x ´ and y according to the diffusion process K. [sent-118, score-0.816]
18 3 Harmonic Analysis The eigenfunctions {ψi } of K, satisfying Kψi = λi ψi , 1 are are related,via multiplication by D− 2 , to the eigenfunctions φi of the graph Laplacian (Chung, 1997), since 1 1 1 1 (3) L = D− 2 W D− 2 − I = D 2 KD− 2 − I . [sent-124, score-0.345]
19 4 Regularization by Diffusion It is often useful to find the smoothest function f˜ on a data set X with geometry given by W , so that for a given f , f˜ is not too far from f ; this task is encountered in problems of denoising and function extension. [sent-143, score-0.421]
20 In the denoising problem, we are given a function f + η from X → R, and η is Gaussian white noise of a given variance, or if one is ambitious, some other possibly stochastic contamination. [sent-144, score-0.213]
21 However it is well known that if f does not have uniform smoothness everywhere, the approximation by eigenfunctions is poor not only in regions of lesser smoothness, but the poor approximation spills to regions of smoothness as well. [sent-158, score-0.237]
22 If we would like to balance smoothing by K with fidelity to the original f , we can choose β > 0 and set f 0 = f and ft+1 = (K ft + β f )/(1 + β); the original function is treated as a heat source. [sent-167, score-0.224]
23 Function-adapted Kernels The methods above are based on the idea that the function f to be recovered should be smooth with respect to W , but it can happen that an interesting function on data is not smooth with respect to the given geometry on that data. [sent-175, score-0.336]
24 We thus propose to modify the geometry of W so that the function(s) to be recovered are as smooth as possible in the modified geometry. [sent-178, score-0.272]
25 We will attempt to incorporate the geometry of the function f (or a family of functions F) in the construction of the weights W ; the hope is that we can convert structure to smoothness, and apply the methods of harmonic analysis to a smoother function on a rougher data set. [sent-180, score-0.388]
26 In other words, we want our regularizer to take averages between 1717 S ZLAM , M AGGIONI AND C OIFMAN points where f has similar structure, in addition to being near to each other in terms of the given geometry of the data. [sent-181, score-0.248]
27 The averaging matrix K f associated with W f can then be used for regularizing, denoising and learning tasks, as described above. [sent-197, score-0.259]
28 The way the function f affects the construction of K f will be application- and data- specific, as we shall see in the applications to image denoising and graph transductive learning. [sent-199, score-0.478]
29 For example, in the application to image denoising, F ( f )(x) may be a vector of filter responses applied to the image f at location x. [sent-200, score-0.23]
30 Then F ( f )(x) can be obtained by evaluating K t (χi ) at x, where i=1 K is a diffusion operator which only depends on the data set, and not on the χ i ’s. [sent-203, score-0.393]
31 Application I: Denoising of Images We apply function-adapted kernels to the task of denoising images. [sent-206, score-0.213]
32 We build a graph G(I) whose vertices are the pixels of the image and whose weights are adapted to the image structure, and use the diffusion on the graph with a fidelity term, as described in Section 2. [sent-216, score-0.909]
33 If we are able to encode image structure in the geometry of the graph in such a way that the image is actually smooth as a function on its graph, then the harmonic analysis on the graph will be well-suited for denoising that image. [sent-218, score-0.972]
34 A simple choice of d + 2 dimensional feature vectors is obtained by setting two of the coordinates of the feature vector to be scaled versions of the coordinates of the corresponding pixel in the image αx, where α ≥ 0 is a scalar, and x ∈ Q. [sent-222, score-0.251]
35 Left: row of the diffusion kernel corresponding to the upper-left highlighted area in the above image. [sent-240, score-0.428]
36 Right: row of the diffusion kernel corresponding to the bottom-left highlighted area in the above image. [sent-241, score-0.428]
37 The diffusion kernel averages according to different patterns in different locations. [sent-242, score-0.428]
38 1, and the associated diffusion kernel K as in (1). [sent-246, score-0.428]
39 In Figure 3 we explore the local geometry in patch space by projecting the set of patches around a given patch onto the principal components of the set of patches itself. [sent-247, score-0.366]
40 Geometric structures of the set of patches, dependent on local geometry of the image (e. [sent-248, score-0.323]
41 The diffusion one gets from this choice of filters is the NL-means filter of Buades et al. [sent-261, score-0.393]
42 We wish to emphasize that smoothing with the NL-means filter is not, in any sort of reasonable limit, a 2-d PDE; rather, it is the heat equation on the set of patches of the image! [sent-267, score-0.244]
43 2 B OOTSTRAPPING A D ENOISER ; OR D ENOISED I MAGES G RAPH Different denoising methods often pick up different parts of the image structure, and create different characteristic artifacts. [sent-272, score-0.328]
44 2 Image Graph Denoising Once we have the graph W and normalized diffusion K, we use K to denoise the image. [sent-296, score-0.484]
45 The obvious bijection from pixels to vertices in the image graph induces a correspondence between functions on pixels (such as the original image) and functions on the vertices of the graph. [sent-297, score-0.256]
46 If we consider iteration of K as evolving a heat equation, the fidelity term sets the noisy function as a heat source, with strength determined by β. [sent-310, score-0.256]
47 ˜ I ← DenoiseImage(I,t) // Input: // I : an image // t : amount of denoising // Output: ˜ // I : a denoised version of I. [sent-313, score-0.328]
48 Compute the associated I-adapted diffusion operator K I . [sent-317, score-0.393]
49 Figure 4: Pseudo-code for denoising an image 4. [sent-320, score-0.328]
50 3 Examples Figure 5 displays examples of denoising with a diffusion on an image graph. [sent-321, score-0.721]
51 On the top right of Figure 5, we denoise the image using a 7 × 7 NL-means type patch embedding as described in Section 4. [sent-324, score-0.213]
52 Each curvelet denoisings is a reconstruction of the noisy image f 0 shifted either 1, 2, or 4 pixels in the vertical and/or horizontal directions, using only coefficients with magnitudes greater than 3σ. [sent-335, score-0.24]
53 4) Denoising with a diffusion built from the 9 curvelet denoisings. [sent-342, score-0.449]
54 Application II: Graph Transductive Learning We apply function adapted approach to the transductive learning problem, and give experimental evidence demonstrating that using function adapted weights can improve diffusion based classifiers. [sent-351, score-0.697]
55 If F is smooth with respect to the data, an intrinsic analysis on the data set, such as the one possible by the use of diffusion processes and the associated Fourier and multi-scale analyses, fits very well in the transductive learning framework. [sent-365, score-0.548]
56 In several papers a diffusion process constructed on X has been used for finding F directly (Zhou and Schlkopf, 2005; Zhu et al. [sent-366, score-0.393]
57 , 2003; Kondor and Lafferty, 2002) and indirectly, by using adapted basis functions on X constructed from the diffusion, such as the eigenfunctions of the Laplacian (Coifman and Lafon, 2006a,b; Lafon, 2004; Coifman et al. [sent-367, score-0.229]
58 , 2005a,b; Belkin and Niyogi, 2003b; Maggioni and Mhaskar, 2007), or diffusion wavelets (Coifman and Maggioni, 2006; Mahadevan and Maggioni, 2007; Maggioni and Mahadevan, 2006; Mahadevan and Maggioni, 2005; Maggioni and Mahadevan, 2005). [sent-368, score-0.421]
59 We will try to modify the geometry of the unlabeled data so that F is as smooth as possible with respect to the modified geometry. [sent-369, score-0.348]
60 k}, be the classes, let Cilab be the labeled ˜ ˜ data points in the ith class, that is, Ci = {x ∈ X : F = i}, and let χlab be the characteristic function i lab (x) = 1 if x ∈ C , and χlab (x) = 0 otherwise. [sent-379, score-0.284]
61 of those Ci , that is, χi i i A simple classification algorithm can be obtained as follows (Szummer and Jaakkola, 2001): (i) Build a geometric diffusion K on the graph defined by the data points X, as described in Section 2. [sent-380, score-0.52]
62 (ii) Use a power of K to smooth the functions χlab , exactly as in the denoising algorithm described i above, obtaining functions χlab : i χlab = K t χlab . [sent-383, score-0.277]
63 For each i, the “initial condition” for the heat flow given by χ lab considers all the unlabeled i points to be the same as labeled points not in Ci . [sent-389, score-0.528]
64 Just as i in the image denoising examples, it is often the case that one does not want to run such a harmonic classifier to equilibrium, and we may want to find the correct number of iterations of smoothing by K and updating the boundary values by cross validation. [sent-394, score-0.532]
65 Since the values at the unlabeled points are unknown, we regress only to the labeled points; so for each χlab , we need to solve i 2 N argmin{al } ∑ ∑ ail φl (x) − χlab (x) i , x labeled l=1 and extend the χlab to i N χlab = ∑ ail φi . [sent-400, score-0.27]
66 2 Function Adapted Diffusion for Classification If the structure of the classes is very simple with respect to the geometry of the data set, then smoothness with respect to this geometry is precisely what is necessary to generalize from the labeled data. [sent-404, score-0.535]
67 We will modify the diffusion so it flows faster along class boundaries and slower across them, by using function-adapted kernels as in (7). [sent-407, score-0.393]
68 Compute the associated diffusion operator K as in (1). [sent-421, score-0.393]
69 1 and the kernel K Figure 6: Pseudo-code for learning of a function based on diffusion on graphs ˜ tions {χi } are initially given on a (typically small) subset X of X, and hence a similarity cannot be immediately defined in a way similar to (7). [sent-429, score-0.486]
70 In this way, if several classes claim a data point with some confidence, the diffusion will tend to average more among other points which have the same ownership situation when determining the value of a function at that data point. [sent-437, score-0.46]
71 We allow the geometry of the data set to absorb some of the complexity of the classes, and use diffusion analysis techniques on the modified data set. [sent-448, score-0.601]
72 The parallel with image denoising should not be unexpected: the goal of a function-adapted kernel is to strengthen averaging along level lines, and this is as desirable in image denoising as in transductive learning. [sent-449, score-0.828]
73 (2005a) the idea of using the estimated classes to warp the diffusion is introduced. [sent-457, score-0.42]
74 The tradeoff between geometry of the data and geometry of the (estimated/diffused) labels is made explicit and controllable. [sent-460, score-0.416]
75 We find that on many standard benchmark data sets, classification rate is improved using function-adapted weights instead of “geometry only” weights in diffusion based classifiers. [sent-472, score-0.475]
76 Denote the labeled points by L, let Ci the ith class, and let χlab be i 1 on the labeled points in the ith class, −1 on the labeled points of the other classes, and 0 elsewhere. [sent-500, score-0.255]
77 Classify unlabeled points x by supi χlab (x), where χlab (x) are constructed using the harmonic i i classifier with the number of iterations chosen by leave-20-out cross validation from 1 to 250. [sent-504, score-0.283]
78 For the most simple example, consider a set of n × n images where each image has a single pixel set to 1, and every other pixel set to 0. [sent-512, score-0.243]
79 Using the χ lab from the i harmonic classifier at steady state (N = 250), we do the following: x6. [sent-521, score-0.338]
80 The modified weights are constructed using the nearest neighbors from the original weight matrix, exactly as in the image processing examples. [sent-525, score-0.267]
81 In addition to showing that function adapted weights often improve classification using diffusion based methods, the results we obtain are very competitive and in many cases better than all other methods listed in the extensive comparative results presented in Chapelle et al. [sent-535, score-0.52]
82 Each pair of columns corresponds to a smoothing method; the right column in each pair uses function adapted weights, with c i determined by the harmonic classifier. [sent-743, score-0.29]
83 KS stands for kernel smoothing as in (x5), FAKS for function adapted kernel smoothing as in (x10), HC for harmonic classifier as in (x3), FAHC for function adapted harmonic classifier as in (x8), EF for eigenfunctions as in (x4), and FAEF for function adapted eigenfunctions as in (x9). [sent-744, score-1.022]
84 4 Figure 9: Various classification results, ci determined by smoothing by eigenfunctions of L . [sent-1136, score-0.257]
85 The boundary between the classes is exactly at the bottleneck between the two clusters; in other words, the geometry/metric of the data as initially presented leads to the optimal classifier, and thus modifying the geometry by the cluster guesses can only do harm. [sent-1178, score-0.235]
86 Note that in most of the image denoising applications we have presented, because of the 2d locality constraints we put on the neighbor searches, the number of operation is linear in the number N of pixels, with a rather small constant. [sent-1211, score-0.359]
87 The first one is to use a transductive learning approach to tackle image processing problems like denoising and inpainting. [sent-1220, score-0.419]
88 Another research direction is towards understanding how to construct and use efficiently basis functions which are associated to function-adapted diffusion kernels. [sent-1225, score-0.393]
89 Conclusions We have introduced a general approach for associating graphs and diffusion processes to data sets and functions on such data sets. [sent-1231, score-0.421]
90 This framework is very flexible, and we have shown two particular applications, denoising of images and transductive learning, which traditionally are considered very different and have been tackled with very different techniques. [sent-1232, score-0.348]
91 We show that in fact they are very similar problems and results at least as good as the state-of-the-art can be obtained within the single framework of function-adapted diffusion kernels. [sent-1233, score-0.393]
92 A review of image denoising algorithms, with a new one. [sent-1293, score-0.328]
93 Ideal denoising in an orthonormal basis chosen from a library of bases. [sent-1463, score-0.213]
94 Manifold parametrizations by eigenfunctions of the Laplacian and heat kernels. [sent-1504, score-0.271]
95 Universal local manifold parametrizations via heat kernels and eigenfunctions of the Laplacian. [sent-1516, score-0.335]
96 Multiscale diffusion bases for policy iteration in markov decision processes. [sent-1576, score-0.393]
97 Fast direct policy evaluation using multiscale analysis of markov diffusion processes. [sent-1582, score-0.458]
98 Biorthogonal diffusion wavelets for multiscale representations on manifolds and graphs. [sent-1599, score-0.486]
99 Value function approximation with diffusion wavelets and laplacian eigenfunctions. [sent-1609, score-0.477]
100 Fast image and video denoising via nonlocal means of similar neighborhoods. [sent-1628, score-0.328]
wordName wordTfidf (topN-words)
[('diffusion', 0.393), ('maggioni', 0.346), ('coifman', 0.255), ('denoising', 0.213), ('geometry', 0.208), ('lab', 0.199), ('bci', 0.149), ('eigenfunctions', 0.143), ('harmonic', 0.139), ('aggioni', 0.131), ('faef', 0.131), ('fahc', 0.131), ('faks', 0.131), ('iffusion', 0.131), ('oifman', 0.131), ('zlam', 0.131), ('heat', 0.128), ('lafon', 0.128), ('belkin', 0.117), ('image', 0.115), ('rocesses', 0.111), ('raphs', 0.099), ('egularization', 0.099), ('coil', 0.092), ('unction', 0.091), ('mahadevan', 0.091), ('transductive', 0.091), ('mnist', 0.086), ('adapted', 0.086), ('hc', 0.085), ('ks', 0.079), ('unlabeled', 0.076), ('mhaskar', 0.065), ('smoothing', 0.065), ('multiscale', 0.065), ('manifold', 0.064), ('smooth', 0.064), ('usps', 0.063), ('graph', 0.059), ('curvelet', 0.056), ('laplacian', 0.056), ('delity', 0.055), ('nearest', 0.054), ('niyogi', 0.054), ('patches', 0.051), ('lter', 0.049), ('ci', 0.049), ('coordinates', 0.047), ('smoothness', 0.047), ('diffusions', 0.047), ('schlkopf', 0.047), ('averaging', 0.046), ('labeled', 0.045), ('images', 0.044), ('chapelle', 0.042), ('pixel', 0.042), ('weights', 0.041), ('similarities', 0.041), ('spectral', 0.041), ('pixels', 0.041), ('points', 0.04), ('pde', 0.04), ('eigenfunction', 0.04), ('lters', 0.039), ('embedding', 0.038), ('buades', 0.037), ('fg', 0.037), ('sinkhorn', 0.037), ('szlam', 0.037), ('kernel', 0.035), ('zhou', 0.034), ('donoho', 0.034), ('gn', 0.034), ('perona', 0.033), ('ail', 0.032), ('denoise', 0.032), ('neighbor', 0.031), ('neighbors', 0.031), ('ft', 0.031), ('url', 0.03), ('similarity', 0.03), ('text', 0.03), ('edges', 0.029), ('patch', 0.028), ('wavelets', 0.028), ('chung', 0.028), ('geometric', 0.028), ('bremer', 0.028), ('denoisings', 0.028), ('erard', 0.028), ('malik', 0.028), ('mauro', 0.028), ('molli', 0.028), ('supi', 0.028), ('zass', 0.028), ('zhu', 0.028), ('graphs', 0.028), ('metric', 0.027), ('classes', 0.027), ('weight', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
Author: Arthur D. Szlam, Mauro Maggioni, Ronald R. Coifman
Abstract: Harmonic analysis and diffusion on discrete data has been shown to lead to state-of-the-art algorithms for machine learning tasks, especially in the context of semi-supervised and transductive learning. The success of these algorithms rests on the assumption that the function(s) to be studied (learned, interpolated, etc.) are smooth with respect to the geometry of the data. In this paper we present a method for modifying the given geometry so the function(s) to be studied are smoother with respect to the modified geometry, and thus more amenable to treatment using harmonic analysis methods. Among the many possible applications, we consider the problems of image denoising and transductive classification. In both settings, our approach improves on standard diffusion based methods. Keywords: diffusion processes, diffusion geometry, spectral graph theory, image denoising, transductive learning, semi-supervised learning
2 0.11074711 96 jmlr-2008-Visualizing Data using t-SNE
Author: Laurens van der Maaten, Geoffrey Hinton
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the data sets. Keywords: visualization, dimensionality reduction, manifold learning, embedding algorithms, multidimensional scaling
3 0.081853092 57 jmlr-2008-Manifold Learning: The Price of Normalization
Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov
Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap
4 0.075606838 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
Author: Olivier Chapelle, Vikas Sindhwani, Sathiya S. Keerthi
Abstract: Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S 3 VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S3 VMs. This paper reviews key ideas in this literature. The performance and behavior of various S3 VM algorithms is studied together, under a common experimental setting. Keywords: semi-supervised learning, support vector machines, non-convex optimization, transductive learning
5 0.067920454 54 jmlr-2008-Learning to Select Features using their Properties
Author: Eyal Krupka, Amir Navot, Naftali Tishby
Abstract: Feature selection is the task of choosing a small subset of features that is sufficient to predict the target labels well. Here, instead of trying to directly determine which features are better, we attempt to learn the properties of good features. For this purpose we assume that each feature is represented by a set of properties, referred to as meta-features. This approach enables prediction of the quality of features without measuring their value on the training instances. We use this ability to devise new selection algorithms that can efficiently search for new good features in the presence of a huge number of features, and to dramatically reduce the number of feature measurements needed. We demonstrate our algorithms on a handwritten digit recognition problem and a visual object category recognition problem. In addition, we show how this novel viewpoint enables derivation of better generalization bounds for the joint learning problem of selection and classification, and how it contributes to a better understanding of the problem. Specifically, in the context of object recognition, previous works showed that it is possible to find one set of features which fits most object categories (aka a universal dictionary). Here we use our framework to analyze one such universal dictionary and find that the quality of features in this dictionary can be predicted accurately by its meta-features. Keywords: feature selection, unobserved features, meta-features
6 0.062520787 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
7 0.055847272 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
8 0.051097021 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
9 0.047401972 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
10 0.047369815 58 jmlr-2008-Max-margin Classification of Data with Absent Features
11 0.043634217 9 jmlr-2008-Active Learning by Spherical Subdivision
12 0.042398058 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
13 0.042327393 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
14 0.041405611 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding
15 0.041225892 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
16 0.04055962 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
17 0.04042843 92 jmlr-2008-Universal Multi-Task Kernels
18 0.038329139 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
19 0.038180992 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
20 0.035270717 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
topicId topicWeight
[(0, 0.189), (1, -0.016), (2, 0.065), (3, -0.039), (4, -0.013), (5, -0.013), (6, 0.105), (7, 0.098), (8, 0.082), (9, -0.002), (10, -0.139), (11, 0.212), (12, 0.024), (13, 0.123), (14, 0.097), (15, 0.231), (16, -0.027), (17, -0.179), (18, 0.05), (19, 0.173), (20, -0.102), (21, -0.181), (22, 0.204), (23, 0.011), (24, 0.084), (25, -0.023), (26, -0.051), (27, 0.035), (28, -0.131), (29, -0.013), (30, 0.092), (31, 0.026), (32, 0.014), (33, -0.066), (34, 0.003), (35, -0.049), (36, -0.035), (37, -0.006), (38, 0.038), (39, 0.055), (40, -0.078), (41, 0.019), (42, -0.145), (43, -0.226), (44, 0.003), (45, -0.11), (46, 0.005), (47, 0.051), (48, 0.026), (49, -0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.93796492 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
Author: Arthur D. Szlam, Mauro Maggioni, Ronald R. Coifman
Abstract: Harmonic analysis and diffusion on discrete data has been shown to lead to state-of-the-art algorithms for machine learning tasks, especially in the context of semi-supervised and transductive learning. The success of these algorithms rests on the assumption that the function(s) to be studied (learned, interpolated, etc.) are smooth with respect to the geometry of the data. In this paper we present a method for modifying the given geometry so the function(s) to be studied are smoother with respect to the modified geometry, and thus more amenable to treatment using harmonic analysis methods. Among the many possible applications, we consider the problems of image denoising and transductive classification. In both settings, our approach improves on standard diffusion based methods. Keywords: diffusion processes, diffusion geometry, spectral graph theory, image denoising, transductive learning, semi-supervised learning
2 0.57630134 96 jmlr-2008-Visualizing Data using t-SNE
Author: Laurens van der Maaten, Geoffrey Hinton
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the data sets. Keywords: visualization, dimensionality reduction, manifold learning, embedding algorithms, multidimensional scaling
3 0.46872392 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
Author: Olivier Chapelle, Vikas Sindhwani, Sathiya S. Keerthi
Abstract: Due to its wide applicability, the problem of semi-supervised classification is attracting increasing attention in machine learning. Semi-Supervised Support Vector Machines (S 3 VMs) are based on applying the margin maximization principle to both labeled and unlabeled examples. Unlike SVMs, their formulation leads to a non-convex optimization problem. A suite of algorithms have recently been proposed for solving S3 VMs. This paper reviews key ideas in this literature. The performance and behavior of various S3 VM algorithms is studied together, under a common experimental setting. Keywords: semi-supervised learning, support vector machines, non-convex optimization, transductive learning
4 0.45360467 57 jmlr-2008-Manifold Learning: The Price of Normalization
Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov
Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap
5 0.31678784 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
Author: Elena Marchiori
Abstract: In supervised learning, a training set consisting of labeled instances is used by a learning algorithm for generating a model (classifier) that is subsequently employed for deciding the class label of new instances (for generalization). Characteristics of the training set, such as presence of noisy instances and size, influence the learning algorithm and affect generalization performance. This paper introduces a new network-based representation of a training set, called hit miss network (HMN), which provides a compact description of the nearest neighbor relation over pairs of instances from each pair of classes. We show that structural properties of HMN’s correspond to properties of training points related to the one nearest neighbor (1-NN) decision rule, such as being border or central point. This motivates us to use HMN’s for improving the performance of a 1-NN classifier by removing instances from the training set (instance selection). We introduce three new HMN-based algorithms for instance selection. HMN-C, which removes instances without affecting accuracy of 1-NN on the original training set, HMN-E, based on a more aggressive storage reduction, and HMN-EI, which applies iteratively HMN-E. Their performance is assessed on 22 data sets with different characteristics, such as input dimension, cardinality, class balance, number of classes, noise content, and presence of redundant variables. Results of experiments on these data sets show that accuracy of 1-NN classifier increases significantly when HMN-EI is applied. Comparison with state-of-the-art editing algorithms for instance selection on these data sets indicates best generalization performance of HMN-EI and no significant difference in storage requirements. In general, these results indicate that HMN’s provide a powerful graph-based representation of a training set, which can be successfully applied for performing noise and redundance reduction in instance-based learning. Keywords: graph-based training set repre
6 0.25552884 54 jmlr-2008-Learning to Select Features using their Properties
7 0.25412387 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
8 0.23314461 58 jmlr-2008-Max-margin Classification of Data with Absent Features
9 0.23265997 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
10 0.20422919 73 jmlr-2008-On the Suitable Domain for SVM Training in Image Coding
12 0.19073965 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
13 0.18886785 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
14 0.1886878 87 jmlr-2008-Stationary Features and Cat Detection
15 0.18593234 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
16 0.18555312 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
17 0.18415387 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
18 0.18345033 9 jmlr-2008-Active Learning by Spherical Subdivision
19 0.18009458 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
20 0.17052998 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
topicId topicWeight
[(0, 0.044), (1, 0.01), (5, 0.038), (31, 0.014), (40, 0.059), (54, 0.035), (58, 0.03), (66, 0.047), (72, 0.015), (73, 0.404), (76, 0.026), (78, 0.013), (88, 0.069), (92, 0.032), (94, 0.053), (99, 0.035)]
simIndex simValue paperId paperTitle
1 0.8000716 80 jmlr-2008-Ranking Individuals by Group Comparisons
Author: Tzu-Kuo Huang, Chih-Jen Lin, Ruby C. Weng
Abstract: This paper proposes new approaches to rank individuals from their group comparison results. Many real-world problems are of this type. For example, ranking players from team comparisons is important in some sports. In machine learning, a closely related application is classification using coding matrices. Group comparison results are usually in two types: binary indicator outcomes (wins/losses) or measured outcomes (scores). For each type of results, we propose new models for estimating individuals’ abilities, and hence a ranking of individuals. The estimation is carried out by solving convex minimization problems, for which we develop easy and efficient solution procedures. Experiments on real bridge records and multi-class classification demonstrate the viability of the proposed models. Keywords: ranking, group comparison, binary/scored outcomes, Bradley-Terry model, multiclass classification
same-paper 2 0.77108109 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
Author: Arthur D. Szlam, Mauro Maggioni, Ronald R. Coifman
Abstract: Harmonic analysis and diffusion on discrete data has been shown to lead to state-of-the-art algorithms for machine learning tasks, especially in the context of semi-supervised and transductive learning. The success of these algorithms rests on the assumption that the function(s) to be studied (learned, interpolated, etc.) are smooth with respect to the geometry of the data. In this paper we present a method for modifying the given geometry so the function(s) to be studied are smoother with respect to the modified geometry, and thus more amenable to treatment using harmonic analysis methods. Among the many possible applications, we consider the problems of image denoising and transductive classification. In both settings, our approach improves on standard diffusion based methods. Keywords: diffusion processes, diffusion geometry, spectral graph theory, image denoising, transductive learning, semi-supervised learning
3 0.30968255 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
4 0.30933845 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
5 0.30759269 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
6 0.30364439 76 jmlr-2008-Optimization Techniques for Semi-Supervised Support Vector Machines
7 0.30093277 57 jmlr-2008-Manifold Learning: The Price of Normalization
8 0.30062699 86 jmlr-2008-SimpleMKL
10 0.29968613 83 jmlr-2008-Robust Submodular Observation Selection
11 0.2996341 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
12 0.29920325 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
13 0.2982358 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
14 0.29654333 39 jmlr-2008-Gradient Tree Boosting for Training Conditional Random Fields
15 0.29516807 67 jmlr-2008-Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies
16 0.29504654 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
17 0.29350373 56 jmlr-2008-Magic Moments for Structured Output Prediction
18 0.29096749 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
19 0.29083166 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
20 0.29000074 9 jmlr-2008-Active Learning by Spherical Subdivision