jmlr jmlr2008 jmlr2008-57 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov
Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap
Reference: text
sentIndex sentText sentNum sentScore
1 IL Department of Statistics The Hebrew University 91905 Jerusalem, Israel Editor: Sam Roweis Abstract We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. [sent-13, score-0.156]
2 We present and prove conditions on the manifold that are necessary for the success of the algorithms. [sent-15, score-0.187]
3 We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. [sent-17, score-0.218]
4 Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap 1. [sent-19, score-0.225]
5 These manifold-learning algorithms compute an embedding for some given input. [sent-27, score-0.166]
6 The nonlinearity of these algorithms allows them to reveal the domain structure even when the manifold is not linearly embedded. [sent-33, score-0.149]
7 In the second step, a description of these neighborhoods is computed. [sent-56, score-0.147]
8 We show that one should not expect the normalized-output algorithms to recover geodesic distances or local structures. [sent-60, score-0.198]
9 A more reasonable criterion for success is a high degree of similarity between the output of the algorithms and the original sample, up to some affine transformation; the definition of similarity will be discussed later. [sent-61, score-0.153]
10 However, they serve as an analytic tool to prove that there are general classes of manifolds on which the normalized-output algorithms fail. [sent-70, score-0.161]
11 Moreover, the numerical examples in this section show that the class of manifolds on which the normalized-output algorithms fail is wide and includes non-isometrically manifolds and real-world data. [sent-71, score-0.317]
12 Second, we show that there exist simple manifolds that do not fulfill the necessary conditions for the success of the algorithms. [sent-77, score-0.199]
13 In the first step, the normalized-output algorithms assign neighbors to each input point x i based on the Euclidean distances in the high-dimensional space. [sent-93, score-0.168]
14 1 This can be done, for example, by choosing all the input points in an r-ball around xi or alternatively by choosing xi ’s K-nearest-neighbors. [sent-94, score-0.165]
15 The neighborhood of xi is given by the matrix Xi = [xi , xi,1 , . [sent-95, score-0.17]
16 In the second step, the normalized-output algorithms compute a description of the local neighborhoods that were found in the previous step. [sent-108, score-0.176]
17 The neighborhoods are not mentioned explicitly by Coifman and Lafon (2006). [sent-112, score-0.147]
18 However, since a sparse optimization problem is considered, it is assumed implicitly that neighborhoods are defined (see Sec. [sent-113, score-0.147]
19 Define the output matrix Y to be the matrix that achieves the minimum of Φ under the normalization constraints of Eq. [sent-177, score-0.217]
20 Then we have the following: the embeddings of LEM and LLE are given by the according output matrices Y ; the embeddings of LTSA and HLLE 1 are given by the according output matrices √N Y ; and the embedding of DFM is given by a linear transformation of Y as discussed in Appendix A. [sent-179, score-0.47]
21 The discussion of the algorithms’ output in this paper holds for any affine transformation of the output (see Section 3). [sent-181, score-0.165]
22 The most obvious difference between input and output is that while the input is a strip, the output is roughly square. [sent-195, score-0.162]
23 By definition, the geodesic distance between two points on a manifold is the length of the shortest path on the manifold between the two points. [sent-197, score-0.333]
24 Preservation of geodesic distances is particularly relevant when the manifold is isometrically embedded. [sent-198, score-0.232]
25 In this case, assuming the domain is convex, the geodesic distance between any two points on the manifold is equal to the Euclidean distance between the corresponding domain points. [sent-199, score-0.213]
26 The normalization constraint shortens the horizontal distances and lengthens the vertical distances, leading to the distortion of geodesic distances. [sent-208, score-0.183]
27 Note that less than half of the original neighbors of the point remain neighbors in the output space. [sent-212, score-0.209]
28 In addition to the distortion of angles and distances, the K-nearest-neighbors of a given point on the manifold do not necessarily correspond to the K-nearest-neighbors of the respective output point, as shown in Figs. [sent-218, score-0.176]
29 Accordingly, we conclude that the original structure of the local neighborhoods is not necessarily preserved by the normalized-output algorithms. [sent-220, score-0.174]
30 The above discussion highlights the fact that one cannot expect the normalized-output algorithms to preserve geodesic distances or local neighborhood structure. [sent-221, score-0.25]
31 However, it seems reasonable to demand that the output of the normalized-output algorithms resemble an affine transformation of the original sample. [sent-222, score-0.197]
32 Using the affine transformation criterion, we can state that LTSA succeeds in recovering the underlying structure of the strip shown in Fig. [sent-233, score-0.203]
33 However, in the case of the noisy strip shown in Fig. [sent-235, score-0.15]
34 Let Y = YN×d be an affine transformation of the original sample X, such that the normalization constraints of Eq. [sent-244, score-0.197]
35 In other words, we say that the algorithm has failed when a substantially different embedding Z has a lower cost than the most appropriate embedding Y . [sent-250, score-0.3]
36 It is a mathematical construction that shows when the output of the algorithm is not likely to be similar to Y , the normalized version of the true manifold structure. [sent-255, score-0.176]
37 Let S be the maximum number of neighborhoods to which a single input point belongs. [sent-261, score-0.172]
38 In particular, we argue that LEM cannot recover the structure of a two-dimensional grid in the case where the aspect ratio of the grid is greater than 2. [sent-272, score-0.313]
39 Necessary conditions for successful performance of the normalized-output algorithms on such manifolds are presented. [sent-278, score-0.161]
40 First, the conditions for the success of LEM on the two-dimensional grid are more limiting. [sent-280, score-0.155]
41 The embedding Z is a curve in R2 and clearly does not preserve the original structure of the grid. [sent-297, score-0.164]
42 Note that this is the only linear transformation of X (up to rotation) that satisfies the conditions Y D1 = 0 and Y DY = I, which are the normalization constraints for LEM (see Eq. [sent-300, score-0.148]
43 However, the embedding Y depends on the matrix D, which in turn depends on the choice of neighborhoods. [sent-302, score-0.17]
44 In this section we analyze the embedding Y instead of Y , thereby avoiding the dependence on the matrix D and hence simplifying the notation. [sent-312, score-0.17]
45 3 3 The definition of the embedding Z is as follows: i , −2i − z(2) ¯ σ ρ zi j = i 2i , − z(2) ¯ i≤0 (4) , i≥0 σ ρ (2q+1)2 Nρ ∑m (2i) ensures that Z 1 = 0, and σ (the same σ as before; see below) and ρ i=1 are chosen so that sample variance of Z (1) and Z (2) is equal to one. [sent-326, score-0.229]
46 2), where yi j is an inner point of the grid and Yi j is the neighborhood of yi j ; likewise, we estimate Φ(Z) by Nφ(Zi j ) for an inner point zi j . [sent-333, score-0.433]
47 Then φ(Yi j ) > φ(Zi j ) for neighborhood-radius r that satisfies 1 ≤ r ≤ 3, or similarly, for K-nearest neighborhoods where K = 4, 8, 12. [sent-339, score-0.147]
48 The case of general r-ball neighborhoods is discussed in Appendix A. [sent-341, score-0.147]
49 1917 G OLDBERG , Z AKAI , K USHNIR AND R ITOV Figure 3: (A) The normalized grid at an inner point yi j . [sent-343, score-0.197]
50 The neighbors from above and below are embedded to the same point as zi j . [sent-350, score-0.173]
51 However, for all the inner points of the grid with neighbors that are also inner points, the renormalization factor (qε (xi )−1 qε (xi, j )−1 ) is a constant. [sent-379, score-0.265]
52 In other words, when DFM using the “window” kernel is applied to a grid with aspect ratio slightly greater than 2 or above, DFM will prefer the embedding Z over the embedding Y (see Fig 2). [sent-381, score-0.442]
53 1 for manifolds embedded in higher dimensions), which is typically not the case in dimension-reducing applications. [sent-395, score-0.172]
54 The neighborhoods were chosen using K-nearest neighbors with K = 4, 8, 16, and 64. [sent-401, score-0.21]
55 1919 G OLDBERG , Z AKAI , K USHNIR AND R ITOV A B C D Figure 4: The output of LEM on a grid of dimensions 81 × 41 is presented in (A). [sent-405, score-0.17]
56 In the case of the grid, both LLE and LTSA (roughly) recovered the grid shape for K = 4, 8, 16, and 64, while HLLE failed to produce any output due to large memory requirements. [sent-415, score-0.196]
57 Analysis for General Two-Dimensional Manifolds The aim of this section is to present necessary conditions for the success of the normalized-output algorithms on general two-dimensional manifolds embedded in high-dimensional space. [sent-421, score-0.268]
58 The quantity ei is the portion of error obtained by using the j-th column of the i-th neighborhood when using the original ( j) 1 sample as output. [sent-442, score-0.181]
59 Note that Y is just the original sample up to a linear transformation that ensures that the normalization constraints Cov(Y ) = I and Y 1 = 0 hold. [sent-448, score-0.197]
60 Moreover, Y is the only transformation of X that satisfies these conditions, which are the normalization constraints for LLE, HLLE, and LTSA. [sent-449, score-0.148]
61 In the case of neighborhoods Zi , the situation can be different. [sent-457, score-0.147]
62 Define rmax = maxi∈N0 r(i) to be the maximum radius of neighborhoods i, such that i ∈ N0 . [sent-466, score-0.216]
63 2 The Embeddings for LEM and DFM So far we have claimed that given the original sample X, we expect the output to resemble Y (see Eq. [sent-468, score-0.163]
64 When the original sample is given by X, we expect ˆ the output of LEM and DFM to resemble Y . [sent-473, score-0.163]
65 We note that unlike the matrix Y that was defined in ˆ depends also on the choice of neighborhoods through the matrix D terms of the matrix X only, Y that appears in the normalization constraints. [sent-474, score-0.317]
66 Define X = XΓ; then Y = X Σ−1/2 is the only affine transformation of X that satisfies the normalization constraints of LEM and DFM; namely, we have Y DY = I and Y D1 = 0. [sent-479, score-0.148]
67 10, (1) xi(1) −xi(1) ˆ(2) ˆ , ˆ −z xi < 0 ˆ ¯ ˆ ˆ σ ρ , zi = ˆ x(1) κx(1) ˆi ˆ ˆi (1) ˆ xi ≥ 0 ˆ ¯(2) ˆ ˆ σ , ρ −z (1) (1) ˆ d κx −d x 1 ˆ ˆ ˆ where κ is defined by Eq. [sent-481, score-0.178]
68 In the ˆ σ case where the distribution of the original points is uniform, the ratio τ is close to the ratio σ and ˆ τ thus the necessary conditions for the success of LEM and DFM are similar to the conditions in Corollary 5. [sent-485, score-0.174]
69 1 Let X be a sample from a two-dimensional domain and let ψ(X) be its embedding in high-dimensional space. [sent-493, score-0.159]
70 In other words, ¯ when the neighborhoods are large, one can expect a large average error in the description of the neighborhoods, since the Euclidian approximation of the neighborhoods is less accurate for neighborhoods of large radius. [sent-502, score-0.467]
71 However, when the difference between the right side and the left side of the inequalities is large, one cannot expect the output to resemble the original sample (see Lemma 3. [sent-518, score-0.163]
72 4 Generalization of the Results to Manifolds of Higher Dimensions The discussion above introduced necessary conditions for the normalized-output algorithms’ success on two-dimensional manifolds embedded in RD . [sent-522, score-0.239]
73 We present here a simple criterion to demonstrate the fact that there are d-dimensional manifolds that the normalized-output algorithms cannot recover. [sent-524, score-0.161]
74 Hence, both LEM and DFM are expected to fail whenever the ratio between the length of the grid in the first and second coordinates is slightly greater than 2 or more, regardless of the length of grid in the other coordinates, similar to the result presented in Theorem 4. [sent-560, score-0.276]
75 Note that the “fishbowl” is a twodimensional manifold embedded in R3 , which is not an isometry. [sent-585, score-0.16]
76 entire set of images and on a strip of 30 × 10 angular variations. [sent-635, score-0.15]
77 1926 M ANIFOLD L EARNING : T HE P RICE OF N ORMALIZATION A1 B1 C1 D1 E1 F1 A2 B2 C2 D2 E2 F2 Figure 7: The output of LEM on 1600 points sampled from a swissroll is presented in A1. [sent-639, score-0.168]
78 These three examples, in addition to the noisy version of the two-dimensional strip discussed in Section 3 (see Fig. [sent-643, score-0.15]
79 We consider the question of convergence in the case of input that consists of a d-dimensional manifold embedded in RD , where the manifold is isometric to a convex subset of Euclidean space. [sent-648, score-0.305]
80 are classes of manifolds on which the normalized-output algorithms cannot be expected to recover the original sample, not even asymptotically. [sent-659, score-0.219]
81 be a uniform sample from the two-dimensional strip S = [0, L] × [0, 1]. [sent-664, score-0.172]
82 Thus, if L > 4 we do not expect either LEM or DFM to recover the structure of the strip as the number of points in the sample tends to infinity. [sent-672, score-0.261]
83 Our working example is the two-dimensional strip S = [0, L] × [0, 1], which can be considered as the continuous counterpart of the grid X defined in Section 4. [sent-684, score-0.264]
84 The eigenfunctions ϕ i, j (x1 , x2 ) and eigenvalues λi, j on the strip S under these conditions are given by ϕi, j (x1 , x2 ) = cos iπ x1 cos ( jπx2 ) λi, j = L iπ L 2 + ( jπ)2 for i, j = 0, 1, 2, . [sent-686, score-0.178]
85 When the aspect ratio of the strip satisfies L > M ∈ N, the first M non-trivial eigenfunctions are ϕi,0 , i = 1, . [sent-690, score-0.232]
86 Any embedding of the strip based on the first M eigenfunctions is therefore a function of only the first variable x 1 . [sent-694, score-0.315]
87 Specifically, whenever L > 2 the two-dimensional embedding is a function of the first variable only, and therefore clearly cannot establish a faithful embedding of the strip. [sent-695, score-0.274]
88 This assumption is reasonable, for example, in the case of a uniform sample from the strip S. [sent-717, score-0.172]
89 Note that we expect both |n0 | and rmax,n to n be bounded whenever the radius of the neighborhoods does not increase. [sent-722, score-0.219]
90 However, if the radii ¯ of the neighborhoods tend to zero while the number of points in each neighborhood tends to infin(2) ity, we expect en → 0 for both LTSA and HLLE. [sent-728, score-0.343]
91 This is because the neighborhood matrices Wi ¯ are based on the linear approximation of the neighborhood as captured by the neighborhood SVD. [sent-729, score-0.249]
92 ¯ We conclude that a necessary condition for convergence is that the radii of the neighborhoods tend to zero. [sent-732, score-0.203]
93 If the radius of the neighborhood is smaller than α, the neighborhood cannot be approximated reasonably by a two-dimensional projection. [sent-736, score-0.212]
94 Hence, in the presence of noise of a constant magnitude, the radii of the neighborhoods cannot tend to zero. [sent-737, score-0.177]
95 Concluding Remarks In the introduction to this paper we posed the following question: Do the normalized-output algorithms succeed in revealing the underlying low-dimensional structure of manifolds embedded in high-dimensional spaces? [sent-749, score-0.201]
96 More specifically, does the output of the normalized-output algorithms resemble the original sample up to affine transformation? [sent-750, score-0.166]
97 While it is clear that, due to the normalization constraints, one cannot hope for geodesic distances preservation nor for neighborhoods structure preservation, success as measured by other criteria may be achieved. [sent-761, score-0.396]
98 The embedding suggested by Coifman and Lafon (2006) (up to rotation) is the matrix λ1 v(1) v(1) , . [sent-804, score-0.17]
99 Note that this embedding can be obtained from the output matrix Y by a simple linear transformation. [sent-808, score-0.226]
100 The function f (x1 , x2 ) = x1 σ 2 x2 τ + 2 agrees with the squared distance for points on the grid, where x 1 and x2 indicate the horizontal and vertical distances from xi j in the original grid, respectively. [sent-832, score-0.164]
wordName wordTfidf (topN-words)
[('lem', 0.415), ('dfm', 0.39), ('ltsa', 0.33), ('hlle', 0.26), ('lle', 0.246), ('strip', 0.15), ('neighborhoods', 0.147), ('wi', 0.14), ('embedding', 0.137), ('manifolds', 0.132), ('akai', 0.13), ('itov', 0.13), ('oldberg', 0.13), ('ushnir', 0.13), ('manifold', 0.12), ('anifold', 0.12), ('ormalization', 0.12), ('rice', 0.12), ('grid', 0.114), ('cov', 0.11), ('shbowl', 0.09), ('embeddings', 0.084), ('neighborhood', 0.083), ('swissroll', 0.08), ('wiyi', 0.08), ('normalization', 0.071), ('huo', 0.07), ('wrs', 0.07), ('zi', 0.07), ('neighbors', 0.063), ('geodesic', 0.061), ('output', 0.056), ('isomap', 0.056), ('yi', 0.055), ('xi', 0.054), ('transformation', 0.053), ('distances', 0.051), ('eigenmap', 0.05), ('globe', 0.05), ('ei', 0.049), ('laplacian', 0.049), ('radius', 0.046), ('af', 0.046), ('rd', 0.044), ('corollary', 0.044), ('success', 0.041), ('embedded', 0.04), ('dy', 0.04), ('svd', 0.04), ('earning', 0.04), ('rotation', 0.038), ('stretched', 0.038), ('coifman', 0.038), ('dii', 0.038), ('lafon', 0.038), ('belkin', 0.035), ('matrix', 0.033), ('smith', 0.033), ('saul', 0.033), ('resemble', 0.032), ('diffusion', 0.032), ('points', 0.032), ('ui', 0.031), ('recover', 0.031), ('radii', 0.03), ('tenenbaum', 0.03), ('kushnir', 0.03), ('normalizedoutput', 0.03), ('wittman', 0.03), ('aspect', 0.03), ('algorithms', 0.029), ('niyogi', 0.029), ('inner', 0.028), ('eigenfunctions', 0.028), ('original', 0.027), ('failure', 0.027), ('necessary', 0.026), ('expect', 0.026), ('failed', 0.026), ('goldberg', 0.025), ('yr', 0.025), ('preservation', 0.025), ('input', 0.025), ('ca', 0.025), ('en', 0.025), ('constraints', 0.024), ('ratio', 0.024), ('fail', 0.024), ('weights', 0.023), ('tangent', 0.023), ('alignment', 0.023), ('israel', 0.023), ('roweis', 0.023), ('rmax', 0.023), ('dan', 0.023), ('eigenmaps', 0.023), ('xn', 0.023), ('sample', 0.022), ('hessian', 0.022), ('ys', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 57 jmlr-2008-Manifold Learning: The Price of Normalization
Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov
Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap
2 0.11799368 96 jmlr-2008-Visualizing Data using t-SNE
Author: Laurens van der Maaten, Geoffrey Hinton
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the data sets. Keywords: visualization, dimensionality reduction, manifold learning, embedding algorithms, multidimensional scaling
3 0.081853092 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
Author: Arthur D. Szlam, Mauro Maggioni, Ronald R. Coifman
Abstract: Harmonic analysis and diffusion on discrete data has been shown to lead to state-of-the-art algorithms for machine learning tasks, especially in the context of semi-supervised and transductive learning. The success of these algorithms rests on the assumption that the function(s) to be studied (learned, interpolated, etc.) are smooth with respect to the geometry of the data. In this paper we present a method for modifying the given geometry so the function(s) to be studied are smoother with respect to the modified geometry, and thus more amenable to treatment using harmonic analysis methods. Among the many possible applications, we consider the problems of image denoising and transductive classification. In both settings, our approach improves on standard diffusion based methods. Keywords: diffusion processes, diffusion geometry, spectral graph theory, image denoising, transductive learning, semi-supervised learning
4 0.059399735 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
Author: Arnak S. Dalalyan, Anatoly Juditsky, Vladimir Spokoiny
Abstract: The statistical problem of estimating the effective dimension-reduction (EDR) subspace in the multi-index regression model with deterministic design and additive noise is considered. A new procedure for recovering the directions of the EDR subspace is proposed. Many methods for estimating the EDR subspace perform principal component analysis on a family of vectors, say ˆ ˆ β1 , . . . , βL , nearly lying in the EDR subspace. This is in particular the case for the structure-adaptive approach proposed by Hristache et al. (2001a). In the present work, we propose to estimate the projector onto the EDR subspace by the solution to the optimization problem minimize ˆ ˆ max β (I − A)β =1,...,L subject to A ∈ Am∗ , where Am∗ is the set of all symmetric matrices with eigenvalues in [0, 1] and trace less than or equal √ to m∗ , with m∗ being the true structural dimension. Under mild assumptions, n-consistency of the proposed procedure is proved (up to a logarithmic factor) in the case when the structural dimension is not larger than 4. Moreover, the stochastic error of the estimator of the projector onto the EDR subspace is shown to depend on L logarithmically. This enables us to use a large number of vectors ˆ β for estimating the EDR subspace. The empirical behavior of the algorithm is studied through numerical simulations. Keywords: dimension-reduction, multi-index regression model, structure-adaptive approach, central subspace
5 0.056966465 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
Author: Konrad Rieck, Pavel Laskov
Abstract: Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA. Keywords: string kernels, string distances, learning with sequential data
6 0.048501052 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
7 0.045870483 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
8 0.045264937 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
9 0.036244899 9 jmlr-2008-Active Learning by Spherical Subdivision
10 0.033199292 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
11 0.032454912 52 jmlr-2008-Learning from Multiple Sources
12 0.032218114 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
13 0.030261036 92 jmlr-2008-Universal Multi-Task Kernels
14 0.02780764 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
15 0.027659804 7 jmlr-2008-A Tutorial on Conformal Prediction
16 0.026420936 58 jmlr-2008-Max-margin Classification of Data with Absent Features
17 0.025389735 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
18 0.024989545 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
19 0.024552727 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
20 0.024371056 17 jmlr-2008-Automatic PCA Dimension Selection for High Dimensional Data and Small Sample Sizes
topicId topicWeight
[(0, 0.148), (1, -0.054), (2, 0.001), (3, -0.033), (4, 0.037), (5, -0.056), (6, 0.057), (7, 0.043), (8, 0.023), (9, -0.008), (10, -0.078), (11, 0.191), (12, -0.01), (13, 0.117), (14, 0.226), (15, 0.238), (16, -0.088), (17, -0.192), (18, 0.022), (19, 0.066), (20, -0.152), (21, -0.133), (22, 0.161), (23, 0.145), (24, 0.127), (25, 0.116), (26, 0.016), (27, 0.121), (28, 0.131), (29, 0.118), (30, -0.095), (31, -0.066), (32, -0.033), (33, 0.137), (34, -0.092), (35, 0.112), (36, -0.028), (37, -0.08), (38, -0.044), (39, -0.1), (40, -0.035), (41, 0.011), (42, 0.113), (43, 0.058), (44, 0.171), (45, 0.101), (46, 0.036), (47, -0.061), (48, -0.117), (49, -0.077)]
simIndex simValue paperId paperTitle
same-paper 1 0.92986572 57 jmlr-2008-Manifold Learning: The Price of Normalization
Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov
Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap
2 0.65451527 96 jmlr-2008-Visualizing Data using t-SNE
Author: Laurens van der Maaten, Geoffrey Hinton
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the data sets. Keywords: visualization, dimensionality reduction, manifold learning, embedding algorithms, multidimensional scaling
3 0.40609172 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
Author: Arthur D. Szlam, Mauro Maggioni, Ronald R. Coifman
Abstract: Harmonic analysis and diffusion on discrete data has been shown to lead to state-of-the-art algorithms for machine learning tasks, especially in the context of semi-supervised and transductive learning. The success of these algorithms rests on the assumption that the function(s) to be studied (learned, interpolated, etc.) are smooth with respect to the geometry of the data. In this paper we present a method for modifying the given geometry so the function(s) to be studied are smoother with respect to the modified geometry, and thus more amenable to treatment using harmonic analysis methods. Among the many possible applications, we consider the problems of image denoising and transductive classification. In both settings, our approach improves on standard diffusion based methods. Keywords: diffusion processes, diffusion geometry, spectral graph theory, image denoising, transductive learning, semi-supervised learning
4 0.39832243 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
Author: Arnak S. Dalalyan, Anatoly Juditsky, Vladimir Spokoiny
Abstract: The statistical problem of estimating the effective dimension-reduction (EDR) subspace in the multi-index regression model with deterministic design and additive noise is considered. A new procedure for recovering the directions of the EDR subspace is proposed. Many methods for estimating the EDR subspace perform principal component analysis on a family of vectors, say ˆ ˆ β1 , . . . , βL , nearly lying in the EDR subspace. This is in particular the case for the structure-adaptive approach proposed by Hristache et al. (2001a). In the present work, we propose to estimate the projector onto the EDR subspace by the solution to the optimization problem minimize ˆ ˆ max β (I − A)β =1,...,L subject to A ∈ Am∗ , where Am∗ is the set of all symmetric matrices with eigenvalues in [0, 1] and trace less than or equal √ to m∗ , with m∗ being the true structural dimension. Under mild assumptions, n-consistency of the proposed procedure is proved (up to a logarithmic factor) in the case when the structural dimension is not larger than 4. Moreover, the stochastic error of the estimator of the projector onto the EDR subspace is shown to depend on L logarithmically. This enables us to use a large number of vectors ˆ β for estimating the EDR subspace. The empirical behavior of the algorithm is studied through numerical simulations. Keywords: dimension-reduction, multi-index regression model, structure-adaptive approach, central subspace
5 0.27202275 55 jmlr-2008-Linear-Time Computation of Similarity Measures for Sequential Data
Author: Konrad Rieck, Pavel Laskov
Abstract: Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms—enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA. Keywords: string kernels, string distances, learning with sequential data
6 0.20813178 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
7 0.15823999 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
8 0.1569754 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
9 0.15119827 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective
10 0.14894709 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
11 0.14249106 65 jmlr-2008-Multi-Agent Reinforcement Learning in Common Interest and Fixed Sum Stochastic Games: An Experimental Study
12 0.13829164 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
13 0.1337373 35 jmlr-2008-Finding Optimal Bayesian Network Given a Super-Structure
14 0.13318577 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
15 0.13318536 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
16 0.13130073 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
17 0.13038114 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
18 0.12250769 56 jmlr-2008-Magic Moments for Structured Output Prediction
19 0.11939114 52 jmlr-2008-Learning from Multiple Sources
20 0.11864856 7 jmlr-2008-A Tutorial on Conformal Prediction
topicId topicWeight
[(0, 0.034), (1, 0.011), (5, 0.014), (31, 0.012), (36, 0.373), (40, 0.048), (54, 0.036), (58, 0.041), (66, 0.078), (73, 0.019), (76, 0.023), (78, 0.012), (88, 0.106), (92, 0.028), (94, 0.048), (99, 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.72351438 57 jmlr-2008-Manifold Learning: The Price of Normalization
Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov
Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap
2 0.38172615 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
3 0.37517866 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
4 0.37261087 58 jmlr-2008-Max-margin Classification of Data with Absent Features
Author: Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, Daphne Koller
Abstract: We consider the problem of learning classifiers in structured domains, where some objects have a subset of features that are inherently absent due to complex relationships between the features. Unlike the case where a feature exists but its value is not observed, here we focus on the case where a feature may not even exist (structurally absent) for some of the samples. The common approach for handling missing features in discriminative models is to first complete their unknown values, and then use a standard classification procedure over the completed data. This paper focuses on features that are known to be non-existing, rather than have an unknown value. We show how incomplete data can be classified directly without any completion of the missing features using a max-margin learning framework. We formulate an objective function, based on the geometric interpretation of the margin, that aims to maximize the margin of each sample in its own relevant subspace. In this formulation, the linearly separable case can be transformed into a binary search over a series of second order cone programs (SOCP), a convex problem that can be solved efficiently. We also describe two approaches for optimizing the general case: an approximation that can be solved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoiding the pre-processing phase in which the data is completed, both of these approaches could offer considerable computational savings. More importantly, we show that the elegant handling of missing values by our approach allows it to both outperform other methods when the missing values have non-trivial structure, and be competitive with other methods when the values are missing at random. We demonstrate our results on several standard benchmarks and two real-world problems: edge prediction in metabolic pathways, and automobile detection in natural images. Keywords: max margin, missing features, network reconstruction, metabolic pa
5 0.37081704 96 jmlr-2008-Visualizing Data using t-SNE
Author: Laurens van der Maaten, Geoffrey Hinton
Abstract: We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map. The technique is a variation of Stochastic Neighbor Embedding (Hinton and Roweis, 2002) that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. t-SNE is better than existing techniques at creating a single map that reveals structure at many different scales. This is particularly important for high-dimensional data that lie on several different, but related, low-dimensional manifolds, such as images of objects from multiple classes seen from multiple viewpoints. For visualizing the structure of very large data sets, we show how t-SNE can use random walks on neighborhood graphs to allow the implicit structure of all of the data to influence the way in which a subset of the data is displayed. We illustrate the performance of t-SNE on a wide variety of data sets and compare it with many other non-parametric visualization techniques, including Sammon mapping, Isomap, and Locally Linear Embedding. The visualizations produced by t-SNE are significantly better than those produced by the other techniques on almost all of the data sets. Keywords: visualization, dimensionality reduction, manifold learning, embedding algorithms, multidimensional scaling
6 0.36721355 9 jmlr-2008-Active Learning by Spherical Subdivision
7 0.36587748 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
8 0.3640801 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
9 0.36207032 83 jmlr-2008-Robust Submodular Observation Selection
10 0.35971892 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
11 0.35963863 56 jmlr-2008-Magic Moments for Structured Output Prediction
12 0.35927954 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
13 0.35924843 86 jmlr-2008-SimpleMKL
14 0.35782731 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
15 0.35764185 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
16 0.35739228 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
17 0.35644585 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
18 0.35621324 80 jmlr-2008-Ranking Individuals by Group Comparisons
19 0.35614708 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
20 0.35560963 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks