nips nips2009 nips2009-106 knowledge-graph by maker-knowledge-mining

106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding


Source: pdf

Author: Zhirong Yang, Irwin King, Zenglin Xu, Erkki Oja

Abstract: Stochastic Neighbor Embedding (SNE) has shown to be quite promising for data visualization. Currently, the most popular implementation, t-SNE, is restricted to a particular Student t-distribution as its embedding distribution. Moreover, it uses a gradient descent algorithm that may require users to tune parameters such as the learning step size, momentum, etc., in finding its optimum. In this paper, we propose the Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE) method, which is a generalization of the t-SNE to accommodate various heavytailed embedding similarity functions. With this generalization, we are presented with two difficulties. The first is how to select the best embedding similarity among all heavy-tailed functions and the second is how to optimize the objective function once the heavy-tailed function has been selected. Our contributions then are: (1) we point out that various heavy-tailed embedding similarities can be characterized by their negative score functions. Based on this finding, we present a parameterized subset of similarity functions for choosing the best tail-heaviness for HSSNE; (2) we present a fixed-point optimization algorithm that can be applied to all heavy-tailed functions and does not require the user to set any parameters; and (3) we present two empirical studies, one for unsupervised visualization showing that our optimization algorithm runs as fast and as good as the best known t-SNE implementation and the other for semi-supervised visualization showing quantitative superiority using the homogeneity measure as well as qualitative advantage in cluster separation over t-SNE.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Currently, the most popular implementation, t-SNE, is restricted to a particular Student t-distribution as its embedding distribution. [sent-12, score-0.201]

2 Moreover, it uses a gradient descent algorithm that may require users to tune parameters such as the learning step size, momentum, etc. [sent-13, score-0.077]

3 In this paper, we propose the Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE) method, which is a generalization of the t-SNE to accommodate various heavytailed embedding similarity functions. [sent-15, score-0.299]

4 The first is how to select the best embedding similarity among all heavy-tailed functions and the second is how to optimize the objective function once the heavy-tailed function has been selected. [sent-17, score-0.278]

5 Our contributions then are: (1) we point out that various heavy-tailed embedding similarities can be characterized by their negative score functions. [sent-18, score-0.303]

6 A multitude of visualization approaches, especially the nonlinear dimensionality reduction techniques such as Isomap [9], Laplacian Eigenmaps [1], Stochastic Neighbor Embedding (SNE) [6], manifold sculpting [5], and kernel maps with a reference point [8], have been proposed. [sent-21, score-0.197]

7 Although they are reported with good performance on tasks such as unfolding an artificial manifold, they are often not successful at visualizing real-world data with high dimensionalities. [sent-22, score-0.042]

8 A common problem of the above methods is that most mapped data points are crowded together in the center without distinguished gaps that isolate data clusters. [sent-23, score-0.048]

9 It was recently pointed out by van der Maaten and Hinton [10] that the “crowding problem” can be alleviated by using a heavy-tailed distribution in the low-dimensional space. [sent-24, score-0.019]

10 Their method, called t-Distributed Stochastic Neighbor Embedding (t-SNE), is adapted from SNE with two major changes: (1) it uses a symmetrized cost function; and (2) it employs a Student t-distribution with a single degree of freedom (T1 ). [sent-25, score-0.068]

11 In this way, t-SNE can achieve remarkable superiority in the discovery of clustering structure in highdimensional data. [sent-26, score-0.021]

12 The t-SNE development procedure in [10] is restricted to the T1 distribution as its embedding similarity. [sent-27, score-0.201]

13 However, different data sets or other purposes of dimensionality reduction may require generalizing t-SNE to other heavy-tailed functions. [sent-28, score-0.052]

14 The original t-SNE derivation provides little information for users on how to select the best embedding similarity among all heavy-tailed functions. [sent-29, score-0.278]

15 Furthermore, the original t-SNE optimization algorithm is not convenient when the symmetric SNE is generalized to use various heavy-tailed embedding similarity functions since it builds on the gradient descent approach with momenta. [sent-30, score-0.461]

16 As a result, several optimization parameters need to be manually specified. [sent-31, score-0.036]

17 The performance of the t-SNE algorithm depends on laborious selection of the optimization parameters. [sent-32, score-0.036]

18 For instance, a large learning step size might cause the algorithm to diverge, while a conservative one might lead to slow convergence or poor annealed results. [sent-33, score-0.02]

19 Although comprehensive strategies have been used to improve the optimization performance, they might be still problematic when extended to other applications or embedding similarity functions. [sent-34, score-0.338]

20 In this paper we generalize t-SNE to accommodate various heavy-tailed functions with two major contributions: (1) we propose to characterize heavy-tailed embedding similarities in symmetric SNE by their negative score functions. [sent-35, score-0.415]

21 This further leads to a parameterized subset facilitating the choice of the best tail-heaviness; and (2) we present a general algorithm for optimizing the symmetric SNE objective with any heavy-tailed embedding similarities. [sent-36, score-0.271]

22 Next, a fixed-point optimization algorithm for HSSNE is provided and its convergence is discussed in Section 4. [sent-40, score-0.036]

23 In Section 5, we relate the EM-like behavior of the fixed-point algorithm to a pairwise local mixture model for an in-depth analysis of HSSNE. [sent-41, score-0.03]

24 Section 6 presents two sets of experiments, one for unsupervised and the other for semi-supervised visualization. [sent-42, score-0.04]

25 2 Symmetric Stochastic Neighbor Embedding Suppose the pairwise similarities of a set of m-dimensional data points X = {xi }n are encoded i=1 in a symmetric matrix P ∈ Rn×n , where Pii = 0 and ij Pij = 1. [sent-44, score-0.187]

26 (2) The optimization of SSNE uses the gradient descent method with ∂J =4 (Pij − Qij )(yi − yj ). [sent-46, score-0.242]

27 (3) ∂yi j A momentum term is added to the gradient in order to speed up the optimization: ∂J Y (t+1) = Y (t) + η + β(t) Y (t) − Y (t−1) , ∂Y Y =Y (t) (t) (4) (t) where Y (t) = [y1 . [sent-47, score-0.182]

28 yn ] ∈ Rr×n is the solution in matrix form at iteration t; η is the learning rate; and β(t) is the momentum amount at iteration t. [sent-50, score-0.105]

29 Compared with an earlier method Stochastic Neighbor Embedding (SNE) [6], SSNE uses a symmetrized cost function with simpler gradients. [sent-51, score-0.025]

30 Most mapped points in the SSNE visualizations are often compressed near the center of the visualizing map without clear gaps that separate clusters of the data. [sent-52, score-0.122]

31 Using this distribution yields the gradient of t-SNE: ∂J =4 (Pij − Qij )(yi − yj )(1 + yi − yj 2 )−1 . [sent-55, score-0.427]

32 (6) ∂yi j In addition, t-SNE employs a number of strategies to overcome the difficulties in the optimization based on gradient descent. [sent-56, score-0.13]

33 3 Heavy-tailed SNE characterized by negative score functions As the gradient derivation in [10] is restricted to the T1 distribution, we derive the gradient with a general function that converts squared distances to similarities, with T1 as a special case. [sent-57, score-0.205]

34 In addition, the direct chain rule used in [10] may cause notational clutter and conceal the working components in the gradients. [sent-58, score-0.073]

35 Our approach can provide more insights of the working factor brought by the heavy-tailed functions. [sent-60, score-0.036]

36 Note that H is not required to be defined as a probability function because the symmetric SNE objective already involves normalization over all data pairs. [sent-62, score-0.07]

37 The extended objective using the Lagrangian technique is given by qij ˜ L(q, Y ) = Pij log + λij qij − H( yi − yj a=b qab ij ij 2 ) . [sent-63, score-0.99]

38 (9) ˜ Setting ∂ L(q, Y )/∂qij = 0 yields λij = 1/ a=b qab − Pij /qij . [sent-64, score-0.087]

39 For notational simplicity, we also write Sij = S( yi − yj ). [sent-66, score-0.238]

40 S(τ ) = − We propose to characterize the tail heaviness of the similarity function H, relative to the one that leads to the Gaussian, by its negative score function S, also called tail-heaviness function in this paper. [sent-67, score-0.212]

41 In this characterization, there is a functional operator S that maps every similarity function to a tail-heaviness function. [sent-68, score-0.077]

42 The above observation inspires us to further parameterize a family of tail-heaviness functions by the power of H: S(H, α) = H α for α ≥ 0, where a larger α value corresponds to a heavier-tailed embedding similarity function. [sent-73, score-0.358]

43 1 0 0 1 2 τ 3 4 5 Figure 1: Several functions in the power family. [sent-85, score-0.026]

44 Figure 1 shows a number of functions in the power family. [sent-91, score-0.026]

45 4 A fixed-Point optimization algorithm Unlike many other dimensionality reduction approaches that can be solved by eigendecomposition in a single step, SNE and its variants require iterative optimization methods. [sent-92, score-0.142]

46 However it remains unknown whether such a comprehensive implementation also works for other types of embedding similarity functions. [sent-94, score-0.302]

47 Manually adjusting the involved parameters such as the learning rate and the momentum for every function is rather time-consuming and infeasible in practice. [sent-95, score-0.121]

48 Here we propose to optimize symmetric SNE by a fixed-point algorithm. [sent-96, score-0.07]

49 After rearranging the terms in ∂J /∂yi = 0 (see Equation (11)), we obtain the following update rule: (t) (t+1) Yki (t) = Yki j (t) Bij + j j (t) (Aij − Bij ) Ykj Aij (t) , (14) (t) where Aij = Pij S( yi − yj 2 ) and Bij = Qij S( yi − yj 2 ). [sent-97, score-0.492]

50 Our optimization algorithm for HSSNE simply involves the iterative application of Equation (14). [sent-98, score-0.054]

51 Compared with the original t-SNE optimization algorithm, our method requires no user-provided parameters such as the learning step size and momentum, which is more convenient for applications. [sent-99, score-0.036]

52 However, it is known that the update rule (14) can diverge in some cases, for example, when Yki are large. [sent-101, score-0.093]

53 Denote ∆ = Y − Y (t) and the gradient of J with respect to Y . [sent-104, score-0.077]

54 Let us first approximate the HSSNE objective by the first-order Taylor expansion at the current estimate Y (t) : J (Y ) ≈ Jlin (Y ) = J (Y (t) ) + ∆ki (t) ki . [sent-105, score-0.07]

55 (15) ki Then we can construct an upper bound of Jlin (Y ): G(Y, Y (t) ) = Jlin (Y ) + 1 2 ∆2 ki ki Aia (16) a as Pia and Sia are all nonnegative. [sent-106, score-0.21]

56 Equating ∂G(Y, Y (t) )/∂Y = 0 implements minimization of G(Y, Y (t) ) and yields the update rule (14). [sent-110, score-0.067]

57 Iteratively applying the update rule (14) thus results in a monotonically decreasing sequence of the linear approximation of HSSNE objective: Jlin (Y (t) ) ≥ G(Y (t+1) , Y (t) ) ≥ Jlin (Y (t+1) ). [sent-111, score-0.067]

58 Even if the second-order terms in the Taylor expansion of J (Y ) are also considered, the update rule (t) (t+1) − Yki are small. [sent-112, score-0.067]

59 With the approximated Hessian Hijkl = δkl (DA − A) − (DB − B) (t) Yki (17) ijkl ij , the updating term Uki in (t−1) Yki (t) Newton’s method = − Uki can be determined by ki Hijkl Uki = lj . [sent-115, score-0.159]

60 Solving this equation by directly inverting the huge tensor H is however infeasible in practice and thus usually implemented by iterative methods such as (v+1) Uki = (A + DB − B)U (v) + (t) ki A Dii . [sent-116, score-0.104]

61 Then one can find that such an approximated (t+1) (t) (t) Newton’s update rule Yki = Yki − Dki is identical to Equation (14). [sent-119, score-0.067]

62 5 A local mixture interpretation Further rearranging the update rule can give us more insights of the properties of SSNE solutions: (t) (t+1) Yki = j Aij Ykj + (t) Qij Pij (Yki j (t) − Ykj ) Aij . [sent-121, score-0.136]

63 (19) One can see that the above update rule mimics the maximization step in the EM-algorithm for classical Gaussian mixture model (e. [sent-122, score-0.097]

64 This resemblance inspires us to find an alternative interpretation of the SNE behavior in terms of a particular mixture model. [sent-125, score-0.061]

65 Given the current estimate Y (t) , the fixed-point update rule actually performs minimization of (t) 2 Pij Sij yi − µij , (20) ij (t) (t) where µij = yj + bound of Qij Pij (t) (t) yi − yj . [sent-126, score-0.575]

66 This problem is equivalent to maximizing the Jensen lower (t) 2 log Pij Sij exp − yi − µij . [sent-127, score-0.092]

67 (21) ij (t) In this form, µij can be regarded as the mean of the j-th mixture component for the i-th embedded data point, while the product Pij Sij can be thought as the mixing coefficients1 . [sent-128, score-0.116]

68 Note that each data sample has its own mixing coefficients because of locality sensitivity. [sent-129, score-0.039]

69 , Y (t+1) = Y (t) = Y ∗ , we can rewrite the mixture without the logarithm as 2 Qij ∗ ∗ y i − yj 2 . [sent-132, score-0.159]

70 ) assumption because the mixing coefficient rows are not summed to the same number. [sent-136, score-0.02]

71 (2) As discussed in Section 3, Sij characterizes the tail heaviness of the embedding similarity function. [sent-139, score-0.341]

72 A pair of Qij that approximates Pij well can increase the exponential, while a pair with a poor mismatch yields little contribution to the mixture. [sent-143, score-0.02]

73 (4) Finally, as credited in many other continuity preserving methods, the second factor in the exponential forces that close pairs in the input space are also situated nearby in the embedding space. [sent-144, score-0.219]

74 Due to space limitation, we only focus on three data sets, iris, wine, and segmentation (training subset) from the UCI repository2 . [sent-147, score-0.019]

75 We followed the instructions in [10] for calculating Pij and choosing the learning rate η and momentum amount β(t) for Gradient t-SNE. [sent-148, score-0.105]

76 Alternatively, we excluded two tricks, “early compression” and “early exaggeration”, that are described in [10] from the comparison of long-run optimization because they apparently belong to the initialization stage. [sent-149, score-0.055]

77 Here both Fixed-Point and Gradient tSNEs execute with the same initialization which uses the “early compression” trick and pre-runs the Gradient t-SNE for 50 iterations as suggested in [10]. [sent-150, score-0.019]

78 The visualization quality can be quantified using the ground truth class information. [sent-151, score-0.126]

79 We adopt the measurement of the homogeneity of nearest neighbors: homogeneity = γ/n, (23) where γ is the number of mapped points belonging to the same class with their nearest neighbor and n again is the total number of points. [sent-152, score-0.565]

80 A larger homogeneity generally indicates better separability of the classes. [sent-153, score-0.237]

81 Even though having a globally optimal solution, the Laplacian Eigenmap yields poor visualizations, since none of the classes can be isolated. [sent-155, score-0.02]

82 By contrast, both t-SNE methods achieve much higher homogeneities and most clusters are well separated in the visualization plots. [sent-156, score-0.178]

83 Comparing the two t-SNE implementations, one can see that our simple fixed-point algorithm converges even slightly faster than the comprehensive and carefully tuned Gradient t-SNE. [sent-157, score-0.024]

84 Besides efficiency, our approach performs as good as Gradient t-SNE in terms of both t-SNE objectives and homogeneities of nearest neighbors for these data sets. [sent-158, score-0.052]

85 2 Semi-supervised visualization Unsupervised symmetric SNE or t-SNE may perform poorly for some data sets in terms of identifying classes. [sent-160, score-0.196]

86 Let us consider another data set vehicle from the LIBSVM repository3 . [sent-162, score-0.023]

87 The top-left plot in Figure 3 demonstrates a poor visualization using unsupervised Gradient t-SNE. [sent-163, score-0.186]

88 We can construct a supervised matrix u where uij = 1 if xi and xj are known to belong to the same class and 0 otherwise. [sent-165, score-0.064]

89 After normalizing Uij = uij / a=b uab , ˜ we calculate the semi-supervised similarity matrix P = (1−ρ)P +ρU , where the trade-off parameter ρ is set to 0. [sent-166, score-0.123]

90 37 1 2 3 8 −5 1 2 3 15 10 6 10 1 2 3 4 5 6 7 4 5 2 5 0 0 −2 0 −4 −6 −5 −5 −8 −10 −10 −10 −12 −10 −8 −6 −4 −2 0 2 4 6 8 10 −10 −5 0 5 10 −15 −10 −5 0 5 10 Figure 2: Unsupervised visualization on three data sets. [sent-278, score-0.126]

91 Column 1 to 3 are results of iris, wine and segmentation, respectively. [sent-279, score-0.03]

92 The second to fourth rows are visualizations using Laplacian Eigenmap, Gradient t-SNE, and Fixed-Point t-SNE, respectively. [sent-281, score-0.032]

93 61 15 25 20 10 4 15 5 10 2 5 0 0 0 −5 −5 −2 −10 −10 −4 −6 −6 1 2 3 4 −4 −15 −2 0 2 4 6 −20 −20 −15 1 2 3 4 −15 −20 −10 −5 0 5 10 15 −25 −30 1 2 3 4 −20 −10 0 10 20 30 Figure 3: Semi-supervised visualization for the vehicle data set. [sent-296, score-0.149]

94 The plots titled with α values are produced using the fixed-point algorithm of the power family of HSSNE. [sent-297, score-0.026]

95 The top-middle plot in Figure 3 shows that inclusion of some supervised information improves the homogeneity (0. [sent-298, score-0.255]

96 We then tried the power family of HSSNE with α ranging from 0 to 1. [sent-300, score-0.026]

97 With α = 1 and α = 2, the HSSNEs implemented by our fixed-point algorithm achieve even higher homogeneities (0. [sent-303, score-0.052]

98 Two sets of experimental results from unsupervised and semi-supervised visualization indicate that our method is efficient, accurate, and versatile over the other two approaches. [sent-310, score-0.166]

99 Laplacian eigenmaps and spectral techniques for embedding and clustering. [sent-319, score-0.221]

100 Data visualization and dimensionality reduction using kernel maps with a reference point. [sent-367, score-0.178]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sne', 0.645), ('qij', 0.275), ('pij', 0.248), ('homogeneity', 0.237), ('embedding', 0.201), ('hssne', 0.192), ('yki', 0.183), ('dkl', 0.159), ('yj', 0.129), ('visualization', 0.126), ('jlin', 0.122), ('ssne', 0.122), ('eigenmap', 0.107), ('momentum', 0.105), ('yi', 0.092), ('qab', 0.087), ('sij', 0.087), ('similarity', 0.077), ('gradient', 0.077), ('ki', 0.07), ('uki', 0.07), ('symmetric', 0.07), ('neighbor', 0.069), ('ij', 0.066), ('laplacian', 0.061), ('fixed', 0.06), ('aij', 0.058), ('hijkl', 0.052), ('homogeneities', 0.052), ('ykj', 0.052), ('similarities', 0.051), ('student', 0.047), ('bij', 0.047), ('uij', 0.046), ('visualizing', 0.042), ('kong', 0.042), ('hong', 0.04), ('unsupervised', 0.04), ('rule', 0.038), ('stochastic', 0.037), ('optimization', 0.036), ('dii', 0.035), ('heaviness', 0.035), ('jquad', 0.035), ('qii', 0.035), ('db', 0.034), ('iris', 0.034), ('visualizations', 0.032), ('crowding', 0.031), ('inspires', 0.031), ('maaten', 0.031), ('mixture', 0.03), ('wine', 0.03), ('score', 0.029), ('update', 0.029), ('dimensionality', 0.029), ('tail', 0.028), ('lagrangian', 0.027), ('diverge', 0.026), ('gaps', 0.026), ('freedom', 0.026), ('power', 0.026), ('symmetrized', 0.025), ('chinese', 0.025), ('comprehensive', 0.024), ('helsinki', 0.024), ('reduction', 0.023), ('shift', 0.023), ('vehicle', 0.023), ('parameterize', 0.023), ('lj', 0.023), ('mapped', 0.022), ('negative', 0.022), ('da', 0.022), ('superiority', 0.021), ('rearranging', 0.021), ('accommodate', 0.021), ('characterize', 0.021), ('eigenmaps', 0.02), ('libsvm', 0.02), ('mixing', 0.02), ('poor', 0.02), ('segmentation', 0.019), ('der', 0.019), ('locality', 0.019), ('kl', 0.019), ('manifold', 0.019), ('king', 0.019), ('initialization', 0.019), ('supervised', 0.018), ('preserving', 0.018), ('iterative', 0.018), ('insights', 0.018), ('newton', 0.018), ('working', 0.018), ('seconds', 0.018), ('compression', 0.018), ('employs', 0.017), ('notational', 0.017), ('infeasible', 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

Author: Zhirong Yang, Irwin King, Zenglin Xu, Erkki Oja

Abstract: Stochastic Neighbor Embedding (SNE) has shown to be quite promising for data visualization. Currently, the most popular implementation, t-SNE, is restricted to a particular Student t-distribution as its embedding distribution. Moreover, it uses a gradient descent algorithm that may require users to tune parameters such as the learning step size, momentum, etc., in finding its optimum. In this paper, we propose the Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE) method, which is a generalization of the t-SNE to accommodate various heavytailed embedding similarity functions. With this generalization, we are presented with two difficulties. The first is how to select the best embedding similarity among all heavy-tailed functions and the second is how to optimize the objective function once the heavy-tailed function has been selected. Our contributions then are: (1) we point out that various heavy-tailed embedding similarities can be characterized by their negative score functions. Based on this finding, we present a parameterized subset of similarity functions for choosing the best tail-heaviness for HSSNE; (2) we present a fixed-point optimization algorithm that can be applied to all heavy-tailed functions and does not require the user to set any parameters; and (3) we present two empirical studies, one for unsupervised visualization showing that our optimization algorithm runs as fast and as good as the best known t-SNE implementation and the other for semi-supervised visualization showing quantitative superiority using the homogeneity measure as well as qualitative advantage in cluster separation over t-SNE.

2 0.10417003 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

Author: Masayuki Karasuyama, Ichiro Takeuchi

Abstract: We propose a multiple incremental decremental algorithm of Support Vector Machine (SVM). Conventional single incremental decremental SVM can update the trained model efficiently when single data point is added to or removed from the training set. When we add and/or remove multiple data points, this algorithm is time-consuming because we need to repeatedly apply it to each data point. The proposed algorithm is computationally more efficient when multiple data points are added and/or removed simultaneously. The single incremental decremental algorithm is built on an optimization technique called parametric programming. We extend the idea and introduce multi-parametric programming for developing the proposed algorithm. Experimental results on synthetic and real data sets indicate that the proposed algorithm can significantly reduce the computational cost of multiple incremental decremental operation. Our approach is especially useful for online SVM learning in which we need to remove old data points and add new data points in a short amount of time.

3 0.084136203 122 nips-2009-Label Selection on Graphs

Author: Andrew Guillory, Jeff A. Bilmes

Abstract: We investigate methods for selecting sets of labeled vertices for use in predicting the labels of vertices on a graph. We specifically study methods which choose a single batch of labeled vertices (i.e. offline, non sequential methods). In this setting, we find common graph smoothness assumptions directly motivate simple label selection methods with interesting theoretical guarantees. These methods bound prediction error in terms of the smoothness of the true labels with respect to the graph. Some of these bounds give new motivations for previously proposed algorithms, and some suggest new algorithms which we evaluate. We show improved performance over baseline methods on several real world data sets. 1

4 0.083500713 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

Author: Keith Bush, Joelle Pineau

Abstract: Interesting real-world datasets often exhibit nonlinear, noisy, continuous-valued states that are unexplorable, are poorly described by first principles, and are only partially observable. If partial observability can be overcome, these constraints suggest the use of model-based reinforcement learning. We experiment with manifold embeddings to reconstruct the observable state-space in the context of offline, model-based reinforcement learning. We demonstrate that the embedding of a system can change as a result of learning, and we argue that the best performing embeddings well-represent the dynamics of both the uncontrolled and adaptively controlled system. We apply this approach to learn a neurostimulation policy that suppresses epileptic seizures on animal brain slices. 1

5 0.063253529 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

Author: Jing Gao, Feng Liang, Wei Fan, Yizhou Sun, Jiawei Han

Abstract: Ensemble classifiers such as bagging, boosting and model averaging are known to have improved accuracy and robustness over a single model. Their potential, however, is limited in applications which have no access to raw data but to the meta-level model output. In this paper, we study ensemble learning with output from multiple supervised and unsupervised models, a topic where little work has been done. Although unsupervised models, such as clustering, do not directly generate label prediction for each individual, they provide useful constraints for the joint prediction of a set of related objects. We propose to consolidate a classification solution by maximizing the consensus among both supervised predictions and unsupervised constraints. We cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the smoothness of the prediction over the graph, as well as penalizing deviations from the initial labeling provided by supervised models. We solve this problem through iterative propagation of probability estimates among neighboring nodes. Our method can also be interpreted as conducting a constrained embedding in a transformed space, or a ranking on the graph. Experimental results on three real applications demonstrate the benefits of the proposed method over existing alternatives1 . 1

6 0.060998604 3 nips-2009-AUC optimization and the two-sample problem

7 0.056070235 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

8 0.05085478 42 nips-2009-Bayesian Sparse Factor Models and DAGs Inference and Comparison

9 0.046378583 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

10 0.044837523 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

11 0.042042051 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

12 0.040551215 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

13 0.039699208 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

14 0.037153058 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms

15 0.036819853 137 nips-2009-Learning transport operators for image manifolds

16 0.036068834 8 nips-2009-A Fast, Consistent Kernel Two-Sample Test

17 0.034815159 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

18 0.034002859 1 nips-2009-$L 1$-Penalized Robust Estimation for a Class of Inverse Problems Arising in Multiview Geometry

19 0.033953696 23 nips-2009-Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models

20 0.033717886 190 nips-2009-Polynomial Semantic Indexing


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.115), (1, 0.043), (2, -0.01), (3, 0.031), (4, -0.043), (5, -0.011), (6, -0.018), (7, 0.03), (8, -0.028), (9, -0.014), (10, -0.018), (11, 0.029), (12, 0.034), (13, -0.058), (14, -0.046), (15, -0.014), (16, 0.017), (17, -0.042), (18, -0.056), (19, -0.066), (20, -0.011), (21, -0.022), (22, 0.059), (23, -0.036), (24, -0.056), (25, 0.087), (26, 0.013), (27, 0.03), (28, 0.011), (29, 0.013), (30, 0.097), (31, -0.115), (32, 0.119), (33, 0.096), (34, -0.027), (35, 0.039), (36, -0.04), (37, 0.068), (38, 0.025), (39, -0.021), (40, 0.066), (41, -0.042), (42, 0.106), (43, 0.043), (44, -0.048), (45, 0.031), (46, -0.119), (47, -0.037), (48, -0.032), (49, -0.119)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92051369 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

Author: Zhirong Yang, Irwin King, Zenglin Xu, Erkki Oja

Abstract: Stochastic Neighbor Embedding (SNE) has shown to be quite promising for data visualization. Currently, the most popular implementation, t-SNE, is restricted to a particular Student t-distribution as its embedding distribution. Moreover, it uses a gradient descent algorithm that may require users to tune parameters such as the learning step size, momentum, etc., in finding its optimum. In this paper, we propose the Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE) method, which is a generalization of the t-SNE to accommodate various heavytailed embedding similarity functions. With this generalization, we are presented with two difficulties. The first is how to select the best embedding similarity among all heavy-tailed functions and the second is how to optimize the objective function once the heavy-tailed function has been selected. Our contributions then are: (1) we point out that various heavy-tailed embedding similarities can be characterized by their negative score functions. Based on this finding, we present a parameterized subset of similarity functions for choosing the best tail-heaviness for HSSNE; (2) we present a fixed-point optimization algorithm that can be applied to all heavy-tailed functions and does not require the user to set any parameters; and (3) we present two empirical studies, one for unsupervised visualization showing that our optimization algorithm runs as fast and as good as the best known t-SNE implementation and the other for semi-supervised visualization showing quantitative superiority using the homogeneity measure as well as qualitative advantage in cluster separation over t-SNE.

2 0.64431155 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines

Author: Masayuki Karasuyama, Ichiro Takeuchi

Abstract: We propose a multiple incremental decremental algorithm of Support Vector Machine (SVM). Conventional single incremental decremental SVM can update the trained model efficiently when single data point is added to or removed from the training set. When we add and/or remove multiple data points, this algorithm is time-consuming because we need to repeatedly apply it to each data point. The proposed algorithm is computationally more efficient when multiple data points are added and/or removed simultaneously. The single incremental decremental algorithm is built on an optimization technique called parametric programming. We extend the idea and introduce multi-parametric programming for developing the proposed algorithm. Experimental results on synthetic and real data sets indicate that the proposed algorithm can significantly reduce the computational cost of multiple incremental decremental operation. Our approach is especially useful for online SVM learning in which we need to remove old data points and add new data points in a short amount of time.

3 0.48309499 34 nips-2009-Anomaly Detection with Score functions based on Nearest Neighbor Graphs

Author: Manqi Zhao, Venkatesh Saligrama

Abstract: We propose a novel non-parametric adaptive anomaly detection algorithm for high dimensional data based on score functions derived from nearest neighbor graphs on n-point nominal data. Anomalies are declared whenever the score of a test sample falls below α, which is supposed to be the desired false alarm level. The resulting anomaly detector is shown to be asymptotically optimal in that it is uniformly most powerful for the specified false alarm level, α, for the case when the anomaly density is a mixture of the nominal and a known density. Our algorithm is computationally efficient, being linear in dimension and quadratic in data size. It does not require choosing complicated tuning parameters or function approximation classes and it can adapt to local structure such as local change in dimensionality. We demonstrate the algorithm on both artificial and real data sets in high dimensional feature spaces.

4 0.47285733 3 nips-2009-AUC optimization and the two-sample problem

Author: Nicolas Vayatis, Marine Depecker, Stéphan J. Clémençcon

Abstract: The purpose of the paper is to explore the connection between multivariate homogeneity tests and AUC optimization. The latter problem has recently received much attention in the statistical learning literature. From the elementary observation that, in the two-sample problem setup, the null assumption corresponds to the situation where the area under the optimal ROC curve is equal to 1/2, we propose a two-stage testing method based on data splitting. A nearly optimal scoring function in the AUC sense is first learnt from one of the two half-samples. Data from the remaining half-sample are then projected onto the real line and eventually ranked according to the scoring function computed at the first stage. The last step amounts to performing a standard Mann-Whitney Wilcoxon test in the onedimensional framework. We show that the learning step of the procedure does not affect the consistency of the test as well as its properties in terms of power, provided the ranking produced is accurate enough in the AUC sense. The results of a numerical experiment are eventually displayed in order to show the efficiency of the method. 1

5 0.46979105 145 nips-2009-Manifold Embeddings for Model-Based Reinforcement Learning under Partial Observability

Author: Keith Bush, Joelle Pineau

Abstract: Interesting real-world datasets often exhibit nonlinear, noisy, continuous-valued states that are unexplorable, are poorly described by first principles, and are only partially observable. If partial observability can be overcome, these constraints suggest the use of model-based reinforcement learning. We experiment with manifold embeddings to reconstruct the observable state-space in the context of offline, model-based reinforcement learning. We demonstrate that the embedding of a system can change as a result of learning, and we argue that the best performing embeddings well-represent the dynamics of both the uncontrolled and adaptively controlled system. We apply this approach to learn a neurostimulation policy that suppresses epileptic seizures on animal brain slices. 1

6 0.46682039 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

7 0.4492799 122 nips-2009-Label Selection on Graphs

8 0.43725654 102 nips-2009-Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models

9 0.4290854 146 nips-2009-Manifold Regularization for SIR with Rate Root-n Convergence

10 0.42567086 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models

11 0.42364275 72 nips-2009-Distribution Matching for Transduction

12 0.42335767 127 nips-2009-Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

13 0.41968536 195 nips-2009-Probabilistic Relational PCA

14 0.39796218 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data

15 0.39487943 103 nips-2009-Graph Zeta Function in the Bethe Free Energy and Loopy Belief Propagation

16 0.39161465 214 nips-2009-Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction

17 0.39009798 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning

18 0.38078916 92 nips-2009-Fast Graph Laplacian Regularized Kernel Learning via Semidefinite–Quadratic–Linear Programming

19 0.37425187 60 nips-2009-Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

20 0.36601937 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(7, 0.02), (24, 0.038), (25, 0.092), (35, 0.06), (36, 0.085), (39, 0.053), (58, 0.063), (61, 0.019), (71, 0.042), (77, 0.312), (86, 0.109), (91, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79547656 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding

Author: Zhirong Yang, Irwin King, Zenglin Xu, Erkki Oja

Abstract: Stochastic Neighbor Embedding (SNE) has shown to be quite promising for data visualization. Currently, the most popular implementation, t-SNE, is restricted to a particular Student t-distribution as its embedding distribution. Moreover, it uses a gradient descent algorithm that may require users to tune parameters such as the learning step size, momentum, etc., in finding its optimum. In this paper, we propose the Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE) method, which is a generalization of the t-SNE to accommodate various heavytailed embedding similarity functions. With this generalization, we are presented with two difficulties. The first is how to select the best embedding similarity among all heavy-tailed functions and the second is how to optimize the objective function once the heavy-tailed function has been selected. Our contributions then are: (1) we point out that various heavy-tailed embedding similarities can be characterized by their negative score functions. Based on this finding, we present a parameterized subset of similarity functions for choosing the best tail-heaviness for HSSNE; (2) we present a fixed-point optimization algorithm that can be applied to all heavy-tailed functions and does not require the user to set any parameters; and (3) we present two empirical studies, one for unsupervised visualization showing that our optimization algorithm runs as fast and as good as the best known t-SNE implementation and the other for semi-supervised visualization showing quantitative superiority using the homogeneity measure as well as qualitative advantage in cluster separation over t-SNE.

2 0.74297065 237 nips-2009-Subject independent EEG-based BCI decoding

Author: Siamac Fazli, Cristian Grozea, Marton Danoczy, Benjamin Blankertz, Florin Popescu, Klaus-Robert Müller

Abstract: In the quest to make Brain Computer Interfacing (BCI) more usable, dry electrodes have emerged that get rid of the initial 30 minutes required for placing an electrode cap. Another time consuming step is the required individualized adaptation to the BCI user, which involves another 30 minutes calibration for assessing a subject’s brain signature. In this paper we aim to also remove this calibration proceedure from BCI setup time by means of machine learning. In particular, we harvest a large database of EEG BCI motor imagination recordings (83 subjects) for constructing a library of subject-specific spatio-temporal filters and derive a subject independent BCI classifier. Our offline results indicate that BCI-na¨ve ı users could start real-time BCI use with no prior calibration at only a very moderate performance loss.

3 0.73640388 136 nips-2009-Learning to Rank by Optimizing NDCG Measure

Author: Hamed Valizadegan, Rong Jin, Ruofei Zhang, Jianchang Mao

Abstract: Learning to rank is a relatively new field of study, aiming to learn a ranking function from a set of training data with relevancy labels. The ranking algorithms are often evaluated using information retrieval measures, such as Normalized Discounted Cumulative Gain (NDCG) [1] and Mean Average Precision (MAP) [2]. Until recently, most learning to rank algorithms were not using a loss function related to the above mentioned evaluation measures. The main difficulty in direct optimization of these measures is that they depend on the ranks of documents, not the numerical values output by the ranking function. We propose a probabilistic framework that addresses this challenge by optimizing the expectation of NDCG over all the possible permutations of documents. A relaxation strategy is used to approximate the average of NDCG over the space of permutation, and a bound optimization approach is proposed to make the computation efficient. Extensive experiments show that the proposed algorithm outperforms state-of-the-art ranking algorithms on several benchmark data sets. 1

4 0.71297383 185 nips-2009-Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis

Author: Sundeep Rangan, Alyson K. Fletcher

Abstract: A well-known analysis of Tropp and Gilbert shows that orthogonal matching pursuit (OMP) can recover a k-sparse n-dimensional real vector from m = 4k log(n) noise-free linear measurements obtained through a random Gaussian measurement matrix with a probability that approaches one as n → ∞. This work strengthens this result by showing that a lower number of measurements, m = 2k log(n − k), is in fact sufficient for asymptotic recovery. More generally, when the sparsity level satisfies kmin ≤ k ≤ kmax but is unknown, m = 2kmax log(n − kmin ) measurements is sufficient. Furthermore, this number of measurements is also sufficient for detection of the sparsity pattern (support) of the vector with measurement errors provided the signal-to-noise ratio (SNR) scales to infinity. The scaling m = 2k log(n − k) exactly matches the number of measurements required by the more complex lasso method for signal recovery in a similar SNR scaling.

5 0.58885098 199 nips-2009-Ranking Measures and Loss Functions in Learning to Rank

Author: Wei Chen, Tie-yan Liu, Yanyan Lan, Zhi-ming Ma, Hang Li

Abstract: Learning to rank has become an important research topic in machine learning. While most learning-to-rank methods learn the ranking functions by minimizing loss functions, it is the ranking measures (such as NDCG and MAP) that are used to evaluate the performance of the learned ranking functions. In this work, we reveal the relationship between ranking measures and loss functions in learningto-rank methods, such as Ranking SVM, RankBoost, RankNet, and ListMLE. We show that the loss functions of these methods are upper bounds of the measurebased ranking errors. As a result, the minimization of these loss functions will lead to the maximization of the ranking measures. The key to obtaining this result is to model ranking as a sequence of classification tasks, and define a so-called essential loss for ranking as the weighted sum of the classification errors of individual tasks in the sequence. We have proved that the essential loss is both an upper bound of the measure-based ranking errors, and a lower bound of the loss functions in the aforementioned methods. Our proof technique also suggests a way to modify existing loss functions to make them tighter bounds of the measure-based ranking errors. Experimental results on benchmark datasets show that the modifications can lead to better ranking performances, demonstrating the correctness of our theoretical analysis. 1

6 0.5486095 87 nips-2009-Exponential Family Graph Matching and Ranking

7 0.54782724 230 nips-2009-Statistical Consistency of Top-k Ranking

8 0.53032285 137 nips-2009-Learning transport operators for image manifolds

9 0.52904189 211 nips-2009-Segmenting Scenes by Matching Image Composites

10 0.52841401 212 nips-2009-Semi-Supervised Learning in Gigantic Image Collections

11 0.52431828 112 nips-2009-Human Rademacher Complexity

12 0.52296859 155 nips-2009-Modelling Relational Data using Bayesian Clustered Tensor Factorization

13 0.52239895 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds

14 0.52189994 105 nips-2009-Grouped Orthogonal Matching Pursuit for Variable Selection and Prediction

15 0.52120656 175 nips-2009-Occlusive Components Analysis

16 0.52097189 104 nips-2009-Group Sparse Coding

17 0.52058578 169 nips-2009-Nonlinear Learning using Local Coordinate Coding

18 0.51889795 3 nips-2009-AUC optimization and the two-sample problem

19 0.51851678 113 nips-2009-Improving Existing Fault Recovery Policies

20 0.51797676 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions