jmlr jmlr2013 jmlr2013-37 knowledge-graph by maker-knowledge-mining

37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis


Source: pdf

Author: Joshua M. Lewis, Virginia R. de Sa, Laurens van der Maaten

Abstract: Divvy is an application for applying unsupervised machine learning techniques (clustering and dimensionality reduction) to the data analysis process. Divvy provides a novel UI that allows researchers to tighten the action-perception loop of changing algorithm parameters and seeing a visualization of the result. Machine learning researchers can use Divvy to publish easy to use reference implementations of their algorithms, which helps the machine learning field have a greater impact on research practices elsewhere. Keywords: clustering, dimensionality reduction, open source software, human computer interaction, data visualization

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU Department of Cognitive Science University of California, San Diego La Jolla, CA 92093-0515, USA Laurens van der Maaten LVDMAATEN @ GMAIL . [sent-7, score-0.108]

2 Science Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands Editor: Mark Reid Abstract Divvy is an application for applying unsupervised machine learning techniques (clustering and dimensionality reduction) to the data analysis process. [sent-12, score-0.048]

3 Divvy provides a novel UI that allows researchers to tighten the action-perception loop of changing algorithm parameters and seeing a visualization of the result. [sent-13, score-0.12]

4 Machine learning researchers can use Divvy to publish easy to use reference implementations of their algorithms, which helps the machine learning field have a greater impact on research practices elsewhere. [sent-14, score-0.152]

5 Keywords: clustering, dimensionality reduction, open source software, human computer interaction, data visualization 1. [sent-15, score-0.096]

6 Introduction The field of machine learning has produced many techniques for performing data analysis, but researchers outside the field face substantial challenges applying them to their data. [sent-16, score-0.08]

7 Authors often describe an algorithm without providing a reference implementation, and if they do provide code it may be for an unfamiliar language or platform. [sent-18, score-0.025]

8 Third, in the early, exploratory stage of data analysis researchers should explore and compare several different techniques. [sent-21, score-0.085]

9 If each technique is challenging to get running for the reasons above, the challenge is compounded with multiple techniques using different languages and data formats. [sent-22, score-0.044]

10 Using Divvy, researchers can quickly run an assortment of clustering and dimensionality reduction algorithms on their data, without worrying about programming languages, data formats, or visualization. [sent-24, score-0.208]

11 Divvy is a member of a family of attempts to bring machine learning and data visualization to a wider audience. [sent-25, score-0.027]

12 , 2003) provides a variety of interactive visualizac 2013 Joshua M. [sent-27, score-0.028]

13 L EWIS , DE S A AND VAN DER M AATEN tions for high-dimensional data, and includes some lightweight machine learning components such as PCA. [sent-30, score-0.047]

14 The Orange project (University of Ljubljana Bioinformatics Laboratory, 2013) provides a visual programming interface for machine learning techniques in Python. [sent-31, score-0.054]

15 Divvy’s unique focus among these projects is on user experience and extensibility. [sent-35, score-0.075]

16 For users, Divvy provides a simple, fast interface for doing data analysis. [sent-36, score-0.054]

17 For machine learning researchers, Divvy provides a plugin architecture that lets one release a version of an algorithm complete with custom UI and help resources. [sent-37, score-0.201]

18 By publishing an algorithm on the Divvy platform, machine learning researchers can drastically lower the barriers to entry that data analysts face when attempting to use it. [sent-38, score-0.125]

19 Divvy and all of its included plugins are open source, distributed under the MIT license. [sent-39, score-0.14]

20 Interaction Divvy has three fundamental components to its interface (see Figure 1). [sent-41, score-0.054]

21 In the lower-right corner of the main window (Label 1) is a collection of data sets for the user to analyze. [sent-42, score-0.039]

22 Each data set is associated with one or more data set views, which are visualized in the left hand portion of the interface (Label 2). [sent-43, score-0.054]

23 Finally each data set view is characterized by a combination of four plugins from the top-right panel (Label 3): a dimensionality reduction algorithm, a clustering algorithm, a point visualizer, and a data set visualizer. [sent-44, score-0.336]

24 Clustering and dimensionality reduction plugins are interfaces to their associated algorithms, currently K-means and single/complete linkage for clustering, and PCA, Isomap (Tenenbaum et al. [sent-45, score-0.232]

25 , 2000) and t-SNE (van der Maaten and Hinton, 2008) for dimensionality reduction. [sent-46, score-0.119]

26 Point visualizers render data points in a meaningful way, for example, by rendering images in a data set where points represent images, or rendering type for a linguistic corpus. [sent-47, score-0.159]

27 Data set visualizers render the entire data set, for example as a scatter plot or parallel coordinates plot. [sent-48, score-0.147]

28 Divvy includes an image point visualizer and a scatter plot data set visualizer by default. [sent-49, score-0.324]

29 As an example, the user in Figure 1 has loaded four data sets into Divvy and selected the faces data set. [sent-50, score-0.09]

30 The user has created three views that represent three distinct perspectives on the data. [sent-51, score-0.13]

31 The top-right view is an Isomap embedding of the data set with an image point visualizer, so the user sees a selection of points rendered as images and positioned in the first two Isomap dimensions. [sent-52, score-0.157]

32 The lower left view is a t-SNE embedding with coloring from K-means clustering rendered as a scatter plot. [sent-53, score-0.195]

33 In practice a researcher can use these visualizations to evaluate different approaches to dimensionality reduction and clustering. [sent-54, score-0.177]

34 In Figure 1 the user can easily see that Isomap embeds the face images in two dimension more smoothly than t-SNE. [sent-55, score-0.082]

35 The structure of the faces data set is ideal for Isomap, and a researcher can quickly discover that with Divvy. [sent-56, score-0.074]

36 Users can create as many perspectives on their data as they’d like, grow or shrink them with the slider on the lower-right edge of the screen, and export their preferred views as PNGs. [sent-57, score-0.091]

37 Whenever they change a plugin or plugin parameter, the selected view automatically rerenders with the new setting. [sent-58, score-0.408]

38 In order to make Divvy as responsive as possible, each data set view is sandboxed in its own set of threads (Divvy is task parallel with granularity at the view level). [sent-59, score-0.195]

39 Users can start a long computation in one view while still interacting with other views or data sets. [sent-60, score-0.111]

40 For shorter computations the view changes instantly, giving users immediate visual feedback on the effect of their parameter selections. [sent-61, score-0.095]

41 (1) The data sets list—data sets the user has loaded appear here with some summary statistics. [sent-63, score-0.07]

42 (2) The data set view palette—data set views are visualizations of data sets using a combination of a dimensionality reduction algorithm, a clustering algorithm, a point visualizer, and a data set visualizer. [sent-64, score-0.29]

43 (3) Data set view parameters—controls for setting the parameters of the algorithms that compose a particular data set view, for example, the number of clusters for a clustering algorithm. [sent-66, score-0.104]

44 Our goal with the Divvy UI is to tighten up the action-perception loop in data analysis. [sent-67, score-0.033]

45 As an analogy, baseball players have an excellent idea of how baseballs behave. [sent-68, score-0.192]

46 Nevertheless, through extensive experience baseball players acquire an excellent pragmatic understanding of how baseballs behave, an understanding that one might guess is based on an implicit learned model of baseball behavior rather than the explicit model a physicist would give. [sent-70, score-0.341]

47 By giving Divvy users an immediate, tactile experience of algorithmic behavior, we hope they can develop better intuitive models of how algorithms behave and thus make better analysis decisions. [sent-71, score-0.103]

48 Rather it is a lightweight framework for loading data sets and plugins and then coordinating their interaction. [sent-74, score-0.207]

49 Each plugin is an independent bundle that defines a UI and follows one of four possible input/output protocols: clusterer, reducer, point visualizer and data set visualizer. [sent-75, score-0.351]

50 While the core Divvy application is Mac OS X specific, each machine learning plugin is just a lightly-wrapped reference implementation of its algorithm in C or C++. [sent-76, score-0.224]

51 1 A researcher publishing an algorithm in Divvy’s format need only add an OS X UI (and Divvy gives complete freedom as to the details of that UI save a fixed width) to their implementation, which remains completely platform-agnostic. [sent-77, score-0.136]

52 With this bit of work users can drop the algorithm bundle into Divvy and start using the new technique on their existing data sets. [sent-78, score-0.078]

53 We implemented Divvy on Mac OS X in order to focus on one user experience. [sent-79, score-0.039]

54 While it’s of course desirable to have open source software on as many platforms as possible, building the Divvy UI across platforms was outside the scope of our engineering resources. [sent-80, score-0.093]

55 Data within Divvy can be exported as PNG or CSV as appropriate. [sent-82, score-0.02]

56 As mentioned above, each data set view owns a set of threads that compute independently from the UI and those of other views. [sent-86, score-0.102]

57 Within a view each plugin can operate in parallel over its assigned data set. [sent-87, score-0.253]

58 In our lab Divvy can achieve over 2,300% CPU utilization on our hyperthreaded 12-core Mac Pro through a combination of these two forms of parallelization. [sent-88, score-0.033]

59 3 The task parallelism is achieved at the application level through the NSOperation framework in Cocoa. [sent-89, score-0.033]

60 Data parallelism is the responsibility of plugin authors and should be implemented using the open-source libdispatch library. [sent-91, score-0.26]

61 Overall the Divvy codebase is two-thirds Objective-C (interface, plugin wrappers) and one-third C/C++ (algorithms). [sent-99, score-0.2]

62 To port Divvy to another platform, one would only need to rewrite the interface code. [sent-101, score-0.054]

63 Joshua Lewis wrote the entire Divvy application and bundled plugins save the dimensionality reduction plugins, which were written by Laurens van der Maaten. [sent-103, score-0.366]

64 With twelve cores and two threads per core, peak utilization is 2,400%. [sent-105, score-0.087]

65 We picked libdispatch because OpenMP has an issue with GCC 4. [sent-107, score-0.047]

66 3 where starting a parallel section from a thread other than the main thread causes a crash, so we cannot recommend it. [sent-109, score-0.083]

67 GGobi: evolving from XGobi into an extensible framework for interactive data visualization. [sent-112, score-0.028]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('divvy', 0.865), ('plugin', 0.18), ('plugins', 0.14), ('visualizer', 0.14), ('baseball', 0.093), ('laurens', 0.08), ('joshua', 0.072), ('der', 0.071), ('ayasdi', 0.07), ('views', 0.063), ('isomap', 0.063), ('researchers', 0.06), ('ui', 0.058), ('clustering', 0.056), ('interface', 0.054), ('researcher', 0.054), ('maaten', 0.054), ('threads', 0.054), ('dimensionality', 0.048), ('view', 0.048), ('users', 0.047), ('aaten', 0.047), ('baseballs', 0.047), ('cogsci', 0.047), ('csv', 0.047), ('delft', 0.047), ('ewis', 0.047), ('ggobi', 0.047), ('ivvy', 0.047), ('libdispatch', 0.047), ('lightweight', 0.047), ('ntuitive', 0.047), ('publish', 0.047), ('swayne', 0.047), ('visualizers', 0.047), ('xploratory', 0.047), ('os', 0.047), ('reduction', 0.044), ('scatter', 0.044), ('mac', 0.044), ('lewis', 0.042), ('ljubljana', 0.04), ('orange', 0.04), ('user', 0.039), ('platform', 0.038), ('van', 0.037), ('experience', 0.036), ('platforms', 0.036), ('ucsd', 0.033), ('players', 0.033), ('parallelism', 0.033), ('tighten', 0.033), ('virginia', 0.033), ('utilization', 0.033), ('render', 0.031), ('format', 0.031), ('loaded', 0.031), ('bundle', 0.031), ('visualizations', 0.031), ('sa', 0.029), ('rendering', 0.029), ('thread', 0.029), ('perspectives', 0.028), ('interactive', 0.028), ('rendered', 0.028), ('nalysis', 0.028), ('visualization', 0.027), ('save', 0.026), ('tenenbaum', 0.026), ('publishing', 0.025), ('parallel', 0.025), ('exploratory', 0.025), ('reference', 0.025), ('languages', 0.024), ('eld', 0.023), ('images', 0.023), ('architecture', 0.021), ('source', 0.021), ('face', 0.02), ('behave', 0.02), ('crash', 0.02), ('analysts', 0.02), ('protocols', 0.02), ('compounded', 0.02), ('reducer', 0.02), ('ameliorate', 0.02), ('buja', 0.02), ('codebase', 0.02), ('coordinating', 0.02), ('exported', 0.02), ('openmp', 0.02), ('practices', 0.02), ('pragmatic', 0.02), ('reid', 0.02), ('responsive', 0.02), ('vin', 0.02), ('faces', 0.02), ('embedding', 0.019), ('core', 0.019), ('excellent', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis

Author: Joshua M. Lewis, Virginia R. de Sa, Laurens van der Maaten

Abstract: Divvy is an application for applying unsupervised machine learning techniques (clustering and dimensionality reduction) to the data analysis process. Divvy provides a novel UI that allows researchers to tighten the action-perception loop of changing algorithm parameters and seeing a visualization of the result. Machine learning researchers can use Divvy to publish easy to use reference implementations of their algorithms, which helps the machine learning field have a greater impact on research practices elsewhere. Keywords: clustering, dimensionality reduction, open source software, human computer interaction, data visualization

2 0.053004947 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

Author: Sergey Lisitsyn, Christian Widmer, Fernando J. Iglesias Garcia

Abstract: We present Tapkee, a C++ template library that provides efficient implementations of more than 20 widely used dimensionality reduction techniques ranging from Locally Linear Embedding (Roweis and Saul, 2000) and Isomap (de Silva and Tenenbaum, 2002) to the recently introduced BarnesHut-SNE (van der Maaten, 2013). Our library was designed with a focus on performance and flexibility. For performance, we combine efficient multi-core algorithms, modern data structures and state-of-the-art low-level libraries. To achieve flexibility, we designed a clean interface for applying methods to user data and provide a callback API that facilitates integration with the library. The library is freely available as open-source software and is distributed under the permissive BSD 3-clause license. We encourage the integration of Tapkee into other open-source toolboxes and libraries. For example, Tapkee has been integrated into the codebase of the Shogun toolbox (Sonnenburg et al., 2010), giving us access to a rich set of kernels, distance measures and bindings to common programming languages including Python, Octave, Matlab, R, Java, C#, Ruby, Perl and Lua. Source code, examples and documentation are available at http://tapkee.lisitsyn.me. Keywords: dimensionality reduction, machine learning, C++, open source software

3 0.033119425 83 jmlr-2013-Orange: Data Mining Toolbox in Python

Author: Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, Blaž Zupan

Abstract: Orange is a machine learning and data mining suite for data analysis through Python scripting and visual programming. Here we report on the scripting part, which features interactive data analysis and component-based assembly of data mining procedures. In the selection and design of components, we focus on the flexibility of their reuse: our principal intention is to let the user write simple and clear scripts in Python, which build upon C++ implementations of computationallyintensive tasks. Orange is intended both for experienced users and programmers, as well as for students of data mining. Keywords: Python, data mining, machine learning, toolbox, scripting

4 0.031295151 86 jmlr-2013-Parallel Vector Field Embedding

Author: Binbin Lin, Xiaofei He, Chiyuan Zhang, Ming Ji

Abstract: We propose a novel local isometry based dimensionality reduction method from the perspective of vector fields, which is called parallel vector field embedding (PFE). We first give a discussion on local isometry and global isometry to show the intrinsic connection between parallel vector fields and isometry. The problem of finding an isometry turns out to be equivalent to finding orthonormal parallel vector fields on the data manifold. Therefore, we first find orthonormal parallel vector fields by solving a variational problem on the manifold. Then each embedding function can be obtained by requiring its gradient field to be as close to the corresponding parallel vector field as possible. Theoretical results show that our method can precisely recover the manifold if it is isometric to a connected open subset of Euclidean space. Both synthetic and real data examples demonstrate the effectiveness of our method even if there is heavy noise and high curvature. Keywords: manifold learning, isometry, vector field, covariant derivative, out-of-sample extension

5 0.027966361 59 jmlr-2013-Large-scale SVD and Manifold Learning

Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

Abstract: This paper examines the efficacy of sampling-based low-rank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nystr¨ m and Column sampling methods. We present a theoretical o comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the largescale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. We address the computational challenges of non-linear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nystr¨ m approximation is superior to the Column sampling method for this o task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMU-PIE data set. Keywords: low-rank approximation, manifold learning, large-scale matrix factorization

6 0.025603615 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

7 0.017547458 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java

8 0.016666671 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

9 0.014715852 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

10 0.013839687 78 jmlr-2013-On the Learnability of Shuffle Ideals

11 0.013721565 67 jmlr-2013-MLPACK: A Scalable C++ Machine Learning Library

12 0.01314347 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

13 0.013013768 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding

14 0.01202683 15 jmlr-2013-Bayesian Canonical Correlation Analysis

15 0.012015199 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

16 0.01191764 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

17 0.011768907 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

18 0.010518786 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

19 0.010341069 96 jmlr-2013-Regularization-Free Principal Curve Estimation

20 0.010330483 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.052), (1, 0.024), (2, -0.025), (3, -0.069), (4, -0.019), (5, 0.02), (6, -0.008), (7, -0.003), (8, -0.033), (9, -0.023), (10, 0.055), (11, -0.12), (12, 0.005), (13, 0.003), (14, 0.027), (15, -0.107), (16, 0.024), (17, -0.041), (18, -0.022), (19, 0.03), (20, 0.118), (21, 0.103), (22, 0.018), (23, -0.081), (24, -0.048), (25, 0.075), (26, -0.023), (27, -0.046), (28, -0.013), (29, -0.116), (30, 0.046), (31, -0.061), (32, 0.038), (33, -0.068), (34, 0.07), (35, -0.079), (36, -0.015), (37, -0.294), (38, -0.087), (39, -0.161), (40, 0.184), (41, 0.104), (42, 0.17), (43, 0.324), (44, 0.165), (45, -0.25), (46, -0.078), (47, -0.193), (48, 0.006), (49, 0.153)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97830617 37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis

Author: Joshua M. Lewis, Virginia R. de Sa, Laurens van der Maaten

Abstract: Divvy is an application for applying unsupervised machine learning techniques (clustering and dimensionality reduction) to the data analysis process. Divvy provides a novel UI that allows researchers to tighten the action-perception loop of changing algorithm parameters and seeing a visualization of the result. Machine learning researchers can use Divvy to publish easy to use reference implementations of their algorithms, which helps the machine learning field have a greater impact on research practices elsewhere. Keywords: clustering, dimensionality reduction, open source software, human computer interaction, data visualization

2 0.54857516 83 jmlr-2013-Orange: Data Mining Toolbox in Python

Author: Janez Demšar, Tomaž Curk, Aleš Erjavec, Črt Gorup, Tomaž Hočevar, Mitar Milutinovič, Martin Možina, Matija Polajnar, Marko Toplak, Anže Starič, Miha Štajdohar, Lan Umek, Lan Žagar, Jure Žbontar, Marinka Žitnik, Blaž Zupan

Abstract: Orange is a machine learning and data mining suite for data analysis through Python scripting and visual programming. Here we report on the scripting part, which features interactive data analysis and component-based assembly of data mining procedures. In the selection and design of components, we focus on the flexibility of their reuse: our principal intention is to let the user write simple and clear scripts in Python, which build upon C++ implementations of computationallyintensive tasks. Orange is intended both for experienced users and programmers, as well as for students of data mining. Keywords: Python, data mining, machine learning, toolbox, scripting

3 0.40076476 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

Author: Sergey Lisitsyn, Christian Widmer, Fernando J. Iglesias Garcia

Abstract: We present Tapkee, a C++ template library that provides efficient implementations of more than 20 widely used dimensionality reduction techniques ranging from Locally Linear Embedding (Roweis and Saul, 2000) and Isomap (de Silva and Tenenbaum, 2002) to the recently introduced BarnesHut-SNE (van der Maaten, 2013). Our library was designed with a focus on performance and flexibility. For performance, we combine efficient multi-core algorithms, modern data structures and state-of-the-art low-level libraries. To achieve flexibility, we designed a clean interface for applying methods to user data and provide a callback API that facilitates integration with the library. The library is freely available as open-source software and is distributed under the permissive BSD 3-clause license. We encourage the integration of Tapkee into other open-source toolboxes and libraries. For example, Tapkee has been integrated into the codebase of the Shogun toolbox (Sonnenburg et al., 2010), giving us access to a rich set of kernels, distance measures and bindings to common programming languages including Python, Octave, Matlab, R, Java, C#, Ruby, Perl and Lua. Source code, examples and documentation are available at http://tapkee.lisitsyn.me. Keywords: dimensionality reduction, machine learning, C++, open source software

4 0.18202497 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing

Author: Lisha Chen, Andreas Buja

Abstract: Multidimensional scaling (MDS) is the art of reconstructing pointsets (embeddings) from pairwise distance data, and as such it is at the basis of several approaches to nonlinear dimension reduction and manifold learning. At present, MDS lacks a unifying methodology as it consists of a discrete collection of proposals that differ in their optimization criteria, called “stress functions”. To correct this situation we propose (1) to embed many of the extant stress functions in a parametric family of stress functions, and (2) to replace the ad hoc choice among discrete proposals with a principled parameter selection method. This methodology yields the following benefits and problem solutions: (a) It provides guidance in tailoring stress functions to a given data situation, responding to the fact that no single stress function dominates all others across all data situations; (b) the methodology enriches the supply of available stress functions; (c) it helps our understanding of stress functions by replacing the comparison of discrete proposals with a characterization of the effect of parameters on embeddings; (d) it builds a bridge to graph drawing, which is the related but not identical art of constructing embeddings from graphs. Keywords: multidimensional scaling, force-directed layout, cluster analysis, clustering strength, unsupervised learning, Box-Cox transformations

5 0.17092375 113 jmlr-2013-The CAM Software for Nonnegative Blind Source Separation in R-Java

Author: Niya Wang, Fan Meng, Li Chen, Subha Madhavan, Robert Clarke, Eric P. Hoffman, Jianhua Xuan, Yue Wang

Abstract: We describe a R-Java CAM (convex analysis of mixtures) package that provides comprehensive analytic functions and a graphic user interface (GUI) for blindly separating mixed nonnegative sources. This open-source multiplatform software implements recent and classic algorithms in the literature including Chan et al. (2008), Wang et al. (2010), Chen et al. (2011a) and Chen et al. (2011b). The CAM package offers several attractive features: (1) instead of using proprietary MATLAB, its analytic functions are written in R, which makes the codes more portable and easier to modify; (2) besides producing and plotting results in R, it also provides a Java GUI for automatic progress update and convenient visual monitoring; (3) multi-thread interactions between the R and Java modules are driven and integrated by a Java GUI, assuring that the whole CAM software runs responsively; (4) the package offers a simple mechanism to allow others to plug-in additional R-functions. Keywords: convex analysis of mixtures, blind source separation, affinity propagation clustering, compartment modeling, information-based model selection c 2013 Niya Wang, Fan Meng, Li Chen, Subha Madhavan, Robert Clarke, Eric P. Hoffman, Jianhua Xuan and Yue Wang. WANG , M ENG , C HEN , M ADHAVAN , C LARKE , H OFFMAN , X UAN AND WANG 1. Overview Blind source separation (BSS) has proven to be a powerful and widely-applicable tool for the analysis and interpretation of composite patterns in engineering and science (Hillman and Moore, 2007; Lee and Seung, 1999). BSS is often described by a linear latent variable model X = AS, where X is the observation data matrix, A is the unknown mixing matrix, and S is the unknown source data matrix. The fundamental objective of BSS is to estimate both the unknown but informative mixing proportions and the source signals based only on the observed mixtures (Child, 2006; Cruces-Alvarez et al., 2004; Hyvarinen et al., 2001; Keshava and Mustard, 2002). While many existing BSS algorithms can usefully extract interesting patterns from mixture observations, they often prove inaccurate or even incorrect in the face of real-world BSS problems in which the pre-imposed assumptions may be invalid. There is a family of approaches exploiting the source non-negativity, including the non-negative matrix factorization (NMF) (Gillis, 2012; Lee and Seung, 1999). This motivates the development of alternative BSS techniques involving exploitation of source nonnegative nature (Chan et al., 2008; Chen et al., 2011a,b; Wang et al., 2010). The method works by performing convex analysis of mixtures (CAM) that automatically identifies pure-source signals that reside at the vertices of the multifaceted simplex most tightly enclosing the data scatter, enabling geometrically-principled delineation of distinct source patterns from mixtures, with the number of underlying sources being suggested by the minimum description length criterion. Consider a latent variable model x(i) = As(i), where the observation vector x(i) = [x1 (i), ..., xM (i)]T can be expressed as a non-negative linear combination of the source vectors s(i) = [s1 (i), ..., sJ (i)]T , and A = [a1 , ..., aJ ] is the mixing matrix with a j being the jth column vector. This falls neatly within the definition of a convex set (Fig. 1) (Chen et al., 2011a): X= J J ∑ j=1 s j (i)a j |a j ∈ A, s j (i) ≥ 0, ∑ j=1 s j (i) = 1, i = 1, ..., N . Assume that the sources have at least one sample point whose signal is exclusively enriched in a particular source (Wang et al., 2010), we have shown that the vertex points of the observation simplex (Fig. 1) correspond to the column vectors of the mixing matrix (Chen et al., 2011b). Via a minimum-error-margin volume maximization, CAM identifies the optimum set of the vertices (Chen et al., 2011b; Wang et al., 2010). Using the samples attached to the vertices, compartment modeling (CM) (Chen et al., 2011a) obtains a parametric solution of A, nonnegative independent component analysis (nICA) (Oja and Plumbley, 2004) estimates A (and s) that maximizes the independency in s, and nonnegative well-grounded component analysis (nWCA) (Wang et al., 2010) finds the column vectors of A directly from the vertex cluster centers. Figure 1: Schematic and illustrative flowchart of R-Java CAM package. 2900 T HE CAM S OFTWARE IN R-JAVA In this paper we describe a newly developed R-Java CAM package whose analytic functions are written in R, while a graphic user interface (GUI) is implemented in Java, taking full advantages of both programming languages. The core software suite implements CAM functions and includes normalization, clustering, and data visualization. Multi-thread interactions between the R and Java modules are driven and integrated by a Java GUI, which not only provides convenient data or parameter passing and visual progress monitoring but also assures the responsive execution of the entire CAM software. 2. Software Design and Implementation The CAM package mainly consists of R and Java modules. The R module is a collection of main and helper functions, each represented by an R function object and achieving an independent and specific task (Fig. 1). The R module mainly performs various analytic tasks required by CAM: figure plotting, update, or error message generation. The Java module is developed to provide a GUI (Fig. 2). We adopt the model-view-controller (MVC) design strategy, and use different Java classes to separately perform information visualization and human-computer interaction. The Java module also serves as the software driver and integrator that use a multi-thread strategy to facilitate the interactions between the R and Java modules, such as importing raw data, passing algorithmic parameters, calling R scripts, and transporting results and messages. Figure 2: Interactive Java GUI supported by a multi-thread design strategy. 2.1 Analytic and Presentation Tasks Implemented in R The R module performs the CAM algorithm and facilitates a suite of subsequent analyses including CM, nICA, and nWCA. These tasks are performed by the three main functions: CAM-CM.R, CAM-nICA.R, and CAM-nWCA.R, which can be activated by the three R scripts: Java-runCAM-CM.R, Java-runCAM-ICA.R, and Java-runCAM-nWCA.R. The R module also performs auxiliary tasks including automatic R library installation, figure drawing, and result recording; and offers other standard methods such as nonnegative matrix factorization (Lee and Seung, 1999), Fast ICA (Hyvarinen et al., 2001), factor analysis (Child, 2006), principal component analysis, affinity propagation, k-means clustering, and expectation-maximization algorithm for learning standard finite normal mixture model. 2.2 Graphic User Interface Written in Java Swing The Java GUI module allows users to import data, select algorithms and parameters, and display results. The module encloses two packages: guiView contains classes for handling frames and 2901 WANG , M ENG , C HEN , M ADHAVAN , C LARKE , H OFFMAN , X UAN AND WANG Figure 3: Application of R-Java CAM to deconvolving dynamic medical image sequence. dialogs for managing user inputs; guiModel contains classes for representing result data sets and for interacting with the R script caller. Packaged as one jar file, the GUI module runs automatically. 2.3 Functional Interaction Between R and Java We adopt the open-source program RCaller (http://code.google.com/p/rcaller) to implement the interaction between R and Java modules (Fig. 2), supported by explicitly designed R scripts such as Java-runCAM-CM.R. Specifically, five featured Java classes are introduced to interact with R for importing data or parameters, running algorithms, passing on or recording results, displaying figures, and handing over error messages. The examples of these classes include guiModel.MyRCaller.java, guiModel.MyRCaller.readResults(), and guiView.MyRPlotViewer. 3. Case Studies and Experimental Results The CAM package has been successfully applied to various data types. Using dynamic contrastenhanced magnetic resonance imaging data set of an advanced breast cancer case (Chen, et al., 2011b),“double click” (or command lines under Ubuntu) activated execution of CAM-Java.jar reveals two biologically interpretable vascular compartments with distinct kinetic patterns: fast clearance in the peripheral “rim” and slow clearance in the inner “core”. These outcomes are consistent with previously reported intratumor heterogeneity (Fig. 3). Angiogenesis is essential to tumor development beyond 1-2mm3 . It has been widely observed that active angiogenesis is often observed in advanced breast tumors occurring in the peripheral “rim” with co-occurrence of inner-core hypoxia. This pattern is largely due to the defective endothelial barrier function and outgrowth blood supply. In another application to natural image mixtures, CAM algorithm successfully recovered the source images in a large number of trials (see Users Manual). 4. Summary and Acknowledgements We have developed a R-Java CAM package for blindly separating mixed nonnegative sources. The open-source cross-platform software is easy-to-use and effective, validated in several real-world applications leading to plausible scientific discoveries. The software is freely downloadable from http://mloss.org/software/view/437/. We intend to maintain and support this package in the future. This work was supported in part by the US National Institutes of Health under Grants CA109872, CA 100970, and NS29525. We thank T.H. Chan, F.Y. Wang, Y. Zhu, and D.J. Miller for technical discussions. 2902 T HE CAM S OFTWARE IN R-JAVA References T.H. Chan, W.K. Ma, C.Y. Chi, and Y. Wang. A convex analysis framework for blind separation of non-negative sources. IEEE Transactions on Signal Processing, 56:5120–5143, 2008. L. Chen, T.H. Chan, P.L. Choyke, and E.M. Hillman et al. Cam-cm: a signal deconvolution tool for in vivo dynamic contrast-enhanced imaging of complex tissues. Bioinformatics, 27:2607–2609, 2011a. L. Chen, P.L. Choyke, T.H. Chan, and C.Y. Chi et al. Tissue-specific compartmental analysis for dynamic contrast-enhanced mr imaging of complex tumors. IEEE Transactions on Medical Imaging, 30:2044–2058, 2011b. D. Child. The essentials of factor analysis. Continuum International, 2006. S.A. Cruces-Alvarez, Andrzej Cichocki, and Shun ichi Amari. From blind signal extraction to blind instantaneous signal separation: criteria, algorithms, and stability. IEEE Transactions on Neural Networks, 15:859–873, 2004. N. Gillis. Sparse and unique nonnegative matrix factorization through data preprocessing. Journal of Machine Learning Research, 13:3349–3386, 2012. E.M.C. Hillman and A. Moore. All-optical anatomical co-registration for molecular imaging of small animals using dynamic contrast. Nature Photonics, 1:526–530, 2007. A. Hyvarinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley, New York, 2001. N. Keshava and J.F. Mustard. Spectral unmixing. IEEE Signal Processing Magazine, 19:44–57, 2002. D.D. Lee and H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788–791, 1999. E. Oja and M. Plumbley. Blind separation of positive sources by globally convergent gradient search. Neural Computation, 16:1811–1825, 2004. F.Y. Wang, C.Y. Chi, T.H. Chan, and Y. Wang. Nonnegative least-correlated component analysis for separation of dependent sources by volume maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:857–888, 2010. 2903

6 0.16235611 15 jmlr-2013-Bayesian Canonical Correlation Analysis

7 0.14393373 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty

8 0.14332694 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

9 0.14133225 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

10 0.12681262 66 jmlr-2013-MAGIC Summoning: Towards Automatic Suggesting and Testing of Gestures With Low Probability of False Positives During Use

11 0.1205778 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

12 0.10621874 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

13 0.10372458 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

14 0.095716037 86 jmlr-2013-Parallel Vector Field Embedding

15 0.094695352 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

16 0.088966668 30 jmlr-2013-Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising

17 0.088804066 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

18 0.083970949 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

19 0.083244041 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

20 0.08118055 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.018), (4, 0.426), (5, 0.106), (6, 0.023), (9, 0.023), (10, 0.051), (20, 0.024), (23, 0.021), (41, 0.013), (68, 0.014), (70, 0.02), (75, 0.04), (87, 0.09)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73030806 37 jmlr-2013-Divvy: Fast and Intuitive Exploratory Data Analysis

Author: Joshua M. Lewis, Virginia R. de Sa, Laurens van der Maaten

Abstract: Divvy is an application for applying unsupervised machine learning techniques (clustering and dimensionality reduction) to the data analysis process. Divvy provides a novel UI that allows researchers to tighten the action-perception loop of changing algorithm parameters and seeing a visualization of the result. Machine learning researchers can use Divvy to publish easy to use reference implementations of their algorithms, which helps the machine learning field have a greater impact on research practices elsewhere. Keywords: clustering, dimensionality reduction, open source software, human computer interaction, data visualization

2 0.34706467 116 jmlr-2013-Truncated Power Method for Sparse Eigenvalue Problems

Author: Xiao-Tong Yuan, Tong Zhang

Abstract: This paper considers the sparse eigenvalue problem, which is to extract dominant (largest) sparse eigenvectors with at most k non-zero components. We propose a simple yet effective solution called truncated power method that can approximately solve the underlying nonconvex optimization problem. A strong sparse recovery result is proved for the truncated power method, and this theory is our key motivation for developing the new algorithm. The proposed method is tested on applications such as sparse principal component analysis and the densest k-subgraph problem. Extensive experiments on several synthetic and real-world data sets demonstrate the competitive empirical performance of our method. Keywords: sparse eigenvalue, power method, sparse principal component analysis, densest k-subgraph

3 0.31344849 112 jmlr-2013-Tapkee: An Efficient Dimension Reduction Library

Author: Sergey Lisitsyn, Christian Widmer, Fernando J. Iglesias Garcia

Abstract: We present Tapkee, a C++ template library that provides efficient implementations of more than 20 widely used dimensionality reduction techniques ranging from Locally Linear Embedding (Roweis and Saul, 2000) and Isomap (de Silva and Tenenbaum, 2002) to the recently introduced BarnesHut-SNE (van der Maaten, 2013). Our library was designed with a focus on performance and flexibility. For performance, we combine efficient multi-core algorithms, modern data structures and state-of-the-art low-level libraries. To achieve flexibility, we designed a clean interface for applying methods to user data and provide a callback API that facilitates integration with the library. The library is freely available as open-source software and is distributed under the permissive BSD 3-clause license. We encourage the integration of Tapkee into other open-source toolboxes and libraries. For example, Tapkee has been integrated into the codebase of the Shogun toolbox (Sonnenburg et al., 2010), giving us access to a rich set of kernels, distance measures and bindings to common programming languages including Python, Octave, Matlab, R, Java, C#, Ruby, Perl and Lua. Source code, examples and documentation are available at http://tapkee.lisitsyn.me. Keywords: dimensionality reduction, machine learning, C++, open source software

4 0.30696601 46 jmlr-2013-GURLS: A Least Squares Library for Supervised Learning

Author: Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro, Lorenzo Rosasco

Abstract: We present GURLS, a least squares, modular, easy-to-extend software library for efficient supervised learning. GURLS is targeted to machine learning practitioners, as well as non-specialists. It offers a number state-of-the-art training strategies for medium and large-scale learning, and routines for efficient model selection. The library is particularly well suited for multi-output problems (multi-category/multi-label). GURLS is currently available in two independent implementations: Matlab and C++. It takes advantage of the favorable properties of regularized least squares algorithm to exploit advanced tools in linear algebra. Routines to handle computations with very large matrices by means of memory-mapped storage and distributed task execution are available. The package is distributed under the BSD license and is available for download at https://github.com/LCSL/GURLS. Keywords: regularized least squares, big data, linear algebra 1. Introduction and Design Supervised learning has become a fundamental tool for the design of intelligent systems and the analysis of high dimensional data. Key to this success has been the availability of efficient, easy-touse software packages. New data collection technologies make it easy to gather high dimensional, multi-output data sets of increasing size. This trend calls for new software solutions for the automatic training, tuning and testing of supervised learning methods. These observations motivated the design of GURLS (Grand Unified Regularized Least Squares). The package was developed to pursue the following goals: Speed: Fast training/testing procedures for learning problems with potentially large/huge number of points, features and especially outputs (e.g., classes). Memory: Flexible data management to work with large data sets by means of memory-mapped storage. Performance: ∗. Also in the Laboratory for Computational and Statistical Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology c 2013 Andrea Tacchetti, Pavan K. Mallapragada, Matteo Santoro and Lorenzo Rosasco. TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO State of the art results in high-dimensional multi-output problems. Usability and modularity: Easy to use and to expand. GURLS is based on Regularized Least Squares (RLS) and takes advantage of all the favorable properties of these methods (Rifkin et al., 2003). Since the algorithm reduces to solving a linear system, GURLS is set up to exploit the powerful tools, and recent advances, of linear algebra (including randomized solver, first order methods, etc.). Second, it makes use of RLS properties which are particularly suited for high dimensional learning. For example: (1) RLS has natural primal and dual formulation (hence having complexity which is the smallest between number of examples and features); (2) efficient parameter selection (closed form expression of the leave one out error and efficient computations of regularization path); (3) natural and efficient extension to multiple outputs. Specific attention has been devoted to handle large high dimensional data sets. We rely on data structures that can be serialized using memory-mapped files, and on a distributed task manager to perform a number of key steps (such as matrix multiplication) without loading the whole data set in memory. Efforts were devoted to to provide a lean API and an exhaustive documentation. GURLS has been deployed and tested successfully on Linux, MacOS and Windows. The library is distributed under the simplified BSD license, and can be downloaded from https://github.com/LCSL/GURLS. 2. Description of the Library The library comprises four main modules. GURLS and bGURLS—both implemented in Matlab— are aimed at solving learning problems with small/medium and large-scale data sets respectively. GURLS++ and bGURLS++ are their C++ counterparts. The Matlab and C++ versions share the same design, but the C++ modules have significant improvements, which make them faster and more flexible. The specification of the desired machine learning experiment in the library is straightforward. Basically, it is a formal description of a pipeline, that is, an ordered sequence of steps. Each step identifies an actual learning task, and belongs to a predefined category. The core of the library is a method (a class in the C++ implementation) called GURLScore, which is responsible for processing the sequence of tasks in the proper order and for linking the output of the former task to the input of the subsequent one. A key role is played by the additional “options” structure, referred to as OPT. OPT is used to store all configuration parameters required to customize the behavior of individual tasks in the pipeline. Tasks receive configuration parameters from OPT in read-only mode and—upon termination—the results are appended to the structure by GURLScore in order to make them available to subsequent tasks. This allows the user to skip the execution of some tasks in a pipeline, by simply inserting the desired results directly into the options structure. Currently, we identify six different task categories: data set splitting, kernel computation, model selection, training, evaluation and testing and performance assessment and analysis. Tasks belonging to the same category may be interchanged with each other. 2.1 Learning From Large Data Sets Two modules in GURLS have been specifically designed to deal with big data scenarios. The approach we adopted is mainly based on a memory-mapped abstraction of matrix and vector data structures, and on a distributed computation of a number of standard problems in linear algebra. For learning on big data, we decided to focus specifically on those situations where one seeks a linear model on a large set of (possibly non linear) features. A more accurate specification of what “large” means in GURLS is related to the number of features d and the number of training 3202 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING data set optdigit landast pendigit letter isolet # of samples 3800 4400 7400 10000 6200 # of classes 10 6 10 26 26 # of variables 64 36 16 16 600 Table 1: Data sets description. examples n: we require it must be possible to store a min(d, n) × min(d, n) matrix in memory. In practice, this roughly means we can train models with up-to 25k features on machines with 8Gb of RAM, and up-to 50k features on machines with 36Gb of RAM. We do not require the data matrix itself to be stored in memory: within GURLS it is possible to manage an arbitrarily large set of training examples. We distinguish two different scenarios. Data sets that can fully reside in RAM without any memory mapping techniques—such as swapping—are considered to be small/medium. Larger data sets are considered to be “big” and learning must be performed using either bGURLS or bGURLS++ . These two modules include all the design patterns described above, and have been complemented with additional big data and distributed computation capabilities. Big data support is obtained using a data structure called bigarray, which allows to handle data matrices as large as the space available on the hard drive: we store the entire data set on disk and load only small chunks in memory when required. There are some differences between the Matlab and C++ implementations. bGURLS relies on a simple, ad hoc interface, called GURLS Distributed Manager (GDM), to distribute matrix-matrix multiplications, thus allowing users to perform the important task of kernel matrix computation on a distributed network of computing nodes. After this step, the subsequent tasks behave as in GURLS. bGURLS++ (currently in active development) offers more interesting features because it is based on the MPI libraries. Therefore, it allows for a full distribution within every single task of the pipeline. All the processes read the input data from a shared filesystem over the network and then start executing the same pipeline. During execution, each process’ task communicates with the corresponding ones. Every process maintains its local copy of the options. Once the same task is completed by all processes, the local copies of the options are synchronized. This architecture allows for the creation of hybrid pipelines comprising serial one-process-based tasks from GURLS++ . 3. Experiments We decided to focus the experimental analysis in the paper to the assessment of GURLS’ performance both in terms of accuracy and time. In our experiments we considered 5 popular data sets, briefly described in Table 1. Experiments were run on a Intel Xeon 5140 @ 2.33GHz processor with 8GB of RAM, and running Ubuntu 8.10 Server (64 bit). optdigit accuracy (%) GURLS (linear primal) GURLS (linear dual) LS-SVM linear GURLS (500 random features) GURLS (1000 random features) GURLS (Gaussian kernel) LS-SVM (Gaussian kernel) time (s) landsat accuracy (%) time (s) pendigit accuracy (%) time (s) 92.3 92.3 92.3 96.8 97.5 98.3 98.3 0.49 726 7190 25.6 207 13500 26100 63.68 66.3 64.6 63.5 63.5 90.4 90.51 0.22 1148 6526 28.0 187 20796 18430 82.24 82.46 82.3 96.7 95.8 98.4 98.36 0.23 5590 46240 31.6 199 100600 120170 Table 2: Comparison between GURLS and LS-SVM. 3203 TACCHETTI , M ALLAPRAGADA , S ANTORO AND ROSASCO Performance (%) 1 0.95 0.9 0.85 isolet(∇) letter(×) 0.8 pendigit(∆) 0.75 landsat(♦) optdigit(◦) 0.7 LIBSVM:rbf 0.65 GURLS++:rbf GURLS:randomfeatures-1000 0.6 GURLS:randomfeatures-500 0.55 0.5 0 10 GURLS:rbf 1 10 2 10 3 10 4 Time (s) 10 Figure 1: Prediction accuracy vs. computing time. The color represents the training method and the library used. In blue: the Matlab implementation of RLS with RBF kernel, in red: its C++ counterpart. In dark red: results of LIBSVM with RBF kernel. In yellow and green: results obtained using a linear kernel on 500 and 1000 random features respectively. We set up different pipelines and compared the performance to SVM, for which we used the python modular interface to LIBSVM (Chang and Lin, 2011). Automatic selection of the optimal regularization parameter is implemented identically in all experiments: (i) split the data; (ii) define a set of regularization parameter on a regular grid; (iii) perform hold-out validation. The variance of the Gaussian kernel has been fixed by looking at the statistics of the pairwise distances among training examples. The prediction accuracy of GURLS and GURLS++ is identical—as expected—but the implementation in C++ is significantly faster. The prediction accuracy of standard RLS-based methods is in many cases higher than SVM. Exploiting the primal formulation of RLS, we further ran experiments with the random features approximation (Rahimi and Recht, 2008). As show in Figure 1, the performance of this method is comparable to that of SVM at a much lower computational cost in the majority of the tested data sets. We further compared GURLS with another available least squares based toolbox: the LS-SVM toolbox (Suykens et al., 2001), which includes routines for parameter selection such as coupled simulated annealing and line/grid search. The goal of this experiment is to benchmark the performance of the parameter selection with random data splitting included in GURLS. For a fair comparison, we considered only the Matlab implementation of GURLS. Results are reported in Table 2. As expected, using the linear kernel with the primal formulation—not available in LS-SVM—is the fastest approach since it leverages the lower dimensionality of the input space. When the Gaussian kernel is used, GURLS and LS-SVM have comparable computing time and classification performance. Note, however, that in GURLS the number of parameter in the grid search is fixed to 400, while in LS-SVM it may vary and is limited to 70. The interesting results obtained with the random features implementation in GURLS, make it an interesting choice in many applications. Finally, all GURLS pipelines, in their Matlab implementation, are faster than LS-SVM and further improvements can be achieved with GURLS++ . Acknowledgments We thank Tomaso Poggio, Zak Stone, Nicolas Pinto, Hristo S. Paskov and CBCL for comments and insights. 3204 GURLS: A L EAST S QUARES L IBRARY FOR S UPERVISED L EARNING References C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www. csie.ntu.edu.tw/˜cjlin/libsvm. A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in Neural Information Processing Systems, volume 21, pages 1313–1320, 2008. R. Rifkin, G. Yeo, and T. Poggio. Regularized least-squares classification. Nato Science Series Sub Series III Computer and Systems Sciences, 190:131–154, 2003. J. Suykens, T. V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle. Least Squares Support Vector Machines. World Scientific, 2001. ISBN 981-238-151-1. 3205

5 0.30617154 59 jmlr-2013-Large-scale SVD and Manifold Learning

Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

Abstract: This paper examines the efficacy of sampling-based low-rank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nystr¨ m and Column sampling methods. We present a theoretical o comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the largescale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. We address the computational challenges of non-linear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nystr¨ m approximation is superior to the Column sampling method for this o task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMU-PIE data set. Keywords: low-rank approximation, manifold learning, large-scale matrix factorization

6 0.30306029 7 jmlr-2013-A Risk Comparison of Ordinary Least Squares vs Ridge Regression

7 0.30250201 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

8 0.29850179 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

9 0.2956976 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

10 0.2950412 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

11 0.29368997 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

12 0.29271415 86 jmlr-2013-Parallel Vector Field Embedding

13 0.29109624 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

14 0.2909261 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

15 0.2908999 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

16 0.29085878 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

17 0.2890197 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression

18 0.28875616 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

19 0.28825292 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs

20 0.28758654 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling