jmlr jmlr2007 jmlr2007-82 knowledge-graph by maker-knowledge-mining

82 jmlr-2007-The Need for Open Source Software in Machine Learning


Source: pdf

Author: Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, Robert Williamson

Abstract: Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community. Keywords: machine learning, open source, reproducibility, creditability, algorithms, software 2444 M ACHINE L EARNING O PEN S OURCE S OFTWARE

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. [sent-54, score-0.571]

2 We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. [sent-55, score-1.133]

3 Keywords: machine learning, open source, reproducibility, creditability, algorithms, software 2444 M ACHINE L EARNING O PEN S OURCE S OFTWARE 1. [sent-58, score-0.71]

4 However, few machine learning researchers currently publish the software and/or source code associated with their papers (Thimbleby, 2003). [sent-62, score-1.098]

5 This contrasts for instance with the practices of the bioinformatics community, where open source software has been the foundation of further research (Strajich and Lapp, 2006). [sent-63, score-1.065]

6 We believe that open source sharing of machine learning software can play a very important role in removing that obstacle. [sent-65, score-1.035]

7 The open source model has many advantages which will lead to better reproducibility of experimental results: quicker detection of errors, innovative applications, and faster adoption of machine learning methods in other disciplines and in industry. [sent-66, score-0.692]

8 However, incentives for polishing and publishing software are at present lacking. [sent-67, score-0.575]

9 This paper is structured as follows: First, we briefly explain the idea behind open source software (Section 2). [sent-71, score-1.035]

10 Finally, we propose a new, separate, ongoing track for machine learning open source software in JMLR (JMLR-MLOSS) in Section 5. [sent-74, score-1.035]

11 We provide an overview about open source licenses in Appendix A and guidelines for good machine learning software in Appendix B. [sent-75, score-1.316]

12 —Sir Isaac Newton (1642–1727) The basic idea of open source software is very simple; programmers or users can read, modify and redistribute the source code of a piece of software (Gacek and Arief, 2004). [sent-78, score-2.15]

13 While there are various licenses of open source software (cf. [sent-79, score-1.264]

14 The open source model replaces central control with collaborative networks of contributors. [sent-82, score-0.539]

15 The Open Source Initiative (OSI)1 defines open source software as work that satisfies the criteria spelled out in Table 1. [sent-84, score-1.035]

16 To achieve this, the supporting software and data should be distributed under a suitable open source license along with the scientific paper. [sent-116, score-1.359]

17 On the contrary, open source software has created numerous new opportunities for businesses (Riehle, 2007). [sent-138, score-1.035]

18 Also, simply using an open source program on a day to day basis has little legal implications for a user provided they comply with the terms of their license. [sent-139, score-0.643]

19 Users are free to copy and distribute the software as is. [sent-140, score-0.57]

20 Most issues arise when users, playing the role of a developer, modify the software or incorporate it in their own programs and distribute a modified product. [sent-141, score-0.558]

21 A variety of open source licenses exists, which protect different aspects of the software with benefits for the initial developer or for developers creating derived work (Laurent, 2004). [sent-142, score-1.54]

22 A developer who wants to give away the source code in exchange for proper credit for derivative works, even closed-source ones, could choose the BSD license. [sent-147, score-0.742]

23 A developer who wants to give away the source code, is comfortable with his source being incorporated into a closed-source product but still wants to receive bug-fixes and changes that are necessary to his source when integrating the code could choose the GNU Lesser General Public License (LGPL). [sent-151, score-1.397]

24 This developer could be someone who wants to keep developing his software, and by publishing his software basically invites the community to contribute to the software. [sent-152, score-0.838]

25 A developer who wants to give away the source code and make sure that his program stays open source, that is, any extension (or integration) will require both the original and the derived code to be released as open source, could choose the GNU General Public License (GPL). [sent-156, score-1.372]

26 Here, the developer could be a researcher who has further plans with his software and wants to make sure that no closed-source product, not even one of his own if it includes changes of external developers, is benefiting from his software. [sent-157, score-0.796]

27 A comparison of open source software licenses listed as “with strong communities” on http://opensource. [sent-178, score-1.264]

28 All of the open source licenses allow for derivative works (item two in Table 1). [sent-182, score-0.768]

29 In addition it is not possible to limit an open source product to a particular use, for example, to non-commercial or academic use, as it conflicts with item six in Table 1. [sent-183, score-0.539]

30 In a brief summary of common open source licenses, Table 2 shows the rights of a developer to distribute a modified product. [sent-184, score-0.798]

31 8 The CC project was started in 2001 to supply the analog to open source for less technical forms of expression (Coates, 2007) and extends to all kinds of media like text documents, photographs, video and music. [sent-192, score-0.539]

32 All CC licenses allow copying, distribution, and public performance and display of the work without any license payments. [sent-193, score-0.649]

33 It therefore conflicts with the a a non-discrimination provision in the open source definition (Table 1). [sent-197, score-0.539]

34 Applied to the area of science, Creative Commons advocates not only a a having open source methods, but also open source data and results. [sent-203, score-1.078]

35 Open Source in Machine Learning This section of the paper aims to provide a brief overview of open source software and its relationship to scientific activity, specifically machine learning. [sent-207, score-1.035]

36 The truth is that it is extremely difficult to obtain hard evidence on the debate between proprietary systems and open source software. [sent-209, score-0.584]

37 11 We argue from moral, ethical and social grounds that open source should be the preferred software publication option for machine learning research and refer the reader to the many advantages of the open source software development (Raymond, 2000). [sent-210, score-2.12]

38 Here, we focus on the specific advantages of open source software for machine learning research, which combines the needs and requirements both of being a scientific endeavor, as well as being a producer and consumer of software. [sent-212, score-1.035]

39 So, it follows that an open source approach would be ideally suited to this challenge. [sent-254, score-0.539]

40 2 Quicker Detection and Correction of Bugs An important feature that has contributed much to the success of open source software is that with the availability of the source code, it is much easier to spot and fix bugs in software. [sent-284, score-1.439]

41 The only question is what can be done about a particular instance of software failure, and that is where having the source matters. [sent-288, score-0.821]

42 Therefore, the availability of open source implementations can help speed up scientific progress significantly. [sent-298, score-0.657]

43 4 Long Term Availability and Support For the individual researcher, open source may provide a means of ensuring that he will be able to use his research even after changing his employer. [sent-300, score-0.539]

44 By releasing code under an open source license the chances of having long-term support are dramatically increased. [sent-311, score-1.043]

45 6 Faster Adoption in Machine Learning, Other Disciplines and Industry Availability of high-quality open source implementations can ease adoption by other machine learning researchers, users in other disciplines and developers in industry for the following reasons: 1. [sent-326, score-0.78]

46 Open source software can be used without cost in teaching. [sent-327, score-0.821]

47 Publishing software as open source might also be the 15. [sent-335, score-1.035]

48 There are also impressive precedents of open source software leading to the creation of multi-billion dollar companies and industries. [sent-343, score-1.068]

49 Now with the publication of toolboxes according to an open source model, it becomes possible for individual projects to move towards standardization in a collaborative, distributed manner. [sent-349, score-0.634]

50 Current Obstacles to an Open Source Community While there exist many advantages to publishing implementations according to the open source model, this option is currently not taken often. [sent-363, score-0.658]

51 org as the platform for machine learning open source software (MLOSS) to openly discuss design decisions and to host and announce MLOSS. [sent-373, score-1.07]

52 1 Publishing Software is Not Considered a Scientific Contribution Some researchers may not consider the extra effort to create a usable piece of software out of machine learning methods to be science. [sent-387, score-0.62]

53 In reality, careful selection of a suitable open source license would satisfy the requirements of most researchers and their employer. [sent-401, score-0.914]

54 For example, using the concept of dual licensing one could release the source code to the public under a open source license with strong reciprocal obligations (like the GNU GPL), and at the same time sell it commercially in a closed-source product. [sent-402, score-1.635]

55 3 The Incentive for Publishing Open Source Software is not High Enough Unlike writing a journal article, releasing a piece of software is only the beginning. [sent-405, score-0.599]

56 Maintaining a software package, fixing bugs or writing further documentation requires time and commitment from the developer, and this contribution is also rarely acknowledged. [sent-406, score-0.578]

57 As a result, researchers tend to not acknowledge software used in their published research, and the effort which has to be expended to turn a piece of code for personal research into a software product that can be used, understood, and extended by others is not sufficiently acknowledged. [sent-410, score-1.266]

58 Therefore, at first glance, making the source code for a particular machine learning paper public may seem counterproductive for the researcher, as other researchers can more easily find problems with the proposed method, and possibly even discredit the approach. [sent-422, score-0.622]

59 Therefore, the already altruistic behavior of publishing papers should be complemented by also providing open source code as the same great benefits can be expected if many other researchers follow this path and also distribute accompanying open source software. [sent-425, score-1.417]

60 Proposal In summary, providing open source code would help the whole community in accelerating research. [sent-435, score-0.719]

61 Arguably, the best way to build an open source community of scientists in machine learning is to promote open source software through the existing reward system based on citation of archival sources (journals, conferences). [sent-436, score-1.631]

62 We would like to initiate this process by giving researchers the opportunity to publish their machine learning open source software, thereby setting an example of how to deal with this kind of publication media. [sent-439, score-0.687]

63 The proposed new JMLR track on machine learning open source software with review guidelines specially tailored to the needs of software is designed to serve that purpose. [sent-440, score-1.583]

64 The software must adhere to a recognized open source license (http://www. [sent-443, score-1.359]

65 Since we specifically want to honor the effort of turning a method into a highly usable piece of software, prior publication of the method is admissible, as long as the software has not been published elsewhere. [sent-447, score-0.619]

66 In summary, preparing research software for publication is a significant extra effort which should also be rewarded as such. [sent-449, score-0.546]

67 It is hoped that the open source track will motivate the machine learning community towards open science, where open access publishing, open data standards and open source software foster research progress. [sent-450, score-2.326]

68 1 Format We invite submissions of descriptions of high quality machine learning open source software implementations. [sent-452, score-1.08]

69 A cover letter stating that the submission is intended for the machine learning open source software section, the open source license the software is released under, the web address of the project, and the software version to be reviewed. [sent-454, score-2.933]

70 The quality of the user documentation (should enable new users to quickly apply the software to other problems, including a tutorial and several non-trivial examples of how the software can be used). [sent-474, score-1.056]

71 After acceptance, the abstract including the link to the software project website, the four page description and the reviewed version of the software will be published on the JMLR-MLOSS website http://www. [sent-482, score-0.992]

72 Conclusion We have argued that the adoption of the open source model of sharing information for implementations of machine learning software can be highly beneficial for the whole field. [sent-487, score-1.13]

73 The open source model has many advantages, such as improved reproducibility of experimental results, quicker detection of errors, accelerated scientific progress, and faster adoption of machine learning methods in other disciplines and in the industry. [sent-488, score-0.692]

74 As the incentives for publishing open source software are 2457 S ONNENBURG , B RAUN , O NG , ET AL . [sent-489, score-1.114]

75 currently insufficient, we outlined a platform for publishing software for machine learning. [sent-490, score-0.575]

76 If machine learning is to solve real scientific and technological problems, the community needs to build on each others’ open source software tools. [sent-495, score-1.065]

77 Hence, we believe that there is an urgent need for machine learning open source software. [sent-496, score-0.539]

78 Such software will fulfill several concurrent roles: a better means for reproducing results; a mechanism for providing academic recognition for quality software implementations; and acceleration of the research process by allowing the standing on shoulders of others (not necessarily giants! [sent-497, score-0.992]

79 As discussed in Section 2, most issues regarding the use of open source software arise when one wants to distribute a modified or derived product. [sent-515, score-1.104]

80 With the proliferation of open source software, various licenses have been put forward, confusing a developer who just wants to release his program to the public. [sent-517, score-1.076]

81 Whilst the choice of license might be considered a boring legal/management detail, it is actually very important to get it right— the choice of certain licenses may significantly limit the impact a piece of software may have. [sent-518, score-1.122]

82 Different licenses protect different aspects of the software with benefits for the initial developer or developers creating derived work (Laurent, 2004). [sent-527, score-1.001]

83 Significant licensing issues may arise when open source software (OSS) is combined with proprietary code. [sent-528, score-1.159]

84 Licenses which demand that subsequent modifications of the software be released under the same license are called “copyleft” licenses Wikipedia (2007a), the most famous of which is the GNU General Public License (GPL). [sent-530, score-1.092]

85 Then there are the “in between” licenses, like Lesser GNU General Public Figure 1: An illustration of open source licenses with respect to the rights for the initial developer and the developer creating derived works. [sent-533, score-1.191]

86 This is referred to as dual licensing and allows a developer to release his code to the public under the GPL and at the same time sell it commercially in a closed-source product. [sent-538, score-0.551]

87 1 Some Complexities As an illustration of some of the difficulties, let us consider the issue of conflicting open source licenses and the issue of reciprocal obligations. [sent-544, score-0.805]

88 1 O PEN S OURCE L ICENSES MAY C ONFLICT When releasing a program as “open source” it is not obvious that although the program is now “open source” it still may have a license that conflicts with many other open source licenses. [sent-547, score-0.979]

89 The OSI currently lists 60 open source licenses21 and the consequence of this license proliferation22 means that the simple inclusion BSD ⊂ LGPL ⊂ GPL as shown in Figure 1 does not hold for other licenses. [sent-549, score-0.863]

90 While this can be used to purposely generate conflicts, as a general rule, one should refrain from doing so as it will make code exchange between open source projects impossible and may limit distribution and thus success of a open source project. [sent-551, score-1.262]

91 Researchers aspiring to a wide developer audience for their software should consider GPL compatible licenses,24 or select one with a strong community. [sent-553, score-0.69]

92 Also note that this is a one-way street, that is, BSD licensed software cannot merge code from LGPL/GPL and LGPL cannot merge software from GPL projects 24. [sent-565, score-1.142]

93 2 Reciprocal Obligations Another issue is the one of reciprocal obligations: any modifications to a piece of open source software may need to be available to the original authors. [sent-573, score-1.145]

94 While a certain lack of organization, documentation, and robustness can be tolerated when the software is used internally, it can make the software next to useless for others. [sent-594, score-0.992]

95 Table 4: Six features of useful machine learning software Good machine learning software should first of all be a good piece of software (Table 4). [sent-609, score-1.561]

96 One should follow general rules for developing open source software (see also the discussion by Levesque, 2004, which highlights common failure modes for open source software development): • The software should be structured well and logically such that its usability is high. [sent-613, score-2.566]

97 Open source software development as a special type of academic research (critique of vulgar Raymondism). [sent-665, score-0.821]

98 Legal issues relating to free and open source software. [sent-714, score-0.583]

99 Open source licenses and the creative commons framework: License selection and comparison. [sent-776, score-0.696]

100 The economic motivation of open source software: Stakeholder perspectives. [sent-814, score-0.566]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('software', 0.496), ('source', 0.325), ('license', 0.324), ('licenses', 0.229), ('open', 0.214), ('developer', 0.194), ('code', 0.15), ('mpl', 0.141), ('gpl', 0.115), ('lgpl', 0.115), ('onnenburg', 0.106), ('ource', 0.106), ('raun', 0.106), ('oftware', 0.097), ('public', 0.096), ('yes', 0.092), ('scienti', 0.083), ('achine', 0.082), ('developers', 0.082), ('pen', 0.08), ('publishing', 0.079), ('licensing', 0.079), ('piece', 0.073), ('commons', 0.071), ('creative', 0.071), ('researcher', 0.067), ('cpl', 0.062), ('legal', 0.061), ('reproducibility', 0.061), ('bsd', 0.06), ('al', 0.059), ('adoption', 0.055), ('obligations', 0.053), ('guidelines', 0.052), ('gnu', 0.052), ('access', 0.052), ('researchers', 0.051), ('publication', 0.05), ('journals', 0.047), ('publish', 0.047), ('bugs', 0.045), ('toolboxes', 0.045), ('submissions', 0.045), ('proprietary', 0.045), ('progress', 0.044), ('programmers', 0.044), ('routines', 0.044), ('sonnenburg', 0.044), ('free', 0.044), ('program', 0.043), ('released', 0.043), ('library', 0.041), ('ng', 0.041), ('implementations', 0.04), ('commercial', 0.04), ('wants', 0.039), ('disciplines', 0.037), ('documentation', 0.037), ('reciprocal', 0.037), ('bug', 0.035), ('collobert', 0.035), ('jurisdiction', 0.035), ('openly', 0.035), ('osi', 0.035), ('rights', 0.035), ('samy', 0.035), ('silent', 0.035), ('jmlr', 0.035), ('exchange', 0.034), ('availability', 0.034), ('icts', 0.033), ('companies', 0.033), ('interfaces', 0.033), ('libraries', 0.033), ('release', 0.032), ('bengio', 0.032), ('programs', 0.032), ('gunnar', 0.031), ('frameworks', 0.031), ('earning', 0.031), ('community', 0.03), ('http', 0.03), ('distribute', 0.03), ('practices', 0.03), ('citations', 0.03), ('releasing', 0.03), ('wikipedia', 0.03), ('germany', 0.029), ('papers', 0.029), ('standards', 0.028), ('algebra', 0.027), ('users', 0.027), ('citation', 0.027), ('opening', 0.027), ('ong', 0.027), ('economic', 0.027), ('cc', 0.027), ('accepted', 0.027), ('apache', 0.026), ('citing', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000032 82 jmlr-2007-The Need for Open Source Software in Machine Learning

Author: Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, Robert Williamson

Abstract: Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community. Keywords: machine learning, open source, reproducibility, creditability, algorithms, software 2444 M ACHINE L EARNING O PEN S OURCE S OFTWARE

2 0.062785223 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen

Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to fit together and to yield efficient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a specific model structure has been fixed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation

3 0.055640657 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

Author: Matthew E. Taylor, Peter Stone, Yaxin Liu

Abstract: Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). Keywords: transfer learning, reinforcement learning, temporal difference methods, value function approximation, inter-task mapping

4 0.039629292 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming

5 0.038167119 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

Author: Evgeniy Gabrilovich, Shaul Markovitch

Abstract: Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of data sets confirm improved performance compared to the bag of words document representation. Keywords: feature generation, text classification, background knowledge

6 0.031403981 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

7 0.02540607 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

8 0.023754232 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

9 0.021497566 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions     (Special Topic on the Conference on Learning Theory 2005)

10 0.021085897 91 jmlr-2007-Very Fast Online Learning of Highly Non Linear Problems

11 0.019512165 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"

12 0.019402914 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

13 0.019168949 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

14 0.019046977 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

15 0.018978039 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

16 0.018473733 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

17 0.017770916 68 jmlr-2007-Preventing Over-Fitting during Model Selection via Bayesian Regularisation of the Hyper-Parameters     (Special Topic on Model Selection)

18 0.01747006 15 jmlr-2007-Bilinear Discriminant Component Analysis

19 0.016849915 74 jmlr-2007-Separating Models of Learning from Correlated and Uncorrelated Data     (Special Topic on the Conference on Learning Theory 2005)

20 0.016324706 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.112), (1, 0.08), (2, -0.007), (3, 0.019), (4, -0.065), (5, 0.057), (6, -0.008), (7, -0.04), (8, -0.026), (9, -0.034), (10, -0.032), (11, -0.121), (12, -0.068), (13, 0.162), (14, 0.014), (15, -0.028), (16, 0.039), (17, -0.062), (18, -0.113), (19, 0.079), (20, 0.174), (21, -0.13), (22, -0.029), (23, -0.008), (24, -0.063), (25, 0.003), (26, 0.041), (27, 0.128), (28, -0.054), (29, 0.056), (30, -0.036), (31, 0.19), (32, -0.311), (33, 0.151), (34, 0.402), (35, -0.259), (36, -0.047), (37, 0.351), (38, -0.045), (39, -0.043), (40, 0.029), (41, -0.056), (42, -0.205), (43, 0.133), (44, -0.031), (45, 0.023), (46, 0.198), (47, -0.134), (48, -0.075), (49, -0.157)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99426717 82 jmlr-2007-The Need for Open Source Software in Machine Learning

Author: Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, Robert Williamson

Abstract: Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community. Keywords: machine learning, open source, reproducibility, creditability, algorithms, software 2444 M ACHINE L EARNING O PEN S OURCE S OFTWARE

2 0.34620398 85 jmlr-2007-Transfer Learning via Inter-Task Mappings for Temporal Difference Learning

Author: Matthew E. Taylor, Peter Stone, Yaxin Liu

Abstract: Temporal difference (TD) learning (Sutton and Barto, 1998) has become a popular reinforcement learning technique in recent years. TD methods, relying on function approximators to generalize learning to novel situations, have had some experimental successes and have been shown to exhibit some desirable properties in theory, but the most basic algorithms have often been found slow in practice. This empirical result has motivated the development of many methods that speed up reinforcement learning by modifying a task for the learner or helping the learner better generalize to novel situations. This article focuses on generalizing across tasks, thereby speeding up learning, via a novel form of transfer using handcoded task relationships. We compare learning on a complex task with three function approximators, a cerebellar model arithmetic computer (CMAC), an artificial neural network (ANN), and a radial basis function (RBF), and empirically demonstrate that directly transferring the action-value function can lead to a dramatic speedup in learning with all three. Using transfer via inter-task mapping (TVITM), agents are able to learn one task and then markedly reduce the time it takes to learn a more complex task. Our algorithms are fully implemented and tested in the RoboCup soccer Keepaway domain. This article contains and extends material published in two conference papers (Taylor and Stone, 2005; Taylor et al., 2005). Keywords: transfer learning, reinforcement learning, temporal difference methods, value function approximation, inter-task mapping

3 0.32523152 17 jmlr-2007-Building Blocks for Variational Bayesian Learning of Latent Variable Models

Author: Tapani Raiko, Harri Valpola, Markus Harva, Juha Karhunen

Abstract: We introduce standardised building blocks designed to be used with variational Bayesian learning. The blocks include Gaussian variables, summation, multiplication, nonlinearity, and delay. A large variety of latent variable models can be constructed from these blocks, including nonlinear and variance models, which are lacking from most existing variational systems. The introduced blocks are designed to fit together and to yield efficient update rules. Practical implementation of various models is easy thanks to an associated software package which derives the learning formulas automatically once a specific model structure has been fixed. Variational Bayesian learning provides a cost function which is used both for updating the variables of the model and for optimising the model structure. All the computations can be carried out locally, resulting in linear computational complexity. We present experimental results on several structures, including a new hierarchical nonlinear model for variances and means. The test results demonstrate the good performance and usefulness of the introduced method. Keywords: latent variable models, variational Bayesian learning, graphical models, building blocks, Bayesian modelling, local computation

4 0.18356872 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming

5 0.15234704 40 jmlr-2007-Harnessing the Expertise of 70,000 Human Editors: Knowledge-Based Feature Generation for Text Categorization

Author: Evgeniy Gabrilovich, Shaul Markovitch

Abstract: Most existing methods for text categorization employ induction algorithms that use the words appearing in the training documents as features. While they perform well in many categorization tasks, these methods are inherently limited when faced with more complicated tasks where external knowledge is essential. Recently, there have been efforts to augment these basic features with external knowledge, including semi-supervised learning and transfer learning. In this work, we present a new framework for automatic acquisition of world knowledge and methods for incorporating it into the text categorization process. Our approach enhances machine learning algorithms with features generated from domain-specific and common-sense knowledge. This knowledge is represented by ontologies that contain hundreds of thousands of concepts, further enriched through controlled Web crawling. Prior to text categorization, a feature generator analyzes the documents and maps them onto appropriate ontology concepts that augment the bag of words used in simple supervised learning. Feature generation is accomplished through contextual analysis of document text, thus implicitly performing word sense disambiguation. Coupled with the ability to generalize concepts using the ontology, this approach addresses two significant problems in natural language processing—synonymy and polysemy. Categorizing documents with the aid of knowledge-based features leverages information that cannot be deduced from the training documents alone. We applied our methodology using the Open Directory Project, the largest existing Web directory built by over 70,000 human editors. Experimental results over a range of data sets confirm improved performance compared to the bag of words document representation. Keywords: feature generation, text classification, background knowledge

6 0.14977951 21 jmlr-2007-Comments on the "Core Vector Machines: Fast SVM Training on Very Large Data Sets"

7 0.14423762 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

8 0.093096174 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

9 0.089266896 22 jmlr-2007-Compression-Based Averaging of Selective Naive Bayes Classifiers     (Special Topic on Model Selection)

10 0.088118583 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

11 0.079712585 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

12 0.078915961 47 jmlr-2007-Learning Horn Expressions with LOGAN-H

13 0.077731177 12 jmlr-2007-Attribute-Efficient and Non-adaptive Learning of Parities and DNF Expressions     (Special Topic on the Conference on Learning Theory 2005)

14 0.073403239 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

15 0.070344843 60 jmlr-2007-Nonlinear Estimators and Tail Bounds for Dimension Reduction inl1Using Cauchy Random Projections

16 0.069013692 59 jmlr-2007-Nonlinear Boosting Projections for Ensemble Construction

17 0.068733104 76 jmlr-2007-Spherical-Homoscedastic Distributions: The Equivalency of Spherical and Normal Distributions in Classification

18 0.068438768 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

19 0.066170737 64 jmlr-2007-Online Learning of Multiple Tasks with a Shared Loss

20 0.062932983 11 jmlr-2007-Anytime Learning of Decision Trees


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.013), (10, 0.022), (12, 0.016), (15, 0.021), (22, 0.595), (28, 0.026), (40, 0.027), (45, 0.01), (48, 0.028), (60, 0.023), (80, 0.021), (85, 0.042), (98, 0.075)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90873301 82 jmlr-2007-The Need for Open Source Software in Machine Learning

Author: Sören Sonnenburg, Mikio L. Braun, Cheng Soon Ong, Samy Bengio, Leon Bottou, Geoffrey Holmes, Yann LeCun, Klaus-Robert Müller, Fernando Pereira, Carl Edward Rasmussen, Gunnar Rätsch, Bernhard Schölkopf, Alexander Smola, Pascal Vincent, Jason Weston, Robert Williamson

Abstract: Open source tools have recently reached a level of maturity which makes them suitable for building large-scale real-world systems. At the same time, the field of machine learning has developed a large body of powerful learning algorithms for diverse applications. However, the true potential of these methods is not used, since existing implementations are not openly shared, resulting in software with low usability, and weak interoperability. We argue that this situation can be significantly improved by increasing incentives for researchers to publish their software under an open source model. Additionally, we outline the problems authors are faced with when trying to publish algorithmic implementations of machine learning methods. We believe that a resource of peer reviewed software accompanied by short articles would be highly valuable to both the machine learning and the general scientific community. Keywords: machine learning, open source, reproducibility, creditability, algorithms, software 2444 M ACHINE L EARNING O PEN S OURCE S OFTWARE

2 0.23762402 5 jmlr-2007-A Nonparametric Statistical Approach to Clustering via Mode Identification

Author: Jia Li, Surajit Ray, Bruce G. Lindsay

Abstract: A new clustering approach based on mode identification is developed by applying new optimization techniques to a nonparametric density estimator. A cluster is formed by those sample points that ascend to the same local maximum (mode) of the density function. The path from a point to its associated mode is efficiently solved by an EM-style algorithm, namely, the Modal EM (MEM). This method is then extended for hierarchical clustering by recursively locating modes of kernel density estimators with increasing bandwidths. Without model fitting, the mode-based clustering yields a density description for every cluster, a major advantage of mixture-model-based clustering. Moreover, it ensures that every cluster corresponds to a bump of the density. The issue of diagnosing clustering results is also investigated. Specifically, a pairwise separability measure for clusters is defined using the ridgeline between the density bumps of two clusters. The ridgeline is solved for by the Ridgeline EM (REM) algorithm, an extension of MEM. Based upon this new measure, a cluster merging procedure is created to enforce strong separation. Experiments on simulated and real data demonstrate that the mode-based clustering approach tends to combine the strengths of linkage and mixture-model-based clustering. In addition, the approach is robust in high dimensions and when clusters deviate substantially from Gaussian distributions. Both of these cases pose difficulty for parametric mixture modeling. A C package on the new algorithms is developed for public access at http://www.stat.psu.edu/∼jiali/hmac. Keywords: modal clustering, mode-based clustering, mixture modeling, modal EM, ridgeline EM, nonparametric density

3 0.21748823 32 jmlr-2007-Euclidean Embedding of Co-occurrence Data

Author: Amir Globerson, Gal Chechik, Fernando Pereira, Naftali Tishby

Abstract: Embedding algorithms search for a low dimensional continuous representation of data, but most algorithms only handle objects of a single type for which pairwise distances are specified. This paper describes a method for embedding objects of different types, such as images and text, into a single common Euclidean space, based on their co-occurrence statistics. The joint distributions are modeled as exponentials of Euclidean distances in the low-dimensional embedding space, which links the problem to convex optimization over positive semidefinite matrices. The local structure of the embedding corresponds to the statistical correlations via random walks in the Euclidean space. We quantify the performance of our method on two text data sets, and show that it consistently and significantly outperforms standard methods of statistical correspondence modeling, such as multidimensional scaling, IsoMap and correspondence analysis. Keywords: embedding algorithms, manifold learning, exponential families, multidimensional scaling, matrix factorization, semidefinite programming

4 0.2133964 57 jmlr-2007-Multi-class Protein Classification Using Adaptive Codes

Author: Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble, Christina Leslie

Abstract: Predicting a protein’s structural class from its amino acid sequence is a fundamental problem in computational biology. Recent machine learning work in this domain has focused on developing new input space representations for protein sequences, that is, string kernels, some of which give state-of-the-art performance for the binary prediction task of discriminating between one class and all the others. However, the underlying protein classification problem is in fact a huge multiclass problem, with over 1000 protein folds and even more structural subcategories organized into a hierarchy. To handle this challenging many-class problem while taking advantage of progress on the binary problem, we introduce an adaptive code approach in the output space of one-vsthe-rest prediction scores. Specifically, we use a ranking perceptron algorithm to learn a weighting of binary classifiers that improves multi-class prediction with respect to a fixed set of output codes. We use a cross-validation set-up to generate output vectors for training, and we define codes that capture information about the protein structural hierarchy. Our code weighting approach significantly improves on the standard one-vs-all method for two difficult multi-class protein classification problems: remote homology detection and fold recognition. Our algorithm also outperforms a previous code learning approach due to Crammer and Singer, trained here using a perceptron, when the dimension of the code vectors is high and the number of classes is large. Finally, we compare against PSI-BLAST, one of the most widely used methods in protein sequence analysis, and find that our method strongly outperforms it on every structure clas∗. The first two authors contributed equally to this work. c 2007 Iain Melvin, Eugene Ie, Jason Weston, William Stafford Noble and Christina Leslie. M ELVIN , I E , W ESTON , N OBLE AND L ESLIE sification problem that we consider. Supplementary data and source code are available at http: //www.cs

5 0.20413013 33 jmlr-2007-Fast Iterative Kernel Principal Component Analysis

Author: Simon Günter, Nicol N. Schraudolph, S. V. N. Vishwanathan

Abstract: We develop gain adaptation methods that improve convergence of the kernel Hebbian algorithm (KHA) for iterative kernel PCA (Kim et al., 2005). KHA has a scalar gain parameter which is either held constant or decreased according to a predetermined annealing schedule, leading to slow convergence. We accelerate it by incorporating the reciprocal of the current estimated eigenvalues as part of a gain vector. An additional normalization term then allows us to eliminate a tuning parameter in the annealing schedule. Finally we derive and apply stochastic meta-descent (SMD) gain vector adaptation (Schraudolph, 1999, 2002) in reproducing kernel Hilbert space to further speed up convergence. Experimental results on kernel PCA and spectral clustering of USPS digits, motion capture and image denoising, and image super-resolution tasks confirm that our methods converge substantially faster than conventional KHA. To demonstrate scalability, we perform kernel PCA on the entire MNIST data set. Keywords: step size adaptation, gain vector adaptation, stochastic meta-descent, kernel Hebbian algorithm, online learning

6 0.19472551 19 jmlr-2007-Classification in Networked Data: A Toolkit and a Univariate Case Study

7 0.18755895 84 jmlr-2007-The Pyramid Match Kernel: Efficient Learning with Sets of Features

8 0.18616182 51 jmlr-2007-Loop Corrections for Approximate Inference on Factor Graphs

9 0.18537554 10 jmlr-2007-An Interior-Point Method for Large-Scalel1-Regularized Logistic Regression

10 0.18536234 25 jmlr-2007-Covariate Shift Adaptation by Importance Weighted Cross Validation

11 0.17111982 81 jmlr-2007-The Locally Weighted Bag of Words Framework for Document Representation

12 0.16537592 62 jmlr-2007-On the Effectiveness of Laplacian Normalization for Graph Semi-supervised Learning

13 0.16521493 7 jmlr-2007-A Stochastic Algorithm for Feature Selection in Pattern Recognition

14 0.16334721 66 jmlr-2007-Penalized Model-Based Clustering with Application to Variable Selection

15 0.16299358 49 jmlr-2007-Learning to Classify Ordinal Data: The Data Replication Method

16 0.16298839 24 jmlr-2007-Consistent Feature Selection for Pattern Recognition in Polynomial Time

17 0.16298032 37 jmlr-2007-GiniSupport Vector Machine: Quadratic Entropy Based Robust Multi-Class Probability Regression

18 0.16277146 58 jmlr-2007-Noise Tolerant Variants of the Perceptron Algorithm

19 0.16175407 39 jmlr-2007-Handling Missing Values when Applying Classification Models

20 0.16172403 53 jmlr-2007-Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling