jmlr jmlr2009 jmlr2009-70 knowledge-graph by maker-knowledge-mining

70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)

Source: pdf

Author: Hugo Jair Escalante, Manuel Montes, Luis Enrique Sucar

Abstract: This paper proposes the application of particle swarm optimization (PSO) to the problem of full model selection, FMS, for classiﬁcation tasks. FMS is deﬁned as follows: given a pool of preprocessing methods, feature selection and learning algorithms, to select the combination of these that obtains the lowest classiﬁcation error for a given data set; the task also includes the selection of hyperparameters for the considered methods. This problem generates a vast search space to be explored, well suited for stochastic optimization techniques. FMS can be applied to any classiﬁcation domain as it does not require domain knowledge. Different model types and a variety of algorithms can be considered under this formulation. Furthermore, competitive yet simple models can be obtained with FMS. We adopt PSO for the search because of its proven performance in different problems and because of its simplicity, since neither expensive computations nor complicated operations are needed. Interestingly, the way the search is guided allows PSO to avoid overﬁtting to some extend. Experimental results on benchmark data sets give evidence that the proposed approach is very effective, despite its simplicity. Furthermore, results obtained in the framework of a model selection challenge show the competitiveness of the models selected with PSO, compared to models selected with other techniques that focus on a single algorithm and that use domain knowledge. Keywords: full model selection, machine learning challenge, particle swarm optimization, experimentation, cross validation

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 FMS is deﬁned as follows: given a pool of preprocessing methods, feature selection and learning algorithms, to select the combination of these that obtains the lowest classiﬁcation error for a given data set; the task also includes the selection of hyperparameters for the considered methods. [sent-6, score-0.218]

2 Furthermore, results obtained in the framework of a model selection challenge show the competitiveness of the models selected with PSO, compared to models selected with other techniques that focus on a single algorithm and that use domain knowledge. [sent-14, score-0.16]

3 Keywords: full model selection, machine learning challenge, particle swarm optimization, experimentation, cross validation 1. [sent-15, score-0.494]

4 For example, several authors make use of search strategies for the selection of candidate models (Lutz, 2006; Boull´ , 2007; Reunanen, 2007; Wichard, 2007), the FMS approach e can be adopted for obtaining such candidate models. [sent-37, score-0.151]

5 In this paper, we propose to use particle swarm optimization (PSO) for exploring the full-models search space. [sent-48, score-0.591]

6 Combinatoric and real-valued optimization problems in which the optimization surface possesses many locally optimal solutions, are well suited for swarm optimization. [sent-51, score-0.24]

7 The methodological differences between swarm optimization and evolutionary algorithms have been highlighted by several authors (Angeline, 1998; Kennedy and Eberhart, 1995, 2001). [sent-53, score-0.292]

8 PSO is compared to pattern search (PS) in order to evaluate the added value of using the swarm strategy instead of another intensive search method. [sent-63, score-0.453]

9 We consider PS among other search techniques because of its simplicity and proved performance in model selection (Momma and Bennett, 2002; Bi et al. [sent-64, score-0.151]

10 407 E SCALANTE , M ONTES AND S UCAR The main contribution of this work is experimental: we provide empirical evidence indicating that by using PSO we were able to perform intensive search over a huge space and succeeded in selecting competitive models without signiﬁcantly overﬁtting. [sent-97, score-0.161]

11 This is due to the way the search is guided in PSO: performing a broad search around promising solutions but not overdoing in terms of really ﬁne optimization. [sent-98, score-0.227]

12 Like evolutionary algorithms, PSO is appropriate for problems with immense search spaces that present many local minima. [sent-119, score-0.169]

13 The velocity of particle i at time t is given by vti =< vti,1 , vti,2 , . [sent-130, score-0.282]

14 , vti,d >, where vti,k is the velocity for dimension k of particle i at time t. [sent-133, score-0.282]

15 pg, j is the value in dimension j of the best particle found so far in the swarm (S); pg =< pg,1 , . [sent-138, score-0.54]

16 Note that through 408 PARTICLE S WARM M ODEL S ELECTION pi and pg each particle i takes into account individual (local) and social (global) information for updating its velocity and position. [sent-142, score-0.328]

17 W is the so called inertia weight, whose goal is to control the impact of the past velocity of a particle over the current one, inﬂuencing the local and global exploration abilities of the algorithm. [sent-145, score-0.396]

18 For this work we considered an adaptive inertia weight speciﬁed by a triplet W = (wstart , w f , wend ); where wstart and wend are the initial and ﬁnal values for W , respectively, and w f indicates the fraction of iterations in which W is decreased. [sent-147, score-0.221]

19 This setting allows us to explore a large area at the start of the optimization, when W is large, and to slightly reﬁne the search later by using a smaller inertia weight (Shi and Eberhart, 1998, 1999; van den Bergh, 2001). [sent-149, score-0.222]

20 The swarm is randomly initialized, considering restrictions on the values that each dimension can take. [sent-159, score-0.24]

21 Next, the goodness of each particle is evaluated and pg , p1,. [sent-160, score-0.3]

22 Then, the iterative PSO process starts, in each iteration: i) the velocities and positions of each particle in every dimension are updated according to Equations (1) and (2); ii) the goodness of each particle is evaluated; iii) pg and p1,. [sent-164, score-0.554]

23 This process is repeated until either a maximum number of iterations is reached or a minimum ﬁtness value is obtained by a particle in the swarm (we used the ﬁrst criterion for FMS); eventually, an (locally) optimal solution is found. [sent-168, score-0.537]

24 A ﬁtness function F : Ψ → R, where Ψ is the space of particles positions, should return a scalar fxi for each particle position xi , indicating how far particle i is from the optimal solution to the problem at hand. [sent-171, score-0.58]

25 Note that in Equation (1) every particle in the swarm knows the best position found so far by any other particle within the swarm, that is pg . [sent-175, score-0.794]

26 This topology has shown to converge faster than any other topology (Kennedy and Mendes, 2002; Reyes and Coello, 2006; 409 E SCALANTE , M ONTES AND S UCAR Algorithm 1 Particle swarm optimization. [sent-177, score-0.24]

27 Require: [Default recommended values for FMS] – c1 , c2 : individual/social behavior weights; [c1 = c2 = 2] – m: swarm size; [m = 5] – I: number of iterations; [I = 50] – F(Ψ → R): ﬁtness function; [F(Ψ → R) = 2−fold CV BER ] – W: Inertia weight W = (1. [sent-178, score-0.24]

28 4) Set decrement factor for W (wdec = wstart −wend ) I×w f Initialize swarm (S = {x1 , x2 , . [sent-181, score-0.268]

29 With this topology, however, the swarm is prone to converge to local minima. [sent-198, score-0.26]

30 In consequence, it is possible that at the end of the search process the swarm may show divergent behavior. [sent-227, score-0.337]

31 2 conﬁrm that the selection of m is not crucial, although (contrary to previous results) slightly better models are obtained by using a small swarm size. [sent-234, score-0.294]

32 Particle swarm full model selection (hereafter PSMS, that is the application of PSO to FMS) is 1. [sent-245, score-0.294]

33 Note that in PSO we say that the swarm converges iff limt→inf pgt = p, where p is an arbitrary position in the search space and t indexes iterations of PSO. [sent-246, score-0.38]

34 Since p refers to an arbitrary position, this deﬁnition does not mean either convergence to local or global optimum, but convergence to the global best position in the swarm (van den Bergh, 2001; Reyes and Coello, 2006; Engelbrecht, 2006). [sent-247, score-0.291]

35 Speciﬁcally, we consider models with at most one feature selection method, but allowing to perform preprocessing before feature selection and viceversa, see Section 3. [sent-280, score-0.209]

36 The search space in FMS is given by all the possible combinations of methods and hyperparameters; an inﬁnite search space due to the real valued parameters. [sent-283, score-0.194]

37 Npre codify the hyperparameters for the selected combination of preprocessing methods, Npre = 3 because each preprocessing method has a single hyperparameter; note that the order of the preprocessing methods is ﬁxed (i. [sent-314, score-0.244]

38 This numerical codiﬁcation must be decoded and used with the chain grouping object for obtaining a full-model from a particle position xi . [sent-332, score-0.254]

39 Note that the dimensionality of each particle is d = 1 + Npre + 1 + N f s + 1 + 1 + Nclass = 16. [sent-333, score-0.254]

40 However, the use of particle swarm optimization (PSO) allows us to harness the complexity of this problem. [sent-365, score-0.494]

41 For a single run of PSMS the ﬁtness function should be evaluated ρ = m × (I + 1) times, with m being the swarm size and I the number of iterations. [sent-373, score-0.24]

42 The goal was to evaluate the advantages of using the swarm strategy over another intensive search method. [sent-412, score-0.356]

43 Among the available search techniques we selected PS because of its simplicity and proved performance in model selection (Dennis and Torczon, 1994; Momma and Bennett, 2002; Bi et al. [sent-413, score-0.183]

44 PS is a direct search method that samples points in the search space in a ﬁxed pattern about the best solution found so far (the center of the pattern). [sent-415, score-0.194]

45 For applying PS to the FMS problem (hereafter PATSMS) the solution qg was initialized in the same way that each particle in the swarm was initialized for PSMS (i. [sent-433, score-0.542]

46 Require: – I ps : number of iterations – F(Ψ → R): ﬁtness function – P: pattern – ∆: search step Initialize solution qi (i = 1) Compute fqi = F(qi ) (Section 3. [sent-438, score-0.183]

47 p j q j = qg + s j Compute fq j = F(q j ) if fq j < fmin then Update qg (qg = q j , fmin = fq j ) end if i++ end for ∆ = ∆/2 end while return qg representation, ﬁtness function and restrictions on the values that each dimension can take were considered for both methods, see Section 3. [sent-443, score-0.182]

48 We compared the FMS ability of PSMS and PATSMS by evaluating the accuracy of the selected models at the end of the search process; that is, we compared the performance of models pg and qg in Algorithms 1 and 2, respectively. [sent-449, score-0.223]

49 For each trial we ﬁxed the number of iterations for PSMS to I = 100 with a swarm size of m = 10. [sent-451, score-0.283]

50 CV-BER is the average of CV BER obtained by each of the candidate solutions through the search process (averaging performed over all particles and iterations and over 10 replications of each data set) . [sent-532, score-0.269]

51 CV-BER reﬂects the behavior of the search strategies through the search process. [sent-545, score-0.194]

52 PATSMS performs a ﬁner grained search over a restricted search area around the initial solution qg . [sent-548, score-0.242]

53 PSMS, on the other hand, explores a much larger search space because the search is not only guided by the best solution found so far (pg ), but also by the individual best solutions of each particle (p1,. [sent-550, score-0.481]

54 For PSMS we show the performance of each particle with a different color. [sent-556, score-0.254]

55 In Figure 1 we show the performance of the solutions tried through the search by each method for a single replication of the Heart data set (for clarity in the plots we used m = 5 and I = 50 for this experiment). [sent-560, score-0.201]

56 In fact all of the solutions in the ﬁnal swarm outperformed the solution obtained with PATSMS in test-BER. [sent-568, score-0.273]

57 PSMS, on the other hand, tries a wide variety of classiﬁers and feature selection methods, exploring a larger search space and being able to escape from local minima. [sent-574, score-0.191]

58 In our experiments we found that for each replication PATSMS used the same classiﬁer and feature selection method for about 95% of the search iterations. [sent-601, score-0.215]

59 The selection of these methods depended on the initial solution instead of the way the search space is explored or the individual performance of the methods. [sent-602, score-0.151]

60 As before, we show the average CV error of the solutions considered during the search as well as the error of the selected model in the test set. [sent-631, score-0.162]

61 We also show the average test set error of all solutions tried through the search test-BER∗ (averaging performed over all particles and all iterations), providing information about the generalization performance of the models considered throughout the search. [sent-632, score-0.229]

62 2 N UMBER I OF I TERATIONS Next we performed experiments varying the number of iterations (I), using 2− f old CV for computing the ﬁtness function and a swarm size of m = 10. [sent-650, score-0.283]

63 3 S WARM S IZE M In the next experiment we ﬁxed the number of iterations to I = 50 and varied the swarm size as follows, m = [5, 10, 20, 40, 50]. [sent-737, score-0.283]

64 This time the best result was obtained by using a swarm size of 5, however there is a statistically-signiﬁcant difference only between m = 5 and m = 50. [sent-739, score-0.24]

65 This is another interesting result because using a small swarm size reduces the number of ﬁtness function evaluations for PSMS and therefore makes it more practical. [sent-741, score-0.24]

66 An interesting result is that by using 50 iterations with any swarm size the test-BER is very close to the CV-BER estimate. [sent-742, score-0.283]

67 We are displaying the test and CV BER values for each particle at every time step. [sent-754, score-0.254]

68 The latter conﬁguration could be a better choice for FMS because this way PSMS does not over-search around any solution; however, local search is also an important component of any search algorithm. [sent-767, score-0.214]

69 In Figure 3 we show the CV and test BER of solutions tried during the search for the Heart data set under the different conﬁgurations tried. [sent-768, score-0.157]

70 We ran PSMS for I = 50 iterations with a swarm size of m = 5, using a single 423 E SCALANTE , M ONTES AND S UCAR ID 1 2 3 Setting c1 = 2; c2 = 2 c1 = 0; c2 = 2 c1 = 2; c2 = 0 CV-BER 23. [sent-784, score-0.283]

71 With c1 = 0 we have that the m = 5 particles converge to single local minima, performing a ﬁne grained search over this solution (red circles). [sent-827, score-0.189]

72 In order to better appreciate the generalization performance for the different conﬁgurations, in Figure 6 we plot the CV-BER as a function of test-BER for the run of PSMS with I = 25, we plot each particle with a different color. [sent-834, score-0.254]

73 It is clear from the right plot in Figure 6 that with c2 = 0 each particle is trapped in different local minima, doing a ﬁne grained search over them that causes PSMS to overﬁt the data. [sent-837, score-0.371]

74 In each plot each particle is shown with a different color. [sent-848, score-0.254]

75 With exception of Ada, the selected models included a feature selection method; this result shows the importance of feature selection methods and that some of them are more compatible than others with some classiﬁers; note that different numbers of features were selected for each data set. [sent-945, score-0.212]

76 Currently, the PSMS run is the top-ranked agnostic entry in the challenge website; furthermore, the performance of this entry is superior than all but one priorknowledge entry: Interim-all-prior; which is the best ranked entry overall, using “prior knowledge” or “domain knowledge”. [sent-963, score-0.176]

77 However, experimental results indicate that the use of an adaptive inertia weight may be helpful to explore the search space better. [sent-1029, score-0.191]

78 Our hypothesis is that PSMS is able to avoid overﬁtting because of the way the search is guided: PSMS performs a broad search around promising solutions without focusing on reaching local minima. [sent-1038, score-0.247]

79 Solutions in PSMS are updated by taking into account information from both: the global-best solution (pg , weighted by c2 ) and the individual-best solutions of each particle (p1,. [sent-1039, score-0.287]

80 The latter combined with an adaptive inertia weight and early stopping cause do not exaggerate the search at any local minima. [sent-1043, score-0.231]

81 On the contrary, PSMS is able to undercompute because for updating solutions it considers local and global knowledge, information from past solutions (weighted by the inertia term) and randomness; note that the latter holds when a reasonable small number of iterations is performed. [sent-1057, score-0.223]

82 For FMS, however, it is very interesting that updating solutions in a heterogeneous way allows us to avoid oversearching and, in consequence, overﬁtting; even when the FMS search space is inﬁnite and has many local minima solutions. [sent-1062, score-0.178]

83 However, despite the potential advantages of e e FMS (namely generality, simplicity and competitiveness), this problem has been little studied because of the huge search space involved and because intensive search methods are prone to overﬁt the data (Gorissen et al. [sent-1069, score-0.213]

84 Grid search (with CV) is the widely used search-approach to model selection in practical applications (Momma and Bennett, 2002; Hsu et al. [sent-1074, score-0.151]

85 Other methods already used for parameter optimization that can be applied to FMS include: greedy search (Dietterich, 1995), random search (e. [sent-1080, score-0.194]

86 Note that despite one may think that exhaustive search is the best search option in model selection, this approach is impractical for most real world problems and when applicable it suffers from the oversearching phenomenon (Dietterich, 1995). [sent-1086, score-0.222]

87 A problem with early stopping is that premature stopping the algorithm would lead to selecting a solution that has not converged yet, while a late stopping of the search will cause severely overﬁtting the data because of oversearching. [sent-1091, score-0.157]

88 As the adaptive inertia weight in PSO, see Section 2, there are parameters in other algorithms that aim to avoid overﬁtting by exploring the search space both globally and locally; examples are the temperature parameter in simulated annealing (Kirkpatrick et al. [sent-1105, score-0.191]

89 Future work includes in PSMS consists of combining particles in order to improve the performance of the swarm strategy. [sent-1110, score-0.312]

90 Besides the selected model, the output of the search (Res, line 4), is a structure with useful information about the search process, see the PSMS documentation (Escalante, In preparation, 2009). [sent-1122, score-0.226]

91 Results obtained by models selected with PSMS in the framework of a model selection challenge show that it is a very competitive model selection method, despite its simplicity and generality. [sent-1145, score-0.202]

92 Evolutionary optimization vs particle swarm optimization: Philosophy and performance differences. [sent-1169, score-0.494]

93 The particle swarm: Explosion, stability and convergenge in a multidimensional complex space. [sent-1229, score-0.254]

94 Particle swarm optimization for classiﬁer selection: A practical guide to psms. [sent-1257, score-0.24]

95 Comparison of particle swarm optimization and backpropagation as training algorithms for neural networks. [sent-1331, score-0.494]

96 On the use of a population-based particle swarm a a optimizer to design combinational logic circuits. [sent-1395, score-0.494]

97 Multi-objective particle swarm optimizers: A survey of the state-of-the-art. [sent-1551, score-0.494]

98 Using the particle swarm optimization technique to train a recurrent neural model. [sent-1570, score-0.494]

99 Arma model selection using particle swarm optimization and aic criteria. [sent-1628, score-0.548]

100 A particle swarm optimization for reactive power and voltage control considering voltage security assessment. [sent-1678, score-0.494]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('psms', 0.679), ('fms', 0.268), ('pso', 0.264), ('particle', 0.254), ('swarm', 0.24), ('patsms', 0.146), ('tness', 0.118), ('eberhart', 0.103), ('guyon', 0.1), ('clop', 0.099), ('search', 0.097), ('inertia', 0.094), ('ber', 0.092), ('kennedy', 0.085), ('ontes', 0.085), ('scalante', 0.085), ('ucar', 0.085), ('warm', 0.085), ('fmax', 0.075), ('particles', 0.072), ('preprocessing', 0.061), ('cawley', 0.057), ('alvspk', 0.056), ('cv', 0.056), ('selection', 0.054), ('odel', 0.052), ('coello', 0.052), ('reyes', 0.052), ('evolutionary', 0.052), ('guration', 0.05), ('escalante', 0.048), ('qg', 0.048), ('logitboost', 0.047), ('pg', 0.046), ('replication', 0.044), ('agnostic', 0.044), ('ps', 0.043), ('iterations', 0.043), ('bergh', 0.042), ('election', 0.042), ('challenge', 0.042), ('saffari', 0.04), ('subsampling', 0.039), ('nova', 0.039), ('lutz', 0.038), ('hastie', 0.037), ('momma', 0.036), ('gorissen', 0.036), ('wmin', 0.036), ('shi', 0.036), ('gurations', 0.035), ('participants', 0.034), ('fs', 0.034), ('solutions', 0.033), ('engelbrecht', 0.033), ('selected', 0.032), ('talbot', 0.032), ('boull', 0.032), ('den', 0.031), ('tting', 0.031), ('entry', 0.03), ('hyperparameters', 0.029), ('hyperparameter', 0.028), ('nelles', 0.028), ('oversearching', 0.028), ('wend', 0.028), ('wstart', 0.028), ('hiva', 0.028), ('velocity', 0.028), ('tried', 0.027), ('heart', 0.025), ('evidence', 0.025), ('ada', 0.024), ('replications', 0.024), ('drmax', 0.024), ('gkridge', 0.024), ('gudise', 0.024), ('matlabr', 0.024), ('pval', 0.024), ('zarbi', 0.024), ('sf', 0.021), ('competition', 0.021), ('id', 0.021), ('feature', 0.02), ('competitive', 0.02), ('table', 0.02), ('stopping', 0.02), ('gina', 0.02), ('sylva', 0.02), ('local', 0.02), ('intensive', 0.019), ('bennett', 0.019), ('clerc', 0.019), ('fmin', 0.019), ('kirkpatrick', 0.019), ('lssvm', 0.019), ('ozcan', 0.019), ('svc', 0.019), ('venayagamoorthy', 0.019), ('voss', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)

Author: Hugo Jair Escalante, Manuel Montes, Luis Enrique Sucar

2 0.089181729 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling

Author: Dirk Gorissen, Tom Dhaene, Filip De Turck

Abstract: Due to the scale and computational complexity of currently used simulation codes, global surrogate (metamodels) models have become indispensable tools for exploring and understanding the design space. Due to their compact formulation they are cheap to evaluate and thus readily facilitate visualization, design space exploration, rapid prototyping, and sensitivity analysis. They can also be used as accurate building blocks in design packages or larger simulation environments. Consequently, there is great interest in techniques that facilitate the construction of such approximation models while minimizing the computational cost and maximizing model accuracy. Many surrogate model types exist (Support Vector Machines, Kriging, Neural Networks, etc.) but no type is optimal in all circumstances. Nor is there any hard theory available that can help make this choice. In this paper we present an automatic approach to the model type selection problem. We describe an adaptive global surrogate modeling environment with adaptive sampling, driven by speciated evolution. Different model types are evolved cooperatively using a Genetic Algorithm (heterogeneous evolution) and compete to approximate the iteratively selected data. In this way the optimal model type and complexity for a given data set or simulation code can be dynamically determined. Its utility and performance is demonstrated on a number of problems where it outperforms traditional sequential execution of each model type. Keywords: model type selection, genetic algorithms, global surrogate modeling, function approximation, active learning, adaptive sampling

3 0.039647117 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

Author: Marc Boullé

Abstract: With the rapid growth of computer storage capacities, available data and demand for scoring models both follow an increasing trend, sharper than that of the processing power. However, the main limitation to a wide spread of data mining solutions is the non-increasing availability of skilled data analysts, which play a key role in data preparation and model selection. In this paper, we present a parameter-free scalable classiﬁcation method, which is a step towards fully automatic data mining. The method is based on Bayes optimal univariate conditional density estimators, naive Bayes classiﬁcation enhanced with a Bayesian variable selection scheme, and averaging of models using a logarithmic smoothing of the posterior distribution. We focus on the complexity of the algorithms and show how they can cope with data sets that are far larger than the available central memory. We ﬁnally report results on the Large Scale Learning challenge, where our method obtains state of the art performance within practicable computation time. Keywords: large scale learning, naive Bayes, Bayesianism, model selection, model averaging

4 0.03946317 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination (Special Topic on Model Selection)

Author: Eugene Tuv, Alexander Borisov, George Runger, Kari Torkkola

Abstract: Predictive models beneﬁt from a compact, non-redundant subset of features that improves interpretability and generalization. Modern data sets are wide, dirty, mixed with both numerical and categorical predictors, and may contain interactive effects that require complex models. This is a challenge for ﬁlters, wrappers, and embedded feature selection methods. We describe details of an algorithm using tree-based ensembles to generate a compact subset of non-redundant features. Parallel and serial ensembles of trees are combined into a mixed method that can uncover masking and detect features of secondary effect. Simulated and actual examples illustrate the effectiveness of the approach. Keywords: trees, resampling, importance, masking, residuals

5 0.035581183 50 jmlr-2009-Learning When Concepts Abound

Author: Omid Madani, Michael Connor, Wiley Greiner

Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can beneﬁt from efﬁcient training and classiﬁcation when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classiﬁcation using perceptrons and support vector machines. We ﬁnd that index learning is highly advantageous for space and time efﬁciency, at both training and classiﬁcation times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classiﬁcation take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization

6 0.033095248 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

7 0.025909869 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning

8 0.025634648 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models (Machine Learning Open Source Software Paper)

9 0.024728712 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning

10 0.024165906 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

11 0.023381507 23 jmlr-2009-Discriminative Learning Under Covariate Shift

12 0.020619366 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

13 0.019508684 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

14 0.0191331 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

15 0.018716222 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

16 0.018346263 38 jmlr-2009-Hash Kernels for Structured Data

17 0.018343132 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

18 0.01823529 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization

19 0.018105615 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

20 0.01774364 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.102), (1, -0.05), (2, 0.055), (3, -0.02), (4, 0.028), (5, -0.045), (6, 0.017), (7, 0.122), (8, 0.051), (9, 0.032), (10, 0.049), (11, -0.019), (12, 0.077), (13, -0.065), (14, -0.069), (15, -0.037), (16, -0.112), (17, 0.035), (18, -0.264), (19, 0.232), (20, 0.083), (21, 0.051), (22, -0.09), (23, -0.109), (24, -0.042), (25, 0.076), (26, 0.117), (27, -0.259), (28, 0.084), (29, 0.018), (30, -0.066), (31, -0.106), (32, -0.273), (33, -0.106), (34, -0.051), (35, 0.133), (36, -0.224), (37, 0.124), (38, -0.043), (39, -0.044), (40, -0.003), (41, 0.04), (42, 0.027), (43, -0.188), (44, -0.031), (45, -0.336), (46, -0.048), (47, -0.05), (48, 0.143), (49, -0.168)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94161177 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)

Author: Hugo Jair Escalante, Manuel Montes, Luis Enrique Sucar

2 0.55561173 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling

Author: Dirk Gorissen, Tom Dhaene, Filip De Turck

3 0.24117894 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

Author: Marc Boullé

4 0.19610089 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models (Machine Learning Open Source Software Paper)

Author: Troy Raeder, Nitesh V. Chawla

Abstract: This paper presents Model Monitor (M 2 ), a Java toolkit for robustly evaluating machine learning algorithms in the presence of changing data distributions. M 2 provides a simple and intuitive framework in which users can evaluate classiﬁers under hypothesized shifts in distribution and therefore determine the best model (or models) for their data under a number of potential scenarios. Additionally, M 2 is fully integrated with the WEKA machine learning environment, so that a variety of commodity classiﬁers can be used if desired. Keywords: machine learning, open-source software, distribution shift, scenario analysis

5 0.17796178 64 jmlr-2009-On The Power of Membership Queries in Agnostic Learning

Author: Vitaly Feldman

Abstract: We study the properties of the agnostic learning framework of Haussler (1992) and Kearns, Schapire, and Sellie (1994). In particular, we address the question: is there any situation in which membership queries are useful in agnostic learning? Our results show that the answer is negative for distribution-independent agnostic learning and positive for agnostic learning with respect to a speciﬁc marginal distribution. Namely, we give a simple proof that any concept class learnable agnostically by a distribution-independent algorithm with access to membership queries is also learnable agnostically without membership queries. This resolves an open problem posed by Kearns et al. (1994). For agnostic learning with respect to the uniform distribution over {0, 1}n we show a concept class that is learnable with membership queries but computationally hard to learn from random examples alone (assuming that one-way functions exist). Keywords: agnostic learning, membership query, separation, PAC learning

6 0.14506617 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination (Special Topic on Model Selection)

7 0.133544 47 jmlr-2009-Learning Linear Ranking Functions for Beam Search with Application to Planning

8 0.13215932 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

9 0.13067532 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

10 0.1251536 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

11 0.12112147 50 jmlr-2009-Learning When Concepts Abound

12 0.1198831 37 jmlr-2009-Generalization Bounds for Ranking Algorithms via Algorithmic Stability

13 0.11713175 15 jmlr-2009-Cautious Collective Classification

14 0.11705022 67 jmlr-2009-Online Learning with Sample Path Constraints

15 0.11636905 49 jmlr-2009-Learning Permutations with Exponential Weights

16 0.11348155 76 jmlr-2009-Python Environment for Bayesian Learning: Inferring the Structure of Bayesian Networks from Knowledge and Data (Machine Learning Open Source Software Paper)

17 0.1106708 11 jmlr-2009-Bayesian Network Structure Learning by Recursive Autonomy Identification

18 0.10621992 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

19 0.10505737 28 jmlr-2009-Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks

20 0.10089399 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.015), (8, 0.023), (11, 0.025), (19, 0.01), (26, 0.058), (38, 0.032), (47, 0.02), (52, 0.046), (55, 0.025), (58, 0.032), (66, 0.069), (68, 0.026), (82, 0.39), (86, 0.032), (90, 0.05), (96, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.74323332 61 jmlr-2009-Nonextensive Information Theoretic Kernels on Measures

Author: André F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar, Mário A. T. Figueiredo

Abstract: Positive deﬁnite kernels on probability measures have been recently applied to classiﬁcation problems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shannon’s) mutual information and the JensenShannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive generalizations of Shannon’s information theory. This paper bridges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type divergences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we introduce Jensen-Tsallis (JT) q-differences, a nonextensive generalization of the JS divergence, and deﬁne a k-th order JT q-difference between stochastic processes. We then deﬁne a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also deﬁned that generalize the p-spectrum kernel. We illustrate the performance of these kernels on text categorization tasks, in which documents are modeled both as bags of words and as sequences of characters. Keywords: positive deﬁnite kernels, nonextensive information theory, Tsallis entropy, JensenShannon divergence, string kernels ∗. Earlier versions of this work appeared in Martins et al. (2008a) and Martins et al. (2008b). †. Also at Instituto de Telecomunicacoes, Instituto Superior T´ cnico, Lisboa, Portugal. ¸˜ e c 2009 Andr´ F. T. Martins, Noah A. Smith, Eric P. Xing, Pedro M. Q. Aguiar and M´ rio A. T. Figueiredo. e a M ARTINS , S MITH , X ING , AGUIAR AND F IGUEIREDO

same-paper 2 0.67181087 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)

Author: Hugo Jair Escalante, Manuel Montes, Luis Enrique Sucar

3 0.30210236 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling

Author: Dirk Gorissen, Tom Dhaene, Filip De Turck

4 0.29273182 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

Author: Marc Boullé

5 0.28404739 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination (Special Topic on Model Selection)

Author: Eugene Tuv, Alexander Borisov, George Runger, Kari Torkkola

6 0.28261441 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning

7 0.2820012 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

8 0.28064024 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

9 0.27761674 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting