jmlr jmlr2009 jmlr2009-3 knowledge-graph by maker-knowledge-mining

3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning


Source: pdf

Author: Marc Boullé

Abstract: With the rapid growth of computer storage capacities, available data and demand for scoring models both follow an increasing trend, sharper than that of the processing power. However, the main limitation to a wide spread of data mining solutions is the non-increasing availability of skilled data analysts, which play a key role in data preparation and model selection. In this paper, we present a parameter-free scalable classification method, which is a step towards fully automatic data mining. The method is based on Bayes optimal univariate conditional density estimators, naive Bayes classification enhanced with a Bayesian variable selection scheme, and averaging of models using a logarithmic smoothing of the posterior distribution. We focus on the complexity of the algorithms and show how they can cope with data sets that are far larger than the available central memory. We finally report results on the Large Scale Learning challenge, where our method obtains state of the art performance within practicable computation time. Keywords: large scale learning, naive Bayes, Bayesianism, model selection, model averaging

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, the main limitation to a wide spread of data mining solutions is the non-increasing availability of skilled data analysts, which play a key role in data preparation and model selection. [sent-4, score-0.15]

2 The method is based on Bayes optimal univariate conditional density estimators, naive Bayes classification enhanced with a Bayesian variable selection scheme, and averaging of models using a logarithmic smoothing of the posterior distribution. [sent-6, score-0.286]

3 Keywords: large scale learning, naive Bayes, Bayesianism, model selection, model averaging 1. [sent-9, score-0.201]

4 Practical data mining projects involve a large variety of constraints, like scalable training algorithms, understandable models, easy and fast deployment. [sent-15, score-0.176]

5 The most limiting factor which slows down the spread of data mining solutions is the data preparation phase, which consumes 80% of the process (Pyle, 1999; Mamdouh, 2006) and requires skilled data analysts. [sent-18, score-0.15]

6 In this paper, we present a method1 (Boull´ , 2007) which aims at automatizing the data e preparation and modeling phases of a data mining project, and which performs well on a large variety of problems. [sent-19, score-0.217]

7 A Parameter-Free Classifier Our method, introduced in Boull´ (2007), extends the naive Bayes classifier owing to an optimal e estimation of the class conditional probabilities, a Bayesian variable selection and a compressionbased model averaging. [sent-33, score-0.221]

8 Experiments demonstrate that even a simple equal width discretization brings superior performance compared to the assumption using a Gaussian distribution. [sent-41, score-0.212]

9 In the MODL approach (Boull´ , 2006), the discretization is turned into a model e selection problem and solved in a Bayesian way. [sent-42, score-0.259]

10 The parameters of a specific discretization are the number of intervals, the bounds of the intervals and the output frequencies in each interval. [sent-44, score-0.283]

11 This prior exploits the hierarchy of the parameters: the number of intervals is first chosen, then the bounds of the intervals and finally the output frequencies. [sent-46, score-0.207]

12 A Bayesian approach is applied to select the best discretization model, which is found by maximizing the probability p(Model|Data) of the model given the data. [sent-49, score-0.212]

13 Owing to the definition of the model space and its prior distribution, the Bayes formula is applicable to derive an exact analytical criterion to evaluate the posterior probability of a discretization model. [sent-50, score-0.242]

14 Efficient search heuristics allow to find the most probable discretization given the data sample. [sent-51, score-0.212]

15 In order to better deal with highly correlated variables, the selective naive Bayes approach (Langley and Sage, 1994) exploits a wrapper approach (Kohavi and John, 1997) to select the subset of variables which optimizes the classification accuracy. [sent-56, score-0.254]

16 Although the selective naive Bayes approach performs quite well on data sets with a reasonable number of variables, it does not scale on very large data sets with hundreds of thousands of instances and thousands of variables, such as in marketing applications or text mining. [sent-57, score-0.186]

17 The parameters of a variable selection model are the number of selected variables and the subset of variables. [sent-60, score-0.144]

18 In the case of the selective naive Bayes classifier, an inspection of the optimized models reveals that their posterior distribution is so sharply peaked that averaging them according to the BMA approach almost reduces to the maximum a posteriori (MAP) model. [sent-70, score-0.253]

19 Extensive experiments demonstrate that the resulting compression-based model averaging scheme clearly outperforms the Bayesian model averaging scheme. [sent-73, score-0.16]

20 Complexity Analysis In this section, we first recall the algorithmic complexity of the algorithms detailed in Boull´ (2005, e 2006, 2007) in the case where all the data fits in central memory, then introduce the extension of the algorithms when data exceeds the central memory. [sent-75, score-0.218]

21 1 When Data Fits into Central Memory The algorithm consists in three phases: data preprocessing using discretization or value grouping, variable selection and model averaging. [sent-77, score-0.436]

22 1 P REPROCESSING In the case of numerical variables, the discretization Algorithm 1 exploits a greedy-heuristic. [sent-80, score-0.277]

23 This merge is performed if the cost (evaluation) of the discretization decreases after the merge and the process is reiterated until no further merge can decrease the discretization cost. [sent-82, score-0.567]

24 However, the method can be optimized in O(N log N) time owing to the additivity of the discretization cost with respect to the intervals and using a maintained sorted list of the evaluated merges (such as an AVL binary search tree for example). [sent-84, score-0.384]

25 Overall, the preprocessing phase is super-linear in time and requires O(KN log N) time, where K is the number of variables and N the number of instances. [sent-86, score-0.252]

26 Evaluating one variable selection for the naive Bayes classifier requires O(KN) time, since the conditional probabilities, which are linear in the number of variables, require O(K) time per instance. [sent-90, score-0.203]

27 Updating the evaluation of a variable subset by adding or dropping one single variable requires only O(1) time per instance and thus O(N) time for all the instances. [sent-91, score-0.156]

28 Therefore, one loop of fast forward or backward variable selection, which involves O(K) adds or drops of variables, can be evaluated in O(KN) time. [sent-92, score-0.161]

29 In order to bound the time complexity, each succession of fast forward and backward variable selection is repeated only twice (MaxIter = 2 in algorithm 2), which is sufficient in practice to be close to convergence within a small computation time. [sent-93, score-0.235]

30 The number of repeats of the outer loop is fixed to log2 N + log2 K, so that the overall time complexity of the variable selection algorithm is O(KN(log K + log N)), which is comparable to that of the preprocessing phase. [sent-94, score-0.28]

31 3 M ODEL AVERAGING The model averaging algorithm consists in collecting all the models evaluated in the variable selection phase and averaging then according to a logarithmic smoothing of their posterior probability, with no overhead on the time complexity. [sent-100, score-0.471]

32 Overall, the preprocessing, variable selection and model averaging have a time complexity of O(KN(log K + log N)) and a space complexity of O(KN). [sent-101, score-0.205]

33 1371 ´ M ARC B OULL E With the O(KN) space complexity, large data sets cannot fit into central memory and training time becomes impracticable as soon as memory pages have to be swapped between disk and central memory. [sent-104, score-0.733]

34 2 To avoid this strong limitation, we enhance our algorithm with a specially designed chunking strategy. [sent-105, score-0.155]

35 1 C HUNKING S TRATEGY First of all, let us consider the access time from central memory (tm ) and from sequential read (tr ) or random disk access (ta ). [sent-108, score-0.634]

36 Sequential read from disk is so fast (based on 100 Mb/s transfer rates) that tr is in the same order as tm : CPU is sometimes a limiting factor, when parsing operations are involved. [sent-110, score-0.441]

37 Random disk access ta is in the order of 10 milliseconds, one million times slower than tm or tr . [sent-111, score-0.276]

38 Therefore, the only way to manage very large amounts of memory space is to exploit disk in a sequential manner. [sent-112, score-0.406]

39 Let S=O(KN) be the size of the data set and M the size of the central memory. [sent-113, score-0.169]

40 Our chunking strategy consists in minimizing the number of exchanges between disk and central memory. [sent-114, score-0.501]

41 The overhead of the scalable version of the algorithms will be expressed in terms of sequential disk read time tr . [sent-115, score-0.602]

42 2 P REPROCESSING For the preprocessing phase of our method, each variable is analyzed once after being loaded into central memory. [sent-118, score-0.45]

43 The discretization Algorithm 1 extensively accesses the values of the analyzed variable, both in the initialization step (values are sorted) and in the optimization step. [sent-119, score-0.242]

44 We partition the set of input variables into C = ⌈S/M⌉ chunks of K1 , . [sent-120, score-0.258]

45 , KC variables, such that each chunk can be loaded into memory (Kc N < M, 1 ≤ c ≤ C). [sent-123, score-0.696]

46 The preprocessing phase loops on the C subsets, and at each step of the loop, reads the data set, parses and loads the chunk variables only, as shown in Figure 1. [sent-124, score-0.74]

47 Each memory chunk is preprocessed as in Algorithm 1, in order to induce conditional probability tables (CPT) for the chunk variables. [sent-125, score-1.13]

48 After the preprocessing, the chunk variables are recoded by replacing the input values by their index in the related CPT. [sent-126, score-0.925]

49 The recoded chunk of variables are stored in C files of integer indexes, as shown in Figure 2. [sent-127, score-0.895]

50 These recoded chunks are very fast to load, since the chunk variables only need to be read, without any parsing (integer indexes are directly transfered from disk to central memory). [sent-128, score-1.462]

51 The scalable version of the preprocessing algorithm is summarized in Algorithm 3. [sent-129, score-0.2]

52 Each input variable is read C times, but parsed and loaded only once as in the standard case. [sent-130, score-0.288]

53 The overhead in terms of disk read time is O((C − 1)KNtr ) time for the load steps. [sent-131, score-0.565]

54 In the recoding step, each variable is written once on disk, which involves an overhead of O(KNtr ) (disk sequential write time is similar to disk sequential read time). [sent-132, score-0.659]

55 This could be improved by first splitting the input data file into C initial disk chunks (without parsing), with an overall overhead of only O(3KNtr ) time (read and write all the input data before preprocessing, and write recoded chunks after preprocessing). [sent-134, score-1.154]

56 However, this involves extra disk space and the benefit is small in practice, since even C read steps are not time expensive compared to the rest of the algorithm. [sent-135, score-0.39]

57 On 32 bits CPU, central memory is physically limited to 4 Go, or even to 2 Go on Windows PCs. [sent-137, score-0.209]

58 05 … Central memory Figure 1: Preprocessing: the input file on disk is read several times, with one chunk of variables parsed and loaded into central memory one at a time. [sent-292, score-1.344]

59 24 … Recoded chunk 1 RVar1 RVar2 RVar3 1 0 0 0 1 0 1 0 0 1 1 0 0 1 0 0 0 0 0 2 0 1 1 0 0 1 0 2 0 0 1 1 0 1 2 0 0 2 0 0 1 0 … … … Var8 -2. [sent-391, score-0.515]

60 3 VARIABLE S ELECTION In the variable selection phase, the algorithm performs O(log K + log N) steps of fast forward backward variable selection, based on random reordering of the variables. [sent-408, score-0.305]

61 In the scalable version of the variable selection phase, we exploit the recoded chunks of variables as shown in Figure 3: one single chunk is loaded in memory at a time. [sent-410, score-1.464]

62 Instead of reordering randomly all the variables, we first reorder randomly the list of recoded chunks, then reorder randomly the subset of variables in the recoded chunk currently loaded in memory. [sent-411, score-1.44]

63 Each step of fast forward backward variable selection involves reading each recoded chunk O(1) times (exactly 2 ∗ MaxIter times with MaxIter = 2), thus implies an overhead of O(KNtr ) read time. [sent-413, score-1.286]

64 Overall, the algorithmic overhead consists of additional steps for reading the input data: C ≈ S/M steps for the preprocessing phase and (log K + log N) read for variable selection phase, with an overhead of O(KN(S/M + log K + log N)tr ) time. [sent-414, score-0.639]

65 Since sequential disk read time ts is comparable to memory access time, we expect that the scalable version of the algorithm has a practicable computation time. [sent-415, score-0.629]

66 We then analyze the advantages and limitations of our non-parametric modeling approach on large scale learning and evaluate our chunking strategy. [sent-420, score-0.198]

67 The data sets, summarized in Table 1 represent a variety of domains and contain up to millions of instances, thousands of variables and tens of gigabytes of disk space. [sent-424, score-0.283]

68 The challenge data sets are a convenient way to evaluate the scalability of our offline method, when data set size is larger than the central memory by one order of magnitude (on computers with 32 bit CPU). [sent-425, score-0.321]

69 It is noteworthy that the evaluated training time is the computation time for the optimal parameter settings, excluding data loading times and any kind of data preparation or model selection time. [sent-433, score-0.251]

70 Actually, the MODL method considers a space of discretization models which grows with the size of the training data set. [sent-473, score-0.293]

71 Both the space of discretization models and the prior distribution on the model parameters are data-dependent. [sent-475, score-0.212]

72 In Figure 4, we draw the discretization (with boundaries and conditional probability of the positive class) of Var150, using the MODL method and the unsupervised equal frequency method with 10 intervals, as a baseline. [sent-480, score-0.212]

73 For data set size 100, the empirical class distribution is noisy, and the MODL method selects a discretization having one single interval, leading to the elimination of the variable. [sent-481, score-0.242]

74 When the data set size increases, the MODL discretizations contain more and more intervals, with up to 12 intervals in the case of data set size 400000. [sent-482, score-0.176]

75 This also demonstrates how the univariate MODL preprocessing method behaves as a variable ranking method with a clear threshold: any input variable discretized into one single interval does not contain any predictive information and can be eliminated. [sent-484, score-0.258]

76 Small training data sets lead to simple models, which are fast to deploy (few selected variables with simple discretization), whereas large training data sets allow more accurate models at the expense of more training time and deployment time. [sent-490, score-0.226]

77 2, we have proposed a chunking strategy that allows to learn with training data sets which size is far larger than central memory. [sent-493, score-0.345]

78 The only difference with the standard algorithm is the “double stage” random reordering of the variables, which is performed prior to each step of fast forward or backward variable selection in Algorithm 2. [sent-494, score-0.254]

79 In this section, we evaluate our chunking strategy and its impact on training time and test AUC. [sent-495, score-0.265]

80 The largest training size fits into central memory on our 1377 ´ M ARC B OULL E Sample size 100 P(class=1) 0. [sent-497, score-0.32]

81 5 100 1000000 1000 10000 100000 1000000 Sample size Sample size Figure 5: Data set size versus selected variable number and test AUC for the delta data set. [sent-558, score-0.194]

82 2 Ghz, 2 Go RAM, Windows XP), which allows to compare the standard algorithm with the chunking strategy. [sent-560, score-0.155]

83 In Figure 6, we report the wall-clock variable selection training time and the test AUC for the standard algorithm (chunk number = 1) and the chunking strategy, for various numbers of chunks. [sent-561, score-0.331]

84 5 1 1000 10 100 1000 Figure 6: Impact on the chunking strategy on the wall-clock variable selection training time and the test AUC for the delta data set. [sent-568, score-0.384]

85 The variable selection training time is almost constant with the number of chunks. [sent-572, score-0.176]

86 Compared to the standard algorithm, the time overhead is only 10% (for data set size above 10000), whatever be the number of chunks. [sent-573, score-0.16]

87 2, the recoded chunks of variables are extremely quick to load into memory, resulting in just a slight impact on training time. [sent-575, score-0.69]

88 5 millions of instances and 900 variables, which is one order of magnitude larger than our central memory resources. [sent-577, score-0.209]

89 Figure 7 reports the wall-clock full training time, including load time, preprocessing and variable selection. [sent-578, score-0.273]

90 When the size of the data sets if far beyond the size of the central memory, the overhead in wallclock time is by only a factor two. [sent-579, score-0.299]

91 This overhead mainly comes from the preprocessing phase of the algorithm (see Algorithm 3), which involves several reads of the initial training data set. [sent-580, score-0.333]

92 1379 ´ M ARC B OULL E Face dataset Training time 1000000 100000 10000 Wall-clock time CPU time IO time 1000 100 10 1 100 1000 10000 100000 1000000 Dataset size 10000000 Figure 7: Training time for the fd data set. [sent-582, score-0.211]

93 These method were applied using a specific data preparation per data set, such as 0-1, L1 or L2 normalization, and the model parameters were tuned using most of the available training data, even for the smallest training sessions. [sent-600, score-0.201]

94 However, when the overall process is accounted for, with data preparation, model selection and data loading time, our training time becomes competitive, with about one hour of training time per analyzed gigabyte. [sent-604, score-0.262]

95 It searches for a subset of variables consistent with the naive Bayes assumption, using an evaluation based on a Bayesian model selection approach and efficient add-drop greedy heuristics. [sent-617, score-0.171]

96 In this paper, we have introduced a carefully designed chunking strategy, such that our method is no longer limited to data sets that fit into central memory. [sent-619, score-0.264]

97 When the data set size is larger than the available central memory by one order of magnitude, our method exploits an efficient chunking strategy, with a time overhead of only a factor two. [sent-622, score-0.589]

98 1381 ´ M ARC B OULL E Figure 8: Final training time results: data set size versus training time (excluding data loading). [sent-629, score-0.186]

99 1383 ´ M ARC B OULL E of great interest for automatizing the data preparation and modeling phases of data mining, and exploits as much as possible all the available training data in its initial representation. [sent-631, score-0.282]

100 MODL: a Bayes optimal discretization method for continuous attributes. [sent-650, score-0.212]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('chunk', 0.515), ('recoded', 0.334), ('disk', 0.237), ('discretization', 0.212), ('boull', 0.193), ('chunks', 0.182), ('modl', 0.167), ('chunking', 0.155), ('oull', 0.136), ('ree', 0.136), ('preprocessing', 0.126), ('read', 0.126), ('cale', 0.116), ('ethod', 0.116), ('central', 0.109), ('overhead', 0.103), ('memory', 0.1), ('preparation', 0.099), ('arc', 0.095), ('arge', 0.095), ('kn', 0.083), ('challenge', 0.082), ('loaded', 0.081), ('averaging', 0.08), ('naive', 0.078), ('scalable', 0.074), ('lassification', 0.071), ('intervals', 0.071), ('bayes', 0.07), ('backward', 0.068), ('exploits', 0.065), ('selective', 0.065), ('aoprc', 0.064), ('costbest', 0.061), ('iter', 0.061), ('auc', 0.053), ('delta', 0.053), ('phase', 0.053), ('langley', 0.052), ('maxiter', 0.052), ('webspam', 0.052), ('mining', 0.051), ('training', 0.051), ('variable', 0.051), ('selection', 0.047), ('antoine', 0.046), ('fd', 0.046), ('reordering', 0.046), ('variables', 0.046), ('cpt', 0.045), ('discretizations', 0.045), ('kntr', 0.045), ('mbest', 0.045), ('owing', 0.045), ('recoding', 0.045), ('load', 0.045), ('scale', 0.043), ('reorder', 0.042), ('forward', 0.042), ('bordes', 0.039), ('keerthi', 0.039), ('parsing', 0.039), ('tm', 0.039), ('bma', 0.039), ('categorical', 0.038), ('ine', 0.038), ('merge', 0.038), ('phases', 0.037), ('sequential', 0.035), ('marc', 0.035), ('earning', 0.035), ('exploit', 0.034), ('windows', 0.033), ('impact', 0.032), ('ocr', 0.032), ('sonnenburg', 0.032), ('input', 0.03), ('automatizing', 0.03), ('boulle', 0.03), ('dougherty', 0.03), ('fayyad', 0.03), ('hoeting', 0.03), ('practicable', 0.03), ('reprocessing', 0.03), ('rofu', 0.03), ('posterior', 0.03), ('size', 0.03), ('xk', 0.03), ('analyzed', 0.03), ('pascal', 0.029), ('cost', 0.029), ('overall', 0.029), ('cpu', 0.028), ('bagging', 0.028), ('kohavi', 0.028), ('submission', 0.028), ('bayesian', 0.027), ('time', 0.027), ('obtains', 0.027), ('gb', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000012 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

Author: Marc Boullé

Abstract: With the rapid growth of computer storage capacities, available data and demand for scoring models both follow an increasing trend, sharper than that of the processing power. However, the main limitation to a wide spread of data mining solutions is the non-increasing availability of skilled data analysts, which play a key role in data preparation and model selection. In this paper, we present a parameter-free scalable classification method, which is a step towards fully automatic data mining. The method is based on Bayes optimal univariate conditional density estimators, naive Bayes classification enhanced with a Bayesian variable selection scheme, and averaging of models using a logarithmic smoothing of the posterior distribution. We focus on the complexity of the algorithms and show how they can cope with data sets that are far larger than the available central memory. We finally report results on the Large Scale Learning challenge, where our method obtains state of the art performance within practicable computation time. Keywords: large scale learning, naive Bayes, Bayesianism, model selection, model averaging

2 0.060042642 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

Author: Kristian Woodsend, Jacek Gondzio

Abstract: Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and OpenMP. Using an interior point method for the optimization and a reformulation that avoids the dense Hessian matrix, the structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efficiently. The new implementation has been applied to solve problems from the PASCAL Challenge on Large-scale Learning. We show that our approach is competitive, and is able to solve problems in the Challenge many times faster than other parallel approaches. We also demonstrate that the hybrid version performs more efficiently than the version using pure MPI. Keywords: linear SVM training, hybrid parallelism, largescale learning, interior point method

3 0.043311555 23 jmlr-2009-Discriminative Learning Under Covariate Shift

Author: Steffen Bickel, Michael Brückner, Tobias Scheffer

Abstract: We address classification problems for which the training instances are governed by an input distribution that is allowed to differ arbitrarily from the test distribution—problems also referred to as classification under covariate shift. We derive a solution that is purely discriminative: neither training nor test distribution are modeled explicitly. The problem of learning under covariate shift can be written as an integrated optimization problem. Instantiating the general optimization problem leads to a kernel logistic regression and an exponential model classifier for covariate shift. The optimization problem is convex under certain conditions; our findings also clarify the relationship to the known kernel mean matching procedure. We report on experiments on problems of spam filtering, text classification, and landmine detection. Keywords: covariate shift, discriminative learning, transfer learning

4 0.043079223 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination    (Special Topic on Model Selection)

Author: Eugene Tuv, Alexander Borisov, George Runger, Kari Torkkola

Abstract: Predictive models benefit from a compact, non-redundant subset of features that improves interpretability and generalization. Modern data sets are wide, dirty, mixed with both numerical and categorical predictors, and may contain interactive effects that require complex models. This is a challenge for filters, wrappers, and embedded feature selection methods. We describe details of an algorithm using tree-based ensembles to generate a compact subset of non-redundant features. Parallel and serial ensembles of trees are combined into a mixed method that can uncover masking and detect features of secondary effect. Simulated and actual examples illustrate the effectiveness of the approach. Keywords: trees, resampling, importance, masking, residuals

5 0.040969763 50 jmlr-2009-Learning When Concepts Abound

Author: Omid Madani, Michael Connor, Wiley Greiner

Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization

6 0.039647117 70 jmlr-2009-Particle Swarm Model Selection    (Special Topic on Model Selection)

7 0.037821315 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

8 0.037421543 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

9 0.036402691 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

10 0.03579814 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

11 0.034001671 43 jmlr-2009-Java-ML: A Machine Learning Library    (Machine Learning Open Source Software Paper)

12 0.032758337 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages    (Special Topic on Mining and Learning with Graphs and Relations)

13 0.03237972 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

14 0.031177931 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

15 0.028496664 91 jmlr-2009-Subgroup Analysis via Recursive Partitioning

16 0.027694996 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks

17 0.026626891 38 jmlr-2009-Hash Kernels for Structured Data

18 0.024390191 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification

19 0.024356702 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models    (Machine Learning Open Source Software Paper)

20 0.023452036 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.138), (1, -0.062), (2, 0.069), (3, -0.036), (4, 0.084), (5, -0.059), (6, 0.007), (7, 0.089), (8, 0.053), (9, 0.079), (10, 0.043), (11, -0.019), (12, 0.029), (13, -0.052), (14, -0.043), (15, -0.007), (16, -0.049), (17, -0.009), (18, 0.041), (19, 0.063), (20, -0.076), (21, 0.04), (22, 0.124), (23, 0.035), (24, 0.019), (25, 0.119), (26, 0.07), (27, -0.163), (28, -0.022), (29, 0.138), (30, 0.034), (31, -0.103), (32, -0.096), (33, -0.094), (34, -0.047), (35, -0.023), (36, -0.426), (37, -0.194), (38, 0.184), (39, 0.068), (40, -0.041), (41, 0.343), (42, 0.281), (43, 0.056), (44, 0.051), (45, 0.053), (46, -0.094), (47, 0.182), (48, -0.081), (49, 0.254)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94131225 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

Author: Marc Boullé

Abstract: With the rapid growth of computer storage capacities, available data and demand for scoring models both follow an increasing trend, sharper than that of the processing power. However, the main limitation to a wide spread of data mining solutions is the non-increasing availability of skilled data analysts, which play a key role in data preparation and model selection. In this paper, we present a parameter-free scalable classification method, which is a step towards fully automatic data mining. The method is based on Bayes optimal univariate conditional density estimators, naive Bayes classification enhanced with a Bayesian variable selection scheme, and averaging of models using a logarithmic smoothing of the posterior distribution. We focus on the complexity of the algorithms and show how they can cope with data sets that are far larger than the available central memory. We finally report results on the Large Scale Learning challenge, where our method obtains state of the art performance within practicable computation time. Keywords: large scale learning, naive Bayes, Bayesianism, model selection, model averaging

2 0.30148023 35 jmlr-2009-Feature Selection with Ensembles, Artificial Variables, and Redundancy Elimination    (Special Topic on Model Selection)

Author: Eugene Tuv, Alexander Borisov, George Runger, Kari Torkkola

Abstract: Predictive models benefit from a compact, non-redundant subset of features that improves interpretability and generalization. Modern data sets are wide, dirty, mixed with both numerical and categorical predictors, and may contain interactive effects that require complex models. This is a challenge for filters, wrappers, and embedded feature selection methods. We describe details of an algorithm using tree-based ensembles to generate a compact subset of non-redundant features. Parallel and serial ensembles of trees are combined into a mixed method that can uncover masking and detect features of secondary effect. Simulated and actual examples illustrate the effectiveness of the approach. Keywords: trees, resampling, importance, masking, residuals

3 0.27993351 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

Author: Kristian Woodsend, Jacek Gondzio

Abstract: Support vector machines are a powerful machine learning technology, but the training process involves a dense quadratic optimization problem and is computationally challenging. A parallel implementation of linear Support Vector Machine training has been developed, using a combination of MPI and OpenMP. Using an interior point method for the optimization and a reformulation that avoids the dense Hessian matrix, the structure of the augmented system matrix is exploited to partition data and computations amongst parallel processors efficiently. The new implementation has been applied to solve problems from the PASCAL Challenge on Large-scale Learning. We show that our approach is competitive, and is able to solve problems in the Challenge many times faster than other parallel approaches. We also demonstrate that the hybrid version performs more efficiently than the version using pure MPI. Keywords: linear SVM training, hybrid parallelism, largescale learning, interior point method

4 0.27170867 70 jmlr-2009-Particle Swarm Model Selection    (Special Topic on Model Selection)

Author: Hugo Jair Escalante, Manuel Montes, Luis Enrique Sucar

Abstract: This paper proposes the application of particle swarm optimization (PSO) to the problem of full model selection, FMS, for classification tasks. FMS is defined as follows: given a pool of preprocessing methods, feature selection and learning algorithms, to select the combination of these that obtains the lowest classification error for a given data set; the task also includes the selection of hyperparameters for the considered methods. This problem generates a vast search space to be explored, well suited for stochastic optimization techniques. FMS can be applied to any classification domain as it does not require domain knowledge. Different model types and a variety of algorithms can be considered under this formulation. Furthermore, competitive yet simple models can be obtained with FMS. We adopt PSO for the search because of its proven performance in different problems and because of its simplicity, since neither expensive computations nor complicated operations are needed. Interestingly, the way the search is guided allows PSO to avoid overfitting to some extend. Experimental results on benchmark data sets give evidence that the proposed approach is very effective, despite its simplicity. Furthermore, results obtained in the framework of a model selection challenge show the competitiveness of the models selected with PSO, compared to models selected with other techniques that focus on a single algorithm and that use domain knowledge. Keywords: full model selection, machine learning challenge, particle swarm optimization, experimentation, cross validation

5 0.24542199 50 jmlr-2009-Learning When Concepts Abound

Author: Omid Madani, Michael Connor, Wiley Greiner

Abstract: Many learning tasks, such as large-scale text categorization and word prediction, can benefit from efficient training and classification when the number of classes, in addition to instances and features, is large, that is, in the thousands and beyond. We investigate the learning of sparse class indices to address this challenge. An index is a mapping from features to classes. We compare the index-learning methods against other techniques, including one-versus-rest and top-down classification using perceptrons and support vector machines. We find that index learning is highly advantageous for space and time efficiency, at both training and classification times. Moreover, this approach yields similar and at times better accuracies. On problems with hundreds of thousands of instances and thousands of classes, the index is learned in minutes, while other methods can take hours or days. As we explain, the design of the learning update enables conveniently constraining each feature to connect to a small subset of the classes in the index. This constraint is crucial for scalability. Given an instance with l active (positive-valued) features, each feature on average connecting to d classes in the index (in the order of 10s in our experiments), update and classification take O(dl log(dl)). Keywords: index learning, many-class learning, multiclass learning, online learning, text categorization

6 0.22730534 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization

7 0.2165412 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages    (Special Topic on Mining and Learning with Graphs and Relations)

8 0.2096729 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks

9 0.19250368 63 jmlr-2009-On Efficient Large Margin Semisupervised Learning: Method and Theory

10 0.17826444 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

11 0.17475477 89 jmlr-2009-Strong Limit Theorems for the Bayesian Scoring Criterion in Bayesian Networks

12 0.17068172 57 jmlr-2009-Multi-task Reinforcement Learning in Partially Observable Stochastic Environments

13 0.1659916 23 jmlr-2009-Discriminative Learning Under Covariate Shift

14 0.15661474 56 jmlr-2009-Model Monitor (M2): Evaluating, Comparing, and Monitoring Models    (Machine Learning Open Source Software Paper)

15 0.1483594 91 jmlr-2009-Subgroup Analysis via Recursive Partitioning

16 0.14697306 48 jmlr-2009-Learning Nondeterministic Classifiers

17 0.14349869 43 jmlr-2009-Java-ML: A Machine Learning Library    (Machine Learning Open Source Software Paper)

18 0.14349867 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models

19 0.13331203 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks

20 0.13236155 60 jmlr-2009-Nieme: Large-Scale Energy-Based Models    (Machine Learning Open Source Software Paper)


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(5, 0.409), (8, 0.02), (11, 0.024), (19, 0.026), (24, 0.022), (26, 0.042), (38, 0.039), (47, 0.016), (52, 0.074), (55, 0.044), (58, 0.031), (66, 0.078), (68, 0.018), (90, 0.043), (96, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71624142 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning

Author: Marc Boullé

Abstract: With the rapid growth of computer storage capacities, available data and demand for scoring models both follow an increasing trend, sharper than that of the processing power. However, the main limitation to a wide spread of data mining solutions is the non-increasing availability of skilled data analysts, which play a key role in data preparation and model selection. In this paper, we present a parameter-free scalable classification method, which is a step towards fully automatic data mining. The method is based on Bayes optimal univariate conditional density estimators, naive Bayes classification enhanced with a Bayesian variable selection scheme, and averaging of models using a logarithmic smoothing of the posterior distribution. We focus on the complexity of the algorithms and show how they can cope with data sets that are far larger than the available central memory. We finally report results on the Large Scale Learning challenge, where our method obtains state of the art performance within practicable computation time. Keywords: large scale learning, naive Bayes, Bayesianism, model selection, model averaging

2 0.30663326 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model

Author: Jianqing Fan, Richard Samworth, Yichao Wu

Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection

3 0.30626255 62 jmlr-2009-Nonlinear Models Using Dirichlet Process Mixtures

Author: Babak Shahbaba, Radford Neal

Abstract: We introduce a new nonlinear model for classification, in which we model the joint distribution of response variable, y, and covariates, x, non-parametrically using Dirichlet process mixtures. We keep the relationship between y and x linear within each component of the mixture. The overall relationship becomes nonlinear if the mixture contains more than one component, with different regression coefficients. We use simulated data to compare the performance of this new approach to alternative methods such as multinomial logit (MNL) models, decision trees, and support vector machines. We also evaluate our approach on two classification problems: identifying the folding class of protein sequences and detecting Parkinson’s disease. Our model can sometimes improve predictive accuracy. Moreover, by grouping observations into sub-populations (i.e., mixture components), our model can sometimes provide insight into hidden structure in the data. Keywords: mixture models, Dirichlet process, classification

4 0.30587128 38 jmlr-2009-Hash Kernels for Structured Data

Author: Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, S.V.N. Vishwanathan

Abstract: We propose hashing to facilitate efficient kernels. This generalizes previous work using sampling and we show a principled way to compute the kernel matrix for data streams and sparse feature spaces. Moreover, we give deviation bounds from the exact kernel matrix. This has applications to estimation on strings and graphs. Keywords: hashing, stream, string kernel, graphlet kernel, multiclass classification

5 0.30518657 87 jmlr-2009-Sparse Online Learning via Truncated Gradient

Author: John Langford, Lihong Li, Tong Zhang

Abstract: We propose a general method called truncated gradient to induce sparsity in the weights of onlinelearning algorithms with convex loss functions. This method has several essential properties: 1. The degree of sparsity is continuous—a parameter controls the rate of sparsification from no sparsification to total sparsification. 2. The approach is theoretically motivated, and an instance of it can be regarded as an online counterpart of the popular L1 -regularization method in the batch setting. We prove that small rates of sparsification result in only small additional regret with respect to typical online-learning guarantees. 3. The approach works well empirically. We apply the approach to several data sets and find for data sets with large numbers of features, substantial sparsity is discoverable. Keywords: truncated gradient, stochastic gradient descent, online learning, sparsity, regularization, Lasso

6 0.30349734 48 jmlr-2009-Learning Nondeterministic Classifiers

7 0.29872635 82 jmlr-2009-Robustness and Regularization of Support Vector Machines

8 0.29569978 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

9 0.29550198 19 jmlr-2009-Controlling the False Discovery Rate of the Association Causality Structure Learned with the PC Algorithm    (Special Topic on Mining and Learning with Graphs and Relations)

10 0.29466817 70 jmlr-2009-Particle Swarm Model Selection    (Special Topic on Model Selection)

11 0.2937403 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

12 0.29350761 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models

13 0.29168144 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

14 0.29156354 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

15 0.28994527 10 jmlr-2009-Application of Non Parametric Empirical Bayes Estimation to High Dimensional Classification

16 0.28931361 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning

17 0.28795913 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

18 0.28759986 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

19 0.28637922 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation

20 0.28471598 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications