nips nips2006 nips2006-129 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Cheng-tao Chu, Sang K. Kim, Yi-an Lin, Yuanyuan Yu, Gary Bradski, Kunle Olukotun, Andrew Y. Ng
Abstract: We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Abstract We are at the beginning of the multicore era. [sent-15, score-0.246]
2 Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. [sent-16, score-0.556]
3 In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. [sent-17, score-0.16]
4 Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. [sent-19, score-0.317]
5 Our experimental results show basically linear speedup with an increasing number of processors. [sent-21, score-0.24]
6 Yet Moore’s law [20], the density of circuits doubling every generation, is projected to last between 10 and 20 more years for silicon based circuits [10]. [sent-23, score-0.093]
7 By keeping clock frequency fixed, but doubling the number of processing cores on a chip, one can maintain lower power while doubling the speed of many applications. [sent-24, score-0.633]
8 We thus approach an era of increasing numbers of cores per chip, but there is as yet no good framework for machine learning to take advantage of massive numbers of cores. [sent-26, score-0.425]
9 There are many parallel programming languages such as Orca, Occam ABCL, SNOW, MPI and PARLOG, but none of these approaches make it obvious how to parallelize a particular algorithm. [sent-27, score-0.294]
10 There is a vast literature on distributed learning and data mining [18], but very little of this literature focuses on our goal: A general means of programming machine learning on multicore. [sent-28, score-0.086]
11 Much of this literature contains a long and distinguished tradition of developing (often ingenious) ways to speed up or parallelize individual learning algorithms, for instance cascaded SVMs [11]. [sent-29, score-0.257]
12 But these yield no general parallelization technique for machine learning and, more pragmatically, specialized implementations of popular algorithms rarely lead to widespread use. [sent-30, score-0.263]
13 [5] give some general data distribution conditions for parallelizing machine learning, but restrict the focus to decision trees; Jin and Agrawal [14] give a general machine learning programming approach, but only for shared memory machines. [sent-33, score-0.131]
14 This doesn’t fit the architecture of cellular or grid type multiprocessors where cores have local cache, even if it can be dynamically reallocated. [sent-34, score-0.493]
15 In this paper, we focuses on developing a general and exact technique for parallel programming of a large class of machine learning algorithms for multicore processors. [sent-35, score-0.441]
16 The central idea of this approach is to allow a future programmer or user to speed up machine learning applications by ”throwing more cores” at the problem rather than search for specialized optimizations. [sent-36, score-0.101]
17 (ii) The summation form does not depend on, but can be easily expressed in a map-reduce [7] framework which is easy to program in. [sent-39, score-0.176]
18 Here we achieve linear speedup which in fact often does beat specific solutions such as cascaded SVM [11] (see section 5; however, they do handle kernels, which we have not addressed). [sent-43, score-0.236]
19 (ii) We make no claim that following our framework (for a specific algorithm) always leads to a novel parallelization undiscovered by others. [sent-44, score-0.165]
20 What is novel is the larger, broadly applicable framework, together with a pragmatic programming paradigm, map-reduce. [sent-45, score-0.131]
21 (iii) We focus here on exact implementation of machine learning algorithms, not on parallel approximations to algorithms (a worthy topic, but one which is beyond this paper’s scope). [sent-46, score-0.109]
22 In section 2 we discuss the Statistical Query Model, our summation form framework and an example of its application. [sent-47, score-0.176]
23 Basically, we often achieve linear speedup in the number of cores. [sent-51, score-0.191]
24 2 Statistical Query and Summation Form For multicore systems, Sutter and Larus [25] point out that multicore mostly benefits concurrent applications, meaning ones where there is little communication between cores. [sent-53, score-0.555]
25 The Statistical Query Model is sometimes posed as a restriction on the Valiant PAC model [26], in which we permit the learning algorithm to access the learning problem only through a statistical query oracle. [sent-56, score-0.101]
26 Given a function f (x, y) over instances, the statistical query oracle returns an estimate of the expectation of f (x, y) (averaged over the training/test distribution). [sent-57, score-0.101]
27 Algorithms that calculate sufficient statistics or gradients fit this model, and since these calculations may be batched, they are expressible as a sum over data points. [sent-58, score-0.086]
28 However, when an algorithm does sum over the data, we can easily distribute the calculations over multiple cores: We just divide the data set into as many pieces as there are cores, give each core its share of the data to sum the equations over, and aggregate the results at the end. [sent-62, score-0.217]
29 ” As an example, consider ordinary least squares (linear regression), which fits a model of the form m y = θT x by solving: θ∗ = minθ i=1 (θT xi − yi )2 The parameter θ is typically solved for by 1. [sent-64, score-0.127]
30 To put this computation into summation form, we reformulate it into a two phase algorithm where we first compute sufficient statistics by summing over the data, and then aggregate those statistics and solve to get θ∗ = A−1 b. [sent-93, score-0.246]
31 We next discuss an architecture that lends itself to the summation form: Map-reduce. [sent-96, score-0.244]
32 3 Architecture Many programming frameworks are possible for the summation form, but inspired by Google’s success in adapting a functional programming construct, map-reduce [7], for wide spread parallel programming use inside their company, we adapted this same construct for multicore use. [sent-97, score-0.754]
33 Google’s map-reduce is specialized for use over clusters that have unreliable communication and where individual computers may go down. [sent-98, score-0.152]
34 These are issues that multicores do not have; thus, we were able to developed a much lighter weight architecture for multicores, shown in Figure 1. [sent-99, score-0.113]
35 In step 0, the map-reduce engine is responsible for splitting the data by training examples (rows). [sent-101, score-0.118]
36 The engine then caches the split data for the subsequent map-reduce invocations. [sent-102, score-0.085]
37 Every algorithm has its own engine instance, and every map-reduce task will be delegated to its engine (step 1). [sent-103, score-0.17]
38 Similar to the original map-reduce architecture, the engine will run a master (step 1. [sent-104, score-0.144]
39 The master is responsible for assigning the split data to different mappers, and then collects the processed intermediate data from the mappers (step 1. [sent-106, score-0.29]
40 After the intermediate data is collected, the master will in turn invoke the reducer to process it (step 1. [sent-111, score-0.402]
41 Note that some mapper and reducer operations require additional scalar information from the algorithms. [sent-116, score-0.582]
42 In order to support these operations, the mapper/reducer can obtain this information through the query info interface, which can be customized for each different algorithm (step 1. [sent-117, score-0.101]
43 These algorithms were chosen partly by their popularity of use in NIPS papers, and our goal will be to illustrate how each algorithm can be expressed in summation form. [sent-125, score-0.211]
44 We will defer the discussion of the theoretical improvement that can be achieved by this parallelization to Section 4. [sent-126, score-0.134]
45 In the following, x or xi denotes a training vector and y or yi denotes a training label. [sent-128, score-0.193]
46 • Locally Weighted Linear Regression (LWLR) LWLR [28, 3] is solved by finding m T the solution of the normal equations Aθ = b, where A = i=1 wi (xi xi ) and b = m i=1 wi (xi yi ). [sent-129, score-0.223]
47 For the summation form, we divide the computation among different mappers. [sent-130, score-0.245]
48 In this case, one set of mappers is used to compute subgroup wi (xi xT ) and another i set to compute subgroup wi (xi yi ). [sent-131, score-0.898]
49 Two reducers respectively sum up the partial values for A and b, and the algorithm finally computes the solution θ = A−1 b. [sent-132, score-0.105]
50 In order to do so, we need to sum over xj = k for each y label in the training data to calculate P (x|y). [sent-135, score-0.154]
51 We specify different sets of mappers to calculate the following: subgroup 1{xj = k|y = 1}, subgroup 1{xj = k|y = 0}, 1{y = 1} and subgroup 1{y = 0}. [sent-136, score-1.079]
52 The reducer then sums up intermediate subgroup results to get the final result for the parameters. [sent-137, score-0.674]
53 For all the summation forms involved in these computations, we may leverage the map-reduce framework to parallelize the process. [sent-139, score-0.31]
54 Σ 1{yi = 1}, Σ 1{yi = 0}, Σ 1{yi = 0}xi , etc) for a subgroup of the training samples. [sent-142, score-0.317]
55 Finally, the reducer will aggregate the intermediate sums and calculate the final result for the parameters. [sent-143, score-0.477]
56 • k-means In k-means [12], it is clear that the operation of computing the Euclidean distance between the sample vectors and the centroids can be parallelized by splitting the data into individual subgroups and clustering samples in each subgroup separately (by the mapper). [sent-144, score-0.356]
57 In recalculating new centroid vectors, we divide the sample vectors into subgroups, compute the sum of vectors in each subgroup in parallel, and finally the reducer will add up the partial sums and compute the new centroids. [sent-145, score-0.796]
58 • Logistic Regression (LR) For logistic regression [23], we choose the form of hypothesis as hθ (x) = g(θT x) = 1/(1 + exp(−θT x)) Learning is done by fitting θ to the training data where the likelihood function can be optimized by using Newton-Raphson to update θ := θ − H −1 θ (θ). [sent-146, score-0.127]
59 θ (θ) is the gradient, which can be computed in parallel by (i) mappers summing up subgroup (y (i) − hθ (x(i) ))xj each NR step i. [sent-147, score-0.537]
60 The computation of the hessian matrix can be also written in a summation form of H(j, k) := H(j, k) + (i) (i) hθ (x(i) )(hθ (x(i) ) − 1)xj xk for the mappers. [sent-148, score-0.205]
61 The reducer will then sum up the values for gradient and hessian to perform the update for θ. [sent-149, score-0.414]
62 • Neural Network (NN) We focus on backpropagation [6] By defining a network structure (we use a three layer network with two output neurons classifying the data into two categories), each mapper propagates its set of data through the network. [sent-150, score-0.324]
63 For each training example, the error is back propagated to calculate the partial gradient for each of the weights in the network. [sent-151, score-0.204]
64 The reducer then sums the partial gradient from each mapper and does a batch gradient descent to update the weights of the network. [sent-152, score-0.856]
65 In the definition for the covariance matrix Σ = m i=1 xi xi m T is already expressed in summation form. [sent-154, score-0.32]
66 Further, we can also Σ, the term i=1 xi xi m 1 express the mean vector µ as a sum, µ = m i=1 xi . [sent-155, score-0.216]
67 The sums can be mapped to separate cores, and then the reducer will sum up the partial results to produce the final empirical covariance matrix. [sent-156, score-0.443]
68 We implement batch gradient ascent to optimize the W ’s likelihood. [sent-159, score-0.159]
69 In this scheme, we can independently T 1 − 2g(w1 x(i) ) T calculate the expression x(i) in the mappers and sum them up in the . [sent-160, score-0.265]
70 For parallelization: In the E-step, every mapper processes its subset (i) of the training data and computes the corresponding wj (expected pseudo count). [sent-165, score-0.387]
71 For p(y), every mapper (i) will compute subgroup (wj ), and the reducer will sum up the partial result and divide it (i) (i) by m. [sent-167, score-1.04]
72 For µ, each mapper will compute subgroup (wj ∗ x(i) ) and subgroup (wj ), and the reducer will sum up the partial result and divide them. [sent-168, score-1.324]
73 For Σ, every mapper will com(i) (i) pute subgroup (wj ∗ (x(i) − µj ) ∗ (x(i) − µj )T ) and subgroup (wj ), and the reducer will again sum up the partial result and divide them. [sent-169, score-1.324]
74 [2] has shown that the primal problem for quadratic loss can be solved using the following formula where sv are the support vectors: = 2w + 2C i∈sv (w · xi − yi )xi & Hessian H = I + C i∈sv xi xT i We perform batch gradient descent to optimize the objective function. [sent-173, score-0.36]
75 The mappers will calculate the partial gradient subgroup(i∈sv) (w · xi − yi )xi and the reducer will sum up the partial results to update w vector. [sent-174, score-0.873]
76 Some implementations of machine learning algorithms, such as ICA, are commonly done with stochastic gradient ascent, which poses a challenge to parallelization. [sent-175, score-0.094]
77 When one gradient ascent step (involving one training sample) is updating W , it has to lock down this matrix, read it, compute the gradient, update W , and finally release the lock. [sent-179, score-0.144]
78 This “lock-release” block creates a bottleneck for parallelization; thus, instead of stochastic gradient ascent, our algorithms above were implemented using batch gradient ascent. [sent-180, score-0.195]
79 1 A few algorithms require matrix inversion or an eigen-decomposition of an n-by-n matrix; we did not parallelize these steps in our experiments, because for us m >> n, and so their cost is small. [sent-187, score-0.204]
80 Further, the reduce phase can minimize communication by combining data as it’s passed back; this accounts for the log(P ) factor. [sent-191, score-0.094]
81 As an example of our running-time analysis, for single-core LWLR we have to compute A = m T 2 3 i=1 wi (xi xi ), which gives us the mn term. [sent-192, score-0.259]
82 Note that not all the experiments make sense from an output view – regression on categorical data – but our purpose was to test speedup so we ran every algorithm over all the data. [sent-196, score-0.243]
83 1 Results and Discussion Table 3 shows the speedup on dual processors over all the algorithms on all the data sets. [sent-206, score-0.284]
84 Figure 2 shows the speedup of the algorithms over all the data sets for 2,4,8 and 16 processing cores. [sent-214, score-0.226]
85 Speedup is basically linear with number Adult Helicopter Corel Image IPUMS Synthetic Census Income Sensor KDD Cover Type Census lwlr 1. [sent-216, score-0.183]
86 Super linear speedup sometimes occurs due to a reduction in processor idle time with multiple threads. [sent-329, score-0.23]
87 (a) (b) (c) (d) (e) (f) (g) (h) (i) Figure 2: (a)-(i) show the speedup from 1 to 16 processors of all the algorithms over all the data sets. [sent-330, score-0.284]
88 The reason for the sub-unity slope is increasing communication overhead. [sent-334, score-0.091]
89 For simplicity and because the number of data points m typically dominates reduction phase communication costs (typically a factor of n2 but n << m), we did not parallelize the reduce phase where we could have combined data on the way back. [sent-335, score-0.259]
90 6% speed up on average over 16 cores whereas the specialized SVM cascade [11] averages only 4%. [sent-337, score-0.556]
91 We finish by reporting some confirming results and higher performance on a proprietary multicore simulator over the sensor dataset. [sent-339, score-0.286]
92 Multicore machines are generally faster than multiprocessor machines because communication internal to the chip is much less costly. [sent-344, score-0.17]
93 6 Conclusion As the Intel and AMD product roadmaps indicate [24], the number of processing cores on a chip will be doubling several times over the next decade, even as individual cores cease to become significantly faster. [sent-345, score-0.974]
94 For machine learning to continue reaping the bounty of Moore’s law and apply to ever larger datasets and problems, it is important to adopt a programming architecture which takes advantage of multicore. [sent-346, score-0.154]
95 In this paper, by taking advantage of the summation form in a map-reduce 2 This work was done in collaboration with Intel Corporation. [sent-347, score-0.176]
96 framework, we could parallelize a wide range of machine learning algorithms and achieve a 1. [sent-348, score-0.169]
97 9 times speedup on a dual processor on up to 54 times speedup on 64 cores. [sent-349, score-0.421]
98 We note that the speedups achieved here involved no special optimizations of the algorithms themselves. [sent-351, score-0.101]
99 We have demonstrated a simple programming framework where in the future we can just “throw cores” at the problem of speeding up machine learning code. [sent-352, score-0.086]
100 Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance. [sent-432, score-0.22]
wordName wordTfidf (topN-words)
[('cores', 0.425), ('mapper', 0.291), ('reducer', 0.291), ('subgroup', 0.284), ('multicore', 0.246), ('speedup', 0.191), ('mappers', 0.179), ('summation', 0.176), ('mn', 0.139), ('lwlr', 0.134), ('parallelization', 0.134), ('parallelize', 0.134), ('gda', 0.112), ('query', 0.101), ('census', 0.089), ('programming', 0.086), ('engine', 0.085), ('ica', 0.084), ('nb', 0.078), ('parallel', 0.074), ('xi', 0.072), ('divide', 0.069), ('architecture', 0.068), ('partial', 0.067), ('speedups', 0.066), ('communication', 0.063), ('wj', 0.063), ('doubling', 0.062), ('chip', 0.062), ('nn', 0.06), ('master', 0.059), ('processors', 0.058), ('sv', 0.057), ('gradient', 0.056), ('specialized', 0.056), ('yi', 0.055), ('ascent', 0.055), ('lr', 0.054), ('helicopter', 0.053), ('intermediate', 0.052), ('regression', 0.052), ('intel', 0.051), ('basically', 0.049), ('nc', 0.049), ('wi', 0.048), ('calculate', 0.048), ('batch', 0.048), ('svm', 0.047), ('sums', 0.047), ('speed', 0.045), ('acip', 0.045), ('cascaded', 0.045), ('ingenious', 0.045), ('ipums', 0.045), ('kunle', 0.045), ('mnc', 0.045), ('multicores', 0.045), ('multiprocessor', 0.045), ('parallelizing', 0.045), ('pragmatic', 0.045), ('sutter', 0.045), ('unmixing', 0.045), ('yuanyuan', 0.045), ('logistic', 0.042), ('stanford', 0.04), ('sensor', 0.04), ('aggregate', 0.039), ('processor', 0.039), ('income', 0.039), ('digest', 0.039), ('clock', 0.039), ('implementations', 0.038), ('sum', 0.038), ('pca', 0.038), ('em', 0.036), ('adult', 0.036), ('corel', 0.036), ('subgroups', 0.036), ('parallelized', 0.036), ('xj', 0.035), ('algorithms', 0.035), ('inversion', 0.035), ('google', 0.035), ('cmos', 0.033), ('computers', 0.033), ('distribute', 0.033), ('tradition', 0.033), ('backpropagation', 0.033), ('training', 0.033), ('table', 0.033), ('claim', 0.031), ('silicon', 0.031), ('jin', 0.031), ('operating', 0.031), ('phase', 0.031), ('kdd', 0.03), ('cascade', 0.03), ('nally', 0.029), ('hessian', 0.029), ('slope', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 129 nips-2006-Map-Reduce for Machine Learning on Multicore
Author: Cheng-tao Chu, Sang K. Kim, Yi-an Lin, Yuanyuan Yu, Gary Bradski, Kunle Olukotun, Andrew Y. Ng
Abstract: We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors. 1
2 0.064468808 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
Author: Christopher J. Burges, Robert Ragno, Quoc V. Le
Abstract: The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions. 1
3 0.06381347 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
Author: Dong S. Cheng, Vittorio Murino, Mário Figueiredo
Abstract: This paper proposes a new approach to model-based clustering under prior knowledge. The proposed formulation can be interpreted from two different angles: as penalized logistic regression, where the class labels are only indirectly observed (via the probability density of each class); as finite mixture learning under a grouping prior. To estimate the parameters of the proposed model, we derive a (generalized) EM algorithm with a closed-form E-step, in contrast with other recent approaches to semi-supervised probabilistic clustering which require Gibbs sampling or suboptimal shortcuts. We show that our approach is ideally suited for image segmentation: it avoids the combinatorial nature Markov random field priors, and opens the door to more sophisticated spatial priors (e.g., wavelet-based) in a simple and computationally efficient way. Finally, we extend our formulation to work in unsupervised, semi-supervised, or discriminative modes. 1
4 0.06349358 186 nips-2006-Support Vector Machines on a Budget
Author: Ofer Dekel, Yoram Singer
Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1
5 0.060357861 109 nips-2006-Learnability and the doubling dimension
Author: Yi Li, Philip M. Long
Abstract: Given a set of classifiers and a probability distribution over their domain, one can define a metric by taking the distance between a pair of classifiers to be the probability that they classify a random item differently. We prove bounds on the sample complexity of PAC learning in terms of the doubling dimension of this metric. These bounds imply known bounds on the sample complexity of learning halfspaces with respect to the uniform distribution that are optimal up to a constant factor. We prove a bound that holds for any algorithm that outputs a classifier with zero error whenever this is possible; this bound is in terms of the maximum of the doubling dimension and the VC-dimension of , and strengthens the best known bound in terms of the VC-dimension alone. We show that there is no bound on the doubling dimension in terms of the VC-dimension of (in contrast with the metric dimension).
6 0.057278518 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
7 0.054737031 20 nips-2006-Active learning for misspecified generalized linear models
8 0.054704022 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
9 0.054281261 126 nips-2006-Logistic Regression for Single Trial EEG Classification
10 0.051905055 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
11 0.051514916 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
12 0.050200589 50 nips-2006-Chained Boosting
13 0.050198011 167 nips-2006-Recursive ICA
14 0.049546335 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight
15 0.049529821 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
16 0.049417824 67 nips-2006-Differential Entropic Clustering of Multivariate Gaussians
17 0.047570311 46 nips-2006-Blind source separation for over-determined delayed mixtures
18 0.047120314 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
19 0.046891637 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
20 0.046673816 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
topicId topicWeight
[(0, -0.173), (1, -0.002), (2, 0.008), (3, 0.012), (4, -0.019), (5, 0.027), (6, -0.02), (7, -0.001), (8, -0.0), (9, 0.056), (10, -0.084), (11, -0.05), (12, 0.047), (13, -0.025), (14, 0.026), (15, 0.014), (16, 0.036), (17, 0.021), (18, 0.105), (19, -0.002), (20, 0.03), (21, -0.032), (22, 0.016), (23, 0.062), (24, -0.06), (25, -0.04), (26, -0.043), (27, 0.024), (28, -0.041), (29, -0.088), (30, 0.004), (31, -0.03), (32, 0.068), (33, -0.086), (34, 0.096), (35, -0.032), (36, 0.078), (37, 0.082), (38, -0.009), (39, 0.025), (40, -0.002), (41, 0.013), (42, -0.014), (43, -0.064), (44, 0.024), (45, -0.097), (46, 0.057), (47, -0.032), (48, -0.16), (49, 0.018)]
simIndex simValue paperId paperTitle
same-paper 1 0.87756503 129 nips-2006-Map-Reduce for Machine Learning on Multicore
Author: Cheng-tao Chu, Sang K. Kim, Yi-an Lin, Yuanyuan Yu, Gary Bradski, Kunle Olukotun, Andrew Y. Ng
Abstract: We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors. 1
2 0.60358703 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
Author: Christopher J. Burges, Robert Ragno, Quoc V. Le
Abstract: The quality measures used in information retrieval are particularly difficult to optimize directly, since they depend on the model scores only through the sorted order of the documents returned for a given query. Thus, the derivatives of the cost with respect to the model parameters are either zero, or are undefined. In this paper, we propose a class of simple, flexible algorithms, called LambdaRank, which avoids these difficulties by working with implicit cost functions. We describe LambdaRank using neural network models, although the idea applies to any differentiable function class. We give necessary and sufficient conditions for the resulting implicit cost function to be convex, and we show that the general method has a simple mechanical interpretation. We demonstrate significantly improved accuracy, over a state-of-the-art ranking algorithm, on several datasets. We also show that LambdaRank provides a method for significantly speeding up the training phase of that ranking algorithm. Although this paper is directed towards ranking, the proposed method can be extended to any non-smooth and multivariate cost functions. 1
3 0.51766789 186 nips-2006-Support Vector Machines on a Budget
Author: Ofer Dekel, Yoram Singer
Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1
4 0.51412988 156 nips-2006-Ordinal Regression by Extended Binary Classification
Author: Ling Li, Hsuan-tien Lin
Abstract: We present a reduction framework from ordinal regression to binary classification based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranking rule from the binary classifier. A weighted 0/1 loss of the binary classifier would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classification approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classification. In addition, our framework unifies many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. 1
5 0.49776572 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
Author: S. S. Keerthi, Vikas Sindhwani, Olivier Chapelle
Abstract: We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold crossvalidation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efficiently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identified by our approach with very few training rounds and gradient computations. . 1
6 0.47536021 149 nips-2006-Nonnegative Sparse PCA
7 0.47344777 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
8 0.44843119 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
9 0.44298798 104 nips-2006-Large-Scale Sparsified Manifold Regularization
10 0.4246248 150 nips-2006-On Transductive Regression
11 0.40265259 107 nips-2006-Large Margin Multi-channel Analog-to-Digital Conversion with Applications to Neural Prosthesis
12 0.40131789 76 nips-2006-Emergence of conjunctive visual features by quadratic independent component analysis
13 0.39727962 105 nips-2006-Large Margin Component Analysis
14 0.39491421 96 nips-2006-In-Network PCA and Anomaly Detection
15 0.394207 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
16 0.38265458 194 nips-2006-Towards a general independent subspace analysis
17 0.37736303 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
18 0.37446797 75 nips-2006-Efficient sparse coding algorithms
19 0.37245724 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
20 0.36957338 120 nips-2006-Learning to Traverse Image Manifolds
topicId topicWeight
[(1, 0.103), (3, 0.017), (7, 0.549), (9, 0.034), (22, 0.047), (44, 0.05), (57, 0.056), (65, 0.029), (69, 0.018)]
simIndex simValue paperId paperTitle
1 0.99505037 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
Author: Hariharan Narayanan, Mikhail Belkin, Partha Niyogi
Abstract: One of the intuitions underlying many graph-based methods for clustering and semi-supervised learning, is that class or cluster boundaries pass through areas of low probability density. In this paper we provide some formal analysis of that notion for a probability distribution. We introduce a notion of weighted boundary volume, which measures the length of the class/cluster boundary weighted by the density of the underlying probability distribution. We show that sizes of the cuts of certain commonly used data adjacency graphs converge to this continuous weighted volume of the boundary. keywords: Clustering, Semi-Supervised Learning 1
2 0.95958191 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing
Author: Yuanhao Chen, Long Zhu, Alan L. Yuille
Abstract: We describe an unsupervised method for learning a probabilistic grammar of an object from a set of training examples. Our approach is invariant to the scale and rotation of the objects. We illustrate our approach using thirteen objects from the Caltech 101 database. In addition, we learn the model of a hybrid object class where we do not know the specific object or its position, scale or pose. This is illustrated by learning a hybrid class consisting of faces, motorbikes, and airplanes. The individual objects can be recovered as different aspects of the grammar for the object class. In all cases, we validate our results by learning the probability grammars from training datasets and evaluating them on the test datasets. We compare our method to alternative approaches. The advantages of our approach is the speed of inference (under one second), the parsing of the object, and increased accuracy of performance. Moreover, our approach is very general and can be applied to a large range of objects and structures. 1
same-paper 3 0.95444393 129 nips-2006-Map-Reduce for Machine Learning on Multicore
Author: Cheng-tao Chu, Sang K. Kim, Yi-an Lin, Yuanyuan Yu, Gary Bradski, Kunle Olukotun, Andrew Y. Ng
Abstract: We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors. 1
4 0.93929666 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
Author: S. S. Keerthi, Vikas Sindhwani, Olivier Chapelle
Abstract: We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold crossvalidation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efficiently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identified by our approach with very few training rounds and gradient computations. . 1
5 0.88005513 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields
Author: Thomas Ott, Ruedi Stoop
Abstract: We rigorously establish a close relationship between message passing algorithms and models of neurodynamics by showing that the equations of a continuous Hopfield network can be derived from the equations of belief propagation on a binary Markov random field. As Hopfield networks are equipped with a Lyapunov function, convergence is guaranteed. As a consequence, in the limit of many weak connections per neuron, Hopfield networks exactly implement a continuous-time variant of belief propagation starting from message initialisations that prevent from running into convergence problems. Our results lead to a better understanding of the role of message passing algorithms in real biological neural networks.
6 0.84801072 60 nips-2006-Convergence of Laplacian Eigenmaps
7 0.78824371 80 nips-2006-Fundamental Limitations of Spectral Clustering
8 0.72054493 128 nips-2006-Manifold Denoising
9 0.68616813 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
10 0.68149167 43 nips-2006-Bayesian Model Scoring in Markov Random Fields
11 0.67646343 33 nips-2006-Analysis of Representations for Domain Adaptation
12 0.6675157 77 nips-2006-Fast Computation of Graph Kernels
13 0.66507971 181 nips-2006-Stability of $K$-Means Clustering
14 0.66265464 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
15 0.66179812 163 nips-2006-Prediction on a Graph with a Perceptron
16 0.65601039 110 nips-2006-Learning Dense 3D Correspondence
17 0.65548664 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
18 0.65351886 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
19 0.65237004 109 nips-2006-Learnability and the doubling dimension
20 0.65072471 56 nips-2006-Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data