nips nips2006 nips2006-156 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ling Li, Hsuan-tien Lin
Abstract: We present a reduction framework from ordinal regression to binary classification based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranking rule from the binary classifier. A weighted 0/1 loss of the binary classifier would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classification approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classification. In addition, our framework unifies many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a reduction framework from ordinal regression to binary classification based on extended examples. [sent-3, score-1.169]
2 The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranking rule from the binary classifier. [sent-4, score-1.005]
3 A weighted 0/1 loss of the binary classifier would then bound the mislabeling cost of the ranking rule. [sent-5, score-0.588]
4 Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classification approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classification. [sent-6, score-2.103]
5 In addition, our framework unifies many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. [sent-7, score-1.99]
6 When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. [sent-8, score-0.208]
7 1 Introduction We work on a type of supervised learning problems called ranking or ordinal regression, where examples are labeled by an ordinal scale called the rank. [sent-9, score-1.571]
8 The ratings have a natural order, which distinguishes ordinal regression from general multiclass classification. [sent-11, score-0.816]
9 Recently, many algorithms for ordinal regression have been proposed from a machine learning perspective. [sent-12, score-0.808]
10 For instance, Crammer and Singer [1] generalized the online perceptron algorithm with multiple thresholds to do ordinal regression. [sent-13, score-1.005]
11 In their approach, a perceptron maps an input vector to a latent potential value, which is then thresholded to obtain a rank. [sent-14, score-0.336]
12 Since binary classification is much better studied than ordinal regression, a general framework to systematically reduce the latter to the former can introduce two immediate benefits. [sent-19, score-0.835]
13 First, well-tuned binary classification approaches can be readily transformed into good ordinal regression algorithms, which saves immense efforts in design and implementation. [sent-20, score-1.008]
14 Second, new generalization bounds for ordinal regression can be easily derived from known bounds for binary classification, which saves tremendous efforts in theoretical analysis. [sent-21, score-1.181]
15 The framework is based on extended examples, which are extracted from the original examples and a given mislabeling cost matrix. [sent-23, score-0.387]
16 The binary classifier trained from the extended examples can then be used to construct a ranking rule. [sent-24, score-0.482]
17 We prove that the mislabeling cost of the ranking rule is bounded by a weighted 0/1 loss of the binary classifier. [sent-25, score-0.606]
18 Hence, binary classifiers that generalize well could introduce ranking rules that generalize well. [sent-26, score-0.34]
19 In addition, we show that our framework provides a unified view for many existing ordinal regression algorithms. [sent-28, score-0.861]
20 Theoretical guarantee on the reduction, including derivations of new generalization bounds for ordinal regression, is provided in Section 4. [sent-33, score-0.791]
21 2 The reduction framework In an ordinal regression problem, an example (x, y) is composed of an input vector x ∈ X and an ordinal label (i. [sent-35, score-1.603]
22 The generalization error of a ranking rule r : X → Y is then defined as def C(r, P ) = E (x,y)∼P Cy,r(x) , where C is a K × K cost matrix with Cy,k being the cost of predicting an example (x, y) as rank k. [sent-45, score-0.706]
23 Given a training set S = {(xn , yn )}n=1 containing N examples, the goal is to find a ranking rule r that generalizes well, i. [sent-47, score-0.499]
24 The ordinal information can be interpreted in several ways. [sent-51, score-0.655]
25 Another interpretation is that the mislabeling cost depends on the “closeness” of the prediction. [sent-53, score-0.196]
26 However, because the cost is invariant for all kinds of mislabelings, the ordinal information is not taken into account. [sent-60, score-0.764]
27 ” In this paper, we shall always assume that the ordinal regression problem under study comes with a cost matrix of V-shaped rows, and discuss how to reduce the ordinal regression problem to a binary classification problem. [sent-65, score-1.879]
28 1 Reducing ordinal regression to binary classification The ordinal information allows ranks to be compared. [sent-68, score-1.599]
29 ” For a fixed k, such a question is exactly a binary classification problem, and the rank of x can be determined by asking multiple questions for k = 1, 2, until (K − 1). [sent-71, score-0.201]
30 Frank and Hall [6] proposed to solve each binary classification problem independently and combine the binary outputs to a rank. [sent-72, score-0.285]
31 First, all the binary classification problems are solved jointly to obtain a single binary classifier. [sent-75, score-0.262]
32 Second, a simpler step is used to convert the binary outputs to a rank, and generalization analysis can immediately follow. [sent-76, score-0.227]
33 Assume that fb (x, k) is a binary classifier for all the associated questions above. [sent-78, score-0.292]
34 Consistent answers would be fb (x, k) = 1 (“yes”) for k = 1 until (y − 1) for some y , and 0 (“no”) afterwards. [sent-79, score-0.203]
35 Then, a reasonable ranking rule based on the binary answers is r(x) = y = 1 + min {k : fb (x, k) = 1}. [sent-80, score-0.613]
36 k=1 Although the definition can be flexibly applied even when fb is not consistent, a consistent fb is usually desired in order to introduce a good ranking rule r. [sent-82, score-0.601]
37 Furthermore, the ordinal information can help to model the relative confidence in the binary outputs. [sent-83, score-0.786]
38 That is, when k is farther from the rank of x, the answer fb (x, k) should be more confident. [sent-84, score-0.231]
39 (1) k=1 The ordinal information would naturally require f to be rank-monotonic, i. [sent-90, score-0.655]
40 Note that a rank-monotonic function f introduces consistent answers fb . [sent-93, score-0.203]
41 Thus the cost of the ranking rule r on an example (x, y) is K−1 K−1 (Cy,k − Cy,k+1 ) f (x, k) ≤ 0 + Cy,K . [sent-96, score-0.388]
42 (Cy,k − Cy,k+1 ) + Cy,K = Cy,r(x) = (2) k=1 k=r(x) Define the extended examples (x(k) , y (k) ) with weights wy,k as x(k) = (x, k), y (k) = 2 k < y − 1, Because row y in C is V-shaped, the binary variable y latter is not zero. [sent-97, score-0.314]
43 ≤ (4) k=1 Inequality (4) shows that the cost of r on example (x, y) is bounded by a weighted 0/1 loss of f on the extended examples. [sent-100, score-0.199]
44 Altogether, our reduction framework consists of the following steps: we first use (3) to transform (k) (k) all training examples (xn , yn ) to extended examples (xn , yn ) with weights wyn ,k (also denoted (k) as wn ). [sent-104, score-0.951]
45 All the extended examples would then be jointly learned by a binary classifier f with confidence outputs, aiming at a low weighted 0/1 loss. [sent-105, score-0.273]
46 Finally, a ranking rule r is constructed from f using (1). [sent-106, score-0.279]
47 Theorem 1 (reduction) An ordinal regression problem with a V-shaped cost matrix C can be reduced to a binary classification problem with the extended examples in (3) and the ranking rule r in (1). [sent-108, score-1.469]
48 If f is rank-monotonic or every row of C is convex, for any example (x, y) and its extended examples (x(k) , y (k) ), the weighted sum of the 0/1 loss of f (x(k) ) bounds the cost of r(x). [sent-109, score-0.355]
49 The question is then, “when can a binary classification algorithm return ordered thresholds? [sent-116, score-0.186]
50 First, consider an example (k+1) (k) = −1, switching the thresholds changes the objective with yn = k + 1. [sent-121, score-0.339]
51 Since yn = 1 and yn value by (k) (k+1) wn [ (g(xn ) − θk+1 ) − (g(xn ) − θk )] + wn [ (θk − g(xn )) − (θk+1 − g(xn ))] . [sent-122, score-0.748]
52 (k) (k+1) For an example with yn < k + 1, we have yn = yn = −1. [sent-124, score-0.66]
53 Since (ρ) Note that row yn of the cost matrix being convex leads to wn ≤ wn is non-increasing, the change above is also non-positive. [sent-127, score-0.735]
54 The case for examples with yn > k + 1 is similar and the change there is also non-positive. [sent-128, score-0.272]
55 With proper choices of the cost matrix, the encoding scheme of (x, k), and the binary learning algorithm, many existing ordinal regression algorithms can be unified in our framework. [sent-135, score-1.123]
56 1 Perceptron-based algorithms The perceptron ranking (PRank) algorithm proposed by Crammer and Singer [1] is an online ordinal regression algorithm that employs the thresholded model with f (x, k) = u, x − θk . [sent-141, score-1.377]
57 Whenever a training example is not predicted correctly, the current u and θ are updated in a way similar to the perceptron learning rule [8]. [sent-142, score-0.309]
58 Thus, when the absolute cost matrix is taken and a modified perceptron learning rule2 is used as the underlying binary classification algorithm, the PRank algorithm is a specific instance of our framework. [sent-145, score-0.569]
59 The orderliness of the thresholds is guaranteed by Theorem 2, and the mistake bound is a direct application of the well-known perceptron mistake bound (see for example Freund and Schapire [8]). [sent-146, score-0.54]
60 Our framework not only simplifies the derivation of the mistake bound, but also allows the use of other perceptron algorithms, such as a batch-mode algorithm rather than an online one. [sent-147, score-0.367]
61 2 SVM-based algorithms SVM [9] can be thought as a generalized perceptron with a kernel that computes the inner product on transformed input vectors φ(x). [sent-149, score-0.31]
62 For the extended examples (x, k), we can suitably define the extended kernel as the original kernel plus the inner product between the extensions, K ((x, k), (x , k )) = φ(x), φ(x ) + ek , ek . [sent-150, score-0.408]
63 Then, several SVM-based approaches for ordinal regression are special instances of our framework. [sent-151, score-0.811]
64 When E = γIK−1 and the traditional soft-margin SVM are used in our framework, the binary classifier f (x, k) has the form u, φ(x) − θk − b, and can be obtained by solving N K−1 min u u,θ,b 2 + θ 2 /γ 2 + κ (k) (k) wn max 0, 1 − yn ( u, φ(xn ) − θk − b) . [sent-154, score-0.505]
65 This finding can also be explained by reduction: SVOR-EXP is an instance of our framework using the classification cost and SVOR-IMC comes from using the absolute cost. [sent-159, score-0.247]
66 If the unmodified soft-margin SVM (7) is directly used in our framework with the absolute cost, we obtain a new support vector ordinal regression formulation. [sent-161, score-0.905]
67 4 Generalization bounds With the extended examples, new generalization bounds can be derived for ordinal regression problems with any cost matrix. [sent-165, score-1.183]
68 A simple result that comes immediately from (4) is: 2 To precisely replicate the PRank algorithm, the (K − 1) extended examples sprouted from a same example should be considered altogether in updating the perceptron weight vector. [sent-166, score-0.43]
69 Theorem 3 (reduction of generalization error) Let cy = Cy,1 + Cy,K and c = maxy cy . [sent-168, score-0.411]
70 If f is ˆ rank-monotonic or every row of C is convex, there exists a distribution P on (X, Y ), where X contains the encoding of (x, k) and Y is a binary label, such that E (x,y)∼P Cy,r(x) ≤ c · E ˆ (X,Y )∼P Y f (X) ≤ 0 . [sent-169, score-0.22]
71 Given the conditions, following (4), we have K−1 wy,k y (k) f (x(k) ) ≤ 0 = cy · E Cy,r(x) ≤ k∼Pk k=1 y (k) f (x(k) ) ≤ 0 , K−1 where Pk (k | y) = wy,k /cy is a probability distribution because cy = k=1 wy,k . [sent-171, score-0.306]
72 Then, the generalization error of r is E (x,y)∼P Cy,r(x) ≤ E (x,y)∼P cy · E k∼Pk y (k) f (x(k) ) ≤ 0 ≤ c · E ˆ (x(k) ,y (k) )∼P y (k) f (x(k) ) ≤ 0 . [sent-173, score-0.226]
73 (8) ˆ Theorem 3 shows that, if the binary classifier f generalizes well when examples are sampled from P , the constructed ranking rule would also generalize well. [sent-174, score-0.462]
74 The terms y (k) f (x(k) ), which are exactly the margins of the associated binary classifier fb (x, k), would be analogously called the margins for ordinal regression, and are expected to be positive and large for correct and confident predictions. [sent-175, score-1.057]
75 However, the bound is not data-dependent, and hence does not fully explain the generalization performance of large-margin ranking rules in reality (for more discussions on data-dependent bounds, see the work of, for example, Bartlett and Shawe-Taylor [10]). [sent-182, score-0.334]
76 Next, we show how a novel data-dependent bound for SVM-based ordinal regression approaches can be derived from our reduction framework. [sent-183, score-0.977]
77 Theorem 4 (data-dependent bound for support vector ordinal regression) Assume that f (x, k) ∈ f : (x, k) → u, φ(x) − θk , u 2 + θ 2 ≤ 1, φ(x) 2 + 1 ≤ R2 . [sent-188, score-0.733]
78 If θ is ordered or every row of C is convex, for any margin criterion ∆, with probability at least 1−δ, every rank rule r based on f has generalization error no more than N K−1 β · N n=1 (k) (k) wn yn f (x(k) ) ≤ ∆ + O n k=1 log N R √ , , N ∆ log 1 δ , where β = maxy cy . [sent-189, score-0.899]
79 miny cy (k) (k) ˆ P ROOF Consider the extended training set S = (xn , yn ) , which contains N (K − 1) elements. [sent-190, score-0.463]
80 One way to extract such a subset is to choose independent kn from Pk (k | yn ) for each (xn , yn ). [sent-201, score-0.495]
81 (k ) (k ) N The subset would be named T = (xn n , yn n ) n=1 . [sent-202, score-0.22]
82 outcomes from P , which is the case of T , E ˆ (x(k) ,y (k) )∼P y (k) f (x(k) ) ≤ 0 ≤ 1 N N (k yn n ) f (x(kn ) ) ≤ ∆ + O n n=1 log N R √ , , N ∆ log 1 δ . [sent-206, score-0.262]
83 (9) Table 1: Test error with absolute cost data set pyrimidines machine boston abalone bank computer california census C4. [sent-207, score-0.251]
84 002 (k ) Let bn = yn n f (xn n ) ≤ ∆ be a Boolean random variable introduced by kn ∼ Pk (k | yn ). [sent-289, score-0.55]
85 The (k) K−1 (k) (k) variable has mean c−1 · k=1 wn yn f (xn ) ≤ ∆ . [sent-290, score-0.374]
86 An extended Chernoff bound shows that yn when each bn is chosen independently, with probability at least (1 − δ/2) over the choice of bn , 1 N N bn ≤ n=1 1 N N 1 c n=1 yn K−1 (k) (k) wn yn f (x(k) ) ≤ ∆ + O n k=1 1 √ , N log 1 δ . [sent-291, score-1.121]
87 We tested our framework with E = γIK−1 and three different binary classification algorithms. [sent-297, score-0.18]
88 The third one is SVM with the perceptron kernel [12], with a simple setting of γ = 1. [sent-301, score-0.287]
89 We used the perceptron kernel instead to gain the advantage of faster parameter selection. [sent-303, score-0.287]
90 For a fair comparison, we also implemented SVOR-IMC with the perceptron kernel and the same parameter selection procedure in LIBSVM. [sent-308, score-0.287]
91 Within the three SVM-based approaches, the two with the perceptron kernel are better than SVORIMC with the Gaussian kernel in test performance. [sent-309, score-0.335]
92 Our direct reduction to the standard SVM performs similarly to SVOR-IMC with the same perceptron kernel, but is much easier to implement. [sent-310, score-0.353]
93 With our reduction framework, all the three binary learning algorithms could be better than SVOR-IMC with the Gaussian kernel on some of the data sets, which demonstrates that they achieve decent out-of-sample performances. [sent-315, score-0.316]
94 6 reduction SVOR−IMC 4 2 0 bank computer california census Figure 1: Training time (including automatic parameter selection) of the SVM-based approaches with the perceptron kernel The results are averaged CPU time gathered on a 1. [sent-317, score-0.524]
95 This difference demonstrates another advantage of our reduction framework: improvements to binary classification approaches can be immediately inherited by reduction-based ordinal regression algorithms. [sent-320, score-1.056]
96 6 Conclusion We presented a reduction framework from ordinal regression to binary classification based on extended examples. [sent-321, score-1.169]
97 The framework has the flexibility to work with any reasonable cost matrix and any binary classifiers. [sent-322, score-0.312]
98 We demonstrated the algorithmic advantages of the framework in designing new ordinal regression algorithms and explaining existing algorithms. [sent-323, score-0.935]
99 We also showed that the framework can be used to derive new generalization bounds for ordinal regression. [sent-324, score-0.84]
100 Large-margin thresholded ensembles for ordinal regression: Theory and practice. [sent-413, score-0.752]
wordName wordTfidf (topN-words)
[('ordinal', 0.655), ('perceptron', 0.239), ('yn', 0.22), ('ranking', 0.209), ('fb', 0.161), ('wn', 0.154), ('cy', 0.153), ('chu', 0.145), ('binary', 0.131), ('regression', 0.13), ('classi', 0.123), ('keerthi', 0.116), ('reduction', 0.114), ('cost', 0.109), ('thresholded', 0.097), ('xn', 0.093), ('extended', 0.09), ('mislabeling', 0.087), ('thresholds', 0.087), ('pk', 0.083), ('generalization', 0.073), ('rule', 0.07), ('rank', 0.07), ('lin', 0.069), ('prank', 0.066), ('rajaram', 0.066), ('bounds', 0.063), ('svm', 0.058), ('bn', 0.055), ('margins', 0.055), ('kn', 0.055), ('mistake', 0.055), ('ordered', 0.055), ('examples', 0.052), ('shashua', 0.052), ('ling', 0.052), ('roof', 0.052), ('libsvm', 0.052), ('bound', 0.052), ('framework', 0.049), ('kernel', 0.048), ('er', 0.048), ('ik', 0.048), ('encoding', 0.048), ('absolute', 0.045), ('def', 0.043), ('answers', 0.042), ('outcomes', 0.042), ('cation', 0.041), ('row', 0.041), ('rows', 0.041), ('crammer', 0.04), ('ek', 0.04), ('levin', 0.038), ('optimizers', 0.038), ('saves', 0.038), ('theorem', 0.038), ('bartlett', 0.037), ('usefulness', 0.036), ('formulations', 0.036), ('census', 0.035), ('bank', 0.035), ('uni', 0.034), ('convex', 0.034), ('dence', 0.033), ('raedt', 0.032), ('maxy', 0.032), ('caltech', 0.032), ('lecture', 0.032), ('switching', 0.032), ('margin', 0.031), ('multiclass', 0.031), ('ecml', 0.03), ('frank', 0.03), ('herbrich', 0.03), ('algorithmic', 0.029), ('notes', 0.028), ('ranks', 0.028), ('efforts', 0.028), ('singer', 0.028), ('california', 0.027), ('benchmark', 0.027), ('existing', 0.027), ('altogether', 0.027), ('movie', 0.027), ('approaches', 0.026), ('decreasing', 0.026), ('support', 0.026), ('shall', 0.024), ('adaboost', 0.024), ('boolean', 0.024), ('online', 0.024), ('freund', 0.023), ('outputs', 0.023), ('algorithms', 0.023), ('matrix', 0.023), ('advantages', 0.022), ('instance', 0.022), ('strictly', 0.022), ('comes', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 156 nips-2006-Ordinal Regression by Extended Binary Classification
Author: Ling Li, Hsuan-tien Lin
Abstract: We present a reduction framework from ordinal regression to binary classification based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranking rule from the binary classifier. A weighted 0/1 loss of the binary classifier would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classification approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classification. In addition, our framework unifies many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. 1
2 0.12060589 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer
Abstract: We describe and analyze an algorithmic framework for online classification where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online hypothesis by defining a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution for the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with online multiclass text categorization. Our experiments indicate that a combination of class-dependent features with the simultaneous projection method outperforms previously studied algorithms. 1
3 0.098912008 163 nips-2006-Prediction on a Graph with a Perceptron
Author: Mark Herbster, Massimiliano Pontil
Abstract: We study the problem of online prediction of a noisy labeling of a graph with the perceptron. We address both label noise and concept noise. Graph learning is framed as an instance of prediction on a finite set. To treat label noise we show that the hinge loss bounds derived by Gentile [1] for online perceptron learning can be transformed to relative mistake bounds with an optimal leading constant when applied to prediction on a finite set. These bounds depend crucially on the norm of the learned concept. Often the norm of a concept can vary dramatically with only small perturbations in a labeling. We analyze a simple transformation that stabilizes the norm under perturbations. We derive an upper bound that depends only on natural properties of the graph – the graph diameter and the cut size of a partitioning of the graph – which are only indirectly dependent on the size of the graph. The impossibility of such bounds for the graph geodesic nearest neighbors algorithm will be demonstrated. 1
4 0.090978153 20 nips-2006-Active learning for misspecified generalized linear models
Author: Francis R. Bach
Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspecification, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classification, multi-class classification, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1
5 0.086557686 186 nips-2006-Support Vector Machines on a Budget
Author: Ofer Dekel, Yoram Singer
Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1
6 0.085301481 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
7 0.082945809 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
8 0.075339742 65 nips-2006-Denoising and Dimension Reduction in Feature Space
9 0.075295039 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
10 0.073301353 193 nips-2006-Tighter PAC-Bayes Bounds
11 0.072172403 116 nips-2006-Learning from Multiple Sources
12 0.067444101 150 nips-2006-On Transductive Regression
13 0.067274161 50 nips-2006-Chained Boosting
14 0.065985449 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions
15 0.06468413 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
16 0.061177745 126 nips-2006-Logistic Regression for Single Trial EEG Classification
17 0.059962455 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation
18 0.059704147 58 nips-2006-Context Effects in Category Learning: An Investigation of Four Probabilistic Models
19 0.059573833 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension
20 0.059025574 61 nips-2006-Convex Repeated Games and Fenchel Duality
topicId topicWeight
[(0, -0.185), (1, 0.061), (2, -0.051), (3, -0.006), (4, -0.067), (5, 0.156), (6, -0.083), (7, 0.011), (8, -0.051), (9, -0.034), (10, -0.032), (11, -0.012), (12, 0.0), (13, -0.032), (14, 0.0), (15, -0.021), (16, 0.011), (17, -0.022), (18, 0.016), (19, -0.074), (20, 0.035), (21, 0.017), (22, -0.017), (23, 0.016), (24, 0.003), (25, 0.038), (26, -0.086), (27, 0.136), (28, 0.114), (29, 0.002), (30, 0.064), (31, 0.089), (32, 0.032), (33, -0.067), (34, -0.004), (35, -0.029), (36, 0.11), (37, 0.083), (38, -0.098), (39, -0.085), (40, 0.003), (41, 0.043), (42, -0.009), (43, 0.109), (44, 0.024), (45, -0.083), (46, -0.023), (47, -0.067), (48, -0.186), (49, -0.014)]
simIndex simValue paperId paperTitle
same-paper 1 0.93182522 156 nips-2006-Ordinal Regression by Extended Binary Classification
Author: Ling Li, Hsuan-tien Lin
Abstract: We present a reduction framework from ordinal regression to binary classification based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranking rule from the binary classifier. A weighted 0/1 loss of the binary classifier would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classification approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classification. In addition, our framework unifies many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. 1
2 0.64851809 186 nips-2006-Support Vector Machines on a Budget
Author: Ofer Dekel, Yoram Singer
Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to define the generated classifier. We present a modified version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classified examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1
3 0.52962226 193 nips-2006-Tighter PAC-Bayes Bounds
Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor
Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classifiers. The bound is based on learning a prior over the distribution of classifiers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classifier compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1
4 0.51605088 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models
Author: S. S. Keerthi, Vikas Sindhwani, Olivier Chapelle
Abstract: We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold crossvalidation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efficiently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identified by our approach with very few training rounds and gradient computations. . 1
5 0.50610888 129 nips-2006-Map-Reduce for Machine Learning on Multicore
Author: Cheng-tao Chu, Sang K. Kim, Yi-an Lin, Yuanyuan Yu, Gary Bradski, Kunle Olukotun, Andrew Y. Ng
Abstract: We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and unified way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Specifically, we show that algorithms that fit the Statistical Query model [15] can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors. 1
6 0.49560484 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
7 0.49049658 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition
8 0.47588769 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection
9 0.47099808 50 nips-2006-Chained Boosting
10 0.46177906 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
11 0.45172578 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
12 0.43579629 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds
13 0.43176389 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods
14 0.42328033 163 nips-2006-Prediction on a Graph with a Perceptron
15 0.42262688 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions
16 0.41346103 150 nips-2006-On Transductive Regression
17 0.41131639 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis
18 0.40428105 61 nips-2006-Convex Repeated Games and Fenchel Duality
19 0.40010113 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets
20 0.39424166 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising
topicId topicWeight
[(1, 0.106), (7, 0.049), (9, 0.03), (20, 0.014), (22, 0.06), (44, 0.06), (57, 0.059), (65, 0.504), (69, 0.01)]
simIndex simValue paperId paperTitle
1 0.96089482 103 nips-2006-Kernels on Structured Objects Through Nested Histograms
Author: Marco Cuturi, Kenji Fukumizu
Abstract: We propose a family of kernels for structured objects which is based on the bag-ofcomponents paradigm. However, rather than decomposing each complex object into the single histogram of its components, we use for each object a family of nested histograms, where each histogram in this hierarchy describes the object seen from an increasingly granular perspective. We use this hierarchy of histograms to define elementary kernels which can detect coarse and fine similarities between the objects. We compute through an efficient averaging trick a mixture of such specific kernels, to propose a final kernel value which weights efficiently local and global matches. We propose experimental results on an image retrieval experiment which show that this mixture is an effective template procedure to be used with kernels on histograms.
same-paper 2 0.91661274 156 nips-2006-Ordinal Regression by Extended Binary Classification
Author: Ling Li, Hsuan-tien Lin
Abstract: We present a reduction framework from ordinal regression to binary classification based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classifier on the extended examples with any binary classification algorithm, and constructing a ranking rule from the binary classifier. A weighted 0/1 loss of the binary classifier would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classification approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classification. In addition, our framework unifies many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. 1
3 0.91412288 146 nips-2006-No-regret Algorithms for Online Convex Programs
Author: Geoffrey J. Gordon
Abstract: Online convex programming has recently emerged as a powerful primitive for designing machine learning algorithms. For example, OCP can be used for learning a linear classifier, dynamically rebalancing a binary search tree, finding the shortest path in a graph with unknown edge lengths, solving a structured classification problem, or finding a good strategy in an extensive-form game. Several researchers have designed no-regret algorithms for OCP. But, compared to algorithms for special cases of OCP such as learning from expert advice, these algorithms are not very numerous or flexible. In learning from expert advice, one tool which has proved particularly valuable is the correspondence between no-regret algorithms and convex potential functions: by reasoning about these potential functions, researchers have designed algorithms with a wide variety of useful guarantees such as good performance when the target hypothesis is sparse. Until now, there has been no such recipe for the more general OCP problem, and therefore no ability to tune OCP algorithms to take advantage of properties of the problem or data. In this paper we derive a new class of no-regret learning algorithms for OCP. These Lagrangian Hedging algorithms are based on a general class of potential functions, and are a direct generalization of known learning rules like weighted majority and external-regret matching. In addition to proving regret bounds, we demonstrate our algorithms learning to play one-card poker. 1
4 0.89745694 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm
Author: Robert Jenssen, Torbjørn Eltoft, Mark Girolami, Deniz Erdogmus
Abstract: We propose a new kernel-based data transformation technique. It is founded on the principle of maximum entropy (MaxEnt) preservation, hence named kernel MaxEnt. The key measure is Renyi’s entropy estimated via Parzen windowing. We show that kernel MaxEnt is based on eigenvectors, and is in that sense similar to kernel PCA, but may produce strikingly different transformed data sets. An enhanced spectral clustering algorithm is proposed, by replacing kernel PCA by kernel MaxEnt as an intermediate step. This has a major impact on performance.
5 0.89284706 170 nips-2006-Robotic Grasping of Novel Objects
Author: Ashutosh Saxena, Justin Driemeyer, Justin Kearns, Andrew Y. Ng
Abstract: We consider the problem of grasping novel objects, specifically ones that are being seen for the first time through vision. We present a learning algorithm that neither requires, nor tries to build, a 3-d model of the object. Instead it predicts, directly as a function of the images, a point at which to grasp the object. Our algorithm is trained via supervised learning, using synthetic images for the training set. We demonstrate on a robotic manipulation platform that this approach successfully grasps a wide variety of objects, such as wine glasses, duct tape, markers, a translucent box, jugs, knife-cutters, cellphones, keys, screwdrivers, staplers, toothbrushes, a thick coil of wire, a strangely shaped power horn, and others, none of which were seen in the training set. 1
6 0.6130898 203 nips-2006-implicit Online Learning with Kernels
7 0.59598035 61 nips-2006-Convex Repeated Games and Fenchel Duality
8 0.57858753 79 nips-2006-Fast Iterative Kernel PCA
9 0.56492651 115 nips-2006-Learning annotated hierarchies from relational data
10 0.5636158 26 nips-2006-An Approach to Bounded Rationality
11 0.55985188 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
12 0.55087405 82 nips-2006-Gaussian and Wishart Hyperkernels
13 0.53822124 65 nips-2006-Denoising and Dimension Reduction in Feature Space
14 0.53332216 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
15 0.53159976 117 nips-2006-Learning on Graph with Laplacian Regularization
16 0.52855003 150 nips-2006-On Transductive Regression
17 0.52374494 163 nips-2006-Prediction on a Graph with a Perceptron
18 0.52315074 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space
19 0.52303803 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
20 0.5220266 109 nips-2006-Learnability and the doubling dimension