nips nips2006 nips2006-156 knowledge-graph by maker-knowledge-mining

156 nips-2006-Ordinal Regression by Extended Binary Classification

Source: pdf

Author: Ling Li, Hsuan-tien Lin

Abstract: We present a reduction framework from ordinal regression to binary classiﬁcation based on extended examples. The framework consists of three steps: extracting extended examples from the original examples, learning a binary classiﬁer on the extended examples with any binary classiﬁcation algorithm, and constructing a ranking rule from the binary classiﬁer. A weighted 0/1 loss of the binary classiﬁer would then bound the mislabeling cost of the ranking rule. Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classiﬁcation approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classiﬁcation. In addition, our framework uniﬁes many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present a reduction framework from ordinal regression to binary classiﬁcation based on extended examples. [sent-3, score-1.169]

2 The framework consists of three steps: extracting extended examples from the original examples, learning a binary classiﬁer on the extended examples with any binary classiﬁcation algorithm, and constructing a ranking rule from the binary classiﬁer. [sent-4, score-1.005]

3 A weighted 0/1 loss of the binary classiﬁer would then bound the mislabeling cost of the ranking rule. [sent-5, score-0.588]

4 Our framework allows not only to design good ordinal regression algorithms based on well-tuned binary classiﬁcation approaches, but also to derive new generalization bounds for ordinal regression from known bounds for binary classiﬁcation. [sent-6, score-2.103]

5 In addition, our framework uniﬁes many existing ordinal regression algorithms, such as perceptron ranking and support vector ordinal regression. [sent-7, score-1.99]

6 When compared empirically on benchmark data sets, some of our newly designed algorithms enjoy advantages in terms of both training speed and generalization performance over existing algorithms, which demonstrates the usefulness of our framework. [sent-8, score-0.208]

7 1 Introduction We work on a type of supervised learning problems called ranking or ordinal regression, where examples are labeled by an ordinal scale called the rank. [sent-9, score-1.571]

8 The ratings have a natural order, which distinguishes ordinal regression from general multiclass classiﬁcation. [sent-11, score-0.816]

9 Recently, many algorithms for ordinal regression have been proposed from a machine learning perspective. [sent-12, score-0.808]

10 For instance, Crammer and Singer [1] generalized the online perceptron algorithm with multiple thresholds to do ordinal regression. [sent-13, score-1.005]

11 In their approach, a perceptron maps an input vector to a latent potential value, which is then thresholded to obtain a rank. [sent-14, score-0.336]

12 Since binary classiﬁcation is much better studied than ordinal regression, a general framework to systematically reduce the latter to the former can introduce two immediate beneﬁts. [sent-19, score-0.835]

13 First, well-tuned binary classiﬁcation approaches can be readily transformed into good ordinal regression algorithms, which saves immense efforts in design and implementation. [sent-20, score-1.008]

14 Second, new generalization bounds for ordinal regression can be easily derived from known bounds for binary classiﬁcation, which saves tremendous efforts in theoretical analysis. [sent-21, score-1.181]

15 The framework is based on extended examples, which are extracted from the original examples and a given mislabeling cost matrix. [sent-23, score-0.387]

16 The binary classiﬁer trained from the extended examples can then be used to construct a ranking rule. [sent-24, score-0.482]

17 We prove that the mislabeling cost of the ranking rule is bounded by a weighted 0/1 loss of the binary classiﬁer. [sent-25, score-0.606]

18 Hence, binary classiﬁers that generalize well could introduce ranking rules that generalize well. [sent-26, score-0.34]

19 In addition, we show that our framework provides a uniﬁed view for many existing ordinal regression algorithms. [sent-28, score-0.861]

20 Theoretical guarantee on the reduction, including derivations of new generalization bounds for ordinal regression, is provided in Section 4. [sent-33, score-0.791]

21 2 The reduction framework In an ordinal regression problem, an example (x, y) is composed of an input vector x ∈ X and an ordinal label (i. [sent-35, score-1.603]

22 The generalization error of a ranking rule r : X → Y is then deﬁned as def C(r, P ) = E (x,y)∼P Cy,r(x) , where C is a K × K cost matrix with Cy,k being the cost of predicting an example (x, y) as rank k. [sent-45, score-0.706]

23 Given a training set S = {(xn , yn )}n=1 containing N examples, the goal is to ﬁnd a ranking rule r that generalizes well, i. [sent-47, score-0.499]

24 The ordinal information can be interpreted in several ways. [sent-51, score-0.655]

25 Another interpretation is that the mislabeling cost depends on the “closeness” of the prediction. [sent-53, score-0.196]

26 However, because the cost is invariant for all kinds of mislabelings, the ordinal information is not taken into account. [sent-60, score-0.764]

27 ” In this paper, we shall always assume that the ordinal regression problem under study comes with a cost matrix of V-shaped rows, and discuss how to reduce the ordinal regression problem to a binary classiﬁcation problem. [sent-65, score-1.879]

28 1 Reducing ordinal regression to binary classiﬁcation The ordinal information allows ranks to be compared. [sent-68, score-1.599]

29 ” For a ﬁxed k, such a question is exactly a binary classiﬁcation problem, and the rank of x can be determined by asking multiple questions for k = 1, 2, until (K − 1). [sent-71, score-0.201]

30 Frank and Hall [6] proposed to solve each binary classiﬁcation problem independently and combine the binary outputs to a rank. [sent-72, score-0.285]

31 First, all the binary classiﬁcation problems are solved jointly to obtain a single binary classiﬁer. [sent-75, score-0.262]

32 Second, a simpler step is used to convert the binary outputs to a rank, and generalization analysis can immediately follow. [sent-76, score-0.227]

33 Assume that fb (x, k) is a binary classiﬁer for all the associated questions above. [sent-78, score-0.292]

34 Consistent answers would be fb (x, k) = 1 (“yes”) for k = 1 until (y − 1) for some y , and 0 (“no”) afterwards. [sent-79, score-0.203]

35 Then, a reasonable ranking rule based on the binary answers is r(x) = y = 1 + min {k : fb (x, k) = 1}. [sent-80, score-0.613]

36 k=1 Although the deﬁnition can be ﬂexibly applied even when fb is not consistent, a consistent fb is usually desired in order to introduce a good ranking rule r. [sent-82, score-0.601]

37 Furthermore, the ordinal information can help to model the relative conﬁdence in the binary outputs. [sent-83, score-0.786]

38 That is, when k is farther from the rank of x, the answer fb (x, k) should be more conﬁdent. [sent-84, score-0.231]

39 (1) k=1 The ordinal information would naturally require f to be rank-monotonic, i. [sent-90, score-0.655]

40 Note that a rank-monotonic function f introduces consistent answers fb . [sent-93, score-0.203]

41 Thus the cost of the ranking rule r on an example (x, y) is K−1 K−1 (Cy,k − Cy,k+1 ) f (x, k) ≤ 0 + Cy,K . [sent-96, score-0.388]

42 (Cy,k − Cy,k+1 ) + Cy,K = Cy,r(x) = (2) k=1 k=r(x) Deﬁne the extended examples (x(k) , y (k) ) with weights wy,k as x(k) = (x, k), y (k) = 2 k < y − 1, Because row y in C is V-shaped, the binary variable y latter is not zero. [sent-97, score-0.314]

43 ≤ (4) k=1 Inequality (4) shows that the cost of r on example (x, y) is bounded by a weighted 0/1 loss of f on the extended examples. [sent-100, score-0.199]

44 Altogether, our reduction framework consists of the following steps: we ﬁrst use (3) to transform (k) (k) all training examples (xn , yn ) to extended examples (xn , yn ) with weights wyn ,k (also denoted (k) as wn ). [sent-104, score-0.951]

45 All the extended examples would then be jointly learned by a binary classiﬁer f with conﬁdence outputs, aiming at a low weighted 0/1 loss. [sent-105, score-0.273]

46 Finally, a ranking rule r is constructed from f using (1). [sent-106, score-0.279]

47 Theorem 1 (reduction) An ordinal regression problem with a V-shaped cost matrix C can be reduced to a binary classiﬁcation problem with the extended examples in (3) and the ranking rule r in (1). [sent-108, score-1.469]

48 If f is rank-monotonic or every row of C is convex, for any example (x, y) and its extended examples (x(k) , y (k) ), the weighted sum of the 0/1 loss of f (x(k) ) bounds the cost of r(x). [sent-109, score-0.355]

49 The question is then, “when can a binary classiﬁcation algorithm return ordered thresholds? [sent-116, score-0.186]

50 First, consider an example (k+1) (k) = −1, switching the thresholds changes the objective with yn = k + 1. [sent-121, score-0.339]

51 Since yn = 1 and yn value by (k) (k+1) wn [ (g(xn ) − θk+1 ) − (g(xn ) − θk )] + wn [ (θk − g(xn )) − (θk+1 − g(xn ))] . [sent-122, score-0.748]

52 (k) (k+1) For an example with yn < k + 1, we have yn = yn = −1. [sent-124, score-0.66]

53 Since (ρ) Note that row yn of the cost matrix being convex leads to wn ≤ wn is non-increasing, the change above is also non-positive. [sent-127, score-0.735]

54 The case for examples with yn > k + 1 is similar and the change there is also non-positive. [sent-128, score-0.272]

55 With proper choices of the cost matrix, the encoding scheme of (x, k), and the binary learning algorithm, many existing ordinal regression algorithms can be uniﬁed in our framework. [sent-135, score-1.123]

56 1 Perceptron-based algorithms The perceptron ranking (PRank) algorithm proposed by Crammer and Singer [1] is an online ordinal regression algorithm that employs the thresholded model with f (x, k) = u, x − θk . [sent-141, score-1.377]

57 Whenever a training example is not predicted correctly, the current u and θ are updated in a way similar to the perceptron learning rule [8]. [sent-142, score-0.309]

58 Thus, when the absolute cost matrix is taken and a modiﬁed perceptron learning rule2 is used as the underlying binary classiﬁcation algorithm, the PRank algorithm is a speciﬁc instance of our framework. [sent-145, score-0.569]

59 The orderliness of the thresholds is guaranteed by Theorem 2, and the mistake bound is a direct application of the well-known perceptron mistake bound (see for example Freund and Schapire [8]). [sent-146, score-0.54]

60 Our framework not only simpliﬁes the derivation of the mistake bound, but also allows the use of other perceptron algorithms, such as a batch-mode algorithm rather than an online one. [sent-147, score-0.367]

61 2 SVM-based algorithms SVM [9] can be thought as a generalized perceptron with a kernel that computes the inner product on transformed input vectors φ(x). [sent-149, score-0.31]

62 For the extended examples (x, k), we can suitably deﬁne the extended kernel as the original kernel plus the inner product between the extensions, K ((x, k), (x , k )) = φ(x), φ(x ) + ek , ek . [sent-150, score-0.408]

63 Then, several SVM-based approaches for ordinal regression are special instances of our framework. [sent-151, score-0.811]

64 When E = γIK−1 and the traditional soft-margin SVM are used in our framework, the binary classiﬁer f (x, k) has the form u, φ(x) − θk − b, and can be obtained by solving N K−1 min u u,θ,b 2 + θ 2 /γ 2 + κ (k) (k) wn max 0, 1 − yn ( u, φ(xn ) − θk − b) . [sent-154, score-0.505]

65 This ﬁnding can also be explained by reduction: SVOR-EXP is an instance of our framework using the classiﬁcation cost and SVOR-IMC comes from using the absolute cost. [sent-159, score-0.247]

66 If the unmodiﬁed soft-margin SVM (7) is directly used in our framework with the absolute cost, we obtain a new support vector ordinal regression formulation. [sent-161, score-0.905]

67 4 Generalization bounds With the extended examples, new generalization bounds can be derived for ordinal regression problems with any cost matrix. [sent-165, score-1.183]

68 A simple result that comes immediately from (4) is: 2 To precisely replicate the PRank algorithm, the (K − 1) extended examples sprouted from a same example should be considered altogether in updating the perceptron weight vector. [sent-166, score-0.43]

69 Theorem 3 (reduction of generalization error) Let cy = Cy,1 + Cy,K and c = maxy cy . [sent-168, score-0.411]

70 If f is ˆ rank-monotonic or every row of C is convex, there exists a distribution P on (X, Y ), where X contains the encoding of (x, k) and Y is a binary label, such that E (x,y)∼P Cy,r(x) ≤ c · E ˆ (X,Y )∼P Y f (X) ≤ 0 . [sent-169, score-0.22]

71 Given the conditions, following (4), we have K−1 wy,k y (k) f (x(k) ) ≤ 0 = cy · E Cy,r(x) ≤ k∼Pk k=1 y (k) f (x(k) ) ≤ 0 , K−1 where Pk (k | y) = wy,k /cy is a probability distribution because cy = k=1 wy,k . [sent-171, score-0.306]

72 Then, the generalization error of r is E (x,y)∼P Cy,r(x) ≤ E (x,y)∼P cy · E k∼Pk y (k) f (x(k) ) ≤ 0 ≤ c · E ˆ (x(k) ,y (k) )∼P y (k) f (x(k) ) ≤ 0 . [sent-173, score-0.226]

73 (8) ˆ Theorem 3 shows that, if the binary classiﬁer f generalizes well when examples are sampled from P , the constructed ranking rule would also generalize well. [sent-174, score-0.462]

74 The terms y (k) f (x(k) ), which are exactly the margins of the associated binary classiﬁer fb (x, k), would be analogously called the margins for ordinal regression, and are expected to be positive and large for correct and conﬁdent predictions. [sent-175, score-1.057]

75 However, the bound is not data-dependent, and hence does not fully explain the generalization performance of large-margin ranking rules in reality (for more discussions on data-dependent bounds, see the work of, for example, Bartlett and Shawe-Taylor [10]). [sent-182, score-0.334]

76 Next, we show how a novel data-dependent bound for SVM-based ordinal regression approaches can be derived from our reduction framework. [sent-183, score-0.977]

77 Theorem 4 (data-dependent bound for support vector ordinal regression) Assume that f (x, k) ∈ f : (x, k) → u, φ(x) − θk , u 2 + θ 2 ≤ 1, φ(x) 2 + 1 ≤ R2 . [sent-188, score-0.733]

78 If θ is ordered or every row of C is convex, for any margin criterion ∆, with probability at least 1−δ, every rank rule r based on f has generalization error no more than N K−1 β · N n=1 (k) (k) wn yn f (x(k) ) ≤ ∆ + O n k=1 log N R √ , , N ∆ log 1 δ , where β = maxy cy . [sent-189, score-0.899]

79 miny cy (k) (k) ˆ P ROOF Consider the extended training set S = (xn , yn ) , which contains N (K − 1) elements. [sent-190, score-0.463]

80 One way to extract such a subset is to choose independent kn from Pk (k | yn ) for each (xn , yn ). [sent-201, score-0.495]

81 (k ) (k ) N The subset would be named T = (xn n , yn n ) n=1 . [sent-202, score-0.22]

82 outcomes from P , which is the case of T , E ˆ (x(k) ,y (k) )∼P y (k) f (x(k) ) ≤ 0 ≤ 1 N N (k yn n ) f (x(kn ) ) ≤ ∆ + O n n=1 log N R √ , , N ∆ log 1 δ . [sent-206, score-0.262]

83 (9) Table 1: Test error with absolute cost data set pyrimidines machine boston abalone bank computer california census C4. [sent-207, score-0.251]

84 002 (k ) Let bn = yn n f (xn n ) ≤ ∆ be a Boolean random variable introduced by kn ∼ Pk (k | yn ). [sent-289, score-0.55]

85 The (k) K−1 (k) (k) variable has mean c−1 · k=1 wn yn f (xn ) ≤ ∆ . [sent-290, score-0.374]

86 An extended Chernoff bound shows that yn when each bn is chosen independently, with probability at least (1 − δ/2) over the choice of bn , 1 N N bn ≤ n=1 1 N N 1 c n=1 yn K−1 (k) (k) wn yn f (x(k) ) ≤ ∆ + O n k=1 1 √ , N log 1 δ . [sent-291, score-1.121]

87 We tested our framework with E = γIK−1 and three different binary classiﬁcation algorithms. [sent-297, score-0.18]

88 The third one is SVM with the perceptron kernel [12], with a simple setting of γ = 1. [sent-301, score-0.287]

89 We used the perceptron kernel instead to gain the advantage of faster parameter selection. [sent-303, score-0.287]

90 For a fair comparison, we also implemented SVOR-IMC with the perceptron kernel and the same parameter selection procedure in LIBSVM. [sent-308, score-0.287]

91 Within the three SVM-based approaches, the two with the perceptron kernel are better than SVORIMC with the Gaussian kernel in test performance. [sent-309, score-0.335]

92 Our direct reduction to the standard SVM performs similarly to SVOR-IMC with the same perceptron kernel, but is much easier to implement. [sent-310, score-0.353]

93 With our reduction framework, all the three binary learning algorithms could be better than SVOR-IMC with the Gaussian kernel on some of the data sets, which demonstrates that they achieve decent out-of-sample performances. [sent-315, score-0.316]

94 6 reduction SVOR−IMC 4 2 0 bank computer california census Figure 1: Training time (including automatic parameter selection) of the SVM-based approaches with the perceptron kernel The results are averaged CPU time gathered on a 1. [sent-317, score-0.524]

95 This difference demonstrates another advantage of our reduction framework: improvements to binary classiﬁcation approaches can be immediately inherited by reduction-based ordinal regression algorithms. [sent-320, score-1.056]

96 6 Conclusion We presented a reduction framework from ordinal regression to binary classiﬁcation based on extended examples. [sent-321, score-1.169]

97 The framework has the ﬂexibility to work with any reasonable cost matrix and any binary classiﬁers. [sent-322, score-0.312]

98 We demonstrated the algorithmic advantages of the framework in designing new ordinal regression algorithms and explaining existing algorithms. [sent-323, score-0.935]

99 We also showed that the framework can be used to derive new generalization bounds for ordinal regression. [sent-324, score-0.84]

100 Large-margin thresholded ensembles for ordinal regression: Theory and practice. [sent-413, score-0.752]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ordinal', 0.655), ('perceptron', 0.239), ('yn', 0.22), ('ranking', 0.209), ('fb', 0.161), ('wn', 0.154), ('cy', 0.153), ('chu', 0.145), ('binary', 0.131), ('regression', 0.13), ('classi', 0.123), ('keerthi', 0.116), ('reduction', 0.114), ('cost', 0.109), ('thresholded', 0.097), ('xn', 0.093), ('extended', 0.09), ('mislabeling', 0.087), ('thresholds', 0.087), ('pk', 0.083), ('generalization', 0.073), ('rule', 0.07), ('rank', 0.07), ('lin', 0.069), ('prank', 0.066), ('rajaram', 0.066), ('bounds', 0.063), ('svm', 0.058), ('bn', 0.055), ('margins', 0.055), ('kn', 0.055), ('mistake', 0.055), ('ordered', 0.055), ('examples', 0.052), ('shashua', 0.052), ('ling', 0.052), ('roof', 0.052), ('libsvm', 0.052), ('bound', 0.052), ('framework', 0.049), ('kernel', 0.048), ('er', 0.048), ('ik', 0.048), ('encoding', 0.048), ('absolute', 0.045), ('def', 0.043), ('answers', 0.042), ('outcomes', 0.042), ('cation', 0.041), ('row', 0.041), ('rows', 0.041), ('crammer', 0.04), ('ek', 0.04), ('levin', 0.038), ('optimizers', 0.038), ('saves', 0.038), ('theorem', 0.038), ('bartlett', 0.037), ('usefulness', 0.036), ('formulations', 0.036), ('census', 0.035), ('bank', 0.035), ('uni', 0.034), ('convex', 0.034), ('dence', 0.033), ('raedt', 0.032), ('maxy', 0.032), ('caltech', 0.032), ('lecture', 0.032), ('switching', 0.032), ('margin', 0.031), ('multiclass', 0.031), ('ecml', 0.03), ('frank', 0.03), ('herbrich', 0.03), ('algorithmic', 0.029), ('notes', 0.028), ('ranks', 0.028), ('efforts', 0.028), ('singer', 0.028), ('california', 0.027), ('benchmark', 0.027), ('existing', 0.027), ('altogether', 0.027), ('movie', 0.027), ('approaches', 0.026), ('decreasing', 0.026), ('support', 0.026), ('shall', 0.024), ('adaboost', 0.024), ('boolean', 0.024), ('online', 0.024), ('freund', 0.023), ('outputs', 0.023), ('algorithms', 0.023), ('matrix', 0.023), ('advantages', 0.022), ('instance', 0.022), ('strictly', 0.022), ('comes', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 156 nips-2006-Ordinal Regression by Extended Binary Classification

Author: Ling Li, Hsuan-tien Lin

2 0.12060589 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

Author: Yonatan Amit, Shai Shalev-shwartz, Yoram Singer

Abstract: We describe and analyze an algorithmic framework for online classiﬁcation where each online trial consists of multiple prediction tasks that are tied together. We tackle the problem of updating the online hypothesis by deﬁning a projection problem in which each prediction task corresponds to a single linear constraint. These constraints are tied together through a single slack parameter. We then introduce a general method for approximately solving the problem by projecting simultaneously and independently on each constraint which corresponds to a prediction sub-problem, and then averaging the individual solutions. We show that this approach constitutes a feasible, albeit not necessarily optimal, solution for the original projection problem. We derive concrete simultaneous projection schemes and analyze them in the mistake bound model. We demonstrate the power of the proposed algorithm in experiments with online multiclass text categorization. Our experiments indicate that a combination of class-dependent features with the simultaneous projection method outperforms previously studied algorithms. 1

3 0.098912008 163 nips-2006-Prediction on a Graph with a Perceptron

Author: Mark Herbster, Massimiliano Pontil

Abstract: We study the problem of online prediction of a noisy labeling of a graph with the perceptron. We address both label noise and concept noise. Graph learning is framed as an instance of prediction on a ﬁnite set. To treat label noise we show that the hinge loss bounds derived by Gentile [1] for online perceptron learning can be transformed to relative mistake bounds with an optimal leading constant when applied to prediction on a ﬁnite set. These bounds depend crucially on the norm of the learned concept. Often the norm of a concept can vary dramatically with only small perturbations in a labeling. We analyze a simple transformation that stabilizes the norm under perturbations. We derive an upper bound that depends only on natural properties of the graph – the graph diameter and the cut size of a partitioning of the graph – which are only indirectly dependent on the size of the graph. The impossibility of such bounds for the graph geodesic nearest neighbors algorithm will be demonstrated. 1

4 0.090978153 20 nips-2006-Active learning for misspecified generalized linear models

Author: Francis R. Bach

Abstract: Active learning refers to algorithmic frameworks aimed at selecting training data points in order to reduce the number of required training data points and/or improve the generalization performance of a learning method. In this paper, we present an asymptotic analysis of active learning for generalized linear models. Our analysis holds under the common practical situation of model misspeciﬁcation, and is based on realistic assumptions regarding the nature of the sampling distributions, which are usually neither independent nor identical. We derive unbiased estimators of generalization performance, as well as estimators of expected reduction in generalization error after adding a new training data point, that allow us to optimize its sampling distribution through a convex optimization problem. Our analysis naturally leads to an algorithm for sequential active learning which is applicable for all tasks supported by generalized linear models (e.g., binary classiﬁcation, multi-class classiﬁcation, regression) and can be applied in non-linear settings through the use of Mercer kernels. 1

5 0.086557686 186 nips-2006-Support Vector Machines on a Budget

Author: Ofer Dekel, Yoram Singer

Abstract: The standard Support Vector Machine formulation does not provide its user with the ability to explicitly control the number of support vectors used to deﬁne the generated classiﬁer. We present a modiﬁed version of SVM that allows the user to set a budget parameter B and focuses on minimizing the loss attained by the B worst-classiﬁed examples while ignoring the remaining examples. This idea can be used to derive sparse versions of both L1-SVM and L2-SVM. Technically, we obtain these new SVM variants by replacing the 1-norm in the standard SVM formulation with various interpolation-norms. We also adapt the SMO optimization algorithm to our setting and report on some preliminary experimental results. 1

6 0.085301481 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

7 0.082945809 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

8 0.075339742 65 nips-2006-Denoising and Dimension Reduction in Feature Space

9 0.075295039 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow

10 0.073301353 193 nips-2006-Tighter PAC-Bayes Bounds

11 0.072172403 116 nips-2006-Learning from Multiple Sources

12 0.067444101 150 nips-2006-On Transductive Regression

13 0.067274161 50 nips-2006-Chained Boosting

14 0.065985449 11 nips-2006-A PAC-Bayes Risk Bound for General Loss Functions

15 0.06468413 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

16 0.061177745 126 nips-2006-Logistic Regression for Single Trial EEG Classification

17 0.059962455 178 nips-2006-Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation

18 0.059704147 58 nips-2006-Context Effects in Category Learning: An Investigation of Four Probabilistic Models

19 0.059573833 164 nips-2006-Randomized PCA Algorithms with Regret Bounds that are Logarithmic in the Dimension

20 0.059025574 61 nips-2006-Convex Repeated Games and Fenchel Duality

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.185), (1, 0.061), (2, -0.051), (3, -0.006), (4, -0.067), (5, 0.156), (6, -0.083), (7, 0.011), (8, -0.051), (9, -0.034), (10, -0.032), (11, -0.012), (12, 0.0), (13, -0.032), (14, 0.0), (15, -0.021), (16, 0.011), (17, -0.022), (18, 0.016), (19, -0.074), (20, 0.035), (21, 0.017), (22, -0.017), (23, 0.016), (24, 0.003), (25, 0.038), (26, -0.086), (27, 0.136), (28, 0.114), (29, 0.002), (30, 0.064), (31, 0.089), (32, 0.032), (33, -0.067), (34, -0.004), (35, -0.029), (36, 0.11), (37, 0.083), (38, -0.098), (39, -0.085), (40, 0.003), (41, 0.043), (42, -0.009), (43, 0.109), (44, 0.024), (45, -0.083), (46, -0.023), (47, -0.067), (48, -0.186), (49, -0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93182522 156 nips-2006-Ordinal Regression by Extended Binary Classification

Author: Ling Li, Hsuan-tien Lin

2 0.64851809 186 nips-2006-Support Vector Machines on a Budget

Author: Ofer Dekel, Yoram Singer

3 0.52962226 193 nips-2006-Tighter PAC-Bayes Bounds

Author: Amiran Ambroladze, Emilio Parrado-hernández, John S. Shawe-taylor

Abstract: This paper proposes a PAC-Bayes bound to measure the performance of Support Vector Machine (SVM) classiﬁers. The bound is based on learning a prior over the distribution of classiﬁers with a part of the training samples. Experimental work shows that this bound is tighter than the original PAC-Bayes, resulting in an enhancement of the predictive capabilities of the PAC-Bayes bound. In addition, it is shown that the use of this bound as a means to estimate the hyperparameters of the classiﬁer compares favourably with cross validation in terms of accuracy of the model, while saving a lot of computational burden. 1

4 0.51605088 28 nips-2006-An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models

Author: S. S. Keerthi, Vikas Sindhwani, Olivier Chapelle

Abstract: We consider the task of tuning hyperparameters in SVM models based on minimizing a smooth performance validation function, e.g., smoothed k-fold crossvalidation error, using non-linear optimization techniques. The key computation in this approach is that of the gradient of the validation function with respect to hyperparameters. We show that for large-scale problems involving a wide choice of kernel-based models and validation functions, this computation can be very efﬁciently done; often within just a fraction of the training time. Empirical results show that a near-optimal set of hyperparameters can be identiﬁed by our approach with very few training rounds and gradient computations. . 1

5 0.50610888 129 nips-2006-Map-Reduce for Machine Learning on Multicore

Author: Cheng-tao Chu, Sang K. Kim, Yi-an Lin, Yuanyuan Yu, Gary Bradski, Kunle Olukotun, Andrew Y. Ng

Abstract: We are at the beginning of the multicore era. Computers will have increasingly many cores (processors), but there is still no good programming framework for these architectures, and thus no simple and uniﬁed way for machine learning to take advantage of the potential speed up. In this paper, we develop a broadly applicable parallel programming method, one that is easily applied to many different learning algorithms. Our work is in distinct contrast to the tradition in machine learning of designing (often ingenious) ways to speed up a single algorithm at a time. Speciﬁcally, we show that algorithms that ﬁt the Statistical Query model [15] can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers. We adapt Google’s map-reduce [7] paradigm to demonstrate this parallel speed up technique on a variety of learning algorithms including locally weighted linear regression (LWLR), k-means, logistic regression (LR), naive Bayes (NB), SVM, ICA, PCA, gaussian discriminant analysis (GDA), EM, and backpropagation (NN). Our experimental results show basically linear speedup with an increasing number of processors. 1

6 0.49560484 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions

7 0.49049658 106 nips-2006-Large Margin Hidden Markov Models for Automatic Speech Recognition

8 0.47588769 73 nips-2006-Efficient Methods for Privacy Preserving Face Detection

9 0.47099808 50 nips-2006-Chained Boosting

10 0.46177906 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

11 0.45172578 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow

12 0.43579629 173 nips-2006-Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

13 0.43176389 63 nips-2006-Cross-Validation Optimization for Large Scale Hierarchical Classification Kernel Methods

14 0.42328033 163 nips-2006-Prediction on a Graph with a Perceptron

15 0.42262688 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions

16 0.41346103 150 nips-2006-On Transductive Regression

17 0.41131639 140 nips-2006-Multiple Instance Learning for Computer Aided Diagnosis

18 0.40428105 61 nips-2006-Convex Repeated Games and Fenchel Duality

19 0.40010113 177 nips-2006-Sparse Kernel Orthonormalized PLS for feature extraction in large data sets

20 0.39424166 89 nips-2006-Handling Advertisements of Unknown Quality in Search Advertising

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.106), (7, 0.049), (9, 0.03), (20, 0.014), (22, 0.06), (44, 0.06), (57, 0.059), (65, 0.504), (69, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96089482 103 nips-2006-Kernels on Structured Objects Through Nested Histograms

Author: Marco Cuturi, Kenji Fukumizu

Abstract: We propose a family of kernels for structured objects which is based on the bag-ofcomponents paradigm. However, rather than decomposing each complex object into the single histogram of its components, we use for each object a family of nested histograms, where each histogram in this hierarchy describes the object seen from an increasingly granular perspective. We use this hierarchy of histograms to deﬁne elementary kernels which can detect coarse and ﬁne similarities between the objects. We compute through an efﬁcient averaging trick a mixture of such speciﬁc kernels, to propose a ﬁnal kernel value which weights efﬁciently local and global matches. We propose experimental results on an image retrieval experiment which show that this mixture is an effective template procedure to be used with kernels on histograms.

same-paper 2 0.91661274 156 nips-2006-Ordinal Regression by Extended Binary Classification

Author: Ling Li, Hsuan-tien Lin

3 0.91412288 146 nips-2006-No-regret Algorithms for Online Convex Programs

Author: Geoffrey J. Gordon

Abstract: Online convex programming has recently emerged as a powerful primitive for designing machine learning algorithms. For example, OCP can be used for learning a linear classiﬁer, dynamically rebalancing a binary search tree, ﬁnding the shortest path in a graph with unknown edge lengths, solving a structured classiﬁcation problem, or ﬁnding a good strategy in an extensive-form game. Several researchers have designed no-regret algorithms for OCP. But, compared to algorithms for special cases of OCP such as learning from expert advice, these algorithms are not very numerous or ﬂexible. In learning from expert advice, one tool which has proved particularly valuable is the correspondence between no-regret algorithms and convex potential functions: by reasoning about these potential functions, researchers have designed algorithms with a wide variety of useful guarantees such as good performance when the target hypothesis is sparse. Until now, there has been no such recipe for the more general OCP problem, and therefore no ability to tune OCP algorithms to take advantage of properties of the problem or data. In this paper we derive a new class of no-regret learning algorithms for OCP. These Lagrangian Hedging algorithms are based on a general class of potential functions, and are a direct generalization of known learning rules like weighted majority and external-regret matching. In addition to proving regret bounds, we demonstrate our algorithms learning to play one-card poker. 1

4 0.89745694 102 nips-2006-Kernel Maximum Entropy Data Transformation and an Enhanced Spectral Clustering Algorithm

Author: Robert Jenssen, Torbjørn Eltoft, Mark Girolami, Deniz Erdogmus

Abstract: We propose a new kernel-based data transformation technique. It is founded on the principle of maximum entropy (MaxEnt) preservation, hence named kernel MaxEnt. The key measure is Renyi’s entropy estimated via Parzen windowing. We show that kernel MaxEnt is based on eigenvectors, and is in that sense similar to kernel PCA, but may produce strikingly different transformed data sets. An enhanced spectral clustering algorithm is proposed, by replacing kernel PCA by kernel MaxEnt as an intermediate step. This has a major impact on performance.

5 0.89284706 170 nips-2006-Robotic Grasping of Novel Objects

Author: Ashutosh Saxena, Justin Driemeyer, Justin Kearns, Andrew Y. Ng

Abstract: We consider the problem of grasping novel objects, speciﬁcally ones that are being seen for the ﬁrst time through vision. We present a learning algorithm that neither requires, nor tries to build, a 3-d model of the object. Instead it predicts, directly as a function of the images, a point at which to grasp the object. Our algorithm is trained via supervised learning, using synthetic images for the training set. We demonstrate on a robotic manipulation platform that this approach successfully grasps a wide variety of objects, such as wine glasses, duct tape, markers, a translucent box, jugs, knife-cutters, cellphones, keys, screwdrivers, staplers, toothbrushes, a thick coil of wire, a strangely shaped power horn, and others, none of which were seen in the training set. 1

6 0.6130898 203 nips-2006-implicit Online Learning with Kernels

7 0.59598035 61 nips-2006-Convex Repeated Games and Fenchel Duality

8 0.57858753 79 nips-2006-Fast Iterative Kernel PCA

9 0.56492651 115 nips-2006-Learning annotated hierarchies from relational data

10 0.5636158 26 nips-2006-An Approach to Bounded Rationality

11 0.55985188 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections

12 0.55087405 82 nips-2006-Gaussian and Wishart Hyperkernels

13 0.53822124 65 nips-2006-Denoising and Dimension Reduction in Feature Space

14 0.53332216 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning

15 0.53159976 117 nips-2006-Learning on Graph with Laplacian Regularization

16 0.52855003 150 nips-2006-On Transductive Regression

17 0.52374494 163 nips-2006-Prediction on a Graph with a Perceptron

18 0.52315074 84 nips-2006-Generalized Regularized Least-Squares Learning with Predefined Features in a Hilbert Space

19 0.52303803 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding

20 0.5220266 109 nips-2006-Learnability and the doubling dimension