nips nips2005 nips2005-114 knowledge-graph by maker-knowledge-mining

114 nips-2005-Learning Rankings via Convex Hull Separation

Source: pdf

Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram

Abstract: We propose efﬁcient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classiﬁcation and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We propose efﬁcient algorithms for learning ranking functions from order constraints between sets—i. [sent-5, score-0.709]

2 Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classiﬁcation and its generalization for ordinal regression. [sent-8, score-0.603]

3 Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. [sent-9, score-0.174]

4 1 Introduction Many machine learning applications depend on accurately ordering the elements of a set based on the known ordering of only some of its elements. [sent-10, score-0.288]

5 In the literature, variants of this problem have been referred to as ordinal regression, ranking, and learning of preference relations. [sent-11, score-0.282]

6 Formally, we want to ﬁnd a function f : ℜn → ℜ such that, for a set of test samples {xk ∈ ℜn }, the output of the function f (xk ) can be sorted to obtain a ranking. [sent-12, score-0.067]

7 In order to learn such a function we are provided with training data, A, containing S sets (or mj S classes) of training samples: A = j=1 (Aj = {xj }i=1 ), where the j-th set Aj contains i S mj samples, so that we have a total of m = j=1 mj samples in A. [sent-13, score-0.558]

8 Further, we are also provided with a directed order graph G = (S, E) each of whose vertices corresponds to a class Aj , and the existence of a directed edge EP Q —corresponding to AP → AQ —means that all training samples xp ∈ AP should be ranked higher than any sample xq ∈ AQ : i. [sent-14, score-0.726]

9 In general the number of constraints on the ranking function grows as O(m2 ) so that naive solutions are computationally infeasible even for moderate sized training sets with a few thousand samples. [sent-17, score-0.949]

10 Hence, we propose a more stringent problem with a larger (inﬁnite) set of constraints, that is nevertheless much more tractably solved. [sent-18, score-0.036]

11 In particular, we modify the constraints to: ∀ (xp ∈ CH(AP ), xq ∈ CH(AQ )), f (xp ) ≤ f (xq ), where CH(Aj ) denotes the set of all points in the convex hull of Aj . [sent-19, score-0.502]

12 We show how this leads to: (a) a family of approximations to the original problem; and (b) considerably more efﬁcient solutions that still enforce all of the original inter-group order constraints. [sent-20, score-0.073]

13 Notice that, this formulation subsumes the standard ranking problem (e. [sent-21, score-0.564]

14 Each problem instance is deﬁned by an order graph. [sent-24, score-0.043]

15 (a-d) A succession of order graphs with an increasing number of constraints (e-f) Two order graphs deﬁning the same partial ordering but different problem instances. [sent-25, score-0.505]

16 However, as illustrated in Figure 1, the formulation is more general and does not require a total ordering of the sets of training samples Aj , i. [sent-27, score-0.398]

17 it allows any order graph G to be incorporated into the problem. [sent-29, score-0.168]

18 1 Generalized Wilcoxon-Mann-Whitney Statistics A distinction is usually made between classiﬁcation and ordinal regression methods on one hand, and ranking on the other. [sent-31, score-0.85]

19 In particular, the loss functions used for classiﬁcation and ordinal regression evaluate whether each test sample is correctly classiﬁed: in other words, the loss functions that are used to evaluate these algorithms—e. [sent-32, score-0.43]

20 the 0–1 loss for binary classiﬁcation—are computed for every sample individually, and then averaged over the training or test set. [sent-34, score-0.144]

21 By contrast, bipartite ranking solutions are evaluated using the Wilcoxon-Mann-Whitney (WMW) statistic which measures the (sample averaged) probability that any pair of samples is ordered correctly; intuitively, the WMW statistic may be interpreted as the area under the ROC curve (AUC). [sent-35, score-0.791]

22 We deﬁne a slight generalization of the WMW statistic that accounts for our notion of class-ordering: mi k=1 W M W (f, A) = Eij mj l=1 δ mi k=1 f (xi ) < f (xj ) k l mj l=1 1 . [sent-36, score-0.386]

23 2 Previous Work Ordinal regression and methods for handling structured output classes: For a classic description of generalized linear models for ordinal regression, see [11]. [sent-39, score-0.531]

24 A non-parametric Bayesian model for ordinal regression based on Gaussian processes (GP) was deﬁned [1]. [sent-40, score-0.369]

25 Several recent machine learning papers consider structured output classes: e. [sent-41, score-0.077]

26 [13] presents SVM based algorithms for handling structured and interdependent output spaces, and [5] discusses automatic document categorization into pre-deﬁned hierarchies or taxonomies of topics. [sent-43, score-0.215]

27 Learning Rankings: The problem of learning rankings was ﬁrst treated as a classiﬁcation problem on pairs of objects by Herbrich [4] and subsequently used on a web page ranking task by Joachims [6]; a variety of authors have investigated this approach recently. [sent-44, score-0.581]

28 The major advantage of this approach is that it considers a more explicit notion of ordering— However, the naive optimization strategy proposed there suffers from the O(m2 ) growth in the number of constraints mentioned in the previous section. [sent-45, score-0.252]

29 This computational burden renders these methods impractical even for medium sized datasets with a few thousand samples. [sent-46, score-0.198]

30 Relationship to the proposed work: Our algorithm penalizes wrong ordering of pairs of training instances in order to learn ranking functions (similar to [4]), but in addition, it can also utilize the notion of a structured class order graph. [sent-49, score-0.933]

31 Nevertheless, using a formulation based on constraints over convex hulls of the training classes, our method avoids the prohibitive computational complexity of the previous algorithms for ranking. [sent-50, score-0.43]

32 3 Notation and Background In the following, vectors will be assumed to be column vectors unless transposed to a row vector by a prime superscript ′ . [sent-52, score-0.029]

33 For a vector x in the n-dimensional real space ℜn , the cardinality of a set A will be denoted by #(A). [sent-53, score-0.066]

34 The scalar (inner) product of two vectors x and y in the n-dimensional real space ℜn will be denoted by x′ y and the 2-norm of x will be denoted by x . [sent-54, score-0.074]

35 For a matrix A ∈ ℜm×n , Ai is the ith row of A which is a row vector in ℜn , while A·j is the jth column of A. [sent-55, score-0.029]

36 A column vector of ones of arbitrary dimension will be denoted by e. [sent-56, score-0.095]

37 In particular, if x and y are column vectors in ℜn then, K(x′ , y) is a real number, K(x′ , A′ ) is a row vector in ℜm and K(A, A′ ) is an m × m matrix. [sent-58, score-0.029]

38 The identity matrix of arbitrary dimension will be denoted by I. [sent-59, score-0.066]

39 2 Convex Hull formulation We are interested in learning a ranking function f : ℜn → ℜ given known ranking relationships between some training instances Ai , Aj ⊂ A. [sent-60, score-1.153]

40 In others words, for binary classiﬁers, we want a linear ranking function fw (x) = w′ x that satisﬁes the following constraints: ∀ (x+∈ A+ , x−∈ A− ), f (x− ) ≤ f (x+ ) ⇒ f (x− )− f (x+ ) = w′ x−− w′ x+ ≤ −1 ≤ 0. [sent-62, score-0.566]

41 (1) Clearly, the number of constraints grows as O(m+ m− ), which is roughly quadratic in the number of training samples (unless we have severe class imbalance). [sent-63, score-0.323]

42 First, notice that (by negation) the feasibility constraints in (1) can also be deﬁned as: ∀ (x+∈ A+ , x−∈ A− ), w′ x− ′ x+ ≤ −1 ⇔ ∄(x+∈ A+ , x−∈ A− ), w′ x− ′ x+ > −1. [sent-66, score-0.221]

43 −w −w In other words, a solution w is feasible iff there exist no pair of samples from the two classes such that fw ( ) orders them incorrectly. [sent-67, score-0.19]

44 Note that two elements xi and xj of the set A− are not correctly ordered and hence generate positive values of the corresponding slack variables yi and yj . [sent-69, score-0.406]

45 Thus, our constraints become: 0 ≤ λ+ ≤ 1, 0 ≤ λ− ≤ 1, ∀(λ+ , λ− ) such that λ+ = 1 λ− = 1 ′ ′ w′ A− λ−− w′ A+ λ+ ≤ −1. [sent-71, score-0.185]

46 B u− w [A +′ ′ − A ] = 0, b u ≤ −1, u ≥ 0, (5) (6) (7) Where the second equivalent form of the constraints was obtained by negation (as before), and the third equivalent form results from our third key insight: the application of Farka’s theorem of alternatives[9]. [sent-78, score-0.233]

47 The resulting linear system of m equalities and m + 5 inequal2 ities in m + n + 4 variables can be used while minimizing any regularizer (such as w ) to obtain the linear ranking function that satisﬁes (1); notice, however, that we avoid the O(m2 ) scaling in constraints. [sent-79, score-0.581]

48 1 The binary non-separable case In the non-separable case, CH(A+ ) CH(A− ) = ∅ so the requirements have to be relaxed by introducing slack variables. [sent-81, score-0.219]

49 To this end, we allow one slack variable yi ≥ 0 for each training sample xi , and consider the slack for any point inside the convex hull CH(Aj ) to also be a convex combination of y (see Fig. [sent-82, score-0.794]

50 For example, this implies that if only a subset of training samples have non-zero slacks yi> 0 (i. [sent-84, score-0.193]

51 they are possibly misclassiﬁed), then the slacks of any points inside the convex hull also only depend on those yi . [sent-86, score-0.302]

52 Note that enforcing the constraints deﬁned above ˆ ˆ indeed implies the desired ordering, since we have: Ai w + y i ≥ −γ ij ≥ γ ij + 1 ≥ γ ij ≥ Aj w − y j ˆ ˆ It is also important to note the connection with Support Vector Machines (SVM) formulation [10, 14] for the binary case. [sent-94, score-0.943]

53 If we impose the extra constraints −γ ij = γ + 1 and γ ij = γ −1, then equations (18) imply the constraints included in the standard primal SVM ˆ formulation. [sent-95, score-0.818]

54 Eij ∀(i, j) ∈ E Where ǫ is a given loss function for the slack variables y i and R(v) represents a regularizer on the normal to the hyperplane v. [sent-98, score-0.277]

55 For an arbitrary kernel K(x, x′ ) the number of variables of formulation (27) is 2 ∗ m + 2#(E) and the number of linear equations(excluding the nonnegativity constraints) is m#(E) + #(E) = #(E)(m + 1). [sent-99, score-0.145]

56 K(x, x′ ) = xx′ the number of variables of formulation (27) becomes m + n + 2#(E) and the number of linear equations remains the same. [sent-102, score-0.178]

57 When using a linear kernel and 2 using ǫ(x) = R(x) = x 2 , the optimization problem (27) becomes a linearly constrained quadratic optimization problem for which a unique solution exists due to the convexity of the objective function: min {w,y i ,γ ij | (i,j)∈E} s. [sent-103, score-0.193]

58 ν y 2 2 1 + 2 w′ w Eij ∀(i, j) ∈ E (28) Unlike other SVM-like methods for ranking that need a O(m2 ) number of slack variables y our formulation only require one slack variable for example, only m slack variables are used, giving our formulation computational advantage over ranking methods. [sent-105, score-1.725]

59 3 Experimental Evaluation We test tested our approach in a set of nine publicly available datasets 1 shown in Tab. [sent-107, score-0.108]

60 1 (several large datasets are not reported since only the algorithm presented in this paper was able to run them). [sent-108, score-0.109]

61 These datasets have been frequently used as a benchmark for ordinal regression methods (e. [sent-109, score-0.501]

62 We compare our method against SVM for ranking (e. [sent-113, score-0.481]

63 [4, 6]) using the SVM-light package 2 and an efﬁcient Gaussian process method (the informative vector machine) 3 [7]. [sent-115, score-0.029]

64 These datasets were originally designed for regression, thus the continuous target values for each dataset were discretized into ﬁve equal size bins. [sent-116, score-0.121]

65 We use these bins to deﬁne our ranking constraints: all the datapoints with target value falling in the same bin were grouped together. [sent-117, score-0.512]

66 Each dataset was divided into 10% for testing and 90% for training. [sent-118, score-0.041]

67 Thus, the input to all of the algorithms tested was, for each point in the training set: (1) a vector in ℜn (where n is different for each set) and (2) a value from 1 to 5 denoting the rank of the group to which it belongs. [sent-119, score-0.135]

68 Since we do not employ information about the ranking of the elements within each group, order constraints within a group 1 Available at http:\\www. [sent-121, score-0.744]

69 7 Run time (Log−scale) Generalized Wilcoxon statistic (AUC) 0. [sent-135, score-0.09]

70 1 0 1 2 3 4 5 6 Dataset number 1 10 0 10 −1 10 −2 10 7 8 9 1 2 3 4 5 6 Dataset number 7 8 9 Figure 3: Experimental comparison of the ranking SVM, IVM and the proposed method on nine benchmark datasets. [sent-142, score-0.598]

71 Along with the mean values in 10 fold cross-validation, the entire range of variation is indicated in the error-bars. [sent-143, score-0.035]

72 (b) The proposed method has a much lower run time than the other methods, even for the full graph case for medium to large size datasets. [sent-145, score-0.257]

73 NOTE: Both SVM-light and IVM ran out of memory and crashed on dataset 4; on dataset 1, SVM-light failed to complete even one fold after more than 24 hours of run time, so its results could not be compiled in time for submission. [sent-146, score-0.201]

74 Letting b(m) = m(m − 1)/2, the total number of order constraints is equal to b(m) − i b(mi ), where mi is the number of instances in group i. [sent-148, score-0.357]

75 Our formulation was tested employing two order graphs, the full directed acyclic graph and the chain graph. [sent-151, score-0.412]

76 The performance for all datasets is generally comparable or signiﬁcantly better for our algorithm (when using a chain order graph). [sent-152, score-0.192]

77 Note that the performance for the full graph is consistently lower than that for the chain graph. [sent-153, score-0.227]

78 Thus, interestingly enforcing more order constraints does not necessarily imply better performance. [sent-154, score-0.282]

79 We suspect that this is due to the role that the slack variables play in both formulations, since the number of slack variables remains the same while the number of constraints increases. [sent-155, score-0.605]

80 Adding more slack variables may positively affect performance in the full graph, but this comes at a computational cost. [sent-156, score-0.243]

81 A different but potentially related problem is that of ﬁnding good order graph given a dataset. [sent-158, score-0.168]

82 Note also that the chain graph is much more stable regarding performance overall. [sent-159, score-0.194]

83 Regarding run-time, our algorithm runs an order of magnitude faster than current implementations of state-of-the-art methods, even approximate ones (like IVM). [sent-160, score-0.043]

84 4 Discussions and future work We propose a general method for learning a ranking function from structured order constraints on sets of training samples. [sent-161, score-0.89]

85 The proposed algorithm was illustrated on benchmark ranking problems with two different constraint graphs: (a) a chain graph; and (b) a full ordering graph. [sent-162, score-0.816]

86 Besides being accurate, the computational requirements of our algorithm scale much more favorably with the number of training samples as compared to other state-of-the-art methods. [sent-164, score-0.138]

87 Indeed it was the only algorithm capable of handling several large datasets, while the other methods either crashed due to lack of memory or ran for so long that they were not practically feasible. [sent-165, score-0.095]

88 While our experiments illustrate only speciﬁc order graphs, we stress that the method is general enough to handle arbitrary constraint relationships. [sent-166, score-0.072]

89 While the proposed formulation reduces the computational complexity of enforcing order constraints, it is entirely independent of the regularizer that is minimized (under these constraints) while learning the optimal ranking function. [sent-167, score-0.765]

90 Though we have used a simple 2 margin regularization (via w in (28), and RKHS regularization in (27) in order to learn in a supervised setting, we can just as easily easily use a graph-Laplacian based regularizer that exploits unlabeled data, in order to learn in semi-supervised settings. [sent-168, score-0.196]

91 Obermayer, Large margin rank boundaries for ordinal regression, Advances in Large Margin Classiﬁers (2000), 115–132. [sent-186, score-0.325]

92 Herbrich, Fast sparse gaussian process methods: The informative vector machine, Neural Info. [sent-197, score-0.029]

93 Lafferty, Conditional models on the ranking poset, Neural Info. [sent-202, score-0.481]

94 [10] , Generalized support vector machines, Advances in Large Margin Classiﬁers, 2000, pp. [sent-208, score-0.029]

95 Altun, Support vector machine learning for interdependent and structured output spaces, Int. [sent-220, score-0.149]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ranking', 0.481), ('ordinal', 0.282), ('aj', 0.236), ('ch', 0.214), ('ij', 0.193), ('uij', 0.19), ('constraints', 0.185), ('slack', 0.177), ('ordering', 0.144), ('wmw', 0.137), ('ai', 0.132), ('graph', 0.125), ('ivm', 0.119), ('hull', 0.117), ('xq', 0.109), ('eij', 0.109), ('xp', 0.095), ('convex', 0.091), ('mj', 0.091), ('statistic', 0.09), ('regression', 0.087), ('formulation', 0.083), ('aq', 0.081), ('datasets', 0.08), ('structured', 0.077), ('svm', 0.074), ('wilcoxon', 0.071), ('rankings', 0.071), ('training', 0.071), ('ap', 0.069), ('chain', 0.069), ('samples', 0.067), ('ranked', 0.067), ('regularizer', 0.067), ('joachims', 0.065), ('equations', 0.062), ('herbrich', 0.061), ('directed', 0.059), ('mi', 0.057), ('crashed', 0.055), ('farka', 0.055), ('poset', 0.055), ('slacks', 0.055), ('taxonomies', 0.055), ('enforcing', 0.054), ('benchmark', 0.052), ('xj', 0.052), ('sized', 0.052), ('classes', 0.051), ('negation', 0.048), ('graphs', 0.045), ('generalized', 0.045), ('interdependent', 0.043), ('fw', 0.043), ('margin', 0.043), ('order', 0.043), ('binary', 0.042), ('yj', 0.042), ('dataset', 0.041), ('hofmann', 0.04), ('preferences', 0.04), ('handling', 0.04), ('yi', 0.039), ('xk', 0.038), ('auc', 0.038), ('denoted', 0.037), ('instances', 0.037), ('proposed', 0.037), ('stringent', 0.036), ('notice', 0.036), ('group', 0.035), ('classi', 0.035), ('fold', 0.035), ('sets', 0.033), ('medium', 0.033), ('thousand', 0.033), ('moderate', 0.033), ('ordered', 0.033), ('full', 0.033), ('variables', 0.033), ('sample', 0.031), ('infeasible', 0.031), ('grouped', 0.031), ('alternatives', 0.031), ('roc', 0.031), ('growth', 0.03), ('misclassi', 0.03), ('correctly', 0.03), ('solutions', 0.03), ('web', 0.029), ('orders', 0.029), ('individually', 0.029), ('ers', 0.029), ('run', 0.029), ('arbitrary', 0.029), ('vector', 0.029), ('bj', 0.028), ('separable', 0.028), ('accurate', 0.028), ('nine', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 114 nips-2005-Learning Rankings via Convex Hull Separation

Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram

2 0.21015541 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari

Abstract: In this paper we propose a general framework to study the generalization properties of binary classiﬁers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classiﬁcation and some cases of ranking problems, and clariﬁes the relationship between these learning tasks. 1

3 0.12804423 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

Author: Y. Altun, D. McAllester, M. Belkin

Abstract: Many real-world classiﬁcation problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classiﬁcation of such structured variables. In this paper, we investigate structured classiﬁcation in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1

4 0.11811413 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil

Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1

5 0.10389281 105 nips-2005-Large-Scale Multiclass Transduction

Author: Thomas Gärtner, Quoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan

Abstract: We present a method for performing transductive inference on very large datasets. Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufﬁciently fast. This holds, for instance, for certain graph and string kernels. Transduction is achieved by variational inference over the unlabeled data subject to a balancing constraint. 1

6 0.089054614 184 nips-2005-Structured Prediction via the Extragradient Method

7 0.088978611 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification

8 0.088720851 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning

9 0.087825306 50 nips-2005-Convex Neural Networks

10 0.08740934 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

11 0.087082766 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

12 0.082825422 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior

13 0.082592674 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

14 0.08233463 59 nips-2005-Dual-Tree Fast Gauss Transforms

15 0.082087062 46 nips-2005-Consensus Propagation

16 0.073708788 178 nips-2005-Soft Clustering on Graphs

17 0.072112374 47 nips-2005-Consistency of one-class SVM and related algorithms

18 0.071095593 20 nips-2005-Affine Structure From Sound

19 0.068157606 126 nips-2005-Metric Learning by Collapsing Classes

20 0.067003772 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.235), (1, 0.126), (2, -0.072), (3, -0.072), (4, 0.002), (5, 0.066), (6, 0.039), (7, 0.076), (8, 0.047), (9, 0.025), (10, -0.012), (11, -0.112), (12, -0.015), (13, 0.04), (14, -0.028), (15, 0.074), (16, 0.084), (17, 0.026), (18, 0.035), (19, -0.027), (20, -0.02), (21, 0.049), (22, 0.107), (23, -0.022), (24, 0.028), (25, -0.009), (26, 0.18), (27, 0.208), (28, 0.072), (29, 0.012), (30, -0.007), (31, 0.106), (32, -0.068), (33, 0.091), (34, 0.066), (35, 0.075), (36, 0.035), (37, 0.03), (38, 0.098), (39, 0.057), (40, -0.066), (41, 0.211), (42, -0.045), (43, 0.049), (44, -0.044), (45, -0.103), (46, -0.096), (47, -0.048), (48, -0.079), (49, 0.048)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94208676 114 nips-2005-Learning Rankings via Convex Hull Separation

Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram

2 0.66105437 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data

Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari

3 0.53495783 105 nips-2005-Large-Scale Multiclass Transduction

Author: Thomas Gärtner, Quoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan

4 0.53387845 196 nips-2005-Two view learning: SVM-2K, Theory and Practice

Author: Jason Farquhar, David Hardoon, Hongying Meng, John S. Shawe-taylor, Sándor Szedmák

Abstract: Kernel methods make it relatively easy to deﬁne complex highdimensional feature spaces. This raises the question of how we can identify the relevant subspaces for a particular learning task. When two views of the same phenomenon are available kernel Canonical Correlation Analysis (KCCA) has been shown to be an effective preprocessing step that can improve the performance of classiﬁcation algorithms such as the Support Vector Machine (SVM). This paper takes this observation to its logical conclusion and proposes a method that combines this two stage learning (KCCA followed by SVM) into a single optimisation termed SVM-2K. We present both experimental and theoretical analysis of the approach showing encouraging results and insights. 1

5 0.52103335 107 nips-2005-Large scale networks fingerprinting and visualization using the k-core decomposition

Author: J. I. Alvarez-hamelin, Luca Dall'asta, Alain Barrat, Alessandro Vespignani

Abstract: We use the k-core decomposition to develop algorithms for the analysis of large scale complex networks. This decomposition, based on a recursive pruning of the least connected vertices, allows to disentangle the hierarchical structure of networks by progressively focusing on their central cores. By using this strategy we develop a general visualization algorithm that can be used to compare the structural properties of various networks and highlight their hierarchical structure. The low computational complexity of the algorithm, O(n + e), where n is the size of the network, and e is the number of edges, makes it suitable for the visualization of very large sparse networks. We show how the proposed visualization tool allows to ﬁnd speciﬁc structural ﬁngerprints of networks. 1

6 0.50098234 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation

7 0.49449113 59 nips-2005-Dual-Tree Fast Gauss Transforms

8 0.44373593 62 nips-2005-Efficient Estimation of OOMs

9 0.44043308 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine

10 0.4375788 184 nips-2005-Structured Prediction via the Extragradient Method

11 0.43164936 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

12 0.42781585 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

13 0.40402356 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

14 0.40139651 46 nips-2005-Consensus Propagation

15 0.39200467 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

16 0.38513184 71 nips-2005-Fast Krylov Methods for N-Body Learning

17 0.38190377 195 nips-2005-Transfer learning for text classification

18 0.37686983 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm

19 0.372244 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

20 0.37047303 6 nips-2005-A Connectionist Model for Constructive Modal Reasoning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.073), (10, 0.042), (27, 0.022), (31, 0.024), (34, 0.152), (39, 0.013), (41, 0.025), (50, 0.036), (55, 0.038), (69, 0.088), (73, 0.041), (86, 0.195), (88, 0.094), (91, 0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.86061758 114 nips-2005-Learning Rankings via Convex Hull Separation

Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram

2 0.72574693 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification

Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul

Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classiﬁcation by semideﬁnite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difﬁculty, we ﬁnd that metrics trained in this way lead to signiﬁcant improvements in kNN classiﬁcation—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modiﬁcation or extension for problems in multiway (as opposed to binary) classiﬁcation. 1

3 0.72431415 184 nips-2005-Structured Prediction via the Extragradient Method

Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan

Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic ﬂow. This makes the approach an efﬁcient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1

4 0.72328788 50 nips-2005-Convex Neural Networks

Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte

Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artiﬁcial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an inﬁnite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time ﬁnding a linear classiﬁer that minimizes a weighted sum of errors. 1

5 0.722063 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification

Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard

Abstract: There have been many graph-based approaches for semi-supervised classiﬁcation. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classiﬁcation. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classiﬁcation as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classiﬁcation. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classiﬁer to classify new points. Tests on synthetic and real datasets show cases where there are signiﬁcant improvements in performance over the existing approaches. 1

6 0.72180152 177 nips-2005-Size Regularized Cut for Data Clustering

7 0.71995068 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines

8 0.71843958 126 nips-2005-Metric Learning by Collapsing Classes

9 0.71680152 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables

10 0.71491247 105 nips-2005-Large-Scale Multiclass Transduction

11 0.713561 160 nips-2005-Query by Committee Made Real

12 0.71082455 77 nips-2005-From Lasso regression to Feature vector machine

13 0.70937359 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

14 0.70926648 58 nips-2005-Divergences, surrogate loss functions and experimental design

15 0.70862627 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

16 0.70838797 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning

17 0.70100754 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

18 0.69823569 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression

19 0.69655257 47 nips-2005-Consistency of one-class SVM and related algorithms

20 0.69635177 195 nips-2005-Transfer learning for text classification