nips nips2005 nips2005-114 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram
Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We propose efficient algorithms for learning ranking functions from order constraints between sets—i. [sent-5, score-0.709]
2 Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. [sent-8, score-0.603]
3 Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. [sent-9, score-0.174]
4 1 Introduction Many machine learning applications depend on accurately ordering the elements of a set based on the known ordering of only some of its elements. [sent-10, score-0.288]
5 In the literature, variants of this problem have been referred to as ordinal regression, ranking, and learning of preference relations. [sent-11, score-0.282]
6 Formally, we want to find a function f : ℜn → ℜ such that, for a set of test samples {xk ∈ ℜn }, the output of the function f (xk ) can be sorted to obtain a ranking. [sent-12, score-0.067]
7 In order to learn such a function we are provided with training data, A, containing S sets (or mj S classes) of training samples: A = j=1 (Aj = {xj }i=1 ), where the j-th set Aj contains i S mj samples, so that we have a total of m = j=1 mj samples in A. [sent-13, score-0.558]
8 Further, we are also provided with a directed order graph G = (S, E) each of whose vertices corresponds to a class Aj , and the existence of a directed edge EP Q —corresponding to AP → AQ —means that all training samples xp ∈ AP should be ranked higher than any sample xq ∈ AQ : i. [sent-14, score-0.726]
9 In general the number of constraints on the ranking function grows as O(m2 ) so that naive solutions are computationally infeasible even for moderate sized training sets with a few thousand samples. [sent-17, score-0.949]
10 Hence, we propose a more stringent problem with a larger (infinite) set of constraints, that is nevertheless much more tractably solved. [sent-18, score-0.036]
11 In particular, we modify the constraints to: ∀ (xp ∈ CH(AP ), xq ∈ CH(AQ )), f (xp ) ≤ f (xq ), where CH(Aj ) denotes the set of all points in the convex hull of Aj . [sent-19, score-0.502]
12 We show how this leads to: (a) a family of approximations to the original problem; and (b) considerably more efficient solutions that still enforce all of the original inter-group order constraints. [sent-20, score-0.073]
13 Notice that, this formulation subsumes the standard ranking problem (e. [sent-21, score-0.564]
14 Each problem instance is defined by an order graph. [sent-24, score-0.043]
15 (a-d) A succession of order graphs with an increasing number of constraints (e-f) Two order graphs defining the same partial ordering but different problem instances. [sent-25, score-0.505]
16 However, as illustrated in Figure 1, the formulation is more general and does not require a total ordering of the sets of training samples Aj , i. [sent-27, score-0.398]
17 it allows any order graph G to be incorporated into the problem. [sent-29, score-0.168]
18 1 Generalized Wilcoxon-Mann-Whitney Statistics A distinction is usually made between classification and ordinal regression methods on one hand, and ranking on the other. [sent-31, score-0.85]
19 In particular, the loss functions used for classification and ordinal regression evaluate whether each test sample is correctly classified: in other words, the loss functions that are used to evaluate these algorithms—e. [sent-32, score-0.43]
20 the 0–1 loss for binary classification—are computed for every sample individually, and then averaged over the training or test set. [sent-34, score-0.144]
21 By contrast, bipartite ranking solutions are evaluated using the Wilcoxon-Mann-Whitney (WMW) statistic which measures the (sample averaged) probability that any pair of samples is ordered correctly; intuitively, the WMW statistic may be interpreted as the area under the ROC curve (AUC). [sent-35, score-0.791]
22 We define a slight generalization of the WMW statistic that accounts for our notion of class-ordering: mi k=1 W M W (f, A) = Eij mj l=1 δ mi k=1 f (xi ) < f (xj ) k l mj l=1 1 . [sent-36, score-0.386]
23 2 Previous Work Ordinal regression and methods for handling structured output classes: For a classic description of generalized linear models for ordinal regression, see [11]. [sent-39, score-0.531]
24 A non-parametric Bayesian model for ordinal regression based on Gaussian processes (GP) was defined [1]. [sent-40, score-0.369]
25 Several recent machine learning papers consider structured output classes: e. [sent-41, score-0.077]
26 [13] presents SVM based algorithms for handling structured and interdependent output spaces, and [5] discusses automatic document categorization into pre-defined hierarchies or taxonomies of topics. [sent-43, score-0.215]
27 Learning Rankings: The problem of learning rankings was first treated as a classification problem on pairs of objects by Herbrich [4] and subsequently used on a web page ranking task by Joachims [6]; a variety of authors have investigated this approach recently. [sent-44, score-0.581]
28 The major advantage of this approach is that it considers a more explicit notion of ordering— However, the naive optimization strategy proposed there suffers from the O(m2 ) growth in the number of constraints mentioned in the previous section. [sent-45, score-0.252]
29 This computational burden renders these methods impractical even for medium sized datasets with a few thousand samples. [sent-46, score-0.198]
30 Relationship to the proposed work: Our algorithm penalizes wrong ordering of pairs of training instances in order to learn ranking functions (similar to [4]), but in addition, it can also utilize the notion of a structured class order graph. [sent-49, score-0.933]
31 Nevertheless, using a formulation based on constraints over convex hulls of the training classes, our method avoids the prohibitive computational complexity of the previous algorithms for ranking. [sent-50, score-0.43]
32 3 Notation and Background In the following, vectors will be assumed to be column vectors unless transposed to a row vector by a prime superscript ′ . [sent-52, score-0.029]
33 For a vector x in the n-dimensional real space ℜn , the cardinality of a set A will be denoted by #(A). [sent-53, score-0.066]
34 The scalar (inner) product of two vectors x and y in the n-dimensional real space ℜn will be denoted by x′ y and the 2-norm of x will be denoted by x . [sent-54, score-0.074]
35 For a matrix A ∈ ℜm×n , Ai is the ith row of A which is a row vector in ℜn , while A·j is the jth column of A. [sent-55, score-0.029]
36 A column vector of ones of arbitrary dimension will be denoted by e. [sent-56, score-0.095]
37 In particular, if x and y are column vectors in ℜn then, K(x′ , y) is a real number, K(x′ , A′ ) is a row vector in ℜm and K(A, A′ ) is an m × m matrix. [sent-58, score-0.029]
38 The identity matrix of arbitrary dimension will be denoted by I. [sent-59, score-0.066]
39 2 Convex Hull formulation We are interested in learning a ranking function f : ℜn → ℜ given known ranking relationships between some training instances Ai , Aj ⊂ A. [sent-60, score-1.153]
40 In others words, for binary classifiers, we want a linear ranking function fw (x) = w′ x that satisfies the following constraints: ∀ (x+∈ A+ , x−∈ A− ), f (x− ) ≤ f (x+ ) ⇒ f (x− )− f (x+ ) = w′ x−− w′ x+ ≤ −1 ≤ 0. [sent-62, score-0.566]
41 (1) Clearly, the number of constraints grows as O(m+ m− ), which is roughly quadratic in the number of training samples (unless we have severe class imbalance). [sent-63, score-0.323]
42 First, notice that (by negation) the feasibility constraints in (1) can also be defined as: ∀ (x+∈ A+ , x−∈ A− ), w′ x− ′ x+ ≤ −1 ⇔ ∄(x+∈ A+ , x−∈ A− ), w′ x− ′ x+ > −1. [sent-66, score-0.221]
43 −w −w In other words, a solution w is feasible iff there exist no pair of samples from the two classes such that fw ( ) orders them incorrectly. [sent-67, score-0.19]
44 Note that two elements xi and xj of the set A− are not correctly ordered and hence generate positive values of the corresponding slack variables yi and yj . [sent-69, score-0.406]
45 Thus, our constraints become: 0 ≤ λ+ ≤ 1, 0 ≤ λ− ≤ 1, ∀(λ+ , λ− ) such that λ+ = 1 λ− = 1 ′ ′ w′ A− λ−− w′ A+ λ+ ≤ −1. [sent-71, score-0.185]
46 B u− w [A +′ ′ − A ] = 0, b u ≤ −1, u ≥ 0, (5) (6) (7) Where the second equivalent form of the constraints was obtained by negation (as before), and the third equivalent form results from our third key insight: the application of Farka’s theorem of alternatives[9]. [sent-78, score-0.233]
47 The resulting linear system of m equalities and m + 5 inequal2 ities in m + n + 4 variables can be used while minimizing any regularizer (such as w ) to obtain the linear ranking function that satisfies (1); notice, however, that we avoid the O(m2 ) scaling in constraints. [sent-79, score-0.581]
48 1 The binary non-separable case In the non-separable case, CH(A+ ) CH(A− ) = ∅ so the requirements have to be relaxed by introducing slack variables. [sent-81, score-0.219]
49 To this end, we allow one slack variable yi ≥ 0 for each training sample xi , and consider the slack for any point inside the convex hull CH(Aj ) to also be a convex combination of y (see Fig. [sent-82, score-0.794]
50 For example, this implies that if only a subset of training samples have non-zero slacks yi> 0 (i. [sent-84, score-0.193]
51 they are possibly misclassified), then the slacks of any points inside the convex hull also only depend on those yi . [sent-86, score-0.302]
52 Note that enforcing the constraints defined above ˆ ˆ indeed implies the desired ordering, since we have: Ai w + y i ≥ −γ ij ≥ γ ij + 1 ≥ γ ij ≥ Aj w − y j ˆ ˆ It is also important to note the connection with Support Vector Machines (SVM) formulation [10, 14] for the binary case. [sent-94, score-0.943]
53 If we impose the extra constraints −γ ij = γ + 1 and γ ij = γ −1, then equations (18) imply the constraints included in the standard primal SVM ˆ formulation. [sent-95, score-0.818]
54 Eij ∀(i, j) ∈ E Where ǫ is a given loss function for the slack variables y i and R(v) represents a regularizer on the normal to the hyperplane v. [sent-98, score-0.277]
55 For an arbitrary kernel K(x, x′ ) the number of variables of formulation (27) is 2 ∗ m + 2#(E) and the number of linear equations(excluding the nonnegativity constraints) is m#(E) + #(E) = #(E)(m + 1). [sent-99, score-0.145]
56 K(x, x′ ) = xx′ the number of variables of formulation (27) becomes m + n + 2#(E) and the number of linear equations remains the same. [sent-102, score-0.178]
57 When using a linear kernel and 2 using ǫ(x) = R(x) = x 2 , the optimization problem (27) becomes a linearly constrained quadratic optimization problem for which a unique solution exists due to the convexity of the objective function: min {w,y i ,γ ij | (i,j)∈E} s. [sent-103, score-0.193]
58 ν y 2 2 1 + 2 w′ w Eij ∀(i, j) ∈ E (28) Unlike other SVM-like methods for ranking that need a O(m2 ) number of slack variables y our formulation only require one slack variable for example, only m slack variables are used, giving our formulation computational advantage over ranking methods. [sent-105, score-1.725]
59 3 Experimental Evaluation We test tested our approach in a set of nine publicly available datasets 1 shown in Tab. [sent-107, score-0.108]
60 1 (several large datasets are not reported since only the algorithm presented in this paper was able to run them). [sent-108, score-0.109]
61 These datasets have been frequently used as a benchmark for ordinal regression methods (e. [sent-109, score-0.501]
62 We compare our method against SVM for ranking (e. [sent-113, score-0.481]
63 [4, 6]) using the SVM-light package 2 and an efficient Gaussian process method (the informative vector machine) 3 [7]. [sent-115, score-0.029]
64 These datasets were originally designed for regression, thus the continuous target values for each dataset were discretized into five equal size bins. [sent-116, score-0.121]
65 We use these bins to define our ranking constraints: all the datapoints with target value falling in the same bin were grouped together. [sent-117, score-0.512]
66 Each dataset was divided into 10% for testing and 90% for training. [sent-118, score-0.041]
67 Thus, the input to all of the algorithms tested was, for each point in the training set: (1) a vector in ℜn (where n is different for each set) and (2) a value from 1 to 5 denoting the rank of the group to which it belongs. [sent-119, score-0.135]
68 Since we do not employ information about the ranking of the elements within each group, order constraints within a group 1 Available at http:\\www. [sent-121, score-0.744]
69 7 Run time (Log−scale) Generalized Wilcoxon statistic (AUC) 0. [sent-135, score-0.09]
70 1 0 1 2 3 4 5 6 Dataset number 1 10 0 10 −1 10 −2 10 7 8 9 1 2 3 4 5 6 Dataset number 7 8 9 Figure 3: Experimental comparison of the ranking SVM, IVM and the proposed method on nine benchmark datasets. [sent-142, score-0.598]
71 Along with the mean values in 10 fold cross-validation, the entire range of variation is indicated in the error-bars. [sent-143, score-0.035]
72 (b) The proposed method has a much lower run time than the other methods, even for the full graph case for medium to large size datasets. [sent-145, score-0.257]
73 NOTE: Both SVM-light and IVM ran out of memory and crashed on dataset 4; on dataset 1, SVM-light failed to complete even one fold after more than 24 hours of run time, so its results could not be compiled in time for submission. [sent-146, score-0.201]
74 Letting b(m) = m(m − 1)/2, the total number of order constraints is equal to b(m) − i b(mi ), where mi is the number of instances in group i. [sent-148, score-0.357]
75 Our formulation was tested employing two order graphs, the full directed acyclic graph and the chain graph. [sent-151, score-0.412]
76 The performance for all datasets is generally comparable or significantly better for our algorithm (when using a chain order graph). [sent-152, score-0.192]
77 Note that the performance for the full graph is consistently lower than that for the chain graph. [sent-153, score-0.227]
78 Thus, interestingly enforcing more order constraints does not necessarily imply better performance. [sent-154, score-0.282]
79 We suspect that this is due to the role that the slack variables play in both formulations, since the number of slack variables remains the same while the number of constraints increases. [sent-155, score-0.605]
80 Adding more slack variables may positively affect performance in the full graph, but this comes at a computational cost. [sent-156, score-0.243]
81 A different but potentially related problem is that of finding good order graph given a dataset. [sent-158, score-0.168]
82 Note also that the chain graph is much more stable regarding performance overall. [sent-159, score-0.194]
83 Regarding run-time, our algorithm runs an order of magnitude faster than current implementations of state-of-the-art methods, even approximate ones (like IVM). [sent-160, score-0.043]
84 4 Discussions and future work We propose a general method for learning a ranking function from structured order constraints on sets of training samples. [sent-161, score-0.89]
85 The proposed algorithm was illustrated on benchmark ranking problems with two different constraint graphs: (a) a chain graph; and (b) a full ordering graph. [sent-162, score-0.816]
86 Besides being accurate, the computational requirements of our algorithm scale much more favorably with the number of training samples as compared to other state-of-the-art methods. [sent-164, score-0.138]
87 Indeed it was the only algorithm capable of handling several large datasets, while the other methods either crashed due to lack of memory or ran for so long that they were not practically feasible. [sent-165, score-0.095]
88 While our experiments illustrate only specific order graphs, we stress that the method is general enough to handle arbitrary constraint relationships. [sent-166, score-0.072]
89 While the proposed formulation reduces the computational complexity of enforcing order constraints, it is entirely independent of the regularizer that is minimized (under these constraints) while learning the optimal ranking function. [sent-167, score-0.765]
90 Though we have used a simple 2 margin regularization (via w in (28), and RKHS regularization in (27) in order to learn in a supervised setting, we can just as easily easily use a graph-Laplacian based regularizer that exploits unlabeled data, in order to learn in semi-supervised settings. [sent-168, score-0.196]
91 Obermayer, Large margin rank boundaries for ordinal regression, Advances in Large Margin Classifiers (2000), 115–132. [sent-186, score-0.325]
92 Herbrich, Fast sparse gaussian process methods: The informative vector machine, Neural Info. [sent-197, score-0.029]
93 Lafferty, Conditional models on the ranking poset, Neural Info. [sent-202, score-0.481]
94 [10] , Generalized support vector machines, Advances in Large Margin Classifiers, 2000, pp. [sent-208, score-0.029]
95 Altun, Support vector machine learning for interdependent and structured output spaces, Int. [sent-220, score-0.149]
wordName wordTfidf (topN-words)
[('ranking', 0.481), ('ordinal', 0.282), ('aj', 0.236), ('ch', 0.214), ('ij', 0.193), ('uij', 0.19), ('constraints', 0.185), ('slack', 0.177), ('ordering', 0.144), ('wmw', 0.137), ('ai', 0.132), ('graph', 0.125), ('ivm', 0.119), ('hull', 0.117), ('xq', 0.109), ('eij', 0.109), ('xp', 0.095), ('convex', 0.091), ('mj', 0.091), ('statistic', 0.09), ('regression', 0.087), ('formulation', 0.083), ('aq', 0.081), ('datasets', 0.08), ('structured', 0.077), ('svm', 0.074), ('wilcoxon', 0.071), ('rankings', 0.071), ('training', 0.071), ('ap', 0.069), ('chain', 0.069), ('samples', 0.067), ('ranked', 0.067), ('regularizer', 0.067), ('joachims', 0.065), ('equations', 0.062), ('herbrich', 0.061), ('directed', 0.059), ('mi', 0.057), ('crashed', 0.055), ('farka', 0.055), ('poset', 0.055), ('slacks', 0.055), ('taxonomies', 0.055), ('enforcing', 0.054), ('benchmark', 0.052), ('xj', 0.052), ('sized', 0.052), ('classes', 0.051), ('negation', 0.048), ('graphs', 0.045), ('generalized', 0.045), ('interdependent', 0.043), ('fw', 0.043), ('margin', 0.043), ('order', 0.043), ('binary', 0.042), ('yj', 0.042), ('dataset', 0.041), ('hofmann', 0.04), ('preferences', 0.04), ('handling', 0.04), ('yi', 0.039), ('xk', 0.038), ('auc', 0.038), ('denoted', 0.037), ('instances', 0.037), ('proposed', 0.037), ('stringent', 0.036), ('notice', 0.036), ('group', 0.035), ('classi', 0.035), ('fold', 0.035), ('sets', 0.033), ('medium', 0.033), ('thousand', 0.033), ('moderate', 0.033), ('ordered', 0.033), ('full', 0.033), ('variables', 0.033), ('sample', 0.031), ('infeasible', 0.031), ('grouped', 0.031), ('alternatives', 0.031), ('roc', 0.031), ('growth', 0.03), ('misclassi', 0.03), ('correctly', 0.03), ('solutions', 0.03), ('web', 0.029), ('orders', 0.029), ('individually', 0.029), ('ers', 0.029), ('run', 0.029), ('arbitrary', 0.029), ('vector', 0.029), ('bj', 0.028), ('separable', 0.028), ('accurate', 0.028), ('nine', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 114 nips-2005-Learning Rankings via Convex Hull Separation
Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram
Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1
2 0.21015541 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari
Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1
3 0.12804423 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
Author: Y. Altun, D. McAllester, M. Belkin
Abstract: Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points. 1
4 0.11811413 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
Author: Andreas Argyriou, Mark Herbster, Massimiliano Pontil
Abstract: A foundational problem in semi-supervised learning is the construction of a graph underlying the data. We propose to use a method which optimally combines a number of differently constructed graphs. For each of these graphs we associate a basic graph kernel. We then compute an optimal combined kernel. This kernel solves an extended regularization problem which requires a joint minimization over both the data and the set of graph kernels. We present encouraging results on different OCR tasks where the optimal combined kernel is computed from graphs constructed with a variety of distances functions and the ‘k’ in nearest neighbors. 1
5 0.10389281 105 nips-2005-Large-Scale Multiclass Transduction
Author: Thomas Gärtner, Quoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan
Abstract: We present a method for performing transductive inference on very large datasets. Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufficiently fast. This holds, for instance, for certain graph and string kernels. Transduction is achieved by variational inference over the unlabeled data subject to a balancing constraint. 1
6 0.089054614 184 nips-2005-Structured Prediction via the Extragradient Method
7 0.088978611 14 nips-2005-A Probabilistic Interpretation of SVMs with an Application to Unbalanced Classification
8 0.088720851 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
9 0.087825306 50 nips-2005-Convex Neural Networks
10 0.08740934 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
11 0.087082766 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
12 0.082825422 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
13 0.082592674 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
14 0.08233463 59 nips-2005-Dual-Tree Fast Gauss Transforms
15 0.082087062 46 nips-2005-Consensus Propagation
16 0.073708788 178 nips-2005-Soft Clustering on Graphs
17 0.072112374 47 nips-2005-Consistency of one-class SVM and related algorithms
18 0.071095593 20 nips-2005-Affine Structure From Sound
19 0.068157606 126 nips-2005-Metric Learning by Collapsing Classes
20 0.067003772 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
topicId topicWeight
[(0, 0.235), (1, 0.126), (2, -0.072), (3, -0.072), (4, 0.002), (5, 0.066), (6, 0.039), (7, 0.076), (8, 0.047), (9, 0.025), (10, -0.012), (11, -0.112), (12, -0.015), (13, 0.04), (14, -0.028), (15, 0.074), (16, 0.084), (17, 0.026), (18, 0.035), (19, -0.027), (20, -0.02), (21, 0.049), (22, 0.107), (23, -0.022), (24, 0.028), (25, -0.009), (26, 0.18), (27, 0.208), (28, 0.072), (29, 0.012), (30, -0.007), (31, 0.106), (32, -0.068), (33, 0.091), (34, 0.066), (35, 0.075), (36, 0.035), (37, 0.03), (38, 0.098), (39, 0.057), (40, -0.066), (41, 0.211), (42, -0.045), (43, 0.049), (44, -0.044), (45, -0.103), (46, -0.096), (47, -0.048), (48, -0.079), (49, 0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.94208676 114 nips-2005-Learning Rankings via Convex Hull Separation
Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram
Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1
2 0.66105437 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari
Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1
3 0.53495783 105 nips-2005-Large-Scale Multiclass Transduction
Author: Thomas Gärtner, Quoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan
Abstract: We present a method for performing transductive inference on very large datasets. Our algorithm is based on multiclass Gaussian processes and is effective whenever the multiplication of the kernel matrix or its inverse with a vector can be computed sufficiently fast. This holds, for instance, for certain graph and string kernels. Transduction is achieved by variational inference over the unlabeled data subject to a balancing constraint. 1
4 0.53387845 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
Author: Jason Farquhar, David Hardoon, Hongying Meng, John S. Shawe-taylor, Sándor Szedmák
Abstract: Kernel methods make it relatively easy to define complex highdimensional feature spaces. This raises the question of how we can identify the relevant subspaces for a particular learning task. When two views of the same phenomenon are available kernel Canonical Correlation Analysis (KCCA) has been shown to be an effective preprocessing step that can improve the performance of classification algorithms such as the Support Vector Machine (SVM). This paper takes this observation to its logical conclusion and proposes a method that combines this two stage learning (KCCA followed by SVM) into a single optimisation termed SVM-2K. We present both experimental and theoretical analysis of the approach showing encouraging results and insights. 1
5 0.52103335 107 nips-2005-Large scale networks fingerprinting and visualization using the k-core decomposition
Author: J. I. Alvarez-hamelin, Luca Dall'asta, Alain Barrat, Alessandro Vespignani
Abstract: We use the k-core decomposition to develop algorithms for the analysis of large scale complex networks. This decomposition, based on a recursive pruning of the least connected vertices, allows to disentangle the hierarchical structure of networks by progressively focusing on their central cores. By using this strategy we develop a general visualization algorithm that can be used to compare the structural properties of various networks and highlight their hierarchical structure. The low computational complexity of the algorithm, O(n + e), where n is the size of the network, and e is the number of edges, makes it suitable for the visualization of very large sparse networks. We show how the proposed visualization tool allows to find specific structural fingerprints of networks. 1
6 0.50098234 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
7 0.49449113 59 nips-2005-Dual-Tree Fast Gauss Transforms
8 0.44373593 62 nips-2005-Efficient Estimation of OOMs
9 0.44043308 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
10 0.4375788 184 nips-2005-Structured Prediction via the Extragradient Method
11 0.43164936 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning
12 0.42781585 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
13 0.40402356 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
14 0.40139651 46 nips-2005-Consensus Propagation
15 0.39200467 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
16 0.38513184 71 nips-2005-Fast Krylov Methods for N-Body Learning
17 0.38190377 195 nips-2005-Transfer learning for text classification
18 0.37686983 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
19 0.372244 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
20 0.37047303 6 nips-2005-A Connectionist Model for Constructive Modal Reasoning
topicId topicWeight
[(3, 0.073), (10, 0.042), (27, 0.022), (31, 0.024), (34, 0.152), (39, 0.013), (41, 0.025), (50, 0.036), (55, 0.038), (69, 0.088), (73, 0.041), (86, 0.195), (88, 0.094), (91, 0.041)]
simIndex simValue paperId paperTitle
same-paper 1 0.86061758 114 nips-2005-Learning Rankings via Convex Hull Separation
Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram
Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1
2 0.72574693 57 nips-2005-Distance Metric Learning for Large Margin Nearest Neighbor Classification
Author: Kilian Q. Weinberger, John Blitzer, Lawrence K. Saul
Abstract: We show how to learn a Mahanalobis distance metric for k-nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification—for example, achieving a test error rate of 1.3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification. 1
3 0.72431415 184 nips-2005-Structured Prediction via the Extragradient Method
Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan
Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1
4 0.72328788 50 nips-2005-Convex Neural Networks
Author: Yoshua Bengio, Nicolas L. Roux, Pascal Vincent, Olivier Delalleau, Patrice Marcotte
Abstract: Convexity has recently received a lot of attention in the machine learning community, and the lack of convexity has been seen as a major disadvantage of many learning algorithms, such as multi-layer artificial neural networks. We show that training multi-layer neural networks in which the number of hidden units is learned can be viewed as a convex optimization problem. This problem involves an infinite number of variables, but can be solved by incrementally inserting a hidden unit at a time, each time finding a linear classifier that minimizes a weighted sum of errors. 1
5 0.722063 92 nips-2005-Hyperparameter and Kernel Learning for Graph Based Semi-Supervised Classification
Author: Ashish Kapoor, Hyungil Ahn, Yuan Qi, Rosalind W. Picard
Abstract: There have been many graph-based approaches for semi-supervised classification. One problem is that of hyperparameter learning: performance depends greatly on the hyperparameters of the similarity graph, transformation of the graph Laplacian and the noise model. We present a Bayesian framework for learning hyperparameters for graph-based semisupervised classification. Given some labeled data, which can contain inaccurate labels, we pose the semi-supervised classification as an inference problem over the unknown labels. Expectation Propagation is used for approximate inference and the mean of the posterior is used for classification. The hyperparameters are learned using EM for evidence maximization. We also show that the posterior mean can be written in terms of the kernel matrix, providing a Bayesian classifier to classify new points. Tests on synthetic and real datasets show cases where there are significant improvements in performance over the existing approaches. 1
6 0.72180152 177 nips-2005-Size Regularized Cut for Data Clustering
7 0.71995068 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
8 0.71843958 126 nips-2005-Metric Learning by Collapsing Classes
9 0.71680152 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
10 0.71491247 105 nips-2005-Large-Scale Multiclass Transduction
11 0.713561 160 nips-2005-Query by Committee Made Real
12 0.71082455 77 nips-2005-From Lasso regression to Feature vector machine
13 0.70937359 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
14 0.70926648 58 nips-2005-Divergences, surrogate loss functions and experimental design
15 0.70862627 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
16 0.70838797 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
17 0.70100754 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery
18 0.69823569 16 nips-2005-A matching pursuit approach to sparse Gaussian process regression
19 0.69655257 47 nips-2005-Consistency of one-class SVM and related algorithms
20 0.69635177 195 nips-2005-Transfer learning for text classification