nips nips2005 nips2005-83 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari
Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Generalization error bounds for classifiers trained with interdependent data Nicolas Usunier, Massih-Reza Amini, Patrick Gallinari Department of Computer Science, University of Paris VI 8, rue du Capitaine Scott, 75015 Paris France {usunier, amini, gallinari}@poleia. [sent-1, score-0.388]
2 fr Abstract In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. [sent-3, score-0.408]
3 It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. [sent-4, score-0.673]
4 1 Introduction Many machine learning (ML) applications deal with the problem of bipartite ranking where the goal is to find a function which orders relevant elements over irrelevant ones. [sent-5, score-0.495]
5 The criterion widely used to measure the ranking quality is the Area Under the ROC Curve (AUC) [6]. [sent-7, score-0.279]
6 Given a training set S = ((xp , yp ))n with yp ∈ {±1}, its optimization over a class of real valued functions G p=1 can be carried out by finding a classifier of the form cg (x, x ) = sign(g(x) − g(x )), g ∈ G which minimizes the error rate over pairs of examples (x, 1) and (x , −1) in S [6]. [sent-8, score-0.367]
7 More generally, it is well-known that the learning of scoring functions can be expressed as a classification task over pairs of examples [7, 5]. [sent-9, score-0.146]
8 The study of the generalization properties of ranking problems is a challenging task, since the pairs of examples violate the central i. [sent-10, score-0.514]
9 [2] showed that SVM-like algorithms optimizing the AUC have good generalization guarantees, and [11] showed that maximizing the margin of the pairs, defined by the quantity g(x) − g(x ), leads to the minimization of the generalization error. [sent-15, score-0.294]
10 While these results suggest some similarity between the classification of the pairs of examples and the classification of independent data, no common framework has been established. [sent-16, score-0.166]
11 As a major drawback, it is not possible to directly deduce results for ranking from those obtained in classification. [sent-17, score-0.315]
12 In this paper, we present a new framework to study the generalization properties of classifiers over data which can exhibit a suitable dependency structure. [sent-18, score-0.28]
13 Among others, the problems of binary classification, bipartite ranking, and the ranking risk defined in [5] are special cases of our study. [sent-19, score-0.566]
14 It shows that it is possible to infer generalization bounds for clas- sifiers trained over interdependent examples using generalization results known for binary classification. [sent-20, score-0.791]
15 This bound derives straightforwardly from the same kind of bounds for SVMs for classification given in [12]. [sent-22, score-0.242]
16 Since learning algorithms aim at minimizing the generalization error of their chosen hypothesis, our results suggest that the design of bipartite ranking algorithms can follow the design of standard classification learning systems. [sent-23, score-0.606]
17 In section 3, we present a new concentration inequality which allows to extend the notion of Rademacher complexity (section 4), and, in section 5, we prove generalization bounds for binary classification and bipartite ranking tasks under our framework. [sent-26, score-1.043]
18 2 Formal framework We distinguish between the input and the training data. [sent-28, score-0.101]
19 The input data S = (sp )n is p=1 a set of n independent examples, while the training data Z = (zi )N is composed of i=1 N binary classified elements where each zi is in Xtr × {−1, +1}, with Xtr the space of characteristics. [sent-29, score-0.425]
20 For example, in the general case of bipartite ranking, the input data is the set of elements to be ordered, while the training data is constituted by the pairs of examples to be classified. [sent-30, score-0.371]
21 The purpose of this work is the study of generalization properties of classifiers trained using a possibly dependent training data, but in the special case where the latter is deterministically generated from the input data. [sent-31, score-0.353]
22 The aim here is to select a hypothesis h ∈ H = {hθ : Xtr → {−1, 1}|θ ∈ Θ} which optimizes the empirical risk N 1 L(h, Z) = N i=1 (h, zi ), being the instantaneous loss of h, over the training set Z. [sent-32, score-0.34]
23 In a first step, the learner applies to its input data S a fixed function ϕ : S n → (Xtr × {−1, 1})N to generate a vector Z = (zi )N = ϕ(S) of i=1 N training examples zi ∈ Xtr × {−1, 1}, i = 1, . [sent-35, score-0.339]
24 In the second step, the learner runs a classification algorithm in order to obtain h which minimizes the empirical classification loss L(h, Z), over its training data Z = ϕ(S). [sent-39, score-0.093]
25 Examples Using the notations above, when S = Xtr × {±1}, n = N , ϕ is the identity function and S is drawn i. [sent-40, score-0.153]
26 according to an unknown distribution D, we recover the classical definition of a binary classification algorithm. [sent-43, score-0.152]
27 Another example is the ranking task described in [5] where S = X ×R, Xtr = X 2 , N = n(n−1) and, given S = ((xp , yp ))n p=1 drawn i. [sent-44, score-0.433]
28 according to a fixed D, ϕ generates all the pairs ((xk , xl ), sign( yk −yl )), k = l. [sent-47, score-0.094]
29 2 In the remaining of the paper, we will prove generalization error bounds of the selected hypothesis by upper bounding sup L(h) − L(h, ϕ(S)) (1) h∈H with high confidence over S, where L(h) = ES L(h, ϕ(S)). [sent-48, score-0.487]
30 To this end we decompose Z = ϕ(S) using the dependency graph of the random variables composing Z with a technique similar to the one proposed by [8]. [sent-49, score-0.179]
31 N N 1 1 ˜ supq∈Q ES N i=1 q(ϕ(S)i ) − N i=1 q(ϕ(S)i ) with high confidence over samples ˜ ˜ S, where S is also drawn according to ⊗n Dp , Q is a class of functions taking values p=1 in [0, 1], and ϕ(S)i denotes the i-th training example (Theorem 4). [sent-51, score-0.253]
32 This bound uses an extension of the Rademacher complexity [3], the fractional Rademacher complexity (FRC) (definition 3), which is a weighted sum of Rademacher complexities over independent subsets of the training data. [sent-52, score-0.549]
33 We show that the FRC of an arbitrary class of real-valued functions can be trivially computed given the Rademacher complexity of this class of functions and ϕ (theorem 6). [sent-53, score-0.236]
34 This theorem shows that generalization error bounds for classes of classifiers over interdependent data (in the sense of definition 1) trivially follows from the same kind of bounds for the same class of classifiers trained over i. [sent-54, score-0.953]
35 Finally, we show an example of the derivation of a margin-based, data-dependent generalization error bound (i. [sent-58, score-0.212]
36 a bound on equation (1) which can be computed on the training data) for the bipartite ranking case when H = {(x, x ) → sign(K(θ, x) − K(θ, x ))|K(θ, θ) ≤ B 2 }, assuming that the input examples are drawn i. [sent-60, score-0.744]
37 according to a distribution D over X × {±1}, X ⊂ Rd and K is a kernel over X 2 . [sent-63, score-0.073]
38 Notations Throughout the paper, we will use the notations of the preceding subsection, except for Z = (zi )N , which will denote an arbitrary element of (Xtr × {−1, 1})N . [sent-64, score-0.084]
39 In i=1 order to obtain the dependency graph of the random variables ϕ(S)i , we will consider, for each 1 ≤ i ≤ N , a set [i] ⊂ {1, . [sent-65, score-0.179]
40 , n} such that ϕ(S)i depends only on the variables sp ∈ S for which p ∈ [i]. [sent-68, score-0.12]
41 , N }, we can notice that the two random variables ϕ(S)k and ϕ(S)l are independent if and only if [k] ∩ [l] = ∅. [sent-72, score-0.091]
42 The dependency graph of the ϕ(S)i s follows, by constructing the graph Γ(ϕ), with the set of vertices V = {1, . [sent-73, score-0.165]
43 The following definitions, taken from [8], will enable us to separate the set of partly dependent variables into sets of independent variables: • A subset A of V is independent if all the elements in A are independent. [sent-77, score-0.306]
44 • A sequence C = (Cj )m of subsets of V is a proper cover of V if, for all j, Cj is j=1 independent, and j Cj = V • A sequence C = (Cj , wj )m is a proper, exact fractional cover of Γ if wj > 0 j=1 m for all j, and, for each i ∈ V , j=1 wj ICj (i) = 1, where ICj is the indicator function of Cj . [sent-78, score-1.004]
45 • The fractional chromatic number of Γ, noted χ(Γ), is equal to the minimum of j wj over all proper, exact fractional cover. [sent-79, score-0.669]
46 2 of [8], the existence of proper, exact fractional covers is ensured. [sent-81, score-0.206]
47 Moreover, we will denote by C(ϕ) = (Cj , wj )κ a proper, exact fractional cover of Γ j=1 such that j wj = χ(ϕ). [sent-83, score-0.683]
48 It is to be noted that if (ti )N ∈ RN , and C(ϕ) = (Cj , wj )κ , lemma 3. [sent-88, score-0.257]
49 1 of [8] states that: i=1 j=1 N ti = i=1 3 κj κ wj Tj , where Tj = j=1 tCj k (2) k=1 A new concentration inequality Concentration inequalities bound the probability that a random variable deviates too much from its expectation (see [4] for a survey). [sent-89, score-0.418]
50 They play a major role in learning theory as they can be used for example to bound the probability of deviation of the expected loss of a function from its empirical value estimated over a sample set. [sent-90, score-0.151]
51 A well-known inequality is McDiarmid’s theorem [9] for independent random variables, which bounds the probability of deviation from its expectation of an arbitrary function with bounded variations over each one of its parameters. [sent-91, score-0.502]
52 While this theorem is very general, [8] proved a large deviation bound for sums of partly random variables where the dependency structure of the variables is known, which can be tighter in some cases. [sent-92, score-0.591]
53 Since we also consider variables with known dependency structure, using such results may lead to tighter bounds. [sent-93, score-0.171]
54 However, we will bound functions like in equation (1), which do not write as a sum of partly dependent variables. [sent-94, score-0.252]
55 Thus, we need a result on more general functions than sums of random variables, but which also takes into account the known dependency structure of the variables. [sent-95, score-0.168]
56 Using the notations defined above, let C(ϕ) = N (Cj , wj )κ . [sent-98, score-0.295]
57 There exist κ functions fj : X j → R which satisfy ∀Z = (z1 , . [sent-100, score-0.124]
58 , βN ∈ R+ such that ∀j, ∀Zj , Zj ∈ X j such that Zj and Zj k differ only in the k-th dimension, |fj (Zj ) − fj (Zj )| ≤ βCjk . [sent-111, score-0.092]
59 The proof of this theorem (given in [13]) is a variation of the demonstrations in [8] and McDiarmid’s theorem. [sent-117, score-0.178]
60 The main idea of this theorem is to allow the decomposition of f , which will take as input partly dependent random variables when applied to ϕ(S), into a sum of functions which, when considering f ◦ ϕ(S), will be functions of independent variables. [sent-118, score-0.469]
61 As we will see, this theorem will be the major tool in our analysis. [sent-119, score-0.178]
62 It is to be noted that when X = X , N = n and ϕ is the identity function of X n , the theorem 2 is N exactly McDiarmid’s theorem. [sent-120, score-0.224]
63 On the other hand, when f takes the form i=1 qi (zi ) with for all z ∈ X , a ≤ qi (z) ≤ a + βi with a ∈ R, then theorem 2 reduces to a particular case of the large deviation bound of [8]. [sent-121, score-0.333]
64 4 The fractional Rademacher complexity Let Z = (zi )N ∈ Z N . [sent-122, score-0.279]
65 This quan2 tity has been extensively used to measure the complexity of function classes in previous bounds for binary classification [3, 10]. [sent-129, score-0.32]
66 Let Q, be class of functions from a set Z to R, Let ϕ : X n → Z N and S a sample of size n drawn according to a product distribution ⊗n Dp over X n . [sent-132, score-0.207]
67 Then, we p=1 define the empirical fractional Rademacher complexity 3 of Q given ϕ as: ∗ Rn (Q, S, ϕ) = 2 Eσ N wj sup q∈Q j σi q(ϕ(S)i ) i∈Cj ∗ ∗ As well as the fractional Rademacher complexity of Q as Rn (Q, ϕ) = ES Rn (Q, S, ϕ) Theorem 4. [sent-133, score-0.895]
68 These results are therefore extensions of equation (4), and show that the generalization error bounds for the tasks falling in our framework will follow from a unique approach. [sent-136, score-0.403]
69 In order to find a bound for all q in Q of λ(q) − λ(q, ϕ(S)), we write: 1 λ(q) − λ(q, ϕ(S)) ≤ sup ES ˜ N q∈Q 1 ≤ N N 1 ˜ q(ϕ(S)i ) − N i=1 ⎡ j q(ϕ(S)i ) i=1 ⎤ ˜ q(ϕ(S)i ) − wj sup ⎣ES ˜ q∈Q N i∈Cj (5) q(ϕ(S)i )⎦ i∈Cj Where we have used equation (2). [sent-138, score-0.519]
70 Now, consider, for each j, fj : Z κj → R such that, for κj κj (j) 1 ˜ all z (j) ∈ Z κj , fj (z (j) ) = N supq∈Q ES k=1 q(ϕ(S)Cj k ) − k=1 q(zk ). [sent-139, score-0.184]
71 It is clear ˜ N that if f : Z N → R is defined by: for all Z ∈ Z N , f (Z) = j=1 wj fj (zCj1 , . [sent-140, score-0.303]
72 , zCjκj ), then the right side of equation (5) is equal to f ◦ ϕ(S), and that f satisfies all the conditions 1 of theorem 2 with, for all i ∈ {1, . [sent-143, score-0.223]
73 However in practice, our bounds only depend on χ(ϕ) (see section 4. [sent-148, score-0.177]
74 Then, from equation (6), the first inequality of the theorem follow. [sent-154, score-0.292]
75 The ∗ second inequality is due to an application of theorem 2 to Rn (Q, S, ϕ). [sent-155, score-0.247]
76 The symmetrization performed in equation (7) requires the variables ϕ(S)i appearing in the same sum to be independent. [sent-157, score-0.092]
77 Thus, the generalization of Rademacher complexities could only be performed using a decomposition in independent sets, and the cover C assures some optimality of the decomposition. [sent-158, score-0.293]
78 Moreover, even though McDiarmid’s theorem could be applied each time we used theorem 2, the derivation of the real numbers bounding the differences is not straightforward, and may not lead to the same result. [sent-159, score-0.393]
79 The creation of the dependency graph of ϕ and theorem 2 are therefore necessary tools for obtaining theorem 4. [sent-160, score-0.488]
80 Then: ∗ Rn (HK,B , S, ϕ) ≤ 2B χ(ϕ) N N ||ϕ(S)i ||2 K i=1 The first point of this theorem is a direct consequence of a Rademacher process comparison theorem, namely theorem 7 of [10], and will enable the obtention of margin-based bounds. [sent-169, score-0.381]
81 The second and third points show that the results regarding the Rademacher complexity can be used to immediately deduce bounds on the FRC. [sent-170, score-0.286]
82 This result, as well as theorem 4 show that binary classifiers of i. [sent-171, score-0.248]
83 data and classifiers of interdependent data will have generalization bounds of the same form, but with different convergence rate depending on the dependency structure imposed by ϕ. [sent-174, score-0.579]
84 The second point results from Jensen’s inequality, using the facts that j wj = χ(ϕ) and, from equation (2), j wj |Cj | = N . [sent-176, score-0.467]
85 [3]), if Sk = ((xp , yp ))k , then p=1 Rk (HK,B , Sk ) ≤ 2B k k p=1 ||xp ||2 . [sent-179, score-0.085]
86 K 5 Data-dependent bounds The fact that classifiers trained on interdependent data will ”inherit” the generalization bound of the same classifier trained on i. [sent-180, score-0.655]
87 data suggests simple ways of obtaining bipartite ranking algorithms. [sent-183, score-0.459]
88 Indeed, suppose we want to learn a linear ranking function, for example a function h ∈ HK,B as defined in theorem 6, where K is a linear kernel, and consider a sample S ∈ (X × {−1, 1})n with X ⊂ Rd , drawn i. [sent-184, score-0.551]
89 Therefore, we can learn a bipartite ranking function by applying an SVM algorithm to the pairs ((x, 1), (x , −1)) in S, each pair being represented by x − x , and our framework allows to immediately obtain generalization bounds for this learning process based on the generalization bounds for SVMs. [sent-189, score-1.19]
90 Let S ∈ (X × {−1, 1})n be a sample of size n drawn i. [sent-197, score-0.094]
91 4 remark that φ is upper bounded by the slack variables of the SVM optimization problem (see e. [sent-202, score-0.078]
92 It is to be noted that when h ∈ HK,B with a non linear kernel, the same bounds apply, with, for the case of bipartite ranking, ||xσ(i) − xν(j) ||2 replaced by ||xσ(i) ||2 + ||xν(j) ||2 − K K K 2K(xσ(i) , xν(j) ). [sent-205, score-0.403]
93 For binary classification, we recover the bounds of [12], since our framework is a generalization of their approach. [sent-206, score-0.465]
94 As expected, the bounds suggest that kernel machines will generalize well for bipartite ranking. [sent-207, score-0.385]
95 6 Conclusion We have shown a general framework for classifiers trained with interdependent data, and provided the necessary tools to study their generalization properties. [sent-211, score-0.392]
96 It gives a new insight on the close relationship between the binary classification task and the bipartite ranking, and allows to prove the first data-dependent bounds for this latter case. [sent-212, score-0.427]
97 Moreover, the framework could also yield comparable bounds on other learning tasks. [sent-213, score-0.211]
98 (2005) Stability and generalization of bipartite ranking algorithms, Conference on Learning Theory 18. [sent-224, score-0.606]
99 (2005) Ranking and scoring using empirical risk minimiza¸ tion, Conference on Learning Theory 18. [sent-241, score-0.09]
100 (2004) Large deviations for sums of partly dependent random variables, Random Structures and Algorithms 24, pp. [sent-253, score-0.147]
wordName wordTfidf (topN-words)
[('rademacher', 0.377), ('cj', 0.297), ('ranking', 0.279), ('xtr', 0.221), ('ns', 0.218), ('wj', 0.211), ('zi', 0.208), ('fractional', 0.206), ('bipartite', 0.18), ('theorem', 0.178), ('bounds', 0.177), ('interdependent', 0.156), ('generalization', 0.147), ('rn', 0.122), ('classi', 0.119), ('mcdiarmid', 0.107), ('ers', 0.105), ('dependency', 0.099), ('sup', 0.099), ('fj', 0.092), ('zj', 0.087), ('yp', 0.085), ('notations', 0.084), ('dp', 0.084), ('ln', 0.083), ('cjk', 0.074), ('frc', 0.074), ('complexity', 0.073), ('sp', 0.073), ('binary', 0.07), ('drawn', 0.069), ('inequality', 0.069), ('xp', 0.069), ('bound', 0.065), ('partly', 0.059), ('proper', 0.055), ('cover', 0.055), ('trained', 0.055), ('auc', 0.051), ('dependent', 0.051), ('es', 0.051), ('pairs', 0.049), ('amini', 0.049), ('gallinari', 0.049), ('icj', 0.049), ('supq', 0.049), ('usunier', 0.049), ('zcj', 0.049), ('zcjk', 0.049), ('dz', 0.049), ('sk', 0.048), ('concentration', 0.048), ('variables', 0.047), ('complexities', 0.047), ('noted', 0.046), ('according', 0.045), ('equation', 0.045), ('independent', 0.044), ('mohri', 0.043), ('cation', 0.042), ('training', 0.041), ('examples', 0.039), ('cortes', 0.039), ('risk', 0.037), ('bounding', 0.037), ('sums', 0.037), ('recover', 0.037), ('agarwal', 0.036), ('deduce', 0.036), ('ez', 0.036), ('supposed', 0.036), ('lugosi', 0.036), ('class', 0.036), ('elements', 0.036), ('framework', 0.034), ('deviation', 0.034), ('graph', 0.033), ('paris', 0.033), ('deterministically', 0.033), ('functions', 0.032), ('nition', 0.032), ('remark', 0.031), ('samples', 0.03), ('sign', 0.03), ('kernel', 0.028), ('bousquet', 0.028), ('rk', 0.028), ('qi', 0.028), ('roc', 0.028), ('hypothesis', 0.027), ('empirical', 0.027), ('trivially', 0.027), ('scoring', 0.026), ('input', 0.026), ('learner', 0.025), ('schapire', 0.025), ('tighter', 0.025), ('enable', 0.025), ('inequalities', 0.025), ('sample', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari
Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1
2 0.23429158 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
Author: Jason Farquhar, David Hardoon, Hongying Meng, John S. Shawe-taylor, Sándor Szedmák
Abstract: Kernel methods make it relatively easy to define complex highdimensional feature spaces. This raises the question of how we can identify the relevant subspaces for a particular learning task. When two views of the same phenomenon are available kernel Canonical Correlation Analysis (KCCA) has been shown to be an effective preprocessing step that can improve the performance of classification algorithms such as the Support Vector Machine (SVM). This paper takes this observation to its logical conclusion and proposes a method that combines this two stage learning (KCCA followed by SVM) into a single optimisation termed SVM-2K. We present both experimental and theoretical analysis of the approach showing encouraging results and insights. 1
3 0.21015541 114 nips-2005-Learning Rankings via Convex Hull Separation
Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram
Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1
4 0.14867865 47 nips-2005-Consistency of one-class SVM and related algorithms
Author: Régis Vert, Jean-philippe Vert
Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1
5 0.11487181 184 nips-2005-Structured Prediction via the Extragradient Method
Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan
Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic flow. This makes the approach an efficient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1
7 0.098171085 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
8 0.093604207 50 nips-2005-Convex Neural Networks
9 0.092031553 85 nips-2005-Generalization to Unseen Cases
10 0.091396734 27 nips-2005-Analysis of Spectral Kernel Design based Semi-supervised Learning
11 0.090728641 102 nips-2005-Kernelized Infomax Clustering
12 0.090166949 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
13 0.086588688 95 nips-2005-Improved risk tail bounds for on-line algorithms
14 0.084853351 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis
15 0.082163192 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
16 0.080199137 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
17 0.073745273 112 nips-2005-Learning Minimum Volume Sets
18 0.073651679 58 nips-2005-Divergences, surrogate loss functions and experimental design
19 0.071761891 10 nips-2005-A General and Efficient Multiple Kernel Learning Algorithm
20 0.071391895 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
topicId topicWeight
[(0, 0.227), (1, 0.134), (2, -0.057), (3, -0.124), (4, 0.081), (5, 0.144), (6, 0.02), (7, 0.093), (8, -0.052), (9, -0.13), (10, -0.089), (11, 0.006), (12, 0.058), (13, -0.039), (14, 0.008), (15, 0.204), (16, 0.158), (17, -0.084), (18, 0.126), (19, -0.069), (20, 0.071), (21, -0.093), (22, 0.09), (23, 0.027), (24, 0.107), (25, 0.003), (26, 0.046), (27, 0.165), (28, 0.109), (29, 0.018), (30, -0.069), (31, 0.076), (32, -0.052), (33, 0.046), (34, 0.063), (35, 0.219), (36, 0.049), (37, 0.139), (38, 0.228), (39, 0.008), (40, -0.021), (41, 0.132), (42, -0.122), (43, 0.063), (44, -0.007), (45, -0.0), (46, 0.024), (47, -0.028), (48, -0.034), (49, 0.094)]
simIndex simValue paperId paperTitle
same-paper 1 0.9482879 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari
Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1
2 0.7351681 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
Author: Jason Farquhar, David Hardoon, Hongying Meng, John S. Shawe-taylor, Sándor Szedmák
Abstract: Kernel methods make it relatively easy to define complex highdimensional feature spaces. This raises the question of how we can identify the relevant subspaces for a particular learning task. When two views of the same phenomenon are available kernel Canonical Correlation Analysis (KCCA) has been shown to be an effective preprocessing step that can improve the performance of classification algorithms such as the Support Vector Machine (SVM). This paper takes this observation to its logical conclusion and proposes a method that combines this two stage learning (KCCA followed by SVM) into a single optimisation termed SVM-2K. We present both experimental and theoretical analysis of the approach showing encouraging results and insights. 1
3 0.62678891 114 nips-2005-Learning Rankings via Convex Hull Separation
Author: Glenn Fung, Rómer Rosales, Balaji Krishnapuram
Abstract: We propose efficient algorithms for learning ranking functions from order constraints between sets—i.e. classes—of training samples. Our algorithms may be used for maximizing the generalized Wilcoxon Mann Whitney statistic that accounts for the partial ordering of the classes: special cases include maximizing the area under the ROC curve for binary classification and its generalization for ordinal regression. Experiments on public benchmarks indicate that: (a) the proposed algorithm is at least as accurate as the current state-of-the-art; (b) computationally, it is several orders of magnitude faster and—unlike current methods—it is easily able to handle even large datasets with over 20,000 samples. 1
4 0.6054706 112 nips-2005-Learning Minimum Volume Sets
Author: Clayton Scott, Robert Nowak
Abstract: Given a probability measure P and a reference measure µ, one is often interested in the minimum µ-measure set with P -measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P , and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P . Other than these samples, no other information is available regarding P , but the reference measure µ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules. 1
5 0.53428239 47 nips-2005-Consistency of one-class SVM and related algorithms
Author: Régis Vert, Jean-philippe Vert
Abstract: We determine the asymptotic limit of the function computed by support vector machines (SVM) and related algorithms that minimize a regularized empirical convex loss function in the reproducing kernel Hilbert space of the Gaussian RBF kernel, in the situation where the number of examples tends to infinity, the bandwidth of the Gaussian kernel tends to 0, and the regularization parameter is held fixed. Non-asymptotic convergence bounds to this limit in the L2 sense are provided, together with upper bounds on the classification error that is shown to converge to the Bayes risk, therefore proving the Bayes-consistency of a variety of methods although the regularization term does not vanish. These results are particularly relevant to the one-class SVM, for which the regularization can not vanish by construction, and which is shown for the first time to be a consistent density level set estimator. 1
6 0.49378204 85 nips-2005-Generalization to Unseen Cases
7 0.46189246 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
8 0.42096305 117 nips-2005-Learning from Data of Variable Quality
9 0.41646922 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
10 0.39893878 12 nips-2005-A PAC-Bayes approach to the Set Covering Machine
11 0.37475011 95 nips-2005-Improved risk tail bounds for on-line algorithms
12 0.37079993 184 nips-2005-Structured Prediction via the Extragradient Method
13 0.37077343 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
14 0.37024182 62 nips-2005-Efficient Estimation of OOMs
15 0.34640855 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
16 0.33095029 59 nips-2005-Dual-Tree Fast Gauss Transforms
17 0.3233102 105 nips-2005-Large-Scale Multiclass Transduction
18 0.31806487 137 nips-2005-Non-Gaussian Component Analysis: a Semi-parametric Framework for Linear Dimension Reduction
19 0.3032946 195 nips-2005-Transfer learning for text classification
20 0.29451212 123 nips-2005-Maximum Margin Semi-Supervised Learning for Structured Variables
topicId topicWeight
[(3, 0.579), (10, 0.024), (27, 0.011), (31, 0.015), (34, 0.088), (41, 0.017), (55, 0.024), (69, 0.03), (73, 0.03), (88, 0.054), (91, 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.97371095 83 nips-2005-Generalization error bounds for classifiers trained with interdependent data
Author: Nicolas Usunier, Massih-reza Amini, Patrick Gallinari
Abstract: In this paper we propose a general framework to study the generalization properties of binary classifiers trained with data which may be dependent, but are deterministically generated upon a sample of independent examples. It provides generalization bounds for binary classification and some cases of ranking problems, and clarifies the relationship between these learning tasks. 1
2 0.94222987 159 nips-2005-Q-Clustering
Author: Mukund Narasimhan, Nebojsa Jojic, Jeff A. Bilmes
Abstract: We show that Queyranne’s algorithm for minimizing symmetric submodular functions can be used for clustering with a variety of different objective functions. Two specific criteria that we consider in this paper are the single linkage and the minimum description length criteria. The first criterion tries to maximize the minimum distance between elements of different clusters, and is inherently “discriminative”. It is known that optimal clusterings into k clusters for any given k in polynomial time for this criterion can be computed. The second criterion seeks to minimize the description length of the clusters given a probabilistic generative model. We show that the optimal partitioning into 2 clusters, and approximate partitioning (guaranteed to be within a factor of 2 of the the optimal) for more clusters can be computed. To the best of our knowledge, this is the first time that a tractable algorithm for finding the optimal clustering with respect to the MDL criterion for 2 clusters has been given. Besides the optimality result for the MDL criterion, the chief contribution of this paper is to show that the same algorithm can be used to optimize a broad class of criteria, and hence can be used for many application specific criterion for which efficient algorithm are not known.
3 0.94076508 117 nips-2005-Learning from Data of Variable Quality
Author: Koby Crammer, Michael Kearns, Jennifer Wortman
Abstract: We initiate the study of learning from multiple sources of limited data, each of which may be corrupted at a different rate. We develop a complete theory of which data sources should be used for two fundamental problems: estimating the bias of a coin, and learning a classifier in the presence of label noise. In both cases, efficient algorithms are provided for computing the optimal subset of data. 1
4 0.93211228 131 nips-2005-Multiple Instance Boosting for Object Detection
Author: Cha Zhang, John C. Platt, Paul A. Viola
Abstract: A good image object detection algorithm is accurate, fast, and does not require exact locations of objects in a training set. We can create such an object detector by taking the architecture of the Viola-Jones detector cascade and training it with a new variant of boosting that we call MILBoost. MILBoost uses cost functions from the Multiple Instance Learning literature combined with the AnyBoost framework. We adapt the feature selection criterion of MILBoost to optimize the performance of the Viola-Jones cascade. Experiments show that the detection rate is up to 1.6 times better using MILBoost. This increased detection rate shows the advantage of simultaneously learning the locations and scales of the objects in the training set along with the parameters of the classifier. 1
5 0.79368395 85 nips-2005-Generalization to Unseen Cases
Author: Teemu Roos, Peter Grünwald, Petri Myllymäki, Henry Tirri
Abstract: We analyze classification error on unseen cases, i.e. cases that are different from those in the training set. Unlike standard generalization error, this off-training-set error may differ significantly from the empirical error with high probability even with large sample sizes. We derive a datadependent bound on the difference between off-training-set and standard generalization error. Our result is based on a new bound on the missing mass, which for small samples is stronger than existing bounds based on Good-Turing estimators. As we demonstrate on UCI data-sets, our bound gives nontrivial generalization guarantees in many practical cases. In light of these results, we show that certain claims made in the No Free Lunch literature are overly pessimistic. 1
6 0.7646783 112 nips-2005-Learning Minimum Volume Sets
7 0.67479843 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
8 0.67248476 41 nips-2005-Coarse sample complexity bounds for active learning
9 0.66670251 196 nips-2005-Two view learning: SVM-2K, Theory and Practice
10 0.63439494 84 nips-2005-Generalization in Clustering with Unobserved Features
11 0.62103367 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
12 0.6182043 58 nips-2005-Divergences, surrogate loss functions and experimental design
13 0.61676192 177 nips-2005-Size Regularized Cut for Data Clustering
14 0.61099195 191 nips-2005-The Forgetron: A Kernel-Based Perceptron on a Fixed Budget
15 0.60804021 147 nips-2005-On the Convergence of Eigenspaces in Kernel Principal Component Analysis
16 0.59615302 95 nips-2005-Improved risk tail bounds for on-line algorithms
17 0.5909031 160 nips-2005-Query by Committee Made Real
18 0.58805007 47 nips-2005-Consistency of one-class SVM and related algorithms
19 0.58150554 197 nips-2005-Unbiased Estimator of Shape Parameter for Spiking Irregularities under Changing Environments
20 0.5763123 74 nips-2005-Faster Rates in Regression via Active Learning