jmlr jmlr2008 jmlr2008-9 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Falk-Florian Henrich, Klaus Obermayer
Abstract: We introduce a computationally feasible, “constructive” active learning method for binary classification. The learning algorithm is initially formulated for separable classification problems, for a hyperspherical data space with constant data density, and for great spheres as classifiers. In order to reduce computational complexity the version space is restricted to spherical simplices and learning procedes by subdividing the edges of maximal length. We show that this procedure optimally reduces a tight upper bound on the generalization error. The method is then extended to other separable classification problems using products of spheres as data spaces and isometries induced by charts of the sphere. An upper bound is provided for the probability of disagreement between classifiers (hence the generalization error) for non-constant data densities on the sphere. The emphasis of this work lies on providing mathematically exact performance estimates for active learning strategies. Keywords: active learning, spherical subdivision, error bounds, simplex halving
Reference: text
sentIndex sentText sentNum sentScore
1 The learning algorithm is initially formulated for separable classification problems, for a hyperspherical data space with constant data density, and for great spheres as classifiers. [sent-8, score-0.275]
2 In order to reduce computational complexity the version space is restricted to spherical simplices and learning procedes by subdividing the edges of maximal length. [sent-9, score-0.394]
3 The method is then extended to other separable classification problems using products of spheres as data spaces and isometries induced by charts of the sphere. [sent-11, score-0.482]
4 Keywords: active learning, spherical subdivision, error bounds, simplex halving 1. [sent-14, score-0.818]
5 The article is organized as follows: After introducing the geometric setup of active learning for binary classification (Section 2), we formulate a learning method for data distributed on the unit sphere Sn ⊂ n+1 (Section 3). [sent-53, score-0.377]
6 Thereafter, we extend the basic algorithm to a broader class ¡ ¡ ¡ 106 ACTIVE L EARNING BY S PHERICAL S UBDIVISION of separable, binary classification problems using isometries induced by charts of the sphere and products of hyperspheres (Sections 4, 5). [sent-54, score-0.472]
7 Upper bounds are provided for the generalization error for nonconstant data densities for binary classification problems on the sphere (Section 6) and for linear classifiers without bias in n (Section 7). [sent-56, score-0.394]
8 However, this is not the case, because n , or any submanifold therein, can be embedded into S n , the n-sphere, by the inverse of stereographic projection (see Section 4). [sent-64, score-0.283]
9 The use of closed hemispheres as classifiers is appropriate for spherical data, since hemispheres are the direct analog to half spaces in Euclidean geometry. [sent-92, score-0.433]
10 Since the total volume of the unit sphere R V := Sn ω with respect to its canonical Riemannian metric and volume form is not equal to one, we have to normalize: d G (c1 , c2 ) = 1 V Z D(c1 ,c2 ) ω= 1 V α 1 V = d(c1 , c2 ). [sent-99, score-0.518]
11 Geometrically, V is a convex spherical polytope, a high-dimensional generalization of a spherical polygon whose edges are segments of great circles and whose (n − 1)dimensional facets are segments of (n − 1)-dimensional great spheres within S n . [sent-108, score-0.893]
12 This results in a set of admissible classifiers S ⊂ Sn which is an equilateral simplex on Sn . [sent-118, score-0.5]
13 Select one of the edges of maximal length of the current simplex S. [sent-120, score-0.637]
14 Replace the old simplex S by the part in direction of u, that is, the simplex which has as vertices m as well as all vertices of S with exception of that end point e of the chosen longest edge for which u, e < 0. [sent-129, score-1.135]
15 Figure 1 illustrates one iteration of the simplex algorithm on the sphere S 2 . [sent-134, score-0.657]
16 The “random orthogonal c longest edge m u o b a Figure 1: The drawing above shows one iteration of the simplex algorithm for the two-dimensional case, that is, for spherical triangles on the unit sphere S 2 . [sent-135, score-1.062]
17 The worst case generalization error after each iteration can be computed 109 H ENRICH AND O BERMAYER by evaluating the maximum spherical distance of the chosen classifier to the vertices of the spherical simplex. [sent-142, score-0.672]
18 To be explicit, the following statement holds: Proposition 3 If S is the current simplex from the simplex algorithm (see Definition 2) with vertices v1 , . [sent-143, score-0.953]
19 Moreover, if c ∈ S denotes the center of mass, then d G (c∗ , c) ≤ max d(c, vi ) = max arccos c, vi i i is a tight and attainable upper bound for the generalization error. [sent-148, score-0.428]
20 If all elements of the simplex are admissible classifiers, the bound is tight. [sent-151, score-0.443]
21 Clearly, the maximal edge length of the simplex S converges to zero. [sent-153, score-0.649]
22 In Appendix A, we derive O((n+1)3 ) as a rough complexity estimate for one iteration of the simplex algorithm (see Definition 2). [sent-154, score-0.443]
23 Another question concerning the convergence of the algorithm is: How many iterations are needed until the maximum edge length of the initial simplex starts to drop? [sent-155, score-0.572]
24 Proposition 4 Let S be the initial equilateral simplex from the simplex algorithm (see Definition 2). [sent-157, score-0.943]
25 The easiest method to obtain isometries from the n-sphere to other data spaces is to consider charts of the sphere together with the induced metric. [sent-162, score-0.419]
26 Being isometries, they preserve the geometry, and any generalization bounds derived for the sphere can be applied without modifications. [sent-163, score-0.318]
27 The stereographic projection σN : Sn \ {N} → n , (x1 , . [sent-168, score-0.283]
28 Hence, we can identify S n \ {N} with north pole N of the sphere point on the sphere 2 ¡ plane of projection ( ) image under projection Figure 2: The stereographic projection S2 \{N} → 2 is a diffeomorphism from the sphere with the north pole removed to the plane. [sent-179, score-1.302]
29 If one considers a ray from the north pole to some point on the plane, the stereographic projection will map the intersection point of this ray with the sphere onto its intersection point with the plane. [sent-181, score-0.656]
30 When viewed under stereographic projection, our spherical classifiers fall in three categories: If the boundary of the hemisphere, which is a great (n − 1)-sphere within Sn , contains the north pole, its projection is a hyperplane through the origin in n . [sent-190, score-0.626]
31 The equatorial great sphere {x ∈ Sn ⊂ n+1 | xn+1 = 0} is projected onto Sn−1 ⊂ n . [sent-191, score-0.299]
32 All other great spheres become spheres intersecting S n−1 ⊂ n orthogonally. [sent-192, score-0.369]
33 Hence, any data which is separable by these classes of hypersurfaces in n is separable by great spheres on Sn and vice versa. [sent-193, score-0.374]
34 sn , H ENRICH AND O BERMAYER 2 ¡ plane of projection ( ) north pole N of the sphere image under projection point on the sphere Figure 3: Illustration of the gnomonic projection for the case S 2 ⊃ {x3 > 0} → 2 . [sent-207, score-1.272]
35 Their intersection points with the sphere are mapped to the corresponding intersection points with the plane. [sent-210, score-0.284]
36 The gnomonic projection distorts lengths as well as angles, but maps great circles to straight lines. [sent-211, score-0.315]
37 As was the case with the stereographic chart, gnomonic projection is an isometry from the upper half-sphere to ( n , g). [sent-217, score-0.456]
38 Products of Spheres We now extend the simplex algorithm (see Definition 2) to other data manifolds and other sets of classifiers using a simple product construction. [sent-235, score-0.58]
39 × (Snk , gk ), (4) where each factor (Sn j , g j ) is a unit sphere of dimension n j with its standard metric g j . [sent-240, score-0.307]
40 This leads to the extended spherical simplex algorithm: 113 H ENRICH AND O BERMAYER Definition 6 (Extended simplex algorithm) 1. [sent-260, score-1.137]
41 For each factor S n j , create an initial simplex using a random orthogonal matrix whose columns are interpreted as a basis for n j +1 . [sent-262, score-0.443]
42 Find one of the edges of maximal length of each factor of the current simplex product S. [sent-265, score-0.695]
43 Replace the old simplex product S by the product of the parts in direction of u j . [sent-277, score-0.559]
44 Proposition 7 If S is the current product simplex from the extended simplex algorithm (see Definition 6) with maximal edge lengths d1 , . [sent-288, score-1.148]
45 The complexity estimate for one iteration of the extended simplex algorithm is O((n1 + 1)3 + . [sent-294, score-0.443]
46 This estimate can be deduced directly from the complexity analysis of the simplex algorithm given in Appendix A. [sent-302, score-0.443]
47 For each factor in Equation 4, we may choose one of stereographic or gnomonic projection. [sent-305, score-0.342]
48 At each step of the extended simplex algorithm, the version space is a hypercube of equal edge length l. [sent-345, score-0.572]
49 The Riemannian volume form belonging to the metric g on M induces a volume form ω on Sn which, in general, is not equal to the spherical volume ω. [sent-354, score-0.542]
50 c∈S Therefore, the simplex algorithm (see Definition 2) will still converge, as it reduces an upper bound of the generalization error. [sent-361, score-0.543]
51 Nevertheless, we can force a simple upper bound by assuming the deviation of the induced volume form from spherical volume to be small: sup |1 − f (x)| < ε. [sent-363, score-0.445]
52 = πVol(Sn ) ≤ Inserting the above upper bound into proposition 3 we obtain Proposition 9 Let S be the current simplex from the simplex algorithm (see Definition 2) with vertices v1 , . [sent-371, score-1.077]
53 If c∗ ∈ Sn denotes the unknown true classifier, the generalization error of c is bounded by d G (c, c∗ ) ≤ (1 + ε)Vol(Sn ) πVol(Sn ) max d(vi , v j ), i, j where d(vi , v j ) denotes the spherical distance of the vertices. [sent-375, score-0.354]
54 The usefulness of the above proposition depends on how much the scaled volume form f ω deviates from the canonical spherical volume ω. [sent-376, score-0.55]
55 The generalization distance is given by Z f dx = D(c1 ,c2 ) {s∈Sn−1 | Fs ⊂D(c1 ,c2 )} where dx, dr, ds denote the canonical volume forms on different fibers Fs have different mass in the sense that ¡ Sn−1 → Z Z , s→ Z ¡ d G (c1 , c2 ) = n, f (s, . [sent-392, score-0.304]
56 Consequently, all results derived for the spherical simplex algorithm (see Definition 2) apply, including the hard upper bounds on the generalization error. [sent-402, score-0.834]
57 118 ACTIVE L EARNING BY S PHERICAL S UBDIVISION Proposition 11 If S is the current simplex from the simplex algorithm (see Definition 2) with vertices v1 , . [sent-405, score-0.953]
58 Let us denote this center by p∗ ∈ V ⊂ Sn−1 , where V ⊂ Sn−1 is the version space (a spherical polytope, see Section 3) on the hypersphere. [sent-416, score-0.303]
59 Motivated by these insights, they propose the following pool-based strategy for selecting an unlabeled data point x ∈ Sn−1 to be labeled: Choose x such that the (spherical) distance from the (n − 2)-dimensional great sphere X := {s ∈ Sn−1 | x, s = 0} to p∗ is minimal. [sent-418, score-0.304]
60 After the data point x is labeled, the version space will be cut into two pieces along the great sphere X ⊂ Sn−1 . [sent-419, score-0.304]
61 The goal of this strategy is to reduce the volume of the spherical polytope V as quickly as possible. [sent-420, score-0.37]
62 Up to now, a closed formula for the volume of a n-spherical simplex is not known, not to speak of polytopes, hence, there is no way of computing the exact volume of a version space on a hypersphere. [sent-423, score-0.601]
63 The SVM algorithm works with the Euclidean scalar product on n , and therefore implicitly assumes the canonical Riemannian metric and volume form on the unit sphere. [sent-427, score-0.311]
64 Experimental Results The following results were obtained from a C++ implementation of the simplex algorithm (see Definition 2). [sent-430, score-0.443]
65 The numerically most sensitive operation of the algorithm is the computation of a normal vector u ∈ Sn to the hyperplane H ⊂ n+1 whose intersection I = H ∩ Sn with Sn cuts the current simplex S into two pieces. [sent-431, score-0.478]
66 Select the basis of H given by the midpoint of the longest edge and all vertices of the simplex excluding the end points of the longest edge. [sent-435, score-0.71]
67 In order to test the performance of the simplex algorithm on spheres of different dimensions a series of numerical experiments was conducted, where the following quantities were measured. [sent-441, score-0.602]
68 , vn+1 ) denote the vertices of the current simplex S of the simplex algorithm, the maximal edge length max d(v1 , v j ) i, j is an upper bound on the generalization error, regardless which point of the simplex is chosen as the learned classifier. [sent-445, score-1.702]
69 i This yields a tight upper bound on the generalization error if we choose the center of mass as the learned classifier. [sent-448, score-0.286]
70 Test data were sampled (i) from a uniform density on the sphere and (ii) from an aspherical density with two distinct “clusters” at opposite poles. [sent-450, score-0.426]
71 The aspherical density was constructed by mapping the uniform distribution from an open parameter cuboid onto the sphere using n-spherical coordinates. [sent-451, score-0.34]
72 Figure 4 illustrates the relation between a sample drawn from this density on the sphere S 2 and its stereographic projection onto the plane 2 . [sent-452, score-0.598]
73 Any density with two peaks at p 1 , p2 ∈ n can be identified with a density on Sn with peaks at opposite poles: Firstly, we apply a translation to move the midpoint of the line p1 p2 to the origin. [sent-454, score-0.296]
74 Now inverse stereographic projection will map the peaks onto opposite poles. [sent-463, score-0.363]
75 ¡ The above quantities were computed for each step of the simplex algorithm. [sent-469, score-0.443]
76 The simplex algorithm is initialized with an equilateral simplex. [sent-486, score-0.5]
77 During the first learning steps, the center of mass moves towards those vertices whose adjacent edges are cut already. [sent-487, score-0.293]
78 The simplex becomes a “thin pyramid” with small base, and the following drop in the plots then corresponds to a cut of a line connecting the apex to the base. [sent-489, score-0.482]
79 For 122 ACTIVE L EARNING BY S PHERICAL S UBDIVISION If the data is drawn from the spherical distribution, the approximate generalization error changes smoothly with the number of selected training data, and its variance is very small. [sent-494, score-0.315]
80 Nevertheless, the numerical experiments show that the spherical simplex algorithm performs well even in the case of non-uniform densities. [sent-496, score-0.694]
81 So far, we restricted the experiments to single spheres instead of products, because the simulation of the extended simplex algorithm on M = S n1 × . [sent-498, score-0.602]
82 We implemented the extended simplex algorithm on the n-torus. [sent-518, score-0.443]
83 sint ¡ ¡ Using this mapping, we can visualize the iterations of the extended simplex algorithm on von Mises distributed data. [sent-521, score-0.494]
84 Figure 9 depicts the stereographic projection of a sample drawn from the von Mises distribution together with the projected classifier in 2 . [sent-523, score-0.368]
85 The resulting values were averaged over 1,000 ¡ 123 H ENRICH AND O BERMAYER Figure 8: The extended simplex algorithm on the 2-torus. [sent-528, score-0.443]
86 The meaning of the nested regions is the following (light to dark): positively classified area of the true classifier, version space after initialization (step one of the extended simplex algorithm), version space after first iteration, version space after second iteration. [sent-531, score-0.472]
87 Figure 9: The image of a data sample from two superposed von Mises distributions on the twodimensional torus under the stereographic projection (defined in Section 4) to 2 . [sent-532, score-0.425]
88 Therefore, the initialization of the extended simplex algorithm yields a classifier whose error is by far smaller than the average generalization error of its spherical counterpart. [sent-544, score-0.758]
89 The algorithm was first formulated for the generic case of a binary classification problem, where data lies on a n-dimensional hypersphere and where both classes are separable using (n − 1)dimensional great spheres as classifiers. [sent-551, score-0.343]
90 The purpose of this appendix is to give a more detailed analysis of the complexity of the simplex algorithm (see Definition 2) as well as a proof of Proposition 4. [sent-569, score-0.443]
91 The edge lengths d(vi , v j ) := arccos( vi , v j ), between vertices vi , v j of the simplex must be computed in order to determine which edge is to be cut next. [sent-572, score-0.856]
92 After step 2, the current simplex is cut by a plane through the midpoint m of some edge (a, b). [sent-574, score-0.651]
93 Since the computational complexity of the other steps is negligible we obtain O((n + 1)3 ) as a rough complexity estimate for one iteration of the simplex algorithm (see Definition 2). [sent-582, score-0.443]
94 The computation of the center of mass in step seven amounts to adding up all vertex vectors of the current simplex and normalizing their sum to length one. [sent-586, score-0.66]
95 ¡ 126 ACTIVE L EARNING BY S PHERICAL S UBDIVISION We now restate and prove Proposition 4 from Section 3: Proposition 12 Let S be the initial equilateral simplex from the simplex algorithm (see Definition 2). [sent-588, score-0.943]
96 Since the degree of each vertex is n, each vertex of the initial simplex is end point of an edge of full length. [sent-593, score-0.611]
97 Now the spherical law of cosines tells us that all edges not 2 adjacent to e have length π . [sent-603, score-0.368]
98 It can be covered by two charts, stereographic projection from the north and south pole. [sent-623, score-0.324]
99 ¡ ¡ ¡ ¡ 127 H ENRICH AND O BERMAYER c 2 3 e b a 1 Figure 11: The figure shows a subdivision of a spherical simplex on S 3 ⊂ 4 under stereographic projection (see Section 4). [sent-628, score-1.011]
100 We use this equality in Section 4 to compute the metric in stereographic and gnomonic coordinates. [sent-636, score-0.396]
wordName wordTfidf (topN-words)
[('simplex', 0.443), ('sn', 0.341), ('spherical', 0.251), ('sphere', 0.214), ('stereographic', 0.205), ('riemannian', 0.19), ('vol', 0.184), ('spheres', 0.159), ('bermayer', 0.148), ('enrich', 0.148), ('gnomonic', 0.137), ('pherical', 0.137), ('ubdivision', 0.137), ('active', 0.124), ('isometries', 0.114), ('charts', 0.091), ('hemispheres', 0.091), ('torus', 0.091), ('proposition', 0.088), ('dxn', 0.08), ('hemisphere', 0.08), ('subsimplex', 0.08), ('manifolds', 0.079), ('volume', 0.079), ('projection', 0.078), ('edge', 0.078), ('mises', 0.078), ('maximal', 0.077), ('densities', 0.076), ('arccos', 0.069), ('mass', 0.069), ('aspherical', 0.068), ('hypersphere', 0.068), ('vertices', 0.067), ('edges', 0.066), ('separable', 0.065), ('tight', 0.065), ('classi', 0.064), ('generalization', 0.064), ('ers', 0.06), ('density', 0.058), ('product', 0.058), ('equilateral', 0.057), ('metric', 0.054), ('products', 0.053), ('canonical', 0.053), ('peaks', 0.052), ('center', 0.052), ('length', 0.051), ('great', 0.051), ('von', 0.051), ('vi', 0.051), ('lengths', 0.049), ('fs', 0.048), ('chart', 0.048), ('midpoint', 0.048), ('pole', 0.048), ('ber', 0.046), ('subdivisions', 0.046), ('earning', 0.045), ('vertex', 0.045), ('plane', 0.043), ('er', 0.043), ('disagreement', 0.041), ('geodesic', 0.041), ('north', 0.041), ('bounds', 0.04), ('attainable', 0.04), ('polytope', 0.04), ('cut', 0.039), ('tp', 0.039), ('distance', 0.039), ('unit', 0.039), ('constructive', 0.038), ('tong', 0.037), ('longest', 0.037), ('upper', 0.036), ('intersection', 0.035), ('henrich', 0.034), ('hypersurfaces', 0.034), ('subdivide', 0.034), ('subdivision', 0.034), ('volumina', 0.034), ('projected', 0.034), ('labeled', 0.033), ('det', 0.033), ('tangent', 0.033), ('manifold', 0.033), ('curves', 0.033), ('nk', 0.032), ('smooth', 0.032), ('dr', 0.032), ('euclidean', 0.032), ('balcan', 0.029), ('positively', 0.029), ('freund', 0.029), ('xn', 0.028), ('scalar', 0.028), ('opposite', 0.028), ('query', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 9 jmlr-2008-Active Learning by Spherical Subdivision
Author: Falk-Florian Henrich, Klaus Obermayer
Abstract: We introduce a computationally feasible, “constructive” active learning method for binary classification. The learning algorithm is initially formulated for separable classification problems, for a hyperspherical data space with constant data density, and for great spheres as classifiers. In order to reduce computational complexity the version space is restricted to spherical simplices and learning procedes by subdividing the edges of maximal length. We show that this procedure optimally reduces a tight upper bound on the generalization error. The method is then extended to other separable classification problems using products of spheres as data spaces and isometries induced by charts of the sphere. An upper bound is provided for the probability of disagreement between classifiers (hence the generalization error) for non-constant data densities on the sphere. The emphasis of this work lies on providing mathematically exact performance estimates for active learning strategies. Keywords: active learning, spherical subdivision, error bounds, simplex halving
2 0.11638141 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
Author: Guy Lebanon, Yi Mao
Abstract: Statistical models on full and partial rankings of n items are often of limited practical use for large n due to computational consideration. We explore the use of non-parametric models for partially ranked data and derive computationally efficient procedures for their use for large n. The derivations are largely possible through combinatorial and algebraic manipulations based on the lattice of partial rankings. A bias-variance analysis and an experimental study demonstrate the applicability of the proposed method. Keywords: ranked data, partially ordered sets, kernel smoothing
3 0.06474939 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park
Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classification, convexity lemma, Radon transform
4 0.059107691 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui
Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso
5 0.056628909 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
Author: Xianchao Xie, Zhi Geng
Abstract: In this paper, we propose a recursive method for structural learning of directed acyclic graphs (DAGs), in which a problem of structural learning for a large DAG is first decomposed into two problems of structural learning for two small vertex subsets, each of which is then decomposed recursively into two problems of smaller subsets until none subset can be decomposed further. In our approach, search for separators of a pair of variables in a large DAG is localized to small subsets, and thus the approach can improve the efficiency of searches and the power of statistical tests for structural learning. We show how the recent advances in the learning of undirected graphical models can be employed to facilitate the decomposition. Simulations are given to demonstrate the performance of the proposed method. Keywords: Bayesian network, conditional independence, decomposition, directed acyclic graph, structural learning
6 0.055234812 52 jmlr-2008-Learning from Multiple Sources
7 0.053337272 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
8 0.052104395 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
9 0.050806522 10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs (Special Topic on Causality)
10 0.050644491 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
11 0.049157292 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
12 0.043634217 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
13 0.043520637 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
14 0.039257489 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
15 0.037978396 7 jmlr-2008-A Tutorial on Conformal Prediction
16 0.037112687 96 jmlr-2008-Visualizing Data using t-SNE
17 0.036244899 57 jmlr-2008-Manifold Learning: The Price of Normalization
18 0.035772499 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models
19 0.035612646 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
20 0.035004381 11 jmlr-2008-Aggregation of SVM Classifiers Using Sobolev Spaces
topicId topicWeight
[(0, 0.197), (1, 0.022), (2, -0.017), (3, -0.059), (4, 0.034), (5, -0.057), (6, 0.082), (7, -0.044), (8, 0.007), (9, -0.119), (10, 0.013), (11, -0.016), (12, -0.101), (13, 0.126), (14, 0.172), (15, 0.133), (16, 0.054), (17, 0.016), (18, 0.063), (19, 0.105), (20, -0.037), (21, -0.175), (22, -0.259), (23, -0.204), (24, -0.022), (25, -0.05), (26, 0.173), (27, -0.095), (28, 0.103), (29, -0.372), (30, 0.165), (31, -0.015), (32, -0.046), (33, 0.118), (34, -0.013), (35, 0.02), (36, 0.018), (37, 0.051), (38, 0.031), (39, -0.137), (40, 0.002), (41, -0.017), (42, -0.113), (43, 0.137), (44, -0.009), (45, 0.055), (46, -0.023), (47, 0.072), (48, 0.075), (49, 0.047)]
simIndex simValue paperId paperTitle
same-paper 1 0.94175678 9 jmlr-2008-Active Learning by Spherical Subdivision
Author: Falk-Florian Henrich, Klaus Obermayer
Abstract: We introduce a computationally feasible, “constructive” active learning method for binary classification. The learning algorithm is initially formulated for separable classification problems, for a hyperspherical data space with constant data density, and for great spheres as classifiers. In order to reduce computational complexity the version space is restricted to spherical simplices and learning procedes by subdividing the edges of maximal length. We show that this procedure optimally reduces a tight upper bound on the generalization error. The method is then extended to other separable classification problems using products of spheres as data spaces and isometries induced by charts of the sphere. An upper bound is provided for the probability of disagreement between classifiers (hence the generalization error) for non-constant data densities on the sphere. The emphasis of this work lies on providing mathematically exact performance estimates for active learning strategies. Keywords: active learning, spherical subdivision, error bounds, simplex halving
2 0.71938974 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
Author: Guy Lebanon, Yi Mao
Abstract: Statistical models on full and partial rankings of n items are often of limited practical use for large n due to computational consideration. We explore the use of non-parametric models for partially ranked data and derive computationally efficient procedures for their use for large n. The derivations are largely possible through combinatorial and algebraic manipulations based on the lattice of partial rankings. A bias-variance analysis and an experimental study demonstrate the applicability of the proposed method. Keywords: ranked data, partially ordered sets, kernel smoothing
3 0.305181 1 jmlr-2008-A Bahadur Representation of the Linear Support Vector Machine
Author: Ja-Yong Koo, Yoonkyung Lee, Yuwon Kim, Changyi Park
Abstract: The support vector machine has been successful in a variety of applications. Also on the theoretical front, statistical properties of the support vector machine have been studied quite extensively with a particular attention to its Bayes risk consistency under some conditions. In this paper, we study somewhat basic statistical properties of the support vector machine yet to be investigated, namely the asymptotic behavior of the coefficients of the linear support vector machine. A Bahadur type representation of the coefficients is established under appropriate conditions, and their asymptotic normality and statistical variability are derived on the basis of the representation. These asymptotic results do not only help further our understanding of the support vector machine, but also they can be useful for related statistical inferences. Keywords: asymptotic normality, Bahadur representation, classification, convexity lemma, Radon transform
4 0.27164707 30 jmlr-2008-Discriminative Learning of Max-Sum Classifiers
Author: Vojtěch Franc, Bogdan Savchynskyy
Abstract: The max-sum classifier predicts n-tuple of labels from n-tuple of observable variables by maximizing a sum of quality functions defined over neighbouring pairs of labels and observable variables. Predicting labels as MAP assignments of a Random Markov Field is a particular example of the max-sum classifier. Learning parameters of the max-sum classifier is a challenging problem because even computing the response of such classifier is NP-complete in general. Estimating parameters using the Maximum Likelihood approach is feasible only for a subclass of max-sum classifiers with an acyclic structure of neighbouring pairs. Recently, the discriminative methods represented by the perceptron and the Support Vector Machines, originally designed for binary linear classifiers, have been extended for learning some subclasses of the max-sum classifier. Besides the max-sum classifiers with the acyclic neighbouring structure, it has been shown that the discriminative learning is possible even with arbitrary neighbouring structure provided the quality functions fulfill some additional constraints. In this article, we extend the discriminative approach to other three classes of max-sum classifiers with an arbitrary neighbourhood structure. We derive learning algorithms for two subclasses of max-sum classifiers whose response can be computed in polynomial time: (i) the max-sum classifiers with supermodular quality functions and (ii) the max-sum classifiers whose response can be computed exactly by a linear programming relaxation. Moreover, we show that the learning problem can be approximately solved even for a general max-sum classifier. Keywords: max-xum classifier, hidden Markov networks, support vector machines
5 0.24013366 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
Author: Andreas Maurer
Abstract: A method is introduced to learn and represent similarity with linear operators in kernel induced Hilbert spaces. Transferring error bounds for vector valued large-margin classifiers to the setting of Hilbert-Schmidt operators leads to dimension free bounds on a risk functional for linear representations and motivates a regularized objective functional. Minimization of this objective is effected by a simple technique of stochastic gradient descent. The resulting representations are tested on transfer problems in image processing, involving plane and spatial geometric invariants, handwritten characters and face recognition. Keywords: learning similarity, similarity, transfer learning
6 0.2257742 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
7 0.21263893 40 jmlr-2008-Graphical Methods for Efficient Likelihood Inference in Gaussian Covariance Models
8 0.2121373 52 jmlr-2008-Learning from Multiple Sources
9 0.2113533 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
10 0.20309147 96 jmlr-2008-Visualizing Data using t-SNE
11 0.19062395 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
12 0.18746699 25 jmlr-2008-Consistency of Random Forests and Other Averaging Classifiers
13 0.18643683 12 jmlr-2008-Algorithms for Sparse Linear Classifiers in the Massive Data Setting
14 0.18318684 68 jmlr-2008-Nearly Uniform Validation Improves Compression-Based Error Bounds
15 0.17730644 6 jmlr-2008-A Recursive Method for Structural Learning of Directed Acyclic Graphs
16 0.17699923 10 jmlr-2008-Active Learning of Causal Networks with Intervention Experiments and Optimal Designs (Special Topic on Causality)
17 0.17607245 8 jmlr-2008-Accelerated Neural Evolution through Cooperatively Coevolved Synapses
18 0.17534012 64 jmlr-2008-Model Selection in Kernel Based Regression using the Influence Function (Special Topic on Model Selection)
19 0.17139609 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
20 0.16294751 47 jmlr-2008-Learning Balls of Strings from Edit Corrections
topicId topicWeight
[(0, 0.03), (5, 0.015), (40, 0.041), (54, 0.047), (58, 0.045), (60, 0.014), (61, 0.022), (66, 0.06), (76, 0.028), (78, 0.024), (88, 0.103), (89, 0.373), (92, 0.044), (94, 0.055), (99, 0.02)]
simIndex simValue paperId paperTitle
1 0.77560037 19 jmlr-2008-Bouligand Derivatives and Robustness of Support Vector Machines for Regression
Author: Andreas Christmann, Arnout Van Messem
Abstract: We investigate robustness properties for a broad class of support vector machines with non-smooth loss functions. These kernel methods are inspired by convex risk minimization in infinite dimensional Hilbert spaces. Leading examples are the support vector machine based on the ε-insensitive loss function, and kernel based quantile regression based on the pinball loss function. Firstly, we propose with the Bouligand influence function (BIF) a modification of F.R. Hampel’s influence function. The BIF has the advantage of being positive homogeneous which is in general not true for Hampel’s influence function. Secondly, we show that many support vector machines based on a Lipschitz continuous loss function and a bounded kernel have a bounded BIF and are thus robust in the sense of robust statistics based on influence functions. Keywords: Bouligand derivatives, empirical risk minimization, influence function, robustness, support vector machines
same-paper 2 0.75018078 9 jmlr-2008-Active Learning by Spherical Subdivision
Author: Falk-Florian Henrich, Klaus Obermayer
Abstract: We introduce a computationally feasible, “constructive” active learning method for binary classification. The learning algorithm is initially formulated for separable classification problems, for a hyperspherical data space with constant data density, and for great spheres as classifiers. In order to reduce computational complexity the version space is restricted to spherical simplices and learning procedes by subdividing the edges of maximal length. We show that this procedure optimally reduces a tight upper bound on the generalization error. The method is then extended to other separable classification problems using products of spheres as data spaces and isometries induced by charts of the sphere. An upper bound is provided for the probability of disagreement between classifiers (hence the generalization error) for non-constant data densities on the sphere. The emphasis of this work lies on providing mathematically exact performance estimates for active learning strategies. Keywords: active learning, spherical subdivision, error bounds, simplex halving
3 0.37295395 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
Author: Hsuan-Tien Lin, Ling Li
Abstract: Ensemble learning algorithms such as boosting can achieve better performance by averaging over the predictions of some base hypotheses. Nevertheless, most existing algorithms are limited to combining only a finite number of hypotheses, and the generated ensemble is usually sparse. Thus, it is not clear whether we should construct an ensemble classifier with a larger or even an infinite number of hypotheses. In addition, constructing an infinite ensemble itself is a challenging task. In this paper, we formulate an infinite ensemble learning framework based on the support vector machine (SVM). The framework can output an infinite and nonsparse ensemble through embedding infinitely many hypotheses into an SVM kernel. We use the framework to derive two novel kernels, the stump kernel and the perceptron kernel. The stump kernel embodies infinitely many decision stumps, and the perceptron kernel embodies infinitely many perceptrons. We also show that the Laplacian radial basis function kernel embodies infinitely many decision trees, and can thus be explained through infinite ensemble learning. Experimental results show that SVM with these kernels is superior to boosting with the same base hypothesis set. In addition, SVM with the stump kernel or the perceptron kernel performs similarly to SVM with the Gaussian radial basis function kernel, but enjoys the benefit of faster parameter selection. These properties make the novel kernels favorable choices in practice. Keywords: ensemble learning, boosting, support vector machine, kernel
4 0.3697482 57 jmlr-2008-Manifold Learning: The Price of Normalization
Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov
Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap
5 0.36712962 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
6 0.36708498 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
7 0.36582553 96 jmlr-2008-Visualizing Data using t-SNE
8 0.36372373 58 jmlr-2008-Max-margin Classification of Data with Absent Features
9 0.36056554 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
10 0.36025834 86 jmlr-2008-SimpleMKL
11 0.35947791 56 jmlr-2008-Magic Moments for Structured Output Prediction
12 0.35936108 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
13 0.35860312 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
14 0.35801506 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
15 0.35694525 83 jmlr-2008-Robust Submodular Observation Selection
16 0.35672808 43 jmlr-2008-Hit Miss Networks with Applications to Instance Selection
17 0.35510984 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
18 0.35383058 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss
19 0.35313877 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
20 0.35275236 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers