nips nips2001 nips2001-28 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Carlotta Domeniconi, Dimitrios Gunopulos
Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract The nearest neighbor technique is a simple and appealing method to address classification problems. [sent-4, score-0.877]
2 It relies on t he assumption of locally constant class conditional probabilities. [sent-5, score-0.197]
3 We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). [sent-7, score-0.417]
4 The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. [sent-8, score-0.397]
5 Such direction provides a local weighting scheme for input features. [sent-9, score-0.406]
6 We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. [sent-10, score-0.549]
7 1 Introduction In a classification problem, we are given J classes and l training observations. [sent-11, score-0.308]
8 The training observations consist of n feature measurements x = (Xl,'" ,Xn)T E ~n and the known class labels j = 1, . [sent-12, score-0.195]
9 The goal is to predict the class label of a given query q. [sent-16, score-0.334]
10 The K nearest neighbor classification method [4, 13, 16] is a simple and appealing approach to this problem: it finds the K nearest neighbors of q in the training set, and then predicts the class label of q as the most frequent one occurring in the K neighbors. [sent-17, score-1.159]
11 It has been shown [5, 8] that the one nearest neighbor rule has asymptotic error rate that is at most twice t he Bayes error rate, independent of the distance metric used. [sent-18, score-0.779]
12 The nearest neighbor rule becomes less appealing with finite training samples, however. [sent-19, score-0.645]
13 Severe bias can be introduced in t he nearest neighbor rule in a high dimensional input feature space with finite samples. [sent-21, score-0.689]
14 As such, the choice of a distance measure becomes crucial in determining t he outcome of nearest neighbor classification. [sent-22, score-0.63]
15 The commonly used Euclidean distance implies that the input space is isotropic, which is often invalid and generally undesirable in many practical applications. [sent-23, score-0.217]
16 Several techniques [9, 10, 7] have been proposed to try to minimize bias in high dimensions by using locally adaptive mechanisms. [sent-24, score-0.206]
17 The feature weighting scheme they introduce, in fact , is query based and is applied on-line when the test point is presented to the "lazy learner" . [sent-26, score-0.549]
18 In this paper we propose a locally adaptive metric classification method which, although still founded on a query based weighting mechanism, computes off-line the information relevant to define local weights. [sent-27, score-1.113]
19 Our technique uses support vector machines (SVMs) as a guidance for the process of defining a local flexible metric. [sent-28, score-0.404]
20 SVMs have been successfully used as a classification tool in a variety of areas [11, 3, 14], and the maximum margin boundary they provide has been proved to be optimal in a structural risk minimization sense. [sent-29, score-0.515]
21 The solid theoretical foundations that have inspired SVMs convey desirable computational and learning theoretic properties to the SVM's learning algorithm, and therefore SVMs are a natural choice for seeking local discriminant directions between classes. [sent-30, score-0.196]
22 The solution provided by SVMs allows to determine locations in input space where class conditional probabilities are likely to be not constant, and guides the extraction of local information in such areas. [sent-31, score-0.272]
23 This process produces highly stretched neighborhoods along boundary directions when the query is close to the boundary. [sent-32, score-0.804]
24 As a result, the class conditional probabilities tend to be constant in the modified neighborhoods, whereby better classification performance can be achieved. [sent-33, score-0.371]
25 The amount of elongation-constriction decays as the query moves further from the boundary vicinity. [sent-34, score-0.514]
26 cp(y) (kernel junction), and cp: 3(n -+ 3(N is a mapping of the input vectors into a higher dimensional feature space. [sent-36, score-0.148]
27 Clearly, in the general case of a non-linear feature mapping cp, the SVM classifier gives a non-linear boundary f(x) = 0 in input space. [sent-41, score-0.483]
28 The gradient vector lld = "Vdj, computed at any point d of the level curve f(x) = 0, gives the perpendicular direction to the decision boundary in input space at d. [sent-42, score-0.837]
29 As such, the vector lld identifies the orientation in input space on which the projected training data are well separated, locally over d's neighborhood. [sent-43, score-0.792]
30 Therefore, the orientation given by lld, and any orientation close to it, is highly informative for the classification task at hand , and we can use such information to define a local measure of feature relevance. [sent-44, score-0.751]
31 Let q be a query point whose class label we want to predict. [sent-45, score-0.368]
32 Suppose q is close to the boundary, which is where class conditional probabilities become locally non uniform, and therefore estimation of local feature relevance becomes crucial. [sent-46, score-0.558]
33 Let d be the closest point to q on the boundary f(x) = 0: d = argminp Ilq - pll, subject to the constraint f(p) = O. [sent-47, score-0.329]
34 Then we know that the gradient lld identifies a direction along which data points between classes are well separated. [sent-48, score-0.59]
35 As a consequence, the subspace spanned by the orientation lld, locally at q, is likely to contain points having the same class label as q . [sent-49, score-0.335]
36 Therefore, when applying a nearest neighbor rule at q, we desire to stay close to q along the lld direction, because that is where it is likely to find points similar to q in terms of class posterior probabilities. [sent-50, score-1.178]
37 Distances should be constricted (large weight) along lld and along directions close to it. [sent-51, score-0.594]
38 The farther we move from the lld direction, the less discriminant the correspondent orientation becomes. [sent-52, score-0.513]
39 This means that class labels are likely not to change along those orientations, and distances should be elongated (small weight) , thus including in q's neighborhood points which are likely to be similar to q in terms of the class conditional probabilities. [sent-53, score-0.353]
40 Formally, we can measure how close a direction t is to lld by considering the dot product lla ·t. [sent-54, score-0.505]
41 In particular, by denoting with Uj the unit vector along input feature j, for j = 1, . [sent-55, score-0.263]
42 , n, we can define a measure of relevance for feature j, locally at q (and therefore at d), as Rj(q) == Iu] . [sent-58, score-0.37]
43 On the other hand, when A is large a change in R j will be exponentially reflected in Wj' The exponential weighting scheme conveys stability to the method by preventing neighborhoods to extend infinitely in any direction. [sent-61, score-0.295]
44 Thus, the exponential weighting scheme can be used as weights associated with features for weighted distance computation D(x, y) = )2::7=1 Wi(Xi - Yi)2. [sent-63, score-0.314]
45 We stop as soon as the boundary is crossed along an input axis i, i. [sent-67, score-0.466]
46 Given Pi, we can get arbitrarily close to the boundary by moving at (arbitrarily) small steps along the segment that joins Pi to q. [sent-70, score-0.438]
47 Let us denote with d i the intercepted point on the boundary along direction i. [sent-71, score-0.432]
48 We then approximate lld with the gradient vector lld i = \7 d i f, computed at d i . [sent-72, score-0.744]
49 We desire that the parameter A in the exponential weighting scheme increases as the distance of q from the boundary decreases. [sent-73, score-0.568]
50 Following the same principle, in [1] the spatial resolution around the boundary is increased by enlarging volume elements locally in neighborhoods of support vectors. [sent-75, score-0.666]
51 In our experiments we set D equal to the approximated average distance between the training points Xk and the boundary: D = 2::xk {minsi Ilxk - sill}. [sent-77, score-0.133]
52 t By doing so the value of A nicely adapts to each query point according to its location with respect to the boundary. [sent-79, score-0.293]
53 In fact, for each query point q, in order to compute Bq we only need to consider the support vectors, whose number is typically small compared to the Input: Decision boundary f(x) point q and parameter K. [sent-82, score-0.674]
54 Set feature relevance values Rj(q) = Indi,jl for j = 1, . [sent-86, score-0.173]
55 Estimate the distance of q from the boundary as: Bq = minsi Ilq - sill; 5. [sent-90, score-0.452]
56 Use the resulting w for K-nearest neighbor classification at the query point q. [sent-96, score-0.841]
57 The resulting local flexible metric technique based on SVMs (LFM-SVM) is summarized in Figure 1. [sent-99, score-0.377]
58 The algorithm has only one adjustable tuning parameter, namely the number K of neighbors in the final nearest neighbor rule. [sent-100, score-0.504]
59 This parameter is common to all nearest neighbor classification techniques. [sent-101, score-0.764]
60 4 Experimental Results In the following we compare several classification methods using both simulated and real data. [sent-102, score-0.383]
61 We compare the following classification approaches: (1) LFM-SVM algorithm described in Figure 1. [sent-103, score-0.26]
62 The output of this classifier is the input of LFM-SVM; (3) ADAMENN-adaptive metric nearest neighbor technique [7]. [sent-107, score-0.87]
63 It uses the Chi-squared distance in order to estimate to which extent each dimension can be relied on to predict class posterior probabilities; (4) Machete [9]. [sent-108, score-0.16]
64 It is a recursive partitioning procedure, in which the input variable used for splitting at each step is the one that maximizes the estimated local relevance. [sent-109, score-0.197]
65 Such relevance is measured in terms of the improvement in squared prediction error each feature is capable to provide; (5) Scythe [9]. [sent-110, score-0.173]
66 It is a generalization of the machete algorithm, in which the input variables influence each split in proportion to their estimated local relevance; (6) DANN-discriminant adaptive nearest neighbor classification [10]. [sent-111, score-1.146]
67 It is an adaptive nearest neighbor classification method based on linear discriminant analysis. [sent-112, score-0.926]
68 It computes a distance metric as a product of properly weighted within and between sum of squares matrices; (7) Simple K-NN method using the Euclidean distance measure; (8) C4. [sent-113, score-0.323]
69 In all the experiments, the features are first normalized over the training data to have zero mean and unit variance, and the test data features are normalized using the corresponding training mean and variance. [sent-115, score-0.266]
70 These test data were classified by each competing method using the respective training data set. [sent-120, score-0.128]
71 The data set consists of n = 2 input features, l = 200 training data, and J = 2 classes. [sent-124, score-0.164]
72 The mean vectors for one class are (-3/4, -3) and (3/4,3); whereas for the other class are (3, -3) and (-3,3). [sent-126, score-0.15]
73 When the noisy predictors are added to the problem (NoisyGaussians), we observe different levels of deterioration in performance among the eight methods. [sent-154, score-0.129]
74 Table 1: Average classification error rates for simulated and real data. [sent-173, score-0.383]
75 8 Experiments on Real Data In our experiments we used seven different real data sets. [sent-248, score-0.163]
76 For the Iris, Sonar, Liver and Vote data we perform leave-one-out cross-validation to measure performance, since the number of available data is limited for these data sets. [sent-255, score-0.161]
77 Table 1 (columns 3-9) shows the cross-validated error rates for the eight methods under consideration on the seven real data. [sent-265, score-0.123]
78 Table 1 shows that LFM-SVM achieves the best performance in 2/7 of the real data sets; in one case it shows the second best performance, and in the remaining four its error rate is still quite close to t he best one. [sent-295, score-0.169]
79 Following Friedman [9], we capture robustness by computing the ratio bm of the error rate em of method m and the smallest error rate over all methods being compared in a particular example: bm = emf minl~k~8 ek· Figure 3 plots the distribution of bm for each method over the seven real data sets. [sent-296, score-0.364]
80 The scythe algorithm, by relaxing the winner-take-all splitting strategy of the machete algorithm, mitigates the greedy nature of the approach, and thereby achieves better performance. [sent-306, score-0.263]
81 In [10], the authors show that the metric employed by the DANN algorithm approximates the weighted Chi-squared distance, given that class densities are Gaussian and have the same covariance matrix. [sent-307, score-0.228]
82 As a consequence, we may expect a degradation in performance when the data do not follow Gaussian distributions and are corrupted by noise , which is likely the case in real scenarios like the ones tested here. [sent-308, score-0.144]
83 Furthermore, LFM-SVM speeds up the classification process since it applies the nearest neighbor rule only once, whereas ADAMENN applies it at each point within a region centered at the query. [sent-311, score-0.869]
84 The LFM-SVM offers performance improvements over the RBF-SVM algorithm alone, for both the (noisy) simulated and real data sets. [sent-313, score-0.163]
85 The reason for such performance gain may rely on the effect of our local weighting scheme on the separability of classes, and therefore on the margin, as shown in [6]. [sent-314, score-0.324]
86 Assigning large weights to input features close to the gradient direction, locally in neighborhoods of support vectors, corresponds to increase the spatial resolution along those orientations, and therefore to improve the separability of classes. [sent-315, score-0.769]
87 As a consequence, better classification results can be achieved as demonstrated in our experiments. [sent-316, score-0.26]
88 5 Related Work In [1], Amari and Wu improve support vector machine classifiers by modifying kernel functions. [sent-317, score-0.205]
89 A primary kernel is first used to obtain support vectors. [sent-318, score-0.135]
90 The kernel is then modified in a data dependent way by using the support vectors: the factor that drives the transformation has larger values at positions close to support vectors. [sent-319, score-0.409]
91 The modified kernel enlarges the spatial resolution around the boundary so that the separability of classes is increased. [sent-320, score-0.478]
92 The resulting transformation depends on the distance of data points from the support vectors , and it is therefore a local transformation, but is independent of the boundary's orientation in input space. [sent-321, score-0.516]
93 Likewise, our transformation metric depends , through the factor A, on the distance of the query point from the support vectors. [sent-322, score-0.665]
94 Moreover, since we weight features, our metric is directional, and depends on the orientation of local boundaries in input space. [sent-323, score-0.41]
95 This dependence is driven by our measure of feature relevance, which has the effect of increasing the spatial resolution along discriminant directions around the boundary. [sent-324, score-0.392]
96 6 Conclusions We have described a locally adaptive metric classification method and demonstrated its efficacy through experimental results. [sent-325, score-0.619]
97 It speeds up, in fact, the classification process by computing offline the information relevant to define local weights, and by applying the nearest neighbor rule only once. [sent-327, score-0.951]
98 Wu, "Improving support vector machine classifiers by modifying kernel functions", Neural Networks, 12, pp. [sent-331, score-0.205]
99 Haussler, "Knowledge-based analysis of microarray gene expressions data using support vector machines", Tech. [sent-346, score-0.168]
100 Girosi, "Training support vector machines: An application to face detection", Pmc. [sent-410, score-0.128]
wordName wordTfidf (topN-words)
[('lld', 0.336), ('neighbor', 0.288), ('classification', 0.26), ('query', 0.259), ('boundary', 0.255), ('nearest', 0.216), ('metric', 0.153), ('machete', 0.14), ('locally', 0.122), ('weighting', 0.119), ('svms', 0.119), ('adamenn', 0.112), ('minsi', 0.112), ('neighborhoods', 0.111), ('relevance', 0.101), ('orientation', 0.099), ('svm', 0.098), ('wj', 0.097), ('support', 0.092), ('flexible', 0.085), ('distance', 0.085), ('domeniconi', 0.084), ('gunopulos', 0.084), ('ilq', 0.084), ('scythe', 0.084), ('sill', 0.084), ('adaptive', 0.084), ('local', 0.082), ('classifier', 0.08), ('along', 0.079), ('discriminant', 0.078), ('input', 0.076), ('class', 0.075), ('feature', 0.072), ('rj', 0.068), ('bm', 0.067), ('real', 0.065), ('scheme', 0.065), ('direction', 0.064), ('close', 0.064), ('bq', 0.062), ('deviations', 0.059), ('pima', 0.058), ('separability', 0.058), ('seven', 0.058), ('simulated', 0.058), ('technique', 0.057), ('appealing', 0.056), ('arj', 0.056), ('carlotta', 0.056), ('crossed', 0.056), ('dann', 0.056), ('ilxk', 0.056), ('invalid', 0.056), ('multigaussians', 0.056), ('oq', 0.056), ('riverside', 0.056), ('noisy', 0.053), ('resolution', 0.052), ('machines', 0.052), ('cp', 0.049), ('ari', 0.049), ('joachims', 0.049), ('liver', 0.049), ('mlight', 0.049), ('training', 0.048), ('breast', 0.047), ('distances', 0.046), ('features', 0.045), ('vote', 0.044), ('desire', 0.044), ('iris', 0.044), ('sonar', 0.044), ('kernel', 0.043), ('transformation', 0.042), ('alone', 0.042), ('non', 0.042), ('table', 0.042), ('measure', 0.041), ('predictors', 0.041), ('hart', 0.041), ('lazy', 0.041), ('closest', 0.04), ('arbitrarily', 0.04), ('data', 0.04), ('likely', 0.039), ('curse', 0.039), ('splitting', 0.039), ('rule', 0.037), ('modified', 0.036), ('gradient', 0.036), ('directions', 0.036), ('vector', 0.036), ('observe', 0.035), ('identifies', 0.035), ('point', 0.034), ('speeds', 0.034), ('modifying', 0.034), ('spatial', 0.034), ('define', 0.034)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
Author: Carlotta Domeniconi, Dimitrios Gunopulos
Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1
2 0.22531559 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
Author: Pascal Vincent, Yoshua Bengio
Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). wx u r i q ©
3 0.17346162 24 nips-2001-Active Information Retrieval
Author: Tommi Jaakkola, Hava T. Siegelmann
Abstract: In classical large information retrieval systems, the system responds to a user initiated query with a list of results ranked by relevance. The users may further refine their query as needed. This process may result in a lengthy correspondence without conclusion. We propose an alternative active learning approach, where the system responds to the initial user's query by successively probing the user for distinctions at multiple levels of abstraction. The system's initiated queries are optimized for speedy recovery and the user is permitted to respond with multiple selections or may reject the query. The information is in each case unambiguously incorporated by the system and the subsequent queries are adjusted to minimize the need for further exchange. The system's initiated queries are subject to resource constraints pertaining to the amount of information that can be presented to the user per iteration. 1
4 0.14192936 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
Author: Manfred Opper, Robert Urbanczik
Abstract: Using methods of Statistical Physics, we investigate the rOle of model complexity in learning with support vector machines (SVMs). We show the advantages of using SVMs with kernels of infinite complexity on noisy target rules, which, in contrast to common theoretical beliefs, are found to achieve optimal generalization error although the training error does not converge to the generalization error. Moreover, we find a universal asymptotics of the learning curves which only depend on the target rule but not on the SVM kernel. 1
5 0.13879672 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
Author: Ronan Collobert, Samy Bengio, Yoshua Bengio
Abstract: Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1
6 0.12856568 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
7 0.12038255 172 nips-2001-Speech Recognition using SVMs
8 0.11984682 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
9 0.11360861 15 nips-2001-A New Discriminative Kernel From Probabilistic Models
10 0.11218452 60 nips-2001-Discriminative Direction for Kernel Classifiers
11 0.10409526 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
12 0.10406072 184 nips-2001-The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank
13 0.097866096 58 nips-2001-Covariance Kernels from Bayesian Generative Models
14 0.093764707 13 nips-2001-A Natural Policy Gradient
15 0.093196623 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
16 0.093093082 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
17 0.085646883 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
18 0.085529037 164 nips-2001-Sampling Techniques for Kernel Methods
19 0.083573215 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
20 0.08078327 144 nips-2001-Partially labeled classification with Markov random walks
topicId topicWeight
[(0, -0.278), (1, 0.095), (2, -0.047), (3, -0.025), (4, 0.034), (5, 0.102), (6, -0.027), (7, -0.049), (8, -0.058), (9, 0.043), (10, 0.168), (11, -0.049), (12, 0.001), (13, -0.051), (14, 0.348), (15, 0.089), (16, 0.036), (17, 0.013), (18, -0.019), (19, 0.067), (20, -0.032), (21, 0.073), (22, 0.065), (23, -0.063), (24, -0.042), (25, 0.035), (26, -0.01), (27, 0.097), (28, -0.06), (29, 0.167), (30, 0.008), (31, 0.146), (32, 0.033), (33, 0.0), (34, 0.002), (35, -0.1), (36, 0.038), (37, 0.047), (38, -0.136), (39, 0.034), (40, 0.119), (41, -0.078), (42, -0.058), (43, -0.036), (44, -0.04), (45, 0.098), (46, 0.11), (47, 0.065), (48, -0.033), (49, 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.9475407 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
Author: Carlotta Domeniconi, Dimitrios Gunopulos
Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1
2 0.76907921 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
Author: Pascal Vincent, Yoshua Bengio
Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classification tasks. We then propose modified K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classification tasks suggest that the modified KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classification tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be defined informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to first mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes infinite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to find the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we define the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to find an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has infinite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and definitions The setting is that of a classical classification problem in (the input space). wx u r i q ©
3 0.58019185 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
Author: Patrick Haffner
Abstract: Maximum margin classifiers such as Support Vector Machines (SVMs) critically depends upon the convex hulls of the training samples of each class, as they implicitly search for the minimum distance between the convex hulls. We propose Extrapolated Vector Machines (XVMs) which rely on extrapolations outside these convex hulls. XVMs improve SVM generalization very significantly on the MNIST [7] OCR data. They share similarities with the Fisher discriminant: maximize the inter-class margin while minimizing the intra-class disparity. 1
4 0.5646137 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
Author: Olivier Chapelle, Bernhard Schćžšlkopf
Abstract: The choice of an SVM kernel corresponds to the choice of a representation of the data in a feature space and, to improve performance , it should therefore incorporate prior knowledge such as known transformation invariances. We propose a technique which extends earlier work and aims at incorporating invariances in nonlinear kernels. We show on a digit recognition task that the proposed approach is superior to the Virtual Support Vector method, which previously had been the method of choice. 1
5 0.51408464 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
Author: Ronan Collobert, Samy Bengio, Yoshua Bengio
Abstract: Support Vector Machines (SVMs) are currently the state-of-the-art models for many classification problems but they suffer from the complexity of their training algorithm which is at least quadratic with respect to the number of examples. Hence, it is hopeless to try to solve real-life problems having more than a few hundreds of thousands examples with SVMs. The present paper proposes a new mixture of SVMs that can be easily implemented in parallel and where each SVM is trained on a small subset of the whole dataset. Experiments on a large benchmark dataset (Forest) as well as a difficult speech database , yielded significant time improvement (time complexity appears empirically to locally grow linearly with the number of examples) . In addition, and that is a surprise, a significant improvement in generalization was observed on Forest. 1
6 0.49183142 184 nips-2001-The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank
7 0.47090039 60 nips-2001-Discriminative Direction for Kernel Classifiers
8 0.46787608 120 nips-2001-Minimax Probability Machine
9 0.45108366 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
10 0.43613681 172 nips-2001-Speech Recognition using SVMs
11 0.42525315 25 nips-2001-Active Learning in the Drug Discovery Process
12 0.41440526 94 nips-2001-Incremental Learning and Selective Sampling via Parametric Optimization Framework for SVM
13 0.41021669 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
14 0.40442663 15 nips-2001-A New Discriminative Kernel From Probabilistic Models
15 0.3893249 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine
16 0.38679376 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
17 0.37697154 24 nips-2001-Active Information Retrieval
18 0.35823488 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network
19 0.35654357 78 nips-2001-Fragment Completion in Humans and Machines
20 0.34809527 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique
topicId topicWeight
[(12, 0.209), (14, 0.05), (17, 0.044), (19, 0.04), (27, 0.139), (30, 0.068), (38, 0.011), (59, 0.029), (72, 0.151), (79, 0.055), (83, 0.018), (91, 0.107)]
simIndex simValue paperId paperTitle
1 0.90814221 147 nips-2001-Pranking with Ranking
Author: Koby Crammer, Yoram Singer
Abstract: We discuss the problem of ranking instances. In our framework each instance is associated with a rank or a rating, which is an integer from 1 to k. Our goal is to find a rank-prediction rule that assigns each instance a rank which is as close as possible to the instance's true rank. We describe a simple and efficient online algorithm, analyze its performance in the mistake bound model, and prove its correctness. We describe two sets of experiments, with synthetic data and with the EachMovie dataset for collaborative filtering. In the experiments we performed, our algorithm outperforms online algorithms for regression and classification applied to ranking. 1
2 0.88254571 114 nips-2001-Learning from Infinite Data in Finite Time
Author: Pedro Domingos, Geoff Hulten
Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi I衯 Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I
same-paper 3 0.85724103 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
Author: Carlotta Domeniconi, Dimitrios Gunopulos
Abstract: The nearest neighbor technique is a simple and appealing method to address classification problems. It relies on t he assumption of locally constant class conditional probabilities. This assumption becomes invalid in high dimensions with a finite number of examples due to the curse of dimensionality. We propose a technique that computes a locally flexible metric by means of Support Vector Machines (SVMs). The maximum margin boundary found by the SVM is used to determine the most discriminant direction over the query's neighborhood. Such direction provides a local weighting scheme for input features. We present experimental evidence of classification performance improvement over the SVM algorithm alone and over a variety of adaptive learning schemes, by using both simulated and real data sets. 1
4 0.75908488 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
Author: Mário Figueiredo
Abstract: In this paper we introduce a new sparseness inducing prior which does not involve any (hyper)parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyper-parameters.
5 0.74910545 40 nips-2001-Batch Value Function Approximation via Support Vectors
Author: Thomas G. Dietterich, Xin Wang
Abstract: We present three ways of combining linear programming with the kernel trick to find value function approximations for reinforcement learning. One formulation is based on SVM regression; the second is based on the Bellman equation; and the third seeks only to ensure that good moves have an advantage over bad moves. All formulations attempt to minimize the number of support vectors while fitting the data. Experiments in a difficult, synthetic maze problem show that all three formulations give excellent performance, but the advantage formulation is much easier to train. Unlike policy gradient methods, the kernel methods described here can easily 'adjust the complexity of the function approximator to fit the complexity of the value function. 1
6 0.74547362 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
8 0.7362417 13 nips-2001-A Natural Policy Gradient
9 0.73596776 1 nips-2001-(Not) Bounding the True Error
10 0.73339355 155 nips-2001-Quantizing Density Estimators
11 0.73242229 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms
12 0.73233962 79 nips-2001-Gaussian Process Regression with Mismatched Models
13 0.73159963 92 nips-2001-Incorporating Invariances in Non-Linear Support Vector Machines
14 0.73082674 16 nips-2001-A Parallel Mixture of SVMs for Very Large Scale Problems
15 0.72869653 88 nips-2001-Grouping and dimensionality reduction by locally linear embedding
16 0.72607106 8 nips-2001-A General Greedy Approximation Algorithm with Applications
17 0.72534108 188 nips-2001-The Unified Propagation and Scaling Algorithm
18 0.72344244 36 nips-2001-Approximate Dynamic Programming via Linear Programming
19 0.72337687 185 nips-2001-The Method of Quantum Clustering
20 0.72332257 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning