nips nips2002 nips2002-121 knowledge-graph by maker-knowledge-mining

121 nips-2002-Knowledge-Based Support Vector Machine Classifiers


Source: pdf

Author: Glenn M. Fung, Olvi L. Mangasarian, Jude W. Shavlik

Abstract: Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a linear support vector machine classifier. The resulting formulation leads to a linear program that can be solved efficiently. Real world examples, from DNA sequencing and breast cancer prognosis, demonstrate the effectiveness of the proposed method. Numerical results show improvement in test set accuracy after the incorporation of prior knowledge into ordinary, data-based linear support vector machine classifiers. One experiment also shows that a linear classifier, based solely on prior knowledge, far outperforms the direct application of prior knowledge rules to classify data. Keywords: use and refinement of prior knowledge, support vector machines, linear programming 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a linear support vector machine classifier. [sent-6, score-0.854]

2 The resulting formulation leads to a linear program that can be solved efficiently. [sent-7, score-0.337]

3 Real world examples, from DNA sequencing and breast cancer prognosis, demonstrate the effectiveness of the proposed method. [sent-8, score-0.187]

4 Numerical results show improvement in test set accuracy after the incorporation of prior knowledge into ordinary, data-based linear support vector machine classifiers. [sent-9, score-0.776]

5 One experiment also shows that a linear classifier, based solely on prior knowledge, far outperforms the direct application of prior knowledge rules to classify data. [sent-10, score-0.763]

6 Keywords: use and refinement of prior knowledge, support vector machines, linear programming 1 Introduction Support vector machines (SVMs) have played a major role in classification problems [18,3, 11]. [sent-11, score-0.675]

7 However unlike other classification tools such as knowledge-based neural networks [16, 17, 7], little work [15] has gone into incorporating prior knowledge into support vector machines. [sent-12, score-0.728]

8 In this work we present a novel approach to incorporating prior knowledge in the form of polyhedral knowledge sets in the input space of the given data. [sent-13, score-1.097]

9 These knowledge sets, which can be as simple as cubes, are supposed to belong to one of two categories into which all the data is divided. [sent-14, score-0.331]

10 Thus, a single knowledge set can be interpreted as a generalization of a training example, which typically consists of a single point in input space. [sent-15, score-0.332]

11 In contrast, each of our knowledge sets consists of a region in the same space. [sent-16, score-0.403]

12 By using a powerful tool from mathematical programming, theorems of the alternative [9, Chapter 2], we are able to embed such prior data into a linear program that can be efficiently solved by any of the publicly available solvers. [sent-17, score-0.329]

13 In Section 2 we describe the linear support vector machine classifier and give a linear program for it. [sent-19, score-0.747]

14 We then describe how prior knowledge, in the form of polyhedral knowledge sets belonging to one of two classes can be characterized. [sent-20, score-0.809]

15 In Section 3 we incorporate these polyhedral sets into our linear programming formulation which results in our knowledge-based support vector machine (KSVM) formulation (19). [sent-21, score-0.992]

16 This formulation is capable of generating a linear classifier based on real data and/or prior knowledge. [sent-22, score-0.613]

17 Section 4 gives a brief summary of numerical results that compare various linear and nonlinear classifiers with and without the incorporation of prior knowledge. [sent-23, score-0.43]

18 A separating plane, with respect to two given point sets A and B in R n , is a plane that attempts to separate R n into two halfspaces such that each open halfspace contains points mostly of A or B. [sent-38, score-0.619]

19 A bounding plane to the set A is a plane that places A in one of the two closed halfspaces that the plane generates. [sent-39, score-0.865]

20 III denotes the I-norm as defined in the Introduction, y is a vector of slack variables measuring empirical error and (w, 'Y) characterize a separating plane depicted in Figure 1. [sent-46, score-0.492]

21 That this problem is indeed a linear program, can be easily seen from the equivalent formulation: min (W ,"Y ,y ,t)ERn +l +=+n {ve'y+e't I D(Aw - q) +y ~ e,t ~ w ~ -t,y ~ a}, (2) where e is a vector of ones of appropriate dimension. [sent-47, score-0.198]

22 For economy of notation we shall use the first formulation (1) with the understanding that computational implementation is via (2). [sent-48, score-0.137]

23 As depicted in Figure l(a), w is the normal to the bounding planes: x'w = 'Y + 1, x'w = 'Y - 1, (3) that bound the points belonging to the sets A + and A-respectively. [sent-49, score-0.417]

24 (4) Consequently, the plane: x'w = 'Y, (5) midway between the bounding planes (3), is a separating plane that separates points belonging to A + from those belonging to A-completely if y = 0, else only approximately. [sent-52, score-0.811]

25 The I-norm term Ilwlll in (1), which is half the reciprocal of the distance 11,,7111 measured using the oo-norm distance [10] between the two bounding planes of (3) (see Figure l(a)), maximizes this distance, often called the "margin". [sent-53, score-0.25]

26 Maximizing the margin enhances the generalization capability of a support vector machine [18, 3]. [sent-54, score-0.247]

27 If the classes are linearly inseparable, then the two planes bound the two classes with a "soft margin" (i. [sent-55, score-0.137]

28 bound approximately with some error) determined by the nonnegative error variable y, that is: AiW + Yi 2: ry + 1, for Dii = 1, AiW - Yi ::; ry - 1, for Dii = -1. [sent-57, score-0.42]

29 (6) The I-norm of the error variable Y is minimized parametrically with weight /J in (1), resulting in an approximate separating plane (5) which classifies as follows: x E A+ if sign(x'w - ry) = 1, x E A- if sign(x'w - ry) = -1. [sent-58, score-0.344]

30 (7) Suppose now that we have prior information of the following type. [sent-59, score-0.129]

31 All points x lying in the polyhedral set determined by the linear inequalities: Bx ::; b, (8) belong to class A +. [sent-60, score-0.327]

32 Looking at Figure 1 (a) or at the inequalities (4) we conclude that the following implication must hold: Bx::; b ===? [sent-62, score-0.184]

33 (9) That is, the knowledge set {x I Bx ::; b} lies on the A + side of the bounding plane x'w = ry+ 1. [sent-64, score-0.644]

34 Later, in (19), we will accommodate the case when the implication (9) cannot be satisfied exactly by the introduction of slack error variables. [sent-65, score-0.211]

35 For now, assuming that the implication (9) holds for a given (w, ry), it follows that (9) is equivalent to: Bx ::; b, x'w < ry + 1, has no solution x. [sent-66, score-0.352]

36 In fact, under the natural assumption that the prior knowledge set {x I Bx ::; b} is nonempty, the forward implication: (10)===? [sent-69, score-0.427]

37 Then for a given (w, ry), the implication (9) is equivalent to the statement (11). [sent-77, score-0.176]

38 In other words, the set {x I Bx ::; b} lies in the halfspace {x I w' x 2: ry + I} if and only if there exists u such that B'u + w = 0, b'u + ry + 1 ::; 0 and u 2: O. [sent-78, score-0.474]

39 8] we have that (10) is equivalent to either: B'u + w = 0, b'u + ry + 1::; 0, u 2: 0, having solution (u, w), (13) or B'u = 0, b'u < 0, u 2: 0, having solution u. [sent-82, score-0.195]

40 D This proposition will play a key role in incorporating knowledge sets, such as {x I Bx ::; b}, into one of two categories in a support vector classifier formulation as demonstrated in the next section. [sent-85, score-1.031]

41 - 15 -15 -20 X'W= Y +1 -30 ~---:---j x'w= y -40 -~·~0------ '5 ---- '0 ~~ -~~---- 5 ------~----~ -~ -45 '--------~----~------~----~----~ -20 - 15 - 10 (a) Figure 1: -5 (b) (a): A linear SVM separation for 200 points in R2 using the linear programming formulation (1). [sent-86, score-0.521]

42 (b): A linear SVM separation for the salTIe 200 points in R2 as those in Figure l(a) but using the linear programming forlTIulation (19) which incorporates three knowledge sets: { x I B ' x :'0 b'} into the halfspace of A + , and { x into the halfspace of A - , as depicted above. [sent-87, score-0.894]

43 I C'x :'0 c'}, { x I C 2 x :'0 c2 } Note the substantial difference between the linear classifiers x' w = , of both figures. [sent-88, score-0.176]

44 3 Knowledge-Based SVM Classification We describe now how to incorporate prior knowledge in the form of polyhedral sets into our linear programming SVM classifier formulation (1). [sent-89, score-1.35]

45 We assume that we are given the following knowledge sets: k sets belonging to A+ : {x I B ix ::; bi } , i = 1, . [sent-90, score-0.548]

46 ,k IZ sets belonging to A- : {x I eix::; ci }, i = 1, . [sent-93, score-0.197]

47 1 that, relative to the bounding planes (3): There exist u i , i = 1, . [sent-97, score-0.25]

48 ,1:fi _ _ - e (17) We now incorporate the knowledge sets (16) into the SVM linear programming formulation (1) classifier, by adding the conditions (17) as constraints to it as follows: min w" ,(y ,u i ,v j )2':O s. [sent-110, score-0.806]

49 ,IZ (18) This linear programming formulation will ensure that each of the knowledge sets { x I BiX::; bi } , i = 1, . [sent-124, score-0.787]

50 ,IZ lie on the appropriate side of the bounding planes (3). [sent-130, score-0.25]

51 However, there is no guarantee that such bounding planes exist that will precisely separate these two classes of knowledge sets, just as there is no a priori guarantee that the original points b elonging to the sets A + and A-are linearly separable. [sent-131, score-0.716]

52 ,£, just like the slack error variable y of the SVM formulation (1), and attempt to drive these error variables to zero by modifying our last formulation above as follows: e k . [sent-138, score-0.389]

53 ,£ (19) This is our final knowledge-based linear programming formulation which incorporates the knowledge sets (16) into the linear classifier with weight j. [sent-152, score-1.081]

54 L = a then the linear program (19) degenerates to (1), the linear program associated with an ordinary linear SVM. [sent-157, score-0.567]

55 However, if set v = 0, then the linear program (19) generates a linear SVM that is strictly based on knowledge sets, but not on any specific training data. [sent-158, score-0.643]

56 This will be demonstrated in the breast cancer dataset of Section 4. [sent-160, score-0.279]

57 Note that the I-norm term Ilwlll can be replaced by one half the 2-norm squared, ~llwll~, which is the usual margin maximization term for ordinary support vector machine classifiers [18, 3]. [sent-161, score-0.432]

58 However, this changes the linear program (19) to a quadratic program which typically takes longer time to solve. [sent-162, score-0.321]

59 For standard SVMs, support vectors consist of all data points which are the complement of the data points that can be dropped from the problem without changing the separating plane (5) [18, 11]. [sent-163, score-0.551]

60 Thus for our knowledge-based linear programming formulation (19), support vectors correspond to data points (rows of the matrix A) for which the Lagrange multipliers are nonzero, because solving (19) with these data points only will give the same answer as solving (19) with the entire matrix A. [sent-164, score-0.63]

61 The concept of support vectors has to be modified as follows for our knowledge sets. [sent-165, score-0.44]

62 Since each knowledge set in (16) is represented by a matrix Bi or j , each row of these matrices can be thought of as characterizing a boundary plane of the knowledge set. [sent-166, score-0.829]

63 In our formulation (19) above, such rows are wiped out if the corresponding components of the variables u i or v j are zero at an optimal solution. [sent-167, score-0.168]

64 We call the complement of these components of the the knowledge sets (16), support constraints. [sent-168, score-0.514]

65 Deleting constraints (rows of Bi or j ), for which the corresponding components of u i or v j are zero, will not alter the solution of the knowledge-based linear program (19). [sent-169, score-0.2]

66 Deletion of non-support constraints can be considered a refinement of prior knowledge [17]. [sent-171, score-0.511]

67 Another type of of refinement of prior knowledge may occur when the separating plane x' w = I' intersects one of the knowledge sets. [sent-172, score-1.176]

68 In such a case the plane x'w = I' can be added as an inequality to the knowledge set it intersects. [sent-173, score-0.531]

69 e e We demonstrate the geometry of incorporating knowledge sets by considering a synthetic example in R2 with m = 200 points, 100 of which are in A + and the other 100 in A -. [sent-175, score-0.485]

70 Figure 1 (a) depicts ordinary linear separation using the linear SVM formulation (1). [sent-176, score-0.431]

71 We now incorporate three knowledge sets into the the problem: {x I Blx ::; bl } belonging to A+ and {x I Clx ::; c l } and {x I C 2 x ::; c2 } belonging to A -, and solve our linear program (19) with f-l = 100 and v = 1. [sent-177, score-0.874]

72 We depict the new linear separation in Figure 1 (b) and note the substantial change generated in the linear separation by the incorporation of these three knowledge sets. [sent-178, score-0.622]

73 Also note that since the plane x'w = "( intersects the knowledge set {x I BlX ::; bl }, this knowledge set can be refined to the following {x I B 1 X ::; bl, w' x 2: "(}. [sent-179, score-0.935]

74 4 Numerical Testing Numerical tests, which are described in detail in [6], were carried out on the DNA promoter recognition dataset [17] and the Wisconsin prognostic breast cancer dataset WPBC (ftp:j /ftp. [sent-180, score-0.463]

75 The prior knowledge for this dataset, which consists of a set of 14 prior rules, matches none of the examples of the training set. [sent-188, score-0.59]

76 However, they do capture significant information about promoters and it is known that incorporating them into a classifier results in a more accurate classifier [17]. [sent-190, score-0.697]

77 These 14 prior rules were converted in a straightforward manner [6] into 64 knowledge sets. [sent-191, score-0.525]

78 After expressing the prior knowledge in the form of polyhedral sets and applying KSVM, we obtained 5 errors out of 106 (5/106). [sent-200, score-0.717]

79 KSVM was also compared with [16] where a hybrid learning system maps problem specific prior knowledge, represented in propositional logic into neural networks and then, refines this reformulated knowledge using back propagation. [sent-203, score-0.459]

80 However, it is important to note that our classifier is a much simpler linear classifier, sign(x'w - "(), while the neural network classifier of KBANN is a considerably more complex nonlinear classifier. [sent-206, score-0.615]

81 Furthermore, we note that KSVM is simpler to implement than KBANN and requires merely a commonly available linear programming solver. [sent-207, score-0.194]

82 In addition, KSVM which is a linear support vector machine classifier, improves by 44. [sent-208, score-0.279]

83 4% the error of an ordinary linear I-norm SVM classifier that does not utilize prior knowledge sets. [sent-209, score-0.892]

84 The second dataset used in our numerical tests was the Wisconsin breast cancer prognosis dataset WPBC using a 60-month cutoff for predicting recurrence or nonrecurrence of the disease [2]. [sent-210, score-0.563]

85 9) ===} NON RECUR It is important to note that the rules described above can be applied directly to classify only 32 of the given 110 given points of the training dataset and correctly classify 22 of these 32 points. [sent-213, score-0.347]

86 Hence, if the rules are applied as a classifier by themselves the classification accuracy would be 20%. [sent-215, score-0.425]

87 As such, these rules are not very useful by themselves and doctors use them in conjunction with other rules [8]. [sent-216, score-0.265]

88 However, using our approach the rules were converted to linear inequalities and used in our KSVM algorithm without any use of the data, i. [sent-217, score-0.235]

89 The resulting linear classifier in the 2-dimensional space of L(ymph) and T(umor) achieved 66. [sent-220, score-0.347]

90 This result is remarkable because our knowledge-based formulation can be applied to problems where training data may not be available whereas expert knowledge may be readily available in the form of knowledge sets. [sent-224, score-0.849]

91 This fact makes this method considerably different from previous hybrid methods like KBANN where training examples are needed in order to refine prior knowledge. [sent-225, score-0.163]

92 5 Conclusion & Future Directions We have proposed an efficient procedure for incorporating prior knowledge in the form of knowledge sets into a linear support vector machine classifier either in combination with a given dataset or based solely on the knowledge sets. [sent-227, score-1.849]

93 This novel and promising approach of handling prior knowledge is worthy of further study, especially ways to handle and simplify the combinatorial nature of incorporating prior knowledge into linear inequalities. [sent-228, score-1.015]

94 A class of possible future applications might be to problems where training data may not be easily available whereas expert knowledge may be readily available in the form of knowledge sets. [sent-229, score-0.712]

95 This would correspond to solving our knowledge based linear program (19) with l/ = O. [sent-230, score-0.529]

96 A typical example of this type was breast cancer prognosis [8] where knowledge sets by themselves generated a linear classifier as good as any classifier based on data points. [sent-231, score-1.311]

97 This is a new way of incorporating prior knowledge into powerful support vector machine classifiers. [sent-232, score-0.709]

98 Also, the concept of support constraints as discussed at the end of Section 3, warrants further study that may lead to a systematic simplification of prior knowledge sets. [sent-233, score-0.538]

99 Feature selection via concave minimization and support vector machines. [sent-245, score-0.16]

100 Prior knowledge and the creation of "virtual" examples for RBF networks. [sent-287, score-0.298]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('knowledge', 0.298), ('classifier', 0.268), ('ksvm', 0.238), ('plane', 0.233), ('bx', 0.203), ('ry', 0.195), ('polyhedral', 0.185), ('ftp', 0.138), ('formulation', 0.137), ('planes', 0.137), ('ilwlll', 0.132), ('kbann', 0.132), ('prior', 0.129), ('implication', 0.126), ('wisconsin', 0.123), ('program', 0.121), ('programming', 0.115), ('bounding', 0.113), ('support', 0.111), ('aiw', 0.106), ('prognosis', 0.106), ('sets', 0.105), ('svm', 0.104), ('rules', 0.098), ('classifiers', 0.097), ('breast', 0.097), ('belonging', 0.092), ('promoter', 0.092), ('dii', 0.092), ('dataset', 0.092), ('cancer', 0.09), ('ordinary', 0.088), ('halfspace', 0.084), ('refinement', 0.084), ('artificial', 0.082), ('incorporating', 0.082), ('separating', 0.081), ('promoters', 0.079), ('linear', 0.079), ('shavlik', 0.078), ('sign', 0.071), ('aw', 0.07), ('incorporation', 0.07), ('ern', 0.069), ('doctors', 0.069), ('points', 0.063), ('mangasarian', 0.063), ('dna', 0.062), ('classification', 0.059), ('inequalities', 0.058), ('slack', 0.055), ('numerical', 0.055), ('proposition', 0.053), ('bi', 0.053), ('blx', 0.053), ('eix', 0.053), ('farkas', 0.053), ('fung', 0.053), ('halfspaces', 0.053), ('intersects', 0.053), ('lymph', 0.053), ('nonhomogeneous', 0.053), ('olvi', 0.053), ('recur', 0.053), ('rumelhart', 0.053), ('towell', 0.053), ('wpbc', 0.053), ('madison', 0.053), ('bl', 0.053), ('statement', 0.05), ('vector', 0.049), ('separation', 0.048), ('margin', 0.047), ('expert', 0.047), ('contradiction', 0.046), ('tumor', 0.046), ('depicted', 0.044), ('sj', 0.042), ('mining', 0.041), ('equivalence', 0.04), ('machine', 0.04), ('min', 0.038), ('ri', 0.038), ('readily', 0.035), ('november', 0.035), ('training', 0.034), ('briefly', 0.034), ('incorporate', 0.034), ('categories', 0.033), ('specific', 0.032), ('ones', 0.032), ('tests', 0.031), ('denoted', 0.031), ('cj', 0.031), ('follows', 0.031), ('rows', 0.031), ('solving', 0.031), ('classify', 0.03), ('sciences', 0.03), ('error', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

Author: Glenn M. Fung, Olvi L. Mangasarian, Jude W. Shavlik

Abstract: Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a linear support vector machine classifier. The resulting formulation leads to a linear program that can be solved efficiently. Real world examples, from DNA sequencing and breast cancer prognosis, demonstrate the effectiveness of the proposed method. Numerical results show improvement in test set accuracy after the incorporation of prior knowledge into ordinary, data-based linear support vector machine classifiers. One experiment also shows that a linear classifier, based solely on prior knowledge, far outperforms the direct application of prior knowledge rules to classify data. Keywords: use and refinement of prior knowledge, support vector machines, linear programming 1

2 0.17322996 161 nips-2002-PAC-Bayes & Margins

Author: John Langford, John Shawe-Taylor

Abstract: unkown-abstract

3 0.10973924 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

Author: Yves Grandvalet, Stéphane Canu

Abstract: This paper introduces an algorithm for the automatic relevance determination of input variables in kernelized Support Vector Machines. Relevance is measured by scale factors defining the input space metric, and feature selection is performed by assigning zero weights to irrelevant variables. The metric is automatically tuned by the minimization of the standard SVM empirical risk, where scale factors are added to the usual set of parameters defining the classifier. Feature selection is achieved by constraints encouraging the sparsity of scale factors. The resulting algorithm compares favorably to state-of-the-art feature selection procedures and demonstrates its effectiveness on a demanding facial expression recognition problem.

4 0.10669032 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

Author: Fei Sha, Lawrence K. Saul, Daniel D. Lee

Abstract: We derive multiplicative updates for solving the nonnegative quadratic programming problem in support vector machines (SVMs). The updates have a simple closed form, and we prove that they converge monotonically to the solution of the maximum margin hyperplane. The updates optimize the traditionally proposed objective function for SVMs. They do not involve any heuristics such as choosing a learning rate or deciding which variables to update at each iteration. They can be used to adjust all the quadratic programming variables in parallel with a guarantee of improvement at each iteration. We analyze the asymptotic convergence of the updates and show that the coefficients of non-support vectors decay geometrically to zero at a rate that depends on their margins. In practice, the updates converge very rapidly to good classifiers.

5 0.10145215 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

Author: Shantanu Chakrabartty, Gert Cauwenberghs

Abstract: Forward decoding kernel machines (FDKM) combine large-margin classifiers with hidden Markov models (HMM) for maximum a posteriori (MAP) adaptive sequence estimation. State transitions in the sequence are conditioned on observed data using a kernel-based probability model trained with a recursive scheme that deals effectively with noisy and partially labeled data. Training over very large data sets is accomplished using a sparse probabilistic support vector machine (SVM) model based on quadratic entropy, and an on-line stochastic steepest descent algorithm. For speaker-independent continuous phone recognition, FDKM trained over 177 ,080 samples of the TlMIT database achieves 80.6% recognition accuracy over the full test set, without use of a prior phonetic language model.

6 0.10139573 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

7 0.087671749 135 nips-2002-Learning with Multiple Labels

8 0.07766211 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers

9 0.076954536 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

10 0.07609117 156 nips-2002-On the Complexity of Learning the Kernel Matrix

11 0.072545648 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

12 0.068828933 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

13 0.068457223 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

14 0.066518061 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

15 0.065395758 119 nips-2002-Kernel Dependency Estimation

16 0.064999878 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation

17 0.062403917 192 nips-2002-Support Vector Machines for Multiple-Instance Learning

18 0.055780429 115 nips-2002-Informed Projections

19 0.05531038 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error

20 0.054741073 99 nips-2002-Graph-Driven Feature Extraction From Microarray Data Using Diffusion Kernels and Kernel CCA


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.192), (1, -0.089), (2, 0.02), (3, -0.049), (4, 0.033), (5, 0.011), (6, 0.013), (7, 0.006), (8, -0.002), (9, -0.014), (10, 0.092), (11, -0.112), (12, -0.019), (13, -0.123), (14, -0.028), (15, -0.078), (16, 0.066), (17, 0.032), (18, 0.091), (19, 0.052), (20, -0.007), (21, 0.005), (22, 0.166), (23, -0.113), (24, -0.026), (25, -0.145), (26, 0.072), (27, -0.027), (28, 0.044), (29, -0.108), (30, 0.037), (31, -0.067), (32, -0.091), (33, -0.022), (34, 0.085), (35, 0.076), (36, 0.021), (37, 0.057), (38, -0.04), (39, -0.021), (40, -0.116), (41, 0.063), (42, 0.013), (43, 0.009), (44, -0.0), (45, -0.057), (46, 0.14), (47, 0.161), (48, 0.154), (49, -0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9606216 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

Author: Glenn M. Fung, Olvi L. Mangasarian, Jude W. Shavlik

Abstract: Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a linear support vector machine classifier. The resulting formulation leads to a linear program that can be solved efficiently. Real world examples, from DNA sequencing and breast cancer prognosis, demonstrate the effectiveness of the proposed method. Numerical results show improvement in test set accuracy after the incorporation of prior knowledge into ordinary, data-based linear support vector machine classifiers. One experiment also shows that a linear classifier, based solely on prior knowledge, far outperforms the direct application of prior knowledge rules to classify data. Keywords: use and refinement of prior knowledge, support vector machines, linear programming 1

2 0.63650721 161 nips-2002-PAC-Bayes & Margins

Author: John Langford, John Shawe-Taylor

Abstract: unkown-abstract

3 0.56472462 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

Author: Fei Sha, Lawrence K. Saul, Daniel D. Lee

Abstract: We derive multiplicative updates for solving the nonnegative quadratic programming problem in support vector machines (SVMs). The updates have a simple closed form, and we prove that they converge monotonically to the solution of the maximum margin hyperplane. The updates optimize the traditionally proposed objective function for SVMs. They do not involve any heuristics such as choosing a learning rate or deciding which variables to update at each iteration. They can be used to adjust all the quadratic programming variables in parallel with a guarantee of improvement at each iteration. We analyze the asymptotic convergence of the updates and show that the coefficients of non-support vectors decay geometrically to zero at a rate that depends on their margins. In practice, the updates converge very rapidly to good classifiers.

4 0.51839906 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems

Author: Sepp Hochreiter, Michael C. Mozer, Klaus Obermayer

Abstract: We introduce a family of classifiers based on a physical analogy to an electrostatic system of charged conductors. The family, called Coulomb classifiers, includes the two best-known support-vector machines (SVMs), the ν–SVM and the C–SVM. In the electrostatics analogy, a training example corresponds to a charged conductor at a given location in space, the classification function corresponds to the electrostatic potential function, and the training objective function corresponds to the Coulomb energy. The electrostatic framework provides not only a novel interpretation of existing algorithms and their interrelationships, but it suggests a variety of new methods for SVMs including kernels that bridge the gap between polynomial and radial-basis functions, objective functions that do not require positive-definite kernels, regularization techniques that allow for the construction of an optimal classifier in Minkowski space. Based on the framework, we propose novel SVMs and perform simulation studies to show that they are comparable or superior to standard SVMs. The experiments include classification tasks on data which are represented in terms of their pairwise proximities, where a Coulomb Classifier outperformed standard SVMs. 1

5 0.47807458 165 nips-2002-Ranking with Large Margin Principle: Two Approaches

Author: empty-author

Abstract: We discuss the problem of ranking k instances with the use of a

6 0.46492901 178 nips-2002-Robust Novelty Detection with Single-Class MPM

7 0.45884904 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition

8 0.44080275 192 nips-2002-Support Vector Machines for Multiple-Instance Learning

9 0.39414105 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs

10 0.36124435 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

11 0.35395238 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

12 0.34316221 16 nips-2002-A Prototype for Automatic Recognition of Spontaneous Facial Actions

13 0.34245363 140 nips-2002-Margin Analysis of the LVQ Algorithm

14 0.34018338 6 nips-2002-A Formulation for Minimax Probability Machine Regression

15 0.32793728 194 nips-2002-The Decision List Machine

16 0.32714245 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

17 0.32085675 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization

18 0.3200326 114 nips-2002-Information Regularization with Partially Labeled Data

19 0.3155756 35 nips-2002-Automatic Acquisition and Efficient Representation of Syntactic Structures

20 0.3128581 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(42, 0.048), (54, 0.672), (55, 0.023), (68, 0.014), (74, 0.045), (92, 0.041), (98, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99606806 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers

Author: Glenn M. Fung, Olvi L. Mangasarian, Jude W. Shavlik

Abstract: Prior knowledge in the form of multiple polyhedral sets, each belonging to one of two categories, is introduced into a reformulation of a linear support vector machine classifier. The resulting formulation leads to a linear program that can be solved efficiently. Real world examples, from DNA sequencing and breast cancer prognosis, demonstrate the effectiveness of the proposed method. Numerical results show improvement in test set accuracy after the incorporation of prior knowledge into ordinary, data-based linear support vector machine classifiers. One experiment also shows that a linear classifier, based solely on prior knowledge, far outperforms the direct application of prior knowledge rules to classify data. Keywords: use and refinement of prior knowledge, support vector machines, linear programming 1

2 0.99561977 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

Author: Tong Zhang

Abstract: We investigate the generalization performance of some learning problems in Hilbert function Spaces. We introduce a concept of scalesensitive effective data dimension, and show that it characterizes the convergence rate of the underlying learning problem. Using this concept, we can naturally extend results for parametric estimation problems in finite dimensional spaces to non-parametric kernel learning methods. We derive upper bounds on the generalization performance and show that the resulting convergent rates are optimal under various circumstances.

3 0.99400938 36 nips-2002-Automatic Alignment of Local Representations

Author: Yee W. Teh, Sam T. Roweis

Abstract: We present an automatic alignment procedure which maps the disparate internal representations learned by several local dimensionality reduction experts into a single, coherent global coordinate system for the original data space. Our algorithm can be applied to any set of experts, each of which produces a low-dimensional local representation of a highdimensional input. Unlike recent efforts to coordinate such models by modifying their objective functions [1, 2], our algorithm is invoked after training and applies an efficient eigensolver to post-process the trained models. The post-processing has no local optima and the size of the system it must solve scales with the number of local models rather than the number of original data points, making it more efficient than model-free algorithms such as Isomap [3] or LLE [4]. 1 Introduction: Local vs. Global Dimensionality Reduction Beyond density modelling, an important goal of unsupervised learning is to discover compact, informative representations of high-dimensional data. If the data lie on a smooth low dimensional manifold, then an excellent encoding is the coordinates internal to that manifold. The process of determining such coordinates is dimensionality reduction. Linear dimensionality reduction methods such as principal component analysis and factor analysis are easy to train but cannot capture the structure of curved manifolds. Mixtures of these simple unsupervised models [5, 6, 7, 8] have been used to perform local dimensionality reduction, and can provide good density models for curved manifolds, but unfortunately such mixtures cannot do dimensionality reduction. They do not describe a single, coherent low-dimensional coordinate system for the data since there is no pressure for the local coordinates of each component to agree. Roweis et al [1] recently proposed a model which performs global coordination of local coordinate systems in a mixture of factor analyzers (MFA). Their model is trained by maximizing the likelihood of the data, with an additional variational penalty term to encourage the internal coordinates of the factor analyzers to agree. While their model can trade off modelling the data and having consistent local coordinate systems, it requires a user given trade-off parameter, training is quite inefficient (although [2] describes an improved training algorithm for a more constrained model), and it has quite serious local minima problems (methods like LLE [4] or Isomap [3] have to be used for initialization). In this paper we describe a novel, automatic way to align the hidden representations used by each component of a mixture of dimensionality reducers into a single global representation of the data throughout space. Given an already trained mixture, the alignment is achieved by applying an eigensolver to a matrix constructed from the internal representations of the mixture components. Our method is efficient, simple to implement, and has no local optima in its optimization nor any learning rates or annealing schedules. 2 The Locally Linear Coordination Algorithm H 9¥ EI¡ CD66B9 ©9B 766 % G F 5 #

4 0.99035597 97 nips-2002-Global Versus Local Methods in Nonlinear Dimensionality Reduction

Author: Vin D. Silva, Joshua B. Tenenbaum

Abstract: Recently proposed algorithms for nonlinear dimensionality reduction fall broadly into two categories which have different advantages and disadvantages: global (Isomap [1]), and local (Locally Linear Embedding [2], Laplacian Eigenmaps [3]). We present two variants of Isomap which combine the advantages of the global approach with what have previously been exclusive advantages of local methods: computational sparsity and the ability to invert conformal maps.

5 0.9627921 67 nips-2002-Discriminative Binaural Sound Localization

Author: Ehud Ben-reuven, Yoram Singer

Abstract: Time difference of arrival (TDOA) is commonly used to estimate the azimuth of a source in a microphone array. The most common methods to estimate TDOA are based on finding extrema in generalized crosscorrelation waveforms. In this paper we apply microphone array techniques to a manikin head. By considering the entire cross-correlation waveform we achieve azimuth prediction accuracy that exceeds extrema locating methods. We do so by quantizing the azimuthal angle and treating the prediction problem as a multiclass categorization task. We demonstrate the merits of our approach by evaluating the various approaches on Sony’s AIBO robot.

6 0.96239483 104 nips-2002-How the Poverty of the Stimulus Solves the Poverty of the Stimulus

7 0.91428465 156 nips-2002-On the Complexity of Learning the Kernel Matrix

8 0.88157749 113 nips-2002-Information Diffusion Kernels

9 0.87235171 34 nips-2002-Artefactual Structure from Least-Squares Multidimensional Scaling

10 0.86975837 119 nips-2002-Kernel Dependency Estimation

11 0.85915357 190 nips-2002-Stochastic Neighbor Embedding

12 0.85333389 106 nips-2002-Hyperkernels

13 0.8499006 60 nips-2002-Convergence Properties of Some Spike-Triggered Analysis Techniques

14 0.8481524 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

15 0.84149653 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers

16 0.83890384 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

17 0.83856046 120 nips-2002-Kernel Design Using Boosting

18 0.83820027 80 nips-2002-Exact MAP Estimates by (Hyper)tree Agreement

19 0.82139957 63 nips-2002-Critical Lines in Symmetry of Mixture Models and its Application to Component Splitting

20 0.81948483 49 nips-2002-Charting a Manifold