nips nips2003 nips2003-1 knowledge-graph by maker-knowledge-mining

1 nips-2003-1-norm Support Vector Machines

Source: pdf

Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie

Abstract: The standard 2-norm SVM is known for its good performance in twoclass classi£cation. In this paper, we consider the 1-norm SVM. We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. We also propose an ef£cient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. [sent-5, score-0.22]

2 We also propose an ef£cient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM. [sent-6, score-0.861]

3 (xn , yn ), where the input xi ∈ Rp , and the output yi ∈ {1, −1} is binary. [sent-10, score-0.224]

4 To handle this problem, we consider the 1-norm support vector machine (SVM):    n min β0 ,β s. [sent-12, score-0.097]

5 1 − yi β0 + i=1 β q βj hj (xi ) j=1 1 = |β1 | + · · · + |βq | ≤ s, (1) + (2) where D = {h1 (x), . [sent-14, score-0.425]

6 hq (x)} is a dictionary of basis functions, and s is a tuning parameˆ ˆ ter. [sent-17, score-0.281]

7 The solution is denoted as β0 (s) and β(s); the £tted model is q ˆ βj hj (x). [sent-18, score-0.35]

8 We argue in this paper that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. [sent-21, score-0.22]

9 ˆ To get a good £tted model f (x) that performs well on future data, we also need to select an appropriate tuning parameter s. [sent-22, score-0.231]

10 Under some mild ˆ assumptions, we show that the computational cost to compute the whole solution path β(s) 2 is O(nq min(n, q) ) in the worst case and O(nq) in the best case. [sent-25, score-0.583]

11 8 Before delving into the technical details, we illustrate the concept of piece-wise linearity ˆ of the solution path β(s) with a simple example. [sent-31, score-0.425]

12 The £rst class has two standard normal independent inputs x 1 , x2 . [sent-33, score-0.134]

13 The dictionary of basis functions is D = { 2x1 , 2x2 , 2x1 x2 , x2 , x2 }. [sent-36, score-0.155]

14 The solution 1 2 ˆ path β(s) as a function of s is shown in Figure 1. [sent-37, score-0.356]

15 Hence the right derivative of β(s) with respect to s is piece-wise constant (in Rq ). [sent-39, score-0.154]

16 5 ˆ Figure 1: The solution path β(s) as a function of s. [sent-45, score-0.356]

17 In section 3, we ˆ describe the algorithm that computes the whole solution path β(s). [sent-47, score-0.522]

18 2 Regularized support vector machines The standard 2-norm SVM is equivalent to £t a model that    n min β0 ,βj i=1 1 − yi β0 + q j=1 βj hj (xi ) + λ β 2 2, (4) + where λ is a tuning parameter. [sent-49, score-0.648]

19 In practice, people usually choose hj (x)’s to be the basis functions of a reproducing kernel Hilbert space. [sent-50, score-0.386]

20 In this paper, however, we will concentrate on the basis representation (3) rather than a kernel representation. [sent-54, score-0.104]

21 Notice that (4) has the form loss + penalty, and λ is the tuning parameter that controls the tradeoff between loss and penalty. [sent-55, score-0.366]

22 The loss (1 − yf )+ is called the hinge loss, and the penalty is called the ridge penalty. [sent-56, score-0.569]

23 The ridge ˆ penalty shrinks the £tted coef£cients β towards zero. [sent-58, score-0.453]

24 It is well known that this shrinkage ˆ has the effect of controlling the variances of β, hence possibly improves the £tted model’s prediction accuracy, especially when there are many highly correlated features [6]. [sent-59, score-0.142]

25 So from a statistical function estimation point of view, the ridge penalty could possibly explain the success of the SVM ([6] and [12]). [sent-60, score-0.411]

26 On the other hand, computational learning theory has associated the good performance of the SVM to its margin maximizing property [11], a property of the hinge loss. [sent-61, score-0.122]

27 In this paper, we replace the ridge penalty in (4) with the L1 -norm of β, i. [sent-63, score-0.411]

28 the lasso penalty [10], and consider the 1-norm SVM problem:    n min β0 ,β i=1 1 − yi β0 + q j=1 βj hj (xi ) + λ β 1, (5) + which is an equivalent Lagrange version of the optimization problem (1)-(2). [sent-65, score-0.954]

29 The lasso penalty was £rst proposed in [10] for regression problems, where the response y is continuous rather than categorical. [sent-66, score-0.482]

30 Similar to the ridge penalty, the lasso penalty also ˆ shrinks the £tted coef£cients β’s towards zero, hence (5) also bene£ts from the reduction in £tted coef£cients’ variances. [sent-68, score-0.768]

31 Another property of the lasso penalty is that because of the L1 nature of the penalty, making λ suf£ciently large, or equivalently s suf£ciently small, ˆ will cause some of the coef£cients βj ’s to be exactly zero. [sent-69, score-0.511]

32 Thus the lasso penalty does a kind of continuous feature selection, while this is not the case for the ridge penalty. [sent-71, score-0.689]

33 It is interesting to note that the ridge penalty corresponds to a Gaussian prior for the βj ’s, while the lasso penalty corresponds to a double-exponential prior. [sent-73, score-0.893]

34 This re¤ects the greater tendency of the lasso to produce some large £tted coef£cients and leave others at 0, especially in high dimensional problems. [sent-75, score-0.291]

35 only a small number of true coef£cients β j ’s are nonzero, the lasso penalty works better than the ridge penalty; while in the non-sparse scenario, e. [sent-83, score-0.662]

36 the true coef£cients β j ’s have a Gaussian distribution, neither the lasso penalty nor the ridge penalty will £t the coef£cients well, since there is too little data from which to estimate these non-zero coef£cients. [sent-85, score-0.893]

37 Based on these observations, [3] further propose the bet on sparsity principle for highdimensional problems, which encourages using lasso penalty. [sent-87, score-0.251]

38 To solve the 1-norm SVM for a £xed value of s, we can transform (1)-(2) into a linear programming ˆ problem and use standard software packages; but to get a good £tted model f (x) that performs well on future data, we need to select an appropriate value for the tuning paramter s. [sent-89, score-0.231]

39 In this section, we propose an ef£cient algorithm that computes the whole solution path ˆ β(s), hence facilitates adaptive selection of s. [sent-90, score-0.735]

40 This implies that the derivative of β(s) with respect to s is piece-wise constant, because when the Karush-Kuhn-Tucker conditions do ˆ not change, the derivative of β(s) will not change either. [sent-93, score-0.308]

41 solution path β(s) ˆ Thus to compute the whole solution path β(s), all we need to do is to £nd the joints, i. [sent-96, score-0.841]

42 the asterisk points in Figure 1, on this piece-wise linear path, then use straight lines to ˆ ˆ interpolate them, or equivalently, to start at β(0) = 0, £nd the right derivative of β(s), let ˆ gets to a joint. [sent-98, score-0.206]

43 s increase and only change the derivative when β(s) 3. [sent-99, score-0.154]

44 Let V = {j : βj (s) = 0}, E = {i : 1 − yi fi = 0}, ˆ ˆ ˆ > 0} and u for the right derivative of βV (s): u 1 = 1 and βV (s) L = {i : 1 − yi fi ˆ denotes the components of β(s) with indices in V. [sent-103, score-0.836]

45 Without loss of generality, we assume ˆ ˆ ˆ #{yi = 1} ≥ #{yi = −1}; then β0 (0) = 1, βj (0) = 0. [sent-104, score-0.12]

46 We consider a modi£ed problem: follows, we need to compute the derivative of β(s) (1 − yi fi )+ + min β0 ,βj yi =1 (1 − yi fi ) (6) yi =−1 q β s. [sent-106, score-1.271]

47 (7) j=1 Notice that if yi = 1, the loss is still (1 − yi fi )+ ; but if yi = −1, the loss becomes ˆ (1 − yi fi ). [sent-109, score-1.278]

48 In this setup, the derivative of β(∆s) with respect to ∆s is the same no matter ˆ what value ∆s is, and one can show that it coincides with the right derivative of β(s) ˆ when s is suf£ciently small. [sent-110, score-0.308]

49 Hence this setup helps us £nd the initial derivative u of β(s). [sent-111, score-0.154]

50 3 Main algorithm ˆ The main algorithm that computes the whole solution path β(s) proceeds as following: 1. [sent-117, score-0.522]

51 ˆold ˆ ˆ ˆ Let the current β0 , β and s be denoted by β0 , β old and sold . [sent-124, score-0.159]

52 For each j ∗ ∈ V, we solve: / u0 + V uj hj (xi ) + uj ∗ hj ∗ (xi ) ˆ sign(β old )uj + |uj ∗ | = 0 for i ∈ E = 1 j V where u0 , uj and uj ∗ are the unknowns. [sent-126, score-1.566]

53 We then compute: ∆lossj ∗ = ∆s yi u0 + L uj hj (xi ) + uj ∗ hj ∗ (xi ) . [sent-127, score-1.166]

54 For each i ∈ E, we solve: u0 + V uj hj (xi ) = 0 for i ∈ E\{i } ˆold = 1 V sign(βj )uj where u0 and uj are the unknowns. [sent-129, score-0.741]

55 We then compute: ∆lossi = ∆s (9) yi u0 + L (11) uj hj (xi ) . [sent-130, score-0.672]

56 Hence, ∆s • If the smallest ∆loss is non-negative, the algorithm terminates; else ∆s • If the smallest negative ∆loss corresponds to a j ∗ in step 2, we update ∆s V ← V ∪ {j ∗ }, u ← u uj ∗ . [sent-135, score-0.396]

57 = ˆold β0 ˆold βV + ∆s · u0 u , (15) ˆ In the end, we get a path β(s), which is piece-wise linear. [sent-138, score-0.28]

58 4 Remarks Due to the page limit, we omit the proof that this algorithm does indeed give the exact ˆ whole solution path β(s) of (1)-(2) (see [13] for detailed proof). [sent-140, score-0.453]

59 ˆ Step 2 computes the possible right derivative of β(s) if adding each basis function hj ∗ (x) ˆ into V. [sent-144, score-0.545]

60 Step 3 computes the possible right derivative of β(s) if removing each point i ˆ (determined by either (9) or (11)) is such that from E. [sent-145, score-0.223]

61 The possible right derivative of β(s) the training points in E are kept in E when s increases, until the next joint (step 1) occurs. [sent-146, score-0.191]

62 ˆ ∆loss/∆s indicates how fast the loss will decrease if β(s) changes according to u. [sent-147, score-0.165]

63 When the loss can not be decreased, the algorithm terminates. [sent-149, score-0.12]

64 06) 65 # Joints 94 (13) 149 (20) 225 (30) 374 (52) 499 (67) Computational cost ˆ We have proposed an algorithm that computes the whole solution path β(s). [sent-181, score-0.553]

65 Suppose |E| = m at a joint on the piece-wise linear solution path, then it takes O(qm2 ) to compute step 2 and step 3 of the algorithm through Sherman-Morrison updating formula. [sent-183, score-0.213]

66 If we assume the training data are separable by the dictionary D, then all the training data are eventually ˆ going to have loss (1 − yi fi )+ equal to zero. [sent-184, score-0.615]

67 Hence it is reasonable to assume the number of joints on the piece-wise linear solution path is O(n). [sent-185, score-0.547]

68 Since the maximum value of m is min(n, q) and the minimum value of m is 1, we get the worst computational cost is O(nq min(n, q)2 ) and the best computational cost is O(nq). [sent-186, score-0.116]

69 Simulation results (section 4) actually indicate that the number of joints tends to be O(min(n, q)). [sent-188, score-0.191]

70 1 Simulation results The data generation mechanism is the same as the one described in section 1, except that we generate 50 training data in each of two classes, and to make harder problems, we sequentially augment the inputs with additional two, four, six and eight standard normal noise inputs. [sent-191, score-0.21]

71 In the original input space, a hyperplane cannot separate the classes; we use an enlarged feature space corresponding to √ 2nd degree polythe √ nomial kernel, hence the dictionary of basis functions is D = { 2xj , 2xj xj , x2 , j, j = j 1, . [sent-195, score-0.271]

72 The average test errors over 50 simulations, with different numbers of noise inputs, are shown in Table 1. [sent-200, score-0.105]

73 For both the 1-norm SVM and the 2-norm SVM, we choose the tuning parameters to minimize the test error, to be as fair as possible to each method. [sent-201, score-0.156]

74 From Table 1 we can see that the non-penalized SVM performs signi£cantly worse than the penalized ones; the 1-norm SVM and the 2-norm SVM perform similarly when there is no noise input (line 1), but the 2-norm SVM is adversely affected by noise inputs (line 2 - line 5). [sent-203, score-0.288]

75 Since the 1-norm SVM has the ability to select relevant features and ignore redundant features, it does not suffer from the noise inputs as much as the 2-norm SVM. [sent-204, score-0.296]

76 Table 1 also shows the number of basis functions q and the number of joints on the piece-wise linear solution path. [sent-205, score-0.369]

77 The left panel is the ˆ piece-wise linear solution path β(s). [sent-217, score-0.393]

78 The middle panel is the test error along the solution path. [sent-219, score-0.17]

79 The right panel illustrates the linear relationship between the number of basis functions and the number of joints on the solution path when q < n. [sent-221, score-0.659]

80 2 Real data results In this section, we apply the 1-norm SVM to classi£cation of gene microarrays. [sent-223, score-0.13]

81 Classi£cation of patient samples is an important aspect of cancer diagnosis and treatment. [sent-224, score-0.182]

82 The 2-norm SVM has been successfully applied to microarray cancer diagnosis problems ([5] and [7]). [sent-225, score-0.282]

83 However, one weakness of the 2-norm SVM is that it only predicts a cancer class label but does not automatically select relevant genes for the classi£cation. [sent-226, score-0.327]

84 Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classi£cation, rather than class prediction. [sent-227, score-0.409]

85 [4] and [5] have proposed gene selection methods, which we call univariate ranking (UR) and recursive feature elimination (RFE) (see [14]), that can be combined with the 2-norm SVM. [sent-228, score-0.237]

86 However, these procedures are two-step procedures that depend on external gene selection methods. [sent-229, score-0.21]

87 On the other hand, the 1-norm SVM has an inherent gene (feature) selection property due to the lasso penalty. [sent-230, score-0.49]

88 Hence the 1-norm SVM achieves the goals of classi£cation of patients and selection of genes simultaneously. [sent-231, score-0.171]

89 This data set consists of 38 training data and 34 test data of two types of acute leukemia, acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). [sent-233, score-0.462]

90 The tuning parameter is chosen according to 10-fold cross-validation, then the £nal model is £tted on all the training data and evaluated on the test data. [sent-240, score-0.193]

91 The number of joints on the solution path is 104, which appears to be O(n) O(q). [sent-241, score-0.547]

92 We should notice that the maximum number of genes that the 1-norm SVM can select is upper bounded by n, which is usually much less than q in microarray problems. [sent-244, score-0.294]

93 We illustrate that the 1-norm SVM may have some advantage over the 2-norm SVM, especially when there are redundant features. [sent-246, score-0.14]

94 ˆ The solution path β(s) of the 1-norm SVM is a piece-wise linear function in the tuning Table 2: Results on Microarray Classi£cation Method 2-norm SVM UR 2-norm SVM RFE 1-norm SVM CV Error 2/38 2/38 2/38 Test Error 3/34 1/34 2/34 # of Genes 22 31 17 parameter s. [sent-247, score-0.482]

95 We have proposed an ef£cient algorithm to compute the whole solution path ˆ β(s) of the 1-norm SVM, and facilitate adaptive selection of the tuning parameter s. [sent-248, score-0.753]

96 (1998) Feature selection via concave minimization and support vector machines. [sent-253, score-0.13]

97 (1999) Molecular classi£cation of cancer: class discovery and class prediction by gene expression monitoring. [sent-284, score-0.202]

98 (2002) Gene selection for cancer classi£cation using support vector machines. [sent-290, score-0.244]

99 (2003) Boosting as a regularized path to a maximum margin classi£er. [sent-309, score-0.304]

100 (2003) Classi£cation of gene microarrays by penalized logistic regression. [sent-342, score-0.168]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('svm', 0.363), ('path', 0.253), ('lasso', 0.251), ('hj', 0.247), ('uj', 0.247), ('penalty', 0.231), ('joints', 0.191), ('ridge', 0.18), ('yi', 0.178), ('fi', 0.163), ('tted', 0.163), ('derivative', 0.154), ('gene', 0.13), ('tuning', 0.126), ('loss', 0.12), ('nq', 0.115), ('cancer', 0.114), ('coef', 0.104), ('solution', 0.103), ('microarray', 0.1), ('whole', 0.097), ('tibshirani', 0.095), ('hastie', 0.094), ('genes', 0.091), ('leukemia', 0.085), ('old', 0.084), ('zhu', 0.084), ('cients', 0.082), ('dictionary', 0.08), ('selection', 0.08), ('basis', 0.075), ('noise', 0.075), ('acute', 0.075), ('sold', 0.075), ('inputs', 0.072), ('classi', 0.072), ('computes', 0.069), ('diagnosis', 0.068), ('rosset', 0.068), ('hence', 0.064), ('redundant', 0.063), ('smallest', 0.055), ('simulation', 0.054), ('notice', 0.053), ('stanford', 0.052), ('rfe', 0.05), ('rq', 0.05), ('select', 0.05), ('support', 0.05), ('min', 0.047), ('xi', 0.046), ('changes', 0.045), ('shrinks', 0.042), ('ur', 0.042), ('argue', 0.042), ('cation', 0.041), ('especially', 0.04), ('mild', 0.04), ('step', 0.039), ('hinge', 0.038), ('penalized', 0.038), ('shrinkage', 0.038), ('training', 0.037), ('panel', 0.037), ('illustrate', 0.037), ('grant', 0.036), ('class', 0.036), ('relevant', 0.036), ('reproducing', 0.035), ('adaptive', 0.035), ('facilitates', 0.034), ('nih', 0.034), ('linearity', 0.032), ('residual', 0.032), ('compute', 0.032), ('cost', 0.031), ('sign', 0.03), ('hilbert', 0.03), ('scenario', 0.03), ('test', 0.03), ('kernel', 0.029), ('table', 0.029), ('property', 0.029), ('performs', 0.028), ('boosting', 0.027), ('feature', 0.027), ('facilitate', 0.027), ('friedman', 0.027), ('gets', 0.027), ('paths', 0.027), ('worst', 0.027), ('get', 0.027), ('margin', 0.026), ('normal', 0.026), ('regularized', 0.025), ('lines', 0.025), ('enlarged', 0.025), ('packages', 0.025), ('retention', 0.025), ('rob', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 1 nips-2003-1-norm Support Vector Machines

Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie

2 0.2729837 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

3 0.13452928 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

Author: Pedro J. Moreno, Purdy P. Ho, Nuno Vasconcelos

Abstract: Over the last years signiﬁcant efforts have been made to develop kernels that can be applied to sequence data such as DNA, text, speech, video and images. The Fisher Kernel and similar variants have been suggested as good ways to combine an underlying generative model in the feature space and discriminant classiﬁers such as SVM’s. In this paper we suggest an alternative procedure to the Fisher kernel for systematically ﬁnding kernel functions that naturally handle variable length sequence data in multimedia domains. In particular for domains such as speech and images we explore the use of kernel functions that take full advantage of well known probabilistic models such as Gaussian Mixtures and single full covariance Gaussian models. We derive a kernel distance based on the Kullback-Leibler (KL) divergence between generative models. In effect our approach combines the best of both generative and discriminative methods and replaces the standard SVM kernels. We perform experiments on speaker identiﬁcation/veriﬁcation and image classiﬁcation tasks and show that these new kernels have the best performance in speaker veriﬁcation and mostly outperform the Fisher kernel based SVM’s and the generative classiﬁers in speaker identiﬁcation and image classiﬁcation. 1

4 0.1280264 113 nips-2003-Learning with Local and Global Consistency

Author: Dengyong Zhou, Olivier Bousquet, Thomas N. Lal, Jason Weston, Bernhard Schölkopf

Abstract: We consider the general problem of learning from labeled and unlabeled data, which is often called semi-supervised learning or transductive inference. A principled approach to semi-supervised learning is to design a classifying function which is sufﬁciently smooth with respect to the intrinsic structure collectively revealed by known labeled and unlabeled points. We present a simple algorithm to obtain such a smooth solution. Our method yields encouraging experimental results on a number of classiﬁcation problems and demonstrates effective use of unlabeled data. 1

5 0.12614666 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

Author: Claudio Gentile

Abstract: New feature selection algorithms for linear threshold functions are described which combine backward elimination with an adaptive regularization method. This makes them particularly suitable to the classiﬁcation of microarray expression data, where the goal is to obtain accurate rules depending on few genes only. Our algorithms are fast and easy to implement, since they center on an incremental (large margin) algorithm which allows us to avoid linear, quadratic or higher-order programming methods. We report on preliminary experiments with ﬁve known DNA microarray datasets. These experiments suggest that multiplicative large margin algorithms tend to outperform additive algorithms (such as SVM) on feature selection tasks. 1

6 0.11792072 73 nips-2003-Feature Selection in Clustering Problems

7 0.11757049 124 nips-2003-Max-Margin Markov Networks

8 0.11720975 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

9 0.11123611 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

10 0.10928929 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data

11 0.10549048 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

12 0.10127951 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

13 0.098366171 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

14 0.097732931 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

15 0.088522792 112 nips-2003-Learning to Find Pre-Images

16 0.085511275 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

17 0.079384036 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

18 0.077860087 126 nips-2003-Measure Based Regularization

19 0.07774093 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification

20 0.076490305 181 nips-2003-Statistical Debugging of Sampled Programs

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.249), (1, -0.139), (2, -0.044), (3, -0.203), (4, 0.088), (5, -0.003), (6, -0.122), (7, -0.095), (8, 0.009), (9, 0.117), (10, 0.091), (11, 0.086), (12, -0.119), (13, 0.033), (14, -0.065), (15, 0.23), (16, 0.018), (17, -0.013), (18, 0.064), (19, 0.001), (20, -0.048), (21, -0.01), (22, 0.069), (23, 0.174), (24, 0.085), (25, -0.01), (26, -0.057), (27, 0.008), (28, -0.087), (29, 0.095), (30, -0.094), (31, 0.073), (32, -0.052), (33, -0.047), (34, 0.09), (35, 0.021), (36, -0.084), (37, 0.037), (38, -0.055), (39, -0.031), (40, 0.002), (41, 0.052), (42, 0.137), (43, -0.064), (44, -0.024), (45, 0.011), (46, 0.028), (47, -0.038), (48, -0.089), (49, 0.074)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96517634 1 nips-2003-1-norm Support Vector Machines

Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie

2 0.72138339 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

3 0.67002702 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

Author: Claudio Gentile

4 0.61628842 124 nips-2003-Max-Margin Markov Networks

Author: Ben Taskar, Carlos Guestrin, Daphne Koller

Abstract: In typical classiﬁcation tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of conﬁdence of the classiﬁer, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efﬁciently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efﬁcient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classiﬁcation demonstrate very signiﬁcant gains over previous approaches. 1

5 0.57893002 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification

Author: Thomas R. Strohmann, Andrei Belitski, Gregory Z. Grudic, Dennis DeCoste

Abstract: The Minimax Probability Machine Classiﬁcation (MPMC) framework [Lanckriet et al., 2002] builds classiﬁers by minimizing the maximum probability of misclassiﬁcation, and gives direct estimates of the probabilistic accuracy bound Ω. The only assumptions that MPMC makes is that good estimates of means and covariance matrices of the classes exist. However, as with Support Vector Machines, MPMC is computationally expensive and requires extensive cross validation experiments to choose kernels and kernel parameters that give good performance. In this paper we address the computational cost of MPMC by proposing an algorithm that constructs nonlinear sparse MPMC (SMPMC) models by incrementally adding basis functions (i.e. kernels) one at a time – greedily selecting the next one that maximizes the accuracy bound Ω. SMPMC automatically chooses both kernel parameters and feature weights without using computationally expensive cross validation. Therefore the SMPMC algorithm simultaneously addresses the problem of kernel selection and feature selection (i.e. feature weighting), based solely on maximizing the accuracy bound Ω. Experimental results indicate that we can obtain reliable bounds Ω, as well as test set accuracies that are comparable to state of the art classiﬁcation algorithms.

6 0.5260734 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots

7 0.49558944 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

8 0.49313781 86 nips-2003-ICA-based Clustering of Genes from Microarray Expression Data

9 0.49145916 181 nips-2003-Statistical Debugging of Sampled Programs

10 0.48758221 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

11 0.48361549 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

12 0.47544762 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

13 0.47407293 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

14 0.46526116 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

15 0.43418947 79 nips-2003-Gene Expression Clustering with Functional Mixture Models

16 0.41966867 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning

17 0.4160358 113 nips-2003-Learning with Local and Global Consistency

18 0.38276774 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

19 0.37744436 126 nips-2003-Measure Based Regularization

20 0.3748154 186 nips-2003-Towards Social Robots: Automatic Evaluation of Human-Robot Interaction by Facial Expression Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (11, 0.012), (35, 0.583), (53, 0.066), (71, 0.041), (76, 0.042), (85, 0.063), (91, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.97572654 32 nips-2003-Approximate Expectation Maximization

Author: Tom Heskes, Onno Zoeter, Wim Wiegerinck

Abstract: We discuss the integration of the expectation-maximization (EM) algorithm for maximum likelihood learning of Bayesian networks with belief propagation algorithms for approximate inference. Specifically we propose to combine the outer-loop step of convergent belief propagation algorithms with the M-step of the EM algorithm. This then yields an approximate EM algorithm that is essentially still double loop, with the important advantage of an inner loop that is guaranteed to converge. Simulations illustrate the merits of such an approach. 1

2 0.95796746 70 nips-2003-Fast Algorithms for Large-State-Space HMMs with Applications to Web Usage Analysis

Author: Pedro F. Felzenszwalb, Daniel P. Huttenlocher, Jon M. Kleinberg

Abstract: In applying Hidden Markov Models to the analysis of massive data streams, it is often necessary to use an artiﬁcially reduced set of states; this is due in large part to the fact that the basic HMM estimation algorithms have a quadratic dependence on the size of the state set. We present algorithms that reduce this computational bottleneck to linear or near-linear time, when the states can be embedded in an underlying grid of parameters. This type of state representation arises in many domains; in particular, we show an application to traﬃc analysis at a high-volume Web site. 1

same-paper 3 0.95200145 1 nips-2003-1-norm Support Vector Machines

Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie

4 0.68668115 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models

Author: Radford M. Neal, Matthew J. Beal, Sam T. Roweis

Abstract: We describe a Markov chain method for sampling from the distribution of the hidden state sequence in a non-linear dynamical system, given a sequence of observations. This method updates all states in the sequence simultaneously using an embedded Hidden Markov Model (HMM). An update begins with the creation of “pools” of candidate states at each time. We then deﬁne an embedded HMM whose states are indexes within these pools. Using a forward-backward dynamic programming algorithm, we can efﬁciently choose a state sequence with the appropriate probabilities from the exponentially large number of state sequences that pass through states in these pools. We illustrate the method in a simple one-dimensional example, and in an example showing how an embedded HMM can be used to in effect discretize the state space without any discretization error. We also compare the embedded HMM to a particle smoother on a more substantial problem of inferring human motion from 2D traces of markers. 1

5 0.61780393 146 nips-2003-Online Learning of Non-stationary Sequences

Author: Claire Monteleoni, Tommi S. Jaakkola

Abstract: We consider an online learning scenario in which the learner can make predictions on the basis of a ﬁxed set of experts. We derive upper and lower relative loss bounds for a class of universal learning algorithms involving a switching dynamics over the choice of the experts. On the basis of the performance bounds we provide the optimal a priori discretization for learning the parameter that governs the switching dynamics. We demonstrate the new algorithm in the context of wireless networks.

6 0.61357582 122 nips-2003-Margin Maximizing Loss Functions

7 0.58787304 123 nips-2003-Markov Models for Automated ECG Interval Analysis

8 0.57471097 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian

9 0.55283821 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms

10 0.55273139 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

11 0.53936923 100 nips-2003-Laplace Propagation

12 0.53384215 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors

13 0.53019291 169 nips-2003-Sample Propagation

14 0.52762401 158 nips-2003-Policy Search by Dynamic Programming

15 0.52661264 117 nips-2003-Linear Response for Approximate Inference

16 0.52391499 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence

17 0.51516366 189 nips-2003-Tree-structured Approximations by Expectation Propagation

18 0.51253444 131 nips-2003-Modeling User Rating Profiles For Collaborative Filtering

19 0.50481629 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes

20 0.50294924 75 nips-2003-From Algorithmic to Subjective Randomness