nips nips2008 nips2008-196 knowledge-graph by maker-knowledge-mining

196 nips-2008-Relative Margin Machines

Source: pdf

Author: Tony Jebara, Pannagadatta K. Shivaswamy

Abstract: In classiﬁcation problems, Support Vector Machines maximize the margin of separation between two classes. While the paradigm has been successful, the solution obtained by SVMs is dominated by the directions with large data spread and biased to separate the classes by cutting along large spread directions. This article proposes a novel formulation to overcome such sensitivity and maximizes the margin relative to the spread of the data. The proposed formulation can be eﬃciently solved and experiments on digit datasets show drastic performance improvements over SVMs. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract In classiﬁcation problems, Support Vector Machines maximize the margin of separation between two classes. [sent-3, score-0.254]

2 While the paradigm has been successful, the solution obtained by SVMs is dominated by the directions with large data spread and biased to separate the classes by cutting along large spread directions. [sent-4, score-0.499]

3 This article proposes a novel formulation to overcome such sensitivity and maximizes the margin relative to the spread of the data. [sent-5, score-0.686]

4 The proposed formulation can be eﬃciently solved and experiments on digit datasets show drastic performance improvements over SVMs. [sent-6, score-0.185]

5 For example, in support vector machines [10] (SVMs) a hyperplane 1 of the form w⊤ x + b = 0, w ∈ Rm , x ∈ Rm , b ∈ R is recovered as a decision boundary after observing a limited number of training examples. [sent-8, score-0.209]

6 The parameters of the hyperplane (w, b) are estimated by maximizing the margin (the distance between w⊤ x + b = 1 and w⊤ x + b = −1) while minimizing a weighted upper bound on the misclassiﬁcation rate on the training data (the so called slack variables). [sent-9, score-0.452]

7 In practice, the margin is maximized by 1 minimizing 2 w⊤ w. [sent-10, score-0.254]

8 On one hand, an adversary can exploit this shortcoming to transform the data so as to give bad performance. [sent-12, score-0.101]

9 More distressingly, this shortcoming can naturally lead to a bad performance especially in high dimensional settings. [sent-13, score-0.096]

10 The key problem is that SVMs simply ﬁnd a large margin solution giving no attention to the spread of the data. [sent-14, score-0.537]

11 An excellent discriminator lying in a dimension with relatively small data spread may be easily overlooked by the SVM solution. [sent-15, score-0.216]

12 The crux here is to ﬁnd the maximum margin solution with respect to the spread of the data in a relative sense rather than ﬁnding the absolute large margin solution. [sent-17, score-0.869]

13 Ellipsoidal kernel machines [9] normalize data in feature space by estimating bounding ellipsoids. [sent-21, score-0.111]

14 While these previous methods showed performance improvements, both relied on multiple-step locally optimal algorithms for interleaving spread information with margin estimation. [sent-22, score-0.47]

15 In addition, the formulations derived in this paper are convex, can be eﬃciently solved and admit some useful generalization bounds. [sent-26, score-0.088]

16 Bottom: The projections of the examples in the top row on the real line for the SVM solution (red or dark shade) and the proposed classiﬁer (green or light shade) in each case. [sent-32, score-0.314]

17 Consider the binary classiﬁcation example shown in the top row of Figure 1 where squares denote examples from one class and triangles denote examples from the other class. [sent-34, score-0.138]

18 One possible decision boundary separating the two classes is shown in green (or light shade). [sent-36, score-0.233]

19 The solution shown in red (or dark shade) is the SVM estimate; it achieves the largest margin possible while still separating both the classes. [sent-37, score-0.43]

20 With progressive scaling, the SVM increasingly deviates from the green solution, clearly indicating that the SVM decision boundary is sensitive to aﬃne transformations of the data and produces a family of diﬀerent solutions as a result. [sent-40, score-0.278]

21 This sensitivity to scaling and aﬃne transformations is worrisome. [sent-41, score-0.121]

22 If there is a best and a worst solution in the family of SVM estimates, there is always the possibility that an adversary exploits this scaling such that the SVM solution we recover is poor. [sent-42, score-0.22]

23 Meanwhile, an algorithm producing the green decision boundary remains resilient to such adversarial scalings. [sent-43, score-0.199]

24 In the previous example, a direction with a small spread in the data produced a good discriminator. [sent-44, score-0.216]

25 Merely ﬁnding a large margin solution, on the other hand, does not recover the best possible discriminator. [sent-45, score-0.254]

26 This particular weakness in large margin estimation has only received limited attention in previous work. [sent-46, score-0.254]

27 In this case, the green decision boundary should obtain zero test error even if it is estimated from a ﬁnite number of samples. [sent-48, score-0.168]

28 However, for ﬁnite training data, the SVM solution will make errors and will do so increasingly as the data is scaled along the x-axis. [sent-49, score-0.117]

29 Similarly, simple prepossessing of the data (aﬃne “whitening” to make the 2 dataset zero mean and unit covariance or scaling to place the data into a zero-one box) may fail to resolve such problems. [sent-51, score-0.1]

30 For more insight, consider the uni-dimensional projections of the data given by the green and red solutions in the bottom row of Figure 1. [sent-52, score-0.251]

31 Meanwhile, the red solution produces more dispersed projections of the two classes. [sent-54, score-0.209]

32 As the adversarial scaling is increased, the spread of the projection in the SVM solution increases correspondingly. [sent-55, score-0.414]

33 Large margins are not suﬃcient on their own and what is needed is a way to also control the spread of the data after projection. [sent-56, score-0.216]

34 Therefore, rather than just maximizing the margin, a trade-oﬀ regularizer should also be used to minimize the spread of the projected data. [sent-57, score-0.245]

35 In other words, we will couple large margin estimation with regularization which seeks to bound the spread |w⊤ x + b| of the data. [sent-58, score-0.513]

36 This will allow the linear classiﬁer to recover large margin solutions not in the absolute sense but rather relative to the spread of the data in that projection direction. [sent-59, score-0.626]

37 3 Formulations Given (xi , yi )n where xi ∈ Rm and yi ∈ {±1} drawn independent and identically disi=1 tributed from a distribution Pr(x, y), the Support Vector Machine primal formulation 2 is as follows: 1 min w 2 + Cξ ⊤ 1 s. [sent-60, score-0.4]

38 (1) w,b,ξ≥0 2 The above formulation minimizes an upper bound on the misclassiﬁcation while maximizing the margin (the two quantities are traded oﬀ by C). [sent-63, score-0.427]

39 In practice, the following dual of the formulation (1) is solved: max − 0≤α≤C1 1 2 n n n αi αj yi yj x⊤ xj + i i=1 j=1 αi s. [sent-64, score-0.31]

40 (2) i=1 It is easy to see that the above formulation (2) is rotation invariant; if all the xi are replaced by Axi where A ∈ Rm×m , A⊤ A = I, then the solution remains the same. [sent-67, score-0.329]

41 Typically, the dot product between the examples is replaced by a kernel function k : Rm × Rm → R such that k(xi , xj ) = φ(xi )⊤ φ(xj ), where φ : Rm → H is a mapping to a Hilbert space to obtain non-linear decision boundaries in the input space. [sent-70, score-0.269]

42 Thus, in (2), x⊤ xj is i replaced by k(xi , xj ) to obtain non-linear solutions. [sent-71, score-0.147]

43 Next, we consider the formulation which corresponds to whitening the data with the covarin n n n 1 1 1 ance matrix. [sent-73, score-0.158]

44 Denote by Σ = n i=1 xi x⊤ − n2 i=1 xi j=1 x⊤ , and µ = n i=1 xi , the i j sample covariance and mean respectively. [sent-74, score-0.39]

45 Consider the following formulation which we call Σ-SVM: min w,b,ξ≥0 1−D w 2 2 + 1 D Σ2 w 2 2 + Cξ ⊤ 1 s. [sent-75, score-0.101]

46 yi (w⊤ (xi − µ) + b) ≥ 1 − ξi , (3) where 0 ≤ D ≤ 1 is an additional parameter that trades oﬀ between the two regularization terms. [sent-77, score-0.102]

47 The dual of (3) can be shown to be: n 1 αi − max 2 0≤α≤C1,y⊤ α=0 i=1 n n ⊤ i=1 −1 αi yi (xi − µ) ((1 − D)I + DΣ) j=1 αj yj (xj − µ). [sent-78, score-0.151]

48 3 It is easy to see that the above formulation (4) is translation invariant and tends to an aﬃne invariant solution when D tends to one. [sent-80, score-0.25]

49 1 RMM and its geometrical interpretation From Section 2, it is clear that large margin in the absolute sense might be deceptive and could merely be a by product of bad scaling of the data. [sent-85, score-0.445]

50 To overcome this limitation, as we pointed out earlier, we need to bound the projections of the training examples as well. [sent-86, score-0.315]

51 As in the two dimensional example, it is necessary to trade oﬀ between the margin and the spread of the data. [sent-87, score-0.529]

52 We propose a slightly modiﬁed formulation in the next section that can be solved eﬃciently. [sent-88, score-0.152]

53 In addition, writing the dual of the following formulation gives some geometric intuition. [sent-90, score-0.184]

54 Since we trade oﬀ between the projections and the margin, implicitly, we ﬁnd large relative margin. [sent-91, score-0.175]

55 Thus we call the following formulation the Relative Margin Machine (RMM): min w,b,ξ≥0 1 w 2 + Cξ ⊤ 1 s. [sent-92, score-0.101]

56 yi (w⊤ xi + b) ≥ 1 − ξi , 2 1 ⊤ B2 (w xi + b)2 ≤ . [sent-94, score-0.328]

57 This formulation has one extra parameter B in addition to the SVM parameter. [sent-96, score-0.101]

58 Note that B ≥ 1 since having a B less than one would mean none of the examples would satisfy yi (w⊤ xi + b) ≥ 1. [sent-97, score-0.267]

59 Let wC and bC be the solutions obtained by solving the SVM (1) for a particular value of C, ⊤ then B > maxi |wC xi + bC |, makes the constraint on the second line in the formulation (5) inactive for each i and the solution obtained is the same as the SVM estimate. [sent-98, score-0.326]

60 Speciﬁcally, with a smaller B, we still ﬁnd a large margin solution such that all the projections of the training examples are bounded by B. [sent-100, score-0.549]

61 Thus by trying out diﬀerent B values, we explore diﬀerent large margin solutions with respect to the projection and spread of the data. [sent-101, score-0.548]

62 The Lagrangian of (5) is given by: n 1 w 2 2 + Cξ ⊤ 1 − n i=1 αi yi (w⊤ xi + b) − 1 + ξi − β ⊤ ξ + 1 ⊤ 1 (w xi + b)2 − B 2 , 2 2 λi i=1 where α, β, λ ≥ 0 are the Lagrange multipliers corresponding to the constraints. [sent-103, score-0.328]

63 i=1 n j=1 λj xj 1 αj yj (xj − µλ ) − B 2 λ⊤ 1 2 j=1 (6) λj x⊤ , j and by µλ = 1 λ⊤ 1 n n i=1 n ( αi yi (xi − µλ )⊤ (I + Σλ )−1 4 Note that the above formulation is translation invariant since µλ is subtracted from each xi . [sent-105, score-0.426]

64 Σλ corresponds to a “shape matrix” (potentially low rank) determined by xi ’s that have 2 non-zero λi . [sent-106, score-0.13]

65 From the KKT conditions of (5), λi ( 1 (w⊤ xi + b)2 − B ) = 0. [sent-107, score-0.13]

66 Consequently 2 2 2 1 λi > 0 implies ( 2 (w⊤ xi + b)2 − B ) = 0. [sent-108, score-0.13]

67 2 Geometrically, in the above formulation (6), the data is whitened with the matrix (I + Σλ ) while solving SVM. [sent-109, score-0.101]

68 While this is similar to what is done by the Σ-SVM, the matrix (I+ Σλ ) is determined jointly considering both the margin of the data and the spread. [sent-110, score-0.254]

69 In contrast, in Σ-SVM, whitening is simply a prepossessing step which can be done independently of the 1 margin. [sent-111, score-0.107]

70 Note that the constraint 2 (w⊤ xi +b)2 ≤ 1 B 2 can be relaxed with slack variables at 2 the expense of one additional parameter however this will not be investigated in this paper. [sent-112, score-0.173]

71 The proposed formulation is of limited use unless it can be solved eﬃciently. [sent-113, score-0.152]

72 1 Note that the constraint 2 (w⊤ xi + b)2 ≤ 1 B 2 can be equivalently posed as two linear 2 ⊤ constraints : (w xi + b) ≤ B and −(w⊤ xi + b) ≤ B. [sent-116, score-0.433]

73 With these constraints replacing the quadratic constraint, we have a quadratic program to solve. [sent-117, score-0.139]

74 In the primal, we have 4n constraints (including ξ ≥ 0 ) instead of the 2n constraints in the SVM. [sent-118, score-0.086]

75 2 Fast algorithm The main idea for the fast algorithm is to have linear constraints bounding the projections rather than quadratic constraints. [sent-122, score-0.239]

76 yi (w⊤ xi + b) ≥ 1 − ξi , w⊤ xi + b ≤ B, − w⊤ xi − b ≤ B. [sent-126, score-0.458]

77 α⊤ yq − λ⊤ 1 + λ∗⊤ 1 = −α⊤ yq + λ⊤ 1 − λ∗⊤ 1, 0 ≤ αq ≤ C1, λr , λ∗ ≥ 0. [sent-135, score-0.226]

78 The algorithm, solves a small sub-problem like (9) in each step until the KKT conditions of the formulation (8) are satisﬁed to a given tolerance. [sent-137, score-0.101]

79 4 Experiments Experiments were carried out on three sets of digits - optical digits from the UCI machine learning repository [1], USPS digits [6] and MNIST digits [7]. [sent-142, score-0.707]

80 These datasets have diﬀerent number of features (64 in optical digits, 256 in USPS and 784 in MNIST) and training examples (3823 in optical digits, 7291 in USPS and 60000 in MNIST). [sent-143, score-0.262]

81 A ﬁnal classiﬁer was trained for each of the 45 classiﬁcation problems with the best parameters found from cross validation using all the training examples in those classes. [sent-150, score-0.15]

82 The larger experiment on MNIST consisted of training with two thirds of the digits (note that this amounts to training with 8000 examples on an average for each pair of digits) for each binary classiﬁcation task. [sent-154, score-0.332]

83 Once we had 45 classiﬁers for each pair of digits, testing was done on the separate test set available in each of these three datasets (1797 examples in the case of optical digits, 2007 examples in USPS and 10000 examples in MNIST). [sent-157, score-0.295]

84 Training with entire MNIST We used the best parameters found by crossvalidation in the previous experiments on MNIST and trained 45 classiﬁers for both SVM and RMM with all the training examples for each class in MNIST for various kernels. [sent-164, score-0.119]

85 9 4 Figure 2: Log run time versus log number of examples from 1000 to 10000 in steps of 1000. [sent-179, score-0.108]

86 Run time comparison We studied the empirical run times using the MNIST digits 3 vs 8 and polynomial kernel with degree two. [sent-181, score-0.235]

87 The number of training examples were increased in steps of 1000 and the training time was noted. [sent-185, score-0.169]

88 We show a log-log plot comparing the number of examples to the run time in Figure 2. [sent-191, score-0.108]

89 However, in many cases, warm starting RMM with previous solution signiﬁcantly helped in reducing the run times. [sent-193, score-0.106]

90 5 Conclusions We identiﬁed a sensitivity of Support Vector Machines and maximum absolute margin criteria to aﬃne scalings. [sent-194, score-0.327]

91 The Relative Margin Machine was proposed to overcome such a problem and optimizes the projection direction such that the margin is large only relative to the spread of the data. [sent-196, score-0.602]

92 By deriving the dual with quadratic constraints, a geometrical interpretation was also formulated for RMMs. [sent-197, score-0.134]

93 An implementation for RMMs requiring only additional linear constraints in the SVM quadratic program leads to a competitively fast implementation. [sent-198, score-0.091]

94 The maximization of relative margin is fairly promising as it is compatible with other popular problems handled by the SVM framework such as ordinal regression, structured prediction etc. [sent-200, score-0.292]

95 Furthermore, the constraints that bound the projection are unsupervised; thus RMMs can readily work in semi-supervised and transduction problems. [sent-202, score-0.136]

96 Maximizing the margin can be seen as choosing a function f (x) = w⊤ x from a bounded class 1 of functions FE := {x → w⊤ x| 2 w 2 ≤ E}. [sent-286, score-0.254]

97 For a technical reason, instead of bounding the projection on the training examples as in (5), we consider bounding the projections on an independent set of examples drawn from Pr(x), that is, a set U = {u1 , u2 , . [sent-287, score-0.425]

98 Note that if we have an iid training set, it can be split into two parts and one part can be used exclusively to bound the projections and the other part can be used exclusively for classiﬁcation constraints. [sent-291, score-0.27]

99 Similarly, consider: GE,D := {x → w⊤ x| 1 w⊤ w + 2 nu D ⊤ 2 i=1 (w ui ) ≤ E}, which is closely related to the class of functions considered by 2nu Σ-SVM. [sent-293, score-0.136]

100 Then, with probability at least 1 − δ over the samples of size n, the following bound holds: ˆ PrD [y = sign(g(x))] ≤ ξ ⊤ 1/n + 2R(F )/γ + 3 (ln(2/δ))/2n, where ξi = max(0, 1 − yi g(xi )) are the so-called slack variables. [sent-302, score-0.154]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rmm', 0.63), ('margin', 0.254), ('svm', 0.22), ('spread', 0.216), ('mnist', 0.194), ('digits', 0.163), ('erent', 0.141), ('shade', 0.132), ('xi', 0.13), ('yq', 0.113), ('projections', 0.109), ('formulation', 0.101), ('klda', 0.101), ('di', 0.092), ('nu', 0.088), ('usps', 0.085), ('svms', 0.084), ('rm', 0.082), ('green', 0.081), ('svmlight', 0.076), ('rademacher', 0.071), ('examples', 0.069), ('yi', 0.068), ('solution', 0.067), ('ks', 0.066), ('xj', 0.058), ('whitening', 0.057), ('dual', 0.055), ('optical', 0.055), ('misclassi', 0.053), ('solved', 0.051), ('boundary', 0.051), ('kqq', 0.05), ('krq', 0.05), ('plugged', 0.05), ('prepossessing', 0.05), ('rmms', 0.05), ('ufe', 0.05), ('training', 0.05), ('scaling', 0.05), ('projection', 0.05), ('ui', 0.048), ('quadratic', 0.048), ('classi', 0.045), ('fe', 0.044), ('ellipsoidal', 0.044), ('deviates', 0.044), ('overcome', 0.044), ('constraints', 0.043), ('slack', 0.043), ('bound', 0.043), ('dot', 0.042), ('invariant', 0.041), ('dark', 0.04), ('kq', 0.04), ('wc', 0.04), ('shivaswamy', 0.04), ('absolute', 0.04), ('bounding', 0.039), ('run', 0.039), ('machines', 0.039), ('relative', 0.038), ('kr', 0.038), ('merely', 0.038), ('transformations', 0.038), ('formulations', 0.037), ('decision', 0.036), ('adversary', 0.036), ('separating', 0.036), ('trades', 0.034), ('meanwhile', 0.034), ('qp', 0.034), ('exclusively', 0.034), ('kernels', 0.034), ('red', 0.033), ('primal', 0.033), ('datasets', 0.033), ('kernel', 0.033), ('sensitivity', 0.033), ('shortcoming', 0.033), ('hyperplane', 0.033), ('complexities', 0.033), ('bad', 0.032), ('adversarial', 0.031), ('kkt', 0.031), ('geometrical', 0.031), ('replaced', 0.031), ('validation', 0.031), ('dimensional', 0.031), ('lecun', 0.03), ('kernelized', 0.03), ('free', 0.03), ('ers', 0.03), ('light', 0.029), ('maximizing', 0.029), ('bengio', 0.028), ('trade', 0.028), ('writing', 0.028), ('solutions', 0.028), ('yj', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 196 nips-2008-Relative Margin Machines

Author: Tony Jebara, Pannagadatta K. Shivaswamy

2 0.13225378 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

Author: Sham M. Kakade, Karthik Sridharan, Ambuj Tewari

Abstract: This work characterizes the generalization ability of algorithms whose predictions are linear in the input vector. To this end, we provide sharp bounds for Rademacher and Gaussian complexities of (constrained) linear classes, which directly lead to a number of generalization bounds. This derivation provides simpliﬁed proofs of a number of corollaries including: risk bounds for linear prediction (including settings where the weight vectors are constrained by either L2 or L1 constraints), margin bounds (including both L2 and L1 margins, along with more general notions based on relative entropy), a proof of the PAC-Bayes theorem, and upper bounds on L2 covering numbers (with Lp norm constraints and relative entropy constraints). In addition to providing a uniﬁed analysis, the results herein provide some of the sharpest risk and margin bounds. Interestingly, our results show that the uniform convergence rates of empirical risk minimization algorithms tightly match the regret bounds of online learning algorithms for linear prediction, up to a constant factor of 2. 1

3 0.12939689 78 nips-2008-Exact Convex Confidence-Weighted Learning

Author: Koby Crammer, Mark Dredze, Fernando Pereira

Abstract: Conﬁdence-weighted (CW) learning [6], an online learning method for linear classiﬁers, maintains a Gaussian distributions over weight vectors, with a covariance matrix that represents uncertainty about weights and correlations. Conﬁdence constraints ensure that a weight vector drawn from the hypothesis distribution correctly classiﬁes examples with a speciﬁed probability. Within this framework, we derive a new convex form of the constraint and analyze it in the mistake bound model. Empirical evaluation with both synthetic and text data shows our version of CW learning achieves lower cumulative and out-of-sample errors than commonly used ﬁrst-order and second-order online methods. 1

4 0.11215384 62 nips-2008-Differentiable Sparse Coding

Author: J. A. Bagnell, David M. Bradley

Abstract: Prior work has shown that features which appear to be biologically plausible as well as empirically useful can be found by sparse coding with a prior such as a laplacian (L1 ) that promotes sparsity. We show how smoother priors can preserve the beneﬁts of these sparse priors while adding stability to the Maximum A-Posteriori (MAP) estimate that makes it more useful for prediction problems. Additionally, we show how to calculate the derivative of the MAP estimate efﬁciently with implicit differentiation. One prior that can be differentiated this way is KL-regularization. We demonstrate its effectiveness on a wide variety of applications, and ﬁnd that online optimization of the parameters of the KL-regularized model can signiﬁcantly improve prediction performance. 1

5 0.10433441 226 nips-2008-Supervised Dictionary Learning

Author: Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, Francis R. Bach

Abstract: It is now well established that sparse signal models are well suited for restoration tasks and can be effectively learned from audio, image, and video data. Recent research has been aimed at learning discriminative sparse models instead of purely reconstructive ones. This paper proposes a new step in that direction, with a novel sparse representation for signals belonging to different classes in terms of a shared dictionary and discriminative class models. The linear version of the proposed model admits a simple probabilistic interpretation, while its most general variant admits an interpretation in terms of kernels. An optimization framework for learning all the components of the proposed model is presented, along with experimental results on standard handwritten digit and texture classiﬁcation tasks. 1

6 0.10273271 228 nips-2008-Support Vector Machines with a Reject Option

7 0.09425234 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition

8 0.088905632 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

9 0.08780054 189 nips-2008-Rademacher Complexity Bounds for Non-I.I.D. Processes

10 0.080223218 239 nips-2008-Tighter Bounds for Structured Estimation

11 0.077787481 113 nips-2008-Kernelized Sorting

12 0.076847568 178 nips-2008-Performance analysis for L\ 2 kernel classification

13 0.071440749 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

14 0.070717424 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions

15 0.069437496 19 nips-2008-An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis

16 0.068102419 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

17 0.067558274 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data

18 0.067142963 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

19 0.066178374 143 nips-2008-Multi-label Multiple Kernel Learning

20 0.063344605 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.213), (1, -0.085), (2, -0.113), (3, 0.044), (4, -0.021), (5, 0.031), (6, 0.012), (7, -0.037), (8, -0.023), (9, 0.032), (10, 0.171), (11, 0.072), (12, 0.006), (13, 0.068), (14, -0.036), (15, 0.02), (16, 0.026), (17, -0.032), (18, 0.011), (19, 0.055), (20, -0.034), (21, -0.031), (22, -0.011), (23, -0.017), (24, 0.012), (25, 0.056), (26, 0.001), (27, -0.071), (28, -0.017), (29, -0.045), (30, -0.069), (31, -0.011), (32, 0.016), (33, -0.03), (34, -0.036), (35, 0.028), (36, -0.068), (37, -0.009), (38, -0.069), (39, -0.053), (40, -0.071), (41, 0.044), (42, 0.182), (43, 0.074), (44, -0.045), (45, 0.029), (46, 0.065), (47, -0.015), (48, 0.063), (49, -0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93787926 196 nips-2008-Relative Margin Machines

Author: Tony Jebara, Pannagadatta K. Shivaswamy

2 0.73281944 78 nips-2008-Exact Convex Confidence-Weighted Learning

Author: Koby Crammer, Mark Dredze, Fernando Pereira

3 0.69457299 178 nips-2008-Performance analysis for L\ 2 kernel classification

Author: Jooseuk Kim, Clayton Scott

Abstract: We provide statistical performance guarantees for a recently introduced kernel classiﬁer that optimizes the L2 or integrated squared error (ISE) of a difference of densities. The classiﬁer is similar to a support vector machine (SVM) in that it is the solution of a quadratic program and yields a sparse classiﬁer. Unlike SVMs, however, the L2 kernel classiﬁer does not involve a regularization parameter. We prove a distribution free concentration inequality for a cross-validation based estimate of the ISE, and apply this result to deduce an oracle inequality and consistency of the classiﬁer on the sense of both ISE and probability of error. Our results also specialize to give performance guarantees for an existing method of L2 kernel density estimation. 1

4 0.66007215 185 nips-2008-Privacy-preserving logistic regression

Author: Kamalika Chaudhuri, Claire Monteleoni

Abstract: This paper addresses the important tradeoff between privacy and learnability, when designing algorithms for learning from private databases. We focus on privacy-preserving logistic regression. First we apply an idea of Dwork et al. [6] to design a privacy-preserving logistic regression algorithm. This involves bounding the sensitivity of regularized logistic regression, and perturbing the learned classiﬁer with noise proportional to the sensitivity. We then provide a privacy-preserving regularized logistic regression algorithm based on a new privacy-preserving technique: solving a perturbed optimization problem. We prove that our algorithm preserves privacy in the model due to [6]. We provide learning guarantees for both algorithms, which are tighter for our new algorithm, in cases in which one would typically apply logistic regression. Experiments demonstrate improved learning performance of our method, versus the sensitivity method. Our privacy-preserving technique does not depend on the sensitivity of the function, and extends easily to a class of convex loss functions. Our work also reveals an interesting connection between regularization and privacy. 1

5 0.64737469 228 nips-2008-Support Vector Machines with a Reject Option

Author: Yves Grandvalet, Alain Rakotomamonjy, Joseph Keshet, Stéphane Canu

Abstract: We consider the problem of binary classiﬁcation where the classiﬁer may abstain instead of classifying each observation. The Bayes decision rule for this setup, known as Chow’s rule, is deﬁned by two thresholds on posterior probabilities. From simple desiderata, namely the consistency and the sparsity of the classiﬁer, we derive the double hinge loss function that focuses on estimating conditional probabilities only in the vicinity of the threshold points of the optimal decision rule. We show that, for suitable kernel machines, our approach is universally consistent. We cast the problem of minimizing the double hinge loss as a quadratic program akin to the standard SVM optimization problem and propose an active set method to solve it efﬁciently. We ﬁnally provide preliminary experimental results illustrating the interest of our constructive approach to devising loss functions. 1

6 0.60376567 226 nips-2008-Supervised Dictionary Learning

7 0.60105246 85 nips-2008-Fast Rates for Regularized Objectives

8 0.58860308 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data

9 0.58695108 239 nips-2008-Tighter Bounds for Structured Estimation

10 0.58256805 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

11 0.58191919 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition

12 0.57089657 161 nips-2008-On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization

13 0.529719 88 nips-2008-From Online to Batch Learning with Cutoff-Averaging

14 0.52865899 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels

15 0.51856726 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

16 0.51773876 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence

17 0.50407964 62 nips-2008-Differentiable Sparse Coding

18 0.49726284 5 nips-2008-A Transductive Bound for the Voted Classifier with an Application to Semi-supervised Learning

19 0.49326384 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning

20 0.48928174 126 nips-2008-Localized Sliced Inverse Regression

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.085), (7, 0.093), (12, 0.043), (15, 0.013), (28, 0.155), (52, 0.224), (57, 0.054), (59, 0.024), (63, 0.067), (71, 0.04), (77, 0.05), (83, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.84151447 229 nips-2008-Syntactic Topic Models

Author: Jordan L. Boyd-graber, David M. Blei

Abstract: We develop the syntactic topic model (STM), a nonparametric Bayesian model of parsed documents. The STM generates words that are both thematically and syntactically constrained, which combines the semantic insights of topic models with the syntactic information available from parse trees. Each word of a sentence is generated by a distribution that combines document-speciﬁc topic weights and parse-tree-speciﬁc syntactic transitions. Words are assumed to be generated in an order that respects the parse tree. We derive an approximate posterior inference method based on variational methods for hierarchical Dirichlet processes, and we report qualitative and quantitative results on both synthetic data and hand-parsed documents. 1

same-paper 2 0.80044442 196 nips-2008-Relative Margin Machines

Author: Tony Jebara, Pannagadatta K. Shivaswamy

3 0.78348327 185 nips-2008-Privacy-preserving logistic regression

Author: Kamalika Chaudhuri, Claire Monteleoni

4 0.69953895 62 nips-2008-Differentiable Sparse Coding

Author: J. A. Bagnell, David M. Bradley

5 0.69454056 202 nips-2008-Robust Regression and Lasso

Author: Huan Xu, Constantine Caramanis, Shie Mannor

Abstract: We consider robust least-squares regression with feature-wise disturbance. We show that this formulation leads to tractable convex optimization problems, and we exhibit a particular uncertainty set for which the robust problem is equivalent to 1 regularized regression (Lasso). This provides an interpretation of Lasso from a robust optimization perspective. We generalize this robust formulation to consider more general uncertainty sets, which all lead to tractable convex optimization problems. Therefore, we provide a new methodology for designing regression algorithms, which generalize known formulations. The advantage is that robustness to disturbance is a physical property that can be exploited: in addition to obtaining new formulations, we use it directly to show sparsity properties of Lasso, as well as to prove a general consistency result for robust regression problems, including Lasso, from a uniﬁed robustness perspective. 1

6 0.69222826 75 nips-2008-Estimating vector fields using sparse basis field expansions

7 0.69211924 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

8 0.69171327 194 nips-2008-Regularized Learning with Networks of Features

9 0.68986869 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

10 0.6888262 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

11 0.68416506 143 nips-2008-Multi-label Multiple Kernel Learning

12 0.68403172 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

13 0.68251884 14 nips-2008-Adaptive Forward-Backward Greedy Algorithm for Sparse Learning with Linear Models

14 0.68229938 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

15 0.68194723 179 nips-2008-Phase transitions for high-dimensional joint support recovery

16 0.68082088 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

17 0.67960489 99 nips-2008-High-dimensional support union recovery in multivariate regression

18 0.67931503 114 nips-2008-Large Margin Taxonomy Embedding for Document Categorization

19 0.67919362 85 nips-2008-Fast Rates for Regularized Objectives

20 0.67775315 16 nips-2008-Adaptive Template Matching with Shift-Invariant Semi-NMF