nips nips2012 nips2012-361 knowledge-graph by maker-knowledge-mining

361 nips-2012-Volume Regularization for Binary Classification

Source: pdf

Author: Koby Crammer, Tal Wagner

Abstract: We introduce a large-volume box classiﬁcation for binary prediction, which maintains a subset of weight vectors, and speciﬁcally axis-aligned boxes. Our learning algorithm seeks for a box of large volume that contains “simple” weight vectors which most of are accurate on the training set. Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efﬁciently. The formulation yields a natural PAC-Bayesian performance bound and it is shown to minimize a quantity directly aligned with it. The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of 30 NLP datasets and binarized USPS optical character recognition datasets. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 il Abstract We introduce a large-volume box classiﬁcation for binary prediction, which maintains a subset of weight vectors, and speciﬁcally axis-aligned boxes. [sent-6, score-0.417]

2 Our learning algorithm seeks for a box of large volume that contains “simple” weight vectors which most of are accurate on the training set. [sent-7, score-0.462]

3 Two versions of the learning process are cast as convex optimization problems, and it is shown how to solve them efﬁciently. [sent-8, score-0.194]

4 The algorithm outperforms SVM and the recently proposed AROW algorithm on a majority of 30 NLP datasets and binarized USPS optical character recognition datasets. [sent-10, score-0.107]

5 Support vectors machines [3, 22] (SVMs) are considered a primary method to efﬁciently build linear classiﬁers from data, yielding state-of-the-art performance. [sent-12, score-0.068]

6 Speciﬁcally, our algorithm maintains an axis-aligned box, which only requires double number of parameters than maintaing a single weight-vector, a dominating model for many tasks. [sent-20, score-0.105]

7 BPMs use the version space, the set of all consistent weight vectors, which is a convex polyhedron. [sent-23, score-0.116]

8 Since the size of the polyhedron’s representation grows with the number of training examples, BPMs approximate the polyhedron with a single weight-vector, the Bayes point. [sent-24, score-0.073]

9 We cast learning as a convex optimization problem and propose methods to solve it efﬁciently. [sent-26, score-0.151]

10 We experiment with 30 binary text categorization datasets from various tasks: sentiment classiﬁcation, predicting domain of product-review, assigning topics to news items, tagging spam emails, and classifying posts to news-groups. [sent-30, score-0.166]

11 Notation: Given a vector x ∈ Rd , we denote its kth element by xk ∈ R, and by |x| ∈ Rd the vector with component-wise absolute value of its elements, |x| = (|x1 |, . [sent-33, score-0.064]

12 2 Large-Volume Box Classiﬁers Standard linear classiﬁcation learning algorithms maintain and return a single weight vector w ∈ Rd used to predict the label of any test point. [sent-37, score-0.112]

13 We study a generalization of these algorithms where hypotheses are uncertainty (sub)sets of weight vectors w. [sent-38, score-0.162]

14 PAC-Bayesian analysis and its generalization bounds give additional justiﬁcation to this approach (see Sec 5). [sent-44, score-0.069]

15 The uncertainty subsets we study are axis aligned boxes parametrized with two vectors u, v ∈ Rd where we assume, uk ≤ v k for all k = 1 . [sent-45, score-0.383]

16 The projection of the box onto the k-axis yields the interval [uk , v k ]. [sent-50, score-0.31]

17 The set of weight vectors contained in the box is denoted by, Q = {w : uk ≤ wk ≤ v k for k = 1 . [sent-51, score-0.66]

18 Given an instance x to be classiﬁed, a Gibbs classiﬁer samples a weight vector uniformly in random from the box w ∈ Q and returns sign(w · x). [sent-55, score-0.361]

19 2 Intuitively, the uncertainty in the weight associated with the kth feature is σ k . [sent-59, score-0.152]

20 3 Learning as Optimization Given a labeled sample S = {(xi , yi )}n , a common practice in learning linear models w is to i=1 perform structural risk minimization (SRM) [25] that picks a weight-vector that is both “simple” (eg small norm) and performs well on the training set. [sent-61, score-0.099]

21 Learning is cast as an optimization problem, w = arg min w 1 n (w, (xi , yi )) + D R(w) . [sent-62, score-0.15]

22 (1) i The ﬁrst term is the empirical loss evaluated on the training set with some loss function (w, (x, y)), and the second term is a regularization that penalizes weight-vectors according to their complexity. [sent-63, score-0.201]

23 Learning with uncertainty sets invites us to balance three desires rather than two as when learning a single weight-vector. [sent-65, score-0.117]

24 The ﬁrst two desires are generalizations of the structural risk minimization principle [25] to boxes: we prefer boxes containing weight-vectors that attain both low loss (w, (xi , yi )) and are “simple” (eg small norm). [sent-66, score-0.305]

25 This alone though is not enough, as if the loss and regularization functions are strictly convex then the optimal box would in fact be a single weightvector. [sent-67, score-0.473]

26 The third desire is thus to prefer boxes with large volume. [sent-68, score-0.085]

27 Intuitively, if during training an algorithm ﬁnds a box with large volume, such that all weight-vectors belonging to it attain low training error and are simple, we expect the classiﬁer based on the center of mass to be robust to noise or ﬂuctuations. [sent-69, score-0.494]

28 We formalize this requirement by adding a term that is inversely proportional to the volume of the box Q. [sent-71, score-0.342]

29 We take a worst-case approach, and deﬁne the loss of the box Q given an example (x, y) denoted by (Q, (x, y)) to be the loss of the worst member w ∈ Q. [sent-72, score-0.446]

30 Similarly, we deﬁne the complexity of the box Q to be the complexity of the most complex member of the box w ∈ Q, formally, (Q, (x, y)) = supw∈Q (w, (x, y)) and R(Q) = supw∈Q R(w). [sent-73, score-0.62]

31 2 Putting it all together, we replace (1) with, 1 m Q = arg min sup Q∈Q w∈Q (w, (xi , yi )) + DR(w) , (2) i where Q is a set of boxes with some minimal volume. [sent-74, score-0.182]

32 We expect this formulation to be robust, as a box is evaluated with its worst performing member. [sent-76, score-0.31]

33 First, it is a common barrier function in optimization [26], and in our case it keeps the box from actually shrinking to a zero volume box. [sent-79, score-0.373]

34 Additionally, we bound the supremum over w with a sum of supremum operators. [sent-81, score-0.237]

35 To conclude, we cast the learning problem as the following optimization problem over boxes, arg min Q 1 m sup (w, (xi , yi )) − C log volQ + D sup R(w) , i w∈Q (3) w∈Q where C, D > 0 are two trade-off parameters used to balance the three goals. [sent-82, score-0.247]

36 (In the analysis below it will be shown that D can be also interpreted as a Langrange multiplier of a constrained optimization problem. [sent-83, score-0.068]

37 ) We further develop the last equation by making additional assumptions over the loss function and the regularization. [sent-84, score-0.097]

38 We assume that the loss is a monotonically decreasing function of the product y(x w), often called the margin (or the signed margin). [sent-85, score-0.157]

39 This is a property of many popular loss functions for binary classiﬁcation, including the hinge-loss and its square used by SVMs [3, 22], exp-loss used by boosting [9], logistic-regression [11] and the Huber-loss [14]. [sent-86, score-0.068]

40 Lemma 1 If the loss function is monotonically decreasing in the margin, (y x w ) then supw∈Q (w, (xi , yi )) = (y(x µ) − |x|σ). [sent-88, score-0.173]

41 Proof: From the monotonicity of (·) we have supw∈Q Computing the inﬁmum we get, d inf y(x w) = w∈Q wk ∈[uk ,v k ] for k=1. [sent-89, score-0.13]

42 d d = (yxk ) k=1 inf uk vk y(x w) = (w, (x, y)) = inf w∈Q y(x w) . [sent-92, score-0.309]

43 d (yxk )wk = k=1 (yxk ) ≥ 0 = (yxk ) < 0 inf k=1 (yxk )wk wk ∈[uk ,v k ] d (yxk ) (µk − sign(yxk )σ k ) = y(x µ) − |x|σ , k=1 using u = µ − σ and v = u + σ as stated above. [sent-93, score-0.13]

44 The lemma highlights the need to constrain the volume to be strictly larger than zero: due to monotonicity and the fact that σ ≥ 0 (component wise) we have (y(x µ) − |x|σ) ≥ (y(x µ)), so the loss is always minimized when we set σ = 0. [sent-94, score-0.163]

45 Lemma 2 (1) Assuming R(w) is convex, then supw∈Q R(w) is attained on vertices of the box Q. [sent-96, score-0.344]

46 (2) Additionally, if R(w) is strictly convex then the supremum is attained only on vertices. [sent-97, score-0.202]

47 Proof: We use the fact that every point in the box can be represented as a convex combination d of the vertices. [sent-98, score-0.375]

48 Formally, given a point in the box w ∈ Q, there exists a vector α ∈ R2 with non-negative elements and t αt = 1 such that w = t αt z t where z t are the vertices of the box. [sent-99, score-0.31]

49 Thus, if w attains the supremum supw∈Q R(w) then so does at least one vertex. [sent-101, score-0.142]

50 Additionally, if R(w) is a strictly convex function, then the ﬁrst inequality in the last equation is a strict inequality, and thus a non-vertex cannot attain the supremum. [sent-102, score-0.104]

51 In this case the supremum is attained on each coordinate independently as follows. [sent-104, score-0.137]

52 3 Corollary 3 Assuming R(w) is a sum of scalar-convex functions sup R(w) = w∈Q max {r(uk ), r(v k )} r(wk ), we have, max {r(µk − σ k ), r(µk + σ k )} . [sent-105, score-0.105]

53 = k k k The corollary follows from the lemma since a supremum of a scalar-function over a box is equivalent to taking the supremum over the box projected to a single coordinate. [sent-106, score-0.919]

54 Finally, the volume of a box is given by a product of the length of its axes, that is, vol (Q) = k (v k − uk ) = k (2σ k ) = 2d k σ k . [sent-107, score-0.592]

55 We denote by z i,+ = yi xi + |xi | ∈ Rd , z i,− = yi xi − |xi | ∈ Rd . [sent-109, score-0.224]

56 (5) The kth element of z i,+ (z i,− ) is twice the kth element of |xi | if the sign of the kth element of xi agrees (disagrees) with yi , and zero otherwise. [sent-110, score-0.355]

57 This problem can equivalently be written in terms of the two “extreme” vertices u and v as follows, min v≥u 1 m m i=1 −C 1 v (z i,− ) + u (z i,+ ) 2 log (v k − uk )+D k max {r(v k ), r(uk )} , (6) k by using the relation yi xi (v+u)−|xi |(v−u) = v (z i,− )+u (z i,+ ) . [sent-111, score-0.341]

58 Note, if the loss function (·) is convex, then both formulations (4) and (6) of the learning problem are convex in their arguments, as each is a sum of convex functions of linear combination of the arguments, and a maximum of convex functions is convex. [sent-112, score-0.263]

59 Although the above problem is convex, the regularization term k max {r(v k ), r(uk )} is not smooth because of the max operator. [sent-114, score-0.102]

60 The problem then becomes, min v≥u 1 m m i=1 1 v (z i,− ) + u (z i,+ ) 2 −C log (v k − uk ) + D (R(u) + R(v)) . [sent-116, score-0.193]

61 The algorithm is a based on COMID [8] and its convergence analysis follows directly from the analysis of COMID, which is omitted due to lack of space. [sent-123, score-0.073]

62 The algorithm then solves the following regularization-oriented optimization problem, min u,v 1 ˜ u−u 2 2 + 1 ˜ v−v 2 2 −C max v 2 , u2 k k log (v k − uk ) + D k . [sent-127, score-0.26]

63 k The objective of the last problem decomposes over individual pairs uk , v k so we reduce the optimization to d independent problems, each deﬁned over 2 scalars u and v (omitting index k), min F (u, v) = u,v 1 1 2 2 (u − u) + (v − v ) − C log (v − u) + D max v 2 , u2 ˜ ˜ 2 2 . [sent-128, score-0.26]

64 The following lemma describes the optimal solution of (8). [sent-130, score-0.063]

65 ˜ ˜ Proof sketch: By deﬁnition, the function F is smooth and convex on G1 . [sent-139, score-0.065]

66 The convexity of F on the entire set H yields that any such point is also a global minimum of F , and that if no such point exists then F attains a global minimum on L (which is derived in 3). [sent-142, score-0.099]

67 The algebraic derivation is omitted due to lack of space. [sent-144, score-0.073]

68 v v ˜ u u ˜ ˜˜ v ˜ Its proof is similar to the proof of Lemma 4, but simpler and omitted due to lack of space. [sent-149, score-0.073]

69 Let D be a distribution over the labeled examples (x, y), and denote by ¯(w, D) the expected zero-one loss of a linear classiﬁer characterized by its weight vector w: ¯(w, D) = Pr(x,y)∼D [sign(w ·x) = y] = E(x,y)∼D [ ¯(w, (x, y))] . [sent-156, score-0.119]

70 We abuse notation, and denote by ¯(w, S) the expected loss ¯(w, DS ) for the empirical distribution DS of a sample S. [sent-157, score-0.068]

71 Below, we identify a compact set with a uniform distribution over the set, and in particular we identify a box Q with a uniform distribution 5 over all weight vectors it contains (and zero mass otherwise). [sent-161, score-0.463]

72 We also denote by (Q, D) the expectation of (w, D) over weight vectors w drawn according to the distribution Q. [sent-164, score-0.085]

73 (11) Additionally, we bound the empirical training error, n ¯(Q, S) = 1 n 1 volQ i ¯(w, (xi , yi )) dw ≤ 1 n w∈Q inf yi (xi w) , w∈Q i (12) where the equality is the deﬁnition of ¯(Q, S), and the inequality follows by choosing a loss function (·) which upper bounds the zero-one loss (e. [sent-172, score-0.388]

74 Hinge loss), by bounding an expectation with the supremum value, and from Lemma 1. [sent-174, score-0.103]

75 We get that to minimize the generalization bound of (9) we can minimize a bound on (10) which is obtained by substituting (11) and (12) in (10). [sent-175, score-0.102]

76 Omitting constants we get, min Q 1 n inf yi w xi i w∈Q − 1 log volQ nγ s. [sent-176, score-0.17]

77 (13) Next, we set P to be a ball of radius R about the origin, and, as in Sec 2, we set Q as a box parametrized with the vectors u and v. [sent-179, score-0.464]

78 We use the following lemma, of which proof is omitted due to lack of space, Lemma 7 If P is a ball of radius R about the origin and Q is a box parametrized using u and v, we have Q ⊆ P ⇔ k max{v 2 , u2 } ≤ R2 . [sent-180, score-0.503]

79 k k Finally, plugging Lemma 7 and Lemma 1 in (13) we get the following problem, which is monotonically related to a bound of the generalization loss, m 1 1 1 minv≥u − nγ subject to i=1 k log (v k − uk ) n 2 v (z i,− ) + u (z i,+ ) 2 max {r(v k ), r(uk )} ≤ R . [sent-181, score-0.341]

80 To solve the last problem we write its Lagrangian, k max min η v≥u 1 n m i=1 1 v (z i,− ) + u (z i,+ ) 2 max {r(v k ), r(uk )} − ηR2 , +η − 1 nγ log (v k − uk ) k (14) k where η is the Lagrange multiplier ensuring the constraint. [sent-182, score-0.302]

81 In fact, each value of the radius R yields a unique optimal value of the Lagrange multiplier η. [sent-186, score-0.076]

82 Thus, we can interpret the role of the constant D as setting implicitly the effective radius of the prior ball P. [sent-187, score-0.086]

83 However, we chose Q to be a box, as it has a nice interpretation of uncertainty over features, and P to be a ball, as it decomposes (as opposed to an ∞ ball), which allows simpler optimization algorithms. [sent-192, score-0.068]

84 Third, the bound is small if the volume of the box Q is large, which motivates seeking for large-volume boxes, whose members perform well. [sent-194, score-0.373]

85 We also experimented with USPS OCR data which we binarized into 45 all-pairs problems, maintaing the standard split into training and test sets. [sent-205, score-0.136]

86 We implemented BoW-M and BoW-S both with Hinge loss and Huber loss. [sent-207, score-0.068]

87 The mean error for 30 NLP tasks over 10 folds of BoW-M and BoW-S vs SVM is summarized in the two left panels of Fig. [sent-215, score-0.146]

88 Clearly, both BoW versions outperform SVM obtaining lower test error on most (26) datasets and higher only on few (at most 3). [sent-218, score-0.173]

89 The right two panels compare the performance of both BoW versions with AROW. [sent-219, score-0.096]

90 of USPS 1-vs-1 datasets (out of 45) for which one algorithm is better than the other (see legend) shown for four levels of label noise during training: 0%, 10%, 20% and 30% (left to right). [sent-223, score-0.116]

91 Each panel shows the number of datasets (out of 45) for which one algorithm outperforms another algorithm, for four levels of label noise (i. [sent-227, score-0.147]

92 The four pairs compared are BoW vs SVM (two left panels, BoW-M most left) and BoW vs AROW (two right panels, BoW-S most right). [sent-230, score-0.094]

93 SVM attains lower test error than BoW-S on 20 datasets and higher on 12 datasets, with a tie in 13 datasets). [sent-234, score-0.213]

94 With maximal level of 30% label noise, the average test error is 16. [sent-240, score-0.107]

95 BoW-M achieves lower test error on 27 datasets (compared both with SVM and AROW), while BoW-S achieves lower test error than SVM on 38 datasets and than AROW on 40 datasets. [sent-244, score-0.202]

96 (1) BoW maintains only 2d parameters, while GMM employs d + d(d + 1)/2 as it maintains a full covariance matrix. [sent-252, score-0.112]

97 (3) We use directly a specialized PAC-Bayes bound for convex loss functions [10] while the analysis of GMMs uses a bound designed for the 0 − 1 loss which is then further bounded. [sent-254, score-0.263]

98 (4) The optimization problem of both versions of BoW is convex, while the optimization problem of GMMs is not convex, and it is only approximated with a convex problem. [sent-255, score-0.17]

99 8 Conclusion We extend the commonly used linear classiﬁers to subsets of the class of possible classiﬁers, or in other words uniform distributions over weight vectors. [sent-259, score-0.085]

100 The empirical evaluation presented shows that our method performs favourably with respect to SVMs and AROW, and is more robust in the presence of label noise. [sent-261, score-0.09]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('arow', 0.494), ('bow', 0.454), ('box', 0.31), ('uk', 0.193), ('yxk', 0.173), ('supw', 0.148), ('gmms', 0.138), ('svm', 0.127), ('nlp', 0.12), ('usps', 0.105), ('supremum', 0.103), ('boxes', 0.085), ('volq', 0.074), ('tie', 0.073), ('wk', 0.072), ('loss', 0.068), ('classi', 0.068), ('convex', 0.065), ('yi', 0.064), ('kth', 0.064), ('wins', 0.064), ('lemma', 0.063), ('label', 0.061), ('comid', 0.06), ('crammer', 0.06), ('inf', 0.058), ('spam', 0.057), ('vol', 0.057), ('maintains', 0.056), ('datasets', 0.055), ('cast', 0.055), ('sentiment', 0.054), ('dredze', 0.054), ('panels', 0.053), ('binarized', 0.052), ('sign', 0.051), ('weight', 0.051), ('dkl', 0.05), ('bpms', 0.049), ('desires', 0.049), ('limv', 0.049), ('maintaing', 0.049), ('margin', 0.048), ('xi', 0.048), ('ball', 0.047), ('vs', 0.047), ('error', 0.046), ('svms', 0.046), ('herbrich', 0.044), ('germain', 0.044), ('omitting', 0.044), ('ocr', 0.044), ('versions', 0.043), ('ers', 0.043), ('bayes', 0.042), ('sec', 0.041), ('monotonically', 0.041), ('generalization', 0.04), ('attain', 0.039), ('attains', 0.039), ('radius', 0.039), ('koby', 0.038), ('graepel', 0.038), ('polyhedron', 0.038), ('multiplier', 0.037), ('uncertainty', 0.037), ('lack', 0.037), ('israel', 0.036), ('eg', 0.036), ('shivaswamy', 0.036), ('max', 0.036), ('omitted', 0.036), ('training', 0.035), ('gmm', 0.034), ('lagrange', 0.034), ('parametrized', 0.034), ('machines', 0.034), ('vectors', 0.034), ('uniform', 0.034), ('additionally', 0.034), ('attained', 0.034), ('sup', 0.033), ('uv', 0.033), ('bhattacharyya', 0.033), ('markers', 0.032), ('hinge', 0.032), ('volume', 0.032), ('rd', 0.032), ('bound', 0.031), ('optimization', 0.031), ('balance', 0.031), ('panel', 0.031), ('minimum', 0.03), ('superior', 0.03), ('corollary', 0.03), ('regularization', 0.03), ('amazon', 0.029), ('additional', 0.029), ('robust', 0.029), ('outperform', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 361 nips-2012-Volume Regularization for Binary Classification

Author: Koby Crammer, Tal Wagner

2 0.12223029 197 nips-2012-Learning with Recursive Perceptual Representations

Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell

Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classiﬁcation tasks but require high dimensional feature spaces for good performance. Deep learning methods can ﬁnd more compact representations but current methods employ multilayer perceptrons that require solving a difﬁcult, non-convex optimization problem. We propose a deep non-linear classiﬁer whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1

3 0.11043017 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-speciﬁc classiﬁers. We formulate an empirical risk minimization problem that incorporates both partitioning and classiﬁcation in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classiﬁers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-ﬁtting and generalization error. We train locally linear classiﬁers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classiﬁcation techniques on benchmark datasets. We also show improved robustness to label noise.

4 0.10377329 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

Author: Xianghang Liu, James Petterson, Tibério S. Caetano

Abstract: We present a new formulation for binary classiﬁcation. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to ﬁnding a MAP conﬁguration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassiﬁcation loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-oﬀ interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1

5 0.092394054 188 nips-2012-Learning from Distributions via Support Measure Machines

Author: Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, Bernhard Schölkopf

Abstract: This paper presents a kernel-based discriminative learning framework on probability measures. Rather than relying on large collections of vectorial training examples, our framework learns using a collection of probability distributions that have been constructed to meaningfully represent training data. By representing these probability distributions as mean embeddings in the reproducing kernel Hilbert space (RKHS), we are able to apply many standard kernel-based learning techniques in straightforward fashion. To accomplish this, we construct a generalization of the support vector machine (SVM) called a support measure machine (SMM). Our analyses of SMMs provides several insights into their relationship to traditional SVMs. Based on such insights, we propose a ﬂexible SVM (FlexSVM) that places different kernel functions on each training example. Experimental results on both synthetic and real-world data demonstrate the effectiveness of our proposed framework. 1

6 0.089818612 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

7 0.086514145 227 nips-2012-Multiclass Learning with Simplex Coding

8 0.084979579 16 nips-2012-A Polynomial-time Form of Robust Regression

9 0.084782586 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

10 0.084566712 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

11 0.075786531 311 nips-2012-Shifting Weights: Adapting Object Detectors from Image to Video

12 0.072650075 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

13 0.070930883 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

14 0.070812121 168 nips-2012-Kernel Latent SVM for Visual Recognition

15 0.069854759 142 nips-2012-Generalization Bounds for Domain Adaptation

16 0.069820881 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

17 0.068005785 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

18 0.065960363 27 nips-2012-A quasi-Newton proximal splitting method

19 0.065002471 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

20 0.064771347 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.202), (1, 0.023), (2, 0.024), (3, -0.056), (4, 0.099), (5, -0.013), (6, 0.01), (7, 0.124), (8, 0.001), (9, -0.036), (10, 0.01), (11, 0.085), (12, 0.074), (13, 0.024), (14, 0.04), (15, -0.041), (16, -0.073), (17, -0.035), (18, -0.011), (19, 0.055), (20, -0.046), (21, 0.005), (22, -0.031), (23, -0.005), (24, -0.051), (25, 0.011), (26, -0.036), (27, -0.043), (28, -0.072), (29, -0.007), (30, -0.022), (31, 0.045), (32, -0.041), (33, -0.043), (34, -0.009), (35, -0.003), (36, -0.008), (37, 0.002), (38, 0.028), (39, -0.034), (40, -0.032), (41, 0.006), (42, 0.037), (43, 0.03), (44, -0.008), (45, 0.036), (46, -0.037), (47, -0.014), (48, -0.04), (49, -0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94064337 361 nips-2012-Volume Regularization for Binary Classification

Author: Koby Crammer, Tal Wagner

2 0.7907005 227 nips-2012-Multiclass Learning with Simplex Coding

Author: Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, Jean-jeacques Slotine

Abstract: In this paper we discuss a novel framework for multiclass learning, deﬁned by a suitable coding/decoding strategy, namely the simplex coding, that allows us to generalize to multiple classes a relaxation approach commonly used in binary classiﬁcation. In this framework, we develop a relaxation error analysis that avoids constraints on the considered hypotheses class. Moreover, using this setting we derive the ﬁrst provably consistent regularized method with training/tuning complexity that is independent to the number of classes. We introduce tools from convex analysis that can be used beyond the scope of this paper. 1

3 0.78187495 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

Author: Chi Jin, Liwei Wang

Abstract: Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or inﬁnite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors ﬁxed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of ﬁnite dimension; and the two bounds tend to be equivalent as the dimension goes to inﬁnity. In addition, we show that the VC bound for linear classiﬁers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and ﬁnd that the new bound is useful for model selection and is usually signiﬁcantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classiﬁers.

4 0.75436854 174 nips-2012-Learning Halfspaces with the Zero-One Loss: Time-Accuracy Tradeoffs

Author: Aharon Birnbaum, Shai S. Shwartz

Abstract: Given α, ϵ, we study the time complexity required to improperly learn a halfspace with misclassiﬁcation error rate of at most (1 + α) L∗ + ϵ, where L∗ is the γ γ optimal γ-margin error rate. For α = 1/γ, polynomial time and sample complexity is achievable using the hinge-loss. For α = 0, Shalev-Shwartz et al. [2011] showed that poly(1/γ) time is impossible, while learning is possible in ˜ time exp(O(1/γ)). An immediate question, which this paper tackles, is what is achievable if α ∈ (0, 1/γ). We derive positive results interpolating between the polynomial time for α = 1/γ and the exponential time for α = 0. In particular, we show that there are cases in which α = o(1/γ) but the problem is still solvable in polynomial time. Our results naturally extend to the adversarial online learning model and to the PAC learning with malicious noise model. 1

5 0.74375534 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

Author: Amit Daniely, Sivan Sabato, Shai S. Shwartz

Abstract: We theoretically analyze and compare the following ﬁve popular multiclass classiﬁcation methods: One vs. All, All Pairs, Tree-based classiﬁers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the ﬁrst four methods, the classiﬁcation is based on a reduction to binary classiﬁcation. We consider the case where the binary classiﬁer comes from a class of VC dimension d, and in particular from the class of halfspaces over Rd . We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1

6 0.72867996 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

7 0.72301 200 nips-2012-Local Supervised Learning through Space Partitioning

8 0.71750128 271 nips-2012-Pointwise Tracking the Optimal Regression Function

9 0.70819509 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses

10 0.7029652 197 nips-2012-Learning with Recursive Perceptual Representations

11 0.69866204 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification

12 0.69794446 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

13 0.66020542 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification

14 0.63591135 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

15 0.63560933 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

16 0.63183904 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

17 0.63157666 168 nips-2012-Kernel Latent SVM for Visual Recognition

18 0.63014561 212 nips-2012-Minimax Multi-Task Learning and a Generalized Loss-Compositional Paradigm for MTL

19 0.6274671 221 nips-2012-Multi-Stage Multi-Task Feature Learning

20 0.60738629 16 nips-2012-A Polynomial-time Form of Robust Regression

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.04), (13, 0.187), (17, 0.013), (21, 0.011), (27, 0.018), (38, 0.183), (42, 0.046), (53, 0.025), (54, 0.015), (55, 0.045), (74, 0.04), (76, 0.161), (80, 0.094), (92, 0.047)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89268446 125 nips-2012-Factoring nonnegative matrices with linear programs

Author: Ben Recht, Christopher Re, Joel Tropp, Victor Bittorf

Abstract: This paper describes a new approach, based on linear programming, for computing nonnegative matrix factorizations (NMFs). The key idea is a data-driven model for the factorization where the most salient features in the data are used to express the remaining features. More precisely, given a data matrix X, the algorithm identiﬁes a matrix C that satisﬁes X ≈ CX and some linear constraints. The constraints are chosen to ensure that the matrix C selects features; these features can then be used to ﬁnd a low-rank NMF of X. A theoretical analysis demonstrates that this approach has guarantees similar to those of the recent NMF algorithm of Arora et al. (2012). In contrast with this earlier work, the proposed method extends to more general noise models and leads to efﬁcient, scalable algorithms. Experiments with synthetic and real datasets provide evidence that the new approach is also superior in practice. An optimized C++ implementation can factor a multigigabyte matrix in a matter of minutes. 1

same-paper 2 0.87240934 361 nips-2012-Volume Regularization for Binary Classification

Author: Koby Crammer, Tal Wagner

3 0.82707214 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

4 0.81399268 319 nips-2012-Sparse Prediction with the $k$-Support Norm

Author: Andreas Argyriou, Rina Foygel, Nathan Srebro

Abstract: We derive a novel norm that corresponds to the tightest convex relaxation of sparsity combined with an 2 penalty. We show that this new k-support norm provides a tighter relaxation than the elastic net and can thus be advantageous in in sparse prediction problems. We also bound the looseness of the elastic net, thus shedding new light on it and providing justiﬁcation for its use. 1

5 0.81397462 179 nips-2012-Learning Manifolds with K-Means and K-Flats

Author: Guillermo Canas, Tomaso Poggio, Lorenzo Rosasco

Abstract: We study the problem of estimating a manifold from random samples. In particular, we consider piecewise constant and piecewise linear estimators induced by k-means and k-ﬂats, and analyze their performance. We extend previous results for k-means in two separate directions. First, we provide new results for k-means reconstruction on manifolds and, secondly, we prove reconstruction bounds for higher-order approximation (k-ﬂats), for which no known results were previously available. While the results for k-means are novel, some of the technical tools are well-established in the literature. In the case of k-ﬂats, both the results and the mathematical tools are new. 1

6 0.81316513 236 nips-2012-Near-Optimal MAP Inference for Determinantal Point Processes

7 0.81300789 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

8 0.81294471 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

9 0.81145209 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models

10 0.81078595 16 nips-2012-A Polynomial-time Form of Robust Regression

11 0.81000262 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

12 0.80994833 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

13 0.80980766 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

14 0.80978167 227 nips-2012-Multiclass Learning with Simplex Coding

15 0.80919039 325 nips-2012-Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

16 0.80891764 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications

17 0.80847681 275 nips-2012-Privacy Aware Learning

18 0.80822676 206 nips-2012-Majorization for CRFs and Latent Likelihoods

19 0.80789733 34 nips-2012-Active Learning of Multi-Index Function Models

20 0.80783755 263 nips-2012-Optimal Regularized Dual Averaging Methods for Stochastic Optimization