nips nips2003 nips2003-132 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Stuart Andrews, Thomas Hofmann
Abstract: Learning from ambiguous training data is highly relevant in many applications. We present a new learning algorithm for classification problems where labels are associated with sets of pattern instead of individual patterns. This encompasses multiple instance learning as a special case. Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Learning from ambiguous training data is highly relevant in many applications. [sent-5, score-0.286]
2 We present a new learning algorithm for classification problems where labels are associated with sets of pattern instead of individual patterns. [sent-6, score-0.158]
3 This encompasses multiple instance learning as a special case. [sent-7, score-0.033]
4 Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem. [sent-8, score-0.729]
5 1 Introduction In many applications of machine learning, it is inherently difficult or prohibitively expensive to generate large amounts of labeled training data. [sent-9, score-0.077]
6 However, it is often considerably less challenging to provide weakly labeled data, where labels or annotations y are associated with sets of patterns or bags X instead of individual patterns x ∈ X. [sent-10, score-0.54]
7 These bags reflect a fundamental ambiguity about the correspondence of patterns and the associated label which can be expressed logically as a disjunction of the form: x∈X (x is an example of class y). [sent-11, score-0.425]
8 In plain English, each labeled bag contains at least one pattern (but possibly more) belonging to this class, but the identities of these patterns are unknown. [sent-12, score-0.177]
9 A special case of particular relevance is known as multiple instance learning [5] (MIL). [sent-13, score-0.033]
10 In MIL labels are binary and the ambiguity is asymmetric in the sense that bags with negative labels are always of size one. [sent-14, score-0.464]
11 The ambiguity typically arises, because of polymorphisms allowing multiple representations, e. [sent-17, score-0.107]
12 annotations may be associated with images or documents where they should be attached to objects in an image or passages in a document. [sent-21, score-0.066]
13 Notice also that there are two intertwined objectives: the goal may be to learn a pattern-level classifier from ambiguous training examples, but sometimes one may be primarily interested in classifying new bags without necessarily resolving the ambiguity for individual patterns. [sent-22, score-0.608]
14 Because of the combinatorial nature of the problem, a simple optimization heuristic was used in [1] to learn discriminant functions. [sent-25, score-0.127]
15 In this paper, we take a more principled approach by carefully analyzing the nature of the resulting optimization problem and by deriving a sequence of successively stronger relaxations that can be used to compute lower and upper bounds on the objective. [sent-26, score-0.154]
16 Since it turns out that exploiting sparseness is a crucial aspect, we have focused on a linear programming formulation by generalizing the LPBoost algorithm [7, 12, 4] we call the resulting method Disjunctive Programming Boosting (DPBoost). [sent-27, score-0.123]
17 2 Linear Programming Boosting LPBoost is a linear programming approach to boosting, which aims at learning ensemble classifiers of the form G(x) = sgn F (x) with F (x) = k αk hk (x), where hk : d → {−1, 1}, k = 1, . [sent-28, score-1.057]
18 , n are the so-called base classifiers, weak hypotheses, or features and αk ≥ 0 are combination weights. [sent-31, score-0.04]
19 The ensemble margin of a labeled example (x, y) is defined as yF (x). [sent-32, score-0.235]
20 Given a set of labeled training examples {(x1 , y1 ), . [sent-33, score-0.077]
21 , (xm , ym )}, LPBoost formulates the supervised learning problem using the 1-norm soft margin objective n min α, ξ m αk + C ξi s. [sent-36, score-0.137]
22 (1) can be written as m max u m ui , i=1 ui yi hk (xi ) ≤ 1, ∀k, s. [sent-42, score-0.865]
23 (2) i=1 It is useful to take a closer look at the KKT complementary conditions m ui (yi F (xi ) + ξi − 1) = 0, ui yi hk (xi ) − 1 and αk = 0. [sent-45, score-0.865]
24 (3) i=1 Since the optimal values of the slack variables are implicitly determined by α as ξi (α) = [1 − yi F (xi )]+ , the first set of conditions states that ui = 0 whenever yi F (xi ) > 1. [sent-46, score-0.431]
25 Since ui can be interpreted as the “misclassification” cost, this implies that only instances with tight margin constraints may have non-vanishing associated m costs. [sent-47, score-0.4]
26 The second set of conditions ensures that αk = 0, if i=1 ui yi hk (xi ) < 1, which states that a weak hypothesis hk is never included in the ensemble, if its weighted score i ui yi hk (xi ) is strictly below the maximum score of 1. [sent-48, score-2.019]
27 So a typical LPBoost solution may be sparse in two ways: (i) Only a small number of weak hypothesis with αk > 0 may contribute to the ensemble and (ii) the solution may only depend on a subset of the training data, i. [sent-49, score-0.212]
28 LPBoost exploits the sparseness of the ensemble by incrementally selecting columns from the simplex tableau and optimizing the smaller tableau. [sent-52, score-0.24]
29 This amounts to finding in each round a hypothesis hk for which the constraint in Eq. [sent-53, score-0.546]
30 (2) is violated, adding it to the ensemble and re-optimizing the tableau with the selected columns. [sent-54, score-0.172]
31 As a column selection heuristic the authors of [4] propose to use the magnitude of the violation, i. [sent-55, score-0.138]
32 pick the weak hypothesis hk with maximal score i ui yi hk (xi ). [sent-57, score-1.271]
33 3 Disjunctive Programming Boosting In order to deal with pattern ambiguity, we employ the disjunctive programming framework [2, 9]. [sent-58, score-0.455]
34 In the spirit of transductive large margin methods [8, 3], we propose to estimate the parameters α of the discriminant function in a way that achieves a large margin for at least one of the patterns in each bag. [sent-59, score-0.355]
35 Applying this principle, we can compile the training data into a set of disjunctive constraints on α. [sent-60, score-0.443]
36 To that extend, let us define the following polyhedra Hi (x) ≡ αk hk (x) + ξi ≥ 1 , (α, ξ) : yi Q ≡ {(α, ξ) : α, ξ ≥ 0} . [sent-61, score-0.541]
37 (4) k Then we can formulate the following disjunctive program: n min α,ξ m αk + C ξi , s. [sent-62, score-0.34]
38 (5) i x∈Xi i=1 k=1 Notice that if |Xi | ≥ 2 then the constraint imposed by Xi is highly non-convex, since it is defined via a union of halfspaces. [sent-65, score-0.072]
39 However, for trivial bags with |Xi | = 1, the resulting constraints are the same as in Eq. [sent-66, score-0.289]
40 Since we will handle these two cases quite differently in the sequel, let us introduce index sets I = {i : |Xi | ≥ 2} and J = {j : |Xj | = 1}. [sent-68, score-0.036]
41 A suitable way to define a relaxation to this non-convex optimization problem is to replace the disjunctive set in Eq. [sent-69, score-0.47]
42 As shown in [2], a whole hierarchy of such relaxations can be built, using the fundamental fact that cl-conv(A) ∩ cl-conv(B) ⊇ cl-conv(A ∩ B), where cl-conv(A) denotes the closure of the convex hull of the limiting points of A. [sent-71, score-0.195]
43 This means a tighter convex relaxation is obtained, if we intersect as many sets as possible, before taking their convex hull. [sent-72, score-0.292]
44 Since repeated intersections of disjunctive sets with more than one element each leads to an combinatorial blow-up in the number of constraints, we propose to intersect every ambiguous disjunctive constraint with every non-ambiguous constraint as well as with Q. [sent-73, score-1.135]
45 This is also called a parallel reduction step [2]. [sent-74, score-0.105]
46 It results in the following convex relaxation of the constraints in Eq. [sent-75, score-0.237]
47 (5) (α, ξ) ∈ i∈I cl-conv x∈Xi Hi (x) ∩ Q ∩ j∈J Hj (xj ) , (6) where we have abused the notation slightly and identified Xj = {xj } for bags with one pattern. [sent-76, score-0.215]
48 The rationale in using this relaxation is that the resulting convex optimization problem is tractable and may provide a reasonably accurate approximation to the original disjunctive program, which can be further strengthened by using it in combination with branch-and-bound search. [sent-77, score-0.527]
49 There is a lift-and-project representation of the convex hulls in Eq. [sent-78, score-0.057]
50 Assume a set of non-empty linear constraints Hi ≡ {z : Ai z ≥ bi } = ∅ is given. [sent-83, score-0.074]
51 We have derived a LP relaxation of the original disjunctive program for boosting with ambiguity. [sent-87, score-0.629]
52 This relaxation was obtained by a linearization of the original non-convex constraints. [sent-88, score-0.164]
53 Furthermore, we have demonstrated how this relaxation can be improved using parallel reduction steps. [sent-89, score-0.211]
54 Applying this linearization to every convex hull in Eq. [sent-90, score-0.161]
55 (6) individually, notice that one needs to introduce duplicates αx , ξ x of the parameters α and slack variables ξ, x x x x x for every x ∈ Xi . [sent-91, score-0.117]
56 ξj = (7c) x∈Xi The first margin constraint in Eq. [sent-93, score-0.125]
57 (7a) is the one associated with the specific pattern x, while the second set of margin constraints in Eq. [sent-94, score-0.227]
58 (7b) stems from the parallel reduction performed with unambiguous bags. [sent-95, score-0.253]
59 One can calculate the dual LP of the above relaxation, the derivation of which can be found in the appendix. [sent-96, score-0.072]
60 The resulting program has a more complicated bound structure on the u-variables and the following crucial constraints involving the data ∀i, ∀x ∈ Xi : yi ux hk (x) + i yj ux hk (xj ) ≤ ρik , j j∈J ρik = 1 . [sent-97, score-1.768]
61 As a result of linearization and parallel reductions, the number of parameters in the primal LP is now O(q · n + q ·r), where q, r ≤ m denote the number of patterns in ambiguous and unambiguous bags, compared to O(n + m) of the standard LPBoost. [sent-99, score-0.642]
62 The number of constraints (variables in the dual) has also been inflated significantly from O(m) to O(q·r+p·n)), where p ≤ q is the number of ambiguous bags. [sent-100, score-0.331]
63 In order to maintain the spirit of LPBoost in dealing efficiently with a large-scale linear program, we propose to maintain the column selection scheme of selecting x one or more αk in every round. [sent-101, score-0.136]
64 Notice that the column selection can not proceed x independently because of the equality constraints x∈Xi αk = αk for all Xi ; in x z particular, αk > 0 implies αk > 0, so that αk > 0 for at least some z ∈ Xi for each x Xi , i ∈ I. [sent-102, score-0.155]
65 We hence propose to simultaneously add all columns {αk : x ∈ Xi , i ∈ I} involving the same weak hypothesis and to prune those back after each boosting round in order to exploit the expected sparseness of the solution. [sent-103, score-0.392]
66 In order to select a feature hk , we compute the following score ρik − 1, ¯ S(k) = i ρik ≡ max yi ux hk (x) + ¯ i x j∈J yj ux hk (xj ) . [sent-104, score-2.02]
67 j (9) Notice that due to the block structure of the tableau, working with a reduced set of columns also eliminates a large number of inequalities (rows). [sent-105, score-0.075]
68 However, the large set of q · r inequalities for the parallel reductions is still prohibitive. [sent-106, score-0.185]
69 In order to address this problem, we propose to perform incremental row selection in an outer loop. [sent-107, score-0.099]
70 Once we have converged to a column basis for the current relaxed LP, we add a subset of rows corresponding to the most useful parallel reductions. [sent-108, score-0.142]
71 One can use the magnitude of the margin violation as a heuristic to perform this row selection. [sent-109, score-0.187]
72 Although the margin constraints imposed by unambiguous training instances (xj , yj ) are redundant after we performed the parallel reduction step in Eq. [sent-111, score-0.609]
73 (6), we add them to the problem, because this will give us a better starting point with respect to the row selection process, and may lead to a sparser solution. [sent-112, score-0.099]
74 We hence add the following constraints to the primal αk hk (xj ) + ξj ≥ 1, yj ∀j ∈ J , (11) k which will introduce additional dual variables uj , j ∈ J. [sent-113, score-0.859]
75 Notice that in the worst case where all inequalities imposed by ambiguous training instances Xi are vacuous, this will make sure that one recovers the standard LPBoost formulation on the unambiguous examples. [sent-114, score-0.565]
76 One can then think of the row generation process as a way of deriving useful information from ambiguous examples. [sent-115, score-0.317]
77 This information takes the form of linear inequalities in the high dimensional representation of the convex hull and will sequentially reduce the version space, i. [sent-116, score-0.153]
78 The three plots correspond to data sets of size |I| = 10, 20, 30. [sent-122, score-0.036]
79 4 Experiments We generated a set of synthetic weakly labeled data sets to evaluate DPboost on a small scale. [sent-123, score-0.156]
80 These were multiple-instance data sets, where the label uncertainty was asymmetric; the only ambiguous bags (|Xi | > 1) were positive. [sent-124, score-0.502]
81 More specifically, we generated instances x ∈ [0, 1] × [0, 1] sampled uniformly at random from the white (yi = 1) and black (yi = −1) regions of Figure 1, leaving the intermediate gray area as a separating margin. [sent-125, score-0.046]
82 The degree of ambiguity was controlled by generating ambiguous bags of size k ∼ Poisson(λ) having only one positive and k − 1 negative patterns. [sent-126, score-0.579]
83 To control data set size, we generated a pre-specified number of ambiguous bags, and the same number of singleton unambiguous bags. [sent-127, score-0.405]
84 The second uses the true pattern-level labels to prune the negative examples from ambiguous bags and solves the smaller supervised problem with LPboost as above. [sent-134, score-0.586]
85 At the other extreme, the third variant assumes the ambiguous pattern labels are equal to their respective bag labels. [sent-136, score-0.4]
86 Figure 2 shows the discriminant boundary (black line), learned by each of the four algorithms for a data set generated with λ = 3 and having 20 ambiguous bags (i. [sent-138, score-0.522]
87 The ambiguous patterns are marked by “o”, unambiguous ones “x”, and the background is shaded to indicate the value of the ensemble F (x) (clamped to [−3, 3]). [sent-144, score-0.547]
88 It is clear from the shading that the ensemble has a small number of active features for DPboost, perfect-selector and perfect-knowledge algorithms. [sent-145, score-0.099]
89 The sparsity of the dual variables was also verified; less than 20 percent of the dual variables and reductions were active. [sent-147, score-0.275]
90 We ran 5-fold cross-validation on the synthetic data sets for λ = 1, 3, 5, 7 and for data sets having |I| = 10, 20, 30. [sent-148, score-0.112]
91 5 Conclusion We have presented a new learning algorithm for classification problems where labels are associated with sets of pattern instead of individual patterns. [sent-155, score-0.158]
92 Using synthetic data, the expected behaviour of the algorithm has been demonstrated. [sent-156, score-0.04]
93 Our current implementation could not handle large data sets, and so improvements, followed by a large-scale validation and comparison to other algorithms using benchmark MIL data sets, will follow. [sent-157, score-0.035]
94 Disjunctive programming and a hierarchy of relaxations for discrete optimization problems. [sent-166, score-0.196]
95 Boosting in the limit: Maximizing the margin of learned ensembles. [sent-205, score-0.088]
96 Cona o u structing boosting algorithms from svms: an application to one-class classification. [sent-238, score-0.127]
97 Appendix x x x x The primal variables are αk , αk , ξi , ξi , ξj , and ηi . [sent-245, score-0.098]
98 The dual variables are ux and x uj for the margin constraints, and ρik , σi , and θi for the equality constraints on αk , ξ and η, respectively. [sent-246, score-0.619]
99 The Lagrangian is given by L= k αk + C ξi + i ux j − i − x∈Xi ρik ξj − x αk x∈Xi k + − σi i i x∈Xi x∈Xi − σij x∈Xi ˜x x ξj ξj − i x∈Xi j ηi ηi . [sent-247, score-0.287]
100 primal variables, leads to the following dual max θi i s. [sent-251, score-0.136]
wordName wordTfidf (topN-words)
[('hk', 0.439), ('disjunctive', 0.34), ('ux', 0.287), ('lpboost', 0.267), ('ambiguous', 0.257), ('bags', 0.215), ('dpboost', 0.17), ('ui', 0.162), ('xi', 0.152), ('unambiguous', 0.148), ('boosting', 0.127), ('mil', 0.121), ('ik', 0.113), ('ambiguity', 0.107), ('relaxation', 0.106), ('yi', 0.102), ('ensemble', 0.099), ('margin', 0.088), ('yj', 0.084), ('lp', 0.083), ('programming', 0.08), ('xj', 0.079), ('constraints', 0.074), ('hi', 0.074), ('tableau', 0.073), ('dual', 0.072), ('parallel', 0.072), ('relaxations', 0.068), ('uj', 0.064), ('primal', 0.064), ('reductions', 0.063), ('linearization', 0.058), ('convex', 0.057), ('labels', 0.057), ('program', 0.056), ('notice', 0.052), ('bag', 0.051), ('discriminant', 0.05), ('inequalities', 0.05), ('ioannis', 0.049), ('labeled', 0.048), ('classi', 0.047), ('hull', 0.046), ('instances', 0.046), ('score', 0.045), ('hypothesis', 0.044), ('patterns', 0.043), ('sparseness', 0.043), ('andrews', 0.042), ('column', 0.042), ('weak', 0.04), ('synthetic', 0.04), ('selection', 0.039), ('providence', 0.038), ('violation', 0.038), ('tsochantaridis', 0.038), ('constraint', 0.037), ('sets', 0.036), ('annotations', 0.036), ('stuart', 0.036), ('intersect', 0.036), ('francisco', 0.035), ('pattern', 0.035), ('imposed', 0.035), ('benchmark', 0.035), ('successively', 0.034), ('variables', 0.034), ('instance', 0.033), ('reduction', 0.033), ('row', 0.032), ('weakly', 0.032), ('prune', 0.031), ('transductive', 0.031), ('slack', 0.031), ('violated', 0.031), ('label', 0.03), ('di', 0.03), ('kaufmann', 0.03), ('associated', 0.03), ('tsch', 0.029), ('heuristic', 0.029), ('training', 0.029), ('propose', 0.028), ('san', 0.028), ('add', 0.028), ('asymmetric', 0.028), ('deriving', 0.028), ('morgan', 0.027), ('naive', 0.027), ('spirit', 0.027), ('supervised', 0.026), ('round', 0.026), ('columns', 0.025), ('ij', 0.025), ('optimization', 0.024), ('hierarchy', 0.024), ('brown', 0.024), ('combinatorial', 0.024), ('soft', 0.023), ('hypotheses', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
Author: Stuart Andrews, Thomas Hofmann
Abstract: Learning from ambiguous training data is highly relevant in many applications. We present a new learning algorithm for classification problems where labels are associated with sets of pattern instead of individual patterns. This encompasses multiple instance learning as a special case. Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem. 1
2 0.15126863 122 nips-2003-Margin Maximizing Loss Functions
Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie
Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1
3 0.14802195 124 nips-2003-Max-Margin Markov Networks
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
4 0.13357358 112 nips-2003-Learning to Find Pre-Images
Author: Jason Weston, Bernhard Schölkopf, Gökhan H. Bakir
Abstract: We consider the problem of reconstructing patterns from a feature map. Learning algorithms using kernels to operate in a reproducing kernel Hilbert space (RKHS) express their solutions in terms of input points mapped into the RKHS. We introduce a technique based on kernel principal component analysis and regression to reconstruct corresponding patterns in the input space (aka pre-images) and review its performance in several applications requiring the construction of pre-images. The introduced technique avoids difficult and/or unstable numerical optimization, is easy to implement and, unlike previous methods, permits the computation of pre-images in discrete input spaces. 1
5 0.13198806 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
Author: Thore Graepel, Ralf Herbrich
Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1
6 0.1214907 117 nips-2003-Linear Response for Approximate Inference
7 0.10611502 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
8 0.092602558 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
9 0.085511275 1 nips-2003-1-norm Support Vector Machines
10 0.085023597 88 nips-2003-Image Reconstruction by Linear Programming
11 0.084370919 41 nips-2003-Boosting versus Covering
12 0.075698316 113 nips-2003-Learning with Local and Global Consistency
13 0.074084006 133 nips-2003-Mutual Boosting for Contextual Inference
14 0.0737582 188 nips-2003-Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects
15 0.073754832 100 nips-2003-Laplace Propagation
16 0.072647519 48 nips-2003-Convex Methods for Transduction
17 0.07099428 126 nips-2003-Measure Based Regularization
18 0.065254882 121 nips-2003-Log-Linear Models for Label Ranking
19 0.064074218 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
20 0.057699062 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
topicId topicWeight
[(0, -0.208), (1, -0.108), (2, -0.064), (3, -0.131), (4, 0.077), (5, -0.092), (6, -0.111), (7, -0.095), (8, 0.004), (9, -0.022), (10, 0.071), (11, 0.033), (12, -0.038), (13, 0.008), (14, -0.088), (15, 0.094), (16, 0.043), (17, 0.011), (18, -0.127), (19, -0.02), (20, 0.026), (21, 0.034), (22, -0.177), (23, 0.026), (24, 0.042), (25, -0.103), (26, 0.032), (27, -0.175), (28, -0.096), (29, -0.02), (30, -0.026), (31, 0.02), (32, 0.012), (33, 0.002), (34, -0.012), (35, -0.025), (36, 0.078), (37, -0.054), (38, -0.081), (39, 0.028), (40, 0.055), (41, -0.03), (42, -0.109), (43, 0.109), (44, -0.036), (45, -0.031), (46, -0.032), (47, -0.033), (48, -0.147), (49, 0.004)]
simIndex simValue paperId paperTitle
same-paper 1 0.94930363 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
Author: Stuart Andrews, Thomas Hofmann
Abstract: Learning from ambiguous training data is highly relevant in many applications. We present a new learning algorithm for classification problems where labels are associated with sets of pattern instead of individual patterns. This encompasses multiple instance learning as a special case. Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem. 1
2 0.71182281 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
Author: Thore Graepel, Ralf Herbrich
Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1
3 0.68840235 124 nips-2003-Max-Margin Markov Networks
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
4 0.67619097 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
Author: Noam Shental, Aharon Bar-hillel, Tomer Hertz, Daphna Weinshall
Abstract: Density estimation with Gaussian Mixture Models is a popular generative technique used also for clustering. We develop a framework to incorporate side information in the form of equivalence constraints into the model estimation procedure. Equivalence constraints are defined on pairs of data points, indicating whether the points arise from the same source (positive constraints) or from different sources (negative constraints). Such constraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. For the estimation of model parameters we present a closed form EM procedure which handles positive constraints, and a Generalized EM procedure using a Markov net which handles negative constraints. Using publicly available data sets we demonstrate that such side information can lead to considerable improvement in clustering tasks, and that our algorithm is preferable to two other suggested methods using the same type of side information.
5 0.54118693 122 nips-2003-Margin Maximizing Loss Functions
Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie
Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1
6 0.53847033 117 nips-2003-Linear Response for Approximate Inference
7 0.47606978 1 nips-2003-1-norm Support Vector Machines
8 0.46416742 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
9 0.45687208 112 nips-2003-Learning to Find Pre-Images
10 0.44672596 100 nips-2003-Laplace Propagation
11 0.4211089 108 nips-2003-Learning a Distance Metric from Relative Comparisons
12 0.41770178 48 nips-2003-Convex Methods for Transduction
13 0.39510152 126 nips-2003-Measure Based Regularization
14 0.39336437 120 nips-2003-Locality Preserving Projections
15 0.37515545 41 nips-2003-Boosting versus Covering
16 0.35541999 88 nips-2003-Image Reconstruction by Linear Programming
17 0.35345271 113 nips-2003-Learning with Local and Global Consistency
18 0.34425905 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
19 0.34301561 25 nips-2003-An MCMC-Based Method of Comparing Connectionist Models in Cognitive Science
20 0.33673981 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus
topicId topicWeight
[(0, 0.067), (11, 0.04), (30, 0.013), (33, 0.268), (35, 0.069), (53, 0.097), (71, 0.123), (76, 0.042), (85, 0.075), (91, 0.065), (99, 0.036)]
simIndex simValue paperId paperTitle
1 0.84275645 176 nips-2003-Sequential Bayesian Kernel Regression
Author: Jaco Vermaak, Simon J. Godsill, Arnaud Doucet
Abstract: We propose a method for sequential Bayesian kernel regression. As is the case for the popular Relevance Vector Machine (RVM) [10, 11], the method automatically identifies the number and locations of the kernels. Our algorithm overcomes some of the computational difficulties related to batch methods for kernel regression. It is non-iterative, and requires only a single pass over the data. It is thus applicable to truly sequential data sets and batch data sets alike. The algorithm is based on a generalisation of Importance Sampling, which allows the design of intuitively simple and efficient proposal distributions for the model parameters. Comparative results on two standard data sets show our algorithm to compare favourably with existing batch estimation strategies.
same-paper 2 0.83242726 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
Author: Stuart Andrews, Thomas Hofmann
Abstract: Learning from ambiguous training data is highly relevant in many applications. We present a new learning algorithm for classification problems where labels are associated with sets of pattern instead of individual patterns. This encompasses multiple instance learning as a special case. Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem. 1
3 0.7972998 172 nips-2003-Semi-Supervised Learning with Trees
Author: Charles Kemp, Thomas L. Griffiths, Sean Stromsten, Joshua B. Tenenbaum
Abstract: We describe a nonparametric Bayesian approach to generalizing from few labeled examples, guided by a larger set of unlabeled objects and the assumption of a latent tree-structure to the domain. The tree (or a distribution over trees) may be inferred using the unlabeled data. A prior over concepts generated by a mutation process on the inferred tree(s) allows efficient computation of the optimal Bayesian classification function from the labeled examples. We test our approach on eight real-world datasets. 1
4 0.70361364 3 nips-2003-AUC Optimization vs. Error Rate Minimization
Author: Corinna Cortes, Mehryar Mohri
Abstract: The area under an ROC curve (AUC) is a criterion used in many applications to measure the quality of a classification algorithm. However, the objective function optimized in most of these algorithms is the error rate and not the AUC value. We give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate. Our results show that the average AUC is monotonically increasing as a function of the classification accuracy, but that the standard deviation for uneven distributions and higher error rates is noticeable. Thus, algorithms designed to minimize the error rate may not lead to the best possible AUC values. We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets demonstrating the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 1 Motivation In many applications, the overall classification error rate is not the most pertinent performance measure, criteria such as ordering or ranking seem more appropriate. Consider for example the list of relevant documents returned by a search engine for a specific query. That list may contain several thousand documents, but, in practice, only the top fifty or so are examined by the user. Thus, a search engine’s ranking of the documents is more critical than the accuracy of its classification of all documents as relevant or not. More generally, for a binary classifier assigning a real-valued score to each object, a better correlation between output scores and the probability of correct classification is highly desirable. A natural criterion or summary statistic often used to measure the ranking quality of a classifier is the area under an ROC curve (AUC) [8].1 However, the objective function optimized by most classification algorithms is the error rate and not the AUC. Recently, several algorithms have been proposed for maximizing the AUC value locally [4] or maximizing some approximations of the global AUC value [9, 15], but, in general, these algorithms do not obtain AUC values significantly better than those obtained by an algorithm designed to minimize the error rates. Thus, it is important to determine the relationship between the AUC values and the error rate. ∗ This author’s new address is: Google Labs, 1440 Broadway, New York, NY 10018, corinna@google.com. 1 The AUC value is equivalent to the Wilcoxon-Mann-Whitney statistic [8] and closely related to the Gini index [1]. It has been re-invented under the name of L-measure by [11], as already pointed out by [2], and slightly modified under the name of Linear Ranking by [13, 14]. True positive rate ROC Curve. AUC=0.718 (1,1) True positive rate = (0,0) False positive rate = False positive rate correctly classified positive total positive incorrectly classified negative total negative Figure 1: An example of ROC curve. The line connecting (0, 0) and (1, 1), corresponding to random classification, is drawn for reference. The true positive (negative) rate is sometimes referred to as the sensitivity (resp. specificity) in this context. In the following sections, we give a detailed statistical analysis of the relationship between the AUC and the error rate, including the first exact expression of the expected value and the variance of the AUC for a fixed error rate.2 We show that, under certain conditions, the global function optimized by the RankBoost algorithm is exactly the AUC. We report the results of our experiments with RankBoost in several datasets and demonstrate the benefits of an algorithm specifically designed to globally optimize the AUC over other existing algorithms optimizing an approximation of the AUC or only locally optimizing the AUC. 2 Definition and properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally developed in signal detection theory [3] in connection with radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [12, 10, 4, 9, 15, 2]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. Fig. (1) shows an example of ROC curve. The AUC is defined as the area under the ROC curve and is closely related to the ranking quality of the classification as shown more formally by Lemma 1 below. Consider a binary classification task with m positive examples and n negative examples. We will assume that a classifier outputs a strictly ordered list for these examples and will denote by 1X the indicator function of a set X. Lemma 1 ([8]) Let c be a fixed classifier. Let x1 , . . . , xm be the output of c on the positive examples and y1 , . . . , yn its output on the negative examples. Then, the AUC, A, associated to c is given by: m n i=1 j=1 1xi >yj (1) A= mn that is the value of the Wilcoxon-Mann-Whitney statistic [8]. Proof. The proof is based on the observation that the AUC value is exactly the probability P (X > Y ) where X is the random variable corresponding to the distribution of the outputs for the positive examples and Y the one corresponding to the negative examples [7]. The Wilcoxon-Mann-Whitney statistic is clearly the expression of that probability in the discrete case, which proves the lemma [8]. Thus, the AUC can be viewed as a measure based on pairwise comparisons between classifications of the two classes. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC. 2 An attempt in that direction was made by [15], but, unfortunately, the authors’ analysis and the result are both wrong. Threshold θ k − x Positive examples x Negative examples n − x Negative examples m − (k − x) Positive examples Figure 2: For a fixed number of errors k, there may be x, 0 ≤ x ≤ k, false negative examples. 3 The Expected Value of the AUC In this section, we compute exactly the expected value of the AUC over all classifications with a fixed number of errors and compare that to the error rate. Different classifiers may have the same error rate but different AUC values. Indeed, for a given classification threshold θ, an arbitrary reordering of the examples with outputs more than θ clearly does not affect the error rate but leads to different AUC values. Similarly, one may reorder the examples with output less than θ without changing the error rate. Assume that the number of errors k is fixed. We wish to compute the average value of the AUC over all classifications with k errors. Our model is based on the simple assumption that all classifications or rankings with k errors are equiprobable. One could perhaps argue that errors are not necessarily evenly distributed, e.g., examples with very high or very low ranks are less likely to be errors, but we cannot justify such biases in general. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are k − x false negative examples. Figure 3 shows the corresponding configuration. The two regions of examples with classification outputs above and below the threshold are separated by a vertical line. For a given x, the computation of the AUC, A, as given by Eq. (1) can be divided into the following three parts: A1 + A2 + A3 A= , with (2) mn A1 = the sum over all pairs (xi , yj ) with xi and yj in distinct regions; A2 = the sum over all pairs (xi , yj ) with xi and yj in the region above the threshold; A3 = the sum over all pairs (xi , yj ) with xi and yj in the region below the threshold. The first term, A1 , is easy to compute. Since there are (m − (k − x)) positive examples above the threshold and n − x negative examples below the threshold, A1 is given by: A1 = (m − (k − x))(n − x) (3) To compute A2 , we can assign to each negative example above the threshold a position based on its classification rank. Let position one be the first position above the threshold and let α1 < . . . < αx denote the positions in increasing order of the x negative examples in the region above the threshold. The total number of examples classified as positive is N = m − (k − x) + x. Thus, by definition of A2 , x A2 = (N − αi ) − (x − i) (4) i=1 where the first term N − αi represents the number of examples ranked higher than the ith example and the second term x − i discounts the number of negative examples incorrectly ranked higher than the ith example. Similarly, let α1 < . . . < αk−x denote the positions of the k − x positive examples below the threshold, counting positions in reverse by starting from the threshold. Then, A3 is given by: x A3 = (N − αj ) − (x − j) (5) j=1 with N = n − x + (k − x) and x = k − x. Combining the expressions of A1 , A2 , and A3 leads to: A= A1 + A2 + A3 (k − 2x)2 + k ( =1+ − mn 2mn x i=1 αi + mn x j=1 αj ) (6) Lemma 2 For a fixed x, the average value of the AUC A is given by: < A >x = 1 − x n + k−x m 2 (7) x Proof. The proof is based on the computation of the average values of i=1 αi and x j=1 αj for a given x. We start by computing the average value < αi >x for a given i, 1 ≤ i ≤ x. Consider all the possible positions for α1 . . . αi−1 and αi+1 . . . αx , when the value of αi is fixed at say αi = l. We have i ≤ l ≤ N − (x − i) since there need to be at least i − 1 positions before αi and N − (x − i) above. There are l − 1 possible positions for α1 . . . αi−1 and N − l possible positions for αi+1 . . . αx . Since the total number of ways of choosing the x positions for α1 . . . αx out of N is N , the average value < αi >x is: x N −(x−i) l=i < αi >x = l l−1 i−1 N −l x−i (8) N x Thus, x < αi >x = x i=1 i=1 Using the classical identity: x < αi >x = N −(x−i) l−1 l i−1 l=i N x u p1 +p2 =p p1 N l=1 l N −1 x−1 N x i=1 N −l x−i v p2 = = N l=1 = u+v p N (N + 1) 2 x l−1 i=1 i−1 N x l N −l x−i (9) , we can write: N −1 x−1 N x = x(N + 1) 2 (10) Similarly, we have: x < αj >x = j=1 x Replacing < i=1 αi >x and < Eq. (10) and Eq. (11) leads to: x j=1 x (N + 1) 2 (11) αj >x in Eq. (6) by the expressions given by (k − 2x)2 + k − x(N + 1) − x (N + 1) =1− 2mn which ends the proof of the lemma. < A >x = 1 + x n + k−x m 2 (12) Note that Eq. (7) shows that the average AUC value for a given x is simply one minus the average of the accuracy rates for the positive and negative classes. Proposition 1 Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC A over all classifications with k errors is given by: < A >= 1 − k (n − m)2 (m + n + 1) − m+n 4mn k−1 m+n x=0 x k m+n+1 x=0 x k − m+n (13) Proof. Lemma 2 gives the average value of the AUC for a fixed value of x. To compute the average over all possible values of x, we need to weight the expression of Eq. (7) with the total number of possible classifications for a given x. There are N possible ways of x choosing the positions of the x misclassified negative examples, and similarly N possible x ways of choosing the positions of the x = k − x misclassified positive examples. Thus, in view of Lemma 2, the average AUC is given by: < A >= k N x=0 x N x (1 − k N x=0 x N x k−x x n+ m 2 ) (14) r=0.05 r=0.01 r=0.1 r=0.25 0.0 0.1 0.2 r=0.5 0.3 Error rate 0.4 0.5 .00 .05 .10 .15 .20 .25 0.5 0.6 0.7 0.8 0.9 1.0 Mean value of the AUC Relative standard deviation r=0.01 r=0.05 r=0.1 0.0 0.1 r=0.25 0.2 0.3 Error rate r=0.5 0.4 0.5 Figure 3: Mean (left) and relative standard deviation (right) of the AUC as a function of the error rate. Each curve corresponds to a fixed ratio of r = n/(n + m). The average AUC value monotonically increases with the accuracy. For n = m, as for the top curve in the left plot, the average AUC coincides with the accuracy. The standard deviation decreases with the accuracy, and the lowest curve corresponds to n = m. This expression can be simplified into Eq. (13)3 using the following novel identities: k X N x x=0 k X N x x x=0 ! N x ! ! N x ! = = ! k X n+m+1 x x=0 (15) ! k X (k − x)(m − n) + k n + m + 1 2 x x=0 (16) that we obtained by using Zeilberger’s algorithm4 and numerous combinatorial ’tricks’. From the expression of Eq. (13), it is clear that the average AUC value is identical to the accuracy of the classifier only for even distributions (n = m). For n = m, the expected value of the AUC is a monotonic function of the accuracy, see Fig. (3)(left). For a fixed ratio of n/(n + m), the curves are obtained by increasing the accuracy from n/(n + m) to 1. The average AUC varies monotonically in the range of accuracy between 0.5 and 1.0. In other words, on average, there seems nothing to be gained in designing specific learning algorithms for maximizing the AUC: a classification algorithm minimizing the error rate also optimizes the AUC. However, this only holds for the average AUC. Indeed, we will show in the next section that the variance of the AUC value is not null for any ratio n/(n + m) when k = 0. 4 The Variance of the AUC 2 Let D = mn + (k−2x) +k , a = i=1 αi , a = j=1 αj , and α = a + a . Then, by 2 Eq. (6), mnA = D − α. Thus, the variance of the AUC, σ 2 (A), is given by: (mn)2 σ 2 (A) x x = < (D − α)2 − (< D > − < α >)2 > = < D2 > − < D >2 + < α2 > − < α >2 −2(< αD > − < α >< D >) (17) As before, to compute the average of a term X over all classifications, we can first determine its average < X >x for a fixed x, and then use the function F defined by: F (Y ) = k N N x=0 x x k N N x=0 x x Y (18) and < X >= F (< X >x ). A crucial step in computing the exact value of the variance of x the AUC is to determine the value of the terms of the type < a2 >x =< ( i=1 αi )2 >x . 3 An essential difference between Eq. (14) and the expression given by [15] is the weighting by the number of configurations. The authors’ analysis leads them to the conclusion that the average AUC is identical to the accuracy for all ratios n/(n + m), which is false. 4 We thank Neil Sloane for having pointed us to Zeilberger’s algorithm and Maple package. x Lemma 3 For a fixed x, the average of ( i=1 αi )2 is given by: x(N + 1) < a2 > x = (3N x + 2x + N ) 12 (19) Proof. By definition of a, < a2 >x = b + 2c with: x x α2 >x i b =< c =< αi αj >x (20) 1≤i
5 0.59709883 122 nips-2003-Margin Maximizing Loss Functions
Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie
Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1
6 0.59653825 117 nips-2003-Linear Response for Approximate Inference
7 0.59294474 189 nips-2003-Tree-structured Approximations by Expectation Propagation
8 0.59106243 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
9 0.59023595 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
10 0.58892339 58 nips-2003-Efficient Multiscale Sampling from Products of Gaussian Mixtures
11 0.58743358 41 nips-2003-Boosting versus Covering
12 0.58737016 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
13 0.58706838 100 nips-2003-Laplace Propagation
14 0.58532196 145 nips-2003-Online Classification on a Budget
15 0.58082891 126 nips-2003-Measure Based Regularization
16 0.57756197 113 nips-2003-Learning with Local and Global Consistency
17 0.57709885 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
18 0.57650989 158 nips-2003-Policy Search by Dynamic Programming
19 0.57607466 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
20 0.57536519 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors