nips nips2003 nips2003-124 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. [sent-4, score-0.124]
2 However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. [sent-6, score-0.231]
3 Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. [sent-7, score-0.138]
4 Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. [sent-8, score-0.271]
5 In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. [sent-9, score-0.247]
6 We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. [sent-10, score-0.141]
7 We provide a new theoretical bound for generalization in structured domains. [sent-11, score-0.19]
8 Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. [sent-12, score-0.301]
9 In both of these cases, we need to assign multiple labels simultaneously, leading to a classification problem that has an exponentially large set of joint labels. [sent-23, score-0.165]
10 An alternative approach is offered by the probabilistic framework, and specifically by probabilistic graphical models. [sent-26, score-0.153]
11 In this case, we can define and learn a joint probabilistic model over the set of label variables. [sent-27, score-0.179]
12 This approach has the advantage of exploiting the correlations between the different labels, often resulting in significant improvements in accuracy over approaches that classify instances independently [7, 10]. [sent-29, score-0.165]
13 Unfortunately, even probabilistic graphical models that are trained discriminatively do not usually achieve the same level of generalization accuracy as SVMs, especially when kernel features are used. [sent-31, score-0.297]
14 Our approach defines a log-linear Markov network over a set of label variables (e. [sent-35, score-0.205]
15 , the labels of the letters in an OCR problem); this network allows us to represent the correlations between these label variables. [sent-37, score-0.343]
16 For Markov networks that can be triangulated tractably, the resulting quadratic program (QP) has an equivalent polynomial-size formulation (e. [sent-39, score-0.186]
17 We also show a generalization bound for such margin-based classifiers. [sent-45, score-0.131]
18 Unlike previous results [3], our bound grows logarithmically rather than linearly with the number of label variables. [sent-46, score-0.148]
19 Our experimental results on character recognition and on hypertext classification, demonstrate dramatic improvements in accuracy over both kernel-based instance-by-instance classification and probabilistic models. [sent-47, score-0.298]
20 A common choice is the linear family: Given n real-valued basis functions fj : X × Y → IR, a hypothesis hw ∈ H is defined by a set of n coefficients w such that: j n hw (x) = arg max y wj fj (x, y) = arg max w f (x, y) , i=1 y (1) where the f (x, y) are features or basis functions. [sent-53, score-0.41]
21 In a webpage collective classification task [10], each Yi is a webpage label, whereas Y is a joint label for an entire website. [sent-66, score-0.277]
22 In these cases, the number of possible assignments to Y is exponential in the number of labels l. [sent-67, score-0.138]
23 Thus, both representing the basis functions fj (x, y) in (1) and computing the maximization arg maxy are infeasible. [sent-68, score-0.242]
24 The advantage of the probabilistic framework is that it can exploit sparseness in the correlations between labels Yi . [sent-72, score-0.282]
25 For example, in the OCR task, we might use a Markov model, where Yi is conditionally independent of the rest of the labels given Yi−1 , Yi+1 . [sent-73, score-0.138]
26 A pairwise Markov network is defined as a graph G = (Y, E), where each edge (i, j) is associated with a potential function ψij (x, yi , yj ). [sent-77, score-0.921]
27 The network encodes a joint conditional probability distribution as P (y | x) ∝ (i,j)∈E ψij (x, yi , yj ). [sent-78, score-0.869]
28 These networks exploit the interaction structure to parameterize a classifier very compactly. [sent-79, score-0.118]
29 , tree-structured networks), we can use effective dynamic programming algorithms (such as the Viterbi algorithm) to find the highest probability label y; in others, we can use approximate inference algorithms that also exploit the structure [12]. [sent-82, score-0.174]
30 The Markov network distribution is simply a log-linear model, with the pairwise potential ψij (x, yi , yj ) representing (in log-space) a sum of basis functions over x, yi , yj . [sent-83, score-1.653]
31 We can therefore parameterize such a model using a set of pairwise basis functions f (x, y i , yj ) for (i, j) ∈ E. [sent-84, score-0.501]
32 We assume for simplicity of notation that all edges in the graph denote the same type of interaction, so that we can define a set of features fk (x, y) = fk (x, yi , yj ). [sent-85, score-0.923]
33 (2) (i,j)∈E n The network potentials are then ψij (x, yi , yj ) = exp [ k=1 wk fk (x, yi , yj )] = exp w f (x, yi , yj ) . [sent-86, score-2.343]
34 For singlelabel multi-class classification, Crammer and Singer [5] provide a natural extension of this framework by maximizing the margin γ subject to constraints: maximize γ s. [sent-92, score-0.158]
35 The constraints in this formulation ensure that arg maxy w f (x, y) = t(x). [sent-95, score-0.244]
36 In structured problems, where we are predicting multiple labels, the loss function is usually not simple 0-1 loss I(arg maxy w fx (y) = t(x)), but per-label loss, such as the proportion of incorrect labels predicted. [sent-97, score-0.805]
37 In order to extend the margin-based framework to the multi-label setting, we must generalize the notion of margin to take into account the number of labels in y that are misclassified. [sent-98, score-0.227]
38 In particular, we would like the margin between t(x) and y to scale linearly with the number of wrong labels in y, ∆tx (y): maximize γ s. [sent-99, score-0.262]
39 We can now present the complete form of our optimization problem, as well as the equivalent dual problem [2]: Primal formulation (6) X 1 ||w||2 + C ξx ; 2 x min s. [sent-107, score-0.191]
40 y (Note: for each x, we add an extra dual variable αx (t(x)), with no effect on the solution. [sent-113, score-0.146]
41 ) Exploiting structure in M3 networks 4 Unfortunately, both the number of constraints in the primal QP in (6), and the number of variables in the dual QP in (7) are exponential in the number of labels l. [sent-114, score-0.462]
42 Our main insight is that the variables αx (y) in the dual formulation (7) can be interpreted as a density function over y conditional on x, as y αx (y) = C and αx (y) ≥ 0. [sent-116, score-0.283]
43 The dual objective is a function of expectations of ∆tx (y) and ∆fx (y) with respect to αx (y). [sent-117, score-0.185]
44 Since both ∆tx (y) = i ∆tx (yi ) and ∆fx (y) = (i,j) ∆fx (yi , yj ) are sums of functions over nodes and edges, we only need node and edge marginals of the measure αx (y) to compute their expectations. [sent-118, score-0.479]
45 We define the marginal dual variables as follows: µx (yi , yj ) = y∼[yi ,yj ] αx (y), ∀ (i, j) ∈ E, ∀yi , yj , ∀ x; (8) µx (yi ) = ∀ i, ∀yi , ∀ x; y∼[yi ] αx (y), where y ∼ [yi , yj ] denotes a full assignment y consistent with partial assignment yi , yj . [sent-119, score-2.187]
46 Now we can reformulate our entire QP (7) in terms of these dual variables. [sent-120, score-0.146]
47 αx (y) = ∆tx (yi ) αx (y)∆tx (yi ) = αx (y)∆tx (y) = y y∼[yi ] i,yi The decomposition of the second term in the objective uses edge marginals µ x (yi , yj ). [sent-122, score-0.518]
48 In order to produce an equivalent QP, however, we must also ensure that the dual variables µx (yi , yj ), µx (yi ) are the marginals resulting from a legal density α(y); that is, that they belong to the marginal polytope [4]. [sent-123, score-0.708]
49 In particular, we must enforce consistency between the pairwise and singleton marginals (and hence between overlapping pairwise marginals): µx (yi , yj ) = µx (yj ), ∀yj , ∀(i, j) ∈ E, ∀x. [sent-124, score-0.539]
50 (9) yi If the Markov network for our basis functions is a forest (singly connected), these constraints are equivalent to the requirement that the µ variables arise from a density. [sent-125, score-0.613]
51 Therefore, the following factored dual QP is X X to the original dual QP: XX X equivalent max µx (yi )∆tx (yi ) − x s. [sent-126, score-0.365]
52 X i,yi µx (yi , yj ) = µx (yj ); yi 1 2 µx (yi , yj )µx (yr , ys )fx (yi , yj ) fx (yr , ys ); ˆ ˆ x,ˆ (i,j) (r,s) x yi ,yj yr ,ys X µx (yi ) = C; µx (yi , yj ) ≥ 0. [sent-128, score-2.864]
53 (10) yi Similarly, the original primal can be factored as follows: min s. [sent-129, score-0.489]
54 XX XX 1 ||w||2 + C ξx,i + C ξx,ij ; 2 x x (i,j) i X X w ∆fx (yi , yj ) + mx,i (yj ) + mx,j (yi ) ≥ −ξx,ij ; (i ,j):i =i X (i,j) mx,j (yi ) ≥ ∆tx (yi ) − ξx,i ; (j ,i):j =j ξx,ij ≥ 0, ξx,i ≥ 0. [sent-131, score-0.376]
55 (11) The solution to the factored dual gives us: w = x (i,j) yi ,yj µx (yi , yj )∆fx (yi , yj ). [sent-132, score-1.348]
56 For example, if our graph is a 4-cycle Y1 —Y2 —Y3 —Y4 —Y1 , we might triangulate the graph by adding an arc Y1 —Y3 , and introducing η variables over joint instantiations of the cliques Y1 , Y2 , Y3 and Y1 , Y3 , Y4 . [sent-137, score-0.137]
57 The η variables appear only in the constraints; they do not add any new basis functions nor change the objective function. [sent-139, score-0.139]
58 In fact, BP makes an additional approximation, using not only the relaxed marginal polytope but also an approximate objective (Bethe free-energy) [12]. [sent-146, score-0.11]
59 As with SVMs [11], the factored dual formulation in (10) uses only dot products between basis functions. [sent-149, score-0.312]
60 In particular, we define our basis functions by fx (yi , yj ) = ρ(yi , yj )φij (x), i. [sent-151, score-1.195]
61 , the product of a selector function ρ(yi , yj ) with a possibly infinite feature vector φij (x). [sent-153, score-0.376]
62 For example, in the OCR task, ρ(yi , yj ) could be an indicator function over the class of two adjacent characters i and j, and φij (x) could be an RBF kernel on the images of these two characters. [sent-154, score-0.434]
63 The operation fx (yi , yj ) fx (yr , ys ) used in the objective function of the ˆ ˆ ˆ factored dual QP is now ρ(yi , yj )ρ(yr , ys )Kφ (x, i, j, x, r, s), where Kφ (x, i, j, x, r, s) = φij (x) · φrs (x) is the kernel function for the feature φ. [sent-155, score-1.938]
64 5 SMO learning of M3 networks Although the number of variables and constraints in the factored dual in (10) is polynomial in the size of the data, the number of coefficients in the quadratic term (kernel matrix) in the objective is quadratic in the number of examples and edges in the network. [sent-157, score-0.581]
65 Let us begin by considering the original dual problem (7). [sent-160, score-0.146]
66 Fortunately, we can perform precisely the same optimization in terms of the marginal dual variables. [sent-166, score-0.186]
67 It is easy to see that a change from αx (y1 ), αx (y2 ) to αx (y1 ), αx (y2 ) has the following effect on µx (yi , yj ): 1 1 2 2 µx (yi , yj ) = µx (yi , yj ) + λI(yi = yi , yj = yj ) − λI(yi = yi , yj = yj ). [sent-169, score-3.386]
68 6 Generalization bound In this section, we show a generalization bound for the task of multi-label classification that allows us to relate the error rate on the training set to the generalization error. [sent-173, score-0.262]
69 Thus an appropriate error function is the average per-label loss L(w, x) = 1 l ∆tx (arg maxy w fx (y)). [sent-176, score-0.553]
70 As in other generalization bounds for margin-based classifiers, we relate the generalization error to the margin of the classifier. [sent-177, score-0.263]
71 We can now define a γ-margin per-label loss: Lγ (w, x) = supz: |z(y)−w 1 fx (y)|≤γ∆tx (y); ∀y l ∆tx (arg maxy z(y)). [sent-180, score-0.498]
72 This loss function measures the worst per-label loss on x made by any classifier z which is perturbed from w fx by at most a γ-margin per-label. [sent-181, score-0.505]
73 We can now prove that the generalization accuracy of any classifier is bounded by its expected γ-margin per-label loss on the training data, plus a term that grows inversely with the margin. [sent-182, score-0.176]
74 Intuitively, the first term corresponds to the “bias”, as margin γ decreases the complexity of our hypothesis class by considering a γ-per-label margin ball around w fx and selecting one (the worst) classifier within this ball. [sent-183, score-0.602]
75 Thus, the result provides a bound to the generalization error that trades off the effective complexity of the hypothesis space with the training error. [sent-185, score-0.186]
76 However we propose a novel method for covering structured problems by constructing a cover to the loss function from a cover of the individual edge basis function differences ∆fx (yi , yj ). [sent-193, score-0.662]
77 Specifically, our bound has a logarithmic dependence on the number of labels (ln l) and depends only on the 2-norm of the basis functions per-edge (Redge ). [sent-195, score-0.23]
78 This is a significant gain over the previous result of Collins [3] which has linear dependence on the number of labels (l), and depends on the joint 2-norm of all of the features (which is ∼ lRedge , unless each sequence is normalized separately, which is often ineffective in practice). [sent-196, score-0.274]
79 Finally, note l that if m = O(1) (for example, in OCR, if the number of instances is at least a constant times the length of a word), then our bound is independent of the number of labels l. [sent-197, score-0.234]
80 7 Experiments We evaluate our approach on two very different tasks: a sequence model for handwriting recognition and an arbitrary topology Markov network for hypertext classification. [sent-199, score-0.234]
81 Each word was divided into characters, each character was rasterized into an image of 16 by 8 binary pixels. [sent-215, score-0.139]
82 ) In our framework, the image for each word corresponds to x, a label of an individual character to Yi , and a labeling for a complete word to Y. [sent-218, score-0.349]
83 Logistic regression and CRFs are both trained by maximizing the conditional likelihood of the labels given the features, using a zero-mean diagonal Gaussian prior over the parameters, with a standard deviation between 0. [sent-227, score-0.242]
84 Our features for each label Yi are the corresponding image of ith character. [sent-230, score-0.145]
85 For margin-based methods (SVMs and M3 ), we were able to use kernels (both quadratic and cubic were evaluated) to increase the dimensionality of the feature space. [sent-232, score-0.168]
86 Interestingly, the error rate of our method using linear features is 16% lower than that of CRFs, and about the same as multi-class SVMs with cubic kernels. [sent-237, score-0.106]
87 We also tested our approach on collective hypertext classification, using the data set in [10], which contains web pages from four different Computer Science departments. [sent-243, score-0.171]
88 [10], which in addition to word-label dependence, has an edge with a potential over the labels of two pages that are hyper-linked to each other. [sent-249, score-0.178]
89 This model defines a Markov network over each web site that was trained to maximize the conditional probability of the labels given the words and the links. [sent-250, score-0.325]
90 The third model is a M3 net with the same features but trained by maximizing the margin using the relaxed dual formulation and loopy BP for inference. [sent-251, score-0.385]
91 1(c) shows a gain in accuracy from SVMs to RMNs by using the correlations between labels of linked web pages, and a very significant additional gain by using maximum margin training. [sent-253, score-0.414]
92 8 Discussion We present a discriminative framework for labeling and segmentation of structured data such as sequences, images, etc. [sent-255, score-0.13]
93 Our approach seamlessly integrates state-of-the-art kernel methods developed for classification of independent instances with the rich language of graphical models that can exploit the structure of complex data. [sent-256, score-0.153]
94 In our experiments with the OCR task, for example, our sequence model significantly outperforms other approaches by incorporating high-dimensional decision boundaries of polynomial kernels over character images while capturing correlations between consecutive characters. [sent-257, score-0.255]
95 Although the number of variables and constraints of our QP formulation is polynomial in the example size (e. [sent-259, score-0.163]
96 , sequence length), we also address its quadratic growth using an effective optimization procedure inspired by SMO. [sent-261, score-0.124]
97 Our generalization bound significantly tightens previous results of Collins [3] and suggests possibilities for analyzing per-label generalization properties of graphical models. [sent-263, score-0.275]
98 For brevity, we simplified our presentation of graphical models to only pairwise Markov networks. [sent-264, score-0.107]
99 Our formulation and generalization bound easily extend to interaction patterns involving more than two labels (e. [sent-265, score-0.314]
100 Overall, we believe that M3 networks will significantly further the applicability of high accuracy margin-based methods to real-world structured data. [sent-268, score-0.14]
wordName wordTfidf (topN-words)
[('fx', 0.395), ('yi', 0.377), ('yj', 0.376), ('qp', 0.277), ('tx', 0.261), ('ocr', 0.178), ('svms', 0.154), ('dual', 0.146), ('labels', 0.138), ('crfs', 0.12), ('classi', 0.116), ('label', 0.104), ('character', 0.104), ('maxy', 0.103), ('margin', 0.089), ('generalization', 0.087), ('hypertext', 0.086), ('factored', 0.073), ('yr', 0.073), ('labeling', 0.071), ('ys', 0.069), ('markov', 0.068), ('cubic', 0.065), ('quadratic', 0.064), ('marginals', 0.063), ('structured', 0.059), ('redge', 0.059), ('characters', 0.058), ('graphical', 0.057), ('arg', 0.056), ('loss', 0.055), ('xx', 0.055), ('ij', 0.054), ('collective', 0.052), ('correlations', 0.052), ('instances', 0.052), ('cation', 0.052), ('variables', 0.052), ('rmns', 0.051), ('taskar', 0.051), ('pairwise', 0.05), ('network', 0.049), ('probabilistic', 0.048), ('basis', 0.048), ('ln', 0.047), ('webpage', 0.047), ('forest', 0.047), ('networks', 0.047), ('formulation', 0.045), ('exploit', 0.044), ('smo', 0.044), ('bound', 0.044), ('features', 0.041), ('multiclass', 0.041), ('conditional', 0.04), ('constraints', 0.04), ('marginal', 0.04), ('edge', 0.04), ('handwriting', 0.039), ('msvm', 0.039), ('primal', 0.039), ('kernels', 0.039), ('objective', 0.039), ('unfortunately', 0.037), ('maximize', 0.035), ('word', 0.035), ('fk', 0.035), ('fj', 0.035), ('sequence', 0.034), ('crf', 0.034), ('rmn', 0.034), ('assignment', 0.034), ('maximizing', 0.034), ('tasks', 0.034), ('accuracy', 0.034), ('gain', 0.034), ('handwritten', 0.033), ('web', 0.033), ('covering', 0.032), ('hw', 0.031), ('polytope', 0.031), ('bp', 0.03), ('cant', 0.03), ('program', 0.03), ('edges', 0.03), ('trained', 0.03), ('graph', 0.029), ('hypothesis', 0.029), ('er', 0.028), ('yk', 0.027), ('parameterize', 0.027), ('exploiting', 0.027), ('joint', 0.027), ('polynomial', 0.026), ('cover', 0.026), ('collins', 0.026), ('crammer', 0.026), ('signi', 0.026), ('effective', 0.026), ('recognition', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999994 124 nips-2003-Max-Margin Markov Networks
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
2 0.2161942 122 nips-2003-Margin Maximizing Loss Functions
Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie
Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1
3 0.18805401 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
Author: Noam Shental, Aharon Bar-hillel, Tomer Hertz, Daphna Weinshall
Abstract: Density estimation with Gaussian Mixture Models is a popular generative technique used also for clustering. We develop a framework to incorporate side information in the form of equivalence constraints into the model estimation procedure. Equivalence constraints are defined on pairs of data points, indicating whether the points arise from the same source (positive constraints) or from different sources (negative constraints). Such constraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. For the estimation of model parameters we present a closed form EM procedure which handles positive constraints, and a Generalized EM procedure using a Markov net which handles negative constraints. Using publicly available data sets we demonstrate that such side information can lead to considerable improvement in clustering tasks, and that our algorithm is preferable to two other suggested methods using the same type of side information.
4 0.15751056 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
Author: Thore Graepel, Ralf Herbrich
Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1
5 0.14802195 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
Author: Stuart Andrews, Thomas Hofmann
Abstract: Learning from ambiguous training data is highly relevant in many applications. We present a new learning algorithm for classification problems where labels are associated with sets of pattern instead of individual patterns. This encompasses multiple instance learning as a special case. Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem. 1
6 0.12235247 155 nips-2003-Perspectives on Sparse Bayesian Learning
7 0.11931448 88 nips-2003-Image Reconstruction by Linear Programming
8 0.11757049 1 nips-2003-1-norm Support Vector Machines
9 0.1111274 118 nips-2003-Link Prediction in Relational Data
10 0.10909735 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
11 0.10652904 100 nips-2003-Laplace Propagation
12 0.10650753 115 nips-2003-Linear Dependent Dimensionality Reduction
13 0.10097387 113 nips-2003-Learning with Local and Global Consistency
14 0.095992692 121 nips-2003-Log-Linear Models for Label Ranking
15 0.089058116 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification
16 0.081145048 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
17 0.080426708 41 nips-2003-Boosting versus Covering
18 0.078652538 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks
19 0.078039192 126 nips-2003-Measure Based Regularization
20 0.075875714 128 nips-2003-Minimax Embeddings
topicId topicWeight
[(0, -0.272), (1, -0.144), (2, -0.095), (3, -0.104), (4, 0.158), (5, -0.122), (6, -0.072), (7, -0.149), (8, 0.001), (9, -0.025), (10, 0.09), (11, 0.052), (12, -0.066), (13, 0.045), (14, -0.101), (15, 0.16), (16, -0.017), (17, -0.047), (18, -0.073), (19, -0.02), (20, -0.05), (21, 0.065), (22, -0.04), (23, 0.086), (24, 0.001), (25, -0.095), (26, 0.077), (27, -0.137), (28, -0.01), (29, -0.002), (30, -0.001), (31, 0.02), (32, 0.045), (33, 0.129), (34, 0.057), (35, -0.12), (36, 0.141), (37, -0.095), (38, -0.138), (39, 0.027), (40, 0.05), (41, 0.114), (42, 0.039), (43, 0.006), (44, 0.033), (45, 0.037), (46, 0.1), (47, -0.034), (48, 0.022), (49, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.96807575 124 nips-2003-Max-Margin Markov Networks
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
2 0.73040468 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
Author: Stuart Andrews, Thomas Hofmann
Abstract: Learning from ambiguous training data is highly relevant in many applications. We present a new learning algorithm for classification problems where labels are associated with sets of pattern instead of individual patterns. This encompasses multiple instance learning as a special case. Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem. 1
3 0.65933126 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
Author: Noam Shental, Aharon Bar-hillel, Tomer Hertz, Daphna Weinshall
Abstract: Density estimation with Gaussian Mixture Models is a popular generative technique used also for clustering. We develop a framework to incorporate side information in the form of equivalence constraints into the model estimation procedure. Equivalence constraints are defined on pairs of data points, indicating whether the points arise from the same source (positive constraints) or from different sources (negative constraints). Such constraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. For the estimation of model parameters we present a closed form EM procedure which handles positive constraints, and a Generalized EM procedure using a Markov net which handles negative constraints. Using publicly available data sets we demonstrate that such side information can lead to considerable improvement in clustering tasks, and that our algorithm is preferable to two other suggested methods using the same type of side information.
4 0.64331418 122 nips-2003-Margin Maximizing Loss Functions
Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie
Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1
5 0.60172361 1 nips-2003-1-norm Support Vector Machines
Author: Ji Zhu, Saharon Rosset, Robert Tibshirani, Trevor J. Hastie
Abstract: The standard 2-norm SVM is known for its good performance in twoclass classi£cation. In this paper, we consider the 1-norm SVM. We argue that the 1-norm SVM may have some advantage over the standard 2-norm SVM, especially when there are redundant noise features. We also propose an ef£cient algorithm that computes the whole solution path of the 1-norm SVM, hence facilitates adaptive selection of the tuning parameter for the 1-norm SVM. 1
6 0.53999209 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
7 0.53173554 118 nips-2003-Link Prediction in Relational Data
8 0.5019539 100 nips-2003-Laplace Propagation
9 0.50151718 155 nips-2003-Perspectives on Sparse Bayesian Learning
10 0.44405273 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
11 0.42669967 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
12 0.4187333 88 nips-2003-Image Reconstruction by Linear Programming
13 0.4150894 113 nips-2003-Learning with Local and Global Consistency
14 0.41121978 48 nips-2003-Convex Methods for Transduction
15 0.40914136 181 nips-2003-Statistical Debugging of Sampled Programs
16 0.40038124 126 nips-2003-Measure Based Regularization
17 0.3944087 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
18 0.39382112 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers
19 0.39045539 115 nips-2003-Linear Dependent Dimensionality Reduction
20 0.38939849 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification
topicId topicWeight
[(0, 0.043), (11, 0.075), (16, 0.022), (26, 0.012), (35, 0.06), (53, 0.07), (66, 0.012), (69, 0.014), (71, 0.102), (76, 0.049), (85, 0.32), (91, 0.077), (99, 0.018)]
simIndex simValue paperId paperTitle
1 0.95935988 193 nips-2003-Variational Linear Response
Author: Manfred Opper, Ole Winther
Abstract: A general linear response method for deriving improved estimates of correlations in the variational Bayes framework is presented. Three applications are given and it is discussed how to use linear response as a general principle for improving mean field approximations.
2 0.95811009 136 nips-2003-New Algorithms for Efficient High Dimensional Non-parametric Classification
Author: Ting liu, Andrew W. Moore, Alexander Gray
Abstract: This paper is about non-approximate acceleration of high dimensional nonparametric operations such as k nearest neighbor classifiers and the prediction phase of Support Vector Machine classifiers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly find the datapoints close to the query, but merely need to ask questions about the properties about that set of datapoints. This offers a small amount of computational leeway, and we investigate how much that leeway can be exploited. For clarity, this paper concentrates on pure k-NN classification and the prediction phase of SVMs. We introduce new ball tree algorithms that on real-world datasets give accelerations of 2-fold up to 100-fold compared against highly optimized traditional ball-tree-based k-NN. These results include datasets with up to 106 dimensions and 105 records, and show non-trivial speedups while giving exact answers. 1
3 0.95723152 95 nips-2003-Insights from Machine Learning Applied to Human Visual Classification
Author: Felix A. Wichmann, Arnulf B. Graf
Abstract: We attempt to understand visual classification in humans using both psychophysical and machine learning techniques. Frontal views of human faces were used for a gender classification task. Human subjects classified the faces and their gender judgment, reaction time and confidence rating were recorded. Several hyperplane learning algorithms were used on the same classification task using the Principal Components of the texture and shape representation of the faces. The classification performance of the learning algorithms was estimated using the face database with the true gender of the faces as labels, and also with the gender estimated by the subjects. We then correlated the human responses to the distance of the stimuli to the separating hyperplane of the learning algorithms. Our results suggest that human classification can be modeled by some hyperplane algorithms in the feature space we used. For classification, the brain needs more processing for stimuli close to that hyperplane than for those further away. 1
same-paper 4 0.95578092 124 nips-2003-Max-Margin Markov Networks
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
5 0.95413625 2 nips-2003-ARA*: Anytime A* with Provable Bounds on Sub-Optimality
Author: Maxim Likhachev, Geoffrey J. Gordon, Sebastian Thrun
Abstract: In real world planning problems, time for deliberation is often limited. Anytime planners are well suited for these problems: they find a feasible solution quickly and then continually work on improving it until time runs out. In this paper we propose an anytime heuristic search, ARA*, which tunes its performance bound based on available search time. It starts by finding a suboptimal solution quickly using a loose bound, then tightens the bound progressively as time allows. Given enough time it finds a provably optimal solution. While improving its bound, ARA* reuses previous search efforts and, as a result, is significantly more efficient than other anytime search methods. In addition to our theoretical analysis, we demonstrate the practical utility of ARA* with experiments on a simulated robot kinematic arm and a dynamic path planning problem for an outdoor rover. 1
6 0.92522335 64 nips-2003-Estimating Internal Variables and Paramters of a Learning Agent by a Particle Filter
7 0.92088759 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes
8 0.86932141 28 nips-2003-Application of SVMs for Colour Classification and Collision Detection with AIBO Robots
9 0.85308629 3 nips-2003-AUC Optimization vs. Error Rate Minimization
10 0.83716649 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
11 0.82941842 134 nips-2003-Near-Minimax Optimal Classification with Dyadic Classification Trees
12 0.82087731 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
13 0.78451663 35 nips-2003-Attractive People: Assembling Loose-Limbed Models using Non-parametric Belief Propagation
14 0.78287482 41 nips-2003-Boosting versus Covering
15 0.7778666 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
16 0.77681696 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
17 0.77595013 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
18 0.77547592 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
19 0.77118075 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
20 0.76826811 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification