nips nips2003 nips2003-47 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Noam Shental, Aharon Bar-hillel, Tomer Hertz, Daphna Weinshall
Abstract: Density estimation with Gaussian Mixture Models is a popular generative technique used also for clustering. We develop a framework to incorporate side information in the form of equivalence constraints into the model estimation procedure. Equivalence constraints are defined on pairs of data points, indicating whether the points arise from the same source (positive constraints) or from different sources (negative constraints). Such constraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. For the estimation of model parameters we present a closed form EM procedure which handles positive constraints, and a Generalized EM procedure using a Markov net which handles negative constraints. Using publicly available data sets we demonstrate that such side information can lead to considerable improvement in clustering tasks, and that our algorithm is preferable to two other suggested methods using the same type of side information.
Reference: text
sentIndex sentText sentNum sentScore
1 We develop a framework to incorporate side information in the form of equivalence constraints into the model estimation procedure. [sent-18, score-0.776]
2 Equivalence constraints are defined on pairs of data points, indicating whether the points arise from the same source (positive constraints) or from different sources (negative constraints). [sent-19, score-0.66]
3 Such constraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. [sent-20, score-0.45]
4 For the estimation of model parameters we present a closed form EM procedure which handles positive constraints, and a Generalized EM procedure using a Markov net which handles negative constraints. [sent-21, score-0.263]
5 Using publicly available data sets we demonstrate that such side information can lead to considerable improvement in clustering tasks, and that our algorithm is preferable to two other suggested methods using the same type of side information. [sent-22, score-0.31]
6 More specifically, we use unlabeled data augmented by equivalence constraints between pairs of data points, where the constraints determine whether each pair was generated by the same source or by different sources. [sent-27, score-1.227]
7 Such constraints may be acquired without human intervention in a broad class of problems, and are a natural form of supervision in other scenarios. [sent-28, score-0.409]
8 We show how to incorporate equivalence constraints into the EM algorithm [1], in order to fit a generative Gaussian mixture model to the data. [sent-29, score-0.769]
9 Density estimation with Gaussian mixture models is a popular generative technique, mostly because it is computationally tractable and often produces good results. [sent-30, score-0.13]
10 , that the data is generated by a mixture of Gaussian sources) may not be easily justified. [sent-33, score-0.092]
11 In this paper we propose to incorporate equivalence constraints into an EM parameter estimation algorithm. [sent-35, score-0.699]
12 Much more importantly, the constraints change the GMM likelihood function and therefore may lead the estimation procedure to choose a better solution which would have otherwise been rejected (due to low relative likelihood in the unconstrained GMM density model). [sent-37, score-0.514]
13 Ideally the solution obtained with side information will be more faithful to the desired results. [sent-38, score-0.103]
14 Unconstrained (a) constrained (b) unconstrained constrained (c) (d) Figure 1: Illustrative examples for the importance of equivalence constraints. [sent-41, score-0.762]
15 Left: the data set consists of 2 vertically aligned classes - (a) given no additional information, the EM algorithm identifies two horizontal classes, and this can be shown to be the maximum likelihood solution (with log likelihood of −3500 vs. [sent-42, score-0.194]
16 log likelihood of −2800 for the solution in (b)); (b) additional side information in the form of equivalence constraints changes the probability function and we get a vertical partition as the most likely solution. [sent-43, score-0.808]
17 Right: the dataset consists of two classes with partial overlap - (c) without constraints the most likely solution includes two non-overlapping sources; (d) with constraints the correct model with overlapping classes was retrieved as the most likely solution. [sent-44, score-0.802]
18 In all plots only the class assignment of novel un-constrained points is shown. [sent-45, score-0.151]
19 Equivalence constraints are binary functions of pairs of points, indicating whether the two points come from the same source or from two different sources. [sent-46, score-0.562]
20 As it turns out, “is-equivalent” constraints can be easily incorporated into EM, while “not-equivalent” constraints require heavy duty inference machinery such as Markov networks. [sent-48, score-0.674]
21 Our choice to use equivalence constraints is motivated by the relative abundance of equivalence constraints in real life applications. [sent-50, score-1.284]
22 In a broad family of applications, equivalence constraints can be obtained without supervision. [sent-51, score-0.668]
23 Maybe the most important source of unsupervised equivalence constraints is temporal continuity in data; for example, in video indexing a sequence of faces obtained from successive frames in roughly the same location are likely to contain the same unknown individual. [sent-52, score-0.779]
24 Furthermore, there are several learning applications in which equivalence constraints are the natural form of supervision. [sent-53, score-0.642]
25 One such scenario occurs when we wish to enhance a retrieval engine using supervision provided by its users. [sent-54, score-0.151]
26 The users may be asked to help annotate the retrieved set of data points, in what may be viewed as ’generalized relevance feedback’. [sent-55, score-0.132]
27 Therefore, we can only extract equivalence constraints from the feedback provided by the users. [sent-57, score-0.668]
28 Similar things happen in a ’distributed learning’ scenario, where supervision is provided by several uncoordinated teachers. [sent-58, score-0.139]
29 In such scenarios, when equivalence constraints are obtained in a supervised manner, our method can be viewed as a semi-supervised learning technique. [sent-59, score-0.705]
30 Most of the work in the field of semi-supervised learning focused on the case of partial labels augmenting a large unlabeled data set [4, 8, 5]. [sent-60, score-0.099]
31 A few recent papers use side information in the form of equivalence constraints [6, 7, 10]. [sent-61, score-0.719]
32 In [9] equivalence constraints were introduced into the K-means clustering algorithm. [sent-62, score-0.709]
33 In [3] equivalence constraints were introduced into the complete linkage clustering algorithm. [sent-64, score-0.927]
34 In comparison with both approaches, we gain significantly better clustering results by introducing the constraints into the EM algorithm. [sent-65, score-0.404]
35 One reason may be that the EM of a Gaussian mixture model is preferable as a clustering algorithm. [sent-66, score-0.16]
36 More importantly, the probabilistic semantics of the EM procedure allows for the introduction of constraints in a principled way, thus overcoming many drawbacks of the heuristic approaches. [sent-67, score-0.337]
37 Comparative results are given in Section 3, demonstrating the very significant advantage of our method over the two alternative constrained clustering algorithms using a number of data sets from the UCI repository and a large database of facial images [2]. [sent-68, score-0.469]
38 2 Constrained EM: the update rules A Gaussian mixture model (GMM) is a parametric statistical model which assumes that the data originates from a weighted sum of several Gaussian sources. [sent-69, score-0.173]
39 More formally, a GMM is given by p(x|Θ) = ΣM αl p(x|θl ), where αl denotes the weight of each Gaussian, θl its l=1 respective parameters, and M denotes the number of Gaussian sources in the GMM. [sent-70, score-0.13]
40 EM is a widely used method for estimating the parameter set of the model (Θ) using unlabeled data [1]. [sent-71, score-0.099]
41 Equivalence constraints modify the ’E’ (expectation computation) step, such that the sum is taken only over assignments which comply with the given constraints (instead of summing over all possible assignments of data points to sources). [sent-72, score-0.892]
42 It is important to note that there is a basic difference between “is-equivalent” (positive) and “not-equivalent” (negative) constraints: While positive constraints are transitive (i. [sent-73, score-0.452]
43 a group of pairwise “is-equivalent” constraints can be merged using a transitive closure), negative constraints are not transitive. [sent-75, score-0.819]
44 Therefore, we begin by presenting a formulation for positive constraints (Section 2. [sent-77, score-0.41]
45 1), and then present a different formulation for negative constraints (Section 2. [sent-78, score-0.44]
46 A unified formulation for both types of constraints immediately follows, and the details are therefore omitted. [sent-80, score-0.363]
47 1 Incorporating positive constraints Let a chunklet denote a small subset of data points that are known to belong to a single unknown class. [sent-82, score-0.833]
48 Chunklets may be obtained by applying the transitive closure to the set of “is-equivalent” constraints. [sent-83, score-0.125]
49 Assume as given a set of unlabeled data points and a set of chunklets. [sent-84, score-0.205]
50 In order to write down the likelihood of a given assignment of points to sources, a probabilistic model of how chunklets are obtained must be specified. [sent-85, score-0.456]
51 d, with respect to the weight of their corresponding source (points within each chunklet are also sampled i. [sent-89, score-0.364]
52 d, without any knowledge about their class membership, and only afterwards chunklets are selected from these points. [sent-95, score-0.233]
53 The first assumption may be appropriate when chunklets are automatically obtained using temporal continuity. [sent-96, score-0.299]
54 The second sampling assumption is appropriate when equivalence constraints are obtained using distributed learning. [sent-97, score-0.742]
55 When incorporating these sampling assumptions into the EM formulation for GMM fitting, different algorithms are obtained: Using the first assumption we obtain closed-form update rules for all of the GMM parameters. [sent-98, score-0.229]
56 In this section we therefore restrict the discussion to the first sampling assumption only; the discussion of the second sampling assumption, where generalized EM must be used, is omitted. [sent-100, score-0.134]
57 Let {Xj }L , L ≤ N denote the distinct chunklets, i=1 j=1 L N where each Xj is a set of points xi such that j=1 Xj = {xi }i=1 (unconstrained data points appear as chunklets of size one). [sent-104, score-0.575]
58 Let Y = {yi }N denote the source assignment i=1 |X | 1 of the respective data-points, and Yj = {yj . [sent-105, score-0.152]
59 yj j } denote the source assignment of the chunklet Xj . [sent-108, score-0.683]
60 Finally, let EΩ denote the event {Y complies with the constraints}. [sent-109, score-0.08]
61 The expectation of the log likelihood is the following: E[log(p(X, Y|Θnew , EΩ ))|X Θold , EΩ ] = log(p(X, Y|Θnew , EΩ )) ·p(Y|X, Θold , EΩ ) (1) Y where M y1 =1 Y . [sent-110, score-0.089]
62 chunklets: stands for a summation over all assignments of points to sources: Y ≡ M yN =1 . [sent-113, score-0.178]
63 Y ≡ yj yj · · · 1 |Xj | First, using Bayes rule and the independence of chunklets, we can write p(EΩ |Y, X, Θold ) p(Y|X, Θold ) p(Y|X, Θold , EΩ ) = old ) p(Y|X, Θold ) Y p(EΩ |Y, X, Θ L old ) j=1 δYj p(Yj |Xj , Θ (2) = L old ) Y1 . [sent-118, score-1.513]
64 ,yj equals 1 if all the points in chunklet i have the same label, and 0 1 |Xj | otherwise. [sent-124, score-0.391]
65 As can be readily seen, the update rules above effectively treat each chunklet as a single data point weighed according to the number of elements in it. [sent-127, score-0.421]
66 2 Incorporating negative constraints The probabilistic description of a data set using a GMM attaches to each data point two random variables: an observable and a hidden. [sent-129, score-0.533]
67 The hidden variable of a point describes its source label, while the data point itself is an observed example from the source. [sent-130, score-0.151]
68 Each pair of observable and hidden variables is assumed to be independent of the other pairs. [sent-131, score-0.101]
69 However, negative equivalence constraints violate this assumption, as dependencies between the hidden variables are introduced. [sent-132, score-0.761]
70 Specifically, assume we have a group Ω = {(a1 , a2 )}P of index pairs correspondi i i=1 ing to P pairs of points that are negatively constrained, and define the event EΩ = {Y complies with the constraints}. [sent-133, score-0.279]
71 The network nodes are the hidden source variables and the observable data point variables. [sent-137, score-0.21]
72 The potential p(xi |yi , Θ) connects each observable data point, in a Gaussian manner, to a hidden variable corresponding to the label of its source. [sent-138, score-0.155]
73 Each hidden source node holds an initial potential of p(yi |Θ) reflecting the prior of the cluster weights. [sent-139, score-0.121]
74 Negative constraints are expressed by edges between hidden variables which prevent them from having the same value. [sent-140, score-0.379]
75 Data points 1 and 2 have a negative constraint, and so do points 2 and 3. [sent-144, score-0.289]
76 The update rules for µl and Σl are still µnew = l N old , EΩ ) i=1 xi p(yi = l|X, Θ , N old , E ) Ω i=1 p(yi = l|X, Θ Σnew = l N old , EΩ ) i=1 Σi lp(yi = l|X, Θ N old , E ) Ω i=1 p(yi = l|X, Θ where Σi l = (xi − µnew )(xi − µnew )T denotes the sample covariance matrix. [sent-146, score-1.496]
77 The update rule of αl = p(yi = l|Θnew , EΩ ) is more intricate, since this parameter appears in the normalization factor Z in the likelihood expression (4): N Z = p(EΩ |Θ) = p(Y|Θ)p(EΩ |Y) = Y . [sent-148, score-0.117]
78 y1 αyi yN i=1 (1 − δya1 ,ya2 ) (a1 ,a2 ) i i i (6) i This factor can be calculated using a net which is similar to the one discussed above but lacks the observable nodes. [sent-151, score-0.1]
79 Alternatively, we can approximate Z by assuming that the pairs of negatively constrained points are disjoint. [sent-155, score-0.388]
80 We simulated a ’distributed learning’ scenario in order to obtain side information. [sent-160, score-0.13]
81 In this scenario equivalence constraints are obtained by employing N uncoordinated teachers. [sent-161, score-0.762]
82 Each teacher is given a random selection of K data points from the data set, and is then asked to partition this set of points into equivalence classes. [sent-162, score-0.607]
83 The constraints provided BALANCE N=625 d=4 C=3 "little" BOSTON N=506 d=13 C=3 "much" "little" IONOSPHERE N=351 d=34 C=2 "much" "little" 0. [sent-163, score-0.363]
84 5 a b c d e f g h i a b c d e f g h i a b c d e f g h i a b c d e f g h i Figure 3: Combined precision and recall scores (f 1 ) of several clustering algorithms over 5 data 2 sets from the UCI repository, and 1 facial image database (YaleB). [sent-197, score-0.17]
85 The YaleB dataset contained a total of 640 images including 64 frontal pose images of 10 different subjects. [sent-198, score-0.109]
86 In each panel results are shown for two cases, using 15% of the data points in constraints (left bars) and 30% of the points constrained (right bars). [sent-201, score-0.78]
87 The results were averaged over 100 realizations of constraints for the UCI datasets, and 1000 realizations for the YaleB dataset. [sent-202, score-0.413]
88 Also shown are the names of the data sets used and some of their parameters: N - the size of the data set; C - the number of classes; d - the dimensionality of the data. [sent-203, score-0.105]
89 by the teachers are gathered and used as equivalence constraints. [sent-204, score-0.398]
90 In the simulations, the number of constrained points was determined by the number of teachers N and the size of the subset K given to each. [sent-208, score-0.359]
91 By controlling the product N K we controlled the amount of side information provided to the learning algorithms. [sent-209, score-0.103]
92 We experimented with two conditions: using “little” side information (approximately 15% of the data points are constrained) and using “much” side information (approximately 30% of the points are constrained). [sent-210, score-0.396]
93 All algorithms were given initial conditions that did not take into account the available equivalence constraints. [sent-211, score-0.305]
94 3: • The constrained EM outperformed the two alternative algorithms in almost all cases, while showing substantial improvement over the baseline EM. [sent-214, score-0.256]
95 The one case where constrained complete linkage outperformed all other algorithms involved the “wine” dataset. [sent-215, score-0.446]
96 The EM procedure was not able to fit the data well even with constraints, probably due to the fact that only 168 data points were available for training. [sent-217, score-0.166]
97 • Introducing side information in the form of equivalence constraints clearly improves the results of both K-means and the EM algorithms. [sent-218, score-0.719]
98 This is not always true for the constrained complete linkage algorithm. [sent-219, score-0.419]
99 In most cases adding the negative constraints contributes a small but significant improvement over results obtained when using only positive constraints. [sent-222, score-0.515]
100 From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. [sent-244, score-0.367]
wordName wordTfidf (topN-words)
[('constraints', 0.337), ('old', 0.328), ('equivalence', 0.305), ('chunklet', 0.285), ('em', 0.272), ('yj', 0.246), ('chunklets', 0.233), ('constrained', 0.201), ('xj', 0.182), ('linkage', 0.164), ('gmm', 0.153), ('yi', 0.128), ('jerusalem', 0.113), ('points', 0.106), ('yaleb', 0.104), ('source', 0.079), ('side', 0.077), ('negative', 0.077), ('supervision', 0.072), ('xi', 0.072), ('unlabeled', 0.069), ('sources', 0.068), ('transitive', 0.068), ('clustering', 0.067), ('mixture', 0.062), ('observable', 0.059), ('unconstrained', 0.055), ('hebrew', 0.055), ('uci', 0.055), ('complete', 0.054), ('scenario', 0.053), ('israel', 0.052), ('complies', 0.052), ('daphna', 0.052), ('teachers', 0.052), ('little', 0.051), ('incorporating', 0.048), ('positive', 0.047), ('likelihood', 0.046), ('retrieved', 0.045), ('names', 0.045), ('shental', 0.045), ('weinshall', 0.045), ('wine', 0.045), ('assignment', 0.045), ('facial', 0.044), ('repository', 0.044), ('rules', 0.044), ('log', 0.043), ('hidden', 0.042), ('net', 0.041), ('assignments', 0.041), ('gathered', 0.041), ('jl', 0.041), ('uncoordinated', 0.041), ('hertz', 0.041), ('negatively', 0.041), ('assumption', 0.04), ('pairs', 0.04), ('realizations', 0.038), ('differentiate', 0.038), ('generative', 0.038), ('update', 0.037), ('supervised', 0.037), ('independence', 0.037), ('yl', 0.036), ('petsche', 0.036), ('handles', 0.034), ('expression', 0.034), ('sampling', 0.034), ('gaussian', 0.033), ('mozer', 0.033), ('derivations', 0.033), ('unsupervised', 0.032), ('closure', 0.031), ('preferable', 0.031), ('pl', 0.031), ('stands', 0.031), ('denotes', 0.031), ('asked', 0.03), ('pose', 0.03), ('estimation', 0.03), ('data', 0.03), ('classes', 0.029), ('database', 0.029), ('denote', 0.028), ('improvement', 0.028), ('demonstrating', 0.027), ('users', 0.027), ('outperformed', 0.027), ('images', 0.027), ('incorporate', 0.027), ('provided', 0.026), ('generalized', 0.026), ('formulation', 0.026), ('obtained', 0.026), ('dataset', 0.025), ('readily', 0.025), ('label', 0.024), ('nips', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999887 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
Author: Noam Shental, Aharon Bar-hillel, Tomer Hertz, Daphna Weinshall
Abstract: Density estimation with Gaussian Mixture Models is a popular generative technique used also for clustering. We develop a framework to incorporate side information in the form of equivalence constraints into the model estimation procedure. Equivalence constraints are defined on pairs of data points, indicating whether the points arise from the same source (positive constraints) or from different sources (negative constraints). Such constraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. For the estimation of model parameters we present a closed form EM procedure which handles positive constraints, and a Generalized EM procedure using a Markov net which handles negative constraints. Using publicly available data sets we demonstrate that such side information can lead to considerable improvement in clustering tasks, and that our algorithm is preferable to two other suggested methods using the same type of side information.
2 0.18805401 124 nips-2003-Max-Margin Markov Networks
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
3 0.13154715 117 nips-2003-Linear Response for Approximate Inference
Author: Max Welling, Yee W. Teh
Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.
4 0.12221294 32 nips-2003-Approximate Expectation Maximization
Author: Tom Heskes, Onno Zoeter, Wim Wiegerinck
Abstract: We discuss the integration of the expectation-maximization (EM) algorithm for maximum likelihood learning of Bayesian networks with belief propagation algorithms for approximate inference. Specifically we propose to combine the outer-loop step of convergent belief propagation algorithms with the M-step of the EM algorithm. This then yields an approximate EM algorithm that is essentially still double loop, with the important advantage of an inner loop that is guaranteed to converge. Simulations illustrate the merits of such an approach. 1
5 0.10611502 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
Author: Stuart Andrews, Thomas Hofmann
Abstract: Learning from ambiguous training data is highly relevant in many applications. We present a new learning algorithm for classification problems where labels are associated with sets of pattern instead of individual patterns. This encompasses multiple instance learning as a special case. Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem. 1
6 0.10483308 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus
7 0.10039344 100 nips-2003-Laplace Propagation
8 0.09779761 113 nips-2003-Learning with Local and Global Consistency
9 0.093512021 152 nips-2003-Pairwise Clustering and Graphical Models
10 0.09192685 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
11 0.090950377 24 nips-2003-An Iterative Improvement Procedure for Hierarchical Clustering
12 0.090926595 73 nips-2003-Feature Selection in Clustering Problems
13 0.082451783 115 nips-2003-Linear Dependent Dimensionality Reduction
14 0.081954561 79 nips-2003-Gene Expression Clustering with Functional Mixture Models
15 0.08172331 166 nips-2003-Reconstructing MEG Sources with Unknown Correlations
16 0.081437103 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
17 0.081075974 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence
18 0.080784529 126 nips-2003-Measure Based Regularization
19 0.079384036 1 nips-2003-1-norm Support Vector Machines
20 0.078497931 46 nips-2003-Clustering with the Connectivity Kernel
topicId topicWeight
[(0, -0.242), (1, -0.129), (2, -0.062), (3, 0.044), (4, 0.009), (5, -0.015), (6, -0.009), (7, -0.039), (8, 0.03), (9, -0.009), (10, 0.097), (11, 0.039), (12, -0.074), (13, -0.053), (14, -0.111), (15, 0.144), (16, -0.005), (17, -0.042), (18, -0.081), (19, -0.091), (20, 0.019), (21, 0.08), (22, -0.113), (23, 0.129), (24, 0.094), (25, -0.097), (26, 0.03), (27, -0.179), (28, 0.004), (29, 0.044), (30, -0.015), (31, -0.06), (32, 0.05), (33, 0.111), (34, -0.004), (35, 0.002), (36, 0.074), (37, -0.155), (38, -0.034), (39, 0.047), (40, 0.069), (41, 0.092), (42, 0.007), (43, 0.133), (44, 0.17), (45, -0.031), (46, -0.053), (47, 0.063), (48, -0.105), (49, -0.162)]
simIndex simValue paperId paperTitle
same-paper 1 0.97398561 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
Author: Noam Shental, Aharon Bar-hillel, Tomer Hertz, Daphna Weinshall
Abstract: Density estimation with Gaussian Mixture Models is a popular generative technique used also for clustering. We develop a framework to incorporate side information in the form of equivalence constraints into the model estimation procedure. Equivalence constraints are defined on pairs of data points, indicating whether the points arise from the same source (positive constraints) or from different sources (negative constraints). Such constraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. For the estimation of model parameters we present a closed form EM procedure which handles positive constraints, and a Generalized EM procedure using a Markov net which handles negative constraints. Using publicly available data sets we demonstrate that such side information can lead to considerable improvement in clustering tasks, and that our algorithm is preferable to two other suggested methods using the same type of side information.
2 0.68585914 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
Author: Stuart Andrews, Thomas Hofmann
Abstract: Learning from ambiguous training data is highly relevant in many applications. We present a new learning algorithm for classification problems where labels are associated with sets of pattern instead of individual patterns. This encompasses multiple instance learning as a special case. Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem. 1
3 0.62071961 124 nips-2003-Max-Margin Markov Networks
Author: Ben Taskar, Carlos Guestrin, Daphne Koller
Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1
4 0.51825666 100 nips-2003-Laplace Propagation
Author: Eleazar Eskin, Alex J. Smola, S.v.n. Vishwanathan
Abstract: We present a novel method for approximate inference in Bayesian models and regularized risk functionals. It is based on the propagation of mean and variance derived from the Laplace approximation of conditional probabilities in factorizing distributions, much akin to Minka’s Expectation Propagation. In the jointly normal case, it coincides with the latter and belief propagation, whereas in the general case, it provides an optimization strategy containing Support Vector chunking, the Bayes Committee Machine, and Gaussian Process chunking as special cases. 1
5 0.50893319 117 nips-2003-Linear Response for Approximate Inference
Author: Max Welling, Yee W. Teh
Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.
6 0.47308734 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
7 0.4414432 172 nips-2003-Semi-Supervised Learning with Trees
8 0.43945512 108 nips-2003-Learning a Distance Metric from Relative Comparisons
9 0.43494368 113 nips-2003-Learning with Local and Global Consistency
10 0.42391419 115 nips-2003-Linear Dependent Dimensionality Reduction
11 0.39786923 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
12 0.38095173 126 nips-2003-Measure Based Regularization
13 0.37391037 130 nips-2003-Model Uncertainty in Classical Conditioning
14 0.37102127 48 nips-2003-Convex Methods for Transduction
15 0.36948133 69 nips-2003-Factorization with Uncertainty and Missing Data: Exploiting Temporal Coherence
16 0.36576465 152 nips-2003-Pairwise Clustering and Graphical Models
17 0.36464351 1 nips-2003-1-norm Support Vector Machines
18 0.35621145 73 nips-2003-Feature Selection in Clustering Problems
19 0.35595179 191 nips-2003-Unsupervised Context Sensitive Language Acquisition from a Large Corpus
20 0.33680621 32 nips-2003-Approximate Expectation Maximization
topicId topicWeight
[(0, 0.037), (11, 0.026), (29, 0.011), (30, 0.011), (35, 0.082), (53, 0.125), (66, 0.02), (69, 0.014), (70, 0.239), (71, 0.113), (76, 0.038), (85, 0.1), (91, 0.105), (99, 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.83749115 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
Author: Noam Shental, Aharon Bar-hillel, Tomer Hertz, Daphna Weinshall
Abstract: Density estimation with Gaussian Mixture Models is a popular generative technique used also for clustering. We develop a framework to incorporate side information in the form of equivalence constraints into the model estimation procedure. Equivalence constraints are defined on pairs of data points, indicating whether the points arise from the same source (positive constraints) or from different sources (negative constraints). Such constraints can be gathered automatically in some learning problems, and are a natural form of supervision in others. For the estimation of model parameters we present a closed form EM procedure which handles positive constraints, and a Generalized EM procedure using a Markov net which handles negative constraints. Using publicly available data sets we demonstrate that such side information can lead to considerable improvement in clustering tasks, and that our algorithm is preferable to two other suggested methods using the same type of side information.
2 0.79833251 80 nips-2003-Generalised Propagation for Fast Fourier Transforms with Partial or Missing Data
Author: Amos J. Storkey
Abstract: Discrete Fourier transforms and other related Fourier methods have been practically implementable due to the fast Fourier transform (FFT). However there are many situations where doing fast Fourier transforms without complete data would be desirable. In this paper it is recognised that formulating the FFT algorithm as a belief network allows suitable priors to be set for the Fourier coefficients. Furthermore efficient generalised belief propagation methods between clusters of four nodes enable the Fourier coefficients to be inferred and the missing data to be estimated in near to O(n log n) time, where n is the total of the given and missing data points. This method is compared with a number of common approaches such as setting missing data to zero or to interpolation. It is tested on generated data and for a Fourier analysis of a damaged audio signal. 1
3 0.69113582 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
4 0.68839002 145 nips-2003-Online Classification on a Budget
Author: Koby Crammer, Jaz Kandola, Yoram Singer
Abstract: Online algorithms for classification often require vast amounts of memory and computation time when employed in conjunction with kernel functions. In this paper we describe and analyze a simple approach for an on-the-fly reduction of the number of past examples used for prediction. Experiments performed with real datasets show that using the proposed algorithmic approach with a single epoch is competitive with the support vector machine (SVM) although the latter, being a batch algorithm, accesses each training example multiple times. 1 Introduction and Motivation Kernel-based methods are widely being used for data modeling and prediction because of their conceptual simplicity and outstanding performance on many real-world tasks. The support vector machine (SVM) is a well known algorithm for finding kernel-based linear classifiers with maximal margin [7]. The kernel trick can be used to provide an effective method to deal with very high dimensional feature spaces as well as to model complex input phenomena via embedding into inner product spaces. However, despite generalization error being upper bounded by a function of the margin of a linear classifier, it is notoriously difficult to implement such classifiers efficiently. Empirically this often translates into very long training times. A number of alternative algorithms exist for finding a maximal margin hyperplane many of which have been inspired by Rosenblatt’s Perceptron algorithm [6] which is an on-line learning algorithm for linear classifiers. The work on SVMs has inspired a number of modifications and enhancements to the original Perceptron algorithm. These incorporate the notion of margin to the learning and prediction processes whilst exhibiting good empirical performance in practice. Examples of such algorithms include the Relaxed Online Maximum Margin Algorithm (ROMMA) [4], the Approximate Maximal Margin Classification Algorithm (ALMA) [2], and the Margin Infused Relaxed Algorithm (MIRA) [1] which can be used in conjunction with kernel functions. A notable limitation of kernel based methods is their computational complexity since the amount of computer memory that they require to store the so called support patterns grows linearly with the number prediction errors. A number of attempts have been made to speed up the training and testing of SVM’s by enforcing a sparsity condition. In this paper we devise an online algorithm that is not only sparse but also generalizes well. To achieve this goal our algorithm employs an insertion and deletion process. Informally, it can be thought of as revising the weight vector after each example on which a prediction mistake has been made. Once such an event occurs the algorithm adds the new erroneous example (the insertion phase), and then immediately searches for past examples that appear to be redundant given the recent addition (the deletion phase). As we describe later, making this adjustment to the algorithm allows us to modify the standard online proof techniques so as to provide a bound on the total number of examples the algorithm keeps. This paper is organized as follows. In Sec. 2 we formalize the problem setting and provide a brief outline of our method for obtaining a sparse set of support patterns in an online setting. In Sec. 3 we present both theoretical and algorithmic details of our approach and provide a bound on the number of support patterns that constitute the cache. Sec. 4 provides experimental details, evaluated on three real world datasets, to illustrate the performance and merits of our sparse online algorithm. We end the paper with conclusions and ideas for future work. 2 Problem Setting and Algorithms This work focuses on online additive algorithms for classification tasks. In such problems we are typically given a stream of instance-label pairs (x1 , y1 ), . . . , (xt , yt ), . . .. we assume that each instance is a vector xt ∈ Rn and each label belongs to a finite set Y. In this and the next section we assume that Y = {−1, +1} but relax this assumption in Sec. 4 where we describe experiments with datasets consisting of more than two labels. When dealing with the task of predicting new labels, thresholded linear classifiers of the form h(x) = sign(w · x) are commonly employed. The vector w is typically represented as a weighted linear combination of the examples, namely w = t αt yt xt where αt ≥ 0. The instances for which αt > 0 are referred to as support patterns. Under this assumption, the output of the classifier solely depends on inner-products of the form x · xt the use of kernel functions can easily be employed simply by replacing the standard scalar product with a function K(·, ·) which satisfies Mercer conditions [7]. The resulting classification rule takes the form h(x) = sign(w · x) = sign( t αt yt K(x, xt )). The majority of additive online algorithms for classification, for example the well known Perceptron [6], share a common algorithmic structure. These online algorithms typically work in rounds. On the tth round, an online algorithm receives an instance xt , computes the inner-products st = i 0. The various online algorithms differ in the way the values of the parameters βt , αt and ct are set. A notable example of an online algorithm is the Perceptron algorithm [6] for which we set βt = 0, αt = 1 and ct = 1. More recent algorithms such as the Relaxed Online Maximum Margin Algorithm (ROMMA) [4] the Approximate Maximal Margin Classification Algorithm (ALMA) [2] and the Margin Infused Relaxed Algorithm (MIRA) [1] can also be described in this framework although the constants βt , αt and ct are not as simple as the ones employed by the Perceptron algorithm. An important computational consideration needs to be made when employing kernel functions for machine learning tasks. This is because the amount of memory required to store the so called support patterns grows linearly with the number prediction errors. In Input: Tolerance β. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. Insert Ct ← Ct−1 ∪ {t}. 2. Set αt = 1. 3. Compute wt ← wt−1 + yt αt xt . 4. DistillCache(Ct , wt , (α1 , . . . , αt )). Output : H(x) = sign(wT · x). Figure 1: The aggressive Perceptron algorithm with a variable-size cache. this paper we shift the focus to the problem of devising online algorithms which are budget-conscious as they attempt to keep the number of support patterns small. The approach is attractive for at least two reasons. Firstly, both the training time and classification time can be reduced significantly if we store only a fraction of the potential support patterns. Secondly, a classier with a small number of support patterns is intuitively ”simpler”, and hence are likely to exhibit good generalization properties rather than complex classifiers with large numbers of support patterns. (See for instance [7] for formal results connecting the number of support patterns to the generalization error.) In Sec. 3 we present a formal analysis and Input: C, w, (α1 , . . . , αt ). the algorithmic details of our approach. Loop: Let us now provide a general overview • Choose i ∈ C such that of how to restrict the number of support β ≤ yi (w − αi yi xi ). patterns in an online setting. Denote by Ct the indices of patterns which consti• if no such i exists then return. tute the classification vector wt . That is, • Remove the example i : i ∈ Ct if and only if αi > 0 on round 1. αi = 0. t when xt is received. The online classification algorithms discussed above keep 2. w ← w − αi yi xi . enlarging Ct – once an example is added 3. C ← C/{i} to Ct it will never be deleted. However, Return : C, w, (α1 , . . . , αt ). as the online algorithm receives more examples, the performance of the classifier Figure 2: DistillCache improves, and some of the past examples may have become redundant and hence can be removed. Put another way, old examples may have been inserted into the cache simply due the lack of support patterns in early rounds. As more examples are observed, the old examples maybe replaced with new examples whose location is closer to the decision boundary induced by the online classifier. We thus add a new stage to the online algorithm in which we discard a few old examples from the cache Ct . We suggest a modification of the online algorithm structure as follows. Whenever yt i 0. Then the number of support patterns constituting the cache is at most S ≤ (R2 + 2β)/γ 2 . Proof: The proof of the theorem is based on the mistake bound of the Perceptron algorithm [5]. To prove the theorem we bound wT 2 from above and below and compare the 2 t bounds. Denote by αi the weight of the ith example at the end of round t (after stage 4 of the algorithm). Similarly, we denote by αi to be the weight of the ith example on round ˜t t after stage 3, before calling the DistillCache (Fig. 2) procedure. We analogously ˜ denote by wt and wt the corresponding instantaneous classifiers. First, we derive a lower bound on wT 2 by bounding the term wT · u from below in a recursive manner. T αt yt (xt · u) wT · u = t∈CT T αt = γ S . ≥ γ (1) t∈CT We now turn to upper bound wT 2 . Recall that each example may be added to the cache and removed from the cache a single time. Let us write wT 2 as a telescopic sum, wT 2 = ( wT 2 ˜ − wT 2 ˜ ) + ( wT 2 − wT −1 2 ˜ ) + . . . + ( w1 2 − w0 2 ) . (2) We now consider three different scenarios that may occur for each new example. The first case is when we did not insert the tth example into the cache at all. In this case, ˜ ( wt 2 − wt−1 2 ) = 0. The second scenario is when an example is inserted into the cache and is never discarded in future rounds, thus, ˜ wt 2 = wt−1 + yt xt 2 = wt−1 2 + 2yt (wt−1 · xt ) + xt 2 . Since we inserted (xt , yt ), the condition yt (wt−1 · xt ) ≤ β must hold. Combining this ˜ with the assumption that the examples are enclosed in a ball of radius R we get, ( wt 2 − wt−1 2 ) ≤ 2β + R2 . The last scenario occurs when an example is inserted into the cache on some round t, and is then later on removed from the cache on round t + p for p > 0. As in the previous case we can bound the value of summands in Equ. (2), ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 ˜ − wt+p 2 ) Input: Tolerance β, Cache Limit n. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. If |Ct | = n remove one example: (a) Find i = arg maxj∈Ct {yj (wt−1 − αj yj xj )}. (b) Update wt−1 ← wt−1 − αi yi xi . (c) Remove Ct−1 ← Ct−1 /{i} 2. Insert Ct ← Ct−1 ∪ {t}. 3. Set αt = 1. 4. Compute wt ← wt−1 + yt αt xt . Output : H(x) = sign(wT · x). Figure 3: The aggressive Perceptron algorithm with as fixed-size cache. ˜ = 2yt (wt−1 · xt ) + xt 2 − 2yt (wt+p · xt ) + xt ˜ = 2 [yt (wt−1 · xt ) − yt ((wt+p − yt xt ) · xt )] ˜ ≤ 2 [β − yt ((wt+p − yt xt ) · xt )] . 2 ˜ Based on the form of the cache update we know that yt ((wt+p − yt xt ) · xt ) ≥ β, and thus, ˜ ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 − wt+p 2 ) ≤ 0 . Summarizing all three cases we see that only the examples which persist in the cache contribute a factor of R2 + 2β each to the bound of the telescopic sum of Equ. (2) and the rest of the examples do contribute anything to the bound. Hence, we can bound the norm of wT as follows, wT 2 ≤ S R2 + 2β . (3) We finish up the proof by applying the Cauchy-Swartz inequality and the assumption u = 1. Combining Equ. (1) and Equ. (3) we get, γ 2 S 2 ≤ (wT · u)2 ≤ wT 2 u 2 ≤ S(2β + R2 ) , which gives the desired bound. 4 Experiments In this section we describe the experimental methods that were used to compare the performance of standard online algorithms with the new algorithm described above. We also describe shortly another variant that sets a hard limit on the number of support patterns. The experiments were designed with the aim of trying to answer the following questions. First, what is effect of the number of support patterns on the generalization error (measured in terms of classification accuracy on unseen data), and second, would the algorithm described in Fig. 2 be able to find an optimal cache size that is able to achieve the best generalization performance. To examine each question separately we used a modified version of the algorithm described by Fig. 2 in which we restricted ourselves to have a fixed bounded cache. This modified algorithm (which we refer to as the fixed budget Perceptron) Name mnist letter usps No. of Training Examples 60000 16000 7291 No. of Test Examples 10000 4000 2007 No. of Classes 10 26 10 No. of Attributes 784 16 256 Table 1: Description of the datasets used in experiments. simulates the original Perceptron algorithm with one notable difference. When the number of support patterns exceeds a pre-determined limit, it chooses a support pattern from the cache and discards it. With this modification the number of support patterns can never exceed the pre-determined limit. This modified algorithm is described in Fig. 3. The algorithm deletes the example which seemingly attains the highest margin after the removal of the example itself (line 1(a) in Fig. 3). Despite the simplicity of the original Perceptron algorithm [6] its good generalization performance on many datasets is remarkable. During the last few year a number of other additive online algorithms have been developed [4, 2, 1] that have shown better performance on a number of tasks. In this paper, we have preferred to embed these ideas into another online algorithm and start with a higher baseline performance. We have chosen to use the Margin Infused Relaxed Algorithm (MIRA) as our baseline algorithm since it has exhibited good generalization performance in previous experiments [1] and has the additional advantage that it is designed to solve multiclass classification problem directly without any recourse to performing reductions. The algorithms were evaluated on three natural datasets: mnist1 , usps2 and letter3 . The characteristics of these datasets has been summarized in Table 1. A comprehensive overview of the performance of various algorithms on these datasets can be found in a recent paper [2]. Since all of the algorithms that we have evaluated are online, it is not implausible for the specific ordering of examples to affect the generalization performance. We thus report results averaged over 11 random permutations for usps and letter and over 5 random permutations for mnist. No free parameter optimization was carried out and instead we simply used the values reported in [1]. More specifically, the margin parameter was set to β = 0.01 for all algorithms and for all datasets. A homogeneous polynomial kernel of degree 9 was used when training on the mnist and usps data sets, and a RBF kernel for letter data set. (The variance of the RBF kernel was identical to the one used in [1].) We evaluated the performance of four algorithms in total. The first algorithm was the standard MIRA online algorithm, which does not incorporate any budget constraints. The second algorithm is the version of MIRA described in Fig. 3 which uses a fixed limited budget. Here we enumerated the cache size limit in each experiment we performed. The different sizes that we tested are dataset dependent but for each dataset we evaluated at least 10 different sizes. We would like to note that such an enumeration cannot be done in an online fashion and the goal of employing the the algorithm with a fixed-size cache is to underscore the merit of the truly adaptive algorithm. The third algorithm is the version of MIRA described in Fig. 2 that adapts the cache size during the running of the algorithms. We also report additional results for a multiclass version of the SVM [1]. Whilst this algorithm is not online and during the training process it considers all the examples at once, this algorithm serves as our gold-standard algorithm against which we want to compare 1 Available from http://www.research.att.com/˜yann Available from ftp.kyb.tuebingen.mpg.de 3 Available from http://www.ics.uci.edu/˜mlearn/MLRepository.html 2 usps mnist Fixed Adaptive SVM MIRA 1.8 4.8 4.7 letter Fixed Adaptive SVM MIRA 5.5 1.7 4.6 5 1.5 1.4 Test Error 4.5 Test Error Test Error 1.6 Fixed Adaptive SVM MIRA 6 4.4 4.3 4.5 4 3.5 4.2 4.1 3 4 2.5 1.3 1.2 3.9 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 2 1000 1500 x 10 mnist 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 usps Fixed Adaptive MIRA 1550 7000 8000 9000 letter Fixed Adaptive MIRA 270 4000 5000 6000 # Support Patterns Fixed Adaptive MIRA 1500 265 1500 1400 260 Training Online Errors Training Online Errors Training Online Errors 1450 1450 255 250 245 1400 1350 1300 1350 240 1250 235 1300 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 1000 1500 x 10 mnist 4 x 10 2000 2500 # Support Patterns 3000 3500 1000 usps 6500 Fixed Adaptive MIRA 5.5 2000 3000 4000 5000 6000 # Support Patterns 7000 Fixed Adaptive MIRA 1.5 6000 9000 letter 4 x 10 1.6 Fixed Adaptive MIRA 8000 4 3.5 3 1.4 5500 Training Margin Errors Training Margin Errors Training Margin Errors 5 4.5 5000 4500 1.3 1.2 1.1 4000 1 2.5 3500 0.9 2 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 4 x 10 500 1000 1500 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 4000 5000 6000 # Support Patterns 7000 8000 9000 Figure 4: Results on a three data sets - mnist (left), usps (center) and letter (right). Each point in a plot designates the test error (y-axis) vs. the number of support patterns used (x-axis). Four algorithms are compared - SVM, MIRA, MIRA with a fixed cache size and MIRA with a variable cache size. performance. Note that for the multiclass SVM we report the results using the best set of parameters, which does not coincide with the set of parameters used for the online training. The results are summarized in Fig 4. This figure is composed of three different plots organized in columns. Each of these plots corresponds to a different dataset - mnist (left), usps (center) and letter (right). In each of the three plots the x-axis designates number of support patterns the algorithm uses. The results for the fixed-size cache are connected with a line to emphasize the performance dependency on the size of the cache. The top row of the three columns shows the generalization error. Thus the y-axis designates the test error of an algorithm on unseen data at the end of the training. Looking at the error of the algorithm with a fixed-size cache reveals that there is a broad range of cache size where the algorithm exhibits good performance. In fact for MNIST and USPS there are sizes for which the test error of the algorithm is better than SVM’s test error. Naturally, we cannot fix the correct size in hindsight so the question is whether the algorithm with variable cache size is a viable automatic size-selection method. Analyzing each of the datasets in turn reveals that this is indeed the case – the algorithm obtains a very similar number of support patterns and test error when compared to the SVM method. The results are somewhat less impressive for the letter dataset which contains less examples per class. One possible explanation is that the algorithm had fewer chances to modify and distill the cache. Nonetheless, overall the results are remarkable given that all the online algorithms make a single pass through the data and the variable-size method finds a very good cache size while making it also comparable to the SVM in terms of performance. The MIRA algorithm, which does not incorporate any form of example insertion or deletion in its algorithmic structure, obtains the poorest level of performance not only in terms of generalization error but also in terms of number of support patterns. The plot of online training error against the number of support patterns, in row 2 of Fig 4, can be considered to be a good on-the-fly validation of generalization performance. As the plots indicate, for the fixed and adaptive versions of the algorithm, on all the datasets, a low online training error translates into good generalization performance. Comparing the test error plots with the online error plots we see a nice similarity between the qualitative behavior of the two errors. Hence, one can use the online error, which is easy to evaluate, to choose a good cache size for the fixed-size algorithm. The third row gives the online training margin errors that translates directly to the number of insertions into the cache. Here we see that the good test error and compactness of the algorithm with a variable cache size come with a price. Namely, the algorithm makes significantly more insertions into the cache than the fixed size version of the algorithm. However, as the upper two sets of plots indicate, the surplus in insertions is later taken care of by excess deletions and the end result is very good overall performance. In summary, the online algorithm with a variable cache and SVM obtains similar levels of generalization and also number of support patterns. While the SVM is still somewhat better in both aspects for the letter dataset, the online algorithm is much simpler to implement and performs a single sweep through the training data. 5 Summary We have described and analyzed a new sparse online algorithm that attempts to deal with the computational problems implicit in classification algorithms such as the SVM. The proposed method was empirically tested and its performance in both the size of the resulting classifier and its error rate are comparable to SVM. There are a few possible extensions and enhancements. We are currently looking at alternative criteria for the deletions of examples from the cache. For instance, the weight of examples might relay information on their importance for accurate classification. Incorporating prior knowledge to the insertion and deletion scheme might also prove important. We hope that such enhancements would make the proposed approach a viable alternative to SVM and other batch algorithms. Acknowledgements: The authors would like to thank John Shawe-Taylor for many helpful comments and discussions. This research was partially funded by the EU project KerMIT No. IST-2000-25341. References [1] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jornal of Machine Learning Research, 3:951–991, 2003. [2] C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242, 2001. [3] M´ zard M. Krauth W. Learning algorithms with optimal stability in neural networks. Journal of e Physics A., 20:745, 1987. [4] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1–3):361–387, 2002. [5] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 615–622, 1962. [6] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).). [7] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.
5 0.68469304 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
Author: Pedro J. Moreno, Purdy P. Ho, Nuno Vasconcelos
Abstract: Over the last years significant efforts have been made to develop kernels that can be applied to sequence data such as DNA, text, speech, video and images. The Fisher Kernel and similar variants have been suggested as good ways to combine an underlying generative model in the feature space and discriminant classifiers such as SVM’s. In this paper we suggest an alternative procedure to the Fisher kernel for systematically finding kernel functions that naturally handle variable length sequence data in multimedia domains. In particular for domains such as speech and images we explore the use of kernel functions that take full advantage of well known probabilistic models such as Gaussian Mixtures and single full covariance Gaussian models. We derive a kernel distance based on the Kullback-Leibler (KL) divergence between generative models. In effect our approach combines the best of both generative and discriminative methods and replaces the standard SVM kernels. We perform experiments on speaker identification/verification and image classification tasks and show that these new kernels have the best performance in speaker verification and mostly outperform the Fisher kernel based SVM’s and the generative classifiers in speaker identification and image classification. 1
6 0.68410808 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
7 0.68232018 143 nips-2003-On the Dynamics of Boosting
8 0.68210804 41 nips-2003-Boosting versus Covering
9 0.68161023 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
10 0.68099052 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
11 0.68072313 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling
12 0.6806038 122 nips-2003-Margin Maximizing Loss Functions
13 0.6802761 117 nips-2003-Linear Response for Approximate Inference
14 0.68005514 3 nips-2003-AUC Optimization vs. Error Rate Minimization
15 0.67947602 113 nips-2003-Learning with Local and Global Consistency
16 0.67902994 187 nips-2003-Training a Quantum Neural Network
18 0.67837322 189 nips-2003-Tree-structured Approximations by Expectation Propagation
19 0.67733115 20 nips-2003-All learning is Local: Multi-agent Learning in Global Reward Games
20 0.67711902 100 nips-2003-Laplace Propagation