nips nips2003 nips2003-171 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Thore Graepel, Ralf Herbrich, Andriy Kharechko, John S. Shawe-taylor
Abstract: We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. [sent-6, score-0.586]
2 Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. [sent-8, score-0.607]
3 We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods. [sent-9, score-0.203]
4 1 Introduction Semidefinite programming (SDP) is one of the most active research areas in optimisation. [sent-10, score-0.124]
5 Recently, semidefinite programming has been discovered as a useful toolkit in machine learning with applications ranging from pattern separation via ellipsoids [4] to kernel matrix optimisation [5] and transformation invariant learning [6]. [sent-12, score-0.399]
6 Methods for solving SDPs have mostly been developed in an analogy to linear programming. [sent-13, score-0.121]
7 Generalised simplex-like algorithms were developed for SDPs [11], but to the best of our knowledge are currently merely of theoretical interest. [sent-14, score-0.033]
8 The ellipsoid method works by searching for a feasible point via repeatedly “halving” an ellipsoid that encloses the affine space of constraint matrices such that the centre of the ellipsoid is a feasible point [7]. [sent-15, score-0.593]
9 However, this method shows poor performance in practice as the running time usually attains its worst-case bound. [sent-16, score-0.068]
10 A third set of methods for solving SDPs are interior point methods [14]. [sent-17, score-0.099]
11 These methods minimise a linear function on convex sets provided the sets are endowed with self-concordant barrier functions. [sent-18, score-0.185]
12 Since such a barrier function is known for SDPs, interior point methods are currently the most efficient method for solving SDPs in practice. [sent-19, score-0.142]
13 Considering the great generality of semidefinite programming and the complexity of state-of-the-art solution methods it is quite surprising that the forty year old simple perceptron learning algorithm [12] can be modified so as to solve SDPs. [sent-20, score-0.568]
14 In this paper we present a combination of the perceptron learning algorithm (PLA) with a rescaling algorithm (originally developed for LPs [3]) that is able to solve semidefinite programs in polynomial time. [sent-21, score-0.948]
15 We start with a short introduction into semidefinite programming and the perceptron learning algorithm in Section 2. [sent-22, score-0.485]
16 In Section 3 we present our main algorithm together with some performance guarantees, whose proofs we only sketch due to space restrictions. [sent-23, score-0.099]
17 While our numerical results presented in Section 4 are very preliminary, they do give insights into the workings of the algorithm and demonstrate that machine learning may have something to offer to the field of convex optimisation. [sent-24, score-0.145]
18 For the rest of the paper we denote matrices and vectors by bold face upper and lower case letters, e. [sent-25, score-0.027]
19 We shall use x := x/ x to denote the unit length vector in the direction of x. [sent-28, score-0.049]
20 The following proposition shows that semidefinite programs are a direct generalisation of linear programs. [sent-35, score-0.185]
21 Every semidefinite program is a linear program with infinitely many linear constraints. [sent-37, score-0.144]
22 Obviously, the objective function in (1) is linear in x. [sent-39, score-0.141]
23 For any u ∈ Rm , define the vector au := (u F1 u, . [sent-40, score-0.307]
24 Then, the constraints in (1) can be written as ∀u ∈ Rm : u F (x) u ≥ 0 ∀u ∈ Rm : ⇔ m This is a linear constraint in x for all u ∈ R x au ≥ −u F0 u . [sent-44, score-0.409]
25 Since the objective function is linear in x, we can solve an SDP by a sequence of semidefinite constraint satisfaction problems (CSPs) introducing the additional constraint c x ≤ c0 and varying c0 ∈ R. [sent-46, score-0.385]
26 Any SDP can be solved by a sequence of homogenised semidefinite CSPs of the following form: n find x ∈ Rn+1 subject to G (x) := xi Gi i=0 0. [sent-49, score-0.029]
27 Algorithm 1 Perceptron Learning Algorithm Require: A (possibly) infinite set A of vectors a ∈ Rn Set t ← 0 and xt = 0 while there exists a ∈ A such that xt a ≤ 0 do xt+1 = xt + a t←t+1 end while return xt Proof. [sent-50, score-0.616]
28 In order to make F0 and c0 dependent on the optimisation variables, we introduce an auxiliary variable x0 > 0; the solution to the original problem is given by x−1 · x. [sent-51, score-0.17]
29 Moreover, we can repose the two linear constraints c0 x0 − c x ≥ 0 and 0 x0 > 0 as an LMI using the fact that a block-diagonal matrix is positive (semi)definite if and only if every block is positive (semi)definite. [sent-52, score-0.124]
30 Thus, the following matrices are sufficient: F0 0 0 Fi 0 0 0 −ci 0 G0 = 0 c0 0 , Gi = . [sent-53, score-0.027]
31 0 0 0 0 0 1 Given an upper and a lower bound on the objective function, repeated bisection can be used to determine the solution in O(log 1 ) steps to accuracy ε. [sent-54, score-0.348]
32 2 Perceptron Learning Algorithm The perceptron learning algorithm (PLA) [12] is an online procedure which finds a linear separation of a set of points from the origin (see Algorithm 1). [sent-57, score-0.434]
33 A remarkable property of the perceptron learning algorithm is that the total number t of updates is independent of the cardinality of A but can be upper bounded simply in terms of the following quantity ρ (A) := max ρ (A, x) := max min a x . [sent-59, score-0.385]
34 n n x∈R x∈R a∈A This quantity is known as the (normalised) margin of A in the machine learning community or as the radius of the feasible region in the optimisation community. [sent-60, score-0.444]
35 It quantifies the radius of the largest ball that can be fitted in the convex region enclosed by all a ∈ A (the so-called feasible set). [sent-61, score-0.341]
36 Then, the perceptron convergence theorem [10] states that t ≤ ρ−2 (A). [sent-62, score-0.312]
37 For the purpose of this paper we observe that Algorithm 1 solves a linear CSP where the linear constraints are given by the vectors a ∈ A. [sent-63, score-0.171]
38 If the feasible set has a positive radius, then the perceptron learning algorithm solves a linear CSP in finitely many steps. [sent-66, score-0.642]
39 It is worth mentioning that in the last few decades a series of modified PLAs A have been developed (see [2] for a good overview) which mainly aim at guaranteeing 1 Note that sometimes the update equation is given using the unnormalised vector a. [sent-67, score-0.058]
40 Algorithm 2 Rescaling algorithm Require: A maximal number T ∈ N+ of steps and a parameter σ ∈ R+ Set y uniformly at random in {z : z = 1} for t = 0, . [sent-68, score-0.074]
41 , T do y ¯¯ Find au such that y au := √ u G(¯)u u)2 ≤ −σ (u ≈ smallest EV of G (¯ )) y n (u G j=1 j if u does not exists then Set ∀i ∈ {1, . [sent-71, score-0.564]
42 , n} : Gi ← Gi + y i G (y); return y end if y ← y − (y au ) au ; t ← t + 1 end for return unsolved not only feasibility of the solution xt but also a lower bound on ρ (A, xt ). [sent-74, score-1.172]
43 3 Semidefinite Programming by Perceptron Learning If we combine Propositions 1, 2 and 3 together with Equation (2) we obtain a perceptron algorithm that sequentially solves SDPs. [sent-76, score-0.417]
44 How can we make the running time of this algorithm polynomial in the description length of the data? [sent-80, score-0.215]
45 Hence, finding a vector au ∈ A such that x au ≤ 0 is equivalent to identifying a vector u ∈ Rm such that n xi u Gi u = u G (x) u ≤ 0 . [sent-88, score-0.614]
46 i=1 One possible way of finding such a vector u (and consequently au ) for the current solution xt in Algorithm 1 is to calculate the eigenvector corresponding to the smallest eigenvalue of G (xt ); if this eigenvalue is positive, the algorithm stops and outputs xt . [sent-89, score-0.741]
47 The second problem requires us to improve the dependency of the runtime from O(ρ−2 ) to O(− log(ρ)). [sent-91, score-0.065]
48 To this end we employ a probabilistic rescaling algorithm (see Algorithm 2) which was originally developed for LPs [3]. [sent-92, score-0.482]
49 The purpose of this algorithm is to enlarge the feasible region (in terms of ρ (A (G1 , . [sent-93, score-0.283]
50 , Gn ))) by a constant factor κ, on average, which would imply a decrease in the number of updates of the perceptron algorithm exponential in the number of calls to this rescaling algorithm. [sent-96, score-0.739]
51 If the algorithm does not return unsolved the rescaling procedure on the Gi has the effect that au changes into au + (y au ) y for every u ∈ Rm . [sent-98, score-1.372]
52 In order to be able to reconstruct the solution xt to the original problem, whenever we rescale the Gi we need to remember the vector y used for rescaling. [sent-99, score-0.201]
53 In Figure 1 we have shown the effect of rescaling for three linear con2 Note that polynomial runtime is only guaranteed if ρ−2 (A (G1 , . [sent-100, score-0.464]
54 , Gn )) is bounded by a polynomial function of the description length of the data. [sent-103, score-0.073]
55 Shown is the feasible region and one feasible point before (left) and after (left) rescaling with the feasible point. [sent-105, score-0.812]
56 The main idea of Algorithm 2 is to find a vector y that is σ-close to the current feasible region and hence leads to an increase in its radius when used for rescaling. [sent-107, score-0.33]
57 Let σ ≤ 32n , ρ be the radius of the feasible set before rescaling and ρ be the radius of the feasible set after 1 rescaling and assume that ρ ≤ 4n . [sent-111, score-1.08]
58 4 The probabilistic nature of the theorem stems from the fact that the rescaling can only be shown to increase the size of the feasible region if the (random) initial value y already points sufficiently closely to the feasible region. [sent-116, score-0.733]
59 A consequence of this theorem is that, on average, the radius increases by κ = (1 + 1/64n) > 1. [sent-117, score-0.121]
60 Algorithm 3 combines rescaling and perceptron learning, which results in a probabilistic polynomial runtime algorithm3 which alternates between calls to Algorithm 1 and 2 . [sent-118, score-0.806]
61 This algorithm may return infeasible in two cases: either Ti many calls to Algorithm 2 have returned unsolved or L many calls of Algorithm 1 together with rescaling have not returned a solution. [sent-119, score-0.836]
62 Following the argument in [3] one can show that for L = −2048n · ln (ρmin ) the total probability of returning infeasible despite ρ (A (G1 , . [sent-124, score-0.16]
63 Our initial aim was to demonstrate that the method works in practice and to assess its efficacy on a 3 Note that we assume that the optimisation problem in line 3 of Algorithm 2 can be solved in polynomial time with algorithms such as Newton-Raphson. [sent-129, score-0.241]
64 , Gn ∈ Rm×m and maximal number of iteration L ∈ N+ Set B = In for i = 1, . [sent-133, score-0.066]
65 , L do 1 Call Algorithm 1 for at most M A, 4n many updates if Algorithm 1 converged then return Bx ln(δ 3 Set δi = π2 i2 and Ti = ln 3i ) (4) for j = 1, . [sent-136, score-0.189]
66 , Ti do 1 Call Algorithm 2 with T = 1024n2 ln (n) and σ = 32n if Algorithm 2 returns y then B ← B (In + yy ); goto the outer for-loop end for return infeasible end for return infeasible benchmark example from graph bisection [1]. [sent-139, score-0.854]
67 These experiments would also indicate how competitive the baseline method is when compared to other solvers. [sent-140, score-0.077]
68 The algorithm was implemented in MATLAB and all of the experiments were run on 1. [sent-141, score-0.074]
69 The time taken can be compared with a standard method SDPT3 [13] partially implemented in C but running under MATLAB. [sent-143, score-0.068]
70 We considered benchmark problems arising from semidefinite relaxations to the MAXCUT problems of weighted graphs, which is posed as finding a maximum weight bisection of a graph. [sent-144, score-0.362]
71 The benchmark MAXCUT problems have the following relaxed SDP form (see [8]): 1 minimise 1 x subject to − (diag(C1) − C) + diag (x) 0 , (3) x∈Rn 4 F0 i x i Fi where C ∈ Rn×n is the adjacency matrix of the graph with n vertices. [sent-145, score-0.172]
72 The benchmark used was ‘mcp100’ provided by SDPLIB 1. [sent-146, score-0.052]
73 For this problem, n = 100 and it is known that the optimal value of the objective function equals 226. [sent-148, score-0.128]
74 The baseline method used the bisection approach to identify the critical value of the objective, referred to throughout this section as c0 . [sent-150, score-0.245]
75 Figure 2 (left) shows a plot of the time per iteration against the value of c0 for the first four iterations of the bisection method. [sent-151, score-0.354]
76 As can be seen from the plots the time taken by the algorithm for each iteration is quite long, with the time of the fourth iteration being around 19,000 seconds. [sent-152, score-0.232]
77 The initial value of 999 for c0 was found without an objective constraint and converged within 0. [sent-153, score-0.206]
78 The bisection then started with the lower (infeasible) value of 0 and the upper value of 999. [sent-155, score-0.273]
79 5, but the feasible solution had an objective value of 492. [sent-157, score-0.318]
80 The second iteration used a value of c0 = 246 slightly above the optimum of 226. [sent-159, score-0.156]
81 The third iteration was infeasible but since it was quite far from the optimum, the algorithm was able to deduce this fact quite quickly. [sent-160, score-0.308]
82 The final iteration was also infeasible, but much closer to the optimal value. [sent-161, score-0.066]
83 If we were to continue the next iteration would also be infeasible but closer to the optimum and so would take even longer. [sent-164, score-0.244]
84 First, that the method does indeed work as predicted; secondly, that the running times are very far from being 1000 4 16000 Time (in sec. [sent-166, score-0.068]
85 (Right) Decay of the attained objective function value while iterating through Algorithm 3 with a non-zero threshold of τ = 500. [sent-168, score-0.128]
86 competitive (SDPT3 takes under 12 seconds to solve this problem) and thirdly that the running times increase as the value of c0 approaches the optimum with those iterations that must prove infeasibility being more costly than those that find a solution. [sent-169, score-0.304]
87 Rather than perform the search using the bisection method we implemented a non-zero threshold on the objective constraint (see the while-statement in Algorithm 1). [sent-171, score-0.37]
88 The value of this threshold is denoted τ , following the notation introduced in [9]. [sent-172, score-0.028]
89 Using a value of τ = 500 ensured that when a feasible solution is found, its objective value is significantly below that of the objective constraint c0 . [sent-173, score-0.499]
90 Figure 2 (right) shows the values of c0 as a function of the outer for-loops (iterations); the algorithm eventually approached its estimate of the optimal value at 228. [sent-174, score-0.147]
91 This is within 1% of the optimum, though of course iterations could have been continued. [sent-176, score-0.043]
92 Despite the clear convergence, using this approach the running time to an accurate estimate of the solution is still prohibitive because overall the algorithm took approximately 60 hours of CPU time to find its solution. [sent-177, score-0.173]
93 A profile of the execution, however, revealed that up to 93% of the execution time is spent in the eigenvalue decomposition to identify u. [sent-178, score-0.065]
94 Observe that we do not need a minimal eigenvector to perform an update, simply a vector u satisfying u G(x)u < 0 (4) Cholesky decomposition will either return u satisfying (4) or it will converge indicating that G(x) is psd and Algorithm 1 has converged. [sent-179, score-0.148]
95 5 Conclusions Semidefinite programming has interesting applications in machine learning. [sent-180, score-0.124]
96 In turn, we have shown how a simple learning algorithm can be modified to solve higher order convex optimisation problems such as semidefinite programs. [sent-181, score-0.305]
97 While the optimisation setting leads to the somewhat artificial and inefficient bisection method the positive definite perceptron algorithm excels at solving positive definite CSPs as found, e. [sent-183, score-0.814]
98 , in problems of transformation invariant pattern recognition as solved by Semidefinite Programming Machines [6]. [sent-185, score-0.115]
99 In future work it will be of interest to consider the combined primal-dual problem at a predefined level ε of granularity so as to avoid the necessity of bisection search. [sent-186, score-0.217]
100 The perceptron: A probabilistic model for information storage and organization in the brain. [sent-268, score-0.027]
wordName wordTfidf (topN-words)
[('semide', 0.465), ('perceptron', 0.287), ('rescaling', 0.285), ('au', 0.282), ('bisection', 0.217), ('sdps', 0.166), ('feasible', 0.159), ('gn', 0.141), ('optimisation', 0.139), ('sdp', 0.13), ('programming', 0.124), ('xt', 0.121), ('rm', 0.121), ('nite', 0.119), ('infeasible', 0.116), ('pla', 0.11), ('objective', 0.1), ('programs', 0.096), ('radius', 0.096), ('return', 0.096), ('gi', 0.093), ('csps', 0.082), ('satisfaction', 0.082), ('competitive', 0.077), ('algorithm', 0.074), ('polynomial', 0.073), ('unsolved', 0.071), ('csp', 0.071), ('nitely', 0.07), ('calls', 0.069), ('running', 0.068), ('iteration', 0.066), ('runtime', 0.065), ('ellipsoid', 0.065), ('minimise', 0.065), ('optimum', 0.062), ('fi', 0.056), ('solves', 0.056), ('dunagan', 0.055), ('maxcut', 0.055), ('mons', 0.055), ('sdplib', 0.055), ('rn', 0.053), ('constraint', 0.053), ('benchmark', 0.052), ('interior', 0.052), ('region', 0.05), ('proposition', 0.048), ('microsoft', 0.048), ('ellipsoids', 0.048), ('lps', 0.048), ('graepel', 0.048), ('solving', 0.047), ('outer', 0.045), ('ln', 0.044), ('lmi', 0.043), ('herbrich', 0.043), ('barrier', 0.043), ('iterations', 0.043), ('linear', 0.041), ('su', 0.039), ('cone', 0.038), ('ti', 0.037), ('end', 0.036), ('semi', 0.036), ('convex', 0.036), ('execution', 0.035), ('insights', 0.035), ('developed', 0.033), ('constraints', 0.033), ('arising', 0.033), ('separation', 0.032), ('program', 0.031), ('solution', 0.031), ('problems', 0.03), ('invariant', 0.03), ('eigenvalue', 0.03), ('solved', 0.029), ('returned', 0.028), ('value', 0.028), ('originally', 0.027), ('matrices', 0.027), ('probabilistic', 0.027), ('combinatorial', 0.027), ('eigenvector', 0.027), ('quite', 0.026), ('solve', 0.026), ('transformation', 0.026), ('diag', 0.025), ('matlab', 0.025), ('vector', 0.025), ('converged', 0.025), ('proofs', 0.025), ('positive', 0.025), ('theorem', 0.025), ('updates', 0.024), ('shall', 0.024), ('uneven', 0.024), ('remember', 0.024), ('prompted', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 171 nips-2003-Semi-Definite Programming by Perceptron Learning
Author: Thore Graepel, Ralf Herbrich, Andriy Kharechko, John S. Shawe-taylor
Abstract: We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods. 1
2 0.2660245 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
Author: Thore Graepel, Ralf Herbrich
Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1
3 0.2309486 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
Author: Michael I. Jordan, Martin J. Wainwright
Abstract: We present a new method for calculating approximate marginals for probability distributions defined by graphs with cycles, based on a Gaussian entropy bound combined with a semidefinite outer bound on the marginal polytope. This combination leads to a log-determinant maximization problem that can be solved by efficient interior point methods [8]. As with the Bethe approximation and its generalizations [12], the optimizing arguments of this problem can be taken as approximations to the exact marginals. In contrast to Bethe/Kikuchi approaches, our variational problem is strictly convex and so has a unique global optimum. An additional desirable feature is that the value of the optimal solution is guaranteed to provide an upper bound on the log partition function. In experimental trials, the performance of the log-determinant relaxation is comparable to or better than the sum-product algorithm, and by a substantial margin for certain problem classes. Finally, the zero-temperature limit of our log-determinant relaxation recovers a class of well-known semidefinite relaxations for integer programming [e.g., 3]. 1
4 0.13029815 48 nips-2003-Convex Methods for Transduction
Author: Tijl D. Bie, Nello Cristianini
Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves finding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semidefinite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1
5 0.085302733 145 nips-2003-Online Classification on a Budget
Author: Koby Crammer, Jaz Kandola, Yoram Singer
Abstract: Online algorithms for classification often require vast amounts of memory and computation time when employed in conjunction with kernel functions. In this paper we describe and analyze a simple approach for an on-the-fly reduction of the number of past examples used for prediction. Experiments performed with real datasets show that using the proposed algorithmic approach with a single epoch is competitive with the support vector machine (SVM) although the latter, being a batch algorithm, accesses each training example multiple times. 1 Introduction and Motivation Kernel-based methods are widely being used for data modeling and prediction because of their conceptual simplicity and outstanding performance on many real-world tasks. The support vector machine (SVM) is a well known algorithm for finding kernel-based linear classifiers with maximal margin [7]. The kernel trick can be used to provide an effective method to deal with very high dimensional feature spaces as well as to model complex input phenomena via embedding into inner product spaces. However, despite generalization error being upper bounded by a function of the margin of a linear classifier, it is notoriously difficult to implement such classifiers efficiently. Empirically this often translates into very long training times. A number of alternative algorithms exist for finding a maximal margin hyperplane many of which have been inspired by Rosenblatt’s Perceptron algorithm [6] which is an on-line learning algorithm for linear classifiers. The work on SVMs has inspired a number of modifications and enhancements to the original Perceptron algorithm. These incorporate the notion of margin to the learning and prediction processes whilst exhibiting good empirical performance in practice. Examples of such algorithms include the Relaxed Online Maximum Margin Algorithm (ROMMA) [4], the Approximate Maximal Margin Classification Algorithm (ALMA) [2], and the Margin Infused Relaxed Algorithm (MIRA) [1] which can be used in conjunction with kernel functions. A notable limitation of kernel based methods is their computational complexity since the amount of computer memory that they require to store the so called support patterns grows linearly with the number prediction errors. A number of attempts have been made to speed up the training and testing of SVM’s by enforcing a sparsity condition. In this paper we devise an online algorithm that is not only sparse but also generalizes well. To achieve this goal our algorithm employs an insertion and deletion process. Informally, it can be thought of as revising the weight vector after each example on which a prediction mistake has been made. Once such an event occurs the algorithm adds the new erroneous example (the insertion phase), and then immediately searches for past examples that appear to be redundant given the recent addition (the deletion phase). As we describe later, making this adjustment to the algorithm allows us to modify the standard online proof techniques so as to provide a bound on the total number of examples the algorithm keeps. This paper is organized as follows. In Sec. 2 we formalize the problem setting and provide a brief outline of our method for obtaining a sparse set of support patterns in an online setting. In Sec. 3 we present both theoretical and algorithmic details of our approach and provide a bound on the number of support patterns that constitute the cache. Sec. 4 provides experimental details, evaluated on three real world datasets, to illustrate the performance and merits of our sparse online algorithm. We end the paper with conclusions and ideas for future work. 2 Problem Setting and Algorithms This work focuses on online additive algorithms for classification tasks. In such problems we are typically given a stream of instance-label pairs (x1 , y1 ), . . . , (xt , yt ), . . .. we assume that each instance is a vector xt ∈ Rn and each label belongs to a finite set Y. In this and the next section we assume that Y = {−1, +1} but relax this assumption in Sec. 4 where we describe experiments with datasets consisting of more than two labels. When dealing with the task of predicting new labels, thresholded linear classifiers of the form h(x) = sign(w · x) are commonly employed. The vector w is typically represented as a weighted linear combination of the examples, namely w = t αt yt xt where αt ≥ 0. The instances for which αt > 0 are referred to as support patterns. Under this assumption, the output of the classifier solely depends on inner-products of the form x · xt the use of kernel functions can easily be employed simply by replacing the standard scalar product with a function K(·, ·) which satisfies Mercer conditions [7]. The resulting classification rule takes the form h(x) = sign(w · x) = sign( t αt yt K(x, xt )). The majority of additive online algorithms for classification, for example the well known Perceptron [6], share a common algorithmic structure. These online algorithms typically work in rounds. On the tth round, an online algorithm receives an instance xt , computes the inner-products st = i 0. The various online algorithms differ in the way the values of the parameters βt , αt and ct are set. A notable example of an online algorithm is the Perceptron algorithm [6] for which we set βt = 0, αt = 1 and ct = 1. More recent algorithms such as the Relaxed Online Maximum Margin Algorithm (ROMMA) [4] the Approximate Maximal Margin Classification Algorithm (ALMA) [2] and the Margin Infused Relaxed Algorithm (MIRA) [1] can also be described in this framework although the constants βt , αt and ct are not as simple as the ones employed by the Perceptron algorithm. An important computational consideration needs to be made when employing kernel functions for machine learning tasks. This is because the amount of memory required to store the so called support patterns grows linearly with the number prediction errors. In Input: Tolerance β. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. Insert Ct ← Ct−1 ∪ {t}. 2. Set αt = 1. 3. Compute wt ← wt−1 + yt αt xt . 4. DistillCache(Ct , wt , (α1 , . . . , αt )). Output : H(x) = sign(wT · x). Figure 1: The aggressive Perceptron algorithm with a variable-size cache. this paper we shift the focus to the problem of devising online algorithms which are budget-conscious as they attempt to keep the number of support patterns small. The approach is attractive for at least two reasons. Firstly, both the training time and classification time can be reduced significantly if we store only a fraction of the potential support patterns. Secondly, a classier with a small number of support patterns is intuitively ”simpler”, and hence are likely to exhibit good generalization properties rather than complex classifiers with large numbers of support patterns. (See for instance [7] for formal results connecting the number of support patterns to the generalization error.) In Sec. 3 we present a formal analysis and Input: C, w, (α1 , . . . , αt ). the algorithmic details of our approach. Loop: Let us now provide a general overview • Choose i ∈ C such that of how to restrict the number of support β ≤ yi (w − αi yi xi ). patterns in an online setting. Denote by Ct the indices of patterns which consti• if no such i exists then return. tute the classification vector wt . That is, • Remove the example i : i ∈ Ct if and only if αi > 0 on round 1. αi = 0. t when xt is received. The online classification algorithms discussed above keep 2. w ← w − αi yi xi . enlarging Ct – once an example is added 3. C ← C/{i} to Ct it will never be deleted. However, Return : C, w, (α1 , . . . , αt ). as the online algorithm receives more examples, the performance of the classifier Figure 2: DistillCache improves, and some of the past examples may have become redundant and hence can be removed. Put another way, old examples may have been inserted into the cache simply due the lack of support patterns in early rounds. As more examples are observed, the old examples maybe replaced with new examples whose location is closer to the decision boundary induced by the online classifier. We thus add a new stage to the online algorithm in which we discard a few old examples from the cache Ct . We suggest a modification of the online algorithm structure as follows. Whenever yt i 0. Then the number of support patterns constituting the cache is at most S ≤ (R2 + 2β)/γ 2 . Proof: The proof of the theorem is based on the mistake bound of the Perceptron algorithm [5]. To prove the theorem we bound wT 2 from above and below and compare the 2 t bounds. Denote by αi the weight of the ith example at the end of round t (after stage 4 of the algorithm). Similarly, we denote by αi to be the weight of the ith example on round ˜t t after stage 3, before calling the DistillCache (Fig. 2) procedure. We analogously ˜ denote by wt and wt the corresponding instantaneous classifiers. First, we derive a lower bound on wT 2 by bounding the term wT · u from below in a recursive manner. T αt yt (xt · u) wT · u = t∈CT T αt = γ S . ≥ γ (1) t∈CT We now turn to upper bound wT 2 . Recall that each example may be added to the cache and removed from the cache a single time. Let us write wT 2 as a telescopic sum, wT 2 = ( wT 2 ˜ − wT 2 ˜ ) + ( wT 2 − wT −1 2 ˜ ) + . . . + ( w1 2 − w0 2 ) . (2) We now consider three different scenarios that may occur for each new example. The first case is when we did not insert the tth example into the cache at all. In this case, ˜ ( wt 2 − wt−1 2 ) = 0. The second scenario is when an example is inserted into the cache and is never discarded in future rounds, thus, ˜ wt 2 = wt−1 + yt xt 2 = wt−1 2 + 2yt (wt−1 · xt ) + xt 2 . Since we inserted (xt , yt ), the condition yt (wt−1 · xt ) ≤ β must hold. Combining this ˜ with the assumption that the examples are enclosed in a ball of radius R we get, ( wt 2 − wt−1 2 ) ≤ 2β + R2 . The last scenario occurs when an example is inserted into the cache on some round t, and is then later on removed from the cache on round t + p for p > 0. As in the previous case we can bound the value of summands in Equ. (2), ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 ˜ − wt+p 2 ) Input: Tolerance β, Cache Limit n. Initialize: Set ∀t αt = 0 , w0 = 0 , C0 = ∅. Loop: For t = 1, 2, . . . , T • Get a new instance xt ∈ Rn . • Predict yt = sign (yt (xt · wt−1 )). ˆ • Get a new label yt . • if yt (xt · wt−1 ) ≤ β update: 1. If |Ct | = n remove one example: (a) Find i = arg maxj∈Ct {yj (wt−1 − αj yj xj )}. (b) Update wt−1 ← wt−1 − αi yi xi . (c) Remove Ct−1 ← Ct−1 /{i} 2. Insert Ct ← Ct−1 ∪ {t}. 3. Set αt = 1. 4. Compute wt ← wt−1 + yt αt xt . Output : H(x) = sign(wT · x). Figure 3: The aggressive Perceptron algorithm with as fixed-size cache. ˜ = 2yt (wt−1 · xt ) + xt 2 − 2yt (wt+p · xt ) + xt ˜ = 2 [yt (wt−1 · xt ) − yt ((wt+p − yt xt ) · xt )] ˜ ≤ 2 [β − yt ((wt+p − yt xt ) · xt )] . 2 ˜ Based on the form of the cache update we know that yt ((wt+p − yt xt ) · xt ) ≥ β, and thus, ˜ ˜ ( wt 2 − wt−1 2 ) + ( wt+p 2 − wt+p 2 ) ≤ 0 . Summarizing all three cases we see that only the examples which persist in the cache contribute a factor of R2 + 2β each to the bound of the telescopic sum of Equ. (2) and the rest of the examples do contribute anything to the bound. Hence, we can bound the norm of wT as follows, wT 2 ≤ S R2 + 2β . (3) We finish up the proof by applying the Cauchy-Swartz inequality and the assumption u = 1. Combining Equ. (1) and Equ. (3) we get, γ 2 S 2 ≤ (wT · u)2 ≤ wT 2 u 2 ≤ S(2β + R2 ) , which gives the desired bound. 4 Experiments In this section we describe the experimental methods that were used to compare the performance of standard online algorithms with the new algorithm described above. We also describe shortly another variant that sets a hard limit on the number of support patterns. The experiments were designed with the aim of trying to answer the following questions. First, what is effect of the number of support patterns on the generalization error (measured in terms of classification accuracy on unseen data), and second, would the algorithm described in Fig. 2 be able to find an optimal cache size that is able to achieve the best generalization performance. To examine each question separately we used a modified version of the algorithm described by Fig. 2 in which we restricted ourselves to have a fixed bounded cache. This modified algorithm (which we refer to as the fixed budget Perceptron) Name mnist letter usps No. of Training Examples 60000 16000 7291 No. of Test Examples 10000 4000 2007 No. of Classes 10 26 10 No. of Attributes 784 16 256 Table 1: Description of the datasets used in experiments. simulates the original Perceptron algorithm with one notable difference. When the number of support patterns exceeds a pre-determined limit, it chooses a support pattern from the cache and discards it. With this modification the number of support patterns can never exceed the pre-determined limit. This modified algorithm is described in Fig. 3. The algorithm deletes the example which seemingly attains the highest margin after the removal of the example itself (line 1(a) in Fig. 3). Despite the simplicity of the original Perceptron algorithm [6] its good generalization performance on many datasets is remarkable. During the last few year a number of other additive online algorithms have been developed [4, 2, 1] that have shown better performance on a number of tasks. In this paper, we have preferred to embed these ideas into another online algorithm and start with a higher baseline performance. We have chosen to use the Margin Infused Relaxed Algorithm (MIRA) as our baseline algorithm since it has exhibited good generalization performance in previous experiments [1] and has the additional advantage that it is designed to solve multiclass classification problem directly without any recourse to performing reductions. The algorithms were evaluated on three natural datasets: mnist1 , usps2 and letter3 . The characteristics of these datasets has been summarized in Table 1. A comprehensive overview of the performance of various algorithms on these datasets can be found in a recent paper [2]. Since all of the algorithms that we have evaluated are online, it is not implausible for the specific ordering of examples to affect the generalization performance. We thus report results averaged over 11 random permutations for usps and letter and over 5 random permutations for mnist. No free parameter optimization was carried out and instead we simply used the values reported in [1]. More specifically, the margin parameter was set to β = 0.01 for all algorithms and for all datasets. A homogeneous polynomial kernel of degree 9 was used when training on the mnist and usps data sets, and a RBF kernel for letter data set. (The variance of the RBF kernel was identical to the one used in [1].) We evaluated the performance of four algorithms in total. The first algorithm was the standard MIRA online algorithm, which does not incorporate any budget constraints. The second algorithm is the version of MIRA described in Fig. 3 which uses a fixed limited budget. Here we enumerated the cache size limit in each experiment we performed. The different sizes that we tested are dataset dependent but for each dataset we evaluated at least 10 different sizes. We would like to note that such an enumeration cannot be done in an online fashion and the goal of employing the the algorithm with a fixed-size cache is to underscore the merit of the truly adaptive algorithm. The third algorithm is the version of MIRA described in Fig. 2 that adapts the cache size during the running of the algorithms. We also report additional results for a multiclass version of the SVM [1]. Whilst this algorithm is not online and during the training process it considers all the examples at once, this algorithm serves as our gold-standard algorithm against which we want to compare 1 Available from http://www.research.att.com/˜yann Available from ftp.kyb.tuebingen.mpg.de 3 Available from http://www.ics.uci.edu/˜mlearn/MLRepository.html 2 usps mnist Fixed Adaptive SVM MIRA 1.8 4.8 4.7 letter Fixed Adaptive SVM MIRA 5.5 1.7 4.6 5 1.5 1.4 Test Error 4.5 Test Error Test Error 1.6 Fixed Adaptive SVM MIRA 6 4.4 4.3 4.5 4 3.5 4.2 4.1 3 4 2.5 1.3 1.2 3.9 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 2 1000 1500 x 10 mnist 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 usps Fixed Adaptive MIRA 1550 7000 8000 9000 letter Fixed Adaptive MIRA 270 4000 5000 6000 # Support Patterns Fixed Adaptive MIRA 1500 265 1500 1400 260 Training Online Errors Training Online Errors Training Online Errors 1450 1450 255 250 245 1400 1350 1300 1350 240 1250 235 1300 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 500 4 1000 1500 x 10 mnist 4 x 10 2000 2500 # Support Patterns 3000 3500 1000 usps 6500 Fixed Adaptive MIRA 5.5 2000 3000 4000 5000 6000 # Support Patterns 7000 Fixed Adaptive MIRA 1.5 6000 9000 letter 4 x 10 1.6 Fixed Adaptive MIRA 8000 4 3.5 3 1.4 5500 Training Margin Errors Training Margin Errors Training Margin Errors 5 4.5 5000 4500 1.3 1.2 1.1 4000 1 2.5 3500 0.9 2 0.2 0.4 0.6 0.8 1 1.2 1.4 # Support Patterns 1.6 1.8 2 2.2 4 x 10 500 1000 1500 2000 2500 # Support Patterns 3000 3500 1000 2000 3000 4000 5000 6000 # Support Patterns 7000 8000 9000 Figure 4: Results on a three data sets - mnist (left), usps (center) and letter (right). Each point in a plot designates the test error (y-axis) vs. the number of support patterns used (x-axis). Four algorithms are compared - SVM, MIRA, MIRA with a fixed cache size and MIRA with a variable cache size. performance. Note that for the multiclass SVM we report the results using the best set of parameters, which does not coincide with the set of parameters used for the online training. The results are summarized in Fig 4. This figure is composed of three different plots organized in columns. Each of these plots corresponds to a different dataset - mnist (left), usps (center) and letter (right). In each of the three plots the x-axis designates number of support patterns the algorithm uses. The results for the fixed-size cache are connected with a line to emphasize the performance dependency on the size of the cache. The top row of the three columns shows the generalization error. Thus the y-axis designates the test error of an algorithm on unseen data at the end of the training. Looking at the error of the algorithm with a fixed-size cache reveals that there is a broad range of cache size where the algorithm exhibits good performance. In fact for MNIST and USPS there are sizes for which the test error of the algorithm is better than SVM’s test error. Naturally, we cannot fix the correct size in hindsight so the question is whether the algorithm with variable cache size is a viable automatic size-selection method. Analyzing each of the datasets in turn reveals that this is indeed the case – the algorithm obtains a very similar number of support patterns and test error when compared to the SVM method. The results are somewhat less impressive for the letter dataset which contains less examples per class. One possible explanation is that the algorithm had fewer chances to modify and distill the cache. Nonetheless, overall the results are remarkable given that all the online algorithms make a single pass through the data and the variable-size method finds a very good cache size while making it also comparable to the SVM in terms of performance. The MIRA algorithm, which does not incorporate any form of example insertion or deletion in its algorithmic structure, obtains the poorest level of performance not only in terms of generalization error but also in terms of number of support patterns. The plot of online training error against the number of support patterns, in row 2 of Fig 4, can be considered to be a good on-the-fly validation of generalization performance. As the plots indicate, for the fixed and adaptive versions of the algorithm, on all the datasets, a low online training error translates into good generalization performance. Comparing the test error plots with the online error plots we see a nice similarity between the qualitative behavior of the two errors. Hence, one can use the online error, which is easy to evaluate, to choose a good cache size for the fixed-size algorithm. The third row gives the online training margin errors that translates directly to the number of insertions into the cache. Here we see that the good test error and compactness of the algorithm with a variable cache size come with a price. Namely, the algorithm makes significantly more insertions into the cache than the fixed size version of the algorithm. However, as the upper two sets of plots indicate, the surplus in insertions is later taken care of by excess deletions and the end result is very good overall performance. In summary, the online algorithm with a variable cache and SVM obtains similar levels of generalization and also number of support patterns. While the SVM is still somewhat better in both aspects for the letter dataset, the online algorithm is much simpler to implement and performs a single sweep through the training data. 5 Summary We have described and analyzed a new sparse online algorithm that attempts to deal with the computational problems implicit in classification algorithms such as the SVM. The proposed method was empirically tested and its performance in both the size of the resulting classifier and its error rate are comparable to SVM. There are a few possible extensions and enhancements. We are currently looking at alternative criteria for the deletions of examples from the cache. For instance, the weight of examples might relay information on their importance for accurate classification. Incorporating prior knowledge to the insertion and deletion scheme might also prove important. We hope that such enhancements would make the proposed approach a viable alternative to SVM and other batch algorithms. Acknowledgements: The authors would like to thank John Shawe-Taylor for many helpful comments and discussions. This research was partially funded by the EU project KerMIT No. IST-2000-25341. References [1] K. Crammer and Y. Singer. Ultraconservative online algorithms for multiclass problems. Jornal of Machine Learning Research, 3:951–991, 2003. [2] C. Gentile. A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242, 2001. [3] M´ zard M. Krauth W. Learning algorithms with optimal stability in neural networks. Journal of e Physics A., 20:745, 1987. [4] Y. Li and P. M. Long. The relaxed online maximum margin algorithm. Machine Learning, 46(1–3):361–387, 2002. [5] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, volume XII, pages 615–622, 1962. [6] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65:386–407, 1958. (Reprinted in Neurocomputing (MIT Press, 1988).). [7] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.
6 0.081033297 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
7 0.078681402 102 nips-2003-Large Scale Online Learning
8 0.074393265 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
9 0.073776819 91 nips-2003-Inferring State Sequences for Non-linear Systems with Embedded Hidden Markov Models
10 0.061276592 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class
11 0.057849735 104 nips-2003-Learning Curves for Stochastic Gradient Descent in Linear Feedforward Networks
12 0.057073563 88 nips-2003-Image Reconstruction by Linear Programming
13 0.055721004 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
14 0.052066579 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian
15 0.051196147 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
16 0.049398918 121 nips-2003-Log-Linear Models for Label Ranking
17 0.048610825 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
18 0.047764488 181 nips-2003-Statistical Debugging of Sampled Programs
19 0.04746874 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
20 0.046128415 100 nips-2003-Laplace Propagation
topicId topicWeight
[(0, -0.192), (1, -0.026), (2, -0.08), (3, -0.055), (4, 0.127), (5, 0.012), (6, 0.022), (7, 0.016), (8, 0.112), (9, -0.012), (10, 0.026), (11, -0.142), (12, 0.036), (13, 0.101), (14, -0.147), (15, 0.022), (16, -0.104), (17, 0.004), (18, -0.093), (19, -0.048), (20, 0.02), (21, -0.081), (22, -0.294), (23, -0.26), (24, 0.08), (25, -0.009), (26, 0.215), (27, 0.149), (28, -0.064), (29, -0.019), (30, -0.117), (31, 0.218), (32, -0.045), (33, -0.127), (34, 0.003), (35, 0.07), (36, -0.079), (37, 0.017), (38, 0.059), (39, -0.012), (40, -0.004), (41, -0.218), (42, 0.028), (43, -0.039), (44, 0.043), (45, 0.125), (46, -0.026), (47, 0.07), (48, 0.13), (49, -0.081)]
simIndex simValue paperId paperTitle
same-paper 1 0.96105582 171 nips-2003-Semi-Definite Programming by Perceptron Learning
Author: Thore Graepel, Ralf Herbrich, Andriy Kharechko, John S. Shawe-taylor
Abstract: We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods. 1
2 0.69996697 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles
Author: Michael I. Jordan, Martin J. Wainwright
Abstract: We present a new method for calculating approximate marginals for probability distributions defined by graphs with cycles, based on a Gaussian entropy bound combined with a semidefinite outer bound on the marginal polytope. This combination leads to a log-determinant maximization problem that can be solved by efficient interior point methods [8]. As with the Bethe approximation and its generalizations [12], the optimizing arguments of this problem can be taken as approximations to the exact marginals. In contrast to Bethe/Kikuchi approaches, our variational problem is strictly convex and so has a unique global optimum. An additional desirable feature is that the value of the optimal solution is guaranteed to provide an upper bound on the log partition function. In experimental trials, the performance of the log-determinant relaxation is comparable to or better than the sum-product algorithm, and by a substantial margin for certain problem classes. Finally, the zero-temperature limit of our log-determinant relaxation recovers a class of well-known semidefinite relaxations for integer programming [e.g., 3]. 1
3 0.63050002 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
Author: Thore Graepel, Ralf Herbrich
Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1
4 0.54296935 48 nips-2003-Convex Methods for Transduction
Author: Tijl D. Bie, Nello Cristianini
Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves finding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semidefinite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1
5 0.34653741 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
Author: Milos Hauskrecht, Branislav Kveton
Abstract: Approximate linear programming (ALP) has emerged recently as one of the most promising methods for solving complex factored MDPs with finite state spaces. In this work we show that ALP solutions are not limited only to MDPs with finite state spaces, but that they can also be applied successfully to factored continuous-state MDPs (CMDPs). We show how one can build an ALP-based approximation for such a model and contrast it to existing solution methods. We argue that this approach offers a robust alternative for solving high dimensional continuous-state space problems. The point is supported by experiments on three CMDP problems with 24-25 continuous state factors. 1
6 0.32992041 44 nips-2003-Can We Learn to Beat the Best Stock
7 0.32685539 102 nips-2003-Large Scale Online Learning
8 0.30015188 167 nips-2003-Robustness in Markov Decision Problems with Uncertain Transition Matrices
9 0.29330182 57 nips-2003-Dynamical Modeling with Kernels for Nonlinear Time Series Prediction
10 0.27781862 97 nips-2003-Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian
11 0.27767184 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
12 0.27746943 181 nips-2003-Statistical Debugging of Sampled Programs
13 0.26620227 187 nips-2003-Training a Quantum Neural Network
14 0.25346604 72 nips-2003-Fast Feature Selection from Microarray Expression Data via Multiplicative Large Margin Algorithms
15 0.24970423 139 nips-2003-Nonlinear Filtering of Electron Micrographs by Means of Support Vector Regression
16 0.24533124 145 nips-2003-Online Classification on a Budget
17 0.23569393 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
18 0.23016813 2 nips-2003-ARA*: Anytime A* with Provable Bounds on Sub-Optimality
19 0.22511938 90 nips-2003-Increase Information Transfer Rates in BCI by CSP Extension to Multi-class
20 0.21145999 32 nips-2003-Approximate Expectation Maximization
topicId topicWeight
[(0, 0.493), (11, 0.013), (35, 0.055), (53, 0.083), (71, 0.068), (76, 0.025), (85, 0.07), (91, 0.065), (99, 0.034)]
simIndex simValue paperId paperTitle
1 0.94240558 29 nips-2003-Applying Metric-Trees to Belief-Point POMDPs
Author: Joelle Pineau, Geoffrey J. Gordon, Sebastian Thrun
Abstract: Recent developments in grid-based and point-based approximation algorithms for POMDPs have greatly improved the tractability of POMDP planning. These approaches operate on sets of belief points by individually learning a value function for each point. In reality, belief points exist in a highly-structured metric simplex, but current POMDP algorithms do not exploit this property. This paper presents a new metric-tree algorithm which can be used in the context of POMDP planning to sort belief points spatially, and then perform fast value function updates over groups of points. We present results showing that this approach can reduce computation in point-based POMDP algorithms for a wide range of problems. 1
same-paper 2 0.94010353 171 nips-2003-Semi-Definite Programming by Perceptron Learning
Author: Thore Graepel, Ralf Herbrich, Andriy Kharechko, John S. Shawe-taylor
Abstract: We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods. 1
3 0.93280321 53 nips-2003-Discriminating Deformable Shape Classes
Author: Salvador Ruiz-correa, Linda G. Shapiro, Marina Meila, Gabriel Berson
Abstract: We present and empirically test a novel approach for categorizing 3-D free form object shapes represented by range data . In contrast to traditional surface-signature based systems that use alignment to match specific objects, we adapted the newly introduced symbolic-signature representation to classify deformable shapes [10]. Our approach constructs an abstract description of shape classes using an ensemble of classifiers that learn object class parts and their corresponding geometrical relationships from a set of numeric and symbolic descriptors. We used our classification engine in a series of large scale discrimination experiments on two well-defined classes that share many common distinctive features. The experimental results suggest that our method outperforms traditional numeric signature-based methodologies. 1 1
4 0.85945898 22 nips-2003-An Improved Scheme for Detection and Labelling in Johansson Displays
Author: Claudio Fanti, Marzia Polito, Pietro Perona
Abstract: Consider a number of moving points, where each point is attached to a joint of the human body and projected onto an image plane. Johannson showed that humans can effortlessly detect and recognize the presence of other humans from such displays. This is true even when some of the body points are missing (e.g. because of occlusion) and unrelated clutter points are added to the display. We are interested in replicating this ability in a machine. To this end, we present a labelling and detection scheme in a probabilistic framework. Our method is based on representing the joint probability density of positions and velocities of body points with a graphical model, and using Loopy Belief Propagation to calculate a likely interpretation of the scene. Furthermore, we introduce a global variable representing the body’s centroid. Experiments on one motion-captured sequence suggest that our scheme improves on the accuracy of a previous approach based on triangulated graphical models, especially when very few parts are visible. The improvement is due both to the more general graph structure we use and, more significantly, to the introduction of the centroid variable. 1
5 0.67947608 42 nips-2003-Bounded Finite State Controllers
Author: Pascal Poupart, Craig Boutilier
Abstract: We describe a new approximation algorithm for solving partially observable MDPs. Our bounded policy iteration approach searches through the space of bounded-size, stochastic finite state controllers, combining several advantages of gradient ascent (efficiency, search through restricted controller space) and policy iteration (less vulnerability to local optima).
6 0.6184572 33 nips-2003-Approximate Planning in POMDPs with Macro-Actions
7 0.56537879 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
8 0.55846292 113 nips-2003-Learning with Local and Global Consistency
9 0.54838014 78 nips-2003-Gaussian Processes in Reinforcement Learning
10 0.54714328 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion
11 0.53476292 6 nips-2003-A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters
12 0.5306955 81 nips-2003-Geometric Analysis of Constrained Curves
13 0.52602714 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting
14 0.52338439 48 nips-2003-Convex Methods for Transduction
15 0.51802158 172 nips-2003-Semi-Supervised Learning with Trees
16 0.51683378 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
17 0.51578879 138 nips-2003-Non-linear CCA and PCA by Alignment of Local Models
18 0.51532531 158 nips-2003-Policy Search by Dynamic Programming
19 0.51144648 34 nips-2003-Approximate Policy Iteration with a Policy Language Bias
20 0.51135063 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection