nips nips2003 nips2003-96 knowledge-graph by maker-knowledge-mining

96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines


Source: pdf

Author: Thore Graepel, Ralf Herbrich

Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. [sent-5, score-0.218]

2 Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. [sent-6, score-0.151]

3 We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. [sent-8, score-0.506]

4 The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. [sent-9, score-0.372]

5 In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. [sent-11, score-0.276]

6 1 Introduction One of the central problems of pattern recognition is the exploitation of known invariances in the pattern domain. [sent-12, score-0.193]

7 In images these invariances may include rotation, translation, shearing, scaling, brightness, and lighting direction. [sent-13, score-0.2]

8 In addition, specific domains such as handwritten digit recognition may exhibit invariances such as line thinning/thickening and other non-uniform deformations [8]. [sent-14, score-0.282]

9 The challenge is to combine the training sample with the knowledge of invariances to obtain a good classifier. [sent-15, score-0.255]

10 Possibly the most straightforward way of incorporating invariances is by including virtual examples into the training sample which have been generated from actual examples by the application of the invariance T : R × Rn → Rn at some fixed θ ∈ R, e. [sent-16, score-0.639]

11 Images x subjected to the transformation T (θ, ·) describe highly non-linear trajectories or manifolds in pixel space. [sent-19, score-0.313]

12 The tangent distance [8] approximates the distance between the trajectories (manifolds) by the distance between their tangent vectors (planes) at a given value θ = θ0 and can be used with any kind of distance-based classifier. [sent-20, score-0.294]

13 Another approach, tangent prop [8], incorporates the invariance T directly into the objective function for learning by penalising large values of the derivative of the classification function w. [sent-21, score-0.164]

14 A similar regulariser can be applied to support vector machines [1]. [sent-25, score-0.187]

15 We take up the idea of considering the trajectory given by the combination of training vector and transformation. [sent-26, score-0.156]

16 While data in machine learning are commonly represented as vectors x ∈ Rn we instead consider more complex training examples each of which is represented as a (usually infinite) set {T (θ, xi ) : θ ∈ R} ⊂ Rn , (1) n which constitutes a trajectory in R . [sent-27, score-0.571]

17 Our goal is to learn a linear classifier that separates well the training trajectories belonging to different classes. [sent-28, score-0.176]

18 In practice, we may be given a “standard” training example x together with a differentiable transformation T representing an invariance of the learning problem. [sent-29, score-0.267]

19 The problem can be solved ˜ if the transformation T is approximated by a transformation T polynomial in θ, e. [sent-30, score-0.453]

20 , a Taylor expansion of the form r ˜ T (θ, xi ) ≈ θj · j=0 1 dj T (θ, xi ) j! [sent-32, score-0.63]

21 = θ=0 (2) j=0 Our approach is based on a powerful theorem by Nesterov [5] which states that the + set P2l of polynomials of degree 2l non-negative on the entire real line is a convex set + representable by positive semidefinite (psd) constraints. [sent-34, score-0.282]

22 Hence, optimisation over P2l can be formulated as a semidefinite program (SDP). [sent-35, score-0.161]

23 Recall that an SDP [9] is given by a linear objective function minimised subject to a linear matrix inequality (LMI), n minimise c w n w∈R subject to A (w) := wj Aj − B 0, (3) j=1 with Aj ∈ Rm×m for all j ∈ {0, . [sent-36, score-0.243]

24 This expressive power can be used to enforce constraints for training examples as given by (1), i. [sent-43, score-0.174]

25 Based on this representability theorem for non-negative polynomials we develop a learning algorithm—the Semidefinite Programming Machine (SDPM)—that maximises the margin on polynomial training samples, much like the support vector machine [2] for ordinary single vector data. [sent-46, score-0.778]

26 , (xm , ym )) ∈ m (Rn × {−1, +1}) we aim at learning a weight vector1 w ∈ Rn to classify examples x by y (x) = sign w x . [sent-51, score-0.115]

27 Assuming linear separability of the training sample the principle of empirical risk minimisation recommends finding a weight vector w such that for all i ∈ {1, . [sent-52, score-0.181]

28 As such this constitutes a linear feasibility problem and is easily solved by the perceptron algorithm [6]. [sent-56, score-0.176]

29 Additionally requiring the solution to maximise the margin leads to the well-known quadratic program of support vector learning [2]. [sent-57, score-0.338]

30 In order to be able to cope with known invariances T (θ, ·) we would like to generalise the above setting to the following feasibility problem: find w ∈ Rn 1 such that ∀i ∈ {1, . [sent-58, score-0.229]

31 5 SDPM version space Figure 1: (Left) Approximated trajectories for rotated USPS images (2) for r = 1 (dashed line) and r = 2 (dotted line). [sent-77, score-0.2]

32 (Right) Set of weight vectors w which are consistent with the six images (top) and the six trajectories (bottom). [sent-79, score-0.252]

33 that is, we would require the weight vector to classify correctly every transformed training example xi (θ) := T (θ, xi ) for every value of the transformation parameter θ. [sent-82, score-0.804]

34 As a consequence, we consider ˜ ˜ ˜ only transformations T (θ, x) of polynomial form, i. [sent-85, score-0.275]

35 , xi (θ) := T (θ, xi ) = Xi θ, each ˜ polynomial example xi (θ) being represented by a polynomial in the row vectors of Xi ∈ R(r+1)×n , with θ := (1, θ, . [sent-87, score-1.313]

36 , m} : ∀θ ∈ R : yi w Xi θ ≥ 0 , (5) which is equivalent to finding a weight vector w such that the polynomials pi (θ) := + yi w Xi θ are non-negative everywhere, i. [sent-94, score-0.801]

37 The + set P2l of polynomials non-negative everywhere on the real line is SD-representable: 1. [sent-99, score-0.315]

38 For every P 0 the polynomial p (θ) = θ P θ is non-negative everywhere. [sent-100, score-0.217]

39 For every polynomial p ∈ P2l there exists a P 0 such that p (θ) = θ P θ. [sent-102, score-0.217]

40 Any polynomial p ∈ P2l can be written as p (θ) = θ P θ, where P = P ∈ 1 R(l+1)×(l+1) . [sent-104, score-0.259]

41 Statement 2 : Every non-negative polynomial p ∈ P2l can be written as 2 a sum of squared polynomials [4], hence ∃qi ∈ Pl : p (θ) = i qi (θ) = θ i qi qi θ where P := i qi qi 0 and qi is the coefficient vector of polynomial qi . [sent-106, score-1.462]

42 Maximising Margins on Polynomial Samples Here we develop an SDP formulation for learning a maximum margin classifier given the polynomial constraints (5). [sent-107, score-0.304]

43 (6) j=1 This constitutes an SDP as in (3) by the fact that a block-diagonal matrix is psd if and only if all its diagonal blocks are psd. [sent-110, score-0.199]

44 The matrix G (w, Xi , yi ) reduces to a scalar yi w xi − 1, which translates into the standard SVM constraint yi w xi ≥ 1 linear in w. [sent-112, score-1.323]

45 For the case l = 1 we have G (w, Xi , yi ) ∈ R2×2 and G (w, Xi , yi ) = yi w (Xi )0,· − 1 1 2 yi w (Xi )1,· 1 2 yi w yi w (Xi )1,· (Xi )2,· . [sent-113, score-1.422]

46 (7) Although we require G (w, Xi , yi ) to be psd the resulting optimisation problem can be formulated in terms of a second-order cone program (SOCP) because the matrices involved are only 2 × 2. [sent-114, score-0.542]

47 Since a polynomial p of degree four is fully determined by its five coefficients p0 , . [sent-117, score-0.258]

48 , p4 , but the symmetric matrix P ∈ R3×3 in p (θ) = θ P θ has six degrees of freedom we require one auxiliary variable ui per training example,   2yi w (Xi )0,· − 2 yi w (Xi )1,· yi w (Xi )2,· − ui 1 . [sent-120, score-0.738]

49 yi w (Xi )1,· 2ui yi w (Xi )3,· G (w, ui , Xi , yi ) = 2 yi w (Xi )2,· − ui yi w (Xi )3,· yi w (Xi )4,· In general, since a polynomial of degree 2l has 2l + 1 coefficients and a symmetric (l + 1) × (l + 1) matrix has (l + 1) (l + 2) /2 degrees of freedom we require (l − 1) l/2 auxiliary variables. [sent-121, score-1.878]

50 Dual Program and Complementarity Let us consider the dual SDPs corresponding to the optimisation problems above. [sent-122, score-0.145]

51 The dual of the general SDP (3) is given by maximise tr (BΛ) subject to ∀j ∈ {1, . [sent-124, score-0.169]

52 The complementarity conditions for the optimal solution (w∗ , t∗ ) read A ((w∗ , t∗ )) Λ∗ = 0 . [sent-128, score-0.151]

53 The dual formulation of (6) with matrix (7) combined with the F (w, t) part of the complementarity conditions reads m m m 1 yi yj [˜ (αi , βi , γi , Xi )] [˜ (αj , βj , γj , Xj )] + x maximise − x αi 2 i=1 j=1 (α,β,γ)∈R3m i=1 subject to ∀i ∈ {1, . [sent-129, score-0.622]

54 , m} : Mi := αi βi βi γi 0, (8) 2 The characteristic polynomial of a 2×2 matrix is quadratic and has at most two solutions. [sent-132, score-0.295]

55 ˜ where we define extrapolated training examples x(αi , βi , γi , Xi ) := αi (Xi )0,· + βi (Xi )1,· + γi (Xi )2,· . [sent-135, score-0.133]

56 As before this program with quadratic objective and psd constraints can be formulated as a standard SDP in the form (3) and is easily solved by a standard SDP solver3 . [sent-136, score-0.285]

57 In addition, the complementarity conditions reveal that the optimal weight vector w∗ can be expanded as m w∗ = ˜ yi x (αi , βi , γi , Xi ) , (9) i=1 in analogy to the corresponding result for support vector machines [2]. [sent-137, score-0.661]

58 It remains to analyse the complementarity conditions related to the example-related G (w, Xi , yi ) constraints in (6). [sent-138, score-0.429]

59 Using (7) and assuming primal and dual feasibility we obtain for all i ∈ {1, . [sent-139, score-0.176]

60 , m} at the solution (w∗ , t∗ , M∗ ), i G (w∗ , Xi , yi ) · M∗ = 0 , i (10) the trace of which translates into ∗ ∗ ∗ ∗ yi w∗, [αi (Xi )0,· + βi (Xi )1,· + γi (Xi )2,· ] = αi . [sent-142, score-0.503]

61 The expansion (9) of w∗ in terms of Xi is sparse: Only those examples Xi (“support vectors”) may have non-zero expansion ∗ coefficients αi which lie on the margin, i. [sent-144, score-0.227]

62 From G (w∗ , Xi , yi ) 0 we conclude using Proposition 1 that for all θ ∈ R we have yi w∗, ((Xi )0,· + θ(Xi )1,· + ∗2 ∗ ∗ θ2 (Xi )2,· ) > 1. [sent-150, score-0.474]

63 Then, det(Mi ) = 0 implies βi = 0 and the fact that G (w∗ , Xi , yi ) ∗, ∗ 0 ⇒ yi w (Xi )2,· = 0 ensures that γi = 0 holds as well. [sent-153, score-0.474]

64 , satisfying det (G (w∗ , Xi , yi )) = 0 and det (M∗ ) = 0 there exist i θi ∈ R ∪ {∞} such that the optimal weight vector w∗ can be written as m w∗ = m ∗ ˜ αi yi xi (θi ) = i=1 ∗ ∗ ∗2 yi αi (Xi )0,· + θi (Xi )1,· + θi (Xi )2,· i=1 ∗ Proof. [sent-157, score-1.266]

65 The other cases are ruled out by the complementarity conditions (10). [sent-160, score-0.151]

66 Based on this proposition it is possible not only to identify which examples Xi are used in the expansion of the optimal weight vector w∗ , but also the corresponding ∗ values θi of the transformation parameter θ. [sent-161, score-0.435]

67 This extends the idea of virtual support vectors [7] in that Semidefinite Programming Machines are capable of finding truly virtual support vectors that were not explicitly provided in the training sample. [sent-162, score-0.723]

68 3 Extensions to SDPMs Optimisation on a Segment In many applications it may not be desirable to enforce correct classification on the entire trajectory given by the polynomial example ˜ x (θ). [sent-167, score-0.269]

69 In particular, when the polynomial is used as a local approximation to a global invariance we would like to restrict the example to a segment of the trajectory. [sent-168, score-0.406]

70 For any l ∈ N, the set Pl+ (−τ, τ ) of polynomials non-negative on a segment [−τ, τ ] is SD-representable. [sent-171, score-0.302]

71 (sketch) Consider a polynomial p ∈ Pl+ (−τ, τ ) where p := x → q := x → 1 + x2 l l i=0 pi xi and · [p(τ (2x2 (1 + x2 )−1 − 1))] . [sent-173, score-0.492]

72 ˜ The proposition shows how we can restrict the examples x (θ) to a segment θ ∈ [−τ, τ ] by effectively doubling the degree of the polynomial used. [sent-175, score-0.515]

73 Note that the matrix G (w, Xi , yi ) is sparse because the resulting polynomial contains only even powers of θ. [sent-177, score-0.487]

74 For example, in handwritten digit recognition transformations like rotation, scaling, translation, shearing, thinning/thickening etc. [sent-179, score-0.18]

75 Unfortunately, Proposition 1 only holds for polynomials in one variable. [sent-181, score-0.241]

76 However, its first statement may be generalised to polynomials of more than one variable: for every psd matrix P 0 the polynomial p (ρ) = θ ρ P θ ρ is non-negative everywhere, even if θi is any monomial in ρ1 , . [sent-182, score-0.639]

77 Considering polynomials of degree two and θ ρ := (1, ρ1 , . [sent-187, score-0.282]

78 , ρD ) we have, xi (ρ) ≈ θ ρ ˜ where ρ denotes the gradient and xi (0) ρ xi (0) ρ ρ ρ xi ρ (0) xi (0) ρ θρ , denotes the Hessian operator. [sent-190, score-1.375]

79 Note that the scaling behaviour with regard to the number D of parameters is more benign than that of the naive method of adding virtual examples to the training sample on a grid. [sent-191, score-0.313]

80 However, taking the dual SDPM (8) as a starting point and assuming the Taylor expansion (2) the crucial point is that in order to represent the polynomial trajectory in feature space we need to differentiate through the kernel function. [sent-195, score-0.423]

81 ˜ x 4 There exist polynomials in more than one variable that are non-negative everywhere yet cannot be written as a sum of squares and are hence not SD-representable. [sent-197, score-0.357]

82 Note that the “support” vector is truly virtual since it was never directly supplied to the algorithm (inset zoom-in). [sent-240, score-0.281]

83 Then an inner product expression between data points xi and xj differentiated, respectively, u and v times reads   N u v ˜ d φs (x(θ)) d φs (˜(θ)) x . [sent-245, score-0.307]

84 It turns out that for polynomials of degree r = 2 the exact calculation of elements of the kernel matrix is already O n4 and needs to be approximated efficiently in practice. [sent-247, score-0.347]

85 4 Experimental Results In order to test and illustrate the SDPM we used the well-known USPS data set of 16 × 16 pixel images in [0, 1] of handwritten digits. [sent-248, score-0.14]

86 We considered the transformation rotation by angle θ and calculated the first and second derivatives xi (θ = 0) and xi (θ = 0) based on an image representation smoothed by a Gaussian of variance 0. [sent-249, score-0.688]

87 Figure 2 (a) shows a plot of 10 training examples of digits “1” and “9” together with the quadratically approximated trajectories for θ ∈ [−20◦ , 20◦ ]. [sent-252, score-0.338]

88 The examples are separated by the solution found with an SDPM restricted to the same segment of the trajectory. [sent-253, score-0.128]

89 Following Propositions 2 and 3 the weight vector found is expressed as a linear combination of truly virtual support vectors that had not been supplied in the training sample directly (see inset). [sent-254, score-0.573]

90 In a second experiment, we probed the performance of the SDPM algorithm on the full feature set of 256 pixel intensities using 50 training sets of size m = 20 for each of the 45 one-versus-one classification tasks between all of the digits from “0” to “9” from the USPS data set. [sent-255, score-0.232]

91 For each task, the digits in one class were rotated by −10◦ and the digits of the other class by +10◦ . [sent-256, score-0.176]

92 We compared the performance of the SDPM algorithm to the performance of the original support vector machine (SVM) [2] and the virtual support vector machine (VSVM) [7] measured on independent test sets of size 250. [sent-257, score-0.417]

93 The VSVM takes the support vectors of the ordinary SVM run and is trained on a sample that contains these support vectors together with transformed versions rotated by −10◦ and +10◦ in the quadratic approximation. [sent-258, score-0.459]

94 Clearly, taking into account the invariance is useful and leads to SDPM performance superior to the ordinary SVM. [sent-260, score-0.136]

95 The SDPM also performs slightly better than the VSVM, however, this could be attributed to the pre-selection of support vectors to which the transformation is applied. [sent-261, score-0.251]

96 It is expected that for increasing number D of transformations the performance improvement becomes more pronounced because in high dimensions most volume is concentrated on the boundary of the convex hull of the polynomial manifold. [sent-262, score-0.275]

97 5 Conclusion We introduced Semidefinite Programming Machines as a means of learning on infinite families of examples given in terms of polynomial trajectories or—more generally— manifolds in data space. [sent-263, score-0.438]

98 The crucial insight lies in the SD-representability of nonnegative polynomials which allows us to replace the simple non-negativity constraint in algorithms such as support vector machines by positive semidefinite constraints. [sent-264, score-0.428]

99 , by a version of the perceptron algorithm for semidefinite feasibility problems [3]—without necessarily maximising the margin. [sent-269, score-0.152]

100 Transformation invariance in pattern recognition, tangent distance and tangent propagation. [sent-333, score-0.229]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('sdpm', 0.453), ('xi', 0.275), ('polynomials', 0.241), ('yi', 0.237), ('sdp', 0.219), ('polynomial', 0.217), ('semide', 0.214), ('invariances', 0.16), ('virtual', 0.151), ('complementarity', 0.151), ('sdpms', 0.126), ('trajectories', 0.11), ('psd', 0.109), ('rn', 0.108), ('transformation', 0.102), ('qi', 0.101), ('vsvm', 0.101), ('proposition', 0.1), ('lmi', 0.1), ('invariance', 0.099), ('support', 0.095), ('expansion', 0.08), ('det', 0.076), ('dual', 0.074), ('everywhere', 0.074), ('optimisation', 0.071), ('feasibility', 0.069), ('usps', 0.069), ('examples', 0.067), ('training', 0.066), ('tangent', 0.065), ('digits', 0.063), ('segment', 0.061), ('ui', 0.061), ('coe', 0.059), ('aj', 0.059), ('transformations', 0.058), ('program', 0.058), ('constitutes', 0.057), ('truly', 0.057), ('pixel', 0.057), ('nite', 0.056), ('maximise', 0.056), ('classi', 0.056), ('svm', 0.056), ('vectors', 0.054), ('machines', 0.054), ('programming', 0.053), ('trajectory', 0.052), ('taylor', 0.05), ('perceptron', 0.05), ('nesterov', 0.05), ('shearing', 0.05), ('socp', 0.05), ('rotated', 0.05), ('weight', 0.048), ('wj', 0.048), ('di', 0.047), ('intensities', 0.046), ('digit', 0.046), ('pl', 0.046), ('margin', 0.046), ('quadratic', 0.045), ('manifolds', 0.044), ('graepel', 0.044), ('minimised', 0.044), ('propositions', 0.044), ('sdps', 0.044), ('auxiliary', 0.043), ('sake', 0.043), ('handwritten', 0.043), ('written', 0.042), ('constraints', 0.041), ('degree', 0.041), ('minimise', 0.04), ('herbrich', 0.04), ('images', 0.04), ('subject', 0.039), ('statement', 0.039), ('modern', 0.038), ('vector', 0.038), ('ectively', 0.037), ('ordinary', 0.037), ('rotation', 0.036), ('cone', 0.035), ('growth', 0.035), ('supplied', 0.035), ('illustration', 0.033), ('maximising', 0.033), ('primal', 0.033), ('matrix', 0.033), ('recognition', 0.033), ('approximated', 0.032), ('formulated', 0.032), ('inset', 0.032), ('reads', 0.032), ('restrict', 0.029), ('translates', 0.029), ('microsoft', 0.029), ('sample', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

Author: Thore Graepel, Ralf Herbrich

Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1

2 0.2660245 171 nips-2003-Semi-Definite Programming by Perceptron Learning

Author: Thore Graepel, Ralf Herbrich, Andriy Kharechko, John S. Shawe-taylor

Abstract: We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods. 1

3 0.16237429 122 nips-2003-Margin Maximizing Loss Functions

Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie

Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1

4 0.15751056 124 nips-2003-Max-Margin Markov Networks

Author: Ben Taskar, Carlos Guestrin, Daphne Koller

Abstract: In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3 ) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches. 1

5 0.15124901 48 nips-2003-Convex Methods for Transduction

Author: Tijl D. Bie, Nello Cristianini

Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves finding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semidefinite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1

6 0.14610089 117 nips-2003-Linear Response for Approximate Inference

7 0.13198806 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

8 0.12206578 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

9 0.11123611 1 nips-2003-1-norm Support Vector Machines

10 0.099896334 112 nips-2003-Learning to Find Pre-Images

11 0.095215775 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

12 0.09192685 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

13 0.088589728 113 nips-2003-Learning with Local and Global Consistency

14 0.085988753 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

15 0.084540725 115 nips-2003-Linear Dependent Dimensionality Reduction

16 0.082835905 155 nips-2003-Perspectives on Sparse Bayesian Learning

17 0.081206031 126 nips-2003-Measure Based Regularization

18 0.080991469 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms

19 0.079361126 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

20 0.077619404 120 nips-2003-Locality Preserving Projections


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.258), (1, -0.144), (2, -0.086), (3, -0.111), (4, 0.152), (5, 0.019), (6, -0.071), (7, -0.132), (8, 0.08), (9, -0.058), (10, 0.08), (11, -0.022), (12, -0.019), (13, 0.049), (14, -0.166), (15, 0.123), (16, 0.011), (17, 0.003), (18, -0.185), (19, -0.041), (20, 0.071), (21, -0.071), (22, -0.245), (23, -0.104), (24, 0.086), (25, -0.022), (26, 0.178), (27, 0.06), (28, -0.047), (29, 0.02), (30, -0.07), (31, 0.176), (32, -0.027), (33, -0.088), (34, 0.017), (35, 0.061), (36, -0.049), (37, 0.005), (38, 0.011), (39, 0.012), (40, 0.063), (41, -0.148), (42, -0.054), (43, -0.005), (44, 0.027), (45, 0.062), (46, 0.035), (47, 0.035), (48, -0.02), (49, -0.054)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94220603 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

Author: Thore Graepel, Ralf Herbrich

Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1

2 0.80857158 171 nips-2003-Semi-Definite Programming by Perceptron Learning

Author: Thore Graepel, Ralf Herbrich, Andriy Kharechko, John S. Shawe-taylor

Abstract: We present a modified version of the perceptron learning algorithm (PLA) which solves semidefinite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semidefinite programs are linear programs with infinitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in finitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a probabilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algorithm works, but is not competitive with state-of-the-art interior point methods. 1

3 0.69634044 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

Author: Stuart Andrews, Thomas Hofmann

Abstract: Learning from ambiguous training data is highly relevant in many applications. We present a new learning algorithm for classification problems where labels are associated with sets of pattern instead of individual patterns. This encompasses multiple instance learning as a special case. Our approach is based on a generalization of linear programming boosting and uses results from disjunctive programming to generate successively stronger linear relaxations of a discrete non-convex problem. 1

4 0.6172182 48 nips-2003-Convex Methods for Transduction

Author: Tijl D. Bie, Nello Cristianini

Abstract: The 2-class transduction problem, as formulated by Vapnik [1], involves finding a separating hyperplane for a labelled data set that is also maximally distant from a given set of unlabelled test points. In this form, the problem has exponential computational complexity in the size of the working set. So far it has been attacked by means of integer programming techniques [2] that do not scale to reasonable problem sizes, or by local search procedures [3]. In this paper we present a relaxation of this task based on semidefinite programming (SDP), resulting in a convex optimization problem that has polynomial complexity in the size of the data set. The results are very encouraging for mid sized data sets, however the cost is still too high for large scale problems, due to the high dimensional search space. To this end, we restrict the feasible region by introducing an approximation based on solving an eigenproblem. With this approximation, the computational cost of the algorithm is such that problems with more than 1000 points can be treated. 1

5 0.5995121 174 nips-2003-Semidefinite Relaxations for Approximate Inference on Graphs with Cycles

Author: Michael I. Jordan, Martin J. Wainwright

Abstract: We present a new method for calculating approximate marginals for probability distributions defined by graphs with cycles, based on a Gaussian entropy bound combined with a semidefinite outer bound on the marginal polytope. This combination leads to a log-determinant maximization problem that can be solved by efficient interior point methods [8]. As with the Bethe approximation and its generalizations [12], the optimizing arguments of this problem can be taken as approximations to the exact marginals. In contrast to Bethe/Kikuchi approaches, our variational problem is strictly convex and so has a unique global optimum. An additional desirable feature is that the value of the optimal solution is guaranteed to provide an upper bound on the log partition function. In experimental trials, the performance of the log-determinant relaxation is comparable to or better than the sum-product algorithm, and by a substantial margin for certain problem classes. Finally, the zero-temperature limit of our log-determinant relaxation recovers a class of well-known semidefinite relaxations for integer programming [e.g., 3]. 1

6 0.48346782 124 nips-2003-Max-Margin Markov Networks

7 0.4645527 108 nips-2003-Learning a Distance Metric from Relative Comparisons

8 0.45725209 117 nips-2003-Linear Response for Approximate Inference

9 0.45233548 1 nips-2003-1-norm Support Vector Machines

10 0.432657 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

11 0.42881688 122 nips-2003-Margin Maximizing Loss Functions

12 0.39760613 112 nips-2003-Learning to Find Pre-Images

13 0.3939513 126 nips-2003-Measure Based Regularization

14 0.39211601 31 nips-2003-Approximate Analytical Bootstrap Averages for Support Vector Classifiers

15 0.38590521 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds

16 0.36343208 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

17 0.35641766 120 nips-2003-Locality Preserving Projections

18 0.34823957 113 nips-2003-Learning with Local and Global Consistency

19 0.34762749 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition

20 0.34729034 128 nips-2003-Minimax Embeddings


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.088), (11, 0.025), (35, 0.031), (53, 0.085), (71, 0.496), (76, 0.04), (85, 0.067), (91, 0.064), (99, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98252851 108 nips-2003-Learning a Distance Metric from Relative Comparisons

Author: Matthew Schultz, Thorsten Joachims

Abstract: This paper presents a method for learning a distance metric from relative comparison such as “A is closer to B than A is to C”. Taking a Support Vector Machine (SVM) approach, we develop an algorithm that provides a flexible way of describing qualitative training data as a set of constraints. We show that such constraints lead to a convex quadratic programming problem that can be solved by adapting standard methods for SVM training. We empirically evaluate the performance and the modelling flexibility of the algorithm on a collection of text documents. 1

2 0.98206639 133 nips-2003-Mutual Boosting for Contextual Inference

Author: Michael Fink, Pietro Perona

Abstract: Mutual Boosting is a method aimed at incorporating contextual information to augment object detection. When multiple detectors of objects and parts are trained in parallel using AdaBoost [1], object detectors might use the remaining intermediate detectors to enrich the weak learner set. This method generalizes the efficient features suggested by Viola and Jones [2] thus enabling information inference between parts and objects in a compositional hierarchy. In our experiments eye-, nose-, mouth- and face detectors are trained using the Mutual Boosting framework. Results show that the method outperforms applications overlooking contextual information. We suggest that achieving contextual integration is a step toward human-like detection capabilities. 1 In trod u ction Classification of multiple objects in complex scenes is one of the next challenges facing the machine learning and computer vision communities. Although, real-time detection of single object classes has been recently demonstrated [2], naïve duplication of these detectors to the multiclass case would be unfeasible. Our goal is to propose an efficient method for detection of multiple objects in natural scenes. Hand-in-hand with the challenges entailing multiclass detection, some distinct advantages emerge as well. Knowledge on position of several objects might shed light on the entire scene (Figure 1). Detection systems that do not exploit the information provided by objects on the neighboring scene will be suboptimal. A B Figure 1: Contextual spatial relationships assist detection A. in absence of facial components (whitened blocking box) faces can be detected by context (alignment of neighboring faces). B. keyboards can be detected when they appear under monitors. Many human and computer vision models postulate explicitly or implicitly that vision follows a compositional hierarchy. Grounded features (that are innate/hardwired and are available prior to learning) are used to detect salient parts, these parts in turn enable detection of complex objects [3, 4], and finally objects are used to recognize the semantics of the entire scene. Yet, a more accurate assessment of human performance reveals that the visual system often violates this strictly hierarchical structure in two ways. First, part and whole detection are often evidently interacting [5, 6]. Second, several layers of the hierarchy are occasionally bypassed to enable swift direct detection. This phenomenon is demonstrated by gist recognition experiments where the semantic classification of an entire scene is performed using only minimal low level feature information [7]. The insights emerging from observing human perception were adopted by the object detection community. Many object detection algorithms bypass stages of a strict compositional hierarchy. The Viola & Jones (VJ) detector [2] is able to perform robust online face detection by directly agglomerating very low-level features (rectangle contrasts), without explicitly referring to facial parts. Gist detection from low-level spatial frequencies was demonstrated by Oliva and Torralba [8]. Recurrent optimization of parts and object constellation is also common in modern detection schemes [9]. Although Latent Semantic Analysis (making use of object cooccurrence information) has been adapted to images [10], the existing state of object detection methods is still far from unifying all the sources of visual contextual information integrated by the human perceptual system. Tackling the context integration problem and achieving robust multiclass object detection is a vital step for applications like image-content database indexing and autonomous robot navigation. We will propose a method termed Mutual Boosting to incorporate contextual information for object detection. Section 2 will start by posing the multiclass detection problem from labeled images. In Section 3 we characterize the feature sets implemented by Mutual Boosting and define an object's contextual neighborhood. Section 4 presents the Mutual Boosting framework aimed at integrating contextual information and inspired by the recurrent inferences dominating the human perceptual system. An application of the Mutual Boosting framework to facial component detection is presented in Section 5. We conclude with a discussion on the scope and limitations of the proposed framework. 2 Problem setting and basic notation Suppose we wish to detect multiple objects in natural scenes, and that these scenes are characterized by certain mutual positions between the composing objects. Could we make use of these objects' contextual relations to improve detection? Perceptual context might include multiple sources of information: information originating from the presence of existing parts, information derived from other objects in the perceptual vicinity and finally general visual knowledge on the scene. In order to incorporate these various sources of visual contextual information Mutual Boosting will treat parts, objects and scenes identically. We will therefore use the term object as a general term while referring to any entity in the compositional hierarchy. Let M denote the cardinality of the object set we wish to detect in natural scenes. Our goal is to optimize detection by exploiting contextual information while maintaining detection time comparable to M individual detectors trained without such information. We define the goal of the multiclass detection algorithm as generating M intensity maps Hm=1,..,M indicating the likelihood of object m appearing at different positions in a target image. We will use the following notation (Figure 2): • H0+/H0-: raw image input with/without the trained objects (A1 & A2) • Cm[i]: labeled position of instance i of object m in image H0+ • Hm: intensity map output indicating the likelihood of object m appearing in different positions in the image H0 (B) B. Hm A2. H0- A1. H0+ Cm[1] Cm[2] Cm[1] Cm[2] Figure 2: A1 & A2. Input: position of positive and negative examples of eyes in natural images. B. Output: Eye intensity (eyeness) detection map of image H0+ 3 F e a t u r e se t a n d c o n t e x t u a l wi n d o w g e n e r a l i za t i o n s The VJ method for real-time object-detection included three basic innovations. First, they presented the rectangle contrast-features, features that are evaluated efficiently, using an integral-image. Second, VJ introduced AdaBoost [1] to object detection using rectangle features as weak learners. Finally a cascade method was developed to chain a sequence of increasingly complex AdaBoost learners to enable rapid filtering of non-relevant sections in the target image. The resulting cascade of AdaBoost face detectors achieves a 15 frame per second detection speed, with 90% detection rate and 2x10-6 false alarms. This detection speed is currently unmatched. In order to maintain efficient detection and in order to benchmark the performance of Mutual Boosting we will adopt the rectangle contrast feature framework suggested by VJ. It should be noted that the grayscale rectangle features could be naturally extended to any image channel that preserves the semantics of summation. A diversified feature set (including color features, texture features, etc.) might saturate later than a homogeneous channel feature set. By making use of features that capture the object regularities well, one can improve performance or reduce detection time. VJ extract training windows that capture the exact area of the training faces. We term this the local window approach. A second approach, in line with our attempt to incorporate information from neighboring parts or objects, would be to make use of training windows that capture wide regions around the object (Figure 3)1. A B Figure 3: A local window (VJ) and a contextual window that captures relative position information from objects or parts around and within the detected object. 1 Contextual neighborhoods emerge by downscaling larger regions in the original image to a PxP resolution window. The contextual neighborhood approach contributes to detection when the applied channels require a wide contextual range as will be demonstrated in the Mutual Boosting scheme presented in the following section2. 4 Mutual Boosting The AdaBoost algorithm maintains a clear distinction between the boosting level and the weak-learner training level. The basic insight guiding the Mutual Boosting method reexamines this distinction, stipulating that when multiple objects and parts are trained simultaneously using AdaBoost; any object detector might combine the previously evolving intermediate detectors to generate new weak learners. In order to elaborate this insight it should first be noted that while training a strong learner using 100 iterations of AdaBoost (abbreviated AB100) one could calculate an intermediate strong learner at each step on the way (AB2 - AB99). To apply this observation for our multiclass detection problem we simultaneously train M object detectors. At each boosting iteration t the M detectors (ABmt-1) emerging at the previous stage t-1, are used to filter positive and negative3 training images, thus producing intermediate m-detection maps Hmt-1 (likelihood of object m in the images4). Next, the Mutual Boosting stage takes place and all the existing Hmt-1 maps are used as additional channels out of which new contrast features are selected. This process gradually enriches the initial grounded features with composite contextual features. The composite features are searched on a PxP wide contextual neighborhood region rather than the PxP local window (Figure 3). Following a dynamic programming approach in training and detection, Hm=1,..,M detection maps are constantly maintained and updated so that the recalculation of Hmt only requires the last chosen weak learner WLmn*t to be evaluated on channel Hn*t-1 of the training image (Figure 4). This evaluation produces a binary detection layer that will be weighted by the AdaBoost weak-learner weighting scheme and added to the previous stage map5. Although Mutual Boosting examines a larger feature set during training, an iteration of Mutual Boosting detection of M objects is as time-consuming as performing an AdaBoost detection iteration for M individual objects. The advantage of Mutual Boosting emerges from introducing highly informative feature sets that can enhance detection or require fewer boosting iterations. While most object detection applications extract a local window containing the object information and discard the remaining image (including the object positional information). Mutual Boosting processes the entire image during training and detection and makes constant use of the information characterizing objects’ relative-position in the training images. As we have previously stated, the detected objects might be in various levels of a compositional hierarchy (e.g. complex objects or parts of other objects). Nevertheless, Mutual Boosting provides a similar treatment to objects, parts and scenes enabling any compositional structure of the data to naturally emerge. We will term any contextual reference that is not directly grounded to the basic features, as a cross referencing of objects6. 2 The most efficient size of the contextual neighborhoods might vary, from the immediate to the entire image, and therefore should be empirically learned. 3 Images without target objects (see experimental section below) 4 Unlike the weak learners, the intermediate strong learners do not apply a threshold 5 In order to optimize the number of detection map integral image recalculations these maps might be updated every k (e.g. 50) iterations rather than at each iteration. 6 Scenes can be crossed referenced as well if scene labels are available (office/lab etc.). H0+/0- positive / negative raw images Cm[i] position of instance i of object m=1,..,M in image H0+ initialize boosting-weights of instances i of object m to 1 initialize detection maps Hm+0/Hm-0 to 0 Input Initialization For t=1,…,T For m=1,..,M and n=0,..,M (A) cutout & downscale local (n=0) or contextual (n>0) windows (WINm) of instances i of object m (at Cm[i]), from all existing images Hnt-1 For m=1,..,M normalize boosting-weights of object m instances [1] (B1&2) select map Hn*t-1 and weak learner WLmn* that minimize error on WINm decrease boosting-weights of instances that WLmn* labeled correctly [1] (C) DetectionLayermn* ← WLmn*(Hn*t-1) calculate α mt the weak learner contribution factor from the empirical error [1] (D) update m-detection map Hmt ← Hmt-1 + αmt DetectionLayermn * Return strong learner ABmT including WLmn*1,..,T and αm1,..,T (m=1,..,M) H0± raw image H1± . . . Hn*± (A) WIN m0 WL m0 (B1) . . . Hm± (A) WIN m1 (B2) WL m1 (B1) (B2) m detection map (A) WIN mn* WL (B1) (D) (C) Detection Layer mn* mn* Figure 4: Mutual Boosting Diagram & Pseudo code. Each raw image H0 is analyzed by M object detection maps Hm=1,.,M, updated by iterating through four steps: (A) cutout & downscale from existing maps H n=0,..,M t-1 a local (n=0) or contextual (n>0) PxP window containing a neighborhood of object m (B1&2) select best performing map Hn* and weak learner WLmn* that optimize object m detection (C) run WLmn* on Hn* map to generate a new binary m-detection layer (D) add m-detection layer to existing detection map Hm. [1] Standard AdaBoost stages are not elaborated To maintain local and global natural scene statistics, negative training examples are generated by pairing each image with an image of equal size that does not contain the target objects and by centering the local and contextual windows of the positive and negative examples on the object positions in the positive images (see Figure 2). By using parallel boosting and efficient rectangle contrast features, Mutual Boosting is capable of incorporating many information inferences (references in Figure 5): • Features could be used to directly detect parts and objects (A & B) • Objects could be used to detect other (or identical) objects in the image (C) • Parts could be used to detect other (or identical) nearby parts (D & E) • Parts could be used to detect objects (F) • Objects could be used to detect parts A. eye feature from raw image B. face feature from raw image C. face E. mouth feature from eye feature from face detection image detection image F. face feature from mouth D. eye feature from eye detection image detection image Figure 5: A-E. Emerging features of eyes, mouths and faces (presented on windows of raw images for legibility). The windows’ scale is defined by the detected object size and by the map mode (local or contextual). C. faces are detected using face detection maps HFace, exploiting the fact that faces tend to be horizontally aligned. 5 Experiments A. Pd In order to test the contribution of the Mutual Boosting process we focused on detection of objects in what we term a face-scene (right eye, left eye, nose, mouth and face). We chose to perform contextual detection in the face-scene for two main reasons. First as detailed in Figure 5, face scenes demonstrate a range of potential part and object cross references. Second, faces have been the focus of object detection research for many years, thus enabling a systematic result comparison. Experiment 1 was aimed at comparing the performance of Mutual Boosting to that of naïve independently trained object detectors using local windows. Pfa Figure 6: A. Two examples of the CMU/MIT face database. B. Mutual Boosting and AdaBoost ROCs on the CMU/MIT face database. Face-scene images were downloaded from the web and manually labeled7. Training relied on 450 positive and negative examples (~4% of the images used by VJ). 400 iterations of local window AdaBoost and contextual window Mutual Boosting were performed on the same image set. Contextual windows encompassed a region five times larger in width and height than the local windows8 (see Figure 3). 7 By following CMU database conventions (R-eye, L-eye, Nose & Mouth positions) we derive both the local window position and the relative position of objects in the image 8 Local windows were created by downscaling objects to 25x25 grids Test image detection maps emerge from iteratively summing T m-detection layers (Mutual Boosting stages C&D;). ROC performance on the CMU/MIT face database (see sample images in Figure 6A) was assessed using a threshold on position Cm[i] that best discriminated the final positive and negative detection maps Hm+/-T. Figure 6B demonstrates the superiority of Mutual Boosting to grounded feature AdaBoost. A. COV 0.25 COV 1.00 COV 4.00 Equal error performance Our second experiment was aimed at assessing the performance of Mutual Boosting as we change the detected configurations’ variance. Assuming normal distribution of face configurations we estimated (from our existing labeled set) the spatial covariance between four facial components (noses, mouths and both eyes). We then modified the covariance matrix, multiplying it by 0.25, 1 or 4 and generated 100 artificial configurations by positioning four contrasting rectangles in the estimated position of facial components. Although both Mutual Boosting and AdaBoost performance degraded as the configuration variance increased, the advantage of Mutual Boosting persists both in rigid and in varying configurations9 (Figure 7). MB sigma=0.25 MB sigma=1.00 MB sigma=4.00 AB sigma=0.25 AB sigma=1.00 AB sigma=4.00 Boosting iteration Figure 7: A. Artificial face configurations with increasing covariance B. MB and AB Equal error rate performance on configurations with varying covariance as a function of boosting iterations. 6 D i s c u s s i on While evaluating the performance of Mutual Boosting it should be emphasized that we did not implement the VJ cascade approach; therefore we only attempt to demonstrate that the power of a single AdaBoost learner could be augmented by Mutual Boosting. The VJ detector is rescaled in order to perform efficient detection of objects in multiple scales. For simplicity, scale of neighboring objects and parts was assumed to be fixed so that a similar detector-rescaling approach could be followed. This assumption holds well for face-scenes, but if neighboring objects may vary in scale a single m-detection map will not suffice. However, by transforming each m-detection image to an m-detection cube, (having scale as the third dimension) multi-scale context detection could be achieved10. The dynamic programming characteristic of Mutual Boosting (simply reusing the multiple position and scale detections already performed by VJ) will ensure that the running time of varying scale context will only be doubled. It should be noted that the facescene is highly structured and therefore it is a good candidate for demonstrating 9 In this experiment the resolution of the MB windows (and the number of training features) was decreased so that information derived from the higher resolution of the parts would be ruled out as an explaining factor for the Mutual Boosting advantage. This procedure explains the superior AdaBoost performance in the first boosting iteration. 10 By using an integral cube, calculating the sum of a cube feature (of any size) requires 8 access operations (only double than the 4 operations required in the integral image case). Mutual Boosting; however as suggested by Figure 7B Mutual Boosting can handle highly varying configurations and the proposed method needs no modification when applied to other scenes, like the office scene in Figure 111. Notice that Mutual Boosting does not require a-priori knowledge of the compositional structure but rather permits structure to naturally emerge in the cross referencing pattern (see examples in Figure 5). Mutual Boosting could be enhanced by unifying the selection of weak-learners rather than selecting an individual weak learner for each object detector. Unified selection is aimed at choosing weak learners that maximize the entire object set detection rate, thus maximizing feature reuse [11]. This approach is optimal when many objects with common characteristics are trained. Is Mutual Boosting specific for image object detection? Indeed it requires labeled input of multiple objects in a scene supplying a local description of the objects as well as information on their contextual mutual positioning. But these criterions are shared by other complex

same-paper 3 0.97659826 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines

Author: Thore Graepel, Ralf Herbrich

Abstract: Knowledge about local invariances with respect to given pattern transformations can greatly improve the accuracy of classification. Previous approaches are either based on regularisation or on the generation of virtual (transformed) examples. We develop a new framework for learning linear classifiers under known transformations based on semidefinite programming. We present a new learning algorithm— the Semidefinite Programming Machine (SDPM)—which is able to find a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vectors. Extensions to segments of trajectories, to more than one transformation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rotational invariance in pixel images from USPS and find improvements over known methods. 1

4 0.9723323 142 nips-2003-On the Concentration of Expectation and Approximate Inference in Layered Networks

Author: Xuanlong Nguyen, Michael I. Jordan

Abstract: We present an analysis of concentration-of-expectation phenomena in layered Bayesian networks that use generalized linear models as the local conditional probabilities. This framework encompasses a wide variety of probability distributions, including both discrete and continuous random variables. We utilize ideas from large deviation analysis and the delta method to devise and evaluate a class of approximate inference algorithms for layered Bayesian networks that have superior asymptotic error bounds and very fast computation time. 1

5 0.96696222 156 nips-2003-Phonetic Speaker Recognition with Support Vector Machines

Author: William M. Campbell, Joseph P. Campbell, Douglas A. Reynolds, Douglas A. Jones, Timothy R. Leek

Abstract: A recent area of significant progress in speaker recognition is the use of high level features—idiolect, phonetic relations, prosody, discourse structure, etc. A speaker not only has a distinctive acoustic sound but uses language in a characteristic manner. Large corpora of speech data available in recent years allow experimentation with long term statistics of phone patterns, word patterns, etc. of an individual. We propose the use of support vector machines and term frequency analysis of phone sequences to model a given speaker. To this end, we explore techniques for text categorization applied to the problem. We derive a new kernel based upon a linearization of likelihood ratio scoring. We introduce a new phone-based SVM speaker recognition approach that halves the error rate of conventional phone-based approaches.

6 0.80584478 117 nips-2003-Linear Response for Approximate Inference

7 0.72440624 132 nips-2003-Multiple Instance Learning via Disjunctive Programming Boosting

8 0.68927902 41 nips-2003-Boosting versus Covering

9 0.67987913 163 nips-2003-Probability Estimates for Multi-Class Classification by Pairwise Coupling

10 0.67429394 145 nips-2003-Online Classification on a Budget

11 0.67028576 121 nips-2003-Log-Linear Models for Label Ranking

12 0.66900218 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints

13 0.66737282 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications

14 0.66427696 100 nips-2003-Laplace Propagation

15 0.65972239 122 nips-2003-Margin Maximizing Loss Functions

16 0.65657878 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification

17 0.64661425 192 nips-2003-Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes

18 0.63884693 187 nips-2003-Training a Quantum Neural Network

19 0.63674623 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection

20 0.63673365 189 nips-2003-Tree-structured Approximations by Expectation Propagation