jmlr jmlr2010 jmlr2010-26 knowledge-graph by maker-knowledge-mining

26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

Source: pdf

Author: Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis

Abstract: This paper develops algorithms to train support vector machines when training data are distributed across different nodes, and their communication to a centralized processing unit is prohibited due to, for example, communication complexity, scalability, or privacy reasons. To accomplish this goal, the centralized linear SVM problem is cast as a set of decentralized convex optimization subproblems (one per node) with consensus constraints on the wanted classiﬁer parameters. Using the alternating direction method of multipliers, fully distributed training algorithms are obtained without exchanging training data among nodes. Different from existing incremental approaches, the overhead associated with inter-node communications is ﬁxed and solely dependent on the network topology rather than the size of the training sets available per node. Important generalizations to train nonlinear SVMs in a distributed fashion are also developed along with sequential variants capable of online processing. Simulated tests illustrate the performance of the novel algorithms.1 Keywords: support vector machine, distributed optimization, distributed data mining, distributed learning, sensor networks

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 To accomplish this goal, the centralized linear SVM problem is cast as a set of decentralized convex optimization subproblems (one per node) with consensus constraints on the wanted classiﬁer parameters. [sent-7, score-0.534]

2 These SVs obtained locally per node are incrementally passed on to neighboring nodes, and further processed at the FC to obtain a discriminant function approaching the centralized one obtained as if all training sets were centrally available. [sent-37, score-0.782]

3 Convergence of the incremental distributed (D) SVM to the centralized SVM requires multiple SV exchanges between the nodes and the FC (Flouri et al. [sent-38, score-0.666]

4 Without updating local SVs through node-FC exchanges, DSVM schemes can approximate but not ensure the performance of centralized SVM classiﬁers (Navia-Vazquez et al. [sent-41, score-0.462]

5 Another class of DSVMs deals with parallel designs of centralized SVMs—a direction well motivated when training sets are prohibitively large (Chang et al. [sent-43, score-0.46]

6 Moreover, convergence to the centralized SVM is generally not guaranteed for any partitioning of the aggregate data set (Graf et al. [sent-49, score-0.402]

7 The centralized SVM problem is cast as a set of coupled decentralized convex optimization subproblems with consensus constraints imposed on the desired classiﬁer parameters. [sent-53, score-0.492]

8 Using the alternating direction method of multipliers (ADMoM), see, for example, Bertsekas and Tsitsiklis (1997), distributed training algorithms that are provably convergent to the centralized SVM are developed based solely on message exchanges among neighboring nodes. [sent-54, score-0.644]

9 For linear SVMs, the novel DSVM algorithm is provably convergent to a centralized SVM classiﬁer, as if all distributed samples were available centrally. [sent-77, score-0.466]

10 If those points correspond to a classiﬁcation query, the network “agrees on” the classiﬁcation of these points with performance identical to the centralized one. [sent-79, score-0.451]

11 For other classiﬁcation queries, nodes provide classiﬁcation results that closely approximate the centralized SVM. [sent-80, score-0.533]

12 To provide context, Section 2 outlines the centralized linear and nonlinear SVM designs. [sent-86, score-0.402]

13 At every node j ∈ J , a labeled training set S j := {(x jn , y jn ) : n = 1, . [sent-102, score-1.247]

14 , N j } of size N j is available, where x jn ∈ X is a p × 1 data vector belonging to the input space X ⊆ R p , and y jn ∈ Y := {−1, 1} denotes its corresponding class label. [sent-105, score-1.04]

15 2 Given S j per node j, the goal is to ﬁnd a maximum-margin linear discriminant function g(x) in a distributed fashion, and thus enable each node to classify any new input vector x to one of the two classes {−1, 1} without communicating S j to other nodes j′ = j. [sent-106, score-0.604]

16 Sensor j measures and forms a local binary decision variable y jn ∈ {1, −1}, where y jn = 1(−1) indicates presence (absence) of the pollutant at the position vector x j := [x j1 , x j2 , x j3 ]T . [sent-110, score-1.1]

17 ) The goal is to have each low-cost sensor improve the performance of local detection achieved based on S j = {([xT ,tn ]T , y jn ) : n = 1, . [sent-112, score-0.634]

18 , age, sex or blood pressure), and y jn is a particular diagnosis (e. [sent-123, score-0.52]

19 However, a nonchalant exchange of database entries (xT , y jn ) can pose a privacy risk jn for the information exchanged. [sent-129, score-1.117]

20 If {S j }J were all centrally available at an FC, then the global variables w∗ and b∗ describing j=1 the centralized maximum-margin linear discriminant function g∗ (x) = xT w∗ + b∗ could be obtained by solving the convex optimization problem; see, for example, Sch¨ lkopf and Smola (2002, Ch. [sent-137, score-0.498]

21 7) o 1 w 2 {w∗ , b∗ } = arg min w,b,{ξ jn } 2 J +C ∑ Nj ∑ ξ jn j=1 n=1 s. [sent-138, score-1.065]

22 , N j where the slack variables ξ jn account for non-linearly separable training sets, and C is a tunable positive scalar. [sent-146, score-0.578]

23 Nonlinear discriminant functions g(x) can also be found along the lines of (1) after mapping vectors x jn to a higher dimensional space H ⊆ RP , with P > p, via a nonlinear transformation φ : X → H . [sent-147, score-0.613]

24 The generalized maximum-margin linear classiﬁer in H is then obtained after replacing x jn with φ(x jn ) in (1), and solving the following optimization problem 1 w 2 {w∗ , b∗ } = arg min w,b,{ξ jn } 2 J +C ∑ Nj ∑ ξ jn j=1 n=1 s. [sent-148, score-2.105]

25 y jn (wT φ(x jn ) + b) ≥ 1 − ξ jn ∀ j ∈ J , n = 1, . [sent-150, score-1.56]

26 Letting λ jn denote the Lagrange multiplier corresponding to the constraint y jn (wT φ(x jn ) + b) ≥ 1 − ξ jn , the dual problem of (2) is: max {λ jn } − J s. [sent-158, score-2.625]

27 1 2 J J Nj Ni ∑∑∑ ∑ j=1 i=1 n=1 m=1 J λ jn λim y jn yim φT (x jn )φ(xim ) + ∑ Nj ∑ λ jn j=1 n=1 Nj ∑ ∑ λ jn y jn = 0 (3) j=1 n=1 0 ≤ λ jn ≤ C ∀ j ∈ J , n = 1, . [sent-160, score-3.64]

28 Training vectors corresponding to non-zero λ∗ ’s constitute jn jn the SVs. [sent-165, score-1.064]

29 Once the SVs are identiﬁed, all other training vectors with λ∗ = 0 can be discarded jn since they do not contribute to w∗ . [sent-166, score-0.602]

30 Solving (3) does not require knowledge of φ but only inner product values φT (x jn )φ(xim ) := K(x jn , xim ), which can be computed through a pre-selected positive semi-deﬁnite kernel K : X × X → R; see, for example, Sch¨ lkopf and Smola (2002, Ch. [sent-168, score-1.067]

31 Although not o explicitly given, the optimal slack variables ξ∗ can be found through the KKT conditions of (2) in jn terms of λ∗ (Sch¨ lkopf and Smola, 2002). [sent-170, score-0.52]

32 The optimal discriminant function can be also expressed o jn in terms of kernels as g∗ (x) = J Nj ∑ ∑ λ∗jn y jn K(x jn , x) + b∗ (5) j=1 n=1 where b∗ = y jn − ∑J ∑Ni λ∗ yim K(xim , x jn ) for any SV x jn with λ∗ ∈ (0,C). [sent-171, score-3.189]

33 This so-called i=1 m=1 im jn kernel trick allows ﬁnding maximum-margin linear classiﬁers in higher dimensional spaces without explicitly operating in such spaces (Sch¨ lkopf and Smola, 2002). [sent-172, score-0.52]

34 o The objective here is to develop fully distributed solvers of the centralized problems in (1) and (2) while guaranteeing performance approaching that of a centralized equivalent SVM. [sent-173, score-0.868]

35 Recall that exchanging all local SVs among all nodes in the network several times is necessary for incremental DSVMs to approach the optimal centralized solution. [sent-177, score-0.71]

36 With proper scaling of the cost by J, the proposed consensus-based 1668 C ONSENSUS -BASED D ISTRIBUTED S UPPORT V ECTOR M ACHINES reformulation of (1) becomes 1 2 min {w j ,b j ,ξ jn } J ∑ 2 wj J + JC ∑ Nj ∑ ξ jn j=1 n=1 j=1 s. [sent-185, score-1.077]

37 y jn (wT x jn + b j ) ≥ 1 − ξ jn j ξ jn ≥ 0 (6) ∀ j ∈ J , n = 1, . [sent-187, score-2.08]

38 Y j X j v j ξj ∀j ∈ J (7) ∀j ∈ J 0j v j = ω ji , ω ji = vi ∀ j ∈ J , ∀i ∈ B j where the redundant variables {ω ji } will turn out to facilitate the decoupling of the classiﬁer parameters v j at node j from those of their neighbors at neighbors i ∈ B j . [sent-215, score-0.853]

39 As in the centralized case, problem (7) will be solved through its dual. [sent-216, score-0.402]

40 The role of these quadratic terms ||v j − ω ji ||2 and ||ω ji − vi ||2 is twofold: (a) they effect strict convexity of L with respect to (w. [sent-218, score-0.464]

41 Lemma 2 links the proposed DSVM design with the convergent ADMoM solver, and thus ensures convergence of the novel MoM-DSVM to the centralized SVM classiﬁer. [sent-229, score-0.426]

42 Indeed, simple inspection of (8) conﬁrms that with all other variables ﬁxed, the cost in (10) is linear-quadratic in ω ji ; hence, ω ji (t + 1) can be found in closed form per iteration, and the resulting closed-form expression can be substituted back to eliminate ω ji from L . [sent-231, score-0.624]

43 Similar to the centralized SVM algorithm, if [λ j (t)]n = 0, then [xT , 1]T is an SV. [sent-247, score-0.402]

44 Finding λ j (t + jn 1) as in (16) requires solving a quadratic optimization problem similar to the one that a centralized SVM would solve, for example, via a gradient projection algorithm or an interior point method; see for example, Sch¨ lkopf and Smola (2002, Ch. [sent-248, score-0.922]

45 However, the number of variables involved in o (16) per iteration per node is considerably smaller when compared to its centralized counterpart, namely N j versus ∑J N j . [sent-250, score-0.681]

46 Note that at any given iteration t of the algorithm, each node j can evaluate its own local discriminant (t) function g j (x) for any vector x ∈ X as (t) g j (x) = [xT , 1]v j (t) (19) which from Proposition 1 is guaranteed to converge to the same solution across all nodes as t → ∞. [sent-260, score-0.496]

47 Figure 2: Visualization of iterations (16)-(18): (left) every node j ∈ J computes λ j (t + 1) to obtain v j (t + 1), and then broadcasts v j (t + 1) to all neighbors i ∈ B j ; (right) once every node j ∈ J has received vi (t + 1) from all i ∈ B j , it computes α j (t + 1). [sent-263, score-0.481]

48 Remark 1 The messages exchanged among neighboring nodes in the MoM-DSVM algorithm correspond to local estimates v j (t), which together with the local multiplier vectors α j (t), convey sufﬁcient information about the local training sets to achieve consensus globally. [sent-264, score-0.555]

49 An online version of DSVM is thus well motivated when a new training example x jn (t) along with its label y jn (t) acquired at time t are incorporated into X j (t) and Y j (t), respectively. [sent-288, score-1.098]

50 Remark 3 Compared to existing centralized online SVM alternatives in, for example, Cauwenberghs and Poggio (2000) and Fung and Mangasarian (2002), the online MoM-DSVM algorithm of this section allows seamless integration of both distributed and online processing. [sent-309, score-0.466]

51 The space of functions g j described by (24) is fully determined by the span of the kernel function K(·, ·) centered at training vectors {x jn , n = 1, . [sent-351, score-0.602]

52 Coefﬁcients a∗ and c∗ are found so that all local jn jl discriminants g∗ agree on their values at points {χl }L . [sent-358, score-0.631]

53 However, ﬁnding the coefﬁcients a∗ , c∗ and b∗ in a distributed fashion j jn jl l=1 remains an issue. [sent-361, score-0.635]

54 Similar to (7), introduce auxiliary variables {ω ji } ({ζ ji }) to decouple the constraints Gw j = Gwi (b j = bi ) across nodes, and α jik (β jik ) denote the corresponding Lagrange multipliers (cf. [sent-363, score-0.636]

55 , χL ]T , and deﬁne the kernel matrices with entries [K(X j , X j )]n,m := K(x jn , x jm ), (28) [K(X j , Γ)]n,l := K(x jn , χl ), (29) [K(Γ, Γ)]l,l ′ := K(χl , χl ′ ). [sent-372, score-1.04]

56 The latter is speciﬁed by the coefﬁcients {a jn (t)}, {c jl (t)} and {b j (t)} that can be obtained in closed form, as shown in the next proposition. [sent-374, score-0.571]

57 The local discriminant function g j (x) Nj = L n=1 (t) g j (x) l=1 ∑ a jn (t)K(x, x jn) + ∑ c jl (t)K(x, χl ) + b j (t) (31) where a j (t) := [a j1 (t), . [sent-379, score-0.7]

58 With arbitrary initialization λ j (0), w j (0), and j b j (0); and α j (0) = 0L×1 and β j (0) = 0, the iterates {a jn (t)}, {c jl (t)} and {b j (t)} in (32), (33) and (34) converge to {a∗ }, {c∗ } and {b∗ } in (24), as t → ∞, ∀ j ∈ J , n = 1, . [sent-391, score-0.603]

59 , L, J 1 where g∗ (χl ) = φT (χl )w∗ + b∗ , and {w∗ , b∗ } are the optimal solution of the centralized problem (2). [sent-421, score-0.402]

60 In this case, MoM-NDSVM ﬁnds local j approximations to the centralized g∗ which accommodate information available to all nodes. [sent-439, score-0.462]

61 Consider every entry k of the training vectors {x jn }, and form the max min min max intervals Ik := [xk , xk ], k = 1, . [sent-443, score-0.602]

62 ,N j [x jn ]k and xk := p , and partition uniformly each I to obmax j∈J , n=1,. [sent-449, score-0.52]

63 In M k−1 this case, MoM-NDSVM performs a global consensus step on the entry-wise maxima and minima of the training vectors {x jn }. [sent-468, score-0.663]

64 Once again, we consider every entry k of the training vectors {x jn }. [sent-471, score-0.602]

65 MoMNDSVM starts by performing a consensus step on the entry-wise maxima and minima of the local training vectors {x jn }. [sent-472, score-0.723]

66 As mentioned earlier, the number of points L affects how close local functions are to each other as well as to the centralized one. [sent-478, score-0.462]

67 Each local training set S j consists of N j = N = 10 labeled examples and was generated by: (i) randomly choosing class Ck , k = 1, 2; and, (ii) randomly generating a labeled example (xT , y jn = Ck ) with x jn ∼ N (mk , Σ). [sent-499, score-1.158]

68 Thus, the global training set contains JN = 300 jn training examples. [sent-500, score-0.636]

69 The empirical risk of the centralized SVM in (1) is deﬁned as Rcentral := emp 1 NT NT 1 ˆ ∑ 2 |yn − yn | n=1 where yn is the predicted label for xn . [sent-509, score-0.438]

70 The average empirical risk of the MoM-DSVM algorithm as ˆ a function of the number of iterations is deﬁned as Remp (t) := J 1 JNT NT 1 ˆ ∑ ∑ 2 |yn − y jn (t)| (40) j=1 n=1 where y jn (t) is the label prediction at iteration t and node j for xn , n = 1, . [sent-510, score-1.332]

71 The average empirical risk of the local SVMs across nodes Rlocal is deﬁned as emp in (40) with y jn found using only locally-trained SVMs. [sent-514, score-0.788]

72 The centralized and local empirical risks with C = 10 are included for comparison. [sent-517, score-0.462]

73 Clearly, the risk of the MoM-DSVM algorithm reduces as the number of iterations increases, quickly outperforming local-based predictions and approaching that of the centralized benchmark. [sent-519, score-0.499]

74 To further visualize this test case, Figure 3 (right) shows the global training set, along with the linear discriminant functions found by the centralized SVM and the MoM-DSVM at two different nodes after 400 iterations with JC = 20 and η = 10. [sent-520, score-0.721]

75 Decision boundary comparison among MoM-DSVM, centralized SVM and local SVM results for synthetic data generated from two Gaussian classes (right). [sent-531, score-0.462]

76 The training set is equally partitioned across nodes, thus every node in the network with J = 25 has N j = 472 training vectors, and every node in the network with J = 50 has N j = 236 samples. [sent-540, score-0.553]

77 The MNIST training set is partitioned across nodes ensuring that every node has an equal number of examples from digit 2 and digit 9. [sent-572, score-0.427]

78 Figure 6 shows the evolution of the test error for the network with J = 25 nodes and Figure 7 shows the evolution of the test error for the network with J = 50 nodes for different values for the penalties JC and η. [sent-573, score-0.432]

79 Moreover, before consensus is reached across all nodes the test error at any given node and iteration index does not necessarily need to be greater than the centralized one. [sent-581, score-0.83]

80 Relating the distributed setting with its centralized counterpart, it follows that with, for example, J = 25 a change in JC from 1 to 5 in the distributed setup of (6), corresponds to a change in C from 0. [sent-588, score-0.53]

81 The ﬁgure of merit in this case is V (t) := 1 ∑Jj=1 v j (t) − vc (t) , where vc (t) contains the coefﬁcients of the centralized SVM using J the training set available at time t. [sent-624, score-0.46]

82 The network with J = 30 nodes is considered again, where each node j has available a local training set with N j = N = 20 with training vectors generated as in Test Case 1. [sent-636, score-0.529]

83 For comparison, we have also included the Bayes risk, the centralized SVM empirical risk, and the local SVM risk. [sent-678, score-0.462]

84 As expected, the classiﬁcation performance of the distributed classiﬁer approaches that of the centralized one. [sent-679, score-0.466]

85 Even though the nodes do not exactly agree on the ﬁnal form of g j (x) at all points, their classiﬁcation performance closely converges to the one obtained by the centralized SVM benchmark. [sent-689, score-0.533]

86 Figure 13: Comparison of the discriminant functions found by a centralized SVM, local SVMs, and the MoM-NDSVM algorithm at 6 different nodes of a network with J = 30 using synthetic data. [sent-691, score-0.711]

87 A brief description 1689 F ORERO , C ANO AND G IANNAKIS Figure 14: Comparison of the discriminant functions found by a centralized SVM, local SVMs, and the MoM-NDSVM algorithm at 6 different nodes of a network with J = 30 using synthetic data. [sent-696, score-0.711]

88 Table 3 compares performance of the classiﬁers constructed via MoM-NDSVM with the average performance of the 5 local classiﬁers trained with local training sets only, and with the one of a centralized SVM trained with the training set available to the whole network. [sent-704, score-0.638]

89 The local and centralized SVMs were trained using the Spider toolbox for MATLAB (Weston et al. [sent-710, score-0.462]

90 28% Table 3: UCI data sets centralized versus local versus distributed performance comparison for t = 1, 000. [sent-739, score-0.526]

91 Although both centralized and local performance remain nearly unchanged, the MoM-NDSVM performance improves about 7% for both L = 150 and L = 300. [sent-763, score-0.462]

92 Each communication link between node j and node i ∈ B j introduces additive noise εv (t) (εα (t)) to the variable v j (t) (α ji ). [sent-836, score-0.561]

93 Conclusions This work developed distributed SVM algorithms by reformulating the centralized SVM training problem into per-node separable sub-problems linked via consensus constraints, which can be solved using decentralized optimization tools. [sent-863, score-0.614]

94 First, it will be shown that the set of consensus constraints in (7), namely {v j = ω ji , ω ji = vi : ∀ j ∈ J , ∀i ∈ B j }, can be written as the equality constraint Av = ω in (45). [sent-934, score-0.525]

95 ω ji equal to zero, ω ji (t + 1) can be found in closed form as ω ji (t + 1) = 1 1 (α ji1 (t) − α ji2 (t)) + (v j (t + 1) + vi (t + 1)). [sent-1025, score-0.658]

96 computes the sum by ﬁxing a node j and adding the inner products of v j with the incoming Lagrange multipliers αi j1 (t); while the left hand side performs the same sum by ﬁxing a node j and adding the inner products of outgoing Lagrange multipliers α ji1 (t) and the corresponding vi neighbors. [sent-1039, score-0.49]

97 , Sch¨ lkopf and Smola, 2002) o 1 2 min {g j ∈H } J ∑ j=1 J g j H + JC ∑ 2 Nj ∑ ℓ(y jn , g j (x jn )) j=1 n=1 (74) s. [sent-1077, score-1.04]

98 Given the optimal Lagrange multipliers ς∗ for the constraints {g j (χl ) = gi (χl )}, the solution jil {g∗ } of (74) can be obtained from its Lagrangian as j 1 {g j ∈H } 2 {g∗ } = arg min j J ∑ j=1 J g j H + JC ∑ 2 Nj J L ∑ ℓ(y jn , g j (x jn )) + ∑ ∑ ∑ ς∗jil (g j (χl ) − gi (χl )). [sent-1083, score-1.157]

99 , y jN j , {χl }, g j ) := JC ∑n=1 ℓ(y jn , g j (x jn ))+ ∑i∈B j ∑L (ς∗ −ς∗jl )g j (χl ). [sent-1097, score-1.04]

100 l=1 jil i Applying the Representer Theorem to (76) as in Wahba (1990) and Sch¨ lkopf and Smola (2002) o one readily arrives at Nj = L n=1 g∗ (x) j l=1 ∑ a∗jn K(x, x jn ) + ∑ c∗jl K(x, χl ). [sent-1098, score-0.554]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('jn', 0.52), ('centralized', 0.402), ('jc', 0.274), ('ji', 0.194), ('iannakis', 0.15), ('istributed', 0.15), ('onsensus', 0.15), ('orero', 0.15), ('upport', 0.15), ('node', 0.149), ('svs', 0.137), ('nodes', 0.131), ('achines', 0.128), ('ano', 0.128), ('ector', 0.128), ('admom', 0.123), ('dsvm', 0.109), ('svm', 0.078), ('vi', 0.076), ('isvm', 0.075), ('communication', 0.069), ('discriminant', 0.069), ('distributed', 0.064), ('bj', 0.064), ('flouri', 0.061), ('jik', 0.061), ('consensus', 0.061), ('iterations', 0.061), ('local', 0.06), ('training', 0.058), ('parkinsons', 0.058), ('av', 0.058), ('multipliers', 0.058), ('mnist', 0.057), ('rtest', 0.055), ('sensor', 0.054), ('jl', 0.051), ('lagrange', 0.051), ('network', 0.049), ('classi', 0.046), ('iteration', 0.046), ('per', 0.042), ('incremental', 0.042), ('ase', 0.042), ('nj', 0.042), ('exchanged', 0.041), ('gw', 0.041), ('privacy', 0.041), ('across', 0.041), ('fc', 0.04), ('wj', 0.037), ('risk', 0.036), ('xt', 0.036), ('evolution', 0.036), ('neighboring', 0.035), ('dwork', 0.034), ('evenly', 0.034), ('gwi', 0.034), ('jil', 0.034), ('wireless', 0.034), ('kkt', 0.032), ('nt', 0.032), ('iterates', 0.032), ('er', 0.031), ('connectivity', 0.03), ('est', 0.029), ('transmitted', 0.029), ('decentralized', 0.029), ('gu', 0.029), ('vj', 0.029), ('proposition', 0.028), ('bi', 0.027), ('centrally', 0.027), ('exchanges', 0.027), ('momdsvm', 0.027), ('transmissions', 0.027), ('xim', 0.027), ('overhead', 0.027), ('sensors', 0.027), ('gt', 0.027), ('exchanging', 0.026), ('vt', 0.026), ('kt', 0.026), ('multiplier', 0.025), ('cycle', 0.025), ('arg', 0.025), ('links', 0.024), ('digit', 0.024), ('raining', 0.024), ('vectors', 0.024), ('svms', 0.024), ('neighbors', 0.023), ('acquires', 0.023), ('broadcast', 0.023), ('broadcasts', 0.023), ('communicates', 0.023), ('drift', 0.023), ('graf', 0.023), ('scalars', 0.023), ('digits', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999923 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

Author: Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis

2 0.091482624 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

Author: Daniil Ryabko

Abstract: The problem is sequence prediction in the following setting. A sequence x1 , . . . , xn , . . . of discretevalued observations is generated according to some unknown probabilistic law (measure) µ. After observing each outcome, it is required to give the conditional probabilities of the next observation. The measure µ belongs to an arbitrary but known class C of stochastic process measures. We are interested in predictors ρ whose conditional probabilities converge (in some sense) to the “true” µ-conditional probabilities, if any µ ∈ C is chosen to generate the sequence. The contribution of this work is in characterizing the families C for which such predictors exist, and in providing a speciﬁc and simple form in which to look for a solution. We show that if any predictor works, then there exists a Bayesian predictor, whose prior is discrete, and which works too. We also ﬁnd several sufﬁcient and necessary conditions for the existence of a predictor, in terms of topological characterizations of the family C , as well as in terms of local behaviour of the measures in C , which in some cases lead to procedures for constructing such predictors. It should be emphasized that the framework is completely general: the stochastic processes considered are not required to be i.i.d., stationary, or to belong to any parametric or countable family. Keywords: sequence prediction, time series, online prediction, Bayesian prediction

3 0.08195626 40 jmlr-2010-Fast and Scalable Local Kernel Machines

Author: Nicola Segata, Enrico Blanzieri

Abstract: A computationally efﬁcient approach to local learning with kernel methods is presented. The Fast Local Kernel Support Vector Machine (FaLK-SVM) trains a set of local SVMs on redundant neighbourhoods in the training set and an appropriate model for each query point is selected at testing time according to a proximity strategy. Supported by a recent result by Zakai and Ritov (2009) relating consistency and localizability, our approach achieves high classiﬁcation accuracies by dividing the separation function in local optimisation problems that can be handled very efﬁciently from the computational viewpoint. The introduction of a fast local model selection further speeds-up the learning process. Learning and complexity bounds are derived for FaLK-SVM, and the empirical evaluation of the approach (with data sets up to 3 million points) showed that it is much faster and more accurate and scalable than state-of-the-art accurate and approximated SVM solvers at least for non high-dimensional data sets. More generally, we show that locality can be an important factor to sensibly speed-up learning approaches and kernel methods, differently from other recent techniques that tend to dismiss local information in order to improve scalability. Keywords: locality, kernel methods, local learning algorithms, support vector machines, instancebased learning

4 0.068052635 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

Author: Miki Aoyagi

Abstract: In this paper, we consider the asymptotic form of the generalization error for the restricted Boltzmann machine in Bayesian estimation. It has been shown that obtaining the maximum pole of zeta functions is related to the asymptotic form of the generalization error for hierarchical learning models (Watanabe, 2001a,b). The zeta function is deﬁned by using a Kullback function. We use two methods to obtain the maximum pole: a new eigenvalue analysis method and a recursive blowing up process. We show that these methods are effective for obtaining the asymptotic form of the generalization error of hierarchical learning models. Keywords: Boltzmann machine, non-regular learning machine, resolution of singularities, zeta function

5 0.063324913 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

Author: Chunping Wang, Xuejun Liao, Lawrence Carin, David B. Dunson

Abstract: A non-parametric hierarchical Bayesian framework is developed for designing a classiﬁer, based on a mixture of simple (linear) classiﬁers. Each simple classiﬁer is termed a local “expert”, and the number of experts and their construction are manifested via a Dirichlet process formulation. The simple form of the “experts” allows analytical handling of incomplete data. The model is extended to allow simultaneous design of classiﬁers on multiple data sets, termed multi-task learning, with this also performed non-parametrically via the Dirichlet process. Fast inference is performed using variational Bayesian (VB) analysis, and example results are presented for several data sets. We also perform inference via Gibbs sampling, to which we compare the VB results. Keywords: classiﬁcation, incomplete data, expert, Dirichlet process, variational Bayesian, multitask learning

6 0.058723815 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks

7 0.056160741 69 jmlr-2010-Lp-Nested Symmetric Distributions

8 0.054269098 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

9 0.047713116 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

10 0.047688417 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

11 0.046626203 44 jmlr-2010-Graph Kernels

12 0.046138927 15 jmlr-2010-Approximate Tree Kernels

13 0.044449668 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

14 0.041599605 22 jmlr-2010-Classification Using Geometric Level Sets

15 0.041595705 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

16 0.039839692 18 jmlr-2010-Bundle Methods for Regularized Risk Minimization

17 0.039409697 97 jmlr-2010-Regret Bounds and Minimax Policies under Partial Monitoring

18 0.038893647 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

19 0.036198951 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

20 0.035777237 63 jmlr-2010-Learning Instance-Specific Predictive Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.187), (1, -0.014), (2, 0.005), (3, 0.06), (4, 0.017), (5, 0.088), (6, -0.061), (7, -0.025), (8, -0.024), (9, 0.039), (10, 0.057), (11, 0.077), (12, 0.054), (13, -0.104), (14, -0.052), (15, 0.019), (16, 0.078), (17, -0.086), (18, -0.009), (19, 0.014), (20, 0.273), (21, -0.019), (22, 0.065), (23, 0.018), (24, -0.037), (25, -0.083), (26, 0.181), (27, -0.038), (28, 0.133), (29, 0.016), (30, -0.029), (31, -0.007), (32, -0.063), (33, 0.128), (34, 0.119), (35, -0.01), (36, -0.057), (37, 0.288), (38, -0.019), (39, 0.242), (40, 0.058), (41, 0.034), (42, 0.069), (43, -0.071), (44, 0.019), (45, -0.06), (46, -0.253), (47, -0.262), (48, 0.02), (49, 0.2)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91283411 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

Author: Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis

2 0.52329785 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

Author: Daniil Ryabko

3 0.4078753 40 jmlr-2010-Fast and Scalable Local Kernel Machines

Author: Nicola Segata, Enrico Blanzieri

4 0.3006438 108 jmlr-2010-Stochastic Complexity and Generalization Error of a Restricted Boltzmann Machine in Bayesian Estimation

Author: Miki Aoyagi

5 0.28895426 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

Author: Pannagadatta K. Shivaswamy, Tony Jebara

Abstract: Leading classiﬁcation methods such as support vector machines (SVMs) and their counterparts achieve strong generalization performance by maximizing the margin of separation between data classes. While the maximum margin approach has achieved promising performance, this article identiﬁes its sensitivity to afﬁne transformations of the data and to directions with large data spread. Maximum margin solutions may be misled by the spread of data and preferentially separate classes along large spread directions. This article corrects these weaknesses by measuring margin not in the absolute sense but rather only relative to the spread of data in any projection direction. Maximum relative margin corresponds to a data-dependent regularization on the classiﬁcation function while maximum absolute margin corresponds to an ℓ2 norm constraint on the classiﬁcation function. Interestingly, the proposed improvements only require simple extensions to existing maximum margin formulations and preserve the computational efﬁciency of SVMs. Through the maximization of relative margin, surprising performance gains are achieved on real-world problems such as digit, text classiﬁcation and on several other benchmark data sets. In addition, risk bounds are derived for the new formulation based on Rademacher averages. Keywords: support vector machines, kernel methods, large margin, Rademacher complexity

6 0.28462756 112 jmlr-2010-Training and Testing Low-degree Polynomial Data Mappings via Linear SVM

7 0.27223957 23 jmlr-2010-Classification with Incomplete Data Using Dirichlet Process Priors

8 0.26440337 63 jmlr-2010-Learning Instance-Specific Predictive Models

9 0.24362086 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks

10 0.24030994 13 jmlr-2010-Approximate Inference on Planar Graphs using Loop Calculus and Belief Propagation

11 0.21209635 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

12 0.19361413 101 jmlr-2010-Second-Order Bilinear Discriminant Analysis

13 0.19344604 69 jmlr-2010-Lp-Nested Symmetric Distributions

14 0.19083682 34 jmlr-2010-Erratum: SGDQN is Less Careful than Expected

15 0.18814236 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

16 0.18397921 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

17 0.17930102 22 jmlr-2010-Classification Using Geometric Level Sets

18 0.17841084 113 jmlr-2010-Tree Decomposition for Large-Scale SVM Problems

19 0.17765045 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

20 0.174519 52 jmlr-2010-Incremental Sigmoid Belief Networks for Grammar Learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.012), (8, 0.01), (15, 0.012), (21, 0.014), (32, 0.048), (36, 0.024), (37, 0.032), (75, 0.663), (81, 0.016), (85, 0.052), (96, 0.016)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99817145 77 jmlr-2010-Model-based Boosting 2.0

Author: Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid, Benjamin Hofner

Abstract: We describe version 2.0 of the R add-on package mboost. The package implements boosting for optimizing general risk functions using component-wise (penalized) least squares estimates or regression trees as base-learners for ﬁtting generalized linear, additive and interaction models to potentially high-dimensional data. Keywords: component-wise functional gradient descent, splines, decision trees 1. Overview The R add-on package mboost (Hothorn et al., 2010) implements tools for ﬁtting and evaluating a variety of regression and classiﬁcation models that have been suggested in machine learning and statistics. Optimization within the empirical risk minimization framework is performed via a component-wise functional gradient descent algorithm. The algorithm originates from the statistical view on boosting algorithms (Friedman et al., 2000; Bühlmann and Yu, 2003). The theory and its implementation in mboost allow for ﬁtting complex prediction models, taking potentially many interactions of features into account, as well as for ﬁtting additive and linear models. The model class the package deals with is best described by so-called structured additive regression (STAR) models, where some characteristic ξ of the conditional distribution of a response variable Y given features X is modeled through a regression function f of the features ξ(Y |X = x) = f (x). In order to facilitate parsimonious and interpretable models, the regression function f is structured, that is, restricted to additive functions f (x) = ∑ p f j (x). Each model component f j (x) might take only j=1 c 2010 Torsten Hothorn, Peter Bühlmann, Thomas Kneib, Matthias Schmid and Benjamin Hofner. H OTHORN , B ÜHLMANN , K NEIB , S CHMID AND H OFNER a subset of the features into account. Special cases are linear models f (x) = x⊤ β, additive models f (x) = ∑ p f j (x( j) ), where f j is a function of the jth feature x( j) only (smooth functions or j=1 stumps, for example) or a more complex function where f (x) is implicitly deﬁned as the sum of multiple decision trees including higher-order interactions. The latter case corresponds to boosting with trees. Combinations of these structures are also possible. The most important advantage of such a decomposition of the regression function is that each component of a ﬁtted model can be looked at and interpreted separately for gaining a better understanding of the model at hand. The characteristic ξ of the distribution depends on the measurement scale of the response Y and the scientiﬁc question to be answered. For binary or numeric variables, some function of the expectation may be appropriate, but also quantiles or expectiles may be interesting. The deﬁnition of ξ is determined by deﬁning a loss function ρ whose empirical risk is to be minimized under some algorithmic constraints (i.e., limited number of boosting iterations). The model is then ﬁtted using n p ( fˆ1 , . . . , fˆp ) = argmin ∑ wi ρ yi , ∑ f j (x) . ( f1 ,..., f p ) i=1 j=1 Here (yi , xi ), i = 1, . . . , n, are n training samples with responses yi and potentially high-dimensional feature vectors xi , and wi are some weights. The component-wise boosting algorithm starts with some offset for f and iteratively ﬁts residuals deﬁned by the negative gradient of the loss function evaluated at the current ﬁt by updating only the best model component in each iteration. The details have been described by Bühlmann and Yu (2003). Early stopping via resampling approaches or AIC leads to sparse models in the sense that only a subset of important model components f j deﬁnes the ﬁnal model. A more thorough introduction to boosting with applications in statistics based on version 1.0 of mboost is given by Bühlmann and Hothorn (2007). As of version 2.0, the package allows for ﬁtting models to binary, numeric, ordered and censored responses, that is, regression of the mean, robust regression, classiﬁcation (logistic and exponential loss), ordinal regression,1 quantile1 and expectile1 regression, censored regression (including Cox, Weibull1 , log-logistic1 or lognormal1 models) as well as Poisson and negative binomial regression1 for count data can be performed. Because the structure of the regression function f (x) can be chosen independently from the loss function ρ, interesting new models can be ﬁtted (e.g., in geoadditive regression, Kneib et al., 2009). 2. Design and Implementation The package incorporates an infrastructure for representing loss functions (so-called ‘families’), base-learners deﬁning the structure of the regression function and thus the model components f j , and a generic implementation of component-wise functional gradient descent. The main progress in version 2.0 is that only one implementation of the boosting algorithm is applied to all possible models (linear, additive, tree-based) and all families. Earlier versions were based on three implementations, one for linear models, one for additive models, and one for tree-based boosting. In comparison to the 1.0 series, the reduced code basis is easier to maintain, more robust and regression tests have been set-up in a more uniﬁed way. Speciﬁcally, the new code basis results in an enhanced and more user-friendly formula interface. In addition, convenience functions for hyperparameter selection, faster computation of predictions and improved visual model diagnostics are available. 1. Model family is new in version 2.0 or was added after the release of mboost 1.0. 2110 M ODEL - BASED B OOSTING 2.0 Currently implemented base-learners include component-wise linear models (where only one variable is updated in each iteration of the algorithm), additive models with quadratic penalties (e.g., for ﬁtting smooth functions via penalized splines, varying coefﬁcients or bi- and trivariate tensor product splines, Schmid and Hothorn, 2008), and trees. As a major improvement over the 1.0 series, computations on larger data sets (both with respect to the number of observations and the number of variables) are now facilitated by memory efﬁcient implementations of the base-learners, mostly by applying sparse matrix techniques (package Matrix, Bates and Mächler, 2009) and parallelization for a cross-validation-based choice of the number of boosting iterations (per default via package multicore, Urbanek, 2009). A more elaborate description of mboost 2.0 features is available from the mboost vignette.2 3. User Interface by Example We illustrate the main components of the user-interface by a small example on human body fat composition: Garcia et al. (2005) used a linear model for predicting body fat content by means of common anthropometric measurements that were obtained for n = 71 healthy German women. In addition, the women’s body composition was measured by Dual Energy X-Ray Absorptiometry (DXA). The aim is to describe the DXA measurements as a function of the anthropometric features. Here, we extend the linear model by i) an intrinsic variable selection via early stopping, ii) additional terms allowing for smooth deviations from linearity where necessary (by means of penalized splines orthogonalized to the linear effect, Kneib et al., 2009), iii) a possible interaction between two variables with known impact on body fat composition (hip and waist circumference) and iv) using a robust median regression approach instead of L2 risk. For the data (available as data frame bodyfat), the model structure is speciﬁed via a formula involving the base-learners corresponding to the different model components (linear terms: bols(); smooth terms: bbs(); interactions: btree()). The loss function (here, the check function for the 0.5 quantile) along with its negative gradient function are deﬁned by the QuantReg(0.5) family (Fenske et al., 2009). The model structure (speciﬁed using the formula fm), the data and the family are then passed to function mboost() for model ﬁtting:3 R> library(

2 0.99659818 65 jmlr-2010-Learning Translation Invariant Kernels for Classification

Author: Kamaledin Ghiasi-Shirazi, Reza Safabakhsh, Mostafa Shamsi

Abstract: Appropriate selection of the kernel function, which implicitly deﬁnes the feature space of an algorithm, has a crucial role in the success of kernel methods. In this paper, we consider the problem of optimizing a kernel function over the class of translation invariant kernels for the task of binary classiﬁcation. The learning capacity of this class is invariant with respect to rotation and scaling of the features and it encompasses the set of radial kernels. We show that how translation invariant kernel functions can be embedded in a nested set of sub-classes and consider the kernel learning problem over one of these sub-classes. This allows the choice of an appropriate sub-class based on the problem at hand. We use the criterion proposed by Lanckriet et al. (2004) to obtain a functional formulation for the problem. It will be proven that the optimal kernel is a ﬁnite mixture of cosine functions. The kernel learning problem is then formulated as a semi-inﬁnite programming (SIP) problem which is solved by a sequence of quadratically constrained quadratic programming (QCQP) sub-problems. Using the fact that the cosine kernel is of rank two, we propose a formulation of a QCQP sub-problem which does not require the kernel matrices to be loaded into memory, making the method applicable to large-scale problems. We also address the issue of including other classes of kernels, such as individual kernels and isotropic Gaussian kernels, in the learning process. Another interesting feature of the proposed method is that the optimal classiﬁer has an expansion in terms of the number of cosine kernels, instead of support vectors, leading to a remarkable speedup at run-time. As a by-product, we also generalize the kernel trick to complex-valued kernel functions. Our experiments on artiﬁcial and real-world benchmark data sets, including the USPS and the MNIST digit recognition data sets, show the usefulness of the proposed method. Keywords: kernel learning, translation invariant k

3 0.99611241 3 jmlr-2010-A Fast Hybrid Algorithm for Large-Scalel1-Regularized Logistic Regression

Author: Jianing Shi, Wotao Yin, Stanley Osher, Paul Sajda

Abstract: ℓ1 -regularized logistic regression, also known as sparse logistic regression, is widely used in machine learning, computer vision, data mining, bioinformatics and neural signal processing. The use of ℓ1 regularization attributes attractive properties to the classiﬁer, such as feature selection, robustness to noise, and as a result, classiﬁer generality in the context of supervised learning. When a sparse logistic regression problem has large-scale data in high dimensions, it is computationally expensive to minimize the non-differentiable ℓ1 -norm in the objective function. Motivated by recent work (Koh et al., 2007; Hale et al., 2008), we propose a novel hybrid algorithm based on combining two types of optimization iterations: one being very fast and memory friendly while the other being slower but more accurate. Called hybrid iterative shrinkage (HIS), the resulting algorithm is comprised of a ﬁxed point continuation phase and an interior point phase. The ﬁrst phase is based completely on memory efﬁcient operations such as matrix-vector multiplications, while the second phase is based on a truncated Newton’s method. Furthermore, we show that various optimization techniques, including line search and continuation, can signiﬁcantly accelerate convergence. The algorithm has global convergence at a geometric rate (a Q-linear rate in optimization terminology). We present a numerical comparison with several existing algorithms, including an analysis using benchmark data from the UCI machine learning repository, and show our algorithm is the most computationally efﬁcient without loss of accuracy. Keywords: logistic regression, ℓ1 regularization, ﬁxed point continuation, supervised learning, large scale c 2010 Jianing Shi, Wotao Yin, Stanley Osher and Paul Sajda. S HI , Y IN , O SHER AND S AJDA

4 0.9935292 28 jmlr-2010-Continuous Time Bayesian Network Reasoning and Learning Engine

Author: Christian R. Shelton, Yu Fan, William Lam, Joon Lee, Jing Xu

Abstract: We present a continuous time Bayesian network reasoning and learning engine (CTBN-RLE). A continuous time Bayesian network (CTBN) provides a compact (factored) description of a continuoustime Markov process. This software provides libraries and programs for most of the algorithms developed for CTBNs. For learning, CTBN-RLE implements structure and parameter learning for both complete and partial data. For inference, it implements exact inference and Gibbs and importance sampling approximate inference for any type of evidence pattern. Additionally, the library supplies visualization methods for graphically displaying CTBNs or trajectories of evidence. Keywords: continuous time Bayesian networks, C++, open source software

same-paper 5 0.99115533 26 jmlr-2010-Consensus-Based Distributed Support Vector Machines

Author: Pedro A. Forero, Alfonso Cano, Georgios B. Giannakis

6 0.99011195 6 jmlr-2010-A Rotation Test to Verify Latent Structure

7 0.93536329 87 jmlr-2010-Online Learning for Matrix Factorization and Sparse Coding

8 0.93378961 51 jmlr-2010-Importance Sampling for Continuous Time Bayesian Networks

9 0.92562163 1 jmlr-2010-A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification

10 0.9252618 43 jmlr-2010-Generalized Power Method for Sparse Principal Component Analysis

11 0.91388011 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

12 0.91023022 40 jmlr-2010-Fast and Scalable Local Kernel Machines

13 0.90898234 84 jmlr-2010-On Spectral Learning

14 0.90876526 17 jmlr-2010-Bayesian Learning in Sparse Graphical Factor Models via Variational Mean-Field Annealing

15 0.90698051 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices

16 0.9049322 63 jmlr-2010-Learning Instance-Specific Predictive Models

17 0.90470946 58 jmlr-2010-Kronecker Graphs: An Approach to Modeling Networks

18 0.90379786 66 jmlr-2010-Linear Algorithms for Online Multitask Classification

19 0.90112448 5 jmlr-2010-A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning

20 0.90007031 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming