nips nips2003 nips2003-180 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ingo Steinwart
Abstract: The decision functions constructed by support vector machines (SVM’s) usually depend only on a subset of the training set—the so-called support vectors. We derive asymptotically sharp lower and upper bounds on the number of support vectors for several standard types of SVM’s. In particular, we show for the Gaussian RBF kernel that the fraction of support vectors tends to twice the Bayes risk for the L1-SVM, to the probability of noise for the L2-SVM, and to 1 for the LS-SVM. 1
Reference: text
sentIndex sentText sentNum sentScore
1 gov Abstract The decision functions constructed by support vector machines (SVM’s) usually depend only on a subset of the training set—the so-called support vectors. [sent-2, score-0.199]
2 We derive asymptotically sharp lower and upper bounds on the number of support vectors for several standard types of SVM’s. [sent-3, score-0.222]
3 In particular, we show for the Gaussian RBF kernel that the fraction of support vectors tends to twice the Bayes risk for the L1-SVM, to the probability of noise for the L2-SVM, and to 1 for the LS-SVM. [sent-4, score-0.249]
4 , (xn , yn )) with xi ∈ X, yi ∈ Y := {−1, 1} standard support vector machines (SVM’s) for classification (cf. [sent-8, score-0.286]
5 [1], [2]) solve arg min λ f f ∈H b∈R 2 H + 1 n n L yi (f (xi ) + b) , (1) i=1 where H is a reproducing kernel Hilbert space (RKHS) of a kernel k : X ×X → R (cf. [sent-9, score-0.205]
6 [3], [4]), λ > 0 is a free regularization parameter and L : R → [0, ∞) is a convex loss function. [sent-10, score-0.123]
7 Common choices for L are the hinge loss function L(t) := max{0, 1−t}, the squared hinge loss function L(t) := (max{0, 1 − t})2 and the least square loss function L(t) := (1 − t)2 . [sent-11, score-0.332]
8 Common choices of kernels are the Gaussian RBF k(x, x ) = exp(−σ 2 x − x 2 ) for 2 x, x ∈ Rd and fixed σ > 0 and polynomial kernels k(x, x ) = ( x, x +c)m for x, x ∈ Rd and fixed c ≥ 0, m ∈ N. [sent-13, score-0.062]
9 If (fT,λ , bT,λ ) ∈ H × R denotes a solution of (1) we have fT,λ = 1 2λ n ∗ yi αi k(xi , . [sent-14, score-0.051]
10 Obviously, only the samples xi with ∗ αi = 0 have an impact on fT,λ . [sent-20, score-0.157]
11 The fewer support vectors fT,λ has the faster it can be evaluated. [sent-22, score-0.101]
12 Moreover, it is well known that the number of support vectors #SV (fT,λ ) of the representation of fT,λ (cf. [sent-23, score-0.101]
13 Section 3 for a brief discusssion) also has a large impact on the time needed to solve (1) using the dual problem. [sent-24, score-0.046]
14 Therefore, it is of high interest to know how many support vectors one can expect for a given classification problem. [sent-25, score-0.101]
15 In this work we address this question by establishing asymptotically lower and upper bounds on the number of support vectors for typical situations. [sent-26, score-0.225]
16 The best known example of a universal kernel is the Gaussian RBF kernel (cf. [sent-44, score-0.227]
17 To simplify the statements, let us assume that P has no discrete components, i. [sent-47, score-0.064]
18 Furthermore, let L be a continuous convex loss function satisfying some minor regularity conditions. [sent-50, score-0.158]
19 Then it was shown for universal RKHS’s and stritly positive nullsequences (λn ) satisfying a regularity condition that the following statements hold for all ε > 0 and n → ∞: P n T ∈ (X × Y )n : #SV (fT,λn ) ≥ (RP − ε)n → 1 . [sent-51, score-0.199]
20 We shall also show that this new lower bound is also an upper bound under moderate conditions on P and H. [sent-58, score-0.075]
21 Furthermore, we prove that (4) is asymptotically optimal for the L2-SVM and show that it can be significantly improved for the LS-SVM. [sent-59, score-0.074]
22 3 New bounds We begin with lower and upper bounds for the L1-SVM. [sent-60, score-0.071]
23 Recall, that the problem (1) for this classifier can be reformulated as minimize subject to λ f, f + 1 n n ξi for f ∈ H, b ∈ R, ξ ∈ Rn i=1 yi f (xi ) + b ≥ 1 − ξi , ξi ≥ 0, i = 1, . [sent-61, score-0.051]
24 [4]) n αi − maximize i=1 n 1 4λ n yi yj αi αj k(xi , xj ) for α ∈ Rn i,j=1 (6) yi αi = 0, subject to i=1 0 ≤ αi ≤ 1 n, i = 1, . [sent-69, score-0.102]
25 ) unique if the kernel is universal and P has no discrete components (cf. [sent-82, score-0.18]
26 Since our results for the L1-SVM hold for general kernels we always assume that fT,λ is found by (6). [sent-84, score-0.05]
27 Finally, for a loss function L and a RKHS H we write RL,P,H := inf RL,P (f + b) , f ∈H b∈R where RL,P (f ) := E(x,y)∼P L yf (x) . [sent-85, score-0.13]
28 1 Let k be a continuous kernel on X and P be a probability measure on X ×Y with no discrete components. [sent-88, score-0.127]
29 2 If k is a universal kernel we have RL,P,H = 2RP (cf. [sent-91, score-0.161]
30 3 For specific kernels the regularity condition nλ2 / log n → ∞ can be weakn ened. [sent-96, score-0.092]
31 Namely, for the Gaussian RBF kernel on X ⊂ Rd it can be substituted by nλn |log λn |−d−1 → ∞. [sent-97, score-0.066]
32 4 If H is finite dimensional and n > dim H the representation (2) of fT,λn can be simplified such that only at most dim H kernel evaluations are neccessary. [sent-101, score-0.13]
33 In order to formulate an upper bound on #SV (fT,λn ) recall that a function is called analytic if it can be locally represented by its Taylor series. [sent-103, score-0.138]
34 Let L be a loss function, H be a RKHS over X and P be a probability measure on X × Y . [sent-104, score-0.12]
35 If H is universal we have RL,P,H = inf{RL,P (f ) f : X → R} (cf. [sent-108, score-0.095]
36 Furthermore, we denote the open unit ball of Rd by BRd . [sent-112, score-0.046]
37 5 Let H be the RKHS of an analytic kernel on BRd . [sent-114, score-0.12]
38 Furthermore, let X ⊂ BRd be a closed ball and P be a noisy non-degenerate probability measure on X × Y such that PX has a density with respect to the Lebesgue measure on X and (H, P ) is non-trivial. [sent-115, score-0.231]
39 Then for the L1-SVM using a regularization sequence (λn ) which tends sufficiently slowly to 0 we have #SV (fT,λn ) → RL,P,H n in probability. [sent-116, score-0.136]
40 Probably the most restricting condition on P in the above theorem is that PX has to have a density with respect to the Lebesgue measure. [sent-117, score-0.138]
41 Considering the proof this condition can be slightly weakened to the assumption that every d−1-dimensional subset of X has measure zero. [sent-118, score-0.116]
42 Although it would be desirable to exclude only probability measures with discrete components it is almost obvious that such a condition cannot be sufficient for d > 1 (cf. [sent-119, score-0.064]
43 The assumption that P is noisy and non-degenerate is far more less restrictive since neither completely noise-free P nor noisy problems with only “coin-flipping” noise often occur in practice. [sent-122, score-0.069]
44 7 Let k be a Gaussian RBF kernel with RKHS H and X be a closed ball of Rd . [sent-129, score-0.146]
45 Recall, that k is universal and hence (H, P ) is non-trivial iff P has two non-vanishing classes. [sent-132, score-0.13]
46 Since k is also analytic on Rd we find #SV (fT,λn ) → 2 RP . [sent-133, score-0.054]
47 n Therefore, (4) shows that in general this L1-SVM produces sparser decision functions than the L2-SVM and the LS-SVM based on a Gaussian RBF kernel (cf. [sent-134, score-0.114]
48 Besides the constraint n i=1 yi αi = 0, which no longer appears, the corresponding dual problem is identical to (6). [sent-139, score-0.072]
49 In particular, for a Gaussian RBF kernel and noise-free problems P we then obtain #SV (fT,λn ) → 0, (7) n i. [sent-147, score-0.066]
50 the number of support vectors increases more slowly than linearly. [sent-149, score-0.154]
51 The following theorem shows that the lower bound (4) on #SV (fT,λn ) for the L2-SVM is often asymptotically optimal. [sent-151, score-0.171]
52 This result is independent of the used optimization algorithm since we only consider universal kernels and measures with no discrete components. [sent-152, score-0.145]
53 9 Let H be the RKHS of an analytic and universal kernel on BRd . [sent-154, score-0.215]
54 Furthermore, let X ⊂ BRd be a closed ball and P be a probability measure on X × Y with RP > 0 such that PX has a density with respect to the Lebesgue measure on X and (H, P ) is non-trivial. [sent-155, score-0.206]
55 Then for the L2-SVM using using a regularization sequence (λ n ) which tends sufficiently slowly to 0 we have #SV (fT,λn ) → SP n in probability. [sent-156, score-0.136]
56 10 For the L2-SVM with fixed offset b := 0 the assumption RP > 0 in the above theorem is superfluous (cf. [sent-158, score-0.139]
57 In particular, for a Gaussian RBF kernel and noise-free problems P we obtain (7), i. [sent-162, score-0.066]
58 Our last result shows that LS-SVM’s often tend to use almost every sample as a support vector: Theorem 3. [sent-166, score-0.08]
59 11 Let H be the RKHS of an analytic and universal kernel on BRd . [sent-167, score-0.215]
60 Furthermore, let X ⊂ BRd be a closed ball and P be a probability measure on X × Y such that PX has a density with respect to the Lebesgue measure on X and (H, P ) is non-trivial. [sent-168, score-0.206]
61 Then for the LS-SVM using a regularization sequence (λn ) which tends sufficiently slowly to 0 we have #SV (fT,λn ) →1 n in probability. [sent-169, score-0.136]
62 This still holds if one fixes the offset for L2-SVM’s, i. [sent-172, score-0.052]
63 The reason for the different behaviours is the margin as already observed in [12]: the assumptions on H and P ensure that only a very small fraction of samples xi can be mapped to ±1 by fT,λn (cf. [sent-176, score-0.154]
64 For the L2-SVM this asymptotically ensures that most of the samples are mapped to values outside the margin, i. [sent-179, score-0.086]
65 the properties of Bn \ Aδ in the proof of Theorem 3. [sent-182, score-0.048]
66 9) and it is well-known that such samples cannot be support vectors. [sent-183, score-0.11]
67 In contrast to this the LS-SVM has the property that every point not lying on the margin is a support vector. [sent-184, score-0.102]
68 Using the techniques of our proofs it is fairly easy to see that the same reasoning holds for the hinge loss function compared to “modified hinge loss functions with no margin”. [sent-185, score-0.316]
69 4 Proofs Let L be a loss function and T be a training set. [sent-186, score-0.078]
70 For a function f : X → R we denote the empirical L-risk of f by RL,T (f + b) := 1 n n L yi (f (xi ) + b) . [sent-187, score-0.051]
71 [4]): λn fT,λn , fT,λn 1 + n n n ∗ ξi i=1 ∗ αi = i=1 1 − 4λn n ∗ ∗ yi yj αi αj k(xi , xj ) (8) i,j=1 By (2) this yields 1 n n ∗ ξi ≤ 2λn fT,λn , fT,λn + i=1 1 n n n ∗ ξi = i=1 ∗ αi . [sent-191, score-0.068]
72 i=1 Furthermore, recall that λn → 0 and nλ2 / log n → ∞ implies n 1 n n ∗ ξi = RL,T (fT,λn + bT,λn ) → RL,P,H i=1 (9) in probability for n → ∞ (cf. [sent-192, score-0.081]
73 [9]) and hence for all ε > 0 the probability of n ∗ αi ≥ RL,P,H − ε (10) i=1 tends to 1 for n → ∞. [sent-193, score-0.092]
74 5: Since H is the RKHS of an analytic kernel every function f ∈ H is analytic. [sent-200, score-0.12]
75 , xd−1 , xd ) = c has at most j solutions xd , where j ≥ 0 is locally (with respect to x1 , . [sent-209, score-0.122]
76 Now, let us suppose that PX {x ∈ X : fP,λ (x) + bP,λ = fP (x)} > 0 (13) for some λ > 0, where fP denotes the Bayes decision function. [sent-216, score-0.064]
77 Then we may assume without loss of generality that PX {x ∈ X : fP,λ (x) + bP,λ = 1} > 0 holds. [sent-217, score-0.121]
78 Therefore (13) cannot hold for small λ > 0 and hence we may assume without loss of generality that PX {x ∈ X : |fP,λ (x) + bP,λ − fP (x)| = 0} = 0 holds for all λ > 0. [sent-222, score-0.203]
79 , (xn , yn )) with fT,λn + bT,λn − fP,λn − bP,λn ∞ RL,T (fT,λn + bT,λn ) − RL,P (fP,λn + bP,λn ) 4 ≤ δn , (14) ≤ ε (15) (λn )λ3 n n and {i : xi ∈ Aδn (n))} ≤ 2εn. [sent-235, score-0.135]
80 If m → ∞ the results of [9] and [8] ensure, that the probability of such a T converges to 1 for n → ∞. [sent-236, score-0.047]
81 Now let us suppose that we have a sample (xi , yi ) of T with xi ∈ Aδn (n). [sent-243, score-0.198]
82 Therefore, by (19) we find n n ∗ αi ≥ RL,P,H + 5ε ≥ i=1 ∗ αi = xi ∈Aδn (n) 1 ∗ {i : xi ∈ Aδn (n) and αi = 0} n Since we have at most 2εn samples in Aδn (n) we finally obtain 1 #SV (fT,λn ) ≤ RL,P,H + 7ε . [sent-247, score-0.234]
83 Furthermore, let Aδ (n) be defined as in the proof of Theorem 3. [sent-257, score-0.093]
84 , (xn , yn )) with fT,λn + bT,λn − fP,λn − bP,λn ∞ ≤ δn , {i : xi ∈ Bδ (n) \ Aδn (n)} ≥ n PX (X \ N ) − 3ε . [sent-266, score-0.135]
85 Again, the probability of such T converges to 1 for n → ∞ whenever (λn ) converges sufficiently slowly to 0. [sent-267, score-0.146]
86 In view of (4) it suffices to show that every sample xi ∈ Bδ (n) \ Aδn (n) cannot be a support vector. [sent-268, score-0.182]
87 Given an xi ∈ Bδ (n) \ Aδn (n) we may assume without loss of generality that xi ∈ C1 . [sent-269, score-0.325]
88 Then xi ∈ Bδ (n) implies fP,λn (xi )+bP,λn ≥ 1−δn while xi ∈ Aδn (n) yields |fP,λn (xi )+bP,λn −1| > δn . [sent-270, score-0.239]
89 105]) this shows that xi is not a support vector. [sent-274, score-0.182]
90 11: Let Aδ (n) and δn be defined as in the proof of Theorem 3. [sent-276, score-0.048]
91 Without loss of generality we may assume δn ∈ (0, 1/2). [sent-278, score-0.121]
92 22] we may assume without loss of generality that PX (Dn ) ≥ PX (C0 ) − ε for all n ≥ 1. [sent-281, score-0.121]
93 , (xn , yn )) with fT,λn + bT,λn − fP,λn − bP,λn ∞ ≤ δn {i : xi ∈ Aδn (n)} {i : xi ∈ Dn } ≤ 2εn ≥ n PX (C0 ) − 2ε . [sent-285, score-0.237]
94 Again, the probability of such T converges to 1 for n → ∞ whenever (λn ) converges sufficiently slowly to 0. [sent-286, score-0.146]
95 Now let us consider a sample xi ∈ (X \ Aδn (n)) ∩ C1 of T . [sent-287, score-0.147]
96 Obviously, the same holds true for samples xi ∈ (X \ Aδn (n)) ∩ C−1 . [sent-291, score-0.16]
97 Finally, for samples xi ∈ Dn we have |fT,λn (xi ) + bT,λn | ≤ 1/2 + δn < 1 and hence these samples are always support vectors. [sent-292, score-0.277]
98 On the influence of the kernel on the consistency of support vector machines. [sent-337, score-0.146]
99 Consistency of support vector machines and other regularized kernel machine. [sent-345, score-0.182]
100 Sparsity of data representation of optimal kernel machine and leaveone-out estimator. [sent-360, score-0.066]
wordName wordTfidf (topN-words)
[('ft', 0.584), ('fp', 0.429), ('px', 0.387), ('bp', 0.229), ('bt', 0.211), ('sv', 0.152), ('rkhs', 0.135), ('brd', 0.13), ('xi', 0.102), ('theorem', 0.096), ('universal', 0.095), ('remark', 0.094), ('support', 0.08), ('loss', 0.078), ('kernel', 0.066), ('rbf', 0.061), ('xd', 0.061), ('rp', 0.056), ('asymptotically', 0.056), ('analytic', 0.054), ('slowly', 0.053), ('rd', 0.053), ('lebesgue', 0.052), ('yi', 0.051), ('hinge', 0.049), ('dn', 0.049), ('proof', 0.048), ('ball', 0.046), ('regularization', 0.045), ('let', 0.045), ('nj', 0.044), ('recall', 0.044), ('generality', 0.043), ('tends', 0.038), ('holomorphic', 0.037), ('supj', 0.037), ('uous', 0.037), ('furthermore', 0.037), ('inf', 0.036), ('regularity', 0.035), ('hence', 0.035), ('closed', 0.034), ('proofs', 0.034), ('suf', 0.033), ('yn', 0.033), ('sparseness', 0.032), ('ingo', 0.032), ('dim', 0.032), ('kernels', 0.031), ('bayes', 0.03), ('classi', 0.03), ('samples', 0.03), ('sparser', 0.029), ('alamos', 0.029), ('rn', 0.029), ('holds', 0.028), ('converges', 0.028), ('condition', 0.026), ('xn', 0.025), ('noisy', 0.025), ('risk', 0.025), ('bounds', 0.025), ('impact', 0.025), ('super', 0.024), ('statements', 0.024), ('offset', 0.024), ('svm', 0.023), ('los', 0.023), ('measure', 0.023), ('obviously', 0.023), ('reproducing', 0.022), ('considerations', 0.022), ('establishing', 0.022), ('margin', 0.022), ('vectors', 0.021), ('ciently', 0.021), ('upper', 0.021), ('dual', 0.021), ('machines', 0.02), ('assumption', 0.019), ('sharp', 0.019), ('exists', 0.019), ('hold', 0.019), ('probability', 0.019), ('moreover', 0.019), ('bound', 0.019), ('discrete', 0.019), ('decision', 0.019), ('whenever', 0.018), ('implies', 0.018), ('prove', 0.018), ('lim', 0.018), ('sup', 0.018), ('yields', 0.017), ('write', 0.016), ('regularized', 0.016), ('shall', 0.016), ('density', 0.016), ('risks', 0.016), ('ipping', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000005 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
Author: Ingo Steinwart
Abstract: The decision functions constructed by support vector machines (SVM’s) usually depend only on a subset of the training set—the so-called support vectors. We derive asymptotically sharp lower and upper bounds on the number of support vectors for several standard types of SVM’s. In particular, we show for the Gaussian RBF kernel that the fraction of support vectors tends to twice the Bayes risk for the L1-SVM, to the probability of noise for the L2-SVM, and to 1 for the LS-SVM. 1
2 0.14869675 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe
Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.
3 0.12415653 122 nips-2003-Margin Maximizing Loss Functions
Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie
Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1
4 0.11630875 117 nips-2003-Linear Response for Approximate Inference
Author: Max Welling, Yee W. Teh
Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.
5 0.11249363 106 nips-2003-Learning Non-Rigid 3D Shape from 2D Motion
Author: Lorenzo Torresani, Aaron Hertzmann, Christoph Bregler
Abstract: This paper presents an algorithm for learning the time-varying shape of a non-rigid 3D object from uncalibrated 2D tracking data. We model shape motion as a rigid component (rotation and translation) combined with a non-rigid deformation. Reconstruction is ill-posed if arbitrary deformations are allowed. We constrain the problem by assuming that the object shape at each time instant is drawn from a Gaussian distribution. Based on this assumption, the algorithm simultaneously estimates 3D shape and motion for each time frame, learns the parameters of the Gaussian, and robustly fills-in missing data points. We then extend the algorithm to model temporal smoothness in object shape, thus allowing it to handle severe cases of missing data. 1
6 0.10501904 121 nips-2003-Log-Linear Models for Label Ranking
7 0.085563049 51 nips-2003-Design of Experiments via Information Theory
8 0.083617046 126 nips-2003-Measure Based Regularization
9 0.083558962 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
10 0.079361126 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
11 0.070158422 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
12 0.068330571 189 nips-2003-Tree-structured Approximations by Expectation Propagation
13 0.066358097 124 nips-2003-Max-Margin Markov Networks
14 0.065914199 112 nips-2003-Learning to Find Pre-Images
15 0.062939994 113 nips-2003-Learning with Local and Global Consistency
16 0.062300708 1 nips-2003-1-norm Support Vector Machines
17 0.061516471 152 nips-2003-Pairwise Clustering and Graphical Models
18 0.057033043 160 nips-2003-Prediction on Spike Data Using Kernel Algorithms
19 0.051612485 44 nips-2003-Can We Learn to Beat the Best Stock
20 0.051362954 78 nips-2003-Gaussian Processes in Reinforcement Learning
topicId topicWeight
[(0, -0.165), (1, -0.094), (2, -0.077), (3, -0.073), (4, 0.144), (5, -0.012), (6, -0.064), (7, -0.055), (8, -0.003), (9, -0.015), (10, 0.054), (11, 0.158), (12, 0.049), (13, 0.079), (14, -0.052), (15, -0.039), (16, 0.073), (17, 0.024), (18, -0.031), (19, -0.113), (20, 0.059), (21, 0.031), (22, 0.14), (23, 0.01), (24, 0.0), (25, -0.008), (26, -0.044), (27, 0.002), (28, 0.074), (29, -0.044), (30, 0.119), (31, -0.033), (32, -0.01), (33, -0.048), (34, -0.062), (35, 0.066), (36, -0.188), (37, 0.084), (38, 0.09), (39, -0.059), (40, 0.039), (41, 0.018), (42, -0.003), (43, -0.063), (44, -0.104), (45, 0.125), (46, -0.185), (47, -0.017), (48, -0.048), (49, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.94672889 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
Author: Ingo Steinwart
Abstract: The decision functions constructed by support vector machines (SVM’s) usually depend only on a subset of the training set—the so-called support vectors. We derive asymptotically sharp lower and upper bounds on the number of support vectors for several standard types of SVM’s. In particular, we show for the Gaussian RBF kernel that the fraction of support vectors tends to twice the Bayes risk for the L1-SVM, to the probability of noise for the L2-SVM, and to 1 for the LS-SVM. 1
2 0.55435628 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe
Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.
3 0.54697269 122 nips-2003-Margin Maximizing Loss Functions
Author: Saharon Rosset, Ji Zhu, Trevor J. Hastie
Abstract: Margin maximizing properties play an important role in the analysis of classi£cation models, such as boosting and support vector machines. Margin maximization is theoretically interesting because it facilitates generalization error analysis, and practically interesting because it presents a clear geometric interpretation of the models being built. We formulate and prove a suf£cient condition for the solutions of regularized loss functions to converge to margin maximizing separators, as the regularization vanishes. This condition covers the hinge loss of SVM, the exponential loss of AdaBoost and logistic regression loss. We also generalize it to multi-class classi£cation problems, and present margin maximizing multiclass versions of logistic regression and support vector machines. 1
4 0.46272716 117 nips-2003-Linear Response for Approximate Inference
Author: Max Welling, Yee W. Teh
Abstract: Belief propagation on cyclic graphs is an efficient algorithm for computing approximate marginal probability distributions over single nodes and neighboring nodes in the graph. In this paper we propose two new algorithms for approximating joint probabilities of arbitrary pairs of nodes and prove a number of desirable properties that these estimates fulfill. The first algorithm is a propagation algorithm which is shown to converge if belief propagation converges to a stable fixed point. The second algorithm is based on matrix inversion. Experiments compare a number of competing methods.
5 0.4116382 126 nips-2003-Measure Based Regularization
Author: Olivier Bousquet, Olivier Chapelle, Matthias Hein
Abstract: We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graph-based semi-supervised learning algorithms. We also propose practical implementations. 1
6 0.4038125 98 nips-2003-Kernel Dimensionality Reduction for Supervised Learning
7 0.39494476 51 nips-2003-Design of Experiments via Information Theory
8 0.39041337 23 nips-2003-An Infinity-sample Theory for Multi-category Large Margin Classification
9 0.37764174 74 nips-2003-Finding the M Most Probable Configurations using Loopy Belief Propagation
10 0.37749615 1 nips-2003-1-norm Support Vector Machines
11 0.3694655 44 nips-2003-Can We Learn to Beat the Best Stock
12 0.35574707 113 nips-2003-Learning with Local and Global Consistency
13 0.3535744 96 nips-2003-Invariant Pattern Recognition by Semi-Definite Programming Machines
14 0.3466526 152 nips-2003-Pairwise Clustering and Graphical Models
15 0.34500596 112 nips-2003-Learning to Find Pre-Images
16 0.34376305 189 nips-2003-Tree-structured Approximations by Expectation Propagation
17 0.34304902 178 nips-2003-Sparse Greedy Minimax Probability Machine Classification
18 0.33841503 121 nips-2003-Log-Linear Models for Label Ranking
19 0.31786454 150 nips-2003-Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering
20 0.30315304 146 nips-2003-Online Learning of Non-stationary Sequences
topicId topicWeight
[(0, 0.042), (11, 0.021), (19, 0.305), (30, 0.014), (35, 0.058), (53, 0.075), (69, 0.036), (71, 0.074), (76, 0.069), (85, 0.069), (91, 0.091), (99, 0.029)]
simIndex simValue paperId paperTitle
same-paper 1 0.81078905 180 nips-2003-Sparseness of Support Vector Machines---Some Asymptotically Sharp Bounds
Author: Ingo Steinwart
Abstract: The decision functions constructed by support vector machines (SVM’s) usually depend only on a subset of the training set—the so-called support vectors. We derive asymptotically sharp lower and upper bounds on the number of support vectors for several standard types of SVM’s. In particular, we show for the Gaussian RBF kernel that the fraction of support vectors tends to twice the Bayes risk for the L1-SVM, to the probability of noise for the L2-SVM, and to 1 for the LS-SVM. 1
2 0.7092554 50 nips-2003-Denoising and Untangling Graphs Using Degree Priors
Author: Quaid D. Morris, Brendan J. Frey
Abstract: This paper addresses the problem of untangling hidden graphs from a set of noisy detections of undirected edges. We present a model of the generation of the observed graph that includes degree-based structure priors on the hidden graphs. Exact inference in the model is intractable; we present an efficient approximate inference algorithm to compute edge appearance posteriors. We evaluate our model and algorithm on a biological graph inference problem. 1 Introduction and motivation The inference of hidden graphs from noisy edge appearance data is an important problem with obvious practical application. For example, biologists are currently building networks of all the physical protein-protein interactions (PPI) that occur in particular organisms. The importance of this enterprise is commensurate with its scale: a completed network would be as valuable as a completed genome sequence, and because each organism contains thousands of different types of proteins, there are millions of possible types of interactions. However, scalable experimental methods for detecting interactions are noisy, generating many false detections. Motivated by this application, we formulate the general problem of inferring hidden graphs as probabilistic inference in a graphical model, and we introduce an efficient algorithm that approximates the posterior probability that an edge is present. In our model, a set of hidden, constituent graphs are combined to generate the observed graph. Each hidden graph is independently sampled from a prior on graph structure. The combination mechanism acts independently on each edge but can be either stochastic or deterministic. Figure 1 shows an example of our generative model. Typically one of the hidden graphs represents the graph of interest (the true graph), the others represent different types of observation noise. Independent edge noise may also be added by the combination mechanism. We use probabilistic inference to compute a likely decomposition of the observed graph into its constituent parts. This process is deemed “untangling”. We use the term “denoising” to refer to the special case where the edge noise is independent. In denoising there is a single hidden graph, the true graph, and all edge noise in the observed graph is due E1 1 eij E2 i 2 eij j xij i j i j X Figure 1: Illustrative generative model example. Figure shows an example where an observed graph, X, is a noisy composition of two constituent graphs, E 1 and E 2 . All graphs share the same vertex set, so each can be represented by a symmetric matrix of random binary variables (i.e., an adjacency matrix). This generative model is designed to solve a toy counter-espionage problem. The vertices represent suspects and each edge in X represents an observed call between two suspects. The graph X reflects zero or more spy rings (represented by E 1 ), telemarketing calls (represented by E 2 ), social calls (independent edge noise), and lost call records (more independent edge noise). The task is to locate any spy rings hidden in X. We model the distribution of spy ring graphs using a prior, P (E 1 ), that has support only on graphs where all vertices have degree of either 2 (i.e., are in the ring) or 0 (i.e., are not). Graphs of telemarketing call patterns are represented using a prior, P (E 2 ), under which all nodes have degrees of > 3 (i.e., are telemarketers), 1 (i.e., are telemarketees), or 0 (i.e., are neither). The displayed hidden graphs are one likely untangling of X. to the combination mechanism. Prior distributions over graphs can be specified in various ways, but our choice is motivated by problems we want to solve, and by a view to deriving an efficient inference algorithm. One compact representation of a distribution over graphs consists of specifying a distribution over vertex degrees, and assuming that graphs that have the same vertex degrees are equiprobable. Such a prior can model quite rich distributions over graphs. These degree-based structure priors are natural representions of graph structure; many classes of real-world networks have a characteristic functional form associated with their degree distributions [1], and sometimes this form can be predicted using knowledge about the domain (see, e.g., [2]) or detected empirically (see, e.g., [3, 4]). As such, our model incorporates degree-based structure priors. Though exact inference in our model is intractable in general, we present an efficient algorithm for approximate inference for arbitrary degree distributions. We evaluate our model and algorithm using the real-world example of untangling yeast proteinprotein interaction networks. 2 A model of noisy and tangled graphs For degree-based structure priors, inference consists of searching over vertex degrees and edge instantiations, while comparing each edge with its noisy observation and enforcing the constraint that the number of edges connected to every vertex must equal the degree of the vertex. Our formulation of the problem in this way is inspired by the success of the sum-product algorithm (loopy belief propagation) for solving similar formulations of problems in error-correcting decoding [6, 7], phase unwrapping [8], and random satisfiability [9]. For example, in error-correcting decoding, inference consists of searching over configurations of codeword bits, while comparing each bit with its noisy observation and enforcing parity-check constraints on subsets of bits [10]. For a graph on a set of N vertices, eij is a variable that indicates the presence of an edge connecting vertices i and j: eij = 1 if there is an edge, and eij = 0 otherwise. We assume the vertex set is fixed, so each graph is specified by an adjacency matrix, E = {eij }N . The degree of vertex i is denoted by di and the i,j=1 degree set by D = {di }N . The observations are given by a noisy adjacency matrix, i=1 X = {xij }N . Generally, edges can be directed, but in this paper we focus on i,j=1 undirected graphs, so eij = eji and xij = xji . Assuming the observation noise is independent for different edges, the joint distribution is P (X, E, D) = P (X|E)P (E, D) = P (xij |eij ) P (E, D). j≥i P (xij |eij ) models the edge observation noise. We use an undirected model for the joint distribution over edges and degrees, P (E, D), where the prior distribution over di is determined by a non-negative potential fi (di ). Assuming graphs that have the same vertex degrees are equiprobable, we have N eij ) , fi (di )I(di , P (E, D) ∝ i j=1 where I(a, b) = 1 if a = b, and I(a, b) = 0 if a = b. The term I(di , j eij ) ensures that the number of edges connected to vertex i is equal to di . It is straightforward to show that the marginal distribution over di is P (di ) ∝ fi (di ) D\di nD j=i fj (dj ) , where nD is the number of graphs with degrees D and the sum is over all degree variables except di . The potentials, fi , can be estimated from a given degree prior using Markov chain Monte Carlo; or, as an approximation, they can be set to an empirical degree distribution obtained from noise-free graphs. Fig 2a shows the factor graph [11] for the above model. Each filled square corresponds to a term in the factorization of the joint distribution and the square is connected to all variables on which the term depends. Factor graphs are graphical models that unify the properties of Bayesian networks and Markov random fields [12]. Many inference algorithms, including the sum-product algorithm (a.k.a. loopy belief propagation), are more easily derived using factor graphs than Bayesian networks or Markov random fields. We describe the sum-product algorithm for our model in section 3. (a) I(d ,e + e +e +e 4 14 24 34 44 d1 e11 e12 e14 4 d3 d2 e13 f 4(d ) e22 e23 e24 (b) ) d1 d4 e33 e34 e1 e44 11 x11 x11 x12 x13 x14 x22 x23 x24 x33 d1 1 x34 x44 e2 11 e1 12 x12 e2 12 d1 2 e1 13 e1 e2 13 e1 14 x13 e1 22 x14 e2 14 d1 3 23 x22 e2 22 x23 e2 23 4 e1 e1 24 e2 24 e1 33 x24 34 x33 e2 33 x34 e2 34 e1 44 x44 e2 44 P( x44 | e44 ) (c) d4 s41 e14 e24 d2 1 d4 e34 e44 e14 s42 e24 s43 e34 d2 2 d2 3 d2 4 s44 P( x e44 44 1 ,e 2 ) 44 44 |e Figure 2: (a) A factor graph that describes a distribution over graphs with vertex degrees di , binary edge indicator variables eij , and noisy edge observations xij . The indicator function I(di , j eij ) enforces the constraint that the sum of the binary edge indicator variables for vertex i must equal the degree of vertex i. (b) A factor graph that explains noisy observed edges as a combination of two constituent graphs, with edge indicator variables e 1 and e2 . ij ij (c) The constraint I(di , j eij ) can be implemented using a chain with state variables, which leads to an exponentially faster message-passing algorithm. 2.1 Combining multiple graphs The above model is suitable when we want to infer a graph that matches a degree prior, assuming the edge observation noise is independent. A more challenging goal, with practical application, is to infer multiple hidden graphs that combine to explain the observed edge data. In section 4, we show how priors over multiple hidden graphs can be be used to infer protein-protein interactions. When there are H hidden graphs, each constituent graph is specified by a set of edges on the same set of N common vertices. For the degree variables and edge variables, we use a superscript to indicate which hidden graph the variable is used to describe. Assuming the graphs are independent, the joint distribution over the observed edge data X, and the edge variables and degree variables for the hidden graphs, E 1 , D1 , . . . , E H , DH , is H P (X, E 1 , D1 , . . . , E H , DH ) = P (xij |e1 , . . . , eH ) ij ij j≥i P (E h , Dh ), (1) h=1 where for each hidden graph, P (E h , Dh ) is modeled as described above. Here, the likelihood P (xij |e1 , . . . , eH ) describes how the edges in the hidden graphs combine ij ij to model the observed edge. Figure 2b shows the factor graph for this model. 3 Probabilistic inference of constituent graphs Exact probabilistic inference in the above models is intractable, here we introduce an approximate inference algorithm that consists of applying the sum-product algorithm, while ignoring cycles in the factor graph. Although the sum-product algorithm has been used to obtain excellent results on several problems [6, 7, 13, 14, 8, 9], we have found that the algorithm works best when the model consists of uncertain observations of variables that are subject to a large number of hard constraints. Thus the formulation of the model described above. Conceptually, our inference algorithm is a straight-forward application of the sumproduct algorithm, c.f. [15], where messages are passed along edges in the factor graph iteratively, and then combined at variables to obtain estimates of posterior probabilities. However, direct implementation of the message-passing updates will lead to an intractable algorithm. In particular, direct implementation of the update for the message sent from function I(di , j eij ) to edge variable eik takes a number of scalar operations that is exponential in the number of vertices. Fortunately there exists a more efficient way to compute these messages. 3.1 Efficiently summing over edge configurations The function I(di , j eij ) ensures that the number of edges connected to vertex i is equal to di . Passing messages through this function requires summing over all edge configurations that correspond to each possible degree, di , and summing over di . Specifically, the message, µIi →eik (eik ), sent from function I(di , j eij ) to edge variable eik is given by I(di , di {eij | j=1,...,N, j=k} eij ) j µeij →Ii (eij ) , j=k where µeij →Ii (eij ) is the message sent from eij to function I(di , j eij ). The sum over {eij | j = 1, . . . , N, j = k} contains 2N −1 terms, so direct computation is intractable. However, for a maximum degree of dmax , all messages departing from the function I(di , j eij ) can be computed using order dmax N binary scalar operations, by introducing integer state variables sij . We define sij = n≤j ein and note that, by recursion, sij = sij−1 + eij , where si0 = 0 and 0 ≤ sij ≤ dmax . This recursive expression enables us to write the high-complexity constraint as the sum of a product of low-complexity constraints, N I(di , eij ) = j I(si1 , ei1 ) {sij | j=1,...,N } I(sij , sij−1 + eij ) I(di , siN ). j=2 This summation can be performed using the forward-backward algorithm. In the factor graph, the summation can be implemented by replacing the function I(di , j eij ) with a chain of lower-complexity functions, connected as shown in Fig. 2c. The function vertex (filled square) on the far left corresponds to I(si1 , ei1 ) and the function vertex in the upper right corresponds to I(di , siN ). So, messages can be passed through each constraint function I(di , j eij ) in an efficient manner, by performing a single forward-backward pass in the corresponding chain. 4 Results We evaluate our model using yeast protein-protein interaction (PPI) data compiled by [16]. These data include eight sets of putative, but noisy, interactions derived from various sources, and one gold-standard set of interactions detected by reliable experiments. Using the ∼ 6300 yeast proteins as vertices, we represent the eight sets of putative m interactions using adjacency matrices {Y m }8 m=1 where yij = 1 if and only if putative interaction dataset m contains an interaction between proteins i and j. We similarly use Y gold to represent the gold-standard interactions. m We construct an observed graph, X, by setting xij = maxm yij for all i and j, thus the observed edge set is the union of all the putative edge sets. We test our model (a) (b) 40 0 untangling baseline random empirical potential posterior −2 30 log Pr true positives (%) 50 20 10 −4 −6 −8 0 0 5 10 −10 0 false positives (%) 10 20 30 degree (# of nodes) Figure 3: Protein-protein interaction network untangling results. (a) ROC curves measuring performance of predicting e1 when xij = 1. (b) Degree distributions. Compares the empirical ij degree distribution of the test set subgraph of E 1 to the degree potential f 1 estimated on the ˆ ij training set subgraph of E 1 and to the distribution of di = j pij where pij = P (e1 = 1|X) is estimated by untangling. on the task of discerning which of the edges in X are also in Y gold . We formalize this problem as that of decomposing X into two constituent graphs E 1 and E 2 , the gold true and the noise graphs respectively, such that e1 = xij yij and e2 = xij − e1 . ij ij ij We use a training set to fit our model parameters and then measure task performance on a test set. The training set contains a randomly selected half of the ∼ 6300 yeast proteins, and the subgraphs of E 1 , E 2 , and X restricted to those vertices. The test contains the other half of the proteins and the corresponding subgraphs. Note that interactions connecting test set proteins to training set proteins (and vice versa) are ignored. We fit three sets of parameters: a set of Naive Bayes parameters that define a set of edge-specific likelihood functions, Pij (xij |e1 , e2 ), one degree potential, f 1 , which ij ij is the same for every vertex in E1 and defines the prior P (E 1 ), and a second, f 2 , that similarly defines the prior P (E 2 ). The likelihood functions, Pij , are used to both assign likelihoods and enforce problem constraints. Given our problem definition, if xij = 0 then e1 = e2 = 0, ij ij otherwise xij = 1 and e1 = 1 − e2 . We enforce the former constraint by setij ij ting Pij (xij = 0|e1 , e2 ) = (1 − e1 )(1 − e2 ), and the latter by setting Pij (xij = ij ij ij ij 1|e1 , e2 ) = 0 whenever e1 = e2 . This construction of Pij simplifies the calculation ij ij ij ij of the µPij →eh messages and improves the computational efficiency of inference beij cause when xij = 0, we need never update messages to and from variables e1 and ij e2 . We complete the specification of Pij (xij = 1|e1 , e2 ) as follows: ij ij ij ym Pij (xij = 1|e1 , e2 ) ij ij = m ij θm (1 − θm )1−yij , if e1 = 1 and e2 = 0, ij ij ym m ij ψm (1 − ψm )1−yij , if e1 = 0 and e2 = 1. ij ij where {θm } and {ψm } are naive Bayes parameters, θm = i,j m yij e1 / ij i,j e1 and ij ψm = i,j m yij e2 / ij i,j e2 , respectively. ij The degree potentials f 1 (d) and f 2 (d) are kernel density estimates fit to the degree distribution of the training set subgraphs of E 1 and E 2 , respectively. We use Gaussian kernels and set the width parameter (standard deviation) σ using leaveone-out cross-validation to maximize the total log density of the held-out datapoints. Each datapoint is the degree of a single vertex. Both degree potentials closely followed the training set empirical degree distributions. Untangling was done on the test set subgraph of X. We initially set the µ Pij →e1 ij messages equal to the likelihood function Pij and we randomly initialized the 1 µIj →e1 messages with samples from a normal distribution with mean 0 and variij ance 0.01. We then performed 40 iterations of the following message update order: 2 2 1 1 µe1 →Ij , µIj →e1 , µe1 →Pij , µPij →e2 , µe2 →Ij , µIj →e2 , µe2 →Pij , µPij →e1 . ij ij ij ij ij ij ij ij We evaluated our untangling algorithm using an ROC curve by comparing the actual ˆ test set subgraph of E 1 to posterior marginal probabilities,P (e1 = 1|X), estimated ij by our sum-product algorithm. Note that because the true interaction network is sparse (less than 0.2% of the 1.8 × 107 possible interactions are likely present [16]) and, in this case, true positive predictions are of greater biological interest than true negative predictions, we focus on low false positive rate portions of the ROC curve. Figure 3a compares the performance of a classifier for e1 based on thresholding ij ˆ P (eij = 1|X) to a baseline method based on thresholding the likelihood functions, Pij (xij = 1|e1 = 1, e2 = 0). Note because e1 = 0 whenever xij = 0, we exclude ij ij ij the xij = 0 cases from our performance evaluation. The ROC curve shows that for the same low false positive rate, untangling produces 50% − 100% more true positives than the baseline method. Figure 3b shows that the degree potential, the true degree distribution, and the predicted degree distribution are all comparable. The slight overprediction of the true degree distribution may result because the degree potential f 1 that defines P (E 1 ) is not equal to the expected degree distribution of graphs sampled from the distribution P (E 1 ). 5 Summary and Related Work Related work includes other algorithms for structure-based graph denoising [17, 18]. These algorithms use structural properties of the observed graph to score edges and rely on the true graph having a surprisingly large number of three (or four) edge cycles compared to the noise graph. In contrast, we place graph generation in a probabilistic framework; our algorithm computes structural fit in the hidden graph, where this computation is not affected by the noise graph(s); and we allow for multiple sources of observation noise, each with its own structural properties. After submitting this paper to the NIPS conference, we discovered [19], in which a degree-based graph structure prior is used to denoise (but not untangle) observed graphs. This paper addresses denoising in directed graphs as well as undirected graphs, however, the prior that they use is not amenable to deriving an efficient sumproduct algorithm. Instead, they use Markov Chain Monte Carlo to do approximate inference in a hidden graph containing 40 vertices. It is not clear how well this approach scales to the ∼ 3000 vertex graphs that we are using. In summary, the contributions of the work described in this paper include: a general formulation of the problem of graph untangling as inference in a factor graph; an efficient approximate inference algorithm for a rich class of degree-based structure priors; and a set of reliability scores (i.e., edge posteriors) for interactions from a current version of the yeast protein-protein interaction network. References [1] A L Barabasi and R Albert. Emergence of scaling in random networks. Science, 286(5439), October 1999. [2] A Rzhetsky and S M Gomez. Birth of scale-free molecular networks and the number of distinct dna and protein domains per genome. Bioinformatics, pages 988–96, 2001. [3] M Faloutsos, P Faloutsos, and C Faloutsos. On power-law relationships of the Internet topology. Computer Communications Review, 29, 1999. [4] Hawoong Jeong, B Tombor, R´ka Albert, Z N Oltvai, and Albert-L´szl´ Barab´si. e a o a The large-scale organization of metabolic networks. Nature, 407, October 2000. [5] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Mateo CA., 1988. [6] D. J. C. MacKay and R. M. Neal. Near Shannon limit performance of low density parity check codes. Electronics Letters, 32(18):1645–1646, August 1996. Reprinted in Electronics Letters, vol. 33, March 1997, 457–458. [7] B. J. Frey and F. R. Kschischang. Probability propagation and iterative decoding. In Proceedings of the 1996 Allerton Conference on Communication, Control and Computing, 1996. [8] B. J. Frey, R. Koetter, and N. Petrovic. Very loopy belief propagation for unwrapping phase images. In 2001 Conference on Advances in Neural Information Processing Systems, Volume 14. MIT Press, 2002. [9] M. M´zard, G. Parisi, and R. Zecchina. Analytic and algorithmic solution of random e satisfiability problems. Science, 297:812–815, 2002. [10] B. J. Frey and D. J. C. MacKay. Trellis-constrained codes. In Proceedings of the 35 th Allerton Conference on Communication, Control and Computing 1997, 1998. [11] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, Special Issue on Codes on Graphs and Iterative Algorithms, 47(2):498–519, February 2001. [12] B. J. Frey. Factor graphs: A unification of directed and undirected graphical models. University of Toronto Technical Report PSI-2003-02, 2003. [13] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In Uncertainty in Artificial Intelligence 1999. Stockholm, Sweden, 1999. [14] W. Freeman and E. Pasztor. Learning low-level vision. In Proceedings of the International Conference on Computer Vision, pages 1182–1189, 1999. [15] M. I. Jordan. An Inroduction to Learning in Graphical Models. 2004. In preparation. [16] C von Mering et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 2002. [17] R Saito, H Suzuki, and Y Hayashizaki. Construction of reliable protein-protein interaction networks with a new interaction generality measure. Bioinformatics, pages 756–63, 2003. [18] D S Goldberg and F P Roth. Assessing experimentally derived interactions in a small world. Proceedings of the National Academy of Science, 2003. [19] S M Gomez and A Rzhetsky. Towards the prediction of complete protein–protein interaction networks. In Pacific Symposium on Biocomputing, pages 413–24, 2002.
3 0.50453854 101 nips-2003-Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates
Author: Peter L. Bartlett, Michael I. Jordan, Jon D. Mcauliffe
Abstract: Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by providing a general quantitative relationship between the risk as assessed using the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds under the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.
4 0.50395608 189 nips-2003-Tree-structured Approximations by Expectation Propagation
Author: Yuan Qi, Tom Minka
Abstract: Approximation structure plays an important role in inference on loopy graphs. As a tractable structure, tree approximations have been utilized in the variational method of Ghahramani & Jordan (1997) and the sequential projection method of Frey et al. (2000). However, belief propagation represents each factor of the graph with a product of single-node messages. In this paper, belief propagation is extended to represent factors with tree approximations, by way of the expectation propagation framework. That is, each factor sends a “message” to all pairs of nodes in a tree structure. The result is more accurate inferences and more frequent convergence than ordinary belief propagation, at a lower cost than variational trees or double-loop algorithms. 1
5 0.50323266 103 nips-2003-Learning Bounds for a Generalized Family of Bayesian Posterior Distributions
Author: Tong Zhang
Abstract: In this paper we obtain convergence bounds for the concentration of Bayesian posterior distributions (around the true distribution) using a novel method that simplifies and enhances previous results. Based on the analysis, we also introduce a generalized family of Bayesian posteriors, and show that the convergence behavior of these generalized posteriors is completely determined by the local prior structure around the true distribution. This important and surprising robustness property does not hold for the standard Bayesian posterior in that it may not concentrate when there exist “bad” prior structures even at places far away from the true distribution. 1
6 0.50100315 78 nips-2003-Gaussian Processes in Reinforcement Learning
7 0.49865758 113 nips-2003-Learning with Local and Global Consistency
8 0.49645451 126 nips-2003-Measure Based Regularization
9 0.49587578 158 nips-2003-Policy Search by Dynamic Programming
10 0.49426743 109 nips-2003-Learning a Rare Event Detection Cascade by Direct Feature Selection
11 0.4921706 9 nips-2003-A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications
12 0.4914937 54 nips-2003-Discriminative Fields for Modeling Spatial Dependencies in Natural Images
13 0.491402 107 nips-2003-Learning Spectral Clustering
14 0.49093369 143 nips-2003-On the Dynamics of Boosting
15 0.49083465 116 nips-2003-Linear Program Approximations for Factored Continuous-State Markov Decision Processes
16 0.48992386 47 nips-2003-Computing Gaussian Mixture Models with EM Using Equivalence Constraints
17 0.48926425 112 nips-2003-Learning to Find Pre-Images
18 0.48765522 147 nips-2003-Online Learning via Global Feedback for Phrase Recognition
19 0.48733521 41 nips-2003-Boosting versus Covering
20 0.4871766 63 nips-2003-Error Bounds for Transductive Learning via Compression and Clustering