nips nips2000 nips2000-22 knowledge-graph by maker-knowledge-mining

22 nips-2000-Algorithms for Non-negative Matrix Factorization

Source: pdf

Author: Daniel D. Lee, H. Sebastian Seung

Abstract: Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. Two different multiplicative algorithms for NMF are analyzed. They differ only slightly in the multiplicative factor used in the update rules. One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence. The monotonic convergence of both algorithms can be proven using an auxiliary function analogous to that used for proving convergence of the ExpectationMaximization algorithm. The algorithms can also be interpreted as diagonally rescaled gradient descent, where the rescaling factor is optimally chosen to ensure convergence.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Massachusetts Institute of Technology Cambridge, MA 02138 Abstract Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. [sent-6, score-0.217]

2 They differ only slightly in the multiplicative factor used in the update rules. [sent-8, score-0.311]

3 One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence. [sent-9, score-0.041]

4 The monotonic convergence of both algorithms can be proven using an auxiliary function analogous to that used for proving convergence of the ExpectationMaximization algorithm. [sent-10, score-0.302]

5 The algorithms can also be interpreted as diagonally rescaled gradient descent, where the rescaling factor is optimally chosen to ensure convergence. [sent-11, score-0.297]

6 1 Introduction Unsupervised learning algorithms such as principal components analysis and vector quantization can be understood as factorizing a data matrix subject to different constraints. [sent-12, score-0.17]

7 Depending upon the constraints utilized, the resulting factors can be shown to have very different representational properties. [sent-13, score-0.034]

8 On the other hand, vector quantization uses a hard winnertake-all constraint that results in clustering the data into mutually exclusive prototypes [3]. [sent-15, score-0.068]

9 We have previously shown that nonnegativity is a useful constraint for matrix factorization that can learn a parts representation of the data [4, 5]. [sent-16, score-0.247]

10 The nonnegative basis vectors that are learned are used in distributed, yet still sparse combinations to generate expressiveness in the reconstructions [6, 7]. [sent-17, score-0.078]

11 In this submission, we analyze in detail two numerical algorithms for learning the optimal nonnegative factors from data. [sent-18, score-0.133]

12 Given a set of of multivariate n-dimensional data vectors, the vectors are placed in the columns of an n x m matrix V where m is the number of examples in the data set. [sent-20, score-0.142]

13 This matrix is then approximately factorized into an n x r matrix Wand an r x m matrix H. [sent-21, score-0.204]

14 Usually r is chosen to be smaller than nor m , so that Wand H are smaller than the original matrix V. [sent-22, score-0.068]

15 It can be rewritten column by column as v ~ Wh, where v and h are the corresponding columns of V and H. [sent-26, score-0.058]

16 In other words, each data vector v is approximated by a linear combination of the columns of W, weighted by the components of h. [sent-27, score-0.025]

17 Therefore W can be regarded as containing a basis that is optimized for the linear approximation of the data in V. [sent-28, score-0.045]

18 Since relatively few basis vectors are used to represent many data vectors, good approximation can only be achieved if the basis vectors discover structure that is latent in the data. [sent-29, score-0.084]

19 The present submission is not about applications of NMF, but focuses instead on the technical aspects of finding non-negative matrix factorizations. [sent-30, score-0.11]

20 Of course, other types of matrix factorizations have been extensively studied in numerical linear algebra, but the nonnegativity constraint makes much of this previous work inapplicable to the present case [8]. [sent-31, score-0.151]

21 Here we discuss two algorithms for NMF based on iterative updates of Wand H. [sent-32, score-0.107]

22 Because these algorithms are easy to implement and their convergence properties are guaranteed, we have found them very useful in practical applications. [sent-33, score-0.154]

23 Other algorithms may possibly be more efficient in overall computation time, but are more difficult to implement and may not generalize to different cost functions. [sent-34, score-0.108]

24 Algorithms similar to ours where only one of the factors is adapted have previously been used for the deconvolution of emission tomography and astronomical images [9, 10, 11, 12]. [sent-35, score-0.12]

25 At each iteration of our algorithms, the new value of W or H is found by multiplying the current value by some factor that depends on the quality ofthe approximation in Eq. [sent-36, score-0.054]

26 We prove that the quality of the approximation improves monotonically with the application of these multiplicative update rules. [sent-38, score-0.327]

27 In practice, this means that repeated iteration of the update rules is guaranteed to converge to a locally optimal matrix factorization. [sent-39, score-0.34]

28 3 Cost functions To find an approximate factorization V ~ W H, we first need to define cost functions that quantify the quality of the approximation. [sent-40, score-0.125]

29 Such a cost function can be constructed using some measure of distance between two non-negative matrices A and B . [sent-41, score-0.072]

30 One useful measure is simply the square of the Euclidean distance between A and B [13], IIA - BI12 = L(Aij - Bij)2 (2) ij This is lower bounded by zero, and clearly vanishes if and only if A = B . [sent-42, score-0.118]

31 Another useful measure is D(AIIB) = 2: k· ( Aij log B:~ - Aij + Bij ) (3) "J Like the Euclidean distance this is also lower bounded by zero, and vanishes if and only if A = B . [sent-43, score-0.122]

32 It reduces to the Kullback-Leibler divergence, or relative entropy, when 2:ij Aij = 2:ij Bij = 1, so that A and B can be regarded as normalized probability distributions. [sent-45, score-0.024]

33 Although the functions IIV - W HI12 and D(VIIW H) are convex in W only or H only, they are not convex in both variables together. [sent-49, score-0.06]

34 However, there are many techniques from numerical optimization that can be applied to find local minima. [sent-51, score-0.025]

35 Gradient descent is perhaps the simplest technique to implement, but convergence can be slow. [sent-52, score-0.12]

36 Other methods such as conjugate gradient have faster convergence, at least in the vicinity of local minima, but are more complicated to implement than gradient descent [8] . [sent-53, score-0.26]

37 The convergence of gradient based methods also have the disadvantage of being very sensitive to the choice of step size, which can be very inconvenient for large applications. [sent-54, score-0.123]

38 4 Multiplicative update rules We have found that the following "multiplicative update rules" are a good compromise between speed and ease of implementation for solving Problems 1 and 2. [sent-55, score-0.394]

39 Theorem 1 The Euclidean distance II V - W H II is non increasing under the update rules (WTV)att Hal' +- Hal' (WTWH)att (V HT)ia Wia +- Wia(WHHT)ia (4) The Euclidean distance is invariant under these updates if and only if Wand H are at a stationary point of the distance. [sent-56, score-0.37]

40 Theorem 2 The divergence D(VIIW H) is nonincreasing under the update rules H att +- H att 2:i WiaVitt/(WH)itt " W L. [sent-57, score-0.57]

41 Jv av (5) The divergence is invariant under these updates if and only ifW and H are at a stationary point of the divergence. [sent-61, score-0.103]

42 Proofs of these theorems are given in a later section. [sent-62, score-0.057]

43 For now, we note that each update consists of multiplication by a factor. [sent-63, score-0.15]

44 In particular, it is straightforward to see that this multiplicative factor is unity when V = W H, so that perfect reconstruction is necessarily a fixed point of the update rules. [sent-64, score-0.342]

45 5 Multiplicative versus additive update rules It is useful to contrast these multiplicative updates with those arising from gradient descent [14]. [sent-65, score-0.627]

46 In particular, a simple additive update for H that reduces the squared distance can be written as (6) If 'flatt are all set equal to some small positive number, this is equivalent to conventional gradient descent. [sent-66, score-0.325]

47 As long as this number is sufficiently small, the update should reduce IIV - WHII· Now if we diagonally rescale the variables and set Halt "Ialt (7) = (WTW H)alt ' then we obtain the update rule for H that is given in Theorem 1. [sent-67, score-0.404]

48 Note that this rescaling results in a multiplicative factor with the positive component of the gradient in the denominator and the absolute value of the negative component in the numerator of the factor. [sent-68, score-0.422]

49 For the divergence, diagonally rescaled gradient descent takes the form Halt f- Halt + "Ialt [~Wia (:;;)ilt - ~ Wia]. [sent-69, score-0.252]

50 (8) Again, if the "Ialt are small and positive, this update should reduce D (V II W H). [sent-70, score-0.15]

51 za ' ~ (9) then we obtain the update rule for H that is given in Theorem 2. [sent-72, score-0.19]

52 This rescaling can also be interpretated as a multiplicative rule with the positive component of the gradient in the denominator and negative component as the numerator of the multiplicative factor. [sent-73, score-0.563]

53 Since our choices for "Ialt are not small, it may seem that there is no guarantee that such a rescaled gradient descent should cause the cost function to decrease. [sent-74, score-0.22]

54 6 Proofs of convergence To prove Theorems 1 and 2, we will make use of an auxiliary function similar to that used in the Expectation-Maximization algorithm [15, 16]. [sent-76, score-0.238]

55 Definition 1 G(h, h') is an auxiliary functionfor F(h) G(h, h') ~ F(h), G(h, h) if the conditions = F(h) (10) are satisfied. [sent-77, score-0.168]

56 The auxiliary function is a useful concept because of the following lemma, which is also graphically illustrated in Fig. [sent-78, score-0.198]

57 Lemma 1 IfG is an auxiliary junction, then F is nonincreasing under the update ht+1 = argmlnG (h,ht ) Proof: F(ht+1) ~ G(ht+1, ht) ~ G(ht, ht) (11) = F(ht) • Note that F(ht+1) = F(ht) only if ht is a local minimum of G(h, ht). [sent-80, score-1.198]

58 If the derivatives of F exist and are continuous in a small neighborhood of ht , this also implies that the derivatives 'V F(ht) = O. [sent-81, score-0.781]

59 (11) we obtain a sequence of estimates that converge to a local minimum h min = argminh F(h) of the objective function: We will show that by defining the appropriate auxiliary functions G(h, ht) for both IIV W HII and D(V, W H), the update rules in Theorems 1 and 2 easily follow from Eq. [sent-83, score-0.454]

60 Figure 1: Minimizing the auxiliary function G(h, ht) F(ht) for h n+1 = argminh G(h, ht). [sent-85, score-0.21]

61 (14) to find that G(h, ht) 2:: F(h) is equivalent to 0:::; (15) W ia h a )2 L i (h - htf[K(ht) - WTW](h - ht) (16) (17) To prove positive semidefiniteness, consider the matrix 1: (18) which is just a rescaling of the components of K - WTW. [sent-88, score-0.239]

62 Then K - WTW is positive semidefinite if and only if M is, and VT M v = L VaMabVb ab (19) L h~(WTW)abh~v~ - vah~(WTW)abh~Vb ab (20) " (W T W ) abhahb t t L. [sent-89, score-0.186]

63 J [1 + 1 2 2" v a 2 2" Vb - VaVb ] (21) ab = ~ L(WTW)abh~h~(va - Vb)2 (22) ab > 0 (23) 'One can also show that K - WTW is positive semidefinite by considering the matrix K (I1 . [sent-92, score-0.254]

64 Then v /M( T W ht ) a is a positive eigenvector of K- 2 W T W K- with W unity eigenvalue, and application of the Frobenius-Perron theorem shows that Eq. [sent-94, score-0.894]

65 1 1) K- 2 W TW K- 2 K • We can now demonstrate the convergence of Theorem 1: Proof of Theorem 1 Replacing G(h, ht) in Eq. [sent-96, score-0.048]

66 (14) results in the update rule: ht+1 = ht - K(ht)-l\1F(ht) (24) Since Eq. [sent-98, score-0.931]

67 (14) is an auxiliary function, F is nonincreasing under this update rule, according to Lemma 1. [sent-99, score-0.417]

68 Writing the components of this equation explicitly, we obtain ht +1 = ht (WT V )a a a (WTWht)a . [sent-100, score-1.562]

69 (25) By reversing the roles of Wand H in Lemma 1 and 2, F can similarly be shown to be nonincreasing under the update rules for W . [sent-101, score-0.405]

70 To show that G(h, ht) we use convexity of the log function to derive the inequality -log " Wiaha ::; - " W log - ~ ~ Q a iaha a (28) a 2: F(h), (29) Qa a which holds for all nonnegative Q a that sum to unity. [sent-103, score-0.086]

71 Setting Wiah~ Q a (30) = 'ub Wibhbt "'" we obtain Wiah~ -log " Wiaha ::; - " '"'" W- ht ( log Wiaha - log,"", W- ht ) ~ ~ Wiah~ a a ub ,b b ub ,b b (31) From this inequality it follows that F(h) ::; G(h, ht) . [sent-104, score-1.771]

72 • Theorem 2 then follows from the application of Lemma 1: Proof of Theorem 2: The minimum of G(h, ht) with respect to h is determined by setting the gradient to zero: _G---,(,---,h,_h--,-t) __ " d _ Wiah~ 1 ~v, t dha _ , ~b Wibhb ha "W- - 0 +~ ,- za- (32) Thus, the update rule of Eq. [sent-105, score-0.296]

73 (11) takes the form t+1 ha h~" Vi = ub wkb ~ ub W-,b htbW ia · '"'" '"'" i (33) Since G is an auxiliary function, F in Eq. [sent-106, score-0.452]

74 Rewritten in matrix form, this is equivalent to the update rule in Eq. [sent-108, score-0.258]

75 By reversing the roles of Hand W, the update rule for W can similarly be shown to be nonincreasing . [sent-110, score-0.351]

76 • 7 Discussion We have shown that application of the update rules in Eqs. [sent-111, score-0.244]

77 (4) and (5) are guaranteed to find at least locally optimal solutions of Problems 1 and 2, respectively. [sent-112, score-0.028]

78 The convergence proofs rely upon defining an appropriate auxiliary function . [sent-113, score-0.252]

79 We are currently working to generalize these theorems to more complex constraints. [sent-114, score-0.057]

80 The update rules themselves are extremely easy to implement computationally, and will hopefully be utilized by others for a wide variety of applications. [sent-115, score-0.311]

81 Learning the parts of objects by non-negative matrix factorization. [sent-135, score-0.068]

82 An iterative technique for the rectification of observed distributions. [sent-160, score-0.023]

83 A unified approach to statistical tomography using coordinate descent optimization. [sent-165, score-0.109]

84 Aggregate and mixed-order Markov models for statistical language processing. [sent-185, score-0.02]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ht', 0.781), ('auxiliary', 0.168), ('update', 0.15), ('wtw', 0.132), ('multiplicative', 0.131), ('wia', 0.127), ('wiah', 0.127), ('nmf', 0.116), ('ialt', 0.106), ('nonincreasing', 0.099), ('wand', 0.099), ('rules', 0.094), ('ub', 0.092), ('att', 0.085), ('halt', 0.085), ('iiv', 0.085), ('wh', 0.082), ('gradient', 0.075), ('descent', 0.072), ('lemma', 0.072), ('ia', 0.069), ('factorization', 0.069), ('matrix', 0.068), ('abh', 0.064), ('diagonally', 0.064), ('viiw', 0.064), ('wiaha', 0.064), ('ab', 0.061), ('theorems', 0.057), ('divergence', 0.057), ('theorem', 0.051), ('rescaling', 0.049), ('aij', 0.049), ('convergence', 0.048), ('vb', 0.046), ('updates', 0.046), ('argminh', 0.042), ('hal', 0.042), ('htf', 0.042), ('submission', 0.042), ('wtwht', 0.042), ('rescaled', 0.041), ('quantization', 0.041), ('rule', 0.04), ('distance', 0.04), ('bij', 0.039), ('algorithms', 0.038), ('implement', 0.038), ('seung', 0.037), ('itt', 0.037), ('tomography', 0.037), ('proof', 0.036), ('nonnegative', 0.036), ('proofs', 0.036), ('euclidean', 0.035), ('factors', 0.034), ('reversing', 0.033), ('hs', 0.033), ('numerator', 0.033), ('rewritten', 0.033), ('semidefinite', 0.033), ('vi', 0.032), ('cost', 0.032), ('positive', 0.031), ('ha', 0.031), ('nonnegativity', 0.031), ('unity', 0.031), ('convex', 0.03), ('useful', 0.03), ('factor', 0.03), ('additive', 0.029), ('dd', 0.029), ('roles', 0.029), ('utilized', 0.029), ('multivariate', 0.028), ('guaranteed', 0.028), ('lee', 0.028), ('denominator', 0.027), ('emission', 0.027), ('vanishes', 0.027), ('constraint', 0.027), ('columns', 0.025), ('log', 0.025), ('numerical', 0.025), ('regarded', 0.024), ('quality', 0.024), ('principal', 0.023), ('iterative', 0.023), ('component', 0.023), ('prove', 0.022), ('coding', 0.022), ('saul', 0.022), ('previously', 0.022), ('minimize', 0.021), ('vectors', 0.021), ('ij', 0.021), ('wt', 0.021), ('basis', 0.021), ('squares', 0.02), ('language', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 22 nips-2000-Algorithms for Non-negative Matrix Factorization

Author: Daniel D. Lee, H. Sebastian Seung

2 0.1016113 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

Author: Koby Crammer, Yoram Singer

Abstract: Output coding is a general method for solving multiclass problems by reducing them to multiple binary classification problems. Previous research on output coding has employed, almost solely, predefined discrete codes. We describe an algorithm that improves the performance of output codes by relaxing them to continuous codes. The relaxation procedure is cast as an optimization problem and is reminiscent of the quadratic program for support vector machines. We describe experiments with the proposed algorithm, comparing it to standard discrete output codes. The experimental results indicate that continuous relaxations of output codes often improve the generalization performance, especially for short codes.

3 0.071254089 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

Author: Geoffrey J. Gordon

Abstract: Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program. 1

4 0.063598476 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations

Author: Oren Shriki, Haim Sompolinsky, Daniel D. Lee

Abstract: The principle of maximizing mutual information is applied to learning overcomplete and recurrent representations. The underlying model consists of a network of input units driving a larger number of output units with recurrent interactions. In the limit of zero noise, the network is deterministic and the mutual information can be related to the entropy of the output units. Maximizing this entropy with respect to both the feedforward connections as well as the recurrent interactions results in simple learning rules for both sets of parameters. The conventional independent components (ICA) learning algorithm can be recovered as a special case where there is an equal number of output units and no recurrent connections. The application of these new learning rules is illustrated on a simple two-dimensional input example.

5 0.063079976 52 nips-2000-Fast Training of Support Vector Classifiers

Author: Fernando Pérez-Cruz, Pedro Luis Alarcón-Diana, Angel Navia-Vázquez, Antonio Artés-Rodríguez

Abstract: In this communication we present a new algorithm for solving Support Vector Classifiers (SVC) with large training data sets. The new algorithm is based on an Iterative Re-Weighted Least Squares procedure which is used to optimize the SVc. Moreover, a novel sample selection strategy for the working set is presented, which randomly chooses the working set among the training samples that do not fulfill the stopping criteria. The validity of both proposals, the optimization procedure and sample selection strategy, is shown by means of computer experiments using well-known data sets. 1 INTRODUCTION The Support Vector Classifier (SVC) is a powerful tool to solve pattern recognition problems [13, 14] in such a way that the solution is completely described as a linear combination of several training samples, named the Support Vectors. The training procedure for solving the SVC is usually based on Quadratic Programming (QP) which presents some inherent limitations, mainly the computational complexity and memory requirements for large training data sets. This problem is typically avoided by dividing the QP problem into sets of smaller ones [6, 1, 7, 11], that are iteratively solved in order to reach the SVC solution for the whole set of training samples. These schemes rely on an optimizing engine, QP, and in the sample selection strategy for each sub-problem, in order to obtain a fast solution for the SVC. An Iterative Re-Weighted Least Squares (IRWLS) procedure has already been proposed as an alternative solver for the SVC [10] and the Support Vector Regressor [9], being computationally efficient in absolute terms. In this communication, we will show that the IRWLS algorithm can replace the QP one in any chunking scheme in order to find the SVC solution for large training data sets. Moreover, we consider that the strategy to decide which training samples must j oin the working set is critical to reduce the total number of iterations needed to attain the SVC solution, and the runtime complexity as a consequence. To aim for this issue, the computer program SV cradit have been developed so as to solve the SVC for large training data sets using IRWLS procedure and fixed-size working sets. The paper is organized as follows. In Section 2, we start by giving a summary of the IRWLS procedure for SVC and explain how it can be incorporated to a chunking scheme to obtain an overall implementation which efficiently deals with large training data sets. We present in Section 3 a novel strategy to make up the working set. Section 4 shows the capabilities of the new implementation and they are compared with the fastest available SVC implementation, SV Mlight [6]. We end with some concluding remarks. 2 IRWLS-SVC In order to solve classification problems, the SVC has to minimize Lp = ~llwI12+CLei- LJliei- LQi(Yi(¢(xifw+b)-l+ei) (1) i i i with respectto w, band ei and maximize it with respectto Qi and Jli, subject to Qi, Jli ~ 0, where ¢(.) is a nonlinear transformation (usually unknown) to a higher dimensional space and C is a penalization factor. The solution to (1) is defined by the Karush-Kuhn-Tucker (KKT) conditions [2]. For further details on the SVC, one can refer to the tutorial survey by Burges [2] and to the work ofVapnik [13, 14]. In order to obtain an IRWLS procedure we will first need to rearrange (1) in such a way that the terms depending on ei can be removed because, at the solution C - Qi - Jli = 0 Vi (one of the KKT conditions [2]) must hold. Lp = 1 Qi(l- Yi(¢T(Xi)W + b)) 211wl12 + L i = (2) where The weighted least square nature of (2) can be understood if ei is defined as the error on each sample and ai as its associated weight, where! IIwl1 2 is a regularizing functional. The minimization of (2) cannot be accomplished in a single step because ai = ai(ei), and we need to apply an IRWLS procedure [4], summarized below in tree steps: 1. Considering the ai fixed, minimize (2). 2. Recalculate ai from the solution on step 1. 3. Repeat until convergence. In order to work with Reproducing Kernels in Hilbert Space (RKHS), as the QP procedure does, we require that w = Ei (JiYi¢(Xi) and in order to obtain a non-zero b, that Ei {JiYi = O. Substituting them into (2), its minimum with respect to {Ji and b for a fixed set of ai is found by solving the following linear equation system l (3) IThe detailed description of the steps needed to obtain (3) from (2) can be found in [10]. where y = [Yl, Y2, ... Yn]T (4) 'r/i,j = 1, ... ,n 'r/i,j = 1, ... ,n (H)ij = YiYj¢T(Xi)¢(Xj) = YiyjK(Xi,Xj) (Da)ij = aio[i - j] 13 = [,81, ,82, ... (5) (6) (7) , ,8n]T and 0[·] is the discrete impulse function. Finally, the dependency of ai upon the Lagrange multipliers is eliminated using the KKT conditions, obtaining a, ai 2.1 ={~ ei Yi' eiYi < Yt.et. > - ° ° (8) IRWLS ALGORITHMIC IMPLEMENTATION The SVC solution with the IRWLS procedure can be simplified by dividing the training samples into three sets. The first set, SI, contains the training samples verifying < ,8i < C, which have to be determined by solving (3). The second one, S2, includes every training sample whose,8i = 0. And the last one, S3, is made up of the training samples whose ,8i = C. This division in sets is fully justified in [10]. The IRWLS-SVC algorithm is shown in Table 1. ° 0. Initialization: SI will contain every training sample, S2 = 0 and S3 = 0. Compute H. e_a = y, f3_a = 0, b_a = 0, G 13 = Gin, a = 1 and G b3 = G bi n . 1 Solve [ (H)Sb S1 + D(al S1 . =° = e-lt a, 3. ai = { ~ (13) S2 2. e ° 1[ (Y)Sl (f3)Sl ] (y ) ~1 b and (13) Ss = C DyH(f3 - f3_a) - (b - b_a)1 =[1- G 13 ] G b3 ' °. eiYi < e- _ > O'r/Z E SI U S2 U S3 tYt 4. Sets reordering: a. Move every sample in S3 with eiYi < to S2. b. Move every sample in SI with ,8i = C to S3. c. Move every sample in SI with ai = to S2 . d. Move every sample in S2 with ai :I to SI. 5. e_a = e, f3_a = 13, G 13 = (H)Sl,SS (f3)ss + (G in )Sl' b-lt = band Gb3 = -y~s (f3)ss + Gbin · 6. Go to step 1 and repeat until convergence. ei Yi ' ° ° ° Table 1: IRWLS-SVC algorithm. The IRWLS-SVC procedure has to be slightly modified in order to be used inside a chunk:ing scheme as the one proposed in [8, 6], such that it can be directly applied in the one proposed in [1]. A chunking scheme is needed to solve the SVC whenever H is too large to fit into memory. In those cases, several SVC with a reduced set of training samples are iteratively solved until the solution for the whole set is found. The samples are divide into a working set, Sw, which is solved as a full SVC problem, and an inactive set, Sin. If there are support vectors in the inactive set, as it might be, the inactive set modifies the IRWLSSVC procedure, adding a contribution to the independent term in the linear equation system (3) . Those support vectors in S in can be seen as anchored samples in S3, because their ,8i is not zero and can not be modified by the IRWLS procedure. Then, such contribution (Gin and G bin ) will be calculated as G 13 and G b3 are (Table 1, 5th step), before calling the IRWLS-SVC algorithm. We have already modified the IRWLS-SVC in Table 1 to consider Gin and G bin , which must be set to zero if the Hessian matrix, H, fits into memory for the whole set of training samples. The resolution of the SVC for large training data sets, employing as minimization engine the IRWLS procedure, is summarized in the following steps: 1. Select the samples that will form the working set. 2. Construct Gin = (H)Sw,Sin (f3)s.n and G bin = -yIin (f3)Sin 3. Solve the IRWLS-SVC procedure, following the steps in Table 1. 4. Compute the error of every training sample. 5. If the stopping conditions Yiei < C eiYi> -c leiYil < C 'Vii 'Vii 'Vii (Ji = 0 (Ji = C 0 < (Ji < C (9) (10) (11) are fulfilled, the SVC solution has been reached. The stopping conditions are the ones proposed in [6] and C must be a small value around 10 - 3 , a full discussion concerning this topic can be found in [6]. 3 SAMPLE SELECTION STRATEGY The selection of the training samples that will constitute the working set in each iteration is the most critical decision in any chunking scheme, because such decision is directly involved in the number of IRWLS-SVC (or QP-SVC) procedures to be called and in the number of reproducing kernel evaluations to be made, which are, by far, the two most time consuming operations in any chunking schemes. In order to solve the SVC efficiently, we first need to define a candidate set of training samples to form the working set in each iteration. The candidate set will be made up, as it could not be otherwise, with all the training samples that violate the stopping conditions (9)-(11); and we will also add all those training samples that satisfy condition (11) but a small variation on their error will make them violate such condition. The strategies to select the working set are as numerous as the number of problems to be solved, but one can think three different simple strategies: • Select those samples which do not fulfill the stopping criteria and present the largest Iei I values. • Select those samples which do not fulfill the stopping criteria and present the smallest Iei I values. • Select them randomly from the ones that do not fulfill the stopping conditions. The first strategy seems the more natural one and it was proposed in [6]. If the largest leil samples are selected we guanrantee that attained solution gives the greatest step towards the solution of (1). But if the step is too large, which usually happens, it will cause the solution in each iteration and the (Ji values to oscillate around its optimal value. The magnitude of this effect is directly proportional to the value of C and q (size of the working set), so in the case ofsmall C (C < 10) and low q (q < 20) it would be less noticeable. The second one is the most conservative strategy because we will be moving towards the solution of (1) with small steps. Its drawback is readily discerned if the starting point is inappropriate, needing too many iterations to reach the SVC solution. The last strategy, which has been implemented together with the IRWLS-SVC procedure, is a mid-point between the other two, but if the number of samples whose 0 < (3i < C increases above q there might be some iterations where we will make no progress (working set is only made up of the training samples that fulfill the stopping condition in (11)). This situation is easily avoided by introducing one sample that violates each one of the stopping conditions per class. Finally, if the cardinality of the candidate set is less than q the working set is completed with those samples that fulfil the stopping criteria conditions and present the least leil. In summary, the sample selection strategy proposed is 2 : 1. Construct the candidate set, Se with those samples that do not fulfill stopping conditions (9) and (10), and those samples whose (3 obeys 0 < (3i < C. 2. IfISel < ngot05. 3. Choose a sample per class that violates each one of the stopping conditions and move them from Se to the working set, SW. 4. Choose randomly n - ISw I samples from Se and move then to SW. Go to Step 6. 5. Move every sample form Se to Sw and then-ISwl samples that fulfill the stopping conditions (9) and (10) and present the lowest leil values are used to complete SW . 6. Go on, obtaining Gin and Gbin. 4 BENCHMARK FOR THE IRWLS-SVC We have prepared two different experiments to test both the IRWLS and the sample selection strategy for solving the SVc. The first one compares the IRWLS against QP and the second one compares the samples selection strategy, together with the IRWLS, against a complete solving procedure for SVC, the SV Mlight. In the first trial, we have replaced the LOQO interior point optimizer used by SV M1ig ht version 3.02 [5] by the IRWLS-SVC procedure in Table 1, to compare both optimizing engines with equal samples selection strategy. The comparison has been made over a Pentium ill-450MHz with 128Mb running on Window98 and the programs have been compiled using Microsoft Developer 6.0. In Table 2, we show the results for two data sets: the first q 20 40 70 Adult44781 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 21.25 20.70 0.61 0.39 20.60 19.22 1.01 0.17 21.15 18.72 2.30 0.46 Splice 2175 CPU time Optimize Time LOQO IRWLS LOQO IRWLS 46.19 30.76 21.94 4.77 71.34 24.93 46.26 8.07 53.77 20.32 34.24 7.72 Table 2: CPU Time indicates the consume time in seconds for the whole procedure. The Optimize Time indicates the consume time in second for the LOQO or IRWLS procedure. one, containing 4781 training samples, needs most CPU resources to compute the RKHS and the second one, containing 2175 training samples, uses most CPU resources to solve the SVC for each Sw, where q indicates the size of the working set. The value of C has 2In what follows, I . I represents absolute value for numbers and cardinality for sets been set to 1 and 1000, respectively, and a Radial Basis Function (RBF) RKHS [2] has been employed, where its parameter a has been set, respectively, to 10 and 70. As it can be seen, the SV M1ig ht with IRWLS is significantly faster than the LOQO procedure in all cases. The kernel cache size has been set to 64Mb for both data sets and for both procedures. The results in Table 2 validates the IRWLS procedure as the fastest SVC solver. For the second trial, we have compiled a computer program that uses the IRWLS-SVC procedure and the working set selection in Section 3, we will refer to it as svcradit from now on. We have borrowed the chunking and shrinking ideas from the SV Mlight [6] for our computer program. To test these two programs several data sets have been used. The Adult and Web data sets have been obtained from 1. Platt's web page http://research.microsoft.comr jplatt/smo.html/; the Gauss-M data set is a two dimensional classification problem proposed in [3] to test neural networks, which comprises a gaussian random variable for each class, which highly overlap. The Banana, Diabetes and Splice data sets have been obtained from Gunnar Ratsch web page http://svm.first.gmd.der raetschl. The selection of C and the RKHS has been done as indicated in [11] for Adult and Web data sets and in http://svm.first.gmd.derraetschl for Banana, Diabetes and Splice data sets. In Table 3, we show the runtime complexity for each data set, where the value of q has been elected as the one that reduces the runtime complexity. Database Dim Adult6 Adult9 Adult! Web 1 Web7 Gauss-M Gauss-M Banana Banana Diabetes Splice 123 123 123 300 300 2 2 2 2 8 69 N Sampl. 11221 32562 1605 2477 24693 4000 4000 400 4900 768 2175 C a SV 1 1 1000 5 5 1 100 316.2 316.2 10 1000 10 10 10 10 10 1 1 1 1 2 70 4477 12181 630 224 1444 1736 1516 80 1084 409 525 q CPU time radit light radit light 150 130 100 100 150 70 100 40 70 40 150 40 70 10 10 10 10 10 70 40 10 20 118.2 1093.29 25.98 2.42 158.13 12.69 61.68 0.33 22.46 2.41 14.06 124.46 1097.09 113.54 2.36 124.57 48.28 3053.20 0.77 1786.56 6.04 49.19 Table 3: Several data sets runtime complexity, when solved with the short, and SV Mlight, light for short. s v c radit , radit for One can appreciate that the svcradit is faster than the SV M1ig ht for most data sets. For the Web data set, which is the only data set the SV Mlight is sligthly faster, the value of C is low and most training samples end up as support vector with (3i < C. In such cases the best strategy is to take the largest step towards the solution in every iteration, as the SV Mlig ht does [6], because most training samples (3i will not be affected by the others training samples (3j value. But in those case the value of C increases the SV c radit samples selection strategy is a much more appropriate strategy than the one used in SV Mlight. 5 CONCLUSIONS In this communication a new algorithm for solving the SVC for large training data sets has been presented. Its two major contributions deal with the optimizing engine and the sample selection strategy. An IRWLS procedure is used to solve the SVC in each step, which is much faster that the usual QP procedure, and simpler to implement, because the most difficult step is the linear equation system solution that can be easily obtained by LU decomposition means [12]. The random working set selection from the samples not fulfilling the KKT conditions is the best option if the working is be large, because it reduces the number of chunks to be solved. This strategy benefits from the IRWLS procedure, which allows to work with large training data set. All these modifications have been concreted in the svcradit solving procedure, publicly available at http://svm.tsc.uc3m.es/. 6 ACKNOWLEDGEMENTS We are sincerely grateful to Thorsten Joachims who has allowed and encouraged us to use his SV Mlight to test our IRWLS procedure, comparisons which could not have been properly done otherwise. References [1] B. E. Boser, I. M . Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In 5th Annual Workshop on Computational Learning Theory, Pittsburg, U.S.A., 1992. [2] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998. [3] S. Haykin. Neural Networks: A comprehensivefoundation. Prentice-Hall, 1994. [4] P. W. Holland and R. E. Welch. Robust regression using iterative re-weighted least squares. Communications of Statistics Theory Methods, A6(9):813-27, 1977. [5] T. Joachims. http://www-ai.infonnatik.uni-dortmund.de/forschung/verfahren Isvmlight Isvmlight.eng.html. Technical report, University of Dortmund, Informatik, AI-Unit Collaborative Research Center on 'Complexity Reduction in Multivariate Data', 1998. [6] T. Joachims. Making Large Scale SVM Learning Practical, In Advances in Kernel Methods- Support Vector Learning, Editors SchOlkopf, B., Burges, C. 1. C. and Smola, A. 1., pages 169-184. M.I.T. Press, 1999. [7] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector machines. In Proc. of the 1997 IEEE Workshop on Neural Networks for Signal Processing, pages 276-285, Amelia Island, U.S.A, 1997. [8] E. Osuna and F. Girosi. Reducing the run-time complexity of support vector machines. In ICPR'98, Brisbane, Australia, August 1998. [9] F. Perez-Cruz, A. Navia-Vazquez

6 0.061692525 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

7 0.054252096 120 nips-2000-Sparse Greedy Gaussian Process Regression

8 0.050515592 6 nips-2000-A Neural Probabilistic Language Model

9 0.049985081 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

10 0.046414848 111 nips-2000-Regularized Winnow Methods

11 0.042484537 18 nips-2000-Active Support Vector Machine Classification

12 0.041588645 100 nips-2000-Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks

13 0.041466828 51 nips-2000-Factored Semi-Tied Covariance Matrices

14 0.040846806 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

15 0.040584732 62 nips-2000-Generalized Belief Propagation

16 0.03955356 77 nips-2000-Learning Curves for Gaussian Processes Regression: A Framework for Good Approximations

17 0.037587784 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

18 0.036694109 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

19 0.035281464 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

20 0.035193838 121 nips-2000-Sparse Kernel Principal Component Analysis

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.131), (1, 0.02), (2, 0.012), (3, -0.021), (4, 0.018), (5, 0.005), (6, -0.013), (7, -0.057), (8, -0.038), (9, 0.038), (10, -0.138), (11, -0.028), (12, 0.054), (13, 0.055), (14, -0.072), (15, -0.01), (16, -0.066), (17, -0.005), (18, 0.151), (19, -0.037), (20, 0.118), (21, 0.083), (22, -0.02), (23, 0.055), (24, 0.02), (25, -0.109), (26, -0.022), (27, 0.174), (28, -0.16), (29, -0.018), (30, 0.054), (31, 0.026), (32, 0.012), (33, 0.262), (34, -0.145), (35, -0.017), (36, 0.107), (37, 0.041), (38, 0.059), (39, 0.011), (40, 0.048), (41, -0.056), (42, -0.014), (43, 0.101), (44, -0.121), (45, 0.139), (46, 0.191), (47, -0.122), (48, -0.164), (49, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96462625 22 nips-2000-Algorithms for Non-negative Matrix Factorization

Author: Daniel D. Lee, H. Sebastian Seung

2 0.5063045 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

Author: Koby Crammer, Yoram Singer

3 0.36318713 52 nips-2000-Fast Training of Support Vector Classifiers

Author: Fernando Pérez-Cruz, Pedro Luis Alarcón-Diana, Angel Navia-Vázquez, Antonio Artés-Rodríguez

4 0.31914133 93 nips-2000-On Iterative Krylov-Dogleg Trust-Region Steps for Solving Neural Networks Nonlinear Least Squares Problems

Author: Eiji Mizutani, James Demmel

Abstract: This paper describes a method of dogleg trust-region steps, or restricted Levenberg-Marquardt steps, based on a projection process onto the Krylov subspaces for neural networks nonlinear least squares problems. In particular, the linear conjugate gradient (CG) method works as the inner iterative algorithm for solving the linearized Gauss-Newton normal equation, whereas the outer nonlinear algorithm repeatedly takes so-called

5 0.31483024 112 nips-2000-Reinforcement Learning with Function Approximation Converges to a Region

Author: Geoffrey J. Gordon

6 0.31128952 46 nips-2000-Ensemble Learning and Linear Response Theory for ICA

7 0.30788121 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

8 0.28589058 111 nips-2000-Regularized Winnow Methods

9 0.27169573 120 nips-2000-Sparse Greedy Gaussian Process Regression

10 0.27066454 18 nips-2000-Active Support Vector Machine Classification

11 0.26583388 109 nips-2000-Redundancy and Dimensionality Reduction in Sparse-Distributed Representations of Natural Objects in Terms of Their Local Features

12 0.25858438 73 nips-2000-Kernel-Based Reinforcement Learning in Average-Cost Problems: An Application to Optimal Portfolio Choice

13 0.25167134 6 nips-2000-A Neural Probabilistic Language Model

14 0.2428312 70 nips-2000-Incremental and Decremental Support Vector Machine Learning

15 0.2320649 24 nips-2000-An Information Maximization Approach to Overcomplete and Recurrent Representations

16 0.22261335 61 nips-2000-Generalizable Singular Value Decomposition for Ill-posed Datasets

17 0.21959363 60 nips-2000-Gaussianization

18 0.21691443 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

19 0.21671763 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

20 0.20779389 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.021), (16, 0.015), (17, 0.095), (23, 0.306), (32, 0.031), (33, 0.05), (42, 0.014), (55, 0.016), (62, 0.068), (65, 0.02), (67, 0.084), (75, 0.017), (76, 0.047), (79, 0.012), (81, 0.011), (90, 0.057), (91, 0.019), (97, 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78907365 22 nips-2000-Algorithms for Non-negative Matrix Factorization

Author: Daniel D. Lee, H. Sebastian Seung

2 0.4600417 74 nips-2000-Kernel Expansions with Unlabeled Examples

Author: Martin Szummer, Tommi Jaakkola

Abstract: Modern classification applications necessitate supplementing the few available labeled examples with unlabeled examples to improve classification performance. We present a new tractable algorithm for exploiting unlabeled examples in discriminative classification. This is achieved essentially by expanding the input vectors into longer feature vectors via both labeled and unlabeled examples. The resulting classification method can be interpreted as a discriminative kernel density estimate and is readily trained via the EM algorithm, which in this case is both discriminative and achieves the optimal solution. We provide, in addition, a purely discriminative formulation of the estimation problem by appealing to the maximum entropy framework. We demonstrate that the proposed approach requires very few labeled examples for high classification accuracy.

3 0.45362353 79 nips-2000-Learning Segmentation by Random Walks

Author: Marina Meila, Jianbo Shi

Abstract: We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segmentation have a probabilistic foundation. In particular, we prove that the Normalized Cut method arises naturally from our framework. Finally, the framework provides a principled method for learning the similarity function as a combination of features. 1

4 0.45221117 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

Author: Zoubin Ghahramani, Matthew J. Beal

Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1

5 0.45201328 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

6 0.45149288 64 nips-2000-High-temperature Expansions for Learning Models of Nonnegative Data

7 0.44944534 111 nips-2000-Regularized Winnow Methods

8 0.44924814 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

9 0.44757032 21 nips-2000-Algorithmic Stability and Generalization Performance

10 0.44436699 146 nips-2000-What Can a Single Neuron Compute?

11 0.44268233 134 nips-2000-The Kernel Trick for Distances

12 0.44164184 52 nips-2000-Fast Training of Support Vector Classifiers

13 0.4408651 94 nips-2000-On Reversing Jensen's Inequality

14 0.43997785 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

15 0.43966466 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

16 0.43904895 37 nips-2000-Convergence of Large Margin Separable Linear Classification

17 0.43838024 60 nips-2000-Gaussianization

18 0.43823406 122 nips-2000-Sparse Representation for Gaussian Process Models

19 0.43804938 20 nips-2000-Algebraic Information Geometry for Learning Machines with Singularities

20 0.43730605 92 nips-2000-Occam's Razor