nips nips2005 nips2005-86 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Suvrit Sra, Inderjit S. Dhillon
Abstract: Nonnegative matrix approximation (NNMA) is a recent technique for dimensionality reduction and data analysis that yields a parts based, sparse nonnegative representation for nonnegative input data. NNMA has found a wide variety of applications, including text analysis, document clustering, face/image recognition, language modeling, speech processing and many others. Despite these numerous applications, the algorithmic development for computing the NNMA factors has been relatively deficient. This paper makes algorithmic progress by modeling and solving (using multiplicative updates) new generalized NNMA problems that minimize Bregman divergences between the input matrix and its lowrank approximation. The multiplicative update formulae in the pioneering work by Lee and Seung [11] arise as a special case of our algorithms. In addition, the paper shows how to use penalty functions for incorporating constraints other than nonnegativity into the problem. Further, some interesting extensions to the use of “link” functions for modeling nonlinear relationships are also discussed. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Nonnegative matrix approximation (NNMA) is a recent technique for dimensionality reduction and data analysis that yields a parts based, sparse nonnegative representation for nonnegative input data. [sent-7, score-0.572]
2 This paper makes algorithmic progress by modeling and solving (using multiplicative updates) new generalized NNMA problems that minimize Bregman divergences between the input matrix and its lowrank approximation. [sent-10, score-0.407]
3 The multiplicative update formulae in the pioneering work by Lee and Seung [11] arise as a special case of our algorithms. [sent-11, score-0.205]
4 In addition, the paper shows how to use penalty functions for incorporating constraints other than nonnegativity into the problem. [sent-12, score-0.186]
5 1 Introduction Nonnegative matrix approximation (NNMA) is a method for dimensionality reduction and data analysis that has gained favor over the past few years. [sent-14, score-0.078]
6 NNMA has previously been called positive matrix factorization [13] and nonnegative matrix factorization1 [12]. [sent-15, score-0.401]
7 , aN are N nonnegative input (M -dimensional) vectors. [sent-19, score-0.263]
8 We organize these vectors as the columns of a nonnegative data matrix a1 a2 . [sent-20, score-0.307]
9 A NNMA seeks a small set of K nonnegative representative vectors b1 , . [sent-24, score-0.289]
10 , bK that can be nonnegatively (or conically) combined to approximate the input vectors ai . [sent-27, score-0.106]
11 That is, K an ≈ ckn bk , k=1 1 ≤ n ≤ N, 1 We use the word approximation instead of factorization to emphasize the inexactness of the process since, the input A is approximated by BC. [sent-28, score-0.313]
12 where the combining coefficients ckn are restricted to be nonnegative. [sent-29, score-0.132]
13 If ckn and bk are unrestricted, and we minimize n an − Bcn 2 , the Truncated Singular Value Decomposition (TSVD) of A yields the optimal bk and ckn values. [sent-30, score-0.476]
14 If the bk are unrestricted, but the coefficient vectors cn are restricted to be indicator vectors, then we obtain the problem of hard-clustering (See [16, Chapter 8] for related discussion regarding different constraints on cn and bk ). [sent-31, score-0.269]
15 In this paper we consider problems where all involved matrices are nonnegative. [sent-32, score-0.047]
16 For many practical problems nonnegativity is a natural requirement. [sent-33, score-0.102]
17 , are all nonnegative entities, and approximating their measurements by nonnegative representations leads to greater interpretability. [sent-35, score-0.494]
18 NNMA has found a significant number of applications, not only due to increased interpretability, but also because admitting only nonnegative combinations of the bk leads to sparse representations. [sent-36, score-0.328]
19 This paper contributes to the algorithmic advancement of NNMA by generalizing the problem significantly, and by deriving efficient algorithms based on multiplicative updates for the generalized problems. [sent-37, score-0.242]
20 The multiplicative update formulae in the pioneering work by Lee and Seung [11] arise as a special case of our algorithms, which seek to minimize Bregman divergences between the nonnegative input A and its approximation. [sent-39, score-0.676]
21 In addition, we discuss the use penalty functions for incorporating constraints other than nonnegativity into the problem. [sent-40, score-0.186]
22 Further, we illustrate an interesting extension of our algorithms for handling non-linear relationships through the use of “link” functions. [sent-41, score-0.05]
23 2 Problems Given a nonnegative matrix A as input, the classical NNMA problem is to approximate it by a lower rank nonnegative matrix of the form BC, where B = [b1 , . [sent-42, score-0.582]
24 For any strictly convex function ϕ : S ⊆ R → R that has a continuous first derivative, the corresponding Bregman divergence Dϕ : S × int(S) → R+ is defined as Dϕ (x, y) ϕ(x) − ϕ(y) − ∇ϕ(y)(x − y), where int(S) is the interior of set S [1, 2]. [sent-53, score-0.077]
25 Bregman divergences are nonnegative, convex in the first argument and zero if and only if x = y. [sent-54, score-0.169]
26 These divergences play an important role in convex optimization [2]. [sent-55, score-0.169]
27 Formally, the resulting generalized nonnegative matrix approximation problems are: min Dϕ (BC, A) + α(B) + β(C), (2. [sent-60, score-0.371]
28 3) The functions α and β serve as penalty functions, and they allow us to enforce regularization (or other constraints) on B and C. [sent-63, score-0.071]
29 Table 1 gives a small sample of NNMA problems to illustrate the breadth of our formulation. [sent-67, score-0.053]
30 3 Algorithms In this section we present algorithms that seek to optimize (2. [sent-68, score-0.051]
31 Our algorithms are iterative in nature, and are directly inspired by the efficient algorithms of Lee and Seung [11]. [sent-71, score-0.106]
32 KL(x, y) i denotes the generalized KL-Divergence = i xi log xi − xi + yi (also called I-divergence). [sent-79, score-0.067]
33 3) are not jointly convex in B and C, so it is not easy to obtain globally optimal solutions in polynomial time. [sent-82, score-0.052]
34 Our iterative procedures start by initializing B and C randomly or otherwise. [sent-83, score-0.052]
35 2) We utilize the concept of auxiliary functions [11] for our derivations. [sent-87, score-0.153]
36 It is sufficient to illustrate our methods using a single column of C (or row of B), since our divergences are separable. [sent-88, score-0.157]
37 A function G(c, c′ ) is called an auxiliary function for F (c) if: 1. [sent-91, score-0.131]
38 If G(c, c′ ) is an auxiliary function for F (c), then F is non-increasing under the update ct+1 = argminc G(c, ct ). [sent-97, score-0.241]
39 F (ct+1 ) ≤ G(ct+1 , ct ) ≤ G(ct , ct ) = F (ct ). [sent-99, score-0.138]
40 As can be observed, the sequence formed by the iterative application of Lemma 3. [sent-100, score-0.052]
41 2 leads to a monotonic decrease in the objective function value F (c). [sent-101, score-0.063]
42 For an algorithm that iteratively updates c in its quest to minimize F (c), the method for proving convergence boils down to the construction of an appropriate auxiliary function. [sent-102, score-0.247]
43 To avoid clutter we drop the functions α and β from (2. [sent-106, score-0.043]
44 The lemma below shows how to construct an auxiliary function for (3. [sent-111, score-0.154]
45 The function G(c, c′ ) = bij cj λij λij ϕ ij − i ϕ(ai ) + ψ(ai ) (Bc)i − ai , (3. [sent-116, score-0.237]
46 2) with λij = (bij c′ )/( l bil c′ ), is an auxiliary function for (3. [sent-117, score-0.131]
47 Note that by definition j l ′ j λij = 1, and as both bij and cj are nonnegative, λij ≥ 0. [sent-119, score-0.097]
48 It is easy to verify that G(c, c) = F (c), since ϕ, we conclude that if j λij = 1 and λij ≥ 0, then F (c) = ϕ bij cj i ≤ bij cj λij ij λij = 1. [sent-121, score-0.26]
49 Using the convexity of − ϕ(ai ) − ψ(ai ) (Bc)i − ai j λij ϕ j − i ϕ(ai ) + ψ(ai ) (Bc)i − ai = G(c, c′ ). [sent-122, score-0.164]
50 We compute the partial derivative ∂G = ∂cp λip ψ i = bip ψ i bip cp λip bip − λip cp (Bc′ )i c′ p Let ψ(x) denote the vector bip ψ(ai ) i − (B T ψ(a))p . [sent-131, score-0.658]
51 , ψ(xy) = ψ(x)ψ(y)) we obtain the following iterative update relations for b and c (see [7]) bp ← bp · ψ −1 [ψ(aT )C T ]p , [ψ(bT C)C T ]p (3. [sent-139, score-0.166]
52 5) It turns out that when ϕ is a convex function of Legendre type, then ψ −1 can be obtained by the derivative of the conjugate function ϕ∗ of ϕ, i. [sent-142, score-0.061]
53 5) coincide with updates derived by Lee and Seung [11], if ϕ(x) = 2 x2 . [sent-148, score-0.066]
54 1 Examples of New NNMA Problems We illustrate the power of our generic auxiliary functions given above for deriving algorithms with multiplicative updates for some specific interesting problems. [sent-151, score-0.382]
55 First we consider the problem that seeks to minimize the divergence, KL(Bc, a) = (Bc)i log i (Bc)i − (Bc)i + ai , ai B, c ≥ 0. [sent-152, score-0.259]
56 3), and setting the resultant to zero we obtain ∂G = ∂cp i bip log(cp (Bc′ )i /c′ ) − p =⇒ (B T 1)p log bip log ai = 0, i cp = [B T log a − B T log(Bc′ )]p c′ p =⇒ cp = c′ · exp p [B T log a/(Bc′ ) ]p . [sent-156, score-0.691]
57 7) problem using our method by making use of an appropriate (differentiable) penalty function that enforces P c ≤ 0. [sent-166, score-0.049]
58 Assuming multiplicative ψ and following the auxiliary function technique described above, we obtain the following updates for c, ck ← ck · ψ −1 [B T ψ(a)]k − ρ[P T (P c)+ ]k , [B T ψ(Bc)]k where (P c)+ = max(0, P c). [sent-169, score-0.329]
59 Note that care must be taken to ensure that the addition of this penalty term does not violate the nonnegativity of c, and to ensure that the argument of ψ −1 lies in its domain. [sent-170, score-0.121]
60 6) is however easier, since the exponential updates ensure nonnegativity. [sent-173, score-0.066]
61 Given a = 1, with appropriate penalty functions, our solution to (3. [sent-174, score-0.049]
62 If A ≈ h(BC), where h is a “link” function that models a nonlinear relationship between A and the approximant BC, we may wish to minimize Dϕ (h(BC), A). [sent-177, score-0.076]
63 Recall that the auxiliary function that we used, depended upon the convexity of ϕ. [sent-179, score-0.147]
64 Thus, if (ϕ ◦h) is a convex function, whose derivative ∇(ϕ ◦h) is “factorizable,” then we can easily derive algorithms for this problem with link functions. [sent-180, score-0.122]
65 We exclude explicit examples for lack of space and refer the reader to [7] for further details. [sent-181, score-0.042]
66 2 Algorithms using KKT conditions We now derive efficient multiplicative update relations for (2. [sent-183, score-0.14]
67 3), and these updates turn out to be simpler than those for (2. [sent-184, score-0.066]
68 Using matrix algebra, one can show that the gradients of Dϕ (A, BC) w. [sent-188, score-0.044]
69 B and C are, ∇B Dϕ (A, BC) = ζ(BC) ⊙ (BC − A) C T ∇C Dϕ (A, BC) =B T ζ(BC) ⊙ (BC − A) , where ⊙ denotes the elementwise or Hadamard product, and ζ is applied elementwise to BC. [sent-191, score-0.116]
70 According to the KKT conditions, there exist Lagrange multiplier matrices Λ ≥ 0 and Ω ≥ 0 such that ∇B Dϕ (A, BC) = Λ, λmk bmk = ωkn ckn = 0. [sent-192, score-0.232]
71 9b) Writing out the gradient ∇B Dϕ (A, BC) elementwise, multiplying by bmk , and making use of (3. [sent-195, score-0.083]
72 9a,b), we obtain ζ(BC) ⊙ (BC − A) C T b mk mk = λmk bmk = 0, which suggests the iterative scheme bmk ← bmk ζ(BC) ⊙ A C T mk ζ(BC) ⊙ BC C T . [sent-196, score-0.498]
73 10) mk Proceeding in a similar fashion we obtain a similar iterative formula for ckn , which is ckn ← ckn [B T ζ(BC) ⊙ A ]kn . [sent-198, score-0.525]
74 1 Examples of New and Old NNMA Problems as Special Cases We now illustrate the power of our approach by showing how one can easily obtain iterative update relations for many NNMA problems, including known and new problems. [sent-202, score-0.153]
75 Here we wish to minimize W ⊙(A−BC) √ √ X ← W ⊙ X, and A ← W ⊙ A in (3. [sent-210, score-0.076]
76 B T (W ⊙ (BC)) These iterative updates are significantly simpler than the PMF algorithms of [13]. [sent-214, score-0.145]
77 The above ideas can be extended to the multifactor NNMA problem that seeks to minimize the following divergence (see [7]) Dϕ (A, B1 B2 . [sent-216, score-0.161]
78 A typical usage of multifactor NNMA problem would be to obtain a three-factor NNMA, namely A ≈ RBC. [sent-220, score-0.06]
79 We can follow the same derivation method as above (based on KKT conditions) for obtaining multiplicative updates for the weighted NNMA problem: min Dϕ (A, W1 BCW2 ), where W1 and W2 are nonnegative (and nonsingular) weight matrices. [sent-223, score-0.418]
80 4 Experiments and Discussion We have looked at generic algorithms for minimizing Bregman divergences between the input and its approximation. [sent-226, score-0.195]
81 One important question arises: Which Bregman divergence should one use for a given problem? [sent-227, score-0.042]
82 If we assume that the noise is distributed according to some member of the exponential family, then minimizing the corresponding Bregman divergence [1] is appropriate. [sent-229, score-0.042]
83 Some other open problems involve looking at the class of minimization problems to which the iterative methods of Section 3. [sent-244, score-0.112]
84 For example, determining the class of functions h, for which these methods may be used to minimize Dϕ (A, h(BC)). [sent-246, score-0.072]
85 However, we did not demonstrate such a guarantee for the updates (3. [sent-252, score-0.066]
86 Figure 1 offers encouraging empirical evidence in favor of a monotonic behavior of these updates. [sent-255, score-0.044]
87 ϕ(x) = − log x PMF Objective 3 ϕ(x) = x log x − x 28 19 26 2. [sent-258, score-0.07]
88 2 12 10 0 10 20 30 40 50 60 Number of iterations 70 80 90 100 8 0 10 20 30 40 50 60 Number of iterations 70 80 90 100 11 0 10 20 30 40 50 60 Number of iterations 70 80 90 100 Figure 1: Objective function values over 100 iterations for different NNMA problems. [sent-266, score-0.076]
89 The input matrix A was random 20×8 nonnegative matrix. [sent-267, score-0.307]
90 We believe that special cases of our generalized problems will prove to be useful for applications in data mining and machine learning. [sent-270, score-0.12]
91 5 Related Work Paatero and Tapper [13] introduced NNMA as positive matrix factorization, and they aimed to minimize W ⊙ (A − BC) F , where W was a fixed nonnegative matrix of weights. [sent-271, score-0.385]
92 NNMA remained confined to applications in Environmetrics and Chemometrics before pioneering papers of Lee and Seung [11, 12] popularized the problem. [sent-272, score-0.064]
93 Lee and Seung [11] provided simple and efficient algorithms for the NNMA problems that sought to minimize A − BC F and KL(A, BC). [sent-273, score-0.107]
94 Lee & Seung called these problems nonnegative matrix factorization (NNMF), and their algorithms have inspired our generalizations. [sent-274, score-0.414]
95 We refer the reader to [7] for pointers to the literature on various applications of NNMA. [sent-276, score-0.063]
96 Srebro and Jaakola [15] discuss elementwise weighted low-rank approximations without any nonnegativity constraints. [sent-277, score-0.156]
97 [6] discuss algorithms for obtaining a low rank approximation of the form A ≈ BC, where the loss functions are Bregman divergences, however, there is no restriction on B and C. [sent-279, score-0.067]
98 A weighted nonnegative matrix factorization for local a representations. [sent-351, score-0.383]
99 Learning the parts of objects by nonnegative matrix factorization. [sent-372, score-0.291]
100 Positive matrix factorization: A nonnegative factor model with optimal utilization of error estimates of data values. [sent-377, score-0.291]
wordName wordTfidf (topN-words)
[('bc', 0.598), ('nnma', 0.545), ('nonnegative', 0.247), ('bregman', 0.161), ('cp', 0.144), ('divergences', 0.134), ('ckn', 0.132), ('auxiliary', 0.131), ('seung', 0.107), ('lee', 0.088), ('bip', 0.086), ('bmk', 0.083), ('bk', 0.081), ('multiplicative', 0.079), ('ai', 0.074), ('nonnegativity', 0.072), ('ct', 0.069), ('ij', 0.066), ('factorization', 0.066), ('updates', 0.066), ('bij', 0.061), ('mk', 0.06), ('elementwise', 0.058), ('austin', 0.052), ('iterative', 0.052), ('minimize', 0.05), ('paatero', 0.05), ('dhillon', 0.049), ('penalty', 0.049), ('kl', 0.044), ('matrix', 0.044), ('csisz', 0.043), ('pioneering', 0.043), ('kkt', 0.043), ('multifactor', 0.043), ('divergence', 0.042), ('update', 0.041), ('texas', 0.037), ('cj', 0.036), ('objective', 0.035), ('convex', 0.035), ('log', 0.035), ('link', 0.034), ('cichocki', 0.033), ('environmetrics', 0.033), ('guillamet', 0.033), ('pmf', 0.033), ('tapper', 0.033), ('unrestricted', 0.033), ('generalized', 0.032), ('collins', 0.031), ('frobenius', 0.03), ('problems', 0.03), ('ip', 0.029), ('srebro', 0.029), ('monotonic', 0.028), ('kn', 0.028), ('algorithms', 0.027), ('seeks', 0.026), ('formulae', 0.026), ('florida', 0.026), ('wish', 0.026), ('cn', 0.026), ('derivative', 0.026), ('weighted', 0.026), ('feng', 0.024), ('reader', 0.024), ('seek', 0.024), ('lemma', 0.023), ('illustrate', 0.023), ('algorithmic', 0.022), ('xij', 0.022), ('yij', 0.022), ('constraints', 0.022), ('functions', 0.022), ('mining', 0.021), ('applications', 0.021), ('clutter', 0.021), ('xy', 0.021), ('award', 0.021), ('incorporating', 0.021), ('relations', 0.02), ('iterations', 0.019), ('analytic', 0.019), ('ck', 0.018), ('int', 0.018), ('approximation', 0.018), ('generic', 0.018), ('refer', 0.018), ('bp', 0.018), ('obtain', 0.017), ('matrices', 0.017), ('special', 0.016), ('convexity', 0.016), ('favor', 0.016), ('noting', 0.016), ('deriving', 0.016), ('vectors', 0.016), ('input', 0.016), ('differentiable', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 86 nips-2005-Generalized Nonnegative Matrix Approximations with Bregman Divergences
Author: Suvrit Sra, Inderjit S. Dhillon
Abstract: Nonnegative matrix approximation (NNMA) is a recent technique for dimensionality reduction and data analysis that yields a parts based, sparse nonnegative representation for nonnegative input data. NNMA has found a wide variety of applications, including text analysis, document clustering, face/image recognition, language modeling, speech processing and many others. Despite these numerous applications, the algorithmic development for computing the NNMA factors has been relatively deficient. This paper makes algorithmic progress by modeling and solving (using multiplicative updates) new generalized NNMA problems that minimize Bregman divergences between the input matrix and its lowrank approximation. The multiplicative update formulae in the pioneering work by Lee and Seung [11] arise as a special case of our algorithms. In addition, the paper shows how to use penalty functions for incorporating constraints other than nonnegativity into the problem. Further, some interesting extensions to the use of “link” functions for modeling nonlinear relationships are also discussed. 1
2 0.16022241 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares
Author: Ross Lippert, Ryan Rifkin
Abstract: We consider regularized least-squares (RLS) with a Gaussian kernel. We prove that if we let the Gaussian bandwidth σ → ∞ while letting the regularization parameter λ → 0, the RLS solution tends to a polynomial 1 whose order is controlled by the rielative rates of decay of σ2 and λ: if λ = σ −(2k+1) , then, as σ → ∞, the RLS solution tends to the kth order polynomial with minimal empirical error. We illustrate the result with an example. 1
3 0.057537507 79 nips-2005-Fusion of Similarity Data in Clustering
Author: Tilman Lange, Joachim M. Buhmann
Abstract: Fusing multiple information sources can yield significant benefits to successfully accomplish learning tasks. Many studies have focussed on fusing information in supervised learning contexts. We present an approach to utilize multiple information sources in the form of similarity data for unsupervised learning. Based on similarity information, the clustering task is phrased as a non-negative matrix factorization problem of a mixture of similarity measurements. The tradeoff between the informativeness of data sources and the sparseness of their mixture is controlled by an entropy-based weighting mechanism. For the purpose of model selection, a stability-based approach is employed to ensure the selection of the most self-consistent hypothesis. The experiments demonstrate the performance of the method on toy as well as real world data sets. 1
4 0.050942432 178 nips-2005-Soft Clustering on Graphs
Author: Kai Yu, Shipeng Yu, Volker Tresp
Abstract: We propose a simple clustering framework on graphs encoding pairwise data similarities. Unlike usual similarity-based methods, the approach softly assigns data to clusters in a probabilistic way. More importantly, a hierarchical clustering is naturally derived in this framework to gradually merge lower-level clusters into higher-level ones. A random walk analysis indicates that the algorithm exposes clustering structures in various resolutions, i.e., a higher level statistically models a longer-term diffusion on graphs and thus discovers a more global clustering structure. Finally we provide very encouraging experimental results. 1
5 0.050929148 58 nips-2005-Divergences, surrogate loss functions and experimental design
Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan
Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1
6 0.046805624 114 nips-2005-Learning Rankings via Convex Hull Separation
7 0.044784322 82 nips-2005-Generalization Error Bounds for Aggregation by Mirror Descent with Averaging
8 0.040204562 96 nips-2005-Inference with Minimal Communication: a Decision-Theoretic Variational Approach
9 0.039578985 185 nips-2005-Subsequence Kernels for Relation Extraction
10 0.038851518 202 nips-2005-Variational EM Algorithms for Non-Gaussian Latent Variable Models
11 0.037045557 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
12 0.036534615 50 nips-2005-Convex Neural Networks
13 0.036433622 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models
14 0.036224253 70 nips-2005-Fast Information Value for Graphical Models
15 0.034334991 192 nips-2005-The Information-Form Data Association Filter
16 0.033579793 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
17 0.033441696 51 nips-2005-Correcting sample selection bias in maximum entropy density estimation
18 0.033400837 126 nips-2005-Metric Learning by Collapsing Classes
19 0.033316072 98 nips-2005-Infinite latent feature models and the Indian buffet process
20 0.032382384 2 nips-2005-A Bayes Rule for Density Matrices
topicId topicWeight
[(0, 0.106), (1, 0.046), (2, -0.029), (3, -0.036), (4, 0.008), (5, -0.016), (6, -0.049), (7, 0.01), (8, -0.04), (9, -0.037), (10, -0.024), (11, -0.03), (12, -0.048), (13, 0.008), (14, 0.016), (15, 0.085), (16, -0.043), (17, 0.064), (18, 0.012), (19, -0.022), (20, -0.034), (21, 0.033), (22, 0.031), (23, -0.014), (24, 0.051), (25, -0.046), (26, 0.078), (27, 0.108), (28, -0.086), (29, -0.025), (30, -0.025), (31, 0.132), (32, -0.02), (33, -0.037), (34, -0.036), (35, 0.037), (36, -0.215), (37, -0.033), (38, -0.31), (39, 0.076), (40, -0.135), (41, 0.026), (42, -0.046), (43, -0.113), (44, -0.02), (45, -0.042), (46, -0.076), (47, -0.017), (48, 0.039), (49, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.94766027 86 nips-2005-Generalized Nonnegative Matrix Approximations with Bregman Divergences
Author: Suvrit Sra, Inderjit S. Dhillon
Abstract: Nonnegative matrix approximation (NNMA) is a recent technique for dimensionality reduction and data analysis that yields a parts based, sparse nonnegative representation for nonnegative input data. NNMA has found a wide variety of applications, including text analysis, document clustering, face/image recognition, language modeling, speech processing and many others. Despite these numerous applications, the algorithmic development for computing the NNMA factors has been relatively deficient. This paper makes algorithmic progress by modeling and solving (using multiplicative updates) new generalized NNMA problems that minimize Bregman divergences between the input matrix and its lowrank approximation. The multiplicative update formulae in the pioneering work by Lee and Seung [11] arise as a special case of our algorithms. In addition, the paper shows how to use penalty functions for incorporating constraints other than nonnegativity into the problem. Further, some interesting extensions to the use of “link” functions for modeling nonlinear relationships are also discussed. 1
2 0.81014782 31 nips-2005-Asymptotics of Gaussian Regularized Least Squares
Author: Ross Lippert, Ryan Rifkin
Abstract: We consider regularized least-squares (RLS) with a Gaussian kernel. We prove that if we let the Gaussian bandwidth σ → ∞ while letting the regularization parameter λ → 0, the RLS solution tends to a polynomial 1 whose order is controlled by the rielative rates of decay of σ2 and λ: if λ = σ −(2k+1) , then, as σ → ∞, the RLS solution tends to the kth order polynomial with minimal empirical error. We illustrate the result with an example. 1
3 0.3924225 182 nips-2005-Statistical Convergence of Kernel CCA
Author: Kenji Fukumizu, Arthur Gretton, Francis R. Bach
Abstract: While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence. 1
4 0.34054074 79 nips-2005-Fusion of Similarity Data in Clustering
Author: Tilman Lange, Joachim M. Buhmann
Abstract: Fusing multiple information sources can yield significant benefits to successfully accomplish learning tasks. Many studies have focussed on fusing information in supervised learning contexts. We present an approach to utilize multiple information sources in the form of similarity data for unsupervised learning. Based on similarity information, the clustering task is phrased as a non-negative matrix factorization problem of a mixture of similarity measurements. The tradeoff between the informativeness of data sources and the sparseness of their mixture is controlled by an entropy-based weighting mechanism. For the purpose of model selection, a stability-based approach is employed to ensure the selection of the most self-consistent hypothesis. The experiments demonstrate the performance of the method on toy as well as real world data sets. 1
5 0.33765817 59 nips-2005-Dual-Tree Fast Gauss Transforms
Author: Dongryeol Lee, Andrew W. Moore, Alexander G. Gray
Abstract: In previous work we presented an efficient approach to computing kernel summations which arise in many machine learning methods such as kernel density estimation. This approach, dual-tree recursion with finitedifference approximation, generalized existing methods for similar problems arising in computational physics in two ways appropriate for statistical problems: toward distribution sensitivity and general dimension, partly by avoiding series expansions. While this proved to be the fastest practical method for multivariate kernel density estimation at the optimal bandwidth, it is much less efficient at larger-than-optimal bandwidths. In this work, we explore the extent to which the dual-tree approach can be integrated with multipole-like Hermite expansions in order to achieve reasonable efficiency across all bandwidth scales, though only for low dimensionalities. In the process, we derive and demonstrate the first truly hierarchical fast Gauss transforms, effectively combining the best tools from discrete algorithms and continuous approximation theory. 1 Fast Gaussian Summation Kernel summations are fundamental in both statistics/learning and computational physics. NR e This paper will focus on the common form G(xq ) = −||xq −xr ||2 2h2 i.e. where the ker- r=1 nel is the Gaussian kernel with scaling parameter, or bandwidth h, there are NR reference points xr , and we desire the sum for NQ different query points xq . Such kernel summations appear in a wide array of statistical/learning methods [5], perhaps most obviously in kernel density estimation [11], the most widely used distribution-free method for the fundamental task of density estimation, which will be our main example. Understanding kernel summation algorithms from a recently developed unified perspective [5] begins with the picture of Figure 1, then separately considers the discrete and continuous aspects. Discrete/geometric aspect. In terms of discrete algorithmic structure, the dual-tree framework of [5], in the context of kernel summation, generalizes all of the well-known algorithms. 1 It was applied to the problem of kernel density estimation in [7] using a simple 1 These include the Barnes-Hut algorithm [2], the Fast Multipole Method [8], Appel’s algorithm [1], and the WSPD [4]: the dual-tree method is a node-node algorithm (considers query regions rather than points), is fully recursive, can use distribution-sensitive data structures such as kd-trees, and is bichromatic (can specialize for differing query and reference sets). Figure 1: The basic idea is to approximate the kernel sum contribution of some subset of the reference points XR , lying in some compact region of space R with centroid xR , to a query point. In more efficient schemes a query region is considered, i.e. the approximate contribution is made to an entire subset of the query points XQ lying in some region of space Q, with centroid xQ . finite-difference approximation, which is tantamount to a centroid approximation. Partially by avoiding series expansions, which depend explicitly on the dimension, the result was the fastest such algorithm for general dimension, when operating at the optimal bandwidth. Unfortunately, when performing cross-validation to determine the (initially unknown) optimal bandwidth, both suboptimally small and large bandwidths must be evaluated. The finite-difference-based dual-tree method tends to be efficient at or below the optimal bandwidth, and at very large bandwidths, but for intermediately-large bandwidths it suffers. Continuous/approximation aspect. This motivates investigating a multipole-like series approximation which is appropriate for the Gaussian kernel, as introduced by [9], which can be shown the generalize the centroid approximation. We define the Hermite functions 2 hn (t) by hn (t) = e−t Hn (t), where the Hermite polynomials Hn (t) are defined by the 2 2 Rodrigues formula: Hn (t) = (−1)n et Dn e−t , t ∈ R1 . After scaling and shifting the argument t appropriately, then taking the product of univariate functions for each dimension, we obtain the multivariate Hermite expansion NR G(xq ) = e −||xq −xr ||2 2h2 NR = r=1 r=1 α≥0 1 α! xr − xR √ 2h2 α hα xq − xR √ 2h2 (1) where we’ve adopted the usual multi-index notation as in [9]. This can be re-written as NR G(xq ) = e r=1 −||xq −xr ||2 2h2 NR = r=1 α≥0 1 hα α! xr − xQ √ 2h2 xq − xQ √ 2h2 α (2) to express the sum as a Taylor (local) expansion about a nearby representative centroid xQ in the query region. We will be using both types of expansions simultaneously. Since series approximations only hold locally, Greengard and Rokhlin [8] showed that it is useful to think in terms of a set of three ‘translation operators’ for converting between expansions centered at different points, in order to create their celebrated hierarchical algorithm. This was done in the context of the Coulombic kernel, but the Gaussian kernel has importantly different mathematical properties. The original Fast Gauss Transform (FGT) [9] was based on a flat grid, and thus provided only one operator (“H2L” of the next section), with an associated error bound (which was unfortunately incorrect). The Improved Fast Gauss Transform (IFGT) [14] was based on a flat set of clusters and provided no operators with a rearranged series approximation, which intended to be more favorable in higher dimensions but had an incorrect error bound. We will show the derivations of all the translation operators and associated error bounds needed to obtain, for the first time, a hierarchical algorithm for the Gaussian kernel. 2 Translation Operators and Error Bounds The first operator converts a multipole expansion of a reference node to form a local expansion centered at the centroid of the query node, and is our main approximation workhorse. Lemma 2.1. Hermite-to-local (H2L) translation operator for Gaussian kernel (as presented in Lemma 2.2 in [9, 10]): Given a reference node XR , a query node XQ , and the Hermite expansion centered at a centroid xR of XR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 , the Taylor expansion of the Hermite expansion at the centroid xQ of the query node XQ is given by G(xq ) = Bβ β≥0 xq −xQ √ 2h2 β where Bβ = (−1)|β| β! Aα hα+β α≥0 xQ −xR √ 2h2 . Proof. (sketch) The proof consists of replacing the Hermite function portion of the expansion with its Taylor series. NR Note that we can rewrite G(xq ) = α≥0 r=1 1 α! xr −xR √ 2h2 α hα xq −xR √ 2h2 by interchanging the summation order, such that the term in the brackets depends only on the reference points, and can thus be computed indepedent of any query location – we will call such terms Hermite moments. The next operator allows the efficient pre-computation of the Hermite moments in the reference tree in a bottom-up fashion from its children. Lemma 2.2. Hermite-to-Hermite (H2H) translation operator for Gaussian kernel: Given the Hermite expansion centered at a centroid xR′ in a reference node XR′ : xq −x G(xq ) = A′ hα √2hR′ , this same Hermite expansion shifted to a new locaα 2 α≥0 tion xR of the parent node of XR is given by G(xq ) = Aγ hγ γ≥0 Aγ = 0≤α≤γ 1 ′ (γ−α)! Aα xR′ −xR √ 2h2 xq −xR √ 2h2 where γ−α . Proof. We simply replace the Hermite function part of the expansion by a new Taylor series, as follows: « x q − x R′ √ 2h2 α≥0 „ « X ′ X 1 „ x R − x R′ « β xq − xR √ √ = Aα (−1)|β| hα+β β! 2h2 2h2 α≥0 β≥0 „ «β « „ X X ′ 1 x R − x R′ xq − xR |β| √ √ (−1) hα+β = Aα β! 2h2 2h2 α≥0 β≥0 „ «β „ « X X ′ 1 x R′ − x R xq − xR √ √ Aα = hα+β β! 2h2 2h2 α≥0 β≥0 3 2 «γ−α „ « „ X X 1 x R′ − x R q ′ 5 hγ x√− xR 4 √ = Aα (γ − α)! 2h2 2h2 γ≥0 0≤α≤γ G(xq ) = where γ = α + β. X A′ hα α „ The next operator acts as a “clean-up” routine in a hierarchical algorithm. Since we can approximate at different scales in the query tree, we must somehow combine all the approximations at the end of the computation. By performing a breadth-first traversal of the query tree, the L2L operator shifts a node’s local expansion to the centroid of each child. Lemma 2.3. Local-to-local (L2L) translation operator for Gaussian kernel: Given a Taylor expansion centered at a centroid xQ′ of a query node XQ′ : G(xq ) = xq −xQ′ √ 2h2 Bβ β≥0 β , the Taylor expansion obtained by shift- ing this expansion to the new centroid xQ of the child node XQ is G(xq ) = α≥0 β≥α β! α!(β−α)! Bβ β−α xQ −xQ′ √ 2h2 xq −xQ √ 2h2 α . Proof. Applying the multinomial theorem to to expand about the new center xQ yields: G(xq ) = X Bβ β≥0 = „ XX β≥0 α≤β xq − xQ′ √ 2h2 Bβ «β β! α!(β − α)! „ xQ − xQ′ √ 2h2 «β−α „ xq − xQ √ 2h2 «α . whose summation order can be interchanged to achieve the result. Because the Hermite and the Taylor expansion are truncated after taking pD terms, we incur an error in approximation. The original error bounds for the Gaussian kernel in [9, 10] were wrong and corrections were shown in [3]. Here, we will present all necessary three error bounds incurred in performing translation operators. We note that these error bounds place limits on the size of the query node and the reference node. 2 Lemma 2.4. Error Bound for Truncating an Hermite Expansion (as presented in [3]): Suppose we are given an Hermite expansion of a reference node XR about its centroid xR : G(xq ) = Aα hα α≥0 xq −xR √ 2h2 NR where Aα = r=1 1 α! xr −xR √ 2h2 α . For any query point xq , the error due to truncating the series after the first pD term is |ǫM (p)| ≤ rp )k rp √ p! NR (1−r)D D−1 k=0 D k (1 − D−k where ∀xr ∈ XR satisfies ||xr − xR ||∞ < rh for r < 1. Proof. (sketch) We expand the Hermite expansion as a product of one-dimensional Hermite functions, and utilize a bound on one-dimensional Hermite functions due to [13]: n −x2 1 2 √ 2 e 2 , n ≥ 0, x ∈ R1 . n! |hn (x)| ≤ n! Lemma 2.5. Error Bound for Truncating a Taylor Expansion Converted from an Hermite Expansion of Infinite Order: Suppose we are given the following Taylor expansion about the centroid xQ of a query node G(xq ) = Bβ β≥0 2 xq −xQ √ 2h2 β where `Strainn[12] proposed the interesting idea of using Stirling’s formula (for any non-negative integer ´ ≤ n!) to lift the node size constraint; one might imagine that this could allow approxin: n+1 e mation of larger regions in a tree-based algorithm. Unfortunately, the error bounds developed in [12] were also incorrect. We have derived the three necessary corrected error bounds based on the techniques in [3]. However, due to space, and because using these bounds actually degraded performance slightly, we do not include those lemmas here. (−1)|β| β! Bβ = Aα hα+β α≥0 xQ −xR √ 2h2 and Aα ’s are the coefficients of the Hermite ex- pansion centered at the reference node centroid xR . Then, truncating the series after pD terms satisfies the error bound |ǫL (p)| ≤ NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k where ||xq − xQ ||∞ < rh for r < 1, ∀xq ∈ XQ . Proof. Taylor expansion of the Hermite function yields e −||xq −xr ||2 2h2 Use e „ «„ «β X (−1)|β| X 1 „ xr − xR «α xq − xQ xQ − xR √ √ √ hα+β = β! α! 2h2 2h2 2h2 α≥0 β≥0 «α „ «„ «β „ X (−1)|β| X 1 xR − xr xQ − xR xq − xQ |α| √ √ √ = (−1) hα+β β! α! 2h2 2h2 2h2 β≥0 α≥0 «„ «β „ X (−1)|β| xq − xQ xQ − xr √ √ = hβ β! 2h2 2h2 β≥0 −||xq −xr ||2 2h2 D = i=1 (up (xqi , xri , xQi ) + vp (xqi , xri , xQi )) for 1 ≤ i ≤ D, where «„ «n „ X (−1)ni xqi − xQi i xQi − xri √ √ hni ni ! 2h2 2h2 ni =0 „ «„ «ni ∞ X (−1)ni xqi − xQi xQi − xri √ √ hni vp (xqi , xri , xQi ) = . ni ! 2h2 2h2 ni =p p−1 up (xqi , xri , xQi ) = 1−r p 1−r These univariate functions respectively satisfy up (xqi , xri , xQi ) ≤ 1 rp vp (xqi , xri , xQi ) ≤ √p! 1−r , for 1 ≤ i ≤ D, achieving the multivariate bound. and Lemma 2.6. Error Bound for Truncating a Taylor Expansion Converted from an Already Truncated Hermite Expansion: A truncated Hermite expansion centered about xq −xR the centroid xR of a reference node G(xq ) = Aα hα √2h2 has the following α < rh, and a reference node XR for which ||xr − xR ||∞ < rh for r < 1 , ∀xq ∈ XQ , ∀xr ∈ XR . 2 Proof. We define upi = up (xqi , xri , xQi , xRi ), vpi = vp (xqi , xri , xQi , xRi ), wpi = wp (xqi , xri , xQi , xRi ) for 1 ≤ i ≤ D: upi = „ «„ «ni p−1 X X (−1)ni p−1 1 „ xR − xr «nj xqi − xQi xQi − xRi i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =0 j vpi = „ «„ «n ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =p nj ! 2h2 2h2 2h2 ni =0 j p−1 wpi = „ «„ «n ∞ ∞ X (−1)ni X 1 „ xR − xr «nj xQi − xRi xqi − xQi i i i √ √ √ (−1)nj hni +nj ni ! n =0 nj ! 2h2 2h2 2h2 ni =p j Note that e −||xq −xr ||2 2h2 D = i=1 (upi + vpi + wpi ) for 1 ≤ i ≤ D. Using the bound for Hermite functions and the property of geometric series, we obtain the following upper bounds: p−1 p−1 upi ≤ X X (2r)ni (2r)nj = ni =0 nj =0 „ 1 − (2r)p ) 1 − 2r «2 „ «„ « p−1 ∞ 1 X X 1 1 − (2r)p (2r)p vpi ≤ √ (2r)ni (2r)nj = √ 1 − 2r 1 − 2r p! n =0 n =p p! i 1 wpi ≤ √ p! j ∞ ∞ X X ni =p nj 1 (2r)ni (2r)nj = √ p! =0 „ 1 1 − 2r «„ (2r)p 1 − 2r « Therefore, ˛ ˛ ! «D−k „ D D−1 ˛ −||xq −xr ||2 ˛ Y X D ((2r)p )(2 − (2r)p ) ˛ ˛ −2D 2 2h √ − upi ˛ ≤ (1 − 2r) ((1 − (2r)p )2 )k ˛e ˛ ˛ k p! i=1 k=0 ˛ ˛ ˛ « ˛ „ „ « D−1 “ ” X D X ˛ xq − xQ β ˛ ((2r)p )(2 − (2r)p ) D−k NR p 2 k ˛≤ ˛G(xq ) − √ ((1 − (2r) ) ) √ Cβ ˛ ˛ 2D (1 − 2r) k p! 2h2 ˛ ˛ k=0 β < 2h, pDH = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if Q.maxside < 2h, pDL = the smallest p ≥ 1 such that NR (1−r)D D−1 k=0 D k (1 − rp )k rp √ p! D−k < ǫGmin . Q if max(Q.maxside,R.maxside) < h, pH2L = the smallest p ≥ 1 such that NR (1−2r)2D D−1 k=0 D k ((1 − (2r)p )2 )k ((2r)p )(2−(2r)p ) √ p! D−k < ǫGmin . Q cDH = pD NQ . cDL = pD NR . cH2L = DpD+1 . cDirect = DNQ NR . DH DL H2L if no Hermite coefficient of order pDH exists for XR , Compute it. cDH = cDH + pD NR . DH if no Hermite coefficient of order pH2L exists for XR , Compute it. cH2L = cH2L + pD NR . H2L c = min(cDH , cDL , cH2L , cDirect ). if c = cDH < ∞, (Direct Hermite) Evaluate each xq at the Hermite series of order pDH centered about xR of XR using Equation 1. if c = cDL < ∞, (Direct Local) Accumulate each xr ∈ XR as the Taylor series of order pDL about the center xQ of XQ using Equation 2. if c = cH2L < ∞, (Hermite-to-Local) Convert the Hermite series of order pH2L centered about xR of XR to the Taylor series of the same order centered about xQ of XQ using Lemma 2.1. if c = cDirect , Update Gmin and Gmax in Q and all its children. return. if leaf(Q) and leaf(R), Perform the naive algorithm on every pair of points in Q and R. else DFGT(Q.left, R.left). DFGT(Q.left, R.right). DFGT(Q.right, R.left). DFGT(Q.right, R.right). ˛ ˛ ˛b ˛ For the FGT, note that the algorithm only ensures: ˛G(xq ) − Gtrue (xq )˛ ≤ τ . Therefore, we first set τ = ǫ, halving τ until the error tolerance ǫ was met. For the IFGT, which has multiple parameters that must be tweaked simultaneously, an automatic scheme was created, based on the recommendations given in the paper and software documentation: For D = 2, use p = 8; for D = 3, √ use p = 6; set ρx = 2.5; start with K = N and double K until the error tolerance is met. When this failed to meet the tolerance, we resorted to additional trial and error by hand. The costs of parameter selection for these methods in both computer and human time is not included in the table. 4 Algorithm \ scale Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH Naive FGT IFGT DFD DFGT DFGTH 0.001 0.01 0.1 1 10 100 sj2-50000-2 (astronomy: positions), D = 2, N = 50000, h∗ = 0.00139506 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 3.892312 2.01846 0.319538 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.837724 1.087066 1.658592 6.018158 62.077669 151.590062 0.849935 1.11567 4.599235 72.435177 18.450387 2.777454 0.846294 1.10654 1.683913 6.265131 5.063365 1.036626 ∗ = 0.0016911 colors50k (astronomy: colors), D = 2, N = 50000, h 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM > 2×Naive > 2×Naive 0.475281 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1.095838 1.469454 2.802112 30.294007 280.633106 81.373053 1.099828 1.983888 29.231309 285.719266 12.886239 5.336602 1.081216 1.47692 2.855083 24.598749 7.142465 1.78648 ∗ edsgc-radec-rnd (astronomy: angles), D = 2, N = 50000, h = 0.00466204 301.696 301.696 301.696 301.696 301.696 301.696 out of RAM out of RAM out of RAM 2.859245 1.768738 0.210799 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.812462 1.083528 1.682261 5.860172 63.849361 357.099354 0.84023 1.120015 4.346061 73.036687 21.652047 3.424304 0.821672 1.104545 1.737799 6.037217 5.7398 1.883216 ∗ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h = 0.000768201 354.868751 354.868751 354.868751 354.868751 354.868751 354.868751 out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 0.70054 0.701547 0.761524 0.843451 1.086608 42.022605 0.73007 0.733638 0.799711 0.999316 50.619588 125.059911 0.724004 0.719951 0.789002 0.877564 1.265064 22.6106 ∗ bio5-rnd (biology: drug activity), D = 5, N = 50000, h = 0.000567161 364.439228 364.439228 364.439228 364.439228 364.439228 364.439228 out of RAM out of RAM out of RAM out of RAM out of RAM out of RAM > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 2.249868 2.4958865 4.70948 12.065697 94.345003 412.39142 > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive > 2×Naive 1000 301.696 0.183616 7.576783 1.551019 2.532401 0.68471 301.696 0.114430 7.55986 3.604753 3.5638 0.627554 301.696 0.059664 7.585585 0.743045 1.977302 0.436596 354.868751 > 2×Naive > 2×Naive 383.12048 109.353701 87.488392 364.439228 out of RAM > 2×Naive 107.675935 > 2×Naive > 2×Naive Discussion. The experiments indicate that the DFGTH method is able to achieve reasonable performance across all bandwidth scales. Unfortunately none of the series approximation-based methods do well on the 5-dimensional data, as expected, highlighting the main weakness of the approach presented. Pursuing corrections to the error bounds necessary to use the intriguing series form of [14] may allow an increase in dimensionality. References [1] A. W. Appel. An Efficient Program for Many-Body Simulations. SIAM Journal on Scientific and Statistical Computing, 6(1):85–103, 1985. [2] J. Barnes and P. Hut. A Hierarchical O(N logN ) Force-Calculation Algorithm. Nature, 324, 1986. [3] B. Baxter and G. Roussos. A new error estimate of the fast gauss transform. SIAM Journal on Scientific Computing, 24(1):257–259, 2002. [4] P. Callahan and S. Kosaraju. A decomposition of multidimensional point sets with applications to k-nearest-neighbors and n-body potential fields. Journal of the ACM, 62(1):67–90, January 1995. [5] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [6] A. G. Gray. Bringing Tractability to Generalized N-Body Problems in Statistical and Scientific Computation. PhD thesis, Carnegie Mellon University, 2003. [7] A. G. Gray and A. W. Moore. Rapid Evaluation of Multiple Density Models. In Artificial Intelligence and Statistics 2003, 2003. [8] L. Greengard and V. Rokhlin. A Fast Algorithm for Particle Simulations. Journal of Computational Physics, 73, 1987. [9] L. Greengard and J. Strain. The fast gauss transform. SIAM Journal on Scientific and Statistical Computing, 12(1):79–94, 1991. [10] L. Greengard and X. Sun. A new version of the fast gauss transform. Documenta Mathematica, Extra Volume ICM(III):575– 584, 1998. [11] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986. [12] J. Strain. The fast gauss transform with variable scales. SIAM Journal on Scientific and Statistical Computing, 12:1131– 1139, 1991. [13] O. Sz´ sz. On the relative extrema of the hermite orthogonal functions. J. Indian Math. Soc., 15:129–134, 1951. a [14] C. Yang, R. Duraiswami, N. A. Gumerov, and L. Davis. Improved fast gauss transform and efficient kernel density estimation. International Conference on Computer Vision, 2003.
6 0.32826123 192 nips-2005-The Information-Form Data Association Filter
7 0.31142136 2 nips-2005-A Bayes Rule for Density Matrices
8 0.29724866 168 nips-2005-Rodeo: Sparse Nonparametric Regression in High Dimensions
9 0.29710504 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
10 0.29706633 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
11 0.28271294 71 nips-2005-Fast Krylov Methods for N-Body Learning
12 0.27631789 46 nips-2005-Consensus Propagation
13 0.26264903 114 nips-2005-Learning Rankings via Convex Hull Separation
14 0.26186925 140 nips-2005-Nonparametric inference of prior probabilities from Bayes-optimal behavior
15 0.25724363 186 nips-2005-TD(0) Leads to Better Policies than Approximate Value Iteration
16 0.23759079 117 nips-2005-Learning from Data of Variable Quality
17 0.22856066 180 nips-2005-Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms
18 0.22248164 105 nips-2005-Large-Scale Multiclass Transduction
19 0.2200637 185 nips-2005-Subsequence Kernels for Relation Extraction
20 0.21275705 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
topicId topicWeight
[(3, 0.063), (10, 0.032), (12, 0.358), (27, 0.042), (31, 0.023), (34, 0.094), (41, 0.032), (50, 0.017), (55, 0.062), (69, 0.042), (73, 0.03), (88, 0.048), (91, 0.039)]
simIndex simValue paperId paperTitle
1 0.93223876 135 nips-2005-Neuronal Fiber Delineation in Area of Edema from Diffusion Weighted MRI
Author: Ofer Pasternak, Nathan Intrator, Nir Sochen, Yaniv Assaf
Abstract: Diffusion Tensor Magnetic Resonance Imaging (DT-MRI) is a non invasive method for brain neuronal fibers delineation. Here we show a modification for DT-MRI that allows delineation of neuronal fibers which are infiltrated by edema. We use the Muliple Tensor Variational (MTV) framework which replaces the diffusion model of DT-MRI with a multiple component model and fits it to the signal attenuation with a variational regularization mechanism. In order to reduce free water contamination we estimate the free water compartment volume fraction in each voxel, remove it, and then calculate the anisotropy of the remaining compartment. The variational framework was applied on data collected with conventional clinical parameters, containing only six diffusion directions. By using the variational framework we were able to overcome the highly ill posed fitting. The results show that we were able to find fibers that were not found by DT-MRI.
same-paper 2 0.78922045 86 nips-2005-Generalized Nonnegative Matrix Approximations with Bregman Divergences
Author: Suvrit Sra, Inderjit S. Dhillon
Abstract: Nonnegative matrix approximation (NNMA) is a recent technique for dimensionality reduction and data analysis that yields a parts based, sparse nonnegative representation for nonnegative input data. NNMA has found a wide variety of applications, including text analysis, document clustering, face/image recognition, language modeling, speech processing and many others. Despite these numerous applications, the algorithmic development for computing the NNMA factors has been relatively deficient. This paper makes algorithmic progress by modeling and solving (using multiplicative updates) new generalized NNMA problems that minimize Bregman divergences between the input matrix and its lowrank approximation. The multiplicative update formulae in the pioneering work by Lee and Seung [11] arise as a special case of our algorithms. In addition, the paper shows how to use penalty functions for incorporating constraints other than nonnegativity into the problem. Further, some interesting extensions to the use of “link” functions for modeling nonlinear relationships are also discussed. 1
3 0.71206707 116 nips-2005-Learning Topology with the Generative Gaussian Graph and the EM Algorithm
Author: Michaël Aupetit
Abstract: Given a set of points and a set of prototypes representing them, how to create a graph of the prototypes whose topology accounts for that of the points? This problem had not yet been explored in the framework of statistical learning theory. In this work, we propose a generative model based on the Delaunay graph of the prototypes and the ExpectationMaximization algorithm to learn the parameters. This work is a first step towards the construction of a topological model of a set of points grounded on statistics. 1 1.1
4 0.65861773 28 nips-2005-Analyzing Auditory Neurons by Learning Distance Functions
Author: Inna Weiner, Tomer Hertz, Israel Nelken, Daphna Weinshall
Abstract: We present a novel approach to the characterization of complex sensory neurons. One of the main goals of characterizing sensory neurons is to characterize dimensions in stimulus space to which the neurons are highly sensitive (causing large gradients in the neural responses) or alternatively dimensions in stimulus space to which the neuronal response are invariant (defining iso-response manifolds). We formulate this problem as that of learning a geometry on stimulus space that is compatible with the neural responses: the distance between stimuli should be large when the responses they evoke are very different, and small when the responses they evoke are similar. Here we show how to successfully train such distance functions using rather limited amount of information. The data consisted of the responses of neurons in primary auditory cortex (A1) of anesthetized cats to 32 stimuli derived from natural sounds. For each neuron, a subset of all pairs of stimuli was selected such that the responses of the two stimuli in a pair were either very similar or very dissimilar. The distance function was trained to fit these constraints. The resulting distance functions generalized to predict the distances between the responses of a test stimulus and the trained stimuli. 1
5 0.38642058 58 nips-2005-Divergences, surrogate loss functions and experimental design
Author: Xuanlong Nguyen, Martin J. Wainwright, Michael I. Jordan
Abstract: In this paper, we provide a general theorem that establishes a correspondence between surrogate loss functions in classification and the family of f -divergences. Moreover, we provide constructive procedures for determining the f -divergence induced by a given surrogate loss, and conversely for finding all surrogate loss functions that realize a given f -divergence. Next we introduce the notion of universal equivalence among loss functions and corresponding f -divergences, and provide necessary and sufficient conditions for universal equivalence to hold. These ideas have applications to classification problems that also involve a component of experiment design; in particular, we leverage our results to prove consistency of a procedure for learning a classifier under decentralization requirements. 1
6 0.38004744 177 nips-2005-Size Regularized Cut for Data Clustering
7 0.37945503 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models
8 0.37817374 47 nips-2005-Consistency of one-class SVM and related algorithms
9 0.37811664 184 nips-2005-Structured Prediction via the Extragradient Method
10 0.37800634 50 nips-2005-Convex Neural Networks
11 0.37710083 190 nips-2005-The Curse of Highly Variable Functions for Local Kernel Machines
12 0.37672588 114 nips-2005-Learning Rankings via Convex Hull Separation
13 0.37605596 43 nips-2005-Comparing the Effects of Different Weight Distributions on Finding Sparse Representations
14 0.37505248 178 nips-2005-Soft Clustering on Graphs
15 0.37445337 9 nips-2005-A Domain Decomposition Method for Fast Manifold Learning
16 0.37322557 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity
17 0.37208533 94 nips-2005-Identifying Distributed Object Representations in Human Extrastriate Visual Cortex
18 0.37170222 112 nips-2005-Learning Minimum Volume Sets
19 0.37112942 49 nips-2005-Convergence and Consistency of Regularized Boosting Algorithms with Stationary B-Mixing Observations
20 0.37035024 160 nips-2005-Query by Committee Made Real