nips nips2006 nips2006-117 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rie K. Ando, Tong Zhang
Abstract: We consider a general form of transductive learning on graphs with Laplacian regularization, and derive margin-based generalization bounds using appropriate geometric properties of the graph. We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as the effect of dimension reduction. The results suggest a limitation of the standard degree-based normalization. We propose a remedy from our analysis and demonstrate empirically that the remedy leads to improved classification performance.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract We consider a general form of transductive learning on graphs with Laplacian regularization, and derive margin-based generalization bounds using appropriate geometric properties of the graph. [sent-14, score-0.183]
2 We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as the effect of dimension reduction. [sent-15, score-0.363]
3 It was proposed in [3] that one may first project these data points into the eigenspace corresponding to the largest eigenvalues of a normalized adjacency matrix of the graph and then use the standard k-means method for clustering. [sent-19, score-0.194]
4 Although such normalization may significantly affect the performance, the issue has not been studied from the learning theory perspective. [sent-27, score-0.152]
5 The relationship of kernel design and graph learning was investigated in [6], which argued that quadratic regularization-based graph learning can be regarded as kernel design. [sent-28, score-0.286]
6 The goal of this paper is to provide some learning theoretical insight into the role of normalization of the graph Laplacian matrix (D − W). [sent-30, score-0.285]
7 We first present a model for transductive learning on graphs and develop a margin analysis for multi-class graph learning. [sent-31, score-0.222]
8 Based on this, we analyze the performance of Laplacian regularization-based graph learning in relation to graph properties. [sent-32, score-0.201]
9 We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as dimension reduction in graph learning. [sent-33, score-0.569]
10 The results indicate a limitation of the commonly practiced degree-based normalization mentioned above. [sent-34, score-0.245]
11 We propose a learning theoretical remedy based on our analysis and use experiments to demonstrate that the remedy leads to improved classification performance. [sent-35, score-0.17]
12 Assume that each node vj is associated with an output value yj ∈ Y, which we are interested in predicting. [sent-41, score-0.355]
13 We encode the label yj into a vector in RK , so that the problem becomes that of generating an estimation vector fj,· = [fj,1 , . [sent-48, score-0.329]
14 , fj,K ] ∈ RK , which can then be used to recover the label yj . [sent-51, score-0.329]
15 , K}, we encode each yj = k ∈ Y as ek ∈ RK , where ek is a vector of zero entries except for the k-th entry being one. [sent-55, score-0.346]
16 , fj,K ] ∈ RK (which is intended to approximate eyj ), we decode the corresponding label estimation yj as: yj = arg maxk {fj,k : k = 1, . [sent-59, score-0.633]
17 If the true label is ˆ ˆ yj , then the classification error is err(fj,· , yj ) = I(ˆj = yj ), where we use I(·) to denote the set y indicator function. [sent-63, score-0.937]
18 In order to estimate f = [fj,k ] ∈ RmK from only a subset of labeled nodes, we consider for a K T given kernel matrix K ∈ Rm , the quadratic regularization f T QK f = k=1 f·,k K−1 f·,k , where f·,k = [f1,k , . [sent-64, score-0.176]
19 We will consider the kernel matrix induced by the graph Laplacian, to be introduced later in the paper. [sent-69, score-0.174]
20 , fj,K ] ∈ RK is measured by a loss function φ(fj,· , yj ). [sent-74, score-0.329]
21 Our learning method attempts to minimize the empirical risk on the set Zn of n labeled training nodes, subject to f T QK f being small: 1 ˆ f (Zn ) = arg min φ(fj,· , yj ) + λf T QK f . [sent-75, score-0.358]
22 In this paper, we focus on a special class of loss function that is of the form φ(fj,· , yj ) = K k=1 φ0 (fj,k , δk,yj ), where δa,b is the delta function defined as: δa,b = 1 when a = b and δa,b = 0 otherwise. [sent-77, score-0.329]
23 Theorem 1 Let φ(fj,· , yj ) = K φ0 (fj,k , δk,yj ) in (1). [sent-79, score-0.304]
24 Then ∀p > 0, the expected generalization error of the learning method (1) over the random training samples Zn can be bounded by: EZn 1 m−n ˆ err(fj,· (Zn ), yj ) ≤ ¯ j∈Zn 1 a inf f ∈RmK ¯ where Zn = {1, . [sent-81, score-0.386]
25 , m} − Zn , trp (K) = 1 m 1 m m φ0 (fj,· , yj ) + λf T QK f + j=1 m j=1 Kp j,j 1/p btrp (K) λnc p , , and Kj,j is the (j, j)-entry of K. [sent-84, score-0.406]
26 We have err(fj,· (Zn ), yj ) ≤ 1 ˆ (Zn+1 ), yj ) + b Kj,j p . [sent-109, score-0.608]
27 3 Laplacian regularization Consider an undirected graph G = (V, E) defined on the nodes V = {vj : j = 1, . [sent-119, score-0.222]
28 Let degj (G) = j ′ =1 wj,j ′ / / be the degree of node j of graph G. [sent-130, score-0.241]
29 Definition 1 Consider a graph G = (V, E) of m nodes with weights wj,j ′ (j, j ′ = 1, . [sent-132, score-0.167]
30 The unnormalized Laplacian matrix L(G) ∈ Rm×m is defined as: Lj,j ′ (G) = −wj,j ′ if j = j ′ ; degj (G) otherwise. [sent-136, score-0.313]
31 The corresponding T regularization is based on: f·,k LS (G)f·,k = 1 2 m j,j ′ =1 wj,j ′ fj fj,k √ − √ ′ ,k S Sj 2 . [sent-142, score-0.321]
32 j′ A common choice of S is S = I, corresponding to regularizing with the unnormalized Laplacian L. [sent-143, score-0.17]
33 The idea is natural: we assume that the predictive values fj,k and fj ′ ,k should be close when (j, j ′ ) ∈ E with a strong link. [sent-144, score-0.266]
34 ,m on V , we define the cut for LS in Definition 1 as: cut(LS , y) = j,j ′ :y j =yj ′ wj,j′ 2 1 Sj + 1 Sj ′ + j,j ′ :y j =yj ′ wj,j′ 2 2 √1 − √1 S Sj . [sent-151, score-0.175]
35 j′ Unlike typical graph-theoretical definitions of graph-cut, this learning theoretical definition of graphcut penalizes not only between-class edge weights but also within-class edge weights when such an edge connects two nodes with different scaling factors. [sent-152, score-0.242]
36 This penalization is intuitive if we look at the regularizer in Definition (1), which encourages fj,k / Sj to be similar to fj ′ ,k / Sj ′ when wj,j ′ is large. [sent-153, score-0.318]
37 If j and j ′ belongs to the same class, we want fj,k to be similar to fj ′ ,k . [sent-154, score-0.266]
38 It can be easily verified that j=1 φ(fj,· , yj )/m + λf T QK f = λ(αs + cut(LS , y)). [sent-169, score-0.304]
39 Definition 3 A subgraph G0 = (V0 , E0 ) of G = (V, E) is a pure component if G0 is connected, E0 is induced by restricting E on V0 , and if labels y have identical values on V0 . [sent-181, score-0.249]
40 A pure subgraph G′ = ∪q Gℓ of G divides V into q disjoint sets V = ∪q Vℓ such that each subgraph Gℓ = (Vℓ , Eℓ ) is ℓ=1 ℓ=1 a pure component. [sent-182, score-0.382]
41 If we remove all edges of G that connect nodes with different labels, then the resulting subgraph is a pure subgraph (but not the only one). [sent-184, score-0.366]
42 For each pure component Gℓ , its first eigenvalue λ1 (Gℓ ) is always zero. [sent-185, score-0.162]
43 Theorem 3 Let the assumptions of Theorem 2 hold, and G′ = ∪q Gℓ (Gℓ = (Vℓ , Eℓ )) be a pure ℓ=1 subgraph of G. [sent-187, score-0.191]
44 2 To put this into perspective, suppose that we use unnormalized Laplacian regularizer on a zero-cut graph. [sent-192, score-0.192]
45 Then S = I and cut(LS , y) = 0, and by letting p = 1 and p → ∞ in Theorem 3, we have: EZn ¯ j∈Zn ˆ err(fj,· , yj ) ≤2 m−n b q · ac n and EZn ¯ j∈Zn ˆ err(fj,· , yj ) b m . [sent-193, score-0.662]
46 1 Near zero-cut optimum scaling factors The above observation motivates a scaling matrix S so that it compensates for the unbalanced pure component sizes. [sent-198, score-0.355]
47 Here we focus on the case that scaling factors are constant within each pure component (Sj = sℓ when j ∈ Vℓ ) in order to derive optimum scaling factors. [sent-200, score-0.312]
48 Hence, if cut(G′ , y) is small, then we should choose sℓ ∝ mℓ ¯ for each pure component ℓ so that the generalization performance is approximately (ac)−1 b · q/n. [sent-203, score-0.212]
49 The commonly practiced degree-based normalization method Sj = degj (G) provides such good normalization factors under a simplified “box model” used in early studies e. [sent-205, score-0.528]
50 In this model, each node connects to itself and all other nodes of the same pure component with edge weight wj,j ′ = 1. [sent-208, score-0.296]
51 The degree is thus degj (Gℓ ) = |Vℓ | = mℓ , which gives the optimal scaling in our analysis. [sent-209, score-0.167]
52 For such a model, the degree-based normalization can fail because the degj (Gℓ ) within each pure component Gℓ is not approximately constant (thus raising cut(LS , y)), and it may not be proportional to mℓ . [sent-212, score-0.41]
53 We call this method of normalization (Sj = 1/Kj,j , K = (αS−1 + LS )−1 ) K-scaling in this paper as it scales the kernel matrix K so that each Kj,j = 1. [sent-218, score-0.219]
54 By contrast, we call the standard degree-based normalization (Sj = degj (G), K = (αI + LS )−1 ) L-scaling as it scales diagonals of LS to 1. [sent-219, score-0.309]
55 In fact, no one has proposed this normalization method in the graph learning setting before this work. [sent-221, score-0.241]
56 4 Dimension Reduction Normalization and dimension reduction have been commonly used in spectral clustering such as [3, 4]. [sent-223, score-0.266]
57 For semi-supervised learning, dimension reduction (without normalization) is known to improve performance [1, 6] while normalization (without dimension reduction) has also been explored [7]. [sent-224, score-0.488]
58 An appropriate combination of normalization and dimension reduction can further improve performance. [sent-225, score-0.367]
59 We shall first introduce dimension reduction with normalized Laplacian LS (G). [sent-226, score-0.262]
60 The benefit of dimension reduction in graph learning has been investigated in [6], under the spectral kernel design framework. [sent-232, score-0.397]
61 Note that the normalization issue, which will change the eigenvectors and their ordering, wasn’t investigated there. [sent-233, score-0.174]
62 S Theorem 4 Let G′ = ∪q Gℓ (Gℓ = (Vℓ , Eℓ )) be a pure subgraph of G. [sent-236, score-0.191]
63 For simplicity, we only consider least squares loss φ(fj,· , yj ) = K (fj,k − δk,yj )2 in (1) using regularization (4) and K0 = I. [sent-244, score-0.412]
64 With k=1 m 1 ¯ p = 1, we have m j=1 φ(fj,· , yj ) ≤ δr (S)2 + λm. [sent-245, score-0.304]
65 5, we have EZn m−n j∈Zn err(fj,· , yj ) ≤ 16(δr (S)2 +λm)+ λnm . [sent-249, score-0.304]
66 By optimizing ¯ over λ, we obtain ˆ err(fj,· , yj ) EZn ≤ 16δr (S)2 + 32 r/n. [sent-250, score-0.304]
67 Compared to Theorem 3, the advantage of dimension reduction in (5) is that the quantity cut(LS , y) is replaced by LS (G) − LS (G′ ) 2 , which is typically much smaller. [sent-253, score-0.215]
68 The regularization parameter λ is chosen by cross validation on the n training labeled examples. [sent-263, score-0.235]
69 We will show performance either when the rest of the parameters (α and dimensionality r) are also chosen by cross validation or when they are set to the optimum (oracle performance). [sent-264, score-0.234]
70 Controlled data experiments The purpose of the controlled data experiments is to observe the correlation of the effectiveness of the normalization methods with graph properties. [sent-269, score-0.241]
71 We show the results when dimension reduction is applied Accuracy (%) Accuracy (%) 100 80 60 40 graph1 Unnormalized graph2 L-scaling graph3 K-scaling 100 90 80 70 60 graph6 graph7 graph8 graph9 graph10 Unnormalized (a) Nearly-constant degrees. [sent-271, score-0.215]
72 With dimension reduction (dim ≤ 20; chosen by cross validation). [sent-276, score-0.289]
73 Figure 1 (a) shows classification accuracy on three graphs that were generated so that the node degrees (of either correct edges or erroneous edges) are close to constant within each class but vary across classes. [sent-279, score-0.203]
74 On these graphs, both K-scaling and L-scaling significantly improve classification accuracy over the unnormalized baseline. [sent-280, score-0.196]
75 Each class consists of core nodes and satellite nodes. [sent-284, score-0.17]
76 Satellite nodes are relatively weakly connected to core nodes of the same class. [sent-286, score-0.231]
77 They are also connected to some other classes’ satellite nodes (i. [sent-287, score-0.168]
78 For graphs generated in this manner, degrees vary within the same class since the satellite nodes have smaller degrees than the core nodes. [sent-291, score-0.285]
79 In particular, K-scaling does well even when L-scaling rather underperforms the unnormalized baseline. [sent-294, score-0.204]
80 As our baseline, we test the supervised configuration by letting W + βI be the kernel matrix and using the same least squares loss function, where we use the oracle β which is optimal. [sent-306, score-0.167]
81 ) 45 accuracy (%) 75 65 45 35 35 L-scaling (w /o dim redu. [sent-311, score-0.188]
82 ) 10 30 50 # of labeled ex amples 10 30 50 # of labeled ex amples K-scaling (w / dim redu. [sent-313, score-0.444]
83 ) 10 50 90 10 50 90 # of labeled ex amples # of labeled ex . [sent-314, score-0.231]
84 (a-1) MNIST, dim and α determined by cross validation. [sent-316, score-0.236]
85 (b-1) RCV1, dim and α determined by cross validation. [sent-318, score-0.236]
86 K-scaling outperforms L-scaling, and L-scaling outperforms the unnormalized Laplacian. [sent-320, score-0.17]
87 The performance of the unnormalized Laplacian (with dimension reduction) is roughly consistent with the performance with similar (m, n) with heuristic dimension selection in [1]. [sent-323, score-0.412]
88 Without dimension reduction, L-scaling and K-scaling still improve performance over the unnormalized Laplacian. [sent-324, score-0.291]
89 In Figure 2 (a-1), the unnormalized Laplacian with dimension reduction underperforms the unnormalized Laplacian without dimension reduction, indicating that dimension reduction rather degrades performance. [sent-326, score-0.902]
90 By comparing Figure 2 (a-1) and (a-2), we observe that this seemingly counterintuitive performance trend is caused by the difficulty of choosing the right dimensionality by cross validation. [sent-327, score-0.157]
91 As observed, if the optimal dimensionality is known (as in (a-2)), dimension reduction improves performance either with or without normalization by K-scaling and L-scaling, and all transductive configurations outperform the supervised baseline. [sent-329, score-0.483]
92 We also note that the comparison of Figure 2 (a-1) and (a-2) shows that choosing good dimensionality by cross validation is much harder than choosing α by cross validation, especially when the number of labeled examples is small. [sent-330, score-0.293]
93 In the setting of Figure 2 (b-1) where the dimensionality and α were determined by cross validation, K-scaling with dimension reduction generally performs the best. [sent-334, score-0.328]
94 By setting the dimensionality and α to the optimum, the benefit of K-scaling with dimension reduction is even clearer (Figure 2 (b-2)). [sent-335, score-0.254]
95 Its performance differences from the second and third best ‘L-scaling (w/ dim redu. [sent-336, score-0.185]
96 In our experiments, K-scaling with dimension reduction consistently outperformed others. [sent-340, score-0.215]
97 On real data, cut is not near-zero, and the effect of normalization is unclear (Section 3. [sent-343, score-0.327]
98 In particular, we explained the importance of Laplacian normalization and dimension reduction for graph learning. [sent-346, score-0.456]
99 We argued that the standard L-scaling normalization method has the undesirable property that the normalization factors can vary significantly within a pure component. [sent-347, score-0.451]
100 An alternate normalization method, which we call K-scaling, is proposed to remedy the problem. [sent-348, score-0.227]
wordName wordTfidf (topN-words)
[('zn', 0.471), ('ls', 0.399), ('yj', 0.304), ('fj', 0.266), ('cut', 0.175), ('unnormalized', 0.17), ('dim', 0.162), ('sj', 0.155), ('fin', 0.153), ('normalization', 0.152), ('err', 0.145), ('ezn', 0.136), ('laplacian', 0.135), ('degj', 0.119), ('reduction', 0.117), ('pure', 0.116), ('rmk', 0.102), ('trp', 0.102), ('dimension', 0.098), ('graph', 0.089), ('qk', 0.083), ('graphs', 0.079), ('nodes', 0.078), ('subgraph', 0.075), ('remedy', 0.075), ('cross', 0.074), ('mnist', 0.063), ('satellite', 0.063), ('theorem', 0.059), ('regularization', 0.055), ('fi', 0.055), ('ac', 0.054), ('transductive', 0.054), ('labeled', 0.054), ('validation', 0.052), ('amples', 0.051), ('practiced', 0.051), ('yin', 0.051), ('generalization', 0.05), ('scaling', 0.048), ('oracle', 0.047), ('nition', 0.047), ('optimum', 0.046), ('cp', 0.044), ('ez', 0.043), ('kernel', 0.043), ('rk', 0.041), ('eigenspace', 0.04), ('dimensionality', 0.039), ('diagonals', 0.038), ('ex', 0.036), ('bkin', 0.034), ('underperforms', 0.034), ('node', 0.033), ('inf', 0.032), ('factors', 0.031), ('penalization', 0.03), ('core', 0.029), ('pr', 0.028), ('squares', 0.028), ('spectral', 0.028), ('connected', 0.027), ('accuracy', 0.026), ('edge', 0.025), ('shall', 0.025), ('label', 0.025), ('erroneous', 0.025), ('loss', 0.025), ('matrix', 0.024), ('articles', 0.024), ('component', 0.023), ('commonly', 0.023), ('rm', 0.023), ('sup', 0.023), ('performance', 0.023), ('eigenvalue', 0.023), ('regularizer', 0.022), ('normalized', 0.022), ('edges', 0.022), ('investigated', 0.022), ('bound', 0.021), ('trend', 0.021), ('connects', 0.021), ('ek', 0.021), ('classes', 0.02), ('suggests', 0.02), ('rbf', 0.02), ('classi', 0.02), ('theoretical', 0.02), ('weakly', 0.019), ('motivates', 0.019), ('bold', 0.019), ('behaves', 0.019), ('limitation', 0.019), ('adjacency', 0.019), ('induced', 0.018), ('degrees', 0.018), ('vj', 0.018), ('labels', 0.017), ('reduced', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999994 117 nips-2006-Learning on Graph with Laplacian Regularization
Author: Rie K. Ando, Tong Zhang
Abstract: We consider a general form of transductive learning on graphs with Laplacian regularization, and derive margin-based generalization bounds using appropriate geometric properties of the graph. We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as the effect of dimension reduction. The results suggest a limitation of the standard degree-based normalization. We propose a remedy from our analysis and demonstrate empirically that the remedy leads to improved classification performance.
2 0.17103271 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou
Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1
3 0.14642324 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
Author: Kenichi Kurihara, Max Welling, Nikos A. Vlassis
Abstract: Dirichlet Process (DP) mixture models are promising candidates for clustering applications where the number of clusters is unknown a priori. Due to computational considerations these models are unfortunately unsuitable for large scale data-mining applications. We propose a class of deterministic accelerated DP mixture models that can routinely handle millions of data-cases. The speedup is achieved by incorporating kd-trees into a variational Bayesian algorithm for DP mixtures in the stick-breaking representation, similar to that of Blei and Jordan (2005). Our algorithm differs in the use of kd-trees and in the way we handle truncation: we only assume that the variational distributions are fixed at their priors after a certain level. Experiments show that speedups relative to the standard variational algorithm can be significant. 1
4 0.12645674 159 nips-2006-Parameter Expanded Variational Bayesian Methods
Author: Tommi S. Jaakkola, Yuan Qi
Abstract: Bayesian inference has become increasingly important in statistical machine learning. Exact Bayesian calculations are often not feasible in practice, however. A number of approximate Bayesian methods have been proposed to make such calculations practical, among them the variational Bayesian (VB) approach. The VB approach, while useful, can nevertheless suffer from slow convergence to the approximate solution. To address this problem, we propose Parameter-eXpanded Variational Bayesian (PX-VB) methods to speed up VB. The new algorithm is inspired by parameter-expanded expectation maximization (PX-EM) and parameterexpanded data augmentation (PX-DA). Similar to PX-EM and -DA, PX-VB expands a model with auxiliary variables to reduce the coupling between variables in the original model. We analyze the convergence rates of VB and PX-VB and demonstrate the superior convergence rates of PX-VB in variational probit regression and automatic relevance determination. 1
5 0.1239104 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
Author: Hariharan Narayanan, Mikhail Belkin, Partha Niyogi
Abstract: One of the intuitions underlying many graph-based methods for clustering and semi-supervised learning, is that class or cluster boundaries pass through areas of low probability density. In this paper we provide some formal analysis of that notion for a probability distribution. We introduce a notion of weighted boundary volume, which measures the length of the class/cluster boundary weighted by the density of the underlying probability distribution. We show that sizes of the cuts of certain commonly used data adjacency graphs converge to this continuous weighted volume of the boundary. keywords: Clustering, Semi-Supervised Learning 1
6 0.1190565 116 nips-2006-Learning from Multiple Sources
7 0.11097538 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
8 0.10137653 80 nips-2006-Fundamental Limitations of Spectral Clustering
9 0.095564157 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
10 0.084932573 65 nips-2006-Denoising and Dimension Reduction in Feature Space
11 0.084193729 163 nips-2006-Prediction on a Graph with a Perceptron
12 0.083292291 39 nips-2006-Balanced Graph Matching
13 0.077867106 7 nips-2006-A Local Learning Approach for Clustering
14 0.074998423 64 nips-2006-Data Integration for Classification Problems Employing Gaussian Process Priors
15 0.071328931 150 nips-2006-On Transductive Regression
16 0.071047366 128 nips-2006-Manifold Denoising
17 0.070937589 181 nips-2006-Stability of $K$-Means Clustering
18 0.069147885 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
19 0.06870465 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms
20 0.066834241 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
topicId topicWeight
[(0, -0.202), (1, 0.109), (2, -0.05), (3, 0.123), (4, 0.035), (5, 0.064), (6, 0.033), (7, 0.085), (8, -0.023), (9, -0.056), (10, 0.049), (11, -0.09), (12, -0.128), (13, -0.015), (14, -0.077), (15, 0.123), (16, 0.047), (17, -0.021), (18, 0.068), (19, -0.198), (20, 0.022), (21, 0.015), (22, -0.19), (23, -0.147), (24, -0.102), (25, -0.039), (26, 0.037), (27, -0.043), (28, 0.077), (29, -0.091), (30, -0.174), (31, 0.002), (32, -0.044), (33, -0.088), (34, 0.047), (35, -0.09), (36, -0.147), (37, 0.031), (38, -0.074), (39, 0.129), (40, 0.086), (41, -0.004), (42, 0.013), (43, 0.07), (44, -0.124), (45, 0.025), (46, -0.002), (47, 0.09), (48, -0.014), (49, -0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.95122463 117 nips-2006-Learning on Graph with Laplacian Regularization
Author: Rie K. Ando, Tong Zhang
Abstract: We consider a general form of transductive learning on graphs with Laplacian regularization, and derive margin-based generalization bounds using appropriate geometric properties of the graph. We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as the effect of dimension reduction. The results suggest a limitation of the standard degree-based normalization. We propose a remedy from our analysis and demonstrate empirically that the remedy leads to improved classification performance.
2 0.60326391 195 nips-2006-Training Conditional Random Fields for Maximum Labelwise Accuracy
Author: Samuel S. Gross, Olga Russakovsky, Chuong B. Do, Serafim Batzoglou
Abstract: We consider the problem of training a conditional random field (CRF) to maximize per-label predictive accuracy on a training set, an approach motivated by the principle of empirical risk minimization. We give a gradient-based procedure for minimizing an arbitrarily accurate approximation of the empirical risk under a Hamming loss function. In experiments with both simulated and real data, our optimization procedure gives significantly better testing performance than several current approaches for CRF training, especially in situations of high label noise. 1
3 0.53553057 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures
Author: Kenichi Kurihara, Max Welling, Nikos A. Vlassis
Abstract: Dirichlet Process (DP) mixture models are promising candidates for clustering applications where the number of clusters is unknown a priori. Due to computational considerations these models are unfortunately unsuitable for large scale data-mining applications. We propose a class of deterministic accelerated DP mixture models that can routinely handle millions of data-cases. The speedup is achieved by incorporating kd-trees into a variational Bayesian algorithm for DP mixtures in the stick-breaking representation, similar to that of Blei and Jordan (2005). Our algorithm differs in the use of kd-trees and in the way we handle truncation: we only assume that the variational distributions are fixed at their priors after a certain level. Experiments show that speedups relative to the standard variational algorithm can be significant. 1
4 0.48645365 151 nips-2006-On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts
Author: Hariharan Narayanan, Mikhail Belkin, Partha Niyogi
Abstract: One of the intuitions underlying many graph-based methods for clustering and semi-supervised learning, is that class or cluster boundaries pass through areas of low probability density. In this paper we provide some formal analysis of that notion for a probability distribution. We introduce a notion of weighted boundary volume, which measures the length of the class/cluster boundary weighted by the density of the underlying probability distribution. We show that sizes of the cuts of certain commonly used data adjacency graphs converge to this continuous weighted volume of the boundary. keywords: Clustering, Semi-Supervised Learning 1
5 0.41769764 123 nips-2006-Learning with Hypergraphs: Clustering, Classification, and Embedding
Author: Dengyong Zhou, Jiayuan Huang, Bernhard Schölkopf
Abstract: We usually endow the investigated objects with pairwise relationships, which can be illustrated as graphs. In many real-world problems, however, relationships among the objects of our interest are more complex than pairwise. Naively squeezing the complex relationships into pairwise ones will inevitably lead to loss of information which can be expected valuable for our learning tasks however. Therefore we consider using hypergraphs instead to completely represent complex relationships among the objects of our interest, and thus the problem of learning with hypergraphs arises. Our main contribution in this paper is to generalize the powerful methodology of spectral clustering which originally operates on undirected graphs to hypergraphs, and further develop algorithms for hypergraph embedding and transductive classification on the basis of the spectral hypergraph clustering approach. Our experiments on a number of benchmarks showed the advantages of hypergraphs over usual graphs. 1
6 0.41517365 159 nips-2006-Parameter Expanded Variational Bayesian Methods
7 0.39945969 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
8 0.37858325 93 nips-2006-Hyperparameter Learning for Graph Based Semi-supervised Learning Algorithms
9 0.37350509 181 nips-2006-Stability of $K$-Means Clustering
10 0.37092388 70 nips-2006-Doubly Stochastic Normalization for Spectral Clustering
11 0.36088648 116 nips-2006-Learning from Multiple Sources
12 0.35861897 152 nips-2006-Online Classification for Complex Problems Using Simultaneous Projections
13 0.32088396 54 nips-2006-Comparative Gene Prediction using Conditional Random Fields
14 0.31727555 39 nips-2006-Balanced Graph Matching
15 0.31455937 60 nips-2006-Convergence of Laplacian Eigenmaps
16 0.29471508 80 nips-2006-Fundamental Limitations of Spectral Clustering
17 0.29298043 163 nips-2006-Prediction on a Graph with a Perceptron
18 0.2927697 194 nips-2006-Towards a general independent subspace analysis
19 0.26953027 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
20 0.26905894 150 nips-2006-On Transductive Regression
topicId topicWeight
[(1, 0.105), (3, 0.025), (7, 0.112), (9, 0.069), (20, 0.013), (22, 0.055), (44, 0.105), (57, 0.073), (65, 0.077), (69, 0.022), (84, 0.24)]
simIndex simValue paperId paperTitle
1 0.89988786 18 nips-2006-A selective attention multi--chip system with dynamic synapses and spiking neurons
Author: Chiara Bartolozzi, Giacomo Indiveri
Abstract: Selective attention is the strategy used by biological sensory systems to solve the problem of limited parallel processing capacity: salient subregions of the input stimuli are serially processed, while non–salient regions are suppressed. We present an mixed mode analog/digital Very Large Scale Integration implementation of a building block for a multi–chip neuromorphic hardware model of selective attention. We describe the chip’s architecture and its behavior, when its is part of a multi–chip system with a spiking retina as input, and show how it can be used to implement in real-time flexible models of bottom-up attention. 1
2 0.8487159 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation
Author: Yee W. Teh, David Newman, Max Welling
Abstract: Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.
same-paper 3 0.81499761 117 nips-2006-Learning on Graph with Laplacian Regularization
Author: Rie K. Ando, Tong Zhang
Abstract: We consider a general form of transductive learning on graphs with Laplacian regularization, and derive margin-based generalization bounds using appropriate geometric properties of the graph. We use this analysis to obtain a better understanding of the role of normalization of the graph Laplacian matrix as well as the effect of dimension reduction. The results suggest a limitation of the standard degree-based normalization. We propose a remedy from our analysis and demonstrate empirically that the remedy leads to improved classification performance.
4 0.66996109 80 nips-2006-Fundamental Limitations of Spectral Clustering
Author: Boaz Nadler, Meirav Galun
Abstract: Spectral clustering methods are common graph-based approaches to clustering of data. Spectral clustering algorithms typically start from local information encoded in a weighted graph on the data and cluster according to the global eigenvectors of the corresponding (normalized) similarity matrix. One contribution of this paper is to present fundamental limitations of this general local to global approach. We show that based only on local information, the normalized cut functional is not a suitable measure for the quality of clustering. Further, even with a suitable similarity measure, we show that the first few eigenvectors of such adjacency matrices cannot successfully cluster datasets that contain structures at different scales of size and density. Based on these findings, a second contribution of this paper is a novel diffusion based measure to evaluate the coherence of individual clusters. Our measure can be used in conjunction with any bottom-up graph-based clustering method, it is scale-free and can determine coherent clusters at all scales. We present both synthetic examples and real image segmentation problems where various spectral clustering algorithms fail. In contrast, using this coherence measure finds the expected clusters at all scales. Keywords: Clustering, kernels, learning theory. 1
5 0.66969538 65 nips-2006-Denoising and Dimension Reduction in Feature Space
Author: Mikio L. Braun, Klaus-Robert Müller, Joachim M. Buhmann
Abstract: We show that the relevant information about a classification problem in feature space is contained up to negligible error in a finite number of leading kernel PCA components if the kernel matches the underlying learning problem. Thus, kernels not only transform data sets such that good generalization can be achieved even by linear discriminant functions, but this transformation is also performed in a manner which makes economic use of feature space dimensions. In the best case, kernels provide efficient implicit representations of the data to perform classification. Practically, we propose an algorithm which enables us to recover the subspace and dimensionality relevant for good classification. Our algorithm can therefore be applied (1) to analyze the interplay of data set and kernel in a geometric fashion, (2) to help in model selection, and to (3) de-noise in feature space in order to yield better classification results. 1
6 0.66414094 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds
7 0.66024113 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment
8 0.65858066 87 nips-2006-Graph Laplacian Regularization for Large-Scale Semidefinite Programming
9 0.65852755 20 nips-2006-Active learning for misspecified generalized linear models
10 0.65743291 83 nips-2006-Generalized Maximum Margin Clustering and Unsupervised Kernel Learning
11 0.65721834 175 nips-2006-Simplifying Mixture Models through Function Approximation
12 0.65659881 109 nips-2006-Learnability and the doubling dimension
13 0.65452534 35 nips-2006-Approximate inference using planar graph decomposition
14 0.65428424 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
15 0.65225679 171 nips-2006-Sample Complexity of Policy Search with Known Dynamics
16 0.65223092 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization
17 0.65175951 21 nips-2006-AdaBoost is Consistent
18 0.65152061 163 nips-2006-Prediction on a Graph with a Perceptron
19 0.65055454 48 nips-2006-Branch and Bound for Semi-Supervised Support Vector Machines
20 0.64875698 37 nips-2006-Attribute-efficient learning of decision lists and linear threshold functions under unconcentrated distributions