nips nips2013 nips2013-182 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Masayuki Karasuyama, Hiroshi Mamitsuka
Abstract: Label propagation is one of the state-of-the-art methods for semi-supervised learning, which estimates labels by propagating label information through a graph. Label propagation assumes that data points (nodes) connected in a graph should have similar labels. Consequently, the label estimation heavily depends on edge weights in a graph which represent similarity of each node pair. We propose a method for a graph to capture the manifold structure of input features using edge weights parameterized by a similarity function. In this approach, edge weights represent both similarity and local reconstruction weight simultaneously, both being reasonable for label propagation. For further justification, we provide analytical considerations including an interpretation as a cross-validation of a propagation model in the feature space, and an error analysis based on a low dimensional manifold model. Experimental results demonstrated the effectiveness of our approach both in synthetic and real datasets. 1
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract Label propagation is one of the state-of-the-art methods for semi-supervised learning, which estimates labels by propagating label information through a graph. [sent-4, score-0.188]
2 Label propagation assumes that data points (nodes) connected in a graph should have similar labels. [sent-5, score-0.208]
3 Consequently, the label estimation heavily depends on edge weights in a graph which represent similarity of each node pair. [sent-6, score-0.383]
4 We propose a method for a graph to capture the manifold structure of input features using edge weights parameterized by a similarity function. [sent-7, score-0.392]
5 In this approach, edge weights represent both similarity and local reconstruction weight simultaneously, both being reasonable for label propagation. [sent-8, score-0.339]
6 For further justification, we provide analytical considerations including an interpretation as a cross-validation of a propagation model in the feature space, and an error analysis based on a low dimensional manifold model. [sent-9, score-0.224]
7 , [1, 2]) is widely accepted as a state-of-the-art approach for semi-supervised learning, in which node labels are estimated through the input graph structure. [sent-14, score-0.183]
8 A common important property of these graph-based approaches is that the manifold structure of the input data can be captured by the graph. [sent-15, score-0.095]
9 On the other hand, it is well-known that the accuracy of the graph-based methods highly depends on the quality of the input graph (e. [sent-17, score-0.123]
10 A general framework of graph-based learning can be represented as the following three-step procedure: Step 1: Generating graph edges from given data, where nodes of the generated graph correspond to the instances of input data. [sent-22, score-0.312]
11 Step 3: Estimating node labels based on the generated graph, which is often represented as an adjacency matrix. [sent-24, score-0.128]
12 In this paper, we focus on the second step in the three-step procedure; estimating edge weights for the subsequent label estimation. [sent-25, score-0.178]
13 Optimizing edge weights is difficult in semi-supervised learning, because there are only a small number of labeled instances. [sent-26, score-0.145]
14 Also this problem is important because edge weights heavily affect final prediction accuracy of graph-based methods, while in reality rather simple heuristics strategies have been employed. [sent-27, score-0.181]
15 There are two standard approaches for estimating edge weights: similarity function based- and locally linear embedding (LLE) [6] based-approaches. [sent-28, score-0.175]
16 The similarity based approaches use similarity functions, such as Gaussian kernel, while most similarity functions have tuning parameters (such as the width parameter of Gaussian kernel) that are in general difficult to be tuned. [sent-30, score-0.24]
17 On the other hand, in LLE, the true underlying manifold can be approximated by a graph which minimizes a local reconstruction error. [sent-31, score-0.279]
18 Our approach is a similarity-based method, yet also captures the manifold structure of the input data; we refer to our approach as adaptive edge weighting (AEW). [sent-35, score-0.167]
19 In AEW, graph edges are determined by a data adaptive manner in terms of both similarity and manifold structure. [sent-36, score-0.287]
20 The objective function in AEW is based on local reconstruction, by which estimated weights capture the manifold structure, where each edge is parameterized as a similarity function of each node pair. [sent-37, score-0.34]
21 • Similarity based representation of edge weights is reasonable for label propagation because transitions of labels are determined by those weights, and edge weights obtained by LLE approaches may not represent node similarity. [sent-40, score-0.403]
22 Although the number of edges in a graph cannot be determined by AEW, we show that performance of AEW is robust against the number of edges compared to standard heuristics and a LLE based approach. [sent-42, score-0.209]
23 We provide further justifications for our approach based on the ideas of feature propagation and local linear approximation. [sent-43, score-0.131]
24 Our objective function can be seen as a cross validation error of a propagation model for feature vectors, which we call feature propagation. [sent-44, score-0.182]
25 This allows us to interpret that AEW optimizes graph weights through cross validation (for prediction) in the feature vector space instead of label space, assuming that input feature vectors and given labels share the same local structure. [sent-45, score-0.385]
26 Another interpretation is provided through local linear approximation, by which we can analyze the error of local reconstruction in the output (label) space under the assumption of low dimensional manifold model. [sent-46, score-0.227]
27 An undirected graph G is generated from X , where each node (or vertex) corresponds to each data point xi . [sent-52, score-0.156]
28 The graph G can be represented by the adjacency matrix W ∈ Rn×n where (i, j)-element Wij is a weight of the edge between xi and xj . [sent-53, score-0.272]
29 The key idea of graph-based algorithms is so-called manifold assumption, in which instances connected by large weights Wij on a graph have similar labels (meaning that labels smoothly change on the graph). [sent-54, score-0.375]
30 From this adjacency matrix, the graph Laplacian can be defined by L = D − W, 2 where D is a diagonal matrix with the diagonal entry Dii = j Wij . [sent-57, score-0.173]
31 Among several label propagation algorithms, we mainly use the formulation by [1], which is the standard formulation of graph-based semi-supervised learning. [sent-59, score-0.167]
32 The goal of label propagation is to predict the labels of unlabeled nodes {xℓ+1 , . [sent-67, score-0.212]
33 Label propagation can be defined as estimating F in such a way that score F smoothly changes on a given graph as well as it can predict given labeled points. [sent-72, score-0.244]
34 The following is standard formulation, which is called the harmonic Gaussian field (HGF) model, of label propagation [1]: min trace F ⊤ LF F subject to Fij = Yij , for i = 1, . [sent-73, score-0.159]
35 where Yij is the label matrix with Yij = 1 if xi is labeled as yi = j; otherwise, Yij = 0, In this formulation, the scores for labeled nodes are fixed as constants. [sent-77, score-0.2]
36 3 Basic Framework of Proposed Approach The performance of label propagation heavily depends on quality of an input graph. [sent-79, score-0.169]
37 Our proposed approach, adaptive edge weighting (AEW), optimizes edge weights for the graph-based learning algorithms. [sent-80, score-0.188]
38 In this paper we consider that the input graph is generated by k-NN graph (the first step is based on k-NN), while we note that AEW can be applied to any types of graphs. [sent-82, score-0.228]
39 First of all, graph edges should satisfy the following conditions: • Capturing the manifold structure of the input space. [sent-83, score-0.229]
40 These two conditions are closely related to manifold assumption of graph-based learning algorithms, in which labels vary smoothly along the input manifold. [sent-85, score-0.156]
41 Since the manifold structure of the input data is unknown beforehand, the graph is used to approximate the manifold (the first condition). [sent-86, score-0.277]
42 Subsequent predictions are performed in such a way that labels smoothly change according to the similarity structure provided by the graph (the second condition). [sent-87, score-0.242]
43 Our algorithm simultaneously pursues these two important aspects of the graph for the graph-based learning algorithms. [sent-88, score-0.105]
44 We define Wij as a similarity function of two nodes (1), using Gaussian kernel in this paper (Note: other similarity functions can also be used). [sent-89, score-0.194]
45 We estimate σd so that the graph represents manifold structure of the input data by the following optimization problem: n xi − min p {σd }d=1 i=1 1 Dii Wij xj 2 2, (2) j∼i where j ∼ i means that j is connected to i. [sent-90, score-0.259]
46 This minimizes the reconstruction error by local linear approximation, which captures the input manifold structure, in terms of the parameters of the similarity function. [sent-91, score-0.29]
47 Since the number of edges of a k-NN graph is O(nk), the derivative of adjacency matrix W can be calculated by O(nkp). [sent-101, score-0.202]
48 4 Analytical Considerations In Section 3, we defined our approach as the minimization of the local reconstruction error of input features. [sent-104, score-0.137]
49 When we employ leave-one-out as the cross-validation of the feature propagation model, we obtain n ˆ xi − x−i 2 , 2 (3) i=1 ˆ where x−i is a prediction for xi with T = {1, . [sent-124, score-0.176]
50 From this equivalence, AEW can be interpreted as the parameter optimization in graph weights of the HGF model for feature vectors through the leave-one-out cross-validation. [sent-132, score-0.196]
51 This also means that our framework estimates labels using the adjacency matrix W optimized in the feature space instead of the output (label) space. [sent-133, score-0.132]
52 Thus, if input features and labels share the same adjacency matrix (i. [sent-134, score-0.124]
53 , sharing the same local structure), the minimization of the objective function (2) should estimate the adjacency matrix which accurately propagates the labels of graph nodes. [sent-136, score-0.26]
54 2 Local Linear Approximation The feature propagation model provides the interpretation of our approach as the optimization of the adjacency matrix under the assumption that x and y can be reconstructed by the same adjacency matrix. [sent-138, score-0.236]
55 We here justify this assumption in a more formal way from a viewpoint of local reconstruction with a lower dimensional manifold model. [sent-139, score-0.174]
56 As shown in [1], HGF can be regarded as local reconstruction methods, which means the prediction can be represented as weighted local averages: Fik = j Wij Fjk Dii for i = ℓ + 1, . [sent-140, score-0.146]
57 We show the relationship between the local reconstruction error in the feature space described by our objective function (2) and the output space. [sent-144, score-0.163]
58 The same analysis can be approximately applied to other graph learning methods such as local global consistency [2] because it has similar local averaging form as the above, though we omitted here. [sent-146, score-0.167]
59 4 We assume the following manifold model for the input feature space, in which x is generated from corresponding some lower dimensional variable τ ∈ Rq : x = g(τ ) + εx , where g : Rq → Rp is a smooth function, and εx ∈ Rp represents noise. [sent-147, score-0.121]
60 For this model, the following theorem shows the relationship between the reconstruction error of the feature vector x and the output y: Theorem 1. [sent-149, score-0.114]
61 Suppose xi can be approximated by its neighbors as follows 1 Wij xj + ei , (4) xi = Dii j∼i where ei ∈ Rp represents an approximation error. [sent-150, score-0.116]
62 The first term includes the reconstruction error for xi which is represented by ei , and the second term is the distance between τ i and {τ j }j∼i . [sent-154, score-0.138]
63 In spite of its importance, this simple relationship has not been focused on in the context of graph estimation for semi-supervised learning, in which a LLE based objective function has been used without clear justification [5, 7–9]. [sent-157, score-0.137]
64 A simple approach to exploit this theorem would be a regularization formulation, which can be a minimization of a combination of the reconstruction error for x and a penalization term for distances between data points connected by edges. [sent-158, score-0.13]
65 We therefore optimize edge weights through parameters of the similarity function, especially the bandwidth parameter of Gaussian similarity function σ. [sent-161, score-0.27]
66 In this approach, a very large bandwidth (giving large weights to distant data points) may cause a large reconstruction error, while an extremely small bandwidth causes the problem of not giving enough weights to reconstruct. [sent-162, score-0.194]
67 For symmetric normalized graph Laplacian, we can not apply Theorem 1 to our algorithm. [sent-163, score-0.122]
68 5 Related Topics LLE [6] can also estimate graph edges based on a similar objective function, in which W is directly optimized as a real valued matrix. [sent-166, score-0.152]
69 This manner has been used in many methods for graph-based semi-supervised learning and clustering [5, 7–9], but LLE is very noise-sensitive [10], and resulting weights Wij cannot necessarily represent similarity between the corresponding nodes (i, j). [sent-167, score-0.161]
70 This parameterized form of edge weights can alleviate the over-fitting problem. [sent-171, score-0.116]
71 In addition, model selection based approaches are basically for the third step in the three-step procedure, by which AEW can be combined with such methods, like that the optimized graph by AEW can be used as the input graph of these methods. [sent-179, score-0.241]
72 Besides k-NN, there have been several methods generating a graph (edges) from the feature vectors (e. [sent-180, score-0.131]
73 In our experiments, we used the edges of the k-NN graph as the initial graph of AEW. [sent-184, score-0.239]
74 This is because the Gaussian similarity value becomes small if xi and xj are not close to each other to minimize the reconstruction error (2). [sent-186, score-0.209]
75 In other words, redundant weights can be reduced drastically, because in the Gaussian kernel, weights decay exponentially according to the squared distance. [sent-187, score-0.098]
76 A standard method for incorporating graph information into metric learning is to use some graphbased regularization, in which graph weights must be determined beforehand. [sent-189, score-0.283]
77 For example, in [18], a graph is generated by LLE, of which we already described the disadvantages. [sent-190, score-0.105]
78 Another approach is [19], which estimates a distance metric so that the k-NN graph in terms of the obtained metric should reproduce a given graph. [sent-191, score-0.153]
79 Overall metric learning is developed from a different context from our setting, by which it has not been justified that metric learning can be applied to label propagation. [sent-193, score-0.111]
80 For comparison, we used linear neighborhood propagation (LNP) [5], which generates a graph using a LLE based objective function. [sent-196, score-0.197]
81 For the parameter in the LLE process, we used the heuristics suggested by [11], and for the label propagation process, we chose the best parameter value in terms of the test accuracy. [sent-198, score-0.2]
82 For HGF, we used the median heuristics to choose the parameter σd in similarity function (1), meaning that a common σ (= σ1 = . [sent-205, score-0.147]
83 This is because σd is automatically adjusted in such a way that the local reconstruction error is minimized and then weights for connections between 6 1 1 0. [sent-217, score-0.168]
84 Although LNP also minimizes the local reconstruction error, LNP may connect data points far from each other if it reduces the reconstruction error. [sent-254, score-0.178]
85 In Figure 2, the k-NN graph connects a lot of nodes in different classes, while AEW favorably eliminates those undesirable edges. [sent-256, score-0.129]
86 In what follows, ‘HGF’ indicates HGF using unnormalized graph Laplacian L = D − W , and ‘N-HGF’ indicates HGF using symmetric normalized Laplacian L = I − D −1/2 W D −1/2 . [sent-263, score-0.122]
87 To adapt the difference of local scale, we here use local scaling kernel [27] as the similarity function. [sent-265, score-0.156]
88 In this figure, two dashed lines with different markers are by HGF and N-HGF, while two solid lines with the same markers are by HGF with AEW. [sent-267, score-0.097]
89 Although Theorem 1 exactly holds only for AEW-HGF, we can see that AEW-N-HGF, in which the degrees of the graph nodes are scaled by normalized Laplacian had highly stable performance. [sent-273, score-0.146]
90 05 1 2 4 6 8 10 # labeled instances in each class Test error rate 2 4 6 8 10 # labeled instances in each class (g) optdigit 2 4 6 8 10 # labeled instances in each class (h) UMIST Figure 3: Performance comparison on real-world datasets. [sent-335, score-0.273]
91 HGFs with AEW are by solid lines with markers, while HGFs with median heuristics is by dashed lines with the same markers, and LNP is by a solid line without any markers. [sent-336, score-0.127]
92 2 is based on the local reconstruction with the constraint 0. [sent-345, score-0.097]
93 15 that each edge represents the similarity of each pair of 0. [sent-346, score-0.13]
94 For example, N−HGF AEW−N−HGF LNP noise sensitivity of LLE can be alleviated by the parameterized form of the edge weights, and the similarity form for the edges weights is very reasonable for graph-based Figure 4: Comparison in test error rates methods. [sent-350, score-0.26]
95 Picard, “Hyperparameter and kernel learning for graph based semisupervised classification,” in Advances in NIPS 18 (Y. [sent-382, score-0.136]
96 Lee, “Hyperparameter learning for graph based semi-supervised learning algorithms,” in Advances in NIPS 19 (B. [sent-391, score-0.105]
97 Spielman, “Fitting a graph to vector data,” in Proc. [sent-414, score-0.105]
98 Yang, “Sparsity induced similarity measure for label propagation,” in IEEE 12th ICCV, pp. [sent-420, score-0.139]
99 Chang, “Large graph construction for scalable semi-supervised learning,” in Proc. [sent-426, score-0.105]
100 Lafferty, “Nonparametric transforms of graph kernels for semisupervised learning,” in Advances in NIPS 17 (L. [sent-482, score-0.118]
wordName wordTfidf (topN-words)
[('aew', 0.642), ('hgf', 0.631), ('lnp', 0.201), ('lle', 0.18), ('graph', 0.105), ('manifold', 0.077), ('similarity', 0.076), ('propagation', 0.074), ('adjacency', 0.068), ('reconstruction', 0.066), ('label', 0.063), ('coil', 0.06), ('yale', 0.059), ('wij', 0.057), ('edge', 0.054), ('vowel', 0.053), ('weights', 0.049), ('orl', 0.046), ('heuristics', 0.046), ('laplacian', 0.045), ('umist', 0.042), ('labeled', 0.042), ('dii', 0.04), ('labels', 0.038), ('lkopf', 0.032), ('optdigit', 0.032), ('instances', 0.031), ('local', 0.031), ('saul', 0.03), ('edges', 0.029), ('xi', 0.029), ('markers', 0.028), ('sch', 0.027), ('usps', 0.026), ('feature', 0.026), ('median', 0.025), ('embedding', 0.025), ('yij', 0.025), ('nodes', 0.024), ('metric', 0.024), ('smoothly', 0.023), ('harmonic', 0.022), ('node', 0.022), ('error', 0.022), ('ei', 0.021), ('hgfs', 0.021), ('xij', 0.021), ('locally', 0.02), ('synthetic', 0.02), ('rq', 0.018), ('objective', 0.018), ('kernel', 0.018), ('input', 0.018), ('weighting', 0.018), ('prediction', 0.018), ('normalized', 0.017), ('face', 0.017), ('xid', 0.017), ('test', 0.017), ('platt', 0.016), ('variants', 0.016), ('thrun', 0.016), ('validation', 0.016), ('xj', 0.016), ('interpreted', 0.016), ('solid', 0.015), ('points', 0.015), ('formulation', 0.015), ('bandwidth', 0.015), ('rp', 0.015), ('eight', 0.015), ('examined', 0.014), ('jebara', 0.014), ('spite', 0.014), ('fij', 0.014), ('connected', 0.014), ('classes', 0.014), ('justi', 0.014), ('heavily', 0.014), ('analytical', 0.013), ('omnipress', 0.013), ('semisupervised', 0.013), ('datasets', 0.013), ('parameterized', 0.013), ('mnist', 0.013), ('optimizes', 0.013), ('regularization', 0.013), ('mit', 0.013), ('steepest', 0.013), ('propagating', 0.013), ('unlabeled', 0.013), ('lines', 0.013), ('third', 0.013), ('bottou', 0.012), ('graphs', 0.012), ('width', 0.012), ('manifolds', 0.012), ('clustering', 0.012), ('considerations', 0.012), ('subsequent', 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation
Author: Masayuki Karasuyama, Hiroshi Mamitsuka
Abstract: Label propagation is one of the state-of-the-art methods for semi-supervised learning, which estimates labels by propagating label information through a graph. Label propagation assumes that data points (nodes) connected in a graph should have similar labels. Consequently, the label estimation heavily depends on edge weights in a graph which represent similarity of each node pair. We propose a method for a graph to capture the manifold structure of input features using edge weights parameterized by a similarity function. In this approach, edge weights represent both similarity and local reconstruction weight simultaneously, both being reasonable for label propagation. For further justification, we provide analytical considerations including an interpretation as a cross-validation of a propagation model in the feature space, and an error analysis based on a low dimensional manifold model. Experimental results demonstrated the effectiveness of our approach both in synthetic and real datasets. 1
2 0.053689774 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning
Author: Xiao-Ming Wu, Zhenguo Li, Shih-Fu Chang
Abstract: We find that various well-known graph-based models exhibit a common important harmonic structure in its target function – the value of a vertex is approximately the weighted average of the values of its adjacent neighbors. Understanding of such structure and analysis of the loss defined over such structure help reveal important properties of the target function over a graph. In this paper, we show that the variation of the target function across a cut can be upper and lower bounded by the ratio of its harmonic loss and the cut cost. We use this to develop an analytical tool and analyze five popular graph-based models: absorbing random walks, partially absorbing random walks, hitting times, pseudo-inverse of the graph Laplacian, and eigenvectors of the Laplacian matrices. Our analysis sheds new insights into several open questions related to these models, and provides theoretical justifications and guidelines for their practical use. Simulations on synthetic and real datasets confirm the potential of the proposed theory and tool.
3 0.052454613 335 nips-2013-Transfer Learning in a Transductive Setting
Author: Marcus Rohrbach, Sandra Ebert, Bernt Schiele
Abstract: Category models for objects or activities typically rely on supervised learning requiring sufficiently large training sets. Transferring knowledge from known categories to novel classes with no or only a few labels is far less researched even though it is a common scenario. In this work, we extend transfer learning with semi-supervised learning to exploit unlabeled instances of (novel) categories with no or only a few labeled instances. Our proposed approach Propagated Semantic Transfer combines three techniques. First, we transfer information from known to novel categories by incorporating external knowledge, such as linguistic or expertspecified information, e.g., by a mid-level layer of semantic attributes. Second, we exploit the manifold structure of novel classes. More specifically we adapt a graph-based learning algorithm – so far only used for semi-supervised learning – to zero-shot and few-shot learning. Third, we improve the local neighborhood in such graph structures by replacing the raw feature-based representation with a mid-level object- or attribute-based representation. We evaluate our approach on three challenging datasets in two different applications, namely on Animals with Attributes and ImageNet for image classification and on MPII Composites for activity recognition. Our approach consistently outperforms state-of-the-art transfer and semi-supervised approaches on all datasets. 1
4 0.048579596 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
Author: Marcelo Fiori, Pablo Sprechmann, Joshua Vogelstein, Pablo Muse, Guillermo Sapiro
Abstract: Graph matching is a challenging problem with very important applications in a wide range of fields, from image and video analysis to biological and biomedical problems. We propose a robust graph matching algorithm inspired in sparsityrelated techniques. We cast the problem, resembling group or collaborative sparsity formulations, as a non-smooth convex optimization problem that can be efficiently solved using augmented Lagrangian techniques. The method can deal with weighted or unweighted graphs, as well as multimodal data, where different graphs represent different types of data. The proposed approach is also naturally integrated with collaborative graph inference techniques, solving general network inference problems where the observed variables, possibly coming from different modalities, are not in correspondence. The algorithm is tested and compared with state-of-the-art graph matching techniques in both synthetic and real graphs. We also present results on multimodal graphs and applications to collaborative inference of brain connectivity from alignment-free functional magnetic resonance imaging (fMRI) data. The code is publicly available. 1
5 0.048395224 173 nips-2013-Least Informative Dimensions
Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda
Abstract: We present a novel non-parametric method for finding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1
6 0.04598951 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
7 0.045144424 53 nips-2013-Bayesian inference for low rank spatiotemporal neural receptive fields
8 0.044472069 149 nips-2013-Latent Structured Active Learning
9 0.041283645 289 nips-2013-Scalable kernels for graphs with continuous attributes
10 0.040073939 329 nips-2013-Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections
11 0.039379895 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
12 0.037328798 350 nips-2013-Wavelets on Graphs via Deep Learning
13 0.037023641 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model
14 0.036755718 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
15 0.036434583 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
16 0.036154479 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap
17 0.035295267 7 nips-2013-A Gang of Bandits
18 0.035097189 75 nips-2013-Convex Two-Layer Modeling
19 0.034752656 279 nips-2013-Robust Bloom Filters for Large MultiLabel Classification Tasks
20 0.034251478 63 nips-2013-Cluster Trees on Manifolds
topicId topicWeight
[(0, 0.091), (1, 0.038), (2, -0.009), (3, -0.006), (4, 0.038), (5, 0.027), (6, 0.002), (7, -0.01), (8, -0.028), (9, 0.024), (10, 0.013), (11, -0.053), (12, 0.068), (13, -0.022), (14, 0.053), (15, 0.001), (16, -0.036), (17, 0.032), (18, -0.016), (19, -0.001), (20, 0.009), (21, -0.006), (22, -0.001), (23, -0.009), (24, 0.003), (25, 0.009), (26, 0.015), (27, 0.011), (28, 0.009), (29, 0.016), (30, -0.026), (31, -0.01), (32, 0.011), (33, 0.036), (34, -0.089), (35, 0.016), (36, 0.022), (37, -0.026), (38, -0.089), (39, 0.037), (40, 0.091), (41, 0.018), (42, -0.029), (43, 0.054), (44, 0.026), (45, -0.019), (46, 0.019), (47, 0.066), (48, 0.026), (49, -0.008)]
simIndex simValue paperId paperTitle
same-paper 1 0.86580354 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation
Author: Masayuki Karasuyama, Hiroshi Mamitsuka
Abstract: Label propagation is one of the state-of-the-art methods for semi-supervised learning, which estimates labels by propagating label information through a graph. Label propagation assumes that data points (nodes) connected in a graph should have similar labels. Consequently, the label estimation heavily depends on edge weights in a graph which represent similarity of each node pair. We propose a method for a graph to capture the manifold structure of input features using edge weights parameterized by a similarity function. In this approach, edge weights represent both similarity and local reconstruction weight simultaneously, both being reasonable for label propagation. For further justification, we provide analytical considerations including an interpretation as a cross-validation of a propagation model in the feature space, and an error analysis based on a low dimensional manifold model. Experimental results demonstrated the effectiveness of our approach both in synthetic and real datasets. 1
2 0.62413043 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap
Author: Ulrike Von Luxburg, Morteza Alamgir
Abstract: Consider an unweighted k-nearest neighbor graph on n points that have been sampled i.i.d. from some unknown density p on Rd . We prove how one can estimate the density p just from the unweighted adjacency matrix of the graph, without knowing the points themselves or any distance or similarity scores. The key insights are that local differences in link numbers can be used to estimate a local function of the gradient of p, and that integrating this function along shortest paths leads to an estimate of the underlying density. 1
3 0.61357838 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning
Author: Xiao-Ming Wu, Zhenguo Li, Shih-Fu Chang
Abstract: We find that various well-known graph-based models exhibit a common important harmonic structure in its target function – the value of a vertex is approximately the weighted average of the values of its adjacent neighbors. Understanding of such structure and analysis of the loss defined over such structure help reveal important properties of the target function over a graph. In this paper, we show that the variation of the target function across a cut can be upper and lower bounded by the ratio of its harmonic loss and the cut cost. We use this to develop an analytical tool and analyze five popular graph-based models: absorbing random walks, partially absorbing random walks, hitting times, pseudo-inverse of the graph Laplacian, and eigenvectors of the Laplacian matrices. Our analysis sheds new insights into several open questions related to these models, and provides theoretical justifications and guidelines for their practical use. Simulations on synthetic and real datasets confirm the potential of the proposed theory and tool.
4 0.58692747 289 nips-2013-Scalable kernels for graphs with continuous attributes
Author: Aasa Feragen, Niklas Kasenburg, Jens Petersen, Marleen de Bruijne, Karsten Borgwardt
Abstract: While graphs with continuous node attributes arise in many applications, stateof-the-art graph kernels for comparing continuous-attributed graphs suffer from a high runtime complexity. For instance, the popular shortest path kernel scales as O(n4 ), where n is the number of nodes. In this paper, we present a class of graph kernels with computational complexity O(n2 (m + log n + δ 2 + d)), where δ is the graph diameter, m is the number of edges, and d is the dimension of the node attributes. Due to the sparsity and small diameter of real-world graphs, these kernels typically scale comfortably to large graphs. In our experiments, the presented kernels outperform state-of-the-art kernels in terms of speed and accuracy on classification benchmark datasets. 1
5 0.55256701 357 nips-2013-k-Prototype Learning for 3D Rigid Structures
Author: Hu Ding, Ronald Berezney, Jinhui Xu
Abstract: In this paper, we study the following new variant of prototype learning, called k-prototype learning problem for 3D rigid structures: Given a set of 3D rigid structures, find a set of k rigid structures so that each of them is a prototype for a cluster of the given rigid structures and the total cost (or dissimilarity) is minimized. Prototype learning is a core problem in machine learning and has a wide range of applications in many areas. Existing results on this problem have mainly focused on the graph domain. In this paper, we present the first algorithm for learning multiple prototypes from 3D rigid structures. Our result is based on a number of new insights to rigid structures alignment, clustering, and prototype reconstruction, and is practically efficient with quality guarantee. We validate our approach using two type of data sets, random data and biological data of chromosome territories. Experiments suggest that our approach can effectively learn prototypes in both types of data. 1
6 0.53119141 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
7 0.52796048 335 nips-2013-Transfer Learning in a Transductive Setting
8 0.52701879 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
9 0.52334827 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
10 0.50003219 329 nips-2013-Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections
11 0.47529119 7 nips-2013-A Gang of Bandits
12 0.46504828 132 nips-2013-Global MAP-Optimality by Shrinking the Combinatorial Search Area with Convex Relaxation
13 0.46314088 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel
14 0.46238214 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
15 0.44818518 151 nips-2013-Learning Chordal Markov Networks by Constraint Satisfaction
16 0.44576687 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
17 0.44266936 8 nips-2013-A Graphical Transformation for Belief Propagation: Maximum Weight Matchings and Odd-Sized Cycles
18 0.4422971 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding
19 0.43492234 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation
20 0.43162572 350 nips-2013-Wavelets on Graphs via Deep Learning
topicId topicWeight
[(2, 0.3), (16, 0.023), (33, 0.124), (34, 0.112), (41, 0.042), (49, 0.017), (56, 0.095), (70, 0.017), (85, 0.063), (89, 0.023), (93, 0.039), (95, 0.019)]
simIndex simValue paperId paperTitle
1 0.8353287 191 nips-2013-Minimax Optimal Algorithms for Unconstrained Linear Optimization
Author: Brendan McMahan, Jacob Abernethy
Abstract: We design and analyze minimax-optimal algorithms for online linear optimization games where the player’s choice is unconstrained. The player strives to minimize regret, the difference between his loss and the loss of a post-hoc benchmark strategy. While the standard benchmark is the loss of the best strategy chosen from a bounded comparator set, we consider a very broad range of benchmark functions. The problem is cast as a sequential multi-stage zero-sum game, and we give a thorough analysis of the minimax behavior of the game, providing characterizations for the value of the game, as well as both the player’s and the adversary’s optimal strategy. We show how these objects can be computed efficiently under certain circumstances, and by selecting an appropriate benchmark, we construct a novel hedging strategy for an unconstrained betting game. 1
2 0.81725836 248 nips-2013-Point Based Value Iteration with Optimal Belief Compression for Dec-POMDPs
Author: Liam C. MacDermed, Charles Isbell
Abstract: We present four major results towards solving decentralized partially observable Markov decision problems (DecPOMDPs) culminating in an algorithm that outperforms all existing algorithms on all but one standard infinite-horizon benchmark problems. (1) We give an integer program that solves collaborative Bayesian games (CBGs). The program is notable because its linear relaxation is very often integral. (2) We show that a DecPOMDP with bounded belief can be converted to a POMDP (albeit with actions exponential in the number of beliefs). These actions correspond to strategies of a CBG. (3) We present a method to transform any DecPOMDP into a DecPOMDP with bounded beliefs (the number of beliefs is a free parameter) using optimal (not lossless) belief compression. (4) We show that the combination of these results opens the door for new classes of DecPOMDP algorithms based on previous POMDP algorithms. We choose one such algorithm, point-based valued iteration, and modify it to produce the first tractable value iteration method for DecPOMDPs that outperforms existing algorithms. 1
3 0.79675311 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge
Author: Evan W. Archer, Il M. Park, Jonathan W. Pillow
Abstract: Shannon’s entropy is a basic quantity in information theory, and a fundamental building block for the analysis of neural codes. Estimating the entropy of a discrete distribution from samples is an important and difficult problem that has received considerable attention in statistics and theoretical neuroscience. However, neural responses have characteristic statistical structure that generic entropy estimators fail to exploit. For example, existing Bayesian entropy estimators make the naive assumption that all spike words are equally likely a priori, which makes for an inefficient allocation of prior probability mass in cases where spikes are sparse. Here we develop Bayesian estimators for the entropy of binary spike trains using priors designed to flexibly exploit the statistical structure of simultaneouslyrecorded spike responses. We define two prior distributions over spike words using mixtures of Dirichlet distributions centered on simple parametric models. The parametric model captures high-level statistical features of the data, such as the average spike count in a spike word, which allows the posterior over entropy to concentrate more rapidly than with standard estimators (e.g., in cases where the probability of spiking differs strongly from 0.5). Conversely, the Dirichlet distributions assign prior mass to distributions far from the parametric model, ensuring consistent estimates for arbitrary distributions. We devise a compact representation of the data and prior that allow for computationally efficient implementations of Bayesian least squares and empirical Bayes entropy estimators with large numbers of neurons. We apply these estimators to simulated and real neural data and show that they substantially outperform traditional methods.
4 0.78594899 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions
Author: Alfredo Kalaitzis, Ricardo Silva
Abstract: Learning the joint dependence of discrete variables is a fundamental problem in machine learning, with many applications including prediction, clustering and dimensionality reduction. More recently, the framework of copula modeling has gained popularity due to its modular parameterization of joint distributions. Among other properties, copulas provide a recipe for combining flexible models for univariate marginal distributions with parametric families suitable for potentially high dimensional dependence structures. More radically, the extended rank likelihood approach of Hoff (2007) bypasses learning marginal models completely when such information is ancillary to the learning task at hand as in, e.g., standard dimensionality reduction problems or copula parameter estimation. The main idea is to represent data by their observable rank statistics, ignoring any other information from the marginals. Inference is typically done in a Bayesian framework with Gaussian copulas, and it is complicated by the fact this implies sampling within a space where the number of constraints increases quadratically with the number of data points. The result is slow mixing when using off-the-shelf Gibbs sampling. We present an efficient algorithm based on recent advances on constrained Hamiltonian Markov chain Monte Carlo that is simple to implement and does not require paying for a quadratic cost in sample size. 1 Contribution There are many ways of constructing multivariate discrete distributions: from full contingency tables in the small dimensional case [1], to structured models given by sparsity constraints [11] and (hierarchies of) latent variable models [6]. More recently, the idea of copula modeling [16] has been combined with such standard building blocks. Our contribution is a novel algorithm for efficient Markov chain Monte Carlo (MCMC) for the copula framework introduced by [7], extending algorithmic ideas introduced by [17]. A copula is a continuous cumulative distribution function (CDF) with uniformly distributed univariate marginals in the unit interval [0, 1]. It complements graphical models and other formalisms that provide a modular parameterization of joint distributions. The core idea is simple and given by the following observation: suppose we are given a (say) bivariate CDF F (y1 , y2 ) with marginals −1 −1 F1 (y1 ) and F2 (y2 ). This CDF can then be rewritten as F (F1 (F1 (y1 )), F2 (F2 (y2 ))). The func−1 −1 tion C(·, ·) given by F (F1 (·), F2 (·)) is a copula. For discrete distributions, this decomposition is not unique but still well-defined [16]. Copulas have found numerous applications in statistics and machine learning since they provide a way of constructing flexible multivariate distributions by mix-and-matching different copulas with different univariate marginals. For instance, one can combine flexible univariate marginals Fi (·) with useful but more constrained high-dimensional copulas. We will not further motivate the use of copula models, which has been discussed at length in recent 1 machine learning publications and conference workshops, and for which comprehensive textbooks exist [e.g., 9]. For a recent discussion on the applications of copulas from a machine learning perspective, [4] provides an overview. [10] is an early reference in machine learning. The core idea dates back at least to the 1950s [16]. In the discrete case, copulas can be difficult to apply: transforming a copula CDF into a probability mass function (PMF) is computationally intractable in general. For the continuous case, a common ˆ trick goes as follows: transform variables by defining ai ≡ Fi (yi ) for an estimate of Fi (·) and then fit a copula density c(·, . . . , ·) to the resulting ai [e.g. 9]. It is not hard to check this breaks down in the discrete case [7]. An alternative is to represent the CDF to PMF transformation for each data point by a continuous integral on a bounded space. Sampling methods can then be used. This trick has allowed many applications of the Gaussian copula to discrete domains. Readers familiar with probit models will recognize the similarities to models where an underlying latent Gaussian field is discretized into observable integers as in Gaussian process classifiers and ordinal regression [18]. Such models can be indirectly interpreted as special cases of the Gaussian copula. In what follows, we describe in Section 2 the Gaussian copula and the general framework for constructing Bayesian estimators of Gaussian copulas by [7], the extended rank likelihood framework. This framework entails computational issues which are discussed. A recent general approach for MCMC in constrained Gaussian fields by [17] can in principle be directly applied to this problem as a blackbox, but at a cost that scales quadratically in sample size and as such it is not practical in general. Our key contribution is given in Section 4. An application experiment on the Bayesian Gaussian copula factor model is performed in Section 5. Conclusions are discussed in the final section. 2 Gaussian copulas and the extended rank likelihood It is not hard to see that any multivariate Gaussian copula is fully defined by a correlation matrix C, since marginal distributions have no free parameters. In practice, the following equivalent generative model is used to define a sample U according to a Gaussian copula GC(C): 1. Sample Z from a zero mean Gaussian with covariance matrix C 2. For each Zj , set Uj = Φ(zj ), where Φ(·) is the CDF of the standard Gaussian It is clear that each Uj follows a uniform distribution in [0, 1]. To obtain a model for variables {y1 , y2 , . . . , yp } with marginal distributions Fj (·) and copula GC(C), one can add the deterministic (n) (1) (1) (2) step yj = Fj−1 (uj ). Now, given n samples of observed data Y ≡ {y1 , . . . , yp , y1 , . . . , yp }, one is interested on inferring C via a Bayesian approach and the posterior distribution p(C, θF | Y) ∝ pGC (Y | C, θF )π(C, θF ) where π(·) is a prior distribution, θF are marginal parameters for each Fj (·), which in general might need to be marginalized since they will be unknown, and pGC (·) is the PMF of a (here discrete) distribution with a Gaussian copula and marginals given by θF . Let Z be the underlying latent Gaussians of the corresponding copula for dataset Y. Although Y is a deterministic function of Z, this mapping is not invertible due to the discreteness of the distribution: each marginal Fj (·) has jumps. Instead, the reverse mapping only enforces the constraints where (i ) (i ) (i ) (i ) yj 1 < yj 2 implies zj 1 < zj 2 . Based on this observation, [7] considers the event Z ∈ D(y), where D(y) is the set of values of Z in Rn×p obeying those constraints, that is (k) (k) D(y) ≡ Z ∈ Rn×p : max zj s.t. yj (i) < yj (i) (k) (i) (k) < zj < min zj s.t. yj < yj . Since {Y = y} ⇒ Z(y) ∈ D(y), we have pGC (Y | C, θF ) = pGC (Z ∈ D(y), Y | C, θF ) = pN (Z ∈ D(y) | C) × pGC (Y| Z ∈ D(y), C, θF ), (1) the first factor of the last line being that of a zero-mean a Gaussian density function marginalized over D(y). 2 The extended rank likelihood is defined by the first factor of (1). With this likelihood, inference for C is given simply by marginalizing p(C, Z | Y) ∝ I(Z ∈ D(y)) pN (Z| C) π(C), (2) the first factor of the right-hand side being the usual binary indicator function. Strictly speaking, this is not a fully Bayesian method since partial information on the marginals is ignored. Nevertheless, it is possible to show that under some mild conditions there is information in the extended rank likelihood to consistently estimate C [13]. It has two important properties: first, in many applications where marginal distributions are nuisance parameters, this sidesteps any major assumptions about the shape of {Fi (·)} – applications include learning the degree of dependence among variables (e.g., to understand relationships between social indicators as in [7] and [13]) and copula-based dimensionality reduction (a generalization of correlation-based principal component analysis, e.g., [5]); second, MCMC inference in the extended rank likelihood is conceptually simpler than with the joint likelihood, since dropping marginal models will remove complicated entanglements between C and θF . Therefore, even if θF is necessary (when, for instance, predicting missing values of Y), an estimate of C can be computed separately and will not depend on the choice of estimator for {Fi (·)}. The standard model with a full correlation matrix C can be further refined to take into account structure implied by sparse inverse correlation matrices [2] or low rank decompositions via higher-order latent variable models [13], among others. We explore the latter case in section 5. An off-the-shelf algorithm for sampling from (2) is full Gibbs sampling: first, given Z, the (full or structured) correlation matrix C can be sampled by standard methods. More to the point, sampling (i) Z is straightforward if for each variable j and data point i we sample Zj conditioned on all other variables. The corresponding distribution is an univariate truncated Gaussian. This is the approach used originally by Hoff. However, mixing can be severely compromised by the sampling of Z, and that is where novel sampling methods can facilitate inference. 3 Exact HMC for truncated Gaussian distributions (i) Hoff’s algorithm modifies the positions of all Zj associated with a particular discrete value of Yj , conditioned on the remaining points. As the number of data points increases, the spread of the hard (i) boundaries on Zj , given by data points of Zj associated with other levels of Yj , increases. This (i) reduces the space in which variables Zj can move at a time. To improve the mixing, we aim to sample from the joint Gaussian distribution of all latent variables (i) Zj , i = 1 . . . n , conditioned on other columns of the data, such that the constraints between them are satisfied and thus the ordering in the observation level is conserved. Standard Gibbs approaches for sampling from truncated Gaussians reduce the problem to sampling from univariate truncated Gaussians. Even though each step is computationally simple, mixing can be slow when strong correlations are induced by very tight truncation bounds. In the following, we briefly describe the methodology recently introduced by [17] that deals with the problem of sampling from log p(x) ∝ − 1 x Mx + r x , where x, r ∈ Rn and M is positive 2 definite, with linear constraints of the form fj x ≤ gj , where fj ∈ Rn , j = 1 . . . m, is the normal vector to some linear boundary in the sample space. Later in this section we shall describe how this framework can be applied to the Gaussian copula extended rank likelihood model. More importantly, the observed rank statistics impose only linear constraints of the form xi1 ≤ xi2 . We shall describe how this special structure can be exploited to reduce the runtime complexity of the constrained sampler from O(n2 ) (in the number of observations) to O(n) in practice. 3.1 Hamiltonian Monte Carlo for the Gaussian distribution Hamiltonian Monte Carlo (HMC) [15] is a MCMC method that extends the sampling space with auxiliary variables so that (ideally) deterministic moves in the joint space brings the sampler to 3 potentially far places in the original variable space. Deterministic moves cannot in general be done, but this is possible in the Gaussian case. The form of the Hamiltonian for the general d-dimensional Gaussian case with mean µ and precision matrix M is: 1 1 H = x Mx − r x + s M−1 s , (3) 2 2 where M is also known in the present context as the mass matrix, r = Mµ and s is the velocity. Both x and s are Gaussian distributed so this Hamiltonian can be seen (up to a constant) as the negative log of the product of two independent Gaussian random variables. The physical interpretation is that of a sum of potential and kinetic energy terms, where the total energy of the system is conserved. In a system where this Hamiltonian function is constant, we can exactly compute its evolution through the pair of differential equations: ˙ x= sH = M−1 s , ˙ s=− xH = −Mx + r . (4) These are solved exactly by x(t) = µ + a sin(t) + b cos(t) , where a and b can be identified at initial conditions (t = 0) : ˙ a = x(0) = M−1 s , b = x(0) − µ . (5) Therefore, the exact HMC algorithm can be summarised as follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b and store the new position at the end of the trajectory xk+1 = x(T ) as an HMC sample. It can be easily shown that the Markov chain of sampled positions has the desired equilibrium distribution N µ, M−1 [17]. 3.2 Sampling with linear constraints Sampling from multivariate Gaussians does not require any method as sophisticated as HMC, but the plot thickens when the target distribution is truncated by linear constraints of the form Fx ≤ g . Here, F ∈ Rm×n is a constraint matrix whose every row is the normal vector to a linear boundary in the sample space. This is equivalent to sampling from a Gaussian that is confined in the (not necessarily bounded) convex polyhedron {x : Fx ≤ g}. In general, to remain within the boundaries of each wall, once a new velocity has been sampled one must compute all possible collision times with the walls. The smallest of all collision times signifies the wall that the particle should bounce from at that collision time. Figure 1 illustrates the concept with two simple examples on 2 and 3 dimensions. The collision times can be computed analytically and their equations can be found in the supplementary material. We also point the reader to [17] for a more detailed discussion of this implementation. Once the wall to be hit has been found, then position and velocity at impact time are computed and the velocity is reflected about the boundary normal1 . The constrained HMC sampler is summarized follows: • Initialise the allowed travel time T and some initial position x0 . • Repeat for HMC samples k = 1 . . . N 1. Sample sk ∼ N (0, M) 2. Use sk and xk to update a and b . 1 Also equivalent to transforming the velocity with a Householder reflection matrix about the bounding hyperplane. 4 1 2 3 4 1 2 3 4 Figure 1: Left: Trajectories of the first 40 iterations of the exact HMC sampler on a 2D truncated Gaussian. A reflection of the velocity can clearly be seen when the particle meets wall #2 . Here, the constraint matrix F is a 4 × 2 matrix. Center: The same example after 40000 samples. The coloring of each sample indicates its density value. Right: The anatomy of a 3D Gaussian. The walls are now planes and in this case F is a 2 × 3 matrix. Figure best seen in color. 3. Reset remaining travel time Tleft ← T . Until no travel time is left or no walls can be reached (no solutions exist), do: (a) Compute impact times with all walls and pick the smallest one, th (if a solution exists). (b) Compute v(th ) and reflect it about the hyperplane fh . This is the updated velocity after impact. The updated position is x(th ) . (c) Tleft ← Tleft − th 4. Store the new position at the end of the trajectory xk+1 as an HMC sample. In general, all walls are candidates for impact, so the runtime of the sampler is linear in m , the number of constraints. This means that the computational load is concentrated in step 3(a). Another consideration is that of the allocated travel time T . Depending on the shape of the bounding polyhedron and the number of walls, a very large travel time can induce many more bounces thus requiring more computations per sample. On the other hand, a very small travel time explores the distribution more locally so the mixing of the chain can suffer. What constitutes a given travel time “large” or “small” is relative to the dimensionality, the number of constraints and the structure of the constraints. Due to the nature of our problem, the number of constraints, when explicitly expressed as linear functions, is O(n2 ) . Clearly, this restricts any direct application of the HMC framework for Gaussian copula estimation to small-sample (n) datasets. More importantly, we show how to exploit the structure of the constraints to reduce the number of candidate walls (prior to each bounce) to O(n) . 4 HMC for the Gaussian Copula extended rank likelihood model Given some discrete data Y ∈ Rn×p , the task is to infer the correlation matrix of the underlying Gaussian copula. Hoff’s sampling algorithm proceeds by alternating between sampling the continu(i) (i) ous latent representation Zj of each Yj , for i = 1 . . . n, j = 1 . . . p , and sampling a covariance matrix from an inverse-Wishart distribution conditioned on the sampled matrix Z ∈ Rn×p , which is then renormalized as a correlation matrix. From here on, we use matrix notation for the samples, as opposed to the random variables – with (i) Zi,j replacing Zj , Z:,j being a column of Z, and Z:,\j being the submatrix of Z without the j-th column. In a similar vein to Hoff’s sampling algorithm, we replace the successive sampling of each Zi,j conditioned on Zi,\j (a conditional univariate truncated Gaussian) with the simultaneous sampling of Z:,j conditioned on Z:,\j . This is done through an HMC step from a conditional multivariate truncated Gaussian. The added benefit of this HMC step over the standard Gibbs approach, is that of a handle for regulating the trade-off between exploration and runtime via the allocated travel time T . Larger travel times potentially allow for larger moves in the sample space, but it comes at a cost as explained in the sequel. 5 4.1 The Hough envelope algorithm The special structure of constraints. Recall that the number of constraints is quadratic in the dimension of the distribution. This is because every Z sample must satisfy the conditions of the event Z ∈ D(y) of the extended rank likelihood (see Section 2). In other words, for any column Z:,j , all entries are organised into a partition L(j) of |L(j) | levels, the number of unique values observed for the discrete or ordinal variable Y (j) . Thereby, for any two adjacent levels lk , lk+1 ∈ L(j) and any pair i1 ∈ lk , i2 ∈ lk+1 , it must be true that Zli ,j < Zli+1 ,j . Equivalently, a constraint f exists where fi1 = 1, fi2 = −1 and g = 0 . It is easy to see that O(n2 ) of such constraints are induced by the order statistics of the j-th variable. To deal with this boundary explosion, we developed the Hough Envelope algorithm to search efficiently, within all pairs in {Z:,j }, in practically linear time. Recall in HMC (section 3.2) that the trajectory of the particle, x(t), is decomposed as xi (t) = ai sin(t) + bi cos(t) + µi , (6) and there are n such functions, grouped into a partition of levels as described above. The Hough envelope2 is found for every pair of adjacent levels. We illustrate this with an example of 10 dimensions and two levels in Figure 2, without loss of generalization to any number of levels or dimensions. Assume we represent trajectories for points in level lk with blue curves, and points in lk+1 with red curves. Assuming we start with a valid state, at time t = 0 all red curves are above all blue curves. The goal is to find the smallest t where a blue curve meets a red curve. This will be our collision time where a bounce will be necessary. 5 3 1 2 Figure 2: The trajectories xj (t) of each component are sinusoid functions. The right-most green dot signifies the wall and the time th of the earliest bounce, where the first inter-level pair (that is, any two components respectively from the blue and red level) becomes equal, in this case the constraint activated being xblue2 = xred2 . 4 4 5 1 2 3 0.2 0.4 0.6 t 0.8 1 1.2 1.4 1. First we find the largest component bluemax of the blue level at t = 0. This takes O(n) time. Clearly, this will be the largest component until its sinusoid intersects that of any other component. 2. To find the next largest component, compute the roots of xbluemax (t) − xi (t) = 0 for all components and pick the smallest (earliest) one (represented by a green dot). This also takes O(n) time. 3. Repeat this procedure until a red sinusoid crosses the highest running blue sinusoid. When this happens, the time of earliest bounce and its constraint are found. In the worst-case scenario, n such repetitions have to be made, but in practice we can safely assume an fixed upper bound h on the number of blue crossings before a inter-level crossing occurs. In experiments, we found h << n, no more than 10 in simulations with hundreds of thousands of curves. Thus, this search strategy takes O(n) time in practice to complete, mirroring the analysis of other output-sensitive algorithms such as the gift wrapping algorithm for computing convex hulls [8]. Our HMC sampling approach is summarized in Algorithm 1. 2 The name is inspired from the fact that each xi (t) is the sinusoid representation, in angle-distance space, of all lines that pass from the (ai , bi ) point in a − b space. A representation known in image processing as the Hough transform [3]. 6 Algorithm 1 HMC for GCERL # Notation: T MN (µ, C, F) is a truncated multivariate normal with location vector µ, scale matrix C and constraints encoded by F and g = 0 . # IW(df, V0 ) is an inverse-Wishart prior with degrees of freedom df and scale matrix V0 . Input: Y ∈ Rn×p , allocated travel time T , a starting Z and variable covariance V ∈ Rp×p , df = p + 2, V0 = df Ip and chain size N . Generate constraints F(j) from Y:,j , for j = 1 . . . p . for samples k = 1 . . . N do # Resample Z as follows: for variables j = 1 . . . p do −1 −1 2 Compute parameters: σj = Vjj − Vj,\j V\j,\j V\j,j , µj = Z:,\j V\j,\j V\j,j . 2 Get one sample Z:,j ∼ T MN µj , σj I, F(j) efficiently by using the Hough Envelope algorithm, see section 4.1. end for Resample V ∼ IW(df + n, V0 + Z Z) . Compute correlation matrix C, s.t. Ci,j = Vi,j / Vi,i Vj,j and store sample, C(k) ← C . end for 5 An application on the Bayesian Gausian copula factor model In this section we describe an experiment that highlights the benefits of our HMC treatment, compared to a state-of-the-art parameter expansion (PX) sampling scheme. During this experiment we ask the important question: “How do the two schemes compare when we exploit the full-advantage of the HMC machinery to jointly sample parameters and the augmented data Z, in a model of latent variables and structured correlations?” We argue that under such circumstances the superior convergence speed and mixing of HMC undeniably compensate for its computational overhead. Experimental setup In this section we provide results from an application on the Gaussian copula latent factor model of [13] (Hoff’s model [7] for low-rank structured correlation matrices). We modify the parameter expansion (PX) algorithm used by [13] by replacing two of its Gibbs steps with a single HMC step. We show a much faster convergence to the true mode with considerable support on its vicinity. We show that unlike the HMC, the PX algorithm falls short of properly exploring the posterior in any reasonable finite amount of time, even for small models, even for small samples. Worse, PX fails in ways one cannot easily detect. Namely, we sample each row of the factor loadings matrix Λ jointly with the corresponding column of the augmented data matrix Z, conditioning on the higher-order latent factors. This step is analogous to Pakman and Paninski’s [17, sec.3.1] use of HMC in the context of a binary probit model (the extension to many levels in the discrete marginal is straightforward with direct application of the constraint matrix F and the Hough envelope algorithm). The sampling of the higher level latent factors remains identical to [13]. Our scheme involves no parameter expansion. We do however interweave the Gibbs step for the Z matrix similarly to Hoff. This has the added benefit of exploring the Z sample space within their current boundaries, complementing the joint (λ, z) sampling which moves the boundaries jointly. The value of such ”interweaving” schemes has been addressed in [19]. Results We perform simulations of 10000 iterations, n = 1000 observations (rows of Y), travel time π/2 for HMC with the setups listed in the following table, along with the elapsed times of each sampling scheme. These experiments were run on Intel COREi7 desktops with 4 cores and 8GB of RAM. Both methods were parallelized across the observed variables (p). Figure p (vars) k (latent factors) M (ordinal levels) elapsed (mins): HMC PX 3(a) : 20 5 2 115 8 3(b) : 10 3 2 80 6 10 3 5 203 16 3(c) : Many functionals of the loadings matrix Λ can be assessed. We focus on reconstructing the true (low-rank) correlation matrix of the Gaussian copula. In particular, we summarize the algorithm’s 7 outcome with the root mean squared error (RMSE) of the differences between entries of the ground-truth correlation matrix and the implied correlation matrix at each iteration of a MCMC scheme (so the following plots looks like a time-series of 10000 timepoints), see Figures 3(a), 3(b) and 3(c) . (a) (b) (c) Figure 3: Reconstruction (RMSE per iteration) of the low-rank structured correlation matrix of the Gaussian copula and its histogram (along the left side). (a) Simulation setup: 20 variables, 5 factors, 5 levels. HMC (blue) reaches a better mode faster (in iterations/CPU-time) than PX (red). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. (b) Simulation setup: 10 vars, 3 factors, 2 levels. We observe behaviors similar to Figure 3(a). Note that the histogram counts RMSEs after the burn-in period of PX (iteration #500). (c) Simulation setup: 10 vars, 3 factors, 5 levels. We observe behaviors similar to Figures 3(a) and 3(b) but with a thinner tail for HMC. Note that the histogram counts RMSEs after the burn-in period of PX (iteration #2000). Main message HMC reaches a better mode faster (iterations/CPUtime). Even more importantly the RMSE posterior samples of PX are concentrated in a much smaller region compared to HMC, even after 10000 iterations. This illustrates that PX poorly explores the true distribution. As an analogous situation we refer to the top and bottom panels of Figure 14 of Radford Neal’s slice sampler paper [14]. If there was no comparison against HMC, there would be no evidence from the PX plot alone that the algorithm is performing poorly. This mirrors Radford Neal’s statement opening Section 8 of his paper: “a wrong answer is obtained without any obvious indication that something is amiss”. The concentration on the posterior mode of PX in these simulations is misleading of the truth. PX might seen a bit simpler to implement, but it seems one cannot avoid using complex algorithms for complex models. We urge practitioners to revisit their past work with this model to find out by how much credible intervals of functionals of interest have been overconfident. Whether trivially or severely, our algorithm offers the first principled approach for checking this out. 6 Conclusion Sampling large random vectors simultaneously in order to improve mixing is in general a very hard problem, and this is why clever methods such as HMC or elliptical slice sampling [12] are necessary. We expect that the method here developed is useful not only for those with data analysis problems within the large family of Gaussian copula extended rank likelihood models, but the method itself and its behaviour might provide some new insights on MCMC sampling in constrained spaces in general. Another direction of future work consists of exploring methods for elliptical copulas, and related possible extensions of general HMC for non-Gaussian copula models. Acknowledgements The quality of this work has benefited largely from comments by our anonymous reviewers and useful discussions with Simon Byrne and Vassilios Stathopoulos. Research was supported by EPSRC grant EP/J013293/1. 8 References [1] Y. Bishop, S. Fienberg, and P. Holland. Discrete Multivariate Analysis: Theory and Practice. MIT Press, 1975. [2] A. Dobra and A. Lenkoski. Copula Gaussian graphical models and their application to modeling functional disability data. Annals of Applied Statistics, 5:969–993, 2011. [3] R. O. Duda and P. E. Hart. Use of the Hough transformation to detect lines and curves in pictures. Communications of the ACM, 15(1):11–15, 1972. [4] G. Elidan. Copulas and machine learning. Proceedings of the Copulae in Mathematical and Quantitative Finance workshop, to appear, 2013. [5] F. Han and H. Liu. Semiparametric principal component analysis. Advances in Neural Information Processing Systems, 25:171–179, 2012. [6] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006. [7] P. Hoff. Extending the rank likelihood for semiparametric copula estimation. Annals of Applied Statistics, 1:265–283, 2007. [8] R. Jarvis. On the identification of the convex hull of a finite set of points in the plane. Information Processing Letters, 2(1):18–21, 1973. [9] H. Joe. Multivariate Models and Dependence Concepts. Chapman-Hall, 1997. [10] S. Kirshner. Learning with tree-averaged densities and distributions. Neural Information Processing Systems, 2007. [11] S. Lauritzen. Graphical Models. Oxford University Press, 1996. [12] I. Murray, R. Adams, and D. MacKay. Elliptical slice sampling. JMLR Workshop and Conference Proceedings: AISTATS 2010, 9:541–548, 2010. [13] J. Murray, D. Dunson, L. Carin, and J. Lucas. Bayesian Gaussian copula factor models for mixed data. Journal of the American Statistical Association, to appear, 2013. [14] R. Neal. Slice sampling. The Annals of Statistics, 31:705–767, 2003. [15] R. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, pages 113–162, 2010. [16] R. Nelsen. An Introduction to Copulas. Springer-Verlag, 2007. [17] A. Pakman and L. Paninski. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. arXiv:1208.4118, 2012. [18] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [19] Y. Yu and X. L. Meng. To center or not to center: That is not the question — An ancillaritysufficiency interweaving strategy (ASIS) for boosting MCMC efficiency. Journal of Computational and Graphical Statistics, 20(3):531–570, 2011. 9
same-paper 5 0.73913217 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation
Author: Masayuki Karasuyama, Hiroshi Mamitsuka
Abstract: Label propagation is one of the state-of-the-art methods for semi-supervised learning, which estimates labels by propagating label information through a graph. Label propagation assumes that data points (nodes) connected in a graph should have similar labels. Consequently, the label estimation heavily depends on edge weights in a graph which represent similarity of each node pair. We propose a method for a graph to capture the manifold structure of input features using edge weights parameterized by a similarity function. In this approach, edge weights represent both similarity and local reconstruction weight simultaneously, both being reasonable for label propagation. For further justification, we provide analytical considerations including an interpretation as a cross-validation of a propagation model in the feature space, and an error analysis based on a low dimensional manifold model. Experimental results demonstrated the effectiveness of our approach both in synthetic and real datasets. 1
6 0.73766994 171 nips-2013-Learning with Noisy Labels
7 0.68142921 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
8 0.67031389 66 nips-2013-Computing the Stationary Distribution Locally
9 0.66244668 325 nips-2013-The Pareto Regret Frontier
10 0.63543004 240 nips-2013-Optimization, Learning, and Games with Predictable Sequences
11 0.63307482 231 nips-2013-Online Learning with Switching Costs and Other Adaptive Adversaries
12 0.63252443 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence
13 0.62821293 327 nips-2013-The Randomized Dependence Coefficient
14 0.61502171 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors
15 0.61347979 173 nips-2013-Least Informative Dimensions
16 0.61098087 126 nips-2013-Gaussian Process Conditional Copulas with Applications to Financial Time Series
17 0.60871738 79 nips-2013-DESPOT: Online POMDP Planning with Regularization
18 0.60371298 25 nips-2013-Adaptive Anonymity via $b$-Matching
19 0.59964466 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations
20 0.59317768 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints