nips nips2013 nips2013-350 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Raif Rustamov, Leonidas Guibas
Abstract: An increasing number of applications require processing of signals defined on weighted graphs. While wavelets provide a flexible tool for signal processing in the classical setting of regular domains, the existing graph wavelet constructions are less flexible – they are guided solely by the structure of the underlying graph and do not take directly into consideration the particular class of signals to be processed. This paper introduces a machine learning framework for constructing graph wavelets that can sparsely represent a given class of signals. Our construction uses the lifting scheme, and is based on the observation that the recurrent nature of the lifting scheme gives rise to a structure resembling a deep auto-encoder network. Particular properties that the resulting wavelets must satisfy determine the training objective and the structure of the involved neural networks. The training is unsupervised, and is conducted similarly to the greedy pre-training of a stack of auto-encoders. After training is completed, we obtain a linear wavelet transform that can be applied to any graph signal in time and memory linear in the size of the graph. Improved sparsity of our wavelet transform for the test signals is confirmed via experiments both on synthetic and real data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 This paper introduces a machine learning framework for constructing graph wavelets that can sparsely represent a given class of signals. [sent-5, score-0.892]
2 Our construction uses the lifting scheme, and is based on the observation that the recurrent nature of the lifting scheme gives rise to a structure resembling a deep auto-encoder network. [sent-6, score-0.574]
3 Particular properties that the resulting wavelets must satisfy determine the training objective and the structure of the involved neural networks. [sent-7, score-0.776]
4 After training is completed, we obtain a linear wavelet transform that can be applied to any graph signal in time and memory linear in the size of the graph. [sent-9, score-0.537]
5 Improved sparsity of our wavelet transform for the test signals is confirmed via experiments both on synthetic and real data. [sent-10, score-0.431]
6 Given this generality, it is desirable to extend the flexibility of classical tools such as wavelets to the processing of signals defined on weighted graphs. [sent-14, score-0.843]
7 However, all of these constructions are guided solely by the structure of the underlying graph, and do not take directly into consideration the particular class of signals to be processed. [sent-16, score-0.236]
8 [19, 17]), such an approach does not fully exploit the degrees of freedom inherent in wavelet design. [sent-19, score-0.255]
9 In contrast, a variety of signal class specific and adaptive wavelet constructions exist on images and multidimensional regular domains, see [9] and references therein. [sent-20, score-0.395]
10 In addition, theoretical guidance for such adaptive constructions is lacking as it remains largely unknown how the properties of the graph wavelet transforms, such as sparsity, relate to the structural properties of graph signals and their underlying graphs [22]. [sent-22, score-0.745]
11 The goal of our work is to provide a machine learning framework for constructing wavelets on weighted graphs that can sparsely represent a given class of signals. [sent-23, score-0.834]
12 Our construction uses the lifting 1 scheme as applied to the Haar wavelets, and is based on the observation that the update and predict steps of the lifting scheme are similar to the encode and decode steps of an auto-encoder. [sent-24, score-0.676]
13 From this point of view, the recurrent nature of the lifting scheme gives rise to a structure resembling a deep auto-encoder network. [sent-25, score-0.31]
14 Particular properties that the resulting wavelets must satisfy, such as sparse representation of signals, local support, and vanishing moments, determine the training objective and the structure of the involved neural networks. [sent-26, score-0.838]
15 First, when no training functions are specified by the application, we can impose a smoothness prior and obtain a novel general-purpose wavelet construction on graphs. [sent-31, score-0.422]
16 Second, our wavelets are adaptive to a class of signals and after training we obtain a linear transform; this is in contrast to adapting to the input signal (e. [sent-32, score-0.926]
17 Third, our construction provides efficient and exact analysis and synthesis operators and results in a critically sampled basis that respects the multiscale structure imposed on the underlying graph. [sent-35, score-0.208]
18 2 Lifting scheme The goal of wavelet design is to obtain a multiresolution [16] of L2 (G) – the set of all functions/signals on graph G. [sent-39, score-0.457]
19 Scaling functions {φ ,k } provide a basis for approximation space V , and similarly wavelet functions {ψ ,k } for W . [sent-46, score-0.353]
20 As a result, for any signal f ∈ L2 (G) on graph and any level 0 < max , we have the wavelet decomposition max −1 f= a φ 0 ,k + 0 ,k k d = 0 ,k ψ ,k . [sent-47, score-0.495]
21 For simplicity, we use a and d to denote the vectors of all approximation and detail coefficients at level . [sent-49, score-0.223]
22 Our construction of wavelets is based on the lifting scheme [23]. [sent-50, score-1.031]
23 Starting with a given wavelet transform, which in our case is the Haar transform (HT ), one can obtain lifted wavelets by applying the process illustrated in Figure 1(left) starting with = max − 1, a max = f and iterating down until = 1. [sent-51, score-1.086]
24 Here, a and d denote the vectors of all approximation and detail coefficients of the lifted transform at level . [sent-53, score-0.328]
25 2 ¯ coefficients a and d (of the lifted approximation coefficients a ¯ ¯ a ← a + Ud ¯ ¯ d ← d − Pa +1 ) as follows where update (U ) and predict (P ) are linear operators (matrices). [sent-56, score-0.208]
26 Note that in adaptive wavelet designs the update and predict operators will vary from level to level, but for simplicity of notation we do not indicate this explicitly. [sent-57, score-0.491]
27 This process is always invertible – the backward transform is depicted, with IHT being the inverse Haar transform, in Figure 1(right) and allows obtaining perfect reconstruction of the original signal. [sent-58, score-0.224]
28 While the wavelets and scaling functions are not explicitly computed during either forward or backward transform, it is possible to recover them using the expansion of Eq. [sent-59, score-0.929]
29 For example, to obtain a specific scaling function φ ,k , one simply sets all of approximation and detail coefficients to zero, except for a ,k = 1 and runs the backward transform. [sent-61, score-0.299]
30 3 Approach For a given class of signals, our objective is to design wavelets that yield approximately sparse expansions in Eq. [sent-62, score-0.747]
31 Therefore, we learn the update and predict operators that minimize some sparsity surrogate of the detail (wavelet) coefficients of given training functions {f n }nmax . [sent-66, score-0.385]
32 n=1 n n ¯n be the Haar approximation For a fixed multiresolution level , and a training function f , let a and d ¯ and detail coefficient vectors of f n received at level (i. [sent-67, score-0.38]
33 ¯ Since we would like to obtain a linear wavelet transform, the linearity of the encode and decode steps is of crucial importance. [sent-72, score-0.293]
34 In addition to linearity and the special form of bias terms, our auto-encoders differ from commonly used ones in that we enforce sparsity on the reconstruction error, rather than the hidden representation – in our setting, the reconstruction errors correspond to detail coefficients. [sent-73, score-0.326]
35 We also follow a similar strategy and tie the weights of update and predict steps, but the specific form of tying is dictated by the wavelet properties and will be discussed in §4. [sent-80, score-0.423]
36 Namely, we first train the the update and predict operators at the finest level: here the input to the lifting step are the original training functions – this corresponds to = max − 1 and ∀n, an+1 = f n in Figure 1(left). [sent-83, score-0.435]
37 After training of this finest level is completed, we obtain new approximation coefficients an which are passed to the next level as the training functions, and this process is repeated until one reaches the coarsest level. [sent-84, score-0.297]
38 The choice of the lifting scheme as the backbone of our construction is motivated by several observations. [sent-86, score-0.305]
39 First, every invertible 1D discrete wavelet transform can be factored into lifting steps [8], which makes lifting a universal tool for constructing multiresolutions. [sent-87, score-0.768]
40 Second, lifting scheme is always invertible, and provides exact reconstruction of signals. [sent-88, score-0.33]
41 We choose to apply lifting to Haar wavelets specifically because Haar wavelets are easy to define on any underlying space provided that it can be hierarchically partitioned [24, 10]. [sent-90, score-1.684]
42 Our use of update-first scheme mirrors its 3 common use for adaptive wavelet constructions in image processing literature, which is motivated by its stability; see [4] for a thorough discussion. [sent-91, score-0.375]
43 For a graph signal ´f , we define its integral over the graph as a weighted sum, G f = i Sii f (i). [sent-95, score-0.322]
44 We assume that a hierarchical partitioning (not necessarily dyadic) of the underlying graph into connected regions is provided. [sent-97, score-0.229]
45 As will become clear, our wavelet construction yields one approximation coefficient a ,k for each region R ,k , and one detail coefficient d ,k for each active region R +1,k at level + 1. [sent-108, score-0.731]
46 Note that if the partition is not dyadic, at a given level the number of scaling coefficients (equal to number of regions at level ) will not be the same as the number of detail coefficients (equal to number of active regions at level + 1). [sent-109, score-0.589]
47 1 Haar wavelets Usually, the (unnormalized) Haar approximation and detail coefficients of a signal f are computed as follows. [sent-112, score-0.919]
48 The detail coefficient d ,k corresponding to an active region ¯ R +1,k is the difference between averages at the region R +1,k and its parent region R ,par(k) , namely ¯ d ,k = a +1,k − a ,par(k) . [sent-114, score-0.461]
49 For perfect reconstruction there is no need to keep detail coefficients for ¯ ¯ inactive regions, because these can be recovered from the scaling coefficient of the parent region and the detail coefficients of the sibling regions. [sent-115, score-0.603]
50 ¯ In our setting, Haar wavelets are a part of the lifting scheme, and so the coefficient vectors a and d ¯ at level need to be computed from the augmented coefficient vector a +1 at level + 1 (c. [sent-116, score-1.072]
51 As before, the detail coefficient ¯ corresponding to an active region R +1,k is given by d ,k = a +1,k − a ,par(k) . [sent-121, score-0.241]
52 The resulting Haar ¯ wavelets are not normalized; when sorting wavelet/scaling coefficients we will multiply coefficients coming from level by 2− /2 . [sent-122, score-0.798]
53 2 Auto-encoder setup The choice of the update and predict operators and their tying scheme is guided by a number of properties that wavelets need to satisfy. [sent-124, score-0.975]
54 Vanishing moments: The wavelets should have vanishing dual and primal moments – two independent conditions due to biorthogonality of our wavelets. [sent-126, score-0.793]
55 In terms of the approximation and detail 4 coefficients these can be expressed as follows: a) all of the detail coefficients of a constant function should be zero and b) the integral of the approximation at any level of multiresolution should be the same as the integral of the original function. [sent-127, score-0.409]
56 be satisfied for all d Taking these two requirements into consideration, we impose the following constraints on predict and update weights: P1 = 0 and U = A−1 P t Af c where Af is the diagonal matrix of the active region volumes at level + 1. [sent-136, score-0.292]
57 Locality: To make our wavelets and scaling functions localized on the graph, we need to constrain update and predict operators in a way that would disallow distant regions from updating or predicting the approximation/detail coefficients of each other. [sent-141, score-1.073]
58 For a detail coefficient d ,k corresponding to the active region R +1,k , we only allow predictions that come from the parent region R ,par(k) and the immediate neighbors of this parent region. [sent-143, score-0.41]
59 As a result of this choice, it is not difficult to see that the resulting scaling functions φ ,k and wavelets ψ ,k will be supported in the vicinity of the region R ,k . [sent-146, score-0.96]
60 In this case we choose smoothness as our prior, and train the wavelets with a set of smooth functions on the graph – namely, we use scaled eigenvectors of graph Laplacian corresponding to the smallest eigenvalues. [sent-158, score-1.09]
61 As proved in [10], the quality of Haar wavelets depends on the quality (balance) of the graph partitioning. [sent-171, score-0.912]
62 At every step, a given region (a subset of the vertex set) of graph G is split into two children partitions by running the 2-means clustering algorithm (k-means with k = 2) on the above embedding restricted to the vertices of the given partition [24]. [sent-180, score-0.348]
63 6 Graph construction for point clouds Our problem setup started with a weighted graph and arrived to the Laplacian matrix L in §4. [sent-184, score-0.283]
64 5 Experiments Our goal is to experimentally investigate the constructed wavelets for multiscale behavior, meaningful adaptation to training signals, and sparse representation that generalizes to testing signals. [sent-189, score-0.858]
65 For the first two objectives we visualize the scaling functions at different levels because they provide insight about the signal approximation spaces V . [sent-190, score-0.277]
66 The generalization performance can be deduced from comparison to Haar wavelets, because during training we modify Haar wavelets so as to achieve a sparser representation of training signals. [sent-191, score-0.826]
67 We start with the case of a periodic interval, which is Figure 2: Scaling (left) and wavelet discretized as a cycle graph; 32 scaled eigenvectors (sines (right) functions on periodic interval. [sent-192, score-0.353]
68 Figure 2 shows the resulting scaling and wavelet functions at level = 4. [sent-194, score-0.482]
69 Up to discretization errors, the wavelets and scaling functions at the same level are shifts of each other – showing that our construction is able to learn shift invariance from training functions. [sent-195, score-1.065]
70 Figure 3(a) depicts a graph representing the road network of Minnesota, with edges showing the major roads and vertices being their intersections. [sent-196, score-0.234]
71 In our construction we employ unit weights on edges and use 32 scaled eigenvectors of graph Laplacian as training functions. [sent-197, score-0.284]
72 The resulting scaling functions for regions containing the red vertex in Figure 3(a) are shown at different levels in Figure 3(b,c,d,e,f). [sent-198, score-0.28]
73 Note that the scaling functions are continuous and show multiscale spatial behavior. [sent-200, score-0.216]
74 Better average reconstruction results (h) for our wavelets (Wav-smooth) indicate a good generalization performance. [sent-203, score-0.813]
75 The improvement over the Haar wavelets shows that our model generalizes well to unseen signals. [sent-207, score-0.726]
76 The daily temperature data for the year of 2012 gives us 366 signals on the graph; Figure 4(b) depicts one such signal. [sent-211, score-0.238]
77 We use the signals from the first half of the year to train the wavelets, and test for sparse reconstruction quality on the second half of the year (and vice versa). [sent-212, score-0.279]
78 Figure 4(c,d,e,f,g) depicts some of the scaling functions at a number of levels; note that the depicted scaling function at level = 2 captures the rough temperature distribution pattern of the US. [sent-213, score-0.442]
79 The average reconstruction error from a specified fraction of largest detail coefficients is shown in Figure 4(g). [sent-214, score-0.216]
80 As an application, we employ our wavelets for semi-supervised learning of the temperature distribution for a day from the temperatures at a subset of labeled graph vertices. [sent-215, score-0.914]
81 The sought temperature (a) GSOD network (c) Scaling = 2 (b) April 9, 2012 (d) Scaling = 4 (e) Scaling = 6 (f) Scaling = 8 (g) Reconstruction error (h) Learning error Figure 4: Our construction on the station network (a) trained with daily temperature data (e. [sent-216, score-0.23]
82 Reconstruction results (g) using our wavelets trained on data (Wav-data) and with smooth prior (Wav-smooth). [sent-219, score-0.778]
83 gov/pub/data/gsod/2012/ 7 (a) Scaling functions (b) PSNR (c) SSIM Figure 5: The scaling functions (a) resulting from training on a face images dataset. [sent-224, score-0.282]
84 These wavelets (Wav-data) provide better sparse reconstruction quality than the CDF9/7 wavelet filterbanks (b,c). [sent-225, score-1.119]
85 Since we expect the detail coefficients to be sparse, we impose a lasso penalty on them; to make the problem smaller, all detail coefficients for levels ≥ 7 are set to zero. [sent-228, score-0.298]
86 Images can be seen as signals on a graph – pixels are the vertices and each pixel is connected to its 8 nearest neighbors. [sent-234, score-0.268]
87 Figure 5(a) depicts a number of obtained scaling functions at different levels (the rows correspond to levels = 4, 5, 6, 7, 8) in various locations (columns). [sent-237, score-0.271]
88 Note that the scaling functions show controllable multiscale spatial behavior. [sent-239, score-0.216]
89 The quality of reconstruction from a sparse set of detail coefficients is plotted in Figure 5(b,c). [sent-240, score-0.267]
90 We also make a comparison to reconstruction using the standard separable CDF 9/7 wavelet filterbanks from bottom-most level; for both of quality metrics, our wavelets trained on data perform better than CDF 9/7. [sent-243, score-1.136]
91 The smoothly trained wavelets do not improve over the Haar wavelets, because the smoothness assumption does not hold for face images. [sent-244, score-0.782]
92 6 Conclusion We have introduced an approach to constructing wavelets that take into consideration structural properties of both graph signals and their underlying graphs. [sent-245, score-1.02]
93 An interesting direction for future research would be to randomize the graph partitioning process or to use bagging over training functions in order to obtain a family of wavelet constructions on the same graph – leading to overcomplete dictionaries like in [25]. [sent-246, score-0.681]
94 One can also introduce multiple lifting steps at each level or even add non-linearities as common with neural networks. [sent-247, score-0.274]
95 Our wavelets are obtained by training a structure similar to a deep neural network; interestingly, the recent work of Mallat and collaborators (e. [sent-248, score-0.824]
96 [3]) goes in the other direction and provides a wavelet interpretation of deep neural networks. [sent-250, score-0.303]
97 Therefore, we believe that there are ample opportunities for future work in the interface between wavelets and deep neural networks. [sent-251, score-0.774]
98 Multiscale wavelets on trees, graphs and high dimensional data: Theory and applications to semi supervised learning. [sent-329, score-0.766]
99 Multi-dimensional separable critically sampled wavelet filterbanks on arbitrary graphs. [sent-387, score-0.274]
100 The lifting scheme: A construction of second generation wavelets. [sent-435, score-0.264]
wordName wordTfidf (topN-words)
[('wavelets', 0.726), ('haar', 0.275), ('wavelet', 0.255), ('lifting', 0.202), ('coef', 0.154), ('ol', 0.154), ('detail', 0.129), ('graph', 0.126), ('cients', 0.123), ('scaling', 0.117), ('signals', 0.089), ('reconstruction', 0.087), ('nmax', 0.081), ('region', 0.079), ('level', 0.072), ('clouds', 0.067), ('transform', 0.064), ('temperature', 0.062), ('construction', 0.062), ('multiscale', 0.061), ('constructions', 0.06), ('laplacian', 0.058), ('predict', 0.056), ('operators', 0.055), ('vertices', 0.053), ('training', 0.05), ('deep', 0.048), ('regions', 0.047), ('lterbanks', 0.046), ('parent', 0.045), ('ac', 0.044), ('signal', 0.042), ('scheme', 0.041), ('lifted', 0.041), ('vanishing', 0.041), ('coifman', 0.041), ('graphs', 0.04), ('levels', 0.04), ('functions', 0.038), ('vertex', 0.038), ('decode', 0.038), ('nest', 0.038), ('dyadic', 0.038), ('depicts', 0.036), ('sii', 0.035), ('multiresolution', 0.035), ('tying', 0.035), ('af', 0.034), ('update', 0.034), ('smooth', 0.033), ('active', 0.033), ('coarsest', 0.031), ('iht', 0.031), ('narang', 0.031), ('backward', 0.031), ('children', 0.03), ('transforms', 0.03), ('stack', 0.03), ('quality', 0.03), ('underlying', 0.03), ('consideration', 0.029), ('weighted', 0.028), ('guided', 0.028), ('dn', 0.028), ('stations', 0.027), ('year', 0.026), ('moments', 0.026), ('partitioning', 0.026), ('daily', 0.025), ('invertible', 0.025), ('diffusion', 0.024), ('eigenvectors', 0.024), ('harmonic', 0.024), ('sparsity', 0.023), ('approximation', 0.022), ('partitions', 0.022), ('weights', 0.022), ('tie', 0.021), ('sparse', 0.021), ('constructing', 0.02), ('face', 0.02), ('sparsely', 0.02), ('separable', 0.019), ('resembling', 0.019), ('road', 0.019), ('adaptive', 0.019), ('trained', 0.019), ('ner', 0.019), ('tied', 0.019), ('images', 0.019), ('volumes', 0.018), ('spaces', 0.018), ('dark', 0.018), ('periodic', 0.018), ('manifolds', 0.018), ('greedy', 0.018), ('perfect', 0.017), ('expansion', 0.017), ('namely', 0.017), ('smoothness', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 350 nips-2013-Wavelets on Graphs via Deep Learning
Author: Raif Rustamov, Leonidas Guibas
Abstract: An increasing number of applications require processing of signals defined on weighted graphs. While wavelets provide a flexible tool for signal processing in the classical setting of regular domains, the existing graph wavelet constructions are less flexible – they are guided solely by the structure of the underlying graph and do not take directly into consideration the particular class of signals to be processed. This paper introduces a machine learning framework for constructing graph wavelets that can sparsely represent a given class of signals. Our construction uses the lifting scheme, and is based on the observation that the recurrent nature of the lifting scheme gives rise to a structure resembling a deep auto-encoder network. Particular properties that the resulting wavelets must satisfy determine the training objective and the structure of the involved neural networks. The training is unsupervised, and is conducted similarly to the greedy pre-training of a stack of auto-encoders. After training is completed, we obtain a linear wavelet transform that can be applied to any graph signal in time and memory linear in the size of the graph. Improved sparsity of our wavelet transform for the test signals is confirmed via experiments both on synthetic and real data. 1
2 0.1005016 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
Author: James L. Sharpnack, Akshay Krishnamurthy, Aarti Singh
Abstract: The detection of anomalous activity in graphs is a statistical problem that arises in many applications, such as network surveillance, disease outbreak detection, and activity monitoring in social networks. Beyond its wide applicability, graph structured anomaly detection serves as a case study in the difficulty of balancing computational complexity with statistical power. In this work, we develop from first principles the generalized likelihood ratio test for determining if there is a well connected region of activation over the vertices in the graph in Gaussian noise. Because this test is computationally infeasible, we provide a relaxation, called the Lov´ sz extended scan statistic (LESS) that uses submodularity to approximate the a intractable generalized likelihood ratio. We demonstrate a connection between LESS and maximum a-posteriori inference in Markov random fields, which provides us with a poly-time algorithm for LESS. Using electrical network theory, we are able to control type 1 error for LESS and prove conditions under which LESS is risk consistent. Finally, we consider specific graph models, the torus, knearest neighbor graphs, and ǫ-random graphs. We show that on these graphs our results provide near-optimal performance by matching our results to known lower bounds. 1
3 0.071354754 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson
Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efficiently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1
4 0.068059929 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning
Author: Xiao-Ming Wu, Zhenguo Li, Shih-Fu Chang
Abstract: We find that various well-known graph-based models exhibit a common important harmonic structure in its target function – the value of a vertex is approximately the weighted average of the values of its adjacent neighbors. Understanding of such structure and analysis of the loss defined over such structure help reveal important properties of the target function over a graph. In this paper, we show that the variation of the target function across a cut can be upper and lower bounded by the ratio of its harmonic loss and the cut cost. We use this to develop an analytical tool and analyze five popular graph-based models: absorbing random walks, partially absorbing random walks, hitting times, pseudo-inverse of the graph Laplacian, and eigenvectors of the Laplacian matrices. Our analysis sheds new insights into several open questions related to these models, and provides theoretical justifications and guidelines for their practical use. Simulations on synthetic and real datasets confirm the potential of the proposed theory and tool.
5 0.067476861 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning
Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato
Abstract: A probabilistic model based on the horseshoe prior is proposed for learning dependencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-fitting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modifications. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefficients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data. 1
6 0.065145254 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
7 0.063984551 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
8 0.062731311 251 nips-2013-Predicting Parameters in Deep Learning
9 0.049594112 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
10 0.049515959 122 nips-2013-First-order Decomposition Trees
11 0.047609057 149 nips-2013-Latent Structured Active Learning
12 0.044029228 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
13 0.040945426 87 nips-2013-Density estimation from unweighted k-nearest neighbor graphs: a roadmap
14 0.040070843 194 nips-2013-Model Selection for High-Dimensional Regression under the Generalized Irrepresentability Condition
15 0.040000703 75 nips-2013-Convex Two-Layer Modeling
16 0.039573755 331 nips-2013-Top-Down Regularization of Deep Belief Networks
17 0.039135154 186 nips-2013-Matrix factorization with binary components
18 0.038753644 7 nips-2013-A Gang of Bandits
19 0.03869028 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
20 0.03848058 59 nips-2013-Blind Calibration in Compressed Sensing using Message Passing Algorithms
topicId topicWeight
[(0, 0.125), (1, 0.053), (2, -0.022), (3, -0.012), (4, 0.033), (5, 0.012), (6, -0.029), (7, -0.022), (8, -0.03), (9, -0.018), (10, 0.051), (11, -0.049), (12, 0.042), (13, -0.02), (14, -0.03), (15, -0.002), (16, -0.021), (17, -0.008), (18, -0.055), (19, -0.004), (20, -0.021), (21, 0.016), (22, 0.059), (23, -0.051), (24, 0.014), (25, -0.024), (26, 0.023), (27, 0.049), (28, -0.009), (29, 0.029), (30, -0.003), (31, 0.009), (32, 0.022), (33, 0.066), (34, -0.019), (35, -0.007), (36, 0.068), (37, -0.034), (38, 0.056), (39, 0.035), (40, -0.034), (41, 0.037), (42, 0.029), (43, -0.013), (44, -0.091), (45, 0.015), (46, -0.036), (47, 0.027), (48, 0.011), (49, -0.071)]
simIndex simValue paperId paperTitle
same-paper 1 0.91043067 350 nips-2013-Wavelets on Graphs via Deep Learning
Author: Raif Rustamov, Leonidas Guibas
Abstract: An increasing number of applications require processing of signals defined on weighted graphs. While wavelets provide a flexible tool for signal processing in the classical setting of regular domains, the existing graph wavelet constructions are less flexible – they are guided solely by the structure of the underlying graph and do not take directly into consideration the particular class of signals to be processed. This paper introduces a machine learning framework for constructing graph wavelets that can sparsely represent a given class of signals. Our construction uses the lifting scheme, and is based on the observation that the recurrent nature of the lifting scheme gives rise to a structure resembling a deep auto-encoder network. Particular properties that the resulting wavelets must satisfy determine the training objective and the structure of the involved neural networks. The training is unsupervised, and is conducted similarly to the greedy pre-training of a stack of auto-encoders. After training is completed, we obtain a linear wavelet transform that can be applied to any graph signal in time and memory linear in the size of the graph. Improved sparsity of our wavelet transform for the test signals is confirmed via experiments both on synthetic and real data. 1
2 0.62726605 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning
Author: Xiao-Ming Wu, Zhenguo Li, Shih-Fu Chang
Abstract: We find that various well-known graph-based models exhibit a common important harmonic structure in its target function – the value of a vertex is approximately the weighted average of the values of its adjacent neighbors. Understanding of such structure and analysis of the loss defined over such structure help reveal important properties of the target function over a graph. In this paper, we show that the variation of the target function across a cut can be upper and lower bounded by the ratio of its harmonic loss and the cut cost. We use this to develop an analytical tool and analyze five popular graph-based models: absorbing random walks, partially absorbing random walks, hitting times, pseudo-inverse of the graph Laplacian, and eigenvectors of the Laplacian matrices. Our analysis sheds new insights into several open questions related to these models, and provides theoretical justifications and guidelines for their practical use. Simulations on synthetic and real datasets confirm the potential of the proposed theory and tool.
3 0.62271553 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching
Author: Marcelo Fiori, Pablo Sprechmann, Joshua Vogelstein, Pablo Muse, Guillermo Sapiro
Abstract: Graph matching is a challenging problem with very important applications in a wide range of fields, from image and video analysis to biological and biomedical problems. We propose a robust graph matching algorithm inspired in sparsityrelated techniques. We cast the problem, resembling group or collaborative sparsity formulations, as a non-smooth convex optimization problem that can be efficiently solved using augmented Lagrangian techniques. The method can deal with weighted or unweighted graphs, as well as multimodal data, where different graphs represent different types of data. The proposed approach is also naturally integrated with collaborative graph inference techniques, solving general network inference problems where the observed variables, possibly coming from different modalities, are not in correspondence. The algorithm is tested and compared with state-of-the-art graph matching techniques in both synthetic and real graphs. We also present results on multimodal graphs and applications to collaborative inference of brain connectivity from alignment-free functional magnetic resonance imaging (fMRI) data. The code is publicly available. 1
4 0.60232651 157 nips-2013-Learning Multi-level Sparse Representations
Author: Ferran Diego Andilla, Fred A. Hamprecht
Abstract: Bilinear approximation of a matrix is a powerful paradigm of unsupervised learning. In some applications, however, there is a natural hierarchy of concepts that ought to be reflected in the unsupervised analysis. For example, in the neurosciences image sequence considered here, there are the semantic concepts of pixel → neuron → assembly that should find their counterpart in the unsupervised analysis. Driven by this concrete problem, we propose a decomposition of the matrix of observations into a product of more than two sparse matrices, with the rank decreasing from lower to higher levels. In contrast to prior work, we allow for both hierarchical and heterarchical relations of lower-level to higher-level concepts. In addition, we learn the nature of these relations rather than imposing them. Finally, we describe an optimization scheme that allows to optimize the decomposition over all levels jointly, rather than in a greedy level-by-level fashion. The proposed bilevel SHMF (sparse heterarchical matrix factorization) is the first formalism that allows to simultaneously interpret a calcium imaging sequence in terms of the constituent neurons, their membership in assemblies, and the time courses of both neurons and assemblies. Experiments show that the proposed model fully recovers the structure from difficult synthetic data designed to imitate the experimental data. More importantly, bilevel SHMF yields plausible interpretations of real-world Calcium imaging data. 1
5 0.60140067 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
Author: Pablo Sprechmann, Roee Litman, Tal Ben Yakar, Alexander M. Bronstein, Guillermo Sapiro
Abstract: In this paper, we propose a new computationally efficient framework for learning sparse models. We formulate a unified approach that contains as particular cases models promoting sparse synthesis and analysis type of priors, and mixtures thereof. The supervised training of the proposed model is formulated as a bilevel optimization problem, in which the operators are optimized to achieve the best possible performance on a specific task, e.g., reconstruction or classification. By restricting the operators to be shift invariant, our approach can be thought as a way of learning sparsity-promoting convolutional operators. Leveraging recent ideas on fast trainable regressors designed to approximate exact sparse codes, we propose a way of constructing feed-forward networks capable of approximating the learned models at a fraction of the computational cost of exact solvers. In the shift-invariant case, this leads to a principled way of constructing a form of taskspecific convolutional networks. We illustrate the proposed models on several experiments in music analysis and image processing applications. 1
6 0.59918392 207 nips-2013-Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic
7 0.58224428 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
8 0.55386347 186 nips-2013-Matrix factorization with binary components
9 0.54866922 122 nips-2013-First-order Decomposition Trees
10 0.54481739 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
11 0.53938395 212 nips-2013-Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty
12 0.52673441 202 nips-2013-Multiclass Total Variation Clustering
13 0.51698023 220 nips-2013-On the Complexity and Approximation of Binary Evidence in Lifted Inference
14 0.5166322 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation
15 0.51582479 359 nips-2013-Σ-Optimality for Active Learning on Gaussian Random Fields
16 0.51051617 37 nips-2013-Approximate Bayesian Image Interpretation using Generative Probabilistic Graphics Programs
17 0.50833619 73 nips-2013-Convex Relaxations for Permutation Problems
18 0.50679523 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
19 0.50135976 329 nips-2013-Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections
20 0.49745387 300 nips-2013-Solving the multi-way matching problem by permutation synchronization
topicId topicWeight
[(2, 0.023), (12, 0.217), (16, 0.06), (33, 0.146), (34, 0.109), (41, 0.039), (49, 0.029), (56, 0.089), (70, 0.038), (85, 0.048), (89, 0.037), (93, 0.048), (95, 0.024)]
simIndex simValue paperId paperTitle
1 0.83401614 340 nips-2013-Understanding variable importances in forests of randomized trees
Author: Gilles Louppe, Louis Wehenkel, Antonio Sutera, Pierre Geurts
Abstract: Despite growing interest and practical use in various scientific areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. We derive a three-level decomposition of the information jointly provided by all input variables about the output in terms of i) the MDI importance of each input variable, ii) the degree of interaction of a given input variable with the other input variables, iii) the different interaction terms of a given degree. We then show that this MDI importance of a variable is equal to zero if and only if the variable is irrelevant and that the MDI importance of a relevant variable is invariant with respect to the removal or the addition of irrelevant variables. We illustrate these properties on a simple example and discuss how they may change in the case of non-totally randomized trees such as Random Forests and Extra-Trees. 1 Motivation An important task in many scientific fields is the prediction of a response variable based on a set of predictor variables. In many situations though, the aim is not only to make the most accurate predictions of the response but also to identify which predictor variables are the most important to make these predictions, e.g. in order to understand the underlying process. Because of their applicability to a wide range of problems and their capability to both build accurate models and, at the same time, to provide variable importance measures, Random Forests (Breiman, 2001) and variants such as Extra-Trees (Geurts et al., 2006) have become a major data analysis tool used with success in various scientific areas. Despite their extensive use in applied research, only a couple of works have studied the theoretical properties and statistical mechanisms of these algorithms. Zhao (2000), Breiman (2004), Biau et al. (2008); Biau (2012), Meinshausen (2006) and Lin and Jeon (2006) investigated simplified to very realistic variants of these algorithms and proved the consistency of those variants. Little is known however regarding the variable importances computed by Random Forests like algorithms, and – as far as we know – the work of Ishwaran (2007) is indeed the only theoretical study of tree-based variable importance measures. In this work, we aim at filling this gap and present a theoretical analysis of the Mean Decrease Impurity importance derived from ensembles of randomized trees. The rest of the paper is organized as follows: in section 2, we provide the background about ensembles of randomized trees and recall how variable importances can be derived from them; in section 3, we then derive a characterization in asymptotic conditions and show how variable importances derived from totally randomized trees offer a three-level decomposition of the information jointly contained in the input variables about the output; section 4 shows that this characterization only depends on the relevant variables and section 5 1 discusses theses ideas in the context of variants closer to the Random Forest algorithm; section 6 then illustrates all these ideas on an artificial problem; finally, section 7 includes our conclusions and proposes directions of future works. 2 Background In this section, we first describe decision trees, as well as forests of randomized trees. Then, we describe the two major variable importances measures derived from them – including the Mean Decrease Impurity (MDI) importance that we will study in the subsequent sections. 2.1 Single classification and regression trees and random forests A binary classification (resp. regression) tree (Breiman et al., 1984) is an input-output model represented by a tree structure T , from a random input vector (X1 , ..., Xp ) taking its values in X1 × ... × Xp = X to a random output variable Y ∈ Y . Any node t in the tree represents a subset of the space X , with the root node being X itself. Internal nodes t are labeled with a binary test (or split) st = (Xm < c) dividing their subset in two subsets1 corresponding to their two children tL and tR , while the terminal nodes (or leaves) are labeled with a best guess value of the output ˆ variable2 . The predicted output Y for a new instance is the label of the leaf reached by the instance when it is propagated through the tree. A tree is built from a learning sample of size N drawn from P (X1 , ..., Xp , Y ) using a recursive procedure which identifies at each node t the split st = s∗ for which the partition of the Nt node samples into tL and tR maximizes the decrease ∆i(s, t) = i(t) − pL i(tL ) − pR i(tR ) (1) of some impurity measure i(t) (e.g., the Gini index, the Shannon entropy, or the variance of Y ), and where pL = NtL /Nt and pR = NtR /Nt . The construction of the tree stops , e.g., when nodes become pure in terms of Y or when all variables Xi are locally constant. Single trees typically suffer from high variance, which makes them not competitive in terms of accuracy. A very efficient and simple way to address this flaw is to use them in the context of randomization-based ensemble methods. Specifically, the core principle is to introduce random perturbations into the learning procedure in order to produce several different decision trees from a single learning set and to use some aggregation technique to combine the predictions of all these trees. In Bagging (Breiman, 1996), trees are built on random bootstrap copies of the original data, hence producing different decision trees. In Random Forests (Breiman, 2001), Bagging is extended and combined with a randomization of the input variables that are used when considering candidate variables to split internal nodes t. In particular, instead of looking for the best split s∗ among all variables, the Random Forest algorithm selects, at each node, a random subset of K variables and then determines the best split over these latter variables only. 2.2 Variable importances In the context of ensembles of randomized trees, Breiman (2001, 2002) proposed to evaluate the importance of a variable Xm for predicting Y by adding up the weighted impurity decreases p(t)∆i(st , t) for all nodes t where Xm is used, averaged over all NT trees in the forest: Imp(Xm ) = 1 NT p(t)∆i(st , t) T (2) t∈T :v(st )=Xm and where p(t) is the proportion Nt /N of samples reaching t and v(st ) is the variable used in split st . When using the Gini index as impurity function, this measure is known as the Gini importance or Mean Decrease Gini. However, since it can be defined for any impurity measure i(t), we will refer to Equation 2 as the Mean Decrease Impurity importance (MDI), no matter the impurity measure i(t). We will characterize and derive results for this measure in the rest of this text. 1 More generally, splits are defined by a (not necessarily binary) partition of the range Xm of possible values of a single variable Xm . 2 e.g. determined as the majority class j(t) (resp., the average value y (t)) within the subset of the leaf t. ¯ 2 In addition to MDI, Breiman (2001, 2002) also proposed to evaluate the importance of a variable Xm by measuring the Mean Decrease Accuracy (MDA) of the forest when the values of Xm are randomly permuted in the out-of-bag samples. For that reason, this latter measure is also known as the permutation importance. Thanks to popular machine learning softwares (Breiman, 2002; Liaw and Wiener, 2002; Pedregosa et al., 2011), both of these variable importance measures have shown their practical utility in an increasing number of experimental studies. Little is known however regarding their inner workings. Strobl et al. (2007) compare both MDI and MDA and show experimentally that the former is biased towards some predictor variables. As explained by White and Liu (1994) in case of single decision trees, this bias stems from an unfair advantage given by the usual impurity functions i(t) towards predictors with a large number of values. Strobl et al. (2008) later showed that MDA is biased as well, and that it overestimates the importance of correlated variables – although this effect was not confirmed in a later experimental study by Genuer et al. (2010). From a theoretical point of view, Ishwaran (2007) provides a detailed theoretical development of a simplified version of MDA, giving key insights for the understanding of the actual MDA. 3 Variable importances derived from totally randomized tree ensembles Let us now consider the MDI importance as defined by Equation 2, and let us assume a set V = {X1 , ..., Xp } of categorical input variables and a categorical output Y . For the sake of simplicity we will use the Shannon entropy as impurity measure, and focus on totally randomized trees; later on we will discuss other impurity measures and tree construction algorithms. Given a training sample L of N joint observations of X1 , ..., Xp , Y independently drawn from the joint distribution P (X1 , ..., Xp , Y ), let us assume that we infer from it an infinitely large ensemble of totally randomized and fully developed trees. In this setting, a totally randomized and fully developed tree is defined as a decision tree in which each node t is partitioned using a variable Xi picked uniformly at random among those not yet used at the parent nodes of t, and where each t is split into |Xi | sub-trees, i.e., one for each possible value of Xi , and where the recursive construction process halts only when all p variables have been used along the current branch. Hence, in such a tree, leaves are all at the same depth p, and the set of leaves of a fully developed tree is in bijection with the set X of all possible joint configurations of the p input variables. For example, if the input variables are all binary, the resulting tree will have exactly 2p leaves. Theorem 1. The MDI importance of Xm ∈ V for Y as computed with an infinite ensemble of fully developed totally randomized trees and an infinitely large training sample is: p−1 Imp(Xm ) = k=0 1 1 k Cp p − k I(Xm ; Y |B), (3) B∈Pk (V −m ) where V −m denotes the subset V \ {Xm }, Pk (V −m ) is the set of subsets of V −m of cardinality k, and I(Xm ; Y |B) is the conditional mutual information of Xm and Y given the variables in B. Proof. See Appendix B. Theorem 2. For any ensemble of fully developed trees in asymptotic learning sample size conditions (e.g., in the same conditions as those of Theorem 1), we have that p Imp(Xm ) = I(X1 , . . . , Xp ; Y ). (4) m=1 Proof. See Appendix C. Together, theorems 1 and 2 show that variable importances derived from totally randomized trees in asymptotic conditions provide a three-level decomposition of the information I(X1 , . . . , Xp ; Y ) contained in the set of input variables about the output variable. The first level is a decomposition among input variables (see Equation 4 of Theorem 2), the second level is a decomposition along the 3 degrees k of interaction terms of a variable with the other ones (see the outer sum in Equation 3 of Theorem 1), and the third level is a decomposition along the combinations B of interaction terms of fixed size k of possible interacting variables (see the inner sum in Equation 3). We observe that the decomposition includes, for each variable, each and every interaction term of each and every degree weighted in a fashion resulting only from the combinatorics of possible interaction terms. In particular, since all I(Xm ; Y |B) terms are at most equal to H(Y ), the prior entropy of Y , the p terms of the outer sum of Equation 3 are each upper bounded by 1 1 1 1 1 H(Y ) = k C k H(Y ) = H(Y ). k Cp p − k Cp p − k p−1 p −m B∈Pk (V ) As such, the second level decomposition resulting from totally randomized trees makes the p sub1 1 importance terms C k p−k B∈Pk (V −m ) I(Xm ; Y |B) to equally contribute (at most) to the total p importance, even though they each include a combinatorially different number of terms. 4 Importances of relevant and irrelevant variables Following Kohavi and John (1997), let us define as relevant to Y with respect to V a variable Xm for which there exists at least one subset B ⊆ V (possibly empty) such that I(Xm ; Y |B) > 0.3 Thus we define as irrelevant to Y with respect to V a variable Xi for which, for all B ⊆ V , I(Xi ; Y |B) = 0. Remark that if Xi is irrelevant to Y with respect to V , then by definition it is also irrelevant to Y with respect to any subset of V . Theorem 3. Xi ∈ V is irrelevant to Y with respect to V if and only if its infinite sample size importance as computed with an infinite ensemble of fully developed totally randomized trees built on V for Y is 0. Proof. See Appendix D. Lemma 4. Let Xi ∈ V be an irrelevant variable for Y with respect to V . The infinite sample size / importance of Xm ∈ V as computed with an infinite ensemble of fully developed totally randomized trees built on V for Y is the same as the importance derived when using V ∪ {Xi } to build the ensemble of trees for Y . Proof. See Appendix E. Theorem 5. Let VR ⊆ V be the subset of all variables in V that are relevant to Y with respect to V . The infinite sample size importance of any variable Xm ∈ VR as computed with an infinite ensemble of fully developed totally randomized trees built on VR for Y is the same as its importance computed in the same conditions by using all variables in V . That is: p−1 Imp(Xm ) = k=0 r−1 = l=0 1 1 k Cp p − k 1 1 l Cr r − l I(Xm ; Y |B) B∈Pk (V −m ) (5) I(Xm ; Y |B) −m B∈Pl (VR ) where r is the number of relevant variables in VR . Proof. See Appendix F. Theorems 3 and 5 show that the importances computed with an ensemble of totally randomized trees depends only on the relevant variables. Irrelevant variables have a zero importance and do not affect the importance of relevant variables. Practically, we believe that such properties are desirable conditions for a sound criterion assessing the importance of a variable. Indeed, noise should not be credited of any importance and should not make any other variable more (or less) important. 3 Among the relevant variables, we have the marginally relevant ones, for which I(Xm ; Y ) > 0, the strongly relevant ones, for which I(Xm ; Y |V −m ) > 0, and the weakly relevant variables, which are relevant but not strongly relevant. 4 5 Random Forest variants In this section, we consider and discuss variable importances as computed with other types of ensembles of randomized trees. We first show how our results extend to any other impurity measure, and then analyze importances computed by depth-pruned ensemble of randomized trees and those computed by randomized trees built on random subspaces of fixed size. Finally, we discuss the case of non-totally randomized trees. 5.1 Generalization to other impurity measures Although our characterization in sections 3 and 4 uses Shannon entropy as the impurity measure, we show in Appendix I that theorems 1, 3 and 5 hold for other impurity measures, simply substituting conditional mutual information for conditional impurity reduction in the different formulas and in the definition of irrelevant variables. In particular, our results thus hold for the Gini index in classification and can be extended to regression problems using variance as the impurity measure. 5.2 Pruning and random subspaces In sections 3 and 4, we considered totally randomized trees that were fully developed, i.e. until all p variables were used within each branch. When totally randomized trees are developed only up to some smaller depth q ≤ p, we show in Proposition 6 that the variable importances as computed by these trees is limited to the q first terms of Equation 3. We then show in Proposition 7 that these latter importances are actually the same as when each tree of the ensemble is fully developed over a random subspace (Ho, 1998) of q variables drawn prior to its construction. Proposition 6. The importance of Xm ∈ V for Y as computed with an infinite ensemble of pruned totally randomized trees built up to depth q ≤ p and an infinitely large training sample is: q−1 Imp(Xm ) = k=0 1 1 k p−k Cp I(Xm ; Y |B) (6) B∈Pk (V −m ) Proof. See Appendix G. Proposition 7. The importance of Xm ∈ V for Y as computed with an infinite ensemble of pruned totally randomized trees built up to depth q ≤ p and an infinitely large training sample is identical to the importance as computed for Y with an infinite ensemble of fully developed totally randomized trees built on random subspaces of q variables drawn from V . Proof. See Appendix H. As long as q ≥ r (where r denotes the number of relevant variables in V ), it can easily be shown that all relevant variables will still obtain a strictly positive importance, which will however differ in general from the importances computed by fully grown totally randomized trees built over all variables. Also, each irrelevant variable of course keeps an importance equal to zero, which means that, in asymptotic conditions, these pruning and random subspace methods would still allow us identify the relevant variables, as long as we have a good upper bound q on r. 5.3 Non-totally randomized trees In our analysis in the previous sections, trees are built totally at random and hence do not directly relate to those built in Random Forests (Breiman, 2001) or in Extra-Trees (Geurts et al., 2006). To better understand the importances as computed by those algorithms, let us consider a close variant of totally randomized trees: at each node t, let us instead draw uniformly at random 1 ≤ K ≤ p variables and let us choose the one that maximizes ∆i(t). Notice that, for K = 1, this procedure amounts to building ensembles of totally randomized trees as defined before, while, for K = p, it amounts to building classical single trees in a deterministic way. First, the importance of Xm ∈ V as computed with an infinite ensemble of such randomized trees is not the same as Equation 3. For K > 1, masking effects indeed appear: at t, some variables are 5 never selected because there always is some other variable for which ∆i(t) is larger. Such effects tend to pull the best variables at the top of the trees and to push the others at the leaves. As a result, the importance of a variable no longer decomposes into a sum including all I(Xm ; Y |B) terms. The importance of the best variables decomposes into a sum of their mutual information alone or conditioned only with the best others – but not conditioned with all variables since they no longer ever appear at the bottom of trees. By contrast, the importance of the least promising variables now decomposes into a sum of their mutual information conditioned only with all variables – but not alone or conditioned with a couple of others since they no longer ever appear at the top of trees. In other words, because of the guided structure of the trees, the importance of Xm now takes into account only some of the conditioning sets B, which may over- or underestimate its overall relevance. To make things clearer, let us consider a simple example. Let X1 perfectly explains Y and let X2 be a slightly noisy copy of X1 (i.e., I(X1 ; Y ) ≈ I(X2 ; Y ), I(X1 ; Y |X2 ) = and I(X2 ; Y |X1 ) = 0). Using totally randomized trees, the importances of X1 and X2 are nearly equal – the importance of X1 being slightly higher than the importance of X2 : 1 1 1 Imp(X1 ) = I(X1 ; Y ) + I(X1 ; Y |X2 ) = I(X1 ; Y ) + 2 2 2 2 1 1 1 Imp(X2 ) = I(X2 ; Y ) + I(X2 ; Y |X1 ) = I(X2 ; Y ) + 0 2 2 2 In non-totally randomized trees, for K = 2, X1 is always selected at the root node and X2 is always used in its children. Also, since X1 perfectly explains Y , all its children are pure and the reduction of entropy when splitting on X2 is null. As a result, ImpK=2 (X1 ) = I(X1 ; Y ) and ImpK=2 (X2 ) = I(X2 ; Y |X1 ) = 0. Masking effects are here clearly visible: the true importance of X2 is masked by X1 as if X2 were irrelevant, while it is only a bit less informative than X1 . As a direct consequence of the example above, for K > 1, it is no longer true that a variable is irrelevant if and only if its importance is zero. In the same way, it can also be shown that the importances become dependent on the number of irrelevant variables. Let us indeed consider the following counter-example: let us add in the previous example an irrelevant variable Xi with respect to {X1 , X2 } and let us keep K = 2. The probability of selecting X2 at the root node now becomes positive, which means that ImpK=2 (X2 ) now includes I(X2 ; Y ) > 0 and is therefore strictly larger than the importance computed before. For K fixed, adding irrelevant variables dampens masking effects, which thereby makes importances indirectly dependent on the number of irrelevant variables. In conclusion, the importances as computed with totally randomized trees exhibit properties that do not possess, by extension, neither random forests nor extra-trees. With totally randomized trees, the importance of Xm only depends on the relevant variables and is 0 if and only if Xm is irrelevant. As we have shown, it may no longer be the case for K > 1. Asymptotically, the use of totally randomized trees for assessing the importance of a variable may therefore be more appropriate. In a finite setting (i.e., a limited number of samples and a limited number of trees), guiding the choice of the splitting variables remains however a sound strategy. In such a case, I(Xm ; Y |B) cannot be measured neither for all Xm nor for all B. It is therefore pragmatic to promote those that look the most promising – even if the resulting importances may be biased. 6 Illustration on a digit recognition problem In this section, we consider the digit recognition problem of (Breiman et al., 1984) for illustrating variable importances as computed with totally randomized trees. We verify that they match with our theoretical developments and that they decompose as foretold. We also compare these importances with those computed by an ensemble of non-totally randomized trees, as discussed in section 5.3, for increasing values of K. Let us consider a seven-segment indicator displaying numerals using horizontal and vertical lights in on-off combinations, as illustrated in Figure 1. Let Y be a random variable taking its value in {0, 1, ..., 9} with equal probability and let X1 , ..., X7 be binary variables whose values are each determined univocally given the corresponding value of Y in Table 1. Since Table 1 perfectly defines the data distribution, and given the small dimensionality of the problem, it is practicable to directly apply Equation 3 to compute variable importances. To verify our 6 X1 X2 y 0 1 2 3 4 5 6 7 8 9 X3 X4 X5 X6 X7 Eqn. 3 0.412 0.581 0.531 0.542 0.656 0.225 0.372 3.321 K=1 0.414 0.583 0.532 0.543 0.658 0.221 0.368 3.321 x2 1 0 0 0 1 1 1 0 1 1 x3 1 1 1 1 1 0 0 1 1 1 x4 0 0 1 1 1 1 1 0 1 1 x5 1 0 1 0 0 0 1 0 1 0 x6 1 1 0 1 1 1 1 1 1 1 x7 1 0 1 1 0 1 1 0 1 1 Table 1: Values of Y, X1 , ..., X7 Figure 1: 7-segment display X1 X2 X3 X4 X5 X6 X7 x1 1 0 1 1 0 1 1 1 1 1 K=2 0.362 0.663 0.512 0.525 0.731 0.140 0.385 3.321 K=3 0.327 0.715 0.496 0.484 0.778 0.126 0.392 3.321 K=4 0.309 0.757 0.489 0.445 0.810 0.122 0.387 3.321 K=5 0.304 0.787 0.483 0.414 0.827 0.122 0.382 3.321 K=6 0.305 0.801 0.475 0.409 0.831 0.121 0.375 3.321 K=7 0.306 0.799 0.475 0.412 0.835 0.120 0.372 3.321 Table 2: Variable importances as computed with an ensemble of randomized trees, for increasing values of K. Importances at K = 1 follow their theoretical values, as predicted by Equation 3 in Theorem 1. However, as K increases, importances diverge due to masking effects. In accordance with Theorem 2, their sum is also always equal to I(X1 , . . . , X7 ; Y ) = H(Y ) = log2 (10) = 3.321 since inputs allow to perfectly predict the output. theoretical developments, we then compare in Table 2 variable importances as computed by Equation 3 and those yielded by an ensemble of 10000 totally randomized trees (K = 1). Note that given the known structure of the problem, building trees on a sample of finite size that perfectly follows the data distribution amounts to building them on a sample of infinite size. At best, trees can thus be built on a 10-sample dataset, containing exactly one sample for each of the equiprobable outcomes of Y . As the table illustrates, the importances yielded by totally randomized trees match those computed by Equation 3, which confirms Theorem 1. Small differences stem from the fact that a finite number of trees were built in our simulations, but those discrepancies should disappear as the size of the ensemble grows towards infinity. It also shows that importances indeed add up to I(X1 , ...X7 ; Y ), which confirms Theorem 2. Regarding the actual importances, they indicate that X5 is stronger than all others, followed – in that order – by X2 , X4 and X3 which also show large importances. X1 , X7 and X6 appear to be the less informative. The table also reports importances for increasing values of K. As discussed before, we see that a large value of K yields importances that can be either overestimated (e.g., at K = 7, the importances of X2 and X5 are larger than at K = 1) or underestimated due to masking effects (e.g., at K = 7, the importances of X1 , X3 , X4 and X6 are smaller than at K = 1, as if they were less important). It can also be observed that masking effects may even induce changes in the variable rankings (e.g., compare the rankings at K = 1 and at K = 7), which thus confirms that importances are differently affected. To better understand why a variable is important, it is also insightful to look at its decomposition into its p sub-importances terms, as shown in Figure 2. Each row in the plots of the figure corresponds to one the p = 7 variables and each column to a size k of conditioning sets. As such, the value at row m and column k corresponds the importance of Xm when conditioned with k other variables 1 1 (e.g., to the term C k p−k B∈Pk (V −m ) I(Xm ; Y |B) in Equation 3 in the case of totally randomized p trees). In the left plot, for K = 1, the figure first illustrates how importances yielded by totally randomized trees decomposes along the degrees k of interactions terms. We can observe that they each equally contribute (at most) the total importance of a variable. The plot also illustrates why X5 is important: it is informative either alone or conditioned with any combination of the other variables (all of its terms are significantly larger than 0). By contrast, it also clearly shows why 7 K=1 0.5 K=7 X1 X1 X2 X3 X4 X4 X5 X5 X6 X6 X7 0.375 X2 X3 X7 0 1 2 3 4 5 6 0.25 0.125 0 1 2 3 4 5 6 0.0 Figure 2: Decomposition of variable importances along the degrees k of interactions of one variable with the other ones. At K = 1, all I(Xm ; Y |B) are accounted for in the total importance, while at K = 7 only some of them are taken into account due to masking effects. X6 is not important: neither alone nor combined with others X6 seems to be very informative (all of its terms are close to 0). More interestingly, this figure also highlights redundancies: X7 is informative alone or conditioned with a couple of others (the first terms are significantly larger than 0), but becomes uninformative when conditioned with many others (the last terms are closer to 0). The right plot, for K = 7, illustrates the decomposition of importances when variables are chosen in a deterministic way. The first thing to notice is masking effects. Some of the I(Xm ; Y |B) terms are indeed clearly never encountered and their contribution is therefore reduced to 0 in the total importance. For instance, for k = 0, the sub-importances of X2 and X5 are positive, while all others are null, which means that only those two variables are ever selected at the root node, hence masking the others. As a consequence, this also means that the importances of the remaining variables is biased and that it actually only accounts of their relevance when conditioned to X2 or X5 , but not of their relevance in other contexts. At k = 0, masking effects also amplify the contribution of I(X2 ; Y ) (resp. I(X5 ; Y )) since X2 (resp. X5 ) appears more frequently at the root node than in totally randomized trees. In addition, because nodes become pure before reaching depth p, conditioning sets of size k ≥ 4 are never actually encountered, which means that we can no longer know whether variables are still informative when conditioned to many others. All in all, this figure thus indeed confirms that importances as computed with non-totally randomized trees take into account only some of the conditioning sets B, hence biasing the measured importances. 7 Conclusions In this work, we made a first step towards understanding variable importances as computed with a forest of randomized trees. In particular, we derived a theoretical characterization of the Mean Decrease Impurity importances as computed by totally randomized trees in asymptotic conditions. We showed that they offer a three-level decomposition of the information jointly provided by all input variables about the output (Section 3). We then demonstrated (Section 4) that MDI importances as computed by totally randomized trees exhibit desirable properties for assessing the relevance of a variable: it is equal to zero if and only if the variable is irrelevant and it depends only on the relevant variables. We discussed the case of Random Forests and Extra-Trees (Section 5) and finally illustrated our developments on an artificial but insightful example (Section 6). There remain several limitations to our framework that we would like address in the future. First, our results should be adapted to binary splits as used within an actual Random Forest-like algorithm. In this setting, any node t is split in only two subsets, which means that any variable may then appear one or several times within a branch, and thus should make variable importances now dependent on the cardinalities of the input variables. In the same direction, our framework should also be extended to the case of continuous variables. Finally, results presented in this work are valid in an asymptotic setting only. An important direction of future work includes the characterization of the distribution of variable importances in a finite setting. Acknowledgements. Gilles Louppe is a research fellow of the FNRS (Belgium) and acknowledges its financial support. This work is supported by PASCAL2 and the IUAP DYSCO, initiated by the Belgian State, Science Policy Office. 8 References Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research, 98888:1063–1095. Biau, G., Devroye, L., and Lugosi, G. (2008). Consistency of random forests and other averaging classifiers. The Journal of Machine Learning Research, 9:2015–2033. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2):123–140. Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. Breiman, L. (2002). Manual on setting up, using, and understanding random forests v3. 1. Statistics Department University of California Berkeley, CA, USA. Breiman, L. (2004). Consistency for a simple model of random forests. Technical report, UC Berkeley. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and regression trees. Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236. Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1):3–42. Ho, T. (1998). The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20(8):832–844. Ishwaran, H. (2007). Variable importance in binary regression trees and forests. Electronic Journal of Statistics, 1:519–537. Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1):273–324. Liaw, A. and Wiener, M. (2002). Classification and regression by randomforest. R news, 2(3):18–22. Lin, Y. and Jeon, Y. (2006). Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101(474):578–590. Meinshausen, N. (2006). Quantile regression forests. The Journal of Machine Learning Research, 7:983–999. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 12:2825–2830. Strobl, C., Boulesteix, A.-L., Kneib, T., Augustin, T., and Zeileis, A. (2008). Conditional variable importance for random forests. BMC bioinformatics, 9(1):307. Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25. White, A. P. and Liu, W. Z. (1994). Technical note: Bias in information-based measures in decision tree induction. Machine Learning, 15(3):321–329. Zhao, G. (2000). A new perspective on classification. PhD thesis, Utah State University, Department of Mathematics and Statistics. 9
same-paper 2 0.83234853 350 nips-2013-Wavelets on Graphs via Deep Learning
Author: Raif Rustamov, Leonidas Guibas
Abstract: An increasing number of applications require processing of signals defined on weighted graphs. While wavelets provide a flexible tool for signal processing in the classical setting of regular domains, the existing graph wavelet constructions are less flexible – they are guided solely by the structure of the underlying graph and do not take directly into consideration the particular class of signals to be processed. This paper introduces a machine learning framework for constructing graph wavelets that can sparsely represent a given class of signals. Our construction uses the lifting scheme, and is based on the observation that the recurrent nature of the lifting scheme gives rise to a structure resembling a deep auto-encoder network. Particular properties that the resulting wavelets must satisfy determine the training objective and the structure of the involved neural networks. The training is unsupervised, and is conducted similarly to the greedy pre-training of a stack of auto-encoders. After training is completed, we obtain a linear wavelet transform that can be applied to any graph signal in time and memory linear in the size of the graph. Improved sparsity of our wavelet transform for the test signals is confirmed via experiments both on synthetic and real data. 1
3 0.82806212 159 nips-2013-Learning Prices for Repeated Auctions with Strategic Buyers
Author: Kareem Amin, Afshin Rostamizadeh, Umar Syed
Abstract: Inspired by real-time ad exchanges for online display advertising, we consider the problem of inferring a buyer’s value distribution for a good when the buyer is repeatedly interacting with a seller through a posted-price mechanism. We model the buyer as a strategic agent, whose goal is to maximize her long-term surplus, and we are interested in mechanisms that maximize the seller’s long-term revenue. We define the natural notion of strategic regret — the lost revenue as measured against a truthful (non-strategic) buyer. We present seller algorithms that are no(strategic)-regret when the buyer discounts her future surplus — i.e. the buyer prefers showing advertisements to users sooner rather than later. We also give a lower bound on strategic regret that increases as the buyer’s discounting weakens and shows, in particular, that any seller algorithm will suffer linear strategic regret if there is no discounting. 1
4 0.80366683 304 nips-2013-Sparse nonnegative deconvolution for compressive calcium imaging: algorithms and phase transitions
Author: Eftychios A. Pnevmatikakis, Liam Paninski
Abstract: We propose a compressed sensing (CS) calcium imaging framework for monitoring large neuronal populations, where we image randomized projections of the spatial calcium concentration at each timestep, instead of measuring the concentration at individual locations. We develop scalable nonnegative deconvolution methods for extracting the neuronal spike time series from such observations. We also address the problem of demixing the spatial locations of the neurons using rank-penalized matrix factorization methods. By exploiting the sparsity of neural spiking we demonstrate that the number of measurements needed per timestep is significantly smaller than the total number of neurons, a result that can potentially enable imaging of larger populations at considerably faster rates compared to traditional raster-scanning techniques. Unlike traditional CS setups, our problem involves a block-diagonal sensing matrix and a non-orthogonal sparse basis that spans multiple timesteps. We provide tight approximations to the number of measurements needed for perfect deconvolution for certain classes of spiking processes, and show that this number undergoes a “phase transition,” which we characterize using modern tools relating conic geometry to compressed sensing. 1
5 0.80230117 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions
Author: Ari Pakman, Liam Paninski
Abstract: We present a new approach to sample from generic binary distributions, based on an exact Hamiltonian Monte Carlo algorithm applied to a piecewise continuous augmentation of the binary distribution of interest. An extension of this idea to distributions over mixtures of binary and possibly-truncated Gaussian or exponential variables allows us to sample from posteriors of linear and probit regression models with spike-and-slab priors and truncated parameters. We illustrate the advantages of these algorithms in several examples in which they outperform the Metropolis or Gibbs samplers. 1
6 0.70915729 201 nips-2013-Multi-Task Bayesian Optimization
7 0.7077046 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
8 0.70652413 173 nips-2013-Least Informative Dimensions
9 0.70456541 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
10 0.70226175 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints
11 0.70079166 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data
12 0.7002461 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
13 0.70024252 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
14 0.70011926 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
15 0.69915974 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
16 0.699018 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
17 0.69888532 5 nips-2013-A Deep Architecture for Matching Short Texts
18 0.69786364 329 nips-2013-Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections
19 0.69748169 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
20 0.69747239 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning