nips nips2008 nips2008-112 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola
Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1
Reference: text
sentIndex sentText sentNum sentScore
1 org Abstract Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. [sent-10, score-0.114]
2 In this paper, we extend this criterion to deal with structured and interdependent observations. [sent-11, score-0.211]
3 This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. [sent-12, score-0.17]
4 We apply this new criterion to independent component analysis and sequence clustering. [sent-13, score-0.084]
5 1 Introduction Statistical dependence measures have been proposed as a unifying framework to address many machine learning problems. [sent-14, score-0.118]
6 For instance, clustering can be viewed as a problem where one strives to maximize the dependence between the observations and a discrete set of labels [14]. [sent-15, score-0.235]
7 Conversely, if labels are given, feature selection can be achieved by finding a subset of features in the observations which maximize the dependence between labels and features [15]. [sent-16, score-0.197]
8 Likewise, blind source separation (BSS) tries to unmix independent sources, which requires a contrast function quantifying the dependence of the unmixed signals. [sent-18, score-0.246]
9 Assume that the observations xt are drawn iid from a uniform distribution on {0, 1} and yt is determined by an XOR operation via yt = xt ⊗ xt−1 . [sent-28, score-1.774]
10 Algorithms which treat the observation ∞ pairs {(xt , yt )}t=1 as iid will consider the random variables x, y as independent. [sent-29, score-0.669]
11 However, it is trivial to detect the XOR dependence by using the information that xi and yi are, in fact, sequences. [sent-30, score-0.118]
12 In view of its importance, temporal correlation has been exploited in the independence test for blind source separation. [sent-31, score-0.147]
13 In this paper, we propose a framework which extends HSIC to structured noniid data. [sent-35, score-0.178]
14 For instance x and y may represent sequences or a mesh for which we wish to establish dependence. [sent-45, score-0.111]
15 To assess whether x and y are independent we briefly review the notion of Hilbert Space embeddings of distributions [6]. [sent-46, score-0.103]
16 Subsequently we discuss properties of the expectation operator in the case of conditionally independent random variables which will lead to a template for a dependence measure. [sent-47, score-0.202]
17 The following theorem shows that the map is injective [16] for a large class of kernels such as Gaussian and Laplacian RBF. [sent-52, score-0.108]
18 1 Exponential Families We are interested in the properties of µ[p] in the case where p satisfies the conditional independence relations specified by an undirected graphical model. [sent-55, score-0.229]
19 In [2], it is shown that for this case the sufficient statistics decompose along the maximal cliques of the conditional independence graph. [sent-56, score-0.404]
20 More formally, denote by C the set of maximal cliques of the graph G and let zc be the restriction of z ∈ Z to the variables on clique c ∈ C. [sent-57, score-0.415]
21 Moreover, let vc be universal kernels in the sense of [17] acting on the restrictions of Z on clique c ∈ C. [sent-58, score-0.187]
22 In this case, [2] showed that v(z, z ) = vc (zc , zc ) (3) c∈C can be used to describe all probability distributions with the above mentioned conditional independence relations using an exponential family model with v as its kernel. [sent-59, score-0.275]
23 Corollary 2 implies that we will be able to perform all subsequent operations on structured domains simply by dealing with mean operators on the corresponding maximal cliques. [sent-63, score-0.254]
24 Similarly, we can quantify the strength of dependence between random variables x and y by simply measuring the square distance between the RKHS embeddings of the joint distribution p(x, y) and the product of the marginals p(x) · p(y) via I(x, y) := µ[p(x, y)] − µ[p(x)p(y)] 2 H . [sent-68, score-0.227]
25 (7) i,j=1 Estimates for Special Structures To illustrate the versatility of our approach we apply our model to a number of graphical models ranging from independent random variables to meshes proceeding according to the following recipe: 1. [sent-74, score-0.112]
26 Independent and Identically Distributed Data T As the simplest case, we first consider the graphical model in Figure 1b, where {(xt , yt )}t=1 are T iid random variables. [sent-85, score-0.697]
27 Correspondingly the maximal cliques are {(xt , yt )}t=1 . [sent-86, score-0.722]
28 We choose the joint kernel on the cliques to be vt ((xt , yt ), (xt , yt )) := k(xt , xt )l(yt , yt ) hence v((x, y), (x , y )) = T t=1 k(xt , xt )l(yt , yt ). [sent-87, score-2.733]
29 (8) The representation for vt implies that we are taking an outer product between the Hilbert Spaces on xt and yt induced by kernels k and l respectively. [sent-88, score-0.848]
30 If the pairs of random variables (xt , yt ) are not identically distributed, all that is left is to use (8) to obtain an empirical estimate via (7). [sent-89, score-0.515]
31 We may improve the estimate considerably if we are able to assume that all pairs (xt , yt ) are drawn from the same distribution p(xt , yt ). [sent-90, score-0.912]
32 Consequently all coordinates of the mean map are identical and we can use all the data to estimate just one of the discrepancies 2 µc [pc (xc , yc )] − µc [pc (xc )pc (yc )] . [sent-91, score-0.249]
33 The latter expression is identical to the standard HSIC criterion and we obtain the biased estimate 1 1 ˆ I(x, y) = T tr HKHL where Kst := k(xs , xt ), Lst := l(ys , yt ) and Hst := δst − T . [sent-92, score-0.871]
34 2 Sequence Data A more interesting application beyond iid data is sequences with a Markovian dependence as deT −1 picted in Figure 1c. [sent-94, score-0.332]
35 Here the maximal cliques are the sets {(xt , xt+1 , yt , yt+1 )}t=1 . [sent-95, score-0.722]
36 More generally, for longer range dependency of order τ ∈ N, the maximal cliques will involve the random variables (xt , . [sent-96, score-0.296]
37 We assume homogeneity and stationarity of the random variables: that is, all cliques share the same sufficient statistics (feature map) and their expected value is identical. [sent-103, score-0.232]
38 Stationarity means that µc [pc (xc , yc )] and µc [pc (xc )pc (yc )] are the same for all cliques c, hence I(x, y) is a multiple of the difference for a single clique. [sent-105, score-0.439]
39 Using the same argument as in the iid case, we can obtain a biased estimate of the measure of dependence by using Kij = k(xi,τ , xj,τ ) and Lij = l(yi,τ , yj,τ ) instead of the definitions of K and L in (9). [sent-106, score-0.301]
40 (1,2,3,4) k(xt , xu )l(xt , xu ) + k(xt , xu )l(xv , xw ) − 2k(xt , xu )l(xt , xv ), (t,u,v,w) and the latter sum denotes all ordered quadruples (t, u, v, w) drawn from (1, 2, 3, 4). [sent-119, score-0.145]
41 The theorem implies that in expectation h takes on the value of the dependence measure. [sent-120, score-0.118]
42 In general, an RBF kernel will lead to an effective criterion for measuring the dependence between random variables, especially in time-series applications. [sent-130, score-0.218]
43 For a specific choice of cliques and kernels, we can recover the work of [18] as a special case of our framework. [sent-132, score-0.19]
44 In [18], for two centered scalar time series x and y, the contrast function is chosen as the sum of same-time and time-lagged cross-covariance E[xt yt ]+E[xt yt+τ ]. [sent-133, score-0.567]
45 Using our framework, two types of cliques, (xt , yt ) and (xt , yt+τ ), are considered in the corresponding graphical model. [sent-134, score-0.514]
46 Furthermore, we use a joint kernel of the form xs , xt ys , yt + xs , xt ys+τ , yt+τ , (14) 1 which leads to the estimator of structured HSIC: I(x, y) = T (tr HKHL + tr HKHLτ ). [sent-135, score-1.594]
47 Here Lτ denotes the linear covariance matrix for the time lagged y signals. [sent-136, score-0.078]
48 For scalar time series, basic algebra shows that tr HKHL and tr HKHLτ are the estimators of E[xt yt ] and E[xt yt+τ ] respectively (up to a multiplicative constant). [sent-137, score-0.599]
49 Further generalization can incorporate several time lagged cross-covariances into the contrast function. [sent-138, score-0.11]
50 That said, by using a nonlinear kernel we are able to obtain better contrast functions, as we will show in our experiments. [sent-140, score-0.099]
51 4 Grid Structured Data Structured HSIC can go beyond sequence data and be applied to more general dependence structures such as 2-D grids for images. [sent-142, score-0.118]
52 In the simplest case, the maximal cliques are C = {(xij , xi+1,j , xi,j+1 , xi+1,j+1 , yij , yi+1,j , yi,j+1 , yi+1,j+1 )}ij . [sent-145, score-0.266]
53 Provided that the kernel v can also be decomposed into the product of k and l, then a biased estimate of the independence measure can be again formulated as tr HKHL up to a multiplicative constant. [sent-147, score-0.237]
54 4 Experiments Having a dependence measure for structured spaces is useful for a range of applications. [sent-151, score-0.296]
55 Analogous to iid HSIC, structured HSIC can be applied to non-iid data in applications such as independent component analysis [12], independence test [6], feature selection [15], clustering [14], and dimensionality reduction [13]. [sent-152, score-0.59]
56 The fact that structured HSIC can take into account the interdependency between observations provides us with a principled generalization of these algorithms to, e. [sent-153, score-0.205]
57 In this paper, we will focus on two examples: independent component analysis, where we wish to minimize the dependence, and time series segmentation, where we wish to maximize the dependence instead. [sent-156, score-0.296]
58 Two simple illustrative experiments on independence test for XOR binary sequence and Gaussian process can be found in the longer version of this paper. [sent-157, score-0.114]
59 Based on the series of observations t, we wish to recover the sources using only the independence assumption on s. [sent-160, score-0.289]
60 Note that sources can only be recovered up to scaling and permutation. [sent-161, score-0.076]
61 The core of ICA is a contrast function that measures the independence of the estimated sources. [sent-162, score-0.146]
62 Thus, we propose to use structured HSIC as the contrast function for ICA. [sent-164, score-0.21]
63 By incorporating time lagged variables in the cliques, we expect that structured HSIC can better deal with the non-iid nature of time series. [sent-165, score-0.317]
64 Table 1: Median performance of ICA on music using HSIC, TDSEP, and structured HSIC. [sent-168, score-0.202]
65 In the top row, the number n of sources and m of samples are given. [sent-169, score-0.076]
66 In the second row, the number of time lags τ used by TDSEP and structured HSIC are given: thus the observation vectors x, xt−1 , . [sent-170, score-0.243]
67 The original HSIC method does not take into account time dependence (τ = 0), and returns a single performance number. [sent-175, score-0.149]
68 Results are in all cases averaged over 136 repetitions: for two sources, this represents all possible pairings, whereas for larger n the sources are chosen at random without replacement. [sent-176, score-0.076]
69 5], we unmixed various musical sources, combined using a randomly generated orthogonal matrix A (since optimization over the orthogonal part of a general mixing matrix is the more difficult step in ICA). [sent-199, score-0.072]
70 We used the sum of pairwise dependencies as the overall contrast function when more than two sources were present. [sent-201, score-0.108]
71 Methods We compared structured HSIC to TD-SEP and iid HSIC. [sent-202, score-0.361]
72 While iid HSIC does not take the temporal dependence in the signal into account, it has been shown to perform very well for iid data [12]. [sent-203, score-0.484]
73 Following [7], we employed a Laplace kernel, k(x, x ) = exp(−λ x − x ) with λ = 3 for both structured and iid HSIC. [sent-204, score-0.361]
74 For both structured and iid HSIC, we used gradient descent over the orthogonal group with a Golden search, and low rank Cholesky decompositions of the Gram matrices to reduce computational cost, as in [3]. [sent-205, score-0.361]
75 Overall, contrast functions that take time delayed information into account perform best, although the best time lag is different when the number of sources varies. [sent-210, score-0.17]
76 2 Time Series Clustering and Segmentation We can also extend clustering to time series and sequences using structured HSIC. [sent-212, score-0.352]
77 This is carried out in a similar way to the iid case. [sent-213, score-0.183]
78 One can formulate clustering as generating the labels y from a finite discrete set, such that their dependence on x is maximized [14]: maximizey tr HKHL subject to constraints on y. [sent-214, score-0.264]
79 More specifically, assuming Lst := δ(ys , yt ) for discrete labels y, we recover clustering. [sent-216, score-0.482]
80 Relaxing discrete labels to yt ∈ R with bounded norm y 2 and setting Lst := ys yt , we obtain Principal Component Analysis. [sent-217, score-1.039]
81 This reasoning for iid data carries over to sequences by introducing additional dependence structure through the kernels: Kst := k(xs,τ , xt,τ ) and Lst := l(ys,τ , yt,τ ). [sent-218, score-0.332]
82 However, for a class of kernels l an efficient decomposition can be found by applying a reverse convolution on k: assume that l is given by l(ys,τ , yt,τ ) = τ u,v=0 ¯ s+u , yt+v )Muv , l(y (16) where M ∈ R(τ +1)×(τ +1) with M 0, and ¯ is a base kernel between individual time points. [sent-220, score-0.164]
83 , T τ s,t=1 T +τ τ s,t=1 u,v=0 s−u,t−v∈[1,T ] l(y Muv [HKH]s−u,t−v ¯ s , yt ) ¯ s+u , yt+v )Muv = l(y [HKH]ij u,v=0 ¯ :=Kst (17) Table 2: Segmentation errors by various methods on the four studied time series. [sent-224, score-0.487]
84 Method Swimming I Swimming II Swimming II BCI structured HSIC 99. [sent-225, score-0.178]
85 The three time series we used in our experiment have the following configurations: T = 23000 time steps with 4 laps; T = 47000 time steps with 16 laps; and T = 67000 time steps with 20 laps. [sent-238, score-0.172]
86 The task is to automatically find the starting and finishing time of each lap based on the sensor signals. [sent-239, score-0.07]
87 Methods We compared three algorithms: structured HSIC for clustering, spectral clustering [10], and HMM. [sent-248, score-0.293]
88 For structured HSIC, we used the maximal cliques of (xt , yt−50,100 ), where y is the discrete label sequence to be generated. [sent-249, score-0.444]
89 As a baseline, we used a spectral clustering with the same kernel k on x, and a first order HMM with 6 hidden states and diagonal Gaussian observation model2 . [sent-252, score-0.182]
90 Results To evaluate the segmentation quality, the boundaries found by various methods were compared to the ground truth. [sent-254, score-0.169]
91 According to Table 2, in all of the four time series we studied, segmentation using structured HSIC leads to lower error compared with spectral clustering and HMM. [sent-258, score-0.473]
92 For instance, structured HSIC reduces nearly 1/3 of the segmentation error in the BCI dataset. [sent-259, score-0.279]
93 To provide a visual feel of the improvement, we plot the true boundaries together with the segmentation results in Figure 2a, 2b,2c. [sent-260, score-0.125]
94 Clearly, segment boundaries produced by structured HSIC fit better with the ground truth. [sent-261, score-0.272]
95 5 Conclusion In this paper, we extended the Hilbert Schmidt Independence Criterion from iid data to structured and non-iid data. [sent-262, score-0.361]
96 Our approach is based on RKHS embeddings of distributions, and utilizes the efficient factorizations provided by the exponential family associated with undirected graphical models. [sent-263, score-0.199]
97 Encouraging experimental results were demonstrated on independence test, ICA, and segmentation for time series. [sent-264, score-0.246]
98 Further work will be done in the direction of applying structured HSIC to PCA and feature selection on structured data. [sent-265, score-0.356]
99 5 0 0 HMM Ground Truth 2000 4000 6000 8000 10000 (c) (d) Figure 2: Segmentation results produced by (a) structured HSIC, (b) spectral clustering and (c) HMM. [sent-276, score-0.293]
100 Red line denotes the ground truth and blue line is the segmentation results. [sent-278, score-0.175]
wordName wordTfidf (topN-words)
[('yt', 0.456), ('hsic', 0.387), ('xt', 0.326), ('yc', 0.249), ('cliques', 0.19), ('xc', 0.188), ('iid', 0.183), ('structured', 0.178), ('hkhl', 0.137), ('dependence', 0.118), ('pc', 0.118), ('independence', 0.114), ('segmentation', 0.101), ('ys', 0.101), ('tdsep', 0.098), ('gretton', 0.097), ('hkh', 0.085), ('embeddings', 0.079), ('lst', 0.078), ('muv', 0.078), ('ica', 0.077), ('maximal', 0.076), ('sources', 0.076), ('hilbert', 0.076), ('xor', 0.073), ('swimming', 0.068), ('kernel', 0.067), ('clique', 0.066), ('kernels', 0.066), ('clustering', 0.064), ('smola', 0.061), ('rkhs', 0.06), ('kst', 0.059), ('graphical', 0.058), ('tr', 0.056), ('vc', 0.055), ('zc', 0.053), ('spectral', 0.051), ('series', 0.048), ('song', 0.048), ('schmidt', 0.047), ('ez', 0.047), ('lagged', 0.047), ('ground', 0.044), ('xs', 0.042), ('stationarity', 0.042), ('injective', 0.042), ('bci', 0.042), ('amari', 0.042), ('bss', 0.039), ('burton', 0.039), ('dehling', 0.039), ('lap', 0.039), ('laps', 0.039), ('unmixed', 0.039), ('borgwardt', 0.038), ('australian', 0.035), ('lags', 0.034), ('exc', 0.034), ('nishing', 0.034), ('zt', 0.034), ('mixing', 0.033), ('undirected', 0.033), ('blind', 0.033), ('criterion', 0.033), ('divergence', 0.032), ('hmm', 0.032), ('contrast', 0.032), ('time', 0.031), ('mesh', 0.031), ('sequences', 0.031), ('operator', 0.03), ('truth', 0.03), ('variables', 0.03), ('xv', 0.029), ('identically', 0.029), ('families', 0.029), ('xu', 0.029), ('exponential', 0.029), ('arthur', 0.028), ('nicta', 0.028), ('component', 0.027), ('observations', 0.027), ('eeg', 0.026), ('segment', 0.026), ('jmlr', 0.026), ('labels', 0.026), ('instance', 0.025), ('nontrivial', 0.025), ('fukumizu', 0.025), ('functionals', 0.025), ('cholesky', 0.024), ('subscripts', 0.024), ('independent', 0.024), ('conditional', 0.024), ('boundaries', 0.024), ('wish', 0.024), ('unbiased', 0.024), ('australia', 0.024), ('music', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 112 nips-2008-Kernel Measures of Independence for non-iid Data
Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola
Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1
2 0.37459287 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces
Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu
Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1
3 0.2299895 113 nips-2008-Kernelized Sorting
Author: Novi Quadrianto, Le Song, Alex J. Smola
Abstract: Object matching is a fundamental operation in data analysis. It typically requires the definition of a similarity measure between the classes of objects to be matched. Instead, we develop an approach which is able to perform matching by requiring a similarity measure only within each of the classes. This is achieved by maximizing the dependency between matched pairs of observations by means of the Hilbert Schmidt Independence Criterion. This problem can be cast as one of maximizing a quadratic assignment problem with special structure and we present a simple algorithm for finding a locally optimal solution. 1
4 0.20548984 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos
Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1
5 0.1714666 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
Author: Giovanni Cavallanti, Nicolò Cesa-bianchi, Claudio Gentile
Abstract: We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of √ queried labels, and α > 0 is the exponent in the low noise condition. For all α > 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the same classifier, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors. 1
6 0.16603874 206 nips-2008-Sequential effects: Superstition or rational behavior?
7 0.16031653 57 nips-2008-Deflation Methods for Sparse PCA
8 0.13690239 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
9 0.13174795 15 nips-2008-Adaptive Martingale Boosting
10 0.12020868 168 nips-2008-Online Metric Learning and Fast Similarity Search
11 0.11622795 117 nips-2008-Learning Taxonomies by Dependence Maximization
12 0.11004444 234 nips-2008-The Infinite Factorial Hidden Markov Model
13 0.10519888 164 nips-2008-On the Generalization Ability of Online Strongly Convex Programming Algorithms
14 0.10185349 195 nips-2008-Regularized Policy Iteration
15 0.10011537 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
16 0.089849263 193 nips-2008-Regularized Co-Clustering with Dual Supervision
17 0.087831236 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
18 0.079523943 244 nips-2008-Unifying the Sensory and Motor Components of Sensorimotor Adaptation
19 0.076689482 44 nips-2008-Characteristic Kernels on Groups and Semigroups
20 0.076539382 119 nips-2008-Learning a discriminative hidden part model for human action recognition
topicId topicWeight
[(0, -0.231), (1, 0.018), (2, -0.071), (3, -0.006), (4, -0.091), (5, 0.299), (6, 0.012), (7, 0.178), (8, 0.263), (9, -0.264), (10, 0.029), (11, -0.295), (12, 0.059), (13, -0.029), (14, -0.05), (15, 0.038), (16, 0.083), (17, -0.018), (18, 0.031), (19, -0.007), (20, 0.014), (21, 0.06), (22, -0.046), (23, 0.113), (24, 0.174), (25, 0.009), (26, -0.009), (27, -0.012), (28, -0.044), (29, -0.059), (30, 0.01), (31, 0.026), (32, -0.02), (33, 0.039), (34, 0.043), (35, 0.05), (36, 0.011), (37, 0.044), (38, -0.029), (39, -0.037), (40, 0.097), (41, 0.143), (42, -0.068), (43, 0.05), (44, -0.004), (45, 0.077), (46, -0.06), (47, -0.043), (48, -0.037), (49, -0.079)]
simIndex simValue paperId paperTitle
same-paper 1 0.97326523 112 nips-2008-Kernel Measures of Independence for non-iid Data
Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola
Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1
2 0.78599405 242 nips-2008-Translated Learning: Transfer Learning across Different Feature Spaces
Author: Wenyuan Dai, Yuqiang Chen, Gui-rong Xue, Qiang Yang, Yong Yu
Abstract: This paper investigates a new machine learning strategy called translated learning. Unlike many previous learning tasks, we focus on how to use labeled data from one feature space to enhance the classification of other entirely different learning spaces. For example, we might wish to use labeled text data to help learn a model for classifying image data, when the labeled images are difficult to obtain. An important aspect of translated learning is to build a “bridge” to link one feature space (known as the “source space”) to another space (known as the “target space”) through a translator in order to migrate the knowledge from source to target. The translated learning solution uses a language model to link the class labels to the features in the source spaces, which in turn is translated to the features in the target spaces. Finally, this chain of linkages is completed by tracing back to the instances in the target spaces. We show that this path of linkage can be modeled using a Markov chain and risk minimization. Through experiments on the text-aided image classification and cross-language classification tasks, we demonstrate that our translated learning framework can greatly outperform many state-of-the-art baseline methods. 1
3 0.63315243 57 nips-2008-Deflation Methods for Sparse PCA
Author: Lester W. Mackey
Abstract: In analogy to the PCA setting, the sparse PCA problem is often solved by iteratively alternating between two subtasks: cardinality-constrained rank-one variance maximization and matrix deflation. While the former has received a great deal of attention in the literature, the latter is seldom analyzed and is typically borrowed without justification from the PCA context. In this work, we demonstrate that the standard PCA deflation procedure is seldom appropriate for the sparse PCA setting. To rectify the situation, we first develop several deflation alternatives better suited to the cardinality-constrained context. We then reformulate the sparse PCA optimization problem to explicitly reflect the maximum additional variance objective on each round. The result is a generalized deflation procedure that typically outperforms more standard techniques on real-world datasets. 1
4 0.57460576 177 nips-2008-Particle Filter-based Policy Gradient in POMDPs
Author: Pierre-arnaud Coquelin, Romain Deguest, Rémi Munos
Abstract: Our setting is a Partially Observable Markov Decision Process with continuous state, observation and action spaces. Decisions are based on a Particle Filter for estimating the belief state given past observations. We consider a policy gradient approach for parameterized policy optimization. For that purpose, we investigate sensitivity analysis of the performance measure with respect to the parameters of the policy, focusing on Finite Difference (FD) techniques. We show that the naive FD is subject to variance explosion because of the non-smoothness of the resampling procedure. We propose a more sophisticated FD method which overcomes this problem and establish its consistency. 1
5 0.53096259 206 nips-2008-Sequential effects: Superstition or rational behavior?
Author: Angela J. Yu, Jonathan D. Cohen
Abstract: In a variety of behavioral tasks, subjects exhibit an automatic and apparently suboptimal sequential effect: they respond more rapidly and accurately to a stimulus if it reinforces a local pattern in stimulus history, such as a string of repetitions or alternations, compared to when it violates such a pattern. This is often the case even if the local trends arise by chance in the context of a randomized design, such that stimulus history has no real predictive power. In this work, we use a normative Bayesian framework to examine the hypothesis that such idiosyncrasies may reflect the inadvertent engagement of mechanisms critical for adapting to a changing environment. We show that prior belief in non-stationarity can induce experimentally observed sequential effects in an otherwise Bayes-optimal algorithm. The Bayesian algorithm is shown to be well approximated by linear-exponential filtering of past observations, a feature also apparent in the behavioral data. We derive an explicit relationship between the parameters and computations of the exact Bayesian algorithm and those of the approximate linear-exponential filter. Since the latter is equivalent to a leaky-integration process, a commonly used model of neuronal dynamics underlying perceptual decision-making and trial-to-trial dependencies, our model provides a principled account of why such dynamics are useful. We also show that parameter-tuning of the leaky-integration process is possible, using stochastic gradient descent based only on the noisy binary inputs. This is a proof of concept that not only can neurons implement near-optimal prediction based on standard neuronal dynamics, but that they can also learn to tune the processing parameters without explicitly representing probabilities. 1
6 0.52135736 113 nips-2008-Kernelized Sorting
7 0.42738268 123 nips-2008-Linear Classification and Selective Sampling Under Low Noise Conditions
8 0.42000705 154 nips-2008-Nonparametric Bayesian Learning of Switching Linear Dynamical Systems
9 0.39960337 15 nips-2008-Adaptive Martingale Boosting
10 0.39113313 76 nips-2008-Estimation of Information Theoretic Measures for Continuous Random Variables
11 0.36711985 168 nips-2008-Online Metric Learning and Fast Similarity Search
12 0.33751619 110 nips-2008-Kernel-ARMA for Hand Tracking and Brain-Machine interfacing During 3D Motor Control
13 0.33683676 13 nips-2008-Adapting to a Market Shock: Optimal Sequential Market-Making
14 0.31425545 234 nips-2008-The Infinite Factorial Hidden Markov Model
15 0.30959338 117 nips-2008-Learning Taxonomies by Dependence Maximization
16 0.30907109 195 nips-2008-Regularized Policy Iteration
17 0.29967487 119 nips-2008-Learning a discriminative hidden part model for human action recognition
18 0.29291782 102 nips-2008-ICA based on a Smooth Estimation of the Differential Entropy
19 0.28520805 183 nips-2008-Predicting the Geometry of Metal Binding Sites from Protein Sequence
20 0.28321454 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels
topicId topicWeight
[(5, 0.202), (6, 0.068), (7, 0.068), (12, 0.025), (28, 0.253), (38, 0.012), (57, 0.068), (59, 0.015), (63, 0.02), (71, 0.014), (77, 0.059), (81, 0.022), (83, 0.06)]
simIndex simValue paperId paperTitle
1 0.91916639 83 nips-2008-Fast High-dimensional Kernel Summations Using the Monte Carlo Multipole Method
Author: Dongryeol Lee, Alexander G. Gray
Abstract: We propose a new fast Gaussian summation algorithm for high-dimensional datasets with high accuracy. First, we extend the original fast multipole-type methods to use approximation schemes with both hard and probabilistic error. Second, we utilize a new data structure called subspace tree which maps each data point in the node to its lower dimensional mapping as determined by any linear dimension reduction method such as PCA. This new data structure is suitable for reducing the cost of each pairwise distance computation, the most dominant cost in many kernel methods. Our algorithm guarantees probabilistic relative error on each kernel sum, and can be applied to high-dimensional Gaussian summations which are ubiquitous inside many kernel methods as the key computational bottleneck. We provide empirical speedup results on low to high-dimensional datasets up to 89 dimensions. 1 Fast Gaussian Kernel Summation In this paper, we propose new computational techniques for efficiently approximating the following sum for each query point qi ∈ Q: Φ(qi , R) = e−||qi −rj || 2 /(2h2 ) (1) rj ∈R where R is the reference set; each reference point is associated with a Gaussian function with a smoothing parameter h (the ’bandwidth’). This form of summation is ubiquitous in many statistical learning methods, including kernel density estimation, kernel regression, Gaussian process regression, radial basis function networks, spectral clustering, support vector machines, and kernel PCA [1, 4]. Cross-validation in all of these methods require evaluating Equation 1 for multiple values of h. Kernel density estimation, for example, requires |R| density estimate based on |R| − 1 points, yielding a brute-force computational cost scaling quadratically (that is O(|R| 2 )). Error bounds. Due to its expensive computational cost, many algorithms approximate the Gaussian kernel sums at the expense of reduced precision. Therefore, it is natural to discuss error bound criteria which measure the quality of the approximations with respect to their corresponding true values. The following error bound criteria are common in literature: Definition 1.1. An algorithm guarantees absolute error bound, if for each exact value Φ(q i , R) for qi ∈ Q, it computes Φ(qi , R) such that Φ(qi , R) − Φ(qi , R) ≤ . Definition 1.2. An algorithm guarantees relative error bound, if for each exact value Φ(q i , R) for qi ∈ Q, it computes Φ(qi , R) ∈ R such that Φ(qi , R) − Φ(qi , R) ≤ |Φ(qi , R)|. 1 Bounding the relative error (e.g., the percentage deviation) is much harder because the error bound criterion is in terms of the initially unknown exact quantity. As a result, many previous methods [7] have focused on bounding the absolute error. The relative error bound criterion is preferred to the absolute error bound criterion in statistical applications in which high accuracy is desired. Our new algorithm will enforce the following “relaxed” form of the relative error bound criterion, whose motivation will be discussed shortly. Definition 1.3. An algorithm guarantees (1 − α) probabilistic relative error bound, if for each exact value Φ(qi , R) for qi ∈ Q, it computes Φ(qi , R) ∈ R, such that with at least probability 0 < 1 − α < 1, Φ(qi , R) − Φ(qi , R) ≤ |Φ(qi , R)|. Previous work. The most successful class of acceleration methods employ “higher-order divide and conquer” or generalized N -body algorithms (GNA) [4]. This approach can use any spatial partioning tree such as kd-trees or ball-trees for both the query set Q and reference data R and performs a simulataneous recursive descent on both trees. GNA with relative error bounds (Definition 1.2) [5, 6, 11, 10] utilized bounding boxes and additional cached-sufficient statistics such as higher-order moments needed for series-expansion. [5, 6] utilized bounding-box based error bounds which tend to be very loose, which resulted in slow empirical performance around suboptimally small and large bandwidths. [11, 10] extended GNA-based Gaussian summations with series-expansion which provided tighter bounds; it showed enormous performance improvements, but only up to low dimensional settings (up to D = 5) since the number of required terms in series expansion increases exponentially with respect to D. [9] introduces an iterative sampling based GNA for accelerating the computation of nested sums (a related easier problem). Its speedup is achieved by replacing pessimistic error bounds provided by bounding boxes with normal-based confidence interval from Monte Carlo sampling. [9] demonstrates the speedup many orders of magnitude faster than the previous state of the art in the context of computing aggregates over the queries (such as the LSCV score for selecting the optimal bandwidth). However, the authors did not discuss the sampling-based approach for computations that require per-query estimates, such as those required for kernel density estimation. None of the previous approaches for kernel summations addresses the issue of reducing the computational cost of each distance computation which incurs O(D) cost. However, the intrinsic dimensionality d of most high-dimensional datasets is much smaller than the explicit dimension D (that is, d << D). [12] proposed tree structures using a global dimension reduction method, such as random projection, as a preprocessing step for efficient (1 + ) approximate nearest neighbor search. Similarly, we develop a new data structure for kernel summations; our new data structure is constructed in a top-down fashion to perform the initial spatial partitioning in the original input space R D and performs a local dimension reduction to a localized subset of the data in a bottom-up fashion. This paper. We propose a new fast Gaussian summation algorithm that enables speedup in higher dimensions. Our approach utilizes: 1) probabilistic relative error bounds (Definition 1.3) on kernel sums provided by Monte Carlo estimates 2) a new tree structure called subspace tree for reducing the computational cost of each distance computation. The former can be seen as relaxing the strict requirement of guaranteeing hard relative bound on very small quantities, as done in [5, 6, 11, 10]. The latter was mentioned as a possible way of ameliorating the effects of the curse of dimensionality in [14], a pioneering paper in this area. Notations. Each query point and reference point (a D-dimensional vector) is indexed by natural numbers i, j ∈ N, and denoted qi and rj respectively. For any set S, |S| denotes the number of elements in S. The entities related to the left and the right child are denoted with superscripts L and R; an internal node N has the child nodes N L and N R . 2 Gaussian Summation by Monte Carlo Sampling Here we describe the extension needed for probabilistic computation of kernel summation satisfying Definition 1.3. The main routine for the probabilistic kernel summation is shown in Algorithm 1. The function MCMM takes the query node Q and the reference node R (each initially called with the roots of the query tree and the reference tree, Qroot and Rroot ) and β (initially called with α value which controls the probability guarantee that each kernel sum is within relative error). 2 Algorithm 1 The core dual-tree routine for probabilistic Gaussian kernel summation. MCMM(Q, R, β) if C AN S UMMARIZE E XACT(Q, R, ) then S UMMARIZE E XACT(Q, R) else if C AN S UMMARIZE MC(Q, R, , β) then 5: S UMMARIZE MC(Q, R, , β) else if Q is a leaf node then if R is a leaf node then MCMMBASE(Q, R) 10: else MCMM Q, RL , β , MCMM Q, RR , β 2 2 else if R is a leaf node then MCMM(QL , R, β), MCMM(QR , R, β) 15: else MCMM QL , RL , β , MCMM QL , RR , β 2 2 MCMM QR , RL , β , MCMM QR , RR , β 2 2 The idea of Monte Carlo sampling used in the new algorithm is similar to the one in [9], except the sampling is done per query and we use approximations that provide hard error bounds as well (i.e. finite difference, exhaustive base case: MCMMBASE). This means that the approximation has less variance than a pure Monte Carlo approach used in [9]. Algorithm 1 first attempts approximations with hard error bounds, which are computationally cheaper than sampling-based approximations. For example, finite-difference scheme [5, 6] can be used for the C AN S UMMARIZE E XACT and S UMMARIZE E XACT functions in any general dimension. The C AN S UMMARIZE MC function takes two parameters that specify the accuracy: the relative error and its probability guarantee and decides whether to use Monte Carlo sampling for the given pair of nodes. If the reference node R contains too few points, it may be more efficient to process it using exact methods that use error bounds based on bounding primitives on the node pair or exhaustive pair-wise evaluations, which is determined by the condition: τ · minitial ≤ |R| where τ > 1 controls the minimum number of reference points needed for Monte Carlo sampling to proceed. If the reference node does contain enough points, then for each query point q ∈ Q, the S AMPLE routine samples minitial terms over the terms in the summation Φ(q, R) = Kh (||q − rjn ||) rjn ∈R where Φ(q, R) denotes the exact contribution of R to q’s kernel sum. Basically, we are interested in estimating Φ(q, R) by Φ(q, R) = |R|µS , where µS is the sample mean of S. From the Central 2 Limit Theorem, given enough m samples, µS N (µ, σS /m) where Φ(q, R) = |R|µ (i.e. µ is the average of the kernel value between q and any reference point r ∈ R); this implies that √ |µS − µ| ≤ zβ/2 σS / m with probability 1 − β. The pruning rule we have to enforce for each query point for the contribution of R is: σS Φ(q, R) zβ/2 √ ≤ |R| m where σS the sample standard deviation of S. Since Φ(q, R) is one of the unknown quanities we want to compute, we instead enforce the following: σS zβ/2 √ ≤ m Φl (q, R) + |R| µS − |R| zβ/2 σS √ m (2) where Φl (q, R) is the currently running lower bound on the sum computed using exact methods z σ √ and |R| µS − β/2 S is the probabilistic component contributed by R. Denoting Φl,new (q, R) = m zβ/2 σ Φl (q, R) + |R| µS − √ S , the minimum number of samples for q needed to achieve the |S| 3 target error the right side of the inequality in Equation 2 with at least probability of 1 − β is: 2 2 m ≥ zβ/2 σS (|R| + |R|)2 2 (Φl (q, R) + |R|µ )2 S If the given query node and reference node pair cannot be pruned using either nonprobabilistic/probabilistic approximations, then we recurse on a smaller subsets of two sets. In particular, when dividing over the reference node R, we recurse with half of the β value 1 . We now state the probablistic error guarantee of our algorithm as a theorem. Theorem 2.1. After calling MCMM with Q = Qroot , R = Rroot , and β = α, Algorithm 1 approximates each Φ(q, R) with Φ(q, R) such that Definition 1.3 holds. Proof. For a query/reference (Q, R) pair and 0 < β < 1, MCMMBASE and S UMMARIZE E XACT compute estimates for q ∈ Q such that Φ(q, R) − Φ(q, R) < Φ(q,R)|R| with probability at |R| least 1 > 1 − β. By Equation 2, S UMMARIZE MC computes estimates for q ∈ Q such that Φ(q, R) − Φ(q, R) < Φ(q,R)|R| with probability 1 − β. |R| We now induct on |Q ∪ R|. Line 11 of Algorithm 1 divides over the reference whose subcalls compute estimates that satisfy Φ(q, RL ) − Φ(q, RL ) ≤ Φ(q,R)|RR | |R| L each with at least 1 − β 2 Φ(q,R)|RL | |R| and Φ(q, RR ) − Φ(q, RR ) ≤ probability by induction hypothesis. For q ∈ Q, Φ(q, R) = Φ(q, R )+ Φ(q, R ) which means |Φ(q, R)−Φ(q, R)| ≤ Φ(q,R)|R| with probability at least 1−β. |R| Line 14 divides over the query and each subcall computes estimates that hold with at least probability 1 − β for q ∈ QL and q ∈ QR . Line 16 and 17 divides both over the query and the reference, and the correctness can be proven similarly. Therefore, M CM M (Qroot , Rroot , α) computes estimates satisfying Definition 1.3. R “Reclaiming” probability. We note that the assigned probability β for the query/reference pair computed with exact bounds (S UMMARIZE E XACT and MCMMBASE) is not used. This portion of the probability can be “reclaimed” in a similar fashion as done in [10] and re-used to prune more aggressively in the later stages of the algorithm. All experiments presented in this paper were benefited by this simple modification. 3 Subspace Tree A subspace tree is basically a space-partitioning tree with a set of orthogonal bases associated with each node N : N.Ω = (µ, U, Λ, d) where µ is the mean, U is a D×d matrix whose columns consist of d eigenvectors, and Λ the corresponding eigenvalues. The orthogonal basis set is constructed using a linear dimension reduction method such as PCA. It is constructed in the top-down manner using the PARTITION S ET function dividing the given set of points into two (where the PARTITION S ET function divides along the dimension with the highest variance in case of a kd-tree for example), with the subspace in each node formed in the bottom-up manner. Algorithm 3 shows a PCA tree (a subspace tree using PCA as a dimension reduction) for a 3-D dataset. The subspace of each leaf node is computed using P CA BASE which can use the exact PCA [3] or a stochastic one [2]. For an internal node, the subspaces of the child nodes, N L .Ω = (µL , U L , ΛL , dL ) and N R .Ω = (µR , U R , ΛR , dR ), are approximately merged using the M ERGE S UBSPACES function which involves solving an (d L + dR + 1) × (dL + dR + 1) eigenvalue problem [8], which runs in O((dL + dR + 1)3 ) << O(D 3 ) given that the dataset is sparse. In addition, each data point x in each node N is mapped to its new lower-dimensional coordinate using the orthogonal basis set of N : xproj = U T (x − µ). The L2 norm reconstruction error is given by: ||xrecon − x||2 = ||(U xproj + µ) − x||2 . 2 2 Monte Carlo sampling using a subspace tree. Consider C AN S UMMARIZE MC function in Algorithm 2. The “outer-loop” over this algorithm is over the query set Q, and it would make sense to project each query point q ∈ Q to the subspace owned by the reference node R. Let U and µ be the orthogonal basis system for R consisting of d basis. For each q ∈ Q, consider the squared distance 1 We could also divide β such that the node that may be harder to approximate gets a lower value. 4 Algorithm 2 Monte Carlo sampling based approximation routines. C AN S UMMARIZE MC(Q, R, , α) S AMPLE(q, R, , α, S, m) return τ · minitial ≤ |R| for k = 1 to m do r ← random point in R S UMMARIZE MC(Q, R, , α) S ← S ∪ {Kh (||q − r||)} 2 for qi ∈ Q do µS ← M EAN(S), σS ← VARIANCE(S) S ← ∅, m ← minitial zα/2 σS Φl,new (q, R) ← Φl (q, R) + |R| µS − √ repeat |S| 2 S AMPLE(qi , R, , α, S, m) 2 2 mthresh ← zα/2 σS 2 (Φ(|R|+ |R|) S )2 l (q,R)+|R|µ until m ≤ 0 m ← mthresh − |S| Φ(qi , R) ← Φ(qi , R) + |R| · M EAN(S) ||(q − µ) − rproj ||2 (where (q − µ) is q’s coordinates expressed in terms of the coordinate system of R) as shown in Figure 1. For the Gaussian kernel, each pairwise kernel value is approximated as: e−||q−r|| 2 /(2h2 ) ≈ e−||q−qrecon || 2 (3) /(2h2 ) −||qproj −rproj ||2 /(2h2 ) e where qrecon = U qproj +µ and qproj = U (q−µ). For a fixed query point q, e can be precomputed (which takes d dot products between two D-dimensional vectors) and re-used for every distance computation between q and any reference point r ∈ R whose cost is now O(d) << O(D). Therefore, we can take more samples efficiently. For a total of sufficiently large m samples, the computational cost is O(d(D + m)) << O(D · m) for each query point. −||q−qrecon ||2 /(2h2 ) T Increased variance comes at the cost of inexact distance computations, however. Each distance computation incurs at most squared L2 norm of ||rrecon − r||2 error. That is, 2 ||q − rrecon ||2 − ||q − r||2 ≤ ||rrecon − r||2 . Neverhteless, the sample variance for each query 2 2 2 point plus the inexactness due to dimension reduction τS can be shown to be bounded for the Gaus2 2 sian kernel as: (where each s = e−||q−rrecon || /(2h ) ): 1 m−1 ≤ 1 m−1 s∈S s2 − m · µ 2 S s2 + τS 2 min 1, max e||rrecon −r||2 /h s∈S r∈R 2 2 − m µS min e−||rrecon −r||2 /(2h 2 2 ) r∈R Exhaustive computations using a subspace tree. Now suppose we have built subspace trees for the query and the reference sets. We can project either each query point onto the reference subspace, or each reference point onto the query subspace, depending on which subspace has a smaller dimension and the number of points in each node. The subspaces formed in the leaf nodes usually are highly numerically accurate since it contains very few points compared to the extrinsic dimensionality D. 4 Experimental Results We empirically evaluated the runtime performance of our algorithm on seven real-world datasets, scaled to fit in [0, 1]D hypercube, for approximating the Gaussian sum at every query point with a range of bandwidths. This experiment is motivated by many kernel methods that require computing the Gaussian sum at different bandwidth values (according to the standard least-sqares crossvalidation scores [15]). Nevertheless, we emphasize that the acceleration results are applicable to other kernel methods that require efficient Gaussian summation. In this paper, the reference set equals the query set. All datasets have 50K points so that the exact exhaustive method can be tractably computed. All times are in seconds and include the time needed to build the trees. Codes are in C/C++ and run on a dual Intel Xeon 3GHz with 8 Gb of main memory. The measurements in second to eigth columns are obtained by running the algorithms at the bandwidth kh∗ where 10−3 ≤ k ≤ 103 is the constant in the corresponding column header. The last columns denote the total time needed to run on all seven bandwidth values. Each table has results for five algorithms: the naive algorithm and four algorithms. The algorithms with p = 1 denote the previous state-of-the-art (finite-difference with error redistribution) [10], 5 Algorithm 3 PCA tree building routine. B UILD P CAT REE(P) if C AN PARTITION(P) then {P L , P R } ← PARTITION S ET(P) N ← empty node N L ← B UILD P CAT REE(P L ) N R ← B UILD P CAT REE(P R ) N.S ← M ERGE S UBSPACES(N L .S, N R .S) else N ← B UILD P CAT REE BASE(P) N.S ← P CA BASE(P) N.Pproj ← P ROJECT(P, N.S) return N while those with p < 1 denote our probabilistic version. Each entry has the running time and the percentage of the query points that did not satisfy the relative error . Analysis. Readers should focus on the last columns containing the total time needed for evaluating Gaussian sum at all points for seven different bandwidth values. This is indicated by boldfaced numbers for our probabilistic algorithm. As expected, On low-dimensional datasets (below 6 dimensions), the algorithm using series-expansion based bounds gives two to three times speedup compared to our approach that uses Monte Carlo sampling. Multipole moments are an effective form of compression in low dimensions with analytical error bounds that can be evaluated; our Monte Carlo-based method has an asymptotic error bound which must be “learned” through sampling. As we go from 7 dimensions and beyond, series-expansion cannot be done efficiently because of its slow convergence. Our probabilistic algorithm (p = 0.9) using Monte Carlo consistently performs better than the algorithm using exact bounds (p = 1) by at least a factor of two. Compared to naive, it achieves the maximum speedup of about nine times on an 16-dimensional dataset; on an 89-dimensional dataset, it is at least three times as fast as the naive. Note that all the datasets contain only 50K points, and the speedup will be more dramatic as we increase the number of points. 5 Conclusion We presented an extension to fast multipole methods to use approximation methods with both hard and probabilistic bounds. Our experimental results show speedup over the previous state-of-the-art on high-dimensional datasets. Our future work will include possible improvements inspired by a recent work done in the FMM community using a matrix-factorization formulation [13]. Figure 1: Left: A PCA-tree for a 3-D dataset. Right: The squared Euclidean distance between a given query point and a reference point projected onto a subspace can be decomposed into two components: the orthogonal component and the component in the subspace. 6 Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ mockgalaxy-D-1M-rnd (cosmology: positions), D = 3, N = 50000, h ∗ = 0.000768201 Naive 182 182 182 182 182 182 182 1274 MCMM 3 3 5 10 26 48 2 97 ( = 0.1, p = 0.9) 1% 1% 1% 1% 1% 1% 5% DFGT 2 2 2 2 6 19 3 36 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 3 3 4 11 27 58 21 127 ( = 0.01, p = 0.9) 0 % 0% 1% 1% 1% 1% 7% DFGT 2 2 2 2 7 30 5 50 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ bio5-rnd (biology: drug activity), D = 5, N = 50000, h∗ = 0.000567161 Naive 214 214 214 214 214 214 214 1498 MCMM 4 4 6 144 149 65 1 373 ( = 0.1, p = 0.9) 0% 0% 0% 0% 1% 0% 1% DFGT 4 4 5 24 96 65 2 200 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 4 4 6 148 165 126 1 454 ( = 0.01, p = 0.9) 0 % 0% 0% 0% 1% 0% 1% DFGT 4 4 5 25 139 126 4 307 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ pall7 − rnd , D = 7, N = 50000, h∗ = 0.00131865 Naive 327 327 327 327 327 327 327 2289 MCMM 3 3 3 3 63 224 <1 300 ( = 0.1, p = 0.9) 0% 0% 0% 1% 1% 12 % 0% DFGT 10 10 11 14 84 263 223 615 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 3 3 3 3 70 265 5 352 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 2% 1% 8% DFGT 10 10 11 14 85 299 374 803 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ covtype − rnd , D = 10, N = 50000, h∗ = 0.0154758 Naive 380 380 380 380 380 380 380 2660 MCMM 11 11 13 39 318 <1 <1 381 ( = 0.1, p = 0.9) 0% 0% 0% 1% 0% 0% 0% DFGT 26 27 38 177 390 244 <1 903 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 11 11 13 77 362 2 <1 477 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 1% 10 % 0% DFGT 26 27 38 180 427 416 <1 1115 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 Σ CoocTexture − rnd , D = 16, N = 50000, h∗ = 0.0263958 Naive 472 472 472 472 472 472 472 3304 MCMM 10 11 22 189 109 <1 <1 343 ( = 0.1, p = 0.9) 0% 0% 0% 1% 8% 0% 0% DFGT 22 26 82 240 452 66 <1 889 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 10 11 22 204 285 <1 <1 534 ( = 0.01, p = 0.9) 0 % 0% 1% 1% 10 % 4% 0% DFGT 22 26 83 254 543 230 <1 1159 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% 7 Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 LayoutHistogram − rnd , D = 32, N = 50000, h∗ = 0.0609892 Naive 757 757 757 757 757 757 757 MCMM 32 32 54 168 583 8 8 ( = 0.1, p = 0.9) 0% 0% 1% 1% 1% 0% 0% DFGT 153 159 221 492 849 212 <1 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 32 45 60 183 858 8 8 ( = 0.01, p = 0.9) 0 % 0% 1% 6% 1% 0% 0% DFGT 153 159 222 503 888 659 <1 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Algorithm \ scale 0.001 0.01 0.1 1 10 100 1000 CorelCombined − rnd , D = 89, N = 50000, h∗ = 0.0512583 Naive 1716 1716 1716 1716 1716 1716 1716 MCMM 384 418 575 428 1679 17 17 ( = 0.1, p = 0.9) 0% 0% 0% 1% 10 % 0% 0% DFGT 659 677 864 1397 1772 836 17 ( = 0.1, p = 1) 0% 0% 0% 0% 0% 0% 0% MCMM 401 419 575 437 1905 17 17 ( = 0.01, p = 0.9) 0 % 0% 0% 1% 2% 0% 0% DFGT 659 677 865 1425 1794 1649 17 ( = 0.01, p = 1) 0% 0% 0% 0% 0% 0% 0% Σ 5299 885 2087 1246 2585 Σ 12012 3518 6205 3771 7086 References [1] Nando de Freitas, Yang Wang, Maryam Mahdaviani, and Dustin Lang. Fast krylov methods for n-body learning. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing o Systems 18, pages 251–258. MIT Press, Cambridge, MA, 2006. [2] P. Drineas, R. Kannan, and M. Mahoney. Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition, 2004. [3] G. Golub. Matrix Computations, Third Edition. The Johns Hopkins University Press, 1996. [4] A. Gray and A. W. Moore. N-Body Problems in Statistical Learning. In Todd K. Leen, Thomas G. Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing Systems 13 (December 2000). MIT Press, 2001. [5] Alexander G. Gray and Andrew W. Moore. Nonparametric Density Estimation: Toward Computational Tractability. In SIAM International Conference on Data Mining 2003, 2003. [6] Alexander G. Gray and Andrew W. Moore. Very Fast Multivariate Kernel Density Estimation via Computational Geometry. In Joint Statistical Meeting 2003, 2003. to be submitted to JASA. [7] L. Greengard and J. Strain. The Fast Gauss Transform. SIAM Journal of Scientific and Statistical Computing, 12(1):79–94, 1991. [8] Peter Hall, David Marshall, and Ralph Martin. Merging and splitting eigenspace models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(9):1042–1049, 2000. [9] Michael Holmes, Alexander Gray, and Charles Isbell. Ultrafast monte carlo for statistical summations. In J.C. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 673–680. MIT Press, Cambridge, MA, 2008. [10] Dongryeol Lee and Alexander Gray. Faster gaussian summation: Theory and experiment. In Proceedings of the Twenty-second Conference on Uncertainty in Artificial Intelligence. 2006. [11] Dongryeol Lee, Alexander Gray, and Andrew Moore. Dual-tree fast gauss transforms. In Y. Weiss, B. Sch¨ lkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 747– o 754. MIT Press, Cambridge, MA, 2006. [12] Ting Liu, Andrew W. Moore, and Alexander Gray. Efficient exact k-nn and nonparametric classification ¨ in high dimensions. In Sebastian Thrun, Lawrence Saul, and Bernhard Sch olkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004. [13] P. G. Martinsson and Vladimir Rokhlin. An accelerated kernel-independent fast multipole method in one dimension. SIAM J. Scientific Computing, 29(3):1160–1178, 2007. [14] A. W. Moore, J. Schneider, and K. Deng. Efficient locally weighted polynomial regression predictions. In D. Fisher, editor, Proceedings of the Fourteenth International Conference on Machine Learning, pages 196–204, San Francisco, 1997. Morgan Kaufmann. [15] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC, 1986. 8
same-paper 2 0.89665782 112 nips-2008-Kernel Measures of Independence for non-iid Data
Author: Xinhua Zhang, Le Song, Arthur Gretton, Alex J. Smola
Abstract: Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering. 1
3 0.89173532 66 nips-2008-Dynamic visual attention: searching for coding length increments
Author: Xiaodi Hou, Liqing Zhang
Abstract: A visual attention system should respond placidly when common stimuli are presented, while at the same time keep alert to anomalous visual inputs. In this paper, a dynamic visual attention model based on the rarity of features is proposed. We introduce the Incremental Coding Length (ICL) to measure the perspective entropy gain of each feature. The objective of our model is to maximize the entropy of the sampled visual features. In order to optimize energy consumption, the limit amount of energy of the system is re-distributed amongst features according to their Incremental Coding Length. By selecting features with large coding length increments, the computational system can achieve attention selectivity in both static and dynamic scenes. We demonstrate that the proposed model achieves superior accuracy in comparison to mainstream approaches in static saliency map generation. Moreover, we also show that our model captures several less-reported dynamic visual search behaviors, such as attentional swing and inhibition of return. 1
Author: Garvesh Raskutti, Bin Yu, Martin J. Wainwright, Pradeep K. Ravikumar
Abstract: We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i.i.d. samples. We study the performance of study the performance of the ℓ1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the ℓ1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = Ω(d2 log(p)), with the error decaying as O(exp(−c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.
5 0.81178248 231 nips-2008-Temporal Dynamics of Cognitive Control
Author: Jeremy Reynolds, Michael C. Mozer
Abstract: Cognitive control refers to the flexible deployment of memory and attention in response to task demands and current goals. Control is often studied experimentally by presenting sequences of stimuli, some demanding a response, and others modulating the stimulus-response mapping. In these tasks, participants must maintain information about the current stimulus-response mapping in working memory. Prominent theories of cognitive control use recurrent neural nets to implement working memory, and optimize memory utilization via reinforcement learning. We present a novel perspective on cognitive control in which working memory representations are intrinsically probabilistic, and control operations that maintain and update working memory are dynamically determined via probabilistic inference. We show that our model provides a parsimonious account of behavioral and neuroimaging data, and suggest that it offers an elegant conceptualization of control in which behavior can be cast as optimal, subject to limitations on learning and the rate of information processing. Moreover, our model provides insight into how task instructions can be directly translated into appropriate behavior and then efficiently refined with subsequent task experience. 1
6 0.81135064 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization
7 0.8082267 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models
8 0.80728561 138 nips-2008-Modeling human function learning with Gaussian processes
9 0.80703586 101 nips-2008-Human Active Learning
10 0.80681676 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations
11 0.80677265 21 nips-2008-An Homotopy Algorithm for the Lasso with Online Observations
12 0.80642033 113 nips-2008-Kernelized Sorting
13 0.80633461 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms
14 0.80599231 195 nips-2008-Regularized Policy Iteration
15 0.80520833 96 nips-2008-Hebbian Learning of Bayes Optimal Decisions
16 0.80509478 34 nips-2008-Bayesian Network Score Approximation using a Metagraph Kernel
17 0.80474931 211 nips-2008-Simple Local Models for Complex Dynamical Systems
18 0.80399334 176 nips-2008-Partially Observed Maximum Entropy Discrimination Markov Networks
19 0.8039037 48 nips-2008-Clustering via LP-based Stabilities
20 0.80317575 226 nips-2008-Supervised Dictionary Learning