nips nips2013 nips2013-281 knowledge-graph by maker-knowledge-mining

281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

Source: pdf

Author: Le Song, Bo Dai

Abstract: Kernel embedding of distributions has led to many recent advances in machine learning. However, latent and low rank structures prevalent in real world distributions have rarely been taken into account in this setting. Furthermore, no prior work in kernel embedding literature has addressed the issue of robust embedding when the latent and low rank information are misspeciﬁed. In this paper, we propose a hierarchical low rank decomposition of kernels embeddings which can exploit such low rank structures in data while being robust to model misspeciﬁcation. We also illustrate with empirical evidence that the estimated low rank embeddings lead to improved performance in density estimation. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 However, latent and low rank structures prevalent in real world distributions have rarely been taken into account in this setting. [sent-5, score-0.627]

2 Furthermore, no prior work in kernel embedding literature has addressed the issue of robust embedding when the latent and low rank information are misspeciﬁed. [sent-6, score-1.554]

3 In this paper, we propose a hierarchical low rank decomposition of kernels embeddings which can exploit such low rank structures in data while being robust to model misspeciﬁcation. [sent-7, score-1.313]

4 We also illustrate with empirical evidence that the estimated low rank embeddings lead to improved performance in density estimation. [sent-8, score-0.79]

5 This new framework has led to many recent advances in machine learning such as kernel independence test [3] and kernel belief propagation [4]. [sent-15, score-0.771]

6 However, algorithms designed with kernel embeddings have rarely taken into account latent and low rank structures prevalent in high dimensional data arising from various applications such as gene expression analysis. [sent-16, score-1.32]

7 While these information have been extensively exploited in other learning contexts such as graphical models and collaborative ﬁltering, their use in kernel embeddings remains scarce and challenging. [sent-17, score-0.703]

8 Intuitively, these intrinsically low dimensional structures of the data should reduce the effect number of parameters in kernel embeddings, and allow us to obtain a better estimator when facing with high dimensional problems. [sent-18, score-0.618]

9 As a demonstration of the above intuition, we illustrate the behavior of low rank kernel embeddings (which we will explain later in more details) when applied to density estimation (Figure 1). [sent-19, score-1.139]

10 The ﬁtted density based on an ordinary kernel density estimator has quite different contours from the ground truth (Figure 1(b)), while those provided by low rank embeddings appear to be much closer to the ground truth ((Figure 1(c)). [sent-24, score-1.398]

11 Essentially, the low rank approximation step endows kernel embeddings with an additional mechanism to smooth the estimator which can be beneﬁcial when the number of data points is small and there are clusters in the data. [sent-25, score-1.089]

12 In our later more systematic experiments, we show that low rank embeddings can lead to density estimators which can signiﬁcantly improve over alternative approaches in terms of held-out likelihood. [sent-26, score-0.768]

13 (a) the contour plot for the ground truth density, (b) for ordinary kernel density estimator (KDE), (c) for low rank KDE. [sent-29, score-0.961]

14 We used cross-validation to ﬁnd the best kernel bandwidth for both the KDE and low rank KDE. [sent-30, score-0.755]

15 considered as a kernel generalization of the discrete valued tree-structured latent variable models studied in [7]. [sent-35, score-0.54]

16 The objective of the current paper is to address previous limitations of kernel embeddings as applied to graphical models and make them more practically useful. [sent-36, score-0.703]

17 Another key contribution of the paper is a novel view of kernel embedding of multivariate distributions as inﬁnite dimensional higher order tensors, and the low rank structure of these tensors in the presence of latent variables. [sent-38, score-1.346]

18 This novel view allows us to introduce modern multi-linear algebra and tensor decomposition tools to address challenging problems in the interface between kernel methods and latent variable models. [sent-39, score-0.911]

19 We believe our work will play a synergistic role in bridging together largely separate areas in machine learning research, including kernel methods, latent variable models, and tensor data analysis. [sent-40, score-0.767]

20 In the remainder of the paper, we will ﬁrst present the tensor view of kernel embeddings of multivariate distributions and its low rank structure in the presence of latent variables. [sent-41, score-1.457]

21 Then we will present our algorithm for hierarchical low rank decomposition of kernel embeddings by making use of a sequence of nested kernel singular value decompositions. [sent-42, score-1.716]

22 A reproducing kernel Hilbert space (RKHS) F on ⌦ with a kernel k(x, x0 ) is a Hilbert space of functions f : ⌦ 7! [sent-46, score-0.767]

23 For simplicity of notation, we assumes that the domain of all variables are the same and the same kernel function is applied to all variables. [sent-50, score-0.396]

24 A kernel embedding represents a density by its expected features, i. [sent-51, score-0.759]

25 The embedding µX has the property that the expectation of any RKHS function f 2 F can be evaluated as an inner product in F, hµX , f iF := EX [f (X)]. [sent-54, score-0.376]

26 Kernel embeddings can be readily generalized to joint density of d variables, X1 , . [sent-55, score-0.43]

27 , Xd , using dth order tensor product feature space F d , In this feature space, the feature map is deﬁned as ⌦d (xi ) := (x1 ) ⌦ (x2 ) ⌦ . [sent-58, score-0.351]

29 Unlike the ordinary embeddings, an embedding of conditional distribution is not a single element in the RKHS, but will instead sweep out a family of points in the RKHS, each indexed by a ﬁxed value z of the conditioning variable Z. [sent-73, score-0.428]

30 In other words, conditional embedding is an operator, denoted as CX|Z , which can take as input an z and output an embedding, i. [sent-75, score-0.352]

31 Likewise, kernel embedding of conditional distributions can also be generalized to joint distribution of d variables. [sent-78, score-0.768]

32 In this case, we let (Z) = Z and use the linear kernel k(Z, Z 0 ) = Z > Z. [sent-82, score-0.371]

33 Then, the conditional embedding operator reduces to a separate embedding µX|z for each conditional density p(X|z). [sent-83, score-0.822]

34 3 Kernel Embeddings as Inﬁnite Dimensional Higher Order Tensors The above kernel embedding CX1:d can also be viewed as a multi-linear operator (tensor) of order d mapping from F ⇥ . [sent-86, score-0.753]

35 (For generic introduction to tensor and tensor notation, please see [10]). [sent-90, score-0.454]

36 Furthermore, d the application of the operator to a set of elements {fi 2 F}i=1 can be deﬁned using the inner product from the tensor product feature space, i. [sent-92, score-0.391]

37 We can reshape a higher order tensor into a lower order tensor by partitioning its modes into several disjoint groups. [sent-114, score-0.698]

38 Similarly to the Matlab function, we can obtain a 2nd order tensor by CI1 ;I2 = reshape (CX1:d , I1 , I2 ) : F s ⇥ F d s (5) 7! [sent-122, score-0.422]

39 In the reverse direction, we can also reshape a lower order tensor into a higher order one by further 0 partitioning certain mode of the tensor. [sent-124, score-0.445]

40 , Xs }, and turn CI1 ;I2 into a 3rd order tensor by 0 00 0 00 CI1 ;I1 ;I2 = reshape (CI1 ;I2 , I1 , I1 , I2 ) : F t ⇥ F s 1 {ei }i=1 t ⇥ Fd s 7! [sent-131, score-0.422]

41 X3:4 |Z1:2 Z2 Zd X1 Z2 X2 Xd (c) Caterpillar tree (hidden Markov model) Figure 2: Three latent variable model with different tree topologies The 2nd order tensor CI1 ;I2 can also be viewed as the cross-covariance operator between two sets of variables in I1 and I2 . [sent-147, score-0.746]

42 For P1 instance, we can perform singular value decomposition of CI1 ;I2 = i=1 si (ui ⌦vi ) where si 2 R 1 1 are ordered in nonincreasing manner, {ui }i=1 ⇢ F s and {vi }i=1 ⇢ F d s are singular vectors. [sent-149, score-0.358]

43 , sr ), and denote the low > rank approximation as CI1 ;I2 = Ur Sr Vr . [sent-160, score-0.54]

44 Finally, a 1st order tensor reshape (CX1:d , {X1:d } , ;), is simply a vector where we we will use vector notation. [sent-161, score-0.422]

45 4 Low Rank Kernel Embeddings Induced by Latent Variables In the presence of latent variables, the kernel embedding CX1:d will be low rank. [sent-162, score-0.951]

46 Then the embedding CX1 X2 of p(X1 , X2 ) has a rank at most r. [sent-166, score-0.576]

47 Reshaping its kernel embedding CX1:4 , we obtain CX1:2 ;X3:4 = reshape (CX1:4 , {X1:2 } , {X3:4 }) which factorizes as EX1:2 |Z1 [ (X1 ) ⌦ (X2 )] EZ1 Z2 [Z1 ⌦ Z2 ] EX3:4 |Z2 [ (X3 ) ⌦ (X4 )] > (8) where EZ1 Z2 [Z1 ⌦ Z2 ] is an r ⇥ r matrix. [sent-171, score-0.911]

48 Hence the intrinsic “rank” of the reshaped embedding is only r, although the original kernel embedding CX1:4 is a 4th order tensor with inﬁnite dimensions. [sent-172, score-1.266]

49 , Xd ) where the conditional independence structure is a tree T , various reshapings of its kernel embedding CX1:d according to edges in the tree will be low rank. [sent-176, score-1.221]

50 More speciﬁcally, each edge in the latent tree corresponds to a pair of latent variables (Zs , Zt ) (or an observed and a hidden variable (Xs , Zt )) which induces a partition of the observed variables into two groups, I1 and I2 . [sent-177, score-0.517]

51 If we reshape the tensor according to this partitioning, then Theorem 1 Assume that all observed variables are leaves in the latent tree structure, and all latent variables take r possible values, then rank(CI1 ;I2 )  r. [sent-180, score-0.892]

53 Theorem 1 implies that, given a latent tree model, we obtain a collection of low rank reshapings {CI1 ;I2 } of the kernel embedding CX1:d , each corresponding to an edge (Zs , Zt ) of the tree. [sent-187, score-1.458]

54 We 4 will denote by H(T , r) the class of kernel embeddings CX1:d whose various reshapings according to the latent tree T have rank at most r. [sent-188, score-1.309]

55 , Xd ), and consequently the various reshapings of its kernel embedding CX1:d are only approximately low rank. [sent-193, score-0.897]

56 In this case, we will instead impose a (potentially misspeciﬁed) latent structure T and a ﬁxed rank r on the data and obtain an approximate low rank decomposition of the kernel embedding. [sent-194, score-1.308]

57 The e e goal is to obtain a low rank embedding CX1:d 2 H(T , r), while at the same time insure kCX1:d CX1:d k• is small. [sent-195, score-0.697]

58 5 Low Rank Decomposition of Kernel Embeddings For simplicity of exposition, we will focus on the case where the latent tree structure T has a caterpillar shape (Figure 2(c)). [sent-197, score-0.403]

59 This decomposition can be viewed as a kernel generalization of the hierarchical tensor decomposition in [11, 12, 7]. [sent-198, score-0.942]

60 The decomposition proceeds by reshaping the kernel embedding CX1:d according to the ﬁrst edge (Z1 , Z2 ), resulting in A1 := CX1 ;X2:d . [sent-199, score-0.917]

61 This leads to the ﬁrst intermediate tensor > G1 = Ur , and we reshape Sr Vr and recursively decompose it. [sent-201, score-0.468]

62 We note that Algorithm 1 contains only pseudo codes, and not implementable in practice since the kernel embedding to decompose can have inﬁnite dimensions. [sent-202, score-0.684]

63 We will design a practical kernel algorithm in the next section. [sent-203, score-0.371]

64 Algorithm 1 Low Rank Decomposition of Kernel Embeddings In: A kernel embedding CX1:d , the caterpillar tree T and desired rank r e Out: A low rank embedding CX1:d 2 H(T , r) as intermediate tensors {G1 , . [sent-204, score-2.026]

65 8: end for 9: Gd = Bd 1 Once we ﬁnish the decomposition, we obtain the low rank representation of the kernel embedding as a set of intermediate tensors {G1 , . [sent-218, score-1.193]

66 In particular, we can think of G1 as a second order tensor with dimension 1 ⇥ r, Gd as a second order tensor with dimension r ⇥ 1, and Gj for 2 6 j 6 d 1 as a third order tensor with dimension r ⇥ 1 ⇥ r. [sent-222, score-0.681]

67 Then we can apply the low d e e rank kernel embedding CX1:d to a set of elements {fi 2 F}i=1 as follows CX1:d •1 f1 •2 . [sent-223, score-1.068]

68 Based on the above decomposition, one can e obtain a low rank density estimate by p(X1 , . [sent-230, score-0.459]

69 , Xd ), and we want to obtain an empirical low rank decomposition of the kernel embedding. [sent-247, score-0.921]

70 In this case, we will perform a low rank decomposition of the empirical kernel embedding Pn 1 i ¯ CX1:d = n i=1 ⌦d j=1 (xj ) . [sent-248, score-1.234]

71 Although the empirical kernel embedding still has inﬁnite dimensions, we will show that we can carry out the decomposition using just the kernel matrices. [sent-249, score-1.249]

72 Let us denote the kernel matrix for each dimension of the data by Kj where j 2 {1, . [sent-250, score-0.371]

73 , (xn ) , and the corresponding kernel matrix is Kj = > j . [sent-259, score-0.371]

74 The corresponding kernel matrix Lj = j with j Qd 0 0 0 ii i i the (i, i )-th entry in Lj deﬁned as Lj = j 0 =j+1 k(xj 0 , xj 0 ). [sent-264, score-0.43]

75 The key building block of the algorithm is a kernel singular value decomposition (Algorithm 2), which we will explain in more details using the example in step 2 of 1 Algorithm 1. [sent-266, score-0.622]

76 1 > For the low rank approximation, A1 ⇡ Ur Sr Vr , using singular value decomposition, the leading r singular vector Ur = (u1 , . [sent-268, score-0.598]

77 Then we can transform the singular value decomposition problem for an inﬁnite dimensional matrix to a generalized eigenvalue problem involving kernel matrices, A1 A1 > u = 1 1 u , n2 1 > 1 > 1 = , n2 K1 L1 K1 = K1 . [sent-277, score-0.7]

78 Then the kernel singular value decomposition can be carried out recursively on the reshaped tensor B1 . [sent-297, score-0.891]

79 b The result of the algorithm is an empirical low rank kernel embedding, CX1:d , represented as a collection of intermediate tensors {G1 , . [sent-321, score-0.902]

80 , Gd } to a set of elements {fi 2 F} can be expressed as kernel operations. [sent-330, score-0.371]

81 7 Performance Guarantees As we mentioned in the introduction, the imposed latent structure used in the low rank decomposition of kernel embeddings may be misspeciﬁed, and the decomposition of empirical embeddings may suffer from sampling error. [sent-344, score-1.829]

82 , ✓ n ) 1: Perform Cholesky decomposition K = R> R 1 e, and keep the leading r eigen vectors 2: Solve eigen decomposition n2 RLR> e = e1 , . [sent-349, score-0.352]

83 , xd ) 1 d b Out: A low rank embedding CX1:d 2 H(T , r) as intermediate operators {G1 , . [sent-367, score-0.936]

84 , n ) > d b eralized Frobenius norm kCX1:d CX1:d k• , the difference between the true kernel embeddings and n the low rank kernel embeddings estimated from a set of n i. [sent-406, score-1.744]

85 We will bound these two sources of error separately (the proof is deferred to Appendix B) Theorem 2 Suppose each reshaping CI1 ;I2 of CX1:d according to an edge in the latent tree struc> > ture has a rank r approximation Ur Sr Vr with error CI1 ;I2 Ur Sr Vr • 6 ✏. [sent-414, score-0.645]

86 Then the low rank p e e decomposition CX1:d from Algorithm 1 satisﬁes kCX1:d CX1:d k• 6 d 1 ✏. [sent-415, score-0.528]

87 Although previous work [5, 6] have also used hierarchical decomposition for kernel embeddings, their decompositions make the strong assumption that the latent tree models are correctly speciﬁed. [sent-416, score-0.848]

88 In contrast, the decomposition we proposed here are robust in the sense that even when the latent tree structure is misspeciﬁed, we can still provide the approximation guarantee for the algorithm. [sent-418, score-0.445]

89 Furthermore, when the latent tree structures are correctly speciﬁed and the rank r is also correct, then CI1 ;I2 has rank r and hence ✏ = 0 and our decomposition algorithm does not incur any modeling error. [sent-419, score-0.979]

90 The estimation error arises from decomposing ¯ the empirical estimate CX1:d of the kernel embedding, and the error can accumulate as we combine intermediate tensors {G1 , . [sent-421, score-0.518]

91 , Gd } to form the ﬁnal low rank kernel embedding. [sent-424, score-0.755]

92 7 From the above theorem, we can see that the smaller the r-th singular value, the more difﬁcult it is to estimate the low rank kernel embedding. [sent-426, score-0.862]

93 8 Experiments Besides the synthetic dataset we showed in Figure 1 where low rank kernel embedding can lead to signiﬁcant improvement in term of estimating the density, we also experimented with real world datasets from UCI data repository. [sent-428, score-1.068]

94 We 1 use Gaussian kernel k(x, x0 ) = p2⇡s exp( kx x0 k2 /(2s2 )) for KDE, Spectral and our method (Low rank). [sent-431, score-0.371]

95 For both Spectral and Low rank, we use a caterpillar tree in Figure 2(c) as the 30 structure for the latent variable model. [sent-433, score-0.426]

96 From Table 1, we can see that low rank kernel embeddings provide the best or comparable held-out negative log-likelihood across the datasets we experimented with. [sent-434, score-1.064]

97 In some datasets, low rank kernel embeddings can lead to drastic improvement over the alternatives. [sent-435, score-1.064]

98 05 9 Discussion and Conclusion In this paper, we presented a robust kernel embedding algorithm which can make use of the low rank structure of the data, and provided both theoretical and empirical support for it. [sent-531, score-1.117]

99 First, the algorithm requires a sequence of kernel singular decompositions which can be computationally intensive for high dimensional and large datasets. [sent-533, score-0.54]

100 Third, it will be interesting empirical work to explore other applications for low rank kernel embeddings, such as kernel two-sample tests, kernel independence tests and kernel belief propagation. [sent-537, score-1.919]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kernel', 0.371), ('embedding', 0.313), ('embeddings', 0.309), ('rank', 0.263), ('tensor', 0.227), ('ur', 0.224), ('reshape', 0.195), ('xd', 0.193), ('vr', 0.178), ('sr', 0.156), ('gd', 0.149), ('kde', 0.146), ('latent', 0.146), ('decomposition', 0.144), ('misspeci', 0.131), ('caterpillar', 0.129), ('tree', 0.128), ('low', 0.121), ('zs', 0.115), ('zj', 0.108), ('singular', 0.107), ('zt', 0.099), ('reshapings', 0.092), ('fd', 0.084), ('tensors', 0.079), ('density', 0.075), ('kj', 0.071), ('song', 0.069), ('gj', 0.065), ('reshaping', 0.065), ('cx', 0.062), ('xi', 0.06), ('xj', 0.059), ('lj', 0.058), ('zd', 0.058), ('eid', 0.055), ('kernelsvd', 0.055), ('ordinary', 0.053), ('hilbert', 0.053), ('spectral', 0.052), ('rkhs', 0.051), ('intermediate', 0.046), ('qd', 0.045), ('operator', 0.043), ('gretton', 0.042), ('reshaped', 0.042), ('conditional', 0.039), ('ex', 0.039), ('inner', 0.038), ('ez', 0.037), ('frobenius', 0.037), ('ezs', 0.037), ('fi', 0.036), ('structures', 0.035), ('dimensional', 0.033), ('feature', 0.033), ('xs', 0.033), ('eigen', 0.032), ('factorizes', 0.032), ('hierarchical', 0.03), ('sonar', 0.03), ('rr', 0.03), ('decompositions', 0.029), ('independence', 0.029), ('yeast', 0.028), ('carry', 0.028), ('truth', 0.028), ('aj', 0.028), ('robust', 0.027), ('tj', 0.027), ('orthonormal', 0.026), ('modes', 0.026), ('bj', 0.026), ('fukumizu', 0.026), ('viewed', 0.026), ('reproducing', 0.025), ('product', 0.025), ('estimator', 0.025), ('ground', 0.025), ('variables', 0.025), ('generalized', 0.025), ('edge', 0.024), ('variable', 0.023), ('partitioning', 0.023), ('prevalent', 0.023), ('graphical', 0.023), ('empirical', 0.022), ('cholesky', 0.022), ('xn', 0.021), ('readily', 0.021), ('mixture', 0.021), ('spherical', 0.021), ('furthermore', 0.021), ('distributions', 0.02), ('sch', 0.02), ('eigenvalue', 0.02), ('parikh', 0.019), ('implicitly', 0.019), ('deferred', 0.019), ('rarely', 0.019)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

Author: Le Song, Bo Dai

2 0.2144344 11 nips-2013-A New Convex Relaxation for Tensor Completion

Author: Bernardino Romera-Paredes, Massimiliano Pontil

Abstract: We study the problem of learning a tensor from a set of linear measurements. A prominent methodology for this problem is based on a generalization of trace norm regularization, which has been used extensively for learning low rank matrices, to the tensor setting. In this paper, we highlight some limitations of this approach and propose an alternative convex relaxation on the Euclidean ball. We then describe a technique to solve the associated regularization problem, which builds upon the alternating direction method of multipliers. Experiments on one synthetic dataset and two real datasets indicate that the proposed method improves signiﬁcantly over tensor trace norm regularization in terms of estimation error, while remaining computationally tractable. 1

3 0.19586071 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization

Author: Ryota Tomioka, Taiji Suzuki

Abstract: We study a new class of structured Schatten norms for tensors that includes two recently proposed norms (“overlapped” and “latent”) for convex-optimizationbased tensor decomposition. We analyze the performance of “latent” approach for tensor decomposition, which was empirically found to perform better than the “overlapped” approach in some settings. We show theoretically that this is indeed the case. In particular, when the unknown true tensor is low-rank in a speciﬁc unknown mode, this approach performs as well as knowing the mode with the smallest rank. Along the way, we show a novel duality result for structured Schatten norms, which is also interesting in the general context of structured sparsity. We conﬁrm through numerical simulations that our theory can precisely predict the scaling behaviour of the mean squared error. 1

4 0.19210991 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

Author: Tzu-Kuo Huang, Jeff Schneider

Abstract: Learning dynamic models from observed data has been a central issue in many scientiﬁc studies or engineering tasks. The usual setting is that data are collected sequentially from trajectories of some dynamical system operation. In quite a few modern scientiﬁc modeling tasks, however, it turns out that reliable sequential data are rather difﬁcult to gather, whereas out-of-order snapshots are much easier to obtain. Examples include the modeling of galaxies, chronic diseases such Alzheimer’s, or certain biological processes. Existing methods for learning dynamic model from non-sequence data are mostly based on Expectation-Maximization, which involves non-convex optimization and is thus hard to analyze. Inspired by recent advances in spectral learning methods, we propose to study this problem from a different perspective: moment matching and spectral decomposition. Under that framework, we identify reasonable assumptions on the generative process of non-sequence data, and propose learning algorithms based on the tensor decomposition method [2] to provably recover ﬁrstorder Markov models and hidden Markov models. To the best of our knowledge, this is the ﬁrst formal guarantee on learning from non-sequence data. Preliminary simulation results conﬁrm our theoretical ﬁndings. 1

5 0.18676026 295 nips-2013-Simultaneous Rectification and Alignment via Robust Recovery of Low-rank Tensors

Author: Xiaoqin Zhang, Di Wang, Zhengyuan Zhou, Yi Ma

Abstract: In this work, we propose a general method for recovering low-rank three-order tensors, in which the data can be deformed by some unknown transformation and corrupted by arbitrary sparse errors. Since the unfolding matrices of a tensor are interdependent, we introduce auxiliary variables and relax the hard equality constraints by the augmented Lagrange multiplier method. To improve the computational efﬁciency, we introduce a proximal gradient step to the alternating direction minimization method. We have provided proof for the convergence of the linearized version of the problem which is the inner loop of the overall algorithm. Both simulations and experiments show that our methods are more efﬁcient and effective than previous work. The proposed method can be easily applied to simultaneously rectify and align multiple images or videos frames. In this context, the state-of-the-art algorithms “RASL” and “TILT” can be viewed as two special cases of our work, and yet each only performs part of the function of our method.

6 0.16948618 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors

7 0.15619323 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

8 0.14881259 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling

9 0.14695737 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

10 0.14530015 9 nips-2013-A Kernel Test for Three-Variable Interactions

11 0.12935074 156 nips-2013-Learning Kernels Using Local Rademacher Complexity

12 0.1271726 173 nips-2013-Least Informative Dimensions

13 0.12649775 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

14 0.12527311 203 nips-2013-Multilinear Dynamical Systems for Tensor Time Series

15 0.12455993 345 nips-2013-Variance Reduction for Stochastic Gradient Optimization

16 0.12301978 224 nips-2013-On the Sample Complexity of Subspace Learning

17 0.12272844 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

18 0.12145459 75 nips-2013-Convex Two-Layer Modeling

19 0.11121421 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data

20 0.1089502 81 nips-2013-DeViSE: A Deep Visual-Semantic Embedding Model

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.224), (1, 0.144), (2, 0.082), (3, 0.214), (4, 0.049), (5, -0.162), (6, 0.112), (7, -0.002), (8, 0.047), (9, 0.033), (10, -0.091), (11, -0.088), (12, -0.084), (13, -0.069), (14, 0.263), (15, 0.125), (16, 0.039), (17, 0.081), (18, -0.03), (19, -0.087), (20, 0.043), (21, 0.021), (22, 0.052), (23, 0.066), (24, -0.04), (25, 0.001), (26, 0.07), (27, 0.046), (28, 0.137), (29, 0.015), (30, 0.069), (31, -0.098), (32, 0.071), (33, -0.002), (34, -0.0), (35, -0.031), (36, 0.064), (37, 0.077), (38, -0.017), (39, 0.001), (40, -0.037), (41, -0.001), (42, -0.016), (43, -0.075), (44, -0.062), (45, -0.017), (46, -0.042), (47, -0.003), (48, 0.068), (49, -0.012)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97508061 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

Author: Le Song, Bo Dai

2 0.69747162 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.

Author: Michel Besserve, Nikos K. Logothetis, Bernhard Schölkopf

Abstract: Many applications require the analysis of complex interactions between time series. These interactions can be non-linear and involve vector valued as well as complex data structures such as graphs or strings. Here we provide a general framework for the statistical analysis of these dependencies when random variables are sampled from stationary time-series of arbitrary objects. To achieve this goal, we study the properties of the Kernel Cross-Spectral Density (KCSD) operator induced by positive deﬁnite kernels on arbitrary input domains. This framework enables us to develop an independence test between time series, as well as a similarity measure to compare different types of coupling. The performance of our test is compared to the HSIC test using i.i.d. assumptions, showing improvements in terms of detection errors, as well as the suitability of this approach for testing dependency in complex dynamical systems. This similarity measure enables us to identify different types of interactions in electrophysiological neural time series. 1

3 0.6739251 9 nips-2013-A Kernel Test for Three-Variable Interactions

Author: Dino Sejdinovic, Arthur Gretton, Wicher Bergsma

Abstract: We introduce kernel nonparametric tests for Lancaster three-variable interaction and for total independence, using embeddings of signed measures into a reproducing kernel Hilbert space. The resulting test statistics are straightforward to compute, and are used in powerful interaction tests, which are consistent against all alternatives for a large family of reproducing kernels. We show the Lancaster test to be sensitive to cases where two independent causes individually have weak inﬂuence on a third dependent variable, but their combined effect has a strong inﬂuence. This makes the Lancaster test especially suited to ﬁnding structure in directed graphical models, where it outperforms competing nonparametric tests in detecting such V-structures.

4 0.64216554 156 nips-2013-Learning Kernels Using Local Rademacher Complexity

Author: Corinna Cortes, Marius Kloft, Mehryar Mohri

Abstract: We use the notion of local Rademacher complexity to design new algorithms for learning kernels. Our algorithms thereby beneﬁt from the sharper learning bounds based on that notion which, under certain general conditions, guarantee a faster convergence rate. We devise two new learning kernel algorithms: one based on a convex optimization problem for which we give an efﬁcient solution using existing learning kernel techniques, and another one that can be formulated as a DC-programming problem for which we describe a solution in detail. We also report the results of experiments with both algorithms in both binary and multi-class classiﬁcation tasks. 1

5 0.64186615 144 nips-2013-Inverse Density as an Inverse Problem: the Fredholm Equation Approach

Author: Qichao Que, Mikhail Belkin

Abstract: q We address the problem of estimating the ratio p where p is a density function and q is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration often referred to as importance sampling in statistical inference. It is also closely related to the problem of covariate shift in transfer learning. Our approach is based on reformulating the problem of estimating the ratio as an inverse problem in terms of an integral operator corresponding to a kernel, known as the Fredholm problem of the ﬁrst kind. This formulation, combined with the techniques of regularization leads to a principled framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is ﬂexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel for densities deﬁned on Rd and smooth d-dimensional sub-manifolds of the Euclidean space. Model selection for unsupervised or semi-supervised inference is generally a difﬁcult problem. It turns out that in the density ratio estimation setting, when samples from both distributions are available, simple completely unsupervised model selection methods are available. We call this mechanism CD-CV for Cross-Density Cross-Validation. We show encouraging experimental results including applications to classiﬁcation within the covariate shift framework. 1

6 0.62892234 74 nips-2013-Convex Tensor Decomposition via Structured Schatten Norm Regularization

7 0.60812503 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

8 0.59021491 11 nips-2013-A New Convex Relaxation for Tensor Completion

9 0.5710429 295 nips-2013-Simultaneous Rectification and Alignment via Robust Recovery of Low-rank Tensors

10 0.55991685 170 nips-2013-Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space

11 0.55626589 203 nips-2013-Multilinear Dynamical Systems for Tensor Time Series

12 0.55200398 173 nips-2013-Least Informative Dimensions

13 0.5483619 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors

14 0.49128506 336 nips-2013-Translating Embeddings for Modeling Multi-relational Data

15 0.47287872 327 nips-2013-The Randomized Dependence Coefficient

16 0.46399829 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

17 0.45756865 329 nips-2013-Third-Order Edge Statistics: Contour Continuation, Curvature, and Cortical Connections

18 0.45509931 76 nips-2013-Correlated random features for fast semi-supervised learning

19 0.45106691 289 nips-2013-Scalable kernels for graphs with continuous attributes

20 0.42954335 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.017), (10, 0.013), (16, 0.032), (19, 0.035), (33, 0.184), (34, 0.113), (41, 0.033), (49, 0.046), (56, 0.131), (70, 0.014), (85, 0.034), (88, 0.156), (89, 0.029), (93, 0.069)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91882521 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse

Author: John Duchi, Michael Jordan, Brendan McMahan

Abstract: We study stochastic optimization problems when the data is sparse, which is in a sense dual to current perspectives on high-dimensional statistical learning and optimization. We highlight both the difﬁculties—in terms of increased sample complexity that sparse data necessitates—and the potential beneﬁts, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on several medium to large-scale learning tasks. 1 Introduction and problem setting In this paper, we investigate stochastic optimization problems in which the data is sparse. Formally, let {F (·; ξ), ξ ∈ Ξ} be a collection of real-valued convex functions, each of whose domains contains the convex set X ⊂ Rd . For a probability distribution P on Ξ, we consider the following optimization problem: minimize f (x) := E[F (x; ξ)] = x∈X F (x; ξ)dP (ξ). (1) Ξ By data sparsity, we mean the samples ξ are sparse: assuming that samples ξ lie in Rd , and deﬁning the support supp(x) of a vector x to the set of indices of its non-zero components, we assume supp F (x; ξ) ⊂ supp ξ. (2) The sparsity condition (2) means that F (x; ξ) does not “depend” on the values of xj for indices j such that ξj = 0.1 This type of data sparsity is prevalent in statistical optimization problems and machine learning applications; in spite of its prevalence, study of such problems has been limited. As a motivating example, consider a text classiﬁcation problem: data ξ ∈ Rd represents words appearing in a document, and we wish to minimize a logistic loss F (x; ξ) = log(1 + exp( ξ, x )) on the data (we encode the label implicitly with the sign of ξ). Such generalized linear models satisfy the sparsity condition (2), and while instances are of very high dimension, in any given instance, very few entries of ξ are non-zero [8]. From a modelling perspective, it thus makes sense to allow a dense predictor x: any non-zero entry of ξ is potentially relevant and important. In a sense, this is dual to the standard approaches to high-dimensional problems; one usually assumes that the data ξ may be dense, but there are only a few relevant features, and thus a parsimonious model x is desirous [2]. So 1 Formally, if πξ denotes the coordinate projection zeroing all indices j of its argument where ξj = 0, then F (πξ (x); ξ) = F (x; ξ) for all x, ξ. This follows from the ﬁrst-order conditions for convexity [6]. 1 while such sparse data problems are prevalent—natural language processing, information retrieval, and other large data settings all have signiﬁcant data sparsity—they do not appear to have attracted as much study as their high-dimensional “duals” of dense data and sparse predictors. In this paper, we investigate algorithms and their inherent limitations for solving problem (1) under natural conditions on the data generating distribution. Recent work in the optimization and machine learning communities has shown that data sparsity can be leveraged to develop parallel (and even asynchronous [12]) optimization algorithms [13, 14], but this work does not consider the statistical effects of data sparsity. In another line of research, Duchi et al. [4] and McMahan and Streeter [9] develop “adaptive” stochastic gradient algorithms to address problems in sparse data regimes (2). These algorithms exhibit excellent practical performance and have theoretical guarantees on their convergence, but it is not clear if they are optimal—in that no algorithm can attain better statistical performance—or whether they can leverage parallel computing as in the papers [12, 14]. In this paper, we take a two-pronged approach. First, we investigate the fundamental limits of optimization and learning algorithms in sparse data regimes. In doing so, we derive lower bounds on the optimization error of any algorithm for problems of the form (1) with sparsity condition (2). These results have two main implications. They show that in some scenarios, learning with sparse data is quite difﬁcult, as essentially each coordinate j ∈ [d] can be relevant and must be optimized for. In spite of this seemingly negative result, we are also able to show that the A DAG RAD algorithms of [4, 9] are optimal, and we show examples in which their dependence on the dimension d can be made exponentially better than standard gradient methods. As the second facet of our two-pronged approach, we study how sparsity may be leveraged in parallel computing frameworks to give substantially faster algorithms that still achieve optimal sample complexity in terms of the number of samples ξ used. We develop two new algorithms, asynchronous dual averaging (A SYNC DA) and asynchronous A DAG RAD (A SYNC A DAG RAD), which allow asynchronous parallel solution of the problem (1) for general convex f and X . Combining insights of Niu et al.’s H OGWILD ! [12] with a new analysis, we prove our algorithms achieve linear speedup in the number of processors while maintaining optimal statistical guarantees. We also give experiments on text-classiﬁcation and web-advertising tasks to illustrate the beneﬁts of the new algorithms. 2 Minimax rates for sparse optimization We begin our study of sparse optimization problems by establishing their fundamental statistical and optimization-theoretic properties. To do this, we derive bounds on the minimax convergence rate of any algorithm for such problems. Formally, let x denote any estimator for a minimizer of the objective (1). We deﬁne the optimality gap N for the estimator x based on N samples ξ 1 , . . . , ξ N from the distribution P as N (x, F, X , P ) := f (x) − inf f (x) = EP [F (x; ξ)] − inf EP [F (x; ξ)] . x∈X x∈X This quantity is a random variable, since x is a random variable (it is a function of ξ 1 , . . . , ξ N ). To deﬁne the minimax error, we thus take expectations of the quantity N , though we require a bit more than simply E[ N ]. We let P denote a collection of probability distributions, and we consider a collection of loss functions F speciﬁed by a collection F of convex losses F : X × ξ → R. We can then deﬁne the minimax error for the family of losses F and distributions P as ∗ N (X , P, F) := inf sup sup EP [ x P ∈P F ∈F N (x(ξ 1:N ), F, X , P )], (3) where the inﬁmum is taken over all possible estimators x (an estimator is an optimization scheme, or a measurable mapping x : ΞN → X ) . 2.1 Minimax lower bounds Let us now give a more precise characterization of the (natural) set of sparse optimization problems we consider to provide the lower bound. For the next proposition, we let P consist of distributions supported on Ξ = {−1, 0, 1}d , and we let pj := P (ξj = 0) be the marginal probability of appearance of feature j ∈ {1, . . . , d}. For our class of functions, we set F to consist of functions F satisfying the sparsity condition (2) and with the additional constraint that for g ∈ ∂x F (x; ξ), we have that the jth coordinate |gj | ≤ Mj for a constant Mj < ∞. We obtain 2 Proposition 1. Let the conditions of the preceding paragraph hold. Let R be a constant such that X ⊃ [−R, R]d . Then √ d pj 1 ∗ . Mj min pj , √ N (X , P, F) ≥ R 8 j=1 N log 3 We provide the proof of Proposition 1 in the supplement A.1 in the full version of the paper, providing a few remarks here. We begin by giving a corollary to Proposition 1 that follows when the data ξ obeys a type of power law: let p0 ∈ [0, 1], and assume that P (ξj = 0) = p0 j −α . We have Corollary 2. Let α ≥ 0. Let the conditions of Proposition 1 hold with Mj ≡ M for all j, and assume the power law condition P (ξj = 0) = p0 j −α on coordinate appearance probabilities. Then (1) If d > (p0 N )1/α , ∗ N (X , P, F) ≥ 2−α 1−α p0 p0 (p0 N ) 2α − 1 + d1−α − (p0 N ) α N 1−α 2 MR 8 2−α (2) If d ≤ (p0 N )1/α , ∗ N (X , P, F) ≥ MR 8 p0 N α 1 1 d1− 2 − 1 − α/2 1 − α/2 . . Expanding Corollary 2 slightly, for simplicity assume the number of samples is large enough that d ≤ (p0 N )1/α . Then we ﬁnd that the lower bound on optimization error is of order p0 1− α p0 p0 d 2 when α < 2, M R log d when α → 2, and M R when α > 2. (4) N N N These results beg the question of tightness: are they improvable? As we see presently, they are not. MR 2.2 Algorithms for attaining the minimax rate To show that the lower bounds of Proposition 1 and its subsequent specializations are sharp, we review a few stochastic gradient algorithms. We begin with stochastic gradient descent (SGD): SGD repeatedly samples ξ ∼ P , computes g ∈ ∂x F (x; ξ), then performs the update x ← ΠX (x − ηg), where η is a stepsize parameter and ΠX denotes Euclidean projection onto X . Standard analyses of stochastic gradient descent [10] show that after N samples ξ i , the SGD estimator x(N ) satisﬁes R2 M ( d j=1 1 pj ) 2 √ , (5) N where R2 denotes the 2 -radius of X . Dual averaging, due to Nesterov [11] (sometimes called “follow the regularized leader” [5]) is a more recent algorithm. In dual averaging, one again samples g ∈ ∂x F (x; ξ), but instead of updating the parameter vector x one updates a dual vector z by z ← z + g, then computes 1 x ← argmin z, x + ψ(x) , η x∈X E[f (x(N ))] − inf f (x) ≤ O(1) x∈X 2 1 where ψ(x) is a strongly convex function deﬁned over X (often one takes ψ(x) = 2 x 2 ). As we discuss presently, the dual averaging algorithm is somewhat more natural in asynchronous and parallel computing environments, and it enjoys the same type of convergence guarantees (5) as SGD. The A DAG RAD algorithm [4, 9] is an extension of the preceding stochastic gradient methods. It maintains a diagonal matrix S, where upon receiving a new sample ξ, A DAG RAD performs the following: it computes g ∈ ∂x F (x; ξ), then updates 2 Sj ← Sj + gj for j ∈ [d]. The dual averaging variant of A DAG RAD updates the usual dual vector z ← z + g; the update to x is based on S and a stepsize η and computes x ← argmin z, x + x ∈X 3 1 1 x ,S2x 2η . After N samples ξ, the averaged parameter x(N ) returned by A DAG RAD satisﬁes R∞ M E[f (x(N ))] − inf f (x) ≤ O(1) √ x∈X N d √ pj , (6) j=1 where R∞ denotes the ∞ -radius of X (cf. [4, Section 1.3 and Theorem 5]). By inspection, the A DAG RAD rate (6) matches the lower bound in Proposition 1 and is thus optimal. It is interesting to note, though, that in the power law setting of Corollary 2 (recall the error order (4)), a calculation √ shows that the multiplier for the SGD guarantee (5) becomes R∞ d max{d(1−α)/2 , 1}, while A DA G RAD attains rate at worst R∞ max{d1−α/2 , log d}. For α > 1, the A DAG RAD rate is no worse, √ and for α ≥ 2, is more than d/ log d better—an exponential improvement in the dimension. 3 Parallel and asynchronous optimization with sparsity As we note in the introduction, recent works [12, 14] have suggested that sparsity can yield beneﬁts in our ability to parallelize stochastic gradient-type algorithms. Given the optimality of A DAG RADtype algorithms, it is natural to focus on their parallelization in the hope that we can leverage their ability to “adapt” to sparsity in the data. To provide the setting for our further algorithms, we ﬁrst revisit Niu et al.’s H OGWILD ! [12]. H OGWILD ! is an asynchronous (parallelized) stochastic gradient algorithm for optimization over product-space domains, meaning that X in problem (1) decomposes as X = X1 × · · · × Xd , where Xj ⊂ R. Fix a stepsize η > 0. A pool of independently running processors then performs the following updates asynchronously to a centralized vector x: 1. Sample ξ ∼ P 2. Read x and compute g ∈ ∂x F (x; ξ) 3. For each j s.t. gj = 0, update xj ← ΠXj (xj − ηgj ). Here ΠXj denotes projection onto the jth coordinate of the domain X . The key of H OGWILD ! is that in step 2, the parameter x is allowed to be inconsistent—it may have received partial gradient updates from many processors—and for appropriate problems, this inconsistency is negligible. Indeed, Niu et al. [12] show linear speedup in optimization time as the number of processors grow; they show this empirically in many scenarios, providing a proof under the somewhat restrictive assumptions that there is at most one non-zero entry in any gradient g and that f has Lipschitz gradients. 3.1 Asynchronous dual averaging A weakness of H OGWILD ! is that it appears only applicable to problems for which the domain X is a product space, and its analysis assumes g 0 = 1 for all gradients g. In effort to alleviate these difﬁculties, we now develop and present our asynchronous dual averaging algorithm, A SYNC DA. A SYNC DA maintains and upates a centralized dual vector z instead of a parameter x, and a pool of processors perform asynchronous updates to z, where each processor independently iterates: 1. Read z and compute x := argminx∈X 1 z, x + η ψ(x) // Implicitly increment “time” counter t and let x(t) = x 2. Sample ξ ∼ P and let g ∈ ∂x F (x; ξ) // Let g(t) = g. 3. For j ∈ [d] such that gj = 0, update zj ← zj + gj . Because the actual computation of the vector x in A SYNC DA is performed locally on each processor in step 1 of the algorithm, the algorithm can be executed with any proximal function ψ and domain X . The only communication point between any of the processors is the addition operation in step 3. Since addition is commutative and associative, forcing all asynchrony to this point of the algorithm is a natural strategy for avoiding synchronization problems. In our analysis of A SYNC DA, and in our subsequent analysis of the adaptive methods, we require a measurement of time elapsed. With that in mind, we let t denote a time index that exists (roughly) behind-the-scenes. We let x(t) denote the vector x ∈ X computed in the tth step 1 of the A SYNC DA 4 algorithm, that is, whichever is the tth x actually computed by any of the processors. This quantity t exists and is recoverable from the algorithm, and it is possible to track the running sum τ =1 x(τ ). Additionally, we state two assumptions encapsulating the conditions underlying our analysis. Assumption A. There is an upper bound m on the delay of any processor. In addition, for each j ∈ [d] there is a constant pj ∈ [0, 1] such that P (ξj = 0) ≤ pj . We also require certain continuity (Lipschitzian) properties of the loss functions; these amount to a second moment constraint on the instantaneous ∂F and a rough measure of gradient sparsity. Assumption B. There exist constants M and (Mj )d such that the following bounds hold for all j=1 2 x ∈ X : E[ ∂x F (x; ξ) 2 ] ≤ M2 and for each j ∈ [d] we have E[|∂xj F (x; ξ)|] ≤ pj Mj . With these deﬁnitions, we have the following theorem, which captures the convergence behavior of A SYNC DA under the assumption that X is a Cartesian product, meaning that X = X1 × · · · × Xd , 2 where Xj ⊂ R, and that ψ(x) = 1 x 2 . Note the algorithm itself can still be efﬁciently parallelized 2 for more general convex X , even if the theorem does not apply. Theorem 3. Let Assumptions A and B and the conditions in the preceding paragraph hold. Then T E t=1 F (x(t); ξ t ) − F (x∗ ; ξ t ) ≤ 1 x∗ 2η d 2 2 η 2 p2 Mj . + T M2 + ηT m j 2 j=1 We now provide a few remarks to explain and simplify the result. Under the more stringent condition 2 d 2 that |∂xj F (x; ξ)| ≤ Mj , Assumption A implies E[ ∂x F (x; ξ) 2 ] ≤ j=1 pj Mj . Thus, for the d 2 remainder of this section we take M2 = j=1 pj Mj , which upper bounds the Lipschitz continuity constant of the objective function f . We then obtain the following corollary. √ T 1 Corollary 4. Deﬁne x(T ) = T t=1 x(t), and set η = x∗ 2 /M T . Then E[f (x(T )) − f (x∗ )] ≤ M x∗ √ T 2 +m x∗ 2 √ 2M T d 2 p2 M j . j j=1 Corollary 4 is nearly immediate: since ξ t is independent of x(t), we have E[F (x(t); ξ t ) | x(t)] = f (x(t)); applying Jensen’s inequality to f (x(T )) and performing an algebraic manipulation give the result. If the data is suitably sparse, meaning that pj ≤ 1/m, the bound in Corollary 4 simpliﬁes to 3 M x∗ √ E[f (x(T )) − f (x )] ≤ 2 T ∗ 2 3 = 2 d j=1 2 pj M j x ∗ √ T 2 , (7) which is the convergence rate of stochastic gradient descent even in centralized settings (5). The √ convergence guarantee (7) shows that after T timesteps, the error scales as 1/ T ; however, if we have k processors, updates occur roughly k times as quickly, as they are asynchronous, and in time scaling as N/k, we can evaluate N gradient samples: a linear speedup. 3.2 Asynchronous AdaGrad We now turn to extending A DAG RAD to asynchronous settings, developing A SYNC A DAG RAD (asynchronous A DAG RAD). As in the A SYNC DA algorithm, A SYNC A DAG RAD maintains a shared dual vector z (the sum of gradients) and the shared matrix S, which is the diagonal sum of squares of gradient entries (recall Section 2.2). The matrix S is initialized as diag(δ 2 ), where δj ≥ 0 is an initial value. Each processor asynchronously performs the following iterations: 1 1 1. Read S and z and set G = S 2 . Compute x := argminx∈X { z, x + 2η x, Gx } increment “time” counter t and let x(t) = x, S(t) = S 2. Sample ξ ∼ P and let g ∈ ∂F (x; ξ) 2 3. For j ∈ [d] such that gj = 0, update Sj ← Sj + gj and zj ← zj + gj . 5 // Implicitly As in the description of A SYNC DA, we note that x(t) is the vector x ∈ X computed in the tth “step” of the algorithm (step 1), and similarly associate ξ t with x(t). To analyze A SYNC A DAG RAD, we make a somewhat stronger assumption on the sparsity properties of the losses F than Assumption B. 2 Assumption C. There exist constants (Mj )d such that E[(∂xj F (x; ξ))2 | ξj = 0] ≤ Mj for all j=1 x ∈ X. 2 Indeed, taking M2 = j pj Mj shows that Assumption C implies Assumption B with speciﬁc constants. We then have the following convergence result. Theorem 5. In addition to the conditions of Theorem 3, let Assumption C hold. Assume that for all 2 j we have δ 2 ≥ Mj m and X ⊂ [−R∞ , R∞ ]d . Then T t=1 E F (x(t); ξ t ) − F (x∗ ; ξ t ) d ≤ min j=1 T 1 2 R E η ∞ 2 δ + gj (t) 2 1 2 T + ηE gj (t) t=1 2 1 2 (1 + pj m), Mj R∞ pj T . t=1 It is possible to relax the condition on the initial constant diagonal term; we defer this to the full version of the paper. It is natural to ask in which situations the bound provided by Theorem 5 is optimal. We note that, as in the case with Theorem 3, we may obtain a convergence rate for f (x(T )) − f (x∗ ) using convexity, T 1 where x(T ) = T t=1 x(t). By Jensen’s inequality, we have for any δ that T E 2 δ + gj (t) 2 1 2 t=1 T ≤ 2 2 E[gj (t) ] δ + t=1 1 2 ≤ 2 δ 2 + T pj Mj . For interpretation, let us now make a few assumptions on the probabilities pj . If we assume that pj ≤ c/m for a universal (numerical) constant c, then Theorem 5 guarantees that d log(T )/T + pj √ (8) , pj , T j=1 √ which is the convergence rate of A DAG RAD except for a small factor of min{ log T /T, pj } in addition to the usual pj /T rate. In particular, optimizing by choosing η = R∞ , and assuming 1 pj T log T , we have convergence guarantee √ d pj E[f (x(T )) − f (x∗ )] ≤ O(1)R∞ Mj min √ , pj , T j=1 E[f (x(T )) − f (x∗ )] ≤ O(1) 1 2 R +η η ∞ Mj min which is minimax optimal by Proposition 1. In fact, however, the bounds of Theorem 5 are somewhat stronger: they provide bounds using the expectation of the squared gradients gj (t) rather than the maximal value Mj , though the bounds are perhaps clearer in the form (8). We note also that our analysis applies to more adversarial settings than stochastic optimization (e.g., to online convex optimization [5]). Speciﬁcally, an adversary may choose an arbitrary sequence of functions subject to the random data sparsity constraint (2), and our results provide an expected regret bound, which is strictly stronger than the stochastic convergence guarantees provided (and guarantees high-probability convergence in stochastic settings [3]). Moreover, our comments in Section 2 about the relative optimality of A DAG RAD versus standard gradient methods apply. When the data is sparse, we indeed should use asynchronous algorithms, but using adaptive methods yields even more improvement than simple gradient-based methods. 4 Experiments In this section, we give experimental validation of our theoretical results on A SYNC A DAG RAD and A SYNC DA, giving results on two datasets selected for their high-dimensional sparsity.2 2 In our experiments, A SYNC DA and H OGWILD ! had effectively identical performance. 6 8 0.07 6 5 4 0.024 Test error Training loss Speedup 0.025 0.065 7 0.06 0.055 0.05 0.045 0.04 0.023 0.022 0.021 0.02 0.035 3 0.019 2 1 2 4 0.03 A-A DAG RAD A SYNC DA Number of workers 6 8 10 12 14 0.018 0.025 0.02 16 2 4 6 8 10 12 14 Number of workers 0.017 16 2 4 6 8 10 12 14 Number of workers 16 Figure 1. Experiments with URL data. Left: speedup relative to one processor. Middle: training dataset loss versus number of processors. Right: test set error rate versus number of processors. AA DAG RAD abbreviates A SYNC A DAG RAD. 1.03 1.02 1.01 1.00 1.0 1 2 4 8 16 64 256 number of passes A-AdaGrad, η = 0.008 L2 = 0 A-AdaGrad, η = 0.008 L2 = 80 A-DA, η = 0.8 L2 = 0 A-DA, η = 0.8 L2 = 80 1.00 1.01 1.4 1.02 1.03 1.04 Impact of L2 regularizaton on test error 1.04 Fixed stepsizes, test data, L2=0 1.2 relative log-loss 1.6 1.8 Fixed stepsizes, training data, L2=0 A-AdaGrad η = 0.002 A-AdaGrad η = 0.004 A-AdaGrad η = 0.008 A-AdaGrad η = 0.016 A-DA η = 0.800 A-DA η = 1.600 A-DA η = 3.200 1 2 4 8 16 32 number of passes 64 128 256 1 2 4 8 16 32 64 128 256 number of passes Figure 2: Relative accuracy for various stepsize choices on an click-through rate prediction dataset. 4.1 Malicious URL detection For our ﬁrst set of experiments, we consider the speedup attainable by applying A SYNC A DAG RAD and A SYNC DA, investigating the performance of each algorithm on a malicious URL prediction task [7]. The dataset in this case consists of an anonymized collection of URLs labeled as malicious (e.g., spam, phishing, etc.) or benign over a span of 120 days. The data in this case consists of 2.4 · 106 examples with dimension d = 3.2 · 106 (sparse) features. We perform several experiments, randomly dividing the dataset into 1.2 · 106 training and test samples for each experiment. In Figure 1 we compare the performance of A SYNC A DAG RAD and A SYNC DA after doing after single pass through the training dataset. (For each algorithm, we choose the stepsize η for optimal training set performance.) We perform the experiments on a single machine running Ubuntu Linux with six cores (with two-way hyperthreading) and 32Gb of RAM. From the left-most plot in Fig. 1, we see that up to six processors, both A SYNC DA and A SYNC A DAG RAD enjoy the expected linear speedup, and from 6 to 12, they continue to enjoy a speedup that is linear in the number of processors though at a lesser slope (this is the effect of hyperthreading). For more than 12 processors, there is no further beneﬁt to parallelism on this machine. The two right plots in Figure 1 plot performance of the different methods (with standard errors) versus the number of worker threads used. Both are essentially ﬂat; increasing the amount of parallelism does nothing to the average training loss or the test error rate for either method. It is clear, however, that for this dataset, the adaptive A SYNC A DAG RAD algorithm provides substantial performance beneﬁts over A SYNC DA. 4.2 Click-through-rate prediction experiments We also experiment on a proprietary datasets consisting of search ad impressions. Each example corresponds to showing a search-engine user a particular text ad in response to a query string. From this, we construct a very sparse feature vector based on the text of the ad displayed and the query string (no user-speciﬁc data is used). The target label is 1 if the user clicked the ad and -1 otherwise. 7 (B) A-AdaGrad speedup (D) Impact of training data ordering 1.004 1.005 1.006 1.007 1.008 1 2 4 8 16 32 number of passes 64 128 256 1.000 1 2 A-DA base η = 1.600 A-AdaGrad base η = 0.023 0 1.005 relative stepsize (C) Optimal stepsize scaling relative log-loss 1.003 target relative log-loss 1.005 1.010 1.002 1.010 1.015 8 4 0 speedup A-DA η = 1.600 A-AdaGrad η = 0.016 1.001 1.000 relative log-loss 1.015 A-DA, L2=80 A-AdaGrad, L2=80 12 (A) Optimized stepsize for each number of passes 1 2 4 8 16 32 number of passes 64 128 256 1 2 4 8 16 32 64 128 256 number of passes Figure 3. (A) Relative test-set log-loss for A SYNC DA and A SYNC A DAG RAD, choosing the best stepsize (within a factor of about 1.4×) individually for each number of passes. (B) Effective speedup for A SYNC A DAG RAD. (C) The best stepsize η, expressed as a scaling factor on the stepsize used for one pass. (D) Five runs with different random seeds for each algorithm (with 2 penalty 80). We ﬁt logistic regression models using both A SYNC DA and A SYNC A DAG RAD. We run extensive experiments on a moderate-sized dataset (about 107 examples, split between training and testing), which allows thorough investigation of the impact of the stepsize η, the number of training passes,3 and 2 -regularization on accuracy. For these experiments we used 32 threads on 16 core machines for each run, as A SYNC A DAG RAD and A SYNC DA achieve similar speedups from parallelization. On this dataset, A SYNC A DAG RAD typically achieves an effective additional speedup over A SYNC DA of 4× or more. That is, to reach a given level of accuracy, A SYNC DA generally needs four times as many effective passes over the dataset. We measure accuracy with log-loss (the logistic loss) averaged over ﬁve runs using different random seeds (which control the order in which the algorithms sample examples during training). We report relative values in Figures 2 and 3, that is, the ratio of the mean loss for the given datapoint to the lowest (best) mean loss obtained. Our results are not particularly sensitive to the choice of relative log-loss as the metric of interest; we also considered AUC (the area under the ROC curve) and observed similar results. Figure 2 shows relative log-loss as a function of the number of training passes for various stepsizes. Without regularization, A SYNC A DAG RAD is prone to overﬁtting: it achieves signiﬁcantly higher accuracy on the training data (Fig. 2 (left)), but unless the stepsize is tuned carefully to the number of passes, it will overﬁt (Fig. 2 (middle)). Fortunately, the addition of 2 regularization largely solves this problem. Indeed, Figure 2 (right) shows that while adding an 2 penalty of 80 has very little impact on A SYNC DA, it effectively prevents the overﬁtting of A SYNC A DAG RAD.4 Fixing √ regularization multiplier to 80, we varied the stepsize η over a multiplicative grid with res2 olution 2 for each number of passes and for each algorithm. Figure 3 reports the results obtained by selecting the best stepsize in terms of test set log-loss for each number of passes. Figure 3(A) shows relative log-loss of the best stepsize for each algorithm; 3(B) shows the relative time A SYNC DA requires with respect to A SYNC A DAG RAD to achieve a given loss. Speciﬁcally, Fig. 3(B) shows the ratio of the number of passes the algorithms require to achieve a ﬁxed loss, which gives a broader estimate of the speedup obtained by using A SYNC A DAG RAD; speedups range from 3.6× to 12×. Figure 3(C) shows the optimal stepsizes as a function of the best setting for one pass. The optimal stepsize decreases moderately for A SYNC A DAG RAD, but are somewhat noisy for A SYNC DA. It is interesting to note that A SYNC A DAG RAD’s accuracy is largely independent of the ordering of the training data, while A SYNC DA shows signiﬁcant variability. This can be seen both in the error bars on Figure 3(A), and explicitly in Figure 3(D), where we plot one line for each of the ﬁve random seeds used. Thus, while on the one hand A SYNC DA requires somewhat less tuning of the stepsize and 2 parameter, tuning A SYNC A DAG RAD is much easier because of its predictable response. 3 Here “number of passes” more precisely means the expected number of times each example in the dataset is trained on. That is, each worker thread randomly selects a training example from the dataset for each update, and we continued making updates until (dataset size) × (number of passes) updates have been processed. 4 For both algorithms, this is accomplished by adding the term η80 x 2 to the ψ function. We can achieve 2 slightly better results for A SYNC A DAG RAD by varying the 2 penalty with the number of passes. 8 References [1] P. Auer and C. Gentile. Adaptive and self-conﬁdent online learning algorithms. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, 2000. [2] P. B¨ hlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and u Applications. Springer, 2011. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, September 2004. [4] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011. [5] E. Hazan. The convex optimization approach to regret minimization. In Optimization for Machine Learning, chapter 10. MIT Press, 2012. [6] J. Hiriart-Urruty and C. Lemar´ chal. Convex Analysis and Minimization Algorithms I & II. e Springer, New York, 1996. [7] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying malicious urls: An application of large-scale online learning. In Proceedings of the 26th International Conference on Machine Learning, 2009. [8] C. Manning and H. Sch¨ tze. Foundations of Statistical Natural Language Processing. MIT u Press, 1999. [9] B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010. [10] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. [11] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):261–283, 2009. [12] F. Niu, B. Recht, C. R´ , and S. Wright. Hogwild: a lock-free approach to parallelizing stochase tic gradient descent. In Advances in Neural Information Processing Systems 24, 2011. [13] P. Richt´ rik and M. Tak´ c. Parallel coordinate descent methods for big data optimization. a aˇ arXiv:1212.0873 [math.OC], 2012. URL http://arxiv.org/abs/1212.0873. [14] M. Tak´ c, A. Bijral, P. Richt´ rik, and N. Srebro. Mini-batch primal and dual methods for aˇ a SVMs. In Proceedings of the 30th International Conference on Machine Learning, 2013. 9

2 0.89838374 50 nips-2013-Bayesian Mixture Modelling and Inference based Thompson Sampling in Monte-Carlo Tree Search

Author: Aijun Bai, Feng Wu, Xiaoping Chen

Abstract: Monte-Carlo tree search (MCTS) has been drawing great interest in recent years for planning and learning under uncertainty. One of the key challenges is the trade-off between exploration and exploitation. To address this, we present a novel approach for MCTS using Bayesian mixture modeling and inference based Thompson sampling and apply it to the problem of online planning in MDPs. Our algorithm, named Dirichlet-NormalGamma MCTS (DNG-MCTS), models the uncertainty of the accumulated reward for actions in the search tree as a mixture of Normal distributions. We perform inferences on the mixture in Bayesian settings by choosing conjugate priors in the form of combinations of Dirichlet and NormalGamma distributions and select the best action at each decision node using Thompson sampling. Experimental results conﬁrm that our algorithm advances the state-of-the-art UCT approach with better values on several benchmark problems. 1

3 0.89736086 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee

Abstract: POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the “curse of dimensionality” and the “curse of history”. This paper presents an online POMDP algorithm that alleviates these difﬁculties by focusing the search on a set of randomly sampled scenarios. A Determinized Sparse Partially Observable Tree (DESPOT) compactly captures the execution of all policies on these scenarios. Our Regularized DESPOT (R-DESPOT) algorithm searches the DESPOT for a policy, while optimally balancing the size of the policy and its estimated value obtained under the sampled scenarios. We give an output-sensitive performance bound for all policies derived from a DESPOT, and show that R-DESPOT works well if a small optimal policy exists. We also give an anytime algorithm that approximates R-DESPOT. Experiments show strong results, compared with two of the fastest online POMDP algorithms. Source code along with experimental settings are available at http://bigbird.comp. nus.edu.sg/pmwiki/farm/appl/. 1

same-paper 4 0.89305991 281 nips-2013-Robust Low Rank Kernel Embeddings of Multivariate Distributions

Author: Le Song, Bo Dai

5 0.88010257 223 nips-2013-On the Relationship Between Binary Classification, Bipartite Ranking, and Binary Class Probability Estimation

Author: Harikrishna Narasimhan, Shivani Agarwal

Abstract: We investigate the relationship between three fundamental problems in machine learning: binary classiﬁcation, bipartite ranking, and binary class probability estimation (CPE). It is known that a good binary CPE model can be used to obtain a good binary classiﬁcation model (by thresholding at 0.5), and also to obtain a good bipartite ranking model (by using the CPE model directly as a ranking model); it is also known that a binary classiﬁcation model does not necessarily yield a CPE model. However, not much is known about other directions. Formally, these relationships involve regret transfer bounds. In this paper, we introduce the notion of weak regret transfer bounds, where the mapping needed to transform a model from one problem to another depends on the underlying probability distribution (and in practice, must be estimated from data). We then show that, in this weaker sense, a good bipartite ranking model can be used to construct a good classiﬁcation model (by thresholding at a suitable point), and more surprisingly, also to construct a good binary CPE model (by calibrating the scores of the ranking model). 1

6 0.85540104 201 nips-2013-Multi-Task Bayesian Optimization

7 0.84746581 251 nips-2013-Predicting Parameters in Deep Learning

8 0.84625423 75 nips-2013-Convex Two-Layer Modeling

9 0.84610546 99 nips-2013-Dropout Training as Adaptive Regularization

10 0.84537369 318 nips-2013-Structured Learning via Logistic Regression

11 0.84423453 30 nips-2013-Adaptive dropout for training deep neural networks

12 0.84336221 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

13 0.84328926 301 nips-2013-Sparse Additive Text Models with Low Rank Background

14 0.84276778 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent

15 0.84275359 294 nips-2013-Similarity Component Analysis

16 0.84258997 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning

17 0.84243345 173 nips-2013-Least Informative Dimensions

18 0.84156007 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables

19 0.84001875 5 nips-2013-A Deep Architecture for Matching Short Texts

20 0.83999652 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization