nips nips2013 nips2013-245 knowledge-graph by maker-knowledge-mining

245 nips-2013-Pass-efficient unsupervised feature selection

Source: pdf

Author: Crystal Maung, Haim Schweitzer

Abstract: The goal of unsupervised feature selection is to identify a small number of important features that can represent the data. We propose a new algorithm, a modiﬁcation of the classical pivoted QR algorithm of Businger and Golub, that requires a small number of passes over the data. The improvements are based on two ideas: keeping track of multiple features in each pass, and skipping calculations that can be shown not to affect the ﬁnal selection. Our algorithm selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. We describe experiments on real-world datasets which sometimes show improvements of several orders of magnitude over the classical algorithm. These results appear to be competitive with recently proposed randomized algorithms in terms of pass efﬁciency and run time. On the other hand, the randomized algorithms may produce more accurate features, at the cost of small probability of failure. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract The goal of unsupervised feature selection is to identify a small number of important features that can represent the data. [sent-4, score-0.237]

2 We propose a new algorithm, a modiﬁcation of the classical pivoted QR algorithm of Businger and Golub, that requires a small number of passes over the data. [sent-5, score-0.466]

3 The improvements are based on two ideas: keeping track of multiple features in each pass, and skipping calculations that can be shown not to affect the ﬁnal selection. [sent-6, score-0.153]

4 Our algorithm selects the exact same features as the classical pivoted QR algorithm, and has the same favorable numerical stability. [sent-7, score-0.181]

5 We describe experiments on real-world datasets which sometimes show improvements of several orders of magnitude over the classical algorithm. [sent-8, score-0.108]

6 These results appear to be competitive with recently proposed randomized algorithms in terms of pass efﬁciency and run time. [sent-9, score-0.289]

7 1 Introduction Work on unsupervised feature selection has received considerable attention. [sent-11, score-0.169]

8 In numerical linear algebra unsupervised feature selection is known as the column subset selection problem, where one attempts to identify a small subset of matrix columns that can approximate the entire column space of the matrix. [sent-15, score-0.43]

9 The distinction between supervised and unsupervised feature selection is as follows. [sent-19, score-0.169]

10 In the supervised case one is given labeled objects as training data and features are selected to help predict that label; in the unsupervised case nothing is known about the labels. [sent-20, score-0.154]

11 We describe an improvement to the classical Businger and Golub pivoted QR algorithm [9, 10]. [sent-21, score-0.112]

12 The QRP selects features one by one, using k passes in order to select k features. [sent-23, score-0.473]

13 In each pass the selected feature is the one that is the hardest to approximate by the previously selected features. [sent-24, score-0.278]

14 We achieve improvements to the algorithm run time and pass efﬁciency without affecting the selection and the excellent numerical stability of the original algorithm. [sent-25, score-0.324]

15 Our algorithm is deterministic, and runs in a small number of passes over the data. [sent-26, score-0.379]

16 In each pass we identify multiple features that are hard to approximate with the previously selected features. [sent-28, score-0.275]

17 A second selection step among these features uses an upper bound on unselected features that enables identifying multiple features that are guaranteed to have been selected by the QRP. [sent-29, score-0.345]

18 Since the error of approximating a feature can only decrease when additional features are added to the selection, there is no need to evaluate candidates with error that is already “too small”. [sent-32, score-0.194]

19 This allows a signiﬁcant reduction in the number of candidate features that need to be considered in each pass. [sent-33, score-0.113]

20 1 2 Algorithms for unsupervised feature selection The algorithms that we consider take as input large matrices of numeric values. [sent-35, score-0.169]

21 We denote by m the number of rows, by n the number of columns (features), and by k the number of features to be selected. [sent-36, score-0.166]

22 Criteria for evaluating algorithms include their run time and memory requirements, the number of passes over the data, and the algorithm accuracy. [sent-37, score-0.465]

23 We review some classical and recent algorithms for unsupervised feature selection. [sent-39, score-0.125]

24 It requires k passes for selecting k features, and its run time is 4kmn − 2k 2 (m + n) + 4k 3 /3. [sent-43, score-0.417]

25 A recent study [11] compares experimentally the accuracy of the QRP as a feature selection algorithm to some recently proposed state-of-the-art algorithms. [sent-44, score-0.113]

26 It computes an initial selection (typically by using the QRP), and then repeatedly swaps selected columns with unselected column. [sent-48, score-0.239]

27 The swapping is done so that the product of singular values of the matrix formed by the selected columns is increased with each swapping. [sent-49, score-0.161]

28 The algorithm requires random access memory, and it is not clear how to implement it by a series of passes over the data. [sent-50, score-0.379]

29 Frieze et al [12, 13] have proposed a randomized algorithm that requires only two passes over the data. [sent-54, score-0.453]

30 This assumes that the norms of all matrix columns are known in advance, and guarantees only an additive approximation error. [sent-55, score-0.131]

31 Volume sampling Deshpande et al [14] have studied a randomized algorithm that samples k-tuples of columns with probability proportional to their “volume”. [sent-57, score-0.172]

32 They describe an efﬁcient O(kmn) randomized algorithm that can be implemented in k passes and approximates this sampling scheme. [sent-60, score-0.478]

33 Further improvements to the speed of volume sampling in [6] have reduced the run time complexity to O(km2 n). [sent-63, score-0.099]

34 Leverage sampling The idea behind leverage sampling is to randomly select features with probability proportional to their “leverage”. [sent-65, score-0.135]

35 In particular, the “two stage” algorithm described in [2] requires only 2 passes if the leverage values are known. [sent-68, score-0.446]

36 Its run time is dominated by the calculation of the leverage values. [sent-69, score-0.131]

37 To the best of our knowledge the currently best algorithms for estimating leverage values are randomized [17, 18]. [sent-70, score-0.141]

38 One run takes 2 passes and O(mn log n + m3 ) time. [sent-71, score-0.417]

39 (˜i is the error vector of ˜ x x approximating xi by a linear combination of the columns in S. [sent-91, score-0.199]

40 ) At the end of the pass set z1 = arg max vi , and initialize S = (z1 ). [sent-92, score-0.481]

41 , n set vi to the square error of approximating xi by a linear combination of the columns in S. [sent-102, score-0.434]

42 At the end of pass j set zj = arg max vi , and add zj to S. [sent-103, score-0.659]

43 describe how to compute QR factorization using their randomized Interpolative Decomposition. [sent-108, score-0.099]

44 The following values depend on the data: the number of passes p, the number of IO-passes q (explained below), and a unit cost of orthogonalization c (see Section 4. [sent-117, score-0.459]

45 For l ≈ k our experiments show that the number of passes is typically much smaller than k. [sent-121, score-0.379]

46 The number of passes is even smaller if one considers IO-passes. [sent-122, score-0.379]

47 To explain what we mean by IO-passes consider as an example a situation where the algorithm runs three passes over the data. [sent-123, score-0.379]

48 In the ﬁrst pass all n features are being accessed. [sent-124, score-0.245]

49 We believe that q is a relevant measure of the algorithm pass complexity when skipping is cheap, so that the cost of a pass over the data is the amount of data that needs to be read. [sent-128, score-0.402]

50 1 Memory-efﬁcient implementations The implementations shown in Figure 2 update the memory where the matrix A is stored. [sent-140, score-0.161]

51 The ﬂops count is dominated by Steps 1 and 2, which cost at most 4(j − 1)mn at pass j. [sent-144, score-0.203]

52 Modiﬁed Gram-Schmidt Householder orthogonalization i i Figure 2: Standard implementations of Step 2. [sent-179, score-0.12]

53 Modiﬁed Gram-Schmidt Householder orthogonalization i i Figure 3: Memory-efﬁcient implementations of Step 2. [sent-206, score-0.12]

54 The algorithm maintains three ordered lists of columns: The list F is the input list containing all columns. [sent-208, score-0.228]

55 The list S contains columns that have already been selected. [sent-209, score-0.19]

56 For each column xi in F the algorithm maintains an integer value ri and a real value vi . [sent-211, score-0.53]

57 They are deﬁned as follows: ri ≤ |S|, vi = vi (ri ) = xi − Qri QTi xi r 2 (1) where Qri = (q1 , . [sent-213, score-0.765]

58 , qri ) is an orthonormal basis to the ﬁrst ri columns in S. [sent-216, score-0.3]

59 Thus, vi (ri ) is the (squared) error of approximating xi with the ﬁrst ri columns in S. [sent-217, score-0.577]

60 In each pass the algorithm identiﬁes the l candidate columns xi corresponding to the l largest values of vi (|S|). [sent-218, score-0.649]

61 That is, the vi values are computed as the error of predicting each candidate by all columns currently in S. [sent-219, score-0.357]

62 The identiﬁed l columns with the largest vi (|S|) are stored in the list L. [sent-220, score-0.493]

63 In addition, the value of the l+1’th largest vi (|S|) is kept as the constant BF . [sent-221, score-0.299]

64 Thus, after a pass is terminated the following condition holds: vα (rα ) ≤ BF for all xα ∈ F \ L. [sent-222, score-0.177]

65 (2) The list L and the value BF can be calculated in one pass using a binary heap data structure, with the cost of at most n log(l + 1) comparisons. [sent-223, score-0.445]

66 The threshold value T is deﬁned by: T = −∞ if the heap is not full. [sent-229, score-0.176]

67 4 (3) Input: The matrix columns (features) x1 , . [sent-231, score-0.131]

68 Fill the heap with the candidates corresponding to the l+1 largest vi (0). [sent-247, score-0.51]

69 Thus, when the heap is full, T is the value of v associated with the l+1’th largest candidate encountered so far. [sent-271, score-0.239]

70 If ri = |S| conditionally insert xi into the heap. [sent-286, score-0.246]

71 To move candidates from L to S run the QRP on L as long as the pivot value is above BF . [sent-306, score-0.124]

72 (The pivot value is the largest value of vi (|S|) in L. [sent-307, score-0.3]

73 For j = 0 the QRP selects xj with vj = |xj |2 = max |xi |2 . [sent-339, score-0.152]

74 The IQRP selects vj as the largest among the l largest values in F . [sent-340, score-0.21]

75 Therei fore vj = maxxi ∈L |xi |2 = maxxi ∈F |xi |2 = vj . [sent-341, score-0.33]

76 Let vj (|S|) be the value of the j+1’th selection by the QRP, and let vj (|S|) be the value of the j+1’th selection by the IQRP. [sent-343, score-0.356]

77 The QRP selection of j satisﬁes: vj (|S|) = maxxi ∈F vi (|S|). [sent-345, score-0.472]

78 (Initially L is created from the heap elements that have ri = |S|. [sent-347, score-0.319]

79 ) The IQRP selection satisﬁes: vj (|S|) = max vi (|S|) and vj (|S|) ≥ BF . [sent-353, score-0.539]

80 Therefore, combining (4), and (5) we get vj (|S|) = max vi (|S|) = vj (|S|). [sent-356, score-0.467]

81 The value of BF is the l+1 largest vi (|S|), while the maximum at B. [sent-362, score-0.274]

82 At pass j the number of selected columns is kj , and the number of columns that were not skipped in Step 2. [sent-372, score-0.46]

83 For the faster implementation that overwrites the input it can be shown that: n ﬂopstime = 2mn + 4m ri , ˜ where ri is the value of ri at termination. [sent-378, score-0.459]

84 ˜ (7) i=1 Since ri ≤ k − 1 it follows that ﬂopstime ≤ 4kmn. [sent-379, score-0.143]

85 6 Memory in the memory-efﬁcient implementation requires km in-core ﬂoats, and additional memory for the heap, that can be reused for the list L. [sent-381, score-0.195]

86 Additional memory to store and manipulate vi , ri for i = 1, . [sent-382, score-0.426]

87 Observe that these memory locations are being accessed consecutively, and can be efﬁciently stored and manipulated out-of-core. [sent-386, score-0.1]

88 We wish to distinguish between a pass where the entire data is accessed and a pass where most of the data is skipped. [sent-393, score-0.377]

89 Testing for the skipping and manipulating the heap requires ﬂoating point comparisons. [sent-396, score-0.224]

90 # passes is the number of passes needed to select k features. [sent-406, score-0.758]

91 Thus, the ratio between the number of IO-passes and the number of passes is the fraction of the data that was not skipped. [sent-411, score-0.379]

92 We describe experiments with the list size l taken as l = k. [sent-414, score-0.117]

93 We describe experiments with the list size l taken as l = k, and also with l = 100 regardless of the value of k. [sent-423, score-0.117]

94 In absolute terms the number of passes was below 10 for most of the data; the number of IO-passes was below 2 for most of the data. [sent-431, score-0.379]

95 Our experiments show that for typical datasets the number of passes is signiﬁcantly smaller than k. [sent-436, score-0.416]

96 In situations where memory can be skipped the notion of IO-passes may be more accurate than passes. [sent-437, score-0.107]

97 5 1 4 3 2 #passes #IO-passes 15 number of passes number of passes ﬂops/kmn 2. [sent-440, score-0.758]

98 This appears to suggest that worst case analysis should not be considered as the only criterion for evaluating feature selection algorithms. [sent-445, score-0.113]

99 2 we observe that the IQRP is competitive in terms of the number of passes and appears to outperform these algorithms in terms of the number of IO-passes. [sent-447, score-0.379]

100 An improved approximation algorithm for the column subset selection problem. [sent-461, score-0.101]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('qrp', 0.452), ('passes', 0.379), ('iqrp', 0.295), ('vi', 0.235), ('bf', 0.208), ('qj', 0.193), ('pass', 0.177), ('heap', 0.176), ('householder', 0.173), ('hj', 0.166), ('ops', 0.16), ('ri', 0.143), ('opstime', 0.137), ('businger', 0.118), ('opsmemory', 0.118), ('vj', 0.106), ('columns', 0.098), ('list', 0.092), ('zj', 0.089), ('orthogonalization', 0.08), ('qr', 0.076), ('xi', 0.076), ('randomized', 0.074), ('selection', 0.072), ('golub', 0.07), ('features', 0.068), ('leverage', 0.067), ('candidates', 0.06), ('gisette', 0.059), ('kmn', 0.059), ('maxxi', 0.059), ('pivoted', 0.059), ('qri', 0.059), ('xzj', 0.059), ('unsupervised', 0.056), ('drineas', 0.055), ('thrombin', 0.052), ('deshpande', 0.052), ('frieze', 0.048), ('skipping', 0.048), ('memory', 0.048), ('halko', 0.045), ('mn', 0.041), ('wi', 0.041), ('boutsidis', 0.041), ('feature', 0.041), ('implementations', 0.04), ('amazon', 0.04), ('nj', 0.04), ('unselected', 0.039), ('largest', 0.039), ('run', 0.038), ('improvements', 0.037), ('savings', 0.037), ('mahoney', 0.037), ('skipped', 0.035), ('matrix', 0.033), ('temporary', 0.032), ('selected', 0.03), ('skip', 0.03), ('create', 0.03), ('implementation', 0.03), ('arg', 0.029), ('stored', 0.029), ('column', 0.029), ('dallas', 0.029), ('gu', 0.029), ('th', 0.029), ('classical', 0.028), ('insert', 0.027), ('pivot', 0.026), ('kannan', 0.026), ('dominated', 0.026), ('selects', 0.026), ('approximating', 0.025), ('km', 0.025), ('oating', 0.025), ('kept', 0.025), ('describe', 0.025), ('candidate', 0.024), ('integer', 0.024), ('volume', 0.024), ('accurate', 0.024), ('accessed', 0.023), ('maintains', 0.023), ('steps', 0.023), ('slower', 0.022), ('editor', 0.022), ('soda', 0.022), ('recommend', 0.022), ('kj', 0.022), ('ordered', 0.021), ('reduction', 0.021), ('end', 0.02), ('max', 0.02), ('typical', 0.019), ('january', 0.019), ('texas', 0.019), ('explained', 0.019), ('datasets', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 245 nips-2013-Pass-efficient unsupervised feature selection

Author: Crystal Maung, Haim Schweitzer

2 0.068009526 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

Author: Hu Ding, Ronald Berezney, Jinhui Xu

Abstract: In this paper, we study the following new variant of prototype learning, called k-prototype learning problem for 3D rigid structures: Given a set of 3D rigid structures, ﬁnd a set of k rigid structures so that each of them is a prototype for a cluster of the given rigid structures and the total cost (or dissimilarity) is minimized. Prototype learning is a core problem in machine learning and has a wide range of applications in many areas. Existing results on this problem have mainly focused on the graph domain. In this paper, we present the ﬁrst algorithm for learning multiple prototypes from 3D rigid structures. Our result is based on a number of new insights to rigid structures alignment, clustering, and prototype reconstruction, and is practically efﬁcient with quality guarantee. We validate our approach using two type of data sets, random data and biological data of chromosome territories. Experiments suggest that our approach can effectively learn prototypes in both types of data. 1

3 0.06126669 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

Author: Michael Hughes, Erik Sudderth

Abstract: Variational inference algorithms provide the most effective framework for largescale training of Bayesian nonparametric models. Stochastic online approaches are promising, but are sensitive to the chosen learning rate and often converge to poor local optima. We present a new algorithm, memoized online variational inference, which scales to very large (yet ﬁnite) datasets while avoiding the complexities of stochastic gradient. Our algorithm maintains ﬁnite-dimensional sufﬁcient statistics from batches of the full dataset, requiring some additional memory but still scaling to millions of examples. Exploiting nested families of variational bounds for inﬁnite nonparametric models, we develop principled birth and merge moves allowing non-local optimization. Births adaptively add components to the model to escape local optima, while merges remove redundancy and improve speed. Using Dirichlet process mixture models for image clustering and denoising, we demonstrate major improvements in robustness and accuracy.

4 0.058643587 65 nips-2013-Compressive Feature Learning

Author: Hristo S. Paskov, Robert West, John C. Mitchell, Trevor Hastie

Abstract: This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method ﬁnds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efﬁcient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning. 1

5 0.058209028 188 nips-2013-Memory Limited, Streaming PCA

Author: Ioannis Mitliagkas, Constantine Caramanis, Prateek Jain

Abstract: We consider streaming, one-pass principal component analysis (PCA), in the highdimensional regime, with limited memory. Here, p-dimensional samples are presented sequentially, and the goal is to produce the k-dimensional subspace that best approximates these points. Standard algorithms require O(p2 ) memory; meanwhile no algorithm can do better than O(kp) memory, since this is what the output itself requires. Memory (or storage) complexity is most meaningful when understood in the context of computational and sample complexity. Sample complexity for high-dimensional PCA is typically studied in the setting of the spiked covariance model, where p-dimensional points are generated from a population covariance equal to the identity (white noise) plus a low-dimensional perturbation (the spike) which is the signal to be recovered. It is now well-understood that the spike can be recovered when the number of samples, n, scales proportionally with the dimension, p. Yet, all algorithms that provably achieve this, have memory complexity O(p2 ). Meanwhile, algorithms with memory-complexity O(kp) do not have provable bounds on sample complexity comparable to p. We present an algorithm that achieves both: it uses O(kp) memory (meaning storage of any kind) and is able to compute the k-dimensional spike with O(p log p) samplecomplexity – the ﬁrst algorithm of its kind. While our theoretical analysis focuses on the spiked covariance model, our simulations show that our algorithm is successful on much more general models for the data. 1

6 0.05727642 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel

7 0.05338515 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse

8 0.050918046 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

9 0.050837189 43 nips-2013-Auxiliary-variable Exact Hamiltonian Monte Carlo Samplers for Binary Distributions

10 0.050476771 134 nips-2013-Graphical Models for Inference with Missing Data

11 0.050141897 186 nips-2013-Matrix factorization with binary components

12 0.048739936 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning

13 0.048481587 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling

14 0.048317261 300 nips-2013-Solving the multi-way matching problem by permutation synchronization

15 0.044656646 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

16 0.044467974 331 nips-2013-Top-Down Regularization of Deep Belief Networks

17 0.044118453 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices

18 0.042568892 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing

19 0.041149061 91 nips-2013-Dirty Statistical Models

20 0.040086277 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.117), (1, 0.04), (2, -0.0), (3, 0.021), (4, 0.032), (5, 0.02), (6, 0.004), (7, 0.007), (8, -0.03), (9, -0.007), (10, 0.043), (11, 0.007), (12, 0.017), (13, 0.013), (14, -0.002), (15, 0.033), (16, -0.022), (17, -0.024), (18, -0.04), (19, 0.029), (20, -0.02), (21, -0.056), (22, 0.014), (23, 0.048), (24, -0.002), (25, 0.015), (26, -0.031), (27, -0.034), (28, 0.062), (29, -0.102), (30, -0.002), (31, 0.046), (32, 0.005), (33, 0.028), (34, 0.013), (35, 0.054), (36, -0.081), (37, 0.019), (38, -0.006), (39, -0.039), (40, -0.044), (41, 0.03), (42, 0.027), (43, 0.051), (44, 0.027), (45, 0.006), (46, 0.014), (47, 0.046), (48, -0.033), (49, -0.069)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.90245944 245 nips-2013-Pass-efficient unsupervised feature selection

Author: Crystal Maung, Haim Schweitzer

2 0.66431856 293 nips-2013-Sign Cauchy Projections and Chi-Square Kernel

Author: Ping Li, Gennady Samorodnitsk, John Hopcroft

Abstract: The method of stable random projections is useful for efﬁciently approximating the lα distance (0 < α ≤ 2) in high dimension and it is naturally suitable for data streams. In this paper, we propose to use only the signs of the projected data and we analyze the probability of collision (i.e., when the two signs differ). Interestingly, when α = 1 (i.e., Cauchy random projections), we show that the probability of collision can be accurately approximated as functions of the chi-square (χ2 ) similarity. In text and vision applications, the χ2 similarity is a popular measure when the features are generated from histograms (which are a typical example of data streams). Experiments conﬁrm that the proposed method is promising for large-scale learning applications. The full paper is available at arXiv:1308.1009. There are many future research problems. For example, when α → 0, the collision probability is a function of the resemblance (of the binary-quantized data). This provides an effective mechanism for resemblance estimation in data streams. 1

3 0.599693 111 nips-2013-Estimation, Optimization, and Parallelism when Data is Sparse

Author: John Duchi, Michael Jordan, Brendan McMahan

Abstract: We study stochastic optimization problems when the data is sparse, which is in a sense dual to current perspectives on high-dimensional statistical learning and optimization. We highlight both the difﬁculties—in terms of increased sample complexity that sparse data necessitates—and the potential beneﬁts, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on several medium to large-scale learning tasks. 1 Introduction and problem setting In this paper, we investigate stochastic optimization problems in which the data is sparse. Formally, let {F (·; ξ), ξ ∈ Ξ} be a collection of real-valued convex functions, each of whose domains contains the convex set X ⊂ Rd . For a probability distribution P on Ξ, we consider the following optimization problem: minimize f (x) := E[F (x; ξ)] = x∈X F (x; ξ)dP (ξ). (1) Ξ By data sparsity, we mean the samples ξ are sparse: assuming that samples ξ lie in Rd , and deﬁning the support supp(x) of a vector x to the set of indices of its non-zero components, we assume supp F (x; ξ) ⊂ supp ξ. (2) The sparsity condition (2) means that F (x; ξ) does not “depend” on the values of xj for indices j such that ξj = 0.1 This type of data sparsity is prevalent in statistical optimization problems and machine learning applications; in spite of its prevalence, study of such problems has been limited. As a motivating example, consider a text classiﬁcation problem: data ξ ∈ Rd represents words appearing in a document, and we wish to minimize a logistic loss F (x; ξ) = log(1 + exp( ξ, x )) on the data (we encode the label implicitly with the sign of ξ). Such generalized linear models satisfy the sparsity condition (2), and while instances are of very high dimension, in any given instance, very few entries of ξ are non-zero [8]. From a modelling perspective, it thus makes sense to allow a dense predictor x: any non-zero entry of ξ is potentially relevant and important. In a sense, this is dual to the standard approaches to high-dimensional problems; one usually assumes that the data ξ may be dense, but there are only a few relevant features, and thus a parsimonious model x is desirous [2]. So 1 Formally, if πξ denotes the coordinate projection zeroing all indices j of its argument where ξj = 0, then F (πξ (x); ξ) = F (x; ξ) for all x, ξ. This follows from the ﬁrst-order conditions for convexity [6]. 1 while such sparse data problems are prevalent—natural language processing, information retrieval, and other large data settings all have signiﬁcant data sparsity—they do not appear to have attracted as much study as their high-dimensional “duals” of dense data and sparse predictors. In this paper, we investigate algorithms and their inherent limitations for solving problem (1) under natural conditions on the data generating distribution. Recent work in the optimization and machine learning communities has shown that data sparsity can be leveraged to develop parallel (and even asynchronous [12]) optimization algorithms [13, 14], but this work does not consider the statistical effects of data sparsity. In another line of research, Duchi et al. [4] and McMahan and Streeter [9] develop “adaptive” stochastic gradient algorithms to address problems in sparse data regimes (2). These algorithms exhibit excellent practical performance and have theoretical guarantees on their convergence, but it is not clear if they are optimal—in that no algorithm can attain better statistical performance—or whether they can leverage parallel computing as in the papers [12, 14]. In this paper, we take a two-pronged approach. First, we investigate the fundamental limits of optimization and learning algorithms in sparse data regimes. In doing so, we derive lower bounds on the optimization error of any algorithm for problems of the form (1) with sparsity condition (2). These results have two main implications. They show that in some scenarios, learning with sparse data is quite difﬁcult, as essentially each coordinate j ∈ [d] can be relevant and must be optimized for. In spite of this seemingly negative result, we are also able to show that the A DAG RAD algorithms of [4, 9] are optimal, and we show examples in which their dependence on the dimension d can be made exponentially better than standard gradient methods. As the second facet of our two-pronged approach, we study how sparsity may be leveraged in parallel computing frameworks to give substantially faster algorithms that still achieve optimal sample complexity in terms of the number of samples ξ used. We develop two new algorithms, asynchronous dual averaging (A SYNC DA) and asynchronous A DAG RAD (A SYNC A DAG RAD), which allow asynchronous parallel solution of the problem (1) for general convex f and X . Combining insights of Niu et al.’s H OGWILD ! [12] with a new analysis, we prove our algorithms achieve linear speedup in the number of processors while maintaining optimal statistical guarantees. We also give experiments on text-classiﬁcation and web-advertising tasks to illustrate the beneﬁts of the new algorithms. 2 Minimax rates for sparse optimization We begin our study of sparse optimization problems by establishing their fundamental statistical and optimization-theoretic properties. To do this, we derive bounds on the minimax convergence rate of any algorithm for such problems. Formally, let x denote any estimator for a minimizer of the objective (1). We deﬁne the optimality gap N for the estimator x based on N samples ξ 1 , . . . , ξ N from the distribution P as N (x, F, X , P ) := f (x) − inf f (x) = EP [F (x; ξ)] − inf EP [F (x; ξ)] . x∈X x∈X This quantity is a random variable, since x is a random variable (it is a function of ξ 1 , . . . , ξ N ). To deﬁne the minimax error, we thus take expectations of the quantity N , though we require a bit more than simply E[ N ]. We let P denote a collection of probability distributions, and we consider a collection of loss functions F speciﬁed by a collection F of convex losses F : X × ξ → R. We can then deﬁne the minimax error for the family of losses F and distributions P as ∗ N (X , P, F) := inf sup sup EP [ x P ∈P F ∈F N (x(ξ 1:N ), F, X , P )], (3) where the inﬁmum is taken over all possible estimators x (an estimator is an optimization scheme, or a measurable mapping x : ΞN → X ) . 2.1 Minimax lower bounds Let us now give a more precise characterization of the (natural) set of sparse optimization problems we consider to provide the lower bound. For the next proposition, we let P consist of distributions supported on Ξ = {−1, 0, 1}d , and we let pj := P (ξj = 0) be the marginal probability of appearance of feature j ∈ {1, . . . , d}. For our class of functions, we set F to consist of functions F satisfying the sparsity condition (2) and with the additional constraint that for g ∈ ∂x F (x; ξ), we have that the jth coordinate |gj | ≤ Mj for a constant Mj < ∞. We obtain 2 Proposition 1. Let the conditions of the preceding paragraph hold. Let R be a constant such that X ⊃ [−R, R]d . Then √ d pj 1 ∗ . Mj min pj , √ N (X , P, F) ≥ R 8 j=1 N log 3 We provide the proof of Proposition 1 in the supplement A.1 in the full version of the paper, providing a few remarks here. We begin by giving a corollary to Proposition 1 that follows when the data ξ obeys a type of power law: let p0 ∈ [0, 1], and assume that P (ξj = 0) = p0 j −α . We have Corollary 2. Let α ≥ 0. Let the conditions of Proposition 1 hold with Mj ≡ M for all j, and assume the power law condition P (ξj = 0) = p0 j −α on coordinate appearance probabilities. Then (1) If d > (p0 N )1/α , ∗ N (X , P, F) ≥ 2−α 1−α p0 p0 (p0 N ) 2α − 1 + d1−α − (p0 N ) α N 1−α 2 MR 8 2−α (2) If d ≤ (p0 N )1/α , ∗ N (X , P, F) ≥ MR 8 p0 N α 1 1 d1− 2 − 1 − α/2 1 − α/2 . . Expanding Corollary 2 slightly, for simplicity assume the number of samples is large enough that d ≤ (p0 N )1/α . Then we ﬁnd that the lower bound on optimization error is of order p0 1− α p0 p0 d 2 when α < 2, M R log d when α → 2, and M R when α > 2. (4) N N N These results beg the question of tightness: are they improvable? As we see presently, they are not. MR 2.2 Algorithms for attaining the minimax rate To show that the lower bounds of Proposition 1 and its subsequent specializations are sharp, we review a few stochastic gradient algorithms. We begin with stochastic gradient descent (SGD): SGD repeatedly samples ξ ∼ P , computes g ∈ ∂x F (x; ξ), then performs the update x ← ΠX (x − ηg), where η is a stepsize parameter and ΠX denotes Euclidean projection onto X . Standard analyses of stochastic gradient descent [10] show that after N samples ξ i , the SGD estimator x(N ) satisﬁes R2 M ( d j=1 1 pj ) 2 √ , (5) N where R2 denotes the 2 -radius of X . Dual averaging, due to Nesterov [11] (sometimes called “follow the regularized leader” [5]) is a more recent algorithm. In dual averaging, one again samples g ∈ ∂x F (x; ξ), but instead of updating the parameter vector x one updates a dual vector z by z ← z + g, then computes 1 x ← argmin z, x + ψ(x) , η x∈X E[f (x(N ))] − inf f (x) ≤ O(1) x∈X 2 1 where ψ(x) is a strongly convex function deﬁned over X (often one takes ψ(x) = 2 x 2 ). As we discuss presently, the dual averaging algorithm is somewhat more natural in asynchronous and parallel computing environments, and it enjoys the same type of convergence guarantees (5) as SGD. The A DAG RAD algorithm [4, 9] is an extension of the preceding stochastic gradient methods. It maintains a diagonal matrix S, where upon receiving a new sample ξ, A DAG RAD performs the following: it computes g ∈ ∂x F (x; ξ), then updates 2 Sj ← Sj + gj for j ∈ [d]. The dual averaging variant of A DAG RAD updates the usual dual vector z ← z + g; the update to x is based on S and a stepsize η and computes x ← argmin z, x + x ∈X 3 1 1 x ,S2x 2η . After N samples ξ, the averaged parameter x(N ) returned by A DAG RAD satisﬁes R∞ M E[f (x(N ))] − inf f (x) ≤ O(1) √ x∈X N d √ pj , (6) j=1 where R∞ denotes the ∞ -radius of X (cf. [4, Section 1.3 and Theorem 5]). By inspection, the A DAG RAD rate (6) matches the lower bound in Proposition 1 and is thus optimal. It is interesting to note, though, that in the power law setting of Corollary 2 (recall the error order (4)), a calculation √ shows that the multiplier for the SGD guarantee (5) becomes R∞ d max{d(1−α)/2 , 1}, while A DA G RAD attains rate at worst R∞ max{d1−α/2 , log d}. For α > 1, the A DAG RAD rate is no worse, √ and for α ≥ 2, is more than d/ log d better—an exponential improvement in the dimension. 3 Parallel and asynchronous optimization with sparsity As we note in the introduction, recent works [12, 14] have suggested that sparsity can yield beneﬁts in our ability to parallelize stochastic gradient-type algorithms. Given the optimality of A DAG RADtype algorithms, it is natural to focus on their parallelization in the hope that we can leverage their ability to “adapt” to sparsity in the data. To provide the setting for our further algorithms, we ﬁrst revisit Niu et al.’s H OGWILD ! [12]. H OGWILD ! is an asynchronous (parallelized) stochastic gradient algorithm for optimization over product-space domains, meaning that X in problem (1) decomposes as X = X1 × · · · × Xd , where Xj ⊂ R. Fix a stepsize η > 0. A pool of independently running processors then performs the following updates asynchronously to a centralized vector x: 1. Sample ξ ∼ P 2. Read x and compute g ∈ ∂x F (x; ξ) 3. For each j s.t. gj = 0, update xj ← ΠXj (xj − ηgj ). Here ΠXj denotes projection onto the jth coordinate of the domain X . The key of H OGWILD ! is that in step 2, the parameter x is allowed to be inconsistent—it may have received partial gradient updates from many processors—and for appropriate problems, this inconsistency is negligible. Indeed, Niu et al. [12] show linear speedup in optimization time as the number of processors grow; they show this empirically in many scenarios, providing a proof under the somewhat restrictive assumptions that there is at most one non-zero entry in any gradient g and that f has Lipschitz gradients. 3.1 Asynchronous dual averaging A weakness of H OGWILD ! is that it appears only applicable to problems for which the domain X is a product space, and its analysis assumes g 0 = 1 for all gradients g. In effort to alleviate these difﬁculties, we now develop and present our asynchronous dual averaging algorithm, A SYNC DA. A SYNC DA maintains and upates a centralized dual vector z instead of a parameter x, and a pool of processors perform asynchronous updates to z, where each processor independently iterates: 1. Read z and compute x := argminx∈X 1 z, x + η ψ(x) // Implicitly increment “time” counter t and let x(t) = x 2. Sample ξ ∼ P and let g ∈ ∂x F (x; ξ) // Let g(t) = g. 3. For j ∈ [d] such that gj = 0, update zj ← zj + gj . Because the actual computation of the vector x in A SYNC DA is performed locally on each processor in step 1 of the algorithm, the algorithm can be executed with any proximal function ψ and domain X . The only communication point between any of the processors is the addition operation in step 3. Since addition is commutative and associative, forcing all asynchrony to this point of the algorithm is a natural strategy for avoiding synchronization problems. In our analysis of A SYNC DA, and in our subsequent analysis of the adaptive methods, we require a measurement of time elapsed. With that in mind, we let t denote a time index that exists (roughly) behind-the-scenes. We let x(t) denote the vector x ∈ X computed in the tth step 1 of the A SYNC DA 4 algorithm, that is, whichever is the tth x actually computed by any of the processors. This quantity t exists and is recoverable from the algorithm, and it is possible to track the running sum τ =1 x(τ ). Additionally, we state two assumptions encapsulating the conditions underlying our analysis. Assumption A. There is an upper bound m on the delay of any processor. In addition, for each j ∈ [d] there is a constant pj ∈ [0, 1] such that P (ξj = 0) ≤ pj . We also require certain continuity (Lipschitzian) properties of the loss functions; these amount to a second moment constraint on the instantaneous ∂F and a rough measure of gradient sparsity. Assumption B. There exist constants M and (Mj )d such that the following bounds hold for all j=1 2 x ∈ X : E[ ∂x F (x; ξ) 2 ] ≤ M2 and for each j ∈ [d] we have E[|∂xj F (x; ξ)|] ≤ pj Mj . With these deﬁnitions, we have the following theorem, which captures the convergence behavior of A SYNC DA under the assumption that X is a Cartesian product, meaning that X = X1 × · · · × Xd , 2 where Xj ⊂ R, and that ψ(x) = 1 x 2 . Note the algorithm itself can still be efﬁciently parallelized 2 for more general convex X , even if the theorem does not apply. Theorem 3. Let Assumptions A and B and the conditions in the preceding paragraph hold. Then T E t=1 F (x(t); ξ t ) − F (x∗ ; ξ t ) ≤ 1 x∗ 2η d 2 2 η 2 p2 Mj . + T M2 + ηT m j 2 j=1 We now provide a few remarks to explain and simplify the result. Under the more stringent condition 2 d 2 that |∂xj F (x; ξ)| ≤ Mj , Assumption A implies E[ ∂x F (x; ξ) 2 ] ≤ j=1 pj Mj . Thus, for the d 2 remainder of this section we take M2 = j=1 pj Mj , which upper bounds the Lipschitz continuity constant of the objective function f . We then obtain the following corollary. √ T 1 Corollary 4. Deﬁne x(T ) = T t=1 x(t), and set η = x∗ 2 /M T . Then E[f (x(T )) − f (x∗ )] ≤ M x∗ √ T 2 +m x∗ 2 √ 2M T d 2 p2 M j . j j=1 Corollary 4 is nearly immediate: since ξ t is independent of x(t), we have E[F (x(t); ξ t ) | x(t)] = f (x(t)); applying Jensen’s inequality to f (x(T )) and performing an algebraic manipulation give the result. If the data is suitably sparse, meaning that pj ≤ 1/m, the bound in Corollary 4 simpliﬁes to 3 M x∗ √ E[f (x(T )) − f (x )] ≤ 2 T ∗ 2 3 = 2 d j=1 2 pj M j x ∗ √ T 2 , (7) which is the convergence rate of stochastic gradient descent even in centralized settings (5). The √ convergence guarantee (7) shows that after T timesteps, the error scales as 1/ T ; however, if we have k processors, updates occur roughly k times as quickly, as they are asynchronous, and in time scaling as N/k, we can evaluate N gradient samples: a linear speedup. 3.2 Asynchronous AdaGrad We now turn to extending A DAG RAD to asynchronous settings, developing A SYNC A DAG RAD (asynchronous A DAG RAD). As in the A SYNC DA algorithm, A SYNC A DAG RAD maintains a shared dual vector z (the sum of gradients) and the shared matrix S, which is the diagonal sum of squares of gradient entries (recall Section 2.2). The matrix S is initialized as diag(δ 2 ), where δj ≥ 0 is an initial value. Each processor asynchronously performs the following iterations: 1 1 1. Read S and z and set G = S 2 . Compute x := argminx∈X { z, x + 2η x, Gx } increment “time” counter t and let x(t) = x, S(t) = S 2. Sample ξ ∼ P and let g ∈ ∂F (x; ξ) 2 3. For j ∈ [d] such that gj = 0, update Sj ← Sj + gj and zj ← zj + gj . 5 // Implicitly As in the description of A SYNC DA, we note that x(t) is the vector x ∈ X computed in the tth “step” of the algorithm (step 1), and similarly associate ξ t with x(t). To analyze A SYNC A DAG RAD, we make a somewhat stronger assumption on the sparsity properties of the losses F than Assumption B. 2 Assumption C. There exist constants (Mj )d such that E[(∂xj F (x; ξ))2 | ξj = 0] ≤ Mj for all j=1 x ∈ X. 2 Indeed, taking M2 = j pj Mj shows that Assumption C implies Assumption B with speciﬁc constants. We then have the following convergence result. Theorem 5. In addition to the conditions of Theorem 3, let Assumption C hold. Assume that for all 2 j we have δ 2 ≥ Mj m and X ⊂ [−R∞ , R∞ ]d . Then T t=1 E F (x(t); ξ t ) − F (x∗ ; ξ t ) d ≤ min j=1 T 1 2 R E η ∞ 2 δ + gj (t) 2 1 2 T + ηE gj (t) t=1 2 1 2 (1 + pj m), Mj R∞ pj T . t=1 It is possible to relax the condition on the initial constant diagonal term; we defer this to the full version of the paper. It is natural to ask in which situations the bound provided by Theorem 5 is optimal. We note that, as in the case with Theorem 3, we may obtain a convergence rate for f (x(T )) − f (x∗ ) using convexity, T 1 where x(T ) = T t=1 x(t). By Jensen’s inequality, we have for any δ that T E 2 δ + gj (t) 2 1 2 t=1 T ≤ 2 2 E[gj (t) ] δ + t=1 1 2 ≤ 2 δ 2 + T pj Mj . For interpretation, let us now make a few assumptions on the probabilities pj . If we assume that pj ≤ c/m for a universal (numerical) constant c, then Theorem 5 guarantees that d log(T )/T + pj √ (8) , pj , T j=1 √ which is the convergence rate of A DAG RAD except for a small factor of min{ log T /T, pj } in addition to the usual pj /T rate. In particular, optimizing by choosing η = R∞ , and assuming 1 pj T log T , we have convergence guarantee √ d pj E[f (x(T )) − f (x∗ )] ≤ O(1)R∞ Mj min √ , pj , T j=1 E[f (x(T )) − f (x∗ )] ≤ O(1) 1 2 R +η η ∞ Mj min which is minimax optimal by Proposition 1. In fact, however, the bounds of Theorem 5 are somewhat stronger: they provide bounds using the expectation of the squared gradients gj (t) rather than the maximal value Mj , though the bounds are perhaps clearer in the form (8). We note also that our analysis applies to more adversarial settings than stochastic optimization (e.g., to online convex optimization [5]). Speciﬁcally, an adversary may choose an arbitrary sequence of functions subject to the random data sparsity constraint (2), and our results provide an expected regret bound, which is strictly stronger than the stochastic convergence guarantees provided (and guarantees high-probability convergence in stochastic settings [3]). Moreover, our comments in Section 2 about the relative optimality of A DAG RAD versus standard gradient methods apply. When the data is sparse, we indeed should use asynchronous algorithms, but using adaptive methods yields even more improvement than simple gradient-based methods. 4 Experiments In this section, we give experimental validation of our theoretical results on A SYNC A DAG RAD and A SYNC DA, giving results on two datasets selected for their high-dimensional sparsity.2 2 In our experiments, A SYNC DA and H OGWILD ! had effectively identical performance. 6 8 0.07 6 5 4 0.024 Test error Training loss Speedup 0.025 0.065 7 0.06 0.055 0.05 0.045 0.04 0.023 0.022 0.021 0.02 0.035 3 0.019 2 1 2 4 0.03 A-A DAG RAD A SYNC DA Number of workers 6 8 10 12 14 0.018 0.025 0.02 16 2 4 6 8 10 12 14 Number of workers 0.017 16 2 4 6 8 10 12 14 Number of workers 16 Figure 1. Experiments with URL data. Left: speedup relative to one processor. Middle: training dataset loss versus number of processors. Right: test set error rate versus number of processors. AA DAG RAD abbreviates A SYNC A DAG RAD. 1.03 1.02 1.01 1.00 1.0 1 2 4 8 16 64 256 number of passes A-AdaGrad, η = 0.008 L2 = 0 A-AdaGrad, η = 0.008 L2 = 80 A-DA, η = 0.8 L2 = 0 A-DA, η = 0.8 L2 = 80 1.00 1.01 1.4 1.02 1.03 1.04 Impact of L2 regularizaton on test error 1.04 Fixed stepsizes, test data, L2=0 1.2 relative log-loss 1.6 1.8 Fixed stepsizes, training data, L2=0 A-AdaGrad η = 0.002 A-AdaGrad η = 0.004 A-AdaGrad η = 0.008 A-AdaGrad η = 0.016 A-DA η = 0.800 A-DA η = 1.600 A-DA η = 3.200 1 2 4 8 16 32 number of passes 64 128 256 1 2 4 8 16 32 64 128 256 number of passes Figure 2: Relative accuracy for various stepsize choices on an click-through rate prediction dataset. 4.1 Malicious URL detection For our ﬁrst set of experiments, we consider the speedup attainable by applying A SYNC A DAG RAD and A SYNC DA, investigating the performance of each algorithm on a malicious URL prediction task [7]. The dataset in this case consists of an anonymized collection of URLs labeled as malicious (e.g., spam, phishing, etc.) or benign over a span of 120 days. The data in this case consists of 2.4 · 106 examples with dimension d = 3.2 · 106 (sparse) features. We perform several experiments, randomly dividing the dataset into 1.2 · 106 training and test samples for each experiment. In Figure 1 we compare the performance of A SYNC A DAG RAD and A SYNC DA after doing after single pass through the training dataset. (For each algorithm, we choose the stepsize η for optimal training set performance.) We perform the experiments on a single machine running Ubuntu Linux with six cores (with two-way hyperthreading) and 32Gb of RAM. From the left-most plot in Fig. 1, we see that up to six processors, both A SYNC DA and A SYNC A DAG RAD enjoy the expected linear speedup, and from 6 to 12, they continue to enjoy a speedup that is linear in the number of processors though at a lesser slope (this is the effect of hyperthreading). For more than 12 processors, there is no further beneﬁt to parallelism on this machine. The two right plots in Figure 1 plot performance of the different methods (with standard errors) versus the number of worker threads used. Both are essentially ﬂat; increasing the amount of parallelism does nothing to the average training loss or the test error rate for either method. It is clear, however, that for this dataset, the adaptive A SYNC A DAG RAD algorithm provides substantial performance beneﬁts over A SYNC DA. 4.2 Click-through-rate prediction experiments We also experiment on a proprietary datasets consisting of search ad impressions. Each example corresponds to showing a search-engine user a particular text ad in response to a query string. From this, we construct a very sparse feature vector based on the text of the ad displayed and the query string (no user-speciﬁc data is used). The target label is 1 if the user clicked the ad and -1 otherwise. 7 (B) A-AdaGrad speedup (D) Impact of training data ordering 1.004 1.005 1.006 1.007 1.008 1 2 4 8 16 32 number of passes 64 128 256 1.000 1 2 A-DA base η = 1.600 A-AdaGrad base η = 0.023 0 1.005 relative stepsize (C) Optimal stepsize scaling relative log-loss 1.003 target relative log-loss 1.005 1.010 1.002 1.010 1.015 8 4 0 speedup A-DA η = 1.600 A-AdaGrad η = 0.016 1.001 1.000 relative log-loss 1.015 A-DA, L2=80 A-AdaGrad, L2=80 12 (A) Optimized stepsize for each number of passes 1 2 4 8 16 32 number of passes 64 128 256 1 2 4 8 16 32 64 128 256 number of passes Figure 3. (A) Relative test-set log-loss for A SYNC DA and A SYNC A DAG RAD, choosing the best stepsize (within a factor of about 1.4×) individually for each number of passes. (B) Effective speedup for A SYNC A DAG RAD. (C) The best stepsize η, expressed as a scaling factor on the stepsize used for one pass. (D) Five runs with different random seeds for each algorithm (with 2 penalty 80). We ﬁt logistic regression models using both A SYNC DA and A SYNC A DAG RAD. We run extensive experiments on a moderate-sized dataset (about 107 examples, split between training and testing), which allows thorough investigation of the impact of the stepsize η, the number of training passes,3 and 2 -regularization on accuracy. For these experiments we used 32 threads on 16 core machines for each run, as A SYNC A DAG RAD and A SYNC DA achieve similar speedups from parallelization. On this dataset, A SYNC A DAG RAD typically achieves an effective additional speedup over A SYNC DA of 4× or more. That is, to reach a given level of accuracy, A SYNC DA generally needs four times as many effective passes over the dataset. We measure accuracy with log-loss (the logistic loss) averaged over ﬁve runs using different random seeds (which control the order in which the algorithms sample examples during training). We report relative values in Figures 2 and 3, that is, the ratio of the mean loss for the given datapoint to the lowest (best) mean loss obtained. Our results are not particularly sensitive to the choice of relative log-loss as the metric of interest; we also considered AUC (the area under the ROC curve) and observed similar results. Figure 2 shows relative log-loss as a function of the number of training passes for various stepsizes. Without regularization, A SYNC A DAG RAD is prone to overﬁtting: it achieves signiﬁcantly higher accuracy on the training data (Fig. 2 (left)), but unless the stepsize is tuned carefully to the number of passes, it will overﬁt (Fig. 2 (middle)). Fortunately, the addition of 2 regularization largely solves this problem. Indeed, Figure 2 (right) shows that while adding an 2 penalty of 80 has very little impact on A SYNC DA, it effectively prevents the overﬁtting of A SYNC A DAG RAD.4 Fixing √ regularization multiplier to 80, we varied the stepsize η over a multiplicative grid with res2 olution 2 for each number of passes and for each algorithm. Figure 3 reports the results obtained by selecting the best stepsize in terms of test set log-loss for each number of passes. Figure 3(A) shows relative log-loss of the best stepsize for each algorithm; 3(B) shows the relative time A SYNC DA requires with respect to A SYNC A DAG RAD to achieve a given loss. Speciﬁcally, Fig. 3(B) shows the ratio of the number of passes the algorithms require to achieve a ﬁxed loss, which gives a broader estimate of the speedup obtained by using A SYNC A DAG RAD; speedups range from 3.6× to 12×. Figure 3(C) shows the optimal stepsizes as a function of the best setting for one pass. The optimal stepsize decreases moderately for A SYNC A DAG RAD, but are somewhat noisy for A SYNC DA. It is interesting to note that A SYNC A DAG RAD’s accuracy is largely independent of the ordering of the training data, while A SYNC DA shows signiﬁcant variability. This can be seen both in the error bars on Figure 3(A), and explicitly in Figure 3(D), where we plot one line for each of the ﬁve random seeds used. Thus, while on the one hand A SYNC DA requires somewhat less tuning of the stepsize and 2 parameter, tuning A SYNC A DAG RAD is much easier because of its predictable response. 3 Here “number of passes” more precisely means the expected number of times each example in the dataset is trained on. That is, each worker thread randomly selects a training example from the dataset for each update, and we continued making updates until (dataset size) × (number of passes) updates have been processed. 4 For both algorithms, this is accomplished by adding the term η80 x 2 to the ψ function. We can achieve 2 slightly better results for A SYNC A DAG RAD by varying the 2 penalty with the number of passes. 8 References [1] P. Auer and C. Gentile. Adaptive and self-conﬁdent online learning algorithms. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory, 2000. [2] P. B¨ hlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and u Applications. Springer, 2011. [3] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050–2057, September 2004. [4] J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011. [5] E. Hazan. The convex optimization approach to regret minimization. In Optimization for Machine Learning, chapter 10. MIT Press, 2012. [6] J. Hiriart-Urruty and C. Lemar´ chal. Convex Analysis and Minimization Algorithms I & II. e Springer, New York, 1996. [7] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker. Identifying malicious urls: An application of large-scale online learning. In Proceedings of the 26th International Conference on Machine Learning, 2009. [8] C. Manning and H. Sch¨ tze. Foundations of Statistical Natural Language Processing. MIT u Press, 1999. [9] B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the Twenty Third Annual Conference on Computational Learning Theory, 2010. [10] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. [11] Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120(1):261–283, 2009. [12] F. Niu, B. Recht, C. R´ , and S. Wright. Hogwild: a lock-free approach to parallelizing stochase tic gradient descent. In Advances in Neural Information Processing Systems 24, 2011. [13] P. Richt´ rik and M. Tak´ c. Parallel coordinate descent methods for big data optimization. a aˇ arXiv:1212.0873 [math.OC], 2012. URL http://arxiv.org/abs/1212.0873. [14] M. Tak´ c, A. Bijral, P. Richt´ rik, and N. Srebro. Mini-batch primal and dual methods for aˇ a SVMs. In Proceedings of the 30th International Conference on Machine Learning, 2013. 9

4 0.57459068 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning

Author: Daniel Hernández-Lobato, José Miguel Hernández-Lobato

Abstract: A probabilistic model based on the horseshoe prior is proposed for learning dependencies in the process of identifying relevant features for prediction. Exact inference is intractable in this model. However, expectation propagation offers an approximate alternative. Because the process of estimating feature selection dependencies may suffer from over-ﬁtting in the model proposed, additional data from a multi-task learning scenario are considered for induction. The same model can be used in this setting with few modiﬁcations. Furthermore, the assumptions made are less restrictive than in other multi-task methods: The different tasks must share feature selection dependencies, but can have different relevant features and model coefﬁcients. Experiments with real and synthetic data show that this model performs better than other multi-task alternatives from the literature. The experiments also show that the model is able to induce suitable feature selection dependencies for the problems considered, only from the training data. 1

5 0.54198849 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation

Author: Edoardo M. Airoldi, Thiago B. Costa, Stanley H. Chan

Abstract: Non-parametric approaches for analyzing network data based on exchangeable graph models (ExGM) have recently gained interest. The key object that deﬁnes an ExGM is often referred to as a graphon. This non-parametric perspective on network modeling poses challenging questions on how to make inference on the graphon underlying observed network data. In this paper, we propose a computationally efﬁcient procedure to estimate a graphon from a set of observed networks generated from it. This procedure is based on a stochastic blockmodel approximation (SBA) of the graphon. We show that, by approximating the graphon with a stochastic block model, the graphon can be consistently estimated, that is, the estimation error vanishes as the size of the graph approaches inﬁnity.

6 0.53576434 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

7 0.52839088 326 nips-2013-The Power of Asymmetry in Binary Hashing

8 0.52154839 352 nips-2013-What do row and column marginals reveal about your dataset?

9 0.51892215 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

10 0.513192 65 nips-2013-Compressive Feature Learning

11 0.50919008 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators

12 0.49400336 55 nips-2013-Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

13 0.48256063 214 nips-2013-On Algorithms for Sparse Multi-factor NMF

14 0.47926828 107 nips-2013-Embed and Project: Discrete Sampling with Universal Hashing

15 0.47719067 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

16 0.4731555 19 nips-2013-Accelerated Mini-Batch Stochastic Dual Coordinate Ascent

17 0.47300869 168 nips-2013-Learning to Pass Expectation Propagation Messages

18 0.46952093 186 nips-2013-Matrix factorization with binary components

19 0.46936613 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport

20 0.46113056 76 nips-2013-Correlated random features for fast semi-supervised learning

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.524), (33, 0.081), (34, 0.072), (36, 0.011), (41, 0.018), (49, 0.014), (56, 0.079), (70, 0.015), (85, 0.037), (89, 0.021), (93, 0.032), (95, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.90945137 61 nips-2013-Capacity of strong attractor patterns to model behavioural and cognitive prototypes

Author: Abbas Edalat

Abstract: We solve the mean ﬁeld equations for a stochastic Hopﬁeld network with temperature (noise) in the presence of strong, i.e., multiply stored, patterns, and use this solution to obtain the storage capacity of such a network. Our result provides for the ﬁrst time a rigorous solution of the mean ﬁled equations for the standard Hopﬁeld model and is in contrast to the mathematically unjustiﬁable replica technique that has been used hitherto for this derivation. We show that the critical temperature for stability of a strong pattern is equal to its degree or multiplicity, when the sum of the squares of degrees of the patterns is negligible compared to the network size. In the case of a single strong pattern, when the ratio of the number of all stored pattens and the network size is a positive constant, we obtain the distribution of the overlaps of the patterns with the mean ﬁeld and deduce that the storage capacity for retrieving a strong pattern exceeds that for retrieving a simple pattern by a multiplicative factor equal to the square of the degree of the strong pattern. This square law property provides justiﬁcation for using strong patterns to model attachment types and behavioural prototypes in psychology and psychotherapy. 1 Introduction: Multiply learned patterns in Hopﬁeld networks The Hopﬁeld network as a model of associative memory and unsupervised learning was introduced in [23] and has been intensively studied from a wide range of viewpoints in the past thirty years. However, properties of a strong pattern, as a pattern that has been multiply stored or learned in these networks, have only been examined very recently, a surprising delay given that repetition of an activity is the basis of learning by the Hebbian rule and long term potentiation. In particular, while the storage capacity of a Hopﬁeld network with certain correlated patterns has been tackled [13, 25], the storage capacity of a Hopﬁeld network in the presence of strong as well as random patterns has not been hitherto addressed. The notion of a strong pattern of a Hopﬁeld network has been proposed in [15] to model attachment types and behavioural prototypes in developmental psychology and psychotherapy. This suggestion has been motivated by reviewing the pioneering work of Bowlby [9] in attachment theory and highlighting how a number of academic biologists, psychiatrists, psychologists, sociologists and neuroscientists have consistently regarded Hopﬁeld-like artiﬁcial neural networks as suitable tools to model cognitive and behavioural constructs as patterns that are deeply and repeatedly learned by individuals [11, 22, 24, 30, 29, 10]. A number of mathematical properties of strong patterns in Hopﬁeld networks, which give rise to strong attractors, have been derived in [15]. These show in particular that strong attractors are strongly stable; a series of experiments have also been carried out which conﬁrm the mathematical 1 results and also indicate that a strong pattern stored in the network can be retrieved even in the presence of a large number of simple patterns, far exceeding the well-known maximum load parameter or storage capacity of the Hopﬁeld network with random patterns (αc ≈ 0.138). In this paper, we consider strong patterns in stochastic Hopﬁeld model with temperature, which accounts for various types of noise in the network. In these networks, the updating rule is probabilistic and depend on the temperature. Since analytical solution of such a system is not possible in general, one strives to obtain the average behaviour of the network when the input to each node, the so-called ﬁeld at the node, is replaced with its mean. This is the basis of mean ﬁeld theory for these networks. Due to the close connection between the Hopﬁeld network and the Ising model in ferromagnetism [1, 8], the mean ﬁeld approach for the Hopﬁeld network and its variations has been tackled using the replica method, starting with the pioneering work of Amit, Gutfreund and Sompolinsky [3, 2, 4, 19, 31, 1, 13]. Although this method has been widely used in the theory of spin glasses in statistical physics [26, 16] its mathematical justiﬁcation has proved to be elusive as we will discuss in the next section; see for example [20, page 264], [14, page 27], and [7, page 9]. In [17] and independently in [27], an alternative technique to the replica method for solving the mean ﬁeld equations has been proposed which is reproduced and characterised as heuristic in [20, section 2.5] since it relies on a number of assumptions that are not later justiﬁed and uses a number of mathematical steps that are not validated. Here, we use the basic idea of the above heuristic to develop a veriﬁable mathematical framework with provable results grounded on elements of probability theory, with which we assume the reader is familiar. This technique allows us to solve the mean ﬁeld equations for the Hopﬁeld network in the presence of strong patterns and use the results to study, ﬁrst, the stability of these patterns in the presence of temperature (noise) and, second, the storage capacity of the network with a single strong pattern at temperature zero. We show that the critical temperature for the stability of a strong pattern is equal to its degree (i.e., its multiplicity) when the ratio of the sum of the squares of degrees of the patterns to the network size tends to zero when the latter tends to inﬁnity. In the case that there is only one strong pattern present with its degree small compared to the number of patterns and the latter is a ﬁxed multiple of the number of nodes, we ﬁnd the distribution of the overlap of the mean ﬁeld and the patterns when the strong pattern is being retrieved. We use these distributions to prove that the storage capacity for retrieving a strong pattern exceeds that for a simple pattern by a multiplicative factor equal to the square of the degree of the strong attractor. This result matches the ﬁnding in [15] regarding the capacity of a network to recall strong patterns as mentioned above. Our results therefore show that strong patterns are robust and persistent in the network memory as attachment types and behavioural prototypes are in the human memory system. In this paper, we will several times use Lyapunov’s theorem in probability which provides a simple sufﬁcient condition to generalise the Central Limit theorem when we deal with independent but not necessarily identically distributed random variables. We require a general form of this theorem kn as follows. Let Yn = N i=1 Yni , for n ∈ I , be a triangular array of random variables such that for each n, the random variables Yni , for 1 ≤ i ≤ kn are independent with E(Yni ) = 0 2 2 and E(Yni ) = σni , where E(X) stands for the expected value of the random variable X. Let kn 2 2 sn = i=1 σni . We use the notation X ∼ Y when the two random variables X and Y have the same distribution (for large n if either or both of them depend on n). Theorem 1.1 (Lyapunov’s theorem [6, page 368]) If for some δ > 0, we have the condition: 1 E(|Yn |2+δ |) → 0 s2+δ n d d as n → ∞ then s1 Yn −→ N (0, 1) as n → ∞ where −→ denotes convergence in distribution, and we denote n by N (a, σ 2 ) the normal distribution with mean a and variance σ 2 . Thus, for large n we have Yn ∼ N (0, s2 ). n 2 2 Mean ﬁeld theory We consider a Hopﬁeld network with N neurons i = 1, . . . , N with values Si = ±1 and follow the notations in [20]. As in [15], we assume patterns can be multiply stored and the degree of a pattern is deﬁned as its multiplicity. The total number of patterns, counting their multiplicity, is denoted by p and we assume there are n patterns ξ 1 , . . . , ξ n with degrees d1 , . . . , dn ≥ 1 respectively and that n the remaining p − k=1 dk ≥ 0 patterns are simple, i.e., each has degree one. Note that by our assumptions there are precisely n p0 = p + n − dk k=1 distinct patterns, which we assume are independent and identically distributed with equal probability of taking value ±1 for each node. More generally, for any non-negative integer k ∈ I , we let N p0 dk . µ pk = µ=1 p µ µ 0 1 We use the generalized Hebbian rule for the synaptic couplings: wij = N µ=1 dµ ξi ξj for i = j with wii = 0 for 1 ≤ i, j ≤ N . As in the standard stochastic Hopﬁeld model [20], we use Glauber dynamics [18] for the stochastic updating rule with pseudo-temperature T > 0, which accounts for various types of noise in the network, and assume zero bias in the local ﬁeld. Putting β = 1/T (i.e., with the Boltzmann constant kB = 1) and letting fβ (h) = 1/(1 + exp(−2βh)), the stochastic updating rule at time t is given by: N Pr(Si (t + 1) = ±1) = fβ (±hi (t)), where hi (t) = wij Sj (t), (1) j=1 is the local ﬁeld at i at time t. The updating is implemented asynchronously in a random way. The energy of the network in the conﬁguration S = (Si )N is given by i=1 N 1 Si Sj wij . H(S) = − 2 i,j=1 For large N , this speciﬁes a complex system, with an underlying state space of dimension 2N , which in general cannot be solved exactly. However, mean ﬁeld theory has proved very useful in studying Hopﬁeld networks. The average updated value of Si (t + 1) in Equation (1) is Si (t + 1) = 1/(1 + e−2βhi (t) ) − 1/(1 + e2βhi (t) ) = tanh(βhi (t)), (2) where . . . denotes taking average with respect to the probability distribution in the updating rule in Equation (1). The stationary solution for the mean ﬁeld thus satisﬁes: Si = tanh(βhi ) , (3) The average overlap of pattern ξ µ with the mean ﬁeld at the nodes of the network is given by: mν = 1 N N ν ξi Si (4) i=1 The replica technique for solving the mean ﬁeld problem, used in the case p/N = α > 0 as N → ∞, seeks to obtain the average of the overlaps in Equation (4) by evaluating the partition function of the system, namely, Z = TrS exp(−βH(S)), where the trace TrS stands for taking sum over all possible conﬁgurations S = (Si )N . As it i=1 is generally the case in statistical physics, once the partition function of the system is obtained, 3 all required physical quantities can in principle be computed. However, in this case, the partition function is very difﬁcult to compute since it entails computing the average log Z of log Z, where . . . indicates averaging over the random distribution of the stored patterns ξ µ . To overcome this problem, the identity Zk − 1 log Z = lim k→0 k is used to reduce the problem to ﬁnding the average Z k of Z k , which is then computed for positive integer values of k. For such k, we have: Z k = TrS 1 TrS 2 . . . TrS k exp(−β(H(S 1 ) + H(S 1 ) + . . . + H(S k ))), where for each i = 1, . . . , k the super-scripted conﬁguration S i is a replica of the conﬁguration state. In computing the trace over each replica, various parameters are obtained and the replica symmetry condition assumes that these parameters are independent of the particular replica under consideration. Apart from this assumption, there are two basic mathematical problems with the technique which makes it unjustiﬁable [20, page 264]. Firstly, the positive integer k above is eventually treated as a real number near zero without any mathematical justiﬁcation. Secondly, the order of taking limits, in particular the order of taking the two limits k → 0 and N → ∞, are several times interchanged again without any mathematical justiﬁcation. Here, we develop a mathematically rigorous method for solving the mean ﬁeld problem, i.e., computing the average of the overlaps in Equation (4) in the case of p/N = α > 0 as N → ∞. Our method turns the basic idea of the heuristic presented in [17] and reproduced in [20] for solving the mean ﬁeld equation into a mathematically veriﬁable formalism, which for the standard Hopﬁeld network with random stored patterns gives the same result as the replica method, assuming replica symmetry. In the presence of strong patterns we obtain a set of new results as explained in the next two sections. The mean ﬁeld equation is obtained from Equation (3) by approximating the right hand side of N this equation by the value of tanh at the mean ﬁeld hi = j=1 wij Sj , ignoring the sum N j=1 wij (Sj − Sj ) for large N [17, page 32]: Si = tanh(β hi ) = tanh β N N j=1 p0 µ=1 µ µ dµ ξi ξj Sj . (5) Equation (5) gives the mean ﬁeld equation for the Hopﬁeld network with n possible strong patterns n ξ µ (1 ≤ µ ≤ n) and p − µ=1 dµ simple patterns ξ µ with n + 1 ≤ µ ≤ p0 . As in the standard Hopﬁeld model, where all patterns are simple, we have two cases to deal with. However, we now have to account for the presence of strong attractors and our two cases will be as follows: (i) In the p0 ﬁrst case we assume p2 := µ=1 d2 = o(N ), which includes the simpler case p2 N when p2 µ is ﬁxed and independent of N . (ii) In the second case we assume we have a single strong attractor with the load parameter p/N = α > 0. 3 Stability of strong patterns with noise: p2 = o(N ) The case of constant p and N → ∞ is usually referred to as α = 0 in the standard Hopﬁeld model. Here, we need to consider the sum of degrees of all stored patterns (and not just the number of patterns) compared to N . We solve the mean ﬁeld equation with T > 0 by using a method similar in spirit to [20, page 33] for the standard Hopﬁeld model, but in our case strong patterns induce a sequence of independent but non-identically distributed random variables in the crosstalk term, where the Central Limit Theorem cannot be used; we show however that Lyapunov’s theorem (Theorem (1.1) can be invoked. In retrieving pattern ξ 1 , we look for a solution of the mean ﬁled 1 equation of the form: Si = mξi , where m > 0 is a constant. Using Equation (5) and separating 1 the contribution of ξ in the argument of tanh, we obtain:  1 mξi = tanh    mβ  1 d1 ξi + N 4 µ µ 1 dµ ξi ξj ξj  . j=i,µ>1 (6) For each N , µ > 1 and j = i, let dµ µ µ 1 (7) ξ ξ ξ . N i j j 2 This gives (p0 − 1)(N − 1) independent random variables with E(YN µj ) = 0, E(YN µj ) = d2 /N 2 , µ 3 3 3 and E(|YN µj |) = dµ /N . We have: YN µj = s2 := N 2 E(YN µj ) = µ>1,j=i 1 N −1 d2 ∼ N 2 µ>1 µ N d2 . µ (8) µ>1 Thus, as N → ∞, we have: 1 s3 N 3 E(|YN µj |) ∼ √ µ>1,j=i µ>1 N( d3 µ µ>1 d2 )3/2 µ → 0. (9) as N → ∞ since for positive numbers dµ we always have µ>1 d3 < ( µ>1 d2 )3/2 . Thus the µ µ Lyapunov condition is satisﬁed for δ = 1. By Lyapunov’s theorem we deduce: 1 N µ µ 1 dµ ξi ξj ξj ∼ N d2 /N µ 0, (10) µ>1 µ>1,j=i Since we also have p2 = o(N ), it follows that we can ignore the second term, i.e., the crosstalk term, in the argument of tanh in Equation (6) as N → ∞; we thus obtain: m = tanh βd1 m. (11) To examine the ﬁxed points of the Equation (11), we let d = d1 for convenience and put x = βdm = dm/T , so that tanh x = T x/d; see Figure 1. It follows that Tc = d is the critical temperature. If T < d then there is a non-zero (non-trivial) solution for m, whereas for T > d we only have the trivial solution. For d = 1 our solution is that of the standard Hopﬁeld network as in [20, page 34]. (d < T) y>x y = x ( d = T) y = tanh x y

2 0.84846228 145 nips-2013-It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals

Author: Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, Oliver Stegle

Abstract: Multi-task prediction methods are widely used to couple regressors or classiﬁcation models by sharing information across related tasks. We propose a multi-task Gaussian process approach for modeling both the relatedness between regressors and the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term in form of a sum of Kronecker products, for which efﬁcient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we ﬁnd substantial beneﬁts of modeling structured noise compared to established alternatives. 1

same-paper 3 0.82194346 245 nips-2013-Pass-efficient unsupervised feature selection

Author: Crystal Maung, Haim Schweitzer

4 0.76603371 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

Author: Jason Chang, John W. Fisher III

Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve signiﬁcant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for ﬁnite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1

5 0.69864398 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson

Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change ﬂexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efﬁciently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1

6 0.69245291 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling

7 0.48894167 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

8 0.47690204 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

9 0.47272548 252 nips-2013-Predictive PAC Learning and Process Decompositions

10 0.46640074 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

11 0.45324337 178 nips-2013-Locally Adaptive Bayesian Multivariate Time Series

12 0.45279175 47 nips-2013-Bayesian Hierarchical Community Discovery

13 0.44550884 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

14 0.44430262 355 nips-2013-Which Space Partitioning Tree to Use for Search?

15 0.43333825 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

16 0.43235877 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

17 0.43089417 350 nips-2013-Wavelets on Graphs via Deep Learning

18 0.42466214 41 nips-2013-Approximate inference in latent Gaussian-Markov models from continuous time observations

19 0.4188956 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

20 0.41509709 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval