nips nips2013 nips2013-206 knowledge-graph by maker-knowledge-mining

206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices


Source: pdf

Author: Dimitris Achlioptas, Zohar S. Karnin, Edo Liberty

Abstract: We consider the problem of selecting non-zero entries of a matrix A in order to produce a sparse sketch of it, B, that minimizes A B 2 . For large m n matrices, such that n m (for example, representing n observations over m attributes) we give sampling distributions that exhibit four important properties. First, they have closed forms computable from minimal information regarding A. Second, they allow sketching of matrices whose non-zeros are presented to the algorithm in arbitrary order as a stream, with O 1 computation per non-zero. Third, the resulting sketch matrices are not only sparse, but their non-zero entries are highly compressible. Lastly, and most importantly, under mild assumptions, our distributions are provably competitive with the optimal offline distribution. Note that the probabilities in the optimal offline distribution may be complex functions of all the entries in the matrix. Therefore, regardless of computational complexity, the optimal distribution might be impossible to compute in the streaming model. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 com Abstract We consider the problem of selecting non-zero entries of a matrix A in order to produce a sparse sketch of it, B, that minimizes A B 2 . [sent-6, score-0.363]

2 Third, the resulting sketch matrices are not only sparse, but their non-zero entries are highly compressible. [sent-10, score-0.339]

3 Note that the probabilities in the optimal offline distribution may be complex functions of all the entries in the matrix. [sent-12, score-0.177]

4 1 Introduction Given an m n matrix A, it is often desirable to find a sparser matrix B that is a good proxy for A. [sent-14, score-0.268]

5 A fruitful measure for the approximation of A by B is the spectral norm of A B, where for any matrix C its spectral norm is defined as C 2 max x 2 1 Cx 2 . [sent-16, score-0.413]

6 Also we expect the total number of non-zero entries in A to exceed available memory. [sent-21, score-0.215]

7 We assume that the original data matrix A is accessed in the streaming model where we know only very basic features of A a priori and the actual non-zero entries are presented to us one at a time in an arbitrary order. [sent-22, score-0.381]

8 But, it is also important in cases when A exists in durable storage and random access of its entries is prohibitively expensive. [sent-24, score-0.177]

9 Assign to each element Aij of the matrix a weight that depends only on the elements in its row qij Aij A i 1 . [sent-26, score-0.346]

10 locations from A using the distribution pij ⇢i qij . [sent-31, score-0.507]

11 Return B which is the mean of s matrices, each containing a single non zero entry Aij pij in the corresponding selected location i, j . [sent-32, score-0.502]

12 As we will see, this simple form of the probabilities pij falls out naturally from generic optimization considerations. [sent-33, score-0.432]

13 Every non-zero in the i-th row of B will take the form kij A i 1 s⇢i where kij is the number of times location i, j of A was selected. [sent-35, score-0.234]

14 Note that since we sample with replacement kij may be more than 1 but, typically, kij 0, 1 . [sent-36, score-0.269]

15 The result is a matrix B which is representable in O m log n s log n s bits. [sent-37, score-0.198]

16 This is because there is no reason to store floating point matrix entry values. [sent-38, score-0.204]

17 We use O m log n bits to store1 all values A i 1 s⇢i and O s log n s bits to store the non zero index offsets. [sent-39, score-0.288]

18 In a simple experiment we measured the average number of bits per sample resulting from this approach (total size of the sketch divided by the number of samples s). [sent-41, score-0.236]

19 The results were between 5 and 22 bits per sample depending on the matrix and s. [sent-42, score-0.283]

20 It is important to note that the number of bits per sample was usually less than even log2 n log2 m , the minimal number of bits required to represent a pair i, j . [sent-43, score-0.261]

21 Our experiments show a reduction of disc space by a factor of between 2 and 5 relative to the compressed size of the file representing the sample matrix B in the standard row-column-value list format. [sent-44, score-0.171]

22 When the number of samples s is small, ⇢i is nearly linear in A i 1 resulting in pij Aij . [sent-46, score-0.467]

23 However, as the number of samples grows, ⇢i tends towards A i 2 resulting in pij Aij A i 1 , a distribution 1 we refer to as Row-L1 sampling. [sent-47, score-0.467]

24 The dependence of the preferred distribution on the sample budget is also borne out in experiments, with sampling based on appropriately mixed distributions being consistently best. [sent-48, score-0.236]

25 This highlights that the need to adapt the sampling distribution to the sample budget is a genuine phenomenon. [sent-49, score-0.164]

26 Thus, Ax 2 measures the strength of linear trend x in A, and A 2 measures the strongest linear trend in A. [sent-53, score-0.199]

27 Thus, minimizing A B 2 minimizes the strength of the strongest linear trend of A not captured by B. [sent-54, score-0.17]

28 This is because the best strategy would be to always pick the largest s matrix entries from A, a strategy that can easily be “fooled”. [sent-58, score-0.311]

29 As a stark example, when the matrix entries are Aij 0, 1 , the quality of approximation of A by B is completely independent of which elements of A we keep. [sent-59, score-0.372]

30 By using the spectral norm to measure error we get a natural and sophisticated target: to minimize A B 2 is to make E A B a near-rotation, having only small variations in the amount by which it stretches different vectors. [sent-61, score-0.135]

31 This idea that the error matrix E should be isotropic, thus packing as much Frobenius norm as possible for its L2 norm, motivated the first work on element-wise matrix sampling by Achlioptas and McSherry [AM07]. [sent-62, score-0.406]

32 , an unbiased estimator of A, and whose entries are formed by sampling the entries of A (and, thus, of E) independently. [sent-65, score-0.461]

33 The study of the spectral characteristics of such matrices goes back all the way to Wigner’s famous semi-circle law [Wig58]. [sent-70, score-0.152]

34 Specifically, to bound E 2 in [AM07] a bound due to Alon Krivelevich and Vu [AKV02] was used, a refinement of a bound by Juh´ sz [Juh81] and F¨ redi and Koml´ s [FK81]. [sent-71, score-0.168]

35 The most salient feature of that bound is that it a u o depends on the maximum entry-wise variance 2 of A B, and therefore the distribution optimizing the bound is the one in which the variance of all entries in E is the same. [sent-72, score-0.257]

36 In turn, this means keeping each entry of A independently with probability pij A2 (up to a small wrinkle discussed below). [sent-73, score-0.502]

37 This is because when each item Aij is kept with probability pij A2 , the resultij 1 It is harmless to assume any value in the matrix is kept using O log n truncating the trailing bits can be shown to be negligible. [sent-76, score-0.834]

38 Otherwise, ing entry Bij in the sample matrix has magnitude Aij pij Aij 1 . [sent-78, score-0.673]

39 Thus, if an extremely small element Aij is accidentally picked, the largest entry of the sample matrix “blows up”. [sent-79, score-0.276]

40 In [AM07] this was addressed by sampling small entries with probability proportional to Aij rather than A2 . [sent-80, score-0.251]

41 ij In the work of Gittens and Tropp [GT09], small entries are not handled separately and the bound derived depends on the ratio between the largest and the smallest non-zero magnitude. [sent-81, score-0.313]

42 This progress motivated Drineas and Zouzias in [DZ11] to revisit L2-sampling using concentration results for sums of random matrices [Rec11], as we do here. [sent-83, score-0.183]

43 This is somewhat different from the original setting of [AM07] since now B is not a random matrix with independent entries, but a sum of many single-element independent matrices, each such matrix resulting by choosing a location of A with replacement. [sent-84, score-0.268]

44 The issue of small entries was handled in [DZ11] by deterministically discarding all sufficiently small entries, a strategy that gives a strong mathematical guarantee (but see the discussion regarding deterministic truncation in the experimental section). [sent-86, score-0.298]

45 Their approximation keeps the largest entries in A deterministically (specifically all Aij " n where the threshold " needs be known a priori) and randomly rounds the remaining smaller entries to sign Aij " n or 0. [sent-89, score-0.354]

46 3 Our Approach Following the discussion in Section 2 and in line with previous works, we: (i) measure the quality of B by A B 2 , (ii) sample the entries of A independently, and (iii) require B to be an unbiased estimator of A. [sent-95, score-0.305]

47 We are therefore left with the task of determining a good probability distribution pij from which to sample the entries of A in order to get B. [sent-96, score-0.646]

48 Specifically, each work proposes a specific sampling distribution and then uses results from random matrix theory to demonstrate that it has good properties. [sent-98, score-0.208]

49 The inequality we use is the Matrix-Bernstein inequality for sums of independent random matrices (see e. [sent-101, score-0.149]

50 Let B be a random m n matrix with exactly one non-zero element, formed by sampling an element Aij of A according to p and letting Bij Aij pij . [sent-116, score-0.675]

51 A conceptual contribution of our work is the discovery that good distributions depend on the sample budget s, a fact also borne out in experiments. [sent-137, score-0.162]

52 1 is stated as a bound on the probability that the norm of the error matrix is greater than some target error " given the number of samples s. [sent-143, score-0.273]

53 In practice, the target error " is typically not known in advance, but rather is the quantity to minimize, given the matrix A, the number of samples s, and the target confidence . [sent-144, score-0.198]

54 In contrast, the only global information regarding A we require are the ratios between the L1 norms of the rows of the matrix. [sent-150, score-0.279]

55 Slightly less trivially, standard concentration arguments imply that these ratios can be estimated very well by sampling only a small number of columns. [sent-152, score-0.249]

56 In the setting of data analysis, though, it is in fact reasonable to expect that good estimates of these ratios are available a priori. [sent-153, score-0.179]

57 This is because different rows correspond to different attributes and the ratios between the row norms reflect the ratios between the average absolute values of the features. [sent-154, score-0.473]

58 For example, if the matrix corresponds to text documents, knowing the ratios amounts to knowing global word frequencies. [sent-155, score-0.275]

59 Moreover these ratios do not need to be known exactly to apply the algorithm, as even rough estimates of them give highly competitive results. [sent-156, score-0.141]

60 Indeed, even disregarding this issue completely and simply assuming that all ratios equal 1, yields an algorithm that appears quite competitive in practice, as demonstrated by our experiments. [sent-157, score-0.141]

61 Regarding Condition 2, it is easy to verify that 4 unless the values of the entries of A exhibit unbounded variance as n grows, the ratio A 2 A 2 1 2 grows as ⌦ n and Condition 2 follows from n m. [sent-177, score-0.253]

62 1, the entries of A must be sampled with replacement. [sent-182, score-0.177]

63 A simple way to achieve this in the streaming model was presented in [DKM06] that uses O s operations per matrix element and O s active memory. [sent-183, score-0.239]

64 In Section D (see supplementary material) we discuss how to implement sampling with replacement far more efficiently, using O log s active ˜ memory, O s space, and O 1 operations per element. [sent-184, score-0.174]

65 That is, we assume we know m and n and that we can compute numbers zi A i 1 as well as repeatedly sample entries from the matrix. [sent-186, score-0.293]

66 2 on the minimal number of samples s0 needed by our algorithm to achieve an approximation B to the matrix A such that A B " A with constant probability. [sent-202, score-0.205]

67 Stable rank: Denoted as sr and defined as A 2 A 2 . [sent-203, score-0.152]

68 For data matrices we expect it is small, even constant, and that it captures the “inherent dimensionality” of the data. [sent-205, score-0.148]

69 Numeric density: Denoted as nd and defined as A 2 A 2 , this is a smooth analog of the number 1 F of non-zero entries nnz A . [sent-206, score-0.239]

70 For 0-1 matrices it equals nnz A , but when there is variance in the magnitude of the entries it is smaller. [sent-207, score-0.349]

71 Numeric row density: Denoted as nrd and defined as i A i 2 A 2 n. [sent-208, score-0.259]

72 1 and let B be the matrix returned by Algorithm 1 for 1 10, " 0 and any s ⇥ nrd sr "2 log n s0 With probability at least 9 10, A B sr nd "2 log n 1 2 . [sent-213, score-0.691]

73 The third column of the table below shows the number of samples needed to guarantee that A B " A occurs with constant probability, in terms of the matrix metrics defined above. [sent-217, score-0.244]

74 The fourth column presents the ratio of the samples needed by previous results divided by the samples needed by our method. [sent-218, score-0.228]

75 In the comparison with [DZ11] we see that the key parameter is the ratio nrd n, a quantity typically much smaller than 1 for data matrices. [sent-223, score-0.265]

76 As a point of reference for the assumptions, in the experimental Section 6 we provide the values of all relevant matrix metrics for all the real data matrices we worked with, wherein the ratio nrd n is typically around 10 2 . [sent-224, score-0.48]

77 Citation Method [AM07] L1, L2 [DZ11] L2 [AHK06] L1 This paper 5 Number of samples needed sr n " 2 Improvement ratio of Theorem 4. [sent-228, score-0.27]

78 3 n polylog n sr n "2 log n nd n " nrd n nd n 2 1 2 " sr log n sr log n n 2 nrd sr " log n sr nd "2 log n 1 Bernstein 2 Proof outline We start by iteratively replacing the objective functions (1) and (2) with simpler and simpler functions. [sent-229, score-1.298]

79 Let "3 : ˜2 : A2 pij , max ij max max i j j A2 pij ij i and ˜ R: ↵˜ ˜ R where max Aij pij . [sent-236, score-1.546]

80 For every matrix A satisfying Conditions 2 and 3 of Definition 4. [sent-239, score-0.164]

81 6 p , R p , from the minimizations by minimizing "4 "5 : max ↵ i j A2 ij pij max "5 , "6 where max j Aij pij , "6 : max ↵ j i A2 ij pij max i Aij pij . [sent-248, score-2.061]

82 The following key lemma establishes that for all data matrices satisfying Condition 1 of Definition 4. [sent-252, score-0.14]

83 For every matrix satisfying Condition 1 of Definition 4. [sent-256, score-0.164]

84 The function "5 is minimized by pij ⇢i qij where qij Aij A i 1 . [sent-261, score-0.582]

85 3 combined imply that there is an efficient algorithm for minimizing "4 for every matrix A satisfying Condition 1 of Definition 4. [sent-268, score-0.209]

86 1, then it is possible to lower and upper bound the ratios "1 "2 , "2 "3 and "3 "4 . [sent-271, score-0.181]

87 Columns correspond to subject lines, rows to words, and entries to tf-idf values. [sent-315, score-0.235]

88 Each row corresponds to an item and each column to a user. [sent-322, score-0.165]

89 The Row-L1 distribution is a simplified version of the Bernstein distribution, where pij Aij A i 1 . [sent-337, score-0.432]

90 L1 and L2 refer to pij Aij and pij Aij 2 , respectively, as defined earlier in the paper. [sent-338, score-0.864]

91 sampling was split into three sampling methods corresponding to different trimming thresholds. [sent-340, score-0.229]

92 In the method referred to as L2 no trimming is made and pij Aij 2 . [sent-341, score-0.513]

93 Although to derive our sampling probability distributions we targeted minimizing A B 2 , in experiments it is more informative to consider a more sensitive measure of quality of approximation. [sent-348, score-0.177]

94 The reason is that for a number of values of s, the scaling of entries required for B to be an unbiased estimator of A, results in A B A which would suggest that the all zeros matrix is a better sketch for A than the sampled matrix. [sent-349, score-0.396]

95 In the experiments we made sure to choose a sufficiently wide range of sample sizes so that at least the best method for each matrix goes from poor to near-perfect both in approximating the row and the column space. [sent-358, score-0.28]

96 , simply taking pij Aij A 1 , performs rather well in many cases. [sent-425, score-0.432]

97 Hence, if it is impossible to perform more than one pass over the matrix and one can not even obtain an estimate of the ratios of the L1-weights of the rows, L1-sampling seems to be a highly viable option. [sent-426, score-0.275]

98 The third insight is that for L2-sampling, discarding small entries may drastically improve the performance. [sent-427, score-0.209]

99 A note on element-wise matrix sparsification via a matrixvalued bernstein inequality. [sent-475, score-0.212]

100 Tensor sparsification via a bound on the spectral norm of random tensors. [sent-501, score-0.146]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('aij', 0.518), ('pij', 0.432), ('nrd', 0.189), ('entries', 0.177), ('sr', 0.152), ('ratios', 0.141), ('matrix', 0.134), ('bits', 0.112), ('matrices', 0.11), ('drineas', 0.094), ('sparsi', 0.09), ('achlioptas', 0.088), ('enron', 0.088), ('petros', 0.083), ('kij', 0.082), ('trimming', 0.081), ('zi', 0.079), ('bernstein', 0.078), ('wikipedia', 0.076), ('qij', 0.075), ('trend', 0.074), ('sampling', 0.074), ('borne', 0.072), ('row', 0.07), ('streaming', 0.07), ('entry', 0.07), ('replacement', 0.068), ('norm', 0.064), ('nnz', 0.062), ('dimitris', 0.062), ('ak', 0.059), ('bij', 0.059), ('numeric', 0.059), ('rows', 0.058), ('arora', 0.057), ('item', 0.056), ('argminp', 0.054), ('istribution', 0.054), ('juh', 0.054), ('krivelevich', 0.054), ('lighten', 0.054), ('budget', 0.053), ('bs', 0.052), ('joel', 0.052), ('sketch', 0.052), ('strongest', 0.051), ('ij', 0.049), ('regarding', 0.048), ('trim', 0.048), ('ompute', 0.048), ('boulder', 0.048), ('koml', 0.048), ('redi', 0.048), ('ratio', 0.047), ('minimizing', 0.045), ('emmanuel', 0.044), ('pci', 0.044), ('spectral', 0.042), ('nition', 0.042), ('truncation', 0.041), ('bound', 0.04), ('column', 0.039), ('satyen', 0.039), ('sums', 0.039), ('expect', 0.038), ('max', 0.038), ('gittens', 0.038), ('alon', 0.038), ('eij', 0.038), ('condition', 0.037), ('sample', 0.037), ('sanjeev', 0.036), ('needed', 0.036), ('pk', 0.036), ('hazan', 0.035), ('se', 0.035), ('samples', 0.035), ('singular', 0.035), ('element', 0.035), ('kept', 0.034), ('concentration', 0.034), ('labs', 0.034), ('unbiased', 0.033), ('theorem', 0.033), ('elements', 0.032), ('log', 0.032), ('norms', 0.032), ('altogether', 0.032), ('nguyen', 0.032), ('discarding', 0.032), ('attributes', 0.031), ('mathematics', 0.031), ('yahoo', 0.03), ('satisfying', 0.03), ('measure', 0.029), ('grows', 0.029), ('quantity', 0.029), ('elad', 0.029), ('algebraic', 0.029), ('quality', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices

Author: Dimitris Achlioptas, Zohar S. Karnin, Edo Liberty

Abstract: We consider the problem of selecting non-zero entries of a matrix A in order to produce a sparse sketch of it, B, that minimizes A B 2 . For large m n matrices, such that n m (for example, representing n observations over m attributes) we give sampling distributions that exhibit four important properties. First, they have closed forms computable from minimal information regarding A. Second, they allow sketching of matrices whose non-zeros are presented to the algorithm in arbitrary order as a stream, with O 1 computation per non-zero. Third, the resulting sketch matrices are not only sparse, but their non-zero entries are highly compressible. Lastly, and most importantly, under mild assumptions, our distributions are provably competitive with the optimal offline distribution. Note that the probabilities in the optimal offline distribution may be complex functions of all the entries in the matrix. Therefore, regardless of computational complexity, the optimal distribution might be impossible to compute in the streaming model. 1

2 0.35967728 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion

Author: Franz Kiraly, Louis Theran

Abstract: We propose a general framework for reconstructing and denoising single entries of incomplete and noisy entries. We describe: effective algorithms for deciding if and entry can be reconstructed and, if so, for reconstructing and denoising it; and a priori bounds on the error of each entry, individually. In the noiseless case our algorithm is exact. For rank-one matrices, the new algorithm is fast, admits a highly-parallel implementation, and produces an error minimizing estimate that is qualitatively close to our theoretical and the state-of-the-art Nuclear Norm and OptSpace methods. 1

3 0.30632511 185 nips-2013-Matrix Completion From any Given Set of Observations

Author: Troy Lee, Adi Shraibman

Abstract: In the matrix completion problem the aim is to recover an unknown real matrix from a subset of its entries. This problem comes up in many application areas, and has received a great deal of attention in the context of the netflix prize. A central approach to this problem is to output a matrix of lowest possible complexity (e.g. rank or trace norm) that agrees with the partially specified matrix. The performance of this approach under the assumption that the revealed entries are sampled randomly has received considerable attention (e.g. [1, 2, 3, 4, 5, 6, 7, 8]). In practice, often the set of revealed entries is not chosen at random and these results do not apply. We are therefore left with no guarantees on the performance of the algorithm we are using. We present a means to obtain performance guarantees with respect to any set of initial observations. The first step remains the same: find a matrix of lowest possible complexity that agrees with the partially specified matrix. We give a new way to interpret the output of this algorithm by next finding a probability distribution over the non-revealed entries with respect to which a bound on the generalization error can be proven. The more complex the set of revealed entries according to a certain measure, the better the bound on the generalization error. 1

4 0.1466458 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport

Author: Marco Cuturi

Abstract: Optimal transport distances are a fundamental family of distances for probability measures and histograms of features. Despite their appealing theoretical properties, excellent performance in retrieval tasks and intuitive formulation, their computation involves the resolution of a linear program whose cost can quickly become prohibitive whenever the size of the support of these measures or the histograms’ dimension exceeds a few hundred. We propose in this work a new family of optimal transport distances that look at transport problems from a maximumentropy perspective. We smooth the classic optimal transport problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn’s matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transport solvers. We also show that this regularized distance improves upon classic optimal transport distances on the MNIST classification problem.

5 0.13685653 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling

Author: Akshay Krishnamurthy, Aarti Singh

Abstract: We study low rank matrix and tensor completion and propose novel algorithms that employ adaptive sampling schemes to obtain strong performance guarantees. Our algorithms exploit adaptivity to identify entries that are highly informative for learning the column space of the matrix (tensor) and consequently, our results hold even when the row space is highly coherent, in contrast with previous analyses. In the absence of noise, we show that one can exactly recover a n ⇥ n matrix of rank r from merely ⌦(nr3/2 log(r)) matrix entries. We also show that one can recover an order T tensor using ⌦(nrT 1/2 T 2 log(r)) entries. For noisy recovery, our algorithm consistently estimates a low rank matrix corrupted with noise using ⌦(nr3/2 polylog(n)) entries. We complement our study with simulations that verify our theory and demonstrate the scalability of our algorithms. 1

6 0.10699368 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

7 0.10476167 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

8 0.10139731 91 nips-2013-Dirty Statistical Models

9 0.097536221 188 nips-2013-Memory Limited, Streaming PCA

10 0.096949548 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models

11 0.095214799 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors

12 0.090433337 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

13 0.088246852 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

14 0.086490028 73 nips-2013-Convex Relaxations for Permutation Problems

15 0.085333288 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel

16 0.077951662 307 nips-2013-Speedup Matrix Completion with Side Information: Application to Multi-Label Learning

17 0.077029034 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

18 0.076824643 352 nips-2013-What do row and column marginals reveal about your dataset?

19 0.075673722 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

20 0.073030844 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.222), (1, 0.085), (2, 0.106), (3, 0.137), (4, 0.013), (5, -0.01), (6, -0.0), (7, -0.041), (8, -0.094), (9, 0.012), (10, 0.078), (11, -0.012), (12, 0.06), (13, 0.028), (14, -0.035), (15, 0.011), (16, -0.08), (17, 0.008), (18, -0.061), (19, 0.085), (20, -0.067), (21, -0.258), (22, -0.027), (23, -0.127), (24, -0.096), (25, -0.013), (26, -0.002), (27, -0.335), (28, -0.069), (29, -0.074), (30, -0.046), (31, -0.088), (32, -0.096), (33, -0.069), (34, -0.019), (35, 0.005), (36, 0.051), (37, 0.057), (38, -0.039), (39, -0.028), (40, -0.014), (41, -0.121), (42, -0.008), (43, -0.091), (44, -0.031), (45, -0.084), (46, 0.028), (47, -0.032), (48, 0.041), (49, 0.036)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94624424 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices

Author: Dimitris Achlioptas, Zohar S. Karnin, Edo Liberty

Abstract: We consider the problem of selecting non-zero entries of a matrix A in order to produce a sparse sketch of it, B, that minimizes A B 2 . For large m n matrices, such that n m (for example, representing n observations over m attributes) we give sampling distributions that exhibit four important properties. First, they have closed forms computable from minimal information regarding A. Second, they allow sketching of matrices whose non-zeros are presented to the algorithm in arbitrary order as a stream, with O 1 computation per non-zero. Third, the resulting sketch matrices are not only sparse, but their non-zero entries are highly compressible. Lastly, and most importantly, under mild assumptions, our distributions are provably competitive with the optimal offline distribution. Note that the probabilities in the optimal offline distribution may be complex functions of all the entries in the matrix. Therefore, regardless of computational complexity, the optimal distribution might be impossible to compute in the streaming model. 1

2 0.90556526 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion

Author: Franz Kiraly, Louis Theran

Abstract: We propose a general framework for reconstructing and denoising single entries of incomplete and noisy entries. We describe: effective algorithms for deciding if and entry can be reconstructed and, if so, for reconstructing and denoising it; and a priori bounds on the error of each entry, individually. In the noiseless case our algorithm is exact. For rank-one matrices, the new algorithm is fast, admits a highly-parallel implementation, and produces an error minimizing estimate that is qualitatively close to our theoretical and the state-of-the-art Nuclear Norm and OptSpace methods. 1

3 0.88602144 185 nips-2013-Matrix Completion From any Given Set of Observations

Author: Troy Lee, Adi Shraibman

Abstract: In the matrix completion problem the aim is to recover an unknown real matrix from a subset of its entries. This problem comes up in many application areas, and has received a great deal of attention in the context of the netflix prize. A central approach to this problem is to output a matrix of lowest possible complexity (e.g. rank or trace norm) that agrees with the partially specified matrix. The performance of this approach under the assumption that the revealed entries are sampled randomly has received considerable attention (e.g. [1, 2, 3, 4, 5, 6, 7, 8]). In practice, often the set of revealed entries is not chosen at random and these results do not apply. We are therefore left with no guarantees on the performance of the algorithm we are using. We present a means to obtain performance guarantees with respect to any set of initial observations. The first step remains the same: find a matrix of lowest possible complexity that agrees with the partially specified matrix. We give a new way to interpret the output of this algorithm by next finding a probability distribution over the non-revealed entries with respect to which a bound on the generalization error can be proven. The more complex the set of revealed entries according to a certain measure, the better the bound on the generalization error. 1

4 0.78584385 307 nips-2013-Speedup Matrix Completion with Side Information: Application to Multi-Label Learning

Author: Miao Xu, Rong Jin, Zhi-Hua Zhou

Abstract: In standard matrix completion theory, it is required to have at least O(n ln2 n) observed entries to perfectly recover a low-rank matrix M of size n × n, leading to a large number of observations when n is large. In many real tasks, side information in addition to the observed entries is often available. In this work, we develop a novel theory of matrix completion that explicitly explore the side information to reduce the requirement on the number of observed entries. We show that, under appropriate conditions, with the assistance of side information matrices, the number of observed entries needed for a perfect recovery of matrix M can be dramatically reduced to O(ln n). We demonstrate the effectiveness of the proposed approach for matrix completion in transductive incomplete multi-label learning. 1

5 0.72698182 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport

Author: Marco Cuturi

Abstract: Optimal transport distances are a fundamental family of distances for probability measures and histograms of features. Despite their appealing theoretical properties, excellent performance in retrieval tasks and intuitive formulation, their computation involves the resolution of a linear program whose cost can quickly become prohibitive whenever the size of the support of these measures or the histograms’ dimension exceeds a few hundred. We propose in this work a new family of optimal transport distances that look at transport problems from a maximumentropy perspective. We smooth the classic optimal transport problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn’s matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transport solvers. We also show that this regularized distance improves upon classic optimal transport distances on the MNIST classification problem.

6 0.69190395 254 nips-2013-Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

7 0.68776643 342 nips-2013-Unsupervised Spectral Learning of Finite State Transducers

8 0.67859536 352 nips-2013-What do row and column marginals reveal about your dataset?

9 0.63975978 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling

10 0.59874099 186 nips-2013-Matrix factorization with binary components

11 0.56583339 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

12 0.53062689 209 nips-2013-New Subsampling Algorithms for Fast Least Squares Regression

13 0.52660698 306 nips-2013-Speeding up Permutation Testing in Neuroimaging

14 0.52628356 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

15 0.50159043 73 nips-2013-Convex Relaxations for Permutation Problems

16 0.50040954 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

17 0.49104232 290 nips-2013-Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?

18 0.45725143 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties

19 0.4496077 247 nips-2013-Phase Retrieval using Alternating Minimization

20 0.44905224 214 nips-2013-On Algorithms for Sparse Multi-factor NMF


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(16, 0.04), (33, 0.149), (34, 0.075), (36, 0.021), (40, 0.018), (41, 0.019), (49, 0.02), (56, 0.18), (70, 0.018), (85, 0.262), (89, 0.038), (93, 0.04), (95, 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.98247612 272 nips-2013-Regularized Spectral Clustering under the Degree-Corrected Stochastic Blockmodel

Author: Tai Qin, Karl Rohe

Abstract: Spectral clustering is a fast and popular algorithm for finding clusters in networks. Recently, Chaudhuri et al. [1] and Amini et al. [2] proposed inspired variations on the algorithm that artificially inflate the node degrees for improved statistical performance. The current paper extends the previous statistical estimation results to the more canonical spectral clustering algorithm in a way that removes any assumption on the minimum degree and provides guidance on the choice of the tuning parameter. Moreover, our results show how the “star shape” in the eigenvectors–a common feature of empirical networks–can be explained by the Degree-Corrected Stochastic Blockmodel and the Extended Planted Partition model, two statistical models that allow for highly heterogeneous degrees. Throughout, the paper characterizes and justifies several of the variations of the spectral clustering algorithm in terms of these models. 1

2 0.97667676 7 nips-2013-A Gang of Bandits

Author: Nicolò Cesa-Bianchi, Claudio Gentile, Giovanni Zappella

Abstract: Multi-armed bandit problems formalize the exploration-exploitation trade-offs arising in several industrially relevant applications, such as online advertisement and, more generally, recommendation systems. In many cases, however, these applications have a strong social component, whose integration in the bandit algorithm could lead to a dramatic performance increase. For instance, content may be served to a group of users by taking advantage of an underlying network of social relationships among them. In this paper, we introduce novel algorithmic approaches to the solution of such networked bandit problems. More specifically, we design and analyze a global recommendation strategy which allocates a bandit algorithm to each network node (user) and allows it to “share” signals (contexts and payoffs) with the neghboring nodes. We then derive two more scalable variants of this strategy based on different ways of clustering the graph nodes. We experimentally compare the algorithm and its variants to state-of-the-art methods for contextual bandits that do not use the relational information. Our experiments, carried out on synthetic and real-world datasets, show a consistent increase in prediction performance obtained by exploiting the network structure. 1

3 0.95970899 332 nips-2013-Tracking Time-varying Graphical Structure

Author: Erich Kummerfeld, David Danks

Abstract: Structure learning algorithms for graphical models have focused almost exclusively on stable environments in which the underlying generative process does not change; that is, they assume that the generating model is globally stationary. In real-world environments, however, such changes often occur without warning or signal. Real-world data often come from generating models that are only locally stationary. In this paper, we present LoSST, a novel, heuristic structure learning algorithm that tracks changes in graphical model structure or parameters in a dynamic, real-time manner. We show by simulation that the algorithm performs comparably to batch-mode learning when the generating graphical structure is globally stationary, and significantly better when it is only locally stationary. 1

4 0.9525941 140 nips-2013-Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

Author: Bruno Scherrer

Abstract: Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal “-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal Ï advantage. We show that Howard’s PI terminates 1 2Ì 1 1 22 1 1 nm 1 after at most n(m ≠ 1) 1≠“ log 1≠“ = O 1≠“ log 1≠“ iterations, improving by a factor O(log 1 a result by [3], while Simplex-PI terminates n) 1 22 1 2 1 22 2 1 1 2 after at most n (m ≠ 1) 1 + 1≠“ log 1≠“ = O n m log 1≠“ 1≠“ iterations, improving by a factor O(log n) a result by [11]. Under some structural assumptions of the MDP, we then consider bounds that are independent of the discount factor “: given a measure of the maximal transient time ·t and the maximal time ·r to revisit states in recurrent classes under all policies, we show that Simplex-PI terminates after at most n2 (m≠ # $ 1) (Á·r log(n·r )Ë + Á·r log(n·t )Ë) (m ≠ 1)Án·t log(n·t )Ë + Án·t log(n2 ·t )Ë = !

5 0.91470432 168 nips-2013-Learning to Pass Expectation Propagation Messages

Author: Nicolas Heess, Daniel Tarlow, John Winn

Abstract: Expectation Propagation (EP) is a popular approximate posterior inference algorithm that often provides a fast and accurate alternative to sampling-based methods. However, while the EP framework in theory allows for complex nonGaussian factors, there is still a significant practical barrier to using them within EP, because doing so requires the implementation of message update operators, which can be difficult and require hand-crafted approximations. In this work, we study the question of whether it is possible to automatically derive fast and accurate EP updates by learning a discriminative model (e.g., a neural network or random forest) to map EP message inputs to EP message outputs. We address the practical concerns that arise in the process, and we provide empirical analysis on several challenging and diverse factors, indicating that there is a space of factors where this approach appears promising. 1

6 0.9003861 89 nips-2013-Dimension-Free Exponentiated Gradient

same-paper 7 0.88394594 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices

8 0.86067373 232 nips-2013-Online PCA for Contaminated Data

9 0.84010702 125 nips-2013-From Bandits to Experts: A Tale of Domination and Independence

10 0.83128464 196 nips-2013-Modeling Overlapping Communities with Node Popularities

11 0.82770509 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

12 0.82710499 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

13 0.81874371 2 nips-2013-(Nearly) Optimal Algorithms for Private Online Learning in Full-information and Bandit Settings

14 0.81298 29 nips-2013-Adaptive Submodular Maximization in Bandit Setting

15 0.80597776 289 nips-2013-Scalable kernels for graphs with continuous attributes

16 0.80582404 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

17 0.79931349 227 nips-2013-Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

18 0.79582417 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints

19 0.79528672 25 nips-2013-Adaptive Anonymity via $b$-Matching

20 0.79159147 233 nips-2013-Online Robust PCA via Stochastic Optimization