nips nips2013 nips2013-186 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martin Slawski, Matthias Hein, Pavlo Lutsik
Abstract: Motivated by an application in computational biology, we consider low-rank matrix factorization with {0, 1}-constraints on one of the factors and optionally convex constraints on the second one. In addition to the non-convexity shared with other matrix factorization schemes, our problem is further complicated by a combinatorial constraint set of size 2m·r , where m is the dimension of the data points and r the rank of the factorization. Despite apparent intractability, we provide − in the line of recent work on non-negative matrix factorization by Arora et al. (2012)− an algorithm that provably recovers the underlying factorization in the exact case with O(mr2r + mnr + r2 n) operations for n datapoints. To obtain this result, we use theory around the Littlewood-Offord lemma from combinatorics.
Reference: text
sentIndex sentText sentNum sentScore
1 Matrix factorization with Binary Components Martin Slawski, Matthias Hein and Pavlo Lutsik Saarland University {ms,hein}@cs. [sent-1, score-0.305]
2 de Abstract Motivated by an application in computational biology, we consider low-rank matrix factorization with {0, 1}-constraints on one of the factors and optionally convex constraints on the second one. [sent-6, score-0.38]
3 In addition to the non-convexity shared with other matrix factorization schemes, our problem is further complicated by a combinatorial constraint set of size 2m·r , where m is the dimension of the data points and r the rank of the factorization. [sent-7, score-0.409]
4 Despite apparent intractability, we provide − in the line of recent work on non-negative matrix factorization by Arora et al. [sent-8, score-0.38]
5 (2012)− an algorithm that provably recovers the underlying factorization in the exact case with O(mr2r + mnr + r2 n) operations for n datapoints. [sent-9, score-0.381]
6 Several other matrix factorizations involving binary matrices have been proposed in the literature. [sent-21, score-0.125]
7 In [12] and [13] matrix factorization for binary input data, but non-binary factors T and A is discussed, whereas a factorization T W A with both T and A binary and real-valued W is proposed in [14], which is more restrictive than the model of the present paper. [sent-22, score-0.785]
8 The model in [14] in turn encompasses binary matrix factorization as proposed in [15], where all of D, T and A are constrained to be binary. [sent-23, score-0.43]
9 It is important to note that this ine of research is fundamentally different from Boolean matrix factorization [16], which is sometimes also referred to as binary matrix factorization. [sent-24, score-0.532]
10 A major drawback of matrix factorization schemes is non-convexity. [sent-25, score-0.38]
11 Substantial progress in this regard has been achieved recently for non-negative matrix factorization (NMF) by Arora et al. [sent-29, score-0.41]
12 Despite the obvious hardness of the problem, we present as our main contribution an algorithm that provably provides an exact factorization D = T A whenever such factorization exists. [sent-32, score-0.639]
13 Moreover, we establish uniqueness of the exact factorization under the separability condition from the NMF literature [17, 19], or alternatively with high probability for T drawn uniformly at random. [sent-36, score-0.516]
14 We demonstrate the practical usefulness of our approach in unmixing DNA methylation signatures of blood samples [9]. [sent-40, score-0.422]
15 For a matrix M and index sets I, J, MI,J denotes the submatrix corresponding to I and J; MI,: and M:,J denote the submatrices formed by the rows in I respectively columns in J. [sent-42, score-0.324]
16 The affine hull generated by the columns of M is denoted by aff(M ). [sent-44, score-0.176]
17 we suppose that a factorization having the desired properties exists. [sent-51, score-0.305]
18 We first discuss the geometric ideas underlying our basic approach for recovering such factorization from the data matrix before presenting conditions under which the factorization is unique. [sent-52, score-0.685]
19 It is shown that the question of uniqueness as well as the computational performance of our approach is intimately connected to the Littlewood-Offord problem in combinatorics [20]. [sent-53, score-0.177]
20 (1) The columns {T:,k }r of T , which are vertices of the hypercube [0, 1]m , are referred to as compok=1 nents. [sent-58, score-0.386]
21 The requirement A⊤ 1r = 1n entails that the columns of D are affine instead of linear combinations of the columns of T . [sent-59, score-0.298]
22 This additional constraint is not essential to our approach; it is imposed for reasons of presentation, in order to avoid that the origin is treated differently from the other vertices of [0, 1]m , because otherwise the zero vector could be dropped from T , leaving the factorization unchanged. [sent-60, score-0.631]
23 there is no factorization of the form (1) with r′ < r, and in turn that the columns of T are affinely independent, i. [sent-67, score-0.454]
24 Since D contains the same number of affinely independent columns as T , it must also hold that aff(D) ⊇ aff(T ), in particular aff(D) ⊇ {T:,k }r . [sent-77, score-0.149]
25 Consequently, (1) k=1 can in principle be solved by enumerating all vertices of [0, 1]m contained in aff(D) and selecting a maximal affinely independent subset thereof (see Figure 1). [sent-78, score-0.247]
26 This procedure, however, is exponential in the dimension m, with 2m vertices to be checked for containment in aff(D) by solving a linear system. [sent-79, score-0.333]
27 Remarkably, the following observation along with its proof, which prompts Algorithm 1 below, shows that the number of elements to be checked can be reduced to 2r−1 irrespective of m. [sent-80, score-0.12]
28 The affine subspace aff(D) contains no more than 2r−1 vertices of [0, 1]m . [sent-82, score-0.241]
29 Moreover, Algorithm 1 provides all vertices contained in aff(D). [sent-83, score-0.247]
30 Determine r − 1 linearly independent columns C of P , obtaining P:,C and subsequently r − 1 linearly independent rows R, obtaining PR,C ∈ Rr−1×r−1 . [sent-90, score-0.308]
31 Form Z = P:,C (PR,C )−1 ∈ Rm×r−1 and T = Z(B (r−1) − pR 1⊤ ) + p1⊤ 2r−1 2r−1 ∈ m×2r−1 (r−1) r−1 R , where the columns of B correspond to the elements of {0, 1} . [sent-92, score-0.177]
32 Select r affinely independent elements of T to be used as columns of T . [sent-103, score-0.177]
33 Left: aff(D) intersects with r + 1 vertices of [0, 1]m . [sent-110, score-0.246]
34 In step 2 of Algorithm 1, determining the rank of P and an associated set of linearly independent columns/rows can be done by means of a rank-revealing QR factorization [21, 22]. [sent-113, score-0.367]
35 Determining T is the hardest part in solving the matrix factorization problem (1). [sent-116, score-0.411]
36 The dominating cost in Algorithm 1 is computation of the candidate matrix T and checking whether its columns are vertices of [0, 1]m . [sent-123, score-0.537]
37 Forming the matrix T would hence require O((m − r + 1)(r − 1)2r−1 ) and the subsequent check for vertices in the fourth step O((m − r + 1)2r−1 ) operations. [sent-127, score-0.285]
38 The second most expensive operation is forming the matrix PR,C in step 2 with the help of a QR decomposition requiring O(mn(r − 1)) operations in typical cases [21]. [sent-131, score-0.102]
39 Computing the matrix factorization (1) after the vertices have been identified (steps 2 to 4 in Algorithm 2) has complexity O(mnr + r3 + r2 n). [sent-132, score-0.59]
40 In this section, we study uniqueness of the matrix factorization problem (1) (modulo permutation of columns/rows). [sent-138, score-0.513]
41 First note that in view of the affine independence of the columns of T , the factorization is unique iff T is, which holds iff aff(D) ∩ {0, 1}m = aff(T ) ∩ {0, 1}m = {T:,1 , . [sent-139, score-0.454]
42 , T:,r } contains no other vertices of [0, 1]m than the r given ones (cf. [sent-147, score-0.21]
43 Uniqueness is of great importance in applications, where one aims at 3 an interpretation in which the columns of T play the role of underlying data-generating elements. [sent-149, score-0.149]
44 Such an interpretation is not valid if (4) fails to hold, since it is then possible to replace one of the columns of a specific choice of T by another vertex contained in the same affine subspace. [sent-150, score-0.215]
45 In the sequel, we argue that property (4) plays an important role from a computational point of view when solving extensions of problem (1) in which further constraints are imposed on A. [sent-152, score-0.115]
46 Problem (5) is of particular interest in the present paper, leading to a novel real world application of matrix factorization techniques as presented in Section 4. [sent-156, score-0.38]
47 For the example below, T consists of all 2r−1 vertices contained in an r − 1-dimensional face of [0, 1]m : 0m−r×r T = Ir−1 0r−1 r−1 with T = T λ : λ1 ∈ {0, 1}, . [sent-168, score-0.247]
48 In view of the negative example (6), one might ask whether uniqueness according to (4) can be achieved under additional conditions on T . [sent-173, score-0.163]
49 We prove uniqueness under separability, a condition introduced in [19] and imposed recently in [17] to show solvability of the NMF problem by linear programming. [sent-174, score-0.217]
50 This raises the question whether uniqueness holds respectively fails for broader classes of binary matrices. [sent-180, score-0.183]
51 Theorem 1 suggests a positive answer to the question of uniqueness posed above. [sent-200, score-0.133]
52 For m large enough and r small compared to m (in fact, following [24] one may conjecture that Theorem 1 holds with C = 1), the probability that the affine hull of r vertices of [0, 1]m selected uniformly at random contains some other vertex is exponentially small in the dimension m. [sent-201, score-0.266]
53 As a byproduct, these results imply that also the NMF variant of our matrix factorization problem (5) can in most cases be reduced to identifying a set of r vertices of [0, 1]m (cf. [sent-204, score-0.59]
54 In Algorithm 1, an m × 2r−1 matrix T of potential vertices is formed (Step 3). [sent-208, score-0.285]
55 We have discussed the case (6) where all candidates must indeed be vertices, in which case it seems to be impossible to reduce the computational cost of O((m − r)r2r−1 ), which becomes significant once m is in the thousands and r ≥ 25. [sent-209, score-0.114]
56 On the positive side, Theorem 1 indicates that for many instances of T , only r out of 2r−1 candidates are in fact vertices. [sent-210, score-0.114]
57 We have observed empirically that this scheme rapidly reduces the candidate set − already checking a single row of T eliminates a substantial portion (see Figure 2). [sent-212, score-0.135]
58 To obtain a reduction of candidates by checking a single row of T = Z(B (r−1) −pR 1⊤ )+p1⊤ , pick i ∈ R (recall that coordinates / 2r−1 2r−1 r−1 in R do not need to be checked, cf. [sent-230, score-0.268]
59 Provided none of the entries of Zi,: = 0, the r−1 discrete L-O lemma implies that there are at most 2 ⌊(r−1)/2⌋ out of 2r−1 candidates for which the i-th coordinate is in {0, 1}. [sent-238, score-0.237]
60 This yields a reduction of the candidate set by 2 r−1 ⌊(r−1)/2⌋ /2r−1 = 1 O √r−1 . [sent-239, score-0.101]
61 Admittedly, this reduction may appear insignificant given the total number of candidates to be checked. [sent-240, score-0.164]
62 Furthermore, when picking successively d rows of T and if one assumes that each row yields a reduction according to the discrete L-O lemma, one would obtain the reduction (r − 1)−d/2 so that d = r − 1 would suffice to identify all vertices provided r ≥ 4. [sent-246, score-0.405]
63 In view of the continuous L-O lemma, a reduction in the number of candidates can still be achieved if the requirement is weakened to Ti,u ∈ [0, 1]. [sent-251, score-0.194]
64 According to (7) the candidates satisfying the relaxed constraint for the i-th coordinate can be obtained from the feasibility problem find b ∈ {0, 1}r−1 subject to 0 ≤ Zi,: (b − pR ) + pi ≤ 1, (8) which is an integer linear program that can be solved e. [sent-252, score-0.151]
65 With the help of CPLEX, it is affordable to solve problem (8) with all m − r + 1 constraints (one for each of the rows of T to be checked) imposed simultaneously. [sent-256, score-0.147]
66 We always recovered directly the underlying vertices in our experiments and only these, without the need to prune the solution pool (which could be achieved by Algorithm 1, replacing the 2r−1 candidates by a potentially much smaller solution pool). [sent-257, score-0.399]
67 5 20 15 10 log10(CPU time) in seconds Number of Vertices (log2) Maximum number of remaining vertices (out of 2(r−1)) over 100 trials 25 5 0 0 1 2 5 10 20 50 100 200 Coordinates checked 2. [sent-267, score-0.302]
68 9 60 70 80 r Figure 2: Left: Speeding up the algorithm by checking single coordinates, remaining number of coordinates vs. [sent-277, score-0.104]
69 T:,ur ] tinguished from the exact case, Algorithm 3 requires the number of components r to be specified in advance as it is typically the case in noisy matrix factorization problems. [sent-321, score-0.409]
70 Moreover, the vector p subtracted from all columns of D in step 1 is chosen as the mean of the data points, which is in particular a reasonable choice if D is contaminated with additive noise distributed symmetrically around zero. [sent-322, score-0.176]
71 The truncated SVD of step 2 achieves the desired dimension reduction and potentially reduces noise corresponding to small singular values that are discarded. [sent-323, score-0.107]
72 While in the exact case, one identifies all columns of T that are in {0, 1}m, one instead only identifies columns close to {0, 1}m. [sent-325, score-0.327]
73 Given the output of Algorithm 3, we solve the approximate matrix factorization problem via least squares, obtaining the right factor from minA D − T A 2 . [sent-326, score-0.38]
74 Improved performance for higher noise levels can be achieved by running Algorithm 3 multiple times with different sets of rows selected in step 2, which yields candidate matrices {T (l) }s , and subsequently using T = argmin{T (l) } minA D − T (l) A 2 , i. [sent-328, score-0.201]
75 Alternatively, we may form a candidate pool by merging the {T (l) }s and l=1 then use a backward elimination scheme, in which successively candidates are dropped that yield the smallest improvement in fitting D until r candidates are left. [sent-331, score-0.388]
76 Apart from that, T returned by Algorithm 3 can be used for initializing the block optimization scheme of Algorithm 4 below. [sent-332, score-0.108]
77 Algorithm 4 is akin to standard block coordinate descent schemes proposed in the matrix factorization literature, e. [sent-333, score-0.464]
78 An important observation (step 3) is that optimization of T is separable along the rows of T , so that for small r, it is feasible to perform exhaustive search over all 2r possibilities (or to use CPLEX). [sent-336, score-0.123]
79 box: we run the block scheme of Algorithm 4, relaxing the integer constraint into a box constraint. [sent-369, score-0.161]
80 Five random initializations are used and we take the result yielding the best fit, subsequently rounding the entries of T to fulfill the {0, 1}-constraints and refitting A. [sent-370, score-0.12]
81 quad pen: as box, but a (concave) quadratic penalty λ i,k Ti,k (1 − Ti,k ) is added to push the entries of T towards {0, 1}. [sent-371, score-0.147]
82 2 box quad pen oracle FindVertices |TA−D|F/sqrt(m*n) 0. [sent-404, score-0.295]
83 1 box quad pen oracle FindVertices F 2 F |T−T*| /(m*r) 0. [sent-407, score-0.295]
84 Since separability is + + crucial to the performance of HT, we restrict our comparison to separable T = [M ; Ir ], generating the entries of M i. [sent-425, score-0.166]
85 The range of the grid is chosen based on knowledge of the noise matrix E. [sent-439, score-0.102]
86 From Figure 3, we find that unlike the other approaches, box does not always recover T ∗ even if the noise level α = 0. [sent-442, score-0.109]
87 Unmixing of DNA methylation profiles is a problem of high interest in cancer research. [sent-450, score-0.284]
88 DNA methylation is a chemical modification of the DNA occurring at specific sites, so-called CpGs. [sent-451, score-0.284]
89 DNA methylation affects gene expression and in turn various processes such as cellular differentiation. [sent-452, score-0.346]
90 DNA methylation microarrays allow one to measure the methylation level for thousands of sites. [sent-454, score-0.568]
91 In the dataset considered here, the measurements D (the rows corresponding to sites, the columns to samples) result from a mixture of cell types. [sent-455, score-0.289]
92 The methylation profiles of the latter are in {0, 1}m, whereas, depending on the mixture proportions associated with each sample, the entries of D take values in [0, 1]m . [sent-456, score-0.441]
93 In other words, we have the model D ≈ T A, with T representing the methylation of the cell types and the columns of A being elements of the probability simplex. [sent-457, score-0.506]
94 It is often of interest to recover the mixture proportions of the samples, because e. [sent-458, score-0.1]
95 1 2 3 4 5 components used 6 Figure 4: Left: Mixture proportions of the ground truth. [sent-473, score-0.1]
96 We apply our approach to obtain an approximate factorization D ≈ T A, T ∈ {0, 1}m×r , ⊤ A ∈ Rr×n and A 1r = 1n . [sent-477, score-0.305]
97 In order to judge the fit as well as the matrix T returned by our method, we compute T ∗ = argminT ∈{0,1}m×r D − T A∗ 2 as in (9). [sent-486, score-0.104]
98 Learning the parts of objects by nonnegative matrix factorization. [sent-500, score-0.101]
99 DNA methylation arrays as surrogate measures of cell mixture distribution. [sent-540, score-0.361]
100 When does non-negative matrix factorization give a correct decomposition into parts? [sent-606, score-0.407]
wordName wordTfidf (topN-words)
[('aff', 0.52), ('factorization', 0.305), ('methylation', 0.284), ('findvertices', 0.213), ('vertices', 0.21), ('columns', 0.149), ('af', 0.143), ('uniqueness', 0.133), ('dna', 0.133), ('rr', 0.128), ('ta', 0.127), ('hottopixx', 0.118), ('candidates', 0.114), ('nmf', 0.107), ('nely', 0.103), ('checked', 0.092), ('alpha', 0.092), ('pen', 0.09), ('quad', 0.09), ('imposed', 0.084), ('box', 0.082), ('matrix', 0.075), ('pr', 0.074), ('ertices', 0.071), ('indv', 0.071), ('unmixing', 0.071), ('ht', 0.071), ('rmse', 0.07), ('proportions', 0.068), ('ir', 0.066), ('cplex', 0.066), ('rows', 0.063), ('rmses', 0.063), ('separable', 0.06), ('entries', 0.057), ('coordinates', 0.052), ('checking', 0.052), ('candidate', 0.051), ('reduction', 0.05), ('binary', 0.05), ('separability', 0.049), ('block', 0.047), ('mnr', 0.047), ('qr', 0.046), ('cell', 0.045), ('pool', 0.045), ('combinatorics', 0.044), ('sites', 0.042), ('argmina', 0.042), ('argmint', 0.038), ('mina', 0.038), ('arora', 0.038), ('contained', 0.037), ('submatrix', 0.037), ('supplement', 0.037), ('coordinate', 0.037), ('intersects', 0.036), ('mint', 0.036), ('hamming', 0.036), ('blood', 0.034), ('transcription', 0.034), ('ai', 0.034), ('oracle', 0.033), ('rm', 0.033), ('bernoulli', 0.033), ('speeding', 0.033), ('signatures', 0.033), ('yielding', 0.033), ('gene', 0.033), ('linearly', 0.033), ('ground', 0.032), ('mixture', 0.032), ('scheme', 0.032), ('dropped', 0.032), ('successively', 0.032), ('solving', 0.031), ('subspace', 0.031), ('zi', 0.031), ('singular', 0.03), ('achieved', 0.03), ('les', 0.03), ('apart', 0.03), ('subsequently', 0.03), ('rank', 0.029), ('vertex', 0.029), ('returned', 0.029), ('lemma', 0.029), ('exact', 0.029), ('cellular', 0.029), ('ur', 0.029), ('truth', 0.029), ('elements', 0.028), ('noise', 0.027), ('tao', 0.027), ('ful', 0.027), ('hull', 0.027), ('referred', 0.027), ('proposition', 0.027), ('decomposition', 0.027), ('nonnegative', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 186 nips-2013-Matrix factorization with binary components
Author: Martin Slawski, Matthias Hein, Pavlo Lutsik
Abstract: Motivated by an application in computational biology, we consider low-rank matrix factorization with {0, 1}-constraints on one of the factors and optionally convex constraints on the second one. In addition to the non-convexity shared with other matrix factorization schemes, our problem is further complicated by a combinatorial constraint set of size 2m·r , where m is the dimension of the data points and r the rank of the factorization. Despite apparent intractability, we provide − in the line of recent work on non-negative matrix factorization by Arora et al. (2012)− an algorithm that provably recovers the underlying factorization in the exact case with O(mr2r + mnr + r2 n) operations for n datapoints. To obtain this result, we use theory around the Littlewood-Offord lemma from combinatorics.
2 0.078881033 179 nips-2013-Low-Rank Matrix and Tensor Completion via Adaptive Sampling
Author: Akshay Krishnamurthy, Aarti Singh
Abstract: We study low rank matrix and tensor completion and propose novel algorithms that employ adaptive sampling schemes to obtain strong performance guarantees. Our algorithms exploit adaptivity to identify entries that are highly informative for learning the column space of the matrix (tensor) and consequently, our results hold even when the row space is highly coherent, in contrast with previous analyses. In the absence of noise, we show that one can exactly recover a n ⇥ n matrix of rank r from merely ⌦(nr3/2 log(r)) matrix entries. We also show that one can recover an order T tensor using ⌦(nrT 1/2 T 2 log(r)) entries. For noisy recovery, our algorithm consistently estimates a low rank matrix corrupted with noise using ⌦(nr3/2 polylog(n)) entries. We complement our study with simulations that verify our theory and demonstrate the scalability of our algorithms. 1
3 0.072489068 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
Author: Siwei Lyu, Xin Wang
Abstract: Nonnegative matrix factorization (NMF) is a popular data analysis method, the objective of which is to approximate a matrix with all nonnegative components into the product of two nonnegative matrices. In this work, we describe a new simple and efficient algorithm for multi-factor nonnegative matrix factorization (mfNMF) problem that generalizes the original NMF problem to more than two factors. Furthermore, we extend the mfNMF algorithm to incorporate a regularizer based on the Dirichlet distribution to encourage the sparsity of the components of the obtained factors. Our sparse mfNMF algorithm affords a closed form and an intuitive interpretation, and is more efficient in comparison with previous works that use fix point iterations. We demonstrate the effectiveness and efficiency of our algorithms on both synthetic and real data sets. 1
4 0.072131172 157 nips-2013-Learning Multi-level Sparse Representations
Author: Ferran Diego Andilla, Fred A. Hamprecht
Abstract: Bilinear approximation of a matrix is a powerful paradigm of unsupervised learning. In some applications, however, there is a natural hierarchy of concepts that ought to be reflected in the unsupervised analysis. For example, in the neurosciences image sequence considered here, there are the semantic concepts of pixel → neuron → assembly that should find their counterpart in the unsupervised analysis. Driven by this concrete problem, we propose a decomposition of the matrix of observations into a product of more than two sparse matrices, with the rank decreasing from lower to higher levels. In contrast to prior work, we allow for both hierarchical and heterarchical relations of lower-level to higher-level concepts. In addition, we learn the nature of these relations rather than imposing them. Finally, we describe an optimization scheme that allows to optimize the decomposition over all levels jointly, rather than in a greedy level-by-level fashion. The proposed bilevel SHMF (sparse heterarchical matrix factorization) is the first formalism that allows to simultaneously interpret a calcium imaging sequence in terms of the constituent neurons, their membership in assemblies, and the time courses of both neurons and assemblies. Experiments show that the proposed model fully recovers the structure from difficult synthetic data designed to imitate the experimental data. More importantly, bilevel SHMF yields plausible interpretations of real-world Calcium imaging data. 1
5 0.068151332 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices
Author: Dimitris Achlioptas, Zohar S. Karnin, Edo Liberty
Abstract: We consider the problem of selecting non-zero entries of a matrix A in order to produce a sparse sketch of it, B, that minimizes A B 2 . For large m n matrices, such that n m (for example, representing n observations over m attributes) we give sampling distributions that exhibit four important properties. First, they have closed forms computable from minimal information regarding A. Second, they allow sketching of matrices whose non-zeros are presented to the algorithm in arbitrary order as a stream, with O 1 computation per non-zero. Third, the resulting sketch matrices are not only sparse, but their non-zero entries are highly compressible. Lastly, and most importantly, under mild assumptions, our distributions are provably competitive with the optimal offline distribution. Note that the probabilities in the optimal offline distribution may be complex functions of all the entries in the matrix. Therefore, regardless of computational complexity, the optimal distribution might be impossible to compute in the streaming model. 1
6 0.067226648 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion
7 0.066578202 233 nips-2013-Online Robust PCA via Stochastic Optimization
8 0.065426588 91 nips-2013-Dirty Statistical Models
9 0.064516686 185 nips-2013-Matrix Completion From any Given Set of Observations
10 0.062346943 220 nips-2013-On the Complexity and Approximation of Binary Evidence in Lifted Inference
11 0.06158071 188 nips-2013-Memory Limited, Streaming PCA
12 0.061131552 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition
13 0.060526647 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing
14 0.059120949 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
15 0.05894595 68 nips-2013-Confidence Intervals and Hypothesis Testing for High-Dimensional Statistical Models
16 0.058455735 105 nips-2013-Efficient Optimization for Sparse Gaussian Process Regression
17 0.057585679 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
18 0.0572748 113 nips-2013-Exact and Stable Recovery of Pairwise Interaction Tensors
19 0.057035159 353 nips-2013-When are Overcomplete Topic Models Identifiable? Uniqueness of Tensor Tucker Decompositions with Structured Sparsity
20 0.056750674 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
topicId topicWeight
[(0, 0.172), (1, 0.07), (2, 0.055), (3, 0.059), (4, 0.004), (5, 0.016), (6, -0.012), (7, -0.011), (8, -0.035), (9, -0.007), (10, 0.009), (11, 0.04), (12, 0.063), (13, 0.032), (14, -0.042), (15, 0.015), (16, -0.026), (17, -0.036), (18, -0.058), (19, 0.032), (20, -0.04), (21, -0.031), (22, -0.021), (23, -0.019), (24, 0.037), (25, -0.033), (26, 0.028), (27, -0.026), (28, 0.023), (29, -0.014), (30, -0.079), (31, -0.009), (32, 0.061), (33, 0.065), (34, 0.046), (35, 0.007), (36, 0.012), (37, 0.022), (38, 0.033), (39, 0.051), (40, -0.046), (41, 0.032), (42, 0.007), (43, -0.074), (44, -0.111), (45, -0.019), (46, -0.084), (47, -0.031), (48, -0.001), (49, -0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.91937774 186 nips-2013-Matrix factorization with binary components
Author: Martin Slawski, Matthias Hein, Pavlo Lutsik
Abstract: Motivated by an application in computational biology, we consider low-rank matrix factorization with {0, 1}-constraints on one of the factors and optionally convex constraints on the second one. In addition to the non-convexity shared with other matrix factorization schemes, our problem is further complicated by a combinatorial constraint set of size 2m·r , where m is the dimension of the data points and r the rank of the factorization. Despite apparent intractability, we provide − in the line of recent work on non-negative matrix factorization by Arora et al. (2012)− an algorithm that provably recovers the underlying factorization in the exact case with O(mr2r + mnr + r2 n) operations for n datapoints. To obtain this result, we use theory around the Littlewood-Offord lemma from combinatorics.
2 0.71620184 73 nips-2013-Convex Relaxations for Permutation Problems
Author: Fajwel Fogel, Rodolphe Jenatton, Francis Bach, Alexandre D'Aspremont
Abstract: Seriation seeks to reconstruct a linear order between variables using unsorted similarity information. It has direct applications in archeology and shotgun gene sequencing for example. We prove the equivalence between the seriation and the combinatorial 2-SUM problem (a quadratic minimization problem over permutations) over a class of similarity matrices. The seriation problem can be solved exactly by a spectral algorithm in the noiseless case and we produce a convex relaxation for the 2-SUM problem to improve the robustness of solutions in a noisy setting. This relaxation also allows us to impose additional structural constraints on the solution, to solve semi-supervised seriation problems. We present numerical experiments on archeological data, Markov chains and gene sequences. 1
3 0.70564169 214 nips-2013-On Algorithms for Sparse Multi-factor NMF
Author: Siwei Lyu, Xin Wang
Abstract: Nonnegative matrix factorization (NMF) is a popular data analysis method, the objective of which is to approximate a matrix with all nonnegative components into the product of two nonnegative matrices. In this work, we describe a new simple and efficient algorithm for multi-factor nonnegative matrix factorization (mfNMF) problem that generalizes the original NMF problem to more than two factors. Furthermore, we extend the mfNMF algorithm to incorporate a regularizer based on the Dirichlet distribution to encourage the sparsity of the components of the obtained factors. Our sparse mfNMF algorithm affords a closed form and an intuitive interpretation, and is more efficient in comparison with previous works that use fix point iterations. We demonstrate the effectiveness and efficiency of our algorithms on both synthetic and real data sets. 1
4 0.66090411 342 nips-2013-Unsupervised Spectral Learning of Finite State Transducers
Author: Raphael Bailly, Xavier Carreras, Ariadna Quattoni
Abstract: Finite-State Transducers (FST) are a standard tool for modeling paired inputoutput sequences and are used in numerous applications, ranging from computational biology to natural language processing. Recently Balle et al. [4] presented a spectral algorithm for learning FST from samples of aligned input-output sequences. In this paper we address the more realistic, yet challenging setting where the alignments are unknown to the learning algorithm. We frame FST learning as finding a low rank Hankel matrix satisfying constraints derived from observable statistics. Under this formulation, we provide identifiability results for FST distributions. Then, following previous work on rank minimization, we propose a regularized convex relaxation of this objective which is based on minimizing a nuclear norm penalty subject to linear constraints and can be solved efficiently. 1
5 0.65205956 117 nips-2013-Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis
Author: James R. Voss, Luis Rademacher, Mikhail Belkin
Abstract: The performance of standard algorithms for Independent Component Analysis quickly deteriorates under the addition of Gaussian noise. This is partially due to a common first step that typically consists of whitening, i.e., applying Principal Component Analysis (PCA) and rescaling the components to have identity covariance, which is not invariant under Gaussian noise. In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise. The two main contributions of this work are as follows: 1. We develop and implement an efficient, Gaussian noise invariant decorrelation (quasi-orthogonalization) algorithm using Hessians of the cumulant functions. 2. We propose a very simple and efficient fixed-point GI-ICA (Gradient Iteration ICA) algorithm, which is compatible with quasi-orthogonalization, as well as with the usual PCA-based whitening in the noiseless case. The algorithm is based on a special form of gradient iteration (different from gradient descent). We provide an analysis of our algorithm demonstrating fast convergence following from the basic properties of cumulants. We also present a number of experimental comparisons with the existing methods, showing superior results on noisy data and very competitive performance in the noiseless case. 1 Introduction and Related Works In the Blind Signal Separation setting, it is assumed that observed data is drawn from an unknown distribution. The goal is to recover the latent signals under some appropriate structural assumption. A prototypical setting is the so-called cocktail party problem: in a room, there are d people speaking simultaneously and d microphones, with each microphone capturing a superposition of the voices. The objective is to recover the speech of each individual speaker. The simplest modeling assumption is to consider each speaker as producing a signal that is a random variable independent of the others, and to take the superposition to be a linear transformation independent of time. This leads to the following formalization: We observe samples from a random vector x distributed according to the equation x = As + b + η where A is a linear mixing matrix, b ∈ Rd is a constant vector, s is a latent random vector with independent coordinates, and η is an unknown random noise independent 1 of s. For simplicity, we assume A ∈ Rd×d is square and of full rank. The latent components of s are viewed as containing the information describing the makeup of the observed signal (voices of individual speakers in the cocktail party setting). The goal of Independent Component Analysis is to approximate the matrix A in order to recover the latent signal s. In practice, most methods ignore the noise term, leaving the simpler problem of recovering the mixing matrix A when x = As is observed. Arguably the two most widely used ICA algorithms are FastICA [13] and JADE [6]. Both of these algorithms are based on a two step process: (1) The data is centered and whitened, that is, made to have identity covariance matrix. This is typically done using principal component analysis (PCA) and rescaling the appropriate components. In the noiseless case this procedure orthogonalizes and rescales the independent components and thus recovers A up to an unknown orthogonal matrix R. (2) Recover the orthogonal matrix R. Most practical ICA algorithms differ only in the second step. In FastICA, various objective functions are used to perform a projection pursuit style algorithm which recovers the columns of R one at a time. JADE uses a fourth-cumulant based technique to simultaneously recover all columns of R. Step 1 of ICA is affected by the addition of a Gaussian noise. Even if the noise is white (has a scalar times identity covariance matrix) the PCA-based whitening procedure can no longer guarantee the whitening of the underlying independent components. Hence, the second step of the process is no longer justified. This failure may be even more significant if the noise is not white, which is likely to be the case in many practical situations. Recent theoretical developments (see, [2] and [3]) consider the case where the noise η is an arbitrary (not necessarily white) additive Gaussian variable drawn independently from s. In [2], it was observed that certain cumulant-based techniques for ICA can still be applied for the second step if the underlying signals can be orthogonalized.1 Orthogonalization of the latent signals (quasi-orthogonalization) is a significantly less restrictive condition as it does not force the underlying signal to have identity covariance (as in whitening in the noiseless case). In the noisy setting, the usual PCA cannot achieve quasi-orthogonalization as it will whiten the mixed signal, but not the underlying components. In [3], we show how quasi-orthogonalization can be achieved in a noise-invariant way through a method based on the fourth-order cumulant tensor. However, a direct implementation of that method requires estimating the full fourth-order cumulant tensor, which is computationally challenging even in relatively low dimensions. In this paper we derive a practical version of that algorithm based on directional Hessians of the fourth univariate cumulant, thus reducing the complexity dependence on the data dimensionality from d4 to d3 , and also allowing for a fully vectorized implementation. We also develop a fast and very simple gradient iteration (not to be confused with gradient descent) algorithm, GI-ICA, which is compatible with the quasi-orthogonalization step and can be shown to have convergence of order r − 1, when implemented using a univariate cumulant of order r. For the cumulant of order four, commonly used in practical applications, we obtain cubic convergence. We show how these convergence rates follow directly from the properties of the cumulants, which sheds some light on the somewhat surprising cubic convergence seen in fourth-order based ICA methods [13, 18, 22]. The update step has complexity O(N d) where N is the number of samples, giving a total algorithmic complexity of O(N d3 ) for step 1 and O(N d2 t) for step 2, where t is the number of iterations for convergence in the gradient iteration. Interestingly, while the techniques are quite different, our gradient iteration algorithm turns out to be closely related to Fast ICA in the noiseless setting, in the case when the data is whitened and the cumulants of order three or four are used. Thus, GI-ICA can be viewed as a generalization (and a conceptual simplification) of Fast ICA for more general quasi-orthogonalized data. We present experimental results showing superior performance in the case of data contaminated by Gaussian noise and very competitive performance for clean data. We also note that the GIICA algorithms are fast in practice, allowing us to process (decorrelate and detect the independent 1 This process of orthogonalizing the latent signals was called quasi-whitening in [2] and later in [3]. However, this conflicts with the definition of quasi-whitening given in [12] which requires the latent signals to be whitened. To avoid the confusion we will use the term quasi-orthogonalization for the process of orthogonalizing the latent signals. 2 components) 100 000 points in dimension 5 in well under a second on a standard desktop computer. Our Matlab implementation of GI-ICA is available for download at http://sourceforge. net/projects/giica/. Finally, we observe that our method is partially compatible with the robust cumulants introduced in [20]. We briefly discuss how GI-ICA can be extended using these noise-robust techniques for ICA to reduce the impact of sparse noise. The paper is organized as follows. In section 2, we discuss the relevant properties of cumulants, and discuss results from prior work which allows for the quasi-orthogonalization of signals with non-zero fourth cumulant. In section 3, we discuss the connection between the fourth-order cumulant tensor method for quasi-orthogonalization discussed in section 2 with Hessian-based techniques seen in [2] and [11]. We use this connection to create a more computationally efficient and practically implementable version of the quasi-orthogonalization algorithm discussed in section 2. In section 4, we discuss new, fast, projection-pursuit style algorithms for the second step of ICA which are compatible with quasi-orthogonalization. In order to simplify the presentation, all algorithms are stated in an abstract form as if we have exact knowledge of required distribution parameters. Section 5 discusses the estimators of required distribution parameters to be used in practice. Section 6 discusses numerical experiments demonstrating the applicability of our techniques. Related Work. The name Independent Component Analysis refers to a broad range of algorithms addressing the blind signal separation problem as well as its variants and extensions. There is an extensive literature on ICA in the signal processing and machine learning communities due to its applicability to a variety of important practical situations. For a comprehensive introduction see the books [8, 14]. In this paper we develop techniques for dealing with noisy data by introducing new and more efficient techniques for quasi-orthogonalization and subsequent component recovery. The quasi-orthogonalization step was introduced in [2], where the authors proposed an algorithm for the case when the fourth cumulants of all independent components are of the same sign. A general algorithm with complete theoretical analysis was provided in [3]. That algorithm required estimating the full fourth-order cumulant tensor. We note that Hessian based techniques for ICA were used in [21, 2, 11], with [11] and [2] using the Hessian of the fourth-order cumulant. The papers [21] and [11] proposed interesting randomized one step noise-robust ICA algorithms based on the cumulant generating function and the fourth cumulant respectively in primarily theoretical settings. The gradient iteration algorithm proposed is closely related to the work [18], which provides a gradient-based algorithm derived from the fourth moment with cubic convergence to learn an unknown parallelepiped in a cryptographic setting. For the special case of the fourth cumulant, the idea of gradient iteration has appeared in the context of FastICA with a different justification, see e.g. [16, Equation 11 and Theorem 2]. We also note the work [12], which develops methods for Gaussian noise-invariant ICA under the assumption that the noise parameters are known. Finally, there are several papers that considered the problem of performing PCA in a noisy framework. [5] gives a provably robust algorithm for PCA under a sparse noise model. [4] performs PCA robust to white Gaussian noise, and [9] performs PCA robust to white Gaussian noise and sparse noise. 2 Using Cumulants to Orthogonalize the Independent Components Properties of Cumulants: Cumulants are similar to moments and can be expressed in terms of certain polynomials of the moments. However, cumulants have additional properties which allow independent random variables to be algebraically separated. We will be interested in the fourth order multi-variate cumulants, and univariate cumulants of arbitrary order. Denote by Qx the fourth order cumulant tensor for the random vector x. So, (Qx )ijkl is the cross-cumulant between the random variables xi , xj , xk , and xl , which we alternatively denote as Cum(xi , xj , xk , xl ). Cumulant tensors are symmetric, i.e. (Qx )ijkl is invariant under permutations of indices. Multivariate cumulants have the following properties (written in the case of fourth order cumulants): 1. (Multilinearity) Cum(αxi , xj , xk , xl ) = α Cum(xi , xj , xk , xl ) for random vector x and scalar α. If y is a random variable, then Cum(xi +y, xj , xk , xl ) = Cum(xi , xj , xk , xl )+Cum(y, xj , xk , xl ). 2. (Independence) If xi and xj are independent random variables, then Cum(xi , xj , xk , xl ) = 0. When x and y are independent, Qx+y = Qx + Qy . 3. (Vanishing Gaussian) Cumulants of order 3 and above are zero for Gaussian random variables. 3 The first order cumulant is the mean, and the second order multivariate cumulant is the covariance matrix. We will denote by κr (x) the order-r univariate cumulant, which is equivalent to the crosscumulant of x with itself r times: κr (x) := Cum(x, x, . . . , x) (where x appears r times). Univariate r-cumulants are additive for independent random variables, i.e. κr (x + y) = κr (x) + κr (y), and homogeneous of degree r, i.e. κr (αx) = αr κr (x). Quasi-Orthogonalization Using Cumulant Tensors. Recalling our original notation, x = As + b + η gives the generative ICA model. We define an operation of fourth-order tensors on matrices: For Q ∈ Rd×d×d×d and M ∈ Rd×d , Q(M ) is the matrix such that d d Q(M )ij := Qijkl mlk . (1) k=1 l=1 We can use this operation to orthogonalize the latent random signals. Definition 2.1. A matrix W is called a quasi-orthogonalization matrix if there exists an orthogonal matrix R and a nonsingular diagonal matrix D such that W A = RD. We will need the following results from [3]. Here we use Aq to denote the q th column of A. Lemma 2.2. Let M ∈ Rd×d be an arbitrary matrix. Then, Qx (M ) = ADAT where D is a diagonal matrix with entries dqq = κ4 (sq )AT M Aq . q Theorem 2.3. Suppose that each component of s has non-zero fourth cumulant. Let M = Qx (I), and let C = Qx (M −1 ). Then C = ADAT where D is a diagonal matrix with entries dqq = 1/ Aq 2 . In particular, C is positive definite, and for any factorization BB T of C, B −1 is a quasi2 orthogonalization matrix. 3 Quasi-Orthogonalization using Cumulant Hessians We have seen in Theorem 2.3 a tensor-based method which can be used to quasi-orthogonalize observed data. However, this method na¨vely requires the estimation of O(d4 ) terms from data. ı There is a connection between the cumulant Hessian-based techniques used in ICA [2, 11] and the tensor-based technique for quasi-orthogonalization described in Theorem 2.3 that allows the tensor-method to be rewritten using a series of Hessian operations. We make this connection precise below. The Hessian version requires only O(d3 ) terms to be estimated from data and simplifies the computation to consist of matrix and vector operations. Let Hu denote the Hessian operator with respect to a vector u ∈ Rd . The following lemma connects Hessian methods with our tensor-matrix operation (a special case is discussed in [2, Section 2.1]). Lemma 3.1. Hu (κ4 (uT x)) = ADAT where dqq = 12(uT Aq )2 κ4 (sq ). In Lemma 3.1, the diagonal entries can be rewritten as dqq = 12κ4 (sq )(AT (uuT )Aq ). By comq paring with Lemma 2.2, we see that applying Qx against a symmetric, rank one matrix uuT can be 1 rewritten in terms of the Hessian operations: Qx (uuT ) = 12 Hu (κ4 (uT x)). This formula extends to arbitrary symmetric matrices by the following Lemma. Lemma 3.2. Let M be a symmetric matrix with eigen decomposition U ΛU T such that U = d 1 (u1 , u2 , . . . , ud ) and Λ = diag(λ1 , λ2 , . . . , λd ). Then, Qx (M ) = 12 i=1 λi Hui κ4 (uT x). i The matrices I and M −1 in Theorem 2.3 are symmetric. As such, the tensor-based method for quasi-orthogonalization can be rewritten using Hessian operations. This is done in Algorithm 1. 4 Gradient Iteration ICA In the preceding sections, we discussed techniques to quasi-orthogonalize data. For this section, we will assume that quasi-orthogonalization is accomplished, and discuss deflationary approaches that can quickly recover the directions of the independent components. Let W be a quasiorthogonalization matrix. Then, define y := W x = W As + W η. Note that since η is Gaussian noise, so is W η. There exists a rotation matrix R and a diagonal matrix D such that W A = RD. Let ˜ := Ds. The coordinates of ˜ are still independent random variables. Gaussian noise makes s s recovering the scaling matrix D impossible. We aim to recover the rotation matrix R. 4 Algorithm 1 Hessian-based algorithm to generate a quasi-orthogonalization matrix. 1: function F IND Q UASI O RTHOGONALIZATION M ATRIX(x) d 1 2: Let M = 12 i=1 Hu κ4 (uT x)|u=ei . See Equation (4) for the estimator. T 3: Let U ΛU give the eigendecomposition of M −1 d 4: Let C = i=1 λi Hu κ4 (uT x)|u=Ui . See Equation (4) for the estimator. 5: Factorize C as BB T . 6: return B −1 7: end function To see why recovery of D is impossible, we note that a white Gaussian random variable η 1 has independent components. It is impossible to distinguish between the case where η 1 is part of the signal, i.e. W A(s + η 1 ) + W η, and the case where Aη 1 is part of the additive Gaussian noise, i.e. W As + W (Aη 1 + η), when s, η 1 , and η are drawn independently. In the noise-free ICA setting, the latent signal is typically assumed to have identity covariance, placing the scaling information in the columns of A. The presence of additive Gaussian noise makes recovery of the scaling information impossible since the latent signals become ill-defined. Following the idea popularized in FastICA, we will discuss a deflationary technique to recover the columns of R one at a time. Fast Recovery of a Single Independent Component. In the deflationary approach, a function f is fixed that acts upon a directional vector u ∈ Rd . Based on some criterion (typically maximization or minimization of f ), an iterative optimization step is performed until convergence. This technique was popularized in FastICA, which is considered fast for the following reasons: 1. As an approximate Newton method, FastICA requires computation of u f and a quick-tocompute estimate of (Hu (f ))−1 at each iterative step. Due to the estimate, the computation runs in O(N d) time, where N is the number of samples. 2. The iterative step in FastICA has local quadratic order convergence using arbitrary functions, and global cubic-order convergence when using the fourth cumulant [13]. We note that cubic convergence rates are not unique to FastICA and have been seen using gradient descent (with the correct step-size) when choosing f as the fourth moment [18]. Our proposed deflationary algorithm will be comparable with FastICA in terms of computational complexity, and the iterative step will take on a conceptually simpler form as it only relies on u κr . We provide a derivation of fast convergence rates that relies entirely on the properties of cumulants. As cumulants are invariant with respect to the additive Gaussian noise, the proposed methods will be admissible for both standard and noisy ICA. While cumulants are essentially unique with the additivity and homogeneity properties [17] when no restrictions are made on the probability space, the preprocessing step of ICA gives additional structure (like orthogonality and centering), providing additional admissible functions. In particular, [20] designs “robust cumulants” which are only minimally effected by sparse noise. Welling’s robust cumulants have versions of the additivity and homogeneity properties, and are consistent with our update step. For this reason, we will state our results in greater generality. Let G be a function of univariate random variables that satisfies the additivity, degree-r (r ≥ 3) homogeneity, and (for the noisy case) the vanishing Gaussians properties of cumulants. Then for a generic choice of input vector v, Algorithm 2 will demonstrate order r−1 convergence. In particular, if G is κ3 , then we obtain quadratic convergence; and if G is κ4 , we obtain cubic convergence. Lemma 4.1 helps explain why this is true. Lemma 4.1. v G(v · y) = r d i=1 (v · Ri )r−1 G(˜i )Ri . s If we consider what is happening in the basis of the columns of R, then up to some multiplicative constant, each coordinate is raised to the r − 1 power and then renormalized during each step of Algorithm 2. This ultimately leads to the order r − 1 convergence. Theorem 4.2. If for a unit vector input v to Algorithm 2 h = arg maxi |(v · Ri )r−2 G(˜i )| has a s unique answer, then v has order r − 1 convergence to Rh up to sign. In particular, if the following conditions are met: (1) There exists a coordinate random variable si of s such that G(si ) = 0. (2) v inputted into Algorithm 2 is chosen uniformly at random from the unit sphere S d−1 . Then Algorithm 2 converges to a column of R (up to sign) almost surely, and convergence is of order r − 1. 5 Algorithm 2 A fast algorithm to recover a single column of R when v is drawn generically from the unit sphere. Equations (2) and (3) provide k-statistic based estimates of v κ3 and v κ4 , which can be used as practical choices of v G on real data. 1: function GI-ICA(v, y) 2: repeat 3: v ← v G(vT y) 4: v ← v/ v 2 5: until Convergence return v 6: end function ˜ Algorithm 3 Algorithm for ICA in the presence of Gaussian noise. A recovers A up to column order and scaling. RT W is the demixing matrix for the observed random vector x. function G AUSSIAN ROBUST ICA(G, x) W = F IND Q UASI O RTHOGONALIZATION M ATRIX(x) y = Wx R columns = ∅ for i = 1 to d do Draw v from S d−1 ∩ span(R columns)⊥ uniformly at random. R columns = R columns ∪ {GI-ICA(v, y)} end for Construct a matrix R using the elements of R columns as columns. ˜ = RT y s ˜ A = (RT W )−1 ˜ s return A, ˜ end function By convergence up to sign, we include the possibility that v oscillates between Rh and −Rh on alternating steps. This can occur if G(˜i ) < 0 and r is odd. Due to space limitations, the proof is s omitted. Recovering all Independent Components. As a Corollary to Theorem 4.2 we get: Corollary 4.3. Suppose R1 , R2 , . . . , Rk are known for some k < d. Suppose there exists i > k such that G(si ) = 0. If v is drawn uniformly at random from S d−1 ∩ span(R1 , . . . , Rk )⊥ where S d−1 denotes the unit sphere in Rd , then Algorithm 2 with input v converges to a new column of R almost surely. Since the indexing of R is arbitrary, Corollary 4.3 gives a solution to noisy ICA, in Algorithm 3. In practice (not required by the theory), it may be better to enforce orthogonality between the columns of R, by orthogonalizing v against previously found columns of R at the end of each step in Algorithm 2. We expect the fourth or third cumulant function will typically be chosen for G. 5 Time Complexity Analysis and Estimation of Cumulants To implement Algorithms 1 and 2 requires the estimation of functions from data. We will limit our discussion to estimation of the third and fourth cumulants, as lower order cumulants are more statistically stable to estimate than higher order cumulants. κ3 is useful in Algorithm 2 for nonsymmetric distributions. However, since κ3 (si ) = 0 whenever si is a symmetric distribution, it is plausible that κ3 would not recover all columns of R. When s is suspected of being symmetric, it is prudent to use κ4 for G. Alternatively, one can fall back to κ4 from κ3 when κ3 is detected to be near 0. Denote by z (1) , z (2) , . . . , z (N ) the observed samples of a random variable z. Given a sample, each cumulant can be estimated in an unbiased fashion by its k-statistic. Denote by kr (z (i) ) the kN 1 statistic sample estimate of κr (z). Letting mr (z (i) ) := N i=1 (z (i) − z )r give the rth sample ¯ central moment, then N 2 m3 (z (i) ) (N + 1)m4 (z (i) ) − 3(N − 1)m2 (z (i) )2 k3 (z (i) ) := , k4 (z (i) ) := N 2 (N − 1)(N − 2) (N − 1)(N − 2)(N − 3) 6 gives the third and fourth k-statistics [15]. However, we are interested in estimating the gradients (for Algorithm 2) and Hessians (for Algorithm 1) of the cumulants rather than the cumulants themselves. The following Lemma shows how to obtain unbiased estimates: Lemma 5.1. Let z be a d-dimensional random vector with finite moments up to order r. Let z(i) be α an iid sample of z. Let α ∈ Nd be a multi-index. Then ∂u kr (u · z(i) ) is an unbiased estimate for α ∂u κr (u · z). If we mean-subtract (via the sample mean) all observed random variables, then the resulting estimates are: N u k3 (u · y) = (N − 1)−1 (N − 2)−1 3N (u · y(i) )2 y(i) (2) i=1 u k4 (u · y) = N2 (N − 1)(N − 2)(N − 3) −12 Hu k4 (u · x) = N −1 − N2 N −1 N2 N +1 N N N ((u · y(i) ))3 y(i) i=1 N (u · y(i) )2 i=1 12N 2 (N − 1)(N − 2)(N − 3) N 4 (u · y(i) )y(i) (3) i=1 N +1 N N 2N − 2 (u · x(i) )2 (xxT )(i) − N2 i=1 i=1 N ((u · x(i) ))2 (xxT )(i) (4) i=1 N (u · x(i) )x(i) i=1 T N (u · x(i) )x(i) i=1 Using (4) to estimate Hu κ4 (uT x) from data when implementing Algorithm 1, the resulting quasiorthogonalization algorithm runs in O(N d3 ) time. Using (2) or (3) to estimate u G(vT y) (with G chosen to be κ3 or κ4 respectively) when implementing Algorithm 2 gives an update step that runs in O(N d) time. If t bounds the number of iterations to convergence in Algorithm 2, then O(N d2 t) steps are required to recover all columns of R once quasi-orthogonalization has been achieved. 6 Simulation Results In Figure 1, we compare our algorithms to the baselines JADE [7] and versions of FastICA [10], using the code made available by the authors. Except for the choice of the contrast function for FastICA the baselines were run using default settings. All tests were done using artificially generated data. In implementing our algorithms (available at [19]), we opted to enforce orthogonality during the update step of Algorithm 2 with previously found columns of R. In Figure 1, comparison on five distributions indicates that each of the independent coordinates was generated from a distinct distribution among the Laplace distribution, the Bernoulli distribution with parameter 0.5, the tdistribution with 5 degrees of freedom, the exponential distribution, and the continuous uniform distribution. Most of these distributions are symmetric, making GI-κ3 inadmissible. When generating data for the ICA algorithm, we generate a random mixing matrix A with condition number 10 (minimum singular value 1 and maximum singular value 10), and intermediate singular values chosen uniformly at random. The noise magnitude indicates the strength of an additive white Gaussian noise. We define 100% noise magnitude to mean variance 10, with 25% noise and 50% noise indicating variances 2.5 and 5 respectively. Performance was measured using the Amari Index ˆ introduced in [1]. Let B denote the approximate demixing matrix returned by an ICA algorithm, |m | n n ˆ and let M = BA. Then, the Amari index is given by: E := i=1 j=1 maxk ij ik | − 1 + |m n j=1 n i=1 |mij | maxk |mkj | − 1 . The Amari index takes on values between 0 and the dimensionality d. It can be roughly viewed as the distance of M from the nearest scaled permutation matrix P D (where P is a permutation matrix and D is a diagonal matrix). From the noiseles data, we see that quasi-orthogonalization requires more data than whitening in order to provide accurate results. Once sufficient data is provided, all fourth order methods (GI-κ4 , JADE, and κ4 -FastICA) perform comparably. The difference between GI-κ4 and κ4 -FastICA is not 7 ICA Comparison on 5 distributions (d=5, noisless data) ICA Comparison on 5 distributions (d=5, 25% noise magnitude) 1.00 ICA Comparison on 5 distributions (d=5, 50% noise magnitude) 1.00 1.00 GI−κ4 (quasi−orthogonal) κ4−FastICA κ4−FastICA κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, noisless data) 10.00 Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) Amari Index GI−κ4 (white) 0.10 0.01 100 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 25% noise magnitude) 10.00 1000 10000 100000 Number of Samples ICA Comparison on 5 distributions (d=10, 50% noise magnitude) 10.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA log cosh−FastICA JADE log cosh−FastICA JADE 0.10 0.01 100 0.10 1000 10000 Number of Samples 100000 κ4−FastICA log cosh−FastICA JADE 1.00 Amari Index 1.00 Amari Index Amari Index GI−κ4 (white) GI−κ4 (quasi−orthogonal) κ4−FastICA 1.00 GI−κ4 (white) GI−κ4 (quasi−orthogonal) 0.01 100 0.10 1000 10000 Number of Samples 100000 0.01 100 1000 10000 Number of Samples 100000 Figure 1: Comparison of ICA algorithms under various levels of noise. White and quasi-orthogonal refer to the choice of the first step of ICA. All baseline algorithms use whitening. Reported Amari indices denote the mean Amari index over 50 runs on different draws of both A and the data. d gives the data dimensionality, with two copies of each distribution used when d = 10. statistically significant over 50 runs with 100 000 samples. We note that GI-κ4 under whitening and κ4 -FastICA have the same update step (up to a slightly different choice of estimators), with GI-κ4 differing to allow for quasi-orthogonalization. Where provided, the error bars give a 2σ confidence interval on the mean Amari index. In all cases, error bars for our algorithms are provided, and error bars for the baseline algorithms are provided when they do not hinder readability. It is clear that all algorithms degrade with the addition of Gaussian noise. However, GI-κ4 under quasi-orthogonalization degrades far less when given sufficient samples. For this reason, the quasi-orthogonalized GI-κ4 outperforms all other algorithms (given sufficient samples) including the log cosh-FastICA, which performs best in the noiseless case. Contrasting the performance of GIκ4 under whitening with itself under quasi-orthogonalization, it is clear that quasi-orthogonalization is necessary to be robust to Gaussian noise. Run times were indeed reasonably fast. For 100 000 samples on the varied distributions (d = 5) with 50% Gaussian noise magnitude, GI-κ4 (including the orthogonalization step) had an average running time2 of 0.19 seconds using PCA whitening, and 0.23 seconds under quasi-orthogonalization. The corresponding average number of iterations to convergence per independent component (at 0.0001 error) were 4.16 and 4.08. In the following table, we report the mean number of steps to convergence (per independent component) over the 50 runs for the 50% noise distribution (d = 5), and note that once sufficiently many samples were taken, the number of steps to convergence becomes remarkably small. Number of data pts whitening+GI-κ4 : mean num steps quasi-orth.+GI-κ4 : mean num steps 7 500 11.76 213.92 1000 5.92 65.95 5000 4.99 4.48 10000 4.59 4.36 Acknowledgments This work was supported by NSF grant IIS 1117707. 2 Using a standard desktop with an i7-2600 3.4 GHz CPU and 16 GB RAM. 8 50000 4.35 4.06 100000 4.16 4.08 References [1] S. Amari, A. Cichocki, H. H. Yang, et al. A new learning algorithm for blind signal separation. Advances in neural information processing systems, pages 757–763, 1996. [2] S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Provable ICA with unknown Gaussian noise, with implications for Gaussian mixtures and autoencoders. In NIPS, pages 2384–2392, 2012. [3] M. Belkin, L. Rademacher, and J. Voss. Blind signal separation in the presence of Gaussian noise. In JMLR W&CP;, volume 30: COLT, pages 270–287, 2013. [4] C. M. Bishop. Variational principal components. Proc. Ninth Int. Conf. on Articial Neural Networks. ICANN, 1:509–514, 1999. [5] E. J. Cand` s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? CoRR, e abs/0912.3599, 2009. [6] J. Cardoso and A. Souloumiac. Blind beamforming for non-Gaussian signals. In Radar and Signal Processing, IEE Proceedings F, volume 140, pages 362–370. IET, 1993. [7] J.-F. Cardoso and A. Souloumiac. Matlab JADE for real-valued data v 1.8. http:// perso.telecom-paristech.fr/˜cardoso/Algo/Jade/jadeR.m, 2005. [Online; accessed 8-May-2013]. [8] P. Comon and C. Jutten, editors. Handbook of Blind Source Separation. Academic Press, 2010. [9] X. Ding, L. He, and L. Carin. Bayesian robust principal component analysis. Image Processing, IEEE Transactions on, 20(12):3419–3430, 2011. [10] H. G¨ vert, J. Hurri, J. S¨ rel¨ , and A. Hyv¨ rinen. Matlab FastICA v 2.5. http:// a a a a research.ics.aalto.fi/ica/fastica/code/dlcode.shtml, 2005. [Online; accessed 1-May-2013]. [11] D. Hsu and S. M. Kakade. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In ITCS, pages 11–20, 2013. [12] A. Hyv¨ rinen. Independent component analysis in the presence of Gaussian noise by maxia mizing joint likelihood. Neurocomputing, 22(1-3):49–67, 1998. [13] A. Hyv¨ rinen. Fast and robust fixed-point algorithms for independent component analysis. a IEEE Transactions on Neural Networks, 10(3):626–634, 1999. [14] A. Hyv¨ rinen and E. Oja. Independent component analysis: Algorithms and applications. a Neural Networks, 13(4-5):411–430, 2000. [15] J. F. Kenney and E. S. Keeping. Mathematics of Statistics, part 2. van Nostrand, 1962. [16] H. Li and T. Adali. A class of complex ICA algorithms based on the kurtosis cost function. IEEE Transactions on Neural Networks, 19(3):408–420, 2008. [17] L. Mafttner. What are cumulants. Documenta Mathematica, 4:601–622, 1999. [18] P. Q. Nguyen and O. Regev. Learning a parallelepiped: Cryptanalysis of GGH and NTRU signatures. J. Cryptology, 22(2):139–160, 2009. [19] J. Voss, L. Rademacher, and M. Belkin. Matlab GI-ICA implementation. sourceforge.net/projects/giica/, 2013. [Online]. http:// [20] M. Welling. Robust higher order statistics. In Tenth International Workshop on Artificial Intelligence and Statistics, pages 405–412, 2005. [21] A. Yeredor. Blind source separation via the second characteristic function. Signal Processing, 80(5):897–902, 2000. [22] V. Zarzoso and P. Comon. How fast is FastICA. EUSIPCO, 2006. 9
6 0.64995629 350 nips-2013-Wavelets on Graphs via Deep Learning
7 0.64151907 202 nips-2013-Multiclass Total Variation Clustering
8 0.63497114 157 nips-2013-Learning Multi-level Sparse Representations
9 0.63364267 254 nips-2013-Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
10 0.62361795 352 nips-2013-What do row and column marginals reveal about your dataset?
11 0.61411369 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression
12 0.59893787 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport
13 0.58282053 321 nips-2013-Supervised Sparse Analysis and Synthesis Operators
14 0.58087617 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
15 0.58019066 256 nips-2013-Probabilistic Principal Geodesic Analysis
16 0.57133561 328 nips-2013-The Total Variation on Hypergraphs - Learning on Hypergraphs Revisited
17 0.56794703 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing
18 0.56577373 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
19 0.56541878 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices
20 0.56332964 307 nips-2013-Speedup Matrix Completion with Side Information: Application to Multi-Label Learning
topicId topicWeight
[(2, 0.01), (16, 0.057), (33, 0.127), (34, 0.108), (41, 0.021), (49, 0.02), (56, 0.126), (70, 0.068), (85, 0.045), (89, 0.043), (93, 0.038), (95, 0.032), (97, 0.225)]
simIndex simValue paperId paperTitle
1 0.81394643 120 nips-2013-Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
Author: Yichao Lu, Paramveer Dhillon, Dean P. Foster, Lyle Ungar
Abstract: We propose a fast algorithm for ridge regression when the number of features is much larger than the number of observations (p n). The standard way to solve ridge regression in this setting works in the dual space and gives a running time of O(n2 p). Our algorithm Subsampled Randomized Hadamard Transform- Dual Ridge Regression (SRHT-DRR) runs in time O(np log(n)) and works by preconditioning the design matrix by a Randomized Walsh-Hadamard Transform with a subsequent subsampling of features. We provide risk bounds for our SRHT-DRR algorithm in the fixed design setting and show experimental results on synthetic and real datasets. 1
same-paper 2 0.80452591 186 nips-2013-Matrix factorization with binary components
Author: Martin Slawski, Matthias Hein, Pavlo Lutsik
Abstract: Motivated by an application in computational biology, we consider low-rank matrix factorization with {0, 1}-constraints on one of the factors and optionally convex constraints on the second one. In addition to the non-convexity shared with other matrix factorization schemes, our problem is further complicated by a combinatorial constraint set of size 2m·r , where m is the dimension of the data points and r the rank of the factorization. Despite apparent intractability, we provide − in the line of recent work on non-negative matrix factorization by Arora et al. (2012)− an algorithm that provably recovers the underlying factorization in the exact case with O(mr2r + mnr + r2 n) operations for n datapoints. To obtain this result, we use theory around the Littlewood-Offord lemma from combinatorics.
3 0.79726112 252 nips-2013-Predictive PAC Learning and Process Decompositions
Author: Cosma Shalizi, Aryeh Kontorovitch
Abstract: We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with finite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This definition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular (β-mixing) processes, of independent probability-theoretic interest. 1
4 0.70292974 249 nips-2013-Polar Operators for Structured Sparse Estimation
Author: Xinhua Zhang, Yao-Liang Yu, Dale Schuurmans
Abstract: Structured sparse estimation has become an important technique in many areas of data analysis. Unfortunately, these estimators normally create computational difficulties that entail sophisticated algorithms. Our first contribution is to uncover a rich class of structured sparse regularizers whose polar operator can be evaluated efficiently. With such an operator, a simple conditional gradient method can then be developed that, when combined with smoothing and local optimization, significantly reduces training time vs. the state of the art. We also demonstrate a new reduction of polar to proximal maps that enables more efficient latent fused lasso. 1
5 0.70219964 56 nips-2013-Better Approximation and Faster Algorithm Using the Proximal Average
Author: Yao-Liang Yu
Abstract: It is a common practice to approximate “complicated” functions with more friendly ones. In large-scale machine learning applications, nonsmooth losses/regularizers that entail great computational challenges are usually approximated by smooth functions. We re-examine this powerful methodology and point out a nonsmooth approximation which simply pretends the linearity of the proximal map. The new approximation is justified using a recent convex analysis tool— proximal average, and yields a novel proximal gradient algorithm that is strictly better than the one based on smoothing, without incurring any extra overhead. Numerical experiments conducted on two important applications, overlapping group lasso and graph-guided fused lasso, corroborate the theoretical claims. 1
6 0.69752783 355 nips-2013-Which Space Partitioning Tree to Use for Search?
7 0.69701153 15 nips-2013-A memory frontier for complex synapses
8 0.69607633 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
9 0.69537187 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data
10 0.69416595 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
11 0.69363415 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks
12 0.69353342 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
13 0.69167805 233 nips-2013-Online Robust PCA via Stochastic Optimization
14 0.69163609 350 nips-2013-Wavelets on Graphs via Deep Learning
15 0.69153923 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
16 0.69085193 184 nips-2013-Marginals-to-Models Reducibility
17 0.69016796 14 nips-2013-A Stability-based Validation Procedure for Differentially Private Machine Learning
18 0.68963462 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
19 0.68943137 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
20 0.68940639 318 nips-2013-Structured Learning via Logistic Regression