nips nips2012 nips2012-281 knowledge-graph by maker-knowledge-mining

281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

Source: pdf

Author: Sanjeev Arora, Rong Ge, Ankur Moitra, Sushant Sachdeva

Abstract: We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form y = Ax + η where A is an unknown n × n matrix and x is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable and η is an n-dimensional Gaussian random variable with unknown covariance Σ: We give an algorithm that provable recovers A and Σ up to an additive and whose running time and sample complexity are polynomial in n and 1/ . To accomplish this, we introduce a novel “quasi-whitening” step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for ﬁnding all local optima of a function (given an oracle for approximately ﬁnding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we ﬁnd the columns of A one by one via local search. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders Sanjeev Arora∗ Rong Ge∗ Ankur Moitra † Sushant Sachdeva∗ Abstract We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. [sent-1, score-0.114]

2 To accomplish this, we introduce a novel “quasi-whitening” step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. [sent-3, score-0.129]

3 Suppose η is an independent n-dimensional Gaussian random variable with an unknown covariance matrix Σ and A is an unknown n × n matrix. [sent-6, score-0.248]

4 We are given samples of the form y = Ax + η where x is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable. [sent-7, score-0.288]

5 Our goal is to reconstruct an additive approximation to the matrix A and the covariance matrix Σ running in time and using a number of samples that is polynomial in n and 1 , where is the target precision (see Theorem 1. [sent-9, score-0.394]

6 We describe these connections next, and known results (focusing on algorithms with provable performance guarantees, since that is our goal). [sent-11, score-0.114]

7 , with no noise η added) then applying linear transformation A−1 to the data results in samples A−1 y whose coordinates are independent. [sent-18, score-0.132]

8 1 additive approximation to A efﬁciently and using a polynomial number of samples. [sent-28, score-0.14]

9 , Hsu and Kakade[6, 7], that do not use local search and avoids this issue. [sent-31, score-0.238]

10 ) To the best of our knowledge, there are currently no known algorithms with provable guarantees for the more general case of ICA with Gaussian noise (this is especially true if the covariance matrix is unknown, as in our problem), although many empirical approaches are known. [sent-32, score-0.291]

11 Our data is generated as a mixture of 2n identical Gaussian components (with an unknown covariance matrix) n whose centers are the points {Ax : x ∈ {−1, 1} }, and all mixing weights are equal. [sent-37, score-0.163]

12 An inﬂuential paper of Dasgupta [10] initiated the program of designing algorithms with provable guarantees, which was improved in a sequence of papers [11, 12, 13, 14]. [sent-40, score-0.146]

13 Thus, new ideas seem necessary to achieve polynomial running time. [sent-42, score-0.123]

14 cannot be compactly described by piecewise linear functions or other “simple” local representations). [sent-48, score-0.184]

15 However, to the best of our knowledge, there has been no successful attempt to give a rigorous analysis of Deep Learning. [sent-51, score-0.123]

16 1 Our results and techniques We give a provable algorithm for ICA with unknown Gaussian noise. [sent-62, score-0.217]

17 There is an algorithm that recovers the unknown A and Σ up to additive error in each entry in time that is polynomial in n, A 2 , Σ 2 , 1/ , 1/λmin (A) where · 2 denotes the operator norm and λmin (·) denotes the smallest eigenvalue. [sent-66, score-0.25]

18 The ﬁrst step is whitening, which applies a suitable linear transformation that makes the variance the same in all directions, thus reducing to the case where 2 A is a rotation matrix. [sent-68, score-0.156]

19 Given samples y = Rx where R is a rotation matrix, the rows of R can be found in principle by computing the vectors u that are local minima of E[(u · y)4 ]. [sent-69, score-0.322]

20 A popular approach is to use the fourth order cumulant (as an alternative to the fourth order moment) as a method for “denoising,” or any one of a number of other functionals whose local optima reveal interesting directions. [sent-73, score-0.754]

21 , provably polynomial running time and number of samples), except for one subtlety: it is unclear how to use local search to ﬁnd all optima in polynomial time. [sent-77, score-0.573]

22 In practice, one ﬁnds a single local optimum, projects to the subspace orthogonal to it and continues recursively on a lower-dimensional problem. [sent-78, score-0.329]

23 However, a naive implementation of this idea is unstable since approximation errors can accumulate badly, and to the best of our knowledge no rigorous analysis has been given prior to our work. [sent-79, score-0.144]

24 ) One of our contributions is a modiﬁed local search that avoids this potential instability and ﬁnds all local optima in this setting. [sent-81, score-0.554]

25 ) Our major new contribution however is dealing with noise that is an unknown Gaussian. [sent-84, score-0.114]

26 We design new tools for denoising and especially whitening in this setting. [sent-88, score-0.165]

27 Denoising uses the fourth order cumulant instead of the fourth moment used in [5] and whitening involves a novel use of the Hessian of the cumulant. [sent-89, score-0.58]

28 ) Nevertheless, we can reduce to an optimization problem that can be solved via local search, and which remains amenable to a rigorous analysis. [sent-91, score-0.277]

29 The results of the local optimization step can be then used to simplify the complicated functional form and recover A as well as the noise Σ. [sent-92, score-0.351]

30 2) will take the same form but with weights depending on the exact value of the fourth moment for each coordinate. [sent-96, score-0.232]

31 Since we already carry through an unknown diagonal matrix D throughout our analysis, this generalization only changes the entries on the diagonal and the same algorithm and proof apply. [sent-97, score-0.216]

32 2 Denoising and quasi-whitening As mentioned, our approach is based on the fourth order cumulant. [sent-98, score-0.152]

33 Let κr (X) be the rth cumulant of a random variable X. [sent-100, score-0.134]

34 The intuition is that P (u) = −κ4 (uT Ax) since the fourth cumulant does not depend on the additive Gaussian noise, and then the lemma follows from completing the square. [sent-115, score-0.464]

35 1 Quasi-whitening via the Hessian of P (u) In prior works on ICA, whitening refers to reducing to the case where y = Rx for some some rotation matrix R. [sent-117, score-0.197]

36 Here we give a technique to reduce to the case where y = RDx + η where η is some other Gaussian noise (still unknown), R is a rotation matrix and D is a diagonal matrix that depends upon A. [sent-118, score-0.304]

37 Quasi-whitening sufﬁces for us since local search using the objective function κ4 (uT y) will give us (approximations to) the rows of RD, from which we will be able to recover A. [sent-120, score-0.318]

38 √ Then there is a rotation matrix R∗ such that B −1 ADA (u0 )1/2 − R∗ F ≤ n . [sent-155, score-0.135]

39 It uses as a blackbox the denoising and quasiwhitening already described above, as well as a routine for computing all local maxima of some “well-behaved” functions which is described later in Section 4. [sent-159, score-0.684]

40 Step 3: Use the procedure A LL OPT(P (u), β, δ , β , δ ) of Section 4 to compute all n local maxima of the function P (u). [sent-173, score-0.581]

41 Step 4: Let R be the matrix whose rows are the n local optima recovered in the previous step. [sent-174, score-0.369]

42 Namely, we consider the sequence of samples z = B −1 y, which are therefore of the form R Dx+η where η = B −1 η, D = DA (u0 ) and R is close to a rotation matrix R∗ (by Lemma 2. [sent-177, score-0.235]

43 Then Step 3 tries to ﬁnd local optima via local search. [sent-183, score-0.5]

44 Ideally we would have liked access to the functional P ∗ (u) = (uT R∗ x)4 since the procedure for local optima works only for true rotations. [sent-184, score-0.353]

45 But since R and R∗ are close we can make it work approximately with P (u), and then in Step 4 use these local optima to ﬁnally recover A. [sent-185, score-0.41]

46 Suppose we are given samples of the form y = Ax + η where x is uniform on {+1, −1}n , A is an n × n matrix, η is an n-dimensional Gaussian random variable independent of x with unknown covariance matrix Σ. [sent-188, score-0.231]

47 There is an algorithm that with high probability recovers A − AΠdiag(ki ) F ≤ where Π is some permutation matrix and each ki ∈ {+1, −1} and also recovers Σ − Σ F ≤ . [sent-189, score-0.24]

48 Furthermore the running time and number of samples needed are poly(n, 1/ , A 2 , Σ 2 , 1/λmin (A)) Note that here we recover A up to a permutation of the columns and sign-ﬂips. [sent-190, score-0.185]

49 4 Framework for iteratively ﬁnding all local maxima In this section, we ﬁrst describe a fairly standard procedure (based upon Newton’s method) for ﬁnding a single local maximum of a function f ∗ : Rn → R among all unit vectors and an analysis of its rate of convergence. [sent-193, score-0.765]

50 Such a procedure is a common tool in statistical algorithms, but here we state it rather carefully since we later give a general method to convert any local search algorithm (that meets certain criteria) into one that ﬁnds all local maxima (see Section 4. [sent-194, score-0.849]

51 Given that we can only ever hope for an additive approximation to a local maximum, one should be concerned about how the error accumulates when our goal is to ﬁnd all local maxima. [sent-196, score-0.428]

52 In fact, a naive strategy is to project onto the subspace orthogonal to the directions found so far, and continue in this subspace. [sent-197, score-0.175]

53 However, such an approach seems to accumulate errors badly (the additive error of the last local maxima found is exponentially larger than the error of the ﬁrst). [sent-198, score-0.727]

54 Else return u Our strategy is to ﬁrst ﬁnd a local maximum in the orthogonal subspace, then run the local optimization algorithm again (in the original n-dimensional space) to “reﬁne” the local maximum we have found. [sent-207, score-0.623]

55 The intuition is that since we are already close to a particular local maxima, the local search algorithm cannot jump to some other local maxima (since this would entail going through a valley). [sent-208, score-1.077]

56 1 Finding one local maximum Throughout this section, we will assume that we are given oracle access to a function f (u) and its gradient and Hessian. [sent-210, score-0.184]

57 The basic step the algorithm is a modiﬁcation of Newton’s method to ﬁnd a local improvement that makes progress so long as the current point u is far from a local maxima. [sent-216, score-0.407]

58 In order to have control over how the norm of u changes, during local optimization step the algorithm projects the gradient f and Hessian H(f ) to the space perpendicular to u. [sent-218, score-0.223]

59 A function f (u) : Rn → R is (γ, β, δ)-Locally Improvable, if for any u that is at least γ far from any local maxima, there is a u such that u − u 2 ≤ β and f (u ) ≥ f (u) + δ. [sent-225, score-0.184]

60 The analysis of the running time of the procedure comes from local Taylor expansion. [sent-230, score-0.227]

61 The following theorem from [5] showed that the two properties above are enough to guarantee the success of a local search algorithm even when the function is only approximated. [sent-232, score-0.275]

62 If |f (u) − f ∗ (u)| ≤ δ/8, the function f ∗ (u) is (γ, β, δ)-Locally Improvable, f (u) is (β, δ) Locally Approximable, then Algorithm 1 will ﬁnd a vector v that is γ close to some local maximum. [sent-235, score-0.228]

63 2 Finding all local maxima Now we consider how to ﬁnd all local maxima of a given function f ∗ (u). [sent-238, score-1.162]

64 The crucial condition that we need is that all local maxima are orthogonal (which is indeed true in our problem, and is morally true when using local search more generally in ICA). [sent-239, score-0.89]

65 Note that this condition implies that there are at most n local maxima. [sent-240, score-0.184]

66 1 In fact we will assume that there are exactly n local maxima. [sent-241, score-0.184]

67 If we are given an exact oracle for f ∗ and can compute exact local maxima then we can ﬁnd all local maxima 6 ∗ Algorithm 2. [sent-242, score-1.162]

68 Let v1 = L OCAL OPT(f, e1 , β, δ) FOR i = 2 TO n DO Let gi be the projection of f to the orthogonal subspace of v1 , v2 , . [sent-254, score-0.219]

69 , vn easily: ﬁnd one local maximum, project the function into the orthogonal subspace, and continue to ﬁnd more local maxima. [sent-263, score-0.506]

70 The projection of a function f to a linear subspace S is a function on that subspace with value equal to f . [sent-266, score-0.222]

71 , vd } is an orthonormal basis of S, the projection d of f to S is a function g : Rd → R such that g(w) = f ( i=1 wi vi ). [sent-270, score-0.145]

72 The following theorem gives sufﬁcient conditions under which the above algorithm ﬁnds all local maxima, making precise the intuition given at the beginning of this section. [sent-271, score-0.251]

73 Orthogonal Local Maxima: The function has n local maxima vi , and they are orthogonal to each other. [sent-275, score-0.723]

74 Improvable Projection: The projection of the function to any subspace spanned by a subset of local maxima is (γ , β , δ ) Locally Improvable. [sent-279, score-0.729]

75 Attraction Radius: Let Rad ≥ 3 nγ + γ , for any local maximum vi , let T√ min f ∗ (u) be ∗ ∗ for u − vi 2 ≤ Rad, then there exist a set U containing u − vi 2 ≤ 3 nγ + γ and does not contain any other local maxima, such that for every u that is not in U but is β close to U , f ∗ (u) < T . [sent-284, score-0.662]

76 If we are given function f such that |f (u) − f ∗ (u)| ≤ δ/8 and f is both (β, δ) and (β , δ ) Locally Approximable, then Algorithm 2 can ﬁnd all local maxima of f ∗ within distance γ. [sent-285, score-0.581]

77 To prove this theorem, we ﬁrst notice the projection of the function f in Step 3 of the algorithm should be close to the projection of f ∗ to the remaining local maxima. [sent-286, score-0.376]

78 First we prove a “coupling” between the orthogonal complement of two close subspaces: ∗ ∗ ∗ Lemma 4. [sent-288, score-0.115]

79 , vk , each γ-close respectively to local maxima v1 , v2 , . [sent-293, score-0.66]

80 , vk (this is without loss of generality because we can permute the index of local maxima), then there is an orthonormal basis vk+1 , vk+2 , . [sent-296, score-0.263]

81 , vk } such that for √ n−k n−k ∗ any unit vector w ∈ Rn−k , i=1 wk vk+i is 3 nγ close to i=1 wk vk+i . [sent-302, score-0.123]

82 Using this lemma we see that the projected function is close to the projection of f ∗ in the span of the rest of local maxima: Lemma 4. [sent-304, score-0.39]

83 Let g ∗ be the projection of f ∗ into the space spanned by the rest of local maxima, then |g ∗ (w) − g(w)| ≤ δ/8 + δ /20 ≤ δ /8. [sent-306, score-0.258]

84 5 Local search on the fourth order cumulant Next, we prove that the fourth order cumulant P ∗ (u) satisﬁes the properties above. [sent-307, score-0.626]

85 Then the algorithm given in the previous section will ﬁnd all of the local maxima, which is the missing step in our 1 Technically, there are 2n local maxima since for each direction u that is a local maxima, so too is −u but this is an unimportant detail for our purposes. [sent-308, score-0.988]

86 Estimate C = E[yy T ] by taking O(( A 2 + Σ 2 )4 n2 −2 ) samples and let C = 1 N N i=1 T yi yi . [sent-315, score-0.132]

87 We ﬁrst use a theorem from [5] to show that properties for ﬁnding one local maxima is satisﬁed. [sent-319, score-0.618]

88 Also, for notational convenience we set di = 2DA (u0 )−2 and let dmin and dmax denote the mini,i imum and maximum values (bounds on these and their ratio follow from Claim 2. [sent-320, score-0.206]

89 When β < dmin /10dmax n2 , the function P ∗ (u) is (3 nβ, β, P ∗ (u)β 2 /100) Locally Improvable and (β, dmin β 2 /100n) Locally Approximable. [sent-325, score-0.33]

90 Moreover, the local maxima of ∗ the function is exactly {±Ri }. [sent-326, score-0.581]

91 , yN , suppose columns of R = B −1 ADA (u0 )1/2 are close to the corresponding columns of R∗ , with high probability the function P (u) is O(dmax n1/2 + n2 (N/Z log n)−1/2 ) close to the true function P ∗ (u). [sent-340, score-0.119]

92 All local maxima of P ∗ has attraction radius Rad ≥ dmin /100dmax . [sent-345, score-0.78]

93 5 )}, then the procedure R ECOVER (f, β, dmin β 2 /100n , β , dmin β 2 /100n) ﬁnds vectors v1 , v2 , . [sent-350, score-0.33]

94 , vn , so that there is a permutation matrix Π and ki ∈ {±1} and for all i: vi − (RΠDiag(ki ))∗ 2 ≤ γ. [sent-353, score-0.304]

95 Given a matrix R such that there is permutation matrix Π and ki ∈ {±1} with Ri − ki (R∗ Π)i 2 ≤ γ for all i, Algorithm 3 returns matrix A such that A − AΠDiag(ki ) F ≤ 2 2 O(γ A 2 n3/2 /λmin (A)). [sent-359, score-0.349]

96 Recall that the diagonal matrix DA (u) is unknown (since it depends on A), but if we are given n ∗ R∗ (or an approximation) and since P ∗ (u) = i=1 di (uT Ri )4 , we can recover the matrix DA (u) ∗ approximately from computing P ∗ (Ri ). [sent-361, score-0.274]

97 An exciting question is: can we give a rigorous analysis of those techniques as well, just as we did for local search on cumulants? [sent-365, score-0.361]

98 A rigorous analysis of deep learning —say, an algorithm that provably learns the parameters of an RBM—is another problem that is wide open, and a plausible special case involves subtle variations on the problem we considered here. [sent-366, score-0.174]

99 Learning mixtures of spherical Gaussians: moment methods and spectral decompositions. [sent-414, score-0.154]

100 Structure from local optima: learning subspace juntas via higher order PCA. [sent-496, score-0.258]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('maxima', 0.397), ('ut', 0.313), ('ada', 0.246), ('ica', 0.195), ('local', 0.184), ('improvable', 0.177), ('hessian', 0.169), ('dmin', 0.165), ('approximable', 0.152), ('fourth', 0.152), ('ax', 0.137), ('cumulant', 0.134), ('optima', 0.132), ('da', 0.127), ('proj', 0.124), ('provable', 0.114), ('denoising', 0.103), ('rigorous', 0.093), ('opt', 0.09), ('ocal', 0.089), ('locally', 0.088), ('lemma', 0.088), ('rotation', 0.082), ('deep', 0.081), ('moment', 0.08), ('polynomial', 0.08), ('vk', 0.079), ('moitra', 0.077), ('claim', 0.077), ('ki', 0.077), ('ecover', 0.076), ('rbm', 0.075), ('subspace', 0.074), ('projection', 0.074), ('mixtures', 0.074), ('unknown', 0.073), ('orthogonal', 0.071), ('vi', 0.071), ('vn', 0.067), ('cumulants', 0.062), ('frieze', 0.062), ('whitening', 0.062), ('additive', 0.06), ('gaussian', 0.058), ('rad', 0.058), ('focs', 0.056), ('samples', 0.056), ('yy', 0.055), ('jerrum', 0.055), ('search', 0.054), ('ak', 0.054), ('matrix', 0.053), ('arora', 0.053), ('yn', 0.052), ('accumulate', 0.051), ('autoencoding', 0.051), ('lathauwer', 0.051), ('sachdeva', 0.051), ('recover', 0.05), ('rx', 0.049), ('covariance', 0.049), ('ri', 0.048), ('diagonal', 0.045), ('comon', 0.045), ('cruz', 0.045), ('crux', 0.045), ('close', 0.044), ('hsu', 0.044), ('running', 0.043), ('hyvarinen', 0.041), ('dmax', 0.041), ('noise', 0.041), ('nds', 0.041), ('mixture', 0.041), ('rn', 0.04), ('step', 0.039), ('kannan', 0.039), ('santa', 0.039), ('yi', 0.038), ('recovers', 0.037), ('min', 0.037), ('theorem', 0.037), ('concisely', 0.037), ('gaussians', 0.037), ('functional', 0.037), ('permutation', 0.036), ('badly', 0.035), ('autoencoder', 0.035), ('transformation', 0.035), ('anandkumar', 0.034), ('attraction', 0.034), ('guarantees', 0.034), ('bb', 0.032), ('initiated', 0.032), ('suppose', 0.031), ('poly', 0.031), ('princeton', 0.031), ('directions', 0.03), ('intuition', 0.03), ('give', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

Author: Sanjeev Arora, Rong Ge, Ankur Moitra, Sushant Sachdeva

2 0.23886609 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz

Abstract: Mirror descent with an entropic regularizer is known to achieve shifting regret bounds that are logarithmic in the dimension. This is done using either a carefully designed projection or by a weight sharing technique. Via a novel uniﬁed analysis, we show that these two approaches deliver essentially equivalent bounds on a notion of regret generalizing shifting, adaptive, discounted, and other related regrets. Our analysis also captures and extends the generalized weight sharing technique of Bousquet and Warmuth, and can be reﬁned in several ways, including improvements for small losses and adaptive tuning of parameters. 1

3 0.11672806 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks

Author: Junyuan Xie, Linli Xu, Enhong Chen

Abstract: We present a novel approach to low-level vision problems that combines sparse coding and deep networks pre-trained with denoising auto-encoder (DA). We propose an alternative training scheme that successfully adapts DA, originally designed for unsupervised feature learning, to the tasks of image denoising and blind inpainting. Our method’s performance in the image denoising task is comparable to that of KSVD which is a widely used sparse coding technique. More importantly, in blind image inpainting task, the proposed method provides solutions to some complex problems that have not been tackled before. Speciﬁcally, we can automatically remove complex patterns like superimposed text from an image, rather than simple patterns like pixels missing at random. Moreover, the proposed method does not need the information regarding the region that requires inpainting to be given a priori. Experimental results demonstrate the effectiveness of the proposed method in the tasks of image denoising and blind inpainting. We also show that our new training scheme for DA is more effective and can improve the performance of unsupervised feature learning. 1

4 0.10296921 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation

Author: Anima Anandkumar, Yi-kai Liu, Daniel J. Hsu, Dean P. Foster, Sham M. Kakade

Abstract: Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efﬁcient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space. 1

5 0.10200167 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

Author: Vasiliy Karasev, Alessandro Chiuso, Stefano Soatto

Abstract: We describe the tradeoff between the performance in a visual recognition problem and the control authority that the agent can exercise on the sensing process. We focus on the problem of “visual search” of an object in an otherwise known and static scene, propose a measure of control authority, and relate it to the expected risk and its proxy (conditional entropy of the posterior density). We show this analytically, as well as empirically by simulation using the simplest known model that captures the phenomenology of image formation, including scaling and occlusions. We show that a “passive” agent given a training set can provide no guarantees on performance beyond what is afforded by the priors, and that an “omnipotent” agent, capable of inﬁnite control authority, can achieve arbitrarily good performance (asymptotically). In between these limiting cases, the tradeoff can be characterized empirically. 1

6 0.095126167 34 nips-2012-Active Learning of Multi-Index Function Models

7 0.091427952 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

8 0.090235122 180 nips-2012-Learning Mixtures of Tree Graphical Models

9 0.087453358 290 nips-2012-Recovery of Sparse Probability Measures via Convex Programming

10 0.086292833 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

11 0.084624685 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

12 0.083496831 134 nips-2012-Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

13 0.081960768 27 nips-2012-A quasi-Newton proximal splitting method

14 0.080513887 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

15 0.077574752 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

16 0.07500869 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

17 0.07379505 4 nips-2012-A Better Way to Pretrain Deep Boltzmann Machines

18 0.072805613 286 nips-2012-Random Utility Theory for Social Choice

19 0.072020628 254 nips-2012-On the Sample Complexity of Robust PCA

20 0.071773127 282 nips-2012-Proximal Newton-type methods for convex optimization

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.215), (1, 0.028), (2, 0.058), (3, -0.04), (4, 0.054), (5, 0.054), (6, -0.004), (7, -0.028), (8, -0.015), (9, -0.035), (10, 0.098), (11, 0.058), (12, -0.073), (13, -0.037), (14, -0.089), (15, 0.059), (16, 0.076), (17, -0.012), (18, 0.053), (19, -0.09), (20, -0.004), (21, -0.004), (22, -0.046), (23, 0.108), (24, -0.018), (25, 0.122), (26, -0.067), (27, -0.066), (28, 0.02), (29, 0.026), (30, -0.008), (31, 0.079), (32, 0.055), (33, -0.033), (34, 0.046), (35, -0.079), (36, -0.041), (37, -0.051), (38, 0.017), (39, 0.151), (40, 0.0), (41, -0.045), (42, -0.039), (43, -0.06), (44, -0.001), (45, -0.077), (46, 0.121), (47, 0.136), (48, -0.074), (49, -0.17)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93213648 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

Author: Sanjeev Arora, Rong Ge, Ankur Moitra, Sushant Sachdeva

2 0.66988587 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

Author: Nicolò Cesa-bianchi, Pierre Gaillard, Gabor Lugosi, Gilles Stoltz

3 0.59478241 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

Author: Borja Balle, Mehryar Mohri

Abstract: Many tasks in text and speech processing and computational biology require estimating functions mapping strings to real numbers. A broad class of such functions can be deﬁned by weighted automata. Spectral methods based on the singular value decomposition of a Hankel matrix have been recently proposed for learning a probability distribution represented by a weighted automaton from a training sample drawn according to this same target distribution. In this paper, we show how spectral methods can be extended to the problem of learning a general weighted automaton from a sample generated by an arbitrary distribution. The main obstruction to this approach is that, in general, some entries of the Hankel matrix may be missing. We present a solution to this problem based on solving a constrained matrix completion problem. Combining these two ingredients, matrix completion and spectral method, a whole new family of algorithms for learning general weighted automata is obtained. We present generalization bounds for a particular algorithm in this family. The proofs rely on a joint stability analysis of matrix completion and spectral learning. 1

4 0.56389189 34 nips-2012-Active Learning of Multi-Index Function Models

Author: Tyagi Hemant, Volkan Cevher

Abstract: We consider the problem of actively learning multi-index functions of the form k f (x) = g(Ax) = i=1 gi (aT x) from point evaluations of f . We assume that i the function f is deﬁned on an 2 -ball in Rd , g is twice continuously differentiable almost everywhere, and A ∈ Rk×d is a rank k matrix, where k d. We propose a randomized, active sampling scheme for estimating such functions with uniform approximation guarantees. Our theoretical developments leverage recent techniques from low rank matrix recovery, which enables us to derive an estimator of the function f along with sample complexity bounds. We also characterize the noise robustness of the scheme, and provide empirical evidence that the highdimensional scaling of our sample complexity bounds are quite accurate. 1

5 0.52054757 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition is an important extension of Nystr¨ m approximao tion to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several real-world datasets demonstrate signiﬁcant improvement over the existing relative-error algorithms. 1

6 0.51257324 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

7 0.50316334 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

8 0.49828964 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

9 0.49078298 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion

10 0.48345113 159 nips-2012-Image Denoising and Inpainting with Deep Neural Networks

11 0.48256287 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

12 0.47684494 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering

13 0.47576278 280 nips-2012-Proper losses for learning from partial labels

14 0.47111508 221 nips-2012-Multi-Stage Multi-Task Feature Learning

15 0.46792725 156 nips-2012-Identifiability and Unmixing of Latent Parse Trees

16 0.45942664 254 nips-2012-On the Sample Complexity of Robust PCA

17 0.45449349 86 nips-2012-Convex Multi-view Subspace Learning

18 0.44654962 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance

19 0.44236362 211 nips-2012-Meta-Gaussian Information Bottleneck

20 0.44152832 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.032), (14, 0.031), (21, 0.018), (38, 0.143), (42, 0.116), (54, 0.025), (55, 0.026), (74, 0.048), (76, 0.109), (80, 0.314), (92, 0.046)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9543848 100 nips-2012-Discriminative Learning of Sum-Product Networks

Author: Robert Gens, Pedro Domingos

Abstract: Sum-product networks are a new deep architecture that can perform fast, exact inference on high-treewidth models. Only generative methods for training SPNs have been proposed to date. In this paper, we present the ﬁrst discriminative training algorithms for SPNs, combining the high accuracy of the former with the representational power and tractability of the latter. We show that the class of tractable discriminative SPNs is broader than the class of tractable generative ones, and propose an efﬁcient backpropagation-style algorithm for computing the gradient of the conditional log likelihood. Standard gradient descent suffers from the diffusion problem, but networks with many layers can be learned reliably using “hard” gradient descent, where marginal inference is replaced by MPE inference (i.e., inferring the most probable state of the non-evidence variables). The resulting updates have a simple and intuitive form. We test discriminative SPNs on standard image classiﬁcation tasks. We obtain the best results to date on the CIFAR-10 dataset, using fewer features than prior methods with an SPN architecture that learns local image structure discriminatively. We also report the highest published test accuracy on STL-10 even though we only use the labeled portion of the dataset. 1

2 0.95316327 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks

Author: Tivadar Papai, Henry Kautz, Daniel Stefankovic

Abstract: Markov logic is a widely used tool in statistical relational learning, which uses a weighted ﬁrst-order logic knowledge base to specify a Markov random ﬁeld (MRF) or a conditional random ﬁeld (CRF). In many applications, a Markov logic network (MLN) is trained in one domain, but used in a different one. This paper focuses on dynamic Markov logic networks, where the size of the discretized time-domain typically varies between training and testing. It has been previously pointed out that the marginal probabilities of truth assignments to ground atoms can change if one extends or reduces the domains of predicates in an MLN. We show that in addition to this problem, the standard way of unrolling a Markov logic theory into a MRF may result in time-inhomogeneity of the underlying Markov chain. Furthermore, even if these representational problems are not signiﬁcant for a given domain, we show that the more practical problem of generating samples in a sequential conditional random ﬁeld for the next slice relying on the samples from the previous slice has high computational cost in the general case, due to the need to estimate a normalization factor for each sample. We propose a new discriminative model, slice normalized dynamic Markov logic networks (SN-DMLN), that suffers from none of these issues. It supports efﬁcient online inference, and can directly model inﬂuences between variables within a time slice that do not have a causal direction, in contrast with fully directed models (e.g., DBNs). Experimental results show an improvement in accuracy over previous approaches to online inference in dynamic Markov logic networks. 1

3 0.95252913 204 nips-2012-MAP Inference in Chains using Column Generation

Author: David Belanger, Alexandre Passos, Sebastian Riedel, Andrew McCallum

Abstract: Linear chains and trees are basic building blocks in many applications of graphical models, and they admit simple exact maximum a-posteriori (MAP) inference algorithms based on message passing. However, in many cases this computation is prohibitively expensive, due to quadratic dependence on variables’ domain sizes. The standard algorithms are inefﬁcient because they compute scores for hypotheses for which there is strong negative local evidence. For this reason there has been signiﬁcant previous interest in beam search and its variants; however, these methods provide only approximate results. This paper presents new exact inference algorithms based on the combination of column generation and pre-computed bounds on terms of the model’s scoring function. While we do not improve worst-case performance, our method substantially speeds real-world, typical-case inference in chains and trees. Experiments show our method to be twice as fast as exact Viterbi for Wall Street Journal part-of-speech tagging and over thirteen times faster for a joint part-of-speed and named-entity-recognition task. Our algorithm is also extendable to new techniques for approximate inference, to faster 0/1 loss oracles, and new opportunities for connections between inference and learning. We encourage further exploration of high-level reasoning about the optimization problem implicit in dynamic programs. 1

4 0.94829029 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

Author: Harish G. Ramaswamy, Shivani Agarwal

Abstract: We study consistency properties of surrogate loss functions for general multiclass classiﬁcation problems, deﬁned by a general loss matrix. We extend the notion of classiﬁcation calibration, which has been studied for binary and multiclass 0-1 classiﬁcation problems (and for certain other speciﬁc learning problems), to the general multiclass setting, and derive necessary and sufﬁcient conditions for a surrogate loss to be classiﬁcation calibrated with respect to a loss matrix in this setting. We then introduce the notion of classiﬁcation calibration dimension of a multiclass loss matrix, which measures the smallest ‘size’ of a prediction space for which it is possible to design a convex surrogate that is classiﬁcation calibrated with respect to the loss matrix. We derive both upper and lower bounds on this quantity, and use these results to analyze various loss matrices. In particular, as one application, we provide a different route from the recent result of Duchi et al. (2010) for analyzing the difﬁculty of designing ‘low-dimensional’ convex surrogates that are consistent with respect to pairwise subset ranking losses. We anticipate the classiﬁcation calibration dimension may prove to be a useful tool in the study and design of surrogate losses for general multiclass learning problems. 1

5 0.94485784 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking

Author: James Scott, Jonathan W. Pillow

Abstract: Characterizing the information carried by neural populations in the brain requires accurate statistical models of neural spike responses. The negative-binomial distribution provides a convenient model for over-dispersed spike counts, that is, responses with greater-than-Poisson variability. Here we describe a powerful data-augmentation framework for fully Bayesian inference in neural models with negative-binomial spiking. Our approach relies on a recently described latentvariable representation of the negative-binomial distribution, which equates it to a Polya-gamma mixture of normals. This framework provides a tractable, conditionally Gaussian representation of the posterior that can be used to design efﬁcient EM and Gibbs sampling based algorithms for inference in regression and dynamic factor models. We apply the model to neural data from primate retina and show that it substantially outperforms Poisson regression on held-out data, and reveals latent structure underlying spike count correlations in simultaneously recorded spike trains. 1

6 0.93362045 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

same-paper 7 0.90672755 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

8 0.89953053 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

9 0.89828622 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions

10 0.88160264 293 nips-2012-Relax and Randomize : From Value to Algorithms

11 0.8794015 200 nips-2012-Local Supervised Learning through Space Partitioning

12 0.87939 171 nips-2012-Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning

13 0.86851138 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

14 0.84746432 197 nips-2012-Learning with Recursive Perceptual Representations

15 0.84369612 330 nips-2012-Supervised Learning with Similarity Functions

16 0.84346694 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

17 0.83846343 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization

18 0.83789605 279 nips-2012-Projection Retrieval for Classification

19 0.83683711 207 nips-2012-Mandatory Leaf Node Prediction in Hierarchical Multilabel Classification

20 0.83479643 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)