nips nips2012 nips2012-17 knowledge-graph by maker-knowledge-mining

17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound


Source: pdf

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition is an important extension of Nystr¨ m approximao tion to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several real-world datasets demonstrate significant improvement over the existing relative-error algorithms. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 cn Abstract The CUR matrix decomposition is an important extension of Nystr¨ m approximao tion to a general matrix. [sent-3, score-0.091]

2 It approximates any data matrix in terms of a small number of its columns and rows. [sent-4, score-0.13]

3 In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. [sent-5, score-0.124]

4 The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. [sent-6, score-0.171]

5 In many cases, matrix factorization methods are employed to construct compressed and informative representations to facilitate computation and interpretation. [sent-10, score-0.116]

6 A principled approach is the truncated singular value decomposition (SVD) which finds the best low-rank approximation of a data matrix. [sent-11, score-0.083]

7 ” Therefore, it is of great interest to represent a data matrix in terms of a small number of actual columns and/or actual rows of the matrix. [sent-16, score-0.199]

8 The CUR matrix decomposition provides such techniques, and it has been shown to be very useful in high dimensional data analysis [19]. [sent-17, score-0.091]

9 Given a matrix A, the CUR technique selects a subset of columns of A to construct a matrix C and a subset of rows of A to construct a matrix R, and ˜ computes a matrix U such that A = CUR best approximates A. [sent-18, score-0.482]

10 Stage 1 is a standard column selection procedure, and Stage 2 does row selection from A and C simultaneously. [sent-20, score-0.16]

11 The CUR matrix decomposition problem is widely studied in the literature [7, 8, 9, 10, 12, 13, 16, 18, 19, 22]. [sent-22, score-0.091]

12 Perhaps the most widely known work on the CUR problem is [10], in which the authors devised a randomized CUR algorithm called the subspace sampling algorithm. [sent-23, score-0.43]

13 Particularly, the algorithm has (1 + ϵ) relative-error ratio with high probability (w. [sent-24, score-0.1]

14 1 Unfortunately, all the existing CUR algorithms require a large number of columns and rows to be chosen. [sent-28, score-0.158]

15 For example, for an m × n matrix A and a target rank k ≤ min{m, n}, the state-ofthe-art CUR algorithm — the subspace sampling algorithm in [10] — requires exactly O(k 4 ϵ−6 ) rows or O(kϵ−4 log2 k) rows in expectation to achieve (1 + ϵ) relative-error ratio w. [sent-29, score-0.796]

16 Moreover, the computational cost of this algorithm is at least the cost of the truncated SVD of A, that is, O(min{mn2 , nm2 }). [sent-32, score-0.063]

17 In this paper we develop a CUR algorithm which beats the state-of-the-art algorithm in both theory and experiments. [sent-34, score-0.088]

18 In particular, we show in Theorem 5 a novel randomized CUR algorithm with lower time complexity and tighter theoretical bound in comparison with the state-of-the-art CUR algorithm in [10]. [sent-35, score-0.181]

19 Section 3 introduces several existing column selection algorithms and the state-of-the-art CUR algorithm. [sent-37, score-0.111]

20 Section 5 empirically compares our proposed algorithm with the state-of-the-art algorithm. [sent-39, score-0.033]

21 2 Notations For a matrix A = [aij ] ∈ Rm×n , let a(i) be its i-th row and aj be its j-th column. [sent-40, score-0.074]

22 1 introduces several relative-error column selection algorithms related to this work. [sent-47, score-0.111]

23 3 discusses the connection between the column selection problem and the CUR problem. [sent-51, score-0.098]

24 1 Relative-Error Column Selection Algorithms Given a matrix A ∈ Rm×n , column selection is a problem of selecting c columns of A to construct C ∈ Rm×c to minimize ∥A − CC† A∥F . [sent-53, score-0.277]

25 In recent years, many polynomial-time approximate algorithms have been proposed, among which we are particularly interested in the algorithms with relative-error bounds; that is, with c ≥ k columns selected from A, there is a constant η such that ∥A − CC† A∥F ≤ η∥A − Ak ∥F . [sent-55, score-0.102]

26 We first introduce a recently developed deterministic algorithm called the dual set sparsification proposed in [2, 3]. [sent-58, score-0.107]

27 Furthermore, this algorithm is a building block of some more powerful algorithms (e. [sent-60, score-0.046]

28 , Lemma 2), and our novel CUR algorithm also relies on this algorithm. [sent-62, score-0.051]

29 Given a matrix A ∈ Rm×n of rank ρ and a target rank k (< ρ), there exists a deterministic algorithm to select c (> k) columns of A and form a matrix C ∈ Rm×c such that √ 1 † √ A − CC A ≤ 1 + A − Ak . [sent-65, score-0.368]

30 F F (1 − k/c)2 1 Although some partial SVD algorithms, such as Krylov subspace methods, require only O(mnk) time, they are all numerical unstable. [sent-66, score-0.224]

31 2 Moreover, the matrix C can be computed in TVA,k +O(mn+nck 2 ), where TVA,k is the time needed to compute the top k right singular vectors of A. [sent-68, score-0.087]

32 There are also a variety of randomized column selection algorithms achieving relative-error bounds in the literature: [3, 5, 6, 10, 14]. [sent-69, score-0.173]

33 An randomized algorithm in [2] selects only c = 2k (1 + o(1)) columns to achieve the expected ϵ relative-error ratio (1 + ϵ). [sent-70, score-0.29]

34 The algorithm is based on the approximate SVD via random projection [15], the dual set sparsification algorithm [2], and the adaptive sampling algorithm [6]. [sent-71, score-0.294]

35 Here we present the main results of this algorithm in Lemma 2. [sent-72, score-0.033]

36 Our proposed CUR algorithm is motivated by and relies on this algorithm. [sent-73, score-0.051]

37 Given a matrix A ∈ Rm×n of rank ρ, a target rank k (2 ≤ k < ρ), and 0 < ϵ < 1, there exists a randomized algorithm to select at most ) 2k ( 1 + o(1) c= ϵ columns of A to form a matrix C ∈ Rm×c such that E2 ∥A − CC† A∥F ≤ E∥A − CC† A∥2 ≤ (1 + ϵ)∥A − Ak ∥2 , F F where the expectations are taken w. [sent-75, score-0.414]

38 Furthermore, the matrix C can be computed in O((mnk+ nk 3 )ϵ−2/3 ). [sent-79, score-0.073]

39 [10] proposed a two-stage randomized CUR algorithm which has a relative-error bound w. [sent-82, score-0.11]

40 With probability at least 1 − δ, the relative-error ratio is 1 + ϵ. [sent-86, score-0.067]

41 The computational cost is dominated by the truncated SVD of A and C. [sent-87, score-0.03]

42 Though the algorithm is ϵ-optimal with high probability, it requires too many rows get chosen: at least r = O(kϵ−4 log2 k) rows in expectation. [sent-88, score-0.171]

43 In this paper we seek to devise an algorithm with mild requirement on column and row numbers. [sent-89, score-0.109]

44 3 Connection between Column Selection and CUR Matrix Decomposition The CUR problem has a close connection with the column selection problem. [sent-91, score-0.098]

45 As aforementioned, the first stage of existing CUR algorithms is simply a column selection procedure. [sent-92, score-0.196]

46 If the second stage is na¨vely solved by a column selection ı algorithm on AT , then the error ratio will be at least (2 + ϵ). [sent-94, score-0.307]

47 For a relative-error CUR algorithm, the first stage seeks to bound a construction error ratio of ∥A−CC† A∥F ∥A−CC† AR† R∥F given C. [sent-95, score-0.239]

48 Actually, the first ∥A−Ak ∥F , while the section stage seeks to bound ∥A−CC† A∥F stage is a special case of the second stage where C = Ak . [sent-96, score-0.284]

49 Given a matrix A, if an algorithm solv† † ing the second stage results in a bound ∥A−CC AR R∥F ≤ η, then this algorithm also solves the ∥A−CC† A∥F column selection problem for AT with an η relative-error ratio. [sent-97, score-0.318]

50 Thus the second stage of CUR is a generalization of the column selection problem. [sent-98, score-0.183]

51 We call it the fast CUR algorithm because it has lower time complexity compared with SVD. [sent-100, score-0.121]

52 Theorem 5 relies on Lemma 2 and Theorem 4, and Theorem 4 relies on Theorem 3. [sent-102, score-0.036]

53 1 Adaptive Sampling The relative-error adaptive sampling algorithm is established in [6, Theorem 2. [sent-108, score-0.17]

54 The algorithm is based on the following idea: after selecting a proportion of columns from A to form C1 by an arbitrary algorithm, the algorithms randomly samples additional c2 columns according to the residual A − C1 C† A. [sent-110, score-0.228]

55 [2] used the adaptive sampling algorithm to decrease the 1 residual of the dual set sparsification algorithm and obtained an (1 + ϵ) relative-error bound. [sent-112, score-0.291]

56 Here we prove a new bound for the adaptive sampling algorithm. [sent-113, score-0.152]

57 1 of [6] is a direct corollary of our following theorem in which C = Ak is set. [sent-117, score-0.037]

58 Given a matrix A ∈ Rm×n and a matrix C ∈ Rm×c such that rank(C) = rank(CC† A) = ρ, (ρ ≤ c ≤ n), we let R1 ∈ Rr1 ×n consist of r1 rows of A, and define the residual B = A − AR† R1 . [sent-119, score-0.207]

59 from A, in each trial of which the i-th row is chosen with probability pi . [sent-124, score-0.038]

60 Let R2 ∈ Rr2 ×n contains the r2 sampled rows and let R = [RT , RT ]T ∈ R(r1 +r2 )×n . [sent-125, score-0.069]

61 2 The Fast CUR Algorithm Based on the dual set sparsification algorithm of of Lemma 1 and the adaptive sampling algorithm of Theorem 3, we develop a randomized algorithm to solve the second stage of CUR problem. [sent-131, score-0.441]

62 We present the results of the algorithm in Theorem 4. [sent-132, score-0.033]

63 Theorem 5 of [2] is a special case of the following theorem where C = Ak . [sent-133, score-0.037]

64 Furthermore, the matrix R can be computed in O((mnk + mk 3 )ϵ−2/3 ) time. [sent-139, score-0.078]

65 Based on Lemma 2 and Theorem 4, here we present the main theorem for the fast CUR algorithm. [sent-140, score-0.108]

66 ( ) Moreover, the algorithm runs in time O mnkϵ−2/3 + (m + n)k 3 ϵ−2/3 + mk 2 ϵ−2 + nk 2 ϵ−4 . [sent-153, score-0.093]

67 Since k, c, r ≪ min{m, n} by the assumptions, so the time complexity of the fast CUR algorithm is lower than that of the SVD of A. [sent-154, score-0.121]

68 This is the main reason why we call it the fast CUR algorithm. [sent-155, score-0.071]

69 Another advantage of this algorithm is avoiding loading the whole m × n data matrix A into main memory. [sent-156, score-0.145]

70 None of three steps — the randomized SVD, the dual set sparsification algorithm, and the adaptive sampling algorithm — requires loading the whole of A into memory. [sent-157, score-0.348]

71 The most memoryexpensive operation throughout the fast CUR Algorithm is computing the Moore-Penrose inverse of C and R, which requires maintaining an m × c matrix or an r × n matrix in memory. [sent-158, score-0.179]

72 In comparison, the subspace sampling algorithm requires loading the whole matrix into memory to compute its truncated SVD. [sent-159, score-0.51]

73 5 Empirical Comparisons In this section we provide empirical comparisons among the relative-error CUR algorithms on several datasets. [sent-160, score-0.027]

74 We report the relative-error ratio and the running time of each algorithm on each data set. [sent-161, score-0.148]

75 The relative-error ratio is defined by ∥A − CUR∥F Relative-error ratio = , ∥A − Ak ∥F where k is a specified target rank. [sent-162, score-0.163]

76 The results show that the fast CUR algorithm has much lower relative-error ratio than the subspace sampling algorithm. [sent-180, score-0.506]

77 As for the running time, the fast CUR algorithm is more efficient when c and r are small. [sent-182, score-0.135]

78 When c and r become large, the fast CUR algorithm becomes less efficient. [sent-183, score-0.104]

79 This is because the time complexity of the fast CUR algorithm is linear in ϵ−4 and large c and r imply small ϵ. [sent-184, score-0.121]

80 However, the purpose of CUR is to select a small number of columns and rows from the data matrix, that is, c ≪ n and r ≪ m. [sent-185, score-0.159]

81 6 Conclusions In this paper we have proposed a novel randomized algorithm for the CUR matrix decomposition problem. [sent-272, score-0.186]

82 This algorithm is faster, more scalable, and more accurate than the state-of-the-art algorithm, i. [sent-273, score-0.033]

83 Our algorithm requires only c = 2kϵ−1 (1 + o(1)) columns and r = 2cϵ−1 (1 + o(1)) rows to achieve (1+ϵ) relative-error ratio. [sent-276, score-0.178]

84 To achieve the same relative-error bound, the subspace sampling algorithm requires c = O(kϵ−2 log k) columns and r = O(cϵ−2 log c) rows selected from the original matrix. [sent-277, score-0.513]

85 Our algorithm also beats the subspace sampling algorithm in time-complexity. [sent-278, score-0.423]

86 Our algorithm costs O(mnkϵ−2/3 + (m + n)k 3 ϵ−2/3 + mk 2 ϵ−2 + nk 2 ϵ−4 ) time, which is lower than O(min{mn2 , m2 n}) of the subspace sampling algorithm when k is small. [sent-279, score-0.444]

87 Moreover, our algorithm enjoys another advantage of avoiding loading the whole data matrix into main memory, which also makes our algorithm more scalable. [sent-280, score-0.178]

88 A The Dual Set Sparsification Algorithm For the sake of completeness, we attach the dual set sparsification algorithm here and describe some implementation details. [sent-282, score-0.111]

89 The dual set sparsification algorithms are deterministic algorithms established in [2]. [sent-283, score-0.1]

90 The fast CUR algorithm calls the dual set spectral-Frobenius sparsification algorithm [2, Lemma 13] in both stages. [sent-284, score-0.195]

91 We show this algorithm in Algorithm 2 and its bounds in Lemma 6. [sent-285, score-0.033]

92 Let U = {x1 , · · · , xn } ⊂ Rl , (l < n), contains the columns of an arbitrary matrix X ∈ Rl×n . [sent-287, score-0.13]

93 Given an integer r with k < r < n, Algorithm 2 deterministically computes a set of weights si ≥ 0 (i = 1, · · · , n) at most r of which are non-zero, such that √ ) n n (∑ (∑ ) ) ( k 2 T and tr λk si xi xT ≤ ∥X∥2 . [sent-291, score-0.048]

94 si vi vi ≥ 1 − i F r i=1 i=1 7 Algorithm 2 Deterministic Dual Set Spectral-Frobenius Sparsification Algorithm. [sent-292, score-0.075]

95 r sτ +1 [j] = sτ [j] + t, T Aτ +1 = Aτ + tvj vj ; The weights si can be computed deterministically in O(rnk 2 + nl) time. [sent-294, score-0.064]

96 In each iteration the algorithm performs once eigenvalue decomposition: Aτ = WΛWT . [sent-296, score-0.047]

97 Since ( )q ( ) Aτ − αIk = WDiag (λ1 − α)q , · · · , (λk − α)q WT , we can efficiently compute (Aτ − (Lτ + 1)Ik )q based on the eigenvalue decomposition of Aτ . [sent-298, score-0.051]

98 Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. [sent-332, score-0.08]

99 On the Nystr¨ m method for approximating a gram o matrix for improved kernel-based learning. [sent-336, score-0.054]

100 Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. [sent-371, score-0.067]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cur', 0.859), ('subspace', 0.224), ('cc', 0.154), ('sparsi', 0.135), ('sampling', 0.111), ('petros', 0.104), ('stage', 0.085), ('ak', 0.077), ('columns', 0.076), ('fast', 0.071), ('svd', 0.07), ('rows', 0.069), ('ar', 0.068), ('drineas', 0.068), ('ratio', 0.067), ('rm', 0.064), ('frobenius', 0.064), ('randomized', 0.062), ('mnk', 0.06), ('arcene', 0.059), ('dual', 0.058), ('column', 0.056), ('matrix', 0.054), ('dexter', 0.052), ('construct', 0.049), ('exactly', 0.047), ('rank', 0.046), ('selection', 0.042), ('ik', 0.041), ('loading', 0.04), ('boutsidis', 0.039), ('theorem', 0.037), ('decomposition', 0.037), ('construction', 0.034), ('algorithm', 0.033), ('vj', 0.033), ('lemma', 0.033), ('running', 0.031), ('truncated', 0.03), ('deshpande', 0.03), ('goreinov', 0.03), ('redrock', 0.03), ('residual', 0.03), ('vi', 0.029), ('target', 0.029), ('expected', 0.029), ('norm', 0.028), ('vk', 0.027), ('adaptive', 0.026), ('error', 0.024), ('christos', 0.024), ('luis', 0.024), ('mk', 0.024), ('rr', 0.024), ('rt', 0.024), ('selects', 0.023), ('aij', 0.023), ('mahoney', 0.023), ('nystr', 0.023), ('biology', 0.023), ('beats', 0.022), ('delete', 0.022), ('malik', 0.022), ('ua', 0.022), ('tighter', 0.021), ('ravi', 0.021), ('relative', 0.021), ('michael', 0.02), ('attach', 0.02), ('row', 0.02), ('rl', 0.019), ('nk', 0.019), ('amit', 0.019), ('symposium', 0.018), ('whole', 0.018), ('pi', 0.018), ('relies', 0.018), ('va', 0.018), ('time', 0.017), ('si', 0.017), ('focs', 0.016), ('singular', 0.016), ('rk', 0.016), ('deterministic', 0.016), ('bag', 0.016), ('siam', 0.016), ('uk', 0.015), ('decompositions', 0.015), ('china', 0.015), ('bound', 0.015), ('eigenvalue', 0.014), ('deterministically', 0.014), ('expectation', 0.014), ('select', 0.014), ('comparisons', 0.014), ('seeks', 0.014), ('annual', 0.013), ('cation', 0.013), ('algorithms', 0.013), ('compressed', 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition is an important extension of Nystr¨ m approximao tion to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several real-world datasets demonstrate significant improvement over the existing relative-error algorithms. 1

2 0.11692661 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

Abstract: In this paper, we consider the problem of clustering data points into lowdimensional subspaces in the presence of outliers. We pose the problem using a density estimation formulation with an associated generative model. Based on this probability model, we first develop an iterative expectation-maximization (EM) algorithm and then derive its global solution. In addition, we develop two Bayesian methods based on variational Bayesian (VB) approximation, which are capable of automatic dimensionality selection. While the first method is based on an alternating optimization scheme for all unknowns, the second method makes use of recent results in VB matrix factorization leading to fast and effective estimation. Both methods are extended to handle sparse outliers for robustness and can handle missing values. Experimental results suggest that proposed methods are very effective in subspace clustering and identifying outliers. 1

3 0.058471177 86 nips-2012-Convex Multi-view Subspace Learning

Author: Martha White, Xinhua Zhang, Dale Schuurmans, Yao-liang Yu

Abstract: Subspace learning seeks a low dimensional representation of data that enables accurate reconstruction. However, in many applications, data is obtained from multiple sources rather than a single source (e.g. an object might be viewed by cameras at different angles, or a document might consist of text and images). The conditional independence of separate sources imposes constraints on their shared latent representation, which, if respected, can improve the quality of a learned low dimensional representation. In this paper, we present a convex formulation of multi-view subspace learning that enforces conditional independence while reducing dimensionality. For this formulation, we develop an efficient algorithm that recovers an optimal data reconstruction by exploiting an implicit convex regularizer, then recovers the corresponding latent representation and reconstruction model, jointly and optimally. Experiments illustrate that the proposed method produces high quality results. 1

4 0.057601705 125 nips-2012-Factoring nonnegative matrices with linear programs

Author: Ben Recht, Christopher Re, Joel Tropp, Victor Bittorf

Abstract: This paper describes a new approach, based on linear programming, for computing nonnegative matrix factorizations (NMFs). The key idea is a data-driven model for the factorization where the most salient features in the data are used to express the remaining features. More precisely, given a data matrix X, the algorithm identifies a matrix C that satisfies X ≈ CX and some linear constraints. The constraints are chosen to ensure that the matrix C selects features; these features can then be used to find a low-rank NMF of X. A theoretical analysis demonstrates that this approach has guarantees similar to those of the recent NMF algorithm of Arora et al. (2012). In contrast with this earlier work, the proposed method extends to more general noise models and leads to efficient, scalable algorithms. Experiments with synthetic and real datasets provide evidence that the new approach is also superior in practice. An optimized C++ implementation can factor a multigigabyte matrix in a matter of minutes. 1

5 0.054915071 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

Author: Thanh Ngo, Yousef Saad

Abstract: This paper describes gradient methods based on a scaled metric on the Grassmann manifold for low-rank matrix completion. The proposed methods significantly improve canonical gradient methods, especially on ill-conditioned matrices, while maintaining established global convegence and exact recovery guarantees. A connection between a form of subspace iteration for matrix completion and the scaled gradient descent procedure is also established. The proposed conjugate gradient method based on the scaled gradient outperforms several existing algorithms for matrix completion and is competitive with recently proposed methods. 1

6 0.053520016 254 nips-2012-On the Sample Complexity of Robust PCA

7 0.052298229 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

8 0.045068249 120 nips-2012-Exact and Stable Recovery of Sequences of Signals with Sparse Increments via Differential 1-Minimization

9 0.043614268 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses

10 0.043034196 279 nips-2012-Projection Retrieval for Classification

11 0.042406842 237 nips-2012-Near-optimal Differentially Private Principal Components

12 0.04063205 234 nips-2012-Multiresolution analysis on the symmetric group

13 0.039603863 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

14 0.03951117 34 nips-2012-Active Learning of Multi-Index Function Models

15 0.039432064 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

16 0.039207842 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach

17 0.038959771 247 nips-2012-Nonparametric Reduced Rank Regression

18 0.038881566 208 nips-2012-Matrix reconstruction with the local max norm

19 0.036216129 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion

20 0.033849739 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.101), (1, 0.015), (2, 0.037), (3, -0.047), (4, 0.014), (5, 0.027), (6, -0.006), (7, 0.018), (8, 0.019), (9, -0.013), (10, 0.028), (11, -0.018), (12, -0.048), (13, -0.013), (14, 0.024), (15, 0.048), (16, 0.036), (17, -0.024), (18, 0.044), (19, -0.053), (20, -0.02), (21, 0.073), (22, -0.054), (23, 0.002), (24, -0.07), (25, 0.016), (26, 0.0), (27, -0.027), (28, -0.001), (29, 0.027), (30, -0.046), (31, 0.074), (32, 0.058), (33, 0.026), (34, -0.089), (35, -0.038), (36, -0.047), (37, -0.002), (38, -0.033), (39, 0.071), (40, 0.024), (41, -0.11), (42, -0.045), (43, 0.001), (44, 0.077), (45, 0.02), (46, 0.021), (47, -0.062), (48, 0.001), (49, 0.032)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91648555 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition is an important extension of Nystr¨ m approximao tion to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several real-world datasets demonstrate significant improvement over the existing relative-error algorithms. 1

2 0.66270947 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

Author: S. D. Babacan, Shinichi Nakajima, Minh Do

Abstract: In this paper, we consider the problem of clustering data points into lowdimensional subspaces in the presence of outliers. We pose the problem using a density estimation formulation with an associated generative model. Based on this probability model, we first develop an iterative expectation-maximization (EM) algorithm and then derive its global solution. In addition, we develop two Bayesian methods based on variational Bayesian (VB) approximation, which are capable of automatic dimensionality selection. While the first method is based on an alternating optimization scheme for all unknowns, the second method makes use of recent results in VB matrix factorization leading to fast and effective estimation. Both methods are extended to handle sparse outliers for robustness and can handle missing values. Experimental results suggest that proposed methods are very effective in subspace clustering and identifying outliers. 1

3 0.65007621 320 nips-2012-Spectral Learning of General Weighted Automata via Constrained Matrix Completion

Author: Borja Balle, Mehryar Mohri

Abstract: Many tasks in text and speech processing and computational biology require estimating functions mapping strings to real numbers. A broad class of such functions can be defined by weighted automata. Spectral methods based on the singular value decomposition of a Hankel matrix have been recently proposed for learning a probability distribution represented by a weighted automaton from a training sample drawn according to this same target distribution. In this paper, we show how spectral methods can be extended to the problem of learning a general weighted automaton from a sample generated by an arbitrary distribution. The main obstruction to this approach is that, in general, some entries of the Hankel matrix may be missing. We present a solution to this problem based on solving a constrained matrix completion problem. Combining these two ingredients, matrix completion and spectral method, a whole new family of algorithms for learning general weighted automata is obtained. We present generalization bounds for a particular algorithm in this family. The proofs rely on a joint stability analysis of matrix completion and spectral learning. 1

4 0.64871943 301 nips-2012-Scaled Gradients on Grassmann Manifolds for Matrix Completion

Author: Thanh Ngo, Yousef Saad

Abstract: This paper describes gradient methods based on a scaled metric on the Grassmann manifold for low-rank matrix completion. The proposed methods significantly improve canonical gradient methods, especially on ill-conditioned matrices, while maintaining established global convegence and exact recovery guarantees. A connection between a form of subspace iteration for matrix completion and the scaled gradient descent procedure is also established. The proposed conjugate gradient method based on the scaled gradient outperforms several existing algorithms for matrix completion and is competitive with recently proposed methods. 1

5 0.59790564 125 nips-2012-Factoring nonnegative matrices with linear programs

Author: Ben Recht, Christopher Re, Joel Tropp, Victor Bittorf

Abstract: This paper describes a new approach, based on linear programming, for computing nonnegative matrix factorizations (NMFs). The key idea is a data-driven model for the factorization where the most salient features in the data are used to express the remaining features. More precisely, given a data matrix X, the algorithm identifies a matrix C that satisfies X ≈ CX and some linear constraints. The constraints are chosen to ensure that the matrix C selects features; these features can then be used to find a low-rank NMF of X. A theoretical analysis demonstrates that this approach has guarantees similar to those of the recent NMF algorithm of Arora et al. (2012). In contrast with this earlier work, the proposed method extends to more general noise models and leads to efficient, scalable algorithms. Experiments with synthetic and real datasets provide evidence that the new approach is also superior in practice. An optimized C++ implementation can factor a multigigabyte matrix in a matter of minutes. 1

6 0.56259835 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion

7 0.54329008 86 nips-2012-Convex Multi-view Subspace Learning

8 0.53009152 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA

9 0.51802492 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

10 0.50086576 34 nips-2012-Active Learning of Multi-Index Function Models

11 0.49588138 281 nips-2012-Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders

12 0.48041928 225 nips-2012-Multi-task Vector Field Learning

13 0.46179008 221 nips-2012-Multi-Stage Multi-Task Feature Learning

14 0.45791921 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

15 0.45562151 247 nips-2012-Nonparametric Reduced Rank Regression

16 0.45521584 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach

17 0.44551373 279 nips-2012-Projection Retrieval for Classification

18 0.44541079 254 nips-2012-On the Sample Complexity of Robust PCA

19 0.44380802 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

20 0.43960375 44 nips-2012-Approximating Concavely Parameterized Optimization Problems


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.027), (2, 0.229), (21, 0.013), (38, 0.153), (39, 0.017), (42, 0.015), (54, 0.052), (55, 0.024), (74, 0.053), (76, 0.14), (80, 0.106), (92, 0.049)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.85636806 196 nips-2012-Learning with Partially Absorbing Random Walks

Author: Xiao-ming Wu, Zhenguo Li, Anthony M. So, John Wright, Shih-fu Chang

Abstract: We propose a novel stochastic process that is with probability αi being absorbed at current state i, and with probability 1 − αi follows a random edge out of it. We analyze its properties and show its potential for exploring graph structures. We prove that under proper absorption rates, a random walk starting from a set S of low conductance will be mostly absorbed in S. Moreover, the absorption probabilities vary slowly inside S, while dropping sharply outside, thus implementing the desirable cluster assumption for graph-based learning. Remarkably, the partially absorbing process unifies many popular models arising in a variety of contexts, provides new insights into them, and makes it possible for transferring findings from one paradigm to another. Simulation results demonstrate its promising applications in retrieval and classification.

2 0.8386547 8 nips-2012-A Generative Model for Parts-based Object Segmentation

Author: S. Eslami, Christopher Williams

Abstract: The Shape Boltzmann Machine (SBM) [1] has recently been introduced as a stateof-the-art model of foreground/background object shape. We extend the SBM to account for the foreground object’s parts. Our new model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based object segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit significant shape and appearance variability, and find that it obtains results that are comparable to the state-of-the-art. There has been significant focus in computer vision on object recognition and detection e.g. [2], but a strong desire remains to obtain richer descriptions of objects than just their bounding boxes. One such description is a parts-based object segmentation, in which an image is partitioned into multiple sets of pixels, each belonging to either a part of the object of interest, or its background. The significance of parts in computer vision has been recognized since the earliest days of the field (e.g. [3, 4, 5]), and there exists a rich history of work on probabilistic models for parts-based segmentation e.g. [6, 7]. Many such models only consider local neighborhood statistics, however several models have recently been proposed that aim to increase the accuracy of segmentations by also incorporating prior knowledge about the foreground object’s shape [8, 9, 10, 11]. In such cases, probabilistic techniques often mainly differ in how accurately they represent and learn about the variability exhibited by the shapes of the object’s parts. Accurate models of the shapes and appearances of parts can be necessary to perform inference in datasets that exhibit large amounts of variability. In general, the stronger the models of these two components, the more performance is improved. A generative model has the added benefit of being able to generate samples, which allows us to visually inspect the quality of its understanding of the data and the problem. Recently, a generative probabilistic model known as the Shape Boltzmann Machine (SBM) has been used to model binary object shapes [1]. The SBM has been shown to constitute the state-of-the-art and it possesses several highly desirable characteristics: samples from the model look realistic, and it generalizes to generate samples that differ from the limited number of examples it is trained on. The main contributions of this paper are as follows: 1) In order to account for object parts we extend the SBM to use multinomial visible units instead of binary ones, resulting in the Multinomial Shape Boltzmann Machine (MSBM), and we demonstrate that the MSBM constitutes a strong model of parts-based object shape. 2) We combine the MSBM with an appearance model to form a fully generative model of images of objects (see Fig. 1). We show how parts-based object segmentations can be obtained simply by performing probabilistic inference in the model. We apply our model to two challenging datasets and find that in addition to being principled and fully generative, the model’s performance is comparable to the state-of-the-art. 1 Train labels Train images Test image Appearance model Joint Model Shape model Parsing Figure 1: Overview. Using annotated images separate models of shape and appearance are trained. Given an unseen test image, its parsing is obtained via inference in the proposed joint model. In Secs. 1 and 2 we present the model and propose efficient inference and learning schemes. In Sec. 3 we compare and contrast the resulting joint model with existing work in the literature. We describe our experimental results in Sec. 4 and conclude with a discussion in Sec. 5. 1 Model We consider datasets of cropped images of an object class. We assume that the images are constructed through some combination of a fixed number of parts. Given a dataset D = {Xd }, d = 1...n of such images X, each consisting of P pixels {xi }, i = 1...P , we wish to infer a segmentation S for the image. S consists of a labeling si for every pixel, where si is a 1-of-(L+1) encoded variable, and L is the fixed number of parts that combine to generate the foreground. In other words, si = (sli ), P l = 0...L, sli 2 {0, 1} and l sli = 1. Note that the background is also treated as a ‘part’ (l = 0). Accurate inference of S is driven by models for 1) part shapes and 2) part appearances. Part shapes: Several types of models can be used to define probabilistic distributions over segmentations S. The simplest approach is to model each pixel si independently with categorical variables whose parameters are specified by the object’s mean shape (Fig. 2(a)). Markov Random Fields (MRFs, Fig. 2(b)) additionally model interactions between nearby pixels using pairwise potential functions that efficiently capture local properties of images like smoothness and continuity. Restricted Boltzmann Machines (RBMs) and their multi-layered counterparts Deep Boltzmann Machines (DBMs, Fig. 2(c)) make heavy use of hidden variables to efficiently define higher-order potentials that take into account the configuration of larger groups of image pixels. The introduction of such hidden variables provides a way to efficiently capture complex, global properties of image pixels. RBMs and DBMs are powerful generative models, but they also have many parameters. Segmented images, however, are expensive to obtain and datasets are typically small (hundreds of examples). In order to learn a model that accurately captures the properties of part shapes we use DBMs but also impose carefully chosen connectivity and capacity constraints, following the structure of the Shape Boltzmann Machine (SBM) [1]. We further extend the model to account for multi-part shapes to obtain the Multinomial Shape Boltzmann Machine (MSBM). The MSBM has two layers of latent variables: h1 and h2 (collectively H = {h1 , h2 }), and defines a P Boltzmann distribution over segmentations p(S) = h1 ,h2 exp{ E(S, h1 , h2 |✓s )}/Z(✓s ) where X X X X X 1 2 E(S, h1 , h2 |✓s ) = bli sli + wlij sli h1 + c 1 h1 + wjk h1 h2 + c2 h2 , (1) j j j j k k k i,l j i,j,l j,k k where j and k range over the first and second layer hidden variables, and ✓s = {W 1 , W 2 , b, c1 , c2 } are the shape model parameters. In the first layer, local receptive fields are enforced by connecting each hidden unit in h1 only to a subset of the visible units, corresponding to one of four patches, as shown in Fig. 2(d,e). Each patch overlaps its neighbor by b pixels, which allows boundary continuity to be learned at the lowest layer. We share weights between the four sets of first-layer hidden units and patches, and purposely restrict the number of units in h2 . These modifications significantly reduce the number of parameters whilst taking into account an important property of shapes, namely that the strongest dependencies between pixels are typically local. 2 h2 1 1 h S S (a) Mean h S (b) MRF h2 h2 h1 S S (c) DBM b (d) SBM (e) 2D SBM Figure 2: Models of shape. Object shape is modeled with undirected graphical models. (a) 1D slice of a mean model. (b) Markov Random Field in 1D. (c) Deep Boltzmann Machine in 1D. (d) 1D slice of a Shape Boltzmann Machine. (e) Shape Boltzmann Machine in 2D. In all models latent units h are binary and visible units S are multinomial random variables. Based on Fig. 2 of [1]. k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 ⇡ l=0 l=1 l=2 Figure 3: A model of appearances. Left: An exemplar dataset. Here we assume one background (l = 0) and two foreground (l = 1, non-body; l = 2, body) parts. Right: The corresponding appearance model. In this example, L = 2, K = 3 and W = 6. Best viewed in color. Part appearances: Pixels in a given image are assumed to have been generated by W fixed Gaussians in RGB space. During pre-training, the means {µw } and covariances {⌃w } of these Gaussians are extracted by training a mixture model with W components on every pixel in the dataset, ignoring image and part structure. It is also assumed that each of the L parts can have different appearances in different images, and that these appearances can be clustered into K classes. The classes differ in how likely they are to use each of the W components when ‘coloring in’ the part. The generative process is as follows. For part l in an image, one of the K classes is chosen (represented by a 1-of-K indicator variable al ). Given al , the probability distribution defined on pixels associated with part l is given by a Gaussian mixture model with means {µw } and covariances {⌃w } and mixing proportions { lkw }. The prior on A = {al } specifies the probability ⇡lk of appearance class k being chosen for part l. Therefore appearance parameters ✓a = {⇡lk , lkw } (see Fig. 3) and: a p(xi |A, si , ✓ ) = p(A|✓a ) = Y l Y l a sli p(xi |al , ✓ ) p(al |✓a ) = = Y Y X YY l l k w lkw N (xi |µw , ⌃w ) !alk !sli (⇡lk )alk . , (2) (3) k Combining shapes and appearances: To summarize, the latent variables for X are A, S, H, and the model’s active parameters ✓ include shape parameters ✓s and appearance parameters ✓a , so that p(X, A, S, H|✓) = Y 1 p(A|✓a )p(S, H|✓s ) p(xi |A, si , ✓a ) , Z( ) i (4) where the parameter adjusts the relative contributions of the shape and appearance components. See Fig. 4 for an illustration of the complete graphical model. During learning, we find the values of ✓ that maximize the likelihood of the training data D, and segmentation is performed on a previously-unseen image by querying the marginal distribution p(S|Xtest , ✓). Note that Z( ) is constant throughout the execution of the algorithms. We set via trial and error in our experiments. 3 n H ✓a si al H xi L+1 ✓s S X A P Figure 4: A model of shape and appearance. Left: The joint model. Pixels xi are modeled via appearance variables al . The model’s belief about each layer’s shape is captured by shape variables H. Segmentation variables si assign each pixel to a layer. Right: Schematic for an image X. 2 Inference and learning Inference: We approximate p(A, S, H|X, ✓) by drawing samples of A, S and H using block-Gibbs Markov Chain Monte Carlo (MCMC). The desired distribution p(S|X, ✓) can then be obtained by considering only the samples for S (see Algorithm 1). In order to sample p(A|S, H, X, ✓) we consider the conditional distribution of appearance class k being chosen for part l which is given by: Q P ·s ⇡lk i ( w lkw N (xi |µw , ⌃w )) li h Q P i. p(alk = 1|S, X, ✓) = P (5) K ·sli r=1 ⇡lr i( w lrw N (xi |µw , ⌃w )) Since the MSBM only has edges between each pair of adjacent layers, all hidden units within a layer are conditionally independent given the units in the other two layers. This property can be exploited to make inference in the shape model exact and efficient. The conditional probabilities are: X X 1 2 p(h1 = 1|s, h2 , ✓) = ( wlij sli + wjk h2 + c1 ), (6) j k j i,l p(h2 k 1 = 1|h , ✓) = ( X k 2 wjk h1 j + c2 ), j (7) j where (y) = 1/(1 + exp( y)) is the sigmoid function. To sample from p(H|S, X, ✓) we iterate between Eqns. 6 and 7 multiple times and keep only the final values of h1 and h2 . Finally, we draw samples for the pixels in p(S|A, H, X, ✓) independently: P 1 exp( j wlij h1 + bli ) p(xi |A, sli = 1, ✓) j p(sli = 1|A, H, X, ✓) = PL . (8) P 1 1 m=1 exp( j wmij hj + bmi ) p(xi |A, smi = 1, ✓) Seeding: Since the latent-space is extremely high-dimensional, in practice we find it helpful to run several inference chains, each initializing S(1) to a different value. The ‘best’ inference is retained and the others are discarded. The computation of the likelihood p(X|✓) of image X is intractable, so we approximate the quality of each inference using a scoring function: 1X Score(X|✓) = p(X, A(t) , S(t) , H(t) |✓), (9) T t where {A(t) , S(t) , H(t) }, t = 1...T are the samples obtained from the posterior p(A, S, H|X, ✓). If the samples were drawn from the prior p(A, S, H|✓) the scoring function would be an unbiased estimator of p(X|✓), but would be wildly inaccurate due to the high probability of missing the important regions of latent space (see e.g. [12, p. 107-109] for further discussion of this issue). Learning: Learning of the model involves maximizing the log likelihood log p(D|✓a , ✓s ) of the training dataset D with respect to the model parameters ✓a and ✓s . Since training is partially supervised, in that for each image X its corresponding segmentation S is also given, we can learn the parameters of the shape and appearance components separately. For appearances, the learning of the mixing coefficients and the histogram parameters decomposes into standard mixture updates independently for each part. For shapes, we follow the standard deep 4 Algorithm 1 MCMC inference algorithm. 1: procedure I NFER(X, ✓) 2: Initialize S(1) , H(1) 3: for t 2 : chain length do 4: A(t) ⇠ p(A|S(t 1) , H(t 1) , X, ✓) 5: S(t) ⇠ p(S|A(t) , H(t 1) , X, ✓) 6: H(t) ⇠ p(H|S(t) , ✓) 7: return {S(t) }t=burnin:chain length learning literature closely [13, 1]. In the pre-training phase we greedily train the model bottom up, one layer at a time. We begin by training an RBM on the observed data using stochastic maximum likelihood learning (SML; also referred to as ‘persistent CD’; [14, 13]). Once this RBM is trained, we infer the conditional mean of the hidden units for each training image. The resulting vectors then serve as the training data for a second RBM which is again trained using SML. We use the parameters of these two RBMs to initialize the parameters of the full MSBM model. In the second phase we perform approximate stochastic gradient ascent in the likelihood of the full model to finetune the parameters in an EM-like scheme as described in [13]. 3 Related work Existing probabilistic models of images can be categorized by the amount of variability they expect to encounter in the data and by how they model this variability. A significant portion of the literature models images using only two parts: a foreground object and its background e.g. [15, 16, 17, 18, 19]. Models that account for the parts within the foreground object mainly differ in how accurately they learn about and represent the variability of the shapes of the object’s parts. In Probabilistic Index Maps (PIMs) [8] a mean partitioning is learned, and the deformable PIM [9] additionally allows for local deformations of this mean partitioning. Stel Component Analysis [10] accounts for larger amounts of shape variability by learning a number of different template means for the object that are blended together on a pixel-by-pixel basis. Factored Shapes and Appearances [11] models global properties of shape using a factor analysis-like model, and ‘masked’ RBMs have been used to model more local properties of shape [20]. However, none of these models constitute a strong model of shape in terms of realism of samples and generalization capabilities [1]. We demonstrate in Sec. 4 that, like the SBM, the MSBM does in fact possess these properties. The closest works to ours in terms of ability to deal with datasets that exhibit significant variability in both shape and appearance are the works of Bo and Fowlkes [21] and Thomas et al. [22]. Bo and Fowlkes [21] present an algorithm for pedestrian segmentation that models the shapes of the parts using several template means. The different parts are composed using hand coded geometric constraints, which means that the model cannot be automatically extended to other application domains. The Implicit Shape Model (ISM) used in [22] is reliant on interest point detectors and defines distributions over segmentations only in the posterior, and therefore is not fully generative. The model presented here is entirely learned from data and fully generative, therefore it can be applied to new datasets and diagnosed with relative ease. Due to its modular structure, we also expect it to rapidly absorb future developments in shape and appearance models. 4 Experiments Penn-Fudan pedestrians: The first dataset that we considered is Penn-Fudan pedestrians [23], consisting of 169 images of pedestrians (Fig. 6(a)). The images are annotated with ground-truth segmentations for L = 7 different parts (hair, face, upper and lower clothes, shoes, legs, arms; Fig. 6(d)). We compare the performance of the model with the algorithm of Bo and Fowlkes [21]. For the shape component, we trained an MSBM on the 684 images of a labeled version of the HumanEva dataset [24] (at 48 ⇥ 24 pixels; also flipped horizontally) with overlap b = 4, and 400 and 50 hidden units in the first and second layers respectively. Each layer was pre-trained for 3000 epochs (iterations). After pre-training, joint training was performed for 1000 epochs. 5 (c) Completion (a) Sampling (b) Diffs ! ! ! Figure 5: Learned shape model. (a) A chain of samples (1000 samples between frames). The apparent ‘blurriness’ of samples is not due to averaging or resizing. We display the probability of each pixel belonging to different parts. If, for example, there is a 50-50 chance that a pixel belongs to the red or blue parts, we display that pixel in purple. (b) Differences between the samples and their most similar counterparts in the training dataset. (c) Completion of occlusions (pink). To assess the realism and generalization characteristics of the learned MSBM we sample from it. In Fig. 5(a) we show a chain of unconstrained samples from an MSBM generated via block-Gibbs MCMC (1000 samples between frames). The model captures highly non-linear correlations in the data whilst preserving the object’s details (e.g. face and arms). To demonstrate that the model has not simply memorized the training data, in Fig. 5(b) we show the difference between the sampled shapes in Fig. 5(a) and their closest images in the training set (based on per-pixel label agreement). We see that the model generalizes in non-trivial ways to generate realistic shapes that it had not encountered during training. In Fig. 5(c) we show how the MSBM completes rectangular occlusions. The samples highlight the variability in possible completions captured by the model. Note how, e.g. the length of the person’s trousers on one leg affects the model’s predictions for the other, demonstrating the model’s knowledge about long-range dependencies. An interactive M ATLAB GUI for sampling from this MSBM has been included in the supplementary material. The Penn-Fudan dataset (at 200 ⇥ 100 pixels) was then split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train the appearance component with a vocabulary of size W = 50 and K = 100 mixture components1 . We additionally constrained the model by sharing the appearance models for the arms and legs with that of the face. We assess the quality of the appearance model by performing the following experiment: for each test image, we used the scoring function described in Eq. 9 to evaluate a number of different proposal segmentations for that image. We considered 10 randomly chosen segmentations from the training dataset as well as the ground-truth segmentation for the test image, and found that the appearance model correctly assigns the highest score to the ground-truth 95% of the time. During inference, the shape and appearance models (which are defined on images of different sizes), were combined at 200 ⇥ 100 pixels via M ATLAB’s imresize function, and we set = 0.8 (Eq. 8) via trial and error. Inference chains were seeded at 100 exemplar segmentations from the HumanEva dataset (obtained using the K-medoids algorithm with K = 100), and were run for 20 Gibbs iterations each (with 5 iterations of Eqs. 6 and 7 per Gibbs iteration). Our unoptimized M ATLAB implementation completed inference for each chain in around 7 seconds. We compute the conditional probability of each pixel belonging to different parts given the last set of samples obtained from the highest scoring chain, assign each pixel independently to the most likely part at that pixel, and report the percentage of correctly labeled pixels (see Table 1). We find that accuracy can be improved using superpixels (SP) computed on X (pixels within a superpixel are all assigned the most common label within it; as with [21] we use gPb-OWT-UCM [25]). We also report the accuracy obtained, had the top scoring seed segmentation been returned as the final segmentation for each image. Here the quality of the seed is determined solely by the appearance model. We observe that the model has comparable performance to the state-of-the-art but pedestrianspecific algorithm of [21], and that inference in the model significantly improves the accuracy of the segmentations over the baseline (top seed+SP). Qualitative results can be seen in Fig. 6(c). 1 We obtained the best quantitative results with these settings. The appearances exhibited by the parts in the dataset are highly varied, and the complexity of the appearance model reflects this fact. 6 Table 1: Penn-Fudan pedestrians. We report the percentage of correctly labeled pixels. The final column is an average of the background, upper and lower body scores (as reported in [21]). FG BG Upper Body Lower Body Head Average Bo and Fowlkes [21] 73.3% 81.1% 73.6% 71.6% 51.8% 69.5% MSBM MSBM + SP 70.7% 71.6% 72.8% 73.8% 68.6% 69.9% 66.7% 68.5% 53.0% 54.1% 65.3% 66.6% Top seed Top seed + SP 59.0% 61.6% 61.8% 67.3% 56.8% 60.8% 49.8% 54.1% 45.5% 43.5% 53.5% 56.4% Table 2: ETHZ cars. We report the percentage of pixels belonging to each part that are labeled correctly. The final column is an average weighted by the frequency of occurrence of each label. BG Body Wheel Window Bumper License Light Average ISM [22] 93.2% 72.2% 63.6% 80.5% 73.8% 56.2% 34.8% 86.8% MSBM 94.6% 72.7% 36.8% 74.4% 64.9% 17.9% 19.9% 86.0% Top seed 92.2% 68.4% 28.3% 63.8% 45.4% 11.2% 15.1% 81.8% ETHZ cars: The second dataset that we considered is the ETHZ labeled cars dataset [22], which itself is a subset of the LabelMe dataset [23], consisting of 139 images of cars, all in the same semiprofile view (Fig. 7(a)). The images are annotated with ground-truth segmentations for L = 6 parts (body, wheel, window, bumper, license plate, headlight; Fig. 7(d)). We compare the performance of the model with the ISM of Thomas et al. [22], who also report their results on this dataset. The dataset was split into 10 train/test cross-validation splits without replacement. We used the training images in each split to train both the shape and appearance components. For the shape component, we trained an MSBM at 50 ⇥ 50 pixels with overlap b = 4, and 2000 and 100 hidden units in the first and second layers respectively. Each layer was pre-trained for 3000 epochs and joint training was performed for 1000 epochs. The appearance model was trained with a vocabulary of size W = 50 and K = 100 mixture components and we set = 0.7. Inference chains were seeded at 50 exemplar segmentations (obtained using K-medoids). We find that the use of superpixels does not help with this dataset (due to the poor quality of superpixels obtained for these images). Qualitative and quantitative results that show the performance of model to be comparable to the state-of-the-art ISM can be seen in Fig. 7(c) and Table 2. We believe the discrepancy in accuracy between the MSBM and ISM on the ‘license’ and ‘light’ labels to mainly be due to ISM’s use of interest-points, as they are able to locate such fine structures accurately. By incorporating better models of part appearance into the generative model, we expect to see this discrepancy decrease. 5 Conclusions and future work In this paper we have shown how the SBM can be extended to obtain the MSBM, and presented a principled probabilistic model of images of objects that exploits the MSBM as its model for part shapes. We demonstrated how object segmentations can be obtained simply by performing MCMC inference in the model. The model can also be treated as a probabilistic evaluator of segmentations: given a proposal segmentation it can be used to estimate its likelihood. This leads us to believe that the combination of a generative model such as ours, with a discriminative, bottom-up segmentation algorithm could be highly effective. We are currently investigating how textured appearance models, which take into account the spatial structure of pixels, affect the learning and inference algorithms and the performance of the model. Acknowledgments Thanks to Charless Fowlkes and Vittorio Ferrari for access to datasets, and to Pushmeet Kohli and John Winn for valuable discussions. AE has received funding from the Carnegie Trust, the SORSAS scheme, and the IST Programme under the PASCAL2 Network of Excellence (IST-2007-216886). 7 (a) Test (c) MSBM (b) Bo and Fowlkes (d) Ground truth Background Hair Face Upper Shoes Legs Lower Arms (d) Ground truth (c) MSBM (b) Thomas et al. (a) Test Figure 6: Penn-Fudan pedestrians. (a) Test images. (b) Results reported by Bo and Fowlkes [21]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [21]. Background Body Wheel Window Bumper License Headlight Figure 7: ETHZ cars. (a) Test images. (b) Results reported by Thomas et al. [22]. (c) Output of the joint model. (d) Ground-truth images. Images shown are those selected by [22]. 8 References [1] S. M. Ali Eslami, Nicolas Heess, and John Winn. The Shape Boltzmann Machine: a Strong Model of Object Shape. In IEEE CVPR, 2012. [2] Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. The PASCAL Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88:303–338, 2010. [3] Martin Fischler and Robert Elschlager. The Representation and Matching of Pictorial Structures. IEEE Transactions on Computers, 22(1):67–92, 1973. [4] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Freeman, 1982. [5] Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94:115–147, 1987. [6] Ashish Kapoor and John Winn. Located Hidden Random Fields: Learning Discriminative Parts for Object Detection. In ECCV, pages 302–315, 2006. [7] John Winn and Jamie Shotton. The Layout Consistent Random Field for Recognizing and Segmenting Partially Occluded Objects. In IEEE CVPR, pages 37–44, 2006. [8] Nebojsa Jojic and Yaron Caspi. Capturing Image Structure with Probabilistic Index Maps. In IEEE CVPR, pages 212–219, 2004. [9] John Winn and Nebojsa Jojic. LOCUS: Learning object classes with unsupervised segmentation. In ICCV, pages 756–763, 2005. [10] Nebojsa Jojic, Alessandro Perina, Marco Cristani, Vittorio Murino, and Brendan Frey. Stel component analysis. In IEEE CVPR, pages 2044–2051, 2009. [11] S. M. Ali Eslami and Christopher K. I. Williams. Factored Shapes and Appearances for Partsbased Object Understanding. In BMVC, pages 18.1–18.12, 2011. [12] Nicolas Heess. Learning generative models of mid-level structure in natural images. PhD thesis, University of Edinburgh, 2011. [13] Ruslan Salakhutdinov and Geoffrey Hinton. Deep Boltzmann Machines. In AISTATS, volume 5, pages 448–455, 2009. [14] Tijmen Tieleman. Training restricted Boltzmann machines using approximations to the likelihood gradient. In ICML, pages 1064–1071, 2008. [15] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. “GrabCut”: interactive foreground extraction using iterated graph cuts. ACM SIGGRAPH, 23:309–314, 2004. [16] Eran Borenstein, Eitan Sharon, and Shimon Ullman. Combining Top-Down and Bottom-Up Segmentation. In CVPR Workshop on Perceptual Organization in Computer Vision, 2004. [17] Himanshu Arora, Nicolas Loeff, David Forsyth, and Narendra Ahuja. Unsupervised Segmentation of Objects using Efficient Learning. IEEE CVPR, pages 1–7, 2007. [18] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. ClassCut for unsupervised class segmentation. In ECCV, pages 380–393, 2010. [19] Nicolas Heess, Nicolas Le Roux, and John Winn. Weakly Supervised Learning of ForegroundBackground Segmentation using Masked RBMs. In ICANN, 2011. [20] Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a Generative Model of Images by Factoring Appearance and Shape. Neural Computation, 23(3):593–650, 2011. [21] Yihang Bo and Charless Fowlkes. Shape-based Pedestrian Parsing. In IEEE CVPR, 2011. [22] Alexander Thomas, Vittorio Ferrari, Bastian Leibe, Tinne Tuytelaars, and Luc Van Gool. Using Recognition and Annotation to Guide a Robot’s Attention. IJRR, 28(8):976–998, 2009. [23] Bryan Russell, Antonio Torralba, Kevin Murphy, and William Freeman. LabelMe: A Database and Tool for Image Annotation. International Journal of Computer Vision, 77:157–173, 2008. [24] Leonid Sigal, Alexandru Balan, and Michael Black. HumanEva. International Journal of Computer Vision, 87(1-2):4–27, 2010. [25] Pablo Arbelaez, Michael Maire, Charless C. Fowlkes, and Jitendra Malik. From Contours to Regions: An Empirical Evaluation. In IEEE CVPR, 2009. 9

3 0.82565981 37 nips-2012-Affine Independent Variational Inference

Author: Edward Challis, David Barber

Abstract: We consider inference in a broad class of non-conjugate probabilistic models based on minimising the Kullback-Leibler divergence between the given target density and an approximating ‘variational’ density. In particular, for generalised linear models we describe approximating densities formed from an affine transformation of independently distributed latent variables, this class including many well known densities as special cases. We show how all relevant quantities can be efficiently computed using the fast Fourier transform. This extends the known class of tractable variational approximations and enables the fitting for example of skew variational densities to the target density. 1

same-paper 4 0.81296402 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound

Author: Shusen Wang, Zhihua Zhang

Abstract: The CUR matrix decomposition is an important extension of Nystr¨ m approximao tion to a general matrix. It approximates any data matrix in terms of a small number of its columns and rows. In this paper we propose a novel randomized CUR algorithm with an expected relative-error bound. The proposed algorithm has the advantages over the existing relative-error CUR algorithms that it possesses tighter theoretical bound and lower time complexity, and that it can avoid maintaining the whole data matrix in main memory. Finally, experiments on several real-world datasets demonstrate significant improvement over the existing relative-error algorithms. 1

5 0.78489673 321 nips-2012-Spectral learning of linear dynamics from generalised-linear observations with application to neural population data

Author: Lars Buesing, Maneesh Sahani, Jakob H. Macke

Abstract: Latent linear dynamical systems with generalised-linear observation models arise in a variety of applications, for instance when modelling the spiking activity of populations of neurons. Here, we show how spectral learning methods (usually called subspace identification in this context) for linear systems with linear-Gaussian observations can be extended to estimate the parameters of a generalised-linear dynamical system model despite a non-linear and non-Gaussian observation process. We use this approach to obtain estimates of parameters for a dynamical model of neural population data, where the observed spike-counts are Poisson-distributed with log-rates determined by the latent dynamical process, possibly driven by external inputs. We show that the extended subspace identification algorithm is consistent and accurately recovers the correct parameters on large simulated data sets with a single calculation, avoiding the costly iterative computation of approximate expectation-maximisation (EM). Even on smaller data sets, it provides an effective initialisation for EM, avoiding local optima and speeding convergence. These benefits are shown to extend to real neural data.

6 0.73102951 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

7 0.72531772 38 nips-2012-Algorithms for Learning Markov Field Policies

8 0.72520542 162 nips-2012-Inverse Reinforcement Learning through Structured Classification

9 0.72481126 255 nips-2012-On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes

10 0.72410923 160 nips-2012-Imitation Learning by Coaching

11 0.72401434 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning

12 0.72345608 292 nips-2012-Regularized Off-Policy TD-Learning

13 0.72332424 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

14 0.72315222 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

15 0.72288275 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback

16 0.72151881 364 nips-2012-Weighted Likelihood Policy Search with Model Selection

17 0.72104424 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress

18 0.72083879 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models

19 0.72013611 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

20 0.71879858 56 nips-2012-Bayesian active learning with localized priors for fast receptive field characterization