nips nips2012 nips2012-258 knowledge-graph by maker-knowledge-mining

258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection


Source: pdf

Author: Shiva P. Kasiviswanathan, Huahua Wang, Arindam Banerjee, Prem Melville

Abstract: Given their pervasive use, social media, such as Twitter, have become a leading source of breaking news. A key task in the automated identification of such news is the detection of novel documents from a voluminous stream of text documents in a scalable manner. Motivated by this challenge, we introduce the problem of online 1 -dictionary learning where unlike traditional dictionary learning, which uses squared loss, the 1 -penalty is used for measuring the reconstruction error. We present an efficient online algorithm for this problem based on alternating directions method of multipliers, and establish a sublinear regret bound for this algorithm. Empirical results on news-stream and Twitter data, shows that this online 1 -dictionary learning algorithm for novel document detection gives more than an order of magnitude speedup over the previously known batch algorithm, without any significant loss in quality of results. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A key task in the automated identification of such news is the detection of novel documents from a voluminous stream of text documents in a scalable manner. [sent-11, score-0.836]

2 Motivated by this challenge, we introduce the problem of online 1 -dictionary learning where unlike traditional dictionary learning, which uses squared loss, the 1 -penalty is used for measuring the reconstruction error. [sent-12, score-0.785]

3 We present an efficient online algorithm for this problem based on alternating directions method of multipliers, and establish a sublinear regret bound for this algorithm. [sent-13, score-0.569]

4 Empirical results on news-stream and Twitter data, shows that this online 1 -dictionary learning algorithm for novel document detection gives more than an order of magnitude speedup over the previously known batch algorithm, without any significant loss in quality of results. [sent-14, score-1.051]

5 The key challenge in automatic detection of breaking news, is being able to detect novel documents in a stream of text; where a document is considered novel if it is “unlike” documents seen in the past. [sent-17, score-1.166]

6 Recently, this has been made possible by dictionary learning, which has emerged as a powerful data representation framework. [sent-18, score-0.459]

7 In dictionary learning each data point y is represented as a sparse linear combination Ax of dictionary atoms, where A is the dictionary and x is a sparse vector [1, 12]. [sent-19, score-1.527]

8 At the end of timestep t, the dictionary is updated to represent all the documents till time t. [sent-21, score-1.005]

9 [10] presented such a (batch) dictionary learning approach for detecting novel documents/topics. [sent-23, score-0.612]

10 † 1 monly used in the dictionary learning literature) as the 1 -penalty has been found to be more effective for text analysis (see Section 3). [sent-30, score-0.49]

11 We build upon this work, by proposing an efficient algorithm for online dictionary learning with 1 -penalty. [sent-32, score-0.723]

12 Our online dictionary learning algorithm is based on the online alternating directions method which was recently proposed by Wang and Banerjee [19] to solve online composite optimization problems with additional linear equality constraints. [sent-33, score-1.277]

13 Traditional online convex optimization methods such as [25, 8, 5, 6, 22] require explicit computation of the subgradient making them computationally expensive to be applied in our high volume text setting, whereas in our algorithm the subgradients are computed implicitly. [sent-34, score-0.367]

14 Under suitable assumptions (to cope with the √ non-convexity of the dictionary learning problem), we establish an O( T ) regret bound for the objective, matching the regret bounds of existing methods [25, 5, 6, 22]. [sent-36, score-0.695]

15 Using this online algorithm for 1 -dictionary learning, we obtain an online algorithm for novel document detection, which we empirically validate on traditional news-streams as well as streaming data from Twitter. [sent-37, score-0.921]

16 Experimental results show a substantial speedup over the batch 1 -dictionary learning based approach of Kasiviswanathan et al. [sent-38, score-0.317]

17 Online dictionary learning was recently introduced by Mairal et al. [sent-42, score-0.49]

18 They considered an 2 -penalty and showed that their online algorithm converges to the minimum objective value in the stochastic case (i. [sent-44, score-0.264]

19 The problem of novel document/topics detection was also addressed by a recent work of Saha et al. [sent-48, score-0.299]

20 However, their algorithm operates over a sliding time window (does not have online regret guarantees) and works only for 2 -penalty. [sent-50, score-0.411]

21 Classic dictionary learning techniques for sparse representation (see [1, 15, 12] and references therein) consider a finite training set of signals P = [p1 , . [sent-63, score-0.534]

22 x We define the problem of dictionary learning as that of minimizing the empirical cost f (A). [sent-69, score-0.459]

23 In other words, the dictionary learning is the following optimization problem def min f (A) = f (A, X) = min A A,X n l(pi , A) = min P − AX A,X i=1 1 +λ X 1. [sent-70, score-0.548]

24 3 Novel Document Detection Using Dictionary Learning In this section, we describe the problem of novel document detection and explain how dictionary learning could be used to tackle this problem. [sent-80, score-0.96]

25 } denote a sequence of streaming matrices where each column of Pt represents a document arriving at time t. [sent-87, score-0.45]

26 Each document is represented is some conventional vector space model such as TF-IDF [13]. [sent-89, score-0.233]

27 We use nt to represent the number of documents arriving at time t. [sent-93, score-0.421]

28 Let Nt be the number of documents arriving at time ≤ t, then P[t] ∈ Rm×Nt . [sent-105, score-0.341]

29 Under this setup, the goal of novel document detection is to identify documents in Pt that are “dissimilar” to the documents in P[t−1] . [sent-106, score-0.949]

30 Let At ∈ Rm×k represent the dictionary matrix after time t − 1; where dictionary At is a good basis to represent of all the documents in P[t−1] . [sent-108, score-1.171]

31 The exact construction of the dictionary is described later. [sent-109, score-0.459]

32 Now, consider a document y ∈ Rm appearing at time t. [sent-110, score-0.262]

33 We can use sparse coding to detect novel documents as follows. [sent-123, score-0.516]

34 For each document y arriving at time t, we do the following. [sent-124, score-0.35]

35 If the objective value l(y, At ) is “big” then we mark the document as novel, otherwise we mark the document as non-novel. [sent-126, score-0.514]

36 Since, we have normalized all documents in Pt to unit 1 -length, the objective values are in the same scale. [sent-127, score-0.224]

37 We empirically validate the superiority of using the 1 -penalty for novel document detection in Section 5. [sent-137, score-0.531]

38 Ideally, in our application setting, changing the size of the dictionary (k) dynamically with t would lead to a more efficient and effective sparse coding. [sent-139, score-0.534]

39 In our experiments, we allow for small increases in the size of the dictionary over time when required. [sent-141, score-0.513]

40 We now describe a simple batch algorithm (slightly modified from [10]) for detecting novel documents. [sent-143, score-0.434]

41 The Algorithm BATCH alternates between a novel document detection and a batch dictionary learning step. [sent-144, score-1.215]

42 , pnt ] ∈ Rm×nt , At ∈ Rm×k , λ ≥ 0, ζ ≥ 0 Novel Document Detection Step: for j = 1 to nt do Solve: xj = argminx≥0 pj − At x 1 + λ x 1 if pj − At xj 1 + λ xj 1 > ζ Mark pj as novel Batch Dictionary Learning Step: Set P[t] ← [P[t−1] | p1 , . [sent-148, score-0.333]

43 At time t, the dictionary learning step is2 [At+1 , X[t] ] = argminA∈A,X≥0 P[t] − AX 1 +λ X 1. [sent-153, score-0.488]

44 To achieve computational efficiency, in [10], the authors solved an approximation of (3) where in the dictionary learning step they only update the A’s and not the X’s. [sent-157, score-0.459]

45 3 This leads to faster running times, but because of the approximation, the quality of the dictionary degrades over time and the performance of the algorithm decreases. [sent-158, score-0.565]

46 In this paper, we propose an online learning algorithm for (3) and show that this online algorithm is both computationally efficient and generates good quality dictionaries under reasonable assumptions. [sent-159, score-0.608]

47 4 Online 1 -Dictionary Learning In this section, we introduce the online 1 -dictionary learning problem and propose an efficient algorithm for it. [sent-160, score-0.264]

48 The standard goal of online learning is to design algorithms whose regret is sublinear in time T , since this implies that “on the average” the algorithm performs as well as the best fixed strategy in hindsight [18]. [sent-161, score-0.546]

49 , polynomial running time) algorithms that solves it without making any assumptions on either the dictionary (A) or the sparse code (X). [sent-165, score-0.585]

50 This also means that it may not be possible to design efficient online algorithm with sublinear regret without making any assumptions on either A or X because an efficient online algorithm with sublinear regret would imply an efficient algorithm for solving (1) in the offline case. [sent-166, score-1.008]

51 Therefore, we focus on obtaining regret bounds for the dictionary update, assuming that the at each timestep the sparse codes given to the batch and online algorithms are “close”. [sent-167, score-1.413]

52 At time t, the online algorithm picks ˆ ˆ ˆ At+1 ∈ A. [sent-171, score-0.293]

53 , xnt ] where xj ’s are coming from the novel document detection step at time t. [sent-176, score-0.53]

54 In [10], the dictionary learning step is At+1 = argminA∈A P[t] − AX[t] 1 . [sent-177, score-0.459]

55 The regret defined above admits the discrepancy between the sparse coding matrices supplied to the batch and online algorithms through the error matrix. [sent-180, score-0.817]

56 The reason for this generality is because in our application setting, the sparse coding matrices used for updating the dictionaries of the batch and online algorithms could be different. [sent-181, score-0.754]

57 1 Online 1 -Dictionary Algorithm In this section, we design an algorithm for the online 1 -dictionary learning problem, which we call Online Inexact ADMM (OIADMM)5 and bound its regret. [sent-185, score-0.264]

58 Firstly note that because of the non-smooth 1 -norms involved it is computationally expensive to apply standard online learning algorithms like online gradient descent [25, 8], COMID [6], FOBOS [5], and RDA [22], as they ¯ require computing a costly subgradient at every iteration. [sent-186, score-0.5]

59 Our algorithm for online 1 -dictionary learning is based on the online alternating direction method which was recently proposed by Wang et al. [sent-188, score-0.588]

60 In a standard online learning setting, the (Pt , Xt ) made available to the online learning algorithm will be the same as (Pt , Xt ) made available √ the batch to ˆ dictionary learning algorithm in hindsight, so that Xt = Xt ⇒ Et = 0, yielding a O( T ) regret. [sent-213, score-1.242]

61 T More generally, as long as t=1 Et p = o(T ) for some suitable p-norm, we get a sublinear regret bound. [sent-214, score-0.227]

62 For the ith column in the dictionary matrix the projection onto A can be done in O(si log m) time where si is the number of non-zero elements in the ith column using the projection onto 1 -ball algorithm of Duchi et al. [sent-218, score-0.593]

63 The simplest implementation of OIADMM takes O(mnk) time at each timestep because of the matrix multiplications involved. [sent-220, score-0.297]

64 5 Experimental Results In this section, we present experiments to compare and contrast the performance of 1 -batch and 1 -online dictionary learning algorithms for the task of novel document detection. [sent-221, score-0.816]

65 In our implementation, we grow the dictionary size by η in each timestep. [sent-224, score-0.459]

66 Growing the dictionary size is essential for the batch algorithm because as t increases the number of columns of P[t] also increases, and therefore, a larger dictionary is required to compactly represent all the documents in P[t] . [sent-225, score-1.448]

67 The optimization problems arising in the sparse coding and dictionary learning steps are solved using ADMM’s. [sent-228, score-0.6]

68 Our online algorithm7 uses the same novel document detection step as Algorithm BATCH, but dictionary learning is done using OIADMM. [sent-230, score-1.198]

69 If these sequence of matrices are close to each other, then we have a sublinear regret on the objective function. [sent-239, score-0.267]

70 For simplicity, we assume that each document is tagged with the single, most dominant topic that it associates with, which we call the true topic of that document. [sent-242, score-0.289]

71 We call a document y arriving at time t novel if the true topic of y has not appeared before the time t. [sent-243, score-0.531]

72 7 In our experiments, the number of documents introduced in each timestep is almost of the same order, and hence there is no need to change the size of the dictionary across timesteps for the online algorithm. [sent-246, score-1.259]

73 6 6 document detection is to classify each document as either novel (positive) or non-novel (negative). [sent-248, score-0.734]

74 We want the dictionary at time t to be a good basis to represent all the documents in P[t] ∈ Rm×Nt . [sent-252, score-0.712]

75 This leads us to define the sparse reconstruction error (SRE) of a dictionary A at time t as def 1 SRE(A) = Nt min X≥0 P[t] − AX 1 +λ X 1 . [sent-253, score-0.721]

76 A dictionary with a smaller SRE is better on average at sparsely representing the documents in P[t] . [sent-254, score-0.683]

77 To justify the choice of using an 1 penalty (on the reconstruction error) for novel document detection, we performed experiments comparing 1 - vs. [sent-256, score-0.445]

78 The ADMM parameters for the sparse coding and batch dictionary learning steps are set as suggested in [10] (refer to the full version [11]). [sent-267, score-0.855]

79 In the batch algorithms, we grow the dictionary sizes by η = 10 in each timestep. [sent-268, score-0.714]

80 In our evaluation, we used a set of 9000 documents represented over 19528 terms and distributed into the top 30 TDT2 human-labeled topics over a period of 27 weeks. [sent-272, score-0.25]

81 At timestep 0, we introduce the first 1000 documents and these documents are used for initializing the dictionary. [sent-274, score-0.716]

82 In these experiments the size of the initial dictionary k = 200. [sent-276, score-0.459]

83 , 8}, we provide the batch and online algorithms the same set of 1000 documents. [sent-280, score-0.493]

84 In Figure 1, we present novel document detection results for those timesteps where at least one novel document was introduced. [sent-281, score-0.928]

85 The results show that using an 1 -penalty on the reconstruction error is better for novel document detection than using an 2 -penalty. [sent-283, score-0.614]

86 5 False Positive Rate Figure 1: ROC curves for TDT2 for timesteps where novel documents were introduced. [sent-294, score-0.418]

87 The 1 -online and 1 -batch algorithms have almost identical performance in terms of detecting novel documents (see Table 1). [sent-296, score-0.377]

88 However, the online algorithm is much more computationally efficient. [sent-297, score-0.264]

89 As noted earlier, the running time of the batch algorithm goes up as t increases (as it has to optimize over the entire past). [sent-299, score-0.386]

90 However, the running time of the online algorithm is independent of the past and only depends on the number of documents introduced in each timestep (which in this case is always 1000). [sent-300, score-0.836]

91 Therefore, the running time of the online 9 We used the SPAMS package http://spams-devel. [sent-301, score-0.318]

92 As expected the run-time gap between the 1 -batch and 1 -online algorithms widen as t increases – in the first timestep the online algorithm is 5. [sent-330, score-0.557]

93 In the first few timesteps, the SRE of the dictionaries produced by the online algorithm is slightly lower than that of the batch algorithm. [sent-338, score-0.627]

94 However, this gets corrected after a few timesteps and as expected later on the batch algorithm produces better dictionaries. [sent-339, score-0.351]

95 1 -online algorithms (in terms of SRE) and do a qualitative evaluation of the 1 -online algorithm for the novel document detection task. [sent-354, score-0.527]

96 Here, the size of the initial dictionary k = 100. [sent-355, score-0.459]

97 At first timestep the online algorithm is already 10. [sent-357, score-0.532]

98 In this case, the SRE of the dictionaries produced by the batch algorithm is consistently better than that of the online algorithm, but as the running time plots suggests this improvement comes at a very steep price. [sent-361, score-0.707]

99 #iPhone Can’t wait for the iphone 4s #letstalkiphone Everybody put an iPhone up in the air one time #ripstevejobs Table 2: Sample novel documents detected by our online algorithm. [sent-371, score-0.692]

100 Table 2 below shows a representative set of novel tweets identified by our online algorithm. [sent-372, score-0.497]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dictionary', 0.459), ('timestep', 0.268), ('batch', 0.255), ('online', 0.238), ('document', 0.233), ('documents', 0.224), ('sre', 0.212), ('pt', 0.21), ('impl', 0.173), ('twitter', 0.162), ('oiadmm', 0.154), ('detection', 0.144), ('tweets', 0.135), ('novel', 0.124), ('regret', 0.118), ('sublinear', 0.109), ('ax', 0.095), ('arriving', 0.088), ('reconstruction', 0.088), ('admm', 0.087), ('kasiviswanathan', 0.085), ('dictionaries', 0.08), ('nt', 0.08), ('argmina', 0.077), ('iphone', 0.077), ('melville', 0.077), ('sparse', 0.075), ('rm', 0.073), ('timesteps', 0.07), ('axt', 0.07), ('xt', 0.069), ('news', 0.067), ('coding', 0.066), ('auc', 0.065), ('banerjee', 0.058), ('aopt', 0.058), ('sept', 0.058), ('alternating', 0.055), ('running', 0.051), ('media', 0.05), ('pi', 0.047), ('breaking', 0.044), ('matrices', 0.04), ('android', 0.039), ('pnt', 0.039), ('smartphone', 0.039), ('smartphones', 0.039), ('throttling', 0.039), ('duchi', 0.038), ('streaming', 0.036), ('false', 0.035), ('ganesh', 0.034), ('et', 0.031), ('oct', 0.031), ('saha', 0.031), ('roc', 0.031), ('text', 0.031), ('speedup', 0.031), ('superiority', 0.03), ('atoms', 0.03), ('pj', 0.03), ('emerging', 0.029), ('sastry', 0.029), ('detecting', 0.029), ('time', 0.029), ('social', 0.028), ('produced', 0.028), ('minnesota', 0.028), ('watson', 0.028), ('mins', 0.028), ('multipliers', 0.028), ('topic', 0.028), ('vocabulary', 0.028), ('detect', 0.027), ('story', 0.027), ('marketing', 0.027), ('algorithm', 0.026), ('topics', 0.026), ('wang', 0.026), ('hindsight', 0.026), ('gt', 0.026), ('increases', 0.025), ('error', 0.025), ('rate', 0.025), ('till', 0.025), ('subgradients', 0.024), ('subgradient', 0.024), ('convex', 0.024), ('column', 0.024), ('zt', 0.024), ('mark', 0.024), ('day', 0.024), ('directions', 0.023), ('sign', 0.023), ('def', 0.023), ('opt', 0.023), ('stream', 0.022), ('zij', 0.022), ('min', 0.022), ('ibm', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

Author: Shiva P. Kasiviswanathan, Huahua Wang, Arindam Banerjee, Prem Melville

Abstract: Given their pervasive use, social media, such as Twitter, have become a leading source of breaking news. A key task in the automated identification of such news is the detection of novel documents from a voluminous stream of text documents in a scalable manner. Motivated by this challenge, we introduce the problem of online 1 -dictionary learning where unlike traditional dictionary learning, which uses squared loss, the 1 -penalty is used for measuring the reconstruction error. We present an efficient online algorithm for this problem based on alternating directions method of multipliers, and establish a sublinear regret bound for this algorithm. Empirical results on news-stream and Twitter data, shows that this online 1 -dictionary learning algorithm for novel document detection gives more than an order of magnitude speedup over the previously known batch algorithm, without any significant loss in quality of results. 1

2 0.17083639 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion

Author: Tingni Sun, Cun-hui Zhang

Abstract: This paper concerns the problem of matrix completion, which is to estimate a matrix from observations in a small subset of indices. We propose a calibrated spectrum elastic net method with a sum of the nuclear and Frobenius penalties and develop an iterative algorithm to solve the convex minimization problem. The iterative algorithm alternates between imputing the missing entries in the incomplete matrix by the current guess and estimating the matrix by a scaled soft-thresholding singular value decomposition of the imputed matrix until the resulting matrix converges. A calibration step follows to correct the bias caused by the Frobenius penalty. Under proper coherence conditions and for suitable penalties levels, we prove that the proposed estimator achieves an error bound of nearly optimal order and in proportion to the noise level. This provides a unified analysis of the noisy and noiseless matrix completion problems. Simulation results are presented to compare our proposal with previous ones. 1

3 0.13364272 358 nips-2012-Value Pursuit Iteration

Author: Amir M. Farahmand, Doina Precup

Abstract: Value Pursuit Iteration (VPI) is an approximate value iteration algorithm that finds a close to optimal policy for reinforcement learning problems with large state spaces. VPI has two main features: First, it is a nonparametric algorithm that finds a good sparse approximation of the optimal value function given a dictionary of features. The algorithm is almost insensitive to the number of irrelevant features. Second, after each iteration of VPI, the algorithm adds a set of functions based on the currently learned value function to the dictionary. This increases the representation power of the dictionary in a way that is directly relevant to the goal of having a good approximation of the optimal value function. We theoretically study VPI and provide a finite-sample error upper bound for it. 1

4 0.13133298 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

Author: Yi Wu, David P. Wipf

Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1

5 0.10545138 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

Author: Brendan Mcmahan, Matthew Streeter

Abstract: Some of the most compelling applications of online convex optimization, including online prediction and classification, are unconstrained: the natural feasible set is Rn . Existing algorithms fail to achieve sub-linear regret in this setting unless constraints on the comparator point ˚ are known in advance. We present algox rithms that, without such prior knowledge, offer near-optimal regret bounds with respect to any choice of ˚. In particular, regret with respect to ˚ = 0 is constant. x x We then prove lower bounds showing that our guarantees are near-optimal in this setting. 1

6 0.10437694 12 nips-2012-A Neural Autoregressive Topic Model

7 0.10188077 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

8 0.10170985 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

9 0.099739559 324 nips-2012-Stochastic Gradient Descent with Only One Projection

10 0.095999502 102 nips-2012-Distributed Non-Stochastic Experts

11 0.092596784 92 nips-2012-Deep Representations and Codes for Image Auto-Annotation

12 0.091968499 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

13 0.086395189 179 nips-2012-Learning Manifolds with K-Means and K-Flats

14 0.085638821 277 nips-2012-Probabilistic Low-Rank Subspace Clustering

15 0.079092264 199 nips-2012-Link Prediction in Graphs with Autoregressive Features

16 0.078952052 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity

17 0.076899014 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

18 0.075744905 293 nips-2012-Relax and Randomize : From Value to Algorithms

19 0.075045049 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

20 0.072059706 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.187), (1, 0.014), (2, 0.036), (3, 0.033), (4, 0.04), (5, -0.003), (6, -0.023), (7, -0.045), (8, 0.045), (9, 0.001), (10, 0.163), (11, 0.116), (12, -0.082), (13, -0.027), (14, -0.01), (15, 0.116), (16, 0.015), (17, 0.039), (18, -0.036), (19, -0.075), (20, 0.005), (21, 0.03), (22, 0.003), (23, -0.056), (24, -0.062), (25, 0.028), (26, -0.05), (27, 0.15), (28, 0.044), (29, 0.009), (30, -0.067), (31, -0.039), (32, 0.085), (33, -0.007), (34, 0.006), (35, -0.008), (36, 0.084), (37, 0.121), (38, -0.089), (39, -0.013), (40, 0.053), (41, 0.063), (42, -0.159), (43, 0.144), (44, -0.22), (45, 0.01), (46, -0.065), (47, -0.099), (48, -0.138), (49, 0.04)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95584613 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

Author: Shiva P. Kasiviswanathan, Huahua Wang, Arindam Banerjee, Prem Melville

Abstract: Given their pervasive use, social media, such as Twitter, have become a leading source of breaking news. A key task in the automated identification of such news is the detection of novel documents from a voluminous stream of text documents in a scalable manner. Motivated by this challenge, we introduce the problem of online 1 -dictionary learning where unlike traditional dictionary learning, which uses squared loss, the 1 -penalty is used for measuring the reconstruction error. We present an efficient online algorithm for this problem based on alternating directions method of multipliers, and establish a sublinear regret bound for this algorithm. Empirical results on news-stream and Twitter data, shows that this online 1 -dictionary learning algorithm for novel document detection gives more than an order of magnitude speedup over the previously known batch algorithm, without any significant loss in quality of results. 1

2 0.65554935 102 nips-2012-Distributed Non-Stochastic Experts

Author: Varun Kanade, Zhenming Liu, Bozidar Radunovic

Abstract: We consider the online distributed non-stochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each time-step t, one of the k site nodes has to pick an expert from the set {1, . . . , n}, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon T , while simultaneously keeping communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the nondistributed setting to obtain the optimal O( log(n)T ) regret bound at the cost of T communication. (ii) No communication: Each site runs an independent copy – the regret is O( log(n)kT ) and the communication is 0. This paper shows the √ difficulty of simultaneously achieving regret asymptotically better than kT and communication better than T . We give a novel algorithm that for an oblivious √ adversary achieves a non-trivial trade-off: regret O( k 5(1+ )/6 T ) and communication O(T /k ), for any value of ∈ (0, 1/5). We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the label-efficient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication trade-off. 1

3 0.59323651 358 nips-2012-Value Pursuit Iteration

Author: Amir M. Farahmand, Doina Precup

Abstract: Value Pursuit Iteration (VPI) is an approximate value iteration algorithm that finds a close to optimal policy for reinforcement learning problems with large state spaces. VPI has two main features: First, it is a nonparametric algorithm that finds a good sparse approximation of the optimal value function given a dictionary of features. The algorithm is almost insensitive to the number of irrelevant features. Second, after each iteration of VPI, the algorithm adds a set of functions based on the currently learned value function to the dictionary. This increases the representation power of the dictionary in a way that is directly relevant to the goal of having a good approximation of the optimal value function. We theoretically study VPI and provide a finite-sample error upper bound for it. 1

4 0.53809226 64 nips-2012-Calibrated Elastic Regularization in Matrix Completion

Author: Tingni Sun, Cun-hui Zhang

Abstract: This paper concerns the problem of matrix completion, which is to estimate a matrix from observations in a small subset of indices. We propose a calibrated spectrum elastic net method with a sum of the nuclear and Frobenius penalties and develop an iterative algorithm to solve the convex minimization problem. The iterative algorithm alternates between imputing the missing entries in the incomplete matrix by the current guess and estimating the matrix by a scaled soft-thresholding singular value decomposition of the imputed matrix until the resulting matrix converges. A calibration step follows to correct the bias caused by the Frobenius penalty. Under proper coherence conditions and for suitable penalties levels, we prove that the proposed estimator achieves an error bound of nearly optimal order and in proportion to the noise level. This provides a unified analysis of the noisy and noiseless matrix completion problems. Simulation results are presented to compare our proposal with previous ones. 1

5 0.50353634 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

Author: Yi Wu, David P. Wipf

Abstract: Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from [22], which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular ℓ1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations. 1

6 0.48197895 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization

7 0.47889674 216 nips-2012-Mirror Descent Meets Fixed Share (and feels no regret)

8 0.46019647 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

9 0.45181957 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

10 0.44862223 63 nips-2012-CPRL -- An Extension of Compressive Sensing to the Phase Retrieval Problem

11 0.4214429 12 nips-2012-A Neural Autoregressive Topic Model

12 0.39847288 43 nips-2012-Approximate Message Passing with Consistent Parameter Estimation and Applications to Sparse Learning

13 0.3872287 11 nips-2012-A Marginalized Particle Gaussian Process Regression

14 0.37401974 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems

15 0.37362233 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

16 0.35943127 124 nips-2012-Factorial LDA: Sparse Multi-Dimensional Text Models

17 0.35811841 332 nips-2012-Symmetric Correspondence Topic Models for Multilingual Text Analysis

18 0.34990048 293 nips-2012-Relax and Randomize : From Value to Algorithms

19 0.3476707 221 nips-2012-Multi-Stage Multi-Task Feature Learning

20 0.34234601 199 nips-2012-Link Prediction in Graphs with Autoregressive Features


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.073), (17, 0.016), (21, 0.024), (38, 0.123), (42, 0.023), (54, 0.027), (55, 0.012), (68, 0.018), (72, 0.234), (74, 0.052), (76, 0.131), (80, 0.116), (92, 0.06)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83208978 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

Author: Michael C. Hughes, Erik B. Sudderth, Emily B. Fox

Abstract: Applications of Bayesian nonparametric methods require learning and inference algorithms which efficiently explore models of unbounded complexity. We develop new Markov chain Monte Carlo methods for the beta process hidden Markov model (BP-HMM), enabling discovery of shared activity patterns in large video and motion capture databases. By introducing split-merge moves based on sequential allocation, we allow large global changes in the shared feature structure. We also develop data-driven reversible jump moves which more reliably discover rare or unique behaviors. Our proposals apply to any choice of conjugate likelihood for observed data, and we show success with multinomial, Gaussian, and autoregressive emission models. Together, these innovations allow tractable analysis of hundreds of time series, where previous inference required clever initialization and lengthy burn-in periods for just six sequences. 1

2 0.83136195 9 nips-2012-A Geometric take on Metric Learning

Author: Søren Hauberg, Oren Freifeld, Michael J. Black

Abstract: Multi-metric learning techniques learn local metric tensors in different parts of a feature space. With such an approach, even simple classifiers can be competitive with the state-of-the-art because the distance measure locally adapts to the structure of the data. The learned distance measure is, however, non-metric, which has prevented multi-metric learning from generalizing to tasks such as dimensionality reduction and regression in a principled way. We prove that, with appropriate changes, multi-metric learning corresponds to learning the structure of a Riemannian manifold. We then show that this structure gives us a principled way to perform dimensionality reduction and regression according to the learned metrics. Algorithmically, we provide the first practical algorithm for computing geodesics according to the learned metrics, as well as algorithms for computing exponential and logarithmic maps on the Riemannian manifold. Together, these tools let many Euclidean algorithms take advantage of multi-metric learning. We illustrate the approach on regression and dimensionality reduction tasks that involve predicting measurements of the human body from shape data. 1 Learning and Computing Distances Statistics relies on measuring distances. When the Euclidean metric is insufficient, as is the case in many real problems, standard methods break down. This is a key motivation behind metric learning, which strives to learn good distance measures from data. In the most simple scenarios a single metric tensor is learned, but in recent years, several methods have proposed learning multiple metric tensors, such that different distance measures are applied in different parts of the feature space. This has proven to be a very powerful approach for classification tasks [1, 2], but the approach has not generalized to other tasks. Here we consider the generalization of Principal Component Analysis (PCA) and linear regression; see Fig. 1 for an illustration of our approach. The main problem with generalizing multi-metric learning is that it is based on assumptions that make the feature space both non-smooth and non-metric. Specifically, it is often assumed that straight lines form geodesic curves and that the metric tensor stays constant along these lines. These assumptions are made because it is believed that computing the actual geodesics is intractable, requiring a discretization of the entire feature space [3]. We solve these problems by smoothing the transitions between different metric tensors, which ensures a metric space where geodesics can be computed. In this paper, we consider the scenario where the metric tensor at a given point in feature space is defined as the weighted average of a set of learned metric tensors. In this model, we prove that the feature space becomes a chart for a Riemannian manifold. This ensures a metric feature space, i.e. dist(x, y) = 0 ⇔ x = y , dist(x, y) = dist(y, x) (symmetry), (1) dist(x, z) ≤ dist(x, y) + dist(y, z) (triangle inequality). To compute statistics according to the learned metric, we need to be able to compute distances, which implies that we need to compute geodesics. Based on the observation that geodesics are 1 (a) Local Metrics & Geodesics (b) Tangent Space Representation (c) First Principal Geodesic Figure 1: Illustration of Principal Geodesic Analysis. (a) Geodesics are computed between the mean and each data point. (b) Data is mapped to the Euclidean tangent space and the first principal component is computed. (c) The principal component is mapped back to the feature space. smooth curves in Riemannian spaces, we derive an algorithm for computing geodesics that only requires a discretization of the geodesic rather than the entire feature space. Furthermore, we show how to compute the exponential and logarithmic maps of the manifold. With this we can map any point back and forth between a Euclidean tangent space and the manifold. This gives us a general strategy for incorporating the learned metric tensors in many Euclidean algorithms: map the data to the tangent of the manifold, perform the Euclidean analysis and map the results back to the manifold. Before deriving the algorithms (Sec. 3) we set the scene by an analysis of the shortcomings of current state-of-the-art methods (Sec. 2), which motivate our final model. The model is general and can be used for many problems. Here we illustrate it with several challenging problems in 3D body shape modeling and analysis (Sec. 4). All proofs can be found in the supplementary material along with algorithmic details and further experimental results. 2 Background and Related Work Single-metric learning learns a metric tensor, M, such that distances are measured as dist2 (xi , xj ) = xi − xj 2 M ≡ (xi − xj )T M(xi − xj ) , (2) where M is a symmetric and positive definite D × D matrix. Classic approaches for finding such a metric tensor include PCA, where the metric is given by the inverse covariance matrix of the training data; and linear discriminant analysis (LDA), where the metric tensor is M = S−1 SB S−1 , with Sw W W and SB being the within class scatter and the between class scatter respectively [9]. A more recent approach tries to learn a metric tensor from triplets of data points (xi , xj , xk ), where the metric should obey the constraint that dist(xi , xj ) < dist(xi , xk ). Here the constraints are often chosen such that xi and xj belong to the same class, while xi and xk do not. Various relaxed versions of this idea have been suggested such that the metric can be learned by solving a semi-definite or a quadratic program [1, 2, 4–8]. Among the most popular approaches is the Large Margin Nearest Neighbor (LMNN) classifier [5], which finds a linear transformation that satisfies local distance constraints, making the approach suitable for multi-modal classes. For many problems, a single global metric tensor is not enough, which motivates learning several local metric tensors. The classic work by Hastie and Tibshirani [9] advocates locally learning metric tensors according to LDA and using these as part of a kNN classifier. In a somewhat similar fashion, Weinberger and Saul [5] cluster the training data and learn a separate metric tensor for each cluster using LMNN. A more extreme point of view was taken by Frome et al. [1, 2], who learn a diagonal metric tensor for every point in the training set, such that distance rankings are preserved. Similarly, Malisiewicz and Efros [6] find a diagonal metric tensor for each training point such that the distance to a subset of the training data from the same class is kept small. Once a set of metric tensors {M1 , . . . , MR } has been learned, the distance dist(a, b) is measured according to (2) where “the nearest” metric tensor is used, i.e. R M(x) = r=1 wr (x) ˜ Mr , where wr (x) = ˜ ˜ j wj (x) 1 0 x − xr 2 r ≤ x − xj M otherwise 2 Mj , ∀j , (3) where x is either a or b depending on the algorithm. Note that this gives a non-metric distance function as it is not symmetric. To derive this equation, it is necessary to assume that 1) geodesics 2 −8 −8 Assumed Geodesics Location of Metric Tensors Test Points −6 −8 Actual Geodesics Location of Metric Tensors Test Points −6 Riemannian Geodesics Location of Metric Tensors Test Points −6 −4 −4 −4 −2 −2 −2 0 0 0 2 2 2 4 4 4 6 −8 6 −8 −6 −4 −2 0 (a) 2 4 6 −6 −4 −2 0 2 4 6 6 −8 −6 (b) −4 −2 (c) 0 2 4 6 (d) Figure 2: (a)–(b) An illustrative example where straight lines do not form geodesics and where the metric tensor does not stay constant along lines; see text for details. The background color is proportional to the trace of the metric tensor, such that light grey corresponds to regions where paths are short (M1 ), and dark grey corresponds to regions they are long (M2 ). (c) The suggested geometric model along with the geodesics. Again, background colour is proportional to the trace of the metric tensor; the colour scale is the same is used in (a) and (b). (d) An illustration of the exponential and logarithmic maps. form straight lines, and 2) the metric tensor stays constant along these lines [3]. Both assumptions are problematic, which we illustrate with a simple example in Fig. 2a–c. Assume we are given two metric tensors M1 = 2I and M2 = I positioned at x1 = (2, 2)T and x2 = (4, 4)T respectively. This gives rise to two regions in feature space in which x1 is nearest in the first and x2 is nearest in the second, according to (3). This is illustrated in Fig. 2a. In the same figure, we also show the assumed straight-line geodesics between selected points in space. As can be seen, two of the lines goes through both regions, such that the assumption of constant metric tensors along the line is violated. Hence, it would seem natural to measure the length of the line, by adding the length of the line segments which pass through the different regions of feature space. This was suggested by Ramanan and Baker [3] who also proposed a polynomial time algorithm for measuring these line lengths. This gives a symmetric distance function. Properly computing line lengths according to the local metrics is, however, not enough to ensure that the distance function is metric. As can be seen in Fig. 2a the straight line does not form a geodesic as a shorter path can be found by circumventing the region with the “expensive” metric tensor M1 as illustrated in Fig. 2b. This issue makes it trivial to construct cases where the triangle inequality is violated, which again makes the line length measure non-metric. In summary, if we want a metric feature space, we can neither assume that geodesics are straight lines nor that the metric tensor stays constant along such lines. In practice, good results have been reported using (3) [1,3,5], so it seems obvious to ask: is metricity required? For kNN classifiers this does not appear to be the case, with many successes based on dissimilarities rather than distances [10]. We, however, want to generalize PCA and linear regression, which both seek to minimize the reconstruction error of points projected onto a subspace. As the notion of projection is hard to define sensibly in non-metric spaces, we consider metricity essential. In order to build a model with a metric feature space, we change the weights in (3) to be smooth functions. This impose a well-behaved geometric structure on the feature space, which we take advantage of in order to perform statistical analysis according to the learned metrics. However, first we review the basics of Riemannian geometry as this provides the theoretical foundation of our work. 2.1 Geodesics and Riemannian Geometry We start by defining Riemannian manifolds, which intuitively are smoothly curved spaces equipped with an inner product. Formally, they are smooth manifolds endowed with a Riemannian metric [11]: Definition A Riemannian metric M on a manifold M is a smoothly varying inner product < a, b >x = aT M(x)b in the tangent space Tx M of each point x ∈ M . 3 Often Riemannian manifolds are represented by a chart; i.e. a parameter space for the curved surface. An example chart is the spherical coordinate system often used to represent spheres. While such charts are often flat spaces, the curvature of the manifold arises from the smooth changes in the metric. On a Riemannian manifold M, the length of a smooth curve c : [0, 1] → M is defined as the integral of the norm of the tangent vector (interpreted as speed) along the curve: 1 Length(c) = 1 c (λ) M(c(λ)) dλ c (λ)T M(c(λ))c (λ)dλ , = (4) 0 0 where c denotes the derivative of c and M(c(λ)) is the metric tensor at c(λ). A geodesic curve is then a length-minimizing curve connecting two given points x and y, i.e. (5) cgeo = arg min Length(c) with c(0) = x and c(1) = y . c The distance between x and y is defined as the length of the geodesic. Given a tangent vector v ∈ Tx M, there exists a unique geodesic cv (t) with initial velocity v at x. The Riemannian exponential map, Expx , maps v to a point on the manifold along the geodesic cv at t = 1. This mapping preserves distances such that dist(cv (0), cv (1)) = v . The inverse of the exponential map is the Riemannian logarithmic map denoted Logx . Informally, the exponential and logarithmic maps move points back and forth between the manifold and the tangent space while preserving distances (see Fig. 2d for an illustration). This provides a general strategy for generalizing many Euclidean techniques to Riemannian domains: data points are mapped to the tangent space, where ordinary Euclidean techniques are applied and the results are mapped back to the manifold. 3 A Metric Feature Space With the preliminaries settled we define the new model. Let C = RD denote the feature space. We endow C with a metric tensor in every point x, which we define akin to (3), R M(x) = wr (x)Mr , where wr (x) = r=1 wr (x) ˜ R ˜ j=1 wj (x) , (6) with wr > 0. The only difference from (3) is that we shall not restrict ourselves to binary weight ˜ functions wr . We assume the metric tensors Mr have already been learned; Sec. 4 contain examples ˜ where they have been learned using LMNN [5] and LDA [9]. From the definition of a Riemannian metric, we trivially have the following result: Lemma 1 The space C = RD endowed with the metric tensor from (6) is a chart of a Riemannian manifold, iff the weights wr (x) change smoothly with x. Hence, by only considering smooth weight functions wr we get a well-studied geometric structure ˜ on the feature space, which ensures us that it is metric. To illustrate the implications we return to the example in Fig. 2. We change the weight functions from binary to squared exponentials, which gives the feature space shown in Fig. 2c. As can be seen, the metric tensor now changes smoothly, which also makes the geodesics smooth curves (a property we will use when computing the geodesics). It is worth noting that Ramanan and Baker [3] also consider the idea of smoothly averaging the metric tensor. They, however, only evaluate the metric tensor at the test point of their classifier and then assume straight line geodesics with a constant metric tensor. Such assumptions violate the premise of a smoothly changing metric tensor and, again, the distance measure becomes non-metric. Lemma 1 shows that metric learning can be viewed as manifold learning. The main difference between our approach and techniques such as Isomap [12] is that, while Isomap learns an embedding of the data points, we learn the actual manifold structure. This gives us the benefit that we can compute geodesics as well as the exponential and logarithmic maps. These provide us with mappings back and forth between the manifold and Euclidean representation of the data, which preserve distances as well as possible. The availability of such mappings is in stark contrast to e.g. Isomap. In the next section we will derive a system of ordinary differential equations (ODE’s) that geodesics in C have to satisfy, which provides us with algorithms for computing geodesics as well as exponential and logarithmic maps. With these we can generalize many Euclidean techniques. 4 3.1 Computing Geodesics, Maps and Statistics At minima of (4) we know that the Euler-Lagrange equation must hold [11], i.e. ∂L d ∂L , where L(λ, c, c ) = c (λ)T M(c(λ))c (λ) . = ∂c dλ ∂c As we have an explicit expression for the metric tensor we can compute (7) in closed form: (7) Theorem 2 Geodesic curves in C satisfy the following system of 2nd order ODE’s M(c(λ))c (λ) = − 1 ∂vec [M(c(λ))] 2 ∂c(λ) T (c (λ) ⊗ c (λ)) , (8) where ⊗ denotes the Kronecker product and vec [·] stacks the columns of a matrix into a vector [13]. Proof See supplementary material. This result holds for any smooth weight functions wr . We, however, still need to compute ∂vec[M] , ˜ ∂c which depends on the specific choice of wr . Any smooth weighting scheme is applicable, but we ˜ restrict ourselves to the obvious smooth generalization of (3) and use squared exponentials. From this assumption, we get the following result Theorem 3 For wr (x) = exp − ρ x − xr ˜ 2 ∂vec [M(c)] = ∂c the derivative of the metric tensor from (6) is R ρ R j=1 2 Mr R 2 wj ˜ T r=1 T wj (c − xj ) Mj − (c − xr ) Mr ˜ wr vec [Mr ] ˜ . (9) j=1 Proof See supplementary material. Computing Geodesics. Any geodesic curve must be a solution to (8). Hence, to compute a geodesic between x and y, we can solve (8) subject to the constraints c(0) = x and c(1) = y . (10) This is a boundary value problem, which has a smooth solution. This allows us to solve the problem numerically using a standard three-stage Lobatto IIIa formula, which provides a fourth-order accurate C 1 –continuous solution [14]. Ramanan and Baker [3] discuss the possibility of computing geodesics, but arrive at the conclusion that this is intractable based on the assumption that it requires discretizing the entire feature space. Our solution avoids discretizing the feature space by discretizing the geodesic curve instead. As this is always one-dimensional the approach remains tractable in high-dimensional feature spaces. Computing Logarithmic Maps. Once a geodesic c is found, it follows from the definition of the logarithmic map, Logx (y), that it can be computed as v = Logx (y) = c (0) Length(c) . c (0) (11) In practice, we solve (8) by rewriting it as a system of first order ODE’s, such that we compute both c and c simultaneously (see supplementary material for details). Computing Exponential Maps. Given a starting point x on the manifold and a vector v in the tangent space, the exponential map, Expx (v), finds the unique geodesic starting at x with initial velocity v. As the geodesic must fulfill (8), we can compute the exponential map by solving this system of ODE’s with the initial conditions c(0) = x and c (0) = v . (12) This initial value problem has a unique solution, which we find numerically using a standard RungeKutta scheme [15]. 5 3.1.1 Generalizing PCA and Regression At this stage, we know that the feature space is Riemannian and we know how to compute geodesics and exponential and logarithmic maps. We now seek to generalize PCA and linear regression, which becomes straightforward since solutions are available in Riemannian spaces [16, 17]. These generalizations can be summarized as mapping the data to the tangent space at the mean, performing standard Euclidean analysis in the tangent and mapping the results back. The first step is to compute the mean value on the manifold, which is defined as the point that minimizes the sum-of-squares distances to the data points. Pennec [18] provides an efficient gradient descent approach for computing this point, which we also summarize in the supplementary material. The empirical covariance of a set of points is defined as the ordinary Euclidean covariance in the tangent space at the mean value [18]. With this in mind, it is not surprising that the principal components of a dataset have been generalized as the geodesics starting at the mean with initial velocity corresponding to the eigenvectors of the covariance [16], γvd (t) = Expµ (tvd ) , (13) th where vd denotes the d eigenvector of the covariance. This approach is called Principal Geodesic Analysis (PGA), and the geodesic curve γvd is called the principal geodesic. An illustration of the approach can be seen in Fig. 1 and more algorithmic details are in the supplementary material. Linear regression has been generalized in a similar way [17] by performing regression in the tangent of the mean and mapping the resulting line back to the manifold using the exponential map. The idea of working in the tangent space is both efficient and convenient, but comes with an element of approximation as the logarithmic map is only guarantied to preserve distances to the origin of the tangent and not between all pairs of data points. Practical experience, however, indicates that this is a good tradeoff; see [19] for a more in-depth discussion of when the approximation is suitable. 4 Experiments To illustrate the framework1 we consider an example in human body analysis, and then we analyze the scalability of the approach. But first, to build intuition, Fig. 3a show synthetically generated data samples from two classes. We sample random points xr and learn a local LDA metric [9] by considering all data points within a radius; this locally pushes the two classes apart. We combine the local metrics using (6) and Fig. 3b show the data in the tangent space of the resulting manifold. As can be seen the two classes are now globally further apart, which shows the effect of local metrics. 4.1 Human Body Shape We consider a regression example concerning human body shape analysis. We study 986 female body laser scans from the CAESAR [20] data set; each shape is represented using the leading 35 principal components of the data learned using a SCAPE-like model [21, 22]. Each shape is associated with anthropometric measurements such as body height, shoe size, etc. We show results for shoulder to wrist distance and shoulder breadth, but results for more measurements are in the supplementary material. To predict the measurements from shape coefficients, we learn local metrics and perform linear regression according to these. As a further experiment, we use PGA to reduce the dimensionality of the shape coefficients according to the local metrics, and measure the quality of the reduction by performing linear regression to predict the measurements. As a baseline we use the corresponding Euclidean techniques. To learn the local metric we do the following. First we whiten the data such that the variance captured by PGA will only be due to the change of metric; this allows easy visualization of the impact of the learned metrics. We then cluster the body shapes into equal-sized clusters according to the measurement and learn a LMNN metric for each cluster [5], which we associate with the mean of each class. These push the clusters apart, which introduces variance along the directions where the measurement changes. From this we construct a Riemannian manifold according to (6), 1 Our software implementation for computing geodesics and performing manifold statistics is available at http://ps.is.tue.mpg.de/project/Smooth Metric Learning 6 30 Euclidean Model Riemannian Model 24 20 18 16 20 15 10 5 14 12 0 (a) 25 22 Running Time (sec.) Average Prediction Error 26 10 (b) 20 Dimensionality 0 0 30 50 (c) 100 Dimensionality 150 (d) 4 3 3 2 2 1 1 0 −1 −2 −3 −4 −4 −3 −2 −1 0 1 2 3 4 Shoulder breadth 20 −2 −3 Euclidean Model Riemannian Model 0 −1 25 Prediction Error 4 15 10 0 −4 −5 0 4 10 15 20 Dimensionality 16 25 30 35 17 3 3 5 5 Euclidean Model Riemannian Model 2 15 2 1 1 Prediction Error Shoulder to wrist distance Figure 3: Left panels: Synthetic data. (a) Samples from two classes along with illustratively sampled metric tensors from (6). (b) The data represented in the tangent of a manifold constructed from local LDA metrics learned at random positions. Right panels: Real data. (c) Average error of linearly predicted body measurements (mm). (d) Running time (sec) of the geodesic computation as a function of dimensionality. 0 0 −1 −2 −1 −3 14 13 12 11 −2 −4 −3 −4 −4 10 −5 −3 −2 −1 0 1 Euclidean PCA 2 3 −6 −4 9 0 −2 0 2 4 Tangent Space PCA (PGA) 6 5 10 15 20 Dimensionality 25 30 35 Regression Error Figure 4: Left: body shape data in the first two principal components according to the Euclidean metric. Point color indicates cluster membership. Center: As on the left, but according to the Riemannian model. Right: regression error as a function of the dimensionality of the shape space; again the Euclidean metric and the Riemannian metric are compared. compute the mean value on the manifold, map the data to the tangent space at the mean and perform linear regression in the tangent space. As a first visualization we plot the data expressed in the leading two dimensions of PGA in Fig. 4; as can be seen the learned metrics provide principal geodesics, which are more strongly related with the measurements than the Euclidean model. In order to predict the measurements from the body shape, we perform linear regression, both directly in the shape space according to the Euclidean metric and in the tangent space of the manifold corresponding to the learned metrics (using the logarithmic map from (11)). We measure the prediction error using leave-one-out cross-validation. To further illustrate the power of the PGA model, we repeat this experiment for different dimensionalities of the data. The results are plotted in Fig. 4, showing that regression according to the learned metrics outperforms the Euclidean model. To verify that the learned metrics improve accuracy, we average the prediction errors over all millimeter measurements. The result in Fig. 3c shows that much can be gained in lower dimensions by using the local metrics. To provide visual insights into the behavior of the learned metrics, we uniformly sample body shape along the first principal geodesic (in the range ±7 times the standard deviation) according to the different metrics. The results are available as a movie in the supplementary material, but are also shown in Fig. 5. As can be seen, the learned metrics pick up intuitive relationships between body shape and the measurements, e.g. shoulder to wrist distance is related to overall body size, while shoulder breadth is related to body weight. 7 Shoulder to wrist distance Shoulder breadth Figure 5: Shapes corresponding to the mean (center) and ±7 times the standard deviations along the principal geodesics (left and right). Movies are available in the supplementary material. 4.2 Scalability The human body data set is small enough (986 samples in 35 dimensions) that computing a geodesic only takes a few seconds. To show that the current unoptimized Matlab implementation can handle somewhat larger datasets, we briefly consider a dimensionality reduction task on the classic MNIST handwritten digit data set. We use the preprocessed data available with [3] where the original 28×28 gray scale images were deskewed and projected onto their leading 164 Euclidean principal components (which captures 95% of the variance in the original data). We learn one diagonal LMNN metric per class, which we associate with the mean of the class. From this we construct a Riemannian manifold from (6), compute the mean value on the manifold and compute geodesics between the mean and each data point; this is the computationally expensive part of performing PGA. Fig. 3d plots the average running time (sec) for the computation of geodesics as a function of the dimensionality of the training data. A geodesic can be computed in 100 dimensions in approximately 5 sec., whereas in 150 dimensions it takes about 30 sec. In this experiment, we train a PGA model on 60,000 data points, and test a nearest neighbor classifier in the tangent space as we decrease the dimensionality of the model. Compared to a Euclidean model, this gives a modest improvement in classification accuracy of 2.3 percent, when averaged across different dimensionalities. Plots of the results can be found in the supplementary material. 5 Discussion This work shows that multi-metric learning techniques are indeed applicable outside the realm of kNN classifiers. The idea of defining the metric tensor at any given point as the weighted average of a finite set of learned metrics is quite natural from a modeling point of view, which is also validated by the Riemannian structure of the resulting space. This opens both a theoretical and a practical toolbox for analyzing and developing algorithms that use local metric tensors. Specifically, we show how to use local metric tensors for both regression and dimensionality reduction tasks. Others have attempted to solve non-classification problems using local metrics, but we feel that our approach is the first to have a solid theoretical backing. For example, Hastie and Tibshirani [9] use local LDA metrics for dimensionality reduction by averaging the local metrics and using the resulting metric as part of a Euclidean PCA, which essentially is a linear approach. Another approach was suggested by Hong et al. [23] who simply compute the principal components according to each metric separately, such that one low dimensional model is learned per metric. The suggested approach is, however, not difficulty-free in its current implementation. Currently, we are using off-the-shelf numerical solvers for computing geodesics, which can be computationally demanding. While we managed to analyze medium-sized datasets, we believe that the run-time can be drastically improved by developing specialized numerical solvers. In the experiments, we learned local metrics using techniques specialized for classification tasks as this is all the current literature provides. We expect improvements by learning the metrics specifically for regression and dimensionality reduction, but doing so is currently an open problem. Acknowledgments: Søren Hauberg is supported in part by the Villum Foundation, and Oren Freifeld is supported in part by NIH-NINDS EUREKA (R01-NS066311). 8 References [1] Andrea Frome, Yoram Singer, and Jitendra Malik. Image retrieval and classification using local distance functions. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing o Systems 19 (NIPS), pages 417–424, Cambridge, MA, 2007. MIT Press. [2] Andrea Frome, Fei Sha, Yoram Singer, and Jitendra Malik. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In International Conference on Computer Vision (ICCV), pages 1–8, 2007. [3] Deva Ramanan and Simon Baker. Local distance functions: A taxonomy, new algorithms, and an evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):794–806, 2011. [4] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In Proceedings of the twenty-first international conference on Machine learning, ICML ’04, pages 94–101. ACM, 2004. [5] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 10:207–244, 2009. [6] Tomasz Malisiewicz and Alexei A. Efros. Recognition by association via learning per-exemplar distances. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. [7] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. The Journal of Machine Learning Research, 13:1–26, 2012. [8] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems 16 (NIPS), 2004. [9] Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–616, June 1996. [10] Elzbieta Pekalska, Pavel Paclik, and Robert P. W. Duin. A generalized kernel approach to dissimilaritybased classification. Journal of Machine Learning Research, 2:175–211, 2002. [11] Manfredo Perdigao do Carmo. Riemannian Geometry. Birkh¨ user Boston, January 1992. a [12] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [13] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons, 2007. [14] Jacek Kierzenka and Lawrence F. Shampine. A BVP solver based on residual control and the Matlab PSE. ACM Transactions on Mathematical Software, 27(3):299–316, 2001. [15] John R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, 6:19–26, 1980. [16] P. Thomas Fletcher, Conglin Lu, Stephen M. Pizer, and Sarang Joshi. Principal Geodesic Analysis for the study of Nonlinear Statistics of Shape. IEEE Transactions on Medical Imaging, 23(8):995–1005, 2004. [17] Peter E. Jupp and John T. Kent. Fitting smooth paths to spherical data. Applied Statistics, 36(1):34–46, 1987. [18] Xavier Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In Proceedings of Nonlinear Signal and Image Processing, pages 194–198, 1999. [19] Stefan Sommer, Francois Lauze, Søren Hauberg, and Mads Nielsen. Manifold valued statistics, exact ¸ principal geodesic analysis and the effect of linear approximations. In European Conference on Computer Vision (ECCV), pages 43–56, 2010. [20] Kathleen M. Robinette, Hein Daanen, and Eric Paquet. The CAESAR project: a 3-D surface anthropometry survey. In 3-D Digital Imaging and Modeling, pages 380–386, 1999. [21] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. ACM Transactions on Graphics, 24(3):408–416, 2005. [22] Oren Freifeld and Michael J. Black. Lie bodies: A manifold representation of 3D human shape. In A. Fitzgibbon et al. (Eds.), editor, European Conference on Computer Vision (ECCV), Part I, LNCS 7572, pages 1–14. Springer-Verlag, oct 2012. [23] Yi Hong, Quannan Li, Jiayan Jiang, and Zhuowen Tu. Learning a mixture of sparse distance metrics for classification and dimensionality reduction. In International Conference on Computer Vision (ICCV), pages 906–913, 2011. 9

3 0.81579351 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

Author: Doina Precup, Joelle Pineau, Andre S. Barreto

Abstract: Kernel-based stochastic factorization (KBSF) is an algorithm for solving reinforcement learning tasks with continuous state spaces which builds a Markov decision process (MDP) based on a set of sample transitions. What sets KBSF apart from other kernel-based approaches is the fact that the size of its MDP is independent of the number of transitions, which makes it possible to control the trade-off between the quality of the resulting approximation and the associated computational cost. However, KBSF’s memory usage grows linearly with the number of transitions, precluding its application in scenarios where a large amount of data must be processed. In this paper we show that it is possible to construct KBSF’s MDP in a fully incremental way, thus freeing the space complexity of this algorithm from its dependence on the number of sample transitions. The incremental version of KBSF is able to process an arbitrary amount of data, which results in a model-based reinforcement learning algorithm that can be used to solve continuous MDPs in both off-line and on-line regimes. We present theoretical results showing that KBSF can approximate the value function that would be computed by conventional kernel-based learning with arbitrary precision. We empirically demonstrate the effectiveness of the proposed algorithm in the challenging threepole balancing task, in which the ability to process a large number of transitions is crucial for success. 1

4 0.80976784 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

Author: Ren Xiaofeng, Liefeng Bo

Abstract: Finding contours in natural images is a fundamental problem that serves as the basis of many tasks such as image segmentation and object recognition. At the core of contour detection technologies are a set of hand-designed gradient features, used by most approaches including the state-of-the-art Global Pb (gPb) operator. In this work, we show that contour detection accuracy can be significantly improved by computing Sparse Code Gradients (SCG), which measure contrast using patch representations automatically learned through sparse coding. We use K-SVD for dictionary learning and Orthogonal Matching Pursuit for computing sparse codes on oriented local neighborhoods, and apply multi-scale pooling and power transforms before classifying them with linear SVMs. By extracting rich representations from pixels and avoiding collapsing them prematurely, Sparse Code Gradients effectively learn how to measure local contrasts and find contours. We improve the F-measure metric on the BSDS500 benchmark to 0.74 (up from 0.71 of gPb contours). Moreover, our learning approach can easily adapt to novel sensor data such as Kinect-style RGB-D cameras: Sparse Code Gradients on depth maps and surface normals lead to promising contour detection using depth and depth+color, as verified on the NYU Depth Dataset. 1

same-paper 5 0.79222131 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

Author: Shiva P. Kasiviswanathan, Huahua Wang, Arindam Banerjee, Prem Melville

Abstract: Given their pervasive use, social media, such as Twitter, have become a leading source of breaking news. A key task in the automated identification of such news is the detection of novel documents from a voluminous stream of text documents in a scalable manner. Motivated by this challenge, we introduce the problem of online 1 -dictionary learning where unlike traditional dictionary learning, which uses squared loss, the 1 -penalty is used for measuring the reconstruction error. We present an efficient online algorithm for this problem based on alternating directions method of multipliers, and establish a sublinear regret bound for this algorithm. Empirical results on news-stream and Twitter data, shows that this online 1 -dictionary learning algorithm for novel document detection gives more than an order of magnitude speedup over the previously known batch algorithm, without any significant loss in quality of results. 1

6 0.73588663 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

7 0.71689832 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

8 0.70808393 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

9 0.6986149 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

10 0.69600785 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

11 0.69453931 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

12 0.69453269 197 nips-2012-Learning with Recursive Perceptual Representations

13 0.69447798 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model

14 0.69445258 65 nips-2012-Cardinality Restricted Boltzmann Machines

15 0.6941855 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

16 0.69237459 200 nips-2012-Local Supervised Learning through Space Partitioning

17 0.69127011 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

18 0.69126427 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

19 0.6911 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

20 0.68919629 19 nips-2012-A Spectral Algorithm for Latent Dirichlet Allocation