nips nips2012 nips2012-107 knowledge-graph by maker-knowledge-mining

107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

Source: pdf

Author: Michael C. Hughes, Erik B. Sudderth, Emily B. Fox

Abstract: Applications of Bayesian nonparametric methods require learning and inference algorithms which efﬁciently explore models of unbounded complexity. We develop new Markov chain Monte Carlo methods for the beta process hidden Markov model (BP-HMM), enabling discovery of shared activity patterns in large video and motion capture databases. By introducing split-merge moves based on sequential allocation, we allow large global changes in the shared feature structure. We also develop data-driven reversible jump moves which more reliably discover rare or unique behaviors. Our proposals apply to any choice of conjugate likelihood for observed data, and we show success with multinomial, Gaussian, and autoregressive emission models. Together, these innovations allow tractable analysis of hundreds of time series, where previous inference required clever initialization and lengthy burn-in periods for just six sequences. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We develop new Markov chain Monte Carlo methods for the beta process hidden Markov model (BP-HMM), enabling discovery of shared activity patterns in large video and motion capture databases. [sent-9, score-0.386]

2 By introducing split-merge moves based on sequential allocation, we allow large global changes in the shared feature structure. [sent-10, score-0.302]

3 We also develop data-driven reversible jump moves which more reliably discover rare or unique behaviors. [sent-11, score-0.413]

4 Our proposals apply to any choice of conjugate likelihood for observed data, and we show success with multinomial, Gaussian, and autoregressive emission models. [sent-12, score-0.453]

5 For example, given collections of videos or human motion capture sequences, our goal is to (i) identify a concise global library of behaviors that explain the observed motions, (ii) annotate each sequence with the subset of behaviors used, and (iii) label each timestep with one active behavior. [sent-16, score-0.386]

6 The beta process hidden Markov model (BP-HMM) [4] offers a promising solution to such problems, allowing an unbounded set of relevant behaviors to be learned from data. [sent-17, score-0.19]

7 [4] considered a dataset containing only six motion capture sequences and proposed a Markov chain Monte Carlo (MCMC) method that required careful initialization and tens of thousands of burn-in iterations. [sent-20, score-0.23]

8 However, like most MCMC samplers, their proposals only modiﬁed small subsets of variables at each step. [sent-22, score-0.251]

9 Additionally, the sampler relied on parameter proposals from priors, leading to low acceptance rates for high-dimensional data. [sent-23, score-0.391]

10 First, we develop split-merge moves which change many variables simultaneously, allowing rapid improvements in the discovered behavior library. [sent-28, score-0.222]

11 Our approach builds on previous work on restricted Gibbs proposals [8] and sequential allocation strategies [9], both of which were formulated for static Dirichlet 1   bk fi i z i ,1 zi ,2 f1 f2 f3 f4  i   k features k=1, 2, . [sent-29, score-0.545]

12 z i ,T xi ,T f1 f2 f3 f4 i = 1, 2, … N z1 z2 z3 z4 merge split km z1 z2 z3 z4 ka kb Figure 1: Left: The BP-HMM as a directed graphical model. [sent-38, score-1.055]

13 Right: Illustration of our split and merge proposals, which modify both binary feature assignment matrices F (white indicates present feature) and state sequences z. [sent-39, score-0.394]

14 We show F, z before (top) and after (bottom) feature km (yellow) is split into ka , kb (red,orange). [sent-40, score-0.969]

15 An item i with fikm = 1 can have either ka , kb , or both after the split, and its new zi sequence can use any features available in fi . [sent-41, score-1.078]

16 An item without km cannot possess ka , kb , and its state sequence zi does not change. [sent-42, score-1.05]

17 Second, we design data-driven [11] reversible jump moves [12] which efﬁciently discover behaviors unique to a single sequence. [sent-44, score-0.518]

18 These data-driven proposals are especially important for high-dimensional observation sequences. [sent-45, score-0.251]

19 Both innovations apply to any likelihood model with a conjugate prior; we show success with multinomial models of discrete video descriptors, and Gaussian autoregressive models of continuous motion capture trajectories. [sent-46, score-0.344]

20 1, and then develop our novel split-merge and data-driven proposals in Sec. [sent-51, score-0.251]

21 4, showing improvements over prior work modeling human motions captured both via a marker-based mocap system [4] and video cameras [13]. [sent-56, score-0.173]

22 The binary feature vector for item i is determined by independent Bernoulli draws fik ∼ Ber(bk ). [sent-67, score-0.243]

23 Marginalizing over B, the number of active features in item i has distribution Poisson(γ), determined by mass parameter γ. [sent-68, score-0.238]

24 The concentration parameter β controls how often features are shared between items [15]. [sent-69, score-0.229]

25 The binary vector fi determines a ﬁnite set of states available for item i. [sent-73, score-0.315]

26 Each timestep t is assigned a single state zit = k from the set {k : fik = 1}, determining which parameters θk generate data xit . [sent-74, score-0.262]

27 λ0 ) (3) 1 Throughout this paper, for variables wijk we use w to denote the vector or collection of wijk ’s over the entire set of subscripts, and wi for the collection over only the omitted subscripts j and k 2 The BP-HMM allows each item independent transition dynamics. [sent-79, score-0.203]

28 The transition distribution πij from each state j for the HMM of item i is built by drawing a set of transition weights ηi , and then normalizing these over the set of active features fi : ηijk ∼ Gamma(α + κδjk , 1), πijk = ηijk fik . [sent-80, score-0.478]

29 This construction assigns positive transition mass πijk only to features k active in fi . [sent-82, score-0.249]

30 We then present our novel contributions: split-merge moves and data-driven reversible jump proposals. [sent-85, score-0.372]

31 [4]’s sampler alternates updates to HMM parameters θ and η, discrete state sequences z, and feature assignments F. [sent-89, score-0.29]

32 2 Sampling each item’s features requires separate updates to features shared with some other time series and features unique to item i. [sent-91, score-0.455]

33 This local move alters only one entry in F while holding all others ﬁxed; the split-merge moves of Sec. [sent-95, score-0.267]

34 [4] deﬁne a reversible pair of birth and death moves which add or delete features to single sequences. [sent-99, score-0.542]

35 While this approach elegantly avoids approximating the inﬁnite BP-HMM, their birth proposals use the (typically vague) prior to propose emission parameters θk∗ for new features k ∗ . [sent-100, score-0.615]

36 We remedy this slow exploration with data-driven proposals in Sec. [sent-101, score-0.251]

37 Jain and Neal present valid proposals that reversibly split a single cluster km into two (ka , kb ), or merge two clusters into one. [sent-107, score-0.97]

38 To construct an initial split of km , the RG sampler ﬁrst assigns items in cluster km at random to either ka or kb . [sent-109, score-1.286]

39 Starting from this partition, the proposal is constructed by performing one-at-a-time Gibbs updates, forgetting an item’s current cluster and reassigning to either ka or kb conditioned on the remaining partitioned data. [sent-110, score-0.738]

40 After several sweeps, these Gibbs updates encourage proposed clusters ka and kb which agree with the data and thus are more likely to be accepted. [sent-111, score-0.712]

41 For non-conjugate models, more complex RG proposals can be constructed which instantiate emission parameters [16]. [sent-112, score-0.358]

42 An alternative sequential allocation [9] method replaces the random initialization of RG by using two randomly chosen items to “anchor” the two new clusters ka , kb . [sent-115, score-0.837]

43 Remaining items are then sequentially assigned to either ka or kb one-at-a-time, using RG moves conditioning only on previously assigned data. [sent-116, score-1.1]

44 Recent work has shown some success with sequentiallyallocated split-merge moves for a hierarchical DP topic model [17]. [sent-118, score-0.222]

45 3 For nonparametric models not based on the DP, split-merge moves are not well studied. [sent-121, score-0.264]

46 Several authors have considered RG split-merge proposals for beta process models [18, 19, 20]. [sent-122, score-0.304]

47 In the mixture models considered by prior work [8, 9], each data item i is associated with a single cluster ki , so selecting two anchors i, j also identiﬁes two cluster indices ki , kj . [sent-126, score-0.716]

48 However, in feature-based models such as the BP-HMM, each data item i is associated with a collection of features indexed by fi . [sent-127, score-0.352]

49 Therefore, our proposals require mechanisms for selecting anchors and for choosing candidate states to split or merge from fi , fj . [sent-128, score-0.755]

50 Additionally, our proposals must allow changes to state sequences z to reﬂect changes in F. [sent-129, score-0.378]

51 Our proposals thus jointly create a new conﬁguration (F∗ , z∗ ), collapsing away HMM parameters θ, η. [sent-130, score-0.284]

52 Selecting Anchors Following [9], we ﬁrst randomly select distinct anchor data items i and j. [sent-133, score-0.207]

53 Next, we select from each anchor one feature it possesses, denoted ki , kj . [sent-135, score-0.451]

54 This choice determines the proposed move: we merge ki , kj if they are distinct, and split ki = kj into two new features otherwise. [sent-136, score-0.935]

55 Selecting ki , kj uniformly at random is problematic. [sent-137, score-0.322]

56 First, in datasets with many features choosing ki = kj is unlikely, making split moves rare. [sent-138, score-0.702]

57 We thus develop a proposal distribution which ﬁrst draws ki uniformly from fi , and then selects kj given ﬁxed ki as follows: qk (ki , kj ) = Unif(ki | fi )q(kj | ki , fj ), q(kj = k | ki , fj ) ∝ 2Cj fjk m(x i ,xk ) fjk m(xk k)m(xk ) if k = ki (5) o. [sent-142, score-1.554]

58 i Here, xk is the data assigned to k, m(·) denotes the marginal likelihood of data collapsing away emission parameters θ, and Cj = kj =ki fjkj m(xki , xkj )/ m(xki )m(xkj ) . [sent-144, score-0.354]

59 This construction gives large mass (2/3) to a split move when possible, and also encourages choices ki = kj for a merge that explain similar data via the marginal likelihood ratio. [sent-145, score-0.587]

60 A large ratio means the model prefers to explain all data assigned to ki , kj together rather than separately, biasing selection towards promising merge candidates. [sent-146, score-0.492]

61 Once ki , kj are ﬁxed, we construct the candidate state F∗ , z∗ . [sent-148, score-0.355]

62 1, we only alter f , z for items which possess either ki or kj . [sent-150, score-0.447]

63 Sweeping through a random permutation of items in the active set S, we draw each item’s assignment to new features ka , kb and resample its state sequence. [sent-154, score-0.94]

64 The dynamic programming recursions underlying these proposals use nonˆ random auxiliary parameters: η is set to its prior mean, and θk to its posterior mean given the current ˆ ˆ z. [sent-157, score-0.312]

65 For new states k ∗ ∈ {ka , kb }, we initialize θk∗ from anchor sequences and then update to account for new data assigned to k ∗ at each item . [sent-158, score-0.725]

66 Finally, we sample f ∗ , z∗ for anchor items, enforcing fika = 1 and fjkb = 1 so the move ∗ ∗ remains reversible under a merge. [sent-160, score-0.219]

67 This does not force zi to use ka nor zj to use kb . [sent-161, score-0.726]

68 We thus need only to sample z∗ for items in S, using a block sampler as in Alg. [sent-163, score-0.208]

69 Again ˆ ˆ this requires auxiliary HMM parameters θ, η , which we emphasize are deterministic tools enabling collapsed proposals of discrete indicators F∗ , z∗ . [sent-165, score-0.286]

70 We give the ratio for a split move which creates features 4 Alg. [sent-168, score-0.239]

71 The acceptance ratio for a merge move is the reciprocal of Eq. [sent-170, score-0.235]

72 p(x, z∗ , F∗ ) qmerge (F, z | x, F∗ , z∗ , ka , kb ) qk (ka , kb | x, F∗ , z∗ , i, j) (6) ρsplit = p(x, z, F) qsplit (F∗ , z∗ | x, F, z, km ) qk (km , km | x, F, z, i, j) The joint probability p(x, z, F) is only tractable with conjugate likelihoods. [sent-172, score-1.457]

73 To accept the birth of new feature k ∗ = K + 1 for item i, this feature must explain some of the observed data xi at least as well as existing features 1, 2, . [sent-176, score-0.425]

74 This mixture strikes a balance between proposing promising new features (via the posterior) while also making death moves possible, since the diffuse prior will place some mass on the reverse birth move. [sent-184, score-0.511]

75 β Let Ui denote the number of unique features in fi and ν = γ N −1+β . [sent-185, score-0.258]

77 (7) is similar to the ratio for the birth proposal from the prior, adding only one term to account ∗ for the proposed θk∗ . [sent-187, score-0.184]

78 Note that each choice of W deﬁnes a valid pair of birth-death moves satisfying detailed balance, so we need not account for this choice in the acceptance ratio [21]. [sent-188, score-0.279]

79 To evaluate how well our novel moves explore the posterior distribution, we compare three possible methods for adding or removing features: split-merge moves (SM, Sec. [sent-190, score-0.444]

80 4), and reversible jump moves using the prior (Prior [4], Sec. [sent-194, score-0.433]

81 Each run begins with one feature used by all items, and must add new features via split-merge proposals (SM), or reversible-jump moves using data-driven (DD) or prior (Prior) proposals. [sent-236, score-0.652]

82 All runs of both SM and DD moves ﬁnd all true states within several minutes, while no Prior run recovers all true states, remaining stuck with merged versions of true features. [sent-249, score-0.287]

83 DD moves add new features most rapidly due to low computational cost. [sent-250, score-0.293]

84 Merge moves enable critical global changes, while the one-at-a-time updates of [4] (and our DD variant) must take long random walks to completely delete a popular feature. [sent-255, score-0.255]

85 These results demonstrate the importance of DD birth and split moves for exploring new features, and merge moves for removing features via proposals of large assignment changes. [sent-257, score-1.111]

86 As such, we consider a sampler that interleaves SM and DD moves in our subsequent analyses. [sent-258, score-0.335]

87 2 Motion Capture Data We next apply our improved MCMC methods to motion capture (mocap) sequences from the CMU database [22]. [sent-260, score-0.197]

88 Previous results [4] show that the BP-HMM outperforms competitors in segmenting these actions, although they report that some true actions like jogging are split across multiple recovered features (see their Fig. [sent-263, score-0.244]

89 5 4 x 10 Figure 3: Analysis of 6 motion capture sequences previously considered by [4]. [sent-285, score-0.197]

90 SM+DD moves (top row started from one feature, middle row started with 5 unique states per sequence) successfully explain each action with one primary state, while [4]’s sampler (bottom row) started from 5 unique features remains stuck with multiple unique states for one true action. [sent-288, score-0.598]

91 3, we compare a sampler which interleaves SM and DD moves with [4]’s original method. [sent-294, score-0.335]

92 We run 20 chains of each method for ten hours from two initializations: unique5, which assigns 5 unique features per sequence (as done in [4]), and one, using a single feature across all items. [sent-295, score-0.191]

93 Our SM moves are critical for effectively creating and deleting features. [sent-302, score-0.222]

94 In contrast, the Prior remains stuck with some unique behaviors used in different sequences, yielding lower probability and larger Hamming distance. [sent-305, score-0.177]

95 In contrast, starting from one, our SM+DD moves rapidly identify a diverse set of 33 behaviors. [sent-310, score-0.222]

96 Each video is several minutes long and depicts a single actor in the same kitchen preparing either a pizza, a sandwich, a salad, brownies, or eggs. [sent-316, score-0.174]

97 We compare our new SM moves on this small dataset, and then study a larger set of 126 Kitchen videos using our improved sampler. [sent-324, score-0.263]

98 DD proposals offer signiﬁcant gains over the prior, and further interleaving DD and SM moves achieves the best overall conﬁguration, showing the beneﬁts of proposing global changes to F, z. [sent-327, score-0.473]

99 Our proposals do not require careful initialization or parameter tuning, and enable exploration of large datasets intractable under previous methods. [sent-341, score-0.284]

100 We expect the guiding principles of data-driven and sequentially-allocated proposals to apply to many other models, enabling new applications of nonparametric analysis. [sent-343, score-0.328]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('kb', 0.343), ('ka', 0.336), ('dd', 0.307), ('proposals', 0.251), ('sm', 0.237), ('moves', 0.222), ('kj', 0.18), ('km', 0.156), ('fi', 0.146), ('ki', 0.142), ('item', 0.135), ('merge', 0.133), ('birth', 0.125), ('items', 0.125), ('emission', 0.107), ('behaviors', 0.105), ('fox', 0.097), ('sequences', 0.094), ('rg', 0.092), ('reversible', 0.092), ('split', 0.087), ('jogging', 0.086), ('kitchen', 0.084), ('sampler', 0.083), ('anchor', 0.082), ('xit', 0.079), ('features', 0.071), ('motion', 0.071), ('sec', 0.067), ('samplers', 0.065), ('mcmc', 0.063), ('cmu', 0.063), ('prior', 0.061), ('fik', 0.061), ('proposal', 0.059), ('jump', 0.058), ('acceptance', 0.057), ('anchors', 0.056), ('mocap', 0.056), ('hmm', 0.056), ('video', 0.056), ('beta', 0.053), ('ar', 0.053), ('pizza', 0.052), ('salad', 0.052), ('sprev', 0.052), ('zit', 0.052), ('conjugate', 0.049), ('fj', 0.048), ('ijk', 0.048), ('feature', 0.047), ('zi', 0.047), ('autoregressive', 0.046), ('innovations', 0.046), ('bowl', 0.046), ('move', 0.045), ('cpu', 0.045), ('dirichlet', 0.044), ('multinomial', 0.044), ('activity', 0.042), ('nonparametric', 0.042), ('unique', 0.041), ('videos', 0.041), ('hamming', 0.041), ('markov', 0.038), ('xw', 0.038), ('assigned', 0.037), ('qk', 0.037), ('creates', 0.036), ('bp', 0.035), ('enabling', 0.035), ('brownies', 0.034), ('preparing', 0.034), ('sticky', 0.034), ('stir', 0.034), ('wijk', 0.034), ('xki', 0.034), ('states', 0.034), ('sweeps', 0.033), ('shared', 0.033), ('monte', 0.033), ('updates', 0.033), ('initialization', 0.033), ('create', 0.033), ('state', 0.033), ('death', 0.032), ('active', 0.032), ('chains', 0.032), ('capture', 0.032), ('patterns', 0.032), ('jain', 0.032), ('hidden', 0.032), ('stuck', 0.031), ('dp', 0.03), ('emissions', 0.03), ('interleaves', 0.03), ('merges', 0.03), ('poi', 0.03), ('xkj', 0.03), ('bk', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

Author: Michael C. Hughes, Erik B. Sudderth, Emily B. Fox

2 0.18126838 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

Author: Maksims Volkovs, Richard S. Zemel

Abstract: Bipartite matching problems characterize many situations, ranging from ranking in information retrieval to correspondence in vision. Exact inference in realworld applications of these problems is intractable, making efﬁcient approximation methods essential for learning and inference. In this paper we propose a novel sequential matching sampler based on a generalization of the PlackettLuce model, which can effectively make large moves in the space of matchings. This allows the sampler to match the difﬁcult target distributions common in these problems: highly multimodal distributions with well separated modes. We present experimental results with bipartite matching problems—ranking and image correspondence—which show that the sequential matching sampler efﬁciently approximates the target distribution, signiﬁcantly outperforming other sampling approaches. 1

3 0.110565 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

Author: Andriy Mnih, Yee W. Teh

Abstract: User preferences for items can be inferred from either explicit feedback, such as item ratings, or implicit feedback, such as rental histories. Research in collaborative ﬁltering has concentrated on explicit feedback, resulting in the development of accurate and scalable models. However, since explicit feedback is often difﬁcult to collect it is important to develop effective models that take advantage of the more widely available implicit feedback. We introduce a probabilistic approach to collaborative ﬁltering with implicit feedback based on modelling the user’s item selection process. In the interests of scalability, we restrict our attention to treestructured distributions over items and develop a principled and efﬁcient algorithm for learning item trees from data. We also identify a problem with a widely used protocol for evaluating implicit feedback models and propose a way of addressing it using a small quantity of explicit feedback data. 1

4 0.10984499 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

Author: Michael Bryant, Erik B. Sudderth

Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.

5 0.10493563 126 nips-2012-FastEx: Hash Clustering with Exponential Families

Author: Amr Ahmed, Sujith Ravi, Alex J. Smola, Shravan M. Narayanamurthy

Abstract: Clustering is a key component in any data analysis toolbox. Despite its importance, scalable algorithms often eschew rich statistical models in favor of simpler descriptions such as k-means clustering. In this paper we present a sampler, capable of estimating mixtures of exponential families. At its heart lies a novel proposal distribution using random projections to achieve high throughput in generating proposals, which is crucial for clustering models with large numbers of clusters. 1

6 0.098301999 60 nips-2012-Bayesian nonparametric models for ranked data

7 0.097733669 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

8 0.097366564 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

9 0.086530246 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

10 0.086503431 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

11 0.085937843 270 nips-2012-Phoneme Classification using Constrained Variational Gaussian Process Dynamical System

12 0.084514968 145 nips-2012-Gradient Weights help Nonparametric Regressors

13 0.082578346 165 nips-2012-Iterative ranking from pair-wise comparisons

14 0.077381931 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

15 0.073185563 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

16 0.072928049 336 nips-2012-The Coloured Noise Expansion and Parameter Estimation of Diffusion Processes

17 0.071068853 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

18 0.070849486 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering

19 0.070806742 74 nips-2012-Collaborative Gaussian Processes for Preference Learning

20 0.069335245 205 nips-2012-MCMC for continuous-time discrete-state systems

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.185), (1, 0.024), (2, -0.032), (3, -0.002), (4, -0.138), (5, -0.085), (6, -0.01), (7, 0.046), (8, 0.111), (9, 0.041), (10, -0.065), (11, 0.008), (12, -0.056), (13, -0.062), (14, -0.006), (15, -0.058), (16, -0.076), (17, -0.073), (18, 0.008), (19, -0.003), (20, 0.081), (21, -0.039), (22, -0.048), (23, -0.089), (24, -0.053), (25, -0.033), (26, -0.029), (27, -0.071), (28, 0.064), (29, 0.011), (30, -0.076), (31, 0.074), (32, -0.007), (33, -0.029), (34, -0.034), (35, 0.095), (36, -0.023), (37, -0.065), (38, -0.12), (39, -0.005), (40, 0.059), (41, 0.034), (42, -0.199), (43, -0.026), (44, -0.08), (45, -0.045), (46, 0.028), (47, 0.07), (48, 0.0), (49, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94012779 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

Author: Michael C. Hughes, Erik B. Sudderth, Emily B. Fox

2 0.74877626 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

Author: Maksims Volkovs, Richard S. Zemel

3 0.56512415 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models

Author: Nicholas Foti, Sinead Williamson

Abstract: A number of dependent nonparametric processes have been proposed to model non-stationary data with unknown latent dimensionality. However, the inference algorithms are often slow and unwieldy, and are in general highly speciﬁc to a given model formulation. In this paper, we describe a large class of dependent nonparametric processes, including several existing models, and present a slice sampler that allows efﬁcient inference across this class of models. 1

4 0.54321748 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

Author: Andriy Mnih, Yee W. Teh

5 0.53570443 189 nips-2012-Learning from the Wisdom of Crowds by Minimax Entropy

Author: Dengyong Zhou, Sumit Basu, Yi Mao, John C. Platt

Abstract: An important way to make large training sets is to gather noisy labels from crowds of nonexperts. We propose a minimax entropy principle to improve the quality of these labels. Our method assumes that labels are generated by a probability distribution over workers, items, and labels. By maximizing the entropy of this distribution, the method naturally infers item confusability and worker expertise. We infer the ground truth by minimizing the entropy of this distribution, which we show minimizes the Kullback-Leibler (KL) divergence between the probability distribution and the unknown truth. We show that a simple coordinate descent scheme can optimize minimax entropy. Empirically, our results are substantially better than previously published methods for the same problem. 1

6 0.53542483 205 nips-2012-MCMC for continuous-time discrete-state systems

7 0.5229888 136 nips-2012-Forward-Backward Activation Algorithm for Hierarchical Hidden Markov Models

8 0.51888192 349 nips-2012-Training sparse natural image models with a fast Gibbs sampler of an extended state space

9 0.49441662 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

10 0.49387598 49 nips-2012-Automatic Feature Induction for Stagewise Collaborative Filtering

11 0.48485816 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo

12 0.47694591 155 nips-2012-Human memory search as a random walk in a semantic network

13 0.46709791 60 nips-2012-Bayesian nonparametric models for ranked data

14 0.46658298 30 nips-2012-Accuracy at the Top

15 0.46307603 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process

16 0.45980349 52 nips-2012-Bayesian Nonparametric Modeling of Suicide Attempts

17 0.45911303 126 nips-2012-FastEx: Hash Clustering with Exponential Families

18 0.45601949 359 nips-2012-Variational Inference for Crowdsourcing

19 0.4487955 342 nips-2012-The variational hierarchical EM algorithm for clustering hidden Markov models

20 0.4394958 41 nips-2012-Ancestor Sampling for Particle Gibbs

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.062), (17, 0.013), (21, 0.034), (38, 0.103), (39, 0.021), (42, 0.023), (54, 0.022), (55, 0.022), (72, 0.267), (74, 0.066), (76, 0.15), (80, 0.078), (92, 0.052)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80738682 107 nips-2012-Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data

Author: Michael C. Hughes, Erik B. Sudderth, Emily B. Fox

2 0.79977655 9 nips-2012-A Geometric take on Metric Learning

Author: Søren Hauberg, Oren Freifeld, Michael J. Black

Abstract: Multi-metric learning techniques learn local metric tensors in different parts of a feature space. With such an approach, even simple classiﬁers can be competitive with the state-of-the-art because the distance measure locally adapts to the structure of the data. The learned distance measure is, however, non-metric, which has prevented multi-metric learning from generalizing to tasks such as dimensionality reduction and regression in a principled way. We prove that, with appropriate changes, multi-metric learning corresponds to learning the structure of a Riemannian manifold. We then show that this structure gives us a principled way to perform dimensionality reduction and regression according to the learned metrics. Algorithmically, we provide the ﬁrst practical algorithm for computing geodesics according to the learned metrics, as well as algorithms for computing exponential and logarithmic maps on the Riemannian manifold. Together, these tools let many Euclidean algorithms take advantage of multi-metric learning. We illustrate the approach on regression and dimensionality reduction tasks that involve predicting measurements of the human body from shape data. 1 Learning and Computing Distances Statistics relies on measuring distances. When the Euclidean metric is insufﬁcient, as is the case in many real problems, standard methods break down. This is a key motivation behind metric learning, which strives to learn good distance measures from data. In the most simple scenarios a single metric tensor is learned, but in recent years, several methods have proposed learning multiple metric tensors, such that different distance measures are applied in different parts of the feature space. This has proven to be a very powerful approach for classiﬁcation tasks [1, 2], but the approach has not generalized to other tasks. Here we consider the generalization of Principal Component Analysis (PCA) and linear regression; see Fig. 1 for an illustration of our approach. The main problem with generalizing multi-metric learning is that it is based on assumptions that make the feature space both non-smooth and non-metric. Speciﬁcally, it is often assumed that straight lines form geodesic curves and that the metric tensor stays constant along these lines. These assumptions are made because it is believed that computing the actual geodesics is intractable, requiring a discretization of the entire feature space [3]. We solve these problems by smoothing the transitions between different metric tensors, which ensures a metric space where geodesics can be computed. In this paper, we consider the scenario where the metric tensor at a given point in feature space is deﬁned as the weighted average of a set of learned metric tensors. In this model, we prove that the feature space becomes a chart for a Riemannian manifold. This ensures a metric feature space, i.e. dist(x, y) = 0 ⇔ x = y , dist(x, y) = dist(y, x) (symmetry), (1) dist(x, z) ≤ dist(x, y) + dist(y, z) (triangle inequality). To compute statistics according to the learned metric, we need to be able to compute distances, which implies that we need to compute geodesics. Based on the observation that geodesics are 1 (a) Local Metrics & Geodesics (b) Tangent Space Representation (c) First Principal Geodesic Figure 1: Illustration of Principal Geodesic Analysis. (a) Geodesics are computed between the mean and each data point. (b) Data is mapped to the Euclidean tangent space and the ﬁrst principal component is computed. (c) The principal component is mapped back to the feature space. smooth curves in Riemannian spaces, we derive an algorithm for computing geodesics that only requires a discretization of the geodesic rather than the entire feature space. Furthermore, we show how to compute the exponential and logarithmic maps of the manifold. With this we can map any point back and forth between a Euclidean tangent space and the manifold. This gives us a general strategy for incorporating the learned metric tensors in many Euclidean algorithms: map the data to the tangent of the manifold, perform the Euclidean analysis and map the results back to the manifold. Before deriving the algorithms (Sec. 3) we set the scene by an analysis of the shortcomings of current state-of-the-art methods (Sec. 2), which motivate our ﬁnal model. The model is general and can be used for many problems. Here we illustrate it with several challenging problems in 3D body shape modeling and analysis (Sec. 4). All proofs can be found in the supplementary material along with algorithmic details and further experimental results. 2 Background and Related Work Single-metric learning learns a metric tensor, M, such that distances are measured as dist2 (xi , xj ) = xi − xj 2 M ≡ (xi − xj )T M(xi − xj ) , (2) where M is a symmetric and positive deﬁnite D × D matrix. Classic approaches for ﬁnding such a metric tensor include PCA, where the metric is given by the inverse covariance matrix of the training data; and linear discriminant analysis (LDA), where the metric tensor is M = S−1 SB S−1 , with Sw W W and SB being the within class scatter and the between class scatter respectively [9]. A more recent approach tries to learn a metric tensor from triplets of data points (xi , xj , xk ), where the metric should obey the constraint that dist(xi , xj ) < dist(xi , xk ). Here the constraints are often chosen such that xi and xj belong to the same class, while xi and xk do not. Various relaxed versions of this idea have been suggested such that the metric can be learned by solving a semi-deﬁnite or a quadratic program [1, 2, 4–8]. Among the most popular approaches is the Large Margin Nearest Neighbor (LMNN) classiﬁer [5], which ﬁnds a linear transformation that satisﬁes local distance constraints, making the approach suitable for multi-modal classes. For many problems, a single global metric tensor is not enough, which motivates learning several local metric tensors. The classic work by Hastie and Tibshirani [9] advocates locally learning metric tensors according to LDA and using these as part of a kNN classiﬁer. In a somewhat similar fashion, Weinberger and Saul [5] cluster the training data and learn a separate metric tensor for each cluster using LMNN. A more extreme point of view was taken by Frome et al. [1, 2], who learn a diagonal metric tensor for every point in the training set, such that distance rankings are preserved. Similarly, Malisiewicz and Efros [6] ﬁnd a diagonal metric tensor for each training point such that the distance to a subset of the training data from the same class is kept small. Once a set of metric tensors {M1 , . . . , MR } has been learned, the distance dist(a, b) is measured according to (2) where “the nearest” metric tensor is used, i.e. R M(x) = r=1 wr (x) ˜ Mr , where wr (x) = ˜ ˜ j wj (x) 1 0 x − xr 2 r ≤ x − xj M otherwise 2 Mj , ∀j , (3) where x is either a or b depending on the algorithm. Note that this gives a non-metric distance function as it is not symmetric. To derive this equation, it is necessary to assume that 1) geodesics 2 −8 −8 Assumed Geodesics Location of Metric Tensors Test Points −6 −8 Actual Geodesics Location of Metric Tensors Test Points −6 Riemannian Geodesics Location of Metric Tensors Test Points −6 −4 −4 −4 −2 −2 −2 0 0 0 2 2 2 4 4 4 6 −8 6 −8 −6 −4 −2 0 (a) 2 4 6 −6 −4 −2 0 2 4 6 6 −8 −6 (b) −4 −2 (c) 0 2 4 6 (d) Figure 2: (a)–(b) An illustrative example where straight lines do not form geodesics and where the metric tensor does not stay constant along lines; see text for details. The background color is proportional to the trace of the metric tensor, such that light grey corresponds to regions where paths are short (M1 ), and dark grey corresponds to regions they are long (M2 ). (c) The suggested geometric model along with the geodesics. Again, background colour is proportional to the trace of the metric tensor; the colour scale is the same is used in (a) and (b). (d) An illustration of the exponential and logarithmic maps. form straight lines, and 2) the metric tensor stays constant along these lines [3]. Both assumptions are problematic, which we illustrate with a simple example in Fig. 2a–c. Assume we are given two metric tensors M1 = 2I and M2 = I positioned at x1 = (2, 2)T and x2 = (4, 4)T respectively. This gives rise to two regions in feature space in which x1 is nearest in the ﬁrst and x2 is nearest in the second, according to (3). This is illustrated in Fig. 2a. In the same ﬁgure, we also show the assumed straight-line geodesics between selected points in space. As can be seen, two of the lines goes through both regions, such that the assumption of constant metric tensors along the line is violated. Hence, it would seem natural to measure the length of the line, by adding the length of the line segments which pass through the different regions of feature space. This was suggested by Ramanan and Baker [3] who also proposed a polynomial time algorithm for measuring these line lengths. This gives a symmetric distance function. Properly computing line lengths according to the local metrics is, however, not enough to ensure that the distance function is metric. As can be seen in Fig. 2a the straight line does not form a geodesic as a shorter path can be found by circumventing the region with the “expensive” metric tensor M1 as illustrated in Fig. 2b. This issue makes it trivial to construct cases where the triangle inequality is violated, which again makes the line length measure non-metric. In summary, if we want a metric feature space, we can neither assume that geodesics are straight lines nor that the metric tensor stays constant along such lines. In practice, good results have been reported using (3) [1,3,5], so it seems obvious to ask: is metricity required? For kNN classiﬁers this does not appear to be the case, with many successes based on dissimilarities rather than distances [10]. We, however, want to generalize PCA and linear regression, which both seek to minimize the reconstruction error of points projected onto a subspace. As the notion of projection is hard to deﬁne sensibly in non-metric spaces, we consider metricity essential. In order to build a model with a metric feature space, we change the weights in (3) to be smooth functions. This impose a well-behaved geometric structure on the feature space, which we take advantage of in order to perform statistical analysis according to the learned metrics. However, ﬁrst we review the basics of Riemannian geometry as this provides the theoretical foundation of our work. 2.1 Geodesics and Riemannian Geometry We start by deﬁning Riemannian manifolds, which intuitively are smoothly curved spaces equipped with an inner product. Formally, they are smooth manifolds endowed with a Riemannian metric [11]: Deﬁnition A Riemannian metric M on a manifold M is a smoothly varying inner product < a, b >x = aT M(x)b in the tangent space Tx M of each point x ∈ M . 3 Often Riemannian manifolds are represented by a chart; i.e. a parameter space for the curved surface. An example chart is the spherical coordinate system often used to represent spheres. While such charts are often ﬂat spaces, the curvature of the manifold arises from the smooth changes in the metric. On a Riemannian manifold M, the length of a smooth curve c : [0, 1] → M is deﬁned as the integral of the norm of the tangent vector (interpreted as speed) along the curve: 1 Length(c) = 1 c (λ) M(c(λ)) dλ c (λ)T M(c(λ))c (λ)dλ , = (4) 0 0 where c denotes the derivative of c and M(c(λ)) is the metric tensor at c(λ). A geodesic curve is then a length-minimizing curve connecting two given points x and y, i.e. (5) cgeo = arg min Length(c) with c(0) = x and c(1) = y . c The distance between x and y is deﬁned as the length of the geodesic. Given a tangent vector v ∈ Tx M, there exists a unique geodesic cv (t) with initial velocity v at x. The Riemannian exponential map, Expx , maps v to a point on the manifold along the geodesic cv at t = 1. This mapping preserves distances such that dist(cv (0), cv (1)) = v . The inverse of the exponential map is the Riemannian logarithmic map denoted Logx . Informally, the exponential and logarithmic maps move points back and forth between the manifold and the tangent space while preserving distances (see Fig. 2d for an illustration). This provides a general strategy for generalizing many Euclidean techniques to Riemannian domains: data points are mapped to the tangent space, where ordinary Euclidean techniques are applied and the results are mapped back to the manifold. 3 A Metric Feature Space With the preliminaries settled we deﬁne the new model. Let C = RD denote the feature space. We endow C with a metric tensor in every point x, which we deﬁne akin to (3), R M(x) = wr (x)Mr , where wr (x) = r=1 wr (x) ˜ R ˜ j=1 wj (x) , (6) with wr > 0. The only difference from (3) is that we shall not restrict ourselves to binary weight ˜ functions wr . We assume the metric tensors Mr have already been learned; Sec. 4 contain examples ˜ where they have been learned using LMNN [5] and LDA [9]. From the deﬁnition of a Riemannian metric, we trivially have the following result: Lemma 1 The space C = RD endowed with the metric tensor from (6) is a chart of a Riemannian manifold, iff the weights wr (x) change smoothly with x. Hence, by only considering smooth weight functions wr we get a well-studied geometric structure ˜ on the feature space, which ensures us that it is metric. To illustrate the implications we return to the example in Fig. 2. We change the weight functions from binary to squared exponentials, which gives the feature space shown in Fig. 2c. As can be seen, the metric tensor now changes smoothly, which also makes the geodesics smooth curves (a property we will use when computing the geodesics). It is worth noting that Ramanan and Baker [3] also consider the idea of smoothly averaging the metric tensor. They, however, only evaluate the metric tensor at the test point of their classiﬁer and then assume straight line geodesics with a constant metric tensor. Such assumptions violate the premise of a smoothly changing metric tensor and, again, the distance measure becomes non-metric. Lemma 1 shows that metric learning can be viewed as manifold learning. The main difference between our approach and techniques such as Isomap [12] is that, while Isomap learns an embedding of the data points, we learn the actual manifold structure. This gives us the beneﬁt that we can compute geodesics as well as the exponential and logarithmic maps. These provide us with mappings back and forth between the manifold and Euclidean representation of the data, which preserve distances as well as possible. The availability of such mappings is in stark contrast to e.g. Isomap. In the next section we will derive a system of ordinary differential equations (ODE’s) that geodesics in C have to satisfy, which provides us with algorithms for computing geodesics as well as exponential and logarithmic maps. With these we can generalize many Euclidean techniques. 4 3.1 Computing Geodesics, Maps and Statistics At minima of (4) we know that the Euler-Lagrange equation must hold [11], i.e. ∂L d ∂L , where L(λ, c, c ) = c (λ)T M(c(λ))c (λ) . = ∂c dλ ∂c As we have an explicit expression for the metric tensor we can compute (7) in closed form: (7) Theorem 2 Geodesic curves in C satisfy the following system of 2nd order ODE’s M(c(λ))c (λ) = − 1 ∂vec [M(c(λ))] 2 ∂c(λ) T (c (λ) ⊗ c (λ)) , (8) where ⊗ denotes the Kronecker product and vec [·] stacks the columns of a matrix into a vector [13]. Proof See supplementary material. This result holds for any smooth weight functions wr . We, however, still need to compute ∂vec[M] , ˜ ∂c which depends on the speciﬁc choice of wr . Any smooth weighting scheme is applicable, but we ˜ restrict ourselves to the obvious smooth generalization of (3) and use squared exponentials. From this assumption, we get the following result Theorem 3 For wr (x) = exp − ρ x − xr ˜ 2 ∂vec [M(c)] = ∂c the derivative of the metric tensor from (6) is R ρ R j=1 2 Mr R 2 wj ˜ T r=1 T wj (c − xj ) Mj − (c − xr ) Mr ˜ wr vec [Mr ] ˜ . (9) j=1 Proof See supplementary material. Computing Geodesics. Any geodesic curve must be a solution to (8). Hence, to compute a geodesic between x and y, we can solve (8) subject to the constraints c(0) = x and c(1) = y . (10) This is a boundary value problem, which has a smooth solution. This allows us to solve the problem numerically using a standard three-stage Lobatto IIIa formula, which provides a fourth-order accurate C 1 –continuous solution [14]. Ramanan and Baker [3] discuss the possibility of computing geodesics, but arrive at the conclusion that this is intractable based on the assumption that it requires discretizing the entire feature space. Our solution avoids discretizing the feature space by discretizing the geodesic curve instead. As this is always one-dimensional the approach remains tractable in high-dimensional feature spaces. Computing Logarithmic Maps. Once a geodesic c is found, it follows from the deﬁnition of the logarithmic map, Logx (y), that it can be computed as v = Logx (y) = c (0) Length(c) . c (0) (11) In practice, we solve (8) by rewriting it as a system of ﬁrst order ODE’s, such that we compute both c and c simultaneously (see supplementary material for details). Computing Exponential Maps. Given a starting point x on the manifold and a vector v in the tangent space, the exponential map, Expx (v), ﬁnds the unique geodesic starting at x with initial velocity v. As the geodesic must fulﬁll (8), we can compute the exponential map by solving this system of ODE’s with the initial conditions c(0) = x and c (0) = v . (12) This initial value problem has a unique solution, which we ﬁnd numerically using a standard RungeKutta scheme [15]. 5 3.1.1 Generalizing PCA and Regression At this stage, we know that the feature space is Riemannian and we know how to compute geodesics and exponential and logarithmic maps. We now seek to generalize PCA and linear regression, which becomes straightforward since solutions are available in Riemannian spaces [16, 17]. These generalizations can be summarized as mapping the data to the tangent space at the mean, performing standard Euclidean analysis in the tangent and mapping the results back. The ﬁrst step is to compute the mean value on the manifold, which is deﬁned as the point that minimizes the sum-of-squares distances to the data points. Pennec [18] provides an efﬁcient gradient descent approach for computing this point, which we also summarize in the supplementary material. The empirical covariance of a set of points is deﬁned as the ordinary Euclidean covariance in the tangent space at the mean value [18]. With this in mind, it is not surprising that the principal components of a dataset have been generalized as the geodesics starting at the mean with initial velocity corresponding to the eigenvectors of the covariance [16], γvd (t) = Expµ (tvd ) , (13) th where vd denotes the d eigenvector of the covariance. This approach is called Principal Geodesic Analysis (PGA), and the geodesic curve γvd is called the principal geodesic. An illustration of the approach can be seen in Fig. 1 and more algorithmic details are in the supplementary material. Linear regression has been generalized in a similar way [17] by performing regression in the tangent of the mean and mapping the resulting line back to the manifold using the exponential map. The idea of working in the tangent space is both efﬁcient and convenient, but comes with an element of approximation as the logarithmic map is only guarantied to preserve distances to the origin of the tangent and not between all pairs of data points. Practical experience, however, indicates that this is a good tradeoff; see [19] for a more in-depth discussion of when the approximation is suitable. 4 Experiments To illustrate the framework1 we consider an example in human body analysis, and then we analyze the scalability of the approach. But ﬁrst, to build intuition, Fig. 3a show synthetically generated data samples from two classes. We sample random points xr and learn a local LDA metric [9] by considering all data points within a radius; this locally pushes the two classes apart. We combine the local metrics using (6) and Fig. 3b show the data in the tangent space of the resulting manifold. As can be seen the two classes are now globally further apart, which shows the effect of local metrics. 4.1 Human Body Shape We consider a regression example concerning human body shape analysis. We study 986 female body laser scans from the CAESAR [20] data set; each shape is represented using the leading 35 principal components of the data learned using a SCAPE-like model [21, 22]. Each shape is associated with anthropometric measurements such as body height, shoe size, etc. We show results for shoulder to wrist distance and shoulder breadth, but results for more measurements are in the supplementary material. To predict the measurements from shape coefﬁcients, we learn local metrics and perform linear regression according to these. As a further experiment, we use PGA to reduce the dimensionality of the shape coefﬁcients according to the local metrics, and measure the quality of the reduction by performing linear regression to predict the measurements. As a baseline we use the corresponding Euclidean techniques. To learn the local metric we do the following. First we whiten the data such that the variance captured by PGA will only be due to the change of metric; this allows easy visualization of the impact of the learned metrics. We then cluster the body shapes into equal-sized clusters according to the measurement and learn a LMNN metric for each cluster [5], which we associate with the mean of each class. These push the clusters apart, which introduces variance along the directions where the measurement changes. From this we construct a Riemannian manifold according to (6), 1 Our software implementation for computing geodesics and performing manifold statistics is available at http://ps.is.tue.mpg.de/project/Smooth Metric Learning 6 30 Euclidean Model Riemannian Model 24 20 18 16 20 15 10 5 14 12 0 (a) 25 22 Running Time (sec.) Average Prediction Error 26 10 (b) 20 Dimensionality 0 0 30 50 (c) 100 Dimensionality 150 (d) 4 3 3 2 2 1 1 0 −1 −2 −3 −4 −4 −3 −2 −1 0 1 2 3 4 Shoulder breadth 20 −2 −3 Euclidean Model Riemannian Model 0 −1 25 Prediction Error 4 15 10 0 −4 −5 0 4 10 15 20 Dimensionality 16 25 30 35 17 3 3 5 5 Euclidean Model Riemannian Model 2 15 2 1 1 Prediction Error Shoulder to wrist distance Figure 3: Left panels: Synthetic data. (a) Samples from two classes along with illustratively sampled metric tensors from (6). (b) The data represented in the tangent of a manifold constructed from local LDA metrics learned at random positions. Right panels: Real data. (c) Average error of linearly predicted body measurements (mm). (d) Running time (sec) of the geodesic computation as a function of dimensionality. 0 0 −1 −2 −1 −3 14 13 12 11 −2 −4 −3 −4 −4 10 −5 −3 −2 −1 0 1 Euclidean PCA 2 3 −6 −4 9 0 −2 0 2 4 Tangent Space PCA (PGA) 6 5 10 15 20 Dimensionality 25 30 35 Regression Error Figure 4: Left: body shape data in the ﬁrst two principal components according to the Euclidean metric. Point color indicates cluster membership. Center: As on the left, but according to the Riemannian model. Right: regression error as a function of the dimensionality of the shape space; again the Euclidean metric and the Riemannian metric are compared. compute the mean value on the manifold, map the data to the tangent space at the mean and perform linear regression in the tangent space. As a ﬁrst visualization we plot the data expressed in the leading two dimensions of PGA in Fig. 4; as can be seen the learned metrics provide principal geodesics, which are more strongly related with the measurements than the Euclidean model. In order to predict the measurements from the body shape, we perform linear regression, both directly in the shape space according to the Euclidean metric and in the tangent space of the manifold corresponding to the learned metrics (using the logarithmic map from (11)). We measure the prediction error using leave-one-out cross-validation. To further illustrate the power of the PGA model, we repeat this experiment for different dimensionalities of the data. The results are plotted in Fig. 4, showing that regression according to the learned metrics outperforms the Euclidean model. To verify that the learned metrics improve accuracy, we average the prediction errors over all millimeter measurements. The result in Fig. 3c shows that much can be gained in lower dimensions by using the local metrics. To provide visual insights into the behavior of the learned metrics, we uniformly sample body shape along the ﬁrst principal geodesic (in the range ±7 times the standard deviation) according to the different metrics. The results are available as a movie in the supplementary material, but are also shown in Fig. 5. As can be seen, the learned metrics pick up intuitive relationships between body shape and the measurements, e.g. shoulder to wrist distance is related to overall body size, while shoulder breadth is related to body weight. 7 Shoulder to wrist distance Shoulder breadth Figure 5: Shapes corresponding to the mean (center) and ±7 times the standard deviations along the principal geodesics (left and right). Movies are available in the supplementary material. 4.2 Scalability The human body data set is small enough (986 samples in 35 dimensions) that computing a geodesic only takes a few seconds. To show that the current unoptimized Matlab implementation can handle somewhat larger datasets, we brieﬂy consider a dimensionality reduction task on the classic MNIST handwritten digit data set. We use the preprocessed data available with [3] where the original 28×28 gray scale images were deskewed and projected onto their leading 164 Euclidean principal components (which captures 95% of the variance in the original data). We learn one diagonal LMNN metric per class, which we associate with the mean of the class. From this we construct a Riemannian manifold from (6), compute the mean value on the manifold and compute geodesics between the mean and each data point; this is the computationally expensive part of performing PGA. Fig. 3d plots the average running time (sec) for the computation of geodesics as a function of the dimensionality of the training data. A geodesic can be computed in 100 dimensions in approximately 5 sec., whereas in 150 dimensions it takes about 30 sec. In this experiment, we train a PGA model on 60,000 data points, and test a nearest neighbor classiﬁer in the tangent space as we decrease the dimensionality of the model. Compared to a Euclidean model, this gives a modest improvement in classiﬁcation accuracy of 2.3 percent, when averaged across different dimensionalities. Plots of the results can be found in the supplementary material. 5 Discussion This work shows that multi-metric learning techniques are indeed applicable outside the realm of kNN classiﬁers. The idea of deﬁning the metric tensor at any given point as the weighted average of a ﬁnite set of learned metrics is quite natural from a modeling point of view, which is also validated by the Riemannian structure of the resulting space. This opens both a theoretical and a practical toolbox for analyzing and developing algorithms that use local metric tensors. Speciﬁcally, we show how to use local metric tensors for both regression and dimensionality reduction tasks. Others have attempted to solve non-classiﬁcation problems using local metrics, but we feel that our approach is the ﬁrst to have a solid theoretical backing. For example, Hastie and Tibshirani [9] use local LDA metrics for dimensionality reduction by averaging the local metrics and using the resulting metric as part of a Euclidean PCA, which essentially is a linear approach. Another approach was suggested by Hong et al. [23] who simply compute the principal components according to each metric separately, such that one low dimensional model is learned per metric. The suggested approach is, however, not difﬁculty-free in its current implementation. Currently, we are using off-the-shelf numerical solvers for computing geodesics, which can be computationally demanding. While we managed to analyze medium-sized datasets, we believe that the run-time can be drastically improved by developing specialized numerical solvers. In the experiments, we learned local metrics using techniques specialized for classiﬁcation tasks as this is all the current literature provides. We expect improvements by learning the metrics speciﬁcally for regression and dimensionality reduction, but doing so is currently an open problem. Acknowledgments: Søren Hauberg is supported in part by the Villum Foundation, and Oren Freifeld is supported in part by NIH-NINDS EUREKA (R01-NS066311). 8 References [1] Andrea Frome, Yoram Singer, and Jitendra Malik. Image retrieval and classiﬁcation using local distance functions. In B. Sch¨ lkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing o Systems 19 (NIPS), pages 417–424, Cambridge, MA, 2007. MIT Press. [2] Andrea Frome, Fei Sha, Yoram Singer, and Jitendra Malik. Learning globally-consistent local distance functions for shape-based image retrieval and classiﬁcation. In International Conference on Computer Vision (ICCV), pages 1–8, 2007. [3] Deva Ramanan and Simon Baker. Local distance functions: A taxonomy, new algorithms, and an evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):794–806, 2011. [4] Shai Shalev-Shwartz, Yoram Singer, and Andrew Y. Ng. Online and batch learning of pseudo-metrics. In Proceedings of the twenty-ﬁrst international conference on Machine learning, ICML ’04, pages 94–101. ACM, 2004. [5] Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classiﬁcation. The Journal of Machine Learning Research, 10:207–244, 2009. [6] Tomasz Malisiewicz and Alexei A. Efros. Recognition by association via learning per-exemplar distances. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–8, 2008. [7] Yiming Ying and Peng Li. Distance metric learning with eigenvalue optimization. The Journal of Machine Learning Research, 13:1–26, 2012. [8] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In Advances in Neural Information Processing Systems 16 (NIPS), 2004. [9] Trevor Hastie and Robert Tibshirani. Discriminant adaptive nearest neighbor classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6):607–616, June 1996. [10] Elzbieta Pekalska, Pavel Paclik, and Robert P. W. Duin. A generalized kernel approach to dissimilaritybased classiﬁcation. Journal of Machine Learning Research, 2:175–211, 2002. [11] Manfredo Perdigao do Carmo. Riemannian Geometry. Birkh¨ user Boston, January 1992. a [12] Joshua B. Tenenbaum, Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [13] Jan R. Magnus and Heinz Neudecker. Matrix Differential Calculus with Applications in Statistics and Econometrics. John Wiley & Sons, 2007. [14] Jacek Kierzenka and Lawrence F. Shampine. A BVP solver based on residual control and the Matlab PSE. ACM Transactions on Mathematical Software, 27(3):299–316, 2001. [15] John R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of Computational and Applied Mathematics, 6:19–26, 1980. [16] P. Thomas Fletcher, Conglin Lu, Stephen M. Pizer, and Sarang Joshi. Principal Geodesic Analysis for the study of Nonlinear Statistics of Shape. IEEE Transactions on Medical Imaging, 23(8):995–1005, 2004. [17] Peter E. Jupp and John T. Kent. Fitting smooth paths to spherical data. Applied Statistics, 36(1):34–46, 1987. [18] Xavier Pennec. Probabilities and statistics on Riemannian manifolds: Basic tools for geometric measurements. In Proceedings of Nonlinear Signal and Image Processing, pages 194–198, 1999. [19] Stefan Sommer, Francois Lauze, Søren Hauberg, and Mads Nielsen. Manifold valued statistics, exact ¸ principal geodesic analysis and the effect of linear approximations. In European Conference on Computer Vision (ECCV), pages 43–56, 2010. [20] Kathleen M. Robinette, Hein Daanen, and Eric Paquet. The CAESAR project: a 3-D surface anthropometry survey. In 3-D Digital Imaging and Modeling, pages 380–386, 1999. [21] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. ACM Transactions on Graphics, 24(3):408–416, 2005. [22] Oren Freifeld and Michael J. Black. Lie bodies: A manifold representation of 3D human shape. In A. Fitzgibbon et al. (Eds.), editor, European Conference on Computer Vision (ECCV), Part I, LNCS 7572, pages 1–14. Springer-Verlag, oct 2012. [23] Yi Hong, Quannan Li, Jiayan Jiang, and Zhuowen Tu. Learning a mixture of sparse distance metrics for classiﬁcation and dimensionality reduction. In International Conference on Computer Vision (ICCV), pages 906–913, 2011. 9

3 0.78963679 101 nips-2012-Discriminatively Trained Sparse Code Gradients for Contour Detection

Author: Ren Xiaofeng, Liefeng Bo

Abstract: Finding contours in natural images is a fundamental problem that serves as the basis of many tasks such as image segmentation and object recognition. At the core of contour detection technologies are a set of hand-designed gradient features, used by most approaches including the state-of-the-art Global Pb (gPb) operator. In this work, we show that contour detection accuracy can be significantly improved by computing Sparse Code Gradients (SCG), which measure contrast using patch representations automatically learned through sparse coding. We use K-SVD for dictionary learning and Orthogonal Matching Pursuit for computing sparse codes on oriented local neighborhoods, and apply multi-scale pooling and power transforms before classifying them with linear SVMs. By extracting rich representations from pixels and avoiding collapsing them prematurely, Sparse Code Gradients effectively learn how to measure local contrasts and ﬁnd contours. We improve the F-measure metric on the BSDS500 benchmark to 0.74 (up from 0.71 of gPb contours). Moreover, our learning approach can easily adapt to novel sensor data such as Kinect-style RGB-D cameras: Sparse Code Gradients on depth maps and surface normals lead to promising contour detection using depth and depth+color, as veriﬁed on the NYU Depth Dataset. 1

4 0.77525854 250 nips-2012-On-line Reinforcement Learning Using Incremental Kernel-Based Stochastic Factorization

Author: Doina Precup, Joelle Pineau, Andre S. Barreto

Abstract: Kernel-based stochastic factorization (KBSF) is an algorithm for solving reinforcement learning tasks with continuous state spaces which builds a Markov decision process (MDP) based on a set of sample transitions. What sets KBSF apart from other kernel-based approaches is the fact that the size of its MDP is independent of the number of transitions, which makes it possible to control the trade-off between the quality of the resulting approximation and the associated computational cost. However, KBSF’s memory usage grows linearly with the number of transitions, precluding its application in scenarios where a large amount of data must be processed. In this paper we show that it is possible to construct KBSF’s MDP in a fully incremental way, thus freeing the space complexity of this algorithm from its dependence on the number of sample transitions. The incremental version of KBSF is able to process an arbitrary amount of data, which results in a model-based reinforcement learning algorithm that can be used to solve continuous MDPs in both off-line and on-line regimes. We present theoretical results showing that KBSF can approximate the value function that would be computed by conventional kernel-based learning with arbitrary precision. We empirically demonstrate the effectiveness of the proposed algorithm in the challenging threepole balancing task, in which the ability to process a large number of transitions is crucial for success. 1

5 0.72292441 258 nips-2012-Online L1-Dictionary Learning with Application to Novel Document Detection

Author: Shiva P. Kasiviswanathan, Huahua Wang, Arindam Banerjee, Prem Melville

Abstract: Given their pervasive use, social media, such as Twitter, have become a leading source of breaking news. A key task in the automated identiﬁcation of such news is the detection of novel documents from a voluminous stream of text documents in a scalable manner. Motivated by this challenge, we introduce the problem of online 1 -dictionary learning where unlike traditional dictionary learning, which uses squared loss, the 1 -penalty is used for measuring the reconstruction error. We present an efﬁcient online algorithm for this problem based on alternating directions method of multipliers, and establish a sublinear regret bound for this algorithm. Empirical results on news-stream and Twitter data, shows that this online 1 -dictionary learning algorithm for novel document detection gives more than an order of magnitude speedup over the previously known batch algorithm, without any signiﬁcant loss in quality of results. 1

6 0.68101281 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback

7 0.67765588 318 nips-2012-Sparse Approximate Manifolds for Differential Geometric MCMC

8 0.65553963 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition

9 0.63341337 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes

10 0.62934905 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization

11 0.62881947 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

12 0.62720215 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

13 0.62702876 260 nips-2012-Online Sum-Product Computation Over Trees

14 0.62691575 188 nips-2012-Learning from Distributions via Support Measure Machines

15 0.62588 210 nips-2012-Memorability of Image Regions

16 0.6257081 172 nips-2012-Latent Graphical Model Selection: Efficient Methods for Locally Tree-like Graphs

17 0.62534142 197 nips-2012-Learning with Recursive Perceptual Representations

18 0.62484688 111 nips-2012-Efficient Sampling for Bipartite Matching Problems

19 0.62461913 90 nips-2012-Deep Learning of Invariant Features via Simulated Fixations in Video

20 0.62438852 294 nips-2012-Repulsive Mixtures