nips nips2012 nips2012-115 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Paul Vernaza, Drew Bagnell
Abstract: Maximum entropy (MaxEnt) modeling is a popular choice for sequence analysis in applications such as natural language processing, where the sequences are embedded in discrete, tractably-sized spaces. We consider the problem of applying MaxEnt to distributions over paths in continuous spaces of high dimensionality— a problem for which inference is generally intractable. Our main contribution is to show that this intractability can be avoided as long as the constrained features possess a certain kind of low dimensional structure. In this case, we show that the associated partition function is symmetric and that this symmetry can be exploited to compute the partition function efficiently in a compressed form. Empirical results are given showing an application of our method to learning models of high-dimensional human motion capture data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Efficient high-dimensional maximum entropy modeling via symmetric partition functions J. [sent-1, score-0.403]
2 edu Abstract Maximum entropy (MaxEnt) modeling is a popular choice for sequence analysis in applications such as natural language processing, where the sequences are embedded in discrete, tractably-sized spaces. [sent-5, score-0.246]
3 We consider the problem of applying MaxEnt to distributions over paths in continuous spaces of high dimensionality— a problem for which inference is generally intractable. [sent-6, score-0.401]
4 Our main contribution is to show that this intractability can be avoided as long as the constrained features possess a certain kind of low dimensional structure. [sent-7, score-0.285]
5 In this case, we show that the associated partition function is symmetric and that this symmetry can be exploited to compute the partition function efficiently in a compressed form. [sent-8, score-0.682]
6 1 Introduction This work aims to generate useful probabilistic models of high dimensional trajectories in continuous spaces. [sent-10, score-0.261]
7 1, which demonstrates the application of our proposed method to the problem of building generative models of high dimensional human motion capture data. [sent-12, score-0.211]
8 Using this method, we may efficiently learn models and perform inferences including but not limited to the following: (1) Given any single pose, what is the probability that a certain type of motion ever visits this pose? [sent-13, score-0.236]
9 (3) Given any initial sequence of poses, what are the odds that this sequence corresponds to one action type versus another? [sent-15, score-0.23]
10 The maximum entropy learning (MaxEnt) approach advocated here has the distinct advantage of being able to efficiently answer all of the aforementioned global inferences in a unified framework while also allowing the use of global features of the state and observations. [sent-17, score-0.348]
11 We show how MaxEnt modeling may be efficiently applied to paths in continuous state spaces of high dimensionality. [sent-19, score-0.462]
12 This is achieved without having to resort to expensive, approximate inference methods based on MCMC, and without having to assume that the sequences themselves lie in or near a low dimensional submanifold, as in standard dimensionality-reduction-based methods. [sent-20, score-0.287]
13 Here we suppose that we are tasked with the problem of comparing two sets of paths: the first, sampled from an empirical distribution; and the second, sampled from a learned distribution intended to model the distribution underlying the empirical samples. [sent-24, score-0.121]
14 We claim that a natural approach to this problem is to visualize both sets of paths by projecting 1 up-phase jumping jack down-phase jumping jack side twist cross-toe touch up-phase jumping jack Log prob. [sent-26, score-1.942]
15 -2 * 10-11 (a) True held-out class = side twist side twist cross-toe touch Log prob. [sent-31, score-0.434]
16 0 (b) True held-out class = down-phase jumping jack Figure 1: Visualizations of predictions of future locations of hands for an individually held-out motion capture frame, conditioned on classes indicated by labels above figures, and corresponding class membership probabilities. [sent-40, score-0.674]
17 Figure 2: Illustration of the constraint that paths sampled from the learned distribution should (in expectation) visit certain regions of space exactly as often as they are visited by paths sampled from the true distribution, after projection of both onto a low dimensional subspace. [sent-42, score-0.944]
18 We then might consider automating this procedure by choosing numerical features of the projected paths and comparing these features in order to determine whether the projected paths appear similar. [sent-47, score-0.788]
19 The MaxEnt method described here iteratively samples paths, projects them onto a low dimensional subspace, computes features of these projected paths, and adjusts the distribution so as to ensure that, in expectation, these features match the desired features. [sent-49, score-0.491]
20 A key contribution of this work is to show that that employing low dimensional features of this sort enables tractable inference and learning algorithms, even in high dimensional spaces. [sent-50, score-0.511]
21 Maximum entropy learning requires repeatedly calculating feature statistics for different distributions, which generally requires computing average feature values over all paths sampled from the distributions. [sent-51, score-0.47]
22 Though this is straightforward to accomplish via dynamic programming in low dimensional spaces, it may not be obvious that the same can be accomplished in high-dimensional spaces. [sent-52, score-0.194]
23 2 2 Preliminaries We now briefly review the basic MaxEnt modeling problem in discrete state spaces. [sent-58, score-0.158]
24 In the basic MaxEnt problem, we have N disjoint events xi , K random variables denoted features φj (xi ) mapping events to scalars, and K expected values of these features Eφj . [sent-59, score-0.403]
25 To continue the example previously discussed, we will think of each xi as being a path, φj (xi ) as being the number of times that a path passes through the jth spatial region, and Eφj as the empirically estimated number of times that a path visits the jth region. [sent-60, score-0.463]
26 Our goal is to find a distribution p(xi ) over the events consistent with our empirical observations in the sense that it generates the observed feature expectations: φj (xi )p(xi ) = Eφj , ∀j ∈ {1 . [sent-61, score-0.131]
27 This problem can be written compactly as max − pi log pi s. [sent-66, score-0.125]
28 θg = Ep [φ | ¯ (3) j 3 MaxEnt modeling of continuous paths We now consider an extension of the MaxEnt formalism to the case that the events are paths embedded in a continuous space. [sent-71, score-0.782]
29 A natural choice in this case is to express each feature φj as an integral of the following form: T φj (x) = ψj (x(s))ds, (4) 0 where T is the duration (or length) of x and each ψj : RN → R+ is what we refer to as a feature potential. [sent-75, score-0.135]
30 An analogous expression for the probability of a continuous path is then obtained by substituting these features into (3). [sent-77, score-0.287]
31 Defining the cost function Cθ := j θj ψj and the cost functional T Sθ {x} := Cθ (x(s))ds, (5) exp −Sθ {x} , exp −Sθ {x}Dx (6) 0 we have that p(x | θ) = ¯ 3 where the notation exp −Sθ {x}Dx denotes the integral of the cost functional over the space of all continuous paths. [sent-78, score-0.354]
32 The normalization factor Zθ := exp −Sθ {x}Dx is referred to as the partition function. [sent-79, score-0.317]
33 As in the discrete case, computing the partition function is of prime concern, as it enables a variety of inference and learning techniques. [sent-80, score-0.411]
34 The functional integral in (6) can be formalized in several ways, including taking an expectation with respect to Wiener measure [12] or as a Feynman integral [4]. [sent-81, score-0.15]
35 The solution, denoted Zθ (a) for a ∈ RN , gives the value of the functional integral evaluated over all paths beginning at a and ending at a given goal location (henceforth assumed w. [sent-83, score-0.355]
36 A discrete approximation to the partition function can therefore be computed via standard numerical methods such as finite differences, finite elements, or spectral methods [2]. [sent-88, score-0.318]
37 However, we proceed by discretizing the state space as a lattice graph and computing the partition function associated with discrete paths in this graph via a standard dynamic programming method [1, 15, 11]. [sent-89, score-0.758]
38 Concretely, the discretized partition function is computed as the fixed point of the following iteration: Zθ (a) ← δ(a) + exp(− Cθ (a)) Zθ (a ), (7) a ∼a where a ∼ a denotes the set of a adjacent to a in the lattice, lattice elements, and δ is the Kronecker delta. [sent-91, score-0.479]
39 1 4 is the spacing between adjacent Efficient inference via symmetry reduction Unfortunately, the dynamic programming approach described above is tractable only for low dimensional problems; for problems in more than a few dimensions, even storing the partition function would be infeasible. [sent-92, score-0.741]
40 Fortunately, we show in this section that it is possible to compute the partition function directly in a compressed form, given that the features also satisfy a certain compressibility property. [sent-93, score-0.472]
41 1 Symmetry of the partition function Elaborating on this statement, we now recall Eq. [sent-95, score-0.264]
42 (4), which expresses the features as integrals of feature potentials ψj over paths. [sent-96, score-0.234]
43 We then examine the effects of assuming that the ψj are compressible in the sense that they may be predicted exactly from their projection onto a low dimensional subspace—i. [sent-97, score-0.314]
44 First, we show that the partition function is symmetric about rotations about the origin that preserve the subspace spanned by the columns of W . [sent-102, score-0.473]
45 We then show that there always exists such a rotation that also brings an arbitrary point in RN into correspondence with a point in a a d + 1-dimensional slice where the partition function has been computed. [sent-103, score-0.492]
46 5 and features derived from feature potentials ψj . [sent-107, score-0.194]
47 The next theorem makes explicit how to exploit the symmetry of the partition function by computing it restricted to a low-dimensional slice of the state space. [sent-114, score-0.543]
48 1 that rotates b onto the subspace spanned by the columns of W and ν. [sent-120, score-0.172]
49 2 Exploiting symmetry in DP We proceed to compute the discretized partition function via a modified version of the dynamic programming algorithm described in Sec. [sent-125, score-0.434]
50 2 in order to represent the partition function in a compressed form. [sent-128, score-0.325]
51 Figure 3 illustrates the algorithm applied to computing the partition function associated with a constant C(x) in a two-dimensional space. [sent-130, score-0.264]
52 The partition function is represented by its values on a regular lattice lying in the low-dimensional slice spanned by the columns of W and ν, as defined in Corollary 4. [sent-131, score-0.567]
53 At each iteration of the algorithm, we update each value in the slice based on adjacent values, as before. [sent-134, score-0.224]
54 However, it is now the case that some of the adjacent nodes lie off of the slice. [sent-135, score-0.148]
55 We compute the values associated with such nodes by rotating them onto the slice (according to Corollary 4. [sent-136, score-0.262]
56 2) and interpolating the value based on those of adjacent nodes within the slice. [sent-137, score-0.196]
57 Suppose that b is a point contained within the slice and y := b + δ is an adjacent point lying off the slice whose value we wish to compute. [sent-139, score-0.396]
58 Therefore, assuming that all nodes adjacent to b lie at a distance of δ from it, all of the updates from the off-slice neighbors will be identical, which allows us to compute the net contribution due to all such nodes simply by multiplying the above value by their cardinality. [sent-144, score-0.205]
59 3 MaxEnt training procedure Given the ability to efficiently compute the partition function, learning may proceed in a way exactly analogous to the discrete case (Sec. [sent-148, score-0.358]
60 The large sphere marked goal denotes origin with respect to which partition function is computed. [sent-151, score-0.325]
61 Partition function in this case is symmetric about all rotations around the origin; hence, any value can be computed by rotation onto any axis (slice) where the partition function is known (ν). [sent-152, score-0.479]
62 Symmetry implies that value updates from off-axis nodes can be computed by rotation (proj) onto the axis. [sent-154, score-0.224]
63 computing feature expectations under the model distribution is not as straightforward as in the low dimensional case, as we must account for the symmetry of the partition function. [sent-156, score-0.628]
64 As such, we compute feature expectations by sampling paths from the model given the partition function. [sent-157, score-0.601]
65 Algorithm 1 PartitionFunc(xT , Cθ , W, N, d) Z : Rd+1 → R : y → 0 ν ← (ν | ν, ν = 1, W T ν = 0) lift : Rd+1 → RN : y → [W ν]y + xT W T (x − xT ) proj : RN → Rd+1 : x → (I − W W T )(x − xT ) while Z not converged do for y ∈ G ⊂ Zd+1 do zon ← {δ∈Zd+1 | δ =1} Z(y + δ) zoff ← 2(N − d − 1)Z(y1 , . [sent-158, score-0.132]
66 Our training set consisted of a small sample of trajectories representing four different exercises performed by a human actor. [sent-162, score-0.12]
67 The feature potentials employed consisted of indicator functions of the form φj (a) = {1 if W T a ∈ Cj , 0 otherwise}, (12) where the Cj were non-overlapping, rectangular regions of the projected state space. [sent-164, score-0.251]
68 6 HDMaxEnt 100 HDMaxEnt 200 100 correct discrimination threshold fraction of path revealed HDMaxEnt 200 log. [sent-166, score-0.47]
69 0 correct discrimination threshold fraction of path revealed HDMaxEnt 100 log. [sent-168, score-0.47]
70 0 side twist log odds ratio down-phase jumping jack log odds ratio log odds ratio log odds ratio 200 up-phase jumping jack correct discrimination threshold fraction of path revealed 0 correct discrimination threshold log. [sent-172, score-2.534]
71 fraction of path revealed Figure 4: Results of classification experiment given progressively revealed trajectories. [sent-174, score-0.554]
72 Abscissa indicates the fraction of the trajectory revealed to the classifiers. [sent-176, score-0.292]
73 Samples of held-out trajectory at different points along abscissa are illustrated above fraction of path revealed. [sent-177, score-0.365]
74 Ordinate shows predicted log-odds ratio between correct class and next-most-probable class. [sent-178, score-0.148]
75 Given our ability to efficiently compute the partition function, this enables us to normalize each of these probability distributions. [sent-180, score-0.314]
76 Knowing the partition function also enables us to perform various marginalizations of the distribution that would otherwise be intractable. [sent-182, score-0.314]
77 [8, 15] In particular, we performed an experiment consisting of evaluating the probability of a held-out trajectory under each model as it was progressively revealed in time. [sent-183, score-0.324]
78 , xt represents the portion of the trajectory revealed up to time t, P (x0 ) is the prior probability of the initial state, and is the spacing between successive samples. [sent-187, score-0.335]
79 4, which plots the predicted log-odds ratio between the correct and next-most-probable classes. [sent-189, score-0.112]
80 Features for this classifier consisted of radial basis functions centered around the portion of each training trajectory revealed up to the current time step. [sent-191, score-0.285]
81 The logistic regression predictions were generally inaccurate on their own, but the the confidence of these predictions was so low that these probabilities were far outweighed by the prior—the log-odds ratio in time therefore appears almost flat for logistic regression. [sent-195, score-0.204]
82 Our method, however, quickly recovered the correct label as the rest of the sequence was revealed. [sent-199, score-0.112]
83 Figures 1(a) and 1(b) show the result of a different inference—here we used the same learned class models to evaluate the probability that a single held-out frame was generated by a path in each class. [sent-200, score-0.332]
84 This probability can be computed as the product of forward and backwards partition functions evaluated at the held-out frame divided by the partition function between nominal start and goal positions. [sent-201, score-0.651]
85 [15] We also sampled trajectories given each potential class label, given the held-out frame as a starting point, and visualized the results. [sent-202, score-0.269]
86 1(b) shows a case where the held-out frame is ambiguous enough that it could have been generated by either the jumping jack up or down phases. [sent-207, score-0.597]
87 6 Related work Our work bears the most relation to the extensive literature on maximum entropy modeling in sequence analysis. [sent-209, score-0.196]
88 Our method is also an instance of MaxEnt modeling applied to sequence analysis; however, our method applies to high-dimensional paths in continuous spaces with a continuous notion of (potentially unbounded) time (as opposed to the discrete notions of finite sequence length or horizon). [sent-211, score-0.637]
89 First our method is able to exploit global, contextual features of sequences without having to model how these features are generated from a latent state. [sent-215, score-0.269]
90 Although the features used in the experiments shown here were fairly simple, we plan to show in future work how our method can leverage context-dependent features to generalize across different environments. [sent-216, score-0.236]
91 Second, global inferences in the aforementioned GP-based methods are intractable, since the state distribution as a function of time is generally not a Gaussian process, unless the dynamics are assumed linear. [sent-217, score-0.169]
92 Therefore, expensive, approximate inference methods such as MCMC would be required to compute any of the inferences demonstrated here. [sent-218, score-0.159]
93 7 Conclusions We have demonstrated a method for efficiently performing inference and learning for maximumentropy modeling of high dimensional, continuous trajectories. [sent-219, score-0.154]
94 Key to the method is the assumption that features arise from potentials that vary only in low dimensional subspaces. [sent-220, score-0.348]
95 The partition functions associated with such features can be computed efficiently by exploiting the symmetries that arise in this case. [sent-221, score-0.403]
96 The ability to efficiently compute the partition function enables tractable learning as well as the opportunity to compute a variety of inferences that would otherwise be intractable. [sent-222, score-0.43]
97 We have demonstrated experimentally that the method is able to build plausible models of high dimensional motion capture trajectories that are well-suited for classification and other prediction tasks. [sent-223, score-0.321]
98 As future work, we would like to explore similar ideas to leverage more generic types of low dimensional structure that might arise in maximum entropy modeling. [sent-224, score-0.336]
99 We are also investigating problem domains such as assistive teleoperation, where the ability to leverage contextual features is essential to learning policies that generalize. [sent-226, score-0.182]
100 A continuous-state version of discrete ı ı randomized shortest-paths, with application to path planning. [sent-256, score-0.19]
wordName wordTfidf (topN-words)
[('maxent', 0.337), ('partition', 0.264), ('paths', 0.26), ('jack', 0.237), ('jumping', 0.237), ('hdmaxent', 0.172), ('revealed', 0.153), ('twist', 0.138), ('path', 0.136), ('slice', 0.133), ('dimensional', 0.133), ('frame', 0.123), ('touch', 0.122), ('inferences', 0.116), ('odds', 0.116), ('ra', 0.096), ('rotation', 0.095), ('symmetry', 0.093), ('adjacent', 0.091), ('features', 0.091), ('events', 0.091), ('entropy', 0.088), ('lattice', 0.087), ('proj', 0.084), ('trajectory', 0.08), ('motion', 0.078), ('onto', 0.072), ('ry', 0.072), ('rn', 0.07), ('feynman', 0.069), ('vernaza', 0.069), ('trajectories', 0.068), ('discrimination', 0.067), ('potentials', 0.063), ('low', 0.061), ('origin', 0.061), ('compressed', 0.061), ('ds', 0.061), ('dover', 0.061), ('rw', 0.061), ('continuous', 0.06), ('fraction', 0.059), ('ratio', 0.057), ('sequence', 0.057), ('nodes', 0.057), ('subspace', 0.056), ('compressibility', 0.056), ('spacing', 0.056), ('integral', 0.055), ('jth', 0.055), ('correct', 0.055), ('discrete', 0.054), ('leverage', 0.054), ('exp', 0.053), ('state', 0.053), ('progressively', 0.053), ('abscissa', 0.053), ('consisted', 0.052), ('modeling', 0.051), ('hands', 0.05), ('enables', 0.05), ('sequences', 0.05), ('dx', 0.049), ('actor', 0.048), ('compressible', 0.048), ('rotations', 0.048), ('lift', 0.048), ('interpolating', 0.048), ('symmetries', 0.048), ('xt', 0.046), ('bagnell', 0.046), ('visualizations', 0.046), ('zd', 0.046), ('spanned', 0.044), ('projected', 0.043), ('yd', 0.043), ('logistic', 0.043), ('inference', 0.043), ('plausible', 0.042), ('visits', 0.042), ('sampled', 0.042), ('pi', 0.042), ('corollary', 0.041), ('log', 0.041), ('pittsburgh', 0.041), ('functional', 0.04), ('integrals', 0.04), ('proceed', 0.04), ('feature', 0.04), ('xi', 0.039), ('lying', 0.039), ('evaluating', 0.038), ('spaces', 0.038), ('discretized', 0.037), ('visited', 0.037), ('illustrated', 0.037), ('expectations', 0.037), ('learned', 0.037), ('contextual', 0.037), ('class', 0.036)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000008 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions
Author: Paul Vernaza, Drew Bagnell
Abstract: Maximum entropy (MaxEnt) modeling is a popular choice for sequence analysis in applications such as natural language processing, where the sequences are embedded in discrete, tractably-sized spaces. We consider the problem of applying MaxEnt to distributions over paths in continuous spaces of high dimensionality— a problem for which inference is generally intractable. Our main contribution is to show that this intractability can be avoided as long as the constrained features possess a certain kind of low dimensional structure. In this case, we show that the associated partition function is symmetric and that this symmetry can be exploited to compute the partition function efficiently in a compressed form. Empirical results are given showing an application of our method to learning models of high-dimensional human motion capture data. 1
2 0.10029457 38 nips-2012-Algorithms for Learning Markov Field Policies
Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer
Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1
3 0.094841294 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics
Author: Ashwini Shukla, Aude Billard
Abstract: Non-linear dynamical systems (DS) have been used extensively for building generative models of human behavior. Their applications range from modeling brain dynamics to encoding motor commands. Many schemes have been proposed for encoding robot motions using dynamical systems with a single attractor placed at a predefined target in state space. Although these enable the robots to react against sudden perturbations without any re-planning, the motions are always directed towards a single target. In this work, we focus on combining several such DS with distinct attractors, resulting in a multi-stable DS. We show its applicability in reach-to-grasp tasks where the attractors represent several grasping points on the target object. While exploiting multiple attractors provides more flexibility in recovering from unseen perturbations, it also increases the complexity of the underlying learning problem. Here we present the Augmented-SVM (A-SVM) model which inherits region partitioning ability of the well known SVM classifier and is augmented with novel constraints derived from the individual DS. The new constraints modify the original SVM dual whose optimal solution then results in a new class of support vectors (SV). These new SV ensure that the resulting multistable DS incurs minimum deviation from the original dynamics and is stable at each of the attractors within a finite region of attraction. We show, via implementations on a simulated 10 degrees of freedom mobile robotic platform, that the model is capable of real-time motion generation and is able to adapt on-the-fly to perturbations.
4 0.084696531 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
Author: Xianghang Liu, James Petterson, Tibério S. Caetano
Abstract: We present a new formulation for binary classification. Instead of relying on convex losses and regularizers such as in SVMs, logistic regression and boosting, or instead non-convex but continuous formulations such as those encountered in neural networks and deep belief networks, our framework entails a non-convex but discrete formulation, where estimation amounts to finding a MAP configuration in a graphical model whose potential functions are low-dimensional discrete surrogates for the misclassification loss. We argue that such a discrete formulation can naturally account for a number of issues that are typically encountered in either the convex or the continuous non-convex approaches, or both. By reducing the learning problem to a MAP inference problem, we can immediately translate the guarantees available for many inference settings to the learning problem itself. We empirically demonstrate in a number of experiments that this approach is promising in dealing with issues such as severe label noise, while still having global optimality guarantees. Due to the discrete nature of the formulation, it also allows for direct regularization through cardinality-based penalties, such as the 0 pseudo-norm, thus providing the ability to perform feature selection and trade-off interpretability and predictability in a principled manner. We also outline a number of open problems arising from the formulation. 1
5 0.084662564 44 nips-2012-Approximating Concavely Parameterized Optimization Problems
Author: Joachim Giesen, Jens Mueller, Soeren Laue, Sascha Swiercy
Abstract: We consider an abstract class of optimization problems that are parameterized concavely in a single parameter, and show that the solution path along the √ parameter can always be approximated with accuracy ε > 0 by a set of size O(1/ ε). A √ lower bound of size Ω(1/ ε) shows that the upper bound is tight up to a constant factor. We also devise an algorithm that calls a step-size oracle and computes an √ approximate path of size O(1/ ε). Finally, we provide an implementation of the oracle for soft-margin support vector machines, and a parameterized semi-definite program for matrix completion. 1
6 0.084594168 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
7 0.084426835 315 nips-2012-Slice sampling normalized kernel-weighted completely random measure mixture models
8 0.08420568 233 nips-2012-Multiresolution Gaussian Processes
9 0.081544384 206 nips-2012-Majorization for CRFs and Latent Likelihoods
10 0.080426618 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems
11 0.079479203 279 nips-2012-Projection Retrieval for Classification
12 0.078964695 82 nips-2012-Continuous Relaxations for Discrete Hamiltonian Monte Carlo
13 0.076723099 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
14 0.075508341 314 nips-2012-Slice Normalized Dynamic Markov Logic Networks
15 0.073674724 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
16 0.070689686 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
17 0.070062041 13 nips-2012-A Nonparametric Conjugate Prior Distribution for the Maximizing Argument of a Noisy Function
18 0.069999911 23 nips-2012-A lattice filter model of the visual pathway
19 0.069692761 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
20 0.068052404 37 nips-2012-Affine Independent Variational Inference
topicId topicWeight
[(0, 0.239), (1, -0.01), (2, 0.001), (3, -0.004), (4, -0.024), (5, -0.038), (6, -0.01), (7, -0.011), (8, -0.041), (9, -0.029), (10, -0.026), (11, -0.03), (12, -0.012), (13, -0.035), (14, -0.025), (15, -0.035), (16, -0.031), (17, -0.036), (18, -0.004), (19, 0.042), (20, 0.004), (21, 0.067), (22, -0.02), (23, -0.052), (24, -0.013), (25, -0.013), (26, -0.002), (27, 0.017), (28, 0.031), (29, 0.004), (30, 0.048), (31, 0.044), (32, 0.116), (33, 0.02), (34, -0.058), (35, -0.095), (36, -0.015), (37, -0.118), (38, -0.036), (39, 0.069), (40, -0.034), (41, 0.026), (42, -0.02), (43, -0.01), (44, 0.056), (45, 0.002), (46, -0.028), (47, -0.067), (48, 0.038), (49, 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 0.93683773 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions
Author: Paul Vernaza, Drew Bagnell
Abstract: Maximum entropy (MaxEnt) modeling is a popular choice for sequence analysis in applications such as natural language processing, where the sequences are embedded in discrete, tractably-sized spaces. We consider the problem of applying MaxEnt to distributions over paths in continuous spaces of high dimensionality— a problem for which inference is generally intractable. Our main contribution is to show that this intractability can be avoided as long as the constrained features possess a certain kind of low dimensional structure. In this case, we show that the associated partition function is symmetric and that this symmetry can be exploited to compute the partition function efficiently in a compressed form. Empirical results are given showing an application of our method to learning models of high-dimensional human motion capture data. 1
2 0.69702101 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics
Author: Ashwini Shukla, Aude Billard
Abstract: Non-linear dynamical systems (DS) have been used extensively for building generative models of human behavior. Their applications range from modeling brain dynamics to encoding motor commands. Many schemes have been proposed for encoding robot motions using dynamical systems with a single attractor placed at a predefined target in state space. Although these enable the robots to react against sudden perturbations without any re-planning, the motions are always directed towards a single target. In this work, we focus on combining several such DS with distinct attractors, resulting in a multi-stable DS. We show its applicability in reach-to-grasp tasks where the attractors represent several grasping points on the target object. While exploiting multiple attractors provides more flexibility in recovering from unseen perturbations, it also increases the complexity of the underlying learning problem. Here we present the Augmented-SVM (A-SVM) model which inherits region partitioning ability of the well known SVM classifier and is augmented with novel constraints derived from the individual DS. The new constraints modify the original SVM dual whose optimal solution then results in a new class of support vectors (SV). These new SV ensure that the resulting multistable DS incurs minimum deviation from the original dynamics and is stable at each of the attractors within a finite region of attraction. We show, via implementations on a simulated 10 degrees of freedom mobile robotic platform, that the model is capable of real-time motion generation and is able to adapt on-the-fly to perturbations.
3 0.62583613 128 nips-2012-Fast Resampling Weighted v-Statistics
Author: Chunxiao Zhou, Jiseong Park, Yun Fu
Abstract: In this paper, a novel and computationally fast algorithm for computing weighted v-statistics in resampling both univariate and multivariate data is proposed. To avoid any real resampling, we have linked this problem with finite group action and converted it into a problem of orbit enumeration. For further computational cost reduction, an efficient method is developed to list all orbits by their symmetry orders and calculate all index function orbit sums and data function orbit sums recursively. The computational complexity analysis shows reduction in the computational cost from n! or nn level to low-order polynomial level. 1
4 0.62536454 267 nips-2012-Perceptron Learning of SAT
Author: Alex Flint, Matthew Blaschko
Abstract: Boolean satisfiability (SAT) as a canonical NP-complete decision problem is one of the most important problems in computer science. In practice, real-world SAT sentences are drawn from a distribution that may result in efficient algorithms for their solution. Such SAT instances are likely to have shared characteristics and substructures. This work approaches the exploration of a family of SAT solvers as a learning problem. In particular, we relate polynomial time solvability of a SAT subset to a notion of margin between sentences mapped by a feature function into a Hilbert space. Provided this mapping is based on polynomial time computable statistics of a sentence, we show that the existance of a margin between these data points implies the existance of a polynomial time solver for that SAT subset based on the Davis-Putnam-Logemann-Loveland algorithm. Furthermore, we show that a simple perceptron-style learning rule will find an optimal SAT solver with a bounded number of training updates. We derive a linear time computable set of features and show analytically that margins exist for important polynomial special cases of SAT. Empirical results show an order of magnitude improvement over a state-of-the-art SAT solver on a hardware verification task. 1
5 0.59785777 206 nips-2012-Majorization for CRFs and Latent Likelihoods
Author: Tony Jebara, Anna Choromanska
Abstract: The partition function plays a key role in probabilistic modeling including conditional random fields, graphical models, and maximum likelihood estimation. To optimize partition functions, this article introduces a quadratic variational upper bound. This inequality facilitates majorization methods: optimization of complicated functions through the iterative solution of simpler sub-problems. Such bounds remain efficient to compute even when the partition function involves a graphical model (with small tree-width) or in latent likelihood settings. For large-scale problems, low-rank versions of the bound are provided and outperform LBFGS as well as first-order methods. Several learning applications are shown and reduce to fast and convergent update rules. Experimental results show advantages over state-of-the-art optimization methods. This supplement presents additional details in support of the full article. These include the application of the majorization method to maximum entropy problems. It also contains proofs of the various theorems, in particular, a guarantee that the bound majorizes the partition function. In addition, a proof is provided guaranteeing convergence on (non-latent) maximum conditional likelihood problems. The supplement also contains supporting lemmas that show the bound remains applicable in constrained optimization problems. The supplement then proves the soundness of the junction tree implementation of the bound for graphical models with large n. It also proves the soundness of the low-rank implementation of the bound for problems with large d. Finally, the supplement contains additional experiments and figures to provide further empirical support for the majorization methodology. Supplement for Section 2 Proof of Theorem 1 Rewrite the partition function as a sum over the integer index j = 1, . . . , n under the random ordering π : Ω → {1, . . . , n}. This defines j∑ π(y) and associates h and f with = n hj = h(π −1 (j)) and fj = f (π −1 (j)). Next, write Z(θ) = j=1 αj exp(λ⊤ fj ) by introducing ˜ ˜ λ = θ − θ and αj = hj exp(θ ⊤ fj ). Define the partition function over only the first i components ∑i as Zi (θ) = j=1 αj exp(λ⊤ fj ). When i = 0, a trivial quadratic upper bound holds ( ) Z0 (θ) ≤ z0 exp 1 λ⊤ Σ0 λ + λ⊤ µ0 2 with the parameters z0 → 0+ , µ0 = 0, and Σ0 = z0 I. Next, add one term to the current partition function Z1 (θ) = Z0 (θ) + α1 exp(λ⊤ f1 ). Apply the current bound Z0 (θ) to obtain 1 Z1 (θ) ≤ z0 exp( 2 λ⊤ Σ0 λ + λ⊤ µ0 ) + α1 exp(λ⊤ f1 ). Consider the following change of variables u γ 1/2 −1/2 = Σ0 λ − Σ0 = α1 z0 exp( 1 (f1 2 (f1 − µ0 )) − µ0 )⊤ Σ−1 (f1 − µ0 )) 0 and rewrite the logarithm of the bound as log Z1 (θ) ( ) 1 ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp( 2 ∥u∥2 ) + γ . 0 2 Apply Lemma 1 (cf. [32] p. 100) to the last term to get ( (1 ) ) log Z1 (θ) ≤ log z0 − 1 (f1 − µ0 )⊤ Σ−1 (f1 − µ0 ) + λ⊤ f1 + log exp 2 ∥v∥2 +γ 0 2 ( ) v⊤ (u − v) 1 + + (u − v)⊤ I + Γvv⊤ (u − v) 1+γ exp(− 1 ∥v∥2 ) 2 2 10 where Γ = 1 1 tanh( 2 log(γ exp(− 2 ∥v∥2 ))) . 1 2 log(γ exp(− 2 ∥v∥2 )) The bound in [32] is tight when u = v. To achieve tightness −1/2 ˜ when θ = θ or, equivalently, λ = 0, we choose v = Σ0 (µ0 − f1 ) yielding ( ) Z1 (θ) ≤ z1 exp 1 λ⊤ Σ1 λ + λ⊤ µ1 2 where we have z1 = z0 + α1 z0 α1 = µ0 + f1 z0 + α1 z0 + α1 1 tanh( 2 log(α1 /z0 )) = Σ0 + (µ0 − f1 )(µ0 − f1 )⊤ . 2 log(α1 /z0 ) µ1 Σ1 This rule updates the bound parameters z0 , µ0 , Σ0 to incorporate an extra term in the sum over i in Z(θ). The process is iterated n times (replacing 1 with i and 0 with i − 1) to produce an overall bound on all terms. Lemma 1 (See [32] p. 100) ( ( ) ) For all u ∈ Rd , any v ∈ Rd and any γ ≥ 0, the bound log exp 1 ∥u∥2 + γ ≤ 2 ( ( ) ) log exp 1 ∥v∥2 + γ + 2 ( ) v⊤ (u − v) 1 + (u − v)⊤ I + Γvv⊤ (u − v) 1 1 + γ exp(− 2 ∥v∥2 ) 2 holds when the scalar term Γ = 1 tanh( 2 log(γ exp(−∥v∥2 /2))) . 2 log(γ exp(−∥v∥2 /2)) Equality is achieved when u = v. Proof of Lemma 1 The proof is provided in [32]. Supplement for Section 3 Maximum entropy problem We show here that partition functions arise naturally in maximum ∑ p(y) entropy estimation or minimum relative entropy RE(p∥h) = y p(y) log h(y) estimation. Consider the following problem: ∑ ∑ min RE(p∥h) s.t. p(y)f (y) = 0, p(y)g(y) ≥ 0. p y y d′ Here, assume that f : Ω → R and g : Ω → R are arbitrary (non-constant) vector-valued functions ( ) over the sample space. The solution distribution p(y) = h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) /Z(θ, ϑ) is recovered by the dual optimization ∑ ( ) arg θ, ϑ = max − log h(y) exp θ ⊤ f (y) + ϑ⊤ g(y) d ϑ≥0,θ y ′ where θ ∈ Rd and ϑ ∈ Rd . These are obtained by minimizing Z(θ, ϑ) or equivalently by maximizing its negative logarithm. Algorithm 1 permits variational maximization of the dual via the quadratic program min 1 (β ϑ≥0,θ 2 ˜ ˜ − β)⊤ Σ(β − β) + β ⊤ µ ′ where β ⊤ = [θ ⊤ ϑ⊤ ]. Note that any general convex hull of constraints β ∈ Λ ⊆ Rd+d could be imposed without loss of generality. Proof of Theorem 2 We begin by proving a lemma that will be useful later. Lemma 2 If κΨ ≽ Φ ≻ 0 for Φ, Ψ ∈ Rd×d , then 1 ˜ ˜ ˜ L(θ) = − 2 (θ − θ)⊤ Φ(θ − θ) − (θ − θ)⊤ µ ˜ ˜ ˜ U (θ) = − 1 (θ − θ)⊤ Ψ(θ − θ) − (θ − θ)⊤ µ 2 satisfy supθ∈Λ L(θ) ≥ 1 κ ˜ supθ∈Λ U (θ) for any convex Λ ⊆ Rd , θ ∈ Λ, µ ∈ Rd and κ ∈ R+ . 11 Proof of Lemma 2 Define the primal problems of interest as PL = supθ∈Λ L(θ) and PU = supθ∈Λ U (θ). The constraints θ ∈ Λ can be summarized by a set of linear inequalities Aθ ≤ b where A ∈ Rk×d and b ∈ Rk for some (possibly infinite) k ∈ Z. Apply the change of variables ˜ ˜ ˜ ˜ ˜ ˜ z = θ− θ. The constraint A(z+ θ) ≤ b simplifies into Az ≤ b where b = b−Aθ. Since θ ∈ Λ, it 1 ⊤ ˜ ≥ 0. We obtain the equivalent primal problems PL = sup is easy to show that b ˜ − z Φz − Az≤b z⊤ µ and PU = supAz≤b − 1 z⊤ Ψz − z⊤ µ. The corresponding dual problems are ˜ 2 2 ⊤ −1 y⊤AΦ−1A⊤y ˜ µ Φ µ +y⊤AΦ−1µ+y⊤b+ y≥0 2 2 y⊤AΨ−1 A⊤y µ⊤Ψ−1µ ˜ DU = inf +y⊤AΨ−1µ+y⊤b+ . y≥0 2 2 DL = inf ˜ Due to strong duality, PL = DL and PU = DU . Apply the inequalities Φ ≼ κΨ and y⊤ b > 0 as ⊤ −1 y⊤AΨ−1 A⊤y y⊤AΨ−1 µ κ ˜ µΨ µ + + y⊤b + PL ≥ sup − z⊤ Ψz − z⊤ µ = inf y≥0 2 2κ κ 2κ ˜ Az≤b 1 1 ≥ DU = PU . κ κ 1 This proves that PL ≥ κ PU . We will use the above to prove Theorem 2. First, we will upper-bound (in the Loewner ordering sense) the matrices Σj in Algorithm 2. Since ∥fxj (y)∥2 ≤ r for all y ∈ Ωj and since µj in Algorithm 1 is a convex combination of fxj (y), the outer-product terms in the update for Σj satisfy (fxj (y) − µ)(fxj (y) − µ)⊤ ≼ 4r2 I. Thus, Σj ≼ F(α1 , . . . , αn )4r2 I holds where α 1 F(α1 , . . . , αn ) = i n ∑ tanh( 2 log( ∑i−1 αk )) k=1 αi 2 log( ∑i−1 α ) i=2 k k=1 using the definition of α1 , . . . , αn in the proof of Theorem 1. The formula for F starts at i = 2 since z0 → 0+ . Assume permutation π is sampled uniformly at random. The expected value of F is then α π(i) 1 n 1 ∑ ∑ tanh( 2 log( ∑i−1 απ(k) )) k=1 . Eπ [F(α1 , . . . , αn )] = απ(i) n! π i=2 ) 2 log( ∑i−1 k=1 απ(k) We claim that the expectation is maximized when all αi = 1 or any positive constant. Also, F is invariant under uniform scaling of its arguments. Write the expected value of F as E for short. ∂E ∂E ∂E Next, consider ∂αl at the setting αi = 1, ∀i. Due to the expectation over π, we have ∂αl = ∂αo for any l, o. Therefore, the gradient vector is constant when all αi = 1. Since F(α1 , . . . , αn ) is invariant to scaling, the gradient vector must therefore be the all zeros vector. Thus, the point ∂ ∂E when all αi = 1 is an extremum or a saddle. Next, consider ∂αo ∂αl for any l, o. At the setting 2 ∂ ∂E αi = 1, ∂ E = −c(n) and, ∂αo ∂αl = c(n)/(n − 1) for some non-negative constant function ∂α2 l c(n). Thus, the αi = 1 extremum is locally concave and is a maximum. This establishes that Eπ [F(α1 , . . . , αn )] ≤ Eπ [F(1, . . . , 1)] and yields the Loewner bound ) ( n−1 ∑ tanh(log(i)/2) 2 I = ωI. Σj ≼ 2r log(i) i=1 Apply this bound to each Σj in the lower bound on J(θ) and also note a corresponding upper bound ∑ ˜ ˜ ˜ J(θ) ≥ J(θ)−tω+tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j ∑ ˜ ˜ ˜ J(θ) ≤ J(θ)−tλ ∥θ − θ∥2− (θ − θ)⊤(µj −fxj (yj )) 2 j 12 ˜ which follows from Jensen’s inequality. Define the current θ at time τ as θτ and denote by Lτ (θ) the above lower bound and by Uτ (θ) the above upper bound at time τ . Clearly, Lτ (θ) ≤ J(θ) ≤ Uτ (θ) with equality when θ = θτ . Algorithm 2 maximizes J(θ) after initializing at θ0 and performing an update by maximizing a lower bound based on Σj . Since Lτ (θ) replaces the definition of Σj with ωI ≽ Σj , Lτ (θ) is a looser bound than the one used by Algorithm 2. Thus, performing θτ +1 = arg maxθ∈Λ Lτ (θ) makes less progress than a step of Algorithm 1. Consider computing the slower update at each iteration τ and returning θτ +1 = arg maxθ∈Λ Lτ (θ). Setting Φ = (tω +tλ)I, Ψ = tλI and κ = ω+λ allows us to apply Lemma 2 as follows λ sup Lτ (θ) − Lτ (θτ ) = θ∈Λ 1 sup Uτ (θ) − Uτ (θτ ). κ θ∈Λ Since Lτ (θτ ) = J(θτ ) = Uτ (θτ ), J(θτ +1 ) ≥ supθ∈Λ Lτ (θ) and supθ∈Λ Uτ (θ) ≥ J(θ ∗ ), we obtain ( ) 1 J(θτ +1 ) − J(θ ∗ ) ≥ 1− (J(θτ ) − J(θ ∗ )) . κ Iterate the above inequality starting at t = 0 to obtain ( )τ 1 ∗ J(θτ ) − J(θ ) ≥ 1− (J(θ0 ) − J(θ ∗ )) . κ ( ) 1 τ κ A solution within a multiplicative factor of ϵ implies that ϵ = 1 − κ or log(1/ϵ) = τ log κ−1 . ⌉ ⌈ Inserting the definition for κ shows that the number of iterations τ is at most log(1/ϵ) or κ log κ−1 ⌉ ⌈ log(1/ϵ) log(1+λ/ω) . Inserting the definition for ω gives the bound. Y12,0 Y11,1 Y21,1 Y31,1 ··· 1,1 Ym1,1 Figure 3: Junction tree of depth 2. Algorithm 5 SmallJunctionTree ˜ Input Parameters θ and h(u), f (u) ∀u ∈ Y12,0 and zi , Σi , µi ∀i = 1, . . . , m1,1 + Initialize z → 0 , µ = 0, Σ = zI For each configuration u ∈ Y12,0 { ∏m1,1 ∑m1,1 ∏m1,1 ˜ ˜ ˜ α = h(u)( ∑ zi exp(−θ ⊤ µi )) exp(θ ⊤ (f (u) + i=1 µi )) = h(u) exp(θ ⊤ f (u)) i=1 zi i=1 m1,1 l = f (u) + i=1 µi − µ 1 ∑m1,1 tanh( 2 log(α/z)) ⊤ Σ + = i=1 Σi + ll 2 log(α/z) α µ + = z+α l z += α } Output z, µ, Σ Supplement for Section 5 Proof of correctness for Algorithm 3 Consider a simple junction tree of depth 2 shown on Figure 3. The notation Yca,b refers to the cth tree node located at tree level a (first level is considered as the one with∑ leaves) whose parent is the bth from the higher tree level (the root has no parent so b = 0). tree ∑ Let Y a1 ,b1 refer to the sum over all configurations of variables in Yca1 ,b1 and Y a1 ,b1 \Y a2 ,b2 1 c1 c1 c2 refers to the sum over all configurations of variables that are in Yca1 ,b1 but not in Yca2 ,b2 . Let ma,b 1 2 denote the number of children of the bth node located at tree level a + 1. For short-hand, use ψ(Y ) = h(Y ) exp(θ ⊤ f (Y )). The partition function can be expressed as: 13 Y13,0 Y12,1 ··· Y11,1 Y21,1 ··· Y22,1 1,1 Ym1,1 ··· Y11,2 Y21,2 1,2 Ym1,2 2,1 Ym2,1 1,m2,1 Y1 ··· 1,m2,1 Y2 1,m 2,1 Ym1,m2,1 Figure 4: Junction tree of depth 3. Z(θ) = ∑ 2,0 u∈Y1 ≤ ∑ 2,0 u∈Y1 = ∑ ψ(u) ∏ m1,1 i=1 [ ∏ m1,1 ψ(u) i=1 [ ∑ ψ(v) 2,0 v∈Yi1,1 \Y1 ) 1( ˜ ˜ ˜ zi exp( θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 ∏ ( m1,1 ⊤ h(u) exp(θ f (u)) 2,0 u∈Y1 zi exp i=1 ] 1 ˜ ˜ ˜ (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi 2 )] ∑ where the upper-bound is obtained by applying Theorem 1 to each of the terms v∈Y 1,1 \Y 2,0 ψ(v). 1 i By simply rearranging terms we get: ) ( ( [ (m1,1 )) m1,1 ∏ ∑ ∑ ˜ zi exp(−θ ⊤ µi ) exp θ ⊤ f (u) + µi h(u) Z(θ) ≤ 2,0 u∈Y1 i=1 ( exp 1 ˜ (θ − θ)⊤ 2 (m1,1 ∑ ) Σi i=1 ˜ (θ − θ) )] . i=1 One ( can prove that this ) expression can be upper-bounded by 1 ˆ ⊤ Σ(θ − θ) + (θ − θ)⊤ µ where z, Σ and µ can be computed using Algoˆ ˆ z exp 2 (θ − θ) rithm 5 (a simplification of Algorithm 3). We will call this result Lemma A. The proof is similar to the proof of Theorem 1 so is not repeated here. Consider enlarging the tree to a depth 3 as shown on Figure 4. The partition function is now m2,1 m1,i ∑ ∏ ∑ ∏ ∑ Z(θ) = ψ(w) . ψ(u) ψ(v) 3,0 u∈Y1 i=1 3,0 v∈Yi2,1 \Y1 j=1 w∈Yj1,i \Yi2,1 ( )) ∏m1,i (∑ ∑ term By Lemma A we can upper bound each v∈Y 2,1 \Y 3,0 ψ(v) j=1 w∈Yj1,i \Yi2,1 ψ(w) 1 i ( ) ˆ ˆ ˆ by the expression zi exp 1 (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . This yields 2 [ )] ( m2,1 ∑ ∏ 1 ˜ ˜ ˜ Z(θ) ≤ ψ(u) (θ − θ)⊤ Σi (θ − θ) + (θ − θ)⊤ µi . zi exp 2 3,0 i=1 u∈Y1 2,1 2,1 2,1 This process can be viewed as collapsing the sub-trees S1 , S2 , . . ., Sm2,1 to super-nodes that are represented by bound parameters, zi , Σi and µi , i = {1, 2, · · · , m2,1 }, where the sub-trees are 14 defined as: 2,1 S1 = 1,1 {Y12,1 , Y11,1 , Y21,1 , Y31,1 , . . . , Ym1,1 } 2,1 S2 = 1,2 {Y22,1 , Y11,2 , Y21,2 , Y31,2 , . . . , Ym1,2 } = 2,1 {Ym2,1 , Y1 . . . 2,1 Sm2,1 1,m2,1 1,m2,1 , Y2 1,m2,1 , Y3 1,m2,1 , . . . , Ym1,m2,1 }. Notice that the obtained expression can be further upper bounded using again Lemma A (induction) ( ) ˆ ˆ ˆ yielding a bound of the form: z exp 1 (θ − θ)⊤ Σ(θ − θ) + (θ − θ)⊤ µ . 2 Finally, for a general tree, follow the same steps described above, starting from leaves and collapsing nodes to super-nodes, each represented by bound parameters. This procedure effectively yields Algorithm 3 for the junction tree under consideration. Supplement for Section 6 Proof of correctness for Algorithm 4 We begin by proving a lemma that will be useful later. Lemma 3 For all x ∈ Rd and for all l ∈ Rd , 2 d d 2 ∑ ∑ l(i) . x(i)2 l(i)2 ≥ x(i) √∑ d l(j)2 i=1 i=1 j=1 Proof of Lemma 3 By Jensen’s inequality, 2 ( d )2 d d ∑ x(i)l(i)2 ∑ ∑ x(i)l(i)2 . √∑ x(i)2 ∑d ⇐⇒ x(i)2 l(i)2 ≥ ≥ ∑d d l(j)2 l(j)2 l(j)2 j=1 j=1 i=1 i=1 i=1 i=1 d ∑ l(i)2 j=1 Now we prove the correctness of Algorithm 4. At the ith iteration, the algorithm stores Σi using ⊤ a low-rank representation Vi Si Vi + Di where Vi ∈ Rk×d is orthonormal, Si ∈ Rk×k positive d×d semi-definite and Di ∈ R is non-negative diagonal. The diagonal terms D are initialized to tλI where λ is the regularization term. To mimic Algorithm 1 we must increment the Σ matrix by a rank one update of the form Σi = Σi−1 + ri r⊤ . By projecting ri onto each eigenvector in V, we i ∑k ⊤ can decompose it as ri = j=1 Vi−1 (j, ·)ri Vi−1 (j, ·)⊤ + g = Vi−1 Vi−1 ri + g where g is the remaining residue. Thus the update rule can be rewritten as: Σi ⊤ ⊤ ⊤ = Σi−1 + ri r⊤ = Vi−1 Si−1 Vi−1 + Di−1 + (Vi−1 Vi−1 ri + g)(Vi−1 Vi−1 ri + g)⊤ i ′ ′ ′ ⊤ ⊤ ⊤ = Vi−1 (Si−1 + Vi−1 ri r⊤ Vi−1 )Vi−1 + Di−1 + gg⊤ = Vi−1 Si−1 Vi−1 + gg⊤ + Di−1 i ′ where we define Vi−1 = Qi−1 Vi−1 and defined Qi−1 in terms of the singular value decomposition, ′ ′ ⊤ Q⊤ Si−1 Qi−1 = svd(Si−1 + Vi−1 ri r⊤ Vi−1 ). Note that Si−1 is diagonal and nonnegative by i−1 i construction. The current formula for Σi shows that we have a rank (k + 1) system (plus diagonal term) which needs to be converted back to a rank k system (plus diagonal term) which we denote by ′ Σi . We have two options as follows. Case 1) Remove g from Σi to obtain ′ ′ ′ ′ ⊤ Σi = Vi−1 Si−1 Vi−1 + Di−1 = Σi − gg⊤ = Σi − cvv⊤ 1 where c = ∥g∥2 and v = ∥g∥ g. th Case 2) Remove the m (smallest) eigenvalue in S′ and its corresponding eigenvector: i−1 ′ Σi ′ ′ ′ ′ ′ ′ ⊤ = Vi−1 Si−1 Vi−1 + Di−1 + gg⊤ − S (m, m)V (m, ·)⊤ V (m, ·) = Σi − cvv⊤ ′ ′ where c = S (m, m) and v = V(m, ·) . 15 ′ Clearly, both cases can be written as an update of the form Σi = Σi + cvv⊤ where c ≥ 0 and v⊤ v = 1. We choose the case with smaller c value to minimize the change as we drop from a system of order (k + 1) to order k. Discarding the smallest singular value and its corresponding eigenvector would violate the bound. Instead, consider absorbing this term into the diagonal component to ′′ ′ preserve the bound. Formally, we look for a diagonal matrix F such that Σi = Σi + F which also ′′ maintains x⊤ Σi x ≥ x⊤ Σi x for all x ∈ Rd . Thus, we want to satisfy: ( d )2 d ∑ ∑ ⊤ ′′ ⊤ ⊤ ⊤ ⊤ x Σi x ≥ x Σi x ⇐⇒ x cvv x ≤ x Fx ⇐⇒ c x(i)v(i) ≤ x(i)2 F(i) i=1 i=1 where, for ease of notation, we take F(i) = F(i, i). ′ where w = v⊤ 1. Consider the case where v ≥ 0 though we will soon get rid of (∑ )2 ∑d d this assumption. We need an F such that i=1 x(i)2 F(i) ≥ c i=1 x(i)v(i) . Equivalently, we (∑ ) ∑d ′ 2 ′ d need i=1 x(i)2 F(i) ≥ . Define F(i) = F(i) for all i = 1, . . . , d. So, we need 2 i=1 x(i)v(i) cw cw2 (∑ ) ∑d ′ ′ ′ 2 d an F such that i=1 x(i)2 F(i) ≥ . Using Lemma 3 it is easy to show that we i=1 x(i)v(i) Define v = 1 wv ′ ′ ′ may choose F (i) = v(i) . Thus, we obtain F(i) = cw2 F(i) = cwv(i). Therefore, for all x ∈ Rd , ∑d all v ≥ 0, and for F(i) = cv(i) j=1 v(j) we have d ∑ ( x(i) F(i) ≥ c 2 i=1 d ∑ )2 x(i)v(i) . (3) i=1 To generalize the inequality to hold for all vectors v ∈ Rd with potentially negative entries, it is ∑d sufficient to set F(i) = c|v(i)| j=1 |v(j)|. To verify this, consider flipping the sign of any v(i). The left side of the Inequality 3 does not change. For the right side of this inequality, flipping the sign of v(i) is equivalent to flipping the sign of x(i) and not changing the sign of v(i). However, in this case the inequality holds as shown before (it holds for any x ∈ Rd ). Thus for all x, v ∈ Rd and ∑d for F(i) = c|v(i)| j=1 |v(j)|, Inequality 3 holds. Supplement for Section 7 Small scale experiments In additional small-scale experiments, we compared Algorithm 2 with steepest descent (SD), conjugate gradient (CG), BFGS and Newton-Raphson. Small-scale problems may be interesting in real-time learning settings, for example, when a website has to learn from a user’s uploaded labeled data in a split second to perform real-time retrieval. We considered logistic regression on five UCI data sets where missing values were handled via mean-imputation. A range of regularization settings λ ∈ {100 , 102 , 104 } was explored and all algorithms were initialized from the same ten random start-points. Table 3 shows the average number of seconds each algorithm needed to achieve the same solution that BFGS converged to (all algorithms achieve the same solution due to concavity). The bound is the fastest algorithm as indicated in bold. data|λ BFGS SD CG Newton Bound a|100 1.90 1.74 0.78 0.31 0.01 a|102 0.89 0.92 0.83 0.25 0.01 a|104 2.45 1.60 0.85 0.22 0.01 b|100 3.14 2.18 0.70 0.43 0.07 b|102 2.00 6.17 0.67 0.37 0.04 b|104 1.60 5.83 0.83 0.35 0.04 c|100 4.09 1.92 0.65 0.39 0.07 c|102 1.03 0.64 0.64 0.34 0.02 c|104 1.90 0.56 0.72 0.32 0.02 d|100 5.62 12.04 1.36 0.92 0.16 d|102 2.88 1.27 1.21 0.63 0.09 d|104 3.28 1.94 1.23 0.60 0.07 e|100 2.63 2.68 0.48 0.35 0.03 e|102 2.01 2.49 0.55 0.26 0.03 e|104 1.49 1.54 0.43 0.20 0.03 Table 3: Convergence time in seconds under various regularization levels for a) Bupa (t = 345, dim = 7), b) Wine (t = 178, dim = 14), c) Heart (t = 187, dim = 23), d) Ion (t = 351, dim = 34), and e) Hepatitis (t = 155, dim = 20) data sets. Influence of rank k on bound performance in large scale experiments We also examined the influence of k on bound performance and compared it with LBFGS, SD and CG. Several choices 16 of k were explored. Table 4 shows results for the SRBCT data-set. In general, the bound performs best but slows down for superfluously large values of k. Steepest descent and conjugate gradient are slow yet obviously do not vary with k. Note that each iteration takes less time with smaller k for the bound. However, we are reporting overall runtime which is also a function of the number of iterations. Therefore, total runtime (a function of both) may not always decrease/increase with k. k LBFGS SD CG Bound 1 1.37 8.80 4.39 0.56 2 1.32 8.80 4.39 0.56 4 1.39 8.80 4.39 0.67 8 1.35 8.80 4.39 0.96 16 1.46 8.80 4.39 1.34 32 1.40 8.80 4.39 2.11 64 1.54 8.80 4.39 4.57 Table 4: Convergence time in seconds as a function of k. Additional latent-likelihood results For completeness, Figure 5 depicts two additional data-sets to complement Figure 2. Similarly, Table 5 shows all experimental settings explored in order to provide the summary Table 2 in the main article. bupa wine −19 0 −5 −log(J(θ)) −log(J(θ)) −20 −21 −22 Bound Newton BFGS Conjugate gradient Steepest descent −15 −23 −24 −5 −10 0 5 log(Time) [sec] 10 −20 −4 −2 0 2 4 log(Time) [sec] 6 8 Figure 5: Convergence of test latent log-likelihood for bupa and wine data-sets. Data-set Algorithm BFGS SD CG Newton Bound ion m=1 m=2m=3m=4 -4.96 -5.55 -5.88 -5.79 -11.80 -9.92 -5.56 -8.59 -5.47 -5.81 -5.57 -5.22 -5.95 -5.95 -5.95 -5.95 -6.08 -4.84 -4.18 -5.17 Data-set Algorithm BFGS SD CG Newton Bound bupa m=1 m=2 m=3 m=4 -22.07 -21.78 -21.92 -21.87 -21.76 -21.74 -21.73 -21.83 -21.81 -21.81 -21.81 -21.81 -21.85 -21.85 -21.85 -21.85 -21.85 -19.95 -20.01 -19.97 wine m=1m=2m=3m=4 -0.90 -0.91 -1.79 -1.35 -1.61 -1.60 -1.37 -1.63 -0.51 -0.78 -0.95 -0.51 -0.71 -0.71 -0.71 -0.71 -0.51 -0.51 -0.48 -0.51 hepatitis m=1m=2m=3m=4 -4.42 -5.28 -4.95 -4.93 -4.93 -5.14 -5.01 -5.20 -4.84 -4.84 -4.84 -4.84 -5.50 -5.50 -5.50 -4.50 -5.47 -4.40 -4.75 -4.92 SRBCT m=1m=2m=3m=4 -5.99 -6.17 -6.09 -6.06 -5.61 -5.62 -5.62 -5.61 -5.62 -5.49 -5.36 -5.76 -5.54 -5.54 -5.54 -5.54 -5.31 -5.31 -4.90 -0.11 Table 5: Test latent log-likelihood at convergence for different values of m ∈ {1, 2, 3, 4} on ion, bupa, hepatitis, wine and SRBCT data-sets. 17
6 0.58941793 103 nips-2012-Distributed Probabilistic Learning for Camera Networks with Missing Data
7 0.58930105 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization
8 0.58381456 261 nips-2012-Online allocation and homogeneous partitioning for piecewise constant mean-approximation
9 0.580109 96 nips-2012-Density Propagation and Improved Bounds on the Partition Function
10 0.57900101 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process
11 0.57171685 53 nips-2012-Bayesian Pedigree Analysis using Measure Factorization
12 0.56937033 37 nips-2012-Affine Independent Variational Inference
13 0.56785768 66 nips-2012-Causal discovery with scale-mixture model for spatiotemporal variance dependencies
14 0.56633306 286 nips-2012-Random Utility Theory for Social Choice
15 0.56488246 198 nips-2012-Learning with Target Prior
16 0.56313008 343 nips-2012-Tight Bounds on Profile Redundancy and Distinguishability
17 0.55571795 232 nips-2012-Multiplicative Forests for Continuous-Time Processes
18 0.55446988 110 nips-2012-Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems
19 0.55248964 279 nips-2012-Projection Retrieval for Classification
20 0.55084074 2 nips-2012-3D Social Saliency from Head-mounted Cameras
topicId topicWeight
[(0, 0.035), (21, 0.023), (38, 0.098), (42, 0.027), (54, 0.399), (55, 0.019), (74, 0.043), (76, 0.151), (80, 0.101), (92, 0.038)]
simIndex simValue paperId paperTitle
1 0.94876111 331 nips-2012-Symbolic Dynamic Programming for Continuous State and Observation POMDPs
Author: Zahra Zamani, Scott Sanner, Pascal Poupart, Kristian Kersting
Abstract: Point-based value iteration (PBVI) methods have proven extremely effective for finding (approximately) optimal dynamic programming solutions to partiallyobservable Markov decision processes (POMDPs) when a set of initial belief states is known. However, no PBVI work has provided exact point-based backups for both continuous state and observation spaces, which we tackle in this paper. Our key insight is that while there may be an infinite number of observations, there are only a finite number of continuous observation partitionings that are relevant for optimal decision-making when a finite, fixed set of reachable belief states is considered. To this end, we make two important contributions: (1) we show how previous exact symbolic dynamic programming solutions for continuous state MDPs can be generalized to continuous state POMDPs with discrete observations, and (2) we show how recently developed symbolic integration methods allow this solution to be extended to PBVI for continuous state and observation POMDPs with potentially correlated, multivariate continuous observation spaces. 1
2 0.84681851 177 nips-2012-Learning Invariant Representations of Molecules for Atomization Energy Prediction
Author: Grégoire Montavon, Katja Hansen, Siamac Fazli, Matthias Rupp, Franziska Biegler, Andreas Ziehe, Alexandre Tkatchenko, Anatole V. Lilienfeld, Klaus-Robert Müller
Abstract: The accurate prediction of molecular energetics in chemical compound space is a crucial ingredient for rational compound design. The inherently graph-like, non-vectorial nature of molecular data gives rise to a unique and difficult machine learning problem. In this paper, we adopt a learning-from-scratch approach where quantum-mechanical molecular energies are predicted directly from the raw molecular geometry. The study suggests a benefit from setting flexible priors and enforcing invariance stochastically rather than structurally. Our results improve the state-of-the-art by a factor of almost three, bringing statistical methods one step closer to chemical accuracy. 1
same-paper 3 0.82009381 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions
Author: Paul Vernaza, Drew Bagnell
Abstract: Maximum entropy (MaxEnt) modeling is a popular choice for sequence analysis in applications such as natural language processing, where the sequences are embedded in discrete, tractably-sized spaces. We consider the problem of applying MaxEnt to distributions over paths in continuous spaces of high dimensionality— a problem for which inference is generally intractable. Our main contribution is to show that this intractability can be avoided as long as the constrained features possess a certain kind of low dimensional structure. In this case, we show that the associated partition function is symmetric and that this symmetry can be exploited to compute the partition function efficiently in a compressed form. Empirical results are given showing an application of our method to learning models of high-dimensional human motion capture data. 1
4 0.79395831 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
Author: Zhirong Yang, Tele Hao, Onur Dikmen, Xi Chen, Erkki Oja
Abstract: Nonnegative Matrix Factorization (NMF) is a promising relaxation technique for clustering analysis. However, conventional NMF methods that directly approximate the pairwise similarities using the least square error often yield mediocre performance for data in curved manifolds because they can capture only the immediate similarities between data samples. Here we propose a new NMF clustering method which replaces the approximated matrix with its smoothed version using random walk. Our method can thus accommodate farther relationships between data samples. Furthermore, we introduce a novel regularization in the proposed objective function in order to improve over spectral clustering. The new learning objective is optimized by a multiplicative Majorization-Minimization algorithm with a scalable implementation for learning the factorizing matrix. Extensive experimental results on real-world datasets show that our method has strong performance in terms of cluster purity. 1
5 0.78464353 344 nips-2012-Timely Object Recognition
Author: Sergey Karayev, Tobias Baumgartner, Mario Fritz, Trevor Darrell
Abstract: In a large visual multi-class detection framework, the timeliness of results can be crucial. Our method for timely multi-class detection aims to give the best possible performance at any single point after a start time; it is terminated at a deadline time. Toward this goal, we formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which detector to deploy next. In contrast to previous work, our method significantly diverges from the predominant greedy strategies, and is able to learn to take actions with deferred values. We evaluate our method with a novel timeliness measure, computed as the area under an Average Precision vs. Time curve. Experiments are conducted on the PASCAL VOC object detection dataset. If execution is stopped when only half the detectors have been run, our method obtains 66% better AP than a random ordering, and 14% better performance than an intelligent baseline. On the timeliness measure, our method obtains at least 11% better performance. Our method is easily extensible, as it treats detectors and classifiers as black boxes and learns from execution traces using reinforcement learning. 1
6 0.7785238 287 nips-2012-Random function priors for exchangeable arrays with applications to graphs and relational data
7 0.68225282 88 nips-2012-Cost-Sensitive Exploration in Bayesian Reinforcement Learning
8 0.67817611 173 nips-2012-Learned Prioritization for Trading Off Accuracy and Speed
9 0.64379233 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning
10 0.64008945 108 nips-2012-Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
11 0.61811656 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
12 0.61654866 38 nips-2012-Algorithms for Learning Markov Field Policies
13 0.61565864 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions
14 0.6118055 153 nips-2012-How Prior Probability Influences Decision Making: A Unifying Probabilistic Model
15 0.60829598 353 nips-2012-Transferring Expectations in Model-based Reinforcement Learning
16 0.60420287 209 nips-2012-Max-Margin Structured Output Regression for Spatio-Temporal Action Localization
17 0.60389888 51 nips-2012-Bayesian Hierarchical Reinforcement Learning
18 0.60377645 160 nips-2012-Imitation Learning by Coaching
19 0.60332954 122 nips-2012-Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress
20 0.60318303 348 nips-2012-Tractable Objectives for Robust Policy Optimization