nips nips2012 nips2012-48 knowledge-graph by maker-knowledge-mining

48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics


Source: pdf

Author: Ashwini Shukla, Aude Billard

Abstract: Non-linear dynamical systems (DS) have been used extensively for building generative models of human behavior. Their applications range from modeling brain dynamics to encoding motor commands. Many schemes have been proposed for encoding robot motions using dynamical systems with a single attractor placed at a predefined target in state space. Although these enable the robots to react against sudden perturbations without any re-planning, the motions are always directed towards a single target. In this work, we focus on combining several such DS with distinct attractors, resulting in a multi-stable DS. We show its applicability in reach-to-grasp tasks where the attractors represent several grasping points on the target object. While exploiting multiple attractors provides more flexibility in recovering from unseen perturbations, it also increases the complexity of the underlying learning problem. Here we present the Augmented-SVM (A-SVM) model which inherits region partitioning ability of the well known SVM classifier and is augmented with novel constraints derived from the individual DS. The new constraints modify the original SVM dual whose optimal solution then results in a new class of support vectors (SV). These new SV ensure that the resulting multistable DS incurs minimum deviation from the original dynamics and is stable at each of the attractors within a finite region of attraction. We show, via implementations on a simulated 10 degrees of freedom mobile robotic platform, that the model is capable of real-time motion generation and is able to adapt on-the-fly to perturbations.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Many schemes have been proposed for encoding robot motions using dynamical systems with a single attractor placed at a predefined target in state space. [sent-7, score-0.672]

2 We show its applicability in reach-to-grasp tasks where the attractors represent several grasping points on the target object. [sent-10, score-0.727]

3 While exploiting multiple attractors provides more flexibility in recovering from unseen perturbations, it also increases the complexity of the underlying learning problem. [sent-11, score-0.466]

4 These new SV ensure that the resulting multistable DS incurs minimum deviation from the original dynamics and is stable at each of the attractors within a finite region of attraction. [sent-14, score-0.732]

5 A major advantage of representing motion using DS based models [1, 2, 3, 4] is the ability to counter perturbations by virtue of the fact that re-planning of trajectories is instantaneous. [sent-17, score-0.391]

6 These are generative schemes that define the flow of trajectories in state space x ∈ RN by ˙ means of a non-linear dynamical function x = f (x). [sent-18, score-0.409]

7 DS with single stable attractors have been used in pick and place tasks to control for both the motion of the end-effector [5, 6, 7] and the placement of the fingers on an object [8]. [sent-19, score-0.658]

8 Assuming a single attractor, and hence a single grasping location on the object, constrains considerably the applicability of these methods to realistic grasping problems. [sent-20, score-0.443]

9 A DS composed of multiple stable attractors provides an opportunity to encode different ways to reach and grasp an object. [sent-21, score-0.525]

10 Recent neuro-physiological results [9] have shown that a DS based modeling best explains the trajectories followed by humans while switching between several reaching targets. [sent-22, score-0.283]

11 From a robotics viewpoint, a robot controlled using a DS with multiple attractors would 1 2. [sent-23, score-0.654]

12 Y be able to switch online across grasping strategies. [sent-44, score-0.241]

13 , when one grasping point becomes no longer accessible due to a sudden change in the orientation of the object or the appearance of an obstacle along the current trajectory. [sent-47, score-0.316]

14 This paper presents a method by which one can learn multiple dynamics directed toward different attractors in a single dynamical system. [sent-48, score-0.727]

15 The Hopfield network and variants offered a powerful means to Figure 1: 8 attractor DS encode several stable attractors in the same system to provide a form of content-addressable memory [13, 14]. [sent-61, score-0.862]

16 The dynamics to reach these attractors was however not controlled for, nor was the partitioning of the state space that would send the trajectories to each attractor. [sent-62, score-0.895]

17 To our knowledge, this is the first attempt at learning simultaneously a partitioning of the state space and an embedding of multiple dynamical systems with separate regions of attractions and distinct attractors. [sent-66, score-0.247]

18 Also, let the two DS be x = fyi (x) with ˙ stable attractors at x∗i . [sent-74, score-0.499]

19 Figure 2(c) shows y the trajectories resulting from this approach. [sent-76, score-0.253]

20 Due to the non-linearity of the dynamics, trajectories initialized in one region cross the boundary and converge to the attractor located in the opposite region. [sent-77, score-0.736]

21 In other words, each region partitioned by the SVM hyperplane is not a region of attraction for its attractor. [sent-78, score-0.212]

22 In a real-world scenario where the attractors represent grasping points on an object and the trajectories are to be followed by robots, crossing over may take the trajectories towards kinematically unreachable regions. [sent-79, score-1.272]

23 2(d), trajectories that encounter the boundary may switch rapidly between different dynamics leading to jittery motion. [sent-81, score-0.427]

24 Concretely, if xO = fsgn(h(x)) (x) denotes the velocity obtained from the original DS and λ(x) = ˙ max ǫ, ∇h(x)T xO ˙ min −ǫ, ∇h(x)T xO if h(x) > 0 , if h(x) < 0 the modulated dynamical system is given by ˜ ˙ ˙ x = f (x) = λ(x)∇h(x) + x⊥ . [sent-86, score-0.396]

25 As a result, the trajectories move away from the classification hyperplane and converge to a point located in the region where they were initialized. [sent-89, score-0.326]

26 If h(x) is upper bounded1, all trajectories converge to one of the stationary points {x : ∇h(x) = 0} and h(x) is a Lyapunov function of the overall system (refer [17], proposition 1). [sent-91, score-0.315]

27 Figure 3 shows the result of applying the above modulation to our pair of DS. [sent-92, score-0.238]

28 As expected, it forces the trajectories to flow along the gradient of the function h(x). [sent-93, score-0.294]

29 5 “crossing-over” the boundary, the trajectories obtained are deficient in two major ways. [sent-95, score-0.253]

30 In other words, the vector field given by 1 ∇h(x) was not aligned with the flow of the training trajectories and the stationary points of the modulation function did not coincide 0. [sent-99, score-0.605]

31 5 2 In subsequent sections, we show how we can learn a new modulation function which takes into account the three issues we highFigure 3: Modulated trajs. [sent-103, score-0.238]

32 We will seek a system that a) ensures strict classification across regions of attraction (ROA) for each DS, b) follows closely the dynamics of each DS in each ROA and c) ensures that all trajectories in each ROA reach the desired attractor. [sent-105, score-0.634]

33 In next sections, we show that in addition to the usual SVM support vectors (SVs), the resulting modulation function is composed of an additional class of SVs. [sent-108, score-0.347]

34 We collect trajectories in the state ˙ space, yielding a set of P data points {xi ; xi ; li }i=1. [sent-112, score-0.399]

35 To learn the set of modulation functions {hm (x)}m=1. [sent-116, score-0.238]

36 We learn each modulation function in a one-vs-all classifier scheme and then 1 2 SVM classifier function is bounded if the Radial Basis Function (rbf) is used as kernel. [sent-120, score-0.238]

37 3 ˜ compute the final modulation function h(x) = max hm (x). [sent-123, score-0.459]

38 In the multi-class setting, the be- m=1···M havior of avoiding boundaries is obtained if the trajectories move along increasing values of the ˜ function h(x). [sent-124, score-0.294]

39 Next, we describe the procedure for learning a single hm (x) function. [sent-126, score-0.221]

40 We also assume that each function hm (x) is linear in feature space, i. [sent-128, score-0.221]

41 We label the current (m − th) motion class as positive and all others negative such that the set of labels for the current sub-problem is given by yi = +1 if li = m ; −1 if li = m i = 1 · · · P. [sent-131, score-0.225]

42 (3) Lyapunov constraint: To ensure that the modulated flow is aligned with the training trajectories, the gradient of the modulation function must have a positive component along the velocities at the data points. [sent-137, score-0.461]

43 That is, ˆ ˆ ˙ ˙ ∇hm (xi )T xi = wT J(xi )xi ≥ 0 ∀i ∈ I+ (4) where J ∈ RF ×N is the Jacobian matrix given by J = [ ∇φ1 (x)∇φ2 (x) · · · ∇φF (x) ] ˆ ˙ ˙ ˙ xi = xi / xi is the normalized velocity at the i − th data point. [sent-138, score-0.483]

44 T and Stability: Lastly, the gradient of the modulation function must vanish at the attractor of the positive class x∗ . [sent-139, score-0.606]

45 The solution to the above problem yields a modulation function (see Eq. [sent-165, score-0.238]

46 11 for proof) given by P N ˙ xi βi ˆT αi yi k(x, xi ) + hm (x) = i=1 i∈I+ ∂k(x, xi ) ∂k(x, x∗ ) γi e T − +b i ∂xi ∂x∗ i=1 (9) which can be further expanded depending on the choice of kernel. [sent-167, score-0.543]

47 The modulation function 9 learned using the A0 75 57 0. [sent-171, score-0.238]

48 5 85 1 55 77 7 ˆ ˙ ear combination of the normalized velocity xi −2 −2 −2 −1 0 1 −2 −1 0 1 at the training data points xi . [sent-187, score-0.322]

49 5 port vectors (β-SVs) collectively contribute to the fulfillment of the Lyapunov constraint by ˆT ∂k(x,xi ) ˙ introducing a positive slope in the modulation Figure 4: Isocurves of f (x) = xi ∂xi at 1 1 T ˆ √ √ ]T for the rbf kernel. [sent-189, score-0.421]

50 Figure 4 xi = [0 0] , xi = [ 2 2 shows the influence of a β-SV for the rbf kernel 2 2 k(xi , xj ) = e1/2σ xi −xj with xi at the origin 1 1 ˆ ˙ and xi = [ √2 √2 ]T . [sent-191, score-0.602]

51 The third summation term is a non-linear bias, which does not depend on the chosen support vectors, and performs a local modification around the desired attractor x∗ to ensure that the modulation function has a local maximum at that point. [sent-193, score-0.691]

52 We calculate its value by making use of the fact that for all the α-SV xi , we must have yi hm (xi ) = 1. [sent-195, score-0.355]

53 The value of the modulation function hm (x) is shown by the color plot (white indicates high values). [sent-198, score-0.459]

54 5(b)-(d), they push the flow of trajectories along their associated directions. [sent-200, score-0.294]

55 5(e)-(f), adding the two γ terms shifts the location of the maximum of the modulation function to coincide with the desired attractor. [sent-202, score-0.327]

56 , they follow the training trajectories and terminate at the desired attractor. [sent-205, score-0.338]

57 8 Y Y Y Obtained attractor Desired attractor −0. [sent-231, score-0.656]

58 (b)-(d) β-SVs guide the flow of trajectories ˆ ˙ along their respective associated directions xi shown by arrows. [sent-240, score-0.388]

59 (e)-(f) The 2 γ terms force the local maximum of the modulation function to coincide with the desired attractor along the X and Y axes respectively. [sent-241, score-0.696]

60 Next, we present a cross-validation analysis of the error introduced by the modulation in the original dynamics. [sent-244, score-0.265]

61 A sensitivity analysis of the region of attraction of the resulting dynamical system with respect to the model parameters is also presented. [sent-245, score-0.363]

62 As discussed in Section 2, the RBF kernel is advantageous as it ensures that the function hm (x) is bounded. [sent-247, score-0.291]

63 The color plot indicates the value of the combined modulation function ˜ h(x) = max hm (x) where each of the functions m=1···M hm (x) are learned using the presented A-SVM technique. [sent-250, score-0.68]

64 The trajectories obtained after modulating the original dynamical systems flow along increasing values of the modulation function, thereby bifurcating towards different attractors at the region boundaries. [sent-252, score-1.254]

65 3, the flow here is aligned with the training trajectories and terminates at the desired attractors. [sent-254, score-0.365]

66 6, the Lyapunov constraints admit some slack, which allows the modulation to introduce slight deviations from the original dynamics. [sent-261, score-0.302]

67 In the 4 attractor problem presented 6 Training Data Modulated Trajs. [sent-263, score-0.328]

68 above, we generate a total of 10 trajectories per motion class and use 2:3 training to testing ratio for cross validation. [sent-270, score-0.49]

69 We calculate the average percentage error between the original velocity (read off from the data) and the modulated velocity (calculated using 2) for the m − th class as ˜ ˙ xi −f (xi ) em = × 100 where < . [sent-271, score-0.446]

70 The region covered by this band of optimal values may vary depending on the relative location of the attractors and other data points. [sent-275, score-0.569]

71 6(a), motion classes 2 (upper left) and 4 (upper right) are better fitted and show less sensitivity to the choice of kernel width than classes 1 (lower left) and 3 (lower right). [sent-277, score-0.222]

72 For each class, we compute the isosurfaces of the corresponding modulation −0. [sent-285, score-0.297]

73 We mesh each of these test surfaces and compute trajecMeshed contour Actual attractor tories starting from the obtained mesh-points, looking for spurious Spurious attractor attractors. [sent-288, score-0.751]

74 5 no spurious attractor and marks the ROA of the corresponding motion dynamics. [sent-291, score-0.518]

75 5 to illustrate this Figure 7: Test trajectories process. [sent-293, score-0.253]

76 Figure 7 shows a case where one spurious attractor is generated from several points detected using a larger test surface (dotted line) whereas the actual on an isocurve (dotted line) to ROA (solid line) is smaller. [sent-294, score-0.45]

77 rROA = 0 when no trajectory except those originating at the attractor itself, lead to the attractor. [sent-297, score-0.374]

78 6(a) and translating the attractors so that they are either very far apart (left, distance datt = 1. [sent-302, score-0.544]

79 Furthermore, when the attractors are farther apart, high values of rROA are obtained for a larger range of values of the kernel width, i. [sent-306, score-0.513]

80 The attractors here represent manually labeled grasping points on a pitcher. [sent-312, score-0.702]

81 2 (a) Training data (b) hm (x) = 0 (c) Trajectory 1 (d) Trajectory 2 0 0. [sent-327, score-0.221]

82 (a) shows training trajectories for three manually chosen grasping points. [sent-333, score-0.49]

83 (b) shows the isosurfaces hm (x) = 0; m = 1, 2, 3 along with the locations of the corresponding attractors. [sent-334, score-0.321]

84 In (c) and (d), the robot executes the generated trajectories starting from different positions and hence converging to different grasping points. [sent-335, score-0.558]

85 KUKA-Omnirob base for executing the modulated Cartesian trajectories in simulation. [sent-337, score-0.352]

86 Training data for this implementation was obtained by recording the end-effector positions xi ∈ R3 from kinesthetic demonstrations of reach-to-grasp motions directed towards these grasping points, yielding a 3-class problem (see Fig. [sent-339, score-0.395]

87 Figure 8(b) shows the isosurfaces hm (x) = 0; m ∈ {1, 2, 3} learned using the presented method. [sent-342, score-0.28]

88 Figures 8(c)-(d) show the robot executing two trajectories when started from two different locations and converging to a different attractor (grasping point). [sent-343, score-0.677]

89 Such a fast generative model allows the robot to switch on-the-fly between the attractors and adapt to real-time perturbations in the object or the end-effector pose, without any re-planning or re-learning. [sent-347, score-0.675]

90 A video illustrating how the robot exploits multiple attractors to catch one of the grasping points on the object as it falls down is also provided in the supplementary material. [sent-351, score-0.866]

91 8 1200 1200 linear dynamical systems through a 1000 1000 partitioning of the space. [sent-356, score-0.201]

92 5 2 set of constraints result in a new class σ σ of support vectors that exploit partial (a) datt = 1. [sent-368, score-0.224]

93 2 derivatives of the kernel function to align the flow of trajectories with the Figure 9: Variation of rROA with varying model parameters. [sent-370, score-0.3]

94 The resulting model behaves as a multi-stable DS with attractors at the desired locations. [sent-372, score-0.523]

95 This ensures that the trajectories do not cross over region boundaries. [sent-376, score-0.394]

96 We validated the presented method on synthetic motions in 2D and 3D grasping motions on real objects. [sent-377, score-0.393]

97 Results show that even though spurious attractors may occur, in practice they can be avoided by a careful choice of model parameters through grid search. [sent-378, score-0.561]

98 A dynamical systems approach to task-level system integration used to plan o and control autonomous vehicle motion. [sent-391, score-0.278]

99 Autonomous movement generation for manipulators with o multiple simultaneous constraints using the attractor dynamics approach. [sent-413, score-0.499]

100 Coupled dynamical system based armhand grasping model for learning fast adaptation strategies. [sent-435, score-0.4]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('attractors', 0.466), ('attractor', 0.328), ('trajectories', 0.253), ('ds', 0.25), ('modulation', 0.238), ('hm', 0.221), ('grasping', 0.209), ('roa', 0.176), ('rroa', 0.156), ('dynamical', 0.156), ('ow', 0.126), ('dynamics', 0.105), ('modulated', 0.099), ('robot', 0.096), ('xo', 0.095), ('spurious', 0.095), ('motion', 0.095), ('xi', 0.094), ('motions', 0.092), ('robotics', 0.092), ('svm', 0.089), ('velocity', 0.079), ('datt', 0.078), ('region', 0.073), ('wt', 0.069), ('lyapunov', 0.068), ('attraction', 0.066), ('autonomous', 0.061), ('rbf', 0.06), ('hroa', 0.059), ('isosurfaces', 0.059), ('streamlines', 0.059), ('automation', 0.058), ('desired', 0.057), ('icra', 0.052), ('svs', 0.048), ('kernel', 0.047), ('width', 0.047), ('stability', 0.046), ('regions', 0.046), ('trajectory', 0.046), ('sv', 0.045), ('partitioning', 0.045), ('cross', 0.045), ('classi', 0.044), ('ei', 0.044), ('perturbations', 0.043), ('along', 0.041), ('support', 0.04), ('yi', 0.04), ('class', 0.04), ('rn', 0.04), ('ej', 0.039), ('robotic', 0.039), ('er', 0.039), ('fsgn', 0.039), ('isosurface', 0.039), ('mounted', 0.039), ('shukla', 0.039), ('object', 0.038), ('slack', 0.038), ('boundary', 0.037), ('constraints', 0.037), ('ij', 0.036), ('rf', 0.035), ('system', 0.035), ('dof', 0.035), ('stable', 0.033), ('arm', 0.033), ('sensitivity', 0.033), ('coincide', 0.032), ('switch', 0.032), ('rp', 0.031), ('switching', 0.03), ('catch', 0.03), ('aude', 0.03), ('band', 0.03), ('movement', 0.029), ('vectors', 0.029), ('testing', 0.029), ('lausanne', 0.028), ('sudden', 0.028), ('ensure', 0.028), ('th', 0.028), ('training', 0.028), ('aligned', 0.027), ('mobile', 0.027), ('dual', 0.027), ('points', 0.027), ('original', 0.027), ('reach', 0.026), ('crossing', 0.026), ('associative', 0.026), ('control', 0.026), ('sch', 0.026), ('stefan', 0.025), ('applicability', 0.025), ('li', 0.025), ('xj', 0.025), ('ensures', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

Author: Ashwini Shukla, Aude Billard

Abstract: Non-linear dynamical systems (DS) have been used extensively for building generative models of human behavior. Their applications range from modeling brain dynamics to encoding motor commands. Many schemes have been proposed for encoding robot motions using dynamical systems with a single attractor placed at a predefined target in state space. Although these enable the robots to react against sudden perturbations without any re-planning, the motions are always directed towards a single target. In this work, we focus on combining several such DS with distinct attractors, resulting in a multi-stable DS. We show its applicability in reach-to-grasp tasks where the attractors represent several grasping points on the target object. While exploiting multiple attractors provides more flexibility in recovering from unseen perturbations, it also increases the complexity of the underlying learning problem. Here we present the Augmented-SVM (A-SVM) model which inherits region partitioning ability of the well known SVM classifier and is augmented with novel constraints derived from the individual DS. The new constraints modify the original SVM dual whose optimal solution then results in a new class of support vectors (SV). These new SV ensure that the resulting multistable DS incurs minimum deviation from the original dynamics and is stable at each of the attractors within a finite region of attraction. We show, via implementations on a simulated 10 degrees of freedom mobile robotic platform, that the model is capable of real-time motion generation and is able to adapt on-the-fly to perturbations.

2 0.094841294 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

Author: Paul Vernaza, Drew Bagnell

Abstract: Maximum entropy (MaxEnt) modeling is a popular choice for sequence analysis in applications such as natural language processing, where the sequences are embedded in discrete, tractably-sized spaces. We consider the problem of applying MaxEnt to distributions over paths in continuous spaces of high dimensionality— a problem for which inference is generally intractable. Our main contribution is to show that this intractability can be avoided as long as the constrained features possess a certain kind of low dimensional structure. In this case, we show that the associated partition function is symmetric and that this symmetry can be exploited to compute the partition function efficiently in a compressed form. Empirical results are given showing an application of our method to learning models of high-dimensional human motion capture data. 1

3 0.09355513 200 nips-2012-Local Supervised Learning through Space Partitioning

Author: Joseph Wang, Venkatesh Saligrama

Abstract: We develop a novel approach for supervised learning based on adaptively partitioning the feature space into different regions and learning local region-specific classifiers. We formulate an empirical risk minimization problem that incorporates both partitioning and classification in to a single global objective. We show that space partitioning can be equivalently reformulated as a supervised learning problem and consequently any discriminative learning method can be utilized in conjunction with our approach. Nevertheless, we consider locally linear schemes by learning linear partitions and linear region classifiers. Locally linear schemes can not only approximate complex decision boundaries and ensure low training error but also provide tight control on over-fitting and generalization error. We train locally linear classifiers by using LDA, logistic regression and perceptrons, and so our scheme is scalable to large data sizes and high-dimensions. We present experimental results demonstrating improved performance over state of the art classification techniques on benchmark datasets. We also show improved robustness to label noise.

4 0.092543393 197 nips-2012-Learning with Recursive Perceptual Representations

Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell

Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1

5 0.08649864 38 nips-2012-Algorithms for Learning Markov Field Policies

Author: Abdeslam Boularias, Jan R. Peters, Oliver B. Kroemer

Abstract: We use a graphical model for representing policies in Markov Decision Processes. This new representation can easily incorporate domain knowledge in the form of a state similarity graph that loosely indicates which states are supposed to have similar optimal actions. A bias is then introduced into the policy search process by sampling policies from a distribution that assigns high probabilities to policies that agree with the provided state similarity graph, i.e. smoother policies. This distribution corresponds to a Markov Random Field. We also present forward and inverse reinforcement learning algorithms for learning such policy distributions. We illustrate the advantage of the proposed approach on two problems: cart-balancing with swing-up, and teaching a robot to grasp unknown objects. 1

6 0.081425846 3 nips-2012-A Bayesian Approach for Policy Learning from Trajectory Preference Queries

7 0.081237964 94 nips-2012-Delay Compensation with Dynamical Synapses

8 0.067964926 188 nips-2012-Learning from Distributions via Support Measure Machines

9 0.065394953 195 nips-2012-Learning visual motion in recurrent neural networks

10 0.062405847 284 nips-2012-Q-MKL: Matrix-induced Regularization in Multi-Kernel Learning with Applications to Neuroimaging

11 0.059345651 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

12 0.058773354 204 nips-2012-MAP Inference in Chains using Column Generation

13 0.058181953 121 nips-2012-Expectation Propagation in Gaussian Process Dynamical Systems

14 0.057624225 245 nips-2012-Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions

15 0.056924198 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins

16 0.054193299 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

17 0.054109253 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

18 0.053867921 269 nips-2012-Persistent Homology for Learning Densities with Bounded Support

19 0.053726356 259 nips-2012-Online Regret Bounds for Undiscounted Continuous Reinforcement Learning

20 0.053580984 168 nips-2012-Kernel Latent SVM for Visual Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.162), (1, -0.037), (2, -0.043), (3, 0.006), (4, 0.046), (5, 0.019), (6, 0.012), (7, 0.038), (8, -0.044), (9, -0.049), (10, -0.006), (11, -0.018), (12, 0.073), (13, -0.011), (14, 0.002), (15, -0.025), (16, -0.066), (17, 0.004), (18, -0.039), (19, 0.041), (20, 0.013), (21, 0.007), (22, -0.013), (23, -0.035), (24, -0.025), (25, 0.002), (26, -0.043), (27, -0.04), (28, 0.073), (29, -0.007), (30, -0.049), (31, 0.037), (32, 0.09), (33, 0.067), (34, 0.022), (35, -0.078), (36, -0.007), (37, -0.045), (38, 0.07), (39, 0.044), (40, -0.02), (41, 0.029), (42, -0.026), (43, 0.028), (44, 0.038), (45, -0.019), (46, 0.003), (47, -0.076), (48, 0.021), (49, 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91509068 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

Author: Ashwini Shukla, Aude Billard

Abstract: Non-linear dynamical systems (DS) have been used extensively for building generative models of human behavior. Their applications range from modeling brain dynamics to encoding motor commands. Many schemes have been proposed for encoding robot motions using dynamical systems with a single attractor placed at a predefined target in state space. Although these enable the robots to react against sudden perturbations without any re-planning, the motions are always directed towards a single target. In this work, we focus on combining several such DS with distinct attractors, resulting in a multi-stable DS. We show its applicability in reach-to-grasp tasks where the attractors represent several grasping points on the target object. While exploiting multiple attractors provides more flexibility in recovering from unseen perturbations, it also increases the complexity of the underlying learning problem. Here we present the Augmented-SVM (A-SVM) model which inherits region partitioning ability of the well known SVM classifier and is augmented with novel constraints derived from the individual DS. The new constraints modify the original SVM dual whose optimal solution then results in a new class of support vectors (SV). These new SV ensure that the resulting multistable DS incurs minimum deviation from the original dynamics and is stable at each of the attractors within a finite region of attraction. We show, via implementations on a simulated 10 degrees of freedom mobile robotic platform, that the model is capable of real-time motion generation and is able to adapt on-the-fly to perturbations.

2 0.65706933 115 nips-2012-Efficient high dimensional maximum entropy modeling via symmetric partition functions

Author: Paul Vernaza, Drew Bagnell

Abstract: Maximum entropy (MaxEnt) modeling is a popular choice for sequence analysis in applications such as natural language processing, where the sequences are embedded in discrete, tractably-sized spaces. We consider the problem of applying MaxEnt to distributions over paths in continuous spaces of high dimensionality— a problem for which inference is generally intractable. Our main contribution is to show that this intractability can be avoided as long as the constrained features possess a certain kind of low dimensional structure. In this case, we show that the associated partition function is symmetric and that this symmetry can be exploited to compute the partition function efficiently in a compressed form. Empirical results are given showing an application of our method to learning models of high-dimensional human motion capture data. 1

3 0.61509597 168 nips-2012-Kernel Latent SVM for Visual Recognition

Author: Weilong Yang, Yang Wang, Arash Vahdat, Greg Mori

Abstract: Latent SVMs (LSVMs) are a class of powerful tools that have been successfully applied to many applications in computer vision. However, a limitation of LSVMs is that they rely on linear models. For many computer vision tasks, linear models are suboptimal and nonlinear models learned with kernels typically perform much better. Therefore it is desirable to develop the kernel version of LSVM. In this paper, we propose kernel latent SVM (KLSVM) – a new learning framework that combines latent SVMs and kernel methods. We develop an iterative training algorithm to learn the model parameters. We demonstrate the effectiveness of KLSVM using three different applications in visual recognition. Our KLSVM formulation is very general and can be applied to solve a wide range of applications in computer vision and machine learning. 1

4 0.59492278 279 nips-2012-Projection Retrieval for Classification

Author: Madalina Fiterau, Artur Dubrawski

Abstract: In many applications, classification systems often require human intervention in the loop. In such cases the decision process must be transparent and comprehensible, simultaneously requiring minimal assumptions on the underlying data distributions. To tackle this problem, we formulate an axis-aligned subspace-finding task under the assumption that query specific information dictates the complementary use of the subspaces. We develop a regression-based approach called RECIP that efficiently solves this problem by finding projections that minimize a nonparametric conditional entropy estimator. Experiments show that the method is accurate in identifying the informative projections of the dataset, picking the correct views to classify query points, and facilitates visual evaluation by users. 1 Introduction and problem statement In the domain of predictive analytics, many applications which keep human users in the loop require the use of simple classification models. Often, it is required that a test-point be ‘explained’ (classified) using a simple low-dimensional projection of the original feature space. This is a Projection Retrieval for Classification problem (PRC). The interaction with the user proceeds as follows: the user provides the system a query point; the system searches for a projection in which the point can be accurately classified; the system displays the classification result as well as an illustration of how the classification decision was reached in the selected projection. Solving the PRC problem is relevant in many practical applications. For instance, consider a nuclear threat detection system installed at a border check point. Vehicles crossing the border are scanned with sensors so that a large array of measurements of radioactivity and secondary contextual information is being collected. These observations are fed into a classification system that determines whether the scanned vehicle may carry a threat. Given the potentially devastating consequences of a false negative, a border control agent is requested to validate the prediction and decide whether to submit the vehicle for a costly further inspection. With the positive classification rate of the system under strict bounds because of limitations in the control process, the risk of false negatives is increased. Despite its crucial role, human intervention should only be withheld for cases in which there are reasons to doubt the validity of classification. In order for a user to attest the validity of a decision, the user must have a good understanding of the classification process, which happens more readily when the classifier only uses the original dataset features rather than combinations of them, and when the discrimination models are low-dimensional. In this context, we aim to learn a set of classifiers in low-dimensional subspaces and a decision function which selects the subspace under which a test point is to be classified. Assume we are given a dataset {(x1 , y1 ) . . . (xn , yn )} ∈ X n × {0, 1}n and a class of discriminators H. The model will contain a set Π of subspaces of X ; Π ⊆ Π, where Π is the set of all axis-aligned subspaces of the original feature space, the power set of the features. To each projection πi ∈ Π corresponds one discriminator from a given hypothesis space hi ∈ H. It will also contain a selection function g : X → Π × H, which yields, for a query point x, the projection/discriminator pair with which this point will be classified. The notation π(x) refers to the projection of the point x onto the subspace 1 π while h(π(x)) represents the predicted label for x. Formally, we describe the model class as Md = {Π = {π : π ∈ Π, dim(π) ≤ d}, H = {hi : hi ∈ H, h : πi → Y, ∀i = 1 . . . |Π|}, g ∈ {f : X → {1 . . . |Π|}} . where dim(π) presents the dimensionality of the subspace determined by the projection π. Note that only projections up to size d will be considered, where d is a parameter specific to the application. The set H contains one discriminator from the hypothesis class H for each projection. Intuitively, the aim is to minimize the expected classification error over Md , however, a notable modification is that the projection and, implicitly, the discriminator, are chosen according to the data point that needs to be classified. Given a query x in the space X , g(x) will yield the subspace πg(x) onto which the query is projected and the discriminator hg(x) for it. Distinct test points can be handled using different combinations of subspaces and discriminators. We consider models that minimize 0/1 loss. Hence, the PRC problem can be stated as follows: M ∗ = arg min EX ,Y y = hg(x) (πg(x) (x)) M ∈Md There are limitations to the type of selection function g that can be learned. A simple example for which g can be recovered is a set of signal readings x for which, if one of the readings xi exceeds a threshold ti , the label can be predicted just based on xi . A more complex one is a dataset containing regulatory variables, that is, for xi in the interval [ak , bk ] the label only depends on (x1 . . . xnk ) k k datasets that fall into the latter category fulfill what we call the Subspace-Separability Assumption. This paper proposes an algorithm called RECIP that solves the PRC problem for a class of nonparametric classifiers. We evaluate the method on artificial data to show that indeed it correctly identifies the underlying structure for data satisfying the Subspace-Separability Assumption. We show some case studies to illustrate how RECIP offers insight into applications requiring human intervention. The use of dimensionality reduction techniques is a common preprocessing step in applications where the use of simplified classification models is preferable. Methods that learn linear combinations of features, such as Linear Discriminant Analysis, are not quite appropriate for the task considered here, since we prefer to natively rely on the dimensions available in the original feature space. Feature selection methods, such as e.g. lasso, are suitable for identifying sets of relevant features, but do not consider interactions between them. Our work better fits the areas of class dependent feature selection and context specific classification, highly connected to the concept of Transductive Learning [6]. Other context-sensitive methods are Lazy and Data-Dependent Decision Trees, [5] and [10] respectively. In Ting et al [14], the Feating submodel selection relies on simple attribute splits followed by fitting local predictors, though the algorithm itself is substantially different. Obozinski et al present a subspace selection method in the context of multitask learning [11]. Go et al propose a joint method for feature selection and subspace learning [7], however, their classification model is not particularly query specific. Alternatively, algorithms that transform complex or unintelligible models with user-friendly equivalents have been proposed [3, 2, 1, 8]. Algorithms specifically designed to yield understandable models are a precious few. Here we note a rule learning method described in [12], even though the resulting rules can make visualization difficult, while itemset mining [9] is not specifically designed for classification. Unlike those approaches, our method is designed to retrieve subsets of the feature space designed for use in a way that is complementary to the basic task at hand (classification) while providing query-specific information. 2 Recovering informative projections with RECIP To solve PRC, we need means by which to ascertain which projections are useful in terms of discriminating data from the two classes. Since our model allows the use of distinct projections depending on the query point, it is expected that each projection would potentially benefit different areas of the feature space. A(π) refers to the area of the feature space where the projection π is selected. A(π) = {x ∈ X : πg(x) = π} The objective becomes min E(X ×Y) y = hg(x) (πg(x) (x)) M ∈Md = p(A(π))E y = hg(x) (πg(x) (x))|x ∈ A(π) min M ∈Md 2 π∈Π . The expected classification error over A(π) is linked to the conditional entropy of Y |X. Fano’s inequality provides a lower bound on the error while Feder and Merhav [4] derive a tight upper bound on the minimal error probability in terms of the entropy. This means that conditional entropy characterizes the potential of a subset of the feature space to separate data, which is more generic than simply quantifying classification accuracy for a specific discriminator. In view of this connection between classification accuracy and entropy, we adapt the objective to: p(A(π))H(Y |π(X); X ∈ A(π)) min M ∈Md (1) π∈Π The method we propose optimizes an empirical analog of (1) which we develop below and for which we will need the following result. Proposition 2.1. Given a continuous variable X ∈ X and a binary variable Y , where X is sampled from the mixture model f (x) = p(y = 0)f0 (x) + p(y = 1)f1 (x) = p0 f0 (x) + p1 f1 (x) , then H(Y |X) = −p0 log p0 − p1 log p1 − DKL (f0 ||f ) − DKL (f1 ||f ) . Next, we will use the nonparametric estimator presented in [13] for Tsallis α-divergence. Given samples Ui ∼ U, with i = 1, n and Vj ∼ V with j = 1, m, the divergence is estimated as follows: ˆ Tα (U ||V ) = 1 1 1−α n n (n − 1)νk (Ui , U \ ui )d mνk (Ui , V )d i=1 1−α B(k, α) − 1 , (2) where d is the dimensionality of the variables U and V and νk (z, Z) represents the distance from z ˆ to its k th nearest neighbor of the set of points Z. For α ≈ 1 and n → ∞, Tα (u||v) ≈ DKL (u||v). 2.1 Local estimators of entropy We will now plug (2) in the formula obtained by Proposition 2.1 to estimate the quantity (1). We use the notation X0 to represent the n0 samples from X which have the labels Y equal to 0, and X1 to represent the n1 samples from X which have the labels set to 1. Also, Xy(x) represents the set of samples that have labels equal to the label of x and X¬y(x) the data that have labels opposite to the label of x. ˆ ˆ x ˆ x H(Y |X; X ∈ A) = −H(p0 ) − H(p1 ) − T (f0 ||f x ) − T (f1 ||f x ) + C α≈1 ˆ H(Y |X; X ∈ A) ∝ 1 n0 + 1 n1 ∝ 1 n0 + 1 n1 ∝ 1 n n0 (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X \ xi )d 1−α I[xi ∈ A] (n0 − 1)νk (xi , X0 \ xi )d nνk (xi , X1 \ xi )d 1−α I[xi ∈ A] (n1 − 1)νk (xi , X1 \ xi )d nνk (xi , X0 \ xi )d 1−α I[xi ∈ A] i=1 n1 i=1 n0 i=1 n1 i=1 n I[xi ∈ A] i=1 (n − 1)νk (xi , Xy(xi ) \ xi )d nνk (xi , X¬y(xi ) \ xi )d 1−α The estimator for the entropy of the data that is classified with projection π is as follows: ˆ H(Y |π(X); X ∈ A(π)) ∝ 1 n n I[xi ∈ A(π)] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 1−α (3) From 3 and using the fact that I[xi ∈ A(π)] = I[πg(xi ) = π] for which we use the notation I[g(xi ) → π], we estimate the objective as min M ∈Md π∈Π 1 n n I[g(xi ) → π] i=1 (n − 1)νk (π(xi ), π(Xy(xi ) ) \ π(xi ))d nνk (π(xi ), π(X¬y(xi ) \ xi ))d 3 1−α (4) Therefore, the contribution of each data point to the objective corresponds to a distance ratio on the projection π ∗ where the class of the point is obtained with the highest confidence (data is separable in the neighborhood of the point). We start by computing the distance-based metric of each point on each projection of size up to d - there are d∗ such projections. This procedure yields an extended set of features Z, which we name local entropy estimates: Zij = νk (πj (xi ), πj (Xy(xi ) ) \ πj (xi )) νk (πj (xi ), πj (X¬y(xi ) ) \ πj (xi )) d(1−α) α≈1 j ∈ {1 . . . d∗ } (5) For each training data point, we compute the best distance ratio amid all the projections, which is simply Ti = minj∈[d∗ ] Zij . The objective can be then further rewritten as a function of the entropy estimates: n I[g(xi ) → πj ]Zij min M ∈Md (6) i=1 πj ∈Π From the definition of T, it is also clear that n n I[g(xi ) → πj ]Zij min M ∈Md 2.2 ≥ i=1 πj ∈Π Ti . (7) i=1 Projection selection as a combinatorial problem Considering form (6) of the objective, and given that the estimates Zij are constants, depending only on the training set, the projection retrieval problem is reduced to finding g for all training points, which will implicitly select the projection set of the model. Naturally, one might assume the bestperforming classification model is the one containing all the axis-aligned subspaces. This model achieves the lower bound (7) for the training set. However, the larger the set of projections, the more values the function g takes, and thus the problem of selecting the correct projection becomes more difficult. It becomes apparent that the number of projections should be somehow restricted to allow intepretability. Assuming a hard threshold of at most t projections, the optimization (6) becomes an entry selection problem over matrix Z where one value must be picked from each row under a limitation on the number of columns that can be used. This problem cannot be solved exactly in polynomial time. Instead, it can be formulated as an optimization problem under 1 constraints. 2.3 Projection retrieval through regularized regression To transform the projection retrieval to a regression problem we consider T, the minimum obtainable value of the entropy estimator for each point, as the output which the method needs to predict. Each row i of the parameter matrix B represents the degrees to which the entropy estimates on each projection contribute to the entropy estimator of point xi . Thus, the sum over each row of B is 1, and the regularization penalty applies to the number of non-zero columns in B. d∗ min ||T − (Z B B)J|Π|,1 ||2 2 +λ [Bi = 0] (8) i=1 subject to |Bk | 1 = 1 k = 1, n where (Z B)ij = Zij + Bij and J is a matrix of ones. The problem with this optimization is that it is not convex. A typical walk-around of this issue is to use the convex relaxation for Bi = 0, that is 1 norm. This would transform the penalized term d∗ d∗ n to i=1 |Bi | 1 . However, i=1 |Bi | 1 = k=1 |Bk | 1 = n , so this penalty really has no effect. An alternative mechanism to encourage the non-zero elements in B to populate a small number of columns is to add a penalty term in the form of Bδ, where δ is a d∗ -size column vector with each element representing the penalty for a column in B. With no prior information about which subspaces are more informative, δ starts as an all-1 vector. An initial value for B is obtained through the optimization (8). Since our goal is to handle data using a small number of projections, δ is then updated such that its value is lower for the denser columns in B. This update resembles the reweighing in adaptive lasso. The matrix B itself is updated, and this 2-step process continues until convergence of δ. Once δ converges, the projections corresponding to the non-zero columns of B are added to the model. The procedure is shown in Algorithm 1. 4 Algorithm 1: RECIP δ = [1 . . . 1] repeat |P I| b = arg minB ||T − i=1 < Z, B > ||2 + λ|Bδ| 1 2 subject to |Bk | 1 = 1 k = 1...n δk = |Bi | 1 i = . . . d∗ (update the differential penalty) δ δ = 1 − |δ| 1 until δ converges return Π = {πi ; |Bi | 1 > 0 ∀i = 1 . . . d∗ } 2.4 Lasso for projection selection We will compare our algorithm to lasso regularization that ranks the projections in terms of their potential for data separability. We write this as an 1 -penalized optimization on the extended feature set Z, with the objective T : minβ |T − Zβ|2 + λ|β| 1 . The lasso penalty to the coefficient vector encourages sparsity. For a high enough λ, the sparsity pattern in β is indicative of the usefulness of the projections. The lasso on entropy contributions was not found to perform well as it is not query specific and will find one projection for all data. We improved it by allowing it to iteratively find projections - this robust version offers increased performance by reweighting the data thus focusing on different subsets of it. Although better than running lasso on entropy contributions, the robust lasso does not match RECIP’s performance as the projections are selected gradually rather than jointly. Running the standard lasso on the original design matrix yields a set of relevant variables and it is not immediately clear how the solution would translate to the desired class. 2.5 The selection function Once the projections are selected, the second stage of the algorithm deals with assigning the projection with which to classify a particular query point. An immediate way of selecting the correct projection starts by computing the local entropy estimator for each subspace with each class assignment. Then, we may select the label/subspace combination that minimizes the empirical entropy. (i∗ , θ∗ ) = arg min i,θ 3 νk (πi (x), πi (Xθ )) νk (πi (x), πi (X¬θ )) dim(πi )(1−α) i = 1 . . . d∗ , α≈1 (9) Experimental results In this section we illustrate the capability of RECIP to retrieve informative projections of data and their use in support of interpreting results of classification. First, we analyze how well RECIP can identify subspaces in synthetic data whose distribution obeys the subspace separability assumption (3.1). As a point of reference, we also present classification accuracy results (3.2) for both the synthetic data and a few real-world sets. This is to quantify the extent of the trade-off between fidelity of attainable classifiers and desired informativeness of the projections chosen by RECIP. We expect RECIP’s classification performance to be slightly, but not substantially worse when compared to relevant classification algorithms trained to maximize classification accuracy. Finally, we present a few examples (3.3) of informative projections recovered from real-world data and their utility in explaining to human users the decision processes applied to query points. A set of artificial data used in our experiments contains q batches of data points, each of them made classifiable with high accuracy using one of available 2-dimensional subspaces (x1 , x2 ) with k ∈ k k {1 . . . q}. The data in batch k also have the property that x1 > tk . This is done such that the group a k point belongs to can be detected from x1 , thus x1 is a regulatory variable. We control the amount of k k noise added to thusly created synthetic data by varying the proportion of noisy data points in each batch. The results below are for datasets with 7 features each, with number of batches q ranging between 1 and 7. We kept the number of features specifically low in order to prevent excessive variation between any two sets generated this way, and to enable computing meaningful estimates of the expectation and variance of performance, while enabling creation of complicated data in which synthetic patterns may substantially overlap (using 7 features and 7 2-dimensional patterns implies that dimensions of at least 4 of the patterns will overlap). We implemented our method 5 to be scalable to the size and dimensionality of data and although for brevity we do not include a discussion of this topic here, we have successfully run RECIP against data with 100 features. The parameter α is a value close to 1, because the Tsallis divergence converges to the KL divergence as alpha approaches 1. For the experiments on real-world data, d was set to n (all projections were considered). For the artificial data experiments, we reported results for d = 2 as they do not change significantly for d >= 2 because this data was synthesized to contain bidimensional informative projections. In general, if d is too low, the correct full set of projections will not be found, but it may be recovered partially. If d is chosen too high, there is a risk that a given selected projection p will contain irrelevant features compared to the true projection p0 . However, this situation only occurs if the noise introduced by these features in the estimators makes the entropy contributions on p and p0 statistically indistinguishable for a large subset of data. The users will choose d according to the desired/acceptable complexity of the resulting model. If the results are to be visually interpreted by a human, values of 2 or 3 are reasonable for d. 3.1 Recovering informative projections Table 1 shows how well RECIP recovers the q subspaces corresponding to the synthesized batches of data. We measure precision (proportion of the recovered projections that are known to be informative), and recall (proportion of known informative projections that are recovered by the algorithm). in Table 1, rows correspond to the number of distinct synthetic batches injected in data, q, and subsequent columns correspond to the increasing amounts of noise in data. We note that the observed precision is nearly perfect: the algorithm makes only 2 mistakes over the entire set of experiments, and those occur for highly noisy setups. The recall is nearly perfect as long as there is little overlap among the dimensions, that is when the injections do not interfere with each other. As the number of projections increases, the chances for overlap among the affected features also increase, which makes the data more confusing resulting on a gradual drop of recall until only about 3 or 4 of the 7 known to be informative subspaces can be recovered. We have also used lasso as described in 2.4 in an attempt to recover projections. This setup only manages to recover one of the informative subspaces, regardless of how the regularization parameter is tuned. 3.2 Classification accuracy Table 2 shows the classification accuracy of RECIP, obtained using synthetic data. As expected, the observed performance is initially high when there are few known informative projections in data and it decreases as noise and ambiguity of the injected patterns increase. Most types of ensemble learners would use a voting scheme to arrive at the final classification of a testing sample, rather than use a model selection scheme. For this reason, we have also compared predictive accuracy revealed by RECIP against a method based on majority voting among multiple candidate subspaces. Table 4 shows that the accuracy of this technique is lower than the accuracy of RECIP, regardless of whether the informative projections are recovered by the algorithm or assumed to be known a priori. This confirms the intuition that a selection-based approach can be more effective than voting for data which satisfies the subspace separability assumption. For reference, we have also classified the synthetic data using K-Nearest-Neighbors algorithm using all available features at once. The results of that experiment are shown in Table 5. Since RECIP uses neighbor information, K-NN is conceptually the closest among the popular alternatives. Compared to RECIP, K-NN performs worse when there are fewer synthetic patterns injected in data to form informative projections. It is because some features used then by K-NN are noisy. As more features become informative, the K-NN accuracy improves. This example shows the benefit of a selective approach to feature space and using a subset of the most explanatory projections to support not only explanatory analyses but also classification tasks in such circumstances. 3.3 RECIP case studies using real-world data Table 3 summarizes the RECIP and K-NN performance on UCI datasets. We also test the methods using Cell dataset containing a set of measurements such as the area and perimeter biological cells with separate labels marking cells subjected to treatment and control cells. In Vowel data, the nearest-neighbor approach works exceptionally well, even outperforming random forests (0.94 accuracy), which is an indication that all features are jointly relevant. For d lower than the number of features, RECIP picks projections of only one feature, but if there is no such limitation, RECIP picks the space of all the features as informative. 6 Table 1: Projection recovery for artificial datasets with 1 . . . 7 informative features and noise level 0 . . . 0.2 in terms of mean and variance of Precision and Recall. Mean/var obtained for each setting by repeating the experiment with datasets with different informative projections. PRECISION 1 2 3 4 5 6 7 0 1 1 1 1 1 1 1 0.02 1 1 1 1 1 1 1 Mean 0.05 1 1 1 1 1 1 1 1 2 3 4 5 6 7 0 1 1 1 0.9643 0.7714 0.6429 0.6327 0.02 1 1 1 0.9643 0.7429 0.6905 0.5918 Mean 0.05 1 1 0.9524 0.9643 0.8286 0.6905 0.5918 0 0 0 0 0 0 0 0 0.02 0 0 0 0 0 0 0 Variance 0.05 0 0 0 0 0 0 0 0.1 0.0306 0 0 0 0 0 0 0.2 0.0306 0 0 0 0 0 0 0 0 0 0 0.0077 0.0163 0.0113 0.0225 0.02 0 0 0 0.0077 0.0196 0.0113 0.02 Variance 0.05 0 0 0.0136 0.0077 0.0049 0.0272 0.0258 0.1 0 0 0.0136 0.0077 0.0082 0.0113 0.0233 0.2 0 0 0 0.0128 0.0278 0.0113 0.02 0.1 0.0008 0.0001 0.0028 0.0025 0.0036 0.0025 0.0042 0.2 0.0007 0.0001 0.0007 0.0032 0.0044 0.0027 0.0045 0.1 0.0001 0.0001 0.0007 0.0014 0.0019 0.0023 0.0021 0.2 0.0000 0.0001 0.0007 0.0020 0.0023 0.0021 0.0020 0.1 0.9286 1 1 1 1 1 1 0.2 0.9286 1 1 1 1 1 1 RECALL 0.1 1 1 0.9524 0.9643 0.8571 0.6905 0.5714 0.2 1 1 1 0.9286 0.7714 0.6905 0.551 Table 2: RECIP Classification Accuracy on Artificial Data 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8725 0.8113 0.7655 0.7534 1 2 3 4 5 6 7 0 0.9751 0.9333 0.9053 0.8820 0.8714 0.8566 0.8429 CLASSIFICATION ACCURACY Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9543 0.9420 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8764 0.8640 0.8618 0.0004 0.0005 0.0016 0.8685 0.8589 0.8454 0.8187 0.0020 0.0020 0.0019 0.8009 0.8105 0.8105 0.7782 0.0042 0.0044 0.0033 0.7739 0.7669 0.7632 0.7511 0.0025 0.0021 0.0026 0.7399 0.7347 0.7278 0.7205 0.0034 0.0040 0.0042 CLASSIFICATION ACCURACY - KNOWN PROJECTIONS Mean Variance 0.02 0.05 0.1 0.2 0 0.02 0.05 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.9297 0.9227 0.9067 0.8946 0.0001 0.0001 0.0001 0.8967 0.8914 0.8777 0.8618 0.0004 0.0005 0.0005 0.8781 0.8657 0.8541 0.8331 0.0011 0.0011 0.0014 0.8641 0.8523 0.8429 0.8209 0.0015 0.0015 0.0018 0.8497 0.8377 0.8285 0.8074 0.0014 0.0015 0.0016 0.8371 0.8256 0.8122 0.7988 0.0015 0.0018 0.0018 Table 3: Accuracy of K-NN and RECIP Dataset KNN RECIP In Spam data, the two most informative projections are Breast Cancer Wis 0.8415 0.8275 ’Capital Length Total’ (CLT)/’Capital Length Longest’ Breast Tissue 1.0000 1.0000 (CLL) and CLT/’Frequency of word your’ (FWY). FigCell 0.7072 0.7640 ure 1 shows these two projections, with the dots repreMiniBOONE* 0.7896 0.7396 senting training points. The red dots represent points laSpam 0.7680 0.7680 Vowel 0.9839 0.9839 beled as spam while the blue ones are non-spam. The circles are query points that have been assigned to be classified with the projection in which they are plotted. The green circles are correctly classified points, while the magenta circles - far fewer - are the incorrectly classified ones. Not only does the importance of text in capital letters make sense for a spam filtering dataset, but the points that select those projections are almost flawlessly classified. Additionally, assuming the user would need to attest the validity of classification for the first plot, he/she would have no trouble seeing that the circled data points are located in a region predominantly populated with examples of spam, so any non-spam entry appears suspicious. Both of the magenta-colored cases fall into this category, and they can be therefore flagged for further investigation. 7 Informative Projection for the Spam dataset 2000 1500 1000 500 0 Informative Projection for the Spam dataset 12 Frequency of word ‘your‘ Capital Run Length Longest 2500 10 8 6 4 2 0 500 1000 1500 2000 2500 Capital Run Length Total 3000 0 3500 0 2000 4000 6000 8000 10000 12000 14000 Capital Run Length Total 16000 Figure 1: Spam Dataset Selected Subspaces Table 4: Classification accuracy using RECIP-learned projections - or known projections, in the lower section - within a voting model instead of a selection model 1 2 3 4 5 6 7 1 2 3 4 5 6 7 CLASSIFICATION ACCURACY - VOTING ENSEMBLE Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9317 0.9226 0.0000 0.0000 0.0000 0.0070 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7290 0.7266 0.7163 0.7166 0.7212 0.0002 0.0002 0.0008 0.0006 0.6934 0.6931 0.6932 0.6904 0.6867 0.0008 0.0008 0.0008 0.0008 0.6715 0.6602 0.6745 0.6688 0.6581 0.0013 0.0014 0.0013 0.0014 0.6410 0.6541 0.6460 0.6529 0.6512 0.0008 0.0007 0.0010 0.0006 0.6392 0.6342 0.6268 0.6251 0.6294 0.0009 0.0011 0.0012 0.0012 CLASSIFICATION ACCURACY - VOTING ENSEMBLE, KNOWN PROJECTIONS Mean Variance 0 0.02 0.05 0.1 0.2 0 0.02 0.05 0.1 0.9751 0.9731 0.9686 0.9637 0.9514 0.0000 0.0000 0.0000 0.0001 0.7360 0.7354 0.7331 0.7303 0.7257 0.0002 0.0002 0.0001 0.0002 0.7409 0.7385 0.7390 0.7353 0.7325 0.0010 0.0012 0.0010 0.0011 0.7110 0.7109 0.7083 0.7067 0.7035 0.0041 0.0041 0.0042 0.0042 0.7077 0.7070 0.7050 0.7034 0.7008 0.0015 0.0015 0.0015 0.0016 0.6816 0.6807 0.6801 0.6790 0.6747 0.0008 0.0008 0.0008 0.0008 0.6787 0.6783 0.6772 0.6767 0.6722 0.0008 0.0009 0.0009 0.0008 0.2 0.0053 0.0001 0.0002 0.0009 0.0013 0.0005 0.0012 0.2 0.0000 0.0001 0.0010 0.0043 0.0016 0.0009 0.0008 Table 5: Classification accuracy for artificial data with the K-Nearest Neighbors method CLASSIFICATION ACCURACY - KNN 1 2 3 4 5 6 7 4 0 0.7909 0.7940 0.7964 0.7990 0.8038 0.8043 0.8054 0.02 0.7843 0.7911 0.7939 0.7972 0.8024 0.8032 0.8044 Mean 0.05 0.7747 0.7861 0.7901 0.7942 0.8002 0.8015 0.8028 0.1 0.7652 0.7790 0.7854 0.7904 0.7970 0.7987 0.8004 0.2 0.7412 0.7655 0.7756 0.7828 0.7905 0.7930 0.7955 0 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.02 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Variance 0.05 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.1 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.2 0.0002 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 Conclusion This paper considers the problem of Projection Recovery for Classification. It is relevant in applications where the decision process must be easy to understand in order to enable human interpretation of the results. We have developed a principled, regression-based algorithm designed to recover small sets of low-dimensional subspaces that support interpretability. It optimizes the selection using individual data-point-specific entropy estimators. In this context, the proposed algorithm follows the idea of transductive learning, and the role of the resulting projections bears resemblance to high confidence regions known in conformal prediction models. Empirical results obtained using simulated and real-world data show the effectiveness of our method in finding informative projections that enable accurate classification while maintaining transparency of the underlying decision process. Acknowledgments This material is based upon work supported by the NSF, under Grant No. IIS-0911032. 8 References [1] Mark W. Craven and Jude W. Shavlik. Extracting Tree-Structured Representations of Trained Networks. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8, pages 24–30. The MIT Press, 1996. [2] Pedro Domingos. Knowledge discovery via multiple models. Intelligent Data Analysis, 2:187–202, 1998. [3] Eulanda M. Dos Santos, Robert Sabourin, and Patrick Maupin. A dynamic overproduce-and-choose strategy for the selection of classifier ensembles. Pattern Recogn., 41:2993–3009, October 2008. [4] M. Feder and N. Merhav. Relations between entropy and error probability. Information Theory, IEEE Transactions on, 40(1):259–266, January 1994. [5] Jerome H. Friedman, Ron Kohavi, and Yeogirl Yun. Lazy decision trees, 1996. [6] A. Gammerman, V. Vovk, and V. Vapnik. Learning by transduction. In In Uncertainty in Artificial Intelligence, pages 148–155. Morgan Kaufmann, 1998. [7] Quanquan Gu, Zhenhui Li, and Jiawei Han. Joint feature selection and subspace learning, 2011. [8] Bing Liu, Minqing Hu, and Wynne Hsu. Intuitive representation of decision trees using general rules and exceptions. In Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, pages 615–620, 2000. [9] Michael Mampaey, Nikolaj Tatti, and Jilles Vreeken. Tell me what i need to know: succinctly summarizing data with itemsets. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, pages 573–581, New York, NY, USA, 2011. ACM. [10] Mario Marchand and Marina Sokolova. Learning with decision lists of data-dependent features. JOURNAL OF MACHINE LEARNING REASEARCH, 6, 2005. [11] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, April 2010. [12] Michael J. Pazzani, Subramani Mani, and W. Rodman Shankle. Beyond concise and colorful: Learning intelligible rules, 1997. [13] B. Poczos and J. Schneider. On the estimation of alpha-divergences. AISTATS, 2011. [14] Kai Ting, Jonathan Wells, Swee Tan, Shyh Teng, and Geoffrey Webb. Feature-subspace aggregating: ensembles for stable andunstable learners. Machine Learning, 82:375–397, 2011. 10.1007/s10994-0105224-5. 9

5 0.58002609 197 nips-2012-Learning with Recursive Perceptual Representations

Author: Oriol Vinyals, Yangqing Jia, Li Deng, Trevor Darrell

Abstract: Linear Support Vector Machines (SVMs) have become very popular in vision as part of state-of-the-art object recognition and other classification tasks but require high dimensional feature spaces for good performance. Deep learning methods can find more compact representations but current methods employ multilayer perceptrons that require solving a difficult, non-convex optimization problem. We propose a deep non-linear classifier whose layers are SVMs and which incorporates random projection as its core stacking element. Our method learns layers of linear SVMs recursively transforming the original data manifold through a random projection of the weak prediction computed from each layer. Our method scales as linear SVMs, does not rely on any kernel computations or nonconvex optimization, and exhibits better generalization ability than kernel-based SVMs. This is especially true when the number of training samples is smaller than the dimensionality of data, a common scenario in many real-world applications. The use of random projections is key to our method, as we show in the experiments section, in which we observe a consistent improvement over previous –often more complicated– methods on several vision and speech benchmarks. 1

6 0.57456976 360 nips-2012-Visual Recognition using Embedded Feature Selection for Curvature Self-Similarity

7 0.55478215 188 nips-2012-Learning from Distributions via Support Measure Machines

8 0.52968597 200 nips-2012-Local Supervised Learning through Space Partitioning

9 0.52455395 289 nips-2012-Recognizing Activities by Attribute Dynamics

10 0.51685959 144 nips-2012-Gradient-based kernel method for feature extraction and variable selection

11 0.49895439 231 nips-2012-Multiple Operator-valued Kernel Learning

12 0.49231756 204 nips-2012-MAP Inference in Chains using Column Generation

13 0.49116647 267 nips-2012-Perceptron Learning of SAT

14 0.4880394 39 nips-2012-Analog readout for optical reservoir computers

15 0.48720199 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing

16 0.48631093 44 nips-2012-Approximating Concavely Parameterized Optimization Problems

17 0.48569715 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs

18 0.48488525 167 nips-2012-Kernel Hyperalignment

19 0.48292291 361 nips-2012-Volume Regularization for Binary Classification

20 0.48243254 302 nips-2012-Scaling MPE Inference for Constrained Continuous Markov Random Fields with Consensus Optimization


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.05), (17, 0.017), (21, 0.051), (22, 0.239), (36, 0.011), (38, 0.106), (39, 0.015), (42, 0.023), (54, 0.041), (55, 0.025), (74, 0.082), (76, 0.108), (80, 0.085), (92, 0.065)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77801371 48 nips-2012-Augmented-SVM: Automatic space partitioning for combining multiple non-linear dynamics

Author: Ashwini Shukla, Aude Billard

Abstract: Non-linear dynamical systems (DS) have been used extensively for building generative models of human behavior. Their applications range from modeling brain dynamics to encoding motor commands. Many schemes have been proposed for encoding robot motions using dynamical systems with a single attractor placed at a predefined target in state space. Although these enable the robots to react against sudden perturbations without any re-planning, the motions are always directed towards a single target. In this work, we focus on combining several such DS with distinct attractors, resulting in a multi-stable DS. We show its applicability in reach-to-grasp tasks where the attractors represent several grasping points on the target object. While exploiting multiple attractors provides more flexibility in recovering from unseen perturbations, it also increases the complexity of the underlying learning problem. Here we present the Augmented-SVM (A-SVM) model which inherits region partitioning ability of the well known SVM classifier and is augmented with novel constraints derived from the individual DS. The new constraints modify the original SVM dual whose optimal solution then results in a new class of support vectors (SV). These new SV ensure that the resulting multistable DS incurs minimum deviation from the original dynamics and is stable at each of the attractors within a finite region of attraction. We show, via implementations on a simulated 10 degrees of freedom mobile robotic platform, that the model is capable of real-time motion generation and is able to adapt on-the-fly to perturbations.

2 0.75034469 334 nips-2012-Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

Author: Michael Collins, Shay B. Cohen

Abstract: We describe an approach to speed-up inference with latent-variable PCFGs, which have been shown to be highly effective for natural language parsing. Our approach is based on a tensor formulation recently introduced for spectral estimation of latent-variable PCFGs coupled with a tensor decomposition algorithm well-known in the multilinear algebra literature. We also describe an error bound for this approximation, which gives guarantees showing that if the underlying tensors are well approximated, then the probability distribution over trees will also be well approximated. Empirical evaluation on real-world natural language parsing data demonstrates a significant speed-up at minimal cost for parsing performance. 1

3 0.7498154 251 nips-2012-On Lifting the Gibbs Sampling Algorithm

Author: Deepak Venugopal, Vibhav Gogate

Abstract: First-order probabilistic models combine the power of first-order logic, the de facto tool for handling relational structure, with probabilistic graphical models, the de facto tool for handling uncertainty. Lifted probabilistic inference algorithms for them have been the subject of much recent research. The main idea in these algorithms is to improve the accuracy and scalability of existing graphical models’ inference algorithms by exploiting symmetry in the first-order representation. In this paper, we consider blocked Gibbs sampling, an advanced MCMC scheme, and lift it to the first-order level. We propose to achieve this by partitioning the first-order atoms in the model into a set of disjoint clusters such that exact lifted inference is polynomial in each cluster given an assignment to all other atoms not in the cluster. We propose an approach for constructing the clusters and show how it can be used to trade accuracy with computational complexity in a principled manner. Our experimental evaluation shows that lifted Gibbs sampling is superior to the propositional algorithm in terms of accuracy, scalability and convergence.

4 0.66697925 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound

Author: Chi Jin, Liwei Wang

Abstract: Margin is one of the most important concepts in machine learning. Previous margin bounds, both for SVM and for boosting, are dimensionality independent. A major advantage of this dimensionality independency is that it can explain the excellent performance of SVM whose feature spaces are often of high or infinite dimension. In this paper we address the problem whether such dimensionality independency is intrinsic for the margin bounds. We prove a dimensionality dependent PAC-Bayes margin bound. The bound is monotone increasing with respect to the dimension when keeping all other factors fixed. We show that our bound is strictly sharper than a previously well-known PAC-Bayes margin bound if the feature space is of finite dimension; and the two bounds tend to be equivalent as the dimension goes to infinity. In addition, we show that the VC bound for linear classifiers can be recovered from our bound under mild conditions. We conduct extensive experiments on benchmark datasets and find that the new bound is useful for model selection and is usually significantly sharper than the dimensionality independent PAC-Bayes margin bound as well as the VC bound for linear classifiers.

5 0.6401282 83 nips-2012-Controlled Recognition Bounds for Visual Learning and Exploration

Author: Vasiliy Karasev, Alessandro Chiuso, Stefano Soatto

Abstract: We describe the tradeoff between the performance in a visual recognition problem and the control authority that the agent can exercise on the sensing process. We focus on the problem of “visual search” of an object in an otherwise known and static scene, propose a measure of control authority, and relate it to the expected risk and its proxy (conditional entropy of the posterior density). We show this analytically, as well as empirically by simulation using the simplest known model that captures the phenomenology of image formation, including scaling and occlusions. We show that a “passive” agent given a training set can provide no guarantees on performance beyond what is afforded by the priors, and that an “omnipotent” agent, capable of infinite control authority, can achieve arbitrarily good performance (asymptotically). In between these limiting cases, the tradeoff can be characterized empirically. 1

6 0.63627785 260 nips-2012-Online Sum-Product Computation Over Trees

7 0.6333043 168 nips-2012-Kernel Latent SVM for Visual Recognition

8 0.63287228 274 nips-2012-Priors for Diversity in Generative Latent Variable Models

9 0.63159627 193 nips-2012-Learning to Align from Scratch

10 0.63126731 112 nips-2012-Efficient Spike-Coding with Multiplicative Adaptation in a Spike Response Model

11 0.63071477 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines

12 0.63042659 113 nips-2012-Efficient and direct estimation of a neural subunit model for sensory coding

13 0.6282934 235 nips-2012-Natural Images, Gaussian Mixtures and Dead Leaves

14 0.62745816 77 nips-2012-Complex Inference in Neural Circuits with Probabilistic Population Codes and Topic Models

15 0.62736833 197 nips-2012-Learning with Recursive Perceptual Representations

16 0.62667823 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models

17 0.62612081 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models

18 0.62564451 210 nips-2012-Memorability of Image Regions

19 0.62563598 333 nips-2012-Synchronization can Control Regularization in Neural Systems via Correlated Noise Processes

20 0.62517524 176 nips-2012-Learning Image Descriptors with the Boosting-Trick