nips nips2006 nips2006-112 knowledge-graph by maker-knowledge-mining

112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

Source: pdf

Author: David B. Grimes, Daniel R. Rashid, Rajesh P. Rao

Abstract: Learning by imitation represents an important mechanism for rapid acquisition of new behaviors in humans and robots. A critical requirement for learning by imitation is the ability to handle uncertainty arising from the observation process as well as the imitator’s own dynamics and interactions with the environment. In this paper, we present a new probabilistic method for inferring imitative actions that takes into account both the observations of the teacher as well as the imitator’s dynamics. Our key contribution is a nonparametric learning method which generalizes to systems with very different dynamics. Rather than relying on a known forward model of the dynamics, our approach learns a nonparametric forward model via exploration. Leveraging advances in approximate inference in graphical models, we show how the learned forward model can be directly used to plan an imitating sequence. We provide experimental results for two systems: a biomechanical model of the human arm and a 25-degrees-of-freedom humanoid robot. We demonstrate that the proposed method can be used to learn appropriate motor inputs to the model arm which imitates the desired movements. A second set of results demonstrates dynamically stable full-body imitation of a human teacher by the humanoid robot. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract Learning by imitation represents an important mechanism for rapid acquisition of new behaviors in humans and robots. [sent-7, score-0.594]

2 A critical requirement for learning by imitation is the ability to handle uncertainty arising from the observation process as well as the imitator’s own dynamics and interactions with the environment. [sent-8, score-0.657]

3 In this paper, we present a new probabilistic method for inferring imitative actions that takes into account both the observations of the teacher as well as the imitator’s dynamics. [sent-9, score-0.581]

4 Rather than relying on a known forward model of the dynamics, our approach learns a nonparametric forward model via exploration. [sent-11, score-0.49]

5 Leveraging advances in approximate inference in graphical models, we show how the learned forward model can be directly used to plan an imitating sequence. [sent-12, score-0.381]

6 We provide experimental results for two systems: a biomechanical model of the human arm and a 25-degrees-of-freedom humanoid robot. [sent-13, score-0.605]

7 We demonstrate that the proposed method can be used to learn appropriate motor inputs to the model arm which imitates the desired movements. [sent-14, score-0.299]

8 A second set of results demonstrates dynamically stable full-body imitation of a human teacher by the humanoid robot. [sent-15, score-1.137]

9 Robotics researchers have become increasingly interested in learning by imitation (also called “learning by watching” or “learning from demonstration”) as an attractive alternative to manually programming robots [5, 8, 19]. [sent-18, score-0.548]

10 Uncertainty in imitation arises from many sources including the internal dynamics of the robot, the robot’s interactions with its environment, observations of the teacher, etc. [sent-20, score-0.653]

11 Being able to handle uncertainty is especially critical in robotic imitation because executing actions that have high uncertainty during imitation could lead to potentially disastrous consequences. [sent-21, score-1.385]

12 In this paper, we propose a new technique for imitation that explicitly handles uncertainty using a probabilistic model of actions and their sensory consequences. [sent-22, score-0.784]

13 Rather than relying on a physicsbased parametric model of system dynamics as in traditional methods, our approach learns a nonparametric model of the imitator’s internal dynamics during a constrained exploration period. [sent-23, score-0.547]

14 The learned model is then used to infer appropriate actions for imitation using probabilistic inference in a dynamic Bayesian network (DBN) with teacher observations as evidence. [sent-24, score-1.176]

15 We demonstrate the viability of the approach using two systems: a biomechanical model of the human arm and a 25- a2 a1 s1 s3 s2 c1 o1 aT −1 c2 sT c3 o3 o2 (a) cT oT (b) (c) Figure 1: Graphical model and systems for imitation learning. [sent-25, score-0.949]

16 (a) Dynamic Bayesian network for inferring a sequence of imitative actions a1:T−1 from a sequence of observations of the teacher o1:T . [sent-26, score-0.601]

17 The model also allows for probabilistic constraint variables ct on the imitators states st . [sent-27, score-0.389]

18 (b) The two link biomechanical model of the human arm (from [10]) used in experiments on learning reaching movements via imitation. [sent-29, score-0.508]

19 (c) The Fujitsu Hoap-2 humanoid robot used in our experiments on fullbody, dynamic imitation. [sent-30, score-0.479]

20 Our ﬁrst set of results illustrate how the proposed method can be used to learn appropriate motor commands for producing imitative movements in the model human arm. [sent-32, score-0.328]

21 The second set of results demonstrates dynamically stable full-body imitation of a human teacher by the humanoid robot. [sent-33, score-1.137]

22 Taken together, the results suggest that a probabilistic approach to imitation based on nonparametric model learning could provide a powerful and ﬂexible platform for acquiring new behaviors in complex robotic systems. [sent-34, score-0.843]

23 2 Imitation via Inference and Constrained Exploration In this section we present our inference-based approach to selecting a set of actions based on observations of another agent’s state during demonstration, and a set of probabilistic constraints. [sent-35, score-0.304]

24 We use the convention that the agent 1 starts in an initial state s1 , and as the result of executing the actions visits the set of continuous states s2 , ··, st , ··, sT . [sent-39, score-0.458]

25 In our imitation learning framework the agent observes a sequence of continuous variables o1 , ··, ot , ··, oT− providing partial information about the state of the teacher during demonstration. [sent-41, score-1.068]

26 1 The conditional probability density P (ot |st ) encodes how likely an observation of the teacher (ot ) agrees with an an agent’s state (st ) while performing the same motion or task. [sent-42, score-0.449]

27 Probabilistic constraints on state variables are included within the graphical model by a set of variables ct . [sent-45, score-0.262]

28 The corresponding constraint models P (ct |st ) encode the likelihood of satisfying the constraint in state st . [sent-46, score-0.342]

29 While in principle any algorithm for computing the marginal posterior distributions of the action variables could be used, we ﬁnd it convenient here to use Pearl’s belief propagation (BP) algorithm [13]. [sent-53, score-0.235]

30 The result of performing belief propagation is a set of marginal belief distributions B(x) = P (x|E) = π(x)λ(x). [sent-58, score-0.264]

31 This belief distribution is the product of two sets of messages π(x) and λ(x), which represent the information coming from neighboring parent and children variable nodes respectively. [sent-59, score-0.319]

32 Beliefs are computed via messages passed along the edges of the graphical model, which are distributions over single variables. [sent-60, score-0.212]

33 This “self message” represented using a Dirac delta distribution about the observed value is considered in the product of all messages from the m children (denoted Yj ) of X: m λYj (x). [sent-69, score-0.186]

34 Although the product of a set of mixtures of Gaussians is simply another mixture of Gaussians, the complexity (in terms of the number of components in the output mixture) grows exponentially in the number of input mixtures. [sent-75, score-0.193]

35 Thus an approximation is needed to keep inference tractable in the action sequence length T . [sent-76, score-0.182]

36 For example, when the message πst+1 (st ) coming from a previous state has M = 10 components, the message from the action πat−1 (at ) has N = 1 component (based on a unimodal Gaussian prior), and the GMR model has P = 67 components, the conditional product has M N P = 670 components. [sent-80, score-0.357]

37 This sparsity should not be surprising as the P model components represent localized data, and only a few of these components tend to have overlap with the belief state being propagated. [sent-83, score-0.313]

38 We now brieﬂy describe our algorithm 1 for action inference and constrained exploration. [sent-93, score-0.21]

39 Inference proceeds by ﬁrst ”inverting” the evidence from the observations and constraint variables yielding the messages λot (st ), λct (st ). [sent-95, score-0.205]

40 After initialization from the priors PS , PA we perform a forward planning pass thereby computing the forward state messages πst+1 (st ). [sent-96, score-0.461]

41 The algorithm then combines information from forward and backward messages (via Eq. [sent-98, score-0.254]

42 ˆ We then select the maximum marginal belief action at from the belief distribution using the mode ﬁnding algorithm described in [4]. [sent-100, score-0.281]

43 Our algorithm to iteratively explore the state and action spaces while satisfying the constraints placed on the system builds on the inference based action selection algorithm described above. [sent-101, score-0.321]

44 Execution yields a sequence of states, which are used to update the learned forward model (see Section 3). [sent-104, score-0.201]

45 Using the new (ideally more accurate) forward model we show we are able to obtain a better imitation of the teacher via the newly inferred actions. [sent-105, score-1.016]

46 The ﬁnal model and sequence of actions are then returned after N constrained exploration trials. [sent-106, score-0.393]

47 For simplicity, we currently assume that state and action prior distributions, and the observation and constraint models are pre-speciﬁed. [sent-107, score-0.223]

48 The focus of our algorithms presented here is to learn the forward model which in many real-world domains is extremely complex to derive analytically. [sent-109, score-0.169]

49 2 describe the results of our algorithms applied in the human arm model and humanoid robot domains respectively. [sent-112, score-0.678]

50 3 Nonparametric Model Learning In this section we investigate an algorithm for learning a nonparametric forward model via Gaussian Mixture Regression (GMR) [17]. [sent-113, score-0.321]

51 jj jj (6) Gaussian mixture regression is derived by applying the result of this theorem to Eq. [sent-122, score-0.201]

52 k wk N (x; µk j , Σk jj ) (8) Likewise the mean of the k-th conditioned component of xi given xj is a function of xj : µkij (x) = µki + Σkij Σ−1 (x − µkj ). [sent-127, score-0.234]

53 We assume that a set of state and action histories have been observed during the N trial histories: {[si , ai , si , ai , ··, ai si ]}N . [sent-137, score-0.239]

54 To learn a GMR forward model we ﬁrst construct the joint variable 2 1 1 2 T−1 T i=1 space: x = [s a (s ) ] where s denotes the resulting state when executing action a in state s. [sent-138, score-0.468]

55 The algorithm proceeds by merging components which are very similar, as determined by a symmetric similarity metric between two mixture components. [sent-141, score-0.202]

56 To perform efﬁcient merging we ﬁrst compute the minimum spanning tree of all mixture components. [sent-143, score-0.188]

57 Rather than the ”method of moments” (MoM) approach to merging components and then later running expectation maximization to ﬁne-tune the selected model, we that found performing local maximum likelihood estimation (MLE) within model selection to be more effective at ﬁnding an accurate model. [sent-147, score-0.193]

58 2 demonstrates the learning of a forward model for the biomechanical arm model from Section 4. [sent-154, score-0.45]

59 The imitator explores the space via body babbling (second plot, shown in red). [sent-184, score-0.217]

60 From this data a GMR model is learned, constrained exploration is performed to ﬁnd an imitative reaching motion (shown every 5 iterations). [sent-185, score-0.497]

61 b) The velocities of the two joints during the imitation learning process. [sent-186, score-0.605]

62 By trial number 20 the imitator’s velocities (thin lines) closely match the demonstrator’s velocities (the thick,light colored lines), and meet the zero ﬁnal velocity constraint. [sent-187, score-0.184]

63 c) The teacher’s original muscle torques, followed by the babbling torques, and the torques computed during constrained exploration. [sent-188, score-0.192]

64 In the ﬁrst set of experiments we learn reaching movements via imitation in a complex non-linear model of the human arm. [sent-189, score-0.813]

65 The arm simulator we use is a biomechanical arm model [10] consisting of two degrees of freedom (denoted θ) representing the shoulder and elbow joints. [sent-190, score-0.436]

66 The arm is controlled via two torque inputs (denoted τ ) for the two degrees of freedom. [sent-191, score-0.283]

67 The dynamics of the arm are described via the following differential equation: ¨ ˙ ˙ M (θ)θ + C(θ, θ) + B(θ) = τ (12) where M is the inertial force matrix, C is a vector of centripetal and Coriolis forces, and B is the matrix of force due to friction at the joints. [sent-192, score-0.264]

68 3 shows the process of learning to perform a reaching motion via imitation. [sent-194, score-0.168]

69 First we compute the teacher’s simulated arm motion using the model-based iLQG algorithm [10] based on start and target positions of the hand. [sent-195, score-0.232]

70 By executing the sequence of computed torque inputs [ˆ1:T−1 ] from a a speciﬁed initial state s1 , we obtain the state history of the demonstrator [ˆ1:T ]. [sent-196, score-0.452]

71 To simulate the s 70 50 balanced duration constrained exploration period random (from prior) exploration period 60 40 30 20 10 0 0 (a) 5 10 15 20 25 trial # 30 35 40 45 (b) Figure 4: Humanoid robot dynamic imitation. [sent-197, score-0.616]

72 The second row shows the result of performing a kinematic imitation in the simulator. [sent-199, score-0.664]

73 The third and fourth rows show the ﬁnal imitation result obtained by our method of constrained exploration, in the simulator, and on the Hoap-2 robot. [sent-200, score-0.608]

74 b) The duration that the executed imitation was balanced (out of a total of T = 63) shown vs the trial number. [sent-201, score-0.581]

75 The random exploration trials are shown in red, and the inferred imitative trials are shown in blue. [sent-202, score-0.265]

76 Note that the balanced duration rapidly rises and by the 15th inferred sequence is able to perform the imitation without falling. [sent-203, score-0.615]

77 natural partial observability of a human demonstrator and a human learner, we provide our inference algorithm with noisy measurements of the kinematic state only (not the torques). [sent-204, score-0.502]

78 2 Dynamic Humanoid Imitation We applied our algorithms for nonparametric action selection, model learning, and constrained exploration to the problem of full-body dynamic imitation in a humanoid robot. [sent-207, score-1.32]

79 The experiment consisted of a humanoid demonstrator performing motions such as squatting and standing on one leg. [sent-208, score-0.436]

80 First, the demonstrators’ kinematics were obtained using a commercially available retroreﬂective marker-based optical motion capture system based on inverse kinematics (IK). [sent-210, score-0.197]

81 The IK skeletal model of the human was restricted to have the same degrees of freedom as the Fujitsu Hoap-2 humanoid robot. [sent-211, score-0.362]

82 Representing humanoid motion using a full kinematic conﬁguration is problematic (due to the curse of dimensionality). [sent-212, score-0.404]

83 considering the dynamic balance involved in stably imitating the human demonstrator. [sent-218, score-0.29]

84 By computing four differential features of the pressure sensors, and extracting the two horizontal gyroscope axis, we form a six dimensional representation of the dynamic state of the robot. [sent-221, score-0.241]

85 Concatenating the four dimensional kinematic state and the six dimensional dynamic state we form the full ten dimensional state representation st . [sent-222, score-0.551]

86 Robot actions at are then simply points in the embedded kinematic space. [sent-223, score-0.205]

87 We bootstrap the forward model (of robot kinematics and dynamics) by ﬁrst performing random exploration (body babbling) about the instructor’s trajectory. [sent-224, score-0.564]

88 Subsequently we place a probabilistic constraint on the dynamic conﬁguration of the robot (using a tight, central Gaussian distribution around zero angular velocity, and zero pressure differentials). [sent-226, score-0.381]

89 Using this constraint on dynamics we perform constrained exploration, until we obtain a stable motion for the Hoap-2 which imitates the human motion. [sent-227, score-0.387]

90 The results we obtained in imitating a difﬁcult one-legged balance motion are shown in Fig. [sent-228, score-0.209]

91 5 Conclusion Our results demonstrate that probabilistic inference and learning techniques can be used to successfully acquire new behaviors in complex robotic systems such as a humanoid robot. [sent-230, score-0.436]

92 In particular, we showed how a nonparametric model of forward dynamics can be learned from constrained exploration and used to infer actions for imitating a teacher while simultaneously taking the imitator’s dynamics into account. [sent-231, score-1.127]

93 There exists a large body of previous work on robotic imitation learning (see, for example [2, 5, 14, 19]). [sent-232, score-0.597]

94 Some rely on producing imitative behaviors using nonlinear dynamical systems (e. [sent-233, score-0.168]

95 However, the role of this type of expert and that of our human demonstrator must be distinguished. [sent-239, score-0.204]

96 In the former case, the teacher is directly controlling the artiﬁcial system. [sent-240, score-0.265]

97 In the imitation learning paradigm, one can only observe the teacher controlling their own body. [sent-241, score-0.813]

98 Further, despite kinematic similarities between the human and humanoid robot, the dynamic properties of the robot and human are very different. [sent-242, score-0.728]

99 Learning human arm movements by imitation: Evaluation of a biologically-inspired connectionist architecture. [sent-258, score-0.291]

100 Dynamic imitation in a humanoid robot through nonparametric probabilistic inference. [sent-280, score-1.113]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('imitation', 0.548), ('teacher', 0.265), ('humanoid', 0.242), ('gmr', 0.183), ('st', 0.162), ('robot', 0.161), ('arm', 0.155), ('exploration', 0.143), ('forward', 0.131), ('messages', 0.123), ('demonstrator', 0.122), ('imitative', 0.122), ('imitator', 0.122), ('actions', 0.12), ('nonparametric', 0.118), ('imitating', 0.102), ('ot', 0.099), ('action', 0.095), ('belief', 0.093), ('ct', 0.093), ('biomechanical', 0.088), ('kinematic', 0.085), ('human', 0.082), ('mixture', 0.078), ('motion', 0.077), ('dynamic', 0.076), ('state', 0.076), ('dynamics', 0.075), ('merging', 0.071), ('torques', 0.071), ('xj', 0.064), ('kij', 0.062), ('babbling', 0.061), ('torque', 0.061), ('kinematics', 0.06), ('constrained', 0.06), ('robotics', 0.06), ('message', 0.058), ('reaching', 0.057), ('velocities', 0.057), ('inference', 0.055), ('graphical', 0.055), ('movements', 0.054), ('components', 0.053), ('constraint', 0.052), ('executing', 0.052), ('robotic', 0.049), ('pressure', 0.048), ('agent', 0.048), ('propagation', 0.047), ('behaviors', 0.046), ('probabilistic', 0.044), ('regression', 0.043), ('kj', 0.043), ('eqs', 0.041), ('fujitsu', 0.041), ('gyroscope', 0.041), ('ijspeert', 0.041), ('imitates', 0.041), ('instructor', 0.041), ('ipra', 0.041), ('kjj', 0.041), ('nbp', 0.041), ('squatting', 0.041), ('xtr', 0.041), ('jj', 0.04), ('reinforcement', 0.04), ('parent', 0.04), ('spanning', 0.039), ('gmm', 0.039), ('model', 0.038), ('velocity', 0.037), ('integrals', 0.037), ('wk', 0.036), ('yj', 0.035), ('merges', 0.035), ('apprenticeship', 0.035), ('rises', 0.035), ('histories', 0.035), ('imitate', 0.035), ('grimes', 0.035), ('watching', 0.035), ('xl', 0.035), ('via', 0.034), ('uncertainty', 0.034), ('inputs', 0.033), ('trial', 0.033), ('demonstration', 0.032), ('wkj', 0.032), ('mle', 0.032), ('product', 0.032), ('sequence', 0.032), ('motor', 0.032), ('performing', 0.031), ('children', 0.031), ('observations', 0.03), ('ihler', 0.03), ('balance', 0.03), ('xi', 0.03), ('mixtures', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999958 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

Author: David B. Grimes, Daniel R. Rashid, Rajesh P. Rao

2 0.15578696 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

Author: David Barber, Bertrand Mesot

Abstract: We introduce a method for approximate smoothed inference in a class of switching linear dynamical systems, based on a novel form of Gaussian Sum smoother. This class includes the switching Kalman Filter and the more general case of switch transitions dependent on the continuous latent state. The method improves on the standard Kim smoothing approach by dispensing with one of the key approximations, thus making fuller use of the available future information. Whilst the only central assumption required is projection to a mixture of Gaussians, we show that an additional conditional independence assumption results in a simpler but stable and accurate alternative. Unlike the alternative unstable Expectation Propagation procedure, our method consists only of a single forward and backward pass and is reminiscent of the standard smoothing ‘correction’ recursions in the simpler linear dynamical system. The algorithm performs well on both toy experiments and in a large scale application to noise robust speech recognition. 1 Switching Linear Dynamical System The Linear Dynamical System (LDS) [1] is a key temporal model in which a latent linear process generates the observed series. For complex time-series which are not well described globally by a single LDS, we may break the time-series into segments, each modeled by a potentially different LDS. This is the basis for the Switching LDS (SLDS) [2, 3, 4, 5] where, for each time t, a switch variable st ∈ 1, . . . , S describes which of the LDSs is to be used. The observation (or ‘visible’) vt ∈ RV is linearly related to the hidden state ht ∈ RH with additive noise η by vt = B(st )ht + η v (st ) p(vt |ht , st ) = N (B(st )ht , Σv (st )) ≡ (1) where N (µ, Σ) denotes a Gaussian distribution with mean µ and covariance Σ. The transition dynamics of the continuous hidden state ht is linear, ht = A(st )ht−1 + η h (st ), ≡ p(ht |ht−1 , st ) = N A(st )ht−1 , Σh (st ) (2) The switch st may depend on both the previous st−1 and ht−1 . This is an augmented SLDS (aSLDS), and deﬁnes the model T p(vt |ht , st )p(ht |ht−1 , st )p(st |ht−1 , st−1 ) p(v1:T , h1:T , s1:T ) = t=1 The standard SLDS[4] considers only switch transitions p(st |st−1 ). At time t = 1, p(s1 |h0 , s0 ) simply denotes the prior p(s1 ), and p(h1 |h0 , s1 ) denotes p(h1 |s1 ). The aim of this article is to address how to perform inference in the aSLDS. In particular we desire the ﬁltered estimate p(ht , st |v1:t ) and the smoothed estimate p(ht , st |v1:T ), for any 1 ≤ t ≤ T . Both ﬁltered and smoothed inference in the SLDS is intractable, scaling exponentially with time [4]. s1 s2 s3 s4 h1 h2 h3 h4 v1 v2 v3 v4 Figure 1: The independence structure of the aSLDS. Square nodes denote discrete variables, round nodes continuous variables. In the SLDS links from h to s are not normally considered. 2 Expectation Correction Our approach to approximate p(ht , st |v1:T ) mirrors the Rauch-Tung-Striebel ‘correction’ smoother for the simpler LDS [1].The method consists of a single forward pass to recursively ﬁnd the ﬁltered posterior p(ht , st |v1:t ), followed by a single backward pass to correct this into a smoothed posterior p(ht , st |v1:T ). The forward pass we use is equivalent to standard Assumed Density Filtering (ADF) [6]. The main contribution of this paper is a novel form of backward pass, based only on collapsing the smoothed posterior to a mixture of Gaussians. Together with the ADF forward pass, we call the method Expectation Correction, since it corrects the moments found from the forward pass. A more detailed description of the method, including pseudocode, is given in [7]. 2.1 Forward Pass (Filtering) Readers familiar with ADF may wish to continue directly to Section (2.2). Our aim is to form a recursion for p(st , ht |v1:t ), based on a Gaussian mixture approximation of p(ht |st , v1:t ). Without loss of generality, we may decompose the ﬁltered posterior as p(ht , st |v1:t ) = p(ht |st , v1:t )p(st |v1:t ) (3) The exact representation of p(ht |st , v1:t ) is a mixture with O(S t ) components. We therefore approximate this with a smaller I-component mixture I p(ht |st , v1:t ) ≈ p(ht |it , st , v1:t )p(it |st , v1:t ) it =1 where p(ht |it , st , v1:t ) is a Gaussian parameterized with mean f (it , st ) and covariance F (it , st ). To ﬁnd a recursion for these parameters, consider p(ht+1 |st+1 , v1:t+1 ) = p(ht+1 |st , it , st+1 , v1:t+1 )p(st , it |st+1 , v1:t+1 ) (4) st ,it Evaluating p(ht+1 |st , it , st+1 , v1:t+1 ) We ﬁnd p(ht+1 |st , it , st+1 , v1:t+1 ) by ﬁrst computing the joint distribution p(ht+1 , vt+1 |st , it , st+1 , v1:t ), which is a Gaussian with covariance and mean elements, Σhh = A(st+1 )F (it , st )AT (st+1 ) + Σh (st+1 ), Σvv = B(st+1 )Σhh B T (st+1 ) + Σv (st+1 ) Σvh = B(st+1 )F (it , st ), µv = B(st+1 )A(st+1 )f (it , st ), µh = A(st+1 )f (it , st ) (5) and then conditioning on vt+1 1 . For the case S = 1, this forms the usual Kalman Filter recursions[1]. Evaluating p(st , it |st+1 , v1:t+1 ) The mixture weight in (4) can be found from the decomposition p(st , it |st+1 , v1:t+1 ) ∝ p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) (6) 1 p(x|y) is a Gaussian with mean µx + Σxy Σ−1 (y − µy ) and covariance Σxx − Σxy Σ−1 Σyx . yy yy The ﬁrst factor in (6), p(vt+1 |it , st , st+1 , v1:t ) is a Gaussian with mean µv and covariance Σvv , as given in (5). The last two factors p(it |st , v1:t ) and p(st |v1:t ) are given from the previous iteration. Finally, p(st+1 |it , st , v1:t ) is found from p(st+1 |it , st , v1:t ) = p(st+1 |ht , st ) p(ht |it ,st ,v1:t ) (7) where · p denotes expectation with respect to p. In the SLDS, (7) is replaced by the Markov transition p(st+1 |st ). In the aSLDS, however, (7) will generally need to be computed numerically. Closing the recursion We are now in a position to calculate (4). For each setting of the variable st+1 , we have a mixture of I × S Gaussians which we numerically collapse back to I Gaussians to form I p(ht+1 |st+1 , v1:t+1 ) ≈ p(ht+1 |it+1 , st+1 , v1:t+1 )p(it+1 |st+1 , v1:t+1 ) it+1 =1 Any method of choice may be supplied to collapse a mixture to a smaller mixture; our code simply repeatedly merges low-weight components. In this way the new mixture coefﬁcients p(it+1 |st+1 , v1:t+1 ), it+1 ∈ 1, . . . , I are deﬁned, completing the description of how to form a recursion for p(ht+1 |st+1 , v1:t+1 ) in (3). A recursion for the switch variable is given by p(st+1 |v1:t+1 ) ∝ p(vt+1 |st+1 , it , st , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) st ,it where all terms have been computed during the recursion for p(ht+1 |st+1 , v1:t+1 ). The likelihood p(v1:T ) may be found by recursing p(v1:t+1 ) = p(vt+1 |v1:t )p(v1:t ), where p(vt+1 |vt ) = p(vt+1 |it , st , st+1 , v1:t )p(st+1 |it , st , v1:t )p(it |st , v1:t )p(st |v1:t ) it ,st ,st+1 2.2 Backward Pass (Smoothing) The main contribution of this paper is to ﬁnd a suitable way to ‘correct’ the ﬁltered posterior p(st , ht |v1:t ) obtained from the forward pass into a smoothed posterior p(st , ht |v1:T ). We derive this for the case of a single Gaussian representation. The extension to the mixture case is straightforward and presented in [7]. We approximate the smoothed posterior p(ht |st , v1:T ) by a Gaussian with mean g(st ) and covariance G(st ) and our aim is to ﬁnd a recursion for these parameters. A useful starting point for a recursion is: p(st+1 |v1:T )p(ht |st , st+1 , v1:T )p(st |st+1 , v1:T ) p(ht , st |v1:T ) = st+1 The term p(ht |st , st+1 , v1:T ) may be computed as p(ht |st , st+1 , v1:T ) = p(ht |ht+1 , st , st+1 , v1:t )p(ht+1 |st , st+1 , v1:T ) (8) ht+1 The recursion therefore requires p(ht+1 |st , st+1 , v1:T ), which we can write as p(ht+1 |st , st+1 , v1:T ) ∝ p(ht+1 |st+1 , v1:T )p(st |st+1 , ht+1 , v1:t ) (9) The difﬁculty here is that the functional form of p(st |st+1 , ht+1 , v1:t ) is not squared exponential in ht+1 , so that p(ht+1 |st , st+1 , v1:T ) will not be Gaussian2 . One possibility would be to approximate the non-Gaussian p(ht+1 |st , st+1 , v1:T ) by a Gaussian (or mixture thereof) by minimizing the Kullback-Leilbler divergence between the two, or performing moment matching in the case of a single Gaussian. A simpler alternative (which forms ‘standard’ EC) is to make the assumption p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), where p(ht+1 |st+1 , v1:T ) is already known from the previous backward recursion. Under this assumption, the recursion becomes p(ht , st |v1:T ) ≈ p(st+1 |v1:T )p(st |st+1 , v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (10) st+1 2 In the exact calculation, p(ht+1 |st , st+1 , v1:T ) is a mixture of Gaussians, see [7]. However, since in (9) the two terms p(ht+1 |st+1 , v1:T ) will only be approximately computed during the recursion, our approximation to p(ht+1 |st , st+1 , v1:T ) will not be a mixture of Gaussians. Evaluating p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian in ht , whose statistics we will now compute. First we ﬁnd p(ht |ht+1 , st , st+1 , v1:t ) which may be obtained from the joint distribution p(ht , ht+1 |st , st+1 , v1:t ) = p(ht+1 |ht , st+1 )p(ht |st , v1:t ) (11) which itself can be found from a forward dynamics from the ﬁltered estimate p(ht |st , v1:t ). The statistics for the marginal p(ht |st , st+1 , v1:t ) are simply those of p(ht |st , v1:t ), since st+1 carries no extra information about ht . The remaining statistics are the mean of ht+1 , the covariance of ht+1 and cross-variance between ht and ht+1 , which are given by ht+1 = A(st+1 )ft (st ), Σt+1,t+1 = A(st+1 )Ft (st )AT (st+1 )+Σh (st+1 ), Σt+1,t = A(st+1 )Ft (st ) Given the statistics of (11), we may now condition on ht+1 to ﬁnd p(ht |ht+1 , st , st+1 , v1:t ). Doing so effectively constitutes a reversal of the dynamics, ← − − ht = A (st , st+1 )ht+1 + ←(st , st+1 ) η ← − ← − − − where A (st , st+1 ) and ←(st , st+1 ) ∼ N (← t , st+1 ), Σ (st , st+1 )) are easily found using η m(s conditioning. Averaging the above reversed dynamics over p(ht+1 |st+1 , v1:T ), we ﬁnd that p(ht |ht+1 , st , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) is a Gaussian with statistics ← − ← − ← − ← − − µt = A (st , st+1 )g(st+1 )+← t , st+1 ), Σt,t = A (st , st+1 )G(st+1 ) A T (st , st+1 )+ Σ (st , st+1 ) m(s These equations directly mirror the standard RTS backward pass[1]. Evaluating p(st |st+1 , v1:T ) The main departure of EC from previous methods is in treating the term p(st |st+1 , v1:T ) = p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) (12) The term p(st |ht+1 , st+1 , v1:t ) is given by p(st |ht+1 , st+1 , v1:t ) = p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) ′ ′ s′ p(ht+1 |st+1 , st , v1:t )p(st , st+1 |v1:t ) (13) t Here p(st , st+1 |v1:t ) = p(st+1 |st , v1:t )p(st |v1:t ), where p(st+1 |st , v1:t ) occurs in the forward pass, (7). In (13), p(ht+1 |st+1 , st , v1:t ) is found by marginalizing (11). Computing the average of (13) with respect to p(ht+1 |st+1 , v1:T ) may be achieved by any numerical integration method desired. A simple approximation is to evaluate the integrand at the mean value of the averaging distribution p(ht+1 |st+1 , v1:T ). More sophisticated methods (see [7]) such as sampling from the Gaussian p(ht+1 |st+1 , v1:T ) have the advantage that covariance information is used3 . Closing the Recursion We have now computed both the continuous and discrete factors in (8), which we wish to use to write the smoothed estimate in the form p(ht , st |v1:T ) = p(st |v1:T )p(ht |st , v1:T ). The distribution p(ht |st , v1:T ) is readily obtained from the joint (8) by conditioning on st to form the mixture p(ht |st , v1:T ) = p(st+1 |st , v1:T )p(ht |st , st+1 , v1:T ) st+1 which may then be collapsed to a single Gaussian (the mixture case is discussed in [7]). The smoothed posterior p(st |v1:T ) is given by p(st |v1:T ) = p(st+1 |v1:T ) p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) . (14) st+1 3 This is a form of exact sampling since drawing samples from a Gaussian is easy. This should not be confused with meaning that this use of sampling renders EC a sequential Monte-Carlo scheme. 2.3 Relation to other methods The EC Backward pass is closely related to Kim’s method [8]. In both EC and Kim’s method, the approximation p(ht+1 |st , st+1 , v1:T ) ≈ p(ht+1 |st+1 , v1:T ), is used to form a numerically simple backward pass. The other ‘approximation’ in EC is to numerically compute the average in (14). In Kim’s method, however, an update for the discrete variables is formed by replacing the required term in (14) by p(st |ht+1 , st+1 , v1:t ) p(ht+1 |st+1 ,v1:T ) ≈ p(st |st+1 , v1:t ) (15) Since p(st |st+1 , v1:t ) ∝ p(st+1 |st )p(st |v1:t )/p(st+1 |v1:t ), this can be computed simply from the ﬁltered results alone. The fundamental difference therefore between EC and Kim’s method is that the approximation, (15), is not required by EC. The EC backward pass therefore makes fuller use of the future information, resulting in a recursion which intimately couples the continuous and discrete variables. The resulting effect on the quality of the approximation can be profound, as we will see in the experiments. The Expectation Propagation (EP) algorithm makes the central assumption of collapsing the posteriors to a Gaussian family [5]; the collapse is deﬁned by a consistency criterion on overlapping marginals. In our experiments, we take the approach in [9] of collapsing to a single Gaussian. Ensuring consistency requires frequent translations between moment and canonical parameterizations, which is the origin of potentially severe numerical instability [10]. In contrast, EC works largely with moment parameterizations of Gaussians, for which relatively few numerical difﬁculties arise. Unlike EP, EC is not based on a consistency criterion and a subtle issue arises about possible inconsistencies in the Forward and Backward approximations for EC. For example, under the conditional independence assumption in the Backward Pass, p(hT |sT −1 , sT , v1:T ) ≈ p(hT |sT , v1:T ), which is in contradiction to (5) which states that the approximation to p(hT |sT −1 , sT , v1:T ) will depend on sT −1 . Such potential inconsistencies arise because of the approximations made, and should not be considered as separate approximations in themselves. Rather than using a global (consistency) objective, EC attempts to faithfully approximate the exact Forward and Backward propagation routines. For this reason, as in the exact computation, only a single Forward and Backward pass are required in EC. In [11] a related dynamics reversed is proposed. However, the singularities resulting from incorrectly treating p(vt+1:T |ht , st ) as a density are heuristically ﬁnessed. In [12] a variational method approximates the joint distribution p(h1:T , s1:T |v1:T ) rather than the marginal inference p(ht , st |v1:T ). This is a disadvantage when compared to other methods that directly approximate the marginal. Sequential Monte Carlo methods (Particle Filters)[13], are essentially mixture of delta-function approximations. Whilst potentially powerful, these typically suffer in high-dimensional hidden spaces, unless techniques such as Rao-Blackwellization are performed. ADF is generally preferential to Particle Filtering since in ADF the approximation is a mixture of non-trivial distributions, and is therefore more able to represent the posterior. 3 Demonstration Testing EC in a problem with a reasonably long temporal sequence, T , is important since numerical instabilities may not be apparent in timeseries of just a few points. To do this, we sequentially generate hidden and visible states from a given model, here with H = 3, S = 2, V = 1 – see Figure(2) for full details of the experimental setup. Then, given only the parameters of the model and the visible observations (but not any of the hidden states h1:T , s1:T ), the task is to infer p(ht |st , v1:T ) and p(st |v1:T ). Since the exact computation is exponential in T , a simple alternative is to assume that the original sample states s1:T are the ‘correct’ inferences, and compare how our most probable posterior smoothed estimates arg maxst p(st |v1:T ) compare with the assumed correct sample st . We chose conditions that, from the viewpoint of classical signal processing, are difﬁcult, with changes in the switches occurring at a much higher rate than the typical frequencies in the signal vt . For EC we use the mean approximation for the numerical integration of (12). We included the Particle Filter merely for a point of comparison with ADF, since they are not designed to approximate PF RBPF EP ADFS KimS ECS ADFM KimM ECM 1000 800 600 400 200 0 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 Figure 2: The number of errors in estimating p(st |v1:T ) for a binary switch (S = 2) over a time series of length T = 100. Hence 50 errors corresponds to random guessing. Plotted are histograms of the errors are over 1000 experiments. The x-axes are cut off at 20 errors to improve visualization of the results. (PF) Particle Filter. (RBPF) Rao-Blackwellized PF. (EP) Expectation Propagation. (ADFS) Assumed Density Filtering using a Single Gaussian. (KimS) Kim’s smoother using the results from ADFS. (ECS) Expectation Correction using a Single Gaussian (I = J = 1). (ADFM) ADF using a multiple of I = 4 Gaussians. (KimM) Kim’s smoother using the results from ADFM. (ECM) Expectation Correction using a mixture with I = J = 4 components. S = 2, V = 1 (scalar observations), T = 100, with zero output bias. A(s) = 0.9999 ∗ orth(randn(H, H)), B(s) = randn(V, H). H = 3, Σh (s) = IH , Σv (s) = 0.1IV , p(st+1 |st ) ∝ 1S×S + IS . At time t = 1, the priors are p1 = uniform, with h1 drawn from N (10 ∗ randn(H, 1), IH ). the smoothed estimate, for which 1000 particles were used, with Kitagawa resampling. For the RaoBlackwellized Particle Filter [13], 500 particles were used, with Kitagawa resampling. We found that EP4 was numerically unstable and often struggled to converge. To encourage convergence, we used the damping method in [9], performing 20 iterations with a damping factor of 0.5. Nevertheless, the disappointing performance of EP is most likely due to conﬂicts resulting from numerical instabilities introduced by the frequent conversions between moment and canonical representations. The best ﬁltered results are given using ADF, since this is better able to represent the variance in the ﬁltered posterior than the sampling methods. Unlike Kim’s method, EC makes good use of the future information to clean up the ﬁltered results considerably. One should bear in mind that both EC and Kim’s method use the same ADF ﬁltered results. This demonstrates that EC may dramatically improve on Kim’s method, so that the small amount of extra work in making a numerical approximation of p(st |st+1 , v1:T ), (12), may bring signiﬁcant beneﬁts. We found similar conclusions for experiments with an aSLDS[7]. 4 Application to Noise Robust ASR Here we brieﬂy present an application of the SLDS to robust Automatic Speech Recognition (ASR), for which the intractable inference is performed by EC, and serves to demonstrate how EC scales well to a large-scale application. Fuller details are given in [14]. The standard approach to noise robust ASR is to provide a set of noise-robust features to a standard Hidden Markov Model (HMM) classiﬁer, which is based on modeling the acoustic feature vector. For example, the method of Unsupervised Spectral Subtraction (USS) [15] provides state-of-the-art performance in this respect. Incorporating noise models directly into such feature-based HMM systems is difﬁcult, mainly because the explicit inﬂuence of the noise on the features is poorly understood. An alternative is to model the raw speech signal directly, such as the SAR-HMM model [16] for which, under clean conditions, isolated spoken digit recognition performs well. However, the SAR-HMM performs poorly under noisy conditions, since no explicit noise processes are taken into account by the model. The approach we take here is to extend the SAR-HMM to include an explicit noise process, so that h the observed signal vt is modeled as a noise corrupted version of a clean hidden signal vt : h vt = vt + ηt ˜ 4 with ηt ∼ N (0, σ 2 ) ˜ ˜ Generalized EP [5], which groups variables together improves on the results, but is still far inferior to the EC results presented here – Onno Zoeter personal communication. Noise Variance 0 10−7 10−6 10−5 10−4 10−3 SNR (dB) 26.5 26.3 25.1 19.7 10.6 0.7 HMM 100.0% 100.0% 90.9% 86.4% 59.1% 9.1% SAR-HMM 97.0% 79.8% 56.7% 22.2% 9.7% 9.1% AR-SLDS 96.8% 96.8% 96.4% 94.8% 84.0% 61.2% Table 1: Comparison of the recognition accuracy of three models when the test utterances are corrupted by various levels of Gaussian noise. The dynamics of the clean signal is modeled by a switching AR process R h vt = h h cr (st )vt−r + ηt (st ), h ηt (st ) ∼ N (0, σ 2 (st )) r=1 where st ∈ {1, . . . , S} denotes which of a set of AR coefﬁcients cr (st ) are to be used at time t, h and ηt (st ) is the so-called innovation noise. When σ 2 (st ) ≡ 0, this model reproduces the SARHMM of [16], a specially constrained HMM. Hence inference and learning for the SAR-HMM are tractable and straightforward. For the case σ 2 (st ) > 0 the model can be recast as an SLDS. To do this we deﬁne ht as a vector which contains the R most recent clean hidden samples ht = h vt h . . . vt−r+1 T (16) and we set A(st ) to be an R × R matrix where the ﬁrst row contains the AR coefﬁcients −cr (st ) and the rest is a shifted down identity matrix. For example, for a third order (R = 3) AR process, A(st ) = −c1 (st ) −c2 (st ) −c3 (st ) 1 0 0 0 1 0 . (17) The hidden covariance matrix Σh (s) has all elements zero, except the top-left most which is set to the innovation variance. To extract the ﬁrst component of ht we use the (switch independent) 1 × R projection matrix B = [ 1 0 . . . 0 ]. The (switch independent) visible scalar noise 2 variance is given by Σv ≡ σv . A well-known issue with raw speech signal models is that the energy of a signal may vary from one speaker to another or because of a change in recording conditions. For this reason the innovation Σh is adjusted by maximizing the likelihood of an observed sequence with respect to the innovation covariance, a process called Gain Adaptation [16]. 4.1 Training & Evaluation Following [16], we trained a separate SAR-HMM for each of the eleven digits (0–9 and ‘oh’) from the TI-DIGITS database [17]. The training set for each digit was composed of 110 single digit utterances down-sampled to 8 kHz, each one pronounced by a male speaker. Each SAR-HMM was composed of ten states with a left-right transition matrix. Each state was associated with a 10thorder AR process and the model was constrained to stay an integer multiple of K = 140 time steps (0.0175 seconds) in the same state. We refer the reader to [16] for a detailed explanation of the training procedure used with the SAR-HMM. An AR-SLDS was built for each of the eleven digits by copying the parameters of the corresponding trained SAR-HMM, i.e., the AR coefﬁcients cr (s) are copied into the ﬁrst row of the hidden transition matrix A(s) and the same discrete transition distribution p(st | st−1 ) is used. The models were then evaluated on a test set composed of 112 corrupted utterances of each of the eleven digits, each pronounced by different male speakers than those used in the training set. The recognition accuracy obtained by the models on the corrupted test sets is presented in Table 1. As expected, the performance of the SAR-HMM rapidly decreases with noise. The feature-based HMM with USS has high accuracy only for high SNR levels. In contrast, the AR-SLDS achieves a recognition accuracy of 61.2% at a SNR close to 0 dB, while the performance of the two other methods is equivalent to random guessing (9.1%). Whilst other inference methods may also perform well in this case, we found that EC performs admirably, without numerical instabilities, even for time-series with several thousand time-steps. 5 Discussion We presented a method for approximate smoothed inference in an augmented class of switching linear dynamical systems. Our approximation is based on the idea that due to the forgetting which commonly occurs in Markovian models, a ﬁnite number of mixture components may provide a reasonable approximation. Clearly, in systems with very long correlation times our method may require too many mixture components to produce a satisfactory result, although we are unaware of other techniques that would be able to cope well in that case. The main beneﬁt of EC over Kim smoothing is that future information is more accurately dealt with. Whilst EC is not as general as EP, EC carefully exploits the properties of singly-connected distributions, such as the aSLDS, to provide a numerically stable procedure. We hope that the ideas presented here may therefore help facilitate the practical application of dynamic hybrid networks. Acknowledgements This work is supported by the EU Project FP6-0027787. This paper only reﬂects the authors’ views and funding agencies are not liable for any use that may be made of the information contained herein. References [1] Y. Bar-Shalom and Xiao-Rong Li. Estimation and Tracking : Principles, Techniques and Software. Artech House, Norwood, MA, 1998. [2] V. Pavlovic, J. M. Rehg, and J. MacCormick. Learning switching linear models of human motion. In Advances in Neural Information Processing systems (NIPS 13), pages 981–987, 2001. [3] A. T. Cemgil, B. Kappen, and D. Barber. A Generative Model for Music Transcription. IEEE Transactions on Audio, Speech and Language Processing, 14(2):679 – 694, 2006. [4] U. N. Lerner. Hybrid Bayesian Networks for Reasoning about Complex Systems. PhD thesis, Stanford University, 2002. [5] O. Zoeter. Monitoring non-linear and switching dynamical systems. PhD thesis, Radboud University Nijmegen, 2005. [6] T. Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, MIT Media Lab, 2001. [7] D. Barber. Expectation Correction for Smoothed Inference in Switching Linear Dynamical Systems. Journal of Machine Learning Research, 7:2515–2540, 2006. [8] C-J. Kim. Dynamic linear models with Markov-switching. Journal of Econometrics, 60:1–22, 1994. [9] T. Heskes and O. Zoeter. Expectation Propagation for approximate inference in dynamic Bayesian networks. In A. Darwiche and N. Friedman, editors, Uncertainty in Art. Intelligence, pages 216–223, 2002. [10] S. Lauritzen and F. Jensen. Stable local computation with conditional Gaussian distributions. Statistics and Computing, 11:191–203, 2001. [11] G. Kitagawa. The Two-Filter Formula for Smoothing and an implementation of the Gaussian-sum smoother. Annals of the Institute of Statistical Mathematics, 46(4):605–623, 1994. [12] Z. Ghahramani and G. E. Hinton. Variational learning for switching state-space models. Neural Computation, 12(4):963–996, 1998. [13] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo Methods in Practice. Springer, 2001. [14] B. Mesot and D. Barber. Switching Linear Dynamical Systems for Noise Robust Speech Recognition. IDIAP-RR 08, 2006. [15] G. Lathoud, M. Magimai-Doss, B. Mesot, and H. Bourlard. Unsupervised spectral subtraction for noiserobust ASR. In Proceedings of ASRU 2005, pages 189–194, November 2005. [16] Y. Ephraim and W. J. J. Roberts. Revisiting autoregressive hidden Markov modeling of speech signals. IEEE Signal Processing Letters, 12(2):166–169, February 2005. [17] R.G. Leonard. A database for speaker independent digit recognition. In Proceedings of ICASSP84, volume 3, 1984.

3 0.13008653 47 nips-2006-Boosting Structured Prediction for Imitation Learning

Author: J. A. Bagnell, Joel Chestnutt, David M. Bradley, Nathan D. Ratliff

Abstract: The Maximum Margin Planning (MMP) (Ratliff et al., 2006) algorithm solves imitation learning problems by learning linear mappings from features to cost functions in a planning domain. The learned policy is the result of minimum-cost planning using these cost functions. These mappings are chosen so that example policies (or trajectories) given by a teacher appear to be lower cost (with a lossscaled margin) than any other policy for a given planning domain. We provide a novel approach, M MP B OOST , based on the functional gradient descent view of boosting (Mason et al., 1999; Friedman, 1999a) that extends MMP by “boosting” in new features. This approach uses simple binary classiﬁcation or regression to improve performance of MMP imitation learning, and naturally extends to the class of structured maximum margin prediction problems. (Taskar et al., 2005) Our technique is applied to navigation and planning problems for outdoor mobile robots and robotic legged locomotion. 1

4 0.08607927 111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations

Author: Lorenzo Torresani, Peggy Hackney, Christoph Bregler

Abstract: This paper presents an algorithm for synthesis of human motion in speciﬁed styles. We use a theory of movement observation (Laban Movement Analysis) to describe movement styles as points in a multi-dimensional perceptual space. We cast the task of learning to synthesize desired movement styles as a regression problem: sequences generated via space-time interpolation of motion capture data are used to learn a nonlinear mapping between animation parameters and movement styles in perceptual space. We demonstrate that the learned model can apply a variety of motion styles to pre-recorded motion sequences and it can extrapolate styles not originally included in the training data. 1

5 0.085171789 148 nips-2006-Nonlinear physically-based models for decoding motor-cortical population activity

Author: Gregory Shakhnarovich, Sung-phil Kim, Michael J. Black

Abstract: Neural motor prostheses (NMPs) require the accurate decoding of motor cortical population activity for the control of an artiﬁcial motor system. Previous work on cortical decoding for NMPs has focused on the recovery of hand kinematics. Human NMPs however may require the control of computer cursors or robotic devices with very different physical and dynamical properties. Here we show that the ﬁring rates of cells in the primary motor cortex of non-human primates can be used to control the parameters of an artiﬁcial physical system exhibiting realistic dynamics. The model represents 2D hand motion in terms of a point mass connected to a system of idealized springs. The nonlinear spring coefﬁcients are estimated from the ﬁring rates of neurons in the motor cortex. We evaluate linear and a nonlinear decoding algorithms using neural recordings from two monkeys performing two different tasks. We found that the decoded spring coefﬁcients produced accurate hand trajectories compared with state-of-the-art methods for direct decoding of hand kinematics. Furthermore, using a physically-based system produced decoded movements that were more “natural” in that their frequency spectrum more closely matched that of natural hand movements. 1

6 0.081752911 170 nips-2006-Robotic Grasping of Novel Objects

7 0.081523158 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

8 0.080223888 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

9 0.079414912 57 nips-2006-Conditional mean field

10 0.075230494 69 nips-2006-Distributed Inference in Dynamical Systems

11 0.073009081 122 nips-2006-Learning to parse images of articulated bodies

12 0.069802634 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

13 0.069757335 146 nips-2006-No-regret Algorithms for Online Convex Programs

14 0.069130406 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

15 0.069026597 201 nips-2006-Using Combinatorial Optimization within Max-Product Belief Propagation

16 0.062206235 38 nips-2006-Automated Hierarchy Discovery for Planning in Partially Observable Environments

17 0.060109485 131 nips-2006-Mixture Regression for Covariate Shift

18 0.058490679 154 nips-2006-Optimal Change-Detection and Spiking Neurons

19 0.057595327 190 nips-2006-The Neurodynamics of Belief Propagation on Binary Markov Random Fields

20 0.056575786 9 nips-2006-A Nonparametric Bayesian Method for Inferring Features From Similarity Judgments

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.204), (1, -0.012), (2, -0.029), (3, -0.136), (4, 0.017), (5, -0.105), (6, 0.104), (7, -0.038), (8, 0.047), (9, 0.012), (10, -0.031), (11, -0.076), (12, -0.082), (13, 0.081), (14, 0.061), (15, -0.049), (16, -0.083), (17, 0.112), (18, -0.124), (19, 0.053), (20, 0.107), (21, 0.009), (22, -0.022), (23, 0.104), (24, 0.017), (25, 0.06), (26, 0.032), (27, 0.023), (28, 0.036), (29, 0.016), (30, -0.173), (31, -0.049), (32, 0.035), (33, -0.075), (34, -0.057), (35, -0.17), (36, 0.0), (37, 0.087), (38, 0.073), (39, -0.107), (40, 0.174), (41, -0.032), (42, 0.138), (43, -0.042), (44, -0.028), (45, -0.106), (46, -0.077), (47, -0.035), (48, 0.047), (49, -0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92983007 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

Author: David B. Grimes, Daniel R. Rashid, Rajesh P. Rao

2 0.56427294 25 nips-2006-An Application of Reinforcement Learning to Aerobatic Helicopter Flight

Author: Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng

Abstract: Autonomous helicopter ﬂight is widely regarded to be a highly challenging control problem. This paper presents the ﬁrst successful autonomous completion on a real RC helicopter of the following four aerobatic maneuvers: forward ﬂip and sideways roll at low speed, tail-in funnel, and nose-in funnel. Our experimental results signiﬁcantly extend the state of the art in autonomous helicopter ﬂight. We used the following approach: First we had a pilot ﬂy the helicopter to help us ﬁnd a helicopter dynamics model and a reward (cost) function. Then we used a reinforcement learning (optimal control) algorithm to ﬁnd a controller that is optimized for the resulting model and reward function. More speciﬁcally, we used differential dynamic programming (DDP), an extension of the linear quadratic regulator (LQR). 1

3 0.52724284 71 nips-2006-Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning

Author: Gediminas Lukšys, Jérémie Knüsel, Denis Sheynikhovich, Carmen Sandi, Wulfram Gerstner

Abstract: Stress and genetic background regulate different aspects of behavioral learning through the action of stress hormones and neuromodulators. In reinforcement learning (RL) models, meta-parameters such as learning rate, future reward discount factor, and exploitation-exploration factor, control learning dynamics and performance. They are hypothesized to be related to neuromodulatory levels in the brain. We found that many aspects of animal learning and performance can be described by simple RL models using dynamic control of the meta-parameters. To study the effects of stress and genotype, we carried out 5-hole-box light conditioning and Morris water maze experiments with C57BL/6 and DBA/2 mouse strains. The animals were exposed to different kinds of stress to evaluate its effects on immediate performance as well as on long-term memory. Then, we used RL models to simulate their behavior. For each experimental session, we estimated a set of model meta-parameters that produced the best ﬁt between the model and the animal performance. The dynamics of several estimated meta-parameters were qualitatively similar for the two simulated experiments, and with statistically signiﬁcant differences between different genetic strains and stress conditions. 1

4 0.52385509 170 nips-2006-Robotic Grasping of Novel Objects

Author: Ashutosh Saxena, Justin Driemeyer, Justin Kearns, Andrew Y. Ng

Abstract: We consider the problem of grasping novel objects, speciﬁcally ones that are being seen for the ﬁrst time through vision. We present a learning algorithm that neither requires, nor tries to build, a 3-d model of the object. Instead it predicts, directly as a function of the images, a point at which to grasp the object. Our algorithm is trained via supervised learning, using synthetic images for the training set. We demonstrate on a robotic manipulation platform that this approach successfully grasps a wide variety of objects, such as wine glasses, duct tape, markers, a translucent box, jugs, knife-cutters, cellphones, keys, screwdrivers, staplers, toothbrushes, a thick coil of wire, a strangely shaped power horn, and others, none of which were seen in the training set. 1

5 0.50402427 47 nips-2006-Boosting Structured Prediction for Imitation Learning

Author: J. A. Bagnell, Joel Chestnutt, David M. Bradley, Nathan D. Ratliff

6 0.49142858 10 nips-2006-A Novel Gaussian Sum Smoother for Approximate Inference in Switching Linear Dynamical Systems

7 0.45018607 148 nips-2006-Nonlinear physically-based models for decoding motor-cortical population activity

8 0.42687273 90 nips-2006-Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space

9 0.42374262 131 nips-2006-Mixture Regression for Covariate Shift

10 0.3873336 134 nips-2006-Modeling Human Motion Using Binary Latent Variables

11 0.38622835 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization

12 0.38272503 143 nips-2006-Natural Actor-Critic for Road Traffic Optimisation

13 0.37841651 111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations

14 0.37652382 43 nips-2006-Bayesian Model Scoring in Markov Random Fields

15 0.3719689 158 nips-2006-PG-means: learning the number of clusters in data

16 0.36286691 57 nips-2006-Conditional mean field

17 0.35670659 202 nips-2006-iLSTD: Eligibility Traces and Convergence Analysis

18 0.35030699 146 nips-2006-No-regret Algorithms for Online Convex Programs

19 0.34320796 69 nips-2006-Distributed Inference in Dynamical Systems

20 0.32947022 198 nips-2006-Unified Inference for Variational Bayesian Linear Gaussian State-Space Models

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.082), (3, 0.032), (7, 0.053), (9, 0.035), (20, 0.031), (22, 0.064), (44, 0.091), (47, 0.02), (57, 0.082), (65, 0.082), (69, 0.051), (71, 0.032), (92, 0.238)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.87353843 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

Author: Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers

Abstract: Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the speciﬁc words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over general topics, and (c) a distribution over words that are treated as being speciﬁc to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a speciﬁc word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match documents at the speciﬁc word level (such as TF-IDF). 1 Introduction and Motivation Reducing high-dimensional data vectors to robust and interpretable lower-dimensional representations has a long and successful history in data analysis, including recent innovations such as latent semantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003). These types of techniques have found broad application in modeling of sparse high-dimensional count data such as the “bag of words” representations for documents or transaction data for Web and retail applications. Approaches such as LSI and LDA have both been shown to be useful for “object matching” in their respective latent spaces. In information retrieval for example, both a query and a set of documents can be represented in the LSI or topic latent spaces, and the documents can be ranked in terms of how well they match the query based on distance or similarity in the latent space. The mapping to latent space represents a generalization or abstraction away from the sparse set of observed words, to a “higher-level” semantic representation in the latent space. These abstractions in principle lead to better generalization on new data compared to inferences carried out directly in the original sparse high-dimensional space. The capability of these models to provide improved generalization has been demonstrated empirically in a number of studies (e.g., Deerwester et al 1994; Hofmann 1999; Canny 2004; Buntine et al, 2005). However, while this type of generalization is broadly useful in terms of inference and prediction, there are situations where one can over-generalize. Consider trying to match the following query to a historical archive of news articles: election + campaign + Camejo. The query is intended to ﬁnd documents that are about US presidential campaigns and also about Peter Camejo (who ran as vice-presidential candidate alongside independent Ralph Nader in 2004). LSI and topic models are likely to highly rank articles that are related to presidential elections (even if they don’t necessarily contain the words election or campaign). However, a potential problem is that the documents that are highly ranked by LSI or topic models need not include any mention of the name Camejo. The reason is that the combination of words in this query is likely to activate one or more latent variables related to the concept of presidential campaigns. However, once this generalization is made the model has “lost” the information about the speciﬁc word Camejo and it will only show up in highly ranked documents if this word happens to frequently occur in these topics (unlikely in this case given that this candidate received relatively little media coverage compared to the coverage given to the candidates from the two main parties). But from the viewpoint of the original query, our preference would be to get documents that are about the general topic of US presidential elections with the speciﬁc constraint that they mention Peter Camejo. Word-based retrieval techniques, such as the widely-used term-frequency inverse-documentfrequency (TF-IDF) method, have the opposite problem in general. They tend to be overly speciﬁc in terms of matching words in the query to documents. In general of course one would like to have a balance between generality and speciﬁcity. One ad hoc approach is to combine scores from a general method such as LSI with those from a more speciﬁc method such as TF-IDF in some manner, and indeed this technique has been proposed in information retrieval (Vogt and Cottrell, 1999). Similarly, in the ad hoc LDA approach (Wei and Croft, 2006), the LDA model is linearly combined with document-speciﬁc word distributions to capture both general as well as speciﬁc information in documents. However, neither method is entirely satisfactory since it is not clear how to trade-off generality and speciﬁcity in a principled way. The contribution of this paper is a new graphical model based on latent topics that handles the tradeoff between generality and speciﬁcity in a fully probabilistic and automated manner. The model, which we call the special words with background (SWB) model, is an extension of the LDA model. The new model allows words in documents to be modeled as either originating from general topics, or from document-speciﬁc “special” word distributions, or from a corpus-wide background distribution. The idea is that words in a document such as election and campaign are likely to come from a general topic on presidential elections, whereas a name such as Camejo is much more likely to be treated as “non-topical” and speciﬁc to that document. Words in queries are automatically interpreted (in a probabilistic manner) as either being topical or special, in the context of each document, allowing for a data-driven document-speciﬁc trade-off between the beneﬁts of topic-based abstraction and speciﬁc word matching. Daum´ and Marcu (2006) independently proposed a probabilistic e model using similar concepts for handling different training and test distributions in classiﬁcation problems. Although we have focused primarily on documents in information retrieval in the discussion above, the model we propose can in principle be used on any large sparse matrix of count data. For example, transaction data sets where rows are individuals and columns correspond to items purchased or Web sites visited are ideally suited to this approach. The latent topics can capture broad patterns of population behavior and the “special word distributions” can capture the idiosyncracies of speciﬁc individuals. Section 2 reviews the basic principles of the LDA model and introduces the new SWB model. Section 3 illustrates how the model works in practice using examples from New York Times news articles. In Section 4 we describe a number of experiments with 4 different document sets, including perplexity experiments and information retrieval experiments, illustrating the trade-offs between generalization and speciﬁcity for different models. Section 5 contains a brief discussion and concluding comments. 2 A Topic Model for Special Words Figure 1(a) shows the graphical model for what we will refer to as the “standard topic model” or LDA. There are D documents and document d has Nd words. α and β are ﬁxed parameters of symmetric Dirichlet priors for the D document-topic multinomials represented by θ and the T topicword multinomials represented by φ. In the generative model, for each document d, the Nd words (a) β2 β1 α γ Ω ψ θ λ β0 z x φ w (b) α θ β z φ w Nd T T D Nd D Figure 1: Graphical models for (a) the standard LDA topic model (left) and (b) the proposed special words topic model with a background distribution (SWB) (right). are generated by drawing a topic t from the document-topic distribution p(z|θd ) and then drawing a word w from the topic-word distribution p(w|z = t, φt ). As shown in Grifﬁths and Steyvers (2004) the topic assignments z for each word token in the corpus can be efﬁciently sampled via Gibbs sampling (after marginalizing over θ and φ). Point estimates for the θ and φ distributions can be computed conditioned on a particular sample, and predictive distributions can be obtained by averaging over multiple samples. We will refer to the proposed model as the special words topic model with background distribution (SWB) (Figure 1(b)). SWB has a similar general structure to the LDA model (Figure 1(a)) but with additional machinery to handle special words and background words. In particular, associated with each word token is a latent random variable x, taking value x = 0 if the word w is generated via the topic route, value x = 1 if the word is generated as a special word (for that document) and value x = 2 if the word is generated from a background distribution speciﬁc for the corpus. The variable x acts as a switch: if x = 0, the previously described standard topic mechanism is used to generate the word, whereas if x = 1 or x = 2, words are sampled from a document-speciﬁc multinomial Ψ or a corpus speciﬁc multinomial Ω (with symmetric Dirichlet priors parametrized by β1 and β2 ) respectively. x is sampled from a document-speciﬁc multinomial λ, which in turn has a symmetric Dirichlet prior, γ. One could also use a hierarchical Bayesian approach to introduce another level of uncertainty about the Dirichlet priors (e.g., see Blei, Ng, and Jordan, 2003)—we have not investigated this option, primarily for computational reasons. In all our experiments, we set α = 0.1, β0 = β2 = 0.01, β1 = 0.0001 and γ = 0.3—all weak symmetric priors. The conditional probability of a word w given a document d can be written as: T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w) p(w|d) = p(x = 0|d) t=1 ′ where p (w|d) is the special word distribution for document d, and p′′ (w) is the background word distribution for the corpus. Note that when compared to the standard topic model the SWB model can explain words in three different ways, via topics, via a special word distribution, or via a background word distribution. Given the graphical model above, it is relatively straightforward to derive Gibbs sampling equations that allow joint sampling of the zi and xi latent variables for each word token wi , for xi = 0: p (xi = 0, zi = t |w, x−i , z−i , α, β0 , γ ) ∝ Nd0,−i + γ × Nd,−i + 3γ TD Ctd,−i + α t′ TD Ct′ d,−i + T α × WT Cwt,−i + β0 WT w ′ Cw ′ t,−i + W β0 and for xi = 1: p (xi = 1 |w, x−i , z−i , β1 , γ ) ∝ Nd1,−i + γ × Nd,−i + 3γ WD Cwd,−i + β1 w′ WD Cw′ d,−i + W β1 e mail krugman nytimes com memo to critics of the media s liberal bias the pinkos you really should be going after are those business reporters even i was startled by the tone of the jan 21 issue of investment news which describes itself as the weekly newspaper for financial advisers the headline was paul o neill s sweet deal the blurb was irs backs off closing loophole averting tax liability for execs and treasury chief it s not really news that the bush administration likes tax breaks for businessmen but two weeks later i learned from the wall street journal that this loophole is more than a tax break for businessmen it s a gift to biznesmen and it may be part of a larger pattern confused in the former soviet union the term biznesmen pronounced beeznessmen refers to the class of sudden new rich who emerged after the fall of communism and who generally got rich by using their connections to strip away the assets of public enterprises what we ve learned from enron and other players to be named later is that america has its own biznesmen and that we need to watch out for policies that make it easier for them to ply their trade it turns out that the sweet deal investment news was referring to the use of split premium life insurance policies to give executives largely tax free compensation you don t want to know the details is an even sweeter deal for executives of companies that go belly up it shields their wealth from creditors and even from lawsuits sure enough reports the wall street journal former enron c e o s kenneth lay and jeffrey skilling both had large split premium policies so what other pro biznes policies have been promulgated lately last year both houses of … john w snow was paid more than 50 million in salary bonus and stock in his nearly 12 years as chairman of the csx corporation the railroad company during that period the company s profits fell and its stock rose a bit more than half as much as that of the average big company mr snow s compensation amid csx s uneven performance has drawn criticism from union officials and some corporate governance specialists in 2000 for example after the stock had plunged csx decided to reverse a 25 million loan to him the move is likely to get more scrutiny after yesterday s announcement that mr snow has been chosen by president bush to replace paul o neill as the treasury secretary like mr o neill mr snow is an outsider on wall street but an insider in corporate america with long experience running an industrial company some wall street analysts who follow csx said yesterday that mr snow had ably led the company through a difficult period in the railroad industry and would make a good treasury secretary it s an excellent nomination said jill evans an analyst at j p morgan who has a neutral rating on csx stock i think john s a great person for the administration he as the c e o of a railroad has probably touched every sector of the economy union officials are less complimentary of mr snow s performance at csx last year the a f l c i o criticized him and csx for the company s decision to reverse the loan allowing him to return stock he had purchased with the borrowed money at a time when independent directors are in demand a corporate governance specialist said recently that mr snow had more business relationships with members of his own board than any other chief executive in addition mr snow is the third highest paid of 37 chief executives of transportation companies said ric marshall chief executive of the corporate library which provides specialized investment research into corporate boards his own compensation levels have been pretty high mr marshall said he could afford to take a public service job a csx program in 1996 allowed mr snow and other top csx executives to buy… Figure 2: Examples of two news articles with special words (as inferred by the model) shaded in gray. (a) upper, email article with several colloquialisms, (b) lower, article about CSX corporation. and for xi = 2: p (xi = 2 |w, x−i , z−i , β2 , γ ) ∝ Nd2,−i + γ × Nd,−i + 3γ W Cw,−i + β2 W w ′ Cw ′ ,−i + W β2 where the subscript −i indicates that the count for word token i is removed, Nd is the number of words in document d and Nd0 , Nd1 and Nd2 are the number of words in document d assigned to the W W W latent topics, special words and background component, respectively, CwtT , CwdD and Cw are the number of times word w is assigned to topic t, to the special-words distribution of document d, and to the background distribution, respectively, and W is the number of unique words in the corpus. Note that when there is not strong supporting evidence for xi = 0 (i.e., the conditional probability of this event is low), then the probability of the word being generated by the special words route, xi = 1, or background route, xi = 2 increases. One iteration of the Gibbs sampler corresponds to a sampling pass through all word tokens in the corpus. In practice we have found that around 500 iterations are often sufﬁcient for the in-sample perplexity (or log-likelihood) and the topic distributions to stabilize. We also pursued a variant of SWB, the special words (SW) model that excludes the background distribution Ω and has a symmetric Beta prior, γ, on λ (which in SW is a document-speciﬁc Bernoulli distribution). In all our SW model runs, we set γ = 0.5 resulting in a weak symmetric prior that is equivalent to adding one pseudo-word to each document. Experimental results (not shown) indicate that the ﬁnal word-topic assignments are not sensitive to either the value of the prior or the initial assignments to the latent variables, x and z. 3 Illustrative Examples We illustrate the operation of the SW model with a data set consisting of 3104 articles from the New York Times (NYT) with a total of 1,399,488 word tokens. This small set of NYT articles was formed by selecting all NYT articles that mention the word “Enron.” The SW topic model was run with T = 100 topics. In total, 10 Gibbs samples were collected from the model. Figure 2 shows two short fragments of articles from this NYT dataset. The background color of words indicates the probability of assigning words to the special words topic—darker colors are associated with higher probability that over the 10 Gibbs samples a word was assigned to the special topic. The words with gray foreground colors were treated as stopwords and were not included in the analysis. Figure 2(a) shows how intentionally misspelled words such as “biznesmen” and “beeznessmen” and rare Collection NIPS PATENTS AP FR # of Docs 1740 6711 10000 2500 Total # of Word Tokens 2,301,375 15,014,099 2,426,181 6,332,681 Median Doc Length 1310 1858 235.5 516 Mean Doc Length 1322.6 2237.2 242.6 2533.1 # of Queries N/A N/A 142 30 Table 1: General characteristics of document data sets used in experiments. NIPS set number results case problem function values paper approach large PATENTS .0206 .0167 .0153 .0123 .0118 .0108 .0102 .0088 .0080 .0079 fig end extend invent view shown claim side posit form .0647 .0372 .0267 .0246 .0214 .0191 .0189 .0177 .0153 .0128 AP tagnum itag requir includ section determin part inform addit applic FR .0416 .0412 .0381 .0207 .0189 .0134 .0112 .0105 .0096 .0086 nation sai presid polici issu call support need govern effort .0147 .0129 .0118 .0108 .0096 .0094 .0085 .0079 .0070 .0068 Figure 3: Examples of background distributions (10 most likely words) learned by the SWB model for 4 different document corpora. words such as “pinkos” are likely to be assigned to the special words topic. Figure 2(b) shows how a last name such as “Snow” and the corporation name “CSX” that are speciﬁc to the document are likely to be assigned to the special topic. The words “Snow” and “CSX” do not occur often in other documents but are mentioned several times in the example document. This combination of low document-frequency and high term-frequency within the document is one factor that makes these words more likely to be treated as “special” words. 4 Experimental Results: Perplexity and Precision We use 4 different document sets in our experiments, as summarized in Table 1. The NIPS and PATENTS document sets are used for perplexity experiments and the AP and FR data sets for retrieval experiments. The NIPS data set is available online1 and PATENTS, AP, and FR consist of documents from the U.S. Patents collection (TREC Vol-3), Associated Press news articles from 1998 (TREC Vol-2), and articles from the Federal Register (TREC Vol-1, 2) respectively. To create the sampled AP and FR data sets, all documents relevant to queries were included ﬁrst and the rest of the documents were chosen randomly. In the results below all LDA/SWB/SW models were ﬁt using T = 200 topics. Figure 3 demonstrates the background component learned by the SWB model on the 4 different document data sets. The background distributions learned for each set of documents are quite intuitive, with words that are commonly used across a broad range of documents within each corpus. The ratio of words assigned to the special words distribution and the background distribution are (respectively for each data set), 25%:10% (NIPS), 58%:5% (PATENTS), 11%:6% (AP), 50%:11% (FR). Of note is the fact that a much larger fraction of words are treated as special in collections containing long documents (NIPS, PATENTS, and FR) than in short “abstract-like” collections (such as AP)—this makes sense since short documents are more likely to contain general summary information while longer documents will have more speciﬁc details. 4.1 Perplexity Comparisons The NIPS and PATENTS documents sets do not have queries and relevance judgments, but nonetheless are useful for evaluating perplexity. We compare the predictive performance of the SW and SWB topic models with the standard topic model by computing the perplexity of unseen words in test documents. Perplexity of a test set under a model is deﬁned as follows: 1 From http://www.cs.toronto.edu/˜roweis/data.html 1900 550 LDA SW SWB 1800 500 1700 450 LDA SW SWB Perplexity Perplexity 1600 1500 1400 400 350 300 1300 250 1200 200 1100 1000 10 20 30 40 50 60 70 80 150 10 90 20 Percentage of Words Observed 30 40 50 60 70 80 90 Percentage of Words Observed Figure 4: Average perplexity of the two special words models and the standard topics model as a function of the percentage of words observed in test documents on the NIPS data set (left) and the PATENTS data set (right). Perplexity(wtest |Dtrain ) = exp − Dtest d=1 log p(wd |Dtrain ) Dtest d=1 Nd where wtest is a vector of words in the test data set, wd is a vector of words in document d of the test set, and Dtrain is the training set. For the SWB model, we approximate p(wd |Dtrain ) as follows: p(wd |Dtrain ) ≈ 1 S S p(wd |{Θs Φs Ψs Ωs λs }) s=1 where Θs , Φs , Ψs , Ωs and λs are point estimates from s = 1:S different Gibbs sampling runs. The probability of the words wd in a test document d, given its parameters, can be computed as follows: Nd p(wd |{Θs Φs Ψs Ωs λs }) = T s s φs i t θtd + λs ψwi d + λs Ωs i 3d w w 2d λs 1d i=1 t=1 where Nd is the number of words in test document d and wi is the ith word being predicted in the s s test document. θtd , φs i t , ψwi d , Ωs i and λs are point estimates from sample s. w w d When a fraction of words of a test document d is observed, a Gibbs sampler is run on the observed words to update the document-speciﬁc parameters, θd , ψd and λd and these updated parameters are used in the computation of perplexity. For the NIPS data set, documents from the last year of the data set were held out to compute perplexity (Dtest = 150), and for the PATENTS data set 500 documents were randomly selected as test documents. From the perplexity ﬁgures, it can be seen that once a small fraction of the test document words is observed (20% for NIPS and 10% for PATENTS), the SW and SWB models have signiﬁcantly lower perplexity values than LDA indicating that the SW and SWB models are using the special words “route” to better learn predictive models for individual documents. 4.2 Information Retrieval Results Returning to the point of capturing both speciﬁc and general aspects of documents as discussed in the introduction of the paper, we generated 500 queries of length 3-5 using randomly selected lowfrequency words from the NIPS corpus and then ranked documents relative to these queries using several different methods. Table 2 shows for the top k-ranked documents (k = 1, 10, 50, 100) how many of the retrieved documents contained at least one of the words in the query. Note that we are not assessing relevance here in a traditional information retrieval sense, but instead are assessing how Method TF-IDF LSI LDA SW SWB 1 Ret Doc 100.0 97.6 90.0 99.2 99.4 10 Ret Docs 100.0 82.7 80.6 97.1 96.6 50 Ret Docs 100.0 64.6 67.0 79.1 78.7 100 Ret Docs 100.0 54.3 58.7 67.3 67.2 Table 2: Percentage of retrieved documents containing at least one query word (NIPS corpus). AP MAP Method TF-IDF LSI LDA SW SWB Title .353 .286 .424 .466* .460* Pr@10d Desc .358 .387 .394 .430* .417 Concepts .498 .459 .498 .550* .549* Method TF-IDF LSI LDA SW SWB Title .406 .455 .478 .524* .513* Method TF-IDF LSI LDA SW SWB Title .300 .366 .428 .469 .462 Desc .434 .469 .463 .509* .495 Concepts .549 .523 .556 .599* .603* FR MAP Method TF-IDF LSI LDA SW SWB Title .268 .329 .344 .371 .373 Pr@10d Desc .272 .295 .271 .323* .328* Concepts .391 .399 .396 .448* .435 Desc .287 .327 .340 .407* .423* Concepts .483 .487 .487 .550* .523 *=sig difference wrt LDA Figure 5: Information retrieval experimental results. often speciﬁc query words occur in retrieved documents. TF-IDF has 100% matches, as one would expect, and the techniques that generalize (such as LSI and LDA) have far fewer exact matches. The SWB and SW models have more speciﬁc matches than either LDA or LSI, indicating that they have the ability to match at the level of speciﬁc words. Of course this is not of much utility unless the SWB and SW models can also perform well in terms of retrieving relevant documents (not just documents containing the query words), which we investigate next. For the AP and FR documents sets, 3 types of query sets were constructed from TREC Topics 1150, based on the T itle (short), Desc (sentence-length) and Concepts (long list of keywords) ﬁelds. Queries that have no relevance judgments for a collection were removed from the query set for that collection. The score for a document d relative to a query q for the SW and standard topic models can be computed as the probability of q given d (known as the query-likelihood model in the IR community). For the SWB topic model, we have T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w)] [p(x = 0|d) p(q|d) ≈ w∈q t=1 We compare SW and SWB models with the standard topic model (LDA), LSI and TF-IDF. The TFCW D D wd IDF score for a word w in a document d is computed as TF-IDF(w, d) = Nd × log2 Dw . For LSI, the TF-IDF weight matrix is reduced to a K-dimensional latent space using SVD, K = 200. A given query is ﬁrst mapped into the LSI latent space or the TF-IDF space (known as query folding), and documents are scored based on their cosine distances to the mapped queries. To measure the performance of each algorithm we used 2 metrics that are widely used in IR research: the mean average precision (MAP) and the precision for the top 10 documents retrieved (pr@10d). The main difference between the AP and FR documents is that the latter documents are considerably longer on average and there are fewer queries for the FR data set. Figure 5 summarizes the results, broken down by algorithm, query type, document set, and metric. The maximum score for each query experiment is shown in bold: in all cases (query-type/data set/metric) the SW or SWB model produced the highest scores. To determine statistical signiﬁcance, we performed a t-test at the 0.05 level between the scores of each of the SW and SWB models, and the scores of the LDA model (as LDA has the best scores overall among TF-IDF, LSI and LDA). Differences between SW and SWB are not signiﬁcant. In ﬁgure 5, we use the symbol * to indicate scores where the SW and SWB models showed a statistically signiﬁcant difference (always an improvement) relative to the LDA model. The differences for the “non-starred” query and metric scores of SW and SWB are not statistically signiﬁcant but nonetheless always favor SW and SWB over LDA. 5 Discussion and Conclusions Wei and Croft (2006) have recently proposed an ad hoc LDA approach that models p(q|d) as a weighted combination of a multinomial over the entire corpus (the background model), a multinomial over the document, and an LDA model. Wei and Croft showed that this combination provides excellent retrieval performance compared to other state-of-the-art IR methods. In a number of experiments (not shown) comparing the SWB and ad hoc LDA models we found that the two techniques produced comparable precision performance, with small but systematic performance gains being achieved by an ad hoc combination where the standard LDA model in ad hoc LDA was replaced with the SWB model. An interesting direction for future work is to investigate fully generative models that can achieve the performance of ad hoc approaches. In conclusion, we have proposed a new probabilistic model that accounts for both general and speciﬁc aspects of documents or individual behavior. The model extends existing latent variable probabilistic approaches such as LDA by allowing these models to take into account speciﬁc aspects of documents (or individuals) that are exceptions to the broader structure of the data. This allows, for example, documents to be modeled as a mixture of words generated by general topics and words generated in a manner speciﬁc to that document. Experimental results on information retrieval tasks indicate that the SWB topic model does not suffer from the weakness of techniques such as LSI and LDA when faced with very speciﬁc query words, nor does it suffer the limitations of TF-IDF in terms of its ability to generalize. Acknowledgements We thank Tom Grifﬁths for useful initial discussions about the special words model. This material is based upon work supported by the National Science Foundation under grant IIS-0083489. We acknowledge use of the computer clusters supported by NIH grant LM-07443-01 and NSF grant EIA-0321390 to Pierre Baldi and the Institute of Genomics and Bioinformatics. References Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003) Latent Dirichlet allocation, Journal of Machine Learning Research 3: 993-1022. Buntine, W., L¨ fstr¨ m, J., Perttu, S. and Valtonen, K. (2005) Topic-speciﬁc scoring of documents for relevant o o retrieval Workshop on Learning in Web Search: 22nd International Conference on Machine Learning, pp. 34-41. Bonn, Germany. Canny, J. (2004) GaP: a factor model for discrete data. Proceedings of the 27th Annual SIGIR Conference, pp. 122-129. Daum´ III, H., and Marcu, D. (2006) Domain Adaptation for Statistical classiﬁers. Journal of the Artiﬁcial e Intelligence Research, 26: 101-126. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391-407. Grifﬁths, T. L., and Steyvers, M. (2004) Finding scientiﬁc topics, Proceedings of the National Academy of Sciences, pp. 5228-5235. Hofmann, T. (1999) Probabilistic latent semantic indexing, Proceedings of the 22nd Annual SIGIR Conference, pp. 50-57. Vogt, C. and Cottrell, G. (1999) Fusion via a linear combination of scores. Information Retrieval, 1(3): 151173. Wei, X. and Croft, W.B. (2006) LDA-based document models for ad-hoc retrieval, Proceedings of the 29th SIGIR Conference, pp. 178-185.

same-paper 2 0.79651099 112 nips-2006-Learning Nonparametric Models for Probabilistic Imitation

Author: David B. Grimes, Daniel R. Rashid, Rajesh P. Rao

3 0.79051888 194 nips-2006-Towards a general independent subspace analysis

Author: Fabian J. Theis

Abstract: The increasingly popular independent component analysis (ICA) may only be applied to data following the generative ICA model in order to guarantee algorithmindependent and theoretically valid results. Subspace ICA models generalize the assumption of component independence to independence between groups of components. They are attractive candidates for dimensionality reduction methods, however are currently limited by the assumption of equal group sizes or less general semi-parametric models. By introducing the concept of irreducible independent subspaces or components, we present a generalization to a parameter-free mixture model. Moreover, we relieve the condition of at-most-one-Gaussian by including previous results on non-Gaussian component analysis. After introducing this general model, we discuss joint block diagonalization with unknown block sizes, on which we base a simple extension of JADE to algorithmically perform the subspace analysis. Simulations conﬁrm the feasibility of the algorithm. 1 Independent subspace analysis A random vector Y is called an independent component of the random vector X, if there exists an invertible matrix A and a decomposition X = A(Y, Z) such that Y and Z are stochastically independent. The goal of a general independent subspace analysis (ISA) or multidimensional independent component analysis is the decomposition of an arbitrary random vector X into independent components. If X is to be decomposed into one-dimensional components, this coincides with ordinary independent component analysis (ICA). Similarly, if the independent components are required to be of the same dimension k, then this is denoted by multidimensional ICA of ﬁxed group size k or simply k-ISA. So 1-ISA is equivalent to ICA. 1.1 Why extend ICA? An important structural aspect in the search for decompositions is the knowledge of the number of solutions i.e. the indeterminacies of the problem. Without it, the result of any ICA or ISA algorithm cannot be compared with other solutions, so for instance blind source separation (BSS) would be impossible. Clearly, given an ISA solution, invertible transforms in each component (scaling matrices L) as well as permutations of components of the same dimension (permutation matrices P) give again an ISA of X. And indeed, in the special case of ICA, scaling and permutation are already all indeterminacies given that at most one Gaussian is contained in X [6]. This is one of the key theoretical results in ICA, allowing the usage of ICA for solving BSS problems and hence stimulating many applications. It has been shown that also for k-ISA, scalings and permutations as above are the only indeterminacies [11], given some additional rather weak restrictions to the model. However, a serious drawback of k-ISA (and hence of ICA) lies in the fact that the requirement ﬁxed group-size k does not allow us to apply this analysis to an arbitrary random vector. Indeed, crosstalking error 4 3 2 1 0 FastICA JADE Extended Infomax Figure 1: Applying ICA to a random vector X = AS that does not fulﬁll the ICA model; here S is chosen to consist of a two-dimensional and a one-dimensional irreducible component. Shown are the statistics over 100 runs of the Amari error of the random original and the reconstructed mixing matrix using the three ICA-algorithms FastICA, JADE and Extended Infomax. Clearly, the original mixing matrix could not be reconstructed in any of the experiments. However, interestingly, the latter two algorithms do indeed ﬁnd an ISA up to permutation, which will be explained in section 3. theoretically speaking, it may only be applied to random vectors following the k-ISA blind source separation model, which means that they have to be mixtures of a random vector that consists of independent groups of size k. If this is the case, uniqueness up to permutation and scaling holds as noted above; however if k-ISA is applied to any random vector, a decomposition into groups that are only ‘as independent as possible’ cannot be unique and depends on the contrast and the algorithm. In the literature, ICA is often applied to ﬁnd representations fulﬁlling the independence condition as well as possible, however care has to be taken; the strong uniqueness result is not valid any more, and the results may depend on the algorithm as illustrated in ﬁgure 1. This work aims at ﬁnding an ISA model that allows applicability to any random vector. After reviewing previous approaches, we will provide such a model together with a corresponding uniqueness result and a preliminary algorithm. 1.2 Previous approaches to ISA for dependent component analysis Generalizations of the ICA model that are to include dependencies of multiple one-dimensional components have been studied for quite some time. ISA in the terminology of multidimensional ICA has ﬁrst been introduced by Cardoso [4] using geometrical motivations. His model as well as the related but independently proposed factorization of multivariate function classes [9] is quite general, however no identiﬁability results were presented, and applicability to an arbitrary random vector was unclear; later, in the special case of equal group sizes (k-ISA) uniqueness results have been extended from the ICA theory [11]. Algorithmic enhancements in this setting have been recently studied by [10]. Moreover, if the observation contain additional structures such as spatial or temporal structures, these may be used for the multidimensional separation [13]. Hyv¨ rinen and Hoyer presented a special case of k-ISA by combining it with invariant feature suba space analysis [7]. They model the dependence within a k-tuple explicitly and are therefore able to propose more efﬁcient algorithms without having to resort to the problematic multidimensional density estimation. A related relaxation of the ICA assumption is given by topographic ICA [8], where dependencies between all components are assumed and modelled along a topographic structure (e.g. a 2-dimensional grid). Bach and Jordan [2] formulate ISA as a component clustering problem, which necessitates a model for inter-cluster independence and intra-cluster dependence. For the latter, they propose to use a tree-structure as employed by their tree dependepent component analysis. Together with inter-cluster independence, this implies a search for a transformation of the mixtures into a forest i.e. a set of disjoint trees. However, the above models are all semi-parametric and hence not fully blind. In the following, no additional structures are necessary for the separation. 1.3 General ISA Deﬁnition 1.1. A random vector S is said to be irreducible if it contains no lower-dimensional independent component. An invertible matrix W is called a (general) independent subspace analysis of X if WX = (S1 , . . . , Sk ) with pairwise independent, irreducible random vectors Si . Note that in this case, the Si are independent components of X. The idea behind this deﬁnition is that in contrast to ICA and k-ISA, we do not ﬁx the size of the groups Si in advance. Of course, some restriction is necessary, otherwise no decomposition would be enforced at all. This restriction is realized by allowing only irreducible components. The advantage of this formulation now is that it can clearly be applied to any random vector, although of course a trivial decomposition might be the result in the case of an irreducible random vector. Obvious indeterminacies of an ISA of X are, as mentioned above, scalings i.e. invertible transformations within each Si and permutation of Si of the same dimension1. These are already all indeterminacies as shown by the following theorem, which extends previous results in the case of ICA [6] and k-ISA [11], where also the additional slight assumptions on square-integrability i.e. on existing covariance have been made. Theorem 1.2. Given a random vector X with existing covariance and no Gaussian independent component, then an ISA of X exists and is unique except for scaling and permutation. Existence holds trivially but uniqueness is not obvious. Due to the limited space, we only give a short sketch of the proof in the following. The uniqueness result can easily be formulated as a subspace extraction problem, and theorem 1.2 follows readily from Lemma 1.3. Let S = (S1 , . . . , Sk ) be a square-integrable decomposition of S into irreducible independent components Si . If X is an irreducible component of S, then X ∼ Si for some i. Here the equivalence relation ∼ denotes equality except for an invertible transformation. The following two lemmata each give a simpliﬁcation of lemma 1.3 by ordering the components Si according to their dimensions. Some care has to be taken when showing that lemma 1.5 implies lemma 1.4. Lemma 1.4. Let S and X be deﬁned as in lemma 1.3. In addition assume that dim Si = dim X for i ≤ l and dim Si < dim X for i > l. Then X ∼ Si for some i ≤ l. Lemma 1.5. Let S and X be deﬁned as in lemma 1.4, and let l = 1 and k = 2. Then X ∼ S1 . In order to prove lemma 1.5 (and hence the theorem), it is sufﬁcient to show the following lemma: Lemma 1.6. Let S = (S1 , S2 ) with S1 irreducible and m := dim S1 > dim S2 =: n. If X = AS is again irreducible for some m × (m + n)-matrix A, then (i) the left m × m-submatrix of A is invertible, and (ii) if X is an independent component of S, the right m × n-submatrix of A vanishes. (i) follows after some linear algebra, and is necessary to show the more difﬁcult part (ii). For this, we follow the ideas presented in [12] using factorization of the joint characteristic function of S. 1.4 Dealing with Gaussians In the previous section, Gaussians had to be excluded (or at most one was allowed) in order to avoid additional indeterminacies. Indeed, any orthogonal transformation of two decorrelated hence independent Gaussians is again independent, so clearly such a strong identiﬁcation result would not be possible. Recently, a general decomposition model dealing with Gaussians was proposed in the form of the socalled non-Gaussian subspace analysis (NGSA) [3]. It tries to detect a whole non-Gaussian subspace within the data, and no assumption of independence within the subspace is made. More precisely, given a random vector X, a factorization X = AS with an invertible matrix A, S = (SN , SG ) and SN a square-integrable m-dimensional random vector is called an m-decomposition of X if SN and SG are stochastically independent and SG is Gaussian. In this case, X is said to be mdecomposable. X is denoted to be minimally n-decomposable if X is not (n − 1)-decomposable. According to our previous notation, SN and SG are independent components of X. It has been shown that the subspaces of such decompositions are unique [12]: 1 Note that scaling here implies a basis change in the component Si , so for example in the case of a twodimensional source component, this might be rotation and sheering. In the example later in ﬁgure 3, these indeterminacies can easily be seen by comparing true and estimated sources. Theorem 1.7 (Uniqueness of NGSA). The mixing matrix A of a minimal decomposition is unique except for transformations in each of the two subspaces. Moreover, explicit algorithms can be constructed for identifying the subspaces [3]. This result enables us to generalize theorem 1.2and to get a general decomposition theorem, which characterizes solutions of ISA. Theorem 1.8 (Existence and Uniqueness of ISA). Given a random vector X with existing covariance, an ISA of X exists and is unique except for permutation of components of the same dimension and invertible transformations within each independent component and within the Gaussian part. Proof. Existence is obvious. Uniqueness follows after ﬁrst applying theorem 1.7 to X and then theorem 1.2 to the non-Gaussian part. 2 Joint block diagonalization with unknown block-sizes Joint diagonalization has become an important tool in ICA-based BSS (used for example in JADE) or in BSS relying on second-order temporal decorrelation. The task of (real) joint diagonalization (JD) of a set of symmetric real n×n matrices M := {M1 , . . . , MK } is to ﬁnd an orthogonal matrix K ˆ ˆ ˆ E such that E⊤ Mk E is diagonal for all k = 1, . . . , K i.e. to minimizef (E) := k=1 E⊤ Mk E − ˆ ⊤ Mk E) 2 with respect to the orthogonal matrix E, where diagM(M) produces a matrix ˆ ˆ diagM(E F where all off-diagonal elements of M have been set to zero, and M 2 := tr(MM⊤ ) denotes the F squared Frobenius norm. The Frobenius norm is invariant under conjugation by an orthogonal maˆ ˆ⊤ ˆ 2 trix, so minimizing f is equivalent to maximizing g(E) := K k=1 diag(E Mk E) , where now diag(M) := (mii )i denotes the diagonal of M. For the actual minimization of f respectively maximization of g, we will use the common approach of Jacobi-like optimization by iterative applications of Givens rotation in two coordinates [5]. 2.1 Generalization to blocks In the following we will use a generalization of JD in order to solve ISA problems. Instead of fully diagonalizing all n × n matrices Mk ∈ M, in joint block diagonalization (JBD) of M we want to determine E such that E⊤ Mk E is block-diagonal. Depending on the application, we ﬁx the blockstructure in advance or try to determine it from M. We are not interested in the order of the blocks, so the block-structure is uniquely speciﬁed by ﬁxing a partition of n i.e. a way of writing n as a sum of positive integers, where the order of the addends is not signiﬁcant. So let2 n = m1 + . . . + mr with m1 ≤ m2 ≤ . . . ≤ mr and set m := (m1 , . . . , mr ) ∈ Nr . An n × n matrix is said to be m-block diagonal if it is of the form   D1 · · · 0  . .  .. .   . . . . 0 · · · Dr with arbitrary mi × mi matrices Di . As generalization of JD in the case of known the block structure, we can formulate the joint mK ˆ ˆ⊤ ˆ block diagonalization (m-JBD) problem as the minimization of f m (E) := k=1 E Mk E − m ˆ⊤ ˆ 2 with respect to the orthogonal matrix E, where diagMm (M) produces a ˆ diagM (E Mk E) F m-block diagonal matrix by setting all other elements of M to zero. In practice due to estimation errors, such E will not exist, so we speak of approximate JBD and imply minimizing some errormeasure on non-block-diagonality. Indeterminacies of any m-JBD are m-scaling i.e. multiplication by an m-block diagonal matrix from the right, and m-permutation deﬁned by a permutation matrix that only swaps blocks of the same size. Finally, we speak of general JBD if we search for a JBD but no block structure is given; instead it is to be determined from the matrix set. For this it is necessary to require a block 2 We do not use the convention from Ferrers graphs of specifying partitions in decreasing order, as a visualization of increasing block-sizes seems to be preferable in our setting. structure of maximal length, otherwise trivial solutions or ‘in-between’ solutions could exist (and obviously contain high indeterminacies). Formally, E is said to be a (general) JBD of M if (E, m) = argmaxm | ∃E:f m (E)=0 |m|. In practice due to errors, a true JBD would always result in the trivial decomposition m = (n), so we deﬁne an approximate general JBD by requiring f m (E) < ǫ for some ﬁxed constant ǫ > 0 instead of f m (E) = 0. 2.2 JBD by JD A few algorithms to actually perform JBD have been proposed, see [1] and references therein. In the following we will simply perform joint diagonalization and then permute the columns of E to achieve block-diagonality — in experiments this turns out to be an efﬁcient solution to JBD [1]. This idea has been formulated in a conjecture [1] essentially claiming that a minimum of the JD cost function f already is a JBD i.e. a minimum of the function f m up to a permutation matrix. Indeed, in the conjecture it is required to use the Jacobi-update algorithm from [5], but this is not necessary, and we can prove the conjecture partially: We want to show that JD implies JBD up to permutation, i.e. if E is a minimum of f , then there exists a permutation P such that f m (EP) = 0 (given existence of a JBD of M). But of course f (EP) = f (E), so we will show why (certain) JBD solutions are minima of f . However, JD might have additional minima. First note that clearly not any JBD minimizes f , only those such that in each block of size mk , f (E) when restricted to the block is maximal over E ∈ O(mk ). We will call such a JBD block-optimal in the following. Theorem 2.1. Any block-optimal JBD of M (zero of f m ) is a local minimum of f . Proof. Let E ∈ O(n) be block-optimal with f m (E) = 0. We have to show that E is a local minimum of f or equivalently a local maximum of the squared diagonal sum g. After substituting each Mk by E⊤ Mk E, we may already assume that Mk is m-block diagonal, so we have to show that E = I is a local maximum of g. Consider the elementary Givens rotation Gij (ǫ) deﬁned for i < j and ǫ√ (−1, 1) as the orthogonal ∈ matrix, where all diagonal elements are 1 except for the two elements 1 − ǫ2 in rows i and j and with all off-diagonal elements equal to 0 except for the two elements ǫ and −ǫ at (i, j) and (j, i), respectively. It can be used to construct local coordinates of the d := n(n − 1)/2-dimensional manifold O(n) at I, simply by ι(ǫ12 , ǫ13 , . . . , ǫn−1,n ) := i < j. If i, j are in the same block of m, then h is locally maximal i.e. negative semi-deﬁnite at 0 in the direction ǫij because of block-optimality. Now assume i and j are from different blocks. After possible permutation, we may assume that j = i + 1 so that each matrix Mk ∈ M has (Mk )ij = (Mk )ji = 0, and ak := (Mk )ii , bk := (Mk )jj . Then Gij (ǫ)⊤ Mk Gij (ǫ) can be easily calculated at coordinates (i, i) to (j, j), and indeed entries on the diagonal other than at indices (i, i) and (j, j) are not changed, so diag(Gij (ǫ)⊤ Mk Gij (ǫ)) 2 2 − diag(Mk ) 2 = 2 = −2ak (ak − bk )ǫ + 2bk (ak − bk )ǫ + 2(ak − bk )2 ǫ4 = −2(a2 + b2 )ǫ2 + 2(ak − bk )2 ǫ4 . k k Hence h(0, . . . , 0, ǫij , 0, . . . , 0) − h(0) = −cǫ2 + dǫ4 with c = 2 K (a2 + b2 ) and d = ij ij k k=1 k K 2 k=1 (ak − bk )2 . Now either c = 0, then also d = 0 and h is constant zero in the direction ǫij . Or, more interestingly, c = 0, then c > 0 and therefore h is negative deﬁnite in the direction ǫij . Altogether we get a negative deﬁnite h at 0 except for ‘trivial directions’, and hence a local maximum at 0. 2.3 Recovering the permutation In order to perform JBD, we therefore only have to ﬁnd a JD E of M. What is left according to the above theorem is to ﬁnd a permutation matrix P such that EP block-diagonalizes M. In the case of known block-order m, we can employ similar techniques as used in [1, 10], which essentially ﬁnd P by some combinatorial optimization. 5 5 5 10 10 10 15 15 15 20 20 20 25 25 25 30 30 30 35 35 40 35 40 40 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 ˆ⊤ (a) (unknown) block diagonal M1 (b) E E w/o recovered permutation 5 10 15 20 25 30 35 40 ˆ⊤ (c) E E Figure 2: Performance of the proposed general JBD algorithm in the case of the (unknown) blockpartition 40 = 1 + 2 + 2 + 3 + 3 + 5 + 6 + 6 + 6 + 6 in the presence of noise with SNR of 5dB. The ˆ product E⊤ E of the inverse of the estimated block diagonalizer and the original one is an m-block diagonal matrix except for permutation within groups of the same sizes as claimed in section 2.2. In the case of unknown block-size, we propose to use the following simple permutation-recovery K algorithm: consider the mean diagonalized matrix D := K −1 k=1 E⊤ Mk E. Due to the assumption that M is m-block-diagonalizable (with unknown m), each E⊤ Mk E and hence also D must be m-block-diagonal except for a permutation P, so it must have the corresponding number of zeros in each column and row. In the approximate JBD case, thresholding with a threshold θ is necessary, whose choice is non-trivial. We propose using algorithm 1 to recover the permutation; we denote its resulting permuted matrix by P(D) when applied to the input D. P(D) is constructed from possibly thresholded D by iteratively permuting columns and rows in order to guarantee that all non-zeros of D are clustered along the diagonal as closely as possible. This recovers the permutation as well as the partition m of n. Algorithm 1: Block-diagonality permutation ﬁnder Input: (n × n)-matrix D Output: block-diagonal matrix P(D) := D′ such that D′ = PDPT for a permutation matrix P D′ ← D for i ← 1 to n do repeat if (j0 ← min{j|j ≥ i and d′ = 0 and d′ = 0}) exists then ij ji if (k0 ← min{k|k > j0 and (d′ = 0 or d′ = 0)}) exists then ik ki swap column j0 of D′ with column k0 swap row j0 of D′ with row k0 until no swap has occurred ; We illustrate the performance of the proposed JBD algorithm as follows: we generate a set of K = 100 m-block-diagonal matrices Dk of dimension 40 × 40 with m = (1, 2, 2, 3, 3, 5, 6, 6, 6, 6). They have been generated in blocks of size m with coefﬁcients chosen randomly uniform from [−1, 1], and symmetrized by Dk ← (Dk + D⊤ )/2. After that, they have been mixed by a random k orthogonal mixing matrix E ∈ O(40), i.e. Mk := EDk E⊤ + N, where N is a noise matrix with independent Gaussian entries such that the resulting signal-to-noise ratio is 5dB. Application of the JBD algorithm from above to {M1 , . . . , MK } with threshold θ = 0.1 correctly recovers the ˆ block sizes, and the estimated block diagonalizer E equals E up to m-scaling and permutation, as illustrated in ﬁgure 2. 3 SJADE — a simple algorithm for general ISA As usual by preprocessing of the observations X by whitening we may assume that Cov(X) = I. The indeterminacies allow scaling transformations in the sources, so without loss of generality let 5.5 6 5 5.5 5 1 5 2 4.5 4 3 5 4.5 4 3 4 2 5 4.5 4 3.5 4 3.5 6 3 1 2.5 0 5 3.5 3 7 3 2.5 8 4 3 2 2.5 2 5 4 2 1 1.5 7 8 9 10 11 12 13 14 2 3 4 5 (a) S2 6 7 8 9 1.5 3 3.5 4 (b) S3 4.5 5 5.5 6 6.5 10 1 0 7 (c) S4 5 9 3 2 0 1 14 3 4 5 6 7 8 9 10 ˆ (e) A−1 A −3 1 −3.5 4.5 2 (d) S5 13 250 0 −4 4 −1 −4.5 12 200 −5 −3 −6 −4 −6.5 11 150 −2 −5.5 3.5 −5 0 3 10 100 2.5 −1 −7 9 50 2 5 −2 4 3 −3 −7.5 2 −4 1.5 −6.5 −6 −5.5 −5 −4.5 −4 −3.5 −3 ˆ ˆ (f) (S1 , S2 ) −2.5 −2 0 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 ˆ (g) histogram of S3 0 8 −4.5 −4 −3.5 −3 −2.5 −2 (h) S4 −1.5 −1 −0.5 −8 −7.5 −7 −6.5 −6 −5.5 −5 −4.5 (i) S5 −4 −3.5 −3 −2.5 1 −5 0 (j) S6 Figure 3: Example application of general ISA for unknown sizes m = (1, 2, 2, 2, 3). Shown are the ˆ scatter plots i.e. densities of the source components and the mixing-separating map A−1 A. also Cov(S) = I. Then I = Cov(X) = A Cov(S)A⊤ = AA⊤ so A is orthogonal. Due to the ISA assumptions, the fourth-order cross cumulants of the sources have to be trivial between different groups, and within the Gaussians. In order to ﬁnd transformations of the mixtures fulﬁlling this property, we follow the idea of the JADE algorithmbut now in the ISA setting. We perform JBD of the (whitened) contracted quadricovariance matrices deﬁned by Cij (X) := E X⊤ Eij XXX⊤ − Eij − E⊤ − tr(Eij )I. Here RX := Cov(X) and Eij is a set of eigen-matrices of Cij , 1 ≤ i, j ≤ n. ij One simple choice is to use n2 matrices Eij with zeros everywhere except 1 at index (i, j). More elaborate choices of eigen-matrices (with only n(n + 1)/2 or even n entries) are possible. The resulting algorithm, subspace-JADE (SJADE) not only performs NGCA by grouping Gaussians as one-dimensional components with trivial Cii ’s, but also automatically ﬁnds the subspace partition m using the general JBD algorithm from section 2.3. 4 Experimental results In a ﬁrst example, we consider a general ISA problem in dimension n = 10 with the unknown partition m = (1, 2, 2, 2, 3). In order to generate 2- and 3-dimensional irreducible random vectors, we decided to follow the nice visual ideas from [10] and to draw samples from a density following a known shape — in our case 2d-letters or 3d-geometrical shapes. The chosen sources densities are shown in ﬁgure 3(a-d). Another 1-dimensional source following a uniform distribution was constructed. Altogether 104 samples were used. The sources S were mixed by a mixing matrix A with coefﬁcients uniformly randomly sampled from [−1, 1] to give mixtures X = AS. The recovered ˆ mixing matrix A was then estimated using the above block-JADE algorithm with unknown block size; we observed that the method is quite sensitive to the choice of the threshold (here θ = 0.015). ˆ Figure 3(e) shows the composed mixing-separating system A−1 A; clearly the matrices are equal except for block permutation and scaling, which experimentally conﬁrms theorem 1.8. The algoˆ rithm found a partition m = (1, 1, 1, 2, 2, 3), so one 2d-source was misinterpreted as two 1d-sources, but by using previous knowledge combination of the correct two 1d-sources yields the original 2dˆ ˆ source. The resulting recovered sources S := A−1 X, ﬁgures 3(f-j), then equal the original sources except for permutation and scaling within the sources — which in the higher-dimensional cases implies transformations such as rotation of the underlying images or shapes. When applying ICA (1-ISA) to the above mixtures, we cannot expect to recover the original sources as explained in ﬁgure 1; however, some algorithms might recover the sources up to permutation. Indeed, SJADE equals JADE with additional permutation recovery because the joint block diagonalization is per- formed using joint diagonalization. This explains why JADE retrieves meaningful components even in this non-ICA setting as observed in [4]. In a second example, we illustrate how the algorithm deals with Gaussian sources i.e. how the subspace JADE also includes NGCA. For this we consider the case n = 5, m = (1, 1, 1, 2) and sources with two Gaussians, one uniform and a 2-dimensional irreducible component as before; 105 samples were drawn. We perform 100 Monte-Carlo simulations with random mixing matrix ˆ A, and apply SJADE with θ = 0.01. The recovered mixing matrix A is compared with A by 3 2 2 2 ˆ −1 A. Indeed, we get nearly taking the ad-hoc measure ι(P) := i=1 j=1 (pij + pji ) for P := A perfect recovery in 99 out of 100 runs, the median of ι(P) is very low with 0.0083. A single run diverges with ι(P ) = 3.48. In order to show that the algorithm really separates the Gaussian part from the other components, we compare the recovered source kurtoses. The median kurtoses are −0.0006 ± 0.02, −0.003 ± 0.3, −1.2 ± 0.3, −1.2 ± 0.2 and −1.6 ± 0.2. The ﬁrst two components have kurtoses close to zero, so they are the two Gaussians, whereas the third component has kurtosis of around −1.2, which equals the kurtosis of a uniform density. This conﬁrms the applicability of the algorithm in the general, noisy ISA setting. 5 Conclusion Previous approaches for independent subspace analysis were restricted either to ﬁxed group sizes or semi-parametric models. In neither case, general applicability to any kind of mixture data set was guaranteed, so blind source separation might fail. In the present contribution we introduce the concept of irreducible independent components and give an identiﬁability result for this general, parameter-free model together with a novel arbitrary-subspace-size algorithm based on joint block diagonalization. As in ICA, the main uniqueness theorem is an asymptotic result (but includes noisy case via NGCA). However in practice in the ﬁnite sample case, due to estimation errors the general joint block diagonality only approximately holds. Our simple solution in this contribution was to choose appropriate thresholds. But this choice is non-trivial, and adaptive methods are to be developed in future works. References [1] K. Abed-Meraim and A. Belouchrani. Algorithms for joint block diagonalization. In Proc. EUSIPCO 2004, pages 209–212, Vienna, Austria, 2004. [2] F.R. Bach and M.I. Jordan. Finding clusters in independent component analysis. In Proc. ICA 2003, pages 891–896, 2003. [3] G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny, and K.-R. M¨ ller. In search of non-gaussian u components of a high-dimensional distribution. JMLR, 7:247–282, 2006. [4] J.F. Cardoso. Multidimensional independent component analysis. In Proc. of ICASSP ’98, Seattle, 1998. [5] J.F. Cardoso and A. Souloumiac. Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal. Appl., 17(1):161–164, January 1995. [6] P. Comon. Independent component analysis - a new concept? Signal Processing, 36:287–314, 1994. [7] A. Hyv¨ rinen and P.O. Hoyer. Emergence of phase and shift invariant features by decomposition of a natural images into independent feature subspaces. Neural Computation, 12(7):1705–1720, 2000. [8] A. Hyv¨ rinen, P.O. Hoyer, and M. Inki. Topographic independent component analysis. Neural Computaa tion, 13(7):1525–1558, 2001. [9] J.K. Lin. Factorizing multivariate function classes. In Advances in Neural Information Processing Systems, volume 10, pages 563–569, 1998. [10] B. Poczos and A. L¨ rincz. Independent subspace analysis using k-nearest neighborhood distances. In o Proc. ICANN 2005, volume 3696 of LNCS, pages 163–168, Warsaw, Poland, 2005. Springer. [11] F.J. Theis. Uniqueness of complex and multidimensional independent component analysis. Signal Processing, 84(5):951–956, 2004. [12] F.J. Theis and M. Kawanabe. Uniqueness of non-gaussian subspace analysis. In Proc. ICA 2006, pages 917–925, Charleston, USA, 2006. [13] R. Vollgraf and K. Obermayer. Multi-dimensional ICA to separate correlated sources. In Proc. NIPS 2001, pages 993–1000, 2001.

4 0.60528028 3 nips-2006-A Complexity-Distortion Approach to Joint Pattern Alignment

Author: Andrea Vedaldi, Stefano Soatto

Abstract: Image Congealing (IC) is a non-parametric method for the joint alignment of a collection of images affected by systematic and unwanted deformations. The method attempts to undo the deformations by minimizing a measure of complexity of the image ensemble, such as the averaged per-pixel entropy. This enables alignment without an explicit model of the aligned dataset as required by other methods (e.g. transformed component analysis). While IC is simple and general, it may introduce degenerate solutions when the transformations allow minimizing the complexity of the data by collapsing them to a constant. Such solutions need to be explicitly removed by regularization. In this paper we propose an alternative formulation which solves this regularization issue on a more principled ground. We make the simple observation that alignment should simplify the data while preserving the useful information carried by them. Therefore we trade off ﬁdelity and complexity of the aligned ensemble rather than minimizing the complexity alone. This eliminates the need for an explicit regularization of the transformations, and has a number of other useful properties such as noise suppression. We show the modeling and computational beneﬁts of the approach to the some of the problems on which IC has been demonstrated. 1

5 0.60063958 175 nips-2006-Simplifying Mixture Models through Function Approximation

Author: Kai Zhang, James T. Kwok

Abstract: Finite mixture model is a powerful tool in many statistical learning problems. In this paper, we propose a general, structure-preserving approach to reduce its model complexity, which can bring signiﬁcant computational beneﬁts in many applications. The basic idea is to group the original mixture components into compact clusters, and then minimize an upper bound on the approximation error between the original and simpliﬁed models. By adopting the L2 norm as the distance measure between mixture models, we can derive closed-form solutions that are more robust and reliable than using the KL-based distance measure. Moreover, the complexity of our algorithm is only linear in the sample size and dimensionality. Experiments on density estimation and clustering-based image segmentation demonstrate its outstanding performance in terms of both speed and accuracy.

6 0.60036445 79 nips-2006-Fast Iterative Kernel PCA

7 0.59915423 65 nips-2006-Denoising and Dimension Reduction in Feature Space

8 0.59888774 32 nips-2006-Analysis of Empirical Bayesian Methods for Neuroelectromagnetic Source Localization

9 0.59619313 19 nips-2006-Accelerated Variational Dirichlet Process Mixtures

10 0.59515113 34 nips-2006-Approximate Correspondences in High Dimensions

11 0.59380794 115 nips-2006-Learning annotated hierarchies from relational data

12 0.58896488 165 nips-2006-Real-time adaptive information-theoretic optimization of neurophysiology experiments

13 0.58873272 20 nips-2006-Active learning for misspecified generalized linear models

14 0.58819801 117 nips-2006-Learning on Graph with Laplacian Regularization

15 0.58805317 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields

16 0.58627605 167 nips-2006-Recursive ICA

17 0.58598018 61 nips-2006-Convex Repeated Games and Fenchel Duality

18 0.58482021 184 nips-2006-Stratification Learning: Detecting Mixed Density and Dimensionality in High Dimensional Point Clouds

19 0.58435678 162 nips-2006-Predicting spike times from subthreshold dynamics of a neuron

20 0.58431989 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints