nips nips2002 nips2002-96 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Geoffrey J. Gordon
Abstract: We introduce the Generalized 2 Linear 2 Model, a statistical estimator which combines features of nonlinear regression and factor analysis. A (GL)2M approximately decomposes a rectangular matrix X into a simpler representation j(g(A)h(B)). Here A and Bare low-rank matrices, while j, g, and h are link functions. (GL)2Ms include many useful models as special cases, including principal components analysis, exponential-family peA, the infomax formulation of independent components analysis, linear regression, and generalized linear models. They also include new and interesting special cases, one of which we describe below. We also present an iterative procedure which optimizes the parameters of a (GL)2M. This procedure reduces to well-known algorithms for some of the special cases listed above; for other special cases, it is new. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We introduce the Generalized 2 Linear 2 Model, a statistical estimator which combines features of nonlinear regression and factor analysis. [sent-4, score-0.099]
2 Here A and Bare low-rank matrices, while j, g, and h are link functions. [sent-6, score-0.283]
3 (GL)2Ms include many useful models as special cases, including principal components analysis, exponential-family peA, the infomax formulation of independent components analysis, linear regression, and generalized linear models. [sent-7, score-0.259]
4 They also include new and interesting special cases, one of which we describe below. [sent-8, score-0.079]
5 This procedure reduces to well-known algorithms for some of the special cases listed above; for other special cases, it is new. [sent-10, score-0.133]
6 Each column of X represents a training example, and each row represents a measured feature of the examples. [sent-12, score-0.088]
7 It is often reasonable to assume that some of the features are redundant , that is, that there exists a reduced set of I features which contains all or most of the information in X. [sent-13, score-0.097]
8 If the reduced features are linear functions of the original features and the distributions of the elements of X are Gaussian, redundancy means we can write X as the product of two smaller matrices U and V with small sum of squared errors. [sent-14, score-0.311]
9 This factorization is essentially a singular value decomposition: U must span the first I dimensions of the left principal subspace of X, while V T must span the first I dimensions of the right principal subspace. [sent-15, score-0.189]
10 ) The SVD has a long list of successes in machine learning, including information retrieval applications such as latent semantic analysis [1] and link analysis [2]; pattern recognition applications such as "eigenfaces" [3]; structure from motion algorithms [4]; and data compression tools [5]. [sent-17, score-0.346]
11 The first assumption is the use of the sum of squared errors between X and UV as a loss function. [sent-19, score-0.298]
12 Squared error loss means that predicting 1000 when the answer is 1010 is as bad as saying -7 when the answer is 3. [sent-20, score-0.239]
13 The second assumption is that the reduced features are linear functions of the original features. [sent-21, score-0.135]
14 Instead, X might be a nonlinear function of UV, and U and V might be nonlinear functions of some other matrices A and B. [sent-22, score-0.193]
15 We also propose allowing non-quadratic loss functions for the error (X - X) and the parameter matrices A and B. [sent-24, score-0.391]
16 By analogy to generalized linear models [6], we call equation (1) a Generalized 2 Linear 2 Model: generalized2 because it uses link functions for the parameters A and B as well as the prediction X, and linear2 because like the SVD it is bilinear. [sent-26, score-0.491]
17 As long as we choose link and loss functions that match each other (see below for the definition of matching link and loss), there will exist efficient algorithms for finding A and B given X, f, g, and h. [sent-27, score-1.016]
18 Because (1) is a generalization of the SVD, (GL)2Ms are drop-in replacements for SVDs in all of the applications mentioned above, with better reconstruction performance when the SVD's error model is inaccurate. [sent-28, score-0.094]
19 2 Matching link and loss functions Whenever we try to optimize the predictions of a nonlinear model, we need to worry about getting stuck in local minima. [sent-30, score-0.671]
20 On the other hand, if we optimize the same predictions Yi under the logarithmic loss function ~i[Yi log Yi + (1 - Vi) 10g(1 - Yi)] instead of squared error, our optimization problem is convex. [sent-33, score-0.333]
21 Because the logistic link works with the log loss to produce a convex optimization problem, we say they match each other [7]. [sent-34, score-0.679]
22 Matching link-loss pairs are important because minimizing a convex loss function is usually far easier than minimizing a non convex one. [sent-35, score-0.483]
23 We can use any convex function F(z) to generate a matching pair of link and loss functions. [sent-36, score-0.746]
24 The loss function which corresponds to F is (2) where F*(y) is defined so that minz DF(Z I y) = O. [sent-37, score-0.239]
25 (F* is the convex dual of F [8], and D F is the generalized Bregman divergence from Z to Y [9]. [sent-38, score-0.197]
26 ) Expression (2) is nonnegative, and it is globally convex in all of the ZiS (and therefore also in (J since each Zi is a linear function of (J). [sent-39, score-0.122]
27 If we write f for the gradient of F, the derivative of (2) with respect to Zi is f(Zi) - Vi. [sent-40, score-0.1]
28 So, (2) will be zero if and only if Yi = f(Zi) for all i; in other words, using the loss (2) implies that Yi = f(z;) is our best prediction of Vi, and f is therefore our matching link function. [sent-41, score-0.686]
29 (Also, convex duality is defined even when F, G, and H aren't differentiable. [sent-44, score-0.122]
30 If they are not, replace derivatives by subgradients below. [sent-45, score-0.103]
31 ) 3 Loss functions for (G L )2Ms In (GL)2Ms, matching loss functions will be particularly important because we need to deal with three separate nonlinear link functions. [sent-46, score-0.803]
32 We will usually not be able to avoid local minima entirely; instead, the overall loss function will be convex in some groups of parameters if we hold the remaining parameters fixed. [sent-47, score-0.441]
33 We will specify a (GL)2M by picking three link functions and their matching loss functions. [sent-48, score-0.695]
34 We can then combine these individual loss functions into an overall loss function as described in section 4; fitting a (GL)2M will then reduce to minimizing the overall loss function with respect to our parameters. [sent-49, score-0.89]
35 The choice of link functions is where we should inject our domain knowledge about what sort of noise there is in X and what parameter matrices A and B are a priori most likely. [sent-51, score-0.435]
36 Useful link functions include f (x) = x (corresponding to squared error and Gaussian noise), f (x) = log x (unnormalized KL-di vergence and Poisson noise), and f(x) = (1 + e- x) - l (log-loss and Bernoulli noise). [sent-52, score-0.479]
37 The loss functions themselves are only necessary for the analysis; all of our algorithms need only the link functions and (in some cases) their derivatives. [sent-53, score-0.664]
38 So, we can pick the loss functions and differentiate to get the matching link functions; or, we can pick the link functions directly and not worry about the corresponding loss functions. [sent-54, score-1.329]
39 In order for our analysis to apply, the link functions must be derivatives of some (possibly unknown) convex functions. [sent-55, score-0.577]
40 Our loss functions are D F , DG, and DH where G : lRmxl H lR are convex functions. [sent-56, score-0.432]
41 We will abuse notation and call F, G, and H loss functions as well: F is the prediction loss, and its derivative f is the prediction link; it provides our model of the noise in X. [sent-57, score-0.469]
42 G and H are the parameter losses, and their derivatives g and h are the parameter links; they tell us which values of A and B are a priori most likely. [sent-58, score-0.18]
43 By convention, since F takes an m x n matrix argument , we will say that the input and output to f are also m x n matrices (similarly for g and h). [sent-59, score-0.104]
44 4 The model and its fixed point equations We will define a (GL)2M by specifying an overall loss function which relates the parameter matrices A and B to the data matrix X. [sent-60, score-0.498]
45 If we write U = g(A) and V = h(B), the (GL)2M loss function is L(U, V) = F(UV) - X Here A 0 0 UV + G*(U) + H*(V) (3) B is the "matrix dot product," often written tr(AT B). [sent-61, score-0.275]
46 The result is a set of fixed-point equations that the optimal parameter settings must satisfy: UT(X - f(UV)) B (4) (X - f(UV))VT A (5) To understand these equations, we can consider two special cases. [sent-65, score-0.151]
47 First, if we let G* go to zero (so there is no pressure to keep U and V small) , (4) becomes UT(X - f(UV)) = 0 (6) which tells us that each column of the error matrix must be orthogonal to each column of U. [sent-66, score-0.232]
48 Second, if we set the prediction link to be f(UV) = UV, (6) becomes UTUV = UTX which tells us that for a given U, we must choose V so that UV reconstructs X with the smallest possible sum of squared errors. [sent-67, score-0.464]
49 5 Algorithms for fitting (GL)2Ms We could solve equations (4- 5) with any of several different algorithms. [sent-68, score-0.086]
50 Or, we could use the generalized gradient descent [9] update rule (with learning rate a): A +-", (X - f(UV))V T B +-", UT(X - f(UV)) The advantage of these algorithms is that they are simple to implement and don't require additional assumptions on F , G , and H. [sent-70, score-0.142]
51 Our algorithm is based on Newton's method , and it reduces to well-known algorithms for several special cases of (GL)2Ms. [sent-73, score-0.085]
52 For our Newton algorithm we will need to place some restrictions on the prediction and parameter loss functions. [sent-75, score-0.396]
53 (These restrictions are only necessary for the Newton algorithm; more general loss functions still give valid (GL)2Ms, but require different algorithms. [sent-76, score-0.372]
54 j ) ij j These definitions fix most of the second derivatives of L(U, V) to be zero, simplifying and speeding up computation. [sent-79, score-0.103]
55 With these restrictions, we can linearize (4) and (5) around our current guess at the parameters, then solve the resulting equations to find updated parameters. [sent-81, score-0.131]
56 j; and the m x m diagonal matrix Dj contains the second derivatives of F with respect to the jth column of its argument. [sent-87, score-0.187]
57 j VJ5j, prior - f(UV j )) Similarly, we can linearize with respect to rows of U to find the equation UreW(VDiVT + G i ) = ((Xi. [sent-89, score-0.089]
58 We can solve one copy of (7) simultaneously for each column of V, then replace V by vnew. [sent-96, score-0.132]
59 Next we can solve one copy of (8) simultaneously for each row of U, then replace U by unew. [sent-97, score-0.074]
60 1 6 Related models There are many important special cases of (GL)2Ms. [sent-99, score-0.085]
61 We derive two in this section; others include principal components analysis, "sensible" PCA, linear regression, generalized linear models, and the weighted majority algorithm. [sent-100, score-0.182]
62 ) (GL)2Ms are related to generalized bilinear models; the latter include some of the above special cases, but not ICA, weighted majority, or the example of section 7. [sent-102, score-0.154]
63 Finally, some models such as non-negative matrix factorization [10] and generalized low-rank approximation [11] are cousins of (GL)2Ms: they use a loss function which is convex in either factor with the other fixed but which is not a Bregman divergence. [sent-104, score-0.582]
64 1 Independent components analysis In ICA, we assume that there is a hidden matrix V (the same size as X) of independent random variables, and that X was generated from V by applying a square matrix U. [sent-106, score-0.141]
65 We seek to recover the mixing matrix U and the sources V; in other words , we want to decompose X = UV so that the elements of V are as nearly independent as possible. [sent-107, score-0.112]
66 The info max algorithm for ICA assumes that the elements of V follow distributions with heavy tails (i. [sent-108, score-0.085]
67 In our notation, the fixed point of the info max algorithm for ICA is _ U T = tanh(V)XT (9) (see, e. [sent-112, score-0.1]
68 To reproduce (9) , we will let the prediction link f be the identity, and we will let the duals of the parameter loss functions be G*(U) -dogdet U H*(V) E L log cosh Vij ij iTo guarantee convergence, we can check that (3) decreases and reduce our step size if we encounter problems. [sent-115, score-0.874]
69 Then equations (4) and (5) become UT(X - UV) (X - UV)VT ttanh(V) -fU - T (10) (11) since the derivative of log cosh v is tanh v and the derivative of log det U is U - T . [sent-119, score-0.286]
70 Exponential family peA To duplicate exponential family PCA [13], we can set the prediction link f arbitrarily and let the parameter links 9 and h be large multiples of the identity. [sent-122, score-0.449]
71 7 Example: robot belief states Figure 1 shows a map of a corridor in the CMU CS building. [sent-124, score-0.574]
72 A robot navigating in this corridor can sense both side walls and compute an accurate estimate of its lateral position. [sent-125, score-0.314]
73 Unless it is near an observable feature such the lab door near the middle of the corridor, however, it can't accurately resolve its position along the corridor and it can't tell whether it is pointing left or right. [sent-126, score-0.329]
74 In order to plan to achieve a goal in this environment, the robot must maintain a belief state (a probability distribution representing its best information about the unobserved state variables). [sent-127, score-0.339]
75 The map shows the robot's starting belief state: it is at one end of the corridor facing in, but it doesn't know which end. [sent-128, score-0.441]
76 We collected a training set of 400 belief states by driving the robot along the corridor and feeding its sensor readings to a belief tracker [14]. [sent-129, score-0.821]
77 Planning is difficult because belief states are high-dimensional: even in this simple world there are 550 states (275 positions at lOcm intervals along the corridor x 2 orientations), so a belief is a vector in ]R550. [sent-132, score-0.742]
78 This regularity can make planning tractable: if we can identify a few features which extract the important information from belief states, we can plan in low-dimensional feature space instead of high-dimensional belief space. [sent-134, score-0.572]
79 We factored the matrix of belief states using feature space ranks l = 3,4, 5. [sent-135, score-0.346]
80 For the prediction link f(Z) we used exp(Z) (componentwise); this link ensures that the predicted beliefs are positive, and treats errors in small probabilities as proportionally more important than errors in large ones. [sent-136, score-0.745]
81 (The matching loss for f is a Poisson log-likelihood or unnormalized KL-divergence. [sent-137, score-0.382]
82 ) For the parameter link h we used 10 12 I, corresponding to H* = lO - 12 11V11 2 /2 (a weak bias towards small V). [sent-138, score-0.316]
83 Left panel: overhead map of corridor with initial belief b1 ; belief state bso (just before robot finds out which direction it's pointing); belief bgo (just after finding out). [sent-147, score-1.027]
84 Right panel: reconstruction of bso with 3, 4, and 5 features. [sent-148, score-0.15]
85 This loss function fixes the first column of U, representing our knowledge that one feature should be a normalizing constant so that each belief sums to 1. [sent-154, score-0.546]
86 The subgradient of G* is 1O- 12U + [k, 0], so equation (5) becomes (X - exp(UV))VT = 1O- 12U + [k, 0] Here [k,O] is a matrix with an arbitrary first column and all other elements 0; this matrix has enough degrees of freedom to compensate for the constraints on U. [sent-155, score-0.17]
87 So, this (GL)2M is a principled and efficient way to decompose a matrix of probability distributions. [sent-157, score-0.122]
88 Figure 1 shows our reconstructions of a representative belief state using I = 3, 4,5 features (one of which is a normalizing constant that can be discarded for planning). [sent-159, score-0.252]
89 The I = 5 reconstruction is consistently good across all 400 beliefs, while the I = 4 reconstruction has minor artifacts for some beliefs. [sent-160, score-0.222]
90 For comparison, a traditional SVD requires a matrix of rank about 25 to achieve the same mean-squared reconstruction error as our rank-3 decomposition. [sent-162, score-0.15]
91 Examination of the learned U matrix (not shown) for I = 4 reveals that the corridor is mapped into two smooth curves in feature space, one for each orientation. [sent-164, score-0.308]
92 Corresponding states with opposite orientations are mapped into similar feature vectors for one half of the corridor (where the training beliefs were sometimes confused about orientation) but not the other (where there were no training beliefs that indicated any connection between orientations). [sent-165, score-0.572]
93 This selfintersection happens because of local minima in the loss function; with more flexibility (I = 5) the optimizer is able to untangle the curves and avoid self-intersection. [sent-167, score-0.29]
94 Our success in compressing the belief state translates directly into success in planning; see [15] for details. [sent-168, score-0.219]
95 By comparison, traditional SVD on either the beliefs or the log beliefs produces feature sets which are unusable for planning because they do not achieve sufficiently good reconstruction with few enough features. [sent-169, score-0.497]
96 8 Discussion We have introduced a new general class of nonlinear regression and factor analysis model, presenting a derivation and algorithm, reductions to previously-known special cases, and a practical example. [sent-170, score-0.114]
97 The model is a drop-in replacement for PCA, but can provide much better reconstruction performance in cases where the PCA error model is inaccurate. [sent-171, score-0.131]
98 Future research includes online algorithms for parameter adjustment; extensions for missing data; and exploration of new link functions. [sent-172, score-0.316]
99 This work was supported by AFRL contract F30602-01-C-0219, DARPA's MICA program, and by AFRL contract F30602- 98- 2- 0137, DARPA's CoABS program. [sent-174, score-0.078]
100 A generalization of principal component analysis to the exponential family. [sent-239, score-0.078]
wordName wordTfidf (topN-words)
[('uv', 0.469), ('gl', 0.373), ('link', 0.283), ('loss', 0.239), ('corridor', 0.222), ('belief', 0.219), ('svd', 0.154), ('convex', 0.122), ('beliefs', 0.117), ('newton', 0.111), ('zi', 0.102), ('matching', 0.102), ('reconstruction', 0.094), ('robot', 0.092), ('yi', 0.078), ('pca', 0.078), ('generalized', 0.075), ('derivatives', 0.073), ('functions', 0.071), ('planning', 0.071), ('tanh', 0.067), ('ica', 0.064), ('ut', 0.063), ('prediction', 0.062), ('restrictions', 0.062), ('squared', 0.059), ('column', 0.058), ('matrix', 0.056), ('afrl', 0.056), ('bso', 0.056), ('duals', 0.056), ('linearize', 0.056), ('lrd', 0.056), ('hj', 0.056), ('geoffrey', 0.056), ('minima', 0.051), ('fixed', 0.051), ('hessian', 0.049), ('info', 0.049), ('pea', 0.049), ('matrices', 0.048), ('special', 0.048), ('vj', 0.048), ('principal', 0.047), ('bregman', 0.046), ('orientations', 0.045), ('copy', 0.044), ('eigenfaces', 0.044), ('fitting', 0.044), ('nips', 0.044), ('equations', 0.042), ('roy', 0.041), ('tell', 0.041), ('unnormalized', 0.041), ('worry', 0.041), ('states', 0.041), ('links', 0.04), ('factorization', 0.039), ('contract', 0.039), ('lr', 0.039), ('darpa', 0.039), ('descent', 0.038), ('efficient', 0.038), ('cases', 0.037), ('cosh', 0.037), ('nonnegative', 0.037), ('nonlinear', 0.037), ('vt', 0.036), ('tails', 0.036), ('pointing', 0.036), ('dg', 0.036), ('write', 0.036), ('compression', 0.035), ('derivative', 0.035), ('log', 0.035), ('artifacts', 0.034), ('features', 0.033), ('parameter', 0.033), ('sufficiently', 0.033), ('find', 0.033), ('tells', 0.032), ('df', 0.032), ('dh', 0.031), ('include', 0.031), ('exponential', 0.031), ('reduced', 0.031), ('feature', 0.03), ('ij', 0.03), ('replace', 0.03), ('gi', 0.03), ('components', 0.029), ('gradient', 0.029), ('overall', 0.029), ('regression', 0.029), ('must', 0.028), ('nearly', 0.028), ('semantic', 0.028), ('decompose', 0.028), ('sensor', 0.028), ('check', 0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.9999997 96 nips-2002-Generalized² Linear² Models
Author: Geoffrey J. Gordon
Abstract: We introduce the Generalized 2 Linear 2 Model, a statistical estimator which combines features of nonlinear regression and factor analysis. A (GL)2M approximately decomposes a rectangular matrix X into a simpler representation j(g(A)h(B)). Here A and Bare low-rank matrices, while j, g, and h are link functions. (GL)2Ms include many useful models as special cases, including principal components analysis, exponential-family peA, the infomax formulation of independent components analysis, linear regression, and generalized linear models. They also include new and interesting special cases, one of which we describe below. We also present an iterative procedure which optimizes the parameters of a (GL)2M. This procedure reduces to well-known algorithms for some of the special cases listed above; for other special cases, it is new. 1
2 0.34621143 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
Author: Nicholas Roy, Geoffrey J. Gordon
Abstract: Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are intractable for large models. The intractability of these algorithms is due to a great extent to their generating an optimal policy over the entire belief space. However, in real POMDP problems most belief states are unlikely, and there is a structured, low-dimensional manifold of plausible beliefs embedded in the high-dimensional belief space. We introduce a new method for solving large-scale POMDPs by taking advantage of belief space sparsity. We reduce the dimensionality of the belief space by exponential family Principal Components Analysis [1], which allows us to turn the sparse, highdimensional belief space into a compact, low-dimensional representation in terms of learned features of the belief state. We then plan directly on the low-dimensional belief features. By planning in a low-dimensional space, we can find policies for POMDPs that are orders of magnitude larger than can be handled by conventional techniques. We demonstrate the use of this algorithm on a synthetic problem and also on a mobile robot navigation task.
3 0.12968189 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
Author: Tom Heskes
Abstract: We extend recent work on the connection between loopy belief propagation and the Bethe free energy. Constrained minimization of the Bethe free energy can be turned into an unconstrained saddle-point problem. Both converging double-loop algorithms and standard loopy belief propagation can be interpreted as attempts to solve this saddle-point problem. Stability analysis then leads us to conclude that stable fixed points of loopy belief propagation must be (local) minima of the Bethe free energy. Perhaps surprisingly, the converse need not be the case: minima can be unstable fixed points. We illustrate this with an example and discuss implications. 1
4 0.11999314 94 nips-2002-Fractional Belief Propagation
Author: Wim Wiegerinck, Tom Heskes
Abstract: We consider loopy belief propagation for approximate inference in probabilistic graphical models. A limitation of the standard algorithm is that clique marginals are computed as if there were no loops in the graph. To overcome this limitation, we introduce fractional belief propagation. Fractional belief propagation is formulated in terms of a family of approximate free energies, which includes the Bethe free energy and the naive mean-field free as special cases. Using the linear response correction of the clique marginals, the scale parameters can be tuned. Simulation results illustrate the potential merits of the approach.
5 0.11486144 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes
Author: Anat Levin, Assaf Zomet, Yair Weiss
Abstract: Certain simple images are known to trigger a percept of transparency: the input image I is perceived as the sum of two images I(x, y) = I1 (x, y) + I2 (x, y). This percept is puzzling. First, why do we choose the “more complicated” description with two images rather than the “simpler” explanation I(x, y) = I1 (x, y) + 0 ? Second, given the infinite number of ways to express I as a sum of two images, how do we compute the “best” decomposition ? Here we suggest that transparency is the rational percept of a system that is adapted to the statistics of natural scenes. We present a probabilistic model of images based on the qualitative statistics of derivative filters and “corner detectors” in natural scenes and use this model to find the most probable decomposition of a novel image. The optimization is performed using loopy belief propagation. We show that our model computes perceptually “correct” decompositions on synthetic images and discuss its application to real images. 1
6 0.10285936 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
7 0.098683752 205 nips-2002-Value-Directed Compression of POMDPs
8 0.096666045 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
9 0.095947474 169 nips-2002-Real-Time Particle Filters
10 0.092881441 119 nips-2002-Kernel Dependency Estimation
11 0.085219875 111 nips-2002-Independent Components Analysis through Product Density Estimation
12 0.084302209 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
13 0.080722101 140 nips-2002-Margin Analysis of the LVQ Algorithm
14 0.07902889 77 nips-2002-Effective Dimension and Generalization of Kernel Learning
15 0.068130225 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
16 0.065037318 163 nips-2002-Prediction and Semantic Association
17 0.06481798 85 nips-2002-Fast Kernels for String and Tree Matching
18 0.062927105 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
19 0.062877931 181 nips-2002-Self Supervised Boosting
20 0.06204541 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
topicId topicWeight
[(0, -0.225), (1, -0.061), (2, -0.149), (3, 0.034), (4, -0.021), (5, 0.042), (6, -0.059), (7, 0.221), (8, -0.107), (9, 0.004), (10, -0.183), (11, -0.028), (12, -0.112), (13, -0.087), (14, 0.063), (15, 0.095), (16, 0.065), (17, -0.014), (18, 0.005), (19, 0.064), (20, -0.028), (21, -0.057), (22, 0.07), (23, 0.097), (24, 0.066), (25, -0.011), (26, -0.191), (27, -0.147), (28, 0.06), (29, -0.037), (30, 0.087), (31, 0.172), (32, -0.039), (33, 0.065), (34, -0.086), (35, -0.101), (36, 0.035), (37, -0.038), (38, -0.12), (39, 0.06), (40, 0.191), (41, 0.028), (42, 0.135), (43, -0.092), (44, 0.092), (45, -0.02), (46, 0.029), (47, 0.058), (48, 0.01), (49, 0.111)]
simIndex simValue paperId paperTitle
same-paper 1 0.96179783 96 nips-2002-Generalized² Linear² Models
Author: Geoffrey J. Gordon
Abstract: We introduce the Generalized 2 Linear 2 Model, a statistical estimator which combines features of nonlinear regression and factor analysis. A (GL)2M approximately decomposes a rectangular matrix X into a simpler representation j(g(A)h(B)). Here A and Bare low-rank matrices, while j, g, and h are link functions. (GL)2Ms include many useful models as special cases, including principal components analysis, exponential-family peA, the infomax formulation of independent components analysis, linear regression, and generalized linear models. They also include new and interesting special cases, one of which we describe below. We also present an iterative procedure which optimizes the parameters of a (GL)2M. This procedure reduces to well-known algorithms for some of the special cases listed above; for other special cases, it is new. 1
2 0.81493211 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
Author: Nicholas Roy, Geoffrey J. Gordon
Abstract: Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are intractable for large models. The intractability of these algorithms is due to a great extent to their generating an optimal policy over the entire belief space. However, in real POMDP problems most belief states are unlikely, and there is a structured, low-dimensional manifold of plausible beliefs embedded in the high-dimensional belief space. We introduce a new method for solving large-scale POMDPs by taking advantage of belief space sparsity. We reduce the dimensionality of the belief space by exponential family Principal Components Analysis [1], which allows us to turn the sparse, highdimensional belief space into a compact, low-dimensional representation in terms of learned features of the belief state. We then plan directly on the low-dimensional belief features. By planning in a low-dimensional space, we can find policies for POMDPs that are orders of magnitude larger than can be handled by conventional techniques. We demonstrate the use of this algorithm on a synthetic problem and also on a mobile robot navigation task.
3 0.63636076 205 nips-2002-Value-Directed Compression of POMDPs
Author: Pascal Poupart, Craig Boutilier
Abstract: We examine the problem of generating state-space compressions of POMDPs in a way that minimally impacts decision quality. We analyze the impact of compressions on decision quality, observing that compressions that allow accurate policy evaluation (prediction of expected future reward) will not affect decision quality. We derive a set of sufficient conditions that ensure accurate prediction in this respect, illustrate interesting mathematical properties these confer on lossless linear compressions, and use these to derive an iterative procedure for finding good linear lossy compressions. We also elaborate on how structured representations of a POMDP can be used to find such compressions.
4 0.41664591 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes
Author: Anat Levin, Assaf Zomet, Yair Weiss
Abstract: Certain simple images are known to trigger a percept of transparency: the input image I is perceived as the sum of two images I(x, y) = I1 (x, y) + I2 (x, y). This percept is puzzling. First, why do we choose the “more complicated” description with two images rather than the “simpler” explanation I(x, y) = I1 (x, y) + 0 ? Second, given the infinite number of ways to express I as a sum of two images, how do we compute the “best” decomposition ? Here we suggest that transparency is the rational percept of a system that is adapted to the statistics of natural scenes. We present a probabilistic model of images based on the qualitative statistics of derivative filters and “corner detectors” in natural scenes and use this model to find the most probable decomposition of a novel image. The optimization is performed using loopy belief propagation. We show that our model computes perceptually “correct” decompositions on synthetic images and discuss its application to real images. 1
5 0.40003029 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
Author: Tom Heskes
Abstract: We extend recent work on the connection between loopy belief propagation and the Bethe free energy. Constrained minimization of the Bethe free energy can be turned into an unconstrained saddle-point problem. Both converging double-loop algorithms and standard loopy belief propagation can be interpreted as attempts to solve this saddle-point problem. Stability analysis then leads us to conclude that stable fixed points of loopy belief propagation must be (local) minima of the Bethe free energy. Perhaps surprisingly, the converse need not be the case: minima can be unstable fixed points. We illustrate this with an example and discuss implications. 1
6 0.39568341 169 nips-2002-Real-Time Particle Filters
7 0.38570386 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
8 0.37963712 94 nips-2002-Fractional Belief Propagation
9 0.33952317 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
10 0.33632943 115 nips-2002-Informed Projections
11 0.33001918 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems
12 0.32648239 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum
13 0.32257375 119 nips-2002-Kernel Dependency Estimation
14 0.31582159 111 nips-2002-Independent Components Analysis through Product Density Estimation
15 0.31211099 150 nips-2002-Multiple Cause Vector Quantization
16 0.31094381 140 nips-2002-Margin Analysis of the LVQ Algorithm
17 0.29863793 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers
18 0.29671559 6 nips-2002-A Formulation for Minimax Probability Machine Regression
19 0.29466763 136 nips-2002-Linear Combinations of Optic Flow Vectors for Estimating Self-Motion - a Real-World Test of a Neural Model
20 0.29393414 178 nips-2002-Robust Novelty Detection with Single-Class MPM
topicId topicWeight
[(11, 0.021), (14, 0.31), (23, 0.022), (42, 0.096), (54, 0.138), (55, 0.047), (57, 0.01), (65, 0.015), (67, 0.02), (68, 0.026), (74, 0.062), (92, 0.05), (98, 0.094)]
simIndex simValue paperId paperTitle
1 0.83970612 150 nips-2002-Multiple Cause Vector Quantization
Author: David A. Ross, Richard S. Zemel
Abstract: We propose a model that can learn parts-based representations of highdimensional data. Our key assumption is that the dimensions of the data can be separated into several disjoint subsets, or factors, which take on values independently of each other. We assume each factor has a small number of discrete states, and model it using a vector quantizer. The selected states of each factor represent the multiple causes of the input. Given a set of training examples, our model learns the association of data dimensions with factors, as well as the states of each VQ. Inference and learning are carried out efficiently via variational algorithms. We present applications of this model to problems in image decomposition, collaborative filtering, and text classification.
same-paper 2 0.83731139 96 nips-2002-Generalized² Linear² Models
Author: Geoffrey J. Gordon
Abstract: We introduce the Generalized 2 Linear 2 Model, a statistical estimator which combines features of nonlinear regression and factor analysis. A (GL)2M approximately decomposes a rectangular matrix X into a simpler representation j(g(A)h(B)). Here A and Bare low-rank matrices, while j, g, and h are link functions. (GL)2Ms include many useful models as special cases, including principal components analysis, exponential-family peA, the infomax formulation of independent components analysis, linear regression, and generalized linear models. They also include new and interesting special cases, one of which we describe below. We also present an iterative procedure which optimizes the parameters of a (GL)2M. This procedure reduces to well-known algorithms for some of the special cases listed above; for other special cases, it is new. 1
3 0.80341053 83 nips-2002-Extracting Relevant Structures with Side Information
Author: Gal Chechik, Naftali Tishby
Abstract: The problem of extracting the relevant aspects of data, in face of multiple conflicting structures, is inherent to modeling of complex data. Extracting structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. However, such auxiliary variables often contain more information than is actually required due to structures that are irrelevant for the task. In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. Identifying the relevant structures, however, can thus be considerably improved by also minimizing the information about another, irrelevant, variable. In this paper we give a general formulation of this problem and derive its formal, as well as algorithmic, solution. Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categorization and face images. While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information.
4 0.61167037 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
Author: Nicholas Roy, Geoffrey J. Gordon
Abstract: Standard value function approaches to finding policies for Partially Observable Markov Decision Processes (POMDPs) are intractable for large models. The intractability of these algorithms is due to a great extent to their generating an optimal policy over the entire belief space. However, in real POMDP problems most belief states are unlikely, and there is a structured, low-dimensional manifold of plausible beliefs embedded in the high-dimensional belief space. We introduce a new method for solving large-scale POMDPs by taking advantage of belief space sparsity. We reduce the dimensionality of the belief space by exponential family Principal Components Analysis [1], which allows us to turn the sparse, highdimensional belief space into a compact, low-dimensional representation in terms of learned features of the belief state. We then plan directly on the low-dimensional belief features. By planning in a low-dimensional space, we can find policies for POMDPs that are orders of magnitude larger than can be handled by conventional techniques. We demonstrate the use of this algorithm on a synthetic problem and also on a mobile robot navigation task.
5 0.60713524 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
Author: Chakra Chennubhotla, Allan D. Jepson
Abstract: Using a Markov chain perspective of spectral clustering we present an algorithm to automatically find the number of stable clusters in a dataset. The Markov chain’s behaviour is characterized by the spectral properties of the matrix of transition probabilities, from which we derive eigenflows along with their halflives. An eigenflow describes the flow of probability mass due to the Markov chain, and it is characterized by its eigenvalue, or equivalently, by the halflife of its decay as the Markov chain is iterated. A ideal stable cluster is one with zero eigenflow and infinite half-life. The key insight in this paper is that bottlenecks between weakly coupled clusters can be identified by computing the sensitivity of the eigenflow’s halflife to variations in the edge weights. We propose a novel E IGEN C UTS algorithm to perform clustering that removes these identified bottlenecks in an iterative fashion.
6 0.58285457 169 nips-2002-Real-Time Particle Filters
7 0.58111292 190 nips-2002-Stochastic Neighbor Embedding
8 0.58082014 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information
9 0.58069545 27 nips-2002-An Impossibility Theorem for Clustering
10 0.57372957 53 nips-2002-Clustering with the Fisher Score
11 0.57015318 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons
12 0.56858891 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
13 0.56830144 2 nips-2002-A Bilinear Model for Sparse Coding
14 0.56663638 40 nips-2002-Bayesian Models of Inductive Generalization
15 0.56370252 188 nips-2002-Stability-Based Model Selection
16 0.56096715 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
17 0.55903065 46 nips-2002-Boosting Density Estimation
18 0.55719459 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
19 0.55645484 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
20 0.55617762 3 nips-2002-A Convergent Form of Approximate Policy Iteration