nips nips2002 nips2002-96 knowledge-graph by maker-knowledge-mining

96 nips-2002-GeneralizedÂ˛ LinearÂ˛ Models

Source: pdf

Author: Geoffrey J. Gordon

Abstract: We introduce the Generalized 2 Linear 2 Model, a statistical estimator which combines features of nonlinear regression and factor analysis. A (GL)2M approximately decomposes a rectangular matrix X into a simpler representation j(g(A)h(B)). Here A and Bare low-rank matrices, while j, g, and h are link functions. (GL)2Ms include many useful models as special cases, including principal components analysis, exponential-family peA, the infomax formulation of independent components analysis, linear regression, and generalized linear models. They also include new and interesting special cases, one of which we describe below. We also present an iterative procedure which optimizes the parameters of a (GL)2M. This procedure reduces to well-known algorithms for some of the special cases listed above; for other special cases, it is new. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We introduce the Generalized 2 Linear 2 Model, a statistical estimator which combines features of nonlinear regression and factor analysis. [sent-4, score-0.099]

2 Here A and Bare low-rank matrices, while j, g, and h are link functions. [sent-6, score-0.283]

3 (GL)2Ms include many useful models as special cases, including principal components analysis, exponential-family peA, the infomax formulation of independent components analysis, linear regression, and generalized linear models. [sent-7, score-0.259]

4 They also include new and interesting special cases, one of which we describe below. [sent-8, score-0.079]

5 This procedure reduces to well-known algorithms for some of the special cases listed above; for other special cases, it is new. [sent-10, score-0.133]

6 Each column of X represents a training example, and each row represents a measured feature of the examples. [sent-12, score-0.088]

7 It is often reasonable to assume that some of the features are redundant , that is, that there exists a reduced set of I features which contains all or most of the information in X. [sent-13, score-0.097]

8 If the reduced features are linear functions of the original features and the distributions of the elements of X are Gaussian, redundancy means we can write X as the product of two smaller matrices U and V with small sum of squared errors. [sent-14, score-0.311]

9 This factorization is essentially a singular value decomposition: U must span the first I dimensions of the left principal subspace of X, while V T must span the first I dimensions of the right principal subspace. [sent-15, score-0.189]

10 ) The SVD has a long list of successes in machine learning, including information retrieval applications such as latent semantic analysis [1] and link analysis [2]; pattern recognition applications such as "eigenfaces" [3]; structure from motion algorithms [4]; and data compression tools [5]. [sent-17, score-0.346]

11 The first assumption is the use of the sum of squared errors between X and UV as a loss function. [sent-19, score-0.298]

12 Squared error loss means that predicting 1000 when the answer is 1010 is as bad as saying -7 when the answer is 3. [sent-20, score-0.239]

13 The second assumption is that the reduced features are linear functions of the original features. [sent-21, score-0.135]

14 Instead, X might be a nonlinear function of UV, and U and V might be nonlinear functions of some other matrices A and B. [sent-22, score-0.193]

15 We also propose allowing non-quadratic loss functions for the error (X - X) and the parameter matrices A and B. [sent-24, score-0.391]

16 By analogy to generalized linear models [6], we call equation (1) a Generalized 2 Linear 2 Model: generalized2 because it uses link functions for the parameters A and B as well as the prediction X, and linear2 because like the SVD it is bilinear. [sent-26, score-0.491]

17 As long as we choose link and loss functions that match each other (see below for the definition of matching link and loss), there will exist efficient algorithms for finding A and B given X, f, g, and h. [sent-27, score-1.016]

18 Because (1) is a generalization of the SVD, (GL)2Ms are drop-in replacements for SVDs in all of the applications mentioned above, with better reconstruction performance when the SVD's error model is inaccurate. [sent-28, score-0.094]

19 2 Matching link and loss functions Whenever we try to optimize the predictions of a nonlinear model, we need to worry about getting stuck in local minima. [sent-30, score-0.671]

20 On the other hand, if we optimize the same predictions Yi under the logarithmic loss function ~i[Yi log Yi + (1 - Vi) 10g(1 - Yi)] instead of squared error, our optimization problem is convex. [sent-33, score-0.333]

21 Because the logistic link works with the log loss to produce a convex optimization problem, we say they match each other [7]. [sent-34, score-0.679]

22 Matching link-loss pairs are important because minimizing a convex loss function is usually far easier than minimizing a non convex one. [sent-35, score-0.483]

23 We can use any convex function F(z) to generate a matching pair of link and loss functions. [sent-36, score-0.746]

24 The loss function which corresponds to F is (2) where F*(y) is defined so that minz DF(Z I y) = O. [sent-37, score-0.239]

25 (F* is the convex dual of F [8], and D F is the generalized Bregman divergence from Z to Y [9]. [sent-38, score-0.197]

26 ) Expression (2) is nonnegative, and it is globally convex in all of the ZiS (and therefore also in (J since each Zi is a linear function of (J). [sent-39, score-0.122]

27 If we write f for the gradient of F, the derivative of (2) with respect to Zi is f(Zi) - Vi. [sent-40, score-0.1]

28 So, (2) will be zero if and only if Yi = f(Zi) for all i; in other words, using the loss (2) implies that Yi = f(z;) is our best prediction of Vi, and f is therefore our matching link function. [sent-41, score-0.686]

29 (Also, convex duality is defined even when F, G, and H aren't differentiable. [sent-44, score-0.122]

30 If they are not, replace derivatives by subgradients below. [sent-45, score-0.103]

31 ) 3 Loss functions for (G L )2Ms In (GL)2Ms, matching loss functions will be particularly important because we need to deal with three separate nonlinear link functions. [sent-46, score-0.803]

32 We will usually not be able to avoid local minima entirely; instead, the overall loss function will be convex in some groups of parameters if we hold the remaining parameters fixed. [sent-47, score-0.441]

33 We will specify a (GL)2M by picking three link functions and their matching loss functions. [sent-48, score-0.695]

34 We can then combine these individual loss functions into an overall loss function as described in section 4; fitting a (GL)2M will then reduce to minimizing the overall loss function with respect to our parameters. [sent-49, score-0.89]

35 The choice of link functions is where we should inject our domain knowledge about what sort of noise there is in X and what parameter matrices A and B are a priori most likely. [sent-51, score-0.435]

36 Useful link functions include f (x) = x (corresponding to squared error and Gaussian noise), f (x) = log x (unnormalized KL-di vergence and Poisson noise), and f(x) = (1 + e- x) - l (log-loss and Bernoulli noise). [sent-52, score-0.479]

37 The loss functions themselves are only necessary for the analysis; all of our algorithms need only the link functions and (in some cases) their derivatives. [sent-53, score-0.664]

38 So, we can pick the loss functions and differentiate to get the matching link functions; or, we can pick the link functions directly and not worry about the corresponding loss functions. [sent-54, score-1.329]

39 In order for our analysis to apply, the link functions must be derivatives of some (possibly unknown) convex functions. [sent-55, score-0.577]

40 Our loss functions are D F , DG, and DH where G : lRmxl H lR are convex functions. [sent-56, score-0.432]

41 We will abuse notation and call F, G, and H loss functions as well: F is the prediction loss, and its derivative f is the prediction link; it provides our model of the noise in X. [sent-57, score-0.469]

42 G and H are the parameter losses, and their derivatives g and h are the parameter links; they tell us which values of A and B are a priori most likely. [sent-58, score-0.18]

43 By convention, since F takes an m x n matrix argument , we will say that the input and output to f are also m x n matrices (similarly for g and h). [sent-59, score-0.104]

44 4 The model and its fixed point equations We will define a (GL)2M by specifying an overall loss function which relates the parameter matrices A and B to the data matrix X. [sent-60, score-0.498]

45 If we write U = g(A) and V = h(B), the (GL)2M loss function is L(U, V) = F(UV) - X Here A 0 0 UV + G*(U) + H*(V) (3) B is the "matrix dot product," often written tr(AT B). [sent-61, score-0.275]

46 The result is a set of fixed-point equations that the optimal parameter settings must satisfy: UT(X - f(UV)) B (4) (X - f(UV))VT A (5) To understand these equations, we can consider two special cases. [sent-65, score-0.151]

47 First, if we let G* go to zero (so there is no pressure to keep U and V small) , (4) becomes UT(X - f(UV)) = 0 (6) which tells us that each column of the error matrix must be orthogonal to each column of U. [sent-66, score-0.232]

48 Second, if we set the prediction link to be f(UV) = UV, (6) becomes UTUV = UTX which tells us that for a given U, we must choose V so that UV reconstructs X with the smallest possible sum of squared errors. [sent-67, score-0.464]

49 5 Algorithms for fitting (GL)2Ms We could solve equations (4- 5) with any of several different algorithms. [sent-68, score-0.086]

50 Or, we could use the generalized gradient descent [9] update rule (with learning rate a): A +-", (X - f(UV))V T B +-", UT(X - f(UV)) The advantage of these algorithms is that they are simple to implement and don't require additional assumptions on F , G , and H. [sent-70, score-0.142]

51 Our algorithm is based on Newton's method , and it reduces to well-known algorithms for several special cases of (GL)2Ms. [sent-73, score-0.085]

52 For our Newton algorithm we will need to place some restrictions on the prediction and parameter loss functions. [sent-75, score-0.396]

53 (These restrictions are only necessary for the Newton algorithm; more general loss functions still give valid (GL)2Ms, but require different algorithms. [sent-76, score-0.372]

54 j ) ij j These definitions fix most of the second derivatives of L(U, V) to be zero, simplifying and speeding up computation. [sent-79, score-0.103]

55 With these restrictions, we can linearize (4) and (5) around our current guess at the parameters, then solve the resulting equations to find updated parameters. [sent-81, score-0.131]

56 j; and the m x m diagonal matrix Dj contains the second derivatives of F with respect to the jth column of its argument. [sent-87, score-0.187]

57 j VJ5j, prior - f(UV j )) Similarly, we can linearize with respect to rows of U to find the equation UreW(VDiVT + G i ) = ((Xi. [sent-89, score-0.089]

58 We can solve one copy of (7) simultaneously for each column of V, then replace V by vnew. [sent-96, score-0.132]

59 Next we can solve one copy of (8) simultaneously for each row of U, then replace U by unew. [sent-97, score-0.074]

60 1 6 Related models There are many important special cases of (GL)2Ms. [sent-99, score-0.085]

61 We derive two in this section; others include principal components analysis, "sensible" PCA, linear regression, generalized linear models, and the weighted majority algorithm. [sent-100, score-0.182]

62 ) (GL)2Ms are related to generalized bilinear models; the latter include some of the above special cases, but not ICA, weighted majority, or the example of section 7. [sent-102, score-0.154]

63 Finally, some models such as non-negative matrix factorization [10] and generalized low-rank approximation [11] are cousins of (GL)2Ms: they use a loss function which is convex in either factor with the other fixed but which is not a Bregman divergence. [sent-104, score-0.582]

64 1 Independent components analysis In ICA, we assume that there is a hidden matrix V (the same size as X) of independent random variables, and that X was generated from V by applying a square matrix U. [sent-106, score-0.141]

65 We seek to recover the mixing matrix U and the sources V; in other words , we want to decompose X = UV so that the elements of V are as nearly independent as possible. [sent-107, score-0.112]

66 The info max algorithm for ICA assumes that the elements of V follow distributions with heavy tails (i. [sent-108, score-0.085]

67 In our notation, the fixed point of the info max algorithm for ICA is _ U T = tanh(V)XT (9) (see, e. [sent-112, score-0.1]

68 To reproduce (9) , we will let the prediction link f be the identity, and we will let the duals of the parameter loss functions be G*(U) -dogdet U H*(V) E L log cosh Vij ij iTo guarantee convergence, we can check that (3) decreases and reduce our step size if we encounter problems. [sent-115, score-0.874]

69 Then equations (4) and (5) become UT(X - UV) (X - UV)VT ttanh(V) -fU - T (10) (11) since the derivative of log cosh v is tanh v and the derivative of log det U is U - T . [sent-119, score-0.286]

70 Exponential family peA To duplicate exponential family PCA [13], we can set the prediction link f arbitrarily and let the parameter links 9 and h be large multiples of the identity. [sent-122, score-0.449]

71 7 Example: robot belief states Figure 1 shows a map of a corridor in the CMU CS building. [sent-124, score-0.574]

72 A robot navigating in this corridor can sense both side walls and compute an accurate estimate of its lateral position. [sent-125, score-0.314]

73 Unless it is near an observable feature such the lab door near the middle of the corridor, however, it can't accurately resolve its position along the corridor and it can't tell whether it is pointing left or right. [sent-126, score-0.329]

74 In order to plan to achieve a goal in this environment, the robot must maintain a belief state (a probability distribution representing its best information about the unobserved state variables). [sent-127, score-0.339]

75 The map shows the robot's starting belief state: it is at one end of the corridor facing in, but it doesn't know which end. [sent-128, score-0.441]

76 We collected a training set of 400 belief states by driving the robot along the corridor and feeding its sensor readings to a belief tracker [14]. [sent-129, score-0.821]

77 Planning is difficult because belief states are high-dimensional: even in this simple world there are 550 states (275 positions at lOcm intervals along the corridor x 2 orientations), so a belief is a vector in ]R550. [sent-132, score-0.742]

78 This regularity can make planning tractable: if we can identify a few features which extract the important information from belief states, we can plan in low-dimensional feature space instead of high-dimensional belief space. [sent-134, score-0.572]

79 We factored the matrix of belief states using feature space ranks l = 3,4, 5. [sent-135, score-0.346]

80 For the prediction link f(Z) we used exp(Z) (componentwise); this link ensures that the predicted beliefs are positive, and treats errors in small probabilities as proportionally more important than errors in large ones. [sent-136, score-0.745]

81 (The matching loss for f is a Poisson log-likelihood or unnormalized KL-divergence. [sent-137, score-0.382]

82 ) For the parameter link h we used 10 12 I, corresponding to H* = lO - 12 11V11 2 /2 (a weak bias towards small V). [sent-138, score-0.316]

83 Left panel: overhead map of corridor with initial belief b1 ; belief state bso (just before robot finds out which direction it's pointing); belief bgo (just after finding out). [sent-147, score-1.027]

84 Right panel: reconstruction of bso with 3, 4, and 5 features. [sent-148, score-0.15]

85 This loss function fixes the first column of U, representing our knowledge that one feature should be a normalizing constant so that each belief sums to 1. [sent-154, score-0.546]

86 The subgradient of G* is 1O- 12U + [k, 0], so equation (5) becomes (X - exp(UV))VT = 1O- 12U + [k, 0] Here [k,O] is a matrix with an arbitrary first column and all other elements 0; this matrix has enough degrees of freedom to compensate for the constraints on U. [sent-155, score-0.17]

87 So, this (GL)2M is a principled and efficient way to decompose a matrix of probability distributions. [sent-157, score-0.122]

88 Figure 1 shows our reconstructions of a representative belief state using I = 3, 4,5 features (one of which is a normalizing constant that can be discarded for planning). [sent-159, score-0.252]

89 The I = 5 reconstruction is consistently good across all 400 beliefs, while the I = 4 reconstruction has minor artifacts for some beliefs. [sent-160, score-0.222]

90 For comparison, a traditional SVD requires a matrix of rank about 25 to achieve the same mean-squared reconstruction error as our rank-3 decomposition. [sent-162, score-0.15]

91 Examination of the learned U matrix (not shown) for I = 4 reveals that the corridor is mapped into two smooth curves in feature space, one for each orientation. [sent-164, score-0.308]

92 Corresponding states with opposite orientations are mapped into similar feature vectors for one half of the corridor (where the training beliefs were sometimes confused about orientation) but not the other (where there were no training beliefs that indicated any connection between orientations). [sent-165, score-0.572]

93 This selfintersection happens because of local minima in the loss function; with more flexibility (I = 5) the optimizer is able to untangle the curves and avoid self-intersection. [sent-167, score-0.29]

94 Our success in compressing the belief state translates directly into success in planning; see [15] for details. [sent-168, score-0.219]

95 By comparison, traditional SVD on either the beliefs or the log beliefs produces feature sets which are unusable for planning because they do not achieve sufficiently good reconstruction with few enough features. [sent-169, score-0.497]

96 8 Discussion We have introduced a new general class of nonlinear regression and factor analysis model, presenting a derivation and algorithm, reductions to previously-known special cases, and a practical example. [sent-170, score-0.114]

97 The model is a drop-in replacement for PCA, but can provide much better reconstruction performance in cases where the PCA error model is inaccurate. [sent-171, score-0.131]

98 Future research includes online algorithms for parameter adjustment; extensions for missing data; and exploration of new link functions. [sent-172, score-0.316]

99 This work was supported by AFRL contract F30602-01-C-0219, DARPA's MICA program, and by AFRL contract F30602- 98- 2- 0137, DARPA's CoABS program. [sent-174, score-0.078]

100 A generalization of principal component analysis to the exponential family. [sent-239, score-0.078]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('uv', 0.469), ('gl', 0.373), ('link', 0.283), ('loss', 0.239), ('corridor', 0.222), ('belief', 0.219), ('svd', 0.154), ('convex', 0.122), ('beliefs', 0.117), ('newton', 0.111), ('zi', 0.102), ('matching', 0.102), ('reconstruction', 0.094), ('robot', 0.092), ('yi', 0.078), ('pca', 0.078), ('generalized', 0.075), ('derivatives', 0.073), ('functions', 0.071), ('planning', 0.071), ('tanh', 0.067), ('ica', 0.064), ('ut', 0.063), ('prediction', 0.062), ('restrictions', 0.062), ('squared', 0.059), ('column', 0.058), ('matrix', 0.056), ('afrl', 0.056), ('bso', 0.056), ('duals', 0.056), ('linearize', 0.056), ('lrd', 0.056), ('hj', 0.056), ('geoffrey', 0.056), ('minima', 0.051), ('fixed', 0.051), ('hessian', 0.049), ('info', 0.049), ('pea', 0.049), ('matrices', 0.048), ('special', 0.048), ('vj', 0.048), ('principal', 0.047), ('bregman', 0.046), ('orientations', 0.045), ('copy', 0.044), ('eigenfaces', 0.044), ('fitting', 0.044), ('nips', 0.044), ('equations', 0.042), ('roy', 0.041), ('tell', 0.041), ('unnormalized', 0.041), ('worry', 0.041), ('states', 0.041), ('links', 0.04), ('factorization', 0.039), ('contract', 0.039), ('lr', 0.039), ('darpa', 0.039), ('descent', 0.038), ('efficient', 0.038), ('cases', 0.037), ('cosh', 0.037), ('nonnegative', 0.037), ('nonlinear', 0.037), ('vt', 0.036), ('tails', 0.036), ('pointing', 0.036), ('dg', 0.036), ('write', 0.036), ('compression', 0.035), ('derivative', 0.035), ('log', 0.035), ('artifacts', 0.034), ('features', 0.033), ('parameter', 0.033), ('sufficiently', 0.033), ('find', 0.033), ('tells', 0.032), ('df', 0.032), ('dh', 0.031), ('include', 0.031), ('exponential', 0.031), ('reduced', 0.031), ('feature', 0.03), ('ij', 0.03), ('replace', 0.03), ('gi', 0.03), ('components', 0.029), ('gradient', 0.029), ('overall', 0.029), ('regression', 0.029), ('must', 0.028), ('nearly', 0.028), ('semantic', 0.028), ('decompose', 0.028), ('sensor', 0.028), ('check', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 96 nips-2002-GeneralizedÂ˛ LinearÂ˛ Models

Author: Geoffrey J. Gordon

2 0.34621143 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

Author: Nicholas Roy, Geoffrey J. Gordon

Abstract: Standard value function approaches to ﬁnding policies for Partially Observable Markov Decision Processes (POMDPs) are intractable for large models. The intractability of these algorithms is due to a great extent to their generating an optimal policy over the entire belief space. However, in real POMDP problems most belief states are unlikely, and there is a structured, low-dimensional manifold of plausible beliefs embedded in the high-dimensional belief space. We introduce a new method for solving large-scale POMDPs by taking advantage of belief space sparsity. We reduce the dimensionality of the belief space by exponential family Principal Components Analysis [1], which allows us to turn the sparse, highdimensional belief space into a compact, low-dimensional representation in terms of learned features of the belief state. We then plan directly on the low-dimensional belief features. By planning in a low-dimensional space, we can ﬁnd policies for POMDPs that are orders of magnitude larger than can be handled by conventional techniques. We demonstrate the use of this algorithm on a synthetic problem and also on a mobile robot navigation task.

3 0.12968189 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy

Author: Tom Heskes

Abstract: We extend recent work on the connection between loopy belief propagation and the Bethe free energy. Constrained minimization of the Bethe free energy can be turned into an unconstrained saddle-point problem. Both converging double-loop algorithms and standard loopy belief propagation can be interpreted as attempts to solve this saddle-point problem. Stability analysis then leads us to conclude that stable ﬁxed points of loopy belief propagation must be (local) minima of the Bethe free energy. Perhaps surprisingly, the converse need not be the case: minima can be unstable ﬁxed points. We illustrate this with an example and discuss implications. 1

4 0.11999314 94 nips-2002-Fractional Belief Propagation

Author: Wim Wiegerinck, Tom Heskes

Abstract: We consider loopy belief propagation for approximate inference in probabilistic graphical models. A limitation of the standard algorithm is that clique marginals are computed as if there were no loops in the graph. To overcome this limitation, we introduce fractional belief propagation. Fractional belief propagation is formulated in terms of a family of approximate free energies, which includes the Bethe free energy and the naive mean-ﬁeld free as special cases. Using the linear response correction of the clique marginals, the scale parameters can be tuned. Simulation results illustrate the potential merits of the approach.

5 0.11486144 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes

Author: Anat Levin, Assaf Zomet, Yair Weiss

Abstract: Certain simple images are known to trigger a percept of transparency: the input image I is perceived as the sum of two images I(x, y) = I1 (x, y) + I2 (x, y). This percept is puzzling. First, why do we choose the “more complicated” description with two images rather than the “simpler” explanation I(x, y) = I1 (x, y) + 0 ? Second, given the inﬁnite number of ways to express I as a sum of two images, how do we compute the “best” decomposition ? Here we suggest that transparency is the rational percept of a system that is adapted to the statistics of natural scenes. We present a probabilistic model of images based on the qualitative statistics of derivative ﬁlters and “corner detectors” in natural scenes and use this model to ﬁnd the most probable decomposition of a novel image. The optimization is performed using loopy belief propagation. We show that our model computes perceptually “correct” decompositions on synthetic images and discuss its application to real images. 1

6 0.10285936 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

7 0.098683752 205 nips-2002-Value-Directed Compression of POMDPs

8 0.096666045 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines

9 0.095947474 169 nips-2002-Real-Time Particle Filters

10 0.092881441 119 nips-2002-Kernel Dependency Estimation

11 0.085219875 111 nips-2002-Independent Components Analysis through Product Density Estimation

12 0.084302209 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods

13 0.080722101 140 nips-2002-Margin Analysis of the LVQ Algorithm

14 0.07902889 77 nips-2002-Effective Dimension and Generalization of Kernel Learning

15 0.068130225 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine

16 0.065037318 163 nips-2002-Prediction and Semantic Association

17 0.06481798 85 nips-2002-Fast Kernels for String and Tree Matching

18 0.062927105 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation

19 0.062877931 181 nips-2002-Self Supervised Boosting

20 0.06204541 19 nips-2002-Adapting Codes and Embeddings for Polychotomies

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.225), (1, -0.061), (2, -0.149), (3, 0.034), (4, -0.021), (5, 0.042), (6, -0.059), (7, 0.221), (8, -0.107), (9, 0.004), (10, -0.183), (11, -0.028), (12, -0.112), (13, -0.087), (14, 0.063), (15, 0.095), (16, 0.065), (17, -0.014), (18, 0.005), (19, 0.064), (20, -0.028), (21, -0.057), (22, 0.07), (23, 0.097), (24, 0.066), (25, -0.011), (26, -0.191), (27, -0.147), (28, 0.06), (29, -0.037), (30, 0.087), (31, 0.172), (32, -0.039), (33, 0.065), (34, -0.086), (35, -0.101), (36, 0.035), (37, -0.038), (38, -0.12), (39, 0.06), (40, 0.191), (41, 0.028), (42, 0.135), (43, -0.092), (44, 0.092), (45, -0.02), (46, 0.029), (47, 0.058), (48, 0.01), (49, 0.111)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96179783 96 nips-2002-GeneralizedÂ˛ LinearÂ˛ Models

Author: Geoffrey J. Gordon

2 0.81493211 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

Author: Nicholas Roy, Geoffrey J. Gordon

3 0.63636076 205 nips-2002-Value-Directed Compression of POMDPs

Author: Pascal Poupart, Craig Boutilier

Abstract: We examine the problem of generating state-space compressions of POMDPs in a way that minimally impacts decision quality. We analyze the impact of compressions on decision quality, observing that compressions that allow accurate policy evaluation (prediction of expected future reward) will not affect decision quality. We derive a set of sufﬁcient conditions that ensure accurate prediction in this respect, illustrate interesting mathematical properties these confer on lossless linear compressions, and use these to derive an iterative procedure for ﬁnding good linear lossy compressions. We also elaborate on how structured representations of a POMDP can be used to ﬁnd such compressions.

4 0.41664591 133 nips-2002-Learning to Perceive Transparency from the Statistics of Natural Scenes

Author: Anat Levin, Assaf Zomet, Yair Weiss

5 0.40003029 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy

Author: Tom Heskes

6 0.39568341 169 nips-2002-Real-Time Particle Filters

7 0.38570386 69 nips-2002-Discriminative Learning for Label Sequences via Boosting

8 0.37963712 94 nips-2002-Fractional Belief Propagation

9 0.33952317 179 nips-2002-Scaling of Probability-Based Optimization Algorithms

10 0.33632943 115 nips-2002-Informed Projections

11 0.33001918 65 nips-2002-Derivative Observations in Gaussian Process Models of Dynamic Systems

12 0.32648239 197 nips-2002-The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum

13 0.32257375 119 nips-2002-Kernel Dependency Estimation

14 0.31582159 111 nips-2002-Independent Components Analysis through Product Density Estimation

15 0.31211099 150 nips-2002-Multiple Cause Vector Quantization

16 0.31094381 140 nips-2002-Margin Analysis of the LVQ Algorithm

17 0.29863793 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers

18 0.29671559 6 nips-2002-A Formulation for Minimax Probability Machine Regression

19 0.29466763 136 nips-2002-Linear Combinations of Optic Flow Vectors for Estimating Self-Motion - a Real-World Test of a Neural Model

20 0.29393414 178 nips-2002-Robust Novelty Detection with Single-Class MPM

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(11, 0.021), (14, 0.31), (23, 0.022), (42, 0.096), (54, 0.138), (55, 0.047), (57, 0.01), (65, 0.015), (67, 0.02), (68, 0.026), (74, 0.062), (92, 0.05), (98, 0.094)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.83970612 150 nips-2002-Multiple Cause Vector Quantization

Author: David A. Ross, Richard S. Zemel

Abstract: We propose a model that can learn parts-based representations of highdimensional data. Our key assumption is that the dimensions of the data can be separated into several disjoint subsets, or factors, which take on values independently of each other. We assume each factor has a small number of discrete states, and model it using a vector quantizer. The selected states of each factor represent the multiple causes of the input. Given a set of training examples, our model learns the association of data dimensions with factors, as well as the states of each VQ. Inference and learning are carried out efﬁciently via variational algorithms. We present applications of this model to problems in image decomposition, collaborative ﬁltering, and text classiﬁcation.

same-paper 2 0.83731139 96 nips-2002-GeneralizedÂ˛ LinearÂ˛ Models

Author: Geoffrey J. Gordon

3 0.80341053 83 nips-2002-Extracting Relevant Structures with Side Information

Author: Gal Chechik, Naftali Tishby

Abstract: The problem of extracting the relevant aspects of data, in face of multiple conﬂicting structures, is inherent to modeling of complex data. Extracting structure in one random variable that is relevant for another variable has been principally addressed recently via the information bottleneck method [15]. However, such auxiliary variables often contain more information than is actually required due to structures that are irrelevant for the task. In many other cases it is in fact easier to specify what is irrelevant than what is, for the task at hand. Identifying the relevant structures, however, can thus be considerably improved by also minimizing the information about another, irrelevant, variable. In this paper we give a general formulation of this problem and derive its formal, as well as algorithmic, solution. Its operation is demonstrated in a synthetic example and in two real world problems in the context of text categorization and face images. While the original information bottleneck problem is related to rate distortion theory, with the distortion measure replaced by the relevant information, extracting relevant features while removing irrelevant ones is related to rate distortion with side information.

4 0.61167037 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs

Author: Nicholas Roy, Geoffrey J. Gordon

5 0.60713524 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering

Author: Chakra Chennubhotla, Allan D. Jepson

Abstract: Using a Markov chain perspective of spectral clustering we present an algorithm to automatically ﬁnd the number of stable clusters in a dataset. The Markov chain’s behaviour is characterized by the spectral properties of the matrix of transition probabilities, from which we derive eigenﬂows along with their halﬂives. An eigenﬂow describes the ﬂow of probability mass due to the Markov chain, and it is characterized by its eigenvalue, or equivalently, by the halﬂife of its decay as the Markov chain is iterated. A ideal stable cluster is one with zero eigenﬂow and inﬁnite half-life. The key insight in this paper is that bottlenecks between weakly coupled clusters can be identiﬁed by computing the sensitivity of the eigenﬂow’s halﬂife to variations in the edge weights. We propose a novel E IGEN C UTS algorithm to perform clustering that removes these identiﬁed bottlenecks in an iterative fashion.

6 0.58285457 169 nips-2002-Real-Time Particle Filters

7 0.58111292 190 nips-2002-Stochastic Neighbor Embedding

8 0.58082014 70 nips-2002-Distance Metric Learning with Application to Clustering with Side-Information

9 0.58069545 27 nips-2002-An Impossibility Theorem for Clustering

10 0.57372957 53 nips-2002-Clustering with the Fisher Score

11 0.57015318 28 nips-2002-An Information Theoretic Approach to the Functional Classification of Neurons

12 0.56858891 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture

13 0.56830144 2 nips-2002-A Bilinear Model for Sparse Coding

14 0.56663638 40 nips-2002-Bayesian Models of Inductive Generalization

15 0.56370252 188 nips-2002-Stability-Based Model Selection

16 0.56096715 21 nips-2002-Adaptive Classification by Variational Kalman Filtering

17 0.55903065 46 nips-2002-Boosting Density Estimation

18 0.55719459 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions

19 0.55645484 52 nips-2002-Cluster Kernels for Semi-Supervised Learning

20 0.55617762 3 nips-2002-A Convergent Form of Approximate Policy Iteration