nips nips2007 nips2007-181 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis
Abstract: An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the “expressiveness” of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.
Reference: text
sentIndex sentText sentNum sentScore
1 com Abstract An important problem in many fields is the analysis of counts data to extract meaningful latent components. [sent-5, score-0.311]
2 We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. [sent-10, score-0.445]
3 1 Introduction A frequently encountered problem in many fields is the analysis of histogram data to extract meaningful latent factors from it. [sent-12, score-0.311]
4 PLSA allows us to express distributions that underlie such count data as mixtures of latent components. [sent-17, score-0.323]
5 Realistically, it may be expected that the number of latent components in the process underlying any dataset is unrestricted. [sent-21, score-0.292]
6 Any analysis that attempts to find an overcomplete set of a larger number of components encounters the problem of indeterminacy and is liable to result in meaningless or trivial solutions. [sent-25, score-0.365]
7 Sparse coding refers to a representational scheme where, of a set of components that may be combined to compose data, only a small number are combined to represent any particular instance of the data (although the specific set of components may change from instance to 1 instance). [sent-31, score-0.479]
8 In our problem, this translates to permitting the generating process to have an unrestricted number of latent components, but requiring that only a small number of them contribute to the composition of the histogram represented by any data instance. [sent-32, score-0.326]
9 In other words, the latent components must be learned such that the mixture weights with which they are combined to generate any data have low entropy – a set with low entropy implies that only a few mixture weight terms are significant. [sent-33, score-0.934]
10 Firstly, it largely eliminates the problem of indeterminacy permitting us to learn an unrestricted number of latent components. [sent-35, score-0.341]
11 Secondly, estimation of low entropy mixture weights forces more information on to the latent components, thereby making them more expressive. [sent-36, score-0.513]
12 The basic formulation we use to extract latent components is similar to PLSA. [sent-37, score-0.308]
13 We use an entropic prior to manipulate the entropy of the mixture weights. [sent-38, score-0.467]
14 We use an artificial dataset to illustrate the effects of sparsity on the model. [sent-40, score-0.282]
15 We show through simulations that sparsity can lead to components that are more representative of the true nature of the data compared to conventional maximum likelihood learning. [sent-41, score-0.398]
16 We demonstrate through experiments on images that the latent components learned in this manner are more informative enabling us to predict unobserved data. [sent-42, score-0.393]
17 Vf n , the f th row entry of Vn , the nth column of V, represents the count of f (or the f th discrete symbol that may be generated by the multinomial) in the nth data set. [sent-48, score-0.464]
18 For example, if the columns of V represent word count vectors for a collection of documents, Vf n would be the count of the f th word of the vocabulary in the nth document in the collection. [sent-49, score-0.518]
19 We model all data as having been generated by a process that is characterized by a set of latent probability distributions that, although not directly observed, combine to compose the distribution of any data set. [sent-50, score-0.368]
20 We represent the probability of drawing f from the z th latent distribution by P (f |z), where z is a latent variable. [sent-51, score-0.466]
21 To generate any data set, the latent distributions P (f |z) are combined in proportions that are specific to that set. [sent-52, score-0.287]
22 We can define the distribution underlying the nth column of V as Pn (f ) = P (f |z)Pn (z), (1) z where Pn (f ) represents the probability of drawing f in the nth data set in V, and Pn (z) is the mixing proportion signifying the contribution of P (f |z) towards Pn (f ). [sent-54, score-0.282]
23 Equation 1 is functionally identical to that used for Probabilistic Latent Semantic Analysis of text data [6]1 : if the columns Vn of V represent word count vectors for documents, P (f |z) represents the z th latent topic in the documents. [sent-55, score-0.545]
24 For example, if each column of V represents one of a collection of images (each of which has been unraveled into a column vector), the P (f |z)’s would represent the latent “bases” that compose all images in the collection. [sent-57, score-0.658]
25 In maintaining this latter analogy, we will henceforth refer to P (f |z) as the basis distributions for the process. [sent-58, score-0.257]
26 The distributions Pn (f ) and basis distributions P (f |z) are also F -dimensional vectors in the same simplex. [sent-61, score-0.409]
27 The model expresses Pn (f ) as points within the convex hull formed by the basis distributions P (f |z). [sent-62, score-0.357]
28 The model approximates data distributions as points lying within the convex hull formed by the basis distributions. [sent-69, score-0.357]
29 2 Latent Variable Model as Matrix Factorization We can write the model given by equation (1) in matrix form as pn = Wgn , where pn is a column vector indicating Pn (f ), gn is a column vector indicating Pn (z), and W is a matrix with the (f, z)th element corresponding to P (f |z). [sent-81, score-1.185]
30 If we characterize V by R basis distributions, W is an F × R matrix. [sent-82, score-0.224]
31 Concatenating all column vectors pn and gn as matrices P and G respectively, one can write the model as P = WG, where G is an R × N matrix. [sent-83, score-0.7]
32 In other words, the model of Equation (1) actually represents the decomposition V ≈ WGD = WH (4) th where D is an N × N diagonal matrix, whose n diagonal element is the total number of counts in Vn and H = GD. [sent-85, score-0.237]
33 If R, the number of basis distributions, is equal to F , then a trivial solution exists that achieves perfect decomposition: W = I; H = V, where I is the identity matrix (although the algorithm may not always arrive at this solution). [sent-89, score-0.23]
34 It can be accurately represented by any set of three or more bases that form an enclosing polygon and there are many such polygons. [sent-93, score-0.46]
35 However, if we restrict the number of bases used to enclose ‘+’ to be minimized, only the 7 enclosing triangles shown remain as valid solutions. [sent-94, score-0.495]
36 By further imposing the restriction that the entropy of the mixture weights with which the bases (corners) must be combined to represent ‘+’ must be minimum, only one triangle is obtained as the unique optimal enclosure. [sent-95, score-0.836]
37 For overcomplete decompositions where R > F , the solution becomes indeterminate – multiple perfect decompositions are possible. [sent-97, score-0.22]
38 The indeterminacy of the overcomplete decomposition can, however, be greatly reduced by im¯ posing a restriction that the approximation for any Vn must employ minimum number of basis distributions required. [sent-98, score-0.573]
39 By further imposing the constraint that the entropy of gn must be minimized, the indeterminacy of the solution can often be eliminated as illustrated by Figure 2. [sent-99, score-0.334]
40 This principle, which is related to the concept of sparse coding [5], is what we will use to derive overcomplete sets of basis distributions for the data. [sent-100, score-0.486]
41 3 Sparsity in the Latent Variable Model Sparse coding refers to a representational scheme where, of a set of components that may be combined to compose data, only a small number are combined to represent any particular input. [sent-101, score-0.392]
42 In the context of basis decompositions, the goal of sparse coding is to find a set of bases for any data set such that the mixture weights with which the bases are combined to compose any data are sparse. [sent-102, score-1.365]
43 Different metrics have been used to quantify the sparsity of the mixture weights in the literature. [sent-103, score-0.482]
44 [7]) while other approaches minimize various approximations of the entropy of the mixture weights. [sent-105, score-0.262]
45 We use the entropic prior, which has been used in the maximum entropy literature (see [9]) to manipulate entropy. [sent-107, score-0.339]
46 Given a probability distribution θ, the entropic prior is defined as Pe (θ) ∝ e−αH(θ) , where H(θ) = − i θi log θi is the entropy of the distribution and α is a weighting factor. [sent-108, score-0.276]
47 Imposing this prior during maximum a posteriori estimation is a way to manipulate the entropy of the distribution. [sent-110, score-0.223]
48 The distribution θ could correspond to the basis distributions P (f |z) or the mixture weights Pn (z) or both. [sent-111, score-0.457]
49 A sparse code would correspond to having the entropic prior on Pn (z) with a positive value for α. [sent-112, score-0.216]
50 Below, we consider the case where both the basis vectors and mixture weights have the entropic prior to keep the exposition general. [sent-113, score-0.611]
51 05 (001) (010) (001) (100) (100) 10 Basis Vectors (100) (001) (001) Figure 3: Illustration of the effect of sparsity on the synthetic data set from Figure 1. [sent-122, score-0.282]
52 Sets of 3 (left), 7 (center), and 10 (right) basis distributions were obtained from the data without employing sparsity. [sent-125, score-0.287]
53 The convex hulls formed by the bases from each of these runs are shown in the panels from left to right. [sent-127, score-0.495]
54 Notice that increasing the number of bases enlarges the sizes of convex hulls, none of which characterize the distribution of the data well. [sent-128, score-0.462]
55 The panels from left to right show the 20 sets of estimates of 7 basis distributions, for increasing values of the sparsity parameter for the mixture weights. [sent-130, score-0.678]
56 where α and β are parameters indicating the degree of sparsity desired in P (f |z) and Pn (z) respectively. [sent-132, score-0.282]
57 Consequently, reducing entropy of the mixture weights Pn (z) to obtain a sparse code results in increased entropy (information) of basis distributions P (f |z). [sent-145, score-0.799]
58 This structure cannot be accurately represented by three or fewer basis distributions, since they can, at best specify 5 A. [sent-149, score-0.216]
59 Original Test Images Figure 4: Application of latent variable decomposition for reconstructing faces from occluded images (CBCL Database). [sent-152, score-0.554]
60 Reconstructed faces from a sparse-overcomplete basis set of 1000 learned components (sparsity parameter = 0. [sent-157, score-0.356]
61 Simply increasing the number of bases without constraining the sparsity of the mixture weights does not provide meaningful solutions. [sent-162, score-0.918]
62 However, increasing the sparsity quickly results in solutions that accurately characterize the distribution of the data. [sent-163, score-0.377]
63 The goal of the decomposition is often to identify a set of latent distributions that characterize the underlying process that generated the data V. [sent-165, score-0.393]
64 When no sparsity is enforced on the solution, the trivial solution W = I, H = V is obtained at R = F . [sent-166, score-0.325]
65 In this solution, the entire information in V is borne by H and the bases W becomes uninformative, i. [sent-167, score-0.367]
66 However, by enforcing sparsity on H the information V is transferred back to W, and non-trivial solutions are possible for R > F . [sent-170, score-0.282]
67 By enforcing sparsity, we have thus increased the implicit limit on the number of bases that can be estimated without indeterminacy from the smaller dimension of V to the larger one. [sent-177, score-0.46]
68 4 Experimental Evaluation We hypothesize that if the learned basis distribution are characteristic of the process that generates the data, they must not only generalize to explain new data from the process, but also enable prediction of components of the data that were not observed. [sent-178, score-0.308]
69 Secondly, the bases for a given process must be worse at explaining data that have been generated by any other process. [sent-179, score-0.367]
70 1 Face Reconstruction In this experiment we evaluate the ability of the overcomplete bases to explain new data and predict the values of unobserved components of the data. [sent-183, score-0.596]
71 To create occluded test images, we removed 6 × 6 grids in ten random configurations for 10 test faces each, resulting in 100 occluded images. [sent-190, score-0.354]
72 In a training stage, we learned sets of K ∈ {50, 200, 500, 750, 1000} basis distributions from the training data. [sent-193, score-0.291]
73 Basis images combine in proportion to the mixture weights shown to result in the pixel images shown. [sent-196, score-0.46]
74 5 Figure 6: 25 basis distributions learned from training data for class “3” with increasing sparsity parameters on the mixture weights. [sent-199, score-0.756]
75 Increasing the sparsity parameter of mixture weights produces bases which are holistic representations of the input (histogram) data instead of parts-like features. [sent-203, score-0.9]
76 1) on the mixture weights in the overcomplete cases (500, 750 and 1000 basis vectors). [sent-205, score-0.529]
77 The procedure for estimating the occluded regions of the a test image has two steps. [sent-206, score-0.216]
78 In the first step, we estimate the distribution underlying the image as a linear combination of the basis distributions. [sent-207, score-0.276]
79 This is done by iterations of Equations 2 and 3 to estimate Pn (z) (the bases P (f |z), being already known, stay fixed) based only on the pixels that are observed (i. [sent-208, score-0.434]
80 The combination of the bases P (f |z) and the estimated Pn (z) give us the overall distribution Pn (f ) for the image. [sent-211, score-0.367]
81 The occluded pixel values at any pixel f is estimated as the expected number of counts at the pixels, given by Pn (f )( f ′ ∈{Fo } Vf ′ )/( f ′ ∈{Fo } Pn (f ′ )) where Vf represents the value of the image at the f th pixel and {Fo } is the set of observed pixels. [sent-212, score-0.594]
82 Figure 4B shows the reconstructed faces for the sparse-overcomplete case of 1000 basis vectors. [sent-213, score-0.235]
83 Performance is measured by mean Signal-to-Noise-Ratio (SNR), where SNR for an image was computed as the ratio of the sum of squared pixel intensities of the original image to the sum of squared error between the original image pixels and the reconstruction. [sent-215, score-0.357]
84 2 Handwritten Digit Classification In this experiment we evaluate the specificity of the bases to the process represented by the training data set, through a simple example of handwritten digit classification. [sent-217, score-0.442]
85 During training, separate sets of basis distributions P k (f |z) were learned for each class, where k represents the index of the class. [sent-221, score-0.33]
86 Figure 5 shows 25 bases images extracted for the digit “2”. [sent-222, score-0.533]
87 To classify any test image v, we attempted to compute the distribution underlying the image using the k bases for each class (by estimating the mixture weights Pv (z), keeping the bases fixed, as before). [sent-223, score-1.112]
88 The “match” of the bases to the test instance was indicated by the likelihood Lk of the image comk puted using P k (f ) = z P k (f |z)Pv (z) as Lk = f vf log P k (f ). [sent-224, score-0.633]
89 Since we expect the bases for the true class of the image to best compose it, we expect the likelihood for the correct class to be maximum. [sent-225, score-0.63]
90 Reconstruction Experiment 24 5 1 patch 2 patches 3 patches 4 patches 22 25 50 75 100 200 4. [sent-228, score-0.216]
91 Mean SNR of the reconstructions is shown as a function of the number of basis vectors and the test case (number of deleted patches, shown in the legend). [sent-238, score-0.296]
92 Notice that imposing sparsity almost always leads to better classification performance. [sent-243, score-0.338]
93 In the case of 100 bases, error rate comes down by almost 50% when a sparsity parameter of 0. [sent-244, score-0.282]
94 As one can see, imposing sparsity improves classification performance in almost all cases. [sent-247, score-0.338]
95 Figure 6 shows three sets of basis distributions learned for class “3” with different sparsity values on the mixture weights. [sent-248, score-0.727]
96 As the sparsity parameter is increased, bases tend to be holistic representations of the input histograms. [sent-249, score-0.7]
97 This is consistent with improved classification performance - as the representation of basis distributions gets more holistic, the more unlike they become when compared to bases of other classes. [sent-250, score-0.624]
98 Thus, there is a lesser chance that the bases of one class can compose an image in another class, thereby improving performance. [sent-251, score-0.575]
99 5 Conclusions In this paper, we have presented an algorithm for sparse extraction of overcomplete sets of latent distributions from histogram data. [sent-252, score-0.486]
100 We have used entropy as a measure of sparsity and employed the entropic prior to manipulate the entropy of the estimated parameters. [sent-253, score-0.755]
wordName wordTfidf (topN-words)
[('pn', 0.52), ('bases', 0.367), ('sparsity', 0.282), ('basis', 0.187), ('latent', 0.179), ('vf', 0.174), ('occluded', 0.153), ('entropic', 0.142), ('overcomplete', 0.142), ('vn', 0.136), ('entropy', 0.134), ('mixture', 0.128), ('plsa', 0.119), ('compose', 0.119), ('indeterminacy', 0.093), ('images', 0.093), ('components', 0.087), ('nth', 0.085), ('vectors', 0.082), ('decomposition', 0.081), ('count', 0.074), ('pixel', 0.074), ('weights', 0.072), ('patches', 0.072), ('hull', 0.071), ('distributions', 0.07), ('pixels', 0.067), ('th', 0.067), ('enclosing', 0.064), ('snr', 0.063), ('manipulate', 0.063), ('simplex', 0.063), ('image', 0.063), ('equations', 0.062), ('imposing', 0.056), ('panels', 0.052), ('gn', 0.051), ('holistic', 0.051), ('fo', 0.051), ('counts', 0.05), ('histogram', 0.05), ('faces', 0.048), ('column', 0.047), ('hulls', 0.047), ('param', 0.047), ('sparse', 0.045), ('digit', 0.043), ('lda', 0.043), ('trivial', 0.043), ('bhiksha', 0.043), ('cbcl', 0.043), ('expressiveness', 0.043), ('quantum', 0.043), ('shashanka', 0.043), ('coding', 0.042), ('extract', 0.042), ('represent', 0.041), ('panel', 0.041), ('draws', 0.04), ('meaningful', 0.04), ('represents', 0.039), ('decompositions', 0.039), ('combined', 0.038), ('permitting', 0.037), ('lambert', 0.037), ('enclose', 0.037), ('semantic', 0.037), ('characterize', 0.037), ('documents', 0.036), ('illustration', 0.035), ('supplemental', 0.034), ('pv', 0.034), ('learned', 0.034), ('word', 0.032), ('handwritten', 0.032), ('unrestricted', 0.032), ('legend', 0.032), ('columns', 0.031), ('dm', 0.03), ('extracted', 0.03), ('employing', 0.03), ('increasing', 0.029), ('likelihood', 0.029), ('convex', 0.029), ('code', 0.029), ('accurately', 0.029), ('composition', 0.028), ('em', 0.028), ('factorization', 0.028), ('dirichlet', 0.028), ('reconstructions', 0.027), ('intensities', 0.027), ('triangles', 0.027), ('representational', 0.027), ('lk', 0.026), ('reconstruction', 0.026), ('utility', 0.026), ('underlying', 0.026), ('class', 0.026), ('posteriori', 0.026)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
Author: Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis
Abstract: An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the “expressiveness” of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.
2 0.17143399 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
Author: Tony Jebara, Yingbo Song, Kapil Thadani
Abstract: A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. This article makes independent similarly distributed (or isd) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya affinity penalty between pairs of distributions on samples. Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. Experiments in density estimation on a variety of datasets confirm the value of isd over iid estimation, id estimation and mixture modeling.
3 0.15913337 182 nips-2007-Sparse deep belief net model for visual area V2
Author: Honglak Lee, Chaitanya Ekanadham, Andrew Y. Ng
Abstract: Motivated in part by the hierarchical organization of the cortex, a number of algorithms have recently been proposed that try to learn hierarchical, or “deep,” structure from unlabeled data. While several authors have formally or informally compared their algorithms to computations performed in visual area V1 (and the cochlea), little attempt has been made thus far to evaluate these algorithms in terms of their fidelity for mimicking computations at deeper levels in the cortical hierarchy. This paper presents an unsupervised learning model that faithfully mimics certain properties of visual area V2. Specifically, we develop a sparse variant of the deep belief networks of Hinton et al. (2006). We learn two layers of nodes in the network, and demonstrate that the first layer, similar to prior work on sparse coding and ICA, results in localized, oriented, edge filters, similar to the Gabor functions known to model V1 cell receptive fields. Further, the second layer in our model encodes correlations of the first layer responses in the data. Specifically, it picks up both colinear (“contour”) features as well as corners and junctions. More interestingly, in a quantitative comparison, the encoding of these more complex “corner” features matches well with the results from the Ito & Komatsu’s study of biological V2 responses. This suggests that our sparse variant of deep belief networks holds promise for modeling more higher-order features. 1
4 0.15551662 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
Author: Pierre Garrigues, Bruno A. Olshausen
Abstract: It has been shown that adapting a dictionary of basis functions to the statistics of natural images so as to maximize sparsity in the coefficients results in a set of dictionary elements whose spatial properties resemble those of V1 (primary visual cortex) receptive fields. However, the resulting sparse coefficients still exhibit pronounced statistical dependencies, thus violating the independence assumption of the sparse coding model. Here, we propose a model that attempts to capture the dependencies among the basis function coefficients by including a pairwise coupling term in the prior over the coefficient activity states. When adapted to the statistics of natural images, the coupling terms learn a combination of facilitatory and inhibitory interactions among neighboring basis functions. These learned interactions may offer an explanation for the function of horizontal connections in V1 in terms of a prior over natural images.
5 0.13527985 145 nips-2007-On Sparsity and Overcompleteness in Image Models
Author: Pietro Berkes, Richard Turner, Maneesh Sahani
Abstract: Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete. 1
6 0.11916678 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
7 0.096647382 183 nips-2007-Spatial Latent Dirichlet Allocation
8 0.089716725 115 nips-2007-Learning the 2-D Topology of Images
9 0.080101959 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
10 0.074223146 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
11 0.071451597 73 nips-2007-Distributed Inference for Latent Dirichlet Allocation
12 0.06960699 189 nips-2007-Supervised Topic Models
13 0.069237359 185 nips-2007-Stable Dual Dynamic Programming
14 0.068759367 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
15 0.065604582 179 nips-2007-SpAM: Sparse Additive Models
16 0.065330401 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data
17 0.062160015 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
18 0.062107891 61 nips-2007-Convex Clustering with Exemplar-Based Models
19 0.05893806 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
20 0.058261946 143 nips-2007-Object Recognition by Scene Alignment
topicId topicWeight
[(0, -0.198), (1, 0.1), (2, -0.02), (3, -0.158), (4, 0.014), (5, 0.035), (6, -0.116), (7, 0.07), (8, 0.039), (9, 0.069), (10, 0.157), (11, 0.023), (12, 0.061), (13, -0.022), (14, 0.009), (15, 0.016), (16, 0.085), (17, 0.146), (18, -0.088), (19, -0.109), (20, -0.11), (21, 0.053), (22, 0.05), (23, -0.013), (24, -0.094), (25, 0.027), (26, -0.016), (27, -0.071), (28, 0.027), (29, 0.002), (30, 0.036), (31, 0.077), (32, 0.084), (33, -0.072), (34, -0.182), (35, 0.105), (36, -0.009), (37, -0.009), (38, -0.175), (39, 0.031), (40, 0.083), (41, -0.038), (42, -0.104), (43, 0.011), (44, 0.172), (45, -0.176), (46, -0.088), (47, -0.001), (48, -0.077), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.96979761 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
Author: Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis
Abstract: An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the “expressiveness” of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.
2 0.71538931 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
Author: Pierre Garrigues, Bruno A. Olshausen
Abstract: It has been shown that adapting a dictionary of basis functions to the statistics of natural images so as to maximize sparsity in the coefficients results in a set of dictionary elements whose spatial properties resemble those of V1 (primary visual cortex) receptive fields. However, the resulting sparse coefficients still exhibit pronounced statistical dependencies, thus violating the independence assumption of the sparse coding model. Here, we propose a model that attempts to capture the dependencies among the basis function coefficients by including a pairwise coupling term in the prior over the coefficient activity states. When adapted to the statistics of natural images, the coupling terms learn a combination of facilitatory and inhibitory interactions among neighboring basis functions. These learned interactions may offer an explanation for the function of horizontal connections in V1 in terms of a prior over natural images.
3 0.64357483 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
Author: Tony Jebara, Yingbo Song, Kapil Thadani
Abstract: A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. This article makes independent similarly distributed (or isd) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya affinity penalty between pairs of distributions on samples. Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. Experiments in density estimation on a variety of datasets confirm the value of isd over iid estimation, id estimation and mixture modeling.
4 0.62954682 145 nips-2007-On Sparsity and Overcompleteness in Image Models
Author: Pietro Berkes, Richard Turner, Maneesh Sahani
Abstract: Computational models of visual cortex, and in particular those based on sparse coding, have enjoyed much recent attention. Despite this currency, the question of how sparse or how over-complete a sparse representation should be, has gone without principled answer. Here, we use Bayesian model-selection methods to address these questions for a sparse-coding model based on a Student-t prior. Having validated our methods on toy data, we find that natural images are indeed best modelled by extremely sparse distributions; although for the Student-t prior, the associated optimal basis size is only modestly over-complete. 1
5 0.61402375 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
Author: Matthias Bethge, Philipp Berens
Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1
6 0.49516022 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
7 0.45009583 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
8 0.44373292 153 nips-2007-People Tracking with the Laplacian Eigenmaps Latent Variable Model
9 0.43245599 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data
10 0.42845929 182 nips-2007-Sparse deep belief net model for visual area V2
11 0.41543543 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
12 0.39565551 196 nips-2007-The Infinite Gamma-Poisson Feature Model
13 0.36046791 211 nips-2007-Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data
14 0.35752854 130 nips-2007-Modeling Natural Sounds with Modulation Cascade Processes
15 0.35622418 8 nips-2007-A New View of Automatic Relevance Determination
16 0.35185698 50 nips-2007-Combined discriminative and generative articulated pose and non-rigid shape estimation
17 0.33628851 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI
18 0.32900107 179 nips-2007-SpAM: Sparse Additive Models
19 0.31747156 47 nips-2007-Collapsed Variational Inference for HDP
20 0.31258753 183 nips-2007-Spatial Latent Dirichlet Allocation
topicId topicWeight
[(5, 0.482), (13, 0.032), (16, 0.015), (21, 0.037), (34, 0.021), (35, 0.024), (47, 0.078), (83, 0.114), (87, 0.046), (90, 0.046)]
simIndex simValue paperId paperTitle
1 0.93894881 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
Author: Charlie Frogner, Avi Pfeffer
Abstract: Dynamic Bayesian networks are structured representations of stochastic processes. Despite their structure, exact inference in DBNs is generally intractable. One approach to approximate inference involves grouping the variables in the process into smaller factors and keeping independent beliefs over these factors. In this paper we present several techniques for decomposing a dynamic Bayesian network automatically to enable factored inference. We examine a number of features of a DBN that capture different types of dependencies that will cause error in factored inference. An empirical comparison shows that the most useful of these is a heuristic that estimates the mutual information introduced between factors by one step of belief propagation. In addition to features computed over entire factors, for efficiency we explored scores computed over pairs of variables. We present search methods that use these features, pairwise and not, to find a factorization, and we compare their results on several datasets. Automatic factorization extends the applicability of factored inference to large, complex models that are undesirable to factor by hand. Moreover, tests on real DBNs show that automatic factorization can achieve significantly lower error in some cases. 1
2 0.89283413 111 nips-2007-Learning Horizontal Connections in a Sparse Coding Model of Natural Images
Author: Pierre Garrigues, Bruno A. Olshausen
Abstract: It has been shown that adapting a dictionary of basis functions to the statistics of natural images so as to maximize sparsity in the coefficients results in a set of dictionary elements whose spatial properties resemble those of V1 (primary visual cortex) receptive fields. However, the resulting sparse coefficients still exhibit pronounced statistical dependencies, thus violating the independence assumption of the sparse coding model. Here, we propose a model that attempts to capture the dependencies among the basis function coefficients by including a pairwise coupling term in the prior over the coefficient activity states. When adapted to the statistics of natural images, the coupling terms learn a combination of facilitatory and inhibitory interactions among neighboring basis functions. These learned interactions may offer an explanation for the function of horizontal connections in V1 in terms of a prior over natural images.
3 0.89146817 27 nips-2007-Anytime Induction of Cost-sensitive Trees
Author: Saher Esmeir, Shaul Markovitch
Abstract: Machine learning techniques are increasingly being used to produce a wide-range of classifiers for complex real-world applications that involve nonuniform testing costs and misclassification costs. As the complexity of these applications grows, the management of resources during the learning and classification processes becomes a challenging task. In this work we introduce ACT (Anytime Cost-sensitive Trees), a novel framework for operating in such environments. ACT is an anytime algorithm that allows trading computation time for lower classification costs. It builds a tree top-down and exploits additional time resources to obtain better estimations for the utility of the different candidate splits. Using sampling techniques ACT approximates for each candidate split the cost of the subtree under it and favors the one with a minimal cost. Due to its stochastic nature ACT is expected to be able to escape local minima, into which greedy methods may be trapped. Experiments with a variety of datasets were conducted to compare the performance of ACT to that of the state of the art cost-sensitive tree learners. The results show that for most domains ACT produces trees of significantly lower costs. ACT is also shown to exhibit good anytime behavior with diminishing returns.
same-paper 4 0.88468665 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
Author: Madhusudana Shashanka, Bhiksha Raj, Paris Smaragdis
Abstract: An important problem in many fields is the analysis of counts data to extract meaningful latent components. Methods like Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) have been proposed for this purpose. However, they are limited in the number of components they can extract and lack an explicit provision to control the “expressiveness” of the extracted components. In this paper, we present a learning formulation to address these limitations by employing the notion of sparsity. We start with the PLSA framework and use an entropic prior in a maximum a posteriori formulation to enforce sparsity. We show that this allows the extraction of overcomplete sets of latent components which better characterize the data. We present experimental evidence of the utility of such representations.
5 0.57224274 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
Author: Matthias Bethge, Philipp Berens
Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1
6 0.51445961 115 nips-2007-Learning the 2-D Topology of Images
7 0.51214284 7 nips-2007-A Kernel Statistical Test of Independence
8 0.50806773 146 nips-2007-On higher-order perceptron algorithms
9 0.49968928 96 nips-2007-Heterogeneous Component Analysis
10 0.49867252 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
11 0.49600869 180 nips-2007-Sparse Feature Learning for Deep Belief Networks
12 0.49503982 187 nips-2007-Structured Learning with Approximate Inference
13 0.4917219 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
14 0.48555514 49 nips-2007-Colored Maximum Variance Unfolding
15 0.48304951 172 nips-2007-Scene Segmentation with CRFs Learned from Partially Labeled Images
16 0.47755075 113 nips-2007-Learning Visual Attributes
17 0.46878061 202 nips-2007-The discriminant center-surround hypothesis for bottom-up saliency
18 0.46743482 48 nips-2007-Collective Inference on Markov Models for Modeling Bird Migration
19 0.46555293 164 nips-2007-Receptive Fields without Spike-Triggering
20 0.46474436 63 nips-2007-Convex Relaxations of Latent Variable Training