nips nips2013 nips2013-320 knowledge-graph by maker-knowledge-mining

320 nips-2013-Summary Statistics for Partitionings and Feature Allocations

Source: pdf

Author: Isik B. Fidaner, Taylan Cemgil

Abstract: Inﬁnite mixture models are commonly used for clustering. One can sample from the posterior of mixture assignments by Monte Carlo methods or ﬁnd its maximum a posteriori solution by optimization. However, in some problems the posterior is diffuse and it is hard to interpret the sampled partitionings. In this paper, we introduce novel statistics based on block sizes for representing sample sets of partitionings and feature allocations. We develop an element-based deﬁnition of entropy to quantify segmentation among their elements. Then we propose a simple algorithm called entropy agglomeration (EA) to summarize and visualize this information. Experiments on various inﬁnite mixture posteriors as well as a feature allocation dataset demonstrate that the proposed statistics are useful in practice.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 One can sample from the posterior of mixture assignments by Monte Carlo methods or ﬁnd its maximum a posteriori solution by optimization. [sent-6, score-0.158]

2 In this paper, we introduce novel statistics based on block sizes for representing sample sets of partitionings and feature allocations. [sent-8, score-0.741]

3 We develop an element-based deﬁnition of entropy to quantify segmentation among their elements. [sent-9, score-0.347]

4 Then we propose a simple algorithm called entropy agglomeration (EA) to summarize and visualize this information. [sent-10, score-0.472]

5 Experiments on various inﬁnite mixture posteriors as well as a feature allocation dataset demonstrate that the proposed statistics are useful in practice. [sent-11, score-0.138]

6 1 Introduction Clustering aims to summarize observed data by grouping its elements according to their similarities. [sent-12, score-0.136]

7 Depending on the application, clusters may represent words belonging to topics, genes belonging to metabolic processes or any other relation assumed by the deployed approach. [sent-13, score-0.114]

8 Inﬁnite mixture models provide a general solution by allowing a potentially unlimited number of mixture components. [sent-14, score-0.178]

9 Studies on inﬁnite mixture models inspired the development of several other models [8, 9] including Indian buffet process (IBP) for inﬁnite feature models [10, 11] and fragmentation-coagulation process for sequence data [12] all of which belong to Bayesian nonparametrics [13]. [sent-16, score-0.138]

10 In making inference on inﬁnite mixture models, a sample set of partitionings can be obtained from the posterior. [sent-17, score-0.665]

11 However, in some cases the posterior is more diffuse and one needs to extract statistical information about the random partitioning induced by the model. [sent-19, score-0.231]

12 This problem to ‘summarize’ the samples from the inﬁnite mixture posterior was raised in bioinformatics literature in 2002 by Medvedovic and Sivaganesan for clustering gene expression proﬁles [14]. [sent-20, score-0.275]

13 But the question proved difﬁcult and they ‘circumvented’ it by using a heuristic linkage algorithm based on pairwise occurence probabilities [15, 16]. [sent-21, score-0.241]

14 In this paper, we approach this problem and propose basic methodology for summarizing sample sets of partitionings as well as feature allocations. [sent-22, score-0.625]

15 showed in 2002 that the entropy [17] of a DP posterior was strongly determined by its prior hyperparameters [18]. [sent-24, score-0.304]

16 In other work, entropy was generalized to partitionings by interpreting partitionings as probability distributions [20, 21]. [sent-27, score-1.358]

17 Therefore, entropy emerges as an important statistic for our problem, but new deﬁnitions will be needed for quantifying information in feature allocations. [sent-28, score-0.395]

18 1 In methods such as collapsed Gibbs sampling, slice sampling, retrospective sampling, truncation methods 1 In the following sections, we deﬁne the problem and introduce cumulative statistics for representing partitionings and feature allocations. [sent-29, score-0.709]

19 Then, we develop an interpretation for entropy function in terms of per-element information in order to quantify segmentation among their elements. [sent-30, score-0.347]

20 Finally, we describe entropy agglomeration (EA) algorithm that generates dendrograms to summarize sample sets of partitionings and feature allocations. [sent-31, score-1.135]

21 We demonstrate EA on inﬁnite mixture posteriors for synthetic and real datasets as well as on a real dataset directly interpreted as a feature allocation. [sent-32, score-0.138]

22 A partitioning of a set of elements [n] = {1, 2, . [sent-34, score-0.285]

23 2 We write Z ⊢ [n] to designate that Z is a partitioning of [n]. [sent-44, score-0.191]

24 , Z (T ) } from a distribution π(Z) over partitionings is a multiset such that Z (t) ∼ π(Z) for all t ∈ {1, . [sent-48, score-0.597]

25 , xn ) are clustered by an inﬁnite mixture model with parameters θ(k) for each component k and mixture assignments (z1 , . [sent-56, score-0.178]

26 These z (t) are then represented by partitionings Z (t) ⊢ [n]. [sent-63, score-0.547]

27 The induced sample set contains information regarding (1) CRP prior over partitioning structure given by the hyperparameters (α, d) and (2) integrals over θ that capture the relation among the observed elements (x1 , . [sent-64, score-0.314]

28 In addition, we aim to extract information from feature allocations, which constitute a superclass of partitionings [11]. [sent-68, score-0.629]

29 A feature allocation of [n] is a multiset of blocks F = {B1 , . [sent-69, score-0.232]

30 , F (T ) } from a distribution π(F ) over feature allocations is a multiset such that F (t) ∼ π(F ) for all t. [sent-79, score-0.232]

31 If it was obtained by sampling from an inﬁnite mixture posterior, then its blocks B ∈ Z (t) correspond to the mixture components. [sent-82, score-0.311]

32 Three statistics commonly appear in the literature: First one is the number of blocks |Z|, which has been analyzed theoretically for various nonparametric priors [2, 5]. [sent-87, score-0.133]

33 It is simple, general and exchangable with respect to the elements [n], but it is not very informative about the distribution π(Z) and therefore is not very useful in practice. [sent-88, score-0.151]

34 A common statistic is pairwise occurence, which is used to extract information from inﬁnite mixture posteriors in applications like bioinformatics [14]. [sent-89, score-0.287]

35 For given pairs of elements {a, b}, it counts the number of blocks that contain these pairs, written i [{a, b} ⊂ Bi ]. [sent-90, score-0.256]

36 Another statistic is exact block size distribution (referred to as ‘multiplicities’ in [11, 19]). [sent-92, score-0.167]

37 It counts the partitioning’s blocks that contain exactly k elements, written i [|Bi | = k]. [sent-93, score-0.162]

38 It is exchangable with respect to the elements [n], but its weighted average over a sample set is difﬁcult to interpret. [sent-94, score-0.18]

39 The symbol ‘⊢’ is usually used for integer partitions, but here we use it for partitionings (=set partitions). [sent-96, score-0.547]

40 We want to compare the subsets of these genes S1 , S2 , S3 . [sent-98, score-0.125]

41 The projection of a partitioning Z ⊢ [n] onto S ⊂ [n] is deﬁned as the set of non-empty intersections between S and B ∈ Z. [sent-99, score-0.231]

42 In the following section, we develop a novel and general approach based on block sizes that opens up a systematic method for analyzing sample sets over partitionings and feature allocations. [sent-104, score-0.741]

43 3 Cumulative statistics to represent structure We deﬁne cumulative block size distribution, or ‘cumulative statistic’ in short, as the function φk (Z) = i [|Bi | ≥ k], which counts the partitioning’s blocks of size at least k. [sent-105, score-0.36]

44 We can rewrite the previous statistics: number of blocks as φ1 (Z), exact block size distribution as φk (Z) − φk+1 (Z), and pairwise occurence as φ2 (P ROJ(Z, {a, b})). [sent-106, score-0.43]

45 Moreover, cumulative statistics satisfy the following property: for partitionings of [n], φ(Z) always sums up to n, just like a probability mass function that sums up to 1. [sent-107, score-0.66]

46 When blocks of Z are sorted according to their sizes and the indicators [|Bi | ≥ k] are arranged on a matrix as in Figure 1a, they form a Young diagram, showing that φ(Z) is always the conjugate partition of the integer partition of Z. [sent-108, score-0.257]

47 Therefore, cumulative statistics of a random partitioning ‘conserve mass’. [sent-110, score-0.304]

48 n Z ⊢ [n] n ⇒ ⇒ φk (Z) = n k=1 φk (Z) π(Z) =n (5) k=1 When we project the partitioning Z onto a subset S ⊂ [n], the resulting vector φ(P ROJ(Z, S)) will then sum up to |S| (Figure 1b). [sent-112, score-0.191]

49 We can form a partitioning Z by inserting elements 1, 2, 3, 4, . [sent-114, score-0.34]

50 The resulting tree can be simpliﬁed by representing the partitionings by their cumulative statistics instead of their blocks (Figure 3b). [sent-128, score-0.793]

51 Based on this concept, we deﬁne cumulative occurence distribution (COD) as the triangular matrix of incremental cumulative statistic vectors, written ∆i,k (Z, σ) = φk (P ROJ(Z, Si )) where Z ⊢ [n], σ is a permutation of [n] and Si = {σ1 , . [sent-129, score-0.518]

52 COD matrices for two extreme paths (Figure 3c, 3e) and for the example partitioning Z (1) (Figure 3d) are shown. [sent-136, score-0.191]

53 i Z ⊢ [n] i ⇒ ⇒ ∆i,k (Z, σ) = i k=1 ∆i,k (Z, σ) π(Z) =i (6) k=1 Expected COD matrix of a random partitioning expresses (1) cumulation of elements by the differences between its rows, and (2) cumulation of block sizes by the differences between its columns. [sent-138, score-0.477]

54 Since CRP is exchangable and projective, its expected cumulative statistic φ(Z) π(Z) for n elements depends only on its hyperparameters (α, d). [sent-140, score-0.346]

55 56 partition entropy {{1}, {2}, {3}} (4, 0, 0, 0) –1. [sent-144, score-0.293]

56 This example shows that the COD matrix can reveal speciﬁc information about a distribution over partitionings; of course in practice we encounter non-exchangeable and almost arbitrary distributions over partitionings (e. [sent-213, score-0.547]

57 4 Entropy to quantify segmentation Shannon’s entropy [17] can be an appropriate quantity to measure ‘segmentation’ with respect to partitionings, which can be interpreted as probability distributions [20, 21]. [sent-216, score-0.347]

58 If |B| < n, the block supplies positive information since the segment size is larger than minimum, and we know that its segment size could be smaller if the block were larger. [sent-221, score-0.373]

59 To quantify this information, we deﬁne per-element information for a block B as the integral of segment size 1/s over the range [|B|, n] of block sizes that make this segment smaller (Figure 5). [sent-222, score-0.415]

60 2 |B| n 1 s log In pein (B), n is a ‘base’ that determines the minimum possible per-element segment size. [sent-225, score-0.142]

61 Since segment size expresses the signiﬁcance of elements, the function integrates segment sizes over the block sizes that make the elements less signiﬁcant. [sent-226, score-0.411]

62 5 0 b∈B 0 1 0 2 4 6 8 number of elements n 10 12 Figure 7: H(Z) in incremental construction of Z 1 log 3 3 1 0 0 c∈B a∈B b∈B 2 log 3 3 2 S = {a, b, c} partition entropy H(Z) 0 1. [sent-229, score-0.417]

63 A partitioning with a single block has zero entropy, and a partitioning with n blocks has the maximum entropy log n. [sent-234, score-0.864]

64 On the extended tree (Figure 7), nth column of nodes represent the possible partitionings of n. [sent-236, score-0.547]

65 A similar grid for feature k n allocations can be generated by inserting nodes for cumulative statistics that do not conserve mass. [sent-238, score-0.383]

66 To quantify the segmentation of a subset S, we compute projection entropy H(P ROJ(Z, S)). [sent-239, score-0.387]

67 To understand this function, we compare it to subset occurence in Figure 8. [sent-240, score-0.151]

68 Subset occurence acts as a ‘score’ that counts the ‘successful’ blocks that contain all of S, whereas projection entropy acts as a ‘penalty’ that quantiﬁes how much S is being divided and segmented by the given blocks B ∈ Z. [sent-241, score-0.816]

69 A partitioning Z and a permutation σ of its elements induce an entropy sequence (h1 , . [sent-242, score-0.578]

70 To ﬁnd subsets of elements that are more closely related, one can seek permutations σ that keep the entropies low. [sent-252, score-0.25]

71 5 Entropy agglomeration and experimental results We want to summarize a sample set using the proposed statistics. [sent-261, score-0.237]

72 Permutations that yield lower entropy sequences can be meaningful, but a feasible algorithm can only involve a small subset of the n! [sent-262, score-0.264]

73 We deﬁne entropy agglomeration (EA) algorithm, which begins from 1-element subsets, and merges in each iteration the pair of subsets that yield the minimum expected entropy: Entropy Agglomeration Algorithm: 1. [sent-264, score-0.474]

74 Find the subset pair {Sa , Sb } ⊂ Ψ that minimizes the entropy H(P ROJ(Z, Sa ∪ Sb )) 3. [sent-270, score-0.264]

75 Generate the dendrogram of chosen pairs by plotting minimum entropies for every split. [sent-275, score-0.128]

76 The resulting dendrogram for the example partitionings are shown in Figure 9a. [sent-277, score-0.614]

77 Besides using this dendrogram as a general summary, one can also generate more speciﬁc dendrograms by choosing speciﬁc elements or speciﬁc parts of the data. [sent-279, score-0.199]

78 For a detailed element-wise analysis, entropy sequences of particular permutations σ can be assessed. [sent-280, score-0.315]

79 To summarize partitionings of gene expressions, [14] applied agglomerative clustering by pairwise occurences. [sent-282, score-0.741]

80 EA avoids this drawback, since projection entropy is already deﬁned over subsets. [sent-284, score-0.304]

81 To test the proposed algorithm, we apply it to partitionings sampled from inﬁnite mixture posteriors. [sent-285, score-0.636]

82 Pairwise occurences are ordered according to the EA dendrogram. [sent-290, score-0.113]

83 Plots are based on 450 partitionings from the posterior. [sent-293, score-0.547]

84 The dispersedness of the ﬁrst cluster is represented by distinguishing ‘inner’ elements 1, 10, from ‘outer’ elements 6, 7. [sent-295, score-0.188]

85 Plots are based on 150 partitionings obtained from the posterior. [sent-298, score-0.547]

86 3) Galactose data (Figure 9d): This is a dataset of gene expressions by 820 genes in 20 experimental conditions [25]. [sent-302, score-0.142]

87 First 204 genes are chosen, and ﬁrst two letters of gene names are used for labels. [sent-303, score-0.142]

88 Plots are based on 250 partitionings from the posterior. [sent-304, score-0.547]

89 70 RP (ribosomal protein) genes and 12 HX (hexose transport) genes appear in individual leaves. [sent-305, score-0.162]

90 In this experiment, we take a different approach and apply EA directly on the dataset interpreted as a sample set of single-block feature allocations, where the blocks are IGO-year tuples and elements are the countries. [sent-309, score-0.305]

91 6 Conclusion In this paper, we developed a novel approach for summarizing sample sets of partitionings and feature allocations. [sent-313, score-0.625]

92 After presenting the problem, we introduced cumulative statistics and cumulative occurence distribution matrices for each of its permutations, to represent a sample set in a systematic manner. [sent-314, score-0.406]

93 We deﬁned per-element information to compute entropy sequences for these permutations. [sent-315, score-0.264]

94 We developed entropy agglomeration (EA) algorithm that chooses and visualises a small subset of these entropy sequences. [sent-316, score-0.694]

95 Entropy agglomeration is a simple algorithm that does not require much knowledge to implement, but it is conceptually based on the cumulative statistics we have presented. [sent-318, score-0.279]

96 Last but not least, algorithms like entropy agglomeration can be designed for summarization tasks concerning various types of combinatorial sample sets. [sent-322, score-0.459]

97 This work was funded by TUB˙ 7 Galactose data: Number of blocks (a) Example partitionings: 2 Z (1) = {{1, 3, 6, 7}, {2}, {4, 5}} 1. [sent-325, score-0.133]

98 2 Figure 9: Entropy agglomeration and other results from the experiments (See the text) 8 0. [sent-342, score-0.166]

99 (2002) Bayesian inﬁnite mixture model based clustering of gene expression proﬁles. [sent-420, score-0.18]

100 (2004) Bayesian mixture model based clustering of replicated microarray data. [sent-425, score-0.119]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('partitionings', 0.547), ('roj', 0.339), ('entropy', 0.264), ('partitioning', 0.191), ('cod', 0.17), ('agglomeration', 0.166), ('occurence', 0.151), ('allocations', 0.133), ('blocks', 0.133), ('ea', 0.13), ('sr', 0.128), ('cumulative', 0.113), ('occurences', 0.113), ('sn', 0.102), ('bi', 0.102), ('ha', 0.094), ('pr', 0.094), ('elements', 0.094), ('mixture', 0.089), ('block', 0.085), ('segment', 0.085), ('statistic', 0.082), ('genes', 0.081), ('igo', 0.075), ('crp', 0.073), ('medvedovic', 0.067), ('dendrogram', 0.067), ('segmented', 0.066), ('pairwise', 0.061), ('tf', 0.061), ('entropies', 0.061), ('gene', 0.061), ('rp', 0.059), ('azici', 0.057), ('bumgarner', 0.057), ('exchangable', 0.057), ('galactose', 0.057), ('ower', 0.057), ('pein', 0.057), ('inserting', 0.055), ('hp', 0.055), ('bioinformatics', 0.055), ('permutations', 0.051), ('pg', 0.05), ('multiset', 0.05), ('feature', 0.049), ('pitman', 0.046), ('ls', 0.046), ('sl', 0.046), ('shannon', 0.045), ('subsets', 0.044), ('quantify', 0.044), ('sm', 0.044), ('sb', 0.043), ('summarize', 0.042), ('countries', 0.041), ('sa', 0.041), ('rr', 0.041), ('posterior', 0.04), ('projection', 0.04), ('cd', 0.04), ('segmentation', 0.039), ('ab', 0.038), ('hx', 0.038), ('continent', 0.038), ('cumulation', 0.038), ('dendrograms', 0.038), ('fidaner', 0.038), ('pdp', 0.038), ('sivaganesan', 0.038), ('nc', 0.037), ('zi', 0.037), ('arranged', 0.035), ('bo', 0.034), ('dirichlet', 0.034), ('si', 0.034), ('young', 0.034), ('superclass', 0.033), ('metabolic', 0.033), ('supplies', 0.033), ('yeung', 0.033), ('conserve', 0.033), ('istanbul', 0.033), ('teh', 0.033), ('nt', 0.031), ('mt', 0.031), ('organizations', 0.031), ('sizes', 0.031), ('incremental', 0.03), ('clustering', 0.03), ('fi', 0.029), ('counts', 0.029), ('neal', 0.029), ('permutation', 0.029), ('archer', 0.029), ('nemenman', 0.029), ('linkage', 0.029), ('sample', 0.029), ('partition', 0.029), ('pa', 0.028)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000006 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations

Author: Isik B. Fidaner, Taylan Cemgil

2 0.17504108 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Author: Karin C. Knudson, Jonathan W. Pillow

Abstract: Entropy rate quantiﬁes the amount of disorder in a stochastic process. For spiking neurons, the entropy rate places an upper bound on the rate at which the spike train can convey stimulus information, and a large literature has focused on the problem of estimating entropy rate from spike train data. Here we present Bayes least squares and empirical Bayesian entropy rate estimators for binary spike trains using hierarchical Dirichlet process (HDP) priors. Our estimator leverages the fact that the entropy rate of an ergodic Markov Chain with known transition probabilities can be calculated analytically, and many stochastic processes that are non-Markovian can still be well approximated by Markov processes of sufﬁcient depth. Choosing an appropriate depth of Markov model presents challenges due to possibly long time dependencies and short data sequences: a deeper model can better account for long time dependencies, but is more difﬁcult to infer from limited data. Our approach mitigates this difﬁculty by using a hierarchical prior to share statistical power across Markov chains of different depths. We present both a fully Bayesian and empirical Bayes entropy rate estimator based on this model, and demonstrate their performance on simulated and real neural spike train data. 1

3 0.096952751 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge

Author: Evan W. Archer, Il M. Park, Jonathan W. Pillow

Abstract: Shannon’s entropy is a basic quantity in information theory, and a fundamental building block for the analysis of neural codes. Estimating the entropy of a discrete distribution from samples is an important and difﬁcult problem that has received considerable attention in statistics and theoretical neuroscience. However, neural responses have characteristic statistical structure that generic entropy estimators fail to exploit. For example, existing Bayesian entropy estimators make the naive assumption that all spike words are equally likely a priori, which makes for an inefﬁcient allocation of prior probability mass in cases where spikes are sparse. Here we develop Bayesian estimators for the entropy of binary spike trains using priors designed to ﬂexibly exploit the statistical structure of simultaneouslyrecorded spike responses. We deﬁne two prior distributions over spike words using mixtures of Dirichlet distributions centered on simple parametric models. The parametric model captures high-level statistical features of the data, such as the average spike count in a spike word, which allows the posterior over entropy to concentrate more rapidly than with standard estimators (e.g., in cases where the probability of spiking differs strongly from 0.5). Conversely, the Dirichlet distributions assign prior mass to distributions far from the parametric model, ensuring consistent estimates for arbitrary distributions. We devise a compact representation of the data and prior that allow for computationally efﬁcient implementations of Bayesian least squares and empirical Bayes entropy estimators with large numbers of neurons. We apply these estimators to simulated and real neural data and show that they substantially outperform traditional methods.

4 0.079751052 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

Author: Martin Azizyan, Aarti Singh, Larry Wasserman

Abstract: While several papers have investigated computationally and statistically efﬁcient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efﬁcient procedure. Our results provide the ﬁrst step of a theoretical basis for recent methods that combine feature selection and clustering. 1

5 0.07945209 44 nips-2013-B-test: A Non-parametric, Low Variance Kernel Two-sample Test

Author: Wojciech Zaremba, Arthur Gretton, Matthew Blaschko

Abstract: A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of previously proposed MMD two-sample tests: B-tests are more powerful than a linear time test where blocks are just pairs of samples, yet they are more computationally efﬁcient than a quadratic time test where a single large block incorporating all the samples is used to compute a U-statistic. A further important advantage of the B-tests is their asymptotically Normal null distribution: this is by contrast with the U-statistic, which is degenerate under the null hypothesis, and for which estimates of the null distribution are computationally demanding. Recent results on kernel selection for hypothesis testing transfer seamlessly to the B-tests, yielding a means to optimize test power via kernel choice. 1

6 0.069618449 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

7 0.068978295 224 nips-2013-On the Sample Complexity of Subspace Learning

8 0.065215118 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties

9 0.063890629 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

10 0.062330604 146 nips-2013-Large Scale Distributed Sparse Precision Estimation

11 0.061245892 201 nips-2013-Multi-Task Bayesian Optimization

12 0.058520734 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

13 0.058167253 23 nips-2013-Active Learning for Probabilistic Hypotheses Using the Maximum Gibbs Error Criterion

14 0.057022005 149 nips-2013-Latent Structured Active Learning

15 0.055984687 300 nips-2013-Solving the multi-way matching problem by permutation synchronization

16 0.055234812 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

17 0.049968079 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices

18 0.047910407 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions

19 0.047539961 318 nips-2013-Structured Learning via Logistic Regression

20 0.047470324 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.143), (1, 0.048), (2, -0.008), (3, 0.014), (4, -0.001), (5, 0.086), (6, 0.066), (7, -0.037), (8, 0.006), (9, 0.016), (10, -0.03), (11, 0.029), (12, -0.006), (13, 0.043), (14, 0.045), (15, 0.001), (16, 0.035), (17, 0.028), (18, -0.032), (19, 0.05), (20, -0.028), (21, 0.046), (22, -0.079), (23, -0.115), (24, -0.004), (25, 0.012), (26, -0.027), (27, -0.057), (28, -0.041), (29, 0.051), (30, 0.026), (31, 0.137), (32, 0.034), (33, -0.044), (34, 0.054), (35, 0.107), (36, -0.043), (37, 0.108), (38, 0.076), (39, 0.075), (40, 0.016), (41, -0.036), (42, 0.09), (43, -0.033), (44, 0.074), (45, 0.077), (46, -0.045), (47, -0.016), (48, 0.082), (49, -0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94451994 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations

Author: Isik B. Fidaner, Taylan Cemgil

2 0.78626919 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

Author: Karin C. Knudson, Jonathan W. Pillow

3 0.65320796 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge

Author: Evan W. Archer, Il M. Park, Jonathan W. Pillow

4 0.60650522 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties

Author: Paul Valiant, Gregory Valiant

Abstract: Recently, Valiant and Valiant [1, 2] showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a sublinear sized sample. Speciﬁcally, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n/ log n). We propose a novel modiﬁcation of this approach and show: 1) theoretically, this estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in our approach is to ﬁrst use the sample to characterize the “unseen” portion of the distribution. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the shape of the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems. 1

5 0.56239098 252 nips-2013-Predictive PAC Learning and Process Decompositions

Author: Cosma Shalizi, Aryeh Kontorovitch

Abstract: We informally call a stochastic process learnable if it admits a generalization error approaching zero in probability for any concept class with ﬁnite VC-dimension (IID processes are the simplest example). A mixture of learnable processes need not be learnable itself, and certainly its generalization error need not decay at the same rate. In this paper, we argue that it is natural in predictive PAC to condition not on the past observations but on the mixture component of the sample path. This deﬁnition not only matches what a realistic learner might demand, but also allows us to sidestep several otherwise grave problems in learning from dependent data. In particular, we give a novel PAC generalization bound for mixtures of learnable processes with a generalization error that is not worse than that of each mixture component. We also provide a characterization of mixtures of absolutely regular (β-mixing) processes, of independent probability-theoretic interest. 1

6 0.53511804 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

7 0.49832892 316 nips-2013-Stochastic blockmodel approximation of a graphon: Theory and consistent estimation

8 0.49284899 277 nips-2013-Restricting exchangeable nonparametric distributions

9 0.47292349 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation

10 0.47013894 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models

11 0.4664686 201 nips-2013-Multi-Task Bayesian Optimization

12 0.45529807 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models

13 0.45250842 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

14 0.43536586 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent

15 0.43374151 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits

16 0.43142912 173 nips-2013-Least Informative Dimensions

17 0.42684042 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components

18 0.42359027 344 nips-2013-Using multiple samples to learn mixture models

19 0.42100644 260 nips-2013-RNADE: The real-valued neural autoregressive density-estimator

20 0.41472235 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.036), (3, 0.288), (16, 0.065), (33, 0.134), (34, 0.097), (36, 0.014), (41, 0.021), (49, 0.037), (56, 0.068), (70, 0.024), (85, 0.048), (89, 0.017), (93, 0.056), (95, 0.014), (99, 0.012)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.76046187 8 nips-2013-A Graphical Transformation for Belief Propagation: Maximum Weight Matchings and Odd-Sized Cycles

Author: Jinwoo Shin, Andrew E. Gelfand, Misha Chertkov

Abstract: Max-product ‘belief propagation’ (BP) is a popular distributed heuristic for ﬁnding the Maximum A Posteriori (MAP) assignment in a joint probability distribution represented by a Graphical Model (GM). It was recently shown that BP converges to the correct MAP assignment for a class of loopy GMs with the following common feature: the Linear Programming (LP) relaxation to the MAP problem is tight (has no integrality gap). Unfortunately, tightness of the LP relaxation does not, in general, guarantee convergence and correctness of the BP algorithm. The failure of BP in such cases motivates reverse engineering a solution – namely, given a tight LP, can we design a ‘good’ BP algorithm. In this paper, we design a BP algorithm for the Maximum Weight Matching (MWM) problem over general graphs. We prove that the algorithm converges to the correct optimum if the respective LP relaxation, which may include inequalities associated with non-intersecting odd-sized cycles, is tight. The most signiﬁcant part of our approach is the introduction of a novel graph transformation designed to force convergence of BP. Our theoretical result suggests an efﬁcient BP-based heuristic for the MWM problem, which consists of making sequential, “cutting plane”, modiﬁcations to the underlying GM. Our experiments show that this heuristic performs as well as traditional cutting-plane algorithms using LP solvers on MWM problems. 1

same-paper 2 0.75379109 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations

Author: Isik B. Fidaner, Taylan Cemgil

3 0.70149571 277 nips-2013-Restricting exchangeable nonparametric distributions

Author: Sinead A. Williamson, Steve N. MacEachern, Eric Xing

Abstract: Distributions over matrices with exchangeable rows and inﬁnitely many columns are useful in constructing nonparametric latent variable models. However, the distribution implied by such models over the number of features exhibited by each data point may be poorly-suited for many modeling tasks. In this paper, we propose a class of exchangeable nonparametric priors obtained by restricting the domain of existing models. Such models allow us to specify the distribution over the number of features per data point, and can achieve better performance on data sets where the number of features is not well-modeled by the original distribution. 1

4 0.64914799 1 nips-2013-(More) Efficient Reinforcement Learning via Posterior Sampling

Author: Ian Osband, Dan Russo, Benjamin Van Roy

Abstract: Most provably-eﬃcient reinforcement learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for eﬃcient exploration: posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally eﬃcient and allows an √ agent to encode prior knowledge ˜ in a natural way. We establish an O(τ S AT ) bound on expected regret, where T is time, τ is the episode length and S and A are the cardinalities of the state and action spaces. This bound is one of the ﬁrst for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL signiﬁcantly outperforms existing algorithms with similar regret bounds. 1

5 0.56481522 115 nips-2013-Factorized Asymptotic Bayesian Inference for Latent Feature Models

Author: Kohei Hayashi, Ryohei Fujimaki

Abstract: This paper extends factorized asymptotic Bayesian (FAB) inference for latent feature models (LFMs). FAB inference has not been applicable to models, including LFMs, without a speciﬁc condition on the Hessian matrix of a complete loglikelihood, which is required to derive a “factorized information criterion” (FIC). Our asymptotic analysis of the Hessian matrix of LFMs shows that FIC of LFMs has the same form as those of mixture models. FAB/LFMs have several desirable properties (e.g., automatic hidden states selection and parameter identiﬁability) and empirically perform better than state-of-the-art Indian Buffet processes in terms of model selection, prediction, and computational efﬁciency. 1

6 0.56391728 201 nips-2013-Multi-Task Bayesian Optimization

7 0.56334519 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models

8 0.56253153 350 nips-2013-Wavelets on Graphs via Deep Learning

9 0.55993354 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking

10 0.55740482 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

11 0.55703443 341 nips-2013-Universal models for binary spike patterns using centered Dirichlet processes

12 0.55690336 173 nips-2013-Least Informative Dimensions

13 0.55677688 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

14 0.55670196 5 nips-2013-A Deep Architecture for Matching Short Texts

15 0.55665392 123 nips-2013-Flexible sampling of discrete data correlations without the marginal distributions

16 0.55504096 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

17 0.55496937 182 nips-2013-Manifold-based Similarity Adaptation for Label Propagation

18 0.55488276 285 nips-2013-Robust Transfer Principal Component Analysis with Rank Constraints

19 0.55465299 251 nips-2013-Predicting Parameters in Deep Learning

20 0.55456078 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation