nips nips2010 nips2010-90 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Bo Thiesson, Chong Wang
Abstract: Remarkably easy implementation and guaranteed convergence has made the EM algorithm one of the most used algorithms for mixture modeling. On the downside, the E-step is linear in both the sample size and the number of mixture components, making it impractical for large-scale data. Based on the variational EM framework, we propose a fast alternative that uses component-specific data partitions to obtain a sub-linear E-step in sample size, while the algorithm still maintains provable convergence. Our approach builds on previous work, but is significantly faster and scales much better in the number of mixture components. We demonstrate this speedup by experiments on large-scale synthetic and real data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Fast Large-scale Mixture Modeling with Component-specific Data Partitions Bo Thiesson∗ Microsoft Research Chong Wang∗† Princeton University Abstract Remarkably easy implementation and guaranteed convergence has made the EM algorithm one of the most used algorithms for mixture modeling. [sent-1, score-0.119]
2 Based on the variational EM framework, we propose a fast alternative that uses component-specific data partitions to obtain a sub-linear E-step in sample size, while the algorithm still maintains provable convergence. [sent-3, score-0.312]
3 Our approach builds on previous work, but is significantly faster and scales much better in the number of mixture components. [sent-4, score-0.115]
4 We demonstrate this speedup by experiments on large-scale synthetic and real data. [sent-5, score-0.117]
5 1 Introduction Probabilistic mixture modeling [7] has been widely used for density estimation and clustering applications. [sent-6, score-0.123]
6 In particular, the E-step is linear in both the number of data points and the number of mixture components, and therefore computationally impractical for large-scale applications. [sent-9, score-0.151]
7 Our work was motivated by a large-scale geo-spatial problem, demanding a mixture model of a customer base (a huge number of data points) for competing businesses (a large number mixture components), as the basis for site evaluation (where to locate a new store). [sent-10, score-0.25]
8 Our work is inspired by the “chunky EM” algorithm in [17, 16], a smart application of the variational EM framework [11], where a lower bound on the objective function increases at each iteration and convergence is guaranteed. [sent-16, score-0.203]
9 The variational EM framework alters the E-step to use sufficient statistics calculated under a variational distribution instead. [sent-18, score-0.306]
10 In chunky EM, the speedup is obtained by using a variational distribution with shared (variational) membership probabilities for blocks of data (in an exhaustive partition for the entire data into non-overlapping blocks of data). [sent-19, score-1.102]
11 The chunky EM starts from a coarse partition of the data and gradually refines the partition until convergence. [sent-20, score-0.729]
12 However, chunky EM does not scale well in the number of components, since all components share the same partition. [sent-21, score-0.493]
13 The individual components are different – in order to obtain membership probabilities of appropriate quality, one component may need fine-grained blocks in one area of the data space, while another component is perfectly fine with coarse blocks in that area. [sent-22, score-0.401]
14 Chunky EM expands the shared partition to match the needed granularity for the most demanding mixture component in any area of the data space, which might unnecessarily increase the computational *Equal contributors. [sent-23, score-0.345]
15 We demonstrate a significant performance improvement over standard and chunky EM for experiments on synthetic and mentioned customer-business data. [sent-27, score-0.478]
16 data x {x1 , · · · , xN }, we are interested in estimating the parameters θ = {η1:K , π1:K } in the K-component mixture model with log-likelihood function L(θ) = n log k p(xn |ηk )πk . [sent-32, score-0.132]
17 (1) For this task, we consider a variational generalization [11] of standard EM [4], which maximizes a lower bound of L(θ) through the introduction of a variational distribution q. [sent-33, score-0.337]
18 We assume that the variational distribution factorizes in accordance with data points, i. [sent-34, score-0.172]
19 e, q = n qn , where each qn is an arbitrary discrete distribution over mixture components k = 1, . [sent-35, score-0.303]
20 The variational EM algorithm alternates the following two steps, i. [sent-40, score-0.153]
21 The variational EM is in this case equivalent to the standard EM, and hence produces the maximum likelihood (ML) estimate. [sent-45, score-0.153]
22 In the following, we consider certain ways of restricting q to attain speedup over standard EM, implying that the minimum KL-divergence between qn and p(·|xn , θ) is not necessarily zero. [sent-46, score-0.172]
23 Still the variational EM defines a convergent algorithm, which instead optimizes a lower bound of the log-likelihood. [sent-47, score-0.184]
24 The chunky EM algorithm [17, 16] falls into the framework of variational EM algorithms. [sent-49, score-0.605]
25 In chunky EM, the variational distribution q = n qn is restricted according to a partition into exhaustive and mutually exclusive blocks of the data. [sent-50, score-0.887]
26 The intuition is that data points in the same block are somewhat similar and can be treated in the same way, which leads to computational savings in the E-step. [sent-52, score-0.117]
27 If M is the number of blocks in a given partition, the E-step for chunky EM has cost O(KM ) whereas in standard EM the cost is O(KN ). [sent-53, score-0.534]
28 The speedup is gained by a trade-off between the tightness of the lower bound for the log-likelihood and the restrictiveness of constraints. [sent-55, score-0.122]
29 Chunky EM starts from a coarse partition and iteratively refines it. [sent-56, score-0.139]
30 This refinement process always produces a tighter bound, since restrictions on the variational distribution are gradually relaxed. [sent-57, score-0.153]
31 The chunky EM algorithm stops when refining any block in a partition will not significantly increase the lower bound. [sent-58, score-0.655]
32 3 Component-specific EM In chunky EM, all mixture components share the same data partition. [sent-59, score-0.612]
33 However, for a particular block of data, the variation in membership probabilities differs across components, resulting in varying differences from the equality constrained variational probabilities. [sent-60, score-0.32]
34 Roughly, the variation in membership probabilities is greatest for components closer to a block of data, and, in particular, for components far away the membership probabilities are all so small that the variation is insignificant. [sent-61, score-0.348]
35 Let Mk be the number of data blocks in the partition for component k. [sent-63, score-0.256]
36 The complexity for the E-step is then O( k Mk ), compared to O(KM ) in chunky EM. [sent-64, score-0.452]
37 Since our model maintains different partitions for different mixture components, we call it the component-specific EM algorithm (CS-EM). [sent-66, score-0.198]
38 The bottom-right figure is the corresponding MPT, where {·} indicates the component marks and a, b, c, d, e, f, g enumerate all the marked nodes. [sent-68, score-0.12]
39 Starting from a coarse partition for each component (see Section 4. [sent-73, score-0.175]
40 1 for examples), CS-EM runs variational EM to convergence and then selectively refine the component-specific partitions. [sent-74, score-0.191]
41 1 Marked Partition Trees It is convenient to organize the data into a pre-computed partition tree, where a node in the tree represents the union of the data represented by its children. [sent-80, score-0.22]
42 ) Any hierarchical decomposition of data that ensures some degree of similarity between data in a block is suitable for constructing a partition tree. [sent-84, score-0.225]
43 We will in the following consider tree-consistent partitions, where each data block in a partition corresponds to exactly one node for a cut (possibly across different levels) in the tree–see Figure 1. [sent-87, score-0.244]
44 Let us now define a marked partition tree (MPT), a simple encoding of all component-specific partitions, as follows. [sent-88, score-0.201]
45 Let Bk be the data partition (a set of blocks) in the tree-consistent partition for mixture component k. [sent-89, score-0.393]
46 In Figure 1, for example, B1 is the partition into data blocks associated with nodes {e, c, d}. [sent-90, score-0.245]
47 In the shared data partition tree used to generate the component-specific partitions, we mark the corresponding nodes for the data blocks in each Bk by the component identifier k. [sent-91, score-0.325]
48 For example, in chunky EM, the component-specific partitions are the same for each mixture component. [sent-97, score-0.65]
49 In this case, only the leaves in the MPT are marked, with each leaf marked by all mixture components. [sent-98, score-0.234]
50 The following important property for a MPT holds since all component-specific partitions are constructed with respect to the same data partition tree. [sent-99, score-0.236]
51 The marked nodes on a path from leaf to root in T mark exactly one data block from each of the K component-specific data partitions. [sent-102, score-0.276]
52 In the following, it becomes important to identify the data block in a component-specific partition, which embeds the block defined by a leaf. [sent-103, score-0.155]
53 Let L denote the set of leaves in T , and let BL denote a partition with data blocks Bl ∈ BL according to these leaves. [sent-104, score-0.242]
54 Example: In Figure 1, the path a → e → g in turn marks the components Ka = {3, 4}, Ke = {1, 2}, and Kg = {5} and we see that each component is marked exactly once on this path, as stated in Property 1. [sent-107, score-0.178]
55 2 The Variational Distribution Our variational distribution q assigns the same variational membership probability to mixture component k for all data points in a component-specific block Bk ∈ Bk . [sent-110, score-0.609]
56 That is, qn (k) = qBk for all xn ∈ Bk , (4) which we denote as the component-specific block constraint. [sent-111, score-0.172]
57 Unlike chunky EM, we do not assume that the data partition Bk is the same across different mixture components. [sent-112, score-0.69]
58 3 To further drive intuition behind the E-step complication, let us make the sum-to-one constraint for the variational distributions qn (·) explicit. [sent-115, score-0.248]
59 That is, k qn (k) = 1 for all data points n, which according to the above block constraint and using Property 1 can be reformulated as the |L| constraints k qBk(l) = 1 for all l ∈ L. [sent-116, score-0.211]
60 The closer a data block Bk is to the root of T the more constraints simultaneously involve the same qBk . [sent-119, score-0.13]
61 1 For chunky EM, the E-step is straightforward, because Bk(l) = Bl and therefore l:Bl ⊆Bk(l) λl = λl for all k = 1, . [sent-137, score-0.468]
62 Consider a leaf node l ∈ L and recall that Kl denotes the components with Bk(l) = Bl in their partitions. [sent-147, score-0.134]
63 Efficient Variational M-step In the variational M-step the model parameters θ = {η1:K , π1:K } are updated by maximizing Eq. [sent-170, score-0.153]
64 Any refinement enlarges the family of variational distributions, and therefore always tightens the optimal lower bound for the log-likelihood. [sent-182, score-0.2]
65 We define a refinement unit as the refinement of one data block in the current partition for one component in the model. [sent-183, score-0.242]
66 This introduces K new free variational parameters, which is similar to a refinement step in chunky EM. [sent-187, score-0.605]
67 However, chunky EM refines the same data block across all components, which is not the case in CS-EM. [sent-188, score-0.539]
68 2: repeat 3: repeat 4: Run variational E-step and M-step. [sent-192, score-0.153]
69 2: for each marked node v in T do 3: Compute q via E-step with constraints as in (14). [sent-199, score-0.122]
70 To approximate the new variational distribution q , we fix the ¯ ¯ value for each qBk(l) , with k ∈ Ku and l ∈ L, to the value obtained for the distribution q before the ¯ refinement. [sent-210, score-0.153]
71 ¯ The improvement to F that is achieved by the refinement-guiding E-step for the refinement unit refining data block v for component k is denoted ∆Fv,k , and can be computed as ∆Fv,k = u∈ch(v) F(θ, qBk(u) ) − F(θ, qBk(v) ). [sent-215, score-0.123]
72 Notice that this selective refinement step will most likely not refine the same data block for all components and therefore creates component-specific partitions. [sent-217, score-0.144]
73 2 4 Experiments In this section we provide a systematic evaluation of CS-EM, chunky EM, and standard EM on synthetic data, as well as a comparison between CS-EM and chunky EM on the business-customer data, mentioned in Section 1. [sent-221, score-0.93]
74 The locations for competing stores of a particular type, in this way, correspond to fixed centers for components in a mixture model. [sent-233, score-0.2]
75 After constructing KD-tree, the first tree-level containing at least K nodes ( log2 K ) is used as the initial data partition for both chunky EM and all components in CS-EM. [sent-236, score-0.671]
76 For all algorithms (including standard EM), we randomly chose K data blocks from the initial partition and initialized parameters for the individual mixture components accordingly. [sent-237, score-0.39]
77 The experiments on the business-customer data are initialized in the same way, except that the component centers are fixed and the initial data blocks that cover these centers are used for initializing the remaining parameters. [sent-239, score-0.219]
78 It starts from the CS-EM initialization and recursively, according to the KD-tree structure, merges two data blocks in a component-specific partition, if the merge has little effect on that component. [sent-241, score-0.131]
79 We recorded the log-likelihood for the test data at each iteration of the EM algorithm, and before each S-step in chunky EM and the CS-EM. [sent-245, score-0.471]
80 4 We now recorded the run-time for each algorithm to reach this baseline, and computed the EM-speedup factors for chunky EM, CS-EM, and CS-EM∗ , each defined as the standard EM run-time divided by the run-time for the alternative algorithm. [sent-247, score-0.472]
81 First of all, we see that both CS-EM and CSEM∗ are significantly faster than chunky EM in all experiments. [sent-250, score-0.467]
82 In general, the k Mk variational parameters needed for the CS-EM algorithms is far fewer than the KM parameters needed for chunky EM in order to reach an estimate of same quality. [sent-251, score-0.625]
83 This observation can be explained by the fact that the resulting component-specific data partitions greatly refine the initial partitions, and any computational speedup due to the smarter initial partition in CS-EM∗ is therefore overwhelmed. [sent-256, score-0.373]
84 However, the last plot in Figure 4 reveals an important difference between chunky EM and CS-EM: with a fixed ratio between number of data points and number of clusters, the EM-speedup declines a lot for chunky EM, as the number of clusters and data points increases. [sent-259, score-0.989]
85 To compare run-times for chunky 3 Let µ and Σ be the mean and variance parameter for an initial component, and µp , µl , and µr denote the sample mean for data in the considered parent, left and right child. [sent-264, score-0.486]
86 Figure 5: A comparison of run-time and final number of variational parameters for Chunky EM vs. [sent-270, score-0.153]
87 Figure 5 shows the speedup (time ratio) and the reduction in number of variational parameters (parameter ratio) for CS-EM compared to chunky EM, as evaluated on exemplary types of businesses. [sent-287, score-0.717]
88 Again, CS-EM is significantly faster than chunky EM and the speedup is achieved by a better targeting of variational distribution through the component-specific partitions. [sent-288, score-0.711]
89 CS-EM combines the best from two major directions in the literature regarding speedup of EM for mixture modeling. [sent-290, score-0.191]
90 In contrast, CS-EM is provable convergent–even for arbitrary rough partitions, if extreme speedup is needed. [sent-294, score-0.133]
91 The granularity of partitions in [10] is controlled by a user-specified threshold on the minimum and maximum membership probabilities that are reachable within the boundaries of a node in the KD-tree. [sent-295, score-0.257]
92 Finally, [10] “prunes” a component (sets the membership probability to zero) for data far away from the component. [sent-298, score-0.119]
93 The second direction of speedup approaches are based on the variational EM framework [11]. [sent-300, score-0.244]
94 The chunky EM in [17, 16] is the approach most similar to our CS-EM. [sent-305, score-0.452]
95 Both are based on the variational EM framework and therefore guarantees convergence, but CS-EM is faster and scales better in the number of clusters. [sent-306, score-0.184]
96 Sub-sampling therefore has same computational complexity as chunky EM, and our results therefore suggest that we should expect CS-EM to be much faster than sub-sampling and scale better in the number of mixture components. [sent-309, score-0.599]
97 Finally, we exemplified our work by using KD-trees as the tree-consistent partition structure for generating the component-specific partitions in CS-EM, which limited its effectiveness in high dimensions. [sent-310, score-0.217]
98 However, any hierarchical partition structure can be used, and the work in [8] therefore suggest that changing to an anchor tree (a special kind of metric tree [15]) will also render CS-EM effective in high dimensions, under the assumption of lower intrinsic dimensionality for the data. [sent-311, score-0.201]
99 , k-means clustering [6, 13, 5] and 2) general EM applications beyond mixture modeling. [sent-315, score-0.123]
100 Very fast EM-based mixture model clustering using multiresolution kd-trees. [sent-379, score-0.123]
wordName wordTfidf (topN-words)
[('qbk', 0.595), ('chunky', 0.452), ('bk', 0.32), ('em', 0.312), ('gbk', 0.155), ('variational', 0.153), ('mpt', 0.143), ('partition', 0.119), ('bl', 0.102), ('nement', 0.102), ('mixture', 0.1), ('partitions', 0.098), ('ku', 0.094), ('speedup', 0.091), ('blocks', 0.082), ('qn', 0.081), ('block', 0.068), ('membership', 0.064), ('re', 0.061), ('kl', 0.059), ('marked', 0.057), ('leaf', 0.055), ('mk', 0.051), ('kv', 0.043), ('provable', 0.042), ('components', 0.041), ('fs', 0.041), ('ch', 0.041), ('node', 0.038), ('component', 0.036), ('units', 0.036), ('granularity', 0.036), ('xxt', 0.036), ('ft', 0.034), ('accelerating', 0.029), ('km', 0.028), ('ql', 0.027), ('ke', 0.027), ('stores', 0.027), ('marks', 0.027), ('constraints', 0.027), ('synthetic', 0.026), ('ka', 0.026), ('tree', 0.025), ('nodes', 0.025), ('intertwined', 0.024), ('nail', 0.024), ('salon', 0.024), ('business', 0.023), ('clustering', 0.023), ('xn', 0.023), ('leaves', 0.022), ('probabilities', 0.021), ('exemplary', 0.021), ('thiesson', 0.021), ('priority', 0.021), ('unmarked', 0.021), ('coarse', 0.02), ('reach', 0.02), ('lagrange', 0.02), ('data', 0.019), ('convergence', 0.019), ('unnecessarily', 0.019), ('nements', 0.019), ('queue', 0.019), ('selectively', 0.019), ('nes', 0.018), ('ru', 0.018), ('kb', 0.018), ('stored', 0.018), ('children', 0.018), ('accordingly', 0.017), ('verbeek', 0.017), ('default', 0.017), ('centers', 0.017), ('path', 0.017), ('exp', 0.017), ('therefore', 0.016), ('root', 0.016), ('parent', 0.016), ('notice', 0.016), ('points', 0.016), ('initialization', 0.016), ('customers', 0.016), ('demanding', 0.016), ('lower', 0.016), ('initial', 0.015), ('bound', 0.015), ('ning', 0.015), ('ratio', 0.015), ('competing', 0.015), ('faster', 0.015), ('variation', 0.014), ('merge', 0.014), ('multipliers', 0.014), ('initialized', 0.014), ('microsoft', 0.014), ('intuition', 0.014), ('truly', 0.013), ('log', 0.013)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions
Author: Bo Thiesson, Chong Wang
Abstract: Remarkably easy implementation and guaranteed convergence has made the EM algorithm one of the most used algorithms for mixture modeling. On the downside, the E-step is linear in both the sample size and the number of mixture components, making it impractical for large-scale data. Based on the variational EM framework, we propose a fast alternative that uses component-specific data partitions to obtain a sub-linear E-step in sample size, while the algorithm still maintains provable convergence. Our approach builds on previous work, but is significantly faster and scales much better in the number of mixture components. We demonstrate this speedup by experiments on large-scale synthetic and real data. 1
2 0.1895756 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
Author: Akshat Kumar, Shlomo Zilberstein
Abstract: Computing a maximum a posteriori (MAP) assignment in graphical models is a crucial inference problem for many practical applications. Several provably convergent approaches have been successfully developed using linear programming (LP) relaxation of the MAP problem. We present an alternative approach, which transforms the MAP problem into that of inference in a mixture of simple Bayes nets. We then derive the Expectation Maximization (EM) algorithm for this mixture that also monotonically increases a lower bound on the MAP assignment until convergence. The update equations for the EM algorithm are remarkably simple, both conceptually and computationally, and can be implemented using a graph-based message passing paradigm similar to max-product computation. Experiments on the real-world protein design dataset show that EM’s convergence rate is significantly higher than the previous LP relaxation based approach MPLP. EM also achieves a solution quality within 95% of optimal for most instances. 1
3 0.1296892 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
Author: Armand Joulin, Jean Ponce, Francis R. Bach
Abstract: Dimensionality reduction is commonly used in the setting of multi-label supervised classification to control the learning capacity and to provide a meaningful representation of the data. We introduce a simple forward probabilistic model which is a multinomial extension of reduced rank regression, and show that this model provides a probabilistic interpretation of discriminative clustering methods with added benefits in terms of number of hyperparameters and optimization. While the expectation-maximization (EM) algorithm is commonly used to learn these probabilistic models, it usually leads to local maxima because it relies on a non-convex cost function. To avoid this problem, we introduce a local approximation of this cost function, which in turn leads to a quadratic non-convex optimization problem over a product of simplices. In order to maximize quadratic functions, we propose an efficient algorithm based on convex relaxations and lowrank representations of the data, capable of handling large-scale problems. Experiments on text document classification show that the new model outperforms other supervised dimensionality reduction methods, while simulations on unsupervised clustering show that our probabilistic formulation has better properties than existing discriminative clustering methods. 1
4 0.087674826 284 nips-2010-Variational bounds for mixed-data factor analysis
Author: Mohammad E. Khan, Guillaume Bouchard, Kevin P. Murphy, Benjamin M. Marlin
Abstract: We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. The algorithm is based on a simple quadratic bound to the log-sum-exp function. In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. We present results on synthetic and real data sets demonstrating several desirable properties of our proposed method. 1
5 0.06825427 283 nips-2010-Variational Inference over Combinatorial Spaces
Author: Alexandre Bouchard-côté, Michael I. Jordan
Abstract: Since the discovery of sophisticated fully polynomial randomized algorithms for a range of #P problems [1, 2, 3], theoretical work on approximate inference in combinatorial spaces has focused on Markov chain Monte Carlo methods. Despite their strong theoretical guarantees, the slow running time of many of these randomized algorithms and the restrictive assumptions on the potentials have hindered the applicability of these algorithms to machine learning. Because of this, in applications to combinatorial spaces simple exact models are often preferred to more complex models that require approximate inference [4]. Variational inference would appear to provide an appealing alternative, given the success of variational methods for graphical models [5]; unfortunately, however, it is not obvious how to develop variational approximations for combinatorial objects such as matchings, partial orders, plane partitions and sequence alignments. We propose a new framework that extends variational inference to a wide range of combinatorial spaces. Our method is based on a simple assumption: the existence of a tractable measure factorization, which we show holds in many examples. Simulations on a range of matching models show that the algorithm is more general and empirically faster than a popular fully polynomial randomized algorithm. We also apply the framework to the problem of multiple alignment of protein sequences, obtaining state-of-the-art results on the BAliBASE dataset [6]. 1
6 0.064385213 166 nips-2010-Minimum Average Cost Clustering
7 0.063983664 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates
8 0.059066318 108 nips-2010-Graph-Valued Regression
9 0.055774912 194 nips-2010-Online Learning for Latent Dirichlet Allocation
10 0.055414621 276 nips-2010-Tree-Structured Stick Breaking for Hierarchical Data
11 0.052924749 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
12 0.050940841 268 nips-2010-The Neural Costs of Optimal Control
13 0.049798775 32 nips-2010-Approximate Inference by Compilation to Arithmetic Circuits
14 0.042579826 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks
15 0.041028623 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
16 0.039857171 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
17 0.038450092 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
18 0.036562879 266 nips-2010-The Maximal Causes of Natural Scenes are Edge Filters
19 0.035722036 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone
20 0.03508608 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
topicId topicWeight
[(0, 0.11), (1, 0.024), (2, 0.024), (3, 0.021), (4, -0.106), (5, 0.011), (6, 0.024), (7, 0.047), (8, 0.071), (9, 0.036), (10, -0.012), (11, -0.018), (12, 0.093), (13, -0.024), (14, 0.001), (15, -0.052), (16, -0.025), (17, 0.076), (18, 0.054), (19, -0.065), (20, 0.03), (21, -0.057), (22, -0.006), (23, 0.079), (24, 0.001), (25, -0.078), (26, -0.128), (27, 0.174), (28, -0.151), (29, -0.063), (30, 0.021), (31, 0.003), (32, 0.026), (33, 0.0), (34, 0.129), (35, 0.04), (36, 0.061), (37, -0.015), (38, 0.012), (39, 0.091), (40, -0.047), (41, 0.058), (42, 0.114), (43, 0.036), (44, 0.066), (45, -0.05), (46, -0.049), (47, -0.02), (48, 0.013), (49, -0.017)]
simIndex simValue paperId paperTitle
same-paper 1 0.95126152 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions
Author: Bo Thiesson, Chong Wang
Abstract: Remarkably easy implementation and guaranteed convergence has made the EM algorithm one of the most used algorithms for mixture modeling. On the downside, the E-step is linear in both the sample size and the number of mixture components, making it impractical for large-scale data. Based on the variational EM framework, we propose a fast alternative that uses component-specific data partitions to obtain a sub-linear E-step in sample size, while the algorithm still maintains provable convergence. Our approach builds on previous work, but is significantly faster and scales much better in the number of mixture components. We demonstrate this speedup by experiments on large-scale synthetic and real data. 1
2 0.68856919 164 nips-2010-MAP Estimation for Graphical Models by Likelihood Maximization
Author: Akshat Kumar, Shlomo Zilberstein
Abstract: Computing a maximum a posteriori (MAP) assignment in graphical models is a crucial inference problem for many practical applications. Several provably convergent approaches have been successfully developed using linear programming (LP) relaxation of the MAP problem. We present an alternative approach, which transforms the MAP problem into that of inference in a mixture of simple Bayes nets. We then derive the Expectation Maximization (EM) algorithm for this mixture that also monotonically increases a lower bound on the MAP assignment until convergence. The update equations for the EM algorithm are remarkably simple, both conceptually and computationally, and can be implemented using a graph-based message passing paradigm similar to max-product computation. Experiments on the real-world protein design dataset show that EM’s convergence rate is significantly higher than the previous LP relaxation based approach MPLP. EM also achieves a solution quality within 95% of optimal for most instances. 1
3 0.5545122 284 nips-2010-Variational bounds for mixed-data factor analysis
Author: Mohammad E. Khan, Guillaume Bouchard, Kevin P. Murphy, Benjamin M. Marlin
Abstract: We propose a new variational EM algorithm for fitting factor analysis models with mixed continuous and categorical observations. The algorithm is based on a simple quadratic bound to the log-sum-exp function. In the special case of fully observed binary data, the bound we propose is significantly faster than previous variational methods. We show that EM is significantly more robust in the presence of missing data compared to treating the latent factors as parameters, which is the approach used by exponential family PCA and other related matrix-factorization methods. A further benefit of the variational approach is that it can easily be extended to the case of mixtures of factor analyzers, as we show. We present results on synthetic and real data sets demonstrating several desirable properties of our proposed method. 1
Author: Ken Takiyama, Masato Okada
Abstract: 019 020 We propose an algorithm for simultaneously estimating state transitions among neural states, the number of neural states, and nonstationary firing rates using a switching state space model (SSSM). This algorithm enables us to detect state transitions on the basis of not only the discontinuous changes of mean firing rates but also discontinuous changes in temporal profiles of firing rates, e.g., temporal correlation. We construct a variational Bayes algorithm for a non-Gaussian SSSM whose non-Gaussian property is caused by binary spike events. Synthetic data analysis reveals that our algorithm has the high performance for estimating state transitions, the number of neural states, and nonstationary firing rates compared to previous methods. We also analyze neural data that were recorded from the medial temporal area. The statistically detected neural states probably coincide with transient and sustained states that have been detected heuristically. Estimated parameters suggest that our algorithm detects the state transition on the basis of discontinuous changes in the temporal correlation of firing rates, which transitions previous methods cannot detect. This result suggests that our algorithm is advantageous in real-data analysis. 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053
5 0.53901666 283 nips-2010-Variational Inference over Combinatorial Spaces
Author: Alexandre Bouchard-côté, Michael I. Jordan
Abstract: Since the discovery of sophisticated fully polynomial randomized algorithms for a range of #P problems [1, 2, 3], theoretical work on approximate inference in combinatorial spaces has focused on Markov chain Monte Carlo methods. Despite their strong theoretical guarantees, the slow running time of many of these randomized algorithms and the restrictive assumptions on the potentials have hindered the applicability of these algorithms to machine learning. Because of this, in applications to combinatorial spaces simple exact models are often preferred to more complex models that require approximate inference [4]. Variational inference would appear to provide an appealing alternative, given the success of variational methods for graphical models [5]; unfortunately, however, it is not obvious how to develop variational approximations for combinatorial objects such as matchings, partial orders, plane partitions and sequence alignments. We propose a new framework that extends variational inference to a wide range of combinatorial spaces. Our method is based on a simple assumption: the existence of a tractable measure factorization, which we show holds in many examples. Simulations on a range of matching models show that the algorithm is more general and empirically faster than a popular fully polynomial randomized algorithm. We also apply the framework to the problem of multiple alignment of protein sequences, obtaining state-of-the-art results on the BAliBASE dataset [6]. 1
6 0.50456512 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
7 0.47957185 165 nips-2010-MAP estimation in Binary MRFs via Bipartite Multi-cuts
8 0.44068024 106 nips-2010-Global Analytic Solution for Variational Bayesian Matrix Factorization
9 0.3498235 288 nips-2010-Worst-case bounds on the quality of max-product fixed-points
10 0.34060249 194 nips-2010-Online Learning for Latent Dirichlet Allocation
11 0.32839486 55 nips-2010-Cross Species Expression Analysis using a Dirichlet Process Mixture Model with Latent Matchings
12 0.32786942 268 nips-2010-The Neural Costs of Optimal Control
13 0.32647547 29 nips-2010-An Approximate Inference Approach to Temporal Optimization in Optimal Control
14 0.32413742 171 nips-2010-Movement extraction by detecting dynamics switches and repetitions
15 0.31011763 62 nips-2010-Discriminative Clustering by Regularized Information Maximization
16 0.30768123 139 nips-2010-Latent Variable Models for Predicting File Dependencies in Large-Scale Software Development
17 0.30711564 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
18 0.30290934 33 nips-2010-Approximate inference in continuous time Gaussian-Jump processes
19 0.30118707 223 nips-2010-Rates of convergence for the cluster tree
20 0.30084094 84 nips-2010-Exact inference and learning for cumulative distribution functions on loopy graphs
topicId topicWeight
[(13, 0.06), (17, 0.01), (27, 0.082), (30, 0.082), (35, 0.011), (45, 0.2), (50, 0.047), (52, 0.019), (53, 0.275), (60, 0.028), (77, 0.05), (78, 0.012), (90, 0.023)]
simIndex simValue paperId paperTitle
1 0.84117043 226 nips-2010-Repeated Games against Budgeted Adversaries
Author: Jacob D. Abernethy, Manfred K. Warmuth
Abstract: We study repeated zero-sum games against an adversary on a budget. Given that an adversary has some constraint on the sequence of actions that he plays, we consider what ought to be the player’s best mixed strategy with knowledge of this budget. We show that, for a general class of normal-form games, the minimax strategy is indeed efficiently computable and relies on a “random playout” technique. We give three diverse applications of this new algorithmic template: a cost-sensitive “Hedge” setting, a particular problem in Metrical Task Systems, and the design of combinatorial prediction markets. 1
2 0.8065083 2 nips-2010-A Bayesian Approach to Concept Drift
Author: Stephen Bach, Mark Maloof
Abstract: To cope with concept drift, we placed a probability distribution over the location of the most-recent drift point. We used Bayesian model comparison to update this distribution from the predictions of models trained on blocks of consecutive observations and pruned potential drift points with low probability. We compare our approach to a non-probabilistic method for drift and a probabilistic method for change-point detection. In our experiments, our approach generally yielded improved accuracy and/or speed over these other methods. 1
same-paper 3 0.77929199 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions
Author: Bo Thiesson, Chong Wang
Abstract: Remarkably easy implementation and guaranteed convergence has made the EM algorithm one of the most used algorithms for mixture modeling. On the downside, the E-step is linear in both the sample size and the number of mixture components, making it impractical for large-scale data. Based on the variational EM framework, we propose a fast alternative that uses component-specific data partitions to obtain a sub-linear E-step in sample size, while the algorithm still maintains provable convergence. Our approach builds on previous work, but is significantly faster and scales much better in the number of mixture components. We demonstrate this speedup by experiments on large-scale synthetic and real data. 1
4 0.76182812 208 nips-2010-Policy gradients in linearly-solvable MDPs
Author: Emanuel Todorov
Abstract: We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy gradients for continuous-time stochastic systems.
5 0.75450796 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
Author: Xinhua Zhang, Ankan Saha, S.v.n. Vishwanathan
Abstract: In a recent paper Joachims [1] presented SVM-Perf, a cutting plane method (CPM) for training linear Support Vector Machines (SVMs) which converges to an accurate solution in O(1/ 2 ) iterations. By tightening the analysis, Teo et al. [2] showed that O(1/ ) iterations suffice. Given the impressive convergence speed of CPM on a number of practical problems, it was conjectured that these rates could be further improved. In this paper we disprove this conjecture. We present counter examples which are not only applicable for training linear SVMs with hinge loss, but also hold for support vector methods which optimize a multivariate performance score. However, surprisingly, these problems are not inherently hard. By exploiting the structure of the objective function we can devise an algo√ rithm that converges in O(1/ ) iterations. 1
6 0.66637594 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
7 0.66616124 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
8 0.66600299 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
9 0.66531134 179 nips-2010-Natural Policy Gradient Methods with Parameter-based Exploration for Control Tasks
10 0.66194141 155 nips-2010-Learning the context of a category
11 0.65704173 64 nips-2010-Distributionally Robust Markov Decision Processes
12 0.65601701 238 nips-2010-Short-term memory in neuronal networks through dynamical compressed sensing
13 0.65514284 160 nips-2010-Linear Complementarity for Regularized Policy Evaluation and Improvement
14 0.65469211 63 nips-2010-Distributed Dual Averaging In Networks
15 0.65394211 117 nips-2010-Identifying graph-structured activation patterns in networks
16 0.65374631 268 nips-2010-The Neural Costs of Optimal Control
17 0.65369153 51 nips-2010-Construction of Dependent Dirichlet Processes based on Poisson Processes
18 0.65334386 194 nips-2010-Online Learning for Latent Dirichlet Allocation
19 0.65200585 4 nips-2010-A Computational Decision Theory for Interactive Assistants
20 0.65159321 85 nips-2010-Exact learning curves for Gaussian process regression on large random graphs