nips nips2013 nips2013-238 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, Michael Jordan
Abstract: Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. [sent-4, score-0.317]
2 We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. [sent-6, score-0.287]
3 We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. [sent-7, score-0.175]
4 We evaluate our methods via large-scale experiments in a cluster computing environment. [sent-8, score-0.096]
5 1 Introduction The desire to apply machine learning to increasingly larger datasets has pushed the machine learning community to address the challenges of distributed algorithm design: partitioning and coordinating computation across the processing resources. [sent-9, score-0.075]
6 For these embarrassingly parallel tasks, the machine learning community has embraced the map-reduce paradigm, which provides a template for constructing distributed algorithms that are fault tolerant, scalable, and easy to study. [sent-11, score-0.119]
7 , collapsed Gibbs sampling or coordinate ascent) which were developed and studied in the serial setting. [sent-14, score-0.181]
8 The mutual exclusion approach, adopted by [1] and [2], guarantees a serializable execution preserving the theoretical properties of the serial algorithm but at the expense of parallelism and costly locking overhead. [sent-17, score-0.434]
9 In this paper we explore a third approach, optimistic concurrency control (OCC) [5] which offers the performance gains of the coordination-free approach while at the same time ensuring a serializable execution and preserving the theoretical properties of the serial algorithm. [sent-19, score-0.692]
10 1 We apply OCC to distributed nonparametric unsupervised learning—including but not limited to clustering—and implement distributed versions of the DP-Means [6], BP-Means [7], and online facility location (OFL) algorithms. [sent-23, score-0.375]
11 Reinterpretation of online nonparametric clustering in the form of facility location with approximation guarantees. [sent-28, score-0.234]
12 , model parameters or variable assignment) giving the illusion of serial dependencies between each operation. [sent-35, score-0.181]
13 This opportunity for serializable concurrency forms the foundation of distributed database systems. [sent-37, score-0.421]
14 For example, two customers may concurrently make purchases exhausting the inventory of unrelated products, but if they try to purchase the same product then we may need to serialize their purchases to ensure sufficient inventory. [sent-38, score-0.21]
15 This might work for an unpopular, rare product but if we are interested in selling a popular product for which we have a large inventory the serialization overhead could lead to unnecessarily slow response times. [sent-40, score-0.188]
16 To address this problem, the database community has adopted optimistic concurrency control (OCC) [5] in which the system tries to satisfy the customers requests without locking and corrects transactions that could lead to negative inventory (e. [sent-41, score-0.497]
17 Optimistic concurrency control exploits situations where most operations can execute concurrently without conflicting or violating serialization invariants. [sent-44, score-0.452]
18 For example, given sufficient inventory the order in which customers are satisfied is immaterial and concurrent operations can be executed serially to yield the same final result. [sent-45, score-0.297]
19 However, in the rare event that inventory is nearly depleted two concurrent purchases may not be serializable since the inventory can never be negative. [sent-46, score-0.254]
20 By shifting the cost of concurrency control to rare events we can admit more costly concurrency control mechanisms (e. [sent-47, score-0.571]
21 , operations or collections of operations), a mechanism to detect when a transaction violates serialization invariants (i. [sent-52, score-0.215]
22 Optimistic concurrency control is most effective when the cost of validating concurrent transactions is small and conflicts occur infrequently. [sent-57, score-0.304]
23 Machine learning algorithms are ideal for optimistic concurrency control. [sent-58, score-0.331]
24 Similarly, symmetry in our models often provides the flexibility to reorder serial operations while preserving algorithm invariants. [sent-60, score-0.277]
25 Because the models encode the dependency structure, we can easily detect when an operation violates serial invariants and correct by rejecting the change and rerunning the computation. [sent-61, score-0.24]
26 As a consequence OCC allows us to easily construct provably correct and efficient distributed algorithms without the need to develop new theoretical tools to analyze complex non-deterministic distributed behavior. [sent-63, score-0.15]
27 1 The OCC Pattern for Machine Learning Optimistic concurrency control can be distilled to a simple pattern (meta-algorithm) for the design and implementation of distributed machine learning systems. [sent-65, score-0.35]
28 Each processor maintains a replicated view of the global state and serially applies the learning algorithm as a sequence of operations on its assigned data and the global state. [sent-67, score-0.241]
29 If an operation mutates the global state in a way that preserves the serialization invariants then the operation is accepted locally and its effect on the global state, if any, is eventually replicated to other processors. [sent-68, score-0.316]
30 However, if an operation could potentially conflict with operations on other processors then it is sent to a unique serializing processor where it is rejected or corrected and the resulting global state change is eventually replicated to the rest of the processors. [sent-69, score-0.23]
31 Meanwhile the originating processor either tentatively accepts the state change (if a rollback operator is defined) or proceeds as though the operation has been deferred to some point in the future. [sent-70, score-0.114]
32 Within an epoch t, b data points B(p, t) are evenly assigned to each of the P processors. [sent-72, score-0.187]
33 Any state changes or serialization operations are transmitted at the end of the epoch and processed before the next epoch. [sent-73, score-0.277]
34 In this paper we apply the OCC pattern to machine learning problems that have a more discrete, combinatorial flavor—in particular unsupervised clustering and latent feature learning problems. [sent-76, score-0.09]
35 These problems exhibit symmetry via their invariance to both data permutation and cluster or feature permutation. [sent-77, score-0.125]
36 Together with the sparsity of interacting operations in their existing serial algorithms, these problems offer a unique opportunity to develop OCC algorithms. [sent-78, score-0.26]
37 The algorithms considered to date in this literature have been developed and analyzed in the serial setting; our goal is to explore distributed algorithms for optimizing these cost functions that preserve the structure and analysis of their serial counterparts. [sent-81, score-0.437]
38 Like the K-means algorithm, DP-Means alternates between updating the cluster assignment zi for each point xi and recomputing K the centroids C = {µk }k=1 associated with each clusters. [sent-85, score-0.206]
39 However, DP-Means differs in that the number of clusters is not fixed a priori. [sent-86, score-0.092]
40 Instead, if the distance from a given data point to all existing cluster centroids is greater than a parameter λ, then a new cluster is created. [sent-87, score-0.192]
41 While the second phase is trivially parallel, the process of introducing clusters in the first phase is inherently serial. [sent-88, score-0.152]
42 However, clusters tend to be introduced infrequently, and thus DP-Means provides an opportunity for OCC. [sent-89, score-0.118]
43 3 we present an OCC parallelization of the DP-Means algorithm in which each iteration of the serial DP-Means algorithm is divided into N/(P b) bulk-synchronous epochs. [sent-91, score-0.215]
44 During each epoch t, each processor p evaluates the cluster membership of its assigned data {xi }i∈B(p,t) using the cluster centers C from the previous epoch and optimistically proposes a new set of cluster ˆ ˆ centers C. [sent-93, score-0.835]
45 At the end of each epoch the proposed cluster centers, C, are serially validated using Alg. [sent-94, score-0.369]
46 The validation process accepts cluster centers that are not covered by (i. [sent-100, score-0.262]
47 When a cluster center is rejected we update its reference to point to the already accepted center, thereby correcting the original point assignment. [sent-103, score-0.224]
48 However, while DP-Means allows the clusters to be arbitrary points (e. [sent-106, score-0.141]
49 , C ∈ RD ), FL constrains the clusters to be points C ⊆ F in a set of candidate locations F. [sent-108, score-0.141]
50 We build on the online facility location (OFL) algorithm described by Meyerson [10]. [sent-112, score-0.179]
51 The OFL algorithm processes each data point x serially in a single pass 2 by either adding x to the set of clusters with probability min(1, minµ∈C x − µ /λ2 ) or assigning x to the nearest existing cluster. [sent-113, score-0.22]
52 The OCC OFL algorithm differs only in that clusters are introduced and validated stochastically—the validation process ensures that the new clusters are accepted with probability equal to the serial algorithm. [sent-117, score-0.518]
53 4 Algorithm 4: Parallel OFL Algorithm 5: OFLValidate Input: Same as DP-Means ˆ for epoch t = 1 to N/(P b) do C ← ∅ for p ∈ {1, . [sent-120, score-0.113]
54 As with serial DP-means, there are two phases in serial BP-means (Alg. [sent-124, score-0.362]
55 In the first phase, each data point xi is labeled with binary assignments from a collection of features (zik = 0 if xi doesn’t belong to feature k; otherwise zik = 1) to construct a representation xi ≈ k zik fk . [sent-126, score-0.452]
56 In the ˆ second phase, parameter values (the feature means fk ∈ C) are updated based on the assignments. [sent-127, score-0.108]
57 While the second phase is trivially parallel, the inherently serial nature of the first phase combined with the infrequent introduction of new features points to the usefulness of OCC in this domain. [sent-129, score-0.313]
58 Each transaction operates on a data point xi in two phases. [sent-131, score-0.083]
59 In the first, analysis phase, the optimal representation k zik fk is found. [sent-132, score-0.172]
60 , xi − k zik fk > λ), the difference is proposed as a new feature in the second validation phase. [sent-135, score-0.267]
61 At the end of epoch t, ˜ the proposed features {finew } are serially validated to obtain a set of accepted features C. [sent-136, score-0.419]
62 For each proposed feature finew , the validation process first finds the optimal representation finew ≈ new is not well represented, the difference finew − ˜ fk ∈C zik fk using newly accepted features. [sent-137, score-0.572]
63 If fi ˜ ˜ fk ∈C zik fk is added to C and accepted as a new feature. [sent-138, score-0.351]
64 The feature means update F ← (Z T Z)−1 Z T X can be evaluated as a single transaction by computing the T T sums Z T Z = i zi zi (where zi is a K × 1 column vector so zi zi is a K × K matrix) and Z T X = i zi xT in parallel. [sent-140, score-0.457]
65 4 Analysis of Correctness and Scalability In contrast to the coordination-free pattern in which scalability is trivial and correctness often requires strong assumptions or holds only in expectation, the OCC pattern leads to simple proofs of correctness and challenging scalability analysis. [sent-142, score-0.206]
66 The distributed DP-means, OFL, and BP-means algorithms are serially equivalent to DP-means, OFL and BP-means, respectively. [sent-146, score-0.203]
67 1 is relatively straightforward and is obtained by constructing a permutation function that describes an equivalent serial execution for each distributed execution. [sent-148, score-0.303]
68 Serializability allows us to easily extend important theoretical properties of the serial algorithm to the distributed setting. [sent-150, score-0.256]
69 For example, by invoking serializability, we can establish the following result for the OCC version of the online facility location (OFL) algorithm: 5 Theorem 4. [sent-151, score-0.179]
70 2 is first derived in the serial setting then extended to the distributed setting through serializability. [sent-156, score-0.256]
71 2 is unaffected by distributed processing and has no communication or coarsening tradeoffs. [sent-158, score-0.098]
72 Furthermore, to retain the same factors as a batch algorithm on the full data, divide-and-conquer schemes need a large number of preliminary centers at lower levels [11, 12]. [sent-159, score-0.126]
73 In that case, the communication cost can be high, since all proposed clusters are sent at the same time, as opposed to the OCC approach. [sent-160, score-0.149]
74 Scalability The scalability of the OCC algorithms depends on the number of transactions that are rejected during validation (i. [sent-162, score-0.118]
75 To illustrate the techniques employed in OCC scalability analysis we study the DP-Means algorithm, whose scalability limiting factor is determined by the number of points that must be serially validated. [sent-167, score-0.315]
76 We show that the communication cost only depends on the number of clusters and processing resources and does not directly depend on the number of data points. [sent-168, score-0.115]
77 Assume N data points are generated iid to form a random number (KN ) of well-spaced clusters of diameter λ: λ is an upper bound on the distances within clusters and a lower bound on the distance between clusters. [sent-172, score-0.233]
78 Then the expected number of serially validated points is bounded above by P b + E [KN ] for P processors and b points per epoch. [sent-173, score-0.292]
79 Under the separation assumptions of the theorem, the number of clusters present in N data points, KN , is exactly equal to the number of clusters found by DP-Means in N data points; call this latter quantity kN . [sent-174, score-0.184]
80 Since the master must process at least kN points, the overhead caused by rejections is P b and independent of N . [sent-176, score-0.096]
81 The cluster and feature proportions were generated nonparametrically as described below. [sent-178, score-0.125]
82 Clustering: The cluster proportions and indicators were generated simultaneously using the stickbreaking procedure for Dirichlet processes—‘sticks’ are ‘broken’ on-the-fly to generate new clusters as necessary. [sent-181, score-0.188]
83 Cluster means 1 were sampled µk ∼ N (0, I16 ), and data points were generated at xi ∼ N (µzi , 4 I16 ). [sent-183, score-0.094]
84 We generated feature means fk ∼ N (0, I16 ) and data points xi ∼ N ( k zik fk , 1 I16 ). [sent-190, score-0.374]
85 1 Simulated experiments To test the efficiency of our algorithms, we simulated the first iteration (one complete pass over all the data, where most clusters / features are created and thus greatest coordination is needed) of each algorithm in MATLAB. [sent-192, score-0.155]
86 features, and MN , the number of proposed clusters / features. [sent-196, score-0.092]
87 2 Distributed implementation and experiments We also implemented1 the distributed algorithms in Spark [9], an open-source cluster computing system. [sent-206, score-0.171]
88 The DP-means and BP-means algorithms were initialized by pre-processing a small number of data points (1/16 of the first P b points)—this reduces the number of data points sent to the master on the first epoch, while still preserving serializability of the algorithms. [sent-207, score-0.295]
89 Ideally, to process the same amount of data, an algorithm and implementation with perfect scaling would take half the runtime on 8 machines as it would on 4, and so on. [sent-210, score-0.077]
90 We were able to get perfect scaling (Figure 4a) in all but the first iteration, when the master has to perform the most synchronization of proposed centers. [sent-215, score-0.077]
91 Figure 4b shows that we get no scaling in the first epoch, where all P b data points are sent to the master. [sent-219, score-0.105]
92 6 Related work Others have proposed alternatives to mutual exclusion and coordination-free parallelism for machine learning algorithm design. [sent-225, score-0.081]
93 [14] proposed transforming the underlying model to expose additional parallelism while preserving the marginal posterior. [sent-226, score-0.108]
94 7 (a) OCC DP-means (b) OCC OFL (c) OCC BP-means Figure 4: Normalized runtime for distributed algorithms. [sent-233, score-0.109]
95 Runtime of each iteration / epoch is divided by that using 1 machine (P = 8). [sent-234, score-0.113]
96 7 Discussion In this paper we have shown how optimistic concurrency control can be usefully employed in the design of distributed machine learning algorithms. [sent-245, score-0.439]
97 We established the equivalence of our distributed OCC DPmeans, OFL and BP-means algorithms to their serial counterparts, thus preserving their theoretical properties. [sent-247, score-0.299]
98 In particular, the strong approximation guarantees of serial OFL translate immediately to the distributed algorithm. [sent-248, score-0.256]
99 We implemented and evaluated all three OCC algorithms on a distributed computing platform and demonstrate strong scalability in practice. [sent-250, score-0.144]
100 In particular we may be able to partially or probabilistically accept non-serializable operations in a way that preserves underlying algorithm invariants. [sent-253, score-0.096]
wordName wordTfidf (topN-words)
[('occ', 0.727), ('ofl', 0.314), ('concurrency', 0.242), ('serial', 0.181), ('serially', 0.128), ('facility', 0.126), ('centers', 0.126), ('epoch', 0.113), ('accepted', 0.1), ('cluster', 0.096), ('zik', 0.093), ('clusters', 0.092), ('optimistic', 0.089), ('serializability', 0.086), ('serialization', 0.086), ('fk', 0.079), ('distributed', 0.075), ('scalability', 0.069), ('zi', 0.065), ('kn', 0.06), ('inventory', 0.057), ('finew', 0.057), ('serializable', 0.057), ('operations', 0.053), ('spark', 0.05), ('points', 0.049), ('execution', 0.047), ('xi', 0.045), ('parallel', 0.044), ('parallelism', 0.043), ('rollback', 0.043), ('preserving', 0.043), ('coordination', 0.04), ('fl', 0.04), ('concurrently', 0.038), ('transaction', 0.038), ('icts', 0.038), ('exclusion', 0.038), ('rejections', 0.038), ('invariants', 0.038), ('optimistically', 0.038), ('clustering', 0.035), ('granada', 0.035), ('runtime', 0.034), ('correctness', 0.034), ('sent', 0.034), ('parallelization', 0.034), ('master', 0.034), ('processors', 0.034), ('location', 0.033), ('control', 0.033), ('purchases', 0.033), ('validated', 0.032), ('processor', 0.031), ('vancouver', 0.031), ('customers', 0.03), ('phase', 0.03), ('feature', 0.029), ('ict', 0.029), ('replicated', 0.029), ('concurrent', 0.029), ('dpvalidate', 0.029), ('oflvalidate', 0.029), ('epochs', 0.028), ('rejected', 0.028), ('gonzalez', 0.027), ('opportunity', 0.026), ('unsupervised', 0.026), ('meyerson', 0.025), ('locking', 0.025), ('istanbul', 0.025), ('processed', 0.025), ('scalable', 0.025), ('evenly', 0.025), ('overhead', 0.024), ('features', 0.023), ('communication', 0.023), ('ref', 0.023), ('workload', 0.023), ('bnp', 0.023), ('scaling', 0.022), ('streaming', 0.022), ('joseph', 0.022), ('accept', 0.022), ('expose', 0.022), ('dirichlet', 0.021), ('preserves', 0.021), ('validation', 0.021), ('perfect', 0.021), ('rare', 0.021), ('database', 0.021), ('tamara', 0.021), ('broderick', 0.021), ('operation', 0.021), ('online', 0.02), ('nonparametric', 0.02), ('michael', 0.019), ('paradigm', 0.019), ('accepts', 0.019), ('purchase', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
Author: Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, Michael Jordan
Abstract: Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. 1
2 0.072680853 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
3 0.065260969 277 nips-2013-Restricting exchangeable nonparametric distributions
Author: Sinead A. Williamson, Steve N. MacEachern, Eric Xing
Abstract: Distributions over matrices with exchangeable rows and infinitely many columns are useful in constructing nonparametric latent variable models. However, the distribution implied by such models over the number of features exhibited by each data point may be poorly-suited for many modeling tasks. In this paper, we propose a class of exchangeable nonparametric priors obtained by restricting the domain of existing models. Such models allow us to specify the distribution over the number of features per data point, and can achieve better performance on data sets where the number of features is not well-modeled by the original distribution. 1
4 0.058912609 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
Author: Trevor Campbell, Miao Liu, Brian Kulis, Jonathan P. How, Lawrence Carin
Abstract: This paper presents a novel algorithm, based upon the dependent Dirichlet process mixture model (DDPMM), for clustering batch-sequential data containing an unknown number of evolving clusters. The algorithm is derived via a lowvariance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM, and provides a hard clustering with convergence guarantees similar to those of the k-means algorithm. Empirical results from a synthetic test with moving Gaussian clusters and a test with real ADS-B aircraft trajectory data demonstrate that the algorithm requires orders of magnitude less computational time than contemporary probabilistic and hard clustering algorithms, while providing higher accuracy on the examined datasets. 1
5 0.058187485 63 nips-2013-Cluster Trees on Manifolds
Author: Sivaraman Balakrishnan, Srivatsan Narayanan, Alessandro Rinaldo, Aarti Singh, Larry Wasserman
Abstract: unkown-abstract
6 0.053483825 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
7 0.046653375 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
8 0.044646535 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
9 0.044510327 158 nips-2013-Learning Multiple Models via Regularized Weighting
10 0.044344325 193 nips-2013-Mixed Optimization for Smooth Functions
11 0.043973129 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
12 0.042493429 213 nips-2013-Nonparametric Multi-group Membership Model for Dynamic Networks
13 0.041436147 142 nips-2013-Information-theoretic lower bounds for distributed statistical estimation with communication constraints
14 0.038957175 148 nips-2013-Latent Maximum Margin Clustering
15 0.037940167 75 nips-2013-Convex Two-Layer Modeling
16 0.037585676 273 nips-2013-Reinforcement Learning in Robust Markov Decision Processes
17 0.036279917 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
18 0.03615487 344 nips-2013-Using multiple samples to learn mixture models
19 0.03554843 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
20 0.035017364 146 nips-2013-Large Scale Distributed Sparse Precision Estimation
topicId topicWeight
[(0, 0.109), (1, 0.015), (2, -0.003), (3, -0.006), (4, 0.016), (5, 0.057), (6, 0.028), (7, -0.012), (8, 0.039), (9, 0.002), (10, -0.01), (11, 0.055), (12, 0.024), (13, 0.012), (14, 0.015), (15, 0.02), (16, 0.007), (17, -0.043), (18, 0.029), (19, 0.023), (20, 0.035), (21, 0.027), (22, -0.021), (23, -0.029), (24, -0.027), (25, -0.034), (26, -0.026), (27, -0.031), (28, -0.019), (29, -0.075), (30, -0.023), (31, 0.034), (32, 0.159), (33, -0.007), (34, -0.004), (35, -0.017), (36, -0.019), (37, 0.059), (38, 0.003), (39, 0.019), (40, -0.042), (41, -0.088), (42, 0.051), (43, 0.016), (44, -0.056), (45, -0.045), (46, 0.012), (47, -0.107), (48, -0.068), (49, -0.028)]
simIndex simValue paperId paperTitle
same-paper 1 0.89647347 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
Author: Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, Michael Jordan
Abstract: Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. 1
2 0.61391097 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
Author: Jason Chang, John W. Fisher III
Abstract: We present an MCMC sampler for Dirichlet process mixture models that can be parallelized to achieve significant computational gains. We combine a nonergodic, restricted Gibbs iteration with split/merge proposals in a manner that produces an ergodic Markov chain. Each cluster is augmented with two subclusters to construct likely split moves. Unlike some previous parallel samplers, the proposed sampler enforces the correct stationary distribution of the Markov chain without the need for finite approximations. Empirical results illustrate that the new sampler exhibits better convergence properties than current methods. 1
3 0.61327994 100 nips-2013-Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture
Author: Trevor Campbell, Miao Liu, Brian Kulis, Jonathan P. How, Lawrence Carin
Abstract: This paper presents a novel algorithm, based upon the dependent Dirichlet process mixture model (DDPMM), for clustering batch-sequential data containing an unknown number of evolving clusters. The algorithm is derived via a lowvariance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM, and provides a hard clustering with convergence guarantees similar to those of the k-means algorithm. Empirical results from a synthetic test with moving Gaussian clusters and a test with real ADS-B aircraft trajectory data demonstrate that the algorithm requires orders of magnitude less computational time than contemporary probabilistic and hard clustering algorithms, while providing higher accuracy on the examined datasets. 1
4 0.59977412 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
Author: Maria-Florina Balcan, Steven Ehrlich, Yingyu Liang
Abstract: This paper provides new algorithms for distributed clustering for two popular center-based objectives, k-median and k-means. These algorithms have provable guarantees and improve communication complexity over existing approaches. Following a classic approach in clustering by [13], we reduce the problem of finding a clustering with low cost to the problem of finding a coreset of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication complexity, and which works over general communication topologies. Experimental results on large scale data sets show that this approach outperforms other coreset-based distributed clustering algorithms. 1
5 0.5706051 277 nips-2013-Restricting exchangeable nonparametric distributions
Author: Sinead A. Williamson, Steve N. MacEachern, Eric Xing
Abstract: Distributions over matrices with exchangeable rows and infinitely many columns are useful in constructing nonparametric latent variable models. However, the distribution implied by such models over the number of features exhibited by each data point may be poorly-suited for many modeling tasks. In this paper, we propose a class of exchangeable nonparametric priors obtained by restricting the domain of existing models. Such models allow us to specify the distribution over the number of features per data point, and can achieve better performance on data sets where the number of features is not well-modeled by the original distribution. 1
6 0.53972042 18 nips-2013-A simple example of Dirichlet process mixture inconsistency for the number of components
7 0.53466004 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
8 0.52510488 344 nips-2013-Using multiple samples to learn mixture models
9 0.51557213 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling
10 0.5005495 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models
11 0.47184375 63 nips-2013-Cluster Trees on Manifolds
12 0.46194378 252 nips-2013-Predictive PAC Learning and Process Decompositions
13 0.4554444 158 nips-2013-Learning Multiple Models via Regularized Weighting
14 0.44794202 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
15 0.44735101 190 nips-2013-Mid-level Visual Element Discovery as Discriminative Mode Seeking
16 0.44049063 333 nips-2013-Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent
17 0.43783757 148 nips-2013-Latent Maximum Margin Clustering
18 0.43772155 202 nips-2013-Multiclass Total Variation Clustering
19 0.42601559 320 nips-2013-Summary Statistics for Partitionings and Feature Allocations
20 0.41874418 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
topicId topicWeight
[(2, 0.011), (3, 0.013), (16, 0.097), (17, 0.23), (33, 0.083), (34, 0.1), (36, 0.016), (41, 0.035), (49, 0.036), (56, 0.082), (70, 0.024), (85, 0.045), (89, 0.022), (93, 0.064), (95, 0.033)]
simIndex simValue paperId paperTitle
same-paper 1 0.75022316 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
Author: Xinghao Pan, Joseph E. Gonzalez, Stefanie Jegelka, Tamara Broderick, Michael Jordan
Abstract: Research on distributed machine learning algorithms has focused primarily on one of two extremes—algorithms that obey strict concurrency constraints or algorithms that obey few or no such constraints. We consider an intermediate alternative in which algorithms optimistically assume that conflicts are unlikely and if conflicts do arise a conflict-resolution protocol is invoked. We view this “optimistic concurrency control” paradigm as particularly appropriate for large-scale machine learning algorithms, particularly in the unsupervised setting. We demonstrate our approach in three problem areas: clustering, feature learning and online facility location. We evaluate our methods via large-scale experiments in a cluster computing environment. 1
2 0.71504974 284 nips-2013-Robust Spatial Filtering with Beta Divergence
Author: Wojciech Samek, Duncan Blythe, Klaus-Robert Müller, Motoaki Kawanabe
Abstract: The efficiency of Brain-Computer Interfaces (BCI) largely depends upon a reliable extraction of informative features from the high-dimensional EEG signal. A crucial step in this protocol is the computation of spatial filters. The Common Spatial Patterns (CSP) algorithm computes filters that maximize the difference in band power between two conditions, thus it is tailored to extract the relevant information in motor imagery experiments. However, CSP is highly sensitive to artifacts in the EEG data, i.e. few outliers may alter the estimate drastically and decrease classification performance. Inspired by concepts from the field of information geometry we propose a novel approach for robustifying CSP. More precisely, we formulate CSP as a divergence maximization problem and utilize the property of a particular type of divergence, namely beta divergence, for robustifying the estimation of spatial filters in the presence of artifacts in the data. We demonstrate the usefulness of our method on toy data and on EEG recordings from 80 subjects. 1
3 0.6692217 64 nips-2013-Compete to Compute
Author: Rupesh K. Srivastava, Jonathan Masci, Sohrob Kazerounian, Faustino Gomez, Jürgen Schmidhuber
Abstract: Local competition among neighboring neurons is common in biological neural networks (NNs). In this paper, we apply the concept to gradient-based, backprop-trained artificial multilayer NNs. NNs with competing linear units tend to outperform those with non-competing nonlinear units, and avoid catastrophic forgetting when training sets change over time. 1
4 0.61658639 204 nips-2013-Multiscale Dictionary Learning for Estimating Conditional Distributions
Author: Francesca Petralia, Joshua T. Vogelstein, David Dunson
Abstract: Nonparametric estimation of the conditional distribution of a response given highdimensional features is a challenging problem. It is important to allow not only the mean but also the variance and shape of the response density to change flexibly with features, which are massive-dimensional. We propose a multiscale dictionary learning model, which expresses the conditional response density as a convex combination of dictionary densities, with the densities used and their weights dependent on the path through a tree decomposition of the feature space. A fast graph partitioning algorithm is applied to obtain the tree decomposition, with Bayesian methods then used to adaptively prune and average over different sub-trees in a soft probabilistic manner. The algorithm scales efficiently to approximately one million features. State of the art predictive performance is demonstrated for toy examples and two neuroscience applications including up to a million features. 1
Author: Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, Oliver Stegle
Abstract: Multi-task prediction methods are widely used to couple regressors or classification models by sharing information across related tasks. We propose a multi-task Gaussian process approach for modeling both the relatedness between regressors and the task correlations in the residuals, in order to more accurately identify true sharing between regressors. The resulting Gaussian model has a covariance term in form of a sum of Kronecker products, for which efficient parameter inference and out of sample prediction are feasible. On both synthetic examples and applications to phenotype prediction in genetics, we find substantial benefits of modeling structured noise compared to established alternatives. 1
6 0.60576296 245 nips-2013-Pass-efficient unsupervised feature selection
7 0.59827161 34 nips-2013-Analyzing Hogwild Parallel Gaussian Gibbs Sampling
8 0.59645885 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models
9 0.59336549 58 nips-2013-Binary to Bushy: Bayesian Hierarchical Clustering with the Beta Coalescent
10 0.58979243 243 nips-2013-Parallel Sampling of DP Mixture Models using Sub-Cluster Splits
11 0.58950126 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
12 0.58926809 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
13 0.58719921 5 nips-2013-A Deep Architecture for Matching Short Texts
14 0.58643848 287 nips-2013-Scalable Inference for Logistic-Normal Topic Models
15 0.5861724 350 nips-2013-Wavelets on Graphs via Deep Learning
16 0.58583355 201 nips-2013-Multi-Task Bayesian Optimization
17 0.58512133 262 nips-2013-Real-Time Inference for a Gamma Process Model of Neural Spiking
18 0.58408117 229 nips-2013-Online Learning of Nonparametric Mixture Models via Sequential Variational Approximation
19 0.5840649 77 nips-2013-Correlations strike back (again): the case of associative memory retrieval
20 0.58344853 187 nips-2013-Memoized Online Variational Inference for Dirichlet Process Mixture Models