nips nips2013 nips2013-296 knowledge-graph by maker-knowledge-mining

296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport

Source: pdf

Author: Marco Cuturi

Abstract: Optimal transport distances are a fundamental family of distances for probability measures and histograms of features. Despite their appealing theoretical properties, excellent performance in retrieval tasks and intuitive formulation, their computation involves the resolution of a linear program whose cost can quickly become prohibitive whenever the size of the support of these measures or the histograms’ dimension exceeds a few hundred. We propose in this work a new family of optimal transport distances that look at transport problems from a maximumentropy perspective. We smooth the classic optimal transport problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn’s matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transport solvers. We also show that this regularized distance improves upon classic optimal transport distances on the MNIST classiﬁcation problem.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp Abstract Optimal transport distances are a fundamental family of distances for probability measures and histograms of features. [sent-4, score-0.921]

2 We propose in this work a new family of optimal transport distances that look at transport problems from a maximumentropy perspective. [sent-6, score-1.194]

3 We smooth the classic optimal transport problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn’s matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transport solvers. [sent-7, score-1.365]

4 We also show that this regularized distance improves upon classic optimal transport distances on the MNIST classiﬁcation problem. [sent-8, score-0.869]

5 1 Introduction Choosing a suitable distance to compare probabilities is a key problem in statistical machine learning. [sent-9, score-0.13]

6 When little is known on the probability space on which these probabilities are supported, various information divergences with minimalistic assumptions have been proposed to play that part, among which the Hellinger, χ2 , total variation or Kullback-Leibler divergences. [sent-10, score-0.131]

7 When the probability space is a metric space, optimal transport distances (Villani, 2009, §6), a. [sent-11, score-0.752]

8 earth mover’s (EMD) in computer vision (Rubner et al. [sent-14, score-0.107]

9 In the particular case that the metric probability space of interest can be embedded in Rn and n is small, computing or approximating optimal transport distances can become reasonably cheap. [sent-19, score-0.752]

10 When n ≥ 2, embeddings of measures can be used to approximate them in linear time (Indyk and Thaper, 2003; Grauman and Darrell, 2004; Shirdhonkar and Jacobs, 2008) and network simplex solvers can be modiﬁed to run in quadratic time (Gudmundsson et al. [sent-21, score-0.077]

11 Outside of the perimeter of these cases, computing a single distance between a pair of measures supported by a few hundred points/bins in an arbitrary metric space can take more than a few seconds on a single CPU. [sent-24, score-0.146]

12 This issue severely hinders the applicability of optimal transport distances in large-scale data analysis and goes as far as putting into question their relevance within the ﬁeld of machine learning, where high-dimensional histograms and measures in high-dimensional spaces are now prevalent. [sent-25, score-0.791]

13 1 We show in this paper that another strategy can be employed to speed-up optimal transport, and even potentially deﬁne a better distance in inference tasks. [sent-26, score-0.124]

14 Our strategy is valid regardless of the metric characteristics of the original probability space. [sent-27, score-0.056]

15 Rather than exploit properties of the metric probability space of interest (such as embeddability in a low-dimensional Euclidean space) we choose to focus directly on the original transport problem, and regularize it with an entropic term. [sent-28, score-0.647]

16 We argue that this regularization is intuitive given the geometry of the optimal transport problem and has, in fact, been long known and favored in transport theory to predict trafﬁc patterns (Wilson, 1969). [sent-29, score-1.058]

17 From an optimization point of view, this regularization has multiple virtues, among which that of turning the transport problem into a strictly convex problem that can be solved with matrix scaling algorithms. [sent-30, score-0.595]

18 We propose a novel implementation of this algorithm that can compute simultaneously the distance of a single point to a family of points using matrix-matrix products, and which can therefore be implemented on GPGPU architectures. [sent-33, score-0.09]

19 We show that, on the benchmark task of classifying MNIST digits, regularized distances perform better than standard optimal transport distances, and can be computed several orders of magnitude faster. [sent-34, score-0.724]

20 This paper is organized as follows: we provide reminders on optimal transport theory in Section 2, introduce Sinkhorn distances in Section 3 and provide algorithmic details in Section 4. [sent-35, score-0.731]

21 For two probability vectors r and c in the simplex Σd := {x ∈ Rd : xT 1d = 1}, where 1d is the d dimensional vector of ones, we write U (r, c) for the transport + polytope of r and c, namely the polyhedral set of d × d matrices, U (r, c) := {P ∈ Rd×d | P 1d = r, P T 1d = c}. [sent-38, score-0.579]

22 U (r, c) has a probabilistic interpretation: for X and Y two multinomial random variables taking values in {1, · · · , d}, each with distribution r and c respectively, the set U (r, c) contains all possible joint probabilities of (X, Y ). [sent-40, score-0.069]

23 Indeed, any matrix P ∈ U (r, c) can be identiﬁed with a joint probability for (X, Y ) such that p(X = i, Y = j) = pij . [sent-41, score-0.33]

24 We deﬁne the entropy h and the Kullback-Leibler divergences of P, Q ∈ U (r, c) and a marginals r ∈ Σd as d d ri log ri , h(r) = − i=1 pij log pij , h(P ) = − pij log KL(P Q) = ij i,j=1 pij . [sent-42, score-1.35]

25 qij Optimal Transport Distance Between r and c Given a d × d cost matrix M , the cost of mapping r to c using a transport matrix (or joint probability) P can be quantiﬁed as P, M . [sent-43, score-0.595]

26 P ∈U(r,c) is called an optimal transport (OT) problem between r and c given cost M . [sent-45, score-0.532]

27 An optimal table P ⋆ for this problem can be obtained, among other approaches, with the network simplex (Ahuja et al. [sent-46, score-0.111]

28 The optimum of this problem, dM (r, c), is a distance between r and c (Villani, 2009, §6. [sent-48, score-0.09]

29 1) whenever the matrix M is itself a metric matrix, namely whenever M belongs to the cone of distance matrices (Avis, 1980; Brickell et al. [sent-49, score-0.203]

30 , 2008): M = {M ∈ Rd×d : ∀i, j ≤ d, mij = 0 ⇔ i = j, ∀i, j, k ≤ d, mij ≤ mik + mkj }. [sent-50, score-0.221]

31 Hence, the set of tables P whose Kullback-Leibler divergence to rcT is constrained to lie below a certain threshold can be interpreted as the set of joint probabilities P in U (r, c) which have sufﬁcient entropy with respect to h(r) and h(c), or small enough mutual information. [sent-56, score-0.197]

32 For reasons that will become clear in Section 4, we call the quantity below the Sinkhorn distance of r and c: Deﬁnition 1 (Sinkhorn Distance). [sent-57, score-0.09]

33 dM,α (r, c) := min P, M P ∈Uα (r,c) Why consider an entropic constraint in optimal transport? [sent-58, score-0.127]

34 As a classic result of linear optimization, the OT problem is always solved on a vertex of U (r, c). [sent-61, score-0.076]

35 From a probabilistic perspective, such vertices are quasi-deterministic joint probabilities, since if pij > 0, then very few probabilities pij ′ for j = j ′ will be non-zero in general. [sent-65, score-0.603]

36 Rather than considering such outliers of U (r, c) as the basis of OT distances, we propose to restrict the search for low cost joint probabilities to tables with sufﬁcient smoothness. [sent-66, score-0.069]

37 Note that this is equivalent to considering the maximum-entropy principle (Jaynes, 1957; Darroch and Ratcliff, 1972) if we were to maximize entropy while keeping the transportation cost constrained. [sent-67, score-0.079]

38 They relax and penalize (through graph-based norms) the original transport problem to avoid undesirable properties exhibited by the original optima in the problem of color matching. [sent-70, score-0.498]

39 Metric Properties of Sinkhorn Distances When α is large enough, the Sinkhorn distance coincides with the classic OT distance. [sent-72, score-0.145]

40 When α = 0, the Sinkhorn distance has a closed form and becomes a negative deﬁnite kernel if one assumes that M is itself a negative deﬁnite distance, or equivalently a Euclidean distance matrix1 . [sent-73, score-0.228]

41 For α large enough, the Sinkhorn distance dM,α is the transport distance dM . [sent-75, score-0.678]

42 If M is a Euclidean distance matrix, dM,0 is a negative deﬁnite kernel and e−tdM,0 , the independence kernel, is positive deﬁnite for all t > 0. [sent-80, score-0.176]

43 Beyond these two extreme cases, the main theorem of this section states that Sinkhorn distances are symmetric and satisfy triangle inequalities for all possible values of α. [sent-82, score-0.191]

44 Since for α small enough dM,α (r, r) > 0 for any r such that h(r) > 0, Sinkhorn distances cannot satisfy the coincidence axiom (d(x, y) = 0 ⇔ x = y holds for all x, y). [sent-83, score-0.192]

45 The function (r, c) → 1r=c dM,α (r, c) satisﬁes all three distance axioms. [sent-87, score-0.09]

46 1 ∃n, ∃ϕ1 , · · · , ϕd ∈ Rn such that mij = ϕi − ϕj 2 . [sent-88, score-0.095]

47 Recall that, in that case, M raised to power t 2 element-wise, [mt ], 0 < t < 1 is also a Euclidean distance matrix (Berg et al. [sent-89, score-0.147]

48 This drawing implicitly assumes that the optimal transport P ⋆ is unique. [sent-95, score-0.532]

49 The Sinkhorn distance dM,α (r, c) is equal to Pα , M , the minimum of the dot product with M on that ball. [sent-96, score-0.09]

50 The dual-sinkhorn distance dλ (r, c), the minimum of the transport problem regularized by M minus the entropy divided by λ, reaches its minimum at a unique solution P λ , forming a regularization path for varying λ from rcT to P ⋆ . [sent-98, score-0.731]

51 19) is key to proving that OT distances are indeed distances. [sent-103, score-0.164]

52 We can prove the triangle inequality for dM,α by using the same proof strategy than that used for classic transport distances: Proof of Theorem 1. [sent-110, score-0.58]

53 jk 4 Computing Regularized Transport with Sinkhorn’s Algorithm We consider in this section a Lagrange multiplier for the entropy constraint of Sinkhorn distances: For λ > 0, dλ (r, c) := P λ , M , where P λ = argmin P, M − M P ∈U(r,c) 1 h(P ). [sent-115, score-0.085]

54 We call dλ the dual-Sinkhorn divergence and show that it can be computed M 4 for a much cheaper cost than the original distance dM . [sent-117, score-0.133]

55 Since the entropy of P λ decreases monotonically with λ, computing dM,α M can be carried out by computing dλ with increasing values of λ until h(P λ ) reaches h(r)+h(c)−α. [sent-119, score-0.058]

56 Computing dλ with Matrix Scaling Algorithms Adding an entropy regularization to the optiM mal transport problem enforces a simple structure on the optimal regularized transport P λ : Lemma 2. [sent-121, score-1.144]

57 The fact that P λ can be written as a rescaled version of K is a well known fact in transport theory (Erlander and Stewart, 1990, §3. [sent-125, score-0.498]

58 3): let L(P, α, β) be the Lagrangian of Equation (2) with dual variables α, β ∈ Rd for the two equality constraints in U (r, c): 1 pij log pij + pij mij + αT (P 1d − r) + β T (P T 1d − c). [sent-126, score-0.896]

59 L(P, α, β) = λ ij For any couple (i, j), (∂L/∂pij = 0) ⇒ pij = e−1/2−λαi e−λmij e−1/2−λβj . [sent-127, score-0.308]

60 ∗ M )v) Parallelism, Convergence and Stopping Criteria As can be seen right above, Sinkhorn’s algorithm can be vectorized and generalized to N target histograms c1 , · · · , cN . [sent-150, score-0.095]

61 When N > 1, the computations for N target histograms can be simultaneously carried out by updating a single matrix of scaling factors u ∈ Rd×N instead of updating a scaling vector u ∈ Rd . [sent-152, score-0.199]

62 Kjl Kim The upper bound κ(K) tends to 1 as λ grows, and we do observe a slower convergence as P λ gets closer to the optimal vertex P ⋆ (or the optimal facet of U (r, c) if it is not unique). [sent-157, score-0.089]

63 5 5 Experimental Results MNIST Digits We test the performance of dual-Sinkhorn divergences on the MNIST digits dataset. [sent-160, score-0.115]

64 Given a distance d, we form the kernel e−d/t , where t > 0 is chosen by CV on each training fold within {1, q10 (d), q20 (d), q50 (d)}, where qs is the s% quantile of a subset of distances observed in that fold. [sent-164, score-0.302]

65 M is the 400 × 400 matrix of Euclidean distances between the 20 × 20 bins in the grid. [sent-172, score-0.198]

66 We also tried Mahalanobis distances on this example using exp(-tM. [sent-173, score-0.164]

67 For the Independence kernel we considered [ma ] where ij a ∈ {0. [sent-175, score-0.089]

68 We select λ in {5, 7, 9, 11} × 1/q50 (M ) where q50 (M (:)) is the median distance between pixels. [sent-178, score-0.09]

69 The dual-Sinkhorn divergence beats by a safe margin all other distances, including the classic optimal transport distance, here labeled as EMD. [sent-181, score-0.63]

70 We study the converDeviation of Sinkhorn’s Distance to EMD on subset of MNIST Data gence of the dual-Sinkhorn divergence towards classic optimal transport distances as 1. [sent-183, score-0.794]

71 For this experiment as well as all the other experiments be1 5 9 15 25 50 100 λ low, we compute a vector of N divergences d at each iteration, and stop when none of the N values of d varies more in absolute Figure 3: Decrease of the gap between the dualvalue than a 1/100th of a percent i. [sent-195, score-0.091]

72 ) Several Orders of Magnitude Faster Computational Speed for Histograms of We measure the computational speed of Varying Dimension Drawn Uniformly on the Simplex (log log scale) classic optimal transport distances vs. [sent-202, score-0.751]

73 that of dual-Sinkhorn divergences using RubFastEMD 2 Rubner’s emd 10 ner et al. [sent-203, score-0.283]

74 M is the all−2 pairs shortest-path matrix obtained from 10 this connectivity matrix using the FloydWarshall algorithm (Ahuja et al. [sent-212, score-0.091]

75 We 64 128 256 512 1024 2048 implemented Algorithm 1 in matlab, and Histogram Dimension use emd mex and emd hat gd metric mex/C ﬁles. [sent-218, score-0.394]

76 The EMD distances and Figure 4: Average computational time required to comSinkhorn CPU are run on a single core pute a distance between two histograms sampled uni(2. [sent-219, score-0.349]

77 Sinkhorn GPU is run formly in the d dimensional simplex for varying values on a NVidia Quadro K5000 card. [sent-221, score-0.082]

78 Dual-Sinkhorn divergences are run both on a sinsider λ in {1, 10, 50}. [sent-223, score-0.091]

79 ’s implementation cannot be run for histograms larger than d = 512. [sent-227, score-0.095]

80 As can be expected, the competitive advantage of dual-Sinkhorn divergences over EMD solvers increases with the dimension. [sent-228, score-0.091]

81 6 Conclusion We have shown that regularizing the optimal transport problem with an entropic penalty opens the door for new numerical approaches to compute OT. [sent-236, score-0.625]

82 This regularization yields speed-ups that are effective regardless of any assumptions on the ground metric M . [sent-237, score-0.084]

83 Based on preliminary evidence, it 7 seems that dual-Sinkhorn divergences do not perform worse than the EMD, and may in fact perform better in applications. [sent-238, score-0.091]

84 Dual-Sinkhorn divergences are parameterized by a regularization weight λ which should be tuned having both computational and performance objectives in mind, but we have not observed a need to establish a trade-off between both. [sent-239, score-0.119]

85 The set U1 (r, c) contains all joint probabilities P for which h(P ) = h(r) + h(c). [sent-243, score-0.069]

86 If M is negative deﬁnite, there exists vectors (ϕ1 , · · · , ϕd ) in some Euclidean space Rn such that mij = ϕi − ϕj 2 through (Berg et al. [sent-247, score-0.118]

87 We thus have 2 that rT M c = ri cj ϕi − ϕj 2 ri ϕi =( 2 ri ϕi , cj ϕj ij i i ij ci ϕi 2 ) − 2 + = rT u + cT u − 2rT Kc where ui = φi 2 and Kij = ϕi , ϕj . [sent-251, score-0.22]

88 kernels: the ﬁrst term (rT u + cT u) is the sum of the same function evaluated separately on r and c, and thus a negative deﬁnite kernel (Berg et al. [sent-257, score-0.071]

89 10); the latter term −2rT Ku is negative deﬁnite as minus a positive deﬁnite kernel (Berg et al. [sent-260, score-0.1]

90 Given a matrix M , one can indeed pre-compute the vector of norms u as well as a Cholesky factor L of K above to preprocess a dataset of histograms by premultiplying each observations ri by L and only store T Lri as well as precomputing its diagonal term ri u. [sent-266, score-0.221]

91 Note that the independence kernel is positive deﬁnite on histograms with the same 1-norm, but is no longer positive deﬁnite for arbitrary vectors. [sent-267, score-0.181]

92 Let T be the a probability distribution on {1, · · · , d}3 whose coefﬁcients are deﬁned as pij qjk tijk := , (3) yj for all indices j such that yj > 0. [sent-269, score-0.799]

93 For indices j such that yj = 0, all values tijk are set to 0. [sent-270, score-0.168]

94 Indeed, sijk = i j j i sijk = k j j k pij qjk = yj pij qjk = yj j qjk yj j pij yj j qjk yj = yj j j pij yj = yj j pij = i qjk = k qjk = zk (column sums) pij = xi (row sums) We now prove that h(S) ≥ h(x) + h(z) − α. [sent-273, score-4.1]

95 An efﬁcient gpu implementation of the revised simplex method. [sent-300, score-0.126]

96 Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. [sent-349, score-0.058]

97 Small manhattan networks and algorithmic applications for the earth movers distance. [sent-364, score-0.136]

98 An efﬁcient earth mover’s distance algorithm for robust histogram comparison. [sent-386, score-0.174]

99 The earth movers distance, multi-dimensional scaling, and color-based image retrieval. [sent-408, score-0.136]

100 The use of entropy maximising models, in the theory of trip distribution, mode split and route split. [sent-427, score-0.058]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('transport', 0.498), ('sinkhorn', 0.485), ('pij', 0.267), ('qjk', 0.242), ('dm', 0.229), ('rct', 0.173), ('emd', 0.169), ('distances', 0.164), ('yj', 0.122), ('histograms', 0.095), ('mij', 0.095), ('entropic', 0.093), ('divergences', 0.091), ('distance', 0.09), ('ot', 0.084), ('earth', 0.084), ('gpu', 0.072), ('mjk', 0.069), ('pele', 0.061), ('rubner', 0.061), ('villani', 0.061), ('entropy', 0.058), ('metric', 0.056), ('classic', 0.055), ('simplex', 0.054), ('mover', 0.053), ('avis', 0.052), ('movers', 0.052), ('werman', 0.052), ('mnist', 0.05), ('berg', 0.05), ('kernel', 0.048), ('ri', 0.046), ('ahuja', 0.046), ('tijk', 0.046), ('lorenz', 0.046), ('divergence', 0.043), ('knight', 0.042), ('cv', 0.042), ('thomas', 0.042), ('ij', 0.041), ('probabilities', 0.04), ('franklin', 0.04), ('independence', 0.038), ('euclidean', 0.036), ('cpu', 0.035), ('scaling', 0.035), ('bieling', 0.035), ('brickell', 0.035), ('erlander', 0.035), ('ferradans', 0.035), ('gpgpu', 0.035), ('gudmundsson', 0.035), ('jaynes', 0.035), ('okada', 0.035), ('ratcliff', 0.035), ('reminders', 0.035), ('shirdhonkar', 0.035), ('sijk', 0.035), ('thaper', 0.035), ('cover', 0.034), ('optimal', 0.034), ('matrix', 0.034), ('diag', 0.032), ('rd', 0.031), ('gluing', 0.031), ('darroch', 0.031), ('schechtman', 0.031), ('hellinger', 0.031), ('mik', 0.031), ('brualdi', 0.031), ('rt', 0.03), ('joint', 0.029), ('minus', 0.029), ('cn', 0.029), ('naor', 0.028), ('coincidence', 0.028), ('formly', 0.028), ('regularized', 0.028), ('regularization', 0.028), ('polytope', 0.027), ('argmin', 0.027), ('mutual', 0.027), ('triangle', 0.027), ('grauman', 0.026), ('stewart', 0.026), ('ijk', 0.025), ('lemma', 0.024), ('digits', 0.024), ('execution', 0.024), ('ling', 0.024), ('darrell', 0.024), ('jacobs', 0.024), ('et', 0.023), ('indyk', 0.023), ('stopping', 0.023), ('exceeds', 0.022), ('vertex', 0.021), ('wilson', 0.021), ('transportation', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport

Author: Marco Cuturi

2 0.1466458 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices

Author: Dimitris Achlioptas, Zohar S. Karnin, Edo Liberty

Abstract: We consider the problem of selecting non-zero entries of a matrix A in order to produce a sparse sketch of it, B, that minimizes A B 2 . For large m n matrices, such that n m (for example, representing n observations over m attributes) we give sampling distributions that exhibit four important properties. First, they have closed forms computable from minimal information regarding A. Second, they allow sketching of matrices whose non-zeros are presented to the algorithm in arbitrary order as a stream, with O 1 computation per non-zero. Third, the resulting sketch matrices are not only sparse, but their non-zero entries are highly compressible. Lastly, and most importantly, under mild assumptions, our distributions are provably competitive with the optimal ofﬂine distribution. Note that the probabilities in the optimal ofﬂine distribution may be complex functions of all the entries in the matrix. Therefore, regardless of computational complexity, the optimal distribution might be impossible to compute in the streaming model. 1

3 0.11429074 185 nips-2013-Matrix Completion From any Given Set of Observations

Author: Troy Lee, Adi Shraibman

Abstract: In the matrix completion problem the aim is to recover an unknown real matrix from a subset of its entries. This problem comes up in many application areas, and has received a great deal of attention in the context of the netﬂix prize. A central approach to this problem is to output a matrix of lowest possible complexity (e.g. rank or trace norm) that agrees with the partially speciﬁed matrix. The performance of this approach under the assumption that the revealed entries are sampled randomly has received considerable attention (e.g. [1, 2, 3, 4, 5, 6, 7, 8]). In practice, often the set of revealed entries is not chosen at random and these results do not apply. We are therefore left with no guarantees on the performance of the algorithm we are using. We present a means to obtain performance guarantees with respect to any set of initial observations. The ﬁrst step remains the same: ﬁnd a matrix of lowest possible complexity that agrees with the partially speciﬁed matrix. We give a new way to interpret the output of this algorithm by next ﬁnding a probability distribution over the non-revealed entries with respect to which a bound on the generalization error can be proven. The more complex the set of revealed entries according to a certain measure, the better the bound on the generalization error. 1

4 0.068976305 195 nips-2013-Modeling Clutter Perception using Parametric Proto-object Partitioning

Author: Chen-Ping Yu, Wen-Yu Hua, Dimitris Samaras, Greg Zelinsky

Abstract: Visual clutter, the perception of an image as being crowded and disordered, affects aspects of our lives ranging from object detection to aesthetics, yet relatively little effort has been made to model this important and ubiquitous percept. Our approach models clutter as the number of proto-objects segmented from an image, with proto-objects deﬁned as groupings of superpixels that are similar in intensity, color, and gradient orientation features. We introduce a novel parametric method of clustering superpixels by modeling mixture of Weibulls on Earth Mover’s Distance statistics, then taking the normalized number of proto-objects following partitioning as our estimate of clutter perception. We validated this model using a new 90-image dataset of real world scenes rank ordered by human raters for clutter, and showed that our method not only predicted clutter extremely well (Spearman’s ρ = 0.8038, p < 0.001), but also outperformed all existing clutter perception models and even a behavioral object segmentation ground truth. We conclude that the number of proto-objects in an image affects clutter perception more than the number of objects or features. 1

5 0.0633035 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

Author: Shahin Shahrampour, Sasha Rakhlin, Ali Jadbabaie

Abstract: This paper addresses the problem of online learning in a dynamic setting. We consider a social network in which each individual observes a private signal about the underlying state of the world and communicates with her neighbors at each time period. Unlike many existing approaches, the underlying state is dynamic, and evolves according to a geometric random walk. We view the scenario as an optimization problem where agents aim to learn the true state while suffering the smallest possible loss. Based on the decomposition of the global loss function, we introduce two update mechanisms, each of which generates an estimate of the true state. We establish a tight bound on the rate of change of the underlying state, under which individuals can track the parameter with a bounded variance. Then, we characterize explicit expressions for the steady state mean-square deviation(MSD) of the estimates from the truth, per individual. We observe that only one of the estimators recovers the optimal MSD, which underscores the impact of the objective function decomposition on the learning quality. Finally, we provide an upper bound on the regret of the proposed methods, measured as an average of errors in estimating the parameter in a ﬁnite time. 1

6 0.054836597 337 nips-2013-Transportability from Multiple Environments with Limited Experiments

7 0.053031642 169 nips-2013-Learning to Prune in Metric and Non-Metric Spaces

8 0.049083274 258 nips-2013-Projecting Ising Model Parameters for Fast Mixing

9 0.048258297 155 nips-2013-Learning Hidden Markov Models from Non-sequence Data via Tensor Decomposition

10 0.047917873 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

11 0.043602426 75 nips-2013-Convex Two-Layer Modeling

12 0.043303613 153 nips-2013-Learning Feature Selection Dependencies in Multi-task Learning

13 0.042263635 308 nips-2013-Spike train entropy-rate estimation using hierarchical Dirichlet process priors

14 0.040976368 12 nips-2013-A Novel Two-Step Method for Cross Language Representation Learning

15 0.04093378 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

16 0.038748354 318 nips-2013-Structured Learning via Logistic Regression

17 0.038545363 173 nips-2013-Least Informative Dimensions

18 0.038264506 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

19 0.036636271 158 nips-2013-Learning Multiple Models via Regularized Weighting

20 0.036204547 51 nips-2013-Bayesian entropy estimation for binary spike train data using parametric prior knowledge

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.126), (1, 0.035), (2, 0.015), (3, 0.03), (4, 0.017), (5, 0.019), (6, -0.001), (7, -0.017), (8, -0.026), (9, 0.01), (10, 0.01), (11, -0.032), (12, 0.031), (13, -0.002), (14, 0.028), (15, 0.04), (16, -0.031), (17, 0.024), (18, -0.045), (19, 0.028), (20, -0.023), (21, -0.07), (22, -0.021), (23, -0.043), (24, -0.045), (25, -0.033), (26, 0.018), (27, -0.12), (28, -0.004), (29, -0.0), (30, -0.054), (31, 0.009), (32, -0.084), (33, -0.024), (34, -0.014), (35, 0.032), (36, -0.035), (37, 0.065), (38, 0.004), (39, 0.064), (40, -0.005), (41, 0.006), (42, -0.027), (43, -0.045), (44, -0.003), (45, -0.119), (46, -0.016), (47, 0.048), (48, 0.034), (49, 0.01)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89547175 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport

Author: Marco Cuturi

2 0.70413798 206 nips-2013-Near-Optimal Entrywise Sampling for Data Matrices

Author: Dimitris Achlioptas, Zohar S. Karnin, Edo Liberty

3 0.68779409 108 nips-2013-Error-Minimizing Estimates and Universal Entry-Wise Error Bounds for Low-Rank Matrix Completion

Author: Franz Kiraly, Louis Theran

Abstract: We propose a general framework for reconstructing and denoising single entries of incomplete and noisy entries. We describe: effective algorithms for deciding if and entry can be reconstructed and, if so, for reconstructing and denoising it; and a priori bounds on the error of each entry, individually. In the noiseless case our algorithm is exact. For rank-one matrices, the new algorithm is fast, admits a highly-parallel implementation, and produces an error minimizing estimate that is qualitatively close to our theoretical and the state-of-the-art Nuclear Norm and OptSpace methods. 1

4 0.67166311 185 nips-2013-Matrix Completion From any Given Set of Observations

Author: Troy Lee, Adi Shraibman

5 0.56932759 110 nips-2013-Estimating the Unseen: Improved Estimators for Entropy and other Properties

Author: Paul Valiant, Gregory Valiant

Abstract: Recently, Valiant and Valiant [1, 2] showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a sublinear sized sample. Speciﬁcally, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n/ log n). We propose a novel modiﬁcation of this approach and show: 1) theoretically, this estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in our approach is to ﬁrst use the sample to characterize the “unseen” portion of the distribution. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the shape of the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems. 1

6 0.55711007 342 nips-2013-Unsupervised Spectral Learning of Finite State Transducers

7 0.55669677 357 nips-2013-k-Prototype Learning for 3D Rigid Structures

8 0.54340523 299 nips-2013-Solving inverse problem of Markov chain with partial observations

9 0.53524095 352 nips-2013-What do row and column marginals reveal about your dataset?

10 0.53455204 186 nips-2013-Matrix factorization with binary components

11 0.51562375 307 nips-2013-Speedup Matrix Completion with Side Information: Application to Multi-Label Learning

12 0.50105709 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression

13 0.49272501 214 nips-2013-On Algorithms for Sparse Multi-factor NMF

14 0.47683585 297 nips-2013-Sketching Structured Matrices for Faster Nonlinear Regression

15 0.47175232 35 nips-2013-Analyzing the Harmonic Structure in Graph-Based Learning

16 0.4688966 332 nips-2013-Tracking Time-varying Graphical Structure

17 0.46870223 131 nips-2013-Geometric optimisation on positive definite matrices for elliptically contoured distributions

18 0.4669795 222 nips-2013-On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization

19 0.46132725 254 nips-2013-Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

20 0.46126559 343 nips-2013-Unsupervised Structure Learning of Stochastic And-Or Grammars

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.018), (16, 0.047), (19, 0.012), (33, 0.094), (34, 0.079), (41, 0.033), (49, 0.036), (56, 0.109), (70, 0.019), (85, 0.055), (89, 0.02), (92, 0.316), (93, 0.05), (95, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.72712272 296 nips-2013-Sinkhorn Distances: Lightspeed Computation of Optimal Transport

Author: Marco Cuturi

2 0.65395772 280 nips-2013-Robust Data-Driven Dynamic Programming

Author: Grani Adiwena Hanasusanto, Daniel Kuhn

Abstract: In stochastic optimal control the distribution of the exogenous noise is typically unknown and must be inferred from limited data before dynamic programming (DP)-based solution schemes can be applied. If the conditional expectations in the DP recursions are estimated via kernel regression, however, the historical sample paths enter the solution procedure directly as they determine the evaluation points of the cost-to-go functions. The resulting data-driven DP scheme is asymptotically consistent and admits an efﬁcient computational solution when combined with parametric value function approximations. If training data is sparse, however, the estimated cost-to-go functions display a high variability and an optimistic bias, while the corresponding control policies perform poorly in out-of-sample tests. To mitigate these small sample effects, we propose a robust data-driven DP scheme, which replaces the expectations in the DP recursions with worst-case expectations over a set of distributions close to the best estimate. We show that the arising minmax problems in the DP recursions reduce to tractable conic programs. We also demonstrate that the proposed robust DP algorithm dominates various non-robust schemes in out-of-sample tests across several application domains. 1

3 0.6457262 282 nips-2013-Robust Multimodal Graph Matching: Sparse Coding Meets Graph Matching

Author: Marcelo Fiori, Pablo Sprechmann, Joshua Vogelstein, Pablo Muse, Guillermo Sapiro

Abstract: Graph matching is a challenging problem with very important applications in a wide range of ﬁelds, from image and video analysis to biological and biomedical problems. We propose a robust graph matching algorithm inspired in sparsityrelated techniques. We cast the problem, resembling group or collaborative sparsity formulations, as a non-smooth convex optimization problem that can be efﬁciently solved using augmented Lagrangian techniques. The method can deal with weighted or unweighted graphs, as well as multimodal data, where different graphs represent different types of data. The proposed approach is also naturally integrated with collaborative graph inference techniques, solving general network inference problems where the observed variables, possibly coming from different modalities, are not in correspondence. The algorithm is tested and compared with state-of-the-art graph matching techniques in both synthetic and real graphs. We also present results on multimodal graphs and applications to collaborative inference of brain connectivity from alignment-free functional magnetic resonance imaging (fMRI) data. The code is publicly available. 1

4 0.54587108 30 nips-2013-Adaptive dropout for training deep neural networks

Author: Jimmy Ba, Brendan Frey

Abstract: Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. We describe a method called ‘standout’ in which a binary belief network is overlaid on a neural network and is used to regularize of its hidden units by selectively setting activities to zero. This ‘adaptive dropout network’ can be trained jointly with the neural network by approximately computing local expectations of binary dropout variables, computing derivatives using back-propagation, and using stochastic gradient descent. Interestingly, experiments show that the learnt dropout network parameters recapitulate the neural network parameters, suggesting that a good dropout network regularizes activities according to magnitude. When evaluated on the MNIST and NORB datasets, we found that our method achieves lower classiﬁcation error rates than other feature learning methods, including standard dropout, denoising auto-encoders, and restricted Boltzmann machines. For example, our method achieves 0.80% and 5.8% errors on the MNIST and NORB test sets, which is better than state-of-the-art results obtained using feature learning methods, including those that use convolutional architectures. 1

5 0.50236684 79 nips-2013-DESPOT: Online POMDP Planning with Regularization

Author: Adhiraj Somani, Nan Ye, David Hsu, Wee Sun Lee

Abstract: POMDPs provide a principled framework for planning under uncertainty, but are computationally intractable, due to the “curse of dimensionality” and the “curse of history”. This paper presents an online POMDP algorithm that alleviates these difﬁculties by focusing the search on a set of randomly sampled scenarios. A Determinized Sparse Partially Observable Tree (DESPOT) compactly captures the execution of all policies on these scenarios. Our Regularized DESPOT (R-DESPOT) algorithm searches the DESPOT for a policy, while optimally balancing the size of the policy and its estimated value obtained under the sampled scenarios. We give an output-sensitive performance bound for all policies derived from a DESPOT, and show that R-DESPOT works well if a small optimal policy exists. We also give an anytime algorithm that approximates R-DESPOT. Experiments show strong results, compared with two of the fastest online POMDP algorithms. Source code along with experimental settings are available at http://bigbird.comp. nus.edu.sg/pmwiki/farm/appl/. 1

6 0.50100142 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies

7 0.50056869 104 nips-2013-Efficient Online Inference for Bayesian Nonparametric Relational Models

8 0.49956912 249 nips-2013-Polar Operators for Structured Sparse Estimation

9 0.49741265 355 nips-2013-Which Space Partitioning Tree to Use for Search?

10 0.49704731 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables

11 0.49610236 252 nips-2013-Predictive PAC Learning and Process Decompositions

12 0.49582219 228 nips-2013-Online Learning of Dynamic Parameters in Social Networks

13 0.49562588 232 nips-2013-Online PCA for Contaminated Data

14 0.4953579 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning

15 0.49526092 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA

16 0.49501896 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction

17 0.49474952 102 nips-2013-Efficient Algorithm for Privately Releasing Smooth Queries

18 0.49442178 318 nips-2013-Structured Learning via Logistic Regression

19 0.49341163 33 nips-2013-An Approximate, Efficient LP Solver for LP Rounding

20 0.4933618 99 nips-2013-Dropout Training as Adaptive Regularization