nips nips2010 nips2010-166 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kiyohito Nagano, Yoshinobu Kawahara, Satoru Iwata
Abstract: A number of objective functions in clustering problems can be described with submodular functions. In this paper, we introduce the minimum average cost criterion, and show that the theory of intersecting submodular functions can be used for clustering with submodular objective functions. The proposed algorithm does not require the number of clusters in advance, and it will be determined by the property of a given set of data points. The minimum average cost clustering problem is parameterized with a real variable, and surprisingly, we show that all information about optimal clusterings for all parameters can be computed in polynomial time in total. Additionally, we evaluate the performance of the proposed algorithm through computational experiments. 1
Reference: text
sentIndex sentText sentNum sentScore
1 jp Abstract A number of objective functions in clustering problems can be described with submodular functions. [sent-12, score-0.733]
2 In this paper, we introduce the minimum average cost criterion, and show that the theory of intersecting submodular functions can be used for clustering with submodular objective functions. [sent-13, score-1.738]
3 The minimum average cost clustering problem is parameterized with a real variable, and surprisingly, we show that all information about optimal clusterings for all parameters can be computed in polynomial time in total. [sent-15, score-0.538]
4 1 Introduction A clustering of a finite set V of data points is a partition of V into subsets (called clusters) such that data points in the same cluster are similar to each other. [sent-17, score-0.564]
5 Basically, a clustering problem asks for a partition P of V such that the intra-cluster similarity is maximized and/or the inter-cluster similarity is minimized. [sent-18, score-0.561]
6 The clustering of data is one of the most fundamental unsupervised learning problems. [sent-19, score-0.197]
7 We use a criterion function defined on all partitions of V , and the clustering problem becomes that of finding a partition of V that minimizes the clustering cost under some constraints. [sent-20, score-0.928]
8 Suppose that the inhomogeneity of subsets of the data set is measured by a nonnegative set function f : 2V → R with f (∅) = 0, where 2V denotes the set of all subsets of V , and the clustering cost of a partition P = {S1 , S2 , . [sent-21, score-0.756]
9 A number of set functions that represent the inhomogeneity, including cut functions of graphs and entropy functions, are known to be submodular [3, 4]. [sent-25, score-0.618]
10 A submodular function is known to be a discrete counterpart of a convex function, and in recent years, its importance has been recognized in the field of machine learning. [sent-27, score-0.534]
11 For any given integer k with 1 ≤ k ≤ n, where n is the number of points in V , a partition P of V is called a k-partition if there are exactly k nonempty elements in P, and is called an optimal k-clustering if P is a k-partition that minimizes the cost f [P] among all k-partitions. [sent-28, score-0.48]
12 A problem of finding an optimal k-clustering is widely studied in combinatorial optimization and various fields, and it is recognized as a natural formulation of a clustering problem [8, 9, 10]. [sent-29, score-0.266]
13 Even if f is a cut function of a graph, which is submodular and symmetric, that is, f (V − S) = f (S) for all S ⊆ V , this problem is known to be NP-hard unless k can be regarded as a constant [5]. [sent-30, score-0.576]
14 [10] dealt with the case when f is submodular and symmetric. [sent-33, score-0.515]
15 [17] gave a 2(1−1/k)-approximation algorithm using Queyranne’s symmetric submodular function minimization algorithm [13]. [sent-35, score-0.571]
16 [10] showed that Queyranne’s algorithm can be used 1 for clustering problems with some specific natural criteria. [sent-37, score-0.225]
17 For a general submodular function and a small constant k, constant factor approximation algorithms for optimal k-clusterings are designed in [12, 18]. [sent-38, score-0.534]
18 In addition, balanced clustering problems with submodular costs are studied in [8, 9]. [sent-39, score-0.739]
19 In this paper, we introduce a new clustering criterion to resolve the above shortcomings of previous approaches [10]. [sent-42, score-0.217]
20 In the minimum average cost (MAC) clustering problem we consider, the objective function is the average cost of a partition P which combines the clustering cost f [P] and the number of clusters |P|. [sent-43, score-1.162]
21 We argue that the MAC clustering problem represents a natural clustering criterion. [sent-45, score-0.394]
22 In this paper, we show that the Dilworth truncation of an intersecting submodular function [2] (see also Chapter II of Fujishige [4] and Chapter 48 of Schrijver [14]) can be used to solve the clustering problem exactly and efficiently. [sent-46, score-1.124]
23 To the best of our knowledge, this is the first time that the theory of intersecting submodular functions is used for clustering. [sent-47, score-0.844]
24 The MAC clustering problem can be parameterized with a real-valued parameter β ≥ 0, and the problem with respect to β asks for a partition P of V that minimizes the average cost under a constraint |P| > β. [sent-48, score-0.686]
25 The main contribution of this paper is a polynomial time algorithm that solves the MAC clustering problem exactly for any given parameter β. [sent-49, score-0.298]
26 Even more surprisingly, our algorithm computes all information about MAC clusterings for all parameters in polynomial time in total. [sent-51, score-0.206]
27 If f is a cut function and β = 1, the optimal value of the MAC clustering problem coincides with the strength of a graph [1]. [sent-53, score-0.302]
28 In addition, the computation of the principal sequence of partitions of a graph [7] is a special case of the parametrized MAC clustering problem in an implicit way. [sent-54, score-0.301]
29 In Section 2, we formulate the minimum average cost clustering problem, and show a structure property of minimum average cost clusterings. [sent-56, score-0.59]
30 In Section 3, we propose a framework of our algorithm for the minimum average cost clustering problem. [sent-57, score-0.407]
31 In Section 4, we explain the basic results on the theory of intersecting submodular functions, and describe the Dilworth truncation algorithm which is used in Section 3 as a subroutine. [sent-58, score-0.955]
32 Let V be a given set of n data points, and let f : 2V → R be a nonnegative submodular function with f (∅) = 0, which is not necessarily symmetric. [sent-62, score-0.551]
33 , Sk }, the clustering cost is defined by f [P] = f (S1 ) + · · · + f (Sk ). [sent-67, score-0.272]
34 We will introduce the minimum average cost criterion in order to make consideration of both the clustering cost f [P] and the number of clusters |P|. [sent-68, score-0.538]
35 1 Definition Consider a k-partition P of V with k > 1, and compare P with a trivial partition {V } of V . [sent-70, score-0.341]
36 Then, the number of clusters has increased by k − 1 and the clustering cost has increased by f [P] + c, where c is a constant. [sent-71, score-0.336]
37 Suppose that P ∗ is a partition of V that minimizes the average cost among all partitions P of V with |P| > 1. [sent-73, score-0.545]
38 For any parameter β ∈ [0, n), we consider the minimum average cost (MAC) clustering problem λ(β) := min {f [P]/(|P| − β) : P is a partition of V , |P| > β} . [sent-77, score-0.72]
39 P 2 (1) Let us say that a partition P is a β-MAC clustering if P is optimal for the problem (1) with respect to β ∈ [0, n). [sent-78, score-0.557]
40 Let P be a β-MAC clustering for some β ∈ [0, n), and set k := |P|. [sent-83, score-0.197]
41 Our algorithm requires the help of the theory of intersecting submodular functions [4, 14]. [sent-89, score-0.872]
42 Proposition 1 says that if there exists a β-MAC clustering P satisfying |P| = k, then we obtain an optimal k-clustering. [sent-90, score-0.216]
43 If P is a partition of V satisfying |P| ≤ β, we have −βλ ≤ −|P|λ ≤ f [P] − |P|λ for all λ ∈ R+ . [sent-99, score-0.341]
44 For λ ≥ 0, we say that a partition P determines h at λ if f [P] − |P|λ = h(λ). [sent-102, score-0.382]
45 For each partition P of V , define a linear function hP : R+ → R as hP (λ) = f [P] − |P|λ. [sent-106, score-0.341]
46 We have h(0) = f (V ) because f [{V }] ≤ f [P] for any partition P of V . [sent-109, score-0.341]
47 In addition, a β-MAC clustering can be characterized as follows. [sent-115, score-0.197]
48 Given a parameter β ∈ [0, n), let P be a partition of V such that |P| > β and h(λ(β)) = f [P] − |P|λ(β). [sent-117, score-0.341]
49 For any partition Q of V with |Q| > β, we have −βλ(β) ≤ f [Q] − |Q|λ(β), and thus λ(β) ≤ f [Q]/(|Q| − β). [sent-121, score-0.341]
50 , d, the partition Psj determines h at all λ ∈ Rj (see Figure 2 (a)). [sent-145, score-0.382]
51 In particular, sd = n and the last partition Psd is the set of singletons {{1}, {2}, . [sent-146, score-0.396]
52 For any β ∈ Bj , the partition Ps∗ is a β-MAC clustering. [sent-179, score-0.341]
53 3 The clustering algorithm In this section, we present a framework of a polynomial time algorithm that finds the collection {Ps1 , Ps2 , . [sent-188, score-0.346]
54 Procedure F IND PARTITION(λ): For any given λ ≥ 0, this procedure computes the value h(λ) and finds a partition P of V that determines h at λ. [sent-196, score-0.452]
55 We will use SFM(n) to denote the time required to minimize a general submodular function defined on 2V , where n = |V |. [sent-197, score-0.515]
56 The proof of Lemma 4, which will be given in §4, utilizes the Dilworth truncation of an intersecting submodular function [4, 14]. [sent-202, score-0.927]
57 Let us call a partition P of V supporting if there exists λ ≥ 0 such that h(λ) = hP (λ). [sent-203, score-0.46]
58 In addition, for any λ ≥ 0, F IND PARTITION(λ) returns a supporting partition of V . [sent-205, score-0.486]
59 Q1 is a supporting partition of V because h(0) = f [{V }] = hQ1 (0), and Qn is also supporting because Qn = Psd . [sent-210, score-0.579]
60 For a supporting partition P of V , if |P| = sj for some j ∈ {1, . [sent-211, score-0.505]
61 Suppose that we are given two supporting partitions Qk and Qℓ such that |Qk | = k, |Qℓ | = ℓ and k < ℓ. [sent-218, score-0.198]
62 We describe the algorithm S PLIT(Qk , Qℓ ), which computes the information about all breakpoints of h on the interval R(k, ℓ). [sent-219, score-0.205]
63 Besides, if the decision is negative, the algorithm finds a supporting partition Qm such that |Qm | = m and k < m < ℓ. [sent-225, score-0.488]
64 Then the algorithm gives a negative answer, and the partition 4 P returned by F IND PARTITION is supporting and satisfies k < |P| < ℓ. [sent-235, score-0.488]
65 By performing F IND PARTITION(λ), compute h(λ) and a partition P of V that determines h(λ). [sent-242, score-0.382]
66 In the algorithm, after one call of F IND PARTITION, (i) we can obtain the information about one breakpoint of h, or (ii) a new supporting partition Qm can be obtained. [sent-263, score-0.492]
67 Throughout the execution of S PLIT(Q1 , Qn ), the algorithm computes a supporting k-partition at most once for each k ∈ {1, . [sent-265, score-0.209]
68 All information of optimal solutions to the minimum average cost clustering problem (1) for all parameters β ∈ [0, n) can be computed in O(n2 · SFM(n)) time in total. [sent-272, score-0.398]
69 4 Finding a partition In the clustering algorithm of Section 3, we iteratively call the procedure F IND PARTITION, which computes h(λ) defined in (3) and a partition P that determines h(λ) for any given λ ≥ 0. [sent-273, score-1.018]
70 In this section, we will see that the procedure F IND PARTITION can be implemented to run in polynomial time with the aid of the Dilworth truncation of an intersecting submodular function [2], and give a proof of Lemma 4. [sent-274, score-1.032]
71 1 The Dilworth truncation of an intersecting submodular function We start with definitions of an intersecting submodular function and the Dilworth truncation. [sent-279, score-1.75]
72 Subsets S, T ⊆ V are intersecting if S ∩ T = ∅, S \ T = ∅, and T \ S = ∅. [sent-280, score-0.308]
73 A set function g : 2V → R is intersecting submodular if g(S) + g(T ) ≥ g(S ∪ T ) + g(S ∩ T ) for all intersecting subsets S, T ⊆ V . [sent-281, score-1.157]
74 Clearly, the fully submodular function1 f is also intersecting submodular. [sent-282, score-0.857]
75 For any λ ≥ 0, 1 To emphasize the difference between submodular and intersecting submodular functions, in what follows we refer to a submodular function as a fully submodular function. [sent-283, score-2.402]
76 It is easy to see that fλ is an intersecting submodular function. [sent-285, score-0.823]
77 For a fully submodular function f with f (∅) = 0, consider a polyhedron P(f ) = {x ∈ Rn : x(S) ≤ f (S), ∅ = ∀S ⊆ V }, where x(S) = i∈S xi . [sent-286, score-0.593]
78 The polyhedron P(f ) is called a submodular polyhedron. [sent-287, score-0.559]
79 In the same manner, for an intersecting submodular function g with g(∅) = 0, define P(g) = {x ∈ Rn : x(S) ≤ g(S), ∅ = ∀S ⊆ V }. [sent-288, score-0.823]
80 Given an intersecting submodular function g : 2V → R with g(∅) = 0, there exists a fully submodular function g : 2V → R such that g(∅) = 0 and P(g) = P(g). [sent-295, score-1.372]
81 Furthermore, the function g can be represented as g(S) = min{ S∈P g(S) : P is a partition of S}. [sent-296, score-0.341]
82 For a general intersecting submodular function g, however, the computation of g(S) is a nontrivial task. [sent-299, score-0.823]
83 Suppose that a fully submodular function f : 2{1, 2} → R satisfies f (∅) = 0, f ({1}) = 12, f ({2}) = 8, and f ({1, 2}) = 19. [sent-301, score-0.549]
84 Observe that fλ is fully submodular and P(fλ ) = P(fλ ). [sent-305, score-0.549]
85 2 x2 6 16 6 10 e1 x0 0 x1 10 x1 x0 0 x2 6 e2 x1 10 x1 Figure 5: The greedy algorithm [3] Algorithm that finds a partition Let us fix λ ≥ 0, and describe F IND PARTITION(λ). [sent-308, score-0.411]
86 We ask for a partition P of V satisfying fλ (V ) = fλ [P] (= T ∈P fλ (T )) because such a partition P of V determines h at λ. [sent-310, score-0.723]
87 We know that fλ : 2V → R is submodular, but fλ (S) = min{fλ [P] : P is a partition of S} cannot be obtained directly for each S ⊆ V . [sent-311, score-0.341]
88 Hence, the value αℓ can be computed by minimizing a fully submodular function. [sent-345, score-0.549]
89 In addition to the value h(λ), a partition P of V such that f [P] − λ|P| = h(λ) is also required. [sent-347, score-0.341]
90 Output : A real value hλ and a partition Pλ of V . [sent-350, score-0.341]
91 By the intersecting submodularity of fλ , if S and T are intersecting and both S and T are x-tight, then S ∪ T is also x-tight. [sent-370, score-0.616]
92 By Theorem 8, we have hλ = h(λ), and by Lemma 9, we have fλ (V ) = fλ [Pλ ], and thus the partition Pλ of V determines h(λ). [sent-389, score-0.382]
93 The compared algorithms are k-means method, spectral-clustering method with normalized-cut [11] and maximum-margin clustering [16], and we used cut functions as the objective functions for the MAC clustering algorithm. [sent-410, score-0.497]
94 6 Concluding remarks We have introduced the new concept, the minimum average cost clustering problem. [sent-438, score-0.398]
95 We have shown that the set of minimum average cost clusterings has a compact representation, and if the clustering cost is given by a submodular function, we have proposed a polynomial time algorithm that compute all information about minimum average cost clusterings. [sent-439, score-1.319]
96 The present paper reinforced the importance of the theory of intersecting submodular functions from the viewpoint of clustering. [sent-441, score-0.844]
97 Nagamochi: Divide-and-conquer algorithms for partitioning hypergraphs and submodular systems. [sent-507, score-0.515]
98 Queyranne: Minimizing symmetric submodular functions, Mathematical Programming, 82 (1998), pp. [sent-511, score-0.515]
99 Ibaraki: Approximating the minimum k-way cut in a graph via minimum 3-way cuts. [sent-530, score-0.238]
100 Ibaraki: A unified framework for approximating multiway partition problems. [sent-536, score-0.341]
wordName wordTfidf (topN-words)
[('submodular', 0.515), ('partition', 0.341), ('intersecting', 0.308), ('ind', 0.261), ('plit', 0.255), ('mac', 0.205), ('dilworth', 0.2), ('clustering', 0.197), ('qk', 0.191), ('supporting', 0.119), ('qm', 0.117), ('breakpoints', 0.112), ('sfm', 0.112), ('truncation', 0.104), ('psd', 0.086), ('partitions', 0.079), ('minimum', 0.076), ('cost', 0.075), ('hq', 0.075), ('polynomial', 0.073), ('clusterings', 0.067), ('qn', 0.067), ('clusters', 0.064), ('subintervals', 0.064), ('narasimhan', 0.064), ('cut', 0.061), ('edmonds', 0.055), ('hqk', 0.055), ('inhomogeneity', 0.055), ('nagamochi', 0.055), ('psj', 0.055), ('kawahara', 0.048), ('schrijver', 0.048), ('lemma', 0.048), ('hp', 0.048), ('sj', 0.045), ('polyhedron', 0.044), ('queyranne', 0.044), ('greedy', 0.042), ('determines', 0.041), ('iwata', 0.041), ('nagano', 0.041), ('computes', 0.038), ('ei', 0.037), ('nonnegative', 0.036), ('zhao', 0.036), ('bilmes', 0.036), ('ibaraki', 0.036), ('libras', 0.036), ('polyhedra', 0.036), ('japan', 0.036), ('sk', 0.035), ('fully', 0.034), ('procedure', 0.032), ('isaac', 0.032), ('breakpoint', 0.032), ('fukunaga', 0.032), ('combinatorial', 0.031), ('average', 0.031), ('kf', 0.029), ('fujishige', 0.029), ('lncs', 0.029), ('iris', 0.029), ('property', 0.029), ('sd', 0.029), ('suppose', 0.028), ('algorithm', 0.028), ('balanced', 0.027), ('interval', 0.027), ('rj', 0.026), ('returns', 0.026), ('nonempty', 0.026), ('singletons', 0.026), ('subsets', 0.026), ('graph', 0.025), ('execution', 0.024), ('ex', 0.024), ('nds', 0.024), ('bd', 0.024), ('glass', 0.024), ('concluding', 0.023), ('asks', 0.023), ('uci', 0.022), ('theorem', 0.022), ('industrial', 0.022), ('rn', 0.021), ('functions', 0.021), ('basically', 0.02), ('collection', 0.02), ('apparently', 0.02), ('criterion', 0.02), ('remarks', 0.019), ('equality', 0.019), ('recognized', 0.019), ('illustrative', 0.019), ('optimal', 0.019), ('illustrates', 0.019), ('minimizes', 0.019), ('advance', 0.018), ('id', 0.017)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000006 166 nips-2010-Minimum Average Cost Clustering
Author: Kiyohito Nagano, Yoshinobu Kawahara, Satoru Iwata
Abstract: A number of objective functions in clustering problems can be described with submodular functions. In this paper, we introduce the minimum average cost criterion, and show that the theory of intersecting submodular functions can be used for clustering with submodular objective functions. The proposed algorithm does not require the number of clusters in advance, and it will be determined by the property of a given set of data points. The minimum average cost clustering problem is parameterized with a real variable, and surprisingly, we show that all information about optimal clusterings for all parameters can be computed in polynomial time in total. Additionally, we evaluate the performance of the proposed algorithm through computational experiments. 1
2 0.3695322 258 nips-2010-Structured sparsity-inducing norms through submodular functions
Author: Francis R. Bach
Abstract: Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1 -norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lov´ sz extension, a common tool in submodua lar analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.
3 0.33118981 69 nips-2010-Efficient Minimization of Decomposable Submodular Functions
Author: Peter Stobbe, Andreas Krause
Abstract: Many combinatorial problems arising in machine learning can be reduced to the problem of minimizing a submodular function. Submodular functions are a natural discrete analog of convex functions, and can be minimized in strongly polynomial time. Unfortunately, state-of-the-art algorithms for general submodular minimization are intractable for larger problems. In this paper, we introduce a novel subclass of submodular minimization problems that we call decomposable. Decomposable submodular functions are those that can be represented as sums of concave functions applied to modular functions. We develop an algorithm, SLG, that can efÄ?Ĺš ciently minimize decomposable submodular functions with tens of thousands of variables. Our algorithm exploits recent results in smoothed convex minimization. We apply SLG to synthetic benchmarks and a joint classiÄ?Ĺš cation-and-segmentation task, and show that it outperforms the state-of-the-art general purpose submodular minimization algorithms by several orders of magnitude. 1
4 0.17492117 102 nips-2010-Generalized roof duality and bisubmodular functions
Author: Vladimir Kolmogorov
Abstract: ˆ Consider a convex relaxation f of a pseudo-boolean function f . We say that ˆ the relaxation is totally half-integral if f (x) is a polyhedral function with halfintegral extreme points x, and this property is preserved after adding an arbitrary combination of constraints of the form xi = xj , xi = 1 − xj , and xi = γ where 1 γ ∈ {0, 1, 2 } is a constant. A well-known example is the roof duality relaxation for quadratic pseudo-boolean functions f . We argue that total half-integrality is a natural requirement for generalizations of roof duality to arbitrary pseudo-boolean functions. Our contributions are as follows. First, we provide a complete characterization ˆ of totally half-integral relaxations f by establishing a one-to-one correspondence with bisubmodular functions. Second, we give a new characterization of bisubmodular functions. Finally, we show some relationships between general totally half-integral relaxations and relaxations based on the roof duality. 1
5 0.15458065 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
Author: Margareta Ackerman, Shai Ben-David, David Loker
Abstract: Clustering is a basic data mining task with a wide variety of applications. Not surprisingly, there exist many clustering algorithms. However, clustering is an ill defined problem - given a data set, it is not clear what a “correct” clustering for that set is. Indeed, different algorithms may yield dramatically different outputs for the same input sets. Faced with a concrete clustering task, a user needs to choose an appropriate clustering algorithm. Currently, such decisions are often made in a very ad hoc, if not completely random, manner. Given the crucial effect of the choice of a clustering algorithm on the resulting clustering, this state of affairs is truly regrettable. In this paper we address the major research challenge of developing tools for helping users make more informed decisions when they come to pick a clustering tool for their data. This is, of course, a very ambitious endeavor, and in this paper, we make some first steps towards this goal. We propose to address this problem by distilling abstract properties of the input-output behavior of different clustering paradigms. In this paper, we demonstrate how abstract, intuitive properties of clustering functions can be used to taxonomize a set of popular clustering algorithmic paradigms. On top of addressing deterministic clustering algorithms, we also propose similar properties for randomized algorithms and use them to highlight functional differences between different common implementations of k-means clustering. We also study relationships between the properties, independent of any particular algorithm. In particular, we strengthen Kleinberg’s famous impossibility result, while providing a simpler proof. 1
6 0.11686841 108 nips-2010-Graph-Valued Regression
7 0.10281216 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
8 0.090289958 261 nips-2010-Supervised Clustering
9 0.078822665 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
10 0.07498122 78 nips-2010-Error Propagation for Approximate Policy and Value Iteration
11 0.072378695 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
12 0.071978301 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
13 0.064385213 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions
14 0.064216062 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA
15 0.056597222 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
16 0.055159494 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
17 0.054369234 223 nips-2010-Rates of convergence for the cluster tree
18 0.051362757 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
19 0.051105507 221 nips-2010-Random Projections for $k$-means Clustering
20 0.04764834 180 nips-2010-Near-Optimal Bayesian Active Learning with Noisy Observations
topicId topicWeight
[(0, 0.128), (1, 0.021), (2, 0.128), (3, 0.074), (4, -0.025), (5, -0.161), (6, -0.033), (7, 0.137), (8, 0.435), (9, 0.031), (10, 0.252), (11, 0.124), (12, 0.132), (13, -0.154), (14, -0.076), (15, 0.126), (16, 0.08), (17, 0.039), (18, -0.016), (19, 0.058), (20, -0.073), (21, -0.076), (22, 0.014), (23, 0.022), (24, 0.028), (25, 0.113), (26, 0.043), (27, -0.022), (28, 0.002), (29, 0.029), (30, 0.059), (31, -0.044), (32, -0.025), (33, 0.046), (34, -0.008), (35, 0.006), (36, 0.056), (37, -0.028), (38, 0.051), (39, -0.007), (40, 0.048), (41, 0.044), (42, 0.001), (43, 0.046), (44, 0.043), (45, -0.111), (46, 0.024), (47, 0.075), (48, 0.009), (49, 0.012)]
simIndex simValue paperId paperTitle
same-paper 1 0.94659162 166 nips-2010-Minimum Average Cost Clustering
Author: Kiyohito Nagano, Yoshinobu Kawahara, Satoru Iwata
Abstract: A number of objective functions in clustering problems can be described with submodular functions. In this paper, we introduce the minimum average cost criterion, and show that the theory of intersecting submodular functions can be used for clustering with submodular objective functions. The proposed algorithm does not require the number of clusters in advance, and it will be determined by the property of a given set of data points. The minimum average cost clustering problem is parameterized with a real variable, and surprisingly, we show that all information about optimal clusterings for all parameters can be computed in polynomial time in total. Additionally, we evaluate the performance of the proposed algorithm through computational experiments. 1
2 0.83099747 69 nips-2010-Efficient Minimization of Decomposable Submodular Functions
Author: Peter Stobbe, Andreas Krause
Abstract: Many combinatorial problems arising in machine learning can be reduced to the problem of minimizing a submodular function. Submodular functions are a natural discrete analog of convex functions, and can be minimized in strongly polynomial time. Unfortunately, state-of-the-art algorithms for general submodular minimization are intractable for larger problems. In this paper, we introduce a novel subclass of submodular minimization problems that we call decomposable. Decomposable submodular functions are those that can be represented as sums of concave functions applied to modular functions. We develop an algorithm, SLG, that can efÄ?Ĺš ciently minimize decomposable submodular functions with tens of thousands of variables. Our algorithm exploits recent results in smoothed convex minimization. We apply SLG to synthetic benchmarks and a joint classiÄ?Ĺš cation-and-segmentation task, and show that it outperforms the state-of-the-art general purpose submodular minimization algorithms by several orders of magnitude. 1
3 0.79663026 258 nips-2010-Structured sparsity-inducing norms through submodular functions
Author: Francis R. Bach
Abstract: Sparse methods for supervised learning aim at finding good linear predictors from as few variables as possible, i.e., with small cardinality of their supports. This combinatorial selection problem is often turned into a convex optimization problem by replacing the cardinality function by its convex envelope (tightest convex lower bound), in this case the ℓ1 -norm. In this paper, we investigate more general set-functions than the cardinality, that may incorporate prior knowledge or structural constraints which are common in many applications: namely, we show that for nondecreasing submodular set-functions, the corresponding convex envelope can be obtained from its Lov´ sz extension, a common tool in submodua lar analysis. This defines a family of polyhedral norms, for which we provide generic algorithmic tools (subgradients and proximal operators) and theoretical results (conditions for support recovery or high-dimensional inference). By selecting specific submodular functions, we can give a new interpretation to known norms, such as those based on rank-statistics or grouped norms with potentially overlapping groups; we also define new norms, in particular ones that can be used as non-factorial priors for supervised learning.
4 0.73937613 102 nips-2010-Generalized roof duality and bisubmodular functions
Author: Vladimir Kolmogorov
Abstract: ˆ Consider a convex relaxation f of a pseudo-boolean function f . We say that ˆ the relaxation is totally half-integral if f (x) is a polyhedral function with halfintegral extreme points x, and this property is preserved after adding an arbitrary combination of constraints of the form xi = xj , xi = 1 − xj , and xi = γ where 1 γ ∈ {0, 1, 2 } is a constant. A well-known example is the roof duality relaxation for quadratic pseudo-boolean functions f . We argue that total half-integrality is a natural requirement for generalizations of roof duality to arbitrary pseudo-boolean functions. Our contributions are as follows. First, we provide a complete characterization ˆ of totally half-integral relaxations f by establishing a one-to-one correspondence with bisubmodular functions. Second, we give a new characterization of bisubmodular functions. Finally, we show some relationships between general totally half-integral relaxations and relaxations based on the roof duality. 1
5 0.41475892 273 nips-2010-Towards Property-Based Classification of Clustering Paradigms
Author: Margareta Ackerman, Shai Ben-David, David Loker
Abstract: Clustering is a basic data mining task with a wide variety of applications. Not surprisingly, there exist many clustering algorithms. However, clustering is an ill defined problem - given a data set, it is not clear what a “correct” clustering for that set is. Indeed, different algorithms may yield dramatically different outputs for the same input sets. Faced with a concrete clustering task, a user needs to choose an appropriate clustering algorithm. Currently, such decisions are often made in a very ad hoc, if not completely random, manner. Given the crucial effect of the choice of a clustering algorithm on the resulting clustering, this state of affairs is truly regrettable. In this paper we address the major research challenge of developing tools for helping users make more informed decisions when they come to pick a clustering tool for their data. This is, of course, a very ambitious endeavor, and in this paper, we make some first steps towards this goal. We propose to address this problem by distilling abstract properties of the input-output behavior of different clustering paradigms. In this paper, we demonstrate how abstract, intuitive properties of clustering functions can be used to taxonomize a set of popular clustering algorithmic paradigms. On top of addressing deterministic clustering algorithms, we also propose similar properties for randomized algorithms and use them to highlight functional differences between different common implementations of k-means clustering. We also study relationships between the properties, independent of any particular algorithm. In particular, we strengthen Kleinberg’s famous impossibility result, while providing a simpler proof. 1
6 0.34068573 230 nips-2010-Robust Clustering as Ensembles of Affinity Relations
7 0.31946835 261 nips-2010-Supervised Clustering
9 0.28593031 221 nips-2010-Random Projections for $k$-means Clustering
10 0.28278908 108 nips-2010-Graph-Valued Regression
11 0.28025097 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
12 0.26539835 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
13 0.25288689 38 nips-2010-Batch Bayesian Optimization via Simulation Matching
14 0.24364333 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
15 0.23020859 251 nips-2010-Sphere Embedding: An Application to Part-of-Speech Induction
16 0.21992384 90 nips-2010-Fast Large-scale Mixture Modeling with Component-specific Data Partitions
17 0.21863647 223 nips-2010-Rates of convergence for the cluster tree
18 0.20113313 181 nips-2010-Network Flow Algorithms for Structured Sparsity
19 0.19807255 155 nips-2010-Learning the context of a category
20 0.19323307 226 nips-2010-Repeated Games against Budgeted Adversaries
topicId topicWeight
[(13, 0.073), (21, 0.374), (27, 0.029), (30, 0.072), (33, 0.015), (35, 0.015), (45, 0.173), (50, 0.019), (52, 0.018), (60, 0.051), (77, 0.04), (90, 0.018)]
simIndex simValue paperId paperTitle
1 0.72329235 46 nips-2010-Causal discovery in multiple models from different experiments
Author: Tom Claassen, Tom Heskes
Abstract: A long-standing open research problem is how to use information from different experiments, including background knowledge, to infer causal relations. Recent developments have shown ways to use multiple data sets, provided they originate from identical experiments. We present the MCI-algorithm as the first method that can infer provably valid causal relations in the large sample limit from different experiments. It is fast, reliable and produces very clear and easily interpretable output. It is based on a result that shows that constraint-based causal discovery is decomposable into a candidate pair identification and subsequent elimination step that can be applied separately from different models. We test the algorithm on a variety of synthetic input model sets to assess its behavior and the quality of the output. The method shows promising signs that it can be adapted to suit causal discovery in real-world application areas as well, including large databases. 1
same-paper 2 0.67838997 166 nips-2010-Minimum Average Cost Clustering
Author: Kiyohito Nagano, Yoshinobu Kawahara, Satoru Iwata
Abstract: A number of objective functions in clustering problems can be described with submodular functions. In this paper, we introduce the minimum average cost criterion, and show that the theory of intersecting submodular functions can be used for clustering with submodular objective functions. The proposed algorithm does not require the number of clusters in advance, and it will be determined by the property of a given set of data points. The minimum average cost clustering problem is parameterized with a real variable, and surprisingly, we show that all information about optimal clusterings for all parameters can be computed in polynomial time in total. Additionally, we evaluate the performance of the proposed algorithm through computational experiments. 1
3 0.5927968 148 nips-2010-Learning Networks of Stochastic Differential Equations
Author: José Pereira, Morteza Ibrahimi, Andrea Montanari
Abstract: We consider linear models for stochastic dynamics. To any such model can be associated a network (namely a directed graph) describing which degrees of freedom interact under the dynamics. We tackle the problem of learning such a network from observation of the system trajectory over a time interval T . We analyze the ℓ1 -regularized least squares algorithm and, in the setting in which the underlying network is sparse, we prove performance guarantees that are uniform in the sampling rate as long as this is sufficiently high. This result substantiates the notion of a well defined ‘time complexity’ for the network inference problem. keywords: Gaussian processes, model selection and structure learning, graphical models, sparsity and feature selection. 1 Introduction and main results Let G = (V, E) be a directed graph with weight A0 ∈ R associated to the directed edge (j, i) from ij j ∈ V to i ∈ V . To each node i ∈ V in this network is associated an independent standard Brownian motion bi and a variable xi taking values in R and evolving according to A0 xj (t) dt + dbi (t) , ij dxi (t) = j∈∂+ i where ∂+ i = {j ∈ V : (j, i) ∈ E} is the set of ‘parents’ of i. Without loss of generality we shall take V = [p] ≡ {1, . . . , p}. In words, the rate of change of xi is given by a weighted sum of the current values of its neighbors, corrupted by white noise. In matrix notation, the same system is then represented by dx(t) = A0 x(t) dt + db(t) , p (1) 0 p×p with x(t) ∈ R , b(t) a p-dimensional standard Brownian motion and A ∈ R a matrix with entries {A0 }i,j∈[p] whose sparsity pattern is given by the graph G. We assume that the linear system ij x(t) = A0 x(t) is stable (i.e. that the spectrum of A0 is contained in {z ∈ C : Re(z) < 0}). Further, ˙ we assume that x(t = 0) is in its stationary state. More precisely, x(0) is a Gaussian random variable 1 independent of b(t), distributed according to the invariant measure. Under the stability assumption, this a mild restriction, since the system converges exponentially to stationarity. A portion of time length T of the system trajectory {x(t)}t∈[0,T ] is observed and we ask under which conditions these data are sufficient to reconstruct the graph G (i.e., the sparsity pattern of A0 ). We are particularly interested in computationally efficient procedures, and in characterizing the scaling of the learning time for large networks. Can the network structure be learnt in a time scaling linearly with the number of its degrees of freedom? As an example application, chemical reactions can be conveniently modeled by systems of nonlinear stochastic differential equations, whose variables encode the densities of various chemical species [1, 2]. Complex biological networks might involve hundreds of such species [3], and learning stochastic models from data is an important (and challenging) computational task [4]. Considering one such chemical reaction network in proximity of an equilibrium point, the model (1) can be used to trace fluctuations of the species counts with respect to the equilibrium values. The network G would represent in this case the interactions between different chemical factors. Work in this area focused so-far on low-dimensional networks, i.e. on methods that are guaranteed to be correct for fixed p, as T → ∞, while we will tackle here the regime in which both p and T diverge. Before stating our results, it is useful to stress a few important differences with respect to classical graphical model learning problems: (i) Samples are not independent. This can (and does) increase the sample complexity. (ii) On the other hand, infinitely many samples are given as data (in fact a collection indexed by the continuous parameter t ∈ [0, T ]). Of course one can select a finite subsample, for instance at regularly spaced times {x(i η)}i=0,1,... . This raises the question as to whether the learning performances depend on the choice of the spacing η. (iii) In particular, one expects that choosing η sufficiently large as to make the configurations in the subsample approximately independent can be harmful. Indeed, the matrix A0 contains more information than the stationary distribution of the above process (1), and only the latter can be learned from independent samples. (iv) On the other hand, letting η → 0, one can produce an arbitrarily large number of distinct samples. However, samples become more dependent, and intuitively one expects that there is limited information to be harnessed from a given time interval T . Our results confirm in a detailed and quantitative way these intuitions. 1.1 Results: Regularized least squares Regularized least squares is an efficient and well-studied method for support recovery. We will discuss relations with existing literature in Section 1.3. In the present case, the algorithm reconstructs independently each row of the matrix A0 . The rth row, A0 , is estimated by solving the following convex optimization problem for Ar ∈ Rp r minimize L(Ar ; {x(t)}t∈[0,T ] ) + λ Ar 1 , (2) where the likelihood function L is defined by L(Ar ; {x(t)}t∈[0,T ] ) = 1 2T T 0 (A∗ x(t))2 dt − r 1 T T 0 (A∗ x(t)) dxr (t) . r (3) (Here and below M ∗ denotes the transpose of matrix/vector M .) To see that this likelihood function is indeed related to least squares, one can formally write xr (t) = dxr (t)/dt and complete the square ˙ for the right hand side of Eq. (3), thus getting the integral (A∗ x(t) − xr (t))2 dt − xr (t)2 dt. ˙ ˙ r The first term is a sum of square residuals, and the second is independent of A. Finally the ℓ1 regularization term in Eq. (2) has the role of shrinking to 0 a subset of the entries Aij thus effectively selecting the structure. Let S 0 be the support of row A0 , and assume |S 0 | ≤ k. We will refer to the vector sign(A0 ) as to r r the signed support of A0 (where sign(0) = 0 by convention). Let λmax (M ) and λmin (M ) stand for r 2 the maximum and minimum eigenvalue of a square matrix M respectively. Further, denote by Amin the smallest absolute value among the non-zero entries of row A0 . r When stable, the diffusion process (1) has a unique stationary measure which is Gaussian with covariance Q0 ∈ Rp×p given by the solution of Lyapunov’s equation [5] A0 Q0 + Q0 (A0 )∗ + I = 0. (4) Our guarantee for regularized least squares is stated in terms of two properties of the covariance Q0 and one assumption on ρmin (A0 ) (given a matrix M , we denote by ML,R its submatrix ML,R ≡ (Mij )i∈L,j∈R ): (a) We denote by Cmin ≡ λmin (Q0 0 ,S 0 ) the minimum eigenvalue of the restriction of Q0 to S the support S 0 and assume Cmin > 0. (b) We define the incoherence parameter α by letting |||Q0 (S 0 )C ,S 0 Q0 S 0 ,S 0 and assume α > 0. (Here ||| · |||∞ is the operator sup norm.) −1 |||∞ = 1 − α, ∗ (c) We define ρmin (A0 ) = −λmax ((A0 + A0 )/2) and assume ρmin (A0 ) > 0. Note this is a stronger form of stability assumption. Our main result is to show that there exists a well defined time complexity, i.e. a minimum time interval T such that, observing the system for time T enables us to reconstruct the network with high probability. This result is stated in the following theorem. Theorem 1.1. Consider the problem of learning the support S 0 of row A0 of the matrix A0 from a r sample trajectory {x(t)}t∈[0,T ] distributed according to the model (1). If T > 104 k 2 (k ρmin (A0 )−2 + A−2 ) 4pk min log , 2 α2 ρmin (A0 )Cmin δ (5) then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = 36 log(4p/δ)/(T α2 ρmin (A0 )) . The time complexity is logarithmic in the number of variables and polynomial in the support size. Further, it is roughly inversely proportional to ρmin (A0 ), which is quite satisfying conceptually, since ρmin (A0 )−1 controls the relaxation time of the mixes. 1.2 Overview of other results So far we focused on continuous-time dynamics. While, this is useful in order to obtain elegant statements, much of the paper is in fact devoted to the analysis of the following discrete-time dynamics, with parameter η > 0: x(t) = x(t − 1) + ηA0 x(t − 1) + w(t), t ∈ N0 . (6) Here x(t) ∈ Rp is the vector collecting the dynamical variables, A0 ∈ Rp×p specifies the dynamics as above, and {w(t)}t≥0 is a sequence of i.i.d. normal vectors with covariance η Ip×p (i.e. with independent components of variance η). We assume that consecutive samples {x(t)}0≤t≤n are given and will ask under which conditions regularized least squares reconstructs the support of A0 . The parameter η has the meaning of a time-step size. The continuous-time model (1) is recovered, in a sense made precise below, by letting η → 0. Indeed we will prove reconstruction guarantees that are uniform in this limit as long as the product nη (which corresponds to the time interval T in the previous section) is kept constant. For a formal statement we refer to Theorem 3.1. Theorem 1.1 is indeed proved by carefully controlling this limit. The mathematical challenge in this problem is related to the fundamental fact that the samples {x(t)}0≤t≤n are dependent (and strongly dependent as η → 0). Discrete time models of the form (6) can arise either because the system under study evolves by discrete steps, or because we are subsampling a continuous time system modeled as in Eq. (1). Notice that in the latter case the matrices A0 appearing in Eq. (6) and (1) coincide only to the zeroth order in η. Neglecting this technical complication, the uniformity of our reconstruction guarantees as η → 0 has an appealing interpretation already mentioned above. Whenever the samples spacing is not too large, the time complexity (i.e. the product nη) is roughly independent of the spacing itself. 3 1.3 Related work A substantial amount of work has been devoted to the analysis of ℓ1 regularized least squares, and its variants [6, 7, 8, 9, 10]. The most closely related results are the one concerning high-dimensional consistency for support recovery [11, 12]. Our proof follows indeed the line of work developed in these papers, with two important challenges. First, the design matrix is in our case produced by a stochastic diffusion, and it does not necessarily satisfies the irrepresentability conditions used by these works. Second, the observations are not corrupted by i.i.d. noise (since successive configurations are correlated) and therefore elementary concentration inequalities are not sufficient. Learning sparse graphical models via ℓ1 regularization is also a topic with significant literature. In the Gaussian case, the graphical LASSO was proposed to reconstruct the model from i.i.d. samples [13]. In the context of binary pairwise graphical models, Ref. [11] proves high-dimensional consistency of regularized logistic regression for structural learning, under a suitable irrepresentability conditions on a modified covariance. Also this paper focuses on i.i.d. samples. Most of these proofs builds on the technique of [12]. A naive adaptation to the present case allows to prove some performance guarantee for the discrete-time setting. However the resulting bounds are not uniform as η → 0 for nη = T fixed. In particular, they do not allow to prove an analogous of our continuous time result, Theorem 1.1. A large part of our effort is devoted to producing more accurate probability estimates that capture the correct scaling for small η. Similar issues were explored in the study of stochastic differential equations, whereby one is often interested in tracking some slow degrees of freedom while ‘averaging out’ the fast ones [14]. The relevance of this time-scale separation for learning was addressed in [15]. Let us however emphasize that these works focus once more on system with a fixed (small) number of dimensions p. Finally, the related topic of learning graphical models for autoregressive processes was studied recently in [16, 17]. The convex relaxation proposed in these papers is different from the one developed here. Further, no model selection guarantee was proved in [16, 17]. 2 Illustration of the main results It might be difficult to get a clear intuition of Theorem 1.1, mainly because of conditions (a) and (b), which introduce parameters Cmin and α. The same difficulty arises with analogous results on the high-dimensional consistency of the LASSO [11, 12]. In this section we provide concrete illustration both via numerical simulations, and by checking the condition on specific classes of graphs. 2.1 Learning the laplacian of graphs with bounded degree Given a simple graph G = (V, E) on vertex set V = [p], its laplacian ∆G is the symmetric p × p matrix which is equal to the adjacency matrix of G outside the diagonal, and with entries ∆G = ii −deg(i) on the diagonal [18]. (Here deg(i) denotes the degree of vertex i.) It is well known that ∆G is negative semidefinite, with one eigenvalue equal to 0, whose multiplicity is equal to the number of connected components of G. The matrix A0 = −m I + ∆G fits into the setting of Theorem 1.1 for m > 0. The corresponding model (1.1) describes the over-damped dynamics of a network of masses connected by springs of unit strength, and connected by a spring of strength m to the origin. We obtain the following result. Theorem 2.1. Let G be a simple connected graph of maximum vertex degree k and consider the model (1.1) with A0 = −m I + ∆G where ∆G is the laplacian of G and m > 0. If k+m 5 4pk T ≥ 2 · 105 k 2 , (7) (k + m2 ) log m δ then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = 36(k + m)2 log(4p/δ)/(T m3 ). In other words, for m bounded away from 0 and ∞, regularized least squares regression correctly reconstructs the graph G from a trajectory of time length which is polynomial in the degree and logarithmic in the system size. Notice that once the graph is known, the laplacian ∆G is uniquely determined. Also, the proof technique used for this example is generalizable to other graphs as well. 4 2800 Min. # of samples for success prob. = 0.9 1 0.9 p = 16 p = 32 0.8 Probability of success p = 64 0.7 p = 128 p = 256 0.6 p = 512 0.5 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 T=nη 350 400 2600 2400 2200 2000 1800 1600 1400 1200 1 10 450 2 3 10 10 p Figure 1: (left) Probability of success vs. length of the observation interval nη. (right) Sample complexity for 90% probability of success vs. p. 2.2 Numerical illustrations In this section we present numerical validation of the proposed method on synthetic data. The results confirm our observations in Theorems 1.1 and 3.1, below, namely that the time complexity scales logarithmically with the number of nodes in the network p, given a constant maximum degree. Also, the time complexity is roughly independent of the sampling rate. In Fig. 1 and 2 we consider the discrete-time setting, generating data as follows. We draw A0 as a random sparse matrix in {0, 1}p×p with elements chosen independently at random with P(A0 = 1) = k/p, k = 5. The ij process xn ≡ {x(t)}0≤t≤n is then generated according to Eq. (6). We solve the regularized least 0 square problem (the cost function is given explicitly in Eq. (8) for the discrete-time case) for different values of n, the number of observations, and record if the correct support is recovered for a random row r using the optimum value of the parameter λ. An estimate of the probability of successful recovery is obtained by repeating this experiment. Note that we are estimating here an average probability of success over randomly generated matrices. The left plot in Fig.1 depicts the probability of success vs. nη for η = 0.1 and different values of p. Each curve is obtained using 211 instances, and each instance is generated using a new random matrix A0 . The right plot in Fig.1 is the corresponding curve of the sample complexity vs. p where sample complexity is defined as the minimum value of nη with probability of success of 90%. As predicted by Theorem 2.1 the curve shows the logarithmic scaling of the sample complexity with p. In Fig. 2 we turn to the continuous-time model (1). Trajectories are generated by discretizing this stochastic differential equation with step δ much smaller than the sampling rate η. We draw random matrices A0 as above and plot the probability of success for p = 16, k = 4 and different values of η, as a function of T . We used 211 instances for each curve. As predicted by Theorem 1.1, for a fixed observation interval T , the probability of success converges to some limiting value as η → 0. 3 Discrete-time model: Statement of the results Consider a system evolving in discrete time according to the model (6), and let xn ≡ {x(t)}0≤t≤n 0 be the observed portion of the trajectory. The rth row A0 is estimated by solving the following r convex optimization problem for Ar ∈ Rp minimize L(Ar ; xn ) + λ Ar 0 where L(Ar ; xn ) ≡ 0 1 2η 2 n 1 , (8) n−1 2 t=0 {xr (t + 1) − xr (t) − η A∗ x(t)} . r (9) Apart from an additive constant, the η → 0 limit of this cost function can be shown to coincide with the cost function in the continuous time case, cf. Eq. (3). Indeed the proof of Theorem 1.1 will amount to a more precise version of this statement. Furthermore, L(Ar ; xn ) is easily seen to be the 0 log-likelihood of Ar within model (6). 5 1 1 0.9 0.95 0.9 0.7 Probability of success Probability of success 0.8 η = 0.04 η = 0.06 0.6 η = 0.08 0.5 η = 0.1 0.4 η = 0.14 0.3 η = 0.22 η = 0.18 0.85 0.8 0.75 0.7 0.65 0.2 0.6 0.1 0 50 100 150 T=nη 200 0.55 0.04 250 0.06 0.08 0.1 0.12 η 0.14 0.16 0.18 0.2 0.22 Figure 2: (right)Probability of success vs. length of the observation interval nη for different values of η. (left) Probability of success vs. η for a fixed length of the observation interval, (nη = 150) . The process is generated for a small value of η and sampled at different rates. As before, we let S 0 be the support of row A0 , and assume |S 0 | ≤ k. Under the model (6) x(t) has r a Gaussian stationary state distribution with covariance Q0 determined by the following modified Lyapunov equation A0 Q0 + Q0 (A0 )∗ + ηA0 Q0 (A0 )∗ + I = 0 . (10) It will be clear from the context whether A0 /Q0 refers to the dynamics/stationary matrix from the continuous or discrete time system. We assume conditions (a) and (b) introduced in Section 1.1, and adopt the notations already introduced there. We use as a shorthand notation σmax ≡ σmax (I +η A0 ) where σmax (.) is the maximum singular value. Also define D ≡ 1 − σmax /η . We will assume D > 0. As in the previous section, we assume the model (6) is initiated in the stationary state. Theorem 3.1. Consider the problem of learning the support S 0 of row A0 from the discrete-time r trajectory {x(t)}0≤t≤n . If nη > 4pk 104 k 2 (kD−2 + A−2 ) min log , 2 DC 2 α δ min (11) then there exists λ such that ℓ1 -regularized least squares recovers the signed support of A0 with r probability larger than 1 − δ. This is achieved by taking λ = (36 log(4p/δ))/(Dα2 nη). In other words the discrete-time sample complexity, n, is logarithmic in the model dimension, polynomial in the maximum network degree and inversely proportional to the time spacing between samples. The last point is particularly important. It enables us to derive the bound on the continuoustime sample complexity as the limit η → 0 of the discrete-time sample complexity. It also confirms our intuition mentioned in the Introduction: although one can produce an arbitrary large number of samples by sampling the continuous process with finer resolutions, there is limited amount of information that can be harnessed from a given time interval [0, T ]. 4 Proofs In the following we denote by X ∈ Rn×p the matrix whose (t + 1)th column corresponds to the configuration x(t), i.e. X = [x(0), x(1), . . . , x(n − 1)]. Further ∆X ∈ Rn×p is the matrix containing configuration changes, namely ∆X = [x(1) − x(0), . . . , x(n) − x(n − 1)]. Finally we write W = [w(1), . . . , w(n − 1)] for the matrix containing the Gaussian noise realization. Equivalently, The r th row of W is denoted by Wr . W = ∆X − ηA X . In order to lighten the notation, we will omit the reference to xn in the likelihood function (9) and 0 simply write L(Ar ). We define its normalized gradient and Hessian by G = −∇L(A0 ) = r 1 ∗ XWr , nη Q = ∇2 L(A0 ) = r 6 1 XX ∗ . n (12) 4.1 Discrete time In this Section we outline our prove for our main result for discrete-time dynamics, i.e., Theorem 3.1. We start by stating a set of sufficient conditions for regularized least squares to work. Then we present a series of concentration lemmas to be used to prove the validity of these conditions, and finally we sketch the outline of the proof. As mentioned, the proof strategy, and in particular the following proposition which provides a compact set of sufficient conditions for the support to be recovered correctly is analogous to the one in [12]. A proof of this proposition can be found in the supplementary material. Proposition 4.1. Let α, Cmin > 0 be be defined by λmin (Q0 0 ,S 0 ) ≡ Cmin , S |||Q0 0 )C ,S 0 Q0 0 ,S 0 S (S −1 |||∞ ≡ 1 − α . (13) If the following conditions hold then the regularized least square solution (8) correctly recover the signed support sign(A0 ): r λα Amin Cmin G ∞≤ , GS 0 ∞ ≤ − λ, (14) 3 4k α Cmin α Cmin √ , √ . |||QS 0 ,S 0 − Q0 0 ,S 0 |||∞ ≤ (15) |||Q(S 0 )C ,S 0 − Q0 0 )C ,S 0 |||∞ ≤ S (S 12 k 12 k Further the same statement holds for the continuous model 3, provided G and Q are the gradient and the hessian of the likelihood (3). The proof of Theorem 3.1 consists in checking that, under the hypothesis (11) on the number of consecutive configurations, conditions (14) to (15) will hold with high probability. Checking these conditions can be regarded in turn as concentration-of-measure statements. Indeed, if expectation is taken with respect to a stationary trajectory, we have E{G} = 0, E{Q} = Q0 . 4.1.1 Technical lemmas In this section we will state the necessary concentration lemmas for proving Theorem 3.1. These are non-trivial because G, Q are quadratic functions of dependent random variables the samples {x(t)}0≤t≤n . The proofs of Proposition 4.2, of Proposition 4.3, and Corollary 4.4 can be found in the supplementary material provided. Our first Proposition implies concentration of G around 0. Proposition 4.2. Let S ⊆ [p] be any set of vertices and ǫ < 1/2. If σmax ≡ σmax (I + η A0 ) < 1, then 2 P GS ∞ > ǫ ≤ 2|S| e−n(1−σmax ) ǫ /4 . (16) We furthermore need to bound the matrix norms as per (15) in proposition 4.1. First we relate bounds on |||QJS − Q0 JS |||∞ with bounds on |Qij − Q0 |, (i ∈ J, i ∈ S) where J and S are any ij subsets of {1, ..., p}. We have, P(|||QJS − Q0 )|||∞ > ǫ) ≤ |J||S| max P(|Qij − Q0 | > ǫ/|S|). JS ij i,j∈J (17) Then, we bound |Qij − Q0 | using the following proposition ij Proposition 4.3. Let i, j ∈ {1, ..., p}, σmax ≡ σmax (I + ηA0 ) < 1, T = ηn > 3/D and 0 < ǫ < 2/D where D = (1 − σmax )/η then, P(|Qij − Q0 )| > ǫ) ≤ 2e ij n − 32η2 (1−σmax )3 ǫ2 . (18) Finally, the next corollary follows from Proposition 4.3 and Eq. (17). Corollary 4.4. Let J, S (|S| ≤ k) be any two subsets of {1, ..., p} and σmax ≡ σmax (I + ηA0 ) < 1, ǫ < 2k/D and nη > 3/D (where D = (1 − σmax )/η) then, P(|||QJS − Q0 |||∞ > ǫ) ≤ 2|J|ke JS 7 n − 32k2 η2 (1−σmax )3 ǫ2 . (19) 4.1.2 Outline of the proof of Theorem 3.1 With these concentration bounds we can now easily prove Theorem 3.1. All we need to do is to compute the probability that the conditions given by Proposition 4.1 hold. From the statement of the theorem we have that the first two conditions (α, Cmin > 0) of Proposition 4.1 hold. In order to make the first condition on G imply the second condition on G we assume that λα/3 ≤ (Amin Cmin )/(4k) − λ which is guaranteed to hold if λ ≤ Amin Cmin /8k. (20) We also combine the two last conditions on Q, thus obtaining the following |||Q[p],S 0 − Q0 0 |||∞ ≤ [p],S α Cmin √ , 12 k (21) since [p] = S 0 ∪ (S 0 )C . We then impose that both the probability of the condition on Q failing and the probability of the condition on G failing are upper bounded by δ/2 using Proposition 4.2 and Corollary 4.4. It is shown in the supplementary material that this is satisfied if condition (11) holds. 4.2 Outline of the proof of Theorem 1.1 To prove Theorem 1.1 we recall that Proposition 4.1 holds provided the appropriate continuous time expressions are used for G and Q, namely G = −∇L(A0 ) = r 1 T T x(t) dbr (t) , 0 Q = ∇2 L(A0 ) = r 1 T T x(t)x(t)∗ dt . (22) 0 These are of course random variables. In order to distinguish these from the discrete time version, we will adopt the notation Gn , Qn for the latter. We claim that these random variables can be coupled (i.e. defined on the same probability space) in such a way that Gn → G and Qn → Q almost surely as n → ∞ for fixed T . Under assumption (5), it is easy to show that (11) holds for all n > n0 with n0 a sufficiently large constant (for a proof see the provided supplementary material). Therefore, by the proof of Theorem 3.1, the conditions in Proposition 4.1 hold for gradient Gn and hessian Qn for any n ≥ n0 , with probability larger than 1 − δ. But by the claimed convergence Gn → G and Qn → Q, they hold also for G and Q with probability at least 1 − δ which proves the theorem. We are left with the task of showing that the discrete and continuous time processes can be coupled in such a way that Gn → G and Qn → Q. With slight abuse of notation, the state of the discrete time system (6) will be denoted by x(i) where i ∈ N and the state of continuous time system (1) by x(t) where t ∈ R. We denote by Q0 the solution of (4) and by Q0 (η) the solution of (10). It is easy to check that Q0 (η) → Q0 as η → 0 by the uniqueness of stationary state distribution. The initial state of the continuous time system x(t = 0) is a N(0, Q0 ) random variable independent of b(t) and the initial state of the discrete time system is defined to be x(i = 0) = (Q0 (η))1/2 (Q0 )−1/2 x(t = 0). At subsequent times, x(i) and x(t) are assumed are generated by the respective dynamical systems using the same matrix A0 using common randomness provided by the standard Brownian motion {b(t)}0≤t≤T in Rp . In order to couple x(t) and x(i), we construct w(i), the noise driving the discrete time system, by letting w(i) ≡ (b(T i/n) − b(T (i − 1)/n)). The almost sure convergence Gn → G and Qn → Q follows then from standard convergence of random walk to Brownian motion. Acknowledgments This work was partially supported by a Terman fellowship, the NSF CAREER award CCF-0743978 and the NSF grant DMS-0806211 and by a Portuguese Doctoral FCT fellowship. 8 References [1] D.T. Gillespie. Stochastic simulation of chemical kinetics. Annual Review of Physical Chemistry, 58:35–55, 2007. [2] D. Higham. Modeling and Simulating Chemical Reactions. SIAM Review, 50:347–368, 2008. [3] N.D.Lawrence et al., editor. Learning and Inference in Computational Systems Biology. MIT Press, 2010. [4] T. Toni, D. Welch, N. Strelkova, A. Ipsen, and M.P.H. Stumpf. Modeling and Simulating Chemical Reactions. J. R. Soc. Interface, 6:187–202, 2009. [5] K. Zhou, J.C. Doyle, and K. Glover. Robust and optimal control. Prentice Hall, 1996. [6] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996. [7] D.L. Donoho. For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution. Communications on Pure and Applied Mathematics, 59(7):907–934, 2006. [8] D.L. Donoho. For most large underdetermined systems of linear equations the minimal l1norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 2006. [9] T. Zhang. Some sharp performance bounds for least squares regression with L1 regularization. Annals of Statistics, 37:2109–2144, 2009. [10] M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1constrained quadratic programming (Lasso). IEEE Trans. Information Theory, 55:2183–2202, 2009. [11] M.J. Wainwright, P. Ravikumar, and J.D. Lafferty. High-Dimensional Graphical Model Selection Using l-1-Regularized Logistic Regression. Advances in Neural Information Processing Systems, 19:1465, 2007. [12] P. Zhao and B. Yu. On model selection consistency of Lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006. [13] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432, 2008. [14] K. Ball, T.G. Kurtz, L. Popovic, and G. Rempala. Modeling and Simulating Chemical Reactions. Ann. Appl. Prob., 16:1925–1961, 2006. [15] G.A. Pavliotis and A.M. Stuart. Parameter estimation for multiscale diffusions. J. Stat. Phys., 127:741–781, 2007. [16] J. Songsiri, J. Dahl, and L. Vandenberghe. Graphical models of autoregressive processes. pages 89–116, 2010. [17] J. Songsiri and L. Vandenberghe. Topology selection in graphical models of autoregressive processes. Journal of Machine Learning Research, 2010. submitted. [18] F.R.K. Chung. Spectral Graph Theory. CBMS Regional Conference Series in Mathematics, 1997. [19] P. Ravikumar, M.J. Wainwright, and J. Lafferty. High-dimensional Ising model selection using l1-regularized logistic regression. Annals of Statistics, 2008. 9
4 0.485149 196 nips-2010-Online Markov Decision Processes under Bandit Feedback
Author: Gergely Neu, Andras Antos, András György, Csaba Szepesvári
Abstract: We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O T 2/3 (ln T )1/3 , giving the first rigorously proved regret bound for the problem. 1
5 0.48331684 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
Author: Gowtham Bellala, Suresh Bhavnani, Clayton Scott
Abstract: Generalized Binary Search (GBS) is a well known greedy algorithm for identifying an unknown object while minimizing the number of “yes” or “no” questions posed about that object, and arises in problems such as active learning and active diagnosis. Here, we provide a coding-theoretic interpretation for GBS and show that GBS can be viewed as a top-down algorithm that greedily minimizes the expected number of queries required to identify an object. This interpretation is then used to extend GBS in two ways. First, we consider the case where the objects are partitioned into groups, and the objective is to identify only the group to which the object belongs. Then, we consider the case where the cost of identifying an object grows exponentially in the number of queries. In each case, we present an exact formula for the objective function involving Shannon or R´ nyi entropy, and e develop a greedy algorithm for minimizing it. 1
6 0.48305467 134 nips-2010-LSTD with Random Projections
7 0.48255557 7 nips-2010-A Family of Penalty Functions for Structured Sparsity
8 0.48225573 265 nips-2010-The LASSO risk: asymptotic results and real world examples
9 0.4820134 261 nips-2010-Supervised Clustering
10 0.48152855 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
11 0.48115069 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
12 0.48096731 92 nips-2010-Fast global convergence rates of gradient methods for high-dimensional statistical recovery
13 0.48064744 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
14 0.48045823 258 nips-2010-Structured sparsity-inducing norms through submodular functions
15 0.48036623 221 nips-2010-Random Projections for $k$-means Clustering
16 0.47983477 210 nips-2010-Practical Large-Scale Optimization for Max-norm Regularization
17 0.47979185 27 nips-2010-Agnostic Active Learning Without Constraints
18 0.47969696 63 nips-2010-Distributed Dual Averaging In Networks
19 0.47884682 274 nips-2010-Trading off Mistakes and Don't-Know Predictions
20 0.47802806 30 nips-2010-An Inverse Power Method for Nonlinear Eigenproblems with Applications in 1-Spectral Clustering and Sparse PCA