nips nips2007 nips2007-58 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Ulrike V. Luxburg, Stefanie Jegelka, Michael Kaufmann, Sébastien Bubeck
Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal. As an alternative, we suggest the paradigm of “nearest neighbor clustering”. Instead of selecting the best out of all partitions of the sample, it only considers partitions in some restricted function class. Using tools from statistical learning theory we prove that nearest neighbor clustering is statistically consistent. Moreover, its worst case complexity is polynomial by construction, and it can be implemented with small average case complexity using branch and bound. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Consistent Minimization of Clustering Objective Functions Ulrike von Luxburg Max Planck Institute for Biological Cybernetics S´ bastien Bubeck e INRIA Futurs Lille, France ulrike. [sent-1, score-0.082]
2 The objective is to find, among all partitions of the data set, the best one according to some quality measure. [sent-12, score-0.182]
3 However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. [sent-13, score-0.113]
4 As an alternative, we suggest the paradigm of “nearest neighbor clustering”. [sent-15, score-0.126]
5 Instead of selecting the best out of all partitions of the sample, it only considers partitions in some restricted function class. [sent-16, score-0.212]
6 Using tools from statistical learning theory we prove that nearest neighbor clustering is statistically consistent. [sent-17, score-0.501]
7 Moreover, its worst case complexity is polynomial by construction, and it can be implemented with small average case complexity using branch and bound. [sent-18, score-0.089]
8 Many algorithms try to achieve this by minimizing a certain quality function Qn , for example graph cut objective functions such as ratio cut or normalized cut, or various criteria based on some function of the within- and between-cluster similarities. [sent-20, score-0.347]
9 The objective of clustering is then stated as a discrete optimization problem. [sent-21, score-0.33]
10 , Xn } and a clustering quality function Qn , the ideal clustering algorithm should take into account all possible partitions of the data set and output the one that minimizes Qn . [sent-25, score-0.631]
11 The implicit understanding is that the “best” clustering can be any partition out of the set of all possible partitions of the data set. [sent-26, score-0.381]
12 If we look at clustering from the perspective of statistical learning theory we assume that the finite data set has been sampled from an underlying data space X according to some probability measure. [sent-29, score-0.257]
13 We choose a clustering quality function Q on the set of partitions of the entire data space X , and define the true clustering f ∗ to be the partition minimizing Q. [sent-32, score-0.692]
14 In this setting, a very important property of a clustering algorithm is consistency. [sent-33, score-0.24]
15 Denoting the clustering constructed on the finite sample by fn , we require that Q(fn ) converges to Q(f ∗ ) when n→∞. [sent-34, score-0.799]
16 One can prove that in the standard setting of statistical learning theory, a necessary condition for consistency is that E log NF (x1 , . [sent-42, score-0.09]
17 While the discrete optimization approach on any given sample attempts to find the best of all (exponentially many) partitions, the statistical learning theory approach restricts the set of candidate partitions to have sub-exponential size. [sent-50, score-0.199]
18 Bubeck and von Luxburg, 2007) which show that this indeed can happen: here the partitions constructed on the finite sample do not converge to the true clustering of the data space. [sent-53, score-0.471]
19 In practice, for most cases the discrete optimization approach cannot be performed perfectly as the corresponding optimization problem is NP hard. [sent-54, score-0.089]
20 Another approach is to construct a relaxation of the original problem which can be solved efficiently (spectral clustering is an example for this). [sent-57, score-0.24]
21 This situation is clearly unsatisfactory: for most clustering algorithms, we neither have guarantees on the finite sample behavior of the algorithm, nor on its statistical consistency in the limit. [sent-59, score-0.321]
22 Directly from the outset, we only consider candidate partitions in some restricted class Fn containing only polynomially many functions. [sent-62, score-0.124]
23 From a theoretical point of view this approach has the advantage that the resulting clustering algorithm has the potential of being consistent. [sent-64, score-0.24]
24 2 Nearest neighbor clustering In the following we assume that we are given a set of data points Xn = {X1 , . [sent-68, score-0.398]
25 To follow the approach outlined above we have to optimize Qn over a “small” set Fn of partitions of Xn . [sent-73, score-0.106]
26 A convenient choice satisfying all those properties is the class of “nearest neighbor partitions”. [sent-77, score-0.144]
27 Assign all other data points to their closest seed points, that is for all j = 1, . [sent-83, score-0.12]
28 , m define the set Zj as the subset of data points whose nearest seed point is Xsj . [sent-86, score-0.238]
29 Then consider all partitions of Xn which are constant on the sets Zj . [sent-87, score-0.123]
30 More formally, for given seeds we define the set Fn as the set of all functions f : X → {1, . [sent-88, score-0.084]
31 The function class Fn contains K m functions, which is polynomial in n if the number m of seeds satisfies m = O(log n). [sent-93, score-0.101]
32 Given Fn , the simplest polynomial-time optimization algorithm is then to evaluate Qn (f ) for all f ∈ Fn and choose the solution fn = argminf ∈Fn Qn (f ). [sent-94, score-0.586]
33 We call the resulting clustering the nearest neighbor clustering and denote it by NNC(Qn ). [sent-95, score-0.74]
34 2 3 Consistency of nearest neighbor clustering In this section we prove that nearest neighbor clustering is statistically consistent for many clustering quality functions. [sent-97, score-1.274]
35 Let m := m(n) ≤ n be the number of seeds used in nearest neighbor clustering. [sent-108, score-0.31]
36 ,Xn Furthermore, let Q : F → R be the quality function we aim to minimize, and Qn : Fn → R an estimator of this quality function on a finite sample. [sent-131, score-0.109]
37 With this notation, the true clustering f ∗ on the underlying space and the nearest neighbor clustering fn introduced in the last section are given by f ∗ ∈ argminf ∈F Q(f ) fn ∈ argminf ∈Fn Qn (f ). [sent-132, score-1.862]
38 and Later on we will also need to work with the functions ∗ fn ∈ argminf ∈Fn Q(f ) f ∗ (x) := f ∗ (NNm (x)). [sent-133, score-0.574]
39 Theorem 1 (Consistency of nearest neighbor clustering) Let (Xi )i∈N be a sequence of points drawn i. [sent-138, score-0.276]
40 according to some probability measure P on Rd , and m := m(n) the number of seed points used in nearest neighbor clustering. [sent-141, score-0.364]
41 Let Q : F → R be a clustering quality function, Qn : Fn → R its estimator, and A(f ) and An (f ) some predicates. [sent-142, score-0.285]
42 Qn (f ) is a consistent estimator of Q(f ) which converges sufficiently fast: ∀ε > 0, K m (2n)(d+1)m supf ∈Fn P(|Qn (f ) − Q(f )| > ε) → 0. [sent-144, score-0.103]
43 Then nearest neighbor clustering as introduced in Section 2 is weakly consistent, that is Q(fn ) → Q(f ∗ ) in probability. [sent-151, score-0.484]
44 Moreover, note that the function class Fn is data dependent as the seed points used in the Voronoi partition are data points. [sent-162, score-0.173]
45 , 1996) that the number of Voronoi partitions 2 of n points using m cells in Rd is bounded by n(d+1)m , hence the number of nearest neighbor 2 clusterings into K classes is bounded by SK (Fn , n) ≤ K m n(d+1)m . [sent-174, score-0.422]
46 To deal with the approximation error, observe that if An (f ∗ ) is ∗ true, then f ∗ ∈ Fn , and by the definition of fn we have ∗ Q(fn ) − Q(f ∗ ) ≤ Q(f ∗ ) − Q(f ∗ ) and thus ∗ P Q(fn ) − Q(f ∗ ) ≥ ε ≤ P(An (f ∗ ) false) + P f ∗ ∈ Fn and Q(f ∗ ) − Q(f ∗ ) ≥ ε . [sent-177, score-0.512]
47 , K} denote by fk (x) := 1f (x)=k the indicator function of the k-th cluster. [sent-190, score-0.282]
48 , K An (f ) is true : ⇐⇒ voln (fk ) > an ∀k = 1, . [sent-195, score-0.139]
49 Let m := m(n) be the number of seed points used in nearest neighbor clustering, a > 0 an arbitrary constant, and (an )n∈N a monotonically decreasing sequence with an → a. [sent-202, score-0.364]
50 Then nearest neighbor clustering using Q := Ncut, Qn := Ncutn , and A and An as defined in (3) is weakly consistent if m(n) → ∞ and m2 log n/(n(a − an )2 ) → 0. [sent-203, score-0.505]
51 First we establish that {| cutn (fk ) − cut(fk )| ≤ aε} ∩ {| voln (fk ) − vol(fk )| ≤ aε} ⊂ {| cutn (fk ) cut(fk ) − | ≤ 2ε}. [sent-206, score-0.339]
52 voln (fk ) vol(fk ) Applying the McDiarmid inequality to cutn and voln , respectively, we obtain that for all f ∈ Fn P(| Ncut(f ) − Ncutn (f )| > ε) ≤ 4K exp − na2 ε2 8C 2 K 2 . [sent-207, score-0.339]
53 The proof of Condition (2) is rather technical, but in the end also follows by applying the McDiarmid inequality to voln (f ). [sent-209, score-0.113]
54 a In fact, Theorem 1 can be applied to a large variety of clustering objective functions. [sent-211, score-0.271]
55 i 2 := Theorem 3 (Consistency of NNC(RatioCutn ), NNC(WSSn ), and NNC(BWn )) Let fn and f ∗ be the empirical and true minimizers of nearest neighbor clustering using RatioCutn , WSSn , or BWn , respectively. [sent-213, score-1.022]
56 4 Implementation using branch and bound It is an obvious question how nearest neighbor clustering can be implemented in a more efficient way than simply trying all functions in Fn . [sent-217, score-0.597]
57 As an example we introduce a branch and bound algorithm for solving NNC(Ncut) for K = 2 clusters. [sent-220, score-0.095]
58 First of all, observe that minimizing Ncutn over the nearest neighbor function set Fn is the same as minimizing Ncutm over all partitions of a contracted data set consisting of m “super-points” Z1 , . [sent-222, score-0.37]
59 , Zm (super-point Zi contains all data points assigned to the i-th seed point), endowed with the “super-similarity” function s(Zs , Zt ) := Xi ∈Zs ,Xj ∈Zt s(Xi , Xj ). [sent-225, score-0.12]
60 Hence nearest neighbor clustering on the original data set ¯ with n points can be performed by directly optimizing Ncut on the contracted data set consisting of only m super-points. [sent-226, score-0.536]
61 Analogously to the notation fk of the previous section, in case K = 2 we can decompose Ncut(f ) = cut(f+1 ) · (1/ vol(f+1 ) + 1/ vol(f−1 )); we call the first term the “cut term” and the second term the “volume term”. [sent-239, score-0.298]
62 As it is standard in branch and bound we have to investigate whether the “branch” of clusterings with the specific fixed labels on A could contain a solution which is better than all the previously considered solutions. [sent-240, score-0.135]
63 The first one is very simple: assigning at least one point in B to +1 can only lead to an improvement if this either decreases the cut term or the volume term of Ncut. [sent-242, score-0.118]
64 // If no pruning possible, recursively call bbncut: ¯ • Set gi = +1, θu := min{Ncut(g), θu }, call g := bbncut(S, g, i + 1, θu ) ¯ • Set gi = −1, θu := min{Ncut(g ), θu }, call g := bbncut(S, g, i + 1, θu ) • If Ncut(g ) ≤ Ncut(g ) then return g , else return g . [sent-255, score-0.174]
65 If θl ≥ θu then no improvement is possible by any clustering in the current branch of the tree, and we retract. [sent-263, score-0.312]
66 Using the conventions s(A, B) = Zi ∈A,Zj ∈B sij and s(A, ∅) = 0, the cut term is bounded by ¯ ¯ ¯ cut(A+ ∪ B + , A− ∪ B − ) ≥ minj≥i s(Zj , A+ ) s(A+ , A− ) + minj≥i s(Zj , A− ) ¯ ¯ if A− = ∅ otherwise. [sent-265, score-0.137]
67 For example, in our experience it is of advantage to sort the super-points by decreasing degree, and from one recursion level to the next one alternate between first visiting branch gi = 1 and gi = −1. [sent-273, score-0.144]
68 5 Experiments The main point about nearest neighbor clustering is its statistical consistency: for large n it reveals an approximately correct clustering. [sent-274, score-0.501]
69 Given an objective function Qn (such as WSS or Ncut) we compare the NNC results to heuristics designed to optimize Qn directly (such as k-means or spectral clustering). [sent-276, score-0.091]
70 As numeric data sets we used classification benchmark data sets from different repositories (UCI repository, repository by G. [sent-277, score-0.132]
71 We always set the number m of seed points for NNC to m = log n. [sent-286, score-0.12]
72 In case of WSS, we compare the result of the k-means algorithm to the result of NNC using the WSS objective function and the Euclidean distance to assign data points to seed points. [sent-287, score-0.174]
73 Results for K-means algorithm, NNC(WSS) with Euclidean distance; spectral clustering (SC); NNC(Ncut) with commute distance. [sent-475, score-0.333]
74 NNC(Ncut) with commute distance and spectral clustering, both trained on the entire graph. [sent-478, score-0.116]
75 The kernel width σ is set to the mean distance of a data point to its k-th nearest neighbor. [sent-481, score-0.141]
76 We then build the k-nearest neighbor graph (both times using k = ln n). [sent-482, score-0.176]
77 , Gutman and Xiao, 2004) as distance function to determine the nearest seed points for NNC. [sent-486, score-0.261]
78 From the numeric data sets we generated z = 40 training sets by subsampling n/2 points. [sent-488, score-0.098]
79 On each training set, we repeated all algorithms r = 50 times with different random initializations (the seeds in NNC; the centers in K-means; the centers in the K-means post-processing step in spectral clustering). [sent-489, score-0.144]
80 For the network data sets we ran spectral clustering and NNC on the whole graph. [sent-491, score-0.338]
81 For both the numeric data sets (left table, top lines) and the network data sets (right table) we see that the training performance of NNC is comparable to the other algorithms. [sent-494, score-0.119]
82 This is what we had hoped, and we find it remarkable as NNC is in fact a very simple clustering algorithm. [sent-495, score-0.24]
83 For each of the numeric data sets we cluster n/2 points, extend the clustering to the other n/2 points, and then compute the objective function on the test set. [sent-497, score-0.369]
84 For Ncut, we do this based on the k-nearest neighbor graph on the test set only. [sent-501, score-0.143]
85 The results on the numeric data sets are reported in Table 1 (left table, bottom lines). [sent-503, score-0.081]
86 The most likely explanation is that both K-means and spectral clustering have already reasonably good extension properties. [sent-506, score-0.3]
87 This can be due to the fact that as NNC, both algorithms consider only a certain subclass of all partitions: Voronoi partitions for K-means, and partitions induced by eigenvectors for spectral clustering. [sent-507, score-0.272]
88 7 6 Discussion In this paper we investigate clustering algorithms which minimize quality functions. [sent-509, score-0.285]
89 If we even choose Fn to be polynomial, then all problems due to NP hardness of discrete optimization problems formally disappear as the remaining optimization problems become inherently polynomial. [sent-511, score-0.089]
90 From a practical point of view, the approach of using a restricted function class Fn can be seen as a more controlled way of simplifying NP hard optimization problems than the standard approaches of local optimization or relaxation. [sent-512, score-0.078]
91 The generic clustering algorithm we studied in this article is nearest neighbor clustering, which produces clusterings that are constant on small local neighborhoods. [sent-515, score-0.524]
92 We have proved that this algorithm is statistically consistent for a large variety of popular clustering objective functions. [sent-516, score-0.292]
93 Thus, as opposed to other clustering algorithms such as the K-means algorithm or spectral clustering, nearest neighbor clustering is guaranteed to converge to a minimizer of the true global optimum on the underlying space. [sent-517, score-0.836]
94 For K-means it has been proved that the global minimizer of the WSS objective function on the sample converges to a global minimizer on the underlying space (e. [sent-519, score-0.13]
95 A related effect happens for spectral clustering, which is a relaxation attempting to minimize Ncut (see von Luxburg (2007) for a tutorial). [sent-523, score-0.142]
96 It has been shown that under certain conditions the solution of the relaxed problem on the finite sample converges to some limit clustering (e. [sent-524, score-0.287]
97 However, it has been conjectured that this limit clustering is not necessarily the optimizer of the Ncut objective function. [sent-528, score-0.271]
98 So for both cases, our consistency results represent an improvement: our algorithm provably converges to the true limit minimizer of K-means or Ncut, respectively. [sent-529, score-0.129]
99 Distribution-free exponential error bound for nearest neighbor pattern classification. [sent-573, score-0.267]
100 Supplementary material to ”Consistent minimization of clustering objective functions”, 2007. [sent-638, score-0.271]
wordName wordTfidf (topN-words)
[('fn', 0.512), ('ncut', 0.377), ('vol', 0.307), ('nnc', 0.289), ('fk', 0.282), ('clustering', 0.24), ('qn', 0.196), ('neighbor', 0.126), ('wss', 0.126), ('cut', 0.118), ('nearest', 0.118), ('cutn', 0.113), ('voln', 0.113), ('partitions', 0.106), ('luxburg', 0.09), ('seed', 0.088), ('von', 0.082), ('ncutn', 0.075), ('nnm', 0.075), ('branch', 0.072), ('seeds', 0.066), ('numeric', 0.064), ('spectral', 0.06), ('zj', 0.059), ('rd', 0.055), ('bbncut', 0.05), ('bubeck', 0.05), ('voronoi', 0.05), ('consistency', 0.047), ('quality', 0.045), ('argminf', 0.044), ('clusterings', 0.04), ('bwn', 0.038), ('minr', 0.038), ('ratiocut', 0.038), ('ratiocutn', 0.038), ('wssn', 0.038), ('gi', 0.036), ('np', 0.036), ('partition', 0.035), ('repository', 0.034), ('ln', 0.033), ('supf', 0.033), ('commute', 0.033), ('points', 0.032), ('devroye', 0.032), ('xn', 0.032), ('objective', 0.031), ('converges', 0.03), ('bw', 0.03), ('optimization', 0.03), ('discrete', 0.029), ('xi', 0.028), ('return', 0.027), ('true', 0.026), ('condition', 0.026), ('minimizer', 0.026), ('brusco', 0.025), ('efk', 0.025), ('guimer', 0.025), ('gutman', 0.025), ('jegelka', 0.025), ('jeong', 0.025), ('symmetrization', 0.025), ('bound', 0.023), ('zi', 0.023), ('et', 0.023), ('distance', 0.023), ('sc', 0.022), ('eqn', 0.022), ('nf', 0.022), ('spellman', 0.022), ('watts', 0.022), ('network', 0.021), ('consistent', 0.021), ('minj', 0.02), ('contracted', 0.02), ('maxj', 0.02), ('predicate', 0.02), ('theorem', 0.02), ('xj', 0.019), ('estimator', 0.019), ('mcdiarmid', 0.019), ('sij', 0.019), ('email', 0.019), ('protein', 0.018), ('class', 0.018), ('functions', 0.018), ('tsuda', 0.018), ('initializations', 0.018), ('zs', 0.018), ('polynomial', 0.017), ('cluster', 0.017), ('sample', 0.017), ('zm', 0.017), ('lj', 0.017), ('statistical', 0.017), ('sets', 0.017), ('graph', 0.017), ('call', 0.016)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 58 nips-2007-Consistent Minimization of Clustering Objective Functions
Author: Ulrike V. Luxburg, Stefanie Jegelka, Michael Kaufmann, Sébastien Bubeck
Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal. As an alternative, we suggest the paradigm of “nearest neighbor clustering”. Instead of selecting the best out of all partitions of the sample, it only considers partitions in some restricted function class. Using tools from statistical learning theory we prove that nearest neighbor clustering is statistically consistent. Moreover, its worst case complexity is polynomial by construction, and it can be implemented with small average case complexity using branch and bound. 1
2 0.27478704 200 nips-2007-The Tradeoffs of Large Scale Learning
Author: Olivier Bousquet, Léon Bottou
Abstract: This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation–estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways. 1 Motivation The computational complexity of learning algorithms has seldom been taken into account by the learning theory. Valiant [1] states that a problem is “learnable” when there exists a probably approximatively correct learning algorithm with polynomial complexity. Whereas much progress has been made on the statistical aspect (e.g., [2, 3, 4]), very little has been told about the complexity side of this proposal (e.g., [5].) Computational complexity becomes the limiting factor when one envisions large amounts of training data. Two important examples come to mind: • Data mining exists because competitive advantages can be achieved by analyzing the masses of data that describe the life of our computerized society. Since virtually every computer generates data, the data volume is proportional to the available computing power. Therefore one needs learning algorithms that scale roughly linearly with the total volume of data. • Artificial intelligence attempts to emulate the cognitive capabilities of human beings. Our biological brains can learn quite efficiently from the continuous streams of perceptual data generated by our six senses, using limited amounts of sugar as a source of power. This observation suggests that there are learning algorithms whose computing time requirements scale roughly linearly with the total volume of data. This contribution finds its source in the idea that approximate optimization algorithms might be sufficient for learning purposes. The first part proposes new decomposition of the test error where an additional term represents the impact of approximate optimization. In the case of small-scale learning problems, this decomposition reduces to the well known tradeoff between approximation error and estimation error. In the case of large-scale learning problems, the tradeoff is more complex because it involves the computational complexity of the learning algorithm. The second part explores the asymptotic properties of the large-scale learning tradeoff for various prototypical learning algorithms under various assumptions regarding the statistical estimation rates associated with the chosen objective functions. This part clearly shows that the best optimization algorithms are not necessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well regardless of the assumed rate for the statistical estimation error. 2 2.1 Approximate Optimization Setup Following [6, 2], we consider a space of input-output pairs (x, y) ∈ X × Y endowed with a probability distribution P (x, y). The conditional distribution P (y|x) represents the unknown relationship between inputs and outputs. The discrepancy between the predicted output y and the real output ˆ y is measured with a loss function ℓ(ˆ, y). Our benchmark is the function f ∗ that minimizes the y expected risk E(f ) = that is, ℓ(f (x), y) dP (x, y) = E [ℓ(f (x), y)], f ∗ (x) = arg min E [ ℓ(ˆ, y)| x]. y y ˆ Although the distribution P (x, y) is unknown, we are given a sample S of n independently drawn training examples (xi , yi ), i = 1 . . . n. We define the empirical risk En (f ) = 1 n n ℓ(f (xi ), yi ) = En [ℓ(f (x), y)]. i=1 Our first learning principle consists in choosing a family F of candidate prediction functions and finding the function fn = arg minf ∈F En (f ) that minimizes the empirical risk. Well known combinatorial results (e.g., [2]) support this approach provided that the chosen family F is sufficiently restrictive. Since the optimal function f ∗ is unlikely to belong to the family F, we also define ∗ ∗ fF = arg minf ∈F E(f ). For simplicity, we assume that f ∗ , fF and fn are well defined and unique. We can then decompose the excess error as ∗ ∗ E [E(fn ) − E(f ∗ )] = E [E(fF ) − E(f ∗ )] + E [E(fn ) − E(fF )] = Eapp + Eest , (1) where the expectation is taken with respect to the random choice of training set. The approximation error Eapp measures how closely functions in F can approximate the optimal solution f ∗ . The estimation error Eest measures the effect of minimizing the empirical risk En (f ) instead of the expected risk E(f ). The estimation error is determined by the number of training examples and by the capacity of the family of functions [2]. Large families1 of functions have smaller approximation errors but lead to higher estimation errors. This tradeoff has been extensively discussed in the literature [2, 3] and lead to excess error that scale between the inverse and the inverse square root of the number of examples [7, 8]. 2.2 Optimization Error Finding fn by minimizing the empirical risk En (f ) is often a computationally expensive operation. Since the empirical risk En (f ) is already an approximation of the expected risk E(f ), it should not be necessary to carry out this minimization with great accuracy. For instance, we could stop an iterative optimization algorithm long before its convergence. ˜ Let us assume that our minimization algorithm returns an approximate solution fn such that ˜ En (fn ) < En (fn ) + ρ ˜ where ρ ≥ 0 is a predefined tolerance. An additional term Eopt = E E(fn ) − E(fn ) then appears ˜ ) − E(f ∗ ) : in the decomposition of the excess error E = E E(fn E ∗ ∗ ˜ = E [E(fF ) − E(f ∗ )] + E [E(fn ) − E(fF )] + E E(fn ) − E(fn ) = Eapp + Eest + Eopt . (2) We call this additional term optimization error. It reflects the impact of the approximate optimization on the generalization performance. Its magnitude is comparable to ρ (see section 3.1.) 1 We often consider nested families of functions of the form Fc = {f ∈ H, Ω(f ) ≤ c}. Then, for each value of c, function fn is obtained by minimizing the regularized empirical risk En (f ) + λΩ(f ) for a suitable choice of the Lagrange coefficient λ. We can then control the estimation-approximation tradeoff by choosing λ instead of c. 2.3 The Approximation–Estimation–Optimization Tradeoff This decomposition leads to a more complicated compromise. It involves three variables and two constraints. The constraints are the maximal number of available training example and the maximal computation time. The variables are the size of the family of functions F, the optimization accuracy ρ, and the number of examples n. This is formalized by the following optimization problem. min E = Eapp + Eest + Eopt F ,ρ,n n ≤ nmax T (F, ρ, n) ≤ Tmax subject to (3) The number n of training examples is a variable because we could choose to use only a subset of the available training examples in order to complete the optimization within the alloted time. This happens often in practice. Table 1 summarizes the typical evolution of the quantities of interest with the three variables F, n, and ρ increase. Table 1: Typical variations when F, n, and ρ increase. F Eapp Eest Eopt T (approximation error) (estimation error) (optimization error) (computation time) n ρ ց ր ··· ր ց ··· ր ր ց The solution of the optimization program (3) depends critically of which budget constraint is active: constraint n < nmax on the number of examples, or constraint T < Tmax on the training time. • We speak of small-scale learning problem when (3) is constrained by the maximal number of examples nmax . Since the computing time is not limited, we can reduce the optimization error Eopt to insignificant levels by choosing ρ arbitrarily small. The excess error is then dominated by the approximation and estimation errors, Eapp and Eest . Taking n = nmax , we recover the approximation-estimation tradeoff that is the object of abundant literature. • We speak of large-scale learning problem when (3) is constrained by the maximal computing time Tmax . Approximate optimization, that is choosing ρ > 0, possibly can achieve better generalization because more training examples can be processed during the allowed time. The specifics depend on the computational properties of the chosen optimization algorithm through the expression of the computing time T (F, ρ, n). 3 The Asymptotics of Large-scale Learning In the previous section, we have extended the classical approximation-estimation tradeoff by taking into account the optimization error. We have given an objective criterion to distiguish small-scale and large-scale learning problems. In the small-scale case, we recover the classical tradeoff between approximation and estimation. The large-scale case is substantially different because it involves the computational complexity of the learning algorithm. In order to clarify the large-scale learning tradeoff with sufficient generality, this section makes several simplifications: • We are studying upper bounds of the approximation, estimation, and optimization errors (2). It is often accepted that these upper bounds give a realistic idea of the actual convergence rates [9, 10, 11, 12]. Another way to find comfort in this approach is to say that we study guaranteed convergence rates instead of the possibly pathological special cases. • We are studying the asymptotic properties of the tradeoff when the problem size increases. Instead of carefully balancing the three terms, we write E = O(Eapp ) + O(Eest ) + O(Eopt ) and only need to ensure that the three terms decrease with the same asymptotic rate. • We are considering a fixed family of functions F and therefore avoid taking into account the approximation error Eapp . This part of the tradeoff covers a wide spectrum of practical realities such as choosing models and choosing features. In the context of this work, we do not believe we can meaningfully address this without discussing, for instance, the thorny issue of feature selection. Instead we focus on the choice of optimization algorithm. • Finally, in order to keep this paper short, we consider that the family of functions F is linearly parametrized by a vector w ∈ Rd . We also assume that x, y and w are bounded, ensuring that there is a constant B such that 0 ≤ ℓ(fw (x), y) ≤ B and ℓ(·, y) is Lipschitz. We first explain how the uniform convergence bounds provide convergence rates that take the optimization error into account. Then we discuss and compare the asymptotic learning properties of several optimization algorithms. 3.1 Convergence of the Estimation and Optimization Errors The optimization error Eopt depends directly on the optimization accuracy ρ. However, the accuracy ˜ ρ involves the empirical quantity En (fn ) − En (fn ), whereas the optimization error Eopt involves ˜ its expected counterpart E(fn ) − E(fn ). This section discusses the impact on the optimization error Eopt and of the optimization accuracy ρ on generalization bounds that leverage the uniform convergence concepts pioneered by Vapnik and Chervonenkis (e.g., [2].) In this discussion, we use the letter c to refer to any positive constant. Multiple occurences of the letter c do not necessarily imply that the constants have identical values. 3.1.1 Simple Uniform Convergence Bounds Recall that we assume that F is linearly parametrized by w ∈ Rd . Elementary uniform convergence results then state that r » – E sup |E(f ) − En (f )| ≤ c f ∈F d , n where the expectation is taken with respect to the random choice of the training set.2 This result immediately provides a bound on the estimation error: Eest = ≤ ´ ` ` ∗ ´ ∗ ∗ ´˜ E(fn ) − En (fn ) + En (fn ) − En (fF ) + En (fF ) − E(fF ) r » – d . 2 E sup |E(f ) − En (f )| ≤ c n f ∈F E ˆ` This same result also provides a combined bound for the estimation and optimization errors: Eest + Eopt = + ≤ ˆ ˜ ˆ ˜ ˜ ˜ ˜ E E(fn ) − En (fn ) + E En (fn ) − En (fn ) ∗ ∗ ∗ E [En (fn ) − En (fF )] + E [En (fF ) − E(fF )] r r r ! d d d c +ρ+0+c = c ρ+ . n n n Unfortunately, this convergence rate is known to be pessimistic in many important cases. More sophisticated bounds are required. 3.1.2 Faster Rates in the Realizable Case When the loss functions ℓ(ˆ, y) is positive, with probability 1 − e−τ for any τ > 0, relative uniform y convergence bounds state that r E(f ) − En (f ) d n τ p sup log + . ≤c n d n f ∈F E(f ) This result is very useful because it provides faster convergence rates O(log n/n) in the realizable case, that is when ℓ(fn (xi ), yi ) = 0 for all training examples (xi , yi ). We have then En (fn ) = 0, ˜ En (fn ) ≤ ρ, and we can write r q n d τ ˜ ˜ E(fn ) − ρ ≤ c E(fn ) log + . n d n q 2 d Although the original Vapnik-Chervonenkis bounds have the form c n log n , the logarithmic term can d be eliminated using the “chaining” technique (e.g., [10].) Viewing this as a second degree polynomial inequality in variable ˜ E(fn ), we obtain „ « d n τ ˜ E(fn ) ≤ c ρ + log + . n d n Integrating this inequality using a standard technique (see, e.g., [13]), we obtain a better convergence rate of the combined estimation and optimization error: „ « h i h i d n ∗ ˜ ˜ Eest + Eopt = E E(fn ) − E(fF ) ≤ E E(fn ) = c ρ + log . n d 3.1.3 Fast Rate Bounds Many authors (e.g., [10, 4, 12]) obtain fast statistical estimation rates in more general conditions. These bounds have the general form α n 1 d log for ≤ α ≤ 1. n d 2 This result holds when one can establish the following variance condition: Eapp + Eest ≤ c ∀f ∈ F Eapp + ∗ ℓ(f (X), Y ) − ℓ(fF (X), Y ) E 2 ≤ c ∗ E(f ) − E(fF ) (4) 1 2− α . (5) The convergence rate of (4) is described by the exponent α which is determined by the quality of the variance bound (5). Works on fast statistical estimation identify two main ways to establish such a variance condition. • Exploiting the strict convexity of certain loss functions [12, theorem 12]. For instance, Lee et al. [14] establish a O(log n/n) rate using the squared loss ℓ(ˆ, y) = (ˆ − y)2 . y y • Making assumptions on the data distribution. In the case of pattern recognition problems, for instance, the “Tsybakov condition” indicates how cleanly the posterior distributions P (y|x) cross near the optimal decision boundary [11, 12]. The realizable case discussed in section 3.1.2 can be viewed as an extreme case of this. Despite their much greater complexity, fast rate estimation results can accomodate the optimization accuracy ρ using essentially the methods illustrated in sections 3.1.1 and 3.1.2. We then obtain a bound of the form α d n ˜ E = Eapp + Eest + Eopt = E E(fn ) − E(f ∗ ) ≤ c Eapp + log +ρ . (6) n d For instance, a general result with α = 1 is provided by Massart [13, theorem 4.2]. Combining this result with standard bounds on the complexity of classes of linear functions (e.g., [10]) yields the following result: d n ˜ E = Eapp + Eest + Eopt = E E(fn ) − E(f ∗ ) ≤ c Eapp + log + ρ . (7) n d See also [15, 4] for more bounds taking into account the optimization accuracy. 3.2 Gradient Optimization Algorithms We now discuss and compare the asymptotic learning properties of four gradient optimization algo∗ rithms. Recall that the family of function F is linearly parametrized by w ∈ Rd . Let wF and wn ∗ correspond to the functions fF and fn defined in section 2.1. In this section, we assume that the functions w → ℓ(fw (x), y) are convex and twice differentiable with continuous second derivatives. Convexity ensures that the empirical const function C(w) = En (fw ) has a single minimum. Two matrices play an important role in the analysis: the Hessian matrix H and the gradient covariance matrix G, both measured at the empirical optimum wn . ∂ 2 ℓ(fwn (x), y) ∂2C (wn ) = En , 2 ∂w ∂w2 H = G = En ∂ℓ(fwn (x), y) ∂w ∂ℓ(fwn (x), y) ∂w (8) ′ . (9) The relation between these two matrices depends on the chosen loss function. In order to summarize them, we assume that there are constants λmax ≥ λmin > 0 and ν > 0 such that, for any η > 0, we can choose the number of examples n large enough to ensure that the following assertion is true with probability greater than 1 − η : tr(G H −1 ) ≤ ν EigenSpectrum(H) ⊂ [ λmin , λmax ] and (10) The condition number κ = λmax /λmin is a good indicator of the difficulty of the optimization [16]. The condition λmin > 0 avoids complications with stochastic gradient algorithms. Note that this condition only implies strict convexity around the optimum. For instance, consider the loss function ℓ is obtained by smoothing the well known hinge loss ℓ(z, y) = max{0, 1 − yz} in a small neighborhood of its non-differentiable points. Function C(w) is then piecewise linear with smoothed edges and vertices. It is not strictly convex. However its minimum is likely to be on a smoothed vertex with a non singular Hessian. When we have strict convexity, the argument of [12, theorem 12] yields fast estimation rates α ≈ 1 in (4) and (6). This is not necessarily the case here. The four algorithm considered in this paper use information about the gradient of the cost function to iteratively update their current estimate w(t) of the parameter vector. • Gradient Descent (GD) iterates w(t + 1) = w(t) − η ∂C 1 (w(t)) = w(t) − η ∂w n n i=1 ∂ ℓ fw(t) (xi ), yi ∂w where η > 0 is a small enough gain. GD is an algorithm with linear convergence [16]. When η = 1/λmax , this algorithm requires O(κ log(1/ρ)) iterations to reach accuracy ρ. The exact number of iterations depends on the choice of the initial parameter vector. • Second Order Gradient Descent (2GD) iterates n w(t + 1) = w(t) − H −1 1 ∂ ∂C (w(t)) = w(t) − H −1 ℓ fw(t) (xi ), yi ∂w n ∂w i=1 where matrix H −1 is the inverse of the Hessian matrix (8). This is more favorable than Newton’s algorithm because we do not evaluate the local Hessian at each iteration but simply assume that we know in advance the Hessian at the optimum. 2GD is a superlinear optimization algorithm with quadratic convergence [16]. When the cost is quadratic, a single iteration is sufficient. In the general case, O(log log(1/ρ)) iterations are required to reach accuracy ρ. • Stochastic Gradient Descent (SGD) picks a random training example (xt , yt ) at each iteration and updates the parameter w on the basis of this example only, w(t + 1) = w(t) − η ∂ ℓ fw(t) (xt ), yt . t ∂w Murata [17, section 2.2], characterizes the mean ES [w(t)] and variance VarS [w(t)] with respect to the distribution implied by the random examples drawn from the training set S at each iteration. Applying this result to the discrete training set distribution for η = 1/λmin , we have δw(t)2 = O(1/t) where δw(t) is a shorthand notation for w(t) − wn . We can then write ES [ C(w(t)) − inf C ] = = ≤ ˆ ` ´˜ ` ´ ES tr H δw(t) δw(t)′ + o 1 t ` ´ ` ´ tr H ES [δw(t)] ES [δw(t)]′ + H VarS [w(t)] + o 1 t `1´ `1´ 2 tr(GH) + o t ≤ νκ + o t . t t (11) Therefore the SGD algorithm reaches accuracy ρ after less than νκ2/ρ + o(1/ρ) iterations on average. The SGD convergence is essentially limited by the stochastic noise induced by the random choice of one example at each iteration. Neither the initial value of the parameter vector w nor the total number of examples n appear in the dominant term of this bound! When the training set is large, one could reach the desired accuracy ρ measured on the whole training set without even visiting all the training examples. This is in fact a kind of generalization bound. Table 2: Asymptotic results for gradient algorithms (with probability 1). Compare the second last column (time to optimize) with the last column (time to reach the excess test error ǫ). Legend: n number of examples; d parameter dimension; κ, ν see equation (10). Algorithm Cost of one iteration GD O(nd) 2GD O d2 + nd SGD O(d) 2SGD O d2 Iterations to reach ρ O κ log 1 ρ 1 O log log ρ νκ2 ρ ν ρ O ndκ log O 1 ρ +o +o Time to reach accuracy ρ 1 ρ 1 ρ d2 + nd log log O O Time to reach E ≤ c (Eapp + ε) d2 κ ε1/α O 1 ρ O d2 ε1/α dνκ2 ρ d2 ν ρ log2 1 ε log 1 log log 1 ε ε O O d ν κ2 ε d2 ν ε • Second Order Stochastic Gradient Descent (2SGD) replaces the gain η by the inverse of the Hessian matrix H: w(t + 1) = w(t) − 1 −1 ∂ H ℓ fw(t) (xt ), yt . t ∂w Unlike standard gradient algorithms, using the second order information does not change the influence of ρ on the convergence rate but improves the constants. Using again [17, theorem 4], accuracy ρ is reached after ν/ρ + o(1/ρ) iterations. For each of the four gradient algorithms, the first three columns of table 2 report the time for a single iteration, the number of iterations needed to reach a predefined accuracy ρ, and their product, the time needed to reach accuracy ρ. These asymptotic results are valid with probability 1, since the probability of their complement is smaller than η for any η > 0. The fourth column bounds the time necessary to reduce the excess error E below c (Eapp +ε) where c `d ´α is the constant from (6). This is computed by observing that choosing ρ ∼ n log n in (6) achieves d the fastest rate for ε, with minimal computation time. We can then use the asymptotic equivalences d ρ ∼ ε and n ∼ ε1/α log 1 . Setting the fourth column expressions to Tmax and solving for ǫ yields ε the best excess error achieved by each algorithm within the limited time Tmax . This provides the asymptotic solution of the Estimation–Optimization tradeoff (3) for large scale problems satisfying our assumptions. These results clearly show that the generalization performance of large-scale learning systems depends on both the statistical properties of the estimation procedure and the computational properties of the chosen optimization algorithm. Their combination leads to surprising consequences: • The SGD and 2SGD results do not depend on the estimation rate α. When the estimation rate is poor, there is less need to optimize accurately. That leaves time to process more examples. A potentially more useful interpretation leverages the fact that (11) is already a kind of generalization bound: its fast rate trumps the slower rate assumed for the estimation error. • Second order algorithms bring little asymptotical improvements in ε. Although the superlinear 2GD algorithm improves the logarithmic term, all four algorithms are dominated by the polynomial term in (1/ε). However, there are important variations in the influence of the constants d, κ and ν. These constants are very important in practice. • Stochastic algorithms (SGD, 2SGD) yield the best generalization performance despite being the worst optimization algorithms. This had been described before [18] and observed in experiments. In contrast, since the optimization error Eopt of small-scale learning systems can be reduced to insignificant levels, their generalization performance is solely determined by the statistical properties of their estimation procedure. 4 Conclusion Taking in account budget constraints on both the number of examples and the computation time, we find qualitative differences between the generalization performance of small-scale learning systems and large-scale learning systems. The generalization properties of large-scale learning systems depend on both the statistical properties of the estimation procedure and the computational properties of the optimization algorithm. We illustrate this fact with some asymptotic results on gradient algorithms. Considerable refinements of this framework can be expected. Extending the analysis to regularized risk formulations would make results on the complexity of primal and dual optimization algorithms [19, 20] directly exploitable. The choice of surrogate loss function [7, 12] could also have a non-trivial impact in the large-scale case. Acknowledgments Part of this work was funded by NSF grant CCR-0325463. References [1] Leslie G. Valiant. A theory of learnable. Proc. of the 1984 STOC, pages 436–445, 1984. [2] Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Series in Statistics. Springer-Verlag, Berlin, 1982. [3] St´ phane Boucheron, Olivier Bousquet, and G´ bor Lugosi. Theory of classification: a survey of recent e a advances. ESAIM: Probability and Statistics, 9:323–375, 2005. [4] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311–334, 2006. [5] J. Stephen Judd. On the complexity of loading shallow neural networks. Journal of Complexity, 4(3):177– 192, 1988. [6] Richard O. Duda and Peter E. Hart. Pattern Classification And Scene Analysis. Wiley and Son, 1973. [7] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32:56–85, 2004. [8] Clint Scovel and Ingo Steinwart. Fast rates for support vector machines. In Peter Auer and Ron Meir, editors, Proceedings of the 18th Conference on Learning Theory (COLT 2005), volume 3559 of Lecture Notes in Computer Science, pages 279–294, Bertinoro, Italy, June 2005. Springer-Verlag. [9] Vladimir N. Vapnik, Esther Levin, and Yann LeCun. Measuring the VC-dimension of a learning machine. Neural Computation, 6(5):851–876, 1994. [10] Olivier Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms. PhD thesis, Ecole Polytechnique, 2002. [11] Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statististics, 32(1), 2004. [12] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101(473):138–156, March 2006. [13] Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la Facult´ des e Sciences de Toulouse, series 6, 9(2):245–303, 2000. [14] Wee S. Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974–1980, 1998. [15] Shahar Mendelson. A few notes on statistical learning theory. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures in Machine Learning, volume 2600 of Lecture Notes in Computer Science, pages 1–40. Springer-Verlag, Berlin, 2003. [16] John E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods For Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1983. [17] Noboru Murata. A statistical study of on-line learning. In David Saad, editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998. [18] L´ on Bottou and Yann Le Cun. Large scale online learning. In Sebastian Thrun, Lawrence K. Saul, e and Bernhard Sch¨ lkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, o Cambridge, MA, 2004. [19] Thorsten Joachims. Training linear SVMs in linear time. In Proceedings of KDD’06, Philadelphia, PA, USA, August 20-23 2006. ACM. [20] Don Hush, Patrick Kelly, Clint Scovel, and Ingo Steinwart. QP algorithms with guaranteed accuracy and run time for support vector machines. Journal of Machine Learning Research, 7:733–769, 2006.
3 0.18071949 170 nips-2007-Robust Regression with Twinned Gaussian Processes
Author: Andrew Naish-guzman, Sean Holden
Abstract: We propose a Gaussian process (GP) framework for robust inference in which a GP prior on the mixing weights of a two-component noise model augments the standard process over latent function values. This approach is a generalization of the mixture likelihood used in traditional robust GP regression, and a specialization of the GP mixture models suggested by Tresp [1] and Rasmussen and Ghahramani [2]. The value of this restriction is in its tractable expectation propagation updates, which allow for faster inference and model selection, and better convergence than the standard mixture. An additional benefit over the latter method lies in our ability to incorporate knowledge of the noise domain to influence predictions, and to recover with the predictive distribution information about the outlier distribution via the gating process. The model has asymptotic complexity equal to that of conventional robust methods, but yields more confident predictions on benchmark problems than classical heavy-tailed models and exhibits improved stability for data with clustered corruptions, for which they fail altogether. We show further how our approach can be used without adjustment for more smoothly heteroscedastic data, and suggest how it could be extended to more general noise models. We also address similarities with the work of Goldberg et al. [3].
4 0.13305224 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
Author: Francis R. Bach, Zaïd Harchaoui
Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.
5 0.11561605 127 nips-2007-Measuring Neural Synchrony by Message Passing
Author: Justin Dauwels, François Vialatte, Tomasz Rutkowski, Andrzej S. Cichocki
Abstract: A novel approach to measure the interdependence of two time series is proposed, referred to as “stochastic event synchrony” (SES); it quantifies the alignment of two point processes by means of the following parameters: time delay, variance of the timing jitter, fraction of “spurious” events, and average similarity of events. SES may be applied to generic one-dimensional and multi-dimensional point processes, however, the paper mainly focusses on point processes in time-frequency domain. The average event similarity is in that case described by two parameters: the average frequency offset between events in the time-frequency plane, and the variance of the frequency offset (“frequency jitter”); SES then consists of five parameters in total. Those parameters quantify the synchrony of oscillatory events, and hence, they provide an alternative to existing synchrony measures that quantify amplitude or phase synchrony. The pairwise alignment of point processes is cast as a statistical inference problem, which is solved by applying the maxproduct algorithm on a graphical model. The SES parameters are determined from the resulting pairwise alignment by maximum a posteriori (MAP) estimation. The proposed interdependence measure is applied to the problem of detecting anomalies in EEG synchrony of Mild Cognitive Impairment (MCI) patients; the results indicate that SES significantly improves the sensitivity of EEG in detecting MCI.
6 0.10740387 46 nips-2007-Cluster Stability for Finite Samples
7 0.10568865 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI
8 0.10200348 80 nips-2007-Ensemble Clustering using Semidefinite Programming
9 0.090083323 61 nips-2007-Convex Clustering with Exemplar-Based Models
10 0.089399442 195 nips-2007-The Generalized FITC Approximation
11 0.055649448 16 nips-2007-A learning framework for nearest neighbor search
12 0.051196102 116 nips-2007-Learning the structure of manifolds using random projections
13 0.048977982 70 nips-2007-Discriminative K-means for Clustering
14 0.047472574 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
15 0.041237112 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents
16 0.038475286 141 nips-2007-New Outer Bounds on the Marginal Polytope
17 0.038157139 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
18 0.03712846 160 nips-2007-Random Features for Large-Scale Kernel Machines
19 0.036250792 213 nips-2007-Variational Inference for Diffusion Processes
20 0.035807949 118 nips-2007-Learning with Transformation Invariant Kernels
topicId topicWeight
[(0, -0.141), (1, 0.017), (2, -0.079), (3, 0.05), (4, -0.042), (5, -0.065), (6, -0.046), (7, 0.003), (8, -0.172), (9, -0.015), (10, -0.127), (11, 0.068), (12, 0.007), (13, -0.235), (14, -0.037), (15, 0.232), (16, 0.091), (17, -0.075), (18, -0.186), (19, 0.137), (20, -0.078), (21, -0.136), (22, -0.024), (23, -0.202), (24, -0.127), (25, -0.143), (26, 0.002), (27, 0.266), (28, 0.024), (29, -0.02), (30, -0.066), (31, 0.107), (32, -0.043), (33, -0.024), (34, 0.039), (35, -0.032), (36, -0.033), (37, 0.11), (38, -0.093), (39, -0.087), (40, 0.046), (41, -0.036), (42, 0.029), (43, 0.073), (44, -0.094), (45, -0.062), (46, 0.026), (47, -0.026), (48, -0.087), (49, -0.057)]
simIndex simValue paperId paperTitle
same-paper 1 0.95233053 58 nips-2007-Consistent Minimization of Clustering Objective Functions
Author: Ulrike V. Luxburg, Stefanie Jegelka, Michael Kaufmann, Sébastien Bubeck
Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal. As an alternative, we suggest the paradigm of “nearest neighbor clustering”. Instead of selecting the best out of all partitions of the sample, it only considers partitions in some restricted function class. Using tools from statistical learning theory we prove that nearest neighbor clustering is statistically consistent. Moreover, its worst case complexity is polynomial by construction, and it can be implemented with small average case complexity using branch and bound. 1
2 0.77085847 200 nips-2007-The Tradeoffs of Large Scale Learning
Author: Olivier Bousquet, Léon Bottou
Abstract: This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation–estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways. 1 Motivation The computational complexity of learning algorithms has seldom been taken into account by the learning theory. Valiant [1] states that a problem is “learnable” when there exists a probably approximatively correct learning algorithm with polynomial complexity. Whereas much progress has been made on the statistical aspect (e.g., [2, 3, 4]), very little has been told about the complexity side of this proposal (e.g., [5].) Computational complexity becomes the limiting factor when one envisions large amounts of training data. Two important examples come to mind: • Data mining exists because competitive advantages can be achieved by analyzing the masses of data that describe the life of our computerized society. Since virtually every computer generates data, the data volume is proportional to the available computing power. Therefore one needs learning algorithms that scale roughly linearly with the total volume of data. • Artificial intelligence attempts to emulate the cognitive capabilities of human beings. Our biological brains can learn quite efficiently from the continuous streams of perceptual data generated by our six senses, using limited amounts of sugar as a source of power. This observation suggests that there are learning algorithms whose computing time requirements scale roughly linearly with the total volume of data. This contribution finds its source in the idea that approximate optimization algorithms might be sufficient for learning purposes. The first part proposes new decomposition of the test error where an additional term represents the impact of approximate optimization. In the case of small-scale learning problems, this decomposition reduces to the well known tradeoff between approximation error and estimation error. In the case of large-scale learning problems, the tradeoff is more complex because it involves the computational complexity of the learning algorithm. The second part explores the asymptotic properties of the large-scale learning tradeoff for various prototypical learning algorithms under various assumptions regarding the statistical estimation rates associated with the chosen objective functions. This part clearly shows that the best optimization algorithms are not necessarily the best learning algorithms. Maybe more surprisingly, certain algorithms perform well regardless of the assumed rate for the statistical estimation error. 2 2.1 Approximate Optimization Setup Following [6, 2], we consider a space of input-output pairs (x, y) ∈ X × Y endowed with a probability distribution P (x, y). The conditional distribution P (y|x) represents the unknown relationship between inputs and outputs. The discrepancy between the predicted output y and the real output ˆ y is measured with a loss function ℓ(ˆ, y). Our benchmark is the function f ∗ that minimizes the y expected risk E(f ) = that is, ℓ(f (x), y) dP (x, y) = E [ℓ(f (x), y)], f ∗ (x) = arg min E [ ℓ(ˆ, y)| x]. y y ˆ Although the distribution P (x, y) is unknown, we are given a sample S of n independently drawn training examples (xi , yi ), i = 1 . . . n. We define the empirical risk En (f ) = 1 n n ℓ(f (xi ), yi ) = En [ℓ(f (x), y)]. i=1 Our first learning principle consists in choosing a family F of candidate prediction functions and finding the function fn = arg minf ∈F En (f ) that minimizes the empirical risk. Well known combinatorial results (e.g., [2]) support this approach provided that the chosen family F is sufficiently restrictive. Since the optimal function f ∗ is unlikely to belong to the family F, we also define ∗ ∗ fF = arg minf ∈F E(f ). For simplicity, we assume that f ∗ , fF and fn are well defined and unique. We can then decompose the excess error as ∗ ∗ E [E(fn ) − E(f ∗ )] = E [E(fF ) − E(f ∗ )] + E [E(fn ) − E(fF )] = Eapp + Eest , (1) where the expectation is taken with respect to the random choice of training set. The approximation error Eapp measures how closely functions in F can approximate the optimal solution f ∗ . The estimation error Eest measures the effect of minimizing the empirical risk En (f ) instead of the expected risk E(f ). The estimation error is determined by the number of training examples and by the capacity of the family of functions [2]. Large families1 of functions have smaller approximation errors but lead to higher estimation errors. This tradeoff has been extensively discussed in the literature [2, 3] and lead to excess error that scale between the inverse and the inverse square root of the number of examples [7, 8]. 2.2 Optimization Error Finding fn by minimizing the empirical risk En (f ) is often a computationally expensive operation. Since the empirical risk En (f ) is already an approximation of the expected risk E(f ), it should not be necessary to carry out this minimization with great accuracy. For instance, we could stop an iterative optimization algorithm long before its convergence. ˜ Let us assume that our minimization algorithm returns an approximate solution fn such that ˜ En (fn ) < En (fn ) + ρ ˜ where ρ ≥ 0 is a predefined tolerance. An additional term Eopt = E E(fn ) − E(fn ) then appears ˜ ) − E(f ∗ ) : in the decomposition of the excess error E = E E(fn E ∗ ∗ ˜ = E [E(fF ) − E(f ∗ )] + E [E(fn ) − E(fF )] + E E(fn ) − E(fn ) = Eapp + Eest + Eopt . (2) We call this additional term optimization error. It reflects the impact of the approximate optimization on the generalization performance. Its magnitude is comparable to ρ (see section 3.1.) 1 We often consider nested families of functions of the form Fc = {f ∈ H, Ω(f ) ≤ c}. Then, for each value of c, function fn is obtained by minimizing the regularized empirical risk En (f ) + λΩ(f ) for a suitable choice of the Lagrange coefficient λ. We can then control the estimation-approximation tradeoff by choosing λ instead of c. 2.3 The Approximation–Estimation–Optimization Tradeoff This decomposition leads to a more complicated compromise. It involves three variables and two constraints. The constraints are the maximal number of available training example and the maximal computation time. The variables are the size of the family of functions F, the optimization accuracy ρ, and the number of examples n. This is formalized by the following optimization problem. min E = Eapp + Eest + Eopt F ,ρ,n n ≤ nmax T (F, ρ, n) ≤ Tmax subject to (3) The number n of training examples is a variable because we could choose to use only a subset of the available training examples in order to complete the optimization within the alloted time. This happens often in practice. Table 1 summarizes the typical evolution of the quantities of interest with the three variables F, n, and ρ increase. Table 1: Typical variations when F, n, and ρ increase. F Eapp Eest Eopt T (approximation error) (estimation error) (optimization error) (computation time) n ρ ց ր ··· ր ց ··· ր ր ց The solution of the optimization program (3) depends critically of which budget constraint is active: constraint n < nmax on the number of examples, or constraint T < Tmax on the training time. • We speak of small-scale learning problem when (3) is constrained by the maximal number of examples nmax . Since the computing time is not limited, we can reduce the optimization error Eopt to insignificant levels by choosing ρ arbitrarily small. The excess error is then dominated by the approximation and estimation errors, Eapp and Eest . Taking n = nmax , we recover the approximation-estimation tradeoff that is the object of abundant literature. • We speak of large-scale learning problem when (3) is constrained by the maximal computing time Tmax . Approximate optimization, that is choosing ρ > 0, possibly can achieve better generalization because more training examples can be processed during the allowed time. The specifics depend on the computational properties of the chosen optimization algorithm through the expression of the computing time T (F, ρ, n). 3 The Asymptotics of Large-scale Learning In the previous section, we have extended the classical approximation-estimation tradeoff by taking into account the optimization error. We have given an objective criterion to distiguish small-scale and large-scale learning problems. In the small-scale case, we recover the classical tradeoff between approximation and estimation. The large-scale case is substantially different because it involves the computational complexity of the learning algorithm. In order to clarify the large-scale learning tradeoff with sufficient generality, this section makes several simplifications: • We are studying upper bounds of the approximation, estimation, and optimization errors (2). It is often accepted that these upper bounds give a realistic idea of the actual convergence rates [9, 10, 11, 12]. Another way to find comfort in this approach is to say that we study guaranteed convergence rates instead of the possibly pathological special cases. • We are studying the asymptotic properties of the tradeoff when the problem size increases. Instead of carefully balancing the three terms, we write E = O(Eapp ) + O(Eest ) + O(Eopt ) and only need to ensure that the three terms decrease with the same asymptotic rate. • We are considering a fixed family of functions F and therefore avoid taking into account the approximation error Eapp . This part of the tradeoff covers a wide spectrum of practical realities such as choosing models and choosing features. In the context of this work, we do not believe we can meaningfully address this without discussing, for instance, the thorny issue of feature selection. Instead we focus on the choice of optimization algorithm. • Finally, in order to keep this paper short, we consider that the family of functions F is linearly parametrized by a vector w ∈ Rd . We also assume that x, y and w are bounded, ensuring that there is a constant B such that 0 ≤ ℓ(fw (x), y) ≤ B and ℓ(·, y) is Lipschitz. We first explain how the uniform convergence bounds provide convergence rates that take the optimization error into account. Then we discuss and compare the asymptotic learning properties of several optimization algorithms. 3.1 Convergence of the Estimation and Optimization Errors The optimization error Eopt depends directly on the optimization accuracy ρ. However, the accuracy ˜ ρ involves the empirical quantity En (fn ) − En (fn ), whereas the optimization error Eopt involves ˜ its expected counterpart E(fn ) − E(fn ). This section discusses the impact on the optimization error Eopt and of the optimization accuracy ρ on generalization bounds that leverage the uniform convergence concepts pioneered by Vapnik and Chervonenkis (e.g., [2].) In this discussion, we use the letter c to refer to any positive constant. Multiple occurences of the letter c do not necessarily imply that the constants have identical values. 3.1.1 Simple Uniform Convergence Bounds Recall that we assume that F is linearly parametrized by w ∈ Rd . Elementary uniform convergence results then state that r » – E sup |E(f ) − En (f )| ≤ c f ∈F d , n where the expectation is taken with respect to the random choice of the training set.2 This result immediately provides a bound on the estimation error: Eest = ≤ ´ ` ` ∗ ´ ∗ ∗ ´˜ E(fn ) − En (fn ) + En (fn ) − En (fF ) + En (fF ) − E(fF ) r » – d . 2 E sup |E(f ) − En (f )| ≤ c n f ∈F E ˆ` This same result also provides a combined bound for the estimation and optimization errors: Eest + Eopt = + ≤ ˆ ˜ ˆ ˜ ˜ ˜ ˜ E E(fn ) − En (fn ) + E En (fn ) − En (fn ) ∗ ∗ ∗ E [En (fn ) − En (fF )] + E [En (fF ) − E(fF )] r r r ! d d d c +ρ+0+c = c ρ+ . n n n Unfortunately, this convergence rate is known to be pessimistic in many important cases. More sophisticated bounds are required. 3.1.2 Faster Rates in the Realizable Case When the loss functions ℓ(ˆ, y) is positive, with probability 1 − e−τ for any τ > 0, relative uniform y convergence bounds state that r E(f ) − En (f ) d n τ p sup log + . ≤c n d n f ∈F E(f ) This result is very useful because it provides faster convergence rates O(log n/n) in the realizable case, that is when ℓ(fn (xi ), yi ) = 0 for all training examples (xi , yi ). We have then En (fn ) = 0, ˜ En (fn ) ≤ ρ, and we can write r q n d τ ˜ ˜ E(fn ) − ρ ≤ c E(fn ) log + . n d n q 2 d Although the original Vapnik-Chervonenkis bounds have the form c n log n , the logarithmic term can d be eliminated using the “chaining” technique (e.g., [10].) Viewing this as a second degree polynomial inequality in variable ˜ E(fn ), we obtain „ « d n τ ˜ E(fn ) ≤ c ρ + log + . n d n Integrating this inequality using a standard technique (see, e.g., [13]), we obtain a better convergence rate of the combined estimation and optimization error: „ « h i h i d n ∗ ˜ ˜ Eest + Eopt = E E(fn ) − E(fF ) ≤ E E(fn ) = c ρ + log . n d 3.1.3 Fast Rate Bounds Many authors (e.g., [10, 4, 12]) obtain fast statistical estimation rates in more general conditions. These bounds have the general form α n 1 d log for ≤ α ≤ 1. n d 2 This result holds when one can establish the following variance condition: Eapp + Eest ≤ c ∀f ∈ F Eapp + ∗ ℓ(f (X), Y ) − ℓ(fF (X), Y ) E 2 ≤ c ∗ E(f ) − E(fF ) (4) 1 2− α . (5) The convergence rate of (4) is described by the exponent α which is determined by the quality of the variance bound (5). Works on fast statistical estimation identify two main ways to establish such a variance condition. • Exploiting the strict convexity of certain loss functions [12, theorem 12]. For instance, Lee et al. [14] establish a O(log n/n) rate using the squared loss ℓ(ˆ, y) = (ˆ − y)2 . y y • Making assumptions on the data distribution. In the case of pattern recognition problems, for instance, the “Tsybakov condition” indicates how cleanly the posterior distributions P (y|x) cross near the optimal decision boundary [11, 12]. The realizable case discussed in section 3.1.2 can be viewed as an extreme case of this. Despite their much greater complexity, fast rate estimation results can accomodate the optimization accuracy ρ using essentially the methods illustrated in sections 3.1.1 and 3.1.2. We then obtain a bound of the form α d n ˜ E = Eapp + Eest + Eopt = E E(fn ) − E(f ∗ ) ≤ c Eapp + log +ρ . (6) n d For instance, a general result with α = 1 is provided by Massart [13, theorem 4.2]. Combining this result with standard bounds on the complexity of classes of linear functions (e.g., [10]) yields the following result: d n ˜ E = Eapp + Eest + Eopt = E E(fn ) − E(f ∗ ) ≤ c Eapp + log + ρ . (7) n d See also [15, 4] for more bounds taking into account the optimization accuracy. 3.2 Gradient Optimization Algorithms We now discuss and compare the asymptotic learning properties of four gradient optimization algo∗ rithms. Recall that the family of function F is linearly parametrized by w ∈ Rd . Let wF and wn ∗ correspond to the functions fF and fn defined in section 2.1. In this section, we assume that the functions w → ℓ(fw (x), y) are convex and twice differentiable with continuous second derivatives. Convexity ensures that the empirical const function C(w) = En (fw ) has a single minimum. Two matrices play an important role in the analysis: the Hessian matrix H and the gradient covariance matrix G, both measured at the empirical optimum wn . ∂ 2 ℓ(fwn (x), y) ∂2C (wn ) = En , 2 ∂w ∂w2 H = G = En ∂ℓ(fwn (x), y) ∂w ∂ℓ(fwn (x), y) ∂w (8) ′ . (9) The relation between these two matrices depends on the chosen loss function. In order to summarize them, we assume that there are constants λmax ≥ λmin > 0 and ν > 0 such that, for any η > 0, we can choose the number of examples n large enough to ensure that the following assertion is true with probability greater than 1 − η : tr(G H −1 ) ≤ ν EigenSpectrum(H) ⊂ [ λmin , λmax ] and (10) The condition number κ = λmax /λmin is a good indicator of the difficulty of the optimization [16]. The condition λmin > 0 avoids complications with stochastic gradient algorithms. Note that this condition only implies strict convexity around the optimum. For instance, consider the loss function ℓ is obtained by smoothing the well known hinge loss ℓ(z, y) = max{0, 1 − yz} in a small neighborhood of its non-differentiable points. Function C(w) is then piecewise linear with smoothed edges and vertices. It is not strictly convex. However its minimum is likely to be on a smoothed vertex with a non singular Hessian. When we have strict convexity, the argument of [12, theorem 12] yields fast estimation rates α ≈ 1 in (4) and (6). This is not necessarily the case here. The four algorithm considered in this paper use information about the gradient of the cost function to iteratively update their current estimate w(t) of the parameter vector. • Gradient Descent (GD) iterates w(t + 1) = w(t) − η ∂C 1 (w(t)) = w(t) − η ∂w n n i=1 ∂ ℓ fw(t) (xi ), yi ∂w where η > 0 is a small enough gain. GD is an algorithm with linear convergence [16]. When η = 1/λmax , this algorithm requires O(κ log(1/ρ)) iterations to reach accuracy ρ. The exact number of iterations depends on the choice of the initial parameter vector. • Second Order Gradient Descent (2GD) iterates n w(t + 1) = w(t) − H −1 1 ∂ ∂C (w(t)) = w(t) − H −1 ℓ fw(t) (xi ), yi ∂w n ∂w i=1 where matrix H −1 is the inverse of the Hessian matrix (8). This is more favorable than Newton’s algorithm because we do not evaluate the local Hessian at each iteration but simply assume that we know in advance the Hessian at the optimum. 2GD is a superlinear optimization algorithm with quadratic convergence [16]. When the cost is quadratic, a single iteration is sufficient. In the general case, O(log log(1/ρ)) iterations are required to reach accuracy ρ. • Stochastic Gradient Descent (SGD) picks a random training example (xt , yt ) at each iteration and updates the parameter w on the basis of this example only, w(t + 1) = w(t) − η ∂ ℓ fw(t) (xt ), yt . t ∂w Murata [17, section 2.2], characterizes the mean ES [w(t)] and variance VarS [w(t)] with respect to the distribution implied by the random examples drawn from the training set S at each iteration. Applying this result to the discrete training set distribution for η = 1/λmin , we have δw(t)2 = O(1/t) where δw(t) is a shorthand notation for w(t) − wn . We can then write ES [ C(w(t)) − inf C ] = = ≤ ˆ ` ´˜ ` ´ ES tr H δw(t) δw(t)′ + o 1 t ` ´ ` ´ tr H ES [δw(t)] ES [δw(t)]′ + H VarS [w(t)] + o 1 t `1´ `1´ 2 tr(GH) + o t ≤ νκ + o t . t t (11) Therefore the SGD algorithm reaches accuracy ρ after less than νκ2/ρ + o(1/ρ) iterations on average. The SGD convergence is essentially limited by the stochastic noise induced by the random choice of one example at each iteration. Neither the initial value of the parameter vector w nor the total number of examples n appear in the dominant term of this bound! When the training set is large, one could reach the desired accuracy ρ measured on the whole training set without even visiting all the training examples. This is in fact a kind of generalization bound. Table 2: Asymptotic results for gradient algorithms (with probability 1). Compare the second last column (time to optimize) with the last column (time to reach the excess test error ǫ). Legend: n number of examples; d parameter dimension; κ, ν see equation (10). Algorithm Cost of one iteration GD O(nd) 2GD O d2 + nd SGD O(d) 2SGD O d2 Iterations to reach ρ O κ log 1 ρ 1 O log log ρ νκ2 ρ ν ρ O ndκ log O 1 ρ +o +o Time to reach accuracy ρ 1 ρ 1 ρ d2 + nd log log O O Time to reach E ≤ c (Eapp + ε) d2 κ ε1/α O 1 ρ O d2 ε1/α dνκ2 ρ d2 ν ρ log2 1 ε log 1 log log 1 ε ε O O d ν κ2 ε d2 ν ε • Second Order Stochastic Gradient Descent (2SGD) replaces the gain η by the inverse of the Hessian matrix H: w(t + 1) = w(t) − 1 −1 ∂ H ℓ fw(t) (xt ), yt . t ∂w Unlike standard gradient algorithms, using the second order information does not change the influence of ρ on the convergence rate but improves the constants. Using again [17, theorem 4], accuracy ρ is reached after ν/ρ + o(1/ρ) iterations. For each of the four gradient algorithms, the first three columns of table 2 report the time for a single iteration, the number of iterations needed to reach a predefined accuracy ρ, and their product, the time needed to reach accuracy ρ. These asymptotic results are valid with probability 1, since the probability of their complement is smaller than η for any η > 0. The fourth column bounds the time necessary to reduce the excess error E below c (Eapp +ε) where c `d ´α is the constant from (6). This is computed by observing that choosing ρ ∼ n log n in (6) achieves d the fastest rate for ε, with minimal computation time. We can then use the asymptotic equivalences d ρ ∼ ε and n ∼ ε1/α log 1 . Setting the fourth column expressions to Tmax and solving for ǫ yields ε the best excess error achieved by each algorithm within the limited time Tmax . This provides the asymptotic solution of the Estimation–Optimization tradeoff (3) for large scale problems satisfying our assumptions. These results clearly show that the generalization performance of large-scale learning systems depends on both the statistical properties of the estimation procedure and the computational properties of the chosen optimization algorithm. Their combination leads to surprising consequences: • The SGD and 2SGD results do not depend on the estimation rate α. When the estimation rate is poor, there is less need to optimize accurately. That leaves time to process more examples. A potentially more useful interpretation leverages the fact that (11) is already a kind of generalization bound: its fast rate trumps the slower rate assumed for the estimation error. • Second order algorithms bring little asymptotical improvements in ε. Although the superlinear 2GD algorithm improves the logarithmic term, all four algorithms are dominated by the polynomial term in (1/ε). However, there are important variations in the influence of the constants d, κ and ν. These constants are very important in practice. • Stochastic algorithms (SGD, 2SGD) yield the best generalization performance despite being the worst optimization algorithms. This had been described before [18] and observed in experiments. In contrast, since the optimization error Eopt of small-scale learning systems can be reduced to insignificant levels, their generalization performance is solely determined by the statistical properties of their estimation procedure. 4 Conclusion Taking in account budget constraints on both the number of examples and the computation time, we find qualitative differences between the generalization performance of small-scale learning systems and large-scale learning systems. The generalization properties of large-scale learning systems depend on both the statistical properties of the estimation procedure and the computational properties of the optimization algorithm. We illustrate this fact with some asymptotic results on gradient algorithms. Considerable refinements of this framework can be expected. Extending the analysis to regularized risk formulations would make results on the complexity of primal and dual optimization algorithms [19, 20] directly exploitable. The choice of surrogate loss function [7, 12] could also have a non-trivial impact in the large-scale case. Acknowledgments Part of this work was funded by NSF grant CCR-0325463. References [1] Leslie G. Valiant. A theory of learnable. Proc. of the 1984 STOC, pages 436–445, 1984. [2] Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer Series in Statistics. Springer-Verlag, Berlin, 1982. [3] St´ phane Boucheron, Olivier Bousquet, and G´ bor Lugosi. Theory of classification: a survey of recent e a advances. ESAIM: Probability and Statistics, 9:323–375, 2005. [4] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probability Theory and Related Fields, 135(3):311–334, 2006. [5] J. Stephen Judd. On the complexity of loading shallow neural networks. Journal of Complexity, 4(3):177– 192, 1988. [6] Richard O. Duda and Peter E. Hart. Pattern Classification And Scene Analysis. Wiley and Son, 1973. [7] Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32:56–85, 2004. [8] Clint Scovel and Ingo Steinwart. Fast rates for support vector machines. In Peter Auer and Ron Meir, editors, Proceedings of the 18th Conference on Learning Theory (COLT 2005), volume 3559 of Lecture Notes in Computer Science, pages 279–294, Bertinoro, Italy, June 2005. Springer-Verlag. [9] Vladimir N. Vapnik, Esther Levin, and Yann LeCun. Measuring the VC-dimension of a learning machine. Neural Computation, 6(5):851–876, 1994. [10] Olivier Bousquet. Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms. PhD thesis, Ecole Polytechnique, 2002. [11] Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Annals of Statististics, 32(1), 2004. [12] Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification and risk bounds. Journal of the American Statistical Association, 101(473):138–156, March 2006. [13] Pascal Massart. Some applications of concentration inequalities to statistics. Annales de la Facult´ des e Sciences de Toulouse, series 6, 9(2):245–303, 2000. [14] Wee S. Lee, Peter L. Bartlett, and Robert C. Williamson. The importance of convexity in learning with squared loss. IEEE Transactions on Information Theory, 44(5):1974–1980, 1998. [15] Shahar Mendelson. A few notes on statistical learning theory. In Shahar Mendelson and Alexander J. Smola, editors, Advanced Lectures in Machine Learning, volume 2600 of Lecture Notes in Computer Science, pages 1–40. Springer-Verlag, Berlin, 2003. [16] John E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods For Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1983. [17] Noboru Murata. A statistical study of on-line learning. In David Saad, editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 1998. [18] L´ on Bottou and Yann Le Cun. Large scale online learning. In Sebastian Thrun, Lawrence K. Saul, e and Bernhard Sch¨ lkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, o Cambridge, MA, 2004. [19] Thorsten Joachims. Training linear SVMs in linear time. In Proceedings of KDD’06, Philadelphia, PA, USA, August 20-23 2006. ACM. [20] Don Hush, Patrick Kelly, Clint Scovel, and Ingo Steinwart. QP algorithms with guaranteed accuracy and run time for support vector machines. Journal of Machine Learning Research, 7:733–769, 2006.
3 0.4688915 170 nips-2007-Robust Regression with Twinned Gaussian Processes
Author: Andrew Naish-guzman, Sean Holden
Abstract: We propose a Gaussian process (GP) framework for robust inference in which a GP prior on the mixing weights of a two-component noise model augments the standard process over latent function values. This approach is a generalization of the mixture likelihood used in traditional robust GP regression, and a specialization of the GP mixture models suggested by Tresp [1] and Rasmussen and Ghahramani [2]. The value of this restriction is in its tractable expectation propagation updates, which allow for faster inference and model selection, and better convergence than the standard mixture. An additional benefit over the latter method lies in our ability to incorporate knowledge of the noise domain to influence predictions, and to recover with the predictive distribution information about the outlier distribution via the gating process. The model has asymptotic complexity equal to that of conventional robust methods, but yields more confident predictions on benchmark problems than classical heavy-tailed models and exhibits improved stability for data with clustered corruptions, for which they fail altogether. We show further how our approach can be used without adjustment for more smoothly heteroscedastic data, and suggest how it could be extended to more general noise models. We also address similarities with the work of Goldberg et al. [3].
4 0.39408195 195 nips-2007-The Generalized FITC Approximation
Author: Andrew Naish-guzman, Sean Holden
Abstract: We present an efficient generalization of the sparse pseudo-input Gaussian process (SPGP) model developed by Snelson and Ghahramani [1], applying it to binary classification problems. By taking advantage of the SPGP prior covariance structure, we derive a numerically stable algorithm with O(N M 2 ) training complexity—asymptotically the same as related sparse methods such as the informative vector machine [2], but which more faithfully represents the posterior. We present experimental results for several benchmark problems showing that in many cases this allows an exceptional degree of sparsity without compromising accuracy. Following [1], we locate pseudo-inputs by gradient ascent on the marginal likelihood, but exhibit occasions when this is likely to fail, for which we suggest alternative solutions.
5 0.390093 127 nips-2007-Measuring Neural Synchrony by Message Passing
Author: Justin Dauwels, François Vialatte, Tomasz Rutkowski, Andrzej S. Cichocki
Abstract: A novel approach to measure the interdependence of two time series is proposed, referred to as “stochastic event synchrony” (SES); it quantifies the alignment of two point processes by means of the following parameters: time delay, variance of the timing jitter, fraction of “spurious” events, and average similarity of events. SES may be applied to generic one-dimensional and multi-dimensional point processes, however, the paper mainly focusses on point processes in time-frequency domain. The average event similarity is in that case described by two parameters: the average frequency offset between events in the time-frequency plane, and the variance of the frequency offset (“frequency jitter”); SES then consists of five parameters in total. Those parameters quantify the synchrony of oscillatory events, and hence, they provide an alternative to existing synchrony measures that quantify amplitude or phase synchrony. The pairwise alignment of point processes is cast as a statistical inference problem, which is solved by applying the maxproduct algorithm on a graphical model. The SES parameters are determined from the resulting pairwise alignment by maximum a posteriori (MAP) estimation. The proposed interdependence measure is applied to the problem of detecting anomalies in EEG synchrony of Mild Cognitive Impairment (MCI) patients; the results indicate that SES significantly improves the sensitivity of EEG in detecting MCI.
6 0.38106507 80 nips-2007-Ensemble Clustering using Semidefinite Programming
7 0.36931068 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
8 0.34794506 46 nips-2007-Cluster Stability for Finite Samples
9 0.34558356 61 nips-2007-Convex Clustering with Exemplar-Based Models
10 0.29901022 16 nips-2007-A learning framework for nearest neighbor search
11 0.2582514 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI
12 0.2541289 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents
13 0.23321112 70 nips-2007-Discriminative K-means for Clustering
14 0.21584395 159 nips-2007-Progressive mixture rules are deviation suboptimal
15 0.19545138 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
16 0.18849039 78 nips-2007-Efficient Principled Learning of Thin Junction Trees
17 0.18846713 101 nips-2007-How SVMs can estimate quantiles and the median
18 0.18824503 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
19 0.18777534 68 nips-2007-Discovering Weakly-Interacting Factors in a Complex Stochastic Process
20 0.18624811 167 nips-2007-Regulator Discovery from Gene Expression Time Series of Malaria Parasites: a Hierachical Approach
topicId topicWeight
[(5, 0.024), (13, 0.027), (16, 0.013), (21, 0.049), (31, 0.055), (34, 0.025), (35, 0.046), (41, 0.362), (47, 0.052), (49, 0.016), (83, 0.141), (85, 0.023), (87, 0.011), (90, 0.047)]
simIndex simValue paperId paperTitle
1 0.72401679 83 nips-2007-Evaluating Search Engines by Modeling the Relationship Between Relevance and Clicks
Author: Ben Carterette, Rosie Jones
Abstract: We propose a model that leverages the millions of clicks received by web search engines to predict document relevance. This allows the comparison of ranking functions when clicks are available but complete relevance judgments are not. After an initial training phase using a set of relevance judgments paired with click data, we show that our model can predict the relevance score of documents that have not been judged. These predictions can be used to evaluate the performance of a search engine, using our novel formalization of the confidence of the standard evaluation metric discounted cumulative gain (DCG), so comparisons can be made across time and datasets. This contrasts with previous methods which can provide only pair-wise relevance judgments between results shown for the same query. When no relevance judgments are available, we can identify the better of two ranked lists up to 82% of the time, and with only two relevance judgments for each query, we can identify the better ranking up to 94% of the time. While our experiments are on sponsored search results, which is the financial backbone of web search, our method is general enough to be applicable to algorithmic web search results as well. Furthermore, we give an algorithm to guide the selection of additional documents to judge to improve confidence. 1
same-paper 2 0.71620131 58 nips-2007-Consistent Minimization of Clustering Objective Functions
Author: Ulrike V. Luxburg, Stefanie Jegelka, Michael Kaufmann, Sébastien Bubeck
Abstract: Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal. As an alternative, we suggest the paradigm of “nearest neighbor clustering”. Instead of selecting the best out of all partitions of the sample, it only considers partitions in some restricted function class. Using tools from statistical learning theory we prove that nearest neighbor clustering is statistically consistent. Moreover, its worst case complexity is polynomial by construction, and it can be implemented with small average case complexity using branch and bound. 1
3 0.44410026 80 nips-2007-Ensemble Clustering using Semidefinite Programming
Author: Vikas Singh, Lopamudra Mukherjee, Jiming Peng, Jinhui Xu
Abstract: We consider the ensemble clustering problem where the task is to ‘aggregate’ multiple clustering solutions into a single consolidated clustering that maximizes the shared information among given clustering solutions. We obtain several new results for this problem. First, we note that the notion of agreement under such circumstances can be better captured using an agreement measure based on a 2D string encoding rather than voting strategy based methods proposed in literature. Using this generalization, we first derive a nonlinear optimization model to maximize the new agreement measure. We then show that our optimization problem can be transformed into a strict 0-1 Semidefinite Program (SDP) via novel convexification techniques which can subsequently be relaxed to a polynomial time solvable SDP. Our experiments indicate improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. We discuss evaluations on clustering and image segmentation databases. 1
4 0.43509188 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
Author: Kai Yu, Wei Chu
Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1
5 0.43456873 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
6 0.43315366 16 nips-2007-A learning framework for nearest neighbor search
7 0.43226179 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
8 0.4316029 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
9 0.43133739 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
10 0.43082875 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
11 0.43071216 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
12 0.42955005 144 nips-2007-On Ranking in Survival Analysis: Bounds on the Concordance Index
13 0.42910263 134 nips-2007-Multi-Task Learning via Conic Programming
14 0.42894927 175 nips-2007-Semi-Supervised Multitask Learning
15 0.42882395 116 nips-2007-Learning the structure of manifolds using random projections
16 0.42794082 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
17 0.42757401 49 nips-2007-Colored Maximum Variance Unfolding
18 0.42710105 141 nips-2007-New Outer Bounds on the Marginal Polytope
19 0.42706579 10 nips-2007-A Randomized Algorithm for Large Scale Support Vector Learning
20 0.42696506 156 nips-2007-Predictive Matrix-Variate t Models