nips nips2012 nips2012-291 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniil Ryabko, Jeremie Mary
Abstract: We show how binary classification methods developed to work on i.i.d. data can be used for solving statistical problems that are seemingly unrelated to classification and concern highly-dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classification methods. Universal consistency of the proposed algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. [sent-8, score-0.332]
2 The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classification methods. [sent-9, score-0.151]
3 The series can exhibit arbitrary long-range dependence, and different timeseries samples may be interdependent as well. [sent-22, score-0.112]
4 Moreover, the learning problems that we consider — the three-sample problem, time-series clustering, and homogeneity testing — at first glance seem completely unrelated to classification. [sent-23, score-0.358]
5 The results include asymptotically consistent algorithms, as well as finite-sample analysis. [sent-25, score-0.154]
6 To establish the consistency of the suggested methods, for clustering and the three-sample problem the only assumption that we make on the data is that the distributions generating the samples are stationary ergodic; this is one of the weakest assumptions used in statistics. [sent-26, score-0.617]
7 For homogeneity testing we have to make some mixing assumptions in order to obtain consistency results (this is indeed unavoidable [22]). [sent-27, score-0.453]
8 The proposed approach is based on a new distance between time-series distributions (that is, between probability distributions on the space of infinite sequences), which we call telescope distance. [sent-29, score-0.62]
9 This distance can be evaluated using binary classification methods, and its finite-sample estimates are shown to be asymptotically consistent. [sent-30, score-0.263]
10 The first one is a distance on finite-dimensional marginal distributions. [sent-32, score-0.116]
11 The distance we use for this is the following: dH (P, Q) := suph∈H |EP h − EQ h| where P, Q are distributions and H is a set of functions. [sent-33, score-0.211]
12 This distance can be estimated using binary classification methods, and thus can be used to reduce various statistical problems to the classification problem. [sent-34, score-0.205]
13 This distance was previously applied to such statistical problems as homogeneity testing and change-point estimation [14]. [sent-35, score-0.448]
14 Thus, the second building block are the recent results of [1, 2], that show that empirical estimates of dH are consistent (under certain conditions on H) for arbitrary stationary ergodic distributions. [sent-40, score-0.601]
15 This, however, is not enough: evaluating dH for (stationary ergodic) time-series distributions means measuring the distance between their finitedimensional marginals, and not the distributions themselves. [sent-41, score-0.306]
16 Finally, the third step to construct the distance is what we call telescoping. [sent-42, score-0.142]
17 We show that the resulting distance (telescope distance) indeed can be consistently estimated based on sampling, for arbitrary stationary ergodic distributions. [sent-44, score-0.619]
18 Further, we show how this fact can be used to construct consistent algorithms for the considered problems on time series. [sent-45, score-0.122]
19 A related approach to address the problems considered here, as well some related problems about stationary ergodic time series, is based on (consistent) empirical estimates of the distributional distance, see [23, 21, 13] and [8] about the distributional distance. [sent-50, score-0.66]
20 The empirical distance is based on counting frequencies of bins of decreasing sizes and “telescoping. [sent-51, score-0.142]
21 In Section 3 we introduce and discuss the telescope distance. [sent-58, score-0.314]
22 Section 4 explains how this distance can be calculated using binary classification methods. [sent-59, score-0.164]
23 In Section 7, under some mixing conditions, we address the problems of homogeneity testing, clustering with unknown k, and finite-sample performance guarantees. [sent-61, score-0.531]
24 Time-series (or process) distributions are probability measures on the space (X N , FN ) of one-way infinite sequences (where FN is the Borel sigmaalgebra of X N ). [sent-65, score-0.147]
25 A stationary distribution is called (stationary) ergodic 1 if limn→∞ n i=1. [sent-78, score-0.476]
26 ) 3 A distance between time-series distributions We start with a distance between distributions on X , and then we will extend it to distributions on X ∞ . [sent-88, score-0.517]
27 For two probability distributions P and Q on (X , F) and a set H of measurable functions on X , one can define the distance dH (P, Q) := sup |EP h − EQ h|. [sent-89, score-0.256]
28 h∈H 2 Special cases of this distance are Kolmogorov-Smirnov [15], Kantorovich-Rubinstein [11] and Fortet-Mourier [7] metrics; the general case has been studied since at least [26]. [sent-90, score-0.116]
29 It is also easy to check that it holds if H is the set of halfspaces in the feature space of most commonly used kernels (provided the feature space is of the same or higher dimension than the input space), such as polynomial and Gaussian kernels. [sent-98, score-0.123]
30 Based on dH we can construct a distance between time-series probability distributions. [sent-99, score-0.142]
31 For two time-series distributions ρ1 , ρ2 we take the dH between k-dimensional marginal distributions of ρ1 and ρ2 for each k ∈ N, and sum them all up with decreasing weights. [sent-100, score-0.19]
32 For two time series distributions ρ1 and ρ2 on the space (X ∞ , F∞ ) and a sequence of sets of functions H = (H1 , H2 , . [sent-102, score-0.178]
33 ) define the telescope distance ∞ wk sup |Eρ1 h(X1 , . [sent-105, score-0.518]
34 The statement follows from the fact that two process distributions are the same if and only if all their finite-dimensional marginals coincide. [sent-117, score-0.182]
35 (2) i=1 All the methods presented in this work are based on the empirical telescope distance. [sent-132, score-0.34]
36 The key fact is that it is an asymptotically consistent estimate of the telescope distance, that is, the latter can be consistently estimated based on sampling. [sent-133, score-0.521]
37 ), Hk ⊂ X k , k ∈ N be a sequence of separable sets of indicator functions of finite VC dimension such that Hk generates FX k . [sent-138, score-0.179]
38 Then, for every stationary ergodic time series distributions ρX and ρY generating samples X1. [sent-139, score-0.71]
39 The condition that the sets Hk are sets of indicator function of finite VC dimension comes from [2], where it is shown that for any stationary ergodic distribution ρ, under these n−k+1 1 conditions, suph∈Hk n−k+1 i=1 h(Xi. [sent-153, score-0.624]
40 i+k−1 ) is an asymptotically consistent estimate of Eρ h(X1 , . [sent-155, score-0.154]
41 3 The definition (2) of DH involves calculating l summands (where l := min{n, m}), that is n−k+1 1 sup n−k+1 h∈Hk i=1 1 h(Xi. [sent-163, score-0.14]
42 Assuming that h ∈ Hk are indicator functions, calculating each of the summands amounts to solving the following k-dimensional binary classification problem. [sent-170, score-0.19]
43 Thus, as long as we have a way to find h ∈ Hk that minimizes empirical risk, we have a consistent estimate of DH (ρX , ρY ), under the mild conditions on H required by Theorem 1. [sent-186, score-0.125]
44 In ˆ contrast, all we need to achieve asymptotically 0 error in estimating D (and therefore, in the learning problems considered below) is that the sets Hk asymptotically generate FX k and have a finite VC dimension (for each k). [sent-190, score-0.337]
45 Finally, we remark that while in (2) the number of summands is l, it can be replaced with any γl such that γl → ∞, without affecting any asymptotic consistency results. [sent-196, score-0.135]
46 Both distributions are assumed to be stationary ergodic, but no further assumptions are made about them (no independence, mixing or memory assumptions). [sent-212, score-0.396]
47 The three sample-problem for dependent time series has been addressed in [9] for Markov processes and in [23] for stationary ergodic time series. [sent-213, score-0.532]
48 Indeed, to solve this problem it suffices to have consistent estimates of some distance between time series distributions. [sent-215, score-0.227]
49 , Zl ) be generated by stationary ergodic distributions ρX , ρY and ρZ , with ρX = ρY and either (i) ρZ = ρX or (ii) ρZ = ρY . [sent-228, score-0.598]
50 Assume that the sets Hk ⊂ X k , k ∈ N are separable sets of indicator functions of ˆ ˆ finite VC dimension such that Hk generates FX k . [sent-229, score-0.206]
51 It is straightforward to extend this theorem to more than two classes; in other words, instead of X and Y one can have an arbitrary number of samples from different stationary ergodic distributions. [sent-231, score-0.57]
52 4 6 Clustering time series N N 1 1 We are given N samples X 1 = (X1 , . [sent-232, score-0.112]
53 , XnN ) generated by k different stationary ergodic time-series distributions ρ1 , . [sent-241, score-0.598]
54 Thus, define target clustering as the partitioning in which those and only those samples that were generated by the same distribution are placed in the same cluster. [sent-250, score-0.283]
55 A clustering algorithm is called asymptotically consistent if with probability 1 there is an n such that the algorithm produces the target clustering whenever maxi=1. [sent-251, score-0.51]
56 Again, to solve this problem it is enough to have a metric between time-series distributions that can ˆ be consistently estimated. [sent-254, score-0.158]
57 Our approach here is based on the telescope distance, and thus we use D. [sent-255, score-0.314]
58 The clustering problem is relatively simple if the target clustering has what is called the strict separation property [4]: every two points in the same target cluster are closer to each other than to any point from a different target cluster. [sent-256, score-0.558]
59 Assume that the sets Hk ⊂ X k , k ∈ N are separable sets of indicator functions of finite VC dimension, such that Hk generates FX k . [sent-259, score-0.159]
60 , XnN ) are stationary ergodic, then with probability 1 from some n := maxi=1. [sent-272, score-0.183]
61 N ni on the target clustering has the strict separation property with respect ˆ to DH . [sent-274, score-0.337]
62 With the strict separation property at hand, it is easy to find asymptotically consistent algorithms. [sent-275, score-0.27]
63 We will give some simple examples, but the theorem below can be extended to many other distancebased clustering algorithms. [sent-276, score-0.194]
64 The distance between clusters is defined as the average distance between points in these clusters. [sent-278, score-0.291]
65 Under the conditions of Theorem 3, average linkage and farthest point clusterings are asymptotically consistent. [sent-293, score-0.249]
66 Note that we do not require the samples to be independent; the joint distributions of the samples may be completely arbitrary, as long as the marginal distribution of each sample is stationary ergodic. [sent-294, score-0.39]
67 7 Speed of convergence The results established so far are asymptotic out of necessity: they are established under the assumption that the distributions involved are stationary ergodic, which is too general to allow for any meaningful finite-time performance guarantees. [sent-296, score-0.3]
68 Moreover, some statistical problems, such as homogeneity testing or clustering when the number of clusters is unknown, are provably impossible to solve under this assumption [22]. [sent-297, score-0.506]
69 Moreover, since it is usually not known in advance whether the data at hand satisfies given assumptions or not, it appears important to have methods that have both asymptotic consistency in the general setting and finite-time performance guarantees under stronger assumptions. [sent-299, score-0.129]
70 A stationary distribution on the space of one-way infinite sequences (X N , FN ) can be uniquely extended to a stationary distribution on the space of two-way infinite sequences (X Z , FZ ) of the form . [sent-301, score-0.47]
71 For a process distribution ρ define the mixing coefficients β(ρ, k) := |ρ(A ∩ B) − ρ(A)ρ(B)| sup A∈σ(X−∞. [sent-309, score-0.129]
72 n is generated by a distribution ρ that is uniformly β-mixing with coefficients β(ρ, k) Assume further that Hk is a set of indicator functions with a finite VC dimension dk , for each k ∈ N. [sent-323, score-0.18]
73 m be generated by stationary distributions ρX and ρY whose β-mixing coefficients satisfy β(ρ. [sent-342, score-0.305]
74 Let Hk , k ∈ N be some sets of indicator functions on X k whose VC dimension dk is finite and non-decreasing with k. [sent-344, score-0.18]
75 m generated by distributions ρX and ρY respectively, the problem of homogeneity testing (or the two-sample problem) consists in deciding whether ρX = ρY . [sent-355, score-0.439]
76 In general, for stationary ergodic time series distributions, there is no asymptotically consistent test for homogeneity [22], so stronger assumptions are in order. [sent-357, score-0.975]
77 Our contribution to this line of research is to show that this problem can be reduced (via the telescope distance) to binary classification, in the case of strongly dependent processes satisfying some mixing conditions. [sent-364, score-0.446]
78 It is easy to see that under the mixing conditions of Lemma 1 a consistent test for homogeneity exists, and finite-sample performance guarantees can be obtained. [sent-365, score-0.436]
79 Under the conditions of Lemma 2 the probability of Type I error (the distributions are the same but the test says they are different) of the described test is upper-bounded by 4∆(ε/8, n ). [sent-378, score-0.163]
80 The probability of Type II error (the distributions are different but the test says they are the same) is upper-bounded by 4∆(δ − ε/8, n ) where δ := 1/2DH (ρX , ρY ). [sent-379, score-0.119]
81 The optimal choice of εn may depend on the speed at which dk (the VC dimension of Hk ) increases; however, for most natural cases (recall that Hk are also parameters of the algorithm) this growth is √ 2 polynomial so the main term to control is e− nε /8 . [sent-380, score-0.16]
82 For example, if Hk is the set of halfspaces in X k = Rk then dk = k + 1 and one can chose εn := n−1/8 . [sent-381, score-0.137]
83 3 Clustering with a known or unknown number of clusters If the distributions generating the samples satisfy certain mixing conditions, then we can augment Theorems 3 and 4 with finite-sample performance guarantees. [sent-384, score-0.345]
84 Then with probability at least 1 − N (N − 1)∆(δ/4, n)/2 the target clustering of the samples has the strict separation property. [sent-404, score-0.345]
85 In this case single linkage and farthest point algorithms output the target clustering. [sent-405, score-0.15]
86 Note that a sufficient condition for the strict separation property to hold is that for every one ˆ out of N (N − 1)/2 pairs of samples the estimate DH (X i , X j ) i, j = 1. [sent-407, score-0.145]
87 N is within δ/4 of the DH distance between the corresponding distributions. [sent-409, score-0.116]
88 As with homogeneity testing, while in the general case of stationary ergodic distributions it is impossible to have a consistent clustering algorithm when the number of clusters k is unknown, the situation changes if the distributions satisfy certain mixing conditions. [sent-412, score-1.246]
89 In this case a consistent clustering algorithm can be obtained as follows. [sent-413, score-0.211]
90 Assign to the same cluster all samples that are at most εn -far from each other, where the threshold εn is selected the same way as for homogeneity testing: εn → 0 and ∆(εn , n) → 0. [sent-414, score-0.307]
91 The optimal choice of this parameter depends on the choice of Hk through the speed of growth of the VC dimension dk of these sets. [sent-415, score-0.16]
92 Given N samples generated by k different stationary distributions ρi , i = 1. [sent-417, score-0.361]
93 Average-linkage clustering is used, with the telescope distance between samples calculated using an SVM, as described 7 in Section 4. [sent-427, score-0.642]
94 1 Synthetic data For the artificial setting we have chosen highly-dependent time series distributions which have the same single-dimensional marginals and which cannot be well approximated by finite- or countablestate models. [sent-430, score-0.185]
95 If α is irrational1 then the distribution ρ(α) is stationary ergodic, but does not belong to any simpler natural distribution family [25]. [sent-442, score-0.183]
96 One clustering experiment on sequences of length 1000 takes about 5 min. [sent-455, score-0.208]
97 Originally, each time series consisted of several consecutive sequences of different classes, and the problem was supervised: three time series for training and one for testing. [sent-460, score-0.164]
98 We split each of the original time series into classes, and then used our clustering algorithm in a completely unsupervised setting. [sent-461, score-0.234]
99 The latter is of particular interest since the classification method we used in the telescope distance is also SVM, but our setting is unsupervised (clustering). [sent-475, score-0.478]
100 3 subjects (columns), 4 methods Figure 1: Error of two-class clustering using (rows). [sent-477, score-0.156]
wordName wordTfidf (topN-words)
[('dh', 0.578), ('telescope', 0.314), ('hk', 0.311), ('ergodic', 0.293), ('homogeneity', 0.226), ('stationary', 0.183), ('clustering', 0.156), ('vc', 0.119), ('fx', 0.116), ('distance', 0.116), ('asymptotically', 0.099), ('distributions', 0.095), ('tssvm', 0.09), ('mixing', 0.084), ('classi', 0.07), ('summands', 0.069), ('suph', 0.067), ('xnn', 0.067), ('testing', 0.065), ('bci', 0.065), ('dk', 0.059), ('farthest', 0.059), ('clusters', 0.059), ('samples', 0.056), ('series', 0.056), ('consistent', 0.055), ('ryabko', 0.055), ('statement', 0.053), ('sequences', 0.052), ('halfspaces', 0.049), ('ni', 0.048), ('binary', 0.048), ('tn', 0.048), ('dimension', 0.047), ('indicator', 0.047), ('linkage', 0.047), ('sup', 0.045), ('strict', 0.045), ('kcpa', 0.045), ('consistency', 0.044), ('separation', 0.044), ('conditions', 0.044), ('target', 0.044), ('wk', 0.043), ('problems', 0.041), ('foot', 0.04), ('dtw', 0.04), ('svm', 0.038), ('distributional', 0.038), ('theorem', 0.038), ('cation', 0.037), ('daniil', 0.037), ('metric', 0.036), ('marginals', 0.034), ('assumptions', 0.034), ('lemma', 0.033), ('zl', 0.033), ('mary', 0.033), ('fn', 0.033), ('qn', 0.03), ('generates', 0.029), ('balcan', 0.029), ('lille', 0.029), ('separable', 0.029), ('stronger', 0.029), ('chose', 0.029), ('ri', 0.029), ('speed', 0.028), ('generated', 0.027), ('assign', 0.027), ('consistently', 0.027), ('sets', 0.027), ('easy', 0.027), ('deferred', 0.027), ('generating', 0.027), ('calculating', 0.026), ('construct', 0.026), ('unrelated', 0.026), ('deciding', 0.026), ('growth', 0.026), ('latter', 0.026), ('empirical', 0.026), ('nite', 0.025), ('cluster', 0.025), ('adams', 0.025), ('ym', 0.025), ('risk', 0.024), ('error', 0.024), ('unknown', 0.024), ('classes', 0.024), ('libsvm', 0.024), ('interface', 0.024), ('ln', 0.024), ('langford', 0.023), ('cients', 0.023), ('establish', 0.022), ('asymptotic', 0.022), ('unsupervised', 0.022), ('concern', 0.022), ('mini', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 291 nips-2012-Reducing statistical time-series problems to binary classification
Author: Daniil Ryabko, Jeremie Mary
Abstract: We show how binary classification methods developed to work on i.i.d. data can be used for solving statistical problems that are seemingly unrelated to classification and concern highly-dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classification methods. Universal consistency of the proposed algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. 1
2 0.23506445 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points
Author: Azadeh Khaleghi, Daniil Ryabko
Abstract: The problem of multiple change point estimation is considered for sequences with unknown number of change points. A consistency framework is suggested that is suitable for highly dependent time-series, and an asymptotically consistent algorithm is proposed. In order for the consistency to be established the only assumption required is that the data is generated by stationary ergodic time-series distributions. No modeling, independence or parametric assumptions are made; the data are allowed to be dependent and the dependence can be of arbitrary form. The theoretical results are complemented with experimental evaluations. 1
3 0.12874649 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
Author: Koby Crammer, Yishay Mansour
Abstract: In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of shared hypotheses. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach. 1
4 0.11296826 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set
Author: Nan Li, Longin J. Latecki
Abstract: We formulate clustering aggregation as a special instance of Maximum-Weight Independent Set (MWIS) problem. For a given dataset, an attributed graph is constructed from the union of the input clusterings generated by different underlying clustering algorithms with different parameters. The vertices, which represent the distinct clusters, are weighted by an internal index measuring both cohesion and separation. The edges connect the vertices whose corresponding clusters overlap. Intuitively, an optimal aggregated clustering can be obtained by selecting an optimal subset of non-overlapping clusters partitioning the dataset together. We formalize this intuition as the MWIS problem on the attributed graph, i.e., finding the heaviest subset of mutually non-adjacent vertices. This MWIS problem exhibits a special structure. Since the clusters of each input clustering form a partition of the dataset, the vertices corresponding to each clustering form a maximal independent set (MIS) in the attributed graph. We propose a variant of simulated annealing method that takes advantage of this special structure. Our algorithm starts from each MIS, which is close to a distinct local optimum of the MWIS problem, and utilizes a local search heuristic to explore its neighborhood in order to find the MWIS. Extensive experiments on many challenging datasets show that: 1. our approach to clustering aggregation automatically decides the optimal number of clusters; 2. it does not require any parameter tuning for the underlying clustering algorithms; 3. it can combine the advantages of different underlying clustering algorithms to achieve superior performance; 4. it is robust against moderate or even bad input clusterings. 1
5 0.10061247 264 nips-2012-Optimal kernel choice for large-scale two-sample tests
Author: Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, Bharath K. Sriperumbudur
Abstract: Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.
6 0.099658892 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation
7 0.094015136 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering
8 0.092780374 282 nips-2012-Proximal Newton-type methods for convex optimization
9 0.092341565 69 nips-2012-Clustering Sparse Graphs
10 0.091481917 307 nips-2012-Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning
11 0.076837748 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
12 0.075275347 218 nips-2012-Mixing Properties of Conditional Markov Chains with Unbounded Feature Functions
13 0.071380131 106 nips-2012-Dynamical And-Or Graph Learning for Object Shape Modeling and Detection
14 0.071009696 165 nips-2012-Iterative ranking from pair-wise comparisons
15 0.070783369 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
16 0.070110574 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
17 0.069098145 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
18 0.067543425 297 nips-2012-Robustness and risk-sensitivity in Markov decision processes
19 0.066915572 254 nips-2012-On the Sample Complexity of Robust PCA
20 0.066555358 126 nips-2012-FastEx: Hash Clustering with Exponential Families
topicId topicWeight
[(0, 0.185), (1, 0.033), (2, 0.048), (3, -0.053), (4, 0.041), (5, -0.031), (6, 0.002), (7, 0.07), (8, 0.041), (9, 0.029), (10, 0.044), (11, -0.092), (12, 0.003), (13, -0.073), (14, 0.025), (15, -0.144), (16, -0.083), (17, 0.088), (18, -0.032), (19, -0.004), (20, -0.136), (21, -0.09), (22, -0.04), (23, 0.068), (24, 0.037), (25, 0.009), (26, 0.072), (27, -0.024), (28, 0.001), (29, -0.066), (30, 0.129), (31, 0.072), (32, -0.049), (33, -0.105), (34, 0.025), (35, -0.018), (36, -0.145), (37, 0.092), (38, 0.013), (39, -0.002), (40, 0.025), (41, -0.019), (42, -0.012), (43, -0.136), (44, 0.001), (45, -0.032), (46, -0.103), (47, 0.003), (48, -0.191), (49, -0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.93193686 291 nips-2012-Reducing statistical time-series problems to binary classification
Author: Daniil Ryabko, Jeremie Mary
Abstract: We show how binary classification methods developed to work on i.i.d. data can be used for solving statistical problems that are seemingly unrelated to classification and concern highly-dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classification methods. Universal consistency of the proposed algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. 1
2 0.74474365 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points
Author: Azadeh Khaleghi, Daniil Ryabko
Abstract: The problem of multiple change point estimation is considered for sequences with unknown number of change points. A consistency framework is suggested that is suitable for highly dependent time-series, and an asymptotically consistent algorithm is proposed. In order for the consistency to be established the only assumption required is that the data is generated by stationary ergodic time-series distributions. No modeling, independence or parametric assumptions are made; the data are allowed to be dependent and the dependence can be of arbitrary form. The theoretical results are complemented with experimental evaluations. 1
3 0.63905489 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
Author: Koby Crammer, Yishay Mansour
Abstract: In this work we consider a setting where we have a very large number of related tasks with few examples from each individual task. Rather than either learning each task individually (and having a large generalization error) or learning all the tasks together using a single hypothesis (and suffering a potentially large inherent error), we consider learning a small pool of shared hypotheses. Each task is then mapped to a single hypothesis in the pool (hard association). We derive VC dimension generalization bounds for our model, based on the number of tasks, shared hypothesis and the VC dimension of the hypotheses class. We conducted experiments with both synthetic problems and sentiment of reviews, which strongly support our approach. 1
4 0.5833773 85 nips-2012-Convergence and Energy Landscape for Cheeger Cut Clustering
Author: Xavier Bresson, Thomas Laurent, David Uminsky, James V. Brecht
Abstract: This paper provides both theoretical and algorithmic results for the 1 -relaxation of the Cheeger cut problem. The 2 -relaxation, known as spectral clustering, only loosely relates to the Cheeger cut; however, it is convex and leads to a simple optimization problem. The 1 -relaxation, in contrast, is non-convex but is provably equivalent to the original problem. The 1 -relaxation therefore trades convexity for exactness, yielding improved clustering results at the cost of a more challenging optimization. The first challenge is understanding convergence of algorithms. This paper provides the first complete proof of convergence for algorithms that minimize the 1 -relaxation. The second challenge entails comprehending the 1 energy landscape, i.e. the set of possible points to which an algorithm might converge. We show that 1 -algorithms can get trapped in local minima that are not globally optimal and we provide a classification theorem to interpret these local minima. This classification gives meaning to these suboptimal solutions and helps to explain, in terms of graph structure, when the 1 -relaxation provides the solution of the original Cheeger cut problem. 1
5 0.55712241 175 nips-2012-Learning High-Density Regions for a Generalized Kolmogorov-Smirnov Test in High-Dimensional Data
Author: Assaf Glazer, Michael Lindenbaum, Shaul Markovitch
Abstract: We propose an efficient, generalized, nonparametric, statistical KolmogorovSmirnov test for detecting distributional change in high-dimensional data. To implement the test, we introduce a novel, hierarchical, minimum-volume sets estimator to represent the distributions to be tested. Our work is motivated by the need to detect changes in data streams, and the test is especially efficient in this context. We provide the theoretical foundations of our test and show its superiority over existing methods. 1
6 0.53870565 133 nips-2012-Finding Exemplars from Pairwise Dissimilarities via Simultaneous Sparse Recovery
7 0.5381422 226 nips-2012-Multiclass Learning Approaches: A Theoretical Comparison with Implications
8 0.5286988 68 nips-2012-Clustering Aggregation as Maximum-Weight Independent Set
9 0.49962217 7 nips-2012-A Divide-and-Conquer Method for Sparse Inverse Covariance Estimation
10 0.49465114 99 nips-2012-Dip-means: an incremental clustering method for estimating the number of clusters
11 0.49418104 196 nips-2012-Learning with Partially Absorbing Random Walks
12 0.47941083 271 nips-2012-Pointwise Tracking the Optimal Regression Function
13 0.47645399 299 nips-2012-Scalable imputation of genetic data with a discrete fragmentation-coagulation process
14 0.47490981 70 nips-2012-Clustering by Nonnegative Matrix Factorization Using Graph Random Walk
15 0.46077418 98 nips-2012-Dimensionality Dependent PAC-Bayes Margin Bound
16 0.45910195 155 nips-2012-Human memory search as a random walk in a semantic network
17 0.44318458 69 nips-2012-Clustering Sparse Graphs
18 0.43618888 338 nips-2012-The Perturbed Variation
19 0.42964965 361 nips-2012-Volume Regularization for Binary Classification
20 0.41614747 135 nips-2012-Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach
topicId topicWeight
[(0, 0.046), (21, 0.032), (34, 0.026), (35, 0.193), (36, 0.015), (38, 0.155), (39, 0.027), (42, 0.029), (54, 0.021), (55, 0.029), (74, 0.035), (76, 0.173), (80, 0.085), (92, 0.042)]
simIndex simValue paperId paperTitle
1 0.87461758 213 nips-2012-Minimization of Continuous Bethe Approximations: A Positive Variation
Author: Jason Pacheco, Erik B. Sudderth
Abstract: We develop convergent minimization algorithms for Bethe variational approximations which explicitly constrain marginal estimates to families of valid distributions. While existing message passing algorithms define fixed point iterations corresponding to stationary points of the Bethe free energy, their greedy dynamics do not distinguish between local minima and maxima, and can fail to converge. For continuous estimation problems, this instability is linked to the creation of invalid marginal estimates, such as Gaussians with negative variance. Conversely, our approach leverages multiplier methods with well-understood convergence properties, and uses bound projection methods to ensure that marginal approximations are valid at all iterations. We derive general algorithms for discrete and Gaussian pairwise Markov random fields, showing improvements over standard loopy belief propagation. We also apply our method to a hybrid model with both discrete and continuous variables, showing improvements over expectation propagation. 1
2 0.85717964 354 nips-2012-Truly Nonparametric Online Variational Inference for Hierarchical Dirichlet Processes
Author: Michael Bryant, Erik B. Sudderth
Abstract: Variational methods provide a computationally scalable alternative to Monte Carlo methods for large-scale, Bayesian nonparametric learning. In practice, however, conventional batch and online variational methods quickly become trapped in local optima. In this paper, we consider a nonparametric topic model based on the hierarchical Dirichlet process (HDP), and develop a novel online variational inference algorithm based on split-merge topic updates. We derive a simpler and faster variational approximation of the HDP, and show that by intelligently splitting and merging components of the variational posterior, we can achieve substantially better predictions of test data than conventional online and batch variational algorithms. For streaming analysis of large datasets where batch analysis is infeasible, we show that our split-merge updates better capture the nonparametric properties of the underlying model, allowing continual learning of new topics.
same-paper 3 0.85127658 291 nips-2012-Reducing statistical time-series problems to binary classification
Author: Daniil Ryabko, Jeremie Mary
Abstract: We show how binary classification methods developed to work on i.i.d. data can be used for solving statistical problems that are seemingly unrelated to classification and concern highly-dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. The algorithms that we construct for solving these problems are based on a new metric between time-series distributions, which can be evaluated using binary classification methods. Universal consistency of the proposed algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. 1
4 0.84484273 229 nips-2012-Multimodal Learning with Deep Boltzmann Machines
Author: Nitish Srivastava, Ruslan Salakhutdinov
Abstract: A Deep Boltzmann Machine is described for learning a generative model of data that consists of multiple and diverse input modalities. The model can be used to extract a unified representation that fuses modalities together. We find that this representation is useful for classification and information retrieval tasks. The model works by learning a probability density over the space of multimodal inputs. It uses states of latent variables as representations of the input. The model can extract this representation even when some modalities are absent by sampling from the conditional distribution over them and filling them in. Our experimental results on bi-modal data consisting of images and text show that the Multimodal DBM can learn a good generative model of the joint space of image and text inputs that is useful for information retrieval from both unimodal and multimodal queries. We further demonstrate that this model significantly outperforms SVMs and LDA on discriminative tasks. Finally, we compare our model to other deep learning methods, including autoencoders and deep belief networks, and show that it achieves noticeable gains. 1
5 0.82967502 162 nips-2012-Inverse Reinforcement Learning through Structured Classification
Author: Edouard Klein, Matthieu Geist, Bilal Piot, Olivier Pietquin
Abstract: This paper adresses the inverse reinforcement learning (IRL) problem, that is inferring a reward for which a demonstrated expert behavior is optimal. We introduce a new algorithm, SCIRL, whose principle is to use the so-called feature expectation of the expert as the parameterization of the score function of a multiclass classifier. This approach produces a reward function for which the expert policy is provably near-optimal. Contrary to most of existing IRL algorithms, SCIRL does not require solving the direct RL problem. Moreover, with an appropriate heuristic, it can succeed with only trajectories sampled according to the expert behavior. This is illustrated on a car driving simulator. 1
6 0.78952032 319 nips-2012-Sparse Prediction with the $k$-Support Norm
7 0.78908032 316 nips-2012-Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models
8 0.78852779 104 nips-2012-Dual-Space Analysis of the Sparse Linear Model
9 0.78826398 203 nips-2012-Locating Changes in Highly Dependent Data with Unknown Number of Change Points
10 0.78788465 69 nips-2012-Clustering Sparse Graphs
11 0.78764987 335 nips-2012-The Bethe Partition Function of Log-supermodular Graphical Models
12 0.78724784 139 nips-2012-Fused sparsity and robust estimation for linear models with unknown variance
13 0.78655034 312 nips-2012-Simultaneously Leveraging Output and Task Structures for Multiple-Output Regression
14 0.78640276 199 nips-2012-Link Prediction in Graphs with Autoregressive Features
15 0.78603357 277 nips-2012-Probabilistic Low-Rank Subspace Clustering
16 0.78564537 149 nips-2012-Hierarchical Optimistic Region Selection driven by Curiosity
17 0.78557092 355 nips-2012-Truncation-free Online Variational Inference for Bayesian Nonparametric Models
18 0.78552884 326 nips-2012-Structure estimation for discrete graphical models: Generalized covariance matrices and their inverses
19 0.78495228 179 nips-2012-Learning Manifolds with K-Means and K-Flats
20 0.78449672 117 nips-2012-Ensemble weighted kernel estimators for multivariate entropy estimation