nips nips2013 nips2013-158 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Daniel Vainsencher, Shie Mannor, Huan Xu
Abstract: We consider the general problem of Multiple Model Learning (MML) from data, from the statistical and algorithmic perspectives; this problem includes clustering, multiple regression and subspace clustering as special cases. A common approach to solving new MML problems is to generalize Lloyd’s algorithm for clustering (or Expectation-Maximization for soft clustering). However this approach is unfortunately sensitive to outliers and large noise: a single exceptional point may take over one of the models. We propose a different general formulation that seeks for each model a distribution over data points; the weights are regularized to be sufficiently spread out. This enhances robustness by making assumptions on class balance. We further provide generalization bounds and explain how the new iterations may be computed efficiently. We demonstrate the robustness benefits of our approach with some experimental results and prove for the important case of clustering that our approach has a non-trivial breakdown point, i.e., is guaranteed to be robust to a fixed percentage of adversarial unbounded outliers. 1
Reference: text
sentIndex sentText sentNum sentScore
1 sg Abstract We consider the general problem of Multiple Model Learning (MML) from data, from the statistical and algorithmic perspectives; this problem includes clustering, multiple regression and subspace clustering as special cases. [sent-9, score-0.497]
2 A common approach to solving new MML problems is to generalize Lloyd’s algorithm for clustering (or Expectation-Maximization for soft clustering). [sent-10, score-0.307]
3 However this approach is unfortunately sensitive to outliers and large noise: a single exceptional point may take over one of the models. [sent-11, score-0.361]
4 We propose a different general formulation that seeks for each model a distribution over data points; the weights are regularized to be sufficiently spread out. [sent-12, score-0.376]
5 We demonstrate the robustness benefits of our approach with some experimental results and prove for the important case of clustering that our approach has a non-trivial breakdown point, i. [sent-15, score-0.691]
6 Yet, in practice, it is common to encounter data that were generated by a mixture of several models rather than a single one, and the goal is to learn a number of models such that any given data can be explained by at least one of the learned models. [sent-21, score-0.376]
7 It is also common for the data to contain outliers: data-points that are not well explained by any of the models to be learned, possibly inserted by external processes. [sent-22, score-0.207]
8 At its center is the problem of assigning data points to models, with the main consideration that every model be consistent with many of the data points. [sent-24, score-0.326]
9 Thus we seek for each model a distribution of weights over the data points, and encourage even weights by regularizing these distributions (hence our approach is called Regularized Weighting; abbreviated as RW). [sent-25, score-0.256]
10 A data point that is inconsistent with all available models will receive lower weight and even sometimes be ignored. [sent-26, score-0.366]
11 The value of ignoring difficult points is illustrated by contrast with the common approach, which we consider next. [sent-27, score-0.249]
12 This leaves the minimum loss approach vulnerable to outliers and corruptions: If one data point goes to infinity, so must at least one model. [sent-30, score-0.67]
13 Indeed, as we show later, the RW formulation is provably robust in the case of clustering, in the sense of having non-zero breakdown point [2]. [sent-32, score-0.561]
14 A new formulation of the sub-task of associating data points to models as a convex optimization problem for setting weights. [sent-35, score-0.422]
15 This problem favors broadly based models, and may ignore difficult data points entirely. [sent-36, score-0.209]
16 We show that the breakdown point of the proposed method is bounded away from zero for the clustering case. [sent-41, score-0.621]
17 The breakdown point is a concept from robust statistics: it is the fraction of adversarial outliers that an algorithm can sustain without having its output arbitrarily changed. [sent-42, score-0.821]
18 We show, empirically on a synthetic and real world datasets, that our formulation is more resistant to fat tailed additive noise. [sent-45, score-0.334]
19 As almost every method to tackle the multiple model learning problem, we use alternating optimization of the models and the association (weights), i. [sent-53, score-0.291]
20 Our formulation for optimizing the association requires solving a quadratic problem in kn variables, where k is the number of models and n is the number of points. [sent-56, score-0.295]
21 Indeed, special examples of multi-model learning have been studied, including k-means clustering [3, 4, 5] (and many other variants thereof), Gaussian mixture models (and extensions) [6, 7] and subspace segmentation problem [8, 9, 10]; see Section 2 for details. [sent-61, score-0.57]
22 Slightly closer to our approach is [12], whose formulation generalizes a common approach to different model types and permits for problem specific regularization, giving both generalization results and algorithmic iteration complexity results. [sent-64, score-0.247]
23 Algorithms for dealing with outliers and multiple models together have been proposed in the context of clustering [14]. [sent-66, score-0.687]
24 2 Formulation In this section we show how multi-model learning problems can be formed from simple estimation problem (where we seek to explain weighted data points by a single model), and imposing a par2 ticular joint loss. [sent-69, score-0.445]
25 We contrast the joint loss proposed here to a common one through the weights assigned by each and their effects on robustness. [sent-70, score-0.352]
26 n We refer throughout to n data points from X by (xi )i=1 = X ∈ X n , which we seek to explain k by k models from M denoted (mj )j=1 = M ∈ Mk . [sent-71, score-0.346]
27 A base weighted learning problem is a tuple (X , M, ℓ, A), where ℓ : X × M → R+ is a non-negative convex function, which we call a base loss function and A : △n × X n → M defines an efficient algorithm for choosing a model. [sent-74, score-0.438]
28 Given the weight w and data X, A obtains n low weighted empirical loss i=1 wi ℓ (xi , m) (the weighted empirical loss need not be minimal, allowing for regularization which we do not discuss further). [sent-75, score-0.781]
29 n We will often denote the losses of a model m over X as a vector l = (ℓ(xi , m))i=1 . [sent-76, score-0.212]
30 In the context of a set of models M , we similarly associate the loss vector lj and the weight vector wj with the ⊤ model mj ; this allows us to use the terse notation wj lj for the weighted loss of model j. [sent-77, score-1.433]
31 • In subspace clustering, also known as subspace segmentation, the objective is to group the training samples into subsets, such that each subset can be well approximated by a low-dimensional affine subspace. [sent-82, score-0.246]
32 • Regression clustering [16] extends the standard linear regression problem in that the training samples cannot be explained by one linear function. [sent-84, score-0.377]
33 The most common way to tackle the multiple model learning problem is the minimum loss approach, i. [sent-88, score-0.393]
34 e, to minimize the following joint loss L (X, M ) = 1 n min ℓ (x, m) . [sent-89, score-0.202]
35 1) In terms of weighted base learning problems, each model gives equal weight to all points for which 2 it is the best (lowest loss) model. [sent-91, score-0.56]
36 For example, when M = X = Rn with ℓ(x, m) = x − m 2 the squared Euclidean distance loss yields k means clustering. [sent-92, score-0.256]
37 In this context, alternating between choosing for each x its loss minimizing model, and adjusting each model to minimized the squared Euclidean loss, yields Lloyd’s algorithm (and its generalizations for other problems). [sent-93, score-0.327]
38 The minimum loss approach requires that every point is assigned to a model, this can potentially cause problems in the presence of outliers. [sent-94, score-0.309]
39 For example, consider the clustering case where the data contain a single outlier point xi . [sent-95, score-0.553]
40 Let xi tend to infinity; there will always be some mj that is closest to xi , and is therefore (at equilibrium) the average of xi and some other data points. [sent-96, score-0.502]
41 We call this phenomenon mode I of sensitivity to outliers; it is common also 3 to such simple estimators as the mean. [sent-98, score-0.233]
42 Mode II of sensitivity is more particular: as mj follows xi to infinity, it stops being the closest to any other points, until the model is associated only to the outlier and thus matches it perfectly. [sent-99, score-0.567]
43 Mode II of sensitivity is not clustering specific, and Fig. [sent-103, score-0.34]
44 Neither mode is avoided by spreading a point’s weight among models as in mixture models [6]. [sent-106, score-0.523]
45 A penalty term discourages the concentration of a model on few points and thus mode II sensitivity. [sent-108, score-0.324]
46 For clustering this robustness is formalized in Theorem 2. [sent-110, score-0.424]
47 1: Data is a mixture of two quadratics, with positive fat tailed noise. [sent-129, score-0.296]
48 Under a minimum loss approach an off-the-chart high-noise point suffices to prevent the top broken line from being close to many other data points. [sent-130, score-0.313]
49 Given k weight vectors, we denote their averk age v (W ) = k −1 j=1 wj , and just v when W is clear from context. [sent-135, score-0.288]
50 The Regularized Weighting k multiple model learning loss is a function Lα : X n × Mk × (△n ) → R defined as k Lα (X, M, W ) = α u − v (W ) 2 2 +k l⊤ wj j −1 (2. [sent-136, score-0.389]
51 3) As its name suggests, our formulation regularizes distributions of weight over data points; specifically, wj are controlled by forcing their average v to be close to the uniform distribution u. [sent-139, score-0.438]
52 We avoid this by penalizing squared Euclidean distance from uniformity, which emphasizes points receiving weight much higher than the natural n−1 , and essentially ignores small variations around n−1 . [sent-141, score-0.47]
53 In the following examples, we will consider a set of γnk −1 data points, recalling that nk −1 is the natural number of points per model. [sent-144, score-0.264]
54 To avoid letting a few high loss outliers skew our models (mode I of sensitivity), we prefer instead to give them zero weight. [sent-145, score-0.526]
55 Take γ ≪ k/2, then the cost of ignoring some γnk −1 points in all models is at most αn−1 · 2γk −1 ≪ αn−1 . [sent-146, score-0.263]
56 335 model 1 weighs 21 points model 2 weighs 39 points Weight assigned by model, scaled: wj,i · n/k 1. [sent-151, score-0.57]
57 Within each model, weights are affine in the loss (see Section 2. [sent-170, score-0.234]
58 The gap allowed between the maximal weights of different models allows a point from the right cluster to be adopted by the left model, lowering overall penalty at a cost to weighted losses. [sent-172, score-0.272]
59 If the jth model is fit to only γnk −1 points for γ ≪ 1, the penalty from those points will be at least (approximately) αn−1 · γ −1 k −1 . [sent-174, score-0.377]
60 We can make the first situation cheap and the second expensive (per model) in comparison to the empirical weighted loss term by choosing k −1 αn ≈k −1 ⊤ w j lj . [sent-175, score-0.34]
61 Consider the case where a model has low loss for fewer than n/(2k) points: spreading its weight only over them can incur very high costs due to the regularization term, which might be lowered by including some higher-loss points that are indeed better explained by another model (see Figure 2. [sent-178, score-0.768]
62 This challenge might be solved by explicitly and separately estimating the relative frequencies of the classes, and penalizing deviations from the estimates rather than from equal frequencies, as is done in mixture models [6]; this is left for future study. [sent-180, score-0.203]
63 1 Two properties of Regularized Weighting Two properties of our formulation result from an analysis (in Appendix A for lack of space) of a dual problem of the weight setting problem (2. [sent-182, score-0.277]
64 4): if outliers are present and αn−1 > 2B where B bounds losses on all points including outliers, weights will be almost uniform (enabling mode I of sensi5 tivity). [sent-189, score-0.858]
65 One observation from this lemma is that if a particular model j gives weight to some point i, then every point with lower loss ℓ (xi′ , mj ) under that model will receive at least that much weight. [sent-200, score-0.881]
66 This property plays a key role in the proof of robustness to outliers in clustering. [sent-201, score-0.431]
67 3) is convex when we fix the models, and an efficient procedure A is assumed for solving a weighted base learning problem for a model, supporting an alternating optimization approach, as in Algorithm 1; see Section 5 for further discussion. [sent-205, score-0.247]
68 1 on page 4 provides a positive example in regression clustering, and a more substantial empirical evaluation on subspace clustering is in Appendix B. [sent-208, score-0.476]
69 In the particular case of clustering with the squared Euclidean loss, robustness benefits can be proved. [sent-209, score-0.444]
70 We use “breakdown point” – the standard robustness measure in the literature of robust statistics [2], see also [17, 18] and many others – to quantify the robustness property of the proposed formulation. [sent-210, score-0.35]
71 The breakdown point of an estimator is the smallest fraction of bad observations that can cause the estimator to take arbitrarily aberrant values, i. [sent-211, score-0.388]
72 , the smallest fraction of outliers needed to completely break an estimator. [sent-213, score-0.304]
73 For the case of clustering with the squared Euclidean distance base loss, the min-loss approach corresponds to k-means clustering which is not robust in this sense; its breakdown point is 0. [sent-214, score-1.164]
74 The non robustness of k-means has led to the development of many formulations of robust clustering, see a review by [14]. [sent-215, score-0.223]
75 In contrast, we show that our joint loss yields an estimator that has a non-zero breakdown point, and is hence robust. [sent-216, score-0.502]
76 In general, a squared loss clustering formulation that assigns equal weight to different data points cannot be robust – as one data point tends to infinity so must at least one model. [sent-217, score-1.163]
77 On the other hand if α is too low, it becomes possible 6 for each model to assign all of its weight to a single point, which may well be an outlier tending to infinity. [sent-219, score-0.348]
78 Let X = M be a Euclidean space in which we perform clustering with the loss 2 ℓ (xi , mj ) = mj − xi and k centers. [sent-222, score-0.999]
79 Denote by R the radius of any ball containing the inliers, −2 and η < k /22 the proportion of outliers allowed to be outside the ball. [sent-223, score-0.347]
80 Denote also by r a radius such that there exists M ′ = {m′ , · · · , m′ } such that each inlier is within a distance r of some 1 k model m′ and each mj approximates (i. [sent-224, score-0.466]
81 mj − xi 2 ≤ 6R for every model mj and inlier xi . [sent-228, score-0.751]
82 Then we have Theorem 2 shows that when the number of outliers is not too high, then the learned model, regardless of the magnitude of the outliers, is close to the inliers and hence cannot be arbitrarily bad. [sent-229, score-0.487]
83 In particular, the theorem implies a non-zero breakdown point for any α > nr2 ; taking too high an α merely forces a larger but still finite R. [sent-230, score-0.357]
84 If the inliers are amenable to balanced clustering so that r ≪ R, the regime of non-zero breakdown is extended to smaller α. [sent-231, score-0.716]
85 First, due to the regularization term, for any model, the total weight on the few outliers is at most 1/3. [sent-233, score-0.503]
86 Second, an optimal model must thus be at least twice as close to the weighted average of its inlier as it is to the weighted average of its outliers. [sent-234, score-0.305]
87 This step depends critically on squared Euclidean loss being used. [sent-235, score-0.218]
88 Lastly, this gap in distances cannot be large in absolute terms, due to Lemma 2; an outlier that is much farther from the model than the inliers must receive weight zero. [sent-236, score-0.541]
89 The current formulation seems to be particularly vulnerable since it allows data to be ignored, in contrast to most generalization bounds that assume equal weight is given to all data. [sent-239, score-0.46]
90 Our loss Lα (X, M ) differs from common losses in allowing data points to be differently weighted. [sent-240, score-0.618]
91 We shall endow Mk with the metric d∞ (M, M ′ ) = max ℓ (·, mj ) − ℓ ·, m′ j j∈[k] ∞ and define its covering number Nε Mk as the minimal cardinality of a set Mk such that Mk ⊆ ε k B(M, ε). [sent-246, score-0.322]
92 M ∈M ε The bound depends on an upper bound on base losses denoted B; this should be viewed as fixing a scale for the losses and is standard where losses are not naturally bounded (e. [sent-247, score-0.599]
93 Let the base losses be bounded in the interval [0, B], let Mk have covering numdk bers Nε Mk ≤ (C/ε) and let γ = nB/ (2α). [sent-252, score-0.325]
94 Given data and models (X, M ) there exists an algorithm that finds a weight matrix W such that Lα (X, M, W ) − Lα (X, M ) ≤ ε using O and memory. [sent-261, score-0.268]
95 The first bound might suggest that typical settings of α ∝ n requires iterations to increase with the number of points n; the second bounds shows this is not always necessary. [sent-263, score-0.233]
96 6 Conclusion In this paper, we proposed and analyzed, from a general perspective, a new formulation for learning multiple models that explain well much of the data. [sent-266, score-0.274]
97 This is based on associating to each model a regularized weight distribution over the data it explains well. [sent-267, score-0.385]
98 We further provided generalization bounds and explained an optimization procedure to solve the formulation in scale. [sent-269, score-0.261]
99 Our main motivation comes from the fast growing attention to analyzing data using multiple models, under the names of k-means clustering, subspace segmentation, and Gaussian mixture models, to list a few. [sent-270, score-0.297]
100 We believe general methods with desirable properties such as generalization and robustness will supply ready tools for new applications using other model types. [sent-272, score-0.223]
wordName wordTfidf (topN-words)
[('outliers', 0.304), ('breakdown', 0.3), ('clustering', 0.264), ('mj', 0.258), ('weight', 0.169), ('losses', 0.169), ('points', 0.167), ('loss', 0.165), ('mk', 0.16), ('inliers', 0.152), ('outlier', 0.136), ('robustness', 0.127), ('subspace', 0.123), ('fat', 0.121), ('wj', 0.119), ('mode', 0.114), ('formulation', 0.108), ('tailed', 0.105), ('mml', 0.103), ('robust', 0.096), ('base', 0.092), ('weighted', 0.089), ('weighting', 0.087), ('lj', 0.086), ('rw', 0.084), ('inlier', 0.084), ('regularized', 0.083), ('nity', 0.079), ('sensitivity', 0.076), ('shie', 0.075), ('euclidean', 0.071), ('mixture', 0.07), ('weights', 0.069), ('alternating', 0.066), ('explained', 0.065), ('kn', 0.064), ('covering', 0.064), ('lloyd', 0.063), ('multiple', 0.062), ('singapore', 0.058), ('point', 0.057), ('models', 0.057), ('segmentation', 0.056), ('spreading', 0.056), ('weighs', 0.056), ('nk', 0.055), ('xi', 0.054), ('sons', 0.053), ('generalization', 0.053), ('squared', 0.053), ('fista', 0.053), ('vulnerable', 0.053), ('minimum', 0.049), ('lemma', 0.048), ('regression', 0.048), ('associating', 0.048), ('explain', 0.047), ('technion', 0.046), ('huan', 0.046), ('haifa', 0.044), ('af', 0.044), ('penalizing', 0.043), ('radius', 0.043), ('model', 0.043), ('common', 0.043), ('appendix', 0.042), ('data', 0.042), ('receive', 0.041), ('page', 0.041), ('partly', 0.041), ('tend', 0.04), ('ignoring', 0.039), ('mi', 0.039), ('assigned', 0.038), ('distance', 0.038), ('israel', 0.038), ('focs', 0.037), ('joint', 0.037), ('ieee', 0.036), ('bounds', 0.035), ('associate', 0.034), ('wiley', 0.034), ('solutions', 0.034), ('optimizing', 0.034), ('xu', 0.034), ('seek', 0.033), ('adversarial', 0.033), ('formalized', 0.033), ('ignored', 0.033), ('frequencies', 0.033), ('association', 0.032), ('consideration', 0.032), ('allowing', 0.032), ('iterations', 0.031), ('tackle', 0.031), ('arbitrarily', 0.031), ('seeks', 0.031), ('regularization', 0.03), ('lowered', 0.03), ('ticular', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000001 158 nips-2013-Learning Multiple Models via Regularized Weighting
Author: Daniel Vainsencher, Shie Mannor, Huan Xu
Abstract: We consider the general problem of Multiple Model Learning (MML) from data, from the statistical and algorithmic perspectives; this problem includes clustering, multiple regression and subspace clustering as special cases. A common approach to solving new MML problems is to generalize Lloyd’s algorithm for clustering (or Expectation-Maximization for soft clustering). However this approach is unfortunately sensitive to outliers and large noise: a single exceptional point may take over one of the models. We propose a different general formulation that seeks for each model a distribution over data points; the weights are regularized to be sufficiently spread out. This enhances robustness by making assumptions on class balance. We further provide generalization bounds and explain how the new iterations may be computed efficiently. We demonstrate the robustness benefits of our approach with some experimental results and prove for the important case of clustering that our approach has a non-trivial breakdown point, i.e., is guaranteed to be robust to a fixed percentage of adversarial unbounded outliers. 1
2 0.2184713 232 nips-2013-Online PCA for Contaminated Data
Author: Jiashi Feng, Huan Xu, Shie Mannor, Shuicheng Yan
Abstract: We consider the online Principal Component Analysis (PCA) where contaminated samples (containing outliers) are revealed sequentially to the Principal Components (PCs) estimator. Due to their sensitiveness to outliers, previous online PCA algorithms fail in this case and their results can be arbitrarily skewed by the outliers. Here we propose the online robust PCA algorithm, which is able to improve the PCs estimation upon an initial one steadily, even when faced with a constant fraction of outliers. We show that the final result of the proposed online RPCA has an acceptable degradation from the optimum. Actually, under mild conditions, online RPCA achieves the maximal robustness with a 50% breakdown point. Moreover, online RPCA is shown to be efficient for both storage and computation, since it need not re-explore the previous samples as in traditional robust PCA algorithms. This endows online RPCA with scalability for large scale data. 1
3 0.14473665 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling
Author: Mahito Sugiyama, Karsten Borgwardt
Abstract: Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search. 1
4 0.13346362 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
Author: Yu-Xiang Wang, Huan Xu, Chenlei Leng
Abstract: Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for subspace clustering. The two methods are fundamentally similar in that both are convex optimizations exploiting the intuition of “Self-Expressiveness”. The main difference is that SSC minimizes the vector 1 norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. The results reveal interesting insights into the strength and weakness of SSC and LRR and demonstrate how LRSSC can take the advantages of both methods in preserving the “Self-Expressiveness Property” and “Graph Connectivity” at the same time. 1
5 0.11778869 91 nips-2013-Dirty Statistical Models
Author: Eunho Yang, Pradeep Ravikumar
Abstract: We provide a unified framework for the high-dimensional analysis of “superposition-structured” or “dirty” statistical models: where the model parameters are a superposition of structurally constrained parameters. We allow for any number and types of structures, and any statistical model. We consider the general class of M -estimators that minimize the sum of any loss function, and an instance of what we call a “hybrid” regularization, that is the infimal convolution of weighted regularization functions, one for each structural component. We provide corollaries showcasing our unified framework for varied statistical models such as linear regression, multiple regression and principal component analysis, over varied superposition structures. 1
6 0.11276466 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
7 0.10859823 75 nips-2013-Convex Two-Layer Modeling
8 0.10698813 171 nips-2013-Learning with Noisy Labels
9 0.10617891 233 nips-2013-Online Robust PCA via Stochastic Optimization
10 0.095931336 63 nips-2013-Cluster Trees on Manifolds
11 0.093383551 230 nips-2013-Online Learning with Costly Features and Labels
12 0.093110286 242 nips-2013-PAC-Bayes-Empirical-Bernstein Inequality
13 0.091375522 224 nips-2013-On the Sample Complexity of Subspace Learning
14 0.089938872 180 nips-2013-Low-rank matrix reconstruction and clustering via approximate message passing
15 0.086326122 137 nips-2013-High-Dimensional Gaussian Process Bandits
16 0.085540891 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
17 0.084502347 355 nips-2013-Which Space Partitioning Tree to Use for Search?
18 0.080227479 188 nips-2013-Memory Limited, Streaming PCA
19 0.079891741 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
20 0.079429559 298 nips-2013-Small-Variance Asymptotics for Hidden Markov Models
topicId topicWeight
[(0, 0.248), (1, 0.064), (2, 0.077), (3, 0.008), (4, 0.035), (5, 0.057), (6, -0.048), (7, 0.05), (8, -0.096), (9, -0.008), (10, -0.044), (11, 0.125), (12, 0.111), (13, 0.098), (14, 0.096), (15, 0.01), (16, 0.048), (17, -0.028), (18, 0.1), (19, -0.043), (20, -0.035), (21, 0.078), (22, 0.018), (23, 0.017), (24, -0.026), (25, 0.022), (26, -0.081), (27, 0.003), (28, -0.028), (29, -0.018), (30, -0.098), (31, 0.067), (32, 0.145), (33, 0.008), (34, 0.03), (35, -0.011), (36, -0.108), (37, -0.068), (38, -0.018), (39, 0.033), (40, -0.058), (41, 0.038), (42, -0.076), (43, 0.054), (44, 0.066), (45, -0.146), (46, 0.079), (47, 0.019), (48, -0.015), (49, 0.007)]
simIndex simValue paperId paperTitle
same-paper 1 0.9484517 158 nips-2013-Learning Multiple Models via Regularized Weighting
Author: Daniel Vainsencher, Shie Mannor, Huan Xu
Abstract: We consider the general problem of Multiple Model Learning (MML) from data, from the statistical and algorithmic perspectives; this problem includes clustering, multiple regression and subspace clustering as special cases. A common approach to solving new MML problems is to generalize Lloyd’s algorithm for clustering (or Expectation-Maximization for soft clustering). However this approach is unfortunately sensitive to outliers and large noise: a single exceptional point may take over one of the models. We propose a different general formulation that seeks for each model a distribution over data points; the weights are regularized to be sufficiently spread out. This enhances robustness by making assumptions on class balance. We further provide generalization bounds and explain how the new iterations may be computed efficiently. We demonstrate the robustness benefits of our approach with some experimental results and prove for the important case of clustering that our approach has a non-trivial breakdown point, i.e., is guaranteed to be robust to a fixed percentage of adversarial unbounded outliers. 1
2 0.70300525 261 nips-2013-Rapid Distance-Based Outlier Detection via Sampling
Author: Mahito Sugiyama, Karsten Borgwardt
Abstract: Distance-based approaches to outlier detection are popular in data mining, as they do not require to model the underlying probability distribution, which is particularly challenging for high-dimensional data. We present an empirical comparison of various approaches to distance-based outlier detection across a large number of datasets. We report the surprising observation that a simple, sampling-based scheme outperforms state-of-the-art techniques in terms of both efficiency and effectiveness. To better understand this phenomenon, we provide a theoretical analysis why the sampling-based approach outperforms alternative methods based on k-nearest neighbor search. 1
3 0.65738779 259 nips-2013-Provable Subspace Clustering: When LRR meets SSC
Author: Yu-Xiang Wang, Huan Xu, Chenlei Leng
Abstract: Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR) are both considered as the state-of-the-art methods for subspace clustering. The two methods are fundamentally similar in that both are convex optimizations exploiting the intuition of “Self-Expressiveness”. The main difference is that SSC minimizes the vector 1 norm of the representation matrix to induce sparsity while LRR minimizes nuclear norm (aka trace norm) to promote a low-rank structure. Because the representation matrix is often simultaneously sparse and low-rank, we propose a new algorithm, termed Low-Rank Sparse Subspace Clustering (LRSSC), by combining SSC and LRR, and develops theoretical guarantees of when the algorithm succeeds. The results reveal interesting insights into the strength and weakness of SSC and LRR and demonstrate how LRSSC can take the advantages of both methods in preserving the “Self-Expressiveness Property” and “Graph Connectivity” at the same time. 1
4 0.64225018 232 nips-2013-Online PCA for Contaminated Data
Author: Jiashi Feng, Huan Xu, Shie Mannor, Shuicheng Yan
Abstract: We consider the online Principal Component Analysis (PCA) where contaminated samples (containing outliers) are revealed sequentially to the Principal Components (PCs) estimator. Due to their sensitiveness to outliers, previous online PCA algorithms fail in this case and their results can be arbitrarily skewed by the outliers. Here we propose the online robust PCA algorithm, which is able to improve the PCs estimation upon an initial one steadily, even when faced with a constant fraction of outliers. We show that the final result of the proposed online RPCA has an acceptable degradation from the optimum. Actually, under mild conditions, online RPCA achieves the maximal robustness with a 50% breakdown point. Moreover, online RPCA is shown to be efficient for both storage and computation, since it need not re-explore the previous samples as in traditional robust PCA algorithms. This endows online RPCA with scalability for large scale data. 1
5 0.63323736 80 nips-2013-Data-driven Distributionally Robust Polynomial Optimization
Author: Martin Mevissen, Emanuele Ragnoli, Jia Yuan Yu
Abstract: We consider robust optimization for polynomial optimization problems where the uncertainty set is a set of candidate probability density functions. This set is a ball around a density function estimated from data samples, i.e., it is data-driven and random. Polynomial optimization problems are inherently hard due to nonconvex objectives and constraints. However, we show that by employing polynomial and histogram density estimates, we can introduce robustness with respect to distributional uncertainty sets without making the problem harder. We show that the optimum to the distributionally robust problem is the limit of a sequence of tractable semidefinite programming relaxations. We also give finite-sample consistency guarantees for the data-driven uncertainty sets. Finally, we apply our model and solution method in a water network optimization problem. 1
6 0.62000525 94 nips-2013-Distributed $k$-means and $k$-median Clustering on General Topologies
7 0.61369205 233 nips-2013-Online Robust PCA via Stochastic Optimization
8 0.60718912 357 nips-2013-k-Prototype Learning for 3D Rigid Structures
9 0.60396284 63 nips-2013-Cluster Trees on Manifolds
10 0.59898919 238 nips-2013-Optimistic Concurrency Control for Distributed Unsupervised Learning
11 0.58481181 135 nips-2013-Heterogeneous-Neighborhood-based Multi-Task Local Learning Algorithms
12 0.57833183 358 nips-2013-q-OCSVM: A q-Quantile Estimator for High-Dimensional Distributions
13 0.57265902 197 nips-2013-Moment-based Uniform Deviation Bounds for $k$-means and Friends
14 0.56922984 31 nips-2013-Adaptivity to Local Smoothness and Dimension in Kernel Regression
15 0.56765187 244 nips-2013-Parametric Task Learning
16 0.56455064 202 nips-2013-Multiclass Total Variation Clustering
17 0.55100906 188 nips-2013-Memory Limited, Streaming PCA
18 0.54494494 224 nips-2013-On the Sample Complexity of Subspace Learning
19 0.54211903 294 nips-2013-Similarity Component Analysis
20 0.54141319 192 nips-2013-Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
topicId topicWeight
[(2, 0.022), (10, 0.193), (16, 0.029), (33, 0.156), (34, 0.137), (36, 0.01), (41, 0.038), (49, 0.036), (56, 0.122), (70, 0.054), (85, 0.03), (89, 0.048), (93, 0.051), (95, 0.017)]
simIndex simValue paperId paperTitle
1 0.86933953 342 nips-2013-Unsupervised Spectral Learning of Finite State Transducers
Author: Raphael Bailly, Xavier Carreras, Ariadna Quattoni
Abstract: Finite-State Transducers (FST) are a standard tool for modeling paired inputoutput sequences and are used in numerous applications, ranging from computational biology to natural language processing. Recently Balle et al. [4] presented a spectral algorithm for learning FST from samples of aligned input-output sequences. In this paper we address the more realistic, yet challenging setting where the alignments are unknown to the learning algorithm. We frame FST learning as finding a low rank Hankel matrix satisfying constraints derived from observable statistics. Under this formulation, we provide identifiability results for FST distributions. Then, following previous work on rank minimization, we propose a regularized convex relaxation of this objective which is based on minimizing a nuclear norm penalty subject to linear constraints and can be solved efficiently. 1
2 0.86702269 276 nips-2013-Reshaping Visual Datasets for Domain Adaptation
Author: Boqing Gong, Kristen Grauman, Fei Sha
Abstract: In visual recognition problems, the common data distribution mismatches between training and testing make domain adaptation essential. However, image data is difficult to manually divide into the discrete domains required by adaptation algorithms, and the standard practice of equating datasets with domains is a weak proxy for all the real conditions that alter the statistics in complex ways (lighting, pose, background, resolution, etc.) We propose an approach to automatically discover latent domains in image or video datasets. Our formulation imposes two key properties on domains: maximum distinctiveness and maximum learnability. By maximum distinctiveness, we require the underlying distributions of the identified domains to be different from each other to the maximum extent; by maximum learnability, we ensure that a strong discriminative model can be learned from the domain. We devise a nonparametric formulation and efficient optimization procedure that can successfully discover domains among both training and test data. We extensively evaluate our approach on object recognition and human activity recognition tasks. 1
same-paper 3 0.8603695 158 nips-2013-Learning Multiple Models via Regularized Weighting
Author: Daniel Vainsencher, Shie Mannor, Huan Xu
Abstract: We consider the general problem of Multiple Model Learning (MML) from data, from the statistical and algorithmic perspectives; this problem includes clustering, multiple regression and subspace clustering as special cases. A common approach to solving new MML problems is to generalize Lloyd’s algorithm for clustering (or Expectation-Maximization for soft clustering). However this approach is unfortunately sensitive to outliers and large noise: a single exceptional point may take over one of the models. We propose a different general formulation that seeks for each model a distribution over data points; the weights are regularized to be sufficiently spread out. This enhances robustness by making assumptions on class balance. We further provide generalization bounds and explain how the new iterations may be computed efficiently. We demonstrate the robustness benefits of our approach with some experimental results and prove for the important case of clustering that our approach has a non-trivial breakdown point, i.e., is guaranteed to be robust to a fixed percentage of adversarial unbounded outliers. 1
4 0.79750657 95 nips-2013-Distributed Exploration in Multi-Armed Bandits
Author: Eshcar Hillel, Zohar S. Karnin, Tomer Koren, Ronny Lempel, Oren Somekh
Abstract: We study exploration in Multi-Armed Bandits in a setting where k players collaborate in order to identify an ε-optimal arm. Our motivation comes from recent employment of bandit algorithms in computationally intensive, large-scale applications. Our results demonstrate a non-trivial tradeoff between the number of arm pulls required by each of the players, and the amount of communication between them. In particular, our main result shows that by allowing the k players to com√ municate only once, they are able to learn k times faster than a single player. √ That is, distributing learning to k players gives rise to a factor k parallel speedup. We complement this result with a lower bound showing this is in general the best possible. On the other extreme, we present an algorithm that achieves the ideal factor k speed-up in learning performance, with communication only logarithmic in 1/ε. 1
5 0.79009891 173 nips-2013-Least Informative Dimensions
Author: Fabian Sinz, Anna Stockl, January Grewe, January Benda
Abstract: We present a novel non-parametric method for finding a subspace of stimulus features that contains all information about the response of a system. Our method generalizes similar approaches to this problem such as spike triggered average, spike triggered covariance, or maximally informative dimensions. Instead of maximizing the mutual information between features and responses directly, we use integral probability metrics in kernel Hilbert spaces to minimize the information between uninformative features and the combination of informative features and responses. Since estimators of these metrics access the data via kernels, are easy to compute, and exhibit good theoretical convergence properties, our method can easily be generalized to populations of neurons or spike patterns. By using a particular expansion of the mutual information, we can show that the informative features must contain all information if we can make the uninformative features independent of the rest. 1
6 0.78769898 49 nips-2013-Bayesian Inference and Online Experimental Design for Mapping Neural Microcircuits
7 0.78670341 236 nips-2013-Optimal Neural Population Codes for High-dimensional Stimulus Variables
8 0.78500849 150 nips-2013-Learning Adaptive Value of Information for Structured Prediction
9 0.78495705 239 nips-2013-Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result
10 0.78463912 201 nips-2013-Multi-Task Bayesian Optimization
11 0.78459692 310 nips-2013-Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators.
12 0.78405887 5 nips-2013-A Deep Architecture for Matching Short Texts
13 0.78375161 22 nips-2013-Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization
14 0.78309697 114 nips-2013-Extracting regions of interest from biological images with convolutional sparse block coding
15 0.78269851 101 nips-2013-EDML for Learning Parameters in Directed and Undirected Graphical Models
16 0.78267932 116 nips-2013-Fantope Projection and Selection: A near-optimal convex relaxation of sparse PCA
17 0.78220981 318 nips-2013-Structured Learning via Logistic Regression
18 0.78140354 97 nips-2013-Distributed Submodular Maximization: Identifying Representative Elements in Massive Data
19 0.78135753 45 nips-2013-BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
20 0.78125829 286 nips-2013-Robust learning of low-dimensional dynamics from large neural ensembles