nips nips2004 nips2004-158 knowledge-graph by maker-knowledge-mining

158 nips-2004-Sampling Methods for Unsupervised Learning

Source: pdf

Author: Rob Fergus, Andrew Zisserman, Pietro Perona

Abstract: We present an algorithm to overcome the local maxima problem in estimating the parameters of mixture models. It combines existing approaches from both EM and a robust ﬁtting algorithm, RANSAC, to give a data-driven stochastic learning scheme. Minimal subsets of data points, sufﬁcient to constrain the parameters of the model, are drawn from proposal densities to discover new regions of high likelihood. The proposal densities are learnt using EM and bias the sampling toward promising solutions. The algorithm is computationally efﬁcient, as well as effective at escaping from local maxima. We compare it with alternative methods, including EM and RANSAC, on both challenging synthetic data and the computer vision problem of alpha-matting. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We present an algorithm to overcome the local maxima problem in estimating the parameters of mixture models. [sent-10, score-0.206]

2 Minimal subsets of data points, sufﬁcient to constrain the parameters of the model, are drawn from proposal densities to discover new regions of high likelihood. [sent-12, score-0.77]

3 The proposal densities are learnt using EM and bias the sampling toward promising solutions. [sent-13, score-0.76]

4 The algorithm is computationally efﬁcient, as well as effective at escaping from local maxima. [sent-14, score-0.071]

5 1(a) we have some clumps of data that are embedded in noise. [sent-18, score-0.06]

6 Since our data has many components so must our model. [sent-20, score-0.086]

7 The challenge here is to ﬁnd these lines reliably despite them only constituting a small portion of the data. [sent-24, score-0.086]

8 Our motivating real-world problem is to learn a visual model from the set of images returned by Google’s image search on an object type (such as “camel”, “tiger” or “bottles”), like those shown. [sent-27, score-0.077]

9 Since text-based cues alone were used to compile the images, typically only 20%-50% images are visually consistent and the remainder may not even be images of the sought object type, resulting in a challenging learning problem. [sent-28, score-0.09]

10 The parameters of these may be estimated using algorithms based on EM [2] in a maximum likelihood framework. [sent-30, score-0.1]

11 While EM provides an efﬁcient estimation scheme, it has a serious problem in that for complex models, a local maxima of the likelihood function is often reached rather than the global maxima. [sent-31, score-0.226]

12 (b) Synthetic line data with few components but with a large portion of background noise. [sent-34, score-0.301]

13 yN } and a parametric mixture model with parameters θ, of the form: p(x|θ) = p(x, y|θ) = y p(x|y, θ) P (y|θ) (1) y We assume the number of mixture components is known and equal to C. [sent-46, score-0.198]

14 We also assume that the parametric form of the mixture components is given. [sent-47, score-0.126]

15 One of these components will model the background noise, while the remainder ﬁt the signal within the data. [sent-48, score-0.173]

16 This is not a straightforward as the dimensionality of θ is large and the likelihood function is highly non-linear. [sent-50, score-0.068]

17 Algorithms such as EM often get stuck in local maxima such as those illustrated in Fig. [sent-51, score-0.134]

18 Before describing our algorithm, we ﬁrst review the robust ﬁtting algorithm RANSAC, from which we borrow several key concepts to enable us to escape from local maxima. [sent-53, score-0.083]

19 1 RANSAC RANSAC (RANdom Sampling And Consensus) attempts to ﬁnd global maxima by drawing random subset of points, ﬁtting a model to them and then measuring their support from the data. [sent-55, score-0.161]

20 The basic idea is to draw at random and without replacement from x, a set of P samples for each of the C components in our model; P being the smallest number required to compute the parameters θc for each component. [sent-57, score-0.254]

21 Let draw i be represented by zi , a vector of length N containing exactly P ones, indicating the points selected with the rest being zeros. [sent-58, score-0.145]

22 Thus x(zi ) is the subset of points drawn from x. [sent-59, score-0.063]

23 Having done this for all components, we then estimate the component mixing portions, π using EM (keeping the other parameters ﬁxed), giving i i a set of parameters for draw i, θ i = {π, θ1 . [sent-61, score-0.246]

24 Using these parameters, we compute i the likelihood over all the data: p(x|θ ). [sent-65, score-0.068]

25 The entire process is repeated until either we exceed our maximum limit on the number of draws or we reach a pre-deﬁned performance level. [sent-66, score-0.102]

26 Since this process explores a ﬁnite set of points in the space of θ, it is unlikely that the globally optimal point, θ M L , will be found, but θ ∗ should be close so that running EM from it is guaranteed to ﬁnd the global optimum. [sent-68, score-0.063]

27 However, it is clear that the approach of sampling randomly, while guaranteed to eventually ﬁnd a point close to θ M L , is very inefﬁcient since the number of possible draws scales exponentially with both P and C. [sent-69, score-0.125]

28 [6] proposed drawing the samples from a non-uniform density, this approach involved incorporating auxiliary information about each sample point which may not be available for more general problems. [sent-73, score-0.064]

29 [1] propose general scheme to draw samples selectively from points tentatively classiﬁed as signal. [sent-76, score-0.151]

30 The problem with RANSAC is that points are drawn randomly. [sent-79, score-0.063]

31 Even after a large number of draws this random sampling continues, despite the fact that we may have already discovered a good, albeit local, maximum in our likelihood function. [sent-80, score-0.193]

32 The key idea in PROPOSAL is to draw samples from a proposal density. [sent-81, score-0.686]

33 Initially this density is uniform, as in RANSAC, but as regions of high likelihood are discovered, we update it so that it gives a strong bias toward producing good draws again, increasing the efﬁciency of the sampling process. [sent-82, score-0.266]

34 However, having found local maxima, we must still be able to escape and ﬁnd the global maxima. [sent-83, score-0.079]

35 Local maxima are characterized by too many components in one part of the space and too few in another. [sent-84, score-0.189]

36 In the ﬁrst, a component in an underpopulated region is split into two new ones, while in the second two components in an overpopulated area are merged. [sent-87, score-0.249]

37 These two moves are performed together to keep the number of components constant. [sent-88, score-0.086]

38 2(a), merging the green and blue components, while splitting the red component will yield a superior solution. [sent-90, score-0.298]

39 (a) (b) Figure 2: (a) Examples of different types of local maxima encountered. [sent-91, score-0.134]

40 The green and blue components on the left are overpopulating a small clump of data. [sent-92, score-0.273]

41 The magenta component in the center models noise, while missing a clump altogether. [sent-93, score-0.187]

42 The single red component on the right is inadequately modeling two clumps of data. [sent-94, score-0.182]

43 PROPOSAL acts in a similar manner, by ﬁrst ﬁnding components that are superﬂuous via two tests (described in section 3. [sent-96, score-0.11]

44 3): (i) the Evaporation test – which would ﬁnd the magenta component in Fig. [sent-97, score-0.147]

45 2(a) and (ii) the Overlap test – which would identify one of the green and blue components in Fig. [sent-98, score-0.233]

46 Then their proposal densities are adjusted so that they focus on data that is underpopulated by the model, thus subsequent samples are likely to discover a superior solution. [sent-100, score-0.833]

47 An overview of the algorithm is as follows: Algorithm 1 PROPOSAL Require: Data x; Parameters: C, πmin , for i = 1 to I Max do repeat i • For each component, c, compute parameters θc from P points drawn from the proposal density qc (x|θc ). [sent-101, score-0.759]

48 i i i i • Compute the likelihood L = n p(xn |π , θ1 . [sent-103, score-0.068]

49 until Li > LBest Rough • Reﬁne θ i using EM to give θ ∗ with likelihood L∗ . [sent-107, score-0.068]

50 if L∗ > LBest then • Update the proposal densities, q(x|θ), using θ ∗ . [sent-108, score-0.574]

51 • Apply the Evaporation and Overlap tests (using parameters πmin and ). [sent-109, score-0.056]

52 • Reassign the proposal densities of any components failing the above tests. [sent-110, score-0.775]

53 3(c), which shows the non-uniform proposal densities for each component on a simulated problem. [sent-117, score-0.767]

54 2 Computing model parameters i Each component c has a subset of points picked out by z from which its parameters θ c are estimated. [sent-121, score-0.211]

55 For the Gaussian example 1 Recall that z is a vector representing a draw of P points from q(x|θ). [sent-123, score-0.121]

56 3, we draw 3 points for each of the 4 Gaussian components, whose mean and covariance matrices are directly computed, using appropriate normalizations to give unbiased estimators of the population parameters. [sent-126, score-0.121]

57 This can done efﬁciently since the component parameters are ﬁxed, allowing the preN 1 computation of p(x|y, θ). [sent-130, score-0.11]

58 3 Updating proposal densities Having obtained a rough model for draw i with parameters θ i and likelihood Li , we ﬁrst see if its likelihood exceeds the likelihood of the previous best rough model, L Best . [sent-133, score-1.162]

59 If this Rough is the case we reﬁne the rough model to ensure that we are at an actual maximum since the sampling process limits us to a set of discrete points in θ-space, which are unlikely to be maxima themselves. [sent-134, score-0.256]

60 Running EM again, this time updating all parameters and using θ i as an initialization, the parameters converge to θ ∗ , having likelihood L∗ . [sent-135, score-0.153]

61 If L∗ exceeds a second threshold (the previous best reﬁned model’s likelihood) LBest , then we we recompute the proposal densities, as given in (2), using P (y|x, θ ∗ ). [sent-136, score-0.597]

62 In updating the proposal densities, two tests are applied to θ ∗ : 1. [sent-138, score-0.619]

63 Evaporation test: If πc < πmin , then the component is deemed to model noise, so is ﬂagged for resetting. [sent-139, score-0.078]

64 Overlap test2 : If for any two components, a and b, θa θi < 2 , then the two i a b components are judged to be ﬁtting the same data. [sent-143, score-0.086]

65 4 Resetting a proposal density If a component’s proposal density is to be reset, it is given a new density that maximizes C 1 the entropy of the mean proposal density qM (x|θ) = C c=1 qc (x|θ). [sent-146, score-1.962]

66 By maximizing the entropy of qM (x|θ), we are ensuring that the samples will subsequently be drawn as widely as possible, maximizing the chances of escaping from the local minima. [sent-147, score-0.125]

68 c=d n qd (xn |θ) Since a uniform distribution has the highest entropy, but qd (x|θ) cannot be negative, the optimal choice of qd (x|θ) will be zero everywhere, except for x corresponding to the smallest k values of qf (x|θ). [sent-150, score-0.808]

69 At these points qd (x|θ) must add to qf (x|θ) to give a constant qM (x|θ). [sent-151, score-0.347]

70 Thus qd (x|θ) be large where qf (x|θ) is small, giving the appealing result that the new component will draw preferentially from underpopulated portion of the data, as demonstrated in Fig. [sent-153, score-0.636]

71 (c) shows the corresponding proposal densities for each component (black is the background model). [sent-175, score-0.854]

72 Note how spiky the green density is, since it is only modeling a few data points. [sent-176, score-0.141]

73 Since πgreen < πmin , its proposal density is set to qd (x|θ), as shown in (d). [sent-177, score-0.863]

74 Note how qd (x|θ) is higher in the areas occupied by the red component which is a poor ﬁt for two clumps of data. [sent-178, score-0.421]

75 (b) The global maxima along with its proposal density (e). [sent-179, score-0.751]

76 1 Experiments Synthetic experiments We tested PROPOSAL on two types of synthetic data – mixtures of 2-D lines and Gaussians with uniform background noise. [sent-182, score-0.16]

77 The ﬁrst pair of experiments examined how many components the different algorithms could handle reliably. [sent-185, score-0.086]

78 The second pair tested the robustness to background noise. [sent-186, score-0.125]

79 In the Gaussian experiments, the model consisted of a mixture of 2-D Gaussian densities and a uniform background component. [sent-187, score-0.264]

80 In the line experiments, the model consisted of a mixture of densities modeling the residual to the line with a Gaussian noise model, having a variance σ that was also learnt. [sent-188, score-0.274]

81 Each line component has therefore three parameters – its gradient; y-intercept and variance. [sent-189, score-0.152]

82 In each experiment, the same time was allocated for each algorithm, so for example, EM which ran quickly was repeated until it had spent the same amount of time as the slowest (usually PROPOSAL or SMEM), and the best result from the repeated runs taken. [sent-192, score-0.095]

83 In the ﬁrst pair of experiments, the number of components was varied from 2 upto 10 for lines and 20 for Gaussians. [sent-197, score-0.086]

84 In the second pair of experiments, C = 3 components were used, with the background noise varying from 1% up to 99% . [sent-202, score-0.208]

85 The x-axis is the number of components ranging from 2 upwards. [sent-229, score-0.086]

86 The y-axis is portion of correct solutions found from 250 runs, each having with a different randomly generated dataset. [sent-230, score-0.086]

87 PROPOSAL is still achieving 75% correct with 10 components - twice the performance of the next best algorithm (SMEM). [sent-235, score-0.109]

88 8 Noise portion 1 −5 −5 −4 0 0 −4 −3 −2 −1 0 1 2 3 4 5 0. [sent-259, score-0.086]

89 8 1 −5 −5 (a) (b) (c) (d) Figure 5: Experiments showing the robustness to background noise. [sent-263, score-0.125]

90 The x-axis is the portion of noise, varying between 1% and 99%. [sent-264, score-0.086]

91 2 Real data experiments We test PROPOSAL against other clustering methods on the computer vision problem of alpha-matting (the extraction of a foreground element from a background image by estimating the opacity for each pixel of the foreground element, see Figure 6 for examples). [sent-273, score-0.379]

92 The simple approach we adopt is to ﬁrst form a tri-mask (the composite image is divided into 3 regions: pixels that are deﬁnitely foreground; pixels that are deﬁnitely background and uncertain pixels). [sent-274, score-0.26]

93 Two color models are constructed by clustering with a mixture of Gaussians the foreground and background pixels respectively. [sent-275, score-0.342]

94 The opacity (alpha values) of the uncertain pixels are then determined by using comparing the color of the pixel under the foreground and background color models. [sent-276, score-0.377]

95 Figure 7 compares the likelihood of the foreground and background color models clustered using EM, SMEM and PROPOSAL on two sets of images (11 face images and 5 dog images, examples of which are shown in Fig. [sent-277, score-0.418]

96 Each model is clustering ∼ 2×104 pixels in a 4-D space (R,G,B and edge strength) with a 10 component model. [sent-279, score-0.155]

97 It prevalently examines the small portion of θ-space which has support from the data. [sent-282, score-0.086]

98 y-axis is log-likelihood of foreground color model on foreground pixels plus log-likelihood of background color model on background pixels. [sent-310, score-0.495]

99 with PROPOSAL is that P scales with the square of the dimension of the data (due to the number of terms in the covariance matrix) meaning for high dimensions, a very large number of draws would be needed to ﬁnd new portions of data. [sent-313, score-0.113]

100 Maximum likelihood from incomplete data via the em algorithm. [sent-326, score-0.271]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('proposal', 0.574), ('smem', 0.439), ('qd', 0.239), ('em', 0.203), ('ransac', 0.199), ('mlesac', 0.179), ('daem', 0.14), ('sem', 0.14), ('lbest', 0.12), ('densities', 0.115), ('maxima', 0.103), ('foreground', 0.094), ('green', 0.091), ('background', 0.087), ('components', 0.086), ('portion', 0.086), ('draw', 0.082), ('evaporation', 0.08), ('qm', 0.08), ('component', 0.078), ('draws', 0.077), ('magenta', 0.069), ('qf', 0.069), ('likelihood', 0.068), ('rough', 0.066), ('clumps', 0.06), ('underpopulated', 0.06), ('blue', 0.056), ('synthetic', 0.051), ('density', 0.05), ('sampling', 0.048), ('images', 0.045), ('pixels', 0.045), ('merge', 0.044), ('red', 0.044), ('color', 0.044), ('xn', 0.042), ('line', 0.042), ('google', 0.042), ('consensus', 0.042), ('bottles', 0.04), ('clump', 0.04), ('dagm', 0.04), ('escaping', 0.04), ('opacity', 0.04), ('tordoff', 0.04), ('qc', 0.04), ('mcmc', 0.04), ('mixture', 0.04), ('points', 0.039), ('robustness', 0.038), ('solid', 0.038), ('portions', 0.036), ('overlap', 0.036), ('agged', 0.035), ('matas', 0.035), ('dog', 0.035), ('noise', 0.035), ('drawing', 0.034), ('image', 0.032), ('parameters', 0.032), ('clustering', 0.032), ('cyan', 0.032), ('local', 0.031), ('picked', 0.03), ('samples', 0.03), ('tting', 0.03), ('ueda', 0.03), ('superior', 0.029), ('success', 0.029), ('gaussian', 0.028), ('borrow', 0.028), ('composite', 0.028), ('torr', 0.026), ('ec', 0.026), ('perona', 0.026), ('annealing', 0.026), ('repeated', 0.025), ('discover', 0.025), ('split', 0.025), ('typical', 0.024), ('replacement', 0.024), ('escape', 0.024), ('global', 0.024), ('drawn', 0.024), ('tests', 0.024), ('zi', 0.024), ('dataset', 0.023), ('plain', 0.023), ('toward', 0.023), ('best', 0.023), ('uncertain', 0.023), ('giving', 0.022), ('runs', 0.022), ('uniform', 0.022), ('initially', 0.022), ('world', 0.022), ('li', 0.021), ('reset', 0.021), ('updating', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 158 nips-2004-Sampling Methods for Unsupervised Learning

Author: Rob Fergus, Andrew Zisserman, Pietro Perona

2 0.074834786 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

Abstract: We present a semi-parametric latent variable model based technique for density modelling, dimensionality reduction and visualization. Unlike previous methods, we estimate the latent distribution non-parametrically which enables us to model data generated by an underlying low dimensional, multimodal distribution. In addition, we allow the components of latent variable models to be drawn from the exponential family which makes the method suitable for special data types, for example binary or count data. Simulations on real valued, binary and count data show favorable comparison to other related schemes both in terms of separating different populations and generalization to unseen samples. 1

3 0.071680903 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

Author: Anton Schwaighofer, Volker Tresp, Kai Yu

Abstract: We present a novel method for learning with Gaussian process regression in a hierarchical Bayesian framework. In a ﬁrst step, kernel matrices on a ﬁxed set of input points are learned from data using a simple and efﬁcient EM algorithm. This step is nonparametric, in that it does not require a parametric form of covariance function. In a second step, kernel functions are ﬁtted to approximate the learned covariance matrix using a generalized Nystr¨ m method, which results in a complex, data o driven kernel. We evaluate our approach as a recommendation engine for art images, where the proposed hierarchical Bayesian method leads to excellent prediction performance. 1

4 0.070356436 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

Author: Aharon Bar-hillel, Adam Spiro, Eran Stark

Abstract: Spike sorting involves clustering spike trains recorded by a microelectrode according to the source neuron. It is a complicated problem, which requires a lot of human labor, partly due to the non-stationary nature of the data. We propose an automated technique for the clustering of non-stationary Gaussian sources in a Bayesian framework. At a ﬁrst search stage, data is divided into short time frames and candidate descriptions of the data as a mixture of Gaussians are computed for each frame. At a second stage transition probabilities between candidate mixtures are computed, and a globally optimal clustering is found as the MAP solution of the resulting probabilistic model. Transition probabilities are computed using local stationarity assumptions and are based on a Gaussian version of the Jensen-Shannon divergence. The method was applied to several recordings. The performance appeared almost indistinguishable from humans in a wide range of scenarios, including movement, merges, and splits of clusters. 1

5 0.069595642 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution beneﬁts from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are deﬁnitely in favor of minimum entropy regularization when generative models are misspeciﬁed, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

6 0.069580123 77 nips-2004-Hierarchical Clustering of a Mixture Model

7 0.058658045 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

8 0.058120549 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

9 0.05684799 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

10 0.055325408 130 nips-2004-Newscast EM

11 0.053514544 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

12 0.053402212 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

13 0.053126231 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

14 0.05176609 90 nips-2004-Joint Probabilistic Curve Clustering and Alignment

15 0.051351089 168 nips-2004-Semigroup Kernels on Finite Sets

16 0.049182639 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

17 0.048416376 73 nips-2004-Generative Affine Localisation and Tracking

18 0.047146149 161 nips-2004-Self-Tuning Spectral Clustering

19 0.046130419 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

20 0.041574463 114 nips-2004-Maximum Likelihood Estimation of Intrinsic Dimension

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.141), (1, 0.024), (2, -0.048), (3, -0.091), (4, 0.0), (5, -0.0), (6, -0.056), (7, 0.051), (8, -0.007), (9, 0.024), (10, 0.061), (11, 0.072), (12, 0.058), (13, -0.086), (14, 0.019), (15, -0.031), (16, 0.018), (17, 0.016), (18, 0.038), (19, -0.116), (20, -0.032), (21, 0.025), (22, 0.033), (23, 0.051), (24, 0.005), (25, -0.03), (26, 0.027), (27, 0.036), (28, 0.015), (29, -0.003), (30, 0.058), (31, 0.008), (32, -0.029), (33, 0.025), (34, -0.018), (35, 0.074), (36, -0.02), (37, -0.041), (38, 0.051), (39, 0.036), (40, -0.062), (41, -0.047), (42, -0.155), (43, -0.028), (44, 0.013), (45, 0.096), (46, 0.096), (47, 0.062), (48, -0.071), (49, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94249237 158 nips-2004-Sampling Methods for Unsupervised Learning

Author: Rob Fergus, Andrew Zisserman, Pietro Perona

2 0.64534897 163 nips-2004-Semi-parametric Exponential Family PCA

Author: Sajama Sajama, Alon Orlitsky

3 0.5947035 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

Author: Aharon Bar-hillel, Adam Spiro, Eran Stark

4 0.54567868 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

Author: Zhengdong Lu, Todd K. Leen

Abstract: While clustering is usually an unsupervised operation, there are circumstances in which we believe (with varying degrees of certainty) that items A and B should be assigned to the same cluster, while items A and C should not. We would like such pairwise relations to inﬂuence cluster assignments of out-of-sample data in a manner consistent with the prior knowledge expressed in the training set. Our starting point is probabilistic clustering based on Gaussian mixture models (GMM) of the data distribution. We express clustering preferences in the prior distribution over assignments of data points to clusters. This prior penalizes cluster assignments according to the degree with which they violate the preferences. We ﬁt the model parameters with EM. Experiments on a variety of data sets show that PPC can consistently improve clustering results.

5 0.5314585 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

Author: Dan Pelleg, Andrew W. Moore

Abstract: We introduce a novel active-learning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. These are distinguished from the traditional statistical deﬁnition of anomalies as outliers or merely ill-modeled points. Our distinction is that the usefulness of anomalies is categorized subjectively by the user. We make two additional assumptions. First, there exist extremely few useful anomalies to be hunted down within a massive dataset. Second, both useful and useless anomalies may sometimes exist within tiny classes of similar anomalies. The challenge is thus to identify “rare category” records in an unlabeled noisy set with help (in the form of class labels) from a human expert who has a small budget of datapoints that they are prepared to categorize. We propose a technique to meet this challenge, which assumes a mixture model ﬁt to the data, but otherwise makes no assumptions on the particular form of the mixture components. This property promises wide applicability in real-life scenarios and for various statistical models. We give an overview of several alternative methods, highlighting their strengths and weaknesses, and conclude with a detailed empirical analysis. We show that our method can quickly zoom in on an anomaly set containing a few tens of points in a dataset of hundreds of thousands. 1

6 0.5214445 77 nips-2004-Hierarchical Clustering of a Mixture Model

7 0.50926888 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

8 0.49849099 169 nips-2004-Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes

9 0.49835649 130 nips-2004-Newscast EM

10 0.46201459 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

11 0.45333493 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

12 0.44747028 2 nips-2004-A Direct Formulation for Sparse PCA Using Semidefinite Programming

13 0.44266266 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

14 0.44033647 164 nips-2004-Semi-supervised Learning by Entropy Minimization

15 0.41855249 172 nips-2004-Sparse Coding of Natural Images Using an Overcomplete Set of Limited Capacity Units

16 0.41546094 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

17 0.41523117 46 nips-2004-Constraining a Bayesian Model of Human Visual Speed Perception

18 0.41305104 50 nips-2004-Dependent Gaussian Processes

19 0.41296551 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis

20 0.40362093 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.085), (15, 0.118), (26, 0.063), (31, 0.053), (33, 0.145), (35, 0.016), (38, 0.307), (39, 0.021), (50, 0.076)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.77228993 158 nips-2004-Sampling Methods for Unsupervised Learning

Author: Rob Fergus, Andrew Zisserman, Pietro Perona

2 0.74629045 175 nips-2004-Stable adaptive control with online learning

Author: H. J. Kim, Andrew Y. Ng

Abstract: Learning algorithms have enjoyed numerous successes in robotic control tasks. In problems with time-varying dynamics, online learning methods have also proved to be a powerful tool for automatically tracking and/or adapting to the changing circumstances. However, for safety-critical applications such as airplane ﬂight, the adoption of these algorithms has been signiﬁcantly hampered by their lack of safety, such as “stability,” guarantees. Rather than trying to show difﬁcult, a priori, stability guarantees for speciﬁc learning methods, in this paper we propose a method for “monitoring” the controllers suggested by the learning algorithm online, and rejecting controllers leading to instability. We prove that even if an arbitrary online learning method is used with our algorithm to control a linear dynamical system, the resulting system is stable. 1

3 0.59342611 69 nips-2004-Fast Rates to Bayes for Kernel Machines

Author: Ingo Steinwart, Clint Scovel

Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) with hinge loss. In particular, for SVMs with Gaussian RBF kernels we propose a geometric condition for distributions which can be used to determine approximation properties of these kernels. Finally, we compare our methods with a recent paper of G. Blanchard et al.. 1

4 0.59132552 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

5 0.57938349 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

6 0.57844669 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

7 0.57824993 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

8 0.57617015 136 nips-2004-On Semi-Supervised Classification

9 0.57615858 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

10 0.57315284 103 nips-2004-Limits of Spectral Clustering

11 0.57185853 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

12 0.57162976 29 nips-2004-Beat Tracking the Graphical Model Way

13 0.57057679 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

14 0.57054919 178 nips-2004-Support Vector Classification with Input Data Uncertainty

15 0.5704596 28 nips-2004-Bayesian inference in spiking neurons

16 0.56955743 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

17 0.56838655 49 nips-2004-Density Level Detection is Classification

18 0.5683859 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

19 0.56834161 70 nips-2004-Following Curved Regularized Optimization Solution Paths

20 0.5674243 163 nips-2004-Semi-parametric Exponential Family PCA