nips nips2004 nips2004-136 knowledge-graph by maker-knowledge-mining

136 nips-2004-On Semi-Supervised Classification


Source: pdf

Author: Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin, Mário Figueiredo, Alexander J. Hartemink

Abstract: A graph-based prior is proposed for parametric semi-supervised classification. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classifier automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Figueiredo a Instituto de Telecomunicacoes, Instituto Superior T´ cnico, Portugal ¸˜ e Abstract A graph-based prior is proposed for parametric semi-supervised classification. [sent-3, score-0.085]

2 The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e. [sent-4, score-1.125]

3 An EM algorithm for training the classifier automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. [sent-7, score-1.013]

4 Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. [sent-8, score-0.757]

5 1 Introduction In many pattern classification problems, the acquisition of labelled training data is costly and/or time consuming, whereas unlabelled samples can be obtained easily. [sent-10, score-1.138]

6 Semisupervised algorithms that learn from both labelled and unlabelled samples have been the focus of much research in the last few years; a comprehensive review up to 2001 can be found in [13], while more recent references include [1, 2, 6, 7, 16–18]. [sent-11, score-1.028]

7 This regularization is often implemented by associating the vertices of a graph to all the (labelled and unlabelled) samples, and then formulating the problem on the vertices of the graph [6, 16–18]. [sent-16, score-0.37]

8 While current graph-based algorithms are inherently transductive — i. [sent-17, score-0.123]

9 , they cannot be used directly to classify samples not present when training — our classifier is parametric and the learned classifier can be used directly on new samples. [sent-19, score-0.225]

10 Unlike existing methods, our algorithm automatically learns the relative importance of the labelled and unlabelled data. [sent-21, score-0.896]

11 When multiple views of the same sample are provided (e. [sent-22, score-0.106]

12 In addition, we also show how to exploit the unlabelled data and the redundant views of the sample (from co-training) in order to improve active label query selection [15]. [sent-25, score-0.995]

13 3 describes the priors for semi-supervised learning and co-training. [sent-30, score-0.081]

14 2 Multinomial Logistic Regression In an m-class supervised learning problem, one is given a labelled training set DL = {(x1 , y 1 ), . [sent-38, score-0.455]

15 , (xL , y L )}, where xi ∈ Rd is a feature vector and y i the corresponding (1) (m) class label. [sent-41, score-0.078]

16 , yi ] is a binary vector, such that (c) (j) yi = 1 and yi = 0, for j = c, indicates that sample i belongs to class c. [sent-45, score-0.224]

17 In multinomial logistic regression [5], the posterior class probabilities are modelled as log P (y (c) = 1|x) = xT w(c) − log (c) m k=1 exp(xT w(k) ), for c = 1, . [sent-46, score-0.445]

18 , y L }) [5] L i=1 (w) ≡ log P (Y|w) = m c=1 (c) yi xT w(c) − log i m j=1 exp(xT w(j) ) . [sent-57, score-0.139]

19 i (2) In the presence of a prior p(w), we seek a maximum a posteriori (MAP) estimate, w = arg maxw { (w) + log p(w)}. [sent-58, score-0.137]

20 Actually, if the training data is separable, (w) is unbounded, and a prior is crucial. [sent-59, score-0.115]

21 , |V |} of vertices of an undirected graph (V, E). [sent-73, score-0.141]

22 Each edge of the graph, joining vertices i and j, is given a weight kij = kji ≥ 0, and we collect all the weights in a |V | × |V | matrix K. [sent-74, score-0.169]

23 A natural way to measure how much f varies across the graph is by the quantity kij (fi − fj )2 = 2 f T ∆ f , i (3) j where ∆ = diag{ j k1j , . [sent-75, score-0.157]

24 , j k|V |j }−K is the so-called graph Laplacian [2]. [sent-78, score-0.079]

25 Notice that kij ≥ 0 (for all i, j) guarantees that ∆ is positive semi-definite and also that ∆ has (at least) one null eigenvalue (1T∆1 = 0, where 1 has all elements equal to one). [sent-79, score-0.078]

26 In semi-supervised learning, in addition to DL , we are given U unlabelled samples DU = {xL+1 , . [sent-80, score-0.658]

27 To use (3) for semi-supervised learning, the usual choice is to assign one vertex of the graph to each sample in X = [x1 , . [sent-84, score-0.132]

28 , xL+U ]T (thus |V | = L + U ), and to let kij represent some (non-negative) measure of “similarity” between xi and xj . [sent-87, score-0.12]

29 In contrast to previous uses of graph-based priors, we define f as the real function f (defined over the entire observation space) evaluated at the graph nodes. [sent-93, score-0.079]

30 Specifically, f is defined as a linear function of x, and at the graph node i, fi ≡ f (xi ) = wT xi . [sent-94, score-0.121]

31 , f|V | ]T = Xw, and p(f ) induces a Gaussian prior on w, with precision matrix A = XT ∆X, p(w) ∝ exp{−(λ/2) wT XT∆ Xw} = exp{−(λ/2) wT Aw}. [sent-98, score-0.173]

32 (4) Notice that since ∆ is singular, A may also be singular, and the corresponding prior may therefore be improper. [sent-99, score-0.085]

33 This is no problem for MAP estimation of w because (as is well known) the normalization factor of the prior plays no role in this estimate. [sent-100, score-0.085]

34 If we include extra regularization, by adding a non-negative diagonal matrix to A, the prior becomes p(w) ∝ exp −(1/2) wT (λ0 A + Λ) w , (5) where we may choose Λ = diag{λ1 , . [sent-101, score-0.169]

35 (7) Finally, since all the λ’s are inverses of variances, the conjugate priors are Gamma [3]: (c) (c) (c) (c) p(λ0 |α0 , β0 ) = Ga(λ0 |α0 , β0 ), and p(λi |α1 , β1 ) = Ga(λi |α1 , β1 ), for c = 1, . [sent-122, score-0.111]

36 Summarizing, our model for semi-supervised learning includes the log-likelihood (2), a prior (6), and Gamma hyper-priors. [sent-131, score-0.085]

37 3 Exploiting Features from Multiple Sensors: The Co-Training Prior In some applications several sensors are available, each providing a different set of features. [sent-134, score-0.202]

38 For simplicity, we assume two sensors s ∈ {1, 2}, but everything discussed here is easily (s) extended to any number of sensors. [sent-135, score-0.202]

39 Denote the features from sensor s, for sample i, as xi , and Ss as the set of sample indices for which we have features from sensor s (S1 ∪S2 = {1, . [sent-136, score-0.49]

40 Let O = S1 ∩ S2 be the indices for which both sensors are available, and OU = O ∩ {L + 1, . [sent-140, score-0.202]

41 By using the samples in S1 and S2 as two independent training sets, we may obtain two separate classifiers (denoted w1 and w2 ). [sent-144, score-0.162]

42 However, we can coordinate the information from both sensors by using an idea known as co-training [4]: on the OU samples, classifiers w1 and w2 should agree as much as possible. [sent-145, score-0.202]

43 Notice that, in a logistic regression framework, the disagreement between the two classifiers on the OU samples can be measured by i∈OU [(w 1 T (1) ) xi (2) − (w2 )T xi ]2 = ω T C ω, (8) where ω = [(w1 )T (w2 )T ]T and C = i∈OU [(x1 )T (−x2 )T ]T [(x1 )T (−x2 )T ]. [sent-146, score-0.403]

44 (9) This Gaussian prior can be combined with two smoothness Gaussian priors on w1 and w2 (obtained as described in Section 3. [sent-148, score-0.166]

45 2); this leads to a prior which is still Gaussian, p(w1 , w2 ) = p(ω) ∝ exp −(1/2) ω T λco C + block-diag{Γ1 , Γ2 } ω , (10) where Γ1 and Γ2 are the two graph-based precision matrices (see (7)) for w1 and w2 . [sent-149, score-0.199]

46 Under this prior, and with a logistic regression likelihood as above, estimates of w1 and w2 can easily be found using minor modifications to the EM algorithm described in Section 4. [sent-151, score-0.187]

47 4 Learning Via EM To find the MAP estimate w, we use the EM algorithm, with λ as missing data, which is equivalent to integrating out λ from the full posterior before maximization [8]. [sent-153, score-0.075]

48 For simplicity, we will only describe the single sensor case (no co-training). [sent-154, score-0.133]

49 Since log p(w, λ|Y) = log p(Y|w) − (1/2)wT Γ(λ)w + K, (11) (where K collects all terms independent of w) is linear w. [sent-156, score-0.104]

50 all the λ parameters (see (6) and (7)), we just have to plug their conditional expectations into (11): Q(w|w) = log p(Y|w) − (1/2)wT E[Γ(λ)|w] w = (w) − (1/2)wT Υ(w) w. [sent-159, score-0.082]

51 The necessary expectations have well-known closed forms, due to the use of conjugate Gamma hyper(c) priors [3]. [sent-161, score-0.141]

52 M-step: Given matrix Υ(w), the M-step reduces to a logistic regression problem with a quadratic regularizer, i. [sent-169, score-0.216]

53 The sequential algorithm visits one particular element of w, say wu , and updates its estimate by maximizing the bound derived above, while keeping all other variables fixed at their previous values. [sent-180, score-0.124]

54 This leads to −1 new wu = wu + [gu (w) − (Υ(w)w)u ] [(B + Υ(w))uu ] , (13) and = wv , for v = u. [sent-181, score-0.15]

55 new wv 5 Active Label Selection If we are allowed to obtain the label for one of the unlabelled samples, the following question arises: which sample, if labelled, would provide the most information? [sent-186, score-0.664]

56 Our approach uses a Laplace approximation of the posterior p(w|Y) N (w|w, H−1 ), where H is the posterior precision matrix, i. [sent-188, score-0.151]

57 This approximation is known to be accurate for logistic regression under a Gaussian prior [14]. [sent-191, score-0.272]

58 Now let x∗ ∈ DU be an unlabelled sample and y ∗ its label. [sent-193, score-0.579]

59 Accepting it implies that after labeling x∗ , and regardless of y ∗ , the posterior precision changes to (14) H = H + (diag{p∗ } − p∗ pT ) ⊗ x∗ xT . [sent-197, score-0.105]

60 ∗ ∗ Since the entropy of a Gaussian with precision H is (−1/2) log |H| (up to an additive constant), the mutual information (MI) between y ∗ and w (i. [sent-198, score-0.148]

61 Our criterion is then: the best sample to label is the one that maximizes I(w; y ∗ ). [sent-201, score-0.137]

62 , it is large for samples with ∗ high variance of the corresponding class probability estimate. [sent-209, score-0.168]

63 Summarizing, (15) favors samples with uncertain class labels and high uncertainty in the class probability estimate. [sent-210, score-0.239]

64 6 Experimental Results We begin by presenting two-dimensional synthetic examples to visually illustrate our semisupervised classifier. [sent-211, score-0.086]

65 1 shows the utility of using unlabelled data to improve the deci- Figure 1: Synthetic two-dimensional examples. [sent-213, score-0.526]

66 (a) Comparison of the supervised logistic linear classifier (boundary shown as dashed line) learned only from the labelled data (shown in color) with the proposed semi-supervised classifier (boundary shown as solid line) which also uses the unlabelled samples (shown as dots). [sent-214, score-1.269]

67 (b) A RBF kernel classifier obtained by our algorithm, using two labelled samples (shaded circles) and many unlabelled samples. [sent-215, score-1.028]

68 Figure 2: (a)-(c) Accuracy (on UCI datasets) of the proposed method, the supervised SVM, and the other semi-supervised classifiers mentioned in the text; a subset of samples is labelled and the others are treated as unlabelled samples. [sent-216, score-1.083]

69 In (d), a separate holdout set is used to evaluate the accuracy of our method versus the amount of labelled and unlabelled data. [sent-217, score-0.896]

70 sion boundary in linear and non-linear (kernel) classifiers (see figure caption for details). [sent-218, score-0.094]

71 We compare our method against state-of-the-art semi-supervised classifiers: the GRF method of [18], the SGT method of [10], and the transductive SVM (TSVM) of [9]. [sent-221, score-0.123]

72 2(a)-(c) is an average of 20 trials: we randomly select 20 labelled sets which are used by every method. [sent-227, score-0.37]

73 All remaining samples are used as unlabelled by the semi-supervised algorithms. [sent-228, score-0.688]

74 2(a)-(c) are transductive, in the sense that the unlabelled and test data are the same. [sent-230, score-0.526]

75 Our logistic GRF is non-transductive: after being trained, it may be applied to classify new data without re-training. [sent-231, score-0.16]

76 Training took place using labelled and unlabelled data, and testing was performed on 200 new unseen samples. [sent-234, score-0.896]

77 The results suggest that semi-supervised classifiers are most relevant when the labelled set is small relative to the unlabelled set (as is often the case). [sent-235, score-0.896]

78 Two sensors were used: (1) a 70-band hyper-spectral electro-optic (EOIR) sensor; (2) an X-band synthetic aperture radar (SAR). [sent-240, score-0.242]

79 123 samples have features from the EOIR sensor alone, 398 from the Figure 3: (a) Land mine detection ROC curves of classifiers designed using only hyperspectral (EOIR) features, only SAR features, and both. [sent-242, score-0.336]

80 (b) Number of landmines detected during the active querying process (dotted lines), for active training and random selection (for the latter the bars reflect one standard deviation about the mean). [sent-243, score-0.479]

81 ROC curves (solid) are for the learned classifier as applied to the remaining samples. [sent-244, score-0.093]

82 For the purely supervised case, a sparseness prior is used (as in [14]). [sent-248, score-0.176]

83 For the data for which only one sensor is available, 20% of it is labelled (selected randomly). [sent-250, score-0.503]

84 For the data for which both sensors are available, 80% is labelled (again selected randomly). [sent-251, score-0.572]

85 3(a) show that, in general, the semi-supervised classifiers outperform the corresponding supervised ones, and the classifier learned from both sensors is markedly superior to classifiers learned from either sensor alone. [sent-253, score-0.45]

86 For comparison, we also show average results over 100 independent realizations for random label query selection (error bars indicate one standard deviation). [sent-256, score-0.225]

87 3(b) are plotted in two stages: first, mines and clutter are selected during the labeling process (dashed curves); then, the 100 labelled examples are used to build the final semi-supervised classifier, for which the ROC curve is obtained using the remaining unlabelled data (solid curves). [sent-258, score-1.04]

88 Interestingly, the active-learning algorithm finds almost half of the mines while querying for labels. [sent-259, score-0.153]

89 Due to physical limitations of the sensors, the rate at which mines are detected drops precipitously after approximately 90 mines are detected — i. [sent-260, score-0.338]

90 , the remaining mines are poorly matched to the sensor physics. [sent-262, score-0.277]

91 Transductive: Unlike most earlier methods, after the training stage our algorithm can directly classify new samples without computationally expensive re-training. [sent-265, score-0.195]

92 Tradeoff between labelled and unlabelled data: Automatically addressing the inherent tradeoff between their relative contributions, we have ensured that even a small amount of labelled data does not get overlooked because of an abundance of unlabelled samples. [sent-266, score-1.837]

93 Bayesian co-training: Using the proposed prior, classifiers for all sensors are improved using: (a) the label information provided on the other types of data, and (b) samples drawn from the joint distribution of features from multiple sensors. [sent-267, score-0.456]

94 Active label acquisition: We explicitly account for the knowledge of the unlabelled data and the co-training information while computing the well known mutual information criterion. [sent-268, score-0.647]

95 2 Quality of Assumptions and Empirically Observed Shortcomings The assumption that the mode of the posterior distribution of the classifier remains unchanged after seeing an additional label is clearly not true at the beginning of the active learning procedure. [sent-270, score-0.308]

96 However, we have empirically found it a very good approximation after the active learning procedure has yielded as few as 15 labels. [sent-271, score-0.14]

97 This assumption allows a tremendous saving in the computational cost, since it helps us avoid repeated re-training of classifiers in the active label acquisition process while evaluating candidate queries. [sent-272, score-0.272]

98 , in [12]) and that we have confirmed (in unreported experiments) is that the error rate of the active query selection increases slightly when the number of labelled samples grows beyond an optimal number. [sent-275, score-0.752]

99 Support vector machine active learning with applications to text classification. [sent-372, score-0.14]

100 Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions. [sent-394, score-0.14]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('unlabelled', 0.526), ('labelled', 0.37), ('sensors', 0.202), ('active', 0.14), ('ou', 0.134), ('classi', 0.133), ('sensor', 0.133), ('samples', 0.132), ('wt', 0.13), ('logistic', 0.127), ('eoir', 0.123), ('sar', 0.123), ('xt', 0.123), ('transductive', 0.123), ('mines', 0.114), ('ers', 0.112), ('grf', 0.092), ('prior', 0.085), ('label', 0.084), ('xl', 0.082), ('diag', 0.081), ('priors', 0.081), ('er', 0.079), ('graph', 0.079), ('kij', 0.078), ('multinomial', 0.072), ('gamma', 0.072), ('co', 0.07), ('query', 0.066), ('em', 0.065), ('vertices', 0.062), ('instituto', 0.062), ('sion', 0.062), ('wnew', 0.062), ('xw', 0.062), ('regression', 0.06), ('precision', 0.059), ('supervised', 0.055), ('detected', 0.055), ('exp', 0.055), ('shortcomings', 0.054), ('wv', 0.054), ('jeffreys', 0.054), ('sample', 0.053), ('views', 0.053), ('log', 0.052), ('maybe', 0.049), ('summarizing', 0.049), ('icml', 0.049), ('wu', 0.048), ('acquisition', 0.048), ('maximizing', 0.047), ('regularization', 0.047), ('posterior', 0.046), ('semisupervised', 0.046), ('tradeoff', 0.045), ('selection', 0.044), ('roc', 0.044), ('aw', 0.043), ('land', 0.043), ('contributions', 0.042), ('xi', 0.042), ('formulating', 0.041), ('dl', 0.041), ('du', 0.041), ('map', 0.041), ('gaussian', 0.04), ('synthetic', 0.04), ('notice', 0.039), ('querying', 0.039), ('features', 0.038), ('pi', 0.038), ('ga', 0.038), ('unchanged', 0.038), ('mutual', 0.037), ('sparseness', 0.036), ('class', 0.036), ('bayesian', 0.035), ('uncertain', 0.035), ('hessian', 0.035), ('belkin', 0.035), ('yi', 0.035), ('classify', 0.033), ('lafferty', 0.033), ('curves', 0.033), ('boundary', 0.032), ('costly', 0.032), ('bars', 0.031), ('colt', 0.031), ('expectations', 0.03), ('conjugate', 0.03), ('training', 0.03), ('remaining', 0.03), ('learned', 0.03), ('belongs', 0.03), ('solid', 0.029), ('pt', 0.029), ('redundant', 0.029), ('estimate', 0.029), ('matrix', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000002 136 nips-2004-On Semi-Supervised Classification

Author: Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin, Mário Figueiredo, Alexander J. Hartemink

Abstract: A graph-based prior is proposed for parametric semi-supervised classification. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classifier automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1

2 0.12814201 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

3 0.11980105 110 nips-2004-Matrix Exponential Gradient Updates for On-line Learning and Bregman Projection

Author: Koji Tsuda, Gunnar Rätsch, Manfred K. Warmuth

Abstract: We address the problem of learning a symmetric positive definite matrix. The central issue is to design parameter updates that preserve positive definiteness. Our updates are motivated with the von Neumann divergence. Rather than treating the most general case, we focus on two key applications that exemplify our methods: On-line learning with a simple square loss and finding a symmetric positive definite matrix subject to symmetric linear constraints. The updates generalize the Exponentiated Gradient (EG) update and AdaBoost, respectively: the parameter is now a symmetric positive definite matrix of trace one instead of a probability vector (which in this context is a diagonal positive definite matrix with trace one). The generalized updates use matrix logarithms and exponentials to preserve positive definiteness. Most importantly, we show how the analysis of each algorithm generalizes to the non-diagonal case. We apply both new algorithms, called the Matrix Exponentiated Gradient (MEG) update and DefiniteBoost, to learn a kernel matrix from distance measurements.

4 0.11844846 23 nips-2004-Analysis of a greedy active learning strategy

Author: Sanjoy Dasgupta

Abstract: We abstract out the core search problem of active learning schemes, to better understand the extent to which adaptive labeling can improve sample complexity. We give various upper and lower bounds on the number of labels which need to be queried, and we prove that a popular greedy active learning rule is approximately as good as any other strategy for minimizing this number of labels. 1

5 0.10569587 178 nips-2004-Support Vector Classification with Input Data Uncertainty

Author: Jinbo Bi, Tong Zhang

Abstract: This paper investigates a new learning model in which the input data is corrupted with noise. We present a general statistical framework to tackle this problem. Based on the statistical reasoning, we propose a novel formulation of support vector classification, which allows uncertainty in input data. We derive an intuitive geometric interpretation of the proposed formulation, and develop algorithms to efficiently solve it. Empirical results are included to show that the newly formed method is superior to the standard SVM for problems with noisy input. 1

6 0.10390649 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

7 0.1001756 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

8 0.094226688 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

9 0.092136726 98 nips-2004-Learning Gaussian Process Kernels via Hierarchical Bayes

10 0.08981435 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

11 0.088935815 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

12 0.085108057 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

13 0.084505543 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

14 0.082262099 166 nips-2004-Semi-supervised Learning via Gaussian Processes

15 0.081682891 42 nips-2004-Computing regularization paths for learning multiple kernels

16 0.08070831 82 nips-2004-Incremental Algorithms for Hierarchical Classification

17 0.07880532 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

18 0.078581505 175 nips-2004-Stable adaptive control with online learning

19 0.078069083 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

20 0.077424467 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.27), (1, 0.074), (2, 0.031), (3, 0.095), (4, 0.095), (5, -0.032), (6, 0.02), (7, 0.074), (8, 0.037), (9, 0.01), (10, 0.021), (11, 0.098), (12, -0.028), (13, -0.083), (14, -0.025), (15, 0.009), (16, -0.001), (17, -0.093), (18, 0.115), (19, -0.057), (20, 0.037), (21, 0.117), (22, 0.006), (23, 0.062), (24, -0.008), (25, 0.055), (26, 0.042), (27, -0.033), (28, 0.009), (29, 0.006), (30, 0.015), (31, -0.018), (32, 0.022), (33, 0.063), (34, -0.024), (35, -0.035), (36, -0.067), (37, -0.079), (38, 0.09), (39, 0.126), (40, -0.022), (41, -0.069), (42, 0.191), (43, 0.092), (44, -0.076), (45, 0.054), (46, 0.005), (47, -0.059), (48, 0.026), (49, 0.05)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94165838 136 nips-2004-On Semi-Supervised Classification

Author: Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin, Mário Figueiredo, Alexander J. Hartemink

Abstract: A graph-based prior is proposed for parametric semi-supervised classification. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classifier automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1

2 0.62830722 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

Author: Omid Madani, David M. Pennock, Gary W. Flake

Abstract: In the context of binary classification, we define disagreement as a measure of how often two independently-trained models differ in their classification of unlabeled data. We explore the use of disagreement for error estimation and model selection. We call the procedure co-validation, since the two models effectively (in)validate one another by comparing results on unlabeled data, which we assume is relatively cheap and plentiful compared to labeled data. We show that per-instance disagreement is an unbiased estimate of the variance of error for that instance. We also show that disagreement provides a lower bound on the prediction (generalization) error, and a tight upper bound on the “variance of prediction error”, or the variance of the average error across instances, where variance is measured across training sets. We present experimental results on several data sets exploring co-validation for error estimation and model selection. The procedure is especially effective in active learning settings, where training sets are not drawn at random and cross validation overestimates error. 1

3 0.60824865 164 nips-2004-Semi-supervised Learning by Entropy Minimization

Author: Yves Grandvalet, Yoshua Bengio

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the “cluster assumption”. Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces. 1

4 0.59052205 23 nips-2004-Analysis of a greedy active learning strategy

Author: Sanjoy Dasgupta

Abstract: We abstract out the core search problem of active learning schemes, to better understand the extent to which adaptive labeling can improve sample complexity. We give various upper and lower bounds on the number of labels which need to be queried, and we prove that a popular greedy active learning rule is approximately as good as any other strategy for minimizing this number of labels. 1

5 0.5675059 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

Author: Isabelle Guyon, Steve Gunn, Asa Ben-Hur, Gideon Dror

Abstract: The NIPS 2003 workshops included a feature selection competition organized by the authors. We provided participants with five datasets from different application domains and called for classification results using a minimal number of features. The competition took place over a period of 13 weeks and attracted 78 research groups. Participants were asked to make on-line submissions on the validation and test sets, with performance on the validation set being presented immediately to the participant and performance on the test set presented to the participants at the workshop. In total 1863 entries were made on the validation sets during the development period and 135 entries on all test sets for the final competition. The winners used a combination of Bayesian neural networks with ARD priors and Dirichlet diffusion trees. Other top entries used a variety of methods for feature selection, which combined filters and/or wrapper or embedded methods using Random Forests, kernel methods, or neural networks as a classification engine. The results of the benchmark (including the predictions made by the participants and the features they selected) and the scoring software are publicly available. The benchmark is available at www.nipsfsc.ecs.soton.ac.uk for post-challenge submissions to stimulate further research. 1

6 0.54052973 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

7 0.53484976 166 nips-2004-Semi-supervised Learning via Gaussian Processes

8 0.52136827 15 nips-2004-Active Learning for Anomaly and Rare-Category Detection

9 0.50901359 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

10 0.50863039 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

11 0.50624847 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

12 0.50066739 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

13 0.49532628 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

14 0.49183199 149 nips-2004-Probabilistic Inference of Alternative Splicing Events in Microarray Data

15 0.48405576 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants

16 0.47608954 143 nips-2004-PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data

17 0.47120282 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

18 0.47077531 8 nips-2004-A Machine Learning Approach to Conjoint Analysis

19 0.45055497 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

20 0.44839162 82 nips-2004-Incremental Algorithms for Hierarchical Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.09), (15, 0.113), (26, 0.044), (31, 0.041), (33, 0.144), (35, 0.012), (39, 0.019), (50, 0.457)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.89742815 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

Author: Kumar Chellapilla, Patrice Y. Simard

Abstract: Machine learning is often used to automatically solve human tasks. In this paper, we look for tasks where machine learning algorithms are not as good as humans with the hope of gaining insight into their current limitations. We studied various Human Interactive Proofs (HIPs) on the market, because they are systems designed to tell computers and humans apart by posing challenges presumably too hard for computers. We found that most HIPs are pure recognition tasks which can easily be broken using machine learning. The harder HIPs use a combination of segmentation and recognition tasks. From this observation, we found that building segmentation tasks is the most effective way to confuse machine learning algorithms. This has enabled us to build effective HIPs (which we deployed in MSN Passport), as well as design challenging segmentation tasks for machine learning algorithms. 1 In t rod u ct i on The OCR problem for high resolution printed text has virtually been solved 10 years ago [1]. On the other hand, cursive handwriting recognition today is still too poor for most people to rely on. Is there a fundamental difference between these two seemingly similar problems? To shed more light on this question, we study problems that have been designed to be difficult for computers. The hope is that we will get some insight on what the stumbling blocks are for machine learning and devise appropriate tests to further understand their similarities and differences. Work on distinguishing computers from humans traces back to the original Turing Test [2] which asks that a human distinguish between another human and a machine by asking questions of both. Recent interest has turned to developing systems that allow a computer to distinguish between another computer and a human. These systems enable the construction of automatic filters that can be used to prevent automated scripts from utilizing services intended for humans [4]. Such systems have been termed Human Interactive Proofs (HIPs) [3] or Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHAs) [4]. An overview of the work in this area can be found in [5]. Construction of HIPs that are of practical value is difficult because it is not sufficient to develop challenges at which humans are somewhat more successful than machines. This is because the cost of failure for an automatic attacker is minimal compared to the cost of failure for humans. Ideally a HIP should be solved by humans more than 80% of the time, while an automatic script with reasonable resource use should succeed less than 0.01% of the time. This latter ratio (1 in 10,000) is a function of the cost of an automatic trial divided by the cost of having a human perform the attack. This constraint of generating tasks that are failed 99.99% of the time by all automated algorithms has generated various solutions which can easily be sampled on the internet. Seven different HIPs, namely, Mailblocks, MSN (before April 28th, 2004), Ticketmaster, Yahoo, Yahoo v2 (after Sept’04), Register, and Google, will be given as examples in the next section. We will show in Section 3 that machinelearning-based attacks are far more successful than 1 in 10,000. Yet, some of these HIPs are harder than others and could be made even harder by identifying the recognition and segmentation parts, and emphasizing the latter. Section 4 presents examples of more difficult HIPs which are much more respectable challenges for machine learning, and yet surprisingly easy for humans. The final section discusses a (known) weakness of machine learning algorithms and suggests designing simple artificial datasets for studying this weakness. 2 Exa mp les o f H I Ps The HIPs explored in this paper are made of characters (or symbols) rendered to an image and presented to the user. Solving the HIP requires identifying all characters in the correct order. The following HIPs can be sampled from the web: Mailblocks: While signing up for free email service with (www.mailblocks.com), you will find HIP challenges of the type: mailblocks MSN: While signing up for free e-mail with MSN Hotmail (www.hotmail.com), you will find HIP challenges of the type: Register.com: While requesting a whois lookup for a domain at www.register.com, you will HIP challenges of the type: Yahoo!/EZ-Gimpy (CMU): While signing up for free e-mail service with Yahoo! (www.yahoo.com), you will receive HIP challenges of the type: Yahoo! (version 2): Starting in August 2004, Yahoo! introduced their second generation HIP. Three examples are presented below: Ticketmaster: While looking for concert tickets at www.ticketmaster.com, you will receive HIP challenges of the type: Google/Gmail: While signing up for free e-mail with Gmail at www.google.com, one will receive HIP challenges of the type: While solutions to Yahoo HIPs are common English words, those for ticketmaster and Google do not necessarily belong to the English dictionary. They appear to have been created using a phonetic generator [8]. 3 Usi n g ma ch i n e lea rn i n g t o b rea k H IP s Breaking HIPs is not new. Mori and Malik [7] have successfully broken the EZGimpy (92% success) and Gimpy (33% success) HIPs from CMU. Our approach aims at an automatic process for solving multiple HIPs with minimum human intervention, using machine learning. In this paper, our main goal is to learn more about the common strengths and weaknesses of these HIPs rather than to prove that we can break any one HIP in particular with the highest possible success rate. We have results for six different HIPs: EZ-Gimpy/Yahoo, Yahoo v2, mailblocks, register, ticketmaster, and Google. To simplify our study, we will not be using language models in our attempt to break HIPs. For example, there are only about 600 words in the EZ-Gimpy dictionary [7], which means that a random guess attack would get a success rate of 1 in 600 (more than enough to break the HIP, i.e., greater than 0.01% success). HIPs become harder when no language model is used. Similarly, when a HIP uses a language model to generate challenges, success rate of attacks can be significantly improved by incorporating the language model. Further, since the language model is not common to all HIPs studied, it was not used in this paper. Our generic method for breaking all of these HIPs is to write a custom algorithm to locate the characters, and then use machine learning for recognition. Surprisingly, segmentation, or finding the characters, is simple for many HIPs which makes the process of breaking the HIP particularly easy. Gimpy uses a single constant predictable color (black) for letters even though the background color changes. We quickly realized that once the segmentation problem is solved, solving the HIP becomes a pure recognition problem, and it can trivially be solved using machine learning. Our recognition engine is based on neural networks [6][9]. It yielded a 0.4% error rate on the MNIST database, uses little memory, and is very fast for recognition (important for breaking HIPs). For each HIP, we have a segmentation step, followed by a recognition step. It should be stressed that we are not trying to solve every HIP of a given type i.e., our goal is not 100% success rate, but something efficient that can achieve much better than 0.01%. In each of the following experiments, 2500 HIPs were hand labeled and used as follows (a) recognition (1600 for training, 200 for validation, and 200 for testing), and (b) segmentation (500 for testing segmentation). For each of the five HIPs, a convolution neural network, identical to the one described in [6], was trained and tested on gray level character images centered on the guessed character positions (see below). The trained neural network became the recognizer. 3.1 M a i l b l oc k s To solve the HIP, we select the red channel, binarize and erode it, extract the largest connected components (CCs), and breakup CCs that are too large into two or three adjacent CCs. Further, vertically overlapping half character size CCs are merged. The resulting rough segmentation works most of the time. Here is an example: For instance, in the example above, the NN would be trained, and tested on the following images: … The end-to-end success rate is 88.8% for segmentation, 95.9% for recognition (given correct segmentation), and (0.888)*(0.959)7 = 66.2% total. Note that most of the errors come from segmentation, even though this is where all the custom programming was invested. 3.2 Register The procedure to solve HIPs is very similar. The image was smoothed, binarized, and the largest 5 connected components were identified. Two examples are presented below: The end-to-end success rate is 95.4% for segmentation, 87.1% for recognition (given correct segmentation), and (0.954)*(0.871)5 = 47.8% total. 3.3 Y a h oo/ E Z - G i mp y Unlike the mailblocks and register HIPs, the Yahoo/EZ-Gimpy HIPs are richer in that a variety of backgrounds and clutter are possible. Though some amount of text warping is present, the text color, size, and font have low variability. Three simple segmentation algorithms were designed with associated rules to identify which algorithm to use. The goal was to keep these simple yet effective: a) No mesh: Convert to grayscale image, threshold to black and white, select large CCs with sizes close to HIP char sizes. One example: b) Black mesh: Convert to grayscale image, threshold to black and white, remove vertical and horizontal line pixels that don’t have neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: c) White mesh: Convert to grayscale image, threshold to black and white, add black pixels (in white line locations) if there exist neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: Tests for black and white meshes were performed to determine which segmentation algorithm to use. The end-to-end success rate was 56.2% for segmentation (38.2% came from a), 11.8% from b), and 6.2% from c), 90.3% for recognition (given correct segmentation), and (0.562)*(0.903)4.8 = 34.4% total. The average length of a Yahoo HIP solution is 4.8 characters. 3.4 T i c k e t ma s t e r The procedure that solved the Yahoo HIP is fairly successful at solving some of the ticket master HIPs. These HIPs are characterized by cris-crossing lines at random angles clustered around 0, 45, 90, and 135 degrees. A multipronged attack as in the Yahoo case (section 3.3) has potential. In the interests of simplicity, a single attack was developed: Convert to grayscale, threshold to black and white, up-sample image, dilate first then erode, select large CCs with sizes close to HIP char sizes. One example: The dilate-erode combination causes the lines to be removed (along with any thin objects) but retains solid thick characters. This single attack is successful in achieving an end-to-end success rate of 16.6% for segmentation, the recognition rate was 82.3% (in spite of interfering lines), and (0.166)*(0.823)6.23 = 4.9% total. The average HIP solution length is 6.23 characters. 3.5 Y a h oo ve r s i on 2 The second generation HIP from Yahoo had several changes: a) it did not use words from a dictionary or even use a phonetic generator, b) it uses only black and white colors, c) uses both letters and digits, and d) uses connected lines and arcs as clutter. The HIP is somewhat similar to the MSN/Passport HIP which does not use a dictionary, uses two colors, uses letters and digits, and background and foreground arcs as clutter. Unlike the MSN/Passport HIP, several different fonts are used. A single segmentation attack was developed: Remove 6 pixel border, up-sample, dilate first then erode, select large CCs with sizes close to HIP char sizes. The attack is practically identical to that used for the ticketmaster HIP with different preprocessing stages and slightly modified parameters. Two examples: This single attack is successful in achieving an end-to-end success rate of 58.4% for segmentation, the recognition rate was 95.2%, and (0.584)*(0.952)5 = 45.7% total. The average HIP solution length is 5 characters. 3.6 G oog l e / G M a i l The Google HIP is unique in that it uses only image warp as a means of distorting the characters. Similar to the MSN/Passport and Yahoo version 2 HIPs, it is also two color. The HIP characters are arranged closed to one another (they often touch) and follow a curved baseline. The following very simple attack was used to segment Google HIPs: Convert to grayscale, up-sample, threshold and separate connected components. a) b) This very simple attack gives an end-to-end success rate of 10.2% for segmentation, the recognition rate was 89.3%, giving (0.102)*(0.893)6.5 = 4.89% total probability of breaking a HIP. Average Google HIP solution length is 6.5 characters. This can be significantly improved upon by judicious use of dilate-erode attack. A direct application doesn’t do as well as it did on the ticketmaster and yahoo HIPs (because of the shear and warp of the baseline of the word). More successful and complicated attacks might estimate and counter the shear and warp of the baseline to achieve better success rates. 4 Lesso n s lea rn ed f ro m b rea ki n g H IPs From the previous section, it is clear that most of the errors come from incorrect segmentations, even though most of the development time is spent devising custom segmentation schemes. This observation raises the following questions: Why is segmentation a hard problem? Can we devise harder HIPs and datasets? Can we build an automatic segmentor? Can we compare classification algorithms based on how useful they are for segmentation? 4.1 T h e s e g me n t a t i on p r ob l e m As a review, segmentation is difficult for the following reasons: 1. Segmentation is computationally expensive. In order to find valid patterns, a recognizer must attempt recognition at many different candidate locations. 2. The segmentation function is complex. To segment successfully, the system must learn to identify which patterns are valid among the set of all possible valid and non-valid patterns. This task is intrinsically more difficult than classification because the space of input is considerably larger. Unlike the space of valid patterns, the space of non-valid patterns is typically too vast to sample. This is a problem for many learning algorithms which yield too many false positives when presented non-valid patterns. 3. Identifying valid characters among a set of valid and invalid candidates is a combinatorial problem. For example, correctly identifying which 8 characters among 20 candidates (assuming 12 false positives), has a 1 in 125,970 (20 choose 8) chances of success by random guessing. 4.2 B ui l d i n g b e t te r / h a r de r H I P s We can use what we have learned to build better HIPs. For instance the HIP below was designed to make segmentation difficult and a similar version has been deployed by MSN Passport for hotmail registrations (www.hotmail.com): The idea is that the additional arcs are themselves good candidates for false characters. The previous segmentation attacks would fail on this HIP. Furthermore, simple change of fonts, distortions, or arc types would require extensive work for the attacker to adjust to. We believe HIPs that emphasize the segmentation problem, such as the above example, are much stronger than the HIPs we examined in this paper, which rely on recognition being difficult. Pushing this to the extreme, we can easily generate the following HIPs: Despite the apparent difficulty of these HIPs, humans are surprisingly good at solving these, indicating that humans are far better than computers at segmentation. This approach of adding several competing false positives can in principle be used to automatically create difficult segmentation problems or benchmarks to test classification algorithms. 4.3 B ui l d i n g a n a ut o ma t i c s e g me n t or To build an automatic segmentor, we could use the following procedure. Label characters based on their correct position and train a recognizer. Apply the trained recognizer at all locations in the HIP image. Collect all candidate characters identified with high confidence by the recognizer. Compute the probability of each combination of candidates (going from left to right), and output the solution string with the highest probability. This is better illustrated with an example. Consider the following HIP (to the right). The trained neural network has these maps (warm colors indicate recognition) that show that K, Y, and so on are correctly identified. However, the maps for 7 and 9 show several false positives. In general, we would get the following color coded map for all the different candidates: HIP K Y B 7 9 With a threshold of 0.5 on the network’s outputs, the map obtained is: We note that there are several false positives for each true positive. The number of false positives per true positive character was found to be between 1 and 4, giving a 1 in C(16,8) = 12,870 to 1 in C(32,8) = 10,518,300 random chance of guessing the correct segmentation for the HIP characters. These numbers can be improved upon by constraining solution strings to flow sequentially from left to right and by restricting overlap. For each combination, we compute a probability by multiplying the 8 probabilities of the classifier for each position. The combination with the highest probability is the one proposed by the classifier. We do not have results for such an automatic segmentor at this time. It is interesting to note that with such a method a classifier that is robust to false positives would do far better than one that is not. This suggests another axis for comparing classifiers. 5 Con clu si on In this paper, we have successfully applied machine learning to the problem of solving HIPs. We have learned that decomposing the HIP problem into segmentation and recognition greatly simplifies analysis. Recognition on even unprocessed images (given segmentation is a solved) can be done automatically using neural networks. Segmentation, on the other hand, is the difficulty differentiator between weaker and stronger HIPs and requires custom intervention for each HIP. We have used this observation to design new HIPs and new tests for machine learning algorithms with the hope of improving them. A c k n ow l e d ge me n t s We would like to acknowledge Chau Luu and Eric Meltzer for their help with labeling and segmenting various HIPs. We would also like to acknowledge Josh Benaloh and Cem Paya for stimulating discussions on HIP security. References [1] Baird HS (1992), “Anatomy of a versatile page reader,” IEEE Pro., v.80, pp. 1059-1065. [2] Turing AM (1950), “Computing Machinery and Intelligence,” Mind, 59:236, pp. 433-460. [3] First Workshop on Human Interactive Proofs, Palo Alto, CA, January 2002. [4] Von Ahn L, Blum M, and Langford J, The Captcha Project. http://www.captcha.net [5] Baird HS and Popat K (2002) “Human Interactive Proofs and Document Image Analysis,” Proc. IAPR 2002 Workshop on Document Analysis Systerms, Princeton, NJ. [6] Simard PY, Steinkraus D, and Platt J, (2003) “Best Practice for Convolutional Neural Networks Applied to Visual Document Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), pp. 958-962, IEEE Computer Society, Los Alamitos. [7] Mori G, Malik J (2003), “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” Proc. of the Computer Vision and Pattern Recognition (CVPR) Conference, IEEE Computer Society, vol.1, pages:I-134 - I-141, June 18-20, 2003 [8] Chew, M. and Baird, H. S. (2003), “BaffleText: a Human Interactive Proof,” Proc., 10th IS&T;/SPIE Document Recognition & Retrieval Conf., Santa Clara, CA, Jan. 22. [9] LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,’ Proceedings of the IEEE, Nov. 1998.

same-paper 2 0.85355318 136 nips-2004-On Semi-Supervised Classification

Author: Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin, Mário Figueiredo, Alexander J. Hartemink

Abstract: A graph-based prior is proposed for parametric semi-supervised classification. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classifier automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1

3 0.82701826 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning

Author: Liam Paninski

Abstract: Log-concavity is an important property in the context of optimization, Laplace approximation, and sampling; Bayesian methods based on Gaussian process priors have become quite popular recently for classification, regression, density estimation, and point process intensity estimation. Here we prove that the predictive densities corresponding to each of these applications are log-concave, given any observed data. We also prove that the likelihood is log-concave in the hyperparameters controlling the mean function of the Gaussian prior in the density and point process intensity estimation cases, and the mean, covariance, and observation noise parameters in the classification and regression cases; this result leads to a useful parameterization of these hyperparameters, indicating a suitably large class of priors for which the corresponding maximum a posteriori problem is log-concave.

4 0.79242492 49 nips-2004-Density Level Detection is Classification

Author: Ingo Steinwart, Don Hush, Clint Scovel

Abstract: We show that anomaly detection can be interpreted as a binary classification problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1

5 0.77651036 29 nips-2004-Beat Tracking the Graphical Model Way

Author: Dustin Lang, Nando D. Freitas

Abstract: We present a graphical model for beat tracking in recorded music. Using a probabilistic graphical model allows us to incorporate local information and global smoothness constraints in a principled manner. We evaluate our model on a set of varied and difficult examples, and achieve impressive results. By using a fast dual-tree algorithm for graphical model inference, our system runs in less time than the duration of the music being processed. 1

6 0.58455968 69 nips-2004-Fast Rates to Bayes for Kernel Machines

7 0.57920152 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

8 0.54264116 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

9 0.53607696 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms

10 0.53554934 158 nips-2004-Sampling Methods for Unsupervised Learning

11 0.53079247 103 nips-2004-Limits of Spectral Clustering

12 0.5290674 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments

13 0.52675211 168 nips-2004-Semigroup Kernels on Finite Sets

14 0.52184618 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture

15 0.52090877 175 nips-2004-Stable adaptive control with online learning

16 0.51995409 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression

17 0.51977861 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors

18 0.51827204 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

19 0.5078969 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

20 0.50623786 23 nips-2004-Analysis of a greedy active learning strategy