jmlr jmlr2013 jmlr2013-22 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Nathan Parrish, Hyrum S. Anderson, Maya R. Gupta, Dun Yu Hsiao
Abstract: We consider the problem of classifying a test sample given incomplete information. This problem arises naturally when data about a test sample is collected over time, or when costs must be incurred to compute the classification features. For example, in a distributed sensor network only a fraction of the sensors may have reported measurements at a certain time, and additional time, power, and bandwidth is needed to collect the complete data to classify. A practical goal is to assign a class label as soon as enough data is available to make a good decision. We formalize this goal through the notion of reliability—the probability that a label assigned given incomplete data would be the same as the label assigned given the complete data, and we propose a method to classify incomplete data only if some reliability threshold is met. Our approach models the complete data as a random variable whose distribution is dependent on the current incomplete data and the (complete) training data. The method differs from standard imputation strategies in that our focus is on determining the reliability of the classification decision, rather than just the class label. We show that the method provides useful reliability estimates of the correctness of the imputed class labels on a set of experiments on time-series data sets, where the goal is to classify the time-series as early as possible while still guaranteeing that the reliability threshold is met. Keywords: classification, sensor networks, signals, reliability
Reference: text
sentIndex sentText sentNum sentScore
1 EDU Department of Electrical Engineering University of Washington Seattle, WA 98195-4322, USA Editor: Kevin Murphy Abstract We consider the problem of classifying a test sample given incomplete information. [sent-9, score-0.319]
2 We formalize this goal through the notion of reliability—the probability that a label assigned given incomplete data would be the same as the label assigned given the complete data, and we propose a method to classify incomplete data only if some reliability threshold is met. [sent-13, score-0.903]
3 The method differs from standard imputation strategies in that our focus is on determining the reliability of the classification decision, rather than just the class label. [sent-15, score-0.362]
4 We show that the method provides useful reliability estimates of the correctness of the imputed class labels on a set of experiments on time-series data sets, where the goal is to classify the time-series as early as possible while still guaranteeing that the reliability threshold is met. [sent-16, score-0.766]
5 Specifically, we wish to guarantee that a decision made from incomplete test data has a high probability of being the same decision that would be made given the complete test data. [sent-24, score-0.597]
6 In this paper, we focus on answering the question “With probability at least equal to τ, will the classification decision from incomplete data be the same as that which would be made from the complete data? [sent-25, score-0.357]
7 ” Our approach also makes it possible to answer the related question, “If we classify based on the current incomplete data, what is the probability that the class decision will be the same as classifying from the complete data? [sent-26, score-0.512]
8 ” First, we propose optimal and practical decision rules for classifying incomplete data. [sent-27, score-0.367]
9 In Sections 3, 4, and 5 we provide the details on how to efficiently and accurately implement the proposed practical decision rule for classifiers that use linear or quadratic discriminants, such as linear support vector machines and linear or quadratic discriminant analysis (LDA and QDA). [sent-28, score-0.37]
10 Experiments in Section 7 show that the proposed incomplete decision rule consistently provides enhanced reliability over the state of the art in classifying incomplete data. [sent-30, score-0.938]
11 However, suppose that at ˆ test time we do not have x, but instead have some incomplete information given as a vector z. [sent-37, score-0.351]
12 To that end, we consider decision rules that answer the question: “Can we classify z and know that we meet some minimum probability threshold of making the same decision that we would make on x? [sent-39, score-0.344]
13 ” We use the term reliability to mean the probability that the class label assigned to z matches that assigned to x. [sent-40, score-0.352]
14 To estimate reliability, we model the classification features derived from the complete data as a random variable X, where X is jointly distributed with the random variable Z modeling the incomplete data. [sent-41, score-0.334]
15 Given a desired reliability τ ∈ [0, 1] and a realization of the incomplete information z, an ideal incomplete decision rule is to classify as class g if P(g(X) = g|Z = z) = ˆ p(x|z) dx x s. [sent-42, score-1.031]
16 An alternative check that we find easier to approximate is to consider all sets A in the domain of X such that P(X ∈ A|Z = z) ≥ τ, 3562 C LASSIFYING W ITH C ONFIDENCE Figure 1: In this example, the available information is the incomplete time signal z, shown in green. [sent-48, score-0.315]
17 We propose that a more conservative, but computable, incomplete decision rule is to classify as class g if g(x) = g for all x ∈ A for some set A such that P(X ∈ A|Z = z) ≥ τ. [sent-53, score-0.509]
18 Defining a Set A that Contains Measure τ of X To implement the incomplete decision rule (2), one must be able to construct a set A that contains at least τ measure of X given z. [sent-63, score-0.385]
19 For desired reliability values τ that are smaller than the probability mass of X that falls to the left of the decision boundary, the ideal incomplete decision rule would choose to classify based on the incomplete information z. [sent-71, score-1.111]
20 Right: The entire probability mass of X falls on one side of the decision boundary, and thus the ideal incomplete decision rule would choose to classify rather than wait, for every value of τ. [sent-72, score-0.589]
21 However, our computable incomplete decision rule constructs a set A that captures a fraction τ of the mass of X and requires that entire set A to lie on one side of the decision boundary. [sent-73, score-0.497]
22 For the choice of A shown here in blue, the set A crosses the decision boundary, and thus the computable decision rule would choose to wait for more information before classifying. [sent-74, score-0.296]
23 If we assume more about the distribution, we can define a smaller constraint set A that results in a less conservative decision rule, and therefore earlier classification for the same reliability requirement τ. [sent-82, score-0.48]
24 xi :yi =g An optimal method for checking the incomplete decision rule (2) for this discriminant is an open question. [sent-111, score-0.456]
25 A conservative reliability decision can be made by treating each sample as its own class in (6). [sent-112, score-0.485]
26 Then the proposed incomplete data decision rule (2) is implemented: if max f (x) ≤ 0 1 x∈A g(z) = ˆ 2 if min f (x) > 0 x∈A no decision otherwise. [sent-122, score-0.497]
27 (7) Note that the decision rule (7) is dependent on the incomplete data through the dependence of A on z. [sent-123, score-0.385]
28 The three different conditions in (7) are shown for a quadratic discriminant (and hence quadratic decision boundary) and a quadratic construction of the set A in Figure 4. [sent-124, score-0.39]
29 In the center and rightmost plots, A lies completely on a single side of the decision boundary, so the classifier assigns a label to the incomplete data. [sent-127, score-0.358]
30 Coupled with a quadratic set A, such as the Chebyshev or n¨ ive Bayes quadratic sets A given in a Section 3, finding the maximum and minimum are the linear programs with quadratic constraints: max f (x) = max βT x + b x∈A (8) x s. [sent-132, score-0.349]
31 First consider finding the maximum and minimum of (12), as required by the incomplete decision rule (7), over a quadratic constraint set A. [sent-158, score-0.509]
32 After this change of variables, we can greatly simplify the maximum and minimum computations required by the incomplete decision rule (7) by making the n¨ ive Bayes assumption on the random variable Y = V 1/2 X as opposed to on X. [sent-171, score-0.527]
33 ˆ The proposed incomplete data classification rule (2) can be written: g(z) = ˆ c if min fc (x) − fh (x) ≥ 0 for all h = c no decision otherwise. [sent-184, score-0.461]
34 x∈A (15) That is, classify z as class c if the set A lies completely within the decision region for some class c, and do not decide at the requested reliability if the set A straddles a decision boundary. [sent-185, score-0.678]
35 To classify the incomplete data z early as class c, class c must dominate all other classes. [sent-199, score-0.426]
36 If yes, classify the incomplete data as the class labelled candidate, if no, output no decision. [sent-210, score-0.426]
37 We do this by leveraging the incomplete information about the test signal that is currently available along with the prior knowledge of the structure of the test signal gained from the training data using the standard assumption that the training and test features are IID. [sent-215, score-0.513]
38 However, the minimum support parameter is different from our τ parameter in that it does not provide an explicit guarantee on the reliability of the early decision. [sent-246, score-0.372]
39 Given such costs, an optimal stopping rule approach would attempt to estimate the probability of each class given the current incomplete information, and determine the expected costs of making a decision or waiting. [sent-275, score-0.448]
40 Generally stopping rules are not applicable to the problem we focus on because they assume that all increasing sets of features can be compared, rather than that one only has the incomplete set of features and must make a decision. [sent-280, score-0.357]
41 We also use the Synthetic Control data set from this repository, a data set of Gaussian data that has only three hundred test samples, to further illustrate the differences between the constraint sets and estimation methods that we have described for the proposed incomplete decision rule. [sent-313, score-0.427]
42 At i=1 time t, the incomplete data for the ith test time-series is zi ∈ Rt , the first t samples of xi . [sent-317, score-0.384]
43 At each time t we check the proposed incomplete decision rule and classify zi if the reliability condition is met for τ. [sent-318, score-0.866]
44 Let ti (τ) be the minimum time at which the ith test signal can be classified with reliability constraint τ, and let g(zi (τ)) be the class label assigned to zi at this time. [sent-344, score-0.567]
45 We measure the test ˆ ˆ ˆ reliability as 1 ∑n I(g(zi (τ)) = g(xi )), where g(xi ) is the label assigned to the complete data and n i=1 ˆ I(·) is one if the argument is true and zero otherwise. [sent-345, score-0.405]
46 Ideally, we would like to classify with the smallest average classification time while still meeting reliability requirement τ. [sent-347, score-0.483]
47 Local QDA learns the mean and covariance for the class g discriminant function for test point x, fg (x), by estimating them using the k nearest class g training points to test point x. [sent-351, score-0.315]
48 2 Comparison of Construction of Sets of Measure τ We first compare the three set construction methods proposed Section 3, the Chebyshev set (3), the Gaussian n¨ ive Bayes quadratic set (4), and the Gaussian n¨ ive Bayes box set (5). [sent-357, score-0.357]
49 a a We vary the reliability parameter between four values τ = [0. [sent-358, score-0.298]
50 In all cases, the empirical reliability rate exceeds the reliability requirement τ. [sent-364, score-0.596]
51 Additionally, these plots verify that the Chebyshev set is the most conservative, as it waits the longest to classify the test data, and the n¨ ive Bayes quadratic set is the a least conservative. [sent-365, score-0.339]
52 This table shows that the n¨ ive Bayes box set is the least computationally complex, a followed by the n¨ ive Bayes quadratic set, and finally the Chebyshev set. [sent-368, score-0.357]
53 3 Comparison of Estimation Methods In this section we compare the performance of reliable incomplete classification using jointly Gaussian estimation (16) to that using GMM estimation (17). [sent-370, score-0.368]
54 test reliability for the jointly Gaussian and GMM estimation methods using the n¨ ive Bayes quadratic constraint set. [sent-373, score-0.61]
55 9 90 116 Chebyshev Naive Bayes Quadratic Naive Bayes Box 118 120 122 124 Average Classification Time 126 Figure 6: Average classification time vs test reliability for local QDA (left column) and linear SVM (right column) using jointly Gaussian prediction. [sent-383, score-0.518]
56 9 88 114 Joint Gaussian GMM 116 118 120 122 Average Classification Time 124 Figure 7: Average classification time vs test reliability for local QDA (left column) and linear SVM (right column) using the n¨ ive Bayes quadratic constraint set with τ varied between a [0. [sent-395, score-0.69]
57 4 Dimensionality Reduction Features An advantage of our reliable incomplete classification approach is that it can use any features derived from the data for which we can estimate the mean and covariance. [sent-417, score-0.381]
58 Thus, if we simply use the time-series samples as the features for classification, the optimization problem that the reliable incomplete classifier must solve has d − t free variables. [sent-426, score-0.381]
59 The table also compares the testing time required to perform reliable local QDA classification with the n¨ ive Bayes quadratic a constraint set with jointly Gaussian estimation at time t = 1 with and without LDG dimensionality reduction. [sent-433, score-0.538]
60 The test time shown measures the average time, per test sample, to perform reliable classification at time t = 1. [sent-436, score-0.39]
61 5 Comparison to Other Methods In this section, we compare the performance of our reliable incomplete data classifier to ECTS (Xing et al. [sent-439, score-0.33]
62 For all experiments in this section, we use the n¨ ive Bayes a quadratic constraint set because it proved to be uniformly better than the box constraint set across all experiments in Section 7. [sent-441, score-0.297]
63 However, we emphasize that this parameter is not the same as our reliability parameter τ, in that it provides no guarantee on reliability of the early predictions, but is instead a knob that the user can tune to trade off between earliness and reliability. [sent-448, score-0.642]
64 ECTS Fixed−time QDA Fixed−time 1−NN 90 85 80 100 110 120 130 140 150 160 Average classification time 90 Rel. [sent-461, score-0.303]
65 ECTS Fixed−time QDA Fixed−time 1−NN 80 70 60 0 170 5 10 15 20 Average classification time Face (All) Medical Images 100 Test Reliability Test Reliability 100 95 Rel. [sent-465, score-0.303]
66 ECTS Fixed−time QDA Fixed−time 1−NN 90 85 80 100 110 120 130 Average classification time 95 90 80 Non-invasive Fetal ECG 1 90 LDG Rel. [sent-469, score-0.303]
67 ECTS Fixed−time QDA Fixed−time 1−NN 85 640 660 680 700 720 Average classification time Test Reliability Test Reliability 75 80 85 90 95 Average classification time 100 100 95 LDG Rel. [sent-471, score-0.606]
68 ECTS Fixed−time QDA Fixed−time 1−NN 90 85 600 740 Starlight Curves 650 700 Average classification time 750 Swedish Leaf 100 Test Reliability 100 Test Reliability 70 Non-invasive Fetal ECG 2 95 98 96 LDG Rel. [sent-473, score-0.303]
69 ECTS Fixed−time QDA Fixed−time 1−NN 85 75 65 140 100 80 620 25 800 850 900 950 1000 Average classification time 1050 95 90 85 80 75 80 Rel. [sent-479, score-0.303]
70 ECTS Fixed−time QDA Fixed−time 1−NN 90 100 110 120 Average classification time 130 Figure 8: Average classification time vs test reliability for reliable incomplete local QDA classification (Rel. [sent-483, score-1.113]
71 ), reliable incomplete local QDA classification with LDG features (LDG Rel. [sent-485, score-0.415]
72 ECTS Fixed−time QDA Fixed−time 1−NN 80 70 100 105 110 115 120 125 Average classification time 95 90 80 75 200 130 U Wave Gesture Library Y 220 240 260 280 300 Average classification time Rel. [sent-492, score-0.606]
73 ECTS Fixed−time QDA Fixed−time 1−NN 80 70 200 250 300 Average classification time 200 220 240 260 280 300 Average classification time Wafer 100 Test Reliability Test Reliability 320 Yoga 100 98 Rel. [sent-500, score-0.606]
74 ECTS Fixed−time QDA Fixed−time 1−NN 85 40 98 96 92 90 300 60 80 100 120 140 Average classification time LDG Rel. [sent-508, score-0.303]
75 ECTS Fixed−time QDA Fixed−time 1−NN 94 320 340 360 380 400 Average classification time 420 Figure 9: Average classification time vs test reliability for reliable incomplete local QDA classification (Rel. [sent-510, score-1.113]
76 ), reliable incomplete local QDA classification with LDG features (LDG Rel. [sent-512, score-0.415]
77 The reliability results are shown in Figures 8 and 9. [sent-517, score-0.298]
78 Reliable incomplete local QDA classification and reliable incomplete local QDA classification with LDG features perform well across all experiments. [sent-518, score-0.673]
79 ECTS Fixed−time QDA Fixed−time 1−NN 5 10 15 20 25 Average classification time 80 70 60 0 120 140 160 Average classification time Face (All) Medical Images 70 70 65 60 55 50 100 Test Accuracy Test Accuracy 75 Rel. [sent-529, score-0.606]
80 ECTS Fixed−time QDA Fixed−time 1−NN 110 120 130 Average classification time 60 140 Non-invasive Fetal ECG 1 LDG Rel. [sent-533, score-0.303]
81 ECTS Fixed−time QDA Fixed−time 1−NN 650 700 750 Average classification time Test Accuracy Test Accuracy 80 90 Average classification time 100 90 80 60 70 Non-invasive Fetal ECG 2 90 70 Rel. [sent-535, score-0.606]
82 ECTS Fixed−time QDA Fixed−time 1−NN 650 700 Average classification time 750 Swedish Leaf Test Accuracy Test Accuracy 90 90 80 70 750 LDG Rel. [sent-541, score-0.303]
83 ECTS Fixed−time QDA Fixed−time 1−NN 80 70 60 800 850 900 950 1000 1050 Average classification time 80 Rel. [sent-543, score-0.303]
84 ECTS Fixed−time QDA Fixed−time 1−NN 90 100 110 120 130 Average classification time Figure 10: Average classification time vs test accuracy for reliable incomplete local QDA classification (Rel. [sent-547, score-0.837]
85 ), reliable incomplete local QDA classification with LDG features (LDG Rel. [sent-549, score-0.415]
86 ECTS Fixed−time QDA Fixed−time 1−NN 110 120 130 Average classification time Test Accuracy Test Accuracy 100 75 70 65 60 200 U Wave Gesture Library Y Rel. [sent-556, score-0.303]
87 ECTS Fixed−time QDA Fixed−time 1−NN 220 240 260 280 300 320 Average classification time U Wave Gesture Library Z Test Accuracy Test Accuracy 70 65 Rel. [sent-560, score-0.303]
88 ECTS Fixed−time QDA Fixed−time 1−NN 60 55 50 150 200 250 300 Average classification time 70 65 60 55 50 350 200 Wafer Yoga 84 90 Test Accuracy Test Accuracy 100 95 Rel. [sent-564, score-0.303]
89 ECTS Fixed−time QDA Fixed−time 1−NN 220 240 260 280 300 320 Average classification time Rel. [sent-568, score-0.303]
90 ECTS Fixed−time QDA Fixed−time 1−NN 50 100 Average classification time 150 82 80 78 300 Rel. [sent-572, score-0.303]
91 ECTS Fixed−time QDA Fixed−time 1−NN 350 400 Average classification time Figure 11: Average classification time vs test accuracy for reliable incomplete local QDA classification (Rel. [sent-576, score-0.837]
92 ), reliable incomplete local QDA classification with LDG features (LDG Rel. [sent-578, score-0.415]
93 Therefore, if someone wanted to set τ by cross-validation on the training data set, the reliable incomplete classifier offers more flexibility than ECTS. [sent-586, score-0.353]
94 Discussion and Some Open Questions We have proposed a practical incomplete decision rule that is a conservative approximation of the optimal rule. [sent-595, score-0.428]
95 This paper has focused on answering the question “With probability τ, will the classification decision from this incomplete data be the same as from the complete data? [sent-600, score-0.357]
96 ” The presented tools can also be used to answer the related question: “If we classify based on the current incomplete data, what is the probability that assigned label will match that which would be chosen using the complete data? [sent-601, score-0.359]
97 Another related question that can be answered is, “Can we reliably classify as class g with this incomplete data? [sent-603, score-0.348]
98 This question can be answered by applying the incomplete decision rule given in (2) only to the class of interest. [sent-607, score-0.417]
99 Proof of Proposition 2: First, note that each pairwise check reduces the number of classes labelled candidate by either two classes if the classes tie, or by one class (the loser) if one class dominates. [sent-619, score-0.306]
100 Furthermore, for the incomplete data decision rule (7), it is not necessary to find the true minimum over A of f (x), but it is instead sufficient to know only whether or not it is less than or equal to zero. [sent-652, score-0.413]
wordName wordTfidf (topN-words)
[('qda', 0.547), ('ldg', 0.357), ('reliability', 0.298), ('classification', 0.24), ('incomplete', 0.224), ('fixed', 0.221), ('parrish', 0.152), ('ive', 0.114), ('decision', 0.112), ('gmm', 0.111), ('lassifying', 0.106), ('nderson', 0.106), ('onfidence', 0.106), ('saio', 0.106), ('reliable', 0.106), ('chebyshev', 0.1), ('ects', 0.093), ('classify', 0.092), ('upta', 0.091), ('wave', 0.091), ('nn', 0.09), ('classi', 0.078), ('labelled', 0.078), ('ecg', 0.076), ('fetal', 0.076), ('discriminant', 0.071), ('quadratic', 0.069), ('mt', 0.065), ('test', 0.064), ('time', 0.063), ('gesture', 0.06), ('box', 0.06), ('bayes', 0.059), ('missing', 0.052), ('mg', 0.052), ('features', 0.051), ('rule', 0.049), ('medical', 0.048), ('early', 0.046), ('chlorine', 0.046), ('mpl', 0.046), ('conservative', 0.043), ('fc', 0.041), ('raj', 0.039), ('rg', 0.039), ('jointly', 0.038), ('sandia', 0.038), ('sprt', 0.038), ('starlight', 0.038), ('wafer', 0.038), ('classes', 0.037), ('library', 0.036), ('fh', 0.035), ('local', 0.034), ('ith', 0.033), ('discriminants', 0.033), ('imputation', 0.032), ('mq', 0.032), ('er', 0.032), ('class', 0.032), ('stopping', 0.031), ('classifying', 0.031), ('swedish', 0.03), ('yoga', 0.03), ('average', 0.03), ('naive', 0.03), ('ers', 0.029), ('fg', 0.029), ('sdp', 0.029), ('zk', 0.028), ('minimum', 0.028), ('check', 0.028), ('constraint', 0.027), ('svm', 0.027), ('patterns', 0.026), ('images', 0.026), ('candidate', 0.025), ('dimensionality', 0.024), ('final', 0.024), ('italy', 0.024), ('training', 0.023), ('gaussian', 0.023), ('btsrp', 0.023), ('rodriguez', 0.023), ('trsp', 0.023), ('wait', 0.023), ('synthetic', 0.023), ('tied', 0.022), ('label', 0.022), ('dominated', 0.022), ('accuracy', 0.022), ('cation', 0.022), ('cdf', 0.022), ('vs', 0.021), ('complete', 0.021), ('ms', 0.021), ('demand', 0.021), ('speech', 0.021), ('dominates', 0.02), ('gupta', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 22 jmlr-2013-Classifying With Confidence From Incomplete Information
Author: Nathan Parrish, Hyrum S. Anderson, Maya R. Gupta, Dun Yu Hsiao
Abstract: We consider the problem of classifying a test sample given incomplete information. This problem arises naturally when data about a test sample is collected over time, or when costs must be incurred to compute the classification features. For example, in a distributed sensor network only a fraction of the sensors may have reported measurements at a certain time, and additional time, power, and bandwidth is needed to collect the complete data to classify. A practical goal is to assign a class label as soon as enough data is available to make a good decision. We formalize this goal through the notion of reliability—the probability that a label assigned given incomplete data would be the same as the label assigned given the complete data, and we propose a method to classify incomplete data only if some reliability threshold is met. Our approach models the complete data as a random variable whose distribution is dependent on the current incomplete data and the (complete) training data. The method differs from standard imputation strategies in that our focus is on determining the reliability of the classification decision, rather than just the class label. We show that the method provides useful reliability estimates of the correctness of the imputed class labels on a set of experiments on time-series data sets, where the goal is to classify the time-series as early as possible while still guaranteeing that the reliability threshold is met. Keywords: classification, sensor networks, signals, reliability
2 0.048588317 73 jmlr-2013-Multicategory Large-Margin Unified Machines
Author: Chong Zhang, Yufeng Liu
Abstract: Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classification, large-margin, soft classification, support vector machine
3 0.041307107 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
Author: Fang Han, Tuo Zhao, Han Liu
Abstract: We propose a high dimensional classification method, named the Copula Discriminant Analysis (CODA). The CODA generalizes the normal-based linear discriminant analysis to the larger Gaussian Copula models (or the nonparanormal) as proposed by Liu et al. (2009). To simultaneously achieve estimation efficiency and robustness, the nonparametric rank-based methods including the Spearman’s rho and Kendall’s tau are exploited in estimating the covariance matrix. In high dimensional settings, we prove that the sparsity pattern of the discriminant features can be consistently recovered with the parametric rate, and the expected misclassification error is consistent to the Bayes risk. Our theory is backed up by careful numerical experiments, which show that the extra flexibility gained by the CODA method incurs little efficiency loss even when the data are truly Gaussian. These results suggest that the CODA method can be an alternative choice besides the normal-based high dimensional linear discriminant analysis. Keywords: high dimensional statistics, sparse nonlinear discriminant analysis, Gaussian copula, nonparanormal distribution, rank-based statistics
4 0.037767574 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features
Author: Jun Wan, Qiuqi Ruan, Wei Li, Shuang Deng
Abstract: For one-shot learning gesture recognition, two important challenges are: how to extract distinctive features and how to learn a discriminative model from only one training sample per gesture class. For feature extraction, a new spatio-temporal feature representation called 3D enhanced motion scale-invariant feature transform (3D EMoSIFT) is proposed, which fuses RGB-D data. Compared with other features, the new feature set is invariant to scale and rotation, and has more compact and richer visual representations. For learning a discriminative model, all features extracted from training samples are clustered with the k-means algorithm to learn a visual codebook. Then, unlike the traditional bag of feature (BoF) models using vector quantization (VQ) to map each feature into a certain visual codeword, a sparse coding method named simulation orthogonal matching pursuit (SOMP) is applied and thus each feature can be represented by some linear combination of a small number of codewords. Compared with VQ, SOMP leads to a much lower reconstruction error and achieves better performance. The proposed approach has been evaluated on ChaLearn gesture database and the result has been ranked amongst the top best performing techniques on ChaLearn gesture challenge (round 2). Keywords: gesture recognition, bag of features (BoF) model, one-shot learning, 3D enhanced motion scale invariant feature transform (3D EMoSIFT), Simulation Orthogonal Matching Pursuit (SOMP)
5 0.035995461 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition
Author: Sean Ryan Fanello, Ilaria Gori, Giorgio Metta, Francesca Odone
Abstract: Sparsity has been showed to be one of the most important properties for visual recognition purposes. In this paper we show that sparse representation plays a fundamental role in achieving one-shot learning and real-time recognition of actions. We start off from RGBD images, combine motion and appearance cues and extract state-of-the-art features in a computationally efficient way. The proposed method relies on descriptors based on 3D Histograms of Scene Flow (3DHOFs) and Global Histograms of Oriented Gradient (GHOGs); adaptive sparse coding is applied to capture high-level patterns from data. We then propose a simultaneous on-line video segmentation and recognition of actions using linear SVMs. The main contribution of the paper is an effective realtime system for one-shot action modeling and recognition; the paper highlights the effectiveness of sparse coding techniques to represent 3D actions. We obtain very good results on three different data sets: a benchmark data set for one-shot action learning (the ChaLearn Gesture Data Set), an in-house data set acquired by a Kinect sensor including complex actions and gestures differing by small details, and a data set created for human-robot interaction purposes. Finally we demonstrate that our system is effective also in a human-robot interaction setting and propose a memory game, “All Gestures You Can”, to be played against a humanoid robot. Keywords: real-time action recognition, sparse representation, one-shot action learning, human robot interaction
6 0.03549229 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
8 0.030185305 8 jmlr-2013-A Theory of Multiclass Boosting
9 0.030119641 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems
10 0.029869296 58 jmlr-2013-Language-Motivated Approaches to Action Recognition
11 0.029533932 68 jmlr-2013-Machine Learning with Operational Costs
12 0.028112732 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty
13 0.028094444 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs
14 0.027773913 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
15 0.026777972 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
16 0.026333779 54 jmlr-2013-JKernelMachines: A Simple Framework for Kernel Machines
17 0.025855616 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations
18 0.025842357 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos
19 0.023534654 12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting
topicId topicWeight
[(0, -0.125), (1, 0.009), (2, -0.079), (3, 0.007), (4, 0.031), (5, -0.018), (6, 0.009), (7, 0.0), (8, -0.026), (9, -0.077), (10, 0.037), (11, -0.061), (12, 0.048), (13, -0.1), (14, -0.054), (15, -0.032), (16, 0.013), (17, 0.021), (18, -0.139), (19, 0.003), (20, -0.037), (21, 0.012), (22, -0.008), (23, -0.068), (24, 0.059), (25, -0.05), (26, -0.054), (27, 0.076), (28, -0.198), (29, -0.0), (30, 0.185), (31, -0.066), (32, -0.316), (33, 0.038), (34, 0.022), (35, 0.108), (36, 0.093), (37, -0.132), (38, 0.092), (39, 0.013), (40, -0.029), (41, 0.059), (42, -0.153), (43, -0.001), (44, -0.022), (45, 0.14), (46, -0.015), (47, 0.269), (48, 0.051), (49, 0.259)]
simIndex simValue paperId paperTitle
same-paper 1 0.94174147 22 jmlr-2013-Classifying With Confidence From Incomplete Information
Author: Nathan Parrish, Hyrum S. Anderson, Maya R. Gupta, Dun Yu Hsiao
Abstract: We consider the problem of classifying a test sample given incomplete information. This problem arises naturally when data about a test sample is collected over time, or when costs must be incurred to compute the classification features. For example, in a distributed sensor network only a fraction of the sensors may have reported measurements at a certain time, and additional time, power, and bandwidth is needed to collect the complete data to classify. A practical goal is to assign a class label as soon as enough data is available to make a good decision. We formalize this goal through the notion of reliability—the probability that a label assigned given incomplete data would be the same as the label assigned given the complete data, and we propose a method to classify incomplete data only if some reliability threshold is met. Our approach models the complete data as a random variable whose distribution is dependent on the current incomplete data and the (complete) training data. The method differs from standard imputation strategies in that our focus is on determining the reliability of the classification decision, rather than just the class label. We show that the method provides useful reliability estimates of the correctness of the imputed class labels on a set of experiments on time-series data sets, where the goal is to classify the time-series as early as possible while still guaranteeing that the reliability threshold is met. Keywords: classification, sensor networks, signals, reliability
2 0.6943332 73 jmlr-2013-Multicategory Large-Margin Unified Machines
Author: Chong Zhang, Yufeng Liu
Abstract: Hard and soft classifiers are two important groups of techniques for classification problems. Logistic regression and Support Vector Machines are typical examples of soft and hard classifiers respectively. The essential difference between these two groups is whether one needs to estimate the class conditional probability for the classification task or not. In particular, soft classifiers predict the label based on the obtained class conditional probabilities, while hard classifiers bypass the estimation of probabilities and focus on the decision boundary. In practice, for the goal of accurate classification, it is unclear which one to use in a given situation. To tackle this problem, the Largemargin Unified Machine (LUM) was recently proposed as a unified family to embrace both groups. The LUM family enables one to study the behavior change from soft to hard binary classifiers. For multicategory cases, however, the concept of soft and hard classification becomes less clear. In that case, class probability estimation becomes more involved as it requires estimation of a probability vector. In this paper, we propose a new Multicategory LUM (MLUM) framework to investigate the behavior of soft versus hard classification under multicategory settings. Our theoretical and numerical results help to shed some light on the nature of multicategory classification and its transition behavior from soft to hard classifiers. The numerical results suggest that the proposed tuned MLUM yields very competitive performance. Keywords: hard classification, large-margin, soft classification, support vector machine
3 0.31933403 61 jmlr-2013-Learning Theory Analysis for Association Rules and Sequential Event Prediction
Author: Cynthia Rudin, Benjamin Letham, David Madigan
Abstract: We present a theoretical analysis for prediction algorithms based on association rules. As part of this analysis, we introduce a problem for which rules are particularly natural, called “sequential event prediction.” In sequential event prediction, events in a sequence are revealed one by one, and the goal is to determine which event will next be revealed. The training set is a collection of past sequences of events. An example application is to predict which item will next be placed into a customer’s online shopping cart, given his/her past purchases. In the context of this problem, algorithms based on association rules have distinct advantages over classical statistical and machine learning methods: they look at correlations based on subsets of co-occurring past events (items a and b imply item c), they can be applied to the sequential event prediction problem in a natural way, they can potentially handle the “cold start” problem where the training set is small, and they yield interpretable predictions. In this work, we present two algorithms that incorporate association rules. These algorithms can be used both for sequential event prediction and for supervised classification, and they are simple enough that they can possibly be understood by users, customers, patients, managers, etc. We provide generalization guarantees on these algorithms based on algorithmic stability analysis from statistical learning theory. We include a discussion of the strict minimum support threshold often used in association rule mining, and introduce an “adjusted confidence” measure that provides a weaker minimum support condition that has advantages over the strict minimum support. The paper brings together ideas from statistical learning theory, association rule mining and Bayesian analysis. Keywords: statistical learning theory, algorithmic stability, association rules, sequence prediction, associative classification c 2013 Cynthia Rudin, Benjamin Letham and David Madigan. RUDIN , L E
4 0.31105092 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation
Author: Philémon Brakel, Dirk Stroobandt, Benjamin Schrauwen
Abstract: Imputing missing values in high dimensional time-series is a difficult problem. This paper presents a strategy for training energy-based graphical models for imputation directly, bypassing difficulties probabilistic approaches would face. The training strategy is inspired by recent work on optimization-based learning (Domke, 2012) and allows complex neural models with convolutional and recurrent structures to be trained for imputation tasks. In this work, we use this training strategy to derive learning rules for three substantially different neural architectures. Inference in these models is done by either truncated gradient descent or variational mean-field iterations. In our experiments, we found that the training methods outperform the Contrastive Divergence learning algorithm. Moreover, the training methods can easily handle missing values in the training data itself during learning. We demonstrate the performance of this learning scheme and the three models we introduce on one artificial and two real-world data sets. Keywords: neural networks, energy-based models, time-series, missing values, optimization
5 0.28661287 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis
Author: Fang Han, Tuo Zhao, Han Liu
Abstract: We propose a high dimensional classification method, named the Copula Discriminant Analysis (CODA). The CODA generalizes the normal-based linear discriminant analysis to the larger Gaussian Copula models (or the nonparanormal) as proposed by Liu et al. (2009). To simultaneously achieve estimation efficiency and robustness, the nonparametric rank-based methods including the Spearman’s rho and Kendall’s tau are exploited in estimating the covariance matrix. In high dimensional settings, we prove that the sparsity pattern of the discriminant features can be consistently recovered with the parametric rate, and the expected misclassification error is consistent to the Bayes risk. Our theory is backed up by careful numerical experiments, which show that the extra flexibility gained by the CODA method incurs little efficiency loss even when the data are truly Gaussian. These results suggest that the CODA method can be an alternative choice besides the normal-based high dimensional linear discriminant analysis. Keywords: high dimensional statistics, sparse nonlinear discriminant analysis, Gaussian copula, nonparanormal distribution, rank-based statistics
6 0.24582069 11 jmlr-2013-Algorithms for Discovery of Multiple Markov Boundaries
7 0.23692906 19 jmlr-2013-BudgetedSVM: A Toolbox for Scalable SVM Approximations
8 0.22773559 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos
9 0.22155213 12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting
10 0.21819049 57 jmlr-2013-Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels
11 0.21162769 40 jmlr-2013-Efficient Program Synthesis Using Constraint Satisfaction in Inductive Logic Programming
12 0.199826 23 jmlr-2013-Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty
13 0.19849712 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification
14 0.19638008 81 jmlr-2013-Optimal Discovery with Probabilistic Expert Advice: Finite Time Analysis and Macroscopic Optimality
15 0.18680401 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications
17 0.17498183 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning
18 0.17462738 80 jmlr-2013-One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features
19 0.17403845 109 jmlr-2013-Stress Functions for Nonlinear Dimension Reduction, Proximity Analysis, and Graph Drawing
20 0.17078349 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
topicId topicWeight
[(0, 0.035), (5, 0.103), (6, 0.053), (9, 0.012), (10, 0.062), (20, 0.025), (22, 0.393), (23, 0.03), (53, 0.016), (68, 0.022), (70, 0.015), (71, 0.021), (75, 0.055), (85, 0.022), (87, 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 0.68503106 22 jmlr-2013-Classifying With Confidence From Incomplete Information
Author: Nathan Parrish, Hyrum S. Anderson, Maya R. Gupta, Dun Yu Hsiao
Abstract: We consider the problem of classifying a test sample given incomplete information. This problem arises naturally when data about a test sample is collected over time, or when costs must be incurred to compute the classification features. For example, in a distributed sensor network only a fraction of the sensors may have reported measurements at a certain time, and additional time, power, and bandwidth is needed to collect the complete data to classify. A practical goal is to assign a class label as soon as enough data is available to make a good decision. We formalize this goal through the notion of reliability—the probability that a label assigned given incomplete data would be the same as the label assigned given the complete data, and we propose a method to classify incomplete data only if some reliability threshold is met. Our approach models the complete data as a random variable whose distribution is dependent on the current incomplete data and the (complete) training data. The method differs from standard imputation strategies in that our focus is on determining the reliability of the classification decision, rather than just the class label. We show that the method provides useful reliability estimates of the correctness of the imputed class labels on a set of experiments on time-series data sets, where the goal is to classify the time-series as early as possible while still guaranteeing that the reliability threshold is met. Keywords: classification, sensor networks, signals, reliability
2 0.57663262 75 jmlr-2013-Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood
Author: Jaakko Riihimäki, Pasi Jylänki, Aki Vehtari
Abstract: This paper considers probabilistic multinomial probit classification using Gaussian process (GP) priors. Challenges with multiclass GP classification are the integration over the non-Gaussian posterior distribution, and the increase of the number of unknown latent variables as the number of target classes grows. Expectation propagation (EP) has proven to be a very accurate method for approximate inference but the existing EP approaches for the multinomial probit GP classification rely on numerical quadratures, or independence assumptions between the latent values associated with different classes, to facilitate the computations. In this paper we propose a novel nested EP approach which does not require numerical quadratures, and approximates accurately all betweenclass posterior dependencies of the latent values, but still scales linearly in the number of classes. The predictive accuracy of the nested EP approach is compared to Laplace, variational Bayes, and Markov chain Monte Carlo (MCMC) approximations with various benchmark data sets. In the experiments nested EP was the most consistent method compared to MCMC sampling, but in terms of classification accuracy the differences between all the methods were small from a practical point of view. Keywords: Gaussian process, multiclass classification, multinomial probit, approximate inference, expectation propagation
Author: Alberto N. Escalante-B., Laurenz Wiskott
Abstract: Supervised learning from high-dimensional data, for example, multimedia data, is a challenging task. We propose an extension of slow feature analysis (SFA) for supervised dimensionality reduction called graph-based SFA (GSFA). The algorithm extracts a label-predictive low-dimensional set of features that can be post-processed by typical supervised algorithms to generate the final label or class estimation. GSFA is trained with a so-called training graph, in which the vertices are the samples and the edges represent similarities of the corresponding labels. A new weighted SFA optimization problem is introduced, generalizing the notion of slowness from sequences of samples to such training graphs. We show that GSFA computes an optimal solution to this problem in the considered function space and propose several types of training graphs. For classification, the most straightforward graph yields features equivalent to those of (nonlinear) Fisher discriminant analysis. Emphasis is on regression, where four different graphs were evaluated experimentally with a subproblem of face detection on photographs. The method proposed is promising particularly when linear models are insufficient as well as when feature selection is difficult. Keywords: slow feature analysis, feature extraction, classification, regression, pattern recognition, training graphs, nonlinear dimensionality reduction, supervised learning, implicitly supervised, high-dimensional data, image analysis
4 0.3441292 120 jmlr-2013-Variational Algorithms for Marginal MAP
Author: Qiang Liu, Alexander Ihler
Abstract: The marginal maximum a posteriori probability (MAP) estimation problem, which calculates the mode of the marginal posterior distribution of a subset of variables with the remaining variables marginalized, is an important inference problem in many models, such as those with hidden variables or uncertain parameters. Unfortunately, marginal MAP can be NP-hard even on trees, and has attracted less attention in the literature compared to the joint MAP (maximization) and marginalization problems. We derive a general dual representation for marginal MAP that naturally integrates the marginalization and maximization operations into a joint variational optimization problem, making it possible to easily extend most or all variational-based algorithms to marginal MAP. In particular, we derive a set of “mixed-product” message passing algorithms for marginal MAP, whose form is a hybrid of max-product, sum-product and a novel “argmax-product” message updates. We also derive a class of convergent algorithms based on proximal point methods, including one that transforms the marginal MAP problem into a sequence of standard marginalization problems. Theoretically, we provide guarantees under which our algorithms give globally or locally optimal solutions, and provide novel upper bounds on the optimal objectives. Empirically, we demonstrate that our algorithms significantly outperform the existing approaches, including a state-of-the-art algorithm based on local search methods. Keywords: graphical models, message passing, belief propagation, variational methods, maximum a posteriori, marginal-MAP, hidden variable models
5 0.3428233 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization
Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright
Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling
6 0.34276524 110 jmlr-2013-Sub-Local Constraint-Based Learning of Bayesian Networks Using A Joint Dependence Criterion
7 0.34212101 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation
8 0.34201998 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning
10 0.33939001 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization
11 0.33873832 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference
12 0.33861145 3 jmlr-2013-A Framework for Evaluating Approximation Methods for Gaussian Process Regression
13 0.3376551 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees
14 0.33718556 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut
15 0.33559334 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components
16 0.33555323 59 jmlr-2013-Large-scale SVD and Manifold Learning
17 0.33487442 93 jmlr-2013-Random Walk Kernels and Learning Curves for Gaussian Process Regression on Random Graphs
18 0.33478409 86 jmlr-2013-Parallel Vector Field Embedding
19 0.33440313 51 jmlr-2013-Greedy Sparsity-Constrained Optimization
20 0.33407289 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning