nips nips2011 nips2011-19 knowledge-graph by maker-knowledge-mining

19 nips-2011-Active Classification based on Value of Classifier


Source: pdf

Author: Tianshi Gao, Daphne Koller

Abstract: Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. [sent-6, score-0.094]

2 Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. [sent-7, score-0.074]

3 This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. [sent-9, score-0.149]

4 We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods. [sent-10, score-0.193]

5 For example, in an image labeling task, we have the option of using GIST feature [26], SIFT feature [23], spatial HOG feature [33], Object Bank [21] and more. [sent-14, score-0.133]

6 However, as we discuss in Section 2, most of these methods use a fixed procedure determined at training time to apply the classifiers without adapting to each individual test instance. [sent-21, score-0.067]

7 In this paper, we take an active and adaptive approach to combine multiple classifiers/features at test time, based on the idea of value of information [16, 17, 24, 22]. [sent-22, score-0.094]

8 At training time, we construct a rich family of classifiers, which may vary in the features that they use or the set of distinctions that they make (i. [sent-23, score-0.087]

9 At test time, we dynamically select an instance-specific 1 subset of classifiers. [sent-27, score-0.071]

10 Starting from an empty set of observations, at each stage, we use a myopic value-ofinformation computation to select the next classifier to apply to the instance in a way that attempts to increase the accuracy of our classification state (e. [sent-29, score-0.088]

11 2 Related Work Our classification model is based on multiple classifiers, so it resembles ensemble methods like boosting [10], random forests [4] and output-coding based multiclass classification [9, 1, 29, 14]. [sent-38, score-0.112]

12 However, these methods use a static decision process, where all classifiers have to be evaluated before any decision can be made. [sent-39, score-0.081]

13 [30] proposed a stopping criterion for random forests such that decisions can be made based on a subset of the trees. [sent-45, score-0.079]

14 Instance-specific decision paths based on previous observations can be found in decision tree style models, e. [sent-47, score-0.086]

15 Instead of making hard decisions based on individual observations like these methods, we use a probabilistic model to fuse information from multiple observations and only make decisions when it is sufficiently confident. [sent-50, score-0.144]

16 Instead of selecting a fixed set of features in the learning stage [34], we actively select instancespecific features in the test stage. [sent-52, score-0.152]

17 Our selection criterion trades off between the statistical gain and the computational cost of the classifier, resulting in a computationally efficient cheap-to-expensive evaluation process. [sent-54, score-0.115]

18 [2] performed feature selection to achieve certain accuracy under some computational budget, but the selection is at training time without adaptation to individual test instances. [sent-58, score-0.219]

19 [5] considered test-time feature value acquisition with a strong assumption that observations are conditionally independent given the class variable. [sent-60, score-0.099]

20 [7] used active learning to select training instances to reduce the labeling cost and speedup the learning, while our work focuses on inference. [sent-64, score-0.179]

21 Furthermore, we assume that we have been provided a set of trained classifiers H, where hi ∈ H : X → R can be any real-valued classifiers (functions) from existing methods. [sent-66, score-0.213]

22 For example, for multiclass classification, hi can be one-vs-all classifiers, one-vs-one classifiers and weak learners from the boosting algorithms. [sent-67, score-0.291]

23 Note that hi ’s do not have to be homogeneous meaning that they can have different function forms, e. [sent-68, score-0.196]

24 , linear or nonlinear, and more importantly they can be trained on different types of features with various computational costs. [sent-70, score-0.086]

25 Given an instance x, our goal is to infer Y by sequentially selecting one hi to evaluate at a time, based on what has already been observed, until we are sufficiently confident about 2 Y or some other stopping criterion is met, e. [sent-71, score-0.256]

26 The key in this process is how valuable we think a classifier hi is, so we introduce the value of a classifier as follows. [sent-74, score-0.212]

27 Denote the random variable Mi = hi (X) as the response/margin of the i-th classifier in H and denote the random vector for the observed classifiers as MO = [Mo1 , Mo2 , . [sent-77, score-0.196]

28 Furthermore, we use C(hi |O) to denote the computational cost of evaluating classifier hi conditioned on the set of evaluated classifiers O. [sent-83, score-0.316]

29 This is because if hi shares the same feature with some oi ∈ O, we do not need to compute the feature again. [sent-84, score-0.286]

30 With some chosen reward R and a computational model C(hi |O), we define the value of an unobserved classifier as follows. [sent-85, score-0.083]

31 Definition 1 The value of classifier V (hi |mO ) for a classifier hi given the observed classifier responses mO is the combination of the expected reward of the state informed by hi and the computational cost of hi . [sent-86, score-0.722]

32 The first part VR (hi |mO ) = E R(P (Y |mi , mO )) is the expected reward of P (Y |mi , mO ), where the expectation is with respect to the posterior of Mi given mO . [sent-88, score-0.074]

33 ∆ 1 The second part VC (hi |mO ) = − τ C(hi |O) is a computational penalty incurred by evaluating the classifier hi . [sent-89, score-0.246]

34 Given the definition of the value of classifier, at each step of our sequential evaluations, our goal is to pick hi with the highest value: h∗ = argmax V (hi |mO ) = argmax VR (hi |mO ) + VC (hi |mO ) hi ∈H\O (2) hi ∈H\O We introduce the building blocks of the value of classifier, i. [sent-91, score-0.652]

35 Formally, given some posterior distribution P (Y |mO ) , we define R(P (Y |mO )) = −H(Y |mO ) = P (y|mO ) log P (y|mO ) (3) y The value of classifier under this reward definition is closely related to information gain. [sent-99, score-0.09]

36 The cost of evaluating a classifier h on an instance x can be broken down into two parts. [sent-112, score-0.092]

37 The first part is the cost of computing the feature φ : X → Rn on which h is built, and the second is the cost of computing the function value of h given the input φ(x). [sent-113, score-0.158]

38 If h shares the same feature as some evaluated classifiers in O, then C(h|O) only consists of the cost of evaluating the function h, otherwise it will also include the cost of computing the feature input φ. [sent-114, score-0.229]

39 Note that computing φ is usually much more expensive than evaluating the function value of h. [sent-115, score-0.076]

40 Given a test instance x, we construct an instance-specific joint distribution over Y and the selected observations MO . [sent-117, score-0.071]

41 Given the subset of the training set {(x(j) , y (j) = y)}j=1 corresponding (j) to the instances from class y, we denote mi = hi (x(j) ), then our goal is to learn P (Mi |mO , y) Ny from {(m(j) , y (j) = y)}j=1 . [sent-126, score-0.466]

42 If O = ∅, then P (Mi |mO , y) reduces to the marginal distribution 2 P (Mi |y) = N (µy , σy ), and based on maximum likelihood estimation, we have µy = 1 Ny (j) j mi , (j) 1 2 and σy = Ny j (mi − µy )2 . [sent-127, score-0.215]

43 It is worth ¯ ¯ noting that (MT WMO + λI)−1 W in (8) does not depend on i, so it can be computed once and O 2 shared for different classifiers hi ’s. [sent-136, score-0.196]

44 For example, in multiclass classification, if we include all one-vs-one classifiers into H, |H| is quadratic in the number of classes. [sent-149, score-0.069]

45 Based on the posterior P (Y |mO ), we can make dynamic and adaptive decision about whether to continue observing new classifiers or stop the process. [sent-152, score-0.065]

46 To avoid computing the values of all the remaining classifiers, we can dynamically restrict the search space of classifiers to those having high expected mutual information with Y with respect to the current posterior P (Y |mO ). [sent-171, score-0.089]

47 Specifically, during the training, for each classifier hi we can compute the mutual information I(Mi ; By ) between its response Mi and a class y, where By is a binary variable indicating whether an instance is from class y or not. [sent-172, score-0.277]

48 5 results on pendigits dataset results on satimage dataset results on letter dataset results on vowel dataset 1 0. [sent-183, score-0.179]

49 7 selection by value of classifier random selection one−vs−all dagsvm one−vs−one tree 0. [sent-188, score-0.169]

50 6 selection by value of classifier random selection one−vs−all dagsvm one−vs−one tree 0. [sent-193, score-0.169]

51 4 selection by value of classifier random selection one−vs−all dagsvm one−vs−one tree 0. [sent-199, score-0.169]

52 25 45 number of evaluated classifiers test classification accuracy 0. [sent-202, score-0.186]

53 9 test classification accuracy test classification accuracy 0. [sent-207, score-0.224]

54 2 0 5 10 15 20 25 30 35 40 number of evaluated classifiers 45 50 0. [sent-209, score-0.074]

55 1 0 10 selection by value of classifier random selection one−vs−all dagsvm one−vs−one tree 1 10 2 10 number of evaluated classifiers Figure 1: (Best viewed magnified and in colors) Performance comparisons on UCI datasets. [sent-217, score-0.243]

56 From the left to right are the results on satimage, pendigits, vowel and letter (in log-scale) datasets. [sent-218, score-0.086]

57 Note that the error bars for pendigits and letter datasets are very small (around 0. [sent-219, score-0.081]

58 First, selecting the top L candidate observations is a constant time, since we can sort the observations based on I(Mi ; By ) before the test process. [sent-222, score-0.071]

59 Given our dynamic pruning of class space, suppose there are only Nt,Y promising classes to consider instead of the total number of classes K. [sent-224, score-0.067]

60 The key to have a low cost is to effectively prune the class space (small Nt,Y ) and reach a decision quickly (small t). [sent-229, score-0.082]

61 Specifically, it dynamically selects an instance-specific subset of features, resulting in higher classification accuracy of using all features but with a significant reduction in the computational cost. [sent-235, score-0.151]

62 If there are multiple features {φi }F , our pool of classifiers is i=1 H = ∪F Hφi . [sent-239, score-0.069]

63 During training, in addition to learning (j) the classifiers, we also need to compute the response mi of each classifier hi ∈ H for each training instance x(j) . [sent-241, score-0.455]

64 In order to make the training distribution of the classifier responses better match the test distribution, when evaluating classifier hi on x(j) , we do not want hi to be trained on x(j) . [sent-242, score-0.497]

65 After this procedure, each training instance x(j) will be evaluated by all hi ’s. [sent-245, score-0.273]

66 Note that the classifiers used in the test stage are trained on the entire training set. [sent-246, score-0.074]

67 Although for different (k) (j) training instances x(j) and x(k) from different folds and a test instance x, mi , mi and mi are obtained using different hi ’s, our experimental results confirmed that their empirical distributions are close enough to achieve good performance. [sent-247, score-0.922]

68 We compare different methods in terms of both the classification accuracy and the number of evaluated classifiers. [sent-256, score-0.082]

69 For our algorithm and the random selection baseline, we show the accuracy over iterations as well. [sent-257, score-0.072]

70 As can be seen, our method can achieve comparable or higher accuracy than traditional methods. [sent-261, score-0.085]

71 In terms of the number of evaluated classifiers, our active scheme is very effective: the mean number of classifier evaluations for 6-class, 10-class, 11-class and 26-class problems are 4. [sent-267, score-0.106]

72 Although the tree-based method can also use a few number of classifiers, sometimes it suffers from a significant drop in accuracy like on the vowel and letter datasets. [sent-272, score-0.135]

73 To maintain a belief over the class variable Y and to dynamically select classifiers with high value, we have introduced additional computational costs, i. [sent-275, score-0.088]

74 For example, this additional cost is around 10ms for satimage, however, evaluating a linear classifier only takes less than 1ms due to very low feature dimension, so the actual running time of the active scheme is higher than one-vs-one. [sent-278, score-0.179]

75 Therefore, our method will have a real computational advantage only if the cost of evaluating the classifiers is higher than the cost of our probabilistic inference. [sent-279, score-0.144]

76 We test our active classification on a benchmark scene recognition dataset Scene15 [20]. [sent-282, score-0.105]

77 model accuracy all features best feature OB [21] fastest feature GIST [26] ours τ = 25 ours τ = 100 ours τ = 600 86. [sent-285, score-0.175]

78 For our methods, we show the speedup factors with respective to using all the features in a static way. [sent-319, score-0.067]

79 We consider various types of features, since as shown in [33], the classification accuracy can be significantly improved by combining multiple features but at a high computational cost. [sent-320, score-0.118]

80 Our feature set includes 7 features from [33], including GIST, spatial HOG, dense SIFT, Local Binary Pattern, self-similarity, texton histogram, geometry specific histograms (please refer to [33] for details), and another recently proposed high-level image feature Object Bank [21]. [sent-321, score-0.145]

81 Instead of treating Φ as an undecomposable single feature vector, we can think of it as a collection of 177 different features {φi }177 . [sent-328, score-0.088]

82 Therefore, our feature i=1 pool consists of 184 features in total. [sent-329, score-0.107]

83 One traditional way to combine these features is through multiple kernel learning. [sent-333, score-0.072]

84 For our active classification, we will not compute all features at the beginning of the evaluation process, but will only compute a component φi when a classifier h based on it is selected. [sent-336, score-0.105]

85 To demonstrate the trade-off between the accuracy and computational cost in the definition of value of classifier, we run multiple experiments with various τ ’s. [sent-339, score-0.121]

86 We also report results on scene15 dataset comparisons to the best individual features in terms of either accuracy or speed (the reported accuracy is the best of one-vs-one and one-vs-all). [sent-341, score-0.172]

87 As can be seen, combining all features using the traditional method indeed improves the accuracy significantly over those individual features, but at an expensive computational cost. [sent-342, score-0.178]

88 However, using active classifisequentially adding features active classification cation, to achieve similar accuracy as the baseline GIST of all features, we can get 28. [sent-343, score-0.264]

89 LBP spatial HOG Note that at this configuration, our method is faster Object Bank than the state-of-the-art individual feature [21], and dense SIFT is also 2. [sent-345, score-0.081]

90 Figure 2: Classification accuracy versus runTo further test the effectiveness of our active selec- ning time for the baseline, active classification scheme, we compare with another baseline that tion, and various individual features. [sent-349, score-0.221]

91 Specifically, we first rank the individual features based on their classification accuracy, and only consider the top 80 features (using 80 features achieves essentially the same accuracy as using 184 features). [sent-351, score-0.223]

92 Given this selected pool, we arrange the features in order of increasing computational complexity, and then train a classifier based on the top N features for all values of N from 1 to 80. [sent-352, score-0.119]

93 As shown in Figure 2, our active scheme is one order of magnitude faster than the baseline given the same level of accuracy. [sent-353, score-0.088]

94 65 0 5 10 15 20 25 30 35 40 45 50 6 Conclusion and Future Work In this paper, we presented an active classification process based on the value of classifier. [sent-360, score-0.071]

95 We applied this active scheme in the context of multiclass classification, and achieved comparable and even higher classification accuracy with significant computational savings compared to traditional static methods. [sent-361, score-0.246]

96 One interesting future direction is to estimate the value of features instead of individual classifiers. [sent-362, score-0.09]

97 This is particularly important when computing the feature is much more expensive than evaluating the function value of classifiers, which is often the case. [sent-363, score-0.114]

98 Therefore, predicting the value of the feature (equivalent to the joint value of multiple classifiers sharing the same feature) can potentially lead to more computationally efficient classification process. [sent-365, score-0.07]

99 Reducing multiclass to binary: a unifying approach for margin classifiers. [sent-376, score-0.069]

100 Object bank: A high-level image representation for scene classification and semantic feature sparsification. [sent-505, score-0.065]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mo', 0.855), ('mi', 0.215), ('classi', 0.208), ('hi', 0.196), ('ers', 0.186), ('er', 0.103), ('multiclass', 0.069), ('dagsvm', 0.062), ('active', 0.055), ('satimage', 0.052), ('features', 0.05), ('vr', 0.05), ('accuracy', 0.049), ('reward', 0.048), ('vowel', 0.046), ('classifiers', 0.041), ('pendigits', 0.041), ('tny', 0.041), ('classification', 0.04), ('letter', 0.04), ('feature', 0.038), ('cost', 0.037), ('emi', 0.036), ('vs', 0.036), ('scheduling', 0.036), ('dynamically', 0.033), ('evaluated', 0.033), ('classifier', 0.031), ('evaluating', 0.031), ('wmo', 0.031), ('cation', 0.031), ('uci', 0.03), ('bank', 0.028), ('scene', 0.027), ('ny', 0.027), ('decisions', 0.026), ('posterior', 0.026), ('boosting', 0.026), ('object', 0.025), ('vc', 0.025), ('decision', 0.024), ('spent', 0.024), ('instance', 0.024), ('individual', 0.024), ('observations', 0.024), ('wj', 0.024), ('selection', 0.023), ('classes', 0.023), ('test', 0.023), ('gist', 0.022), ('mt', 0.022), ('traditional', 0.022), ('stopping', 0.022), ('fi', 0.022), ('gain', 0.022), ('class', 0.021), ('stanford', 0.021), ('focuses', 0.021), ('angelova', 0.021), ('instancespeci', 0.021), ('schwing', 0.021), ('probabilistic', 0.02), ('training', 0.02), ('spatial', 0.019), ('pool', 0.019), ('computational', 0.019), ('scheme', 0.018), ('hog', 0.018), ('met', 0.018), ('speedup', 0.017), ('distinctions', 0.017), ('forests', 0.017), ('trained', 0.017), ('value', 0.016), ('costs', 0.016), ('inform', 0.016), ('sian', 0.016), ('argmax', 0.016), ('mutual', 0.015), ('select', 0.015), ('computing', 0.015), ('baseline', 0.015), ('cvpr', 0.015), ('chai', 0.015), ('cheap', 0.015), ('stop', 0.015), ('criterion', 0.014), ('expensive', 0.014), ('expert', 0.014), ('cohn', 0.014), ('oi', 0.014), ('cally', 0.014), ('sift', 0.014), ('stage', 0.014), ('comparable', 0.014), ('instances', 0.014), ('responses', 0.014), ('koller', 0.014), ('maxy', 0.014), ('tree', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 19 nips-2011-Active Classification based on Value of Classifier

Author: Tianshi Gao, Daphne Koller

Abstract: Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods.

2 0.1287723 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li

Abstract: We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fine grained control over the efficiency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classification with 10184 classes and 9 million images. We demonstrate significant improvements in test accuracy and efficiency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1

3 0.10396607 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon

Abstract: We introduce P I C O D ES: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of defining indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images defining the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classification performance. Optimization directly for binary features is difficult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. P I C O D ES of 256 bytes match the accuracy of the current best known classifier for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.

4 0.097550906 178 nips-2011-Multiclass Boosting: Theory and Algorithms

Author: Mohammad J. Saberian, Nuno Vasconcelos

Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1

5 0.091985509 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classifiers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by defining a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1

6 0.088949651 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification

7 0.085205995 28 nips-2011-Agnostic Selective Classification

8 0.068080001 33 nips-2011-An Exact Algorithm for F-Measure Maximization

9 0.06717582 53 nips-2011-Co-Training for Domain Adaptation

10 0.066985153 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features

11 0.063825786 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions

12 0.055692159 165 nips-2011-Matrix Completion for Multi-label Image Classification

13 0.053979371 261 nips-2011-Sparse Filtering

14 0.053329926 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

15 0.05307965 154 nips-2011-Learning person-object interactions for action recognition in still images

16 0.052723128 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

17 0.051198896 258 nips-2011-Sparse Bayesian Multi-Task Learning

18 0.050138142 180 nips-2011-Multiple Instance Filtering

19 0.049465347 302 nips-2011-Variational Learning for Recurrent Spiking Networks

20 0.048883956 59 nips-2011-Composite Multiclass Losses


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.134), (1, 0.038), (2, -0.048), (3, 0.068), (4, 0.017), (5, 0.054), (6, 0.032), (7, -0.054), (8, -0.094), (9, 0.012), (10, -0.083), (11, 0.01), (12, -0.006), (13, 0.113), (14, -0.03), (15, -0.021), (16, -0.073), (17, -0.018), (18, 0.06), (19, -0.088), (20, 0.029), (21, 0.018), (22, 0.07), (23, -0.032), (24, 0.027), (25, 0.025), (26, -0.044), (27, -0.058), (28, 0.048), (29, 0.032), (30, 0.151), (31, -0.02), (32, 0.027), (33, -0.085), (34, -0.017), (35, -0.104), (36, -0.078), (37, -0.086), (38, -0.012), (39, 0.043), (40, 0.01), (41, 0.055), (42, -0.034), (43, -0.05), (44, 0.086), (45, 0.07), (46, 0.01), (47, 0.006), (48, -0.055), (49, -0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94845557 19 nips-2011-Active Classification based on Value of Classifier

Author: Tianshi Gao, Daphne Koller

Abstract: Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods.

2 0.67365247 28 nips-2011-Agnostic Selective Classification

Author: Yair Wiener, Ran El-Yaniv

Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1

3 0.65021604 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li

Abstract: We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fine grained control over the efficiency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classification with 10184 classes and 9 million images. We demonstrate significant improvements in test accuracy and efficiency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1

4 0.63439244 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon

Abstract: We introduce P I C O D ES: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of defining indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images defining the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classification performance. Optimization directly for binary features is difficult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. P I C O D ES of 256 bytes match the accuracy of the current best known classifier for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.

5 0.61795479 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing

Author: Shai Shalev-shwartz, Yonatan Wexler, Amnon Shashua

Abstract: Multiclass prediction is the problem of classifying an object into a relevant target class. We consider the problem of learning a multiclass predictor that uses only few features, and in particular, the number of used features should increase sublinearly with the number of possible classes. This implies that features should be shared by several classes. We describe and analyze the ShareBoost algorithm for learning a multiclass predictor that uses few shared features. We prove that ShareBoost efficiently finds a predictor that uses few shared features (if such a predictor exists) and that it has a small generalization error. We also describe how to use ShareBoost for learning a non-linear predictor that has a fast evaluation time. In a series of experiments with natural data sets we demonstrate the benefits of ShareBoost and evaluate its success relatively to other state-of-the-art approaches. 1

6 0.61119026 33 nips-2011-An Exact Algorithm for F-Measure Maximization

7 0.59952271 178 nips-2011-Multiclass Boosting: Theory and Algorithms

8 0.59718895 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification

9 0.54462349 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

10 0.53749555 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions

11 0.53055781 59 nips-2011-Composite Multiclass Losses

12 0.52282637 277 nips-2011-Submodular Multi-Label Learning

13 0.49480906 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features

14 0.48839203 254 nips-2011-Similarity-based Learning via Data Driven Embeddings

15 0.48381102 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers

16 0.47758994 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

17 0.47237825 141 nips-2011-Large-Scale Category Structure Aware Image Categorization

18 0.47044691 287 nips-2011-The Manifold Tangent Classifier

19 0.43688318 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

20 0.43501359 143 nips-2011-Learning Anchor Planes for Classification


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.016), (4, 0.042), (20, 0.051), (26, 0.029), (31, 0.083), (33, 0.052), (39, 0.206), (43, 0.053), (45, 0.185), (57, 0.042), (65, 0.019), (74, 0.024), (83, 0.03), (99, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8976782 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features

Author: Kristen Grauman, Fei Sha, Sung J. Hwang

Abstract: We introduce an approach to learn discriminative visual representations while exploiting external semantic knowledge about object category relationships. Given a hierarchical taxonomy that captures semantic similarity between the objects, we learn a corresponding tree of metrics (ToM). In this tree, we have one metric for each non-leaf node of the object hierarchy, and each metric is responsible for discriminating among its immediate subcategory children. Specifically, a Mahalanobis metric learned for a given node must satisfy the appropriate (dis)similarity constraints generated only among its subtree members’ training instances. To further exploit the semantics, we introduce a novel regularizer coupling the metrics that prefers a sparse disjoint set of features to be selected for each metric relative to its ancestor (supercategory) nodes’ metrics. Intuitively, this reflects that visual cues most useful to distinguish the generic classes (e.g., feline vs. canine) should be different than those cues most useful to distinguish their component fine-grained classes (e.g., Persian cat vs. Siamese cat). We validate our approach with multiple image datasets using the WordNet taxonomy, show its advantages over alternative metric learning approaches, and analyze the meaning of attribute features selected by our algorithm. 1

2 0.83369029 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure

Author: Fatma K. Karzan, Arkadi S. Nemirovski, Boris T. Polyak, Anatoli Juditsky

Abstract: We discuss new methods for the recovery of signals with block-sparse structure, based on 1 -minimization. Our emphasis is on the efficiently computable error bounds for the recovery routines. We optimize these bounds with respect to the method parameters to construct the estimators with improved statistical properties. We justify the proposed approach with an oracle inequality which links the properties of the recovery algorithms and the best estimation performance. 1

3 0.83318067 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements

Author: Yi-kai Liu

Abstract: We study the problem of reconstructing an unknown matrix M of rank r and dimension d using O(rd poly log d) Pauli measurements. This has applications in quantum state tomography, and is a non-commutative analogue of a well-known problem in compressed sensing: recovering a sparse vector from a few of its Fourier coefficients. We show that almost all sets of O(rd log6 d) Pauli measurements satisfy the rankr restricted isometry property (RIP). This implies that M can be recovered from a fixed (“universal”) set of Pauli measurements, using nuclear-norm minimization (e.g., the matrix Lasso), with nearly-optimal bounds on the error. A similar result holds for any class of measurements that use an orthonormal operator basis whose elements have small operator norm. Our proof uses Dudley’s inequality for Gaussian processes, together with bounds on covering numbers obtained via entropy duality. 1

4 0.83095974 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent

Author: Inderjit S. Dhillon, Pradeep K. Ravikumar, Ambuj Tewari

Abstract: Increasingly, optimization problems in machine learning, especially those arising from bigh-dimensional statistical estimation, bave a large number of variables. Modem statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structore to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a snite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structore, we are not only able to significantly speed up the vanilla greedy method, hot also outperform cyclic descent when the problem size becomes large. Our resnlts indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.

same-paper 5 0.8249414 19 nips-2011-Active Classification based on Value of Classifier

Author: Tianshi Gao, Daphne Koller

Abstract: Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods.

6 0.79633576 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

7 0.76342297 209 nips-2011-Orthogonal Matching Pursuit with Replacement

8 0.73649621 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

9 0.73481596 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

10 0.7342782 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

11 0.73359305 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

12 0.73303926 168 nips-2011-Maximum Margin Multi-Instance Learning

13 0.73092461 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

14 0.72984207 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

15 0.72430837 244 nips-2011-Selecting Receptive Fields in Deep Networks

16 0.72397673 49 nips-2011-Boosting with Maximum Adaptive Sampling

17 0.72288036 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices

18 0.72152072 271 nips-2011-Statistical Tests for Optimization Efficiency

19 0.72117466 165 nips-2011-Matrix Completion for Multi-label Image Classification

20 0.72101468 231 nips-2011-Randomized Algorithms for Comparison-based Search