nips nips2011 nips2011-19 knowledge-graph by maker-knowledge-mining

19 nips-2011-Active Classification based on Value of Classifier

Source: pdf

Author: Tianshi Gao, Daphne Koller

Abstract: Modern classiﬁcation tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classiﬁers, which are then applied at test time and then pieced together in a ﬁxed procedure determined in advance or at training time. We present an active classiﬁcation process at the test time, where each classiﬁer in a large ensemble is viewed as a potential observation that might inform our classiﬁcation process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classiﬁcation gain from each observation as well as its computational cost. The expected classiﬁcation gain is computed using a probabilistic model that uses the outcome from previous observations. This active classiﬁcation process is applied at test time for each individual test instance, resulting in an efﬁcient instance-speciﬁc decision path. We demonstrate the beneﬁt of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classiﬁcation accuracy at a fraction of the computational costs of traditional methods.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We present an active classiﬁcation process at the test time, where each classiﬁer in a large ensemble is viewed as a potential observation that might inform our classiﬁcation process. [sent-6, score-0.094]

2 Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classiﬁcation gain from each observation as well as its computational cost. [sent-7, score-0.074]

3 This active classiﬁcation process is applied at test time for each individual test instance, resulting in an efﬁcient instance-speciﬁc decision path. [sent-9, score-0.149]

4 We demonstrate the beneﬁt of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classiﬁcation accuracy at a fraction of the computational costs of traditional methods. [sent-10, score-0.193]

5 For example, in an image labeling task, we have the option of using GIST feature [26], SIFT feature [23], spatial HOG feature [33], Object Bank [21] and more. [sent-14, score-0.133]

6 However, as we discuss in Section 2, most of these methods use a ﬁxed procedure determined at training time to apply the classiﬁers without adapting to each individual test instance. [sent-21, score-0.067]

7 In this paper, we take an active and adaptive approach to combine multiple classiﬁers/features at test time, based on the idea of value of information [16, 17, 24, 22]. [sent-22, score-0.094]

8 At training time, we construct a rich family of classiﬁers, which may vary in the features that they use or the set of distinctions that they make (i. [sent-23, score-0.087]

9 At test time, we dynamically select an instance-speciﬁc 1 subset of classiﬁers. [sent-27, score-0.071]

10 Starting from an empty set of observations, at each stage, we use a myopic value-ofinformation computation to select the next classiﬁer to apply to the instance in a way that attempts to increase the accuracy of our classiﬁcation state (e. [sent-29, score-0.088]

11 2 Related Work Our classiﬁcation model is based on multiple classiﬁers, so it resembles ensemble methods like boosting [10], random forests [4] and output-coding based multiclass classiﬁcation [9, 1, 29, 14]. [sent-38, score-0.112]

12 However, these methods use a static decision process, where all classiﬁers have to be evaluated before any decision can be made. [sent-39, score-0.081]

13 [30] proposed a stopping criterion for random forests such that decisions can be made based on a subset of the trees. [sent-45, score-0.079]

14 Instance-speciﬁc decision paths based on previous observations can be found in decision tree style models, e. [sent-47, score-0.086]

15 Instead of making hard decisions based on individual observations like these methods, we use a probabilistic model to fuse information from multiple observations and only make decisions when it is sufﬁciently conﬁdent. [sent-50, score-0.144]

16 Instead of selecting a ﬁxed set of features in the learning stage [34], we actively select instancespeciﬁc features in the test stage. [sent-52, score-0.152]

17 Our selection criterion trades off between the statistical gain and the computational cost of the classiﬁer, resulting in a computationally efﬁcient cheap-to-expensive evaluation process. [sent-54, score-0.115]

18 [2] performed feature selection to achieve certain accuracy under some computational budget, but the selection is at training time without adaptation to individual test instances. [sent-58, score-0.219]

19 [5] considered test-time feature value acquisition with a strong assumption that observations are conditionally independent given the class variable. [sent-60, score-0.099]

20 [7] used active learning to select training instances to reduce the labeling cost and speedup the learning, while our work focuses on inference. [sent-64, score-0.179]

21 Furthermore, we assume that we have been provided a set of trained classiﬁers H, where hi ∈ H : X → R can be any real-valued classiﬁers (functions) from existing methods. [sent-66, score-0.213]

22 For example, for multiclass classiﬁcation, hi can be one-vs-all classiﬁers, one-vs-one classiﬁers and weak learners from the boosting algorithms. [sent-67, score-0.291]

23 Note that hi ’s do not have to be homogeneous meaning that they can have different function forms, e. [sent-68, score-0.196]

24 , linear or nonlinear, and more importantly they can be trained on different types of features with various computational costs. [sent-70, score-0.086]

25 Given an instance x, our goal is to infer Y by sequentially selecting one hi to evaluate at a time, based on what has already been observed, until we are sufﬁciently conﬁdent about 2 Y or some other stopping criterion is met, e. [sent-71, score-0.256]

26 The key in this process is how valuable we think a classiﬁer hi is, so we introduce the value of a classiﬁer as follows. [sent-74, score-0.212]

27 Denote the random variable Mi = hi (X) as the response/margin of the i-th classiﬁer in H and denote the random vector for the observed classiﬁers as MO = [Mo1 , Mo2 , . [sent-77, score-0.196]

28 Furthermore, we use C(hi |O) to denote the computational cost of evaluating classiﬁer hi conditioned on the set of evaluated classiﬁers O. [sent-83, score-0.316]

29 This is because if hi shares the same feature with some oi ∈ O, we do not need to compute the feature again. [sent-84, score-0.286]

30 With some chosen reward R and a computational model C(hi |O), we deﬁne the value of an unobserved classiﬁer as follows. [sent-85, score-0.083]

31 Deﬁnition 1 The value of classiﬁer V (hi |mO ) for a classiﬁer hi given the observed classiﬁer responses mO is the combination of the expected reward of the state informed by hi and the computational cost of hi . [sent-86, score-0.722]

32 The ﬁrst part VR (hi |mO ) = E R(P (Y |mi , mO )) is the expected reward of P (Y |mi , mO ), where the expectation is with respect to the posterior of Mi given mO . [sent-88, score-0.074]

33 ∆ 1 The second part VC (hi |mO ) = − τ C(hi |O) is a computational penalty incurred by evaluating the classiﬁer hi . [sent-89, score-0.246]

34 Given the deﬁnition of the value of classiﬁer, at each step of our sequential evaluations, our goal is to pick hi with the highest value: h∗ = argmax V (hi |mO ) = argmax VR (hi |mO ) + VC (hi |mO ) hi ∈H\O (2) hi ∈H\O We introduce the building blocks of the value of classiﬁer, i. [sent-91, score-0.652]

36 The cost of evaluating a classiﬁer h on an instance x can be broken down into two parts. [sent-112, score-0.092]

37 The ﬁrst part is the cost of computing the feature φ : X → Rn on which h is built, and the second is the cost of computing the function value of h given the input φ(x). [sent-113, score-0.158]

38 If h shares the same feature as some evaluated classiﬁers in O, then C(h|O) only consists of the cost of evaluating the function h, otherwise it will also include the cost of computing the feature input φ. [sent-114, score-0.229]

39 Note that computing φ is usually much more expensive than evaluating the function value of h. [sent-115, score-0.076]

40 Given a test instance x, we construct an instance-speciﬁc joint distribution over Y and the selected observations MO . [sent-117, score-0.071]

41 Given the subset of the training set {(x(j) , y (j) = y)}j=1 corresponding (j) to the instances from class y, we denote mi = hi (x(j) ), then our goal is to learn P (Mi |mO , y) Ny from {(m(j) , y (j) = y)}j=1 . [sent-126, score-0.466]

42 If O = ∅, then P (Mi |mO , y) reduces to the marginal distribution 2 P (Mi |y) = N (µy , σy ), and based on maximum likelihood estimation, we have µy = 1 Ny (j) j mi , (j) 1 2 and σy = Ny j (mi − µy )2 . [sent-127, score-0.215]

43 It is worth ¯ ¯ noting that (MT WMO + λI)−1 W in (8) does not depend on i, so it can be computed once and O 2 shared for different classiﬁers hi ’s. [sent-136, score-0.196]

44 For example, in multiclass classiﬁcation, if we include all one-vs-one classiﬁers into H, |H| is quadratic in the number of classes. [sent-149, score-0.069]

45 Based on the posterior P (Y |mO ), we can make dynamic and adaptive decision about whether to continue observing new classiﬁers or stop the process. [sent-152, score-0.065]

46 To avoid computing the values of all the remaining classiﬁers, we can dynamically restrict the search space of classiﬁers to those having high expected mutual information with Y with respect to the current posterior P (Y |mO ). [sent-171, score-0.089]

47 Speciﬁcally, during the training, for each classiﬁer hi we can compute the mutual information I(Mi ; By ) between its response Mi and a class y, where By is a binary variable indicating whether an instance is from class y or not. [sent-172, score-0.277]

48 5 results on pendigits dataset results on satimage dataset results on letter dataset results on vowel dataset 1 0. [sent-183, score-0.179]

49 7 selection by value of classifier random selection one−vs−all dagsvm one−vs−one tree 0. [sent-188, score-0.169]

50 6 selection by value of classifier random selection one−vs−all dagsvm one−vs−one tree 0. [sent-193, score-0.169]

51 4 selection by value of classifier random selection one−vs−all dagsvm one−vs−one tree 0. [sent-199, score-0.169]

52 25 45 number of evaluated classifiers test classification accuracy 0. [sent-202, score-0.186]

53 9 test classification accuracy test classification accuracy 0. [sent-207, score-0.224]

54 2 0 5 10 15 20 25 30 35 40 number of evaluated classifiers 45 50 0. [sent-209, score-0.074]

55 1 0 10 selection by value of classifier random selection one−vs−all dagsvm one−vs−one tree 1 10 2 10 number of evaluated classifiers Figure 1: (Best viewed magniﬁed and in colors) Performance comparisons on UCI datasets. [sent-217, score-0.243]

56 From the left to right are the results on satimage, pendigits, vowel and letter (in log-scale) datasets. [sent-218, score-0.086]

57 Note that the error bars for pendigits and letter datasets are very small (around 0. [sent-219, score-0.081]

58 First, selecting the top L candidate observations is a constant time, since we can sort the observations based on I(Mi ; By ) before the test process. [sent-222, score-0.071]

59 Given our dynamic pruning of class space, suppose there are only Nt,Y promising classes to consider instead of the total number of classes K. [sent-224, score-0.067]

60 The key to have a low cost is to effectively prune the class space (small Nt,Y ) and reach a decision quickly (small t). [sent-229, score-0.082]

61 Speciﬁcally, it dynamically selects an instance-speciﬁc subset of features, resulting in higher classiﬁcation accuracy of using all features but with a signiﬁcant reduction in the computational cost. [sent-235, score-0.151]

62 If there are multiple features {φi }F , our pool of classiﬁers is i=1 H = ∪F Hφi . [sent-239, score-0.069]

63 During training, in addition to learning (j) the classiﬁers, we also need to compute the response mi of each classiﬁer hi ∈ H for each training instance x(j) . [sent-241, score-0.455]

64 In order to make the training distribution of the classiﬁer responses better match the test distribution, when evaluating classiﬁer hi on x(j) , we do not want hi to be trained on x(j) . [sent-242, score-0.497]

65 After this procedure, each training instance x(j) will be evaluated by all hi ’s. [sent-245, score-0.273]

66 Note that the classiﬁers used in the test stage are trained on the entire training set. [sent-246, score-0.074]

67 Although for different (k) (j) training instances x(j) and x(k) from different folds and a test instance x, mi , mi and mi are obtained using different hi ’s, our experimental results conﬁrmed that their empirical distributions are close enough to achieve good performance. [sent-247, score-0.922]

68 We compare different methods in terms of both the classiﬁcation accuracy and the number of evaluated classiﬁers. [sent-256, score-0.082]

69 For our algorithm and the random selection baseline, we show the accuracy over iterations as well. [sent-257, score-0.072]

70 As can be seen, our method can achieve comparable or higher accuracy than traditional methods. [sent-261, score-0.085]

71 In terms of the number of evaluated classiﬁers, our active scheme is very effective: the mean number of classiﬁer evaluations for 6-class, 10-class, 11-class and 26-class problems are 4. [sent-267, score-0.106]

72 Although the tree-based method can also use a few number of classiﬁers, sometimes it suffers from a signiﬁcant drop in accuracy like on the vowel and letter datasets. [sent-272, score-0.135]

73 To maintain a belief over the class variable Y and to dynamically select classiﬁers with high value, we have introduced additional computational costs, i. [sent-275, score-0.088]

74 For example, this additional cost is around 10ms for satimage, however, evaluating a linear classiﬁer only takes less than 1ms due to very low feature dimension, so the actual running time of the active scheme is higher than one-vs-one. [sent-278, score-0.179]

75 Therefore, our method will have a real computational advantage only if the cost of evaluating the classiﬁers is higher than the cost of our probabilistic inference. [sent-279, score-0.144]

76 We test our active classiﬁcation on a benchmark scene recognition dataset Scene15 [20]. [sent-282, score-0.105]

77 model accuracy all features best feature OB [21] fastest feature GIST [26] ours τ = 25 ours τ = 100 ours τ = 600 86. [sent-285, score-0.175]

78 For our methods, we show the speedup factors with respective to using all the features in a static way. [sent-319, score-0.067]

79 We consider various types of features, since as shown in [33], the classiﬁcation accuracy can be signiﬁcantly improved by combining multiple features but at a high computational cost. [sent-320, score-0.118]

80 Our feature set includes 7 features from [33], including GIST, spatial HOG, dense SIFT, Local Binary Pattern, self-similarity, texton histogram, geometry speciﬁc histograms (please refer to [33] for details), and another recently proposed high-level image feature Object Bank [21]. [sent-321, score-0.145]

81 Instead of treating Φ as an undecomposable single feature vector, we can think of it as a collection of 177 different features {φi }177 . [sent-328, score-0.088]

82 Therefore, our feature i=1 pool consists of 184 features in total. [sent-329, score-0.107]

83 One traditional way to combine these features is through multiple kernel learning. [sent-333, score-0.072]

84 For our active classiﬁcation, we will not compute all features at the beginning of the evaluation process, but will only compute a component φi when a classiﬁer h based on it is selected. [sent-336, score-0.105]

85 To demonstrate the trade-off between the accuracy and computational cost in the deﬁnition of value of classiﬁer, we run multiple experiments with various τ ’s. [sent-339, score-0.121]

86 We also report results on scene15 dataset comparisons to the best individual features in terms of either accuracy or speed (the reported accuracy is the best of one-vs-one and one-vs-all). [sent-341, score-0.172]

87 As can be seen, combining all features using the traditional method indeed improves the accuracy signiﬁcantly over those individual features, but at an expensive computational cost. [sent-342, score-0.178]

88 However, using active classiﬁsequentially adding features active classification cation, to achieve similar accuracy as the baseline GIST of all features, we can get 28. [sent-343, score-0.264]

89 LBP spatial HOG Note that at this conﬁguration, our method is faster Object Bank than the state-of-the-art individual feature [21], and dense SIFT is also 2. [sent-345, score-0.081]

90 Figure 2: Classiﬁcation accuracy versus runTo further test the effectiveness of our active selec- ning time for the baseline, active classiﬁcation scheme, we compare with another baseline that tion, and various individual features. [sent-349, score-0.221]

91 Speciﬁcally, we ﬁrst rank the individual features based on their classiﬁcation accuracy, and only consider the top 80 features (using 80 features achieves essentially the same accuracy as using 184 features). [sent-351, score-0.223]

92 Given this selected pool, we arrange the features in order of increasing computational complexity, and then train a classiﬁer based on the top N features for all values of N from 1 to 80. [sent-352, score-0.119]

93 As shown in Figure 2, our active scheme is one order of magnitude faster than the baseline given the same level of accuracy. [sent-353, score-0.088]

94 65 0 5 10 15 20 25 30 35 40 45 50 6 Conclusion and Future Work In this paper, we presented an active classiﬁcation process based on the value of classiﬁer. [sent-360, score-0.071]

95 We applied this active scheme in the context of multiclass classiﬁcation, and achieved comparable and even higher classiﬁcation accuracy with signiﬁcant computational savings compared to traditional static methods. [sent-361, score-0.246]

96 One interesting future direction is to estimate the value of features instead of individual classiﬁers. [sent-362, score-0.09]

97 This is particularly important when computing the feature is much more expensive than evaluating the function value of classiﬁers, which is often the case. [sent-363, score-0.114]

98 Therefore, predicting the value of the feature (equivalent to the joint value of multiple classiﬁers sharing the same feature) can potentially lead to more computationally efﬁcient classiﬁcation process. [sent-365, score-0.07]

99 Reducing multiclass to binary: a unifying approach for margin classiﬁers. [sent-376, score-0.069]

100 Object bank: A high-level image representation for scene classiﬁcation and semantic feature sparsiﬁcation. [sent-505, score-0.065]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mo', 0.855), ('mi', 0.215), ('classi', 0.208), ('hi', 0.196), ('ers', 0.186), ('er', 0.103), ('multiclass', 0.069), ('dagsvm', 0.062), ('active', 0.055), ('satimage', 0.052), ('features', 0.05), ('vr', 0.05), ('accuracy', 0.049), ('reward', 0.048), ('vowel', 0.046), ('classifiers', 0.041), ('pendigits', 0.041), ('tny', 0.041), ('classification', 0.04), ('letter', 0.04), ('feature', 0.038), ('cost', 0.037), ('emi', 0.036), ('vs', 0.036), ('scheduling', 0.036), ('dynamically', 0.033), ('evaluated', 0.033), ('classifier', 0.031), ('evaluating', 0.031), ('wmo', 0.031), ('cation', 0.031), ('uci', 0.03), ('bank', 0.028), ('scene', 0.027), ('ny', 0.027), ('decisions', 0.026), ('posterior', 0.026), ('boosting', 0.026), ('object', 0.025), ('vc', 0.025), ('decision', 0.024), ('spent', 0.024), ('instance', 0.024), ('individual', 0.024), ('observations', 0.024), ('wj', 0.024), ('selection', 0.023), ('classes', 0.023), ('test', 0.023), ('gist', 0.022), ('mt', 0.022), ('traditional', 0.022), ('stopping', 0.022), ('fi', 0.022), ('gain', 0.022), ('class', 0.021), ('stanford', 0.021), ('focuses', 0.021), ('angelova', 0.021), ('instancespeci', 0.021), ('schwing', 0.021), ('probabilistic', 0.02), ('training', 0.02), ('spatial', 0.019), ('pool', 0.019), ('computational', 0.019), ('scheme', 0.018), ('hog', 0.018), ('met', 0.018), ('speedup', 0.017), ('distinctions', 0.017), ('forests', 0.017), ('trained', 0.017), ('value', 0.016), ('costs', 0.016), ('inform', 0.016), ('sian', 0.016), ('argmax', 0.016), ('mutual', 0.015), ('select', 0.015), ('computing', 0.015), ('baseline', 0.015), ('cvpr', 0.015), ('chai', 0.015), ('cheap', 0.015), ('stop', 0.015), ('criterion', 0.014), ('expensive', 0.014), ('expert', 0.014), ('cohn', 0.014), ('oi', 0.014), ('cally', 0.014), ('sift', 0.014), ('stage', 0.014), ('comparable', 0.014), ('instances', 0.014), ('responses', 0.014), ('koller', 0.014), ('maxy', 0.014), ('tree', 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 19 nips-2011-Active Classification based on Value of Classifier

Author: Tianshi Gao, Daphne Koller

2 0.1287723 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li

Abstract: We present a novel approach to efﬁciently learn a label tree for large scale classiﬁcation with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classiﬁers for each node in the tree. This approach also allows ﬁne grained control over the efﬁciency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classiﬁcation with 10184 classes and 9 million images. We demonstrate signiﬁcant improvements in test accuracy and efﬁciency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1

3 0.10396607 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon

Abstract: We introduce P I C O D ES: a very compact image descriptor which nevertheless allows high performance on object category recognition. In particular, we address novel-category recognition: the task of deﬁning indexing structures and image representations which enable a large collection of images to be searched for an object category that was not known when the index was built. Instead, the training images deﬁning the category are supplied at query time. We explicitly learn descriptors of a given length (from as small as 16 bytes per image) which have good object-recognition performance. In contrast to previous work in the domain of object recognition, we do not choose an arbitrary intermediate representation, but explicitly learn short codes. In contrast to previous approaches to learn compact codes, we optimize explicitly for (an upper bound on) classiﬁcation performance. Optimization directly for binary features is difﬁcult and nonconvex, but we present an alternation scheme and convex upper bound which demonstrate excellent performance in practice. P I C O D ES of 256 bytes match the accuracy of the current best known classiﬁer for the Caltech256 benchmark, but they decrease the database storage size by a factor of 100 and speed-up the training and testing of novel classes by orders of magnitude.

4 0.097550906 178 nips-2011-Multiclass Boosting: Theory and Algorithms

Author: Mohammad J. Saberian, Nuno Vasconcelos

Abstract: The problem of multi-class boosting is considered. A new framework, based on multi-dimensional codewords and predictors is introduced. The optimal set of codewords is derived, and a margin enforcing loss proposed. The resulting risk is minimized by gradient descent on a multidimensional functional space. Two algorithms are proposed: 1) CD-MCBoost, based on coordinate descent, updates one predictor component at a time, 2) GD-MCBoost, based on gradient descent, updates all components jointly. The algorithms differ in the weak learners that they support but are both shown to be 1) Bayes consistent, 2) margin enforcing, and 3) convergent to the global minimum of the risk. They also reduce to AdaBoost when there are only two classes. Experiments show that both methods outperform previous multiclass boosting approaches on a number of datasets. 1

5 0.091985509 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

Author: Congcong Li, Ashutosh Saxena, Tsuhan Chen

Abstract: For most scene understanding tasks (such as object detection or depth estimation), the classiﬁers need to consider contextual information in addition to the local features. We can capture such contextual information by taking as input the features/attributes from all the regions in the image. However, this contextual dependence also varies with the spatial location of the region of interest, and we therefore need a different set of parameters for each spatial location. This results in a very large number of parameters. In this work, we model the independence properties between the parameters for each location and for each task, by deﬁning a Markov Random Field (MRF) over the parameters. In particular, two sets of parameters are encouraged to have similar values if they are spatially close or semantically close. Our method is, in principle, complementary to other ways of capturing context such as the ones that use a graphical model over the labels instead. In extensive evaluation over two different settings, of multi-class object detection and of multiple scene understanding tasks (scene categorization, depth estimation, geometric labeling), our method beats the state-of-the-art methods in all the four tasks. 1

6 0.088949651 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification

7 0.085205995 28 nips-2011-Agnostic Selective Classification

8 0.068080001 33 nips-2011-An Exact Algorithm for F-Measure Maximization

9 0.06717582 53 nips-2011-Co-Training for Domain Adaptation

10 0.066985153 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features

11 0.063825786 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions

12 0.055692159 165 nips-2011-Matrix Completion for Multi-label Image Classification

13 0.053979371 261 nips-2011-Sparse Filtering

14 0.053329926 233 nips-2011-Rapid Deformable Object Detection using Dual-Tree Branch-and-Bound

15 0.05307965 154 nips-2011-Learning person-object interactions for action recognition in still images

16 0.052723128 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning

17 0.051198896 258 nips-2011-Sparse Bayesian Multi-Task Learning

18 0.050138142 180 nips-2011-Multiple Instance Filtering

19 0.049465347 302 nips-2011-Variational Learning for Recurrent Spiking Networks

20 0.048883956 59 nips-2011-Composite Multiclass Losses

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.134), (1, 0.038), (2, -0.048), (3, 0.068), (4, 0.017), (5, 0.054), (6, 0.032), (7, -0.054), (8, -0.094), (9, 0.012), (10, -0.083), (11, 0.01), (12, -0.006), (13, 0.113), (14, -0.03), (15, -0.021), (16, -0.073), (17, -0.018), (18, 0.06), (19, -0.088), (20, 0.029), (21, 0.018), (22, 0.07), (23, -0.032), (24, 0.027), (25, 0.025), (26, -0.044), (27, -0.058), (28, 0.048), (29, 0.032), (30, 0.151), (31, -0.02), (32, 0.027), (33, -0.085), (34, -0.017), (35, -0.104), (36, -0.078), (37, -0.086), (38, -0.012), (39, 0.043), (40, 0.01), (41, 0.055), (42, -0.034), (43, -0.05), (44, 0.086), (45, 0.07), (46, 0.01), (47, 0.006), (48, -0.055), (49, -0.041)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94845557 19 nips-2011-Active Classification based on Value of Classifier

Author: Tianshi Gao, Daphne Koller

2 0.67365247 28 nips-2011-Agnostic Selective Classification

Author: Yair Wiener, Ran El-Yaniv

Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classiﬁcation performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefﬁcient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classiﬁcation algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1

3 0.65021604 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition

Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li

4 0.63439244 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

Author: Alessandro Bergamo, Lorenzo Torresani, Andrew W. Fitzgibbon

5 0.61795479 252 nips-2011-ShareBoost: Efficient multiclass learning with feature sharing

Author: Shai Shalev-shwartz, Yonatan Wexler, Amnon Shashua

Abstract: Multiclass prediction is the problem of classifying an object into a relevant target class. We consider the problem of learning a multiclass predictor that uses only few features, and in particular, the number of used features should increase sublinearly with the number of possible classes. This implies that features should be shared by several classes. We describe and analyze the ShareBoost algorithm for learning a multiclass predictor that uses few shared features. We prove that ShareBoost efﬁciently ﬁnds a predictor that uses few shared features (if such a predictor exists) and that it has a small generalization error. We also describe how to use ShareBoost for learning a non-linear predictor that has a fast evaluation time. In a series of experiments with natural data sets we demonstrate the beneﬁts of ShareBoost and evaluate its success relatively to other state-of-the-art approaches. 1

6 0.61119026 33 nips-2011-An Exact Algorithm for F-Measure Maximization

7 0.59952271 178 nips-2011-Multiclass Boosting: Theory and Algorithms

8 0.59718895 279 nips-2011-Target Neighbor Consistent Feature Weighting for Nearest Neighbor Classification

9 0.54462349 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

10 0.53749555 7 nips-2011-A Machine Learning Approach to Predict Chemical Reactions

11 0.53055781 59 nips-2011-Composite Multiclass Losses

12 0.52282637 277 nips-2011-Submodular Multi-Label Learning

13 0.49480906 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features

14 0.48839203 254 nips-2011-Similarity-based Learning via Data Driven Embeddings

15 0.48381102 284 nips-2011-The Impact of Unlabeled Patterns in Rademacher Complexity Theory for Kernel Classifiers

16 0.47758994 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models

17 0.47237825 141 nips-2011-Large-Scale Category Structure Aware Image Categorization

18 0.47044691 287 nips-2011-The Manifold Tangent Classifier

19 0.43688318 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

20 0.43501359 143 nips-2011-Learning Anchor Planes for Classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.016), (4, 0.042), (20, 0.051), (26, 0.029), (31, 0.083), (33, 0.052), (39, 0.206), (43, 0.053), (45, 0.185), (57, 0.042), (65, 0.019), (74, 0.024), (83, 0.03), (99, 0.037)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.8976782 151 nips-2011-Learning a Tree of Metrics with Disjoint Visual Features

Author: Kristen Grauman, Fei Sha, Sung J. Hwang

Abstract: We introduce an approach to learn discriminative visual representations while exploiting external semantic knowledge about object category relationships. Given a hierarchical taxonomy that captures semantic similarity between the objects, we learn a corresponding tree of metrics (ToM). In this tree, we have one metric for each non-leaf node of the object hierarchy, and each metric is responsible for discriminating among its immediate subcategory children. Speciﬁcally, a Mahalanobis metric learned for a given node must satisfy the appropriate (dis)similarity constraints generated only among its subtree members’ training instances. To further exploit the semantics, we introduce a novel regularizer coupling the metrics that prefers a sparse disjoint set of features to be selected for each metric relative to its ancestor (supercategory) nodes’ metrics. Intuitively, this reﬂects that visual cues most useful to distinguish the generic classes (e.g., feline vs. canine) should be different than those cues most useful to distinguish their component ﬁne-grained classes (e.g., Persian cat vs. Siamese cat). We validate our approach with multiple image datasets using the WordNet taxonomy, show its advantages over alternative metric learning approaches, and analyze the meaning of attribute features selected by our algorithm. 1

2 0.83369029 203 nips-2011-On the accuracy of l1-filtering of signals with block-sparse structure

Author: Fatma K. Karzan, Arkadi S. Nemirovski, Boris T. Polyak, Anatoli Juditsky

Abstract: We discuss new methods for the recovery of signals with block-sparse structure, based on 1 -minimization. Our emphasis is on the efﬁciently computable error bounds for the recovery routines. We optimize these bounds with respect to the method parameters to construct the estimators with improved statistical properties. We justify the proposed approach with an oracle inequality which links the properties of the recovery algorithms and the best estimation performance. 1

3 0.83318067 297 nips-2011-Universal low-rank matrix recovery from Pauli measurements

Author: Yi-kai Liu

Abstract: We study the problem of reconstructing an unknown matrix M of rank r and dimension d using O(rd poly log d) Pauli measurements. This has applications in quantum state tomography, and is a non-commutative analogue of a well-known problem in compressed sensing: recovering a sparse vector from a few of its Fourier coefﬁcients. We show that almost all sets of O(rd log6 d) Pauli measurements satisfy the rankr restricted isometry property (RIP). This implies that M can be recovered from a ﬁxed (“universal”) set of Pauli measurements, using nuclear-norm minimization (e.g., the matrix Lasso), with nearly-optimal bounds on the error. A similar result holds for any class of measurements that use an orthonormal operator basis whose elements have small operator norm. Our proof uses Dudley’s inequality for Gaussian processes, together with bounds on covering numbers obtained via entropy duality. 1

4 0.83095974 182 nips-2011-Nearest Neighbor based Greedy Coordinate Descent

Author: Inderjit S. Dhillon, Pradeep K. Ravikumar, Ambuj Tewari

Abstract: Increasingly, optimization problems in machine learning, especially those arising from bigh-dimensional statistical estimation, bave a large number of variables. Modem statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structore to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a snite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite non-smooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structore, we are not only able to significantly speed up the vanilla greedy method, hot also outperform cyclic descent when the problem size becomes large. Our resnlts indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards first-order optimization methods.

same-paper 5 0.8249414 19 nips-2011-Active Classification based on Value of Classifier

Author: Tianshi Gao, Daphne Koller

6 0.79633576 82 nips-2011-Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

7 0.76342297 209 nips-2011-Orthogonal Matching Pursuit with Replacement

8 0.73649621 1 nips-2011-$\theta$-MRF: Capturing Spatial and Semantic Structure in the Parameters for Scene Understanding

9 0.73481596 214 nips-2011-PiCoDes: Learning a Compact Code for Novel-Category Recognition

10 0.7342782 169 nips-2011-Maximum Margin Multi-Label Structured Prediction

11 0.73359305 105 nips-2011-Generalized Lasso based Approximation of Sparse Coding for Visual Recognition

12 0.73303926 168 nips-2011-Maximum Margin Multi-Instance Learning

13 0.73092461 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries

14 0.72984207 70 nips-2011-Dimensionality Reduction Using the Sparse Linear Model

15 0.72430837 244 nips-2011-Selecting Receptive Fields in Deep Networks

16 0.72397673 49 nips-2011-Boosting with Maximum Adaptive Sampling

17 0.72288036 172 nips-2011-Minimax Localization of Structural Information in Large Noisy Matrices

18 0.72152072 271 nips-2011-Statistical Tests for Optimization Efficiency

19 0.72117466 165 nips-2011-Matrix Completion for Multi-label Image Classification

20 0.72101468 231 nips-2011-Randomized Algorithms for Comparison-based Search