nips nips2010 nips2010-52 knowledge-graph by maker-knowledge-mining

52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio


Source: pdf

Author: Fuxin Li, Cristian Sminchisescu

Abstract: We propose an approach to multiple-instance learning that reformulates the problem as a convex optimization on the likelihood ratio between the positive and the negative class for each training instance. This is casted as joint estimation of both a likelihood ratio predictor and the target (likelihood ratio variable) for instances. Theoretically, we prove a quantitative relationship between the risk estimated under the 0-1 classification loss, and under a loss function for likelihood ratio. It is shown that likelihood ratio estimation is generally a good surrogate for the 0-1 loss, and separates positive and negative instances well. The likelihood ratio estimates provide a ranking of instances within a bag and are used as input features to learn a linear classifier on bags of instances. Instance-level classification is achieved from the bag-level predictions and the individual likelihood ratios. Experiments on synthetic and real datasets demonstrate the competitiveness of the approach.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 de Abstract We propose an approach to multiple-instance learning that reformulates the problem as a convex optimization on the likelihood ratio between the positive and the negative class for each training instance. [sent-5, score-0.595]

2 This is casted as joint estimation of both a likelihood ratio predictor and the target (likelihood ratio variable) for instances. [sent-6, score-0.539]

3 Theoretically, we prove a quantitative relationship between the risk estimated under the 0-1 classification loss, and under a loss function for likelihood ratio. [sent-7, score-0.351]

4 It is shown that likelihood ratio estimation is generally a good surrogate for the 0-1 loss, and separates positive and negative instances well. [sent-8, score-0.715]

5 The likelihood ratio estimates provide a ranking of instances within a bag and are used as input features to learn a linear classifier on bags of instances. [sent-9, score-1.045]

6 Instance-level classification is achieved from the bag-level predictions and the individual likelihood ratios. [sent-10, score-0.176]

7 There is an OR relationship in a bag: if one of the feature vectors is classified as positive, the entire bag is considered positive. [sent-15, score-0.325]

8 However the state-of-the-art in MIL is often obtained by simply using a weighted sum of kernel values between all instance pairs within the bags, while ignoring the prediction of instance labels [8, 9, 10]. [sent-25, score-0.24]

9 It is intriguing why MIL algorithms that exploit instance level information cannot achieve better performance, as constraints at instance level seems abundant – none of the negative instances is positive. [sent-26, score-0.428]

10 This should provide additional constraints in defining the region of positive instances and should help classification in input space. [sent-27, score-0.234]

11 Most of these algorithms perform alternating minimization on the classifier and the instance weights. [sent-29, score-0.165]

12 A newly proposed algorithm, ALP-SVM, was also introduced, which used a preset parameter defining the fixed ratio of witnesses – the true positive instances in a positive bag. [sent-34, score-0.454]

13 Excellent results were obtained with this witness rate parameter set to the correct value. [sent-35, score-0.456]

14 In principle, the witness rate should also be estimated, and this learning stage partially account for the non-convexity of the MIL problem. [sent-37, score-0.456]

15 The success of the Latent SVM for person detection [4] shows that a standard MIL procedure (the reformulation of the alternating minimization MI-SVM algorithm in [2]) can achieve good results if properly initialized. [sent-40, score-0.125]

16 Although the formulation is convex, its scalability drops significantly for bags with many instances. [sent-46, score-0.281]

17 In this paper we make an alternative attempt towards a convex formulation: we establish that nonconvex MIL constraints can be recast reliably into convex constraints on the likelihood ratio between the positive and negative classes for each instance. [sent-47, score-0.691]

18 We transform the multiple-instance learning problem into a convex joint estimation of the likelihood ratio function and the likelihood ratio values on training instances. [sent-48, score-0.803]

19 The choice of the jointly convex loss function is rich, remarkably at least from a family of f-divergences. [sent-49, score-0.144]

20 Theoretically, we prove consistency results for likelihood ratio estimation, thus showing that f-divergence loss functions upper bound the classification 0-1 loss tightly, unless the likelihood is very large. [sent-50, score-0.669]

21 A support vector regression scheme is implemented to estimate the likelihood ratio, and it is shown to separate positive and negative instances well. [sent-51, score-0.492]

22 However, determining the correct threshold for instance classification from the training set remain non-trivial. [sent-52, score-0.193]

23 To address this problem, we propose a post-processing step based on a bag classifier computed as a linear combination of likelihood ratios. [sent-53, score-0.501]

24 2 Convex Reformulation of the Multiple Instance Constraint Let us consider a learning problem with n training instances in total, n+ positive and n− negative. [sent-55, score-0.23]

25 In negative bags, every instance is negative, hence we do not separately define such bags – instead we directly work with the instances. [sent-56, score-0.435]

26 , x−− } be the training input, where each xi belongs to a n n 1 2 1 2 positive bag Bj and each x− is a negative instance. [sent-66, score-0.593]

27 The goal of multiple instance learning is, given i {X + , X − , B}, to learn a decision rule, sign(f (x)), to predict the label {+1, −1} for the test instance x. [sent-67, score-0.237]

28 1) negative-exclusion: if none of the instances in a bag is positive, the bag is not positive. [sent-69, score-0.777]

29 2) positive-identifiability: if one of the instances in the bag is positive, the bag is positive. [sent-70, score-0.777]

30 These properties are equivalent to a constraint maxxi ∈Bj f (xi ) ≥ 0 on positive bags. [sent-71, score-0.126]

31 This constraint is not convex since the negative max function is concave. [sent-72, score-0.197]

32 However, this hardly retains positive-identifiability, since if there is only one xi with f (xi ) > 0, this can be superseded by other instances with f (xi ) < 0. [sent-74, score-0.206]

33 However, in this paper we show that if MIL conditions are formulated as constraints on the likelihood ratio, convexity can be achieved. [sent-76, score-0.228]

34 |B | j When the size of the bag is large, the assumption Pr(y = 1|xi ) > |Bj |+1 can be too strong. [sent-79, score-0.325]

35 Pr(y = 1|xi ) is not close to 1/2, then likelihood ratio sums on negative examples can become much smaller, hence we can adopt a significantly lower threshold at some degree of violation of the negative-exclusion property. [sent-83, score-0.528]

36 Assuming Mβ holds, we prove the following result which allows us to relax the hard constraint (1): Theorem 1 ∀δ > 0, for each xi in a bag Bj , assume yi is drawn i. [sent-88, score-0.516]

37 If all instances xi ∈ Bj are negative, then the probability that xi ∈Bj Pr(y = 1|xi ) β+4 ≥ |Bj |+ Pr(y = −1|xi ) 2(β + 1)(β + 2) 4β + 1 log 1/δ |Bj | log 1/δ+ (2) 2 (2β + 3) 2(β + 1) 3 is at most δ. [sent-92, score-0.285]

38 3 Likelihood Ratio Estimation To estimate the likelihood ratio, one possibility would be to use kernel methods as nonparametric estimators over a RKHS. [sent-107, score-0.221]

39 This approach was taken in [22], where predictions of the ratio provided a variational estimate of an f -divergence (or Ali-Silvey divergence) between two distributions. [sent-108, score-0.184]

40 An important issue is the relationship between the likelihood ratio estimation and our final goal: binary classification. [sent-123, score-0.376]

41 In this paper we extend these results to loss functions for likelihood ratio estimation. [sent-125, score-0.416]

42 Let R(f ) = P (sign(y) = sign(f (x) − 1)) be the 0-1 risk of a likelihood estimator f , with classification rule given by sign(f (x) ≥ 1). [sent-126, score-0.259]

43 For a generic loss function C(α, η), let η = Pr(y = 1|x), we can define the C-risk as RC (f ) = ∗ E(C(f, η)) and RC = inf f RC (f ). [sent-128, score-0.127]

44 Our goal is to bound the excess 0-1 risk R(f ) − R∗ by the ∗ excess-C risk RC (f ) − RC , so that minimizing the excess-C risk can be converted into minimizing the classification loss. [sent-129, score-0.213]

45 Let us further define the optimal conditional risk as H(η) = inf α∈R C(α, η), and H − (η) = inf α,(α−1)(2η−1)≤0 C(α, η). [sent-130, score-0.163]

46 The difference between likelihood ratio estimation and the classification setting is in the asymmetric scaling of the loss function for positive and negative examples. [sent-133, score-0.621]

47 Let ψ− = ψ(−x), R− (f ) = ∗ ∗ Pr(y = −1, f (x) > 1), R− = inf f R− (f ), R+ (f ) = Pr(y = 1, f (x) < 1) and R+ = inf f R+ (f ) be the risk and Bayes risks on negative and positive examples, respectively. [sent-134, score-0.331]

48 Compared against ∗ Theorem 3 in [20] which has the form ψ(R(f ) − R∗ ) ≤ RC (f ) − RC , the difference here stems from the different loss transforms used for the positive and the negative examples. [sent-140, score-0.274]

49 η We consider an f -divergence of the likelihood as the loss function, i. [sent-141, score-0.253]

50 , C(α, η) = D(α, 1−η ), η where 1−η is the likelihood ratio when the Pr(y = 1|x) = η. [sent-143, score-0.339]

51 However, likelihood estimation would severely penalize the misclassified positive examples with large Pr(yi = 1|xi ). [sent-149, score-0.321]

52 (a) The function ψ appearing in the losses used for likelihood estimation (L1 , KL-divergence) is similar to the hinge loss when θ > 0; however it goes to infinity as θ approaches 1. [sent-170, score-0.325]

53 (c) If we only know the label of the negative examples (blue) and the maximal positive example (red), determining the optimal threshold becomes non-trivial. [sent-175, score-0.297]

54 05, which gives the estimate of the bound for each bag as Di = 1 |Bi | + 4 3 14 |Bi | + 1, when Bi ≥ 3 and Di = |Bi | when |Bi | < 3. [sent-180, score-0.346]

55 4 Bag and Instance Classification If the likelihood ratio is obtained using an unbiased estimator, a decision rule based on sign(f (x) ≥ 1) should give the optimal classifier. [sent-188, score-0.36]

56 In positive bags, it is unclear whether an instance should be labeled positive or negative, as long as it does not contribute significantly to the classification error of its bag (fig. [sent-190, score-0.623]

57 This means that based on the learned likelihood ratio, the positive examples are usually well separated from the negative ones. [sent-194, score-0.37]

58 The main difficulty stems from the compound source of bias which arises from both the estimation of η + and the loss minimization over η + and f . [sent-196, score-0.167]

59 Instead of directly estimating the threshold, we learn a linear combination of instance likelihood ratios to classify the bag. [sent-198, score-0.361]

60 First, we sort the instance likelihood ratios for each bag into a vector of length maxi |Bi |. [sent-199, score-0.637]

61 We append 0 to bags that do not have enough 5 (a) (b) (c) Pattern Classification Error Bag Classification Error 1 0. [sent-200, score-0.254]

62 3 Error − averaged over 50 runs Error − averaged over 50 runs mi−SVM Estimated witness rate AL−SVM : T=10C 0. [sent-215, score-0.456]

63 8 1 percent of positive labeled points in bags (d) 0 0. [sent-220, score-0.371]

64 8 Percent of positives in bags 1 (f) Figure 2: Synthetic dataset (best viewed in color). [sent-232, score-0.306]

65 Under this representation, bag classification turns into a standard binary decision problem where a vector and a binary label is given for each bag, and a linear SVM is learned to solve the problem. [sent-240, score-0.372]

66 If we were to classify only the likelihood ratio on the first instance, this procedure would reduce to simple thresholding. [sent-241, score-0.366]

67 This approach is derived from the basic MIL assumption that all instances in a negative bag are negative. [sent-247, score-0.538]

68 Based on instance classification we could also estimate the witness rate of the dataset. [sent-248, score-0.572]

69 This is computed as the ratio of positively classified instances and the total number of instances in the positive bags of the training set. [sent-249, score-0.774]

70 Since our algorithm automatically adjusts to different witness rates, this estimate offers quantitative insight as to whether MIL should be used. [sent-250, score-0.424]

71 For instance, if the witness rate is 100%, it may be more effective to use a conventional learning approach. [sent-251, score-0.483]

72 The positive bags have a fraction of points sampled uniformly from the white region and the rest sampled uniformly from the black region. [sent-256, score-0.336]

73 An example of the sample at 40% witness rate is shown in fig. [sent-257, score-0.456]

74 In this figure, the plotted instance labels are the ones of their bags – indeed, one could notice many positive (blue) instances in the negative (red) region. [sent-259, score-0.67]

75 WR” gives the estimated witness rates of our method. [sent-268, score-0.468]

76 5 100 % In order to test the effect of witness rates, 10 different types of datasets are created by varying the rates over the range 0. [sent-343, score-0.483]

77 Under the likelihood ratio, the positive examples are well separated from negatives. [sent-353, score-0.284]

78 Complete results on datasets with different witness rates are shown in fig. [sent-355, score-0.483]

79 We give both bag classification and instance classification results. [sent-357, score-0.42]

80 BEST THRESHOLD refers to a method where the best threshold was chosen based on the full knowledge of training/test instance labels, i. [sent-359, score-0.172]

81 , the optimal performance our likelihood ratio estimator can achieve. [sent-361, score-0.359]

82 SVR-SVM generally works well when the witness rate is not very low. [sent-363, score-0.456]

83 From instance classification, one can see that the original mi-SVM is only competitive when the witness rate is near 1 – this situation is close to a supervised SVM. [sent-364, score-0.574]

84 With a deterministic annealing approach in [15], AW-SVM and mi-SVM perform quite the opposite – competitive when the witness rate is small but degrade when this is large. [sent-365, score-0.591]

85 Presumably this is because deterministic annealing is initialized with the apriori assumption that datasets are multiple-instance i. [sent-366, score-0.162]

86 When the witness rate is large, annealing does not improve performance. [sent-369, score-0.545]

87 On the contrary, the proposed SVR-SVM does not appear to be affected by the witness rate. [sent-370, score-0.403]

88 With the same parameters used across all the experiments, the method self-adjusts to different witness rates. [sent-371, score-0.403]

89 2 (e): regardless of the witness rate, the instance error rate remains roughly the same. [sent-373, score-0.572]

90 But we note that results in ALP-SVM are obtained by tuning the witness rate to the optimal value, which may be difficult in practical settings. [sent-382, score-0.456]

91 The slightly lower performance compared to miGraph suggests that we may be inferior in the bag classification step, which we already know is suboptimal. [sent-383, score-0.349]

92 These data have the benefit of being designated to have a small witness rate. [sent-599, score-0.403]

93 These are derived from the 20 Newsgroups corpus, with 50 positive and 50 negative bags for each of the 20 news categories. [sent-601, score-0.422]

94 This shows the potential of methods based on likelihood ratio estimators for multiple instance learning. [sent-610, score-0.434]

95 6 Conclusion We have proposed an approach to multiple-instance learning based on estimating the likelihood ratio between the positive and the negative classes on instances. [sent-611, score-0.529]

96 The MIL constraint is reformulated into a convex constraint on the likelihood ratio where a joint estimation of both the function and the target ratios on the training set is performed. [sent-612, score-0.616]

97 Theoretically we justify that learning the likelihood ratio is Bayes-consistent and has desirable excess loss transform properties. [sent-613, score-0.44]

98 Although we are not able to find the optimal classification threshold on the estimated ratio function, our proposed bag classifier based on such ratios obtains state-of-the-art results in a number of difficult datasets. [sent-614, score-0.641]

99 In future work, we plan to explore transductive learning techniques in order to leverage the information in the learned ratio function and identify better threshold estimation procedures. [sent-615, score-0.277]

100 : Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. [sent-750, score-0.509]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mil', 0.472), ('witness', 0.403), ('bag', 0.325), ('bags', 0.254), ('migraph', 0.218), ('pr', 0.191), ('likelihood', 0.176), ('ratio', 0.163), ('rc', 0.161), ('instances', 0.127), ('bj', 0.119), ('bi', 0.105), ('classi', 0.102), ('instance', 0.095), ('annealing', 0.089), ('negative', 0.086), ('svm', 0.086), ('positive', 0.082), ('xi', 0.079), ('loss', 0.077), ('threshold', 0.077), ('yi', 0.068), ('svr', 0.068), ('convex', 0.067), ('risk', 0.063), ('wr', 0.054), ('rate', 0.053), ('inf', 0.05), ('datasets', 0.05), ('alternating', 0.046), ('supervisory', 0.045), ('accompanying', 0.044), ('surrogate', 0.044), ('constraint', 0.044), ('synthetic', 0.042), ('misclassi', 0.042), ('ratios', 0.041), ('di', 0.04), ('divergence', 0.04), ('sminchisescu', 0.04), ('unclear', 0.039), ('estimation', 0.037), ('bonn', 0.036), ('sign', 0.035), ('reformulation', 0.035), ('percent', 0.035), ('estimated', 0.035), ('hinge', 0.035), ('cation', 0.034), ('dominating', 0.034), ('kwok', 0.034), ('er', 0.033), ('gehler', 0.032), ('weak', 0.032), ('rates', 0.03), ('tiger', 0.03), ('stems', 0.029), ('li', 0.029), ('bayes', 0.028), ('aw', 0.028), ('website', 0.028), ('classify', 0.027), ('dataset', 0.027), ('conventional', 0.027), ('formulation', 0.027), ('smo', 0.027), ('convexity', 0.027), ('examples', 0.026), ('label', 0.026), ('tsybakov', 0.026), ('labels', 0.026), ('fi', 0.026), ('positives', 0.025), ('constraints', 0.025), ('rkhs', 0.025), ('chapelle', 0.025), ('minimization', 0.024), ('kernel', 0.024), ('excess', 0.024), ('inferior', 0.024), ('classification', 0.024), ('theoretically', 0.023), ('competitive', 0.023), ('deterministic', 0.023), ('hofmann', 0.023), ('ambiguous', 0.023), ('zhou', 0.023), ('reformulated', 0.023), ('suboptimal', 0.023), ('estimating', 0.022), ('labeling', 0.022), ('estimate', 0.021), ('categorization', 0.021), ('measurable', 0.021), ('remains', 0.021), ('training', 0.021), ('decision', 0.021), ('estimator', 0.02), ('person', 0.02), ('flach', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio

Author: Fuxin Li, Cristian Sminchisescu

Abstract: We propose an approach to multiple-instance learning that reformulates the problem as a convex optimization on the likelihood ratio between the positive and the negative class for each training instance. This is casted as joint estimation of both a likelihood ratio predictor and the target (likelihood ratio variable) for instances. Theoretically, we prove a quantitative relationship between the risk estimated under the 0-1 classification loss, and under a loss function for likelihood ratio. It is shown that likelihood ratio estimation is generally a good surrogate for the 0-1 loss, and separates positive and negative instances well. The likelihood ratio estimates provide a ranking of instances within a bag and are used as input features to learn a linear classifier on bags of instances. Instance-level classification is achieved from the bag-level predictions and the individual likelihood ratios. Experiments on synthetic and real datasets demonstrate the competitiveness of the approach.

2 0.32254636 36 nips-2010-Avoiding False Positive in Multi-Instance Learning

Author: Yanjun Han, Qing Tao, Jue Wang

Abstract: In multi-instance learning, there are two kinds of prediction failure, i.e., false negative and false positive. Current research mainly focus on avoiding the former. We attempt to utilize the geometric distribution of instances inside positive bags to avoid both the former and the latter. Based on kernel principal component analysis, we define a projection constraint for each positive bag to classify its constituent instances far away from the separating hyperplane while place positive instances and negative instances at opposite sides. We apply the Constrained Concave-Convex Procedure to solve the resulted problem. Empirical results demonstrate that our approach offers improved generalization performance.

3 0.25613976 151 nips-2010-Learning from Candidate Labeling Sets

Author: Jie Luo, Francesco Orabona

Abstract: In many real world applications we do not have access to fully-labeled training data, but only to a list of possible labels. This is the case, e.g., when learning visual classifiers from images downloaded from the web, using just their text captions or tags as learning oracles. In general, these problems can be very difficult. However most of the time there exist different implicit sources of information, coming from the relations between instances and labels, which are usually dismissed. In this paper, we propose a semi-supervised framework to model this kind of problems. Each training sample is a bag containing multi-instances, associated with a set of candidate labeling vectors. Each labeling vector encodes the possible labels for the instances in the bag, with only one being fully correct. The use of the labeling vectors provides a principled way not to exclude any information. We propose a large margin discriminative formulation, and an efficient algorithm to solve it. Experiments conducted on artificial datasets and a real-world images and captions dataset show that our approach achieves performance comparable to an SVM trained with the ground-truth labels, and outperforms other baselines.

4 0.10942147 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision

Author: Matthew Blaschko, Andrea Vedaldi, Andrew Zisserman

Abstract: A standard approach to learning object category detectors is to provide strong supervision in the form of a region of interest (ROI) specifying each instance of the object in the training images [17]. In this work are goal is to learn from heterogeneous labels, in which some images are only weakly supervised, specifying only the presence or absence of the object or a weak indication of object location, whilst others are fully annotated. To this end we develop a discriminative learning approach and make two contributions: (i) we propose a structured output formulation for weakly annotated images where full annotations are treated as latent variables; and (ii) we propose to optimize a ranking objective function, allowing our method to more effectively use negatively labeled images to improve detection average precision performance. The method is demonstrated on the benchmark INRIA pedestrian detection dataset of Dalal and Triggs [14] and the PASCAL VOC dataset [17], and it is shown that for a significant proportion of weakly supervised images the performance achieved is very similar to the fully supervised (state of the art) results. 1

5 0.090880632 243 nips-2010-Smoothness, Low Noise and Fast Rates

Author: Nathan Srebro, Karthik Sridharan, Ambuj Tewari

Abstract: √ ˜ We establish an excess risk bound of O HR2 + HL∗ Rn for ERM with an H-smooth loss n function and a hypothesis class with Rademacher complexity Rn , where L∗ is the best risk achievable by the hypothesis class. For typical hypothesis classes where Rn = R/n, this translates to ˜ ˜ a learning rate of O (RH/n) in the separable (L∗ = 0) case and O RH/n + L∗ RH/n more generally. We also provide similar guarantees for online and stochastic convex optimization of a smooth non-negative objective. 1

6 0.089363642 225 nips-2010-Relaxed Clipping: A Global Training Method for Robust Regression and Classification

7 0.087350138 22 nips-2010-Active Estimation of F-Measures

8 0.084652603 272 nips-2010-Towards Holistic Scene Understanding: Feedback Enabled Cascaded Classification Models

9 0.079878889 118 nips-2010-Implicit Differentiation by Perturbation

10 0.07896512 282 nips-2010-Variable margin losses for classifier design

11 0.075146131 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

12 0.074843794 228 nips-2010-Reverse Multi-Label Learning

13 0.074533902 23 nips-2010-Active Instance Sampling via Matrix Partition

14 0.073775053 235 nips-2010-Self-Paced Learning for Latent Variable Models

15 0.07244315 275 nips-2010-Transduction with Matrix Completion: Three Birds with One Stone

16 0.069615431 145 nips-2010-Learning Kernels with Radiuses of Minimum Enclosing Balls

17 0.069542259 86 nips-2010-Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

18 0.069405973 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

19 0.069125272 135 nips-2010-Label Embedding Trees for Large Multi-Class Tasks

20 0.068647988 269 nips-2010-Throttling Poisson Processes


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.219), (1, 0.071), (2, 0.088), (3, -0.122), (4, 0.056), (5, 0.093), (6, -0.095), (7, -0.086), (8, 0.015), (9, -0.159), (10, -0.064), (11, -0.003), (12, 0.043), (13, 0.091), (14, -0.011), (15, -0.002), (16, -0.141), (17, 0.068), (18, 0.066), (19, -0.008), (20, 0.018), (21, -0.114), (22, 0.04), (23, 0.091), (24, 0.065), (25, 0.144), (26, 0.162), (27, -0.135), (28, -0.14), (29, 0.036), (30, 0.036), (31, 0.098), (32, 0.144), (33, -0.004), (34, -0.084), (35, 0.086), (36, 0.021), (37, 0.006), (38, -0.156), (39, -0.229), (40, 0.08), (41, 0.083), (42, 0.029), (43, 0.033), (44, -0.004), (45, -0.073), (46, 0.073), (47, -0.041), (48, -0.029), (49, 0.101)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.92424059 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio

Author: Fuxin Li, Cristian Sminchisescu

Abstract: We propose an approach to multiple-instance learning that reformulates the problem as a convex optimization on the likelihood ratio between the positive and the negative class for each training instance. This is casted as joint estimation of both a likelihood ratio predictor and the target (likelihood ratio variable) for instances. Theoretically, we prove a quantitative relationship between the risk estimated under the 0-1 classification loss, and under a loss function for likelihood ratio. It is shown that likelihood ratio estimation is generally a good surrogate for the 0-1 loss, and separates positive and negative instances well. The likelihood ratio estimates provide a ranking of instances within a bag and are used as input features to learn a linear classifier on bags of instances. Instance-level classification is achieved from the bag-level predictions and the individual likelihood ratios. Experiments on synthetic and real datasets demonstrate the competitiveness of the approach.

2 0.91740119 36 nips-2010-Avoiding False Positive in Multi-Instance Learning

Author: Yanjun Han, Qing Tao, Jue Wang

Abstract: In multi-instance learning, there are two kinds of prediction failure, i.e., false negative and false positive. Current research mainly focus on avoiding the former. We attempt to utilize the geometric distribution of instances inside positive bags to avoid both the former and the latter. Based on kernel principal component analysis, we define a projection constraint for each positive bag to classify its constituent instances far away from the separating hyperplane while place positive instances and negative instances at opposite sides. We apply the Constrained Concave-Convex Procedure to solve the resulted problem. Empirical results demonstrate that our approach offers improved generalization performance.

3 0.81574476 151 nips-2010-Learning from Candidate Labeling Sets

Author: Jie Luo, Francesco Orabona

Abstract: In many real world applications we do not have access to fully-labeled training data, but only to a list of possible labels. This is the case, e.g., when learning visual classifiers from images downloaded from the web, using just their text captions or tags as learning oracles. In general, these problems can be very difficult. However most of the time there exist different implicit sources of information, coming from the relations between instances and labels, which are usually dismissed. In this paper, we propose a semi-supervised framework to model this kind of problems. Each training sample is a bag containing multi-instances, associated with a set of candidate labeling vectors. Each labeling vector encodes the possible labels for the instances in the bag, with only one being fully correct. The use of the labeling vectors provides a principled way not to exclude any information. We propose a large margin discriminative formulation, and an efficient algorithm to solve it. Experiments conducted on artificial datasets and a real-world images and captions dataset show that our approach achieves performance comparable to an SVM trained with the ground-truth labels, and outperforms other baselines.

4 0.47058013 235 nips-2010-Self-Paced Learning for Latent Variable Models

Author: M. P. Kumar, Benjamin Packer, Daphne Koller

Abstract: Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that often we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif finding and handwritten digit recognition. 1

5 0.44973406 228 nips-2010-Reverse Multi-Label Learning

Author: James Petterson, Tibério S. Caetano

Abstract: Multi-label classification is the task of predicting potentially multiple labels for a given instance. This is common in several applications such as image annotation, document classification and gene function prediction. In this paper we present a formulation for this problem based on reverse prediction: we predict sets of instances given the labels. By viewing the problem from this perspective, the most popular quality measures for assessing the performance of multi-label classification admit relaxations that can be efficiently optimised. We optimise these relaxations with standard algorithms and compare our results with several stateof-the-art methods, showing excellent performance. 1

6 0.43248758 3 nips-2010-A Bayesian Framework for Figure-Ground Interpretation

7 0.41473499 214 nips-2010-Probabilistic Belief Revision with Structural Constraints

8 0.40496421 267 nips-2010-The Multidimensional Wisdom of Crowds

9 0.40149087 269 nips-2010-Throttling Poisson Processes

10 0.39659029 22 nips-2010-Active Estimation of F-Measures

11 0.38956931 236 nips-2010-Semi-Supervised Learning with Adversarially Missing Label Information

12 0.38796046 282 nips-2010-Variable margin losses for classifier design

13 0.38336283 240 nips-2010-Simultaneous Object Detection and Ranking with Weak Supervision

14 0.37930822 23 nips-2010-Active Instance Sampling via Matrix Partition

15 0.37788752 1 nips-2010-(RF)^2 -- Random Forest Random Field

16 0.37325558 169 nips-2010-More data means less inference: A pseudo-max approach to structured learning

17 0.37195557 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification

18 0.36491171 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods

19 0.36445016 15 nips-2010-A Theory of Multiclass Boosting

20 0.36443365 202 nips-2010-Parallelized Stochastic Gradient Descent


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.05), (17, 0.014), (18, 0.153), (27, 0.061), (30, 0.064), (35, 0.021), (45, 0.242), (50, 0.05), (52, 0.033), (60, 0.028), (77, 0.057), (78, 0.085), (90, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92456603 267 nips-2010-The Multidimensional Wisdom of Crowds

Author: Peter Welinder, Steve Branson, Pietro Perona, Serge J. Belongie

Abstract: Distributing labeling tasks among hundreds or thousands of annotators is an increasingly important method for annotating large datasets. We present a method for estimating the underlying value (e.g. the class) of each image from (noisy) annotations provided by multiple annotators. Our method is based on a model of the image formation and annotation process. Each image has different characteristics that are represented in an abstract Euclidean space. Each annotator is modeled as a multidimensional entity with variables representing competence, expertise and bias. This allows the model to discover and represent groups of annotators that have different sets of skills and knowledge, as well as groups of images that differ qualitatively. We find that our model predicts ground truth labels on both synthetic and real data more accurately than state of the art methods. Experiments also show that our model, starting from a set of binary labels, may discover rich information, such as different “schools of thought” amongst the annotators, and can group together images belonging to separate categories. 1

same-paper 2 0.89132267 52 nips-2010-Convex Multiple-Instance Learning by Estimating Likelihood Ratio

Author: Fuxin Li, Cristian Sminchisescu

Abstract: We propose an approach to multiple-instance learning that reformulates the problem as a convex optimization on the likelihood ratio between the positive and the negative class for each training instance. This is casted as joint estimation of both a likelihood ratio predictor and the target (likelihood ratio variable) for instances. Theoretically, we prove a quantitative relationship between the risk estimated under the 0-1 classification loss, and under a loss function for likelihood ratio. It is shown that likelihood ratio estimation is generally a good surrogate for the 0-1 loss, and separates positive and negative instances well. The likelihood ratio estimates provide a ranking of instances within a bag and are used as input features to learn a linear classifier on bags of instances. Instance-level classification is achieved from the bag-level predictions and the individual likelihood ratios. Experiments on synthetic and real datasets demonstrate the competitiveness of the approach.

3 0.88469577 211 nips-2010-Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression

Author: Ling Huang, Jinzhu Jia, Bin Yu, Byung-gon Chun, Petros Maniatis, Mayur Naik

Abstract: Predicting the execution time of computer programs is an important but challenging problem in the community of computer systems. Existing methods require experts to perform detailed analysis of program code in order to construct predictors or select important features. We recently developed a new system to automatically extract a large number of features from program execution on sample inputs, on which prediction models can be constructed without expert knowledge. In this paper we study the construction of predictive models for this problem. We propose the SPORE (Sparse POlynomial REgression) methodology to build accurate prediction models of program performance using feature data collected from program execution on sample inputs. Our two SPORE algorithms are able to build relationships between responses (e.g., the execution time of a computer program) and features, and select a few from hundreds of the retrieved features to construct an explicitly sparse and non-linear model to predict the response variable. The compact and explicitly polynomial form of the estimated model could reveal important insights into the computer program (e.g., features and their non-linear combinations that dominate the execution time), enabling a better understanding of the program’s behavior. Our evaluation on three widely used computer programs shows that SPORE methods can give accurate prediction with relative error less than 7% by using a moderate number of training data samples. In addition, we compare SPORE algorithms to state-of-the-art sparse regression algorithms, and show that SPORE methods, motivated by real applications, outperform the other methods in terms of both interpretability and prediction accuracy.

4 0.87424207 154 nips-2010-Learning sparse dynamic linear systems using stable spline kernels and exponential hyperpriors

Author: Alessandro Chiuso, Gianluigi Pillonetto

Abstract: We introduce a new Bayesian nonparametric approach to identification of sparse dynamic linear systems. The impulse responses are modeled as Gaussian processes whose autocovariances encode the BIBO stability constraint, as defined by the recently introduced “Stable Spline kernel”. Sparse solutions are obtained by placing exponential hyperpriors on the scale factors of such kernels. Numerical experiments regarding estimation of ARMAX models show that this technique provides a definite advantage over a group LAR algorithm and state-of-the-art parametric identification techniques based on prediction error minimization. 1

5 0.87422204 144 nips-2010-Learning Efficient Markov Networks

Author: Vibhav Gogate, William Webb, Pedro Domingos

Abstract: We present an algorithm for learning high-treewidth Markov networks where inference is still tractable. This is made possible by exploiting context-specific independence and determinism in the domain. The class of models our algorithm can learn has the same desirable properties as thin junction trees: polynomial inference, closed-form weight learning, etc., but is much broader. Our algorithm searches for a feature that divides the state space into subspaces where the remaining variables decompose into independent subsets (conditioned on the feature and its negation) and recurses on each subspace/subset of variables until no useful new features can be found. We provide probabilistic performance guarantees for our algorithm under the assumption that the maximum feature length is bounded by a constant k (the treewidth can be much larger) and dependences are of bounded strength. We also propose a greedy version of the algorithm that, while forgoing these guarantees, is much more efficient. Experiments on a variety of domains show that our approach outperforms many state-of-the-art Markov network structure learners. 1

6 0.86681145 132 nips-2010-Joint Cascade Optimization Using A Product Of Boosted Classifiers

7 0.86223435 112 nips-2010-Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning

8 0.85951072 151 nips-2010-Learning from Candidate Labeling Sets

9 0.85828429 222 nips-2010-Random Walk Approach to Regret Minimization

10 0.85681671 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics

11 0.85469639 282 nips-2010-Variable margin losses for classifier design

12 0.85329229 277 nips-2010-Two-Layer Generalization Analysis for Ranking Using Rademacher Average

13 0.8490141 243 nips-2010-Smoothness, Low Noise and Fast Rates

14 0.8487432 63 nips-2010-Distributed Dual Averaging In Networks

15 0.84615541 36 nips-2010-Avoiding False Positive in Multi-Instance Learning

16 0.84462523 23 nips-2010-Active Instance Sampling via Matrix Partition

17 0.84436131 195 nips-2010-Online Learning in The Manifold of Low-Rank Matrices

18 0.84304786 1 nips-2010-(RF)^2 -- Random Forest Random Field

19 0.84155303 158 nips-2010-Learning via Gaussian Herding

20 0.84148806 290 nips-2010-t-logistic regression