jmlr jmlr2011 jmlr2011-58 knowledge-graph by maker-knowledge-mining

58 jmlr-2011-Learning from Partial Labels

Source: pdf

Author: Timothee Cour, Ben Sapp, Ben Taskar

Abstract: We address the problem of partially-labeled multiclass classiﬁcation, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classiﬁer that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We deﬁne an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classiﬁcation, convex learning, generalization bounds, names and faces

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. [sent-13, score-0.594]

2 In this setting each face is ambiguously labeled with the set of names extracted from the caption, see Figure 1 (bottom). [sent-17, score-0.383]

3 Top: using a screenplay, we can tell who is in a movie scene, but for every face in the corresponding images, the person’s identity is ambiguous (green labels). [sent-25, score-0.599]

4 In both cases, our goal is to learn a model from ambiguously labeled examples so as to disambiguate the training labels and also generalize to unseen examples. [sent-27, score-0.41]

5 We pose the partially labeled learning problem as minimization of an ambiguous loss in Section 3, and establish upper and lower bounds between the (unobserved) true loss and the (observed) ambiguous loss in terms of a critical distributional property we call ambiguity degree. [sent-55, score-1.449]

6 We propose the novel Convex Learning from Partial Labels (CLPL) formulation in Section 4, and show it offers a tighter approximation to the ambiguous loss, compared to a straightforward formulation. [sent-56, score-0.477]

7 We also apply our framework to a naming task in TV series, where screenplay and closed captions provide ambiguous labels. [sent-61, score-0.733]

8 1503 C OUR , S APP AND TASKAR supervised unsupervised semi-supervised multi-label multi-instance label label instance label instance instance label label instance instance instance label instance partial-label ? [sent-78, score-0.893]

9 label instance label label Figure 2: Range of supervision in classiﬁcation. [sent-84, score-0.455]

10 2 Learning From Partially-labeled or Ambiguous Data There have been several papers that addressed the ambiguous label problem. [sent-100, score-0.619]

11 Hullermeier and Beringer (2006) propose several nonparametric, instance-based algorithms for ambiguous learning based on greedy heuristics. [sent-106, score-0.477]

12 These papers only report results on synthetically-created ambiguous labels for data sets such as the UCI repository. [sent-107, score-0.618]

13 Instead of estimating a prior probability for each individual, the algorithm estimates a prior for groups using the ambiguous labels. [sent-133, score-0.477]

14 / Clearly, our setup generalizes the standard semi-supervised setting where some examples are labeled and some are unlabeled: an example is labeled when the corresponding ambiguity set yi is a singleton, and unlabeled when yi includes all the labels. [sent-161, score-0.463]

15 In order to learn from ambiguous data, we must make some assumptions about the distribution P(Z | X,Y ). [sent-164, score-0.477]

16 Consider a very simple ambiguity pattern that makes accurate learning impossible: L = 3, |zi | = 1 and label 1 is present in every set yi , for all i. [sent-165, score-0.384]

17 For example, consider our initial motivation of naming characters in TV shows, where the ambiguity set for any given detected face in a scene is given by the set of characters occurring at some point in that scene. [sent-170, score-0.663]

18 This suggests that for most characters, ambiguity sets are diverse and we can expect that the ambiguity degree is small. [sent-182, score-0.4]

19 We deﬁne a surrogate loss which we can evaluate, and we call ambiguous or partial 0/1 loss (where A stands for ambiguous): Partial 0/1 loss: LA (h(x), y) = 1(h(x) ∈ y). [sent-189, score-0.683]

20 However, in the ambiguous learning setting we would like to minimize the true 0/1 loss, with access only to the partial loss. [sent-192, score-0.567]

21 We ﬁrst introduce a measure of hardness of learning under ambiguous supervision, which we deﬁne as ambiguity degree ε of a distribution P(X,Y, Z): Ambiguity degree: ε= sup x,y,z:P(x,y)>0,z∈{1,. [sent-194, score-0.696]

22 (1) In words, ε corresponds to the maximum probability of an extra label z co-occurring with a / true label y, over all labels and inputs. [sent-198, score-0.425]

23 Intuitively, the best case scenario for ambiguous learning corresponds to a distribution with high conditional entropy for P(Z | X,Y ). [sent-207, score-0.477]

24 1507 C OUR , S APP AND TASKAR Proposition 1 (Partial loss bound via ambiguity degree ε) For any classiﬁer h and distribution P(X,Y, Z), with Y = X ∪ Z and ambiguity degree ε: EP [LA (h(X), Y)] ≤ EP [L (h(X),Y )] ≤ 1 EP [LA (h(X), Y)], 1−ε with the convention 1/0 = +∞. [sent-210, score-0.496]

25 We can further tighten our bounds between ambiguous loss and standard 1508 L EARNING FROM PARTIAL L ABELS 1 E[true loss] 0. [sent-237, score-0.562]

26 8 E[ambiguous loss] 1 Figure 4: Feasible region for expected ambiguous and true loss, for ε = 0. [sent-247, score-0.477]

27 These bounds give a strong connection between ambiguous loss and real loss when ε is small. [sent-259, score-0.62]

28 This assumption allows us to approximately minimize the expected real loss by minimizing (an upper bound on) the ambiguous loss, as we propose in the following section. [sent-260, score-0.535]

29 We now focus on a particular family of classiﬁers, which assigns a score ga (x) to each label a for a given input x and select the highest scoring label: h(x) = arg max ga (x). [sent-263, score-0.947]

30 Below, we use a multi-linear function class G by assuming a feature mapping f(x) : X → Rd from inputs to d real-valued features and let ga (x) = wa · f(x), where wa ∈ Rd is a weight vector for each class, bounded by some norm: ||wa || p ≤ B for p = 1, 2. [sent-271, score-0.506]

31 1 Convex Loss for Partial Labels In the partially labeled setting, instead of an unambiguous single label y per instance we have a set of labels Y , one of which is the correct label for the instance. [sent-281, score-0.538]

32 We show next that our loss function is consistent under certain assumptions and offers a tighter upperbound to the ambiguous loss compared to a more straightforward multi-label approach. [sent-289, score-0.593]

33 First, we require that arg maxa P(Y = a | X = x) = arg maxa P(a ∈ Y | X = x), since otherwise arg maxa P(Y = a | X = x) cannot be determined by any algorithm from partial labels Y without additional information even with an inﬁnite amount of data. [sent-294, score-0.66]

34 It satisifes 2 2 arg maxa ga = 2 but arg maxa ∑b Pab = 1. [sent-312, score-0.654]

35 To gain additional intuition as to why CLPL is better than the naive loss Equation 3: for an input x with ambiguous label set (a, b), CLPL only encourages the average of ga (x) and gb (x) to be large, allowing the correct score to be positive and the extraneous score to be negative (e. [sent-325, score-1.303]

36 In contrast, the naive model encourages both ga (x) and gb (x) to be large. [sent-328, score-0.572]

37 We assume a feature mapping f(x) : X → Rd from inputs to d real-valued features and let ga (x) = wa ·f(x), where wa ∈ Rd is a weight vector for each class, bounded by L2 norm : ||wa ||2 ≤ B. [sent-331, score-0.506]

38 ambiguous label set) of some x ∈ S, and z(x) = y(x)\{y(x)}. [sent-351, score-0.619]

39 If Lψ (g(x), y(x)) ≤ ψ(η/2) and ∀a ∈ z(x), ∃x′ ∈ Bη (x) such that ga (x′ ) ≤ −η/2, then g predicts the correct label for x. [sent-356, score-0.51]

40 In other words, g predicts the correct label for x when its loss is sufﬁciently small, and for each of its ambiguous labels a, we can ﬁnd a neighbor with same label whose score ga (x′ ) is small enough. [sent-357, score-1.355]

41 We stack the parameters and features into one vector as follows below, so that ga (x) = wa · f(x) = w · f(x, a):     1(a = 1)f(x) w1 . [sent-375, score-0.437]

42 We analyze the effect on learning of the following factors: distribution of ambiguous labels, size of ambiguous bags, proportion of instances which contain an ambiguous bag, entropy of the ambiguity, distribution of true labels and number of distinct labels. [sent-400, score-1.572]

43 We compare our CLPL approach against a number of baselines, including a generative model, a discriminative maximum-entropy model, a naive model, two K-nearest neighbor models, as well as models that ignore the ambiguous bags. [sent-401, score-0.586]

44 1 C HANCE BASELINE We deﬁne the chance baseline as randomly guessing between the possible ambiguous labels only. [sent-408, score-0.675]

45 1 Deﬁning the (empirical) average ambiguous size to be ES [|y|] = m ∑m |yi |, then the expected error i=1 1 from the chance baseline is given by errorchance = 1 − ES [|y|] . [sent-409, score-0.534]

46 After training, we predict the label with the highest score (in the transductive setting): y = arg maxa∈y ga (x). [sent-413, score-0.626]

47 (1993) for machine translation, but we can adapt it to the ambiguous label case. [sent-417, score-0.619]

48 4 D ISCRIMINATIVE EM We compare with the model proposed in Jin and Ghahramani (2002), which is a discriminative model with an EM procedure adapted for the ambiguous label setting. [sent-422, score-0.619]

49 The authors minimize the KL divergence between a maximum entropy model P (estimated in the M-step) and a distribution ˆ over ambiguous labels P (estimated in the E-step): ˆ ˆ J(θ, P) = ∑ ∑ P(a | xi ) log i a∈y 7. [sent-423, score-0.654]

50 6 S UPERVISED M ODELS Finally we also consider two baselines that ignore the ambiguous label setting. [sent-431, score-0.663]

51 The ambiguous labels in the training set are generated randomly according to different noise models which we specify in each case. [sent-471, score-0.618]

52 For each method and parameter setting, we report the average test error rate over 20 trials after training the model on the ambiguous train set. [sent-472, score-0.477]

53 Note, in the inductive setting we consider the test set as unlabeled, thus the classiﬁer votes among all possible labels: a∗ = h(x) = arg max ga (x). [sent-474, score-0.449]

54 L} For the transductive experiments, there is no test set; we report the error rate for disambiguating the ambiguous labels (also averaged over 20 trials corresponding to random settings of ambiguous labels). [sent-477, score-1.142]

55 The main differences with the inductive setting are: (1) the model is trained on all instances and tested on the same instances; and (2) the classiﬁer votes only among the ambiguous labels, which is easier: a∗ = h(x) = arg max ga (x). [sent-478, score-0.926]

56 We experiment with different noise models for ambiguous bags, parametrized by p, q, ε, see Figure 6. [sent-483, score-0.477]

57 q represents the number of extra labels for each ambiguous example. [sent-485, score-0.618]

58 ε represents the degree of ambiguity (deﬁned in 1) for each ambiguous 1518 L EARNING FROM PARTIAL L ABELS example. [sent-486, score-0.696]

59 Experiment # of ambiguous bags degree of ambiguity degree of ambiguity dimension ambiguity size ambiguity size ambiguity size ambiguity size ambiguity size ambiguity size ambiguity size ﬁg 6 6 6 6 7 7 7 8 8 8 8 induct. [sent-490, score-2.249]

60 yes yes no yes yes yes yes yes yes yes yes data set FIW(10b) FIW(10b) FIW(10b) FIW(10b) FIW(10b) FIW(10) FIW(100) Lost audio ecoli derma abalone parameter p ∈ [0, 0. [sent-491, score-0.683]

61 We experiment with 3 different noise models for ambiguous bags, parametrized by p, q, ε. [sent-502, score-0.477]

62 q represents the number of extra labels for each ambiguous example (generated uniformly without replacement). [sent-504, score-0.618]

63 ε represents the degree of ambiguity for each ambiguous example (see deﬁnition 1). [sent-505, score-0.696]

64 6 (which ignore the ambiguously labeled examples) consistently perform worse than their counterparts adapted for the ambiguous setting. [sent-523, score-0.71]

65 We ﬁrst choose at random for each label a dominant co-occurring label which is sampled with probability ε; the rest of the labels are sampled uniformly with probability (1 − ε)/(L − 2) (there is a single extra label per example). [sent-527, score-0.567]

66 (top left) increasing proportion of ambiguous bags q, inductive setting. [sent-575, score-0.583]

67 Figure 9: Left: We experiment with a boosting version of the ambiguous learning, and compare to a boosting version of the naive baseline (here with ambiguous bags of size 3). [sent-692, score-1.218]

68 Our goal is to identify characters given ambiguous labels derived from the screenplay. [sent-708, score-0.717]

69 Given an alignment of the screenplay to frames, we have ambiguous labels for characters in each scene: the set of speakers mentioned at some point in the scene, as shown in Figure 1. [sent-730, score-0.847]

70 We use the ambiguous sets to select face tracks ﬁltered through our pipeline. [sent-747, score-0.647]

71 This leaves ambiguous bags of size 1, 2 or 3, with an average bag size of 2. [sent-749, score-0.544]

72 3 Errors in Ambiguous Label Sets In the TV episodes we considered, we observed that approximately 1% of ambiguous label sets were wrong, in that they didn’t contain the ground truth label of the face track. [sent-756, score-0.939]

73 While this is not a major problem, it becomes so when we consider additional cues (mouth motion, gender) that restrict the ambiguous label set. [sent-758, score-0.648]

74 4 Results with the Basic System Now that we have a set of instances (face tracks), feature descriptors for the face track and ambiguous label sets for each face track, we can apply the same method as described in the previous section. [sent-762, score-0.891]

75 The confusion matrix displaying the distribution of ambiguous labels for the top 16 characters in Lost is shown in Figure 11 (left). [sent-764, score-0.717]

76 The confusion matrix of our predictions after applying our ambiguous learning algorithm is shown in Figure 11 (right). [sent-765, score-0.477]

77 7% of the ambiguous bags, 3 times less then the second least common character) and Liam Pace from Charlie Pace (they are brothers and co-occur frequently, as can be seen in the top ﬁgure). [sent-767, score-0.477]

78 As we can see, the most difﬁcult classes are the ones with which another class is strongly correlated in the ambiguous label confusion matrix. [sent-770, score-0.619]

79 Element Di j represents the proportion of times class i was seen with class j in the ambiguous bags, and D1 = 1. [sent-803, score-0.477]

80 we can deﬁne the relative-constrained score as an adaptation to the ambiguous setting; we only consider votes among ambiguous labels y (where a∗ = arg maxa∈y ga (x)): Crel,y (g(x)) = ga∗ (x) − max ga (x). [sent-828, score-1.9]

81 The relativeconstrain improves the high-recall/low-precision region by only voting among the ambiguous bags, but it suffers in high-precision/low recall region because some ambiguous bags may be erroneous. [sent-844, score-1.021]

82 There are some problems with all of those choices, especially in the case where we have some errors in ambiguous label set (a ∈ Y for the true label a). [sent-846, score-0.761]

83 At high recall, the errors in the classiﬁer dominate the errors in ambiguous labels, and relative-constrained conﬁdence gives better precision because of the restriction. [sent-850, score-0.477]

84 We introduce a hybrid conﬁdence measure that performs well for all recall levels r, interpolating between the two conﬁdence measures: ha (x) = r ga (x) (1 − r)ga (x) + r minb gb (x) if a ∈ y, else. [sent-851, score-0.463]

85 (2006) to detect mouth motion during dialog and adapt it to our ambiguous label setting. [sent-861, score-0.796]

86 5 For a face track x with ambiguous label set y and a temporally overlapping utterance from a speaker a ∈ {1. [sent-862, score-0.811]

87 2 G ENDER C ONSTRAINTS We introduce a gender classiﬁer to constrain the ambiguous labels based on predicted gender. [sent-867, score-0.703]

88 We use gender by ﬁltering out the labels that do not match by gender the predicted gender of a face track, if the conﬁdence exceeds a threshold (one for females and one for males are set on a validation data to achieve 90% precision for each direction of the gender prediction). [sent-871, score-0.603]

89 Thus, we modify ambiguous label set y as:  y if gender uncertain,  y := y − {a : a is male} if gender predicts female,   y − {a : a is female} if gender predicts male. [sent-872, score-0.874]

90 We also propagate the predicted labels of our model to all faces in the same face track throughout an episode. [sent-955, score-0.421]

91 Hence we can focus on analysis for a ﬁxed x (with P(X = x) > 0), writing ga = ga (x), and for any set c ⊆ {1, . [sent-1012, score-0.736]

92 The case of Pa = 0 leads to ga → −∞ and it can be ignored without loss of generality, so we can assume that optimal g is bounded for ﬁxed p with 0 < Pa < 1. [sent-1020, score-0.426]

93 Taking the derivative of the loss with respect to ga and setting to 0, we have the ﬁrst order optimality conditions: ∂Lψ (g) Pc,a ψ′ (gc,a ) = ∑ − (1 − Pa )ψ′ (−ga ) = 0. [sent-1021, score-0.426]

94 ∑ |c| + 1 c:a,b∈c / Since ga ≤ gb , ψ′ (gc,a ) ≤ ψ′ (gc,b ) and ψ′ (−ga ) ≥ ψ′ (−gb ). [sent-1024, score-0.463]

95 Then the minimizer g satisﬁes either (1) ga → ∞ (this happens if ψ′ (·) < 0 for ﬁnite arguments) while ga′ are ﬁnite because of (1 − Pa′ )ψ(−ga′ ) terms in the objective or (2) g is ﬁnite and the proof above applies since dominance holds: Pc,b = 0 if a ∈ c, so we can apply the / theorem. [sent-1031, score-0.427]

96 The second inequality comes from the fact that max ga (x) ≥ a∈y 1 ∑ ga (x). [sent-1038, score-0.736]

97 |y| a∈y For the tightness proof: When ga (x) = constant over a ∈ y, we have ψ max ga (x) = ψ a∈y 1 ∑ ga (x) |y| a∈y = 1 ∑ ψ (ga (x)) , |y| a∈y naive max implying Lψ (g(x), y) = Lψ (g(x), y) = Lψ (g(x), y). [sent-1040, score-1.213]

98 ˆ Gm (Ga ) = Eν = 2 ∑ νi ga (xi ) | S ga ∈Ga m i = sup 2B Eν || ∑ νi f(xi )|| | S m i ≤ 2B m Eν = 2B m = 2 Eν m sup wa · ∑ νi f(xi ) | S ||wa ||≤B 2B Eν m i ∑ νi ν j f(xi )T f(x j ) | S ij ∑ ||f(xi)||2. [sent-1049, score-0.805]

99 By deﬁnition of Bη (x), 2 ga (x) = ga (x′ ) + wa · (f(x) − f(x′ )) ≤ ga (x′ ) + ||wa ||∗ η ≤ ga (x′ ) + η ≤ η . [sent-1078, score-1.541]

100 2 In fact, we also have ga (x) < η , by considering two cases (wa = 0 or wa = 0) and using the 2 fact that ||f(x) − f(x′ )|| < η. [sent-1079, score-0.437]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ambiguous', 0.477), ('ga', 0.368), ('clpl', 0.246), ('ep', 0.183), ('ambiguity', 0.181), ('knn', 0.17), ('ambiguously', 0.153), ('app', 0.153), ('label', 0.142), ('labels', 0.141), ('mouth', 0.136), ('faces', 0.13), ('abels', 0.123), ('face', 0.122), ('fiw', 0.119), ('naming', 0.117), ('tv', 0.113), ('naive', 0.109), ('maxa', 0.101), ('characters', 0.099), ('gb', 0.095), ('partial', 0.09), ('gender', 0.085), ('screenplay', 0.085), ('lost', 0.081), ('taskar', 0.08), ('labeled', 0.08), ('groundtruth', 0.076), ('la', 0.075), ('wa', 0.069), ('bags', 0.067), ('everingham', 0.065), ('cour', 0.065), ('yi', 0.061), ('yes', 0.06), ('dominance', 0.059), ('gm', 0.059), ('loss', 0.058), ('chance', 0.057), ('episodes', 0.056), ('jin', 0.054), ('captions', 0.054), ('pa', 0.051), ('multiclass', 0.05), ('audio', 0.049), ('video', 0.048), ('tracks', 0.048), ('transductive', 0.047), ('wild', 0.045), ('alignment', 0.045), ('scene', 0.045), ('baselines', 0.044), ('boosting', 0.044), ('gallagher', 0.042), ('hullermeier', 0.042), ('arg', 0.042), ('berg', 0.042), ('speaker', 0.042), ('motion', 0.041), ('supervised', 0.041), ('caption', 0.039), ('inductive', 0.039), ('degree', 0.038), ('images', 0.038), ('er', 0.037), ('pab', 0.036), ('barnard', 0.036), ('disambiguate', 0.036), ('earning', 0.036), ('xi', 0.036), ('people', 0.035), ('uci', 0.035), ('beringer', 0.034), ('ecoli', 0.034), ('infg', 0.034), ('partially', 0.033), ('mendelson', 0.033), ('proposition', 0.032), ('movies', 0.032), ('hinge', 0.03), ('classi', 0.03), ('supervision', 0.029), ('cues', 0.029), ('dermatology', 0.029), ('duygulu', 0.029), ('gc', 0.029), ('satoh', 0.029), ('track', 0.028), ('pb', 0.028), ('names', 0.028), ('dence', 0.027), ('bounds', 0.027), ('character', 0.027), ('score', 0.027), ('videos', 0.026), ('losses', 0.026), ('labeling', 0.026), ('ablative', 0.025), ('ambroise', 0.025), ('crel', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 58 jmlr-2011-Learning from Partial Labels

Author: Timothee Cour, Ben Sapp, Ben Taskar

2 0.13590088 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari

Abstract: This paper considers the robust and efﬁcient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modiﬁcations and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difﬁcult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difﬁcult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation

3 0.13229126 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

Author: Botond Cseke, Tom Heskes

Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random ﬁelds, Laplace approximation, variational inference, expectation propagation

4 0.11718647 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

Abstract: Many popular linear classiﬁers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classiﬁers in transfer learning, and for training classiﬁers with no labeled data whatsoever. Keywords: classiﬁcation, large margin, maximum likelihood

5 0.059461422 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

Author: Mark D. Reid, Robert C. Williamson

Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classiﬁcation. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classiﬁcation, loss functions, divergence, statistical information, regret bounds

6 0.057057038 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

7 0.048192319 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

8 0.04426346 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

9 0.043208793 16 jmlr-2011-Clustering Algorithms for Chains

10 0.041893087 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

11 0.039639901 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

12 0.038390394 61 jmlr-2011-Logistic Stick-Breaking Process

13 0.037961043 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

14 0.037087243 12 jmlr-2011-Bayesian Co-Training

15 0.036793876 55 jmlr-2011-Learning Multi-modal Similarity

16 0.036253642 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

17 0.034029819 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

18 0.033968188 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

19 0.031935353 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

20 0.031179598 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.199), (1, -0.102), (2, -0.123), (3, 0.042), (4, 0.084), (5, -0.16), (6, -0.064), (7, -0.085), (8, 0.083), (9, -0.188), (10, -0.112), (11, -0.095), (12, 0.088), (13, -0.0), (14, -0.086), (15, -0.0), (16, -0.103), (17, -0.118), (18, -0.128), (19, -0.118), (20, 0.054), (21, -0.066), (22, -0.131), (23, 0.001), (24, -0.178), (25, -0.01), (26, -0.063), (27, 0.065), (28, -0.053), (29, 0.056), (30, 0.036), (31, -0.177), (32, -0.004), (33, -0.117), (34, 0.097), (35, -0.056), (36, 0.029), (37, -0.053), (38, -0.09), (39, 0.015), (40, 0.072), (41, -0.042), (42, 0.04), (43, -0.107), (44, 0.023), (45, -0.105), (46, -0.083), (47, -0.027), (48, -0.019), (49, 0.061)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94566816 58 jmlr-2011-Learning from Partial Labels

Author: Timothee Cour, Ben Sapp, Ben Taskar

2 0.57410085 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

3 0.48452568 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari

4 0.45149705 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models

Author: Botond Cseke, Tom Heskes

5 0.41437075 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

Author: Philippe Rigollet, Xin Tong

Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classiﬁcation with a convex loss ϕ. Given a ﬁnite collection of classiﬁers, we combine them and obtain a new classiﬁer that satisﬁes simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-speciﬁed level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classiﬁer is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classiﬁer output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classiﬁcation, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization

6 0.37042686 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

7 0.35777166 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels

8 0.296413 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem

9 0.29606754 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership

10 0.29271787 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

11 0.29171735 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning

12 0.27654642 16 jmlr-2011-Clustering Algorithms for Chains

13 0.25282645 83 jmlr-2011-Scikit-learn: Machine Learning in Python

14 0.23975757 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

15 0.22728437 61 jmlr-2011-Logistic Stick-Breaking Process

16 0.22096543 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models

17 0.20823339 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

18 0.20632415 12 jmlr-2011-Bayesian Co-Training

19 0.20468408 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

20 0.20362726 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.03), (9, 0.025), (10, 0.023), (13, 0.012), (24, 0.551), (31, 0.089), (32, 0.024), (41, 0.016), (60, 0.011), (71, 0.017), (73, 0.04), (78, 0.06), (90, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.96635491 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

Author: Thomas L. Griffiths, Zoubin Ghahramani

Abstract: The Indian buffet process is a stochastic process deﬁning a probability distribution over equivalence classes of sparse binary matrices with a ﬁnite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially inﬁnite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an inﬁnite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices

same-paper 2 0.91011095 58 jmlr-2011-Learning from Partial Labels

Author: Timothee Cour, Ben Sapp, Ben Taskar

3 0.87233126 28 jmlr-2011-Double Updating Online Learning

Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin

Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassiﬁed, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufﬁcient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reﬂect the inﬂuence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a ﬁxed weight to the misclassiﬁed example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classiﬁcation

4 0.67483538 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson

Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The ﬁrst stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justiﬁes the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised

5 0.57747525 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

Author: Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar

Abstract: We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MICG ROUP) and multi-task feature selection (MIC-M ULTI). MIC-G ROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-M ULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, T RANS F EAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems.1 Keywords: feature selection, minimum description length principle, multi-task learning

6 0.56567258 61 jmlr-2011-Logistic Stick-Breaking Process

7 0.56029916 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

8 0.55921048 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

9 0.50571507 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

10 0.49251261 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

11 0.48569629 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes

12 0.48393011 48 jmlr-2011-Kernel Analysis of Deep Networks

13 0.47115511 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

14 0.46639878 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

15 0.46272248 66 jmlr-2011-Multiple Kernel Learning Algorithms

16 0.4561488 16 jmlr-2011-Clustering Algorithms for Chains

17 0.45491102 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking

18 0.44836047 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

19 0.44567367 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

20 0.44508383 7 jmlr-2011-Adaptive Exact Inference in Graphical Models