jmlr jmlr2011 jmlr2011-58 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Timothee Cour, Ben Sapp, Ben Taskar
Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classification, convex learning, generalization bounds, names and faces
Reference: text
sentIndex sentText sentNum sentScore
1 We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. [sent-13, score-0.594]
2 In this setting each face is ambiguously labeled with the set of names extracted from the caption, see Figure 1 (bottom). [sent-17, score-0.383]
3 Top: using a screenplay, we can tell who is in a movie scene, but for every face in the corresponding images, the person’s identity is ambiguous (green labels). [sent-25, score-0.599]
4 In both cases, our goal is to learn a model from ambiguously labeled examples so as to disambiguate the training labels and also generalize to unseen examples. [sent-27, score-0.41]
5 We pose the partially labeled learning problem as minimization of an ambiguous loss in Section 3, and establish upper and lower bounds between the (unobserved) true loss and the (observed) ambiguous loss in terms of a critical distributional property we call ambiguity degree. [sent-55, score-1.449]
6 We propose the novel Convex Learning from Partial Labels (CLPL) formulation in Section 4, and show it offers a tighter approximation to the ambiguous loss, compared to a straightforward formulation. [sent-56, score-0.477]
7 We also apply our framework to a naming task in TV series, where screenplay and closed captions provide ambiguous labels. [sent-61, score-0.733]
8 1503 C OUR , S APP AND TASKAR supervised unsupervised semi-supervised multi-label multi-instance label label instance label instance instance label label instance instance instance label instance partial-label ? [sent-78, score-0.893]
9 label instance label label Figure 2: Range of supervision in classification. [sent-84, score-0.455]
10 2 Learning From Partially-labeled or Ambiguous Data There have been several papers that addressed the ambiguous label problem. [sent-100, score-0.619]
11 Hullermeier and Beringer (2006) propose several nonparametric, instance-based algorithms for ambiguous learning based on greedy heuristics. [sent-106, score-0.477]
12 These papers only report results on synthetically-created ambiguous labels for data sets such as the UCI repository. [sent-107, score-0.618]
13 Instead of estimating a prior probability for each individual, the algorithm estimates a prior for groups using the ambiguous labels. [sent-133, score-0.477]
14 / Clearly, our setup generalizes the standard semi-supervised setting where some examples are labeled and some are unlabeled: an example is labeled when the corresponding ambiguity set yi is a singleton, and unlabeled when yi includes all the labels. [sent-161, score-0.463]
15 In order to learn from ambiguous data, we must make some assumptions about the distribution P(Z | X,Y ). [sent-164, score-0.477]
16 Consider a very simple ambiguity pattern that makes accurate learning impossible: L = 3, |zi | = 1 and label 1 is present in every set yi , for all i. [sent-165, score-0.384]
17 For example, consider our initial motivation of naming characters in TV shows, where the ambiguity set for any given detected face in a scene is given by the set of characters occurring at some point in that scene. [sent-170, score-0.663]
18 This suggests that for most characters, ambiguity sets are diverse and we can expect that the ambiguity degree is small. [sent-182, score-0.4]
19 We define a surrogate loss which we can evaluate, and we call ambiguous or partial 0/1 loss (where A stands for ambiguous): Partial 0/1 loss: LA (h(x), y) = 1(h(x) ∈ y). [sent-189, score-0.683]
20 However, in the ambiguous learning setting we would like to minimize the true 0/1 loss, with access only to the partial loss. [sent-192, score-0.567]
21 We first introduce a measure of hardness of learning under ambiguous supervision, which we define as ambiguity degree ε of a distribution P(X,Y, Z): Ambiguity degree: ε= sup x,y,z:P(x,y)>0,z∈{1,. [sent-194, score-0.696]
22 (1) In words, ε corresponds to the maximum probability of an extra label z co-occurring with a / true label y, over all labels and inputs. [sent-198, score-0.425]
23 Intuitively, the best case scenario for ambiguous learning corresponds to a distribution with high conditional entropy for P(Z | X,Y ). [sent-207, score-0.477]
24 1507 C OUR , S APP AND TASKAR Proposition 1 (Partial loss bound via ambiguity degree ε) For any classifier h and distribution P(X,Y, Z), with Y = X ∪ Z and ambiguity degree ε: EP [LA (h(X), Y)] ≤ EP [L (h(X),Y )] ≤ 1 EP [LA (h(X), Y)], 1−ε with the convention 1/0 = +∞. [sent-210, score-0.496]
25 We can further tighten our bounds between ambiguous loss and standard 1508 L EARNING FROM PARTIAL L ABELS 1 E[true loss] 0. [sent-237, score-0.562]
26 8 E[ambiguous loss] 1 Figure 4: Feasible region for expected ambiguous and true loss, for ε = 0. [sent-247, score-0.477]
27 These bounds give a strong connection between ambiguous loss and real loss when ε is small. [sent-259, score-0.62]
28 This assumption allows us to approximately minimize the expected real loss by minimizing (an upper bound on) the ambiguous loss, as we propose in the following section. [sent-260, score-0.535]
29 We now focus on a particular family of classifiers, which assigns a score ga (x) to each label a for a given input x and select the highest scoring label: h(x) = arg max ga (x). [sent-263, score-0.947]
30 Below, we use a multi-linear function class G by assuming a feature mapping f(x) : X → Rd from inputs to d real-valued features and let ga (x) = wa · f(x), where wa ∈ Rd is a weight vector for each class, bounded by some norm: ||wa || p ≤ B for p = 1, 2. [sent-271, score-0.506]
31 1 Convex Loss for Partial Labels In the partially labeled setting, instead of an unambiguous single label y per instance we have a set of labels Y , one of which is the correct label for the instance. [sent-281, score-0.538]
32 We show next that our loss function is consistent under certain assumptions and offers a tighter upperbound to the ambiguous loss compared to a more straightforward multi-label approach. [sent-289, score-0.593]
33 First, we require that arg maxa P(Y = a | X = x) = arg maxa P(a ∈ Y | X = x), since otherwise arg maxa P(Y = a | X = x) cannot be determined by any algorithm from partial labels Y without additional information even with an infinite amount of data. [sent-294, score-0.66]
34 It satisifes 2 2 arg maxa ga = 2 but arg maxa ∑b Pab = 1. [sent-312, score-0.654]
35 To gain additional intuition as to why CLPL is better than the naive loss Equation 3: for an input x with ambiguous label set (a, b), CLPL only encourages the average of ga (x) and gb (x) to be large, allowing the correct score to be positive and the extraneous score to be negative (e. [sent-325, score-1.303]
36 In contrast, the naive model encourages both ga (x) and gb (x) to be large. [sent-328, score-0.572]
37 We assume a feature mapping f(x) : X → Rd from inputs to d real-valued features and let ga (x) = wa ·f(x), where wa ∈ Rd is a weight vector for each class, bounded by L2 norm : ||wa ||2 ≤ B. [sent-331, score-0.506]
38 ambiguous label set) of some x ∈ S, and z(x) = y(x)\{y(x)}. [sent-351, score-0.619]
39 If Lψ (g(x), y(x)) ≤ ψ(η/2) and ∀a ∈ z(x), ∃x′ ∈ Bη (x) such that ga (x′ ) ≤ −η/2, then g predicts the correct label for x. [sent-356, score-0.51]
40 In other words, g predicts the correct label for x when its loss is sufficiently small, and for each of its ambiguous labels a, we can find a neighbor with same label whose score ga (x′ ) is small enough. [sent-357, score-1.355]
41 We stack the parameters and features into one vector as follows below, so that ga (x) = wa · f(x) = w · f(x, a): 1(a = 1)f(x) w1 . [sent-375, score-0.437]
42 We analyze the effect on learning of the following factors: distribution of ambiguous labels, size of ambiguous bags, proportion of instances which contain an ambiguous bag, entropy of the ambiguity, distribution of true labels and number of distinct labels. [sent-400, score-1.572]
43 We compare our CLPL approach against a number of baselines, including a generative model, a discriminative maximum-entropy model, a naive model, two K-nearest neighbor models, as well as models that ignore the ambiguous bags. [sent-401, score-0.586]
44 1 C HANCE BASELINE We define the chance baseline as randomly guessing between the possible ambiguous labels only. [sent-408, score-0.675]
45 1 Defining the (empirical) average ambiguous size to be ES [|y|] = m ∑m |yi |, then the expected error i=1 1 from the chance baseline is given by errorchance = 1 − ES [|y|] . [sent-409, score-0.534]
46 After training, we predict the label with the highest score (in the transductive setting): y = arg maxa∈y ga (x). [sent-413, score-0.626]
47 (1993) for machine translation, but we can adapt it to the ambiguous label case. [sent-417, score-0.619]
48 4 D ISCRIMINATIVE EM We compare with the model proposed in Jin and Ghahramani (2002), which is a discriminative model with an EM procedure adapted for the ambiguous label setting. [sent-422, score-0.619]
49 The authors minimize the KL divergence between a maximum entropy model P (estimated in the M-step) and a distribution ˆ over ambiguous labels P (estimated in the E-step): ˆ ˆ J(θ, P) = ∑ ∑ P(a | xi ) log i a∈y 7. [sent-423, score-0.654]
50 6 S UPERVISED M ODELS Finally we also consider two baselines that ignore the ambiguous label setting. [sent-431, score-0.663]
51 The ambiguous labels in the training set are generated randomly according to different noise models which we specify in each case. [sent-471, score-0.618]
52 For each method and parameter setting, we report the average test error rate over 20 trials after training the model on the ambiguous train set. [sent-472, score-0.477]
53 Note, in the inductive setting we consider the test set as unlabeled, thus the classifier votes among all possible labels: a∗ = h(x) = arg max ga (x). [sent-474, score-0.449]
54 L} For the transductive experiments, there is no test set; we report the error rate for disambiguating the ambiguous labels (also averaged over 20 trials corresponding to random settings of ambiguous labels). [sent-477, score-1.142]
55 The main differences with the inductive setting are: (1) the model is trained on all instances and tested on the same instances; and (2) the classifier votes only among the ambiguous labels, which is easier: a∗ = h(x) = arg max ga (x). [sent-478, score-0.926]
56 We experiment with different noise models for ambiguous bags, parametrized by p, q, ε, see Figure 6. [sent-483, score-0.477]
57 q represents the number of extra labels for each ambiguous example. [sent-485, score-0.618]
58 ε represents the degree of ambiguity (defined in 1) for each ambiguous 1518 L EARNING FROM PARTIAL L ABELS example. [sent-486, score-0.696]
59 Experiment # of ambiguous bags degree of ambiguity degree of ambiguity dimension ambiguity size ambiguity size ambiguity size ambiguity size ambiguity size ambiguity size ambiguity size fig 6 6 6 6 7 7 7 8 8 8 8 induct. [sent-490, score-2.249]
60 yes yes no yes yes yes yes yes yes yes yes data set FIW(10b) FIW(10b) FIW(10b) FIW(10b) FIW(10b) FIW(10) FIW(100) Lost audio ecoli derma abalone parameter p ∈ [0, 0. [sent-491, score-0.683]
61 We experiment with 3 different noise models for ambiguous bags, parametrized by p, q, ε. [sent-502, score-0.477]
62 q represents the number of extra labels for each ambiguous example (generated uniformly without replacement). [sent-504, score-0.618]
63 ε represents the degree of ambiguity for each ambiguous example (see definition 1). [sent-505, score-0.696]
64 6 (which ignore the ambiguously labeled examples) consistently perform worse than their counterparts adapted for the ambiguous setting. [sent-523, score-0.71]
65 We first choose at random for each label a dominant co-occurring label which is sampled with probability ε; the rest of the labels are sampled uniformly with probability (1 − ε)/(L − 2) (there is a single extra label per example). [sent-527, score-0.567]
66 (top left) increasing proportion of ambiguous bags q, inductive setting. [sent-575, score-0.583]
67 Figure 9: Left: We experiment with a boosting version of the ambiguous learning, and compare to a boosting version of the naive baseline (here with ambiguous bags of size 3). [sent-692, score-1.218]
68 Our goal is to identify characters given ambiguous labels derived from the screenplay. [sent-708, score-0.717]
69 Given an alignment of the screenplay to frames, we have ambiguous labels for characters in each scene: the set of speakers mentioned at some point in the scene, as shown in Figure 1. [sent-730, score-0.847]
70 We use the ambiguous sets to select face tracks filtered through our pipeline. [sent-747, score-0.647]
71 This leaves ambiguous bags of size 1, 2 or 3, with an average bag size of 2. [sent-749, score-0.544]
72 3 Errors in Ambiguous Label Sets In the TV episodes we considered, we observed that approximately 1% of ambiguous label sets were wrong, in that they didn’t contain the ground truth label of the face track. [sent-756, score-0.939]
73 While this is not a major problem, it becomes so when we consider additional cues (mouth motion, gender) that restrict the ambiguous label set. [sent-758, score-0.648]
74 4 Results with the Basic System Now that we have a set of instances (face tracks), feature descriptors for the face track and ambiguous label sets for each face track, we can apply the same method as described in the previous section. [sent-762, score-0.891]
75 The confusion matrix displaying the distribution of ambiguous labels for the top 16 characters in Lost is shown in Figure 11 (left). [sent-764, score-0.717]
76 The confusion matrix of our predictions after applying our ambiguous learning algorithm is shown in Figure 11 (right). [sent-765, score-0.477]
77 7% of the ambiguous bags, 3 times less then the second least common character) and Liam Pace from Charlie Pace (they are brothers and co-occur frequently, as can be seen in the top figure). [sent-767, score-0.477]
78 As we can see, the most difficult classes are the ones with which another class is strongly correlated in the ambiguous label confusion matrix. [sent-770, score-0.619]
79 Element Di j represents the proportion of times class i was seen with class j in the ambiguous bags, and D1 = 1. [sent-803, score-0.477]
80 we can define the relative-constrained score as an adaptation to the ambiguous setting; we only consider votes among ambiguous labels y (where a∗ = arg maxa∈y ga (x)): Crel,y (g(x)) = ga∗ (x) − max ga (x). [sent-828, score-1.9]
81 The relativeconstrain improves the high-recall/low-precision region by only voting among the ambiguous bags, but it suffers in high-precision/low recall region because some ambiguous bags may be erroneous. [sent-844, score-1.021]
82 There are some problems with all of those choices, especially in the case where we have some errors in ambiguous label set (a ∈ Y for the true label a). [sent-846, score-0.761]
83 At high recall, the errors in the classifier dominate the errors in ambiguous labels, and relative-constrained confidence gives better precision because of the restriction. [sent-850, score-0.477]
84 We introduce a hybrid confidence measure that performs well for all recall levels r, interpolating between the two confidence measures: ha (x) = r ga (x) (1 − r)ga (x) + r minb gb (x) if a ∈ y, else. [sent-851, score-0.463]
85 (2006) to detect mouth motion during dialog and adapt it to our ambiguous label setting. [sent-861, score-0.796]
86 5 For a face track x with ambiguous label set y and a temporally overlapping utterance from a speaker a ∈ {1. [sent-862, score-0.811]
87 2 G ENDER C ONSTRAINTS We introduce a gender classifier to constrain the ambiguous labels based on predicted gender. [sent-867, score-0.703]
88 We use gender by filtering out the labels that do not match by gender the predicted gender of a face track, if the confidence exceeds a threshold (one for females and one for males are set on a validation data to achieve 90% precision for each direction of the gender prediction). [sent-871, score-0.603]
89 Thus, we modify ambiguous label set y as: y if gender uncertain, y := y − {a : a is male} if gender predicts female, y − {a : a is female} if gender predicts male. [sent-872, score-0.874]
90 We also propagate the predicted labels of our model to all faces in the same face track throughout an episode. [sent-955, score-0.421]
91 Hence we can focus on analysis for a fixed x (with P(X = x) > 0), writing ga = ga (x), and for any set c ⊆ {1, . [sent-1012, score-0.736]
92 The case of Pa = 0 leads to ga → −∞ and it can be ignored without loss of generality, so we can assume that optimal g is bounded for fixed p with 0 < Pa < 1. [sent-1020, score-0.426]
93 Taking the derivative of the loss with respect to ga and setting to 0, we have the first order optimality conditions: ∂Lψ (g) Pc,a ψ′ (gc,a ) = ∑ − (1 − Pa )ψ′ (−ga ) = 0. [sent-1021, score-0.426]
94 ∑ |c| + 1 c:a,b∈c / Since ga ≤ gb , ψ′ (gc,a ) ≤ ψ′ (gc,b ) and ψ′ (−ga ) ≥ ψ′ (−gb ). [sent-1024, score-0.463]
95 Then the minimizer g satisfies either (1) ga → ∞ (this happens if ψ′ (·) < 0 for finite arguments) while ga′ are finite because of (1 − Pa′ )ψ(−ga′ ) terms in the objective or (2) g is finite and the proof above applies since dominance holds: Pc,b = 0 if a ∈ c, so we can apply the / theorem. [sent-1031, score-0.427]
96 The second inequality comes from the fact that max ga (x) ≥ a∈y 1 ∑ ga (x). [sent-1038, score-0.736]
97 |y| a∈y For the tightness proof: When ga (x) = constant over a ∈ y, we have ψ max ga (x) = ψ a∈y 1 ∑ ga (x) |y| a∈y = 1 ∑ ψ (ga (x)) , |y| a∈y naive max implying Lψ (g(x), y) = Lψ (g(x), y) = Lψ (g(x), y). [sent-1040, score-1.213]
98 ˆ Gm (Ga ) = Eν = 2 ∑ νi ga (xi ) | S ga ∈Ga m i = sup 2B Eν || ∑ νi f(xi )|| | S m i ≤ 2B m Eν = 2B m = 2 Eν m sup wa · ∑ νi f(xi ) | S ||wa ||≤B 2B Eν m i ∑ νi ν j f(xi )T f(x j ) | S ij ∑ ||f(xi)||2. [sent-1049, score-0.805]
99 By definition of Bη (x), 2 ga (x) = ga (x′ ) + wa · (f(x) − f(x′ )) ≤ ga (x′ ) + ||wa ||∗ η ≤ ga (x′ ) + η ≤ η . [sent-1078, score-1.541]
100 2 In fact, we also have ga (x) < η , by considering two cases (wa = 0 or wa = 0) and using the 2 fact that ||f(x) − f(x′ )|| < η. [sent-1079, score-0.437]
wordName wordTfidf (topN-words)
[('ambiguous', 0.477), ('ga', 0.368), ('clpl', 0.246), ('ep', 0.183), ('ambiguity', 0.181), ('knn', 0.17), ('ambiguously', 0.153), ('app', 0.153), ('label', 0.142), ('labels', 0.141), ('mouth', 0.136), ('faces', 0.13), ('abels', 0.123), ('face', 0.122), ('fiw', 0.119), ('naming', 0.117), ('tv', 0.113), ('naive', 0.109), ('maxa', 0.101), ('characters', 0.099), ('gb', 0.095), ('partial', 0.09), ('gender', 0.085), ('screenplay', 0.085), ('lost', 0.081), ('taskar', 0.08), ('labeled', 0.08), ('groundtruth', 0.076), ('la', 0.075), ('wa', 0.069), ('bags', 0.067), ('everingham', 0.065), ('cour', 0.065), ('yi', 0.061), ('yes', 0.06), ('dominance', 0.059), ('gm', 0.059), ('loss', 0.058), ('chance', 0.057), ('episodes', 0.056), ('jin', 0.054), ('captions', 0.054), ('pa', 0.051), ('multiclass', 0.05), ('audio', 0.049), ('video', 0.048), ('tracks', 0.048), ('transductive', 0.047), ('wild', 0.045), ('alignment', 0.045), ('scene', 0.045), ('baselines', 0.044), ('boosting', 0.044), ('gallagher', 0.042), ('hullermeier', 0.042), ('arg', 0.042), ('berg', 0.042), ('speaker', 0.042), ('motion', 0.041), ('supervised', 0.041), ('caption', 0.039), ('inductive', 0.039), ('degree', 0.038), ('images', 0.038), ('er', 0.037), ('pab', 0.036), ('barnard', 0.036), ('disambiguate', 0.036), ('earning', 0.036), ('xi', 0.036), ('people', 0.035), ('uci', 0.035), ('beringer', 0.034), ('ecoli', 0.034), ('infg', 0.034), ('partially', 0.033), ('mendelson', 0.033), ('proposition', 0.032), ('movies', 0.032), ('hinge', 0.03), ('classi', 0.03), ('supervision', 0.029), ('cues', 0.029), ('dermatology', 0.029), ('duygulu', 0.029), ('gc', 0.029), ('satoh', 0.029), ('track', 0.028), ('pb', 0.028), ('names', 0.028), ('dence', 0.027), ('bounds', 0.027), ('character', 0.027), ('score', 0.027), ('videos', 0.026), ('losses', 0.026), ('labeling', 0.026), ('ablative', 0.025), ('ambroise', 0.025), ('crel', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999952 58 jmlr-2011-Learning from Partial Labels
Author: Timothee Cour, Ben Sapp, Ben Taskar
Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classification, convex learning, generalization bounds, names and faces
2 0.13590088 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari
Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation
3 0.13229126 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
Author: Botond Cseke, Tom Heskes
Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random fields, Laplace approximation, variational inference, expectation propagation
4 0.11718647 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon
Abstract: Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classifiers in transfer learning, and for training classifiers with no labeled data whatsoever. Keywords: classification, large margin, maximum likelihood
5 0.059461422 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
Author: Mark D. Reid, Robert C. Williamson
Abstract: We unify f -divergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROC-curves and statistical information. We do this by systematically studying integral and variational representations of these objects and in so doing identify their representation primitives which all are related to cost-sensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating f -divergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates maximum mean discrepancy to Fisher linear discriminants. Keywords: classification, loss functions, divergence, statistical information, regret bounds
6 0.057057038 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
7 0.048192319 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates
8 0.04426346 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
9 0.043208793 16 jmlr-2011-Clustering Algorithms for Chains
10 0.041893087 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
11 0.039639901 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
12 0.038390394 61 jmlr-2011-Logistic Stick-Breaking Process
13 0.037961043 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
14 0.037087243 12 jmlr-2011-Bayesian Co-Training
15 0.036793876 55 jmlr-2011-Learning Multi-modal Similarity
16 0.036253642 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels
17 0.034029819 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
18 0.033968188 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series
19 0.031935353 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels
20 0.031179598 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
topicId topicWeight
[(0, 0.199), (1, -0.102), (2, -0.123), (3, 0.042), (4, 0.084), (5, -0.16), (6, -0.064), (7, -0.085), (8, 0.083), (9, -0.188), (10, -0.112), (11, -0.095), (12, 0.088), (13, -0.0), (14, -0.086), (15, -0.0), (16, -0.103), (17, -0.118), (18, -0.128), (19, -0.118), (20, 0.054), (21, -0.066), (22, -0.131), (23, 0.001), (24, -0.178), (25, -0.01), (26, -0.063), (27, 0.065), (28, -0.053), (29, 0.056), (30, 0.036), (31, -0.177), (32, -0.004), (33, -0.117), (34, 0.097), (35, -0.056), (36, 0.029), (37, -0.053), (38, -0.09), (39, 0.015), (40, 0.072), (41, -0.042), (42, 0.04), (43, -0.107), (44, 0.023), (45, -0.105), (46, -0.083), (47, -0.027), (48, -0.019), (49, 0.061)]
simIndex simValue paperId paperTitle
same-paper 1 0.94566816 58 jmlr-2011-Learning from Partial Labels
Author: Timothee Cour, Ben Sapp, Ben Taskar
Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classification, convex learning, generalization bounds, names and faces
2 0.57410085 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon
Abstract: Many popular linear classifiers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classifiers in transfer learning, and for training classifiers with no labeled data whatsoever. Keywords: classification, large margin, maximum likelihood
3 0.48452568 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood
Author: Pasi Jylänki, Jarno Vanhatalo, Aki Vehtari
Abstract: This paper considers the robust and efficient implementation of Gaussian process regression with a Student-t observation model, which has a non-log-concave likelihood. The challenge with the Student-t model is the analytically intractable inference which is why several approximative methods have been proposed. Expectation propagation (EP) has been found to be a very accurate method in many empirical studies but the convergence of EP is known to be problematic with models containing non-log-concave site functions. In this paper we illustrate the situations where standard EP fails to converge and review different modifications and alternative algorithms for improving the convergence. We demonstrate that convergence problems may occur during the type-II maximum a posteriori (MAP) estimation of the hyperparameters and show that standard EP may not converge in the MAP values with some difficult data sets. We present a robust implementation which relies primarily on parallel EP updates and uses a moment-matching-based double-loop algorithm with adaptively selected step size in difficult cases. The predictive performance of EP is compared with Laplace, variational Bayes, and Markov chain Monte Carlo approximations. Keywords: Gaussian process, robust regression, Student-t distribution, approximate inference, expectation propagation
4 0.45149705 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
Author: Botond Cseke, Tom Heskes
Abstract: We consider the problem of improving the Gaussian approximate posterior marginals computed by expectation propagation and the Laplace method in latent Gaussian models and propose methods that are similar in spirit to the Laplace approximation of Tierney and Kadane (1986). We show that in the case of sparse Gaussian models, the computational complexity of expectation propagation can be made comparable to that of the Laplace method by using a parallel updating scheme. In some cases, expectation propagation gives excellent estimates where the Laplace approximation fails. Inspired by bounds on the correct marginals, we arrive at factorized approximations, which can be applied on top of both expectation propagation and the Laplace method. The factorized approximations can give nearly indistinguishable results from the non-factorized approximations and their computational complexity scales linearly with the number of variables. We experienced that the expectation propagation based marginal approximations we introduce are typically more accurate than the methods of similar complexity proposed by Rue et al. (2009). Keywords: approximate marginals, Gaussian Markov random fields, Laplace approximation, variational inference, expectation propagation
5 0.41437075 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints
Author: Philippe Rigollet, Xin Tong
Abstract: Motivated by problems of anomaly detection, this paper implements the Neyman-Pearson paradigm to deal with asymmetric errors in binary classification with a convex loss ϕ. Given a finite collection of classifiers, we combine them and obtain a new classifier that satisfies simultaneously the two following properties with high probability: (i) its ϕ-type I error is below a pre-specified level and (ii), it has ϕ-type II error close to the minimum possible. The proposed classifier is obtained by minimizing an empirical convex objective with an empirical convex constraint. The novelty of the method is that the classifier output by this computationally feasible program is shown to satisfy the original constraint on type I error. New techniques to handle such problems are developed and they have consequences on chance constrained programming. We also evaluate the price to pay in terms of type II error for being conservative on type I error. Keywords: binary classification, Neyman-Pearson paradigm, anomaly detection, empirical constraint, empirical risk minimization, chance constrained optimization
6 0.37042686 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
7 0.35777166 31 jmlr-2011-Efficient and Effective Visual Codebook Generation Using Additive Kernels
9 0.29606754 52 jmlr-2011-Large Margin Hierarchical Classification with Mutually Exclusive Class Membership
10 0.29271787 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning
11 0.29171735 63 jmlr-2011-MULAN: A Java Library for Multi-Label Learning
12 0.27654642 16 jmlr-2011-Clustering Algorithms for Chains
13 0.25282645 83 jmlr-2011-Scikit-learn: Machine Learning in Python
14 0.23975757 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation
15 0.22728437 61 jmlr-2011-Logistic Stick-Breaking Process
16 0.22096543 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
17 0.20823339 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
18 0.20632415 12 jmlr-2011-Bayesian Co-Training
19 0.20468408 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
20 0.20362726 5 jmlr-2011-A Refined Margin Analysis for Boosting Algorithms via Equilibrium Margin
topicId topicWeight
[(4, 0.03), (9, 0.025), (10, 0.023), (13, 0.012), (24, 0.551), (31, 0.089), (32, 0.024), (41, 0.016), (60, 0.011), (71, 0.017), (73, 0.04), (78, 0.06), (90, 0.015)]
simIndex simValue paperId paperTitle
1 0.96635491 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review
Author: Thomas L. Griffiths, Zoubin Ghahramani
Abstract: The Indian buffet process is a stochastic process defining a probability distribution over equivalence classes of sparse binary matrices with a finite number of rows and an unbounded number of columns. This distribution is suitable for use as a prior in probabilistic models that represent objects using a potentially infinite array of features, or that involve bipartite graphs in which the size of at least one class of nodes is unknown. We give a detailed derivation of this distribution, and illustrate its use as a prior in an infinite latent feature model. We then review recent applications of the Indian buffet process in machine learning, discuss its extensions, and summarize its connections to other stochastic processes. Keywords: nonparametric Bayes, Markov chain Monte Carlo, latent variable models, Chinese restaurant processes, beta process, exchangeable distributions, sparse binary matrices
same-paper 2 0.91011095 58 jmlr-2011-Learning from Partial Labels
Author: Timothee Cour, Ben Sapp, Ben Taskar
Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classification, convex learning, generalization bounds, names and faces
3 0.87233126 28 jmlr-2011-Double Updating Online Learning
Author: Peilin Zhao, Steven C.H. Hoi, Rong Jin
Abstract: In most kernel based online learning algorithms, when an incoming instance is misclassified, it will be added into the pool of support vectors and assigned with a weight, which often remains unchanged during the rest of the learning process. This is clearly insufficient since when a new support vector is added, we generally expect the weights of the other existing support vectors to be updated in order to reflect the influence of the added support vector. In this paper, we propose a new online learning method, termed Double Updating Online Learning, or DUOL for short, that explicitly addresses this problem. Instead of only assigning a fixed weight to the misclassified example received at the current trial, the proposed online learning algorithm also tries to update the weight for one of the existing support vectors. We show that the mistake bound can be improved by the proposed online learning method. We conduct an extensive set of empirical evaluations for both binary and multi-class online learning tasks. The experimental results show that the proposed technique is considerably more effective than the state-of-the-art online learning algorithms. The source code is available to public at http://www.cais.ntu.edu.sg/˜chhoi/DUOL/. Keywords: online learning, kernel method, support vector machines, maximum margin learning, classification
4 0.67483538 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models
Author: Sharon Goldwater, Thomas L. Griffiths, Mark Johnson
Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that can generically produce power laws, breaking generative models into two stages. The first stage, the generator, can be any standard probabilistic model, while the second stage, the adaptor, transforms the word frequencies of this model to provide a closer match to natural language. We show that two commonly used Bayesian models, the Dirichlet-multinomial model and the Dirichlet process, can be viewed as special cases of our framework. We discuss two stochastic processes—the Chinese restaurant process and its two-parameter generalization based on the Pitman-Yor process—that can be used as adaptors in our framework to produce power-law distributions over word frequencies. We show that these adaptors justify common estimation procedures based on logarithmic or inverse-power transformations of empirical frequencies. In addition, taking the Pitman-Yor Chinese restaurant process as an adaptor justifies the appearance of type frequencies in formal analyses of natural language and improves the performance of a model for unsupervised learning of morphology. Keywords: nonparametric Bayes, Pitman-Yor process, language model, unsupervised
5 0.57747525 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
Author: Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar
Abstract: We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MICG ROUP) and multi-task feature selection (MIC-M ULTI). MIC-G ROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-M ULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, T RANS F EAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems.1 Keywords: feature selection, minimum description length principle, multi-task learning
6 0.56567258 61 jmlr-2011-Logistic Stick-Breaking Process
7 0.56029916 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
8 0.55921048 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
9 0.50571507 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing
10 0.49251261 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes
11 0.48569629 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
12 0.48393011 48 jmlr-2011-Kernel Analysis of Deep Networks
13 0.47115511 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models
15 0.46272248 66 jmlr-2011-Multiple Kernel Learning Algorithms
16 0.4561488 16 jmlr-2011-Clustering Algorithms for Chains
17 0.45491102 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking
18 0.44836047 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
19 0.44567367 68 jmlr-2011-Natural Language Processing (Almost) from Scratch
20 0.44508383 7 jmlr-2011-Adaptive Exact Inference in Graphical Models