nips nips2002 nips2002-135 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Rong Jin, Zoubin Ghahramani
Abstract: In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naIve methods degrade rapidly as more ambiguity is introduced into the labels. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. [sent-6, score-1.784]
2 , in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. [sent-9, score-0.474]
3 We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. [sent-10, score-1.045]
4 The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. [sent-11, score-1.58]
5 In supervised classification each training instance is associated with a single class label, while in unsupervised classification (i. [sent-14, score-0.658]
6 We generalize the notion of supervision by thinking of learning problems where multiple candidate class labels are associated with each training instance, and it is assumed that only one of the candidates is the correct label. [sent-19, score-1.145]
7 For a supervised classification problem, the set of candidate class labels for every training instance contains only one label, while for an unsupervised learning problem, the set of candidate class labels for each training instance counts in all the possible class labels. [sent-20, score-2.536]
8 For a learning problem with the mixture of labeled and unlabelled training data, the number of candidate class labels for every training instance can be either one or the total number of different classes. [sent-21, score-1.304]
9 a learning problem when each training instance is assigned to a subset of all the class labels (later, we further generalize this to include arbitrary distributions over the class labels). [sent-24, score-1.212]
10 For example, there may be 10 different classes and each training instance is given two candidate class labels and one of the two given labels is correct. [sent-25, score-1.543]
11 This learning problem is more difficult than supervised classification because for each training example we don't know which class among the given set of candidate classes is actually the target. [sent-26, score-0.727]
12 For easy reference, we called this class of learning problems 'multiple-label' problems. [sent-27, score-0.276]
13 For example, the problem of having several different class labels for a single training example can be caused by the disagreement between several assessors. [sent-29, score-0.852]
14 1 Consider the scenario when two assessors are hired to label the training data and sometimes the two assessors give different class labels to the same training example. [sent-30, score-1.654]
15 In this case, we will have two class labels for a single training instance and don't know which, if any, is actually correct. [sent-31, score-0.931]
16 Another scenario that can cause multiple class labels to be assigned to a single training example is when there is a hierarchical structure over the class labels and some of the training data are given the labels of the internal nodes in the hierarchy (i. [sent-32, score-2.132]
17 superclasses) instead of the labels of the leaf nodes (subclasses). [sent-34, score-0.471]
18 For such hierarchical labels, we can treat the label of internal nodes as a set of the labels on the leaf nodes. [sent-36, score-0.926]
19 2 Related Work First of all, we need to distinguish this 'multiple-label' problem from the problem where the classes are not mutually exclusive and therefore each training example is allowed several class labels [4]. [sent-37, score-0.816]
20 There, even though each training example can have multiple class labels, all the assigned class labels are actually correct labels while in 'multiple-label' problems only one of the assigned multiple labels is the target label for the training instance. [sent-38, score-2.911]
21 The essential difficulty of 'multiple-label' problems comes from the ambiguity in the class labels for training data, i. [sent-39, score-0.994]
22 among the several labels assigned to every training instance only one is presumed to be the correct one and unfortunately we are not informed which one is the target label. [sent-41, score-0.892]
23 A similar difficulty appears in the problem of classification from labeled and unlabeled training data. [sent-42, score-0.307]
24 The difference between the 'multiple-label' problem and the labeled/unlabeled classification problem is that in the former only a subset of the class labels can be the candidate for the target label, while in the latter any class label can be the candidate. [sent-43, score-1.655]
25 As will be shown later, this constraint makes it possible for us to build up a purely discriminative approach while for learning problems using unlabeled data people usually take a generative approach and model properties of the input distribution. [sent-44, score-0.262]
26 In contrast to the 'multiple-label' problem, there is a set of problems named 'multiple-instance' problems [3] where instances are organized into 'bags' of several instances, and a class label is tagged for every bag of instances. [sent-45, score-0.95]
27 In the 'multiple-instance' problem, at least one of the instances within each bag corresponds to the label of the bag and all other instances within the bag are just noise. [sent-46, score-0.792]
28 The difference between 'multiple-label' problems and 'multiple-instance' problems is that for 'multiple-label' problems the ambiguity lies on the side of class labels while for 'multiple-instance' problem the ambiguity comes from the instances within the bag. [sent-47, score-1.147]
29 Our multiplelabel framework differs in that we don't know which observer assigned which label to each case. [sent-49, score-0.683]
30 3 Formal Description of the 'Multiple-label' Problem As described in the introduction, for a 'multiple-label' problem, each training instance is associated with a set of candidate class labels, only one of which is the target label for that instance. [sent-53, score-1.13]
31 Let Xi be the input for the i-th training example, and Si be the set of candidate class labels for the i-th training example. [sent-54, score-1.07]
32 Our goal is to find the model parameters E in some class of models M , i. [sent-55, score-0.308]
33 a parameterized which maps inputs to labels, so that the predicted class classifier with parameters label y for the i-th training example has a high probability to be a member of the set Si. [sent-57, score-0.947]
34 assignments, this goal can be simply stated as e e e (1) 4 Description of the Discriminative Model for the 'Multiple-label' Problem Before discussing the discriminative model for the 'multiple-label' problem, let's look at the standard discriminative model for supervised classification. [sent-61, score-0.404]
35 Let p(y I X i ) stand for some given conditional distribution of class labels for the training instance Xi and p(y I x"f}) be the model-based conditional distribution for the training data Xi to have the class label y. [sent-62, score-1.899]
36 B* = arg min {L L B p(y ; y I x,) log p(y I x) } p(y I x ;, B) (2) For supervised learning problems, the class label for every trammg instance is known. [sent-65, score-0.974]
37 Therefore, the given conditional distribution of the class label for every training instance is a delta function or jJ(y I Xi) = c5(y, Yi) where Yi is the given class label for the i-th instance. [sent-66, score-1.737]
38 For the 'multiple-label' problem, each training instance Xi is assigned to a set of candidate class labels Si and therefore Eqn. [sent-69, score-1.16]
39 (3) (4) In the 'multiple-label' problem the distribution of class labels p(y I x,) is unknown except for the constraint that the target class label for every training example is a member of the corresponding set of candidate class labels. [sent-71, score-2.036]
40 A simple solution to the problem of unknown label distribution is to assume it is uniform, I. [sent-72, score-0.493]
41 For the case of multiple assessors giving differing labels to the data, discussed in the introduction, this corresponds to concatenating the labeled data sets. [sent-77, score-0.68]
42 A better solution than the 'NaIve Model' is to disambiguate the label association, i. [sent-80, score-0.455]
43 to find which label among the given set is more appropriate than the others and use the appropriate label for training. [sent-82, score-0.94]
44 Starting with the assumption that every class label within the set is equally likely, we train a conditional model p(y I x, B). [sent-84, score-0.839]
45 Then, with the help of this conditional model, we estimate the label distribution jJ(y I x,) for each data point. [sent-85, score-0.561]
46 With these label distributions, we refit the conditional model p(y I x , B) and so on. [sent-86, score-0.57]
47 More formally, this idea can be expressed as follows: First, we estimate the conditional model based on the assumed or estimated label distribution according to Eqn. [sent-87, score-0.608]
48 Then, in the E-step, new label distributions are estimated by maximizing Eqn. [sent-90, score-0.455]
49 in some 'multiple-label' problems, information on which class label within the set Sj is more likely to be the correct one can be obtained. [sent-101, score-0.746]
50 For example, if three assessors manually label the training data, in some cases two assessors will agree on the class label and the other doesn't. [sent-102, score-1.56]
51 We should give more weights to the labels that are agreed by two assessors and low weights to the labels that are chosen by only one. [sent-103, score-1.03]
52 To accommodate prior information on the class labels, we generalize the previous framework so that the estimated label distribution jJ(y I Xi) has low relative entropy with the prior on the class labels. [sent-104, score-1.186]
53 lyx,) - ~ ~ p(y I X,) log p(y I Xi,B)} (7) where " i,y is the prior probability for the i-th training example to have class label y. [sent-106, score-0.897]
54 The first term in the objective function (7) encourages the estimated label distribution to be consistent with the prior distribution of class labels and the second term encourages the prediction of the model to be consistent with the estimated label distribution. [sent-107, score-1.868]
55 YE Si When there is no prior information about which class label within the given set is preferable we can set n ;,y = 1/ I S; I and Eqn. [sent-109, score-0.786]
56 (3), which shows that when there is no pnor knowledge on the class label distribution, we revert back to the' EM Model' . [sent-112, score-0.737]
57 (7) using the EM algorithm, estimating the label distribution p(y I x;) in the E step fitting any standard discriminative model for p(y I x,B) in the M step. [sent-114, score-0.669]
58 The label distribution that optimizes (7) in the Estep is: p(y Ix. [sent-115, score-0.524]
59 As we would expect, I ~ the label distribution p(y I xJ trades off both the prior n ;,y and the model-based prediction p(y I x;, B). [sent-119, score-0.593]
60 The basic idea is illustrated in Figure 1, where the random variable ti represents the event that the true label Yi belongs to the label set Si. [sent-122, score-0.95]
61 The difference between the 'EM Model' and the 'Naive Model' for the 'multiple-label' problems is that the 'Naive Model ' makes no effort in finding the correct label within the given label set while the 'EM Model' applies the EM algorithm to clarify the ambiguity in the class label. [sent-130, score-1.418]
62 Therefore, in this experiment, we need to justify empirically whether the effort in disambiguating class labels is effective. [sent-131, score-0.72]
63 The difference between the 'EM Model' and the 'EM+Prior Model' is that the 'EM+Prior Model' takes advantage of prior knowledge on the distribution of class labels for instances. [sent-134, score-0.807]
64 However, since sometimes the prior knowledge on the class label can be misleading, we need to test the robustness of the 'EM+Prior Model' to such noisy prior knowledge. [sent-135, score-0.937]
65 1 Experimental Data Since there don't exist standard data sets with trammg instances assigned to multiple class labels, we actually create several data sets with multiple class labels from the UCI classification datasets. [sent-137, score-1.317]
66 To make our experiments more realistic, we tried two different methods of creating datasets with multiple class labels: • Random Distractors. [sent-138, score-0.331]
67 For every training instance, in addition to the original assigned label, several randomly selected labels are added to the label candidate set. [sent-139, score-1.355]
68 In the previous method, the added class labels are randomly selected and therefore independent from the original class label. [sent-142, score-0.943]
69 To simulate this realistic situation, we use the output of a NaIve Bayes (NB) classifier as an additional member of the class label candidate set. [sent-144, score-1.015]
70 Then, the trained NB classifier is asked to predict the class label of the training data. [sent-146, score-0.907]
71 When the output of the NB classifier differs from the original label, it is added as a candidate label. [sent-147, score-0.371]
72 Otherwise, a randomly selected label is added to the candidate set. [sent-148, score-0.677]
73 7% Five different VCI datasets were selected as Information about these datasets is listed in Table cross validation results for the ME model together NB output differs from the originally assigned label 5. [sent-160, score-0.834]
74 'EM Model' Table 2 lists the results for the 'NaIve Model' and 'EM Model' over a varied number of additional class labels created by the 'random distractor' and the 'NaIve Bayes' distractor. [sent-165, score-0.699]
75 Since 'wine' and 'iris' datasets only have 3 different classes, the maximum additional class labels for these two data sets is 1. [sent-166, score-0.726]
76 Therefore, there is no experiment result for the case of 2 or 3 distractor class labels for 'wine' and 'iris'. [sent-167, score-0.825]
77 Therefore, we can conclude that the 'EM Model' is able to reduce the noise caused by randomly added class labels. [sent-178, score-0.308]
78 T a bl e 2 Average 10 - D Id cross va I attOn error rates D bot h 'N aIve Mo de I' an d 'EM Mo de I' or 0 Class Name ecoli wine pendigit iris glass 1 extra label by random distracter Naive 17. [sent-179, score-1.098]
79 9% 2 extra labels by random distracter Naive 20. [sent-188, score-0.727]
80 4% 12% 3 extra labe ls by random distracter Na ive 25 . [sent-193, score-0.417]
81 6% Secondly, we compare the performance of these two models over a more realistic setup for the 'multiple-label' problem where the distractor identity is correlated with the true label (simulated by using the NB distractor). [sent-209, score-0.642]
82 Table 1 gives the percentage of times when the trained Naive Bayes classifier disagreed with the 'true' labels, which is also the percentage of the additional class labels that is created by the 'Naive Bayes distracter'. [sent-210, score-0.859]
83 The last row of Table 2 shows the performance of these two models when the additional class labels are introduced by the 'NB distracter'. [sent-211, score-0.669]
84 For dataset 'ecoli', 'wine' and 'iris', the averaged error rates of the 'EM Model' are very close to the cases when there are no distractor class labels. [sent-213, score-0.387]
85 Therefore, we can conclude that the 'EM Model' is able to reduce the noise caused not only by random label ambiguity but also by some systematic label ambiguity. [sent-214, score-1.08]
86 0 or nor Class Name ecoli wine pendigit iris glass I extra label by random distracter Perfect 13 . [sent-218, score-0.99]
87 9% 2 extra labels by random distracter Perfect 13. [sent-228, score-0.727]
88 6% 3 extra labels by random distracter Perfect 12. [sent-234, score-0.727]
89 0% In this subsection, we focus on whether the information from a prior distribution on class labels can improve the performance. [sent-250, score-0.807]
90 Here the guidance of the prior distribution on class labels is always correct. [sent-252, score-0.868]
91 In our experiments for every training instance Xi we set the probability Jri, y; twice as large for the correct Yi as for other Jri ,yo< y; • • 'Noisy Case '. [sent-253, score-0.319]
92 For this case, we only allow the guidance of the prior distribution on the class label to be correct 70% of the time. [sent-254, score-0.945]
93 In the 'perfect case ', the averaged error rates of 'EM+Prior Model ' are quite close to the case when there is no label ambiguity at all (see Table 1). [sent-257, score-0.591]
94 Therefore, we can conclude that our 'EM+Prior Model' is able to take advantage of the pnor distribution on class labels even when some of the' guidance' is not correct. [sent-259, score-0.758]
95 6 Conclusions and Future Work We introduced the 'multiple-label' problem and proposed a discriminative framework that is able to clarify the ambiguity between labels. [sent-260, score-0.332]
96 The framework was generalized to take advantage of prior knowledge on which class label is more likely to be the target label. [sent-262, score-0.861]
97 Our experiments clearly indicate that the proposed discriminative model is robust to the addition of noisy class labels and to errors in the prior distribution over class labels. [sent-263, score-1.265]
98 Each model is trained on a small labeled data set and predicts labels on a large unlabeled data set. [sent-270, score-0.571]
99 These predicted labels can be combined with the small set to form a larger multiply-labeled data set (since not all models will agree). [sent-271, score-0.438]
100 (4) It is possible to extend this framework to handle the presence of label noise and to combine it with the multiple-instance problem [3]. [sent-273, score-0.486]
wordName wordTfidf (topN-words)
[('label', 0.455), ('labels', 0.438), ('class', 0.231), ('distracter', 0.205), ('candidate', 0.179), ('distractor', 0.156), ('assessors', 0.154), ('em', 0.14), ('ambiguity', 0.136), ('discriminative', 0.129), ('nb', 0.119), ('training', 0.111), ('instance', 0.11), ('classifier', 0.11), ('naive', 0.108), ('prior', 0.1), ('assigned', 0.091), ('jj', 0.09), ('yes', 0.09), ('extra', 0.084), ('ecoli', 0.077), ('ixjlog', 0.077), ('labe', 0.077), ('classification', 0.077), ('argmin', 0.076), ('si', 0.072), ('instances', 0.071), ('xi', 0.068), ('conditional', 0.068), ('observer', 0.067), ('bag', 0.065), ('iris', 0.061), ('guidance', 0.061), ('correct', 0.06), ('wine', 0.057), ('datasets', 0.057), ('bayes', 0.056), ('supervised', 0.052), ('atton', 0.051), ('bynb', 0.051), ('disambiguating', 0.051), ('ive', 0.051), ('ixj', 0.051), ('jri', 0.051), ('pendigit', 0.051), ('pnor', 0.051), ('rong', 0.051), ('trammg', 0.051), ('yesi', 0.051), ('noisy', 0.051), ('model', 0.047), ('listed', 0.047), ('labeled', 0.045), ('table', 0.045), ('problems', 0.045), ('target', 0.044), ('ix', 0.044), ('added', 0.043), ('multiple', 0.043), ('unlabeled', 0.041), ('actually', 0.041), ('unlabelled', 0.041), ('cross', 0.041), ('ti', 0.04), ('percentage', 0.04), ('member', 0.04), ('differs', 0.039), ('five', 0.039), ('every', 0.038), ('distribution', 0.038), ('simplified', 0.038), ('mo', 0.038), ('candidates', 0.038), ('disagreement', 0.038), ('isi', 0.038), ('arg', 0.037), ('perfect', 0.036), ('classes', 0.036), ('name', 0.036), ('yi', 0.036), ('clarify', 0.036), ('zoubin', 0.036), ('della', 0.036), ('likelihood', 0.035), ('uci', 0.034), ('caused', 0.034), ('pietra', 0.034), ('bl', 0.034), ('encourages', 0.033), ('leaf', 0.033), ('difficulty', 0.033), ('va', 0.033), ('kl', 0.032), ('reference', 0.032), ('ii', 0.032), ('optimizes', 0.031), ('framework', 0.031), ('setup', 0.031), ('lists', 0.03), ('find', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 135 nips-2002-Learning with Multiple Labels
Author: Rong Jin, Zoubin Ghahramani
Abstract: In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naIve methods degrade rapidly as more ambiguity is introduced into the labels. 1
2 0.22266807 192 nips-2002-Support Vector Machines for Multiple-Instance Learning
Author: Stuart Andrews, Ioannis Tsochantaridis, Thomas Hofmann
Abstract: This paper presents two new formulations of multiple-instance learning as a maximum margin problem. The proposed extensions of the Support Vector Machine (SVM) learning approach lead to mixed integer quadratic programs that can be solved heuristically. Our generalization of SVMs makes a state-of-the-art classification technique, including non-linear classification via kernels, available to an area that up to now has been largely dominated by special purpose methods. We present experimental results on a pharmaceutical data set and on applications in automated image indexing and document categorization. 1
3 0.18724039 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
Author: Yasemin Altun, Thomas Hofmann, Mark Johnson
Abstract: This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. 1
4 0.13979478 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
Author: Ofer Dekel, Yoram Singer
Abstract: We describe a new algorithmic framework for learning multiclass categorization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.
5 0.11802655 90 nips-2002-Feature Selection in Mixture-Based Clustering
Author: Martin H. Law, Anil K. Jain, Mário Figueiredo
Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.
6 0.10637372 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
7 0.10111687 120 nips-2002-Kernel Design Using Boosting
8 0.09783987 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
9 0.094014652 114 nips-2002-Information Regularization with Partially Labeled Data
10 0.090876177 115 nips-2002-Informed Projections
11 0.090727985 188 nips-2002-Stability-Based Model Selection
12 0.087671749 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers
13 0.080519609 110 nips-2002-Incremental Gaussian Processes
14 0.077771023 119 nips-2002-Kernel Dependency Estimation
15 0.074928716 161 nips-2002-PAC-Bayes & Margins
16 0.072292544 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
17 0.071819961 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
18 0.068107024 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
19 0.067449041 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
20 0.061091036 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
topicId topicWeight
[(0, -0.207), (1, -0.108), (2, 0.051), (3, -0.006), (4, 0.029), (5, 0.049), (6, -0.072), (7, -0.044), (8, 0.009), (9, -0.083), (10, 0.03), (11, -0.085), (12, -0.061), (13, -0.031), (14, -0.208), (15, 0.001), (16, 0.039), (17, 0.075), (18, 0.17), (19, 0.13), (20, 0.096), (21, -0.032), (22, 0.044), (23, -0.022), (24, 0.209), (25, -0.238), (26, -0.038), (27, -0.155), (28, -0.156), (29, -0.069), (30, -0.094), (31, -0.091), (32, 0.055), (33, -0.173), (34, -0.081), (35, -0.041), (36, -0.015), (37, -0.087), (38, -0.186), (39, 0.017), (40, -0.08), (41, -0.022), (42, -0.121), (43, 0.157), (44, 0.042), (45, 0.091), (46, 0.009), (47, -0.095), (48, -0.063), (49, 0.069)]
simIndex simValue paperId paperTitle
same-paper 1 0.98568606 135 nips-2002-Learning with Multiple Labels
Author: Rong Jin, Zoubin Ghahramani
Abstract: In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naIve methods degrade rapidly as more ambiguity is introduced into the labels. 1
2 0.84063005 192 nips-2002-Support Vector Machines for Multiple-Instance Learning
Author: Stuart Andrews, Ioannis Tsochantaridis, Thomas Hofmann
Abstract: This paper presents two new formulations of multiple-instance learning as a maximum margin problem. The proposed extensions of the Support Vector Machine (SVM) learning approach lead to mixed integer quadratic programs that can be solved heuristically. Our generalization of SVMs makes a state-of-the-art classification technique, including non-linear classification via kernels, available to an area that up to now has been largely dominated by special purpose methods. We present experimental results on a pharmaceutical data set and on applications in automated image indexing and document categorization. 1
3 0.59056962 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
Author: Yasemin Altun, Thomas Hofmann, Mark Johnson
Abstract: This paper investigates a boosting approach to discriminative learning of label sequences based on a sequence rank loss function. The proposed method combines many of the advantages of boosting schemes with the efficiency of dynamic programming methods and is attractive both, conceptually and computationally. In addition, we also discuss alternative approaches based on the Hamming loss for label sequences. The sequence boosting algorithm offers an interesting alternative to methods based on HMMs and the more recently proposed Conditional Random Fields. Applications areas for the presented technique range from natural language processing and information extraction to computational biology. We include experiments on named entity recognition and part-of-speech tagging which demonstrate the validity and competitiveness of our approach. 1
4 0.45586395 188 nips-2002-Stability-Based Model Selection
Author: Tilman Lange, Mikio L. Braun, Volker Roth, Joachim M. Buhmann
Abstract: Model selection is linked to model assessment, which is the problem of comparing different models, or model parameters, for a specific learning task. For supervised learning, the standard practical technique is crossvalidation, which is not applicable for semi-supervised and unsupervised settings. In this paper, a new model assessment scheme is introduced which is based on a notion of stability. The stability measure yields an upper bound to cross-validation in the supervised case, but extends to semi-supervised and unsupervised problems. In the experimental part, the performance of the stability measure is studied for model order selection in comparison to standard techniques in this area.
5 0.44573933 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
Author: Shantanu Chakrabartty, Gert Cauwenberghs
Abstract: Forward decoding kernel machines (FDKM) combine large-margin classifiers with hidden Markov models (HMM) for maximum a posteriori (MAP) adaptive sequence estimation. State transitions in the sequence are conditioned on observed data using a kernel-based probability model trained with a recursive scheme that deals effectively with noisy and partially labeled data. Training over very large data sets is accomplished using a sparse probabilistic support vector machine (SVM) model based on quadratic entropy, and an on-line stochastic steepest descent algorithm. For speaker-independent continuous phone recognition, FDKM trained over 177 ,080 samples of the TlMIT database achieves 80.6% recognition accuracy over the full test set, without use of a prior phonetic language model.
6 0.4243395 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
7 0.37854934 131 nips-2002-Learning to Classify Galaxy Shapes Using the EM Algorithm
8 0.36425501 90 nips-2002-Feature Selection in Mixture-Based Clustering
9 0.35488972 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
10 0.34741083 114 nips-2002-Information Regularization with Partially Labeled Data
11 0.34646809 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers
12 0.31875718 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
13 0.29798514 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis
14 0.29369858 115 nips-2002-Informed Projections
15 0.29187599 110 nips-2002-Incremental Gaussian Processes
16 0.29161799 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences
17 0.28963566 119 nips-2002-Kernel Dependency Estimation
18 0.27578825 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
19 0.27556214 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
20 0.27431247 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
topicId topicWeight
[(3, 0.012), (11, 0.016), (23, 0.016), (42, 0.076), (54, 0.129), (55, 0.036), (57, 0.277), (68, 0.019), (74, 0.177), (92, 0.04), (98, 0.104)]
simIndex simValue paperId paperTitle
1 0.91709632 94 nips-2002-Fractional Belief Propagation
Author: Wim Wiegerinck, Tom Heskes
Abstract: We consider loopy belief propagation for approximate inference in probabilistic graphical models. A limitation of the standard algorithm is that clique marginals are computed as if there were no loops in the graph. To overcome this limitation, we introduce fractional belief propagation. Fractional belief propagation is formulated in terms of a family of approximate free energies, which includes the Bethe free energy and the naive mean-field free as special cases. Using the linear response correction of the clique marginals, the scale parameters can be tuned. Simulation results illustrate the potential merits of the approach.
2 0.82939398 187 nips-2002-Spikernels: Embedding Spiking Neurons in Inner-Product Spaces
Author: Lavi Shpigelman, Yoram Singer, Rony Paz, Eilon Vaadia
Abstract: Inner-product operators, often referred to as kernels in statistical learning, define a mapping from some input space into a feature space. The focus of this paper is the construction of biologically-motivated kernels for cortical activities. The kernels we derive, termed Spikernels, map spike count sequences into an abstract vector space in which we can perform various prediction tasks. We discuss in detail the derivation of Spikernels and describe an efficient algorithm for computing their value on any two sequences of neural population spike counts. We demonstrate the merits of our modeling approach using the Spikernel and various standard kernels for the task of predicting hand movement velocities from cortical recordings. In all of our experiments all the kernels we tested outperform the standard scalar product used in regression with the Spikernel consistently achieving the best performance. 1
same-paper 3 0.82249761 135 nips-2002-Learning with Multiple Labels
Author: Rong Jin, Zoubin Ghahramani
Abstract: In this paper, we study a special kind of learning problem in which each training instance is given a set of (or distribution over) candidate class labels and only one of the candidate labels is the correct one. Such a problem can occur, e.g., in an information retrieval setting where a set of words is associated with an image, or if classes labels are organized hierarchically. We propose a novel discriminative approach for handling the ambiguity of class labels in the training examples. The experiments with the proposed approach over five different UCI datasets show that our approach is able to find the correct label among the set of candidate labels and actually achieve performance close to the case when each training instance is given a single correct label. In contrast, naIve methods degrade rapidly as more ambiguity is introduced into the labels. 1
4 0.7383191 189 nips-2002-Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy
Author: Tom Heskes
Abstract: We extend recent work on the connection between loopy belief propagation and the Bethe free energy. Constrained minimization of the Bethe free energy can be turned into an unconstrained saddle-point problem. Both converging double-loop algorithms and standard loopy belief propagation can be interpreted as attempts to solve this saddle-point problem. Stability analysis then leads us to conclude that stable fixed points of loopy belief propagation must be (local) minima of the Bethe free energy. Perhaps surprisingly, the converse need not be the case: minima can be unstable fixed points. We illustrate this with an example and discuss implications. 1
5 0.7064817 152 nips-2002-Nash Propagation for Loopy Graphical Games
Author: Luis E. Ortiz, Michael Kearns
Abstract: We introduce NashProp, an iterative and local message-passing algorithm for computing Nash equilibria in multi-player games represented by arbitrary undirected graphs. We provide a formal analysis and experimental evidence demonstrating that NashProp performs well on large graphical games with many loops, often converging in just a dozen iterations on graphs with hundreds of nodes. NashProp generalizes the tree algorithm of (Kearns et al. 2001), and can be viewed as similar in spirit to belief propagation in probabilistic inference, and thus complements the recent work of (Vickrey and Koller 2002), who explored a junction tree approach. Thus, as for probabilistic inference, we have at least two promising general-purpose approaches to equilibria computation in graphs.
7 0.67644686 82 nips-2002-Exponential Family PCA for Belief Compression in POMDPs
8 0.66941899 124 nips-2002-Learning Graphical Models with Mercer Kernels
9 0.66698301 163 nips-2002-Prediction and Semantic Association
10 0.66644436 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
11 0.664729 173 nips-2002-Recovering Intrinsic Images from a Single Image
12 0.66003174 4 nips-2002-A Differential Semantics for Jointree Algorithms
13 0.65909666 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
14 0.65689707 192 nips-2002-Support Vector Machines for Multiple-Instance Learning
15 0.65557188 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
16 0.65527076 169 nips-2002-Real-Time Particle Filters
17 0.65291131 39 nips-2002-Bayesian Image Super-Resolution
18 0.65189272 74 nips-2002-Dynamic Structure Super-Resolution
19 0.65100789 89 nips-2002-Feature Selection by Maximum Marginal Diversity
20 0.64908636 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks