nips nips2002 nips2002-59 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Sariel Har-Peled, Dan Roth, Dav Zimak
Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.
Reference: text
sentIndex sentText sentNum sentScore
1 edu ¡ Abstract The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. [sent-2, score-1.244]
2 We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification. [sent-4, score-0.831]
3 ¢ While binary classification is well understood, relatively little is known about multiclass classification. [sent-7, score-0.525]
4 Indeed, the most common approach to multiclass classification, the oneversus-all (OvA) approach, makes direct use of standard binary classifiers to encode and train the output labels. [sent-8, score-0.563]
5 The OvA scheme assumes that for each class there exists a single (simple) separator between that class and all the other classes. [sent-9, score-0.106]
6 Another common approach, all-versus-all (AvA) [HT98], is a more expressive alternative which assumes the existence of a separator between any two classes. [sent-10, score-0.093]
7 OvA classifiers are usually implemented using a winner-take-all (WTA) strategy that associates a real-valued function with each class in order to determine class membership. [sent-11, score-0.068]
8 While it is known that WTA is an expressive classifier [Maa00], it has limited expressivity when trained using the OvA assumption since OvA assumes that each class can be easily separated from the rest. [sent-15, score-0.142]
9 This work is motivated by several successful practical approaches, such as multiclass support vector machines (SVMs) and the sparse network of winnows (SNoW) architecture that rely on the WTA strategy over linear functions. [sent-17, score-0.556]
10 An alternative interpretation of WTA is that every example provides an ordering of the classes (sorted in descending order by the assigned values), where the “winner” is the first class in this ordering. [sent-19, score-0.118]
11 It is thus natural to specify the ordering of the classes for an example directly, instead of implicitly through WTA. [sent-20, score-0.079]
12 In Section 2, we introduce constraint classification, where each example is labeled with a set of constraints relating multiple classes. [sent-21, score-0.253]
13 Each such constraint specifies the relative order of two classes for this example. [sent-22, score-0.252]
14 Learning is made possible by a simple transformation mapping each example into a set of examples (one for each constraint) and the application of any binary classifier on the mapped examples. [sent-24, score-0.165]
15 In Section 3, we present a new algorithm for constraint classification that takes on the properties of the binary classification algorithm used. [sent-25, score-0.352]
16 Therefore, using the Perceptron algorithm, it is able to learn a consistent classifier if one exists, using the winnow algorithm it can learn attribute efficiently, and using the SVM, it provides a simple implementation of multiclass SVM. [sent-26, score-0.643]
17 The algorithm can be implemented with a subtle change to the standard (via OvA) approach to training a network of linear threshold gates. [sent-27, score-0.13]
18 In Section 4, we discuss both VC-dimension and margin-based generalization bounds presented a companion paper[HPRZ02]. [sent-28, score-0.117]
19 Our generalization bounds apply to WTA classifiers over linear functions, for which VC-style bounds were not known. [sent-29, score-0.226]
20 In addition to multiclass classification, constraint classification generalizes multilabel classification, ranking on labels, and of course, binary classification. [sent-30, score-0.919]
21 For example, in Section , we show that the commonly used OvA assumption can cause learning to fail, even when a consistent classifier exists. [sent-32, score-0.112]
22 Section 5 provides empirical evidence that the constraint classification outperforms the OvA approach. [sent-33, score-0.281]
23 1 (Learning) Given examples, from , a hypothesis class and an error function , a learning algorithm attempts to output a function , where , that minimizes the expected error on a randomly drawn example. [sent-37, score-0.176]
24 Similarly, denotes the set of all partial orders sisting of all permutations of over . [sent-41, score-0.104]
25 A partial order, , defines a binary relation, and can be repholds, . [sent-42, score-0.107]
26 Given two partial orders order produced by the transitive closure of with respect to , is consistent with (denoted ) if for every , holds whenever . [sent-44, score-0.136]
27 If is a full order, then it can be represented by a list of integers where if precedes in the list. [sent-45, score-0.097]
28 3 (Constraint Classification) Constraint classification is a learning problem where each example is labeled according to a partial order . [sent-48, score-0.1]
29 A constraint classifier, , is consistent with example if is consistent with ( ). [sent-49, score-0.367]
30 4 (Error Indicator Function) For any , and hypothesis , the indicator function indicates an error on example , if , and otherwise. [sent-59, score-0.072]
31 For example, if and example , , and , then is correct since precedes and precedes in the full order whereas is incorrect since precedes in . [sent-60, score-0.319]
32 the case that Constraint classification can model many well-studied learning problems including multiclass classification, ranking and multilabel classification. [sent-74, score-0.692]
33 Table 1 shows a few interesting classification problems expressible as constraint classification. [sent-75, score-0.258]
34 7 (Problem mappings) All of the learning problems in Table 1 can be expressed as constraint classification problems. [sent-77, score-0.286]
35 If we find a constraint classifier that correctly labels according to the given constraints where , , and , then . [sent-80, score-0.251]
36 If instead we are given a ranking example , it can be transformed into . [sent-81, score-0.127]
37 ¥65r2 q ¡ £ ¡ ¤¢ 3 Learning In this section, -class constraint classification is transformed into binary classification in higher dimension. [sent-84, score-0.313]
38 Each example becomes a set of examples in ¢ ¡ r r s1 ¢ W¡ q © § ¥ £ ¡ ¨5j¢ ¤ with each constraint contributing a single ‘positive’ and a single ‘negative’ example. [sent-85, score-0.3]
39 Then, a separating hyperplane for the expanded example set (in ) can be viewed as a linear sorting function over linear functions, each in dimensional space. [sent-86, score-0.437]
40 1 Kesler’s Construction Kesler’s construction for multiclass classification was first introduced by Nilsson in 1965[Nil65, 75–77] and can also be found more recently[DH73]. [sent-88, score-0.496]
41 This subsection extends the Kesler construction for constraint classification. [sent-89, score-0.259]
42 2 (Expansion) Let by writing the coordinates of in the -th chunk of a vector in . [sent-93, score-0.095]
43 Finally, , is the embedding of in the -th chunk and in the -th chunk of a vector in . [sent-96, score-0.19]
44 2 Algorithm Figure 1 (a) shows a meta-learning algorithm for constraint classification that finds a linear sorting function by using any algorithm for learning a binary classifier. [sent-103, score-0.621]
45 Given a set of examples , the algorithm simply finds a separating hyperplane for . [sent-104, score-0.175]
46 Suppose correctly classifies , , and the constraint on (dictating that then ) is consistent with . [sent-105, score-0.308]
47 Therefore, if correctly classifies all , then is a consistent linear sorting function. [sent-106, score-0.324]
48 First, the hypothesis learned above is more expressive than when the OvA assumption is used. [sent-108, score-0.126]
49 Finally, the multiclass support vector machine can be implemented by learning a hyperplane to separate with maximal margin. [sent-111, score-0.538]
50 is any binary learning algorithm returning a separating hyperplane. [sent-115, score-0.171]
51 (b) Online meta-algorithm for constraint classification with linear sorting functions (see Definition 2. [sent-116, score-0.466]
52 The particular online algorithm used determines how is initialized and the promotion and demotion strategies. [sent-118, score-0.112]
53 Learning proceeds by learning independent binary classifiers, one corresponding to each class, where example is considered positive for classifier and negative for all others. [sent-120, score-0.119]
54 q § ¥ £ ¡ ¨5j¢ ¥ It is easy to construct an example where the OvA assumption causes the learning to fail even when there exists a consistent linear sorting function. [sent-121, score-0.381]
55 (see Figure 2) Notice, since the existence of a consistent linear sorting function (w. [sent-122, score-0.298]
56 ) implies the existence of a ), any learning algorithm guaranteed to separate two separating hyperplane (w. [sent-125, score-0.156]
57 the Perceptron algorithm) is guaranteed to find a consistent linear sorting function. [sent-130, score-0.298]
58 In Section 5, we use the perceptron algorithm to find a consistent classifier for an extension of the example in Figure 2 to when OvA fails. [sent-131, score-0.25]
59 4 Comparison to Newtorks of Linear Threshold Gates (Perceptron) It is possible to implement the algorithm in Section using a network of linear classifiers such as multi-output Perceptron [AB99], SNoW [CCRR99, Rot98], and multiclass SVM [CS00, WW99]. [sent-133, score-0.564]
60 Such a network has as input and outputs, each repre, where the -th output computes (see Figure 1 sented by a weight vector, (b)). [sent-134, score-0.077]
61 ¢ W¡ q 1 ¡ © ¡ ¢ £¡ 1 k © P Typically, a label is mapped, via fixed transformation, into a -dimensional output vector, and each output is trained separately, as in the OvA case. [sent-135, score-0.076]
62 Alternately, if the online perceptron algorithm is plugged into the meta-algorithm in Section , then updates are performed according to a dynamic transformation. [sent-136, score-0.21]
63 Specifically, given , for every constraint , if , is ‘promoted’ and is ‘demoted’. [sent-137, score-0.225]
64 Using a network in this results in an ultraconservative online algorithm for multiclass classification [CS01]. [sent-138, score-0.635]
65 This subtle change enables the commonly used network of linear threshold gates to learn every hypothesis it is capable of representing. [sent-139, score-0.171]
66 q § ¥ £ ¡ ¨5j¢ f q 1 ged 1 f i ¡ ¡ ¥ © § ¢ y £ + + − − + f =0 Figure 2: A 3-class classification example in ¡ f =0 f =0 − showing that one-versus-all (OvA) does not converge to a consistent hypothesis. [sent-140, score-0.112]
67 The OvA assumption will attempt to separate the circles from squares and triangles with a single separating hyperplane, as well as the other 2 combinations. [sent-143, score-0.11]
68 £ ¤¢ £ ¤¢ Dataset glass vowel soybean audiology ISOLET letter Synthetic* Features 9 10 35 69 617 16 100 Classes 6 11 19 24 26 26 3 Training Examples 214 528 307 200 6238 16000 50000 Testing Examples – 462 376 26 1559 4000 50000 Table 2: Summary of problems from the UCI repository. [sent-147, score-0.197]
69 The synthetic data is sampled from a random linear sorting function (see Section 5). [sent-148, score-0.297]
70 4 Generalization Bounds A PAC-style analysis of multiclass functions that uses an extended notion of VC-dimension for multiclass case [BCHL95] provides poor bounds on generalization for WTA, and the current best bounds rely on a generalized notion of margin [ASS00]. [sent-149, score-1.148]
71 In this section, we prove tighter bounds using the new framework. [sent-150, score-0.078]
72 We seek generalization bounds for learning with , the class of linear sorting functions (Definition 2. [sent-151, score-0.42]
73 Although both VC-dimension-based (based on growth function) and margin-based bounds for the class of hyperplanes in are known [Vap98, AB99], they cannot directly be applied since produces points that are random, but not independently drawn. [sent-153, score-0.137]
74 It turns out that bounds can be derived indirectly by using known bounds for constraint classification. [sent-154, score-0.381]
75 Due to space considerations see[HPRZ02], where natural extensions to the growth function and margin are used to develop generalization bounds. [sent-155, score-0.088]
76 R ¢ ¡ r § 1 v 5 Experiments As in previous multiclass classification work [DB95, ASS00], we tested our algorithm on a suite of problems from the Irvine Repository of machine learning [BM98] (see Table 2). [sent-156, score-0.555]
77 All of the results for the constraint classification algorithm are competitive with the known. [sent-160, score-0.257]
78 The synthetic data would converge to error using constraint classification but would not converge using the OvA approach. [sent-161, score-0.335]
79 A comparison is made between the OvA approach (Section ) and the constraint classification approach. [sent-163, score-0.225]
80 Both were implemented on the same network of multi-output Perceptron network with weights (with one threshold per class). [sent-164, score-0.078]
81 § bc ¤ ¡ ¢ £ £ ¢ 2 , a constraint classification example ¡ q £ £ HBBBApc a ¢ ¢ £¡ © § VPD 5j¢ ¥ £ ¡ ¢ ¤£h ¡ C q ¡ 2 6 ¢ ¢ £ For each multiclass example ¢ £¡ was created, where . [sent-168, score-0.743]
82 4) of corresponds to the traditional error for multiclass classification. [sent-170, score-0.462]
83 " ¥ ¦ D ¥ p ¡ q £ £ HBBBApc 2 y § £ D ¥ ¦IG¢ 2 x ¥ r s1 © V§ x 5j¢ ¥ £ ¡ § x ¦¤¢ ¥ £ ¡ Figure 3 shows that constraint classification outperforms the multioutput Perceptron when using the OvA assumption. [sent-171, score-0.225]
84 6 Discussion We think constraint classification provides two significant contributions to multiclass classification. [sent-172, score-0.742]
85 Firstly, it provides a conceptual generalization that encompasses multiclass classification, multilabel classification, and label ranking problems in addition to problems with more complex relationships between labels. [sent-173, score-0.765]
86 Secondly, it reminds the community that the Kesler construction can be used to extend any learning algorithm for binary classification to the multiclass (or constraint) setting. [sent-174, score-0.619]
87 Section 5 showed that the constraint approach to learning is advantageous over the oneversus-all approach on both real-world and synthetic data sets. [sent-175, score-0.309]
88 However, preliminary experiments using various natural language data sets, such as part-of-speech tagging, do not yield any significant difference between the two approaches. [sent-176, score-0.064]
89 We used a common transformation [EZR01] to convert raw data to approximately three million examples in one hundred thousand dimensional boolean feature space. [sent-177, score-0.074]
90 Because the constraint approach is more expressive than the one-versus-all approach, and because both approaches use the same hypothesis space ( linear functions), we expected the constraint approach to achieve higher accuracy. [sent-179, score-0.58]
91 Again, we think this is not the case, since it seems that “most” random winner-take-all problems (as with the synthetic data) would cause the one-versus-all assumption to fail. [sent-183, score-0.142]
92 q Rather, we conjecture that for some reason, natural language problems (along with the transformation) are suited to the one-versus-all approach and do not require a more complex hypothesis. [sent-184, score-0.097]
93 7 Conclusions The view of multiclass classification presented here simplifies the implementation, analysis, and understanding of many preexisting approaches. [sent-186, score-0.462]
94 Multiclass support vector machines, ultraconservative online algorithms, and traditional one-versus-all approaches can be cast in this framework. [sent-187, score-0.102]
95 It would be interesting to see if it could be combined with the error-correcting output coding method in [DB95] that provides another way to extend the OvA approach. [sent-188, score-0.067]
96 Furthermore, this view allows for a very natural extension of multiclass classification to constraint classification – capturing within it complex learning tasks such as multilabel classification and ranking. [sent-189, score-0.834]
97 Because constraint classification is a very intuitive approach and its implementation can be carried out by any discriminant technique, and not only by optimization techniques, we think it will have useful real-world applications. [sent-190, score-0.251]
98 Reducing multiclass to binary: A unifying approach for margin classifiers. [sent-208, score-0.462]
99 On the learnability and design of output codes for multiclass problems. [sent-245, score-0.529]
100 Learning to resolve natural language ambiguities: A unified approach. [sent-319, score-0.064]
wordName wordTfidf (topN-words)
[('multiclass', 0.462), ('ova', 0.418), ('classi', 0.402), ('constraint', 0.225), ('sorting', 0.21), ('cation', 0.167), ('wta', 0.139), ('perceptron', 0.133), ('abba', 0.114), ('hbbbapc', 0.109), ('kesler', 0.109), ('precedes', 0.097), ('multilabel', 0.095), ('chunk', 0.095), ('er', 0.095), ('jg', 0.087), ('bounds', 0.078), ('nition', 0.078), ('bgc', 0.076), ('ranking', 0.074), ('abab', 0.066), ('babgc', 0.066), ('binary', 0.063), ('snow', 0.057), ('kh', 0.057), ('ultraconservative', 0.057), ('consistent', 0.057), ('synthetic', 0.056), ('expressive', 0.055), ('pb', 0.048), ('separating', 0.048), ('hyperplane', 0.048), ('examples', 0.047), ('vv', 0.046), ('online', 0.045), ('partial', 0.044), ('hypothesis', 0.044), ('abbgc', 0.044), ('audiology', 0.044), ('gec', 0.044), ('ged', 0.044), ('lass', 0.044), ('lbd', 0.044), ('nput', 0.044), ('utput', 0.044), ('expanded', 0.041), ('language', 0.04), ('network', 0.039), ('generalization', 0.039), ('ers', 0.039), ('separator', 0.038), ('isolet', 0.038), ('soybean', 0.038), ('winnow', 0.038), ('wsu', 0.038), ('output', 0.038), ('orders', 0.035), ('triangles', 0.035), ('earn', 0.035), ('promotion', 0.035), ('tagging', 0.035), ('class', 0.034), ('construction', 0.034), ('problems', 0.033), ('vowel', 0.032), ('winner', 0.032), ('algorithm', 0.032), ('linear', 0.031), ('provides', 0.029), ('learnability', 0.029), ('gates', 0.029), ('learning', 0.028), ('example', 0.028), ('subtle', 0.028), ('crammer', 0.028), ('roth', 0.028), ('converge', 0.027), ('classes', 0.027), ('expansion', 0.027), ('transformation', 0.027), ('assumption', 0.027), ('de', 0.027), ('empirical', 0.027), ('pages', 0.027), ('correctly', 0.026), ('think', 0.026), ('separated', 0.026), ('permutation', 0.026), ('letter', 0.026), ('hc', 0.026), ('transformed', 0.025), ('hypotheses', 0.025), ('growth', 0.025), ('permutations', 0.025), ('attribute', 0.025), ('section', 0.024), ('machines', 0.024), ('natural', 0.024), ('glass', 0.024), ('verify', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
Author: Sariel Har-Peled, Dan Roth, Dav Zimak
Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.
2 0.2537753 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola
Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classifiers, similar in spirit to Boosting.
3 0.23606341 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
Author: Ofer Dekel, Yoram Singer
Abstract: We describe a new algorithmic framework for learning multiclass categorization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.
Author: Sepp Hochreiter, Klaus Obermayer
Abstract: We investigate the problem of learning a classification task for datasets which are described by matrices. Rows and columns of these matrices correspond to objects, where row and column objects may belong to different sets, and the entries in the matrix express the relationships between them. We interpret the matrix elements as being produced by an unknown kernel which operates on object pairs and we show that - under mild assumptions - these kernels correspond to dot products in some (unknown) feature space. Minimizing a bound for the generalization error of a linear classifier which has been obtained using covering numbers we derive an objective function for model selection according to the principle of structural risk minimization. The new objective function has the advantage that it allows the analysis of matrices which are not positive definite, and not even symmetric or square. We then consider the case that row objects are interpreted as features. We suggest an additional constraint, which imposes sparseness on the row objects and show, that the method can then be used for feature selection. Finally, we apply this method to data obtained from DNA microarrays, where “column” objects correspond to samples, “row” objects correspond to genes and matrix elements correspond to expression levels. Benchmarks are conducted using standard one-gene classification and support vector machines and K-nearest neighbors after standard feature selection. Our new method extracts a sparse set of genes and provides superior classification results. 1
5 0.20253606 67 nips-2002-Discriminative Binaural Sound Localization
Author: Ehud Ben-reuven, Yoram Singer
Abstract: Time difference of arrival (TDOA) is commonly used to estimate the azimuth of a source in a microphone array. The most common methods to estimate TDOA are based on finding extrema in generalized crosscorrelation waveforms. In this paper we apply microphone array techniques to a manikin head. By considering the entire cross-correlation waveform we achieve azimuth prediction accuracy that exceeds extrema locating methods. We do so by quantizing the azimuthal angle and treating the prediction problem as a multiclass categorization task. We demonstrate the merits of our approach by evaluating the various approaches on Sony’s AIBO robot.
6 0.16744295 45 nips-2002-Boosted Dyadic Kernel Discriminants
7 0.16336094 92 nips-2002-FloatBoost Learning for Classification
8 0.15393497 58 nips-2002-Conditional Models on the Ranking Poset
9 0.15370266 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
10 0.13651419 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
11 0.12761904 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
12 0.1179772 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study
13 0.11760385 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
14 0.11360613 55 nips-2002-Combining Features for BCI
15 0.11050989 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach
16 0.10755922 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
17 0.10678634 120 nips-2002-Kernel Design Using Boosting
18 0.10571525 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
19 0.10384692 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis
20 0.092116341 140 nips-2002-Margin Analysis of the LVQ Algorithm
topicId topicWeight
[(0, -0.238), (1, -0.158), (2, 0.123), (3, -0.057), (4, 0.361), (5, -0.022), (6, -0.055), (7, -0.196), (8, 0.036), (9, -0.025), (10, -0.113), (11, 0.101), (12, -0.03), (13, -0.068), (14, 0.153), (15, -0.002), (16, 0.046), (17, 0.126), (18, -0.082), (19, -0.09), (20, 0.018), (21, -0.053), (22, -0.11), (23, 0.119), (24, 0.053), (25, 0.022), (26, -0.063), (27, 0.111), (28, -0.055), (29, 0.054), (30, -0.011), (31, -0.022), (32, 0.107), (33, -0.052), (34, -0.006), (35, -0.011), (36, -0.03), (37, -0.005), (38, 0.021), (39, 0.019), (40, 0.039), (41, 0.022), (42, -0.035), (43, 0.013), (44, 0.032), (45, -0.008), (46, -0.056), (47, -0.068), (48, 0.026), (49, -0.044)]
simIndex simValue paperId paperTitle
same-paper 1 0.96345478 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
Author: Sariel Har-Peled, Dan Roth, Dav Zimak
Abstract: The constraint classification framework captures many flavors of multiclass classification including winner-take-all multiclass classification, multilabel classification and ranking. We present a meta-algorithm for learning in this framework that learns via a single linear classifier in high dimension. We discuss distribution independent as well as margin-based generalization bounds and present empirical and theoretical evidence showing that constraint classification benefits over existing methods of multiclass classification.
2 0.8094784 149 nips-2002-Multiclass Learning by Probabilistic Embeddings
Author: Ofer Dekel, Yoram Singer
Abstract: We describe a new algorithmic framework for learning multiclass categorization problems. In this framework a multiclass predictor is composed of a pair of embeddings that map both instances and labels into a common space. In this space each instance is assigned the label it is nearest to. We outline and analyze an algorithm, termed Bunching, for learning the pair of embeddings from labeled data. A key construction in the analysis of the algorithm is the notion of probabilistic output codes, a generalization of error correcting output codes (ECOC). Furthermore, the method of multiclass categorization using ECOC is shown to be an instance of Bunching. We demonstrate the advantage of Bunching over ECOC by comparing their performance on numerous categorization problems.
3 0.75596422 58 nips-2002-Conditional Models on the Ranking Poset
Author: Guy Lebanon, John D. Lafferty
Abstract: A distance-based conditional model on the ranking poset is presented for use in classification and ranking. The model is an extension of the Mallows model, and generalizes the classifier combination methods used by several ensemble learning algorithms, including error correcting output codes, discrete AdaBoost, logistic regression and cranking. The algebraic structure of the ranking poset leads to a simple Bayesian interpretation of the conditional model and its special cases. In addition to a unifying view, the framework suggests a probabilistic interpretation for error correcting output codes and an extension beyond the binary coding scheme.
4 0.7317096 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
Author: Gunnar Rätsch, Sebastian Mika, Alex J. Smola
Abstract: In this paper we consider formulations of multi-class problems based on a generalized notion of a margin and using output coding. This includes, but is not restricted to, standard multi-class SVM formulations. Differently from many previous approaches we learn the code as well as the embedding function. We illustrate how this can lead to a formulation that allows for solving a wider range of problems with for instance many classes or even “missing classes”. To keep our optimization problems tractable we propose an algorithm capable of solving them using twoclass classifiers, similar in spirit to Boosting.
5 0.70830566 45 nips-2002-Boosted Dyadic Kernel Discriminants
Author: Baback Moghaddam, Gregory Shakhnarovich
Abstract: We introduce a novel learning algorithm for binary classification with hyperplane discriminants based on pairs of training points from opposite classes (dyadic hypercuts). This algorithm is further extended to nonlinear discriminants using kernel functions satisfying Mercer’s conditions. An ensemble of simple dyadic hypercuts is learned incrementally by means of a confidence-rated version of AdaBoost, which provides a sound strategy for searching through the finite set of hypercut hypotheses. In experiments with real-world datasets from the UCI repository, the generalization performance of the hypercut classifiers was found to be comparable to that of SVMs and k-NN classifiers. Furthermore, the computational cost of classification (at run time) was found to be similar to, or better than, that of SVM. Similarly to SVMs, boosted dyadic kernel discriminants tend to maximize the margin (via AdaBoost). In contrast to SVMs, however, we offer an on-line and incremental learning machine for building kernel discriminants whose complexity (number of kernel evaluations) can be directly controlled (traded off for accuracy). 1
6 0.69348902 108 nips-2002-Improving Transfer Rates in Brain Computer Interfacing: A Case Study
7 0.68585384 67 nips-2002-Discriminative Binaural Sound Localization
8 0.65804541 92 nips-2002-FloatBoost Learning for Classification
9 0.64709926 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
10 0.64406002 72 nips-2002-Dyadic Classification Trees via Structural Risk Minimization
11 0.64157534 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
12 0.62719238 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
13 0.61496633 55 nips-2002-Combining Features for BCI
14 0.58154786 109 nips-2002-Improving a Page Classifier with Anchor Extraction and Link Analysis
15 0.43396181 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems
16 0.42870453 173 nips-2002-Recovering Intrinsic Images from a Single Image
17 0.41300195 140 nips-2002-Margin Analysis of the LVQ Algorithm
18 0.40939057 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
19 0.40512574 51 nips-2002-Classifying Patterns of Visual Motion - a Neuromorphic Approach
20 0.39327416 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
topicId topicWeight
[(1, 0.015), (11, 0.011), (41, 0.013), (42, 0.052), (52, 0.025), (54, 0.136), (55, 0.036), (67, 0.017), (68, 0.017), (74, 0.059), (87, 0.01), (92, 0.039), (98, 0.46)]
simIndex simValue paperId paperTitle
1 0.99482954 103 nips-2002-How Linear are Auditory Cortical Responses?
Author: Maneesh Sahani, Jennifer F. Linden
Abstract: By comparison to some other sensory cortices, the functional properties of cells in the primary auditory cortex are not yet well understood. Recent attempts to obtain a generalized description of auditory cortical responses have often relied upon characterization of the spectrotemporal receptive field (STRF), which amounts to a model of the stimulusresponse function (SRF) that is linear in the spectrogram of the stimulus. How well can such a model account for neural responses at the very first stages of auditory cortical processing? To answer this question, we develop a novel methodology for evaluating the fraction of stimulus-related response power in a population that can be captured by a given type of SRF model. We use this technique to show that, in the thalamo-recipient layers of primary auditory cortex, STRF models account for no more than 40% of the stimulus-related power in neural responses.
2 0.99177229 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
Author: Ralf Herbrich, Neil D. Lawrence, Matthias Seeger
Abstract: We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n · d2 ), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. 1
3 0.98921448 129 nips-2002-Learning in Spiking Neural Assemblies
Author: David Barber
Abstract: We consider a statistical framework for learning in a class of networks of spiking neurons. Our aim is to show how optimal local learning rules can be readily derived once the neural dynamics and desired functionality of the neural assembly have been specified, in contrast to other models which assume (sub-optimal) learning rules. Within this framework we derive local rules for learning temporal sequences in a model of spiking neurons and demonstrate its superior performance to correlation (Hebbian) based approaches. We further show how to include mechanisms such as synaptic depression and outline how the framework is readily extensible to learning in networks of highly complex spiking neurons. A stochastic quantal vesicle release mechanism is considered and implications on the complexity of learning discussed. 1
4 0.98844677 56 nips-2002-Concentration Inequalities for the Missing Mass and for Histogram Rule Error
Author: Luis E. Ortiz, David A. McAllester
Abstract: This paper gives distribution-free concentration inequalities for the missing mass and the error rate of histogram rules. Negative association methods can be used to reduce these concentration problems to concentration questions about independent sums. Although the sums are independent, they are highly heterogeneous. Such highly heterogeneous independent sums cannot be analyzed using standard concentration inequalities such as Hoeffding’s inequality, the Angluin-Valiant bound, Bernstein’s inequality, Bennett’s inequality, or McDiarmid’s theorem.
5 0.98587346 92 nips-2002-FloatBoost Learning for Classification
Author: Stan Z. Li, Zhenqiu Zhang, Heung-yeung Shum, Hongjiang Zhang
Abstract: AdaBoost [3] minimizes an upper error bound which is an exponential function of the margin on the training set [14]. However, the ultimate goal in applications of pattern classification is always minimum error rate. On the other hand, AdaBoost needs an effective procedure for learning weak classifiers, which by itself is difficult especially for high dimensional data. In this paper, we present a novel procedure, called FloatBoost, for learning a better boosted classifier. FloatBoost uses a backtrack mechanism after each iteration of AdaBoost to remove weak classifiers which cause higher error rates. The resulting float-boosted classifier consists of fewer weak classifiers yet achieves lower error rates than AdaBoost in both training and test. We also propose a statistical model for learning weak classifiers, based on a stagewise approximation of the posterior using an overcomplete set of scalar features. Experimental comparisons of FloatBoost and AdaBoost are provided through a difficult classification problem, face detection, where the goal is to learn from training examples a highly nonlinear classifier to differentiate between face and nonface patterns in a high dimensional space. The results clearly demonstrate the promises made by FloatBoost over AdaBoost.
same-paper 6 0.96507406 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
7 0.92047858 79 nips-2002-Evidence Optimization Techniques for Estimating Stimulus-Response Functions
8 0.87644517 50 nips-2002-Circuit Model of Short-Term Synaptic Dynamics
9 0.87378919 184 nips-2002-Spectro-Temporal Receptive Fields of Subthreshold Responses in Auditory Cortex
10 0.86902505 110 nips-2002-Incremental Gaussian Processes
11 0.8515709 43 nips-2002-Binary Coding in Auditory Cortex
12 0.84948468 102 nips-2002-Hidden Markov Model of Cortical Synaptic Plasticity: Derivation of the Learning Rule
13 0.83985353 41 nips-2002-Bayesian Monte Carlo
14 0.83417273 12 nips-2002-A Neural Edge-Detection Model for Enhanced Auditory Sensitivity in Modulated Noise
15 0.83044499 116 nips-2002-Interpreting Neural Response Variability as Monte Carlo Sampling of the Posterior
16 0.81803358 199 nips-2002-Timing and Partial Observability in the Dopamine System
17 0.81765544 180 nips-2002-Selectivity and Metaplasticity in a Unified Calcium-Dependent Model
18 0.8164928 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences
19 0.81522167 81 nips-2002-Expected and Unexpected Uncertainty: ACh and NE in the Neocortex
20 0.81121922 45 nips-2002-Boosted Dyadic Kernel Discriminants