nips nips2002 nips2002-194 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marina Sokolova, Mario Marchand, Nathalie Japkowicz, John S. Shawe-taylor
Abstract: We introduce a new learning algorithm for decision lists to allow features that are constructed from the data and to allow a tradeoff between accuracy and complexity. We bound its generalization error in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. 1
Reference: text
sentIndex sentText sentNum sentScore
1 uk Abstract We introduce a new learning algorithm for decision lists to allow features that are constructed from the data and to allow a tradeoff between accuracy and complexity. [sent-13, score-0.147]
2 We bound its generalization error in terms of the number of errors and the size of the classifier it finds on the training data. [sent-14, score-0.229]
3 Given a feature space, the SCM tries to find the smallest conjunction (or disjunction) of features that gives a small training error. [sent-17, score-0.15]
4 Hence, we denote by decision list machine (DLM) any classifier which computes a decision list of Boolean-valued features, including features that are possibly constructed from the data. [sent-21, score-0.294]
5 By extending the sample compression technique of Littlestone and Warmuth (1986), we bound the generalization error of the DLM with data-dependent balls in terms of the number of errors and the number of balls it achieves on the training data. [sent-23, score-1.074]
6 We also show that the DLM with balls can provide better generalization than the SCM with this same set of features on some natural data sets. [sent-24, score-0.465]
7 We consider binary classification problems for which the training set S = P ∪ N consists of a set P of positive training examples and a set N of negative training examples. [sent-26, score-0.327]
8 Given any set H = {hi (x)}i=1 of features hi (x) and any training set S, the learning algorithm returns a small subset R ⊂ H of features. [sent-28, score-0.408]
9 Else If (hr (x)) then br Else br+1 where each bi ∈ 0, 1 defines the output of f (x) if and only if hi is the first feature to be satisfied on x (i. [sent-32, score-0.434]
10 Note that f computes a disjunction of the hi s whenever bi = 1 for i = 1 . [sent-36, score-0.397]
11 To compute a conjunction of hi s, we simply place in f the negation of each hi with bi = 0 for i = 1 . [sent-40, score-0.626]
12 a pair (bi , bi+1 ) for which bi = bi+1 for i < r) cannot be represented as a (pure) conjunction or disjunction of hi s (and their negations). [sent-46, score-0.43]
13 Hence, the class of decision lists strictly includes conjunctions and disjunctions. [sent-47, score-0.138]
14 From this definition, it seems natural to use the following greedy algorithm for building a DLM from a training set. [sent-48, score-0.119]
15 For a given set S = P ∪ N of examples (where P ⊆ P and N ⊆ N ) and a given set H of features, consider only the features hi ∈ H which make no errors on either P or N . [sent-49, score-0.424]
16 If hi makes no error with P , let Qi be the subset of examples of N on which hi makes no errors. [sent-50, score-0.755]
17 Otherwise, if hi makes no error with N , let Qi be the subset of examples of P on which hi makes no errors. [sent-51, score-0.755]
18 Then it finds the hi with the largest |Qi | and appends this hi to the DLM. [sent-54, score-0.554]
19 It then removes Qi from S and repeat to find the hk with the largest |Qk | until either P or N is empty. [sent-55, score-0.081]
20 Following Rivest (1987), this greedy algorithm is assured to build a DLM that makes no training errors whenever there exists a DLM on a set E ⊆ H of features that makes zero training errors. [sent-57, score-0.373]
21 However, this constraint is not really required in practice since we do want to permit the user of a learning algorithm to control the tradeoff between the accuracy achieved on the training data and the complexity (here the size) of the classifier. [sent-58, score-0.092]
22 Indeed, a small DLM which makes a few errors on the training set might give better generalization than a larger DLM (with more features) which makes zero training errors. [sent-59, score-0.31]
23 One way to include this flexibility is to early-stop the greedy algorithm when there remains a few more training examples to be covered. [sent-60, score-0.175]
24 But a further reduction in the size of the DLM can be accomplished Algorithm BuildDLM(P, N, pp , pn , s, H) Input: A set P of positive examples, a set N of negative examples, the penalty values |H| pp and pn , a stopping point s, and a set H = {hi (x)}i=1 of Boolean-valued features. [sent-61, score-0.7]
25 Output: A decision list f consisting of a set R = {(hi , bi )}r of features hi with i=1 their corresponding output values bi , and a default value br+1 . [sent-62, score-0.57]
26 For each hi ∈ H, let Pi and Ni be respectively the subsets of P and N correctly classified by hi . [sent-64, score-0.589]
27 For each hi compute Ui , where: def Ui = max {|Pi | − pn · |N − Ni |, |Ni | − pp · |P − Pi |} 2. [sent-65, score-0.678]
28 Let hk be a feature with the largest value of Uk . [sent-66, score-0.081]
29 If (|Pk | − pn · |N − Nk | ≥ |Nk | − pp · |P − Pk |) then R = R ∪ {(hk , 1)}, P = P − Pk , N = Nk . [sent-68, score-0.276]
30 If (|Pk | − pn · |N − Nk | < |Nk | − pp · |P − Pk |) then R = R ∪ {(¬hk , 0)}, N = N − Nk , P = Pk . [sent-70, score-0.276]
31 Figure 1: The learning algorithm for the Decision List Machine by considering features hi that do make a few errors on P (or N ) if many more examples Qi ∈ N (or Qi ∈ P ) can be covered. [sent-76, score-0.424]
32 Hence, to include this flexibility in choosing the proper tradeoff between complexity and accuracy, we propose the following modification of the greedy algorithm. [sent-77, score-0.081]
33 For every feature hi , let us denote by Pi the subset of P on which hi makes no errors and by Ni the subset of N on which hi makes no error. [sent-78, score-1.016]
34 For a given set S = P ∪ N , we will select the feature hi with the largest value of Ui and append this hi in the DLM. [sent-81, score-0.554]
35 Otherwise if |Pi | − pn · |N − Ni | < |Ni | − pp · |P − Pi |, we will then remove from S examples in Ni ∪ (P − Pi ). [sent-83, score-0.368]
36 Hence, we recover the simple greedy algorithm when pp = pn = ∞. [sent-84, score-0.33]
37 The penalty parameters pp and pn and the early stopping point s are the model-selection parameters that give the user the ability to control the proper tradeoff between the training accuracy and the size of the DLM. [sent-86, score-0.44]
38 Their values could be determined either by using k-fold cross-validation, or by computing our bound (see section 4) on the generalization error. [sent-87, score-0.084]
39 Given a set S of m training examples, our initial set of features consists, in principle, of H = i∈S ρ∈[0,∞[ hi,ρ . [sent-94, score-0.117]
40 But obviously, for each training example xi , we need only to consider the set of m − 1 distances {d(xi , xj )}j=i . [sent-95, score-0.164]
41 In fact, from the description of the DLM in the previous section, it follows that the ball with the largest usefulness belongs to one of the following following types of balls: type Pi , Po , Ni , and No . [sent-97, score-0.221]
42 Balls of type Pi (positive inside) are balls having a positive example x for its center and a radius given by ρ = d(x, x ) − for some negative example x (that we call a border point) and very small positive number . [sent-98, score-0.777]
43 Balls of type Po (positive outside) have a negative example center x and a radius ρ = d(x, x ) + given by a negative border x . [sent-99, score-0.366]
44 Balls of type Ni (negative inside) have a negative center x and a radius ρ = d(x, x ) − given by a positive border x . [sent-100, score-0.368]
45 Balls of type No (negative outside) have a positive center x and a radius ρ = d(x, x ) + given by a positive border x . [sent-101, score-0.37]
46 This proposed set of features, constructed from the training data, provides to the user full control for choosing the proper tradeoff between training accuracy and function size. [sent-102, score-0.157]
47 4 Bound on the Generalization Error Note that we cannot use the “standard” VC theory to bound the expected loss of DLMs with data-dependent features because the VC dimension is a property of a function class defined on some input domain without reference to the data. [sent-103, score-0.093]
48 Since our learning algorithm tries to build a DLM with the smallest number of datadependent balls, we seek a bound that depends on this number and, consequently, on the number of examples that are used in the final classifier (the hypothesis). [sent-105, score-0.097]
49 We can thus think of our learning algorithm as compressing the training set into a small subset of examples that we call the compression set. [sent-106, score-0.254]
50 It was shown by Littlestone and Warmuth (1986) and Floyd and Warmuth (1995) that we can bound the generalization error of the hypothesis f if we can always reconstruct f from the compression set. [sent-107, score-0.264]
51 Hence, the only requirement is the existence of such a reconstruction function and its only purpose is to permit the exact identification of the hypothesis from the compression set and, possibly, additional bits of information. [sent-108, score-0.268]
52 Not surprisingly, the bound on the generalization error increases rapidly in terms of these additional bits of information. [sent-109, score-0.171]
53 We now describe our reconstruction function and the additional information that it needs to assure, in all cases, the proper reconstruction of the hypothesis from a compression set. [sent-111, score-0.236]
54 Our proposed scheme works in all cases provided that the learning algorithm returns a hypothesis that always correctly classifies the compression set (but not necessarily all of the training set). [sent-112, score-0.218]
55 We first describe the simpler case where only balls of types Pi and Ni are permitted and, later, describe the additional requirements that are introduced when we also permit balls of types Po and No . [sent-114, score-0.767]
56 Given a compression set Λ (returned by the learning algorithm), we first partition it into four disjoint subsets Cp , Cn , Bp , and Bn consisting of positive ball centers, negative ball centers, positive borders, and negative borders respectively. [sent-115, score-0.631]
57 When only balls of type Pi and Ni are permitted, the center of a ball cannot be the center of another ball since the center is removed from the remaining examples to be covered when a ball is added to the DLM. [sent-117, score-1.232]
58 But a center can be the border of a previous ball in the DLM and a border can be the border of more than one ball. [sent-118, score-0.477]
59 Hence, points in Bp ∪Bn are examples that are borders without being the center of another ball. [sent-119, score-0.182]
60 Because of the crucial importance of the ordering of the features in a decision list, these sets do not provide enough information by themselves to be able to reconstruct the hypothesis. [sent-120, score-0.166]
61 To specify the ordering of each ball center it is sufficient to provide log 2 (r) bits of additional information where the number r of balls is given by r = cp + cn for cp = |Cp | and cn = |Cn |. [sent-121, score-1.521]
62 To find the radius ρi for each center xi we start with Cp = Cp , Cn = Cn , Bp = Bp , Bn = Bn , and do the following, sequentially from the first center to the last. [sent-122, score-0.319]
63 If center xi ∈ Cp , then the radius is given by ρi = minxj ∈Cn ∪Bn d(xi , xj ) − and we remove center xi from Cp and any other point from Bp covered by this ball (to find the radius of the other balls). [sent-123, score-0.786]
64 If center xi ∈ Cn , then the radius is given by ρi = minxj ∈Cp ∪Bp d(xi , xj ) − and we remove center xi from Cn and any other point from Bn covered by this ball. [sent-124, score-0.549]
65 The output bi for each ball hi is 1 if the center xi ∈ Cp and 0 otherwise. [sent-125, score-0.631]
66 This reconstructed decision list of balls will be the same as the hypothesis if and only if the compression set is always correctly classified by the learning algorithm. [sent-126, score-0.644]
67 Once we can identify the hypothesis from the compression set, we can bound its generalization error. [sent-127, score-0.237]
68 Theorem 1 Let S = P ∪ N be a training set of positive and negative examples of size m = mp + mn . [sent-128, score-0.398]
69 Let A be the learning algorithm BuildDLM that uses data-dependent balls of type Pi and Ni for its set of features with the constraint that the returned function A(S) always correctly classifies every example in the compression set. [sent-129, score-0.569]
70 Suppose that A(S) contains r balls, and makes kp training errors on P , kn training errors on N (with k = kp + kn ), and has a compression set Λ = Cp ∪ Cn ∪ Bp ∪ Bn (as defined above) of size λ = cp + cn + bp + bn . [sent-130, score-1.409]
71 With probability 1 − δ over all random training sets S of size m, the generalization error er(A(S)) of A(S) is bounded by er(A(S)) ≤ 1 − exp −1 m−λ−k ln Bλ + ln(r! [sent-131, score-0.221]
72 ) + ln 1 δλ def where δλ = where mp cp def Bλ −6 π2 6 = · ((cp + 1)(cn + 1)(bp + 1)(bn + 1)(kp + 1)(kn + 1)) mp − c p bp mn − c n bn mn cn mp − c p − b p kp −2 · δ and mn − c n − b n kn Proof Let X be the set of training sets of size m. [sent-132, score-1.897]
73 Let us first bound the probability def Pm = P {S ∈ X : er(A(S)) ≥ | m(S) = m} given that m(S) is fixed to some value def m where m = (m, mp , mn , cp , cn , bp , bn , kp , kn ). [sent-133, score-1.385]
74 For this, denote by Ep the subset of P on which A(S) makes an error and similarly for En . [sent-134, score-0.097]
75 ) bits needed to specify the ordering of the balls (as described above). [sent-136, score-0.497]
76 Now define Pm to be def Pm = P {S ∈ X : er(A(S)) ≥ | C p = S1 , Cn = S2 , Bp = S3 , Bn = S4 Ep = S5 , En = S6 , I = I0 , m(S) = m} for some fixed set of disjoint subsets {Si }6 of S and some fixed information mesi=1 sage I0 . [sent-137, score-0.168]
77 Since Bλ is the number of different ways of choosing the different compression subsets and set of error points in a training set of fixed m, we have: Pm ≤ (r! [sent-138, score-0.226]
78 ) · Bλ · Pm where the first factor comes from the additional information that is needed to specify def the ordering of r balls. [sent-139, score-0.206]
79 Note that the hypothesis f = A(S) is fixed in Pm (because the compression set is fixed and the required information bits are given). [sent-140, score-0.213]
80 Let p be the probability of obtaining a positive example, let α be the probability that the fixed hypothesis f makes an error on a positive example, and let β be the def probability that f makes an error on a negative example. [sent-142, score-0.508]
81 Let tp = cp + bp + kp and def let tn = cn + bn + kn . [sent-143, score-1.307]
82 We then have: Pm = (1 − α)mp −tp (1 − β)m−tn −mp m − tn − tp mp −tp p (1 − p)m−tn −mp mp − t p m−tn (1 − α)m −tp (1 − β)m−tn −m ≤ m =tp = [(1 − α)p + (1 − β)(1 − p)] ≤ (1 − )m−tn −tp m−tn −tp m − tn − tp m −tp p (1 − p)m−tn −m m − tp = (1 − er(f )) m−tn −tp Consequently: Pm ≤ (r! [sent-144, score-0.945]
83 The use of balls of type Po and No introduces a few more difficulties that are taken into account by sending more bits to the reconstruction function. [sent-148, score-0.5]
84 First, the center of a ball of type Po and No can be used for more than one ball since the covered examples are outside the ball. [sent-149, score-0.572]
85 Hence, the number r of balls can now exceed cp + cn = c. [sent-150, score-0.765]
86 Then, for each ball, we can send log2 c bits to specify which center this ball is using and another bit to specify if the examples covered are inside or outside the ball. [sent-152, score-0.517]
87 Using the same notation as before, the radius ρi of a center xi of a ball of type Po is given by ρi = maxxj ∈Cn ∪Bn d(xi , xj ) + , and for a center xi of a ball of type No , its radius is given by ρi = maxxj ∈Cp ∪Bp d(xi , xj ) + . [sent-153, score-0.988]
88 Theorem 2 Let A be the learning algorithm BuildDLM that uses data-dependent balls of type Pi , Ni , Po , and No for its set of features. [sent-155, score-0.412]
89 Consider all the definitions def used for Theorem 1 with c = cp +cn . [sent-156, score-0.313]
90 With probability 1−δ over all random training sets S of size m, we have er(A(S)) ≤ 1 − exp −1 m−λ−k ln Bλ + ln λ + r ln(2c) + ln 1 δλ Basically, our bound states that good generalization is expected when we can find a small DLM that makes few training errors. [sent-157, score-0.454]
91 In principle, we could use it as a guide for choosing the model selection parameters s, pp , and pn since it depends only on what the hypothesis has achieved on the training data. [sent-158, score-0.389]
92 For each data set, we have removed all examples that contained attributes with unknown values (this has reduced substantially the “votes” data set) and we have removed examples with contradictory labels (this occurred only for a few examples in the Haberman data set). [sent-162, score-0.168]
93 each machine was trained on the same training sets and tested on the same testing sets). [sent-171, score-0.095]
94 The “size” column refers to the average number of support vectors contained in SVM machines obtained from the 10 different training sets of 10-fold cross-validation. [sent-174, score-0.124]
95 We have reported the results for the SCM (Marchand and Shawe-Taylor, 2002) and the DLM when both machines are equipped with data-dependent balls under the L2 metric. [sent-175, score-0.37]
96 For the SCM, the T column refers to type of the best machine found Data Set Name #exs BreastW 683 Votes 52 Pima 768 Haberman 294 Bupa 345 Glass 214 Credit 653 SVM size errors 58 19 18 3 526 203 146 71 266 107 125 34 423 190 SCM with balls T p s errors c 1. [sent-176, score-0.547]
97 2 4 194 T c s c s c c c DLM with balls pp pn s errors 2. [sent-182, score-0.699]
98 (c for conjunction, and d for disjunction), the p column refers the best value found for the penalty parameter, and the s column refers the the best stopping point in terms of the number of balls. [sent-189, score-0.13]
99 The same definitions applies also for DLMs except that two different penalty values (pp and pn ) are used. [sent-190, score-0.19]
100 In the T column of the DLM results, we have specified by s (simple) when the DLM was trained by using only balls of type Pi and Ni and by c (complex) when the four possible types of balls where used (see section 3). [sent-191, score-0.782]
wordName wordTfidf (topN-words)
[('balls', 0.37), ('dlm', 0.347), ('hi', 0.263), ('ni', 0.255), ('cn', 0.221), ('cp', 0.174), ('bn', 0.172), ('marchand', 0.166), ('ball', 0.151), ('pn', 0.148), ('def', 0.139), ('tp', 0.137), ('pi', 0.137), ('bp', 0.137), ('mp', 0.135), ('scm', 0.133), ('tn', 0.132), ('pm', 0.132), ('pp', 0.128), ('compression', 0.105), ('br', 0.104), ('tradeo', 0.101), ('ottawa', 0.1), ('kn', 0.087), ('radius', 0.086), ('center', 0.083), ('border', 0.081), ('po', 0.077), ('nk', 0.074), ('kp', 0.074), ('bi', 0.067), ('xi', 0.067), ('builddlm', 0.067), ('disjunction', 0.067), ('dlms', 0.067), ('mario', 0.067), ('mn', 0.066), ('pk', 0.066), ('list', 0.065), ('training', 0.065), ('covered', 0.062), ('bits', 0.06), ('qi', 0.057), ('decision', 0.056), ('examples', 0.056), ('ln', 0.056), ('greedy', 0.054), ('errors', 0.053), ('hk', 0.053), ('features', 0.052), ('scms', 0.05), ('hypothesis', 0.048), ('er', 0.047), ('conjunctions', 0.043), ('borders', 0.043), ('rivest', 0.043), ('generalization', 0.043), ('type', 0.042), ('makes', 0.042), ('penalty', 0.042), ('bound', 0.041), ('covering', 0.04), ('littlestone', 0.04), ('positive', 0.039), ('lists', 0.039), ('specify', 0.039), ('negative', 0.037), ('erent', 0.036), ('di', 0.036), ('remove', 0.036), ('classi', 0.035), ('let', 0.034), ('dhagat', 0.033), ('disjunctions', 0.033), ('haberman', 0.033), ('maxxj', 0.033), ('minxj', 0.033), ('sokolova', 0.033), ('votes', 0.033), ('warmuth', 0.033), ('conjunction', 0.033), ('svm', 0.033), ('xj', 0.032), ('ui', 0.032), ('metrics', 0.03), ('stopping', 0.03), ('sets', 0.03), ('refers', 0.029), ('floyd', 0.029), ('hence', 0.029), ('subsets', 0.029), ('subset', 0.028), ('reconstruction', 0.028), ('ordering', 0.028), ('largest', 0.028), ('proper', 0.027), ('error', 0.027), ('outside', 0.027), ('permit', 0.027), ('ep', 0.027)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 194 nips-2002-The Decision List Machine
Author: Marina Sokolova, Mario Marchand, Nathalie Japkowicz, John S. Shawe-taylor
Abstract: We introduce a new learning algorithm for decision lists to allow features that are constructed from the data and to allow a tradeoff between accuracy and complexity. We bound its generalization error in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. 1
2 0.10819784 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
Author: Harald Steck, Tommi S. Jaakkola
Abstract: A common objective in learning a model from data is to recover its network structure, while the model parameters are of minor interest. For example, we may wish to recover regulatory networks from high-throughput data sources. In this paper we examine how Bayesian regularization using a product of independent Dirichlet priors over the model parameters affects the learned model structure in a domain with discrete variables. We show that a small scale parameter - often interpreted as
3 0.09385325 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
Author: Ralf Herbrich, Neil D. Lawrence, Matthias Seeger
Abstract: We present a framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on informationtheoretic principles, previously suggested for active learning. Our goal is not only to learn d–sparse predictors (which can be evaluated in O(d) rather than O(n), d n, n the number of training points), but also to perform training under strong restrictions on time and memory requirements. The scaling of our method is at most O(n · d2 ), and in large real-world classification experiments we show that it can match prediction performance of the popular support vector machine (SVM), yet can be significantly faster in training. In contrast to the SVM, our approximation produces estimates of predictive probabilities (‘error bars’), allows for Bayesian model selection and is less complex in implementation. 1
4 0.073556714 45 nips-2002-Boosted Dyadic Kernel Discriminants
Author: Baback Moghaddam, Gregory Shakhnarovich
Abstract: We introduce a novel learning algorithm for binary classification with hyperplane discriminants based on pairs of training points from opposite classes (dyadic hypercuts). This algorithm is further extended to nonlinear discriminants using kernel functions satisfying Mercer’s conditions. An ensemble of simple dyadic hypercuts is learned incrementally by means of a confidence-rated version of AdaBoost, which provides a sound strategy for searching through the finite set of hypercut hypotheses. In experiments with real-world datasets from the UCI repository, the generalization performance of the hypercut classifiers was found to be comparable to that of SVMs and k-NN classifiers. Furthermore, the computational cost of classification (at run time) was found to be similar to, or better than, that of SVM. Similarly to SVMs, boosted dyadic kernel discriminants tend to maximize the margin (via AdaBoost). In contrast to SVMs, however, we offer an on-line and incremental learning machine for building kernel discriminants whose complexity (number of kernel evaluations) can be directly controlled (traded off for accuracy). 1
5 0.073402271 32 nips-2002-Approximate Inference and Protein-Folding
Author: Chen Yanover, Yair Weiss
Abstract: Side-chain prediction is an important subtask in the protein-folding problem. We show that finding a minimal energy side-chain configuration is equivalent to performing inference in an undirected graphical model. The graphical model is relatively sparse yet has many cycles. We used this equivalence to assess the performance of approximate inference algorithms in a real-world setting. Specifically we compared belief propagation (BP), generalized BP (GBP) and naive mean field (MF). In cases where exact inference was possible, max-product BP always found the global minimum of the energy (except in few cases where it failed to converge), while other approximation algorithms of similar complexity did not. In the full protein data set, maxproduct BP always found a lower energy configuration than the other algorithms, including a widely used protein-folding software (SCWRL). 1
6 0.071330853 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
7 0.070648342 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
8 0.070258185 195 nips-2002-The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities
9 0.06582018 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
10 0.061527416 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
11 0.060939409 164 nips-2002-Prediction of Protein Topologies Using Generalized IOHMMs and RNNs
12 0.060192436 90 nips-2002-Feature Selection in Mixture-Based Clustering
13 0.056987274 171 nips-2002-Reconstructing Stimulus-Driven Neural Networks from Spike Times
14 0.055272222 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
15 0.053773887 120 nips-2002-Kernel Design Using Boosting
16 0.05310772 59 nips-2002-Constraint Classification for Multiclass Classification and Ranking
17 0.050577 69 nips-2002-Discriminative Learning for Label Sequences via Boosting
18 0.050191771 159 nips-2002-Optimality of Reinforcement Learning Algorithms with Linear Function Approximation
19 0.04963962 54 nips-2002-Combining Dimensions and Features in Similarity-Based Representations
20 0.048750348 181 nips-2002-Self Supervised Boosting
topicId topicWeight
[(0, -0.149), (1, -0.069), (2, 0.005), (3, -0.028), (4, 0.07), (5, 0.028), (6, -0.036), (7, -0.004), (8, -0.044), (9, 0.021), (10, -0.003), (11, -0.024), (12, 0.024), (13, 0.004), (14, -0.031), (15, -0.089), (16, 0.026), (17, 0.065), (18, 0.015), (19, 0.017), (20, -0.001), (21, 0.031), (22, 0.019), (23, -0.174), (24, 0.049), (25, 0.041), (26, -0.024), (27, 0.185), (28, 0.024), (29, -0.002), (30, 0.011), (31, 0.112), (32, -0.056), (33, 0.014), (34, -0.064), (35, 0.046), (36, 0.183), (37, -0.195), (38, -0.141), (39, 0.129), (40, -0.004), (41, 0.051), (42, 0.145), (43, 0.011), (44, -0.087), (45, -0.134), (46, -0.068), (47, -0.02), (48, 0.047), (49, -0.082)]
simIndex simValue paperId paperTitle
same-paper 1 0.91858095 194 nips-2002-The Decision List Machine
Author: Marina Sokolova, Mario Marchand, Nathalie Japkowicz, John S. Shawe-taylor
Abstract: We introduce a new learning algorithm for decision lists to allow features that are constructed from the data and to allow a tradeoff between accuracy and complexity. We bound its generalization error in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. 1
Author: Sumio Watanabe, Shun-ichi Amari
Abstract: A lot of learning machines with hidden variables used in information science have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, resulting that the learning theory of regular statistical models does not hold. Recently, it was proven that, if the true parameter is contained in singularities, then the coefficient of the Bayes generalization error is equal to the pole of the zeta function of the Kullback information. In this paper, under the condition that the true parameter is almost but not contained in singularities, we show two results. (1) If the dimension of the parameter from inputs to hidden units is not larger than three, then there exits a region of true parameters where the generalization error is larger than those of regular models, however, if otherwise, then for any true parameter, the generalization error is smaller than those of regular models. (2) The symmetry of the generalization error and the training error does not hold in singular models in general. 1
3 0.44095224 62 nips-2002-Coulomb Classifiers: Generalizing Support Vector Machines via an Analogy to Electrostatic Systems
Author: Sepp Hochreiter, Michael C. Mozer, Klaus Obermayer
Abstract: We introduce a family of classifiers based on a physical analogy to an electrostatic system of charged conductors. The family, called Coulomb classifiers, includes the two best-known support-vector machines (SVMs), the ν–SVM and the C–SVM. In the electrostatics analogy, a training example corresponds to a charged conductor at a given location in space, the classification function corresponds to the electrostatic potential function, and the training objective function corresponds to the Coulomb energy. The electrostatic framework provides not only a novel interpretation of existing algorithms and their interrelationships, but it suggests a variety of new methods for SVMs including kernels that bridge the gap between polynomial and radial-basis functions, objective functions that do not require positive-definite kernels, regularization techniques that allow for the construction of an optimal classifier in Minkowski space. Based on the framework, we propose novel SVMs and perform simulation studies to show that they are comparable or superior to standard SVMs. The experiments include classification tasks on data which are represented in terms of their pairwise proximities, where a Coulomb Classifier outperformed standard SVMs. 1
4 0.42807284 32 nips-2002-Approximate Inference and Protein-Folding
Author: Chen Yanover, Yair Weiss
Abstract: Side-chain prediction is an important subtask in the protein-folding problem. We show that finding a minimal energy side-chain configuration is equivalent to performing inference in an undirected graphical model. The graphical model is relatively sparse yet has many cycles. We used this equivalence to assess the performance of approximate inference algorithms in a real-world setting. Specifically we compared belief propagation (BP), generalized BP (GBP) and naive mean field (MF). In cases where exact inference was possible, max-product BP always found the global minimum of the energy (except in few cases where it failed to converge), while other approximation algorithms of similar complexity did not. In the full protein data set, maxproduct BP always found a lower energy configuration than the other algorithms, including a widely used protein-folding software (SCWRL). 1
5 0.41879037 54 nips-2002-Combining Dimensions and Features in Similarity-Based Representations
Author: Daniel J. Navarro, Michael D. Lee
Abstract: unkown-abstract
6 0.41357803 86 nips-2002-Fast Sparse Gaussian Process Methods: The Informative Vector Machine
7 0.40930551 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
8 0.38489184 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers
9 0.37970713 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
10 0.36822048 164 nips-2002-Prediction of Protein Topologies Using Generalized IOHMMs and RNNs
11 0.34148639 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
12 0.33757678 58 nips-2002-Conditional Models on the Ranking Poset
13 0.33675468 136 nips-2002-Linear Combinations of Optic Flow Vectors for Estimating Self-Motion - a Real-World Test of a Neural Model
14 0.32892174 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
15 0.30330884 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
16 0.28765777 123 nips-2002-Learning Attractor Landscapes for Learning Motor Primitives
17 0.28602663 179 nips-2002-Scaling of Probability-Based Optimization Algorithms
18 0.27118507 196 nips-2002-The RA Scanner: Prediction of Rheumatoid Joint Inflammation Based on Laser Imaging
19 0.26993814 158 nips-2002-One-Class LP Classifiers for Dissimilarity Representations
20 0.26898846 60 nips-2002-Convergence Properties of Some Spike-Triggered Analysis Techniques
topicId topicWeight
[(0, 0.419), (23, 0.019), (42, 0.065), (54, 0.112), (55, 0.025), (67, 0.011), (68, 0.022), (74, 0.074), (92, 0.043), (98, 0.1)]
simIndex simValue paperId paperTitle
same-paper 1 0.77688038 194 nips-2002-The Decision List Machine
Author: Marina Sokolova, Mario Marchand, Nathalie Japkowicz, John S. Shawe-taylor
Abstract: We introduce a new learning algorithm for decision lists to allow features that are constructed from the data and to allow a tradeoff between accuracy and complexity. We bound its generalization error in terms of the number of errors and the size of the classifier it finds on the training data. We also compare its performance on some natural data sets with the set covering machine and the support vector machine. 1
2 0.40173492 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter
Abstract: We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classification performance.
3 0.40098172 21 nips-2002-Adaptive Classification by Variational Kalman Filtering
Author: Peter Sykacek, Stephen J. Roberts
Abstract: We propose in this paper a probabilistic approach for adaptive inference of generalized nonlinear classification that combines the computational advantage of a parametric solution with the flexibility of sequential sampling techniques. We regard the parameters of the classifier as latent states in a first order Markov process and propose an algorithm which can be regarded as variational generalization of standard Kalman filtering. The variational Kalman filter is based on two novel lower bounds that enable us to use a non-degenerate distribution over the adaptation rate. An extensive empirical evaluation demonstrates that the proposed method is capable of infering competitive classifiers both in stationary and non-stationary environments. Although we focus on classification, the algorithm is easily extended to other generalized nonlinear models.
4 0.40072531 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf
Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1
5 0.40056866 127 nips-2002-Learning Sparse Topographic Representations with Products of Student-t Distributions
Author: Max Welling, Simon Osindero, Geoffrey E. Hinton
Abstract: We propose a model for natural images in which the probability of an image is proportional to the product of the probabilities of some filter outputs. We encourage the system to find sparse features by using a Studentt distribution to model each filter output. If the t-distribution is used to model the combined outputs of sets of neurally adjacent filters, the system learns a topographic map in which the orientation, spatial frequency and location of the filters change smoothly across the map. Even though maximum likelihood learning is intractable in our model, the product form allows a relatively efficient learning procedure that works well even for highly overcomplete sets of filters. Once the model has been learned it can be used as a prior to derive the “iterated Wiener filter” for the purpose of denoising images.
6 0.39967674 53 nips-2002-Clustering with the Fisher Score
7 0.3993533 3 nips-2002-A Convergent Form of Approximate Policy Iteration
8 0.39810041 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
9 0.3974061 124 nips-2002-Learning Graphical Models with Mercer Kernels
10 0.39729774 46 nips-2002-Boosting Density Estimation
11 0.39712632 37 nips-2002-Automatic Derivation of Statistical Algorithms: The EM Family and Beyond
12 0.39688915 24 nips-2002-Adaptive Scaling for Feature Selection in SVMs
13 0.39671478 10 nips-2002-A Model for Learning Variance Components of Natural Images
14 0.39595348 27 nips-2002-An Impossibility Theorem for Clustering
15 0.39577165 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
16 0.39572257 203 nips-2002-Using Tarjan's Red Rule for Fast Dependency Tree Construction
17 0.39308921 31 nips-2002-Application of Variational Bayesian Approach to Speech Recognition
18 0.39198399 188 nips-2002-Stability-Based Model Selection
19 0.39196816 141 nips-2002-Maximally Informative Dimensions: Analyzing Neural Responses to Natural Signals
20 0.39012077 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition