nips nips2007 nips2007-149 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marco Barreno, Alvaro Cardenas, J. D. Tygar
Abstract: We present a new analysis for the combination of binary classifiers. Our analysis makes use of the Neyman-Pearson lemma as a theoretical basis to analyze combinations of classifiers. We give a method for finding the optimal decision rule for a combination of classifiers and prove that it has the optimal ROC curve. We show how our method generalizes and improves previous work on combining classifiers and generating ROC curves. 1
Reference: text
sentIndex sentText sentNum sentScore
1 We give a method for finding the optimal decision rule for a combination of classifiers and prove that it has the optimal ROC curve. [sent-8, score-0.235]
2 1 Introduction We present an optimal way to combine binary classifiers in the Neyman-Pearson sense: for a given upper bound on false alarms (false positives), we find the set of combination rules maximizing the detection rate (true positives). [sent-10, score-0.328]
3 This forms the optimal ROC curve of a combination of classifiers. [sent-11, score-0.244]
4 In addition, we prove the following results: (1) We show that the optimal ROC curve is composed in general of 2n + 1 different decision rules and of the interpolation between these rules (over the n space of 22 possible Boolean rules). [sent-15, score-0.537]
5 (2) We prove that our method is optimal in this space. [sent-16, score-0.084]
6 (3) We prove that the Boolean AND and OR rules are always part of the optimal set for the special case of independent classifiers (though in general we make no independence assumptions). [sent-17, score-0.27]
7 We address the problem of combining results from n base classifiers f1 , f2 , . [sent-23, score-0.137]
8 We can characterize the performance of classifier fi by its detection rate (also true positives, or power) PDi = Pr[Yi = 1|H1 ] and its false alarm rate (also false positives, or test size) PF i = Pr[Yi = 1|H0 ]. [sent-31, score-0.226]
9 1 The Receiver Operating Characteristic (ROC) curve plots PF on the x-axis and PD on the y-axis (ROC space). [sent-34, score-0.14]
10 If one classifier’s curve has no points below another, it weakly dominates the latter. [sent-36, score-0.245]
11 If no points are below and at least one point is strictly above, it dominates it. [sent-37, score-0.107]
12 The line y = x describes a classifier that is no better than chance, and every proper classifier dominates this line. [sent-38, score-0.098]
13 When an ROC curve consists of a single point, we connect it with straight lines to (0, 0) and (1, 1) in order to compare it with others (see Lemma 1). [sent-39, score-0.155]
14 In this paper, we focus on base classifiers that occupy a single point in ROC space. [sent-40, score-0.11]
15 1 The ROC convex hull Provost and Fawcett [1] give a seminal result on the use of ROC curves for combining classifiers. [sent-43, score-0.166]
16 They suggest taking the convex hull of all points of the ROC curves of the classifiers. [sent-44, score-0.168]
17 This ROC convex hull (ROCCH) combination rule interpolates between base classifiers f1 , f2 , . [sent-45, score-0.304]
18 , fn , selecting (1) a single best classifier or (2) a randomization between the decisions of two classifiers for every false alarm rate [1]. [sent-48, score-0.128]
19 This approach, however, is not optimal: as pointed out in later work by Fawcett, the Boolean AND and OR rules over classifiers can perform better than the ROCCH [3]. [sent-49, score-0.149]
20 AND and OR are only 2 of 22 possiblen Boolean rules over the outputs of n base classifiers (n classifiers ⇒ 2n possible outcomes ⇒ 22 rules over outcomes). [sent-50, score-0.493]
21 Let the random variable Y have probability distributions P (Y|H0 ) under H0 and P (Y|H1 ) under H1 , and define the likelihood ratio ℓ(Y) = P (Y|H1 )/P (Y|H0 ). [sent-55, score-0.115]
22 The Neyman-Pearson lemma states that the likelihood ratio test D(Y) = 1 γ 0 if ℓ(Y) > τ if ℓ(Y) = τ , if ℓ(Y) < τ (1) for some τ ∈ (0, ∞) and γ ∈ [0, 1], is a most powerful test for its size: no other test has higher PD = Pr[D(Y) = 1|H1 ] for the same bound on PF = Pr[D(Y) = 1|H0 ]. [sent-56, score-0.262]
23 3 The optimal ROC curve for a combination of classifiers We characterize the optimal ROC curve for a decision based on a combination of arbitrary classifiers—for any given bound α on PF , we maximize PD . [sent-61, score-0.487]
24 Its optimal ROC curve is a piecewise linear function parameterized by a free parameter α bounding PF : for α < PF 1 , PD (α) = (PD1 /PF 1 )α, and for α > PF 1 , PD (α) = [(1 − PD1 )/(1 − PF 1 )](α − PF 1 ) + PD1 . [sent-67, score-0.186]
25 When α < PF 1 , we can obtain a likelihood ratio test by setting τ ∗ = ℓ(1) and γ ∗ = α/PF 1 , and for α > PF 1 , we set τ ∗ = ℓ(0) and γ ∗ = (α − PF 1 )/(1 − PF 1 ). [sent-69, score-0.154]
26 2 2 The intuitive interpretation of this result is that to decrease or increase the false alarm rate of the classifier, we randomize between using its predictions and always choosing 1 or 0. [sent-70, score-0.092]
27 With this information we then compute and sort the likelihood ratios ℓ(y) for all outcomes y ∈ {0, 1}n . [sent-73, score-0.201]
28 Let L be the list of likelihood ratios ranked from low to high. [sent-74, score-0.135]
29 Lemma 2 Given any 0 ≤ α ≤ 1, the ordering L determines parameters τ ∗ and γ ∗ for a likelihood ratio test of size α. [sent-75, score-0.198]
30 Lemma 2 sets up a classification rule for each interval between likelihoods in L and interpolates between them to create a test with size exactly α. [sent-76, score-0.101]
31 To find the ROC curve for our meta-classifier, we plot PD against PF for all 0 ≤ α ≤ 1. [sent-78, score-0.14]
32 We call the ROC curve obtained in this way the LR-ROC. [sent-81, score-0.14]
33 Theorem 1 The LR-ROC weakly dominates the ROC curve of any possible combination of Boolean functions g : {0, 1}n → {0, 1} over the outputs of n classifiers. [sent-82, score-0.277]
34 Let α′ be the probability of false alarm PF for g. [sent-84, score-0.092]
35 Then our meta-classifier’s decision rule is a likelihood ratio test. [sent-86, score-0.178]
36 Since this is true at any α′ , the LR-ROC weakly dominates the ROC curve for g. [sent-89, score-0.216]
37 1 Practical considerations To compute all likelihood ratios for the classifier outcomes we need to know the probability distributions P (Y|H0 ) and P (Y|H1 ). [sent-91, score-0.218]
38 The simplest method is to run the base classifiers on a training set and count occurrences of each outcome. [sent-93, score-0.145]
39 It is likely that some outcomes will not occur in the training, or will occur only a small number of times. [sent-94, score-0.066]
40 Furthermore, the optimal ROC curve may have a different likelihood ratio for each possible outcome from the n classifiers, and therefore a different point in ROC space, so optimal ROC curves in general have up to 2n points. [sent-98, score-0.396]
41 This implies an exponential (in the number of classifiers) lower bound on the running time of any algorithm to compute the optimal ROC curve for a combination of classifiers. [sent-99, score-0.245]
42 For a handful of classifiers, such a bound is not problematic, but it is impractical to compute the optimal ROC curve for dozensnor hundreds of classifiers. [sent-100, score-0.186]
43 (However, by computing and sorting the likelihood ratios we avoid a 22 -time search over all possible classification functions. [sent-101, score-0.135]
44 The assumption that f1 and f2 are conditionally independent and proper defines a partial ordering on the likelihood ratio: ℓ(00) < ℓ(10) < ℓ(11) and ℓ(00) < ℓ(01) < ℓ(11). [sent-106, score-0.186]
45 This ordering breaks the likelihood ratio’s range (0, ∞) into five regions; choosing τ in each region defines a different decision rule. [sent-125, score-0.13]
46 The assumption of conditional independence is a sufficient condition for ensuring that the AND and OR rules improve on the ROCCH for n classifiers, as the following result shows. [sent-134, score-0.195]
47 Theorem 2 If the distributions of the outputs of n proper binary classifiers Y1 , Y2 , . [sent-135, score-0.085]
48 , Yn are conditionally independent given the instance class, then the points in ROC space for the rules AND (Y1 ∧ Y2 ∧ · · · ∧ Yn ) and OR (Y1 ∨ Y2 ∨ · · · ∨ Yn ) are strictly above the convex hull of the ROC curves of the base classifiers f1 , . [sent-138, score-0.494]
49 The likelihood ratio of the case when AND outputs 1 is given by ℓ(11 · · · 1) = (PD1 PD2 · · · PDn )/(PF 1 PF 2 · · · PF n ). [sent-144, score-0.134]
50 The likelihood ratio of the case when OR does not output 1 is given by ℓ(00 · · · 0) = [(1 − PD1 )(1 − PD2 ) · · · (1 − PDn )]/[(1 − PF 1 )(1 − PF 2 ) · · · (1 − PF n )]. [sent-145, score-0.115]
51 Now recall that for proper classifiers fi , PDi > PF i and thus (1 − PDi )/(1 − PF i ) < 1 < PDi /PF i . [sent-146, score-0.07]
52 These rules are strictly above the ROCCH: because ℓ(11 · · · 1) > PD1 /PD2 , and PD1 /PD2 is the slope of the line from (0, 0) to the first point in the ROCCH (f1 ), the AND point must be above the ROCCH. [sent-148, score-0.194]
53 The three non-trivial rules correspond to the Boolean functions Y1 ∨ ¬Y2 , Y1 , and Y1 ∧ ¬Y2 . [sent-158, score-0.149]
54 Note that Y2 appears only negatively despite being a proper classifier, and both the AND and OR rules are sub-optimal. [sent-159, score-0.194]
55 The likelihood ratios of the outcomes are ℓ(00) = 2. [sent-161, score-0.201]
56 4, and ℓ(11) = 5, so ℓ(10) < ℓ(01) < ℓ(00) < ℓ(11) and the three points defining the optimal ROC curve are ¬Y1 ∨ Y2 , ¬(Y1 ⊕ Y2 ), and Y1 ∧ Y2 (see Figure 1b). [sent-163, score-0.215]
57 In this case, an XOR rule emerges from the likelihood ratio analysis. [sent-164, score-0.147]
58 These examples show that for true optimal results it is not sufficient to use weighted voting rules w1 Y1 + w2 Y2 + · · · + wn Yn ≥ τ , where w ∈ (0, ∞) (like some ensemble methods). [sent-165, score-0.249]
59 Weighted voting always has AND and OR rules in its ROC curve, so it cannot always express optimal rules. [sent-166, score-0.229]
60 (c) Original ROC curve and optimal ROC curve for example in Section 4. [sent-193, score-0.326]
61 3 Optimality of the ROCCH We have seen that in some cases, rules exist with points strictly above the ROCCH. [sent-196, score-0.203]
62 The convex hull of points (PF i , PDi ) with (0, 0) and (1, 1) (the ROCCH) is an optimal ROC curve for the combination if (Yi = 1) ⇒ (Yj = 1) for i < j and the following ordering holds: ℓ(00 · · · 0) < ℓ(00 · · · 01) < ℓ(00 · · · 011) < · · · < ℓ(1 · · · 1). [sent-202, score-0.391]
63 The condition (Yi = 1) ⇒ (Yj = 1) for i < j implies that we only need to consider n + 2 points in the ROC space (the two extra points are (0, 0) and (1, 1)) rather than 2n . [sent-204, score-0.093]
64 With these conditions and the ordering condition on the likelihood ratios, we have Pr[ℓ(Y) > ℓ(1 · · · 1)|H0 ] = 0, and Pr[ℓ(Y) > ℓ(0 · · · 0 1 · · · 1)|H0 ] = PF i . [sent-206, score-0.117]
65 Therefore, finding the optimal threshold of the likelihood i ratio test for PF i−1 ≤ α < PF i , we get τ ∗ = ℓ(0 · · · 0 1 · · · 1), and for PF i ≤ α < PF i+1 , i−1 τ ∗ = ℓ(0 · · · 0 1 · · · 1). [sent-207, score-0.2]
66 2 i The condition Yi = 1 ⇒ Yj = 1 for i < j is the same inclusion condition Flach and Wu use for repairing an ROC curve [2]. [sent-210, score-0.215]
67 4 Repairing an ROC curve Flach and Wu give a voting technique to repair concavities in an ROC curve that generates operating points above the ROCCH [2]. [sent-214, score-0.45]
68 Their intuition is that points underneath the convex hull can be mirrored to appear above the convex hull in much the same way as an improper classifier can be negated to obtain a proper classifier. [sent-215, score-0.254]
69 Although their algorithm produces better ROC curves, their solution will often yield curves with new concavities (see for example Flach and Wu’s Figure 4 [2]). [sent-216, score-0.094]
70 Flach and Wu’s method assumes the original ROC curve to be repaired has three models, or operating points: f1 predicts 1 when Y ∈ {11}, f2 predicts 1 when Y ∈ {11, 01}, and f3 predicts 1 when Y ∈ {11, 01, 10}. [sent-228, score-0.3]
71 If we apply Flach and ′ Wu’s repair algorithm, the point f2 is corrected to the point f2 ; however, the operating points of f1 and f3 remain the same. [sent-229, score-0.091]
72 10 Pfa (c) sick-euthyroid (d) sick Figure 2: Empirical ROC curves for experimental results on four UCI datasets. [sent-273, score-0.088]
73 Our method improves on this result by ordering the likelihood ratios ℓ(01) < ℓ(00) < ℓ(11) < ℓ(10) ′ ′ and using that ordering to make three different rules: f1 predicts 1 when Y ∈ {10}, f2 predicts 1 ′ when Y ∈ {10, 11}, and f3 predicts 1 when Y ∈ {10, 11, 00}. [sent-274, score-0.347]
74 5 Experiments We ran experiments to test the performance of our combining method on the adult, hypothyroid, sick-euthyroid, and sick datasets from the UCI machine learning repository [6]. [sent-275, score-0.121]
75 We chose five base classifiers from the YALE machine learning platform [7]: PART (a decision list algorithm), SMO (Sequential Minimal Optimization), SimpleLogistic, VotedPerceptron, and Y-NaiveBayes. [sent-276, score-0.141]
76 The adult dataset has around 30,000 training points and 15,000 test points and the sick dataset has around 2000 training points and 700 test points. [sent-278, score-0.278]
77 For each experiment, we estimate the joint distribution by training the base classifiers on a training set and counting the outcomes. [sent-280, score-0.163]
78 We compute likelihood ratios for all outcomes and order them. [sent-281, score-0.201]
79 When outcomes have no examples, we treat ·/0 as near-infinite and 0/· as near-zero and define 0/0 = 1. [sent-282, score-0.066]
80 6 We derive a sequence of decision rules from the likelihood ratios computed on the training set. [sent-283, score-0.334]
81 We can compute an optimal ROC curve for the combination by counting the number of true positives and false positives each rule achieves. [sent-284, score-0.436]
82 In the test set we use the rules learned on the training set. [sent-285, score-0.207]
83 First, PART clearly outperforms the other base classifiers in three out of four experiments, though it seems to overfit on the hypothyroid dataset. [sent-290, score-0.155]
84 The LR-ROC dominates the ROC curves of the base classifiers on both training and test sets. [sent-291, score-0.27]
85 The ROC curves for the base classifiers are all strictly below the LR-ROC in results on the test sets. [sent-292, score-0.223]
86 The results on training sets seem to imply that the LR-ROC is primarily classifying like PART with a small boost from the other classifiers; however, the test set results (in particular, Figure 2b) demonstrate that the LR-ROC generalizes better than the base classifiers. [sent-293, score-0.197]
87 Estimation of the joint distribution and computation of the ROC curve finished within one minute for 20 classifiers (not including time to train the individual classifiers). [sent-297, score-0.176]
88 Unfortunately, the inherent exponential structure of the optimal ROC curve means we cannot expect to do significantly better (at the same rate, 30 classifiers would take over 12 hours and 40 classifiers almost a year and a half). [sent-298, score-0.186]
89 6 Related work Our work is loosely related to ensemble methods such as bagging [8] and boosting [9] because it finds meta-classification rules over a set of base classifiers. [sent-299, score-0.358]
90 However, bagging and boosting each take one base classifier and train many times, resampling or reweighting the training data to generate classifier diversity [10] or increase the classification margin [11]. [sent-300, score-0.244]
91 The decision rules applied to the generated classifiers are (weighted) majority voting. [sent-301, score-0.18]
92 In contrast, our method takes any binary classifiers and finds optimal combination rules from the more general space of all binary functions. [sent-302, score-0.295]
93 Although we present our methods in a different light, our decision rule can be interpreted as a ranking algorithm. [sent-304, score-0.099]
94 Another more similar method for combining classifiers is stacking [13]. [sent-308, score-0.108]
95 Stacking trains a metalearner to combine the predictions of several base classifiers; in fact, our method might be considered a stacking method with a particular meta-classifier. [sent-309, score-0.207]
96 It can be difficult to show the improvement of stacking in general over selecting the best base-level classifier [14]; however, stacking has a useful interpretation as generalized cross-validation that makes it practical. [sent-310, score-0.13]
97 Our analysis shows that our combination method is the optimal meta-learner in the Neyman-Pearson sense, but incorporating the model validation aspect of stacking would make an interesting extension to our work. [sent-311, score-0.169]
98 We give a method for combining classifiers and prove that it is optimal in the Neyman-Pearson sense. [sent-313, score-0.111]
99 Though in general we make no assumptions about independence, Theorem 2 shows that certain simple rules are optimal when we do know that the classifiers are independent. [sent-317, score-0.195]
100 Is combining classifiers with stacking better than selecting the best z one? [sent-396, score-0.092]
wordName wordTfidf (topN-words)
[('pf', 0.613), ('roc', 0.467), ('ers', 0.237), ('pd', 0.198), ('classi', 0.19), ('rocch', 0.151), ('rules', 0.149), ('curve', 0.14), ('pr', 0.133), ('base', 0.11), ('flach', 0.106), ('pdi', 0.106), ('er', 0.097), ('meta', 0.096), ('ratios', 0.08), ('hull', 0.07), ('boolean', 0.067), ('outcomes', 0.066), ('stacking', 0.065), ('pfa', 0.06), ('ratio', 0.06), ('false', 0.055), ('likelihood', 0.055), ('dominates', 0.053), ('positives', 0.053), ('curves', 0.049), ('wu', 0.049), ('provost', 0.048), ('fawcett', 0.048), ('optimal', 0.046), ('bagging', 0.045), ('concavities', 0.045), ('hypothyroid', 0.045), ('pdn', 0.045), ('proper', 0.045), ('ordering', 0.044), ('conditionally', 0.042), ('combination', 0.042), ('yn', 0.04), ('sick', 0.039), ('repairing', 0.039), ('test', 0.039), ('alarm', 0.037), ('train', 0.036), ('adult', 0.036), ('hc', 0.036), ('predicts', 0.036), ('fn', 0.036), ('operating', 0.036), ('ranking', 0.036), ('voting', 0.034), ('boosting', 0.034), ('rule', 0.032), ('lr', 0.032), ('decision', 0.031), ('lemma', 0.03), ('icast', 0.03), ('interpolates', 0.03), ('points', 0.029), ('classifying', 0.029), ('independence', 0.028), ('yi', 0.028), ('combining', 0.027), ('yale', 0.026), ('repair', 0.026), ('yoav', 0.026), ('strictly', 0.025), ('part', 0.025), ('fi', 0.025), ('weakly', 0.023), ('rankboost', 0.022), ('hp', 0.022), ('telecom', 0.022), ('prove', 0.022), ('freund', 0.022), ('binary', 0.021), ('convex', 0.02), ('slope', 0.02), ('uci', 0.02), ('ensemble', 0.02), ('opinions', 0.019), ('training', 0.019), ('outputs', 0.019), ('mining', 0.018), ('yj', 0.018), ('condition', 0.018), ('technologies', 0.018), ('implies', 0.017), ('sun', 0.017), ('considerations', 0.017), ('auc', 0.016), ('tom', 0.016), ('method', 0.016), ('forms', 0.016), ('notes', 0.016), ('march', 0.015), ('schapire', 0.015), ('counting', 0.015), ('detection', 0.015), ('others', 0.015)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 149 nips-2007-Optimal ROC Curve for a Combination of Classifiers
Author: Marco Barreno, Alvaro Cardenas, J. D. Tygar
Abstract: We present a new analysis for the combination of binary classifiers. Our analysis makes use of the Neyman-Pearson lemma as a theoretical basis to analyze combinations of classifiers. We give a method for finding the optimal decision rule for a combination of classifiers and prove that it has the optimal ROC curve. We show how our method generalizes and improves previous work on combining classifiers and generating ROC curves. 1
2 0.17429097 147 nips-2007-One-Pass Boosting
Author: Zafer Barutcuoglu, Phil Long, Rocco Servedio
Abstract: This paper studies boosting algorithms that make a single pass over a set of base classifiers. We first analyze a one-pass algorithm in the setting of boosting with diverse base classifiers. Our guarantee is the same as the best proved for any boosting algorithm, but our one-pass algorithm is much faster than previous approaches. We next exhibit a random source of examples for which a “picky” variant of AdaBoost that skips poor base classifiers can outperform the standard AdaBoost algorithm, which uses every base classifier, by an exponential factor. Experiments with Reuters and synthetic data show that one-pass boosting can substantially improve on the accuracy of Naive Bayes, and that picky boosting can sometimes lead to a further improvement in accuracy.
3 0.16222952 39 nips-2007-Boosting the Area under the ROC Curve
Author: Phil Long, Rocco Servedio
Abstract: We show that any weak ranker that can achieve an area under the ROC curve slightly better than 1/2 (which can be achieved by random guessing) can be efficiently boosted to achieve an area under the ROC curve arbitrarily close to 1. We further show that this boosting can be performed even in the presence of independent misclassification noise, given access to a noise-tolerant weak ranker.
4 0.097914606 137 nips-2007-Multiple-Instance Pruning For Learning Efficient Cascade Detectors
Author: Cha Zhang, Paul A. Viola
Abstract: Cascade detectors have been shown to operate extremely rapidly, with high accuracy, and have important applications such as face detection. Driven by this success, cascade learning has been an area of active research in recent years. Nevertheless, there are still challenging technical problems during the training process of cascade detectors. In particular, determining the optimal target detection rate for each stage of the cascade remains an unsolved issue. In this paper, we propose the multiple instance pruning (MIP) algorithm for soft cascades. This algorithm computes a set of thresholds which aggressively terminate computation with no reduction in detection rate or increase in false positive rate on the training dataset. The algorithm is based on two key insights: i) examples that are destined to be rejected by the complete classifier can be safely pruned early; ii) face detection is a multiple instance learning problem. The MIP process is fully automatic and requires no assumptions of probability distributions, statistical independence, or ad hoc intermediate rejection targets. Experimental results on the MIT+CMU dataset demonstrate significant performance advantages. 1
5 0.093593314 42 nips-2007-CPR for CSPs: A Probabilistic Relaxation of Constraint Propagation
Author: Luis E. Ortiz
Abstract: This paper proposes constraint propagation relaxation (CPR), a probabilistic approach to classical constraint propagation that provides another view on the whole parametric family of survey propagation algorithms SP(ρ). More importantly, the approach elucidates the implicit, but fundamental assumptions underlying SP(ρ), thus shedding some light on its effectiveness and leading to applications beyond k-SAT. 1
6 0.08201056 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
7 0.07876578 32 nips-2007-Bayesian Co-Training
8 0.07731808 175 nips-2007-Semi-Supervised Multitask Learning
9 0.073591068 166 nips-2007-Regularized Boost for Semi-Supervised Learning
10 0.067238919 173 nips-2007-Second Order Bilinear Discriminant Analysis for single trial EEG analysis
11 0.064268082 126 nips-2007-McRank: Learning to Rank Using Multiple Classification and Gradient Boosting
12 0.062230315 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
13 0.057437681 11 nips-2007-A Risk Minimization Principle for a Class of Parzen Estimators
14 0.047246788 9 nips-2007-A Probabilistic Approach to Language Change
15 0.046618965 110 nips-2007-Learning Bounds for Domain Adaptation
16 0.046424471 97 nips-2007-Hidden Common Cause Relations in Relational Learning
17 0.046305452 112 nips-2007-Learning Monotonic Transformations for Classification
18 0.045289278 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
19 0.045126386 198 nips-2007-The Noisy-Logical Distribution and its Application to Causal Inference
20 0.044495914 69 nips-2007-Discriminative Batch Mode Active Learning
topicId topicWeight
[(0, -0.143), (1, 0.034), (2, -0.068), (3, 0.092), (4, 0.06), (5, 0.13), (6, 0.128), (7, -0.073), (8, 0.048), (9, -0.087), (10, -0.007), (11, -0.009), (12, -0.063), (13, -0.103), (14, -0.023), (15, -0.057), (16, 0.108), (17, 0.035), (18, -0.155), (19, -0.028), (20, 0.0), (21, 0.126), (22, -0.144), (23, 0.109), (24, -0.04), (25, -0.045), (26, 0.078), (27, 0.098), (28, -0.003), (29, 0.153), (30, 0.008), (31, -0.01), (32, 0.059), (33, 0.053), (34, -0.071), (35, -0.049), (36, 0.048), (37, -0.031), (38, 0.031), (39, -0.056), (40, 0.074), (41, 0.056), (42, 0.006), (43, 0.126), (44, 0.047), (45, -0.171), (46, -0.181), (47, 0.097), (48, 0.09), (49, 0.005)]
simIndex simValue paperId paperTitle
same-paper 1 0.96874493 149 nips-2007-Optimal ROC Curve for a Combination of Classifiers
Author: Marco Barreno, Alvaro Cardenas, J. D. Tygar
Abstract: We present a new analysis for the combination of binary classifiers. Our analysis makes use of the Neyman-Pearson lemma as a theoretical basis to analyze combinations of classifiers. We give a method for finding the optimal decision rule for a combination of classifiers and prove that it has the optimal ROC curve. We show how our method generalizes and improves previous work on combining classifiers and generating ROC curves. 1
2 0.74348539 39 nips-2007-Boosting the Area under the ROC Curve
Author: Phil Long, Rocco Servedio
Abstract: We show that any weak ranker that can achieve an area under the ROC curve slightly better than 1/2 (which can be achieved by random guessing) can be efficiently boosted to achieve an area under the ROC curve arbitrarily close to 1. We further show that this boosting can be performed even in the presence of independent misclassification noise, given access to a noise-tolerant weak ranker.
3 0.70949548 147 nips-2007-One-Pass Boosting
Author: Zafer Barutcuoglu, Phil Long, Rocco Servedio
Abstract: This paper studies boosting algorithms that make a single pass over a set of base classifiers. We first analyze a one-pass algorithm in the setting of boosting with diverse base classifiers. Our guarantee is the same as the best proved for any boosting algorithm, but our one-pass algorithm is much faster than previous approaches. We next exhibit a random source of examples for which a “picky” variant of AdaBoost that skips poor base classifiers can outperform the standard AdaBoost algorithm, which uses every base classifier, by an exponential factor. Experiments with Reuters and synthetic data show that one-pass boosting can substantially improve on the accuracy of Naive Bayes, and that picky boosting can sometimes lead to a further improvement in accuracy.
4 0.62833154 137 nips-2007-Multiple-Instance Pruning For Learning Efficient Cascade Detectors
Author: Cha Zhang, Paul A. Viola
Abstract: Cascade detectors have been shown to operate extremely rapidly, with high accuracy, and have important applications such as face detection. Driven by this success, cascade learning has been an area of active research in recent years. Nevertheless, there are still challenging technical problems during the training process of cascade detectors. In particular, determining the optimal target detection rate for each stage of the cascade remains an unsolved issue. In this paper, we propose the multiple instance pruning (MIP) algorithm for soft cascades. This algorithm computes a set of thresholds which aggressively terminate computation with no reduction in detection rate or increase in false positive rate on the training dataset. The algorithm is based on two key insights: i) examples that are destined to be rejected by the complete classifier can be safely pruned early; ii) face detection is a multiple instance learning problem. The MIP process is fully automatic and requires no assumptions of probability distributions, statistical independence, or ad hoc intermediate rejection targets. Experimental results on the MIT+CMU dataset demonstrate significant performance advantages. 1
5 0.54912281 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
Author: John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma, Heung-yeung Shum
Abstract: We present a simple new criterion for classification, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We prove asymptotic optimality of this criterion for Gaussian data and analyze its relationships to classical classifiers. Theoretical results provide new insights into relationships among popular classifiers such as MAP and RDA, as well as unsupervised clustering methods based on lossy compression [13]. Minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small-sample setting. Compression also provides a uniform means of handling classes of varying dimension. This simple classification criterion and its kernel and local versions perform competitively against existing classifiers on both synthetic examples and real imagery data such as handwritten digits and human faces, without requiring domain-specific information. 1
6 0.46457008 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
7 0.4150196 166 nips-2007-Regularized Boost for Semi-Supervised Learning
8 0.40046313 90 nips-2007-FilterBoost: Regression and Classification on Large Datasets
9 0.39418206 126 nips-2007-McRank: Learning to Rank Using Multiple Classification and Gradient Boosting
10 0.3781926 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
11 0.35672739 188 nips-2007-Subspace-Based Face Recognition in Analog VLSI
12 0.35637128 97 nips-2007-Hidden Common Cause Relations in Relational Learning
13 0.3522225 175 nips-2007-Semi-Supervised Multitask Learning
14 0.34649536 32 nips-2007-Bayesian Co-Training
15 0.34211147 38 nips-2007-Boosting Algorithms for Maximizing the Soft Margin
16 0.32457083 173 nips-2007-Second Order Bilinear Discriminant Analysis for single trial EEG analysis
17 0.31329805 24 nips-2007-An Analysis of Inference with the Universum
18 0.30486465 114 nips-2007-Learning and using relational theories
19 0.30447429 27 nips-2007-Anytime Induction of Cost-sensitive Trees
20 0.29537448 112 nips-2007-Learning Monotonic Transformations for Classification
topicId topicWeight
[(5, 0.036), (13, 0.033), (16, 0.021), (18, 0.014), (19, 0.016), (21, 0.082), (31, 0.025), (34, 0.06), (35, 0.022), (46, 0.01), (47, 0.082), (49, 0.014), (63, 0.278), (83, 0.139), (85, 0.018), (87, 0.019), (90, 0.041)]
simIndex simValue paperId paperTitle
1 0.85955024 25 nips-2007-An in-silico Neural Model of Dynamic Routing through Neuronal Coherence
Author: Devarajan Sridharan, Brian Percival, John Arthur, Kwabena A. Boahen
Abstract: We describe a neurobiologically plausible model to implement dynamic routing using the concept of neuronal communication through neuronal coherence. The model has a three-tier architecture: a raw input tier, a routing control tier, and an invariant output tier. The correct mapping between input and output tiers is realized by an appropriate alignment of the phases of their respective background oscillations by the routing control units. We present an example architecture, implemented on a neuromorphic chip, that is able to achieve circular-shift invariance. A simple extension to our model can accomplish circular-shift dynamic routing with only O(N ) connections, compared to O(N 2 ) connections required by traditional models. 1 Dynamic Routing Circuit Models for Circular-Shift Invariance Dynamic routing circuit models are among the most prominent neural models for invariant recognition [1] (also see [2] for review). These models implement shift invariance by dynamically changing spatial connectivity to transform an object to a standard position or orientation. The connectivity between the raw input and invariant output layers is controlled by routing units, which turn certain subsets of connections on or off (Figure 1A). An important feature of this model is the explicit representation of what and where information in the main network and the routing units, respectively; the routing units use the where information to create invariant representations. Traditional solutions for shift invariance are neurobiologically implausible for at least two reasons. First, there are too many synaptic connections: for N input neurons, N output neurons and N possible input-output mappings, the network requires O(N 2 ) connections in the routing layer— between each of the N routing units and each set of N connections that that routing unit gates (Figure 1A). Second, these connections must be extremely precise: each routing unit must activate an inputoutput mapping (N individual connections) corresponding to the desired shift (as highlighted in Figure 1A). Other approaches that have been proposed, including invariant feature networks [3,4], also suffer from significant drawbacks, such as the inability to explicitly represent where information [2]. It remains an open question how biology could achieve shift invariance without profligate and precise connections. In this article, we propose a simple solution for shift invariance for quantities that are circular or periodic in nature—circular-shift invariance (CSI)—orientation invariance in vision and key invariance in music. The visual system may create orientation-invariant representations to aid recognition under conditions of object rotation or head-tilt [5,6]; a similar mechanism could be employed by the auditory system to create key-invariant representations under conditions where the same melody 1 Figure 1: Dynamic routing. A In traditional dynamic routing, connections from the (raw) input layer to the (invariant) output layer are gated by routing units. For instance, the mapping from A to 5, B to 6, . . . , F to 4 is achieved by turning on the highlighted routing unit. B In time-division multiplexing (TDM), the encoder samples input channels periodically (using a rotating switch) while the decoder sends each sample to the appropriate output channel (based on its time bin). TDM can be extended to achieve a circular-shift transformation by altering the angle between encoder and decoder switches (θ), thereby creating a rotated mapping between input and output channels (adapted from [7]). is played in different keys. Similar to orientation, which is a periodic quantity, musical notes one octave apart sound alike, a phenomenon known as octave equivalence [8]. Thus, the problems of key invariance and orientation invariance admit similar solutions. Deriving inspiration from time-division multiplexing (TDM), we propose a neural network for CSI that uses phase to encode and decode information. We modulate the temporal window of communication between (raw) input and (invariant) output neurons to achieve the appropriate input–output mapping. Extending TDM, any particular circular-shift transformation can be accomplished by changing the relative angle, θ, between the rotating switches of the encoder (that encodes the raw input in time) and decoder (that decodes the invariant output in time) (Figure 1B). This obviates the need to hardwire routing control units that specifically modulate the strength of each possible inputoutput connection, thereby significantly reducing the complexity inherent in the traditional dynamic routing solution. Similarly, a remapping between the input and output neurons can be achieved by introducing a relative phase-shift in their background oscillations. 2 Dynamic Routing through Neuronal Coherence To modulate the temporal window of communication, the model uses a ring of neurons (the oscillation ring) to select the pool of neurons (in the projection ring) that encode or decode information at a particular time (Figure 2A). Each projection pool encodes a specific value of the feature (for example, one of twelve musical notes). Upon activation by external input, each pool is active only when background inhibition generated by the oscillation ring (outer ring of neurons) is at a minimum. In addition to exciting 12 inhibitory interneurons in the projection ring, each oscillation ring neuron excites its nearest 18 neighbors in the clockwise direction around the oscillation ring. As a result, a wave of inhibition travels around the projection ring that allows only one pool to be excitable at any point in time. These neurons become excitable at roughly the same time (numbered sectors, inner ring) by virtue of recurrent excitatory intra-pool connections. Decoding is accomplished by a second tier of rings (Figure 2B). The projection ring of the first (input) tier connects all-to-all to the projection ring of the second (output) tier. The two oscillation rings create a window of excitability for the pools of neurons in their respective projection rings. Hence, the most effective communication occurs between input and output pools that become excitable at the same time (i.e. are oscillating in phase with one another [9]). The CSI problem is solved by introducing a phase-shift between the input and output tiers. If they are exactly in phase, then an input pool is simply mapped to the output pool directly above it. If their 2 Figure 2: Double-Ring Network for Encoding and Decoding. A The projection (inner) ring is divided into (numbered) pools. The oscillation (outer) ring modulates sub-threshold activity (waveforms) of the projection ring by exciting (black distribution) inhibitory neurons that inhibit neighboring projection neurons. A wave of activity travels around the oscillation ring due to asymmetric excitatory connections, creating a corresponding wave of inhibitory activity in the projection ring, such that only one pool of projection neurons is excitable (spikes) at a given time. B Two instances of the double-ring structure from A. The input projection ring connects all-to-all to the output projection ring (dashed lines). Because each input pool will spike only during a distinct time bin, and each output pool is excitable only in a certain time bin, communication occurs between input and output pools that are oscillating in phase with each other. Appropriate phase offset between input and output oscillation rings realizes the desired circular shift (input pool H to output pool 1, solid arrow). C Interactions among pools highlighted in B. phases are different, the input is dynamically routed to an appropriate circularly shifted position in the output tier. Such changes in phase are analogous to adjusting the angle of the rotating switch at either the encoder or the decoder in TDM (see Figure 1B). There is some evidence that neural systems could employ phase relationships of subthreshold oscillations to selectively target neural populations [9-11]. 3 Implementation in Silicon We implemented this solution to CSI on a neuromorphic silicon chip [12]. The neuromorphic chip has neurons whose properties resemble that of biological neurons; these neurons even have intrinsic differences, thereby mimicking heterogeneity in real neurobiological systems. The chip uses a conductance-based spiking model for both inhibitory and excitatory neurons. Inhibitory neurons project to nearby excitatory and inhibitory neurons via a diffusor network that determines the spread of inhibition. A lookup table of excitatory synaptic connectivity is stored in a separate randomaccess memory (RAM) chip. Spikes occurring on-chip are converted to a neuron address, mapped to synapses (if any) via the lookup table, and routed to the targeted on-chip synapse. A universal serial bus (USB) interface chip communicates spikes to and from a computer, for external input and 3 Figure 3: Traveling-wave activity in the oscillation ring. A Population activity (5ms bins) of a pool of eighteen (adjacent) oscillation neurons. B Increasing the strength of feedforward excitation led to increasing frequencies of periodic firing in the θ and α range (1-10 Hz). Strength of excitation is the amplitude change in post-synaptic conductance due to a single pre-synaptic spike (measured relative to minimum amplitude used). data analysis, respectively. Simulations on the chip occur in real-time, making it an attractive option for implementing the model. We configured the following parameters: • Magnitude of a potassium M-current: increasing this current’s magnitude increased the post-spike repolarization time of the membrane potential, thereby constraining spiking to a single time bin per cycle. • The strength of excitatory and inhibitory synapses: a correct balance had to be established between excitation and inhibition to make only a small subset of neurons in the projection rings fire at a time—too much excitation led to widespread firing and too much inhibition led to neurons that were entirely silent or fired sporadically. • The space constant of inhibitory spread: increasing the spread was effective in preventing runaway excitation, which could occur due to the recurrent excitatory connections. We were able to create a stable traveling wave of background activity within the oscillation ring. We transiently stimulated a small subset of the neurons, which initiated a wave of activity that propagated in a stable manner around the ring after the transient external stimulation had ceased (Figure 3A). The network frequency determined from a Fourier transform of the network activity smoothed with a non-causal Gaussian kernel (FDHM = 80ms) was 7.4Hz. The frequency varied with the strength of the neurons’ excitatory connections (Figure 3B), measured as the amplitude of the step increase in membrane conductivity due to the arrival of a pre-synaptic spike. Over much of the range of the synaptic strengths tested, we observed stable oscillations in the θ and α bands (1-10Hz); the frequency appeared to increase logarithmically with synaptic strength. 4 Phase-based Encoding and Decoding In order to assess the best-case performance of the model, the background activity in the input and output projection rings was derived from the input oscillation ring. Their spikes were delivered to the appropriately circularly-shifted output oscillation neurons. The asymmetric feedforward connections were disabled in the output oscillation ring. For instance, in order to achieve a circular shift by k pools (i.e. mapping input projection pool 1 to output projection pool k + 1, input pool 2 to output pool k + 2, and so on), activity from the input oscillation neurons closest to input pool 1 was fed into the output oscillation neurons closest to output pool k. By providing the appropriate phase difference between input and output oscillation, we were able to assess the performance of the model under ideal conditions. In the Discussion section, we discuss a biologically plausible mechanism to control the relative phases. 4 Figure 4: Phase-based encoding. Rasters indicating activity of projection pools in 1ms bins, and mean phase of firing (×’s) for each pool (relative to arbitrary zero time). The abscissa shows firing time normalized by the period of oscillation (which may be converted to firing phase by multiplication by 2π). Under constant input to the input projection ring, the input pools fire approximately in sequence. Two cycles of pool activity normalized by maximum firing rate for each pool are shown in left inset (for clarity, pools 1-6 are shown in the top panel and pools 7-12 are shown separately in the bottom panel); phase of background inhibition of pool 4 is shown (below) for reference. Phase-aligned average1 of activity (right inset) showed that the firing times were relatively tight and uniform across pools: a standard deviation of 0.0945 periods, or equivalently, a spread of 1.135 pools at any instant of time. We verified that the input projection pools fired in a phase-shifted fashion relative to one another, a property critical for accurate encoding (see Figure 2). We stimulated all pools in the input projection ring simultaneously while the input oscillation ring provided a periodic wave of background inhibition. The mean phase of firing for each pool (relative to arbitrary zero time) increased nearly linearly with pool number, thereby providing evidence for accurate, phase-based encoding (Figure 4). The firing times of all pools are shown for two cycles of background oscillatory activity (Figure 4 left inset). A phase-aligned average1 showed that the timing was relatively tight (standard deviation 1.135 pools) and uniform across pools of neurons (Figure 4 right inset). We then characterized the system’s ability to correctly decode this encoding under a given circular shift. The shift was set to seven pools, mapping input pool 1 to output pool 8, and so on. Each input pool was stimulated in turn. We expected to see only the appropriately shifted output pool become highly active. In fact, not only was this pool active, but other pools around it were also active, though to a lesser extent (Figure 5A). Thus, the phase-encoded input was decoded successfully, and circularly shifted, except that the output units were broadly tuned. To quantify the overall precision of encoding and decoding, we constructed an input-locked average of the tuning curves (Figure 5B): the curves were circularly shifted to the left by an amount corresponding to the stimulated input pool number, and the raw pool firing rates were averaged. If the phase-based encoding and decoding were perfect, the peak should occur at a shift of 7 pools. 1 The phase-aligned average was constructed by shifting the pool-activity curves by the (# of the pool) × 1 ( 12 of the period) to align activity across pools, which was then averaged. 5 Figure 5: Decoding phase-encoded input. A In order to assess decoding performance under a given circular shift (here 7 pools) each input pool was stimulated in turn and activity in each output pool was recorded and averaged over 500ms. The pool’s response, normalized by its maximum firing rate, is plotted for each stimulated input pool (arrows pointing to curves, color code as in Figure 4). Each input pool stimulation trial consistently resulted in peak activity in the appropriate output pool; however, adjacent pools were also active, but to a lesser extent, resulting in a broad tuning curve. B The best-fit Gaussian (dot-dashed grey curve, σ = 2.30 pools) to the input-locked average of the raw pool firing rates (see text for details) revealed a maximum between a shift of 7 and 8 pools (inverted grey triangle; expected peak at a shift of 7 pools). Indeed, the highest (average) firing rate corresponded to a shift of 7 pools. However, the activity corresponding to a shift of 8 pools was nearly equal to that of 7 pools, and the best fitting Gaussian curve to the activity histogram (grey dot-dashed line) peaked at a point between pools 7 and 8 (inverted grey triangle). The standard deviation (σ) was 2.30 pools, versus the expected ideal σ of 1.60, which corresponds to the encoding distribution (σ = 1.135 pools) convolved with itself. 5 Discussion We have demonstrated a biologically plausible mechanism for the dynamic routing of information in time that obviates the need for precise gating of connections. This mechanism requires that a wave of activity propagate around pools of neurons arranged in a ring. While previous work has described traveling waves in a ring of neurons [13], and a double ring architecture (for determining head-direction) [14], our work combines these two features (twin rings with phase-shifted traveling waves) to achieve dynamic routing. These features of the model are found in the cortex: Bonhoeffer and Grinwald [15] describe iso-orientation columns in the cat visual cortex that are arranged in ring-like pinwheel patterns, with orientation tuning changing gradually around the pinwheel center. Moreover, Rubino et al. [16] have shown that coherent oscillations can propagate as waves across the cortical surface in the motor cortex of awake, behaving monkeys performing a delayed reaching task. Our solution for CSI is also applicable to music perception. In the Western twelve-tone, equaltemperament tuning system (12-tone scale), each octave is divided into twelve logarithmicallyspaced notes. Human observers are known to construct mental representations for raw notes that are invariant of the (perceived) key of the music: a note of C heard in the key of C-Major is perceptually equivalent to the note C# heard in the key of C#-Major [8,17]. In previous dynamic routing models of key invariance, the tonic—the first note of the key (e.g., C is the tonic of C-Major)— supplies the equivalent where information used by routing units that gate precise connections to map the raw note into a key-invariant output representation [17]. To achieve key invariance in our model, the bottom tier encodes raw note information while the top tier decodes key-invariant notes (Figure 6). The middle tier receives the tonic information and aligns the phase of the first output pool (whose invariant representation corresponds to the tonic) with the appropriate input pool (whose raw note representation corresponds to the tonic of the perceived key). 6 Figure 6: Phase-based dynamic routing to achieve key-invariance. The input (bottom) tier encodes raw note information, and the output (top) tier decodes key-invariant information. The routing (middle) tier sets the phase of the background wave activity in the input and output oscillation rings (dashed arrows) such that the first output pool is in phase with the input pool representing the note corresponding to the tonic. On the left, where G is the tonic, input pool G, output pool 1, and the routing tier are in phase with one another (black clocks), while input pool C and output pool 6 are in phase with one another (grey clocks). Thus, the raw note input, G, activates the invariant output 1, which corresponds to the perceived tonic invariant representation (heavy solid arrows). On the right, the same raw input note, G, is active, but the key is different and A is now the active tonic; thus the raw input, G, is now mapped to output pool 11. The tonic information is supplied to a specific pool in the routing ring according to the perceived key. This pool projects directly down to the input pool corresponding to the tonic. This ensures that the current tonic’s input pool is excitable in the same time bin as the first output pool. Each of the remaining raw input notes of the octave is mapped by time binning to the corresponding key-invariant representation in the output tier, as the phases of input pools are all shifted by the same amount. Supporting evidence for phase-based encoding of note information comes from MEG recordings in humans: the phase of the MEG signal (predominantly over right hemispheric sensor locations) tracks the note of the heard note sequence with surprising accuracy [18]. The input and output tiers’ periods must be kept in lock-step, which can be accomplished through more plausible means than employed in the current implementation of this model. Here, we maintained a fixed phase shift between the input and output oscillation rings by feeding activity from the input oscillation ring to the appropriately shifted pool in the output oscillation ring. This approach allowed us to avoid difficulties achieving coherent oscillations at identical frequencies in the input and output oscillation rings. Alternatively, entrainment could be achieved even when the frequencies are not identical—a more biologically plausible scenario—if the routing ring resets the phase of the input and output rings on a cycle-by-cycle basis. Lakatos et al. [19] have shown that somatosensory inputs can reset the phase of ongoing neuronal oscillations in the primary auditory cortex (A1), which helps in the generation of a unified auditory-tactile percept (the so-called “Hearing-Hands Effect”). A simple extension to our model can reduce the number of connections below the requirements of traditional dynamic routing models. Instead of having all-to-all connections between the input and output layers, a relay layer of very few (M N ) neurons could be used to transmit the spikes form the input neurons to the output neurons (analogous to the single wire connecting encoder and decoder in Figure 1B). A small number of (or ideally even one) relay neurons suffices because encoding and decoding occur in time. Hence, the connections between each input pool and the relay neurons require O(M N ) ≈ O(N ) connections (as long as M does not scale with N ) and those between the relay neurons and each output pool require O(M N) ≈ O(N ) connections as well. Thus, by removing all-to-all connectivity between the input and output units (a standard feature in traditional dynamic routing models), the number of required connections is reduced from O(N 2 ) 7 to O(N ). Further, by replacing the strict pool boundaries with nearest neighbor connectivity in the projection rings, the proposed model can accommodate a continuum of rotation angles. In summary, we propose that the mechanism of dynamic routing through neuronal coherence could be a general mechanism that could be used by multiple sensory and motor modalities in the neocortex: it is particularly suitable for placing raw information in an appropriate context (defined by the routing tier). Acknowledgments DS was supported by a Stanford Graduate Fellowship and BP was supported under a National Science Foundation Graduate Research Fellowship. References [1] Olshausen B.A., Anderson C.H. & Van Essen D.C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience 13(11):47004719. [2] Wiskott L. (2004). How does our visual system achieve shift and size invariance? In J.L. van Hemmen & T.J. Sejnowski (Eds.), 23 Problems in Systems Neuroscience, Oxford University Press. [3] Fukushima K., Miyake S. & Ito T. (1983). A neural network model for a mechanism of visual pattern recognition. IEEE Transactions on Systems, Man and Cybernetics 13:826-834. [4] Mel B.W., Ruderman D.L & Archie K.A. (1998). Translation invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. Journal of Neuroscience 18(11):4325-4334. [5] McKone, E. & Grenfell, T. (1999). Orientation invariance in naming rotated objects: Individual differences and repetition priming. Perception and Psychophysics, 61:1590-1603. [6] Harris IM & Dux PE. (2005). Orientation-invariant object recognition: evidence from repetition blindness. Cognition, 95(1):73-93. [7] Naval Electrical Engineering Training Series (NEETS). Module 17, Radio-Frequency Communication Principles, Chapter 3, pp.32. Published online at http://www.tpub.com/content/neets/14189 (Integrated Publishing). [8] Krumhansl C.L. (1990). Cognitive foundations of musical pitch. Oxford University Press, 1990. [9] Fries P. (2005). A mechanism for cognitive dynamics: neuronal communication through neuronal coherence. Trends in Cognitive Sciences 9(10):474-480. [10] Buzsaki G. & Draguhn A. (2004). Neuronal Oscillations in Cortical Networks. Science 304(5679):19261929. [11] Sejnowski T.J. & Paulsen O. (2006). Network oscillations: Emerging computational principles. Journal of Neuroscience 26(6):1673-1676. [12] Arthur J.A. & Boahen K. (2005). Learning in Silicon: Timing is Everything. Advances in Neural Information Processing Systems 17, B Sholkopf and Y Weiss, Eds, MIT Press, 2006. [13] Hahnloser R.H.R., Sarpeshkar R., Mahowald M.A., Douglas R.J., & Seung H.S. (2000). Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405:947-951. [14] Xie X., Hahnloser R.H.R., & Seung H.S (2002). Double-ring network modeling of the head-direction system. Phys. Rev. E66 041902:1-9. [15] Bonhoeffer K. & Grinwald A. (1991). Iso-orientation domains in cat visual cortex are arranged in pinwheel-like patterns. Nature 353:426-437. [16] Rubino D., Robbins K.A. & Hastopoulos N.G. (2006). Propagating waves mediate information transfer in the motor cortex. Nature Neuroscience 9:1549-1557. [17] Bharucha J.J. (1999). Neural nets, temporal composites and tonality. In D. Deutsch (Ed.), The Psychology of Music (2d Ed.) Academic Press, New York. [18] Patel A.D. & Balaban E. (2000). Temporal patterns of human cortical activity reflect tone sequence structure. Nature 404:80-84. [19] Lakatos P., Chen C., O’Connell M., Mills A. & Schroeder C. (2007). Neuronal oscillations and multisensory interaction in primary auditory cortex. Neuron 53(2):279-292. 8
same-paper 2 0.7876367 149 nips-2007-Optimal ROC Curve for a Combination of Classifiers
Author: Marco Barreno, Alvaro Cardenas, J. D. Tygar
Abstract: We present a new analysis for the combination of binary classifiers. Our analysis makes use of the Neyman-Pearson lemma as a theoretical basis to analyze combinations of classifiers. We give a method for finding the optimal decision rule for a combination of classifiers and prove that it has the optimal ROC curve. We show how our method generalizes and improves previous work on combining classifiers and generating ROC curves. 1
3 0.57063782 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
Author: Kai Yu, Wei Chu
Abstract: This paper aims to model relational data on edges of networks. We describe appropriate Gaussian Processes (GPs) for directed, undirected, and bipartite networks. The inter-dependencies of edges can be effectively modeled by adapting the GP hyper-parameters. The framework suggests an intimate connection between link prediction and transfer learning, which were traditionally two separate research topics. We develop an efficient learning algorithm that can handle a large number of observations. The experimental results on several real-world data sets verify superior learning capacity. 1
4 0.56782699 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
Author: Matthias Bethge, Philipp Berens
Abstract: Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data—the model parameters can be derived in closed form and sampling is easy. Therefore, our NearMaxEnt approach can serve as a tool for testing predictions from a pairwise maximum entropy model not only for low-dimensional marginals, but also for high dimensional measurements of more than thousand units. We demonstrate its usefulness by studying natural images with dichotomized pixel intensities. Our results indicate that the statistics of such higher-dimensional measurements exhibit additional structure that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics surprisingly well up to the limit of dimensionality where estimation of the full joint distribution is feasible. 1
5 0.56684154 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
Author: Gwenn Englebienne, Tim Cootes, Magnus Rattray
Abstract: The present work aims to model the correspondence between facial motion and speech. The face and sound are modelled separately, with phonemes being the link between both. We propose a sequential model and evaluate its suitability for the generation of the facial animation from a sequence of phonemes, which we obtain from speech. We evaluate the results both by computing the error between generated sequences and real video, as well as with a rigorous double-blind test with human subjects. Experiments show that our model compares favourably to other existing methods and that the sequences generated are comparable to real video sequences. 1
6 0.5666005 6 nips-2007-A General Boosting Method and its Application to Learning Ranking Functions for Web Search
7 0.56536853 69 nips-2007-Discriminative Batch Mode Active Learning
9 0.56459552 209 nips-2007-Ultrafast Monte Carlo for Statistical Summations
10 0.56380641 63 nips-2007-Convex Relaxations of Latent Variable Training
11 0.56378472 45 nips-2007-Classification via Minimum Incremental Coding Length (MICL)
12 0.56266248 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
13 0.56235617 128 nips-2007-Message Passing for Max-weight Independent Set
14 0.5622052 187 nips-2007-Structured Learning with Approximate Inference
15 0.5618636 86 nips-2007-Exponential Family Predictive Representations of State
16 0.56060475 186 nips-2007-Statistical Analysis of Semi-Supervised Regression
17 0.559874 93 nips-2007-GRIFT: A graphical model for inferring visual classification features from human data
18 0.55969071 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
19 0.55965364 148 nips-2007-Online Linear Regression and Its Application to Model-Based Reinforcement Learning
20 0.55923361 212 nips-2007-Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes