nips nips2004 nips2004-162 knowledge-graph by maker-knowledge-mining

162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

Source: pdf

Author: Sunita Sarawagi, William W. Cohen

Abstract: We describe semi-Markov conditional random ﬁelds (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semiCRF on an input sequence x outputs a “segmentation” of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on ﬁve named entity recognition problems, semi-CRFs generally outperform conventional CRFs. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We describe semi-Markov conditional random ﬁelds (semi-CRFs), a conditionally trained version of semi-Markov chains. [sent-6, score-0.098]

2 Intuitively, a semiCRF on an input sequence x outputs a “segmentation” of x, in which labels are assigned to segments (i. [sent-7, score-0.223]

3 Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. [sent-10, score-0.302]

4 In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. [sent-11, score-0.119]

5 In experiments on ﬁve named entity recognition problems, semi-CRFs generally outperform conventional CRFs. [sent-12, score-0.412]

6 1 Introduction Conditional random ﬁelds (CRFs) are a recently-introduced formalism [12] for representing a conditional model Pr(y|x), where both x and y have non-trivial structure (often sequential). [sent-13, score-0.067]

7 Recall that semi-Markov chain models extend hidden Markov models (HMMs) by allowing each state si to persist for a non-unit length of time di . [sent-15, score-0.136]

8 After this time has elapsed, the system will transition to a new state s , which depends only on si ; however, during the “segment” of time between i and i + di , the behavior of the system may be non-Markovian. [sent-16, score-0.094]

9 We also argue that segments often have a clear intuitive meaning, and hence semi-CRFs are more natural than conventional CRFs. [sent-20, score-0.236]

10 We focus here on named entity recognition (NER), in which a segment corresponds to an extracted entity; however, similar arguments might be made for several other tasks, such as gene-ﬁnding [11] or NP-chunking [16]. [sent-21, score-0.521]

11 In NER, a semi-Markov formulation allows one to easily construct entity-level features (such as “entity length” and “similarity to other known entities”) which cannot be easily encoded in CRFs. [sent-22, score-0.09]

12 Experiments on ﬁve different NER problems show that semi-CRFs often outperform conventional CRFs. [sent-23, score-0.103]

13 For instance, in the NER application, x might be a sequence of words, and y might be a sequence in {I, O}|x| , where yi = I indicates “word xi is inside a name” and yi = O indicates the opposite. [sent-29, score-0.157]

14 For example, in NER, the components of f might include the measurement f 13 (i, x, y) = [[xi is capitalized]] · [[yi = I]], where the indicator function [[c]] = 1 if c if true and zero otherwise; this implies that F 13 (x, y) would be the number of capitalized words xi paired with the label I. [sent-35, score-0.251]

15 , sp denote a segmentation of x, where segment sj = tj , uj , yj consists of a start position tj , an end position uj , and a label yj ∈ Y . [sent-40, score-1.167]

16 Conceptually, a segment means that the tag yj is given to all xi ’s between i = tj and i = uj , inclusive. [sent-41, score-0.645]

17 We assume segments have positive length, and completely cover the sequence 1 . [sent-42, score-0.223]

18 |x| without overlapping: that is, that tj and uj always satisfy t1 = 1, up = |x|, 1 ≤ tj ≤ uj ≤ |x|, and tj+1 = uj + 1. [sent-45, score-0.707]

19 For NER, a valid segmentation of the sentence “I went skiing with Fernando Pereira in British Columbia” might be s = (1, 1, O), (2, 2, O), (3, 3, O), (4, 4, O), (5, 6, I), (7, 7, O), (8, 9, I) , corresponding to the label sequence y = O, O, O, O, I, I, O, I, I . [sent-46, score-0.187]

20 We now assume a vector g of segment feature functions g = g 1 , . [sent-47, score-0.273]

21 We also make a restriction on the features, analogous to the usual Markovian assumption made in CRFs, and assume that every component g k of g is a function only of x, sj , and the label yj−1 associated with the preceding segment sj−1 . [sent-51, score-0.361]

22 In other words, we assume that every g k (j, x, s) can be rewritten as g k (j, x, s) = g k (yj , yj−1 , x, tj , uj ) (2) for an appropriately deﬁned g k . [sent-52, score-0.278]

23 In the rest of the paper, we will drop the g notation and use g for both versions of the segment-level feature functions. [sent-53, score-0.097]

24 2 An efﬁcient inference algorithm The inference problem for a semi-CRF is deﬁned as follows: given W and x, ﬁnd the best segmentation, argmax s Pr(s|x, W), where Pr(s|x, W) is deﬁned by Equation 3. [sent-56, score-0.203]

25 An efﬁcient inference algorithm is suggested by Equation 2, which implies that argmax s Pr(s|x, W) = argmax s W · G(x, s) = argmax s W · g(yj , yj−1 , x, tj , uj ) j Let L be an upper bound on segment length. [sent-57, score-0.839]

26 Let si:y denote the set of all partial segmentations starting from 1 (the ﬁrst index of the sequence) to i, such that the last segment has the label y and ending position i. [sent-58, score-0.369]

27 L V (i − d, y ) + W · g(y, y , x, i − d + 1, i) 0 −∞ if i > 0 if i = 0 if i < 0 (4) The best segmentation then corresponds to the path traced by maxy V (|x|, y). [sent-63, score-0.109]

28 3 Semi-Markov CRFs vs order-L CRFs Since conventional CRFs need not maximize over possible segment lengths d, inference for semi-CRFs is more expensive. [sent-65, score-0.362]

29 Semi-CRFs are also a natural restriction, as it is often convenient to express features in terms of segments. [sent-73, score-0.09]

30 As an example, let dj denote the length of a segment, and let µ be the average length of all segments with label I. [sent-74, score-0.366]

31 Now consider the segment feature g k1 (j, x, s) = (dj − µ)2 · [[yj = I]]. [sent-75, score-0.273]

32 After training, the contribution of this feature toward 2 Pr(s|x) associated with a length-d entity will be proportional to ewk ·(d−µ) —i. [sent-76, score-0.302]

33 , it allows the learner to model a Gaussian distribution of entity lengths. [sent-78, score-0.241]

34 An exponential model for lengths could be implemented with the feature g k2 (j, x, y) = dj · [[yj = I]]. [sent-79, score-0.138]

35 In contrast to the Gaussian-length feature above, g k2 is “equivalent to” a local feature function f (i, x, y) = [[yi = I]], in the following sense: for every triple x, y, s, where y is the tags for s, j g k2 (j, x, s) = i f (i, s, y). [sent-80, score-0.191]

36 Thus a semi-CRF model based on the single feature g k2 could also be represented by a conventional CRF. [sent-81, score-0.128]

37 In general, a semi-CRF model can be factorized in terms of an equivalent order-1 CRF model if and only if the sum of the segment features can be rewritten as a sum of local features. [sent-82, score-0.334]

38 4 Learning algorithm During training the goal is to maximize log-likelihood over a given training set T = {(x , s )}N . [sent-85, score-0.084]

39 Following the notation of Sha and Pereira [16], we express the log=1 likelihood over the training sequences as L(W) = 1 log Pr(s |x , W) = (W · G(x , s ) − log ZW (x )) Assuming that non-entity words are placed in unit-length segments, as we do below. [sent-86, score-0.111]

40 However, to compute the the normalizer, ZW (x ), and the expected value of the features under the current weight vector, EPr(s |W) G(x , s ), we must use the Markov property of G to construct a dynamic programming algorithm, similar for forward-backward. [sent-91, score-0.09]

41 We thus deﬁne α(i, y) as the value of W·G(s ,x) where again si:y denotes all segmentations from 1 to i ending at i and s ∈si:y e labeled y. [sent-92, score-0.09]

42 For the k-th component of G, let η k (i, y) be the value of the sum k W·G(x ,s ) , restricted to the part of the segmentation ending at s ∈si:y G (s , x )e position i. [sent-96, score-0.12]

43 Experiments with NER data Baseline algorithms and datasets In our experiments, we trained semi-CRFs to mark entity segments with the label I, and put non-entity words into unit-length segments with label O. [sent-99, score-0.782]

44 The ﬁrst version, which we call CRF/1, labels words inside and outside entities with I and O, respectively. [sent-101, score-0.17]

45 The second version, called CRF/4, replaces the I tag with four tags B, E, C, and U , which depend on where the word appears in an entity [2]. [sent-102, score-0.376]

46 We considered extraction of city names and state names from this corpus. [sent-105, score-0.381]

47 The Jobs corpus contains 73,330 words, and consists of 300 computerrelated job postings [4]. [sent-106, score-0.081]

48 We considered extraction of company names and job titles. [sent-107, score-0.271]

49 The 18,121-word Email corpus contains 216 email messages taken from the CSPACE email corpus [10], which is mail associated with a 14-week, 277-person management game. [sent-108, score-0.278]

50 2 Features As features for CRF, we used indicators for speciﬁc words at location i, or locations within three words of i. [sent-112, score-0.348]

51 Following previous NER work [7]), we also used indicators for capitalization/letter patterns (such as “Aa+” for a capitalized word, or “D” for a single-digit number). [sent-113, score-0.201]

52 As features for semi-CRFs, we used the same set of word-level features, as well their logical extensions to segments. [sent-114, score-0.09]

53 Speciﬁcally, we used indicators for the phrase inside a segment and the capitalization pattern inside a segment, as well as indicators for words and capitalization patterns in 3-word windows before and after the segment. [sent-115, score-0.713]

54 We also used indicators for each segment length (d = 1, . [sent-116, score-0.374]

55 , L), and combined all word-level features with indicators for the beginning and end of a segment. [sent-119, score-0.21]

56 To exploit more of the power of semi-CRFs, we also implemented a number of dictionaryderived features, each of which was based on different dictionary D and similarity function sim. [sent-120, score-0.303]

57 xuj , a dictionary feature is deﬁned as g D,sim (j, x, s) = argmax u∈D sim(xsj , u)—i. [sent-124, score-0.463]

58 , the distance from the word sequence xsj to the closest element in D. [sent-126, score-0.243]

59 For each of the extraction problems, we assembled one external dictionary of strings, which were similar (but not identical) to the entity names in the documents. [sent-127, score-0.814]

60 For instance, for city names in the Address data, we used a web page listing cities in India. [sent-128, score-0.176]

61 Due to variations in the way entity names are written, rote matching these dictionaries to the data gives relatively low F1 values, ranging from 22% (for the job-title extraction task) to 57% (for the person-name task). [sent-129, score-0.553]

62 We used three different similarity metrics (Jaccard, TFIDF, and JaroWinkler) which are known to work well for name-matching in data integration tasks [5]. [sent-130, score-0.106]

63 , the distance-based segment features cannot be decomposed into sums of local features. [sent-133, score-0.302]

64 We also extended the semi-CRF algorithm to construct, on the ﬂy, an internal segment dictionary of segments labeled as entities in the training data. [sent-135, score-0.881]

65 To make measurements on training data similar to those on test data, when ﬁnding the closest neighbor of xsj in the internal dictionary, we excluded all strings formed from x, thus excluding matches of xsj to itself (or subsequences of itself). [sent-136, score-0.52]

66 For completeness in the experiments, we also evaluated local versions of the dictionary features. [sent-138, score-0.339]

67 Speciﬁcally, we constructed dictionary features of the form f D,sim (i, x, y) = argmax u∈D sim(xi , u), where D is either the external dictionary used above, or an internal word dictionary formed from all words contained in entities. [sent-139, score-1.423]

68 As before, words in x were excluded in ﬁnding near neighbors to xi . [sent-140, score-0.107]

69 3 Results and Discussion We evaluated F1-measure performance3 of CRF/1, CRF/4, and semi-CRFs, with and without internal and external dictionaries. [sent-142, score-0.202]

70 Gaussian priors were used for all algorithms, and for semi-CRFs, a ﬁxed value of L was chosen for each dataset based on observed entity lengths. [sent-146, score-0.241]

71 In the baseline conﬁguration in which no dictionary features are used, semi-CRFs perform 3 F1 is deﬁned as 2*precision*recall/(precision+recall. [sent-148, score-0.427]

72 ) Address_City Email_Person 90 85 85 F1 span accuracy 90 F1 span accuracy F1 span accuracy Address_State 100 90 80 70 60 50 40 CRF/4 30 SemiCRF+int 20 CRF/4+dict 10 SemiCRF+int+dict 0 0. [sent-149, score-0.114]

73 5 Fraction of available training data Figure 1: F1 as a function of training set size. [sent-179, score-0.084]

74 Algorithms marked with “+dict” include external dictionary features, and algorithms marked with “+int” include internal dictionary features. [sent-180, score-0.808]

75 We do not use internal dictionary features for CRF/4 since they lead to reduced accuracy. [sent-181, score-0.496]

76 baseline F1 CRF/1 state title person city company CRF/4 state title person city company semi-CRF state title person city company +internal dict F1 ∆base +external dict F1 ∆base +both dictionaries F1 ∆base ∆extern 20. [sent-182, score-1.593]

77 0 Table 1: Comparing various methods on ﬁve IE tasks, with and without dictionary features. [sent-302, score-0.303]

78 When internal dictionary features are used, the performance of semi-CRFs is often improved, and never degraded by more than 2. [sent-306, score-0.496]

79 However, the less-natural local version of these features often leads to substantial performance losses for CRF/1 and CRF/4. [sent-308, score-0.09]

80 Semi-CRFs perform best on nine of the ten task variants for which internal dictionaries were used. [sent-309, score-0.244]

81 The external-dictionary features are helpful to all the algorithms. [sent-310, score-0.09]

82 Semi-CRFs performs best on three of ﬁve tasks in which only external dictionaries were used. [sent-311, score-0.28]

83 If we consider the tasks with and without external dictionary features as separate “conditions”, then semi-CRFs using all available information4 outperform both CRF variants on eight of ten “conditions”. [sent-313, score-0.568]

84 , the both-dictionary version when external dictionaries are available, and the internaldictionary only version otherwise. [sent-318, score-0.24]

85 The primary difference is that the “inner model” for semi-CRFs is of short, uniformly-labeled segments with non-Markovian properties, while nested HMMs allow longer, diversely-labeled, Markovian “segments”. [sent-344, score-0.207]

86 However, for these methods, inference is not tractable, and hence approximations must be made in training and classiﬁcation. [sent-349, score-0.094]

87 An interesting question for future research is whether the tractible extension to CRF inference considered here can can be used to improve inference methods for more expressive models. [sent-350, score-0.218]

88 In recent prior work [6], we investigated semi-Markov learning methods for NER based on a voted perceptron training algorithm [7]. [sent-351, score-0.195]

89 The voted perceptron has some advantages in ease of implementation, and efﬁciency. [sent-352, score-0.153]

90 (In particular, the voted perceptron algorithm is more stable numerically, as it does not need to compute a partition function. [sent-353, score-0.153]

91 Probabilistically-grounded approaches like CRFs also are preferable to marginbased approaches like the voted perceptron in certain settings, e. [sent-355, score-0.153]

92 A major advantage of semi-CRFs is that they allow features which measure properties of segments, rather than individual elements. [sent-359, score-0.09]

93 For applications like NER and gene-ﬁnding [11], these features can be quite natural. [sent-360, score-0.09]

94 net, and a NER package using this package is available on http://minorthird. [sent-363, score-0.08]

95 Exploiting diverse knowledge sources via maximum entropy in named entity recognition. [sent-380, score-0.309]

96 Bottom-up relational learning of pattern matching rules for information extraction. [sent-395, score-0.069]

97 Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration methods. [sent-409, score-0.534]

98 Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. [sent-413, score-0.109]

99 Early results for named entity recognition with conditional random ﬁelds, feature induction and web-enhanced lexicons. [sent-461, score-0.437]

100 Dynamic conditional random ﬁelds: Factorized probabilistic models for labeling and segmenting sequence data. [sent-480, score-0.121]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dictionary', 0.303), ('ner', 0.27), ('crfs', 0.27), ('dict', 0.244), ('entity', 0.241), ('segment', 0.212), ('semicrf', 0.19), ('segments', 0.169), ('zw', 0.163), ('uj', 0.151), ('dictionaries', 0.141), ('xsj', 0.135), ('ew', 0.129), ('tj', 0.127), ('indicators', 0.12), ('pr', 0.112), ('yj', 0.112), ('crf', 0.108), ('int', 0.108), ('internal', 0.103), ('external', 0.099), ('argmax', 0.099), ('features', 0.09), ('city', 0.089), ('names', 0.087), ('voted', 0.086), ('extraction', 0.084), ('capitalized', 0.081), ('epr', 0.081), ('title', 0.071), ('words', 0.069), ('email', 0.069), ('relational', 0.069), ('named', 0.068), ('label', 0.067), ('conditional', 0.067), ('conventional', 0.067), ('perceptron', 0.067), ('segmentation', 0.066), ('company', 0.066), ('metrics', 0.066), ('elds', 0.065), ('feature', 0.061), ('si', 0.06), ('expressive', 0.06), ('sequence', 0.054), ('extern', 0.054), ('sunita', 0.054), ('tractible', 0.054), ('cohen', 0.054), ('ending', 0.054), ('word', 0.054), ('markov', 0.053), ('inference', 0.052), ('entities', 0.052), ('person', 0.05), ('ve', 0.049), ('inside', 0.049), ('hmms', 0.048), ('corpus', 0.047), ('base', 0.047), ('sim', 0.047), ('capitalization', 0.047), ('management', 0.046), ('dj', 0.046), ('maxy', 0.043), ('tag', 0.043), ('training', 0.042), ('length', 0.042), ('sj', 0.042), ('india', 0.04), ('package', 0.04), ('restriction', 0.04), ('proceedings', 0.04), ('tasks', 0.04), ('span', 0.038), ('excluded', 0.038), ('tags', 0.038), ('nested', 0.038), ('versions', 0.036), ('sixth', 0.036), ('subsequences', 0.036), ('segmentations', 0.036), ('edmonton', 0.036), ('outperform', 0.036), ('job', 0.034), ('sha', 0.034), ('state', 0.034), ('measurement', 0.034), ('baseline', 0.034), ('discriminative', 0.033), ('pereira', 0.033), ('factorized', 0.032), ('sutton', 0.032), ('markovian', 0.031), ('viterbi', 0.031), ('conditionally', 0.031), ('lengths', 0.031), ('triple', 0.031), ('strings', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

Author: Sunita Sarawagi, William W. Cohen

2 0.14040166 44 nips-2004-Conditional Random Fields for Object Recognition

Author: Ariadna Quattoni, Michael Collins, Trevor Darrell

Abstract: We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as ﬂexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a uniﬁed framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by ﬁnding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.

3 0.13206841 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

Author: Maria-florina Balcan, Avrim Blum, Ke Yang

Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisﬁed in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufﬁcient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1

4 0.13034458 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

Author: Andrew McCallum, Ben Wellner

Abstract: Coreference analysis, also known as record linkage or identity uncertainty, is a difﬁcult and important problem in natural language processing, databases, citation matching and many other tasks. This paper introduces several discriminative, conditional-probability models for coreference analysis, all examples of undirected graphical models. Unlike many historical approaches to coreference, the models presented here are relational—they do not assume that pairwise coreference decisions should be made independently from each other. Unlike other relational models of coreference that are generative, the conditional model here can incorporate a great variety of features of the input without having to be concerned about their dependencies—paralleling the advantages of conditional random ﬁelds over hidden Markov models. We present positive results on noun phrase coreference in two standard text data sets. 1

5 0.12625133 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

Author: Peter L. Bartlett, Michael Collins, Ben Taskar, David A. McAllester

Abstract: We consider the problem of structured classiﬁcation, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random ﬁelds and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem deﬁned in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efﬁcient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efﬁciently. The method for structured labels relies on a more general result, speciﬁcally the application of exponentiated gradient updates [7, 8] to quadratic programs. 1

6 0.11590319 192 nips-2004-The power of feature clustering: An application to object detection

7 0.10444531 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

8 0.074542709 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

9 0.072141178 205 nips-2004-Who's In the Picture

10 0.06586159 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

11 0.0649084 82 nips-2004-Incremental Algorithms for Hierarchical Classification

12 0.062550455 54 nips-2004-Distributed Information Regularization on Graphs

13 0.060586449 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

14 0.058152828 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

15 0.057911366 163 nips-2004-Semi-parametric Exponential Family PCA

16 0.056979228 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

17 0.055047482 87 nips-2004-Integrating Topics and Syntax

18 0.05455742 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

19 0.05455 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

20 0.05377448 99 nips-2004-Learning Hyper-Features for Visual Identification

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.193), (1, 0.043), (2, 0.003), (3, -0.062), (4, 0.072), (5, 0.115), (6, 0.069), (7, 0.094), (8, -0.052), (9, 0.006), (10, 0.012), (11, 0.019), (12, -0.132), (13, 0.149), (14, -0.019), (15, 0.041), (16, 0.068), (17, 0.097), (18, 0.071), (19, -0.042), (20, 0.0), (21, 0.047), (22, -0.121), (23, -0.129), (24, -0.126), (25, 0.135), (26, -0.029), (27, -0.014), (28, -0.155), (29, 0.07), (30, 0.002), (31, -0.099), (32, 0.064), (33, 0.038), (34, -0.06), (35, -0.048), (36, -0.055), (37, -0.058), (38, 0.001), (39, -0.04), (40, -0.123), (41, -0.051), (42, 0.092), (43, -0.056), (44, 0.109), (45, -0.148), (46, -0.028), (47, 0.114), (48, 0.092), (49, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94349098 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

Author: Sunita Sarawagi, William W. Cohen

2 0.66773444 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

Author: Andrew McCallum, Ben Wellner

3 0.57543451 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

Author: Mark Craven, Joseph Bockhorst

Abstract: Many sequential prediction tasks involve locating instances of patterns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally represent arbitrarily overlapping elements. We show how to eﬃciently train and perform inference with these models. Experimental results from a genomics domain show that our models are more accurate at locating instances of overlapping patterns than are baseline models based on HMMs. 1

4 0.55492133 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

Author: Maria-florina Balcan, Avrim Blum, Ke Yang

5 0.51546532 44 nips-2004-Conditional Random Fields for Object Recognition

Author: Ariadna Quattoni, Michael Collins, Trevor Darrell

6 0.48697913 205 nips-2004-Who's In the Picture

7 0.46516365 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

8 0.44542256 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

9 0.44295102 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

10 0.4347316 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

11 0.43353704 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

12 0.43249953 192 nips-2004-The power of feature clustering: An application to object detection

13 0.4268429 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

14 0.41436878 122 nips-2004-Modelling Uncertainty in the Game of Go

15 0.40207416 109 nips-2004-Mass Meta-analysis in Talairach Space

16 0.4015812 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

17 0.39909682 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

18 0.3783837 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

19 0.37530681 193 nips-2004-Theories of Access Consciousness

20 0.33329237 74 nips-2004-Harmonising Chorales by Probabilistic Inference

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(9, 0.016), (13, 0.079), (15, 0.104), (17, 0.023), (26, 0.068), (31, 0.029), (32, 0.019), (33, 0.198), (35, 0.021), (39, 0.028), (50, 0.02), (53, 0.32)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94587159 35 nips-2004-Chemosensory Processing in a Spiking Model of the Olfactory Bulb: Chemotopic Convergence and Center Surround Inhibition

Author: Baranidharan Raman, Ricardo Gutierrez-osuna

Abstract: This paper presents a neuromorphic model of two olfactory signalprocessing primitives: chemotopic convergence of olfactory receptor neurons, and center on-off surround lateral inhibition in the olfactory bulb. A self-organizing model of receptor convergence onto glomeruli is used to generate a spatially organized map, an olfactory image. This map serves as input to a lattice of spiking neurons with lateral connections. The dynamics of this recurrent network transforms the initial olfactory image into a spatio-temporal pattern that evolves and stabilizes into odor- and intensity-coding attractors. The model is validated using experimental data from an array of temperature-modulated gas sensors. Our results are consistent with recent neurobiological findings on the antennal lobe of the honeybee and the locust. 1 In trod u ction An artificial olfactory system comprises of an array of cross-selective chemical sensors followed by a pattern recognition engine. An elegant alternative for the processing of sensor-array signals, normally performed with statistical pattern recognition techniques [1], involves adopting solutions from the biological olfactory system. The use of neuromorphic approaches provides an opportunity for formulating new computational problems in machine olfaction, including mixture segmentation, background suppression, olfactory habituation, and odor-memory associations. A biologically inspired approach to machine olfaction involves (1) identifying key signal processing primitives in the olfactory pathway, (2) adapting these primitives to account for the unique properties of chemical sensor signals, and (3) applying the models to solving specific computational problems. The biological olfactory pathway can be divided into three general stages: (i) olfactory epithelium, where primary reception takes place, (ii) olfactory bulb (OB), where the bulk of signal processing is performed and, (iii) olfactory cortex, where odor associations are stored. A review of literature on olfactory signal processing reveals six key primitives in the olfactory pathway that can be adapted for use in machine olfaction. These primitives are: (a) chemical transduction into a combinatorial code by a large population of olfactory receptor neurons (ORN), (b) chemotopic convergence of ORN axons onto glomeruli (GL), (c) logarithmic compression through lateral inhibition at the GL level by periglomerular interneurons, (d) contrast enhancement through lateral inhibition of mitral (M) projection neurons by granule interneurons, (e) storage and association of odor memories in the piriform cortex, and (f) bulbar modulation through cortical feedback [2, 3]. This article presents a model that captures the first three abovementioned primitives: population coding, chemotopic convergence and contrast enhancement. The model operates as follows. First, a large population of cross-selective pseudosensors is generated from an array of metal-oxide (MOS) gas sensors by means of temperature modulation. Next, a self-organizing model of convergence is used to cluster these pseudo-sensors according to their relative selectivity. This clustering generates an initial spatial odor map at the GL layer. Finally, a lattice of spiking neurons with center on-off surround lateral connections is used to transform the GL map into identity- and intensity-specific attractors. The model is validated using a database of temperature-modulated sensor patterns from three analytes at three concentration levels. The model is shown to address the first problem in biologically-inspired machine olfaction: intensity and identity coding of a chemical stimulus in a manner consistent with neurobiology [4, 5]. 2 M o d e l i n g c h e m o t opi c c o n v e r g e n c e The projection of sensory signals onto the olfactory bulb is organized such that ORNs expressing the same receptor gene converge onto one or a few GLs [3]. This convergence transforms the initial combinatorial code into an organized spatial pattern (i.e., an olfactory image). In addition, massive convergence improves the signal to noise ratio by integrating signals from multiple receptor neurons [6]. When incorporating this principle into machine olfaction, a fundamental difference between the artificial and biological counterparts must be overcome: the input dimensionality at the receptor/sensor level. The biological olfactory system employs a large population of ORNs (over 100 million in humans, replicated from 1,000 primary receptor types), whereas its artificial analogue uses a few chemical sensors (commonly one replica of up to 32 different sensor types). To bridge this gap, we employ a sensor excitation technique known as temperature modulation [7]. MOS sensors are conventionally driven in an isothermal fashion by maintaining a constant temperature. However, the selectivity of these devices is a function of the operating temperature. Thus, capturing the sensor response at multiple temperatures generates a wealth of additional information as compared to the isothermal mode of operation. If the temperature is modulated slow enough (e.g., mHz), the behavior of the sensor at each point in the temperature cycle can then be treated as a pseudo-sensor, and thus used to simulate a large population of cross-selective ORNs (refer to Figure 1(a)). To model chemotopic convergence, these temperature-modulated pseudo-sensors (referred to as ORNs in what follows) must be clustered according to their selectivity [8]. As a first approximation, each ORN can be modeled by an affinity vector [9] consisting of the responses across a set of C analytes: r K i = K i1 , K i2 ,..., K iC (1) [ ] where K ia is the response of the ith ORN to analyte a. The selectivity of this ORN r is then defined by the orientation of the affinity vector Κ i . A close look at the OB also shows that neighboring GLs respond to similar odors [10]. Therefore, we model the ORN-GL projection with a Kohonen self-organizing map (SOM) [11]. In our model, the SOM is trained to model the distribution of r ORNs in chemical sensitivity space, defined by the affinity vector Κ i . Once the training of the SOM is completed, each ORN is assigned to the closest SOM node (a simulated GL) in affinity space, thereby forming a convergence map. The response of each GL can then be computed as Ga = σ j (∑ N i =1 Wij ⋅ ORN ia ) (2) where ORN ia is the response of pseudo-sensor i to analyte a, Wij=1 if pseudo-sensor i converges to GL j and zero otherwise, and σ (⋅) is a squashing sigmoidal function that models saturation. This convergence model works well under the assumption that the different sensory inputs are reasonably uncorrelated. Unfortunately, most gas sensors are extremely collinear. As a result, this convergence model degenerates into a few dominant GLs that capture most of the sensory activity, and a large number of dormant GLs that do not receive any projections. To address this issue, we employ a form of competition known as conscience learning [12], which incorporates a habituation mechanism to prevent certain SOM nodes from dominating the competition. In this scheme, the fraction of times that a particular SOM node wins the competition is used as a bias to favor non-winning nodes. This results in a spreading of the ORN projections to neighboring units and, therefore, significantly reduces the number of dormant units. We measure the performance of the convergence mapping with the entropy across the lattice, H = −∑ Pi log Pi , where Pi is the fraction of ORNs that project to SOM node i [13]. To compare Kohonen and conscience learning, we built convergence mappings with 3,000 pseudo-sensors and 400 GL units (refer to section 4 for details). The theoretical maximum of the entropy for this network, which corresponds to a uniform distribution, is 8.6439. When trained with Kohonen’s algorithm, the entropy of the SOM is 7.3555. With conscience learning, the entropy increases to 8.2280. Thus, conscience is an effective mechanism to improve the spreading of ORN projections across the GL lattice. 3 M o d e l i n g t h e o l f a c t o r y b u l b n e t wo r k Mitral cells, which synapse ORNs at the GL level, transform the initial olfactory image into a spatio-temporal code by means of lateral inhibition. Two roles have been suggested for this lateral inhibition: (a) sharpening of the molecular tuning range of individual M cells with respect to that of their corresponding ORNs [10], and (b) global redistribution of activity, such that the bulb-wide representation of an odorant, rather than the individual tuning ranges, becomes specific and concise over time [3]. More recently, center on-off surround inhibitory connections have been found in the OB [14]. These circuits have been suggested to perform pattern normalization, noise reduction and contrast enhancement of the spatial patterns. We model each M cell using a leaky integrate-and-fire spiking neuron [15]. The input current I(t) and change in membrane potential u(t) of a neuron are given by: I (t ) = du u (t ) +C dt R (3) du τ = −u (t ) + R ⋅ I (t ) [τ = RC ] dt Each M cell receives current Iinput from ORNs and current Ilateral from lateral connections with other M cells: I input ( j ) = ∑Wij ⋅ ORNi i (4) I lateral ( j , t ) = ∑ Lkj ⋅ α (k , t − 1) k where Wij indicates the presence/absence of a synapse between ORNi and Mj, as determined by the chemotopic mapping, Lkj is the efficacy of the lateral connection between Mk and Mj, and α(k,t-1) is the post-synaptic current generated by a spike at Mk: α (k , t − 1) = − g (k , t − 1) ⋅ [u ( j, t − 1) + − Esyn ] (5) g(k,t-1) is the conductance of the synapse between Mk and Mj at time t-1, u(j,t-1) is the membrane potential of Mj at time t-1 and the + subscript indicates this value becomes zero if negative, and Esyn is the reverse synaptic potential. The change in conductance of post-synaptic membrane is: & g (k , t ) = dg (k , t ) − g (k , t ) = + z (k , t ) dt τ syn & z (k , t ) = dz (k , t ) − z ( k , t ) = + g norm ⋅ spk ( k , t ) dt τ syn (6) where z(.) and g(.) are low pass filters of the form exp(-t/τsyn) and t ⋅ exp(−t / τ syn ) , respectively, τsyn is the synaptic time constant, gnorm is a normalization constant, and spk(j,t) marks the occurrence of a spike in neuron i at time t: 1 u ( j , t ) = Vspike  spk ( j , t ) =   0 u ( j , t ) ≠ Vspike  (7) Combining equations (3) and (4), the membrane potential can be expressed as: du ( j , t ) − u ( j, t ) I lateral ( j, t ) I input ( j ) = + + dt RC C C & u ( j , t − 1) + u ( j , t − 1) ⋅ dt u ( j, t ) < Vthreshold  u ( j, t ) =   Vspike u ( j, t ) ≥ Vthreshold   & u ( j, t ) = (8) When the membrane potential reaches Vthreshold, a spike is generated, and the membrane potential is reset to Vrest. Any further inputs to the neuron are ignored during the subsequent refractory period. Following [14], lateral interactions are modeled with a center on-off surround matrix Lij. Each M cell makes excitatory synapses to nearby M cells (d

same-paper 2 0.80558419 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

Author: Sunita Sarawagi, William W. Cohen

3 0.79096341 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes

Author: Yoshitatsu Matsuda, Kazunori Yamaguchi

Abstract: In this paper, linear multilayer ICA (LMICA) is proposed for extracting independent components from quite high-dimensional observed signals such as large-size natural scenes. There are two phases in each layer of LMICA. One is the mapping phase, where a one-dimensional mapping is formed by a stochastic gradient algorithm which makes more highlycorrelated (non-independent) signals be nearer incrementally. Another is the local-ICA phase, where each neighbor (namely, highly-correlated) pair of signals in the mapping is separated by the MaxKurt algorithm. Because LMICA separates only the highly-correlated pairs instead of all ones, it can extract independent components quite efﬁciently from appropriate observed signals. In addition, it is proved that LMICA always converges. Some numerical experiments verify that LMICA is quite efﬁcient and effective in large-size natural image processing.

4 0.60780436 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

Author: Corinna Cortes, Mehryar Mohri

Abstract: In many applications, good ranking is a highly desirable performance for a classiﬁer. The criterion commonly used to measure the ranking quality of a classiﬁcation algorithm is the area under the ROC curve (AUC). To report it properly, it is crucial to determine an interval of conﬁdence for its value. This paper provides conﬁdence intervals for the AUC based on a statistical and combinatorial analysis using only simple parameters such as the error rate and the number of positive and negative examples. The analysis is distribution-independent, it makes no assumption about the distribution of the scores of negative or positive examples. The results are of practical use and can be viewed as the equivalent for AUC of the standard conﬁdence intervals given in the case of the error rate. They are compared with previous approaches in several standard classiﬁcation tasks demonstrating the beneﬁts of our analysis. 1 Motivation In many machine learning applications, the ranking quality of a classiﬁer is critical. For example, the ordering of the list of relevant documents returned by a search engine or a document classiﬁcation system is essential. The criterion widely used to measure the ranking quality of a classiﬁcation algorithm is the area under an ROC curve (AUC). But, to measure and report the AUC properly, it is crucial to determine an interval of conﬁdence for its value as it is customary for the error rate and other measures. It is also important to make the computation of the conﬁdence interval practical by relying only on a small and simple number of parameters. In the case of the error rate, such intervals are often derived from just the sample size N . We present an extensive theoretical analysis of the AUC and show that a similar conﬁdence interval can be derived for its value using only simple parameters such as the error rate k/N , the number of positive examples m, and the number of negative examples n = N − m. Thus, our results extend to AUC the computation of conﬁdence intervals from a small number of readily available parameters. Our analysis is distribution-independent in the sense that it makes no assumption about the distribution of the scores of negative or positive examples. The use of the error rate helps determine tight conﬁdence intervals. This contrasts with existing approaches presented in the statistical literature [11, 5, 2] which are based either on weak distribution-independent assumptions resulting in too loose conﬁdence intervals, or strong distribution-dependent assumptions leading to tight but unsafe conﬁdence intervals. We show that our results are of practical use. We also compare them with previous approaches in several standard classiﬁcation tasks demonstrating the beneﬁts of our analysis. Our results are also useful for testing the statistical signiﬁcance of the difference of the AUC values of two classiﬁers. The paper is organized as follows. We ﬁrst introduce the deﬁnition of the AUC, its connection with the Wilcoxon-Mann-Whitney statistic (Section 2), and brieﬂy review some essential aspects of the existing literature related to the computation of conﬁdence intervals for the AUC. Our computation of the expected value and variance of the AUC for a ﬁxed error rate requires establishing several combinatorial identities. Section 4 presents some existing identities and gives the proof of novel ones useful for the computation of the variance. Section 5 gives the reduced expressions for the expected value and variance of the AUC for a ﬁxed error rate. These can be efﬁciently computed and used to determine our conﬁdence intervals for the AUC (Section 6). Section 7 reports the result of the comparison of our method with previous approaches, including empirical results for several standard tasks. 2 Deﬁnition and Properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally introduced in signal detection theory [6] in connection with the study of radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [14, 13, 7, 12, 16, 3]. The ROC curve for a binary classiﬁcation problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classiﬁcation threshold from the most positive classiﬁcation value to the most negative. For a fully random classiﬁcation, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classiﬁcation results in an ROC curve at least partially above this straight line. The AUC is deﬁned as the area under the ROC curve. Consider a binary classiﬁcation task with m positive examples and n negative examples. Let C be a ﬁxed classiﬁer that outputs a strictly ordered list for these examples. Let x1 , . . . , xm be the output of C on the positive examples and y1 , . . . , yn its output on the negative examples and denote by 1X the indicator function of a set X. Then, the AUC, A, associated to C is given by: A= m i=1 n j=1 1xi >yj (1) mn which is the value of the Wilcoxon-Mann-Whitney statistic [10]. Thus, the AUC is closely related to the ranking quality of the classiﬁcation. It can be viewed as a measure based on pairwise comparisons between classiﬁcations of the two classes. It is an estimate of the probability Pxy that the classiﬁer ranks a randomly chosen positive example higher than a negative example. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC, and the expected AUC value for a random ranking is 0.5. 3 Overview of Related Work This section brieﬂy describes some previous distribution-dependent approaches presented in the statistical literature to derive conﬁdence intervals for the AUC and compares them to our method. The starting point for these analyses is a formula giving the variance of the AUC, A, for a ﬁxed distribution of the scores Px of the positive examples and Py of the negative examples [10, 1]: 2 σA = A(1 − A) + (m − 1)(Pxxy − A2 ) + (n − 1)(Pxyy − A2 ) mn (2) where Pxxy is the probability that the classiﬁer ranks two randomly chosen positive examples higher than a negative one, and Pxyy the probability that it ranks two randomly chosen negative examples lower than a positive one. To compute the variance exactly using Equation 2, the distributions Px and Py must be known. Hanley and McNeil [10] argue in favor of exponential distributions, loosely claiming that this upper-bounds the variance of normal distributions with various means and ratios of A 2A2 variances. They show that for exponential distributions Pxxy = 2−A and Pxyy = 1+A . The resulting conﬁdence intervals are of course relatively tight, but their validity is questionable since they are based on a strong assumption about the distributions of the positive and negative scores that may not hold in many cases. An alternative considered by several authors to the exact computation of the variance is to determine instead the maximum of the variance over all possible continuous distributions with the same expected value of the AUC. For all such distributions, one can ﬁx m and n and compute the expected AUC and its variance. The maximum variance is denoted by 2 σmax and is given by [5, 2]: 2 σmax = A(1 − A) 1 ≤ min {m, n} 4 min {m, n} (3) Unfortunately, this often yields loose conﬁdence intervals of limited practical use. Our approach for computing the mean and variance of the AUC is distribution-independent and inspired by the machine learning literature where analyses typically center on the error rate. We require only that the error rate be measured and compute the mean and variance of the AUC over all distributions Px and Py that maintain the same error rate. Our approach is in line with that of [5, 2] but it crucially avoids considering the maximum of the variance. We show that it is possible to compute directly the mean and variance of the AUC assigning equal weight to all the possible distributions. Of course, one could argue that not all distributions Px and Py are equally probable, but since these distributions are highly problem-dependent, we ﬁnd it risky to make any general assumption on the distributions and thereby limit the validity of our results. Our approach is further justiﬁed empirically by the experiments reported in the last section. 4 Combinatorial Analysis The analysis of the statistical properties of the AUC given a ﬁxed error rate requires various combinatorial calculations. This section describes several of the combinatorial identities that are used in our computation of the conﬁdence intervals. For all q ≥ 0, let Xq (k, m, n) be deﬁned by: k M M Xq (k, m, n) = xq (4) x x x=0 where M = m − (k − x) + x, M = n + (k − x) − x, and x = k − x. In previous work, we derived the following two identities which we used to compute the expected value of the AUC [4]: k X0 (k, m, n) = x=0 n+m+1 x k X1 (k, m, n) = (k − x)(m − n) + k n + m + 1 2 x x=0 To simplify the expression of the variance of the AUC, we need to compute X2 (k, m, n). Proposition 1 Let k, m, n be non-negative integers such that k ≤ min{m, n}, then: k X2 (k, m, n) = P2 (k, m, n, x) x=0 m+n+1 x (5) where P2 is the following 4th-degree polynomial: P2 (k, m, n, x) = (k − x)/12(−2x3 + 2x2 (2m − n + 2k − 4) + x(−3m2 + 3nm + 3m − 5km − 2k 2 + 2 + k + nk + 6n) + (3(k − 1)m2 − 3nm(k − 1) + 6km + 5m + k 2 m + 8n + 8 − 9nk + 3k + k 2 + k 2 n)). Proof. 5 The proof of the proposition is left to a longer version of this paper. Expectation and Variance of the AUC This section presents the expression of the expectation and variance of the AUC for a ﬁxed error rate k/N assuming that all classiﬁcations or rankings with k errors are equiprobable. For a given classiﬁcation, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is ﬁxed, there are x = k − x false negative examples. The expression Xq discussed in the previous section represents the q-th moment of x over all classiﬁcations with exactly k errors. In previous work, we gave the exact expression of the expectation of the AUC for a ﬁxed number of errors k: Proposition 2 ([4]) Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the expected value of the AUC, A, over all classiﬁcations with k errors is given by: E[A] = 1 − k (n − m)2 (m + n + 1) − m+n 4mn k − m+n k−1 m+n x=0 x k m+n+1 x=0 x . Note that the two sums in this expression cannot be further simpliﬁed since they are known not to admit a closed form [9]. We also gave the expression of the variance of the AUC in terms of the function F deﬁned for all Y by: F (Y ) = k M M x=0 x x k M M x=0 x x Y . (6) The following proposition reproduces that result: Proposition 3 ([4]) Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classiﬁcations with k errors is given by: σ 2 (A) = F ((1 − k−x x n+ m )2 ) 2 2 2 F ( mx +n(k−x) +(m(m+1)x+n(n+1)(k−x))−2x(k−x)(m+n+1) ). 12m2 n2 − F ((1 − k−x x n+ m 2 ))2 + Because of the products of binomial terms, the computation of the variance using this expression is inefﬁcient even for relatively small values of m and n. This expression can however be reduced using the identities presented in the previous section which leads to signiﬁcantly more efﬁcient computations that we have been using in all our experiments. Corollary 1 ([4]) Assume that a binary classiﬁcation task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classiﬁcations with ((m+n−2)Z4 −(2m−n+3k−10)Z3 ) k errors is given by: σ 2 (A) = (m+n+1)(m+n)(m+n−1)T72m2 n2 + (m+n+1)(m+n)T (m2 −nm+3km−5m+2k2 −nk+12−9k)Z2 48m2 n2 (m+n+1)Q1 Z1 kQ0 + 144m2 n2 with: 2 n2 72m Pk−i m+n+1−i Zi = x=0 Pk ( x=0 x (m+n+1) x ) − 2 (m+n+1)2 (m−n)4 Z1 16m2 n2 − , T = 3((m − n)2 + m + n) + 2, and: Q0 = (m + n + 1)T k2 + ((−3n2 + 3mn + 3m + 1)T − 12(3mn + m + n) − 8)k + (−3m2 + 7m + 10n + 3nm + 10)T − 4(3mn + m + n + 1) Q1 = T k3 + 3(m − 1)T k2 + ((−3n2 + 3mn − 3m + 8)T − 6(6mn + m + n))k + (−3m2 + 7(m + n) + 3mn)T − 2(6mn + m + n) Proof. The expression of the variance given in Proposition 3 requires the computation of Xq (k, m, n), q = 0, 1, 2. Using the identities giving the expressions of X0 and X1 and Proposition 1, which provides the expression of X2 , σ 2 (A) can be reduced to the expression given by the corollary. 6 Theory and Analysis Our estimate of the conﬁdence interval for the AUC is based on a simple and natural assumption. The main idea for its computation is the following. Assume that a conﬁdence interval E = [e1 , e2 ] is given for the error rate of a classiﬁer C over a sample S, with the conﬁdence level 1 − . This interval may have have been derived from a binomial model of C, which is a standard assumption for determining a conﬁdence interval for the error rate, or from any other model used to compute that interval. For a given error rate e ∈ E, or equivalently for a given number of misclassiﬁcations, we can use the expectation and variance computed in the previous section and Chebyshev’s inequality to predict a conﬁdence interval Ae for the AUC at the conﬁdence level 1 − . Since our equiprobable model for the classiﬁcations is independent of the model used to compute the interval of conﬁdence for the error rate, we can use E and Ae , e ∈ E, to compute a conﬁdence interval of the AUC at the level (1 − )(1 − ). Theorem 1 Let C be a binary classiﬁer and let S be a data sample of size N with m positive examples and n negative examples, N = m + n. Let E = [e1 , e2 ] be a conﬁdence interval for the error rate of C over S at the conﬁdence level 1 − . Then, for any , 0 ≤ ≤ 1, we can compute a conﬁdence interval for the AUC value of the classiﬁer C at the conﬁdence level (1 − )(1 − ) that depends only on , , m, n, and the interval E. Proof. Let k1 = N e1 and k2 = N e2 be the number of errors associated to the error rates e1 and e2 , and let IK be the interval IK = [k1 , k2 ]. For a ﬁxed k ∈ IK , by Propositions 2 and Corollary 1, we can compute the exact value of the expectation E[Ak ] and variance σ 2 (Ak ) of the AUC Ak . Using Chebyshev’s inequality, for any k ∈ IK and any k > 0, σ(Ak ) P |Ak − E[Ak ]| ≥ √ k ≤ (7) k where E[Ak ] and σ(Ak ) are the expressions given in Propositions 2 and Corollary 1, which depend only on k, m, and n. Let α1 and α2 be deﬁned by: α1 = min k∈IK σ(Ak ) E[Ak ] − √ σ(Ak ) α2 = max E[Ak ] + √ k∈IK k (8) k α1 and α2 only depend on IK (i.e., on e1 and e2 ), and on k, m, and n. Let IA be the conﬁdence interval deﬁned by IA = [α1 , α2 ] and let k = for all k ∈ IK . Using the fact that the conﬁdence interval E is independent of our equiprobability model for ﬁxed-k AUC values and the Bayes’ rule: P(A ∈ IA ) = k∈R+ ≥ k∈IK P (A ∈ IA | K = k)P (K = k) (9) P (A ∈ IA | K = k)P (K = k) (10) ≥ (1 − ) k∈IK P (K = k) ≥ (1 − )(1 − ) (11) where we used the property of Eq. 7 and the deﬁnitions of the intervals IK and IA . Thus, IA constitutes a conﬁdence interval for the AUC value of C at the conﬁdence level (1 − )(1 − ). In practice, the conﬁdence interval E is often determined as a result of the assumption that C follows a binomial law. This leads to the following theorem. .020 .035 Standard deviation Standard deviation .030 .015 .010 Max Distribution−dependent Distribution−independent .005 .025 .020 .015 Max Distribution−dependent Distribution−independent .010 .005 0.75 0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 AUC (a) 0.9 1.0 AUC (b) Figure 1: Comparison of the standard deviations for three different methods with: (a) m = n = 500; (b) m = 400 and n = 200. The curves are obtained by computing the expected AUC and its standard deviations for different values of the error rate using the maximum-variance approach (Eq. 3), our distribution-independent method, and the distribution-dependent approach of Hanley [10]. Theorem 2 Let C be a binary classiﬁer, let S be a data sample of size N with m positive examples and n negative examples, N = m + n, and let k0 be the number of misclassiﬁcations of C on S. Assume that C follows a binomial law, then, for any , 0 ≤ ≤ 1, we can compute a conﬁdence interval of the AUC value of the classiﬁer C at the conﬁdence level 1 − that depends only on , k0 , m, and n. Proof. Assume that C follows a binomial law with coefﬁcient p. Then, Chebyshev’s inequality yields: 1 p(1 − p) ≤ (12) P(|C − E[C]| ≥ η) ≤ 2 Nη 4N η 2 1 1 Thus, E = [ k0 − √ √ , k0 + √ √ ] forms a conﬁdence interval for the N 2 (1− 1− )N N 2 √ (1− 1− )N error rate of C at the conﬁdence level 1 − . By Theorem 1, we can√ compute for the √ AUC value a conﬁdence interval at the level (1 − (1 − 1 − ))(1 − (1 − 1 − )) = 1 − depending only on , m, n, and the interval E, i.e., k0 , N = m + n, and . For large N , we can use the normal approximation of the binomial law to determine a ﬁner interval E. Indeed, for large N , √ (13) P(|C − E[C]| ≥ η) ≤ 2Φ(2 N η) with Φ(u) = ∞ e−x2 /2 √ dx. u 2π Thus, E = [ k0 − N √ Φ−1 ( 1− 21− ) k0 √ ,N 2 N √ conﬁdence interval for the error rate at the conﬁdence level √ + Φ−1 ( 1− 21− ) √ ] 2 N is the 1− . For simplicity, in the proof of Theorem 2, k was chosen to be a constant ( k = ) but, in general, it can be another function of k leading to tighter conﬁdence intervals. The results presented in the next section were obtained with k = a0 exp((k − k0 )2 /2a2 ), where a0 1 and a1 are constants selected so that the inequality 11 be veriﬁed. 7 Experiments and Comparisons The analysis in the previous section provides a principled method for computing a conﬁdence interval of the AUC value of a classier C at the conﬁdence level 1 − that depends only on k, n and m. As already discussed, other expressions found in the statistical literature lead to either too loose or unsafely narrow conﬁdence intervals based on questionable assumptions on the probability functions Px and Py [10, 15]. Figure 1 shows a comparison of the standard deviations obtained using the maximum-approach (Eq. 3), the distribution-dependent expression from [10], and our distribution-independent method for NAME m+n n m+n AUC k m+n σindep σA σdep σmax 368 700 303 1159 2473 201 0.63 0.67 0.54 0.17 0.10 0.37 0.70 0.63 0.87 0.85 0.84 0.85 0.24 0.26 0.13 0.05 0.03 0.13 0.0297 0.0277 0.0176 0.0177 0.0164 0.0271 0.0440 0.0330 0.0309 0.0161 0.0088 0.0463 0.0269 0.0215 0.0202 0.0176 0.0161 0.0306 0.0392 0.0317 0.0281 0.0253 0.0234 0.0417 pima yeast credit internet-ads page-blocks ionosphere Table 1: Accuracy and AUC values for AdaBoost [8] and estimated standard deviations for several datasets from the UC Irvine repository. σindep is a distribution-independent standard deviation obtained using our method (Theorem 2). σA is given by Eq. (2) with the values of A, Pxxy , and Pxyy derived from data. σdep is the distribution-dependent standard deviation of Hanley [10], which is based on assumptions that may not always hold. σmax is deﬁned by Eq. (3). All results were obtained on a randomly selected test set of size m + n. various error rates. For m = n = 500, our distribution-independent method consistently leads to tighter conﬁdence intervals (Fig. 1 (a)). It also leads to tighter conﬁdence intervals for AUC values more than .75 for the uneven distribution m = 400 and n = 200 (Fig. 1 (b)). For lower AUC values, the distribution-dependent approach produces tighter intervals, but its underlying assumptions may not hold. A different comparison was made using several datasets available from the UC Irvine repository (Table 1). The table shows that our estimates of the standard deviations (σindep ) are in general close to or tighter than the distribution-dependent standard deviation σdep of Hanley [10]. This is despite we do not make any assumption about the distributions of positive and negative examples. In contrast, Hanley’s method is based on speciﬁc assumptions about these distributions. Plots of the actual ranking distribution demonstrate that these assumptions are often violated however. Thus, the relatively good performance of Hanley’s approach on several data sets can be viewed as fortuitous and is not general. Our distribution-independent method provides tight conﬁdence intervals, in some cases tighter than those derived from σA , in particular because it exploits the information provided by the error rate. Our analysis can also be used to determine if the AUC values produced by two classiﬁers are statistically signiﬁcant by checking if the AUC value of one falls within the conﬁdence interval of the other. 8 Conclusion We presented principled techniques for computing useful conﬁdence intervals for the AUC from simple parameters: the error rate, and the negative and positive sample sizes. We demonstrated the practicality of these conﬁdence intervals by comparing them to previous approaches in several tasks. We also derived the exact expression of the variance of the AUC for a ﬁxed k, which can be of interest in other analyses related to the AUC. The Wilcoxon-Mann-Whitney statistic is a general measure of the quality of a ranking that is an estimate of the probability that the classiﬁer ranks a randomly chosen positive example higher than a negative example. One could argue that accuracy at the top or the bottom of the ranking is of higher importance. This, however, contrarily to some belief, is already captured to a certain degree by the deﬁnition of the Wilcoxon-Mann-Whitney statistic which penalizes more errors at the top or the bottom of the ranking. It is however an interesting research problem to determine how to incorporate this bias in a stricter way in the form of a score-speciﬁc weight in the ranking measure, a weighted WilcoxonMann-Whitney statistic, or how to compute the corresponding expected value and standard deviation in a general way and design machine learning algorithms to optimize such a mea- sure. A preliminary analysis suggests, however, that the calculation of the expectation and the variance are likely to be extremely complex in that case. Finally, it could also be interesting but difﬁcult to adapt our results to the distribution-dependent case and compare them to those of [10]. Acknowledgments We thank Rob Schapire for pointing out to us the problem of the statistical signiﬁcance of the AUC, Daryl Pregibon for the reference to [11], and Saharon Rosset for various discussions about the topic of this paper. References [1] D. Bamber. The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph. Journal of Math. Psychology, 12, 1975. [2] Z. W. Birnbaum and O. M. Klose. Bounds for the Variance of the Mann-Whitney Statistic. Annals of Mathematical Statistics, 38, 1957. [3] J-H. Chauchat, R. Rakotomalala, M. Carloz, and C. Pelletier. Targeting Customer Groups using Gain and Cost Matrix; a Marketing Application. Technical report, ERIC Laboratory - University of Lyon 2, 2001. [4] Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), volume 16, Vancouver, Canada, 2004. MIT Press. [5] D. Van Dantzig. On the Consistency and Power of Wilcoxon’s Two Sample Test. In Koninklijke Nederlandse Akademie van Weterschappen, Series A, volume 54, 1915. [6] J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. [7] C. Ferri, P. Flach, and J. Hern´ ndez-Orallo. Learning Decision Trees Using the Area a Under the ROC Curve. In Proceedings of the 19th International Conference on Machine Learning. Morgan Kaufmann, 2002. [8] Yoav Freund and Robert E. Schapire. A Decision Theoretical Generalization of OnLine Learning and an Application to Boosting. In Proceedings of the Second European Conference on Computational Learning Theory, volume 2, 1995. [9] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics. Addison-Wesley, Reading, Massachusetts, 1994. [10] J. A. Hanley and B. J. McNeil. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, 1982. [11] E. L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco, California, 1975. [12] M. C. Mozer, R. Dodier, M. D. Colagrosso, C. Guerra-Salcedo, and R. Wolniewicz. Prodding the ROC Curve: Constrained Optimization of Classiﬁer Performance. In Neural Information Processing Systems (NIPS 2002). MIT Press, 2002. [13] C. Perlich, F. Provost, and J. Simonoff. Tree Induction vs. Logistic Regression: A Learning Curve Analysis. Journal of Machine Learning Research, 2003. [14] F. Provost and T. Fawcett. Analysis and Visualization of Classiﬁer Performance: Comparison under Imprecise Class and Cost Distribution. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI, 1997. [15] Saharon Rosset. Ranking-Methods for Flexible Evaluation and Efﬁcient Comparison of 2-Class Models. Master’s thesis, Tel-Aviv University, 1999. [16] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing Classiﬁer Performance via the Wilcoxon-Mann-Whitney Statistics. In Proceedings of the International Conference on Machine Learning, 2003.

5 0.60733086 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Author: Balázs Kégl

Abstract: We have recently proposed an extension of A DA B OOST to regression that uses the median of the base regressors as the ﬁnal regressor. In this paper we extend theoretical results obtained for A DA B OOST to median boosting and to its localized variant. First, we extend recent results on efﬁcient margin maximizing to show that the algorithm can converge to the maximum achievable margin within a preset precision in a ﬁnite number of steps. Then we provide conﬁdence-interval-type bounds on the generalization error. 1

6 0.60543162 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

7 0.60442019 69 nips-2004-Fast Rates to Bayes for Kernel Machines

8 0.60434365 161 nips-2004-Self-Tuning Spectral Clustering

9 0.60431898 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

10 0.6039294 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

11 0.60386628 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

12 0.60357553 23 nips-2004-Analysis of a greedy active learning strategy

13 0.60329527 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

14 0.60326725 102 nips-2004-Learning first-order Markov models for control

15 0.60286397 44 nips-2004-Conditional Random Fields for Object Recognition

16 0.60255688 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

17 0.60231996 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

18 0.60194606 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

19 0.60192448 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

20 0.60158294 77 nips-2004-Hierarchical Clustering of a Mixture Model