nips nips2004 nips2004-162 knowledge-graph by maker-knowledge-mining

162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction


Source: pdf

Author: Sunita Sarawagi, William W. Cohen

Abstract: We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semiCRF on an input sequence x outputs a “segmentation” of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on five named entity recognition problems, semi-CRFs generally outperform conventional CRFs. 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. [sent-6, score-0.098]

2 Intuitively, a semiCRF on an input sequence x outputs a “segmentation” of x, in which labels are assigned to segments (i. [sent-7, score-0.223]

3 Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. [sent-10, score-0.302]

4 In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. [sent-11, score-0.119]

5 In experiments on five named entity recognition problems, semi-CRFs generally outperform conventional CRFs. [sent-12, score-0.412]

6 1 Introduction Conditional random fields (CRFs) are a recently-introduced formalism [12] for representing a conditional model Pr(y|x), where both x and y have non-trivial structure (often sequential). [sent-13, score-0.067]

7 Recall that semi-Markov chain models extend hidden Markov models (HMMs) by allowing each state si to persist for a non-unit length of time di . [sent-15, score-0.136]

8 After this time has elapsed, the system will transition to a new state s , which depends only on si ; however, during the “segment” of time between i and i + di , the behavior of the system may be non-Markovian. [sent-16, score-0.094]

9 We also argue that segments often have a clear intuitive meaning, and hence semi-CRFs are more natural than conventional CRFs. [sent-20, score-0.236]

10 We focus here on named entity recognition (NER), in which a segment corresponds to an extracted entity; however, similar arguments might be made for several other tasks, such as gene-finding [11] or NP-chunking [16]. [sent-21, score-0.521]

11 In NER, a semi-Markov formulation allows one to easily construct entity-level features (such as “entity length” and “similarity to other known entities”) which cannot be easily encoded in CRFs. [sent-22, score-0.09]

12 Experiments on five different NER problems show that semi-CRFs often outperform conventional CRFs. [sent-23, score-0.103]

13 For instance, in the NER application, x might be a sequence of words, and y might be a sequence in {I, O}|x| , where yi = I indicates “word xi is inside a name” and yi = O indicates the opposite. [sent-29, score-0.157]

14 For example, in NER, the components of f might include the measurement f 13 (i, x, y) = [[xi is capitalized]] · [[yi = I]], where the indicator function [[c]] = 1 if c if true and zero otherwise; this implies that F 13 (x, y) would be the number of capitalized words xi paired with the label I. [sent-35, score-0.251]

15 , sp denote a segmentation of x, where segment sj = tj , uj , yj consists of a start position tj , an end position uj , and a label yj ∈ Y . [sent-40, score-1.167]

16 Conceptually, a segment means that the tag yj is given to all xi ’s between i = tj and i = uj , inclusive. [sent-41, score-0.645]

17 We assume segments have positive length, and completely cover the sequence 1 . [sent-42, score-0.223]

18 |x| without overlapping: that is, that tj and uj always satisfy t1 = 1, up = |x|, 1 ≤ tj ≤ uj ≤ |x|, and tj+1 = uj + 1. [sent-45, score-0.707]

19 For NER, a valid segmentation of the sentence “I went skiing with Fernando Pereira in British Columbia” might be s = (1, 1, O), (2, 2, O), (3, 3, O), (4, 4, O), (5, 6, I), (7, 7, O), (8, 9, I) , corresponding to the label sequence y = O, O, O, O, I, I, O, I, I . [sent-46, score-0.187]

20 We now assume a vector g of segment feature functions g = g 1 , . [sent-47, score-0.273]

21 We also make a restriction on the features, analogous to the usual Markovian assumption made in CRFs, and assume that every component g k of g is a function only of x, sj , and the label yj−1 associated with the preceding segment sj−1 . [sent-51, score-0.361]

22 In other words, we assume that every g k (j, x, s) can be rewritten as g k (j, x, s) = g k (yj , yj−1 , x, tj , uj ) (2) for an appropriately defined g k . [sent-52, score-0.278]

23 In the rest of the paper, we will drop the g notation and use g for both versions of the segment-level feature functions. [sent-53, score-0.097]

24 2 An efficient inference algorithm The inference problem for a semi-CRF is defined as follows: given W and x, find the best segmentation, argmax s Pr(s|x, W), where Pr(s|x, W) is defined by Equation 3. [sent-56, score-0.203]

25 An efficient inference algorithm is suggested by Equation 2, which implies that argmax s Pr(s|x, W) = argmax s W · G(x, s) = argmax s W · g(yj , yj−1 , x, tj , uj ) j Let L be an upper bound on segment length. [sent-57, score-0.839]

26 Let si:y denote the set of all partial segmentations starting from 1 (the first index of the sequence) to i, such that the last segment has the label y and ending position i. [sent-58, score-0.369]

27 L V (i − d, y ) + W · g(y, y , x, i − d + 1, i) 0 −∞ if i > 0 if i = 0 if i < 0 (4) The best segmentation then corresponds to the path traced by maxy V (|x|, y). [sent-63, score-0.109]

28 3 Semi-Markov CRFs vs order-L CRFs Since conventional CRFs need not maximize over possible segment lengths d, inference for semi-CRFs is more expensive. [sent-65, score-0.362]

29 Semi-CRFs are also a natural restriction, as it is often convenient to express features in terms of segments. [sent-73, score-0.09]

30 As an example, let dj denote the length of a segment, and let µ be the average length of all segments with label I. [sent-74, score-0.366]

31 Now consider the segment feature g k1 (j, x, s) = (dj − µ)2 · [[yj = I]]. [sent-75, score-0.273]

32 After training, the contribution of this feature toward 2 Pr(s|x) associated with a length-d entity will be proportional to ewk ·(d−µ) —i. [sent-76, score-0.302]

33 , it allows the learner to model a Gaussian distribution of entity lengths. [sent-78, score-0.241]

34 An exponential model for lengths could be implemented with the feature g k2 (j, x, y) = dj · [[yj = I]]. [sent-79, score-0.138]

35 In contrast to the Gaussian-length feature above, g k2 is “equivalent to” a local feature function f (i, x, y) = [[yi = I]], in the following sense: for every triple x, y, s, where y is the tags for s, j g k2 (j, x, s) = i f (i, s, y). [sent-80, score-0.191]

36 Thus a semi-CRF model based on the single feature g k2 could also be represented by a conventional CRF. [sent-81, score-0.128]

37 In general, a semi-CRF model can be factorized in terms of an equivalent order-1 CRF model if and only if the sum of the segment features can be rewritten as a sum of local features. [sent-82, score-0.334]

38 4 Learning algorithm During training the goal is to maximize log-likelihood over a given training set T = {(x , s )}N . [sent-85, score-0.084]

39 Following the notation of Sha and Pereira [16], we express the log=1 likelihood over the training sequences as L(W) = 1 log Pr(s |x , W) = (W · G(x , s ) − log ZW (x )) Assuming that non-entity words are placed in unit-length segments, as we do below. [sent-86, score-0.111]

40 However, to compute the the normalizer, ZW (x ), and the expected value of the features under the current weight vector, EPr(s |W) G(x , s ), we must use the Markov property of G to construct a dynamic programming algorithm, similar for forward-backward. [sent-91, score-0.09]

41 We thus define α(i, y) as the value of W·G(s ,x) where again si:y denotes all segmentations from 1 to i ending at i and s ∈si:y e labeled y. [sent-92, score-0.09]

42 For the k-th component of G, let η k (i, y) be the value of the sum k W·G(x ,s ) , restricted to the part of the segmentation ending at s ∈si:y G (s , x )e position i. [sent-96, score-0.12]

43 Experiments with NER data Baseline algorithms and datasets In our experiments, we trained semi-CRFs to mark entity segments with the label I, and put non-entity words into unit-length segments with label O. [sent-99, score-0.782]

44 The first version, which we call CRF/1, labels words inside and outside entities with I and O, respectively. [sent-101, score-0.17]

45 The second version, called CRF/4, replaces the I tag with four tags B, E, C, and U , which depend on where the word appears in an entity [2]. [sent-102, score-0.376]

46 We considered extraction of city names and state names from this corpus. [sent-105, score-0.381]

47 The Jobs corpus contains 73,330 words, and consists of 300 computerrelated job postings [4]. [sent-106, score-0.081]

48 We considered extraction of company names and job titles. [sent-107, score-0.271]

49 The 18,121-word Email corpus contains 216 email messages taken from the CSPACE email corpus [10], which is mail associated with a 14-week, 277-person management game. [sent-108, score-0.278]

50 2 Features As features for CRF, we used indicators for specific words at location i, or locations within three words of i. [sent-112, score-0.348]

51 Following previous NER work [7]), we also used indicators for capitalization/letter patterns (such as “Aa+” for a capitalized word, or “D” for a single-digit number). [sent-113, score-0.201]

52 As features for semi-CRFs, we used the same set of word-level features, as well their logical extensions to segments. [sent-114, score-0.09]

53 Specifically, we used indicators for the phrase inside a segment and the capitalization pattern inside a segment, as well as indicators for words and capitalization patterns in 3-word windows before and after the segment. [sent-115, score-0.713]

54 We also used indicators for each segment length (d = 1, . [sent-116, score-0.374]

55 , L), and combined all word-level features with indicators for the beginning and end of a segment. [sent-119, score-0.21]

56 To exploit more of the power of semi-CRFs, we also implemented a number of dictionaryderived features, each of which was based on different dictionary D and similarity function sim. [sent-120, score-0.303]

57 xuj , a dictionary feature is defined as g D,sim (j, x, s) = argmax u∈D sim(xsj , u)—i. [sent-124, score-0.463]

58 , the distance from the word sequence xsj to the closest element in D. [sent-126, score-0.243]

59 For each of the extraction problems, we assembled one external dictionary of strings, which were similar (but not identical) to the entity names in the documents. [sent-127, score-0.814]

60 For instance, for city names in the Address data, we used a web page listing cities in India. [sent-128, score-0.176]

61 Due to variations in the way entity names are written, rote matching these dictionaries to the data gives relatively low F1 values, ranging from 22% (for the job-title extraction task) to 57% (for the person-name task). [sent-129, score-0.553]

62 We used three different similarity metrics (Jaccard, TFIDF, and JaroWinkler) which are known to work well for name-matching in data integration tasks [5]. [sent-130, score-0.106]

63 , the distance-based segment features cannot be decomposed into sums of local features. [sent-133, score-0.302]

64 We also extended the semi-CRF algorithm to construct, on the fly, an internal segment dictionary of segments labeled as entities in the training data. [sent-135, score-0.881]

65 To make measurements on training data similar to those on test data, when finding the closest neighbor of xsj in the internal dictionary, we excluded all strings formed from x, thus excluding matches of xsj to itself (or subsequences of itself). [sent-136, score-0.52]

66 For completeness in the experiments, we also evaluated local versions of the dictionary features. [sent-138, score-0.339]

67 Specifically, we constructed dictionary features of the form f D,sim (i, x, y) = argmax u∈D sim(xi , u), where D is either the external dictionary used above, or an internal word dictionary formed from all words contained in entities. [sent-139, score-1.423]

68 As before, words in x were excluded in finding near neighbors to xi . [sent-140, score-0.107]

69 3 Results and Discussion We evaluated F1-measure performance3 of CRF/1, CRF/4, and semi-CRFs, with and without internal and external dictionaries. [sent-142, score-0.202]

70 Gaussian priors were used for all algorithms, and for semi-CRFs, a fixed value of L was chosen for each dataset based on observed entity lengths. [sent-146, score-0.241]

71 In the baseline configuration in which no dictionary features are used, semi-CRFs perform 3 F1 is defined as 2*precision*recall/(precision+recall. [sent-148, score-0.427]

72 ) Address_City Email_Person 90 85 85 F1 span accuracy 90 F1 span accuracy F1 span accuracy Address_State 100 90 80 70 60 50 40 CRF/4 30 SemiCRF+int 20 CRF/4+dict 10 SemiCRF+int+dict 0 0. [sent-149, score-0.114]

73 5 Fraction of available training data Figure 1: F1 as a function of training set size. [sent-179, score-0.084]

74 Algorithms marked with “+dict” include external dictionary features, and algorithms marked with “+int” include internal dictionary features. [sent-180, score-0.808]

75 We do not use internal dictionary features for CRF/4 since they lead to reduced accuracy. [sent-181, score-0.496]

76 baseline F1 CRF/1 state title person city company CRF/4 state title person city company semi-CRF state title person city company +internal dict F1 ∆base +external dict F1 ∆base +both dictionaries F1 ∆base ∆extern 20. [sent-182, score-1.593]

77 0 Table 1: Comparing various methods on five IE tasks, with and without dictionary features. [sent-302, score-0.303]

78 When internal dictionary features are used, the performance of semi-CRFs is often improved, and never degraded by more than 2. [sent-306, score-0.496]

79 However, the less-natural local version of these features often leads to substantial performance losses for CRF/1 and CRF/4. [sent-308, score-0.09]

80 Semi-CRFs perform best on nine of the ten task variants for which internal dictionaries were used. [sent-309, score-0.244]

81 The external-dictionary features are helpful to all the algorithms. [sent-310, score-0.09]

82 Semi-CRFs performs best on three of five tasks in which only external dictionaries were used. [sent-311, score-0.28]

83 If we consider the tasks with and without external dictionary features as separate “conditions”, then semi-CRFs using all available information4 outperform both CRF variants on eight of ten “conditions”. [sent-313, score-0.568]

84 , the both-dictionary version when external dictionaries are available, and the internaldictionary only version otherwise. [sent-318, score-0.24]

85 The primary difference is that the “inner model” for semi-CRFs is of short, uniformly-labeled segments with non-Markovian properties, while nested HMMs allow longer, diversely-labeled, Markovian “segments”. [sent-344, score-0.207]

86 However, for these methods, inference is not tractable, and hence approximations must be made in training and classification. [sent-349, score-0.094]

87 An interesting question for future research is whether the tractible extension to CRF inference considered here can can be used to improve inference methods for more expressive models. [sent-350, score-0.218]

88 In recent prior work [6], we investigated semi-Markov learning methods for NER based on a voted perceptron training algorithm [7]. [sent-351, score-0.195]

89 The voted perceptron has some advantages in ease of implementation, and efficiency. [sent-352, score-0.153]

90 (In particular, the voted perceptron algorithm is more stable numerically, as it does not need to compute a partition function. [sent-353, score-0.153]

91 Probabilistically-grounded approaches like CRFs also are preferable to marginbased approaches like the voted perceptron in certain settings, e. [sent-355, score-0.153]

92 A major advantage of semi-CRFs is that they allow features which measure properties of segments, rather than individual elements. [sent-359, score-0.09]

93 For applications like NER and gene-finding [11], these features can be quite natural. [sent-360, score-0.09]

94 net, and a NER package using this package is available on http://minorthird. [sent-363, score-0.08]

95 Exploiting diverse knowledge sources via maximum entropy in named entity recognition. [sent-380, score-0.309]

96 Bottom-up relational learning of pattern matching rules for information extraction. [sent-395, score-0.069]

97 Exploiting dictionaries in named entity extraction: Combining semi-markov extraction processes and data integration methods. [sent-409, score-0.534]

98 Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. [sent-413, score-0.109]

99 Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. [sent-461, score-0.437]

100 Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. [sent-480, score-0.121]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('dictionary', 0.303), ('ner', 0.27), ('crfs', 0.27), ('dict', 0.244), ('entity', 0.241), ('segment', 0.212), ('semicrf', 0.19), ('segments', 0.169), ('zw', 0.163), ('uj', 0.151), ('dictionaries', 0.141), ('xsj', 0.135), ('ew', 0.129), ('tj', 0.127), ('indicators', 0.12), ('pr', 0.112), ('yj', 0.112), ('crf', 0.108), ('int', 0.108), ('internal', 0.103), ('external', 0.099), ('argmax', 0.099), ('features', 0.09), ('city', 0.089), ('names', 0.087), ('voted', 0.086), ('extraction', 0.084), ('capitalized', 0.081), ('epr', 0.081), ('title', 0.071), ('words', 0.069), ('email', 0.069), ('relational', 0.069), ('named', 0.068), ('label', 0.067), ('conditional', 0.067), ('conventional', 0.067), ('perceptron', 0.067), ('segmentation', 0.066), ('company', 0.066), ('metrics', 0.066), ('elds', 0.065), ('feature', 0.061), ('si', 0.06), ('expressive', 0.06), ('sequence', 0.054), ('extern', 0.054), ('sunita', 0.054), ('tractible', 0.054), ('cohen', 0.054), ('ending', 0.054), ('word', 0.054), ('markov', 0.053), ('inference', 0.052), ('entities', 0.052), ('person', 0.05), ('ve', 0.049), ('inside', 0.049), ('hmms', 0.048), ('corpus', 0.047), ('base', 0.047), ('sim', 0.047), ('capitalization', 0.047), ('management', 0.046), ('dj', 0.046), ('maxy', 0.043), ('tag', 0.043), ('training', 0.042), ('length', 0.042), ('sj', 0.042), ('india', 0.04), ('package', 0.04), ('restriction', 0.04), ('proceedings', 0.04), ('tasks', 0.04), ('span', 0.038), ('excluded', 0.038), ('tags', 0.038), ('nested', 0.038), ('versions', 0.036), ('sixth', 0.036), ('subsequences', 0.036), ('segmentations', 0.036), ('edmonton', 0.036), ('outperform', 0.036), ('job', 0.034), ('sha', 0.034), ('state', 0.034), ('measurement', 0.034), ('baseline', 0.034), ('discriminative', 0.033), ('pereira', 0.033), ('factorized', 0.032), ('sutton', 0.032), ('markovian', 0.031), ('viterbi', 0.031), ('conditionally', 0.031), ('lengths', 0.031), ('triple', 0.031), ('strings', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

Author: Sunita Sarawagi, William W. Cohen

Abstract: We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semiCRF on an input sequence x outputs a “segmentation” of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on five named entity recognition problems, semi-CRFs generally outperform conventional CRFs. 1

2 0.14040166 44 nips-2004-Conditional Random Fields for Object Recognition

Author: Ariadna Quattoni, Michael Collins, Trevor Darrell

Abstract: We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as flexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by finding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.

3 0.13206841 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

Author: Maria-florina Balcan, Avrim Blum, Ke Yang

Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1

4 0.13034458 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

Author: Andrew McCallum, Ben Wellner

Abstract: Coreference analysis, also known as record linkage or identity uncertainty, is a difficult and important problem in natural language processing, databases, citation matching and many other tasks. This paper introduces several discriminative, conditional-probability models for coreference analysis, all examples of undirected graphical models. Unlike many historical approaches to coreference, the models presented here are relational—they do not assume that pairwise coreference decisions should be made independently from each other. Unlike other relational models of coreference that are generative, the conditional model here can incorporate a great variety of features of the input without having to be concerned about their dependencies—paralleling the advantages of conditional random fields over hidden Markov models. We present positive results on noun phrase coreference in two standard text data sets. 1

5 0.12625133 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

Author: Peter L. Bartlett, Michael Collins, Ben Taskar, David A. McAllester

Abstract: We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal structure. Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient—even in cases where the number of labels y is exponential in size—provided that certain expectations under Gibbs distributions can be calculated efficiently. The method for structured labels relies on a more general result, specifically the application of exponentiated gradient updates [7, 8] to quadratic programs. 1

6 0.11590319 192 nips-2004-The power of feature clustering: An application to object detection

7 0.10444531 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

8 0.074542709 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

9 0.072141178 205 nips-2004-Who's In the Picture

10 0.06586159 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

11 0.0649084 82 nips-2004-Incremental Algorithms for Hierarchical Classification

12 0.062550455 54 nips-2004-Distributed Information Regularization on Graphs

13 0.060586449 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

14 0.058152828 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

15 0.057911366 163 nips-2004-Semi-parametric Exponential Family PCA

16 0.056979228 9 nips-2004-A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

17 0.055047482 87 nips-2004-Integrating Topics and Syntax

18 0.05455742 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

19 0.05455 187 nips-2004-The Entire Regularization Path for the Support Vector Machine

20 0.05377448 99 nips-2004-Learning Hyper-Features for Visual Identification


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.193), (1, 0.043), (2, 0.003), (3, -0.062), (4, 0.072), (5, 0.115), (6, 0.069), (7, 0.094), (8, -0.052), (9, 0.006), (10, 0.012), (11, 0.019), (12, -0.132), (13, 0.149), (14, -0.019), (15, 0.041), (16, 0.068), (17, 0.097), (18, 0.071), (19, -0.042), (20, 0.0), (21, 0.047), (22, -0.121), (23, -0.129), (24, -0.126), (25, 0.135), (26, -0.029), (27, -0.014), (28, -0.155), (29, 0.07), (30, 0.002), (31, -0.099), (32, 0.064), (33, 0.038), (34, -0.06), (35, -0.048), (36, -0.055), (37, -0.058), (38, 0.001), (39, -0.04), (40, -0.123), (41, -0.051), (42, 0.092), (43, -0.056), (44, 0.109), (45, -0.148), (46, -0.028), (47, 0.114), (48, 0.092), (49, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94349098 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

Author: Sunita Sarawagi, William W. Cohen

Abstract: We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semiCRF on an input sequence x outputs a “segmentation” of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on five named entity recognition problems, semi-CRFs generally outperform conventional CRFs. 1

2 0.66773444 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

Author: Andrew McCallum, Ben Wellner

Abstract: Coreference analysis, also known as record linkage or identity uncertainty, is a difficult and important problem in natural language processing, databases, citation matching and many other tasks. This paper introduces several discriminative, conditional-probability models for coreference analysis, all examples of undirected graphical models. Unlike many historical approaches to coreference, the models presented here are relational—they do not assume that pairwise coreference decisions should be made independently from each other. Unlike other relational models of coreference that are generative, the conditional model here can incorporate a great variety of features of the input without having to be concerned about their dependencies—paralleling the advantages of conditional random fields over hidden Markov models. We present positive results on noun phrase coreference in two standard text data sets. 1

3 0.57543451 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

Author: Mark Craven, Joseph Bockhorst

Abstract: Many sequential prediction tasks involve locating instances of patterns in sequences. Generative probabilistic language models, such as hidden Markov models (HMMs), have been successfully applied to many of these tasks. A limitation of these models however, is that they cannot naturally handle cases in which pattern instances overlap in arbitrary ways. We present an alternative approach, based on conditional Markov networks, that can naturally represent arbitrarily overlapping elements. We show how to efficiently train and perform inference with these models. Experimental results from a genomics domain show that our models are more accurate at locating instances of overlapping patterns than are baseline models based on HMMs. 1

4 0.55492133 37 nips-2004-Co-Training and Expansion: Towards Bridging Theory and Practice

Author: Maria-florina Balcan, Avrim Blum, Ke Yang

Abstract: Co-training is a method for combining labeled and unlabeled data when examples can be thought of as containing two distinct sets of features. It has had a number of practical successes, yet previous theoretical analyses have needed very strong assumptions on the data that are unlikely to be satisfied in practice. In this paper, we propose a much weaker “expansion” assumption on the underlying data distribution, that we prove is sufficient for iterative cotraining to succeed given appropriately strong PAC-learning algorithms on each feature set, and that to some extent is necessary as well. This expansion assumption in fact motivates the iterative nature of the original co-training algorithm, unlike stronger assumptions (such as independence given the label) that allow a simpler one-shot co-training to succeed. We also heuristically analyze the effect on performance of noise in the data. Predicted behavior is qualitatively matched in synthetic experiments on expander graphs. 1

5 0.51546532 44 nips-2004-Conditional Random Fields for Object Recognition

Author: Ariadna Quattoni, Michael Collins, Trevor Darrell

Abstract: We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as flexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by finding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.

6 0.48697913 205 nips-2004-Who's In the Picture

7 0.46516365 101 nips-2004-Learning Syntactic Patterns for Automatic Hypernym Discovery

8 0.44542256 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

9 0.44295102 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge

10 0.4347316 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

11 0.43353704 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

12 0.43249953 192 nips-2004-The power of feature clustering: An application to object detection

13 0.4268429 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

14 0.41436878 122 nips-2004-Modelling Uncertainty in the Game of Go

15 0.40207416 109 nips-2004-Mass Meta-analysis in Talairach Space

16 0.4015812 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

17 0.39909682 14 nips-2004-A Topographic Support Vector Machine: Classification Using Local Label Configurations

18 0.3783837 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

19 0.37530681 193 nips-2004-Theories of Access Consciousness

20 0.33329237 74 nips-2004-Harmonising Chorales by Probabilistic Inference


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(9, 0.016), (13, 0.079), (15, 0.104), (17, 0.023), (26, 0.068), (31, 0.029), (32, 0.019), (33, 0.198), (35, 0.021), (39, 0.028), (50, 0.02), (53, 0.32)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.94587159 35 nips-2004-Chemosensory Processing in a Spiking Model of the Olfactory Bulb: Chemotopic Convergence and Center Surround Inhibition

Author: Baranidharan Raman, Ricardo Gutierrez-osuna

Abstract: This paper presents a neuromorphic model of two olfactory signalprocessing primitives: chemotopic convergence of olfactory receptor neurons, and center on-off surround lateral inhibition in the olfactory bulb. A self-organizing model of receptor convergence onto glomeruli is used to generate a spatially organized map, an olfactory image. This map serves as input to a lattice of spiking neurons with lateral connections. The dynamics of this recurrent network transforms the initial olfactory image into a spatio-temporal pattern that evolves and stabilizes into odor- and intensity-coding attractors. The model is validated using experimental data from an array of temperature-modulated gas sensors. Our results are consistent with recent neurobiological findings on the antennal lobe of the honeybee and the locust. 1 In trod u ction An artificial olfactory system comprises of an array of cross-selective chemical sensors followed by a pattern recognition engine. An elegant alternative for the processing of sensor-array signals, normally performed with statistical pattern recognition techniques [1], involves adopting solutions from the biological olfactory system. The use of neuromorphic approaches provides an opportunity for formulating new computational problems in machine olfaction, including mixture segmentation, background suppression, olfactory habituation, and odor-memory associations. A biologically inspired approach to machine olfaction involves (1) identifying key signal processing primitives in the olfactory pathway, (2) adapting these primitives to account for the unique properties of chemical sensor signals, and (3) applying the models to solving specific computational problems. The biological olfactory pathway can be divided into three general stages: (i) olfactory epithelium, where primary reception takes place, (ii) olfactory bulb (OB), where the bulk of signal processing is performed and, (iii) olfactory cortex, where odor associations are stored. A review of literature on olfactory signal processing reveals six key primitives in the olfactory pathway that can be adapted for use in machine olfaction. These primitives are: (a) chemical transduction into a combinatorial code by a large population of olfactory receptor neurons (ORN), (b) chemotopic convergence of ORN axons onto glomeruli (GL), (c) logarithmic compression through lateral inhibition at the GL level by periglomerular interneurons, (d) contrast enhancement through lateral inhibition of mitral (M) projection neurons by granule interneurons, (e) storage and association of odor memories in the piriform cortex, and (f) bulbar modulation through cortical feedback [2, 3]. This article presents a model that captures the first three abovementioned primitives: population coding, chemotopic convergence and contrast enhancement. The model operates as follows. First, a large population of cross-selective pseudosensors is generated from an array of metal-oxide (MOS) gas sensors by means of temperature modulation. Next, a self-organizing model of convergence is used to cluster these pseudo-sensors according to their relative selectivity. This clustering generates an initial spatial odor map at the GL layer. Finally, a lattice of spiking neurons with center on-off surround lateral connections is used to transform the GL map into identity- and intensity-specific attractors. The model is validated using a database of temperature-modulated sensor patterns from three analytes at three concentration levels. The model is shown to address the first problem in biologically-inspired machine olfaction: intensity and identity coding of a chemical stimulus in a manner consistent with neurobiology [4, 5]. 2 M o d e l i n g c h e m o t opi c c o n v e r g e n c e The projection of sensory signals onto the olfactory bulb is organized such that ORNs expressing the same receptor gene converge onto one or a few GLs [3]. This convergence transforms the initial combinatorial code into an organized spatial pattern (i.e., an olfactory image). In addition, massive convergence improves the signal to noise ratio by integrating signals from multiple receptor neurons [6]. When incorporating this principle into machine olfaction, a fundamental difference between the artificial and biological counterparts must be overcome: the input dimensionality at the receptor/sensor level. The biological olfactory system employs a large population of ORNs (over 100 million in humans, replicated from 1,000 primary receptor types), whereas its artificial analogue uses a few chemical sensors (commonly one replica of up to 32 different sensor types). To bridge this gap, we employ a sensor excitation technique known as temperature modulation [7]. MOS sensors are conventionally driven in an isothermal fashion by maintaining a constant temperature. However, the selectivity of these devices is a function of the operating temperature. Thus, capturing the sensor response at multiple temperatures generates a wealth of additional information as compared to the isothermal mode of operation. If the temperature is modulated slow enough (e.g., mHz), the behavior of the sensor at each point in the temperature cycle can then be treated as a pseudo-sensor, and thus used to simulate a large population of cross-selective ORNs (refer to Figure 1(a)). To model chemotopic convergence, these temperature-modulated pseudo-sensors (referred to as ORNs in what follows) must be clustered according to their selectivity [8]. As a first approximation, each ORN can be modeled by an affinity vector [9] consisting of the responses across a set of C analytes: r K i = K i1 , K i2 ,..., K iC (1) [ ] where K ia is the response of the ith ORN to analyte a. The selectivity of this ORN r is then defined by the orientation of the affinity vector Κ i . A close look at the OB also shows that neighboring GLs respond to similar odors [10]. Therefore, we model the ORN-GL projection with a Kohonen self-organizing map (SOM) [11]. In our model, the SOM is trained to model the distribution of r ORNs in chemical sensitivity space, defined by the affinity vector Κ i . Once the training of the SOM is completed, each ORN is assigned to the closest SOM node (a simulated GL) in affinity space, thereby forming a convergence map. The response of each GL can then be computed as Ga = σ j (∑ N i =1 Wij ⋅ ORN ia ) (2) where ORN ia is the response of pseudo-sensor i to analyte a, Wij=1 if pseudo-sensor i converges to GL j and zero otherwise, and σ (⋅) is a squashing sigmoidal function that models saturation. This convergence model works well under the assumption that the different sensory inputs are reasonably uncorrelated. Unfortunately, most gas sensors are extremely collinear. As a result, this convergence model degenerates into a few dominant GLs that capture most of the sensory activity, and a large number of dormant GLs that do not receive any projections. To address this issue, we employ a form of competition known as conscience learning [12], which incorporates a habituation mechanism to prevent certain SOM nodes from dominating the competition. In this scheme, the fraction of times that a particular SOM node wins the competition is used as a bias to favor non-winning nodes. This results in a spreading of the ORN projections to neighboring units and, therefore, significantly reduces the number of dormant units. We measure the performance of the convergence mapping with the entropy across the lattice, H = −∑ Pi log Pi , where Pi is the fraction of ORNs that project to SOM node i [13]. To compare Kohonen and conscience learning, we built convergence mappings with 3,000 pseudo-sensors and 400 GL units (refer to section 4 for details). The theoretical maximum of the entropy for this network, which corresponds to a uniform distribution, is 8.6439. When trained with Kohonen’s algorithm, the entropy of the SOM is 7.3555. With conscience learning, the entropy increases to 8.2280. Thus, conscience is an effective mechanism to improve the spreading of ORN projections across the GL lattice. 3 M o d e l i n g t h e o l f a c t o r y b u l b n e t wo r k Mitral cells, which synapse ORNs at the GL level, transform the initial olfactory image into a spatio-temporal code by means of lateral inhibition. Two roles have been suggested for this lateral inhibition: (a) sharpening of the molecular tuning range of individual M cells with respect to that of their corresponding ORNs [10], and (b) global redistribution of activity, such that the bulb-wide representation of an odorant, rather than the individual tuning ranges, becomes specific and concise over time [3]. More recently, center on-off surround inhibitory connections have been found in the OB [14]. These circuits have been suggested to perform pattern normalization, noise reduction and contrast enhancement of the spatial patterns. We model each M cell using a leaky integrate-and-fire spiking neuron [15]. The input current I(t) and change in membrane potential u(t) of a neuron are given by: I (t ) = du u (t ) +C dt R (3) du τ = −u (t ) + R ⋅ I (t ) [τ = RC ] dt Each M cell receives current Iinput from ORNs and current Ilateral from lateral connections with other M cells: I input ( j ) = ∑Wij ⋅ ORNi i (4) I lateral ( j , t ) = ∑ Lkj ⋅ α (k , t − 1) k where Wij indicates the presence/absence of a synapse between ORNi and Mj, as determined by the chemotopic mapping, Lkj is the efficacy of the lateral connection between Mk and Mj, and α(k,t-1) is the post-synaptic current generated by a spike at Mk: α (k , t − 1) = − g (k , t − 1) ⋅ [u ( j, t − 1) + − Esyn ] (5) g(k,t-1) is the conductance of the synapse between Mk and Mj at time t-1, u(j,t-1) is the membrane potential of Mj at time t-1 and the + subscript indicates this value becomes zero if negative, and Esyn is the reverse synaptic potential. The change in conductance of post-synaptic membrane is: & g (k , t ) = dg (k , t ) − g (k , t ) = + z (k , t ) dt τ syn & z (k , t ) = dz (k , t ) − z ( k , t ) = + g norm ⋅ spk ( k , t ) dt τ syn (6) where z(.) and g(.) are low pass filters of the form exp(-t/τsyn) and t ⋅ exp(−t / τ syn ) , respectively, τsyn is the synaptic time constant, gnorm is a normalization constant, and spk(j,t) marks the occurrence of a spike in neuron i at time t: 1 u ( j , t ) = Vspike  spk ( j , t ) =   0 u ( j , t ) ≠ Vspike  (7) Combining equations (3) and (4), the membrane potential can be expressed as: du ( j , t ) − u ( j, t ) I lateral ( j, t ) I input ( j ) = + + dt RC C C & u ( j , t − 1) + u ( j , t − 1) ⋅ dt u ( j, t ) < Vthreshold  u ( j, t ) =   Vspike u ( j, t ) ≥ Vthreshold   & u ( j, t ) = (8) When the membrane potential reaches Vthreshold, a spike is generated, and the membrane potential is reset to Vrest. Any further inputs to the neuron are ignored during the subsequent refractory period. Following [14], lateral interactions are modeled with a center on-off surround matrix Lij. Each M cell makes excitatory synapses to nearby M cells (d

same-paper 2 0.80558419 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

Author: Sunita Sarawagi, William W. Cohen

Abstract: We describe semi-Markov conditional random fields (semi-CRFs), a conditionally trained version of semi-Markov chains. Intuitively, a semiCRF on an input sequence x outputs a “segmentation” of x, in which labels are assigned to segments (i.e., subsequences) of x rather than to individual elements xi of x. Importantly, features for semi-CRFs can measure properties of segments, and transitions within a segment can be non-Markovian. In spite of this additional power, exact learning and inference algorithms for semi-CRFs are polynomial-time—often only a small constant factor slower than conventional CRFs. In experiments on five named entity recognition problems, semi-CRFs generally outperform conventional CRFs. 1

3 0.79096341 104 nips-2004-Linear Multilayer Independent Component Analysis for Large Natural Scenes

Author: Yoshitatsu Matsuda, Kazunori Yamaguchi

Abstract: In this paper, linear multilayer ICA (LMICA) is proposed for extracting independent components from quite high-dimensional observed signals such as large-size natural scenes. There are two phases in each layer of LMICA. One is the mapping phase, where a one-dimensional mapping is formed by a stochastic gradient algorithm which makes more highlycorrelated (non-independent) signals be nearer incrementally. Another is the local-ICA phase, where each neighbor (namely, highly-correlated) pair of signals in the mapping is separated by the MaxKurt algorithm. Because LMICA separates only the highly-correlated pairs instead of all ones, it can extract independent components quite efficiently from appropriate observed signals. In addition, it is proved that LMICA always converges. Some numerical experiments verify that LMICA is quite efficient and effective in large-size natural image processing.

4 0.60780436 45 nips-2004-Confidence Intervals for the Area Under the ROC Curve

Author: Corinna Cortes, Mehryar Mohri

Abstract: In many applications, good ranking is a highly desirable performance for a classifier. The criterion commonly used to measure the ranking quality of a classification algorithm is the area under the ROC curve (AUC). To report it properly, it is crucial to determine an interval of confidence for its value. This paper provides confidence intervals for the AUC based on a statistical and combinatorial analysis using only simple parameters such as the error rate and the number of positive and negative examples. The analysis is distribution-independent, it makes no assumption about the distribution of the scores of negative or positive examples. The results are of practical use and can be viewed as the equivalent for AUC of the standard confidence intervals given in the case of the error rate. They are compared with previous approaches in several standard classification tasks demonstrating the benefits of our analysis. 1 Motivation In many machine learning applications, the ranking quality of a classifier is critical. For example, the ordering of the list of relevant documents returned by a search engine or a document classification system is essential. The criterion widely used to measure the ranking quality of a classification algorithm is the area under an ROC curve (AUC). But, to measure and report the AUC properly, it is crucial to determine an interval of confidence for its value as it is customary for the error rate and other measures. It is also important to make the computation of the confidence interval practical by relying only on a small and simple number of parameters. In the case of the error rate, such intervals are often derived from just the sample size N . We present an extensive theoretical analysis of the AUC and show that a similar confidence interval can be derived for its value using only simple parameters such as the error rate k/N , the number of positive examples m, and the number of negative examples n = N − m. Thus, our results extend to AUC the computation of confidence intervals from a small number of readily available parameters. Our analysis is distribution-independent in the sense that it makes no assumption about the distribution of the scores of negative or positive examples. The use of the error rate helps determine tight confidence intervals. This contrasts with existing approaches presented in the statistical literature [11, 5, 2] which are based either on weak distribution-independent assumptions resulting in too loose confidence intervals, or strong distribution-dependent assumptions leading to tight but unsafe confidence intervals. We show that our results are of practical use. We also compare them with previous approaches in several standard classification tasks demonstrating the benefits of our analysis. Our results are also useful for testing the statistical significance of the difference of the AUC values of two classifiers. The paper is organized as follows. We first introduce the definition of the AUC, its connection with the Wilcoxon-Mann-Whitney statistic (Section 2), and briefly review some essential aspects of the existing literature related to the computation of confidence intervals for the AUC. Our computation of the expected value and variance of the AUC for a fixed error rate requires establishing several combinatorial identities. Section 4 presents some existing identities and gives the proof of novel ones useful for the computation of the variance. Section 5 gives the reduced expressions for the expected value and variance of the AUC for a fixed error rate. These can be efficiently computed and used to determine our confidence intervals for the AUC (Section 6). Section 7 reports the result of the comparison of our method with previous approaches, including empirical results for several standard tasks. 2 Definition and Properties of the AUC The Receiver Operating Characteristics (ROC) curves were originally introduced in signal detection theory [6] in connection with the study of radio signals, and have been used since then in many other applications, in particular for medical decision-making. Over the last few years, they have found increased interest in the machine learning and data mining communities for model evaluation and selection [14, 13, 7, 12, 16, 3]. The ROC curve for a binary classification problem plots the true positive rate as a function of the false positive rate. The points of the curve are obtained by sweeping the classification threshold from the most positive classification value to the most negative. For a fully random classification, the ROC curve is a straight line connecting the origin to (1, 1). Any improvement over random classification results in an ROC curve at least partially above this straight line. The AUC is defined as the area under the ROC curve. Consider a binary classification task with m positive examples and n negative examples. Let C be a fixed classifier that outputs a strictly ordered list for these examples. Let x1 , . . . , xm be the output of C on the positive examples and y1 , . . . , yn its output on the negative examples and denote by 1X the indicator function of a set X. Then, the AUC, A, associated to C is given by: A= m i=1 n j=1 1xi >yj (1) mn which is the value of the Wilcoxon-Mann-Whitney statistic [10]. Thus, the AUC is closely related to the ranking quality of the classification. It can be viewed as a measure based on pairwise comparisons between classifications of the two classes. It is an estimate of the probability Pxy that the classifier ranks a randomly chosen positive example higher than a negative example. With a perfect ranking, all positive examples are ranked higher than the negative ones and A = 1. Any deviation from this ranking decreases the AUC, and the expected AUC value for a random ranking is 0.5. 3 Overview of Related Work This section briefly describes some previous distribution-dependent approaches presented in the statistical literature to derive confidence intervals for the AUC and compares them to our method. The starting point for these analyses is a formula giving the variance of the AUC, A, for a fixed distribution of the scores Px of the positive examples and Py of the negative examples [10, 1]: 2 σA = A(1 − A) + (m − 1)(Pxxy − A2 ) + (n − 1)(Pxyy − A2 ) mn (2) where Pxxy is the probability that the classifier ranks two randomly chosen positive examples higher than a negative one, and Pxyy the probability that it ranks two randomly chosen negative examples lower than a positive one. To compute the variance exactly using Equation 2, the distributions Px and Py must be known. Hanley and McNeil [10] argue in favor of exponential distributions, loosely claiming that this upper-bounds the variance of normal distributions with various means and ratios of A 2A2 variances. They show that for exponential distributions Pxxy = 2−A and Pxyy = 1+A . The resulting confidence intervals are of course relatively tight, but their validity is questionable since they are based on a strong assumption about the distributions of the positive and negative scores that may not hold in many cases. An alternative considered by several authors to the exact computation of the variance is to determine instead the maximum of the variance over all possible continuous distributions with the same expected value of the AUC. For all such distributions, one can fix m and n and compute the expected AUC and its variance. The maximum variance is denoted by 2 σmax and is given by [5, 2]: 2 σmax = A(1 − A) 1 ≤ min {m, n} 4 min {m, n} (3) Unfortunately, this often yields loose confidence intervals of limited practical use. Our approach for computing the mean and variance of the AUC is distribution-independent and inspired by the machine learning literature where analyses typically center on the error rate. We require only that the error rate be measured and compute the mean and variance of the AUC over all distributions Px and Py that maintain the same error rate. Our approach is in line with that of [5, 2] but it crucially avoids considering the maximum of the variance. We show that it is possible to compute directly the mean and variance of the AUC assigning equal weight to all the possible distributions. Of course, one could argue that not all distributions Px and Py are equally probable, but since these distributions are highly problem-dependent, we find it risky to make any general assumption on the distributions and thereby limit the validity of our results. Our approach is further justified empirically by the experiments reported in the last section. 4 Combinatorial Analysis The analysis of the statistical properties of the AUC given a fixed error rate requires various combinatorial calculations. This section describes several of the combinatorial identities that are used in our computation of the confidence intervals. For all q ≥ 0, let Xq (k, m, n) be defined by: k M M Xq (k, m, n) = xq (4) x x x=0 where M = m − (k − x) + x, M = n + (k − x) − x, and x = k − x. In previous work, we derived the following two identities which we used to compute the expected value of the AUC [4]: k X0 (k, m, n) = x=0 n+m+1 x k X1 (k, m, n) = (k − x)(m − n) + k n + m + 1 2 x x=0 To simplify the expression of the variance of the AUC, we need to compute X2 (k, m, n). Proposition 1 Let k, m, n be non-negative integers such that k ≤ min{m, n}, then: k X2 (k, m, n) = P2 (k, m, n, x) x=0 m+n+1 x (5) where P2 is the following 4th-degree polynomial: P2 (k, m, n, x) = (k − x)/12(−2x3 + 2x2 (2m − n + 2k − 4) + x(−3m2 + 3nm + 3m − 5km − 2k 2 + 2 + k + nk + 6n) + (3(k − 1)m2 − 3nm(k − 1) + 6km + 5m + k 2 m + 8n + 8 − 9nk + 3k + k 2 + k 2 n)). Proof. 5 The proof of the proposition is left to a longer version of this paper. Expectation and Variance of the AUC This section presents the expression of the expectation and variance of the AUC for a fixed error rate k/N assuming that all classifications or rankings with k errors are equiprobable. For a given classification, there may be x, 0 ≤ x ≤ k, false positive examples. Since the number of errors is fixed, there are x = k − x false negative examples. The expression Xq discussed in the previous section represents the q-th moment of x over all classifications with exactly k errors. In previous work, we gave the exact expression of the expectation of the AUC for a fixed number of errors k: Proposition 2 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the expected value of the AUC, A, over all classifications with k errors is given by: E[A] = 1 − k (n − m)2 (m + n + 1) − m+n 4mn k − m+n k−1 m+n x=0 x k m+n+1 x=0 x . Note that the two sums in this expression cannot be further simplified since they are known not to admit a closed form [9]. We also gave the expression of the variance of the AUC in terms of the function F defined for all Y by: F (Y ) = k M M x=0 x x k M M x=0 x x Y . (6) The following proposition reproduces that result: Proposition 3 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classifications with k errors is given by: σ 2 (A) = F ((1 − k−x x n+ m )2 ) 2 2 2 F ( mx +n(k−x) +(m(m+1)x+n(n+1)(k−x))−2x(k−x)(m+n+1) ). 12m2 n2 − F ((1 − k−x x n+ m 2 ))2 + Because of the products of binomial terms, the computation of the variance using this expression is inefficient even for relatively small values of m and n. This expression can however be reduced using the identities presented in the previous section which leads to significantly more efficient computations that we have been using in all our experiments. Corollary 1 ([4]) Assume that a binary classification task with m positive examples and n negative examples is given. Then, the variance of the AUC A over all classifications with ((m+n−2)Z4 −(2m−n+3k−10)Z3 ) k errors is given by: σ 2 (A) = (m+n+1)(m+n)(m+n−1)T72m2 n2 + (m+n+1)(m+n)T (m2 −nm+3km−5m+2k2 −nk+12−9k)Z2 48m2 n2 (m+n+1)Q1 Z1 kQ0 + 144m2 n2 with: 2 n2 72m Pk−i m+n+1−i Zi = x=0 Pk ( x=0 x (m+n+1) x ) − 2 (m+n+1)2 (m−n)4 Z1 16m2 n2 − , T = 3((m − n)2 + m + n) + 2, and: Q0 = (m + n + 1)T k2 + ((−3n2 + 3mn + 3m + 1)T − 12(3mn + m + n) − 8)k + (−3m2 + 7m + 10n + 3nm + 10)T − 4(3mn + m + n + 1) Q1 = T k3 + 3(m − 1)T k2 + ((−3n2 + 3mn − 3m + 8)T − 6(6mn + m + n))k + (−3m2 + 7(m + n) + 3mn)T − 2(6mn + m + n) Proof. The expression of the variance given in Proposition 3 requires the computation of Xq (k, m, n), q = 0, 1, 2. Using the identities giving the expressions of X0 and X1 and Proposition 1, which provides the expression of X2 , σ 2 (A) can be reduced to the expression given by the corollary. 6 Theory and Analysis Our estimate of the confidence interval for the AUC is based on a simple and natural assumption. The main idea for its computation is the following. Assume that a confidence interval E = [e1 , e2 ] is given for the error rate of a classifier C over a sample S, with the confidence level 1 − . This interval may have have been derived from a binomial model of C, which is a standard assumption for determining a confidence interval for the error rate, or from any other model used to compute that interval. For a given error rate e ∈ E, or equivalently for a given number of misclassifications, we can use the expectation and variance computed in the previous section and Chebyshev’s inequality to predict a confidence interval Ae for the AUC at the confidence level 1 − . Since our equiprobable model for the classifications is independent of the model used to compute the interval of confidence for the error rate, we can use E and Ae , e ∈ E, to compute a confidence interval of the AUC at the level (1 − )(1 − ). Theorem 1 Let C be a binary classifier and let S be a data sample of size N with m positive examples and n negative examples, N = m + n. Let E = [e1 , e2 ] be a confidence interval for the error rate of C over S at the confidence level 1 − . Then, for any , 0 ≤ ≤ 1, we can compute a confidence interval for the AUC value of the classifier C at the confidence level (1 − )(1 − ) that depends only on , , m, n, and the interval E. Proof. Let k1 = N e1 and k2 = N e2 be the number of errors associated to the error rates e1 and e2 , and let IK be the interval IK = [k1 , k2 ]. For a fixed k ∈ IK , by Propositions 2 and Corollary 1, we can compute the exact value of the expectation E[Ak ] and variance σ 2 (Ak ) of the AUC Ak . Using Chebyshev’s inequality, for any k ∈ IK and any k > 0, σ(Ak ) P |Ak − E[Ak ]| ≥ √ k ≤ (7) k where E[Ak ] and σ(Ak ) are the expressions given in Propositions 2 and Corollary 1, which depend only on k, m, and n. Let α1 and α2 be defined by: α1 = min k∈IK σ(Ak ) E[Ak ] − √ σ(Ak ) α2 = max E[Ak ] + √ k∈IK k (8) k α1 and α2 only depend on IK (i.e., on e1 and e2 ), and on k, m, and n. Let IA be the confidence interval defined by IA = [α1 , α2 ] and let k = for all k ∈ IK . Using the fact that the confidence interval E is independent of our equiprobability model for fixed-k AUC values and the Bayes’ rule: P(A ∈ IA ) = k∈R+ ≥ k∈IK P (A ∈ IA | K = k)P (K = k) (9) P (A ∈ IA | K = k)P (K = k) (10) ≥ (1 − ) k∈IK P (K = k) ≥ (1 − )(1 − ) (11) where we used the property of Eq. 7 and the definitions of the intervals IK and IA . Thus, IA constitutes a confidence interval for the AUC value of C at the confidence level (1 − )(1 − ). In practice, the confidence interval E is often determined as a result of the assumption that C follows a binomial law. This leads to the following theorem. .020 .035 Standard deviation Standard deviation .030 .015 .010 Max Distribution−dependent Distribution−independent .005 .025 .020 .015 Max Distribution−dependent Distribution−independent .010 .005 0.75 0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 AUC (a) 0.9 1.0 AUC (b) Figure 1: Comparison of the standard deviations for three different methods with: (a) m = n = 500; (b) m = 400 and n = 200. The curves are obtained by computing the expected AUC and its standard deviations for different values of the error rate using the maximum-variance approach (Eq. 3), our distribution-independent method, and the distribution-dependent approach of Hanley [10]. Theorem 2 Let C be a binary classifier, let S be a data sample of size N with m positive examples and n negative examples, N = m + n, and let k0 be the number of misclassifications of C on S. Assume that C follows a binomial law, then, for any , 0 ≤ ≤ 1, we can compute a confidence interval of the AUC value of the classifier C at the confidence level 1 − that depends only on , k0 , m, and n. Proof. Assume that C follows a binomial law with coefficient p. Then, Chebyshev’s inequality yields: 1 p(1 − p) ≤ (12) P(|C − E[C]| ≥ η) ≤ 2 Nη 4N η 2 1 1 Thus, E = [ k0 − √ √ , k0 + √ √ ] forms a confidence interval for the N 2 (1− 1− )N N 2 √ (1− 1− )N error rate of C at the confidence level 1 − . By Theorem 1, we can√ compute for the √ AUC value a confidence interval at the level (1 − (1 − 1 − ))(1 − (1 − 1 − )) = 1 − depending only on , m, n, and the interval E, i.e., k0 , N = m + n, and . For large N , we can use the normal approximation of the binomial law to determine a finer interval E. Indeed, for large N , √ (13) P(|C − E[C]| ≥ η) ≤ 2Φ(2 N η) with Φ(u) = ∞ e−x2 /2 √ dx. u 2π Thus, E = [ k0 − N √ Φ−1 ( 1− 21− ) k0 √ ,N 2 N √ confidence interval for the error rate at the confidence level √ + Φ−1 ( 1− 21− ) √ ] 2 N is the 1− . For simplicity, in the proof of Theorem 2, k was chosen to be a constant ( k = ) but, in general, it can be another function of k leading to tighter confidence intervals. The results presented in the next section were obtained with k = a0 exp((k − k0 )2 /2a2 ), where a0 1 and a1 are constants selected so that the inequality 11 be verified. 7 Experiments and Comparisons The analysis in the previous section provides a principled method for computing a confidence interval of the AUC value of a classier C at the confidence level 1 − that depends only on k, n and m. As already discussed, other expressions found in the statistical literature lead to either too loose or unsafely narrow confidence intervals based on questionable assumptions on the probability functions Px and Py [10, 15]. Figure 1 shows a comparison of the standard deviations obtained using the maximum-approach (Eq. 3), the distribution-dependent expression from [10], and our distribution-independent method for NAME m+n n m+n AUC k m+n σindep σA σdep σmax 368 700 303 1159 2473 201 0.63 0.67 0.54 0.17 0.10 0.37 0.70 0.63 0.87 0.85 0.84 0.85 0.24 0.26 0.13 0.05 0.03 0.13 0.0297 0.0277 0.0176 0.0177 0.0164 0.0271 0.0440 0.0330 0.0309 0.0161 0.0088 0.0463 0.0269 0.0215 0.0202 0.0176 0.0161 0.0306 0.0392 0.0317 0.0281 0.0253 0.0234 0.0417 pima yeast credit internet-ads page-blocks ionosphere Table 1: Accuracy and AUC values for AdaBoost [8] and estimated standard deviations for several datasets from the UC Irvine repository. σindep is a distribution-independent standard deviation obtained using our method (Theorem 2). σA is given by Eq. (2) with the values of A, Pxxy , and Pxyy derived from data. σdep is the distribution-dependent standard deviation of Hanley [10], which is based on assumptions that may not always hold. σmax is defined by Eq. (3). All results were obtained on a randomly selected test set of size m + n. various error rates. For m = n = 500, our distribution-independent method consistently leads to tighter confidence intervals (Fig. 1 (a)). It also leads to tighter confidence intervals for AUC values more than .75 for the uneven distribution m = 400 and n = 200 (Fig. 1 (b)). For lower AUC values, the distribution-dependent approach produces tighter intervals, but its underlying assumptions may not hold. A different comparison was made using several datasets available from the UC Irvine repository (Table 1). The table shows that our estimates of the standard deviations (σindep ) are in general close to or tighter than the distribution-dependent standard deviation σdep of Hanley [10]. This is despite we do not make any assumption about the distributions of positive and negative examples. In contrast, Hanley’s method is based on specific assumptions about these distributions. Plots of the actual ranking distribution demonstrate that these assumptions are often violated however. Thus, the relatively good performance of Hanley’s approach on several data sets can be viewed as fortuitous and is not general. Our distribution-independent method provides tight confidence intervals, in some cases tighter than those derived from σA , in particular because it exploits the information provided by the error rate. Our analysis can also be used to determine if the AUC values produced by two classifiers are statistically significant by checking if the AUC value of one falls within the confidence interval of the other. 8 Conclusion We presented principled techniques for computing useful confidence intervals for the AUC from simple parameters: the error rate, and the negative and positive sample sizes. We demonstrated the practicality of these confidence intervals by comparing them to previous approaches in several tasks. We also derived the exact expression of the variance of the AUC for a fixed k, which can be of interest in other analyses related to the AUC. The Wilcoxon-Mann-Whitney statistic is a general measure of the quality of a ranking that is an estimate of the probability that the classifier ranks a randomly chosen positive example higher than a negative example. One could argue that accuracy at the top or the bottom of the ranking is of higher importance. This, however, contrarily to some belief, is already captured to a certain degree by the definition of the Wilcoxon-Mann-Whitney statistic which penalizes more errors at the top or the bottom of the ranking. It is however an interesting research problem to determine how to incorporate this bias in a stricter way in the form of a score-specific weight in the ranking measure, a weighted WilcoxonMann-Whitney statistic, or how to compute the corresponding expected value and standard deviation in a general way and design machine learning algorithms to optimize such a mea- sure. A preliminary analysis suggests, however, that the calculation of the expectation and the variance are likely to be extremely complex in that case. Finally, it could also be interesting but difficult to adapt our results to the distribution-dependent case and compare them to those of [10]. Acknowledgments We thank Rob Schapire for pointing out to us the problem of the statistical significance of the AUC, Daryl Pregibon for the reference to [11], and Saharon Rosset for various discussions about the topic of this paper. References [1] D. Bamber. The Area above the Ordinal Dominance Graph and the Area below the Receiver Operating Characteristic Graph. Journal of Math. Psychology, 12, 1975. [2] Z. W. Birnbaum and O. M. Klose. Bounds for the Variance of the Mann-Whitney Statistic. Annals of Mathematical Statistics, 38, 1957. [3] J-H. Chauchat, R. Rakotomalala, M. Carloz, and C. Pelletier. Targeting Customer Groups using Gain and Cost Matrix; a Marketing Application. Technical report, ERIC Laboratory - University of Lyon 2, 2001. [4] Corinna Cortes and Mehryar Mohri. AUC Optimization vs. Error Rate Minimization. In Advances in Neural Information Processing Systems (NIPS 2003), volume 16, Vancouver, Canada, 2004. MIT Press. [5] D. Van Dantzig. On the Consistency and Power of Wilcoxon’s Two Sample Test. In Koninklijke Nederlandse Akademie van Weterschappen, Series A, volume 54, 1915. [6] J. P. Egan. Signal Detection Theory and ROC Analysis. Academic Press, 1975. [7] C. Ferri, P. Flach, and J. Hern´ ndez-Orallo. Learning Decision Trees Using the Area a Under the ROC Curve. In Proceedings of the 19th International Conference on Machine Learning. Morgan Kaufmann, 2002. [8] Yoav Freund and Robert E. Schapire. A Decision Theoretical Generalization of OnLine Learning and an Application to Boosting. In Proceedings of the Second European Conference on Computational Learning Theory, volume 2, 1995. [9] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics. Addison-Wesley, Reading, Massachusetts, 1994. [10] J. A. Hanley and B. J. McNeil. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology, 1982. [11] E. L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Holden-Day, San Francisco, California, 1975. [12] M. C. Mozer, R. Dodier, M. D. Colagrosso, C. Guerra-Salcedo, and R. Wolniewicz. Prodding the ROC Curve: Constrained Optimization of Classifier Performance. In Neural Information Processing Systems (NIPS 2002). MIT Press, 2002. [13] C. Perlich, F. Provost, and J. Simonoff. Tree Induction vs. Logistic Regression: A Learning Curve Analysis. Journal of Machine Learning Research, 2003. [14] F. Provost and T. Fawcett. Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distribution. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining. AAAI, 1997. [15] Saharon Rosset. Ranking-Methods for Flexible Evaluation and Efficient Comparison of 2-Class Models. Master’s thesis, Tel-Aviv University, 1999. [16] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz. Optimizing Classifier Performance via the Wilcoxon-Mann-Whitney Statistics. In Proceedings of the International Conference on Machine Learning, 2003.

5 0.60733086 72 nips-2004-Generalization Error and Algorithmic Convergence of Median Boosting

Author: Balázs Kégl

Abstract: We have recently proposed an extension of A DA B OOST to regression that uses the median of the base regressors as the final regressor. In this paper we extend theoretical results obtained for A DA B OOST to median boosting and to its localized variant. First, we extend recent results on efficient margin maximizing to show that the algorithm can converge to the maximum achievable margin within a preset precision in a finite number of steps. Then we provide confidence-interval-type bounds on the generalization error. 1

6 0.60543162 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

7 0.60442019 69 nips-2004-Fast Rates to Bayes for Kernel Machines

8 0.60434365 161 nips-2004-Self-Tuning Spectral Clustering

9 0.60431898 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

10 0.6039294 67 nips-2004-Exponentiated Gradient Algorithms for Large-margin Structured Classification

11 0.60386628 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

12 0.60357553 23 nips-2004-Analysis of a greedy active learning strategy

13 0.60329527 204 nips-2004-Variational Minimax Estimation of Discrete Distributions under KL Loss

14 0.60326725 102 nips-2004-Learning first-order Markov models for control

15 0.60286397 44 nips-2004-Conditional Random Fields for Object Recognition

16 0.60255688 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

17 0.60231996 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

18 0.60194606 7 nips-2004-A Large Deviation Bound for the Area Under the ROC Curve

19 0.60192448 86 nips-2004-Instance-Specific Bayesian Model Averaging for Classification

20 0.60158294 77 nips-2004-Hierarchical Clustering of a Mixture Model