nips nips2000 nips2000-138 knowledge-graph by maker-knowledge-mining

138 nips-2000-The Use of Classifiers in Sequential Inference


Source: pdf

Author: Vasin Punyakanok, Dan Roth

Abstract: We study the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. In particular, we develop two general approaches for an important subproblem - identifying phrase structure. The first is a Markovian approach that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies. The second is an extension of constraint satisfaction formalisms. We develop efficient combination algorithms under both models and study them experimentally in the context of shallow parsing.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We study the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. [sent-5, score-0.559]

2 In particular, we develop two general approaches for an important subproblem - identifying phrase structure. [sent-6, score-0.559]

3 The first is a Markovian approach that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies. [sent-7, score-0.444]

4 We develop efficient combination algorithms under both models and study them experimentally in the context of shallow parsing. [sent-9, score-0.214]

5 1 Introduction In many situations it is necessary to make decisions that depend on the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints - the sequential nature of the data or other domain specific constraints. [sent-10, score-0.64]

6 Consider, for example, the problem of chunking natural language sentences where the goal is to identify several kinds of phrases (e. [sent-11, score-0.632]

7 For example, one way to address the problem is to utilize two classifiers for each phrase type, one of which recognizes the beginning of the phrase, and the other its end. [sent-15, score-0.731]

8 Clearly, there are constraints over the predictions; for instance, phrases cannot overlap and there are probabilistic constraints over the order of phrases and their lengths. [sent-16, score-1.14]

9 The above mentioned problem is an instance of a general class of problems - identifying the phrase structure in sequential data. [sent-17, score-0.525]

10 This paper develops two general approaches for this class of problems by utilizing general classifiers and performing inferences with their outcomes. [sent-18, score-0.346]

11 Our formalisms directly applies to natural language problems such as shallow parsing [7, 23, 5, 3, 21], computational biology problems such as identifying splice sites [8,4, 15], and problems in information extraction [9]. [sent-19, score-0.55]

12 In this case, classifiers are functions of the observation sequence and their outcomes represent states; we study two Markov models that are used as inference procedures and differ in the type of classifiers and the details of the probabilistic modeling. [sent-21, score-1.003]

13 The critical shortcoming of this framework is that it attempts to maximize the likelihood of the state sequence - not the true performance measure of interest but only a derivative of it. [sent-22, score-0.127]

14 The second approach extends a constraint satisfaction formalism to deal with variables that are associated with costs and shows how to use this to model the classifier combination problem. [sent-23, score-0.25]

15 In this approach general constraints can be incorporated flexibly and algorithms can be developed that closely address the true global optimization criterion of interest. [sent-24, score-0.159]

16 For both approaches we develop efficient combination algorithms that use general classifiers to yield the inference. [sent-25, score-0.346]

17 The approaches are studied experimentally in the context of shallow parsing - the task of identifying syntactic sequences in sentences [14, 1, 11] - which has been found useful in many large-scale language processing applications including information extraction and text summarization [12, 2]. [sent-26, score-0.664]

18 Working within a concrete task allows us to compare the approaches experimentally for phrase types such as base Noun Phrases (NPs) and SubjectVerb phrases (SVs) that differ significantly in their statistical properties, including length and internal dependencies. [sent-27, score-0.986]

19 Our two main methods, projection-based Markov Models (PMM) and constraint satisfaction with classifiers (CSCL) are shown to perform very well on the task of predicting NP and SV phrases, with CSCL at least as good as any other method tried on these tasks. [sent-29, score-0.411]

20 We attribute it to CSCL's ability to cope better with the length of the phrase and the long term dependencies. [sent-31, score-0.446]

21 2 Identifying Phrase Structure The inference problem considered can be formalized as that of identifying the phrase structure of an input string. [sent-33, score-0.553]

22 On >, a phrase is a substring of consecutive input symbols Oi, 0i+l, . [sent-37, score-0.447]

23 Some external mechanism is assumed to consistently (or stochastically) annotate substrings as phrases l . [sent-41, score-0.496]

24 Our goal is to come up with a mechanism that, given an input string, identifies the phrases in this string. [sent-42, score-0.526]

25 The identification mechanism works by using classifiers that attempt to recognize in the input string local signals which are indicative to the existence of a phrase. [sent-43, score-0.618]

26 We assume that the outcome of the classifier at input symbol 0 can be represented as a function of the local context of 0 in the input string, perhaps with the aid of some external information inferred from it2 . [sent-44, score-0.216]

27 Classifiers can indicate that an input symbol 0 is inside or outside a phrase (10 modeling) or they can indicate that an input symbol 0 opens or closes a phrase (the OC modeling) or some combination of the two. [sent-45, score-1.071]

28 Our work here focuses on OC modeling which has been shown to be more robust than the 10, especially with fairly long phrases [21]. [sent-46, score-0.504]

29 In any case, the classifiers' outcomes can be combined to determine the phrases in the input string. [sent-47, score-0.582]

30 This process, however, needs to satisfy some constraints for the resulting set of phrases to be legitimate. [sent-48, score-0.544]

31 The goal is thus two fold: to learn classifiers that recognize the local signals and to combine them in a way that respects the constraints. [sent-50, score-0.427]

32 We call the inference algorithm that combines the classifiers and outputs a coherent phrase structure a combinator. [sent-51, score-0.822]

33 The performance of this process is measured by how accurately it retrieves the phrase structure of the input string. [sent-52, score-0.447]

34 This is quantified in terms of recall - the percentage of phrases that are correctly identified - and precision - the percentage of identified phrases that are indeed correct phrases. [sent-53, score-0.93]

35 1 We assume here a single type of phrase, and thus each input symbol is either in a phrase or outside it. [sent-54, score-0.535]

36 All the methods can be extended to deal with several kinds of phrases in a string. [sent-55, score-0.465]

37 2Jn the case of natural language processing, if the DiS are words in a sentence, additional information might include morphological information, part of speech tags, semantic class information from WordNet, etc. [sent-56, score-0.21]

38 3 Markov Modeling HMM is a probabilistic finite state automaton that models the probabilistic generation of sequential processes. [sent-58, score-0.196]

39 It consists of a finite set S of states, a set 0 of observations, an initial state distribution Pl(s), a state-transition distribution P(sls') (s, s' E S) and an observation distribution P(ols) (0 EO, s E S). [sent-59, score-0.157]

40 A sequence of observations is generated by first picking an initial state according to PI (s); this state produces an observation according to P(ols) and transits to a new state according to P(sls'). [sent-60, score-0.388]

41 This state produces the next observation, and the process goes on until it reaches a designated final state [22]. [sent-61, score-0.114]

42 In a supervised learning task, an observation sequence 0 =< 01,02,' . [sent-62, score-0.17]

43 On > is supervised by a corresponding state sequence S =< Sl, S2,'" sn >. [sent-64, score-0.127]

44 This allows one to estimate the HMM parameters and then, given a new observation sequence, to identify the most likely corresponding state sequence. [sent-65, score-0.191]

45 2) using local signals from which the state sequence can be recovered. [sent-67, score-0.156]

46 1 A Hidden Markov Model Combinator To recover the most likely state sequence in HMM, we wish to estimate all the required probability distributions. [sent-71, score-0.161]

47 That is, we are given classifiers with states as their outcomes. [sent-74, score-0.351]

48 We still need Pt (0) which is harder to approximate but, for each t, can be treated as a constant 'fit because the goal is to find the most likely sequence of states for the given observations, which are the same for all compared sequences. [sent-81, score-0.175]

49 With this scheme, we can still combine the classifiers' predictions by finding the most likely sequence for an observation sequence using dynamic programming. [sent-82, score-0.313]

50 6 we estimate P(slo) based on a whole observation sequence rather than 0t to significantly improve the performance. [sent-86, score-0.17]

51 2 A Projection based Markov Model Combinator In HMMs, observations are allowed to depend only on the current state and long term dependencies are not modeled. [sent-88, score-0.104]

52 Equivalently, the constraints structure is restricted by having a stationary probability distribution of a state given the previous one. [sent-89, score-0.136]

53 Thus, given an observation sequence 0 we can find the most likely state sequence S given 0 by maximizing n n t=2 t=2 Hence, this model generalizes the standard HMM by combining the state-transition probability and the observation probability into one function. [sent-98, score-0.431]

54 The most likely state sequence can still be recovered using the dynamic programming (Viterbi) algorithm if we modify the recursive step: 8t ( s) = maxs'ES 8t - l (s')P( sis', Ot). [sent-99, score-0.161]

55 To learn these classifiers we follow the projection approach [26] and separate P( sis', 0) to many functions Ps ' (slo) according to the previous states s'. [sent-101, score-0.38]

56 ) Since these are simpler classifiers we hope that the performance will improve. [sent-104, score-0.314]

57 6 exhibits the contribution of estimating Ps ' (s 10) using a wider window in the observation sequence. [sent-107, score-0.129]

58 In both cases, the attempt to combine classifiers with Markov models is motivated by an attempt to improve the existing Markov models; the belief is that this would yield better generalization than the pure observation probability estimation from the training data. [sent-111, score-0.531]

59 The starting point is the existence of general classifiers that provide some local information on the input sequence along with constraints on their outcomes; our goal is to use the classifiers to infer the phrase structure of the sequence in a way that satisfies the constraints. [sent-113, score-1.357]

60 Technically, another novelty worth mentioning is that we use a wider range of observations instead of a single observation to predict a state. [sent-115, score-0.176]

61 4 Constraints Satisfaction with Classifiers This section describes a different model that is based on an extension of the Boolean constraint satisfaction (CSP) formalism [17] to handle variables that are the outcome of classifiers. [sent-117, score-0.147]

62 On > and local classifiers that, without loss of generality, take two distinct values, one indicating openning a phrase and a second indicating closing it (OC modeling). [sent-121, score-0.76]

63 The classifiers provide their output in terms of the probability P(o) and P(c), given the observation. [sent-122, score-0.314]

64 We extend the CSP formalism to deal with probabilistic variables (or, more generally, variables with cost) as follows. [sent-123, score-0.102]

65 The constraints are encoded as clauses and, as in standard CSP modeling the Boolean CSP becomes a CNF (conjunctive normal form) formula f. [sent-125, score-0.149]

66 One efficient way to use this general scheme is by encoding phrases as variables. [sent-128, score-0.465]

67 Then, all the non-overlapping constraints can be encoded in: I\e; overlaps ej (-,ei V -,ej). [sent-130, score-0.11]

68 This yields a quadratic number of variables, and the constraints are binary, encoding the restriction that phrases do not overlap. [sent-131, score-0.544]

69 For the specific case of phrase structure, however, we can find the optimal solution in linear time. [sent-133, score-0.417]

70 The solution to the optimization problem corresponds to a shortest path in a directed acyclic graph constructed on the observations symbols, with legitimate phrases (the variables of the CSP) as its edges and their cost as the edges' weights. [sent-134, score-0.579]

71 A natural cost function is to use the classifiers probabilities P(o) and P(c) and define, for a phrase e = (0, c), c(e) = 1 - P(o)P(c). [sent-140, score-0.799]

72 5 Shallow Parsing We use shallow parsing tasks in order to evaluate our approaches. [sent-143, score-0.256]

73 Shallow parsing involves the identification of phrases or of words that participate in a syntactic relationship. [sent-144, score-0.721]

74 The observation that shallow syntactic information can be extracted using local information by examining the pattern itself, its nearby context and the local part-of-speech information - has motivated the use of learning methods to recognize these patterns [7, 23, 3, 5]. [sent-145, score-0.395]

75 In this work we study the identification of two types of phrases, base Noun Phrases (NP) and Subject Verb (SV) patterns. [sent-146, score-0.131]

76 Consequently, each classifier may output three possible outcomes 0, nOi, nOo (open, not open inside, not open outside) and C, nCi, nCo, resp. [sent-151, score-0.22]

77 Figure 1: State-transition diagram for the phrase recognition problem. [sent-154, score-0.417]

78 1 Classification The classifier we use to learn the states as a function of the observation is SNoW [24, 6], a multi-class classifier that is specifically tailored for large scale learning tasks. [sent-156, score-0.283]

79 SNoW has already been used successfully for a variety of tasks in natural language and visual processing [10, 25]. [sent-158, score-0.109]

80 In the current study we normalize the activation levels of all targets to sum to 1 and output the outcomes for all targets (states). [sent-161, score-0.178]

81 6 Experiments We experimented both with NPs and SVs and we show results for two different representations of the observations (that is, different feature sets for the classifiers) - part of speech (PaS) information only and pas with additional lexical information (words). [sent-164, score-0.202]

82 For NP, the training and test corpus was prepared from sections 15 to 18 and section 20, respectively; the SV phrase corpus was prepared from sections 1 to 9 for training and section for testing. [sent-169, score-0.571]

83 When the observations are in terms of lexical items, the data is too sparse to yield robust estimates and these entries were left empty. [sent-172, score-0.103]

84 The NB (naive Bayes) and SNoW classifiers use the same feature set, conjunctions of size 3 of pas tags (PaS and words, resp. [sent-173, score-0.49]

85 Table 1: Results (F{3=l) of different methods on NP and SV recognition Method Model Classifier SNoW HMM NB Simple SNoW NB PMM Simple SNoW CSCL NB Simple POS tags only 90. [sent-175, score-0.112]

86 28 The first important observation is that the SV identification task is significantly more difficult than that the NP task. [sent-205, score-0.163]

87 What is interesting here is the very significant sensitivity to the feature base of the classifiers used, despite the violation of the probabilistic assumptions. [sent-208, score-0.401]

88 For the easier NP task, the HMM model is competitive with the others when the classifiers used are NB or SNoW. [sent-209, score-0.314]

89 In particular, the fact that the significant improvements both probabilistic methods achieve when their input is given by SNoW confirms the claim that the output of SNoW can be used reliably as a probabilistic classifier. [sent-210, score-0.134]

90 PMM and CSCL perform very well on predicting NP and SV phrases with CSCL at least as good as any other methods tried on these tasks. [sent-211, score-0.465]

91 We attribute it to CSCL's ability to cope better with the length of the phrase and the long term dependencies. [sent-213, score-0.446]

92 7 Conclusion We have addressed the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. [sent-214, score-0.526]

93 The focus here is on an important subproblem, the identification of phrase structure. [sent-216, score-0.48]

94 It seems that the CSP formalisms can support the desired performance measure as well as complex constraints and dependencies more flexibly than the Markovian approach. [sent-219, score-0.148]

95 As a side effect, this work exhibits the use of general classifiers within a probabilistic framework. [sent-221, score-0.366]

96 Future work includes extensions to deal with more general constraints by exploiting more general probabilistic structures and generalizing the CSP approach. [sent-222, score-0.131]

97 A memory-based approach to learning shallow natural language patterns. [sent-248, score-0.253]

98 Error-driven pruning of treebanks grammars for base noun phrase identification. [sent-258, score-0.545]

99 A stochastic parts program and noun phrase parser for unrestricted text. [sent-270, score-0.51]

100 Learning to resolve natural language ambiguities: A unified approach. [sent-366, score-0.109]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('phrases', 0.465), ('phrase', 0.417), ('classifiers', 0.314), ('cscl', 0.186), ('snow', 0.151), ('csp', 0.149), ('shallow', 0.144), ('sls', 0.13), ('slo', 0.128), ('hmm', 0.122), ('parsing', 0.112), ('pmm', 0.112), ('tags', 0.112), ('sv', 0.11), ('pt', 0.106), ('np', 0.102), ('observation', 0.1), ('noun', 0.093), ('ols', 0.093), ('outcomes', 0.087), ('nb', 0.08), ('extraction', 0.08), ('language', 0.079), ('constraints', 0.079), ('pos', 0.074), ('identifying', 0.073), ('classifier', 0.073), ('sequence', 0.07), ('satisfaction', 0.067), ('string', 0.067), ('hmms', 0.065), ('pas', 0.064), ('identification', 0.063), ('sis', 0.058), ('coherent', 0.058), ('state', 0.057), ('lexical', 0.056), ('nps', 0.056), ('svs', 0.056), ('markov', 0.054), ('symbol', 0.054), ('probabilistic', 0.052), ('formalism', 0.05), ('syntactic', 0.048), ('pl', 0.048), ('roth', 0.048), ('corpus', 0.048), ('observations', 0.047), ('recognize', 0.045), ('oc', 0.045), ('incorporated', 0.043), ('combine', 0.039), ('modeling', 0.039), ('attempt', 0.039), ('markovian', 0.038), ('cost', 0.038), ('experimentally', 0.037), ('combinator', 0.037), ('flexibly', 0.037), ('freitag', 0.037), ('otls', 0.037), ('punyakanok', 0.037), ('subproblem', 0.037), ('verb', 0.037), ('states', 0.037), ('boolean', 0.036), ('sequential', 0.035), ('inside', 0.035), ('base', 0.035), ('speech', 0.035), ('likely', 0.034), ('outside', 0.034), ('harder', 0.034), ('satisfies', 0.034), ('inference', 0.033), ('words', 0.033), ('study', 0.033), ('semantic', 0.033), ('linguistics', 0.032), ('penn', 0.032), ('formalisms', 0.032), ('approaches', 0.032), ('mechanism', 0.031), ('encoded', 0.031), ('input', 0.03), ('extends', 0.03), ('text', 0.03), ('natural', 0.03), ('open', 0.03), ('constraint', 0.03), ('local', 0.029), ('projection', 0.029), ('targets', 0.029), ('assignment', 0.029), ('path', 0.029), ('sentences', 0.029), ('attribute', 0.029), ('wider', 0.029), ('chunking', 0.029), ('prepared', 0.029)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999988 138 nips-2000-The Use of Classifiers in Sequential Inference

Author: Vasin Punyakanok, Dan Roth

Abstract: We study the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. In particular, we develop two general approaches for an important subproblem - identifying phrase structure. The first is a Markovian approach that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies. The second is an extension of constraint satisfaction formalisms. We develop efficient combination algorithms under both models and study them experimentally in the context of shallow parsing.

2 0.12273809 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

3 0.11082613 75 nips-2000-Large Scale Bayes Point Machines

Author: Ralf Herbrich, Thore Graepel

Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1

4 0.105045 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition

Author: Hervé Bourlard, Samy Bengio, Katrin Weber

Abstract: In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent)

5 0.10077059 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

Author: Koby Crammer, Yoram Singer

Abstract: Output coding is a general method for solving multiclass problems by reducing them to multiple binary classification problems. Previous research on output coding has employed, almost solely, predefined discrete codes. We describe an algorithm that improves the performance of output codes by relaxing them to continuous codes. The relaxation procedure is cast as an optimization problem and is reminiscent of the quadratic program for support vector machines. We describe experiments with the proposed algorithm, comparing it to standard discrete output codes. The experimental results indicate that continuous relaxations of output codes often improve the generalization performance, especially for short codes.

6 0.098968603 130 nips-2000-Text Classification using String Kernels

7 0.071528144 96 nips-2000-One Microphone Source Separation

8 0.066790283 58 nips-2000-From Margin to Sparsity

9 0.066120997 6 nips-2000-A Neural Probabilistic Language Model

10 0.063577667 145 nips-2000-Weak Learners and Improved Rates of Convergence in Boosting

11 0.060114507 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

12 0.060084712 133 nips-2000-The Kernel Gibbs Sampler

13 0.056970652 41 nips-2000-Discovering Hidden Variables: A Structure-Based Approach

14 0.054354012 120 nips-2000-Sparse Greedy Gaussian Process Regression

15 0.053012166 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

16 0.052397456 51 nips-2000-Factored Semi-Tied Covariance Matrices

17 0.051395945 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

18 0.051232863 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

19 0.048744123 4 nips-2000-A Linear Programming Approach to Novelty Detection

20 0.047785871 140 nips-2000-Tree-Based Modeling and Estimation of Gaussian Processes on Graphs with Cycles


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.195), (1, 0.041), (2, 0.03), (3, 0.017), (4, -0.09), (5, -0.092), (6, -0.038), (7, -0.004), (8, 0.006), (9, 0.126), (10, 0.139), (11, 0.017), (12, 0.064), (13, 0.01), (14, -0.129), (15, 0.069), (16, -0.0), (17, 0.014), (18, 0.123), (19, 0.041), (20, -0.016), (21, -0.03), (22, 0.003), (23, -0.193), (24, 0.006), (25, -0.123), (26, -0.065), (27, 0.041), (28, 0.083), (29, -0.02), (30, -0.019), (31, 0.216), (32, 0.072), (33, 0.046), (34, 0.04), (35, -0.022), (36, 0.017), (37, 0.038), (38, 0.074), (39, -0.201), (40, 0.006), (41, -0.075), (42, -0.008), (43, 0.103), (44, -0.012), (45, -0.047), (46, -0.059), (47, -0.031), (48, 0.076), (49, 0.106)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94152099 138 nips-2000-The Use of Classifiers in Sequential Inference

Author: Vasin Punyakanok, Dan Roth

Abstract: We study the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. In particular, we develop two general approaches for an important subproblem - identifying phrase structure. The first is a Markovian approach that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies. The second is an extension of constraint satisfaction formalisms. We develop efficient combination algorithms under both models and study them experimentally in the context of shallow parsing.

2 0.48119774 68 nips-2000-Improved Output Coding for Classification Using Continuous Relaxation

Author: Koby Crammer, Yoram Singer

Abstract: Output coding is a general method for solving multiclass problems by reducing them to multiple binary classification problems. Previous research on output coding has employed, almost solely, predefined discrete codes. We describe an algorithm that improves the performance of output codes by relaxing them to continuous codes. The relaxation procedure is cast as an optimization problem and is reminiscent of the quadratic program for support vector machines. We describe experiments with the proposed algorithm, comparing it to standard discrete output codes. The experimental results indicate that continuous relaxations of output codes often improve the generalization performance, especially for short codes.

3 0.47979352 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition

Author: Hervé Bourlard, Samy Bengio, Katrin Weber

Abstract: In this paper, we discuss some new research directions in automatic speech recognition (ASR), and which somewhat deviate from the usual approaches. More specifically, we will motivate and briefly describe new approaches based on multi-stream and multi/band ASR. These approaches extend the standard hidden Markov model (HMM) based approach by assuming that the different (frequency) channels representing the speech signal are processed by different (independent)

4 0.43095657 9 nips-2000-A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Author: Ralf Herbrich, Thore Graepel

Abstract: We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - for maximum margins - to a vanishing complexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w.r.t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. 1

5 0.39936563 75 nips-2000-Large Scale Bayes Point Machines

Author: Ralf Herbrich, Thore Graepel

Abstract: The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has recently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - also known as the Bayes point - exhibits excellent generalisation abilities. However, the billiard algorithm as presented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N . m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is algorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of handwritten digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation error, e.g. 0.1% generalisation error for a given rejection rate of 10%. 1

6 0.37537754 80 nips-2000-Learning Switching Linear Models of Human Motion

7 0.36380649 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

8 0.34646675 130 nips-2000-Text Classification using String Kernels

9 0.32674545 96 nips-2000-One Microphone Source Separation

10 0.30469993 136 nips-2000-The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity

11 0.30254486 6 nips-2000-A Neural Probabilistic Language Model

12 0.29949608 58 nips-2000-From Margin to Sparsity

13 0.29203236 132 nips-2000-The Interplay of Symbolic and Subsymbolic Processes in Anagram Problem Solving

14 0.28206182 44 nips-2000-Efficient Learning of Linear Perceptrons

15 0.27525508 133 nips-2000-The Kernel Gibbs Sampler

16 0.27084681 86 nips-2000-Model Complexity, Goodness of Fit and Diminishing Returns

17 0.26721516 51 nips-2000-Factored Semi-Tied Covariance Matrices

18 0.26241624 56 nips-2000-Foundations for a Circuit Complexity Theory of Sensory Processing

19 0.25374252 103 nips-2000-Probabilistic Semantic Video Indexing

20 0.25299942 109 nips-2000-Redundancy and Dimensionality Reduction in Sparse-Distributed Representations of Natural Objects in Terms of Their Local Features


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.023), (10, 0.032), (17, 0.081), (26, 0.019), (32, 0.018), (33, 0.057), (55, 0.038), (62, 0.062), (65, 0.035), (67, 0.042), (74, 0.329), (75, 0.011), (76, 0.03), (79, 0.02), (81, 0.038), (90, 0.035), (97, 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.79329956 138 nips-2000-The Use of Classifiers in Sequential Inference

Author: Vasin Punyakanok, Dan Roth

Abstract: We study the problem of combining the outcomes of several different classifiers in a way that provides a coherent inference that satisfies some constraints. In particular, we develop two general approaches for an important subproblem - identifying phrase structure. The first is a Markovian approach that extends standard HMMs to allow the use of a rich observation structure and of general classifiers to model state-observation dependencies. The second is an extension of constraint satisfaction formalisms. We develop efficient combination algorithms under both models and study them experimentally in the context of shallow parsing.

2 0.38055146 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

Author: Zoubin Ghahramani, Matthew J. Beal

Abstract: Variational approximations are becoming a widespread tool for Bayesian learning of graphical models. We provide some theoretical results for the variational updates in a very general family of conjugate-exponential graphical models. We show how the belief propagation and the junction tree algorithms can be used in the inference step of variational Bayesian learning. Applying these results to the Bayesian analysis of linear-Gaussian state-space models we obtain a learning procedure that exploits the Kalman smoothing propagation, while integrating over all model parameters. We demonstrate how this can be used to infer the hidden state dimensionality of the state-space model in a variety of synthetic problems and one real high-dimensional data set. 1

3 0.38053763 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

Author: Javier R. Movellan, Paul Mineiro, Ruth J. Williams

Abstract: This paper explores a framework for recognition of image sequences using partially observable stochastic differential equation (SDE) models. Monte-Carlo importance sampling techniques are used for efficient estimation of sequence likelihoods and sequence likelihood gradients. Once the network dynamics are learned, we apply the SDE models to sequence recognition tasks in a manner similar to the way Hidden Markov models (HMMs) are commonly applied. The potential advantage of SDEs over HMMS is the use of continuous state dynamics. We present encouraging results for a video sequence recognition task in which SDE models provided excellent performance when compared to hidden Markov models. 1

4 0.3786054 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

Author: Claudio Gentile

Abstract: A new incremental learning algorithm is described which approximates the maximal margin hyperplane w.r.t. norm p ~ 2 for a set of linearly separable data. Our algorithm, called ALMAp (Approximate Large Margin algorithm w.r.t. norm p), takes 0 ((P~21;;2) corrections to separate the data with p-norm margin larger than (1 - 0:) ,,(, where,,( is the p-norm margin of the data and X is a bound on the p-norm of the instances. ALMAp avoids quadratic (or higher-order) programming methods. It is very easy to implement and is as fast as on-line algorithms, such as Rosenblatt's perceptron. We report on some experiments comparing ALMAp to two incremental algorithms: Perceptron and Li and Long's ROMMA. Our algorithm seems to perform quite better than both. The accuracy levels achieved by ALMAp are slightly inferior to those obtained by Support vector Machines (SVMs). On the other hand, ALMAp is quite faster and easier to implement than standard SVMs training algorithms.

5 0.37570903 104 nips-2000-Processing of Time Series by Neural Circuits with Biologically Realistic Synaptic Dynamics

Author: Thomas Natschläger, Wolfgang Maass, Eduardo D. Sontag, Anthony M. Zador

Abstract: Experimental data show that biological synapses behave quite differently from the symbolic synapses in common artificial neural network models. Biological synapses are dynamic, i.e., their

6 0.36978111 80 nips-2000-Learning Switching Linear Models of Human Motion

7 0.36824194 69 nips-2000-Incorporating Second-Order Functional Knowledge for Better Option Pricing

8 0.36506873 74 nips-2000-Kernel Expansions with Unlabeled Examples

9 0.36451128 71 nips-2000-Interactive Parts Model: An Application to Recognition of On-line Cursive Script

10 0.36298293 146 nips-2000-What Can a Single Neuron Compute?

11 0.36127299 6 nips-2000-A Neural Probabilistic Language Model

12 0.35968566 123 nips-2000-Speech Denoising and Dereverberation Using Probabilistic Models

13 0.35853216 28 nips-2000-Balancing Multiple Sources of Reward in Reinforcement Learning

14 0.35703534 49 nips-2000-Explaining Away in Weight Space

15 0.3561908 111 nips-2000-Regularized Winnow Methods

16 0.35616305 79 nips-2000-Learning Segmentation by Random Walks

17 0.35585222 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

18 0.35583761 90 nips-2000-New Approaches Towards Robust and Adaptive Speech Recognition

19 0.3550179 94 nips-2000-On Reversing Jensen's Inequality

20 0.3549118 127 nips-2000-Structure Learning in Human Causal Induction