nips nips2001 nips2001-1 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: John Langford, Rich Caruana
Abstract: We present a new approach to bounding the true error rate of a continuous valued classifier based upon PAC-Bayes bounds. The method first constructs a distribution over classifiers by determining how sensitive each parameter in the model is to noise. The true error rate of the stochastic classifier found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound. In this paper we demonstrate the method on artificial neural networks with results of a order of magnitude improvement vs. the best deterministic neural net bounds. £ ¡ ¤¢
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We present a new approach to bounding the true error rate of a continuous valued classifier based upon PAC-Bayes bounds. [sent-5, score-0.508]
2 The true error rate of the stochastic classifier found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound. [sent-7, score-0.685]
3 £ ¡ ¤¢ 1 Introduction In machine learning it is important to know the true error rate a classifier will achieve on future test cases. [sent-10, score-0.496]
4 For example, all known bounds on the true error rate of artificial neural networks tend to be extremely loose and often result in the meaningless bound of “always err” (error rate = 1. [sent-12, score-1.362]
5 In this paper, we do not bound the true error rate of a neural network. [sent-14, score-1.028]
6 Instead, we bound the true error rate of a distribution over neural networks which we create by analysing one neural network. [sent-15, score-1.161]
7 ) This approach proves to be much more fruitful than trying to bound the true error rate of an individual network. [sent-17, score-0.929]
8 The best current approaches [1][2] often require , , or more examples before producing a nontrivial bound on the true error rate. [sent-18, score-0.89]
9 We produce nontrivial bounds on the true error rate of a stochastic neural network with less than examples. [sent-19, score-1.123]
10 A stochastic neural network is a neural network where each weight is perturbed by a gaussian with variance every time it is evaluated. [sent-20, score-0.749]
11 The approach can be thought of as a redivision of the work between the experimenter and the theoretician: we make the experimenter work harder so that the theoretician’s true error bound becomes much tighter. [sent-22, score-0.938]
12 An alternative viewpoint is that the classification problem is finding a hypothesis with a low upper bound on the future error rate. [sent-24, score-0.688]
13 We present a post-processing phase for neural networks which results in a classifier with a much lower upper bound on the future error rate. [sent-25, score-0.821]
14 We exhibit a technique which will likely give nontrivial true error rate bounds for Bayesian ¦¦ ©¥ ¦¦¦¦ ¦¦¦ ©©¨§¥ ©¨§¥ neural networks regardless of approximation or prior modeling errors. [sent-30, score-0.819]
15 Given the average empirical error rate, it is straightforward to apply the PAC-Bayes bound in order to find a bound on the average true error rate. [sent-33, score-1.571]
16 In this paper we refer to the distribution of neural nets that results from this noise analysis as a stochastic neural net model. [sent-36, score-0.618]
17 Why do we expect the PAC-Bayes bound to be a significant improvement over standard covering number and VC bound approaches? [sent-37, score-0.962]
18 There exist learning problems for which the difference between the lower bound and the PAC-Bayes upper bound are tight up to where is the number of training examples. [sent-38, score-1.105]
19 Since the distribution is unknown, the true error rate is not observable. [sent-51, score-0.463]
20 Now that the basic quantities of interest are defined, we will first present a modern neural network bound, then specialize the PAC-Bayes bound to a stochastic neural network. [sent-53, score-1.092]
21 A stochastic neural network is simply a neural network where each weight in the neural network is drawn from some distribution whenever it is used. [sent-54, score-1.007]
22 1 Neural Network bound We will compare a specialization of the best current neural network true error rate bound [2] with our approach. [sent-58, score-1.64]
23 The neural network bound is described in terms of the following parameters: 1. [sent-59, score-0.711]
24 , an upper bound on the sum of the magnitude of the weights in the th layer of the neural network 4. [sent-64, score-0.921]
25 A Lipschitz constant is a bound on the magnitude of the derivative. [sent-66, score-0.512]
26 1 (2 layer feed-forward Neural Network true error bound) Y A )a`Y ¦ UX ¥ A )a`Y U V U XW1¦ e ¥ ¥ b c ¦ e V SxwWbv)&( u s r 4 y ' ©t@& qi h6 2 p d f g $ ¦ £¥ 0) ) dI d fI fI ¦ £ ! [sent-71, score-0.401]
27 The neural network true error bound above is (perhaps) the tightest known bound for general feed-forward neural networks and so it is the natural bound to compare with. [sent-75, score-2.151]
28 This 2 layer feed-forward bound is not easily applied in a tight manner because we can’t should be. [sent-76, score-0.624]
29 This can be patched up using the calculate a priori what our weight bound principle of structural risk minimization. [sent-77, score-0.493]
30 In particular, we can state the bound for where is some non-negative integer and is a constant. [sent-78, score-0.466]
31 2 (2 layer feed-forward Neural Network true error bound) " ¥ ¥ A A Y P I ¦ A `V fI f I ¦ £ W " V y x wQbv)&( u s 4 ' ©tr & p i 56 2 H % G 6 $ ) X 0 Y%) ¦ £¥ aW V T G` ¨U ) ( E P Q4 R S E F4 $ %# 2 § ¢ ¡ £¥¦ ! [sent-80, score-0.401]
32 © H % I 6 where Proof: Apply the union bound to all possible values of and as discussed above. [sent-81, score-0.5]
33 In practice, we will use and report the value of the tightest applicable bound for all . [sent-82, score-0.505]
34 2 Stochastic Neural Network bound Our approach will start with a simple refinement [3] of the original PAC-Bayes bound [5]. [sent-84, score-0.932]
35 We will first specialize this bound to stochastic neural networks and then show that the use of this bound in conjunction with a post-processing algorithm results in a much tighter true error rate upper bound. [sent-85, score-1.912]
36 is the true error rate of the stochastic hypothesis which, in any evaluation, draws a hypothesis from , and outputs . [sent-92, score-0.685]
37 is the average empirical error rate of the same stochastic Now, we are ready to state the theorem. [sent-95, score-0.652]
38 We need to specialize this theorem for application to a stochastic neural network with a choice of the “prior”. [sent-99, score-0.594]
39 Our “prior” will be zero on all neural net structures other than the one we train and a multidimensional isotropic gaussian on the values of the weights in our neural network. [sent-100, score-0.537]
40 Now, the union bound will imply that all bounds hold simultaneously with probability at least . [sent-107, score-0.666]
41 We will avoid the need for a direct evaluation by a monte carlo evaluation and a bound on the tail of the monte carlo evaluation. [sent-115, score-0.497]
42 Proof: This is simply an application of the Chernoff bound for the tail of a Binomial where a “head” occurs when an error is observed and the bias is . [sent-120, score-0.683]
43 In order to calculate a bound on the expected true error rate, we will first bound the expected empirical error rate with confidence then bound the expected true error rate with confidence , using our bound on . [sent-121, score-3.195]
44 Since the total probability of failure is only our bound will hold with probability . [sent-122, score-0.517]
45 In practice, we will use evaluations of the empirical error rate of the stochastic neural network. [sent-123, score-0.782]
46 The variance of the posterior gaussian needs to be dependent on each weight in order to achieve a tight bound since we want any “meaningless” weights to not contribute significantly to the overall sample complexity. [sent-126, score-0.686]
47 For every weight, , search for the variance, , which reduces the empirical accuracy of the stochastic neural network by some fixed target percentage (we use ) while holding all other weights fixed. [sent-130, score-0.612]
48 7 SNN bound NN bound SNN Train error NN Train error SNN Test error NN Test error 100 0. [sent-135, score-1.676]
49 01 0 10000 100000 10000 pattern presentations 100000 pattern presentations Figure 1: Plot of measured errors and error bounds for the neural network (NN) and the stochastic neural network (SNN) on the synthetic problem. [sent-141, score-1.56]
50 The training set has 100 cases and the reduction in empirical error is 5%. [sent-142, score-0.403]
51 Note that a true error bound of “100” (visible more examples are required in order to in the graph on the left) implies that at least make a nonvacuous bound. [sent-143, score-0.96]
52 The graph on the right expands the vertical scale by excluding the poor true error bound that has error above 100. [sent-144, score-1.077]
53 The shape of the SNN bound roughly mimics the shape of the empirically measured true error (this is more visible in the graph on the right) and thus might be useful for indicating where the net begins overfitting. [sent-149, score-1.093]
54 The stochastic neural network defined by will generally have a too-large such that the empirical error. [sent-151, score-0.584]
55 Therefore, we calculate a global multiplier stochastic neural network defined by decreases the empirical accuracy by only the same (absolute error rate) used in Step 2. [sent-152, score-0.797]
56 Then, we evaluate the empirical error rate of the resulting stochastic neural net by repeatedly drawing samples from the stochastic neural network. [sent-155, score-1.244]
57 ¡ ££¢)('¡ 3 Experimental Results How well can we bound the true error rate of a stochastic neural network? [sent-158, score-1.25]
58 The answer is much better than we can bound the true error rate of a neural network. [sent-159, score-1.028]
59 The “ideal” neural net to use in solving this synthetic problem is a single node perceptron. [sent-166, score-0.355]
60 We will instead use a 2 layer neural net with 2 hidden nodes using the sigmoid transfer function. [sent-167, score-0.363]
61 This overly complex neural net will result in the potential for significant overfitting which makes the bound prediction problem interesting. [sent-168, score-0.737]
62 We use a 2 layer neural net with 2 hidden units for this problem as well because preliminary experiments showed that nets this small can overfit the ADULT dataset if the training sample is small. [sent-171, score-0.47]
63 6 SNN bound NN bound SNN Train error NN Train error SNN Test error NN Test error 100 1 0. [sent-178, score-1.676]
64 01 0 10000 100000 10000 pattern presentations 100000 pattern presentations Figure 2: Plot of measured errors and error bounds for the neural network (NN) and the stochastic neural network (SNN) on the UCI ADULT dataset. [sent-183, score-1.476]
65 These graphs show the results obtained using a 1% reduction in empirical error instead of the 5% reduction used in Figure 1. [sent-184, score-0.415]
66 As in Figure 1, a more examples are required in order to true error bound of “100” implies that at least make a nonvacuous bound. [sent-187, score-0.927]
67 The graph on the right expands the vertical scale by excluding the poor true error bound. [sent-188, score-0.425]
68 ¤ '¡ ¡ we will see in Figures 1 and 2, constructing a nonvacuous bound for a continuous hypothexamples is quite difficult. [sent-189, score-0.581]
69 The choice of reduction in empirical error is somewhat arbitrary. [sent-196, score-0.428]
70 While not as tight as might be desired, the SNN upper bounds are orders of magnitude better and are not vacuous. [sent-201, score-0.361]
71 In particular, on the synthetic problem the SNN true error rate is at most worse than the true error rate of the NN (true error rates are estimated using large test sets). [sent-204, score-1.259]
72 This is suprising considering that we fixed the difference in empirical error rates at for the synthetic problem. [sent-205, score-0.417]
73 Similarly, on the ADULT problem we observe that the true error rates between the SNN and NN typically is only about 0. [sent-206, score-0.366]
74 On both test problems, the shape of the SNN bound is somewhat similar to the shape of the true error rate. [sent-210, score-0.93]
75 In particular, the local minima in the SNN bound occur roughly where the local minima in the true error rates occur. [sent-211, score-0.918]
76 The SNN bound may weakly predict the overfitting points of the SNN and NN nets. [sent-212, score-0.509]
77 The comparison between the neural network bound and the stochastic neural network bound is not quite “fair” due to the form of the bound. [sent-213, score-1.644]
78 In particular, the stochastic neural network bound can never return a value greater than “always err”. [sent-214, score-0.933]
79 This implies that when the bound is near the value of “ ”, it is difficult to judge how rapidly extra examples will improve the stochastic neural network bound. [sent-215, score-1.064]
80 We can judge the sample complexity of the stochastic bound by plotting the value of the numerator in equation 1. [sent-216, score-0.857]
81 ¡ ¡ ©£¢ Complexity 100 Complexity 10 1 10000 100000 pattern presentations Figure 3: We plot the “complexity” of the stochastic network model (numerator of equation 1) vs. [sent-219, score-0.616]
82 Note that the complexity increases with more training as expected and stays below , implying nonvacuous bounds on a training set of size . [sent-221, score-0.406]
83 The stochastic bound is a radical improvement on the neural network bound but it is not yet a perfectly tight bound. [sent-223, score-1.527]
84 Given that we do not have a perfectly tight bound, one important consideration arises: does the minimum of the stochastic bound predict the minimum of the true error rate (as predicted by a large holdout dataset). [sent-224, score-1.279]
85 In particular, can we use the stochastic bound to determine when we should cease training? [sent-225, score-0.688]
86 The stochastic bound depends upon (1) the complexity which increases with training time and (2) the training error which decreases with training time. [sent-226, score-1.069]
87 In general, the “optimal” choice of the extra error rate depends upon the learning problem. [sent-231, score-0.406]
88 Figure 4 shows the resulting bound for different choices of posterior . [sent-234, score-0.466]
89 The bound has a minimum at extra error indicating that our initial choices of and are in the right ballpark, and may be unnecessarily large. [sent-235, score-0.715]
90 Larger differences in empirical error rate such as are easier to obtain reliably with fewer samples from the stochastic neural net, but we have not had difficulty using as few as 100 samples from the SNN with as small as a 1% increase in empirical error. [sent-236, score-0.868]
91 Also note that the complexity always decreases with increasing entropy in the distribution of our stochastic neural net. [sent-237, score-0.384]
92 The existence of a minimum in Figure 4 is the “right” behaviour: the increased empirical error rate is significant in the calculation of the true error bound. [sent-238, score-0.766]
93 £¦ ¦ $¦ b " $¦ ¦ b " ¦ b " c ¥¦ ¦ b $¦ b " ¦ 4 Conclusion We have applied a PAC-Bayes bound for the true error rate of a stochastic neural network. [sent-239, score-1.25]
94 The stochastic neural network bound results in a radically tighter ( orders of mag- £ ¡ true error bound or complexity 100 Stochastic NN bound Complexity 10 1 0. [sent-240, score-2.376]
95 1 Figure 4: Plot of the stochastic neural net (SNN) bound for “posterior” distributions chosen according to the extra empirical error they introduce. [sent-246, score-1.325]
96 nitude) bound on the true error rate of a classifier while increasing the empirical and true error rates only a small amount. [sent-247, score-1.412]
97 Although, the stochastic neural net bound is not completely tight, it is not vacuous with just examples and the minima of the bound weakly predicts the point where overtraining occurs. [sent-248, score-1.546]
98 The results with two datasets (one synthetic and one from UCI) are extremely promising—the bounds are orders of magnitude better. [sent-249, score-0.351]
99 Our next step will be to test the method on more datasets using a greater variety of net architectures to insure that the bounds remain tight. [sent-250, score-0.38]
100 Also, we have not taken into account symmetries within the network which would allow for a tighter bound calculation. [sent-253, score-0.678]
wordName wordTfidf (topN-words)
[('snn', 0.469), ('bound', 0.466), ('nn', 0.268), ('stochastic', 0.222), ('error', 0.186), ('presentations', 0.172), ('net', 0.172), ('true', 0.15), ('network', 0.146), ('bounds', 0.14), ('rate', 0.127), ('adult', 0.125), ('empirical', 0.117), ('neural', 0.099), ('tight', 0.093), ('nonvacuous', 0.09), ('multidimensional', 0.085), ('synthetic', 0.084), ('kl', 0.084), ('classi', 0.079), ('experimenter', 0.068), ('tighter', 0.066), ('layer', 0.065), ('complexity', 0.063), ('extra', 0.063), ('specialize', 0.06), ('reduction', 0.056), ('train', 0.054), ('nontrivial', 0.053), ('drawn', 0.05), ('pattern', 0.047), ('ers', 0.047), ('orders', 0.046), ('magnitude', 0.046), ('caruana', 0.045), ('epochs', 0.045), ('theoretician', 0.045), ('bounding', 0.045), ('uci', 0.045), ('tting', 0.044), ('training', 0.044), ('er', 0.043), ('minima', 0.043), ('weakly', 0.043), ('dif', 0.042), ('numerator', 0.041), ('corollary', 0.039), ('coin', 0.039), ('tightest', 0.039), ('somewhat', 0.039), ('variance', 0.037), ('theorem', 0.037), ('upper', 0.036), ('err', 0.036), ('lipschitz', 0.036), ('examples', 0.035), ('datasets', 0.035), ('perfectly', 0.035), ('th', 0.035), ('proof', 0.034), ('union', 0.034), ('networks', 0.034), ('judge', 0.033), ('meaningless', 0.033), ('test', 0.033), ('graph', 0.033), ('divergence', 0.033), ('signi', 0.032), ('sample', 0.032), ('dataset', 0.032), ('fi', 0.031), ('hypotheses', 0.031), ('evaluations', 0.031), ('langford', 0.031), ('tail', 0.031), ('cant', 0.031), ('choice', 0.03), ('dependent', 0.03), ('exhibit', 0.03), ('covering', 0.03), ('visible', 0.03), ('rates', 0.03), ('ned', 0.029), ('plot', 0.029), ('excluding', 0.028), ('expands', 0.028), ('shape', 0.028), ('weights', 0.028), ('sigmoid', 0.027), ('calculate', 0.027), ('hold', 0.026), ('holds', 0.026), ('culty', 0.026), ('nets', 0.026), ('things', 0.026), ('constructing', 0.025), ('david', 0.025), ('failure', 0.025), ('expected', 0.025), ('mackay', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 1 nips-2001-(Not) Bounding the True Error
Author: John Langford, Rich Caruana
Abstract: We present a new approach to bounding the true error rate of a continuous valued classifier based upon PAC-Bayes bounds. The method first constructs a distribution over classifiers by determining how sensitive each parameter in the model is to noise. The true error rate of the stochastic classifier found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound. In this paper we demonstrate the method on artificial neural networks with results of a order of magnitude improvement vs. the best deterministic neural net bounds. £ ¡ ¤¢
2 0.24747364 119 nips-2001-Means, Correlations and Bounds
Author: Martijn Leisink, Bert Kappen
Abstract: The partition function for a Boltzmann machine can be bounded from above and below. We can use this to bound the means and the correlations. For networks with small weights, the values of these statistics can be restricted to non-trivial regions (i.e. a subset of [-1 , 1]). Experimental results show that reasonable bounding occurs for weight sizes where mean field expansions generally give good results. 1
3 0.1865111 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
Author: T. Zhang
Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.
4 0.15756623 143 nips-2001-PAC Generalization Bounds for Co-training
Author: Sanjoy Dasgupta, Michael L. Littman, David A. McAllester
Abstract: The rule-based bootstrapping introduced by Yarowsky, and its cotraining variant by Blum and Mitchell, have met with considerable empirical success. Earlier work on the theory of co-training has been only loosely related to empirically useful co-training algorithms. Here we give a new PAC-style bound on generalization error which justifies both the use of confidences — partial rules and partial labeling of the unlabeled data — and the use of an agreement-based objective function as suggested by Collins and Singer. Our bounds apply to the multiclass case, i.e., where instances are to be assigned one of labels for . £ ¡ ¤¢
5 0.13349426 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model
Author: Dan Klein, Christopher D. Manning
Abstract: This paper presents a novel approach to the unsupervised learning of syntactic analyses of natural language text. Most previous work has focused on maximizing likelihood according to generative PCFG models. In contrast, we employ a simpler probabilistic model over trees based directly on constituent identity and linear context, and use an EM-like iterative procedure to induce structure. This method produces much higher quality analyses, giving the best published results on the ATIS dataset. 1 Overview To enable a wide range of subsequent tasks, human language sentences are standardly given tree-structure analyses, wherein the nodes in a tree dominate contiguous spans of words called constituents, as in figure 1(a). Constituents are the linguistically coherent units in the sentence, and are usually labeled with a constituent category, such as noun phrase (NP) or verb phrase (VP). An aim of grammar induction systems is to figure out, given just the sentences in a corpus S, what tree structures correspond to them. In this sense, the grammar induction problem is an incomplete data problem, where the complete data is the corpus of trees T , but we only observe their yields S. This paper presents a new approach to this problem, which gains leverage by directly making use of constituent contexts. It is an open problem whether entirely unsupervised methods can produce linguistically accurate parses of sentences. Due to the difficulty of this task, the vast majority of statistical parsing work has focused on supervised learning approaches to parsing, where one uses a treebank of fully parsed sentences to induce a model which parses unseen sentences [7, 3]. But there are compelling motivations for unsupervised grammar induction. Building supervised training data requires considerable resources, including time and linguistic expertise. Investigating unsupervised methods can shed light on linguistic phenomena which are implicit within a supervised parser’s supervisory information (e.g., unsupervised systems often have difficulty correctly attaching subjects to verbs above objects, whereas for a supervised parser, this ordering is implicit in the supervisory information). Finally, while the presented system makes no claims to modeling human language acquisition, results on whether there is enough information in sentences to recover their structure are important data for linguistic theory, where it has standardly been assumed that the information in the data is deficient, and strong innate knowledge is required for language acquisition [4]. Node S VP NP NN1 NNS Factory payrolls VBD fell PP IN NN2 in September Constituent S NP VP PP NN 1 NNS VBD IN NN 2 NN NNS VBD IN NN NN NNS VBD IN NN IN NN NN NNS VBD IN NNS Context – – VBD NNS – VBD – – NNS NN – VBD NNS – IN VBD – NN IN – Empty 0 1 2 3 4 5 Context – NN – NNS – VBD – IN – NN – NN NNS VBD IN NN Figure 1: Example parse tree with the constituents and contexts for each tree node. 2 Previous Approaches One aspect of grammar induction where there has already been substantial success is the induction of parts-of-speech. Several different distributional clustering approaches have resulted in relatively high-quality clusterings, though the clusters’ resemblance to classical parts-of-speech varies substantially [9, 15]. For the present work, we take the part-ofspeech induction problem as solved and work with sequences of parts-of-speech rather than words. In some ways this makes the problem easier, such as by reducing sparsity, but in other ways it complicates the task (even supervised parsers perform relatively poorly with the actual words replaced by parts-of-speech). Work attempting to induce tree structures has met with much less success. Most grammar induction work assumes that trees are generated by a symbolic or probabilistic context-free grammar (CFG or PCFG). These systems generally boil down to one of two types. Some fix the structure of the grammar in advance [12], often with an aim to incorporate linguistic constraints [2] or prior knowledge [13]. These systems typically then attempt to find the grammar production parameters which maximize the likelihood P(S| ) using the inside-outside algorithm [1], which is an efficient (dynamic programming) instance of the EM algorithm [8] for PCFG s. Other systems (which have generally been more successful) incorporate a structural search as well, typically using a heuristic to propose candidate grammar modifications which minimize the joint encoding of data and grammar using an MDL criterion, which asserts that a good analysis is a short one, in that the joint encoding of the grammar and the data is compact [6, 16, 18, 17]. These approaches can also be seen as likelihood maximization where the objective function is the a posteriori likelihood of the grammar given the data, and the description length provides a structural prior. The “compact grammar” aspect of MDL is close to some traditional linguistic argumentation which at times has argued for minimal grammars on grounds of analytical [10] or cognitive [5] economy. However, the primary weakness of MDL-based systems does not have to do with the objective function, but the search procedures they employ. Such systems end up growing structures greedily, in a bottom-up fashion. Therefore, their induction quality is determined by how well they are able to heuristically predict what local intermediate structures will fit into good final global solutions. A potential advantage of systems which fix the grammar and only perform parameter search is that they do compare complete grammars against each other, and are therefore able to detect which give rise to systematically compatible parses. However, although early work showed that small, artificial CFGs could be induced with the EM algorithm [12], studies with large natural language grammars have generally suggested that completely unsupervised EM over PCFG s is ineffective for grammar acquisition. For instance, Carroll and Charniak [2] describe experiments running the EM algorithm from random starting points, which produced widely varying learned grammars, almost all of extremely poor quality. 1 1 We duplicated one of their experiments, which used grammars restricted to rules of the form x → x y | y x, where there is one category x for each part-of-speech (such a restricted CFG is isomorphic to a dependency grammar). We began reestimation from a grammar with uniform rewrite It is well-known that EM is only locally optimal, and one might think that the locality of the search procedure, not the objective function, is to blame. The truth is somewhere in between. There are linguistic reasons to distrust an ML objective function. It encourages the symbols and rules to align in ways which maximize the truth of the conditional independence assumptions embodied by the PCFG. The symbols and rules of a natural language grammar, on the other hand, represent syntactically and semantically coherent units, for which a host of linguistic arguments have been made [14]. None of these have anything to do with conditional independence; traditional linguistic constituency reflects only grammatical regularities and possibilities for expansion. There are expected to be strong connections across phrases (such as dependencies between verbs and their selected arguments). It could be that ML over PCFGs and linguistic criteria align, but in practice they do not always seem to. Experiments with both artificial [12] and real [13] data have shown that starting from fixed, correct (or at least linguistically reasonable) structure, EM produces a grammar which has higher log-likelihood than the linguistically determined grammar, but lower parsing accuracy. However, we additionally conjecture that EM over PCFGs fails to propagate contextual cues efficiently. The reason we expect an algorithm to converge on a good PCFG is that there seem to be coherent categories, like noun phrases, which occur in distinctive environments, like between the beginning of the sentence and the verb phrase. In the inside-outside algorithm, the product of inside and outside probabilities α j ( p, q)β j ( p, q) is the probability of generating the sentence with a j constituent spanning words p through q: the outside probability captures the environment, and the inside probability the coherent category. If we had a good idea of what VPs and NPs looked like, then if a novel NP appeared in an NP context, the outside probabilities should pressure the sequence to be parsed as an NP . However, what happens early in the EM procedure, when we have no real idea about the grammar parameters? With randomly-weighted, complete grammars over a symbol set X, we have observed that a frequent, short, noun phrase sequence often does get assigned to some category x early on. However, since there is not a clear overall structure learned, there is only very weak pressure for other NPs, even if they occur in the same positions, to also be assigned to x, and the reestimation process goes astray. To enable this kind of constituent-context pressure to be effective, we propose the model in the following section. 3 The Constituent-Context Model We propose an alternate parametric family of models over trees which is better suited for grammar induction. Broadly speaking, inducing trees like the one shown in figure 1(a) can be broken into two tasks. One is deciding constituent identity: where the brackets should be placed. The second is deciding what to label the constituents. These tasks are certainly correlated and are usually solved jointly. However, the task of labeling chosen brackets is essentially the same as the part-of-speech induction problem, and the solutions cited above can be adapted to cluster constituents [6]. The task of deciding brackets, is the harder task. For example, the sequence DT NN IN DT NN ([the man in the moon]) is virtually always a noun phrase when it is a constituent, but it is only a constituent 66% of the time, because the IN DT NN is often attached elsewhere ([we [sent a man] [to the moon]]). Figure 2(a) probabilities. Figure 4 shows that the resulting grammar (DEP - PCFG) is not as bad as conventional wisdom suggests. Carroll and Charniak are right to observe that the search spaces is riddled with pronounced local maxima, and EM does not do nearly so well when randomly initialized. The need for random seeding in using EM over PCFGs is two-fold. For some grammars, such as one over a set X of non-terminals in which any x 1 → x2 x3 , xi ∈ X is possible, it is needed to break symmetry. This is not the case for dependency grammars, where symmetry is broken by the yields (e.g., a sentence noun verb can only be covered by a noun or verb projection). The second reason is to start the search from a random region of the space. But unless one does many random restarts, the uniform starting condition is better than most extreme points in the space, and produces superior results. 1.5 2 Usually a Constituent Rarely a Constituent 1 1 0.5 0 0 −1 −2 −3 −1.5 −1 −0.5 NP VP PP −1 −0.5 0 0.5 1 −1.5 −1 −0.5 0 0.5 1 1.5 (a) (b) Figure 2: The most frequent examples of (a) different constituent labels and (b) constituents and non-constituents, in the vector space of linear contexts, projected onto the first two principal components. Clustering is effective for labeling, but not detecting constituents. shows the 50 most frequent constituent sequences of three types, represented as points in the vector space of their contexts (see below), projected onto their first two principal components. The three clusters are relatively coherent, and it is not difficult to believe that a clustering algorithm could detect them in the unprojected space. Figure 2(a), however, shows 150 sequences which are parsed as constituents at least 50% of the time along with 150 which are not, again projected onto the first two components. This plot at least suggests that the constituent/non-constituent classification is less amenable to direct clustering. Thus, it is important that an induction system be able to detect constituents, either implicitly or explicitly. A variety of methods of constituent detection have been proposed [11, 6], usually based on information-theoretic properties of a sequence’s distributional context. However, here we rely entirely on the following two simple assumptions: (i) constituents of a parse do not cross each other, and (ii) constituents occur in constituent contexts. The first property is self-evident from the nature of the parse trees. The second is an extremely weakened version of classic linguistic constituency tests [14]. Let σ be a terminal sequence. Every occurrence of σ will be in some linear context c(σ ) = x σ y, where x and y are the adjacent terminals or sentence boundaries. Then we can view any tree t over a sentence s as a collection of sequences and contexts, one of each for every node in the tree, plus one for each inter-terminal empty span, as in figure 1(b). Good trees will include nodes whose yields frequently occur as constituents and whose contexts frequently surround constituents. Formally, we use a conditional exponential model of the form: exp( (σ,c)∈t λσ f σ + λc f c ) P(t|s, ) = t:yield(t)=s exp( (σ,c)∈t λσ f σ + λc f c ) We have one feature f σ (t) for each sequence σ whose value on a tree t is the number of nodes in t with yield σ , and one feature f c (t) for each context c representing the number of times c is the context of the yield of some node in the tree.2 No joint features over c and σ are used, and, unlike many other systems, there is no distinction between constituent types. We model only the conditional likelihood of the trees, P(T |S, ), where = {λ σ , λc }. We then use an iterative EM-style procedure to find a local maximum P(T |S, ) of the completed data (trees) T (P(T |S, ) = t∈T ,s=yield(t) P(t|s, )). We initialize such that each λ is zero and initialize T to any arbitrary set of trees. In alternating steps, we first fix the parameters and find the most probable single tree structure t ∗ for each sentence s according to P(t|s, ), using a simple dynamic program. For any this produces the 2 So, for the tree in figure 1(a), P(t|s) ∝ exp(λ NN NNS + λVBD IN NN + λIN NN + λ −VBD + λNNS− + λVBD− + λ −NNS + λNN−VBD + λNNS−IN + λVBD−NN + λIN− ). set of parses T ∗ which maximizes P(T |S, ). Since T ∗ maximizes this quantity, if T is the former set of trees, P(T ∗ |S, ) ≥ P(T |S, ). Second, we fix the trees and estimate new parameters . The task of finding the parameters ∗ which maximize P(T |S, ) is simply the well-studied task of fitting our exponential model to maximize the conditional likelihood of the fixed parses. Running, for example, a conjugate gradient (CG) ascent on will produce the desired ∗ . If is the former parameters, then we will have P(T |S, ∗ ) ≥ P(T |S, ). Therefore, each iteration will increase P(T |S, ) until convergence.3 Note that our parsing model is not a generative model, and this procedure, though clearly related, is not exactly an instance of the EM algorithm. We merely guarantee that the conditional likelihood of the data completions is increasing. Furthermore, unlike in EM where each iteration increases the marginal likelihood of the fixed observed data, our procedure increases the conditional likelihood of a changing complete data set, with the completions changing at every iteration as we reparse. Several implementation details were important in making the system work well. First, tiebreaking was needed, most of all for the first round. Initially, the parameters are zero, and all parses are therefore equally likely. To prevent bias, all ties were broken randomly. Second, like so many statistical NLP tasks, smoothing was vital. There are features in our model for arbitrarily long yields and most yield types occurred only a few times. The most severe consequence of this sparsity was that initial parsing choices could easily become frozen. If a λσ for some yield σ was either 0 or 0, which was usually the case for rare yields, σ would either be locked into always occurring or never occurring, respectively. Not only did we want to push the λσ values close to zero, we also wanted to account for the fact that most spans are not constituents.4 Therefore, we expect the distribution of the λσ to be skewed towards low values.5 A greater amount of smoothing was needed for the first few iterations, while much less was required in later iterations. Finally, parameter estimation using a CG method was slow and difficult to smooth in the desired manner, and so we used the smoothed relative frequency estimates λ σ = count( fσ )/(count(σ ) + M) and λc = count( f c )/(count(c) + N). These estimates ensured that the λ values were between 0 and 1, and gave the desired bias towards non-constituency. These estimates were fast and surprisingly effective, but do not guarantee non-decreasing conditional likelihood (though the conditional likelihood was increasing in practice). 6 4 Results In all experiments, we used hand-parsed sentences from the Penn Treebank. For training, we took the approximately 7500 sentences in the Wall Street Journal (WSJ) section which contained 10 words or fewer after the removal of punctuation. For testing, we evaluated the system by comparing the system’s parses for those same sentences against the supervised parses in the treebank. We consider each parse as a set of constituent brackets, discarding all trivial brackets.7 We calculated the precision and recall of these brackets against the treebank parses in the obvious way. 3 In practice, we stopped the system after 10 iterations, but final behavior was apparent after 4–8. 4 In a sentence of length n, there are (n + 1)(n + 2)/2 total (possibly size zero) spans, but only 3n constituent spans: n − 1 of size ≥ 2, n of size 1, and n + 1 empty spans. 5 Gaussian priors for the exponential model accomplish the former goal, but not the latter. 6 The relative frequency estimators had a somewhat subtle positive effect. Empty spans have no effect on the model when using CG fitting, as all trees include the same empty spans. However, including their counts improved performance substantially when using relative frequency estimators. This is perhaps an indication that a generative version of this model would be advantageous. 7 We discarded both brackets of length one and brackets spanning the entire sentence, since all of these are impossible to get incorrect, and hence ignored sentences of length ≤ 2 during testing. S DT VP NN VBD σ NP σ VBD NP The screen was NP PP DT DT NN IN NP a VBD σ NN VBD σ σ DT σ was DT NN IN NN The screen a sea of DT red NN DT VBD DT was The screen DT a red (b) IN red DT NN of sea of NN (a) NN sea (c) Figure 3: Alternate parse trees for a sentence: (a) the Penn Treebank tree (deemed correct), (b) the one found by our system CCM, and (c) the one found by DEP - PCFG. Method LBRANCH RANDOM DEP - PCFG RBRANCH CCM UBOUND UP 20.5 29.0 39.5 54.1 60.1 78.2 UR 24.2 31.0 42.3 67.5 75.4 100.0 F1 22.2 30.0 40.9 60.0 66.9 87.8 (a) NP UR 28.9 42.8 69.7 38.3 83.8 100.0 PP UR 6.3 23.6 44.1 44.5 71.6 100.0 VP UR 0.6 26.3 22.8 85.8 66.3 100.0 System EMILE ABL CDC -40 RBRANCH CCM UP 51.6 43.6 53.4 39.9 54.4 UR 16.8 35.6 34.6 46.4 46.8 F1 25.4 39.2 42.0 42.9 50.3 CB 0.84 2.12 1.46 2.18 1.61 (b) Figure 4: Comparative accuracy on WSJ sentences (a) and on the ATIS corpus (b). UR = unlabeled recall; UP = unlabeled precision; F1 = the harmonic mean of UR and UP; CB = crossing brackets. Separate recall values are shown for three major categories. To situate the results of our system, figure 4(a) gives the values of several parsing strategies. CCM is our constituent-context model. DEP - PCFG is a dependency PCFG model [2] trained using the inside-outside algorithm. Figure 3 shows sample parses to give a feel for the parses the systems produce. We also tested several baselines. RANDOM parses randomly. This is an appropriate baseline for an unsupervised system. RBRANCH always chooses the right-branching chain, while LBRANCH always chooses the left-branching chain. RBRANCH is often used as a baseline for supervised systems, but exploits a systematic right-branching tendency of English. An unsupervised system has no a priori reason to prefer right chains to left chains, and LBRANCH is well worse than RANDOM. A system need not beat RBRANCH to claim partial success at grammar induction. Finally, we include an upper bound. All of the parsing strategies and systems mentioned here give fully binary-branching structures. Treebank trees, however, need not be fully binary-branching, and generally are not. As a result, there is an upper bound UBOUND on the precision and F1 scores achievable when structurally confined to binary trees. Clearly, CCM is parsing much better than the RANDOM baseline and the DEP - PCFG induced grammar. Significantly, it also out-performs RBRANCH in both precision and recall, and, to our knowledge, it is the first unsupervised system to do so. To facilitate comparison with other recent systems, figure 4(b) gives results where we trained as before but used (all) the sentences from the distributionally different ATIS section of the treebank as a test set. For this experiment, precision and recall were calculated using the EVALB system of measuring precision and recall (as in [6, 17]) – EVALB is a standard for parser evaluation, but complex, and unsuited to evaluating unlabeled constituency. EMILE and ABL are lexical systems described in [17]. The results for CDC-40, from [6], reflect training on much more data (12M words). Our system is superior in terms of both precision and recall (and so F 1 ). These figures are certainly not all that there is to say about an induced grammar; there are a number of issues in how to interpret the results of an unsupervised system when comparing with treebank parses. Errors come in several kinds. First are innocent sins of commission. Treebank trees are very flat; for example, there is no analysis of the inside of many short noun phrases ([two hard drives] rather than [two [hard drives]]). Our system gives a Sequence DT NN NNP NNP CD CD JJ NNS DT JJ NN DT NNS JJ NN CD NN IN NN IN DT NN NN NNS NN NN TO VB DT JJ IN DT PRP VBZ PRP VBP NNS VBP NN VBZ NN IN NNS VBD Example the man United States 4 1/2 daily yields the top rank the people plastic furniture 12 percent on Monday for the moment fire trucks fire truck to go ?the big *of the ?he says ?they say ?people are ?value is *man from ?people were CORRECT 1 2 3 4 5 6 7 8 9 10 11 22 26 78 90 95 180 =350 =532 =648 =648 FREQUENCY 2 1 9 7 – – 3 – – – – 8 – 6 4 – – – 10 5 – ENTROPY 2 – – 3 – – 7 – 9 – 6 10 1 – – – – 4 5 – 8 DEP - PCFG 1 2 5 4 7 – 3 – – – – – 6 – 10 8 9 – – – – CCM 1 2 5 4 6 10 3 9 – – 8 7 – – – – – – – – – Figure 5: Top non-trivial sequences by actual treebank constituent counts, linear frequency, scaled context entropy, and in DEP - PCFG and CCM learned models’ parses. (usually correct) analysis of the insides of such NPs, for which it is penalized on precision (though not recall or crossing brackets). Second are systematic alternate analyses. Our system tends to form modal verb groups and often attaches verbs first to pronoun subjects rather than to objects. As a result, many VPs are systematically incorrect, boosting crossing bracket scores and impacting VP recall. Finally, the treebank’s grammar is sometimes an arbitrary, and even inconsistent standard for an unsupervised learner: alternate analyses may be just as good.8 Notwithstanding this, we believe that the treebank parses have enough truth in them that parsing scores are a useful component of evaluation. Ideally, we would like to inspect the quality of the grammar directly. Unfortunately, the grammar acquired by our system is implicit in the learned feature weights. These are not by themselves particularly interpretable, and not directly comparable to the grammars produced by other systems, except through their functional behavior. Any grammar which parses a corpus will have a distribution over which sequences tend to be analyzed as constituents. These distributions can give a good sense of what structures are and are not being learned. Therefore, to supplement the parsing scores above, we examine these distributions. Figure 5 shows the top scoring constituents by several orderings. These lists do not say very much about how long, complex, recursive constructions are being analyzed by a given system, but grammar induction systems are still at the level where major mistakes manifest themselves in short, frequent sequences. CORRECT ranks sequences by how often they occur as constituents in the treebank parses. DEP - PCFG and CCM are the same, but use counts from the DEP - PCFG and CCM parses. As a baseline, FREQUENCY lists sequences by how often they occur anywhere in the sentence yields. Note that the sequence IN DT (e.g., “of the”) is high on this list, and is a typical error of many early systems. Finally, ENTROPY is the heuristic proposed in [11] which ranks by context entropy. It is better in practice than FREQUENCY , but that isn’t self-evident from this list. Clearly, the lists produced by the CCM system are closer to correct than the others. They look much like a censored version of the FREQUENCY list, where sequences which do not co-exist with higher-ranked ones have been removed (e.g., IN DT often crosses DT NN). This observation may explain a good part of the success of this method. Another explanation for the surprising success of the system is that it exploits a deep fact about language. Most long constituents have some short, frequent equivalent, or proform, which occurs in similar contexts [14]. In the very common case where the proform is a single word, it is guaranteed constituency, which will be transmitted to longer sequences 8 For example, transitive sentences are bracketed [subject [verb object]] (The president [executed the law]) while nominalizations are bracketed [[possessive noun] complement] ([The president’s execution] of the law), an arbitrary inconsistency which is unlikely to be learned automatically. via shared contexts (categories like PP which have infrequent proforms are not learned well unless the empty sequence is in the model – interestingly, the empty sequence appears to act as the proform for PPs, possibly due to the highly optional nature of many PPs). 5 Conclusions We have presented an alternate probability model over trees which is based on simple assumptions about the nature of natural language structure. It is driven by the explicit transfer between sequences and their contexts, and exploits both the proform phenomenon and the fact that good constituents must tile in ways that systematically cover the corpus sentences without crossing. The model clearly has limits. Lacking recursive features, it essentially must analyze long, rare constructions using only contexts. However, despite, or perhaps due to its simplicity, our model predicts bracketings very well, producing higher quality structural analyses than previous methods which employ the PCFG model family. Acknowledgements. We thank John Lafferty, Fernando Pereira, Ben Taskar, and Sebastian Thrun for comments and discussion. This paper is based on work supported in part by the National Science Foundation under Grant No. IIS-0085896. References [1] James K. Baker. Trainable grammars for speech recognition. In D. H. Klatt and J. J. Wolf, editors, Speech Communication Papers for the 97th Meeting of the ASA, pages 547–550, 1979. [2] Glenn Carroll and Eugene Charniak. Two experiments on learning probabilistic dependency grammars from corpora. In C. Weir, S. Abney, R. Grishman, and R. Weischedel, editors, Working Notes of the Workshop Statistically-Based NLP Techniques, pages 1–13. AAAI Press, 1992. [3] Eugene Charniak. A maximum-entropy-inspired parser. In NAACL 1, pages 132–139, 2000. [4] Noam Chomsky. Knowledge of Language. Prager, New York, 1986. [5] Noam Chomsky & Morris Halle. The Sound Pattern of English. Harper & Row, NY, 1968. [6] Alexander Clark. Unsupervised induction of stochastic context-free grammars using distributional clustering. In The Fifth Conference on Natural Language Learning, 2001. [7] Michael John Collins. Three generative, lexicalised models for statistical parsing. In ACL 35/EACL 8, pages 16–23, 1997. [8] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39:1–38, 1977. [9] Steven Finch and Nick Chater. Distributional bootstrapping: From word class to proto-sentence. In Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, pages 301– 306, Hillsdale, NJ, 1994. Lawrence Erlbaum. [10] Zellig Harris. Methods in Structural Linguistics. University of Chicago Press, Chicago, 1951. [11] Dan Klein and Christopher D. Manning. Distributional phrase structure induction. In The Fifth Conference on Natural Language Learning, 2001. [12] K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the insideoutside algorithm. Computer Speech and Language, 4:35–56, 1990. [13] Fernando Pereira and Yves Schabes. Inside-outside reestimation from partially bracketed corpora. In ACL 30, pages 128–135, 1992. [14] Andrew Radford. Transformational Grammar. Cambridge University Press, Cambridge, 1988. [15] Hinrich Sch¨ tze. Distributional part-of-speech tagging. In EACL 7, pages 141–148, 1995. u [16] Andreas Stolcke and Stephen M. Omohundro. Inducing probabilistic grammars by Bayesian model merging. In Grammatical Inference and Applications: Proceedings of the Second International Colloquium on Grammatical Inference. Springer Verlag, 1994. [17] M. van Zaanen and P. Adriaans. Comparing two unsupervised grammar induction systems: Alignment-based learning vs. emile. Technical Report 2001.05, University of Leeds, 2001. [18] J. G. Wolff. Learning syntax and meanings through optimization and distributional analysis. In Y. Levy, I. M. Schlesinger, and M. D. S. Braine, editors, Categories and processes in language acquisition, pages 179–215. Lawrence Erlbaum, Hillsdale, NJ, 1988.
6 0.12291737 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
7 0.11990483 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
8 0.11332601 114 nips-2001-Learning from Infinite Data in Finite Time
9 0.10997973 139 nips-2001-Online Learning with Kernels
10 0.10924946 31 nips-2001-Algorithmic Luckiness
11 0.10635613 8 nips-2001-A General Greedy Approximation Algorithm with Applications
12 0.10445915 105 nips-2001-Kernel Machines and Boolean Functions
13 0.094217718 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
14 0.093975663 36 nips-2001-Approximate Dynamic Programming via Linear Programming
15 0.090985574 46 nips-2001-Categorization by Learning and Combining Object Parts
16 0.083388135 129 nips-2001-Multiplicative Updates for Classification by Mixture Models
17 0.081579499 21 nips-2001-A Variational Approach to Learning Curves
18 0.081234679 164 nips-2001-Sampling Techniques for Kernel Methods
19 0.078570127 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance
20 0.069821104 38 nips-2001-Asymptotic Universality for Learning Curves of Support Vector Machines
topicId topicWeight
[(0, -0.247), (1, 0.051), (2, 0.021), (3, 0.169), (4, 0.054), (5, -0.141), (6, 0.043), (7, 0.069), (8, -0.104), (9, -0.064), (10, -0.09), (11, 0.087), (12, -0.007), (13, 0.036), (14, -0.164), (15, 0.201), (16, 0.033), (17, -0.178), (18, 0.015), (19, -0.229), (20, -0.202), (21, 0.04), (22, 0.018), (23, -0.178), (24, -0.083), (25, 0.037), (26, -0.043), (27, 0.006), (28, 0.015), (29, -0.029), (30, 0.023), (31, 0.034), (32, 0.001), (33, -0.086), (34, 0.006), (35, -0.043), (36, 0.012), (37, 0.01), (38, -0.072), (39, 0.061), (40, -0.029), (41, -0.079), (42, 0.034), (43, -0.029), (44, -0.035), (45, -0.009), (46, 0.11), (47, 0.094), (48, -0.001), (49, -0.064)]
simIndex simValue paperId paperTitle
same-paper 1 0.98634607 1 nips-2001-(Not) Bounding the True Error
Author: John Langford, Rich Caruana
Abstract: We present a new approach to bounding the true error rate of a continuous valued classifier based upon PAC-Bayes bounds. The method first constructs a distribution over classifiers by determining how sensitive each parameter in the model is to noise. The true error rate of the stochastic classifier found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound. In this paper we demonstrate the method on artificial neural networks with results of a order of magnitude improvement vs. the best deterministic neural net bounds. £ ¡ ¤¢
2 0.87810022 119 nips-2001-Means, Correlations and Bounds
Author: Martijn Leisink, Bert Kappen
Abstract: The partition function for a Boltzmann machine can be bounded from above and below. We can use this to bound the means and the correlations. For networks with small weights, the values of these statistics can be restricted to non-trivial regions (i.e. a subset of [-1 , 1]). Experimental results show that reasonable bounding occurs for weight sizes where mean field expansions generally give good results. 1
3 0.71530503 143 nips-2001-PAC Generalization Bounds for Co-training
Author: Sanjoy Dasgupta, Michael L. Littman, David A. McAllester
Abstract: The rule-based bootstrapping introduced by Yarowsky, and its cotraining variant by Blum and Mitchell, have met with considerable empirical success. Earlier work on the theory of co-training has been only loosely related to empirically useful co-training algorithms. Here we give a new PAC-style bound on generalization error which justifies both the use of confidences — partial rules and partial labeling of the unlabeled data — and the use of an agreement-based objective function as suggested by Collins and Singer. Our bounds apply to the multiclass case, i.e., where instances are to be assigned one of labels for . £ ¡ ¤¢
4 0.68609059 114 nips-2001-Learning from Infinite Data in Finite Time
Author: Pedro Domingos, Geoff Hulten
Abstract: We propose the following general method for scaling learning algorithms to arbitrarily large data sets. Consider the model Mii learned by the algorithm using ni examples in step i (ii = (nl , ... ,nm )) , and the model Moo that would be learned using infinite examples. Upper-bound the loss L(Mii' M oo ) between them as a function of ii, and then minimize the algorithm's time complexity f(ii) subject to the constraint that L(Moo , M ii ) be at most f with probability at most 8. We apply this method to the EM algorithm for mixtures of Gaussians. Preliminary experiments on a series of large data sets provide evidence of the potential of this approach. 1 An Approach to Large-Scale Learning Large data sets make it possible to reliably learn complex models. On the other hand , they require large computational resources to learn from. While in the past the factor limit ing the quality of learnable models was typically the quantity of data available, in many domains today data is super-abundant, and the bottleneck is t he t ime required to process it. Many algorithms for learning on large data sets have been proposed, but in order to achieve scalability they generally compromise the quality of the results to an unspecified degree. We believe this unsatisfactory state of affairs is avoidable, and in this paper we propose a general method for scaling learning algorithms to arbitrarily large databases without compromising the quality of the results. Our method makes it possible to learn in finite time a model that is essentially indistinguishable from the one that would be obtained using infinite data. Consider the simplest possible learning problem: estimating the mean of a random variable x. If we have a very large number of samples, most of them are probably superfluous. If we are willing to accept an error of at most f with probability at most 8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of x, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose to extend this type of reasoning beyond learning single parameters, to learning complex models. The approach we propose consists of three steps: 1. Derive an upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 2. Derive an upper bound on the time complexity of the learning algorithm , as a function of the number of samples used in each step. 3. Minimize the time bound (via the number of samples used in each step) subject to target limits on the loss. In this paper we exemplify this approach using the EM algorithm for mixtures of Gaussians. In earlier papers we applied it (or an earlier version of it) to decision tree induction [2J and k-means clustering [3J. Despite its wide use, EM has long been criticized for its inefficiency (see discussion following Dempster et al. [1]), and has been considered unsuitable for large data sets [8J. Many approaches to speeding it up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be seen as an extension of progressive sampling approaches like Meek et al. [5J: rather than minimize the total number of samples needed by the algorithm , we minimize the number needed by each step, leading to potentially much greater savings; and we obtain guarantees that do not depend on unverifiable extrapolations of learning curves. 2 A Loss Bound for EM In a mixture of Gaussians model, each D-dimensional data point Xj is assumed to have been independently generated by the following process: 1) randomly choose a mixture component k; 2) randomly generate a point from it according to a Gaussian distribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict ourselves to the case where the number K of mixture components and the probability of selection P(f-Lk) and covariance matrix for each component are known. Given a training set S = {Xl, ... , XN }, the learning goal is then to find the maximumlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, starting from some set of initial means , alternating until convergence between estimating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the Estep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk (the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM algorithm, all N examples in the training set are used in each iteration. The goal in this paper is to speed up EM by using only ni < N examples in the ith iteration, while guaranteeing that the means produced by the algorithm do not differ significantly from those that would be obtained with arbitrarily large N. Let Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data EM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be the vector obtained using infinite examples at each iteration. In order to proceed, we need to quantify the difference between Mii and Moo . A natural choice is the sum of the squared errors between corresponding means, which is proportional to the negative log-likelihood of the finite-data means given the infinite-data ones: K L(Mii' Moo ) = L k=l K Ililk - f-Lkl12 = D LL lilkd - f-Lkdl 2 (1) k=ld=l where ilkd is the dth coordinate of il, and similarly for f-Lkd. After any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call the sampling error, derives from the fact that ilkd is estimated from a finite sample, while J-Lkd is estimated from an infinite one. The other component, which we call the weighting error, derives from the fact that , due to sampling errors in previous iterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi be the infinite-data estimate of the dth coordinate of the kth mean produced in iteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate that would be obtained if there were no weighting errors in that iteration. Then the sampling error at iteration i is Iflkdi - J-Lkdi I, the weighting error is Iflkdi - flkdi I, and the total error is Iflkdi - J-Lkdi 1 :::; Iflkdi - flkdi 1 + Iflkdi - J-Lkdi I衯 Given bounds on the total error of each coordinate of each mean after iteration i-I, we can derive a bound on the weighting error after iteration i as follows. Bounds on J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by substituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l into the expression of the Gaussian distribution. Let P}ki be the upper bound on P(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki can be bounded from below by by W}ki W (+) jki = min{p}kiP(J-Lk)/ wjki = PjkiP(J-Lk)/ ~~= l P}k'iP(J-LU, and from above ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj ::::: 0 and (- - 1 > (- + . W jki) - W jki' f Xj _ 0 an d W jki) - W jki 0 th erWlse. ot h ' erWlse, an d 1et W- jki Then Iflkdi - flkdi , IJ-Lkdi 1 < max - ~7~1 Wjk i Xj I
5 0.63867199 81 nips-2001-Generalization Performance of Some Learning Problems in Hilbert Functional Spaces
Author: T. Zhang
Abstract: We investigate the generalization performance of some learning problems in Hilbert functional Spaces. We introduce a notion of convergence of the estimated functional predictor to the best underlying predictor, and obtain an estimate on the rate of the convergence. This estimate allows us to derive generalization bounds on some learning formulations.
6 0.56820565 52 nips-2001-Computing Time Lower Bounds for Recurrent Sigmoidal Neural Networks
7 0.55337739 31 nips-2001-Algorithmic Luckiness
8 0.45159635 186 nips-2001-The Noisy Euclidean Traveling Salesman Problem and Learning
9 0.45034793 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance
10 0.44702435 138 nips-2001-On the Generalization Ability of On-Line Learning Algorithms
11 0.43532535 36 nips-2001-Approximate Dynamic Programming via Linear Programming
12 0.386915 8 nips-2001-A General Greedy Approximation Algorithm with Applications
13 0.36672977 99 nips-2001-Intransitive Likelihood-Ratio Classifiers
14 0.36156258 139 nips-2001-Online Learning with Kernels
15 0.35180229 130 nips-2001-Natural Language Grammar Induction Using a Constituent-Context Model
16 0.34376371 133 nips-2001-On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes
17 0.33918056 105 nips-2001-Kernel Machines and Boolean Functions
18 0.33883533 85 nips-2001-Grammar Transfer in a Second Order Recurrent Neural Network
19 0.33851403 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks
20 0.32804501 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade
topicId topicWeight
[(14, 0.031), (17, 0.034), (19, 0.088), (20, 0.205), (27, 0.126), (30, 0.101), (36, 0.013), (59, 0.036), (72, 0.111), (79, 0.04), (83, 0.012), (91, 0.124)]
simIndex simValue paperId paperTitle
1 0.92474449 116 nips-2001-Linking Motor Learning to Function Approximation: Learning in an Unlearnable Force Field
Author: O. Donchin, Reza Shadmehr
Abstract: Reaching movements require the brain to generate motor commands that rely on an internal model of the task’s dynamics. Here we consider the errors that subjects make early in their reaching trajectories to various targets as they learn an internal model. Using a framework from function approximation, we argue that the sequence of errors should reflect the process of gradient descent. If so, then the sequence of errors should obey hidden state transitions of a simple dynamical system. Fitting the system to human data, we find a surprisingly good fit accounting for 98% of the variance. This allows us to draw tentative conclusions about the basis elements used by the brain in transforming sensory space to motor commands. To test the robustness of the results, we estimate the shape of the basis elements under two conditions: in a traditional learning paradigm with a consistent force field, and in a random sequence of force fields where learning is not possible. Remarkably, we find that the basis remains invariant. 1
2 0.90362537 172 nips-2001-Speech Recognition using SVMs
Author: N. Smith, Mark Gales
Abstract: An important issue in applying SVMs to speech recognition is the ability to classify variable length sequences. This paper presents extensions to a standard scheme for handling this variable length data, the Fisher score. A more useful mapping is introduced based on the likelihood-ratio. The score-space defined by this mapping avoids some limitations of the Fisher score. Class-conditional generative models are directly incorporated into the definition of the score-space. The mapping, and appropriate normalisation schemes, are evaluated on a speaker-independent isolated letter task where the new mapping outperforms both the Fisher score and HMMs trained to maximise likelihood. 1
same-paper 3 0.87448645 1 nips-2001-(Not) Bounding the True Error
Author: John Langford, Rich Caruana
Abstract: We present a new approach to bounding the true error rate of a continuous valued classifier based upon PAC-Bayes bounds. The method first constructs a distribution over classifiers by determining how sensitive each parameter in the model is to noise. The true error rate of the stochastic classifier found with the sensitivity analysis can then be tightly bounded using a PAC-Bayes bound. In this paper we demonstrate the method on artificial neural networks with results of a order of magnitude improvement vs. the best deterministic neural net bounds. £ ¡ ¤¢
4 0.81086999 44 nips-2001-Blind Source Separation via Multinode Sparse Representation
Author: Michael Zibulevsky, Pavel Kisilev, Yehoshua Y. Zeevi, Barak A. Pearlmutter
Abstract: We consider a problem of blind source separation from a set of instantaneous linear mixtures, where the mixing matrix is unknown. It was discovered recently, that exploiting the sparsity of sources in an appropriate representation according to some signal dictionary, dramatically improves the quality of separation. In this work we use the property of multi scale transforms, such as wavelet or wavelet packets, to decompose signals into sets of local features with various degrees of sparsity. We use this intrinsic property for selecting the best (most sparse) subsets of features for further separation. The performance of the algorithm is verified on noise-free and noisy data. Experiments with simulated signals, musical sounds and images demonstrate significant improvement of separation quality over previously reported results. 1
5 0.74783885 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing
Author: Benjamin Blankertz, Gabriel Curio, Klaus-Robert Müller
Abstract: Driven by the progress in the field of single-trial analysis of EEG, there is a growing interest in brain computer interfaces (BCIs), i.e., systems that enable human subjects to control a computer only by means of their brain signals. In a pseudo-online simulation our BCI detects upcoming finger movements in a natural keyboard typing condition and predicts their laterality. This can be done on average 100–230 ms before the respective key is actually pressed, i.e., long before the onset of EMG. Our approach is appealing for its short response time and high classification accuracy (>96%) in a binary decision where no human training is involved. We compare discriminative classifiers like Support Vector Machines (SVMs) and different variants of Fisher Discriminant that possess favorable regularization properties for dealing with high noise cases (inter-trial variablity).
6 0.74625695 13 nips-2001-A Natural Policy Gradient
7 0.74421024 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine
8 0.73936039 102 nips-2001-KLD-Sampling: Adaptive Particle Filters
9 0.73746932 127 nips-2001-Multi Dimensional ICA to Separate Correlated Sources
11 0.73629272 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines
12 0.73569155 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex
13 0.73423696 28 nips-2001-Adaptive Nearest Neighbor Classification Using Support Vector Machines
14 0.73395693 29 nips-2001-Adaptive Sparseness Using Jeffreys Prior
15 0.7327795 60 nips-2001-Discriminative Direction for Kernel Classifiers
16 0.72955334 185 nips-2001-The Method of Quantum Clustering
17 0.72863668 195 nips-2001-Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning
18 0.72733593 36 nips-2001-Approximate Dynamic Programming via Linear Programming
19 0.72503901 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions
20 0.72491443 121 nips-2001-Model-Free Least-Squares Policy Iteration