nips nips2001 nips2001-144 knowledge-graph by maker-knowledge-mining

144 nips-2001-Partially labeled classification with Markov random walks

Source: pdf

Author: Martin Szummer, Tommi Jaakkola

Abstract: To classify a large number of unlabeled examples we combine a limited number of labeled examples with a Markov random walk representation over the unlabeled examples. The random walk representation exploits any low dimensional structure in the data in a robust, probabilistic manner. We develop and compare several estimation criteria/algorithms suited to this representation. This includes in particular multi-way classiﬁcation with an average margin criterion which permits a closed form solution. The time scale of the random walk regularizes the representation and can be set through a margin-based criterion favoring unambiguous classiﬁcation. We also extend this basic regularization by adapting time scales for individual examples. We demonstrate the approach on synthetic examples and on text classiﬁcation problems.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Partially labeled classiﬁcation with Markov random walks Tommi Jaakkola MIT AI Lab Cambridge, MA 02139 tommi@ai. [sent-1, score-0.453]

2 edu Abstract To classify a large number of unlabeled examples we combine a limited number of labeled examples with a Markov random walk representation over the unlabeled examples. [sent-5, score-1.486]

3 The random walk representation exploits any low dimensional structure in the data in a robust, probabilistic manner. [sent-6, score-0.298]

4 This includes in particular multi-way classiﬁcation with an average margin criterion which permits a closed form solution. [sent-8, score-0.319]

5 The time scale of the random walk regularizes the representation and can be set through a margin-based criterion favoring unambiguous classiﬁcation. [sent-9, score-0.431]

6 We also extend this basic regularization by adapting time scales for individual examples. [sent-10, score-0.164]

7 1 Introduction Classiﬁcation with partially labeled examples involves a limited dataset of labeled examples as well as a large unlabeled dataset. [sent-12, score-1.213]

8 The unlabeled examples to be classiﬁed provide information about the structure of the domain while the few labeled examples identify the classiﬁcation task expressed in this structure. [sent-13, score-0.818]

9 When this assumption is appropriate, we only require one labeled point for each cluster to properly classify the whole dataset. [sent-15, score-0.425]

10 Data points are typically given relative to a global coordinate system with an associated metric. [sent-16, score-0.179]

11 While the metric may provide a reasonable local similarity measure, it is frequently inadequate as a measure of global similarity. [sent-17, score-0.169]

12 For example, the data may lie on a submanifold of the space, revealed by the density, and any global comparisons should preferably be made along the manifold structure. [sent-18, score-0.152]

13 Moreover, we often wish to assign higher similarity values to examples contained in the same high-density regions or clusters implying that comparisons ought to incorporate the density in addition to the manifold structure. [sent-19, score-0.264]

14 A representation of examples that satisﬁes these and other desiderata can be constructed through a Markov random walk similarly to [3]. [sent-20, score-0.383]

15 The resulting global comparisons of examples integrate a “volume” of paths connecting the examples as opposed to shortest paths that are susceptible to noise. [sent-21, score-0.403]

16 2 Representation based on Markov random walks We deﬁne a Markov random walk based on a locally appropriate metric [3]. [sent-24, score-0.402]

17 The metric is the basis for the neighborhood graph, associated weights on the edges, and consequently the transition probabilities for the random walk. [sent-25, score-0.299]

18 The new representation for the examples can be obtained naturally from the random walk. [sent-26, score-0.206]

19 ¡ ¡ ¥ § § § ¥ £ ¡ ¨©¨¦¤¢ More formally, consider a set of points with a metric . [sent-28, score-0.193]

20 We ﬁrst construct a symmetrized nearest neighbor graph over the points and assign a weight to each undirected edge in the graph. [sent-29, score-0.274]

21 Note that the product of weights along a path in the graph relates to the total length of the path in the same way as the edge weights relate to the distances between the corresponding points. [sent-31, score-0.211]

22 While the weights are symmetric, the transition ( probabilities generally are not, because the normalization varies from node to node. [sent-35, score-0.157]

23 i The matrix (2) We assume that the starting point for the Markov random walk is chosen uniformly at random, i. [sent-38, score-0.253]

24 These conditional probabilities deﬁne our new representation for the examples. [sent-42, score-0.154]

25 The points in this representation are close whenever they have nearly the same distribution over the starting states. [sent-44, score-0.204]

26 This representation is crucially affected by the time scale parameter . [sent-45, score-0.16]

27 When , all the points become indistinguishable provided that the original neighborhood graph is connected. [sent-46, score-0.26]

28 In this representation controls the resolution at which we look at the data points (cf [3]). [sent-48, score-0.235]

29 y A F ( Cxw# h T y ¥ § § § ¥ F 8¨¨©¨e( S # T S vuY S # T v` ¢aY g S b d T v` ¦¤Y g S b d h h h B h The representation is also inﬂuenced by , , and the local distance metric , which together deﬁne the one-step transition probabilities (see section 4). [sent-49, score-0.271]

30 ¡ ©¨©£ @¡ ¥ § § § ¥ # ¥ £ Given a partially labeled data set , we wish to classify the unlabeled points. [sent-51, score-0.741]

31 The labels may come from two or more classes, and typically, the number of labeled points is a small fraction of the total points . [sent-52, score-0.673]

32 Now given a point , which may be labeled or unlabeled, we interpret the point T as a sample from the step Markov random walk. [sent-55, score-0.441]

33 Since labels are associated with the original (starting) points, the posterior probability of the label for point is given by h T § # (3) T S ` ¢aY S g uY g b d # ¨ ( ©q# T g ¨§¥£¡Y ¦¤ ¢ To classify the -th point, we choose the class that maximizes the posterior: arg . [sent-56, score-0.321]

34 # # S g uY ( Tg ¦¤ ¢ §£ T 2 Y ( 9 We will now discuss two techniques for estimating the unknown parameters mum likelihood with EM, and maximum margin subject to constraints. [sent-57, score-0.171]

35 1 EM estimation The estimation criterion here is the conditional log-likelihood of the labeled points £ T v` ¦d Y S g uY a" g S b # 9 ¨ § # (4) # ! [sent-59, score-0.637]

36 Deﬁne the margin of the classiﬁer on labeled point and class to be . [sent-69, score-0.583]

37 For correct classiﬁcation, the margin should be nonnegative for all classes other than , i. [sent-70, score-0.226]

38 5 # Tg 9 ( 9 @9 ( ¦¤ ¢ ©B£AY 8 T 7F9 8 46 # # D 9 E39 P 8 S g uY ( 9 T g ¦¤ ¢ §¥C Y During training, ﬁnd the parameters that maximize the average margin on the labeled points, thereby forcing most of them to be correctly classiﬁed. [sent-73, score-0.552]

39 The solution is achieved at extremal points of the parameter set and thus it is not surprising that the optimal parameters reduce to hard values (0 or 1). [sent-75, score-0.18]

40 (8) F P g ( hq# Sg uY ( unlabeled labeled +1 labeled −1 Figure 1: Top left: local connectivity for =5 neighbors. [sent-77, score-1.025]

41 Below are classiﬁcations using Markov random walks for =3, 10, and 30 (top to bottom, left to right), estimated with average margin. [sent-78, score-0.158]

42 There are two labeled points (large cross, triangle) and 148 unlabeled points, classiﬁed (small crosses, triangles) or unclassiﬁed (small dots). [sent-79, score-0.78]

43 # T S ` ¢aY g b d £ I aB& R b H 2 6 FG 5 W ( E# Tg ( ¦¤ ¢ ¨§¥CAY The large margin restricts the dimension of the classiﬁer (section 3. [sent-82, score-0.171]

44 4) and encourages generalization to correct classiﬁcation of the unlabeled points as well. [sent-83, score-0.469]

45 For separable problems, we can maximize the minimum margin instead of the average margin. [sent-87, score-0.214]

46 In the case of only two classes, we then have only one global margin parameter for all labeled points. [sent-88, score-0.582]

47 If we tackled noisy or non-separable problems by adding a linear slack variable to each constraint, we would arrive at the average margin criterion given above (because of linearity). [sent-90, score-0.285]

48 We are given 2 labeled and 148 unlabeled points in an intertwining two moons pattern. [sent-97, score-0.78]

49 At =3 the random walk has not connected all unlabeled points to some labeled point. [sent-102, score-1.065]

50 The parameters for unconnected points do not affect likelihood or margin, so we assign them uniformly to both classes. [sent-103, score-0.158]

51 The other points have a path to only one of the classes, and are therefore fully assigned to that class. [sent-104, score-0.182]

52 At =10 all points have paths to labeled points but the Markov process has not mixed well. [sent-105, score-0.666]

53 4 Sample size requirements Here we quantify the sample size that is needed for accurate estimation of the labels for the unlabeled examples. [sent-110, score-0.417]

54 As before, the probabilities and denote the conditional probabilities of having started the random walk in given that the process ends up in , , respectively. [sent-114, score-0.399]

55 The dimension [5] of the binary transductive classiﬁer is upper bounded by the number of connected components of a graph with nodes and adjacency matrix , where if and zero otherwise. [sent-118, score-0.158]

56 Since any pair of examples for which share the same label, different labels can be assigned only to examples not connected by the relation, i. [sent-122, score-0.324]

57 8 e 9 d8" 8 e 9 8" 8 © This theorem applies more generally to any transductive classiﬁer based on a weighted representation of examples so long as the weights are bounded in . [sent-125, score-0.218]

58 t ©F F ¥ 5 r To determine the sample size needed for a given dataset, and a desired classiﬁcation margin , let dimension. [sent-126, score-0.171]

59 With high probability we can correctly classify the unlabeled points given labeled examples [4]. [sent-127, score-0.925]

60 45 average margin per class Class Mac Class Win Markov avg margin Markov min margin Markov max ent SVM labeled only 0. [sent-131, score-0.941]

61 1 5 10 15 t 20 4 8 16 32 64 128 # labeled exampels Figure 2: Windows vs. [sent-143, score-0.338]

62 Left: Average per class margins for different , 16 labeled documents. [sent-145, score-0.423]

63 Right: Classiﬁcation accuracy, between 2 and 128 labeled documents, for Markov random walks and best SVM. [sent-146, score-0.453]

64 The local neighborhood similarity measure should be on the order of the manifold dimensionality, sufﬁciently small to avoid size introducing edges in the neighborhood graph that span outside the manifold. [sent-149, score-0.324]

65 The local scale parameter trades off the emphasis on shortest paths (low effectively ignores distant points), versus volume of paths (high ). [sent-151, score-0.252]

66 $ ¤ ¥ ¤ # ¥ ¤ ¥ ¤ $ $ B B B The smoothness of the random walk representation depends on , the number of transitions. [sent-152, score-0.328]

67 This is a regularization parameter akin to the kernel width of a density estimator. [sent-153, score-0.169]

68 In the limiting case = the representation for each node becomes a ﬂat distribution over the points in the same connected component. [sent-157, score-0.321]

69 Crossvalidation could provide a supervised choice of but requires too many labeled points for good accuracy. [sent-161, score-0.47]

70 Instead, we propose to choose that maximizes the average margin for each c, per class, on both labeled and unlabeled data. [sent-162, score-0.896]

71 Plot separately for labeled and unlabeled points to avoid issues of their relative weights. [sent-163, score-0.78]

72 For labeled points, , for unlabeled points, is the class assigned by the classiﬁer. [sent-164, score-0.719]

73 Figure 2 shows the average margin as a function of , for a large text dataset (section 5). [sent-165, score-0.276]

74 1 Adaptive time scales So far, we have employed a single global value of . [sent-168, score-0.149]

75 However, the desired smoothness may be different at different locations (akin to adaptive kernel widths in kernel density estimation). [sent-169, score-0.178]

76 At the simplest, if the graph has multiple connected components, we can h set individual for each component. [sent-170, score-0.149]

77 Here we propose a restricted version of this criterion where we ﬁnd individual time scales for each unlabeled point but estimate a single timescale for labeled points as before. [sent-172, score-1.082]

78 h 9 h The principle by which we select the time scales for the unlabeled points encourages the node identities to become the only common correlates for the labels. [sent-173, score-0.601]

79 More precisely, deﬁne for any unlabeled point as ¥ # T v 7 ` ¦aY g S b d F 6 ¡ 2 ¢564 5 ¨ # T ( q# 9 T g uY # and both summations are only over the labeled points. [sent-174, score-0.675]

80 Note that remains a function of all the individual time scales for the unlabeled points. [sent-176, score-0.436]

81 Note that the ideal setting of the time scales would be one that determines the labels for the unlabeled points uniquely on the basis of only the labeled examples while at the same time preserving the overall variability of the labels across the nodes. [sent-178, score-1.138]

82 This would happen, for example, if the labeled examples fall on distinct connected components. [sent-179, score-0.482]

83 We optimize the criterion by an axis parallel search, trying only discrete values of large enough that at least one labeled point is reached from each unlabeled point. [sent-180, score-0.746]

84 We initialize to the smallest number of transitions needed to reach a labeled point. [sent-181, score-0.338]

85 # 9 uY # T g uY 9 h h 5 Experimental results We applied the Markov random walk approach to partially labeled text classiﬁcation, with few labeled documents but many unlabeled ones. [sent-184, score-1.331]

86 Text documents are represented by highdimensional vectors but only occupy low-dimensional manifolds, so we expect Markov random walk to be beneﬁcial. [sent-185, score-0.274]

87 We estimated the manifold dimensionality to exceed 7, and a histogram of the distances to the 10 nearest neighbor is peaked at 1. [sent-188, score-0.16]

88 The average margin criterion indicated , and we also cross-validated and plotted the decay of mutual information over . [sent-192, score-0.311]

89 We trained both the EM and the margin-based formulations, using between 2 and 128 labeled points, treating all remaining points as unlabeled. [sent-193, score-0.47]

90 Results in ﬁgure 2 show that Markov random walk based algorithms have B ( h $ h 1 Processed as 20news-18827, http://www. [sent-195, score-0.226]

91 a clear advantage over the best SVM using only labeled data (which had a linear kernel and =3), out of linear and Gaussian kernels, different kernel widths and values of . [sent-199, score-0.436]

92 The advantage is especially noticeable for few labeled points, but decreases thereafter. [sent-200, score-0.338]

93 It can handle outliers and mislabeled points, unlike the maximum min margin classiﬁer that stops improving once 8 or more labeled points are supplied. [sent-202, score-0.641]

94 U U The adaptive timescale criterion favors relatively small timescales for this dataset. [sent-203, score-0.173]

95 For 90% of the unlabeled points, it picks the smallest timescale that reaches a labeled point, which is at most 8 for any point. [sent-204, score-0.757]

96 As the number of labeled points increases, shorter times are chosen. [sent-205, score-0.47]

97 For a few points, the criterion picks a maximally smooth representation (the highest timescale considered here, =12), possibly to increase the criterion. [sent-206, score-0.252]

98 # h 6 Discussion The Markov random walk representation of examples provides a robust variable resolution approach to classifying data sets with signiﬁcant manifold structure and very few labels. [sent-208, score-0.487]

99 The average margin estimation criterion proposed in this context leads to a closed form solution and strong empirical performance. [sent-209, score-0.355]

100 To facilitate continuum limit analysis (and establish better correspondence with the underlying density), we can construct the neighborhood graph on the basis of -balls rather than nearest neighbors. [sent-214, score-0.155]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('uy', 0.584), ('labeled', 0.338), ('unlabeled', 0.31), ('classi', 0.204), ('ay', 0.177), ('walk', 0.177), ('margin', 0.171), ('tg', 0.166), ('points', 0.132), ('markov', 0.113), ('cation', 0.093), ('vuy', 0.09), ('examples', 0.085), ('timescale', 0.078), ('scales', 0.073), ('manifold', 0.073), ('er', 0.072), ('representation', 0.072), ('criterion', 0.071), ('labels', 0.071), ('cay', 0.067), ('mac', 0.067), ('walks', 0.066), ('graph', 0.066), ('paths', 0.064), ('neighborhood', 0.062), ('sg', 0.062), ('metric', 0.061), ('classify', 0.06), ('connected', 0.059), ('probabilities', 0.058), ('classes', 0.055), ('szummer', 0.053), ('arg', 0.051), ('random', 0.049), ('documents', 0.048), ('global', 0.047), ('class', 0.047), ('akin', 0.045), ('fuy', 0.045), ('euclidean', 0.045), ('average', 0.043), ('transition', 0.041), ('local', 0.039), ('regularization', 0.038), ('margins', 0.038), ('text', 0.038), ('distances', 0.037), ('estimation', 0.036), ('closed', 0.034), ('em', 0.034), ('kernel', 0.034), ('maximizes', 0.034), ('scale', 0.033), ('started', 0.033), ('transductive', 0.033), ('concave', 0.033), ('jaakkola', 0.033), ('partially', 0.033), ('comparisons', 0.032), ('label', 0.031), ('resolution', 0.031), ('picks', 0.031), ('vs', 0.031), ('node', 0.03), ('curved', 0.03), ('smoothness', 0.03), ('widths', 0.03), ('time', 0.029), ('tommi', 0.028), ('limiting', 0.028), ('weights', 0.028), ('point', 0.027), ('nearest', 0.027), ('encourages', 0.027), ('path', 0.026), ('assign', 0.026), ('windows', 0.026), ('fg', 0.026), ('shortest', 0.026), ('density', 0.026), ('parameter', 0.026), ('mutual', 0.026), ('choices', 0.025), ('lab', 0.025), ('adaptive', 0.024), ('individual', 0.024), ('assigned', 0.024), ('ne', 0.024), ('formulations', 0.024), ('dataset', 0.024), ('conditional', 0.024), ('jointly', 0.023), ('ideally', 0.023), ('neighbor', 0.023), ('nips', 0.023), ('de', 0.022), ('hard', 0.022), ('similarity', 0.022), ('decisions', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 144 nips-2001-Partially labeled classification with Markov random walks

Author: Martin Szummer, Tommi Jaakkola

2 0.1821344 167 nips-2001-Semi-supervised MarginBoost

Author: Florence D'alché-buc, Yves Grandvalet, Christophe Ambroise

Abstract: In many discrimination problems a large amount of data is available but only a few of them are labeled. This provides a strong motivation to improve or develop methods for semi-supervised learning. In this paper, boosting is generalized to this task within the optimization framework of MarginBoost . We extend the margin definition to unlabeled data and develop the gradient descent algorithm that corresponds to the resulting margin cost function. This meta-learning scheme can be applied to any base classifier able to benefit from unlabeled data. We propose here to apply it to mixture models trained with an Expectation-Maximization algorithm. Promising results are presented on benchmarks with different rates of labeled data. 1

3 0.15396106 25 nips-2001-Active Learning in the Drug Discovery Process

Author: Manfred K. Warmuth, Gunnar Rätsch, Michael Mathieson, Jun Liao, Christian Lemmen

Abstract: We investigate the following data mining problem from Computational Chemistry: From a large data set of compounds, ﬁnd those that bind to a target molecule in as few iterations of biological testing as possible. In each iteration a comparatively small batch of compounds is screened for binding to the target. We apply active learning techniques for selecting the successive batches. One selection strategy picks unlabeled examples closest to the maximum margin hyperplane. Another produces many weight vectors by running perceptrons over multiple permutations of the data. Each weight vector votes with its prediction and we pick the unlabeled examples for which the prediction is most evenly split between and . For a third selection strategy note that each unlabeled example bisects the version space of consistent weight vectors. We estimate the volume on both sides of the split by bouncing a billiard through the version space and select unlabeled examples that cause the most even split of the version space. We demonstrate that on two data sets provided by DuPont Pharmaceuticals that all three selection strategies perform comparably well and are much better than selecting random batches for testing. § © ¨

4 0.13752745 143 nips-2001-PAC Generalization Bounds for Co-training

Author: Sanjoy Dasgupta, Michael L. Littman, David A. McAllester

Abstract: The rule-based bootstrapping introduced by Yarowsky, and its cotraining variant by Blum and Mitchell, have met with considerable empirical success. Earlier work on the theory of co-training has been only loosely related to empirically useful co-training algorithms. Here we give a new PAC-style bound on generalization error which justiﬁes both the use of conﬁdences — partial rules and partial labeling of the unlabeled data — and the use of an agreement-based objective function as suggested by Collins and Singer. Our bounds apply to the multiclass case, i.e., where instances are to be assigned one of labels for . £ ¡ ¤¢

5 0.13710293 58 nips-2001-Covariance Kernels from Bayesian Generative Models

Author: Matthias Seeger

Abstract: We propose the framework of mutual information kernels for learning covariance kernels, as used in Support Vector machines and Gaussian process classifiers, from unlabeled task data using Bayesian techniques. We describe an implementation of this framework which uses variational Bayesian mixtures of factor analyzers in order to attack classification problems in high-dimensional spaces where labeled data is sparse, but unlabeled data is abundant. 1

6 0.13186419 46 nips-2001-Categorization by Learning and Combining Object Parts

7 0.12401985 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

8 0.11746108 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

9 0.11539348 60 nips-2001-Discriminative Direction for Kernel Classifiers

10 0.11313908 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

11 0.10998768 139 nips-2001-Online Learning with Kernels

12 0.10545636 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

13 0.10111759 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

14 0.10078056 105 nips-2001-Kernel Machines and Boolean Functions

15 0.096079983 63 nips-2001-Dynamic Time-Alignment Kernel in Support Vector Machine

16 0.091771409 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance

17 0.091461249 43 nips-2001-Bayesian time series classification

18 0.08999937 84 nips-2001-Global Coordination of Local Linear Models

19 0.089142807 135 nips-2001-On Spectral Clustering: Analysis and an algorithm

20 0.088736273 20 nips-2001-A Sequence Kernel and its Application to Speaker Recognition

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.259), (1, 0.144), (2, -0.043), (3, 0.124), (4, 0.007), (5, 0.001), (6, -0.102), (7, -0.032), (8, -0.003), (9, 0.021), (10, 0.148), (11, -0.112), (12, -0.062), (13, 0.013), (14, 0.025), (15, -0.112), (16, -0.012), (17, -0.045), (18, 0.071), (19, -0.056), (20, -0.154), (21, 0.099), (22, -0.106), (23, -0.002), (24, 0.129), (25, 0.127), (26, 0.121), (27, -0.05), (28, -0.081), (29, -0.067), (30, -0.037), (31, 0.157), (32, 0.015), (33, 0.145), (34, -0.179), (35, -0.024), (36, -0.083), (37, 0.0), (38, 0.022), (39, -0.036), (40, -0.089), (41, -0.081), (42, 0.04), (43, 0.191), (44, -0.038), (45, 0.042), (46, -0.085), (47, -0.052), (48, -0.075), (49, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96470344 144 nips-2001-Partially labeled classification with Markov random walks

Author: Martin Szummer, Tommi Jaakkola

2 0.7487756 25 nips-2001-Active Learning in the Drug Discovery Process

Author: Manfred K. Warmuth, Gunnar Rätsch, Michael Mathieson, Jun Liao, Christian Lemmen

3 0.52422386 167 nips-2001-Semi-supervised MarginBoost

Author: Florence D'alché-buc, Yves Grandvalet, Christophe Ambroise

4 0.52242285 101 nips-2001-K-Local Hyperplane and Convex Distance Nearest Neighbor Algorithms

Author: Pascal Vincent, Yoshua Bengio

Abstract: Guided by an initial idea of building a complex (non linear) decision surface with maximal local margin in input space, we give a possible geometrical intuition as to why K-Nearest Neighbor (KNN) algorithms often perform more poorly than SVMs on classiﬁcation tasks. We then propose modiﬁed K-Nearest Neighbor algorithms to overcome the perceived problem. The approach is similar in spirit to Tangent Distance, but with invariances inferred from the local neighborhood rather than prior knowledge. Experimental results on real world classiﬁcation tasks suggest that the modiﬁed KNN algorithms often give a dramatic improvement over standard KNN and perform as well or better than SVMs. 1 Motivation The notion of margin for classiﬁcation tasks has been largely popularized by the success of the Support Vector Machine (SVM) [2, 15] approach. The margin of SVMs has a nice geometric interpretation1: it can be deﬁned informally as (twice) the smallest Euclidean distance between the decision surface and the closest training point. The decision surface produced by the original SVM algorithm is the hyperplane that maximizes this distance while still correctly separating the two classes. While the notion of keeping the largest possible safety margin between the decision surface and the data points seems very reasonable and intuitively appealing, questions arise when extending the approach to building more complex, non-linear decision surfaces. Non-linear SVMs usually use the “kernel trick” to achieve their non-linearity. This conceptually corresponds to ﬁrst mapping the input into a higher-dimensional feature space with some non-linear transformation and building a maximum-margin hyperplane (a linear decision surface) there. The “trick” is that this mapping is never computed directly, but implicitly induced by a kernel. In this setting, the margin being maximized is still the smallest Euclidean distance between the decision surface and the training points, but this time measured in some strange, sometimes inﬁnite dimensional, kernel-induced feature space rather than the original input space. It is less clear whether maximizing the margin in this new space, is meaningful in general (see [16]). 1 for the purpose of this discussion, we consider the original hard-margin SVM algorithm for two linearly separable classes. A different approach is to try and build a non-linear decision surface with maximal distance to the closest data point as measured directly in input space (as proposed in [14]). We could for instance restrict ourselves to a certain class of decision functions and try to ﬁnd the function with maximal margin among this class. But let us take this even further. Extending the idea of building a correctly separating non-linear decision surface as far away as possible from the data points, we deﬁne the notion of local margin as the Euclidean distance, in input space, between a given point on the decision surface and the closest training point. Now would it be possible to ﬁnd an algorithm that could produce a decision surface which correctly separates the classes and such that the local margin is everywhere maximal along its surface? Surprisingly, the plain old Nearest Neighbor algorithm (1NN) [5] does precisely this! So why does 1NN in practice often perform worse than SVMs? One typical explanation, is that it has too much capacity, compared to SVM, that the class of function it can produce is too rich. But, considering it has inﬁnite capacity (VC-dimension), 1NN is still performing quite well... This study is an attempt to better understand what is happening, based on geometrical intuition, and to derive an improved Nearest Neighbor algorithm from this understanding. 2 Fixing a broken Nearest Neighbor algorithm 2.1 Setting and deﬁnitions The setting is that of a classical classiﬁcation problem in (the input space). wx u r i q ©

5 0.50845855 143 nips-2001-PAC Generalization Bounds for Co-training

Author: Sanjoy Dasgupta, Michael L. Littman, David A. McAllester

6 0.47265571 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

7 0.45392361 60 nips-2001-Discriminative Direction for Kernel Classifiers

8 0.44770932 104 nips-2001-Kernel Logistic Regression and the Import Vector Machine

9 0.42333028 77 nips-2001-Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade

10 0.42030963 90 nips-2001-Hyperbolic Self-Organizing Maps for Semantic Navigation

11 0.41762689 89 nips-2001-Grouping with Bias

12 0.41715688 50 nips-2001-Classifying Single Trial EEG: Towards Brain Computer Interfacing

13 0.41257593 152 nips-2001-Prodding the ROC Curve: Constrained Optimization of Classifier Performance

14 0.40533724 106 nips-2001-Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering

15 0.40287194 159 nips-2001-Reducing multiclass to binary by coupling probability estimates

16 0.39636105 99 nips-2001-Intransitive Likelihood-Ratio Classifiers

17 0.39619723 64 nips-2001-EM-DD: An Improved Multiple-Instance Learning Technique

18 0.39182213 193 nips-2001-Unsupervised Learning of Human Motion Models

19 0.38566089 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

20 0.37506998 69 nips-2001-Escaping the Convex Hull with Extrapolated Vector Machines

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.029), (17, 0.018), (19, 0.024), (27, 0.136), (30, 0.05), (38, 0.013), (59, 0.024), (72, 0.074), (79, 0.037), (83, 0.014), (91, 0.496)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.9988004 107 nips-2001-Latent Dirichlet Allocation

Author: David M. Blei, Andrew Y. Ng, Michael I. Jordan

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification. 1

2 0.99781108 148 nips-2001-Predictive Representations of State

Author: Michael L. Littman, Richard S. Sutton

Abstract: We show that states of a dynamical system can be usefully represented by multi-step, action-conditional predictions of future observations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less dependent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear specialization of the predictive approach with the state representations used in POMDPs and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model. In predicting or controlling a sequence of observations, the concepts of state and state estimation inevitably arise. There have been two dominant approaches. The generative-model approach, typified by research on partially observable Markov decision processes (POMDPs), hypothesizes a structure for generating observations and estimates its state and state dynamics. The history-based approach, typified by k-order Markov methods, uses simple functions of past observations as state, that is, as the immediate basis for prediction and control. (The data flow in these two approaches are diagrammed in Figure 1.) Of the two, the generative-model approach is more general. The model's internal state gives it temporally unlimited memorythe ability to remember an event that happened arbitrarily long ago--whereas a history-based approach can only remember as far back as its history extends. The bane of generative-model approaches is that they are often strongly dependent on a good model of the system's dynamics. Most uses of POMDPs, for example, assume a perfect dynamics model and attempt only to estimate state. There are algorithms for simultaneously estimating state and dynamics (e.g., Chrisman, 1992), analogous to the Baum-Welch algorithm for the uncontrolled case (Baum et al., 1970), but these are only effective at tuning parameters that are already approximately correct (e.g., Shatkay & Kaelbling, 1997). observations (and actions) 1-----1-----1..- (a) state rep'n observations (and actions) ¢E / t/' --+ 1-step delays . state rep'n (b) Figure 1: Data flow in a) POMDP and other recursive updating of state representation, and b) history-based state representation. In practice, history-based approaches are often much more effective. Here, the state representation is a relatively simple record of the stream of past actions and observations. It might record the occurrence of a specific subsequence or that one event has occurred more recently than another. Such representations are far more closely linked to the data than are POMDP representations. One way of saying this is that POMDP learning algorithms encounter many local minima and saddle points because all their states are equipotential. History-based systems immediately break symmetry, and their direct learning procedure makes them comparably simple. McCallum (1995) has shown in a number of examples that sophisticated history-based methods can be effective in large problems, and are often more practical than POMDP methods even in small ones. The predictive state representation (PSR) approach, which we develop in this paper, is like the generative-model approach in that it updates the state representation recursively, as in Figure l(a), rather than directly computing it from data. We show that this enables it to attain generality and compactness at least equal to that of the generative-model approach. However, the PSR approach is also like the history-based approach in that its representations are grounded in data. Whereas a history-based representation looks to the past and records what did happen, a PSR looks to the future and represents what will happen. In particular, a PSR is a vector of predictions for a specially selected set of action-observation sequences, called tests (after Rivest & Schapire, 1994). For example, consider the test U101U202, where U1 and U2 are specific actions and 01 and 02 are specific observations. The correct prediction for this test given the data stream up to time k is the probability of its observations occurring (in order) given that its actions are taken (in order) (i.e., Pr {Ok = 01, Ok+1 = 02 I A k = u1,A k + 1 = U2}). Each test is a kind of experiment that could be performed to tell us something about the system. If we knew the outcome of all possible tests, then we would know everything there is to know about the system. A PSR is a set of tests that is sufficient information to determine the prediction for all possible tests (a sufficient statistic). As an example of these points, consider the float/reset problem (Figure 2) consisting of a linear string of 5 states with a distinguished reset state on the far right. One action, f (float), causes the system to move uniformly at random to the right or left by one state, bounded at the two ends. The other action, r (reset), causes a jump to the reset state irrespective of the current state. The observation is always o unless the r action is taken when the system is already in the reset state, in which case the observation is 1. Thus, on an f action, the correct prediction is always 0, whereas on an r action, the correct prediction depends on how many fs there have been since the last r: for zero fS, it is 1; for one or two fS, it is 0.5; for three or four fS, it is 0.375; for five or six fs, it is 0.3125, and so on decreasing after every second f, asymptotically bottoming out at 0.2. No k-order Markov method can model this system exactly, because no limited-. .5 .5 a) float action 1,0=1 b) reset action Figure 2: Underlying dynamics of the float/reset problem for a) the float action and b) the reset action. The numbers on the arcs indicate transition probabilities. The observation is always 0 except on the reset action from the rightmost state, which produces an observation of 1. length history is a sufficient statistic. A POMDP approach can model it exactly by maintaining a belief-state representation over five or so states. A PSR, on the other hand, can exactly model the float/reset system using just two tests: rl and fOrI. Starting from the rightmost state, the correct predictions for these two tests are always two successive probabilities in the sequence given above (1, 0.5, 0.5, 0.375,...), which is always a sufficient statistic to predict the next pair in the sequence. Although this informational analysis indicates a solution is possible in principle, it would require a nonlinear updating process for the PSR. In this paper we restrict consideration to a linear special case of PSRs, for which we can guarantee that the number of tests needed does not exceed the number of states in the minimal POMDP representation (although we have not ruled out the possibility it can be considerably smaller). Of greater ultimate interest are the prospects for learning PSRs and their update functions, about which we can only speculate at this time. The difficulty of learning POMDP structures without good prior models are well known. To the extent that this difficulty is due to the indirect link between the POMDP states and the data, predictive representations may be able to do better. Jaeger (2000) introduced the idea of predictive representations as an alternative to belief states in hidden Markov models and provided a learning procedure for these models. We build on his work by treating the control case (with actions), which he did not significantly analyze. We have also been strongly influenced by the work of Rivest and Schapire (1994), who did consider tests including actions, but treated only the deterministic case, which is significantly different. They also explored construction and learning algorithms for discovering system structure. 1 Predictive State Representations We consider dynamical systems that accept actions from a discrete set A and generate observations from a discrete set O. We consider only predicting the system, not controlling it, so we do not designate an explicit reward observation. We refer to such a system as an environment. We use the term history to denote a test forming an initial stream of experience and characterize an environment by a probability distribution over all possible histories, P : {OIA}* H- [0,1], where P(Ol··· Otl a1··· at) is the probability of observations 01, ... , O£ being generated, in that order, given that actions aI, ... ,at are taken, in that order. The probability of a test t conditional on a history h is defined as P(tlh) = P(ht)/P(h). Given a set of q tests Q = {til, we define their (1 x q) prediction vector, p(h) = [P(t1Ih),P(t2Ih), ... ,P(tqlh)], as a predictive state representation (PSR) if and only if it forms a sufficient statistic for the environment, Le., if and only if P(tlh) = ft(P(h)), (1) for any test t and history h, and for some projection junction ft : [0, l]q ~ [0,1]. In this paper we focus on linear PSRs, for which the projection functions are linear, that is, for which there exist a (1 x q) projection vector mt, for every test t, such that (2) P(tlh) == ft(P(h)) =7 p(h)mf, for all histories h. Let Pi(h) denote the ith component of the prediction vector for some PSR. This can be updated recursively, given a new action-observation pair a,o, by .(h ) == P(t.lh ) == P(otil ha ) == faati(P(h)) == p(h)m'{;ati P2 ao 2 ao P(olha) faa (P(h)) p(h)mro ' (3) where the last step is specific to linear PSRs. We can now state our main result: Theorem 1 For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests no larger than the number of states in the minimal POMDP model. 2 Proof of Theorem 1: Constructing a PSR from a POMDP We prove Theorem 1 by showing that for any POMDP model of the environment, we can construct in polynomial time a linear PSR for that POMDP of lesser or equal complexity that produces the same probability distribution over histories as the POMDP model. We proceed in three steps. First, we review POMDP models and how they assign probabilities to tests. Next, we define an algorithm that takes an n-state POMDP model and produces a set of n or fewer tests, each of length less than or equal to n. Finally, we show that the set of tests constitute a PSR for the POMDP, that is, that there are projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. A POMDP (Lovejoy, 1991; Kaelbling et al., 1998) is defined by a sextuple (8, A, 0, bo, T, 0). Here, 8 is a set of n underlying (hidden) states, A is a discrete set of actions, and 0 is a discrete set of observations. The (1 x n) vector bo is an initial state distribution. The set T consists of (n x n) transition matrices Ta, one for each action a, where Tlj is the probability of a transition from state i to j when action a is chosen. The set 0 consists of diagonal (n x n) observation matrices oa,o, one for each pair of observation 0 and action a, where o~'o is the probability of observation 0 when action a is selected and state i is reached. l The state representation in a POMDP (Figure l(a)) is the belief state-the (1 x n) vector of the state-occupation probabilities given the history h. It can be computed recursively given a new action a and observation 0 by b(h)Taoa,o b(hao) = b(h)Taoa,oe;' where en is the (1 x n)-vector of all Is. Finally, a POMDP defines a probability distribution over tests (and thus histories) by P(Ol ... otlhal ... at) == b(h)Ta1oal,Ol ... Taloa£,Ole~. (4) IThere are many equivalent formulations and the conversion procedure described here can be easily modified to accommodate other POMDP definitions. We now present our algorithm for constructing a PSR for a given POMDP. It uses a function u mapping tests to (1 x n) vectors defined recursively by u(c) == en and u(aot) == (Taoa,ou(t)T)T, where c represents the null test. Conceptually, the components of u(t) are the probabilities of the test t when applied from each underlying state of the POMDP; we call u(t) the outcome vector for test t. We say a test t is linearly independent of a set of tests S if its outcome vector is linearly independent of the set of outcome vectors of the tests in S. Our algorithm search is used and defined as Q -<- search(c, {}) search(t, S): for each a E A, 0 E 0 if aot is linearly independent of S then S -<- search(aot, S U {aot}) return S The algorithm maintains a set of tests and searches for new tests that are linearly independent of those already found. It is a form of depth-first search. The algorithm halts when it checks all the one-step extensions of its tests and finds none that are linearly independent. Because the set of tests Q returned by search have linearly independent outcome vectors, the cardinality of Q is bounded by n, ensuring that the algorithm halts after a polynomial number of iterations. Because each test in Q is formed by a one-step extension to some other test in Q, no test is longer than n action-observation pairs. The check for linear independence can be performed in many ways, including Gaussian elimination, implying that search terminates in polynomial time. By construction, all one-step extensions to the set of tests Q returned by search are linearly dependent on those in Q. We now show that this is true for any test. Lemma 1 The outcome vectors of the tests in Q can be linearly combined to produce the outcome vector for any test. Proof: Let U be the (n x q) matrix formed by concatenating the outcome vectors for all tests in Q. Since, for all combinations of a and 0, the columns of Taoa,ou are linearly dependent on the columns of U, we can write Taoa,ou == UW T for some q x q matrix of weights W. If t is a test that is linearly dependent on Q, then anyone-step extension of t, aot, is linearly dependent on Q. This is because we can write the outcome vector for t as u(t) == (UwT)T for some (1 x q) weight vector w and the outcome vector for aot as u(aot) == (Taoa,ou(t)T)T == (Taoa,oUwT)T == (UWTwT)T. Thus, aot is linearly dependent on Q. Now, note that all one-step tests are linearly dependent on Q by the structure of the search algorithm. Using the previous paragraph as an inductive argument, this implies that all tests are linearly dependent on Q. 0 Returning to the float/reset example POMDP, search begins with by enumerating the 4 extensions to the null test (fO, fl, rO, and rl). Of these, only fa and rO are are linearly independent. Of the extensions of these, fOrO is the only one that is linearly independent of the other two. The remaining two tests added to Q by search are fOfOrO and fOfOfOrO. No extensions of the 5 tests in Q are linearly independent of the 5 tests in Q, so the procedure halts. We now show that the set of tests Q constitute a PSR for the POMDP by constructing projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. For each combination of a and 0, define a q x q matrix Mao == (U+Taoa,ou)T and a 1 x q vector mao == (U+Taoa,oe;;J T , where U is the matrix of outcome vectors defined in the previous section and U+ is its pseudoinverse2 • The ith row of Mao is maoti. The probability distribution on histories implied by these projection vectors is p(h )m~101 alOl p(h)M~ol M~_10l_1 m~Ol b(h)UU+r a1 oa 1,01 U ... U+T al-10 al-1,Ol-1 UU+Taloal,ol b(h)T a1 0 a1,01 ... ral-l0al-t,ol-lTaloal,Ole~, Le., it is the same as that of the POMDP, as in Equation 4. Here, the last step uses the fact that UU+v T == v T for v T linearly dependent on the columns of U. This holds by construction of U in the previous section. This completes the proof of Theorem 1. Completing the float/reset example, consider the Mf,o matrix found by the process defined in this section. It derives predictions for each test in Q after taking action f. Most of these are quite simple because the tests are so similar: the new prediction for rO is exactly the old prediction for fOrO, for example. The only non trivial test is fOfOfOrO. Its outcome can be computed from 0.250 p(rOlh) - 0.0625 p(fOrOlh) + 0.750 p(fOfOrOlh). This example illustrates that the projection vectors need not contain only positive entries. 3 Conclusion We have introduced a predictive state representation for dynamical systems that is grounded in actions and observations and shown that, even in its linear form, it is at least as general and compact as POMDPs. In essence, we have established PSRs as a non-inferior alternative to POMDPs, and suggested that they might have important advantages, while leaving demonstration of those advantages to future work. We conclude by summarizing the potential advantages (to be explored in future work): Learnability. The k-order Markov model is similar to PSRs in that it is entirely based on actions and observations. Such models can be learned trivially from data by counting-it is an open question whether something similar can be done with a PSR. Jaeger (2000) showed how to learn such a model in the uncontrolled setting, but the situation is more complex in the multiple action case since outcomes are conditioned on behavior, violating some required independence assumptions. Compactness. We have shown that there exist linear PSRs no more complex that the minimal POMDP for an environment, but in some cases the minimal linear PSR seems to be much smaller. For example, a POMDP extension of factored MDPs explored by Singh and Cohn (1998) would be cross-products of separate POMDPs and have linear PSRs that increase linearly with the number and size of the component POMDPs, whereas their minimal POMDP representation would grow as the size 2If U = A~BT is the singular value decomposition of U, then B:E+ AT is the pseudoinverse. The pseudoinverse of the diagonal matrix }J replaces each non-zero element with its reciprocal. e; of the state space, Le., exponential in the number of component POMDPs. This (apparent) advantage stems from the PSR's combinatorial or factored structure. As a vector of state variables, capable of taking on diverse values, a PSR may be inherently more powerful than the distribution over discrete states (the belief state) of a POMDP. We have already seen that general PSRs can be more compact than POMDPs; they are also capable of efficiently capturing environments in the diversity representation used by Rivest and Schapire (1994), which is known to provide an extremely compact representation for some environments. Generalization. There are reasons to think that state variables that are themselves predictions may be particularly useful in learning to make other predictions. With so many things to predict, we have in effect a set or sequence of learning problems, all due to the same environment. In many such cases the solutions to earlier problems have been shown to provide features that generalize particularly well to subsequent problems (e.g., Baxter, 2000; Thrun & Pratt, 1998). Powerful, extensible representations. PSRs that predict tests could be generalized to predict the outcomes of multi-step options (e.g., Sutton et al., 1999). In this case, particularly, they would constitute a powerful language for representing the state of complex environments. AcknowledgIllents: We thank Peter Dayan, Lawrence Saul, Fernando Pereira and Rob Schapire for many helpful discussions of these and related ideas. References Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164-171. Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149-198. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 183-188). San Jose, California: AAAI Press. Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12, 1371-1398. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in ' partially observable stochastic domains. Artificial Intelligence, 101, 99-134. Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28, 47-65. McCallum, A. K. (1995). Reinforcement learning with selective perception and hidden state. Doctoral diss.ertation, Department of Computer Science, University of Rochester. Rivest, R. L., & Schapire, R. E. (1994). Diversity-based inference of finite automata. Journal of the ACM, 41, 555-589. Shatkay, H., & Kaelbling, L. P. (1997). Learning topological maps with weak local odometric information~ Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-91) (pp. 920-929). Singh, S., & Cohn, D. (1998). How to dynamically merge Markov decision processes. Advances in Neural and Information Processing Systems 10 (pp. 1057-1063). Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 181-211. Thrun, S., & Pratt, L. (Eds.). (1998). Learning to learn. Kluwer Academic Publishers.

3 0.99439287 111 nips-2001-Learning Lateral Interactions for Feature Binding and Sensory Segmentation

Author: Heiko Wersing

Abstract: We present a new approach to the supervised learning of lateral interactions for the competitive layer model (CLM) dynamic feature binding architecture. The method is based on consistency conditions, which were recently shown to characterize the attractor states of this linear threshold recurrent network. For a given set of training examples the learning problem is formulated as a convex quadratic optimization problem in the lateral interaction weights. An efﬁcient dimension reduction of the learning problem can be achieved by using a linear superposition of basis interactions. We show the successful application of the method to a medical image segmentation problem of ﬂuorescence microscope cell images.

4 0.99418449 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task

Author: Michael C. Mozer, Michael D. Colagrosso, David E. Huber

Abstract: We are interested in the mechanisms by which individuals monitor and adjust their performance of simple cognitive tasks. We model a speeded discrimination task in which individuals are asked to classify a sequence of stimuli (Jones & Braver, 2001). Response conﬂict arises when one stimulus class is infrequent relative to another, resulting in more errors and slower reaction times for the infrequent class. How do control processes modulate behavior based on the relative class frequencies? We explain performance from a rational perspective that casts the goal of individuals as minimizing a cost that depends both on error rate and reaction time. With two additional assumptions of rationality—that class prior probabilities are accurately estimated and that inference is optimal subject to limitations on rate of information transmission—we obtain a good ﬁt to overall RT and error data, as well as trial-by-trial variations in performance. Consider the following scenario: While driving, you approach an intersection at which the trafﬁc light has already turned yellow, signaling that it is about to turn red. You also notice that a car is approaching you rapidly from behind, with no indication of slowing. Should you stop or speed through the intersection? The decision is difﬁcult due to the presence of two conﬂicting signals. Such response conﬂict can be produced in a psychological laboratory as well. For example, Stroop (1935) asked individuals to name the color of ink on which a word is printed. When the words are color names incongruous with the ink color— e.g., “blue” printed in red—reaction times are slower and error rates are higher. We are interested in the control mechanisms underlying performance of high-conﬂict tasks. Conﬂict requires individuals to monitor and adjust their behavior, possibly responding more slowly if errors are too frequent. In this paper, we model a speeded discrimination paradigm in which individuals are asked to classify a sequence of stimuli (Jones & Braver, 2001). The stimuli are letters of the alphabet, A–Z, presented in rapid succession. In a choice task, individuals are asked to press one response key if the letter is an X or another response key for any letter other than X (as a shorthand, we will refer to non-X stimuli as Y). In a go/no-go task, individuals are asked to press a response key when X is presented and to make no response otherwise. We address both tasks because they elicit slightly different decision-making behavior. In both tasks, Jones and Braver (2001) manipulated the relative frequency of the X and Y stimuli; the ratio of presentation frequency was either 17:83, 50:50, or 83:17. Response conﬂict arises when the two stimulus classes are unbalanced in frequency, resulting in more errors and slower reaction times. For example, when X’s are frequent but Y is presented, individuals are predisposed toward producing the X response, and this predisposition must be overcome by the perceptual evidence from the Y. Jones and Braver (2001) also performed an fMRI study of this task and found that anterior cingulate cortex (ACC) becomes activated in situations involving response conﬂict. Specifically, when one stimulus occurs infrequently relative to the other, event-related fMRI response in the ACC is greater for the low frequency stimulus. Jones and Braver also extended a neural network model of Botvinick, Braver, Barch, Carter, and Cohen (2001) to account for human performance in the two discrimination tasks. The heart of the model is a mechanism that monitors conﬂict—the posited role of the ACC—and adjusts response biases accordingly. In this paper, we develop a parsimonious alternative account of the role of the ACC and of how control processes modulate behavior when response conﬂict arises. 1 A RATIONAL ANALYSIS Our account is based on a rational analysis of human cognition, which views cognitive processes as being optimized with respect to certain task-related goals, and being adaptive to the structure of the environment (Anderson, 1990). We make three assumptions of rationality: (1) perceptual inference is optimal but is subject to rate limitations on information transmission, (2) response class prior probabilities are accurately estimated, and (3) the goal of individuals is to minimize a cost that depends both on error rate and reaction time. The heart of our account is an existing probabilistic model that explains a variety of facilitation effects that arise from long-term repetition priming (Colagrosso, in preparation; Mozer, Colagrosso, & Huber, 2000), and more broadly, that addresses changes in the nature of information transmission in neocortex due to experience. We give a brief overview of this model; the details are not essential for the present work. The model posits that neocortex can be characterized by a collection of informationprocessing pathways, and any act of cognition involves coordination among pathways. To model a simple discrimination task, we might suppose a perceptual pathway to map the visual input to a semantic representation, and a response pathway to map the semantic representation to a response. The choice and go/no-go tasks described earlier share a perceptual pathway, but require different response pathways. The model is framed in terms of probability theory: pathway inputs and outputs are random variables and microinference in a pathway is carried out by Bayesian belief revision. To elaborate, consider a pathway whose input at time is a discrete random variable, denoted , which can assume values corresponding to alternative input states. Similarly, the output of the pathway at time is a discrete random variable, denoted , which can assume values . For example, the input to the perceptual pathway in the discrimination task is one of visual patterns corresponding to the letters of the alphabet, and the output is one of letter identities. (This model is highly abstract: the visual patterns are enumerated, but the actual pixel patterns are not explicitly represented in the model. Nonetheless, the similarity structure among inputs can be captured, but we skip a discussion of this issue because it is irrelevant for the current work.) To present a particular input alternative, , to the model for time steps, we clamp for . The model computes a probability distribution over given , i.e., P . ¡ # 4 0 ©2' & 0 ' ! 1)(

5 0.99352872 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images

Author: James M. Coughlan, Alan L. Yuille

Abstract: We describe the g-factor, which relates probability distributions on image features to distributions on the images themselves. The g-factor depends only on our choice of features and lattice quantization and is independent of the training image data. We illustrate the importance of the g-factor by analyzing how the parameters of Markov Random Field (i.e. Gibbs or log-linear) probability models of images are learned from data by maximum likelihood estimation. In particular, we study homogeneous MRF models which learn image distributions in terms of clique potentials corresponding to feature histogram statistics (d. Minimax Entropy Learning (MEL) by Zhu, Wu and Mumford 1997 [11]) . We first use our analysis of the g-factor to determine when the clique potentials decouple for different features . Second, we show that clique potentials can be computed analytically by approximating the g-factor. Third, we demonstrate a connection between this approximation and the Generalized Iterative Scaling algorithm (GIS), due to Darroch and Ratcliff 1972 [2], for calculating potentials. This connection enables us to use GIS to improve our multinomial approximation, using Bethe-Kikuchi[8] approximations to simplify the GIS procedure. We support our analysis by computer simulations. 1

6 0.99244761 87 nips-2001-Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway

same-paper 7 0.99001449 144 nips-2001-Partially labeled classification with Markov random walks

8 0.97750753 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

9 0.93875229 174 nips-2001-Spike timing and the coding of naturalistic sounds in a central auditory area of songbirds

10 0.91690034 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

11 0.89979076 30 nips-2001-Agglomerative Multivariate Information Bottleneck

12 0.89974362 68 nips-2001-Entropy and Inference, Revisited

13 0.89794332 24 nips-2001-Active Information Retrieval

14 0.89389032 183 nips-2001-The Infinite Hidden Markov Model

15 0.8922838 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

16 0.89154625 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

17 0.88448274 96 nips-2001-Information-Geometric Decomposition in Spike Analysis

18 0.8818692 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

19 0.87343925 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments

20 0.87008238 11 nips-2001-A Maximum-Likelihood Approach to Modeling Multisensory Enhancement