nips nips2001 nips2001-111 knowledge-graph by maker-knowledge-mining

111 nips-2001-Learning Lateral Interactions for Feature Binding and Sensory Segmentation

Source: pdf

Author: Heiko Wersing

Abstract: We present a new approach to the supervised learning of lateral interactions for the competitive layer model (CLM) dynamic feature binding architecture. The method is based on consistency conditions, which were recently shown to characterize the attractor states of this linear threshold recurrent network. For a given set of training examples the learning problem is formulated as a convex quadratic optimization problem in the lateral interaction weights. An efﬁcient dimension reduction of the learning problem can be achieved by using a linear superposition of basis interactions. We show the successful application of the method to a medical image segmentation problem of ﬂuorescence microscope cell images.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 jp Abstract We present a new approach to the supervised learning of lateral interactions for the competitive layer model (CLM) dynamic feature binding architecture. [sent-8, score-1.191]

2 The method is based on consistency conditions, which were recently shown to characterize the attractor states of this linear threshold recurrent network. [sent-9, score-0.269]

3 For a given set of training examples the learning problem is formulated as a convex quadratic optimization problem in the lateral interaction weights. [sent-10, score-0.94]

4 An efﬁcient dimension reduction of the learning problem can be achieved by using a linear superposition of basis interactions. [sent-11, score-0.139]

5 We show the successful application of the method to a medical image segmentation problem of ﬂuorescence microscope cell images. [sent-12, score-0.479]

6 1 Introduction Feature binding has been proposed to provide elegant solution strategies to the segmentation problem in perception [11, 12, 14]. [sent-13, score-0.463]

7 A lot of feature binding models have thus tried to reproduce groping mechanisms like the Gestalt laws of visual perception, e. [sent-14, score-0.3]

8 Quite generally in these models, grouping is based on lateral interactions between feature-representing neurons, which characterize the degree of compatibility between features. [sent-17, score-0.756]

9 Currently in most of the approaches this lateral interaction scheme is chosen heuristically, since the experimental data on the corresponding connection patterns in the visual cortex is insufﬁcient. [sent-18, score-0.792]

10 Nevertheless, in more complex feature spaces this heuristic approach becomes infeasible, raising the question for more systematic learning methods for lateral interactions. [sent-19, score-0.5]

11 [4] suggested supervised learning for a dynamic feature binding model of complex-valued directional units, where the connections to hidden units guiding the grouping dynamics were adapted by recurrent backpropagation learning. [sent-21, score-0.541]

12 [2] considered unsupervised texture segmentation by a pairwise clustering approach on feature vectors derived from Gabor ﬁlter banks at different frequencies and orientations. [sent-24, score-0.412]

13 In their model the pairwise feature compatibilities are determined by a divergence measure of the local feature distributions which was shown to achieve good segmentation results for a range of image types. [sent-25, score-0.574]

14 The problem of segmentation can also be phrased as a labeling problem, where relaxation labeling algorithms have been used as a popular tool in a wide range of computer vision applications. [sent-26, score-0.462]

15 Pelillo & Reﬁce [7] suggested a supervised learning method for the compatibility coefﬁcients of relaxation labeling algorithms, based on minimizing the distance between a target labeling vector and the output after iterating a ﬁxed number of relaxation steps. [sent-27, score-0.495]

16 [16] demonstrated how these properties can be used to learn winner-take-all competition between groups of neurons in an LT network with lateral inhibition. [sent-31, score-0.454]

17 The CLM binding model is implemented by a large-scale topographically organized LT network, and it was shown that this leads to consistency conditions characterizing its binding states [14]. [sent-32, score-0.654]

18 In this contribution we show how these conditions can be used to formulate a learning approach for the CLM as a quadratic optimization problem. [sent-33, score-0.188]

19 In Section 2 we brieﬂy introduce the competitive layer binding model. [sent-34, score-0.467]

20 In Section 4 we show application results of the approach to a cell segmentation problem and give a discussion in the ﬁnal Section 5. [sent-36, score-0.393]

21 2 The CLM architecture The CLM [9, 14] consists of a set of layers of feature-selective neurons (see Fig. [sent-37, score-0.137]

22 The activity of a neuron at position in layer is denoted by , and a column denotes the set of the neuron activities , , sharing a common position . [sent-39, score-0.474]

23 local edge elements characterized by position and orientation . [sent-42, score-0.196]

24 A binding between two features, represented by columns and , is expressed by simultaneous activities and that share a common layer . [sent-43, score-0.53]

25 All neurons in a column are equally driven by an external input , which represents the signiﬁcance of the detection of feature by a preprocessing step. [sent-44, score-0.177]

26 Within each layer the activities are coupled via lateral connections which characterize the degree of compatibility between features and and which is a symmetric function of the feature parameters, thus . [sent-46, score-0.926]

27 The purpose of the layered arrangement in the CLM is to enforce an assignment of the input features to the layers by the dynamics, using the contextual information stored in the lateral interactions. [sent-47, score-0.553]

28 The unique assignment to a single layer is realized by a columnar Winner-Take-All (WTA) circuit, which uses mutual symmetric inhibitory interactions with absolute strength between neural activities and that share a common column . [sent-48, score-0.624]

29 Due to the WTA coupling, for a stable equilibrium state of the CLM only a neuron from one layer can be active within each column [14]. [sent-49, score-0.311]

30 The number of layers does not predetermine the number of active groups, since for sufﬁciently many layers only those are active that carry a salient group. [sent-50, score-0.235]

31 The combination of afferent inputs and lateral and vertical interactions is combined into the standard linear threshold additive activity dynamics ¡ ¦¤ §¥£ ¡ ¢ %¤ # ¤ ¤ £ &"$"! [sent-51, score-0.599]

32 For large compared to the lateral weights , the single active where neuron in a column reproduces its afferent input, . [sent-59, score-0.557]

33 As was shown [14], the stable states of (1) satisfy the consistency conditions 9 ¤ 7 w! [sent-60, score-0.238]

34 §¤ A G 4¤ b which express the assignment of a feature to the layer % ¡ 6 ¢ ¨ q"¡ for all (2) with highest lateral support. [sent-64, score-0.706]

35 ¡ xrL layer L xr’L vertical WTA interaction xr2 lateral interaction xr’2 layer 2 xr1 xr’1 layer 1 hr hr’ input Figure 1: The competitive layer model architecture (see text for description). [sent-65, score-1.956]

36 The overall task of the learning algorithm is to adapt the lateral interactions, given by the interaction coefﬁcients , such that the CLM architecture performs appropriate segmentation on the labeled training data and also generalizes to new test data. [sent-67, score-1.197]

37 We assume that the training data consists of a set of labeled training patterns , , where each pattern consists of a subset of different features with their corresponding labels . [sent-68, score-0.368]

38 For each labeled training pattern a target labeling vector is constructed by choosing 4 §¤ A ¤ ¦ ¢ £¡ % §6 ¢ ¡¢ ¥ %¡ ¢ 6D¥ ¢ % ¢ ¨ ¦ &¡ ¢ ! [sent-69, score-0.259]

39 all edges of different orirun over the subset of features realized in entations at different image positions, while a particular pattern, e. [sent-74, score-0.236]

40 The assignment vectors form the basis of the learning approach since they represent the target activity distribution, which we want to obtain after iterating the CLM with appropriately adjusted lateral interactions. [sent-77, score-0.579]

41 ) 6¢ The goal of the learning process is to make the training patterns consistent, which is in accordance with (2) expressed by the inequalities (4) 6 ¢ 3d ¨ % ¡ ¢ ¦ 2$5¤ for all ( ¦ 4 ¢¤ 4 ! [sent-81, score-0.269]

42 ¤ A ¤ (¦ 4¤ 4¤ b y G 4 ¢ ¤ §§¤ A G 4¤ b These inequalities deﬁne the learning problem that we want to solve in the following. [sent-82, score-0.143]

43 The index runs over all consistency relations deﬁned for the labeled columns of the assignment vectors. [sent-86, score-0.378]

44 The vectors with components are called consistency vectors and represent the consistency constraints for the lateral interaction. [sent-87, score-0.845]

45 The index runs over all entries in the lateral interaction matrix. [sent-88, score-0.727]

46 The inequalities (4) can then be written in the form £ ¤ for all (6) pF % STA V T 8 2 y b T ¨ This illustrates the nature of the learning problem. [sent-91, score-0.143]

47 The problem is to ﬁnd a weight vector which leads to a lateral interaction matrix, such that the consistency vectors lie in the opposite half space of the weight state space. [sent-92, score-0.962]

48 Since the conditions (6) determine the attractivity to achieve of the training patterns, it is customary to introduce a positive margin greater robustness. [sent-93, score-0.193]

49 This gives the target inequalities 2 0 ¥ for all (7) pF STA V T 8 ¥ S ¦T 2 3y T b which we want to solve in for given training data. [sent-94, score-0.171]

50 If the number of features is large, the number of parameters in the complete interaction matrix may be too large to be robustly estimated from a limited number of training examples. [sent-98, score-0.472]

51 An example is to choose basis functions which incorporate invariances such as translation and rotation invariance, or which satisfy the constraint that the interaction is equal in all layers. [sent-101, score-0.404]

52 Now the learning problem of solving the inequalities (7) can be recast in the new free parameters . [sent-103, score-0.143]

53 After inserting (8) into (7) we obtain the transformed problem pF § ¢ for all (9) 2 ey ¥ S V V X b I¦S ¨ ¥ ¨ T ¨ © b T V 8 b T 3 T ¨ V&8 T I¨ V T where is the component of the consistency vector in the basis interaction . [sent-104, score-0.606]

54 The basis interactions can thus be used to reduce the dimensionality of the learning problem. [sent-105, score-0.255]

55 To avoid any redundancy, the basis interactions should be linearly independent. [sent-106, score-0.222]

56 nor span the whole space of interactions % ¡ ¨ Quadratic Consistency Optimization. [sent-108, score-0.17]

57 The generic case in any real world application is that the majority of training vectors contains relevant information, while single spurious vectors may be present due to noise or other disturbing factors. [sent-109, score-0.21]

58 This will be especially the case, if a low-dimensional embedding is used for the basis function templates as described above. [sent-111, score-0.133]

59 We therefore suggest to adapt the interactions by minimizing the following convex cost function £ g ¥ ! [sent-112, score-0.17]

60 S T A VT 8 T b W V b ¨ QCO (10) A similar minimization approach was suggested for the imprinting of attractors for the Brain-State-in-a-Box (BSB) model [8], and a recent study has shown that the approach is competitive with other methods for designing BSB associative memories [6]. [sent-113, score-0.141]

61 For a ﬁxed positive margin , the cost function (10) is minimized by making the inner products of the weight vector and the consistency vectors negative. [sent-114, score-0.282]

62 The global minimum is attained if the inner products are all equal to , which can be interpreted with QCO such that all consistency inequalities are fulﬁlled in an equal manner. [sent-115, score-0.312]

63 ¥ 2 ¨ © ¨ 4 Application to Cell Segmentation The automatic detection and segmentation of individual cells in ﬂuorescence micrographs is a key technology for high-throughput analysis of immune cell surface proteins [5]. [sent-124, score-0.403]

64 In the bottom row corresponding image patches are displayed, where individual cell regions were manually labeled to obtain training data for the learning process. [sent-128, score-0.455]

65 For each of the image patches, a training vector consists of a list of labeled edge features parameterized by , where is the position in the image and is a unit local edge orientation vector computed from the intensity gradient. [sent-129, score-0.672]

66 Since the ﬁgure-ground separating mechthis amounts to a set of anism as implemented by the CLM [14] is also used for this cell segmentation application, features which are not labeled as part of a cell obtain the corresponding background label, given by . [sent-131, score-0.643]

67 Each training pattern contains one additional free layer, to enable the learning algorithm to generalize over the number of layers. [sent-132, score-0.128]

68 To achieve rounit-vector representation into an angular orientation variable tation invariance of the interaction, it is only dependent on the edge orientations relative ! [sent-135, score-0.274]

69 ¤ 2 ©% # 8 7 6 8 9 7 6 5 4 4 2 5 3 2 3 a) Manually labelled training patterns b) Grouping results after learning Figure 2: a) Original images and manually labeled training patterns from a ﬂuorescence micrograph. [sent-136, score-0.43]

70 b) Test patterns and resulting CLM segmentation with learned lateral interaction. [sent-137, score-0.725]

71 Grayscale represents different layer activations, where a total of 20 layers plus one background layer (black) was used. [sent-138, score-0.474]

72 For each combination of the two discretized edge orientations there is an interaction template generated, which is only responding in this combined orientation interval. [sent-142, score-0.608]

73 Thus the angular templates do not overlap in the combined space, i. [sent-143, score-0.162]

74 Since the interaction must be symmetric under feature exchange, this does not result in different combinations, but only 36 independent templates. [sent-146, score-0.476]

75 Apart form the discretization, the interaction represents the most arbitrary angular-dependent interaction within the local neighborhood, which is symmetric under feature exchange. [sent-147, score-0.828]

76 We use two sets of angular templates for and respectively, where is the maximal local interaction range. [sent-148, score-0.514]

77 Figure 3 compares the optimized interaction ﬁeld to earlier heuristic lateral interactions for contour grouping. [sent-150, score-0.936]

78 For all the training examples we used, the resulting inequalities (9) were in fact incompatible, rendering a direct solution of (9) infeasible. [sent-154, score-0.171]

79 After training was completed by minimizing (12), a new image patch was selected as a test pattern and the CLM grouping was performed with the lateral interaction learned before, using the dynamical model as described in [14]. [sent-155, score-1.029]

80 The quadratic consistency optimization was performed as described in the previous section, exploring the free margin parameter . [sent-156, score-0.368]

81 For a set of two training patterns as shown in Fig. [sent-157, score-0.126]

82 ¥ Typical segmentation results obtained with the quadratic consistency optimization approach are shown in Figure 2b, where the margin was given by . [sent-159, score-0.623]

83 The grouping results show a good segmentation performance where most of the salient cells are detected as single groups. [sent-161, score-0.427]

84 There are some spurious groups where a dark image region forms an additional group and some smaller cells are rejected into the background layer. [sent-162, score-0.262]

85 Apart from these minor errors, the optimization has achieved an adequate balancing of the different lateral interaction components for this segmentation task. [sent-163, score-1.033]

86 The interaction depends on the difference vector and two unit vectors , shown in b), encoding directed orientation. [sent-165, score-0.416]

87 a) explains the interaction visualizations c) and d) by showing a magniﬁcation of the plot c) of the interaction ﬁeld of a single horizontal edge pointing to the left. [sent-166, score-0.8]

88 The plots are generated by computing the interaction of the central directed edge with directed edges of all directions (like a cylindrical plot) at a spatial grid. [sent-167, score-0.569]

89 Black edges share excitatory, white edges share inhibitory interaction with the central edge and length codes for interaction strength. [sent-168, score-1.002]

90 The cocircular continuity ﬁeld in c) depends on position and orientation but is not direction-selective. [sent-169, score-0.207]

91 lie tangentially to a common circle and has been recently used for contour segmentation [3, 14]. [sent-172, score-0.294]

92 The learned lateral interaction ﬁeld is shown in d). [sent-173, score-0.757]

93 £ 5 Discussion The presented results show that appropriate lateral interactions can be obtained for the CLM binding architecture from the quadratic consistency optimization approach. [sent-176, score-1.107]

94 This supervised learning approach has clear advantages over the manual tuning of complex feature interactions in complex feature spaces with many parameters. [sent-178, score-0.441]

95 We consider this as an important step towards practical applicability of the feature binding concept. [sent-179, score-0.3]

96 The presented quadratic consistency optimization method is based on choosing equal margins for all consistency inequalities. [sent-180, score-0.523]

97 The application of similar methods to the supervised learning of CLM interactions provides an interesting ﬁeld for future work. [sent-182, score-0.297]

98 The author thanks Helge Ritter and Tim Nattkemper for discussions and Walter Schubert for providing the cell image data. [sent-184, score-0.184]

99 A competitive layer model for feature binding and sensory segmentation. [sent-273, score-0.559]

100 Learning winner-take-all competition between groups of neurons in lateral inhibition networks. [sent-284, score-0.454]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('lateral', 0.375), ('clm', 0.368), ('interaction', 0.352), ('segmentation', 0.255), ('binding', 0.208), ('consistency', 0.202), ('layer', 0.183), ('interactions', 0.17), ('inequalities', 0.11), ('qco', 0.098), ('uorescence', 0.098), ('cell', 0.098), ('edge', 0.096), ('feature', 0.092), ('grouping', 0.091), ('labeled', 0.088), ('image', 0.086), ('compatibility', 0.085), ('angular', 0.081), ('templates', 0.081), ('labeling', 0.076), ('competitive', 0.076), ('bsb', 0.074), ('tissue', 0.074), ('wersing', 0.074), ('quadratic', 0.068), ('activities', 0.065), ('patterns', 0.065), ('eld', 0.065), ('wta', 0.064), ('layers', 0.063), ('training', 0.061), ('features', 0.059), ('edges', 0.059), ('continuity', 0.058), ('cients', 0.058), ('manually', 0.057), ('assignment', 0.056), ('relaxation', 0.055), ('supervised', 0.054), ('orientation', 0.054), ('afferent', 0.054), ('superposition', 0.054), ('coef', 0.053), ('basis', 0.052), ('optimization', 0.051), ('cells', 0.05), ('attractivity', 0.049), ('cocircular', 0.049), ('compatibilities', 0.049), ('heiko', 0.049), ('nattkemper', 0.049), ('schubert', 0.049), ('lt', 0.049), ('margin', 0.047), ('position', 0.046), ('neuron', 0.045), ('background', 0.045), ('pf', 0.045), ('column', 0.044), ('xr', 0.043), ('orientations', 0.043), ('incompatible', 0.043), ('hahnloser', 0.043), ('spurious', 0.043), ('pelillo', 0.043), ('share', 0.042), ('neurons', 0.041), ('application', 0.04), ('active', 0.039), ('contour', 0.039), ('xie', 0.039), ('ritter', 0.039), ('groups', 0.038), ('hr', 0.036), ('lled', 0.036), ('ampli', 0.036), ('conditions', 0.036), ('characterize', 0.035), ('pattern', 0.034), ('memories', 0.034), ('sta', 0.034), ('vectors', 0.033), ('architecture', 0.033), ('learning', 0.033), ('realized', 0.032), ('texture', 0.032), ('analogue', 0.032), ('discretization', 0.032), ('patches', 0.032), ('discretized', 0.032), ('columns', 0.032), ('recurrent', 0.032), ('symmetric', 0.032), ('directed', 0.031), ('salient', 0.031), ('template', 0.031), ('suggested', 0.031), ('learned', 0.03), ('iterating', 0.03)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 111 nips-2001-Learning Lateral Interactions for Feature Binding and Sensory Segmentation

Author: Heiko Wersing

2 0.19578609 89 nips-2001-Grouping with Bias

Author: Stella X. Yu, Jianbo Shi

Abstract: With the optimization of pattern discrimination as a goal, graph partitioning approaches often lack the capability to integrate prior knowledge to guide grouping. In this paper, we consider priors from unitary generative models, partially labeled data and spatial attention. These priors are modelled as constraints in the solution space. By imposing uniformity condition on the constraints, we restrict the feasible space to one of smooth solutions. A subspace projection method is developed to solve this constrained eigenproblema We demonstrate that simple priors can greatly improve image segmentation results. 1

3 0.11600272 80 nips-2001-Generalizable Relational Binding from Coarse-coded Distributed Representations

Author: Randall C. O'Reilly, R. S. Busby

Abstract: We present a model of binding of relationship information in a spatial domain (e.g., square above triangle) that uses low-order coarse-coded conjunctive representations instead of more popular temporal synchrony mechanisms. Supporters of temporal synchrony argue that conjunctive representations lack both efﬁciency (i.e., combinatorial numbers of units are required) and systematicity (i.e., the resulting representations are overly speciﬁc and thus do not support generalization to novel exemplars). To counter these claims, we show that our model: a) uses far fewer hidden units than the number of conjunctions represented, by using coarse-coded, distributed representations where each unit has a broad tuning curve through high-dimensional conjunction space, and b) is capable of considerable generalization to novel inputs.

4 0.11456999 12 nips-2001-A Model of the Phonological Loop: Generalization and Binding

Author: Randall C. O'Reilly, R. Soto

Abstract: We present a neural network model that shows how the prefrontal cortex, interacting with the basal ganglia, can maintain a sequence of phonological information in activation-based working memory (i.e., the phonological loop). The primary function of this phonological loop may be to transiently encode arbitrary bindings of information necessary for tasks - the combinatorial expressive power of language enables very flexible binding of essentially arbitrary pieces of information. Our model takes advantage of the closed-class nature of phonemes, which allows different neural representations of all possible phonemes at each sequential position to be encoded. To make this work, we suggest that the basal ganglia provide a region-specific update signal that allocates phonemes to the appropriate sequential coding slot. To demonstrate that flexible, arbitrary binding of novel sequences can be supported by this mechanism, we show that the model can generalize to novel sequences after moderate amounts of training. 1

5 0.10543779 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

Author: Jens Kohlmorgen, Steven Lemm

Abstract: We propose a novel method for the analysis of sequential data that exhibits an inherent mode switching. In particular, the data might be a non-stationary time series from a dynamical system that switches between multiple operating modes. Unlike other approaches, our method processes the data incrementally and without any training of internal parameters. We use an HMM with a dynamically changing number of states and an on-line variant of the Viterbi algorithm that performs an unsupervised segmentation and classification of the data on-the-fly, i.e. the method is able to process incoming data in real-time. The main idea of the approach is to track and segment changes of the probability density of the data in a sliding window on the incoming data stream. The usefulness of the algorithm is demonstrated by an application to a switching dynamical system. 1

6 0.10015196 82 nips-2001-Generating velocity tuning by asymmetric recurrent connections

7 0.092351988 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network

8 0.087808549 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates

9 0.081561476 87 nips-2001-Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway

10 0.081378974 73 nips-2001-Eye movements and the maturation of cortical orientation selectivity

11 0.080868997 37 nips-2001-Associative memory in realistic neuronal networks

12 0.080864973 43 nips-2001-Bayesian time series classification

13 0.078642786 141 nips-2001-Orientation-Selective aVLSI Spiking Neurons

14 0.068433367 46 nips-2001-Categorization by Learning and Combining Object Parts

15 0.06781593 65 nips-2001-Effective Size of Receptive Fields of Inferior Temporal Visual Cortex Neurons in Natural Scenes

16 0.067009605 129 nips-2001-Multiplicative Updates for Classification by Mixture Models

17 0.066732265 150 nips-2001-Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex

18 0.065964572 144 nips-2001-Partially labeled classification with Markov random walks

19 0.064866632 34 nips-2001-Analog Soft-Pattern-Matching Classifier using Floating-Gate MOS Technology

20 0.064384997 176 nips-2001-Stochastic Mixed-Signal VLSI Architecture for High-Dimensional Kernel Machines

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.215), (1, -0.104), (2, -0.101), (3, -0.003), (4, -0.017), (5, 0.006), (6, -0.178), (7, 0.007), (8, 0.01), (9, -0.018), (10, 0.009), (11, 0.056), (12, 0.132), (13, 0.072), (14, 0.039), (15, 0.016), (16, -0.007), (17, 0.006), (18, 0.097), (19, -0.023), (20, 0.037), (21, 0.166), (22, -0.024), (23, 0.089), (24, -0.026), (25, 0.097), (26, 0.134), (27, -0.037), (28, 0.016), (29, -0.047), (30, -0.145), (31, -0.018), (32, -0.046), (33, 0.218), (34, 0.047), (35, 0.23), (36, 0.056), (37, 0.232), (38, -0.003), (39, 0.003), (40, 0.053), (41, -0.027), (42, 0.065), (43, 0.018), (44, 0.004), (45, -0.007), (46, -0.015), (47, 0.126), (48, -0.075), (49, 0.104)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95378911 111 nips-2001-Learning Lateral Interactions for Feature Binding and Sensory Segmentation

Author: Heiko Wersing

2 0.71142662 89 nips-2001-Grouping with Bias

Author: Stella X. Yu, Jianbo Shi

3 0.48630208 12 nips-2001-A Model of the Phonological Loop: Generalization and Binding

Author: Randall C. O'Reilly, R. Soto

4 0.47112116 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

Author: Jens Kohlmorgen, Steven Lemm

5 0.46463162 191 nips-2001-Transform-invariant Image Decomposition with Similarity Templates

Author: Chris Stauffer, Erik Miller, Kinh Tieu

Abstract: Recent work has shown impressive transform-invariant modeling and clustering for sets of images of objects with similar appearance. We seek to expand these capabilities to sets of images of an object class that show considerable variation across individual instances (e.g. pedestrian images) using a representation based on pixel-wise similarities, similarity templates. Because of its invariance to the colors of particular components of an object, this representation enables detection of instances of an object class and enables alignment of those instances. Further, this model implicitly represents the regions of color regularity in the class-speciﬁc image set enabling a decomposition of that object class into component regions. 1

6 0.40806186 19 nips-2001-A Rotation and Translation Invariant Discrete Saliency Network

7 0.3892912 182 nips-2001-The Fidelity of Local Ordinal Encoding

8 0.3882629 80 nips-2001-Generalizable Relational Binding from Coarse-coded Distributed Representations

9 0.35981333 14 nips-2001-A Neural Oscillator Model of Auditory Selective Attention

10 0.35613516 82 nips-2001-Generating velocity tuning by asymmetric recurrent connections

11 0.34083173 34 nips-2001-Analog Soft-Pattern-Matching Classifier using Floating-Gate MOS Technology

12 0.34078699 176 nips-2001-Stochastic Mixed-Signal VLSI Architecture for High-Dimensional Kernel Machines

13 0.32273853 26 nips-2001-Active Portfolio-Management based on Error Correction Neural Networks

14 0.31274486 177 nips-2001-Switch Packet Arbitration via Queue-Learning

15 0.31135148 109 nips-2001-Learning Discriminative Feature Transforms to Low Dimensions in Low Dimentions

16 0.30327621 125 nips-2001-Modularity in the motor system: decomposition of muscle patterns as combinations of time-varying synergies

17 0.29933277 87 nips-2001-Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway

18 0.2992112 144 nips-2001-Partially labeled classification with Markov random walks

19 0.29805687 23 nips-2001-A theory of neural integration in the head-direction system

20 0.28955185 141 nips-2001-Orientation-Selective aVLSI Spiking Neurons

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(14, 0.025), (17, 0.011), (19, 0.017), (27, 0.086), (30, 0.064), (36, 0.011), (38, 0.037), (59, 0.025), (72, 0.04), (74, 0.018), (79, 0.022), (83, 0.017), (91, 0.552)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.99548477 87 nips-2001-Group Redundancy Measures Reveal Redundancy Reduction in the Auditory Pathway

Author: Gal Chechik, Amir Globerson, M. J. Anderson, E. D. Young, Israel Nelken, Naftali Tishby

Abstract: The way groups of auditory neurons interact to code acoustic information is investigated using an information theoretic approach. We develop measures of redundancy among groups of neurons, and apply them to the study of collaborative coding efficiency in two processing stations in the auditory pathway: the inferior colliculus (IC) and the primary auditory cortex (AI). Under two schemes for the coding of the acoustic content, acoustic segments coding and stimulus identity coding, we show differences both in information content and group redundancies between IC and AI neurons. These results provide for the first time a direct evidence for redundancy reduction along the ascending auditory pathway, as has been hypothesized for theoretical considerations [Barlow 1959,2001]. The redundancy effects under the single-spikes coding scheme are significant only for groups larger than ten cells, and cannot be revealed with the redundancy measures that use only pairs of cells. The results suggest that the auditory system transforms low level representations that contain redundancies due to the statistical structure of natural stimuli, into a representation in which cortical neurons extract rare and independent component of complex acoustic signals, that are useful for auditory scene analysis. 1

2 0.99370509 18 nips-2001-A Rational Analysis of Cognitive Control in a Speeded Discrimination Task

Author: Michael C. Mozer, Michael D. Colagrosso, David E. Huber

Abstract: We are interested in the mechanisms by which individuals monitor and adjust their performance of simple cognitive tasks. We model a speeded discrimination task in which individuals are asked to classify a sequence of stimuli (Jones & Braver, 2001). Response conﬂict arises when one stimulus class is infrequent relative to another, resulting in more errors and slower reaction times for the infrequent class. How do control processes modulate behavior based on the relative class frequencies? We explain performance from a rational perspective that casts the goal of individuals as minimizing a cost that depends both on error rate and reaction time. With two additional assumptions of rationality—that class prior probabilities are accurately estimated and that inference is optimal subject to limitations on rate of information transmission—we obtain a good ﬁt to overall RT and error data, as well as trial-by-trial variations in performance. Consider the following scenario: While driving, you approach an intersection at which the trafﬁc light has already turned yellow, signaling that it is about to turn red. You also notice that a car is approaching you rapidly from behind, with no indication of slowing. Should you stop or speed through the intersection? The decision is difﬁcult due to the presence of two conﬂicting signals. Such response conﬂict can be produced in a psychological laboratory as well. For example, Stroop (1935) asked individuals to name the color of ink on which a word is printed. When the words are color names incongruous with the ink color— e.g., “blue” printed in red—reaction times are slower and error rates are higher. We are interested in the control mechanisms underlying performance of high-conﬂict tasks. Conﬂict requires individuals to monitor and adjust their behavior, possibly responding more slowly if errors are too frequent. In this paper, we model a speeded discrimination paradigm in which individuals are asked to classify a sequence of stimuli (Jones & Braver, 2001). The stimuli are letters of the alphabet, A–Z, presented in rapid succession. In a choice task, individuals are asked to press one response key if the letter is an X or another response key for any letter other than X (as a shorthand, we will refer to non-X stimuli as Y). In a go/no-go task, individuals are asked to press a response key when X is presented and to make no response otherwise. We address both tasks because they elicit slightly different decision-making behavior. In both tasks, Jones and Braver (2001) manipulated the relative frequency of the X and Y stimuli; the ratio of presentation frequency was either 17:83, 50:50, or 83:17. Response conﬂict arises when the two stimulus classes are unbalanced in frequency, resulting in more errors and slower reaction times. For example, when X’s are frequent but Y is presented, individuals are predisposed toward producing the X response, and this predisposition must be overcome by the perceptual evidence from the Y. Jones and Braver (2001) also performed an fMRI study of this task and found that anterior cingulate cortex (ACC) becomes activated in situations involving response conﬂict. Specifically, when one stimulus occurs infrequently relative to the other, event-related fMRI response in the ACC is greater for the low frequency stimulus. Jones and Braver also extended a neural network model of Botvinick, Braver, Barch, Carter, and Cohen (2001) to account for human performance in the two discrimination tasks. The heart of the model is a mechanism that monitors conﬂict—the posited role of the ACC—and adjusts response biases accordingly. In this paper, we develop a parsimonious alternative account of the role of the ACC and of how control processes modulate behavior when response conﬂict arises. 1 A RATIONAL ANALYSIS Our account is based on a rational analysis of human cognition, which views cognitive processes as being optimized with respect to certain task-related goals, and being adaptive to the structure of the environment (Anderson, 1990). We make three assumptions of rationality: (1) perceptual inference is optimal but is subject to rate limitations on information transmission, (2) response class prior probabilities are accurately estimated, and (3) the goal of individuals is to minimize a cost that depends both on error rate and reaction time. The heart of our account is an existing probabilistic model that explains a variety of facilitation effects that arise from long-term repetition priming (Colagrosso, in preparation; Mozer, Colagrosso, & Huber, 2000), and more broadly, that addresses changes in the nature of information transmission in neocortex due to experience. We give a brief overview of this model; the details are not essential for the present work. The model posits that neocortex can be characterized by a collection of informationprocessing pathways, and any act of cognition involves coordination among pathways. To model a simple discrimination task, we might suppose a perceptual pathway to map the visual input to a semantic representation, and a response pathway to map the semantic representation to a response. The choice and go/no-go tasks described earlier share a perceptual pathway, but require different response pathways. The model is framed in terms of probability theory: pathway inputs and outputs are random variables and microinference in a pathway is carried out by Bayesian belief revision. To elaborate, consider a pathway whose input at time is a discrete random variable, denoted , which can assume values corresponding to alternative input states. Similarly, the output of the pathway at time is a discrete random variable, denoted , which can assume values . For example, the input to the perceptual pathway in the discrimination task is one of visual patterns corresponding to the letters of the alphabet, and the output is one of letter identities. (This model is highly abstract: the visual patterns are enumerated, but the actual pixel patterns are not explicitly represented in the model. Nonetheless, the similarity structure among inputs can be captured, but we skip a discussion of this issue because it is irrelevant for the current work.) To present a particular input alternative, , to the model for time steps, we clamp for . The model computes a probability distribution over given , i.e., P . ¡ # 4 0 ©2' & 0 ' ! 1)(

3 0.99282247 189 nips-2001-The g Factor: Relating Distributions on Features to Distributions on Images

Author: James M. Coughlan, Alan L. Yuille

Abstract: We describe the g-factor, which relates probability distributions on image features to distributions on the images themselves. The g-factor depends only on our choice of features and lattice quantization and is independent of the training image data. We illustrate the importance of the g-factor by analyzing how the parameters of Markov Random Field (i.e. Gibbs or log-linear) probability models of images are learned from data by maximum likelihood estimation. In particular, we study homogeneous MRF models which learn image distributions in terms of clique potentials corresponding to feature histogram statistics (d. Minimax Entropy Learning (MEL) by Zhu, Wu and Mumford 1997 [11]) . We first use our analysis of the g-factor to determine when the clique potentials decouple for different features . Second, we show that clique potentials can be computed analytically by approximating the g-factor. Third, we demonstrate a connection between this approximation and the Generalized Iterative Scaling algorithm (GIS), due to Darroch and Ratcliff 1972 [2], for calculating potentials. This connection enables us to use GIS to improve our multinomial approximation, using Bethe-Kikuchi[8] approximations to simplify the GIS procedure. We support our analysis by computer simulations. 1

4 0.99281704 148 nips-2001-Predictive Representations of State

Author: Michael L. Littman, Richard S. Sutton

Abstract: We show that states of a dynamical system can be usefully represented by multi-step, action-conditional predictions of future observations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less dependent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear specialization of the predictive approach with the state representations used in POMDPs and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model. In predicting or controlling a sequence of observations, the concepts of state and state estimation inevitably arise. There have been two dominant approaches. The generative-model approach, typified by research on partially observable Markov decision processes (POMDPs), hypothesizes a structure for generating observations and estimates its state and state dynamics. The history-based approach, typified by k-order Markov methods, uses simple functions of past observations as state, that is, as the immediate basis for prediction and control. (The data flow in these two approaches are diagrammed in Figure 1.) Of the two, the generative-model approach is more general. The model's internal state gives it temporally unlimited memorythe ability to remember an event that happened arbitrarily long ago--whereas a history-based approach can only remember as far back as its history extends. The bane of generative-model approaches is that they are often strongly dependent on a good model of the system's dynamics. Most uses of POMDPs, for example, assume a perfect dynamics model and attempt only to estimate state. There are algorithms for simultaneously estimating state and dynamics (e.g., Chrisman, 1992), analogous to the Baum-Welch algorithm for the uncontrolled case (Baum et al., 1970), but these are only effective at tuning parameters that are already approximately correct (e.g., Shatkay & Kaelbling, 1997). observations (and actions) 1-----1-----1..- (a) state rep'n observations (and actions) ¢E / t/' --+ 1-step delays . state rep'n (b) Figure 1: Data flow in a) POMDP and other recursive updating of state representation, and b) history-based state representation. In practice, history-based approaches are often much more effective. Here, the state representation is a relatively simple record of the stream of past actions and observations. It might record the occurrence of a specific subsequence or that one event has occurred more recently than another. Such representations are far more closely linked to the data than are POMDP representations. One way of saying this is that POMDP learning algorithms encounter many local minima and saddle points because all their states are equipotential. History-based systems immediately break symmetry, and their direct learning procedure makes them comparably simple. McCallum (1995) has shown in a number of examples that sophisticated history-based methods can be effective in large problems, and are often more practical than POMDP methods even in small ones. The predictive state representation (PSR) approach, which we develop in this paper, is like the generative-model approach in that it updates the state representation recursively, as in Figure l(a), rather than directly computing it from data. We show that this enables it to attain generality and compactness at least equal to that of the generative-model approach. However, the PSR approach is also like the history-based approach in that its representations are grounded in data. Whereas a history-based representation looks to the past and records what did happen, a PSR looks to the future and represents what will happen. In particular, a PSR is a vector of predictions for a specially selected set of action-observation sequences, called tests (after Rivest & Schapire, 1994). For example, consider the test U101U202, where U1 and U2 are specific actions and 01 and 02 are specific observations. The correct prediction for this test given the data stream up to time k is the probability of its observations occurring (in order) given that its actions are taken (in order) (i.e., Pr {Ok = 01, Ok+1 = 02 I A k = u1,A k + 1 = U2}). Each test is a kind of experiment that could be performed to tell us something about the system. If we knew the outcome of all possible tests, then we would know everything there is to know about the system. A PSR is a set of tests that is sufficient information to determine the prediction for all possible tests (a sufficient statistic). As an example of these points, consider the float/reset problem (Figure 2) consisting of a linear string of 5 states with a distinguished reset state on the far right. One action, f (float), causes the system to move uniformly at random to the right or left by one state, bounded at the two ends. The other action, r (reset), causes a jump to the reset state irrespective of the current state. The observation is always o unless the r action is taken when the system is already in the reset state, in which case the observation is 1. Thus, on an f action, the correct prediction is always 0, whereas on an r action, the correct prediction depends on how many fs there have been since the last r: for zero fS, it is 1; for one or two fS, it is 0.5; for three or four fS, it is 0.375; for five or six fs, it is 0.3125, and so on decreasing after every second f, asymptotically bottoming out at 0.2. No k-order Markov method can model this system exactly, because no limited-. .5 .5 a) float action 1,0=1 b) reset action Figure 2: Underlying dynamics of the float/reset problem for a) the float action and b) the reset action. The numbers on the arcs indicate transition probabilities. The observation is always 0 except on the reset action from the rightmost state, which produces an observation of 1. length history is a sufficient statistic. A POMDP approach can model it exactly by maintaining a belief-state representation over five or so states. A PSR, on the other hand, can exactly model the float/reset system using just two tests: rl and fOrI. Starting from the rightmost state, the correct predictions for these two tests are always two successive probabilities in the sequence given above (1, 0.5, 0.5, 0.375,...), which is always a sufficient statistic to predict the next pair in the sequence. Although this informational analysis indicates a solution is possible in principle, it would require a nonlinear updating process for the PSR. In this paper we restrict consideration to a linear special case of PSRs, for which we can guarantee that the number of tests needed does not exceed the number of states in the minimal POMDP representation (although we have not ruled out the possibility it can be considerably smaller). Of greater ultimate interest are the prospects for learning PSRs and their update functions, about which we can only speculate at this time. The difficulty of learning POMDP structures without good prior models are well known. To the extent that this difficulty is due to the indirect link between the POMDP states and the data, predictive representations may be able to do better. Jaeger (2000) introduced the idea of predictive representations as an alternative to belief states in hidden Markov models and provided a learning procedure for these models. We build on his work by treating the control case (with actions), which he did not significantly analyze. We have also been strongly influenced by the work of Rivest and Schapire (1994), who did consider tests including actions, but treated only the deterministic case, which is significantly different. They also explored construction and learning algorithms for discovering system structure. 1 Predictive State Representations We consider dynamical systems that accept actions from a discrete set A and generate observations from a discrete set O. We consider only predicting the system, not controlling it, so we do not designate an explicit reward observation. We refer to such a system as an environment. We use the term history to denote a test forming an initial stream of experience and characterize an environment by a probability distribution over all possible histories, P : {OIA}* H- [0,1], where P(Ol··· Otl a1··· at) is the probability of observations 01, ... , O£ being generated, in that order, given that actions aI, ... ,at are taken, in that order. The probability of a test t conditional on a history h is defined as P(tlh) = P(ht)/P(h). Given a set of q tests Q = {til, we define their (1 x q) prediction vector, p(h) = [P(t1Ih),P(t2Ih), ... ,P(tqlh)], as a predictive state representation (PSR) if and only if it forms a sufficient statistic for the environment, Le., if and only if P(tlh) = ft(P(h)), (1) for any test t and history h, and for some projection junction ft : [0, l]q ~ [0,1]. In this paper we focus on linear PSRs, for which the projection functions are linear, that is, for which there exist a (1 x q) projection vector mt, for every test t, such that (2) P(tlh) == ft(P(h)) =7 p(h)mf, for all histories h. Let Pi(h) denote the ith component of the prediction vector for some PSR. This can be updated recursively, given a new action-observation pair a,o, by .(h ) == P(t.lh ) == P(otil ha ) == faati(P(h)) == p(h)m'{;ati P2 ao 2 ao P(olha) faa (P(h)) p(h)mro ' (3) where the last step is specific to linear PSRs. We can now state our main result: Theorem 1 For any environment that can be represented by a finite POMDP model, there exists a linear PSR with number of tests no larger than the number of states in the minimal POMDP model. 2 Proof of Theorem 1: Constructing a PSR from a POMDP We prove Theorem 1 by showing that for any POMDP model of the environment, we can construct in polynomial time a linear PSR for that POMDP of lesser or equal complexity that produces the same probability distribution over histories as the POMDP model. We proceed in three steps. First, we review POMDP models and how they assign probabilities to tests. Next, we define an algorithm that takes an n-state POMDP model and produces a set of n or fewer tests, each of length less than or equal to n. Finally, we show that the set of tests constitute a PSR for the POMDP, that is, that there are projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. A POMDP (Lovejoy, 1991; Kaelbling et al., 1998) is defined by a sextuple (8, A, 0, bo, T, 0). Here, 8 is a set of n underlying (hidden) states, A is a discrete set of actions, and 0 is a discrete set of observations. The (1 x n) vector bo is an initial state distribution. The set T consists of (n x n) transition matrices Ta, one for each action a, where Tlj is the probability of a transition from state i to j when action a is chosen. The set 0 consists of diagonal (n x n) observation matrices oa,o, one for each pair of observation 0 and action a, where o~'o is the probability of observation 0 when action a is selected and state i is reached. l The state representation in a POMDP (Figure l(a)) is the belief state-the (1 x n) vector of the state-occupation probabilities given the history h. It can be computed recursively given a new action a and observation 0 by b(h)Taoa,o b(hao) = b(h)Taoa,oe;' where en is the (1 x n)-vector of all Is. Finally, a POMDP defines a probability distribution over tests (and thus histories) by P(Ol ... otlhal ... at) == b(h)Ta1oal,Ol ... Taloa£,Ole~. (4) IThere are many equivalent formulations and the conversion procedure described here can be easily modified to accommodate other POMDP definitions. We now present our algorithm for constructing a PSR for a given POMDP. It uses a function u mapping tests to (1 x n) vectors defined recursively by u(c) == en and u(aot) == (Taoa,ou(t)T)T, where c represents the null test. Conceptually, the components of u(t) are the probabilities of the test t when applied from each underlying state of the POMDP; we call u(t) the outcome vector for test t. We say a test t is linearly independent of a set of tests S if its outcome vector is linearly independent of the set of outcome vectors of the tests in S. Our algorithm search is used and defined as Q -<- search(c, {}) search(t, S): for each a E A, 0 E 0 if aot is linearly independent of S then S -<- search(aot, S U {aot}) return S The algorithm maintains a set of tests and searches for new tests that are linearly independent of those already found. It is a form of depth-first search. The algorithm halts when it checks all the one-step extensions of its tests and finds none that are linearly independent. Because the set of tests Q returned by search have linearly independent outcome vectors, the cardinality of Q is bounded by n, ensuring that the algorithm halts after a polynomial number of iterations. Because each test in Q is formed by a one-step extension to some other test in Q, no test is longer than n action-observation pairs. The check for linear independence can be performed in many ways, including Gaussian elimination, implying that search terminates in polynomial time. By construction, all one-step extensions to the set of tests Q returned by search are linearly dependent on those in Q. We now show that this is true for any test. Lemma 1 The outcome vectors of the tests in Q can be linearly combined to produce the outcome vector for any test. Proof: Let U be the (n x q) matrix formed by concatenating the outcome vectors for all tests in Q. Since, for all combinations of a and 0, the columns of Taoa,ou are linearly dependent on the columns of U, we can write Taoa,ou == UW T for some q x q matrix of weights W. If t is a test that is linearly dependent on Q, then anyone-step extension of t, aot, is linearly dependent on Q. This is because we can write the outcome vector for t as u(t) == (UwT)T for some (1 x q) weight vector w and the outcome vector for aot as u(aot) == (Taoa,ou(t)T)T == (Taoa,oUwT)T == (UWTwT)T. Thus, aot is linearly dependent on Q. Now, note that all one-step tests are linearly dependent on Q by the structure of the search algorithm. Using the previous paragraph as an inductive argument, this implies that all tests are linearly dependent on Q. 0 Returning to the float/reset example POMDP, search begins with by enumerating the 4 extensions to the null test (fO, fl, rO, and rl). Of these, only fa and rO are are linearly independent. Of the extensions of these, fOrO is the only one that is linearly independent of the other two. The remaining two tests added to Q by search are fOfOrO and fOfOfOrO. No extensions of the 5 tests in Q are linearly independent of the 5 tests in Q, so the procedure halts. We now show that the set of tests Q constitute a PSR for the POMDP by constructing projection vectors that, together with the tests' predictions, produce the same probability distribution over histories as the POMDP. For each combination of a and 0, define a q x q matrix Mao == (U+Taoa,ou)T and a 1 x q vector mao == (U+Taoa,oe;;J T , where U is the matrix of outcome vectors defined in the previous section and U+ is its pseudoinverse2 • The ith row of Mao is maoti. The probability distribution on histories implied by these projection vectors is p(h )m~101 alOl p(h)M~ol M~_10l_1 m~Ol b(h)UU+r a1 oa 1,01 U ... U+T al-10 al-1,Ol-1 UU+Taloal,ol b(h)T a1 0 a1,01 ... ral-l0al-t,ol-lTaloal,Ole~, Le., it is the same as that of the POMDP, as in Equation 4. Here, the last step uses the fact that UU+v T == v T for v T linearly dependent on the columns of U. This holds by construction of U in the previous section. This completes the proof of Theorem 1. Completing the float/reset example, consider the Mf,o matrix found by the process defined in this section. It derives predictions for each test in Q after taking action f. Most of these are quite simple because the tests are so similar: the new prediction for rO is exactly the old prediction for fOrO, for example. The only non trivial test is fOfOfOrO. Its outcome can be computed from 0.250 p(rOlh) - 0.0625 p(fOrOlh) + 0.750 p(fOfOrOlh). This example illustrates that the projection vectors need not contain only positive entries. 3 Conclusion We have introduced a predictive state representation for dynamical systems that is grounded in actions and observations and shown that, even in its linear form, it is at least as general and compact as POMDPs. In essence, we have established PSRs as a non-inferior alternative to POMDPs, and suggested that they might have important advantages, while leaving demonstration of those advantages to future work. We conclude by summarizing the potential advantages (to be explored in future work): Learnability. The k-order Markov model is similar to PSRs in that it is entirely based on actions and observations. Such models can be learned trivially from data by counting-it is an open question whether something similar can be done with a PSR. Jaeger (2000) showed how to learn such a model in the uncontrolled setting, but the situation is more complex in the multiple action case since outcomes are conditioned on behavior, violating some required independence assumptions. Compactness. We have shown that there exist linear PSRs no more complex that the minimal POMDP for an environment, but in some cases the minimal linear PSR seems to be much smaller. For example, a POMDP extension of factored MDPs explored by Singh and Cohn (1998) would be cross-products of separate POMDPs and have linear PSRs that increase linearly with the number and size of the component POMDPs, whereas their minimal POMDP representation would grow as the size 2If U = A~BT is the singular value decomposition of U, then B:E+ AT is the pseudoinverse. The pseudoinverse of the diagonal matrix }J replaces each non-zero element with its reciprocal. e; of the state space, Le., exponential in the number of component POMDPs. This (apparent) advantage stems from the PSR's combinatorial or factored structure. As a vector of state variables, capable of taking on diverse values, a PSR may be inherently more powerful than the distribution over discrete states (the belief state) of a POMDP. We have already seen that general PSRs can be more compact than POMDPs; they are also capable of efficiently capturing environments in the diversity representation used by Rivest and Schapire (1994), which is known to provide an extremely compact representation for some environments. Generalization. There are reasons to think that state variables that are themselves predictions may be particularly useful in learning to make other predictions. With so many things to predict, we have in effect a set or sequence of learning problems, all due to the same environment. In many such cases the solutions to earlier problems have been shown to provide features that generalize particularly well to subsequent problems (e.g., Baxter, 2000; Thrun & Pratt, 1998). Powerful, extensible representations. PSRs that predict tests could be generalized to predict the outcomes of multi-step options (e.g., Sutton et al., 1999). In this case, particularly, they would constitute a powerful language for representing the state of complex environments. AcknowledgIllents: We thank Peter Dayan, Lawrence Saul, Fernando Pereira and Rob Schapire for many helpful discussions of these and related ideas. References Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164-171. Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149-198. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 183-188). San Jose, California: AAAI Press. Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural Computation, 12, 1371-1398. Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in ' partially observable stochastic domains. Artificial Intelligence, 101, 99-134. Lovejoy, W. S. (1991). A survey of algorithmic methods for partially observable Markov decision processes. Annals of Operations Research, 28, 47-65. McCallum, A. K. (1995). Reinforcement learning with selective perception and hidden state. Doctoral diss.ertation, Department of Computer Science, University of Rochester. Rivest, R. L., & Schapire, R. E. (1994). Diversity-based inference of finite automata. Journal of the ACM, 41, 555-589. Shatkay, H., & Kaelbling, L. P. (1997). Learning topological maps with weak local odometric information~ Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-91) (pp. 920-929). Singh, S., & Cohn, D. (1998). How to dynamically merge Markov decision processes. Advances in Neural and Information Processing Systems 10 (pp. 1057-1063). Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 181-211. Thrun, S., & Pratt, L. (Eds.). (1998). Learning to learn. Kluwer Academic Publishers.

5 0.98571205 107 nips-2001-Latent Dirichlet Allocation

Author: David M. Blei, Andrew Y. Ng, Michael I. Jordan

Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hofmann's aspect model , also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification. 1

same-paper 6 0.98198611 111 nips-2001-Learning Lateral Interactions for Feature Binding and Sensory Segmentation

7 0.96203119 144 nips-2001-Partially labeled classification with Markov random walks

8 0.94448119 66 nips-2001-Efficiency versus Convergence of Boolean Kernels for On-Line Learning Algorithms

9 0.91560745 174 nips-2001-Spike timing and the coding of naturalistic sounds in a central auditory area of songbirds

10 0.88376057 123 nips-2001-Modeling Temporal Structure in Classical Conditioning

11 0.87999815 30 nips-2001-Agglomerative Multivariate Information Bottleneck

12 0.85505271 24 nips-2001-Active Information Retrieval

13 0.84872639 96 nips-2001-Information-Geometric Decomposition in Spike Analysis

14 0.84798479 183 nips-2001-The Infinite Hidden Markov Model

15 0.84623098 68 nips-2001-Entropy and Inference, Revisited

16 0.84600252 100 nips-2001-Iterative Double Clustering for Unsupervised and Semi-Supervised Learning

17 0.83910424 11 nips-2001-A Maximum-Likelihood Approach to Modeling Multisensory Enhancement

18 0.83604348 7 nips-2001-A Dynamic HMM for On-line Segmentation of Sequential Data

19 0.83074665 55 nips-2001-Convergence of Optimistic and Incremental Q-Learning

20 0.82827902 160 nips-2001-Reinforcement Learning and Time Perception -- a Model of Animal Experiments