nips nips2004 nips2004-99 knowledge-graph by maker-knowledge-mining

99 nips-2004-Learning Hyper-Features for Visual Identification

Source: pdf

Author: Andras D. Ferencz, Erik G. Learned-miller, Jitendra Malik

Abstract: We address the problem of identifying speciﬁc instances of a class (cars) from a set of images all belonging to that class. Although we cannot build a model for any particular instance (as we may be provided with only one “training” example of it), we can use information extracted from observing other members of the class. We pose this task as a learning problem, in which the learner is given image pairs, labeled as matching or not, and must discover which image features are most consistent for matching instances and discriminative for mismatches. We explore a patch based representation, where we model the distributions of similarity measurements deﬁned on the patches. Finally, we describe an algorithm that selects the most salient patches based on a mutual information criterion. This algorithm performs identiﬁcation well for our challenging dataset of car images, after matching only a few, well chosen patches. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We pose this task as a learning problem, in which the learner is given image pairs, labeled as matching or not, and must discover which image features are most consistent for matching instances and discriminative for mismatches. [sent-4, score-0.659]

2 We explore a patch based representation, where we model the distributions of similarity measurements deﬁned on the patches. [sent-5, score-0.543]

3 Finally, we describe an algorithm that selects the most salient patches based on a mutual information criterion. [sent-6, score-0.352]

4 This algorithm performs identiﬁcation well for our challenging dataset of car images, after matching only a few, well chosen patches. [sent-7, score-0.216]

5 1 Introduction Figure 1 shows six cars: the two leftmost cars were captured by one camera; the right four cars were seen later by another camera from a different angle. [sent-8, score-0.564]

6 While object recognition is used loosely for several problems (including this one), we differentiate visual identiﬁcation, where the challenge is distinguishing between visually similar objects of one category (e. [sent-12, score-0.199]

7 faces, cars), and categorization where Figure 1: The Identiﬁcation Problem: Which of these cars are the same? [sent-14, score-0.208]

8 The two cars on the left, photographed from camera 1, also drive past camera 2. [sent-15, score-0.38]

9 Which of the four images on the right, taken by camera 2, match the cars on the left? [sent-16, score-0.473]

10 Solving this problem will enable applications such as wide area tracking of cars with a sparse set of cameras [2, 9]. [sent-17, score-0.278]

11 Figure 2: Detecting and warping car images into alignment: Our identiﬁcation algorithm assumes that a detection process has found members of the class and approximately aligned them to a canonical view. [sent-18, score-0.26]

12 A projective warp to align the sides is computed by calibrating the pose of the camera to the road and ﬁnding the wheels of the vehicle. [sent-20, score-0.184]

13 Note that this is only a rough approximation (the two warped images, center and right, are far from perfectly aligned) that helps to simplify our patch descriptors and positional bookkeeping. [sent-21, score-0.554]

14 Identiﬁcation is also distinct from “object localization,” where the goal is locating a speciﬁc object in scenes in which distractors have little similarity to the target object [6]. [sent-23, score-0.162]

15 1 One characteristic of the identiﬁcation problem is that the algorithm typically only receives one positive example of each query class (e. [sent-24, score-0.138]

16 a single image of a speciﬁc car), before having to classify other images as the “same” or “different”. [sent-26, score-0.213]

17 The core idea is to use a training set of other image pairs from the category (in our case cars), labeled as matching or not, to learn what characterizes features that are informative in distinguishing one instance from another (i. [sent-32, score-0.511]

18 Our algorithm, given a single novel query image, can build a “same” vs. [sent-35, score-0.138]

19 “different” classiﬁer by: (1) examining a set of candidate features (local image patches) on the query image (2) selecting a small number of them that are likely to be the most informative for this query class and (3) estimating a function for scoring the match for each selected feature. [sent-36, score-0.732]

20 In Section 2, we describe our decision framework including the decomposition of an image pair into bi-patches, which give local indications of match or mismatch, and introduce the appearance distance between the two halves as a discriminative statistic of bi-patches. [sent-39, score-0.431]

21 This model is then reﬁned in Section 3 by conditioning the distance distributions on hyper-features such as patch location, contrast, and dominant orientation. [sent-40, score-0.542]

22 A patch saliency measure based on the estimated distance distributions is introduced in Section 3. [sent-41, score-0.569]

23 In Section 4, we extend our model to include another comparison statistic, the difference in patch position between images. [sent-43, score-0.497]

24 Finally, in Section 5, we conclude and show that comparing a small number of well-chosen patches produces performance nearly as good as matching a dense sampling of them. [sent-44, score-0.411]

25 2 Matching Patches We seek to determine whether a new query image I L (the “Left” image) represents the same vehicle as any of our previously seen database images I R (the “Right” image). [sent-45, score-0.383]

26 We assume that these images are known to contain vehicles, have been brought into rough correspondence (in our data set, through a projective transformation that aligns the sides of the car) and have been scaled to approximately 200 pixels in length (see Figure 2 for details). [sent-46, score-0.194]

27 Figure 3: Patch Matching: The left (query) image is sampled (red dots) by patches encoded as oriented ﬁlter channels (for labeled patch 2, this encoding is shown). [sent-49, score-0.965]

28 Each patch is matched to the best point in the database image of the same car by maximizing the appearance similarity between the patches (the similarity score is indicated by the size and color of the dots, where larger and redder is more similar). [sent-50, score-1.255]

29 Although the classiﬁcation result for this pair of images should be “same” (C = 1), notice that some bi-patches are better predictors of this result than others (the similarity score of 2 & 3 is much better than for patch 1). [sent-52, score-0.62]

30 Our goal is to be able to predict the distribution of P (d|C = 1) and P (d|C = 0) for each patch accurately based on the appearance and position of the patch in the query image (for the 3 patches, our predictions are shown on the right). [sent-53, score-1.368]

31 1 Image Patch Features Our strategy is to break up the whole image comparison problem into multiple local matching problems, where we encode a small patch FjL (1 ≤ j ≤ n) of the query image I L and compare each piece separately [12, 14]. [sent-55, score-0.969]

32 Speciﬁcally, we apply a ﬁrst derivative Gaussian odd-symmetric ﬁlter to the patch at four orientations (horizontal, vertical, and two diagonal), giving four signed numbers per pixel. [sent-57, score-0.504]

33 To compare a query patch FjL to an area of the right image FjR , we encode both patches as 4 × 252 length vectors (4 orientations per pixel) and compute the normalized correlation (dj = 1 − CorrCoef(FjL , FjR )) between these vectors. [sent-58, score-1.078]

34 As the two car images are in rough alignment, we need only to search a small area of I R to ﬁnd the best corresponding patch FjR - i. [sent-59, score-0.668]

35 We will refer to such a matched left and right patch pair FjL , FjR , together with the derived distance dj , as a bi-patch Fj . [sent-62, score-0.915]

36 2 The Decision Rule We pose the task of deciding if the a database image I R is the same as a query image I L as a decision rule R= P (I L , I R |C = 1)P (C = 1) P (C = 1|I L , I R ) = > λ. [sent-64, score-0.446]

37 With our image decomposition into patches, the posteriors from Eq. [sent-68, score-0.129]

38 (2) In practice, we compute the log of this likelihood ratio, where each patch contributes an additive term (denoted LLRi for patch i). [sent-88, score-0.852]

39 3 Uniform Appearance Model The most straightforward way to estimate P (Fj |C) is to assume that the appearance difference dj captures all of the information Fj about the probability of a match (i. [sent-91, score-0.641]

40 C and Fj are independent given dj ), and that all of dj ’s from all patches are identically distributed. [sent-93, score-1.099]

41 (3) to classify each test pair as matching or not, producing a precision-recall curve. [sent-98, score-0.121]

42 Figure 4 compares this patch-based model to a direct image comparison method. [sent-99, score-0.129]

43 3 Figure 4: Identiﬁcation using appearance differences: The bottom curve shows the precision vs. [sent-101, score-0.276]

44 ) Notice that all three patch based models outperform this method. [sent-104, score-0.426]

45 The three top curves show results for various models of dj from Sections 2. [sent-105, score-0.454]

46 Reﬁning the Appearance Distributions with Hyper-Features The most signiﬁcant weakness of the above model is the assumption that the d j ’s from different bi-patches should be identically distributed (observe the 3 labeled patches in Figure 3). [sent-111, score-0.34]

47 When a training set of “same” (C = 1) and “different” (C = 0) images is available for a speciﬁc query image, estimating these distributions directly for each patch is straightforward. [sent-112, score-0.759]

48 How can we estimate a distribution for P (dj |C = 1), where FjL is a patch from a new query image, when we only have that single positive example of FjL ? [sent-113, score-0.564]

49 The intuitive answer: by ﬁnding analogous patches in the training set of labeled (same/different) image pairs. [sent-114, score-0.476]

50 However, since the space of all possible patches (appearance & position, 25∗25+2 ) is very large, the chance of having seen a very similar patch to FjL in the training set is small. [sent-115, score-0.748]

51 In the next sections we present two approaches both of which rely on projecting F jL into a much lower dimensional space by extracting meaningful features from its position and appearance (the hyper-features). [sent-116, score-0.303]

52 1 Non-Parametric Model with Discrete Hyper-Features First we attempted a non-parametric approach, where we model the joint distribution of dj and a few hyper-features (e. [sent-118, score-0.392]

53 the x and y coordinate of the patch FjL , 3 Data consisted of 175 pairs (88 training, 87 test pairs) of matching car images (C=1) from two cameras located on the same side of the street one block apart. [sent-120, score-0.769]

54 Within training and testing sets, about 4000 pairs of mismatched cars (C=0) were formed from non-corresponding images, one from each camera. [sent-121, score-0.272]

55 4 The global image comparison method used here as a baseline technique uses normalized correlation on a combination of intensity and ﬁlter channels, and attempts to overcome slight misalignment. [sent-123, score-0.157]

56 Figure 5: Fitting a GLM to the Γ distribution: we demonstrate our approach by ﬁtting a gamma distribution, through the latent variables Θ = (µ, γ), to the y position of the patches. [sent-124, score-0.134]

57 The centerleft square shows, on each row, a distribution of d conditioned on the y position of the left patch (F L ) for each bi-patch, for training data taken from matching vehicles. [sent-128, score-0.705]

58 For reference, we include a partial image of a car whose y-coordinate is aligned with the center images. [sent-132, score-0.277]

59 On the right, we show two histogram plots, each corresponding to one row of the center images (a small range of y corresponding to the black arrows). [sent-133, score-0.141]

60 The resulting gamma distributions are superimposed on the histograms. [sent-134, score-0.142]

62 The Discrete Hyper-Features curve in Figure 4 shows the performance gain from conditioning on these positional hyper-features. [sent-142, score-0.167]

63 , we moved to a smooth parametric representation for both the distribution of dj and the model by which the the hyper-features inﬂuence this distribution. [sent-146, score-0.392]

64 Speciﬁcally, we model the distributions P (dj |C = 1) and P (dj |C = 0) as gamma distributions (notated Γ()) parameterized by the mean and shape parameter θ = {µ, γ} (see the right panel of Figure 5 for examples of the Γ() ﬁtting the empirical distributions). [sent-147, score-0.259]

65 Our goal is to ﬁt gamma distributions to the distributions of d values for various patches by maximizing the probability density of data under gamma distributions whose parameters are simple polynomial functions of the hyper-features. [sent-151, score-0.653]

66 The α’s are adapted to maximize the joint data likelihood over all patches for C = 0 or C = 1 withing the training set. [sent-166, score-0.322]

67 Running an automatic feature selection technique on this large set of possible conditioning features gives us a principled method of reducing the complexity of our model. [sent-177, score-0.175]

68 The top curve in Figure 4 shows results when Z includes the ﬁrst 10 features found by LARS. [sent-179, score-0.127]

69 4 Estimating the Saliency of a Patch From the distributions P (dj |C = 0) and P (dj |C = 1) computed separately for each patch, it is also possible to estimate the saliency of the patch, i. [sent-182, score-0.169]

70 Intuitively, if the distribution of Dj is very different for C = 0 and C = 1, then the amount of information gained by matching patch j is likely to be large (see the 3 distributions on the right of Figure 3). [sent-185, score-0.664]

71 To emphasize the fact that the distribution P (d j |C) is a ﬁxed function of FjL , given the learned hyper-feature weights α, we slightly abuse notation and refer to the random variable from which dj is sampled as FjL . [sent-186, score-0.392]

72 With this notation, computing the mutual information between FjL and C gives us a measure of the expected information gain from a patch with particular hyper-features: I(FjL ; C) = H(FjL ) − H(FjL |C). [sent-187, score-0.486]

73 The key fact to notice is that this measure can be computed just from the estimated distributions over dj (which, in turn, were estimated from the hyperfeatures of FjL ) before the patch has been matched. [sent-189, score-0.938]

74 This allows us to match only those patches that are likely to be informative, leading to signiﬁcant computational savings. [sent-190, score-0.361]

75 4 Modeling Appearance and Position Differences In the last section, we only considered the similarity of two matching patches that make up a bi-patch in terms of the appearance of the patches (dj ). [sent-191, score-0.917]

76 Recall that for each left patch FjL , a matching right patch FjR is found by searching for the most similar patch in some large neighborhood around the expected location for the match. [sent-192, score-1.501]

77 In this section, we show how to model the change in position, rj , of the match relative to its expected location, and how this, when combined with the appearance model, improves the matching performance. [sent-193, score-0.526]

78 Finally, performance is improved by combining position with appearance (“Complete” curve) compared to using appearance alone. [sent-202, score-0.427]

79 The CENTER pair of images show a correct match, with the patch centers indicated by circles. [sent-203, score-0.51]

80 The color of the circles in the top image indicates MI j , in bottom image LLRj . [sent-204, score-0.324]

81 Our patch selection algorithm chooses the top patches based on MI where subsequent patches are penalized for overlapping with earlier ones (neighborhood suppression). [sent-205, score-1.075]

82 The top 10 “left” patches chosen are marked with arrows connecting them to the corresponding “right” patches. [sent-206, score-0.323]

83 The RIGHT plot quantiﬁes this observation: the curves show 3 different methods of choosing the order of patches - random order, MI and MI with neighborhood suppression. [sent-208, score-0.354]

84 Notice that this top curve with 3 patches does as well as the direct comparison method. [sent-209, score-0.363]

85 Let rj = (δxj , δyj ) be the difference in position between the coordinates of FjL and FjR within the standardized coordinate frames. [sent-211, score-0.2]

86 Generally, we expect rj ≈ 0 if the two images portray the same object (C = 1). [sent-212, score-0.275]

87 Here we focus on the ﬁrst factor, where the distribution of rj given C is dependent on the appearance and position of the left patch (FjL , through the hyper-features Zj ) and on the similarity in appearance (dj ). [sent-214, score-1.049]

88 The intuition for the dependence on dj is that for the C = 1 case, we expect rj to be smaller on average when a good appearance match (small dj ) was found. [sent-215, score-1.162]

89 Following our approach for dj , we model the distribution of rj as a 0 mean normal distribution, N (0, Σ) , where Σ (we use a diagonal covariance) is a function of Z j ,dj . [sent-216, score-0.521]

90 The parameterization of (Zj ,dj ) is found through feature selection, while the weights for the linear function are obtained by maximizing the likelihood of rj over the training data. [sent-217, score-0.185]

91 The bottom four curves of Figure 6 show that ﬁtting an afﬁne model ﬁrst signiﬁcantly improves the positional signal. [sent-220, score-0.178]

92 While position seems to be less informative than appearance, the complete model, which combines appearance and position (Eq. [sent-221, score-0.393]

93 5 Conclusion The center and right sides of Figure 6 show our ability to select the most informative patches using the estimated mutual information I(FjL , C) of each patch. [sent-223, score-0.503]

94 To prevent spatially overlapping patches from being chosen, we added a penalty factor to the mutual in- formation score that penalizes patches that are very close to other chosen patches (MI with neighborhood suppression). [sent-224, score-0.971]

95 To give a numerical indication of the performance, we note that with only 10 patches, given a 1-to-87 forced choice problem, our algorithm chooses the correct matching image 93% of the time. [sent-225, score-0.25]

96 These works approach this problem by learning distributions on shared factors [8] or priors on parameters of ﬁxed distributions for a category [5] where the training data consists of images from other categories. [sent-227, score-0.351]

97 Instead, we take a discriminative approach and model the statistical properties of image patch differences conditioned on properties of the patch. [sent-229, score-0.608]

98 These learned conditional distributions allow us to evaluate, for each feature, the amount of information potentially gained by matching it to the other image. [sent-230, score-0.2]

99 7 Answer to Figure 1: top left matches bottom center; bottom left matches bottom right. [sent-311, score-0.19]

100 For our algorithm, matching these images was not a challenge. [sent-312, score-0.205]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('patch', 0.426), ('fjl', 0.41), ('dj', 0.392), ('patches', 0.29), ('cars', 0.208), ('appearance', 0.178), ('fjr', 0.151), ('query', 0.138), ('rj', 0.129), ('image', 0.129), ('matching', 0.121), ('car', 0.095), ('zj', 0.092), ('glm', 0.086), ('camera', 0.086), ('images', 0.084), ('fj', 0.083), ('distributions', 0.079), ('identi', 0.075), ('informative', 0.073), ('position', 0.071), ('match', 0.071), ('positional', 0.065), ('saliency', 0.064), ('gamma', 0.063), ('object', 0.062), ('fm', 0.06), ('features', 0.054), ('yj', 0.053), ('lars', 0.051), ('category', 0.045), ('bipatches', 0.043), ('cameras', 0.043), ('fusiform', 0.043), ('glms', 0.043), ('notice', 0.041), ('mi', 0.04), ('cvpr', 0.04), ('curve', 0.04), ('sides', 0.04), ('ordinary', 0.038), ('similarity', 0.038), ('right', 0.038), ('conditioning', 0.037), ('alignment', 0.036), ('rough', 0.036), ('selection', 0.036), ('neighborhood', 0.035), ('mutual', 0.035), ('projective', 0.034), ('visual', 0.034), ('channels', 0.034), ('bottom', 0.033), ('top', 0.033), ('lter', 0.032), ('priors', 0.032), ('distinguishing', 0.032), ('vehicle', 0.032), ('mismatched', 0.032), ('oriented', 0.032), ('training', 0.032), ('xj', 0.031), ('recall', 0.031), ('score', 0.031), ('histogram', 0.03), ('matched', 0.03), ('orientations', 0.03), ('traf', 0.03), ('left', 0.029), ('canonical', 0.029), ('instances', 0.029), ('curves', 0.029), ('intensity', 0.028), ('naive', 0.028), ('suppression', 0.027), ('salient', 0.027), ('area', 0.027), ('link', 0.027), ('center', 0.027), ('discriminative', 0.027), ('improves', 0.027), ('visually', 0.026), ('detection', 0.026), ('conditioned', 0.026), ('aligned', 0.026), ('decision', 0.026), ('tting', 0.026), ('cation', 0.026), ('separately', 0.026), ('identically', 0.025), ('labeled', 0.025), ('precision', 0.025), ('gain', 0.025), ('af', 0.025), ('histograms', 0.025), ('dots', 0.025), ('pose', 0.024), ('feature', 0.024), ('four', 0.024), ('automatic', 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999946 99 nips-2004-Learning Hyper-Features for Visual Identification

Author: Andras D. Ferencz, Erik G. Learned-miller, Jitendra Malik

2 0.20826381 160 nips-2004-Seeing through water

Author: Alexei Efros, Volkan Isler, Jianbo Shi, Mirkó Visontai

Abstract: We consider the problem of recovering an underwater image distorted by surface waves. A large amount of video data of the distorted image is acquired. The problem is posed in terms of ﬁnding an undistorted image patch at each spatial location. This challenging reconstruction task can be formulated as a manifold learning problem, such that the center of the manifold is the image of the undistorted patch. To compute the center, we present a new technique to estimate global distances on the manifold. Our technique achieves robustness through convex ﬂow computations and solves the “leakage” problem inherent in recent manifold embedding techniques. 1

3 0.20765005 44 nips-2004-Conditional Random Fields for Object Recognition

Author: Ariadna Quattoni, Michael Collins, Trevor Darrell

Abstract: We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as ﬂexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a uniﬁed framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by ﬁnding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.

4 0.15716293 40 nips-2004-Common-Frame Model for Object Recognition

Author: Pierre Moreels, Pietro Perona

Abstract: A generative probabilistic model for objects in images is presented. An object consists of a constellation of features. Feature appearance and pose are modeled probabilistically. Scene images are generated by drawing a set of objects from a given database, with random clutter sprinkled on the remaining image surface. Occlusion is allowed. We study the case where features from the same object share a common reference frame. Moreover, parameters for shape and appearance densities are shared across features. This is to be contrasted with previous work on probabilistic ‘constellation’ models where features depend on each other, and each feature and model have different pose and appearance statistics [1, 2]. These two differences allow us to build models containing hundreds of features, as well as to train each model from a single example. Our model may also be thought of as a probabilistic revisitation of Lowe’s model [3, 4]. We propose an efﬁcient entropy-minimization inference algorithm that constructs the best interpretation of a scene as a collection of objects and clutter. We test our ideas with experiments on two image databases. We compare with Lowe’s algorithm and demonstrate better performance, in particular in presence of large amounts of background clutter.

5 0.13736753 192 nips-2004-The power of feature clustering: An application to object detection

Author: Shai Avidan, Moshe Butman

Abstract: We give a fast rejection scheme that is based on image segments and demonstrate it on the canonical example of face detection. However, instead of focusing on the detection step we focus on the rejection step and show that our method is simple and fast to be learned, thus making it an excellent pre-processing step to accelerate standard machine learning classiﬁers, such as neural-networks, Bayes classiﬁers or SVM. We decompose a collection of face images into regions of pixels with similar behavior over the image set. The relationships between the mean and variance of image segments are used to form a cascade of rejectors that can reject over 99.8% of image patches, thus only a small fraction of the image patches must be passed to a full-scale classiﬁer. Moreover, the training time for our method is much less than an hour, on a standard PC. The shape of the features (i.e. image segments) we use is data-driven, they are very cheap to compute and they form a very low dimensional feature space in which exhaustive search for the best features is tractable. 1

6 0.12326261 83 nips-2004-Incremental Learning for Visual Tracking

7 0.11024058 68 nips-2004-Face Detection --- Efficient and Rank Deficient

8 0.10401078 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes

9 0.10287748 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

10 0.10178491 73 nips-2004-Generative Affine Localisation and Tracking

11 0.10063352 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

12 0.098596491 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

13 0.095733352 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

14 0.088636704 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

15 0.087515742 23 nips-2004-Analysis of a greedy active learning strategy

16 0.083603971 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

17 0.083249517 134 nips-2004-Object Classification from a Single Example Utilizing Class Relevance Metrics

18 0.077892281 42 nips-2004-Computing regularization paths for learning multiple kernels

19 0.075972155 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

20 0.063425809 179 nips-2004-Surface Reconstruction using Learned Shape Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.211), (1, 0.05), (2, -0.084), (3, -0.231), (4, 0.207), (5, 0.077), (6, 0.077), (7, -0.171), (8, -0.068), (9, 0.036), (10, -0.027), (11, 0.056), (12, 0.109), (13, 0.008), (14, -0.078), (15, 0.05), (16, 0.042), (17, 0.016), (18, -0.018), (19, -0.094), (20, 0.099), (21, 0.004), (22, -0.054), (23, -0.096), (24, 0.019), (25, 0.1), (26, 0.022), (27, -0.11), (28, -0.01), (29, -0.041), (30, -0.036), (31, -0.055), (32, -0.057), (33, -0.076), (34, 0.139), (35, 0.076), (36, -0.07), (37, -0.078), (38, -0.124), (39, 0.102), (40, -0.023), (41, -0.041), (42, 0.058), (43, 0.079), (44, -0.117), (45, -0.045), (46, -0.001), (47, -0.142), (48, -0.105), (49, 0.007)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95777655 99 nips-2004-Learning Hyper-Features for Visual Identification

Author: Andras D. Ferencz, Erik G. Learned-miller, Jitendra Malik

2 0.66231316 40 nips-2004-Common-Frame Model for Object Recognition

Author: Pierre Moreels, Pietro Perona

3 0.6511808 192 nips-2004-The power of feature clustering: An application to object detection

Author: Shai Avidan, Moshe Butman

4 0.64110476 160 nips-2004-Seeing through water

Author: Alexei Efros, Volkan Isler, Jianbo Shi, Mirkó Visontai

5 0.63161725 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

Author: Odelia Schwartz, Terrence J. Sejnowski, Peter Dayan

Abstract: In the analysis of natural images, Gaussian scale mixtures (GSM) have been used to account for the statistics of ﬁlter responses, and to inspire hierarchical cortical representational learning schemes. GSMs pose a critical assignment problem, working out which ﬁlter responses were generated by a common multiplicative factor. We present a new approach to solving this assignment problem through a probabilistic extension to the basic GSM, and show how to perform inference in the model using Gibbs sampling. We demonstrate the efﬁcacy of the approach on both synthetic and image data. Understanding the statistical structure of natural images is an important goal for visual neuroscience. Neural representations in early cortical areas decompose images (and likely other sensory inputs) in a way that is sensitive to sophisticated aspects of their probabilistic structure. This structure also plays a key role in methods for image processing and coding. A striking aspect of natural images that has reﬂections in both top-down and bottom-up modeling is coordination across nearby locations, scales, and orientations. From a topdown perspective, this structure has been modeled using what is known as a Gaussian Scale Mixture model (GSM).1–3 GSMs involve a multi-dimensional Gaussian (each dimension of which captures local structure as in a linear ﬁlter), multiplied by a spatialized collection of common hidden scale variables or mixer variables∗ (which capture the coordination). GSMs have wide implications in theories of cortical receptive ﬁeld development, eg the comprehensive bubbles framework of Hyv¨ rinen.4 The mixer variables provide the a top-down account of two bottom-up characteristics of natural image statistics, namely the ‘bowtie’ statistical dependency,5, 6 and the fact that the marginal distributions of receptive ﬁeld-like ﬁlters have high kurtosis.7, 8 In hindsight, these ideas also bear a close relationship with Ruderman and Bialek’s multiplicative bottom-up image analysis framework 9 and statistical models for divisive gain control.6 Coordinated structure has also been addressed in other image work,10–14 and in other domains such as speech15 and ﬁnance.16 Many approaches to the unsupervised speciﬁcation of representations in early cortical areas rely on the coordinated structure.17–21 The idea is to learn linear ﬁlters (eg modeling simple cells as in22, 23 ), and then, based on the coordination, to ﬁnd combinations of these (perhaps non-linearly transformed) as a way of ﬁnding higher order ﬁlters (eg complex cells). One critical facet whose speciﬁcation from data is not obvious is the neighborhood arrangement, ie which linear ﬁlters share which mixer variables. ∗ Mixer variables are also called mutlipliers, but are unrelated to the scales of a wavelet. Here, we suggest a method for ﬁnding the neighborhood based on Bayesian inference of the GSM random variables. In section 1, we consider estimating these components based on information from different-sized neighborhoods and show the modes of failure when inference is too local or too global. Based on these observations, in section 2 we propose an extension to the GSM generative model, in which the mixer variables can overlap probabilistically. We solve the neighborhood assignment problem using Gibbs sampling, and demonstrate the technique on synthetic data. In section 3, we apply the technique to image data. 1 GSM inference of Gaussian and mixer variables In a simple, n-dimensional, version of a GSM, ﬁlter responses l are synthesized † by multiplying an n-dimensional Gaussian with values g = {g1 . . . gn }, by a common mixer variable v. l = vg (1) We assume g are uncorrelated (σ 2 along diagonal of the covariance matrix). For the analytical calculations, we assume that v has a Rayleigh distribution: where 0 < a ≤ 1 parameterizes the strength of the prior p[v] ∝ [v exp −v 2 /2]a (2) For ease, we develop the theory for a = 1. As is well known,2 and repeated in ﬁgure 1(B), the marginal distribution of the resulting GSM is sparse and highly kurtotic. The joint conditional distribution of two elements l1 and l2 , follows a bowtie shape, with the width of the distribution of one dimension increasing for larger values (both positive and negative) of the other dimension. The inverse problem is to estimate the n+1 variables g1 . . . gn , v from the n ﬁlter responses l1 . . . ln . It is formally ill-posed, though regularized through the prior distributions. Four posterior distributions are particularly relevant, and can be derived analytically from the model: rv distribution posterior mean ” “ √ σ |l1 | 2 2 l1 |l1 | B“ 1, σ ” |l1 | ” exp − v − “ p[v|l1 ] 2 2v 2 σ 2 σ 1 |l1 | 1 |l1 | B p[v|l] p[|g1 ||l1 ] p[|g1 ||l] √ B 2, σ 1 (n−2) 2 2 2 ( ) −(n−1) exp − v2 − 2vl2 σ2 l v B(1− n , σ ) 2 √ σ|l1 | g2 l2 “ ” 1 exp − 12 − 12 2σ 1 |l1 | g2 2g l σ B −2, σ|l1 | ”1 |l1 | 2 (2−n) l n l 2 −1, σ “ B( ) σ (n−3) g1 1 l σ σ 1 g2 2 1 exp − 2σ2 l2 − l 1 2 l1 2 2g1 σ |l1 | σ ( ( 2, σ ) ) l B 3−n,σ 2 2 l B 1− n , σ “ 2 ” |l1 | B 0, σ |l1 | “ ” σ B − 1 , |l1 | 2 σ n 1 l |l1 | B( 2 − 2 , σ ) n l B( −1, l ) 2 σ 2 where B(n, x) is the modiﬁed Bessel function of the second kind (see also24 ), l = i li and gi is forced to have the same sign as li , since the mixer variables are always positive. Note that p[v|l1 ] and p[g1 |l1 ] (rows 1,3) are local estimates, while p[v|l] and p[g|l] (rows 2,4) are estimates according to ﬁlter outputs {l1 . . . ln }. The posterior p[v|l] has also been estimated numerically in noise removal for other mixer priors, by Portilla et al 25 The full GSM speciﬁes a hierarchy of mixer variables. Wainwright2 considered a prespeciﬁed tree-based hierarhical arrangement. In practice, for natural sensory data, given a heterogeneous collection of li , it is advantageous to learn the hierachical arrangement from examples. In an approach related to that of the GSM, Karklin and Lewicki19 suggested We describe the l as being ﬁlter responses even in the synthetic case, to facilitate comparison with images. † B A α 1 ... g v 20 1 ... β 0.1 l 0 -5 0 l 2 0 21 0 0 5 l 1 0 l 1 1 l ... l 21 40 20 Actual Distribution 0 D Gaussian 0 5 0 0 -5 0 0 5 0 5 -5 0 g 1 0 5 E(g 1 | l1) 1 .. 40 ) 0.06 -5 0 0 5 2 E(g |l 1 1 .. 20 ) 0 1 E(g | l ) -5 5 E(g | l 1 2 1 .. 20 5 α E(g |l 1 .. 20 ) E(g |l 0 E(v | l α 0.06 E(g | l2) 2 2 0 5 E(v | l 1 .. 20 ) E(g | l1) 1 1 g 0 1 0.06 0 0.06 E(vαl | ) g 40 filters, too global 0.06 0.06 0.06 Distribution 20 filters 1 filter, too local 0.06 vα E Gaussian joint conditional 40 l l C Mixer g ... 21 Multiply Multiply l g Distribution g v 1 .. 40 1 .. 40 ) ) E(g | l 1 1 .. 40 ) Figure 1: A Generative model: each ﬁlter response is generated by multiplying its Gaussian variable by either mixer variable vα , or mixer variable vβ . B Marginal and joint conditional statistics (bowties) of sample synthetic ﬁlter responses. For the joint conditional statistics, intensity is proportional to the bin counts, except that each column is independently re-scaled to ﬁll the range of intensities. C-E Left: actual distributions of mixer and Gaussian variables; other columns: estimates based on different numbers of ﬁlter responses. C Distribution of estimate of the mixer variable vα . Note that mixer variable values are by deﬁnition positive. D Distribution of estimate of one of the Gaussian variables, g1 . E Joint conditional statistics of the estimates of Gaussian variables g1 and g2 . generating log mixer values for all the ﬁlters and learning the linear combinations of a smaller collection of underlying values. Here, we consider the problem in terms of multiple mixer variables, with the linear ﬁlters being clustered into groups that share a single mixer. This poses a critical assignment problem of working out which ﬁlter responses share which mixer variables. We ﬁrst study this issue using synthetic data in which two groups of ﬁlter responses l1 . . . l20 and l21 . . . l40 are generated by two mixer variables vα and vβ (ﬁgure 1). We attempt to infer the components of the GSM model from the synthetic data. Figure 1C;D shows the empirical distributions of estimates of the conditional means of a mixer variable E(vα |{l}) and one of the Gaussian variables E(g1 |{l}) based on different assumed assignments. For estimation based on too few ﬁlter responses, the estimates do not well match the actual distributions. For example, for a local estimate based on a single ﬁlter response, the Gaussian estimate peaks away from zero. For assignments including more ﬁlter responses, the estimates become good. However, inference is also compromised if the estimates for vα are too global, including ﬁlter responses actually generated from vβ (C and D, last column). In (E), we consider the joint conditional statistics of two components, each 1 v v α vγ β g 1 ... v vα B Actual A Generative model 1 100 1 100 0 v 01 l1 ... l100 0 l 1 20 2 0 0 l 1 0 -4 100 Filter number vγ β 1 100 1 0 Filter number 100 1 Filter number 0 E(g 1 | l ) Gibbs fit assumed 0.15 E(g | l ) 0 2 0 1 Mixer Gibbs fit assumed 0.1 4 0 E(g 1 | l ) Distribution Distribution Distribution l 100 Filter number Gaussian 0.2 -20 1 1 0 Filter number Inferred v α Multiply 100 1 Filter number Pixel vγ 1 g 0 C β E(v | l ) β 0 0 0 15 E(v | l ) α 0 E(v | l ) α Figure 2: A Generative model in which each ﬁlter response is generated by multiplication of its Gaussian variable by a mixer variable. The mixer variable, v α , vβ , or vγ , is chosen probabilistically upon each ﬁlter response sample, from a Rayleigh distribution with a = .1. B Top: actual probability of ﬁlter associations with vα , vβ , and vγ ; Bottom: Gibbs estimates of probability of ﬁlter associations corresponding to vα , vβ , and vγ . C Statistics of generated ﬁlter responses, and of Gaussian and mixer estimates from Gibbs sampling. estimating their respective g1 and g2 . Again, as the number of ﬁlter responses increases, the estimates improve, provided that they are taken from the right group of ﬁlter responses with the same mixer variable. Speciﬁcally, the mean estimates of g1 and g2 become more independent (E, third column). Note that for estimations based on a single ﬁlter response, the joint conditional distribution of the Gaussian appears correlated rather than independent (E, second column); for estimation based on too many ﬁlter responses (40 in this example), the joint conditional distribution of the Gaussian estimates shows a dependent (rather than independent) bowtie shape (E, last column). Mixer variable joint statistics also deviate from the actual when the estimations are too local or global (not shown). We have observed qualitatively similar statistics for estimation based on coefﬁcients in natural images. Neighborhood size has also been discussed in the context of the quality of noise removal, assuming a GSM model.26 2 Neighborhood inference: solving the assignment problem The plots in ﬁgure 1 suggest that it should be possible to infer the assignments, ie work out which ﬁlter responses share common mixers, by learning from the statistics of the resulting joint dependencies. Hard assignment problems (in which each ﬁlter response pays allegiance to just one mixer) are notoriously computationally brittle. Soft assignment problems (in which there is a probabilistic relationship between ﬁlter responses and mixers) are computationally better behaved. Further, real world stimuli are likely better captured by the possibility that ﬁlter responses are coordinated in somewhat different collections in different images. We consider a richer, mixture GSM as a generative model (Figure 2). To model the generation of ﬁlter responses li for a single image patch, we multiply each Gaussian variable gi by a single mixer variable from the set v1 . . . vm . We assume that gi has association probabil- ity pij (satisfying j pij = 1, ∀i) of being assigned to mixer variable vj . The assignments are assumed to be made independently for each patch. We use si ∈ {1, 2, . . . m} for the assignments: li = g i vs i (3) Inference and learning in this model proceeds in two stages, according to the expectation maximization algorithm. First, given a ﬁlter response li , we use Gibbs sampling for the E phase to ﬁnd possible appropriate (posterior) assignments. Williams et al.27 suggested using Gibbs sampling to solve a similar assignment problem in the context of dynamic tree models. Second, for the M phase, given the collection of assignments across multiple ﬁlter responses, we update the association probabilities pij . Given sample mixer assignments, we can estimate the Gaussian and mixer components of the GSM using the table of section 1, but restricting the ﬁlter response samples just to those associated with each mixer variable. We tested the ability of this inference method to ﬁnd the associations in the probabilistic mixer variable synthetic example shown in ﬁgure 2, (A,B). The true generative model speciﬁes probabilistic overlap of 3 mixer variables. We generated 5000 samples for each ﬁlter according to the generative model. We ran the Gibbs sampling procedure, setting the number of possible neighborhoods to 5 (e.g., > 3); after 500 iterations the weights converged near to the proper probabilities. In (B, top), we plot the actual probability distributions for the ﬁlter associations with each of the mixer variables. In (B, bottom), we show the estimated associations: the three non-zero estimates closely match the actual distributions; the other two estimates are zero (not shown). The procedure consistently ﬁnds correct associations even in larger examples of data generated with up to 10 mixer variables. In (C) we show an example of the actual and estimated distributions of the mixer and Gaussian components of the GSM. Note that the joint conditional statistics of both mixer and Gaussian are independent, since the variables were generated as such in the synthetic example. The Gibbs procedure can be adjusted for data generated with different parameters a of equation 2, and for related mixers,2 allowing for a range of image coefﬁcient behaviors. 3 Image data Having validated the inference model using synthetic data, we turned to natural images. We derived linear ﬁlters from a multi-scale oriented steerable pyramid,28 with 100 ﬁlters, at 2 preferred orientations, 25 non-overlapping spatial positions (with spatial subsampling of 8 pixels), and two phases (quadrature pairs), and a single spatial frequency peaked at 1/6 cycles/pixel. The image ensemble is 4 images from a standard image compression database (boats, goldhill, plant leaves, and mountain) and 4000 samples. We ran our method with the same parameters as for synthetic data, with 7 possible neighborhoods and Rayleigh parameter a = .1 (as in ﬁgure 2). Figure 3 depicts the association weights pij of the coefﬁcients for each of the obtained mixer variables. In (A), we show a schematic (template) of the association representation that will follow in (B, C) for the actual data. Each mixer variable neighborhood is shown for coefﬁcients of two phases and two orientations along a spatial grid (one grid for each phase). The neighborhood is illustrated via the probability of each coefﬁcient to be generated from a given mixer variable. For the ﬁrst two neighborhoods (B), we also show the image patches that yielded the maximum log likelihood of P (v|patch). The ﬁrst neighborhood (in B) prefers vertical patterns across most of its “receptive ﬁeld”, while the second has a more localized region of horizontal preference. This can also be seen by averaging the 200 image patches with the maximum log likelihood. Strikingly, all the mixer variables group together two phases of quadrature pair (B, C). Quadrature pairs have also been extracted from cortical data, and are the components of ideal complex cell models. Another tendency is to group Phase 2 Phase 1 19 Y position Y position A 0 -19 Phase 1 Phase 2 19 0 -19 -19 0 19 X position -19 0 19 X position B Neighborhood Example max patches Average Neighborhood Example max patches C Neighborhood Average Gaussian 0.25 l2 0 -50 0 l 1 50 0 l 1 Mixer Gibbs fit assumed Gibbs fit assumed Distribution Distribution Distribution D Coefficient 0.12 E(g | l ) 0 2 0 -5 0 E(g 1 | l ) 5 0 E(g 1 | l ) 0.15 ) E(v | l ) β 0 00 15 E(v | l ) α 0 E(v | l ) α Figure 3: A Schematic of the mixer variable neighborhood representation. The probability that each coefﬁcient is associated with the mixer variable ranges from 0 (black) to 1 (white). Left: Vertical and horizontal ﬁlters, at two orientations, and two phases. Each phase is plotted separately, on a 38 by 38 pixel spatial grid. Right: summary of representation, with ﬁlter shapes replaced by oriented lines. Filters are approximately 6 pixels in diameter, with the spacing between ﬁlters 8 pixels. B First two image ensemble neighborhoods obtained from Gibbs sampling. Also shown, are four 38×38 pixel patches that had the maximum log likelihood of P (v|patch), and the average of the ﬁrst 200 maximal patches. C Other image ensemble neighborhoods. D Statistics of representative coefﬁcients of two spatially displaced vertical ﬁlters, and of inferred Gaussian and mixer variables. orientations across space. The phase and iso-orientation grouping bear some interesting similarity to other recent suggestions;17, 18 as do the maximal patches.19 Wavelet ﬁlters have the advantage that they can span a wider spatial extent than is possible with current ICA techniques, and the analysis of parameters such as phase grouping is more controlled. We are comparing the analysis with an ICA ﬁrst-stage representation, which has other obvious advantages. We are also extending the analysis to correlated wavelet ﬁlters; 25 and to simulations with a larger number of neighborhoods. From the obtained associations, we estimated the mixer and Gaussian variables according to our model. In (D) we show representative statistics of the coefﬁcients and of the inferred variables. The learned distributions of Gaussian and mixer variables are quite close to our assumptions. The Gaussian estimates exhibit joint conditional statistics that are roughly independent, and the mixer variables are weakly dependent. We have thus far demonstrated neighborhood inference for an image ensemble, but it is also interesting and perhaps more intuitive to consider inference for particular images or image classes. In ﬁgure 4 (A-B) we demonstrate example mixer variable neighborhoods derived from learning patches of a zebra image (Corel CD-ROM). As before, the neighborhoods are composed of quadrature pairs; however, the spatial conﬁgurations are richer and have A Neighborhood B Neighborhood Average Example max patches Top 25 max patches Average Example max patches Top 25 max patches Figure 4: Example of Gibbs on Zebra image. Image is 151×151 pixels, and each spatial neighborhood spans 38×38 pixels. A, B Example mixer variable neighborhoods. Left: example mixer variable neighborhood, and average of 200 patches that yielded the maximum likelihood of P (v|patch). Right: Image and marked on top of it example patches that yielded the maximum likelihood of P (v|patch). not been previously reported with unsupervised hierarchical methods: for example, in (A), the mixture neighborhood captures a horizontal-bottom/vertical-top spatial conﬁguration. This appears particularly relevant in segmenting regions of the front zebra, as shown by marking in the image the patches i that yielded the maximum log likelihood of P (v|patch). In (B), the mixture neighborhood captures a horizontal conﬁguration, more focused on the horizontal stripes of the front zebra. This example demonstrates the logic behind a probabilistic mixture: coefﬁcients corresponding to the bottom horizontal stripes might be linked with top vertical stripes (A) or to more horizontal stripes (B). 4 Discussion Work on the study of natural image statistics has recently evolved from issues about scalespace hierarchies, wavelets, and their ready induction through unsupervised learning models (loosely based on cortical development) towards the coordinated statistical structure of the wavelet components. This includes bottom-up (eg bowties, hierarchical representations such as complex cells) and top-down (eg GSM) viewpoints. The resulting new insights inform a wealth of models and ideas and form the essential backdrop for the work in this paper. They also link to impressive engineering results in image coding and processing. A most critical aspect of an hierarchical representational model is the way that the structure of the hierarchy is induced. We addressed the hierarchy question using a novel extension to the GSM generative model in which mixer variables (at one level of the hierarchy) enjoy probabilistic assignments to ﬁlter responses (at a lower level). We showed how these assignments can be learned (using Gibbs sampling), and illustrated some of their attractive properties using both synthetic and a variety of image data. We grounded our method ﬁrmly in Bayesian inference of the posterior distributions over the two classes of random variables in a GSM (mixer and Gaussian), placing particular emphasis on the interplay between the generative model and the statistical properties of its components. An obvious question raised by our work is the neural correlate of the two different posterior variables. The Gaussian variable has characteristics resembling those of the output of divisively normalized simple cells;6 the mixer variable is more obviously related to the output of quadrature pair neurons (such as orientation energy or motion energy cells, which may also be divisively normalized). How these different information sources may subsequently be used is of great interest. Acknowledgements This work was funded by the HHMI (OS, TJS) and the Gatsby Charitable Foundation (PD). We are very grateful to Patrik Hoyer, Mike Lewicki, Zhaoping Li, Simon Osindero, Javier Portilla and Eero Simoncelli for discussion. References [1] D Andrews and C Mallows. Scale mixtures of normal distributions. J. Royal Stat. Soc., 36:99–102, 1974. [2] M J Wainwright and E P Simoncelli. Scale mixtures of Gaussians and the statistics of natural images. In S. A. Solla, T. K. Leen, and K.-R. M¨ ller, editors, Adv. Neural Information Processing Systems, volume 12, pages 855–861, Cambridge, MA, u May 2000. MIT Press. [3] M J Wainwright, E P Simoncelli, and A S Willsky. Random cascades on wavelet trees and their use in modeling and analyzing natural imagery. Applied and Computational Harmonic Analysis, 11(1):89–123, July 2001. Special issue on wavelet applications. [4] A Hyv¨ rinen, J Hurri, and J Vayrynen. Bubbles: a unifying framework for low-level statistical properties of natural image a sequences. Journal of the Optical Society of America A, 20:1237–1252, May 2003. [5] R W Buccigrossi and E P Simoncelli. Image compression via joint statistical characterization in the wavelet domain. IEEE Trans Image Proc, 8(12):1688–1701, December 1999. [6] O Schwartz and E P Simoncelli. Natural signal statistics and sensory gain control. Nature Neuroscience, 4(8):819–825, August 2001. [7] D J Field. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12):2379–2394, 1987. [8] H Attias and C E Schreiner. Temporal low-order statistics of natural sounds. In M Jordan, M Kearns, and S Solla, editors, Adv in Neural Info Processing Systems, volume 9, pages 27–33. MIT Press, 1997. [9] D L Ruderman and W Bialek. Statistics of natural images: Scaling in the woods. Phys. Rev. Letters, 73(6):814–817, 1994. [10] C Zetzsche, B Wegmann, and E Barth. Nonlinear aspects of primary vision: Entropy reduction beyond decorrelation. In Int’l Symposium, Society for Information Display, volume XXIV, pages 933–936, 1993. [11] J Huang and D Mumford. Statistics of natural images and models. In CVPR, page 547, 1999. [12] J. Romberg, H. Choi, and R. Baraniuk. Bayesian wavelet domain image modeling using hidden Markov trees. In Proc. IEEE Int’l Conf on Image Proc, Kobe, Japan, October 1999. [13] A Turiel, G Mato, N Parga, and J P Nadal. The self-similarity properties of natural images resemble those of turbulent ﬂows. Phys. Rev. Lett., 80:1098–1101, 1998. [14] J Portilla and E P Simoncelli. A parametric texture model based on joint statistics of complex wavelet coefﬁcients. Int’l Journal of Computer Vision, 40(1):49–71, 2000. [15] Helmut Brehm and Walter Stammler. Description and generation of spherically invariant speech-model signals. Signal Processing, 12:119–141, 1987. [16] T Bollersley, K Engle, and D Nelson. ARCH models. In B Engle and D McFadden, editors, Handbook of Econometrics V. 1994. [17] A Hyv¨ rinen and P Hoyer. Emergence of topography and complex cell properties from natural images using extensions of a ¨ ICA. In S. A. Solla, T. K. Leen, and K.-R. Muller, editors, Adv. Neural Information Processing Systems, volume 12, pages 827–833, Cambridge, MA, May 2000. MIT Press. [18] P Hoyer and A Hyv¨ rinen. A multi-layer sparse coding network learns contour coding from natural images. Vision Research, a 42(12):1593–1605, 2002. [19] Y Karklin and M S Lewicki. Learning higher-order structures in natural images. Network: Computation in Neural Systems, 14:483–499, 2003. [20] W Laurenz and T Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4):715– 770, 2002. [21] C Kayser, W Einh¨ user, O D¨ mmer, P K¨ nig, and K P K¨ rding. Extracting slow subspaces from natural videos leads to a u o o complex cells. In G Dorffner, H Bischof, and K Hornik, editors, Proc. Int’l Conf. on Artiﬁcial Neural Networks (ICANN-01), pages 1075–1080, Vienna, Aug 2001. Springer-Verlag, Heidelberg. [22] B A Olshausen and D J Field. Emergence of simple-cell receptive ﬁeld properties by learning a sparse factorial code. Nature, 381:607–609, 1996. [23] A J Bell and T J Sejnowski. The ’independent components’ of natural scenes are edge ﬁlters. Vision Research, 37(23):3327– 3338, 1997. [24] U Grenander and A Srivastava. Probabibility models for clutter in natural images. IEEE Trans. on Patt. Anal. and Mach. Intel., 23:423–429, 2002. [25] J Portilla, V Strela, M Wainwright, and E Simoncelli. Adaptive Wiener denoising using a Gaussian scale mixture model in the wavelet domain. In Proc 8th IEEE Int’l Conf on Image Proc, pages 37–40, Thessaloniki, Greece, Oct 7-10 2001. IEEE Computer Society. [26] J Portilla, V Strela, M Wainwright, and E P Simoncelli. Image denoising using a scale mixture of Gaussians in the wavelet domain. IEEE Trans Image Processing, 12(11):1338–1351, November 2003. [27] C K I Williams and N J Adams. Dynamic trees. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Adv. Neural Information Processing Systems, volume 11, pages 634–640, Cambridge, MA, 1999. MIT Press. [28] E P Simoncelli, W T Freeman, E H Adelson, and D J Heeger. Shiftable multi-scale transforms. IEEE Trans Information Theory, 38(2):587–607, March 1992. Special Issue on Wavelets.

6 0.58907825 44 nips-2004-Conditional Random Fields for Object Recognition

7 0.53391337 73 nips-2004-Generative Affine Localisation and Tracking

8 0.4976691 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes

9 0.46374688 68 nips-2004-Face Detection --- Efficient and Rank Deficient

10 0.44463047 47 nips-2004-Contextual Models for Object Detection Using Boosted Random Fields

11 0.42401746 83 nips-2004-Incremental Learning for Visual Tracking

12 0.40816182 89 nips-2004-Joint MRI Bias Removal Using Entropy Minimization Across Images

13 0.39413702 121 nips-2004-Modeling Nonlinear Dependencies in Natural Images using Mixture of Laplacian Distribution

14 0.37310857 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

15 0.37303141 85 nips-2004-Instance-Based Relevance Feedback for Image Retrieval

16 0.36914358 21 nips-2004-An Information Maximization Model of Eye Movements

17 0.35043606 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

18 0.34350246 179 nips-2004-Surface Reconstruction using Learned Shape Models

19 0.33864692 81 nips-2004-Implicit Wiener Series for Higher-Order Image Analysis

20 0.33298439 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.11), (15, 0.139), (17, 0.012), (21, 0.225), (26, 0.049), (31, 0.025), (33, 0.19), (35, 0.034), (36, 0.013), (39, 0.017), (50, 0.06), (87, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.92028534 111 nips-2004-Maximal Margin Labeling for Multi-Topic Text Categorization

Author: Hideto Kazawa, Tomonori Izumitani, Hirotoshi Taira, Eisaku Maeda

Abstract: In this paper, we address the problem of statistical learning for multitopic text categorization (MTC), whose goal is to choose all relevant topics (a label) from a given set of topics. The proposed algorithm, Maximal Margin Labeling (MML), treats all possible labels as independent classes and learns a multi-class classiﬁer on the induced multi-class categorization problem. To cope with the data sparseness caused by the huge number of possible labels, MML combines some prior knowledge about label prototypes and a maximal margin criterion in a novel way. Experiments with multi-topic Web pages show that MML outperforms existing learning algorithms including Support Vector Machines. 1 Multi-topic Text Categorization (MTC) This paper addresses the problem of learning for multi-topic text categorization (MTC), whose goal is to select all topics relevant to a text from a given set of topics. In MTC, multiple topics may be relevant to a single text. We thus call a set of topics label, and say that a text is assigned a label, not a topic. In almost all previous text categorization studies (e.g. [1, 2]), the label is predicted by judging each topic’s relevance to the text. In this decomposition approach, the features speciﬁc to a topic, not a label, are regarded as important features. However, the approach may result in inefﬁcient learning as we will explain in the following example. Imagine an MTC problem of scientiﬁc papers where quantum computing papers are assigned multi-topic label “quantum physics (QP) & computer science (CS)”. (QP and CS are topics in this example.) Since there are some words speciﬁc to quantum computing such as “qbit1 ”, one can say that efﬁcient MTC learners should use such words to assign label QP & CS. However, the decomposition approach is likely to ignore these words since they are only speciﬁc to a small portion of the whole QP or CS papers (there are many more QP and CS papers than quantum computing papers), and therefore are not discriminative features for either topic QP or CS. 1 Qbit is a unit of quantum information, and frequently appears in real quantum computing literatures, but rarely seen in other literatures. Symbol x(∈ Rd ) t1 , t2 , . . . , tl T L, λ(⊂ T ) L[j] Λ(= 2T ) {(xi , Li )}m i=1 Meaning A document vector Topics The set of all topics A label The binary representation of L. 1 if tj ∈ L and 0 otherwise. The set of all possible labels Training samples Table 1: Notation Parametric Mixture Model (PMM) [3] adopts another approach to MTC. It is assumed in PMM that multi-topic texts are generated from a mixture of topic-speciﬁc word distributions. Its decision on labeling is done at once, not separately for each topic. However, PMM also has a problem with multi-topic speciﬁc features such as “qbit” since it is impossible for texts to have such features given PMM’s mixture process. These problems with multi-topic speciﬁc features are caused by dependency assumptions between labels, which are explicitly or implicitly made in existing methods. To solve these problems, we propose Maximal Margin Labeling, which treats labels as independent classes and learns a multi-class classiﬁer on the induced multi-class problem. In this paper, we ﬁrst discuss why multi-class classiﬁers cannot be directly applied to MTC in Section 2. We then propose MML in Section 3, and address implementation issues in Section 4. In Section 5, MML is experimentally compared with existing methods using a collection of multi-topic Web pages. We summarize this paper in Section 6. 2 Solving MTC as a Multi-Class Categorization To discuss why existing multi-class classiﬁers do not work in MTC, we start from the multi-class classiﬁer proposed in [4]. Hereafter we use the notation given in Table 1. The multi-class classiﬁer in [4] categorizes an object into the class whose prototype vector is the closest to the object’s feature vector. By substituting label for class, the classiﬁer can be written as follows. f (x) = arg max x, mλ (1) X λ∈Λ where , X is the the inner product of Rd , and mλ ∈ Rd is the prototype vector of label λ. Following the similar argument as in [4], the prototype vectors are learned by solving the following maximal margin problem2 . min M s.t. 1 M 2 2 λ ξi +C xi , mLi 1≤i≤m λ∈Λ,λ=Li X − xi , m λ X λ ≥ 1 − ξi for 1 ≤ i ≤ m, ∀λ = Li , (2) where M is the prototype matrix whose columns are the prototype vectors, and M is the Frobenius matrix norm of M . Note that Eq. (1) and Eq. (2) cover only training samples’ labels, but also all possible labels. This is because the labels unseen in training samples may be relevant to test samples. In 2 In Eq.(2), we penalize all violation of the margin constraints. On the other hand, Crammer and Singer penalize only the largest violation of the margin constraint for each training sample [4]. We chose the “penalize-all” approach since it leads to an optimization problem without equality constraints (see Eq.(7)), which is much easier to solve than the one in [4]. usual multi-class problems, such unseen labels seldom exist. In MTC, however, the number of labels is generally very large (e.g. one of our datasets has 1,054 labels (Table 2)), and unseen labels often exist. Thus it is necessary to consider all possible labels in Eq. (1) and Eq. (2) since it is impossible to know which unseen labels are present in the test samples. There are two problems with Eq. (1) and Eq. (2). The ﬁrst problem is that they involve the prototype vectors of seldom or never seen labels. Without the help of prior knowledge about where the prototype vectors should be, it is impossible to obtain appropriate prototype vectors for such labels. The second problem is that these equations are computationally too demanding since they involve combinatorial maximization and summation over all possible labels, whose number can be quite large. (For example, the number is around 230 in the datasets used in our experiments.) We will address the ﬁrst problem in Section 3 and the second problem in Section 4. 3 Maximal Margin Labeling In this section, we incorporate some prior knowledge about the location of prototype vectors into Eq. (1) and Eq. (2), and propose a novel MTC learning algorithm, Maximal Margin Labeling (MML). As prior knowledge, we simply assume that the prototype vectors of similar labels should be placed close to each other. Based on this assumption, we ﬁrst rewrite Eq. (1) to yield f (x) = arg max M T x, eλ L, (3) λ∈Λ where , L is the inner product of R|Λ| and {eλ }λ∈Λ is the orthonormal basis of R|Λ| . The classiﬁer of Eq. (3) can be interpreted as a two-step process: the ﬁrst step is to map the vector x into R|Λ| by M T , and the second step is to ﬁnd the closest eλ to image M T x. Then we replace {eλ }λ∈Λ with (generally) non-orthogonal vectors {φ(λ)}λ∈Λ whose geometrical conﬁguration reﬂects label similarity. More formally speaking, we use vectors {φ(λ)}λ∈Λ that satisfy the condition φ(λ1 ), φ(λ2 ) S = S(λ1 , λ2 ) for ∀λ1 , λ2 ∈ Λ, (4) where , S is an inner product of the vector space spanned by {φ(λ)}λ∈Λ , and S is a Mercer kernel [5] on Λ × Λ and is a similarity measure between labels. We call the vector space spanned by {φ(λ)} VS . With this replacement, MML’s classiﬁer is written as follows. f (x) = arg max W x, φ(λ) S , (5) λ∈Λ where W is a linear map from Rd to VS . W is the solution of the following problem. min W 1 W 2 m 2 λ ξi +C i=1 λ∈Λ,λ=Li φ(Li )−φ(λ) λ λ s.t. W xi , ≥ 1−ξi , ξi ≥ 0 for 1 ≤ i ≤ m, ∀λ = Li . (6) φ(Li )−φ(λ) Note that if φ(λ) is replaced by eλ , Eq. (6) becomes identical to Eq. (2) except for a scale factor. Thus Eq. (5) and Eq. (6) are natural extensions of the multi-class classiﬁer in [4]. We call the MTC classiﬁer of Eq. (5) and Eq. (6) “Maximal Margin Labeling (MML)”. Figure 1 explains the margin (the inner product in Eq. (6)) in MML. The margin represents the distance from the image of the training sample xi to the boundary between the correct label Li and wrong label λ. MML optimizes the linear map W so that the smallest margin between all training samples and all possible labels becomes maximal, along with a penalty C for the case that samples penetrate into the margin. Figure 1: Maximal Margin Labeling Dual Form For numerical computation, the following Wolfe dual form of Eq. (6) is more convenient. (We omit its derivation due to space limits.) S(Li , Li′ )−S(Li , λ′ )−S(λ, Li′ )+S(λ, λ′ ) 1 λ λ′ λ αi αi′ (xi ·xi′ ) max αi − 2 αλ 2 (1−S(Li , λ))(1−S(Li′ , λ′ )) i ′ ′ i,λ i,λ i ,λ λ s.t. 0 ≤ αi ≤ C for 1 ≤ i ≤ m, ∀λ = Li , (7) m λ where we denote i=1 λ∈Λ,∀λ=Li by i,λ , and αi are the dual variables corresponding to the ﬁrst inequality constraints in Eq. (6). Note that Eq. (7) does not contain φ(λ): all the computations involving φ can be done through the label similarity S. Additionally xi only appears in the inner products, and therefore can be replaced by any kernel of x. λ Using the solution αi of Eq. (7), the MML’s classiﬁer in Eq. (5) can be written as follows. S(Li , L)−S(λ, L) λ . (8) f (x) = arg max αi (x·xi ) 2(1−S(Li , λ)) L∈Λ i,λ Label Similarity3 As examples of label similarity, we use two similarity measures: Dice measure and cosine measure. Dice measure4 4.1 SC (λ1 , λ2 ) = |λ1 ∩ λ2 | 2 |λ1 | |λ2 | = l j=1 l j=1 2|λ1 ∩λ2 | = |λ1 |+|λ2 | Cosine measure 4 SD (λ1 , λ2 ) = λ1 [j] + λ1 [j]λ2 [j] l j=1 l j=1 l j=1 λ2 [j] . λ1 [j]λ2 [j] λ1 [j] l j=1 (9) . 10) ( λ2 [j] Efﬁcient Implementation Approximation in Learning Eq. (7) contains the sum over all possible labels. As the number of topics (l) increases, this summation rapidly becomes intractable since |Λ| grows exponentially as 2l . To circumvent 3 The following discussion is easily extended to include the case that both λ1 and λ2 are empty although we do not discuss the case due to space limits. this problem, we approximate the sum over all possible labels in Eq. (7) by the partial sum λ λ over αi of |(A ∩ B c ) ∪ (Ac ∩ B)| = 1 and set all the other αi to zero. This approximation reduces the burden of the summation quite a lot: the number of summands is reduced from 2l to l, which is a huge reduction especially when many topics exist. λ To understand the rationale behind the approximation, ﬁrst note that αi is the dual variable λ corresponding to the ﬁrst inequality constraint (the margin constraint) in Eq. (7). Thus αi is non-zero if and only if W xi falls in the margin between φ(Li ) and φ(λ). We assume that this margin violation mainly occurs when φ(λ) is “close” to φ(Li ), i.e. |(A ∩ B c ) ∪ (Ac ∩ B)| = 1. If this assumption holds well, the proposed approximation of the sum will lead to a good approximation of the exact solution. 4.2 Polynomial Time Algorithms for Classiﬁcation The classiﬁcation of MML (Eq. (8)) involves the combinatorial maximization over all possible labels, so it can be a computationally demanding process. However, efﬁcient classiﬁcation algorithms are available when either the cosine measure or dice measure is used as label similarity. Eq. (8) can be divided into the subproblems by the number of topics in a label. f (x) = arg max g(x, L), (11) ˆ ˆ ˆ L∈{L1 ,L2 ,...,Ll } ˆ Ln = arg max g(x, L). (12) L∈Λ,|L|=n where g(x) is l g(x, L) cn [j]L[j], = j=1 cn [j] =      i,λ √ i,λ √ αλ (x·xi ) i 2(1−SD (Li ,λ)) αλ (x·xi ) i 2(1−SC (Li ,λ)) . 2Li [j] |Li |+n − 2λ[j] |λ|+n √Li [j] − √λ[j] √ √ |Li | n |λ| n if SD is used. (13) if SC is used. Here n = |L|. The computational cost of Eq. (13) for all j is O(nα l) (nα is the number of non-zero α), and that of Eq. (12) is O(l log l). Thus the total cost of the classiﬁcation by Eq. (11) is O(nα l2 + l2 log l). On the other hand, nα is O(ml) under the approximation described above. Therefore, the classiﬁcation can be done within O(ml3 ) computational steps, which is a signiﬁcant reduction from the case that the brute force search is used in Eq. (8). 5 Experiments In this section, we report experiments that compared MML to PMM [3], SVM5 [6], and BoosTexter [2] using a collection of Web pages. We used a normalized linear kernel k(x, x′ ) = x · x′ / x x′ in MML and SVM. As for BoosTexter, “real abstaining AdaBoost.MH” was used as the weak learner. 5.1 Experimental Setup The datasets used in our experiment represent the Web page collection used in [3] (Table 2). The Web pages were collected through the hyperlinks from Yahoo!’s top directory 5 For each topic, an SVM classiﬁer is trained to predict whether the topic is relevant (positive) or irrelevant (negative) to input doucments. Dataset Name (Abbrev.) Arts & Humanities (Ar) Business & Economy (Bu) Computers & Internet (Co) Education (Ed) Entertainment (En) Health (He) Recreation (Rc) Reference (Rf) Science (Si) Social Science (SS) Society & Culture (SC) #Text #Voc 7,484 11,214 12,444 12,030 12,730 9,205 12,828 8,027 6,428 12,111 14,512 #Tpc #Lbl 23,146 21,924 34,096 27,534 32,001 30,605 30,324 39,679 37,187 52,350 31,802 26 30 33 33 21 32 22 33 40 39 27 Label Size Frequency (%) 1 2 3 4 ≥5 599 55.6 30.5 9.7 2.8 1.4 233 57.6 28.8 11.0 1.7 0.8 428 69.8 18.2 7.8 3.0 1.1 511 66.9 23.4 7.3 1.9 0.6 337 72.3 21.1 4.5 1.0 1.1 335 53.2 34.0 9.5 2.4 0.9 530 69.2 23.1 5.6 1.4 0.6 275 85.5 12.6 1.5 0.3 0.1 457 68.0 22.3 7.3 1.9 0.5 361 78.4 17.0 3.7 0.7 0.3 1054 59.6 26.1 9.2 2.9 2.2 Table 2: A summary of the web page datasets. “#Text” is the number of texts in the dataset, “#Voc” the number of vocabularies (i.e. features), “#Tpc” the number of topics, “#Lbl” the number of labels, and “Label Size Frequency” is the relative frequency of each label size. (Label size is the number of topics in a label.) Method MML PMM SVM Boost Feature Type TF, TF×IDF TF TF, TF×IDF Binary Parameter C = 0.1, 1, 10 Model1, Model2 C = 0.1, 1, 10 R = {2, 4, 6, 8, 10}×103 Table 3: Candidate feature types and learning parameters. (R is the number of weak hypotheses.) The underlined fetures and parameters were selected for the evaluation with the test data. (www.yahoo.com), and then divided into 11 datasets by Yahoo’s top category. Each page is labeled with the Yahoo’s second level sub-categories from which the page is hyperlinked. (Thus, the sub-categories are topics in our term.) See [3] for more details about the collection. Then the Web pages were converted into three types of feature vectors: (a) Binary vectors, where each feature indicates the presence (absence) of a term by 1 (0); (b) TF vectors, where each feature is the number of appearances of a term (term frequency); and (c) TF×IDF vectors, where each feature is the product of term frequency and inverse document frequency [7]. To select the best combinations of feature types and learning parameters such as the penalty C for MML, the learners were trained on 2,000 Web pages with all combinations of feature and parameter listed in Table 3, and then were evaluated by labeling F-measure on independently drawn development data. The combinations which achieve the best labeling F-measures (underlined in Table 3) were used in the following experiments. 5.2 Evaluation Measures We used three measures to evaluate labeling performance: labeling F-measure, exact match ratio, and retrieval F-measure. In the following deﬁnitions, {Lpred }n and {Ltrue }n i=1 i i=1 i mean the predicted labels and the true labels, respectively. Labeling F-measure Labeling F-measure FL evaluates the average labeling performance while taking partial match into account. FL = 1 n n i=1 2|Lpred ∩ Ltrue | i i |Lpred | + |Ltrue | i i = 1 n n i=1 l pred [j]Ltrue [j] i j=1 Li . l (Lpred [j] + Ltrue [j]) i i j=1 2 (14) Dataset Ar Bu Co Ed En He Rc Rf Si SS SC Avg MD 0.55 0.80 0.62 0.56 0.64 0.74 0.63 0.67 0.61 0.73 0.60 0.65 Labeling F-measure MC PM SV BO 0.44 0.50 0.46 0.38 0.81 0.75 0.76 0.75 0.59 0.61 0.55 0.47 0.43 0.51 0.48 0.37 0.52 0.61 0.54 0.49 0.74 0.66 0.67 0.60 0.46 0.55 0.49 0.44 0.58 0.63 0.56 0.50 0.54 0.52 0.47 0.39 0.71 0.66 0.64 0.59 0.55 0.54 0.49 0.44 0.58 0.59 0.56 0.49 Exact Match Ratio MC PM SV 0.32 0.21 0.29 0.62 0.48 0.57 0.46 0.35 0.41 0.34 0.19 0.30 0.44 0.31 0.42 0.53 0.34 0.47 0.38 0.25 0.37 0.51 0.39 0.49 0.43 0.22 0.36 0.60 0.45 0.55 0.40 0.21 0.32 0.46 0.31 0.41 MD 0.44 0.63 0.51 0.45 0.55 0.58 0.54 0.60 0.52 0.65 0.44 0.54 BO 0.22 0.53 0.34 0.23 0.36 0.39 0.33 0.41 0.28 0.49 0.27 0.35 MD 0.30 0.25 0.27 0.25 0.37 0.35 0.47 0.29 0.37 0.36 0.29 0.32 Retrieval F-measure MC PM SV BO 0.26 0.24 0.29 0.22 0.27 0.20 0.29 0.20 0.25 0.19 0.30 0.17 0.23 0.21 0.25 0.16 0.33 0.30 0.35 0.29 0.35 0.23 0.35 0.26 0.39 0.36 0.40 0.33 0.25 0.24 0.25 0.16 0.35 0.28 0.31 0.19 0.35 0.18 0.31 0.15 0.28 0.25 0.26 0.20 0.30 0.24 0.31 0.21 Table 4: The performance comparison by labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). The bold ﬁgures are the best ones among the ﬁve methods, and the underlined ﬁgures the second best ones. MD, MC, PM, SV, and BO represent MML with SD , MML with SC , PMM, SVM and BoosTexter, respectively. Exact Match Ratio Exact match ratio EX counts only exact matches between the predicted label and the true label. EX = 1 n n I[Lpred = Ltrue ], i i (15) i=1 where I[S] is 1 if the statement S is true and 0 otherwise. Retrieval F-measure6 For real tasks, it is also important to evaluate retrieval performance, i.e. how accurately classiﬁers can ﬁnd relevant texts for a given topic. Retrieval F-measure FR measures the average retrieval performance over all topics. FR = 5.3 1 l l j=1 n pred [j]Ltrue [j] i i=1 Li . n (Lpred [j] + Ltrue [j]) i i i=1 2 (16) Results First we trained the classiﬁers with randomly chosen 2,000 samples. We then calculated the three evaluation measures for 3,000 other randomly chosen samples. This process was repeated ﬁve times, and the resulting averaged values are shown in Table 4. Table 4 shows that the MMLs with Dice measure outperform other methods in labeling F-measure and exact match ratio. The MMLs also show the best performance with regard to retrieval Fmeasure although the margins to the other methods are not as large as observed in labeling F-measure and exact match ratio. Note that no classiﬁer except MML with Dice measure achieves good labeling on all the three measures. For example, PMM shows high labeling F-measures, but its performance is rather poor when evaluated in retrieval F-measure. As the second experiment, we evaluated the classiﬁers trained with 250–2000 training samples on the same test samples. Figure 2 shows each measure averaged over all datasets. It is observed that the MMLs show high generalization even when training data is small. An interesting point is that MML with cosine measure achieves rather high labeling F-measures and retrieval F-measure with training data of smaller size. Such high-performace, however, does not continue when trained on larger data. 6 FR is called “the macro average of F-measures” in the text categorization community. Figure 2: The learning curve of labeling F-measure (left), exact match ratio (middle) and retrieval F-measure (right). MD, MC, PM, SV, BO mean the same as in Table 4. 6 Conclusion In this paper, we proposed a novel learning algorithm for multi-topic text categorization. The algorithm, Maximal Margin Labeling, embeds labels (sets of topics) into a similarityinduced vector space, and learns a large margin classiﬁer in the space. To overcome the demanding computational cost of MML, we provide an approximation method in learning and efﬁcient classiﬁcation algorithms. In experiments on a collection of Web pages, MML outperformed other methods including SVM and showed better generalization. Acknowledgement The authors would like to thank Naonori Ueda, Kazumi Saito and Yuji Kaneda of Nippon Telegraph and Telephone Corporation for providing PMM’s codes and the datasets. References [1] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Claire N´ dellec and C´ line Rouveirol, editors, Proc. of the e e 10th European Conference on Machine Learning, number 1398, pages 137–142, 1998. [2] Robert E. Schapire and Yoram Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000. [3] Naonori Ueda and Kazumi Saito. Parametoric mixture models for multi-topic text. In Advances in Neural Information Processing Systems 15, pages 1261–1268, 2003. [4] Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2:265–292, 2001. [5] Klaus-Robert M¨ ller, Sebastian Mika, Gunnar R¨ tsch, Koji Tsuda, and Bernhard u a Sch¨ lkopf. An introduction to kernel-based learning algorithms. IEEE Transactions o on Neural Networks, 12(2):181–201, 2001. [6] Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, Inc., 1998. [7] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wealy, 1999.

same-paper 2 0.86218613 99 nips-2004-Learning Hyper-Features for Visual Identification

Author: Andras D. Ferencz, Erik G. Learned-miller, Jitendra Malik

3 0.76444083 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

Author: Ofer Dekel, Shai Shalev-shwartz, Yoram Singer

Abstract: Prediction sufﬁx trees (PST) provide a popular and effective tool for tasks such as compression, classiﬁcation, and language modeling. In this paper we take a decision theoretic view of PSTs for the task of sequence prediction. Generalizing the notion of margin to PSTs, we present an online PST learning algorithm and derive a loss bound for it. The depth of the PST generated by this algorithm scales linearly with the length of the input. We then describe a self-bounded enhancement of our learning algorithm which automatically grows a bounded-depth PST. We also prove an analogous mistake-bound for the self-bounded algorithm. The result is an efﬁcient algorithm that neither relies on a-priori assumptions on the shape or maximal depth of the target PST nor does it require any parameters. To our knowledge, this is the ﬁrst provably-correct PST learning algorithm which generates a bounded-depth PST while being competitive with any ﬁxed PST determined in hindsight. 1

4 0.75973177 131 nips-2004-Non-Local Manifold Tangent Learning

Author: Yoshua Bengio, Martin Monperrus

Abstract: We claim and present arguments to the effect that a large class of manifold learning algorithms that are essentially local and can be framed as kernel learning algorithms will suffer from the curse of dimensionality, at the dimension of the true underlying manifold. This observation suggests to explore non-local manifold learning algorithms which attempt to discover shared structure in the tangent planes at different positions. A criterion for such an algorithm is proposed and experiments estimating a tangent plane prediction function are presented, showing its advantages with respect to local manifold learning algorithms: it is able to generalize very far from training data (on learning handwritten character image rotations), where a local non-parametric method fails. 1

5 0.75283027 151 nips-2004-Rate- and Phase-coded Autoassociative Memory

Author: Máté Lengyel, Peter Dayan

Abstract: Areas of the brain involved in various forms of memory exhibit patterns of neural activity quite unlike those in canonical computational models. We show how to use well-founded Bayesian probabilistic autoassociative recall to derive biologically reasonable neuronal dynamics in recurrently coupled models, together with appropriate values for parameters such as the membrane time constant and inhibition. We explicitly treat two cases. One arises from a standard Hebbian learning rule, and involves activity patterns that are coded by graded ﬁring rates. The other arises from a spike timing dependent learning rule, and involves patterns coded by the phase of spike times relative to a coherent local ﬁeld potential oscillation. Our model offers a new and more complete understanding of how neural dynamics may support autoassociation. 1

6 0.75198054 178 nips-2004-Support Vector Classification with Input Data Uncertainty

7 0.75169337 102 nips-2004-Learning first-order Markov models for control

8 0.75161362 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

9 0.75016105 127 nips-2004-Neighbourhood Components Analysis

10 0.74888855 25 nips-2004-Assignment of Multiplicative Mixtures in Natural Images

11 0.74866825 53 nips-2004-Discriminant Saliency for Visual Recognition from Cluttered Scenes

12 0.7484417 69 nips-2004-Fast Rates to Bayes for Kernel Machines

13 0.7475614 163 nips-2004-Semi-parametric Exponential Family PCA

14 0.74737293 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

15 0.74709129 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

16 0.74663186 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms

17 0.74652737 116 nips-2004-Message Errors in Belief Propagation

18 0.74606735 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

19 0.74600399 4 nips-2004-A Generalized Bradley-Terry Model: From Group Competition to Individual Skill

20 0.74562544 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform