nips nips2006 nips2006-66 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Alessandro Bissacco, Ming-Hsuan Yang, Stefano Soatto
Abstract: We consider the problem of detecting humans and classifying their pose from a single image. Specifically, our goal is to devise a statistical model that simultaneously answers two questions: 1) is there a human in the image? and, if so, 2) what is a low-dimensional representation of her pose? We investigate models that can be learned in an unsupervised manner on unlabeled images of human poses, and provide information that can be used to match the pose of a new image to the ones present in the training set. Starting from a set of descriptors recently proposed for human detection, we apply the Latent Dirichlet Allocation framework to model the statistics of these features, and use the resulting model to answer the above questions. We show how our model can efficiently describe the space of images of humans with their pose, by providing an effective representation of poses for tasks such as classification and matching, while performing remarkably well in human/non human decision problems, thus enabling its use for human detection. We validate the model with extensive quantitative experiments and comparisons with other approaches on human detection and pose matching. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract We consider the problem of detecting humans and classifying their pose from a single image. [sent-6, score-0.624]
2 Specifically, our goal is to devise a statistical model that simultaneously answers two questions: 1) is there a human in the image? [sent-7, score-0.261]
3 We investigate models that can be learned in an unsupervised manner on unlabeled images of human poses, and provide information that can be used to match the pose of a new image to the ones present in the training set. [sent-9, score-0.922]
4 Starting from a set of descriptors recently proposed for human detection, we apply the Latent Dirichlet Allocation framework to model the statistics of these features, and use the resulting model to answer the above questions. [sent-10, score-0.381]
5 We validate the model with extensive quantitative experiments and comparisons with other approaches on human detection and pose matching. [sent-12, score-0.896]
6 1 Introduction Human detection and localization from a single image is an active area of research that has witnessed a surge of interest in recent years [9, 18, 6]. [sent-13, score-0.263]
7 Simply put, given an image, we want to devise an automatic procedure that locates the regions that contain human bodies in arbitrary pose. [sent-14, score-0.229]
8 This is hard because of the wide variability that images of humans exhibit. [sent-15, score-0.213]
9 Given that it is impractical to explicitly model nuisance factors such as clothing, lighting conditions, viewpoint, body pose, partial and/or self-occlusions, one can learn a descriptive model of human/non human statistics. [sent-16, score-0.333]
10 Consequently, the main focus of research on human detection so far has been on deriving a suitable representation [9, 18, 6], i. [sent-18, score-0.473]
11 Recently local descriptors based on histograms of gradient orientations such as [6] have proven to be particularly successful for human detection tasks. [sent-21, score-0.743]
12 The main idea is to use distributions of gradient orientations in order to be insensitve to color, brightness and contrast changes and, to some extent, local deformations. [sent-22, score-0.242]
13 We show how a special class of hierarchical Bayesian processes can be used as generative models for these features and applied to the problem of detection and pose classification. [sent-24, score-0.721]
14 This work can be interpreted as an attempt to bridge the gap between the two related problems of human detection and pose estimation in the literature. [sent-25, score-0.864]
15 In human detection, since a simple yes/no answer is required, there is no need to introduce a complex model with latent variables associated to physical quantities. [sent-26, score-0.331]
16 In pose estimation, on the other hand, the goal is to infer these quantities and therefore a full generative model is a natural approach. [sent-27, score-0.549]
17 However, these quantities are instrumental to both human detection and the pose classification problem. [sent-30, score-0.864]
18 The main difficulty is in the representation of the pose information. [sent-31, score-0.503]
19 Humans are highly articulated objects with many degrees of freedom, which makes defining pose classes a remarkably difficult problem. [sent-32, score-0.462]
20 We propose an approach which allows for unsupervised clustering of images of humans and provides a low dimensional representation encoding essential information on their pose. [sent-35, score-0.326]
21 The chief difference with standard clustering or dimensionality reduction techniques is that we derive a full probabilistic framework, which provides principled ways to combine and compare different models, as required for tasks such as human detection, pose classification and matching. [sent-36, score-0.696]
22 2 Context and Motivation The literature on human detection and pose estimation is too broad for us to review here. [sent-37, score-0.864]
23 Detecting humans and estimating poses from single images is a fundamental problem with a range of sensible applications, such as image retrieval and understanding. [sent-39, score-0.421]
24 It makes sense to tackle this problem as we know humans are capable of telling the locations and poses of people from the visual information contained in photographs. [sent-40, score-0.277]
25 Numerous representation schemes have been exploited for human detection, e. [sent-42, score-0.239]
26 , Haar wavelets [18], edges [9], gradient orientations [6], gradients and second derivatives [19] and regions from image segmentation [15]. [sent-44, score-0.292]
27 With these representations, algorithms have been applied for the detection process such as template matching [9], support vector machine [19, 6], Adaboost [18], and grouping [15], to name a few. [sent-45, score-0.327]
28 Most approaches to pose estimation are based on body part detectors, using either edge, shape, color and texture cues [7, 21, 15], or learned from training data [19]. [sent-46, score-0.614]
29 These works focus on only one of the two problems, either detection or pose estimation. [sent-48, score-0.666]
30 Thus we want to simultaneously perform detection and pose classification, and we want to do it in an unsupervised manner. [sent-50, score-0.703]
31 We start from the representation [6] based on gradient histograms recently applied to human detection with excellent results, and derive a probabilistic model for it. [sent-53, score-0.659]
32 1 Histogram of Oriented Gradients Local descriptors based on gradient orientations are one of the most successful representations for image-based detection and matching, as was firstly demonstrated by Lowe in [14]. [sent-59, score-0.466]
33 This descriptor is obtained by computing weighted histograms of gradient orientations over a grid of spatial neighborhoods (cells), which are then grouped in overlapping regions (blocks) and normalized for brightness and contrast changes. [sent-61, score-0.366]
34 Assume that we are given a patch of 64 × 128 pixels, we divide the patch into cells of 8 × 8 pixels, and for each cell a gradient orientation histogram is computed. [sent-62, score-0.282]
35 The histogram represents a quantization in 9 bins of gradient orientations in the range 0◦ − 180◦ . [sent-63, score-0.498]
36 Each pixel contributes to the neighboring bins, both in orientation and space, by an amount proportional to the gradient magnitude and linearly decreasing with the distance from the bin center. [sent-64, score-0.236]
37 The final descriptor is a collection of histograms from overlapping ¯ blocks (each cell shared by 4 blocks). [sent-67, score-0.225]
38 The main characteristic of such a representation is robustness to local deformations, illumination changes and, to a limited extent, viewpoint and pose changes due to coarsening of the histograms. [sent-68, score-0.503]
39 In order to handle the larger variations typical of human body images, we need to complement this representation with a model. [sent-69, score-0.357]
40 A document w = ( w1 , w2 , · · · , wW ) W is a collection of word counts wj : j=1 wj = N . [sent-74, score-0.421]
41 For each word j = 1, · · · , W in the dictionary, choose a word count wj ∼ p(wj |θ, β). [sent-82, score-0.283]
42 where the word counts wj are drawn from a discrete distribution conditioned on the topic proportions θ: p(wj |θ, β) = βj. [sent-83, score-0.561]
43 The hyperparameter α ∈ RK represents the prior on + the topic distribution, θ ∈ RK are the topic proportions, and β ∈ R+ W ×K are the parameters of the + word distributions conditioned on topics. [sent-86, score-0.554]
44 In the context of this work, words correspond to oriented gradients, and documents as well as corpus correspond to images and a set of images respectively. [sent-87, score-0.31]
45 The topic derived by the LDA model is the pose of interest in this work. [sent-88, score-0.711]
46 Here we can safely assume that the topic distributions β are deterministic parameters, later for the purpose of inference we will treat them as random variables and assign them a Dirichlet prior: β. [sent-89, score-0.281]
47 Here we consider the histogram of oriented gradients [6] as the basic feature from which we build our generative model, but let us point out that the framework we introduce is more general and can be applied to any descriptor based on histograms1 . [sent-96, score-0.39]
48 In this histogram descriptor, we have that each bin represents the intensity of the gradient at a particular location, defined by a range of orientations and a local neighborhood (cell). [sent-97, score-0.431]
49 We tested the effect of quantization on the performance of the human detector based on Histogram of Oriented Gradient descriptors and linear Support Vector Machines described in [6]. [sent-106, score-0.468]
50 Thus in what follows we can safely assume that the basic features are collections of small integers, the histogram bin counts wj . [sent-108, score-0.426]
51 Thus, if we quantize histogram bins and assign a unique word to each bin, we obtain a representation for which we can directly apply the LDA framework. [sent-109, score-0.341]
52 Analogous to document analysis, an orientation histogram computed on an image patch is a document w represented as a bag of words ( w1 , · · · , wW ), where the word counts wj are the bin heights. [sent-110, score-0.754]
53 We assume that such a histogram is generated by a mixture of basic components (topics), where each topic z induces a discrete distribution p(r|β. [sent-111, score-0.401]
54 By summing the contributions from each topic we obtain the total count wj for each bin, distributed according to p(wj |θ, β). [sent-113, score-0.326]
55 That is, the same bin may have contributions from multiple topics, and this models the fact that the bin height is the count of edges in a neighborhood which may include parts generated by different components. [sent-115, score-0.24]
56 Finally, let us point out that by assigning a unique word to each bin we model spatial information, encoded in the word identity, whereas most previous approaches (e. [sent-116, score-0.326]
57 Notice that our main goal is to develop a model to represent the statistics of images for human pose classification. [sent-120, score-0.777]
58 We use the human detection problem as a convenient testbed for validating the goodness of our representation, since for this application large labelled datasets and efficient algorithms are available. [sent-121, score-0.439]
59 By no means we intend to compete with state-of-the-art discriminative approaches for human detection alone, which are optimized to represent the decision boundary and thus are supposed to perform better than generative approaches in binary classification tasks. [sent-122, score-0.457]
60 However, if the generative model is good at capturing the statistics of human images we expect it to perform well also in discriminating humans from the background. [sent-123, score-0.498]
61 In human detection, given a set of positive and negative examples and a previously unseen image Inew , we are asked to choose between two hypotheses: either it contains a human or it is a background image. [sent-124, score-0.502]
62 The first step is to compute the gradient histogram representation w(I) for the test and training images. [sent-125, score-0.288]
63 Then we learn a model for humans and background images and use a threshold 1 Notice that, due to the particular normalization procedure applied, the histogram features we consider here do not have unit norm (in fact, they are zero on uniform regions). [sent-126, score-0.425]
64 k + η) In pose classification, we start from a set of unlabeled training examples of human poses and learn the topic distribution β. [sent-139, score-1.071]
65 This defines a probabilistic mapping to the topic variables, which can be seen as an economical representation encoding essential information of the pose. [sent-140, score-0.363]
66 That is, from a ˆ image Inew , we estimate the topic proportions θ(Inew ) as: ˆ θ(Inew ) = θp(θ|w(Inew ), α, β)dθ (3) Pose information can be recovered by matching the new image Inew to an image I in the training set. [sent-141, score-0.626]
67 For ˆ each training document I, in the learning step we compute the posterior topic proportions θ(I) as in (3). [sent-146, score-0.409]
68 5 Experiments We first tested the efficacy of our model for the human detection task. [sent-148, score-0.434]
69 We used the dataset provided by [6], consisting of 2340 64 × 128 images of pedestrians in various configurations and 1671 images of outdoor scenes not containing humans. [sent-149, score-0.248]
70 We first computed the histograms of oriented gradients from the image patches following the procedure outlined in Section 3. [sent-153, score-0.262]
71 Given the number of topics K and the prior hyperparameters α, η, we learned topic distributions β ˆ and topic proportions θ(I) using either Gibbs sampling or Mean Field. [sent-161, score-0.751]
72 We tested both Gamma [5] and Dirichlet [3, 4] distributions for topic priors, obtaining best results with the multinomial model [4] with scalar priors αi = a, ηi = b, in these experiments a = 2/K and b = 0. [sent-162, score-0.316]
73 Eventually, an excessive topic number causes overfitting, which can be measured as the likelihood in the test dataset decreases. [sent-167, score-0.253]
74 Using an analogy with the LDA model, the columns of W contain the topic distributions, while the columns of H represent the component weights. [sent-179, score-0.217]
75 We would like to stress that a sole comparison on detection performance with state-of-the discriminative classifiers would be inappropriate, since our model targets pose classification which is harder than binary detection. [sent-182, score-0.698]
76 For the experiments on pose classification and matching, we used the CMU Mobo dataset [11]. [sent-185, score-0.498]
77 In the first experiment we trained the model with all the views and set the number of topics equal to the number of views, K = 6. [sent-188, score-0.2]
78 As expected, each topic distribution represents a view and by ˆ assigning every image I to the topic k with highest proportion k = arg maxk θk (I) we correctly associated all the images from the same view to the same topic. [sent-189, score-0.578]
79 We learned a model with K = 8 topics from 16 training sequences, and used the remaining 6 for testing. [sent-191, score-0.251]
80 We can see how the pose is matched against change of appearance and motion style, specifically a test subject pose is matched to similar poses of different subjects in the training set. [sent-194, score-1.297]
81 This shows how the topic representation factors out most of the appearance variations and retains only essential information on the pose. [sent-195, score-0.409]
82 In order to give a quantitative evaluation of the pose matching performance and compare with other approaches, we labeled the dataset by mapping the set of walking poses to the interval [0, 1]. [sent-196, score-0.77]
83 01 −3 10 NMF VQ LDA Gibbs LDA MF Linear SVM −2 −1 10 10 false positives per window (FPPW) Figure 1: Human detection results. [sent-208, score-0.244]
84 (Left) Effect on human detection performances of quantizing the histogram of oriented gradient descriptor [6] for a boosted linear SVM classifier based on these features. [sent-209, score-0.796]
85 We can see that for 16 quantization levels or more the differences are negligible, thus validating our discrete approach. [sent-212, score-0.274]
86 The first column shows the distribution of local orientations associated with topic k: (top) visualization of the orientations and (bottom) average gradient intensities for each cell. [sent-218, score-0.504]
87 The right 5 columns show the top ten images in the dataset with ˆ highest topic proportion θk , shown below each image. [sent-219, score-0.338]
88 We can see that topics are tightly related to pose classes. [sent-220, score-0.6]
89 For each test frame, we computed the pose error as the difference between the associated pose value and the average pose of the best top 10 matches in the training dataset. [sent-225, score-1.463]
90 In Figure 4 we also show the average pose error when matching test frames to a single train sequence. [sent-232, score-0.653]
91 Although the different appearance affects the matching performance, overall the results shows how our approach can be successfully applied to automatically match poses of different subjects. [sent-233, score-0.341]
92 6 Conclusions We introduce a novel approach to human detection, pose classification and matching from a single image. [sent-234, score-0.783]
93 Starting from a representation robust to a limited range of variations in the appearance of humans in images, we derive a generative probabilistic model which allows for automatic discovery of pose information. [sent-235, score-0.87]
94 The model can successfully perform detection and provides a low dimensional representation of the pose. [sent-236, score-0.277]
95 It automatically clusters the images using representative distributions and allows for an efficient approach to pose matching. [sent-237, score-0.58]
96 Our experiments show that our approach matches or exceeds the state of the art in human detection, pose classification and matching. [sent-238, score-0.692]
97 We can see how our approach allows to match poses even despite large changes in appearance, and the same pose is correctly matched across different subjects. [sent-241, score-0.642]
98 (Left) Average pose error in matching test sequences to the training set, for our model (both Gibbs and Mean Field learning), Non-Negative Matrix Factorization and Vector Quantization. [sent-248, score-0.694]
99 (Right) Average pose error in matching test and training sequence pairs with our approach, where each row is a test sequence and each column a training sequence. [sent-250, score-0.675]
100 3d human pose from silhouettes by relevance vector regression. [sent-259, score-0.702]
wordName wordTfidf (topN-words)
[('pose', 0.462), ('inew', 0.381), ('topic', 0.217), ('lda', 0.215), ('detection', 0.204), ('human', 0.198), ('poses', 0.149), ('quantization', 0.144), ('topics', 0.138), ('histogram', 0.133), ('humans', 0.128), ('matching', 0.123), ('bin', 0.12), ('orientations', 0.109), ('wj', 0.109), ('gibbs', 0.098), ('dirichlet', 0.092), ('word', 0.087), ('images', 0.085), ('descriptors', 0.084), ('document', 0.083), ('histograms', 0.079), ('descriptor', 0.078), ('body', 0.071), ('oriented', 0.069), ('appearance', 0.069), ('gradient', 0.069), ('frames', 0.068), ('latent', 0.066), ('proportions', 0.064), ('classi', 0.064), ('angeles', 0.063), ('iccv', 0.061), ('image', 0.059), ('los', 0.059), ('allocation', 0.058), ('gradients', 0.055), ('inria', 0.055), ('generative', 0.055), ('factorization', 0.052), ('discrete', 0.051), ('cvpr', 0.05), ('motion', 0.048), ('background', 0.047), ('variations', 0.047), ('orientation', 0.047), ('sampling', 0.046), ('performances', 0.045), ('training', 0.045), ('cmu', 0.044), ('object', 0.043), ('bins', 0.043), ('bissacco', 0.042), ('fppw', 0.042), ('mobo', 0.042), ('pedestrians', 0.042), ('silhouettes', 0.042), ('levels', 0.042), ('detector', 0.042), ('representation', 0.041), ('false', 0.04), ('documents', 0.04), ('detectors', 0.039), ('validating', 0.037), ('honda', 0.037), ('quantize', 0.037), ('weber', 0.037), ('unsupervised', 0.037), ('dataset', 0.036), ('learned', 0.036), ('probabilistic', 0.036), ('blocks', 0.035), ('answer', 0.035), ('essential', 0.035), ('detecting', 0.034), ('multinomial', 0.034), ('soatto', 0.034), ('ww', 0.034), ('economical', 0.034), ('nmf', 0.034), ('svm', 0.033), ('score', 0.033), ('distributions', 0.033), ('cell', 0.033), ('counts', 0.033), ('sequences', 0.032), ('field', 0.032), ('model', 0.032), ('matches', 0.032), ('brightness', 0.031), ('vq', 0.031), ('safely', 0.031), ('efros', 0.031), ('devise', 0.031), ('corpus', 0.031), ('matched', 0.031), ('deriving', 0.03), ('trained', 0.03), ('agarwal', 0.03), ('miss', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 66 nips-2006-Detecting Humans via Their Pose
Author: Alessandro Bissacco, Ming-Hsuan Yang, Stefano Soatto
Abstract: We consider the problem of detecting humans and classifying their pose from a single image. Specifically, our goal is to devise a statistical model that simultaneously answers two questions: 1) is there a human in the image? and, if so, 2) what is a low-dimensional representation of her pose? We investigate models that can be learned in an unsupervised manner on unlabeled images of human poses, and provide information that can be used to match the pose of a new image to the ones present in the training set. Starting from a set of descriptors recently proposed for human detection, we apply the Latent Dirichlet Allocation framework to model the statistics of these features, and use the resulting model to answer the above questions. We show how our model can efficiently describe the space of images of humans with their pose, by providing an effective representation of poses for tasks such as classification and matching, while performing remarkably well in human/non human decision problems, thus enabling its use for human detection. We validate the model with extensive quantitative experiments and comparisons with other approaches on human detection and pose matching. 1
2 0.27256098 122 nips-2006-Learning to parse images of articulated bodies
Author: Deva Ramanan
Abstract: We consider the machine vision task of pose estimation from static images, specifically for the case of articulated objects. This problem is hard because of the large number of degrees of freedom to be estimated. Following a established line of research, pose estimation is framed as inference in a probabilistic model. In our experience however, the success of many approaches often lie in the power of the features. Our primary contribution is a novel casting of visual inference as an iterative parsing process, where one sequentially learns better and better features tuned to a particular image. We show quantitative results for human pose estimation on a database of over 300 images that suggest our algorithm is competitive with or surpasses the state-of-the-art. Since our procedure is quite general (it does not rely on face or skin detection), we also use it to estimate the poses of horses in the Weizmann database. 1
3 0.20071869 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model
Author: Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers
Abstract: Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the specific words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over general topics, and (c) a distribution over words that are treated as being specific to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a specific word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match documents at the specific word level (such as TF-IDF). 1 Introduction and Motivation Reducing high-dimensional data vectors to robust and interpretable lower-dimensional representations has a long and successful history in data analysis, including recent innovations such as latent semantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003). These types of techniques have found broad application in modeling of sparse high-dimensional count data such as the “bag of words” representations for documents or transaction data for Web and retail applications. Approaches such as LSI and LDA have both been shown to be useful for “object matching” in their respective latent spaces. In information retrieval for example, both a query and a set of documents can be represented in the LSI or topic latent spaces, and the documents can be ranked in terms of how well they match the query based on distance or similarity in the latent space. The mapping to latent space represents a generalization or abstraction away from the sparse set of observed words, to a “higher-level” semantic representation in the latent space. These abstractions in principle lead to better generalization on new data compared to inferences carried out directly in the original sparse high-dimensional space. The capability of these models to provide improved generalization has been demonstrated empirically in a number of studies (e.g., Deerwester et al 1994; Hofmann 1999; Canny 2004; Buntine et al, 2005). However, while this type of generalization is broadly useful in terms of inference and prediction, there are situations where one can over-generalize. Consider trying to match the following query to a historical archive of news articles: election + campaign + Camejo. The query is intended to find documents that are about US presidential campaigns and also about Peter Camejo (who ran as vice-presidential candidate alongside independent Ralph Nader in 2004). LSI and topic models are likely to highly rank articles that are related to presidential elections (even if they don’t necessarily contain the words election or campaign). However, a potential problem is that the documents that are highly ranked by LSI or topic models need not include any mention of the name Camejo. The reason is that the combination of words in this query is likely to activate one or more latent variables related to the concept of presidential campaigns. However, once this generalization is made the model has “lost” the information about the specific word Camejo and it will only show up in highly ranked documents if this word happens to frequently occur in these topics (unlikely in this case given that this candidate received relatively little media coverage compared to the coverage given to the candidates from the two main parties). But from the viewpoint of the original query, our preference would be to get documents that are about the general topic of US presidential elections with the specific constraint that they mention Peter Camejo. Word-based retrieval techniques, such as the widely-used term-frequency inverse-documentfrequency (TF-IDF) method, have the opposite problem in general. They tend to be overly specific in terms of matching words in the query to documents. In general of course one would like to have a balance between generality and specificity. One ad hoc approach is to combine scores from a general method such as LSI with those from a more specific method such as TF-IDF in some manner, and indeed this technique has been proposed in information retrieval (Vogt and Cottrell, 1999). Similarly, in the ad hoc LDA approach (Wei and Croft, 2006), the LDA model is linearly combined with document-specific word distributions to capture both general as well as specific information in documents. However, neither method is entirely satisfactory since it is not clear how to trade-off generality and specificity in a principled way. The contribution of this paper is a new graphical model based on latent topics that handles the tradeoff between generality and specificity in a fully probabilistic and automated manner. The model, which we call the special words with background (SWB) model, is an extension of the LDA model. The new model allows words in documents to be modeled as either originating from general topics, or from document-specific “special” word distributions, or from a corpus-wide background distribution. The idea is that words in a document such as election and campaign are likely to come from a general topic on presidential elections, whereas a name such as Camejo is much more likely to be treated as “non-topical” and specific to that document. Words in queries are automatically interpreted (in a probabilistic manner) as either being topical or special, in the context of each document, allowing for a data-driven document-specific trade-off between the benefits of topic-based abstraction and specific word matching. Daum´ and Marcu (2006) independently proposed a probabilistic e model using similar concepts for handling different training and test distributions in classification problems. Although we have focused primarily on documents in information retrieval in the discussion above, the model we propose can in principle be used on any large sparse matrix of count data. For example, transaction data sets where rows are individuals and columns correspond to items purchased or Web sites visited are ideally suited to this approach. The latent topics can capture broad patterns of population behavior and the “special word distributions” can capture the idiosyncracies of specific individuals. Section 2 reviews the basic principles of the LDA model and introduces the new SWB model. Section 3 illustrates how the model works in practice using examples from New York Times news articles. In Section 4 we describe a number of experiments with 4 different document sets, including perplexity experiments and information retrieval experiments, illustrating the trade-offs between generalization and specificity for different models. Section 5 contains a brief discussion and concluding comments. 2 A Topic Model for Special Words Figure 1(a) shows the graphical model for what we will refer to as the “standard topic model” or LDA. There are D documents and document d has Nd words. α and β are fixed parameters of symmetric Dirichlet priors for the D document-topic multinomials represented by θ and the T topicword multinomials represented by φ. In the generative model, for each document d, the Nd words (a) β2 β1 α γ Ω ψ θ λ β0 z x φ w (b) α θ β z φ w Nd T T D Nd D Figure 1: Graphical models for (a) the standard LDA topic model (left) and (b) the proposed special words topic model with a background distribution (SWB) (right). are generated by drawing a topic t from the document-topic distribution p(z|θd ) and then drawing a word w from the topic-word distribution p(w|z = t, φt ). As shown in Griffiths and Steyvers (2004) the topic assignments z for each word token in the corpus can be efficiently sampled via Gibbs sampling (after marginalizing over θ and φ). Point estimates for the θ and φ distributions can be computed conditioned on a particular sample, and predictive distributions can be obtained by averaging over multiple samples. We will refer to the proposed model as the special words topic model with background distribution (SWB) (Figure 1(b)). SWB has a similar general structure to the LDA model (Figure 1(a)) but with additional machinery to handle special words and background words. In particular, associated with each word token is a latent random variable x, taking value x = 0 if the word w is generated via the topic route, value x = 1 if the word is generated as a special word (for that document) and value x = 2 if the word is generated from a background distribution specific for the corpus. The variable x acts as a switch: if x = 0, the previously described standard topic mechanism is used to generate the word, whereas if x = 1 or x = 2, words are sampled from a document-specific multinomial Ψ or a corpus specific multinomial Ω (with symmetric Dirichlet priors parametrized by β1 and β2 ) respectively. x is sampled from a document-specific multinomial λ, which in turn has a symmetric Dirichlet prior, γ. One could also use a hierarchical Bayesian approach to introduce another level of uncertainty about the Dirichlet priors (e.g., see Blei, Ng, and Jordan, 2003)—we have not investigated this option, primarily for computational reasons. In all our experiments, we set α = 0.1, β0 = β2 = 0.01, β1 = 0.0001 and γ = 0.3—all weak symmetric priors. The conditional probability of a word w given a document d can be written as: T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w) p(w|d) = p(x = 0|d) t=1 ′ where p (w|d) is the special word distribution for document d, and p′′ (w) is the background word distribution for the corpus. Note that when compared to the standard topic model the SWB model can explain words in three different ways, via topics, via a special word distribution, or via a background word distribution. Given the graphical model above, it is relatively straightforward to derive Gibbs sampling equations that allow joint sampling of the zi and xi latent variables for each word token wi , for xi = 0: p (xi = 0, zi = t |w, x−i , z−i , α, β0 , γ ) ∝ Nd0,−i + γ × Nd,−i + 3γ TD Ctd,−i + α t′ TD Ct′ d,−i + T α × WT Cwt,−i + β0 WT w ′ Cw ′ t,−i + W β0 and for xi = 1: p (xi = 1 |w, x−i , z−i , β1 , γ ) ∝ Nd1,−i + γ × Nd,−i + 3γ WD Cwd,−i + β1 w′ WD Cw′ d,−i + W β1 e mail krugman nytimes com memo to critics of the media s liberal bias the pinkos you really should be going after are those business reporters even i was startled by the tone of the jan 21 issue of investment news which describes itself as the weekly newspaper for financial advisers the headline was paul o neill s sweet deal the blurb was irs backs off closing loophole averting tax liability for execs and treasury chief it s not really news that the bush administration likes tax breaks for businessmen but two weeks later i learned from the wall street journal that this loophole is more than a tax break for businessmen it s a gift to biznesmen and it may be part of a larger pattern confused in the former soviet union the term biznesmen pronounced beeznessmen refers to the class of sudden new rich who emerged after the fall of communism and who generally got rich by using their connections to strip away the assets of public enterprises what we ve learned from enron and other players to be named later is that america has its own biznesmen and that we need to watch out for policies that make it easier for them to ply their trade it turns out that the sweet deal investment news was referring to the use of split premium life insurance policies to give executives largely tax free compensation you don t want to know the details is an even sweeter deal for executives of companies that go belly up it shields their wealth from creditors and even from lawsuits sure enough reports the wall street journal former enron c e o s kenneth lay and jeffrey skilling both had large split premium policies so what other pro biznes policies have been promulgated lately last year both houses of … john w snow was paid more than 50 million in salary bonus and stock in his nearly 12 years as chairman of the csx corporation the railroad company during that period the company s profits fell and its stock rose a bit more than half as much as that of the average big company mr snow s compensation amid csx s uneven performance has drawn criticism from union officials and some corporate governance specialists in 2000 for example after the stock had plunged csx decided to reverse a 25 million loan to him the move is likely to get more scrutiny after yesterday s announcement that mr snow has been chosen by president bush to replace paul o neill as the treasury secretary like mr o neill mr snow is an outsider on wall street but an insider in corporate america with long experience running an industrial company some wall street analysts who follow csx said yesterday that mr snow had ably led the company through a difficult period in the railroad industry and would make a good treasury secretary it s an excellent nomination said jill evans an analyst at j p morgan who has a neutral rating on csx stock i think john s a great person for the administration he as the c e o of a railroad has probably touched every sector of the economy union officials are less complimentary of mr snow s performance at csx last year the a f l c i o criticized him and csx for the company s decision to reverse the loan allowing him to return stock he had purchased with the borrowed money at a time when independent directors are in demand a corporate governance specialist said recently that mr snow had more business relationships with members of his own board than any other chief executive in addition mr snow is the third highest paid of 37 chief executives of transportation companies said ric marshall chief executive of the corporate library which provides specialized investment research into corporate boards his own compensation levels have been pretty high mr marshall said he could afford to take a public service job a csx program in 1996 allowed mr snow and other top csx executives to buy… Figure 2: Examples of two news articles with special words (as inferred by the model) shaded in gray. (a) upper, email article with several colloquialisms, (b) lower, article about CSX corporation. and for xi = 2: p (xi = 2 |w, x−i , z−i , β2 , γ ) ∝ Nd2,−i + γ × Nd,−i + 3γ W Cw,−i + β2 W w ′ Cw ′ ,−i + W β2 where the subscript −i indicates that the count for word token i is removed, Nd is the number of words in document d and Nd0 , Nd1 and Nd2 are the number of words in document d assigned to the W W W latent topics, special words and background component, respectively, CwtT , CwdD and Cw are the number of times word w is assigned to topic t, to the special-words distribution of document d, and to the background distribution, respectively, and W is the number of unique words in the corpus. Note that when there is not strong supporting evidence for xi = 0 (i.e., the conditional probability of this event is low), then the probability of the word being generated by the special words route, xi = 1, or background route, xi = 2 increases. One iteration of the Gibbs sampler corresponds to a sampling pass through all word tokens in the corpus. In practice we have found that around 500 iterations are often sufficient for the in-sample perplexity (or log-likelihood) and the topic distributions to stabilize. We also pursued a variant of SWB, the special words (SW) model that excludes the background distribution Ω and has a symmetric Beta prior, γ, on λ (which in SW is a document-specific Bernoulli distribution). In all our SW model runs, we set γ = 0.5 resulting in a weak symmetric prior that is equivalent to adding one pseudo-word to each document. Experimental results (not shown) indicate that the final word-topic assignments are not sensitive to either the value of the prior or the initial assignments to the latent variables, x and z. 3 Illustrative Examples We illustrate the operation of the SW model with a data set consisting of 3104 articles from the New York Times (NYT) with a total of 1,399,488 word tokens. This small set of NYT articles was formed by selecting all NYT articles that mention the word “Enron.” The SW topic model was run with T = 100 topics. In total, 10 Gibbs samples were collected from the model. Figure 2 shows two short fragments of articles from this NYT dataset. The background color of words indicates the probability of assigning words to the special words topic—darker colors are associated with higher probability that over the 10 Gibbs samples a word was assigned to the special topic. The words with gray foreground colors were treated as stopwords and were not included in the analysis. Figure 2(a) shows how intentionally misspelled words such as “biznesmen” and “beeznessmen” and rare Collection NIPS PATENTS AP FR # of Docs 1740 6711 10000 2500 Total # of Word Tokens 2,301,375 15,014,099 2,426,181 6,332,681 Median Doc Length 1310 1858 235.5 516 Mean Doc Length 1322.6 2237.2 242.6 2533.1 # of Queries N/A N/A 142 30 Table 1: General characteristics of document data sets used in experiments. NIPS set number results case problem function values paper approach large PATENTS .0206 .0167 .0153 .0123 .0118 .0108 .0102 .0088 .0080 .0079 fig end extend invent view shown claim side posit form .0647 .0372 .0267 .0246 .0214 .0191 .0189 .0177 .0153 .0128 AP tagnum itag requir includ section determin part inform addit applic FR .0416 .0412 .0381 .0207 .0189 .0134 .0112 .0105 .0096 .0086 nation sai presid polici issu call support need govern effort .0147 .0129 .0118 .0108 .0096 .0094 .0085 .0079 .0070 .0068 Figure 3: Examples of background distributions (10 most likely words) learned by the SWB model for 4 different document corpora. words such as “pinkos” are likely to be assigned to the special words topic. Figure 2(b) shows how a last name such as “Snow” and the corporation name “CSX” that are specific to the document are likely to be assigned to the special topic. The words “Snow” and “CSX” do not occur often in other documents but are mentioned several times in the example document. This combination of low document-frequency and high term-frequency within the document is one factor that makes these words more likely to be treated as “special” words. 4 Experimental Results: Perplexity and Precision We use 4 different document sets in our experiments, as summarized in Table 1. The NIPS and PATENTS document sets are used for perplexity experiments and the AP and FR data sets for retrieval experiments. The NIPS data set is available online1 and PATENTS, AP, and FR consist of documents from the U.S. Patents collection (TREC Vol-3), Associated Press news articles from 1998 (TREC Vol-2), and articles from the Federal Register (TREC Vol-1, 2) respectively. To create the sampled AP and FR data sets, all documents relevant to queries were included first and the rest of the documents were chosen randomly. In the results below all LDA/SWB/SW models were fit using T = 200 topics. Figure 3 demonstrates the background component learned by the SWB model on the 4 different document data sets. The background distributions learned for each set of documents are quite intuitive, with words that are commonly used across a broad range of documents within each corpus. The ratio of words assigned to the special words distribution and the background distribution are (respectively for each data set), 25%:10% (NIPS), 58%:5% (PATENTS), 11%:6% (AP), 50%:11% (FR). Of note is the fact that a much larger fraction of words are treated as special in collections containing long documents (NIPS, PATENTS, and FR) than in short “abstract-like” collections (such as AP)—this makes sense since short documents are more likely to contain general summary information while longer documents will have more specific details. 4.1 Perplexity Comparisons The NIPS and PATENTS documents sets do not have queries and relevance judgments, but nonetheless are useful for evaluating perplexity. We compare the predictive performance of the SW and SWB topic models with the standard topic model by computing the perplexity of unseen words in test documents. Perplexity of a test set under a model is defined as follows: 1 From http://www.cs.toronto.edu/˜roweis/data.html 1900 550 LDA SW SWB 1800 500 1700 450 LDA SW SWB Perplexity Perplexity 1600 1500 1400 400 350 300 1300 250 1200 200 1100 1000 10 20 30 40 50 60 70 80 150 10 90 20 Percentage of Words Observed 30 40 50 60 70 80 90 Percentage of Words Observed Figure 4: Average perplexity of the two special words models and the standard topics model as a function of the percentage of words observed in test documents on the NIPS data set (left) and the PATENTS data set (right). Perplexity(wtest |Dtrain ) = exp − Dtest d=1 log p(wd |Dtrain ) Dtest d=1 Nd where wtest is a vector of words in the test data set, wd is a vector of words in document d of the test set, and Dtrain is the training set. For the SWB model, we approximate p(wd |Dtrain ) as follows: p(wd |Dtrain ) ≈ 1 S S p(wd |{Θs Φs Ψs Ωs λs }) s=1 where Θs , Φs , Ψs , Ωs and λs are point estimates from s = 1:S different Gibbs sampling runs. The probability of the words wd in a test document d, given its parameters, can be computed as follows: Nd p(wd |{Θs Φs Ψs Ωs λs }) = T s s φs i t θtd + λs ψwi d + λs Ωs i 3d w w 2d λs 1d i=1 t=1 where Nd is the number of words in test document d and wi is the ith word being predicted in the s s test document. θtd , φs i t , ψwi d , Ωs i and λs are point estimates from sample s. w w d When a fraction of words of a test document d is observed, a Gibbs sampler is run on the observed words to update the document-specific parameters, θd , ψd and λd and these updated parameters are used in the computation of perplexity. For the NIPS data set, documents from the last year of the data set were held out to compute perplexity (Dtest = 150), and for the PATENTS data set 500 documents were randomly selected as test documents. From the perplexity figures, it can be seen that once a small fraction of the test document words is observed (20% for NIPS and 10% for PATENTS), the SW and SWB models have significantly lower perplexity values than LDA indicating that the SW and SWB models are using the special words “route” to better learn predictive models for individual documents. 4.2 Information Retrieval Results Returning to the point of capturing both specific and general aspects of documents as discussed in the introduction of the paper, we generated 500 queries of length 3-5 using randomly selected lowfrequency words from the NIPS corpus and then ranked documents relative to these queries using several different methods. Table 2 shows for the top k-ranked documents (k = 1, 10, 50, 100) how many of the retrieved documents contained at least one of the words in the query. Note that we are not assessing relevance here in a traditional information retrieval sense, but instead are assessing how Method TF-IDF LSI LDA SW SWB 1 Ret Doc 100.0 97.6 90.0 99.2 99.4 10 Ret Docs 100.0 82.7 80.6 97.1 96.6 50 Ret Docs 100.0 64.6 67.0 79.1 78.7 100 Ret Docs 100.0 54.3 58.7 67.3 67.2 Table 2: Percentage of retrieved documents containing at least one query word (NIPS corpus). AP MAP Method TF-IDF LSI LDA SW SWB Title .353 .286 .424 .466* .460* Pr@10d Desc .358 .387 .394 .430* .417 Concepts .498 .459 .498 .550* .549* Method TF-IDF LSI LDA SW SWB Title .406 .455 .478 .524* .513* Method TF-IDF LSI LDA SW SWB Title .300 .366 .428 .469 .462 Desc .434 .469 .463 .509* .495 Concepts .549 .523 .556 .599* .603* FR MAP Method TF-IDF LSI LDA SW SWB Title .268 .329 .344 .371 .373 Pr@10d Desc .272 .295 .271 .323* .328* Concepts .391 .399 .396 .448* .435 Desc .287 .327 .340 .407* .423* Concepts .483 .487 .487 .550* .523 *=sig difference wrt LDA Figure 5: Information retrieval experimental results. often specific query words occur in retrieved documents. TF-IDF has 100% matches, as one would expect, and the techniques that generalize (such as LSI and LDA) have far fewer exact matches. The SWB and SW models have more specific matches than either LDA or LSI, indicating that they have the ability to match at the level of specific words. Of course this is not of much utility unless the SWB and SW models can also perform well in terms of retrieving relevant documents (not just documents containing the query words), which we investigate next. For the AP and FR documents sets, 3 types of query sets were constructed from TREC Topics 1150, based on the T itle (short), Desc (sentence-length) and Concepts (long list of keywords) fields. Queries that have no relevance judgments for a collection were removed from the query set for that collection. The score for a document d relative to a query q for the SW and standard topic models can be computed as the probability of q given d (known as the query-likelihood model in the IR community). For the SWB topic model, we have T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w)] [p(x = 0|d) p(q|d) ≈ w∈q t=1 We compare SW and SWB models with the standard topic model (LDA), LSI and TF-IDF. The TFCW D D wd IDF score for a word w in a document d is computed as TF-IDF(w, d) = Nd × log2 Dw . For LSI, the TF-IDF weight matrix is reduced to a K-dimensional latent space using SVD, K = 200. A given query is first mapped into the LSI latent space or the TF-IDF space (known as query folding), and documents are scored based on their cosine distances to the mapped queries. To measure the performance of each algorithm we used 2 metrics that are widely used in IR research: the mean average precision (MAP) and the precision for the top 10 documents retrieved (pr@10d). The main difference between the AP and FR documents is that the latter documents are considerably longer on average and there are fewer queries for the FR data set. Figure 5 summarizes the results, broken down by algorithm, query type, document set, and metric. The maximum score for each query experiment is shown in bold: in all cases (query-type/data set/metric) the SW or SWB model produced the highest scores. To determine statistical significance, we performed a t-test at the 0.05 level between the scores of each of the SW and SWB models, and the scores of the LDA model (as LDA has the best scores overall among TF-IDF, LSI and LDA). Differences between SW and SWB are not significant. In figure 5, we use the symbol * to indicate scores where the SW and SWB models showed a statistically significant difference (always an improvement) relative to the LDA model. The differences for the “non-starred” query and metric scores of SW and SWB are not statistically significant but nonetheless always favor SW and SWB over LDA. 5 Discussion and Conclusions Wei and Croft (2006) have recently proposed an ad hoc LDA approach that models p(q|d) as a weighted combination of a multinomial over the entire corpus (the background model), a multinomial over the document, and an LDA model. Wei and Croft showed that this combination provides excellent retrieval performance compared to other state-of-the-art IR methods. In a number of experiments (not shown) comparing the SWB and ad hoc LDA models we found that the two techniques produced comparable precision performance, with small but systematic performance gains being achieved by an ad hoc combination where the standard LDA model in ad hoc LDA was replaced with the SWB model. An interesting direction for future work is to investigate fully generative models that can achieve the performance of ad hoc approaches. In conclusion, we have proposed a new probabilistic model that accounts for both general and specific aspects of documents or individual behavior. The model extends existing latent variable probabilistic approaches such as LDA by allowing these models to take into account specific aspects of documents (or individuals) that are exceptions to the broader structure of the data. This allows, for example, documents to be modeled as a mixture of words generated by general topics and words generated in a manner specific to that document. Experimental results on information retrieval tasks indicate that the SWB topic model does not suffer from the weakness of techniques such as LSI and LDA when faced with very specific query words, nor does it suffer the limitations of TF-IDF in terms of its ability to generalize. Acknowledgements We thank Tom Griffiths for useful initial discussions about the special words model. This material is based upon work supported by the National Science Foundation under grant IIS-0083489. We acknowledge use of the computer clusters supported by NIH grant LM-07443-01 and NSF grant EIA-0321390 to Pierre Baldi and the Institute of Genomics and Bioinformatics. References Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003) Latent Dirichlet allocation, Journal of Machine Learning Research 3: 993-1022. Buntine, W., L¨ fstr¨ m, J., Perttu, S. and Valtonen, K. (2005) Topic-specific scoring of documents for relevant o o retrieval Workshop on Learning in Web Search: 22nd International Conference on Machine Learning, pp. 34-41. Bonn, Germany. Canny, J. (2004) GaP: a factor model for discrete data. Proceedings of the 27th Annual SIGIR Conference, pp. 122-129. Daum´ III, H., and Marcu, D. (2006) Domain Adaptation for Statistical classifiers. Journal of the Artificial e Intelligence Research, 26: 101-126. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391-407. Griffiths, T. L., and Steyvers, M. (2004) Finding scientific topics, Proceedings of the National Academy of Sciences, pp. 5228-5235. Hofmann, T. (1999) Probabilistic latent semantic indexing, Proceedings of the 22nd Annual SIGIR Conference, pp. 50-57. Vogt, C. and Cottrell, G. (1999) Fusion via a linear combination of scores. Information Retrieval, 1(3): 151173. Wei, X. and Croft, W.B. (2006) LDA-based document models for ad-hoc retrieval, Proceedings of the 29th SIGIR Conference, pp. 178-185.
4 0.17158766 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests
Author: Frank Moosmann, Bill Triggs, Frederic Jurie
Abstract: Some of the most effective recent methods for content-based image classification work by extracting dense or sparse local image descriptors, quantizing them according to a coding rule such as k-means vector quantization, accumulating histograms of the resulting “visual word” codes over the image, and classifying these with a conventional classifier such as an SVM. Large numbers of descriptors and large codebooks are needed for good results and this becomes slow using k-means. We introduce Extremely Randomized Clustering Forests – ensembles of randomly created clustering trees – and show that these provide more accurate results, much faster training and testing and good resistance to background clutter in several state-of-the-art image classification tasks. 1
5 0.15215936 34 nips-2006-Approximate Correspondences in High Dimensions
Author: Kristen Grauman, Trevor Darrell
Abstract: Pyramid intersection is an efficient method for computing an approximate partial matching between two sets of feature vectors. We introduce a novel pyramid embedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors. The matching similarity is computed in linear time and forms a Mercer kernel. Whereas previous matching approximation algorithms suffer from distortion factors that increase linearly with the feature dimension, we demonstrate that our approach can maintain constant accuracy even as the feature dimension increases. When used as a kernel in a discriminative classifier, our approach achieves improved object recognition results over a state-of-the-art set kernel. 1
6 0.10892533 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions
7 0.10120451 50 nips-2006-Chained Boosting
8 0.10101794 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing
9 0.096121252 2 nips-2006-A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation
10 0.095483296 185 nips-2006-Subordinate class recognition using relational object models
11 0.094969667 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization
12 0.093726277 17 nips-2006-A recipe for optimizing a time-histogram
13 0.091108024 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
14 0.089310117 134 nips-2006-Modeling Human Motion Using Binary Latent Variables
15 0.083739288 103 nips-2006-Kernels on Structured Objects Through Nested Histograms
16 0.07741116 111 nips-2006-Learning Motion Style Synthesis from Perceptual Observations
17 0.076438874 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors
18 0.075630561 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
19 0.074199408 51 nips-2006-Clustering Under Prior Knowledge with Application to Image Segmentation
20 0.073567845 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
topicId topicWeight
[(0, -0.263), (1, 0.024), (2, 0.188), (3, -0.152), (4, 0.057), (5, -0.011), (6, -0.105), (7, -0.125), (8, 0.023), (9, -0.076), (10, 0.055), (11, 0.168), (12, 0.049), (13, 0.062), (14, 0.158), (15, 0.057), (16, -0.207), (17, -0.072), (18, 0.066), (19, -0.142), (20, -0.039), (21, -0.163), (22, 0.075), (23, 0.115), (24, 0.013), (25, -0.147), (26, -0.049), (27, -0.077), (28, -0.045), (29, -0.046), (30, -0.125), (31, -0.059), (32, -0.043), (33, -0.018), (34, -0.009), (35, 0.005), (36, -0.007), (37, -0.093), (38, -0.092), (39, -0.191), (40, -0.014), (41, 0.068), (42, -0.027), (43, 0.024), (44, -0.041), (45, -0.063), (46, 0.037), (47, -0.025), (48, 0.05), (49, -0.048)]
simIndex simValue paperId paperTitle
same-paper 1 0.94627374 66 nips-2006-Detecting Humans via Their Pose
Author: Alessandro Bissacco, Ming-Hsuan Yang, Stefano Soatto
Abstract: We consider the problem of detecting humans and classifying their pose from a single image. Specifically, our goal is to devise a statistical model that simultaneously answers two questions: 1) is there a human in the image? and, if so, 2) what is a low-dimensional representation of her pose? We investigate models that can be learned in an unsupervised manner on unlabeled images of human poses, and provide information that can be used to match the pose of a new image to the ones present in the training set. Starting from a set of descriptors recently proposed for human detection, we apply the Latent Dirichlet Allocation framework to model the statistics of these features, and use the resulting model to answer the above questions. We show how our model can efficiently describe the space of images of humans with their pose, by providing an effective representation of poses for tasks such as classification and matching, while performing remarkably well in human/non human decision problems, thus enabling its use for human detection. We validate the model with extensive quantitative experiments and comparisons with other approaches on human detection and pose matching. 1
2 0.79936254 122 nips-2006-Learning to parse images of articulated bodies
Author: Deva Ramanan
Abstract: We consider the machine vision task of pose estimation from static images, specifically for the case of articulated objects. This problem is hard because of the large number of degrees of freedom to be estimated. Following a established line of research, pose estimation is framed as inference in a probabilistic model. In our experience however, the success of many approaches often lie in the power of the features. Our primary contribution is a novel casting of visual inference as an iterative parsing process, where one sequentially learns better and better features tuned to a particular image. We show quantitative results for human pose estimation on a database of over 300 images that suggest our algorithm is competitive with or surpasses the state-of-the-art. Since our procedure is quite general (it does not rely on face or skin detection), we also use it to estimate the poses of horses in the Weizmann database. 1
3 0.69149387 133 nips-2006-Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model
Author: Chaitanya Chemudugunta, Padhraic Smyth, Mark Steyvers
Abstract: Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the specific words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over general topics, and (c) a distribution over words that are treated as being specific to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a specific word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match documents at the specific word level (such as TF-IDF). 1 Introduction and Motivation Reducing high-dimensional data vectors to robust and interpretable lower-dimensional representations has a long and successful history in data analysis, including recent innovations such as latent semantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003). These types of techniques have found broad application in modeling of sparse high-dimensional count data such as the “bag of words” representations for documents or transaction data for Web and retail applications. Approaches such as LSI and LDA have both been shown to be useful for “object matching” in their respective latent spaces. In information retrieval for example, both a query and a set of documents can be represented in the LSI or topic latent spaces, and the documents can be ranked in terms of how well they match the query based on distance or similarity in the latent space. The mapping to latent space represents a generalization or abstraction away from the sparse set of observed words, to a “higher-level” semantic representation in the latent space. These abstractions in principle lead to better generalization on new data compared to inferences carried out directly in the original sparse high-dimensional space. The capability of these models to provide improved generalization has been demonstrated empirically in a number of studies (e.g., Deerwester et al 1994; Hofmann 1999; Canny 2004; Buntine et al, 2005). However, while this type of generalization is broadly useful in terms of inference and prediction, there are situations where one can over-generalize. Consider trying to match the following query to a historical archive of news articles: election + campaign + Camejo. The query is intended to find documents that are about US presidential campaigns and also about Peter Camejo (who ran as vice-presidential candidate alongside independent Ralph Nader in 2004). LSI and topic models are likely to highly rank articles that are related to presidential elections (even if they don’t necessarily contain the words election or campaign). However, a potential problem is that the documents that are highly ranked by LSI or topic models need not include any mention of the name Camejo. The reason is that the combination of words in this query is likely to activate one or more latent variables related to the concept of presidential campaigns. However, once this generalization is made the model has “lost” the information about the specific word Camejo and it will only show up in highly ranked documents if this word happens to frequently occur in these topics (unlikely in this case given that this candidate received relatively little media coverage compared to the coverage given to the candidates from the two main parties). But from the viewpoint of the original query, our preference would be to get documents that are about the general topic of US presidential elections with the specific constraint that they mention Peter Camejo. Word-based retrieval techniques, such as the widely-used term-frequency inverse-documentfrequency (TF-IDF) method, have the opposite problem in general. They tend to be overly specific in terms of matching words in the query to documents. In general of course one would like to have a balance between generality and specificity. One ad hoc approach is to combine scores from a general method such as LSI with those from a more specific method such as TF-IDF in some manner, and indeed this technique has been proposed in information retrieval (Vogt and Cottrell, 1999). Similarly, in the ad hoc LDA approach (Wei and Croft, 2006), the LDA model is linearly combined with document-specific word distributions to capture both general as well as specific information in documents. However, neither method is entirely satisfactory since it is not clear how to trade-off generality and specificity in a principled way. The contribution of this paper is a new graphical model based on latent topics that handles the tradeoff between generality and specificity in a fully probabilistic and automated manner. The model, which we call the special words with background (SWB) model, is an extension of the LDA model. The new model allows words in documents to be modeled as either originating from general topics, or from document-specific “special” word distributions, or from a corpus-wide background distribution. The idea is that words in a document such as election and campaign are likely to come from a general topic on presidential elections, whereas a name such as Camejo is much more likely to be treated as “non-topical” and specific to that document. Words in queries are automatically interpreted (in a probabilistic manner) as either being topical or special, in the context of each document, allowing for a data-driven document-specific trade-off between the benefits of topic-based abstraction and specific word matching. Daum´ and Marcu (2006) independently proposed a probabilistic e model using similar concepts for handling different training and test distributions in classification problems. Although we have focused primarily on documents in information retrieval in the discussion above, the model we propose can in principle be used on any large sparse matrix of count data. For example, transaction data sets where rows are individuals and columns correspond to items purchased or Web sites visited are ideally suited to this approach. The latent topics can capture broad patterns of population behavior and the “special word distributions” can capture the idiosyncracies of specific individuals. Section 2 reviews the basic principles of the LDA model and introduces the new SWB model. Section 3 illustrates how the model works in practice using examples from New York Times news articles. In Section 4 we describe a number of experiments with 4 different document sets, including perplexity experiments and information retrieval experiments, illustrating the trade-offs between generalization and specificity for different models. Section 5 contains a brief discussion and concluding comments. 2 A Topic Model for Special Words Figure 1(a) shows the graphical model for what we will refer to as the “standard topic model” or LDA. There are D documents and document d has Nd words. α and β are fixed parameters of symmetric Dirichlet priors for the D document-topic multinomials represented by θ and the T topicword multinomials represented by φ. In the generative model, for each document d, the Nd words (a) β2 β1 α γ Ω ψ θ λ β0 z x φ w (b) α θ β z φ w Nd T T D Nd D Figure 1: Graphical models for (a) the standard LDA topic model (left) and (b) the proposed special words topic model with a background distribution (SWB) (right). are generated by drawing a topic t from the document-topic distribution p(z|θd ) and then drawing a word w from the topic-word distribution p(w|z = t, φt ). As shown in Griffiths and Steyvers (2004) the topic assignments z for each word token in the corpus can be efficiently sampled via Gibbs sampling (after marginalizing over θ and φ). Point estimates for the θ and φ distributions can be computed conditioned on a particular sample, and predictive distributions can be obtained by averaging over multiple samples. We will refer to the proposed model as the special words topic model with background distribution (SWB) (Figure 1(b)). SWB has a similar general structure to the LDA model (Figure 1(a)) but with additional machinery to handle special words and background words. In particular, associated with each word token is a latent random variable x, taking value x = 0 if the word w is generated via the topic route, value x = 1 if the word is generated as a special word (for that document) and value x = 2 if the word is generated from a background distribution specific for the corpus. The variable x acts as a switch: if x = 0, the previously described standard topic mechanism is used to generate the word, whereas if x = 1 or x = 2, words are sampled from a document-specific multinomial Ψ or a corpus specific multinomial Ω (with symmetric Dirichlet priors parametrized by β1 and β2 ) respectively. x is sampled from a document-specific multinomial λ, which in turn has a symmetric Dirichlet prior, γ. One could also use a hierarchical Bayesian approach to introduce another level of uncertainty about the Dirichlet priors (e.g., see Blei, Ng, and Jordan, 2003)—we have not investigated this option, primarily for computational reasons. In all our experiments, we set α = 0.1, β0 = β2 = 0.01, β1 = 0.0001 and γ = 0.3—all weak symmetric priors. The conditional probability of a word w given a document d can be written as: T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w) p(w|d) = p(x = 0|d) t=1 ′ where p (w|d) is the special word distribution for document d, and p′′ (w) is the background word distribution for the corpus. Note that when compared to the standard topic model the SWB model can explain words in three different ways, via topics, via a special word distribution, or via a background word distribution. Given the graphical model above, it is relatively straightforward to derive Gibbs sampling equations that allow joint sampling of the zi and xi latent variables for each word token wi , for xi = 0: p (xi = 0, zi = t |w, x−i , z−i , α, β0 , γ ) ∝ Nd0,−i + γ × Nd,−i + 3γ TD Ctd,−i + α t′ TD Ct′ d,−i + T α × WT Cwt,−i + β0 WT w ′ Cw ′ t,−i + W β0 and for xi = 1: p (xi = 1 |w, x−i , z−i , β1 , γ ) ∝ Nd1,−i + γ × Nd,−i + 3γ WD Cwd,−i + β1 w′ WD Cw′ d,−i + W β1 e mail krugman nytimes com memo to critics of the media s liberal bias the pinkos you really should be going after are those business reporters even i was startled by the tone of the jan 21 issue of investment news which describes itself as the weekly newspaper for financial advisers the headline was paul o neill s sweet deal the blurb was irs backs off closing loophole averting tax liability for execs and treasury chief it s not really news that the bush administration likes tax breaks for businessmen but two weeks later i learned from the wall street journal that this loophole is more than a tax break for businessmen it s a gift to biznesmen and it may be part of a larger pattern confused in the former soviet union the term biznesmen pronounced beeznessmen refers to the class of sudden new rich who emerged after the fall of communism and who generally got rich by using their connections to strip away the assets of public enterprises what we ve learned from enron and other players to be named later is that america has its own biznesmen and that we need to watch out for policies that make it easier for them to ply their trade it turns out that the sweet deal investment news was referring to the use of split premium life insurance policies to give executives largely tax free compensation you don t want to know the details is an even sweeter deal for executives of companies that go belly up it shields their wealth from creditors and even from lawsuits sure enough reports the wall street journal former enron c e o s kenneth lay and jeffrey skilling both had large split premium policies so what other pro biznes policies have been promulgated lately last year both houses of … john w snow was paid more than 50 million in salary bonus and stock in his nearly 12 years as chairman of the csx corporation the railroad company during that period the company s profits fell and its stock rose a bit more than half as much as that of the average big company mr snow s compensation amid csx s uneven performance has drawn criticism from union officials and some corporate governance specialists in 2000 for example after the stock had plunged csx decided to reverse a 25 million loan to him the move is likely to get more scrutiny after yesterday s announcement that mr snow has been chosen by president bush to replace paul o neill as the treasury secretary like mr o neill mr snow is an outsider on wall street but an insider in corporate america with long experience running an industrial company some wall street analysts who follow csx said yesterday that mr snow had ably led the company through a difficult period in the railroad industry and would make a good treasury secretary it s an excellent nomination said jill evans an analyst at j p morgan who has a neutral rating on csx stock i think john s a great person for the administration he as the c e o of a railroad has probably touched every sector of the economy union officials are less complimentary of mr snow s performance at csx last year the a f l c i o criticized him and csx for the company s decision to reverse the loan allowing him to return stock he had purchased with the borrowed money at a time when independent directors are in demand a corporate governance specialist said recently that mr snow had more business relationships with members of his own board than any other chief executive in addition mr snow is the third highest paid of 37 chief executives of transportation companies said ric marshall chief executive of the corporate library which provides specialized investment research into corporate boards his own compensation levels have been pretty high mr marshall said he could afford to take a public service job a csx program in 1996 allowed mr snow and other top csx executives to buy… Figure 2: Examples of two news articles with special words (as inferred by the model) shaded in gray. (a) upper, email article with several colloquialisms, (b) lower, article about CSX corporation. and for xi = 2: p (xi = 2 |w, x−i , z−i , β2 , γ ) ∝ Nd2,−i + γ × Nd,−i + 3γ W Cw,−i + β2 W w ′ Cw ′ ,−i + W β2 where the subscript −i indicates that the count for word token i is removed, Nd is the number of words in document d and Nd0 , Nd1 and Nd2 are the number of words in document d assigned to the W W W latent topics, special words and background component, respectively, CwtT , CwdD and Cw are the number of times word w is assigned to topic t, to the special-words distribution of document d, and to the background distribution, respectively, and W is the number of unique words in the corpus. Note that when there is not strong supporting evidence for xi = 0 (i.e., the conditional probability of this event is low), then the probability of the word being generated by the special words route, xi = 1, or background route, xi = 2 increases. One iteration of the Gibbs sampler corresponds to a sampling pass through all word tokens in the corpus. In practice we have found that around 500 iterations are often sufficient for the in-sample perplexity (or log-likelihood) and the topic distributions to stabilize. We also pursued a variant of SWB, the special words (SW) model that excludes the background distribution Ω and has a symmetric Beta prior, γ, on λ (which in SW is a document-specific Bernoulli distribution). In all our SW model runs, we set γ = 0.5 resulting in a weak symmetric prior that is equivalent to adding one pseudo-word to each document. Experimental results (not shown) indicate that the final word-topic assignments are not sensitive to either the value of the prior or the initial assignments to the latent variables, x and z. 3 Illustrative Examples We illustrate the operation of the SW model with a data set consisting of 3104 articles from the New York Times (NYT) with a total of 1,399,488 word tokens. This small set of NYT articles was formed by selecting all NYT articles that mention the word “Enron.” The SW topic model was run with T = 100 topics. In total, 10 Gibbs samples were collected from the model. Figure 2 shows two short fragments of articles from this NYT dataset. The background color of words indicates the probability of assigning words to the special words topic—darker colors are associated with higher probability that over the 10 Gibbs samples a word was assigned to the special topic. The words with gray foreground colors were treated as stopwords and were not included in the analysis. Figure 2(a) shows how intentionally misspelled words such as “biznesmen” and “beeznessmen” and rare Collection NIPS PATENTS AP FR # of Docs 1740 6711 10000 2500 Total # of Word Tokens 2,301,375 15,014,099 2,426,181 6,332,681 Median Doc Length 1310 1858 235.5 516 Mean Doc Length 1322.6 2237.2 242.6 2533.1 # of Queries N/A N/A 142 30 Table 1: General characteristics of document data sets used in experiments. NIPS set number results case problem function values paper approach large PATENTS .0206 .0167 .0153 .0123 .0118 .0108 .0102 .0088 .0080 .0079 fig end extend invent view shown claim side posit form .0647 .0372 .0267 .0246 .0214 .0191 .0189 .0177 .0153 .0128 AP tagnum itag requir includ section determin part inform addit applic FR .0416 .0412 .0381 .0207 .0189 .0134 .0112 .0105 .0096 .0086 nation sai presid polici issu call support need govern effort .0147 .0129 .0118 .0108 .0096 .0094 .0085 .0079 .0070 .0068 Figure 3: Examples of background distributions (10 most likely words) learned by the SWB model for 4 different document corpora. words such as “pinkos” are likely to be assigned to the special words topic. Figure 2(b) shows how a last name such as “Snow” and the corporation name “CSX” that are specific to the document are likely to be assigned to the special topic. The words “Snow” and “CSX” do not occur often in other documents but are mentioned several times in the example document. This combination of low document-frequency and high term-frequency within the document is one factor that makes these words more likely to be treated as “special” words. 4 Experimental Results: Perplexity and Precision We use 4 different document sets in our experiments, as summarized in Table 1. The NIPS and PATENTS document sets are used for perplexity experiments and the AP and FR data sets for retrieval experiments. The NIPS data set is available online1 and PATENTS, AP, and FR consist of documents from the U.S. Patents collection (TREC Vol-3), Associated Press news articles from 1998 (TREC Vol-2), and articles from the Federal Register (TREC Vol-1, 2) respectively. To create the sampled AP and FR data sets, all documents relevant to queries were included first and the rest of the documents were chosen randomly. In the results below all LDA/SWB/SW models were fit using T = 200 topics. Figure 3 demonstrates the background component learned by the SWB model on the 4 different document data sets. The background distributions learned for each set of documents are quite intuitive, with words that are commonly used across a broad range of documents within each corpus. The ratio of words assigned to the special words distribution and the background distribution are (respectively for each data set), 25%:10% (NIPS), 58%:5% (PATENTS), 11%:6% (AP), 50%:11% (FR). Of note is the fact that a much larger fraction of words are treated as special in collections containing long documents (NIPS, PATENTS, and FR) than in short “abstract-like” collections (such as AP)—this makes sense since short documents are more likely to contain general summary information while longer documents will have more specific details. 4.1 Perplexity Comparisons The NIPS and PATENTS documents sets do not have queries and relevance judgments, but nonetheless are useful for evaluating perplexity. We compare the predictive performance of the SW and SWB topic models with the standard topic model by computing the perplexity of unseen words in test documents. Perplexity of a test set under a model is defined as follows: 1 From http://www.cs.toronto.edu/˜roweis/data.html 1900 550 LDA SW SWB 1800 500 1700 450 LDA SW SWB Perplexity Perplexity 1600 1500 1400 400 350 300 1300 250 1200 200 1100 1000 10 20 30 40 50 60 70 80 150 10 90 20 Percentage of Words Observed 30 40 50 60 70 80 90 Percentage of Words Observed Figure 4: Average perplexity of the two special words models and the standard topics model as a function of the percentage of words observed in test documents on the NIPS data set (left) and the PATENTS data set (right). Perplexity(wtest |Dtrain ) = exp − Dtest d=1 log p(wd |Dtrain ) Dtest d=1 Nd where wtest is a vector of words in the test data set, wd is a vector of words in document d of the test set, and Dtrain is the training set. For the SWB model, we approximate p(wd |Dtrain ) as follows: p(wd |Dtrain ) ≈ 1 S S p(wd |{Θs Φs Ψs Ωs λs }) s=1 where Θs , Φs , Ψs , Ωs and λs are point estimates from s = 1:S different Gibbs sampling runs. The probability of the words wd in a test document d, given its parameters, can be computed as follows: Nd p(wd |{Θs Φs Ψs Ωs λs }) = T s s φs i t θtd + λs ψwi d + λs Ωs i 3d w w 2d λs 1d i=1 t=1 where Nd is the number of words in test document d and wi is the ith word being predicted in the s s test document. θtd , φs i t , ψwi d , Ωs i and λs are point estimates from sample s. w w d When a fraction of words of a test document d is observed, a Gibbs sampler is run on the observed words to update the document-specific parameters, θd , ψd and λd and these updated parameters are used in the computation of perplexity. For the NIPS data set, documents from the last year of the data set were held out to compute perplexity (Dtest = 150), and for the PATENTS data set 500 documents were randomly selected as test documents. From the perplexity figures, it can be seen that once a small fraction of the test document words is observed (20% for NIPS and 10% for PATENTS), the SW and SWB models have significantly lower perplexity values than LDA indicating that the SW and SWB models are using the special words “route” to better learn predictive models for individual documents. 4.2 Information Retrieval Results Returning to the point of capturing both specific and general aspects of documents as discussed in the introduction of the paper, we generated 500 queries of length 3-5 using randomly selected lowfrequency words from the NIPS corpus and then ranked documents relative to these queries using several different methods. Table 2 shows for the top k-ranked documents (k = 1, 10, 50, 100) how many of the retrieved documents contained at least one of the words in the query. Note that we are not assessing relevance here in a traditional information retrieval sense, but instead are assessing how Method TF-IDF LSI LDA SW SWB 1 Ret Doc 100.0 97.6 90.0 99.2 99.4 10 Ret Docs 100.0 82.7 80.6 97.1 96.6 50 Ret Docs 100.0 64.6 67.0 79.1 78.7 100 Ret Docs 100.0 54.3 58.7 67.3 67.2 Table 2: Percentage of retrieved documents containing at least one query word (NIPS corpus). AP MAP Method TF-IDF LSI LDA SW SWB Title .353 .286 .424 .466* .460* Pr@10d Desc .358 .387 .394 .430* .417 Concepts .498 .459 .498 .550* .549* Method TF-IDF LSI LDA SW SWB Title .406 .455 .478 .524* .513* Method TF-IDF LSI LDA SW SWB Title .300 .366 .428 .469 .462 Desc .434 .469 .463 .509* .495 Concepts .549 .523 .556 .599* .603* FR MAP Method TF-IDF LSI LDA SW SWB Title .268 .329 .344 .371 .373 Pr@10d Desc .272 .295 .271 .323* .328* Concepts .391 .399 .396 .448* .435 Desc .287 .327 .340 .407* .423* Concepts .483 .487 .487 .550* .523 *=sig difference wrt LDA Figure 5: Information retrieval experimental results. often specific query words occur in retrieved documents. TF-IDF has 100% matches, as one would expect, and the techniques that generalize (such as LSI and LDA) have far fewer exact matches. The SWB and SW models have more specific matches than either LDA or LSI, indicating that they have the ability to match at the level of specific words. Of course this is not of much utility unless the SWB and SW models can also perform well in terms of retrieving relevant documents (not just documents containing the query words), which we investigate next. For the AP and FR documents sets, 3 types of query sets were constructed from TREC Topics 1150, based on the T itle (short), Desc (sentence-length) and Concepts (long list of keywords) fields. Queries that have no relevance judgments for a collection were removed from the query set for that collection. The score for a document d relative to a query q for the SW and standard topic models can be computed as the probability of q given d (known as the query-likelihood model in the IR community). For the SWB topic model, we have T p(w|z = t)p(z = t|d) + p(x = 1|d)p′ (w|d) + p(x = 2|d)p′′ (w)] [p(x = 0|d) p(q|d) ≈ w∈q t=1 We compare SW and SWB models with the standard topic model (LDA), LSI and TF-IDF. The TFCW D D wd IDF score for a word w in a document d is computed as TF-IDF(w, d) = Nd × log2 Dw . For LSI, the TF-IDF weight matrix is reduced to a K-dimensional latent space using SVD, K = 200. A given query is first mapped into the LSI latent space or the TF-IDF space (known as query folding), and documents are scored based on their cosine distances to the mapped queries. To measure the performance of each algorithm we used 2 metrics that are widely used in IR research: the mean average precision (MAP) and the precision for the top 10 documents retrieved (pr@10d). The main difference between the AP and FR documents is that the latter documents are considerably longer on average and there are fewer queries for the FR data set. Figure 5 summarizes the results, broken down by algorithm, query type, document set, and metric. The maximum score for each query experiment is shown in bold: in all cases (query-type/data set/metric) the SW or SWB model produced the highest scores. To determine statistical significance, we performed a t-test at the 0.05 level between the scores of each of the SW and SWB models, and the scores of the LDA model (as LDA has the best scores overall among TF-IDF, LSI and LDA). Differences between SW and SWB are not significant. In figure 5, we use the symbol * to indicate scores where the SW and SWB models showed a statistically significant difference (always an improvement) relative to the LDA model. The differences for the “non-starred” query and metric scores of SW and SWB are not statistically significant but nonetheless always favor SW and SWB over LDA. 5 Discussion and Conclusions Wei and Croft (2006) have recently proposed an ad hoc LDA approach that models p(q|d) as a weighted combination of a multinomial over the entire corpus (the background model), a multinomial over the document, and an LDA model. Wei and Croft showed that this combination provides excellent retrieval performance compared to other state-of-the-art IR methods. In a number of experiments (not shown) comparing the SWB and ad hoc LDA models we found that the two techniques produced comparable precision performance, with small but systematic performance gains being achieved by an ad hoc combination where the standard LDA model in ad hoc LDA was replaced with the SWB model. An interesting direction for future work is to investigate fully generative models that can achieve the performance of ad hoc approaches. In conclusion, we have proposed a new probabilistic model that accounts for both general and specific aspects of documents or individual behavior. The model extends existing latent variable probabilistic approaches such as LDA by allowing these models to take into account specific aspects of documents (or individuals) that are exceptions to the broader structure of the data. This allows, for example, documents to be modeled as a mixture of words generated by general topics and words generated in a manner specific to that document. Experimental results on information retrieval tasks indicate that the SWB topic model does not suffer from the weakness of techniques such as LSI and LDA when faced with very specific query words, nor does it suffer the limitations of TF-IDF in terms of its ability to generalize. Acknowledgements We thank Tom Griffiths for useful initial discussions about the special words model. This material is based upon work supported by the National Science Foundation under grant IIS-0083489. We acknowledge use of the computer clusters supported by NIH grant LM-07443-01 and NSF grant EIA-0321390 to Pierre Baldi and the Institute of Genomics and Bioinformatics. References Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003) Latent Dirichlet allocation, Journal of Machine Learning Research 3: 993-1022. Buntine, W., L¨ fstr¨ m, J., Perttu, S. and Valtonen, K. (2005) Topic-specific scoring of documents for relevant o o retrieval Workshop on Learning in Web Search: 22nd International Conference on Machine Learning, pp. 34-41. Bonn, Germany. Canny, J. (2004) GaP: a factor model for discrete data. Proceedings of the 27th Annual SIGIR Conference, pp. 122-129. Daum´ III, H., and Marcu, D. (2006) Domain Adaptation for Statistical classifiers. Journal of the Artificial e Intelligence Research, 26: 101-126. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6): 391-407. Griffiths, T. L., and Steyvers, M. (2004) Finding scientific topics, Proceedings of the National Academy of Sciences, pp. 5228-5235. Hofmann, T. (1999) Probabilistic latent semantic indexing, Proceedings of the 22nd Annual SIGIR Conference, pp. 50-57. Vogt, C. and Cottrell, G. (1999) Fusion via a linear combination of scores. Information Retrieval, 1(3): 151173. Wei, X. and Croft, W.B. (2006) LDA-based document models for ad-hoc retrieval, Proceedings of the 29th SIGIR Conference, pp. 178-185.
4 0.49861827 34 nips-2006-Approximate Correspondences in High Dimensions
Author: Kristen Grauman, Trevor Darrell
Abstract: Pyramid intersection is an efficient method for computing an approximate partial matching between two sets of feature vectors. We introduce a novel pyramid embedding based on a hierarchy of non-uniformly shaped bins that takes advantage of the underlying structure of the feature space and remains accurate even for sets with high-dimensional feature vectors. The matching similarity is computed in linear time and forms a Mercer kernel. Whereas previous matching approximation algorithms suffer from distortion factors that increase linearly with the feature dimension, we demonstrate that our approach can maintain constant accuracy even as the feature dimension increases. When used as a kernel in a discriminative classifier, our approach achieves improved object recognition results over a state-of-the-art set kernel. 1
5 0.48820063 199 nips-2006-Unsupervised Learning of a Probabilistic Grammar for Object Detection and Parsing
Author: Yuanhao Chen, Long Zhu, Alan L. Yuille
Abstract: We describe an unsupervised method for learning a probabilistic grammar of an object from a set of training examples. Our approach is invariant to the scale and rotation of the objects. We illustrate our approach using thirteen objects from the Caltech 101 database. In addition, we learn the model of a hybrid object class where we do not know the specific object or its position, scale or pose. This is illustrated by learning a hybrid class consisting of faces, motorbikes, and airplanes. The individual objects can be recovered as different aspects of the grammar for the object class. In all cases, we validate our results by learning the probability grammars from training datasets and evaluating them on the test datasets. We compare our method to alternative approaches. The advantages of our approach is the speed of inference (under one second), the parsing of the object, and increased accuracy of performance. Moreover, our approach is very general and can be applied to a large range of objects and structures. 1
6 0.48496386 52 nips-2006-Clustering appearance and shape by learning jigsaws
7 0.47040001 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests
8 0.44983327 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions
9 0.43978083 174 nips-2006-Similarity by Composition
10 0.42427596 185 nips-2006-Subordinate class recognition using relational object models
11 0.41791931 170 nips-2006-Robotic Grasping of Novel Objects
12 0.39010036 40 nips-2006-Bayesian Detection of Infrequent Differences in Sets of Time Series with Shared Structure
13 0.38351893 166 nips-2006-Recursive Attribute Factoring
14 0.37918583 161 nips-2006-Particle Filtering for Nonparametric Bayesian Matrix Factorization
15 0.37912142 50 nips-2006-Chained Boosting
16 0.37261814 119 nips-2006-Learning to Rank with Nonsmooth Cost Functions
17 0.37077925 101 nips-2006-Isotonic Conditional Random Fields and Local Sentiment Flow
18 0.36590251 132 nips-2006-Modeling Dyadic Data with Binary Latent Factors
19 0.36393654 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
20 0.36282548 160 nips-2006-Part-based Probabilistic Point Matching using Equivalence Constraints
topicId topicWeight
[(1, 0.066), (3, 0.04), (7, 0.067), (9, 0.028), (12, 0.026), (20, 0.027), (21, 0.022), (22, 0.04), (44, 0.044), (57, 0.484), (65, 0.036), (69, 0.033), (90, 0.01)]
simIndex simValue paperId paperTitle
same-paper 1 0.97503304 66 nips-2006-Detecting Humans via Their Pose
Author: Alessandro Bissacco, Ming-Hsuan Yang, Stefano Soatto
Abstract: We consider the problem of detecting humans and classifying their pose from a single image. Specifically, our goal is to devise a statistical model that simultaneously answers two questions: 1) is there a human in the image? and, if so, 2) what is a low-dimensional representation of her pose? We investigate models that can be learned in an unsupervised manner on unlabeled images of human poses, and provide information that can be used to match the pose of a new image to the ones present in the training set. Starting from a set of descriptors recently proposed for human detection, we apply the Latent Dirichlet Allocation framework to model the statistics of these features, and use the resulting model to answer the above questions. We show how our model can efficiently describe the space of images of humans with their pose, by providing an effective representation of poses for tasks such as classification and matching, while performing remarkably well in human/non human decision problems, thus enabling its use for human detection. We validate the model with extensive quantitative experiments and comparisons with other approaches on human detection and pose matching. 1
2 0.95713747 50 nips-2006-Chained Boosting
Author: Christian R. Shelton, Wesley Huie, Kin F. Kan
Abstract: We describe a method to learn to make sequential stopping decisions, such as those made along a processing pipeline. We envision a scenario in which a series of decisions must be made as to whether to continue to process. Further processing costs time and resources, but may add value. Our goal is to create, based on historic data, a series of decision rules (one at each stage in the pipeline) that decide, based on information gathered up to that point, whether to continue processing the part. We demonstrate how our framework encompasses problems from manufacturing to vision processing. We derive a quadratic (in the number of decisions) bound on testing performance and provide empirical results on object detection. 1 Pipelined Decisions In many decision problems, all of the data do not arrive at the same time. Often further data collection can be expensive and we would like to make a decision without accruing the added cost. Consider silicon wafer manufacturing. The wafer is processed in a series of stages. After each stage some tests are performed to judge the quality of the wafer. If the wafer fails (due to flaws), then the processing time, energy, and materials are wasted. So, we would like to detect such a failure as early as possible in the production pipeline. A similar problem can occur in vision processing. Consider the case of object detection in images. Often low-level pixel operations (such as downsampling an image) can be performed in parallel by dedicated hardware (on a video capture board, for example). However, searching each subimage patch of the whole image to test whether it is the object in question takes time that is proportional to the number of pixels. Therefore, we can imagine a image pipeline in which low resolution versions of the whole image are scanned first. Subimages which are extremely unlikely to contain the desired object are rejected and only those which pass are processed at higher resolution. In this way, we save on many pixel operations and can reduce the cost in time to process an image. Even if downsampling is not possible through dedicated hardware, for most object detection schemes, the image must be downsampled to form an image pyramid in order to search for the object at different scales. Therefore, we can run the early stages of such a pipelined detector at the low resolution versions of the image and throw out large regions of the high resolution versions. Most of the processing is spent searching for small faces (at the high resolutions), so this method can save a lot of processing. Such chained decisions also occur if there is a human in the decision process (to ask further clarifying questions in database search, for instance). We propose a framework that can model all of these scenarios and allow such decision rules to be learned from historic data. We give a learning algorithm based on the minimization of the exponential loss and conclude with some experimental results. 1.1 Problem Formulation Let there be s stages to the processing pipeline. We assume that there is a static distribution from which the parts, objects, or units to be processed are drawn. Let p(x, c) represent this distribution in which x is a vector of the features of this unit and c represents the costs associated with this unit. In particular, let xi (1 ≤ i ≤ s) be the set of measurements (features) available to the decision maker immediately following stage i. Let c i (1 ≤ i ≤ s) be the cost of rejecting (or stopping the processing of) this unit immediately following stage i. Finally, let c s+1 be the cost of allowing the part to pass through all processing stages. Note that ci need not be monotonic in i. To take our wafer manufacturing example, for wafers that are good we might let c i = i for 1 ≤ i ≤ s, indicating that if a wafer is rejected at any stage, one unit of work has been invested for each stage of processing. For the same good wafers, we might let cs+1 = s − 1000, indicating that the value of a completed wafer is 1000 units and therefore the total cost is the processing cost minus the resulting value. For a flawed wafer, the values might be the same, except for c s+1 which we would set to s, indicating that there is no value for a bad wafer. Note that the costs may be either positive or negative. However, only their relative values are important. Once a part has been drawn from the distribution, there is no way of affecting the “base level” for the value of the part. Therefore, we assume for the remainder of this paper that c i ≥ 0 for 1 ≤ i ≤ s + 1 and that ci = 0 for some value of i (between 1 and s + 1). Our goal is to produce a series of decision rules f i (xi ) for 1 ≤ i ≤ s. We let fi have a range of {0, 1} and let 0 indicate that processing should continue and 1 indicate that processing should be halted. We let f denote the collection of these s decision rules and augment the collection with an additional rule f s+1 which is identically 1 (for ease of notation). The cost of using these rules to halt processing an example is therefore s+1 i−1 ci fi (xi ) L(f (x), c) = i=1 (1 − fj (xj )) . j=1 We would like to find a set of decision rules that minimize E p [L(f (x), c)]. While p(x, c) is not known, we do have a series of samples (training set) D = {(x1 , c1 ), (x2 , c2 ), . . . , (xn , cn )} of n examples drawn from the distribution p. We use superscripts to denote the example index and subscripts to denote the stage index. 2 Boosting Solution For this paper, we consider constructing the rules f i from simpler decision rules, much as in the Adaboost algorithm [1, 2]. We assume that each decision f i (xi ) is computed as the threshold of another function g i (xi ): fi (xi ) = I(gi (xi ) > 0).1 We bound the empirical risk: n n s+1 L(f (xk ), ck ) = i−1 ck I(gi (xk ) > 0) i i k=1 i=1 k=1 n s+1 j=1 k i−1 ck egi (xi ) i ≤ k=1 i=1 I(gj (xk ) ≤ 0) j k n s+1 j=1 k ck egi (xi )− i e−gj (xj ) = Pi−1 j=1 gj (xk ) j . (1) k=1 i=1 Our decision to make all costs positive ensures that the bounds hold. Our decision to make the optimal cost zero helps to ensure that the bound is reasonably tight. As in boosting, we restrict g i (xi ) to take the form mi αi,l hi,l (xi ), the weighted sum of m i subl=1 classifiers, each of which returns either −1 or +1. We will construct these weighted sums incrementally and greedily, adding one additional subclassifier and associated weight at each step. We will pick the stage, weight, and function of the subclassifier in order to make the largest negative change in the exponential bound to the empirical risk. The subclassifiers, h i,l will be drawn from a small class of hypotheses, H. 1 I is the indicator function that equals 1 if the argument is true and 0 otherwise. 1. Initialize gi (x) = 0 for all stages i k 2. Initialize wi = ck for all stages i and examples k. i 3. For each stage i: (a) Calculate targets for each training example, as shown in equation 5. (b) Let h be the result of running the base learner on this set. (c) Calculate the corresponding α as per equation 3. (d) Score this classification as per equation 4 ¯ 4. Select the stage ¯ with the best (highest) score. Let h and α be the classifier and ı ¯ weight found at that stage. ¯¯ 5. Let g¯(x) ← g¯(x) + αh(x). ı ı 6. Update the weights (see equation 2): ¯ k k ¯ ı • ∀1 ≤ k ≤ n, multiply w¯ by eαh(x¯ ) . ı ¯ k k ¯ ı • ∀1 ≤ k ≤ n, j > ¯, multiply wj by e−αh(x¯ ) . ı 7. Repeat from step 3 Figure 1: Chained Boosting Algorithm 2.1 Weight Optimization We first assume that the stage at which to add a new subclassifier and the subclassifier to add have ¯ ¯ already been chosen: ¯ and h, respectively. That is, h will become h¯,m¯+1 but we simplify it for ı ı ı ease of expression. Our goal is to find α ¯,m¯+1 which we similarly abbreviate to α. We first define ¯ ı ı k k wi = ck egi (xi )− i Pi−1 j=1 gj (xk ) j (2) + as the weight of example k at stage i, or its current contribution to our risk bound. If we let D h be ¯ − ¯ the set of indexes of the members of D for which h returns +1, and let D h be similarly defined for ¯ ¯ returns −1, we can further define those for which h s+1 + W¯ = ı k w¯ + ı + k∈Dh ¯ s+1 k wi − W¯ = ı − ı k∈Dh i=¯+1 ¯ k w¯ + ı − k∈Dh ¯ k wi . + ı k∈Dh i=¯+1 ¯ + ¯ We interpret W¯ to be the sum of the weights which h will emphasize. That is, it corresponds to ı ¯ ¯ the weights along the path that h selects: For those examples for which h recommends termination, we add the current weight (related to the cost of stopping the processing at this stage). For those ¯ examples for which h recommends continued processing, we add in all future weights (related to all − future costs associated with this example). W¯ can be similarly interpreted to be the weights (or ı ¯ costs) that h recommends skipping. If we optimize the loss bound of Equation 1 with respect to α, we obtain ¯ α= ¯ 1 Wı− log ¯ . + 2 W¯ ı (3) The more weight (cost) that the rule recommends to skip, the higher its α coefficient. 2.2 Full Optimization Using Equation 3 it is straight forward to show that the reduction in Equation 1 due to the addition of this new subclassifier will be + ¯ − ¯ W¯ (1 − eα ) + W¯ (1 − e−α ) . ı ı (4) We know of no efficient method for determining ¯, the stage at which to add a subclassifier, except ı by exhaustive search. However, within a stage, the choice of which subclassifier to use becomes one of maximizing n s+1 k¯ k k z¯ h(x¯ ) , where z¯ = ı ı ı k k wi − w¯ ı (5) i=¯+1 ı k=1 ¯ with respect to h. This is equivalent to an weighted empirical risk minimization where the training 1 2 n k k set is {x¯ , x¯ , . . . , x¯ }. The label of x¯ is the sign of z¯ , and the weight of the same example is the ı ı ı ı ı k magnitude of z¯ . ı 2.3 Algorithm The resulting algorithm is only slightly more complex than standard Adaboost. Instead of a weight vector (one weight for each data example), we now have a weight matrix (one weight for each data example for each stage). We initialize each weight to be the cost associated with halting the corresponding example at the corresponding stage. We start with all g i (x) = 0. The complete algorithm is as in Figure 1. Each time through steps 3 through 7, we complete one “round” and add one additional rule to one stage of the processing. We stop executing this loop when α ≤ 0 or when an iteration counter ¯ exceeds a preset threshold. Bottom-Up Variation In situations where information is only gained after each stage (such as in section 4), we can also train the classifiers “bottom-up.” That is, we can start by only adding classifiers to the last stage. Once finished with it, we proceed to the previous stage, and so on. Thus instead of selecting the best stage, i, in each round, we systematically work our way backward through the stages, never revisiting previously set stages. 3 Performance Bounds Using the bounds in [3] we can provide a risk bound for this problem. We let E denote the expectaˆ tion with respect to the true distribution p(x, c) and En denote the empirical average with respect to the n training samples. We first bound the indicator function with a piece-wise linear function, b θ , 1 with a maximum slope of θ : z ,0 . I(z > 0) ≤ bθ (z) = max min 1, 1 + θ We then bound the loss: L(f (x), c) ≤ φ θ (f (x), c) where s+1 φθ (f (x), c) = ci min{bθ (gi (xi )), bθ (−gi−1 (xi−1 )), bθ (−gi−2 (xi−2 )), . . . , bθ (−g1 (x1 ))} i=1 s+1 i ci Bθ (gi (xi ), gi−1 (xi−1 ), . . . , g1 (x1 )) = i=1 We replaced the product of indicator functions with a minimization and then bounded each indicator i with bθ . Bθ is just a more compact presentation of the composition of the function b θ and the minimization. We assume that the weights α at each stage have been scaled to sum to 1. This has no affect on the resulting classifications, but is necessary for the derivation below. Before stating the theorem, for clarity, we state two standard definition: Definition 1. Let p(x) be a probability distribution on the set X and let {x 1 , x2 , . . . , xn } be n independent samples from p(x). Let σ 1 , σ 2 , . . . , σ n be n independent samples from a Rademacher random variable (a binary variable that takes on either +1 or −1 with equal probability). Let F be a class of functions mapping X to . Define the Rademacher Complexity of F to be Rn (F ) = E sup f ∈F n 1 n σ i f (xi ) i=1 1 where the expectation is over the random draws of x through xn and σ 1 through σ n . Definition 2. Let p(x), {x1 , x2 , . . . , xn }, and F be as above. Let g 1 , g 2 , . . . , g n be n independent samples from a Gaussian distribution with mean 0 and variance 1. Analogous to the above definition, define the Gaussian Complexity of G to be Gn (F ) = E sup f ∈F 1 n n g i f (xi ) . i=1 We can now state our theorem, bounding the true risk by a function of the empirical risk: Theorem 3. Let H1 , H2 , . . . , Hs be the sequence of the sets of functions from which the base classifier draws for chain boosting. If H i is closed under negation for all i, all costs are bounded between 0 and 1, and the weights for the classifiers at each stage sum to 1, then with probability 1 − δ, k ˆ E [L(f (x), c)] ≤ En [φθ (f (x), c)] + θ s (i + 1)Gn (Hi ) + i=1 8 ln 2 δ n for some constant k. Proof. Theorem 8 of [3] states ˆ E [L(x, c)] ≤ En (φθ (f (x), c)) + 2Rn (φθ ◦ F ) + 8 ln 2 δ n and therefore we need only bound the R n (φθ ◦ F ) term to demonstrate our theorem. For our case, we have 1 Rn (φθ ◦ F ) = E sup n f ∈F = E sup f ∈F 1 n s+1 ≤ E sup j=1 f ∈F n n σ i φθ (f (xi ), ci ) i=1 s+1 σi i=1 1 n s ci Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) j j j−1 1 j=1 n s+1 s σ i Bθ (gj (xi ), gj−1 (xi ), . . . , g1 (xi )) = j j−1 1 i=1 s Rn (Bθ ◦ G j ) j=1 where Gi is the space of convex combinations of functions from H i and G i is the cross product of G1 through Gi . The inequality comes from switching the expectation and the maximization and then from dropping the c i (see [4], lemma 5). j s s Lemma 4 of [3] states that there exists a k such that R n (Bθ ◦ G j ) ≤ kGn (Bθ ◦ G j ). Theorem 14 j 2 s j s of the same paper allows us to conclude that G n (Bθ ◦ G ) ≤ θ i=1 Gn (Gi ). (Because Bθ is the 1 s minimum over a set of functions with maximum slope of θ , the maximum slope of B θ is also 1 .) θ Theorem 12, part 2 states G n (Gi ) = Gn (Hi ). Taken together, this proves our result. Note that this bound has only quadratic dependence on s, the length of the chain and does not explicitly depend on the number of rounds of boosting (the number of rounds affects φ θ which, in turn, affects the bound). 4 Application We tested our algorithm on the MIT face database [5]. This database contains 19-by-19 gray-scale images of faces and non-faces. The training set has 2429 face images and 4548 non-face images. The testing set has 472 faces and 23573 non-faces. We weighted the training set images so that the ratio of the weight of face images to non-face images matched the ratio in the testing set. 0.4 0.4 CB Global CB Bottom−up SVM Boosting 0.3 0.25 0.2 0.15 0.1 150 0.2 100 0.1 False positive rate 0.3 50 Average number of pixels 0.35 average cost/error per example 200 training cost training error testing cost testing error 0.05 0 100 200 300 400 500 number of rounds 700 1000 (a) 0 0 0.2 0.4 0.6 False negative rate 0.8 0 1 (b) Figure 2: (a) Accuracy verses the number of rounds for a typical run, (b) Error rates and average costs for a variety of cost settings. 4.1 Object Detection as Chained Boosting Our goal is to produce a classifier that can identify non-face images at very low resolutions, thereby allowing for quick processing of large images (as explained later). Most image patches (or subwindows) do not contain faces. We, therefore, built a multi-stage detection system where any early rejection is labeled as a non-face. The first stage looks at image patches of size 3-by-3 (i.e. a lowerresolution version of the 19-by-19 original image). The next stage looks at the same image, but at a resolution of 6-by-6. The third stage considers the image at 12-by-12. We did not present the full 19-by-19 images as the classification did not significantly improve over the 12-by-12 versions. We employ a simple base classifier: the set of all functions that look at a single pixel and predict the class by thresholding the pixel’s value. The total classifier at any stage is a linear combination of these simple classifiers. For a given stage, all of the base classifiers that target a particular pixel are added together producing a complex function of the value of the pixel. Yet, this pixel can only take on a finite number of values (256 in this case). Therefore, we can compile this set of base classifiers into a single look-up function that maps the brightness of the pixel into a real number. The total classifier for the whole stage is merely the sum of these look-up functions. Therefore, the total work necessary to compute the classification at a stage is proportional to the number of pixels in the image considered at that stage, regardless of the number of base classifiers used. We therefore assign a cost to each stage of processing proportional to the number of pixels at that stage. If the image is a face, we add a negative cost (i.e. bonus) if the image is allowed to pass through all of the processing stages (and is therefore “accepted” as a face). If the image is a nonface, we add a bonus if the image is rejected at any stage before completion (i.e. correctly labelled). While this dataset has only segmented image patches, in a real application, the classifier would be run on all sub-windows of an image. More importantly, it would also be run at multiple resolutions in order to detect faces of different sizes (or at different distances from the camera). The classifier chain could be run simultaneously at each of these resolutions. To wit, while running the final 12-by12 stage at one resolution of the image, the 6-by-6 (previous) stage could be run at the same image resolution. This 6-by-6 processing would be the necessary pre-processing step to running the 12-by12 stage at a higher resolution. As we run our final scan for big faces (at a low resolution), we can already (at the same image resolution) be performing initial tests to throw out portions of the image as not worthy of testing for smaller faces (at a higher resolution). Most of the work of detecting objects must be done at the high resolutions because there are many more overlapping subwindows. This chained method allows the culling of most of this high-resolution image processing. 4.2 Experiments For each example, we construct a vector of stage costs as above. We add a constant to this vector to ensure that the minimal element is zero, as per section 1.1. We scale all vectors by the same amount to ensure that the maximal value is 1.This means that the number of misclassifications is an upper bound on the total cost that the learning algorithm is trying to minimize. There are three flexible quantities in this problem formulation: the cost of a pixel evaluation, the bonus for a correct face classification, and the bonus for a correct non-face classification. Changing these quantities will control the trade-off between false positives and true positives, and between classification error and speed. Figure 2(a) shows the result of a typical run of the algorithm. As a function of the number of rounds, it plots the cost (that which the algorithm is trying to minimize) and the error (number of misclassified image patches), for both the training and testing sets (where the training set has been reweighted to have the same proportion of faces to non-faces as the testing set). We compared our algorithm’s performance to the performance of support vector machines (SVM) [6] and Adaboost [1] trained and tested on the highest resolution, 12-by-12, image patches. We employed SVM-light [7] with a linear kernels. Figure 2(b) compares the error rates for the methods (solid lines, read against the left vertical axis). Note that the error rates are almost identical for the methods. The dashed lines (read against the right vertical axis) show the average number of pixels evaluated (or total processing cost) for each of the methods. The SVM and Adaboost algorithms have a constant processing cost. Our method (by either training scheme) produces lower processing cost for most error rates. 5 Related Work Cascade detectors for vision processing (see [8] or [9] for example) may appear to be similar to the work in this paper. Especially at first glance for the area of object detection, they appear almost the same. However, cascade detection and this work (chained detection) are quite different. Cascade detectors are built one at a time. A coarse detector is first trained. The examples which pass that detector are then passed to a finer detector for training, and so on. A series of targets for false-positive rates define the increasing accuracy of the detector cascade. By contrast, our chain detectors are trained as an ensemble. This is necessary because of two differences in the problem formulation. First, we assume that the information available at each stage changes. Second, we assume there is an explicit cost model that dictates the cost of proceeding from stage to stage and the cost of rejection (or acceptance) at any particular stage. By contrast, cascade detectors are seeking to minimize computational power necessary for a fixed decision. Therefore, the information available to all of the stages is the same, and there are no fixed costs associated with each stage. The ability to train all of the classifiers at the same time is crucial to good performance in our framework. The first classifier in the chain cannot determine whether it is advantageous to send an example further along unless it knows how the later stages will process the example. Conversely, the later stages cannot construct optimal classifications until they know the distribution of examples that they will see. Section 4.1 may further confuse the matter. We demonstrated how chained boosting can be used to reduce the computational costs of object detection in images. Cascade detectors are often used for the same purpose. However, the reductions in computational time come from two different sources. In cascade detectors, the time taken to evaluate a given image patch is reduced. In our chained detector formulation, image patches are ignored completely based on analysis of lower resolution patches in the image pyramid. To further illustrate the difference, cascade detectors can always be used to speed up asymmetric classification tasks (and are often applied to image detection). By contrast, in Section 4.1 we have exploited the fact that object detection in images is typically performed at multiple scales to turn the problem into a pipeline and apply our framework. Cascade detectors address situations in which prior class probabilities are not equal, while chained detectors address situations in which information is gained at a cost. Both are valid (and separate) ways of tackling image processing (and other tasks as well). In many ways, they are complementary approaches. Classic sequence analysis [10, 11] also addresses the problem of optimal stopping. However, it assumes that the samples are drawn i.i.d. from (usually) a known distribution. Our problem is quite different in that each consecutive sample is drawn from a different (and related) distribution and our goal is to find a decision rule without producing a generative model. WaldBoost [12] is a boosting algorithm based on this. It builds a series of features and a ratio comparison test in order to decide when to stop. For WaldBoost, the available features (information) not change between stages. Rather, any feature is available for selection at any point in the chain. Again, this is a different problem than the one considered in this paper. 6 Conclusions We feel this framework of staged decision making is useful in a wide variety of areas. This paper demonstrated how the framework applies to one vision processing task. Obviously it also applies to manufacturing pipelines where errors can be introduced at different stages. It should also be applicable to scenarios where information gathering is costly. Our current formulation only allows for early negative detection. In the face detection example above, this means that in order to report “face,” the classifier must process each stage, even if the result is assured earlier. In Figure 2(b), clearly the upper-left corner (100% false positives and 0% false negatives) is reachable with little effort: classify everything positive without looking at any features. We would like to extend this framework to cover such two-sided early decisions. While perhaps not useful in manufacturing (or even face detection, where the interesting part of the ROC curve is far from the upper-left), it would make the framework more applicable to informationgathering applications. Acknowledgements This research was supported through the grant “Adaptive Decision Making for Silicon Wafer Testing” from Intel Research and UC MICRO. References [1] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, pages 23–37, 1995. [2] Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML, pages 148–156, 1996. [3] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. JMLR, 2:463–482, 2002. [4] Ron Meir and Tong Zhang. Generalization error bounds for Bayesian mixture algorithms. JMLR, 4:839–860, 2003. [5] MIT. CBCL face database #1, 2000. http://cbcl.mit.edu/cbcl/softwaredatasets/FaceData2.html. [6] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In COLT, pages 144–152, 1992. [7] T. Joachims. Making large-scale SVM learning practical. In B. Schlkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods — Support Vector Learning. MIT-Press, 1999. [8] Paul A. Viola and Michael J. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, pages 511–518, 2001. [9] Jianxin Wu, Matthew D. Mullin, and James M. Rehg. Linear asymmetric classifier for cascade detectors. In ICML, pages 988–995, 2005. [10] Abraham Wald. Sequential Analysis. Chapman & Hall, Ltd., 1947. [11] K. S. Fu. Sequential Methods in Pattern Recognition and Machine Learning. Academic Press, 1968. ˇ [12] Jan Sochman and Jiˇ´ Matas. Waldboost — learning for time constrained sequential detection. rı In CVPR, pages 150–156, 2005.
3 0.95712018 52 nips-2006-Clustering appearance and shape by learning jigsaws
Author: Anitha Kannan, John Winn, Carsten Rother
Abstract: Patch-based appearance models are used in a wide range of computer vision applications. To learn such models it has previously been necessary to specify a suitable set of patch sizes and shapes by hand. In the jigsaw model presented here, the shape, size and appearance of patches are learned automatically from the repeated structures in a set of training images. By learning such irregularly shaped ‘jigsaw pieces’, we are able to discover both the shape and the appearance of object parts without supervision. When applied to face images, for example, the learned jigsaw pieces are surprisingly strongly associated with face parts of different shapes and scales such as eyes, noses, eyebrows and cheeks, to name a few. We conclude that learning the shape of the patch not only improves the accuracy of appearance-based part detection but also allows for shape-based part detection. This enables parts of similar appearance but different shapes to be distinguished; for example, while foreheads and cheeks are both skin colored, they have markedly different shapes. 1
4 0.95273191 74 nips-2006-Efficient Structure Learning of Markov Networks using $L 1$-Regularization
Author: Su-in Lee, Varun Ganapathi, Daphne Koller
Abstract: Markov networks are commonly used in a wide variety of applications, ranging from computer vision, to natural language, to computational biology. In most current applications, even those that rely heavily on learned models, the structure of the Markov network is constructed by hand, due to the lack of effective algorithms for learning Markov network structure from data. In this paper, we provide a computationally efficient method for learning Markov network structure from data. Our method is based on the use of L1 regularization on the weights of the log-linear model, which has the effect of biasing the model towards solutions where many of the parameters are zero. This formulation converts the Markov network learning problem into a convex optimization problem in a continuous space, which can be solved using efficient gradient methods. A key issue in this setting is the (unavoidable) use of approximate inference, which can lead to errors in the gradient computation when the network structure is dense. Thus, we explore the use of different feature introduction schemes and compare their performance. We provide results for our method on synthetic data, and on two real world data sets: pixel values in the MNIST data, and genetic sequence variations in the human HapMap data. We show that our L1 -based method achieves considerably higher generalization performance than the more standard L2 -based method (a Gaussian parameter prior) or pure maximum-likelihood learning. We also show that we can learn MRF network structure at a computational cost that is not much greater than learning parameters alone, demonstrating the existence of a feasible method for this important problem.
5 0.7775327 8 nips-2006-A Nonparametric Approach to Bottom-Up Visual Saliency
Author: Wolf Kienzle, Felix A. Wichmann, Matthias O. Franz, Bernhard Schölkopf
Abstract: This paper addresses the bottom-up influence of local image information on human eye movements. Most existing computational models use a set of biologically plausible linear filters, e.g., Gabor or Difference-of-Gaussians filters as a front-end, the outputs of which are nonlinearly combined into a real number that indicates visual saliency. Unfortunately, this requires many design parameters such as the number, type, and size of the front-end filters, as well as the choice of nonlinearities, weighting and normalization schemes etc., for which biological plausibility cannot always be justified. As a result, these parameters have to be chosen in a more or less ad hoc way. Here, we propose to learn a visual saliency model directly from human eye movement data. The model is rather simplistic and essentially parameter-free, and therefore contrasts recent developments in the field that usually aim at higher prediction rates at the cost of additional parameters and increasing model complexity. Experimental results show that—despite the lack of any biological prior knowledge—our model performs comparably to existing approaches, and in fact learns image features that resemble findings from several previous studies. In particular, its maximally excitatory stimuli have center-surround structure, similar to receptive fields in the early human visual system. 1
6 0.76220959 42 nips-2006-Bayesian Image Super-resolution, Continued
7 0.76146811 34 nips-2006-Approximate Correspondences in High Dimensions
8 0.75880629 94 nips-2006-Image Retrieval and Classification Using Local Distance Functions
9 0.74325019 122 nips-2006-Learning to parse images of articulated bodies
10 0.7420637 110 nips-2006-Learning Dense 3D Correspondence
11 0.73444724 47 nips-2006-Boosting Structured Prediction for Imitation Learning
12 0.72917563 185 nips-2006-Subordinate class recognition using relational object models
13 0.7182821 45 nips-2006-Blind Motion Deblurring Using Image Statistics
14 0.68589389 154 nips-2006-Optimal Change-Detection and Spiking Neurons
15 0.68469632 118 nips-2006-Learning to Model Spatial Dependency: Semi-Supervised Discriminative Random Fields
16 0.68437093 72 nips-2006-Efficient Learning of Sparse Representations with an Energy-Based Model
17 0.68421477 78 nips-2006-Fast Discriminative Visual Codebooks using Randomized Clustering Forests
18 0.67860001 188 nips-2006-Temporal and Cross-Subject Probabilistic Models for fMRI Prediction Tasks
19 0.67429054 130 nips-2006-Max-margin classification of incomplete data
20 0.66711694 53 nips-2006-Combining causal and similarity-based reasoning