nips nips2004 nips2004-205 knowledge-graph by maker-knowledge-mining

205 nips-2004-Who's In the Picture

Source: pdf

Author: Tamara L. Berg, Alexander C. Berg, Jaety Edwards, David A. Forsyth

Abstract: The context in which a name appears in a caption provides powerful cues as to who is depicted in the associated image. We obtain 44,773 face images, using a face detector, from approximately half a million captioned news images and automatically link names, obtained using a named entity recognizer, with these faces. A simple clustering method can produce fair results. We improve these results signiﬁcantly by combining the clustering process with a model of the probability that an individual is depicted given its context. Once the labeling procedure is over, we have an accurately labeled set of faces, an appearance model for each individual depicted, and a natural language model that can produce accurate results on captions in isolation. 1

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract The context in which a name appears in a caption provides powerful cues as to who is depicted in the associated image. [sent-10, score-0.537]

2 We obtain 44,773 face images, using a face detector, from approximately half a million captioned news images and automatically link names, obtained using a named entity recognizer, with these faces. [sent-11, score-0.628]

3 Once the labeling procedure is over, we have an accurately labeled set of faces, an appearance model for each individual depicted, and a natural language model that can produce accurate results on captions in isolation. [sent-14, score-0.655]

4 In this paper, we show that signiﬁcant gains are available by treating language more carefully. [sent-18, score-0.28]

5 A face detector is used to identify potential faces and a named entity recognizer to identify potential names. [sent-20, score-0.595]

6 Multiple faces and names from one image-caption pair are quite common. [sent-21, score-0.415]

7 The problem is to ﬁnd a correspondence between some of the faces and names. [sent-22, score-0.196]

8 As part of the solution we learn an appearance model for each pictured name and the likelihood of a particular instance of a name being pictured based on the surrounding words and punctuation. [sent-23, score-1.88]

9 Although face recognition is well studied, it does not work very well in practice [15]. [sent-26, score-0.239]

10 One motivation for our work is to take the large collection of news images and captions as semi-supervised input and and produce a fully supervised dataset of faces labeled with names. [sent-27, score-0.458]

11 The resulting dataset exhibits many of the confounding factors that make real-world face recognition difﬁcult, in particular modes of variation that are not found in face recognition datasets collected in laboratories. [sent-28, score-0.504]

12 It is important to note that this task is easier than general face recognition because each face has only a few associated names. [sent-29, score-0.431]

13 Center: Detected faces and names for this data item. [sent-33, score-0.415]

14 Our model allows each face to be assigned to at most one name, each name to be assigned to at most one face, and any face or name to be assigned to Null. [sent-35, score-1.028]

15 Our named entity recognizer occasionally identiﬁes strings that do not refer to actual people (e. [sent-36, score-0.264]

16 These names are assigned low probability under our model and therefore their assignment to a face is unlikely. [sent-39, score-0.649]

17 EM iterates between computing the expectation of the possible face-name correspondences and updating the face clusters and language model. [sent-40, score-0.585]

18 Language: Quite simple common phenomena in captions suggest using a language model. [sent-43, score-0.42]

19 First, our named entity recognizer occasionally marks incorrect names like “United Nations”. [sent-44, score-0.538]

20 Our approach combines a simple appearance model using kPCA and LDA, with a language model, based on context. [sent-53, score-0.401]

21 We evaluate both an EM and maximum likelihood clustering and show that incorporating language with appearance produces better results than using appearance alone. [sent-54, score-0.593]

22 We also show the results of the learned natural language classiﬁer applied to a set of captions in isolation. [sent-55, score-0.42]

23 2 Linking a face and language model with EM A natural way of thinking about name assignment is as a hidden variable problem where the hidden variables are the correct name-face correspondences for each picture. [sent-56, score-0.966]

24 Open after − David Nalbandian before − al Qaeda after − Null before − James Bond after − Pierce Brosnan before − James Ivory after − Naomi Watts Figure 2: Names assigned using our raw clustering procedure (before) and incorporating a language model (after). [sent-65, score-0.444]

25 Our named entity recognizer occasionally detects incorrect names (e. [sent-66, score-0.538]

26 “CEO Summit”), but based on context the language model assigns low probabilities to these names, making their assignment unlikely. [sent-68, score-0.52]

27 When multiple names are detected like “Julia Vakulenko” and “Jennifer Capriati”, the probability for each name depends on its context. [sent-69, score-0.531]

28 The caption for this picture reads “American Jennifer Capriati returns the ball to her Ukrainian opponent Julia Vakulenko in Paris during. [sent-70, score-0.195]

29 ” The language model prefers to assign the name “Jennifer Capriati” because its context (beginning of the caption followed by a present tense verb) indicates it is more likely to be pictured than “Julia Vakulenko” (middle of the caption followed by a preposition). [sent-73, score-1.653]

30 For pictures such the man labeled “al Qaeda” to “Null” where the individual is not named in the caption, the language model correctly assigns “Null” to the face. [sent-74, score-0.543]

31 As table 1 shows, incorporating a language model improves our face clusters signiﬁcantly. [sent-75, score-0.58]

32 the expected values of the set of face-name correspondences (given a face clustering and language model) and updating the face clusters and language model given the correspondences. [sent-76, score-1.149]

33 For each of these name, context pairs, generate a binary variable pictured conditioned on the context alone (from P (pictured|context)). [sent-84, score-0.79]

34 For each pictured = 1, generate a face from P (f ace|name) (F n = pictured). [sent-86, score-0.814]

35 name, context pictured N face_u face_n Fu Fn D 5. [sent-87, score-0.706]

36 Generate F u = F − F n other faces from P (f ace). [sent-88, score-0.157]

37 2), the probability that a face is generated by a given name, P (pictured|context) (sec 2. [sent-90, score-0.192]

38 3), the probability that a name is pictured given its context, and P (f ace) the probability that a face is generated without a name. [sent-91, score-1.059]

39 Open empty empty Anastasia Myskina Anastasia Myskina Without language model With language model Figure 3: Left: Example clusters using only appearance to cluster. [sent-96, score-0.76]

40 Right: The same clusters, but using appearance + language to cluster. [sent-97, score-0.376]

41 All clusters get more accurate because the language model is breaking ambiguities and giving the clustering a push in the right direction. [sent-100, score-0.426]

42 Other clusters like “Abraham Lincoln” (who is a person, but whose associated pictures most often portray people other than “Abraham Lincoln”) become empty when using the language model, presumably because these faces are assigned to the correct names. [sent-104, score-0.698]

43 1 Name Assignment For each image-caption pair, we calculate the costs of all possible assignments of names to faces (dependent upon the associated faces and names) and use the best such assignment. [sent-106, score-0.606]

44 An example of the extracted names, faces and all possible assignments can be seen in ﬁgure 1. [sent-107, score-0.191]

45 The likelihood of picture xi under assignment aj , of names to faces under our generative model is: L(xi , aj ) =P (N )P (F )P (n1 , c1 ). [sent-108, score-0.707]

46 In assignment aj , α indexes into the names that are pictured, σ(α) indexes into the faces assigned to the pictured names, β indexes into the names that are not pictured and γ indexes into the faces without assigned names. [sent-112, score-2.52]

47 P (nn , cn ) are not dependent on the assignment so we can ignore them when calculating the probability of an assignment and focus on the remaining terms. [sent-116, score-0.262]

48 food company, on July 1, 2003 said it would take steps, like capping portion sizes and providing more nutrition information, as it and other companies face growing concern and even lawsuits due to rising obesity rates. [sent-132, score-0.192]

49 Photo by Tim Wimborne/Reuters REUTERS/Tim Wimborne Figure 4: Our new procedure gives us not only better clustering results, but also a natural language classiﬁer which can be tested on captions in isolation. [sent-134, score-0.487]

50 Above: a few captions labeled with IN (pictured) and OUT (not pictured) using our learned language model. [sent-135, score-0.46]

51 Our language model has learned which contexts have high probability of referring to pictured individuals and which contexts have low probabilities. [sent-136, score-0.991]

52 We observe an 85% accuracy of labeling who is portrayed in a picture using only our language model. [sent-137, score-0.374]

53 The last incorrectly labels “Stephen Joseph” as not pictured when in fact he is the subject of the picture. [sent-139, score-0.622]

54 Some contexts that are often incorrectly labeled are those where the name appears near the end of the caption (usually a cue that the individual named is not pictured). [sent-140, score-0.557]

55 Some cues we could add that should improve the accuracy of our language model are the nearness of words like “shown”, “pictured”, or “photographed”. [sent-141, score-0.363]

56 This gives a straightforward EM procedure: • E – update the Pij according to the normalized probability of picture i with assignment j. [sent-144, score-0.176]

57 2 Modeling the appearance of faces – P (f ace|name) We model appearance using a mixture model with one mixture element per name in our lexicon. [sent-147, score-0.644]

58 Our representation is obtained by rectiﬁcation of the faces followed by kernel principal components analysis (kPCA) and linear discriminant analysis (LDA) (details in [5]). [sent-149, score-0.19]

59 Here we show what percentage of those faces are correctly labeled by each of our methods (EM and maximal correspondence clustering, MM). [sent-153, score-0.34]

60 For both methods, incorporating a language model improves their respective clusterings greatly. [sent-154, score-0.334]

61 Standard statistical knowledge says that using the expected values should perform better than simply choosing the maximal assignment at each step (MM). [sent-155, score-0.177]

62 However, we have found that using the maximal assignment works better than taking an expectation. [sent-156, score-0.177]

63 One reason this could be true is that EM averages incorrect faces into the appearance model, making the mean unstable. [sent-157, score-0.295]

64 We then use kPCA ([16]) to reduce the dimensionality of our data and compute linear discriminants ([3]) on the single name, single face pictures. [sent-159, score-0.192]

65 3 Language Model – P (pictured|context) Our language model assigns a probability to each name based on its context within the caption. [sent-164, score-0.634]

66 These distributions, P (pictured|context), are learned using counts of how often each context appears describing an assigned name, versus how often that context appears describing an unassigned name. [sent-165, score-0.211]

67 We have one distribution for each possible context cue, and assume that context cues are modeled independently (because we lack enough data to model them jointly). [sent-166, score-0.251]

68 For context, we use a variety of cues: the part of speech tags of the word immediately prior to the name and immediately after the name within the caption (modeled jointly), the location of the name in the caption, and the distances to the nearest “,”, “. [sent-167, score-0.885]

69 We tried adding a variety of other language model cues, but found that they did not increase the assignment accuracy. [sent-169, score-0.436]

70 The probability of being pictured given multiple context cues (where Ci are the different independent context cues) can be formed using Bayes rule: P (pictured|C1 , C2 , . [sent-170, score-0.848]

71 We have tried both methods and found that using the maximal assignment produced better results (table 1). [sent-180, score-0.177]

72 Classiﬁer Baseline EM Labeling with Language Model MM Labeling with Language Model labels correct 67% 76% 84% IN correct 100% 95% 87% OUT correct 0% 56% 76% Table 2: Above: Results of applying our learned language model to a test set of 430 captions (text alone). [sent-183, score-0.547]

73 In our test set, we hand labeled each detected name with IN/OUT based on whether the referred name was pictured within the corresponding picture. [sent-184, score-1.18]

74 We then tested how well our language model could predict those labels (“labels correct” refers to the total percentage of names that were correctly labeled, “IN correct” the percentage of pictured names correctly labeled, and “OUT correct” the percentage of not pictured names correctly labeled). [sent-185, score-2.497]

75 The baseline ﬁgure gives the accuracy of labeling all names as pictured. [sent-186, score-0.307]

76 Using EM to learn a language model gives an accuracy of 76% while using a maximum likelihood clustering gives 84%. [sent-187, score-0.397]

77 Names that are most often mislabeled are those that appear near the end of the caption or in other contexts that usually denote a name being not pictured. [sent-189, score-0.454]

78 The Maximal Assignment process is nearly the same as the EM process except instead of calculating the expected value of each assignment only the maximal assignment is nonzero. [sent-190, score-0.308]

79 3 Results We have collected a dataset consisting of approximately half a million news pictures and captions from Yahoo News over a period of roughly two years. [sent-193, score-0.363]

80 Faces: Using the face detector of [14], we extract 44,773 large well detected face images. [sent-194, score-0.455]

81 Our face recognition dataset is more varied than any other to date. [sent-196, score-0.265]

82 Names: We use an open source named entity recognizer ([8]) to detect proper names in each of the associated captions. [sent-197, score-0.498]

83 This gives us a set of names associated with each picture. [sent-198, score-0.258]

84 Scale: We obtain 44,773 large and reliable face detector responses. [sent-199, score-0.235]

85 We reject face images that cannot be rectiﬁed satisfactorily, leaving 34,623. [sent-200, score-0.221]

86 Finally, we concentrate on images within whose captions we detect proper names, leaving 30,281, the ﬁnal set we cluster on. [sent-201, score-0.169]

87 1 Quantitative Results Incorporating a natural language model into face clustering produces much better results than clustering on appearance alone. [sent-203, score-0.727]

88 As can be seen in table 1, using, only appearance produces an accuracy of 67% while appearance + language gives 77%. [sent-204, score-0.472]

89 For face labeling, using the maximum likelihood assignment (MM) rather than the average (EM) produces better results (77% vs 72%). [sent-205, score-0.348]

90 One neat by-product of our clustering is a natural language classiﬁer. [sent-206, score-0.347]

91 In table 2, we show results for labeling names with pictured and not pictured using our language model. [sent-208, score-1.831]

92 Using the language model we correctly label 84% of the names while the baseline (labeling everyone as pictured) only gives 67%. [sent-209, score-0.594]

93 The maximum likelihood assignment also produces a better language model than EM (76% vs 84%). [sent-210, score-0.461]

94 A few things that our language model learns as indicative of being pictured are being near the beginning of the caption, being followed by a present tense verb, and being near “(L)”, “(R)”, or “(C)”. [sent-211, score-1.045]

95 4 Discussion We have shown previously ([5]) that a good clustering can be created using names and faces. [sent-212, score-0.325]

96 In this work, we show that by analyzing language more carefully we can produce a much better clustering (table 1). [sent-213, score-0.347]

97 Not only do we produce better face clusters, but we also learn a natural language classiﬁer that can be used to determine who is pictured from text alone (table 2). [sent-214, score-1.094]

98 We have coupled language and images, using language to learn about images and images to learn about language. [sent-215, score-0.618]

99 The next step will be to try to learn a language model for free text on a webpage. [sent-216, score-0.305]

100 Fisherfaces: Recognition Using Class Speciﬁc Linear Projection” Transactions on Pattern Analysis and Machine Intelligence, Special issue on face recognition, pp. [sent-235, score-0.192]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('pictured', 0.622), ('language', 0.28), ('names', 0.258), ('name', 0.245), ('face', 0.192), ('null', 0.161), ('faces', 0.157), ('caption', 0.15), ('captions', 0.14), ('assignment', 0.131), ('pictures', 0.104), ('appearance', 0.096), ('ace', 0.092), ('context', 0.084), ('em', 0.084), ('recognizer', 0.081), ('clustering', 0.067), ('news', 0.066), ('named', 0.063), ('abraham', 0.062), ('angelina', 0.062), ('capriati', 0.062), ('jolie', 0.062), ('julia', 0.062), ('vakulenko', 0.062), ('berg', 0.062), ('entity', 0.059), ('correspondences', 0.059), ('cues', 0.058), ('clusters', 0.054), ('forsyth', 0.054), ('lincoln', 0.054), ('jennifer', 0.054), ('mm', 0.052), ('kpca', 0.049), ('indexes', 0.049), ('president', 0.049), ('item', 0.049), ('pij', 0.049), ('labeling', 0.049), ('recognition', 0.047), ('anastasia', 0.047), ('avram', 0.047), ('dole', 0.047), ('elizabeth', 0.047), ('marcel', 0.047), ('myskina', 0.047), ('olympics', 0.047), ('summit', 0.047), ('maximal', 0.046), ('picture', 0.045), ('assigned', 0.043), ('detector', 0.043), ('incorrect', 0.042), ('jackson', 0.041), ('verb', 0.041), ('winter', 0.041), ('labeled', 0.04), ('correspondence', 0.039), ('anniversary', 0.037), ('open', 0.037), ('occasionally', 0.035), ('assignments', 0.034), ('correct', 0.034), ('followed', 0.033), ('aj', 0.033), ('lm', 0.033), ('contexts', 0.032), ('bid', 0.031), ('celebrating', 0.031), ('ceo', 0.031), ('fehr', 0.031), ('furlong', 0.031), ('maria', 0.031), ('qaeda', 0.031), ('sampras', 0.031), ('tense', 0.031), ('unusually', 0.031), ('correctly', 0.031), ('july', 0.031), ('images', 0.029), ('incorporating', 0.029), ('detected', 0.028), ('blur', 0.027), ('friday', 0.027), ('thursday', 0.027), ('barnard', 0.027), ('joseph', 0.027), ('victory', 0.027), ('near', 0.027), ('percentage', 0.027), ('malik', 0.027), ('million', 0.027), ('people', 0.026), ('dataset', 0.026), ('model', 0.025), ('likelihood', 0.025), ('votes', 0.025), ('donald', 0.025), ('preposition', 0.025)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000005 205 nips-2004-Who's In the Picture

Author: Tamara L. Berg, Alexander C. Berg, Jaety Edwards, David A. Forsyth

2 0.1110151 200 nips-2004-Using Random Forests in the Structured Language Model

Author: Peng Xu, Frederick Jelinek

Abstract: In this paper, we explore the use of Random Forests (RFs) in the structured language model (SLM), which uses rich syntactic information in predicting the next word based on words already seen. The goal in this work is to construct RFs by randomly growing Decision Trees (DTs) using syntactic information and investigate the performance of the SLM modeled by the RFs in automatic speech recognition. RFs, which were originally developed as classiﬁers, are a combination of decision tree classiﬁers. Each tree is grown based on random training data sampled independently and with the same distribution for all trees in the forest, and a random selection of possible questions at each node of the decision tree. Our approach extends the original idea of RFs to deal with the data sparseness problem encountered in language modeling. RFs have been studied in the context of n-gram language modeling and have been shown to generalize well to unseen data. We show in this paper that RFs using syntactic information can also achieve better performance in both perplexity (PPL) and word error rate (WER) in a large vocabulary speech recognition system, compared to a baseline that uses Kneser-Ney smoothing. 1

3 0.11018725 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

Author: Margarita Osadchy, Matthew L. Miller, Yann L. Cun

Abstract: We describe a novel method for real-time, simultaneous multi-view face detection and facial pose estimation. The method employs a convolutional network to map face images to points on a manifold, parametrized by pose, and non-face images to points far from that manifold. This network is trained by optimizing a loss function of three variables: image, pose, and face/non-face label. We test the resulting system, in a single conﬁguration, on three standard data sets – one for frontal pose, one for rotated faces, and one for proﬁles – and ﬁnd that its performance on each set is comparable to previous multi-view face detectors that can only handle one form of pose variation. We also show experimentally that the system’s accuracy on both face detection and pose estimation is improved by training for the two tasks together.

4 0.083874255 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

Abstract: Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e.g., n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial distribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with signiﬁcantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocabulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1

5 0.076322734 40 nips-2004-Common-Frame Model for Object Recognition

Author: Pierre Moreels, Pietro Perona

Abstract: A generative probabilistic model for objects in images is presented. An object consists of a constellation of features. Feature appearance and pose are modeled probabilistically. Scene images are generated by drawing a set of objects from a given database, with random clutter sprinkled on the remaining image surface. Occlusion is allowed. We study the case where features from the same object share a common reference frame. Moreover, parameters for shape and appearance densities are shared across features. This is to be contrasted with previous work on probabilistic ‘constellation’ models where features depend on each other, and each feature and model have different pose and appearance statistics [1, 2]. These two differences allow us to build models containing hundreds of features, as well as to train each model from a single example. Our model may also be thought of as a probabilistic revisitation of Lowe’s model [3, 4]. We propose an efﬁcient entropy-minimization inference algorithm that constructs the best interpretation of a scene as a collection of objects and clutter. We test our ideas with experiments on two image databases. We compare with Lowe’s algorithm and demonstrate better performance, in particular in presence of large amounts of background clutter.

6 0.072141178 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

7 0.072112806 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

8 0.065878361 115 nips-2004-Maximum Margin Clustering

9 0.065580197 44 nips-2004-Conditional Random Fields for Object Recognition

10 0.063762404 166 nips-2004-Semi-supervised Learning via Gaussian Processes

11 0.060506098 87 nips-2004-Integrating Topics and Syntax

12 0.057823859 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

13 0.056818746 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

14 0.05679005 192 nips-2004-The power of feature clustering: An application to object detection

15 0.055440631 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

16 0.054294076 127 nips-2004-Neighbourhood Components Analysis

17 0.053430807 68 nips-2004-Face Detection --- Efficient and Rank Deficient

18 0.051496558 83 nips-2004-Incremental Learning for Visual Tracking

19 0.049897868 191 nips-2004-The Variational Ising Classifier (VIC) Algorithm for Coherently Contaminated Data

20 0.048597436 161 nips-2004-Self-Tuning Spectral Clustering

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.138), (1, 0.046), (2, -0.052), (3, -0.138), (4, 0.064), (5, 0.075), (6, -0.019), (7, 0.037), (8, -0.015), (9, 0.059), (10, -0.002), (11, 0.069), (12, -0.125), (13, 0.076), (14, -0.044), (15, -0.068), (16, 0.03), (17, -0.002), (18, 0.065), (19, -0.006), (20, -0.026), (21, -0.068), (22, -0.07), (23, -0.061), (24, -0.007), (25, 0.025), (26, 0.025), (27, 0.015), (28, -0.052), (29, 0.12), (30, 0.084), (31, 0.026), (32, 0.06), (33, 0.147), (34, 0.054), (35, -0.057), (36, 0.089), (37, 0.047), (38, 0.096), (39, 0.061), (40, -0.135), (41, 0.041), (42, 0.055), (43, 0.056), (44, -0.043), (45, -0.03), (46, 0.138), (47, 0.033), (48, 0.01), (49, -0.004)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96076363 205 nips-2004-Who's In the Picture

Author: Tamara L. Berg, Alexander C. Berg, Jaety Edwards, David A. Forsyth

2 0.5817613 200 nips-2004-Using Random Forests in the Structured Language Model

Author: Peng Xu, Frederick Jelinek

3 0.570409 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)

Author: Kumar Chellapilla, Patrice Y. Simard

Abstract: Machine learning is often used to automatically solve human tasks. In this paper, we look for tasks where machine learning algorithms are not as good as humans with the hope of gaining insight into their current limitations. We studied various Human Interactive Proofs (HIPs) on the market, because they are systems designed to tell computers and humans apart by posing challenges presumably too hard for computers. We found that most HIPs are pure recognition tasks which can easily be broken using machine learning. The harder HIPs use a combination of segmentation and recognition tasks. From this observation, we found that building segmentation tasks is the most effective way to confuse machine learning algorithms. This has enabled us to build effective HIPs (which we deployed in MSN Passport), as well as design challenging segmentation tasks for machine learning algorithms. 1 In t rod u ct i on The OCR problem for high resolution printed text has virtually been solved 10 years ago [1]. On the other hand, cursive handwriting recognition today is still too poor for most people to rely on. Is there a fundamental difference between these two seemingly similar problems? To shed more light on this question, we study problems that have been designed to be difficult for computers. The hope is that we will get some insight on what the stumbling blocks are for machine learning and devise appropriate tests to further understand their similarities and differences. Work on distinguishing computers from humans traces back to the original Turing Test [2] which asks that a human distinguish between another human and a machine by asking questions of both. Recent interest has turned to developing systems that allow a computer to distinguish between another computer and a human. These systems enable the construction of automatic filters that can be used to prevent automated scripts from utilizing services intended for humans [4]. Such systems have been termed Human Interactive Proofs (HIPs) [3] or Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHAs) [4]. An overview of the work in this area can be found in [5]. Construction of HIPs that are of practical value is difficult because it is not sufficient to develop challenges at which humans are somewhat more successful than machines. This is because the cost of failure for an automatic attacker is minimal compared to the cost of failure for humans. Ideally a HIP should be solved by humans more than 80% of the time, while an automatic script with reasonable resource use should succeed less than 0.01% of the time. This latter ratio (1 in 10,000) is a function of the cost of an automatic trial divided by the cost of having a human perform the attack. This constraint of generating tasks that are failed 99.99% of the time by all automated algorithms has generated various solutions which can easily be sampled on the internet. Seven different HIPs, namely, Mailblocks, MSN (before April 28th, 2004), Ticketmaster, Yahoo, Yahoo v2 (after Sept’04), Register, and Google, will be given as examples in the next section. We will show in Section 3 that machinelearning-based attacks are far more successful than 1 in 10,000. Yet, some of these HIPs are harder than others and could be made even harder by identifying the recognition and segmentation parts, and emphasizing the latter. Section 4 presents examples of more difficult HIPs which are much more respectable challenges for machine learning, and yet surprisingly easy for humans. The final section discusses a (known) weakness of machine learning algorithms and suggests designing simple artificial datasets for studying this weakness. 2 Exa mp les o f H I Ps The HIPs explored in this paper are made of characters (or symbols) rendered to an image and presented to the user. Solving the HIP requires identifying all characters in the correct order. The following HIPs can be sampled from the web: Mailblocks: While signing up for free email service with (www.mailblocks.com), you will find HIP challenges of the type: mailblocks MSN: While signing up for free e-mail with MSN Hotmail (www.hotmail.com), you will find HIP challenges of the type: Register.com: While requesting a whois lookup for a domain at www.register.com, you will HIP challenges of the type: Yahoo!/EZ-Gimpy (CMU): While signing up for free e-mail service with Yahoo! (www.yahoo.com), you will receive HIP challenges of the type: Yahoo! (version 2): Starting in August 2004, Yahoo! introduced their second generation HIP. Three examples are presented below: Ticketmaster: While looking for concert tickets at www.ticketmaster.com, you will receive HIP challenges of the type: Google/Gmail: While signing up for free e-mail with Gmail at www.google.com, one will receive HIP challenges of the type: While solutions to Yahoo HIPs are common English words, those for ticketmaster and Google do not necessarily belong to the English dictionary. They appear to have been created using a phonetic generator [8]. 3 Usi n g ma ch i n e lea rn i n g t o b rea k H IP s Breaking HIPs is not new. Mori and Malik [7] have successfully broken the EZGimpy (92% success) and Gimpy (33% success) HIPs from CMU. Our approach aims at an automatic process for solving multiple HIPs with minimum human intervention, using machine learning. In this paper, our main goal is to learn more about the common strengths and weaknesses of these HIPs rather than to prove that we can break any one HIP in particular with the highest possible success rate. We have results for six different HIPs: EZ-Gimpy/Yahoo, Yahoo v2, mailblocks, register, ticketmaster, and Google. To simplify our study, we will not be using language models in our attempt to break HIPs. For example, there are only about 600 words in the EZ-Gimpy dictionary [7], which means that a random guess attack would get a success rate of 1 in 600 (more than enough to break the HIP, i.e., greater than 0.01% success). HIPs become harder when no language model is used. Similarly, when a HIP uses a language model to generate challenges, success rate of attacks can be significantly improved by incorporating the language model. Further, since the language model is not common to all HIPs studied, it was not used in this paper. Our generic method for breaking all of these HIPs is to write a custom algorithm to locate the characters, and then use machine learning for recognition. Surprisingly, segmentation, or finding the characters, is simple for many HIPs which makes the process of breaking the HIP particularly easy. Gimpy uses a single constant predictable color (black) for letters even though the background color changes. We quickly realized that once the segmentation problem is solved, solving the HIP becomes a pure recognition problem, and it can trivially be solved using machine learning. Our recognition engine is based on neural networks [6][9]. It yielded a 0.4% error rate on the MNIST database, uses little memory, and is very fast for recognition (important for breaking HIPs). For each HIP, we have a segmentation step, followed by a recognition step. It should be stressed that we are not trying to solve every HIP of a given type i.e., our goal is not 100% success rate, but something efficient that can achieve much better than 0.01%. In each of the following experiments, 2500 HIPs were hand labeled and used as follows (a) recognition (1600 for training, 200 for validation, and 200 for testing), and (b) segmentation (500 for testing segmentation). For each of the five HIPs, a convolution neural network, identical to the one described in [6], was trained and tested on gray level character images centered on the guessed character positions (see below). The trained neural network became the recognizer. 3.1 M a i l b l oc k s To solve the HIP, we select the red channel, binarize and erode it, extract the largest connected components (CCs), and breakup CCs that are too large into two or three adjacent CCs. Further, vertically overlapping half character size CCs are merged. The resulting rough segmentation works most of the time. Here is an example: For instance, in the example above, the NN would be trained, and tested on the following images: … The end-to-end success rate is 88.8% for segmentation, 95.9% for recognition (given correct segmentation), and (0.888)*(0.959)7 = 66.2% total. Note that most of the errors come from segmentation, even though this is where all the custom programming was invested. 3.2 Register The procedure to solve HIPs is very similar. The image was smoothed, binarized, and the largest 5 connected components were identified. Two examples are presented below: The end-to-end success rate is 95.4% for segmentation, 87.1% for recognition (given correct segmentation), and (0.954)*(0.871)5 = 47.8% total. 3.3 Y a h oo/ E Z - G i mp y Unlike the mailblocks and register HIPs, the Yahoo/EZ-Gimpy HIPs are richer in that a variety of backgrounds and clutter are possible. Though some amount of text warping is present, the text color, size, and font have low variability. Three simple segmentation algorithms were designed with associated rules to identify which algorithm to use. The goal was to keep these simple yet effective: a) No mesh: Convert to grayscale image, threshold to black and white, select large CCs with sizes close to HIP char sizes. One example: b) Black mesh: Convert to grayscale image, threshold to black and white, remove vertical and horizontal line pixels that don’t have neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: c) White mesh: Convert to grayscale image, threshold to black and white, add black pixels (in white line locations) if there exist neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: Tests for black and white meshes were performed to determine which segmentation algorithm to use. The end-to-end success rate was 56.2% for segmentation (38.2% came from a), 11.8% from b), and 6.2% from c), 90.3% for recognition (given correct segmentation), and (0.562)*(0.903)4.8 = 34.4% total. The average length of a Yahoo HIP solution is 4.8 characters. 3.4 T i c k e t ma s t e r The procedure that solved the Yahoo HIP is fairly successful at solving some of the ticket master HIPs. These HIPs are characterized by cris-crossing lines at random angles clustered around 0, 45, 90, and 135 degrees. A multipronged attack as in the Yahoo case (section 3.3) has potential. In the interests of simplicity, a single attack was developed: Convert to grayscale, threshold to black and white, up-sample image, dilate first then erode, select large CCs with sizes close to HIP char sizes. One example: The dilate-erode combination causes the lines to be removed (along with any thin objects) but retains solid thick characters. This single attack is successful in achieving an end-to-end success rate of 16.6% for segmentation, the recognition rate was 82.3% (in spite of interfering lines), and (0.166)*(0.823)6.23 = 4.9% total. The average HIP solution length is 6.23 characters. 3.5 Y a h oo ve r s i on 2 The second generation HIP from Yahoo had several changes: a) it did not use words from a dictionary or even use a phonetic generator, b) it uses only black and white colors, c) uses both letters and digits, and d) uses connected lines and arcs as clutter. The HIP is somewhat similar to the MSN/Passport HIP which does not use a dictionary, uses two colors, uses letters and digits, and background and foreground arcs as clutter. Unlike the MSN/Passport HIP, several different fonts are used. A single segmentation attack was developed: Remove 6 pixel border, up-sample, dilate first then erode, select large CCs with sizes close to HIP char sizes. The attack is practically identical to that used for the ticketmaster HIP with different preprocessing stages and slightly modified parameters. Two examples: This single attack is successful in achieving an end-to-end success rate of 58.4% for segmentation, the recognition rate was 95.2%, and (0.584)*(0.952)5 = 45.7% total. The average HIP solution length is 5 characters. 3.6 G oog l e / G M a i l The Google HIP is unique in that it uses only image warp as a means of distorting the characters. Similar to the MSN/Passport and Yahoo version 2 HIPs, it is also two color. The HIP characters are arranged closed to one another (they often touch) and follow a curved baseline. The following very simple attack was used to segment Google HIPs: Convert to grayscale, up-sample, threshold and separate connected components. a) b) This very simple attack gives an end-to-end success rate of 10.2% for segmentation, the recognition rate was 89.3%, giving (0.102)*(0.893)6.5 = 4.89% total probability of breaking a HIP. Average Google HIP solution length is 6.5 characters. This can be significantly improved upon by judicious use of dilate-erode attack. A direct application doesn’t do as well as it did on the ticketmaster and yahoo HIPs (because of the shear and warp of the baseline of the word). More successful and complicated attacks might estimate and counter the shear and warp of the baseline to achieve better success rates. 4 Lesso n s lea rn ed f ro m b rea ki n g H IPs From the previous section, it is clear that most of the errors come from incorrect segmentations, even though most of the development time is spent devising custom segmentation schemes. This observation raises the following questions: Why is segmentation a hard problem? Can we devise harder HIPs and datasets? Can we build an automatic segmentor? Can we compare classification algorithms based on how useful they are for segmentation? 4.1 T h e s e g me n t a t i on p r ob l e m As a review, segmentation is difficult for the following reasons: 1. Segmentation is computationally expensive. In order to find valid patterns, a recognizer must attempt recognition at many different candidate locations. 2. The segmentation function is complex. To segment successfully, the system must learn to identify which patterns are valid among the set of all possible valid and non-valid patterns. This task is intrinsically more difficult than classification because the space of input is considerably larger. Unlike the space of valid patterns, the space of non-valid patterns is typically too vast to sample. This is a problem for many learning algorithms which yield too many false positives when presented non-valid patterns. 3. Identifying valid characters among a set of valid and invalid candidates is a combinatorial problem. For example, correctly identifying which 8 characters among 20 candidates (assuming 12 false positives), has a 1 in 125,970 (20 choose 8) chances of success by random guessing. 4.2 B ui l d i n g b e t te r / h a r de r H I P s We can use what we have learned to build better HIPs. For instance the HIP below was designed to make segmentation difficult and a similar version has been deployed by MSN Passport for hotmail registrations (www.hotmail.com): The idea is that the additional arcs are themselves good candidates for false characters. The previous segmentation attacks would fail on this HIP. Furthermore, simple change of fonts, distortions, or arc types would require extensive work for the attacker to adjust to. We believe HIPs that emphasize the segmentation problem, such as the above example, are much stronger than the HIPs we examined in this paper, which rely on recognition being difficult. Pushing this to the extreme, we can easily generate the following HIPs: Despite the apparent difficulty of these HIPs, humans are surprisingly good at solving these, indicating that humans are far better than computers at segmentation. This approach of adding several competing false positives can in principle be used to automatically create difficult segmentation problems or benchmarks to test classification algorithms. 4.3 B ui l d i n g a n a ut o ma t i c s e g me n t or To build an automatic segmentor, we could use the following procedure. Label characters based on their correct position and train a recognizer. Apply the trained recognizer at all locations in the HIP image. Collect all candidate characters identified with high confidence by the recognizer. Compute the probability of each combination of candidates (going from left to right), and output the solution string with the highest probability. This is better illustrated with an example. Consider the following HIP (to the right). The trained neural network has these maps (warm colors indicate recognition) that show that K, Y, and so on are correctly identified. However, the maps for 7 and 9 show several false positives. In general, we would get the following color coded map for all the different candidates: HIP K Y B 7 9 With a threshold of 0.5 on the network’s outputs, the map obtained is: We note that there are several false positives for each true positive. The number of false positives per true positive character was found to be between 1 and 4, giving a 1 in C(16,8) = 12,870 to 1 in C(32,8) = 10,518,300 random chance of guessing the correct segmentation for the HIP characters. These numbers can be improved upon by constraining solution strings to flow sequentially from left to right and by restricting overlap. For each combination, we compute a probability by multiplying the 8 probabilities of the classifier for each position. The combination with the highest probability is the one proposed by the classifier. We do not have results for such an automatic segmentor at this time. It is interesting to note that with such a method a classifier that is robust to false positives would do far better than one that is not. This suggests another axis for comparing classifiers. 5 Con clu si on In this paper, we have successfully applied machine learning to the problem of solving HIPs. We have learned that decomposing the HIP problem into segmentation and recognition greatly simplifies analysis. Recognition on even unprocessed images (given segmentation is a solved) can be done automatically using neural networks. Segmentation, on the other hand, is the difficulty differentiator between weaker and stronger HIPs and requires custom intervention for each HIP. We have used this observation to design new HIPs and new tests for machine learning algorithms with the hope of improving them. A c k n ow l e d ge me n t s We would like to acknowledge Chau Luu and Eric Meltzer for their help with labeling and segmenting various HIPs. We would also like to acknowledge Josh Benaloh and Cem Paya for stimulating discussions on HIP security. References [1] Baird HS (1992), “Anatomy of a versatile page reader,” IEEE Pro., v.80, pp. 1059-1065. [2] Turing AM (1950), “Computing Machinery and Intelligence,” Mind, 59:236, pp. 433-460. [3] First Workshop on Human Interactive Proofs, Palo Alto, CA, January 2002. [4] Von Ahn L, Blum M, and Langford J, The Captcha Project. http://www.captcha.net [5] Baird HS and Popat K (2002) “Human Interactive Proofs and Document Image Analysis,” Proc. IAPR 2002 Workshop on Document Analysis Systerms, Princeton, NJ. [6] Simard PY, Steinkraus D, and Platt J, (2003) “Best Practice for Convolutional Neural Networks Applied to Visual Document Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), pp. 958-962, IEEE Computer Society, Los Alamitos. [7] Mori G, Malik J (2003), “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” Proc. of the Computer Vision and Pattern Recognition (CVPR) Conference, IEEE Computer Society, vol.1, pages:I-134 - I-141, June 18-20, 2003 [8] Chew, M. and Baird, H. S. (2003), “BaffleText: a Human Interactive Proof,” Proc., 10th IS&T;/SPIE Document Recognition & Retrieval Conf., Santa Clara, CA, Jan. 22. [9] LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,’ Proceedings of the IEEE, Nov. 1998.

4 0.53681761 182 nips-2004-Synergistic Face Detection and Pose Estimation with Energy-Based Models

Author: Margarita Osadchy, Matthew L. Miller, Yann L. Cun

5 0.51043379 78 nips-2004-Hierarchical Distributed Representations for Statistical Language Modeling

Author: John Blitzer, Fernando Pereira, Kilian Q. Weinberger, Lawrence K. Saul

6 0.47987503 162 nips-2004-Semi-Markov Conditional Random Fields for Information Extraction

7 0.43691596 87 nips-2004-Integrating Topics and Syntax

8 0.43067083 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition

9 0.42272457 40 nips-2004-Common-Frame Model for Object Recognition

10 0.41348785 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference

11 0.38801518 191 nips-2004-The Variational Ising Classifier (VIC) Algorithm for Coherently Contaminated Data

12 0.37846395 192 nips-2004-The power of feature clustering: An application to object detection

13 0.36793107 107 nips-2004-Making Latin Manuscripts Searchable using gHMM's

14 0.36606365 10 nips-2004-A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

15 0.35408655 193 nips-2004-Theories of Access Consciousness

16 0.34947488 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data

17 0.33841917 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making

18 0.31832352 74 nips-2004-Harmonising Chorales by Probabilistic Inference

19 0.29366922 44 nips-2004-Conditional Random Fields for Object Recognition

20 0.29195961 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(13, 0.06), (15, 0.105), (22, 0.356), (26, 0.03), (27, 0.012), (31, 0.036), (32, 0.015), (33, 0.181), (35, 0.026), (50, 0.03), (71, 0.012), (76, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81481993 205 nips-2004-Who's In the Picture

Author: Tamara L. Berg, Alexander C. Berg, Jaety Edwards, David A. Forsyth

2 0.53066063 167 nips-2004-Semi-supervised Learning with Penalized Probabilistic Clustering

Author: Zhengdong Lu, Todd K. Leen

Abstract: While clustering is usually an unsupervised operation, there are circumstances in which we believe (with varying degrees of certainty) that items A and B should be assigned to the same cluster, while items A and C should not. We would like such pairwise relations to inﬂuence cluster assignments of out-of-sample data in a manner consistent with the prior knowledge expressed in the training set. Our starting point is probabilistic clustering based on Gaussian mixture models (GMM) of the data distribution. We express clustering preferences in the prior distribution over assignments of data points to clusters. This prior penalizes cluster assignments according to the degree with which they violate the preferences. We ﬁt the model parameters with EM. Experiments on a variety of data sets show that PPC can consistently improve clustering results.

3 0.5271396 44 nips-2004-Conditional Random Fields for Object Recognition

Author: Ariadna Quattoni, Michael Collins, Trevor Darrell

Abstract: We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as ﬂexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Random Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a uniﬁed framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recognition proceeds by ﬁnding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i.e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.

4 0.52625859 77 nips-2004-Hierarchical Clustering of a Mixture Model

Author: Jacob Goldberger, Sam T. Roweis

Abstract: In this paper we propose an eﬃcient algorithm for reducing a large mixture of Gaussians into a smaller mixture while still preserving the component structure of the original model; this is achieved by clustering (grouping) the components. The method minimizes a new, easily computed distance measure between two Gaussian mixtures that can be motivated from a suitable stochastic model and the iterations of the algorithm use only the model parameters, avoiding the need for explicit resampling of datapoints. We demonstrate the method by performing hierarchical clustering of scenery images and handwritten digits. 1

5 0.52509087 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach

Author: Francis R. Bach, Michael I. Jordan

Abstract: We present an algorithm to perform blind, one-microphone speech separation. Our algorithm separates mixtures of speech without modeling individual speakers. Instead, we formulate the problem of speech separation as a problem in segmenting the spectrogram of the signal into two or more disjoint sets. We build feature sets for our segmenter using classical cues from speech psychophysics. We then combine these features into parameterized afﬁnity matrices. We also take advantage of the fact that we can generate training examples for segmentation by artiﬁcially superposing separately-recorded signals. Thus the parameters of the afﬁnity matrices can be tuned using recent work on learning spectral clustering [1]. This yields an adaptive, speech-speciﬁc segmentation algorithm that can successfully separate one-microphone speech mixtures. 1

6 0.52343404 11 nips-2004-A Second Order Cone programming Formulation for Classifying Missing Data

7 0.52326804 23 nips-2004-Analysis of a greedy active learning strategy

8 0.52254099 127 nips-2004-Neighbourhood Components Analysis

9 0.5224123 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications

10 0.52192086 179 nips-2004-Surface Reconstruction using Learned Shape Models

11 0.52176112 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data

12 0.52175212 61 nips-2004-Efficient Out-of-Sample Extension of Dominant-Set Clusters

13 0.52156723 133 nips-2004-Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning

14 0.52155793 166 nips-2004-Semi-supervised Learning via Gaussian Processes

15 0.5213331 177 nips-2004-Supervised Graph Inference

16 0.5212242 207 nips-2004-ℓ₀-norm Minimization for Basis Selection

17 0.52112609 3 nips-2004-A Feature Selection Algorithm Based on the Global Minimization of a Generalization Error Bound

18 0.5209738 99 nips-2004-Learning Hyper-Features for Visual Identification

19 0.52097374 102 nips-2004-Learning first-order Markov models for control

20 0.52093887 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees