nips nips2005 nips2005-171 knowledge-graph by maker-knowledge-mining

171 nips-2005-Searching for Character Models

Source: pdf

Author: Jaety Edwards, David Forsyth

Abstract: We introduce a method to automatically improve character models for a handwritten script without the use of transcriptions and using a minimum of document speciﬁc training data. We show that we can use searches for the words in a dictionary to identify portions of the document whose transcriptions are unambiguous. Using templates extracted from those regions, we retrain our character prediction model to drastically improve our search retrieval performance for words in the document.

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 edu Abstract We introduce a method to automatically improve character models for a handwritten script without the use of transcriptions and using a minimum of document speciﬁc training data. [sent-5, score-0.822]

2 We show that we can use searches for the words in a dictionary to identify portions of the document whose transcriptions are unambiguous. [sent-6, score-0.475]

3 Using templates extracted from those regions, we retrain our character prediction model to drastically improve our search retrieval performance for words in the document. [sent-7, score-1.228]

4 This per character segmentation is expensive and often impractical to acquire, particularly if the corpora in question contain documents in many different scripts. [sent-10, score-0.645]

5 Training with these datasets is made possible by explicitly modelling possible segmentations in addition to having a model for character templates. [sent-13, score-0.619]

6 In their research on “wordspotting”, Lavrenko et al [4] demonstrate that images of entire words can be highly discriminative, even when the individual characters composing the word are locally ambiguous. [sent-14, score-0.601]

7 This implies that images of many sufﬁciently long words should have unambiguous transcriptions, even when the character models are poorly tuned. [sent-15, score-0.724]

8 In our previous work, [2], the discriminatory power of whole words allowed us to achieve strong search results with a model trained on a single example per character. [sent-16, score-0.216]

9 The ﬁrst of these two points implies that given a transcription, we can learn new character models. [sent-19, score-0.591]

10 The second implies that for at least some parts of a document, we should be able to provide that transcription “for free”, by matching against a dictionary of known words. [sent-20, score-0.208]

11 Each state st is deﬁned by its left and right characters ctl and ctr (eg “x” and “e” for s4 ). [sent-22, score-0.593]

12 In the image, a state spans half of each of these two characters, starting just past the center of the left character and extending to the center of the right character, i. [sent-23, score-0.687]

13 The relative positions of the two characters is given by a displacement vector dt (superimposed on the image as white lines). [sent-26, score-0.583]

14 Associating states with intracharacter spaces instead of with individual characters allows for the bounding boxes of characters to overlap while maintaining the independence properties of the Markov chain. [sent-27, score-0.713]

15 In this work we combine these two observations in order to improve character models without the need for a document speciﬁc transcription. [sent-28, score-0.699]

16 We provide a generic dictionary of words in the target language. [sent-29, score-0.199]

17 These are image regions for which exactly one word from our dictionary scores highly under our model. [sent-31, score-0.436]

18 In these regions, we infer a segmentation and extract new character examples. [sent-33, score-0.687]

19 Finally, we use these new exemplars to learn an improved character prediction model. [sent-34, score-0.669]

20 In their simplest incarnation, a hidden state represents a character and the evidence variable is some feature vector calculated at points along the line. [sent-37, score-0.631]

21 If all characters were known to be of a single ﬁxed width, this model would sufﬁce. [sent-38, score-0.304]

22 In particular, the portion of the ink generated by the current character is assumed to be independent of the preceding character. [sent-47, score-0.635]

23 In other words, the model assumes that the bounding boxes of characters do not overlap. [sent-48, score-0.409]

24 In this work, however, we need to extract new templates, and thus correct localization and segmentation of templates is crucial. [sent-55, score-0.511]

25 In our current work, we have relaxed this constraint, allowing characters to partially overlap. [sent-56, score-0.276]

26 We achieve this by changing hidden states to represent character bigrams instead of single characters (Figure 1). [sent-57, score-0.867]

27 In the image, a state now spans the pixels from just past the center of the left character to the pixel containing the center of the right character. [sent-58, score-0.73]

28 We adjust our notation somewhat to reﬂect this change, letting st now represent the tth hidden state and ctl and ctr be the left and right characters associated with s. [sent-59, score-0.66]

29 dt is now the displacement vector between the centers of ctl and ctr . [sent-60, score-0.452]

30 1 Model Parameters Our transition distribution between states is simply a 3-gram character model. [sent-68, score-0.591]

31 Conditioned on displacement vector, the emission model for generating an image chunk given a state is a mixture of gaussians. [sent-71, score-0.298]

32 We associate with each character a set of image windows extracted from various locations in the document. [sent-72, score-0.705]

33 We adjust the probability of an image given the state to include the distribution over blocks by expanding the last term of Equation 3 to reﬂect this mixture. [sent-74, score-0.149]

34 However, our model does not seem particularly sensitive to this displacement distribution and so in practice, we have a single, fairly loose, displacement distribution per character. [sent-77, score-0.36]

35 Given a displacement vector, we can generate the maximum likelihood template image under our model by compositing the correct halves of the left and right blocks. [sent-78, score-0.52]

36 Reshaping the image window into a vector, the likelihood of an image window is then modeled as a gaussian, using the corresponding pixels in the template as the means, and assuming a diagonal covariance matrix. [sent-79, score-0.361]

37 The covariance matrix largely serves to mask out empty regions of a character’s bounding box, so that we do not pay a penalty when the overlap of two characters’ bounding boxes contains only whitespace. [sent-80, score-0.293]

38 2 Efﬁciency Considerations The number of possible different templates for a state is O(|B| × |B| × |D|), where |B| is the number of different possible blocks and |D| is the number of candidate displacement vectors. [sent-82, score-0.622]

39 For a given pair of blocks bl and br , we consider only displacement vectors within some small x distance from a mean displacement mbl ,br , and we have a uniform distribution within this region. [sent-84, score-0.377]

40 We can enumerate the set of possible states by looking at every pair of sites whose displacement vector has a non-zero probability. [sent-94, score-0.223]

41 A path through this lattice deﬁnes both a transcription and a segmentation of the line into individual characters. [sent-97, score-0.24]

42 Inference in this model is relatively straightforward because of our constraint that each character may overlap only one preceding and one following character, and our restriction of displacement vectors to a small discrete range. [sent-98, score-0.841]

43 A given state st is independent of the rest of the line given the values of all other states within dmax of either edge of st (where dmax is the legal displacement vector with the longest x component. [sent-101, score-0.401]

44 3 Learning Better Character Templates We initialize our algorithm with a set of handcut templates, exactly 1 per character, (Figure 2), and our goal is to construct more accurate character models automatically from unsupervised data. [sent-104, score-0.591]

45 (Recall that a site is a particular character template at a given (x,y) location in the line. [sent-106, score-0.821]

46 ) The traditional EM approach to estimating new templates would be to use these Figure 2: Original Training Data These 22 glyphs are our only document speciﬁc training data. [sent-107, score-0.591]

47 We use the model based on these characters to extract the new examples shown below Figure 3: Examples of extracted templates We extract new templates from high conﬁdence regions. [sent-108, score-1.18]

48 This happens when the combination of constraints from the dictionary the surrounding glyphs make a “q” or “a” the only possible explanation for this region, even though its local likelihood is poor. [sent-112, score-0.29]

49 Unfortunately, the constraints imposed by 3 and even 4-gram character models seem to be insufﬁcient. [sent-114, score-0.591]

50 The key to successfully learning new templates lies is the observation from our previous work [2], that even when the posteriors of individual characters are not discriminative, one can still achieve very good search results with the same model. [sent-116, score-0.773]

51 The search word in effect serves as its own language model, only allowing paths through the state graph that actually contain it, and the longer the word the more it constrains the model. [sent-117, score-0.581]

52 Whole words impose much tighter constraints than a 2 or 3-gram character model, and it is only with this added power that we can successfully learn new character templates. [sent-118, score-1.277]

53 We deﬁne the score for a search as the negative log likelihood of the best path containing that word. [sent-119, score-0.25]

54 Moreover, if we are given a large dictionary of words and no alternative word explains a region of ink nearly as well as the best scoring word, then we can be extremely conﬁdent that this is a true transcription of that piece of ink. [sent-121, score-0.626]

55 Starting with a weak character model, we do not expect to ﬁnd many of these “high conﬁdence” regions, but with a large enough document, we should expect to ﬁnd some. [sent-122, score-0.591]

56 From these regions, we can extract new, reliable templates with which to improve our character models. [sent-123, score-1.004]

57 The most valuable of these new templates will be those that are signiﬁcantly different from any in our current set. [sent-124, score-0.371]

58 For example, in Figure 3, note that our system identiﬁes capital Q’s, even though our only input template was lower case. [sent-125, score-0.176]

59 We can easily infer the missing character in the string “obv-ous” because the other letters constrain us to one possible solution. [sent-127, score-0.591]

60 Similarly, if other character templates in a word match well, then we can unambiguously identify the other, more ambiguous ones. [sent-128, score-1.228]

61 1 Extracting New Templates and Updating The Model Within a high conﬁdence region we have both a transcription and a localization of template centers. [sent-131, score-0.284]

62 We accomplish this by creating a template image for the column of pixels from the corresponding block templates and then assigning image pixels to the nearest template character (measured by Euclidean distance). [sent-133, score-1.513]

63 Given a set of templates extracted from high conﬁdence regions, we choose a subset of Score Under Model worse 3400 3350 3300 best Confidence Margins Figure 4: Each line segment in the lower ﬁgure represents a proposed location for a word from our dictionary. [sent-134, score-0.749]

64 The dotted line is the score of our model’s best possible path. [sent-137, score-0.176]

65 We deﬁne the conﬁdence margin of a location as the difference in score between the best ﬁtting word from our dictionary and the next best. [sent-139, score-0.436]

66 Figure 5: Extracting Templates For a region with sufﬁciently high conﬁdence margin, we construct the maximum likelihood template from our current exemplars. [sent-140, score-0.222]

67 left, and we assign pixels from the original image to a template based on its distance to the nearest pixel in the template image, extracting new glyph exemplars right. [sent-141, score-0.499]

68 These new glyphs become the exemplars for our next round of training. [sent-142, score-0.212]

69 Currently, we threshold the number of new templates for the sake of efﬁciency. [sent-145, score-0.371]

70 4 Results Our algorithm iteratively improves the character model by gathering new training data from high conﬁdence regions. [sent-147, score-0.619]

71 Figure 3 shows that this method ﬁnds new templates signiﬁcantly different from the originals. [sent-148, score-0.371]

72 In this document, our set of examples after one round appears to cover the space of character images well, at least those in lower case. [sent-149, score-0.68]

73 They make certain regions of the document more ambiguous locally, but that local ambiguity can be overcome with the context provided by surrounding characters and a language model. [sent-154, score-0.557]

74 Improved Character Models We evaluate the method more quantitatively by testing the impact of the new templates on the quality of searches performed against the document. [sent-155, score-0.418]

75 To search for a given word, we rank lines by the ratio of the maximum likelihood transcription/segmentation that contains the search word to the likelihood of the best possible segmentation/transcription under our model. [sent-156, score-0.554]

76 The lowest possible search score is 1, happening when the search word is actually a substring of the maximum likelihood transcription. [sent-157, score-0.515]

77 Higher scores mean that the word is increasingly unlikely under our model. [sent-158, score-0.192]

78 In Figure 7, the ﬁgure on the left shows the improvement in ranking of the lines that truly contain selected search words. [sent-159, score-0.184]

79 Dotted lines represent other search results, where we have made a few larger in order to show those words that are the closest competitors to the true word. [sent-163, score-0.251]

80 Each correct word has signiﬁcantly improved after one round of template reestimation. [sent-166, score-0.464]

81 Both “nuptiis” and “postquam” are now the highest likelihood words for their region barring smaller subsequences, and “videt” has narrowed the gap between its competitor “video”. [sent-168, score-0.169]

82 while the even rows (in green) use an additional 332 glyphs extracted from high conﬁdence regions. [sent-169, score-0.162]

83 The word “est”, for instance, only had 15 of 24 of the correct lines in the top 100 under the original model, while under the learned model all 24 are not only present but also more highly ranked. [sent-171, score-0.327]

84 Most words have greatly improved relative to their next best alternative. [sent-173, score-0.153]

85 Precision is the percentage of lines truly containing a word in the top n search results, and recall is the percentage of all lines containing the word returned in the top n results. [sent-176, score-0.636]

86 5 Conclusions and Future Work In most fonts, characters are quite ambiguous locally. [sent-181, score-0.313]

87 This ambiguity is the major hurdle to the unsupervised learning of character templates. [sent-183, score-0.591]

88 Language models help, but the standard n-gram models provide insufﬁcient constraints, giving posteriors for character sites too uninformative to get EM off the ground. [sent-184, score-0.681]

89 Aggregate Precision/Recall Curve Selected Words, Top 100 Returned Lines Precision est (15,24)/24 nescio ( 1, 1)/ 1 postquam ( 0, 2)/ 2 quod (14,14)/14 moram ( 0, 2)/ 2 non ( 8, 8)/ 8 quid ( 9, 9)/ 9 10 20 30 40 50 60 70 80 90100 0. [sent-185, score-0.144]

90 Almost all search words in our corpus show a signiﬁcant improvement. [sent-201, score-0.188]

91 The numbers to the right (x/y) mean that out of y lines that actually contained the search word in our document, x of them made it into the top ten. [sent-202, score-0.384]

92 On the right are average precision/recall curves for 21 high frequency words under the model with our original templates (Rnd 1) and after reﬁtting with new extracted templates (Rnd 2). [sent-203, score-0.915]

93 Extracting new templates vastly improves our search quality An entire word is much different. [sent-204, score-0.656]

94 Given a dictionary, we expect many word images to have a single likely transcription even if many characters are locally ambiguous. [sent-205, score-0.61]

95 We show that we can identify these high conﬁdence regions even with a poorly tuned character model. [sent-206, score-0.667]

96 By extracting new templates only from these regions of the document, we overcome the noise problem and signiﬁcantly improve our character models. [sent-207, score-1.085]

97 We demonstrate this improvement for the task of search where the reﬁtted models have drastically better search responses than with the original. [sent-208, score-0.186]

98 Our method is indifferent to the form of the actual character emission model. [sent-209, score-0.591]

99 There is a rich literature in character prediction from isolated image windows, and we expect that incorporating more powerful character models should provide even greater returns and help us in learning less regular scripts. [sent-210, score-1.246]

100 The probability of a character given an image window depends not only on the identify of surrounding characters but also on their spatial conﬁguration. [sent-214, score-0.963]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('character', 0.591), ('templates', 0.371), ('characters', 0.276), ('word', 0.192), ('displacement', 0.166), ('template', 0.148), ('dence', 0.119), ('ctl', 0.112), ('glyphs', 0.112), ('document', 0.108), ('dictionary', 0.104), ('transcription', 0.104), ('rnd', 0.097), ('ctr', 0.097), ('words', 0.095), ('transcriptions', 0.093), ('search', 0.093), ('ct', 0.077), ('dt', 0.077), ('regions', 0.076), ('postquam', 0.075), ('image', 0.064), ('lines', 0.063), ('con', 0.063), ('score', 0.059), ('sites', 0.057), ('bounding', 0.056), ('overlap', 0.056), ('ctx', 0.056), ('iam', 0.056), ('latin', 0.056), ('nuptiis', 0.056), ('videt', 0.056), ('line', 0.055), ('segmentation', 0.054), ('location', 0.052), ('round', 0.051), ('extracted', 0.05), ('boxes', 0.049), ('widths', 0.049), ('exemplars', 0.049), ('searches', 0.047), ('extracting', 0.047), ('blocks', 0.045), ('ink', 0.044), ('correct', 0.044), ('pixels', 0.043), ('extract', 0.042), ('likelihood', 0.042), ('tth', 0.041), ('block', 0.041), ('state', 0.04), ('st', 0.04), ('images', 0.038), ('im', 0.038), ('ascii', 0.037), ('bodleian', 0.037), ('ideo', 0.037), ('imx', 0.037), ('inquam', 0.037), ('jaety', 0.037), ('kopec', 0.037), ('lavrenko', 0.037), ('manuscripts', 0.037), ('nupta', 0.037), ('quid', 0.037), ('stx', 0.037), ('transducers', 0.037), ('unambiguously', 0.037), ('ambiguous', 0.037), ('actually', 0.036), ('berkeley', 0.034), ('dotted', 0.033), ('posteriors', 0.033), ('returned', 0.033), ('statespace', 0.032), ('est', 0.032), ('region', 0.032), ('surrounding', 0.032), ('looks', 0.032), ('site', 0.03), ('handwritten', 0.03), ('dmax', 0.03), ('edwards', 0.03), ('forsyth', 0.03), ('cut', 0.029), ('improved', 0.029), ('best', 0.029), ('language', 0.028), ('left', 0.028), ('model', 0.028), ('spans', 0.028), ('portions', 0.028), ('capital', 0.028), ('path', 0.027), ('ect', 0.026), ('letting', 0.026), ('piece', 0.026), ('lecun', 0.026), ('library', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000001 171 nips-2005-Searching for Character Models

Author: Jaety Edwards, David Forsyth

2 0.099962413 100 nips-2005-Interpolating between types and tokens by estimating power-law generators

Author: Sharon Goldwater, Mark Johnson, Thomas L. Griffiths

Abstract: Standard statistical models of language fail to capture one of the most striking properties of natural languages: the power-law distribution in the frequencies of word tokens. We present a framework for developing statistical models that generically produce power-laws, augmenting standard generative models with an adaptor that produces the appropriate pattern of token frequencies. We show that taking a particular stochastic process – the Pitman-Yor process – as an adaptor justiﬁes the appearance of type frequencies in formal analyses of natural language, and improves the performance of a model for unsupervised learning of morphology.

3 0.078883164 184 nips-2005-Structured Prediction via the Extragradient Method

Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan

Abstract: We present a simple and scalable algorithm for large-margin estimation of structured models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gradient and projection calculations. The projection step can be solved using combinatorial algorithms for min-cost quadratic ﬂow. This makes the approach an efﬁcient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. 1

4 0.074205577 48 nips-2005-Context as Filtering

Author: Daichi Mochihashi, Yuji Matsumoto

Abstract: Long-distance language modeling is important not only in speech recognition and machine translation, but also in high-dimensional discrete sequence modeling in general. However, the problem of context length has almost been neglected so far and a na¨ve bag-of-words history has been ı employed in natural language processing. In contrast, in this paper we view topic shifts within a text as a latent stochastic process to give an explicit probabilistic generative model that has partial exchangeability. We propose an online inference algorithm using particle ﬁlters to recognize topic shifts to employ the most appropriate length of context automatically. Experiments on the BNC corpus showed consistent improvement over previous methods involving no chronological order. 1

5 0.072611034 185 nips-2005-Subsequence Kernels for Relation Extraction

Author: Raymond J. Mooney, Razvan C. Bunescu

Abstract: We present a new kernel method for extracting semantic relations between entities in natural language text, based on a generalization of subsequence kernels. This kernel uses three types of subsequence patterns that are typically employed in natural language to assert relationships between two entities. Experiments on extracting protein interactions from biomedical corpora and top-level relations from newspaper corpora demonstrate the advantages of this approach. 1

6 0.070906222 121 nips-2005-Location-based activity recognition

7 0.07007128 42 nips-2005-Combining Graph Laplacians for Semi--Supervised Learning

8 0.066834353 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

9 0.063377343 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

10 0.059111409 52 nips-2005-Correlated Topic Models

11 0.059099384 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks

12 0.058785986 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection

13 0.057904918 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

14 0.053863514 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

15 0.0508219 33 nips-2005-Bayesian Sets

16 0.047797482 131 nips-2005-Multiple Instance Boosting for Object Detection

17 0.045683917 113 nips-2005-Learning Multiple Related Tasks using Latent Independent Component Analysis

18 0.045534853 103 nips-2005-Kernels for gene regulatory regions

19 0.044720232 195 nips-2005-Transfer learning for text classification

20 0.043282799 115 nips-2005-Learning Shared Latent Structure for Image Synthesis and Robotic Imitation

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.159), (1, 0.004), (2, 0.014), (3, 0.097), (4, -0.033), (5, -0.025), (6, 0.041), (7, 0.124), (8, 0.015), (9, 0.033), (10, -0.055), (11, -0.081), (12, 0.08), (13, -0.109), (14, -0.101), (15, -0.004), (16, 0.002), (17, 0.073), (18, -0.032), (19, -0.012), (20, -0.064), (21, -0.095), (22, -0.041), (23, 0.0), (24, -0.08), (25, 0.076), (26, -0.047), (27, 0.019), (28, 0.054), (29, -0.015), (30, 0.004), (31, 0.017), (32, -0.06), (33, 0.182), (34, 0.04), (35, -0.088), (36, -0.087), (37, 0.044), (38, -0.076), (39, -0.006), (40, -0.003), (41, -0.04), (42, 0.025), (43, -0.144), (44, 0.073), (45, -0.027), (46, 0.168), (47, 0.005), (48, 0.006), (49, 0.068)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94464761 171 nips-2005-Searching for Character Models

Author: Jaety Edwards, David Forsyth

2 0.56508499 100 nips-2005-Interpolating between types and tokens by estimating power-law generators

Author: Sharon Goldwater, Mark Johnson, Thomas L. Griffiths

3 0.52260959 185 nips-2005-Subsequence Kernels for Relation Extraction

Author: Raymond J. Mooney, Razvan C. Bunescu

4 0.45958626 48 nips-2005-Context as Filtering

Author: Daichi Mochihashi, Yuji Matsumoto

5 0.44736114 184 nips-2005-Structured Prediction via the Extragradient Method

Author: Ben Taskar, Simon Lacoste-Julian, Michael I. Jordan

6 0.4337458 121 nips-2005-Location-based activity recognition

7 0.42286831 89 nips-2005-Group and Topic Discovery from Relations and Their Attributes

8 0.40677965 18 nips-2005-Active Learning For Identifying Function Threshold Boundaries

9 0.39614499 3 nips-2005-A Bayesian Framework for Tilt Perception and Confidence

10 0.38567081 97 nips-2005-Inferring Motor Programs from Images of Handwritten Digits

11 0.38477743 103 nips-2005-Kernels for gene regulatory regions

12 0.35765931 174 nips-2005-Separation of Music Signals by Harmonic Structure Modeling

13 0.35620239 198 nips-2005-Using ``epitomes'' to model genetic diversity: Rational design of HIV vaccine cocktails

14 0.35384315 131 nips-2005-Multiple Instance Boosting for Object Detection

15 0.34893972 68 nips-2005-Factorial Switching Kalman Filters for Condition Monitoring in Neonatal Intensive Care

16 0.34592679 11 nips-2005-A Hierarchical Compositional System for Rapid Object Detection

17 0.3367652 164 nips-2005-Representing Part-Whole Relationships in Recurrent Neural Networks

18 0.33198723 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

19 0.32426843 127 nips-2005-Mixture Modeling by Affinity Propagation

20 0.32054922 195 nips-2005-Transfer learning for text classification

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(3, 0.049), (10, 0.043), (11, 0.013), (27, 0.056), (31, 0.074), (34, 0.081), (39, 0.023), (41, 0.03), (44, 0.011), (55, 0.019), (67, 0.016), (69, 0.068), (73, 0.023), (84, 0.267), (88, 0.114), (91, 0.024)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.81114161 171 nips-2005-Searching for Character Models

Author: Jaety Edwards, David Forsyth

2 0.73563623 193 nips-2005-The Role of Top-down and Bottom-up Processes in Guiding Eye Movements during Visual Search

Author: Gregory Zelinsky, Wei Zhang, Bing Yu, Xin Chen, Dimitris Samaras

Abstract: To investigate how top-down (TD) and bottom-up (BU) information is weighted in the guidance of human search behavior, we manipulated the proportions of BU and TD components in a saliency-based model. The model is biologically plausible and implements an artiﬁcial retina and a neuronal population code. The BU component is based on featurecontrast. The TD component is deﬁned by a feature-template match to a stored target representation. We compared the model’s behavior at different mixtures of TD and BU components to the eye movement behavior of human observers performing the identical search task. We found that a purely TD model provides a much closer match to human behavior than any mixture model using BU information. Only when biological constraints are removed (e.g., eliminating the retina) did a BU/TD mixture model begin to approximate human behavior.

3 0.6386677 201 nips-2005-Variational Bayesian Stochastic Complexity of Mixture Models

Author: Kazuho Watanabe, Sumio Watanabe

Abstract: The Variational Bayesian framework has been widely used to approximate the Bayesian learning. In various applications, it has provided computational tractability and good generalization performance. In this paper, we discuss the Variational Bayesian learning of the mixture of exponential families and provide some additional theoretical support by deriving the asymptotic form of the stochastic complexity. The stochastic complexity, which corresponds to the minimum free energy and a lower bound of the marginal likelihood, is a key quantity for model selection. It also enables us to discuss the eﬀect of hyperparameters and the accuracy of the Variational Bayesian approach as an approximation of the true Bayesian learning. 1

4 0.56293803 200 nips-2005-Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery

Author: Jeremy Kubica, Joseph Masiero, Robert Jedicke, Andrew Connolly, Andrew W. Moore

Abstract: In this paper we consider the problem of ﬁnding sets of points that conform to a given underlying model from within a dense, noisy set of observations. This problem is motivated by the task of efﬁciently linking faint asteroid detections, but is applicable to a range of spatial queries. We survey current tree-based approaches, showing a trade-off exists between single tree and multiple tree algorithms. To this end, we present a new type of multiple tree algorithm that uses a variable number of trees to exploit the advantages of both approaches. We empirically show that this algorithm performs well using both simulated and astronomical data.

5 0.56058896 45 nips-2005-Conditional Visual Tracking in Kernel Space

Author: Cristian Sminchisescu, Atul Kanujia, Zhiguo Li, Dimitris Metaxas

Abstract: We present a conditional temporal probabilistic framework for reconstructing 3D human motion in monocular video based on descriptors encoding image silhouette observations. For computational efÄ?Ĺš ciency we restrict visual inference to low-dimensional kernel induced non-linear state spaces. Our methodology (kBME) combines kernel PCA-based non-linear dimensionality reduction (kPCA) and Conditional Bayesian Mixture of Experts (BME) in order to learn complex multivalued predictors between observations and model hidden states. This is necessary for accurate, inverse, visual perception inferences, where several probable, distant 3D solutions exist due to noise or the uncertainty of monocular perspective projection. Low-dimensional models are appropriate because many visual processes exhibit strong non-linear correlations in both the image observations and the target, hidden state variables. The learned predictors are temporally combined within a conditional graphical model in order to allow a principled propagation of uncertainty. We study several predictors and empirically show that the proposed algorithm positively compares with techniques based on regression, Kernel Dependency Estimation (KDE) or PCA alone, and gives results competitive to those of high-dimensional mixture predictors at a fraction of their computational cost. We show that the method successfully reconstructs the complex 3D motion of humans in real monocular video sequences. 1 Introduction and Related Work We consider the problem of inferring 3D articulated human motion from monocular video. This research topic has applications for scene understanding including human-computer interfaces, markerless human motion capture, entertainment and surveillance. A monocular approach is relevant because in real-world settings the human body parts are rarely completely observed even when using multiple cameras. This is due to occlusions form other people or objects in the scene. A robust system has to necessarily deal with incomplete, ambiguous and uncertain measurements. Methods for 3D human motion reconstruction can be classiÄ?Ĺš ed as generative and discriminative. They both require a state representation, namely a 3D human model with kinematics (joint angles) or shape (surfaces or joint positions) and they both use a set of image features as observations for state inference. The computational goal in both cases is the conditional distribution for the model state given image observations. Generative model-based approaches [6, 16, 14, 13] have been demonstrated to Ä?Ĺš‚exibly reconstruct complex unknown human motions and to naturally handle problem constraints. However it is difÄ?Ĺš cult to construct reliable observation likelihoods due to the complexity of modeling human appearance. This varies widely due to different clothing and deformation, body proportions or lighting conditions. Besides being somewhat indirect, the generative approach further imposes strict conditional independence assumptions on the temporal observations given the states in order to ensure computational tractability. Due to these factors inference is expensive and produces highly multimodal state distributions [6, 16, 13]. Generative inference algorithms require complex annealing schedules [6, 13] or systematic non-linear search for local optima [16] in order to ensure continuing tracking. These difÄ?Ĺš culties motivate the advent of a complementary class of discriminative algorithms [10, 12, 18, 2], that approximate the state conditional directly, in order to simplify inference. However, inverse, observation-to-state multivalued mappings are difÄ?Ĺš cult to learn (see e.g. Ä?Ĺš g. 1a) and a probabilistic temporal setting is necessary. In an earlier paper [15] we introduced a probabilistic discriminative framework for human motion reconstruction. Because the method operates in the originally selected state and observation spaces that can be task generic, therefore redundant and often high-dimensional, inference is more expensive and can be less robust. To summarize, reconstructing 3D human motion in a Figure 1: (a, Left) Example of 180o ambiguity in predicting 3D human poses from silhouette image features (center). It is essential that multiple plausible solutions (e.g. F 1 and F2 ) are correctly represented and tracked over time. A single state predictor will either average the distant solutions or zig-zag between them, see also tables 1 and 2. (b, Right) A conditional chain model. The local distributions p(yt |ytĂ˘ˆ’1 , zt ) or p(yt |zt ) are learned as in Ä?Ĺš g. 2. For inference, the predicted local state conditional is recursively combined with the Ä?Ĺš ltered prior c.f . (1). conditional temporal framework poses the following difÄ?Ĺš culties: (i) The mapping between temporal observations and states is multivalued (i.e. the local conditional distributions to be learned are multimodal), therefore it cannot be accurately represented using global function approximations. (ii) Human models have multivariate, high-dimensional continuous states of 50 or more human joint angles. The temporal state conditionals are multimodal which makes efÄ?Ĺš cient Kalman Ä?Ĺš ltering algorithms inapplicable. General inference methods (particle Ä?Ĺš lters, mixtures) have to be used instead, but these are expensive for high-dimensional models (e.g. when reconstructing the motion of several people that operate in a joint state space). (iii) The components of the human state and of the silhouette observation vector exhibit strong correlations, because many repetitive human activities like walking or running have low intrinsic dimensionality. It appears wasteful to work with high-dimensional states of 50+ joint angles. Even if the space were truly high-dimensional, predicting correlated state dimensions independently may still be suboptimal. In this paper we present a conditional temporal estimation algorithm that restricts visual inference to low-dimensional, kernel induced state spaces. To exploit correlations among observations and among state variables, we model the local, temporal conditional distributions using ideas from Kernel PCA [11, 19] and conditional mixture modeling [7, 5], here adapted to produce multiple probabilistic predictions. The corresponding predictor is referred to as a Conditional Bayesian Mixture of Low-dimensional Kernel-Induced Experts (kBME). By integrating it within a conditional graphical model framework (Ä?Ĺš g. 1b), we can exploit temporal constraints probabilistically. We demonstrate that this methodology is effective for reconstructing the 3D motion of multiple people in monocular video. Our contribution w.r.t. [15] is a probabilistic conditional inference framework that operates over a non-linear, kernel-induced low-dimensional state spaces, and a set of experiments (on both real and artiÄ?Ĺš cial image sequences) that show how the proposed framework positively compares with powerful predictors based on KDE, PCA, or with the high-dimensional models of [15] at a fraction of their cost. 2 Probabilistic Inference in a Kernel Induced State Space We work with conditional graphical models with a chain structure [9], as shown in Ä?Ĺš g. 1b, These have continuous temporal states yt , t = 1 . . . T , observations zt . For compactness, we denote joint states Yt = (y1 , y2 , . . . , yt ) or joint observations Zt = (z1 , . . . , zt ). Learning and inference are based on local conditionals: p(yt |zt ) and p(yt |ytĂ˘ˆ’1 , zt ), with yt and zt being low-dimensional, kernel induced representations of some initial model having state xt and observation rt . We obtain zt , yt from rt , xt using kernel PCA [11, 19]. Inference is performed in a low-dimensional, non-linear, kernel induced latent state space (see Ä?Ĺš g. 1b and Ä?Ĺš g. 2 and (1)). For display or error reporting, we compute the original conditional p(x|r), or a temporally Ä?Ĺš ltered version p(xt |Rt ), Rt = (r1 , r2 , . . . , rt ), using a learned pre-image state map [3]. 2.1 Density Propagation for Continuous Conditional Chains For online Ä?Ĺš ltering, we compute the optimal distribution p(yt |Zt ) for the state yt , conditioned by observations Zt up to time t. The Ä?Ĺš ltered density can be recursively derived as: p(yt |Zt ) = p(yt |ytĂ˘ˆ’1 , zt )p(ytĂ˘ˆ’1 |ZtĂ˘ˆ’1 ) (1) ytĂ˘ˆ’1 We compute using a conditional mixture for p(yt |ytĂ˘ˆ’1 , zt ) (a Bayesian mixture of experts c.f . Ă‚Â§2.2) and the prior p(ytĂ˘ˆ’1 |ZtĂ˘ˆ’1 ), each having, say M components. We integrate M 2 pairwise products of Gaussians analytically. The means of the expanded posterior are clustered and the centers are used to initialize a reduced M -component Kullback-Leibler approximation that is reÄ?Ĺš ned using gradient descent [15]. The propagation rule (1) is similar to the one used for discrete state labels [9], but here we work with multivariate continuous state spaces and represent the local multimodal state conditionals using kBME (Ä?Ĺš g. 2), and not log-linear models [9] (these would require intractable normalization). This complex continuous model rules out inference based on Kalman Ä?Ĺš ltering or dynamic programming [9]. 2.2 Learning Bayesian Mixtures over Kernel Induced State Spaces (kBME) In order to model conditional mappings between low-dimensional non-linear spaces we rely on kernel dimensionality reduction and conditional mixture predictors. The authors of KDE [19] propose a powerful structured unimodal predictor. This works by decorrelating the output using kernel PCA and learning a ridge regressor between the input and each decorrelated output dimension. Our procedure is also based on kernel PCA but takes into account the structure of the studied visual problem where both inputs and outputs are likely to be low-dimensional and the mapping between them multivalued. The output variables xi are projected onto the column vectors of the principal space in order to obtain their principal coordinates y i . A z Ă˘ˆˆ P(Fr ) O p(y|z) kP CA ĂŽĹšr (r) Ă˘Š‚ Fr O / y Ă˘ˆˆ P(Fx ) O QQQ QQQ QQQ kP CA QQQ Q( ĂŽĹšx (x) Ă˘Š‚ Fx x Ă˘‰ˆ PreImage(y) O ĂŽĹšr ĂŽĹšx r Ă˘ˆˆ R Ă˘Š‚ Rr x Ă˘ˆˆ X Ă˘Š‚ Rx p(x|r) Ă˘‰ˆ p(x|y) Figure 2: The learned low-dimensional predictor, kBME, for computing p(x|r) Ă˘‰Ä„ p(xt |rt ), Ă˘ˆ€t. (We similarly learn p(xt |xtĂ˘ˆ’1 , rt ), with input (x, r) instead of r Ă˘€“ here we illustrate only p(x|r) for clarity.) The input r and the output x are decorrelated using Kernel PCA to obtain z and y respectively. The kernels used for the input and output are ĂŽĹš r and ĂŽĹšx , with induced feature spaces Fr and Fx , respectively. Their principal subspaces obtained by kernel PCA are denoted by P(Fr ) and P(Fx ), respectively. A conditional Bayesian mixture of experts p(y|z) is learned using the low-dimensional representation (z, y). Using learned local conditionals of the form p(yt |zt ) or p(yt |ytĂ˘ˆ’1 , zt ), temporal inference can be efÄ?Ĺš ciently performed in a low-dimensional kernel induced state space (see e.g. (1) and Ä?Ĺš g. 1b). For visualization and error measurement, the Ä?Ĺš ltered density, e.g. p(yt |Zt ), can be mapped back to p(xt |Rt ) using the pre-image c.f . (3). similar procedure is performed on the inputs ri to obtain zi . In order to relate the reduced feature spaces of z and y (P(Fr ) and P(Fx )), we estimate a probability distribution over mappings from training pairs (zi , yi ). We use a conditional Bayesian mixture of experts (BME) [7, 5] in order to account for ambiguity when mapping similar, possibly identical reduced feature inputs to very different feature outputs, as common in our problem (Ä?Ĺš g. 1a). This gives a model that is a conditional mixture of low-dimensional kernel-induced experts (kBME): M g(z|ĂŽÂ´ j )N (y|Wj z, ĂŽĹ j ) p(y|z) = (2) j=1 where g(z|ĂŽÂ´ j ) is a softmax function parameterized by ĂŽÂ´ j and (Wj , ĂŽĹ j ) are the parameters and the output covariance of expert j, here a linear regressor. As in many Bayesian settings [17, 5], the weights of the experts and of the gates, Wj and ĂŽÂ´ j , are controlled by hierarchical priors, typically Gaussians with 0 mean, and having inverse variance hyperparameters controlled by a second level of Gamma distributions. We learn this model using a double-loop EM and employ ML-II type approximations [8, 17] with greedy (weight) subset selection [17, 15]. Finally, the kBME algorithm requires the computation of pre-images in order to recover the state distribution x from itĂ˘€™s image y Ă˘ˆˆ P(Fx ). This is a closed form computation for polynomial kernels of odd degree. For more general kernels optimization or learning (regression based) methods are necessary [3]. Following [3, 19], we use a sparse Bayesian kernel regressor to learn the pre-image. This is based on training data (xi , yi ): p(x|y) = N (x|AĂŽĹšy (y), Ă˘„Ĺš) (3) with parameters and covariances (A, Ă˘„Ĺš). Since temporal inference is performed in the low-dimensional kernel induced state space, the pre-image function needs to be calculated only for visualizing results or for the purpose of error reporting. Propagating the result from the reduced feature space P(Fx ) to the output space X pro- duces a Gaussian mixture with M elements, having coefÄ?Ĺš cients g(z|ĂŽÂ´ j ) and components N (x|AĂŽĹšy (Wj z), AJĂŽĹšy ĂŽĹ j JĂŽĹšy A + Ă˘„Ĺš), where JĂŽĹšy is the Jacobian of the mapping ĂŽĹšy . 3 Experiments We run experiments on both real image sequences (Ä?Ĺš g. 5 and Ä?Ĺš g. 6) and on sequences where silhouettes were artiÄ?Ĺš cially rendered. The prediction error is reported in degrees (for mixture of experts, this is w.r.t. the most probable one, but see also Ä?Ĺš g. 4a), and normalized per joint angle, per frame. The models are learned using standard cross-validation. Pre-images are learned using kernel regressors and have average error 1.7o . Training Set and Model State Representation: For training we gather pairs of 3D human poses together with their image projections, here silhouettes, using the graphics package Maya. We use realistically rendered computer graphics human surface models which we animate using human motion capture [1]. Our original human representation (x) is based on articulated skeletons with spherical joints and has 56 skeletal d.o.f. including global translation. The database consists of 8000 samples of human activities including walking, running, turns, jumps, gestures in conversations, quarreling and pantomime. Image Descriptors: We work with image silhouettes obtained using statistical background subtraction (with foreground and background models). Silhouettes are informative for pose estimation although prone to ambiguities (e.g. the left / right limb assignment in side views) or occasional lack of observability of some of the d.o.f. (e.g. 180o ambiguities in the global azimuthal orientation for frontal views, e.g. Ä?Ĺš g. 1a). These are multiplied by intrinsic forward / backward monocular ambiguities [16]. As observations r, we use shape contexts extracted on the silhouette [4] (5 radial, 12 angular bins, size range 1/8 to 3 on log scale). The features are computed at different scales and sizes for points sampled on the silhouette. To work in a common coordinate system, we cluster all features in the training set into K = 50 clusters. To compute the representation of a new shape feature (a point on the silhouette), we Ă˘€˜projectĂ˘€™ onto the common basis by (inverse distance) weighted voting into the cluster centers. To obtain the representation (r) for a new silhouette we regularly sample 200 points on it and add all their feature vectors into a feature histogram. Because the representation uses overlapping features of the observation the elements of the descriptor are not independent. However, a conditional temporal framework (Ä?Ĺš g. 1b) Ä?Ĺš‚exibly accommodates this. For experiments, we use Gaussian kernels for the joint angle feature space and dot product kernels for the observation feature space. We learn state conditionals for p(yt |zt ) and p(yt |ytĂ˘ˆ’1 , zt ) using 6 dimensions for the joint angle kernel induced state space and 25 dimensions for the observation induced feature space, respectively. In Ä?Ĺš g. 3b) we show an evaluation of the efÄ?Ĺš cacy of our kBME predictor for different dimensions in the joint angle kernel induced state space (the observation feature space dimension is here 50). On the analyzed dancing sequence, that involves complex motions of the arms and the legs, the non-linear model signiÄ?Ĺš cantly outperforms alternative PCA methods and gives good predictions for compact, low-dimensional models.1 In tables 1 and 2, as well as Ä?Ĺš g. 4, we perform quantitative experiments on artiÄ?Ĺš cially rendered silhouettes. 3D ground truth joint angles are available and this allows a more 1 Running times: On a Pentium 4 PC (3 GHz, 2 GB RAM), a full dimensional BME model with 5 experts takes 802s to train p(xt |xtĂ˘ˆ’1 , rt ), whereas a kBME (including the pre-image) takes 95s to train p(yt |ytĂ˘ˆ’1 , zt ). The prediction time is 13.7s for BME and 8.7s (including the pre-image cost 1.04s) for kBME. The integration in (1) takes 2.67s for BME and 0.31s for kBME. The speed-up for kBME is signiÄ?Ĺš cant and likely to increase with original models having higher dimensionality. Prediction Error Number of Clusters 100 1000 100 10 1 1 2 3 4 5 6 7 8 Degree of Multimodality kBME KDE_RVM PCA_BME PCA_RVM 10 1 0 20 40 Number of Dimensions 60 Figure 3: (a, Left) Analysis of Ă˘€˜multimodalityĂ˘€™ for a training set. The input zt dimension is 25, the output yt dimension is 6, both reduced using kPCA. We cluster independently in (ytĂ˘ˆ’1 , zt ) and yt using many clusters (2100) to simulate small input perturbations and we histogram the yt clusters falling within each cluster in (ytĂ˘ˆ’1 , zt ). This gives intuition on the degree of ambiguity in modeling p(yt |ytĂ˘ˆ’1 , zt ), for small perturbations in the input. (b, Right) Evaluation of dimensionality reduction methods for an artiÄ?Ĺš cial dancing sequence (models trained on 300 samples). The kBME is our model Ă‚Â§2.2, whereas the KDE-RVM is a KDE model learned with a Relevance Vector Machine (RVM) [17] feature space map. PCA-BME and PCA-RVM are models where the mappings between feature spaces (obtained using PCA) is learned using a BME and a RVM. The non-linearity is signiÄ?Ĺš cant. Kernel-based methods outperform PCA and give low prediction error for 5-6d models. systematic evaluation. Notice that the kernelized low-dimensional models generally outperform the PCA ones. At the same time, they give results competitive to the ones of high-dimensional BME predictors, while being lower-dimensional and therefore signiÄ?Ĺš cantly less expensive for inference, e.g. the integral in (1). In Ä?Ĺš g. 5 and Ä?Ĺš g. 6 we show human motion reconstruction results for two real image sequences. Fig. 5 shows the good quality reconstruction of a person performing an agile jump. (Given the missing observations in a side view, 3D inference for the occluded body parts would not be possible without using prior knowledge!) For this sequence we do inference using conditionals having 5 modes and reduced 6d states. We initialize tracking using p(yt |zt ), whereas for inference we use p(yt |ytĂ˘ˆ’1 , zt ) within (1). In the second sequence in Ä?Ĺš g. 6, we simultaneously reconstruct the motion of two people mimicking domestic activities, namely washing a window and picking an object. Here we do inference over a product, 12-dimensional state space consisting of the joint 6d state of each person. We obtain good 3D reconstruction results, using only 5 hypotheses. Notice however, that the results are not perfect, there are small errors in the elbow and the bending of the knee for the subject at the l.h.s., and in the different wrist orientations for the subject at the r.h.s. This reÄ?Ĺš‚ects the bias of our training set. Walk and turn Conversation Run and turn left KDE-RR 10.46 7.95 5.22 RVM 4.95 4.96 5.02 KDE-RVM 7.57 6.31 6.25 BME 4.27 4.15 5.01 kBME 4.69 4.79 4.92 Table 1: Comparison of average joint angle prediction error for different models. All kPCA-based models use 6 output dimensions. Testing is done on 100 video frames for each sequence, the inputs are artiÄ?Ĺš cially generated silhouettes, not in the training set. 3D joint angle ground truth is used for evaluation. KDE-RR is a KDE model with ridge regression (RR) for the feature space mapping, KDE-RVM uses an RVM. BME uses a Bayesian mixture of experts with no dimensionality reduction. kBME is our proposed model. kPCAbased methods use kernel regressors to compute pre-images. Expert Prediction Frequency Ă˘ˆ’ Closest to Ground truth Frequency Ă˘ˆ’ Close to ground truth 30 25 20 15 10 5 0 1 2 3 4 Expert Number 14 10 8 6 4 2 0 5 1st Probable Prev Output 2nd Probable Prev Output 3rd Probable Prev Output 4th Probable Prev Output 5th Probable Prev Output 12 1 2 3 4 Current Expert 5 Figure 4: (a, Left) Histogram showing the accuracy of various expert predictors: how many times the expert ranked as the k-th most probable by the model (horizontal axis) is closest to the ground truth. The model is consistent (the most probable expert indeed is the most accurate most frequently), but occasionally less probable experts are better. (b, Right) Histograms show the dynamics of p(yt |ytĂ˘ˆ’1 , zt ), i.e. how the probability mass is redistributed among experts between two successive time steps, in a conversation sequence. Walk and turn back Run and turn KDE-RR 7.59 17.7 RVM 6.9 16.8 KDE-RVM 7.15 16.08 BME 3.6 8.2 kBME 3.72 8.01 Table 2: Joint angle prediction error computed for two complex sequences with walks, runs and turns, thus more ambiguity (100 frames). Models have 6 state dimensions. Unimodal predictors average competing solutions. kBME has signiÄ?Ĺš cantly lower error. Figure 5: Reconstruction of a jump (selected frames). Top: original image sequence. Middle: extracted silhouettes. Bottom: 3D reconstruction seen from a synthetic viewpoint. 4 Conclusion We have presented a probabilistic framework for conditional inference in latent kernelinduced low-dimensional state spaces. Our approach has the following properties: (a) Figure 6: Reconstructing the activities of 2 people operating in an 12-d state space (each person has its own 6d state). Top: original image sequence. Bottom: 3D reconstruction seen from a synthetic viewpoint. Accounts for non-linear correlations among input or output variables, by using kernel nonlinear dimensionality reduction (kPCA); (b) Learns probability distributions over mappings between low-dimensional state spaces using conditional Bayesian mixture of experts, as required for accurate prediction. In the resulting low-dimensional kBME predictor ambiguities and multiple solutions common in visual, inverse perception problems are accurately represented. (c) Works in a continuous, conditional temporal probabilistic setting and offers a formal management of uncertainty. We show comparisons that demonstrate how the proposed approach outperforms regression, PCA or KDE alone for reconstructing the 3D human motion in monocular video. Future work we will investigate scaling aspects for large training sets and alternative structured prediction methods. References [1] CMU Human Motion DataBase. Online at http://mocap.cs.cmu.edu/search.html, 2003. [2] A. Agarwal and B. Triggs. 3d human pose from silhouettes by Relevance Vector Regression. In CVPR, 2004. [3] G. Bakir, J. Weston, and B. Scholkopf. Learning to Ä?Ĺš nd pre-images. In NIPS, 2004. [4] S. Belongie, J. Malik, and J. Puzicha. Shape matching and object recognition using shape contexts. PAMI, 24, 2002. [5] C. Bishop and M. Svensen. Bayesian mixtures of experts. In UAI, 2003. [6] J. Deutscher, A. Blake, and I. Reid. Articulated Body Motion Capture by Annealed Particle Filtering. In CVPR, 2000. [7] M. Jordan and R. Jacobs. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, (6):181Ă˘€“214, 1994. [8] D. Mackay. Bayesian interpolation. Neural Computation, 4(5):720Ă˘€“736, 1992. [9] A. McCallum, D. Freitag, and F. Pereira. Maximum entropy Markov models for information extraction and segmentation. In ICML, 2000. [10] R. Rosales and S. Sclaroff. Learning Body Pose Via Specialized Maps. In NIPS, 2002. [11] B. SchĂ‚Â¨ lkopf, A. Smola, and K. MĂ‚Â¨ ller. Nonlinear component analysis as a kernel eigenvalue o u problem. Neural Computation, 10:1299Ă˘€“1319, 1998. [12] G. Shakhnarovich, P. Viola, and T. Darrell. Fast Pose Estimation with Parameter Sensitive Hashing. In ICCV, 2003. [13] L. Sigal, S. Bhatia, S. Roth, M. Black, and M. Isard. Tracking Loose-limbed People. In CVPR, 2004. [14] C. Sminchisescu and A. Jepson. Generative Modeling for Continuous Non-Linearly Embedded Visual Inference. In ICML, pages 759Ă˘€“766, Banff, 2004. [15] C. Sminchisescu, A. Kanaujia, Z. Li, and D. Metaxas. Discriminative Density Propagation for 3D Human Motion Estimation. In CVPR, 2005. [16] C. Sminchisescu and B. Triggs. Kinematic Jump Processes for Monocular 3D Human Tracking. In CVPR, volume 1, pages 69Ă˘€“76, Madison, 2003. [17] M. Tipping. Sparse Bayesian learning and the Relevance Vector Machine. JMLR, 2001. [18] C. Tomasi, S. Petrov, and A. Sastry. 3d tracking = classiÄ?Ĺš cation + interpolation. In ICCV, 2003. [19] J. Weston, O. Chapelle, A. Elisseeff, B. Scholkopf, and V. Vapnik. Kernel dependency estimation. In NIPS, 2002.

6 0.55558801 78 nips-2005-From Weighted Classification to Policy Search

7 0.55468911 144 nips-2005-Off-policy Learning with Options and Recognizers

8 0.55255854 63 nips-2005-Efficient Unsupervised Learning for Localization and Detection in Object Categories

9 0.54404694 153 nips-2005-Policy-Gradient Methods for Planning

10 0.54368925 136 nips-2005-Noise and the two-thirds power Law

11 0.54337752 23 nips-2005-An Application of Markov Random Fields to Range Sensing

12 0.54310888 30 nips-2005-Assessing Approximations for Gaussian Process Classification

13 0.54095191 21 nips-2005-An Alternative Infinite Mixture Of Gaussian Process Experts

14 0.54042596 132 nips-2005-Nearest Neighbor Based Feature Selection for Regression and its Application to Neural Activity

15 0.53968167 184 nips-2005-Structured Prediction via the Extragradient Method

16 0.53960586 36 nips-2005-Bayesian models of human action understanding

17 0.53825402 154 nips-2005-Preconditioner Approximations for Probabilistic Graphical Models

18 0.53824466 124 nips-2005-Measuring Shared Information and Coordinated Activity in Neuronal Networks

19 0.53811842 67 nips-2005-Extracting Dynamical Structure Embedded in Neural Activity

20 0.53788835 48 nips-2005-Context as Filtering