iccv iccv2013 iccv2013-170 knowledge-graph by maker-knowledge-mining

170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields


Source: pdf

Author: Taehwan Kim, Greg Shakhnarovich, Karen Livescu

Abstract: Recognition of gesture sequences is in general a very difficult problem, but in certain domains the difficulty may be mitigated by exploiting the domain ’s “grammar”. One such grammatically constrained gesture sequence domain is sign language. In this paper we investigate the case of fingerspelling recognition, which can be very challenging due to the quick, small motions of the fingers. Most prior work on this task has assumed a closed vocabulary of fingerspelled words; here we study the more natural open-vocabulary case, where the only domain knowledge is the possible fingerspelled letters and statistics of their sequences. We develop a semi-Markov conditional model approach, where feature functions are defined over segments of video and their corresponding letter labels. We use classifiers of letters and linguistic handshape features, along with expected motion profiles, to define segmental feature functions. This approach improves letter error rate (Levenshtein distance between hypothesized and correct letter sequences) from 16.3% using a hidden Markov model baseline to 11.6% us- ing the proposed semi-Markov model.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Fingerspelling recognition with semi-Markov conditional random fields Taehwan Kim Greg Shakhnarovich Karen Livescu Toyota Technological Institute at Chicago 6045 S Kenwood Ave, Chicago IL 60637 t aehwan , greg , klive s cu@ tt i . [sent-1, score-0.119]

2 edu c Abstract Recognition of gesture sequences is in general a very difficult problem, but in certain domains the difficulty may be mitigated by exploiting the domain ’s “grammar”. [sent-2, score-0.126]

3 One such grammatically constrained gesture sequence domain is sign language. [sent-3, score-0.335]

4 In this paper we investigate the case of fingerspelling recognition, which can be very challenging due to the quick, small motions of the fingers. [sent-4, score-0.462]

5 Most prior work on this task has assumed a closed vocabulary of fingerspelled words; here we study the more natural open-vocabulary case, where the only domain knowledge is the possible fingerspelled letters and statistics of their sequences. [sent-5, score-0.379]

6 We develop a semi-Markov conditional model approach, where feature functions are defined over segments of video and their corresponding letter labels. [sent-6, score-0.545]

7 We use classifiers of letters and linguistic handshape features, along with expected motion profiles, to define segmental feature functions. [sent-7, score-0.72]

8 This approach improves letter error rate (Levenshtein distance between hypothesized and correct letter sequences) from 16. [sent-8, score-0.802]

9 Introduction Recognition of gesture sequences is in general very challenging. [sent-12, score-0.126]

10 One of the most practically important of such grammatically constrained gesture sequence domains is sign language. [sent-14, score-0.335]

11 In this paper we consider American Sign Language (ASL), and focus in particular on recognition of fingerspelled letter sequences. [sent-15, score-0.535]

12 In fingerspelling, signers spell out a word as a sequence of handshapes or hand trajectories corresponding to individual letters. [sent-16, score-0.39]

13 The handshapes used in fingerspelling are also used throughout ASL. [sent-17, score-0.569]

14 In fact, the fingerspelling handshapes account for about 72% of ASL handshapes [7], making research on fingerspelling applicable to ASL in general. [sent-18, score-1.138]

15 ASL fingerspelling uses a single hand and involves relatively small and quick motions of the hand and fingers, as opposed to the typically larger arm motions involved in other signs. [sent-22, score-0.556]

16 Most prior work on fingerspelling recognition has assumed a closed vocabulary of fingerspelled words, often limited to 20-100 words, typically using hidden Markov models (HMMs) representing letters or letter-to-letter transitions [14, 20, 26]. [sent-24, score-0.781]

17 In such settings it is common to obtain letter error rates (Levenshtein distances between hypothesized and true letter sequences, as a proportion of the number of true letters) of 10% or less. [sent-25, score-0.842]

18 In contrast, we address the problem of recognizing unconstrained fingerspelling sequences. [sent-26, score-0.462]

19 This is a more natural setting, since fingerspelling is often used for names and other “new” terms, which may not appear in any closed vocabulary. [sent-27, score-0.497]

20 We develop a semi-Markov conditional random field (SCRF) approach to the unconstrained fingerspelling recognition problem. [sent-28, score-0.553]

21 In SCRFs [28, 41], feature functions are defined over segments of observed variables (in our case, any number of consecutive video frames) and their corresponding labels (in our case, letters). [sent-29, score-0.114]

22 The use of such segmental feature functions is useful for gesture modeling, where it is natural to consider the trajectory of some measurement or the statistics of an entire segment. [sent-30, score-0.284]

23 In this work we define feature functions based on scores of letter classifiers, as well as classifiers of handshape features suggested by linguistics research on ASL [6, 19]. [sent-31, score-0.744]

24 Linguistic handshape features summarize certain important aspects of a given letter, such as the “active” fingers or the flexed/non-flexed status 11552211 Figure 1. [sent-32, score-0.319]

25 Related work There has been significant work on sign language recognition from video1, but there are still large gaps, especially for continuous, large-vocabulary signing settings. [sent-39, score-0.439]

26 Some prior work has used representations of handshape and motion that are linguistically motivated, e. [sent-42, score-0.249]

27 However, a much finer level of detail is needed for the sub-articulators of the hand, which motivates our use of linguistic handshape features here. [sent-45, score-0.454]

28 The only unrestricted finger- spelling recognition work of which we are aware is [19], using HMM-based approaches; we consider this work as the most competitive baseline and compare to it in the experiments section. [sent-49, score-0.161]

29 The relatively little work on applying segmental (semiMarkov) models to vision tasks has focused on classification and segmentation of action sequences [29, 12] with a small set of possible activities to choose from, including recent work on spotting of specific signs in sign language video [8]. [sent-50, score-0.615]

30 In natural language processing, semi-Markov CRFs have been used for named entity recognition [28], where the labeling is binary. [sent-51, score-0.256]

31 The work presented in this paper is the largest-scale use of semi-Markov models in computer vision, as well as the least constrained fingerspelling recognition experiments, of which we are aware. [sent-55, score-0.521]

32 Ideally we would like to predict the best label sequence, marginalizing out different possible label start and end times, but in practice we use the typical approach of predicting the best sequence of frame labels S = s1, . [sent-63, score-0.132]

33 more natural to consider feature functions that span entire segments corresponding to the same label. [sent-74, score-0.143]

34 In a SCRF, we consider the segmentation to be a latent variable and sum over all possible segmentations of the observations corresponding to a given label sequence to get the conditional probability of the label sequence S = s1, . [sent-77, score-0.29]

35 ), e ranges over all state pairs in S, sle is the state which is on the left of an edge, sre is the state on the right of an edge, and Oe is the multi-frame observation segment associated with sre. [sent-97, score-0.311]

36 In our work, we use a baseline frame-based recognizer to generate a set of candidate segmentations of O, and sum only over those candidate segmentations. [sent-98, score-0.243]

37 Feature functions We define several types of feature functions, some of which are quite general to sequence recognition tasks and some of which are tailored to fingerspelling recognition: 3. [sent-102, score-0.626]

38 1 Language model feature The language model feature is a smoothed bigram probability of the letter pair corresponding to an edge: flm(sle, 3. [sent-104, score-0.71]

39 Baseline consistency feature To take advantage of the existence of a high-quality baseline, we use a baseline feature like the one introduced by [41]. [sent-107, score-0.125]

40 3 Handshape classifier-based feature functions The next set of feature functions measure the degree of match between the intended segment label and the appearance of the frames within the segment. [sent-112, score-0.294]

41 For this purpose we use a set of frame classifiers, each of which classifies either letters or linguistic handshape features. [sent-113, score-0.572]

42 As in [19], we use the linguistic handshape feature set developed by Brentari [6], who proposed seven features to describe handshape in ASL. [sent-114, score-0.738]

43 Each such linguistic feature (not to be confused with feature functions) has 2-7 possible values. [sent-115, score-0.275]

44 For each linguistic feature or letter, we train a classifier that produces a score for each feature value for each video frame. [sent-118, score-0.275]

45 ) = mean: fyv (sle, mδ(wax(:s fre) = y) • g(v|oi) · maxi∈(t(e),T(e)) divs: a concatenation of three mean feature functions, ediavch computed over a third of the segment 11552233 sl(1) sr(1) = sl(2) sr(2) = sl(3) lssateagbtmeelsentations q0e1s1e2s2e3s3 fveisactu aorles o1o2o3. [sent-124, score-0.149]

46 4 divm: a concatenation of three max feature functions, ediavch computed over a third of the segment Peak detection features Fingerspelling a sequence of letters yields a corresponding sequence of “peaks” of articulation. [sent-143, score-0.286]

47 Intuitively, these are frames in which the hand reaches the target handshape for a particular letter. [sent-144, score-0.347]

48 The peak frame and the frames around it for each letter tend to be characterized by very little motion as the transition to the current letter has ended while the transition to the next letter has not yet begun, whereas the transitional frames between letter peaks have more motion. [sent-145, score-1.756]

49 To use this information and encourage each predicted letter segment to have a single peak, we define letter-specific “peak detection features” as follows. [sent-146, score-0.417]

50 Then we define the feature function corresponding to each letter y as fypeak(sle, sre, Oe) = δ(w(sre) = y) · δpeak(Oe) where δpeak(Oe) is 1if there is only one local minimum in the segment Oe and 0 otherwise. [sent-149, score-0.452]

51 The first five features are properties of the active fingers (selected fingers, SF); the last feature is the state of the inactive or unselected fingers (UF). [sent-156, score-0.206]

52 non-overlapping lists of 300 words (one for signers 1 and 2, the other for signers 3 and 4). [sent-159, score-0.312]

53 For comparison with prior work, we use the same data from signers 1 and 2 as [18, 19], as well as additional data from signers 3 and 4. [sent-162, score-0.284]

54 The signers indicated the start and end of each word by pressing a button, allowing automatic partition of the recording into a separate video for every word. [sent-166, score-0.223]

55 Every video was verified and manually labeled by multiple annotators with the times and letter identities of the peaks of articulation (see Sec. [sent-167, score-0.415]

56 The peak annotations are used for the training portion of the data in each experiment to segment a word into letters (the boundary between consecutive letters is defined as the midpoint between their peaks). [sent-171, score-0.35]

57 Note that while this procedure currently requires manual annotation for a small number of frames in our offline recognition setting, it could be fully automated in a realistic interactive setting, by asking the subject to place his/her hand in a few defined locations for calibration. [sent-185, score-0.133]

58 Letter and linguistic feature classifiers We use feedforward neural network classifiers (NNs) (trained with Quicknet [25]) for letters and linguistic feature labels for each video frame. [sent-190, score-0.635]

59 Both baselines have one 3-state HMM per letter, plus a separate HMM for the sequence-initial and sequence-final non-signing portions (referred to as “n/a”), Gaussian mixture observation densities, and a letter bigram language model. [sent-195, score-0.671]

60 The second baseline uses as observations the linear outputs of the NN linguistic feature classifiers, reproducing the “tandem” approach of [19]. [sent-197, score-0.295]

61 This was done to confirm that we can reproduce the result of [19] showing an advantage for the tandem system over standard HMMs. [sent-198, score-0.204]

62 Generating candidate segmentations for SCRFs As described above, we use a two-phase inference approach where a baseline recognizer produces a set of candidate segmentations and label sequences, and a SCRF is used to re-rank the baseline candidates. [sent-200, score-0.39]

63 We produce a list of 11552255 N-best candidate segmentations using the tandem baseline (as it is the better performer of the two baselines). [sent-201, score-0.327]

64 For those training examples where the correct letter sequence is not among the baseline N-best candidates, we have several choices. [sent-203, score-0.473]

65 We independently tune the parameters in each fold (that is, we run 10 separate, complete experiments) and report the average letter error rate (LER) over the 10 folds. [sent-208, score-0.399]

66 We train the letter bigram language models from large online dictionaries of varying sizes that include both English words and names [2]. [sent-209, score-0.703]

67 We use HTK [3] to implement the baseline HMM-based recognizers and SRILM [3 1] to train the language models. [sent-210, score-0.276]

68 The HMM parameters (number of Gaussians per state, size of language model vocabulary, transition penalty and language model weight), as well as the dimensionality of the HOG descriptor input and HOG depth, were tuned to minimize development set letter error rates for the baseline HMM system. [sent-211, score-1.014]

69 For the NN classifiers, the input window size was tuned to minimize frame error rate of the classifiers on the development set. [sent-212, score-0.167]

70 The NN output type was tuned separately for the tandem HMM and the SCRF. [sent-214, score-0.214]

71 Finally, additional parameters tuned for the SCRF models included the N-best list sizes, type of feature functions, choice of language models, and L1 and L2 regularization parameters. [sent-215, score-0.292]

72 First, we confirm that the tandem baseline improves over a standard HMM baseline. [sent-217, score-0.233]

73 Second, we find that the proposed SCRF improves over the tandem HMM-based system, correcting about 21% of the errors (or 29% of the errors committed by Figure 4. [sent-218, score-0.178]

74 Letter error rate for each signer, and average letter error rate over all signers, for the two baselines and for the proposed SCRF. [sent-219, score-0.454]

75 Neural network classifier error rates on letter classification (blue) and linguistic feature classification (red) on test data for each signer. [sent-222, score-0.679]

76 The linguistic feature error rates are averaged over the six linguistic feature classification tasks; error rates for each linguistic feature type range between 4% and 10%. [sent-223, score-0.848]

77 comparison, we have also conducted the same experiments while keeping the training, development, and test vocabularies disjoint; in this modified setup, letter error rates increase by about 2-3% overall, but the SCRFs still outperform the other models. [sent-225, score-0.439]

78 Figures 6 and 7 illustrate the recognition task, show- ing examples in which the SCRF corrected mistakes made by the tandem HMM recognizer. [sent-226, score-0.213]

79 as well as a frame before the first peak and after the last peak. [sent-236, score-0.114]

80 Below the ground truth segmentations are the segmentations obtained with the baseline tandem HMM, and at the bottom are segmentations obtained with the SCRF. [sent-237, score-0.422]

81 Discussion This paper proposes an approach to automatic recognition of fingerspelled words in ASL, in a challenging open- vocabulary scenario. [sent-239, score-0.23]

82 The work has implications for the larger task of general ASL recognition, where the same handshapes used in fingerspelling are used throughout. [sent-243, score-0.569]

83 We believe that our results are promising in a broader context of recognition of action sequences (and in particular gesture sequences) with any sort of “grammar” –constraints that limit a set of configurations, and introduce structure into statistics of possible transitions. [sent-244, score-0.161]

84 We also are investigating more complex segmental feature functions that would capture additional properties of the data. [sent-247, score-0.196]

85 Finally, although user-dependent sign language recognition could be useful in practice, as evidenced by the prevalence of such applications for spoken language recognition (such as dictation systems), we would like to develop methods that are more signer-independent. [sent-248, score-0.702]

86 A linguistic feature vector for the visual interpretation of sign language. [sent-273, score-0.391]

87 Native and foreign vocabulary in American Sign Language: A lexicon with multiple origins. [sent-282, score-0.13]

88 11552277 [8] [9] [10] [11] In Foreign vocabulary in sign languages: A cross-linguistic investigation of word formation, pages 87–1 19. [sent-283, score-0.244]

89 Sign language spotting based on semi-Markov conditional random field. [sent-292, score-0.301]

90 Modelling and recognition of the linguistic components in American Sign Language. [sent-303, score-0.24]

91 Speech recognition techniques for a sign language recognition system. [sent-312, score-0.442]

92 Machine recognition of Auslan signs using powergloves: towards large lexicon integration of sign language. [sent-352, score-0.265]

93 American Sign Language fingerspelling recognition with phonological feature-based tandem models. [sent-364, score-0.728]

94 Automatic recognition of fingerspelled words in British Sign Language. [sent-369, score-0.188]

95 How the alphabet came to be used in a sign language. [sent-381, score-0.151]

96 Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition. [sent-388, score-0.372]

97 Affine-invariant modeling of shape-appearance images applied on sign language handshape classification. [sent-410, score-0.621]

98 Exploiting phonological constraints for handshape inference in ASL video. [sent-444, score-0.302]

99 Model-level data-driven sub-units for signs in videos of continuous sign language. [sent-450, score-0.195]

100 Detecting coarticulation in sign language using conditional random fields. [sent-486, score-0.464]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fingerspelling', 0.462), ('letter', 0.375), ('handshape', 0.249), ('asl', 0.236), ('scrf', 0.231), ('language', 0.221), ('linguistic', 0.205), ('tandem', 0.178), ('sign', 0.151), ('signers', 0.142), ('hmm', 0.129), ('fingerspelled', 0.125), ('segmental', 0.11), ('handshapes', 0.107), ('sre', 0.097), ('scrfs', 0.089), ('gesture', 0.088), ('letters', 0.087), ('peak', 0.083), ('oe', 0.08), ('signer', 0.079), ('sle', 0.079), ('vogler', 0.073), ('brentari', 0.071), ('spelling', 0.071), ('fingers', 0.07), ('american', 0.066), ('recognizer', 0.063), ('segmentations', 0.063), ('conditional', 0.056), ('baseline', 0.055), ('foreign', 0.053), ('phonological', 0.053), ('pitsikalis', 0.053), ('theodorakis', 0.053), ('frames', 0.051), ('functions', 0.051), ('word', 0.051), ('nn', 0.048), ('hand', 0.047), ('signs', 0.044), ('bigram', 0.044), ('english', 0.043), ('sequence', 0.043), ('development', 0.042), ('vocabulary', 0.042), ('segment', 0.042), ('speech', 0.04), ('rates', 0.04), ('peaks', 0.04), ('spoken', 0.039), ('sl', 0.038), ('sequences', 0.038), ('tuned', 0.036), ('asru', 0.036), ('coarticulation', 0.036), ('ediavch', 0.036), ('fyv', 0.036), ('grobel', 0.036), ('levenshtein', 0.036), ('pbxg', 0.036), ('phand', 0.036), ('semimarkov', 0.036), ('srilm', 0.036), ('recognition', 0.035), ('lexicon', 0.035), ('feature', 0.035), ('names', 0.035), ('classifiers', 0.034), ('signing', 0.032), ('aircraft', 0.032), ('livescu', 0.032), ('workshop', 0.031), ('baselines', 0.031), ('cx', 0.031), ('candidate', 0.031), ('state', 0.031), ('frame', 0.031), ('crf', 0.03), ('grammar', 0.03), ('recording', 0.03), ('hidden', 0.03), ('span', 0.029), ('languages', 0.029), ('grammatically', 0.029), ('label', 0.029), ('segments', 0.028), ('hypothesized', 0.028), ('words', 0.028), ('greg', 0.028), ('native', 0.028), ('segmentation', 0.027), ('oi', 0.026), ('reproduce', 0.026), ('nns', 0.026), ('crfs', 0.026), ('spotting', 0.024), ('constrained', 0.024), ('error', 0.024), ('dancing', 0.023)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000007 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields

Author: Taehwan Kim, Greg Shakhnarovich, Karen Livescu

Abstract: Recognition of gesture sequences is in general a very difficult problem, but in certain domains the difficulty may be mitigated by exploiting the domain ’s “grammar”. One such grammatically constrained gesture sequence domain is sign language. In this paper we investigate the case of fingerspelling recognition, which can be very challenging due to the quick, small motions of the fingers. Most prior work on this task has assumed a closed vocabulary of fingerspelled words; here we study the more natural open-vocabulary case, where the only domain knowledge is the possible fingerspelled letters and statistics of their sequences. We develop a semi-Markov conditional model approach, where feature functions are defined over segments of video and their corresponding letter labels. We use classifiers of letters and linguistic handshape features, along with expected motion profiles, to define segmental feature functions. This approach improves letter error rate (Levenshtein distance between hypothesized and correct letter sequences) from 16.3% using a hidden Markov model baseline to 11.6% us- ing the proposed semi-Markov model.

2 0.12184771 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions

Author: Alessandro Bissacco, Mark Cummins, Yuval Netzer, Hartmut Neven

Abstract: We describe PhotoOCR, a system for text extraction from images. Our particular focus is reliable text extraction from smartphone imagery, with the goal of text recognition as a user input modality similar to speech recognition. Commercially available OCR performs poorly on this task. Recent progress in machine learning has substantially improved isolated character classification; we build on this progress by demonstrating a complete OCR system using these techniques. We also incorporate modern datacenter-scale distributed language modelling. Our approach is capable of recognizing text in a variety of challenging imaging conditions where traditional OCR systems fail, notably in the presence of substantial blur, low resolution, low contrast, high image noise and other distortions. It also operates with low latency; mean processing time is 600 ms per image. We evaluate our system on public benchmark datasets for text extraction and outperform all previously reported results, more than halving the error rate on multiple benchmarks. The system is currently in use in many applications at Google, and is available as a user input modality in Google Translate for Android.

3 0.10723884 428 iccv-2013-Translating Video Content to Natural Language Descriptions

Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele

Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.

4 0.093680196 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

Author: Vignesh Ramanathan, Percy Liang, Li Fei-Fei

Abstract: Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.

5 0.082930729 253 iccv-2013-Linear Sequence Discriminant Analysis: A Model-Based Dimensionality Reduction Method for Vector Sequences

Author: Bing Su, Xiaoqing Ding

Abstract: Dimensionality reduction for vectors in sequences is challenging since labels are attached to sequences as a whole. This paper presents a model-based dimensionality reduction method for vector sequences, namely linear sequence discriminant analysis (LSDA), which attempts to find a subspace in which sequences of the same class are projected together while those of different classes are projected as far as possible. For each sequence class, an HMM is built from states of which statistics are extracted. Means of these states are linked in order to form a mean sequence, and the variance of the sequence class is defined as the sum of all variances of component states. LSDA then learns a transformation by maximizing the separability between sequence classes and at the same time minimizing the within-sequence class scatter. DTW distance between mean sequences is used to measure the separability between sequence classes. We show that the optimization problem can be approximately transformed into an eigen decomposition problem. LDA can be seen as a special case of LSDA by considering non-sequential vectors as sequences of length one. The effectiveness of the proposed LSDA is demonstrated on two individual sequence datasets from UCI machine learning repository as well as two concatenate sequence datasets: APTI Arabic printed text database and IFN/ENIT Arabic handwriting database.

6 0.07728035 133 iccv-2013-Efficient Hand Pose Estimation from a Single Depth Image

7 0.075486563 22 iccv-2013-A New Adaptive Segmental Matching Measure for Human Activity Recognition

8 0.072660536 381 iccv-2013-Semantically-Based Human Scanpath Estimation with HMMs

9 0.064627178 452 iccv-2013-YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

10 0.060768276 192 iccv-2013-Handwritten Word Spotting with Corrected Attributes

11 0.056965146 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments

12 0.05526809 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

13 0.05523688 417 iccv-2013-The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection

14 0.048279386 218 iccv-2013-Interactive Markerless Articulated Hand Motion Tracking Using RGB and Depth Data

15 0.047828034 377 iccv-2013-Segmentation Driven Object Detection with Fisher Vectors

16 0.047299359 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection

17 0.043530058 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes

18 0.04343807 4 iccv-2013-ACTIVE: Activity Concept Transitions in Video Event Classification

19 0.042854376 127 iccv-2013-Dynamic Pooling for Complex Event Recognition

20 0.042489704 166 iccv-2013-Finding Actors and Actions in Movies


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.129), (1, 0.031), (2, 0.014), (3, 0.015), (4, 0.036), (5, 0.016), (6, 0.005), (7, -0.0), (8, -0.014), (9, -0.008), (10, 0.057), (11, -0.04), (12, 0.029), (13, 0.015), (14, -0.0), (15, 0.017), (16, -0.052), (17, -0.005), (18, -0.049), (19, 0.012), (20, 0.037), (21, 0.001), (22, 0.012), (23, 0.006), (24, -0.047), (25, -0.019), (26, 0.025), (27, 0.005), (28, -0.049), (29, 0.01), (30, -0.02), (31, -0.076), (32, -0.008), (33, -0.012), (34, -0.066), (35, 0.014), (36, -0.044), (37, -0.006), (38, -0.0), (39, -0.027), (40, -0.002), (41, 0.078), (42, -0.04), (43, 0.035), (44, -0.033), (45, -0.0), (46, 0.057), (47, 0.001), (48, 0.001), (49, 0.066)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89145488 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields

Author: Taehwan Kim, Greg Shakhnarovich, Karen Livescu

Abstract: Recognition of gesture sequences is in general a very difficult problem, but in certain domains the difficulty may be mitigated by exploiting the domain ’s “grammar”. One such grammatically constrained gesture sequence domain is sign language. In this paper we investigate the case of fingerspelling recognition, which can be very challenging due to the quick, small motions of the fingers. Most prior work on this task has assumed a closed vocabulary of fingerspelled words; here we study the more natural open-vocabulary case, where the only domain knowledge is the possible fingerspelled letters and statistics of their sequences. We develop a semi-Markov conditional model approach, where feature functions are defined over segments of video and their corresponding letter labels. We use classifiers of letters and linguistic handshape features, along with expected motion profiles, to define segmental feature functions. This approach improves letter error rate (Levenshtein distance between hypothesized and correct letter sequences) from 16.3% using a hidden Markov model baseline to 11.6% us- ing the proposed semi-Markov model.

2 0.71047074 428 iccv-2013-Translating Video Content to Natural Language Descriptions

Author: Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, Bernt Schiele

Abstract: Humans use rich natural language to describe and communicate visual perceptions. In order to provide natural language descriptions for visual content, this paper combines two important ingredients. First, we generate a rich semantic representation of the visual content including e.g. object and activity labels. To predict the semantic representation we learn a CRF to model the relationships between different components of the visual input. And second, we propose to formulate the generation of natural language as a machine translation problem using the semantic representation as source language and the generated sentences as target language. For this we exploit the power of a parallel corpus of videos and textual descriptions and adapt statistical machine translation to translate between our two languages. We evaluate our video descriptions on the TACoS dataset [23], which contains video snippets aligned with sentence descriptions. Using automatic evaluation and human judgments we show significant improvements over several baseline approaches, motivated by prior work. Our translation approach also shows improvements over related work on an image description task.

3 0.66827381 22 iccv-2013-A New Adaptive Segmental Matching Measure for Human Activity Recognition

Author: Shahriar Shariat, Vladimir Pavlovic

Abstract: The problem of human activity recognition is a central problem in many real-world applications. In this paper we propose a fast and effective segmental alignmentbased method that is able to classify activities and interactions in complex environments. We empirically show that such model is able to recover the alignment that leads to improved similarity measures within sequence classes and hence, raises the classification performance. We also apply a bounding technique on the histogram distances to reduce the computation of the otherwise exhaustive search.

4 0.66720724 452 iccv-2013-YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition

Author: Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, Kate Saenko

Abstract: Despite a recent push towards large-scale object recognition, activity recognition remains limited to narrow domains and small vocabularies of actions. In this paper, we tackle the challenge of recognizing and describing activities “in-the-wild”. We present a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object. Unlike previous work, our approach works on out-of-domain actions: it does not require training videos of the exact activity. If it cannot find an accurate prediction for a pre-trained model, it finds a less specific answer that is also plausible from a pragmatic standpoint. We use semantic hierarchies learned from the data to help to choose an appropriate level of generalization, and priors learned from web-scale natural language corpora to penalize unlikely combinations of actors/actions/objects; we also use a web-scale language model to “fill in ” novel verbs, i.e. when the verb does not appear in the training set. We evaluate our method on a large YouTube corpus and demonstrate it is able to generate short sentence descriptions of video clips better than baseline approaches.

5 0.62749988 253 iccv-2013-Linear Sequence Discriminant Analysis: A Model-Based Dimensionality Reduction Method for Vector Sequences

Author: Bing Su, Xiaoqing Ding

Abstract: Dimensionality reduction for vectors in sequences is challenging since labels are attached to sequences as a whole. This paper presents a model-based dimensionality reduction method for vector sequences, namely linear sequence discriminant analysis (LSDA), which attempts to find a subspace in which sequences of the same class are projected together while those of different classes are projected as far as possible. For each sequence class, an HMM is built from states of which statistics are extracted. Means of these states are linked in order to form a mean sequence, and the variance of the sequence class is defined as the sum of all variances of component states. LSDA then learns a transformation by maximizing the separability between sequence classes and at the same time minimizing the within-sequence class scatter. DTW distance between mean sequences is used to measure the separability between sequence classes. We show that the optimization problem can be approximately transformed into an eigen decomposition problem. LDA can be seen as a special case of LSDA by considering non-sequential vectors as sequences of length one. The effectiveness of the proposed LSDA is demonstrated on two individual sequence datasets from UCI machine learning repository as well as two concatenate sequence datasets: APTI Arabic printed text database and IFN/ENIT Arabic handwriting database.

6 0.60629904 315 iccv-2013-PhotoOCR: Reading Text in Uncontrolled Conditions

7 0.60019845 451 iccv-2013-Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions

8 0.59088635 246 iccv-2013-Learning the Visual Interpretation of Sentences

9 0.53479683 442 iccv-2013-Video Segmentation by Tracking Many Figure-Ground Segments

10 0.52950394 145 iccv-2013-Estimating the Material Properties of Fabric from Video

11 0.51850575 440 iccv-2013-Video Event Understanding Using Natural Language Descriptions

12 0.51802862 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories

13 0.503497 130 iccv-2013-Dynamic Structured Model Selection

14 0.49981877 74 iccv-2013-Co-segmentation by Composition

15 0.49520239 345 iccv-2013-Recognizing Text with Perspective Distortion in Natural Scenes

16 0.49406195 37 iccv-2013-Action Recognition and Localization by Hierarchical Space-Time Segments

17 0.48934969 412 iccv-2013-Synergistic Clustering of Image and Segment Descriptors for Unsupervised Scene Understanding

18 0.4873569 57 iccv-2013-BOLD Features to Detect Texture-less Objects

19 0.47219747 416 iccv-2013-The Interestingness of Images

20 0.47145706 160 iccv-2013-Fast Object Segmentation in Unconstrained Video


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(2, 0.05), (7, 0.012), (22, 0.331), (26, 0.055), (31, 0.044), (34, 0.015), (42, 0.064), (48, 0.014), (64, 0.056), (73, 0.043), (78, 0.023), (84, 0.019), (89, 0.14), (98, 0.013)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.70381761 170 iccv-2013-Fingerspelling Recognition with Semi-Markov Conditional Random Fields

Author: Taehwan Kim, Greg Shakhnarovich, Karen Livescu

Abstract: Recognition of gesture sequences is in general a very difficult problem, but in certain domains the difficulty may be mitigated by exploiting the domain ’s “grammar”. One such grammatically constrained gesture sequence domain is sign language. In this paper we investigate the case of fingerspelling recognition, which can be very challenging due to the quick, small motions of the fingers. Most prior work on this task has assumed a closed vocabulary of fingerspelled words; here we study the more natural open-vocabulary case, where the only domain knowledge is the possible fingerspelled letters and statistics of their sequences. We develop a semi-Markov conditional model approach, where feature functions are defined over segments of video and their corresponding letter labels. We use classifiers of letters and linguistic handshape features, along with expected motion profiles, to define segmental feature functions. This approach improves letter error rate (Levenshtein distance between hypothesized and correct letter sequences) from 16.3% using a hidden Markov model baseline to 11.6% us- ing the proposed semi-Markov model.

2 0.63476157 49 iccv-2013-An Enhanced Structure-from-Motion Paradigm Based on the Absolute Dual Quadric and Images of Circular Points

Author: Lilian Calvet, Pierre Gurdjos

Abstract: This work aims at introducing a new unified Structurefrom-Motion (SfM) paradigm in which images of circular point-pairs can be combined with images of natural points. An imaged circular point-pair encodes the 2D Euclidean structure of a world plane and can easily be derived from the image of a planar shape, especially those including circles. A classical SfM method generally runs two steps: first a projective factorization of all matched image points (into projective cameras and points) and second a camera selfcalibration that updates the obtained world from projective to Euclidean. This work shows how to introduce images of circular points in these two SfM steps while its key contribution is to provide the theoretical foundations for combining “classical” linear self-calibration constraints with additional ones derived from such images. We show that the two proposed SfM steps clearly contribute to better results than the classical approach. We validate our contributions on synthetic and real images.

3 0.61487377 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests

Author: Danhang Tang, Tsz-Ho Yu, Tae-Kyun Kim

Abstract: This paper presents the first semi-supervised transductive algorithm for real-time articulated hand pose estimation. Noisy data and occlusions are the major challenges of articulated hand pose estimation. In addition, the discrepancies among realistic and synthetic pose data undermine the performances of existing approaches that use synthetic data extensively in training. We therefore propose the Semi-supervised Transductive Regression (STR) forest which learns the relationship between a small, sparsely labelled realistic dataset and a large synthetic dataset. We also design a novel data-driven, pseudo-kinematic technique to refine noisy or occluded joints. Our contributions include: (i) capturing the benefits of both realistic and synthetic data via transductive learning; (ii) showing accuracies can be improved by considering unlabelled data; and (iii) introducing a pseudo-kinematic technique to refine articulations efficiently. Experimental results show not only the promising performance of our method with respect to noise and occlusions, but also its superiority over state-of- the-arts in accuracy, robustness and speed.

4 0.53247446 75 iccv-2013-CoDeL: A Human Co-detection and Labeling Framework

Author: Jianping Shi, Renjie Liao, Jiaya Jia

Abstract: We propose a co-detection and labeling (CoDeL) framework to identify persons that contain self-consistent appearance in multiple images. Our CoDeL model builds upon the deformable part-based model to detect human hypotheses and exploits cross-image correspondence via a matching classifier. Relying on a Gaussian process, this matching classifier models the similarity of two hypotheses and efficiently captures the relative importance contributed by various visual features, reducing the adverse effect of scattered occlusion. Further, the detector and matching classifier together make our modelfit into a semi-supervised co-training framework, which can get enhanced results with a small amount of labeled training data. Our CoDeL model achieves decent performance on existing and new benchmark datasets.

5 0.53170693 274 iccv-2013-Monte Carlo Tree Search for Scheduling Activity Recognition

Author: Mohamed R. Amer, Sinisa Todorovic, Alan Fern, Song-Chun Zhu

Abstract: This paper presents an efficient approach to video parsing. Our videos show a number of co-occurring individual and group activities. To address challenges of the domain, we use an expressive spatiotemporal AND-OR graph (ST-AOG) that jointly models activity parts, their spatiotemporal relations, and context, as well as enables multitarget tracking. The standard ST-AOG inference is prohibitively expensive in our setting, since it would require running a multitude of detectors, and tracking their detections in a long video footage. This problem is addressed by formulating a cost-sensitive inference of ST-AOG as Monte Carlo Tree Search (MCTS). For querying an activity in the video, MCTS optimally schedules a sequence of detectors and trackers to be run, and where they should be applied in the space-time volume. Evaluation on the benchmark datasets demonstrates that MCTS enables two-magnitude speed-ups without compromising accuracy relative to the standard cost-insensitive inference.

6 0.48836145 60 iccv-2013-Bayesian Robust Matrix Factorization for Image and Video Processing

7 0.48499221 420 iccv-2013-Topology-Constrained Layered Tracking with Latent Flow

8 0.48413429 175 iccv-2013-From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding

9 0.48361674 230 iccv-2013-Latent Data Association: Bayesian Model Selection for Multi-target Tracking

10 0.48321608 425 iccv-2013-Tracking via Robust Multi-task Multi-view Joint Sparse Representation

11 0.48306459 150 iccv-2013-Exemplar Cut

12 0.48305592 127 iccv-2013-Dynamic Pooling for Complex Event Recognition

13 0.48257759 359 iccv-2013-Robust Object Tracking with Online Multi-lifespan Dictionary Learning

14 0.48252255 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning

15 0.48249522 376 iccv-2013-Scene Text Localization and Recognition with Oriented Stroke Detection

16 0.48249441 89 iccv-2013-Constructing Adaptive Complex Cells for Robust Visual Tracking

17 0.48239192 65 iccv-2013-Breaking the Chain: Liberation from the Temporal Markov Assumption for Tracking Human Poses

18 0.48224771 204 iccv-2013-Human Attribute Recognition by Rich Appearance Dictionary

19 0.48213401 107 iccv-2013-Deformable Part Descriptors for Fine-Grained Recognition and Attribute Prediction

20 0.48211852 253 iccv-2013-Linear Sequence Discriminant Analysis: A Model-Based Dimensionality Reduction Method for Vector Sequences