iccv iccv2013 iccv2013-437 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yuru Pei, Tae-Kyun Kim, Hongbin Zha
Abstract: Lipreading from visual channels remains a challenging topic considering the various speaking characteristics. In this paper, we address an efficient lipreading approach by investigating the unsupervised random forest manifold alignment (RFMA). The density random forest is employed to estimate affinity of patch trajectories in speaking facial videos. We propose novel criteria for node splitting to avoid the rank-deficiency in learning density forests. By virtue of the hierarchical structure of random forests, the trajectory affinities are measured efficiently, which are used to find embeddings of the speaking video clips by a graph-based algorithm. Lipreading is formulated as matching between manifolds of query and reference video clips. We employ the manifold alignment technique for matching, where the L∞norm-based manifold-to-manifold distance is proposed to find the matching pairs. We apply this random forest manifold alignment technique to various video data sets captured by consumer cameras. The experiments demonstrate that lipreading can be performed effectively, and outperform state-of-the-arts.
Reference: text
sentIndex sentText sentNum sentScore
1 In this paper, we address an efficient lipreading approach by investigating the unsupervised random forest manifold alignment (RFMA). [sent-11, score-1.36]
2 The density random forest is employed to estimate affinity of patch trajectories in speaking facial videos. [sent-12, score-0.906]
3 We propose novel criteria for node splitting to avoid the rank-deficiency in learning density forests. [sent-13, score-0.283]
4 By virtue of the hierarchical structure of random forests, the trajectory affinities are measured efficiently, which are used to find embeddings of the speaking video clips by a graph-based algorithm. [sent-14, score-0.337]
5 We employ the manifold alignment technique for matching, where the L∞norm-based manifold-to-manifold distance is proposed to find the matching pairs. [sent-16, score-0.266]
6 We apply this random forest manifold alignment technique to various video data sets captured by consumer cameras. [sent-17, score-0.635]
7 The experiments demonstrate that lipreading can be performed effectively, and outperform state-of-the-arts. [sent-18, score-0.733]
8 Introduction Automatic lipreading plays an important role in communications in noisy environments, e. [sent-20, score-0.733]
9 The lipreading is traditionally viewed as a supplement to speech recognition [18]. [sent-23, score-0.763]
10 In recent years, more researches are put on lipreading solely from visual channels. [sent-24, score-0.761]
11 However, the robust lipreading from visual channels still faces challenges in the following three aspects. [sent-26, score-0.795]
12 The situation becomes worse when it comes to the lipreading of a set of subjects. [sent-28, score-0.733]
13 The generalization capacity is desirable in the lipreading tasks. [sent-32, score-0.733]
14 Second, the lipreading system should be realtime considering the communication requirements. [sent-33, score-0.733]
15 Third, the current lipreading systems based on ordinary videos suffer from illumination and texture variations. [sent-34, score-0.808]
16 Confronted with these problems, we propose a novel framework to handle lipreading of phrases by the alignment of random forest manifolds as shown in Fig. [sent-36, score-1.258]
17 The lipreading is performed in the low-dimensional manifold instead of the original feature space, where the video clip represented by a set of patch trajectories is converted to a simplex motion pattern manifold. [sent-38, score-1.312]
18 For the purpose of graph-based embedding, the affinity of patch trajectories are estimated by an unsupervised density random forest, which is known for fast training and online testing, and the generalization capacity. [sent-39, score-0.567]
19 The random forest often works in a supervised manner, which is trained with labeled data to get a reasonable partition of visual words for image categorization [14], pose estimation [20], and object detection [9]. [sent-42, score-0.373]
20 Different from the supervised random forests, our random forest manifold frame- work works in an unsupervised manner. [sent-43, score-0.614]
21 By introducing a dummy set, a classification random forest was used to cluster unlabeled data set [2]. [sent-45, score-0.373]
22 Recently, the density forest was introduced under a Gaussian distribution assumption in tree nodes [6]. [sent-47, score-0.434]
23 We address the rank-deficiency problem by proposing novel criteria in node splitting by combining the trace-based distribution measurement and a scatter index to estimate the optimal node splitting. [sent-51, score-0.382]
24 [10] employed a supervised classification random forest to derive the distance matrix of medical images and data embedding. [sent-56, score-0.455]
25 [6] addressed the forest manifold of low dimensional toy data, where the Laplacian eigenmaps was employed to find the embedding from the affinity matrix produced by the density forest. [sent-58, score-1.026]
26 In this work, a density forest is built to measure affinities of the patch trajectories. [sent-59, score-0.539]
27 Once given a query phrase video, the extracted patch trajectories are fed to the random forest for a pairwise affinity matrix and embedding. [sent-61, score-0.917]
28 The lipreading in this paper is formulated as the matching of motion pattern manifolds of the query and those of labeled references by alignments in the embedding space. [sent-62, score-1.054]
29 The manifold pairs with the minimum distance are considered to share the same phrase label. [sent-63, score-0.289]
30 By virtue of the unsupervised random forest manifold, the lipreading can be performed by matching simplex motion patterns effectively. [sent-66, score-1.275]
31 Related Work Automatic lipreading has been studied in computer vision for several years [18]. [sent-69, score-0.733]
32 Aside from the lipreading in the original feature space, automatic lipreading in manifolds has also been investigated. [sent-74, score-1.529]
33 The embedding space with representative key images can be used for lipreading by matching uttering contours. [sent-76, score-0.96]
34 Different from the above lipreading in color videos solely, we integrate multimodal data and fusions of feature channels to improve the recognition performance, where the density forests and the manifold alignment are employed for an efficient lipreading system. [sent-79, score-2.052]
35 The manifold alignment provides an approach to establish correspondence between two embedding spaces. [sent-80, score-0.366]
36 [22] built a mutual embedding space for manifold alignment, where the transformations were solved by eigenvector decomposition of the Laplacian matrix. [sent-85, score-0.308]
37 An extended affine transformation for the non-holistic manifold alignment was proposed in [17]. [sent-86, score-0.259]
38 Patch Trajectory × The patch trajectories are extracted from speaking videos captured by consumer depth cameras (Kinect). [sent-89, score-0.369]
39 The patch trajectory features include the trajectory shape s as the difference of patch positions in adjacent frames. [sent-94, score-0.4]
40 The shape and texture features are concatenated together to describe the patch trajectory t, and t = (s, h), where h is the combined texture features ofHOG and LBP with respect to the color and depth images. [sent-107, score-0.376]
41 (c)The patch trajectory and the corresponding texture feature histogram h in the color and depth images. [sent-115, score-0.341]
42 We propose the random forest manifold to represent lip motions. [sent-117, score-0.661]
43 Random Forest Manifold The random forest manifold technique integrates the unsupervised density forest for affinity estimation and graphbased embedding. [sent-119, score-1.14]
44 The framework takes advantages of an efficient affinity estimation in both the training and query by hierarchical tree structures (See Section 3. [sent-120, score-0.313]
45 The density forest with a novel node splitting strategy is introduced to handle the rank-deficiency as described in Section 3. [sent-123, score-0.552]
46 Given one data set, the random forest yields an affinity matrix as described in Section 3. [sent-125, score-0.59]
47 The graph-based embedding algorithm is employed to find manifolds from the data affinities (Section 3. [sent-127, score-0.28]
48 The optimal node splitting parameters, including feature channel and random splitting threshold, are obtained by maximizing the information gain. [sent-146, score-0.303]
49 The density forest tends to produce the compact local clusters. [sent-147, score-0.379]
50 Confronted with this problem, we propose criteria for node splitting as an integration of a trace-based distribution measurement I1and a scatter index I2. [sent-154, score-0.294]
51 Affinity The forest leaves L define a partition of the training data. [sent-183, score-0.3]
52 sWymithm an eicnas effminbitlye mofa density trees, tnheed f ionre thstis st wenadys Ftoi yield a generalized affinity of the data set. [sent-193, score-0.299]
53 Since only points inside one leaf node cluster are considered to be similar, the affinity matrix from the random forest automatically possesses the local neighborship. [sent-198, score-0.775]
54 t Ihne affinity kis, O(NνnT), and the tree depth ν = log2 (N/nl). [sent-216, score-0.316]
55 Embedding The density tree derives the affinity matrix and the neighboring relationship of the data set simultaneously. [sent-229, score-0.379]
56 We apply the random forest manifold method to a set of toy data as shown in Fig. [sent-232, score-0.586]
57 The embeddings based on the random forest affinity is similar to those by the L2 norm and kNN, while the latter is more time-consuming. [sent-234, score-0.568]
58 By virtue of the random forest manifold embedding, the original video clip consisting of a set of patch trajectories is converted to a simplex pattern in a low dimensional space, where each point is corresponding to a patch trajectory. [sent-235, score-1.123]
59 Lipreading In the phrase lipreading scenarios, there is a predefined reference phrase corpus. [sent-237, score-0.964]
60 The lipreading is performed by finding the most similar reference clip and assigning its label to the query clip. [sent-238, score-0.868]
61 In our experiments, the patch trajectories from the predefined phrase data set serve as the training data for the random forest. [sent-239, score-0.42]
62 Once given the affinity matrix of trajectories, the reference phrases can be embedded to a low dimensional motion pattern set Θ = {Pr}. [sent-240, score-0.47]
63 The lipreading is performed in the embedding space by searching a reference motion pattern Pr that best matches the query Pq. [sent-242, score-1.03]
64 We employ the manifold alignment technique [17] to estimate the pattern correspondence and the manifold-tomanifold distance. [sent-250, score-0.268]
65 Manifold Alignment × The goal of manifold alignment is to transform the reference and the query motion patterns, Pr and Pq, to a mutual embedding space. [sent-253, score-0.507]
66 In case the query and reference video clips share the same patch extraction configurations, the correspondence between the two manifolds are known in advance. [sent-254, score-0.354]
67 The phrase label of the reference motion pattern with the minimum distance to Pq is assigned to the query clip for lipreading. [sent-276, score-0.308]
68 The random forest manifold From left to right: punctured spheres, corner. [sent-284, score-0.526]
69 1, the criterion for the node splitting is a combination of a trace-based cluster compactness I1 and a scatter index I2. [sent-319, score-0.3]
70 The lipreading results based on the shapes (RFMAshape), the color (RFMAHOG+LBP(color)) and depth patches (RFMAHOG(depth)) solely and integration of all feature channels together (RFMAfusion) are shown in Table 3. [sent-328, score-0.929]
71 We have compared our method with the recent lipreading works [24, 25, 26] on OULUVS data set, where the features of the patch trajectories are only extracted from color videos. [sent-330, score-1.006]
72 The fusion of the color patch features (HOG and LBP) and the trajectories shape (RFMAfusion) outperformed the reported state-of-the-arts. [sent-332, score-0.273]
73 × ×It8 8is0 a promising way to achieve a powerful lipreading system by virtue of multimodal data. [sent-344, score-0.804]
74 We employ the leave-one-out strategy, where the patch trajectories from one subject are removed, and the remaining data are used as the training data for the random forest manifolds. [sent-348, score-0.648]
75 Almost all automatic lipreading literatures reported this problem [1, 24, 26], which comes from the personal characteristics during speaking, and some person-specific texture difference caused by moustache, skin color, lip and teeth shapes. [sent-352, score-0.903]
76 We compare the time costs for affinity matrices of 3D corner data and 2730-dimensional patch trajectories data by the proposed random forest and the kNN graph-based method [3]. [sent-358, score-0.857]
77 Since the forest traversal is extremely fast and has no relations to the data dimensionality, our method is especially superior to the kNN for high dimensional and large data sets. [sent-360, score-0.416]
78 In the patch trajectory extraction, a smallimage region around the salient marker is extracted, and concatenated together as a patch trajectory. [sent-364, score-0.328]
79 It is interesting to note that, the optimal patch size of the color and depth videos are different. [sent-365, score-0.274]
80 7, the video patch size of KinectVS is set at 15 15, while the patch size for depth fv Kideinoe cist VseSt iast s7e ×t 7t. [sent-367, score-0.348]
81 The time cost in computation of affinity matrices by kNN [3] and our random forest of 3D corner data (a) and 2730- dimensional patch trajectories (b) of different sizes. [sent-385, score-0.889]
82 one point in the leaf node, will lead to a very deep tree with nt leaf nodes and log2 nt levels for a balanced tree. [sent-392, score-0.363]
83 With various leaf size thresholds, the forest tends to yield different affinity matrices. [sent-393, score-0.617]
84 8(a), the small leaf size threshold yields a sparser matrix than that of the large leaf size. [sent-395, score-0.275]
85 In the tree with a small leaf size, the clusters corresponding to the leaf node could be compact, and Figure 7. [sent-398, score-0.393]
86 Accuracy variations with different patch sizes in color (a) and depth (b) videos. [sent-399, score-0.255]
87 The affinity matrix and the embedding of corner data with different leaf sizes of 20, 60 and 100 (from left to right). [sent-401, score-0.554]
88 (b) The difference e between the affinity matrices by our random forest and L2 norm with different forest sizes in corner data. [sent-402, score-0.914]
89 In the lipreading experiments, the accuracy reaches a local maximum when nl is set at 30. [sent-404, score-0.76]
90 It’s deserved to node that, there is one danger with a large leaf size where the points inside the same leaf node are not similar, which could impair the embedding (Fig. [sent-405, score-0.603]
91 It is believed that in the random forest the more trees, the more accurate fitting to the original data distribution and affinity estimation. [sent-408, score-0.565]
92 However, the comparatively large computation cost is introduced both in the training and testing processes with increasing forest size. [sent-409, score-0.3]
93 The difference e decreases when enlarging the forest size. [sent-426, score-0.3]
94 The accuracy reaches a local maximum when the forest size is 17 in lipreading experiments. [sent-427, score-1.06]
95 Conclusions We have presented a random forest manifold technique and applied it to lipreading in color and depth videos. [sent-429, score-1.365]
96 The video clips represented as a set of patch trajectories are converted to simplex motion patterns in the embedding space. [sent-430, score-0.523]
97 The lipreading is realized by motion pattern matching based on the manifold alignment. [sent-431, score-0.979]
98 Our framework takes advantage of the efficient training and testing of random forest, especially for affinity estimation, together with the unsupervised manifold distance estimation by the manifold alignment. [sent-433, score-0.669]
99 The proposed method can handle large data set efficiently, and at the same time can perform lipreading in relatively low-resolution videos effectively. [sent-434, score-0.801]
100 Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. [sent-469, score-0.26]
wordName wordTfidf (topN-words)
[('lipreading', 0.733), ('forest', 0.3), ('affinity', 0.192), ('manifold', 0.181), ('lip', 0.135), ('patch', 0.128), ('embedding', 0.127), ('leaf', 0.125), ('kinectvs', 0.1), ('uttering', 0.1), ('node', 0.088), ('splitting', 0.085), ('phrase', 0.081), ('trajectories', 0.08), ('density', 0.079), ('trajectory', 0.072), ('depth', 0.069), ('pq', 0.067), ('ouluvs', 0.067), ('query', 0.066), ('scatter', 0.065), ('manifolds', 0.063), ('channels', 0.062), ('dimensional', 0.06), ('phrases', 0.059), ('alignment', 0.058), ('tree', 0.055), ('speaking', 0.052), ('pr', 0.052), ('avletters', 0.05), ('deserved', 0.05), ('lips', 0.05), ('simplex', 0.049), ('knn', 0.047), ('virtue', 0.047), ('forests', 0.047), ('random', 0.045), ('unsupervised', 0.043), ('videos', 0.04), ('reference', 0.039), ('covariance', 0.039), ('subject', 0.039), ('tji', 0.039), ('compactness', 0.037), ('color', 0.037), ('corner', 0.036), ('motion', 0.036), ('texture', 0.035), ('clips', 0.035), ('bjq', 0.033), ('jaws', 0.033), ('rfmafusion', 0.033), ('rfmahog', 0.033), ('lbp', 0.033), ('toy', 0.032), ('determinant', 0.032), ('affinities', 0.032), ('embeddings', 0.031), ('criteria', 0.031), ('speech', 0.03), ('clip', 0.03), ('embedded', 0.03), ('employed', 0.03), ('predefined', 0.03), ('dimensionality', 0.029), ('nt', 0.029), ('pattern', 0.029), ('pages', 0.028), ('solely', 0.028), ('data', 0.028), ('fusion', 0.028), ('lafon', 0.027), ('mq', 0.027), ('reaches', 0.027), ('distance', 0.027), ('dependent', 0.026), ('sd', 0.025), ('index', 0.025), ('matrix', 0.025), ('children', 0.025), ('nq', 0.025), ('multimodal', 0.024), ('resolution', 0.024), ('mr', 0.024), ('matchings', 0.024), ('confronted', 0.024), ('zhao', 0.023), ('video', 0.023), ('converted', 0.023), ('peking', 0.023), ('subjects', 0.022), ('cox', 0.022), ('patterns', 0.022), ('thresholds', 0.021), ('pei', 0.021), ('sizes', 0.021), ('centroids', 0.02), ('dm', 0.02), ('matrices', 0.02), ('affine', 0.02)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 437 iccv-2013-Unsupervised Random Forest Manifold Alignment for Lipreading
Author: Yuru Pei, Tae-Kyun Kim, Hongbin Zha
Abstract: Lipreading from visual channels remains a challenging topic considering the various speaking characteristics. In this paper, we address an efficient lipreading approach by investigating the unsupervised random forest manifold alignment (RFMA). The density random forest is employed to estimate affinity of patch trajectories in speaking facial videos. We propose novel criteria for node splitting to avoid the rank-deficiency in learning density forests. By virtue of the hierarchical structure of random forests, the trajectory affinities are measured efficiently, which are used to find embeddings of the speaking video clips by a graph-based algorithm. Lipreading is formulated as matching between manifolds of query and reference video clips. We employ the manifold alignment technique for matching, where the L∞norm-based manifold-to-manifold distance is proposed to find the matching pairs. We apply this random forest manifold alignment technique to various video data sets captured by consumer cameras. The experiments demonstrate that lipreading can be performed effectively, and outperform state-of-the-arts.
2 0.16300768 297 iccv-2013-Online Motion Segmentation Using Dynamic Label Propagation
Author: Ali Elqursh, Ahmed Elgammal
Abstract: The vast majority of work on motion segmentation adopts the affine camera model due to its simplicity. Under the affine model, the motion segmentation problem becomes that of subspace separation. Due to this assumption, such methods are mainly offline and exhibit poor performance when the assumption is not satisfied. This is made evident in state-of-the-art methods that relax this assumption by using piecewise affine spaces and spectral clustering techniques to achieve better results. In this paper, we formulate the problem of motion segmentation as that of manifold separation. We then show how label propagation can be used in an online framework to achieve manifold separation. The performance of our framework is evaluated on a benchmark dataset and achieves competitive performance while being online.
3 0.1393567 336 iccv-2013-Random Forests of Local Experts for Pedestrian Detection
Author: Javier Marín, David Vázquez, Antonio M. López, Jaume Amores, Bastian Leibe
Abstract: Pedestrian detection is one of the most challenging tasks in computer vision, and has received a lot of attention in the last years. Recently, some authors have shown the advantages of using combinations of part/patch-based detectors in order to cope with the large variability of poses and the existence of partial occlusions. In this paper, we propose a pedestrian detection method that efficiently combines multiple local experts by means of a Random Forest ensemble. The proposed method works with rich block-based representations such as HOG and LBP, in such a way that the same features are reused by the multiple local experts, so that no extra computational cost is needed with respect to a holistic method. Furthermore, we demonstrate how to integrate the proposed approach with a cascaded architecture in order to achieve not only high accuracy but also an acceptable efficiency. In particular, the resulting detector operates at five frames per second using a laptop machine. We tested the proposed method with well-known challenging datasets such as Caltech, ETH, Daimler, and INRIA. The method proposed in this work consistently ranks among the top performers in all the datasets, being either the best method or having a small difference with the best one.
4 0.12398446 404 iccv-2013-Structured Forests for Fast Edge Detection
Author: Piotr Dollár, C. Lawrence Zitnick
Abstract: Edge detection is a critical component of many vision systems, including object detectors and image segmentation algorithms. Patches of edges exhibit well-known forms of local structure, such as straight lines or T-junctions. In this paper we take advantage of the structure present in local image patches to learn both an accurate and computationally efficient edge detector. We formulate the problem of predicting local edge masks in a structured learning framework applied to random decision forests. Our novel approach to learning decision trees robustly maps the structured labels to a discrete space on which standard information gain measures may be evaluated. The result is an approach that obtains realtime performance that is orders of magnitude faster than many competing state-of-the-art approaches, while also achieving state-of-the-art edge detection results on the BSDS500 Segmentation dataset and NYU Depth dataset. Finally, we show the potential of our approach as a general purpose edge detector by showing our learned edge models generalize well across datasets.
5 0.11407956 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
Author: Xiatian Zhu, Chen Change Loy, Shaogang Gong
Abstract: Generating coherent synopsis for surveillance video stream remains a formidable challenge due to the ambiguity and uncertainty inherent to visual observations. In contrast to existing video synopsis approaches that rely on visual cues alone, we propose a novel multi-source synopsis framework capable of correlating visual data and independent non-visual auxiliary information to better describe and summarise subtlephysical events in complex scenes. Specifically, our unsupervised framework is capable of seamlessly uncovering latent correlations among heterogeneous types of data sources, despite the non-trivial heteroscedasticity and dimensionality discrepancy problems. Additionally, the proposed model is robust to partial or missing non-visual information. We demonstrate the effectiveness of our framework on two crowded public surveillance datasets.
6 0.1134679 259 iccv-2013-Manifold Based Face Synthesis from Sparse Samples
7 0.11090717 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests
8 0.10690516 10 iccv-2013-A Framework for Shape Analysis via Hilbert Space Embedding
9 0.10653184 439 iccv-2013-Video Co-segmentation for Meaningful Action Extraction
10 0.10627595 133 iccv-2013-Efficient Hand Pose Estimation from a Single Depth Image
11 0.098899387 47 iccv-2013-Alternating Regression Forests for Object Detection and Pose Estimation
12 0.094386086 265 iccv-2013-Mining Motion Atoms and Phrases for Complex Action Recognition
13 0.086884305 352 iccv-2013-Revisiting Example Dependent Cost-Sensitive Learning with Decision Trees
14 0.085537352 68 iccv-2013-Camera Alignment Using Trajectory Intersections in Unsynchronized Videos
15 0.08485353 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria
16 0.084731154 421 iccv-2013-Total Variation Regularization for Functions with Values in a Manifold
17 0.080842018 361 iccv-2013-Robust Trajectory Clustering for Motion Segmentation
18 0.080731593 305 iccv-2013-POP: Person Re-identification Post-rank Optimisation
19 0.079526074 214 iccv-2013-Improving Graph Matching via Density Maximization
20 0.079268262 314 iccv-2013-Perspective Motion Segmentation via Collaborative Clustering
topicId topicWeight
[(0, 0.171), (1, -0.002), (2, -0.033), (3, 0.013), (4, -0.025), (5, 0.083), (6, 0.037), (7, 0.041), (8, 0.043), (9, 0.005), (10, -0.006), (11, 0.015), (12, 0.02), (13, 0.027), (14, 0.073), (15, 0.048), (16, -0.016), (17, -0.128), (18, 0.036), (19, 0.067), (20, -0.091), (21, 0.047), (22, 0.057), (23, 0.186), (24, -0.034), (25, 0.081), (26, -0.0), (27, 0.087), (28, 0.023), (29, -0.046), (30, 0.057), (31, 0.034), (32, -0.111), (33, 0.094), (34, -0.009), (35, 0.038), (36, -0.032), (37, -0.016), (38, -0.003), (39, -0.092), (40, -0.051), (41, -0.007), (42, -0.039), (43, -0.106), (44, -0.056), (45, -0.002), (46, 0.041), (47, 0.039), (48, 0.052), (49, -0.09)]
simIndex simValue paperId paperTitle
same-paper 1 0.94406241 437 iccv-2013-Unsupervised Random Forest Manifold Alignment for Lipreading
Author: Yuru Pei, Tae-Kyun Kim, Hongbin Zha
Abstract: Lipreading from visual channels remains a challenging topic considering the various speaking characteristics. In this paper, we address an efficient lipreading approach by investigating the unsupervised random forest manifold alignment (RFMA). The density random forest is employed to estimate affinity of patch trajectories in speaking facial videos. We propose novel criteria for node splitting to avoid the rank-deficiency in learning density forests. By virtue of the hierarchical structure of random forests, the trajectory affinities are measured efficiently, which are used to find embeddings of the speaking video clips by a graph-based algorithm. Lipreading is formulated as matching between manifolds of query and reference video clips. We employ the manifold alignment technique for matching, where the L∞norm-based manifold-to-manifold distance is proposed to find the matching pairs. We apply this random forest manifold alignment technique to various video data sets captured by consumer cameras. The experiments demonstrate that lipreading can be performed effectively, and outperform state-of-the-arts.
2 0.66951251 47 iccv-2013-Alternating Regression Forests for Object Detection and Pose Estimation
Author: Samuel Schulter, Christian Leistner, Paul Wohlhart, Peter M. Roth, Horst Bischof
Abstract: We present Alternating Regression Forests (ARFs), a novel regression algorithm that learns a Random Forest by optimizing a global loss function over all trees. This interrelates the information of single trees during the training phase and results in more accurate predictions. ARFs can minimize any differentiable regression loss without sacrificing the appealing properties of Random Forests, like low computational complexity during both, training and testing. Inspired by recent developments for classification [19], we derive a new algorithm capable of dealing with different regression loss functions, discuss its properties and investigate the relations to other methods like Boosted Trees. We evaluate ARFs on standard machine learning benchmarks, where we observe better generalization power compared to both standard Random Forests and Boosted Trees. Moreover, we apply the proposed regressor to two computer vision applications: object detection and head pose estimation from depth images. ARFs outperform the Random Forest baselines in both tasks, illustrating the importance of optimizing a common loss function for all trees.
3 0.62680209 404 iccv-2013-Structured Forests for Fast Edge Detection
Author: Piotr Dollár, C. Lawrence Zitnick
Abstract: Edge detection is a critical component of many vision systems, including object detectors and image segmentation algorithms. Patches of edges exhibit well-known forms of local structure, such as straight lines or T-junctions. In this paper we take advantage of the structure present in local image patches to learn both an accurate and computationally efficient edge detector. We formulate the problem of predicting local edge masks in a structured learning framework applied to random decision forests. Our novel approach to learning decision trees robustly maps the structured labels to a discrete space on which standard information gain measures may be evaluated. The result is an approach that obtains realtime performance that is orders of magnitude faster than many competing state-of-the-art approaches, while also achieving state-of-the-art edge detection results on the BSDS500 Segmentation dataset and NYU Depth dataset. Finally, we show the potential of our approach as a general purpose edge detector by showing our learned edge models generalize well across datasets.
4 0.5831908 352 iccv-2013-Revisiting Example Dependent Cost-Sensitive Learning with Decision Trees
Author: Oisin Mac Aodha, Gabriel J. Brostow
Abstract: Typical approaches to classification treat class labels as disjoint. For each training example, it is assumed that there is only one class label that correctly describes it, and that all other labels are equally bad. We know however, that good and bad labels are too simplistic in many scenarios, hurting accuracy. In the realm of example dependent costsensitive learning, each label is instead a vector representing a data point’s affinity for each of the classes. At test time, our goal is not to minimize the misclassification rate, but to maximize that affinity. We propose a novel example dependent cost-sensitive impurity measure for decision trees. Our experiments show that this new impurity measure improves test performance while still retaining the fast test times of standard classification trees. We compare our approach to classification trees and other cost-sensitive methods on three computer vision problems, tracking, descriptor matching, and optical flow, and show improvements in all three domains.
5 0.58305085 443 iccv-2013-Video Synopsis by Heterogeneous Multi-source Correlation
Author: Xiatian Zhu, Chen Change Loy, Shaogang Gong
Abstract: Generating coherent synopsis for surveillance video stream remains a formidable challenge due to the ambiguity and uncertainty inherent to visual observations. In contrast to existing video synopsis approaches that rely on visual cues alone, we propose a novel multi-source synopsis framework capable of correlating visual data and independent non-visual auxiliary information to better describe and summarise subtlephysical events in complex scenes. Specifically, our unsupervised framework is capable of seamlessly uncovering latent correlations among heterogeneous types of data sources, despite the non-trivial heteroscedasticity and dimensionality discrepancy problems. Additionally, the proposed model is robust to partial or missing non-visual information. We demonstrate the effectiveness of our framework on two crowded public surveillance datasets.
6 0.55237144 336 iccv-2013-Random Forests of Local Experts for Pedestrian Detection
7 0.53526855 178 iccv-2013-From Semi-supervised to Transfer Counting of Crowds
8 0.51715595 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria
9 0.50226367 297 iccv-2013-Online Motion Segmentation Using Dynamic Label Propagation
10 0.49795371 259 iccv-2013-Manifold Based Face Synthesis from Sparse Samples
11 0.45926595 340 iccv-2013-Real-Time Articulated Hand Pose Estimation Using Semi-supervised Transductive Regression Forests
12 0.43705931 101 iccv-2013-DCSH - Matching Patches in RGBD Images
13 0.4291378 133 iccv-2013-Efficient Hand Pose Estimation from a Single Depth Image
14 0.42760459 11 iccv-2013-A Fully Hierarchical Approach for Finding Correspondences in Non-rigid Shapes
15 0.42760336 361 iccv-2013-Robust Trajectory Clustering for Motion Segmentation
16 0.42638758 134 iccv-2013-Efficient Higher-Order Clustering on the Grassmann Manifold
17 0.41953748 193 iccv-2013-Heterogeneous Auto-similarities of Characteristics (HASC): Exploiting Relational Information for Classification
18 0.41792333 421 iccv-2013-Total Variation Regularization for Functions with Values in a Manifold
19 0.41703925 176 iccv-2013-From Large Scale Image Categorization to Entry-Level Categories
20 0.41544753 117 iccv-2013-Discovering Details and Scene Structure with Hierarchical Iconoid Shift
topicId topicWeight
[(2, 0.111), (4, 0.012), (7, 0.029), (13, 0.013), (26, 0.083), (31, 0.037), (38, 0.19), (40, 0.027), (42, 0.079), (48, 0.021), (64, 0.025), (73, 0.032), (89, 0.204), (95, 0.02), (98, 0.014)]
simIndex simValue paperId paperTitle
1 0.90000463 84 iccv-2013-Complex 3D General Object Reconstruction from Line Drawings
Author: Linjie Yang, Jianzhuang Liu, Xiaoou Tang
Abstract: An important topic in computer vision is 3D object reconstruction from line drawings. Previous algorithms either deal with simple general objects or are limited to only manifolds (a subset of solids). In this paper, we propose a novel approach to 3D reconstruction of complex general objects, including manifolds, non-manifold solids, and non-solids. Through developing some 3D object properties, we use the degree of freedom of objects to decompose a complex line drawing into multiple simpler line drawings that represent meaningful building blocks of a complex object. After 3D objects are reconstructed from the decomposed line drawings, they are merged to form a complex object from their touching faces, edges, and vertices. Our experiments show a number of reconstruction examples from both complex line drawings and images with line drawings superimposed. Comparisons are also given to indicate that our algorithm can deal with much more complex line drawings of general objects than previous algorithms. 1. Introduction and Related Work A 2D line drawing is the most straightforward way of illustrating a 3D object. Given a line drawing representing a 3D object, our visual system can understand the 3D structure easily. For example, we can interpret without difficulty the line drawing shown in Fig. 1(a) as a castle with four walls and one door. Imitating this ability has been a longstanding and challenging topic in computer vision when a line drawing is as complex as this example. The applications of this work include 3D object design in CAD and for 3D printers, 3D query generation for 3D object retrieval, and 3D modeling from images. In this paper, same as the majority of related work, a line drawing is defined as the orthogonal projection of the Fig.1 (a)Alinedrawing(rae)pres nti gac stle.(b)The3Dm(obd)el of the line drawing. edges and vertices of a 3D object in a generic view, and objects with planar surfaces are considered. A line drawing is represented by an edge-vertex graph. It can be obtained by the user/designer who draws on the screen with a tablet pen, a mouse, or a finger (on a touch sensitive screen), with all, with some, or without hidden edges and vertices. Line labeling is the earliest work to interpret line drawings [1], [17]. It searches for a set of consistent labels such as convex, concave, and occluding from a line drawing to test its correctness and/or realizability. Line labeling itself cannot recover 3D shape from a line drawing. Later, 3D reconstruction from the contours (line drawings) of objects in images is studied [19], [14], [13], which handles simple objects only. Model-based 3D reconstruction [2], [3], [20] can deal with more complex objects, but these methods require to pre-define a set of parametric models. Recently, popular methods of 3D reconstruction from line drawings are optimization based, which are most related to our work and are reviewed next. These methods can be classified into two categories: one dealing with manifolds and the other dealing with general objects. A general object can be a manifold, non-manifold solid, or non-solid. Manifolds are a subset of solids, defined as follows: A manifold, or more rigorously 2D manifold, is a solid where every point on its surface has a neighborhood topologically equivalent to an open diskin the 2D Euclidean space. 1433 In this paper, a solid is a portion of 3D space bounded by planar faces, and a manifold is also bounded by planar faces. Each edge of such manifolds is shared exactly by two faces [4]. Most 3D reconstruction methods from a line drawing assume that the face topology of the line drawing is known in advance. This information can reduce the reconstruction complexity greatly. Algorithms have been developed to find faces from a line drawing in [16], [10], and [9], where [16] and [10] are for general objects and [9] for manifolds. Optimization-based 3D reconstruction depends on some critera (also called image regularities) that simulate our visual perception. Marill proposes a very simple but effective criterion to reconstruct a simple object: minimizing the standard deviation of the angles (MSDA) in the object [11]. Later, other regularities are proposed to deal with more complex objects such as face planarity, line parallelism, isometry, and corner orthogonality [5], [6], [15], [18]. In these methods, an objective function ?C Φ(z1,z2, ...,zNv) = ?ωiφi(z1,z2, ...,zNv) (1) i?= ?1 is minimized to derive the depths z1, z2 , ..., zNv , where Nv is the number of vertices in the line drawing, φi , i = 1, 2, ..., C, are the regularities, and ωi , i = 1, 2, ..., C, are the weights. The main problem in this approach is that these algorithms are easy to get trapped into local minima (obtaining failed results) when a line drawing is complex with many vertices, due to the search in a highdimensional space (Nv dimensions) with the non-convex objective function. For example, the search space is of 56 dimensions for the object in Fig. 1(a). To alleviate this problem, Liu et al. formulate 3D reconstruction in a lower dimensional space so that the optimization procedure has a better chance to find desired results [7]. For the complex object in Fig. 1(a), however, the search in a space with 18 dimensions is still too difficult for it to obtain a satisfactory 3D object (see Section 3). The methods in [5], [6], [15], [18], and [7] reconstruct general objects, and the one in [7] can deal with more complex objects than the other four. But these algorithms cannot avoid the local minimum problem in a high dimensional search space when a line drawing is complex. In [8], a divide-and-conquer (D&C;) strategy is used to tackle this problem. It first separates a complex line drawing into multiple simpler ones, then independently recovers the 3D objects from these line drawings, and finally merges them to form a complete object. Since the separated line drawings are much simpler than the original one, the 3D reconstruction from each of them is an easy task. This D&C; approach handles manifolds only. Based on known faces found by the face identification algorithm in [9], it uses manifold properties to find internal faces Fig.2(a)Asimaeplhdmiafnbo(ldagc)withnae'fahdc'eisfba'nd(obgcn)'eitrnalfce (a, b, c, d). (b) Decomposition result from the internal face. (a) (b) (c) (d) Fig. 3. (a) A non-manifold solid. (b) Expected decomposition of (a). (c) A sheet object. (d) Expected decomposition of (c). from a line drawing and then separates the line drawing from the internal faces. An internal face is defined as an imaginary face lying inside a manifold with only its edges visible on the surface [8]. It is not a real face but can be considered as two coincident real faces of identical shape belonging to two manifolds. For example, Fig. 2(a) shows a manifold with nine faces. The D&C; first finds the internal face (a, b, c, d) and then decomposes the line drawing from this internal face (Fig. 2(b)). However, handling manifolds only limits the applica- tions of [8]. In many applications in computer vision and graphics such as 3D object matching, retrieval, and rendering, it is unnecessary to represent objects as manifolds, in order to facilitate data processing and reduce data storage. For example, a flat ground can be represented by a sheet (one face), but if it is represented by a manifold, a thin box with six faces has to be used. Fig. 1(a), Fig. 3(a), and Fig. 3(c) are line drawings of three non-manifolds. In this paper, we propose a novel approach to 3D reconstruction of complex general objects based on visual perception, object properties, and new line drawing decomposition. Compared with previous methods, ours can deal with much more complex line drawings of general objects. It can handle not only manifolds but also non-manifold solids and non-solids, and is insensitive to sketching errors. 2. General Object Reconstruction We also use the D&C; strategy to deal with 3D reconstruction from a line drawing representing a complex general object. The key is how to decompose a complex line drawing of any objects into multiple simpler line drawings. These decomposed line drawings should represent objects that are in accordance with our visual perception, which makes the 3D reconstruction from these line drawings easier and better because the regularities used to build an objective function for reconstruction follow human perception of 1434 common objects [11], [5], [6], [15], [18]. Before the decomposition of a line drawing, we assume that all the real and internal faces of the object have been obtained from the line drawing using a face identification algorithm. For example, the algorithm in [10] finds 10 faces from the line drawing in Fig. 2(a) (including the internal face), and obtains 12 and 7 faces from the line drawings in Figs. 3(a) and (c), respectively. 2.1. Decomposing line drawings of solids In this subsection, we consider the line drawings of solids first. The decomposition method will be extended to the line drawings of general objects in the next subsection. It is not difficult to see that in general, a complex object, especially a manmade complex object, can be considered as the combinations of multiple smaller objects. The most common combination is the touch of two faces from two different objects such as the one in Fig. 2. Other combinations are the touches among lines, faces, and vertices. Our target is to decompose a complex solid into multiple primitive solids. Before the definition of a primitive solid, we introduce a term called the degree of freedom of a solid. Definition 1. The degree of freedom (DoF) of a 3D solid represented by a line drawing is the minimal number of zcoordinates that can uniquely determine this 3D solid. This is the first time that the concept of DoF is used to decompose line drawings. Now let us consider a simple object in Fig. 4(a). The cube has six faces: (v1, v2 , v3 , v4), (v1, v2, v6, v5), (v1, v4, v8, v5), (v2 , v3, v7, v6), (v4, v3, v7, v8), and (v6, v7, v8, v5). We can show that the cube is determined if the z-coordinates of its four non-coplanar vertices are known. Without loss of generality, suppose z1, z2, z4, and z5 are known. Since the 3D coordinates of v1, v2, and v4 are fixed (remind that the x- and y-coordinates of all the vertices are known under the orthogonal projection), the 3D plane passing through the face (v1, v2 , v3, v4) is determined, and thus z3 can be calculated. Similarly, z6 and z8 can be obtained. Finally, z7 can be computed with the 3D coordinates of v3, v4, and v8 known, which determine the plane passing through the face (v4, v3, v7, v8). So the 3D cube can be determined by the known four z-coordinates, z1, z2, z4, and z5. Further, it can be verified that three 3D vertices cannot determine this object uniquely because they can only define one face in 3D space. Therefore, the DoF of the cube is 4. Similar analysis allows us to know that the solids in Fig. 2(b), Fig. 3(b), and Fig. 4(b) all have DoF 4, while the two solids in Fig. 2(a) and Fig. 3(a) have DoFs 5 and 6, respectively. From these analysis, we can have the intuition that solids with DoF 4 serve as the building blocks of more complex solids whose DoFs are more than 4. Besides, we have the following property: Property 1. There is no solid with DoF less than 4. Fv5(1iga).4v (26a)Av48cubev3w7hos(be)DoFis4.(b)Anedo(tch)fergBasbolifdAwhosfCeDocFji is also 4. (c) Part of a line drawing with each vertex of degree 3. This property is easy to verify. A solid with fewest faces is a tetrahedron. Every two of its four faces are not co-planar. Three 3D vertices of a tetrahedron can only determine one 3D face. Next, we define primitive solids. Definition 2. A 3D solid represented by a line drawing is called a primitive solid if its DoF is 4. Property 2. If every vertex of a 3D solid represented by a line drawing has degree 31, then it is a primitive solid. Proof. Let part of such a line drawing be the one as shown in Fig. 4(c). At each vertex, every two of the three edges form a face, because a solid is bounded by faces without dangling faces and edges. Let the three paths fA, fB, and fC in Fig. 4(c) denote the three faces at vertex a. Without loss of generality, suppose that the four zcoordinates (and thus the four 3D coordinates) of vertices a, b, c, and d are known. Then the three planes passing through fA, fB, and fC are determined in 3D space. With the two known 3D planes passing through fA and fB at vertex b, the 3D coordinates of vertices g and h connected to b can be computed. Similarly, the 3D coordinates of vertices e and f connected to d and the 3D coordinates of vertices i and j connected to c can be obtained. Furthermore, all the 3D coordinates of the other vertices connected to e, f, g, h, i, and j can be derived in the same way. This derivation can propagate to all the vertices of this solid. Therefore, the DoF of this solid is 4 and it is a primitive solid. Property 3. The DoF of a solid is 5 which is obtained by gluing two faces of two primitive solids. Proof. Let the two primitive solids be PS1 and PS2 and their corresponding gluing faces be f1and f2, respectively. The DoFs of PS1 and PS2 are both 4. Suppose that PS1 is determined in 3D space, which requires four z-coordinates. Then f1and f2 are also determined in 3D space. When the z-coordinates of three vertices on PS2 are known based on f2, one more z-coordinate of a vertex not coplanar with f2 on PS2 can determine PS2 in 3D space. Therefore, the DoF of the combined solid is 5. Fig. 2 is a typical example of two primitive solids gluing together along faces. Fig. 3(a) is an example of two primitive solids gluing together along edges. Two primitive solids may also connect at one vertex. The following property is easy to verify. 1The degree of a vertex is the number of edges connected to this vertex. 1435 Property 4. The DoF of a solid is 6 which is obtained by gluing two edges of two primitive solids. The DoF of a solid is 7 which is obtained by gluing two vertices oftwoprimitive solids. From the above properties, we can see that primitive solids are indeed the “smallest” solids in terms of DoF and they can serve as the building blocks to construct more complex solids. Therefore, our next target is to decompose a line drawing representing a complex solid into multiple line drawings representing primitive solids. Before giving Definition 3, we define some terms first. Vertex set of a face. The vertex set V er(f) of a face f is the set of all the vertices of f. Fixed vertex. A fixed vertex is one with its z-coordinate (thus its 3D coordinate) known. Unfixed vertex. An unfixed vertex is one with its zcoordinate unknown. Fixed face. A fixed face is one with its 3D position determined by its three fixed vertices. Unfixed face. An unfixed face is one with its 3D position undetermined. Definition 3. Let the vertex set and the face set of a line drawing be V = {v1, v2 , ..., vn} and F = {f1, f2 , ..., fm}, respectively, w =he {rve n and m are Fthe = n{fumbers of th}e, vertices and the faces, respectively. Also let Vfixed, Ffixed, Vunfixed, and Funfixed be the sets of fixed vertices, fixed faces, unfixed vertices, and unfixed faces, respectively. Suppose that an initial set of two fixed neighboring faces sharing an edge is Finitial with all their fixed vertices in Vinitial. The final Ffixed in Algorithm 1 is called the maximum extended face set (MEFS) from Finitial. In Algorithm 1, a face f that satisfies the condition in step 3 is a face that has been determined in 3D space by the current fixed vertices in Ffixed. When this face is found, it becomes a fixed face and all its vertices become fixed vertices. The DoF of the initial two fixed faces combined is 4. It is not difficult to see that the algorithm does not increase the initial DoF, and thus the final object represented by the MEFS also has DoF 4. Next, let us consider a simple example shown in Fig. 2(a) with the following three cases: Case 1. Suppose that Finitial = {(e, f,g, h) , (e, f,b, a)}, Vinitial = {e, f,g, h, b, a}, and th{e( algorithm a(de,dfs tbh,ea f}a,c Ves into Ffixed ,ing thh,isb aor}d,e ar:n (f th,e g, c, obr)i →m (a, b, c, d) → (e, h, d, a) →i ( tgh, hs, odr, dce)r. T (fhe,gn tch,eb )fin →al object ,fod)und → by t,hhe, algorithm (isg thh,ed c,uc)b.e. Note that the algorithm does not add any triangular faces into Ffixed because they do not satisfy the condition in step 3. Case 2. If Finitial = {(b, i,a) , (b, i,c)}, then the final object found is the pyramid, abn,id, tah)e, algorithm hdoenes t nhoet f iandadl any rectangular faces except (a, b, c, d) into Ffixed. Case 3. If Finitial = {(b, a, i) , (e, f,b, a)}, the algorithm cannot find any othe=r f a{(cebs,a at,oi a)d,(de ,tof Ffixed. tThheus al, gito rfaitihlsto find the cube or pyramid. Algorithm 1 Face extending procedure Initialization: F, F, Initialization: Funfixed = F \ Finitial , Ffixed = Finitial , Vfixed = Vinitial, Vunfixed = FV \ \ FVinitial. 1. do the following steps until no face satisfies the condition in step 3; 2. Find a face f ∈ Funfixed that satisfies 3. the number ofnon-collinear vertices in V er(f) ∩Vfixed is more than 2; 4. Add face f into Ffixed and delete it from Funfixed; 5. For each vertex v ∈ V er(f), if v ∈ Vunfixed, add v into Vfixed and delete it from Vunfixed; Return The final Ffixed. Fig.5(a)Ac(oam)plexinedrawingofn (-bm)anifolds id.(b)The decomposition result by our algorithm. In case 3, the object represented by the MEFS has only two initial faces and this object is discarded. In order not to miss a primitive solid, we run Algorithm 1 multiple times each with a different pair of neighboring faces in Finitial. Then, we can always have Finitial with its two faces from one primitive solid. For the object in Fig. 2(a), we can always find the cube and the pyramid. Note that the same primitive solid may be found multiple times from different Finitial, and finally we keep only one copy of each different object (cube and pyramid in this example). When a complex solid is formed by more than two primitive solids, Algorithm 1 can still be used to find the primitive solids, which is the decomposition result of the complex line drawing. More complex examples are given in Section 3. Besides, Algorithm 1 can also deal with complex solids formed by gluing primitive solids between edges and vertices. Fig. 5(a) is a solid constructed by gluing eight primitive solids between faces, edges, and vertices. Running Algorithm 1multiple times with different pairs of neighboring faces in Finitial generates the primitive solids as shown in Fig. 5(b). 2.2. Decomposing line drawings of general objects A general object can be a manifold, non-manifold solid, or non-solid. Given a line drawing representing a general object, it is unknown whether this object consists of only primitive solids. However, we can always apply Algorithm 1to the line drawing multiple times, each with a 1436 Obj6(4)O b j 15( 94)(ca)O b j 24(9 7)Obj3(7)(bd) Fig. 6. Illustration of our decomposition method. (a) A line drawing. (b) The set of MEFSs from (a). (c) The weighted objectcoexistence graph where the maximum weight clique is shown in bold. (d) The decomposition of (a). different pair of neighboring faces in Finitial, generating a set SMEFS of MEFSs (recall that an MEFS with only two initial neighboring faces is discarded). In what follows, we also call an MEFS an object, which is represented by the MEFS. Note that an MEFS generated from a general line drawing may not be a primitive solid, but its DoF must be 4. Objects of DoF 4 have relatively simple structures and are easy to be reconstructed. A number of decomposition examples of complex general line drawings can be seen from the experimental section. One issue existing in this decomposition method is that two different MEFSs may share many faces. For example, from the line drawing in Fig. 6(a), all different MEFSs found by running Algorithm 1multiple times are shown in Fig. 6(b), where Obj 1and Obj 5 share four faces, and so do Obj 2 and Obj 6. Obviously, Obj 5 and Obj 6 are not necessary. Next we define object coexistence and a rule to choose objects. Definition 4. Two objects are called coexistent if they share no face or share only coplanar faces. Rule 1. Choose a subset of SMEFS such that in the subset, all the objects are coexistent and the number of total faces is maximized. From Definition 4, Obj 1 and Obj 5 are not coexistent in Fig. 6, and Obj 2 and Obj 6 are not either. If Obj 5 and Obj 6 are kept with Obj 1and Obj 2 discarded, many faces in the original object will be missing. Rule 1guarantees that Obj 1and Obj 2 are kept but not Obj 5 and Obj 6. Algorithm 2 Decomposition of a general line drawing Algorithm 2 Decomposition of a general line drawing Input: A Line Drawing: G = (V,E,F). Initialization: SMEFS = ∅, SMWC = ∅. 1. for each pair of neighboring faces {fa , fb} in F do 2. Call Algorithm 1with Finitial = {fa , fb} and Vinitial = V er(fa) ∪ V er(fb); 3. if the returned Ffixed from Algorithm 1contains more than two faces do 4. SMEFS ← Ffixed; 5. Construct the object-coexistence graph Gobj with SMEFS ; 6. SMWC ← the maximum weight clique found from Gobj ; 7. for each face f not contained in SMWC do 8. Attach f to the object in SMWC that contains the maximum number of the vertices of f; Return SMWC. Fig.7 (a)Ashe tobjec(ta)with23faces.(b)Decompositon(br)esult by Algorithm 2 with the modification in Algorithm 1. We formulate Rule 1 as a maximum weight clique problem (MWCP), which is to find a clique2 of the maximum weight from a weighted graph. First, we construct a weighted graph, called the object-coexistence graph, in which a vertex denotes an object in SMEFS and there is an edge connecting two vertices if the two objects represented by the two vertices are coexistent. Besides, each vertex is assigned a weight equal to the number of the faces of the corresponding object. The MWCP is a well-known NP-hard problem. In our application, however, solving this problem is fast enough since an object-coexistence graph usually has less than 20 objects (vertices). We use the algorithm in [12] to deal with this problem. Fig. 6(c) is the object-coexistence graph constructed from the six objects in Fig. 6(b), where the weights of the vertices are denoted by the numbers in the parentheses. The maximum weight clique is shown in bold. From Fig. 6, we see that the face (14, 13, 26, 25) is not contained in SWMC, which is used to store the objects in the maximum weight clique. This face is finally attached to Obj 3. In general, each of the faces not in SWMC is attached to an object that contains the maximum number of the vertices of this face. If there are two or more objects that contain the same number of the vertices of this face, this face is assigned to any of them. 2A clique is a subgraph of a graph such that subgraph are connected by an edge. every two vertices in the 1437 Algorithm 2 shows the complete algorithm to decompose a general line drawing. Steps 7 and 8 attach the faces not in SMWC to some objects in SMWC. A common complex object usually consists of primitive solids and sheets, and Algorithm 2 works well for the decomposition of most complex line drawings. However, there are still some line drawings the algorithm cannot deal with. Such an example is shown in Fig. 7(a) which is a sheet object with 23 faces. In Algorithm 1, with any pair of initial neighboring faces, there is no any other face satisfying the condition in step 3, thus no object of DoF 4 will be found. The following scheme can solve this problem. Given a line drawing, steps 1–6 in Algorithm 2 are used to decompose it into multiple objects of DoF 4. If there are separate groups of faces not in SMWC, where the faces in each group are connected, then attach the groups each with less than four faces to some objects in SMWC3 (the attachment method is similar to steps 7 and 8 in Algorithm 2). For a group with four or more connected faces, Algorithm 2 is applied to it with a minor modification in Algorithm 1. The modification is to set Finitial to contain three connected faces whose combined DoF is 5. This modification allows the search of objects of DoF 5. Suppose the object in Fig. 7(a) is such a group. Applying Algorithm 2 to it with the minor modification generates the decomposition result as shown in Fig. 7(b). 2.3. 3D Reconstruction A complex line drawing can be decomposed into several simpler ones using the method proposed in Sections 2. 1 and 2.2. The next step is to reconstruct a 3D object from each ofthem, which is an easy task because the decomposed line drawings are simple. The method in [6] or [7] can carry out this task very well. We use the one in [6] for our work with the objective function Φ(z1 , z2 , ..., zNv ) constructed by these five image regularities: MSDA, face planarity, line parallelism, isometry, and corner orthogonality. The details of the regularities can be found from [6]. After obtaining the 3D objects from all the decomposed line drawings, the next step is to merge them to form one complex object. When merging two 3D objects, since they are reconstructed separately, the gluing parts (face or edge) of them are usually not of the same size. Then one object is automatically rescaled according to the sizes of the two gluing parts, and the vertices of the gluing part of this object are also adjusted so that the two parts are the same. After merging all the 3D objects, the whole object is fine-tuned by minimizing the objective function Φ on the object. We can also apply our method to reconstruct 3D shapes from objects in images. First, the user draws a line drawing along the visible edges of an object and he/she can also 3The reason to attach a group with less than four faces to an object in SMWC is that this group is small and is not necessary to be an independent object to reconstruct. guess (draw) the hidden edges. Then from this line drawing, our approach described above reconstructs the 3D geometry of the object in the image. 3. Experimental Results In this section, we show a number of complex 3D reconstruction examples from both line drawings and images to demonstrate the performance of our approach. The first set of experiments in Fig. 8 has nine complex line drawings. Fig. 8(a) is a manifold, and the others are nonmanifold solids or non-solids. The decompositions of the line drawings are also given in the figure, from which we can see that the results are in accordance with our visual perception very well. All the primitive solids are found by our algorithm. It is the successful decompositions that make the 3D reconstructions from these complex line drawings possible. The expected satisfactory reconstruction results are shown also in Fig. 8 each in two views. Fig. 9 shows another set of 3D reconstructions from objects in images with line drawings drawn on the objects. The decomposition results are omitted due to the space limitation. Each reconstruction result obtained by our algorithm is shown in two views with the texture from the image mapped onto the surface. We can see that the results are very good. The details of the objects and the line drawings can be shown by enlarging the figures on the screen. Among all the previous algorithms for general object reconstruction, the one in [7] can deal with most complex objects. Due to the local minimum problem in a high dimensional search space, however, this algorithm cannot handle line drawings as complex as those in Figs. 8 and 9. For example, Fig. 10(a) shows its reconstruction result from the line drawing in Fig. 8(c), which is a failure. The reader may wonder what happens if the 3D reconstruction is based on an arbitrary decomposition of a complex line drawing, instead of the proposed one. Fig. 10(b) shows such a decomposition from Fig. 8(c). Based on this decomposition, the 3D reconstruction result obtained by the scheme described in Section 2.3 is given in Fig. 10(c), which is a failure. The failure is caused by two reasons: (i) An arbitrary decomposition usually does not generate common objects, which makes the image regularities less meaningful for the 3D reconstruction. (ii) The gluing of 3D objects from the decomposition in Fig. 10(b) is difficult because of the irregular touches between the objects. The fine-tuning processing (see Section 2.3) cannot reduce the large distortion to an acceptable result. Note that since our algorithm is not limited to manifolds, it can deal with line drawings with some or without hidden lines. The third line drawing in Fig. 9 is an example where some hidden lines are not drawn. Most of the line drawings in this paper look tidy. This 1438 (g)(h)(i) Fig. 8. Nine complex line drawings, their decompositons, and 3D reconstruction results in two views wher dif er nt col rs are used to denote the faces (better viewed on the screen). Fig.9 Fourimages,thecorespondi glinedrawings,andther constructed3Dobjectswith exturemap ed,eachs owni twoviews. The details can be seen by enlarging the figures on the screen. 1439 Fig.10(.a( ) Afailedreconstru(cb)tionbythealgorithm(ci)n[7].(b)An Fig.1 .(a ) Alinedrawingwith(bs)trongsketchingero(sc.)(b)(c) arbitrary decomposition of the line drawing in Fig. 8(c) without using our decomposition method. (c) Failed 3D reconstruction based on the decomposition in (b). Two views ofthe successful reconstruction result by our algorithm. is for easy observation of the objects. In fact, our algorithm is not sensitive to sketching errors. Take Fig. 8(a) as an example and assume it is an accurate projection of the 3D object. Then, random variations are generated with the Gaussian distribution N(0, σ2) on the 2D locations of the vertices. Fig. 11(a) is a resulting noisy line drawing with σ = W/200 where W is the width of the line drawing in Fig. 8(a). From Fig. 11, we see that even for this line drawing with strong sketching errors, our algorithm can still obtain the good reconstruction result. Our algorithm is implemented in C++. The computational time includes two parts: line drawing decomposition and 3D reconstruction. The main computation is consumed by the second part. On average, a common PC takes about one minute to obtain the reconstruction from each of the line drawings in Figs. 8 and 9. 4. Conclusion Previous algorithms of 3D object reconstruction from line drawings either deal with simple general objects or are limited to only manifolds (a subset of solids). In this paper, we have proposed a novel approach that can handle complex general objects, including manifolds, nonmanifold solids, and non-solids. It decomposes a complex line drawing into simpler ones according to the degree of freedom of objects, which is based on the developed 3D object properties. After 3D objects are reconstructed from the decomposed line drawings, they are merged to form a complex object. We have shown a number of reconstruction examples with comparison to the best previous algorithm. The results indicate that our algorithm can tackle much more complex line drawings of general objects and is insensitive to sketching errors. The future work includes (i) the correction of the distortions of 3D objects reconstructed from images caused by the perspective projection, and (ii) the extension of this work to objects with curved faces. Acknowledgements This work was supported by grants from Natural Science Foundation of China (No. try, Trade, and Information Shenzhen Municipality, and Guangdong Science, Technology Commission China (No. Innovative 201001D0104648280). 61070148), Indusof JC201005270378A), Research Team Program (No. Jianzhuang Liu is the correspond- ing author. References [1] M. Clowes. On seeing things. Artificial Intelligence, 2:79–1 16, 1971. [2] P. Debevec, C. Taylor, and J. Malik. Modeling and rendering architecture from photographs: A hybrid geometry- and image-based approach. Proc. ACM SIGGRAPH, pages 11–20, 1996. [3] D. Jelinek and C. Taylor. Reconstruction of linearly parameterized models from single images with a camera of unknown focal length. IEEE T-PAMI, 23(7):767–773, 2001. [4] D. E. LaCourse. Handbook of Solid Modeling. McGraw-Hill, 1995. [5] Y. Leclerc and M. Fischler. An optimization-based approach to the interpretation of single line drawings as 3D wire frames. IJCV, 9(2): 113–136, 1992. [6] H. Lipson and M. Shpitalni. Optimization-based reconstruction of a 3d object from a single freehand line drawing. Computer-Aided Design, 28(7):651–663, 1996. [7] J. Liu, L. Cao, Z. Li, and X. Tang. Plane-based optimization for 3D object reconstruction from single line drawings. IEEE T-PAMI, 30(2):315–327, 2008. [8] J. Liu, Y. Chen, and X. Tang. Decomposition of complex line drawings with hidden lines for 3d planar-faced manifold object [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] reconstruction. IEEE T-PAMI, 33(1):3–15, 2011. J. Liu, Y. Lee, and W. Cham. Identifying faces in a 2D line drawing representing a manifold object. IEEE T-PAMI, 24(12): 1579–1593, 2002. J. Liu and X. Tang. Evolutionary search for faces from line drawings. IEEE T-PAMI, 27(6):861–872, 2005. T. Marill. Emulating the human interpretation of line-drawings as three-dimensional objects. IJCV, 6(2): 147–161, 1991 . P. R. J. O¨sterg a˚rd. A new algorithm for the maximum-weight clique problem. Nordic J. of Computing, 8(4):424–436, Dec. 2001 . H. Shimodaira. A shape-from-shading method of polyhedral objects using prior information. IEEE T-PAMI, 28(4):612–624, 2006. I. Shimshoni and J. Ponce. Recovering the shape of polyhedra using line-drawing analysis and complex reflectance models. Computer Vision and Image Understanding, 65(2):296–3 10, 1997. K. Shoji, K. Kato, and F. Toyama. 3-d interpretation of single line drawings based on entropy minimization principle. CVPR, 2001. M. Shpitalni and H. Lipson. Identification of faces in a 2d line drawing projection of a wireframe object. IEEE T-PAMI, 18(10), 1996. K. Sugihara. Machine interpretation of line drawings. MIT Press, 1986. A. Turner, D. Chapman, and A. Penn. Sketching space. Computer and Graphics, 24:869–879, 2000. F. Ulupinar and R. Nevatia. Shape from contour: straight homogeneous generalized cylinders and constant cross-section generalized cylinders. IEEE T-PAMI, 17(2): 120–135, 1995. T. Xue, J. Liu, and X. Tang. Example-based 3d object reconstruction from line drawings. CVPR, 2012. 1440
same-paper 2 0.85734856 437 iccv-2013-Unsupervised Random Forest Manifold Alignment for Lipreading
Author: Yuru Pei, Tae-Kyun Kim, Hongbin Zha
Abstract: Lipreading from visual channels remains a challenging topic considering the various speaking characteristics. In this paper, we address an efficient lipreading approach by investigating the unsupervised random forest manifold alignment (RFMA). The density random forest is employed to estimate affinity of patch trajectories in speaking facial videos. We propose novel criteria for node splitting to avoid the rank-deficiency in learning density forests. By virtue of the hierarchical structure of random forests, the trajectory affinities are measured efficiently, which are used to find embeddings of the speaking video clips by a graph-based algorithm. Lipreading is formulated as matching between manifolds of query and reference video clips. We employ the manifold alignment technique for matching, where the L∞norm-based manifold-to-manifold distance is proposed to find the matching pairs. We apply this random forest manifold alignment technique to various video data sets captured by consumer cameras. The experiments demonstrate that lipreading can be performed effectively, and outperform state-of-the-arts.
3 0.85246176 267 iccv-2013-Model Recommendation with Virtual Probes for Egocentric Hand Detection
Author: Cheng Li, Kris M. Kitani
Abstract: Egocentric cameras can be used to benefit such tasks as analyzing fine motor skills, recognizing gestures and learning about hand-object manipulation. To enable such technology, we believe that the hands must detected on thepixellevel to gain important information about the shape of the hands and fingers. We show that the problem of pixel-wise hand detection can be effectively solved, by posing the problem as a model recommendation task. As such, the goal of a recommendation system is to recommend the n-best hand detectors based on the probe set a small amount of labeled data from the test distribution. This requirement of a probe set is a serious limitation in many applications, such as ego-centric hand detection, where the test distribution may be continually changing. To address this limitation, we propose the use of virtual probes which can be automatically extracted from the test distribution. The key idea is – that many features, such as the color distribution or relative performance between two detectors, can be used as a proxy to the probe set. In our experiments we show that the recommendation paradigm is well-equipped to handle complex changes in the appearance of the hands in firstperson vision. In particular, we show how our system is able to generalize to new scenarios by testing our model across multiple users.
4 0.83681691 95 iccv-2013-Cosegmentation and Cosketch by Unsupervised Learning
Author: Jifeng Dai, Ying Nian Wu, Jie Zhou, Song-Chun Zhu
Abstract: Cosegmentation refers to theproblem ofsegmenting multiple images simultaneously by exploiting the similarities between the foreground and background regions in these images. The key issue in cosegmentation is to align common objects between these images. To address this issue, we propose an unsupervised learning framework for cosegmentation, by coupling cosegmentation with what we call “cosketch ”. The goal of cosketch is to automatically discover a codebook of deformable shape templates shared by the input images. These shape templates capture distinct image patterns and each template is matched to similar image patches in different images. Thus the cosketch of the images helps to align foreground objects, thereby providing crucial information for cosegmentation. We present a statistical model whose energy function couples cosketch and cosegmentation. We then present an unsupervised learning algorithm that performs cosketch and cosegmentation by energy minimization. Experiments show that our method outperforms state of the art methods for cosegmentation on the challenging MSRC and iCoseg datasets. We also illustrate our method on a new dataset called Coseg-Rep where cosegmentation can be performed within a single image with repetitive patterns.
5 0.83184421 197 iccv-2013-Hierarchical Joint Max-Margin Learning of Mid and Top Level Representations for Visual Recognition
Author: Hans Lobel, René Vidal, Alvaro Soto
Abstract: Currently, Bag-of-Visual-Words (BoVW) and part-based methods are the most popular approaches for visual recognition. In both cases, a mid-level representation is built on top of low-level image descriptors and top-level classifiers use this mid-level representation to achieve visual recognition. While in current part-based approaches, mid- and top-level representations are usually jointly trained, this is not the usual case for BoVW schemes. A main reason for this is the complex data association problem related to the usual large dictionary size needed by BoVW approaches. As a further observation, typical solutions based on BoVW and part-based representations are usually limited to extensions of binary classification schemes, a strategy that ignores relevant correlations among classes. In this work we propose a novel hierarchical approach to visual recognition based on a BoVW scheme that jointly learns suitable midand top-level representations. Furthermore, using a maxmargin learning framework, the proposed approach directly handles the multiclass case at both levels of abstraction. We test our proposed method using several popular bench- mark datasets. As our main result, we demonstrate that, by coupling learning of mid- and top-level representations, the proposed approach fosters sharing of discriminative visual words among target classes, being able to achieve state-ofthe-art recognition performance using far less visual words than previous approaches.
6 0.82600695 10 iccv-2013-A Framework for Shape Analysis via Hilbert Space Embedding
7 0.80319023 404 iccv-2013-Structured Forests for Fast Edge Detection
8 0.79759026 448 iccv-2013-Weakly Supervised Learning of Image Partitioning Using Decision Trees with Structured Split Criteria
9 0.79593611 426 iccv-2013-Training Deformable Part Models with Decorrelated Features
10 0.79516351 229 iccv-2013-Large-Scale Video Hashing via Structure Learning
11 0.79399431 322 iccv-2013-Pose Estimation and Segmentation of People in 3D Movies
12 0.79347169 71 iccv-2013-Category-Independent Object-Level Saliency Detection
13 0.79310513 47 iccv-2013-Alternating Regression Forests for Object Detection and Pose Estimation
14 0.7908318 238 iccv-2013-Learning Graphs to Match
15 0.79060602 132 iccv-2013-Efficient 3D Scene Labeling Using Fields of Trees
17 0.7886948 126 iccv-2013-Dynamic Label Propagation for Semi-supervised Multi-class Multi-label Classification
18 0.7885164 137 iccv-2013-Efficient Salient Region Detection with Soft Image Abstraction
19 0.78845304 111 iccv-2013-Detecting Dynamic Objects with Multi-view Background Subtraction
20 0.78807354 351 iccv-2013-Restoring an Image Taken through a Window Covered with Dirt or Rain