nips nips2000 nips2000-30 knowledge-graph by maker-knowledge-mining

30 nips-2000-Bayesian Video Shot Segmentation


Source: pdf

Author: Nuno Vasconcelos, Andrew Lippman

Abstract: Prior knowledge about video structure can be used both as a means to improve the peiformance of content analysis and to extract features that allow semantic classification. We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. The new formulations is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy.

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Bayesian video shot segmentation Nuno Vasconcelos Andrew Lippman MIT Media Laboratory, 20 Ames St, E15-354, Cambridge, MA 02139, {nuno,lip}@media. [sent-1, score-1.063]

2 edurnuno Abstract Prior knowledge about video structure can be used both as a means to improve the peiformance of content analysis and to extract features that allow semantic classification. [sent-6, score-0.342]

3 We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. [sent-7, score-1.774]

4 The new formulations is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy. [sent-8, score-0.288]

5 1 Introduction Given the recent advances on video coding and streaming technology and the pervasiveness of video as a form of communication, there is currently a strong interest in the development of techniques for browsing, categorizing, retrieving and automatically summarizing video. [sent-9, score-0.404]

6 In this context, two tasks are of particular relevance: the decomposition of a video stream into its component units, and the extraction of features for the automatic characterization of these units. [sent-10, score-0.315]

7 Significant progress can only be attained by a deeper understanding of the relationship between the message conveyed by the video and the patterns of visual structure that it exhibits. [sent-13, score-0.249]

8 For example, it is well known by film theorists that the message strongly constrains the stylistic elements of the video [1, 6], which are usually grouped into two major categories: the elements of montage and the elements of mise-en-scene. [sent-15, score-0.435]

9 Montage refers to the temporal structure, namely the aspects of film editing, while, mise-en-scene deals with spatial structure, i. [sent-16, score-0.098]

10 the composition of each image, and includes variables such as the type of set in which the scene develops, the placement of the actors, aspects of lighting, focus, camera angles, and so on. [sent-18, score-0.028]

11 On the other hand, it will provide constraints for the low-level analysis algorithms required to perform tasks such as video segmentation, keyframing, and so on. [sent-20, score-0.202]

12 The first point is illustrated by Figure 1 where we show how a collection of promotional trailers for commercially released feature films populates a 2-D feature space based on the most elementary characterization of montage and mise-en-scene: average shot duration vs. [sent-21, score-1.122]

13 Despite the coarseness of this characterization, it captures aspects that are important for semantic movie classification: close inspection of the genre assigned to each movie by the motion picture association of America reveals that in this space the movies cluster by genre! [sent-23, score-0.268]

14 The genre of each movie is identified by the symbol used to represent the movie in the plot. [sent-28, score-0.213]

15 how the structure exhibited by Figure 1 can be exploited to improve the performance of low-level processing tasks such as shot segmentation. [sent-31, score-0.754]

16 Because knowledge about the video structure is a form of prior knowledge, Bayesian procedures provide a natural way to accomplish this goal. [sent-32, score-0.285]

17 We therefore introduce computational models for shot duration and activity and develop a Bayesian framework for segmentation that is shown to significantly outperform current approaches. [sent-33, score-1.088]

18 2 Modeling shot duration Because shot boundaries can be seen as arrivals over discrete, non-overlapping temporal intervals, a Poisson process seems an appropriate model for shot duration [3]. [sent-34, score-2.578]

19 However, events generated by Poisson processes have inter-arrival times characterized by the exponential density which is a monotonically decreasing function of time. [sent-35, score-0.145]

20 This is clearly not the case for the shot duration, as can be seen from the histograms of Figure 2. [sent-36, score-0.762]

21 1 The Erlang model Letting T be the time since the previous boundary, the Erlang distribution [3] is described by (1) IThe activity features are described in section 3. [sent-39, score-0.119]

22 Figure 2: Shot duration histogram, and maximum likelihood fit obtained with the Erlang (left) and Weibull (right) distributions. [sent-40, score-0.219]

23 It is a generalization of the exponential density, characterized by two parameters: the order r, and the expected inter-arrival time (1/ A) of the underlying Poisson process. [sent-41, score-0.044]

24 When r 1, the Erlang distribution becomes the exponential distribution. [sent-42, score-0.023]

25 For larger values of r, it characterizes the time between the rth order inter-arrival time of the Poisson process. [sent-43, score-0.042]

26 This leads to an intuitive explanation for the use of the Erlang distribution as a model of shot duration: for a given order r, the shot is modeled as a sequence of r events which are themselves the outcomes of Poisson processes. [sent-44, score-1.539]

27 Such events may reflect properties of the shot content, such as "setting the context" through a wide angle view followed by "zooming in on the details" when r 2, or "emotional buildup" followed by "action" and "action outcome" when r 3. [sent-45, score-0.773]

28 Figure 2 presents a shot duration histogram, obtained from the training set to be described in section 5, and its maximum likelihood (ML) Erlang fit. [sent-46, score-0.933]

29 2 = The Wei bull model While the Erlang model provides a good fit to the empirical density, it is of limited practical utility due to the constant arrival rate assumption [5] inherent to the underlying Poisson process. [sent-48, score-0.132]

30 Because A is a constant, the expected rate of occurrence of a new shot boundary is the same if 10 seconds or 1 hour have elapsed since the occurrence of the previous one. [sent-49, score-0.9]

31 /3 - (2) Figure 2 presents the ML Weibull fit to the shot duration histogram. [sent-51, score-0.946]

32 Once again we obtain a good approximation to the empirical density estimate. [sent-52, score-0.032]

33 3 Modeling shot activity The color histogram distance has been widely used as a measure of (dis)similarity between images for the purposes of object recognition [7], content-based retrieval [4], and temporal video segmentation [2]. [sent-53, score-1.318]

34 A histogram is first computed for each image in the sequence and the distance between successive histograms is used as a measure of local activity. [sent-54, score-0.181]

35 A standard metric for video segmentation [2] is the L l norm of the histogram difference, B V(a, b) = L i=l lai - bil, (3) where a and b are histograms of successive frames, and B the number of histogram bins. [sent-55, score-0.593]

36 Statistical modeling of the histogram distance features requires the identification of the various states through which the video may progress. [sent-56, score-0.327]

37 For simplicity, in this work we restrict ourselves to a video model composed of two states: "regular frames" (S = 0) and "shot transitions" (S = 1). [sent-57, score-0.202]

38 As illustrated by Figure 3, for "regular frames" the distribution is asymmetric about the mean, always positive and concentrated near zero. [sent-59, score-0.027]

39 This suggests that a mixture of Erlang distributions is an appropriate model for this state, a suggestion that is confirmed by the fit to the empirical density obtained with EM, also depicted in the figure. [sent-60, score-0.092]

40 On the other hand, for "shot transitions" the fit obtained with a simple Gaussian model is sufficient to achieve a reasonable approximation to the empirical density. [sent-61, score-0.038]

41 In both cases, a uniform mixture component is introduced to account for the tails of the distributions. [sent-62, score-0.022]

42 Figure 3: Left: Conditional activity histogram for regular frames, and best fit by a mixture with three Erlang and a uniform component. [sent-63, score-0.256]

43 Right: Conditional activity histogram for shot transitions, and best fit by a mixture with a Gaussian and a uniform component. [sent-64, score-0.957]

44 Extensive evaluation of various approaches has shown that simple thresholding of histogram distances performs surprisingly well and is difficult to beat [2]. [sent-66, score-0.192]

45 It is well known that standard thresholding is a particular case of this formulation, in which both conditional densities are assumed to be Gaussians with the same covariance. [sent-68, score-0.094]

46 One further limitation of the thresholding model is that it does not take into account the fact that the likelihood of a new shot transition is dependent on how much time has elapsed since the previous one. [sent-70, score-0.965]

47 On the other hand, the statistical formulation can easily incorporate the shot duration models developed in section 2. [sent-71, score-0.913]

48 1 Notation Because video is a discrete process, characterized by a given frame rate, shot boundaries are not instantaneous, but last for one frame period. [sent-73, score-1.051]

49 To account for this, states are defined over time intervals, i. [sent-74, score-0.021]

50 instead of St = 0 or St = 1, we have St ,tH; = 0 or St,t+6 = 1, where t is the start of a time interval, and 8 its duration. [sent-76, score-0.021]

51 We designate the features observed during the interval [t, t + <5] by Vt,tH' To simplify the notation, we reserve t for the temporal instant at which the last shot boundary has occurred and make all temporal indexes relative to this instant. [sent-77, score-1.059]

52 Furthermore, we reserve the symbol 8 for the duration of the interval between successive frames (inverse of the frame rate), and use the same notation for a simple frame interval and a vector of frame intervals (the temporal indexes being themselves enough to avoid ambiguity). [sent-81, score-0.657]

53 , while Sr,rH = 0 indicates that no shot boundary is present in the interval [t + T, t + T + 8], SrH = indicates that no shot boundary has occurred in any of the frames between t and t + T + 8. [sent-84, score-1.781]

54 In this expression, while the first term on the right hand side is the ratio of the conditional likelihoods of activity given the state sequence, the second term is simply the ratio of probabilities that there may (or not) be a shot transition T units of time after the previous one. [sent-89, score-0.888]

55 Hence, the shot duration density becomes a prior for the segmentation process. [sent-90, score-1.085]

56 This is intuitive since knowledge about the shot duration is a form of prior knowledge about the structure of the video that should be used to favor segmentations that are more plausible. [sent-91, score-1.228]

57 r+6 a The optimal answer to the question if a shot change occurs or not in [t thus to declare that a boundary exists if P(V,dS~T = 1) > 10 log P(V~T IS~T = 0) - Jr':6 P(a)da g J: H p(a)da = 7(T) ' (6) + T, t + T + 8] is (7) and that there is no boundary otherwise. [sent-93, score-0.922]

58 Comparing this with (4), it is clear that the inclusion of the shot duration prior transforms the fixed thresholding approach into an adaptive one, where the threshold depends on how much time has elapsed since the previous shot boundary. [sent-94, score-1.944]

59 1 The Erlang model It can be shown that, under the Erlang assumption, (8) and the threshold of (7) becomes "( ) -1 " T - og L~-l £i,. [sent-97, score-0.095]

60 x(T + 8)] (9) Its variation over time is presented in Figure 4. [sent-101, score-0.021]

61 While in the initial segment of the shot, the threshold is large and shot changes are unlikely to be accepted, the threshold decreases as the scene progresses increasing the likelihood that shot boundaries will be declared. [sent-102, score-1.74]

62 - Figure 4: Temporal evolution of the Bayesian threshold for the Erlang (left) and Weibull (center) priors. [sent-108, score-0.095]

63 Even though, qualitatively, this is behavior that what one would desire, a closer observation of the figure reveals the major limitation of the Erlang prior: its steady-state behavior. [sent-110, score-0.059]

64 Ideally, in addition to decreasing monotonically over time, the threshold should not be lower bounded by a positive value as this may lead to situations in which its steady-state value is high enough to miss several consecutive shot boundaries. [sent-111, score-0.868]

65 This limitation is a consequence of the constant arrival rate assumption discussed in section 2 and can be avoided by relying instead on the Weibull prior. [sent-112, score-0.091]

66 2 The Weibull model It can be shown that, under the Wei bull assumption, (10) from which Tw (T) = - log { exp [(T + 8J: - TCX ] - 1} . [sent-115, score-0.059]

67 (11) As illustrated by Figure 4, unlike the threshold associated with the Erlang prior, Tw(T) tends to -00 when T grows without bound. [sent-116, score-0.122]

68 This guarantees that a new shot boundary will always be found if one waits long enough. [sent-117, score-0.814]

69 In summary, both the Erlang and the Weibull prior lead to adaptive thresholds that are more intuitive than the fixed threshold commonly employed for shot segmentation. [sent-118, score-0.92]

70 5 Segmentation Results The performance of Bayesian shot segmentation was evaluated on a database containing the promotional trailers of Figure 1. [sent-119, score-0.982]

71 Each trailer consists of 2 to 5 minutes of video and the total number of shots in the database is 1959. [sent-120, score-0.263]

72 Ground truth was obtained by manual segmentation of all the trailers. [sent-122, score-0.133]

73 We evaluated the performance of Bayesian models with Erlang, Weibull and Poisson shot duration priors and compared them against the best possible performance achievable with a fixed threshold. [sent-123, score-0.909]

74 For the latter, the optimal threshold was obtained by brute-force, i. [sent-124, score-0.095]

75 Error rates for all priors are shown in Figure 4 where it is visible that, while the Poisson prior leads to worse accuracy than the static threshold, both the Erlang and the Weibull priors lead to significant improvements. [sent-127, score-0.086]

76 The Weibull prior achieves the overall best performance decreasing the error rate of the static threshold by 20%. [sent-128, score-0.177]

77 The reasons for the improved performance of Bayesian segmentation are illustrated by Figure 5, which presents the evolution of the thresholding process for a segment from one of the trailers in the database ("blankman"). [sent-129, score-0.362]

78 Two thresholding approaches are depicted: Bayesian with the Weibull prior, and standard fixed thresholding. [sent-130, score-0.094]

79 The adaptive behavior of the Bayesian threshold significantly increases the robustness against spurious peaks of the activity metric originated by events such as very fast motion, explosions, camera flashes, etc. [sent-131, score-0.234]

80 Figure 5: An example of the thresholding process. [sent-132, score-0.094]

81 The likelihood ratio and the Weibull threshold are shown. [sent-134, score-0.154]

82 Histogram distances and optimal threshold (determined by leave-one-out using the remainder of the database) are presented. [sent-136, score-0.095]

83 The QBIC project: Querying images by content using color, texture, and shape. [sent-156, score-0.034]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('shot', 0.728), ('erlang', 0.329), ('weibull', 0.238), ('video', 0.202), ('duration', 0.156), ('segmentation', 0.133), ('histogram', 0.098), ('threshold', 0.095), ('thresholding', 0.094), ('vrh', 0.091), ('boundary', 0.086), ('poisson', 0.079), ('oisr', 0.073), ('activity', 0.071), ('frames', 0.071), ('movie', 0.066), ('characterization', 0.064), ('elapsed', 0.063), ('film', 0.057), ('interval', 0.057), ('clouds', 0.055), ('genre', 0.055), ('jungle', 0.055), ('montage', 0.055), ('trailers', 0.055), ('vengeance', 0.055), ('bayesian', 0.05), ('color', 0.045), ('events', 0.045), ('boundaries', 0.041), ('temporal', 0.041), ('frame', 0.04), ('intuitive', 0.038), ('fit', 0.038), ('bull', 0.037), ('dredd', 0.037), ('eden', 0.037), ('edwood', 0.037), ('madness', 0.037), ('princess', 0.037), ('promotional', 0.037), ('scout', 0.037), ('sleeping', 0.037), ('srh', 0.037), ('stylistic', 0.037), ('tide', 0.037), ('prior', 0.036), ('santa', 0.035), ('ratio', 0.034), ('st', 0.034), ('histograms', 0.034), ('content', 0.034), ('arrival', 0.034), ('limitation', 0.034), ('density', 0.032), ('semantic', 0.032), ('da', 0.032), ('texture', 0.032), ('odds', 0.032), ('shots', 0.032), ('vis', 0.032), ('database', 0.029), ('formulation', 0.029), ('french', 0.029), ('reserve', 0.029), ('wei', 0.029), ('successive', 0.028), ('scene', 0.028), ('features', 0.027), ('illustrated', 0.027), ('regular', 0.027), ('aro', 0.026), ('symbol', 0.026), ('intervals', 0.026), ('structure', 0.026), ('likelihood', 0.025), ('priors', 0.025), ('transitions', 0.025), ('occurred', 0.025), ('spie', 0.025), ('indexes', 0.025), ('reveals', 0.025), ('presents', 0.024), ('motion', 0.024), ('tw', 0.023), ('walking', 0.023), ('adaptive', 0.023), ('exponential', 0.023), ('decreasing', 0.023), ('rate', 0.023), ('monotonically', 0.022), ('extraction', 0.022), ('log', 0.022), ('mixture', 0.022), ('knowledge', 0.021), ('message', 0.021), ('time', 0.021), ('image', 0.021), ('notation', 0.021), ('elements', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999994 30 nips-2000-Bayesian Video Shot Segmentation

Author: Nuno Vasconcelos, Andrew Lippman

Abstract: Prior knowledge about video structure can be used both as a means to improve the peiformance of content analysis and to extract features that allow semantic classification. We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. The new formulations is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy.

2 0.18077993 103 nips-2000-Probabilistic Semantic Video Indexing

Author: Milind R. Naphade, Igor Kozintsev, Thomas S. Huang

Abstract: We propose a novel probabilistic framework for semantic video indexing. We define probabilistic multimedia objects (multijects) to map low-level media features to high-level semantic labels. A graphical network of such multijects (multinet) captures scene context by discovering intra-frame as well as inter-frame dependency relations between the concepts. The main contribution is a novel application of a factor graph framework to model this network. We model relations between semantic concepts in terms of their co-occurrence as well as the temporal dependencies between these concepts within video shots. Using the sum-product algorithm [1] for approximate or exact inference in these factor graph multinets, we attempt to correct errors made during isolated concept detection by forcing high-level constraints. This results in a significant improvement in the overall detection performance. 1

3 0.12527069 50 nips-2000-FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks

Author: Malcolm Slaney, Michele Covell

Abstract: FaceSync is an optimal linear algorithm that finds the degree of synchronization between the audio and image recordings of a human speaker. Using canonical correlation, it finds the best direction to combine all the audio and image data, projecting them onto a single axis. FaceSync uses Pearson's correlation to measure the degree of synchronization between the audio and image data. We derive the optimal linear transform to combine the audio and visual information and describe an implementation that avoids the numerical problems caused by computing the correlation matrices. 1 Motivation In many applications, we want to know about the synchronization between an audio signal and the corresponding image data. In a teleconferencing system, we might want to know which of the several people imaged by a camera is heard by the microphones; then, we can direct the camera to the speaker. In post-production for a film, clean audio dialog is often dubbed over the video; we want to adjust the audio signal so that the lip-sync is perfect. When analyzing a film, we want to know when the person talking is in the shot, instead of off camera. When evaluating the quality of dubbed films, we can measure of how well the translated words and audio fit the actor's face. This paper describes an algorithm, FaceSync, that measures the degree of synchronization between the video image of a face and the associated audio signal. We can do this task by synthesizing the talking face, using techniques such as Video Rewrite [1], and then comparing the synthesized video with the test video. That process, however, is expensive. Our solution finds a linear operator that, when applied to the audio and video signals, generates an audio-video-synchronization-error signal. The linear operator gathers information from throughout the image and thus allows us to do the computation inexpensively. Hershey and Movellan [2] describe an approach based on measuring the mutual information between the audio signal and individual pixels in the video. The correlation between the audio signal, x, and one pixel in the image y, is given by Pearson's correlation, r. The mutual information between these two variables is given by f(x,y) = -1/2 log(l-?). They create movies that show the regions of the video that have high correlation with the audio; 1. Currently at IBM Almaden Research, 650 Harry Road, San Jose, CA 95120. 2. Currently at Yes Video. com, 2192 Fortune Drive, San Jose, CA 95131. Standard Deviation of Testing Data FaceSync

4 0.11688118 83 nips-2000-Machine Learning for Video-Based Rendering

Author: Arno Schödl, Irfan A. Essa

Abstract: We present techniques for rendering and animation of realistic scenes by analyzing and training on short video sequences. This work extends the new paradigm for computer animation, video textures, which uses recorded video to generate novel animations by replaying the video samples in a new order. Here we concentrate on video sprites, which are a special type of video texture. In video sprites, instead of storing whole images, the object of interest is separated from the background and the video samples are stored as a sequence of alpha-matted sprites with associated velocity information. They can be rendered anywhere on the screen to create a novel animation of the object. We present methods to create such animations by finding a sequence of sprite samples that is both visually smooth and follows a desired path. To estimate visual smoothness, we train a linear classifier to estimate visual similarity between video samples. If the motion path is known in advance, we use beam search to find a good sample sequence. We can specify the motion interactively by precomputing the sequence cost function using Q-Iearning.

5 0.1102953 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

Author: John W. Fisher III, Trevor Darrell, William T. Freeman, Paul A. Viola

Abstract: People can understand complex auditory and visual information, often using one to disambiguate the other. Automated analysis, even at a lowlevel, faces severe challenges, including the lack of accurate statistical models for the signals, and their high-dimensionality and varied sampling rates. Previous approaches [6] assumed simple parametric models for the joint distribution which, while tractable, cannot capture the complex signal relationships. We learn the joint distribution of the visual and auditory signals using a non-parametric approach. First, we project the data into a maximally informative, low-dimensional subspace, suitable for density estimation. We then model the complicated stochastic relationships between the signals using a nonparametric density estimator. These learned densities allow processing across signal modalities. We demonstrate, on synthetic and real signals, localization in video of the face that is speaking in audio, and, conversely, audio enhancement of a particular speaker selected from the video.

6 0.066364482 82 nips-2000-Learning and Tracking Cyclic Human Motion

7 0.062993839 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images

8 0.050012577 72 nips-2000-Keeping Flexible Active Contours on Track using Metropolis Updates

9 0.048802845 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

10 0.047974132 48 nips-2000-Exact Solutions to Time-Dependent MDPs

11 0.047018398 79 nips-2000-Learning Segmentation by Random Walks

12 0.041007452 53 nips-2000-Feature Correspondence: A Markov Chain Monte Carlo Approach

13 0.040160012 65 nips-2000-Higher-Order Statistical Properties Arising from the Non-Stationarity of Natural Signals

14 0.039948784 106 nips-2000-Propagation Algorithms for Variational Bayesian Learning

15 0.038129225 135 nips-2000-The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference

16 0.037537925 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

17 0.036118131 76 nips-2000-Learning Continuous Distributions: Simulations With Field Theoretic Priors

18 0.033903085 17 nips-2000-Active Learning for Parameter Estimation in Bayesian Networks

19 0.033767123 80 nips-2000-Learning Switching Linear Models of Human Motion

20 0.033671021 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.14), (1, -0.07), (2, 0.1), (3, 0.09), (4, -0.077), (5, -0.06), (6, 0.192), (7, 0.115), (8, -0.187), (9, -0.038), (10, 0.037), (11, 0.011), (12, -0.053), (13, 0.007), (14, 0.094), (15, 0.105), (16, -0.006), (17, -0.045), (18, 0.052), (19, 0.104), (20, 0.032), (21, 0.055), (22, 0.068), (23, 0.02), (24, -0.057), (25, 0.077), (26, -0.026), (27, -0.039), (28, 0.151), (29, 0.017), (30, 0.019), (31, 0.022), (32, 0.02), (33, -0.053), (34, -0.057), (35, 0.009), (36, 0.128), (37, -0.027), (38, -0.143), (39, 0.12), (40, -0.085), (41, -0.163), (42, -0.033), (43, 0.214), (44, -0.129), (45, -0.01), (46, 0.062), (47, 0.027), (48, -0.159), (49, 0.045)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9585557 30 nips-2000-Bayesian Video Shot Segmentation

Author: Nuno Vasconcelos, Andrew Lippman

Abstract: Prior knowledge about video structure can be used both as a means to improve the peiformance of content analysis and to extract features that allow semantic classification. We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. The new formulations is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy.

2 0.75550586 103 nips-2000-Probabilistic Semantic Video Indexing

Author: Milind R. Naphade, Igor Kozintsev, Thomas S. Huang

Abstract: We propose a novel probabilistic framework for semantic video indexing. We define probabilistic multimedia objects (multijects) to map low-level media features to high-level semantic labels. A graphical network of such multijects (multinet) captures scene context by discovering intra-frame as well as inter-frame dependency relations between the concepts. The main contribution is a novel application of a factor graph framework to model this network. We model relations between semantic concepts in terms of their co-occurrence as well as the temporal dependencies between these concepts within video shots. Using the sum-product algorithm [1] for approximate or exact inference in these factor graph multinets, we attempt to correct errors made during isolated concept detection by forcing high-level constraints. This results in a significant improvement in the overall detection performance. 1

3 0.48261902 83 nips-2000-Machine Learning for Video-Based Rendering

Author: Arno Schödl, Irfan A. Essa

Abstract: We present techniques for rendering and animation of realistic scenes by analyzing and training on short video sequences. This work extends the new paradigm for computer animation, video textures, which uses recorded video to generate novel animations by replaying the video samples in a new order. Here we concentrate on video sprites, which are a special type of video texture. In video sprites, instead of storing whole images, the object of interest is separated from the background and the video samples are stored as a sequence of alpha-matted sprites with associated velocity information. They can be rendered anywhere on the screen to create a novel animation of the object. We present methods to create such animations by finding a sequence of sprite samples that is both visually smooth and follows a desired path. To estimate visual smoothness, we train a linear classifier to estimate visual similarity between video samples. If the motion path is known in advance, we use beam search to find a good sample sequence. We can specify the motion interactively by precomputing the sequence cost function using Q-Iearning.

4 0.34376517 48 nips-2000-Exact Solutions to Time-Dependent MDPs

Author: Justin A. Boyan, Michael L. Littman

Abstract: We describe an extension of the Markov decision process model in which a continuous time dimension is included in the state space. This allows for the representation and exact solution of a wide range of problems in which transitions or rewards vary over time. We examine problems based on route planning with public transportation and telescope observation scheduling. 1

5 0.34235156 79 nips-2000-Learning Segmentation by Random Walks

Author: Marina Meila, Jianbo Shi

Abstract: We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segmentation have a probabilistic foundation. In particular, we prove that the Normalized Cut method arises naturally from our framework. Finally, the framework provides a principled method for learning the similarity function as a combination of features. 1

6 0.3319487 78 nips-2000-Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

7 0.31736127 50 nips-2000-FaceSync: A Linear Operator for Measuring Synchronization of Video Facial Images and Audio Tracks

8 0.31149706 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

9 0.25871673 82 nips-2000-Learning and Tracking Cyclic Human Motion

10 0.25646323 135 nips-2000-The Manhattan World Assumption: Regularities in Scene Statistics which Enable Bayesian Inference

11 0.25474846 72 nips-2000-Keeping Flexible Active Contours on Track using Metropolis Updates

12 0.22784665 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

13 0.22508256 42 nips-2000-Divisive and Subtractive Mask Effects: Linking Psychophysics and Biophysics

14 0.22167096 29 nips-2000-Bayes Networks on Ice: Robotic Search for Antarctic Meteorites

15 0.20962118 45 nips-2000-Emergence of Movement Sensitive Neurons' Properties by Learning a Sparse Code for Natural Moving Images

16 0.18001799 102 nips-2000-Position Variance, Recurrence and Perceptual Learning

17 0.17966457 31 nips-2000-Beyond Maximum Likelihood and Density Estimation: A Sample-Based Criterion for Unsupervised Learning of Complex Models

18 0.1781545 15 nips-2000-Accumulator Networks: Suitors of Local Probability Propagation

19 0.17791043 16 nips-2000-Active Inference in Concept Learning

20 0.17340343 32 nips-2000-Color Opponency Constitutes a Sparse Representation for the Chromatic Structure of Natural Scenes


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(10, 0.026), (15, 0.015), (17, 0.099), (32, 0.021), (33, 0.041), (55, 0.015), (62, 0.037), (65, 0.016), (67, 0.043), (75, 0.396), (76, 0.028), (79, 0.023), (81, 0.07), (90, 0.027), (91, 0.012), (97, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.89150327 30 nips-2000-Bayesian Video Shot Segmentation

Author: Nuno Vasconcelos, Andrew Lippman

Abstract: Prior knowledge about video structure can be used both as a means to improve the peiformance of content analysis and to extract features that allow semantic classification. We introduce statistical models for two important components of this structure, shot duration and activity, and demonstrate the usefulness of these models by introducing a Bayesian formulation for the shot segmentation problem. The new formulations is shown to extend standard thresholding methods in an adaptive and intuitive way, leading to improved segmentation accuracy.

2 0.82783091 38 nips-2000-Data Clustering by Markovian Relaxation and the Information Bottleneck Method

Author: Naftali Tishby, Noam Slonim

Abstract: We introduce a new, non-parametric and principled, distance based clustering method. This method combines a pairwise based approach with a vector-quantization method which provide a meaningful interpretation to the resulting clusters. The idea is based on turning the distance matrix into a Markov process and then examine the decay of mutual-information during the relaxation of this process. The clusters emerge as quasi-stable structures during this relaxation, and then are extracted using the information bottleneck method. These clusters capture the information about the initial point of the relaxation in the most effective way. The method can cluster data with no geometric or other bias and makes no assumption about the underlying distribution. 1

3 0.79349494 70 nips-2000-Incremental and Decremental Support Vector Machine Learning

Author: Gert Cauwenberghs, Tomaso Poggio

Abstract: An on-line recursive algorithm for training support vector machines, one vector at a time, is presented. Adiabatic increments retain the KuhnTucker conditions on all previously seen training data, in a number of steps each computed analytically. The incremental procedure is reversible, and decremental

4 0.49743617 79 nips-2000-Learning Segmentation by Random Walks

Author: Marina Meila, Jianbo Shi

Abstract: We present a new view of image segmentation by pairwise similarities. We interpret the similarities as edge flows in a Markov random walk and study the eigenvalues and eigenvectors of the walk's transition matrix. This interpretation shows that spectral methods for clustering and segmentation have a probabilistic foundation. In particular, we prove that the Normalized Cut method arises naturally from our framework. Finally, the framework provides a principled method for learning the similarity function as a combination of features. 1

5 0.42611104 71 nips-2000-Interactive Parts Model: An Application to Recognition of On-line Cursive Script

Author: Predrag Neskovic, Philip C. Davis, Leon N. Cooper

Abstract: In this work, we introduce an Interactive Parts (IP) model as an alternative to Hidden Markov Models (HMMs). We t ested both models on a database of on-line cursive script. We show that implementations of HMMs and the IP model, in which all letters are assumed to have the same average width , give comparable results. However , in contrast to HMMs, the IP model can handle duration modeling without an increase in computational complexity. 1

6 0.3916769 103 nips-2000-Probabilistic Semantic Video Indexing

7 0.38738605 12 nips-2000-A Support Vector Method for Clustering

8 0.38677755 48 nips-2000-Exact Solutions to Time-Dependent MDPs

9 0.38130796 84 nips-2000-Minimum Bayes Error Feature Selection for Continuous Speech Recognition

10 0.37438625 146 nips-2000-What Can a Single Neuron Compute?

11 0.37043071 74 nips-2000-Kernel Expansions with Unlabeled Examples

12 0.36453179 7 nips-2000-A New Approximate Maximal Margin Classification Algorithm

13 0.36326307 122 nips-2000-Sparse Representation for Gaussian Process Models

14 0.36057904 107 nips-2000-Rate-coded Restricted Boltzmann Machines for Face Recognition

15 0.35306063 98 nips-2000-Partially Observable SDE Models for Image Sequence Recognition Tasks

16 0.35239869 82 nips-2000-Learning and Tracking Cyclic Human Motion

17 0.35145113 39 nips-2000-Decomposition of Reinforcement Learning for Admission Control of Self-Similar Call Arrival Processes

18 0.35116449 55 nips-2000-Finding the Key to a Synapse

19 0.35087195 4 nips-2000-A Linear Programming Approach to Novelty Detection

20 0.34878752 80 nips-2000-Learning Switching Linear Models of Human Motion