nips nips2004 nips2004-29 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dustin Lang, Nando D. Freitas
Abstract: We present a graphical model for beat tracking in recorded music. Using a probabilistic graphical model allows us to incorporate local information and global smoothness constraints in a principled manner. We evaluate our model on a set of varied and difficult examples, and achieve impressive results. By using a fast dual-tree algorithm for graphical model inference, our system runs in less time than the duration of the music being processed. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract We present a graphical model for beat tracking in recorded music. [sent-3, score-0.701]
2 Using a probabilistic graphical model allows us to incorporate local information and global smoothness constraints in a principled manner. [sent-4, score-0.197]
3 By using a fast dual-tree algorithm for graphical model inference, our system runs in less time than the duration of the music being processed. [sent-6, score-0.239]
4 1 Introduction This paper describes our approach to the beat tracking problem. [sent-7, score-0.655]
5 Dixon describes beats as follows: “much music has as its rhythmic basis a series of pulses, spaced approximately equally in time, relative to which the timing of all musical events can be described. [sent-8, score-0.515]
6 Given a piece of recorded music (an MP3 file, for example), we wish to produce a set of beats that correspond to the beats perceived by human listeners. [sent-10, score-0.509]
7 The set of beats of a song can be characterised by the trajectories through time of the tempo and phase offset. [sent-11, score-1.107]
8 Tempo is typically measured in beats per minute (BPM), and describes the frequency of beats. [sent-12, score-0.197]
9 The phase offset determines the time offset of the beat. [sent-13, score-0.282]
10 When tapping a foot in time to music, tempo is the rate of foot tapping and phase offset is the time at which the tap occurs. [sent-14, score-0.897]
11 The beat tracking problem, in its general form, is quite difficult. [sent-15, score-0.611]
12 Music is often ambiguous; different human listeners can perceive the beat differently. [sent-16, score-0.554]
13 There are often several beat tracks that could be considered correct. [sent-17, score-0.569]
14 Human perception of the beat is influenced both by ‘local’ and contextual information; the beat can continue through several seconds of silence in the middle of a song. [sent-18, score-1.025]
15 We see the beat tracking problem as not only an interesting problem in its own right, but as one aspect of the larger problem of machine analysis of music. [sent-19, score-0.611]
16 Given beat tracks for a number of songs, we could extract descriptions of the rhythm and use these features for clustering or searching in music collections. [sent-20, score-0.764]
17 We could also use the rhythm information to do structural analysis of songs - for example, to find repeating sections. [sent-21, score-0.139]
18 In addition, we note that beat tracking produces a description of the time scale of a song; knowledge of the tempo of a song would be one way to achieve time-invariance in a symbolic description. [sent-22, score-1.431]
19 Finally, we note that beat tracking tells us where the important parts of a song are; the beats (and major divisions of the beats) are good sampling points for other music-analysis problems such as note detection. [sent-23, score-1.026]
20 2 Related Work Many researchers have investigated the beat tracking problem; we present only a brief overview here. [sent-24, score-0.611]
21 Goto [3] has described several systems for beat tracking. [sent-28, score-0.495]
22 Cemgil and Kappen [4] phrase the beat tracking problem in probabilistic terms, and we adapt their model as our local observation model. [sent-30, score-0.641]
23 3 Graphical Model In formulating our model for beat tracking, we assume that the tempo is nearly constant over short periods of time, and usually varies smoothly. [sent-32, score-1.027]
24 We break the song intoPSfrag replacementsof two seconds; each frame is a node in the graphical model. [sent-35, score-0.326]
25 a set of frames We expect the tempo to be constant within each frame, and the tempo and phase offset parameters to vary smoothly between frames. [sent-36, score-1.277]
26 ψ X1 φ Y1 ψ X2 φ Y2 ψ X3 φ Y3 XF φ YF Figure 1: Our graphical model for beat tracking. [sent-37, score-0.559]
27 The hidden state X is composed of the state variables tempo and phase offset. [sent-38, score-0.714]
28 The observations Y are the features extracted by our audio signal processing. [sent-39, score-0.146]
29 The potential function φ describes the compatibility of the observations with the state, while the potential function ψ describes the smoothness between neighbouring states. [sent-40, score-0.241]
30 In this undirected probabilistic graphical model, the potential function φ describes the compatibility of the state variables X = {T, P } composed of tempo T and phase offset P with the local observations Y . [sent-41, score-0.943]
31 The potential function ψ describes the smoothness constraints between frames. [sent-42, score-0.143]
32 This model allows us to trade off local fit and global smoothness in a principled manner. [sent-45, score-0.133]
33 In such models, belief propagation (BP) [5] allows us to compute the marginal probabilities of the state variables in each frame. [sent-47, score-0.134]
34 Alternatively, maximum belief propagation (max-BP) allows a joint maximum a posteriori (MAP) set of state variables to be determined. [sent-48, score-0.18]
35 i=1 Our smoothness function ψ is the product of tempo and phase smoothness components ψT and ψP . [sent-53, score-0.78]
36 For the tempo component, we use a Gaussian on the log of tempo. [sent-54, score-0.532]
37 For the phase offset component, we want the phases to agree at a particular point in time: the boundary between the two frames (nodes), tb . [sent-55, score-0.29]
38 The qualitative results seem to be fairly stable as a function of these smoothness parameters. [sent-60, score-0.109]
39 The beat is divided into a fixed number of bins (some power of two), and each note is assigned to the nearest bin. [sent-64, score-0.495]
40 The probability of observing a note at a coarse subdivision of the beat is greater than at a finer subdivision. [sent-65, score-0.495]
41 More precisely, a note that is quantized to the bin at beat number k has probability p(k) ∝ exp(−λ d(k)), where d(k) is the number of digits in the binary representation of the number k mod 1. [sent-66, score-0.495]
42 Since we use recorded music rather than MIDI, we must perform signal processing to extract features from the raw data. [sent-67, score-0.26]
43 This process produces a signal that has considerably more uncertainty than the discrete events of MIDI data, so we adjust the model. [sent-68, score-0.147]
44 The template function b(t) expresses our belief about the distribution of musical events within the beat. [sent-73, score-0.255]
45 By shifting and scaling b(t), we can describe the expected distribution of notes in time for different tempos and phase offsets: P . [sent-74, score-0.182]
46 b(t | T, P ) = b T t − 2π Our signal processing (described below) yields a discrete set of events that are meant to correspond to musical events. [sent-75, score-0.246]
47 Given tempo and phase offset values, we stretch and shift this function to get the expected distribution of notes in time. [sent-82, score-0.763]
48 5 Signal Processing Our signal processing stage is meant to extract features that approximate musical events (drum beats, piano notes, guitar strums, etc. [sent-85, score-0.361]
49 As discussed above, we produce a set of events composed of time and ‘strength’ values, where the strength describes our certainty that an event occurred. [sent-87, score-0.14]
50 We assume that musical events are characterised by brief, rapid increases in energy in the audio signal. [sent-88, score-0.334]
51 Following a suggestion by [2], we pass the energy spectrum through a bank of five filters that sum the energy in different portions of the spectrum. [sent-92, score-0.122]
52 6 Fast Inference To find a maximum a posteriori (MAP) set of state variables that best explain a set of observations, we need to optimize a 2F -dimensional, continuous, non-linear, non-Gaussian function that has many local extrema. [sent-99, score-0.114]
53 F is the number of frames in the song, so is on the order of the length of the song in seconds - typically in the hundreds. [sent-100, score-0.294]
54 In the first strategy, we convert the continuous state space into a uniform discrete grid and run discrete belief propagation. [sent-103, score-0.199]
55 In the second strategy, we run a particle filter in the forward direction, then use the particles as ‘grid’ points and run discrete belief propagation as per [6]. [sent-104, score-0.3]
56 Since the landscape we are optimizing has many local maxima, we must use a fine discretization grid (for the first strategy) or a large number of particles (for the second strategy). [sent-105, score-0.122]
57 The message-passing stage in discrete belief propagation takes O(N 2 ) if performed naively, where N is the number of discretized states (or particles) per frame. [sent-106, score-0.129]
58 As an aside, we note that if we wish to compute the smoothed marginal probabilities rather than the MAP set of parameters, then we can use standard discrete belief propagation or particle smoothing. [sent-109, score-0.291]
59 For the results presented here, we discretize the state space into NT = 90 tempo values and NP = 50 phase offset values for the belief propagation version. [sent-112, score-0.847]
60 We distribute the tempo values uniformly on a log scale between 40 and 150 BPM, and distribute the phase offsets uniformly. [sent-113, score-0.722]
61 For the particle filter version, we use NT × NP = 4500 particles. [sent-114, score-0.115]
62 7 Results A standard corpus of labelled ground truth data for the beat-tracking problem does not exist. [sent-116, score-0.222]
63 Therefore, we labelled a relatively small number of songs for evaluation of our algorithm, by listening to the songs and pressing a key at each perceived beat. [sent-117, score-0.186]
64 One can imagine several methods for speeding up the process of generating ground truth labellings and of cleaning up the noisy results generated by humans. [sent-121, score-0.222]
65 For example, a human labelling of a short segment of the song could be automatically extrapolated to the remainder of the song, using energy spikes in the audio signal to fine-tune the placement of beats. [sent-122, score-0.471]
66 However, by generating ground truth using assumptions similar to those embodied in the models we intend to test, we risk invalidating the results. [sent-123, score-0.222]
67 There is no standard evaluation metric for beat tracking. [sent-125, score-0.495]
68 We use the ρ function presented by Cemgil et al [11] and used by Dixon [1] in his analysis: ρ(S, T ) = 100 (NS + NT )/2 NS max exp − i=1 j∈T (Si − Tj )2 2σ 2 where S and T are the ground-truth and proposed beat times, and σ is set to 40 milliseconds. [sent-126, score-0.495]
69 A ρ value near 100 means that each predicted beat in close to a true beat, while a value near zero means that each predicted beat is far from a true beat. [sent-127, score-1.122]
70 We have focused on finding a globally-optimum beat track rather than precisely locating each beat. [sent-128, score-0.555]
71 Note the wide range of genres and the choice of songs with features that we thought would make beat tracking difficult. [sent-131, score-0.704]
72 The first columns list the name of the song and the reason we included it. [sent-133, score-0.262]
73 The third column lists the qualitative performance of the fixed grid version: double means our algorithm produced a beat track twice as fast as ground truth, half means we tracked at half speed, and sync means we produced a syncopated (π phase error) beat track. [sent-134, score-1.673]
74 A blank entry means our algorithm produced the correct beat track. [sent-135, score-0.495]
75 A star ( ) means that our result incorrectly switches phase or tempo. [sent-136, score-0.153]
76 Pole / 2 / Stadt Underworld / A Hundred Days Off / MoMove Ravi Shankar / The Sounds Of India / Bhimpalsi (edit) Pitamaha: Music From Bali / Puri Bagus, Bamboo (excerpt) Gamelan Sekar Jaya / Byomantara (excerpt) double half sync double threehalf sync half BP Perf. [sent-144, score-0.463]
77 Comment Song sync double half sync double threehalf sync half PF Perf. [sent-147, score-0.569]
78 Center: ‘raw’ ground-truth tempo (instantaneous tempo estimate based on the time between adjacent beats) and smoothed ground truth (by averaging). [sent-149, score-1.359]
79 The remaining columns contain the same items for the particle filter version. [sent-153, score-0.115]
80 Out of 25 examples, the fixed grid version produces the correct answer in 17 cases, tracks at double speed in two cases, half speed in two cases, syncopated in one case, and in three cases produces a track that (incorrectly) switches tempo or phase. [sent-154, score-0.928]
81 The particle filter version produces 16 correct answers, two double-speed, two half-speed, two syncopated, and the same three ‘switching’ tracks. [sent-155, score-0.115]
82 An example of a successful tempo track is shown in Figure 3. [sent-156, score-0.592]
83 The song switches time signature between 3/4 and 4/4 time a total of five times; see Figure 4. [sent-158, score-0.411]
84 On the fourth change (from 4/4 to 3/4), it tracks at 2/3 the ground truth rate instead. [sent-160, score-0.296]
85 The global maximum corresponds to the beat track shown in the left plot. [sent-163, score-0.555]
86 The local maximum near 50 BPM corresponds to an alternate solution in which, rather than tracking the quarter notes, we produce one beat per measure; this track is quite plausible. [sent-164, score-0.732]
87 Note also that there is also a local maximum near 100 BPM but phase-shifted a half beat. [sent-166, score-0.114]
88 This is the solution in which the beats are syncopated from the true result. [sent-167, score-0.224]
89 8 Conclusions and Further Work We present a graphical model for beat tracking and evaluate it on a set of varied and difficult examples. [sent-168, score-0.675]
90 The beat tracking problem has inherent ambiguity and multiple interpretations are often plausible. [sent-171, score-0.646]
91 This is particularly useful for situations in which beat tracking is one element in a larger machine listening application. [sent-173, score-0.611]
92 Probabilistic graphical models allow flexible and powerful handling of uncertainty, and allow local and contextual information to interact in a principled manner. [sent-174, score-0.161]
93 The global maximum corresponds to the tempo track shown. [sent-184, score-0.592]
94 Longer-range tempo smoothness constraints as suggested by [11] could be useful. [sent-189, score-0.603]
95 At present, we first perform a full particle filtering sweep and then run max-BP. [sent-192, score-0.115]
96 Taking into account the quality of the partial MAP solutions during particle filtering might allow superior results by directing more particles toward regions of the state space that are likely to contain the final MAP solution. [sent-193, score-0.209]
97 Since we know that our probability terrain is multimodal, a mixture particle filter would be useful [12]. [sent-194, score-0.115]
98 An audio-based real-time beat tracking system for music with or without drum-sounds. [sent-206, score-0.76]
99 Monte Carlo methods for tempo tracking and rhythm quantization. [sent-209, score-0.694]
100 Maximum a posteriori sequence estimation using Monte Carlo particle filters. [sent-215, score-0.161]
wordName wordTfidf (topN-words)
[('tempo', 0.532), ('beat', 0.495), ('song', 0.262), ('beats', 0.153), ('music', 0.149), ('bpm', 0.142), ('truth', 0.117), ('tracking', 0.116), ('particle', 0.115), ('sync', 0.106), ('phase', 0.106), ('ground', 0.105), ('musical', 0.099), ('songs', 0.093), ('piano', 0.077), ('tb', 0.077), ('audio', 0.076), ('offset', 0.075), ('tracks', 0.074), ('smoothness', 0.071), ('cemgil', 0.071), ('guitar', 0.071), ('syncopated', 0.071), ('events', 0.07), ('graphical', 0.064), ('energy', 0.061), ('track', 0.06), ('belief', 0.059), ('particles', 0.056), ('double', 0.055), ('half', 0.053), ('lter', 0.053), ('electronica', 0.053), ('excerpt', 0.053), ('gamelan', 0.053), ('jazz', 0.053), ('lucy', 0.053), ('solo', 0.053), ('notes', 0.05), ('signature', 0.05), ('smoothed', 0.047), ('switches', 0.047), ('midi', 0.046), ('rhythm', 0.046), ('posteriori', 0.046), ('describes', 0.044), ('signal', 0.044), ('edit', 0.042), ('raw', 0.041), ('state', 0.038), ('qualitative', 0.038), ('propagation', 0.037), ('sky', 0.037), ('ns', 0.037), ('grid', 0.036), ('cake', 0.035), ('chan', 0.035), ('dixon', 0.035), ('drum', 0.035), ('drums', 0.035), ('err', 0.035), ('foot', 0.035), ('indonesian', 0.035), ('loudness', 0.035), ('miles', 0.035), ('nando', 0.035), ('quartet', 0.035), ('threehalf', 0.035), ('ambiguity', 0.035), ('contextual', 0.035), ('predicted', 0.035), ('nt', 0.034), ('discrete', 0.033), ('frames', 0.032), ('principled', 0.032), ('instruments', 0.031), ('listeners', 0.031), ('rock', 0.031), ('tapping', 0.031), ('near', 0.031), ('local', 0.03), ('bp', 0.028), ('distribute', 0.028), ('diamonds', 0.028), ('characterised', 0.028), ('davis', 0.028), ('doucet', 0.028), ('gauss', 0.028), ('instrumental', 0.028), ('offsets', 0.028), ('potential', 0.028), ('human', 0.028), ('template', 0.027), ('observations', 0.026), ('pf', 0.026), ('kappen', 0.026), ('lang', 0.026), ('time', 0.026), ('recorded', 0.026), ('strategy', 0.025)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999911 29 nips-2004-Beat Tracking the Graphical Model Way
Author: Dustin Lang, Nando D. Freitas
Abstract: We present a graphical model for beat tracking in recorded music. Using a probabilistic graphical model allows us to incorporate local information and global smoothness constraints in a principled manner. We evaluate our model on a set of varied and difficult examples, and achieve impressive results. By using a fast dual-tree algorithm for graphical model inference, our system runs in less time than the duration of the music being processed. 1
2 0.10025641 74 nips-2004-Harmonising Chorales by Probabilistic Inference
Author: Moray Allan, Christopher Williams
Abstract: We describe how we used a data set of chorale harmonisations composed by Johann Sebastian Bach to train Hidden Markov Models. Using a probabilistic framework allows us to create a harmonisation system which learns from examples, and which can compose new harmonisations. We make a quantitative comparison of our system’s harmonisation performance against simpler models, and provide example harmonisations. 1
3 0.081832781 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces
Author: N. J. Hill, Thomas N. Lal, Karin Bierig, Niels Birbaumer, Bernhard Schölkopf
Abstract: Motivated by the particular problems involved in communicating with “locked-in” paralysed patients, we aim to develop a braincomputer interface that uses auditory stimuli. We describe a paradigm that allows a user to make a binary decision by focusing attention on one of two concurrent auditory stimulus sequences. Using Support Vector Machine classification and Recursive Channel Elimination on the independent components of averaged eventrelated potentials, we show that an untrained user’s EEG data can be classified with an encouragingly high level of accuracy. This suggests that it is possible for users to modulate EEG signals in a single trial by the conscious direction of attention, well enough to be useful in BCI. 1
4 0.065674864 83 nips-2004-Incremental Learning for Visual Tracking
Author: Jongwoo Lim, David A. Ross, Ruei-sung Lin, Ming-Hsuan Yang
Abstract: Most existing tracking algorithms construct a representation of a target object prior to the tracking task starts, and utilize invariant features to handle appearance variation of the target caused by lighting, pose, and view angle change. In this paper, we present an efficient and effective online algorithm that incrementally learns and adapts a low dimensional eigenspace representation to reflect appearance changes of the target, thereby facilitating the tracking task. Furthermore, our incremental method correctly updates the sample mean and the eigenbasis, whereas existing incremental subspace update methods ignore the fact the sample mean varies over time. The tracking problem is formulated as a state inference problem within a Markov Chain Monte Carlo framework and a particle filter is incorporated for propagating sample distributions over time. Numerous experiments demonstrate the effectiveness of the proposed tracking algorithm in indoor and outdoor environments where the target objects undergo large pose and lighting changes. 1
5 0.061761305 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation
Author: Erik B. Sudderth, Michael I. Mandel, William T. Freeman, Alan S. Willsky
Abstract: We describe a three–dimensional geometric hand model suitable for visual tracking applications. The kinematic constraints implied by the model’s joints have a probabilistic structure which is well described by a graphical model. Inference in this model is complicated by the hand’s many degrees of freedom, as well as multimodal likelihoods caused by ambiguous image measurements. We use nonparametric belief propagation (NBP) to develop a tracking algorithm which exploits the graph’s structure to control complexity, while avoiding costly discretization. While kinematic constraints naturally have a local structure, self– occlusions created by the imaging process lead to complex interpendencies in color and edge–based likelihood functions. However, we show that local structure may be recovered by introducing binary hidden variables describing the occlusion state of each pixel. We augment the NBP algorithm to infer these occlusion variables in a distributed fashion, and then analytically marginalize over them to produce hand position estimates which properly account for occlusion events. We provide simulations showing that NBP may be used to refine inaccurate model initializations, as well as track hand motion through extended image sequences. 1
6 0.053367041 116 nips-2004-Message Errors in Belief Propagation
7 0.053337641 160 nips-2004-Seeing through water
8 0.053035948 16 nips-2004-Adaptive Discriminative Generative Model and Its Applications
9 0.046307418 174 nips-2004-Spike Sorting: Bayesian Clustering of Non-Stationary Data
10 0.045336615 152 nips-2004-Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization
11 0.043262757 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks
12 0.04305302 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters
13 0.042608429 63 nips-2004-Expectation Consistent Free Energies for Approximate Inference
14 0.042466868 5 nips-2004-A Harmonic Excitation State-Space Approach to Blind Separation of Speech
15 0.041298747 76 nips-2004-Hierarchical Bayesian Inference in Networks of Spiking Neurons
16 0.039682474 12 nips-2004-A Temporal Kernel-Based Model for Tracking Hand Movements from Neural Activities
17 0.037886214 13 nips-2004-A Three Tiered Approach for Articulated Object Action Modeling and Recognition
18 0.036611095 60 nips-2004-Efficient Kernel Machines Using the Improved Fast Gauss Transform
19 0.034500208 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
20 0.033746861 31 nips-2004-Blind One-microphone Speech Separation: A Spectral Learning Approach
topicId topicWeight
[(0, -0.121), (1, -0.03), (2, 0.01), (3, -0.082), (4, -0.004), (5, -0.028), (6, 0.083), (7, 0.019), (8, -0.047), (9, -0.086), (10, 0.054), (11, -0.015), (12, 0.054), (13, 0.025), (14, 0.038), (15, -0.01), (16, 0.04), (17, -0.045), (18, -0.027), (19, 0.062), (20, -0.064), (21, -0.02), (22, -0.004), (23, 0.015), (24, 0.025), (25, 0.016), (26, -0.037), (27, 0.041), (28, -0.026), (29, 0.004), (30, -0.027), (31, -0.05), (32, 0.005), (33, 0.002), (34, 0.025), (35, 0.027), (36, 0.003), (37, 0.081), (38, -0.065), (39, 0.071), (40, -0.047), (41, -0.011), (42, 0.013), (43, -0.018), (44, -0.076), (45, -0.005), (46, 0.191), (47, 0.171), (48, -0.032), (49, -0.189)]
simIndex simValue paperId paperTitle
same-paper 1 0.92580235 29 nips-2004-Beat Tracking the Graphical Model Way
Author: Dustin Lang, Nando D. Freitas
Abstract: We present a graphical model for beat tracking in recorded music. Using a probabilistic graphical model allows us to incorporate local information and global smoothness constraints in a principled manner. We evaluate our model on a set of varied and difficult examples, and achieve impressive results. By using a fast dual-tree algorithm for graphical model inference, our system runs in less time than the duration of the music being processed. 1
2 0.73819971 74 nips-2004-Harmonising Chorales by Probabilistic Inference
Author: Moray Allan, Christopher Williams
Abstract: We describe how we used a data set of chorale harmonisations composed by Johann Sebastian Bach to train Hidden Markov Models. Using a probabilistic framework allows us to create a harmonisation system which learns from examples, and which can compose new harmonisations. We make a quantitative comparison of our system’s harmonisation performance against simpler models, and provide example harmonisations. 1
3 0.47478741 171 nips-2004-Solitaire: Man Versus Machine
Author: Xiang Yan, Persi Diaconis, Paat Rusmevichientong, Benjamin V. Roy
Abstract: In this paper, we use the rollout method for policy improvement to analyze a version of Klondike solitaire. This version, sometimes called thoughtful solitaire, has all cards revealed to the player, but then follows the usual Klondike rules. A strategy that we establish, using iterated rollouts, wins about twice as many games on average as an expert human player does. 1
4 0.43422297 55 nips-2004-Distributed Occlusion Reasoning for Tracking with Nonparametric Belief Propagation
Author: Erik B. Sudderth, Michael I. Mandel, William T. Freeman, Alan S. Willsky
Abstract: We describe a three–dimensional geometric hand model suitable for visual tracking applications. The kinematic constraints implied by the model’s joints have a probabilistic structure which is well described by a graphical model. Inference in this model is complicated by the hand’s many degrees of freedom, as well as multimodal likelihoods caused by ambiguous image measurements. We use nonparametric belief propagation (NBP) to develop a tracking algorithm which exploits the graph’s structure to control complexity, while avoiding costly discretization. While kinematic constraints naturally have a local structure, self– occlusions created by the imaging process lead to complex interpendencies in color and edge–based likelihood functions. However, we show that local structure may be recovered by introducing binary hidden variables describing the occlusion state of each pixel. We augment the NBP algorithm to infer these occlusion variables in a distributed fashion, and then analytically marginalize over them to produce hand position estimates which properly account for occlusion events. We provide simulations showing that NBP may be used to refine inaccurate model initializations, as well as track hand motion through extended image sequences. 1
5 0.35892734 83 nips-2004-Incremental Learning for Visual Tracking
Author: Jongwoo Lim, David A. Ross, Ruei-sung Lin, Ming-Hsuan Yang
Abstract: Most existing tracking algorithms construct a representation of a target object prior to the tracking task starts, and utilize invariant features to handle appearance variation of the target caused by lighting, pose, and view angle change. In this paper, we present an efficient and effective online algorithm that incrementally learns and adapts a low dimensional eigenspace representation to reflect appearance changes of the target, thereby facilitating the tracking task. Furthermore, our incremental method correctly updates the sample mean and the eigenbasis, whereas existing incremental subspace update methods ignore the fact the sample mean varies over time. The tracking problem is formulated as a state inference problem within a Markov Chain Monte Carlo framework and a particle filter is incorporated for propagating sample distributions over time. Numerous experiments demonstrate the effectiveness of the proposed tracking algorithm in indoor and outdoor environments where the target objects undergo large pose and lighting changes. 1
6 0.34580019 122 nips-2004-Modelling Uncertainty in the Game of Go
7 0.3357597 20 nips-2004-An Auditory Paradigm for Brain-Computer Interfaces
8 0.32884169 116 nips-2004-Message Errors in Belief Propagation
9 0.32531559 120 nips-2004-Modeling Conversational Dynamics as a Mixed-Memory Markov Process
10 0.32148129 1 nips-2004-A Cost-Shaping LP for Bellman Error Minimization with Performance Guarantees
11 0.31856039 108 nips-2004-Markov Networks for Detecting Overalpping Elements in Sequence Data
12 0.31762341 97 nips-2004-Learning Efficient Auditory Codes Using Spikes Predicts Cochlear Filters
13 0.31711465 203 nips-2004-Validity Estimates for Loopy Belief Propagation on Binary Real-world Networks
14 0.31701982 147 nips-2004-Planning for Markov Decision Processes with Sparse Stochasticity
15 0.31526113 184 nips-2004-The Cerebellum Chip: an Analog VLSI Implementation of a Cerebellar Model of Classical Conditioning
16 0.31415737 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
17 0.31213164 75 nips-2004-Heuristics for Ordering Cue Search in Decision Making
18 0.31015623 56 nips-2004-Dynamic Bayesian Networks for Brain-Computer Interfaces
19 0.30683076 43 nips-2004-Conditional Models of Identity Uncertainty with Application to Noun Coreference
20 0.30613503 193 nips-2004-Theories of Access Consciousness
topicId topicWeight
[(13, 0.059), (15, 0.104), (26, 0.044), (31, 0.015), (33, 0.153), (39, 0.011), (50, 0.362), (51, 0.014), (56, 0.011), (69, 0.035), (71, 0.021), (76, 0.014), (87, 0.03)]
simIndex simValue paperId paperTitle
1 0.92109764 199 nips-2004-Using Machine Learning to Break Visual Human Interaction Proofs (HIPs)
Author: Kumar Chellapilla, Patrice Y. Simard
Abstract: Machine learning is often used to automatically solve human tasks. In this paper, we look for tasks where machine learning algorithms are not as good as humans with the hope of gaining insight into their current limitations. We studied various Human Interactive Proofs (HIPs) on the market, because they are systems designed to tell computers and humans apart by posing challenges presumably too hard for computers. We found that most HIPs are pure recognition tasks which can easily be broken using machine learning. The harder HIPs use a combination of segmentation and recognition tasks. From this observation, we found that building segmentation tasks is the most effective way to confuse machine learning algorithms. This has enabled us to build effective HIPs (which we deployed in MSN Passport), as well as design challenging segmentation tasks for machine learning algorithms. 1 In t rod u ct i on The OCR problem for high resolution printed text has virtually been solved 10 years ago [1]. On the other hand, cursive handwriting recognition today is still too poor for most people to rely on. Is there a fundamental difference between these two seemingly similar problems? To shed more light on this question, we study problems that have been designed to be difficult for computers. The hope is that we will get some insight on what the stumbling blocks are for machine learning and devise appropriate tests to further understand their similarities and differences. Work on distinguishing computers from humans traces back to the original Turing Test [2] which asks that a human distinguish between another human and a machine by asking questions of both. Recent interest has turned to developing systems that allow a computer to distinguish between another computer and a human. These systems enable the construction of automatic filters that can be used to prevent automated scripts from utilizing services intended for humans [4]. Such systems have been termed Human Interactive Proofs (HIPs) [3] or Completely Automated Public Turing Tests to Tell Computers and Humans Apart (CAPTCHAs) [4]. An overview of the work in this area can be found in [5]. Construction of HIPs that are of practical value is difficult because it is not sufficient to develop challenges at which humans are somewhat more successful than machines. This is because the cost of failure for an automatic attacker is minimal compared to the cost of failure for humans. Ideally a HIP should be solved by humans more than 80% of the time, while an automatic script with reasonable resource use should succeed less than 0.01% of the time. This latter ratio (1 in 10,000) is a function of the cost of an automatic trial divided by the cost of having a human perform the attack. This constraint of generating tasks that are failed 99.99% of the time by all automated algorithms has generated various solutions which can easily be sampled on the internet. Seven different HIPs, namely, Mailblocks, MSN (before April 28th, 2004), Ticketmaster, Yahoo, Yahoo v2 (after Sept’04), Register, and Google, will be given as examples in the next section. We will show in Section 3 that machinelearning-based attacks are far more successful than 1 in 10,000. Yet, some of these HIPs are harder than others and could be made even harder by identifying the recognition and segmentation parts, and emphasizing the latter. Section 4 presents examples of more difficult HIPs which are much more respectable challenges for machine learning, and yet surprisingly easy for humans. The final section discusses a (known) weakness of machine learning algorithms and suggests designing simple artificial datasets for studying this weakness. 2 Exa mp les o f H I Ps The HIPs explored in this paper are made of characters (or symbols) rendered to an image and presented to the user. Solving the HIP requires identifying all characters in the correct order. The following HIPs can be sampled from the web: Mailblocks: While signing up for free email service with (www.mailblocks.com), you will find HIP challenges of the type: mailblocks MSN: While signing up for free e-mail with MSN Hotmail (www.hotmail.com), you will find HIP challenges of the type: Register.com: While requesting a whois lookup for a domain at www.register.com, you will HIP challenges of the type: Yahoo!/EZ-Gimpy (CMU): While signing up for free e-mail service with Yahoo! (www.yahoo.com), you will receive HIP challenges of the type: Yahoo! (version 2): Starting in August 2004, Yahoo! introduced their second generation HIP. Three examples are presented below: Ticketmaster: While looking for concert tickets at www.ticketmaster.com, you will receive HIP challenges of the type: Google/Gmail: While signing up for free e-mail with Gmail at www.google.com, one will receive HIP challenges of the type: While solutions to Yahoo HIPs are common English words, those for ticketmaster and Google do not necessarily belong to the English dictionary. They appear to have been created using a phonetic generator [8]. 3 Usi n g ma ch i n e lea rn i n g t o b rea k H IP s Breaking HIPs is not new. Mori and Malik [7] have successfully broken the EZGimpy (92% success) and Gimpy (33% success) HIPs from CMU. Our approach aims at an automatic process for solving multiple HIPs with minimum human intervention, using machine learning. In this paper, our main goal is to learn more about the common strengths and weaknesses of these HIPs rather than to prove that we can break any one HIP in particular with the highest possible success rate. We have results for six different HIPs: EZ-Gimpy/Yahoo, Yahoo v2, mailblocks, register, ticketmaster, and Google. To simplify our study, we will not be using language models in our attempt to break HIPs. For example, there are only about 600 words in the EZ-Gimpy dictionary [7], which means that a random guess attack would get a success rate of 1 in 600 (more than enough to break the HIP, i.e., greater than 0.01% success). HIPs become harder when no language model is used. Similarly, when a HIP uses a language model to generate challenges, success rate of attacks can be significantly improved by incorporating the language model. Further, since the language model is not common to all HIPs studied, it was not used in this paper. Our generic method for breaking all of these HIPs is to write a custom algorithm to locate the characters, and then use machine learning for recognition. Surprisingly, segmentation, or finding the characters, is simple for many HIPs which makes the process of breaking the HIP particularly easy. Gimpy uses a single constant predictable color (black) for letters even though the background color changes. We quickly realized that once the segmentation problem is solved, solving the HIP becomes a pure recognition problem, and it can trivially be solved using machine learning. Our recognition engine is based on neural networks [6][9]. It yielded a 0.4% error rate on the MNIST database, uses little memory, and is very fast for recognition (important for breaking HIPs). For each HIP, we have a segmentation step, followed by a recognition step. It should be stressed that we are not trying to solve every HIP of a given type i.e., our goal is not 100% success rate, but something efficient that can achieve much better than 0.01%. In each of the following experiments, 2500 HIPs were hand labeled and used as follows (a) recognition (1600 for training, 200 for validation, and 200 for testing), and (b) segmentation (500 for testing segmentation). For each of the five HIPs, a convolution neural network, identical to the one described in [6], was trained and tested on gray level character images centered on the guessed character positions (see below). The trained neural network became the recognizer. 3.1 M a i l b l oc k s To solve the HIP, we select the red channel, binarize and erode it, extract the largest connected components (CCs), and breakup CCs that are too large into two or three adjacent CCs. Further, vertically overlapping half character size CCs are merged. The resulting rough segmentation works most of the time. Here is an example: For instance, in the example above, the NN would be trained, and tested on the following images: … The end-to-end success rate is 88.8% for segmentation, 95.9% for recognition (given correct segmentation), and (0.888)*(0.959)7 = 66.2% total. Note that most of the errors come from segmentation, even though this is where all the custom programming was invested. 3.2 Register The procedure to solve HIPs is very similar. The image was smoothed, binarized, and the largest 5 connected components were identified. Two examples are presented below: The end-to-end success rate is 95.4% for segmentation, 87.1% for recognition (given correct segmentation), and (0.954)*(0.871)5 = 47.8% total. 3.3 Y a h oo/ E Z - G i mp y Unlike the mailblocks and register HIPs, the Yahoo/EZ-Gimpy HIPs are richer in that a variety of backgrounds and clutter are possible. Though some amount of text warping is present, the text color, size, and font have low variability. Three simple segmentation algorithms were designed with associated rules to identify which algorithm to use. The goal was to keep these simple yet effective: a) No mesh: Convert to grayscale image, threshold to black and white, select large CCs with sizes close to HIP char sizes. One example: b) Black mesh: Convert to grayscale image, threshold to black and white, remove vertical and horizontal line pixels that don’t have neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: c) White mesh: Convert to grayscale image, threshold to black and white, add black pixels (in white line locations) if there exist neighboring pixels, select large CCs with sizes close to HIP char sizes. One example: Tests for black and white meshes were performed to determine which segmentation algorithm to use. The end-to-end success rate was 56.2% for segmentation (38.2% came from a), 11.8% from b), and 6.2% from c), 90.3% for recognition (given correct segmentation), and (0.562)*(0.903)4.8 = 34.4% total. The average length of a Yahoo HIP solution is 4.8 characters. 3.4 T i c k e t ma s t e r The procedure that solved the Yahoo HIP is fairly successful at solving some of the ticket master HIPs. These HIPs are characterized by cris-crossing lines at random angles clustered around 0, 45, 90, and 135 degrees. A multipronged attack as in the Yahoo case (section 3.3) has potential. In the interests of simplicity, a single attack was developed: Convert to grayscale, threshold to black and white, up-sample image, dilate first then erode, select large CCs with sizes close to HIP char sizes. One example: The dilate-erode combination causes the lines to be removed (along with any thin objects) but retains solid thick characters. This single attack is successful in achieving an end-to-end success rate of 16.6% for segmentation, the recognition rate was 82.3% (in spite of interfering lines), and (0.166)*(0.823)6.23 = 4.9% total. The average HIP solution length is 6.23 characters. 3.5 Y a h oo ve r s i on 2 The second generation HIP from Yahoo had several changes: a) it did not use words from a dictionary or even use a phonetic generator, b) it uses only black and white colors, c) uses both letters and digits, and d) uses connected lines and arcs as clutter. The HIP is somewhat similar to the MSN/Passport HIP which does not use a dictionary, uses two colors, uses letters and digits, and background and foreground arcs as clutter. Unlike the MSN/Passport HIP, several different fonts are used. A single segmentation attack was developed: Remove 6 pixel border, up-sample, dilate first then erode, select large CCs with sizes close to HIP char sizes. The attack is practically identical to that used for the ticketmaster HIP with different preprocessing stages and slightly modified parameters. Two examples: This single attack is successful in achieving an end-to-end success rate of 58.4% for segmentation, the recognition rate was 95.2%, and (0.584)*(0.952)5 = 45.7% total. The average HIP solution length is 5 characters. 3.6 G oog l e / G M a i l The Google HIP is unique in that it uses only image warp as a means of distorting the characters. Similar to the MSN/Passport and Yahoo version 2 HIPs, it is also two color. The HIP characters are arranged closed to one another (they often touch) and follow a curved baseline. The following very simple attack was used to segment Google HIPs: Convert to grayscale, up-sample, threshold and separate connected components. a) b) This very simple attack gives an end-to-end success rate of 10.2% for segmentation, the recognition rate was 89.3%, giving (0.102)*(0.893)6.5 = 4.89% total probability of breaking a HIP. Average Google HIP solution length is 6.5 characters. This can be significantly improved upon by judicious use of dilate-erode attack. A direct application doesn’t do as well as it did on the ticketmaster and yahoo HIPs (because of the shear and warp of the baseline of the word). More successful and complicated attacks might estimate and counter the shear and warp of the baseline to achieve better success rates. 4 Lesso n s lea rn ed f ro m b rea ki n g H IPs From the previous section, it is clear that most of the errors come from incorrect segmentations, even though most of the development time is spent devising custom segmentation schemes. This observation raises the following questions: Why is segmentation a hard problem? Can we devise harder HIPs and datasets? Can we build an automatic segmentor? Can we compare classification algorithms based on how useful they are for segmentation? 4.1 T h e s e g me n t a t i on p r ob l e m As a review, segmentation is difficult for the following reasons: 1. Segmentation is computationally expensive. In order to find valid patterns, a recognizer must attempt recognition at many different candidate locations. 2. The segmentation function is complex. To segment successfully, the system must learn to identify which patterns are valid among the set of all possible valid and non-valid patterns. This task is intrinsically more difficult than classification because the space of input is considerably larger. Unlike the space of valid patterns, the space of non-valid patterns is typically too vast to sample. This is a problem for many learning algorithms which yield too many false positives when presented non-valid patterns. 3. Identifying valid characters among a set of valid and invalid candidates is a combinatorial problem. For example, correctly identifying which 8 characters among 20 candidates (assuming 12 false positives), has a 1 in 125,970 (20 choose 8) chances of success by random guessing. 4.2 B ui l d i n g b e t te r / h a r de r H I P s We can use what we have learned to build better HIPs. For instance the HIP below was designed to make segmentation difficult and a similar version has been deployed by MSN Passport for hotmail registrations (www.hotmail.com): The idea is that the additional arcs are themselves good candidates for false characters. The previous segmentation attacks would fail on this HIP. Furthermore, simple change of fonts, distortions, or arc types would require extensive work for the attacker to adjust to. We believe HIPs that emphasize the segmentation problem, such as the above example, are much stronger than the HIPs we examined in this paper, which rely on recognition being difficult. Pushing this to the extreme, we can easily generate the following HIPs: Despite the apparent difficulty of these HIPs, humans are surprisingly good at solving these, indicating that humans are far better than computers at segmentation. This approach of adding several competing false positives can in principle be used to automatically create difficult segmentation problems or benchmarks to test classification algorithms. 4.3 B ui l d i n g a n a ut o ma t i c s e g me n t or To build an automatic segmentor, we could use the following procedure. Label characters based on their correct position and train a recognizer. Apply the trained recognizer at all locations in the HIP image. Collect all candidate characters identified with high confidence by the recognizer. Compute the probability of each combination of candidates (going from left to right), and output the solution string with the highest probability. This is better illustrated with an example. Consider the following HIP (to the right). The trained neural network has these maps (warm colors indicate recognition) that show that K, Y, and so on are correctly identified. However, the maps for 7 and 9 show several false positives. In general, we would get the following color coded map for all the different candidates: HIP K Y B 7 9 With a threshold of 0.5 on the network’s outputs, the map obtained is: We note that there are several false positives for each true positive. The number of false positives per true positive character was found to be between 1 and 4, giving a 1 in C(16,8) = 12,870 to 1 in C(32,8) = 10,518,300 random chance of guessing the correct segmentation for the HIP characters. These numbers can be improved upon by constraining solution strings to flow sequentially from left to right and by restricting overlap. For each combination, we compute a probability by multiplying the 8 probabilities of the classifier for each position. The combination with the highest probability is the one proposed by the classifier. We do not have results for such an automatic segmentor at this time. It is interesting to note that with such a method a classifier that is robust to false positives would do far better than one that is not. This suggests another axis for comparing classifiers. 5 Con clu si on In this paper, we have successfully applied machine learning to the problem of solving HIPs. We have learned that decomposing the HIP problem into segmentation and recognition greatly simplifies analysis. Recognition on even unprocessed images (given segmentation is a solved) can be done automatically using neural networks. Segmentation, on the other hand, is the difficulty differentiator between weaker and stronger HIPs and requires custom intervention for each HIP. We have used this observation to design new HIPs and new tests for machine learning algorithms with the hope of improving them. A c k n ow l e d ge me n t s We would like to acknowledge Chau Luu and Eric Meltzer for their help with labeling and segmenting various HIPs. We would also like to acknowledge Josh Benaloh and Cem Paya for stimulating discussions on HIP security. References [1] Baird HS (1992), “Anatomy of a versatile page reader,” IEEE Pro., v.80, pp. 1059-1065. [2] Turing AM (1950), “Computing Machinery and Intelligence,” Mind, 59:236, pp. 433-460. [3] First Workshop on Human Interactive Proofs, Palo Alto, CA, January 2002. [4] Von Ahn L, Blum M, and Langford J, The Captcha Project. http://www.captcha.net [5] Baird HS and Popat K (2002) “Human Interactive Proofs and Document Image Analysis,” Proc. IAPR 2002 Workshop on Document Analysis Systerms, Princeton, NJ. [6] Simard PY, Steinkraus D, and Platt J, (2003) “Best Practice for Convolutional Neural Networks Applied to Visual Document Analysis,” in International Conference on Document Analysis and Recognition (ICDAR), pp. 958-962, IEEE Computer Society, Los Alamitos. [7] Mori G, Malik J (2003), “Recognizing Objects in Adversarial Clutter: Breaking a Visual CAPTCHA,” Proc. of the Computer Vision and Pattern Recognition (CVPR) Conference, IEEE Computer Society, vol.1, pages:I-134 - I-141, June 18-20, 2003 [8] Chew, M. and Baird, H. S. (2003), “BaffleText: a Human Interactive Proof,” Proc., 10th IS&T;/SPIE Document Recognition & Retrieval Conf., Santa Clara, CA, Jan. 22. [9] LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,’ Proceedings of the IEEE, Nov. 1998.
2 0.87960672 136 nips-2004-On Semi-Supervised Classification
Author: Balaji Krishnapuram, David Williams, Ya Xue, Lawrence Carin, Mário Figueiredo, Alexander J. Hartemink
Abstract: A graph-based prior is proposed for parametric semi-supervised classification. The prior utilizes both labelled and unlabelled data; it also integrates features from multiple views of a given sample (e.g., multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classifier automatically adjusts the tradeoff between the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is performed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1
3 0.86033005 105 nips-2004-Log-concavity Results on Gaussian Process Methods for Supervised and Unsupervised Learning
Author: Liam Paninski
Abstract: Log-concavity is an important property in the context of optimization, Laplace approximation, and sampling; Bayesian methods based on Gaussian process priors have become quite popular recently for classification, regression, density estimation, and point process intensity estimation. Here we prove that the predictive densities corresponding to each of these applications are log-concave, given any observed data. We also prove that the likelihood is log-concave in the hyperparameters controlling the mean function of the Gaussian prior in the density and point process intensity estimation cases, and the mean, covariance, and observation noise parameters in the classification and regression cases; this result leads to a useful parameterization of these hyperparameters, indicating a suitably large class of priors for which the corresponding maximum a posteriori problem is log-concave.
4 0.83534867 49 nips-2004-Density Level Detection is Classification
Author: Ingo Steinwart, Don Hush, Clint Scovel
Abstract: We show that anomaly detection can be interpreted as a binary classification problem. Using this interpretation we propose a support vector machine (SVM) for anomaly detection. We then present some theoretical results which include consistency and learning rates. Finally, we experimentally compare our SVM with the standard one-class SVM. 1
same-paper 5 0.82862175 29 nips-2004-Beat Tracking the Graphical Model Way
Author: Dustin Lang, Nando D. Freitas
Abstract: We present a graphical model for beat tracking in recorded music. Using a probabilistic graphical model allows us to incorporate local information and global smoothness constraints in a principled manner. We evaluate our model on a set of varied and difficult examples, and achieve impressive results. By using a fast dual-tree algorithm for graphical model inference, our system runs in less time than the duration of the music being processed. 1
6 0.63172781 69 nips-2004-Fast Rates to Bayes for Kernel Machines
7 0.62429732 189 nips-2004-The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees
8 0.59480363 206 nips-2004-Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms
9 0.59162199 38 nips-2004-Co-Validation: Using Model Disagreement on Unlabeled Data to Validate Classification Algorithms
10 0.58621633 103 nips-2004-Limits of Spectral Clustering
11 0.58492184 158 nips-2004-Sampling Methods for Unsupervised Learning
12 0.5830397 65 nips-2004-Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments
13 0.57689512 175 nips-2004-Stable adaptive control with online learning
14 0.57629782 168 nips-2004-Semigroup Kernels on Finite Sets
15 0.57163227 156 nips-2004-Result Analysis of the NIPS 2003 Feature Selection Challenge
16 0.5709753 154 nips-2004-Resolving Perceptual Aliasing In The Presence Of Noisy Sensors
17 0.56509101 201 nips-2004-Using the Equivalent Kernel to Understand Gaussian Process Regression
18 0.56059778 84 nips-2004-Inference, Attention, and Decision in a Bayesian Neural Architecture
19 0.56002104 142 nips-2004-Outlier Detection with One-class Kernel Fisher Discriminants
20 0.55966616 48 nips-2004-Convergence and No-Regret in Multiagent Learning