nips nips2011 nips2011-246 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Dmitry Pidan, Ran El-Yaniv
Abstract: Focusing on short term trend prediction in a Ä?Ĺš nancial context, we consider the problem of selective prediction whereby the predictor can abstain from prediction in order to improve performance. We examine two types of selective mechanisms for HMM predictors. The Ä?Ĺš rst is a rejection in the spirit of Chow’s well-known ambiguity principle. The second is a specialized mechanism for HMMs that identiÄ?Ĺš es low quality HMM states and abstain from prediction in those states. We call this model selective HMM (sHMM). In both approaches we can trade-off prediction coverage to gain better accuracy in a controlled manner. We compare performance of the ambiguity-based rejection technique with that of the sHMM approach. Our results indicate that both methods are effective, and that the sHMM model is superior. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Ĺš nancial context, we consider the problem of selective prediction whereby the predictor can abstain from prediction in order to improve performance. [sent-5, score-0.684]
2 We examine two types of selective mechanisms for HMM predictors. [sent-6, score-0.378]
3 Ś rst is a rejection in the spirit of Chow’s well-known ambiguity principle. [sent-8, score-0.333]
4 Ĺš es low quality HMM states and abstain from prediction in those states. [sent-10, score-0.322]
5 In both approaches we can trade-off prediction coverage to gain better accuracy in a controlled manner. [sent-12, score-0.374]
6 We compare performance of the ambiguity-based rejection technique with that of the sHMM approach. [sent-13, score-0.218]
7 Currently, manifestations of selective prediction within machine learning mainly exist in the realm of inductive classiÄ? [sent-19, score-0.42]
8 Ĺš er or predictor equipped with a rejection mechanism we can quantify its performance proÄ? [sent-24, score-0.269]
9 The RC curve represents a trade-off: the more coverage we compromise, the more accurate we can expect to be, up to the point where we reject everything and (trivially) never err. [sent-26, score-0.451]
10 Ĺš ers achieving useful (and optimal) RC trade-offs, thus providing the user with control over the choice of desired risk (with its associated coverage compromise). [sent-29, score-0.473]
11 Our longer term goal is to study selective prediction models for general sequential prediction tasks. [sent-30, score-0.525]
12 The second is a novel and specialized technique utilizing the HMM state structure. [sent-42, score-0.237]
13 In this approach we identify latent states whose prediction quality is systematically inferior, and abstain from predictions while the underlying source is likely to be in 1 those states. [sent-43, score-0.402]
14 1 Preliminaries Hidden Markov Models in brief A Hidden Markov Model (HMM) is a generative probabilistic state machine with latent states, in which state transitions and observations emissions represent Ä? [sent-58, score-0.324]
15 the most likely (in a Bayesian sense) state machine giving rise to O, with associated latent state sequence S = S1 , . [sent-64, score-0.393]
16 Ĺš ne the performance parameters in selective prediction we utilize the following deÄ? [sent-82, score-0.42]
17 The purpose of a selective prediction model is to provide “sufÄ? [sent-96, score-0.42]
18 The functional relation between risk and coverage is called the risk coverage (RC) trade-off. [sent-101, score-0.83]
19 Generally, the user of a selective model would like to bound one measure (either risk or coverage) and then obtain the best model in terms of the other measure. [sent-102, score-0.515]
20 A selective predictor is useful if its RC curve is “non trivial� [sent-104, score-0.373]
21 in the sense that progressively smaller risk can be obtained with progressively smaller coverage. [sent-105, score-0.196]
22 Interpolated RC curve can be obtained by selecting a number of coverage bounds at certain grid points of choice, and learning (and testing) a selective model aiming at achieving the best possible risk for each coverage level. [sent-109, score-1.09]
23 Obviously, each such model should respect the corresponding coverage bound. [sent-110, score-0.269]
24 Ĺš er, similar to the one used in [3], and endow it with a rejection mechanism in the spirit of Chow [5]. [sent-115, score-0.227]
25 2 State-Based Selectivity We propose a different approach for implementing selective prediction with HMMs. [sent-139, score-0.42]
26 The proposed approach is suitable for prediction problems whose observation sequences are labeled. [sent-142, score-0.226]
27 Each state is assigned risk and visit rate estimates. [sent-146, score-0.598]
28 For each state q, its risk estimate is used as a proxy to the probability of making erroneous predictions from q, and its visit rate quantiÄ? [sent-147, score-0.671]
29 A subset of the highest risk states is selected so that their total expected visit rate does not exceed the user speciÄ? [sent-149, score-0.651]
30 These states are called rejective and predictions from them are ignored. [sent-151, score-0.331]
31 We associate with each state q a label Lq representing the HMM prediction while at this state (see Section 3. [sent-154, score-0.454]
32 Denote ĂŽĹ‚t (i) P [St = qi | O, ĂŽĹĽ], and note that ĂŽĹ‚t (i) can be efÄ? [sent-156, score-0.211]
33 Given an observation sequence, the empirical visit rate, v(i), T 1 of a state qi , is the fraction of time the HMM spends in state qi , that is v(i) T t=1 ĂŽĹ‚t (i). [sent-161, score-1.037]
34 Given an observation sequence, the empirical risk, r(i), of a T 1 t=1 state qi , is the rate of erroneous visits to qi , that is r(i) v(i)T ĂŽĹ‚t (i). [sent-165, score-0.737]
35 To achieve this we apply the following greedy selection procedure of rejective states whereby highest risk states are sequentially selected as long as their overall visit rate does not exceed B. [sent-170, score-0.936]
36 If our model does not include a large number of states, or includes states with very high visit rates (as it is often the case in applications), the total visit rate of the rejective states might be far from the requested bound 3 B, entailing that selectivity cannot be fully exploited. [sent-190, score-1.063]
37 Let q be the non-rejective state with the highest risk rate. [sent-198, score-0.333]
38 The probability to reject predictions emerging 1 from this state is taken to be pq q ∈RS v(q ) . [sent-199, score-0.358]
39 Ĺš ned, the v(q) B − total expected rejection rate is precisely B, when expectation is taken over random choices. [sent-201, score-0.238]
40 Ĺš nement approach is to construct an approximate HMM whose states have Ä? [sent-206, score-0.244]
41 This smaller granularity enables a selection of rejective states whose total visit rate is closer to the required bound. [sent-208, score-0.624]
42 Ĺš nement is achieved by replacing every highly visited state with a complete HMM. [sent-210, score-0.241]
43 In ĂŽĹĽ0 , states that have visit rate greater than a certain bound are identiÄ? [sent-212, score-0.484]
44 For each such state qi (called a heavy state), a new HMM ĂŽĹĽi (called a reÄ? [sent-214, score-0.453]
45 Ĺš nally, every transition from qi to another state entails transition from a state in ĂŽĹĽi whose probability is the original transition probability from qi . [sent-216, score-0.947]
46 States of ĂŽĹĽi are assigned the label of qi . [sent-217, score-0.236]
47 Ĺš nement continues in a recursive manner and terminates when all the heavy states have reÄ? [sent-219, score-0.349]
48 Ĺš ned) states, and states 3,5,6,7,8 are leaf (emitting) states. [sent-234, score-0.242]
49 Ĺš nes state 1, the model consisting of states 5 and 6 reÄ? [sent-236, score-0.358]
50 An aggregate state of the complete hierarchical model corresponds to a set of inner HMM states, each of which is a state on a path from the root through reÄ? [sent-238, score-0.484]
51 Transition to the next aggregate state always starts at ĂŽĹĽ0 , and recursively progresses to the leaf states, as shown in the following example. [sent-245, score-0.403]
52 Suppose that the model in Figure 1 is at aggregate state {1,4,7} at time t. [sent-246, score-0.291]
53 The aggregate state at time t + 1 is calculated as follows. [sent-247, score-0.338]
54 ĂŽĹĽ0 is in state 1, so its next state (say 1 again) is chosen according to the distribution {a11 , a12 }. [sent-248, score-0.324]
55 Ĺš nes state 1, which was in state 4 at 4 time t. [sent-250, score-0.355]
56 State 3 is a leaf state that emits observations, and the aggregate state at time t + 1, is {1,3}. [sent-252, score-0.53]
57 On the other hand, if state 2 is chosen at the root, a new state (say 6) in its reÄ? [sent-253, score-0.324]
58 Ĺš ning model is chosen according to the initial distribution {ÄŽ€5 , ÄŽ€6 } (transition into the heavy state from another state). [sent-254, score-0.273]
59 The chosen state 6 is a leaf state so the new aggregate state becomes {2,6}. [sent-255, score-0.692]
60 qn+N qj 3: Remove state qi with the corresponding {bim }i=M from ĂŽĹĽ, and record it as a state reÄ? [sent-263, score-0.678]
61 In steps 1-3, a random ĂŽĹĽi is generated and connected to the HMM ĂŽĹĽ instead of qi . [sent-271, score-0.211]
62 The algorithm is applied on heavy states until all states in the HMM have visit rates lower than a required bound. [sent-276, score-0.679]
63 2), re-estimation formulas for the parameters of newly added states (Step 8) are presented, where ĂŽĹžt (j, k) = P [qt = j, qt+1 = k | O, ĂŽĹĽ]. [sent-286, score-0.203]
64 Ĺš nement process, transitions from other states into heavy state qi also affect the initial distribution of its reÄ? [sent-289, score-0.697]
65 The most likely aggregate state at time t, given sequence O, is found in a top-down manner using the hierarchical structure of the model. [sent-291, score-0.391]
66 Starting with the root model, ĂŽĹĽ0 , the most likely individual state in it, say qi , is identiÄ? [sent-292, score-0.406]
67 Otherwise, the most likely individual state in ĂŽĹĽi (HMM that reÄ? [sent-296, score-0.195]
68 Ĺš ed, and the aggregate state is updated to be {qi , qj }. [sent-298, score-0.434]
69 The above procedure requires calculation of the quantity ĂŽĹ‚t (i) not only for the leaf states (where it is calculated using a standard forward-backward procedure), but also for the reÄ? [sent-301, score-0.289]
70 Ĺš nes qi The rejection subset is found using the Eq. [sent-305, score-0.432]
71 Visit and risk estimates for the aggregate state {qi1 . [sent-309, score-0.437]
72 qik } are calculated using ĂŽĹ‚t (ik ), of a leaf state qik that identiÄ? [sent-312, score-0.382]
73 5 The outcome of the RR procedure is a tree of HMMs whose main purpose is to redistribute visit rates among states. [sent-314, score-0.269]
74 Ĺš rst glance, they do not address the visit rate re-distribution objective. [sent-318, score-0.29]
75 Alternatively, state labels can be calculated from the statistics of the states, if an unsupervised training method is used. [sent-325, score-0.209]
76 For a state qi , and given observation label l, we calculate the average number of visits (at qi ) whose corresponding label is l, as E [St = qi | lt = l, O, ĂŽĹĽ] = 1â‰Â¤tâ‰Â¤T,lt =l ĂŽĹ‚t (i). [sent-327, score-0.965]
77 4 Experimental Results We compared empirically the four selection mechanisms presented in Section 3, namely, the ambiguity model and the Naive, RLI, and RR sHMMs. [sent-329, score-0.232]
78 For our prediction task, we took as observation sequence directions of the S&P500; price changes. [sent-336, score-0.253]
79 For the ambiguity model, the partial sequences dt− +1 , . [sent-343, score-0.215]
80 For the ambiguity model we constructed two 8-state HMMs, where the length of a single observation sequence ( ) is 5. [sent-353, score-0.228]
81 Ĺš ning model in the RR procedure had the same structure, and the upper bound on the visit rate was Ä? [sent-356, score-0.35]
82 RC curves were computed for each technique by taking the linear grid of rejection rate bounds from 0 to 0. [sent-363, score-0.266]
83 1 Coverage Bound (a) Error rate vs coverage bound Amb. [sent-406, score-0.346]
84 131 (b) Coverage rate vs coverage bound Figure 2: S&P500; RC-curves Figure 2a shows that all four methods exhibited meaningful RC-curves; namely, the error rates decreased monotonically with decreasing coverage bounds. [sent-443, score-0.642]
85 The RLI and RR models (curves 3 and 4, respectively) outperformed the Naive one (curve 2), by better exploiting the allotted coverage bound, as is evident from Table 2b. [sent-444, score-0.269]
86 In addition, the RR model outperformed the RLI model, and moreover, its effective coverage is higher for every required coverage bound. [sent-445, score-0.538]
87 Ĺš nes a state and the resulting sub-states have different risk rates, the selection procedure will tend to reject riskier states Ä? [sent-449, score-0.654]
88 Comparing the state-based models (curves 2-4) to the ambiguity model (curve 1), we see that all the state-based models outperformed the ambiguity model through the entire coverage range (despite the advantage we provided to the ambiguity model). [sent-451, score-0.698]
89 As can be seen in Figure 2a, the selective techniques can also improve the accuracy obtained by these methods (with full coverage). [sent-455, score-0.315]
90 5 1 (b) Risk Figure 3: Distributions of visit and risk train/test differences Figure 3a depicts the distribution of differences between empirical visit rates, measured on the training set, and those rates on the test set. [sent-463, score-0.689]
91 This means that our empirical visit estimates are quite robust and useful. [sent-465, score-0.242]
92 Ĺš cation was introduced by Chow [5], who took a Bayesian route to infer the optimal rejection rule and analyze the risk-coverage trade-off under complete knowledge of the underlying probabilistic source. [sent-471, score-0.216]
93 There is a substantial volume of research contributions on selective classiÄ? [sent-475, score-0.315]
94 Therefore, this technique falls within selective prediction but the selection function has been manually predeÄ? [sent-511, score-0.474]
95 6 Concluding Remarks The structure and modularity of HMMs make them particularly convenient for incorporating selective prediction mechanisms. [sent-513, score-0.42]
96 We focused on selective prediction of trends in Ä? [sent-515, score-0.468]
97 Ĺš cult prediction tasks our models can provide non-trivial prediction improvements. [sent-518, score-0.21]
98 We expect that the relative advantage of these selective prediction techniques will be higher in easier tasks, or even in the same task by utilizing more elaborate HMM modeling, perhaps including other sources of specialized information including prices of other correlated indices. [sent-519, score-0.467]
99 We believe that a major bottleneck in attaining smaller test errors is the noisy risk estimates we obtain for the hidden states (see Figure 3b). [sent-520, score-0.394]
100 Finally, it will be very interesting to examine selective prediction mechanisms in the more general context of Bayesian networks and other types of graphical models. [sent-523, score-0.483]
wordName wordTfidf (topN-words)
[('hmm', 0.386), ('selective', 0.315), ('coverage', 0.269), ('visit', 0.242), ('qi', 0.211), ('rejection', 0.19), ('states', 0.165), ('hmms', 0.164), ('state', 0.162), ('rc', 0.154), ('risk', 0.146), ('qj', 0.143), ('ambiguity', 0.143), ('aggregate', 0.129), ('reject', 0.124), ('chow', 0.123), ('rli', 0.122), ('rejective', 0.119), ('nancial', 0.107), ('prediction', 0.105), ('re', 0.1), ('hijk', 0.099), ('onml', 0.099), ('rr', 0.094), ('ajk', 0.087), ('hidden', 0.083), ('heavy', 0.08), ('shmm', 0.079), ('nement', 0.079), ('leaf', 0.077), ('sequences', 0.072), ('coarseness', 0.07), ('markov', 0.067), ('transition', 0.067), ('classi', 0.064), ('mechanisms', 0.063), ('bjm', 0.06), ('shmms', 0.06), ('ot', 0.059), ('curve', 0.058), ('qn', 0.052), ('kk', 0.052), ('abstain', 0.052), ('observation', 0.049), ('trends', 0.048), ('qik', 0.048), ('rate', 0.048), ('day', 0.048), ('calculated', 0.047), ('predictions', 0.047), ('specialized', 0.047), ('wf', 0.045), ('hypothesized', 0.045), ('er', 0.042), ('st', 0.041), ('lt', 0.041), ('lqi', 0.04), ('dt', 0.039), ('formulas', 0.038), ('instances', 0.038), ('mechanism', 0.037), ('price', 0.037), ('sequence', 0.036), ('recursively', 0.035), ('emit', 0.035), ('likely', 0.033), ('achieving', 0.033), ('depicts', 0.032), ('aij', 0.032), ('rao', 0.032), ('rabiner', 0.032), ('hong', 0.032), ('trained', 0.032), ('hierarchical', 0.031), ('ning', 0.031), ('nes', 0.031), ('visits', 0.03), ('baum', 0.03), ('bound', 0.029), ('naive', 0.029), ('technique', 0.028), ('compromise', 0.027), ('wp', 0.027), ('rates', 0.027), ('option', 0.027), ('trend', 0.027), ('selection', 0.026), ('erroneous', 0.026), ('siddiqi', 0.026), ('selectivity', 0.026), ('reliable', 0.026), ('took', 0.026), ('pq', 0.025), ('progressively', 0.025), ('recursive', 0.025), ('user', 0.025), ('label', 0.025), ('highest', 0.025), ('granularity', 0.024), ('lq', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000002 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
Author: Dmitry Pidan, Ran El-Yaniv
Abstract: Focusing on short term trend prediction in a Ä?Ĺš nancial context, we consider the problem of selective prediction whereby the predictor can abstain from prediction in order to improve performance. We examine two types of selective mechanisms for HMM predictors. The Ä?Ĺš rst is a rejection in the spirit of Chow’s well-known ambiguity principle. The second is a specialized mechanism for HMMs that identiÄ?Ĺš es low quality HMM states and abstain from prediction in those states. We call this model selective HMM (sHMM). In both approaches we can trade-off prediction coverage to gain better accuracy in a controlled manner. We compare performance of the ambiguity-based rejection technique with that of the sHMM approach. Our results indicate that both methods are effective, and that the sHMM model is superior. 1
2 0.34572953 28 nips-2011-Agnostic Selective Classification
Author: Yair Wiener, Ran El-Yaniv
Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1
3 0.17738266 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
Author: Armen Allahverdyan, Aram Galstyan
Abstract: We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam’s razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. 1
4 0.11160544 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
Author: Jia Deng, Sanjeev Satheesh, Alexander C. Berg, Fei Li
Abstract: We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fine grained control over the efficiency vs accuracy trade-off in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classification with 10184 classes and 9 million images. We demonstrate significant improvements in test accuracy and efficiency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1
5 0.095276989 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines
Author: Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal, Iain Matthews, Rob Fergus
Abstract: We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our model to a challenging real-world graphics problem: facial expression transfer. Our results demonstrate improved performance over several baselines modeling high-dimensional 2D and 3D data. 1
6 0.084259652 212 nips-2011-Periodic Finite State Controllers for Efficient POMDP and DEC-POMDP Planning
7 0.078875884 4 nips-2011-A Convergence Analysis of Log-Linear Training
8 0.078256637 253 nips-2011-Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment
9 0.077170089 88 nips-2011-Environmental statistics and the trade-off between model-based and TD learning in humans
10 0.075229339 221 nips-2011-Priors over Recurrent Continuous Time Processes
11 0.070800684 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes
12 0.067193426 197 nips-2011-On Tracking The Partition Function
13 0.065912776 176 nips-2011-Multi-View Learning of Word Embeddings via CCA
14 0.065413728 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
15 0.063417859 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
16 0.063040644 249 nips-2011-Sequence learning with hidden units in spiking neural networks
17 0.059664577 205 nips-2011-Online Submodular Set Cover, Ranking, and Repeated Active Learning
18 0.058242697 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
19 0.0559544 131 nips-2011-Inference in continuous-time change-point models
20 0.055676598 198 nips-2011-On U-processes and clustering performance
topicId topicWeight
[(0, 0.194), (1, -0.032), (2, 0.014), (3, 0.028), (4, -0.031), (5, -0.027), (6, 0.001), (7, -0.128), (8, -0.086), (9, -0.068), (10, -0.08), (11, -0.067), (12, 0.099), (13, 0.007), (14, -0.061), (15, -0.121), (16, -0.072), (17, -0.053), (18, 0.09), (19, 0.045), (20, 0.196), (21, 0.016), (22, 0.011), (23, 0.058), (24, 0.021), (25, 0.208), (26, -0.154), (27, -0.022), (28, 0.19), (29, -0.067), (30, 0.043), (31, 0.05), (32, 0.098), (33, 0.024), (34, -0.107), (35, -0.123), (36, -0.167), (37, -0.148), (38, -0.031), (39, 0.124), (40, 0.164), (41, 0.022), (42, -0.013), (43, 0.036), (44, -0.12), (45, -0.058), (46, -0.017), (47, -0.074), (48, 0.043), (49, 0.009)]
simIndex simValue paperId paperTitle
same-paper 1 0.96511543 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
Author: Dmitry Pidan, Ran El-Yaniv
Abstract: Focusing on short term trend prediction in a Ä?Ĺš nancial context, we consider the problem of selective prediction whereby the predictor can abstain from prediction in order to improve performance. We examine two types of selective mechanisms for HMM predictors. The Ä?Ĺš rst is a rejection in the spirit of Chow’s well-known ambiguity principle. The second is a specialized mechanism for HMMs that identiÄ?Ĺš es low quality HMM states and abstain from prediction in those states. We call this model selective HMM (sHMM). In both approaches we can trade-off prediction coverage to gain better accuracy in a controlled manner. We compare performance of the ambiguity-based rejection technique with that of the sHMM approach. Our results indicate that both methods are effective, and that the sHMM model is superior. 1
2 0.77669525 28 nips-2011-Agnostic Selective Classification
Author: Yair Wiener, Ran El-Yaniv
Abstract: For a learning problem whose associated excess loss class is (β, B)-Bernstein, we show that it is theoretically possible to track the same classification performance of the best (unknown) hypothesis in our class, provided that we are free to abstain from prediction in some region of our choice. The (probabilistic) volume of this √ rejected region of the domain is shown to be diminishing at rate O(Bθ( 1/m)β ), where θ is Hanneke’s disagreement coefficient. The strategy achieving this performance has computational barriers because it requires empirical error minimization in an agnostic setting. Nevertheless, we heuristically approximate this strategy and develop a novel selective classification algorithm using constrained SVMs. We show empirically that the resulting algorithm consistently outperforms the traditional rejection mechanism based on distance from decision boundary. 1
3 0.5616408 57 nips-2011-Comparative Analysis of Viterbi Training and Maximum Likelihood Estimation for HMMs
Author: Armen Allahverdyan, Aram Galstyan
Abstract: We present an asymptotic analysis of Viterbi Training (VT) and contrast it with a more conventional Maximum Likelihood (ML) approach to parameter estimation in Hidden Markov Models. While ML estimator works by (locally) maximizing the likelihood of the observed data, VT seeks to maximize the probability of the most likely hidden state sequence. We develop an analytical framework based on a generating function formalism and illustrate it on an exactly solvable model of HMM with one unambiguous symbol. For this particular model the ML objective function is continuously degenerate. VT objective, in contrast, is shown to have only finite degeneracy. Furthermore, VT converges faster and results in sparser (simpler) models, thus realizing an automatic Occam’s razor for HMM learning. For more general scenario VT can be worse compared to ML but still capable of correctly recovering most of the parameters. 1
4 0.50684619 169 nips-2011-Maximum Margin Multi-Label Structured Prediction
Author: Christoph H. Lampert
Abstract: We study multi-label prediction for structured output sets, a problem that occurs, for example, in object detection in images, secondary structure prediction in computational biology, and graph matching with symmetries. Conventional multilabel classification techniques are typically not applicable in this situation, because they require explicit enumeration of the label set, which is infeasible in case of structured outputs. Relying on techniques originally designed for single-label structured prediction, in particular structured support vector machines, results in reduced prediction accuracy, or leads to infeasible optimization problems. In this work we derive a maximum-margin training formulation for multi-label structured prediction that remains computationally tractable while achieving high prediction accuracy. It also shares most beneficial properties with single-label maximum-margin approaches, in particular formulation as a convex optimization problem, efficient working set training, and PAC-Bayesian generalization bounds. 1
5 0.48784822 19 nips-2011-Active Classification based on Value of Classifier
Author: Tianshi Gao, Daphne Koller
Abstract: Modern classification tasks usually involve many class labels and can be informed by a broad range of features. Many of these tasks are tackled by constructing a set of classifiers, which are then applied at test time and then pieced together in a fixed procedure determined in advance or at training time. We present an active classification process at the test time, where each classifier in a large ensemble is viewed as a potential observation that might inform our classification process. Observations are then selected dynamically based on previous observations, using a value-theoretic computation that balances an estimate of the expected classification gain from each observation as well as its computational cost. The expected classification gain is computed using a probabilistic model that uses the outcome from previous observations. This active classification process is applied at test time for each individual test instance, resulting in an efficient instance-specific decision path. We demonstrate the benefit of the active scheme on various real-world datasets, and show that it can achieve comparable or even higher classification accuracy at a fraction of the computational costs of traditional methods.
6 0.45042482 221 nips-2011-Priors over Recurrent Continuous Time Processes
7 0.44913056 4 nips-2011-A Convergence Analysis of Log-Linear Training
8 0.44248122 17 nips-2011-Accelerated Adaptive Markov Chain for Partition Function Computation
9 0.41187108 96 nips-2011-Fast and Balanced: Efficient Label Tree Learning for Large Scale Object Recognition
10 0.40619203 8 nips-2011-A Model for Temporal Dependencies in Event Streams
11 0.40094942 173 nips-2011-Modelling Genetic Variations using Fragmentation-Coagulation Processes
12 0.39349401 38 nips-2011-Anatomically Constrained Decoding of Finger Flexion from Electrocorticographic Signals
13 0.39084911 55 nips-2011-Collective Graphical Models
14 0.37734428 33 nips-2011-An Exact Algorithm for F-Measure Maximization
15 0.36977458 94 nips-2011-Facial Expression Transfer with Input-Output Temporal Restricted Boltzmann Machines
16 0.3645753 147 nips-2011-Learning Patient-Specific Cancer Survival Distributions as a Sequence of Dependent Regressors
17 0.36289424 59 nips-2011-Composite Multiclass Losses
18 0.35364822 103 nips-2011-Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss
19 0.34652776 237 nips-2011-Reinforcement Learning using Kernel-Based Stochastic Factorization
20 0.34553495 254 nips-2011-Similarity-based Learning via Data Driven Embeddings
topicId topicWeight
[(0, 0.043), (2, 0.119), (4, 0.035), (20, 0.024), (26, 0.023), (31, 0.184), (33, 0.044), (43, 0.051), (45, 0.194), (57, 0.032), (65, 0.018), (74, 0.049), (83, 0.041), (84, 0.021), (99, 0.038)]
simIndex simValue paperId paperTitle
same-paper 1 0.94263327 246 nips-2011-Selective Prediction of Financial Trends with Hidden Markov Models
Author: Dmitry Pidan, Ran El-Yaniv
Abstract: Focusing on short term trend prediction in a Ä?Ĺš nancial context, we consider the problem of selective prediction whereby the predictor can abstain from prediction in order to improve performance. We examine two types of selective mechanisms for HMM predictors. The Ä?Ĺš rst is a rejection in the spirit of Chow’s well-known ambiguity principle. The second is a specialized mechanism for HMMs that identiÄ?Ĺš es low quality HMM states and abstain from prediction in those states. We call this model selective HMM (sHMM). In both approaches we can trade-off prediction coverage to gain better accuracy in a controlled manner. We compare performance of the ambiguity-based rejection technique with that of the sHMM approach. Our results indicate that both methods are effective, and that the sHMM model is superior. 1
2 0.93287706 191 nips-2011-Nonnegative dictionary learning in the exponential noise model for adaptive music signal representation
Author: Onur Dikmen, Cédric Févotte
Abstract: In this paper we describe a maximum likelihood approach for dictionary learning in the multiplicative exponential noise model. This model is prevalent in audio signal processing where it underlies a generative composite model of the power spectrogram. Maximum joint likelihood estimation of the dictionary and expansion coefficients leads to a nonnegative matrix factorization problem where the Itakura-Saito divergence is used. The optimality of this approach is in question because the number of parameters (which include the expansion coefficients) grows with the number of observations. In this paper we describe a variational procedure for optimization of the marginal likelihood, i.e., the likelihood of the dictionary where the activation coefficients have been integrated out (given a specific prior). We compare the output of both maximum joint likelihood estimation (i.e., standard Itakura-Saito NMF) and maximum marginal likelihood estimation (MMLE) on real and synthetical datasets. The MMLE approach is shown to embed automatic model order selection, akin to automatic relevance determination.
3 0.91114414 223 nips-2011-Probabilistic Joint Image Segmentation and Labeling
Author: Adrian Ion, Joao Carreira, Cristian Sminchisescu
Abstract: We present a joint image segmentation and labeling model (JSL) which, given a bag of figure-ground segment hypotheses extracted at multiple image locations and scales, constructs a joint probability distribution over both the compatible image interpretations (tilings or image segmentations) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, modeled as maximal cliques, from a graph connecting spatially non-overlapping segments in the bag [1], followed by sampling labels for those segments, conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on Maximum Likelihood with a novel Incremental Saddle Point estimation procedure. The partition function over tilings and labelings is increasingly more accurately approximated by including incorrect configurations that a not-yet-competent model rates probable during learning. We show that the proposed methodology matches the current state of the art in the Stanford dataset [2], as well as in VOC2010, where 41.7% accuracy on the test set is achieved.
4 0.90383065 241 nips-2011-Scalable Training of Mixture Models via Coresets
Author: Dan Feldman, Matthew Faulkner, Andreas Krause
Abstract: How can we train a statistical mixture model on a massive data set? In this paper, we show how to construct coresets for mixtures of Gaussians and natural generalizations. A coreset is a weighted subset of the data, which guarantees that models fitting the coreset will also provide a good fit for the original data set. We show that, perhaps surprisingly, Gaussian mixtures admit coresets of size independent of the size of the data set. More precisely, we prove that a weighted set of O(dk3 /ε2 ) data points suffices for computing a (1 + ε)-approximation for the optimal model on the original n data points. Moreover, such coresets can be efficiently constructed in a map-reduce style computation, as well as in a streaming setting. Our results rely on a novel reduction of statistical estimation to problems in computational geometry, as well as new complexity results about mixtures of Gaussians. We empirically evaluate our algorithms on several real data sets, including a density estimation problem in the context of earthquake detection using accelerometers in mobile phones. 1
5 0.89828873 283 nips-2011-The Fixed Points of Off-Policy TD
Author: J. Z. Kolter
Abstract: Off-policy learning, the ability for an agent to learn about a policy other than the one it is following, is a key element of Reinforcement Learning, and in recent years there has been much work on developing Temporal Different (TD) algorithms that are guaranteed to converge under off-policy sampling. It has remained an open question, however, whether anything can be said a priori about the quality of the TD solution when off-policy sampling is employed with function approximation. In general the answer is no: for arbitrary off-policy sampling the error of the TD solution can be unboundedly large, even when the approximator can represent the true value function well. In this paper we propose a novel approach to address this problem: we show that by considering a certain convex subset of off-policy distributions we can indeed provide guarantees as to the solution quality similar to the on-policy case. Furthermore, we show that we can efficiently project on to this convex set using only samples generated from the system. The end result is a novel TD algorithm that has approximation guarantees even in the case of off-policy sampling and which empirically outperforms existing TD methods. 1
6 0.88899362 229 nips-2011-Query-Aware MCMC
7 0.88733369 278 nips-2011-TD gamma: Re-evaluating Complex Backups in Temporal Difference Learning
8 0.88593477 197 nips-2011-On Tracking The Partition Function
9 0.88534796 37 nips-2011-Analytical Results for the Error in Filtering of Gaussian Processes
10 0.88499808 243 nips-2011-Select and Sample - A Model of Efficient Neural Inference and Learning
11 0.88168621 180 nips-2011-Multiple Instance Filtering
12 0.88040197 163 nips-2011-MAP Inference for Bayesian Inverse Reinforcement Learning
13 0.87949759 32 nips-2011-An Empirical Evaluation of Thompson Sampling
14 0.8769654 10 nips-2011-A Non-Parametric Approach to Dynamic Programming
15 0.87655991 134 nips-2011-Infinite Latent SVM for Classification and Multi-task Learning
16 0.87646711 156 nips-2011-Learning to Learn with Compound HD Models
17 0.87560034 149 nips-2011-Learning Sparse Representations of High Dimensional Data on Large Scale Dictionaries
18 0.87079924 58 nips-2011-Complexity of Inference in Latent Dirichlet Allocation
19 0.87033522 206 nips-2011-Optimal Reinforcement Learning for Gaussian Systems
20 0.86850023 271 nips-2011-Statistical Tests for Optimization Efficiency