nips nips2009 nips2009-69 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Marcus Hutter
Abstract: The Minimum Description Length (MDL) principle selects the model that has the shortest code for data plus model. We show that for a countable class of models, MDL predictions are close to the true distribution in a strong sense. The result is completely general. No independence, ergodicity, stationarity, identifiability, or other assumption on the model class need to be made. More formally, we show that for any countable class of models, the distributions selected by MDL (or MAP) asymptotically predict (merge with) the true measure in the class in total variation distance. Implications for non-i.i.d. domains like time-series forecasting, discriminative learning, and reinforcement learning are discussed. 1
Reference: text
sentIndex sentText sentNum sentScore
1 net Abstract The Minimum Description Length (MDL) principle selects the model that has the shortest code for data plus model. [sent-4, score-0.124]
2 We show that for a countable class of models, MDL predictions are close to the true distribution in a strong sense. [sent-5, score-0.388]
3 More formally, we show that for any countable class of models, the distributions selected by MDL (or MAP) asymptotically predict (merge with) the true measure in the class in total variation distance. [sent-8, score-0.526]
4 domains like time-series forecasting, discriminative learning, and reinforcement learning are discussed. [sent-12, score-0.097]
5 The better the compression, the u more regularity has been detected, hence the better will predictions be. [sent-14, score-0.07]
6 Classical prediction is concerned with h = 1, multi-step lookahead with 1 < h < ∞, and total prediction with h = ∞. [sent-29, score-0.18]
7 A more computer science problem is (infinite horizon) reinforcement learning, where predicting the infinite future is necessary for evaluating a policy. [sent-32, score-0.055]
8 } be a countable class of models=theories=hypotheses= probabilities over sequences X ∞ , sorted w. [sent-38, score-0.377]
9 Since we do not know P , we could select the Q ∈ M that leads to the shortest code on the observed data x. [sent-48, score-0.07]
10 In order to be able to reconstruct x from the code we need to know which Q has been chosen, so we also need to code Q, which takes K(Q) bits. [sent-49, score-0.074]
11 Given x, the true predictive probability of some “future” event A is P [A|x], e. [sent-52, score-0.094]
12 A could be x +1: +h or any other measurable set of sequences (see Section 3 for proper definitions). [sent-54, score-0.081]
13 1 We consider the sequence of predictive measures MDLx [·|x] for = 0,1,2,. [sent-55, score-0.145]
14 Our main result is that MDLx [·|x] converges to P [·|x] in total variation distance for → ∞ with P -probability 1 (see Theorem 1). [sent-59, score-0.058]
15 The analogous result for Bayesian prediction is well-known, and an immediate corollary of Blackwell&Dubin;’s celebrated merging-of-opinions theorem [BD62]. [sent-60, score-0.155]
16 A priori it is not obvious that it holds at all, and indeed the proof turns out to be much more complex. [sent-62, score-0.067]
17 The results above hold for completely arbitrary countable model classes M. [sent-64, score-0.294]
18 For instance, asymptotic consistency has been shown in [Bar85]. [sent-72, score-0.07]
19 data is pervasive [AHRU09]; it includes all time-series prediction problems like weather forecasting and stock market prediction [CBL06]. [sent-80, score-0.162]
20 Also stationarity is easily violated in multi-agent scenarios: An environment which itself contains a learning agent is non-stationary (during the relevant learning phase). [sent-84, score-0.114]
21 Extensive games and multi-agent reinforcement learning are classical examples [WR04]. [sent-85, score-0.093]
22 For nonergodic environments, asymptotic distinguishability can depend on the realized observations, which prevent a prior reduction or partitioning of M. [sent-87, score-0.105]
23 Indeed this problem is the primary reason for considering predictive MDL. [sent-91, score-0.061]
24 For arbitrary countable model classes, the following results are known: The MDL one-step lookahead predictor (i. [sent-93, score-0.378]
25 h = 1) of three variants of MDL converges to the true predictive distribution. [sent-95, score-0.119]
26 Consistency is shown (only) in probability and the predictive u implications of the result are unclear. [sent-100, score-0.094]
27 A semi-parametric problem class d=1 Md with Md = {Qθ,d : θ ∈ I d } (say) can be R reduced to a countable class M = {Pd } for which our result holds, where Pd is a Bayes or NML or other estimate of Md [Gr¨ 07]. [sent-108, score-0.36]
28 Alternatively, d Md could be reduced to a countable class by u considering only computable parameters θ. [sent-109, score-0.327]
29 Essentially all interesting model classes contain such a countable topologically dense subset. [sent-110, score-0.294]
30 Finally, the techniques for the countable case might aid proving general results for continuous M, possibly along the lines of [Rya09]. [sent-113, score-0.32]
31 The paper is organized as follows: In Section 2 we provide some insights how MDL works in restricted settings, what breaks down for general countable M, and how to circumvent the problems. [sent-115, score-0.349]
32 In Section 6 we show how the result can be applied to sequence prediction, classification and regression, discriminative learning, and reinforcement learning. [sent-118, score-0.145]
33 2 2 Facts, Insights, Problems Before starting with the formal development, we describe how MDL works in some restricted settings, what breaks down for general countable M, and how to circumvent the problems. [sent-120, score-0.376]
34 We have to give up the idea of model identification, and concentrate on predictive performance. [sent-127, score-0.061]
35 Given the true observations x ≡ xP so far, MDL selects the simplest 1:∞ 1: Q consistent with xP and for h = 1 predicts xQ . [sent-138, score-0.087]
36 Since elimination occurs in order of increasing index i, and Qm never makes any error, MDL makes at most m−1 prediction errors. [sent-141, score-0.094]
37 For 1 < h < ∞, the prediction xQ +h may be wrong only on xQ , which causes +1: +h h wrong predictions before the error is revealed. [sent-143, score-0.208]
38 The bound is for instance attained on the class consisting of Qi = 1ih 0∞ , and the true sequence switches from 1 to 0 after having observed m·h ones. [sent-146, score-0.114]
39 For h = ∞, a wrong prediction gets eventually revealed. [sent-147, score-0.18]
40 Hence each wrong Qi (i < m) gets eventually eliminated, i. [sent-148, score-0.13]
41 So the special deterministic case illustrates the more complex probabilistic case. [sent-157, score-0.073]
42 Further, while the Bayesian bound trivially follows from the 1/2-century old classical merging of opinions result [BD62], the corresponding MDL bound we prove in this paper is more difficult to obtain. [sent-159, score-0.112]
43 class M, the law of large numbers applied to the random variables Zt := log[P (xt )/Q(xt )] implies 1 t=1 Zt → KL(P ||Q) := x1 P (x1 )log[P (x1 )/Q(x1 )] with P -probability 1. [sent-164, score-0.101]
44 For countable M, a refinement of this argument shows that MDL eventually selects P [BC91]. [sent-168, score-0.387]
45 infinitely often larger and smaller than its limit) sequence θt → θ0 one can show that log[P (x1:t )/Q(x1:t )] converges to but oscillates around K(Q) − K(P ) w. [sent-177, score-0.073]
46 there are nonstationary distributions for which MDL does not converge (not even to a wrong distribution). [sent-181, score-0.065]
47 One idea to solve this problem is to partition M, where two distributions are in the same partition if and only if they are asymptotically indistinguishable (like P and Q above), and then ask MDL to only identify a partition. [sent-182, score-0.095]
48 For x1 = 1, let P and Q be asymptotically indistinguishable, e. [sent-184, score-0.062]
49 For x1 =0, let P and Q be asymptotically distinguishable distributions, e. [sent-187, score-0.062]
50 This shows that for non-ergodic sources like this one, asymptotic distinguishability depends on the drawn sequence. [sent-190, score-0.089]
51 We just measure predictive success, and accept infinite oscillations. [sent-196, score-0.099]
52 We need probability measures and filters for infinite sequences, conditional probabilities and densities, the total variation distance, and the concept of merging (of opinions), in order to formally state our main result. [sent-199, score-0.104]
53 Let (Ω := X ∞ ,F,P ) be the space of infinite sequences with natural filtration and product σ-field F and probability measure P . [sent-201, score-0.088]
54 Let ω ∈ Ω be an infinite sequence sampled from the true measure P . [sent-202, score-0.119]
55 For countable X , the probability that an infinite sequence starts with x is P (x):=P [{x}×X ∞ ]. [sent-214, score-0.342]
56 P is said to be absolutely continuous relative to Q, written P Q :⇔ [Q[A] = 0 implies P [A] = 0 for all A ∈ F] P and Q are said to be mutually singular, written P ⊥Q, iff there exists an A∈F for which P [A]=1 and Q[A] = 0. [sent-221, score-0.204]
57 The total variation distance (tvd) between Q and P given x is defined as d(P, Q|x) := sup Q[A|x] − P [A|x] (1) A∈F Q is said to predict P in tvd (or merge with P ) if d(P,Q|x) → 0 for (x) → ∞ with P -probability 1. [sent-222, score-0.164]
58 Note that this in particular implies, but is stronger than one-step predictive on- and off-sequence convergence Q(x +1 = a +1 |x1: ) − P (x +1 = a +1 |x1: ) → 0 for any a, not necessarily equal ω [KL94]. [sent-223, score-0.061]
59 The famous Blackwell and Dubins convergence result [BD62] states that if P is absolutely continuous relative to Q, then (and only then [KL94]) Q merges with P : If P Q then d(P, Q|x) → 0 w. [sent-224, score-0.129]
60 } be a countable (finite or infinite) class of probability measures, and Bayes[A] := Q∈M Q[A]wQ with wQ > 0 ∀Q and Q∈M wQ = 1. [sent-231, score-0.327]
61 If the model assumption P ∈ M holds, then obviously P Bayes, hence Bayes merges with P , i. [sent-232, score-0.089]
62 Unlike many other Bayesian convergence and consistency theorems, no (independence, ergodicity, stationarity, identifiability, or other) assumption on the model class M need to be made. [sent-237, score-0.075]
63 The analogous result for MDL is as follows: Theorem 1 (MDL predictions) Let M be a countable class of probability measures on X ∞ containing the unknown true sampling distribution P . [sent-238, score-0.424]
64 Then the predictive distributions MDLx [·|x] converge to P [·|x] in the sense that d(P, MDLx |x) ≡ sup MDLx [A|x] − P [A|x] → 0 for (x) → ∞ w. [sent-241, score-0.1]
65 The proof of the theorem is surprisingly subtle and complex compared to the analogous Bayesian case. [sent-246, score-0.14]
66 For arbitrary measurable spaces X , definitions are more subtle, essentially because point probabilities Q(x) have to be replaced by probability densities relative to some base measure M , usually Lebesgue for X = I d , counting measure for countable X , and e. [sent-249, score-0.428]
67 For this we need the following Definition and Lemma: Definition 2 (Relations between Q and P ) For any probability measures Q and P , let • Qr +Qs =Q be the Lebesgue decomposition of Q relative to P into an absolutely continuous non-negative measure Qr P and a singular non-negative measure Qs ⊥P . [sent-259, score-0.194]
68 • g(ω) := dQr /dP = lim →∞ [Q(x1: )/P (x1: )] be (a version of) the Radon-Nikodym derivative, i. [sent-260, score-0.055]
69 The representation of the Radon-Nikodym derivative as a limit of local densities can e. [sent-266, score-0.091]
70 Qr P implies that the limit Z∞ is the Radon-Nikodym derivative dQr /dP . [sent-274, score-0.132]
71 (Indeed, Doob’s martingale convergence theorem can be used to prove the Radon-Nikodym theorem. [sent-275, score-0.077]
72 (iii) says that even if P Q, we still have d(P,Q|x) → 0 on almost every sequence that has a positive limit of Q(x)/P (x). [sent-283, score-0.13]
73 (i⇐) Assume P [Ω◦ ]=0: P [A]>0 implies Q[A]≥Qr [A]= P [Ω◦ ] = 0. [sent-286, score-0.068]
74 Now Qr [Ω◦ ] = Ω◦ g dP = 0 implies 0 ≤ Q[B ∩Ω◦ ] ≤ Qs [B]+Qr [Ω◦ ] = 0+0. [sent-291, score-0.068]
75 By P Q this implies P [B ∩Ω◦ ] = 0, hence ◦ P [Ω ] = 0. [sent-292, score-0.11]
76 Q implies P [Ω] = 1 is Blackwell-Dubins’ celebrated result. [sent-294, score-0.101]
77 Since g >0 outside Ω◦ , this implies P [A\Ω◦ ] = 0. [sent-302, score-0.068]
78 Now (ii) implies d(P ,Q|x) → 0 with P probability 1. [sent-305, score-0.068]
79 MDL will asymptotically not select Q for which Q(x)/P (x) → 0. [sent-317, score-0.062]
80 The technical difficulties are for finite M that the eligible Q depend on the sequence ω, and for infinite M to deal with non-uniformly converging d, i. [sent-321, score-0.076]
81 The set of sequences ω for which some gQ for some Q ∈ M is undefined has P measure zero, and hence can be ignored. [sent-329, score-0.13]
82 is +∞, hence ∀Q ∈ Mω ∃ Q∀ > Q : LQ (x) > LP (x) Since M is finite, this implies ∀ > 0 ∀Q ∈ Mω : LQ (x) > LP (x), where 0 := max{ Q : Q ∈ Mω } < ∞ x Therefore, since P ∈ M, we have MDL ∈ Mω ∀ > 0 , so we can safely ignore all Q ∈ Mω and focus on Q ∈ Mω := M\Mω . [sent-335, score-0.11]
83 Q ∈ Mω This implies ⇒ gQ (ω) > 0 ⇒ ω ∈ Ω◦ Q x d(P, MDL |x) ≤ ⇒ ω ∈ ΩQ ⇒ d(P, Q|x) → 0 sup d(P, Q|x) → 0 Q∈Mω where the inequality holds for > 0 and the limit holds, since M is finite. [sent-338, score-0.179]
84 We want to prove that the probability that MDL asymptotically selects “complex” Q is small. [sent-343, score-0.116]
85 The following Lemma establishes that the probability that MDL selects a specific complex Q infinitely often is small. [sent-344, score-0.088]
86 Lemma 4 (MDL avoids complex probability measures Q) For any Q and P we have P [Q(x)/P (x) ≥ c infinitly often] ≤ 1/c. [sent-345, score-0.07]
87 For sufficiently complex Q, Lemma 4 implies that LQ (x) > LP (x) for most x. [sent-350, score-0.102]
88 First we bound ∞ ¯ Qi (x) −K(Qi ) Q(x) Qi (x) −K(Qi ) sup 2 ≤ 2 = δ , P (x) P (x) i>m P (x) i=m+1 1 ¯ Qi (x)2−K(Qi ) , Q(x) := δ i>m i>m ¯ While MDL· [·] is not a (single) measure on Ω and hence difficult to deal with, Q is a proper probability measure on Ω. [sent-365, score-0.157]
89 , Qm } with probability at least 1 − ε Hence the already proven Theorem 1 for finite M implies that d(P,MDLx |x) → 0 with probability at least 1−ε. [sent-372, score-0.068]
90 Since convergence holds for every ε > 0, it holds w. [sent-373, score-0.066]
91 We illustrate some immediate implications of Theorem 1 for time-series forecasting, classification, regression, discriminative learning, and reinforcement learning. [sent-377, score-0.13]
92 Classical online sequence prediction is concerned with predicting x +1 from (non-i. [sent-379, score-0.124]
93 Offline learning is concerned with training a predictor on x1: for fixed in-house, and then selling and using the predictor on x +1:∞ without further learning. [sent-388, score-0.086]
94 ˙ ˙ ˙ ˙ If we assume that also y follows some distribution, and start with a countable model class M of ˙ joint distributions Q(x,y) which contains the true joint distribution P (x,y), our main result implies ˙ ˙ ˙ ˙ that MDLD [(x,y)|D] converges to the true distribution P (x,y). [sent-399, score-0.486]
95 Since the x given y are not identically distributed, classical MDL consistency results for i. [sent-415, score-0.08]
96 The following corollary formalizes our findings: Corollary 5 (Discriminative MDL) Let M P be a class of discriminative causal distributions Q[·|y1:∞ ], i. [sent-419, score-0.075]
97 Let MDLx|y := argminQ∈M {−logQ(x|y)+ K(Q)} be the discriminative MDL measure (at time given x,y). [sent-424, score-0.08]
98 For finite Y and conditionally independent x, the intuitive reason how this can work is as follows: If y appears in y1:∞ only finitely often, it plays asymptotically no role; if it appears infinitely often, ˙ then P (·|y) can be learned. [sent-426, score-0.062]
99 For infinite Y and deterministic M, the result is also intelligible: Every ˙ y might appear only once, but probing enough function values xt = f (yt ) allows to identify the ˙ function. [sent-427, score-0.064]
100 In the agent framework [RN03], an agent interacts with an environment in cycles. [sent-429, score-0.082]
wordName wordTfidf (topN-words)
[('mdl', 0.714), ('mdlx', 0.33), ('countable', 0.294), ('qi', 0.163), ('qr', 0.103), ('lq', 0.103), ('gq', 0.099), ('qs', 0.09), ('xq', 0.078), ('stationarity', 0.073), ('gr', 0.073), ('wq', 0.07), ('implies', 0.068), ('nitely', 0.067), ('wrong', 0.065), ('asymptotically', 0.062), ('forecasting', 0.062), ('logq', 0.062), ('predictive', 0.061), ('ergodicity', 0.056), ('absolutely', 0.056), ('lim', 0.055), ('reinforcement', 0.055), ('lookahead', 0.054), ('selects', 0.054), ('lp', 0.052), ('qm', 0.052), ('nite', 0.051), ('prediction', 0.05), ('lebesgue', 0.05), ('dp', 0.05), ('sequences', 0.05), ('sequence', 0.048), ('merges', 0.047), ('md', 0.045), ('elimination', 0.044), ('theorem', 0.044), ('says', 0.043), ('lemma', 0.043), ('consistency', 0.042), ('discriminative', 0.042), ('hence', 0.042), ('agent', 0.041), ('codelength', 0.041), ('dqr', 0.041), ('nonergodic', 0.041), ('ravens', 0.041), ('tvd', 0.041), ('limit', 0.039), ('deterministic', 0.039), ('sup', 0.039), ('eventually', 0.039), ('opinions', 0.039), ('measure', 0.038), ('classical', 0.038), ('code', 0.037), ('surely', 0.036), ('environments', 0.036), ('distinguishability', 0.036), ('measures', 0.036), ('merging', 0.035), ('independence', 0.034), ('proof', 0.034), ('complex', 0.034), ('variation', 0.033), ('logp', 0.033), ('celebrated', 0.033), ('martingale', 0.033), ('indistinguishable', 0.033), ('marcus', 0.033), ('implications', 0.033), ('class', 0.033), ('true', 0.033), ('holds', 0.033), ('bayes', 0.033), ('shortest', 0.033), ('measurable', 0.031), ('xp', 0.03), ('predictor', 0.03), ('identi', 0.03), ('circumvent', 0.029), ('predictions', 0.028), ('asymptotic', 0.028), ('analogous', 0.028), ('converging', 0.028), ('formal', 0.027), ('densities', 0.027), ('said', 0.027), ('carries', 0.027), ('continuous', 0.026), ('concerned', 0.026), ('iii', 0.026), ('ii', 0.026), ('breaks', 0.026), ('gets', 0.026), ('converges', 0.025), ('sources', 0.025), ('derivative', 0.025), ('xt', 0.025), ('merge', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 69 nips-2009-Discrete MDL Predicts in Total Variation
Author: Marcus Hutter
Abstract: The Minimum Description Length (MDL) principle selects the model that has the shortest code for data plus model. We show that for a countable class of models, MDL predictions are close to the true distribution in a strong sense. The result is completely general. No independence, ergodicity, stationarity, identifiability, or other assumption on the model class need to be made. More formally, we show that for any countable class of models, the distributions selected by MDL (or MAP) asymptotically predict (merge with) the true measure in the class in total variation distance. Implications for non-i.i.d. domains like time-series forecasting, discriminative learning, and reinforcement learning are discussed. 1
2 0.089296594 98 nips-2009-From PAC-Bayes Bounds to KL Regularization
Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette
Abstract: We show that convex KL-regularized objective functions are obtained from a PAC-Bayes risk bound when using convex loss functions for the stochastic Gibbs classifier that upper-bound the standard zero-one loss used for the weighted majority vote. By restricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coordinate descent learning algorithm to minimize the proposed KL-regularized cost function. We show that standard p -regularized objective functions currently used, such as ridge regression and p -regularized boosting, are obtained from a relaxation of the KL divergence between the quasi uniform posterior and the uniform prior. We present numerical experiments where the proposed learning algorithm generally outperforms ridge regression and AdaBoost. 1
3 0.08342506 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
Author: Peter Orbanz
Abstract: We consider the general problem of constructing nonparametric Bayesian models on infinite-dimensional random objects, such as functions, infinite graphs or infinite permutations. The problem has generated much interest in machine learning, where it is treated heuristically, but has not been studied in full generality in nonparametric Bayesian statistics, which tends to focus on models over probability distributions. Our approach applies a standard tool of stochastic process theory, the construction of stochastic processes from their finite-dimensional marginal distributions. The main contribution of the paper is a generalization of the classic Kolmogorov extension theorem to conditional probabilities. This extension allows a rigorous construction of nonparametric Bayesian models from systems of finitedimensional, parametric Bayes equations. Using this approach, we show (i) how existence of a conjugate posterior for the nonparametric model can be guaranteed by choosing conjugate finite-dimensional models in the construction, (ii) how the mapping to the posterior parameters of the nonparametric model can be explicitly determined, and (iii) that the construction of conjugate models in essence requires the finite-dimensional models to be in the exponential family. As an application of our constructive framework, we derive a model on infinite permutations, the nonparametric Bayesian analogue of a model recently proposed for the analysis of rank data. 1
4 0.071198992 91 nips-2009-Fast, smooth and adaptive regression in metric spaces
Author: Samory Kpotufe
Abstract: It was recently shown that certain nonparametric regressors can escape the curse of dimensionality when the intrinsic dimension of data is low ([1, 2]). We prove some stronger results in more general settings. In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, adapts to intrinsic dimension, operates on general metrics, yields a smooth function, and evaluates in time O(log n). We derive a tight convergence rate of the form n−2/(2+d) where d is the Assouad dimension of the input space. 1
5 0.068414055 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification
Author: Wenming Zheng, Zhouchen Lin
Abstract: The method of common spatio-spectral patterns (CSSPs) is an extension of common spatial patterns (CSPs) by utilizing the technique of delay embedding to alleviate the adverse effects of noises and artifacts on the electroencephalogram (EEG) classification. Although the CSSPs method has shown to be more powerful than the CSPs method in the EEG classification, this method is only suitable for two-class EEG classification problems. In this paper, we generalize the two-class CSSPs method to multi-class cases. To this end, we first develop a novel theory of multi-class Bayes error estimation and then present the multi-class CSSPs (MCSSPs) method based on this Bayes error theoretical framework. By minimizing the estimated closed-form Bayes error, we obtain the optimal spatio-spectral filters of MCSSPs. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on the BCI competition 2005 data set. The experimental results show that our method significantly outperforms the previous multi-class CSPs (MCSPs) methods in the EEG classification.
6 0.056980774 226 nips-2009-Spatial Normalized Gamma Processes
7 0.048859626 152 nips-2009-Measuring model complexity with the prior predictive
8 0.048490934 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms
9 0.047740154 242 nips-2009-The Infinite Partially Observable Markov Decision Process
10 0.046518955 111 nips-2009-Hierarchical Modeling of Local Image Features through $L p$-Nested Symmetric Distributions
11 0.044990107 207 nips-2009-Robust Nonparametric Regression with Metric-Space Valued Output
12 0.043693125 116 nips-2009-Information-theoretic lower bounds on the oracle complexity of convex optimization
13 0.041734181 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification
14 0.041274816 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
15 0.041148655 53 nips-2009-Complexity of Decentralized Control: Special Cases
16 0.038090486 260 nips-2009-Zero-shot Learning with Semantic Output Codes
17 0.037659444 147 nips-2009-Matrix Completion from Noisy Entries
18 0.037531126 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
19 0.036961425 245 nips-2009-Thresholding Procedures for High Dimensional Variable Selection and Statistical Estimation
20 0.036581609 157 nips-2009-Multi-Label Prediction via Compressed Sensing
topicId topicWeight
[(0, -0.129), (1, 0.06), (2, 0.026), (3, -0.028), (4, -0.007), (5, -0.025), (6, 0.034), (7, -0.029), (8, -0.006), (9, -0.005), (10, 0.034), (11, 0.021), (12, 0.034), (13, -0.027), (14, 0.071), (15, -0.026), (16, 0.016), (17, 0.01), (18, 0.039), (19, 0.085), (20, 0.118), (21, 0.013), (22, -0.056), (23, -0.013), (24, -0.014), (25, -0.029), (26, 0.014), (27, 0.007), (28, -0.062), (29, 0.03), (30, 0.06), (31, -0.024), (32, 0.069), (33, -0.007), (34, 0.009), (35, 0.019), (36, 0.063), (37, -0.032), (38, -0.031), (39, -0.035), (40, 0.053), (41, -0.026), (42, 0.015), (43, 0.075), (44, 0.027), (45, 0.001), (46, 0.1), (47, -0.039), (48, 0.07), (49, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.8966794 69 nips-2009-Discrete MDL Predicts in Total Variation
Author: Marcus Hutter
Abstract: The Minimum Description Length (MDL) principle selects the model that has the shortest code for data plus model. We show that for a countable class of models, MDL predictions are close to the true distribution in a strong sense. The result is completely general. No independence, ergodicity, stationarity, identifiability, or other assumption on the model class need to be made. More formally, we show that for any countable class of models, the distributions selected by MDL (or MAP) asymptotically predict (merge with) the true measure in the class in total variation distance. Implications for non-i.i.d. domains like time-series forecasting, discriminative learning, and reinforcement learning are discussed. 1
2 0.53276038 98 nips-2009-From PAC-Bayes Bounds to KL Regularization
Author: Pascal Germain, Alexandre Lacasse, Mario Marchand, Sara Shanian, François Laviolette
Abstract: We show that convex KL-regularized objective functions are obtained from a PAC-Bayes risk bound when using convex loss functions for the stochastic Gibbs classifier that upper-bound the standard zero-one loss used for the weighted majority vote. By restricting ourselves to a class of posteriors, that we call quasi uniform, we propose a simple coordinate descent learning algorithm to minimize the proposed KL-regularized cost function. We show that standard p -regularized objective functions currently used, such as ridge regression and p -regularized boosting, are obtained from a relaxation of the KL divergence between the quasi uniform posterior and the uniform prior. We present numerical experiments where the proposed learning algorithm generally outperforms ridge regression and AdaBoost. 1
3 0.52432698 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
Author: Peter Orbanz
Abstract: We consider the general problem of constructing nonparametric Bayesian models on infinite-dimensional random objects, such as functions, infinite graphs or infinite permutations. The problem has generated much interest in machine learning, where it is treated heuristically, but has not been studied in full generality in nonparametric Bayesian statistics, which tends to focus on models over probability distributions. Our approach applies a standard tool of stochastic process theory, the construction of stochastic processes from their finite-dimensional marginal distributions. The main contribution of the paper is a generalization of the classic Kolmogorov extension theorem to conditional probabilities. This extension allows a rigorous construction of nonparametric Bayesian models from systems of finitedimensional, parametric Bayes equations. Using this approach, we show (i) how existence of a conjugate posterior for the nonparametric model can be guaranteed by choosing conjugate finite-dimensional models in the construction, (ii) how the mapping to the posterior parameters of the nonparametric model can be explicitly determined, and (iii) that the construction of conjugate models in essence requires the finite-dimensional models to be in the exponential family. As an application of our constructive framework, we derive a model on infinite permutations, the nonparametric Bayesian analogue of a model recently proposed for the analysis of rank data. 1
4 0.51163918 11 nips-2009-A General Projection Property for Distribution Families
Author: Yao-liang Yu, Yuxi Li, Dale Schuurmans, Csaba Szepesvári
Abstract: Surjectivity of linear projections between distribution families with fixed mean and covariance (regardless of dimension) is re-derived by a new proof. We further extend this property to distribution families that respect additional constraints, such as symmetry, unimodality and log-concavity. By combining our results with classic univariate inequalities, we provide new worst-case analyses for natural risk criteria arising in classification, optimization, portfolio selection and Markov decision processes. 1
5 0.50411743 7 nips-2009-A Data-Driven Approach to Modeling Choice
Author: Vivek Farias, Srikanth Jagabathula, Devavrat Shah
Abstract: We visit the following fundamental problem: For a ‘generic’ model of consumer choice (namely, distributions over preference lists) and a limited amount of data on how consumers actually make decisions (such as marginal preference information), how may one predict revenues from offering a particular assortment of choices? This problem is central to areas within operations research, marketing and econometrics. We present a framework to answer such questions and design a number of tractable algorithms (from a data and computational standpoint) for the same. 1
6 0.46531856 184 nips-2009-Optimizing Multi-Class Spatio-Spectral Filters via Bayes Error Estimation for EEG Classification
7 0.46375936 40 nips-2009-Bayesian Nonparametric Models on Decomposable Graphs
8 0.45429122 226 nips-2009-Spatial Normalized Gamma Processes
9 0.43132392 91 nips-2009-Fast, smooth and adaptive regression in metric spaces
10 0.42983904 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models
11 0.42192602 55 nips-2009-Compressed Least-Squares Regression
12 0.41575554 166 nips-2009-Noisy Generalized Binary Search
13 0.41402403 51 nips-2009-Clustering sequence sets for motif discovery
14 0.41065806 71 nips-2009-Distribution-Calibrated Hierarchical Classification
15 0.40195096 112 nips-2009-Human Rademacher Complexity
16 0.39009541 53 nips-2009-Complexity of Decentralized Control: Special Cases
17 0.37509379 129 nips-2009-Learning a Small Mixture of Trees
18 0.37041554 182 nips-2009-Optimal Scoring for Unsupervised Learning
19 0.36864391 194 nips-2009-Predicting the Optimal Spacing of Study: A Multiscale Context Model of Memory
20 0.36696804 78 nips-2009-Efficient Moments-based Permutation Tests
topicId topicWeight
[(7, 0.36), (24, 0.068), (25, 0.082), (35, 0.041), (36, 0.094), (39, 0.034), (58, 0.053), (61, 0.028), (71, 0.07), (81, 0.012), (86, 0.053), (91, 0.01)]
simIndex simValue paperId paperTitle
1 0.90119606 165 nips-2009-Noise Characterization, Modeling, and Reduction for In Vivo Neural Recording
Author: Zhi Yang, Qi Zhao, Edward Keefer, Wentai Liu
Abstract: Studying signal and noise properties of recorded neural data is critical in developing more efficient algorithms to recover the encoded information. Important issues exist in this research including the variant spectrum spans of neural spikes that make it difficult to choose a globally optimal bandpass filter. Also, multiple sources produce aggregated noise that deviates from the conventional white Gaussian noise. In this work, the spectrum variability of spikes is addressed, based on which the concept of adaptive bandpass filter that fits the spectrum of individual spikes is proposed. Multiple noise sources have been studied through analytical models as well as empirical measurements. The dominant noise source is identified as neuron noise followed by interface noise of the electrode. This suggests that major efforts to reduce noise from electronics are not well spent. The measured noise from in vivo experiments shows a family of 1/f x spectrum that can be reduced using noise shaping techniques. In summary, the methods of adaptive bandpass filtering and noise shaping together result in several dB signal-to-noise ratio (SNR) enhancement.
same-paper 2 0.79922152 69 nips-2009-Discrete MDL Predicts in Total Variation
Author: Marcus Hutter
Abstract: The Minimum Description Length (MDL) principle selects the model that has the shortest code for data plus model. We show that for a countable class of models, MDL predictions are close to the true distribution in a strong sense. The result is completely general. No independence, ergodicity, stationarity, identifiability, or other assumption on the model class need to be made. More formally, we show that for any countable class of models, the distributions selected by MDL (or MAP) asymptotically predict (merge with) the true measure in the class in total variation distance. Implications for non-i.i.d. domains like time-series forecasting, discriminative learning, and reinforcement learning are discussed. 1
3 0.78135312 170 nips-2009-Nonlinear directed acyclic structure learning with weakly additive noise models
Author: Arthur Gretton, Peter Spirtes, Robert E. Tillman
Abstract: The recently proposed additive noise model has advantages over previous directed structure learning approaches since it (i) does not assume linearity or Gaussianity and (ii) can discover a unique DAG rather than its Markov equivalence class. However, for certain distributions, e.g. linear Gaussians, the additive noise model is invertible and thus not useful for structure learning, and it was originally proposed for the two variable case with a multivariate extension which requires enumerating all possible DAGs. We introduce weakly additive noise models, which extends this framework to cases where the additive noise model is invertible and when additive noise is not present. We then provide an algorithm that learns an equivalence class for such models from data, by combining a PC style search using recent advances in kernel measures of conditional dependence with local searches for additive noise models in substructures of the Markov equivalence class. This results in a more computationally efficient approach that is useful for arbitrary distributions even when additive noise models are invertible. 1
4 0.77211934 95 nips-2009-Fast subtree kernels on graphs
Author: Nino Shervashidze, Karsten M. Borgwardt
Abstract: In this article, we propose fast subtree kernels on graphs. On graphs with n nodes and m edges and maximum degree d, these kernels comparing subtrees of height h can be computed in O(mh), whereas the classic subtree kernel by Ramon & G¨ rtner scales as O(n2 4d h). Key to this efficiency is the observation that the a Weisfeiler-Lehman test of isomorphism from graph theory elegantly computes a subtree kernel as a byproduct. Our fast subtree kernels can deal with labeled graphs, scale up easily to large graphs and outperform state-of-the-art graph kernels on several classification benchmark datasets in terms of accuracy and runtime. 1
5 0.675574 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
Author: Marius Kloft, Ulf Brefeld, Pavel Laskov, Klaus-Robert Müller, Alexander Zien, Sören Sonnenburg
Abstract: Learning linear combinations of multiple kernels is an appealing strategy when the right choice of features is unknown. Previous approaches to multiple kernel learning (MKL) promote sparse kernel combinations to support interpretability. Unfortunately, 1 -norm MKL is hardly observed to outperform trivial baselines in practical applications. To allow for robust kernel mixtures, we generalize MKL to arbitrary p -norms. We devise new insights on the connection between several existing MKL formulations and develop two efficient interleaved optimization strategies for arbitrary p > 1. Empirically, we demonstrate that the interleaved optimization strategies are much faster compared to the traditionally used wrapper approaches. Finally, we apply p -norm MKL to real-world problems from computational biology, showing that non-sparse MKL achieves accuracies that go beyond the state-of-the-art. 1
6 0.60175192 17 nips-2009-A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds
7 0.4993867 226 nips-2009-Spatial Normalized Gamma Processes
8 0.47415951 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes
9 0.47255358 59 nips-2009-Construction of Nonparametric Bayesian Models from Parametric Bayes Equations
10 0.46678314 162 nips-2009-Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
11 0.46554771 25 nips-2009-Adaptive Design Optimization in Experiments with People
12 0.46373603 168 nips-2009-Non-stationary continuous dynamic Bayesian networks
13 0.46216482 113 nips-2009-Improving Existing Fault Recovery Policies
14 0.46191144 175 nips-2009-Occlusive Components Analysis
15 0.45742089 178 nips-2009-On Stochastic and Worst-case Models for Investing
16 0.45519942 62 nips-2009-Correlation Coefficients are Insufficient for Analyzing Spike Count Dependencies
17 0.45432734 237 nips-2009-Subject independent EEG-based BCI decoding
18 0.45327261 247 nips-2009-Time-rescaling methods for the estimation and assessment of non-Poisson neural encoding models
19 0.45273006 38 nips-2009-Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity
20 0.45227143 3 nips-2009-AUC optimization and the two-sample problem