nips nips2010 nips2010-75 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Noah A. Smith, Shay B. Cohen
Abstract: Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. [sent-7, score-0.375]
2 We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. [sent-8, score-0.821]
3 We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. [sent-9, score-0.276]
4 1 Introduction Probabilistic grammars are an important statistical model family used in natural language processing [7], computer vision [16], computational biology [19] and more recently, in human activity analysis [12]. [sent-10, score-0.423]
5 Such estimation can be viewed as minimizing empirical risk with the log-loss [21]. [sent-12, score-0.18]
6 The log-loss is not bounded when applied to probabilistic grammars, and that makes it hard to obtain uniform convergence results. [sent-13, score-0.294]
7 To overcome this problem, we derive distribution-dependent uniform convergence results for probabilistic grammars. [sent-15, score-0.228]
8 Based on the notion of bounded approximations [1, 9], we define a sequence of increasingly better approximations for probabilistic grammars, which we call “proper approximations. [sent-18, score-0.411]
9 ” We then derive sample complexity bounds in our framework, for both the supervised case and the unsupervised case. [sent-19, score-0.276]
10 Our results rely on an exponential decay in probabilities with respect to the length of the derivation (number of derivation steps the grammar takes when generating a structure). [sent-20, score-0.567]
11 §4 presents proper approximations, which are approximate concept spaces that permit the derivation of sample complexity bounds for probabilistic grammars. [sent-27, score-0.671]
12 1 2 Probabilistic Grammars A probabilistic grammar defines a probability distribution over grammatical derivations generated through a step-by-step process. [sent-30, score-0.78]
13 For example, probabilistic context-free grammars (PCFGs) generate phrase-structure trees by recursively rewriting nonterminal symbols as sequences of “child” symbols according to a fixed set of production rules. [sent-31, score-0.528]
14 In this paper, we will assume that any grammatical derivation z fully determines a string x, denoted yield(z). [sent-33, score-0.208]
15 There may be many derivations z for a given string (perhaps infinitely many for some kinds of grammars; we assume that the number of derivations is finite). [sent-34, score-0.47]
16 In general, a probabilistic grammar defines the probability of a grammatical derivation z as: hθ (z) = K k=1 Nk ψk,i (z) i=1 θk,i = exp K k=1 Nk i=1 ψk,i (z) log θk,i (1) ψk,i is a function that “counts” the number of times the kth distribution’s ith event occurs in the derivation. [sent-35, score-0.788]
17 D(G) denotes the set of all possible derivations of G. [sent-41, score-0.213]
18 We let K Nk |x| denote the length of the string x, and |z| = k=1 i=1 ψk,i (z) denote the “length” (number of event tokens) of the derivation z. [sent-43, score-0.233]
19 Parameter estimation for probabilistic grammars means choosing θ from complete data (“supervised”) or incomplete data (“semi-supervised” or “unsupervised,” the latter usually implying that strings x are evidence but all derivations z are missing). [sent-44, score-0.769]
20 For simplicity of notation, we assume that there is a fixed grammar and use H to refer to H(G) and F to refer to F(G). [sent-46, score-0.337]
21 We will make a few assumptions about G and P(z), the distribution that generates derivations from D(G) (note that P does not have to be a probabilistic grammar): • Bounded derivation length: There is an α ≥ 1 such that, for all z, |z| ≤ α|yield(z)|. [sent-48, score-0.437]
22 • Exponential decay of strings: Let Λ(k) = |{z ∈ D(G) | |z| = k}| be the number derivations of length k in G. [sent-51, score-0.285]
23 This implies that the number of derivations of length k may be exponentially large (e. [sent-53, score-0.252]
24 3 The Learning Setting In the supervised learning setting, a set of grammatical derivations z1 , . [sent-60, score-0.349]
25 MLE chooses h∗ ∈ H to maximize the likelihood of the data: n 1 ˜ h∗ = argmax log h(zi ) = argmin P(z) (− log h(z)) (2) h∈H n i=1 h∈H z∈D(G) Remp,n (− log h) 1 Learning the rules in a grammar is another important problem that has received much attention [11]. [sent-64, score-0.559]
26 The expected risk, under P, is the (unknowable) quantity R(− log h) = P(z) (− log h(z)) = EP [− log h] z∈D(G) Showing convergence of the form suph∈H |Remp,n (− log h) − R(− log h)| −→ 0 (in probability), n→∞ is referred to as double-sided uniform convergence. [sent-66, score-0.453]
27 (We note that suph∈H |Remp,n (− log h) − R(− log h)| = supf ∈F |Remp,n (f ) − R(f )|). [sent-67, score-0.22]
28 This kind of uniform convergence is the driving force in showing that the empirical risk minimizer is consistent, i. [sent-68, score-0.33]
29 , the minimized empirical risk converges to the minimized expected risk. [sent-70, score-0.18]
30 We assume familiarity with the relevant literature about empirical risk minimization; see [21]. [sent-71, score-0.18]
31 Instead of making this restriction, which is heavy for probabilistic grammars, we revise the learning model according to well-known results about the convergence of stochastic processes. [sent-77, score-0.19]
32 and replaces two-sided uniform convergence with convergence on the sequence of concept spaces. [sent-81, score-0.248]
33 Our approximations are based on the concept of bounded approximations [1, 9]. [sent-84, score-0.342]
34 2 this is actually a well-motivated characterization of convergence for probabilistic grammars in the supervised setting. [sent-95, score-0.58]
35 We note that a good approximation would have Km increasing fast as a function of m and tail (m) and bound (m) decreasing fast as a function of m. [sent-96, score-0.314]
36 1 Constructing Proper Approximations for Probabilistic Grammars We now focus on constructing proper approximations for probabilistic grammars. [sent-99, score-0.413]
37 We make an assumption about the probabilistic grammar that ∀k, Nk = 2. [sent-100, score-0.482]
38 For most common grammar formalisms, this does not change the expressive power: any grammar that can be expressed using Nk > 2 can be expressed using a grammar that has Nk ≤ 2. [sent-101, score-1.011]
39 For each f ∈ F we define the transformation T (f, γ) that shifts every θ k = θk,1 , θk,2 in the probabilistic grammar by γ: γ, 1 − γ if θk,1 < γ 1 − γ, γ if θk,1 > 1 − γ (3) θk,1 , θk,2 ← θk,1 , θk,2 otherwise Note that T (f, γ) ∈ F for any γ ≤ 1/2. [sent-104, score-0.482]
40 There exists a constant β = β(L, q, p, N ) > 0 such that Fm has the boundedness property with Km = pN log3 m and bound (m) = m−β log m . [sent-109, score-0.236]
41 Then, for all z ∈ Z(m) we have f (z) = 3 − i,k ψ(k, i) log θk,i ≤ i,k ψ(k, i)(p log m) ≤ pN log m = Km , where the first inequality follows from f ∈ Fm (θk,i ≥ m−p ) and the second from |z| ≤ log2 m. [sent-113, score-0.222]
42 In addition, from the requirements on P we have: E |f | × I(|f | ≥ Km ) ≤ pN log m k>log2 m LΛ(k)rk k ≤ κ log m q log 2 m for some constant κ > 0. [sent-114, score-0.287]
43 Finally, for some β(L, q, p, N ) = β > 0 and some constant M , if m > M then κ log m q log 2 m ≤ m−β log m . [sent-115, score-0.245]
44 There exists an M P f ∈F {z | Cm (f )(z) − f (z) ≥ tail (m)} ≤ Cm (f ) = T (f, m−p ). [sent-118, score-0.272]
45 tail (m) = tail (m) for tail (m) = M we have: N log2 m and mp − 1 Proof. [sent-119, score-0.726]
46 From this point, we use Fm to denote the proper approximation constructed for G. [sent-122, score-0.187]
47 We use bound (m) and tail (m) as in Proposition 4. [sent-123, score-0.292]
48 2 Asymptotic Empirical Risk Minimization It would be compelling to know that the empirical risk minimizer over Fn is an asymptotic empirical risk minimizer (in the log-loss case, this means it converges to the maximum likelihood estimate). [sent-127, score-0.532]
49 As a conclusion to this section about proper approximations, we motivate the three requirements that we posed on proper approximations by showing that this is indeed true. [sent-128, score-0.475]
50 Let fn be the ∗ minimizer of the empirical risk over F, (fn = argminf ∈F Remp,n (f )) and let gn be the minimizer of the empirical risk over Fn (gn = argminf ∈Fn Remp,n (f )). [sent-130, score-1.197]
51 The operator (gn =) argminf ∈Fn Remp,n (f ) is an ∗ asymptotic empirical risk minimizer if E[Remp,n (gn ) − Remp,n (fn )] → 0. [sent-135, score-0.401]
52 Then gn = argminf ∈Fn Remp,n (f ) is an asymptotic empirical risk minimizer. [sent-142, score-0.386]
53 ” Then if Fn properly approximates F then: ∗ E [Remp,n (gn ) − Remp,n (fn )] (4) ∗ ∗ ≤ E Remp,n (Cn (fn )) | A ,n P(A ,n ) + E Remp,n (fn ) | A ,n P(A ,n ) + tail (n) where the expectations are taken with respect to the dataset D. [sent-147, score-0.26]
54 , n} be the event “zj ∈ Z E[ψk,i (zl ) | A ≤ ≤ ,n ]P(A ,n ) j=l j=l zl ≤ P(zl )P(Aj, P(Aj, ,n ) ≤ (n − 1)BP(z ∈ Z n l=1 log 2 n j ,n ”. [sent-157, score-0.34]
55 zl ,n )|zl | E[ψk,i (zl ) | A k,i Then A ,n P(zl , Aj, + zl j Aj, ,n . [sent-158, score-0.44]
56 7 comes from zl being independent and B is the constant from §2. [sent-160, score-0.243]
57 Therefore, we have: 1 n n E[ψk,i (zl ) | A ,n ]P(A ,n ) E[ψk,i (z) | z ∈ Z ≤ l=1 k,i ,n ]P(z ∈Z ,n ) + n2 BP(z ∈ Z ,n ) k,i (10) From the construction of our proper approximations (Proposition 4. [sent-161, score-0.268]
58 2), we know that only derivations of length log2 n or greater can be in Z ,n . [sent-162, score-0.252]
59 Sample Complexity Results We now give our main sample complexity results for probabilistic grammars. [sent-167, score-0.27]
60 The following is the key ˜ result about the connection between covering numbers and the double-sided convergence of the empirical process supf ∈Fn |Remp,n (f ) − R(f )| as n → ∞: dP (f, g) = z∈D(G) = Lemma 5. [sent-182, score-0.256]
61 , the set of 2 The “permissible class” requirement is a mild regularity condition about measurability that holds for proper approximations. [sent-187, score-0.19]
62 In the case of our “binomialized” probabilistic grammars, the pseudo dimension of Fn is bounded by N , because we have Fn ⊆ F, and the functions in F are linear with N parameters. [sent-196, score-0.26]
63 ) Let Fn be the proper approximations for probabilistic grammars, for any 0 < < Kn we have: N( , Ftruncated,n ) < 2 5. [sent-201, score-0.413]
64 Let Fn be a proper approximation for the corresponding family of probabilistic grammars. [sent-207, score-0.357]
65 Let P(x, z) be a distribution over derivations that satisfies the requirements in §2. [sent-208, score-0.255]
66 2 Unsupervised Case In the unsupervised setting, we have n yields of derivations from the grammar, x1 , . [sent-222, score-0.279]
67 , xn , and our goal again is to identify grammar parameters θ from these yields. [sent-225, score-0.337]
68 Our concept classes are now the sets of log marginalized distributions from Fn . [sent-226, score-0.172]
69 Note that we also need to define the operator Cn (f ) as a first step towards defining Fn as proper approximations (for F ) in the unsupervised setting. [sent-229, score-0.36]
70 It is not immediate to show that Fn is a proper approximation for F . [sent-233, score-0.187]
71 There exists an M P f ∈F {x | Cn (f )(x) − f (x) ≥ such that for any n ≤ tail (n)} operator Cn (f ) as defined above. [sent-240, score-0.298]
72 i M we have: N log2 n and the tail (n) for tail (n) = np − 1 ai + log i bi > ≥ then there exists an i such that Sketch of proof of Proposition 5. [sent-244, score-0.645]
73 5 we have: P {x | Cn (f )(x) − f (x) ≥ ≤ P {x | ∃zCn (f )(z) − f (z) ≥ tail (n)} f ∈F tail (n)} f ∈F (17) Define X(n) to be all x such that there exists a z with yield(z) = x and |z| ≥ log n. [sent-247, score-0.579]
74 Cn (f )(z) − f (z) ≥ ≤ tail (n)} P(x) x∈X(n) ∞ ≤ LΛ(k)rk P(x) ≤ x:|x|≥log2 n/α ≤ (18) tail (n) k= log2 n/α where the last inequality happens for some n larger than a fixed M . [sent-251, score-0.466]
75 Computing either the covering number or the pseudo dimension of Fn is a hard task, because the function in the classes includes the “log-sum-exp. [sent-252, score-0.195]
76 ” In [9], Dasgupta overcomes this problem for Bayesian networks with fixed structure by giving a bound on the covering number for (his respective) F that depends on the covering number of F. [sent-253, score-0.239]
77 Unfortunately, we cannot fully adopt this approach, because the derivations of a probabilistic grammar can be arbitrarily large. [sent-254, score-0.718]
78 The more samples we have, the more permissive (for large derivation set) the grammar can be. [sent-257, score-0.416]
79 On the other hand, the more accuracy we desire, the more restricted we are in choosing grammars that have a large derivation set. [sent-258, score-0.418]
80 For the unsupervised case, then, we get the following sample complexity result: Theorem 5. [sent-265, score-0.191]
81 Let Fn be a proper approximation for the corresponding family of probabilistic grammars. [sent-268, score-0.357]
82 Let P(x, z) be a distribution over derivations that satisfies the requirements in §2. [sent-269, score-0.255]
83 For this sample complexity bound to be non-trivial, for example, we can restrict Dx (G), through d(n), to have a polynomial size in the number of our samples. [sent-275, score-0.184]
84 7 criterion tightness of proper approximation sample complexity bound as Kn increases . [sent-279, score-0.445]
85 6 Discussion Our framework can be specialized to improve the two main criteria that have a trade-off: the tightness of the proper approximation and the sample complexity. [sent-292, score-0.314]
86 For example, we can improve the tightness of our proper approximations by taking a subsequence of Fn . [sent-293, score-0.378]
87 However, this will make the sample complexity bound degrade, because Kn will grow faster. [sent-294, score-0.214]
88 In general, we would want the derivational condition to be removed (choose d(n) = ∞, or at least allow d(n) = Ω(tn ) for some t, for small samples), but in that case our sample complexity bounds become trivial. [sent-296, score-0.271]
89 This is consistent with conventional wisdom that lexicalized grammars require much more data for accurate learning. [sent-301, score-0.384]
90 The dependence of the bound on N suggests that it is easier to learn models with a smaller grammar size. [sent-302, score-0.396]
91 The sample complexity bound for the unsupervised case suggests that we need log d(n) times as much data to achieve estimates as good as those for supervised learning. [sent-305, score-0.375]
92 Interestingly, with unsupervised grammar learning, available training sentences longer than a maximum length (e. [sent-306, score-0.442]
93 We note that sample complexity is not the only measure for the complexity of estimating probabilistic grammars. [sent-309, score-0.342]
94 In the unsupervised setting, for example, the computational complexity of ERM is NP hard for PCFGs [5] or probabilistic automata [2]. [sent-310, score-0.311]
95 7 Conclusion We presented a framework for learning the parameters of a probabilistic grammar under the log-loss and derived sample complexity bounds for it. [sent-311, score-0.641]
96 We motivated this framework by showing that the empirical risk minimizer for our approximate framework is an asymptotic empirical risk minimizer. [sent-312, score-0.465]
97 Our framework uses a sequence of approximations to a family of probabilistic grammars, which improves as we have more data, to give distribution dependent sample complexity bounds in the supervised and unsupervised settings. [sent-313, score-0.571]
98 On the computational complexity of approximating distributions by probabilistic automata. [sent-325, score-0.217]
99 Empirical risk minimization for probabilistic grammars: Sample complexity and hardness of learning, in preparation. [sent-351, score-0.401]
100 A stochastic graph grammar for compositional object representation and recognition. [sent-399, score-0.373]
wordName wordTfidf (topN-words)
[('fn', 0.42), ('grammars', 0.339), ('grammar', 0.337), ('tail', 0.233), ('fm', 0.23), ('zl', 0.22), ('derivations', 0.213), ('proper', 0.165), ('probabilistic', 0.145), ('cn', 0.144), ('risk', 0.131), ('kn', 0.126), ('km', 0.12), ('derivational', 0.112), ('approximations', 0.103), ('concept', 0.098), ('nk', 0.095), ('proposition', 0.093), ('covering', 0.09), ('argminf', 0.09), ('parsing', 0.088), ('grammatical', 0.085), ('derivation', 0.079), ('pcfg', 0.079), ('pcfgs', 0.079), ('gn', 0.078), ('pn', 0.076), ('log', 0.074), ('tightness', 0.074), ('complexity', 0.072), ('supf', 0.072), ('minimizer', 0.067), ('unsupervised', 0.066), ('bound', 0.059), ('language', 0.059), ('aj', 0.059), ('dx', 0.058), ('zn', 0.056), ('pseudo', 0.055), ('sample', 0.053), ('supervised', 0.051), ('empirical', 0.049), ('strings', 0.048), ('cm', 0.048), ('event', 0.046), ('convergence', 0.045), ('lexicalized', 0.045), ('string', 0.044), ('requirements', 0.042), ('degrades', 0.041), ('boundedness', 0.041), ('cohen', 0.041), ('exists', 0.039), ('permissible', 0.039), ('acl', 0.039), ('length', 0.039), ('uniform', 0.038), ('asymptotic', 0.038), ('bounded', 0.038), ('material', 0.037), ('subsequence', 0.036), ('abe', 0.036), ('compositional', 0.036), ('suph', 0.036), ('supplementary', 0.036), ('bounds', 0.034), ('decay', 0.033), ('zi', 0.032), ('grow', 0.03), ('grows', 0.03), ('minimization', 0.028), ('hard', 0.028), ('mp', 0.027), ('lemma', 0.027), ('approximates', 0.027), ('operator', 0.026), ('technologies', 0.026), ('hardness', 0.025), ('bp', 0.025), ('let', 0.025), ('rk', 0.025), ('family', 0.025), ('spaces', 0.025), ('requirement', 0.025), ('overview', 0.024), ('implying', 0.024), ('sup', 0.023), ('np', 0.023), ('arbitrarily', 0.023), ('constant', 0.023), ('kth', 0.022), ('bi', 0.022), ('dimension', 0.022), ('pittsburgh', 0.022), ('sequence', 0.022), ('approximation', 0.022), ('symbols', 0.022), ('ep', 0.022), ('proof', 0.021), ('relying', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999988 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
Author: Noah A. Smith, Shay B. Cohen
Abstract: Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. 1
2 0.15346174 264 nips-2010-Synergies in learning words and their referents
Author: Mark Johnson, Katherine Demuth, Bevan Jones, Michael J. Black
Abstract: This paper presents Bayesian non-parametric models that simultaneously learn to segment words from phoneme strings and learn the referents of some of those words, and shows that there is a synergistic interaction in the acquisition of these two kinds of linguistic information. The models themselves are novel kinds of Adaptor Grammars that are an extension of an embedding of topic models into PCFGs. These models simultaneously segment phoneme sequences into words and learn the relationship between non-linguistic objects to the words that refer to them. We show (i) that modelling inter-word dependencies not only improves the accuracy of the word segmentation but also of word-object relationships, and (ii) that a model that simultaneously learns word-object relationships and word segmentation segments more accurately than one that just learns word segmentation on its own. We argue that these results support an interactive view of language acquisition that can take advantage of synergies such as these. 1
3 0.15058097 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
Author: Gilles Blanchard, Nicole Krämer
Abstract: We prove rates of convergence in the statistical sense for kernel-based least squares regression using a conjugate gradient algorithm, where regularization against overfitting is obtained by early stopping. This method is directly related to Kernel Partial Least Squares, a regression method that combines supervised dimensionality reduction with least squares projection. The rates depend on two key quantities: first, on the regularity of the target regression function and second, on the effective dimensionality of the data mapped into the kernel space. Lower bounds on attainable rates depending on these two quantities were established in earlier literature, and we obtain upper bounds for the considered method that match these lower bounds (up to a log factor) if the true regression function belongs to the reproducing kernel Hilbert space. If this assumption is not fulfilled, we obtain similar convergence rates provided additional unlabeled data are available. The order of the learning rates match state-of-the-art results that were recently obtained for least squares support vector machines and for linear regularization operators. 1
4 0.12743258 250 nips-2010-Spectral Regularization for Support Estimation
Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo
Abstract: In this paper we consider the problem of learning from data the support of a probability distribution when the distribution does not have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call “completely regular”. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.
5 0.12425093 281 nips-2010-Using body-anchored priors for identifying actions in single images
Author: Leonid Karlinsky, Michael Dinerstein, Shimon Ullman
Abstract: This paper presents an approach to the visual recognition of human actions using only single images as input. The task is easy for humans but difficult for current approaches to object recognition, because instances of different actions may be similar in terms of body pose, and often require detailed examination of relations between participating objects and body parts in order to be recognized. The proposed approach applies a two-stage interpretation procedure to each training and test image. The first stage produces accurate detection of the relevant body parts of the actor, forming a prior for the local evidence needed to be considered for identifying the action. The second stage extracts features that are anchored to the detected body parts, and uses these features and their feature-to-part relations in order to recognize the action. The body anchored priors we propose apply to a large range of human actions. These priors allow focusing on the relevant regions and relations, thereby significantly simplifying the learning process and increasing recognition performance. 1
6 0.11808988 243 nips-2010-Smoothness, Low Noise and Fast Rates
7 0.10732189 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
8 0.10128045 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
9 0.095486306 223 nips-2010-Rates of convergence for the cluster tree
10 0.084928654 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
11 0.084909871 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
12 0.079855174 184 nips-2010-Nonparametric Bayesian Policy Priors for Reinforcement Learning
13 0.077179044 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
14 0.075869903 214 nips-2010-Probabilistic Belief Revision with Structural Constraints
15 0.072987132 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
16 0.070558392 18 nips-2010-A novel family of non-parametric cumulative based divergences for point processes
17 0.070164561 192 nips-2010-Online Classification with Specificity Constraints
18 0.069823973 70 nips-2010-Efficient Optimization for Discriminative Latent Class Models
19 0.068832435 191 nips-2010-On the Theory of Learnining with Privileged Information
20 0.067191757 148 nips-2010-Learning Networks of Stochastic Differential Equations
topicId topicWeight
[(0, 0.169), (1, 0.009), (2, 0.104), (3, 0.025), (4, 0.008), (5, 0.101), (6, 0.007), (7, -0.026), (8, 0.011), (9, 0.067), (10, -0.007), (11, -0.075), (12, -0.145), (13, -0.044), (14, -0.032), (15, 0.022), (16, 0.184), (17, 0.102), (18, 0.045), (19, -0.047), (20, -0.043), (21, -0.04), (22, 0.023), (23, 0.028), (24, 0.271), (25, -0.063), (26, -0.065), (27, -0.05), (28, -0.06), (29, 0.086), (30, -0.035), (31, -0.018), (32, -0.039), (33, 0.146), (34, -0.051), (35, 0.153), (36, -0.03), (37, 0.008), (38, -0.067), (39, 0.02), (40, -0.048), (41, 0.075), (42, 0.069), (43, -0.052), (44, -0.177), (45, 0.066), (46, -0.069), (47, 0.014), (48, 0.036), (49, -0.101)]
simIndex simValue paperId paperTitle
same-paper 1 0.93153322 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
Author: Noah A. Smith, Shay B. Cohen
Abstract: Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. 1
2 0.66733485 199 nips-2010-Optimal learning rates for Kernel Conjugate Gradient regression
Author: Gilles Blanchard, Nicole Krämer
Abstract: We prove rates of convergence in the statistical sense for kernel-based least squares regression using a conjugate gradient algorithm, where regularization against overfitting is obtained by early stopping. This method is directly related to Kernel Partial Least Squares, a regression method that combines supervised dimensionality reduction with least squares projection. The rates depend on two key quantities: first, on the regularity of the target regression function and second, on the effective dimensionality of the data mapped into the kernel space. Lower bounds on attainable rates depending on these two quantities were established in earlier literature, and we obtain upper bounds for the considered method that match these lower bounds (up to a log factor) if the true regression function belongs to the reproducing kernel Hilbert space. If this assumption is not fulfilled, we obtain similar convergence rates provided additional unlabeled data are available. The order of the learning rates match state-of-the-art results that were recently obtained for least squares support vector machines and for linear regularization operators. 1
3 0.51952243 250 nips-2010-Spectral Regularization for Support Estimation
Author: Ernesto D. Vito, Lorenzo Rosasco, Alessandro Toigo
Abstract: In this paper we consider the problem of learning from data the support of a probability distribution when the distribution does not have a density (with respect to some reference measure). We propose a new class of regularized spectral estimators based on a new notion of reproducing kernel Hilbert space, which we call “completely regular”. Completely regular kernels allow to capture the relevant geometric and topological properties of an arbitrary probability space. In particular, they are the key ingredient to prove the universal consistency of the spectral estimators and in this respect they are the analogue of universal kernels for supervised problems. Numerical experiments show that spectral estimators compare favorably to state of the art machine learning algorithms for density support estimation.
4 0.47956234 281 nips-2010-Using body-anchored priors for identifying actions in single images
Author: Leonid Karlinsky, Michael Dinerstein, Shimon Ullman
Abstract: This paper presents an approach to the visual recognition of human actions using only single images as input. The task is easy for humans but difficult for current approaches to object recognition, because instances of different actions may be similar in terms of body pose, and often require detailed examination of relations between participating objects and body parts in order to be recognized. The proposed approach applies a two-stage interpretation procedure to each training and test image. The first stage produces accurate detection of the relevant body parts of the actor, forming a prior for the local evidence needed to be considered for identifying the action. The second stage extracts features that are anchored to the detected body parts, and uses these features and their feature-to-part relations in order to recognize the action. The body anchored priors we propose apply to a large range of human actions. These priors allow focusing on the relevant regions and relations, thereby significantly simplifying the learning process and increasing recognition performance. 1
5 0.44862619 74 nips-2010-Empirical Bernstein Inequalities for U-Statistics
Author: Thomas Peel, Sandrine Anthoine, Liva Ralaivola
Abstract: We present original empirical Bernstein inequalities for U-statistics with bounded symmetric kernels q. They are expressed with respect to empirical estimates of either the variance of q or the conditional variance that appears in the Bernsteintype inequality for U-statistics derived by Arcones [2]. Our result subsumes other existing empirical Bernstein inequalities, as it reduces to them when U-statistics of order 1 are considered. In addition, it is based on a rather direct argument using two applications of the same (non-empirical) Bernstein inequality for U-statistics. We discuss potential applications of our new inequalities, especially in the realm of learning ranking/scoring functions. In the process, we exhibit an efficient procedure to compute the variance estimates for the special case of bipartite ranking that rests on a sorting argument. We also argue that our results may provide test set bounds and particularly interesting empirical racing algorithms for the problem of online learning of scoring functions. 1
6 0.44137871 264 nips-2010-Synergies in learning words and their referents
7 0.44047558 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
8 0.42898089 278 nips-2010-Universal Consistency of Multi-Class Support Vector Classification
9 0.40669274 223 nips-2010-Rates of convergence for the cluster tree
10 0.40590072 214 nips-2010-Probabilistic Belief Revision with Structural Constraints
11 0.39880896 263 nips-2010-Switching state space model for simultaneously estimating state transitions and nonstationary firing rates
12 0.39412808 191 nips-2010-On the Theory of Learnining with Privileged Information
13 0.39357108 233 nips-2010-Scrambled Objects for Least-Squares Regression
14 0.38653025 185 nips-2010-Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable
15 0.37763992 285 nips-2010-Why are some word orders more common than others? A uniform information density account
16 0.37628576 232 nips-2010-Sample Complexity of Testing the Manifold Hypothesis
17 0.37194711 9 nips-2010-A New Probabilistic Model for Rank Aggregation
18 0.35685304 243 nips-2010-Smoothness, Low Noise and Fast Rates
19 0.34999099 205 nips-2010-Permutation Complexity Bound on Out-Sample Error
20 0.32978013 148 nips-2010-Learning Networks of Stochastic Differential Equations
topicId topicWeight
[(9, 0.253), (13, 0.054), (17, 0.019), (27, 0.037), (30, 0.126), (35, 0.015), (45, 0.163), (50, 0.035), (52, 0.03), (60, 0.084), (77, 0.05), (78, 0.01), (90, 0.046)]
simIndex simValue paperId paperTitle
same-paper 1 0.78341645 75 nips-2010-Empirical Risk Minimization with Approximations of Probabilistic Grammars
Author: Noah A. Smith, Shay B. Cohen
Abstract: Probabilistic grammars are generative statistical models that are useful for compositional and sequential structures. We present a framework, reminiscent of structural risk minimization, for empirical risk minimization of the parameters of a fixed probabilistic grammar using the log-loss. We derive sample complexity bounds in this framework that apply both to the supervised setting and the unsupervised setting. 1
2 0.67346561 131 nips-2010-Joint Analysis of Time-Evolving Binary Matrices and Associated Documents
Author: Eric Wang, Dehong Liu, Jorge Silva, Lawrence Carin, David B. Dunson
Abstract: We consider problems for which one has incomplete binary matrices that evolve with time (e.g., the votes of legislators on particular legislation, with each year characterized by a different such matrix). An objective of such analysis is to infer structure and inter-relationships underlying the matrices, here defined by latent features associated with each axis of the matrix. In addition, it is assumed that documents are available for the entities associated with at least one of the matrix axes. By jointly analyzing the matrices and documents, one may be used to inform the other within the analysis, and the model offers the opportunity to predict matrix values (e.g., votes) based only on an associated document (e.g., legislation). The research presented here merges two areas of machine-learning that have previously been investigated separately: incomplete-matrix analysis and topic modeling. The analysis is performed from a Bayesian perspective, with efficient inference constituted via Gibbs sampling. The framework is demonstrated by considering all voting data and available documents (legislation) during the 220-year lifetime of the United States Senate and House of Representatives. 1
3 0.67240572 270 nips-2010-Tight Sample Complexity of Large-Margin Learning
Author: Sivan Sabato, Nathan Srebro, Naftali Tishby
Abstract: We obtain a tight distribution-specific characterization of the sample complexity of large-margin classification with L2 regularization: We introduce the γ-adapted-dimension, which is a simple function of the spectrum of a distribution’s covariance matrix, and show distribution-specific upper and lower bounds on the sample complexity, both governed by the γ-adapted-dimension of the source distribution. We conclude that this new quantity tightly characterizes the true sample complexity of large-margin classification. The bounds hold for a rich family of sub-Gaussian distributions. 1
4 0.65944052 220 nips-2010-Random Projection Trees Revisited
Author: Aman Dhesi, Purushottam Kar
Abstract: The Random Projection Tree (RPT REE) structures proposed in [1] are space partitioning data structures that automatically adapt to various notions of intrinsic dimensionality of data. We prove new results for both the RPT REE -M AX and the RPT REE -M EAN data structures. Our result for RPT REE -M AX gives a nearoptimal bound on the number of levels required by this data structure to reduce the size of its cells by a factor s ≥ 2. We also prove a packing lemma for this data structure. Our final result shows that low-dimensional manifolds have bounded Local Covariance Dimension. As a consequence we show that RPT REE -M EAN adapts to manifold dimension as well.
5 0.65941095 222 nips-2010-Random Walk Approach to Regret Minimization
Author: Hariharan Narayanan, Alexander Rakhlin
Abstract: We propose a computationally efficient random walk on a convex body which rapidly mixes to a time-varying Gibbs distribution. In the setting of online convex optimization and repeated games, the algorithm yields low regret and presents a novel efficient method for implementing mixture forecasting strategies. 1
6 0.65727669 193 nips-2010-Online Learning: Random Averages, Combinatorial Parameters, and Learnability
7 0.65539747 243 nips-2010-Smoothness, Low Noise and Fast Rates
8 0.65046763 58 nips-2010-Decomposing Isotonic Regression for Efficiently Solving Large Problems
9 0.64612144 260 nips-2010-Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework
10 0.64161521 88 nips-2010-Extensions of Generalized Binary Search to Group Identification and Exponential Costs
11 0.63945049 117 nips-2010-Identifying graph-structured activation patterns in networks
12 0.63732272 163 nips-2010-Lower Bounds on Rate of Convergence of Cutting Plane Methods
13 0.63578349 40 nips-2010-Beyond Actions: Discriminative Models for Contextual Group Activities
14 0.63387215 265 nips-2010-The LASSO risk: asymptotic results and real world examples
15 0.63319981 280 nips-2010-Unsupervised Kernel Dimension Reduction
16 0.6324653 80 nips-2010-Estimation of Renyi Entropy and Mutual Information Based on Generalized Nearest-Neighbor Graphs
17 0.63123858 155 nips-2010-Learning the context of a category
18 0.63114613 173 nips-2010-Multi-View Active Learning in the Non-Realizable Case
19 0.63048244 87 nips-2010-Extended Bayesian Information Criteria for Gaussian Graphical Models
20 0.63019675 148 nips-2010-Learning Networks of Stochastic Differential Equations