nips nips2007 nips2007-22 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Percy Liang, Dan Klein, Michael I. Jordan
Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1
Reference: text
sentIndex sentText sentNum sentScore
1 In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. [sent-9, score-0.661]
2 Such models often include hidden variables, which play an important role in unsupervised learning and general missing data problems. [sent-13, score-0.152]
3 The basic idea of our approach is to create several tractable submodels and train them jointly to agree on their hidden variables. [sent-23, score-0.655]
4 In some applications, it is infeasible computationally to optimize the objective function; Section 4 provides two alternative objectives that lead to tractable algorithms. [sent-27, score-0.222]
5 Section 5 demonstrates that our methods can be applied successfully to large datasets in three real world problem domains—grammar induction, word alignment, and phylogenetic hidden Markov modeling. [sent-28, score-0.384]
6 1 2 Agreement-based learning of multiple submodels Assume we have M (sub)models pm (x, z; θm ), m = 1, . [sent-29, score-0.754]
7 , M , where each submodel specifies a distribution over the observed data x ∈ X and some hidden state z ∈ Z. [sent-32, score-0.496]
8 The submodels could be parameterized in completely different ways as long as they are defined on the common event space X × Z. [sent-33, score-0.341]
9 Intuitively, each submodel should capture a different aspect of the data in a tractable way. [sent-34, score-0.498]
10 To learn these submodels, the simplest approach is to train them independently by maximizing the sum of their log-likelihoods: def Oindep (θ) = log pm (x, z; θm ) = m z log pm (x; θm ), (1) m where θ = (θ1 , . [sent-35, score-1.048]
11 , θM ) is the collective set of parameters and pm (x; θm ) = z pm (x, z; θm ) 1 is the likelihood under submodel pm . [sent-38, score-1.63]
12 Given an input x, we can then produce an output z by combining the posteriors pm (z | x; θm ) of the trained submodels. [sent-39, score-0.473]
13 If we view each submodel as trying to solve the same task of producing the desired posterior over z, then it seems advantageous to train the submodels jointly to encourage “agreement on z. [sent-40, score-0.823]
14 ” We propose the following objective which realizes this insight: def Oagree (θ) = log pm (x, z; θm ) = z m pm (z | x; θm ). [sent-41, score-1.02]
15 log pm (x; θm ) + log m z (2) m The last term rewards parameter values θ for which the submodels assign probability mass to the same z (conditioned on x); the summation over z reflects the fact that we do not know what z is. [sent-42, score-0.8]
16 Then Oagree is the probability that the submodels all generate the same observed data x and the same hidden state: p(x1 = · · · = xM = x, z1 = · · · = zM ; θ). [sent-51, score-0.446]
17 Oagree is also related to the likelihood of a proper probabilistic model pnorm , obtained by normalizing the product of the submodels, as is done in [3]. [sent-52, score-0.238]
18 Our objective Oagree is then a lower bound on the likelihood under pnorm : def pnorm (x; θ) = z pm (x, z; θm ) ≥ pm (x, z; θm ) m x,z m z pm (x, z; θm ) = Oagree (θ). [sent-53, score-1.614]
19 pm (x, z; θm ) m (3) m x,z The inequality holds because the denominator of the lower bound contains additional cross terms. [sent-54, score-0.435]
20 The bound is generally loose, but becomes tighter as each pm becomes more deterministic. [sent-55, score-0.413]
21 Note that pnorm is distinct from the product-of-experts model [3], in which each “expert” model pm has its own set of (nuisance) hidden variables: ppoe (x) ∝ m z pm (x, z; θm ). [sent-56, score-1.033]
22 In contrast, pnorm has one set of hidden variables z common to all submodels, which is what provides the mechanism for agreement-based learning. [sent-57, score-0.207]
23 1 The product EM algorithm We now derive the product EM algorithm to maximize Oagree . [sent-59, score-0.272]
24 Simple algebra reveals that this optimization is equivalent to minimizing a KL-divergence: L(θ, q) = −KL(q(z)|| m pm (x, z; θm )) + constant, where the constant 1 To simplify notation, we consider one data point x. [sent-63, score-0.413]
25 This quantity is minimized by setting q(z) ∝ m pm (x, z; θm ). [sent-69, score-0.413]
26 In the (product) M-step, we optimize L with respect to θ, which decomposes into M independent objectives: L(θ, q) = m Eq log pm (x, z; θm ) + constant, where this constant does not depend on θ. [sent-70, score-0.458]
27 Thus, our product EM algorithm differs from independent EM only in the E-step, in which the submodels are multiplied together to produce one posterior over z rather than M separate ones. [sent-72, score-0.548]
28 Assuming that there is an efficient EM algorithm for each submodel pm , there is no difficulty in performing the product M-step. [sent-73, score-0.94]
29 In our applications (Section 5), each pm is composed of multinomial distributions, so the M-step simply involves computing ratios of expected counts. [sent-74, score-0.413]
30 On the other hand, the product E-step can become intractable and we must develop approximations (Section 4). [sent-75, score-0.236]
31 We can think of all the submodels pm as being defined on a common space Z∪ = ∪m Zm , but the support of q(z) as computed in the E-step is only the intersection Z∩ = ∩m Zm . [sent-78, score-0.776]
32 Controlling this support will be essential in developing tractable approximations (Section 4. [sent-79, score-0.156]
33 In the general formulation, we required only that the submodels share the same event space X × Z. [sent-81, score-0.341]
34 Now we make explicit the possibility of the submodels sharing features, which give us more structure for deriving approximations. [sent-82, score-0.341]
35 We can now express our objective L(θ, q) (4) using (5) and (6): T θm φX (x) (Eq(z) φZ (z)) + H(q) − m L(θ, q) = m Am (θm ) for q ∈ Q(Z∩ ), (7) m def where Q(Z ) = {q : q(z) = 0 for z ∈ Z } is the set of distributions with support Z . [sent-86, score-0.171]
36 Substituting µ and A∗ ∩ (µ) into (7), we obtain an objective in terms of the dual variables µ: Z def L∗ (θ, µ) = T θm φX (x) µ − A∗ ∩ (µ) − m Z m Am (θm ) for µ ∈ M(Z∩ ). [sent-95, score-0.192]
37 The mean parameters µ are exactly the z-specific expected sufficient statistics computed in the product E-step. [sent-97, score-0.136]
38 The product EM algorithm is summarized below: E-step: M-step: 4 Product EM µ = argmaxµ ∈M(Z∩ ) {bT µ − A∗ ∩ (µ )} Z T θm = argmaxθm ∈Θm {θm φX (x)µ − Am (θm )} Approximations The product M-step is tractable provided that the M-step for each submodel is tractable, which is generally the case. [sent-99, score-0.77]
39 Z (11) µ ∈M(Z ) Using this notation, E(bm , Zm ) is the E-step for training the m-th submodel independently using EM and E(b, Z∩ ) is the E-step of product EM. [sent-103, score-0.548]
40 If E(bm , Zm ) is tractable and all submodels have the same dynamic programming structure (e. [sent-105, score-0.468]
41 , if z is a tree and all features are local with respect to that tree), then E(b, Z∩ ) is also tractable: we can incorporate all the features into the same dynamic program and simply run product EM (see Section 5. [sent-107, score-0.19]
42 However, E(b, Z∩ ) is intractable in general, owing to two complications: (1) we can sum over each Zm efficiently but not the intersection Z∩ ; and (2) each bm corresponds to a decomposable graphical model, but the combined b = m bm corresponds to a loopy graph. [sent-109, score-0.308]
43 In the sequel, we describe two approximate objective functions addressing each complication, whose maximization can be carried out by performing M independent tractable E-steps. [sent-110, score-0.195]
44 1 Domain-approximate product EM Assume that for each submodel pm , E(b, Zm ) is tractable (see Section 5. [sent-112, score-1.047]
45 We propose maximizing the following objective: def L∗ (θ, µ1 , . [sent-114, score-0.137]
46 The resulting expected sufficient statistics are averaged and used in the product M-step, which breaks down into M separate M-steps. [sent-119, score-0.158]
47 4 While we have not yet established any relationship between our approximation L∗ and the original dom objective L∗ , we can, however, relate L∗ to L∗ , which is defined as an analogue of L∗ by replacing ∪ dom Z∩ with Z∪ in (10). [sent-120, score-0.31]
48 2 Parameter-approximate product EM Now suppose that for each submodel pm , E(bm , Z∩ ) is tractable (see Section 5. [sent-136, score-1.047]
49 We propose maximizing the following objective: def L∗ (θ, µ1 , . [sent-138, score-0.137]
50 The product E-step could also be approximated by mean-field or loopy belief propagation variants. [sent-159, score-0.157]
51 The two approximations we developed have the advantage of permitting exact tractable solutions without resorting to expensive iterative methods which are only guaranteed to converge to a local optima. [sent-161, score-0.156]
52 While we still lack a complete theory relating our approximations L∗ and L∗ to the original par dom objective L∗ , we can give some intuitions. [sent-162, score-0.332]
53 Since we are operating in the space of expected sufficient statistics µm , most of the information about the full posterior pm (z | x) must be captured in these statistics alone. [sent-163, score-0.44]
54 Therefore, we expect our approximations to be accurate when each submodel has enough capacity to represent the posterior pm (z | x; θm ) as a low-variance unimodal distribution. [sent-164, score-0.88]
55 5 Applications We now empirically validate our algorithms on three concrete applications: grammar induction using product EM (Section 5. [sent-165, score-0.279]
56 1), unsupervised word alignment using domain-approximate product EM (Section 5. [sent-166, score-0.475]
57 2), and prediction of missing nucleotides in DNA sequences using parameter-approximate product EM (Section 5. [sent-167, score-0.278]
58 5 HMM model e1 a1 f1 e2 a2 f2 e3 a3 f3 e1 f4 f1 (a) Submodel p1 e3 a1 a4 e2 a2 a3 f2 f3 f4 alignment error rate 0. [sent-169, score-0.175]
59 07 1 (b) Submodel p2 2 3 4 5 6 7 8 9 10 iteration Figure 1: The two instances of IBM model 1 for word alignment are shown in (a) and (b). [sent-175, score-0.312]
60 1 Grammar induction Grammar induction is the problem of inducing latent syntactic structures given a set of observed sentences. [sent-178, score-0.144]
61 There are two common types of syntactic structure (one based on word dependencies and the other based on constituent phrases), which can each be represented as a submodel. [sent-179, score-0.173]
62 Their algorithm is a special case of our product EM algorithm, although they did not state an objective function. [sent-181, score-0.202]
63 Since the shared hidden state is a tree structure, product EM is tractable. [sent-182, score-0.275]
64 They show that training the two submodels to agree significantly improves accuracy over independent training. [sent-183, score-0.422]
65 2 Unsupervised word alignment Word alignment is an important component of machine translation systems. [sent-186, score-0.525]
66 The goal of unsupervised word alignment is to match the words in a source sentence to the words in the corresponding target sentence. [sent-189, score-0.362]
67 , f|f | ); z is a set of alignment edges between positions in the English and positions in the French. [sent-196, score-0.175]
68 Classical models for word alignment include IBM models 1 and 2 [2] and the HMM model [8]. [sent-197, score-0.352]
69 These are asymmetric models, which means that they assign non-zero probability only to alignments in which each French word is aligned to at most one English word; we denote this set Z1 . [sent-198, score-0.248]
70 , |e|}, corresponding to the English word (if any) that French word fj is aligned to. [sent-205, score-0.379]
71 We have φZ (z) = 1 ij ij if and only if English word ei is aligned to French word fj and zN ULLj = 1 if and only if fj is not aligned to any English word. [sent-214, score-0.59]
72 We can define a second submodel p2 (x, z; θ2 ) on X × Z2 by reversing the roles of English and French. [sent-217, score-0.391]
73 We cannot use product EM algorithm to train p1 and p2 because summing over all alignments in Z∩ = Z1 ∩ Z2 is NP-hard. [sent-219, score-0.218]
74 However, we can use domain-approximate product EM because E(b1 + b2 , Zm ) is tractable—the tractability here does not depend on decomposability of b but the asymmetric alignment structure of Zm . [sent-220, score-0.335]
75 The concrete change from independent EM is slight: we need to only change the E-step of each pm to use the product of translation probabilities t1;ef t2;f e and change the M-step to use the average of the edge posteriors obtained from the two E-steps. [sent-221, score-0.666]
76 Their M-step uses the elementwise product of µ1 and µ2 , whereas we use the average 1 2 (µ1 + µ2 ). [sent-225, score-0.136]
77 Alignments are evaluated using alignment error rate (AER); see [6] for more details. [sent-230, score-0.175]
78 We trained two instances of the HMM model [8] (English-to-French and French-to-English) using 10 iterations of domain-approximate product EM, initializing with independently trained IBM model 1 parameters. [sent-231, score-0.184]
79 For prediction, we output alignment edges with sufficient posterior probability: {(i, j) : 1 2 (µ1;ij + µ2;ij ) ≥ δ}. [sent-232, score-0.202]
80 3 Phylogenetic HMM models Suppose we have a set of species s ∈ S arranged in a fixed phylogeny (i. [sent-235, score-0.159]
81 Each species s is associated with a length L sequence of nucleotides ds = (ds1 , . [sent-238, score-0.26]
82 A good phylogenetic model should take into consideration both the relationship between nucleotides of the different species at the same site and the relationship between adjacent nucleotides in the same species. [sent-243, score-0.487]
83 Our approach is to instead create two tractable submodels and train them to agree. [sent-246, score-0.487]
84 Define one submodel to be p1 (ds j | dsj ; θ1 )p1 (ds j+1 | ds j , ds(j+1) ; θ1 ), p1 (x, z; θ1 ) = p1 (d; θ1 ) = (15) j odd s∈S s ∈C H(s) where C H(s) is the set of children of s in the tree. [sent-247, score-0.473]
85 The second submodel p2 is defined similarly, only with the product taken over j even. [sent-248, score-0.527]
86 Both submodels permit the same set of assignments of hidden nucleotides (Z∩ = Z1 = Z2 ). [sent-250, score-0.588]
87 Exact product EM is not tractable since b = b1 + b2 corresponds to a graph with high tree-width. [sent-252, score-0.243]
88 We can apply parameter-approximate product EM, in which the E-step only involves computing µm = E(2bm , Z∩ ). [sent-253, score-0.136]
89 Our experiments used a multiple alignment consisting of L = 20, 000 consecutive sites belonging to the L1 transposons in the Cystic Fibrosis Transmembrane Conductance Regulator (CFTR) gene (chromosome 7). [sent-256, score-0.175]
90 Eight eutherian species were arranged in the phylogeny shown in Figure 3. [sent-257, score-0.139]
91 We trained two models using 30 iterations of parameter-approximate product EM. [sent-261, score-0.18]
92 6 0 5 10 15 20 25 Independent EM Parameter-approximate product EM 0 iteration 5 10 15 20 25 iteration Figure 3: The tree is the phylogeny topology used in experiments. [sent-278, score-0.223]
93 The graphs show the prediction accuracy of independent versus agreement-based training (parameter-approximate product EM) when 20% and 50% of the observed nodes are held out. [sent-279, score-0.204]
94 nucleotides under each model are averaged and the one with the highest posterior is chosen. [sent-280, score-0.169]
95 Viewing these submodels as components of an overall model, our framework permits the submodels to be trained jointly without paying the computational cost associated with an actual jointly-normalized probability model. [sent-285, score-0.757]
96 We have presented an objective function for agreement-based learning and three EM-style algorithms that maximize this objective or approximations to this objective. [sent-286, score-0.181]
97 For word alignment and phylogenetic HMMs, our approach provides entirely new algorithms. [sent-289, score-0.454]
98 Acknowledgments We would like to thank Adam Siepel for providing the phylogenetic data and acknowledge the support of the Defense Advanced Research Projects Agency under contract NBCHD030010. [sent-290, score-0.142]
99 Efficient approximations for learning phylogenetic HMM models from data. [sent-320, score-0.211]
100 Combining phylogenetic and hidden Markov models in biosequence analysis. [sent-346, score-0.267]
wordName wordTfidf (topN-words)
[('pm', 0.413), ('submodel', 0.391), ('submodels', 0.341), ('zm', 0.299), ('em', 0.238), ('oagree', 0.184), ('alignment', 0.175), ('phylogenetic', 0.142), ('nucleotides', 0.142), ('word', 0.137), ('product', 0.136), ('dom', 0.122), ('bm', 0.117), ('tractable', 0.107), ('hidden', 0.105), ('def', 0.105), ('pnorm', 0.102), ('par', 0.095), ('english', 0.088), ('ull', 0.082), ('az', 0.082), ('heldout', 0.071), ('hmm', 0.071), ('argmax', 0.071), ('grammar', 0.068), ('objective', 0.066), ('french', 0.065), ('siepel', 0.061), ('fj', 0.061), ('species', 0.061), ('ds', 0.057), ('induction', 0.054), ('phylogeny', 0.053), ('eq', 0.052), ('linguistics', 0.05), ('objectives', 0.049), ('approximations', 0.049), ('aligned', 0.044), ('alignments', 0.043), ('ibm', 0.042), ('azm', 0.041), ('oindep', 0.041), ('pseudolikelihood', 0.041), ('klein', 0.041), ('exponential', 0.039), ('train', 0.039), ('ascent', 0.039), ('ij', 0.038), ('translation', 0.038), ('sentences', 0.038), ('agree', 0.038), ('syntactic', 0.036), ('posteriors', 0.036), ('ri', 0.036), ('jojic', 0.036), ('mutation', 0.036), ('xm', 0.035), ('tree', 0.034), ('division', 0.034), ('maximizing', 0.032), ('berkeley', 0.032), ('aj', 0.031), ('piecewise', 0.031), ('intractable', 0.031), ('pietra', 0.03), ('ei', 0.03), ('bt', 0.03), ('language', 0.029), ('california', 0.028), ('liang', 0.027), ('unsupervised', 0.027), ('posterior', 0.027), ('auxiliary', 0.027), ('permits', 0.026), ('composite', 0.026), ('arranged', 0.025), ('held', 0.025), ('odd', 0.025), ('family', 0.025), ('jointly', 0.025), ('asymmetric', 0.024), ('maximized', 0.024), ('sup', 0.024), ('trained', 0.024), ('sentence', 0.023), ('log', 0.023), ('separate', 0.022), ('independent', 0.022), ('intersection', 0.022), ('inequality', 0.022), ('dual', 0.021), ('training', 0.021), ('concrete', 0.021), ('loopy', 0.021), ('variational', 0.021), ('sutton', 0.02), ('develop', 0.02), ('dynamic', 0.02), ('models', 0.02), ('duality', 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 1.0 22 nips-2007-Agreement-Based Learning
Author: Percy Liang, Dan Klein, Michael I. Jordan
Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1
2 0.24789694 84 nips-2007-Expectation Maximization and Posterior Constraints
Author: Kuzman Ganchev, Ben Taskar, João Gama
Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
3 0.16628195 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
Author: Tony Jebara, Yingbo Song, Kapil Thadani
Abstract: A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. This article makes independent similarly distributed (or isd) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya affinity penalty between pairs of distributions on samples. Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. Experiments in density estimation on a variety of datasets confirm the value of isd over iid estimation, id estimation and mixture modeling.
4 0.16393445 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
Author: Bing Zhao, Eric P. Xing
Abstract: We present a novel paradigm for statistical machine translation (SMT), based on a joint modeling of word alignment and the topical aspects underlying bilingual document-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). In this paradigm, parallel sentence-pairs from a parallel document-pair are coupled via a certain semantic-flow, to ensure coherence of topical context in the alignment of mapping words between languages, likelihood-based training of topic-dependent translational lexicons, as well as in the inference of topic representations in each language. The learned HM-BiTAM can not only display topic patterns like methods such as LDA [1], but now for bilingual corpora; it also offers a principled way of inferring optimal translation using document context. Our method integrates the conventional model of HMM — a key component for most of the state-of-the-art SMT systems, with the recently proposed BiTAM model [10]; we report an extensive empirical analysis (in many ways complementary to the description-oriented [10]) of our method in three aspects: bilingual topic representation, word alignment, and translation.
5 0.1600063 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
6 0.096562736 9 nips-2007-A Probabilistic Approach to Language Change
7 0.09510348 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging
8 0.083823226 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis
9 0.065183446 185 nips-2007-Stable Dual Dynamic Programming
10 0.058938969 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
11 0.057967246 160 nips-2007-Random Features for Large-Scale Kernel Machines
12 0.052357726 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks
13 0.05046922 75 nips-2007-Efficient Bayesian Inference for Dynamically Changing Graphs
14 0.050297737 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
15 0.049125139 197 nips-2007-The Infinite Markov Model
16 0.048633233 192 nips-2007-Testing for Homogeneity with Kernel Fisher Discriminant Analysis
17 0.047470395 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning
18 0.046121467 214 nips-2007-Variational inference for Markov jump processes
19 0.043490943 101 nips-2007-How SVMs can estimate quantiles and the median
20 0.043181479 135 nips-2007-Multi-task Gaussian Process Prediction
topicId topicWeight
[(0, -0.164), (1, 0.029), (2, -0.063), (3, -0.138), (4, 0.022), (5, -0.147), (6, -0.015), (7, -0.082), (8, -0.074), (9, -0.008), (10, 0.073), (11, -0.037), (12, -0.029), (13, -0.21), (14, -0.023), (15, -0.242), (16, -0.21), (17, 0.208), (18, -0.051), (19, 0.087), (20, -0.015), (21, -0.13), (22, -0.14), (23, 0.016), (24, 0.036), (25, 0.05), (26, 0.017), (27, -0.136), (28, 0.019), (29, 0.08), (30, -0.142), (31, 0.04), (32, 0.161), (33, -0.097), (34, -0.117), (35, 0.147), (36, -0.096), (37, -0.038), (38, 0.061), (39, 0.07), (40, 0.067), (41, -0.09), (42, 0.054), (43, -0.049), (44, -0.01), (45, 0.042), (46, -0.01), (47, 0.0), (48, 0.018), (49, -0.003)]
simIndex simValue paperId paperTitle
same-paper 1 0.96298343 22 nips-2007-Agreement-Based Learning
Author: Percy Liang, Dan Klein, Michael I. Jordan
Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1
2 0.80029637 84 nips-2007-Expectation Maximization and Posterior Constraints
Author: Kuzman Ganchev, Ben Taskar, João Gama
Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
3 0.63347036 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
Author: Bing Zhao, Eric P. Xing
Abstract: We present a novel paradigm for statistical machine translation (SMT), based on a joint modeling of word alignment and the topical aspects underlying bilingual document-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). In this paradigm, parallel sentence-pairs from a parallel document-pair are coupled via a certain semantic-flow, to ensure coherence of topical context in the alignment of mapping words between languages, likelihood-based training of topic-dependent translational lexicons, as well as in the inference of topic representations in each language. The learned HM-BiTAM can not only display topic patterns like methods such as LDA [1], but now for bilingual corpora; it also offers a principled way of inferring optimal translation using document context. Our method integrates the conventional model of HMM — a key component for most of the state-of-the-art SMT systems, with the recently proposed BiTAM model [10]; we report an extensive empirical analysis (in many ways complementary to the description-oriented [10]) of our method in three aspects: bilingual topic representation, word alignment, and translation.
4 0.61557394 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
5 0.61485678 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
Author: Tony Jebara, Yingbo Song, Kapil Thadani
Abstract: A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or id) sample data (as in nonparametric kernel density estimators) to the extreme of independent identically distributed (or iid) sample data. This article makes independent similarly distributed (or isd) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya affinity penalty between pairs of distributions on samples. Surprisingly, the isd method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed isd scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. Experiments in density estimation on a variety of datasets confirm the value of isd over iid estimation, id estimation and mixture modeling.
6 0.48008531 9 nips-2007-A Probabilistic Approach to Language Change
7 0.42436704 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis
8 0.31425038 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning
9 0.31300318 82 nips-2007-Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization
10 0.30535978 185 nips-2007-Stable Dual Dynamic Programming
11 0.29746303 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks
12 0.28693774 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging
13 0.25877535 47 nips-2007-Collapsed Variational Inference for HDP
14 0.25102937 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
15 0.24123439 61 nips-2007-Convex Clustering with Exemplar-Based Models
16 0.23829021 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines
17 0.23527253 119 nips-2007-Learning with Tree-Averaged Densities and Distributions
18 0.23520567 181 nips-2007-Sparse Overcomplete Latent Variable Decomposition of Counts Data
19 0.21510608 159 nips-2007-Progressive mixture rules are deviation suboptimal
20 0.20864169 214 nips-2007-Variational inference for Markov jump processes
topicId topicWeight
[(5, 0.035), (13, 0.479), (16, 0.015), (21, 0.033), (31, 0.022), (34, 0.021), (35, 0.021), (47, 0.05), (49, 0.019), (83, 0.129), (85, 0.025), (87, 0.011), (90, 0.049)]
simIndex simValue paperId paperTitle
1 0.94758719 14 nips-2007-A configurable analog VLSI neural network with spiking neurons and self-regulating plastic synapses
Author: Massimiliano Giulioni, Mario Pannunzi, Davide Badoni, Vittorio Dante, Paolo D. Giudice
Abstract: We summarize the implementation of an analog VLSI chip hosting a network of 32 integrate-and-fire (IF) neurons with spike-frequency adaptation and 2,048 Hebbian plastic bistable spike-driven stochastic synapses endowed with a selfregulating mechanism which stops unnecessary synaptic changes. The synaptic matrix can be flexibly configured and provides both recurrent and AER-based connectivity with external, AER compliant devices. We demonstrate the ability of the network to efficiently classify overlapping patterns, thanks to the self-regulating mechanism.
2 0.92520416 191 nips-2007-Temporal Difference Updating without a Learning Rate
Author: Marcus Hutter, Shane Legg
Abstract: We derive an equation for temporal difference learning from statistical principles. Specifically, we start with the variational principle and then bootstrap to produce an updating rule for discounted state value estimates. The resulting equation is similar to the standard equation for temporal difference learning with eligibility traces, so called TD(λ), however it lacks the parameter α that specifies the learning rate. In the place of this free parameter there is now an equation for the learning rate that is specific to each state transition. We experimentally test this new learning rule against TD(λ) and find that it offers superior performance in various settings. Finally, we make some preliminary investigations into how to extend our new temporal difference algorithm to reinforcement learning. To do this we combine our update equation with both Watkins’ Q(λ) and Sarsa(λ) and find that it again offers superior performance without a learning rate parameter. 1
3 0.91167092 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines
Author: Fred Richardson, William M. Campbell
Abstract: Many tasks in speech processing involve classification of long term characteristics of a speech segment such as language, speaker, dialect, or topic. A natural technique for determining these characteristics is to first convert the input speech into a sequence of tokens such as words, phones, etc. From these tokens, we can then look for distinctive sequences, keywords, that characterize the speech. In many applications, a set of distinctive keywords may not be known a priori. In this case, an automatic method of building up keywords from short context units such as phones is desirable. We propose a method for the construction of keywords based upon Support Vector Machines. We cast the problem of keyword selection as a feature selection problem for n-grams of phones. We propose an alternating filter-wrapper method that builds successively longer keywords. Application of this method to language recognition and topic recognition tasks shows that the technique produces interesting and significant qualitative and quantitative results.
same-paper 4 0.88686675 22 nips-2007-Agreement-Based Learning
Author: Percy Liang, Dan Klein, Michael I. Jordan
Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1
5 0.84494984 62 nips-2007-Convex Learning with Invariances
Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola
Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1
6 0.67184252 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
7 0.6193471 84 nips-2007-Expectation Maximization and Posterior Constraints
8 0.59930754 117 nips-2007-Learning to classify complex patterns using a VLSI network of spiking neurons
9 0.56288129 102 nips-2007-Incremental Natural Actor-Critic Algorithms
10 0.53837377 9 nips-2007-A Probabilistic Approach to Language Change
11 0.53175884 205 nips-2007-Theoretical Analysis of Learning with Reward-Modulated Spike-Timing-Dependent Plasticity
12 0.52602792 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion
13 0.51695603 151 nips-2007-Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs
14 0.51189137 86 nips-2007-Exponential Family Predictive Representations of State
15 0.50836504 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
16 0.50794309 63 nips-2007-Convex Relaxations of Latent Variable Training
17 0.50783193 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
18 0.49776992 40 nips-2007-Bundle Methods for Machine Learning
19 0.49467614 118 nips-2007-Learning with Transformation Invariant Kernels
20 0.49396315 24 nips-2007-An Analysis of Inference with the Universum