nips nips2007 nips2007-84 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kuzman Ganchev, Ben Taskar, João Gama
Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
Reference: text
sentIndex sentText sentNum sentScore
1 Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. [sent-3, score-0.292]
2 Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. [sent-4, score-0.26]
3 In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. [sent-5, score-0.434]
4 Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. [sent-6, score-0.162]
5 Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. [sent-7, score-0.717]
6 1 Introduction In unsupervised problems where observed data has sequential, recursive, spatial, relational, or other kinds of structure, we often employ statistical models with latent variables to tease apart the underlying dependencies and induce meaningful semantic parts. [sent-8, score-0.212]
7 Part-of-speech and grammar induction, word and phrase alignment for statistical machine translation in natural language processing are examples of such aims. [sent-9, score-0.849]
8 A pernicious problem with most models is that the data likelihood is not convex in the model parameters and EM can get stuck in local optima with very different latent variable posteriors. [sent-13, score-0.255]
9 Another problem is that data likelihood may not guide the model towards the intended meaning for the latent variables, instead focusing on explaining irrelevant but common correlations in the data. [sent-14, score-0.247]
10 Very indirect methods such as clever initialization and feature design (as well as ad-hoc procedural modifications) are often used to affect the posteriors of latent variables in a desired manner. [sent-15, score-0.354]
11 By allowing to specify prior information directly about posteriors of hidden variables, we can help avoid these difficulties. [sent-16, score-0.178]
12 A somewhat similar in spirit approach is evident in work on multivariate information bottleneck [8], where extra conditional independence assumptions between latent variables can be imposed to control their “meaning”. [sent-17, score-0.134]
13 Similarly, in many semisupervised approaches, assumptions about smoothness or other properties of the posteriors are often used as regularization [18, 13, 4]. [sent-18, score-0.178]
14 In [17], deterministic annealing was used to to explicitly control a particular feature of the posteriors of a grammar induction model. [sent-19, score-0.309]
15 In this paper, we present an approach that effectively incorporates rich constraints on posterior distributions of a graphical model into a simple and efficient EM scheme. [sent-20, score-0.217]
16 An important advantage of our approach is that the E-step remains tractable in a large class of problems even though incorporating the desired constraints directly into the model would make it intractable. [sent-21, score-0.201]
17 2 Expectation Maximization and posterior constraints We are interested in estimating the parameters θ of a model pθ (x, z) over observed variables X taking values x ∈ X and latent variables Z taking values z ∈ Z. [sent-24, score-0.348]
18 We are often even more interested in the induced posterior distribution over the latent variables, pθ (z | x), as we ascribe domainspecific semantics to these variables. [sent-25, score-0.167]
19 We assume that computing the joint and the marginals is tractable and that the model factors across cliques as follows: pθ (x, z) ∝ α φθ (xα , zα ), where φθ (xα , zα ) are clique potentials or conditional probability distributions. [sent-27, score-0.169]
20 [14]): LS (θ) z z ≥ ES q(z | x) log z q(z | x) pθ (x, z) = ES log = ES [log pθ (x)] = ES log pθ (x, z) q(z | x) pθ (x, z) = F (q, θ), q(z | x) (1) (2) 1 where ES [f (x)] = n i f (xi ) denotes the sample average and q(z | x) is non-negative and sums to 1 over z for each x. [sent-32, score-0.132]
21 It can be shown that the lower bound can be made tight for a given value of θ by maximizing over q and under mild continuity conditions on pθ (x, z), local maxima (q ∗ , θ∗ ) of F (q, θ) correspond to local maxima θ∗ of LS (θ) [14]. [sent-34, score-0.314]
22 The E step computes the posteriors of the latent variables given the observed variables and current parameters. [sent-36, score-0.357]
23 The M step uses q to “fill in” the values of latent variables z and estimate parameters θ as if the data was complete. [sent-37, score-0.134]
24 In the following, we build on this simple scheme while incorporating desired constraints on the posteriors over latent variables. [sent-40, score-0.431]
25 1 Constraining the posteriors Our goal is to allow for finer-level control over posteriors, bypassing the likelihood function. [sent-42, score-0.239]
26 We can express our desired constraints on the posteriors as the requirement that pθ (z | x) ∈ Q(x). [sent-44, score-0.342]
27 For example, in dependency grammar induction, constraining the average length of dependency attachments is desired [17]; in statistical word alignment, the constraint might involve the expected degree of each node in the alignment [3]. [sent-45, score-0.801]
28 Instead of restricting p directly, which might not be feasible, we can penalize the distance of p to the constraint set Q. [sent-46, score-0.124]
29 The situation here is the opposite: we assume the original posterior space is tractable but we add constraints to enforce intended semantics not captured by the simple model. [sent-49, score-0.356]
30 A natural and general way to specify constraints on q is by bounding expectations of given functions: Eq [f (x, z)] ≤ b (equality can be achieved by adding Eq [−f (x, z)] ≤ −b). [sent-53, score-0.122]
31 We are no longer guaranteed that the local maxima of the constrained problem are local maxima of the log-likelihood. [sent-61, score-0.314]
32 However, we can characterize the objective maximized at local maxima as log-likelihood penalized by average KL divergence of posteriors from Q: Proposition 2. [sent-62, score-0.335]
33 1 The local maxima of F (q, θ) such that q(z | x) ∈ Q(x), ∀x ∈ S are local maxima of ES [log pθ (x)] − ES [KL(Q(x) || pθ (z | x)], where KL(Q(x) || pθ (z | x) = minq(z|x))∈Q(x) KL(q(z | x) || pθ (z | x)). [sent-63, score-0.314]
34 This proposition implies that our procedure trades off likelihood and distance to the desired posterior subspace (modulo getting stuck in local maxima) and provides an effective method of controlling the posteriors. [sent-66, score-0.216]
35 [5, 1]): arg max λ b − log λ≥0 pθt (z | x) exp{λ f (x, z)} (11) z Define qλ (z | x) ∝ pθt (z | x) exp{λ f (x, z)}, then at the dual optimum λ∗ , the primal solution is given by qλ∗ (z | x). [sent-70, score-0.131]
36 Such projections become particularly efficient when we assume the constraint functions decompose the same way as the graphical model: f (x, z) = α f (xα , zα ). [sent-71, score-0.165]
37 The EM algorithm clusters each column of points together, but if we introduce the constraint that each column should have at least one of the clusters, we get the clustering to the right. [sent-75, score-0.195]
38 0 1 2 3 4 5 6 7 8 1 · · · · · · · • · • · · ju de ga ba n 0 · · · · · · 2 · · • · · • 3 · · · · · · 4 · · · • · · 5 · · · · • 6 · · · · · · • · • • • · · · · · · · · un ma an y a ne ima ra da 7 · · · · · · 8 · · · · · · · · · • · · · · • m co . [sent-76, score-0.374]
39 uy rd ia l 0 1 2 3 4 5 6 7 8 0 · · · 1 · · · · • · · · · · · · • · · ju de ga ba n 2 · · 3 · · · • • · · · · · · un a 4 · · · • 5 · · · · · · · · · m · • · · · · · · · · a y an nim era ad a 6 · · · · · • 7 · · · · · · 8 · · · · · · · · · • · · · · • m co . [sent-77, score-0.437]
40 uy rd ia l 0 1 2 3 4 5 6 7 8 0 · · · 1 · · · · • · · · · · · · • · · ju de ga ba n 2 · · 3 4 5 · · · · · · · · · • · · • · · · · • · · · · · · · · · · · · · · · · un ma an y a ne ima ra da 6 · · · · · • 7 · · · · · · 8 · · · · · · · · · • · · · · • m co . [sent-78, score-0.524]
41 uy rd ia it was an animated , very convivial game . [sent-79, score-0.2]
42 In our experiments, we used constraint functions that decompose with the original model. [sent-85, score-0.117]
43 Note that even in this case, the graphical model pθ (x, z) can not in general satisfy the expectation constraints for every setting of θ and x. [sent-86, score-0.17]
44 Instead, the constrained EM procedure is tuning θ to the distribution of x to satisfy these constraints in expectation. [sent-87, score-0.122]
45 Every gradient computation thus involves computing marginals of qλ (z | x), which is of the same complexity as computing marginals of pθ (z | x) if no new cliques are added by the constraint functions. [sent-91, score-0.248]
46 In practice, we do not need to solve the dual to a very high precision in every round of EM, so several (about 5-10) gradient steps suffice. [sent-92, score-0.126]
47 When the number of constraints is small, alternating projections are also a good option. [sent-93, score-0.122]
48 Both of these constraints are easy to capture and implement in our framework. [sent-100, score-0.122]
49 Let zij = 1 represent the event that data point i is assigned to cluster j. [sent-101, score-0.214]
50 If we want to ensure that data point i is not assigned to the same cluster as data point i then we need to enforce the constraint E [zij + zi j ] ≤ 1, ∀j. [sent-102, score-0.183]
51 To ensure the constraint that each cluster has at least one data point assigned to it from an instance I we need to enforce the constraint E i∈I zij ≤ 1, ∀j. [sent-103, score-0.447]
52 We implemented this constraint in a mixture of Gaussians clustering algorithm. [sent-104, score-0.146]
53 Figure 1 compares clustering of synthetic data using unconstrained EM as well as our method with the constraint that each column of data points has at least one copy of each cluster in expectation. [sent-105, score-0.213]
54 4 4 Statistical word alignment Statistical word alignment, used primarily for machine translation, is a task where the latent variables are intended to have a meaning: whether a word in one language translates into a word in another language in the context of the given sentence pair. [sent-106, score-1.626]
55 The input to an alignment systems is a sentence aligned bilingual corpus, consisting of pairs of sentences in two languages. [sent-107, score-0.504]
56 Figure 2 shows three machine-generated alignments of a sentence pair. [sent-108, score-0.301]
57 The black dots represent the machine alignments and the shading represents the human annotation. [sent-109, score-0.229]
58 Darkly shaded squares with a border represent a sure alignments that the system is required to produce while lightly shaded squares without a border represent possible alignments that the system is optionally allowed to produce. [sent-110, score-0.571]
59 We denote one language the “source” language and use s for its sentences and one language the “target” language and use t for its sentences. [sent-111, score-0.544]
60 It will also be useful to talk about an alignment for a particular sentence pair as a binary matrix z, with zij = 1 representing “source word i generates target word j. [sent-112, score-0.988]
61 ” The generative models we consider generate target word j from only one source word, and so an alignment is only valid from the point of view of the model when i zij = 1, so we can equivalently represent an alignment as an array a of indices, with aj = i ⇔ zij = 1. [sent-113, score-1.504]
62 Figure 2 shows three alignments performed by a baseline model as well as our two modifications. [sent-114, score-0.352]
63 We see that the rare word “convivial” acts as a garbage collector[2], aligning to words that do not have a simple translation in the target sentence. [sent-115, score-0.449]
64 Both of the constraints we suggest repair this problem to different degrees. [sent-116, score-0.122]
65 We now introduce the baseline models and the constraints we impose on them. [sent-117, score-0.284]
66 The three models can be expressed as: pd (aj |j, aj−1 )pt (tj |saj ), p(t, a | s) = (12) j with the three models differing in their definition of the distortion probability pd (aj |j, aj−1 ). [sent-120, score-0.288]
67 Model 2 allows a dependence on the positions pd (aj |j, aj−1 ) = pd (aj |j) and the HMM model assumes that the only the distance between the current and previous source word are important pd (aj |j, aj−1 ) = pd (aj |aj − aj−1 ). [sent-122, score-0.635]
68 All the models are augmented by adding a special “null” word to the source sentence. [sent-123, score-0.33]
69 The likelihood of the corpus, marginalized over possible alignments is concave for Model 1, but not for the other models [3]. [sent-124, score-0.329]
70 1 Substochastic Constraints A common error for our baseline models is to use rare source words as garbage collectors [2]. [sent-127, score-0.386]
71 The models align target words that do not match any of the source words to rare source words rather than to the null word. [sent-128, score-0.39]
72 While this results in higher data likelihood, the resulting alignments are not desirable, since they cannot be interpreted as translations. [sent-129, score-0.229]
73 One might consider augmenting the models to disallow this, for example by restricting that the alignments are at most one-to-one. [sent-131, score-0.308]
74 Our approach is to instead constrain the posterior distribution over alignments during the E-step. [sent-133, score-0.276]
75 More concretely we enforce the constraint Eq [zij ] ≤ 1. [sent-134, score-0.149]
76 Another way of thinking of this constraint is that we require the expected fertility of each source word to be at most one. [sent-135, score-0.528]
77 For our hand-aligned corpora Hansards [15] and EPPS [11, 10], the average fertility is around 1 and 1. [sent-136, score-0.196]
78 We will see that these constraints improve alignment accuracy. [sent-139, score-0.446]
79 2 Agreement Constraints Another weakness of our baseline models is that they are asymmetric. [sent-142, score-0.162]
80 In our framework, we can 5 Language English French Hansards 447 sentences Max Avg. [sent-146, score-0.108]
81 Define a mixture p(z) = 2 p p Z Z 2 in this setup are Eq [f (x, z)] = 0 with → − 1 z ∈ Z and zij = 1 ← − fij (x, z) = . [sent-173, score-0.18]
82 −1 z ∈ Z and zij = 1 0 otherwise 5 Evaluation We evaluated our augmented models on two corpora: the Hansards corpus [15] of English/French and the Europarl corpus [10] with EPPS annotation [11]. [sent-174, score-0.405]
83 Hansards test sentences are on average only half as long as those of EPPS and only 21% of alignments in Hansards are sure and hence required compared with 69% for EPPS. [sent-177, score-0.38]
84 Both are alignments of a Romance language to English and the average distance of an alignment to the diagonal is around 2 for both corpora. [sent-181, score-0.662]
85 The error metrics we use are precision, recall and alignment error rate (AER), which is a weighted combination of precision and recall. [sent-182, score-0.475]
86 Although AER is the standard metric in word alignment is has been shown [7] that it has a weak correlation with the standard MT metric, Bleu, when the alignments are used in a phrase-based translation system. [sent-183, score-0.863]
87 [7] suggest weighted F-Measure1 as an alternative that correlates well with Bleu, so we also report precision and recall numbers. [sent-184, score-0.151]
88 Following prior work [16], we initialize Model 1 translation table with uniform probabilities over word pairs that occur together in same sentence. [sent-185, score-0.31]
89 Model 2 and Model HMM were initialized with the translation probabilities from Model 1 and with uniform distortion probabilities. [sent-186, score-0.142]
90 We report results for the model with English as the “source” language when using posterior decoding [12]. [sent-193, score-0.156]
91 Figures 3 shows alignment results for the baselines models as well as the models with additional constraints. [sent-194, score-0.439]
92 We show precision, recall and AER for the HMM model as well as precision and recall for Model 2. [sent-195, score-0.211]
93 We note that both constraints improve all measures of performance for all dataset sizes, with most improvement for smaller dataset sizes. [sent-196, score-0.122]
94 We see that the performance gap between the model with and without agreement constraints is preserved as the number of EM iterations increases. [sent-199, score-0.274]
95 Note also that likelihood increases monotonically for all the models and that the baseline model always achieves higher likelihood as expected. [sent-200, score-0.284]
96 Both types of constraints improve all accuracy measures across both datasets and models. [sent-207, score-0.122]
97 We implemented our method on two different problems: probabilistic clustering using mixtures of Gaussians and statistical word alignment and tested it on synthetic and real data. [sent-214, score-0.664]
98 Maximum likelihood from incomplete data via the em algorithm. [sent-265, score-0.278]
99 Europarl: A multilingual corpus for evaluation of machine translation, 2002. [sent-282, score-0.143]
100 Guidelines for word alignment evaln uation and manual alignment. [sent-289, score-0.53]
wordName wordTfidf (topN-words)
[('alignment', 0.324), ('substochastic', 0.251), ('alignments', 0.229), ('hansards', 0.219), ('em', 0.217), ('word', 0.206), ('epps', 0.201), ('zij', 0.18), ('posteriors', 0.178), ('aj', 0.166), ('fertility', 0.153), ('agreement', 0.152), ('baseline', 0.123), ('maxima', 0.123), ('aer', 0.123), ('constraints', 0.122), ('kl', 0.119), ('language', 0.109), ('sentences', 0.108), ('translation', 0.104), ('eq', 0.102), ('cliques', 0.1), ('hmm', 0.1), ('corpus', 0.093), ('precision', 0.091), ('latent', 0.089), ('pd', 0.086), ('source', 0.085), ('constraint', 0.084), ('della', 0.08), ('ia', 0.075), ('ju', 0.075), ('uy', 0.075), ('pietra', 0.074), ('sentence', 0.072), ('grammar', 0.067), ('bleu', 0.066), ('hermann', 0.066), ('enforce', 0.065), ('clustering', 0.062), ('likelihood', 0.061), ('ba', 0.06), ('recall', 0.06), ('co', 0.056), ('intended', 0.054), ('english', 0.054), ('ls', 0.053), ('es', 0.053), ('arg', 0.052), ('ibm', 0.051), ('convivial', 0.05), ('europarl', 0.05), ('ganchev', 0.05), ('garbage', 0.05), ('graca', 0.05), ('ima', 0.05), ('minq', 0.05), ('multilingual', 0.05), ('ga', 0.05), ('clusters', 0.049), ('graphical', 0.048), ('intuitive', 0.048), ('posterior', 0.047), ('words', 0.046), ('un', 0.046), ('variables', 0.045), ('log', 0.044), ('franz', 0.044), ('josef', 0.044), ('och', 0.044), ('meaning', 0.043), ('corpora', 0.043), ('rare', 0.043), ('sure', 0.043), ('desired', 0.042), ('restricting', 0.04), ('grammars', 0.04), ('constraining', 0.039), ('models', 0.039), ('statistical', 0.039), ('distortion', 0.038), ('intractable', 0.038), ('ra', 0.037), ('baselines', 0.037), ('tommi', 0.037), ('tractable', 0.037), ('border', 0.035), ('dual', 0.035), ('local', 0.034), ('cluster', 0.034), ('ben', 0.033), ('induction', 0.033), ('decompose', 0.033), ('synthetic', 0.033), ('acl', 0.032), ('stuck', 0.032), ('marginals', 0.032), ('semantics', 0.031), ('pennsylvania', 0.031), ('annealing', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000007 84 nips-2007-Expectation Maximization and Posterior Constraints
Author: Kuzman Ganchev, Ben Taskar, João Gama
Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
2 0.2841149 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
Author: Bing Zhao, Eric P. Xing
Abstract: We present a novel paradigm for statistical machine translation (SMT), based on a joint modeling of word alignment and the topical aspects underlying bilingual document-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). In this paradigm, parallel sentence-pairs from a parallel document-pair are coupled via a certain semantic-flow, to ensure coherence of topical context in the alignment of mapping words between languages, likelihood-based training of topic-dependent translational lexicons, as well as in the inference of topic representations in each language. The learned HM-BiTAM can not only display topic patterns like methods such as LDA [1], but now for bilingual corpora; it also offers a principled way of inferring optimal translation using document context. Our method integrates the conventional model of HMM — a key component for most of the state-of-the-art SMT systems, with the recently proposed BiTAM model [10]; we report an extensive empirical analysis (in many ways complementary to the description-oriented [10]) of our method in three aspects: bilingual topic representation, word alignment, and translation.
3 0.24789694 22 nips-2007-Agreement-Based Learning
Author: Percy Liang, Dan Klein, Michael I. Jordan
Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1
4 0.14377123 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
5 0.14027148 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging
Author: Kristina Toutanova, Mark Johnson
Abstract: We present a novel Bayesian model for semi-supervised part-of-speech tagging. Our model extends the Latent Dirichlet Allocation model and incorporates the intuition that words’ distributions over tags, p(t|w), are sparse. In addition we introduce a model for determining the set of possible tags of a word which captures important dependencies in the ambiguity classes of words. Our model outperforms the best previously proposed model for this task on a standard dataset. 1
6 0.10766782 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning
7 0.10112533 9 nips-2007-A Probabilistic Approach to Language Change
8 0.098732509 61 nips-2007-Convex Clustering with Exemplar-Based Models
9 0.089684762 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines
10 0.088705115 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
11 0.086159825 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks
12 0.079372071 110 nips-2007-Learning Bounds for Domain Adaptation
13 0.078255959 183 nips-2007-Spatial Latent Dirichlet Allocation
14 0.074980587 197 nips-2007-The Infinite Markov Model
15 0.069368377 80 nips-2007-Ensemble Clustering using Semidefinite Programming
16 0.06663014 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables
17 0.066333428 189 nips-2007-Supervised Topic Models
18 0.064039335 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
19 0.063464522 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
20 0.063383669 47 nips-2007-Collapsed Variational Inference for HDP
topicId topicWeight
[(0, -0.22), (1, 0.062), (2, -0.092), (3, -0.219), (4, 0.06), (5, -0.139), (6, 0.034), (7, -0.105), (8, -0.116), (9, 0.007), (10, 0.048), (11, -0.04), (12, -0.038), (13, -0.288), (14, -0.03), (15, -0.17), (16, -0.23), (17, 0.163), (18, -0.055), (19, 0.104), (20, -0.051), (21, -0.132), (22, -0.105), (23, 0.01), (24, 0.015), (25, 0.016), (26, 0.03), (27, -0.069), (28, -0.02), (29, 0.039), (30, -0.04), (31, -0.046), (32, 0.04), (33, -0.042), (34, -0.033), (35, 0.037), (36, -0.069), (37, 0.002), (38, 0.059), (39, 0.104), (40, 0.056), (41, 0.039), (42, -0.038), (43, -0.053), (44, -0.026), (45, 0.024), (46, -0.043), (47, 0.048), (48, 0.027), (49, -0.002)]
simIndex simValue paperId paperTitle
same-paper 1 0.95365697 84 nips-2007-Expectation Maximization and Posterior Constraints
Author: Kuzman Ganchev, Ben Taskar, João Gama
Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
2 0.88186622 22 nips-2007-Agreement-Based Learning
Author: Percy Liang, Dan Klein, Michael I. Jordan
Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1
3 0.82800949 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
Author: Bing Zhao, Eric P. Xing
Abstract: We present a novel paradigm for statistical machine translation (SMT), based on a joint modeling of word alignment and the topical aspects underlying bilingual document-pairs, via a hidden Markov Bilingual Topic AdMixture (HM-BiTAM). In this paradigm, parallel sentence-pairs from a parallel document-pair are coupled via a certain semantic-flow, to ensure coherence of topical context in the alignment of mapping words between languages, likelihood-based training of topic-dependent translational lexicons, as well as in the inference of topic representations in each language. The learned HM-BiTAM can not only display topic patterns like methods such as LDA [1], but now for bilingual corpora; it also offers a principled way of inferring optimal translation using document context. Our method integrates the conventional model of HMM — a key component for most of the state-of-the-art SMT systems, with the recently proposed BiTAM model [10]; we report an extensive empirical analysis (in many ways complementary to the description-oriented [10]) of our method in three aspects: bilingual topic representation, word alignment, and translation.
4 0.69707114 9 nips-2007-A Probabilistic Approach to Language Change
Author: Alexandre Bouchard-côté, Percy Liang, Dan Klein, Thomas L. Griffiths
Abstract: We present a probabilistic approach to language change in which word forms are represented by phoneme sequences that undergo stochastic edits along the branches of a phylogenetic tree. This framework combines the advantages of the classical comparative method with the robustness of corpus-based probabilistic models. We use this framework to explore the consequences of two different schemes for defining probabilistic models of phonological change, evaluating these schemes by reconstructing ancient word forms of Romance languages. The result is an efficient inference procedure for automatically inferring ancient word forms from modern languages, which can be generalized to support inferences about linguistic phylogenies. 1
5 0.63388121 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
6 0.54960573 1 nips-2007-A Bayesian Framework for Cross-Situational Word-Learning
7 0.48759636 210 nips-2007-Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks
8 0.46819463 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging
9 0.45578015 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines
10 0.39660791 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
11 0.38406798 61 nips-2007-Convex Clustering with Exemplar-Based Models
12 0.3497515 197 nips-2007-The Infinite Markov Model
13 0.34722495 87 nips-2007-Fast Variational Inference for Large-scale Internet Diagnosis
14 0.29538986 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
15 0.28954282 47 nips-2007-Collapsed Variational Inference for HDP
16 0.28755695 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
17 0.28582168 31 nips-2007-Bayesian Agglomerative Clustering with Coalescents
18 0.28536686 129 nips-2007-Mining Internet-Scale Software Repositories
19 0.279378 72 nips-2007-Discriminative Log-Linear Grammars with Latent Variables
20 0.26993921 80 nips-2007-Ensemble Clustering using Semidefinite Programming
topicId topicWeight
[(5, 0.054), (13, 0.162), (16, 0.031), (18, 0.015), (21, 0.055), (34, 0.019), (35, 0.038), (47, 0.082), (82, 0.203), (83, 0.128), (85, 0.027), (87, 0.024), (90, 0.053)]
simIndex simValue paperId paperTitle
same-paper 1 0.81968784 84 nips-2007-Expectation Maximization and Posterior Constraints
Author: Kuzman Ganchev, Ben Taskar, João Gama
Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
2 0.79856187 155 nips-2007-Predicting human gaze using low-level saliency combined with face detection
Author: Moran Cerf, Jonathan Harel, Wolfgang Einhaeuser, Christof Koch
Abstract: Under natural viewing conditions, human observers shift their gaze to allocate processing resources to subsets of the visual input. Many computational models try to predict such voluntary eye and attentional shifts. Although the important role of high level stimulus properties (e.g., semantic information) in search stands undisputed, most models are based on low-level image properties. We here demonstrate that a combined model of face detection and low-level saliency significantly outperforms a low-level model in predicting locations humans fixate on, based on eye-movement recordings of humans observing photographs of natural scenes, most of which contained at least one person. Observers, even when not instructed to look for anything particular, fixate on a face with a probability of over 80% within their first two fixations; furthermore, they exhibit more similar scanpaths when faces are present. Remarkably, our model’s predictive performance in images that do not contain faces is not impaired, and is even improved in some cases by spurious face detector responses. 1
3 0.75654286 71 nips-2007-Discriminative Keyword Selection Using Support Vector Machines
Author: Fred Richardson, William M. Campbell
Abstract: Many tasks in speech processing involve classification of long term characteristics of a speech segment such as language, speaker, dialect, or topic. A natural technique for determining these characteristics is to first convert the input speech into a sequence of tokens such as words, phones, etc. From these tokens, we can then look for distinctive sequences, keywords, that characterize the speech. In many applications, a set of distinctive keywords may not be known a priori. In this case, an automatic method of building up keywords from short context units such as phones is desirable. We propose a method for the construction of keywords based upon Support Vector Machines. We cast the problem of keyword selection as a feature selection problem for n-grams of phones. We propose an alternating filter-wrapper method that builds successively longer keywords. Application of this method to language recognition and topic recognition tasks shows that the technique produces interesting and significant qualitative and quantitative results.
4 0.74932164 62 nips-2007-Convex Learning with Invariances
Author: Choon H. Teo, Amir Globerson, Sam T. Roweis, Alex J. Smola
Abstract: Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly. 1
5 0.74838245 22 nips-2007-Agreement-Based Learning
Author: Percy Liang, Dan Klein, Michael I. Jordan
Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1
6 0.74575204 191 nips-2007-Temporal Difference Updating without a Learning Rate
7 0.74477112 158 nips-2007-Probabilistic Matrix Factorization
8 0.70817286 14 nips-2007-A configurable analog VLSI neural network with spiking neurons and self-regulating plastic synapses
9 0.70071828 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
10 0.66962522 102 nips-2007-Incremental Natural Actor-Critic Algorithms
11 0.66837537 63 nips-2007-Convex Relaxations of Latent Variable Training
12 0.6646384 86 nips-2007-Exponential Family Predictive Representations of State
13 0.65522569 177 nips-2007-Simplified Rules and Theoretical Analysis for Information Bottleneck Optimization and PCA with Spiking Neurons
14 0.65385574 9 nips-2007-A Probabilistic Approach to Language Change
15 0.64711267 134 nips-2007-Multi-Task Learning via Conic Programming
16 0.64664006 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
17 0.64616662 24 nips-2007-An Analysis of Inference with the Universum
18 0.64413166 117 nips-2007-Learning to classify complex patterns using a VLSI network of spiking neurons
19 0.64221358 2 nips-2007-A Bayesian LDA-based model for semi-supervised part-of-speech tagging
20 0.64029211 98 nips-2007-Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion