nips nips2007 nips2007-63 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
Reference: text
sentIndex sentText sentNum sentScore
1 ca Abstract We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. [sent-3, score-0.497]
2 First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. [sent-4, score-0.723]
3 Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. [sent-5, score-0.588]
4 In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. [sent-6, score-0.487]
5 We then develop a semidefinite relaxation that yields global training by eliminating local minima. [sent-8, score-0.307]
6 Here it is important to distinguish between the EM algorithm (essentially a coordinate descent procedure [10]) and the objective it optimizes (marginal observed or conditional hidden likelihood). [sent-12, score-0.311]
7 In particular, the standard objectives tackled by EM are not convex in any standard probability model (e. [sent-17, score-0.261]
8 We present a convex relaxation of EM for a standard training criterion and a general class of models in an attempt to understand whether local minima are really a necessary aspect of unsupervised learning. [sent-22, score-0.498]
9 In this paper, we propose a convex relaxation of EM that can be applied to a general class of directed graphical models, including mixture models and Bayesian networks, in the presence of hidden variables. [sent-24, score-0.644]
10 There are some technical barriers to overcome in achieving an effective convex relaxation however. [sent-25, score-0.389]
11 First, as we will show, any convex relaxation of EM must produce trivial results if it maintains any dependence on the values of hidden variables. [sent-26, score-0.537]
12 Although this result suggests that any convex relaxation of EM cannot succeed, we subsequently show that the problem can be overcome by working with equivalence relations over the values of the hidden variables, rather than the missing values themselves. [sent-27, score-1.04]
13 Although equivalence relations provide an easy way to solve the symmetry collapsing problem, they do not immediately yield a convex EM formulation, because the underlying estimation principles for directed graphical models have not been formulated in these terms. [sent-28, score-0.774]
14 Our main technical contribution therefore is a reformulation of standard estimation principles for exponential conditional models in terms of equivalence relations on variable values, rather than the variable values themselves. [sent-29, score-0.673]
15 Given an adequate reformulation of the core estimation principle, developing a useful convex relaxation of EM becomes possible. [sent-30, score-0.468]
16 Moreover, the criteria being optimized are in fact well motivated objectives for unsupervised training: joint EM is frequently used in statistical natural language processing (where it is referred to as “Viterbi EM” [3, 7]); the conditional form has been used in [16]. [sent-37, score-0.248]
17 By far, the more common form of EM—contributing the very name expectation-maximization—is given by (marginal EM update) q(k+1) = P (y|x, w(k) ) y w(k+1) = arg max w (k+1) y qy log P (x, y|w) where qy is a distribution over possible missing values. [sent-39, score-1.31]
18 [10] later showed that the E-step could be generalized to maxqy y qy log P (x, y|w(k) )/qy . [sent-41, score-0.588]
19 Due to the softer qy update, the standard EM update does not as converge as rapidly to a local maximum as the joint and conditional variants; however, as a result, it tends to find better local maxima. [sent-42, score-0.706]
20 Nevertheless, none of the training criteria are jointly convex in the optimization variables, thus these iterations are only guaranteed to find local maxima. [sent-44, score-0.325]
21 Of the three training criteria therefore (joint, conditional and marginal), marginal likelihood appears to be the least relevant to learning predictive models. [sent-47, score-0.179]
22 Nevertheless, the convex relaxation techniques we propose can be applied to all three objectives. [sent-48, score-0.389]
23 For simplicity we will focus on maximizing joint likelihood in this paper, since it incorporates aspects of both marginal and conditional training. [sent-49, score-0.158]
24 Conveniently, joint and marginal EM pose nearly identical optimization problems: (joint EM objective) arg max max P (x, y|w) = arg max max y qy log P (x, y|w) (marg. [sent-50, score-1.001]
25 EM objective) arg max y qy log P (x, y|w) +H(qy ) w w y w y qy P (x, y|w) = arg max max w qy where qy is a distribution over possible missing values [10]. [sent-51, score-2.528]
26 Therefore, much of the analysis we provide for joint EM also applies to marginal EM, leaving only a separate convex relaxation of the entropy term that can be conducted independently. [sent-52, score-0.527]
27 We will also primarily consider the hidden variable case and assume a fixed set of random variables Y1 , . [sent-53, score-0.271]
28 2 A Cautionary Result for Convexity Our focus in this paper will be to develop a jointly convex relaxation to the minimization problem posed by joint EM min min − y i w (1) log P (xi , yi |w) One obvious issue we must face is to relax the discrete constraints on the assignments y. [sent-61, score-0.621]
29 In the hidden variable case—when the same variables are missing in each observation—there is a complete symmetry between the missing values. [sent-63, score-0.51]
30 In particular, for any optimal solution (y, w) there must be other, equivalent solutions (y , w ) corresponding to a permutation of the hidden variable values. [sent-64, score-0.266]
31 Lemma 1 If f is strictly convex and invariant to permutations of unobserved variable values, then the global minimum of f , (q∗ , w∗ ), must satisfy q∗ = uniform. [sent-66, score-0.249]
32 y y Proof: Assume (qy , w) is a global minimum of f but qy = uniform. [sent-67, score-0.533]
33 Then there must be some permutation of the missing values, Π, such that the alternative (qy , w ) = (Π(qy ), Π(w)) satisfies qy = qy . [sent-68, score-1.184]
34 Therefore, any convex relaxation of (1) that uses a distribution qy over missing values and does not make arbitrary distinctions can never do anything but produce a uniform distribution over the hidden variable values. [sent-71, score-1.226]
35 ) Moreover, any non-strictly convex relaxation must admit the uniform distribution as a possible solution. [sent-73, score-0.415]
36 (Note that standard coordinate descent algorithms simply break the symmetry arbitrarily and descend into some local solution. [sent-75, score-0.16]
37 ) This negative result seems to imply that no useful convex relaxation of EM is possible in the hidden variable case. [sent-76, score-0.603]
38 However, our key observation is that a convex relaxation expressed in terms of an equivalence relation over the missing values avoids this symmetry breaking problem. [sent-77, score-0.974]
39 In particular, equivalence relations exactly collapse the unresolvable symmetries in this context, while still representing useful structure over the hidden assignments. [sent-78, score-0.592]
40 Representations based on equivalence relations are a useful tool for unsupervised learning that has largely been overlooked (with some exceptions [4, 15]). [sent-79, score-0.446]
41 Our goal in this paper, therefore, will be to reformulate standard training objectives to use only equivalence relations on hidden variable values. [sent-80, score-0.777]
42 3 Directed Graphical Models We will derive a convex relaxation framework for a general class of probability models—namely, directed models—that includes mixture models and discrete Bayesian networks as special cases. [sent-81, score-0.547]
43 A directed model defines a joint probability distribution over a set of random variables Z 1 , . [sent-82, score-0.193]
44 , Zn by exploiting the chain rule of probability to decompose the joint into a product of locally normalized n conditional distributions P (z|w) = j=1 P (zj |zπ(j) , wj ). [sent-85, score-0.284]
45 , j − 1}, and wj is the set of parameters defining conditional distribution j. [sent-89, score-0.23]
46 A discrete Bayesian network is just a directed model where the conditional distributions are represented by a sparse feature vector indicating the identity of the child-parent configuration φj (zj , zπ(j) ) = (. [sent-92, score-0.16]
47 To explain this notation, note that Y and Φ are indicator matrices that have a single 1 in each row, where Y indicates the value of the child variable, and Φ indicates the specific configuration of the parent values, respectively; i. [sent-103, score-0.334]
48 However, for our purposes it suffers from a major drawback: (3) is not expressed in terms of equivalence relations between the variable values. [sent-110, score-0.55]
49 Rather it is expressed in terms of direct indicators of specific variable values in specific examples—which will lead to a trivial outcome if we attempt any convex relaxation. [sent-111, score-0.326]
50 Instead, we require a fundamental reformulation of (3) to remove the value dependence and replace it with a dependence only on equivalence relationships. [sent-112, score-0.341]
51 4 Log-linear Regression on Equivalence Relations The first step in reformulating (3) in terms of equivalence relations is to derive its dual. [sent-113, score-0.413]
52 Lemma 2 An equivalent optimization problem to (3) is max −tr(Θ log Θ ) − Θ 1 tr (Y − Θ) ΦΦ (Y − Θ) 2α subject to Θ ≥ 0, Θ1 = 1 (4) Proof: The proof follows a standard derivation, which we sketch; see e. [sent-114, score-0.299]
53 Interestingly, deriving the dual has already achieved part of the desired result: the parent configurations now only enter the problem through the kernel matrix K = ΦΦ . [sent-118, score-0.213]
54 For Bayesian networks this kernel matrix is in fact an equivalence relation between parent configurations: Φ is a 0-1 indicator matrix with a single 1 in each row, implying that Kij = 1 iff Φi: = Φj: , and Kij = 0 otherwise. [sent-119, score-0.655]
55 But more importantly, K can be re-expressed as a function of the individual equivalence relations on each of the parent variables. [sent-120, score-0.55]
56 Let Y p ∈ {0, 1}t×vp indicate the value of a parent variable Zp for each p training example. [sent-121, score-0.244]
57 Then M p = Y p Y p defines an equivalence relation over the assignments p p p p to variable Zp , since Mij = 1 if Yi: = Yj: and Mij = 0 otherwise. [sent-123, score-0.465]
58 It is not hard to see that the equivalence relation over complete parent configurations, K = ΦΦ , is equal to the componentwise (Hadamard) product of the individual equivalence relations for each parent variable. [sent-124, score-1.093]
59 Unfortunately, the dual problem (4) is still expressed in terms of the indicator matrix Y over child variable values, which is still not acceptable. [sent-129, score-0.437]
60 We still need to reformulate (4) in terms of the equivalence relation matrix M = Y Y . [sent-130, score-0.462]
61 Also note that as long as every child value occurs at least once in the training set, Y has full rank v. [sent-133, score-0.231]
62 If not, then the child variable effectively has fewer values, and we could simply reduce Y until it becomes full rank again without affecting the objective (3). [sent-134, score-0.304]
63 Then we can relate the primal parameters to this 1 larger set of dual parameters by the relation W = α Φ (I − Ω)Y . [sent-136, score-0.158]
64 The formulation (5) is now almost completely expressed in terms of equivalence relations over the data, except for one subtle problem: the log normalization factor 1 A(B, Φi: ) = log a exp α Φi: Φ BY 1a still directly depends on the label indicator matrix Y . [sent-140, score-0.727]
65 Our key technical lemma is that this log normalization factor can be re-expressed to depend on the equivalence relation matrix M alone. [sent-141, score-0.499]
66 Lemma 3 A(B, Φi: ) = log j exp 1 α Ki: BM:j − log 1 M:j Proof: The main observation is that an equivalence relation over value indicators, M = Y Y , consists of columns copied from Y . [sent-142, score-0.52]
67 That is, for all j, M:j = Y:a for some a corresponding to the 1 child value in example j. [sent-143, score-0.158]
68 Let y(j) denote the child value in example j and let β i: = α Ki: B. [sent-144, score-0.158]
69 Theorem 1 An equivalent optimization problem to (3) is 1 max −tr(Λ log Λ ) − 1 Λ log(M 1) − tr((I − Λ) K(I − Λ)M ) (6) Λ≥0,Λ1=1 2α where K = M 1 ◦ · · · ◦ M p for parent variables Z1 , . [sent-146, score-0.326]
70 This gives our key result: the log-linear regression (3) is equivalent to (6), which is now expressed strictly in terms of equivalence relations over the parent configurations and child values. [sent-152, score-0.792]
71 5 Convex Relaxation of Joint EM The equivalence relation form of log-linear regression can be used to derive useful relaxations of EM variants for directed models. [sent-156, score-0.555]
72 j = M j1 ◦ ··· ◦ M (9) jp for the parent variables Note that (8) is an exact reformulation of the joint EM objective (7); no relaxation has yet been introduced. [sent-161, score-0.581]
73 Another nice property of the objective in (8) is that is it concave in each Λ j and convex in each M h individually (a maximum of convex functions is convex [2]). [sent-162, score-0.597]
74 Although the constraints imposed in (9) are not convex, there is a natural convex relaxation suggested by the following. [sent-165, score-0.389]
75 Note that since (8) and (10) are expressed solely in terms of equivalence relations, and do not depend on the specific values of hidden variables in any way, this formulation is not subject to the triviality result of Lemma 1. [sent-168, score-0.573]
76 First, if there is only a single hidden variable then (8) is convex with respect to the single matrix variable M h . [sent-170, score-0.491]
77 This result immediately provides a convex EM training algorithm for various applications, such as for mixture models for example (see the note regarding continuous random variables below). [sent-171, score-0.384]
78 Second, if there are multiple hidden variables that are separated from each other (none are neighbors, nor share a common child) then the formulation (8) remains convex and can be directly applied. [sent-172, score-0.42]
79 On the other hand, if hidden variables are connected in any way, either by sharing a parent-child relationship or having a common child, then (8) is no longer jointly convex because the trace term is no longer linear in the matrix variables {M h }. [sent-173, score-0.505]
80 In this case, we can restore convexity by further relaxing the problem: To illustrate, if there are multiple hidden parents Zp1 , . [sent-174, score-0.201]
81 , Zp for a given child, then the combined equivalence relation M p1 ◦ · · · ◦ M p is a Hadamard product of the individual matrices. [sent-177, score-0.372]
82 A convex formulation can be ˜ recovered by introducing an auxiliary matrix variable M to replace M p1 ◦· · ·◦M p in (8) and adding ˜ ij ≤ M p for p ∈ {p1 , . [sent-178, score-0.334]
83 A similar relaxation can also be applied when a child is hidden concurrently with hidden parent variables. [sent-182, score-0.797]
84 Continuous Variables The formulation in (8) can be applied to directed models with continuous random variables, provided that all hidden variables remain discrete. [sent-183, score-0.36]
85 If every continuous random variable is observed, then the subproblems on these variables can be kept in their natural formulations, and hence still solved. [sent-184, score-0.195]
86 Unfortunately, the techniques developed in this paper do not apply to the situation where there are continuous hidden variables. [sent-186, score-0.189]
87 Recovering the Model Parameters Once the relaxed equivalence relation matrices {M h } have been obtained, the parameters of they underlying probability model need to be recovered. [sent-187, score-0.406]
88 For hidden variables, we first need to compute a rank v h factorization of M h . [sent-287, score-0.18]
89 6 Experimental Results An important question to ask is whether the relaxed, convex objective (8) is in fact over-relaxed, and whether important structure in the original marginal likelihood objective has been lost as a result. [sent-293, score-0.334]
90 To investigate this question, we conducted a set of experiments to evaluate our convex approach compared to the standard Viterbi (i. [sent-294, score-0.212]
91 Our experiments are conducted using both synthetic Bayesian networks and real networks, while measuring the trained models by their logloss produced on the fully observed training data and testing data. [sent-297, score-0.166]
92 Here we can see that the convex relaxation was successful at preserving structure in the EM objective, and in fact, generally performed much better than the Viterbi EM algorithm, particularly in the case (Synth1) where there was two hidden variables. [sent-307, score-0.537]
93 For the UCI experiments, the results are very similar to the synthetic networks, showing good results again for the convex EM relaxation. [sent-310, score-0.228]
94 We picked one well connected node from each model to serve as the hidden variable, and generated data by sampling from the models. [sent-315, score-0.198]
95 Here we can see that the convex EM relaxation performed well on the Cancer and Alarm networks. [sent-317, score-0.389]
96 Since we only picked one hidden variable from the 37 variables in Alarm, it is understandable that any potential advantage for the convex approach might not be large. [sent-318, score-0.504]
97 7 Conclusion We have presented a new convex relaxation of EM that obtains generally effective results in simple experimental comparisons to a standard joint EM algorithm (Viterbi EM), on both synthetic and real problems. [sent-322, score-0.488]
98 This new approach was facilitated by a novel reformulation of log-linear regression that refers only to equivalence relation information on the data, and thereby allows us to avoid the symmetry breaking problem that blocks naive convexification strategies from working. [sent-323, score-0.596]
99 One shortcoming of the proposed technique however is that it cannot handle continuous hidden variables; this remains a direction for future research. [sent-324, score-0.189]
100 A view of the em algorithm that justifies incremental, sparse, and other variants. [sent-378, score-0.414]
wordName wordTfidf (topN-words)
[('qy', 0.533), ('em', 0.414), ('equivalence', 0.262), ('relaxation', 0.206), ('convex', 0.183), ('wj', 0.181), ('child', 0.158), ('relations', 0.151), ('hidden', 0.148), ('parent', 0.137), ('tr', 0.129), ('relation', 0.11), ('zj', 0.093), ('missing', 0.09), ('zp', 0.086), ('directed', 0.082), ('kbm', 0.079), ('yh', 0.079), ('reformulation', 0.079), ('objectives', 0.078), ('mij', 0.075), ('viterbi', 0.068), ('variable', 0.066), ('asian', 0.059), ('symmetry', 0.059), ('variables', 0.057), ('marginal', 0.055), ('log', 0.055), ('joint', 0.054), ('bayesian', 0.054), ('max', 0.053), ('gurations', 0.052), ('networks', 0.051), ('picked', 0.05), ('conditional', 0.049), ('alarm', 0.048), ('dual', 0.048), ('objective', 0.048), ('arg', 0.046), ('synthetic', 0.045), ('lemma', 0.044), ('relaxations', 0.044), ('indicators', 0.041), ('training', 0.041), ('continuous', 0.041), ('coordinate', 0.04), ('cautionary', 0.039), ('vp', 0.039), ('kij', 0.039), ('indicator', 0.039), ('exp', 0.038), ('subject', 0.038), ('immediately', 0.037), ('expressed', 0.036), ('cancer', 0.035), ('suffers', 0.035), ('local', 0.035), ('componentwise', 0.034), ('hadamard', 0.034), ('vh', 0.034), ('yuhong', 0.034), ('relaxed', 0.034), ('naive', 0.034), ('criteria', 0.034), ('unsupervised', 0.033), ('variants', 0.033), ('rank', 0.032), ('min', 0.032), ('formulation', 0.032), ('jointly', 0.032), ('reformulate', 0.031), ('convexi', 0.031), ('dale', 0.031), ('still', 0.031), ('zi', 0.029), ('convexity', 0.029), ('fenchel', 0.029), ('network', 0.029), ('conducted', 0.029), ('matrix', 0.028), ('con', 0.028), ('permutation', 0.028), ('guration', 0.028), ('breaking', 0.028), ('assignments', 0.027), ('invoking', 0.026), ('admit', 0.026), ('descent', 0.026), ('uci', 0.026), ('ki', 0.026), ('yielding', 0.026), ('notation', 0.025), ('eliminating', 0.025), ('bm', 0.025), ('nevertheless', 0.025), ('ij', 0.025), ('mixture', 0.025), ('regression', 0.024), ('equivalent', 0.024), ('relaxing', 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999958 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
2 0.1600063 22 nips-2007-Agreement-Based Learning
Author: Percy Liang, Dan Klein, Michael I. Jordan
Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1
3 0.14377123 84 nips-2007-Expectation Maximization and Posterior Constraints
Author: Kuzman Ganchev, Ben Taskar, João Gama
Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
4 0.13521762 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
Author: Francis R. Bach, Zaïd Harchaoui
Abstract: We present a novel linear clustering framework (D IFFRAC) which relies on a linear discriminative cost function and a convex relaxation of a combinatorial optimization problem. The large convex optimization problem is solved through a sequence of lower dimensional singular value decompositions. This framework has several attractive properties: (1) although apparently similar to K-means, it exhibits superior clustering performance than K-means, in particular in terms of robustness to noise. (2) It can be readily extended to non linear clustering if the discriminative cost function is based on positive definite kernels, and can then be seen as an alternative to spectral clustering. (3) Prior information on the partition is easily incorporated, leading to state-of-the-art performance for semi-supervised learning, for clustering or classification. We present empirical evaluations of our algorithms on synthetic and real medium-scale datasets.
5 0.1184589 112 nips-2007-Learning Monotonic Transformations for Classification
Author: Andrew Howard, Tony Jebara
Abstract: A discriminative method is proposed for learning monotonic transformations of the training data while jointly estimating a large-margin classifier. In many domains such as document classification, image histogram classification and gene microarray experiments, fixed monotonic transformations can be useful as a preprocessing step. However, most classifiers only explore these transformations through manual trial and error or via prior domain knowledge. The proposed method learns monotonic transformations automatically while training a large-margin classifier without any prior knowledge of the domain. A monotonic piecewise linear function is learned which transforms data for subsequent processing by a linear hyperplane classifier. Two algorithmic implementations of the method are formalized. The first solves a convergent alternating sequence of quadratic and linear programs until it obtains a locally optimal solution. An improved algorithm is then derived using a convex semidefinite relaxation that overcomes initialization issues in the greedy optimization problem. The effectiveness of these learned transformations on synthetic problems, text data and image data is demonstrated. 1
6 0.11063784 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
7 0.09483517 23 nips-2007-An Analysis of Convex Relaxations for MAP Estimation
8 0.088655353 8 nips-2007-A New View of Automatic Relevance Determination
9 0.086084113 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
10 0.085001536 140 nips-2007-Neural characterization in partially observed populations of spiking neurons
11 0.081703499 61 nips-2007-Convex Clustering with Exemplar-Based Models
12 0.079984754 128 nips-2007-Message Passing for Max-weight Independent Set
13 0.075418383 97 nips-2007-Hidden Common Cause Relations in Relational Learning
14 0.074632779 105 nips-2007-Infinite State Bayes-Nets for Structured Domains
15 0.070090033 92 nips-2007-Fixing Max-Product: Convergent Message Passing Algorithms for MAP LP-Relaxations
16 0.069420144 120 nips-2007-Linear programming analysis of loopy belief propagation for weighted matching
17 0.06926427 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
18 0.06835518 132 nips-2007-Modeling image patches with a directed hierarchy of Markov random fields
19 0.06803216 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning
20 0.063239992 141 nips-2007-New Outer Bounds on the Marginal Polytope
topicId topicWeight
[(0, -0.222), (1, 0.029), (2, -0.101), (3, -0.006), (4, -0.013), (5, -0.175), (6, -0.016), (7, 0.033), (8, -0.049), (9, 0.036), (10, 0.09), (11, -0.124), (12, 0.032), (13, -0.105), (14, -0.032), (15, -0.052), (16, -0.139), (17, 0.11), (18, 0.057), (19, 0.035), (20, -0.104), (21, -0.079), (22, -0.15), (23, 0.04), (24, -0.026), (25, 0.13), (26, 0.021), (27, -0.06), (28, 0.115), (29, 0.071), (30, -0.086), (31, -0.025), (32, 0.101), (33, 0.047), (34, -0.069), (35, -0.027), (36, -0.044), (37, 0.102), (38, 0.028), (39, 0.007), (40, 0.003), (41, -0.046), (42, -0.06), (43, 0.02), (44, -0.024), (45, 0.016), (46, 0.067), (47, 0.119), (48, 0.107), (49, 0.059)]
simIndex simValue paperId paperTitle
same-paper 1 0.9650712 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
2 0.70943689 22 nips-2007-Agreement-Based Learning
Author: Percy Liang, Dan Klein, Michael I. Jordan
Abstract: The learning of probabilistic models with many hidden variables and nondecomposable dependencies is an important and challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable submodels by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We propose an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks. 1
3 0.69153017 84 nips-2007-Expectation Maximization and Posterior Constraints
Author: Kuzman Ganchev, Ben Taskar, João Gama
Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
4 0.60036367 112 nips-2007-Learning Monotonic Transformations for Classification
Author: Andrew Howard, Tony Jebara
Abstract: A discriminative method is proposed for learning monotonic transformations of the training data while jointly estimating a large-margin classifier. In many domains such as document classification, image histogram classification and gene microarray experiments, fixed monotonic transformations can be useful as a preprocessing step. However, most classifiers only explore these transformations through manual trial and error or via prior domain knowledge. The proposed method learns monotonic transformations automatically while training a large-margin classifier without any prior knowledge of the domain. A monotonic piecewise linear function is learned which transforms data for subsequent processing by a linear hyperplane classifier. Two algorithmic implementations of the method are formalized. The first solves a convergent alternating sequence of quadratic and linear programs until it obtains a locally optimal solution. An improved algorithm is then derived using a convex semidefinite relaxation that overcomes initialization issues in the greedy optimization problem. The effectiveness of these learned transformations on synthetic problems, text data and image data is demonstrated. 1
5 0.57889056 23 nips-2007-An Analysis of Convex Relaxations for MAP Estimation
Author: Pawan Mudigonda, Vladimir Kolmogorov, Philip Torr
Abstract: The problem of obtaining the maximum a posteriori estimate of a general discrete random field (i.e. a random field defined using a finite and discrete set of labels) is known to be NP-hard. However, due to its central importance in many applications, several approximate algorithms have been proposed in the literature. In this paper, we present an analysis of three such algorithms based on convex relaxations: (i) LP - S: the linear programming (LP) relaxation proposed by Schlesinger [20] for a special case and independently in [4, 12, 23] for the general case; (ii) QP - RL: the quadratic programming (QP) relaxation by Ravikumar and Lafferty [18]; and (iii) SOCP - MS: the second order cone programming (SOCP) relaxation first proposed by Muramatsu and Suzuki [16] for two label problems and later extended in [14] for a general label set. We show that the SOCP - MS and the QP - RL relaxations are equivalent. Furthermore, we prove that despite the flexibility in the form of the constraints/objective function offered by QP and SOCP, the LP - S relaxation strictly dominates (i.e. provides a better approximation than) QP - RL and SOCP - MS. We generalize these results by defining a large class of SOCP (and equivalent QP) relaxations which is dominated by the LP - S relaxation. Based on these results we propose some novel SOCP relaxations which strictly dominate the previous approaches.
6 0.54374099 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
7 0.54329199 65 nips-2007-DIFFRAC: a discriminative and flexible framework for clustering
8 0.4282181 190 nips-2007-Support Vector Machine Classification with Indefinite Kernels
9 0.41969475 12 nips-2007-A Spectral Regularization Framework for Multi-Task Structure Learning
10 0.41159466 61 nips-2007-Convex Clustering with Exemplar-Based Models
11 0.40407357 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
12 0.39633557 131 nips-2007-Modeling homophily and stochastic equivalence in symmetric relational data
13 0.3865464 97 nips-2007-Hidden Common Cause Relations in Relational Learning
14 0.38560253 174 nips-2007-Selecting Observations against Adversarial Objectives
15 0.37961078 99 nips-2007-Hierarchical Penalization
16 0.37880811 128 nips-2007-Message Passing for Max-weight Independent Set
17 0.3770301 66 nips-2007-Density Estimation under Independent Similarly Distributed Sampling Assumptions
18 0.37363619 185 nips-2007-Stable Dual Dynamic Programming
19 0.37352794 156 nips-2007-Predictive Matrix-Variate t Models
20 0.37315407 95 nips-2007-HM-BiTAM: Bilingual Topic Exploration, Word Alignment, and Translation
topicId topicWeight
[(5, 0.064), (13, 0.086), (16, 0.024), (18, 0.02), (20, 0.13), (21, 0.059), (31, 0.029), (34, 0.03), (35, 0.058), (47, 0.102), (83, 0.153), (85, 0.065), (87, 0.014), (90, 0.077)]
simIndex simValue paperId paperTitle
1 0.90777361 122 nips-2007-Locality and low-dimensions in the prediction of natural experience from fMRI
Author: Francois Meyer, Greg Stephens
Abstract: Functional Magnetic Resonance Imaging (fMRI) provides dynamical access into the complex functioning of the human brain, detailing the hemodynamic activity of thousands of voxels during hundreds of sequential time points. One approach towards illuminating the connection between fMRI and cognitive function is through decoding; how do the time series of voxel activities combine to provide information about internal and external experience? Here we seek models of fMRI decoding which are balanced between the simplicity of their interpretation and the effectiveness of their prediction. We use signals from a subject immersed in virtual reality to compare global and local methods of prediction applying both linear and nonlinear techniques of dimensionality reduction. We find that the prediction of complex stimuli is remarkably low-dimensional, saturating with less than 100 features. In particular, we build effective models based on the decorrelated components of cognitive activity in the classically-defined Brodmann areas. For some of the stimuli, the top predictive areas were surprisingly transparent, including Wernicke’s area for verbal instructions, visual cortex for facial and body features, and visual-temporal regions for velocity. Direct sensory experience resulted in the most robust predictions, with the highest correlation (c ∼ 0.8) between the predicted and experienced time series of verbal instructions. Techniques based on non-linear dimensionality reduction (Laplacian eigenmaps) performed similarly. The interpretability and relative simplicity of our approach provides a conceptual basis upon which to build more sophisticated techniques for fMRI decoding and offers a window into cognitive function during dynamic, natural experience. 1
same-paper 2 0.89550263 63 nips-2007-Convex Relaxations of Latent Variable Training
Author: Yuhong Guo, Dale Schuurmans
Abstract: We investigate a new, convex relaxation of an expectation-maximization (EM) variant that approximates a standard objective while eliminating local minima. First, a cautionary result is presented, showing that any convex relaxation of EM over hidden variables must give trivial results if any dependence on the missing values is retained. Although this appears to be a strong negative outcome, we then demonstrate how the problem can be bypassed by using equivalence relations instead of value assignments over hidden variables. In particular, we develop new algorithms for estimating exponential conditional models that only require equivalence relation information over the variable values. This reformulation leads to an exact expression for EM variants in a wide range of problems. We then develop a semidefinite relaxation that yields global training by eliminating local minima. 1
3 0.83174962 84 nips-2007-Expectation Maximization and Posterior Constraints
Author: Kuzman Ganchev, Ben Taskar, João Gama
Abstract: The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models. 1
4 0.83053178 134 nips-2007-Multi-Task Learning via Conic Programming
Author: Tsuyoshi Kato, Hisashi Kashima, Masashi Sugiyama, Kiyoshi Asai
Abstract: When we have several related tasks, solving them simultaneously is shown to be more effective than solving them individually. This approach is called multi-task learning (MTL) and has been studied extensively. Existing approaches to MTL often treat all the tasks as uniformly related to each other and the relatedness of the tasks is controlled globally. For this reason, the existing methods can lead to undesired solutions when some tasks are not highly related to each other, and some pairs of related tasks can have significantly different solutions. In this paper, we propose a novel MTL algorithm that can overcome these problems. Our method makes use of a task network, which describes the relation structure among tasks. This allows us to deal with intricate relation structures in a systematic way. Furthermore, we control the relatedness of the tasks locally, so all pairs of related tasks are guaranteed to have similar solutions. We apply the above idea to support vector machines (SVMs) and show that the optimization problem can be cast as a second order cone program, which is convex and can be solved efficiently. The usefulness of our approach is demonstrated through simulations with protein super-family classification and ordinal regression problems.
5 0.82959116 86 nips-2007-Exponential Family Predictive Representations of State
Author: David Wingate, Satinder S. Baveja
Abstract: In order to represent state in controlled, partially observable, stochastic dynamical systems, some sort of sufficient statistic for history is necessary. Predictive representations of state (PSRs) capture state as statistics of the future. We introduce a new model of such systems called the “Exponential family PSR,” which defines as state the time-varying parameters of an exponential family distribution which models n sequential observations in the future. This choice of state representation explicitly connects PSRs to state-of-the-art probabilistic modeling, which allows us to take advantage of current efforts in high-dimensional density estimation, and in particular, graphical models and maximum entropy models. We present a parameter learning algorithm based on maximum likelihood, and we show how a variety of current approximate inference methods apply. We evaluate the quality of our model with reinforcement learning by directly evaluating the control performance of the model. 1
6 0.82197273 94 nips-2007-Gaussian Process Models for Link Analysis and Transfer Learning
7 0.81999695 34 nips-2007-Bayesian Policy Learning with Trans-Dimensional MCMC
8 0.81941682 112 nips-2007-Learning Monotonic Transformations for Classification
9 0.81740308 138 nips-2007-Near-Maximum Entropy Models for Binary Neural Representations of Natural Images
10 0.81648713 115 nips-2007-Learning the 2-D Topology of Images
11 0.81580186 18 nips-2007-A probabilistic model for generating realistic lip movements from speech
12 0.81462121 102 nips-2007-Incremental Natural Actor-Critic Algorithms
13 0.81403351 4 nips-2007-A Constraint Generation Approach to Learning Stable Linear Dynamical Systems
14 0.81350392 156 nips-2007-Predictive Matrix-Variate t Models
15 0.81322706 187 nips-2007-Structured Learning with Approximate Inference
16 0.81311327 76 nips-2007-Efficient Convex Relaxation for Transductive Support Vector Machine
17 0.81220555 141 nips-2007-New Outer Bounds on the Marginal Polytope
18 0.81158113 79 nips-2007-Efficient multiple hyperparameter learning for log-linear models
19 0.81127429 49 nips-2007-Colored Maximum Variance Unfolding
20 0.81023741 180 nips-2007-Sparse Feature Learning for Deep Belief Networks