nips nips2002 nips2002-114 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Martin Szummer, Tommi S. Jaakkola
Abstract: Classification with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. We formulate a regularization approach to linking the marginal and the conditional in a general way. The regularization penalty measures the information that is implied about the labels over covering regions. No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). We develop algorithms for solving the regularization problem for finite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases.
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract Classification with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. [sent-5, score-0.876]
2 We formulate a regularization approach to linking the marginal and the conditional in a general way. [sent-6, score-0.572]
3 The regularization penalty measures the information that is implied about the labels over covering regions. [sent-7, score-0.504]
4 No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). [sent-8, score-0.347]
5 We develop algorithms for solving the regularization problem for finite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases. [sent-9, score-0.689]
6 To benefit from such examples, we must exploit either implicitly or explicitly the link between the marginal density P (x) over examples x and the conditional P (y|x) representing the decision boundary for the labels y. [sent-11, score-0.51]
7 High density regions or clusters in the data, for example, can be expected to fall solely in one or another class. [sent-12, score-0.385]
8 Most discriminative methods do not attempt to explicitly model or incorporate information from the marginal density P (x). [sent-13, score-0.346]
9 However, many discriminative algorithms such as SVMs exploit the notion of margin that effectively relates P (x) to P (y|x); the decision boundary is biased to fall preferentially in low density regions of P (x) so that only a few points fall within the margin band. [sent-14, score-0.623]
10 In this paper we appeal to information theory to explicitly constrain P (y|x) on the basis of P (x) in a regularization framework. [sent-16, score-0.351]
11 65 + + – – + – 1 – + – + + – 1 Figure 1: Mutual information I(x; y) measured in bits for four regions with different configurations of labels y= {+,-}. [sent-20, score-0.428]
12 The marginal P (x) is discrete and uniform across the points. [sent-21, score-0.283]
13 The mutual information is low when the labels are homogenous in the region, and high when labels vary. [sent-22, score-0.383]
14 The mutual information is invariant to the spatial configuration of points within the neighborhood. [sent-23, score-0.338]
15 2 Information Regularization We begin by showing how to regularize a small region of the domain X . [sent-24, score-0.34]
16 We will subsequently cover the domain (or any chosen subset) with multiple small regions, and describe criteria that ensure regularization of the whole domain on the basis of the individual regions. [sent-25, score-0.542]
17 1 Regularizing a Single Region Consider a small contiguous region Q in the domain X (e. [sent-27, score-0.254]
18 We will regularize the conditional probability P (y|x) by penalizing the amount of information the conditionals imply about the labels within the region. [sent-30, score-0.634]
19 The regularizer is a function of both P (y|x) and P (x), and will penalize changes in P (y|x) more in regions with high P (x). [sent-31, score-0.42]
20 Let L be the set of labeled points (size N L ) and L ∪ U be the set of labeled and unlabeled points (size NLU ). [sent-32, score-0.82]
21 The marginal P (x) is assumed to be given, and may be available directly in terms of a continuous density, or as an empirical density P (x) = 1/NLU · i∈L∪U δ(x − xi ) corresponding to a set of points {xi } that may not have labels (δ(·) is the Dirac delta function integrating to 1). [sent-33, score-0.59]
22 As a measure of information, we employ mutual information [5], which is the average number of bits that x contains about the label in region Q (see Figure 1. [sent-34, score-0.499]
23 ) The measure depends both on the marginal density P (x) (specifically its restriction to x ∈ Q namely P (x|Q) = P (x)/ Q P (x) dx) and the conditional P (y|x). [sent-35, score-0.346]
24 More precisely, the mutual information in region Q is IQ (x; y) = P (x|Q)P (y|x) log y x∈Q P (y|x) dx, P (y|Q) (1) where P (y|Q) = x∈Q P (x|Q)P (y|x) dx. [sent-38, score-0.455]
25 The densities conditioned on Q are normalized to integrate to 1 within the region Q. [sent-39, score-0.273]
26 Note that the mutual information is invariant to permutations of the elements of X within Q, which suggests that the regions must be small enough to preserve locality. [sent-40, score-0.554]
27 The regularization penalty has to further scale with the number of points in the region (or the probability mass). [sent-41, score-0.664]
28 The mutual information IQ (x; y) measures the information per point, and to obtain the total mutual information contained in a region, we must multiply by the probability mass MQ . [sent-44, score-0.587]
29 The regularization will be stronger in regions with high P (x). [sent-45, score-0.553]
30 When the region is small, then the marginal will be close to uniform over the region and V Q ∝ R2 , where R is, e. [sent-48, score-0.63]
31 2 Limiting Behavior for Vanishing Size Regions When the size of the region is scaled down, the mutual information will go to zero for any continuous P (y|x). [sent-54, score-0.537]
32 We derive here the appropriate regularization penalty in the limit of vanishing regions. [sent-55, score-0.394]
33 Within a small region Q we can (under mild continuity assumptions) approximate P (y|x) by a Taylor expansion around the mean point x0 ∈ Q, obtaining P (y|Q) ≈ P (y|x0 ) to first order. [sent-57, score-0.241]
34 The regularization penalty should not scale with the resolution at which we penalize information and we thus divide out the size-dependent part. [sent-59, score-0.425]
35 3 Regularizing the Domain We want to regularize the conditional P (y|x) across the domain X (or any subset of interest). [sent-63, score-0.252]
36 Since individual regions must be relatively small to preserve locality, we need multiple regions to cover the domain. [sent-64, score-0.764]
37 Since the regularization penalty is assigned to each region, the regions must overlap to ensure that the conditionals in different regions become functionally dependent. [sent-66, score-1.314]
38 In general all areas with significant marginal density P (x) should be included in the cover or will not be regularized (areas of zero marginal need not be considered). [sent-68, score-0.614]
39 The cover should generally be connected (with respect to neighborhood relations of the regions) so that labeled points have potential to influence all conditionals. [sent-69, score-0.499]
40 The amount of overlap between any two regions in the cover determines how strongly the corresponding conditionals are tied to each other. [sent-70, score-0.797]
41 On the other hand, the regions should be small to preserve locality. [sent-71, score-0.317]
42 The limit of a large number of small overlapping regions can be defined, and we ensure continuity of P (y|x) when the offset between regions vanishes relative to their size (in all dimensions). [sent-72, score-0.745]
43 3 Classification with Information Regularization Information regularization across multiple regions can be performed, for example, by minimizing the maximum information per region, subject to correct classification of the labeled points. [sent-73, score-0.851]
44 Specifically, we constrain each region in the cover (Q ∈ C) to carry at most γ units of information. [sent-74, score-0.388]
45 0 ≤ P (y|xk ) ≤ 1, y (3a) (3b) (3c) (3d) We have incorporated the labeled points by constraining their conditionals to the observed values (eq. [sent-78, score-0.692]
46 The solution P (y|x) to this optimization problem is unique in regions that achieve the information constraint with equality (as long as P (x) > 0). [sent-80, score-0.356]
47 Define an atomic subregion as a non-empty intersection of regions that cannot be further intersected by any region (Figure 2). [sent-82, score-0.848]
48 All unlabeled points in an atomic subregion belong to the same set of regions, and therefore participate in exactly the same constraints. [sent-83, score-0.646]
49 They will be regularized the same way, and since mutual information is a convex function, it will be minimized when the conditionals P (y|x) are equal in the atomic subregion. [sent-84, score-0.849]
50 We can therefore parsimoniously represent conditionals of atomic subregions, instead of individual points, merely by treating such atomic subregions as “merged points” and weighting the associated constraint by the probability mass contained in the subregion. [sent-85, score-1.215]
51 1 Incorporating Noisy Labels Labeled points participate in the information regularization in the same way as unlabeled points. [sent-87, score-0.596]
52 However, their conditionals have additional constraints, which incorporate the label information. [sent-88, score-0.391]
53 In equation 3c we used the constraint P (y|xk ) = δ(y, yk ) for all labeled ˜ points. [sent-89, score-0.378]
54 This constraint does not permit noise in the labels (and cannot be used when two points at the same location have disagreeing labels. [sent-90, score-0.254]
55 The conditionals of the labeled points are directly determined by their labels, and are treated as fixed constants. [sent-96, score-0.692]
56 5, the thresholded conditional classifies labeled points in the observed class. [sent-98, score-0.442]
57 In constraint (exp-lbl), the conditionals for labeled points can have an average error at most b, where the averaged is over all labeled points. [sent-99, score-0.986]
58 Thus, a few points may have conditionals that deviate significantly from their observed labels, giving robustness against mislabeled points and outliers. [sent-100, score-0.591]
59 2 Continuous Densities Information regularization is also computationally feasible for continuous marginal densities, known or estimated. [sent-104, score-0.557]
60 For example, we may be given a continuous unlabeled data distribution P (x) and a few discrete labeled points, and regularize across a finite set of covering regions. [sent-105, score-0.636]
61 The conditionals are uniform inside atomic subregions (except at labeled points), requiring estimates of only a finite number of conditionals. [sent-106, score-1.079]
62 3 Implementation Firstly, we choose appropriate regions forming a cover, and find the atomic subregions. [sent-108, score-0.555]
63 If the data is all discrete, create a spherical region centered at every labeled and unlabeled point (or over some reduced set still covering all the points). [sent-111, score-0.626]
64 We have used regions of fixed radius R, but the radius could also be set adaptively at each point to the distance of its Knearest neighbor. [sent-112, score-0.373]
65 The union of such regions is our cover, and we choose the radius R (or K) large enough to create a connected cover. [sent-113, score-0.351]
66 The cover induces a set of atomic subregions, and we merge the parameters P (y|x) of points inside individual atomic subregions (atomic subregions with no observed points can be ignored). [sent-114, score-1.366]
67 The marginal of each atomic subregion is proportional to the number of (merged) points it contains. [sent-115, score-0.677]
68 If continuous marginals are given, they will put probability mass in all atomic subregions where the marginal is non-zero. [sent-116, score-0.9]
69 To avoid considering an exponential number of subregions, we can limit the overlap between the regions by creating a sparser cover. [sent-117, score-0.37]
70 Given the cover, we now regularize the conditionals P (y|x) in the regions, according to eq. [sent-118, score-0.423]
71 This is a convex minimization problem with a global minimum, since mutual information is convex in P (y|x). [sent-120, score-0.259]
72 3a, the required gradients of the constraints for the binary class (y = {±1}) case (region Q, atomic subregion r) are: MQ dIQ (x; y) MQ P (y = 1|xr ) P (y = −1|Q) = P (xr |Q) log VQ dP (y = 1|xr ) VQ P (y = −1|xr ) P (y = 1|Q) . [sent-123, score-0.401]
73 4 Minimize Average Information An alternative regularization criterion minimizes the average mutual information across regions. [sent-126, score-0.572]
74 When calculating the average, we must correct for the overlaps of intersecting regions to avoid doublecounting (in contrast, the previous regularization criterion (eq. [sent-127, score-0.704]
75 3b) avoided doublecounting by restricting information in each region individually). [sent-128, score-0.297]
76 The influence of a region is proportional to the probability mass MQ contained in it. [sent-129, score-0.308]
77 We can then minimize average mutual information according to min P (y|xk ) Q ∗ MQ IQ (x; y) VQ (5a) s. [sent-132, score-0.283]
78 1 Limiting Behavior The above average information criterion is a discrete version of a continuous regularization criterion. [sent-138, score-0.497]
79 In the limit of a large number of small regions in the cover (where the spacing of the regions vanishes relative to their size), we obtain a well-defined regularization criterion resulting in continuous P (y|x): min P (y|x) s. [sent-139, score-1.205]
80 More generally, we can formulate the regularization problem as a Tikhonov regularization, where the loss is the negative log-probability of labels: 1 P (y|x) NL min − log P (˜k |xk ) + λ y P (x0 )P (y|x0 ) y k∈L d log P (y|x) dx 2 dx0 . [sent-144, score-0.509]
81 ∂P (y = 1|x) dx ∂P (y = 1|x) (8) (natural boundary conditions apply since we can assume P (x) = 0 and P (y|x) = 0 at the boundary of the domain X ). [sent-151, score-0.24]
82 y 4 Results and Discussion We have experimentally studied the behavior of the regularizer with different marginal densities P (x). [sent-159, score-0.362]
83 Figure 3 shows the one-dimensional case with a continuous marginal density 1. [sent-160, score-0.354]
84 6 P(y|x) P(x) labeled points Posterior P(y|x) 1. [sent-161, score-0.355]
85 5 1 Figure 2: (Left) Three intersecting regions, and their atomic subregions (numbered). [sent-169, score-0.529]
86 P (y|x) for unlabeled points will be constant in atomic subregions. [sent-170, score-0.489]
87 Figure 3: (Right) The conditional (solid line) for a continuous marginal P (x) (dotted line) consisting of a mixture of two continuous Gaussian and two labeled points at (x=-0. [sent-171, score-0.818]
88 The row of circles at the top depicts the region structure used (a rendering of overlapping one-dimensional intervals. [sent-174, score-0.251]
89 5 1 Figure 4: Conditionals (solid lines) for two continuous marginals (dotted lines) plus two labeled points. [sent-187, score-0.405]
90 Left: the marginal is uniform, and the conditional approaches a straight line. [sent-188, score-0.273]
91 ) The conditional changes slowly in regions of high density. [sent-190, score-0.364]
92 (mixture of two Gaussians), and two discrete labeled points. [sent-191, score-0.263]
93 We choose N Q =40 regions centered at uniform intervals of [−1, 1], overlapping each other half-way, creating N Q + 1 atomic subregions. [sent-192, score-0.655]
94 The conditional varies smoothly between the labeled points of opposite classes. [sent-197, score-0.442]
95 Note the dependence on the marginal density P (x). [sent-198, score-0.259]
96 5 2 Figure 5: Conditionals for two other continuous marginals plus two labeled points (marked as crosses and located at x=-1, 2 in the left figure and x=-2, 2 in the right), solved via the differential equation (eq. [sent-228, score-0.644]
97 The conditionals are continuous but non-differentiable at the two labeled points (marked as crosses). [sent-230, score-0.787]
98 5 Conclusion We have presented an information theoretic regularization framework for combining conditional and marginal densities in a semi-supervised estimation setting. [sent-231, score-0.653]
99 The shape and structure of the regions give direct ways of imposing relations between particular variables or values of those variables. [sent-236, score-0.301]
100 The regions can be easily defined on low-dimensional data manifolds. [sent-237, score-0.277]
wordName wordTfidf (topN-words)
[('conditionals', 0.337), ('regions', 0.277), ('regularization', 0.276), ('atomic', 0.252), ('xk', 0.236), ('mq', 0.232), ('subregions', 0.232), ('labeled', 0.228), ('region', 0.207), ('marginal', 0.186), ('mutual', 0.173), ('cover', 0.144), ('vq', 0.143), ('iq', 0.134), ('points', 0.127), ('subregion', 0.112), ('unlabeled', 0.11), ('continuous', 0.095), ('dx', 0.089), ('conditional', 0.087), ('regularizer', 0.086), ('yk', 0.086), ('regularize', 0.086), ('labels', 0.086), ('xr', 0.076), ('tommi', 0.076), ('mass', 0.076), ('density', 0.073), ('szummer', 0.067), ('densities', 0.066), ('var', 0.061), ('differential', 0.059), ('marginals', 0.059), ('penalize', 0.057), ('penalty', 0.054), ('limiting', 0.054), ('boundary', 0.052), ('bfgs', 0.052), ('doublecounting', 0.052), ('covering', 0.05), ('radius', 0.048), ('domain', 0.047), ('min', 0.047), ('intersecting', 0.045), ('participate', 0.045), ('lab', 0.044), ('overlapping', 0.044), ('constraint', 0.041), ('preserve', 0.04), ('overlap', 0.039), ('calculus', 0.038), ('information', 0.038), ('constrain', 0.037), ('log', 0.037), ('merged', 0.036), ('vanishing', 0.036), ('jaakkola', 0.035), ('fall', 0.035), ('discrete', 0.035), ('fisher', 0.035), ('continuity', 0.034), ('regularizing', 0.034), ('eq', 0.034), ('sloan', 0.033), ('vanishes', 0.033), ('across', 0.032), ('spherical', 0.031), ('crosses', 0.03), ('uniform', 0.03), ('classi', 0.029), ('martin', 0.029), ('label', 0.029), ('ai', 0.029), ('limit', 0.028), ('criterion', 0.028), ('ensure', 0.028), ('bits', 0.027), ('partitioning', 0.027), ('must', 0.026), ('creating', 0.026), ('bottleneck', 0.026), ('choose', 0.026), ('average', 0.025), ('marked', 0.025), ('entropy', 0.025), ('regularized', 0.025), ('incorporate', 0.025), ('contained', 0.025), ('size', 0.024), ('behavior', 0.024), ('convex', 0.024), ('shape', 0.024), ('adjusted', 0.024), ('discriminative', 0.024), ('formulate', 0.023), ('substituting', 0.023), ('posterior', 0.023), ('xi', 0.023), ('equation', 0.023), ('plus', 0.023)]
simIndex simValue paperId paperTitle
same-paper 1 1.0000004 114 nips-2002-Information Regularization with Partially Labeled Data
Author: Martin Szummer, Tommi S. Jaakkola
Abstract: Classification with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. We formulate a regularization approach to linking the marginal and the conditional in a general way. The regularization penalty measures the information that is implied about the labels over covering regions. No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). We develop algorithms for solving the regularization problem for finite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases.
2 0.1378981 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
Author: Harald Steck, Tommi S. Jaakkola
Abstract: A common objective in learning a model from data is to recover its network structure, while the model parameters are of minor interest. For example, we may wish to recover regulatory networks from high-throughput data sources. In this paper we examine how Bayesian regularization using a product of independent Dirichlet priors over the model parameters affects the learned model structure in a domain with discrete variables. We show that a small scale parameter - often interpreted as
3 0.13627708 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
Author: Olivier Chapelle, Jason Weston, Bernhard SchĂślkopf
Abstract: We propose a framework to incorporate unlabeled data in kernel classifier, based on the idea that two points in the same cluster are more likely to have the same label. This is achieved by modifying the eigenspectrum of the kernel matrix. Experimental results assess the validity of this approach. 1
4 0.10979089 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter
Abstract: We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classification performance.
5 0.1094821 124 nips-2002-Learning Graphical Models with Mercer Kernels
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present a class of algorithms for learning the structure of graphical models from data. The algorithms are based on a measure known as the kernel generalized variance (KGV), which essentially allows us to treat all variables on an equal footing as Gaussians in a feature space obtained from Mercer kernels. Thus we are able to learn hybrid graphs involving discrete and continuous variables of arbitrary type. We explore the computational properties of our approach, showing how to use the kernel trick to compute the relevant statistics in linear time. We illustrate our framework with experiments involving discrete and continuous data.
6 0.094014652 135 nips-2002-Learning with Multiple Labels
7 0.079017311 89 nips-2002-Feature Selection by Maximum Marginal Diversity
8 0.073449291 64 nips-2002-Data-Dependent Bounds for Bayesian Mixture Methods
9 0.071272708 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
10 0.068653122 88 nips-2002-Feature Selection and Classification on Matrix Data: From Large Margins to Small Covering Numbers
11 0.066936061 106 nips-2002-Hyperkernels
12 0.066341504 19 nips-2002-Adapting Codes and Embeddings for Polychotomies
13 0.063905545 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
14 0.063631296 53 nips-2002-Clustering with the Fisher Score
15 0.061909124 150 nips-2002-Multiple Cause Vector Quantization
16 0.061691642 151 nips-2002-Multiplicative Updates for Nonnegative Quadratic Programming in Support Vector Machines
17 0.058532022 90 nips-2002-Feature Selection in Mixture-Based Clustering
18 0.058464844 41 nips-2002-Bayesian Monte Carlo
19 0.058149431 83 nips-2002-Extracting Relevant Structures with Side Information
20 0.054747678 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
topicId topicWeight
[(0, -0.182), (1, -0.079), (2, 0.01), (3, 0.012), (4, 0.003), (5, 0.044), (6, -0.063), (7, -0.004), (8, -0.05), (9, 0.045), (10, 0.068), (11, -0.033), (12, 0.03), (13, -0.02), (14, -0.047), (15, -0.051), (16, -0.008), (17, 0.015), (18, 0.073), (19, 0.021), (20, 0.033), (21, 0.041), (22, -0.04), (23, -0.093), (24, 0.064), (25, -0.137), (26, -0.127), (27, 0.029), (28, -0.112), (29, 0.018), (30, -0.008), (31, -0.287), (32, -0.032), (33, -0.134), (34, 0.104), (35, -0.08), (36, 0.043), (37, 0.061), (38, 0.123), (39, 0.03), (40, 0.103), (41, 0.008), (42, 0.059), (43, -0.096), (44, 0.114), (45, -0.031), (46, 0.021), (47, 0.093), (48, -0.092), (49, -0.138)]
simIndex simValue paperId paperTitle
same-paper 1 0.97489864 114 nips-2002-Information Regularization with Partially Labeled Data
Author: Martin Szummer, Tommi S. Jaakkola
Abstract: Classification with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. We formulate a regularization approach to linking the marginal and the conditional in a general way. The regularization penalty measures the information that is implied about the labels over covering regions. No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). We develop algorithms for solving the regularization problem for finite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases.
2 0.64828646 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
Author: Harald Steck, Tommi S. Jaakkola
Abstract: A common objective in learning a model from data is to recover its network structure, while the model parameters are of minor interest. For example, we may wish to recover regulatory networks from high-throughput data sources. In this paper we examine how Bayesian regularization using a product of independent Dirichlet priors over the model parameters affects the learned model structure in a domain with discrete variables. We show that a small scale parameter - often interpreted as
3 0.524306 124 nips-2002-Learning Graphical Models with Mercer Kernels
Author: Francis R. Bach, Michael I. Jordan
Abstract: We present a class of algorithms for learning the structure of graphical models from data. The algorithms are based on a measure known as the kernel generalized variance (KGV), which essentially allows us to treat all variables on an equal footing as Gaussians in a feature space obtained from Mercer kernels. Thus we are able to learn hybrid graphs involving discrete and continuous variables of arbitrary type. We explore the computational properties of our approach, showing how to use the kernel trick to compute the relevant statistics in linear time. We illustrate our framework with experiments involving discrete and continuous data.
4 0.45391804 178 nips-2002-Robust Novelty Detection with Single-Class MPM
Author: Laurent E. Ghaoui, Michael I. Jordan, Gert R. Lanckriet
Abstract: In this paper we consider the problem of novelty detection, presenting an algorithm that aims to find a minimal region in input space containing a fraction 0: of the probability mass underlying a data set. This algorithm- the
5 0.4531939 68 nips-2002-Discriminative Densities from Maximum Contrast Estimation
Author: Peter Meinicke, Thorsten Twellmann, Helge Ritter
Abstract: We propose a framework for classifier design based on discriminative densities for representation of the differences of the class-conditional distributions in a way that is optimal for classification. The densities are selected from a parametrized set by constrained maximization of some objective function which measures the average (bounded) difference, i.e. the contrast between discriminative densities. We show that maximization of the contrast is equivalent to minimization of an approximation of the Bayes risk. Therefore using suitable classes of probability density functions, the resulting maximum contrast classifiers (MCCs) can approximate the Bayes rule for the general multiclass case. In particular for a certain parametrization of the density functions we obtain MCCs which have the same functional form as the well-known Support Vector Machines (SVMs). We show that MCC-training in general requires some nonlinear optimization but under certain conditions the problem is concave and can be tackled by a single linear program. We indicate the close relation between SVM- and MCC-training and in particular we show that Linear Programming Machines can be viewed as an approximate realization of MCCs. In the experiments on benchmark data sets, the MCC shows a competitive classification performance.
6 0.43866956 93 nips-2002-Forward-Decoding Kernel-Based Phone Recognition
7 0.4073956 29 nips-2002-Analysis of Information in Speech Based on MANOVA
8 0.40378034 138 nips-2002-Manifold Parzen Windows
9 0.37573999 7 nips-2002-A Hierarchical Bayesian Markovian Model for Motifs in Biopolymer Sequences
10 0.36348155 89 nips-2002-Feature Selection by Maximum Marginal Diversity
11 0.356837 41 nips-2002-Bayesian Monte Carlo
12 0.35419595 121 nips-2002-Knowledge-Based Support Vector Machine Classifiers
13 0.34884256 204 nips-2002-VIBES: A Variational Inference Engine for Bayesian Networks
14 0.33482367 52 nips-2002-Cluster Kernels for Semi-Supervised Learning
15 0.31717691 172 nips-2002-Recovering Articulated Model Topology from Observed Rigid Motion
16 0.31605795 53 nips-2002-Clustering with the Fisher Score
17 0.31375152 111 nips-2002-Independent Components Analysis through Product Density Estimation
18 0.31318289 117 nips-2002-Intrinsic Dimension Estimation Using Packing Numbers
19 0.31306693 135 nips-2002-Learning with Multiple Labels
20 0.31271744 190 nips-2002-Stochastic Neighbor Embedding
topicId topicWeight
[(14, 0.023), (23, 0.017), (42, 0.054), (54, 0.093), (55, 0.024), (57, 0.011), (68, 0.021), (74, 0.579), (92, 0.024), (98, 0.062)]
simIndex simValue paperId paperTitle
1 0.98988843 57 nips-2002-Concurrent Object Recognition and Segmentation by Graph Partitioning
Author: Stella X. Yu, Ralph Gross, Jianbo Shi
Abstract: Segmentation and recognition have long been treated as two separate processes. We propose a mechanism based on spectral graph partitioning that readily combine the two processes into one. A part-based recognition system detects object patches, supplies their partial segmentations as well as knowledge about the spatial configurations of the object. The goal of patch grouping is to find a set of patches that conform best to the object configuration, while the goal of pixel grouping is to find a set of pixels that have the best low-level feature similarity. Through pixel-patch interactions and between-patch competition encoded in the solution space, these two processes are realized in one joint optimization problem. The globally optimal partition is obtained by solving a constrained eigenvalue problem. We demonstrate that the resulting object segmentation eliminates false positives for the part detection, while overcoming occlusion and weak contours for the low-level edge detection.
same-paper 2 0.97364444 114 nips-2002-Information Regularization with Partially Labeled Data
Author: Martin Szummer, Tommi S. Jaakkola
Abstract: Classification with partially labeled data requires using a large number of unlabeled examples (or an estimated marginal P (x)), to further constrain the conditional P (y|x) beyond a few available labeled examples. We formulate a regularization approach to linking the marginal and the conditional in a general way. The regularization penalty measures the information that is implied about the labels over covering regions. No parametric assumptions are required and the approach remains tractable even for continuous marginal densities P (x). We develop algorithms for solving the regularization problem for finite covers, establish a limiting differential equation, and exemplify the behavior of the new regularization approach in simple cases.
3 0.96669346 78 nips-2002-Efficient Learning Equilibrium
Author: Ronen I. Brafman, Moshe Tennenholtz
Abstract: We introduce efficient learning equilibrium (ELE), a normative approach to learning in non cooperative settings. In ELE, the learning algorithms themselves are required to be in equilibrium. In addition, the learning algorithms arrive at a desired value after polynomial time, and deviations from a prescribed ELE become irrational after polynomial time. We prove the existence of an ELE in the perfect monitoring setting, where the desired value is the expected payoff in a Nash equilibrium. We also show that an ELE does not always exist in the imperfect monitoring case. Yet, it exists in the special case of common-interest games. Finally, we extend our results to general stochastic games. 1
Author: Terry Elliott, Jörg Kramer
Abstract: A neurotrophic model for the co-development of topography and ocular dominance columns in the primary visual cortex has recently been proposed. In the present work, we test this model by driving it with the output of a pair of neuronal vision sensors stimulated by disparate moving patterns. We show that the temporal correlations in the spike trains generated by the two sensors elicit the development of refined topography and ocular dominance columns, even in the presence of significant amounts of spontaneous activity and fixed-pattern noise in the sensors.
5 0.95928389 90 nips-2002-Feature Selection in Mixture-Based Clustering
Author: Martin H. Law, Anil K. Jain, Mário Figueiredo
Abstract: There exist many approaches to clustering, but the important issue of feature selection, i.e., selecting the data attributes that are relevant for clustering, is rarely addressed. Feature selection for clustering is difficult due to the absence of class labels. We propose two approaches to feature selection in the context of Gaussian mixture-based clustering. In the first one, instead of making hard selections, we estimate feature saliencies. An expectation-maximization (EM) algorithm is derived for this task. The second approach extends Koller and Sahami’s mutual-informationbased feature relevance criterion to the unsupervised case. Feature selection is then carried out by a backward search scheme. This scheme can be classified as a “wrapper”, since it wraps mixture estimation in an outer layer that performs feature selection. Experimental results on synthetic and real data show that both methods have promising performance.
6 0.91019642 162 nips-2002-Parametric Mixture Models for Multi-Labeled Text
7 0.76880264 175 nips-2002-Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games
8 0.73333853 39 nips-2002-Bayesian Image Super-Resolution
9 0.70521742 152 nips-2002-Nash Propagation for Loopy Graphical Games
10 0.70025104 89 nips-2002-Feature Selection by Maximum Marginal Diversity
11 0.69783133 157 nips-2002-On the Dirichlet Prior and Bayesian Regularization
12 0.69673479 132 nips-2002-Learning to Detect Natural Image Boundaries Using Brightness and Texture
13 0.69294596 182 nips-2002-Shape Recipes: Scene Representations that Refer to the Image
14 0.68895251 74 nips-2002-Dynamic Structure Super-Resolution
15 0.67662799 122 nips-2002-Learning About Multiple Objects in Images: Factorial Learning without Factorial Search
16 0.66746759 173 nips-2002-Recovering Intrinsic Images from a Single Image
17 0.65053564 100 nips-2002-Half-Lives of EigenFlows for Spectral Clustering
18 0.64949715 135 nips-2002-Learning with Multiple Labels
19 0.64786601 2 nips-2002-A Bilinear Model for Sparse Coding
20 0.64237833 105 nips-2002-How to Combine Color and Shape Information for 3D Object Recognition: Kernels do the Trick