nips nips2009 nips2009-72 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Novi Quadrianto, James Petterson, Alex J. Smola
Abstract: Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modification for classification, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efficiently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach. 1
Reference: text
sentIndex sentText sentNum sentScore
1 org Abstract Many transductive inference algorithms assume that distributions over training and test estimates should be related, e. [sent-9, score-0.488]
2 We use this idea to design a transduction algorithm which can be used without modification for classification, regression, and structured estimation. [sent-12, score-0.707]
3 At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. [sent-13, score-0.267]
4 1 Introduction Transduction relies on the fundamental assumption that training and test data should exhibit similar behavior. [sent-16, score-0.186]
5 For instance, in large margin classification a popular concept is to assume that both training and test data should be separable with a large margin [4]. [sent-17, score-0.388]
6 A similar matching assumption is made by [8, 15] in requiring that class means are balanced between training and test set. [sent-18, score-0.404]
7 Such matching assumptions are well founded: after all, we assume that both training data X = {x1 , . [sent-20, score-0.259]
8 It therefore follows that for any function (or set of functions) f : X → R the distribution of f (x) where x ∼ p(x) should also behave in the same way on both training and test set. [sent-27, score-0.186]
9 That is, it is applicable without much need for customization to all estimation problems, whether structured or not. [sent-33, score-0.119]
10 At its heart it uses the following: rather than minimizing only the empirical risk, regularized risk, log-posterior, or related quantities obtained only on the training set, let us add a divergence term characterizing the mismatch in distributions between training and test set. [sent-37, score-0.365]
11 Moreover, we show that for certain choices of kernels we are able to recover a number of existing transduction constraints as a special case. [sent-39, score-0.666]
12 That said, while 1 distribution matching always holds thus making our method always applicable, it is not entirely clear whether the cluster assumption is always satisfied (e. [sent-44, score-0.194]
13 Distribution matching, however, comes with a nontrivial price: the objective of the optimization problem ceases to be convex except for rather special cases (which correspond to algorithms that have been proposed as previous work). [sent-47, score-0.112]
14 While this is a downside, it is a property inherent in most transduction algorithms — after all, we are dealing with algorithms to obtain self-consistent labelings, predictions, or regressions on the data and there may exist more than one potential solution. [sent-48, score-0.634]
15 Moreover, denote by X, Y sets of data and labels of the training set and by X , Y test data and labels respectively. [sent-50, score-0.278]
16 In general, when designing an estimator one attempts to minimize some regularized risk functional Rreg [f, X, Y ] := m 1 m l(xi , yi , f ) + λΩ[f ] (1) i=1 or alternatively (in a Bayesian setting) one deals with a log-posterior probability m log p(yi |xi , f ) + log p(f ) + const. [sent-51, score-0.149]
17 f typically is a mapping X → R (for scalar problems such as regression or classification) or X → Rd (for multivariate problems such as named entity tagging, image annotation, matching, ranking, or more generally the clique potentials of graphical models). [sent-53, score-0.393]
18 , f (xm )} the applications of our estimator (and any related quantities) to training and test set respectively. [sent-64, score-0.186]
19 After all, we want that the empirical risk on the training and test sets match. [sent-67, score-0.243]
20 This reasoning leads us to the following additional term for the objective function of a transduction problem: D(f (X), f (X )) (4) Here D(f (X), f (X )) denotes the distance between the two distributions f (X) and f (X ). [sent-69, score-0.798]
21 This leads to an overall objective for learning Rtrain [f, X, Y ] + γD(f (X), f (X )) for some γ > 0 (5) when performing transductive inference. [sent-70, score-0.302]
22 Mean Matching for Classification Joachims [8] uses the following balancing constraint in the objective function of a binary classifier where y (x) = sgn(f (x)) for f (x) = w, x . [sent-80, score-0.162]
23 In order to ˆ balance the outputs between training and test set, [8] imposes the linear constraint 1 m m f (xi ) = i=1 1 m m f (xi ). [sent-81, score-0.218]
24 (8) i=1 Assuming a linear kernel k on R this constraint is equivalent to requiring that µ[f (X)] = 1 m m f (xi ), · = i=1 1 m m f (xi ), · = µ[f (X )]. [sent-82, score-0.149]
25 [5] propose to perform transduction by a requiring that the conditional class probabilities on training and test set match. [sent-88, score-0.877]
26 That is, for classifiers generating a distribution of the form yi ∼ p(yi |xi , w) they require that the marginal class probability on the test set matches the empirical class probability on the training set. [sent-89, score-0.278]
27 Again, this can be cast in terms of distribution matching via 1 µ[g ◦ f (X)] = m m i=1 1 g ◦ f (xi ), · = m m g ◦ f (xi ), · = µ[g ◦ f (X )] i=1 1 Here g(χ) = 1+e−χ denotes the likelihood of y = 1 in logistic regression for the model p(y|χ) = 1 1+e−yχ . [sent-90, score-0.2]
28 In other words, generalizing distribution matching to apply to transforms other than the logistic leads us directly to our new transduction criterion. [sent-96, score-0.795]
29 From left to right: induction scores on the training set; test set; transduction scores on the training set; test set; Note that while the margin distributions on training and test set are very different for induction, the ones for transduction match rather well. [sent-98, score-2.116]
30 Distribution Matching for Regression A similar idea for transduction was proposed by [10] in the context of regression: requiring that both means and predictive variances of the estimate agree between training and test set. [sent-100, score-0.877]
31 For a heteroscedastic regression estimate this constraint between training and test set is met simply by ensuring that the distributions over first and second order moments of a Gaussian exponential family distribution match. [sent-101, score-0.298]
32 The same goal can be achieved by using a polynomial kernel of second degree on the estimates, which shows that regression transduction can be viewed as a special case. [sent-102, score-0.765]
33 Large Margin Hypothesis A key assumption in transduction is that a good hypothesis is characterized by a large margin of separation on both training and test set. [sent-103, score-0.921]
34 Generalizations of this approach to multiclass and structured estimation settings is not entirely trivial and requires a number of heuristic choices (e. [sent-107, score-0.267]
35 Instead, if we require that the distribution of values f (x, ·) on X match those on X, we automatically obtain a loss function which enforces the large margin hypothesis whenever it is actually achievable on the training set. [sent-110, score-0.234]
36 4 Algorithm Streaming Approximation In general, minimizing D(f (X), f (X )) is computationally infeasible since the estimation of the distributional distance requires access to f (X) and f (X ) rather than evaluations on a small sample. [sent-115, score-0.144]
37 4 ˆ Stochastic Gradient Descent The fact that the estimator of the distance D decomposes into an average over a function of pairs from the training and test set respectively means that we can use Di as a stochastic approximation. [sent-124, score-0.289]
38 ¯ ¯ end for Remark: The streaming formulation does not impose any in-principle limitation regarding matching sample sizes. [sent-129, score-0.195]
39 DC programming has been used extensively in almost any other transductive algorithms to deal with non-convexity of the objective function. [sent-134, score-0.302]
40 In order to minimize an additively decomposable objective function as in our transductive estimation, we could use stochastic gradient descent on the convex upper bound. [sent-137, score-0.518]
41 Note that here the convex upper bound is given by a sum over the convex upper bounds for all terms. [sent-138, score-0.16]
42 This strategy, however, is deficient in a significant aspect: the convex upper bounds on each of the loss terms become increasingly loose as we move f away from the current point of approximation. [sent-139, score-0.115]
43 It would be considerably better if we updated the upper bound after every stochastic gradient descent step. [sent-140, score-0.177]
44 This variant, however, is identical to stochastic gradient descent on the original objective function due to the following: ¯ ∂x F (x)|x=x = ∂x F (x, x0 )|x=x = ∂x G(x)|x=x − ∂x H(x)|x=x for all x0 . [sent-141, score-0.177]
45 (17) 0 0 0 0 In other words, in order to compute the gradient of the upper bound we need not compute the upper bound itself. [sent-142, score-0.115]
46 Instead we may use the nonconvex objective directly, hence we did not pursue DC programming approach and Algorithm 1 applies. [sent-143, score-0.116]
47 5 Experiments To demonstrate the applicability of our approach, we apply transduction to binary and multiclass classification both on toy datasets from the UCI repository [16] and the LibSVM site [17], plus 5 a larger scale multi-category classification dataset with 3. [sent-144, score-0.909]
48 We also perform experiments on a structured estimation problem, i. [sent-146, score-0.119]
49 Japanese named entity recognition task and CoNLL-2000 base NP chunking task. [sent-148, score-0.329]
50 Algorithms Since we are not aware of other transductive algorithms which can be applied easily to all the problems we consider, we choose problem-specific transduction algorithms as competitors. [sent-149, score-0.895]
51 This method is a variant of transductive SVM algorithm [8] tailored for linear semi-supervised binary classification on large and sparse datasets and involves switching of more than a single pair of labels at a time. [sent-151, score-0.395]
52 For multiclass categorization we pick a Gaussian processes based transductive algorithm with distribution matching term (GPDistMatch) [5]. [sent-152, score-0.567]
53 We use stochastic gradient descent for optimization in both inductive and transductive settings for binary and multiclass losses. [sent-153, score-0.634]
54 More specifically, for transduction we use the Gaussian RBF kernel to compare distributions in (14). [sent-154, score-0.735]
55 Note that, in the multiclass case, the additional distribution matching term measures the distance between multivariate functions. [sent-155, score-0.323]
56 Since we anticipate the relevant length scale in the margin distribution to be in the order of 1 (after all, we use a loss function, i. [sent-158, score-0.168]
57 a hinge loss, which uses a margin of 1) we pick a Gaussian RBF kernel width of 0. [sent-160, score-0.209]
58 Moreover, to take scaling in the number of classes √ into account we choose a kernel width of 0. [sent-162, score-0.108]
59 We split data equally into training and test sets, performing model selection on the training set and assessing performance on the test set. [sent-166, score-0.411]
60 In these small scale experiments, we tune hyperparameters via 5-fold cross validation on the entire training set. [sent-167, score-0.13]
61 More specifically, in the model selection stage, for transduction we adjust the regularization λ and the transductive weight term γ (obviously, for inductive inference we only need to adjust λ). [sent-169, score-0.964]
62 For GP transduction both the regularization and divergence parameters were adjusted. [sent-172, score-0.634]
63 Results The experimental results are summarized in Figure 2 for a binary setting and in Table 1 for a multiclass problem. [sent-173, score-0.168]
64 In 23 binary datasets, transduction outperforms the inductive setup in 20 of them. [sent-174, score-0.756]
65 Arguably, our proposed transductive method performs on a par with state-of-the-art transductive approach for each learning problem. [sent-175, score-0.522]
66 In the binary estimation, out of 23 datasets, our method performs significantly worse than MultiSwitch transduction algorithm in 4 datasets (adult, bupa, pima, and svmguide3) and significantly better on 2 datasets (ionosphere and pageblock), using a one-sided paired t-test with 95% confidence. [sent-176, score-0.757]
67 Further, by casting the transductive solution as an online optimization method, our approach scales well. [sent-182, score-0.261]
68 Larger Scale Experiments Since one of the key points of our approach is that it can be applied to large problems, we performed transduction on the DMOZ ontology [20] of topics. [sent-183, score-0.702]
69 DistMatch: distribution matching (ours) and MultiSwitch: Multi switch transductive SVM, [14]. [sent-190, score-0.422]
70 DistMatch: distribution matching (ours) and GPDistMatch: Gaussian Process transduction, [5]. [sent-193, score-0.161]
71 060 vehicle Table 2: Error rate on the DMOZ ontology for increasing training / test set sizes. [sent-224, score-0.288]
72 training / test set size 50,000 100,000 200,000 400,000 800,000 1,600,000 induction 0. [sent-225, score-0.334]
73 250 transduction Table 3: Error rate on the DMOZ ontology for fixed training set size of 100,000 samples. [sent-237, score-0.8]
74 test set size 100,000 200,000 400,000 800,000 1,600,000 induction 0. [sent-238, score-0.236]
75 329 transduction Table 4: Accuracy, precision, recall and Fβ=1 score on the Japanese named entity task. [sent-248, score-0.9]
76 62 Table 5: Accuracy, precision, recall and Fβ=1 score on the CoNLL-2000 base NP chunking task. [sent-257, score-0.165]
77 To our knowledge, our proposed transduction method is the only one that scales very well due to the stochastic approximation. [sent-269, score-0.69]
78 For each experiment, we split data into training and test sets. [sent-270, score-0.225]
79 Model selection is perform on the training set by putting aside part of the training data as a validation set which is then used exclusively for tuning the hyperparameters. [sent-271, score-0.263]
80 In large scale transduction two issues matter: firstly, the algorithm needs to be scalable with respect to the training set size. [sent-272, score-0.802]
81 Secondly, we need to be able to scale the algorithm with respect to the test set. [sent-273, score-0.12]
82 Note that Table 2 uses an equal split between training and test sets, while Table 3 uses an unequal split where the test 7 set has many more observations. [sent-275, score-0.352]
83 We see that the algorithm improves with increasing data size, both for training and test sets. [sent-276, score-0.186]
84 We suspect that a locationdependent transduction score would be useful in this context – i. [sent-278, score-0.685]
85 instead of only minimizing the discrepancy between decision function values on training and test set D(f (X), f (X )) we could also introduce local features D((X, f (X)), (X , f (X ))). [sent-280, score-0.186]
86 Japanese Named Entity Recognition Experiments A key advantage of our transduction algorithm is it can be applied to structured estimation without modification. [sent-281, score-0.753]
87 The data contains 716 Japanese sentences with 17 annotated named entities. [sent-283, score-0.145]
88 That is, we have clique potentials joining adjacent labels (yi , yi+1 ), but which are independent of the text itself, and clique potentials joining words and labels (xi , yi ). [sent-288, score-0.536]
89 For the latter, though, we want to enforce that clique potentials are distributed in the same way between training and test set. [sent-290, score-0.325]
90 the chain length itself is a random variable, we perform distribution matching on a per-token basis — we oversample each token 10 times in our experiments. [sent-294, score-0.161]
91 The additional distribution matching term is then measuring the distance between these over-sampled clique potentials. [sent-296, score-0.283]
92 As before, we split data equally into training and test sets and put aside part of the training data as a validation set which is used exclusively for tuning the hyperparameters. [sent-297, score-0.39]
93 We report results in Table 4, that is precision (fraction of name tags which match the reference tags), recall (fraction of reference tags returned), and their harmonic mean, Fβ=1 are reported. [sent-299, score-0.206]
94 CoNLL-2000 Base NP Chunking Experiments Our second structured estimation experiment is the CoNLL-2000 base NP chunking dataset [13] as provided in the CRF++ toolkit. [sent-301, score-0.233]
95 Similarly to Japanese named entity recognition task, 1D chain CRFs with only first order Markov dependency between chunk tags are modeled. [sent-304, score-0.366]
96 The same experimental setup as in named entity experiments is used. [sent-306, score-0.215]
97 6 Summary and Discussion We proposed a transductive estimation algorithm which is a) simple, b) general c) scalable and d) works well when compared to the state of the art algorithms applied to each specific problem. [sent-309, score-0.345]
98 Not only is it useful for classical binary and multiclass categorization problems but it also applies to ontologies and structured estimation problems. [sent-310, score-0.317]
99 It is not surprising that it performs very comparably to existing algorithms, since they can, in many cases, be seen as special instances of the general purpose distribution matching setting. [sent-311, score-0.193]
100 Early results for named entity recognition with conditional random fields, feature induction and web enhanced lexicons. [sent-423, score-0.363]
wordName wordTfidf (topN-words)
[('transduction', 0.634), ('transductive', 0.261), ('matching', 0.161), ('distmatch', 0.157), ('induction', 0.148), ('xi', 0.144), ('multiclass', 0.115), ('named', 0.113), ('japanese', 0.112), ('dmoz', 0.105), ('multiswitch', 0.105), ('entity', 0.102), ('margin', 0.101), ('training', 0.098), ('yi', 0.092), ('test', 0.088), ('crf', 0.085), ('chunking', 0.084), ('tags', 0.082), ('gpdistmatch', 0.079), ('rtrain', 0.079), ('clique', 0.075), ('nonconvex', 0.075), ('structured', 0.073), ('inductive', 0.069), ('chunk', 0.069), ('rtner', 0.069), ('ontology', 0.068), ('potentials', 0.064), ('ex', 0.061), ('kernel', 0.06), ('requiring', 0.057), ('risk', 0.057), ('np', 0.057), ('dc', 0.057), ('stochastic', 0.056), ('crfs', 0.056), ('binary', 0.053), ('bupa', 0.052), ('pageblock', 0.052), ('satimage', 0.052), ('score', 0.051), ('distributional', 0.051), ('xm', 0.05), ('width', 0.048), ('descent', 0.047), ('distance', 0.047), ('classi', 0.046), ('estimation', 0.046), ('anu', 0.046), ('rsise', 0.046), ('labels', 0.046), ('rbf', 0.042), ('sml', 0.042), ('precision', 0.042), ('objective', 0.041), ('distributions', 0.041), ('upper', 0.041), ('sigir', 0.04), ('heart', 0.04), ('repository', 0.04), ('regression', 0.039), ('pima', 0.039), ('multi', 0.039), ('split', 0.039), ('convex', 0.039), ('scalable', 0.038), ('canberra', 0.037), ('joining', 0.037), ('hilbert', 0.036), ('nicta', 0.036), ('balancing', 0.036), ('aside', 0.036), ('datasets', 0.035), ('reasoning', 0.035), ('loss', 0.035), ('streaming', 0.034), ('iris', 0.034), ('alex', 0.034), ('vehicle', 0.034), ('adult', 0.034), ('gradient', 0.033), ('entirely', 0.033), ('smola', 0.033), ('uci', 0.033), ('constraint', 0.032), ('scale', 0.032), ('availability', 0.032), ('sentences', 0.032), ('ionosphere', 0.032), ('special', 0.032), ('table', 0.031), ('di', 0.031), ('editors', 0.031), ('exclusively', 0.031), ('usps', 0.031), ('categorization', 0.03), ('base', 0.03), ('gretton', 0.03), ('zien', 0.03)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999923 72 nips-2009-Distribution Matching for Transduction
Author: Novi Quadrianto, James Petterson, Alex J. Smola
Abstract: Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modification for classification, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efficiently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach. 1
2 0.11106321 49 nips-2009-Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition
Author: Ashish Kapoor, Eric Horvitz
Abstract: To date, the processes employed for active information acquisition during periods of learning and diagnosis have been considered as separate and have been applied in distinct phases of analysis. While active learning centers on the collection of information about training cases in order to build better predictive models, diagnosis uses fixed predictive models for guiding the collection of observations about a specific test case at hand. We introduce a model and inferential methods that bridge these phases of analysis into a holistic approach to information acquisition that considers simultaneously the extension of the predictive model and the probing of a case at hand. The bridging of active learning and real-time diagnostic feature acquisition leads to a new class of policies for learning and diagnosis. 1
3 0.10976627 57 nips-2009-Conditional Random Fields with High-Order Features for Sequence Labeling
Author: Nan Ye, Wee S. Lee, Hai L. Chieu, Dan Wu
Abstract: Dependencies among neighbouring labels in a sequence is an important source of information for sequence labeling problems. However, only dependencies between adjacent labels are commonly exploited in practice because of the high computational complexity of typical inference algorithms when longer distance dependencies are taken into account. In this paper, we show that it is possible to design efficient inference algorithms for a conditional random field using features that depend on long consecutive label sequences (high-order features), as long as the number of distinct label sequences used in the features is small. This leads to efficient learning algorithms for these conditional random fields. We show experimentally that exploiting dependencies using high-order features can lead to substantial performance improvements for some problems and discuss conditions under which high-order features can be effective. 1
4 0.1030855 82 nips-2009-Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification
Author: Amarnag Subramanya, Jeff A. Bilmes
Abstract: We prove certain theoretical properties of a graph-regularized transductive learning objective that is based on minimizing a Kullback-Leibler divergence based loss. These include showing that the iterative alternating minimization procedure used to minimize the objective converges to the correct solution and deriving a test for convergence. We also propose a graph node ordering algorithm that is cache cognizant and leads to a linear speedup in parallel computations. This ensures that the algorithm scales to large data sets. By making use of empirical evaluation on the TIMIT and Switchboard I corpora, we show this approach is able to outperform other state-of-the-art SSL approaches. In one instance, we solve a problem on a 120 million node graph. 1
5 0.10084178 87 nips-2009-Exponential Family Graph Matching and Ranking
Author: James Petterson, Jin Yu, Julian J. Mcauley, Tibério S. Caetano
Abstract: We present a method for learning max-weight matching predictors in bipartite graphs. The method consists of performing maximum a posteriori estimation in exponential families with sufficient statistics that encode permutations and data features. Although inference is in general hard, we show that for one very relevant application–document ranking–exact inference is efficient. For general model instances, an appropriate sampler is readily available. Contrary to existing max-margin matching models, our approach is statistically consistent and, in addition, experiments with increasing sample sizes indicate superior improvement over such models. We apply the method to graph matching in computer vision as well as to a standard benchmark dataset for learning document ranking, in which we obtain state-of-the-art results, in particular improving on max-margin variants. The drawback of this method with respect to max-margin alternatives is its runtime for large graphs, which is comparatively high. 1
6 0.092222877 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
7 0.091129124 71 nips-2009-Distribution-Calibrated Hierarchical Classification
8 0.08678478 129 nips-2009-Learning a Small Mixture of Trees
9 0.077432349 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine
10 0.076587379 236 nips-2009-Structured output regression for detection with partial truncation
11 0.076398239 47 nips-2009-Boosting with Spatial Regularization
12 0.075816043 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models
13 0.074207567 61 nips-2009-Convex Relaxation of Mixture Regression with Efficient Algorithms
14 0.073642969 157 nips-2009-Multi-Label Prediction via Compressed Sensing
15 0.072261982 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields
16 0.072081201 229 nips-2009-Statistical Analysis of Semi-Supervised Learning: The Limit of Infinite Unlabelled Data
17 0.071977198 128 nips-2009-Learning Non-Linear Combinations of Kernels
18 0.071490437 168 nips-2009-Non-stationary continuous dynamic Bayesian networks
19 0.06953305 220 nips-2009-Slow Learners are Fast
20 0.069019102 98 nips-2009-From PAC-Bayes Bounds to KL Regularization
topicId topicWeight
[(0, -0.245), (1, 0.089), (2, -0.077), (3, 0.03), (4, -0.023), (5, -0.051), (6, -0.071), (7, 0.031), (8, -0.07), (9, 0.05), (10, 0.033), (11, -0.012), (12, -0.015), (13, -0.054), (14, 0.025), (15, 0.034), (16, 0.017), (17, -0.072), (18, 0.035), (19, -0.077), (20, 0.063), (21, -0.113), (22, -0.001), (23, 0.087), (24, 0.024), (25, 0.149), (26, 0.06), (27, 0.063), (28, 0.021), (29, -0.01), (30, -0.055), (31, -0.093), (32, -0.017), (33, 0.101), (34, -0.002), (35, 0.039), (36, -0.059), (37, -0.034), (38, -0.025), (39, -0.026), (40, -0.043), (41, 0.026), (42, -0.056), (43, 0.066), (44, 0.022), (45, 0.06), (46, -0.028), (47, 0.005), (48, -0.091), (49, -0.051)]
simIndex simValue paperId paperTitle
same-paper 1 0.94379324 72 nips-2009-Distribution Matching for Transduction
Author: Novi Quadrianto, James Petterson, Alex J. Smola
Abstract: Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modification for classification, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efficiently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach. 1
2 0.69450271 160 nips-2009-Multiple Incremental Decremental Learning of Support Vector Machines
Author: Masayuki Karasuyama, Ichiro Takeuchi
Abstract: We propose a multiple incremental decremental algorithm of Support Vector Machine (SVM). Conventional single incremental decremental SVM can update the trained model efficiently when single data point is added to or removed from the training set. When we add and/or remove multiple data points, this algorithm is time-consuming because we need to repeatedly apply it to each data point. The proposed algorithm is computationally more efficient when multiple data points are added and/or removed simultaneously. The single incremental decremental algorithm is built on an optimization technique called parametric programming. We extend the idea and introduce multi-parametric programming for developing the proposed algorithm. Experimental results on synthetic and real data sets indicate that the proposed algorithm can significantly reduce the computational cost of multiple incremental decremental operation. Our approach is especially useful for online SVM learning in which we need to remove old data points and add new data points in a short amount of time.
3 0.68816698 189 nips-2009-Periodic Step Size Adaptation for Single Pass On-line Learning
Author: Chun-nan Hsu, Yu-ming Chang, Hanshen Huang, Yuh-jye Lee
Abstract: It has been established that the second-order stochastic gradient descent (2SGD) method can potentially achieve generalization performance as well as empirical optimum in a single pass (i.e., epoch) through the training examples. However, 2SGD requires computing the inverse of the Hessian matrix of the loss function, which is prohibitively expensive. This paper presents Periodic Step-size Adaptation (PSA), which approximates the Jacobian matrix of the mapping function and explores a linear relation between the Jacobian and Hessian to approximate the Hessian periodically and achieve near-optimal results in experiments on a wide variety of models and tasks. 1
4 0.67516643 71 nips-2009-Distribution-Calibrated Hierarchical Classification
Author: Ofer Dekel
Abstract: While many advances have already been made in hierarchical classification learning, we take a step back and examine how a hierarchical classification problem should be formally defined. We pay particular attention to the fact that many arbitrary decisions go into the design of the label taxonomy that is given with the training data. Moreover, many hand-designed taxonomies are unbalanced and misrepresent the class structure in the underlying data distribution. We attempt to correct these problems by using the data distribution itself to calibrate the hierarchical classification loss function. This distribution-based correction must be done with care, to avoid introducing unmanageable statistical dependencies into the learning problem. This leads us off the beaten path of binomial-type estimation and into the unfamiliar waters of geometric-type estimation. In this paper, we present a new calibrated definition of statistical risk for hierarchical classification, an unbiased estimator for this risk, and a new algorithmic reduction from hierarchical classification to cost-sensitive classification.
5 0.67508477 49 nips-2009-Breaking Boundaries Between Induction Time and Diagnosis Time Active Information Acquisition
Author: Ashish Kapoor, Eric Horvitz
Abstract: To date, the processes employed for active information acquisition during periods of learning and diagnosis have been considered as separate and have been applied in distinct phases of analysis. While active learning centers on the collection of information about training cases in order to build better predictive models, diagnosis uses fixed predictive models for guiding the collection of observations about a specific test case at hand. We introduce a model and inferential methods that bridge these phases of analysis into a holistic approach to information acquisition that considers simultaneously the extension of the predictive model and the probing of a case at hand. The bridging of active learning and real-time diagnostic feature acquisition leads to a new class of policies for learning and diagnosis. 1
6 0.65825331 56 nips-2009-Conditional Neural Fields
7 0.62340868 75 nips-2009-Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models
8 0.6067757 15 nips-2009-A Rate Distortion Approach for Semi-Supervised Conditional Random Fields
9 0.59328866 130 nips-2009-Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization
10 0.58935875 68 nips-2009-Dirichlet-Bernoulli Alignment: A Generative Model for Multi-Class Multi-Label Multi-Instance Corpora
11 0.58861989 106 nips-2009-Heavy-Tailed Symmetric Stochastic Neighbor Embedding
13 0.56959462 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
14 0.56558943 122 nips-2009-Label Selection on Graphs
15 0.56321067 26 nips-2009-Adaptive Regularization for Transductive Support Vector Machine
16 0.55990779 192 nips-2009-Posterior vs Parameter Sparsity in Latent Variable Models
17 0.54942399 47 nips-2009-Boosting with Spatial Regularization
18 0.54761535 97 nips-2009-Free energy score space
19 0.546574 129 nips-2009-Learning a Small Mixture of Trees
20 0.54101926 222 nips-2009-Sparse Estimation Using General Likelihoods and Non-Factorial Priors
topicId topicWeight
[(7, 0.021), (8, 0.197), (21, 0.021), (24, 0.062), (25, 0.048), (35, 0.056), (36, 0.142), (39, 0.05), (58, 0.094), (61, 0.029), (71, 0.071), (81, 0.015), (86, 0.098), (91, 0.024)]
simIndex simValue paperId paperTitle
same-paper 1 0.8467167 72 nips-2009-Distribution Matching for Transduction
Author: Novi Quadrianto, James Petterson, Alex J. Smola
Abstract: Many transductive inference algorithms assume that distributions over training and test estimates should be related, e.g. by providing a large margin of separation on both sets. We use this idea to design a transduction algorithm which can be used without modification for classification, regression, and structured estimation. At its heart we exploit the fact that for a good learner the distributions over the outputs on training and test sets should match. This is a classical two-sample problem which can be solved efficiently in its most general form by using distance measures in Hilbert Space. It turns out that a number of existing heuristics can be viewed as special cases of our approach. 1
2 0.83303368 37 nips-2009-Asymptotically Optimal Regularization in Smooth Parametric Models
Author: Percy Liang, Guillaume Bouchard, Francis R. Bach, Michael I. Jordan
Abstract: Many types of regularization schemes have been employed in statistical learning, each motivated by some assumption about the problem domain. In this paper, we present a unified asymptotic analysis of smooth regularizers, which allows us to see how the validity of these assumptions impacts the success of a particular regularizer. In addition, our analysis motivates an algorithm for optimizing regularization parameters, which in turn can be analyzed within our framework. We apply our analysis to several examples, including hybrid generative-discriminative learning and multi-task learning. 1
3 0.7483322 250 nips-2009-Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference
Author: Khashayar Rohanimanesh, Sameer Singh, Andrew McCallum, Michael J. Black
Abstract: Large, relational factor graphs with structure defined by first-order logic or other languages give rise to notoriously difficult inference problems. Because unrolling the structure necessary to represent distributions over all hypotheses has exponential blow-up, solutions are often derived from MCMC. However, because of limitations in the design and parameterization of the jump function, these samplingbased methods suffer from local minima—the system must transition through lower-scoring configurations before arriving at a better MAP solution. This paper presents a new method of explicitly selecting fruitful downward jumps by leveraging reinforcement learning (RL). Rather than setting parameters to maximize the likelihood of the training data, parameters of the factor graph are treated as a log-linear function approximator and learned with methods of temporal difference (TD); MAP inference is performed by executing the resulting policy on held out test data. Our method allows efficient gradient updates since only factors in the neighborhood of variables affected by an action need to be computed—we bypass the need to compute marginals entirely. Our method yields dramatic empirical success, producing new state-of-the-art results on a complex joint model of ontology alignment, with a 48% reduction in error over state-of-the-art in that domain. 1
4 0.74558789 129 nips-2009-Learning a Small Mixture of Trees
Author: M. P. Kumar, Daphne Koller
Abstract: The problem of approximating a given probability distribution using a simpler distribution plays an important role in several areas of machine learning, for example variational inference and classification. Within this context, we consider the task of learning a mixture of tree distributions. Although mixtures of trees can be learned by minimizing the KL-divergence using an EM algorithm, its success depends heavily on the initialization. We propose an efficient strategy for obtaining a good initial set of trees that attempts to cover the entire observed distribution by minimizing the α-divergence with α = ∞. We formulate the problem using the fractional covering framework and present a convergent sequential algorithm that only relies on solving a convex program at each iteration. Compared to previous methods, our approach results in a significantly smaller mixture of trees that provides similar or better accuracies. We demonstrate the usefulness of our approach by learning pictorial structures for face recognition.
5 0.74509126 132 nips-2009-Learning in Markov Random Fields using Tempered Transitions
Author: Ruslan Salakhutdinov
Abstract: Markov random fields (MRF’s), or undirected graphical models, provide a powerful framework for modeling complex dependencies among random variables. Maximum likelihood learning in MRF’s is hard due to the presence of the global normalizing constant. In this paper we consider a class of stochastic approximation algorithms of the Robbins-Monro type that use Markov chain Monte Carlo to do approximate maximum likelihood learning. We show that using MCMC operators based on tempered transitions enables the stochastic approximation algorithm to better explore highly multimodal distributions, which considerably improves parameter estimates in large, densely-connected MRF’s. Our results on MNIST and NORB datasets demonstrate that we can successfully learn good generative models of high-dimensional, richly structured data that perform well on digit and object recognition tasks.
6 0.74460441 260 nips-2009-Zero-shot Learning with Semantic Output Codes
7 0.74318868 169 nips-2009-Nonlinear Learning using Local Coordinate Coding
8 0.73828697 71 nips-2009-Distribution-Calibrated Hierarchical Classification
9 0.73793054 118 nips-2009-Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions
10 0.73681056 30 nips-2009-An Integer Projected Fixed Point Method for Graph Matching and MAP Inference
11 0.73677492 41 nips-2009-Bayesian Source Localization with the Multivariate Laplace Prior
12 0.73643732 104 nips-2009-Group Sparse Coding
13 0.73626196 18 nips-2009-A Stochastic approximation method for inference in probabilistic graphical models
14 0.73464358 217 nips-2009-Sharing Features among Dynamical Systems with Beta Processes
15 0.73395026 87 nips-2009-Exponential Family Graph Matching and Ranking
16 0.7338683 220 nips-2009-Slow Learners are Fast
17 0.73383844 27 nips-2009-Adaptive Regularization of Weight Vectors
18 0.73316586 3 nips-2009-AUC optimization and the two-sample problem
19 0.7324146 80 nips-2009-Efficient and Accurate Lp-Norm Multiple Kernel Learning
20 0.73239833 174 nips-2009-Nonparametric Latent Feature Models for Link Prediction