nips nips2012 nips2012-278 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Kevin Swersky, Brendan J. Frey, Daniel Tarlow, Richard S. Zemel, Ryan P. Adams
Abstract: In categorical data there is often structure in the number of variables that take on each label. For example, the total number of objects in an image and the number of highly relevant documents per query in web search both tend to follow a structured distribution. In this paper, we study a probabilistic model that explicitly includes a prior distribution over such counts, along with a count-conditional likelihood that defines probabilities over all subsets of a given size. When labels are binary and the prior over counts is a Poisson-Binomial distribution, a standard logistic regression model is recovered, but for other count distributions, such priors induce global dependencies and combinatorics that appear to complicate learning and inference. However, we demonstrate that simple, efficient learning procedures can be derived for more general forms of this model. We illustrate the utility of the formulation by exploring applications to multi-object classification, learning to rank, and top-K classification. 1
Reference: text
sentIndex sentText sentNum sentScore
1 edu Abstract In categorical data there is often structure in the number of variables that take on each label. [sent-19, score-0.171]
2 In this paper, we study a probabilistic model that explicitly includes a prior distribution over such counts, along with a count-conditional likelihood that defines probabilities over all subsets of a given size. [sent-21, score-0.223]
3 When labels are binary and the prior over counts is a Poisson-Binomial distribution, a standard logistic regression model is recovered, but for other count distributions, such priors induce global dependencies and combinatorics that appear to complicate learning and inference. [sent-22, score-0.711]
4 1 Introduction When models contain multiple output variables, an important potential source of structure is the number of variables that take on a particular value. [sent-25, score-0.167]
5 For example, if we have binary variables indicating the presence or absence of a particular object class in an image, then the number of “present” objects may be highly structured, such as the number of digits in a zip code. [sent-26, score-0.238]
6 In ordinal regression problems there may be some prior knowledge about the proportion of outputs within each level. [sent-27, score-0.392]
7 One popular model for multiple output classification problems is logistic regression (LR), in which the class probabilities are modeled as being conditionally independent, given the features; another popular approach utilizes a softmax over the class outputs. [sent-29, score-0.286]
8 Both models can be seen as possessing a prior on the label counts: in the case of the softmax model this prior is explicit that exactly one is active. [sent-30, score-0.415]
9 For LR, there is an implicit factorization in which there is a specific prior on counts; this prior is the source of computational tractability, but also imparts an inductive bias to the model. [sent-31, score-0.298]
10 The starting observation for our work is that we do not lose much efficiency by replacing the LR counts prior with a general prior, which permits the specification of a variety of inductive biases. [sent-32, score-0.339]
11 In this paper we present a probabilistic model of multiple output classification, the n-choosek model, which incorporates a distribution over the label counts, and show that computations needed 1 for learning and inference in this model are efficient. [sent-33, score-0.311]
12 A maximum-likelihood version of the model can be used for problems such as multi-class recognition, in which the label counts are known at training time but only a prior distribution is known at test time. [sent-35, score-0.498]
13 The model easily extends to ordinal regression problems, such as ranking or collaborative filtering, in which each item is assigned to one of a small number of relevance levels. [sent-36, score-0.73]
14 We establish a connection between n-choose-k models and ranking objectives, and prove that optimal decision theoretic predictions under the model for “monotonic” gain functions (to be defined later), which include standard objectives used in ranking, can be achieved by a simple sorting operation. [sent-37, score-0.751]
15 In the following section, we will generalize to the case of ordinal variables. [sent-44, score-0.233]
16 The model output is a vector of D binary variables y ∈ Y = {0, 1}D . [sent-46, score-0.199]
17 The generative procedure is then defined as follows: • Draw k from a prior distribution p(k) over counts k. [sent-55, score-0.271]
18 • Draw k variables to take on label 1, where the probability of choosing subset c is given by p(y c = 1, y c = 0 | k) = ¯ exp{ d∈c θd } Zk (θ) 0 if |c| = k , otherwise (1) where θ = (θ1 , . [sent-56, score-0.22]
19 , θD ) are parameters that determine individual variable biases towards being off or on, and Zk (θ) = y| exp{ d θd yd }. [sent-59, score-0.463]
20 Under this definition Z0 = 1, d yd =k and p(0 | 0) = 1. [sent-60, score-0.463]
21 Logistic regression can be viewed as an instantiation of this model, with a “prior” distribution over count values that depends on parameters θ. [sent-62, score-0.191]
22 This is a forced interpretation, but it is useful in understanding the implicit prior over counts that is imposed when using LR. [sent-63, score-0.344]
23 Suppose we have a joint assignment of variables y and d yd = k, and p(k; θ) is Poisson-Binomial, then Zk (θ) exp{ d∈c θd } exp{θd yd } p(y, k; θ) = p(k; θ)p(y | k; θ) = = . [sent-65, score-0.988]
24 First, we explore treating p(k) as a prior in the Bayesian sense, using it to express prior knowledge about label counts; later we will explore learning p(k) using separate parameters from θ. [sent-68, score-0.313]
25 The log-likelihood is as follows: D p(k)p(y | k; θ) = log p(y | log p(y; θ) = log k=0 θd yd − log Z = yd ; θ) + κ (3) d d yd (θ) + κ, (4) d where κ is a constant that is independent of θ. [sent-75, score-1.389]
26 The partial n=1 derivatives take a standard log-sum-exp form, requiring expectations Ep(yd |k= d yd ) [yd ]. [sent-77, score-0.502]
27 A naive computation of this expectation would require summing over k= D yd configurations. [sent-78, score-0.463]
28 The basic idea is to compute partial sums along a chain that lays out variables yd in sequence. [sent-82, score-0.525]
29 These algorithms are quite general and can also be used to compute Zk values, incorporate prior distributions over count values, and draw a sample of y values conditional upon some k for the same computational cost [5]. [sent-84, score-0.332]
30 , that maximize expected gain) can be made in several settings by a simple sorting procedure, and this will be our primary way of using the learned model. [sent-91, score-0.174]
31 To draw a joint sample of y values, we can begin by drawing k from p(k), then conditional on that k, use the dynamic programming algorithm to draw a sample conditional on k. [sent-93, score-0.236]
32 An alternative approach is to draw several samples of k from p(k), then for each sampled value, run dynamic programming to compute marginals. [sent-96, score-0.161]
33 3 Ordinal n-Choose-k Model An extension of the binary n-choose-k model can be developed in the case of ordinal data, where we assume that labels y can take on one of R categorical labels, and where there is an inherent ordering to labels R > R − 1 > . [sent-99, score-0.581]
34 > 1; each label represents a relevance label in a learning-to-rank setting. [sent-102, score-0.307]
35 Let kr represent the number of variables y that take on label r and define k = (kR , . [sent-103, score-0.498]
36 The idea in the ordinal case is to define a joint model over count variables k, then to reduce the conditional distribution of p(y | k) to be a series of binary models. [sent-107, score-0.529]
37 • Repeat for r = R to 1: – Choose a set cr of kr unlabeled variables y ≤r and assign them relevance label r. [sent-113, score-0.582]
38 Choose subsets with probability equal to the following: p(y ≤r,cr = 1, y ≤r,¯r = 0 | kr ) = c 3 exp{ d∈cr θd } Zr,k (θ,y ≤r ) 0 if |cr | = kr , otherwise (5) where we use the notation y ≤r to represent all variables that are given a relevance label less than or equal to r. [sent-114, score-0.806]
39 d Note that if R = D and p(k) specifies that kr = 1 for all r, then this process defines a Plackett-Luce (PL) [6, 7, 8] ranking model. [sent-116, score-0.509]
40 In this work, we focus on ranking with weak labels (R < D) which is more restrictive than modeling distributions over permutations [9], where learning would require marginalizing over all possible permutations consistent with the given labels. [sent-118, score-0.325]
41 In this setting, inference in the ordinal n-choose-k model is both exact and efficient. [sent-119, score-0.318]
42 1 Maximum Likelihood Learning Let kr = d 1 {yd = r}. [sent-121, score-0.278]
43 r=1 k∈K (6) d:yd =r Here, we see that learning decomposes into the sum of R objectives that are of the same form as arise in the binary n-choose-k model. [sent-123, score-0.157]
44 2 Test-time Inference The test-time inference procedure in the ordinal model is similar to the binary case. [sent-127, score-0.386]
45 To draw samples of y, the main requirement is the ability to draw a joint sample of k from p(k). [sent-129, score-0.15]
46 It is also possible to efficiently draw a joint sample if the distribution over k takes the form p(k) = 1 { r kr = D} · r p(kr ). [sent-131, score-0.353]
47 That is, there is an arbitrary but independent prior over each kr value, along with a single constraint that the chosen kr values sum to exactly D. [sent-132, score-0.653]
48 To do so, begin by using the binary algorithm to sample kR variables to take on value R. [sent-134, score-0.169]
49 Then remove the chosen variables from the set of possible variables, and sample kR−1 variables to take on value R − 1. [sent-135, score-0.163]
50 The main motivation for the ordinal model is the learning to rank problem [10], so our main interest is in methods that do well under such task-specific evaluation measures that arise in the ranking task. [sent-138, score-0.54]
51 2, we show that we can make exact optimal decision theoretic test-time predictions under the learning-to-rank gain functions without the need for sampling. [sent-140, score-0.361]
52 We formulate this task using a gain function, parameterized by a value K and a “scoring vector” t, which is assumed to be of the same dimension as y. [sent-143, score-0.243]
53 The gain function stipulates that K elements of y are chosen, (assigning a score of zero if some other number is chosen), and assigns reward for choosing each element of y based on t. [sent-144, score-0.243]
54 Specifically the gain function is defined as follows: GK (y, t) = d yd t d 0 if d yd = K otherwise . [sent-145, score-1.169]
55 (7) The same gain can be used for Precision@K, in which case the number of nonzero values in t is unrestricted. [sent-146, score-0.282]
56 4 An interesting issue is what gain function should be used to train a model when the test-time evaluation metric is TKC, or Precision@K. [sent-148, score-0.354]
57 Maximum likelihood training of TKC in this case of a single target class could correspond to a version of our n-choose-k model in which p(k) is a spike at k = 1; note that in this case the n-choose-k model is equivalent to a softmax over the output classes. [sent-149, score-0.262]
58 An alternative is to train using the same gain function used at test-time. [sent-150, score-0.285]
59 Here, we consider incorporating the TKC gain at training time for binary t with one nonzero entry, training the model to maximize expected gain. [sent-151, score-0.603]
60 Specifically, the objective is the following: p(k)p(y | k)1 Ep [GK (y, t)] = k yd = K y d p(K)p(y | K)yd∗ yd td = d (8) y It becomes clear that this objective is equivalent to the marginal probability of yd∗ under a prior distribution that places all its mass on k = K. [sent-152, score-1.149]
61 3, we empirically investigate training under expected gain versus training under maximum likelihood 4. [sent-154, score-0.411]
62 2 Optimal Decision-theoretic Predictions for Monotonic Gain Functions We now turn attention to gain functions defined on rankings of items. [sent-155, score-0.243]
63 Letting π be a permutation, we define a “monotonic” gain function as follows: Definition 1. [sent-156, score-0.243]
64 A gain function G(π, r) is a monotonic ranking gain if: D • It can be expressed as d=1 αd f (rπd ), where αd is a weighting (or discount) term, and πd is the index of the item ranked in position d, • αd ≥ αd+1 ≥ 0 for all d, and • f (r) ≥ f (r − 1) ≥ 0 for all r ≥ r . [sent-157, score-0.875]
65 It is straightforward to see that popular learning-to-rank scoring functions like normalized discounted cumulative gain (NDCG) and Precision@K are monotonic ranking gains. [sent-158, score-0.588]
66 We define Preci2 2 sion@K gain to be the fraction of documents in the top K produced ranks that have label R: P @K(π, r) = d 1 {d ≤ K} 1 {rπd = R}, so set αd = 1 {d ≤ K} and f (r) = 1 {r = R}. [sent-160, score-0.44]
67 The expected gain under a monotonic ranking gain and ordinal n-choose-k model is D Ep [G(π)] = p(y ) y ∈Y D αd f (yπd ) = d=1 where we have defined gd = R αd yπ =1 d=1 R r=1 D f (yπd )p(yπd = yπd ) = d αd gπd , (9) d=1 f (r)p(yd = r). [sent-161, score-1.138]
68 Consider two pairs of non-negative real numbers ai , aj and bi , bj where ai ≥ aj and bi ≥ bj . [sent-177, score-0.394]
69 Under an ordinal n-choose-k model, the optimal decision theoretic predictions for a monotonic ranking gain are made by sorting θ values. [sent-180, score-1.011]
70 5 Figure 1: Four example images from the embedded MNIST dataset test set, along with the PoissonBinomial distribution produced by logistic regression for each image. [sent-181, score-0.26]
71 The area marked in red has zero probability under the data distribution, but the logistic regression model is not flexible enough to model it. [sent-182, score-0.226]
72 1 To generate an image, we uniformly sampled a count between 1 and 4, and then take that number of digit instances (with at most one instance per digit class) from the MNIST dataset and embed them in a 60 × 60 image. [sent-196, score-0.266]
73 We train a binary n-choose-k model on this dataset. [sent-201, score-0.147]
74 As a baseline, we trained a logistic regression classifier on the features and achieved a test-set negative loglikelihood (NLL) of 2. [sent-203, score-0.189]
75 In Figure 1, we show four test images, and the Poisson-Binomial distribution over counts that arises from the logistic regression model. [sent-206, score-0.357]
76 Here it is clear that the implicit count prior in LR is not powerful enough to model this data. [sent-208, score-0.299]
77 As a comparison, we trained a binary n-choosek model where we explicitly parameterize and learn an input-dependent prior. [sent-209, score-0.142]
78 The model learns the correct distribution over counts and achieves a test-set NLL of 1. [sent-210, score-0.211]
79 We show a visualization of the learned likelihood and prior parameters in the supplementary material. [sent-212, score-0.148]
80 We report on comparisons to other ranking approaches, using seven datasets associated with the LETOR 3. [sent-215, score-0.267]
81 For each dataset, we train an ordinal n-choose-k model to maximize the likelihood of the data, where each training example consists of a number of items, each assigned a particular relevance level; the number of levels ranges from 2-4 across the datasets. [sent-218, score-0.591]
82 2 is the optimal decision theoretic prediction under a ranking gain function, by simply sorting the items for each test query based on their θ score values. [sent-239, score-0.639]
83 Note that this is a very simple ranking model, in that the score assigned to each test item by the model is a linear function of the input features, and the only hyperparameter to tune is an 2 regularization strength. [sent-240, score-0.397]
84 We hypothesize that proper probabilistic incorporation of weak labels helps to mitigate this effect to some degree. [sent-245, score-0.175]
85 We train binary n-choose-k models, experimenting with different training protocols that directly maximize expected gain under the model, as described in Section 4. [sent-250, score-0.495]
86 That is, we train on the expected top-K gain for different values of K. [sent-252, score-0.322]
87 We experimented on the embedded MNIST dataset where all but one label from each example was randomly removed, and on the Caltech-101 Silhouettes dataset [11], which consists of images of binarized silhouettes from 101 different categories. [sent-256, score-0.248]
88 On Caltech it is clear that training for the expected gain improves the corresponding test accuracy in this regime. [sent-260, score-0.351]
89 822 (d) EMNIST weak 2 Table 1: Top-K classification results when various models are trained using an expected top-K gain and then tested using some possibly different top-K criterion. [sent-318, score-0.357]
90 Instead, we focus on work related to the main novelty in this paper, the explicit modeling of structure on label counts. [sent-323, score-0.153]
91 That is, given that we have prior knowledge of label count structure, or are modeling a domain that exhibits such structure, the question is how can the structure be leveraged to improve a model. [sent-324, score-0.379]
92 The first and most direct approach is the one that we take here: explicitly model the count structure within the model. [sent-325, score-0.239]
93 Similarly, [13] develops an example application where a cardinality-based term constrains the number of pixels that take on the label “foreground” in a foreground/background image segmentation task. [sent-328, score-0.264]
94 [14] develops models that include a penalty in the energy function for using more labels, which can be seen as a restricted form of structure over label cardinalities. [sent-329, score-0.217]
95 An alternative way of incorporating structure over counts into a model is via the gain function. [sent-330, score-0.522]
96 A different approach to including count information in the gain function comes from [16], which trains an image segmentation model so as match count statistics present in the ground truth data. [sent-332, score-0.58]
97 Overall, the main difference between our work and these others is that we work in a proper probabilistic framework, either maximizing likelihood, maximizing expected gain, and/or making proper decision-theoretic predictions at test time. [sent-335, score-0.248]
98 7 Discussion We have presented a flexible probabilistic model for multiple output variables that explicitly models structure in the number of variables taking on specific values. [sent-337, score-0.265]
99 Our theoretical contribution provides a link between this type of ordinal model and ranking problems, bridging the gap between the two tasks, and allowing the same model to be effective for several quite different problems. [sent-339, score-0.569]
100 Also, while we chose to take a maximum likelihood approach in this paper, the model is well suited to fully Bayesian inference using e. [sent-342, score-0.175]
wordName wordTfidf (topN-words)
[('yd', 0.463), ('kr', 0.278), ('gain', 0.243), ('ordinal', 0.233), ('ranking', 0.231), ('counts', 0.174), ('lr', 0.147), ('ndcg', 0.134), ('zk', 0.133), ('count', 0.129), ('letor', 0.128), ('tkc', 0.128), ('label', 0.119), ('monotonic', 0.114), ('prior', 0.097), ('logistic', 0.09), ('swersky', 0.085), ('top', 0.078), ('draw', 0.075), ('sorting', 0.072), ('relevance', 0.069), ('binary', 0.068), ('inductive', 0.068), ('softmax', 0.065), ('maximize', 0.065), ('develops', 0.064), ('digits', 0.064), ('emnist', 0.064), ('mnist', 0.063), ('caltech', 0.062), ('tarlow', 0.062), ('variables', 0.062), ('theoretic', 0.062), ('regression', 0.062), ('bj', 0.06), ('ordering', 0.06), ('proposition', 0.059), ('aj', 0.058), ('imagenet', 0.057), ('nll', 0.057), ('predictions', 0.056), ('cr', 0.054), ('assigned', 0.054), ('labels', 0.054), ('silhouettes', 0.052), ('dynamic', 0.052), ('likelihood', 0.051), ('zemel', 0.051), ('precision', 0.05), ('td', 0.05), ('objectives', 0.05), ('digit', 0.049), ('inference', 0.048), ('yj', 0.047), ('classi', 0.047), ('item', 0.044), ('objects', 0.044), ('proper', 0.043), ('embedded', 0.043), ('repeat', 0.043), ('image', 0.042), ('bi', 0.042), ('train', 0.042), ('frey', 0.041), ('exible', 0.041), ('training', 0.04), ('toronto', 0.04), ('weak', 0.04), ('take', 0.039), ('arise', 0.039), ('nonzero', 0.039), ('probabilistic', 0.038), ('objective', 0.038), ('propositions', 0.038), ('ep', 0.037), ('expected', 0.037), ('forced', 0.037), ('gk', 0.037), ('trained', 0.037), ('model', 0.037), ('ai', 0.037), ('marginals', 0.036), ('categorical', 0.036), ('implicit', 0.036), ('datasets', 0.036), ('reliability', 0.035), ('adams', 0.035), ('calls', 0.035), ('truncation', 0.034), ('images', 0.034), ('programming', 0.034), ('incorporating', 0.034), ('structure', 0.034), ('pl', 0.033), ('strength', 0.032), ('output', 0.032), ('issue', 0.032), ('criteria', 0.031), ('quite', 0.031), ('test', 0.031)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999964 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
Author: Kevin Swersky, Brendan J. Frey, Daniel Tarlow, Richard S. Zemel, Ryan P. Adams
Abstract: In categorical data there is often structure in the number of variables that take on each label. For example, the total number of objects in an image and the number of highly relevant documents per query in web search both tend to follow a structured distribution. In this paper, we study a probabilistic model that explicitly includes a prior distribution over such counts, along with a count-conditional likelihood that defines probabilities over all subsets of a given size. When labels are binary and the prior over counts is a Poisson-Binomial distribution, a standard logistic regression model is recovered, but for other count distributions, such priors induce global dependencies and combinatorics that appear to complicate learning and inference. However, we demonstrate that simple, efficient learning procedures can be derived for more general forms of this model. We illustrate the utility of the formulation by exploring applications to multi-object classification, learning to rank, and top-K classification. 1
2 0.18213364 330 nips-2012-Supervised Learning with Similarity Functions
Author: Purushottam Kar, Prateek Jain
Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1
3 0.16228752 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space
Author: Yanyan Lan, Jiafeng Guo, Xueqi Cheng, Tie-yan Liu
Abstract: This paper is concerned with the statistical consistency of ranking methods. Recently, it was proven that many commonly used pairwise ranking methods are inconsistent with the weighted pairwise disagreement loss (WPDL), which can be viewed as the true loss of ranking, even in a low-noise setting. This result is interesting but also surprising, given that the pairwise ranking methods have been shown very effective in practice. In this paper, we argue that the aforementioned result might not be conclusive, depending on what kind of assumptions are used. We give a new assumption that the labels of objects to rank lie in a rank-differentiable probability space (RDPS), and prove that the pairwise ranking methods become consistent with WPDL under this assumption. What is especially inspiring is that RDPS is actually not stronger than but similar to the low-noise setting. Our studies provide theoretical justifications of some empirical findings on pairwise ranking methods that are unexplained before, which bridge the gap between theory and applications.
4 0.14439751 365 nips-2012-Why MCA? Nonlinear sparse coding with spike-and-slab prior for neurally plausible image encoding
Author: Philip Sterne, Joerg Bornschein, Abdul-saboor Sheikh, Joerg Luecke, Jacquelyn A. Shelton
Abstract: Modelling natural images with sparse coding (SC) has faced two main challenges: flexibly representing varying pixel intensities and realistically representing lowlevel image components. This paper proposes a novel multiple-cause generative model of low-level image statistics that generalizes the standard SC model in two crucial points: (1) it uses a spike-and-slab prior distribution for a more realistic representation of component absence/intensity, and (2) the model uses the highly nonlinear combination rule of maximal causes analysis (MCA) instead of a linear combination. The major challenge is parameter optimization because a model with either (1) or (2) results in strongly multimodal posteriors. We show for the first time that a model combining both improvements can be trained efficiently while retaining the rich structure of the posteriors. We design an exact piecewise Gibbs sampling method and combine this with a variational method based on preselection of latent dimensions. This combined training scheme tackles both analytical and computational intractability and enables application of the model to a large number of observed and hidden dimensions. Applying the model to image patches we study the optimal encoding of images by simple cells in V1 and compare the model’s predictions with in vivo neural recordings. In contrast to standard SC, we find that the optimal prior favors asymmetric and bimodal activity of simple cells. Testing our model for consistency we find that the average posterior is approximately equal to the prior. Furthermore, we find that the model predicts a high percentage of globular receptive fields alongside Gabor-like fields. Similarly high percentages are observed in vivo. Our results thus argue in favor of improvements of the standard sparse coding model for simple cells by using flexible priors and nonlinear combinations. 1
5 0.14378208 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models
Author: Weiwei Cheng, Willem Waegeman, Volkmar Welker, Eyke Hüllermeier
Abstract: Several machine learning methods allow for abstaining from uncertain predictions. While being common for settings like conventional classification, abstention has been studied much less in learning to rank. We address abstention for the label ranking setting, allowing the learner to declare certain pairs of labels as being incomparable and, thus, to predict partial instead of total orders. In our method, such predictions are produced via thresholding the probabilities of pairwise preferences between labels, as induced by a predicted probability distribution on the set of all rankings. We formally analyze this approach for the Mallows and the Plackett-Luce model, showing that it produces proper partial orders as predictions and characterizing the expressiveness of the induced class of partial orders. These theoretical results are complemented by experiments demonstrating the practical usefulness of the approach. 1
6 0.13971069 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm
7 0.12296988 75 nips-2012-Collaborative Ranking With 17 Parameters
8 0.11889498 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
9 0.10731132 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
10 0.10123777 138 nips-2012-Fully Bayesian inference for neural models with negative-binomial spiking
11 0.095218368 30 nips-2012-Accuracy at the Top
12 0.094655097 165 nips-2012-Iterative ranking from pair-wise comparisons
13 0.091293335 111 nips-2012-Efficient Sampling for Bipartite Matching Problems
14 0.090237767 286 nips-2012-Random Utility Theory for Social Choice
15 0.087535828 252 nips-2012-On Multilabel Classification and Ranking with Partial Feedback
16 0.087281741 280 nips-2012-Proper losses for learning from partial labels
17 0.087229535 127 nips-2012-Fast Bayesian Inference for Non-Conjugate Gaussian Process Regression
18 0.086263858 200 nips-2012-Local Supervised Learning through Space Partitioning
19 0.085770421 186 nips-2012-Learning as MAP Inference in Discrete Graphical Models
20 0.085250489 249 nips-2012-Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison
topicId topicWeight
[(0, 0.257), (1, 0.04), (2, -0.06), (3, -0.013), (4, -0.027), (5, -0.094), (6, 0.028), (7, 0.168), (8, 0.023), (9, 0.093), (10, -0.117), (11, 0.085), (12, -0.034), (13, 0.087), (14, 0.076), (15, -0.018), (16, 0.058), (17, 0.007), (18, -0.005), (19, 0.01), (20, 0.037), (21, 0.026), (22, -0.046), (23, 0.14), (24, 0.059), (25, 0.037), (26, 0.056), (27, 0.055), (28, -0.066), (29, -0.004), (30, 0.005), (31, -0.182), (32, 0.143), (33, -0.06), (34, 0.039), (35, -0.091), (36, -0.089), (37, -0.089), (38, 0.041), (39, 0.034), (40, -0.011), (41, -0.009), (42, 0.034), (43, -0.031), (44, -0.021), (45, 0.053), (46, -0.015), (47, 0.032), (48, 0.022), (49, 0.139)]
simIndex simValue paperId paperTitle
same-paper 1 0.92417216 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
Author: Kevin Swersky, Brendan J. Frey, Daniel Tarlow, Richard S. Zemel, Ryan P. Adams
Abstract: In categorical data there is often structure in the number of variables that take on each label. For example, the total number of objects in an image and the number of highly relevant documents per query in web search both tend to follow a structured distribution. In this paper, we study a probabilistic model that explicitly includes a prior distribution over such counts, along with a count-conditional likelihood that defines probabilities over all subsets of a given size. When labels are binary and the prior over counts is a Poisson-Binomial distribution, a standard logistic regression model is recovered, but for other count distributions, such priors induce global dependencies and combinatorics that appear to complicate learning and inference. However, we demonstrate that simple, efficient learning procedures can be derived for more general forms of this model. We illustrate the utility of the formulation by exploring applications to multi-object classification, learning to rank, and top-K classification. 1
2 0.76957983 141 nips-2012-GenDeR: A Generic Diversified Ranking Algorithm
Author: Jingrui He, Hanghang Tong, Qiaozhu Mei, Boleslaw Szymanski
Abstract: Diversified ranking is a fundamental task in machine learning. It is broadly applicable in many real world problems, e.g., information retrieval, team assembling, product search, etc. In this paper, we consider a generic setting where we aim to diversify the top-k ranking list based on an arbitrary relevance function and an arbitrary similarity function among all the examples. We formulate it as an optimization problem and show that in general it is NP-hard. Then, we show that for a large volume of the parameter space, the proposed objective function enjoys the diminishing returns property, which enables us to design a scalable, greedy algorithm to find the (1 − 1/e) near-optimal solution. Experimental results on real data sets demonstrate the effectiveness of the proposed algorithm.
3 0.76801413 169 nips-2012-Label Ranking with Partial Abstention based on Thresholded Probabilistic Models
Author: Weiwei Cheng, Willem Waegeman, Volkmar Welker, Eyke Hüllermeier
Abstract: Several machine learning methods allow for abstaining from uncertain predictions. While being common for settings like conventional classification, abstention has been studied much less in learning to rank. We address abstention for the label ranking setting, allowing the learner to declare certain pairs of labels as being incomparable and, thus, to predict partial instead of total orders. In our method, such predictions are produced via thresholding the probabilities of pairwise preferences between labels, as induced by a predicted probability distribution on the set of all rankings. We formally analyze this approach for the Mallows and the Plackett-Luce model, showing that it produces proper partial orders as predictions and characterizing the expressiveness of the induced class of partial orders. These theoretical results are complemented by experiments demonstrating the practical usefulness of the approach. 1
4 0.76485717 323 nips-2012-Statistical Consistency of Ranking Methods in A Rank-Differentiable Probability Space
Author: Yanyan Lan, Jiafeng Guo, Xueqi Cheng, Tie-yan Liu
Abstract: This paper is concerned with the statistical consistency of ranking methods. Recently, it was proven that many commonly used pairwise ranking methods are inconsistent with the weighted pairwise disagreement loss (WPDL), which can be viewed as the true loss of ranking, even in a low-noise setting. This result is interesting but also surprising, given that the pairwise ranking methods have been shown very effective in practice. In this paper, we argue that the aforementioned result might not be conclusive, depending on what kind of assumptions are used. We give a new assumption that the labels of objects to rank lie in a rank-differentiable probability space (RDPS), and prove that the pairwise ranking methods become consistent with WPDL under this assumption. What is especially inspiring is that RDPS is actually not stronger than but similar to the low-noise setting. Our studies provide theoretical justifications of some empirical findings on pairwise ranking methods that are unexplained before, which bridge the gap between theory and applications.
5 0.64639956 330 nips-2012-Supervised Learning with Similarity Functions
Author: Purushottam Kar, Prateek Jain
Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a “goodness” criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using “good” similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. 1
6 0.58660358 75 nips-2012-Collaborative Ranking With 17 Parameters
7 0.57259375 286 nips-2012-Random Utility Theory for Social Choice
8 0.57178706 280 nips-2012-Proper losses for learning from partial labels
9 0.56280184 130 nips-2012-Feature-aware Label Space Dimension Reduction for Multi-label Classification
10 0.54485911 365 nips-2012-Why MCA? Nonlinear sparse coding with spike-and-slab prior for neurally plausible image encoding
11 0.53684133 22 nips-2012-A latent factor model for highly multi-relational data
12 0.53000361 228 nips-2012-Multilabel Classification using Bayesian Compressed Sensing
13 0.52001667 230 nips-2012-Multiple Choice Learning: Learning to Produce Multiple Structured Outputs
14 0.50104308 359 nips-2012-Variational Inference for Crowdsourcing
15 0.4879075 54 nips-2012-Bayesian Probabilistic Co-Subspace Addition
16 0.48401007 148 nips-2012-Hamming Distance Metric Learning
17 0.48325992 246 nips-2012-Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction
18 0.48215565 58 nips-2012-Bayesian models for Large-scale Hierarchical Classification
19 0.48043451 67 nips-2012-Classification Calibration Dimension for General Multiclass Losses
20 0.47957489 30 nips-2012-Accuracy at the Top
topicId topicWeight
[(0, 0.023), (38, 0.513), (39, 0.015), (42, 0.02), (54, 0.013), (55, 0.019), (74, 0.047), (76, 0.137), (80, 0.101), (92, 0.031)]
simIndex simValue paperId paperTitle
1 0.99577373 29 nips-2012-Accelerated Training for Matrix-norm Regularization: A Boosting Approach
Author: Xinhua Zhang, Dale Schuurmans, Yao-liang Yu
Abstract: Sparse learning models typically combine a smooth loss with a nonsmooth penalty, such as trace norm. Although recent developments in sparse approximation have offered promising solution methods, current approaches either apply only to matrix-norm constrained problems or provide suboptimal convergence rates. In this paper, we propose a boosting method for regularized learning that guarantees accuracy within O(1/ ) iterations. Performance is further accelerated by interlacing boosting with fixed-rank local optimization—exploiting a simpler local objective than previous work. The proposed method yields state-of-the-art performance on large-scale problems. We also demonstrate an application to latent multiview learning for which we provide the first efficient weak-oracle. 1
Author: Zhuo Wang, Alan Stocker, Daniel Lee
Abstract: In this work we study how the stimulus distribution influences the optimal coding of an individual neuron. Closed-form solutions to the optimal sigmoidal tuning curve are provided for a neuron obeying Poisson statistics under a given stimulus distribution. We consider a variety of optimality criteria, including maximizing discriminability, maximizing mutual information and minimizing estimation error under a general Lp norm. We generalize the Cramer-Rao lower bound and show how the Lp loss can be written as a functional of the Fisher Information in the asymptotic limit, by proving the moment convergence of certain functions of Poisson random variables. In this manner, we show how the optimal tuning curve depends upon the loss function, and the equivalence of maximizing mutual information with minimizing Lp loss in the limit as p goes to zero. 1
3 0.99387413 84 nips-2012-Convergence Rate Analysis of MAP Coordinate Minimization Algorithms
Author: Ofer Meshi, Amir Globerson, Tommi S. Jaakkola
Abstract: Finding maximum a posteriori (MAP) assignments in graphical models is an important task in many applications. Since the problem is generally hard, linear programming (LP) relaxations are often used. Solving these relaxations efficiently is thus an important practical problem. In recent years, several authors have proposed message passing updates corresponding to coordinate descent in the dual LP. However, these are generally not guaranteed to converge to a global optimum. One approach to remedy this is to smooth the LP, and perform coordinate descent on the smoothed dual. However, little is known about the convergence rate of this procedure. Here we perform a thorough rate analysis of such schemes and derive primal and dual convergence rates. We also provide a simple dual to primal mapping that yields feasible primal solutions with a guaranteed rate of convergence. Empirical evaluation supports our theoretical claims and shows that the method is highly competitive with state of the art approaches that yield global optima. 1
4 0.99224991 165 nips-2012-Iterative ranking from pair-wise comparisons
Author: Sahand Negahban, Sewoong Oh, Devavrat Shah
Abstract: The question of aggregating pairwise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR’s TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining ranking, finding ‘scores’ for each object (e.g. player’s rating) is of interest to understanding the intensity of the preferences. In this paper, we propose a novel iterative rank aggregation algorithm for discovering scores for objects from pairwise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with edges present between two objects if they are compared; the scores turn out to be the stationary probability of this random walk. The algorithm is model independent. To establish the efficacy of our method, however, we consider the popular Bradley-Terry-Luce (BTL) model in which each object has an associated score which determines the probabilistic outcomes of pairwise comparisons between objects. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. This, in essence, leads to order-optimal dependence on the number of samples required to learn the scores well by our algorithm. Indeed, the experimental evaluation shows that our (model independent) algorithm performs as well as the Maximum Likelihood Estimator of the BTL model and outperforms a recently proposed algorithm by Ammar and Shah [1]. 1
same-paper 5 0.97980541 278 nips-2012-Probabilistic n-Choose-k Models for Classification and Ranking
Author: Kevin Swersky, Brendan J. Frey, Daniel Tarlow, Richard S. Zemel, Ryan P. Adams
Abstract: In categorical data there is often structure in the number of variables that take on each label. For example, the total number of objects in an image and the number of highly relevant documents per query in web search both tend to follow a structured distribution. In this paper, we study a probabilistic model that explicitly includes a prior distribution over such counts, along with a count-conditional likelihood that defines probabilities over all subsets of a given size. When labels are binary and the prior over counts is a Poisson-Binomial distribution, a standard logistic regression model is recovered, but for other count distributions, such priors induce global dependencies and combinatorics that appear to complicate learning and inference. However, we demonstrate that simple, efficient learning procedures can be derived for more general forms of this model. We illustrate the utility of the formulation by exploring applications to multi-object classification, learning to rank, and top-K classification. 1
6 0.97735977 268 nips-2012-Perfect Dimensionality Recovery by Variational Bayesian PCA
7 0.97423941 181 nips-2012-Learning Multiple Tasks using Shared Hypotheses
8 0.96429837 240 nips-2012-Newton-Like Methods for Sparse Inverse Covariance Estimation
9 0.96231186 143 nips-2012-Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins
10 0.96123028 208 nips-2012-Matrix reconstruction with the local max norm
11 0.92525989 285 nips-2012-Query Complexity of Derivative-Free Optimization
12 0.92422158 86 nips-2012-Convex Multi-view Subspace Learning
13 0.91978687 241 nips-2012-No-Regret Algorithms for Unconstrained Online Convex Optimization
14 0.91717261 15 nips-2012-A Polylog Pivot Steps Simplex Algorithm for Classification
15 0.91018081 324 nips-2012-Stochastic Gradient Descent with Only One Projection
16 0.91008651 17 nips-2012-A Scalable CUR Matrix Decomposition Algorithm: Lower Time Complexity and Tighter Bound
17 0.90657496 187 nips-2012-Learning curves for multi-task Gaussian process regression
18 0.90215158 178 nips-2012-Learning Label Trees for Probabilistic Modelling of Implicit Feedback
19 0.9011513 30 nips-2012-Accuracy at the Top
20 0.90012574 227 nips-2012-Multiclass Learning with Simplex Coding