jmlr jmlr2008 jmlr2008-69 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Guy Lebanon, Yi Mao
Abstract: Statistical models on full and partial rankings of n items are often of limited practical use for large n due to computational consideration. We explore the use of non-parametric models for partially ranked data and derive computationally efficient procedures for their use for large n. The derivations are largely possible through combinatorial and algebraic manipulations based on the lattice of partial rankings. A bias-variance analysis and an experimental study demonstrate the applicability of the proposed method. Keywords: ranked data, partially ordered sets, kernel smoothing
Reference: text
sentIndex sentText sentNum sentScore
1 EDU School of Electrical and Computer Engineering Purdue University West Lafayette, IN Editor: Tommi Jaakkola Abstract Statistical models on full and partial rankings of n items are often of limited practical use for large n due to computational consideration. [sent-5, score-0.773]
2 Keywords: ranked data, partially ordered sets, kernel smoothing 1. [sent-9, score-0.35]
3 Introduction Rankers such as people, search engines, and classifiers, output full or partial rankings representing preference relations over n items or alternatives. [sent-10, score-0.803]
4 For example in the case of m = 6 rankers issuing full or partial preferences over n = 3 items a possible data set is 3 1 2, 3 2 1, 1 3 2, 1 {2, 3}, 3 {1, 2}, {2, 3} 1. [sent-11, score-0.499]
5 (1) The first three expressions in (1) correspond to full rankings while the last three expressions correspond to partial rankings (the numbers correspond to items and the symbol corresponds to a preference relation). [sent-12, score-1.326]
6 While it is likely that some rankings will contradict others, it is natural to assume that the data in (1) was sampled iid from some distribution p over rankings. [sent-13, score-0.384]
7 L EBANON AND M AO Despite this motivating observation, modeling ranked data is less popular than modeling the existing numeric scores, or even made-up numeric scores in case the true scores are unavailable (such is the case with the frequently used Borda count). [sent-21, score-0.435]
8 The main reason for this is that rankings over a large number of items n reside in an extremely large discrete space whose modeling often requires intractable computation. [sent-22, score-0.599]
9 Previous attempts at modeling ranked data have been mostly parametric and often designed to work with fully ranked data (Marden, 1996). [sent-23, score-0.415]
10 Most of aforementioned approaches are unsuitable for modeling partial rankings for medium and large n due to the computational difficulties of handling a probability space of size n! [sent-29, score-0.622]
11 The few possible exceptions (Critchlow, 1985; Marden, 1996) are usually more ad-hoc and do not correspond to an underlying permutation model making them ill suited to handle partial rankings of different types. [sent-31, score-0.689]
12 On the other hand, there has been a recent increase in data sets containing partial or full rankings for large n. [sent-33, score-0.603]
13 More details on how these data sets correspond to partial rankings may be found in Section 2. [sent-35, score-0.602]
14 (2) The resulting estimate p should assign probabilities to full and partial rankings in a coherent ˆ and contradiction-free manner (described in Section 4). [sent-43, score-0.603]
15 (3) Estimate p based on partial rankings of different types (defined in Section 2). [sent-44, score-0.577]
16 (5) Statistical accuracy of p can be slow for fully ranked data but should be accelerated when ˆ restricted to simpler partial rankings. [sent-67, score-0.378]
17 (6) Obtaining the estimate p and using it to compute probabilities p(A) of partial rankings should ˆ ˆ be computationally feasible, even for large n. [sent-68, score-0.577]
18 All 6 properties above are crucial in the large n scenario: it is often impossible for rankers to specify full rankings over a very large number of items making the use of partial rankings a necessity. [sent-69, score-1.234]
19 Different rankers may choose to output partial rankings of different types, for example, one ranker can output 3 {1, 2} (3 is preferred to both 1 and 2) and another ranker can output {1, 3} 2 (both 1 and 3 are preferred to 2). [sent-70, score-0.704]
20 In this notation the numbers correspond to items and the locations of the items in their corresponding compartments correspond to their ranks. [sent-93, score-0.434]
21 The collection of all permutations of n items forms the non-Abelian symmetric group of order n, denoted by S n , using function composition as the group operation πσ = π ◦ σ. [sent-94, score-0.344]
22 A reasonable solution is achieved by considering partial rankings which correspond to cosets of the symmetric group. [sent-107, score-0.701]
23 1 3 1 4 2 2 3 1 3 PSfrag replacements 4 4 2 S1,1,2 π = {σ1 π, σ2 π} = 3|1|2, 4 σ2 π 3 4 Figure 1: A partial ranking corresponds to a coset or a set or permutations all permutations that fix the top k positions is denoted S1,. [sent-109, score-0.79]
24 It may thus be interpreted as a partial ranking of the top k items, that does not contain any information concerning the relative ranking of the bottom n − k items. [sent-123, score-0.571]
25 The set of all such partial rankings forms the quotient space S n /S1,. [sent-124, score-0.577]
26 Figure 1 illustrates the identification of a coset as a partial ranking of the top 2 out of 4 items. [sent-128, score-0.529]
27 We generalize the above relationship between partial rankings and cosets through the following definition of a composition. [sent-129, score-0.676]
28 , γr ) corresponds to a partial ranking with γ1 items in the first position, γ2 items in the second position and so on. [sent-138, score-0.707]
29 For such a partial ranking it is known that the first set of γ 1 items are to be ranked before the second set of γ2 items etc. [sent-139, score-0.892]
30 A partial ranking of type γ is equivalent to a coset Sγ π = {σπ : σ ∈ Sγ } and the set of such partial rankings forms the quotient space Sn /Sγ . [sent-164, score-1.076]
31 2404 N ON -PARAMETRIC M ODELING OF PARTIALLY R ANKED DATA The vertical bar notation described above for permutations is particularly convenient for denoting partial rankings. [sent-165, score-0.411]
32 , n separated by vertical bars, indicating that items on the left side of each vertical bar are preferred to (ranked higher than) items on the right side of the bar. [sent-169, score-0.533]
33 For example, the partial ranking displayed in Figure 1 is denoted by 3|1|2, 4. [sent-171, score-0.367]
34 The set of all partial rankings Wn = {Sγ π : π ∈ Sn , ∀γ} def (2) which includes the set of full rankings Sn , is a subset of all possible partial orders on {1, . [sent-173, score-1.254]
35 While the formalism of partial rankings in Wn cannot realize all partial orderings, it is sufficiently powerful to include many useful and naturally occurring orderings as special cases. [sent-177, score-0.801]
36 Special cases of particular interest are the following partial rankings • π ∈ Sn corresponds to a permutation or a full ordering, for example, 3|2|4|1. [sent-179, score-0.69]
37 An example for such a ranking is a ranked list of the top k webpages output by search engines in response to a query. [sent-189, score-0.389]
38 An example for such a ranking is selection of preferred and non-preferred items from a list. [sent-204, score-0.369]
39 One particular extension to partial rankings is to consider a partial ranking as censored data equivalent to the set of permutations in its related coset. [sent-237, score-1.125]
40 In other words, we define the probability the model assigns to the partial ranking S γ π by ∑ τ∈Sγ π pκ (τ) = ψ−1 (c) ∑ exp (−c d(τ, κ)) . [sent-238, score-0.395]
41 However, the apparent absence of a closed form formula for more general partial rankings prevented the widespread use of Equation 7 for large n and encouraged more ad-hoc and heuristic models (Critchlow, 1985; Marden, 1996). [sent-243, score-0.577]
42 Section 7 describes an efficient computational procedure for computing (7) for more general partial ranking types γ. [sent-244, score-0.367]
43 The Ranking Lattice Partial rankings Sγ π relate to each other in a natural way by expressing more general, more specific or inconsistent ordering. [sent-246, score-0.384]
44 We define below the concepts of partially ordered sets and lattices and then relate them to partial rankings by considering the set of partial rankings W n as a lattice. [sent-247, score-1.286]
45 The set of partial rankings Wn defined in (2) is naturally endowed with the partial order of ranking refinement, that is, π σ if π refines σ or alternatively if we can get from π to σ by dropping vertical lines (Lebanon and Lafferty, 2003). [sent-256, score-1.008]
46 Using the vertical bar notation, two elements are inconsistent iff there exist two items i, j that appear on opposing sides of a vertical bar in x and y, that is, x = · · · i| j · · · while y = · · · j|i · · · . [sent-274, score-0.378]
47 Probabilistic Models on the Ranking Lattice The ranking lattice is a convenient framework to define and study probabilistic models on partial ˜ rankings. [sent-299, score-0.459]
48 (8) ˜ β∈Wn :β α ˜ Interpreting partial rankings Sγ π ∈ Wn as the disjoint union of the events defined by the coset Sγ π we have that g(Sγ π) = ∑ τ∈Sγ π 2409 p(τ) (9) L EBANON AND M AO may be interpreted as the probability under p of the disjoint union S γ π of permutations. [sent-301, score-0.709]
49 We refer to the function g as the partial ranking or lattice version of p. [sent-302, score-0.459]
50 The function g is defined on the entire lattice, but when restricted to partial rankings of the same ˜ type G = {Sγ π : π ∈ Sn } ⊂ Wn , constitutes a normalized probability distribution on G. [sent-307, score-0.577]
51 Figure 3 illustrates this problem for partial rankings with the same (left) and different (right) number of vertical bars. [sent-312, score-0.641]
52 In addition to this construction which logically occurs after obtaining the estimator p, we also ˆ need to consider how to use partially ranked data in the process of obtaining the estimator p. [sent-315, score-0.401]
53 Instead, the inference needs to be conducted based on a set of partial rankings D = {Sγi πi : i = 1, . [sent-317, score-0.577]
54 Assuming uniformly random censoring in a parametric setting, we obtain the following observed likelihood with respect to the partially ranked data set D m (θ|D) = ∑ log i=1 m 1 ∑ pθ (σ) = ∑ log ∑ pθ (σ) + const. [sent-323, score-0.39]
55 In the next section we explore in detail a non-parametric kernel smoothing alternative to estimating p and g based on partially ranked data. [sent-325, score-0.35]
56 , n Sγ π Sλ σ PSfrag replacements Sλ σ PSfrag replacements ˆ 0 ˆ 0 Figure 3: Two partial rankings with the same (left) and different (right) number of vertical bars ˜ in the Hasse diagram of Wn . [sent-332, score-0.778]
57 To avoid probabilistic contradictions, the values of g at two non-disjoint partial rankings Sγ π, Sλ σ cannot be specified in an independent manner. [sent-335, score-0.577]
58 Since the normalization term ψ does not depend on the location parameter (6), the kernel smoothing estimator for p is p(π) = ˆ m 1 ∑ exp(−c d(π, πi )) π, πi ∈ Sn m ψ(c) i=1 (11) assuming the data consists of complete rankings π1 , . [sent-341, score-0.522]
59 The lattice or partial ranking version g correˆ sponding to p in (12) is ˆ g(Sλ π) = ˆ m 1 1 ∑ |Sγ | ∑ ∑ exp(−c d(κ, τ)) m ψ(c) i=1 i κ∈S π τ∈Sγ πi λ i ˜ S γ π ∈ Wn . [sent-350, score-0.459]
60 The set appearing in the definition γ of bkl (τ) contains pairs (s,t) that are inversions of τ and for which s and t appear in the l and k compartments of γ respectively. [sent-366, score-0.363]
61 (15) s=1 j=1 k=0 Proof τ∈Sγ π ∑ γ γ q∑k=1 ak (τ)+∑k=1 ∑l=k+1 bkl (τ) r r r ∑ q∑k=1 ak (τ) τ∈Sγ π γ = q∑k=1 ∑l=k+1 bkl (π) r r γ r τ∈Sγ π r γ = q∑k=1 ∑l=k+1 bkl (π) ∏ r r ∑ qi(τ) s=1 τ∈Sγs r γs −1 j γ = q∑k=1 ∑l=k+1 bkl (π) ∏ ∏ r r ∑ qk . [sent-380, score-1.09]
62 The remaining terms depend only on the partial ranking type γ and thus may be pre-computed and tabulated for efficient computation. [sent-383, score-0.367]
63 Corollary 2 The partial ranking version g corresponding to the Mallows kernel p κ is pκ (Sγ π) = γ −1 j s ∏r ∏ j=1 ∑k=0 e−kc s=1 j ∏n−1 ∑k=0 e−kc j=1 γ ∝ e−c ∑k=1 ∑l=k+1 bkl (πκ r r −1 ) 2413 . [sent-388, score-0.581]
64 s=1 j=1 k=0 The complexity of computing (16), (12), (13) for some popular partial ranking types appears in Table 1. [sent-396, score-0.367]
65 The term ψ(2c)/ψ2 (c) is bounded since it may be 1+e− jc 1+e−c written as a product ∏n R j (c), with R j (c) = 1−e− jc / 1−e−c ≤ 1 for all c ∈ R+ and j ≥ 1 since the j=1 function 1+ε increases with ε > 0. [sent-426, score-0.454]
66 1−ε Based on Proposition 7 and Equations (17)-(18) |bias ( p(π))| ≤ M ˆ Var ( p(π)) ≤ ˆ n je− jc ne−c ne−c −M ∑ ≤M − jc 1 − e−c 1 − e−c j=1 1 − e p(π) ψ(2c) M ψ (2c) ψ(2c) − . [sent-427, score-0.454]
67 For large n, it is often the case that partial, rather than full, rankings are available for estimating p. [sent-437, score-0.384]
68 Furthermore, in many cases, rankers can make some partial ranking assertions with certainty but do not have a clear opinion on other preferences. [sent-439, score-0.444]
69 Using the censored data interpretation of partially ranked data enables efficient use of partially ranked data of multiple types in the estimation process (12). [sent-440, score-0.639]
70 Statistically, expressing partially ranked data as censored data has the effect of increased smoothing and therefore it reduces the variance while increasing the bias. [sent-441, score-0.415]
71 A consequence of this proposition which is ˆ also illustrated in Section 9 experimentally is that even if the fully ranked data is somehow available, estimating p based on the partial rankings obtained by censoring it tends to increase the estimation ˆ accuracy. [sent-443, score-0.937]
72 2417 L EBANON AND M AO Proposition 9 Assuming the same conditions as in Proposition 7, the bias and variance of the censored data or partial ranking estimator (12) for γ1 = . [sent-444, score-0.578]
73 Contrasting the expressions in Proposition 9 with those in Proposition 7 indicates that reverting to partial rankings tends to increase the bias but reduce the variance. [sent-459, score-0.665]
74 The precise changes in the bias and variance that occur due to using partial rankings depend on γ, n, m, c, M. [sent-465, score-0.658]
75 Indeed, in the common case described earlier where the number of items n is large, switching to partially ranked data can dramatically improve the estimation accuracy. [sent-467, score-0.445]
76 It is remarkable that this statistical motivation to use partial rather than full rankings is aligned with the data availability and ease of use as well as with the computational efficiency demonstrated in the previous section. [sent-469, score-0.603]
77 1,n−6) Figure 5: Values of sp(Sγ ) |Sγ | (top (4,n−4) (5,n−5) (6,n−6) (7,n−7) row), and log sp(S|n ) (bottom row) for n = 15 and various partial |Sγ 2 ranking types. [sent-493, score-0.367]
78 A popular example is collaborative filtering which is the task of recommending items to a user based on partial preference information that is output by that user (Resnick et al. [sent-499, score-0.393]
79 Given a particular partial ranking Sγ π output by a certain user we can predict its most likely refinement arg maxSλ σ p(Sλ σ|Sγ π). [sent-502, score-0.367]
80 The first is the APA data set (Diaconis, 1989) which contains several thousand rankings of 5 APA presidential candidates. [sent-508, score-0.384]
81 The second is the Jester data set containing rankings of 100 jokes by 73,496 users. [sent-509, score-0.428]
82 The third data set is the EachMovie data set containing rankings of 1628 movies by 72,916 users. [sent-510, score-0.412]
83 Due to the computational difficulty associated with maximum likelihood for the Mallows model for large n we experimented with rankings over a small number of items. [sent-516, score-0.411]
84 The vertices of the permutation polytope, displayed in Figure 7, correspond to S 4 and its edges correspond to pairs of permutations with Kendall’s tau distance 1. [sent-523, score-0.353]
85 Figure 8 demonstrates non-parametric modeling of partial rankings for n = 100 (the Mallows model maximum likelihood estimator cannot be computed for such n). [sent-530, score-0.712]
86 We used 10043 rankings from the Jester data set which contain users ranking all n = 100 jokes. [sent-531, score-0.558]
87 set log-likelihood with respect to the lattice version g(Sγ π) of the non-parametric estimator p for ˆ ˆ partial ranking γ = (5, n − 5) (top) and γ = (1, 1, 1, n − 3) (bottom). [sent-557, score-0.522]
88 The variance reduction by (k, n − k) partial ˆ rankings clearly outweighs the bias increase. [sent-566, score-0.658]
89 Similarly, it is typically the case for large n that both the data available for estimating p and the use of p will be ˆ ˆ restricted to partial rankings or cosets of the symmetric group. [sent-569, score-0.676]
90 ˜ Attempts to define a probabilistic model directly on multiple types of partial rankings H ⊂ Wn face a challenging problem of preventing probabilistic contradictions. [sent-570, score-0.577]
91 A simple solution is to define ¨ the partial ranking model g in terms of a permutation model p through the mechanism of M obius ˆ ˆ inversion and censored data interpretation. [sent-571, score-0.593]
92 We also examine the effect of using partial, rather than full, rankings on the bias and variance of the estimator. [sent-591, score-0.465]
93 Complexity Issues Table 1 lists the computational complexity results for computing (13) for some popular partial ranking types γ and λ. [sent-594, score-0.367]
94 ∏tj=t−|A|+1 (1 − e− jc ) |A| ∏ j=1 (1 − e− jc ) ¯ ¯ |A|! [sent-630, score-0.454]
95 ∏n−t ¯ j=n−t−|A|+1 ¯ |A| (1 − e− jc ) ∏ j=1 (1 − e− jc ) τ∈Sn−t . [sent-632, score-0.454]
96 Substituting the above results into Equation 16, we get ψ−1 (c)|Sγ |−1 ∑ ∑ σ∈Sλ π1 τ∈Sγ π2 γ2 γ1 ¯ = e−c d(σ,τ) − jc − jc ∏n 1−e −c t! [sent-633, score-0.454]
97 ∏n−t ¯ j=n−t−|A|+1 ∏ j=1 (1−e− jc ) ∏ j=1 (1−e− jc ) |A| = − jc ∏k 1−e −c ∏n−k 1−e −c j=1 1−e j=1 1−e e−c|A||B| ∑τ∈St e−cb12 (τ) ∑τ∈Sn−t e−cb12 (τ) ¯ |A| ¯ − jc e−c|A||B| ∏k j=1 1 − e − jc ) t! [sent-639, score-1.135]
98 ∏n j=n−k+1 (1 − e − jc ¯ −c|A||B| (1 − e− jc ) ∏k ∏tj=t−|A|+1 (1 − e− jc ) ∏n−t ¯ ¯ j=1 1 − e |A|! [sent-641, score-0.681]
99 ∏n−t − jc ) ∏|A| (1 − e− jc ) ∏n ¯ (1 − e− jc ) ∏ (1 − e j=n−t−|A|+1 j=1 j=1 j=n−k+1 ¯ Note |A| ≤ min(k,t), |A| ≤ k and |B| ≤ t, therefore the above expression takes O(k + t) to evaluate. [sent-646, score-0.681]
100 q (1+k)k 2 ∑ qa ak =k ∑ ∑ k ∑ ak =k ak−1 =k−1 ak −1 n ak −1 n γ qb12 (π) = k! [sent-658, score-0.567]
wordName wordTfidf (topN-words)
[('rankings', 0.384), ('mallows', 0.318), ('sn', 0.269), ('jc', 0.227), ('partial', 0.193), ('bkl', 0.187), ('ranked', 0.185), ('ranking', 0.174), ('items', 0.17), ('anked', 0.154), ('ebanon', 0.154), ('odeling', 0.154), ('wn', 0.136), ('coset', 0.132), ('inversions', 0.132), ('ak', 0.125), ('sp', 0.115), ('permutations', 0.114), ('bab', 0.112), ('ao', 0.107), ('tau', 0.102), ('cosets', 0.099), ('lattice', 0.092), ('kendall', 0.092), ('qk', 0.092), ('qi', 0.09), ('partially', 0.09), ('censoring', 0.088), ('kc', 0.088), ('permutation', 0.087), ('proposition', 0.087), ('poset', 0.077), ('rankers', 0.077), ('hasse', 0.075), ('def', 0.074), ('inversion', 0.072), ('censored', 0.067), ('qa', 0.067), ('vertical', 0.064), ('estimator', 0.063), ('composition', 0.06), ('bias', 0.056), ('stanley', 0.056), ('apa', 0.055), ('compartment', 0.055), ('eachmovie', 0.055), ('fligner', 0.055), ('jester', 0.055), ('cd', 0.05), ('lebanon', 0.05), ('mum', 0.05), ('numeric', 0.05), ('smoothing', 0.048), ('modeling', 0.045), ('compartments', 0.044), ('critchlow', 0.044), ('jokes', 0.044), ('diagram', 0.042), ('lattices', 0.042), ('psfrag', 0.042), ('bar', 0.04), ('contradictions', 0.037), ('subgroup', 0.037), ('item', 0.036), ('var', 0.034), ('replacements', 0.033), ('preferences', 0.033), ('bius', 0.033), ('bxy', 0.033), ('disconcordant', 0.033), ('marden', 0.033), ('qak', 0.033), ('expressions', 0.032), ('orderings', 0.031), ('rated', 0.031), ('visualizing', 0.031), ('scores', 0.03), ('top', 0.03), ('preference', 0.03), ('bars', 0.029), ('movies', 0.028), ('verducci', 0.028), ('exp', 0.028), ('lipschitz', 0.028), ('likelihood', 0.027), ('continuity', 0.027), ('kernel', 0.027), ('qn', 0.027), ('full', 0.026), ('correspond', 0.025), ('mode', 0.025), ('preferred', 0.025), ('variance', 0.025), ('summaries', 0.023), ('summations', 0.023), ('tj', 0.023), ('enables', 0.022), ('compositions', 0.022), ('diaconis', 0.022), ('guy', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999982 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
Author: Guy Lebanon, Yi Mao
Abstract: Statistical models on full and partial rankings of n items are often of limited practical use for large n due to computational consideration. We explore the use of non-parametric models for partially ranked data and derive computationally efficient procedures for their use for large n. The derivations are largely possible through combinatorial and algebraic manipulations based on the lattice of partial rankings. A bias-variance analysis and an experimental study demonstrate the applicability of the proposed method. Keywords: ranked data, partially ordered sets, kernel smoothing
2 0.11638141 9 jmlr-2008-Active Learning by Spherical Subdivision
Author: Falk-Florian Henrich, Klaus Obermayer
Abstract: We introduce a computationally feasible, “constructive” active learning method for binary classification. The learning algorithm is initially formulated for separable classification problems, for a hyperspherical data space with constant data density, and for great spheres as classifiers. In order to reduce computational complexity the version space is restricted to spherical simplices and learning procedes by subdividing the edges of maximal length. We show that this procedure optimally reduces a tight upper bound on the generalization error. The method is then extended to other separable classification problems using products of spheres as data spaces and isometries induced by charts of the sphere. An upper bound is provided for the probability of disagreement between classifiers (hence the generalization error) for non-constant data densities on the sphere. The emphasis of this work lies on providing mathematically exact performance estimates for active learning strategies. Keywords: active learning, spherical subdivision, error bounds, simplex halving
3 0.071102753 80 jmlr-2008-Ranking Individuals by Group Comparisons
Author: Tzu-Kuo Huang, Chih-Jen Lin, Ruby C. Weng
Abstract: This paper proposes new approaches to rank individuals from their group comparison results. Many real-world problems are of this type. For example, ranking players from team comparisons is important in some sports. In machine learning, a closely related application is classification using coding matrices. Group comparison results are usually in two types: binary indicator outcomes (wins/losses) or measured outcomes (scores). For each type of results, we propose new models for estimating individuals’ abilities, and hence a ranking of individuals. The estimation is carried out by solving convex minimization problems, for which we develop easy and efficient solution procedures. Experiments on real bridge records and multi-class classification demonstrate the viability of the proposed models. Keywords: ranking, group comparison, binary/scored outcomes, Bradley-Terry model, multiclass classification
4 0.051908199 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui
Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso
5 0.050147001 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
Author: Bernadetta Tarigan, Sara A. van de Geer
Abstract: The success of support vector machines in binary classification relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than finding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classification, all-at-once, moment bound, fast rate, entropy
6 0.038386889 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
7 0.034686703 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
8 0.034030579 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
9 0.033426538 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers
10 0.033418491 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
11 0.033237901 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
12 0.031048724 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
13 0.030735658 22 jmlr-2008-Closed Sets for Labeled Data
14 0.030182244 96 jmlr-2008-Visualizing Data using t-SNE
15 0.029302787 52 jmlr-2008-Learning from Multiple Sources
16 0.028825991 56 jmlr-2008-Magic Moments for Structured Output Prediction
17 0.027434509 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
18 0.02710798 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
19 0.025994135 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
20 0.025865078 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective
topicId topicWeight
[(0, 0.146), (1, -0.028), (2, -0.039), (3, -0.014), (4, -0.005), (5, -0.054), (6, 0.025), (7, -0.062), (8, 0.004), (9, -0.068), (10, -0.005), (11, -0.024), (12, -0.16), (13, 0.047), (14, 0.241), (15, -0.003), (16, 0.04), (17, 0.024), (18, -0.007), (19, 0.113), (20, 0.007), (21, -0.298), (22, -0.419), (23, -0.205), (24, -0.116), (25, 0.03), (26, 0.122), (27, -0.144), (28, 0.1), (29, -0.145), (30, 0.126), (31, -0.084), (32, -0.103), (33, -0.105), (34, -0.043), (35, 0.034), (36, 0.107), (37, 0.023), (38, -0.074), (39, 0.044), (40, -0.0), (41, 0.033), (42, 0.002), (43, -0.07), (44, 0.068), (45, -0.072), (46, 0.044), (47, 0.021), (48, 0.085), (49, 0.039)]
simIndex simValue paperId paperTitle
same-paper 1 0.97427642 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
Author: Guy Lebanon, Yi Mao
Abstract: Statistical models on full and partial rankings of n items are often of limited practical use for large n due to computational consideration. We explore the use of non-parametric models for partially ranked data and derive computationally efficient procedures for their use for large n. The derivations are largely possible through combinatorial and algebraic manipulations based on the lattice of partial rankings. A bias-variance analysis and an experimental study demonstrate the applicability of the proposed method. Keywords: ranked data, partially ordered sets, kernel smoothing
2 0.64512122 9 jmlr-2008-Active Learning by Spherical Subdivision
Author: Falk-Florian Henrich, Klaus Obermayer
Abstract: We introduce a computationally feasible, “constructive” active learning method for binary classification. The learning algorithm is initially formulated for separable classification problems, for a hyperspherical data space with constant data density, and for great spheres as classifiers. In order to reduce computational complexity the version space is restricted to spherical simplices and learning procedes by subdividing the edges of maximal length. We show that this procedure optimally reduces a tight upper bound on the generalization error. The method is then extended to other separable classification problems using products of spheres as data spaces and isometries induced by charts of the sphere. An upper bound is provided for the probability of disagreement between classifiers (hence the generalization error) for non-constant data densities on the sphere. The emphasis of this work lies on providing mathematically exact performance estimates for active learning strategies. Keywords: active learning, spherical subdivision, error bounds, simplex halving
3 0.38881654 80 jmlr-2008-Ranking Individuals by Group Comparisons
Author: Tzu-Kuo Huang, Chih-Jen Lin, Ruby C. Weng
Abstract: This paper proposes new approaches to rank individuals from their group comparison results. Many real-world problems are of this type. For example, ranking players from team comparisons is important in some sports. In machine learning, a closely related application is classification using coding matrices. Group comparison results are usually in two types: binary indicator outcomes (wins/losses) or measured outcomes (scores). For each type of results, we propose new models for estimating individuals’ abilities, and hence a ranking of individuals. The estimation is carried out by solving convex minimization problems, for which we develop easy and efficient solution procedures. Experiments on real bridge records and multi-class classification demonstrate the viability of the proposed models. Keywords: ranking, group comparison, binary/scored outcomes, Bradley-Terry model, multiclass classification
4 0.19862194 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis
Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui
Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefficients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semidefinite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefficients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufficient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artificial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso
5 0.18452753 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties
Author: Sivan Sabato, Shai Shalev-Shwartz
Abstract: Feature ranking is a fundamental machine learning task with various applications, including feature selection and decision tree learning. We describe and analyze a new feature ranking method that supports categorical features with a large number of possible values. We show that existing ranking criteria rank a feature according to the training error of a predictor based on the feature. This approach can fail when ranking categorical features with many values. We propose the Ginger ranking criterion, that estimates the generalization error of the predictor associated with the Gini index. We show that for almost all training sets, the Ginger criterion produces an accurate estimation of the true generalization error, regardless of the number of values in a categorical feature. We also address the question of finding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassification error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties. We demonstrate the efficiency of our approach for feature selection and for learning decision trees in a series of experiments with synthetic and natural data sets. Keywords: feature ranking, categorical features, generalization bounds, Gini index, decision trees
6 0.17321104 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction
7 0.16904312 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods
8 0.16833425 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
9 0.15494747 22 jmlr-2008-Closed Sets for Labeled Data
10 0.14356516 96 jmlr-2008-Visualizing Data using t-SNE
11 0.13158637 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
12 0.12858143 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data
13 0.1260248 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
14 0.12595482 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment
15 0.12534069 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)
16 0.12024723 92 jmlr-2008-Universal Multi-Task Kernels
17 0.11869487 61 jmlr-2008-Mixed Membership Stochastic Blockmodels
18 0.10810877 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes
19 0.10809025 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces
20 0.10312185 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers
topicId topicWeight
[(0, 0.026), (5, 0.016), (40, 0.024), (54, 0.028), (58, 0.041), (64, 0.476), (66, 0.082), (76, 0.034), (78, 0.013), (88, 0.067), (92, 0.037), (94, 0.035), (99, 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.80012733 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data
Author: Guy Lebanon, Yi Mao
Abstract: Statistical models on full and partial rankings of n items are often of limited practical use for large n due to computational consideration. We explore the use of non-parametric models for partially ranked data and derive computationally efficient procedures for their use for large n. The derivations are largely possible through combinatorial and algebraic manipulations based on the lattice of partial rankings. A bias-variance analysis and an experimental study demonstrate the applicability of the proposed method. Keywords: ranked data, partially ordered sets, kernel smoothing
2 0.24934205 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning
Author: Francis R. Bach
Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufficient conditions for the consistency of group Lasso under practical assumptions, such as model misspecification. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this infinite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisfied. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators
3 0.2492757 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model
Author: Matthias W. Seeger
Abstract: The linear model with sparsity-favouring prior on the coefficients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identification from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in significant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identification application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing
4 0.2462751 57 jmlr-2008-Manifold Learning: The Price of Normalization
Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov
Abstract: We analyze the performance of a class of manifold-learning algorithms that find their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the finite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap
5 0.24428232 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks
Author: Gal Elidan, Stephen Gould
Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufficiently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overfitting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modifications and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth
6 0.24323998 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace
7 0.24185084 83 jmlr-2008-Robust Submodular Observation Selection
8 0.2416622 56 jmlr-2008-Magic Moments for Structured Output Prediction
9 0.2407487 58 jmlr-2008-Max-margin Classification of Data with Absent Features
10 0.24034993 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration
11 0.23824798 9 jmlr-2008-Active Learning by Spherical Subdivision
12 0.23792307 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition
13 0.23723145 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning
14 0.2355763 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)
15 0.23533125 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers
16 0.2345423 86 jmlr-2008-SimpleMKL
18 0.23376866 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections
19 0.23312908 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks
20 0.23282877 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss