jmlr jmlr2008 jmlr2008-69 knowledge-graph by maker-knowledge-mining

69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data

Source: pdf

Author: Guy Lebanon, Yi Mao

Abstract: Statistical models on full and partial rankings of n items are often of limited practical use for large n due to computational consideration. We explore the use of non-parametric models for partially ranked data and derive computationally efﬁcient procedures for their use for large n. The derivations are largely possible through combinatorial and algebraic manipulations based on the lattice of partial rankings. A bias-variance analysis and an experimental study demonstrate the applicability of the proposed method. Keywords: ranked data, partially ordered sets, kernel smoothing

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 EDU School of Electrical and Computer Engineering Purdue University West Lafayette, IN Editor: Tommi Jaakkola Abstract Statistical models on full and partial rankings of n items are often of limited practical use for large n due to computational consideration. [sent-5, score-0.773]

2 Keywords: ranked data, partially ordered sets, kernel smoothing 1. [sent-9, score-0.35]

3 Introduction Rankers such as people, search engines, and classiﬁers, output full or partial rankings representing preference relations over n items or alternatives. [sent-10, score-0.803]

4 For example in the case of m = 6 rankers issuing full or partial preferences over n = 3 items a possible data set is 3 1 2, 3 2 1, 1 3 2, 1 {2, 3}, 3 {1, 2}, {2, 3} 1. [sent-11, score-0.499]

5 (1) The ﬁrst three expressions in (1) correspond to full rankings while the last three expressions correspond to partial rankings (the numbers correspond to items and the symbol corresponds to a preference relation). [sent-12, score-1.326]

6 While it is likely that some rankings will contradict others, it is natural to assume that the data in (1) was sampled iid from some distribution p over rankings. [sent-13, score-0.384]

7 L EBANON AND M AO Despite this motivating observation, modeling ranked data is less popular than modeling the existing numeric scores, or even made-up numeric scores in case the true scores are unavailable (such is the case with the frequently used Borda count). [sent-21, score-0.435]

8 The main reason for this is that rankings over a large number of items n reside in an extremely large discrete space whose modeling often requires intractable computation. [sent-22, score-0.599]

9 Previous attempts at modeling ranked data have been mostly parametric and often designed to work with fully ranked data (Marden, 1996). [sent-23, score-0.415]

10 Most of aforementioned approaches are unsuitable for modeling partial rankings for medium and large n due to the computational difﬁculties of handling a probability space of size n! [sent-29, score-0.622]

11 The few possible exceptions (Critchlow, 1985; Marden, 1996) are usually more ad-hoc and do not correspond to an underlying permutation model making them ill suited to handle partial rankings of different types. [sent-31, score-0.689]

12 On the other hand, there has been a recent increase in data sets containing partial or full rankings for large n. [sent-33, score-0.603]

13 More details on how these data sets correspond to partial rankings may be found in Section 2. [sent-35, score-0.602]

14 (2) The resulting estimate p should assign probabilities to full and partial rankings in a coherent ˆ and contradiction-free manner (described in Section 4). [sent-43, score-0.603]

15 (3) Estimate p based on partial rankings of different types (deﬁned in Section 2). [sent-44, score-0.577]

16 (5) Statistical accuracy of p can be slow for fully ranked data but should be accelerated when ˆ restricted to simpler partial rankings. [sent-67, score-0.378]

17 (6) Obtaining the estimate p and using it to compute probabilities p(A) of partial rankings should ˆ ˆ be computationally feasible, even for large n. [sent-68, score-0.577]

18 All 6 properties above are crucial in the large n scenario: it is often impossible for rankers to specify full rankings over a very large number of items making the use of partial rankings a necessity. [sent-69, score-1.234]

19 Different rankers may choose to output partial rankings of different types, for example, one ranker can output 3 {1, 2} (3 is preferred to both 1 and 2) and another ranker can output {1, 3} 2 (both 1 and 3 are preferred to 2). [sent-70, score-0.704]

20 In this notation the numbers correspond to items and the locations of the items in their corresponding compartments correspond to their ranks. [sent-93, score-0.434]

21 The collection of all permutations of n items forms the non-Abelian symmetric group of order n, denoted by S n , using function composition as the group operation πσ = π ◦ σ. [sent-94, score-0.344]

22 A reasonable solution is achieved by considering partial rankings which correspond to cosets of the symmetric group. [sent-107, score-0.701]

23 1 3 1 4 2 2 3 1 3 PSfrag replacements 4 4 2 S1,1,2 π = {σ1 π, σ2 π} = 3|1|2, 4 σ2 π 3 4 Figure 1: A partial ranking corresponds to a coset or a set or permutations all permutations that ﬁx the top k positions is denoted S1,. [sent-109, score-0.79]

24 It may thus be interpreted as a partial ranking of the top k items, that does not contain any information concerning the relative ranking of the bottom n − k items. [sent-123, score-0.571]

25 The set of all such partial rankings forms the quotient space S n /S1,. [sent-124, score-0.577]

26 Figure 1 illustrates the identiﬁcation of a coset as a partial ranking of the top 2 out of 4 items. [sent-128, score-0.529]

27 We generalize the above relationship between partial rankings and cosets through the following deﬁnition of a composition. [sent-129, score-0.676]

28 , γr ) corresponds to a partial ranking with γ1 items in the ﬁrst position, γ2 items in the second position and so on. [sent-138, score-0.707]

29 For such a partial ranking it is known that the ﬁrst set of γ 1 items are to be ranked before the second set of γ2 items etc. [sent-139, score-0.892]

30 A partial ranking of type γ is equivalent to a coset Sγ π = {σπ : σ ∈ Sγ } and the set of such partial rankings forms the quotient space Sn /Sγ . [sent-164, score-1.076]

31 2404 N ON -PARAMETRIC M ODELING OF PARTIALLY R ANKED DATA The vertical bar notation described above for permutations is particularly convenient for denoting partial rankings. [sent-165, score-0.411]

32 , n separated by vertical bars, indicating that items on the left side of each vertical bar are preferred to (ranked higher than) items on the right side of the bar. [sent-169, score-0.533]

33 For example, the partial ranking displayed in Figure 1 is denoted by 3|1|2, 4. [sent-171, score-0.367]

34 The set of all partial rankings Wn = {Sγ π : π ∈ Sn , ∀γ} def (2) which includes the set of full rankings Sn , is a subset of all possible partial orders on {1, . [sent-173, score-1.254]

35 While the formalism of partial rankings in Wn cannot realize all partial orderings, it is sufﬁciently powerful to include many useful and naturally occurring orderings as special cases. [sent-177, score-0.801]

36 Special cases of particular interest are the following partial rankings • π ∈ Sn corresponds to a permutation or a full ordering, for example, 3|2|4|1. [sent-179, score-0.69]

37 An example for such a ranking is a ranked list of the top k webpages output by search engines in response to a query. [sent-189, score-0.389]

38 An example for such a ranking is selection of preferred and non-preferred items from a list. [sent-204, score-0.369]

39 One particular extension to partial rankings is to consider a partial ranking as censored data equivalent to the set of permutations in its related coset. [sent-237, score-1.125]

40 In other words, we deﬁne the probability the model assigns to the partial ranking S γ π by ∑ τ∈Sγ π pκ (τ) = ψ−1 (c) ∑ exp (−c d(τ, κ)) . [sent-238, score-0.395]

41 However, the apparent absence of a closed form formula for more general partial rankings prevented the widespread use of Equation 7 for large n and encouraged more ad-hoc and heuristic models (Critchlow, 1985; Marden, 1996). [sent-243, score-0.577]

42 Section 7 describes an efﬁcient computational procedure for computing (7) for more general partial ranking types γ. [sent-244, score-0.367]

43 The Ranking Lattice Partial rankings Sγ π relate to each other in a natural way by expressing more general, more speciﬁc or inconsistent ordering. [sent-246, score-0.384]

44 We deﬁne below the concepts of partially ordered sets and lattices and then relate them to partial rankings by considering the set of partial rankings W n as a lattice. [sent-247, score-1.286]

45 The set of partial rankings Wn deﬁned in (2) is naturally endowed with the partial order of ranking reﬁnement, that is, π σ if π reﬁnes σ or alternatively if we can get from π to σ by dropping vertical lines (Lebanon and Lafferty, 2003). [sent-256, score-1.008]

46 Using the vertical bar notation, two elements are inconsistent iff there exist two items i, j that appear on opposing sides of a vertical bar in x and y, that is, x = · · · i| j · · · while y = · · · j|i · · · . [sent-274, score-0.378]

47 Probabilistic Models on the Ranking Lattice The ranking lattice is a convenient framework to deﬁne and study probabilistic models on partial ˜ rankings. [sent-299, score-0.459]

48 (8) ˜ β∈Wn :β α ˜ Interpreting partial rankings Sγ π ∈ Wn as the disjoint union of the events deﬁned by the coset Sγ π we have that g(Sγ π) = ∑ τ∈Sγ π 2409 p(τ) (9) L EBANON AND M AO may be interpreted as the probability under p of the disjoint union S γ π of permutations. [sent-301, score-0.709]

49 We refer to the function g as the partial ranking or lattice version of p. [sent-302, score-0.459]

50 The function g is deﬁned on the entire lattice, but when restricted to partial rankings of the same ˜ type G = {Sγ π : π ∈ Sn } ⊂ Wn , constitutes a normalized probability distribution on G. [sent-307, score-0.577]

51 Figure 3 illustrates this problem for partial rankings with the same (left) and different (right) number of vertical bars. [sent-312, score-0.641]

52 In addition to this construction which logically occurs after obtaining the estimator p, we also ˆ need to consider how to use partially ranked data in the process of obtaining the estimator p. [sent-315, score-0.401]

53 Instead, the inference needs to be conducted based on a set of partial rankings D = {Sγi πi : i = 1, . [sent-317, score-0.577]

54 Assuming uniformly random censoring in a parametric setting, we obtain the following observed likelihood with respect to the partially ranked data set D m (θ|D) = ∑ log i=1 m 1 ∑ pθ (σ) = ∑ log ∑ pθ (σ) + const. [sent-323, score-0.39]

55 In the next section we explore in detail a non-parametric kernel smoothing alternative to estimating p and g based on partially ranked data. [sent-325, score-0.35]

56 , n Sγ π Sλ σ PSfrag replacements Sλ σ PSfrag replacements ˆ 0 ˆ 0 Figure 3: Two partial rankings with the same (left) and different (right) number of vertical bars ˜ in the Hasse diagram of Wn . [sent-332, score-0.778]

57 To avoid probabilistic contradictions, the values of g at two non-disjoint partial rankings Sγ π, Sλ σ cannot be speciﬁed in an independent manner. [sent-335, score-0.577]

58 Since the normalization term ψ does not depend on the location parameter (6), the kernel smoothing estimator for p is p(π) = ˆ m 1 ∑ exp(−c d(π, πi )) π, πi ∈ Sn m ψ(c) i=1 (11) assuming the data consists of complete rankings π1 , . [sent-341, score-0.522]

59 The lattice or partial ranking version g correˆ sponding to p in (12) is ˆ g(Sλ π) = ˆ m 1 1 ∑ |Sγ | ∑ ∑ exp(−c d(κ, τ)) m ψ(c) i=1 i κ∈S π τ∈Sγ πi λ i ˜ S γ π ∈ Wn . [sent-350, score-0.459]

60 The set appearing in the deﬁnition γ of bkl (τ) contains pairs (s,t) that are inversions of τ and for which s and t appear in the l and k compartments of γ respectively. [sent-366, score-0.363]

61 (15) s=1 j=1 k=0 Proof τ∈Sγ π ∑ γ γ q∑k=1 ak (τ)+∑k=1 ∑l=k+1 bkl (τ) r r r ∑ q∑k=1 ak (τ) τ∈Sγ π γ = q∑k=1 ∑l=k+1 bkl (π) r r γ r τ∈Sγ π r γ = q∑k=1 ∑l=k+1 bkl (π) ∏ r r ∑ qi(τ) s=1 τ∈Sγs r γs −1 j γ = q∑k=1 ∑l=k+1 bkl (π) ∏ ∏ r r ∑ qk . [sent-380, score-1.09]

62 The remaining terms depend only on the partial ranking type γ and thus may be pre-computed and tabulated for efﬁcient computation. [sent-383, score-0.367]

63 Corollary 2 The partial ranking version g corresponding to the Mallows kernel p κ is pκ (Sγ π) = γ −1 j s ∏r ∏ j=1 ∑k=0 e−kc s=1 j ∏n−1 ∑k=0 e−kc j=1 γ ∝ e−c ∑k=1 ∑l=k+1 bkl (πκ r r −1 ) 2413 . [sent-388, score-0.581]

64 s=1 j=1 k=0 The complexity of computing (16), (12), (13) for some popular partial ranking types appears in Table 1. [sent-396, score-0.367]

65 The term ψ(2c)/ψ2 (c) is bounded since it may be 1+e− jc 1+e−c written as a product ∏n R j (c), with R j (c) = 1−e− jc / 1−e−c ≤ 1 for all c ∈ R+ and j ≥ 1 since the j=1 function 1+ε increases with ε > 0. [sent-426, score-0.454]

66 1−ε Based on Proposition 7 and Equations (17)-(18) |bias ( p(π))| ≤ M ˆ Var ( p(π)) ≤ ˆ n je− jc ne−c ne−c −M ∑ ≤M − jc 1 − e−c 1 − e−c j=1 1 − e p(π) ψ(2c) M ψ (2c) ψ(2c) − . [sent-427, score-0.454]

67 For large n, it is often the case that partial, rather than full, rankings are available for estimating p. [sent-437, score-0.384]

68 Furthermore, in many cases, rankers can make some partial ranking assertions with certainty but do not have a clear opinion on other preferences. [sent-439, score-0.444]

69 Using the censored data interpretation of partially ranked data enables efﬁcient use of partially ranked data of multiple types in the estimation process (12). [sent-440, score-0.639]

70 Statistically, expressing partially ranked data as censored data has the effect of increased smoothing and therefore it reduces the variance while increasing the bias. [sent-441, score-0.415]

71 A consequence of this proposition which is ˆ also illustrated in Section 9 experimentally is that even if the fully ranked data is somehow available, estimating p based on the partial rankings obtained by censoring it tends to increase the estimation ˆ accuracy. [sent-443, score-0.937]

72 2417 L EBANON AND M AO Proposition 9 Assuming the same conditions as in Proposition 7, the bias and variance of the censored data or partial ranking estimator (12) for γ1 = . [sent-444, score-0.578]

73 Contrasting the expressions in Proposition 9 with those in Proposition 7 indicates that reverting to partial rankings tends to increase the bias but reduce the variance. [sent-459, score-0.665]

74 The precise changes in the bias and variance that occur due to using partial rankings depend on γ, n, m, c, M. [sent-465, score-0.658]

75 Indeed, in the common case described earlier where the number of items n is large, switching to partially ranked data can dramatically improve the estimation accuracy. [sent-467, score-0.445]

76 It is remarkable that this statistical motivation to use partial rather than full rankings is aligned with the data availability and ease of use as well as with the computational efﬁciency demonstrated in the previous section. [sent-469, score-0.603]

77 1,n−6) Figure 5: Values of sp(Sγ ) |Sγ | (top (4,n−4) (5,n−5) (6,n−6) (7,n−7) row), and log sp(S|n ) (bottom row) for n = 15 and various partial |Sγ 2 ranking types. [sent-493, score-0.367]

78 A popular example is collaborative ﬁltering which is the task of recommending items to a user based on partial preference information that is output by that user (Resnick et al. [sent-499, score-0.393]

79 Given a particular partial ranking Sγ π output by a certain user we can predict its most likely reﬁnement arg maxSλ σ p(Sλ σ|Sγ π). [sent-502, score-0.367]

80 The ﬁrst is the APA data set (Diaconis, 1989) which contains several thousand rankings of 5 APA presidential candidates. [sent-508, score-0.384]

81 The second is the Jester data set containing rankings of 100 jokes by 73,496 users. [sent-509, score-0.428]

82 The third data set is the EachMovie data set containing rankings of 1628 movies by 72,916 users. [sent-510, score-0.412]

83 Due to the computational difﬁculty associated with maximum likelihood for the Mallows model for large n we experimented with rankings over a small number of items. [sent-516, score-0.411]

84 The vertices of the permutation polytope, displayed in Figure 7, correspond to S 4 and its edges correspond to pairs of permutations with Kendall’s tau distance 1. [sent-523, score-0.353]

85 Figure 8 demonstrates non-parametric modeling of partial rankings for n = 100 (the Mallows model maximum likelihood estimator cannot be computed for such n). [sent-530, score-0.712]

86 We used 10043 rankings from the Jester data set which contain users ranking all n = 100 jokes. [sent-531, score-0.558]

87 set log-likelihood with respect to the lattice version g(Sγ π) of the non-parametric estimator p for ˆ ˆ partial ranking γ = (5, n − 5) (top) and γ = (1, 1, 1, n − 3) (bottom). [sent-557, score-0.522]

88 The variance reduction by (k, n − k) partial ˆ rankings clearly outweighs the bias increase. [sent-566, score-0.658]

89 Similarly, it is typically the case for large n that both the data available for estimating p and the use of p will be ˆ ˆ restricted to partial rankings or cosets of the symmetric group. [sent-569, score-0.676]

90 ˜ Attempts to deﬁne a probabilistic model directly on multiple types of partial rankings H ⊂ Wn face a challenging problem of preventing probabilistic contradictions. [sent-570, score-0.577]

91 A simple solution is to deﬁne ¨ the partial ranking model g in terms of a permutation model p through the mechanism of M obius ˆ ˆ inversion and censored data interpretation. [sent-571, score-0.593]

92 We also examine the effect of using partial, rather than full, rankings on the bias and variance of the estimator. [sent-591, score-0.465]

93 Complexity Issues Table 1 lists the computational complexity results for computing (13) for some popular partial ranking types γ and λ. [sent-594, score-0.367]

94 ∏tj=t−|A|+1 (1 − e− jc ) |A| ∏ j=1 (1 − e− jc ) ¯ ¯ |A|! [sent-630, score-0.454]

95 ∏n−t ¯ j=n−t−|A|+1 ¯ |A| (1 − e− jc ) ∏ j=1 (1 − e− jc ) τ∈Sn−t . [sent-632, score-0.454]

96 Substituting the above results into Equation 16, we get ψ−1 (c)|Sγ |−1 ∑ ∑ σ∈Sλ π1 τ∈Sγ π2 γ2 γ1 ¯ = e−c d(σ,τ) − jc − jc ∏n 1−e −c t! [sent-633, score-0.454]

97 ∏n−t ¯ j=n−t−|A|+1 ∏ j=1 (1−e− jc ) ∏ j=1 (1−e− jc ) |A| = − jc ∏k 1−e −c ∏n−k 1−e −c j=1 1−e j=1 1−e e−c|A||B| ∑τ∈St e−cb12 (τ) ∑τ∈Sn−t e−cb12 (τ)  ¯ |A| ¯ − jc e−c|A||B| ∏k j=1 1 − e − jc ) t! [sent-639, score-1.135]

98 ∏n j=n−k+1 (1 − e  − jc ¯ −c|A||B| (1 − e− jc ) ∏k ∏tj=t−|A|+1 (1 − e− jc ) ∏n−t ¯ ¯ j=1 1 − e |A|! [sent-641, score-0.681]

99 ∏n−t − jc ) ∏|A| (1 − e− jc ) ∏n ¯ (1 − e− jc ) ∏ (1 − e j=n−t−|A|+1 j=1 j=1 j=n−k+1 ¯ Note |A| ≤ min(k,t), |A| ≤ k and |B| ≤ t, therefore the above expression takes O(k + t) to evaluate. [sent-646, score-0.681]

100 q (1+k)k 2 ∑ qa ak =k ∑ ∑ k ∑ ak =k ak−1 =k−1 ak −1 n ak −1 n γ qb12 (π) = k! [sent-658, score-0.567]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('rankings', 0.384), ('mallows', 0.318), ('sn', 0.269), ('jc', 0.227), ('partial', 0.193), ('bkl', 0.187), ('ranked', 0.185), ('ranking', 0.174), ('items', 0.17), ('anked', 0.154), ('ebanon', 0.154), ('odeling', 0.154), ('wn', 0.136), ('coset', 0.132), ('inversions', 0.132), ('ak', 0.125), ('sp', 0.115), ('permutations', 0.114), ('bab', 0.112), ('ao', 0.107), ('tau', 0.102), ('cosets', 0.099), ('lattice', 0.092), ('kendall', 0.092), ('qk', 0.092), ('qi', 0.09), ('partially', 0.09), ('censoring', 0.088), ('kc', 0.088), ('permutation', 0.087), ('proposition', 0.087), ('poset', 0.077), ('rankers', 0.077), ('hasse', 0.075), ('def', 0.074), ('inversion', 0.072), ('censored', 0.067), ('qa', 0.067), ('vertical', 0.064), ('estimator', 0.063), ('composition', 0.06), ('bias', 0.056), ('stanley', 0.056), ('apa', 0.055), ('compartment', 0.055), ('eachmovie', 0.055), ('fligner', 0.055), ('jester', 0.055), ('cd', 0.05), ('lebanon', 0.05), ('mum', 0.05), ('numeric', 0.05), ('smoothing', 0.048), ('modeling', 0.045), ('compartments', 0.044), ('critchlow', 0.044), ('jokes', 0.044), ('diagram', 0.042), ('lattices', 0.042), ('psfrag', 0.042), ('bar', 0.04), ('contradictions', 0.037), ('subgroup', 0.037), ('item', 0.036), ('var', 0.034), ('replacements', 0.033), ('preferences', 0.033), ('bius', 0.033), ('bxy', 0.033), ('disconcordant', 0.033), ('marden', 0.033), ('qak', 0.033), ('expressions', 0.032), ('orderings', 0.031), ('rated', 0.031), ('visualizing', 0.031), ('scores', 0.03), ('top', 0.03), ('preference', 0.03), ('bars', 0.029), ('movies', 0.028), ('verducci', 0.028), ('exp', 0.028), ('lipschitz', 0.028), ('likelihood', 0.027), ('continuity', 0.027), ('kernel', 0.027), ('qn', 0.027), ('full', 0.026), ('correspond', 0.025), ('mode', 0.025), ('preferred', 0.025), ('variance', 0.025), ('summaries', 0.023), ('summations', 0.023), ('tj', 0.023), ('enables', 0.022), ('compositions', 0.022), ('diaconis', 0.022), ('guy', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data

Author: Guy Lebanon, Yi Mao

2 0.11638141 9 jmlr-2008-Active Learning by Spherical Subdivision

Author: Falk-Florian Henrich, Klaus Obermayer

Abstract: We introduce a computationally feasible, “constructive” active learning method for binary classiﬁcation. The learning algorithm is initially formulated for separable classiﬁcation problems, for a hyperspherical data space with constant data density, and for great spheres as classiﬁers. In order to reduce computational complexity the version space is restricted to spherical simplices and learning procedes by subdividing the edges of maximal length. We show that this procedure optimally reduces a tight upper bound on the generalization error. The method is then extended to other separable classiﬁcation problems using products of spheres as data spaces and isometries induced by charts of the sphere. An upper bound is provided for the probability of disagreement between classiﬁers (hence the generalization error) for non-constant data densities on the sphere. The emphasis of this work lies on providing mathematically exact performance estimates for active learning strategies. Keywords: active learning, spherical subdivision, error bounds, simplex halving

3 0.071102753 80 jmlr-2008-Ranking Individuals by Group Comparisons

Author: Tzu-Kuo Huang, Chih-Jen Lin, Ruby C. Weng

Abstract: This paper proposes new approaches to rank individuals from their group comparison results. Many real-world problems are of this type. For example, ranking players from team comparisons is important in some sports. In machine learning, a closely related application is classiﬁcation using coding matrices. Group comparison results are usually in two types: binary indicator outcomes (wins/losses) or measured outcomes (scores). For each type of results, we propose new models for estimating individuals’ abilities, and hence a ranking of individuals. The estimation is carried out by solving convex minimization problems, for which we develop easy and efﬁcient solution procedures. Experiments on real bridge records and multi-class classiﬁcation demonstrate the viability of the proposed models. Keywords: ranking, group comparison, binary/scored outcomes, Bradley-Terry model, multiclass classiﬁcation

4 0.051908199 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui

Abstract: Given a sample covariance matrix, we examine the problem of maximizing the variance explained by a linear combination of the input variables while constraining the number of nonzero coefﬁcients in this combination. This is known as sparse principal component analysis and has a wide array of applications in machine learning and engineering. We formulate a new semideﬁnite relaxation to this problem and derive a greedy algorithm that computes a full set of good solutions for all target numbers of non zero coefﬁcients, with total complexity O(n3 ), where n is the number of variables. We then use the same relaxation to derive sufﬁcient conditions for global optimality of a solution, which can be tested in O(n3 ) per pattern. We discuss applications in subset selection and sparse recovery and show on artiﬁcial examples and biological data that our algorithm does provide globally optimal solutions in many cases. Keywords: PCA, subset selection, sparse eigenvalues, sparse recovery, lasso

5 0.050147001 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

Author: Bernadetta Tarigan, Sara A. van de Geer

Abstract: The success of support vector machines in binary classiﬁcation relies on the fact that hinge loss employed in the risk minimization targets the Bayes rule. Recent research explores some extensions of this large margin based method to the multicategory case. We show a moment bound for the socalled multi-hinge loss minimizers based on two kinds of complexity constraints: entropy with bracketing and empirical entropy. Obtaining such a result based on the latter is harder than ﬁnding one based on the former. We obtain fast rates of convergence that adapt to the unknown margin. Keywords: multi-hinge classiﬁcation, all-at-once, moment bound, fast rate, entropy

6 0.038386889 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

7 0.034686703 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

8 0.034030579 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

9 0.033426538 32 jmlr-2008-Estimating the Confidence Interval for Prediction Errors of Support Vector Machine Classifiers

10 0.033418491 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction

11 0.033237901 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

12 0.031048724 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

13 0.030735658 22 jmlr-2008-Closed Sets for Labeled Data

14 0.030182244 96 jmlr-2008-Visualizing Data using t-SNE

15 0.029302787 52 jmlr-2008-Learning from Multiple Sources

16 0.028825991 56 jmlr-2008-Magic Moments for Structured Output Prediction

17 0.027434509 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

18 0.02710798 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties

19 0.025994135 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

20 0.025865078 90 jmlr-2008-Theoretical Advantages of Lenient Learners: An Evolutionary Game Theoretic Perspective

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.146), (1, -0.028), (2, -0.039), (3, -0.014), (4, -0.005), (5, -0.054), (6, 0.025), (7, -0.062), (8, 0.004), (9, -0.068), (10, -0.005), (11, -0.024), (12, -0.16), (13, 0.047), (14, 0.241), (15, -0.003), (16, 0.04), (17, 0.024), (18, -0.007), (19, 0.113), (20, 0.007), (21, -0.298), (22, -0.419), (23, -0.205), (24, -0.116), (25, 0.03), (26, 0.122), (27, -0.144), (28, 0.1), (29, -0.145), (30, 0.126), (31, -0.084), (32, -0.103), (33, -0.105), (34, -0.043), (35, 0.034), (36, 0.107), (37, 0.023), (38, -0.074), (39, 0.044), (40, -0.0), (41, 0.033), (42, 0.002), (43, -0.07), (44, 0.068), (45, -0.072), (46, 0.044), (47, 0.021), (48, 0.085), (49, 0.039)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97427642 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data

Author: Guy Lebanon, Yi Mao

2 0.64512122 9 jmlr-2008-Active Learning by Spherical Subdivision

Author: Falk-Florian Henrich, Klaus Obermayer

3 0.38881654 80 jmlr-2008-Ranking Individuals by Group Comparisons

Author: Tzu-Kuo Huang, Chih-Jen Lin, Ruby C. Weng

4 0.19862194 75 jmlr-2008-Optimal Solutions for Sparse Principal Component Analysis

Author: Alexandre d'Aspremont, Francis Bach, Laurent El Ghaoui

5 0.18452753 79 jmlr-2008-Ranking Categorical Features Using Generalization Properties

Author: Sivan Sabato, Shai Shalev-Shwartz

Abstract: Feature ranking is a fundamental machine learning task with various applications, including feature selection and decision tree learning. We describe and analyze a new feature ranking method that supports categorical features with a large number of possible values. We show that existing ranking criteria rank a feature according to the training error of a predictor based on the feature. This approach can fail when ranking categorical features with many values. We propose the Ginger ranking criterion, that estimates the generalization error of the predictor associated with the Gini index. We show that for almost all training sets, the Ginger criterion produces an accurate estimation of the true generalization error, regardless of the number of values in a categorical feature. We also address the question of ﬁnding the optimal predictor that is based on a single categorical feature. It is shown that the predictor associated with the misclassiﬁcation error criterion has the minimal expected generalization error. We bound the bias of this predictor with respect to the generalization error of the Bayes optimal predictor, and analyze its concentration properties. We demonstrate the efﬁciency of our approach for feature selection and for learning decision trees in a series of experiments with synthetic and natural data sets. Keywords: feature ranking, categorical features, generalization bounds, Gini index, decision trees

6 0.17321104 59 jmlr-2008-Maximal Causes for Non-linear Component Extraction

7 0.16904312 29 jmlr-2008-Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods

8 0.16833425 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

9 0.15494747 22 jmlr-2008-Closed Sets for Labeled Data

10 0.14356516 96 jmlr-2008-Visualizing Data using t-SNE

11 0.13158637 31 jmlr-2008-Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

12 0.12858143 62 jmlr-2008-Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

13 0.1260248 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

14 0.12595482 13 jmlr-2008-An Error Bound Based on a Worst Likely Assignment

15 0.12534069 15 jmlr-2008-An Information Criterion for Variable Selection in Support Vector Machines (Special Topic on Model Selection)

16 0.12024723 92 jmlr-2008-Universal Multi-Task Kernels

17 0.11869487 61 jmlr-2008-Mixed Membership Stochastic Blockmodels

18 0.10810877 81 jmlr-2008-Regularization on Graphs with Function-adapted Diffusion Processes

19 0.10809025 70 jmlr-2008-On Relevant Dimensions in Kernel Feature Spaces

20 0.10312185 51 jmlr-2008-Learning Similarity with Operator-valued Large-margin Classifiers

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (5, 0.016), (40, 0.024), (54, 0.028), (58, 0.041), (64, 0.476), (66, 0.082), (76, 0.034), (78, 0.013), (88, 0.067), (92, 0.037), (94, 0.035), (99, 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.80012733 69 jmlr-2008-Non-Parametric Modeling of Partially Ranked Data

Author: Guy Lebanon, Yi Mao

2 0.24934205 27 jmlr-2008-Consistency of the Group Lasso and Multiple Kernel Learning

Author: Francis R. Bach

Abstract: We consider the least-square regression problem with regularization by a block 1 -norm, that is, a sum of Euclidean norms over spaces of dimensions larger than one. This problem, referred to as the group Lasso, extends the usual regularization by the 1 -norm where all spaces have dimension one, where it is commonly referred to as the Lasso. In this paper, we study the asymptotic group selection consistency of the group Lasso. We derive necessary and sufﬁcient conditions for the consistency of group Lasso under practical assumptions, such as model misspeciﬁcation. When the linear predictors and Euclidean norms are replaced by functions and reproducing kernel Hilbert norms, the problem is usually referred to as multiple kernel learning and is commonly used for learning from heterogeneous data sources and for non linear variable selection. Using tools from functional analysis, and in particular covariance operators, we extend the consistency results to this inﬁnite dimensional case and also propose an adaptive scheme to obtain a consistent model estimate, even when the necessary condition required for the non adaptive scheme is not satisﬁed. Keywords: sparsity, regularization, consistency, convex optimization, covariance operators

3 0.2492757 18 jmlr-2008-Bayesian Inference and Optimal Design for the Sparse Linear Model

Author: Matthias W. Seeger

Abstract: The linear model with sparsity-favouring prior on the coefﬁcients has important applications in many different domains. In machine learning, most methods to date search for maximum a posteriori sparse solutions and neglect to represent posterior uncertainties. In this paper, we address problems of Bayesian optimal design (or experiment planning), for which accurate estimates of uncertainty are essential. To this end, we employ expectation propagation approximate inference for the linear model with Laplace prior, giving new insight into numerical stability properties and proposing a robust algorithm. We also show how to estimate model hyperparameters by empirical Bayesian maximisation of the marginal likelihood, and propose ideas in order to scale up the method to very large underdetermined problems. We demonstrate the versatility of our framework on the application of gene regulatory network identiﬁcation from micro-array expression data, where both the Laplace prior and the active experimental design approach are shown to result in signiﬁcant improvements. We also address the problem of sparse coding of natural images, and show how our framework can be used for compressive sensing tasks. Part of this work appeared in Seeger et al. (2007b). The gene network identiﬁcation application appears in Steinke et al. (2007). Keywords: sparse linear model, Laplace prior, expectation propagation, approximate inference, optimal design, Bayesian statistics, gene network recovery, image coding, compressive sensing

4 0.2462751 57 jmlr-2008-Manifold Learning: The Price of Normalization

Author: Yair Goldberg, Alon Zakai, Dan Kushnir, Ya'acov Ritov

Abstract: We analyze the performance of a class of manifold-learning algorithms that ﬁnd their output by minimizing a quadratic form under some normalization constraints. This class consists of Locally Linear Embedding (LLE), Laplacian Eigenmap, Local Tangent Space Alignment (LTSA), Hessian Eigenmaps (HLLE), and Diffusion maps. We present and prove conditions on the manifold that are necessary for the success of the algorithms. Both the ﬁnite sample case and the limit case are analyzed. We show that there are simple manifolds in which the necessary conditions are violated, and hence the algorithms cannot recover the underlying manifolds. Finally, we present numerical results that demonstrate our claims. Keywords: dimensionality reduction, manifold learning, Laplacian eigenmap, diffusion maps, locally linear embedding, local tangent space alignment, Hessian eigenmap

5 0.24428232 48 jmlr-2008-Learning Bounded Treewidth Bayesian Networks

Author: Gal Elidan, Stephen Gould

Abstract: With the increased availability of data for complex domains, it is desirable to learn Bayesian network structures that are sufﬁciently expressive for generalization while at the same time allow for tractable inference. While the method of thin junction trees can, in principle, be used for this purpose, its fully greedy nature makes it prone to overﬁtting, particularly when data is scarce. In this work we present a novel method for learning Bayesian networks of bounded treewidth that employs global structure modiﬁcations and that is polynomial both in the size of the graph and the treewidth bound. At the heart of our method is a dynamic triangulation that we update in a way that facilitates the addition of chain structures that increase the bound on the model’s treewidth by at most one. We demonstrate the effectiveness of our “treewidth-friendly” method on several real-life data sets and show that it is superior to the greedy approach as soon as the bound on the treewidth is nontrivial. Importantly, we also show that by making use of global operators, we are able to achieve better generalization even when learning Bayesian networks of unbounded treewidth. Keywords: Bayesian networks, structure learning, model selection, bounded treewidth

6 0.24323998 5 jmlr-2008-A New Algorithm for Estimating the Effective Dimension-Reduction Subspace

7 0.24185084 83 jmlr-2008-Robust Submodular Observation Selection

8 0.2416622 56 jmlr-2008-Magic Moments for Structured Output Prediction

9 0.2407487 58 jmlr-2008-Max-margin Classification of Data with Absent Features

10 0.24034993 36 jmlr-2008-Finite-Time Bounds for Fitted Value Iteration

11 0.23824798 9 jmlr-2008-Active Learning by Spherical Subdivision

12 0.23792307 88 jmlr-2008-Structural Learning of Chain Graphs via Decomposition

13 0.23723145 89 jmlr-2008-Support Vector Machinery for Infinite Ensemble Learning

14 0.2355763 66 jmlr-2008-Multi-class Discriminant Kernel Learning via Convex Programming (Special Topic on Model Selection)

15 0.23533125 3 jmlr-2008-A Moment Bound for Multi-hinge Classifiers

16 0.2345423 86 jmlr-2008-SimpleMKL

17 0.23450717 41 jmlr-2008-Graphical Models for Structured Classification, with an Application to Interpreting Images of Protein Subcellular Location Patterns

18 0.23376866 74 jmlr-2008-Online Learning of Complex Prediction Problems Using Simultaneous Projections

19 0.23312908 34 jmlr-2008-Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

20 0.23282877 21 jmlr-2008-Classification with a Reject Option using a Hinge Loss