jmlr jmlr2010 jmlr2010-89 knowledge-graph by maker-knowledge-mining

89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Source: pdf

Author: Yevgeny Seldin, Naftali Tishby

Abstract: We derive PAC-Bayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as co-clustering, matrix tri-factorization, graphical models, graph clustering, and pairwise clustering.1 We begin with the analysis of co-clustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in co-occurrence matrices. We derive PAC-Bayesian generalization bounds for the expected out-of-sample performance of co-clustering-based solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of co-clustering. The bounds suggest that the expected performance of co-clustering is governed by a trade-off between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for ﬁnding a local optimum of this trade-off for discriminative prediction tasks. This algorithm achieved stateof-the-art performance in the MovieLens collaborative ﬁltering task. Our co-clustering model can also be seen as matrix tri-factorization and the results provide generalization bounds, regularization terms, and new algorithms for this form of matrix factorization. The analysis of co-clustering is extended to tree-shaped graphical models, which can be used to analyze high dimensional tensors. According to the bounds, the generalization abilities of treeshaped graphical models depend on a trade-off between their empirical data ﬁt and the mutual information that is propagated up the tree levels. We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. The analysis of co-clustering easily

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We also formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. [sent-18, score-0.589]

2 The analysis of co-clustering easily extends to this problem and suggests that graph clustering should optimize the trade-off between empirical data ﬁt and the mutual information that clusters preserve on graph nodes. [sent-19, score-0.424]

3 All these possibilities are equally plausible and asking whether a clustering of blocks by shape is better or worse than a clustering of blocks by color makes no sense. [sent-31, score-0.398]

4 Returning to the bag of blocks example, if we know that after clustering the blocks we will have to pack them into a box, then the clustering of blocks by shape is much more useful than the clustering of blocks by color, since packing is indifferent to color. [sent-35, score-0.597]

5 In Section 8 we extend our approach to the formulation of unsupervised learning problems as prediction problems to graph clustering and pairwise clustering (the latter is equivalent to clustering of a weighted graph, where edge weights correspond to pairwise distances). [sent-134, score-0.797]

6 We formulate weighted graph clustering as a prediction problem: given a sample of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. [sent-135, score-0.589]

7 The bound shows that graph clustering should optimize a trade-off between empirical data ﬁt and the mutual information that clusters preserve on the graph nodes. [sent-137, score-0.472]

8 As well, we deﬁne ˆ ˆ 1 − L(Q) L(Q) ˆ ˆ ˆ + (1 − L(Q)) ln kl(L(Q) L(Q)) = L(Q) ln L(Q) 1 − L(Q) ˆ as the KL-divergence between two Bernoulli distributions with biases L(Q) and L(Q). [sent-211, score-0.434]

9 Furthermore, L(h) = ph (Z = 1) and L(h) = ph (Z = 1), hence ˆ ˆ kl(L(Q) L(Q)) = KL(ˆQ (z) pQ (z)). [sent-234, score-0.628]

10 Although there is no analytical expression for the inverse of the kl-divergence, L(Q) can be bounded numerically: 1 KL(Q P) + 2 ln(4N ) − ln δ ˆ L(Q) ≤ kl−1 L(Q), N 1 KL(Q P) + 2 ln(4N ) − ln δ ˆ . [sent-240, score-0.434]

11 The ﬁrst auxiliary result applies the method of types to bound the expectation of the exponent of the divergence between p empirical and expected distributions over Z for a single hypothesis: ES eN ·KL(ˆh (z) ph (z)) . [sent-248, score-0.362]

12 Then: p N KL(ˆQ (z) pQ (z)) = N KL(EQ(h) ph (z) EQ(h) ph (z)) p ˆ ≤ EQ(h) N KL(ˆh (z) ph (z)) p (10) p ≤ KL(Q P) + ln EP(h) eN KL(ˆh (z) ph (z)) , (11) where (10) is by the convexity of the KL-divergence (Cover and Thomas, 1991) and (11) is by the p change of measure inequality. [sent-295, score-1.473]

13 To obtain (2) it is left to bound EP(h) eN KL(ˆh (z) ph (z)) . [sent-296, score-0.362]

14 By ˆ Markov’s inequality we know that with probability at least 1 − δ over the sample p p In order to obtain a bound on EP(h) eN KL(ˆh (z) ph (z)) ≤ 1 ES EP(h) eN KL(ˆh (z) ph (z)) . [sent-298, score-0.676]

15 δ p ES EP(h) eN KL(ˆh (z) h are independent: ph (z)) we note that it is possible to exchange ES with EP(h) since S and p ES EP(h) eN KL(ˆh (z) ph (z)) p = EP(h) ES eN KL(ˆh (z) ph (z)) ≤ (N + 1)|Z|−1 . [sent-299, score-0.942]

16 We denote a smoothed version of pQ by ˆ ˆ pQ and deﬁne it as: ˜ ph (z) = ˜ ph (z) + γ ˆ , 1 + γ|Z| pQ (z) = EQ(h) ph (z) = ˜ ˜ (13) pQ (z) + γ ˆ . [sent-315, score-0.942]

17 1 + γ|Z| In the following theorem we show that if KL(ˆQ (z) pQ (z)) ≤ ε(Q) and γ(Q) = p √ ε(Q)/2 , |Z| then p −EpQ (z) ln pQ (z) is roughly within ± ε(Q)/2 ln |Z| range around H(ˆQ (z)). [sent-316, score-0.469]

18 N 1 Note that for a uniform distribution u(z) = |Z| the value of −Ep(z) ln u(z) = ln |Z|. [sent-319, score-0.434]

19 For technical reasons in the proofs of the following section, the upper bound in the next theorem is stated for −EpQ (z) ln pQ (z) and for −EQ(h) Eph (z) ln ph (z). [sent-321, score-0.831]

20 Note that both φ(ε) and ψ(ε) converge to zero approximately 1 as − ε/2 ln ε/2 and for ε/2 < |Z| they are in fact dominant over the ε/2 ln |Z| term. [sent-327, score-0.434]

21 The conditional probability distribution q(ci |xi ) represents the probability of mapping (assigning) xi to cluster ci . [sent-398, score-0.485]

22 We deﬁne q(ci |xi ) 1 ¯ q(ci |xi ) ln , I(Xi ; Ci ) = ni xi ,ci q (ci ) ¯ 3608 PAC-BAYESIAN A NALYSIS OF C O - CLUSTERING AND B EYOND where xi ∈ Xi are the possible values of Xi , ci ∈ {1, . [sent-425, score-0.935]

23 , mi } are the possible values of Ci , and q (ci ) = ¯ 1 ni xi q(ci |xi ) 1 is the marginal distribution over Ci corresponding to q(ci |xi ) and a uniform distribution u(xi ) = ni ¯ over Xi . [sent-427, score-0.598]

24 Thus, I(Xi ; Ci ) is the mutual information corresponding to the joint distribution q (xi , ci ) = ¯ 1 ni q(ci |xi ) deﬁned by q(ci |xi ) and the uniform distribution over Xi . [sent-428, score-0.582]

25 , cd ) : i=1 ˆ kl(L(Q) L(Q)) ≤ d i=1 ¯ ni I(Xi ; Ci ) + mi ln ni + M ln |Y| + 1 ln(4N ) − ln δ 2 N , (18) where M is the number of partition cells: d mi . [sent-439, score-1.473]

26 And if we assign each xi to a separate cluster, ˆ I(X ¯ i ; Ci ) is large, speciﬁcally in this case I(Xi ; Ci ) = ln ni , and L(Q) is far from L(Q). [sent-452, score-0.553]

27 Bear in mind that ni I(Xi ; Ci ) is linear in ni , whereas mi ln ni is logarithmic in ni . [sent-454, score-1.155]

28 Thus, at least when mi is small compared to ni (which is a reasonable assumption when we cluster the values of Xi ) the leading term in (18) is ¯ ni I(Xi ; Ci ). [sent-455, score-0.582]

29 Thus, the subspace 2 of the unbalanced partitions is smaller than the subspace of the balanced partitions and the unbalanced partitions are simpler (it is easier to describe an unbalanced partition than a balanced one). [sent-462, score-0.356]

30 M is the number of partition cells (in a hard partition) and M ln |Y| corresponds to the size of the C1 , . [sent-472, score-0.335]

31 ) In a hard grid partition each value xi ∈ Xi is mapped deterministically to a single cluster ci ∈ {1, . [sent-490, score-0.621]

32 In order ¯ i=1 to draw a hypothesis h ∈ H according to Q we draw a cluster ci for each xi ∈ Xi according to q(ci |xi ) and then draw a label for each partition cell according to q(y|c1 , . [sent-517, score-0.629]

33 For example, we map each viewer to a cluster of viewers, map each movie to a cluster of movies, and assign ratings to the product space of viewer clusters by movie clusters. [sent-520, score-0.409]

34 , xd we simply check which partition cell it has fallen into and return the corresponding label. [sent-523, score-0.376]

35 3 Combinatorial Priors in PAC-Bayesian Bounds In this section we design a combinatorial prior over the grid clustering hypothesis space and calculate the KL-divergence KL(Q P) between the posterior deﬁned earlier and the prior. [sent-531, score-0.347]

36 Another 3611 S ELDIN AND T ISHBY important point to mention is that the posterior Q returns a named partition of Xi -s (the conditional distribution q(ci |xi ) speciﬁes the “name” ci of the cluster that xi is mapped to). [sent-534, score-0.559]

37 Lemma 8 It is possible to deﬁne a prior P over Hm that satisﬁes: ¯ P(h) ≥ 1 exp d i=1 (ni H(hist(h|i )) + (mi − 1) ln ni ) , (19) where hist(h|i ) = {|ci1 |, . [sent-538, score-0.47]

38 , |cimi |} denotes the cardinality proﬁle (histogram) of cluster sizes along |cij | |cij | dimension i of a partition corresponding to h and H(hist(h|i )) = − mi ni ln ni is the entropy j=1 mi of the (normalized) cardinality proﬁle (note that j=1 |cij | = ni ). [sent-540, score-1.228]

40 Note that the leading terms in the prior are ni H(hist(h|i )) that count the number of possible ways to assign xi -s to ci -s, which are invariant under permutation of xi -s within each Xi (see the proof for details). [sent-543, score-0.779]

41 “Weak” prior knowledge on the number of clusters mi along each dimension and even on their sizes can only introduce an improvement that is logarithmic in ni -s to the bounds. [sent-545, score-0.424]

42 Lemma 9 For the prior deﬁned in (19) and posterior Q = {q(ci |xi )}d : i=1 d KL(Q P) ≤ i=1 ¯ ni I(Xi ; Ci ) + (mi − 1) ln ni . [sent-555, score-0.687]

43 , cd ) : i=1 d KL(Q P) ≤ i=1 ¯ ni I(Xi ; Ci ) + mi ln ni + M ln |Y|. [sent-558, score-1.112]

44 There are ¯ ¯ m ni −1 ≤ ni i −1 possibilities to choose a cluster cardinality proﬁle along a dimension i. [sent-562, score-0.546]

45 To deﬁne a cardinality proﬁle we are free to distribute the “excess mass” of ni − mi among the mi clusters. [sent-564, score-0.391]

46 The number of possible distributions equals the number of possibilities to place mi − 1 ones in a sequence of (ni − mi ) + (mi − 1) = ni − ni 1 ones and zeros. [sent-565, score-0.574]

47 There are ni possibilities to chose the value of mi (we can assign all xi -s to a single cluster, ¯ assign each xi to a separate cluster, and all the possibilities in between). [sent-573, score-0.525]

48 We use the decomposition KL(Q P) = −EQ ln P(h) − H(Q) and bound −EQ ln P(h) and H(Q) separately. [sent-576, score-0.482]

49 Then −EQ ln P(h) = − i EQ ln P(h|i ) and H(Q) = i H(Q(h|i )). [sent-580, score-0.434]

50 By Lemma 8: −EQ ln P(h|i ) ≤ (mi − 1) ln ni + ni EQ H(hist(h|i )). [sent-582, score-0.868]

51 (23) Hence, in order to bound −EQ ln P(h|i ) we have to bound the expected entropy of cluster cardinality proﬁles of the hypotheses generated by Q. [sent-583, score-0.449]

52 Recall that Q draws a cluster Ci for each xi ∈ Xi 1 according to q(ci |xi ) and that this process results in marginal distribution q (ci ) = ni xi q(ci |xi ) ¯ over the normalized cluster sizes (this is where the uniform distribution over Xi comes in). [sent-584, score-0.561]

53 Substituting this into (23) yields: q −EQ ln P(h|i ) ≤ ni H(¯(ci )) + (mi − 1) ln ni . [sent-594, score-0.868]

54 The bound follows from the fact that if we draw ni values of Ci according to q(ci |xi ) the probability 3613 S ELDIN AND T ISHBY ¯ ¯ of the resulting type is bounded from above by e−ni H(Ci |Xi ) , where H(Ci |Xi ) = 1 − ni xi ,ci q(ci |xi ) ln q(ci |xi ) (see Cover and Thomas, 1991, Theorem 12. [sent-597, score-0.793]

55 Hence, −EQ ln P(h|y|m ) = M ln |Y| ¯ ¯ and −H(Q(h|y|m )) ≤ 0. [sent-602, score-0.434]

56 Finally, since the prior P(m) over the selection of m is uniform we ¯ ¯ ¯ ¯ have −EQ ln P(m) = d ln ni and H(Q(m)) = 0, which is added to (21) by the additivity of ¯ i=1 KL(Q P) completing the proof. [sent-603, score-0.687]

57 , xd ) we need N to be exponential in ni -s, since the cardinality of the random variable ˆ X1 , . [sent-621, score-0.553]

58 , ch (xd )) i q(ch (x) )) , 1 d i i where ch (xi ) = h(xi ) is the partition cell that xi fell within h. [sent-659, score-0.328]

59 , xd directly, the described randomized prediction process matches the predictions by (27) and thus enables its analysis with PAC-Bayesian bounds. [sent-662, score-0.392]

60 ˆ We extrapolate ph , pQ , ph and pQ to the whole space X1 × . [sent-701, score-0.628]

61 , cd )) ≤ p d ¯ i=1 ni I(Xi ; Ci ) + K1 (28) N and for all i KL(ˆ(xi ) p(xi )) ≤ p (ni − 1) ln(N + 1) + ln d+1 δ , N (29) d+1 . [sent-752, score-0.608]

62 δ (30) where d K1 = i=1 mi ln ni + (M − 1) ln(N + 1) + ln As well, with probability greater than 1 − δ: KL(ˆQ (x1 , . [sent-753, score-0.721]

63 , xd )) ≤ p d ¯ i=1 ni I(Xi ; Ci ) + K2 N , (31) where d d mi ln ni + M + K2 = i=1 i=1 ni − d − 1 ln(N + 1) − ln δ. [sent-757, score-1.457]

64 , xd ) ˆ directly using Theorem 6, because the cardinality of the random variable X1 , . [sent-769, score-0.336]

65 , cd ) + γ ˆ , 1 + γM p(xi ) + γi ˆ , p(xi ) = ˜ 1 + γi n i ph (c1 , . [sent-775, score-0.488]

66 , cd ) = ˜ ph (ci ) = ˜ (32) p(xi ), ˜ xi :h(xi )=ci d ph (x1 , . [sent-777, score-0.896]

67 , cd ))+ln(M ) ˜ p d ¯ i=1 ni I(Xi ; Ci ) + K1 2N +K3 , (36) d where I(ˆQ (c1 , . [sent-816, score-0.391]

68 , cd ), K1 is deﬁned by (30), ˆ d H(ˆ(xi )) + 2 εi /2 ln ni + φ(εi ) + ψ(εi ) , p K3 = φ(ε(Q)) + i=1 and the functions φ and ψ are deﬁned in Theorem 6. [sent-824, score-0.608]

69 , cd )) and i ni I(Xi ; Ci ) [note that the latter also appears p N in φ(ε(Q) in K3 ]. [sent-837, score-0.391]

70 , cd )) and the statistical reliability of the estimate of the utility function, p ¯ which is related to i ni I(Xi ; Ci ). [sent-869, score-0.391]

71 , xd ) we are able to reduce ˆ this dependency to (M + i ni − d − 1). [sent-887, score-0.519]

72 , cd )) + p KL(ˆh (xi |ch (xi )) ph (xi |ch (xi ))) p i i d KL(ˆ(xi ) p(xi )) − p KL(ˆ(xi ) p(xi )). [sent-937, score-0.488]

73 , Cd ) − i i p(Xi ) ˜ ph (Cih (Xi )) ˜ EQ(h) Eph (ci ) ln ph (Ci ) ˜ Ep(xi ) ln p(Xi ) + ˜ i EpQ (ci ) ln pQ (Ci ) ˜ Ep(xi ) ln p(Xi ) + ˜ i At this point we use (14) to bound the ﬁrst and the second term and the lower bound (16) to bound the last term and obtain (36). [sent-987, score-1.64]

74 1 Minimization of the PAC-Bayesian Bound for Discriminative Prediction with Grid Clustering We start with minimization of the PAC-Bayesian bound for discriminative prediction based on grid clustering (18) suggested in Theorem 7. [sent-999, score-0.435]

75 We rewrite the bound in a slightly different way in order to separate terms which are independent of the conditional distributions in Q: ˆ kl(L(Q) L(Q)) ≤ where d ¯ i=1 ni I(Xi ; Ci ) + K N , (37) d K= 1 mi ln ni + M ln |Y| + ln(4N ) − ln δ. [sent-1000, score-1.203]

76 Since K is constant, L(Q) ¯ ˆ depends on a parameterized trade-off between L(Q) and d ni I(Xi ; Ci ), which can be written i=1 as follows: ˆ F(Q, β) = βN L(Q) + d ¯ ni I(Xi ; Ci ), (40) i=1 where β is the trade-off parameter. [sent-1009, score-0.434]

77 p l(y, y ′ ) = ∂q(c1 |x1 ) y,y′ x2 ,c2 ¯ Recall that I(Xi ; Ci ) = ¯ tive of ni I(Xi ; Ci ) is: 1 ni q(ci |xi ) xi ,ci q(ci |xi ) ln q (ci ) ¯ and q (ci ) = ¯ 1 ni xi q(ci |xi ). [sent-1026, score-1.056]

78 2 Minimization of the PAC-Bayesian Bound for Density Estimation Similar to the PAC-Bayesian bound for discriminative prediction, the PAC-Bayesian bound for density estimation (36) depends on the trade-off: 2 G(Q, β) = −βN I(ˆQ (c1 , c2 )) + p ¯ ni I(Xi ; Ci ). [sent-1045, score-0.428]

79 , each xi is deterministically assigned to a single ci ). [sent-1056, score-0.407]

80 Compute G(Q, β) for each possible assignment of xi to ci ∈ {1, . [sent-1062, score-0.407]

81 , mi } Reassign xi to ci such that G(Q, β) is minimized. [sent-1064, score-0.477]

82 In the MDL formulau tion the co-clustering solutions are evaluated by the total description length, which includes the length of the description of assignments of xi -s to ci -s together with the length of the description of the ratings given the assignments. [sent-1095, score-0.479]

83 (2007) only hard (deterministic) assignments of xi -s to ci -s were considered. [sent-1098, score-0.46]

84 • Algorithm 1 considers soft assignments of xi -s to ci -s. [sent-1106, score-0.44]

85 For clustering into 13x6 clusters prediction performance of MAE equal ¯ to 0. [sent-1279, score-0.349]

86 5; for clustering into 50x50 clusters prediction performance of 0. [sent-1285, score-0.349]

87 7; and for clustering into 283x283 clusters the optimal prediction performance of slightly below 0. [sent-1292, score-0.349]

88 Note that the assignment of xi -s to ci -s remains stochastic. [sent-1305, score-0.407]

89 (2007) the matrices Q1 and Q2 are restricted to deterministic assignments of xi -s to ci -s (the entries of Q1 and Q2 are in {0, 1}), whereas in the factorization proposed here Q1 and Q2 are stochastic matrices and F is arbitrary. [sent-1315, score-0.469]

90 3 (46) (47) (48) In other words, the clustering of viewers into clusters of viewers is shared between factorizations of A1 and A2 and the clustering of movies into clusters of movies is shared between factorizations of A2 and A3 . [sent-1338, score-0.802]

91 We remind the reader that Algorithm 2 operates with deterministic assignments of xi -s to clusters ci -s. [sent-1357, score-0.541]

92 In this case the hypothesis space is the i=1 i=1 space of all hard partitions of xi -s to ci -s and of the pairs c2i−1 , c2i to di -s. [sent-1382, score-0.538]

93 By repeating the analysis in Theorem 7 we obtain that with probability greater than 1 − δ: ˆ kl(L(Q L(Q)) ≤ B1 + B2 + |D1 ||D2 | ln |Y| + 1 ln(4N ) − ln δ 2 , N (51) where 4 B1 = i=1 2 B2 = i=1 ¯ ni I(Xi ; Ci ) + mi ln ni , ¯ (m2i−1 m2i )I(Di ; C2i−1 , C2i ) + |Di | ln(m2i−1 m2i ) . [sent-1383, score-1.155]

94 The above approach shares the same basic principle already discussed in the collaborative ﬁltering task: by clustering together similar pairs (and then quadruples) of amino acids we increase the statistical reliability of the observations, but reduce the resolution at which we process the data. [sent-1402, score-0.337]

95 Comparing the different approaches is usually a painful task, mainly because the goal of each of these clustering methods is formulated in terms of the solution: most clustering methods start by deﬁning some objective functional and then minimize it. [sent-1425, score-0.398]

96 sample S of size N according to p, for all graph clustering models deﬁned by Q = {q(c|x), q(w|c1 , c2 )}: ˆ kl(L(Q) L(Q)) ≤ ¯ nI(X; C) + m ln n + m2 ln |W| + 1 ln(4N ) − ln δ 2 , N (53) where m = |C|, n = |X |, and |W| is the number of distinct edge weights. [sent-1455, score-0.921]

97 Unlike alternating projections, sequential minimization (similar to the one in Algorithm 2) is guaranteed to converge to a local minimum, but can operate with deterministic assignments of graph nodes xi -s to the clusters ci -s only. [sent-1467, score-0.605]

98 Finally, we 3638 PAC-BAYESIAN A NALYSIS OF C O - CLUSTERING AND B EYOND prove the lower bound (16): −EpQ (z) ln pQ (Z) = E[ˆQ (z)−pQ (z)] ln pQ (Z) − EpQ (z) ln pQ (Z) ˜ ˜ ˜ p ˆ ≥− 1 + γ|Z| + H(ˆQ (z)) p γ |Z|(1 + ε/2) . [sent-1554, score-0.699]

99 ε/2 ln ε/2 1 pQ (z) − pQ (z) ˆ 2 ≥ H(ˆQ (z)) − p 1 ln Appendix B. [sent-1555, score-0.434]

100 Hence, we have ˆ L(Q) ≤ kl−1 L(Q) + 2∆ + ∆2 , d ¯ i=1 ni I(Xi ; Ci ) + K N + 2∆ + ∆2 , where K, deﬁned previously in (38), becomes: d K= 1 mi ln ni − M ln ∆ + ln(4N ) − ln δ. [sent-1566, score-1.155]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('ph', 0.314), ('ci', 0.313), ('xd', 0.302), ('pq', 0.295), ('seldin', 0.234), ('ni', 0.217), ('ln', 0.217), ('clustering', 0.199), ('cd', 0.174), ('kl', 0.171), ('eldin', 0.16), ('ishby', 0.16), ('eyond', 0.154), ('eq', 0.147), ('ep', 0.118), ('qt', 0.103), ('clusters', 0.101), ('slonim', 0.1), ('nalysis', 0.096), ('mae', 0.096), ('xi', 0.094), ('tishby', 0.09), ('hm', 0.083), ('ch', 0.08), ('cluster', 0.078), ('naftali', 0.074), ('partition', 0.074), ('mi', 0.07), ('hypothesis', 0.07), ('discriminative', 0.069), ('movielens', 0.068), ('en', 0.063), ('epq', 0.061), ('hist', 0.061), ('viewers', 0.058), ('dhillon', 0.053), ('mutual', 0.052), ('collaborative', 0.049), ('eph', 0.049), ('prediction', 0.049), ('bound', 0.048), ('banerjee', 0.047), ('amino', 0.047), ('density', 0.046), ('noam', 0.043), ('yevgeny', 0.043), ('movies', 0.043), ('grid', 0.042), ('acids', 0.042), ('pac', 0.042), ('randomized', 0.041), ('partitions', 0.041), ('mdl', 0.041), ('ratings', 0.039), ('thomas', 0.039), ('bounds', 0.037), ('graph', 0.036), ('ltering', 0.036), ('prior', 0.036), ('edge', 0.035), ('theorem', 0.035), ('unbalanced', 0.035), ('quantization', 0.035), ('generalization', 0.034), ('cardinality', 0.034), ('yoo', 0.033), ('assignments', 0.033), ('choi', 0.032), ('krupka', 0.031), ('pairwise', 0.03), ('graphical', 0.03), ('factorization', 0.029), ('minimization', 0.028), ('balanced', 0.027), ('nips', 0.027), ('langford', 0.026), ('inderjit', 0.026), ('matrix', 0.026), ('depth', 0.026), ('assign', 0.025), ('trees', 0.025), ('biclustering', 0.025), ('coclustering', 0.025), ('herlocker', 0.025), ('pacbayesian', 0.025), ('cover', 0.024), ('mcallester', 0.024), ('cells', 0.024), ('germain', 0.024), ('viewer', 0.024), ('hypotheses', 0.024), ('loss', 0.022), ('stability', 0.022), ('luxburg', 0.022), ('ding', 0.02), ('distortion', 0.02), ('jerusalem', 0.02), ('movie', 0.02), ('unsupervised', 0.02), ('hard', 0.02)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999952 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

2 0.18052708 55 jmlr-2010-Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Author: Nguyen Xuan Vinh, Julien Epps, James Bailey

Abstract: Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0, 1] range better than other normalized variants. Keywords: clustering comparison, information theory, adjustment for chance, normalized information distance

3 0.15054144 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

Author: Liva Ralaivola, Marie Szafranski, Guillaume Stempfel

Abstract: PAC-Bayes bounds are among the most accurate generalization bounds for classiﬁers learned from independently and identically distributed (IID) data, and it is particularly so for margin classiﬁers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classiﬁers (Germain et al., 2009). However, there are many practical situations where the training data show some dependencies and where the traditional IID assumption does not hold. Stating generalization bounds for such frameworks is therefore of the utmost interest, both from theoretical and practical standpoints. In this work, we propose the ﬁrst—to the best of our knowledge—PAC-Bayes generalization bounds for classiﬁers trained on data exhibiting interdependencies. The approach undertaken to establish our results is based on the decomposition of a so-called dependency graph that encodes the dependencies within the data, in sets of independent data, thanks to graph fractional covers. Our bounds are very general, since being able to ﬁnd an upper bound on the fractional chromatic number of the dependency graph is sufﬁcient to get new PAC-Bayes bounds for speciﬁc settings. We show how our results can be used to derive bounds for ranking statistics (such as AUC) and classiﬁers trained on data distributed according to a stationary β-mixing process. In the way, we show how our approach seamlessly allows us to deal with U-processes. As a side note, we also provide a PAC-Bayes generalization bound for classiﬁers learned on data from stationary ϕ-mixing distributions. Keywords: PAC-Bayes bounds, non IID data, ranking, U-statistics, mixing processes c 2010 Liva Ralaivola, Marie Szafranski and Guillaume Stempfel. R ALAIVOLA , S ZAFRANSKI AND S TEMPFEL

4 0.12142134 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

Author: Ido Cohn, Tal El-Hay, Nir Friedman, Raz Kupferman

Abstract: Continuous-time Bayesian networks is a natural structured representation language for multicomponent stochastic processes that evolve continuously over time. Despite the compact representation provided by this language, inference in such models is intractable even in relatively simple structured networks. We introduce a mean ﬁeld variational approximation in which we use a product of inhomogeneous Markov processes to approximate a joint distribution over trajectories. This variational approach leads to a globally consistent distribution, which can be efﬁciently queried. Additionally, it provides a lower bound on the probability of observations, thus making it attractive for learning tasks. Here we describe the theoretical foundations for the approximation, an efﬁcient implementation that exploits the wide range of highly optimized ordinary differential equations (ODE) solvers, experimentally explore characterizations of processes for which this approximation is suitable, and show applications to a large-scale real-world inference problem. Keywords: continuous time Markov processes, continuous time Bayesian networks, variational approximations, mean ﬁeld approximation

5 0.080549583 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

Author: Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar

Abstract: We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efﬁciently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efﬁciency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efﬁcient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment.1 Keywords: posterior regularization framework, unsupervised learning, latent variables models, prior knowledge, natural language processing

6 0.075460985 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

7 0.07131999 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference

8 0.061762299 10 jmlr-2010-An Exponential Model for Infinite Rankings

9 0.059946936 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

10 0.057284676 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

11 0.05565061 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

12 0.054091331 44 jmlr-2010-Graph Kernels

13 0.05353846 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

14 0.052622471 33 jmlr-2010-Efficient Heuristics for Discriminative Structure Learning of Bayesian Network Classifiers

15 0.052024718 78 jmlr-2010-Model Selection: Beyond the Bayesian Frequentist Divide

16 0.051762886 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

17 0.051379044 90 jmlr-2010-Permutation Tests for Studying Classifier Performance

18 0.050631057 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

19 0.048812505 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

20 0.046003897 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.254), (1, -0.019), (2, -0.07), (3, -0.032), (4, -0.063), (5, 0.01), (6, -0.012), (7, 0.156), (8, 0.011), (9, 0.21), (10, -0.14), (11, 0.15), (12, 0.107), (13, 0.071), (14, -0.02), (15, -0.227), (16, -0.217), (17, -0.171), (18, 0.239), (19, -0.095), (20, -0.125), (21, 0.027), (22, -0.159), (23, -0.156), (24, 0.075), (25, -0.096), (26, -0.004), (27, 0.08), (28, -0.016), (29, 0.08), (30, -0.103), (31, 0.097), (32, -0.043), (33, 0.062), (34, -0.053), (35, 0.004), (36, -0.049), (37, -0.076), (38, -0.096), (39, -0.041), (40, -0.104), (41, 0.008), (42, -0.022), (43, 0.073), (44, -0.044), (45, 0.039), (46, -0.097), (47, -0.017), (48, 0.027), (49, 0.027)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95782 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

2 0.76087052 55 jmlr-2010-Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Author: Nguyen Xuan Vinh, Julien Epps, James Bailey

3 0.70276368 20 jmlr-2010-Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary β-Mixing Processes

Author: Liva Ralaivola, Marie Szafranski, Guillaume Stempfel

4 0.4756259 32 jmlr-2010-Efficient Algorithms for Conditional Independence Inference

Author: Remco Bouckaert, Raymond Hemmecke, Silvia Lindner, Milan Studený

Abstract: The topic of the paper is computer testing of (probabilistic) conditional independence (CI) implications by an algebraic method of structural imsets. The basic idea is to transform (sets of) CI statements into certain integral vectors and to verify by a computer the corresponding algebraic relation between the vectors, called the independence implication. We interpret the previous methods for computer testing of this implication from the point of view of polyhedral geometry. However, the main contribution of the paper is a new method, based on linear programming (LP). The new method overcomes the limitation of former methods to the number of involved variables. We recall/describe the theoretical basis for all four methods involved in our computational experiments, whose aim was to compare the efﬁciency of the algorithms. The experiments show that the LP method is clearly the fastest one. As an example of possible application of such algorithms we show that testing inclusion of Bayesian network structures or whether a CI statement is encoded in an acyclic directed graph can be done by the algebraic method. Keywords: conditional independence inference, linear programming approach

5 0.39855197 19 jmlr-2010-Characterization, Stability and Convergence of Hierarchical Clustering Methods

Author: Gunnar Carlsson, Facundo Mémoli

Abstract: We study hierarchical clustering schemes under an axiomatic view. We show that within this framework, one can prove a theorem analogous to one of Kleinberg (2002), in which one obtains an existence and uniqueness theorem instead of a non-existence result. We explore further properties of this unique scheme: stability and convergence are established. We represent dendrograms as ultrametric spaces and use tools from metric geometry, namely the Gromov-Hausdorff distance, to quantify the degree to which perturbations in the input metric space affect the result of hierarchical methods. Keywords: clustering, hierarchical clustering, stability of clustering, Gromov-Hausdorff distance

6 0.32792622 75 jmlr-2010-Mean Field Variational Approximation for Continuous-Time Bayesian Networks

7 0.32367817 10 jmlr-2010-An Exponential Model for Infinite Rankings

8 0.30059594 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

9 0.27689201 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

10 0.27495366 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

11 0.2726351 81 jmlr-2010-On Finding Predictors for Arbitrary Families of Processes

12 0.24465932 85 jmlr-2010-On the Foundations of Noise-free Selective Classification

13 0.22609058 7 jmlr-2010-A Streaming Parallel Decision Tree Algorithm

14 0.22343417 72 jmlr-2010-Matrix Completion from Noisy Entries

15 0.22226705 21 jmlr-2010-Classification Methods with Reject Option Based on Convex Risk Minimization

16 0.21630599 57 jmlr-2010-Iterative Scaling and Coordinate Descent Methods for Maximum Entropy Models

17 0.21245073 42 jmlr-2010-Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data

18 0.19622058 27 jmlr-2010-Consistent Nonparametric Tests of Independence

19 0.19382857 62 jmlr-2010-Learning Gradients: Predictive Models that Infer Geometry and Statistical Dependence

20 0.19374995 106 jmlr-2010-Stability Bounds for Stationary φ-mixing and β-mixing Processes

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(1, 0.012), (3, 0.035), (4, 0.033), (8, 0.021), (9, 0.274), (15, 0.015), (21, 0.021), (24, 0.018), (32, 0.077), (36, 0.059), (37, 0.061), (75, 0.116), (81, 0.028), (85, 0.069), (96, 0.042), (97, 0.011)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.73447692 89 jmlr-2010-PAC-Bayesian Analysis of Co-clustering and Beyond

Author: Yevgeny Seldin, Naftali Tishby

2 0.51800257 114 jmlr-2010-Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels

Author: Pinar Donmez, Guy Lebanon, Krishnakumar Balasubramanian

Abstract: Estimating the error rates of classiﬁers or regression models is a fundamental task in machine learning which has thus far been studied exclusively using supervised learning techniques. We propose a novel unsupervised framework for estimating these error rates using only unlabeled data and mild assumptions. We prove consistency results for the framework and demonstrate its practical applicability on both synthetic and real world data. Keywords: classiﬁcation and regression, maximum likelihood, latent variable models

3 0.51707256 46 jmlr-2010-High Dimensional Inverse Covariance Matrix Estimation via Linear Programming

Author: Ming Yuan

Abstract: This paper considers the problem of estimating a high dimensional inverse covariance matrix that can be well approximated by “sparse” matrices. Taking advantage of the connection between multivariate linear regression and entries of the inverse covariance matrix, we propose an estimating procedure that can effectively exploit such “sparsity”. The proposed method can be computed using linear programming and therefore has the potential to be used in very high dimensional problems. Oracle inequalities are established for the estimation error in terms of several operator norms, showing that the method is adaptive to different types of sparsity of the problem. Keywords: covariance selection, Dantzig selector, Gaussian graphical model, inverse covariance matrix, Lasso, linear programming, oracle inequality, sparsity

4 0.51239961 49 jmlr-2010-Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

Author: Miloš Radovanović, Alexandros Nanopoulos, Mirjana Ivanović

Abstract: Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution of k-occurrences: the number of times a point appears among the k nearest neighbors of other points in a data set. Through theoretical and empirical analysis involving synthetic and real data sets we show that under commonly used assumptions this distribution becomes considerably skewed as dimensionality increases, causing the emergence of hubs, that is, points with very high k-occurrences which effectively represent “popular” nearest neighbors. We examine the origins of this phenomenon, showing that it is an inherent property of data distributions in high-dimensional vector space, discuss its interaction with dimensionality reduction, and explore its inﬂuence on a wide range of machine-learning tasks directly or indirectly based on measuring distances, belonging to supervised, semi-supervised, and unsupervised learning families. Keywords: nearest neighbors, curse of dimensionality, classiﬁcation, semi-supervised learning, clustering

5 0.51172209 103 jmlr-2010-Sparse Semi-supervised Learning Using Conjugate Functions

Author: Shiliang Sun, John Shawe-Taylor

Abstract: In this paper, we propose a general framework for sparse semi-supervised learning, which concerns using a small portion of unlabeled data and a few labeled data to represent target functions and thus has the merit of accelerating function evaluations when predicting the output of a new example. This framework makes use of Fenchel-Legendre conjugates to rewrite a convex insensitive loss involving a regularization with unlabeled data, and is applicable to a family of semi-supervised learning methods such as multi-view co-regularized least squares and single-view Laplacian support vector machines (SVMs). As an instantiation of this framework, we propose sparse multi-view SVMs which use a squared ε-insensitive loss. The resultant optimization is an inf-sup problem and the optimal solutions have arguably saddle-point properties. We present a globally optimal iterative algorithm to optimize the problem. We give the margin bound on the generalization error of the sparse multi-view SVMs, and derive the empirical Rademacher complexity for the induced function class. Experiments on artiﬁcial and real-world data show their effectiveness. We further give a sequential training approach to show their possibility and potential for uses in large-scale problems and provide encouraging experimental results indicating the efﬁcacy of the margin bound and empirical Rademacher complexity on characterizing the roles of unlabeled data for semi-supervised learning. Keywords: semi-supervised learning, Fenchel-Legendre conjugate, representer theorem, multiview regularization, support vector machine, statistical learning theory

6 0.51077884 74 jmlr-2010-Maximum Relative Margin and Data-Dependent Regularization

7 0.50813723 102 jmlr-2010-Semi-Supervised Novelty Detection

8 0.50801206 109 jmlr-2010-Stochastic Composite Likelihood

9 0.50521344 59 jmlr-2010-Large Scale Online Learning of Image Similarity Through Ranking

10 0.50034928 91 jmlr-2010-Posterior Regularization for Structured Latent Variable Models

11 0.49721876 92 jmlr-2010-Practical Approaches to Principal Component Analysis in the Presence of Missing Values

12 0.4965435 107 jmlr-2010-Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

13 0.49630249 95 jmlr-2010-Rademacher Complexities and Bounding the Excess Risk in Active Learning

14 0.49625769 16 jmlr-2010-Asymptotic Equivalence of Bayes Cross Validation and Widely Applicable Information Criterion in Singular Learning Theory

15 0.49588168 63 jmlr-2010-Learning Instance-Specific Predictive Models

16 0.49524215 111 jmlr-2010-Topology Selection in Graphical Models of Autoregressive Processes

17 0.49406153 69 jmlr-2010-Lp-Nested Symmetric Distributions

18 0.49281421 54 jmlr-2010-Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization

19 0.49260065 60 jmlr-2010-Learnability, Stability and Uniform Convergence

20 0.49186471 105 jmlr-2010-Spectral Regularization Algorithms for Learning Large Incomplete Matrices