jmlr jmlr2011 jmlr2011-84 knowledge-graph by maker-knowledge-mining

84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

Source: pdf

Author: Amarnag Subramanya, Jeff Bilmes

Abstract: We describe a new objective for graph-based semi-supervised learning based on minimizing the Kullback-Leibler divergence between discrete probability measures that encode class membership probabilities. We show how the proposed objective can be efﬁciently optimized using alternating minimization. We prove that the alternating minimization procedure converges to the correct optimum and derive a simple test for convergence. In addition, we show how this approach can be scaled to solve the semi-supervised learning problem on very large data sets, for example, in one instance we use a data set with over 108 samples. In this context, we propose a graph node ordering algorithm that is also applicable to other graph-based semi-supervised learning approaches. We compare the proposed approach against other standard semi-supervised learning algorithms on the semi-supervised learning benchmark data sets (Chapelle et al., 2007), and other real-world tasks such as text classiﬁcation on Reuters and WebKB, speech phone classiﬁcation on TIMIT and Switchboard, and linguistic dialog-act tagging on Dihana and Switchboard. In each case, the proposed approach outperforms the state-of-the-art. Lastly, we show that our objective can be generalized into a form that includes the standard squared-error loss, and we prove a geometric rate of convergence in that case. Keywords: graph-based semi-supervised learning, transductive inference, large-scale semi-supervised learning, non-parametric models

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 , 2007), and other real-world tasks such as text classiﬁcation on Reuters and WebKB, speech phone classiﬁcation on TIMIT and Switchboard, and linguistic dialog-act tagging on Dihana and Switchboard. [sent-11, score-0.211]

2 , 2005), label propagation (Zhu and Ghahramani, 2002a), and spectral graph transduction (Joachims, 2003) on a variety of tasks ranging from synthetic data sets to SSL benchmark data sets (Chapelle et al. [sent-92, score-0.154]

3 A useful byproduct of this experiment is the semi-supervised switchboard transcription project (S3TP) which consists of phone level annotations of the SwitchboardI corpus generated in a semi-supervised manner (see Section 8. [sent-100, score-0.15]

4 Graph Construction Let Dl = {(xi , ri )}l be the set of labeled samples, Du = {xi }l+u the set of unlabeled samples i=1 i=l+1 and D {Dl , Du }. [sent-118, score-0.248]

5 Here ri is an encoding of the labeled data and will be explained shortly. [sent-119, score-0.198]

6 Proposed Approach for Graph-based Semi-Supervised Learning For each i ∈ V and j ∈ Vl , we deﬁne discrete probability measures pi and r j respectively over the measurable space (Y, Y ). [sent-142, score-0.414]

7 That is, for each vertex in the graph, we deﬁne a measure pi and for all the labeled vertices, in addition to the p’s we also deﬁne ri (recall, Dl = {(xi , ri )}l ). [sent-143, score-0.779]

8 As we only consider classiﬁcation problems here, pi and ri are in essence multinomial distributions and so pi (y) represents the probability that the sample represented by vertex i belongs to class y. [sent-146, score-0.995]

9 As can be seen, the ri ’s can handle a wide variety of inputs ranging from the most certain case where a single input yields a single output to cases where there is an uncertainty associated with the output for a given input. [sent-153, score-0.144]

10 As pi and ri are discrete probability measures, we have that ∑y pi (y) = 1, pi (y) ≥ 0, ∑y ri (y) = 1, and ri (y) ≥ 0. [sent-157, score-1.674]

11 In other words, pi and ri lie within a |Y|-dimensional probability simplex which we represent using △|Y| and so pi , ri ∈△|Y| (henceforth denoted as △). [sent-158, score-1.116]

12 Consider the optimization problem PKL : min CKL (p) where m p∈△ l m CKL (p) = ∑ DKL ri ||pi + µ ∑ i=1 ∑ i=1 j∈N (i) n wi j DKL pi ||p j − ν ∑ H(pi ). [sent-172, score-0.615]

13 i=1 Here H(p) = − ∑y p(y) log p(y) is the Shannon entropy of p and DKL (pi ||q j ) is the KLD between p(y) measures pi and q j and is given by DKL (p||q) = ∑y p(y) log q(y) . [sent-173, score-0.501]

14 Given a vertex i ∈ V , N (i) denotes the set of neighbors of the vertex in the graph corresponding to wi j and thus |N (i)| represents vertex i’s degree. [sent-175, score-0.199]

15 The ﬁrst term of CKL will penalize the solution pi , i ∈ {1, . [sent-181, score-0.414]

16 , l}, when it is far away from the labeled training data Dl , but it does not insist that pi = ri , as allowing for deviations from ri can help especially with noisy labels (Bengio et al. [sent-184, score-0.756]

17 If wi j is large, we prefer a solution in which pi and p j are close in the KLD sense. [sent-188, score-0.471]

18 Proof As we have an undirected graph, W is symmetric, that is, wi j = w ji and for every wi j DKL (pi ||p j ), we also have w ji DKL (p j ||pi ). [sent-194, score-0.158]

19 The last term encourages each pi to be close to the uniform distribution (i. [sent-195, score-0.433]

20 For example, consider the case where a part of the graph is almost completely disconnected from any labeled vertex—that is, a “pendant” graph component. [sent-199, score-0.14]

21 More generally, we conjecture that by maximizing the entropy of each pi , the classiﬁer has a better chance of producing high entropy results in graph regions of low conﬁdence (e. [sent-202, score-0.531]

22 Finally, while the second graph-regularizer term encourages high-entropy solutions for nodes that have high entropy neighbors, the graph regularizer alone is insufﬁcient to yield high-entropy solutions in other cases where it may be desirable. [sent-208, score-0.142]

23 There can be cases, 3316 G RAPH - BASED S EMI -S UPERVISED L EARNING WITH M EASURE P ROPAGATION however, where more uncertainty should be expressed about such a large mass of unlabeled nodes distantly situated from the nearest labeled node. [sent-212, score-0.128]

24 Label uncertainty: The objective is capable of handling uncertainty in the labels (encoded using ri ) (Pearl, 1990). [sent-223, score-0.184]

25 First note that CKL (p) may be rewritten as CKL (p) = ∑l DKL ri ||pi + µ ∑i, j wi j DKL pi ||p j + ν ∑i DKL pi ||u where u i=1 is uniform measure. [sent-228, score-1.029]

26 Now if we replace the uniform measure, u, in the above by p0 then we are asking for each pi to be close to p0 . [sent-230, score-0.414]

27 pi (y) is of the form, k1 pi (y) log pi (y) + k2 pi (y) + k3 (k1 , k2 , k3 are constants). [sent-245, score-1.681]

28 It can be shown that the update equations for pi (y) for solving PKL using MOM are given by (see appendix A for details) (n) pi (y) = (n−1) pi (y) − α(n−1) ∂LCKL (p, Λ) ∂pi (y) + {p=p(n−1) ,Λ=Λ(n−1) ,c=c(n−1) } where n = 1, . [sent-256, score-1.242]

29 , is the iteration index, α(n−1) is the learning rate which is determined using the Armijo rule (Bertsekas, 1999), [x]+ = max(x, 0) and w je p j (y) ri (y) ∂LCKL (p, Λ) − = µ ∑ we j 1 + log pi (y) − log p j (y) − δ(e ≤ l) ∂pi (y) pi (y) pi (y) j∈N (i) + ν(log pi (y) + 1) + λi + 2c 1 − ∑ pi (y) . [sent-259, score-2.264]

30 Lack of intuition in update equations: While the update equations for pi (y) are easy to obtain, they lack an intuitive explanation. [sent-275, score-0.414]

31 Here the qi ’s play a sim- ilar role as the pi ’s and can potentially be used to obtain a ﬁnal classiﬁcation result (argmaxy qi (y)). [sent-333, score-0.626]

32 Thus, it would seem that we now have two classiﬁcation results for each sample – one the most likely assignment according to pi and another given by qi . [sent-334, score-0.52]

33 However, CMP includes terms of the form (wii + α)DKL (pi ||qi ) which encourage pi and qi to be close to each other. [sent-335, score-0.52]

34 Thus α, which is a hyper-parameter, plays an important role in ensuring that pi = qi , ∀ i. [sent-336, score-0.52]

35 p∈△n α→∞ p,q∈△n In the following we will show that there exists a ﬁnite α such that at a minima, pi (y) = qi (y) ∀ i, y (henceforth we will denote this as either pi = qi ∀ i or p = q). [sent-338, score-1.04]

36 While the ﬁrst term encourages qi for the labeled vertices to be close to the labels, ri , the last term encourages higher entropy p’s. [sent-340, score-0.379]

37 The AM updates (see Appendix E for the derivation) are given by (n−1) (n) pi (y) = (n) qi (y) = µ exp{ γi ∑ j w′ j log q j i (y)} (n−1) µ (y)} ∑y exp{ γi ∑ j w′ j log q j i ′ (n) ri (y)δ(i ≤ l) + µ ∑ j w ji p j (y) and δ(i ≤ l) + µ ∑ j w ji ′ ′ where γi = ν + µ ∑ j wi j . [sent-385, score-0.815]

38 Theorem 9 (Test for convergence, see Appendix D) If {(p(n) , q(n) )}∞ is generated by AM of CMP (p, q) n=1 and CMP (p∗ , q∗ ) inf n CMP (p, q) then p,q∈△ n CMP (p(n) , q(n) ) − CMP (p∗ , q∗ ) ≤ ∑ δ(i ≤ l) + di βi , i=1 βi log sup y (n) qi (y) , (n−1) qi (y) d j = ∑ wi j . [sent-390, score-0.294]

39 Consider the optimization problem PSQ : min CSQ (p) where m p∈△ l CSQ (p) = ∑ ri − pi i=1 2 m +∑ ∑ wi j i=1 j∈N (i) pi − p j 2 m +ν ∑ pi − u 2 i=1 and p 2 = ∑y p2 (y). [sent-401, score-1.443]

40 Returning to the original formulation, using Lagrange multipliers, setting the gradient to zero and solving for the multipliers we get the update for pi ’s to be (n−1) (n) pi (y) = ri (y)δ(i ≤ l) + νu(y) + µ ∑ j wi j p j (y) δ(i ≤ l) + ν + µ ∑ j wi j . [sent-417, score-1.086]

41 1 AM Amenable Formulation of PSQ Consider the following reformulation of CSQ l ′ CSQ (p, q) = ∑ ri − qi i=1 2 n +∑ ∑ w′ j i pi − q j 2 n +ν ∑ pi − u 2 . [sent-434, score-1.078]

42 Further the updates for two steps of AM have a closed form solution and are given by (n−1) (n) pi (y) = νu(y) + µ ∑ j w′ j q j i ν + ∑ j w′ j i 3325 (y) , S UBRAMANYA AND B ILMES (n) ′ (n) qi (y) = ri (y)δ(i ≤ l) + µ ∑ j w ji p j (y) δ(i ≤ l) + µ ∑ j w ji ′ . [sent-436, score-0.728]

43 In the case of MP, the pi (y) update may be re-written as (n−1) (n) pi (y) = ∏j qj (y) (n−1) ∑y ∏ j q j µw′ j i (y) µw′ j i . [sent-469, score-0.907]

44 Thus, while squared loss makes use of a weighted arithmetic-mean, MP uses a weighted geometricmean to update pi (y). [sent-470, score-0.414]

45 , 2005) proposes a general framework in which a parametric loss function that is deﬁned over the labeled samples and is regularized by graph smoothness term deﬁned over both the labeled and unlabeled samples. [sent-480, score-0.201]

46 Further, the update equation for one of the steps of the optimization in the case of PD (Equation 13 in Tsuda, 2005) is actually a special case of our update equation for pi (y) and may be obtained by setting wi j = 1/2. [sent-534, score-0.471]

47 For documents in Dl that are labeled with multiple categories, we initialize ri to have equal non-zero probability for each such category. [sent-1046, score-0.23]

48 For example, if document i is annotated as belonging to classes { acq, grain, wheat}, then ri (acq) = ri (grain) = ri (wheat) = 1/3. [sent-1047, score-0.432]

49 Note that there might be other (non-uniform) ways of initializing ri (e. [sent-1048, score-0.144]

50 We created 21 transduction sets by randomly sampling l documents from the standard Reuters training set with the constraint that each of 11 categories (top 10 categories and the class other) are represented at least once in each set. [sent-1051, score-0.145]

51 In the case of SGT, SQ-Loss-I and MP, the ﬁrst transduction set was used to tune the hyper-parameters which we then held ﬁxed for all the remaining 20 transduction sets. [sent-1054, score-0.134]

52 4 TIMIT Phone Recognition The TIMIT corpus of read speech was designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems (Zue et al. [sent-1197, score-0.211]

53 In addition, less reliable phone level annotations generated in an automatic manner by a speech recognizer with a non-zero error rate are also available (Deshmukh et al. [sent-1458, score-0.149]

54 As the task was time-consuming, costly, and error-prone, only 75 minutes of speech segments selected from different SWB conversations were annotated at the phone level and about 150 minutes annotated at the syllable level. [sent-1468, score-0.149]

55 The labeled and unlabeled points in the graph changed based on training, development and test sets used. [sent-1487, score-0.169]

56 Phone classiﬁcation in the case of conversational speech is a much harder task compared to phone classiﬁcation of read speech (Morgan, 2009). [sent-1500, score-0.212]

57 Consider the optimization problem PBR : min CBR (p) where m p∈△ l m CBR (p) = ∑ Bφ ri ||pi + µ ∑ i=1 ∑ i=1 j∈N (i) m wi j Bφ pi ||p j + ν ∑ Bφ (pi ||u). [sent-1511, score-0.615]

58 For example, in the case of phone classiﬁcation, as a result of the nature of human speech and language production, some classes of sounds tend to occur at a higher rate compared to others. [sent-1541, score-0.149]

59 This can occur due to a number of reasons such as, (a) improper graph construction, (b) improperly sampled labeled data, that is, the case where a majority of the labeled samples come from one class (similar to the scenario discussed in the case of the 2D two-moon data set). [sent-1545, score-0.151]

60 We ﬁrst remind the reader that CKL (p) may be re-written as CKL (p) = ∑li=1 DKL ri ||pi + µ ∑i, j wi j DKL pi ||p j + ν ∑i DKL pi ||u where u is uniform measure. [sent-1554, score-1.029]

61 3349 S UBRAMANYA AND B ILMES Now consider minimizing over p ∈△m l ′ CKL (p) = ∑ DKL ri ||pi + µ ∑ wi j DKL pi ||p j + ν ∑ DKL pi ||p0 . [sent-1555, score-1.029]

62 i i, j i=1 The above objective is convex and the last term encourages each pi to be close to p0 without actually insisting that pi (y) = p0 (y) ∀ i, y. [sent-1556, score-0.887]

63 It is possible to reformulate the above objective as l ′ CMP (p, q) = ∑ DKL ri ||qi + µ ∑ w′i j DKL pi ||q j + ν ∑ DKL pi ||p0 . [sent-1557, score-1.012]

64 There are many ways of deﬁning p, such as, ˜ p(y) = ˜ n 1 n pi (y) or p(y) ∝ ∏(pi (y) + ε). [sent-1568, score-0.414]

65 Consider the following objective m l (D1) CMP (p, q) = ∑ DKL ri ||pi + µ ∑ ∑ i=1 j∈N (in) (i) i=1 m wi j DKL pi ||q j − ν ∑ H(pi ). [sent-1591, score-0.655]

66 i=1 In this case, for node i, the second term in the above objective encourages pi to be close to the q’s of all its neighbors, N (in) (i). [sent-1592, score-0.494]

67 Consider minimizing m l (D2) CMP (p, q) = ∑ DKL ri ||pi + µ ∑ ∑ i=1 j∈N (out) (i) i=1 m wi j DKL pi ||q j − ν ∑ H(pi ). [sent-1595, score-0.615]

68 4 Connections to Entropy Minimization (Grandvalet and Bengio, 2005) Entropy Minimization uses the entropy of the unlabeled data as a regularizer while optimizing a parametric loss function over the labeled data. [sent-1600, score-0.16]

69 Consider l CKL (p) = ∑ DKL ri ||pi ) + µ n ∑ i=1 i, j=1 l n ≤ ∑ DKL ri ||pi ) − µ i=1 ∑ i, j=1 n wi j DKL pi ||p j − ν ∑ H(pi ) i=1 wi j ∑ pi (y) log p j (y) y as wi j , ν, H(pi ) ≥ 0. [sent-1610, score-1.312]

70 Consider a degenerate graph in which wi j = δ(i = j ∧ i > l) then l CKL (p) ≤ ∑ DKL ri ||pi ) − µ i=1 n ∑ ∑ pi (y) log pi (y) i=l+1 y l = ∑ ∑ ri (y) log ri (y) − ri (y) log pi (y) + µ i=1 y l ≤ − ∑ ∑ ri (y) log pi (y) + µ i=1 y n ∑ H(pi ) i=l+1 n ∑ H(pi ). [sent-1611, score-2.576]

71 , H(ri ) = 0) and that each pi is parameterized by, say θi , then we can rewrite the above as l CKL (p) ≤ − ∑ log pi (yi ; θi ) + µ i=1 n ∑ H(pi ; θi ). [sent-1615, score-0.853]

72 i=l+1 Now if all the θi were tied to a single θ then we have that n l CKL (p) ≤ − ∑ log pi (yi ; θ) + µ i=1 ∑ H(pi ; θ) i=l+1 which is equal to the entropy minimization objective. [sent-1616, score-0.476]

73 The difﬁculties associated with analyzing the (n) rate of convergence of MP are mostly due to the non-linear nature of the update equation for pi (y). [sent-1624, score-0.414]

74 Solving PKL using Method of Multipliers The ﬁrst step in the application of MOM to solve PKL is the construction of the augmented Lagrangian function for CKL (p) which is given by n n LC1 (p, Λ) = CKL (p) + ∑ λi 1 − ∑ pi (y) + c ∑ 1 − ∑ pi (y) y i=1 i=1 2 y where Λ = {λ1 , . [sent-1670, score-0.828]

75 Recall that we require ∑y pi (y) = 1, ∀ i and that pi (y) ≥ 0, ∀ i, y. [sent-1674, score-0.828]

76 Thus the update equation is given by (n) pi (y) = (n−1) pi (y) − α(n−1) ∂LC1 (p, Λ) ∂pi (y) + . [sent-1677, score-0.828]

77 It can be shown that n w je p j (y) ∂LC1 (p, Λ) ri (y) = µ ∑ we j 1 + log pi (y) − log p j (y) − δ(e ≤ l)+ − ∂pi (y) pi (y) pi (y) j=1 ν(log pi (y) + 1) + λi + 2c 1 − ∑ pi (y) . [sent-1684, score-2.264]

78 y Under MOM, the update equation for the Lagrange multipliers is (n) (n−1) λi = λi + c(n−1) ∑ pi (n−1) (y) − 1 y and the penalty parameter is updated using  βc(n−1) if ∑ τ(n) − γτ(n−1) i i i (n) c =  (n−1) c otherwise (n) (n) >0 2 where τi = 1 − ∑y pi (y) . [sent-1685, score-0.828]

79 Proof Let n δ(p, p(1) ) µ ∑ w′i j DKL (pi ||pi (1) ), f (t) CMP (p(t) , q(0) ) i, j=1 (t) (1) where p(t) = (1 −t)p +tp(1) , 0 < t ≤ 1 and thus pi = (1 −t)pi +t pi . [sent-1694, score-0.828]

80 1−t (3) We have that l f (t) = ∑ ∑ ri log i=1 y∈Y n ri (0) +µ qi ∑ w′ j i i, j=1 (t) pi (t) pi log (0) qj y∈Y ∑ n +ν∑ ∑ (t) pi log i=1 y∈Y (t) pi u where we ignore the argument y in every measure for brevity (e. [sent-1696, score-2.204]

81 If q j (y) > 0, ∀ y, j then there exists γ < ∞ such that (1) (1) pi log pi (0) qj (t) (t) − pi log pi (0) qj < γ because the difference of two ﬁnite real numbers is always bounded above which implies that the DCT can be used to distribute the limits within the summations. [sent-1707, score-1.864]

82 Thus we have that n ∑ 0≥µ w′ j i i, j=1 n =µ ∑ ∑ y∈Y w′ j i i, j=1 ∑ (t) pi ∂ (t) pi log (0) ∂t q n +ν∑ j (1) pi (0) qj (1) pi log y∈Y ∑ i=1 y∈Y t=1 − pi log (t) p ∂ (t) pi log i ∂t u (1) pi (0) qj n +ν∑ ∑ (1) pi log t=1 (1) pi u i=1 y∈Y (1) − pi log pi u . [sent-1708, score-4.862]

83 (1) The last equation follows as ∑y∈Y (pi − pi ) = 0. [sent-1709, score-0.414]

84 i=1 i=1 Using the above we get l 0 ≥ CMP (p(1) , q(0) ) − ∑ DKL ri ||qi (0) i=1 n − µ ∑ w′ j i i, j=1 ∑ pi log y∈Y (1) pi (0) qj n +ν∑ ∑ pi log i=1 y∈Y (1) pi u . [sent-1711, score-1.929]

85 (4) Consider ∑ pi log y∈Y (1) pi (0) qj = ∑ pi log y∈Y (1) pi (0) qj (1) p pi pi = ∑ pi log (0) + log i pi y∈Y pi qj (0) (1) = DKL (pi ||q j ) − DKL (pi ||pi ). [sent-1712, score-4.063]

86 Similarly (1) (1) (1) pi pi pi pi pi ∑ pi log u = ∑ pi log u pi = ∑ pi log u + log pi y∈Y y∈Y y∈Y 3356 (1) = DKL (pi ||u) − DKL (pi ||pi ). [sent-1713, score-4.24]

87 It should be clear that g(t) achieves its minimum at t = 1 and as a result we have that g(1) − g(t) ≤0 1−t (5) and l g(t) = ∑ ∑ ri log i=1 y∈Y (1) (1) n n p p (1) (1) + µ ∑ w′ j ∑ pi log i(t) + ν ∑ ∑ pi log i . [sent-1723, score-1.047]

88 As a result we have that l 0 ≥ −∑ l i=1 y∈Y n = −l − µ (1) n n pi (1) qi − µ ∑ w′ j ∑ pi + µ ∑ w′ j ∑ (1) q j ∑ (1) i i i=1 y∈Y qi i, j=1 y∈Y i, j=1 y∈Y q j ri ∑ ri + ∑ ∑ w′ j + i i, j=1 l n ri ∑∑ ∑ q +µ (1) i i, j=1 i=1 y∈Y qi w′ j i ∑ y∈Y (1) pi (1) qj q j. [sent-1727, score-2.071]

89 Theorem 5 (Convergence of AM on CMP ) If (0) p(n) = argmin CMP (p, q(n−1) ), q(n) = argmin CMP (p(n) , q) and qi (y) > 0 ∀ y ∈ Y, ∀i then p∈△m q∈△m (a) CMP (p, q) + CMP (p, p(0) ) ≥ CMP (p, q(1) ) + CMP (p(1) , q(1) ) for all p, q ∈△m , and (b) lim CMP (p(n) , q(n) ) = infp,q∈△m CMP (p, q). [sent-1734, score-0.172]

90 Test for Convergence Theorem 9 (Test for Convergence) If {(p(n) , q(n) )}∞ is generated by AM of CMP (p, q) and n=1 CMP (p∗ , q∗ ) inf n CMP (p, q) then p,q∈△ n CMP (p(n) , q(n) ) − CMP (p∗ , q∗ ) ≤ ∑ δ(i ≤ l) + di βi , i=1 βi log sup y (n) qi (y) , (n−1) qi (y) d j = ∑ wi j . [sent-1764, score-0.294]

91 Update Equations for p(n) and q(n) The Lagrangian (ignoring the non-negativity constraints) for solving min CMP (p, q(n−1) ) is given by n p∈△ n l L (p, Λ) = ∑ DKL ri ||qi + µ i=1 ∑ (n−1) w′ j DKL pi ||q j i n − ν ∑ H(pi ) + ∑ λi i i=1 i, j=1 ∑ pi (y) − 1 y where Λ = {λ1 , . [sent-1770, score-0.972]

92 As KKT conditions apply (since we have a convex optimization problem), we have that ▽ pi (y) L (p, Λ) = 0 and p ∈△n at the optimal solution. [sent-1774, score-0.414]

93 Solving the above we have (n−1) −λi − βi log pi (y) = αi ′ (n−1) (y) . [sent-1775, score-0.439]

94 Using the above in Equation 7 leads to the dual problem in Λ which admits a closed form solution given by λi = αi log ∑ exp (n−1) (y) βi αi =⇒ y (n) Clearly pi (y) ≥ 0, ∀ i, y. [sent-1777, score-0.459]

95 3362 (n) pi (y) = 1 Zi exp (n−1) (y) βi αi . [sent-1778, score-0.414]

96 In this case KKT conditions require that ▽qi (y) L (q, Λ) = 0, ∑y qi (y) − 1 ∀ y, σiy qi (y) = 0 ∀ i, y solving which yields (n) ′ (n) qi (y) = ri (y)δ(i ≤ l) + µ ∑j wji pj (y) δ(i ≤ l) + µ ∑j wji ′ . [sent-1786, score-0.462]

97 , W is irreducible) then the sequence of updates (n−1) (n) pi (y) = ri (y)δ(i ≤ l) + νu(y) + µ ∑ j wi j p j (y) δ(i ≤ l) + ν + µ ∑ j wi j has a linear (geometric) rate of convergence for all i and y. [sent-1791, score-0.672]

98 Learning from labeled and unlabeled data using graph mincuts. [sent-1899, score-0.147]

99 Learning to classify text from labeled and unlabeled documents. [sent-2165, score-0.133]

100 Learning from labeled and unlabeled data on a directed graph. [sent-2313, score-0.127]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('cmp', 0.52), ('pi', 0.414), ('ssl', 0.284), ('mp', 0.263), ('dkl', 0.207), ('ckl', 0.191), ('mom', 0.156), ('ri', 0.144), ('sgt', 0.136), ('ilmes', 0.12), ('ubramanya', 0.12), ('emi', 0.116), ('qi', 0.106), ('ropagation', 0.099), ('easure', 0.099), ('upervised', 0.099), ('kld', 0.092), ('phone', 0.086), ('pkl', 0.08), ('swb', 0.08), ('timit', 0.08), ('qj', 0.079), ('psq', 0.076), ('raph', 0.076), ('zhu', 0.073), ('csq', 0.072), ('transduction', 0.067), ('laprls', 0.064), ('switchboard', 0.064), ('speech', 0.063), ('wi', 0.057), ('qc', 0.054), ('labeled', 0.054), ('joachims', 0.054), ('mlp', 0.051), ('unlabeled', 0.05), ('prbep', 0.048), ('subramanya', 0.044), ('graph', 0.043), ('coil', 0.041), ('objective', 0.04), ('stp', 0.04), ('belkin', 0.039), ('entropy', 0.037), ('sq', 0.034), ('argmin', 0.033), ('tagging', 0.033), ('documents', 0.032), ('dihana', 0.032), ('hf', 0.032), ('symmetrized', 0.032), ('bilmes', 0.031), ('bci', 0.031), ('sim', 0.031), ('neighbors', 0.03), ('earning', 0.029), ('text', 0.029), ('csiszar', 0.028), ('tfidf', 0.028), ('transductive', 0.026), ('usps', 0.025), ('henceforth', 0.025), ('chapelle', 0.025), ('log', 0.025), ('reuters', 0.024), ('utterances', 0.024), ('nodes', 0.024), ('arya', 0.024), ('cbr', 0.024), ('tusnady', 0.024), ('tsvm', 0.024), ('threads', 0.024), ('ghahramani', 0.024), ('vertex', 0.023), ('label', 0.023), ('categories', 0.023), ('priors', 0.023), ('directed', 0.023), ('dl', 0.022), ('ji', 0.022), ('development', 0.022), ('propagation', 0.021), ('node', 0.021), ('bregman', 0.02), ('classi', 0.02), ('corduneanu', 0.02), ('webkb', 0.02), ('bengio', 0.02), ('ai', 0.02), ('discourse', 0.02), ('thread', 0.02), ('bertsekas', 0.02), ('closed', 0.02), ('da', 0.019), ('regularizer', 0.019), ('encourages', 0.019), ('objectives', 0.019), ('ran', 0.018), ('bigram', 0.018), ('dialog', 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999982 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

Author: Amarnag Subramanya, Jeff Bilmes

2 0.06714163 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under ﬁxed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufﬁcient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identiﬁed; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

3 0.05337907 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

Abstract: Many popular linear classiﬁers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classiﬁers in transfer learning, and for training classiﬁers with no labeled data whatsoever. Keywords: classiﬁcation, large margin, maximum likelihood

4 0.043396622 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

Abstract: In many applications involving multi-media data, the deﬁnition of similarity between items is integral to several key tasks, including nearest-neighbor retrieval, classiﬁcation, and recommendation. Data in such regimes typically exhibits multiple modalities, such as acoustic and visual content of video. Integrating such heterogeneous data to form a holistic similarity space is therefore a key challenge to be overcome in many real-world applications. We present a novel multiple kernel learning technique for integrating heterogeneous data into a single, uniﬁed similarity space. Our algorithm learns an optimal ensemble of kernel transformations which conform to measurements of human perceptual similarity, as expressed by relative comparisons. To cope with the ubiquitous problems of subjectivity and inconsistency in multimedia similarity, we develop graph-based techniques to ﬁlter similarity measurements, resulting in a simpliﬁed and robust training procedure. Keywords: multiple kernel learning, metric learning, similarity

5 0.040397175 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

Author: Cassio P. de Campos, Qiang Ji

Abstract: This paper addresses the problem of learning Bayesian network structures from data based on score functions that are decomposable. It describes properties that strongly reduce the time and memory costs of many known methods without losing global optimality guarantees. These properties are derived for different score criteria such as Minimum Description Length (or Bayesian Information Criterion), Akaike Information Criterion and Bayesian Dirichlet Criterion. Then a branch-andbound algorithm is presented that integrates structural constraints with data in a way to guarantee global optimality. As an example, structural constraints are used to map the problem of structure learning in Dynamic Bayesian networks into a corresponding augmented Bayesian network. Finally, we show empirically the beneﬁts of using the properties with state-of-the-art methods and with the new algorithm, which is able to handle larger data sets than before. Keywords: Bayesian networks, structure learning, properties of decomposable scores, structural constraints, branch-and-bound technique

6 0.035569567 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

7 0.034983851 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

8 0.034482293 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

9 0.034029819 58 jmlr-2011-Learning from Partial Labels

10 0.032346055 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

11 0.032226659 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

12 0.030806726 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

13 0.028596969 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

14 0.0257826 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

15 0.024451235 101 jmlr-2011-Variable Sparsity Kernel Learning

16 0.02433846 54 jmlr-2011-Learning Latent Tree Graphical Models

17 0.024326155 12 jmlr-2011-Bayesian Co-Training

18 0.024108596 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms

19 0.023446376 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

20 0.023029633 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.14), (1, -0.057), (2, -0.019), (3, -0.035), (4, 0.02), (5, -0.022), (6, -0.011), (7, -0.006), (8, 0.032), (9, 0.04), (10, -0.109), (11, -0.001), (12, 0.012), (13, -0.03), (14, -0.041), (15, -0.019), (16, -0.042), (17, -0.075), (18, -0.07), (19, -0.016), (20, 0.011), (21, 0.091), (22, -0.108), (23, -0.024), (24, -0.053), (25, 0.053), (26, -0.082), (27, 0.059), (28, -0.212), (29, 0.202), (30, -0.072), (31, -0.01), (32, -0.118), (33, 0.049), (34, -0.015), (35, 0.054), (36, 0.211), (37, 0.353), (38, -0.062), (39, -0.128), (40, -0.058), (41, 0.14), (42, 0.021), (43, -0.07), (44, -0.253), (45, 0.293), (46, 0.127), (47, -0.122), (48, 0.308), (49, 0.248)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.95961046 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

Author: Amarnag Subramanya, Jeff Bilmes

2 0.23477241 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

3 0.23177914 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

4 0.21652237 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

Author: Stefano Melacci, Mikhail Belkin

Abstract: In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classiﬁers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semi-supervised classiﬁcation. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. In particular, training a LapSVM in the primal can be efﬁciently performed with preconditioned conjugate gradient. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classiﬁcation accuracy as the optimal ones, considerably reducing the training time. The computational complexity of the training algorithm is reduced from O(n3 ) to O(kn2 ), where n is the combined number of labeled and unlabeled examples and k is empirically evaluated to be signiﬁcantly smaller than n. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large data sets. We present an extensive experimental evaluation on real world data showing the beneﬁts of the proposed approach. Keywords: Laplacian support vector machines, manifold regularization, semi-supervised learning, classiﬁcation, optimization

5 0.20984074 55 jmlr-2011-Learning Multi-modal Similarity

Author: Brian McFee, Gert Lanckriet

6 0.19684166 61 jmlr-2011-Logistic Stick-Breaking Process

7 0.18796861 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

8 0.181392 58 jmlr-2011-Learning from Partial Labels

9 0.17221725 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

10 0.15719509 102 jmlr-2011-Waffles: A Machine Learning Toolkit

11 0.15208949 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

12 0.15155141 103 jmlr-2011-Weisfeiler-Lehman Graph Kernels

13 0.15020669 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

14 0.14880356 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

15 0.14575982 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

16 0.14275321 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms

17 0.14001824 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints

18 0.13886733 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

19 0.13853119 70 jmlr-2011-Non-Parametric Estimation of Topic Hierarchies from Texts with Hierarchical Dirichlet Processes

20 0.13282633 68 jmlr-2011-Natural Language Processing (Almost) from Scratch

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.037), (6, 0.012), (9, 0.029), (10, 0.026), (11, 0.011), (24, 0.039), (31, 0.095), (32, 0.028), (38, 0.422), (41, 0.021), (60, 0.017), (71, 0.015), (73, 0.047), (78, 0.057), (90, 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.6813094 84 jmlr-2011-Semi-Supervised Learning with Measure Propagation

Author: Amarnag Subramanya, Jeff Bilmes

2 0.30039129 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Deﬁnite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efﬁcient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efﬁciently. In particular, we propose two efﬁcient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efﬁciently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is signiﬁcantly more efﬁcient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-deﬁnite programming, semi-supervised learning, side information, pairwise constraints

3 0.29956153 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

Author: Graham W. Taylor, Geoffrey E. Hinton, Sam T. Roweis

Abstract: In this paper we develop a class of nonlinear generative models for high-dimensional time series. We ﬁrst propose a model based on the restricted Boltzmann machine (RBM) that uses an undirected model with binary latent variables and real-valued “visible” variables. The latent and visible variables at each time step receive directed connections from the visible variables at the last few time-steps. This “conditional” RBM (CRBM) makes on-line inference efﬁcient and allows us to use a simple approximate learning procedure. We demonstrate the power of our approach by synthesizing various sequences from a model trained on motion capture data and by performing on-line ﬁlling in of data lost during capture. We extend the CRBM in a way that preserves its most important computational properties and introduces multiplicative three-way interactions that allow the effective interaction weight between two variables to be modulated by the dynamic state of a third variable. We introduce a factoring of the implied three-way weight tensor to permit a more compact parameterization. The resulting model can capture diverse styles of motion with a single set of parameters, and the three-way interactions greatly improve its ability to blend motion styles or to transition smoothly among them. Videos and source code can be found at http://www.cs.nyu.edu/˜gwtaylor/publications/ jmlr2011. Keywords: unsupervised learning, restricted Boltzmann machines, time series, generative models, motion capture

4 0.29776651 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar

Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve signiﬁcant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.

5 0.29709977 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is deﬁned as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classiﬁcation, level sets, connected components

6 0.29457167 12 jmlr-2011-Bayesian Co-Training

7 0.2938163 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

8 0.29379156 48 jmlr-2011-Kernel Analysis of Deep Networks

9 0.29292247 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

10 0.29146436 91 jmlr-2011-The Sample Complexity of Dictionary Learning

11 0.28936467 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

12 0.28905991 16 jmlr-2011-Clustering Algorithms for Chains

13 0.28716823 66 jmlr-2011-Multiple Kernel Learning Algorithms

14 0.28646919 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

15 0.28637144 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

16 0.2849561 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

17 0.28467208 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments

18 0.28412262 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

19 0.28277698 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

20 0.2827059 101 jmlr-2011-Variable Sparsity Kernel Learning