jmlr jmlr2011 jmlr2011-12 knowledge-graph by maker-knowledge-mining

12 jmlr-2011-Bayesian Co-Training

Source: pdf

Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao

Abstract: Co-training (or more generally, co-regularization) has been a popular algorithm for semi-supervised learning in data with two feature representations (or views), but the fundamental assumptions underlying this type of models are still unclear. In this paper we propose a Bayesian undirected graphical model for co-training, or more generally for semi-supervised multi-view learning. This makes explicit the previously unstated assumptions of a large class of co-training type algorithms, and also clariﬁes the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for co-training, which is a novel co-training kernel for Gaussian process classiﬁers. The resulting approach is convex and avoids local-maxima problems, and it can also automatically estimate how much each view should be trusted to accommodate noisy or unreliable views. The Bayesian co-training approach can also elegantly handle data samples with missing views, that is, some of the views are not available for some data points at learning time. This is further extended to an active sensing framework, in which the missing (sample, view) pairs are actively acquired to improve learning performance. The strength of active sensing model is that one actively sensed (sample, view) pair would improve the joint multi-view classiﬁcation on all the samples. Experiments on toy data and several real world data sets illustrate the beneﬁts of this approach. Keywords: co-training, multi-view learning, semi-supervised learning, Gaussian processes, undirected graphical models, active sensing

Reference: text

Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The Bayesian co-training approach can also elegantly handle data samples with missing views, that is, some of the views are not available for some data points at learning time. [sent-17, score-0.451]

2 This is further extended to an active sensing framework, in which the missing (sample, view) pairs are actively acquired to improve learning performance. [sent-18, score-0.794]

3 The strength of active sensing model is that one actively sensed (sample, view) pair would improve the joint multi-view classiﬁcation on all the samples. [sent-19, score-0.649]

4 Keywords: co-training, multi-view learning, semi-supervised learning, Gaussian processes, undirected graphical models, active sensing 1. [sent-21, score-0.704]

5 When the data samples can be characterized in multiple views, the disagreement between the class labels suggested by different views can be computed even when using unlabeled data. [sent-33, score-0.512]

6 More importantly, we extend the Bayesian co-training model to handle data samples with missing views (i. [sent-79, score-0.451]

7 , some views are missing for certain data samples), and introduce a novel application called the active sensing. [sent-81, score-0.563]

8 Active sensing aims to efﬁciently choose, among all the missing features (grouped in views), what views and samples to additionally acquire (or sense) to improve the overall learning performance. [sent-83, score-0.938]

9 So active sensing is to decide which location and which type of sensor we should additionally consider to achieve better detection accuracy. [sent-92, score-0.622]

10 This active sensing problem is similar to active feature acquisition—see, for example, Melville et al. [sent-94, score-0.747]

11 But in active sensing, one actively acquired (sample, view) pair will improve the classiﬁcation performance of all the unlabeled samples via a co-training setting. [sent-97, score-0.393]

12 The model is extended to handle missing views in Section 4, and this provides the basics for the active sensing solution. [sent-103, score-1.006]

13 The active sensing problem is discussed in Section 5, in which we provide two methods for deciding which incomplete samples should be further characterized, and which sensors should be deployed on them. [sent-104, score-0.66]

14 2653 Y U , K RISHNAPURAM , ROSALES AND R AO multi-views 2-views f1 f2 f1 f2 … fm fc fc y y Figure 2: Factor graph in the functional space for 2-view and multi-view learning. [sent-139, score-1.172]

15 2 Undirected Graphical Model for Multi-View Learning ( j) In multi-view learning, suppose we have m different views of a same set of n data samples. [sent-141, score-0.31]

16 One can certainly concatenate the multiple views of the data into a single view, and apply a single-view GP model. [sent-155, score-0.31]

17 Since one data sample i has only one single label yi even though it has multiple features from the multiple views (i. [sent-161, score-0.417]

18 , latent ( j) function value f j (xi ) for view j), the label yi should depend on all of these latent function values for data sample i. [sent-163, score-0.279]

19 We tackle this problem by introducing a new latent function, the consensus function fc , to ensure conditional independence between the output y and the m latent functions { f j } for the m views. [sent-165, score-0.835]

20 At the functional level, the output y depends only on fc , and latent functions { f j } depend on each other only via the consensus function fc (see Figure 2 for the factor graphs for 2-view and multi-view cases). [sent-167, score-1.321]

21 That is, the joint probability is deﬁned as: p(y, fc , f1 , . [sent-168, score-0.56]

22 , fm ) = m 1 ψ(y, fc ) ∏ ψ( f j , fc ), Z j=1 (2) with some potential functions ψ. [sent-171, score-1.2]

23 In the ground network where we have n data samples, let fc = ( j) { fc (xi )}n and f j = { f j (xi )}n be the functional values for the consensus view and the jth view, i=1 i=1 2654 BAYESIAN C O -T RAINING respectively. [sent-172, score-1.384]

24 The graphical model leads to the following factorization: p (y, fc , f1 , . [sent-173, score-0.59]

25 , fm ) = m 1 n ∏ ψ(yi , fc (xi )) ∏ ψ(f j )ψ(f j , fc ). [sent-176, score-1.172]

26 Z i=1 j=1 (3) Here the within-view potential ψ(f j ) speciﬁes the dependency structure within each view j, and the consensus potential ψ(f j , fc ) describes how each latent function f j is related to the consensus function fc . [sent-177, score-1.687]

27 Finally, the output potential ψ(yi , fc (xi )) is deﬁned the same as that in (1) for regression or for classiﬁcation. [sent-180, score-0.565]

28 The most important potential function in Bayesian co-training is the consensus potential, which simply deﬁnes an isotropic multivariate Gaussian for the difference of f j and fc , that is, f j − fc ∼ N (0, σ2 I). [sent-181, score-1.298]

29 This can also be interpreted as assuming a conditional isotropic Gaussian for f j with j the consensus fc being the mean. [sent-182, score-0.733]

30 Alternatively if fc is of interest, the joint consensus potentials effectively deﬁne a conditional Gaussian prior for fc , fc |f1 , . [sent-183, score-1.923]

31 2 This indicates that, given the latent functions {f j }m , the posterior mean of the j j=1 consensus function fc is a weighted average of these latent functions, and the weight is given by the inverse variance (i. [sent-188, score-0.835]

32 We will discuss the consensus potential and the view variances in more details in Section 3. [sent-196, score-0.338]

33 Note that this conditional Gaussian for fc has a normalization factor which depends on f1 , . [sent-209, score-0.537]

34 2655 Y U , K RISHNAPURAM , ROSALES AND R AO since the output vector yl is only of length nl , the joint probability is now: p (yl , fc , f1 , . [sent-213, score-0.671]

35 , fm ) = m 1 nl ∏ ψ(yi , fc (xi )) ∏ ψ(f j )ψ(f j , fc ). [sent-216, score-1.215]

36 Z i=1 j=1 (6) Note that the product of output potentials contains only that of the nl labeled data samples, and ( j) that fc = { fc (xi )}n and f j = { f j (xi )}n are still of length n. [sent-217, score-1.246]

37 Unlabeled data samples contribute i=1 i=1 to the joint probability via the within-view potentials ψ(f j ) and consensus potentials ψ(f j , fc ). [sent-218, score-0.982]

38 Inference and Learning in Bayesian Co-Training In this section we discuss inference and learning in the proposed model, assuming ﬁrst that there is no missing data in any of the views (the setting with missing data will be discussed in Section 4). [sent-222, score-0.512]

39 All marginalizations lead to standard Gaussian process inference with different latent function at consideration, but interestingly, these different marginalizations show different insights of the proposed undirected graphical model. [sent-225, score-0.348]

40 1 Marginal 1: Co-Regularized Multi-View Learning Our ﬁrst marginalization focuses on the joint probability distribution of the m latent functions, when the consensus function fc is integrated out. [sent-230, score-0.807]

41 Taking the integral of (3) over fc (and ignoring the output potential for the moment), we obtain the joint marginal distribution as follows after some mathematics (for derivations see Appendix A. [sent-235, score-0.611]

42 , fm } in Marginal 1, fc in Marginal 2, and f j in Marginal 3). [sent-249, score-0.635]

43 As mentioned before, the consensus-based potentials in (4) can be interpreted as deﬁning a Gaussian prior (5) to fc , where the mean is a weighted average of the m individual views. [sent-256, score-0.63]

44 This averaging indicates that the value of fc is never higher (or lower) than that of any single view. [sent-257, score-0.537]

45 While the consensus-based potentials are intuitive and useful for many applications, they are limited for some real world problems where the evidence from different views should be additive (or enhanced) rather than averaging. [sent-258, score-0.403]

46 It’s clear that in this scenario the multiple views are reinforcing or weakening each other, not averaging. [sent-265, score-0.31]

47 Bayesian Co-Training with Missing Views In the previous two sections we assume that the input data are complete, that is, all the views are observed for every data sample. [sent-269, score-0.31]

48 , CT, PET, Ultrasound, MRI) for the ﬁnal diagnosis, so some views (i. [sent-273, score-0.31]

49 To the best of our knowledge, this is the ﬁrst elegant framework to account for the missing views in the multi-view learning setting. [sent-278, score-0.411]

50 Let each view j be observed for a subset of n j ≤ n samples, and let I j denote the indices of these samples in the whole sample set (including labeled and unlabeled data). [sent-279, score-0.293]

51 We start from the undirected graphical model and make necessary changes to the potentials to account for the missing views. [sent-282, score-0.303]

52 The joint probability can be deﬁned as: p (yl , fc , f1 , . [sent-285, score-0.56]

53 , fm ) = m 1 nl ∏ ψ(yi , fc (xi )) ∏ ψ(f j )ψ(f j , fc ), Z i=1 j=1 (12) ( j) where fc = { fc (xi )}n ∈ Rn , and f j = { f j (xi )}i∈I j ∈ Rn j . [sent-288, score-2.289]

54 In other words, the consensus potentials is deﬁned such that ψ( f j (xi ), fc (xi )) = exp − 1 f j (xi ) − fc (xi ) 2σ2 j 2 , i ∈ I j. [sent-291, score-1.387]

55 The idea here is to deﬁne the consensus potential for view j using only the data samples observed in view j. [sent-292, score-0.492]

56 The other data samples with missing view information for view j are treated as hidden (or integrated out) in this potential deﬁnition. [sent-293, score-0.397]

57 As before, σ j > 0 quantiﬁes how far the latent function f j is apart from fc . [sent-294, score-0.588]

58 Let xi be the set of observed views for xi , we need to distinguish two different settings. [sent-310, score-0.42]

59 However it is not clear what view to acquire for this sample (if more than one view is missing for the sample). [sent-337, score-0.329]

60 In the second part we evaluate the active sensing algorithms in the Bayesian co-training setting. [sent-346, score-0.595]

61 We are given a classiﬁcation task with missing views, and at each iteration we are allowed to select an unobserved (sample, view) pair for sensing (i. [sent-347, score-0.57]

62 The proposed methods are compared with random sensing in which a random unobserved (sample, view) pair is selected for sensing. [sent-350, score-0.469]

63 This is an ideal case for co-training, since: 1) each single view is sufﬁcient to train a classiﬁer, and 2) both views are conditionally independent given the class labels. [sent-359, score-0.424]

64 There are three natural views for each document: the text view consists of title and abstract of the paper; the two link views are inbound and outbound references. [sent-493, score-0.789]

65 There are two views containing the text on the page (24,480 features) and the anchor text (901 features) of all inbound links, respectively. [sent-497, score-0.372]

66 So these multiple views are very unbalanced and should be taken into account in co-training with different weights. [sent-522, score-0.31]

67 3 Active Sensing on Toy Data We show some empirical results on active sensing in this and the following subsections. [sent-525, score-0.595]

68 Suppose we are given a classiﬁcation task with missing views, and at each iteration we are allowed to select an unobserved (sample, view) pair for sensing (i. [sent-526, score-0.57]

69 We compare the classiﬁcation performance on unlabeled data using the following three sensing approaches: • Active Sensing MI: The pair is selected based on the mutual information criteria (17). [sent-529, score-0.546]

70 92 5 10 15 20 Number of acquired (sample, view) pairs in order Figure 5: Toy data for active sensing (left). [sent-534, score-0.662]

71 Comparison of active sensing with random sensing is shown on the right. [sent-538, score-1.038]

72 • Active Sensing VAR: A sample is selected ﬁrst which has the maximal predictive variance and has missing views, and then one of the missing views is randomly selected for sensing. [sent-540, score-0.512]

73 In active sensing with MI, we use EM algorithm to learn the GMM structure with missing entries, and the GMM model is re-estimated after each pair is selected and ﬁlled in (this is fast thanks to the incremental updates in the EM algorithm). [sent-544, score-0.696]

74 We ﬁrst illustrate active sensing with a toy example. [sent-545, score-0.675]

75 To simulate our active sensing experiment, we randomly “hide” one of the two features of each sample with 40% probability each, and with 20% probability observe both features. [sent-547, score-0.639]

76 For active sensing MI we use the Gaussian kernel with width 0. [sent-550, score-0.627]

77 In Figure 5 (right) we compare active sensing with random sensing, using AUC for the unlabeled data. [sent-555, score-0.698]

78 This indicates that active sensing is much better than random sensing in improving the classiﬁcation performance. [sent-556, score-1.038]

79 The Bayes optimal accuracy (reachable when there is no missing data) is reached by the 16th query by active sensing whereas random sensing improves much slower with the number of acquired pairs. [sent-557, score-1.206]

80 The features for the 2 views are listed in the left table, and the performance comparison of active sensing and random sensing is shown in the right ﬁgure. [sent-565, score-1.392]

81 From Bayesian co-training point of view we have 2 views, with 3 features in the ﬁrst (clinical feature) view and 2 features in the second (imaging-based feature) view. [sent-584, score-0.316]

82 As the active sensing setup, the ﬁrst view is available for all the patients, and the second view is available only for randomly chosen 50% patients. [sent-589, score-0.823]

83 Figure 6 (right) shows the test AUC scores (with error-bars) of active sensing and random sensing, with different number of acquired pairs. [sent-591, score-0.662]

84 Active sensing in general yields better performance, and is signiﬁcantly better after 5 ﬁrst pairs. [sent-593, score-0.443]

85 Active sensing based on MI and VAR again yield very 2672 BAYESIAN C O -T RAINING similar results. [sent-594, score-0.443]

86 We split all the features into 3 views (clinical, pre-treatment imaging, post-treatment imaging), and the features are listed in Figure 7 (left). [sent-604, score-0.398]

87 For active sensing, we assume that all the (labeled or unlabeled) patients have view 1 features available, 70% of the patients have view 2 features available, and 40% of the patients have view 3 features available. [sent-605, score-0.956]

88 Figure 7 (right) shows the performance comparison of active sensing with random sensing, and it is seen that after about 18 pair acquisitions, active sensing is signiﬁcantly better than random sensing. [sent-608, score-1.19]

89 Active sensing MI and VAR share a similar trend, and the MI based active sensing is overall better than VAR based active sensing. [sent-609, score-1.19]

90 The optimal AUC (when there are no missing features) is shown as a dotted line, and we see that with around 34 actively acquired pairs, active sensing can almost achieve the optimum. [sent-611, score-0.794]

91 It takes however much longer for random sensing to reach this performance. [sent-612, score-0.443]

92 In the process, we showed that these algorithms have been making an intrinsic assumption of the form p( fc , f1 , f2 , . [sent-616, score-0.537]

93 ψ( fc , fm ), even though it was not explicitly realized earlier. [sent-622, score-0.635]

94 The features for the 3 views are listed in the left table, and the performance comparison of active sensing and random sensing is shown in the right ﬁgure. [sent-632, score-1.392]

95 We also extend this framework to handle multi-view data with missing features, and introduce an active sensing framework which allows us to actively acquiring missing (sample, view) pairs to maximize performance. [sent-641, score-0.828]

96 The joint probability of all the variables is deﬁned as in (6) and is repeated here: p (yl , fc , f1 , . [sent-645, score-0.56]

97 , fm ) = m 1 nl ∏ ψ(yi , fc (xi )) ∏ ψ(f j )ψ(f j , fc ). [sent-648, score-1.215]

98 1 Marginal 1: Co-Regularized Multi-View Learning The ﬁrst marginalization integrates out the latent consensus function fc in (21). [sent-652, score-0.784]

99 Ignoring the output consensus function ψ(yi , fc (xi )) for the moment, we derive the joint likelihood p (f1 , . [sent-653, score-0.756]

100 2∑ j j σj j (23) Note that C does not depend on fc . [sent-657, score-0.537]

similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('fc', 0.537), ('sensing', 0.443), ('views', 0.31), ('consensus', 0.196), ('active', 0.152), ('rosales', 0.133), ('gp', 0.124), ('view', 0.114), ('ao', 0.111), ('rishnapuram', 0.111), ('patients', 0.11), ('raining', 0.11), ('unlabeled', 0.103), ('missing', 0.101), ('gplr', 0.1), ('fm', 0.098), ('marginalizations', 0.094), ('potentials', 0.093), ('ext', 0.089), ('gmm', 0.087), ('bayesian', 0.082), ('toy', 0.08), ('ink', 0.076), ('rained', 0.076), ('yl', 0.068), ('acquired', 0.067), ('auc', 0.067), ('yi', 0.063), ('cancer', 0.058), ('undirected', 0.056), ('nsclc', 0.055), ('rectal', 0.055), ('siemens', 0.055), ('xi', 0.055), ('graphical', 0.053), ('latent', 0.051), ('tumor', 0.051), ('imaging', 0.05), ('post', 0.047), ('classi', 0.046), ('fj', 0.045), ('dfc', 0.044), ('krishnapuram', 0.044), ('rain', 0.044), ('features', 0.044), ('nl', 0.043), ('kc', 0.042), ('pcr', 0.042), ('survival', 0.042), ('patient', 0.041), ('blum', 0.041), ('samples', 0.04), ('canonical', 0.04), ('det', 0.038), ('labeled', 0.036), ('mitchell', 0.036), ('disagreement', 0.034), ('mri', 0.034), ('balaji', 0.033), ('cotraining', 0.033), ('inbound', 0.033), ('nbound', 0.033), ('shipeng', 0.033), ('suvmax', 0.033), ('ultrasound', 0.033), ('mi', 0.033), ('kernel', 0.032), ('var', 0.031), ('dent', 0.031), ('actively', 0.031), ('page', 0.029), ('clinical', 0.029), ('characterizations', 0.029), ('potential', 0.028), ('ers', 0.028), ('citeseer', 0.028), ('webkb', 0.028), ('documents', 0.028), ('sensor', 0.027), ('unobserved', 0.026), ('acquisition', 0.026), ('sensors', 0.025), ('trusted', 0.025), ('bj', 0.025), ('lung', 0.025), ('labels', 0.025), ('web', 0.025), ('exp', 0.024), ('er', 0.024), ('misclassi', 0.023), ('gender', 0.023), ('kj', 0.023), ('joint', 0.023), ('marginal', 0.023), ('diagnosis', 0.022), ('malvern', 0.022), ('melville', 0.022), ('nu', 0.022), ('outbound', 0.022), ('pathologic', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0000004 12 jmlr-2011-Bayesian Co-Training

Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao

2 0.11759889 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

Author: Liwei Wang

Abstract: We study pool-based active learning in the presence of noise, that is, the agnostic setting. It is known that the effectiveness of agnostic active learning depends on the learning problem and the hypothesis space. Although there are many cases on which active learning is very useful, it is also easy to construct examples that no active learning algorithm can have an advantage. Previous works have shown that the label complexity of active learning relies on the disagreement coefﬁcient which often characterizes the intrinsic difﬁculty of the learning problem. In this paper, we study the disagreement coefﬁcient of classiﬁcation problems for which the classiﬁcation boundary is smooth and the data distribution has a density that can be bounded by a smooth function. We prove upper and lower bounds for the disagreement coefﬁcients of both ﬁnitely and inﬁnitely smooth problems. Combining with existing results, it shows that active learning is superior to passive supervised learning for smooth problems. Keywords: active learning, disagreement coefﬁcient, label complexity, smooth function

3 0.07236363 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

Author: Zeeshan Syed, John Guttag

Abstract: In medicine, one often bases decisions upon a comparative analysis of patient data. In this paper, we build upon this observation and describe similarity-based algorithms to risk stratify patients for major adverse cardiac events. We evolve the traditional approach of comparing patient data in two ways. First, we propose similarity-based algorithms that compare patients in terms of their long-term physiological monitoring data. Symbolic mismatch identiﬁes functional units in longterm signals and measures changes in the morphology and frequency of these units across patients. Second, we describe similarity-based algorithms that are unsupervised and do not require comparisons to patients with known outcomes for risk stratiﬁcation. This is achieved by using an anomaly detection framework to identify patients who are unlike other patients in a population and may potentially be at an elevated risk. We demonstrate the potential utility of our approach by showing how symbolic mismatch-based algorithms can be used to classify patients as being at high or low risk of major adverse cardiac events by comparing their long-term electrocardiograms to that of a large population. We describe how symbolic mismatch can be used in three different existing methods: one-class support vector machines, nearest neighbor analysis, and hierarchical clustering. When evaluated on a population of 686 patients with available long-term electrocardiographic data, symbolic mismatch-based comparative approaches were able to identify patients at roughly a two-fold increased risk of major adverse cardiac events in the 90 days following acute coronary syndrome. These results were consistent even after adjusting for other clinical risk variables. Keywords: risk stratiﬁcation, cardiovascular disease, time-series comparison, symbolic analysis, anomaly detection

4 0.061852358 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

Author: Krishnakumar Balasubramanian, Pinar Donmez, Guy Lebanon

Abstract: Many popular linear classiﬁers, such as logistic regression, boosting, or SVM, are trained by optimizing a margin-based risk function. Traditionally, these risk functions are computed based on a labeled data set. We develop a novel technique for estimating such risks using only unlabeled data and the marginal label distribution. We prove that the proposed risk estimator is consistent on high-dimensional data sets and demonstrate it on synthetic and real-world data. In particular, we show how the estimate is used for evaluating classiﬁers in transfer learning, and for training classiﬁers with no labeled data whatsoever. Keywords: classiﬁcation, large margin, maximum likelihood

5 0.058421552 41 jmlr-2011-Improved Moves for Truncated Convex Models

Author: M. Pawan Kumar, Olga Veksler, Philip H.S. Torr

Abstract: We consider the problem of obtaining an approximate maximum a posteriori estimate of a discrete random ﬁeld characterized by pairwise potentials that form a truncated convex model. For this problem, we propose two st-MINCUT based move making algorithms that we call Range Swap and Range Expansion. Our algorithms can be thought of as extensions of αβ-Swap and α-Expansion respectively that fully exploit the form of the pairwise potentials. Speciﬁcally, instead of dealing with one or two labels at each iteration, our methods explore a large search space by considering a range of labels (that is, an interval of consecutive labels). Furthermore, we show that Range Expansion provides the same multiplicative bounds as the standard linear programming (LP) relaxation in polynomial time. Compared to previous approaches based on the LP relaxation, for example interior-point algorithms or tree-reweighted message passing (TRW), our methods are faster as they use only the efﬁcient st-MINCUT algorithm in their design. We demonstrate the usefulness of the proposed approaches on both synthetic and standard real data problems. Keywords: truncated convex models, move making algorithms, range moves, multiplicative bounds, linear programming relaxation

6 0.056205656 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

7 0.052506309 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

8 0.051236976 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

9 0.04783399 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

10 0.045208622 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

11 0.040746972 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

12 0.039309312 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

13 0.038104776 55 jmlr-2011-Learning Multi-modal Similarity

14 0.03795455 54 jmlr-2011-Learning Latent Tree Graphical Models

15 0.037345693 59 jmlr-2011-Learning with Structured Sparsity

16 0.037087243 58 jmlr-2011-Learning from Partial Labels

17 0.03681203 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

18 0.03391666 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

19 0.03380933 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

20 0.032047853 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.179), (1, -0.074), (2, -0.02), (3, -0.018), (4, 0.02), (5, -0.055), (6, -0.046), (7, 0.025), (8, -0.065), (9, 0.003), (10, -0.078), (11, 0.039), (12, 0.256), (13, -0.039), (14, 0.138), (15, 0.073), (16, -0.097), (17, -0.119), (18, -0.19), (19, 0.068), (20, -0.318), (21, 0.018), (22, 0.163), (23, -0.1), (24, -0.024), (25, -0.102), (26, -0.005), (27, -0.149), (28, 0.059), (29, 0.032), (30, -0.047), (31, -0.089), (32, -0.018), (33, 0.054), (34, -0.172), (35, -0.044), (36, -0.023), (37, 0.048), (38, -0.011), (39, -0.033), (40, -0.007), (41, -0.064), (42, 0.165), (43, -0.036), (44, 0.133), (45, 0.089), (46, 0.11), (47, -0.009), (48, -0.066), (49, -0.148)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.93900341 12 jmlr-2011-Bayesian Co-Training

Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao

2 0.5704549 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

Author: Liwei Wang

3 0.4223718 99 jmlr-2011-Unsupervised Similarity-Based Risk Stratification for Cardiovascular Events Using Long-Term Time-Series Data

Author: Zeeshan Syed, John Guttag

4 0.41422173 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

Author: Zhihua Zhang, Guang Dai, Michael I. Jordan

Abstract: We propose a fully Bayesian methodology for generalized kernel mixed models (GKMMs), which are extensions of generalized linear mixed models in the feature space induced by a reproducing kernel. We place a mixture of a point-mass distribution and Silverman’s g-prior on the regression vector of a generalized kernel model (GKM). This mixture prior allows a fraction of the components of the regression vector to be zero. Thus, it serves for sparse modeling and is useful for Bayesian computation. In particular, we exploit data augmentation methodology to develop a Markov chain Monte Carlo (MCMC) algorithm in which the reversible jump method is used for model selection and a Bayesian model averaging method is used for posterior prediction. When the feature basis expansion in the reproducing kernel Hilbert space is treated as a stochastic process, this approach can be related to the Karhunen-Lo` ve expansion of a Gaussian process (GP). Thus, our sparse e modeling framework leads to a ﬂexible approximation method for GPs. Keywords: reproducing kernel Hilbert spaces, generalized kernel models, Silverman’s g-prior, Bayesian model averaging, Gaussian processes

5 0.30503544 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis

Author: Vanya Van Belle, Kristiaan Pelckmans, Johan A. K. Suykens, Sabine Van Huffel

Abstract: This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. The present contribution describes a machine learning approach termed MINLIP . The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions. Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the ’margin’ is for Support Vector Machines for classiﬁcation. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual kernel machines, and for sparse models is indicated. Given n observations, the present method solves a quadratic program existing of O (n) constraints and O (n) unknowns, where most existing risk minimization approaches to ranking problems typically result in algorithms with O (n2 ) constraints or unknowns. We specify the MINLIP method for three different cases: the ﬁrst one concerns the preference learning problem. Secondly it is speciﬁed how to adapt the method to ordinal regression with a ﬁnite set of ordered outcomes. Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. The current approach is found to be particularly useful in this context as it can handle, in contrast with the standard statistical model for analyzing survival data, all types of censoring in a straightforward way, and because of the explicit relation with the Proportional Hazard and Accelerated Failure Time models. The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets. Keywords: support vector machines, preference learning, ranking models, ordinal regression, survival analysis c

6 0.29743832 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

7 0.2964372 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

8 0.28605923 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

9 0.24812663 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

10 0.22975783 41 jmlr-2011-Improved Moves for Truncated Convex Models

11 0.21839255 88 jmlr-2011-Structured Variable Selection with Sparsity-Inducing Norms

12 0.21829522 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

13 0.20946042 58 jmlr-2011-Learning from Partial Labels

14 0.20138881 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

15 0.19812654 54 jmlr-2011-Learning Latent Tree Graphical Models

16 0.19558245 51 jmlr-2011-Laplacian Support Vector Machines Trained in the Primal

17 0.19495058 102 jmlr-2011-Waffles: A Machine Learning Toolkit

18 0.19300428 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods

19 0.19243002 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models

20 0.18797281 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.052), (9, 0.019), (10, 0.046), (24, 0.044), (31, 0.099), (32, 0.027), (41, 0.042), (44, 0.383), (60, 0.018), (70, 0.011), (71, 0.013), (73, 0.052), (78, 0.078), (90, 0.015)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.67237186 12 jmlr-2011-Bayesian Co-Training

Author: Shipeng Yu, Balaji Krishnapuram, Rómer Rosales, R. Bharat Rao

2 0.36893815 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

Author: Jinfeng Zhuang, Ivor W. Tsang, Steven C.H. Hoi

Abstract: Previous studies of Non-Parametric Kernel Learning (NPKL) usually formulate the learning task as a Semi-Deﬁnite Programming (SDP) problem that is often solved by some general purpose SDP solvers. However, for N data examples, the time complexity of NPKL using a standard interiorpoint SDP solver could be as high as O(N 6.5 ), which prohibits NPKL methods applicable to real applications, even for data sets of moderate size. In this paper, we present a family of efﬁcient NPKL algorithms, termed “SimpleNPKL”, which can learn non-parametric kernels from a large set of pairwise constraints efﬁciently. In particular, we propose two efﬁcient SimpleNPKL algorithms. One is SimpleNPKL algorithm with linear loss, which enjoys a closed-form solution that can be efﬁciently computed by the Lanczos sparse eigen decomposition technique. Another one is SimpleNPKL algorithm with other loss functions (including square hinge loss, hinge loss, square loss) that can be re-formulated as a saddle-point optimization problem, which can be further resolved by a fast iterative algorithm. In contrast to the previous NPKL approaches, our empirical results show that the proposed new technique, maintaining the same accuracy, is signiﬁcantly more efﬁcient and scalable. Finally, we also demonstrate that the proposed new technique is also applicable to speed up many kernel learning tasks, including colored maximum variance unfolding, minimum volume embedding, and structure preserving embedding. Keywords: non-parametric kernel learning, semi-deﬁnite programming, semi-supervised learning, side information, pairwise constraints

3 0.36867628 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is deﬁned as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classiﬁcation, level sets, connected components

4 0.36816493 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar

Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve signiﬁcant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.

5 0.36752576 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

Author: Ricardo Henao, Ole Winther

Abstract: In this paper we consider sparse and identiﬁable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efﬁcient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identiﬁable Multivariate modeling), is validated and bench-marked on artiﬁcial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identiﬁability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identiﬁable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identiﬁability, factor models, linear Bayesian networks

6 0.36636275 91 jmlr-2011-The Sample Complexity of Dictionary Learning

7 0.36513939 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

8 0.36418405 16 jmlr-2011-Clustering Algorithms for Chains

9 0.36277241 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

10 0.36267549 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

11 0.36017895 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

12 0.35857594 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

13 0.35757151 96 jmlr-2011-Two Distributed-State Models For Generating High-Dimensional Time Series

14 0.35685837 13 jmlr-2011-Bayesian Generalized Kernel Mixed Models

15 0.35644108 48 jmlr-2011-Kernel Analysis of Deep Networks

16 0.35405469 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

17 0.35375786 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

18 0.35347441 104 jmlr-2011-X-Armed Bandits

19 0.35298762 69 jmlr-2011-Neyman-Pearson Classification, Convexity and Stochastic Constraints

20 0.35263714 7 jmlr-2011-Adaptive Exact Inference in Graphical Models