jmlr jmlr2011 jmlr2011-56 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Vanya Van Belle, Kristiaan Pelckmans, Johan A. K. Suykens, Sabine Van Huffel
Abstract: This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. The present contribution describes a machine learning approach termed MINLIP . The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions. Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the ’margin’ is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual kernel machines, and for sparse models is indicated. Given n observations, the present method solves a quadratic program existing of O (n) constraints and O (n) unknowns, where most existing risk minimization approaches to ranking problems typically result in algorithms with O (n2 ) constraints or unknowns. We specify the MINLIP method for three different cases: the first one concerns the preference learning problem. Secondly it is specified how to adapt the method to ordinal regression with a finite set of ordered outcomes. Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. The current approach is found to be particularly useful in this context as it can handle, in contrast with the standard statistical model for analyzing survival data, all types of censoring in a straightforward way, and because of the explicit relation with the Proportional Hazard and Accelerated Failure Time models. The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets. Keywords: support vector machines, preference learning, ranking models, ordinal regression, survival analysis c
Reference: text
sentIndex sentText sentNum sentScore
1 BE Katholieke Universiteit Leuven, ESAT-SCD Kasteelpark Arenberg 10 B-3001 Leuven, Belgium Editor: Nicolas Vayatis Abstract This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. [sent-15, score-0.962]
2 Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. [sent-23, score-0.671]
3 The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets. [sent-25, score-0.573]
4 Keywords: support vector machines, preference learning, ranking models, ordinal regression, survival analysis c 2011 Vanya Van Belle, Kristiaan Pelckmans, Johan A. [sent-26, score-0.836]
5 Learning ranking functions offers a solution to different types of problems including ordinal regression, bipartite ranking and discounted cumulative gain ranking (DCG, see Cl´ mencon and e ¸ Vayatis, 2007), studied frequently in research on information retrieval. [sent-32, score-0.578]
6 Examples in which k = ∞ are found in survival analysis and preference learning in cases where the number of classes is not known in advance. [sent-35, score-0.549]
7 The second component of the model maps this utility to an outcome in R by a transformation function h : R → R. [sent-49, score-0.387]
8 The central observation now is that when one knows the ordinal relations between instances, one can estimate a transformation function mapping the instances to their utility value u(X). [sent-51, score-0.463]
9 For ranking and survival analysis one typically ignores the second phase, whereas in ordinal regression a prediction of the output level is found by combining the first and the second components. [sent-53, score-0.849]
10 Transformation models are especially appropriate when considering data arising from a survival study. [sent-54, score-0.525]
11 The goal in survival analysis is often to relate time-to-event of an instance to a corresponding set of covariates. [sent-56, score-0.525]
12 Given a data set D = {(X(i) ,Y(i) )}n where the instances i=1 are sorted such that Y(i) ≤ Y(i+1) , a utility function u(X) = wT ϕ(X) is trained such that the ranking on the evaluations of this function is representative for the ranking on the outcome. [sent-59, score-0.465]
13 We find that the class of transformation models is a powerful tool to model data arising from survival studies for different reasons. [sent-69, score-0.638]
14 The first reason being that they separate nicely the model for the time-scale (via the transformation function), and the qualitative characterization of an instance (via the utility function). [sent-70, score-0.31]
15 In the following, we will relate the transformation function to ranking criteria as Kendall’s τ or area under the curve (AUC), hence outlining a unified framework to study survival models as used in a statistical context and machine learning techniques for learning ranking functions. [sent-72, score-0.906]
16 Thirdly, in studies of failure time data, censoring is omnipresent. [sent-75, score-0.318]
17 We consider empirical studies of ordinal regression and survival analysis. [sent-91, score-0.715]
18 Performance of MINLIP on ordinal regression is analyzed using the ordinal data compiled by Chu and Keerthi (2005). [sent-92, score-0.343]
19 In a last study, concerning a clinical breast cancer survival study (Schumacher et al. [sent-98, score-0.634]
20 Most notably, this paper additionally elaborates on the case of survival analysis and a number of new case studies. [sent-107, score-0.525]
21 2 The Agnostic Case In case it is impossible to find a utility function u : Rd → R extracting the ranking perfectly, a noisy transformation model is considered: Y = h(wT X + ε) , where u = wT X. [sent-306, score-0.444]
22 Estimation of the transformation function for ordinal regression and survival analysis will be illustrated later. [sent-357, score-0.828]
23 To cope with this issue, we add dummy observations (X, B) in between two consecutive ordinal classes 1 with levels Y(i) < Y(i+1) such that B(i) = 2 (Y(i+1) + Y(i) ) (see Figure 4) and leaving their covariates and utility function unspecified. [sent-418, score-0.468]
24 outcome Y(3) Y(2) B(1) Y(1) Prediction: level 1 level 2 v(1) level 3 v(2) utility Figure 5: Prediction for ordinal regression. [sent-460, score-0.427]
25 MINLIP for ordinal regression, including unknown thresholds, has the advantage to reduce the prediction step to a simple comparison between the utility of a new observation and the utility of the thresholds. [sent-461, score-0.547]
26 They impose a Gaussian process prior distribution on the utility function (called latent function in their work) and employ an appropriate likelihood function for ordinal variables. [sent-517, score-0.35]
27 Transformation Models for Failure Time Data We now turn our attention to the case where the data originate from a survival study, that is, the dependent variable is essentially a time-to-failure and typically requires specific models and tools to capture its behavior. [sent-521, score-0.525]
28 A key quantity in survival analysis is the conditional survival function S(t|u(X)) : R+ → [0, 1] defined as S(t|u(X)) = P T > t u(X) , denoting the probability of the event occurring past t given the value of the utility function u(X) = wT X. [sent-530, score-1.247]
29 A related quantity to the conditional survival function is the conditional hazard function λ : R → R+ defined as λ(t|u(X)) = = P t ≤ T < t + ∆t u(X), T ≥ t lim ∆t ∆t→0 P t ≤ T < t + ∆t u(X) lim ∆t→0 . [sent-531, score-0.629]
30 Finally, one can make the relation between the hazard λ and the survival function S even more explicit by introducing the conditional cumulative t hazard function Λ(t|u(X)) = 0 λ(r|u(X))dr for t ≥ 0 such that Λ(t|u(X)) = − ln S(t | u(X)) . [sent-534, score-0.757]
31 The following Subsection enumerates some commonly used (semi-)parametric methods for modelling the survival and hazard functions. [sent-535, score-0.629]
32 2 Transformation Models for Survival Analysis The Transformation model (see Definition 1) encompasses a broad class of models, including the following classical survival models. [sent-537, score-0.525]
33 Under the Cox model, the value of the survival function at t = T is T X) S(T, X) = [S0 (T )]exp(−β , where S0 (t) = exp(−Λ0 (t)) is called the baseline survival function. [sent-540, score-1.05]
34 In general the survival function equals S(t) = 1 − F(t), leading together with Equation (15) to ln 1 − S(T |X) = α(T ) + βT X S(T |X) ⇒ ε = α(T ) + u(X) ⇒T = h(−u(X) + ε) . [sent-549, score-0.549]
35 A failure time is called censored when the exact time of failure is not observed. [sent-556, score-0.402]
36 This indicator is defined depending on the censoring types present in the data: Right censoring occurs when the event of interest did not occur until the last follow-up time. [sent-562, score-0.344]
37 In case of right censoring the comparability indicator ∆ takes the value 1 for two observations i and j when the observation with the earliest failure time is observed, and zero otherwise: ∆(Ti , T j ) = 1 if (Ti < T j and δi = 0) or (T j < Ti and δ j = 0) 0 otherwise. [sent-565, score-0.377]
38 Left censoring deals with the case when the failure is known to have happened before a certain time. [sent-566, score-0.318]
39 Interval censoring is a combination of the previous two censoring types. [sent-569, score-0.344]
40 In this case the failure time is not known exactly, instead an interval including the failure time is indicated. [sent-570, score-0.316]
41 Whether two observations are comparable or not in case of interval censoring depends on the censoring times T i and T i defining the failure interval for each observation i: Ti ∈ [T i , T i ]. [sent-572, score-0.56]
42 For uncensored observations, the failure interval reduces to one time, namely the failure time Ti = T i = T i . [sent-573, score-0.316]
43 Standard statistical methods for modelling survival data obtain parameter estimates by maximizing a (partial) likelihood with regard to these parameters. [sent-581, score-0.525]
44 In the next section we will illustrate that MINLIP can be easily adapted for right, left, interval censoring and combined censoring schemes. [sent-587, score-0.368]
45 Two observations i and j are comparable if their relative order in survival time is known. [sent-593, score-0.547]
46 A pair of observations i and j is concordant if they are comparable and the observation with the lowest failure time also has the lowest score for the utility function u(X). [sent-594, score-0.388]
47 5 Prediction with Transformation Models The prediction step in survival analysis, refers to the estimation of survival and hazard functions rather than the estimation of the failure time itself. [sent-627, score-1.3]
48 The proportional hazard model estimates these functions, by assuming that a baseline hazard function exists; the covariates changing the hazard only proportionally. [sent-628, score-0.366]
49 The survival function is found as S(ui , l) = 1− F(ui , l). [sent-639, score-0.525]
50 λ(ui , l) = tl+1 − tl Remark the analogy with the partial logistic artificial neural network approach to the survival problem proposed by Biganzoli et al. [sent-644, score-0.547]
51 In a first Subsection, 3 artificial examples will illustrate how transformation models are used within the ranking, ordinal regression and survival setting (see also Table 1). [sent-650, score-0.828]
52 The last two examples concern survival data, one with micro-array data (data also used in Bøvelstad et al. [sent-653, score-0.525]
53 For both survival examples, the cross-validation concordance index was used as model selection criterion since the main interest lies in the ranking of the patients. [sent-659, score-0.794]
54 1 Artificial Examples This section illustrates the different steps needed to obtain the desired output for ranking, regression and survival problems, using artificial examples. [sent-661, score-0.562]
55 The outcome is defined as the ranking given by a weighted sum of the ranking in the 9 races, age, weight and condition score. [sent-668, score-0.345]
56 Since one is only interested in ranking the cyclists, the value of the utility is irrelevant. [sent-675, score-0.331]
57 843 VAN B ELLE , P ELCKMANS , S UYKENS AND VAN H UFFEL 150 utility u ˆ 100 50 0 0 5 10 15 20 25 30 35 40 45 50 ranking Figure 6: Artificial example illustrating the use of the MINLIP transformation model for the ranking setting. [sent-684, score-0.578]
58 outcome y ˆ 3 2 1 35 40 45 50 55 60 65 35 utility u ˆ 40 45 50 55 60 65 utility u ˆ (a) (b) Figure 7: Artificial example illustrating the use of the MINLIP transformation model for ordinal regression. [sent-688, score-0.737]
59 The MINLIP model for ordinal regression results in an estimate of the utility function and threshold values. [sent-690, score-0.419]
60 Students with a utility between both thresholds (medium grey) are estimated to be average students and students with a utility higher than the second threshold (dark grey) are predicted to be good students. [sent-692, score-0.559]
61 In addition to an estimate of the utility, the MINLIP model for ordinal regression gives threshold values which can be used to predict the outcome of new observations (see Figure 7). [sent-697, score-0.321]
62 For the first treatment arm, the survival time has a Weibull distribution with parameters 1 and 0. [sent-705, score-0.567]
63 For the second treatment arm, the survival time is Weibull distributed with parameters 4 and 5. [sent-707, score-0.567]
64 Using the information on age, treatment arm and survival on 100 patients, one would like to predict the survival for the remaining 50 patients. [sent-708, score-1.118]
65 Figure 8 illustrates that the MINLIP model is able to divide the group of test patients into two groups with a significant different survival (p=0. [sent-713, score-0.617]
66 However, in survival analysis, additional information can be provided when performing the second part of the transformation model, namely estimating the transformation function. [sent-716, score-0.751]
67 5, the estimated survival curves for all patients are calculated (Figure 9). [sent-718, score-0.617]
68 The grey and black survival curves correspond to patients in the first and second treatment arm, respectively. [sent-720, score-0.699]
69 The true survival function for the first and second treatment are illustrated in thick black and grey lines, respectively. [sent-721, score-0.607]
70 5 0 utility u ˆ (a) (b) Figure 8: Artificial example illustrating the use of the MINLIP transformation model for survival analysis. [sent-752, score-0.835]
71 The utility is able to group the patients according to the relevant variable treatment (see the clear separation between circles and stars). [sent-759, score-0.331]
72 5 4 Time t Figure 9: Artificial example illustrating the use of the MINLIP transformation model for survival analysis: illustration of the reconstruction step. [sent-773, score-0.638]
73 For each patient, the survival curve is calculated using the method discussed in Section 5. [sent-774, score-0.525]
74 The true survival curve for the first and second treatment, are illustrated in thick black and grey lines. [sent-777, score-0.565]
75 One clearly notices two distinct survival groups, corresponding to the treatment groups. [sent-778, score-0.567]
76 846 T RANSFORMATION M ODELS FOR R ANKING AND S URVIVAL data set pyrimidines triazines Wisconsin machine CPU auto MPG Boston housing data set pyrimidines triazines Wisconsin machine CPU auto MPG Boston housing minlip 0. [sent-779, score-0.472]
77 The SPCR (Bair and Tibshirani, 2004; Bair, Hastie, Debashis, and Tibshirani, 2006) method first selects a subset of genes which are correlated with survival by using univariate selection and then applies PCR to this subset. [sent-887, score-0.525]
78 4 Failure Time Data: Cancer Study In this last example, we investigate the ability of the MINLIP model to estimate how the different covariates influence the survival time. [sent-921, score-0.579]
79 6%) patients had a breast cancer related event within the study period, leaving all other patients with a right censored failure time. [sent-926, score-0.549]
80 1 0 0 0 20 40 60 80 100 120 140 160 180 200 Time Figure 10: Concordance (left) and time dependent receiver operating characteristic curve (TDROC) (right) on the test set for three micro-array survival data sets (top: DBCD, middle: DL BCL , bottom: NSBCD ). [sent-989, score-0.525]
81 Remark that in Figure 11 the estimates are inversely related with the survival time, whereas in Figure 12 the estimates are related with the survival time itself. [sent-998, score-1.05]
82 The MINLIP model estimates a higher survival time for older patients, up to the age of 65, whereafter the survival time drops again. [sent-1005, score-1.124]
83 According to this model, a larger tumor, a higher number of positive lymph nodes and a lower progesterone and estrogen receptor level result in lower survival times and thus a higher risk for relapse. [sent-1006, score-0.668]
84 Such models are found useful in a context of ordinal regression and survival analysis, and relate directly to commonly used risk measures as the area under the curve and others. [sent-1020, score-0.756]
85 Extensions towards tasks where transformation models provide only a (good) approximation (agnostic case), ordinal regression and survival analysis are given. [sent-1023, score-0.828]
86 Experiments on ordinal regression and survival analysis, on both clinical and high dimensional data sets, illustrate the use of the proposed method. [sent-1024, score-0.715]
87 The estimated effects are inversely related with the survival time. [sent-1047, score-0.525]
88 The estimated covariate effects are directly related with the survival time. [sent-1054, score-0.554]
89 The MINLIP model estimates the covariate effects as follows: the estimated survival time increases with age until the age of 65, whereafter the survival time drops slightly. [sent-1055, score-1.196]
90 The larger the tumor, the higher the number of positive lymph nodes, the lower the expression of the receptors, the lower the estimated survival time is. [sent-1056, score-0.558]
91 The spread in the survival curves is broader for the MINLIP model, which is confirmed by a larger value of the log rank test statistic. [sent-1080, score-0.525]
92 In the ordinal regression case unknown thresholds v are introduced corresponding to an outcome intermediate between two successive outcome levels. [sent-1179, score-0.373]
93 The model is built by indicating that the difference between the utility of a certain observation Xi and the largest threshold lower than the outcome of that observation Yi should be larger than the difference between Yi and the outcome corresponding to the before mentioned threshold. [sent-1180, score-0.383]
94 Semi-supervised methods to predict patient survival from gene expression data. [sent-1208, score-0.525]
95 Feedforward neural networks for the analysis of censored survival data: a partial logistic regression approach. [sent-1222, score-0.672]
96 Predicting survival from microarray data - a comparative study. [sent-1238, score-0.525]
97 Time-dependent ROC curves for censored survival data and a diagnostic marker. [sent-1354, score-0.635]
98 Applying a neural network to prostate cancer survival data. [sent-1394, score-0.573]
99 The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. [sent-1499, score-0.525]
100 A gene-expression signature as a predictor of survival in breast cancer. [sent-1604, score-0.586]
wordName wordTfidf (topN-words)
[('survival', 0.525), ('minlip', 0.472), ('utility', 0.197), ('censoring', 0.172), ('ordinal', 0.153), ('failure', 0.146), ('concordance', 0.135), ('elckmans', 0.135), ('elle', 0.135), ('uffel', 0.135), ('urvival', 0.135), ('ranking', 0.134), ('wt', 0.131), ('ransformation', 0.129), ('uykens', 0.115), ('transformation', 0.113), ('censored', 0.11), ('hazard', 0.104), ('van', 0.102), ('anking', 0.095), ('patients', 0.092), ('cox', 0.079), ('outcome', 0.077), ('pelckmans', 0.067), ('freq', 0.061), ('pls', 0.061), ('breast', 0.061), ('ti', 0.061), ('odels', 0.058), ('belle', 0.055), ('imc', 0.055), ('lipschitz', 0.055), ('covariates', 0.054), ('students', 0.052), ('rd', 0.052), ('suykens', 0.052), ('gpor', 0.049), ('spcr', 0.049), ('cancer', 0.048), ('exc', 0.047), ('age', 0.043), ('eisch', 0.043), ('kalb', 0.043), ('progesterone', 0.043), ('dummy', 0.042), ('pcr', 0.042), ('treatment', 0.042), ('risk', 0.041), ('grey', 0.04), ('ki', 0.039), ('zl', 0.039), ('regression', 0.037), ('comparability', 0.037), ('lymph', 0.033), ('margin', 0.032), ('monotonically', 0.032), ('threshold', 0.032), ('ranksvm', 0.031), ('relapse', 0.031), ('whereafter', 0.031), ('realizable', 0.03), ('dy', 0.029), ('thresholds', 0.029), ('covariate', 0.029), ('chu', 0.029), ('yl', 0.028), ('medicine', 0.028), ('arm', 0.026), ('leuven', 0.026), ('estrogen', 0.026), ('mv', 0.026), ('ei', 0.025), ('cn', 0.025), ('bair', 0.025), ('componentwise', 0.025), ('dabrowska', 0.025), ('harrell', 0.025), ('koenker', 0.025), ('qb', 0.025), ('receptors', 0.025), ('rlie', 0.025), ('schumacher', 0.025), ('splines', 0.025), ('velstad', 0.025), ('interval', 0.024), ('tumor', 0.024), ('ln', 0.024), ('dc', 0.024), ('preference', 0.024), ('concordant', 0.023), ('mencon', 0.023), ('herbrich', 0.023), ('agnostic', 0.023), ('prentice', 0.023), ('observations', 0.022), ('graepel', 0.022), ('tl', 0.022), ('brabanter', 0.021), ('prognostic', 0.021), ('women', 0.021)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999905 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
Author: Vanya Van Belle, Kristiaan Pelckmans, Johan A. K. Suykens, Sabine Van Huffel
Abstract: This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. The present contribution describes a machine learning approach termed MINLIP . The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions. Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the ’margin’ is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual kernel machines, and for sparse models is indicated. Given n observations, the present method solves a quadratic program existing of O (n) constraints and O (n) unknowns, where most existing risk minimization approaches to ranking problems typically result in algorithms with O (n2 ) constraints or unknowns. We specify the MINLIP method for three different cases: the first one concerns the preference learning problem. Secondly it is specified how to adapt the method to ordinal regression with a finite set of ordered outcomes. Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. The current approach is found to be particularly useful in this context as it can handle, in contrast with the standard statistical model for analyzing survival data, all types of censoring in a straightforward way, and because of the explicit relation with the Proportional Hazard and Accelerated Failure Time models. The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets. Keywords: support vector machines, preference learning, ranking models, ordinal regression, survival analysis c
Author: Zeeshan Syed, John Guttag
Abstract: In medicine, one often bases decisions upon a comparative analysis of patient data. In this paper, we build upon this observation and describe similarity-based algorithms to risk stratify patients for major adverse cardiac events. We evolve the traditional approach of comparing patient data in two ways. First, we propose similarity-based algorithms that compare patients in terms of their long-term physiological monitoring data. Symbolic mismatch identifies functional units in longterm signals and measures changes in the morphology and frequency of these units across patients. Second, we describe similarity-based algorithms that are unsupervised and do not require comparisons to patients with known outcomes for risk stratification. This is achieved by using an anomaly detection framework to identify patients who are unlike other patients in a population and may potentially be at an elevated risk. We demonstrate the potential utility of our approach by showing how symbolic mismatch-based algorithms can be used to classify patients as being at high or low risk of major adverse cardiac events by comparing their long-term electrocardiograms to that of a large population. We describe how symbolic mismatch can be used in three different existing methods: one-class support vector machines, nearest neighbor analysis, and hierarchical clustering. When evaluated on a population of 686 patients with available long-term electrocardiographic data, symbolic mismatch-based comparative approaches were able to identify patients at roughly a two-fold increased risk of major adverse cardiac events in the 90 days following acute coronary syndrome. These results were consistent even after adjusting for other clinical risk variables. Keywords: risk stratification, cardiovascular disease, time-series comparison, symbolic analysis, anomaly detection
3 0.10941669 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization
Author: Shai Shalev-Shwartz, Ambuj Tewari
Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity
4 0.081103228 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama
Abstract: We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally confirm our analysis in a large scale ℓ1 -regularized logistic regression problem and extensively compare the efficiency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark data sets. Keywords: dual augmented Lagrangian, proximal minimization, global convergence, sparse estimation, convex optimization
5 0.058252182 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
Author: Aad van der Vaart, Harry van Zanten
Abstract: We consider the quality of learning a response function by a nonparametric Bayesian approach using a Gaussian process (GP) prior on the response function. We upper bound the quadratic risk of the learning procedure, which in turn is an upper bound on the Kullback-Leibler information between the predictive and true data distribution. The upper bound is expressed in small ball probabilities and concentration measures of the GP prior. We illustrate the computation of the upper bound for the Mat´ rn and squared exponential kernels. For these priors the risk, and hence the e information criterion, tends to zero for all continuous response functions. However, the rate at which this happens depends on the combination of true response function and Gaussian prior, and is expressible in a certain concentration function. In particular, the results show that for good performance, the regularity of the GP prior should match the regularity of the unknown response function. Keywords: Bayesian learning, Gaussian prior, information rate, risk, Mat´ rn kernel, squared e exponential kernel
6 0.056205656 12 jmlr-2011-Bayesian Co-Training
7 0.049933124 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
8 0.049506269 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
9 0.049245596 36 jmlr-2011-Generalized TD Learning
10 0.048662145 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
11 0.045989919 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
12 0.045563318 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
13 0.03965674 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
14 0.035107706 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
15 0.035105705 55 jmlr-2011-Learning Multi-modal Similarity
16 0.032472841 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
17 0.027803993 30 jmlr-2011-Efficient Structure Learning of Bayesian Networks using Constraints
18 0.025710531 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
19 0.025552606 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking
20 0.025493968 11 jmlr-2011-Approximate Marginals in Latent Gaussian Models
topicId topicWeight
[(0, 0.169), (1, 0.099), (2, -0.01), (3, -0.063), (4, -0.057), (5, -0.155), (6, -0.138), (7, 0.007), (8, -0.102), (9, 0.085), (10, -0.057), (11, -0.016), (12, 0.083), (13, 0.042), (14, 0.074), (15, 0.07), (16, -0.07), (17, 0.032), (18, 0.091), (19, -0.044), (20, -0.288), (21, 0.103), (22, 0.052), (23, -0.137), (24, 0.129), (25, -0.213), (26, 0.257), (27, 0.025), (28, -0.042), (29, -0.023), (30, 0.036), (31, -0.051), (32, 0.083), (33, 0.049), (34, 0.007), (35, 0.058), (36, -0.016), (37, -0.189), (38, 0.179), (39, 0.037), (40, 0.078), (41, 0.08), (42, -0.018), (43, 0.015), (44, -0.112), (45, 0.005), (46, -0.028), (47, -0.006), (48, 0.064), (49, 0.1)]
simIndex simValue paperId paperTitle
same-paper 1 0.9319123 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
Author: Vanya Van Belle, Kristiaan Pelckmans, Johan A. K. Suykens, Sabine Van Huffel
Abstract: This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. The present contribution describes a machine learning approach termed MINLIP . The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions. Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the ’margin’ is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual kernel machines, and for sparse models is indicated. Given n observations, the present method solves a quadratic program existing of O (n) constraints and O (n) unknowns, where most existing risk minimization approaches to ranking problems typically result in algorithms with O (n2 ) constraints or unknowns. We specify the MINLIP method for three different cases: the first one concerns the preference learning problem. Secondly it is specified how to adapt the method to ordinal regression with a finite set of ordered outcomes. Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. The current approach is found to be particularly useful in this context as it can handle, in contrast with the standard statistical model for analyzing survival data, all types of censoring in a straightforward way, and because of the explicit relation with the Proportional Hazard and Accelerated Failure Time models. The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets. Keywords: support vector machines, preference learning, ranking models, ordinal regression, survival analysis c
Author: Zeeshan Syed, John Guttag
Abstract: In medicine, one often bases decisions upon a comparative analysis of patient data. In this paper, we build upon this observation and describe similarity-based algorithms to risk stratify patients for major adverse cardiac events. We evolve the traditional approach of comparing patient data in two ways. First, we propose similarity-based algorithms that compare patients in terms of their long-term physiological monitoring data. Symbolic mismatch identifies functional units in longterm signals and measures changes in the morphology and frequency of these units across patients. Second, we describe similarity-based algorithms that are unsupervised and do not require comparisons to patients with known outcomes for risk stratification. This is achieved by using an anomaly detection framework to identify patients who are unlike other patients in a population and may potentially be at an elevated risk. We demonstrate the potential utility of our approach by showing how symbolic mismatch-based algorithms can be used to classify patients as being at high or low risk of major adverse cardiac events by comparing their long-term electrocardiograms to that of a large population. We describe how symbolic mismatch can be used in three different existing methods: one-class support vector machines, nearest neighbor analysis, and hierarchical clustering. When evaluated on a population of 686 patients with available long-term electrocardiographic data, symbolic mismatch-based comparative approaches were able to identify patients at roughly a two-fold increased risk of major adverse cardiac events in the 90 days following acute coronary syndrome. These results were consistent even after adjusting for other clinical risk variables. Keywords: risk stratification, cardiovascular disease, time-series comparison, symbolic analysis, anomaly detection
3 0.33293578 87 jmlr-2011-Stochastic Methods forl1-regularized Loss Minimization
Author: Shai Shalev-Shwartz, Ambuj Tewari
Abstract: We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature or example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets.1 Keywords: L1 regularization, optimization, coordinate descent, mirror descent, sparsity
4 0.31250429 89 jmlr-2011-Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
Author: Ryota Tomioka, Taiji Suzuki, Masashi Sugiyama
Abstract: We analyze the convergence behaviour of a recently proposed algorithm for regularized estimation called Dual Augmented Lagrangian (DAL). Our analysis is based on a new interpretation of DAL as a proximal minimization algorithm. We theoretically show under some conditions that DAL converges super-linearly in a non-asymptotic and global sense. Due to a special modelling of sparse estimation problems in the context of machine learning, the assumptions we make are milder and more natural than those made in conventional analysis of augmented Lagrangian algorithms. In addition, the new interpretation enables us to generalize DAL to wide varieties of sparse estimation problems. We experimentally confirm our analysis in a large scale ℓ1 -regularized logistic regression problem and extensively compare the efficiency of DAL algorithm to previously proposed algorithms on both synthetic and benchmark data sets. Keywords: dual augmented Lagrangian, proximal minimization, global convergence, sparse estimation, convex optimization
5 0.28206775 49 jmlr-2011-Kernel Regression in the Presence of Correlated Errors
Author: Kris De Brabanter, Jos De Brabanter, Johan A.K. Suykens, Bart De Moor
Abstract: It is a well-known problem that obtaining a correct bandwidth and/or smoothing parameter in nonparametric regression is difficult in the presence of correlated errors. There exist a wide variety of methods coping with this problem, but they all critically depend on a tuning procedure which requires accurate information about the correlation structure. We propose a bandwidth selection procedure based on bimodal kernels which successfully removes the correlation without requiring any prior knowledge about its structure and its parameters. Further, we show that the form of the kernel is very important when errors are correlated which is in contrast to the independent and identically distributed (i.i.d.) case. Finally, some extensions are proposed to use the proposed criterion in support vector machines and least squares support vector machines for regression. Keywords: nonparametric regression, correlated errors, bandwidth choice, cross-validation, shortrange dependence, bimodal kernel
6 0.27356389 12 jmlr-2011-Bayesian Co-Training
7 0.26030383 71 jmlr-2011-On Equivalence Relationships Between Classification and Ranking Algorithms
8 0.25934038 44 jmlr-2011-Information Rates of Nonparametric Gaussian Process Methods
9 0.23691925 43 jmlr-2011-Information, Divergence and Risk for Binary Experiments
10 0.23596507 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach
11 0.20084782 24 jmlr-2011-Dirichlet Process Mixtures of Generalized Linear Models
12 0.19329274 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
13 0.1706768 2 jmlr-2011-A Bayesian Approximation Method for Online Ranking
14 0.16642769 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
15 0.16530387 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination
16 0.16133559 39 jmlr-2011-High-dimensional Covariance Estimation Based On Gaussian Graphical Models
17 0.15157585 62 jmlr-2011-MSVMpack: A Multi-Class Support Vector Machine Package
18 0.14775021 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
19 0.13859355 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels
20 0.13814269 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning
topicId topicWeight
[(4, 0.029), (9, 0.023), (10, 0.021), (18, 0.018), (24, 0.028), (31, 0.056), (32, 0.024), (41, 0.536), (65, 0.012), (71, 0.013), (73, 0.04), (78, 0.072), (90, 0.014)]
simIndex simValue paperId paperTitle
1 0.93232375 45 jmlr-2011-Internal Regret with Partial Monitoring: Calibration-Based Optimal Algorithms
Author: Vianney Perchet
Abstract: We provide consistent random algorithms for sequential decision under partial monitoring, when the decision maker does not observe the outcomes but receives instead random feedback signals. Those algorithms have no internal regret in the sense that, on the set of stages where the decision maker chose his action according to a given law, the average payoff could not have been improved in average by using any other fixed law. They are based on a generalization of calibration, no longer defined in terms of a Vorono¨ ı diagram but instead of a Laguerre diagram (a more general concept). This allows us to bound, for the first time in this general framework, the expected average internal, as well as the usual external, regret at stage n by O(n−1/3 ), which is known to be optimal. Keywords: repeated games, on-line learning, regret, partial monitoring, calibration, Vorono¨ and ı Laguerre diagrams
2 0.92859304 23 jmlr-2011-DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model
Author: Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyvärinen, Yoshinobu Kawahara, Takashi Washio, Patrik O. Hoyer, Kenneth Bollen
Abstract: Structural equation models and Bayesian networks have been widely used to analyze causal relations between continuous variables. In such frameworks, linear acyclic models are typically used to model the data-generating process of variables. Recently, it was shown that use of non-Gaussianity identifies the full structure of a linear acyclic model, that is, a causal ordering of variables and their connection strengths, without using any prior knowledge on the network structure, which is not the case with conventional methods. However, existing estimation methods are based on iterative search algorithms and may not converge to a correct solution in a finite number of steps. In this paper, we propose a new direct method to estimate a causal ordering and connection strengths based on non-Gaussianity. In contrast to the previous methods, our algorithm requires no algorithmic parameters and is guaranteed to converge to the right solution within a small fixed number of steps if the data strictly follows the model, that is, if all the model assumptions are met and the sample size is infinite. Keywords: structural equation models, Bayesian networks, independent component analysis, non-Gaussianity, causal discovery c 2011 Shohei Shimizu, Takanori Inazumi, Yasuhiro Sogawa, Aapo Hyv¨ rinen, Yoshinobu Kawahara, Takashi Washio, Patrik O. Hoyer a and Kenneth Bollen ¨ S HIMIZU , I NAZUMI , S OGAWA , H YV ARINEN , K AWAHARA , WASHIO , H OYER AND B OLLEN
same-paper 3 0.82709169 56 jmlr-2011-Learning Transformation Models for Ranking and Survival Analysis
Author: Vanya Van Belle, Kristiaan Pelckmans, Johan A. K. Suykens, Sabine Van Huffel
Abstract: This paper studies the task of learning transformation models for ranking problems, ordinal regression and survival analysis. The present contribution describes a machine learning approach termed MINLIP . The key insight is to relate ranking criteria as the Area Under the Curve to monotone transformation functions. Consequently, the notion of a Lipschitz smoothness constant is found to be useful for complexity control for learning transformation models, much in a similar vein as the ’margin’ is for Support Vector Machines for classification. The use of this model structure in the context of high dimensional data, as well as for estimating non-linear, and additive models based on primal-dual kernel machines, and for sparse models is indicated. Given n observations, the present method solves a quadratic program existing of O (n) constraints and O (n) unknowns, where most existing risk minimization approaches to ranking problems typically result in algorithms with O (n2 ) constraints or unknowns. We specify the MINLIP method for three different cases: the first one concerns the preference learning problem. Secondly it is specified how to adapt the method to ordinal regression with a finite set of ordered outcomes. Finally, it is shown how the method can be used in the context of survival analysis where one models failure times, typically subject to censoring. The current approach is found to be particularly useful in this context as it can handle, in contrast with the standard statistical model for analyzing survival data, all types of censoring in a straightforward way, and because of the explicit relation with the Proportional Hazard and Accelerated Failure Time models. The advantage of the current method is illustrated on different benchmark data sets, as well as for estimating a model for cancer survival based on different micro-array and clinical data sets. Keywords: support vector machines, preference learning, ranking models, ordinal regression, survival analysis c
4 0.36884907 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling
Author: Ricardo Henao, Ole Winther
Abstract: In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identifiability, factor models, linear Bayesian networks
5 0.35505933 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets
Author: Bruno Pelletier, Pierre Pudlo
Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components
6 0.35336262 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments
7 0.34299618 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem
8 0.33621687 81 jmlr-2011-Robust Approximate Bilinear Programming for Value Function Approximation
10 0.32056969 29 jmlr-2011-Efficient Learning with Partially Observed Attributes
11 0.32010525 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models
12 0.31986764 38 jmlr-2011-Hierarchical Knowledge Gradient for Sequential Sampling
13 0.31972378 104 jmlr-2011-X-Armed Bandits
14 0.31677034 48 jmlr-2011-Kernel Analysis of Deep Networks
15 0.31584993 73 jmlr-2011-Online Learning in Case of Unbounded Losses Using Follow the Perturbed Leader Algorithm
16 0.31281963 18 jmlr-2011-Convergence Rates of Efficient Global Optimization Algorithms
17 0.31276426 91 jmlr-2011-The Sample Complexity of Dictionary Learning
18 0.3120361 40 jmlr-2011-Hyper-Sparse Optimal Aggregation
19 0.30914125 1 jmlr-2011-A Bayesian Approach for Learning and Planning in Partially Observable Markov Decision Processes
20 0.30719864 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms