jmlr jmlr2009 jmlr2009-30 knowledge-graph by maker-knowledge-mining
Source: pdf
Author: Holger Höfling, Robert Tibshirani
Abstract: We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudolikelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate. Keywords: Markov networks, logistic regression, L1 penalty, model selection, Binary variables
Reference: text
sentIndex sentText sentNum sentScore
1 For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. [sent-5, score-0.442]
2 The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudolikelihood criterion so that each additional iterations moves it closer to the exact solution. [sent-6, score-0.269]
3 Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). [sent-7, score-0.169]
4 (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate. [sent-9, score-0.164]
5 That is, the penalized log-likelihood ℓ(Θ) − ρ||Θ||1 is maximized, where ℓ is the Gaussian log-likelihood, ||Θ||1 is the sum of the absolute values of the elements of Θ and ρ ≥ 0 is a user-defined tuning parameter. [sent-14, score-0.256]
6 (2008) as well as Friedman, Hastie, and Tibshirani (2008a) propose algorithms for solving this penalized log-likelihood with the procedure in Friedman, Hastie, and Tibshirani (2008a), the graphical lasso, being especially fast and efficient. [sent-19, score-0.324]
7 Most edges for L1 -norm ≤ 2 are correctly identified as in the graph in Figure 1. [sent-29, score-0.248]
8 However, edge (1, 3) is absent in the true model but included in the L1 penalized model relatively early. [sent-30, score-0.281]
9 Our main focus in this paper is to develop and implement fast approximate and exact procedures for solving this class of L1 -penalized binary pairwise Markov networks and compare the accuracy and speed of these to other methods proposed by Lee, Ganapathi, and Koller (2006a) and Wainwright et al. [sent-31, score-0.31]
10 Here, by “exact” procedures we refer to algorithms that find the exact maximizer of the penalized log-likelihood of the model whereas “approximate” procedures only find an approximate solution. [sent-33, score-0.469]
11 5 1−4, 2−4 0 1 2 3 4 5 6 L1 norm Figure 2: Toy example: profiles of estimated edge parameters as the penalty parameter is varied. [sent-42, score-0.128]
12 the added advantage that it can be used as a building block to a new algorithm for maximizing the penalized log-likelihood exactly. [sent-43, score-0.286]
13 Finally, Section 5 discusses the results of the simulations with respect to speed and accuracy of the competing algorithms. [sent-45, score-0.133]
14 The Model and Competing Methods In this section we briefly outline the competing methods for maximizing the penalized log-likelihood. [sent-47, score-0.327]
15 Consider data generated under the Ising model p(x, Θ) = exp ∑ θs xs + ∑ s∈V θst xs xt − Ψ(Θ) . [sent-51, score-0.286]
16 Here, V denotes the vertices and E the edges of a graph. [sent-59, score-0.196]
17 By setting θst = 0 for (s,t) ∈ E, and using the fact that x is a binary vector we can write the log-likelihood as l(x, Θ) = log p(x, Θ) = p ∑ s≥t≥1 885 θst xs xt − Ψ(Θ) ¨ H OFLING AND T IBSHIRANI where Θ is a symmetric p × p matrix with θss = θs for s = 1, . [sent-61, score-0.184]
18 The penalized log-likelihood for all N observations is p ∑ (XT X)st θst − NΨ(Θ) − N||R ∗ Θ||1 s≥t≥1 where ∗ denotes component-wise multiplication. [sent-67, score-0.256]
19 1 The Lee, Ganapathi, and Koller (2006a) Method The penalized log-likelihood is a concave function, so standard convex programming techniques can be used to maximize it. [sent-71, score-0.281]
20 However, for sparse matrices Θ, algorithms such as the junction tree algorithm exist that can calculate Ψ and its derivatives efficiently. [sent-73, score-0.258]
21 Lee, Ganapathi, and Koller (2006a) achieve this by optimizing the penalized log-likelihood only over a set F of active variables which they gradually enlarge until an optimality criterion is satisfied. [sent-75, score-0.361]
22 Using either conjugate gradients or BFGS, the penalized log-likelihood is maximized over the set of variables F. [sent-79, score-0.293]
23 Their procedure provides an exact solution to the problem when the junction tree algorithm is used to calculate Ψ(Θ) and its derivatives. [sent-84, score-0.266]
24 In their method as well as ours, any procedure to evaluate Ψ(Θ) can be “plugged in” without any further changes to the rest of the algorithm; we decided to evaluate the speed and performance of only the exact algorithms. [sent-86, score-0.17]
25 (2006) propose estimation of the Markov network by applying a separate L1 penalized logistic regression to each of the p variables on the remaining variables. [sent-92, score-0.354]
26 886 S PARSE M ARKOV NETWORKS ˜ ˜ Then set θss = β0 , the intercept of the logistic regression, and θst = βt , where βt is the coefficient associated with xt in the regression. [sent-101, score-0.143]
27 They show that under certain assumptions, their method correctly identifies the non-zero edges in a Markov graph, as N → ∞ even for increasing number of parameters p or neighborhood sizes of the graph d, as long as N grows more quickly than d 3 log p (see Wainwright et al. [sent-109, score-0.248]
28 Due to the simplicity of the method it is obvious that it could also be used for parameter estimation itself and here we will compare its performance in these cases to the pseudo-likelihood approach proposed below and the exact solution of the penalized log-likelihood. [sent-111, score-0.36]
29 Furthermore, as an important part of this article is the comparison of the speeds of the underlying algorithms, we implement their method, using the fast coordinate descent algorithm for logistic regression with a lasso penalty (see Friedman, Hastie, and Tibshirani, 2008b). [sent-112, score-0.281]
30 As we will see in the simulations section, the results are very close to the exact solution of the penalized log-likelihood. [sent-115, score-0.386]
31 In the next section, we use the pseudo-likelihood model to design a very fast algorithm for finding an exact solution for the penalized Ising model. [sent-116, score-0.385]
32 Putting all this together, the pseudo-likelihood for a single observation x is given by p ˜ l(Θ|x) = p p ∑ ∑ xs xt θst − ∑ Ψs (x, Θ). [sent-126, score-0.184]
33 Here S = 2R − diag(R) and is chosen to be roughly equivalent to the penalty terms in the penalized log-likelihood. [sent-129, score-0.359]
34 The penalty term is doubled on the off-diagonal, as the derivative of the pseudo-likelihood on the off-diagonal is roughly twice as large as the derivative of the log-likelihood (see Equation 1). [sent-130, score-0.195]
35 As the number of variables is p(p + 1)/2, the Hessian could get very large, we restrict our quadratic approximation to have a diagonal Hessian. [sent-134, score-0.129]
36 These are ∂l˜ ˆs ˆ = 2(XT X)st − (pT X)t − (ptT X)s s = t, (1) ∂θst N ∂l˜ ˆ = (XT X)ss − ∑ psk ∂θss k=1 where 1 − psk = 1/(1 + exp(θss + ∑s=t Xkt θst )). [sent-138, score-0.182]
37 The second derivatives are ˆ ∂2 l˜ ˆ ˆ ˆ ˆ = −(XT diag(ps )diag(1 − ps )X)tt − (XT diag(pt )diag(1 − pt )X)ss (∂θst )2 N ∂2 l˜ ˆ ˆ = − ∑ psk (1 − psk ). [sent-139, score-0.293]
38 Then define the local approximation ˜ to l(Θ|X) − N||S ∗ Θ||1 as f Θ (k) ∂l˜ 1 ∂2 l˜ (k) (k) (θst − θst ) + (θ − θst )2 − N||S ∗ Θ||1 2 st 2 (∂θst ) s≥t ∂θst (Θ) = C + ∑ where C is some constant. [sent-141, score-0.4]
39 As stated above, this is just a quadratic approximation with linear term equal to the gradient of l˜ and a diagonal Hessian with diagonal elements equal to the diagonal of ˜ the Hessian of l. [sent-142, score-0.2]
40 The main reasons for using this simple structure are that it keeps the computation complexity per iteration low and it is very easy to solve this L1 penalized local approximation. [sent-143, score-0.256]
41 st (∂θst )2 + Using Θ∗ , the next step Θ(k+1) can now be obtained by, for example, a backtracking line search. [sent-146, score-0.432]
42 , 2003) and the implementation of the penalized logistic regression in Friedman, Hastie, and Tibshirani (2008b) among others. [sent-157, score-0.317]
43 The inner loop then optimizes over only the active variables until convergence occurs. [sent-160, score-0.131]
44 In the outer loop, the criterion chooses variables to be active that are either non-zero already or will be non-zero after one step of the local approximation over all variables. [sent-164, score-0.166]
45 Therefore, if the active set stays the same, no variables would be moved in the next step as all active variables are already optimal and therefore, we have a solution over all variables. [sent-165, score-0.21]
46 However, if the active set changes, the inner loop is guaranteed to improve the penalized pseudo-likelihood and find the optimum for the given set of active variables. [sent-166, score-0.417]
47 As there are only a finite number of different active variables sets, the algorithm has to converge after a finite number of iterations of the outer loop. [sent-167, score-0.173]
48 In order to detect this, we calculate the penalized pseudo-likelihood every 100 iterations and check that we improved during the last 100 steps. [sent-174, score-0.298]
49 Exact Solution Using Pseudo-likelihood We now turn to our new method for optimizing the penalized log-likelihood. [sent-181, score-0.256]
50 Taking this into considerations, the standard methods mentioned above have the following drawbacks: Gradient descent: It can take many steps to converge and therefore require many evaluations of the partition function and its derivatives (using the junction tree algorithm). [sent-185, score-0.28]
51 Also, it does not control the number of non-zero variables in intermediate steps well to which the runtime of the junction tree algorithm is very sensitive. [sent-186, score-0.206]
52 BFGS: Takes less steps than gradient descent, but similar to gradient descent, intermediate steps can have more non-zero variables than the solution. [sent-188, score-0.173]
53 Lee, Ganapathi, and Koller (2006a) use the BFGS method only on a set of active variables onto which additional variables are “grafted” until convergence. [sent-192, score-0.142]
54 Here, we want to use the pseudo-likelihood in a new algorithm for maximizing the penalized log-likelihood. [sent-195, score-0.286]
55 Specifically, among all functions of the form f 1 (k) (k) = l˜ + ∑ ast (θst − θst ) − γ ∑(θst − θst )2 − N||R ∗ Θ||1 , Θ 2 s≥t s≥t (k) we fit the one that at Θ(k) is a first order approximation to the penalized log-likelihood l. [sent-199, score-0.333]
56 Here, ast will be chosen so that the sub-gradient is equal to the penalized log-likelihood l at Θ(k) . [sent-201, score-0.302]
57 Then choose 0 Z −1 (0) θst = In this case we than have psk = ˆ 1 N ∑N xks k=1 1 N 1 N Wst = 1 N ∑N xks k=1 if s = t . [sent-207, score-0.305]
58 if s = t ∀k and also ∑N xks k=1 ∑N xks k=1 1 N ∑N xkt k=1 if s = t if s = t and together with γ = 0 we then get that f Θ(0) 1˜ = l(Θ|X) − N||R ∗ Θ||1 2 and thus just a regular pseudo-likelihood step. [sent-208, score-0.244]
59 The only slight difference is that in this case the penalty term on the diagonal is twice as large as in the pseudo-likelihood case presented above. [sent-209, score-0.142]
60 The algorithm for maximizing the penalized log-likelihood is now very similar to the one presented for maximizing the penalized pseudo-likelihood. [sent-212, score-0.572]
61 Here, the savings in time are especially large due to a special feature of the junction tree algorithm that we use to calculate the derivatives of the partition function. [sent-221, score-0.257]
62 However, p passes of the junction tree are needed in order to get the full matrix of derivatives W. [sent-223, score-0.189]
63 Therefore, any fast algorithm for solving the penalized log-likelihood exactly has to use as few evaluations of the partition function as possible. [sent-232, score-0.307]
64 This observation also explains the improvement in speed of the pseudo-likelihood based exact algorithm over the specific methods proposed in Lee, Lee, Abbeel, and Ng (2006b). [sent-234, score-0.17]
65 Newton-like procedures that use the second derivative often converge in very few steps, however these cannot be used here as it is prohibitively expensive to evaluate the second derivative of the partition function, even in small examples. [sent-236, score-0.181]
66 Then, using the average number of edges per node, upper-triangular elements of Θ are drawn uniformly at random to be non-zero. [sent-248, score-0.196]
67 For the Wainwright-methods, the penalty parameter is ρ with no penalty on the intercept. [sent-255, score-0.206]
68 Although the log-likelihood functions that are being penalized are somewhat different, this choice of parameters makes them perform roughly equivalent, as can be seen in Figure 3. [sent-256, score-0.256]
69 The number of edges is plotted against the penalty parameter used and all methods behave very similar. [sent-257, score-0.299]
70 However, in order not to confound some results by these slight differences of the effects of the penalty, all the following plots are with respect to the number of edges in the graph, not the penalty parameter itself. [sent-258, score-0.299]
71 Plots of the speeds of the exact methods can be seen in Figure 4 and the approximate methods are shown in Figure 5. [sent-262, score-0.131]
72 Each plot shows the time the algorithm needed to converge versus the number of edges in the estimated graph. [sent-263, score-0.234]
73 As can be seen, the pseudo-likelihood based exact algorithm described above is considerable faster than the one proposed in Lee, Ganapathi, and Koller (2006a). [sent-264, score-0.128]
74 Furthermore, we would like to note that we decided to plot the speed against the number of edges in the graph instead of the penalty parameter, as the runtime of the algorithm is very closely related to the actual sparseness of the graph. [sent-268, score-0.417]
75 Overall, when comparing the computation times for the exact algorithms to the approximate algorithms, we can see that the approximate methods are orders of magnitude faster and do not suffer from an exponentially increasing computation time for decreasing sparsity as the exact methods. [sent-269, score-0.286]
76 Therefore, if an exact solution is required, our exact algorithm is preferable. [sent-270, score-0.208]
77 10 0 20 60 100 P=40, N=200, Neigh=4 Number of edges in Graph 10 20 30 40 Exact Wain−Min Wain−Max Pseudo 0 Number of edges in Graph P=20, N=200, Neigh=3 0. [sent-274, score-0.392]
78 05 80 P=60, N=300, Neigh=4 Number of edges in Graph Penalty parameter P=50, N=300, Neigh=4 Number of edges in Graph Penalty parameter 0. [sent-281, score-0.392]
79 20 Penalty parameter Figure 3: Number of edges in the graph vs. [sent-284, score-0.248]
80 As both our L1 penalized exact algorithm and the one by Lee, Ganapathi, and Koller (2006a) find the exact maximizer of the L1 -penalized log-likelihood, we only use our algorithm in the comparison. [sent-290, score-0.496]
81 First we investigate how closely the edges in the estimated graph correspond to edges in the true graph. [sent-296, score-0.444]
82 Overall, we see that all approximate algorithms match the results of the exact solution very closely and in some of the plots, the curves even lie almost on top of each other. [sent-299, score-0.153]
83 0 Number of edges P=50, N=300, Neigh=4 Computation time (s) Number of edges 0 50 100 150 0 Number of edges 50 100 150 200 250 Number of edges Figure 5: Computation time for the approximate algorithms versus the number of non-zero elements in Θ. [sent-311, score-0.811]
84 Again, the approximate solutions are all very close to the exact solution (see Figure 7) and the differences are always smaller than 2 standard deviations. [sent-316, score-0.131]
85 For a plot of the KL-divergence against the number of edges in the model see Figure 9. [sent-368, score-0.196]
86 2 Number of edges P=50, N=300, Neigh=4 log−likelihood Number of edges 0 Number of edges 50 100 150 Number of edges Figure 7: Log-likelihood of the estimated model vs number of edges in the graph for different problem sizes, averaged over 20 simulations. [sent-381, score-1.032]
87 In Figure 10 the differences of the KL-divergence of the approximate to the exact method can be seen. [sent-383, score-0.131]
88 Discussion When we embarked on this work, our goal was to find a fast method for maximizing the L1 penalized log-likelihood of binary-valued Markov networks. [sent-386, score-0.311]
89 We succeeded in doing this, and found that the resulting procedure is faster than competing exact methods. [sent-387, score-0.169]
90 we also learned something surprising: several approximate methods exist that are much faster and only slightly less accurate than the exact methods. [sent-401, score-0.155]
91 In addition, when a dense solution is required, the exact methods become infeasible while the approximate methods can still be used. [sent-402, score-0.131]
92 26 P=20, N=200, Neigh=3 0 10 20 30 40 50 60 0 50 100 150 Number of edges P=50, N=300, Neigh=4 P=60, N=300, Neigh=4 0 50 100 150 0. [sent-416, score-0.196]
93 85 Number of edges 0 Number of edges 50 100 150 Number of edges Figure 9: Kullback-Leibler divergence of the estimated model vs. [sent-422, score-0.63]
94 number of edges in the graph for different problem sizes, averaged over 20 simulations. [sent-423, score-0.248]
95 Furthermore, we believe that both the exact and fast approximate methods can also be applied to the learning of multilayer generative models, such as restricted Boltzmann machines (see Hinton, 2007). [sent-427, score-0.156]
96 An R language package for fitting sparse graphical models, both by exact and approximate methods, will be made available on the authors’ websites. [sent-428, score-0.201]
97 020 Number of edges P=50, N=300, Neigh=4 Difference of KL to exact Number of edges 0 Number of edges 50 100 150 Number of edges Figure 10: Difference of Kullback-Leibler divergence of the approximate methods to the exact solution vs. [sent-440, score-1.061]
98 number of edges in the graph for different problem sizes, averaged over 20 simulations. [sent-441, score-0.248]
99 For the penalized pseudo-likelihood algorithm, by definition of the approximation it is evident that it is a first order approximation. [sent-478, score-0.287]
100 The situation for the penalized log-likelihood algorithm is a little more complicated and it will be shown in the next section of the appendix that the proposed approximation is to first order and therefore satisfies the assumptions of the proof. [sent-479, score-0.287]
wordName wordTfidf (topN-words)
[('neigh', 0.518), ('st', 0.369), ('penalized', 0.256), ('ss', 0.208), ('edges', 0.196), ('ibshirani', 0.183), ('ofling', 0.183), ('wain', 0.183), ('wainwright', 0.15), ('lee', 0.143), ('ganapathi', 0.139), ('arkov', 0.128), ('xks', 0.107), ('exact', 0.104), ('penalty', 0.103), ('xs', 0.102), ('parse', 0.097), ('wst', 0.091), ('junction', 0.091), ('psk', 0.091), ('xt', 0.082), ('derivatives', 0.069), ('active', 0.068), ('speed', 0.066), ('ising', 0.065), ('tt', 0.065), ('abbeel', 0.064), ('pseudo', 0.064), ('hastie', 0.063), ('grafting', 0.061), ('pseudolikelihood', 0.061), ('unpenalized', 0.061), ('diag', 0.061), ('logistic', 0.061), ('bfgs', 0.059), ('koller', 0.058), ('besag', 0.058), ('lasso', 0.056), ('kl', 0.054), ('tibshirani', 0.053), ('graph', 0.052), ('dkl', 0.052), ('friedman', 0.052), ('stanford', 0.047), ('derivative', 0.046), ('ast', 0.046), ('markov', 0.045), ('graphical', 0.043), ('logit', 0.043), ('divergence', 0.042), ('calculate', 0.042), ('ps', 0.042), ('competing', 0.041), ('networks', 0.04), ('backtracking', 0.039), ('perkins', 0.039), ('wss', 0.039), ('diagonal', 0.039), ('converge', 0.038), ('variables', 0.037), ('descent', 0.036), ('ts', 0.033), ('maximizer', 0.032), ('find', 0.032), ('approximation', 0.031), ('hessian', 0.03), ('outer', 0.03), ('dahl', 0.03), ('xkt', 0.03), ('maximizing', 0.03), ('gradient', 0.03), ('ng', 0.029), ('converged', 0.029), ('tree', 0.029), ('likelihood', 0.028), ('sparse', 0.027), ('approximate', 0.027), ('global', 0.027), ('steps', 0.027), ('simulations', 0.026), ('convergence', 0.026), ('regressions', 0.026), ('holger', 0.026), ('sst', 0.026), ('partition', 0.026), ('fast', 0.025), ('edge', 0.025), ('concave', 0.025), ('optimum', 0.025), ('procedures', 0.025), ('faster', 0.024), ('false', 0.024), ('adjusted', 0.024), ('line', 0.024), ('loopy', 0.023), ('pairwise', 0.023), ('ll', 0.023), ('quadratic', 0.022), ('intermediate', 0.022), ('curves', 0.022)]
simIndex simValue paperId paperTitle
same-paper 1 0.99999946 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
Author: Holger Höfling, Robert Tibshirani
Abstract: We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudolikelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate. Keywords: Markov networks, logistic regression, L1 penalty, model selection, Binary variables
2 0.075868703 13 jmlr-2009-Bounded Kernel-Based Online Learning
Author: Francesco Orabona, Joseph Keshet, Barbara Caputo
Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set
3 0.074994847 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
Author: Jianqing Fan, Richard Samworth, Yichao Wu
Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection
4 0.073853433 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
Author: Cynthia Rudin, Robert E. Schapire
Abstract: We study boosting algorithms for learning to rank. We give a general margin-based bound for ranking based on covering numbers for the hypothesis space. Our bound suggests that algorithms that maximize the ranking margin will generalize well. We then describe a new algorithm, smooth margin ranking, that precisely converges to a maximum ranking-margin solution. The algorithm is a modification of RankBoost, analogous to “approximate coordinate ascent boosting.” Finally, we prove that AdaBoost and RankBoost are equally good for the problems of bipartite ranking and classification in terms of their asymptotic behavior on the training set. Under natural conditions, AdaBoost achieves an area under the ROC curve that is equally as good as RankBoost’s; furthermore, RankBoost, when given a specific intercept, achieves a misclassification error that is as good as AdaBoost’s. This may help to explain the empirical observations made by Cortes and Mohri, and Caruana and Niculescu-Mizil, about the excellent performance of AdaBoost as a bipartite ranking algorithm, as measured by the area under the ROC curve. Keywords: ranking, RankBoost, generalization bounds, AdaBoost, area under the ROC curve
5 0.071941338 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni
Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning
6 0.069006413 90 jmlr-2009-Structure Spaces
7 0.062865108 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
8 0.048997857 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
9 0.047489244 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
10 0.045502663 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
11 0.044519305 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
12 0.043604191 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
13 0.043385692 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
14 0.042841006 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
15 0.041610707 23 jmlr-2009-Discriminative Learning Under Covariate Shift
16 0.040760007 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression
17 0.038709387 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
18 0.036550581 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
19 0.033587545 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
topicId topicWeight
[(0, 0.196), (1, 0.018), (2, 0.013), (3, 0.071), (4, 0.027), (5, -0.04), (6, -0.09), (7, 0.045), (8, -0.099), (9, -0.129), (10, 0.047), (11, 0.125), (12, 0.036), (13, 0.034), (14, 0.161), (15, 0.002), (16, -0.144), (17, -0.047), (18, 0.024), (19, 0.135), (20, 0.012), (21, 0.042), (22, 0.074), (23, -0.154), (24, 0.041), (25, -0.126), (26, -0.015), (27, 0.016), (28, -0.027), (29, 0.013), (30, -0.112), (31, -0.035), (32, -0.067), (33, 0.302), (34, -0.14), (35, 0.048), (36, 0.064), (37, -0.057), (38, 0.053), (39, -0.031), (40, 0.08), (41, 0.093), (42, 0.041), (43, -0.111), (44, -0.259), (45, 0.108), (46, 0.411), (47, -0.075), (48, -0.021), (49, -0.057)]
simIndex simValue paperId paperTitle
same-paper 1 0.95954454 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
Author: Holger Höfling, Robert Tibshirani
Abstract: We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudolikelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate. Keywords: Markov networks, logistic regression, L1 penalty, model selection, Binary variables
2 0.42630923 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
Author: Jianqing Fan, Richard Samworth, Yichao Wu
Abstract: Variable selection in high-dimensional space characterizes many contemporary problems in scientific discovery and decision making. Many frequently-used techniques are based on independence screening; examples include correlation ranking (Fan & Lv, 2008) or feature selection using a twosample t-test in high-dimensional classification (Tibshirani et al., 2003). Within the context of the linear model, Fan & Lv (2008) showed that this simple correlation ranking possesses a sure independence screening property under certain conditions and that its revision, called iteratively sure independent screening (ISIS), is needed when the features are marginally unrelated but jointly related to the response variable. In this paper, we extend ISIS, without explicit definition of residuals, to a general pseudo-likelihood framework, which includes generalized linear models as a special case. Even in the least-squares setting, the new method improves ISIS by allowing feature deletion in the iterative process. Our technique allows us to select important features in high-dimensional classification where the popularly used two-sample t-method fails. A new technique is introduced to reduce the false selection rate in the feature screening stage. Several simulated and two real data examples are presented to illustrate the methodology. Keywords: classification, feature screening, generalized linear models, robust regression, feature selection
3 0.34827274 55 jmlr-2009-Maximum Entropy Discrimination Markov Networks
Author: Jun Zhu, Eric P. Xing
Abstract: The standard maximum margin approach for structured prediction lacks a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and ability to model hidden variables. In this paper, we present a new general framework called maximum entropy discrimination Markov networks (MaxEnDNet, or simply, MEDN), which integrates these two approaches and combines and extends their merits. Major innovations of this approach include: 1) It extends the conventional max-entropy discrimination learning of classification rules to a new structural maxentropy discrimination paradigm of learning a distribution of Markov networks. 2) It generalizes the extant Markov network structured-prediction rule based on a point estimator of model coefficients to an averaging model akin to a Bayesian predictor that integrates over a learned posterior distribution of model coefficients. 3) It admits flexible entropic regularization of the model during learning. By plugging in different prior distributions of the model coefficients, it subsumes the wellknown maximum margin Markov networks (M3 N) as a special case, and leads to a model similar to an L1 -regularized M3 N that is simultaneously primal and dual sparse, or other new types of Markov networks. 4) It applies a modular learning algorithm that combines existing variational inference techniques and convex-optimization based M3 N solvers as subroutines. Essentially, MEDN can be understood as a jointly maximum likelihood and maximum margin estimate of Markov network. It represents the first successful attempt to combine maximum entropy learning (a dual form of maximum likelihood learning) with maximum margin learning of Markov network for structured input/output problems; and the basic principle can be generalized to learning arbi
4 0.28235862 90 jmlr-2009-Structure Spaces
Author: Brijnesh J. Jain, Klaus Obermayer
Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization
5 0.26557162 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms
Author: Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, Luca Cazzanti
Abstract: This paper reviews and extends the field of similarity-based classification, presenting new analyses, algorithms, data sets, and a comprehensive set of experimental results for a rich collection of classification problems. Specifically, the generalizability of using similarities as features is analyzed, design goals and methods for weighting nearest-neighbors for similarity-based learning are proposed, and different methods for consistently converting similarities into kernels are compared. Experiments on eight real data sets compare eight approaches and their variants to similarity-based learning. Keywords: similarity, dissimilarity, similarity-based learning, indefinite kernels
6 0.25517362 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks
7 0.23180857 13 jmlr-2009-Bounded Kernel-Based Online Learning
8 0.21405703 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models
9 0.20595688 50 jmlr-2009-Learning When Concepts Abound
10 0.20393209 54 jmlr-2009-Markov Properties for Linear Causal Models with Correlated Errors (Special Topic on Causality)
11 0.20185745 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
12 0.19822194 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques
13 0.18917663 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
14 0.18538746 17 jmlr-2009-Computing Maximum Likelihood Estimates in Recursive Linear Models with Correlated Errors
15 0.18310279 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
16 0.18075916 31 jmlr-2009-Evolutionary Model Type Selection for Global Surrogate Modeling
17 0.17961654 69 jmlr-2009-Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization
18 0.17480205 79 jmlr-2009-Reinforcement Learning in Finite MDPs: PAC Analysis
19 0.17348546 2 jmlr-2009-A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization
20 0.16193363 53 jmlr-2009-Marginal Likelihood Integrals for Mixtures of Independence Models
topicId topicWeight
[(8, 0.02), (11, 0.02), (19, 0.015), (26, 0.017), (38, 0.023), (52, 0.036), (55, 0.036), (58, 0.023), (66, 0.098), (68, 0.549), (90, 0.054), (96, 0.019)]
simIndex simValue paperId paperTitle
same-paper 1 0.87228614 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods
Author: Holger Höfling, Robert Tibshirani
Abstract: We consider the problems of estimating the parameters as well as the structure of binary-valued Markov networks. For maximizing the penalized log-likelihood, we implement an approximate procedure based on the pseudo-likelihood of Besag (1975) and generalize it to a fast exact algorithm. The exact algorithm starts with the pseudo-likelihood solution and then adjusts the pseudolikelihood criterion so that each additional iterations moves it closer to the exact solution. Our results show that this procedure is faster than the competing exact method proposed by Lee, Ganapathi, and Koller (2006a). However, we also find that the approximate pseudo-likelihood as well as the approaches of Wainwright et al. (2006), when implemented using the coordinate descent procedure of Friedman, Hastie, and Tibshirani (2008b), are much faster than the exact methods, and only slightly less accurate. Keywords: Markov networks, logistic regression, L1 penalty, model selection, Binary variables
2 0.76032549 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification
Author: Kilian Q. Weinberger, Lawrence K. Saul
Abstract: The accuracy of k-nearest neighbor (kNN) classification depends significantly on the metric used to compute distances between different examples. In this paper, we show how to learn a Mahalanobis distance metric for kNN classification from labeled examples. The Mahalanobis metric can equivalently be viewed as a global linear transformation of the input space that precedes kNN classification using Euclidean distances. In our approach, the metric is trained with the goal that the k-nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. As in support vector machines (SVMs), the margin criterion leads to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our approach requires no modification or extension for problems in multiway (as opposed to binary) classification. In our framework, the Mahalanobis distance metric is obtained as the solution to a semidefinite program. On several data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification. Sometimes these results can be further improved by clustering the training examples and learning an individual metric within each cluster. We show how to learn and combine these local metrics in a globally integrated manner. Keywords: convex optimization, semi-definite programming, Mahalanobis distance, metric learning, multi-class classification, support vector machines
3 0.64652008 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting
Author: John Duchi, Yoram Singer
Abstract: We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1 . We derive concrete and very simple algorithms for minimization of loss functions with ℓ1 , ℓ2 , ℓ2 , and ℓ∞ regularization. We also show how to construct ef2 ficient algorithms for mixed-norm ℓ1 /ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. Keywords: subgradient methods, group sparsity, online learning, convex optimization
4 0.44656855 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection
Author: Jie Chen, Haw-ren Fang, Yousef Saad
Abstract: Nearest neighbor graphs are widely used in data mining and machine learning. A brute-force method to compute the exact kNN graph takes Θ(dn2 ) time for n data points in the d dimensional Euclidean space. We propose two divide and conquer methods for computing an approximate kNN graph in Θ(dnt ) time for high dimensional data (large d). The exponent t ∈ (1, 2) is an increasing function of an internal parameter α which governs the size of the common region in the divide step. Experiments show that a high quality graph can usually be obtained with small overlaps, that is, for small values of t. A few of the practical details of the algorithms are as follows. First, the divide step uses an inexpensive Lanczos procedure to perform recursive spectral bisection. After each conquer step, an additional refinement step is performed to improve the accuracy of the graph. Finally, a hash table is used to avoid repeating distance calculations during the divide and conquer process. The combination of these techniques is shown to yield quite effective algorithms for building kNN graphs. Keywords: nearest neighbors graph, high dimensional data, divide and conquer, Lanczos algorithm, spectral method
5 0.40825558 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
Author: Antoine Bordes, Léon Bottou, Patrick Gallinari
Abstract: The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of secondorder information and splits the parameter update into independently scheduled components. Thanks to this design, SGD-QN iterates nearly as fast as a first-order stochastic gradient descent but requires less iterations to achieve the same accuracy. This algorithm won the “Wild Track” of the first PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008). Keywords: support vector machine, stochastic gradient descent
6 0.40774667 87 jmlr-2009-Sparse Online Learning via Truncated Gradient
7 0.37044924 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks
8 0.3620882 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences
9 0.35541621 9 jmlr-2009-Analysis of Perceptron-Based Active Learning
10 0.34069201 97 jmlr-2009-Ultrahigh Dimensional Feature Selection: Beyond The Linear Model
11 0.33295697 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost
12 0.32813337 38 jmlr-2009-Hash Kernels for Structured Data
14 0.32484204 25 jmlr-2009-Distributed Algorithms for Topic Models
15 0.32473171 1 jmlr-2009-A Least-squares Approach to Direct Importance Estimation
16 0.32322213 70 jmlr-2009-Particle Swarm Model Selection (Special Topic on Model Selection)
18 0.31847656 75 jmlr-2009-Provably Efficient Learning with Typed Parametric Models
19 0.31699437 3 jmlr-2009-A Parameter-Free Classification Method for Large Scale Learning
20 0.31283259 71 jmlr-2009-Perturbation Corrections in Approximate Inference: Mixture Modelling Applications