jmlr jmlr2013 jmlr2013-4 knowledge-graph by maker-knowledge-mining

4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion


Source: pdf

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 A max-norm constrained maximum likelihood estimate is introduced and studied. [sent-7, score-0.166]

2 Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. [sent-9, score-0.195]

3 The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. [sent-10, score-0.172]

4 Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence 1. [sent-12, score-0.255]

5 In some real-life situations, however, the observations are highly quantized, sometimes even to a single bit and thus the standard matrix completion techniques do not apply. [sent-15, score-0.408]

6 (2012) considered the 1-bit matrix completion problem of recovering an approximately low-rank matrix M ∗ ∈ Rd1 ×d2 from a set of n noise corrupted sign (1-bit) measurements. [sent-32, score-0.518]

7 In particular, they proposed a trace-norm constrained maximum likelihood estimator to estimate M ∗ , based on a small number of binary samples observed according to a probability distribution determined by the entries of M ∗ . [sent-33, score-0.251]

8 It was also shown that the trace-norm constrained optimization method is minimax rate-optimal under the uniform sampling model. [sent-34, score-0.348]

9 The 1-bit measurements are meant to model quantization in the extreme case, and a surprising fact is that when the signal-to-noise ratio is low, empirical evidence demonstrates that such extreme quantization can be optimal when constrained to a fixed bit budget (Laska and Baraniuk, 2012). [sent-36, score-0.185]

10 To be more specific, consider an arbitrary unknown d1 × d2 target matrix M ∗ with rank at most r. [sent-38, score-0.173]

11 , (in , jn )} of entries of a binary matrix Y is observed, where the entries of Y depend on M ∗ in the following way: Yi, j = ∗ +1, if Mi, j + Zi, j ≥ 0, ∗ + Z < 0. [sent-42, score-0.31]

12 This latent variable matrix model can been seen as a direct analogue to the usual 1-bit compressed sensing model, in which only the signs of measurements are observed. [sent-44, score-0.201]

13 Contrary to the standard matrix completion model and many other statistical problems, random noise turns out to be helpful and has a positive effect in the 1-bit case, since the problem is illposed in the absence of noise as described in Davenport et al. [sent-49, score-0.481]

14 Although the trace-norm constrained optimization method has been shown to be minimax rateoptimal under the uniform sampling model, it remains unclear that the trace-norm is the best convex surrogate to the rank. [sent-55, score-0.433]

15 A different convex relaxation for the rank, the matrix max-norm, has been duly noted in machine learning literature since Srebro et al. [sent-56, score-0.181]

16 Regarding a real d1 × d2 matrix as an operator that maps from Rd2 to Rd1 , its rank can be alternatively expressed as the smallest integer k, such that it is possible to express M = UV T , where U ∈ Rd1 ×k and V ∈ Rd2 ×k . [sent-58, score-0.173]

17 ∞ Foygel and Srebro (2011) first used the max-norm for matrix completion under the uniform sampling distribution. [sent-69, score-0.528]

18 (2012) analyzed 1-bit matrix completion under the uniform sampling model, where observed entries are assumed to be sampled randomly and uniformly. [sent-74, score-0.613]

19 In such a setting, the trace-norm constrained approach has been shown to achieve minimax rate of convergence. [sent-75, score-0.197]

20 However, in certain application such as collaborative filtering, the uniform sampling model is over idealized. [sent-76, score-0.193]

21 In the Netflix problem, for instance, the uniform sampling model is equivalent to assuming all users are equally likely to rate every movie and all movies are equally likely to be rated by any user. [sent-77, score-0.191]

22 In this scenario, Salakhutdinov and Srebro (2010) showed that the standard trace-norm relaxation can behave very poorly, and suggested to use a weighted variant of the trace-norm, which takes the sampling distribution into account. [sent-80, score-0.223]

23 Since the true sampling distribution is most likely unknown and can only be estimated based on the locations of those entries that are revealed in the sample, what commonly used in practice is the empirically-weighted trace norm. [sent-81, score-0.202]

24 (2011) provided rigorous recovery guarantees for learning with the standard weighted, smoothed weighted and smoothed empirically-weighted trace-norms. [sent-83, score-0.238]

25 In this paper, we study matrix completion based on noisy 1-bit observations under a general (non-degenerate) sampling model using the max-norm as a convex relaxation for the rank. [sent-85, score-0.623]

26 A matching minimax lower bound is established under the general non-uniform sampling model using information-theoretical methods. [sent-87, score-0.195]

27 The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. [sent-88, score-0.172]

28 As a comparison with the max-norm constrained optimization approach, we also analyze the recovery guarantee of the weighted trace-norm constrained method in the setting of non-uniform sampling distributions. [sent-89, score-0.503]

29 When the noise distribution is Gaussian or more generally log-concave, the negative log-likelihood function for M, given the measurements, is convex, hence computing the max-norm constrained maximum likelihood estimate is a convex optimization problem. [sent-93, score-0.272]

30 Section 3 introduces the 1-bit matrix completion model and the estimation procedure and investigates the theoretical properties of the estimator. [sent-100, score-0.377]

31 Given two norms ℓ p and ℓq on Rd1 and Rd2 respectively, the corresponding operator norm · p,q of a matrix M ∈ Rd1 ×d2 is defined by M p,q = sup x p =1 Mx q . [sent-120, score-0.223]

32 Max-Norm Constrained Maximum Likelihood Estimate In this section, we introduce the max-norm constrained maximum likelihood estimation procedure for 1-bit matrix completion and investigates the theoretical properties of the estimator. [sent-188, score-0.543]

33 1 Observation Model We consider 1-bit matrix completion under a general random sampling model. [sent-191, score-0.494]

34 The unknown low∗ rank matrix M ∗ ∈ Rd1 ×d2 is the object of interest. [sent-192, score-0.173]

35 Instead of observing noisy entries Mi, j + Zi, j directly in unquantized matrix completion, now we only observe with error the sign of a random subset of the entries of M ∗ . [sent-193, score-0.343]

36 with replacement according to a general sampling distribution Π = {πkl } on [d1 ] × [d2 ]. [sent-200, score-0.18]

37 That is, P{(it , jt ) = (k, l)} = πkl , for all t and (k, l). [sent-201, score-0.218]

38 Suppose that a (random) subset S of size |S| = n of entries of a sign matrix Y is observed. [sent-202, score-0.174]

39 The dependence of Y on the underlying matrix M ∗ is as follows: Yi, j = ∗ +1, if Mi, j + Zi, j ≥ 0, ∗ −1, if Mi, j + Zi, j < 0, (9) where Z = (Zi, j ) ∈ Rd1 ×d2 is a matrix consisting of i. [sent-203, score-0.178]

40 Let F(·) be the cumulative distribution function of −Z1,1 , then the above model can be recast as Yi, j = +1, −1, ∗ with probability F(Mi, j ), ∗ with probability 1 − F(Mi, j ), (10) n and we observe noisy entries {Yit , jt }t=1 indexed by S. [sent-207, score-0.34]

41 Intuitively it would be more practical to assume that entries are sampled without replacement, or equivalently, to sample n of the d1 d2 binary entries observed with noise without replacing. [sent-213, score-0.222]

42 approach as a proxy for sampling without replacement throughout this paper. [sent-219, score-0.18]

43 (2012) have focused on approximately low-rank matrices recovery by considering the following class of matrices K∗ (α, r) = M ∈ Rd1 ×d2 : M ∞ √ M ∗ ≤ α, √ ≤α r , d1 d2 (11) where 1 ≤ r ≤ min(d1 , d2 ) and α > 0 is a free parameter to be determined. [sent-242, score-0.188]

44 Clearly, any matrix M with rank at most r satisfying M ∞ ≤ α belongs to K∗ (α, r). [sent-243, score-0.173]

45 Alternatively, using max-norm as a convex relaxation for the rank, we consider recovery of matrices with ℓ∞ -norm and max-norm constraints defined by Kmax (α, R) := M ∈ Rd1 ×d2 : M ∞ ≤ α, M max ≤R . [sent-244, score-0.269]

46 This condition is reasonable while also critical in approximate low-rank 3627 C AI AND Z HOU matrix recovery problems by controlling the spikiness of the solution. [sent-251, score-0.2]

47 (2005), a large gap between the max-complexity (related to max-norm) and the dimensional-complexity (related to rank) is possible only when the underlying low-rank matrix has entries of vastly varying magnitudes. [sent-256, score-0.174]

48 When the noise distribution is log-concave so that the log-likelihood is a concave function, the max-norm constrained minimization problem (13) is a convex program and we recommend a fast and efficient algorithm developed in Lee et al. [sent-259, score-0.225]

49 b and Uα ≤ 2 Now we are ready to state our main results concerning the recovery of an approximately lowrank matrix M ∗ using the max-norm constrained maximum likelihood estimate. [sent-274, score-0.335]

50 Theorem 3 Suppose that Condition U holds and assume that the training set S follows a general weighted sampling model according to the distribution Π. [sent-276, score-0.185]

51 Remark 4 (i) While using the trace-norm to study this general weighted sampling model, it is common to assume that each row and column is sampled with positive probability (Klopp, 2012; Negahban and Wainwright, 2012), though in some applications this assumption does not seem realistic. [sent-279, score-0.185]

52 (ii) Klopp (2012) studied the problem of standard matrix completion with noise, also in the case of general sampling distribution, using the trace-norm penalized approach. [sent-282, score-0.494]

53 Cai and Zhou (2013) studied the unquantized problem under a general sampling model using max-norm as a convex relaxation for the rank. [sent-293, score-0.256]

54 The lower bounds given in Theorem 5 below show that the rate attained by the max-norm constrained maximum likelihood estimator is optimal up to constant factors. [sent-306, score-0.203]

55 Then, as long as the parameters (R, α) satisfy max 2, 4 (d1 ∨ d2 )1/2 ≤ R (d1 ∧ d2 )1/2 ≤ , α 2 the minimax risk for estimating M over the parameter space Kmax (α, R) satisfies inf max M M∈Kmax (α,R) 1 E M−M d1 d2 2 F ≥ 3630 1 min α2 , 512 βα/2 2 R d . [sent-308, score-0.195]

56 If the target √ matrix M ∗ is known to have rank at most r, we can take R = α r, such that the requirement here on the sample size n ≥ βα/2 r(d1 +d2 ) is weak and the optimal rate of convergence becomes α 4α2 r(d1 +d2 ) . [sent-319, score-0.173]

57 5 Comparison to Prior Work In this paper, we study a matrix completion model proposed in Davenport et al. [sent-321, score-0.377]

58 (2012), in which it is assumed that a binary matrix is observed at random from a distribution parameterized by an unknown matrix which is (approximately) low-rank. [sent-322, score-0.178]

59 We next turn to a detailed comparison of our results for 1-bit matrix completion to those obtained in Davenport et al. [sent-326, score-0.377]

60 (2012) have studied 1-bit matrix completion under the uniform sampling distribution over the parameter space K∗ (α, r) as given in (11), for some α > 0 and r ≤ min{d1 , d2 } is a positive integer. [sent-329, score-0.528]

61 (25) F d1 d2 n √ Comparing to (18) with R = α r, it is easy to see that under the uniform sampling model, the error bounds in (rescaled) Frobenius norm for the two estimates Mmax and Mtr are of the same order. [sent-331, score-0.245]

62 (2012) and Theorem 5, respectively, provide lower bounds showing that both Mtr and Mmax achieve the minimax rate of convergence for recovering approximately low-rank matrices over the parameter spaces K∗ (α, r) and Kmax (α, R) respectively. [sent-333, score-0.169]

63 Moreover, it was proposed to use a weighted variant of the trace-norm, which incorporates the knowledge of the true sampling distribution in its construction, and showed experimentally that this variant indeed leads to superior performance. [sent-336, score-0.185]

64 Using this weighted trace-norm, Negahban and Wainwright (2012) provided theoretical guarantees on approximate low-rank matrix completion in general sampling case while assuming that each row and column is sampled with positive probability (see condition (17)). [sent-337, score-0.562]

65 (2012) for the trace-norm regularization in uniform sampling case may also be extended to the weighted trace-norm method under the general sampling model, by using the matrix Bernstein inequality instead of Seginer’s theorem. [sent-343, score-0.425]

66 (28) M∈KΠ,∗ The following theorem states that the weighted trace-norm regularized approach can be nearly as good as the max-norm regularized estimator (up to logarithmic and constant factors), under a general sampling distribution that is not too far from uniform. [sent-357, score-0.228]

67 (2011) in the standard matrix completion problems under arbitrary sampling distributions. [sent-359, score-0.494]

68 Theorem 7 Suppose that Condition U holds but with M ∗ ∈ KΠ,∗ , and assume that the training set S follows a general weighted sampling model according to the distribution Π satisfying (17). [sent-360, score-0.185]

69 Since the construction of weighted trace-norm · w,∗ highly depends on the underlying sampling distribution which is typically unknown in practice, the constraint M ∗ ∈ KΠ,∗ seems to be artificial. [sent-362, score-0.185]

70 The max-norm constrained approach, on the contrary, does not require the knowledge of the exact sampling distribution and the error bound in weighted Frobenius norm, as shown in (16), holds even without prior assumption on Π, for example, condition (17). [sent-363, score-0.304]

71 zero mean entries (corresponding to the uniform sampling distribution), that is, for any h ≤ 2 log(max{d1 , d2 }), E[ A h ] ≤ K h E max k=1,. [sent-373, score-0.279]

72 Under the non-uniform sampling model, we will deal with a matrix with independent entries that are not necessarily identically distributed, to which case an alternative result of Latala (2005) can be applied, that is, E[ A ] ≤ K ′ max E ak· k=1,. [sent-382, score-0.334]

73 ,d2 2+ ∑ Ea4 kl 1/4 , k,l or instead, resorting to the matrix Bernstein inequality. [sent-388, score-0.2]

74 Suppose that, before solving (13), we know that the target matrix M ∗ has rank at most r∗ . [sent-420, score-0.173]

75 The stochastic gradient method says that at t-th iteration, we only need to pick one training pair (it , jt ) at random from S, then update g(uT v jt ;Yit , jt ) via the previous it procedure. [sent-428, score-0.654]

76 More precisely, if uit 2 > R, we project it back so that uit 2 = R, otherwise we do not 2 2 √ make any change (do the same for v jt ). [sent-429, score-0.324]

77 Next, if |uT v jt | > α, replace uit and vit with αuit /|uT v jt |1/2 it it √ and αvit /|uT v jt |1/2 respectively, otherwise we keep everything still. [sent-430, score-0.707]

78 Numerical Results In this section, we report the simulation results for low-rank matrix recovery based on 1-bit observations. [sent-434, score-0.169]

79 1 500 1000 1500 2000 Sample size s Figure 1: Plot of the average Frobenius error M − M ∗ 2 /d 2 versus the sample size s for three F different matrix sizes d ∈ {80, 120, 160}, all with rank r = 2. [sent-453, score-0.173]

80 We plot in Figure 2 the squared Frobenius norm of the error (normalized by the norm of the underlying matrix M ∗ ) over a range of different values of noise level σ on a logarithmic scale. [sent-466, score-0.298]

81 5 1 log10(σ) Figure 2: Plot of the relative Frobenius error M − M ∗ 2 / M ∗ 2 versus the noise level σ on a F F logarithmic scale, with rank r = 1, using both max-norm and trace-norm constrained methods. [sent-484, score-0.298]

82 When the infinity norm of the unknown matrix M ∗ is bounded by a constant and its entries are observed uniformly in random, they show that M ∗ can be recovered from binary measurements accurately and efficiently. [sent-505, score-0.266]

83 The unit max-norm ball is nearly the convex hull of rank-1 matrices whose entries are bounded in magnitude by 1, thus is a natural convex relaxation of low-rank matrices, particularly with bounded infinity norm. [sent-507, score-0.285]

84 Allowing for non-uniform sampling, we show that the max-norm constrained maximum likelihood estimation is rate-optimal up to a constant factor, and that the corresponding convex program may be solved efficiently in polynomial time. [sent-508, score-0.22]

85 An interesting question naturally arises that whether it is possible to push the theory further to cover exact low-rank matrix completion from noisy binary measurements. [sent-509, score-0.414]

86 The numerical study in Section 5 provides some evidence of the efficiency of the max-norm constraint approach in 1-bit matrix completion problem. [sent-510, score-0.377]

87 In our previous work (Cai and Zhou, 2013), we suggest to use max-norm constrained least square estimation to study standard matrix completion (from observations where additive noise is present) under a general sampling model. [sent-512, score-0.665]

88 We regard matrix recovery as a prediction problem, that is, consider a matrix M ∈ Rd1 ×d2 as a function: [d1 ] × [d2 ] → R, that is, M(k, l) = Mk,l . [sent-519, score-0.258]

89 , (in , jn )} ⊆ ([d1 ] × [d2 ])n of the observed entries of Y , let DS (M;Y ) = 1 n 1 n ∑t=1 g(Mit , jt ;Yit , jt ) = n ℓS (M;Y ) be the average empirical likelihood function, where the training set S is drawn i. [sent-524, score-0.619]

90 Then we have ∑ DΠ (M;Y ) := ES∼Π [g(Mit , jt ;Yit , jt )] = (k,l)∈[d1 ]×[d2 ] 3639 πkl · g(Mk,l ;Yk,l ). [sent-528, score-0.436]

91 n (31) Since Mmax is optimal and M ∗ is feasible to the optimization problem (13), we have DS (Mmax ;Y ) ≤ DS (M ∗ ;Y ) = 1 n ∑ g(Mi∗t , jt ;Yit , jt ). [sent-532, score-0.436]

92 Applying Fano’s inequality (Cover and Thomas, 1991) gives the lower bound P(M = M ⋆ ) ≥ 1 − I(M ⋆ ;YS ) + log 2 , log |M | (37) where I(M ⋆ ;YS ) denotes the mutual information between the random parameter M ⋆ in M and the observation matrix YS . [sent-567, score-0.215]

93 , (in , jn )} be independent random variables taking values in [d1 ] × [d2 ] according to Π, and recall that s 1 1 + 1{YAt =−1} log F(MAt ) 1 − F(MAt ) ℓS (M;Y ) = ∑ 1{YAt =1} log t=1 . [sent-581, score-0.177]

94 Then it follows from (5) that √ ∆ ≤ α r·E n sup u 2= v √ = α r·E √ = α r·E t 2 t=1 sup u 2= v n εt ∑ √πi · π· j ui v j =1 2 t ∑ ∑ =1 i, j t:(it , jt )=(i, j) eit eTt j ∑ εt √πi · π· j t=1 t t t √ εt ui v j πit · π· jt . [sent-587, score-0.512]

95 (2011), we see that, under condition (26) n E ∑ Qt t=1 ≤ C σ1 log(d) + σ2 log(d) 3643 C AI AND Z HOU with σ1 = n · max max ∑ k σ2 = max √ k,l l πkl πkl , max ∑ πk· π·l l k πk· π·l ≤ µn max{d1 , d2 }, 1 ≤ µ d1 d2 . [sent-595, score-0.172]

96 We shall show here that in the uniform sampling setting, the results obtained in this paper continue to hold if the (binary) entries are sampled without replacement. [sent-599, score-0.236]

97 , Xn ) is a C n -valued random vector modeling sampling with replacement from C . [sent-619, score-0.18]

98 Learning with the weighted trace-norm under arbitrary sampling distributions. [sent-697, score-0.185]

99 Nuclear norm penalization and optimal rates for noisy low rank matrix completion. [sent-760, score-0.267]

100 Restricted strong convexity and weighted matrix completion: optimal bounds with noise. [sent-798, score-0.194]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('mmax', 0.305), ('davenport', 0.292), ('completion', 0.288), ('kmax', 0.278), ('jt', 0.218), ('ompletion', 0.217), ('srebro', 0.193), ('rs', 0.165), ('foygel', 0.159), ('hou', 0.154), ('ds', 0.144), ('atrix', 0.144), ('yat', 0.124), ('constrained', 0.119), ('sampling', 0.117), ('frobenius', 0.115), ('rademacher', 0.113), ('kl', 0.111), ('ys', 0.11), ('ey', 0.109), ('mat', 0.106), ('matrix', 0.089), ('yit', 0.088), ('entries', 0.085), ('rank', 0.084), ('recovery', 0.08), ('cai', 0.079), ('minimax', 0.078), ('uv', 0.078), ('mi', 0.073), ('weighted', 0.068), ('plan', 0.067), ('bmax', 0.066), ('log', 0.063), ('replacement', 0.063), ('klopp', 0.062), ('mtr', 0.062), ('zhou', 0.059), ('zi', 0.058), ('norm', 0.057), ('conv', 0.055), ('matrices', 0.054), ('convex', 0.054), ('burer', 0.053), ('uit', 0.053), ('ai', 0.052), ('arxiv', 0.052), ('noise', 0.052), ('jn', 0.051), ('fano', 0.048), ('likelihood', 0.047), ('grothendieck', 0.047), ('kg', 0.047), ('laska', 0.047), ('shraibman', 0.047), ('unquantized', 0.047), ('dh', 0.046), ('salakhutdinov', 0.046), ('smoothed', 0.045), ('negahban', 0.044), ('max', 0.043), ('logarithmic', 0.043), ('collaborative', 0.042), ('tony', 0.041), ('boufounos', 0.04), ('movies', 0.04), ('sy', 0.04), ('compressed', 0.039), ('norms', 0.039), ('sensing', 0.038), ('sup', 0.038), ('relaxation', 0.038), ('preprint', 0.037), ('ix', 0.037), ('mendelson', 0.037), ('quantized', 0.037), ('noisy', 0.037), ('bounds', 0.037), ('linial', 0.036), ('monteiro', 0.036), ('vershynin', 0.036), ('lee', 0.036), ('measurements', 0.035), ('marginals', 0.035), ('logistic', 0.034), ('uniform', 0.034), ('mw', 0.033), ('qt', 0.033), ('wharton', 0.033), ('absconv', 0.031), ('marketing', 0.031), ('rateoptimal', 0.031), ('seginer', 0.031), ('spence', 0.031), ('spikiness', 0.031), ('tracenorm', 0.031), ('uvt', 0.031), ('packing', 0.031), ('bit', 0.031), ('inf', 0.031)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999976 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

2 0.095658153 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm

3 0.082151063 96 jmlr-2013-Regularization-Free Principal Curve Estimation

Author: Samuel Gerber, Ross Whitaker

Abstract: Principal curves and manifolds provide a framework to formulate manifold learning within a statistical context. Principal curves define the notion of a curve passing through the middle of a distribution. While the intuition is clear, the formal definition leads to some technical and practical difficulties. In particular, principal curves are saddle points of the mean-squared projection distance, which poses severe challenges for estimation and model selection. This paper demonstrates that the difficulties in model selection associated with the saddle point property of principal curves are intrinsically tied to the minimization of the mean-squared projection distance. We introduce a new objective function, facilitated through a modification of the principal curve estimation approach, for which all critical points are principal curves and minima. Thus, the new formulation removes the fundamental issue for model selection in principal curve estimation. A gradient-descent-based estimator demonstrates the effectiveness of the new formulation for controlling model complexity on numerical experiments with synthetic and real data. Keywords: principal curve, manifold estimation, unsupervised learning, model complexity, model selection

4 0.07262487 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

Author: Antony Joseph

Abstract: The performance of orthogonal matching pursuit (OMP) for variable selection is analyzed for random designs. When contrasted with the deterministic case, since the performance is here measured after averaging over the distribution of the design matrix, one can have far less stringent sparsity constraints on the coefficient vector. We demonstrate that for exact sparse vectors, the performance of the OMP is similar to known results on the Lasso algorithm (Wainwright, 2009). Moreover, variable selection under a more relaxed sparsity assumption on the coefficient vector, whereby one has only control on the ℓ1 norm of the smaller coefficients, is also analyzed. As consequence of these results, we also show that the coefficient estimate satisfies strong oracle type inequalities. Keywords: high dimensional regression, greedy algorithms, Lasso, compressed sensing

5 0.072219931 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

Author: Tony Cai, Jianqing Fan, Tiefeng Jiang

Abstract: This paper studies the asymptotic behaviors of the pairwise angles among n randomly and uniformly distributed unit vectors in R p as the number of points n → ∞, while the dimension p is either fixed or growing with n. For both settings, we derive the limiting empirical distribution of the random angles and the limiting distributions of the extreme angles. The results reveal interesting differences in the two settings and provide a precise characterization of the folklore that “all high-dimensional random vectors are almost always nearly orthogonal to each other”. Applications to statistics and machine learning and connections with some open problems in physics and mathematics are also discussed. Keywords: random angle, uniform distribution on sphere, empirical law, maximum of random variables, minimum of random variables, extreme-value distribution, packing on sphere

6 0.071074739 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

7 0.065218322 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

8 0.06238018 114 jmlr-2013-The Rate of Convergence of AdaBoost

9 0.061525665 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

10 0.060311273 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

11 0.056163073 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models

12 0.050421976 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

13 0.049174968 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

14 0.048519246 104 jmlr-2013-Sparse Single-Index Model

15 0.047972336 31 jmlr-2013-Derivative Estimation with Local Polynomial Fitting

16 0.045685597 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

17 0.043659344 29 jmlr-2013-Convex and Scalable Weakly Labeled SVMs

18 0.042302206 6 jmlr-2013-A Plug-in Approach to Neyman-Pearson Classification

19 0.042294152 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

20 0.042212486 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.247), (1, 0.059), (2, 0.073), (3, 0.023), (4, 0.05), (5, -0.048), (6, 0.044), (7, -0.014), (8, 0.147), (9, 0.049), (10, 0.037), (11, 0.03), (12, 0.068), (13, -0.006), (14, -0.036), (15, 0.076), (16, -0.052), (17, -0.18), (18, -0.112), (19, -0.007), (20, 0.017), (21, -0.099), (22, 0.003), (23, 0.076), (24, 0.025), (25, -0.043), (26, 0.056), (27, -0.054), (28, 0.064), (29, 0.025), (30, -0.03), (31, 0.018), (32, -0.01), (33, 0.058), (34, 0.021), (35, -0.149), (36, -0.12), (37, -0.034), (38, 0.129), (39, -0.112), (40, 0.138), (41, 0.051), (42, -0.018), (43, 0.057), (44, -0.123), (45, 0.037), (46, -0.057), (47, 0.029), (48, -0.1), (49, -0.149)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.91334128 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

2 0.58457035 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos

Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the fidelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm

3 0.52155954 36 jmlr-2013-Distributions of Angles in Random Packing on Spheres

Author: Tony Cai, Jianqing Fan, Tiefeng Jiang

Abstract: This paper studies the asymptotic behaviors of the pairwise angles among n randomly and uniformly distributed unit vectors in R p as the number of points n → ∞, while the dimension p is either fixed or growing with n. For both settings, we derive the limiting empirical distribution of the random angles and the limiting distributions of the extreme angles. The results reveal interesting differences in the two settings and provide a precise characterization of the folklore that “all high-dimensional random vectors are almost always nearly orthogonal to each other”. Applications to statistics and machine learning and connections with some open problems in physics and mathematics are also discussed. Keywords: random angle, uniform distribution on sphere, empirical law, maximum of random variables, minimum of random variables, extreme-value distribution, packing on sphere

4 0.4670189 62 jmlr-2013-Learning Theory Approach to Minimum Error Entropy Criterion

Author: Ting Hu, Jun Fan, Qiang Wu, Ding-Xuan Zhou

Abstract: We consider the minimum error entropy (MEE) criterion and an empirical risk minimization learning algorithm when an approximation of R´ nyi’s entropy (of order 2) by Parzen windowing is e minimized. This learning algorithm involves a Parzen windowing scaling parameter. We present a learning theory approach for this MEE algorithm in a regression setting when the scaling parameter is large. Consistency and explicit convergence rates are provided in terms of the approximation ability and capacity of the involved hypothesis space. Novel analysis is carried out for the generalization error associated with R´ nyi’s entropy and a Parzen windowing function, to overcome e technical difficulties arising from the essential differences between the classical least squares problems and the MEE setting. An involved symmetrized least squares error is introduced and analyzed, which is related to some ranking algorithms. Keywords: minimum error entropy, learning theory, R´ nyi’s entropy, empirical risk minimization, e approximation error

5 0.46006906 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

6 0.45571342 21 jmlr-2013-Classifier Selection using the Predicate Depth

7 0.44575331 35 jmlr-2013-Distribution-Dependent Sample Complexity of Large Margin Learning

8 0.44448858 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

9 0.44337761 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

10 0.4410446 104 jmlr-2013-Sparse Single-Index Model

11 0.40228972 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

12 0.39110669 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

13 0.38633204 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

14 0.38446176 53 jmlr-2013-Improving CUR Matrix Decomposition and the Nystrom Approximation via Adaptive Sampling

15 0.38191646 96 jmlr-2013-Regularization-Free Principal Curve Estimation

16 0.36537358 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference

17 0.36472365 49 jmlr-2013-Global Analytic Solution of Fully-observed Variational Bayesian Matrix Factorization

18 0.35866597 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

19 0.35430902 84 jmlr-2013-PC Algorithm for Nonparanormal Graphical Models

20 0.35319546 18 jmlr-2013-Beyond Fano's Inequality: Bounds on the Optimal F-Score, BER, and Cost-Sensitive Risk and Their Implications


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.015), (5, 0.147), (6, 0.042), (10, 0.083), (20, 0.022), (23, 0.04), (53, 0.016), (55, 0.293), (68, 0.027), (70, 0.047), (75, 0.042), (85, 0.047), (87, 0.038), (89, 0.019), (93, 0.013), (94, 0.014)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.71246964 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

Author: Tony Cai, Wen-Xin Zhou

Abstract: We consider in this paper the problem of noisy 1-bit matrix completion under a general non-uniform sampling distribution using the max-norm as a convex relaxation for the rank. A max-norm constrained maximum likelihood estimate is introduced and studied. The rate of convergence for the estimate is obtained. Information-theoretical methods are used to establish a minimax lower bound under the general sampling model. The minimax upper and lower bounds together yield the optimal rate of convergence for the Frobenius norm loss. Computational algorithms and numerical performance are also discussed. Keywords: 1-bit matrix completion, low-rank matrix, max-norm, trace-norm, constrained optimization, maximum likelihood estimate, optimal rate of convergence

2 0.54736334 25 jmlr-2013-Communication-Efficient Algorithms for Statistical Optimization

Author: Yuchen Zhang, John C. Duchi, Martin J. Wainwright

Abstract: We analyze two communication-efficient algorithms for distributed optimization in statistical settings involving large-scale data sets. The first algorithm is a standard averaging method that distributes the N data samples evenly to m machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves √ mean-squared error (MSE) that decays as O (N −1 + (N/m)−2 ). Whenever m ≤ N, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all N samples. The second algorithm is a novel method, based on an appropriate form of bootstrap subsampling. Requiring only a single round of communication, it has mean-squared error that decays as O (N −1 + (N/m)−3 ), and so is more robust to the amount of parallelization. In addition, we show that a stochastic gradient-based method attains mean-squared error decaying as O (N −1 + (N/m)−3/2 ), easing computation at the expense of a potentially slower MSE rate. We also provide an experimental evaluation of our methods, investigating their performance both on simulated data and on a large-scale regression problem from the internet search domain. In particular, we show that our methods can be used to efficiently solve an advertisement prediction problem from the Chinese SoSo Search Engine, which involves logistic regression with N ≈ 2.4 × 108 samples and d ≈ 740,000 covariates. Keywords: distributed learning, stochastic optimization, averaging, subsampling

3 0.54046547 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

Author: Sébastien Gerchinovitz

Abstract: We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T . We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an online-learning algorithm called SeqSEW and based on exponential weighting and data-driven truncation. In a second part we apply a parameter-free version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design. Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds

4 0.53314203 2 jmlr-2013-A Binary-Classification-Based Metric between Time-Series Distributions and Its Use in Statistical and Learning Problems

Author: Daniil Ryabko, Jérémie Mary

Abstract: A metric between time-series distributions is proposed that can be evaluated using binary classification methods, which were originally developed to work on i.i.d. data. It is shown how this metric can be used for solving statistical problems that are seemingly unrelated to classification and concern highly dependent time series. Specifically, the problems of time-series clustering, homogeneity testing and the three-sample problem are addressed. Universal consistency of the resulting algorithms is proven under most general assumptions. The theoretical results are illustrated with experiments on synthetic and real-world data. Keywords: time series, reductions, stationary ergodic, clustering, metrics between probability distributions

5 0.5318045 26 jmlr-2013-Conjugate Relation between Loss Functions and Uncertainty Sets in Classification Problems

Author: Takafumi Kanamori, Akiko Takeda, Taiji Suzuki

Abstract: There are two main approaches to binary classiÄ?Ĺš cation problems: the loss function approach and the uncertainty set approach. The loss function approach is widely used in real-world data analysis. Statistical decision theory has been used to elucidate its properties such as statistical consistency. Conditional probabilities can also be estimated by using the minimum solution of the loss function. In the uncertainty set approach, an uncertainty set is deÄ?Ĺš ned for each binary label from training samples. The best separating hyperplane between the two uncertainty sets is used as the decision function. Although the uncertainty set approach provides an intuitive understanding of learning algorithms, its statistical properties have not been sufÄ?Ĺš ciently studied. In this paper, we show that the uncertainty set is deeply connected with the convex conjugate of a loss function. On the basis of the conjugate relation, we propose a way of revising the uncertainty set approach so that it will have good statistical properties such as statistical consistency. We also introduce statistical models corresponding to uncertainty sets in order to estimate conditional probabilities. Finally, we present numerical experiments, verifying that the learning with revised uncertainty sets improves the prediction accuracy. Keywords: loss function, uncertainty set, convex conjugate, consistency

6 0.53012723 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

7 0.52988333 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

8 0.52966815 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

9 0.52914423 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

10 0.5259614 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

11 0.52413452 73 jmlr-2013-Multicategory Large-Margin Unified Machines

12 0.52373731 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation

13 0.52349424 39 jmlr-2013-Efficient Active Learning of Halfspaces: An Aggressive Approach

14 0.52146423 50 jmlr-2013-Greedy Feature Selection for Subspace Clustering

15 0.52089798 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

16 0.52073425 68 jmlr-2013-Machine Learning with Operational Costs

17 0.52017725 10 jmlr-2013-Algorithms and Hardness Results for Parallel Large Margin Learning

18 0.52017635 69 jmlr-2013-Manifold Regularization and Semi-supervised Learning: Some Theoretical Analyses

19 0.51917893 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

20 0.51889145 59 jmlr-2013-Large-scale SVD and Manifold Learning