jmlr jmlr2013 jmlr2013-101 knowledge-graph by maker-knowledge-mining

101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning


Source: pdf

Author: Markus Thom, Günther Palm

Abstract: Sparseness is a useful regularizer for learning in a wide range of applications, in particular in neural networks. This paper proposes a model targeted at classification tasks, where sparse activity and sparse connectivity are used to enhance classification capabilities. The tool for achieving this is a sparseness-enforcing projection operator which finds the closest vector with a pre-defined sparseness for any given vector. In the theoretical part of this paper, a comprehensive theory for such a projection is developed. In conclusion, it is shown that the projection is differentiable almost everywhere and can thus be implemented as a smooth neuronal transfer function. The entire model can hence be tuned end-to-end using gradient-based methods. Experiments on the MNIST database of handwritten digits show that classification performance can be boosted by sparse activity or sparse connectivity. With a combination of both, performance can be significantly better compared to classical non-sparse approaches. Keywords: supervised learning, sparseness projection, sparse activity, sparse connectivity

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 The tool for achieving this is a sparseness-enforcing projection operator which finds the closest vector with a pre-defined sparseness for any given vector. [sent-7, score-0.663]

2 Keywords: supervised learning, sparseness projection, sparse activity, sparse connectivity 1. [sent-13, score-0.657]

3 This sparseness measure has the significant disadvantage of not being scale-invariant, so that an intuitive notion of sparseness cannot be derived from it. [sent-34, score-0.822]

4 This sparseness measure fulfills all criteria of Hurley and Rickard (2009) except for Dalton’s fourth law, which states that the sparseness of a vector should be identical to the sparseness of the vector resulting from multiple concatenation of the original vector. [sent-40, score-1.233]

5 For example, sparseness of connectivity in a biological brain increases quickly with its volume, so that connectivity in a human brain is about 170 times more sparse than in a rat brain (Karbowski, 2003). [sent-42, score-0.684]

6 For a pre-defined target degree of sparseness σ∗ ∈ (0, 1), the operator finds the closest vector of a given scale that has sparseness σ∗ given an arbitrary vector. [sent-45, score-0.92]

7 Hoyer’s original algorithm for computation of such a projection is an alternating projection onto a hyperplane representing the L1 norm constraint, a hypersphere representing the L2 norm constraint, and the non-negative orthant. [sent-50, score-0.86]

8 Section 2 proposes a simple algorithm for carrying out sparseness-enforcing projections with respect to Hoyer’s sparseness measure. [sent-56, score-0.597]

9 Because the projection itself is differentiable, it is the ideal tool for achieving sparseness in gradient-based learning. [sent-58, score-0.632]

10 This is exploited in Section 3, where the sparseness projection is used to obtain a classifier that features both sparse activity and sparse connectivity in a natural way. [sent-59, score-1.064]

11 The problem of finding projections onto sets where Hoyer’s sparseness measure attains a constant value is addressed in Appendix C. [sent-66, score-0.818]

12 Appendix D investigates analytical properties of the sparseness projection and concludes with an efficient algorithm that computes its gradient. [sent-68, score-0.632]

13 As a consequence, when a vector is sorted in ascending or descending order, then its projection onto M is sorted accordingly. [sent-79, score-0.577]

14 By exploiting this property, projections onto M can be yielded by recording and discarding the signs of the coordinates n of the argument, projecting onto M ∩ R≥0 , and finally restoring the signs of the coordinates of the result using the signs of the argument. [sent-82, score-0.628]

15 It is hence enough to handle projections onto S≥0 2 in the first place, as projections onto the unrestricted target set can easily be recovered. [sent-91, score-0.942]

16 In the applications of the sparseness projection in this paper, λ2 is always set to unity to achieve normalized projections, and λ1 is adjusted as explained in Section 1. [sent-93, score-0.632]

17 For computation of projections onto an intersection of a finite number of closed and convex sets, it is enough to perform alternating projections onto the members of the intersection (Deutsch, 2001). [sent-108, score-0.929]

18 With these preparations, a simple algorithm can be proposed; it computes the sparseness-enforcing projection with respect to a constraint induced by Hoyer’s sparseness measure σ. [sent-120, score-0.632]

19 1094 m after line 1 S PARSE ACTIVITY AND S PARSE C ONNECTIVITY IN S UPERVISED L EARNING Algorithm 1: Proposed algorithm for computing the sparseness-enforcing projection operator for Hoyer’s sparseness measure σ. [sent-123, score-0.663]

20 , n} | ri 0 }; end As already pointed out, the idea of Algorithm 1 is that projections onto D can be computed by alternating projections onto the geometric structures just defined. [sent-133, score-0.893]

21 The alternating projection onto H and L in the beginning of Algorithm 1 make the result of the projection onto D invariant to positive scaling and arbitrary shifting of the argument, as shown in Corollary 19. [sent-142, score-0.921]

22 The formula for projections onto L can be generalized for projections onto LI for an index set I ⊆ {1, . [sent-144, score-0.814]

23 ˆ The separator t and the number of nonzero entries in the projection onto C can be computed with Algorithm 2, which is an adapted version of the algorithm of Chen and Ye (2011). [sent-155, score-0.572]

24 Based on the symmetries induced by Hoyer’s sparseness measure σ and by exploiting the projection onto a simplex, an improved method was given in Algorithm 3. [sent-227, score-0.892]

25 Therefore, at least the same amount of entries is set to zero in the simplex projection compared to the projection onto the non-negative orthant, see also Corollary 29. [sent-235, score-0.882]

26 15 were sampled and their sparse projections were computed using the respective algorithms, to gain the best normalized approximations with a target sparseness degree of σ∗ := 0. [sent-239, score-0.737]

27 Starting with the second iteration, dimensions are discarded by projecting onto R≥0 in the original algorithm and onto C in the improved variant, which yields vanishing entries in the working vectors. [sent-261, score-0.614]

28 Then the absolute time needed for the algorithms to compute the projections with a target sparseness of 0. [sent-299, score-0.664]

29 Because the projection onto D is essentially a composition of projections onto H, C, L and LI , the overall gradient can be computed using the chain rule. [sent-341, score-0.91]

30 This section proposes a hybrid of an auto-encoder network and a two-layer neural network, where the sparseness projection is employed as a neuronal transfer function. [sent-350, score-0.762]

31 If the transfer function f is set to the sparseness projection, the internal representation h will be sparsely populated, fulfilling the sparse activity property. [sent-366, score-0.839]

32 , n}, where σW ∈ (0, 1) is the target degree of connectivity sparseness and Wei is the i-th column of W . [sent-374, score-0.578]

33 Here, each update to the degrees of freedom is followed by application of the sparseness projection to the columns of W to enforce sparse connectivity. [sent-406, score-0.705]

34 There are theoretical results on the convergence of projected gradient methods when projections are carried out onto convex sets (Bertsekas, 1999), but here the target set for projection is non-convex. [sent-407, score-0.825]

35 The dimensionality of the internal representation n and the target degree of sparseness with respect to the connectivity σW ∈ (0, 1) are parameters of the algorithm. [sent-413, score-0.657]

36 The more sophisticated method is to use the unrestricted sparseness-enforcing projection 1103 T HOM AND PALM operator π with respect to Hoyer’s sparseness measure σ, which can be carried out by Algorithm 3. [sent-417, score-0.729]

37 After every epoch, a sparseness projection is applied to the columns of W . [sent-434, score-0.632]

38 Target degrees of sparse activity σH with respect 1105 T HOM AND PALM to Hoyer’s sparseness measure σ were chosen from the interval [0. [sent-507, score-0.67]

39 For every value of σH , the resulting sparseness of activity was measured after training using the L0 pseudo-norm. [sent-513, score-0.597]

40 The standard deviation of the activity decreases when sparseness increases, hence the mapping from σH to the resulting number of active units becomes more accurate. [sent-521, score-0.659]

41 The target sparseness of activity is given by a parameter κ ∈ {1, . [sent-523, score-0.664]

42 Usage of the σ projection consequently outperforms the L0 projection for all sparseness degrees. [sent-530, score-0.853]

43 SOAE-σ is more robust, as classification capabilities first begin to collapse when sparseness is below 5%, whereas SOAE-L0 starts to degenerate when sparseness falls below 20%. [sent-533, score-0.864]

44 The variants SOAE-σ and SOAE-L0 denote the entirety of the respective experiments where sparseness of activity lies in the intervals described above, that is σH ∈ [0. [sent-540, score-0.597]

45 0 Figure 5: Resulting amount of nonzero entries in an internal representation h with 1000 entries, depending on the target degree of sparseness for activity σH with respect to Hoyer’s sparseness measure σ. [sent-567, score-1.252]

46 For low values of σH , about 80% of the entries are nonzero, whereas for very high sparseness degrees only 1% of the entries do not vanish. [sent-568, score-0.611]

47 4 0 30 70 10 20 40 50 60 Sparseness of Activity Measured in L0 Pseudo-Norm [%] 80 Figure 6: Resulting classification error on the MNIST evaluation set for the supervised online autoencoder network, in dependence of sparseness of activity in the hidden layer. [sent-580, score-0.626]

48 The projection onto an L0 pseudo-norm constraint for variant SOAE-L0 and the projection onto a constraint determined by sparseness measure σ for variant SOAE-σ were used as transfer functions. [sent-581, score-1.42]

49 First, sparseness of activity is achieved through a latent variable that stores the optimal sparse code words of all samples simultaneously. [sent-642, score-0.67]

50 With SOAE, sparseness of activity is guaranteed by construction. [sent-646, score-0.597]

51 As Hoyer’s sparseness measure σ and the according projection possess desirable analytical properties, they can be considered smooth approximations to the L0 pseudo-norm. [sent-747, score-0.632]

52 Here, an algorithm for the sparseness-enforcing projection with respect to Hoyer’s sparseness measure σ was proposed. [sent-756, score-0.632]

53 But this changes the meaning of the problem dramatically, as now virtually any sparseness below the original target 1111 T HOM AND PALM degree of sparseness can be obtained. [sent-769, score-0.889]

54 This is because when the target L1 norm λ1 is fixed, the sparseness measure σ decreases whenever the target L2 norm decreases. [sent-770, score-0.609]

55 In geometric terms, the method proposed in this paper performs a projection from within a circle onto its boundary to increase the sparseness of the working vector. [sent-771, score-0.892]

56 When only independent solutions are required, the projection of a point x onto a scaled canonical simplex of L1 norm λ1 can also be carried out in linear time (Liu and Ye, 2009), without having to sort the vector that is to be projected. [sent-781, score-0.662]

57 For the simplex C, a characterization of critical points is given with Lemma 32 and Lemma 33, and it is shown that the expression for the projection onto C is invariant to local changes. [sent-793, score-0.561]

58 Similar approaches for sparseness projections are discussed in the following. [sent-796, score-0.597]

59 A problem where similar projections are employed is to minimize a convex function subject to group sparseness (see for example Friedman et al. [sent-810, score-0.597]

60 When p = 1, then the projection onto a simplex can be generalized directly for q = 2 (van den Berg et al. [sent-818, score-0.561]

61 Therefore, the elastic net induces a different notion of sparseness than Hoyer’s sparseness measure σ does. [sent-825, score-0.822]

62 As is the case for mixed norm balls, the projection onto a simplex can be generalized to achieve projections onto N (Mairal et al. [sent-826, score-1.0]

63 2 Supervised Online Auto-Encoder The sparseness-enforcing projection operator π with respect to Hoyer’s sparseness measure σ and the projection onto an L0 pseudo-norm constraint are differentiable almost everywhere. [sent-829, score-1.163]

64 In addition, the matrix of bases which was used to compute the internal representation was enforced to be sparsely populated by application of the sparseness projection after each learning epoch. [sent-833, score-0.766]

65 Similar techniques to achieve a shrinkage-like effect for increasing sparseness of activity in a neural network were used by Gregor and LeCun (2010) and Glorot et al. [sent-855, score-0.625]

66 This paper studied Hoyer’s sparseness measure σ, and in particular the projection of arbitrary vectors onto sets where σ attains a constant value. [sent-894, score-0.853]

67 As projections onto σ constraints are well-understood, they constitute the ideal tool for building systems that can benefit from sparseness constraints. [sent-898, score-0.818]

68 However, when the target degree of sparseness of the activity is in a reasonable range, classification results are not only equivalent but superior to classical non-sparse approaches. [sent-905, score-0.664]

69 Further it is a − b 2 2 = As an example, note that the outcome of the sparseness-enforcing projection operator depends only on the target sparseness degree up to scaling: ˜ ˜ ˜ ˜ Remark 5 Let λ1 , λ2 > 0 and λ1 , λ2 > 0 be pairs of target norms such that λ1/λ2 = λ1/λ2 . [sent-938, score-0.797]

70 Projections onto Symmetric Sets This appendix investigates certain symmetries of sets and their effect on projections onto such sets. [sent-959, score-0.667]

71 When the projection onto a permutation-invariant set is unique, then equal entries of the argument cause equal entries in the projection: Remark 10 Let ∅ M ⊆ Rn be permutation-invariant and x ∈ Rn . [sent-1020, score-0.642]

72 (2008), when the connection between projections onto a simplex and onto an L1 ball was studied. [sent-1035, score-0.747]

73 1 Geometric Structures and First Considerations n The aim is to compute projections onto D, which is the intersection of the non-negative orthant R≥0 , the target hyperplane H and the target hypersphere K, see Section 2. [sent-1088, score-0.71]

74 1 2 With these definitions the intermediate goal is now to prove that projections onto D can be computed by alternating projections onto the geometric structures defined earlier. [sent-1108, score-0.851]

75 Further, m, h = λ2/n for all h ∈ H, and thus h − m 1 1 2 2 = h 2 λ2 2 − 1/n L2 N ORM C ONSTRAINT—TARGET H YPERSPHERE After the projection onto the target hyperplane H has been carried out, consider now the joint constraint of H and the target hypersphere K. [sent-1135, score-0.711]

76 With Lemma 13 and Lemma 17 follows that projD (x) = projD (r) = projD (s), hence the next steps consist of projecting s onto C for finding the projection of x onto D. [sent-1214, score-0.697]

77 2 P ROJECTION ONTO A FACE OF A S IMPLEX The projection within H onto C yields zero entries in the working vector. [sent-1259, score-0.581]

78 It still remains to be shown that the projection onto D possesses zero entries at the same coordinates as the projection onto C. [sent-1260, score-0.984]

79 It describes the construction of the result of the projection onto a simplex face and poses a statement on its norm, which in turn is used to prove that the position of vanishing entries does not change upon projection. [sent-1264, score-0.694]

80 , s(h) ∈ Rn defined iteratively by s(0) := q and (k−1) s(k) := s(k−1) − s jk (k−1) 1 e jk + n−k s jk k e − ∑i=1 e ji for k ∈ {1, . [sent-1278, score-0.587]

81 , k − 1}, hence with induction follows (k−1) s jk k−1 (i−1) = e jk , s(k−1) = e jk , s(0) + ∑i=1 s ji IH k−1 k−1 k−1 (i−1) 1 e jk , ai = q jk + ∑i=1 n−i s ji i−1 k−1 1 1 1 = q jk + ∑i=1 n−i q ji + ∑i=1 ∑µ=1 n−i n−i+1 q jµ = q jk + ∑i=1 q ji 1 n−i k−1 1 1 + ∑µ=i+1 n−µ n−µ+1 . [sent-1388, score-1.467]

82 , jk }C ( j) = 0, therefore (k) (k) (k−1) si − s j = si (k−1) − s jk (k−1) 1 χ{ jk } (i) + n−k s jk χ{ j1 ,. [sent-1435, score-0.874]

83 , jk }C (i) (k−1) (k−1) 1 (k−1) −sj + s jk χ{ jk } ( j) − n−k s jk χ{ j1 ,. [sent-1438, score-0.732]

84 , jk }C ( j) (k−1) (k−1) (k−1) 1 = si −sj + s jk n−k + χ{ jk } ( j) ≥ 0, (k−1) where si (k−1) −sj ≥ 0 holds by induction hypothesis. [sent-1441, score-0.691]

85 Therefore, (k−1) = ∑i∈I si k−1 (k−1) + ∑i=1 s ji (k−1) ≥ (n − h) + 1 + (h − k) s jk (k−1) + s jk (k−1) ≥ s jk holds (k−1) h + ∑i=k+1 s ji (k−1) = (n − k + 1) s jk , and the claim follows because n − k + 1 > 0. [sent-1450, score-0.912]

86 Furthermore, 2 s(k−1) , ak = 1 n−k k−1 (k−1) λ1 − ∑i=1 s ji (k−1) − s jk (k−1) − s jk = 1 n−k (k−1) (k−1) λ1 − s jk − s jk , (k−1) vanish for i ∈ {1, . [sent-1452, score-0.77]

87 Application of Proposition 4 yields s(k) 2 − 2 because all s ji s(k−1) 2 2 2 (k−1) + 2s jk 2 (k−1) (k−1) 1 s jk s jk 1 + n−k (k−1) = s jk = (k−1) = s jk ak 2λ1 n−k (k−1) (k−1) − s jk s(k−1) , ak (k−1) 2 + n−k λ1 − s jk 1 1 + n−k (k−1) − 2s jk , −1 2λ1 2λ1 1 · n−k = n−k+1 . [sent-1456, score-1.502]

88 , jh } (k−1) with q jk 0, let k be minimal such that either k = 1 or q jk−1 = 0, hence s jk = q jk . [sent-1462, score-0.613]

89 S PARSE ACTIVITY AND S PARSE C ONNECTIVITY IN S UPERVISED L EARNING s t p v ρI LI mI CI Figure 10: Sketch of the proof of Corollary 27: The projection of v onto D must be located on simplex face CI . [sent-1464, score-0.561]

90 It shows that the solution set with respect to the projection onto D is not tampered, and that all solutions have zeros at the same positions as the projection onto C. [sent-1504, score-0.884]

91 For this, the simplex projection r is projected onto the target hypersphere, simultaneously keeping already vanished entries at zero, yielding a point u. [sent-1573, score-0.758]

92 Lemma 30 gives an explicit formulation of this projection and shows that the solution set with respect to the projection onto D stays the same. [sent-1574, score-0.663]

93 It is clear by Theorem 2 that the projection of any point onto D can be written as finite composition of projections onto H, L, C and LI , respectively. [sent-1681, score-0.849]

94 Thus only the projection onto the ˆ simplex C demands attention. [sent-1705, score-0.561]

95 However, for every point where the projection onto C is not differentiable, a subtle change is sufficient to find a point where the projection is differentiable: Lemma 33 Consider the function p : Rn \C → C, s → projC (s) and let x ∈ Rn \C be a point such that p is not differentiable in x. [sent-1741, score-0.721]

96 , n} | eT projC (x) 0 } is the index i set of nonzero entries of the projection onto C, d := |I|, u := ∑i∈I ei ∈ Rn and v := e − u ∈ Rn . [sent-1762, score-0.599]

97 Let N := dh = π≥0 (x) 0 denote the number of nonzero entries in the projection onto D, and define T s(i) := eT r(i), . [sent-1782, score-0.572]

98 Proof The gradient of projections onto H is merely a special case of projections onto C, which also applies to the respective projections onto L and LI , see Lemma 34. [sent-1791, score-1.282]

99 However, when the target degree of sparseness is low, and thus the amount of nonzero entries N in the result of the projection is large, gradient computation can become very inefficient. [sent-1832, score-0.89]

100 A finite algorithm for finding the projection of a point onto the canonical simplex of Rn . [sent-2192, score-0.591]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('projd', 0.441), ('sparseness', 0.411), ('onto', 0.221), ('projection', 0.221), ('projections', 0.186), ('activity', 0.186), ('jk', 0.183), ('parse', 0.151), ('hoyer', 0.147), ('hom', 0.145), ('onnectivity', 0.139), ('palm', 0.125), ('projc', 0.119), ('simplex', 0.119), ('rn', 0.115), ('entries', 0.1), ('connectivity', 0.1), ('upervised', 0.099), ('projl', 0.095), ('wout', 0.095), ('ci', 0.083), ('mi', 0.079), ('projm', 0.078), ('transfer', 0.073), ('sparse', 0.073), ('projh', 0.073), ('si', 0.071), ('target', 0.067), ('soae', 0.067), ('theis', 0.067), ('qi', 0.061), ('hypersphere', 0.061), ('gradient', 0.061), ('differentiable', 0.058), ('barycenter', 0.056), ('synaptic', 0.056), ('thom', 0.056), ('sr', 0.056), ('en', 0.053), ('rc', 0.052), ('lemma', 0.05), ('sorted', 0.05), ('hypercircle', 0.05), ('sparsely', 0.049), ('pi', 0.049), ('earning', 0.048), ('module', 0.047), ('internal', 0.047), ('sc', 0.046), ('proposition', 0.046), ('diag', 0.046), ('projli', 0.045), ('capabilities', 0.042), ('ri', 0.042), ('classi', 0.041), ('lecun', 0.04), ('working', 0.039), ('rrt', 0.039), ('uut', 0.039), ('intersection', 0.039), ('symmetries', 0.039), ('carried', 0.039), ('populated', 0.038), ('layer', 0.038), ('ji', 0.038), ('alternating', 0.037), ('sgn', 0.036), ('gradients', 0.036), ('li', 0.036), ('remark', 0.036), ('hyperplane', 0.035), ('descending', 0.035), ('orthant', 0.034), ('hence', 0.034), ('yd', 0.033), ('vanishing', 0.033), ('claim', 0.033), ('ye', 0.033), ('ful', 0.033), ('norm', 0.032), ('dimensionality', 0.032), ('operator', 0.031), ('canonical', 0.03), ('nonzero', 0.03), ('projected', 0.03), ('jh', 0.03), ('neuronal', 0.029), ('qqt', 0.029), ('hidden', 0.029), ('rd', 0.028), ('esoae', 0.028), ('jittering', 0.028), ('proja', 0.028), ('units', 0.028), ('network', 0.028), ('ei', 0.027), ('mnist', 0.027), ('unrestricted', 0.027), ('membership', 0.026), ('variant', 0.026)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.9999997 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

Author: Markus Thom, Günther Palm

Abstract: Sparseness is a useful regularizer for learning in a wide range of applications, in particular in neural networks. This paper proposes a model targeted at classification tasks, where sparse activity and sparse connectivity are used to enhance classification capabilities. The tool for achieving this is a sparseness-enforcing projection operator which finds the closest vector with a pre-defined sparseness for any given vector. In the theoretical part of this paper, a comprehensive theory for such a projection is developed. In conclusion, it is shown that the projection is differentiable almost everywhere and can thus be implemented as a smooth neuronal transfer function. The entire model can hence be tuned end-to-end using gradient-based methods. Experiments on the MNIST database of handwritten digits show that classification performance can be boosted by sparse activity or sparse connectivity. With a combination of both, performance can be significantly better compared to classical non-sparse approaches. Keywords: supervised learning, sparseness projection, sparse activity, sparse connectivity

2 0.085897371 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

Author: Manavender R. Malgireddy, Ifeoma Nwogu, Venu Govindaraju

Abstract: We present language-motivated approaches to detecting, localizing and classifying activities and gestures in videos. In order to obtain statistical insight into the underlying patterns of motions in activities, we develop a dynamic, hierarchical Bayesian model which connects low-level visual features in videos with poses, motion patterns and classes of activities. This process is somewhat analogous to the method of detecting topics or categories from documents based on the word content of the documents, except that our documents are dynamic. The proposed generative model harnesses both the temporal ordering power of dynamic Bayesian networks such as hidden Markov models (HMMs) and the automatic clustering power of hierarchical Bayesian models such as the latent Dirichlet allocation (LDA) model. We also introduce a probabilistic framework for detecting and localizing pre-specified activities (or gestures) in a video sequence, analogous to the use of filler models for keyword detection in speech processing. We demonstrate the robustness of our classification model and our spotting framework by recognizing activities in unconstrained real-life video sequences and by spotting gestures via a one-shot-learning approach. Keywords: dynamic hierarchical Bayesian networks, topic models, activity recognition, gesture spotting, generative models

3 0.083615817 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

Author: Tingni Sun, Cun-Hui Zhang

Abstract: We propose a new method of learning a sparse nonnegative-definite target matrix. Our primary example of the target matrix is the inverse of a population covariance or correlation matrix. The algorithm first estimates each column of the target matrix by the scaled Lasso and then adjusts the matrix estimator to be symmetric. The penalty level of the scaled Lasso for each column is completely determined by data via convex minimization, without using cross-validation. We prove that this scaled Lasso method guarantees the fastest proven rate of convergence in the spectrum norm under conditions of weaker form than those in the existing analyses of other ℓ1 regularized algorithms, and has faster guaranteed rate of convergence when the ratio of the ℓ1 and spectrum norms of the target inverse matrix diverges to infinity. A simulation study demonstrates the computational feasibility and superb performance of the proposed method. Our analysis also provides new performance bounds for the Lasso and scaled Lasso to guarantee higher concentration of the error at a smaller threshold level than previous analyses, and to allow the use of the union bound in column-by-column applications of the scaled Lasso without an adjustment of the penalty level. In addition, the least squares estimation after the scaled Lasso selection is considered and proven to guarantee performance bounds similar to that of the scaled Lasso. Keywords: precision matrix, concentration matrix, inverse matrix, graphical model, scaled Lasso, linear regression, spectrum norm

4 0.073758572 104 jmlr-2013-Sparse Single-Index Model

Author: Pierre Alquier, Gérard Biau

Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method

5 0.06394583 90 jmlr-2013-Quasi-Newton Method: A New Direction

Author: Philipp Hennig, Martin Kiefel

Abstract: Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors. Keywords: optimization, numerical analysis, probability, Gaussian processes

6 0.059246976 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

7 0.053897981 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

8 0.05331634 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

9 0.053288102 96 jmlr-2013-Regularization-Free Principal Curve Estimation

10 0.052724954 82 jmlr-2013-Optimally Fuzzy Temporal Memory

11 0.050822191 111 jmlr-2013-Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

12 0.049174968 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

13 0.04900923 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

14 0.048507821 99 jmlr-2013-Semi-Supervised Learning Using Greedy Max-Cut

15 0.046193648 106 jmlr-2013-Stationary-Sparse Causality Network Learning

16 0.045848977 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis

17 0.045539644 56 jmlr-2013-Keep It Simple And Sparse: Real-Time Action Recognition

18 0.044729952 76 jmlr-2013-Nonparametric Sparsity and Regularization

19 0.044195026 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

20 0.04396648 47 jmlr-2013-Gaussian Kullback-Leibler Approximate Inference


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.227), (1, 0.049), (2, -0.057), (3, -0.024), (4, 0.072), (5, -0.027), (6, 0.01), (7, -0.044), (8, 0.12), (9, 0.021), (10, 0.008), (11, -0.031), (12, -0.117), (13, 0.024), (14, 0.054), (15, 0.01), (16, -0.054), (17, -0.04), (18, -0.053), (19, 0.009), (20, 0.147), (21, 0.035), (22, -0.021), (23, 0.116), (24, 0.048), (25, -0.327), (26, -0.093), (27, 0.041), (28, 0.077), (29, 0.04), (30, -0.132), (31, -0.284), (32, 0.136), (33, -0.025), (34, 0.008), (35, -0.03), (36, -0.029), (37, 0.025), (38, -0.109), (39, -0.168), (40, -0.154), (41, 0.046), (42, 0.02), (43, -0.14), (44, 0.008), (45, 0.013), (46, 0.015), (47, 0.03), (48, 0.056), (49, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.94897968 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

Author: Markus Thom, Günther Palm

Abstract: Sparseness is a useful regularizer for learning in a wide range of applications, in particular in neural networks. This paper proposes a model targeted at classification tasks, where sparse activity and sparse connectivity are used to enhance classification capabilities. The tool for achieving this is a sparseness-enforcing projection operator which finds the closest vector with a pre-defined sparseness for any given vector. In the theoretical part of this paper, a comprehensive theory for such a projection is developed. In conclusion, it is shown that the projection is differentiable almost everywhere and can thus be implemented as a smooth neuronal transfer function. The entire model can hence be tuned end-to-end using gradient-based methods. Experiments on the MNIST database of handwritten digits show that classification performance can be boosted by sparse activity or sparse connectivity. With a combination of both, performance can be significantly better compared to classical non-sparse approaches. Keywords: supervised learning, sparseness projection, sparse activity, sparse connectivity

2 0.53309351 82 jmlr-2013-Optimally Fuzzy Temporal Memory

Author: Karthik H. Shankar, Marc W. Howard

Abstract: Any learner with the ability to predict the future of a structured time-varying signal must maintain a memory of the recent past. If the signal has a characteristic timescale relevant to future prediction, the memory can be a simple shift register—a moving window extending into the past, requiring storage resources that linearly grows with the timescale to be represented. However, an independent general purpose learner cannot a priori know the characteristic prediction-relevant timescale of the signal. Moreover, many naturally occurring signals show scale-free long range correlations implying that the natural prediction-relevant timescale is essentially unbounded. Hence the learner should maintain information from the longest possible timescale allowed by resource availability. Here we construct a fuzzy memory system that optimally sacrifices the temporal accuracy of information in a scale-free fashion in order to represent prediction-relevant information from exponentially long timescales. Using several illustrative examples, we demonstrate the advantage of the fuzzy memory system over a shift register in time series forecasting of natural signals. When the available storage resources are limited, we suggest that a general purpose learner would be better off committing to such a fuzzy memory system. Keywords: temporal information compression, forecasting long range correlated time series

3 0.43482569 100 jmlr-2013-Similarity-based Clustering by Left-Stochastic Matrix Factorization

Author: Raman Arora, Maya R. Gupta, Amol Kapila, Maryam Fazel

Abstract: For similarity-based clustering, we propose modeling the entries of a given similarity matrix as the inner products of the unknown cluster probabilities. To estimate the cluster probabilities from the given similarity matrix, we introduce a left-stochastic non-negative matrix factorization problem. A rotation-based algorithm is proposed for the matrix factorization. Conditions for unique matrix factorizations and clusterings are given, and an error bound is provided. The algorithm is particularly efficient for the case of two clusters, which motivates a hierarchical variant for cases where the number of desired clusters is large. Experiments show that the proposed left-stochastic decomposition clustering model produces relatively high within-cluster similarity on most data sets and can match given class labels, and that the efficient hierarchical variant performs surprisingly well. Keywords: clustering, non-negative matrix factorization, rotation, indefinite kernel, similarity, completely positive

4 0.43181416 104 jmlr-2013-Sparse Single-Index Model

Author: Pierre Alquier, Gérard Biau

Abstract: Let (X,Y ) be a random pair taking values in R p × R. In the so-called single-index model, one has Y = f ⋆ (θ⋆T X) +W , where f ⋆ is an unknown univariate measurable function, θ⋆ is an unknown vector in Rd , and W denotes a random noise satisfying E[W |X] = 0. The single-index model is known to offer a flexible way to model a variety of high-dimensional real-world phenomena. However, despite its relative simplicity, this dimension reduction scheme is faced with severe complications as soon as the underlying dimension becomes larger than the number of observations (“p larger than n” paradigm). To circumvent this difficulty, we consider the single-index model estimation problem from a sparsity perspective using a PAC-Bayesian approach. On the theoretical side, we offer a sharp oracle inequality, which is more powerful than the best known oracle inequalities for other common procedures of single-index recovery. The proposed method is implemented by means of the reversible jump Markov chain Monte Carlo technique and its performance is compared with that of standard procedures. Keywords: single-index model, sparsity, regression estimation, PAC-Bayesian, oracle inequality, reversible jump Markov chain Monte Carlo method

5 0.37838331 90 jmlr-2013-Quasi-Newton Method: A New Direction

Author: Philipp Hennig, Martin Kiefel

Abstract: Four decades after their invention, quasi-Newton methods are still state of the art in unconstrained numerical optimization. Although not usually interpreted thus, these are learning algorithms that fit a local quadratic approximation to the objective function. We show that many, including the most popular, quasi-Newton methods can be interpreted as approximations of Bayesian linear regression under varying prior assumptions. This new notion elucidates some shortcomings of classical algorithms, and lights the way to a novel nonparametric quasi-Newton method, which is able to make more efficient use of available information at computational cost similar to its predecessors. Keywords: optimization, numerical analysis, probability, Gaussian processes

6 0.35181782 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

7 0.34545359 58 jmlr-2013-Language-Motivated Approaches to Action Recognition

8 0.34081411 106 jmlr-2013-Stationary-Sparse Causality Network Learning

9 0.33312443 71 jmlr-2013-Message-Passing Algorithms for Quadratic Minimization

10 0.32580584 119 jmlr-2013-Variable Selection in High-Dimension with Random Designs and Orthogonal Matching Pursuit

11 0.31980011 72 jmlr-2013-Multi-Stage Multi-Task Feature Learning

12 0.30686653 20 jmlr-2013-CODA: High Dimensional Copula Discriminant Analysis

13 0.28955719 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

14 0.28512478 115 jmlr-2013-Training Energy-Based Models for Time-Series Imputation

15 0.28132725 41 jmlr-2013-Experiment Selection for Causal Discovery

16 0.27824205 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

17 0.27696341 70 jmlr-2013-Maximum Volume Clustering: A New Discriminative Clustering Approach

18 0.27035111 15 jmlr-2013-Bayesian Canonical Correlation Analysis

19 0.26324648 118 jmlr-2013-Using Symmetry and Evolutionary Search to Minimize Sorting Networks

20 0.25886473 77 jmlr-2013-On the Convergence of Maximum Variance Unfolding


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(0, 0.026), (5, 0.116), (6, 0.039), (10, 0.084), (14, 0.015), (20, 0.034), (23, 0.042), (68, 0.034), (70, 0.02), (75, 0.058), (85, 0.025), (87, 0.025), (89, 0.388)]

similar papers list:

simIndex simValue paperId paperTitle

1 0.91314107 12 jmlr-2013-Alleviating Naive Bayes Attribute Independence Assumption by Attribute Weighting

Author: Nayyar A. Zaidi, Jesús Cerquides, Mark J. Carman, Geoffrey I. Webb

Abstract: Despite the simplicity of the Naive Bayes classifier, it has continued to perform well against more sophisticated newcomers and has remained, therefore, of great interest to the machine learning community. Of numerous approaches to refining the naive Bayes classifier, attribute weighting has received less attention than it warrants. Most approaches, perhaps influenced by attribute weighting in other machine learning algorithms, use weighting to place more emphasis on highly predictive attributes than those that are less predictive. In this paper, we argue that for naive Bayes attribute weighting should instead be used to alleviate the conditional independence assumption. Based on this premise, we propose a weighted naive Bayes algorithm, called WANBIA, that selects weights to minimize either the negative conditional log likelihood or the mean squared error objective functions. We perform extensive evaluations and find that WANBIA is a competitive alternative to state of the art classifiers like Random Forest, Logistic Regression and A1DE. Keywords: classification, naive Bayes, attribute independence assumption, weighted naive Bayes classification

2 0.85337949 63 jmlr-2013-Learning Trees from Strings: A Strong Learning Algorithm for some Context-Free Grammars

Author: Alexander Clark

Abstract: Standard models of language learning are concerned with weak learning: the learner, receiving as input only information about the strings in the language, must learn to generalise and to generate the correct, potentially infinite, set of strings generated by some target grammar. Here we define the corresponding notion of strong learning: the learner, again only receiving strings as input, must learn a grammar that generates the correct set of structures or parse trees. We formalise this using a modification of Gold’s identification in the limit model, requiring convergence to a grammar that is isomorphic to the target grammar. We take as our starting point a simple learning algorithm for substitutable context-free languages, based on principles of distributional learning, and modify it so that it will converge to a canonical grammar for each language. We prove a corresponding strong learning result for a subclass of context-free grammars. Keywords: context-free grammars, grammatical inference, identification in the limit, structure learning

same-paper 3 0.78005302 101 jmlr-2013-Sparse Activity and Sparse Connectivity in Supervised Learning

Author: Markus Thom, Günther Palm

Abstract: Sparseness is a useful regularizer for learning in a wide range of applications, in particular in neural networks. This paper proposes a model targeted at classification tasks, where sparse activity and sparse connectivity are used to enhance classification capabilities. The tool for achieving this is a sparseness-enforcing projection operator which finds the closest vector with a pre-defined sparseness for any given vector. In the theoretical part of this paper, a comprehensive theory for such a projection is developed. In conclusion, it is shown that the projection is differentiable almost everywhere and can thus be implemented as a smooth neuronal transfer function. The entire model can hence be tuned end-to-end using gradient-based methods. Experiments on the MNIST database of handwritten digits show that classification performance can be boosted by sparse activity or sparse connectivity. With a combination of both, performance can be significantly better compared to classical non-sparse approaches. Keywords: supervised learning, sparseness projection, sparse activity, sparse connectivity

4 0.41944891 51 jmlr-2013-Greedy Sparsity-Constrained Optimization

Author: Sohail Bahmani, Bhiksha Raj, Petros T. Boufounos

Abstract: Sparsity-constrained optimization has wide applicability in machine learning, statistics, and signal processing problems such as feature selection and Compressed Sensing. A vast body of work has studied the sparsity-constrained optimization from theoretical, algorithmic, and application aspects in the context of sparse estimation in linear models where the fidelity of the estimate is measured by the squared error. In contrast, relatively less effort has been made in the study of sparsityconstrained optimization in cases where nonlinear models are involved or the cost function is not quadratic. In this paper we propose a greedy algorithm, Gradient Support Pursuit (GraSP), to approximate sparse minima of cost functions of arbitrary form. Should a cost function have a Stable Restricted Hessian (SRH) or a Stable Restricted Linearization (SRL), both of which are introduced in this paper, our algorithm is guaranteed to produce a sparse vector within a bounded distance from the true sparse optimum. Our approach generalizes known results for quadratic cost functions that arise in sparse linear regression and Compressed Sensing. We also evaluate the performance of GraSP through numerical simulations on synthetic and real data, where the algorithm is employed for sparse logistic regression with and without ℓ2 -regularization. Keywords: sparsity, optimization, compressed sensing, greedy algorithm

5 0.41847649 59 jmlr-2013-Large-scale SVD and Manifold Learning

Author: Ameet Talwalkar, Sanjiv Kumar, Mehryar Mohri, Henry Rowley

Abstract: This paper examines the efficacy of sampling-based low-rank approximation techniques when applied to large dense kernel matrices. We analyze two common approximate singular value decomposition techniques, namely the Nystr¨ m and Column sampling methods. We present a theoretical o comparison between these two methods, provide novel insights regarding their suitability for various tasks and present experimental results that support our theory. Our results illustrate the relative strengths of each method. We next examine the performance of these two techniques on the largescale task of extracting low-dimensional manifold structure given millions of high-dimensional face images. We address the computational challenges of non-linear dimensionality reduction via Isomap and Laplacian Eigenmaps, using a graph containing about 18 million nodes and 65 million edges. We present extensive experiments on learning low-dimensional embeddings for two large face data sets: CMU-PIE (35 thousand faces) and a web data set (18 million faces). Our comparisons show that the Nystr¨ m approximation is superior to the Column sampling method for this o task. Furthermore, approximate Isomap tends to perform better than Laplacian Eigenmaps on both clustering and classification with the labeled CMU-PIE data set. Keywords: low-rank approximation, manifold learning, large-scale matrix factorization

6 0.41646707 52 jmlr-2013-How to Solve Classification and Regression Problems on High-Dimensional Data with a Supervised Extension of Slow Feature Analysis

7 0.41418004 4 jmlr-2013-A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion

8 0.40814057 5 jmlr-2013-A Near-Optimal Algorithm for Differentially-Private Principal Components

9 0.40153986 34 jmlr-2013-Distance Preserving Embeddings for General n-Dimensional Manifolds

10 0.40105122 74 jmlr-2013-Multivariate Convex Regression with Adaptive Partitioning

11 0.40067267 105 jmlr-2013-Sparsity Regret Bounds for Individual Sequences in Online Linear Regression

12 0.39777246 17 jmlr-2013-Belief Propagation for Continuous State Spaces: Stochastic Message-Passing with Quantitative Guarantees

13 0.3950386 14 jmlr-2013-Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

14 0.39418405 102 jmlr-2013-Sparse Matrix Inversion with Scaled Lasso

15 0.38957968 28 jmlr-2013-Construction of Approximation Spaces for Reinforcement Learning

16 0.38880286 33 jmlr-2013-Dimension Independent Similarity Computation

17 0.38871303 64 jmlr-2013-Lovasz theta function, SVMs and Finding Dense Subgraphs

18 0.3881785 82 jmlr-2013-Optimally Fuzzy Temporal Memory

19 0.38782135 38 jmlr-2013-Dynamic Affine-Invariant Shape-Appearance Handshape Features and Classification in Sign Language Videos

20 0.38743204 48 jmlr-2013-Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation