jmlr jmlr2009 jmlr2009-90 knowledge-graph by maker-knowledge-mining

90 jmlr-2009-Structure Spaces


Source: pdf

Author: Brijnesh J. Jain, Klaus Obermayer

Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. [sent-7, score-0.326]

2 In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. [sent-8, score-0.317]

3 Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. [sent-9, score-0.619]

4 An example of such a distance function is the edit distance on string, trees, or graphs (Levenshtein, 1966; Sanfeliu and Fu, 1983; Shapiro and Haralick, 1985; Shasha and Zhang, 1989; Zhang, 1996). [sent-17, score-0.294]

5 JAIN AND O BERMAYER distance spaces (X , d) of structures often have less mathematical structure than Banach spaces, several standard statistical pattern recognition techniques cannot be easily applied to (X , d). [sent-21, score-0.317]

6 But distance matrices of a finite set of combinatorial structures are often not of negative type and therefore an isometric embedding into a Hilbert space or Euclidean space is not possible. [sent-44, score-0.285]

7 In this contribution, we present a theoretical framework that isometrically and isostructurally embeds certain metric spaces (X , d) of combinatorial structures into a quotient space (X ′ , d ′ ) of a Euclidean vector space. [sent-56, score-0.408]

8 The proposed approach maps combinatorial structures to equivalence classes of vectors, where the elements of the same equivalence class are different vector representations of the same structure. [sent-64, score-0.302]

9 We show that the gradient of a smooth function on structures satisfies the necessary condition of optimality and is a well-defined structure pointing in direction of steepest ascent. [sent-73, score-0.318]

10 We show that the proposed cost functions are locally Lipschitz and therefore nonsmooth on a set of Lebesgue measure zero. [sent-75, score-0.38]

11 In line with this formulation, we define a sample mean of D as a global minimum of the cost function k F(X) = ∑ D(X, Xi )2 , (1) i=1 where D is some appropriate distance function on G [R] that measures structural consistent and inconsistent parts of the graphs under consideration. [sent-109, score-0.28]

12 2 Here, we first consider distance functions on structures that generalize the Euclidean metric, because Euclidean spaces have a rich repository of analytical tools. [sent-111, score-0.274]

13 This definition implies that (i) the cost function F is neither differentiable nor convex; (ii) the sample mean of graphs is not unique as shown in Figure 2; and (iii) determining a sample mean of graphs is NP-complete, because evaluation of D is NP-complete. [sent-115, score-0.432]

14 As we will see later, a distance function is well-behaved if it is locally Lipschitz. [sent-128, score-0.285]

15 For practical issues, it is important to note that restricting to structures of bounded order n and alignment of structures are purely technical assumptions to simplify mathematics. [sent-135, score-0.288]

16 We consider a more general setting in the sense that we include classes of finite structures other than directed graphs with attributes from arbitrary vector spaces rather than weights from R. [sent-181, score-0.351]

17 In a similar way, we can define further types of graphs such as, for example, trees, directed acyclic graphs, complete graphs, and regular graphs as 2-structures by specifying the corresponding properties on R . [sent-207, score-0.317]

18 ˜ ˜ Any inner product space X is a normed space with norm x = x, x and a metric space with metric d(x, y) = x − y . [sent-305, score-0.355]

19 A T -function is locally Lipschitz if, and only if, its representation function is locally Lipschitz. [sent-330, score-0.503]

20 We refer to Appendix C for basic definitions and properties from nonsmooth analysis of locally Lipschitz functions. [sent-331, score-0.344]

21 Suppose that F is a locally Lipschitz function with representation function f . [sent-332, score-0.277]

22 21, the cost function F is also locally Lipschitz, and we can rewrite (P1) to an equivalent optimization problem (P2) minimize f : X → R, subject to x ∈ U x → ∑k fi (x) i=1 where the component functions fi are the representation functions of Fi and U ⊆ X is the feasible set with µ(U ) = UT . [sent-344, score-0.485]

23 Hence, f is the representation function of the locally Lipschitz T -function F and therefore also locally Lipschitz. [sent-345, score-0.503]

24 To minimize locally Lipschitz functions, the field of nonsmooth optimization offers a number of techniques. [sent-346, score-0.344]

25 As an example, we describe subgradient methods, which are easy to implement and well-suited to identify the difficulties arising in nonsmooth optimization. [sent-348, score-0.328]

26 In the step direction finding, we generate a descent direction by exploiting the fact that the direction opposite to the gradient is locally the steepest descent direction. [sent-355, score-0.456]

27 The generalized gradient coincides with the gradient at differentiable points and is a convex set of subgradients at non-differentiable points. [sent-360, score-0.298]

28 The basic idea of subgradient methods is to generalize the methods for smooth problems by replacing the gradient by an arbitrary subgradient. [sent-363, score-0.319]

29 If f is differentiable at x, then the subgradient d coincides with the gradient ∇ f (x). [sent-365, score-0.391]

30 At non-differentiable points, however, an arbitrary subgradient provides no information about the existence of the zero in the generalized gradient ∂ f (x). [sent-376, score-0.275]

31 An approximated subgradient corresponds to a direction that is no longer a subgradient of the cost function. [sent-390, score-0.493]

32 We call Algorithm 1 an approximate incremental subgradient methods if the direction finding step produces directions that are not necessarily subgradients of the corresponding component function fi . [sent-392, score-0.385]

33 In a computer simulation, we show that determining a sample mean of weighted graph is indeed possible when using approximate subgradient methods. [sent-394, score-0.272]

34 Suppose that XT is the T -space of simple weighted graphs over X = Rn×n , and let UT ⊆ XT be the subset of weighted graphs with attributes from the interval [0, 1]. [sent-395, score-0.28]

35 Given a representation x of X, the computationally intractable task is to find a representation xi 2 of Xi such that (x, xi ) ∈ supp dXi |x . [sent-400, score-0.289]

36 An approximated subgradient corresponds to a direction that is no longer a subgradient of the cost function. [sent-404, score-0.493]

37 We call Algorithm 1 an approximate incremental subgradient methods if the direction finding step produces directions that are not necessarily subgradients of the corresponding component function fi . [sent-406, score-0.385]

38 1 A Generic Approach: Learning in Distance Spaces Without loss of generality, we may assume that (XT , D) is a distance space, where D is either the metric D∗ induced by the Euclidean metric on X or another (not necessarily metric) distance function that is more appropriate for the problem to hand. [sent-413, score-0.346]

39 Since XT is a metric space over an Euclidean vector space, we can apply subgradient methods or other techniques from nonsmooth optimization to minimize locally Lipschitz T -functions on XT . [sent-420, score-0.668]

40 If the cost function F depends on a distance measure D, we demand that D is locally Lipschitz to ensure the locally Lipschitz property for F. [sent-421, score-0.547]

41 Examples include visualizing or comparing two populations of chemical graphs, and central clustering of structures (Section 4. [sent-424, score-0.338]

42 If D is locally Lipschitz, then F is also locally Lipschitz by Prop. [sent-430, score-0.452]

43 If the distortion measure is locally Lipschitz, then F as a function of the cluster centers Y j is locally Lipschitz by Prop. [sent-449, score-0.597]

44 A number of central clustering algorithms for graphs have been devised recently (Gold, Rangarajan, and Mjolsness, 1996; G¨ nter and Bunke, 2002; Lozano and Escolano, 2003; Jain and u Wysotzki, 2004; Bonev, Escolano, Lozano, Suau, Cazorla, and Aguilar, 2007). [sent-451, score-0.275]

45 The first term of f is smooth and convex (and therefore locally Lipschitz). [sent-469, score-0.27]

46 The locally Lipschitz property and convexity of the second term follows from the rules of calculus for locally Lipschitz functions (see Section C) and Prop. [sent-470, score-0.452]

47 The function f is locally Lipschitz if ℓ and h (as a function of θh ) are locally Lipschitz. [sent-487, score-0.452]

48 As an example for a locally Lipschitz function f , we extend supervised neural learning machines to k-structures: 2684 S TRUCTURE S PACES • Loss function: The loss ℓ(x, y) = D∗ (µ (x), µ (y))2 is locally Lipschitz as a function of x. [sent-488, score-0.452]

49 For each component gi , the pointwise maximizer hi (x) = max gi (T x) T ∈T is locally Lipschitz. [sent-495, score-0.391]

50 1 Sample Mean To assess the performance and to investigate the behavior of the subgradient and approximated subgradient method for determining a sample mean, we conducted an experiments on random graphs, letter graphs, and chemical graphs. [sent-533, score-0.613]

51 We sampled k graphs by distorting a given initial graph according to the following scheme: First, we randomly generated an initial graph M0 with 6 vertices and edge density 0. [sent-540, score-0.373]

52 Finally, the vertices of the distorted graphs were randomly permuted. [sent-565, score-0.293]

53 5 The graphs represent distorted letter drawings from the Roman alphabet that consist of straight lines only. [sent-570, score-0.292]

54 We considered the 150 letter graphs representing the capital letter A at a medium distortion level. [sent-574, score-0.434]

55 We generated 100 samples each consisting or k = 10 letter graphs drawn from a uniform distribution over the data set of 150 graph letters representing letter A at a medium distortion level. [sent-575, score-0.496]

56 We generated 100 samples each consisting of k = 10 chemical graphs drawn from a uniform distribution over the data set of 340 chemical graphs. [sent-582, score-0.31]

57 Compared with the set median, the results indicate that the subgradient and approximated subgradient method have found reasonable solutions in the sense that the resulting average SSD is lower. [sent-596, score-0.42]

58 This method operates as the EM algorithm of standard k-means, where the chosen distortion measure in the Estep is D to classify the structures Xi according to nearest cluster center Y j . [sent-605, score-0.263]

59 The structural version of simple competitive learning corresponds to the basic incremental subgradient method described in Algorithm 1 for minimizing the cluster objective F(X). [sent-608, score-0.322]

60 We consider all 750 graphs from the test data set representing distorted letter drawings from the Roman alphabet that consist of straight lines only (A, E, F, H, I, K, L, M, N, T, V, W, X, Y, Z). [sent-639, score-0.292]

61 Figure 6: Example of letter drawings: Prototype of letter A and distorted copies generated by imposing low, medium, and high distortion (from left to right) on prototype A. [sent-646, score-0.338]

62 We describe molecules by graphs in the usual way: atoms are represented by vertices labeled with the atom type of the corresponding atom and bonds between atoms are represented by edges labeled with the valence of the corresponding bonds. [sent-677, score-0.435]

63 34 error accuracy silhouette molecules km error accuracy silhouette fingerprint measure error accuracy silhouette grec k 30 cl 56. [sent-710, score-0.284]

64 For subgradient and graph distance calculations, we applied a depth first search algorithm on the letter data set and the graduated assignment algorithm (Gold and Rangarajan, 1996) on the grec, fingerprint, and molecule data set. [sent-719, score-0.516]

65 This result shows that approximated subgradient methods can be applied to central clustering in the domain of graphs. [sent-726, score-0.345]

66 3 Clustering Protein Structures In our last experiment, we compared the performance of k-means and simple competitive learning of graphs with hierarchical clustering applied on protein structures. [sent-728, score-0.283]

67 The chosen distance measure D for both, the letter graphs and the contact maps, is the minimizer of the standard Euclidean metric. [sent-751, score-0.46]

68 For subgradient and distance calculations, we used a combination of graduated assignment and dynamic programming (Jain and Lappe, 2007). [sent-762, score-0.346]

69 One problem of central clustering algorithms applied to contact maps is the increasing size of the cluster centers caused by the updating step. [sent-770, score-0.313]

70 A solution to this problem is to restrict the vertices of a cluster center to those vertices that occur in at least one cluster member. [sent-771, score-0.352]

71 This constructions turns out to be a convenient abstraction of combinatorial structures to formally adopt geometrical and analytical concepts from vector spaces. [sent-787, score-0.298]

72 The metric of the representation space X induces a metric on the T -space. [sent-788, score-0.279]

73 The field of nonsmooth optimization provides techniques like the subgradient method to minimize this class of nonsmooth functions. [sent-801, score-0.446]

74 The cost functions are locally Lipschitz, but computation of a subgradient is computationally intractable. [sent-803, score-0.472]

75 To cope with the computational complexity, we suggested an approximate subgradient method that chooses the opposite of a direction close to the generalized gradient as descent direction. [sent-804, score-0.312]

76 Even so the high computational complexity of deriving a subgradient demands a reevaluation of existing nonsmooth optimization methods and asks for devising algorithms that use approximations of the generalized gradient. [sent-806, score-0.328]

77 As a normed vector space, X is a metric space with metric d(x, y) = x − y for all x, y ∈ X . [sent-967, score-0.279]

78 Any inner product space (X , ·, · ) is a metric space with metric d(x, y) = x − y . [sent-1028, score-0.304]

79 Basic definitions and results on nonsmooth analysis of locally Lipschitz mappings are given in Appendix C. [sent-1050, score-0.344]

80 If f is continuous (Lipschitz, locally Lipschitz), then f lifts to a unique continuous (Lipschitz, locally Lipschitz) mapping F : XT → Y with f (x) = F(µ(x)) for all x ∈ X . [sent-1091, score-0.452]

81 We have: 2705 JAIN AND O BERMAYER • If all fi ∈ supp( f ) are locally Lipschitz at x, then f is locally Lipschitz at x and ∂ f (x) ⊆ con {∂ fi (x) : fi ∈ supp( f ) ∧ fi (x) = f (x)} . [sent-1185, score-0.796]

82 • If all fi ∈ supp( f ) are smooth at x, then f is regular at x and ∂ f (x) = con {∇ fi (x) : fi ∈ supp( f ) ∧ fi (x) = f (x)} . [sent-1187, score-0.388]

83 a a a If all support functions of f ∗ are locally Lipschitz, then f ∗ is also locally Lipschitz and admits a generalized gradient at any point from U . [sent-1193, score-0.517]

84 The function sY represents S∗ (·,Y ) and is a pointwise maximizer with support supp (sY ) = {sy : sy (·) = s (·, y), y ∈ Y }. [sent-1206, score-0.4]

85 If the support functions of sY are locally Lipschitz, regular, or smooth, we can apply Theorem 18 to show that sY is locally Lipschitz, admits a generalized gradient at each point of its domain, and is differentiable almost everywhere. [sent-1207, score-0.633]

86 For a fixed structure Y ∈ XT , the support of the pointwise maximizer sY is of the form supp(sY ) = {sy : sy (·) = ·, y , y ∈ Y }, As linear functions, these support functions sy are continuously differentiable. [sent-1212, score-0.413]

87 From Theorem 18 follows • sY is locally Lipschitz and regular, • the generalized gradient ∂sY (x) is the convex set ∂sY (x) = con {y ∈ Y : (x, y) ∈ supp(sY |x)} . [sent-1213, score-0.291]

88 The function dY represents D∗ (·,Y ) and is a pointwise maximizer with support supp dY = dy : dy (·) = −d (·, y), y ∈ Y . [sent-1225, score-0.59]

89 The function dY represents D∗ (·,Y ) and is a pointwise maximizer with support supp dY = {dy : y ∈ Y }, where dy (x) = − x − y for all x ∈ X . [sent-1230, score-0.433]

90 The support functions of dY are locally Lipschitz and regular on X , and smooth on X \ {y}. [sent-1231, score-0.27]

91 From Theorem 18 follows 2707 JAIN AND O BERMAYER • dY is locally Lipschitz and regular, • the generalized gradient ∂dY (x) is the convex set   con − (x−y) : y ∈ Y ∧ (y, x) ∈ supp(dY |x) x−y ∂dY (x) =  {y ∈ XT : y ≤ 1} : x=y : x = y. [sent-1232, score-0.291]

92 In particular, we have 2 • dY is locally Lipschitz and regular, 2 • the generalized gradient ∂dY (x) is the convex set 2 2 ∂dY (x) = con −2(x − y) : y ∈ Y ∧ (y, x) ∈ supp dY |x . [sent-1235, score-0.402]

93 Proposition 22 Let f : X → Y be locally Lipschitz at x, and let g : X → Z be locally Lipschitz at y = f (x). [sent-1247, score-0.452]

94 f is smooth at x, then f is locally Lipschitz, regular, continuous, and differentiable at x. [sent-1265, score-0.386]

95 f is locally Lipschitz or differentiable at x, then f is continuous at x. [sent-1267, score-0.342]

96 We have: • If f is locally Lipschitz and differentiable at x, then ∇ f (x) ∈ ∂ f (x). [sent-1280, score-0.342]

97 • If f is locally Lipschitz, regular, and differentiable at x, then ∂ f (x) = {∇ f (x)} . [sent-1281, score-0.342]

98 The Graduated Assignment Algorithm We use the graduated assignment algorithm to approximate the NP-hard squared distance D∗ (X,Y )2 2 between two weighted graphs and to determine a subgradient from the generalized gradient ∂dY (see 2 can be expressed by the inner Section B. [sent-1290, score-0.627]

99 A distance measure between attributed relational graphs for pattern recognition. [sent-1595, score-0.284]

100 On certain metric spaces arising from euclidean spaces by a change of metric and their imbedding in hilbert space. [sent-1600, score-0.427]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('xt', 0.356), ('lipschitz', 0.309), ('locally', 0.226), ('subgradient', 0.21), ('bermayer', 0.206), ('jain', 0.189), ('tructure', 0.168), ('paces', 0.168), ('dy', 0.157), ('graphs', 0.14), ('sy', 0.124), ('structures', 0.118), ('nonsmooth', 0.118), ('differentiable', 0.116), ('metric', 0.114), ('supp', 0.111), ('contact', 0.111), ('vertices', 0.109), ('letter', 0.108), ('maximizer', 0.102), ('clustering', 0.09), ('euclidean', 0.087), ('fi', 0.086), ('kel', 0.086), ('neittaanm', 0.086), ('chemical', 0.085), ('distortion', 0.078), ('bunke', 0.077), ('graduated', 0.077), ('inner', 0.076), ('geometrical', 0.073), ('ut', 0.072), ('orbits', 0.069), ('cluster', 0.067), ('combinatorial', 0.066), ('gradient', 0.065), ('dx', 0.064), ('pointwise', 0.063), ('graph', 0.062), ('frequent', 0.061), ('grec', 0.06), ('silhouette', 0.06), ('ssd', 0.06), ('distance', 0.059), ('rangarajan', 0.058), ('obermayer', 0.058), ('spaces', 0.056), ('proteins', 0.054), ('quotient', 0.054), ('steepest', 0.054), ('protein', 0.053), ('alignment', 0.052), ('substructure', 0.052), ('subgradients', 0.052), ('atom', 0.051), ('caprara', 0.051), ('skolnick', 0.051), ('normed', 0.051), ('representation', 0.051), ('vt', 0.048), ('sa', 0.047), ('structural', 0.045), ('central', 0.045), ('membership', 0.045), ('pattern', 0.044), ('molecules', 0.044), ('maximizers', 0.044), ('distorted', 0.044), ('smooth', 0.044), ('escolano', 0.043), ('istrail', 0.043), ('lancia', 0.043), ('mjolsness', 0.043), ('ngerprint', 0.043), ('minimizer', 0.042), ('embedding', 0.042), ('analytical', 0.041), ('attributed', 0.041), ('zn', 0.04), ('xk', 0.04), ('equivalence', 0.04), ('recognition', 0.04), ('edges', 0.04), ('gold', 0.04), ('lozano', 0.039), ('xi', 0.038), ('angle', 0.038), ('representations', 0.038), ('structured', 0.037), ('directed', 0.037), ('direction', 0.037), ('yy', 0.037), ('adjacency', 0.037), ('orbit', 0.036), ('edit', 0.036), ('xn', 0.036), ('cost', 0.036), ('xx', 0.035), ('orderings', 0.035), ('banach', 0.035)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999917 90 jmlr-2009-Structure Spaces

Author: Brijnesh J. Jain, Klaus Obermayer

Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization

2 0.19846022 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

Author: Barnabás Póczos, András Lőrincz

Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design

3 0.12318099 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon

Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness

4 0.11900026 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

Author: Sanjoy Dasgupta, Adam Tauman Kalai, Claire Monteleoni

Abstract: We start by showing that in an active learning setting, the Perceptron algorithm needs Ω( ε12 ) labels to learn linear separators within generalization error ε. We then present a simple active learning algorithm for this problem, which combines a modification of the Perceptron update with an adaptive filtering rule for deciding which points to query. For data distributed uniformly over the unit ˜ sphere, we show that our algorithm reaches generalization error ε after asking for just O(d log 1 ) ε labels. This exponential improvement over the usual sample complexity of supervised learning had previously been demonstrated only for the computationally more complex query-by-committee algorithm. Keywords: active learning, perceptron, label complexity bounds, online learning

5 0.11496948 13 jmlr-2009-Bounded Kernel-Based Online Learning

Author: Francesco Orabona, Joseph Keshet, Barbara Caputo

Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set

6 0.11432777 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes

7 0.11356588 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

8 0.098693043 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning

9 0.086864032 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

10 0.071836695 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

11 0.071592428 23 jmlr-2009-Discriminative Learning Under Covariate Shift

12 0.069171138 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

13 0.069006413 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

14 0.063800864 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

15 0.063085258 68 jmlr-2009-Online Learning with Samples Drawn from Non-identical Distributions

16 0.060139865 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

17 0.059574205 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

18 0.05852123 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages    (Special Topic on Mining and Learning with Graphs and Relations)

19 0.057329569 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

20 0.053789843 33 jmlr-2009-Exploring Strategies for Training Deep Neural Networks


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.326), (1, 0.142), (2, -0.011), (3, 0.068), (4, -0.041), (5, -0.187), (6, -0.006), (7, -0.133), (8, -0.205), (9, -0.144), (10, 0.024), (11, 0.095), (12, 0.129), (13, 0.186), (14, 0.046), (15, -0.031), (16, -0.083), (17, 0.047), (18, -0.039), (19, -0.014), (20, -0.102), (21, 0.113), (22, 0.009), (23, 0.048), (24, -0.214), (25, -0.029), (26, -0.09), (27, 0.073), (28, -0.039), (29, 0.062), (30, -0.104), (31, -0.168), (32, -0.085), (33, -0.075), (34, -0.028), (35, 0.08), (36, 0.011), (37, -0.014), (38, 0.032), (39, -0.096), (40, 0.015), (41, 0.03), (42, 0.002), (43, 0.011), (44, 0.073), (45, 0.004), (46, -0.011), (47, 0.001), (48, -0.066), (49, 0.062)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96924603 90 jmlr-2009-Structure Spaces

Author: Brijnesh J. Jain, Klaus Obermayer

Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization

2 0.74466252 40 jmlr-2009-Identification of Recurrent Neural Networks by Bayesian Interrogation Techniques

Author: Barnabás Póczos, András Lőrincz

Abstract: We introduce novel online Bayesian methods for the identification of a family of noisy recurrent neural networks (RNNs). We present Bayesian active learning techniques for stimulus selection given past experiences. In particular, we consider the unknown parameters as stochastic variables and use A-optimality and D-optimality principles to choose optimal stimuli. We derive myopic cost functions in order to maximize the information gain concerning network parameters at each time step. We also derive the A-optimal and D-optimal estimations of the additive noise that perturbs the dynamical system of the RNN. Here we investigate myopic as well as non-myopic estimations, and study the problem of simultaneous estimation of both the system parameters and the noise. Employing conjugate priors our derivations remain approximation-free and give rise to simple update rules for the online learning of the parameters. The efficiency of our method is demonstrated for a number of selected cases, including the task of controlled independent component analysis. Keywords: active learning, system identification, online Bayesian learning, A-optimality, Doptimality, infomax control, optimal design

3 0.55326986 72 jmlr-2009-Polynomial-Delay Enumeration of Monotonic Graph Classes

Author: Jan Ramon, Siegfried Nijssen

Abstract: Algorithms that list graphs such that no two listed graphs are isomorphic, are important building blocks of systems for mining and learning in graphs. Algorithms are already known that solve this problem efficiently for many classes of graphs of restricted topology, such as trees. In this article we introduce the concept of a dense augmentation schema, and introduce an algorithm that can be used to enumerate any class of graphs with polynomial delay, as long as the class of graphs can be described using a monotonic predicate operating on a dense augmentation schema. In practice this means that this is the first enumeration algorithm that can be applied theoretically efficiently in any frequent subgraph mining algorithm, and that this algorithm generalizes to situations beyond the standard frequent subgraph mining setting. Keywords: graph mining, enumeration, monotonic graph classes

4 0.55109477 51 jmlr-2009-Low-Rank Kernel Learning with Bregman Matrix Divergences

Author: Brian Kulis, Mátyás A. Sustik, Inderjit S. Dhillon

Abstract: In this paper, we study low-rank matrix nearness problems, with a focus on learning low-rank positive semidefinite (kernel) matrices for machine learning applications. We propose efficient algorithms that scale linearly in the number of data points and quadratically in the rank of the input matrix. Existing algorithms for learning kernel matrices often scale poorly, with running times that are cubic in the number of data points. We employ Bregman matrix divergences as the measures of nearness—these divergences are natural for learning low-rank kernels since they preserve rank as well as positive semidefiniteness. Special cases of our framework yield faster algorithms for various existing learning problems, and experimental results demonstrate that our algorithms can effectively learn both low-rank and full-rank kernel matrices. Keywords: kernel methods, Bregman divergences, convex optimization, kernel learning, matrix nearness

5 0.42669612 13 jmlr-2009-Bounded Kernel-Based Online Learning

Author: Francesco Orabona, Joseph Keshet, Barbara Caputo

Abstract: A common problem of kernel-based online algorithms, such as the kernel-based Perceptron algorithm, is the amount of memory required to store the online hypothesis, which may increase without bound as the algorithm progresses. Furthermore, the computational load of such algorithms grows linearly with the amount of memory used to store the hypothesis. To attack these problems, most previous work has focused on discarding some of the instances, in order to keep the memory bounded. In this paper we present a new algorithm, in which the instances are not discarded, but are instead projected onto the space spanned by the previous online hypothesis. We call this algorithm Projectron. While the memory size of the Projectron solution cannot be predicted before training, we prove that its solution is guaranteed to be bounded. We derive a relative mistake bound for the proposed algorithm, and deduce from it a slightly different algorithm which outperforms the Perceptron. We call this second algorithm Projectron++. We show that this algorithm can be extended to handle the multiclass and the structured output settings, resulting, as far as we know, in the first online bounded algorithm that can learn complex classification tasks. The method of bounding the hypothesis representation can be applied to any conservative online algorithm and to other online algorithms, as it is demonstrated for ALMA2 . Experimental results on various data sets show the empirical advantage of our technique compared to various bounded online algorithms, both in terms of memory and accuracy. Keywords: online learning, kernel methods, support vector machines, bounded support set

6 0.35509124 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning

7 0.34871921 30 jmlr-2009-Estimation of Sparse Binary Pairwise Markov Networks using Pseudo-likelihoods

8 0.33502388 66 jmlr-2009-On the Consistency of Feature Selection using Greedy Least Squares Regression

9 0.33104628 24 jmlr-2009-Distance Metric Learning for Large Margin Nearest Neighbor Classification

10 0.33101076 86 jmlr-2009-Similarity-based Classification: Concepts and Algorithms

11 0.31327549 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

12 0.29316682 9 jmlr-2009-Analysis of Perceptron-Based Active Learning

13 0.29130596 65 jmlr-2009-On Uniform Deviations of General Empirical Risks with Unboundedness, Dependence, and High Dimensionality

14 0.27643874 93 jmlr-2009-The Hidden Life of Latent Variables: Bayesian Learning with Mixed Graph Models

15 0.26248801 11 jmlr-2009-Bayesian Network Structure Learning by Recursive Autonomy Identification

16 0.25369892 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

17 0.2523689 83 jmlr-2009-SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

18 0.24782839 100 jmlr-2009-When Is There a Representer Theorem? Vector Versus Matrix Regularizers

19 0.24681422 98 jmlr-2009-Universal Kernel-Based Learning with Applications to Regular Languages    (Special Topic on Mining and Learning with Graphs and Relations)

20 0.24389094 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(8, 0.012), (11, 0.024), (19, 0.019), (21, 0.011), (26, 0.012), (38, 0.017), (52, 0.031), (55, 0.043), (58, 0.032), (66, 0.098), (68, 0.021), (90, 0.594), (96, 0.018)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96787101 90 jmlr-2009-Structure Spaces

Author: Brijnesh J. Jain, Klaus Obermayer

Abstract: Finite structures such as point patterns, strings, trees, and graphs occur as ”natural” representations of structured data in different application areas of machine learning. We develop the theory of structure spaces and derive geometrical and analytical concepts such as the angle between structures and the derivative of functions on structures. In particular, we show that the gradient of a differentiable structural function is a well-defined structure pointing in the direction of steepest ascent. Exploiting the properties of structure spaces, it will turn out that a number of problems in structural pattern recognition such as central clustering or learning in structured output spaces can be formulated as optimization problems with cost functions that are locally Lipschitz. Hence, methods from nonsmooth analysis are applicable to optimize those cost functions. Keywords: graphs, graph matching, learning in structured domains, nonsmooth optimization

2 0.94867271 8 jmlr-2009-An Anticorrelation Kernel for Subsystem Training in Multiple Classifier Systems

Author: Luciana Ferrer, Kemal Sönmez, Elizabeth Shriberg

Abstract: We present a method for training support vector machine (SVM)-based classification systems for combination with other classification systems designed for the same task. Ideally, a new system should be designed such that, when combined with existing systems, the resulting performance is optimized. We present a simple model for this problem and use the understanding gained from this analysis to propose a method to achieve better combination performance when training SVM systems. We include a regularization term in the SVM objective function that aims to reduce the average class-conditional covariance between the resulting scores and the scores produced by the existing systems, introducing a trade-off between such covariance and the system’s individual performance. That is, the new system “takes one for the team”, falling somewhat short of its best possible performance in order to increase the diversity of the ensemble. We report results on the NIST 2005 and 2006 speaker recognition evaluations (SREs) for a variety of subsystems. We show a gain of 19% on the equal error rate (EER) of a combination of four systems when applying the proposed method with respect to the performance obtained when the four systems are trained independently of each other. Keywords: system combination, ensemble diversity, multiple classifier systems, support vector machines, speaker recognition, kernel methods ∗. This author performed part of the work presented in this paper while at the Information Systems Laboratory, Department of Electrical Engineering, Stanford University. c 2009 Luciana Ferrer, Kemal S¨ nmez and Elizabeth Shriberg. o ¨ F ERRER , S ONMEZ AND S HRIBERG

3 0.92925251 95 jmlr-2009-The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List

Author: Cynthia Rudin

Abstract: We are interested in supervised ranking algorithms that perform especially well near the top of the ranked list, and are only required to perform sufficiently well on the rest of the list. In this work, we provide a general form of convex objective that gives high-scoring examples more importance. This “push” near the top of the list can be chosen arbitrarily large or small, based on the preference of the user. We choose ℓ p -norms to provide a specific type of push; if the user sets p larger, the objective concentrates harder on the top of the list. We derive a generalization bound based on the p-norm objective, working around the natural asymmetry of the problem. We then derive a boosting-style algorithm for the problem of ranking with a push at the top. The usefulness of the algorithm is illustrated through experiments on repository data. We prove that the minimizer of the algorithm’s objective is unique in a specific sense. Furthermore, we illustrate how our objective is related to quality measurements for information retrieval. Keywords: ranking, RankBoost, generalization bounds, ROC, information retrieval

4 0.68752873 42 jmlr-2009-Incorporating Functional Knowledge in Neural Networks

Author: Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, René Garcia

Abstract: Incorporating prior knowledge of a particular task into the architecture of a learning algorithm can greatly improve generalization performance. We study here a case where we know that the function to be learned is non-decreasing in its two arguments and convex in one of them. For this purpose we propose a class of functions similar to multi-layer neural networks but (1) that has those properties, (2) is a universal approximator of Lipschitz1 functions with these and other properties. We apply this new class of functions to the task of modelling the price of call options. Experiments show improvements on regressing the price of call options using the new types of function classes that incorporate the a priori constraints. Keywords: neural networks, universal approximation, monotonicity, convexity, call options

5 0.66199362 80 jmlr-2009-Reproducing Kernel Banach Spaces for Machine Learning

Author: Haizhang Zhang, Yuesheng Xu, Jun Zhang

Abstract: We introduce the notion of reproducing kernel Banach spaces (RKBS) and study special semiinner-product RKBS by making use of semi-inner-products and the duality mapping. Properties of an RKBS and its reproducing kernel are investigated. As applications, we develop in the framework of RKBS standard learning schemes including minimal norm interpolation, regularization network, support vector machines, and kernel principal component analysis. In particular, existence, uniqueness and representer theorems are established. Keywords: reproducing kernel Banach spaces, reproducing kernels, learning theory, semi-innerproducts, representer theorems

6 0.65474218 52 jmlr-2009-Margin-based Ranking and an Equivalence between AdaBoost and RankBoost

7 0.64762753 85 jmlr-2009-Settable Systems: An Extension of Pearl's Causal Model with Optimization, Equilibrium, and Learning

8 0.61724144 32 jmlr-2009-Exploiting Product Distributions to Identify Relevant Variables of Correlation Immune Functions

9 0.59912288 59 jmlr-2009-Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions

10 0.59898597 99 jmlr-2009-Using Local Dependencies within Batches to Improve Large Margin Classifiers

11 0.5982722 22 jmlr-2009-Deterministic Error Analysis of Support Vector Regression and Related Regularized Kernel Methods

12 0.59718949 45 jmlr-2009-Learning Approximate Sequential Patterns for Classification

13 0.59677011 70 jmlr-2009-Particle Swarm Model Selection    (Special Topic on Model Selection)

14 0.58399993 34 jmlr-2009-Fast ApproximatekNN Graph Construction for High Dimensional Data via Recursive Lanczos Bisection

15 0.58077776 58 jmlr-2009-NEUROSVM: An Architecture to Reduce the Effect of the Choice of Kernel on the Performance of SVM

16 0.57789534 18 jmlr-2009-Consistency and Localizability

17 0.56454206 39 jmlr-2009-Hybrid MPI OpenMP Parallel Linear Support Vector Machine Training

18 0.56267542 78 jmlr-2009-Refinement of Reproducing Kernels

19 0.56239849 27 jmlr-2009-Efficient Online and Batch Learning Using Forward Backward Splitting

20 0.55692345 36 jmlr-2009-Fourier Theoretic Probabilistic Inference over Permutations