nips nips2008 nips2008-44 knowledge-graph by maker-knowledge-mining

44 nips-2008-Characteristic Kernels on Groups and Semigroups


Source: pdf

Author: Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Bharath K. Sriperumbudur

Abstract: Embeddings of random variables in reproducing kernel Hilbert spaces (RKHSs) may be used to conduct statistical inference based on higher order moments. For sufficiently rich (characteristic) RKHSs, each probability distribution has a unique embedding, allowing all statistical properties of the distribution to be taken into consideration. Necessary and sufficient conditions for an RKHS to be characteristic exist for Rn . In the present work, conditions are established for an RKHS to be characteristic on groups and semigroups. Illustrative examples are provided, including characteristic kernels on periodic domains, rotation matrices, and Rn . + 1

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Necessary and sufficient conditions for an RKHS to be characteristic exist for Rn . [sent-11, score-0.337]

2 In the present work, conditions are established for an RKHS to be characteristic on groups and semigroups. [sent-12, score-0.457]

3 Illustrative examples are provided, including characteristic kernels on periodic domains, rotation matrices, and Rn . [sent-13, score-0.557]

4 Key to the above work is the notion of a characteristic kernel, as introduced in [5, 6]: it gives an RKHS for which probabilities have unique images (i. [sent-17, score-0.361]

5 Universal kernels on compact metric spaces [16] are characteristic [8], as are Gaussian and Laplace kernels on Rn [6]. [sent-21, score-0.979]

6 Recently, it has been shown [14] that a continuous shift-invariant R-valued positive definite kernel on Rn is characteristic if and only if the support of its Fourier transform is the entire Rn . [sent-22, score-0.717]

7 This completely determines the set of characteristic ones in the convex cone of continuous shift-invariant positive definite kernels on Rn . [sent-23, score-0.718]

8 One of the chief advantages of kernel methods is that they allow us to deal straightforwardly with complex domains, through use of a kernel function to determine the similarity between objects in these domains [13]. [sent-24, score-0.264]

9 A question that naturally arises is whether characteristic kernels can be defined on spaces besides Rn . [sent-25, score-0.582]

10 Several such domains constitute topological groups/semigroups, and our focus is on kernels defined by their algebraic structure. [sent-26, score-0.342]

11 Broadly speaking, our approach is based on extensions of Fourier analysis to groups and semigroups, where we apply appropriate extensions of Bochner’s theorem to obtain the required conditions on the kernel. [sent-27, score-0.221]

12 The most immediate generalization of the results in [14] is to locally compact Abelian groups, of which (Rn , +) is one example. [sent-28, score-0.239]

13 Thus, in Section 2 we provide review of characteristic kernels on (Rn , +) from this viewpoint. [sent-29, score-0.557]

14 In Section 3 we derive necessary and sufficient conditions for kernels 1 on locally compact Abelian groups to be characteristic. [sent-30, score-0.579]

15 We next address non-Abelian compact groups in Section 4, for which we obtain a sufficient condition for a characteristic kernel. [sent-34, score-0.66]

16 Finally, in Section 5, we consider the Abelian semigroup (Rn , +), where + R+ = [0, ∞). [sent-36, score-0.212]

17 This semigroup has many practical applications, including expressions of nonnegative measures or frequency on n points [3]. [sent-37, score-0.255]

18 Note that in all cases, we provide specific examples of characteristic kernels to illustrate the properties required. [sent-38, score-0.557]

19 2 Preliminaries: Characteristic kernels and shift-invariant kernels Let X be a random variable taking values on a measurable space (Ω, B), and H be a RKHS defined by a measurable kernel k on Ω such that E[ k(X, X)] < ∞. [sent-39, score-0.671]

20 A bounded measurable kernel k on Ω is called characteristic if {P : probability on (Ω, B)} → H, P → mP = EX∼P [k(·, X)] (1) is injective ([5, 6]). [sent-42, score-0.581]

21 Therefore, by definition, a characteristic kernel uniquely determines a probability by its mean element. [sent-43, score-0.456]

22 The following result provides the necessary and sufficient condition for a kernel to be characteristic and shows its associated RKHS to be a rich function class. [sent-46, score-0.509]

23 Let (Ω, B) be a measurable space, k be a bounded measurable positive definite kernel on Ω, and H be the associated RKHS. [sent-49, score-0.333]

24 Then, k is characteristic if and only if H + R (direct sum of the two RKHS’s) is dense in L2 (P ) for every probability P on (Ω, B). [sent-50, score-0.362]

25 The above lemma and Theorem 3 of [6] imply that characteristic kernels give a criterion of (conditional) independence through (conditional) covariance on RKHS, which enables statistical tests of independence with kernels [6]. [sent-51, score-0.821]

26 This explains also the practical importance of characteristic kernels. [sent-52, score-0.337]

27 The following result shows that the characteristic property is invariant under some conformal mappings introduced in [17] and provides a construction to generate new characteristic kernels. [sent-53, score-0.764]

28 Let Ω be a topological space with Borel σ-field, k be a measurable positive definite kernel on Ω such that Ω k(·, y)dµ(y) = 0 means µ = 0 for a finite Borel measure µ, and f : Ω → C be a bounded continuous function such that f (x) > 0 for all x ∈ Ω and k(x, x)|f (x)|2 is bounded. [sent-55, score-0.432]

29 We will focus on spaces with algebraic structure for better description of characteristic kernels. [sent-61, score-0.422]

30 We call this type of positive definite kernels shift-invariant, because k(zx, zy) = φ((zy)−1 zx) = φ(y −1 x) = k(x, y) for any z ∈ G. [sent-64, score-0.294]

31 There are many examples of shift-invariant positive definite kernels on the additive group Rn : Gausn sian RBF kernel k(x, y) = exp(− x−y 2 /σ 2 ) and Laplacian kernel k(x, y) = exp(−β i=1 |xi − n yi |) are famous ones. [sent-65, score-0.692]

32 (2) Rn Bochner’s theorem completely characterizes the set of continuous shift-invariant positive definite kernels on Rn by the Fourier transform. [sent-69, score-0.482]

33 It also implies that the continuous positive definite functions √ T form a convex cone with the extreme points given by the Fourier kernels {e −1x ω | ω ∈ Rn }. [sent-70, score-0.381]

34 2 It is interesting to determine the class of continuous shift-invariant “characteristic” kernels on Rn . [sent-71, score-0.307]

35 The basic idea is the following: since the mean element EP [φ(y − X)] is equal to the convolution φ ∗ P , the Fourier transform rewrites the definition of characteristic property as (P − Q)Λ = 0 =⇒ P = Q, where denotes the Fourier transform, and we use φ ∗ P = ΛP . [sent-75, score-0.46]

36 We will extend these results to more general algebraic objects, such as groups and semigroups, on which Fourier analysis and Bochner’s theorem can be extended. [sent-77, score-0.281]

37 3 Characteristic kernels on locally compact Abelian groups It is known that most of the results on Fourier analysis for Rn are extended to any locally compact Abelian (LCA) group, which is an Abelian (i. [sent-78, score-0.818]

38 commutative) topological group with the topology Hausdorff and locally compact. [sent-80, score-0.295]

39 The group operation is denoted by “+” in Abelian cases. [sent-82, score-0.185]

40 Hereafter, for a LCA group G, we consider only the probability measures included in the set of finite regular measures M (G) (see Supplements) to discuss characteristic property. [sent-83, score-0.583]

41 For a LCA group G, there exists a non-negative regular measure m on G such that m(E + x) = m(E) for every x ∈ G and every Borel set E in G. [sent-88, score-0.242]

42 The set of all continuous characters of G forms an Abelian group with the operation (γ1 γ2 )(x) = γ1 (x)γ2 (x). [sent-94, score-0.292]

43 This group is called the dual group of G, and denoted by G. [sent-98, score-0.375]

44 It is ˆ ˆ known that G is a LCA group if the weakest topology is introduced so that x is continuous for each ˆ x ∈ G. [sent-100, score-0.284]

45 We can therefore consider the dual of G, denoted by Gˆ and the group homomorphism ˆ , ˆ G → Gˆ , x → x. [sent-101, score-0.266]

46 √ T It is known that the dual group of the LCA group Rn is {e −1ω x | ω ∈ Rn }, which can be identified with Rn . [sent-119, score-0.35]

47 The above definition and properties of Fourier transform for LCA groups are extension of the ordinary Fourier transform for Rn . [sent-120, score-0.344]

48 A continuous function φ on G is positive definite if and only if there is a unique non-negative measure Λ ∈ M (G) such that φ(x) = (x, γ)dΛ(γ) G 3. [sent-128, score-0.217]

49 (5) Shift-invariant characteristic kernels on LCA group Based on Bochner’s theorem, a sufficient condition of the characteristic property is obtained. [sent-130, score-1.103]

50 Let φ be a continuous positive definite function on a LCA group G given by Eq. [sent-132, score-0.321]

51 If supp(Λ) = G, then the positive definite kernel k(x, y) = φ(x − y) is characteristic. [sent-134, score-0.193]

52 Let φ be a R-valued continuous positive definite function on a LCA group G given by Eq. [sent-142, score-0.321]

53 The kernel φ(x − y) is characteristic if and only if (i) 0 ∈ G is not open and supp(Λ) = G, or (ii) 0 ∈ G is open and supp(Λ) ⊃ G − {0}. [sent-144, score-0.534]

54 It is obvious that k is characteristic if and only if so is k(x, y) + 1. [sent-149, score-0.364]

55 Take an open neighborhood W of 0 in G with compact closure such that W ⊂ τ −1 (U − γ0 ). [sent-155, score-0.216]

56 Also, g is positive definite, since i,j ci cj g(xi − xj ) = i,j ci cj G χW (xi − xj − y)χ−W (y)dy = i,j ci cj G χW (xi − y)χ−W (y − xj )dy = i ci χW (xi − y) j cj χW (xj − y) dy ≥ 0. [sent-159, score-0.796]

57 From Theorem 8, we can see that the characteristic property is stable under the product for shift-invariant kernels. [sent-174, score-0.36]

58 Let φ1 (x − y) and φ2 (x − y) be R-valued continuous shift-invariant characteristic kernels on a LCA group G. [sent-176, score-0.804]

59 (Rn , +): As already shown in [6, 14], the Gaussian RBF kernel exp(− 2σ2 x − y 2 ) n n and Laplacian kernel exp(−β i=1 |xi − yi |) are characteristic on R . [sent-187, score-0.575]

60 An example of a positive definite kernel that is not characteristic on Rn is sinc(x − y) = sin(x−y) . [sent-188, score-0.53]

61 The dual group is {e −1nx | n ∈ Z}, which is isomorphic to Z. [sent-191, score-0.19]

62 Examples of non-characteristic kernels on [0, 2π) include cos(x − y), F´ jer, and Dirichlet kernel. [sent-198, score-0.22]

63 e 4 Characteristic kernels on compact groups We discuss non-Abelian cases in this section. [sent-199, score-0.517]

64 Providing useful positive definite kernels on this class is important in those applications areas. [sent-202, score-0.294]

65 First, we give a brief summary of known results on the Fourier analysis on locally compact and compact groups. [sent-203, score-0.416]

66 1 Unitary representation and Fourier analysis Let G be a locally compact group, which may not be Abelian. [sent-206, score-0.239]

67 For a unitary representation (T, H) on a locally compact group G, a subspace V in H is called Ginvariant if T (x)V ⊂ V for every x ∈ G. [sent-208, score-0.586]

68 A unitary representation (T, H) is irreducible if there are 5 no closed G-invariant subspace except {0} and H. [sent-209, score-0.214]

69 Unitary representations (T1 , H1 ) and (T2 , H2 ) are said to be equivalent if there is a unitary isomorphism A : H1 → H2 such that T1 = A−1 T2 A. [sent-210, score-0.196]

70 (i) If G is a compact group, every irreducible unitary representation (T, H) of G is finite dimensional, that is, H is finite dimensional. [sent-216, score-0.416]

71 (ii) If G is an Abelian group, every irreducible unitary representation of G is one dimensional. [sent-217, score-0.239]

72 It is possible to extend the Fourier analysis on locally compact non-Abelian groups. [sent-219, score-0.239]

73 Unlike Abelian cases, the Fourier transform by the characters are not possible, but we need to consider unitary representations and operator-valued Fourier transform. [sent-220, score-0.307]

74 We define G to be the set of equivalent classes of irreducible unitary representations of a compact group G. [sent-225, score-0.551]

75 The equivalence class of a unitary representation (T, HT ) is denoted by [T ], and the dimensionality of HT by dT . [sent-226, score-0.187]

76 It is known that on a compact group G there is a Haar measure m, which is a left and right invariant non-negative finite measure. [sent-228, score-0.369]

77 This is a natural extension of the Fourier transform on LCA groups, where G is the characters serving as the Fourier kernel in view of Theorem 10. [sent-233, score-0.264]

78 Bochner’s theorem can be extended to compact groups as follows [11, Section 34. [sent-239, score-0.398]

79 A continuous function φ on a compact group G is positive definite if and only if the Fourier transform φ(T ) is positive semidefinite, gives an absolutely convergent series Eq. [sent-242, score-0.739]

80 −1 i,j ci cj φ(xj xi ) = ∗ [T ] dT Tr[T (xi )φ(T )T (xj ) ] = [T ] dT Tr[ The proof of “if” part is easy; in fact, i,j ci cj = i,j ci cj i ci T (xi ) 4. [sent-244, score-0.529]

81 Shift-invariant characteristic kernels on compact groups We have the following sufficient condition of characteristic property for compact groups. [sent-246, score-1.417]

82 If φ(T ) is strictly positive definite for every [T ] ∈ G\{1}, the kernel φ(y −1 x) is characteristic. [sent-250, score-0.218]

83 (8) derives an γn for each term, we see that a sequence {an }∞ such that n=0 ∞ a0 ≥ 0, an > 0 (n ≥ 1), and n=0 an (2n + 1)2 < ∞ defines a characteristic positive definite kernel on SO(3) by 1 sin((2n + 1)θ) (cos θ = Tr[B −1 A], 0 ≤ θ ≤ π). [sent-285, score-0.556]

84 (2n + 1)3 8 sin θ n=0 ∞ α2n+1 sin((2n + 1)θ) 1 2α sin θ = arctan . [sent-288, score-0.23]

85 (2n + 1) sin θ 2 sin θ 1 − α2 n=0 Characteristic kernels on the semigroup Rn + In this section, we consider kernels on an Abelian semigroup (S, +). [sent-289, score-1.094]

86 In this case, a kernel based on the semigroup structure is defined by k(x, y) = φ(x + y). [sent-290, score-0.331]

87 For an Abelian semigroup (S, +), a semicharacter is defined by a map ρ : S → C such that ρ(x + y) = ρ(x)ρ(y). [sent-291, score-0.212]

88 While extensions of Bochner’s theorem are known for semigroups [2], the topology on the set of semicharacters are not as obvious as LCA groups, and the straightforward extension of the results in Section 3 is difficult. [sent-292, score-0.3]

89 We focus only on the Abelian semigroup (Rn , +), where R+ = [0, ∞). [sent-293, score-0.212]

90 + This semigroup has many practical applications of data analysis including expressions of nonnegative measures or frequency on n points [3]. [sent-294, score-0.255]

91 For Rn , Laplace transform replaces Fourier transform to give Bochner’s theorem. [sent-300, score-0.2]

92 + (9) Based on the above theorem, we have the following sufficient condition of characteristic property. [sent-305, score-0.363]

93 If suppΛ = Rn , then the + positive definite kernel k(x, y) = φ(x + y) is characteristic. [sent-309, score-0.193]

94 7 We show some examples of characteristic kernels on (Rn , +). [sent-317, score-0.557]

95 This type of kernel on non-negative measures is discussed in [3] in connection with semigroup structure. [sent-323, score-0.374]

96 6 Conclusions We have discussed conditions that kernels defined by the algebraic structure of groups and semigroups are characteristic. [sent-324, score-0.496]

97 For locally compact Abelian groups, the continuous shift-invariant Rvalued characteristic kernels are completely determined by the Fourier inverse of positive measures with support equal to the entire dual group. [sent-325, score-1.03]

98 For compact (non-Abelian) groups, we show a sufficient condition of continuous shift-invariant characteristic kernels in terms of the operator-valued Fourier transform. [sent-326, score-0.847]

99 In the advanced theory of harmonic analysis, + Bochner’s theorem and Fourier analysis can be extended to more general algebraic structure to some extent. [sent-328, score-0.198]

100 While the characteristic property gives a necessary requirement for RKHS embeddings of distributions to be distinguishable, it does not address optimal kernel choice at finite sample sizes. [sent-332, score-0.518]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('characteristic', 0.337), ('fourier', 0.288), ('abelian', 0.27), ('bochner', 0.27), ('lca', 0.27), ('supp', 0.246), ('kernels', 0.22), ('rn', 0.214), ('semigroup', 0.212), ('compact', 0.177), ('unitary', 0.162), ('group', 0.16), ('groups', 0.12), ('kernel', 0.119), ('sin', 0.115), ('dt', 0.107), ('theorem', 0.101), ('fukumizu', 0.1), ('transform', 0.1), ('rkhs', 0.098), ('semigroups', 0.096), ('borel', 0.093), ('continuous', 0.087), ('nite', 0.086), ('tr', 0.082), ('mx', 0.077), ('cj', 0.075), ('positive', 0.074), ('countable', 0.072), ('ci', 0.069), ('haar', 0.067), ('rkhss', 0.067), ('locally', 0.062), ('ht', 0.062), ('algebraic', 0.06), ('gretton', 0.06), ('measurable', 0.056), ('cos', 0.056), ('dx', 0.055), ('irreducible', 0.052), ('homomorphism', 0.051), ('bi', 0.05), ('dy', 0.05), ('fubini', 0.046), ('characters', 0.045), ('tn', 0.045), ('measures', 0.043), ('injective', 0.041), ('embeddings', 0.039), ('open', 0.039), ('conformal', 0.039), ('pontryagin', 0.039), ('semicharacters', 0.039), ('tij', 0.039), ('zx', 0.039), ('zy', 0.039), ('ai', 0.038), ('hilbert', 0.038), ('topology', 0.037), ('harmonic', 0.037), ('topological', 0.036), ('sch', 0.035), ('jmlr', 0.034), ('mpi', 0.034), ('absolutely', 0.034), ('geophysics', 0.034), ('isomorphism', 0.034), ('convergent', 0.033), ('character', 0.033), ('xj', 0.032), ('measure', 0.032), ('laplace', 0.032), ('operators', 0.032), ('dual', 0.03), ('sriperumbudur', 0.029), ('mappings', 0.028), ('let', 0.028), ('bounded', 0.028), ('xi', 0.028), ('obvious', 0.027), ('rich', 0.027), ('uniqueness', 0.027), ('suf', 0.027), ('domains', 0.026), ('derives', 0.026), ('lkopf', 0.026), ('ti', 0.026), ('duality', 0.026), ('condition', 0.026), ('cybernetics', 0.025), ('every', 0.025), ('spaces', 0.025), ('denoted', 0.025), ('reproducing', 0.024), ('ordinary', 0.024), ('mod', 0.024), ('rotational', 0.024), ('unique', 0.024), ('property', 0.023), ('independence', 0.022)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 1.0 44 nips-2008-Characteristic Kernels on Groups and Semigroups

Author: Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Bharath K. Sriperumbudur

Abstract: Embeddings of random variables in reproducing kernel Hilbert spaces (RKHSs) may be used to conduct statistical inference based on higher order moments. For sufficiently rich (characteristic) RKHSs, each probability distribution has a unique embedding, allowing all statistical properties of the distribution to be taken into consideration. Necessary and sufficient conditions for an RKHS to be characteristic exist for Rn . In the present work, conditions are established for an RKHS to be characteristic on groups and semigroups. Illustrative examples are provided, including characteristic kernels on periodic domains, rotation matrices, and Rn . + 1

2 0.19967514 106 nips-2008-Inferring rankings under constrained sensing

Author: Srikanth Jagabathula, Devavrat Shah

Abstract: Motivated by applications like elections, web-page ranking, revenue maximization etc., we consider the question of inferring popular rankings using constrained data. More specifically, we consider the problem of inferring a probability distribution over the group of permutations using its first order marginals. We first prove that it is not possible to recover more than O(n) permutations over n elements with the given information. We then provide a simple and novel algorithm that can recover up to O(n) permutations under a natural stochastic model; in this sense, the algorithm is optimal. In certain applications, the interest is in recovering only the most popular (or mode) ranking. As a second result, we provide an algorithm based on the Fourier Transform over the symmetric group to recover the mode under a natural majority condition; the algorithm turns out to be a maximum weight matching on an appropriately defined weighted bipartite graph. The questions considered are also thematically related to Fourier Transforms over the symmetric group and the currently popular topic of compressed sensing. 1

3 0.15056784 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Author: Francis R. Bach

Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

4 0.11952396 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning

Author: Jihun Hamm, Daniel D. Lee

Abstract: Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of affine as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classification and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases. 1

5 0.096814036 143 nips-2008-Multi-label Multiple Kernel Learning

Author: Shuiwang Ji, Liang Sun, Rong Jin, Jieping Ye

Abstract: We present a multi-label multiple kernel learning (MKL) formulation in which the data are embedded into a low-dimensional space directed by the instancelabel correlations encoded into a hypergraph. We formulate the problem in the kernel-induced feature space and propose to learn the kernel matrix as a linear combination of a given collection of kernel matrices in the MKL framework. The proposed learning formulation leads to a non-smooth min-max problem, which can be cast into a semi-infinite linear program (SILP). We further propose an approximate formulation with a guaranteed error bound which involves an unconstrained convex optimization problem. In addition, we show that the objective function of the approximate formulation is differentiable with Lipschitz continuous gradient, and hence existing methods can be employed to compute the optimal solution efficiently. We apply the proposed formulation to the automated annotation of Drosophila gene expression pattern images, and promising results have been reported in comparison with representative algorithms.

6 0.084854513 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

7 0.081201874 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels

8 0.078135274 138 nips-2008-Modeling human function learning with Gaussian processes

9 0.076689482 112 nips-2008-Kernel Measures of Independence for non-iid Data

10 0.073610432 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data

11 0.069609478 107 nips-2008-Influence of graph construction on graph-based clustering measures

12 0.068147525 113 nips-2008-Kernelized Sorting

13 0.064862393 117 nips-2008-Learning Taxonomies by Dependence Maximization

14 0.063555367 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition

15 0.058801524 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching

16 0.057333287 175 nips-2008-PSDBoost: Matrix-Generation Linear Programming for Positive Semidefinite Matrices Learning

17 0.056969807 133 nips-2008-Mind the Duality Gap: Logarithmic regret algorithms for online optimization

18 0.056787971 170 nips-2008-Online Optimization in X-Armed Bandits

19 0.055309962 202 nips-2008-Robust Regression and Lasso

20 0.052695729 55 nips-2008-Cyclizing Clusters via Zeta Function of a Graph


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, -0.161), (1, -0.042), (2, -0.077), (3, 0.083), (4, 0.062), (5, -0.049), (6, 0.002), (7, -0.047), (8, 0.068), (9, -0.027), (10, 0.189), (11, -0.137), (12, 0.017), (13, 0.085), (14, 0.049), (15, 0.067), (16, 0.072), (17, -0.049), (18, -0.089), (19, 0.034), (20, -0.052), (21, 0.07), (22, 0.096), (23, -0.003), (24, 0.071), (25, -0.01), (26, -0.001), (27, 0.018), (28, 0.185), (29, 0.026), (30, 0.035), (31, 0.003), (32, -0.024), (33, 0.068), (34, 0.052), (35, 0.05), (36, 0.077), (37, 0.012), (38, 0.122), (39, 0.112), (40, 0.028), (41, -0.115), (42, -0.238), (43, 0.068), (44, 0.097), (45, 0.005), (46, -0.017), (47, 0.028), (48, 0.073), (49, 0.093)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.97204596 44 nips-2008-Characteristic Kernels on Groups and Semigroups

Author: Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Bharath K. Sriperumbudur

Abstract: Embeddings of random variables in reproducing kernel Hilbert spaces (RKHSs) may be used to conduct statistical inference based on higher order moments. For sufficiently rich (characteristic) RKHSs, each probability distribution has a unique embedding, allowing all statistical properties of the distribution to be taken into consideration. Necessary and sufficient conditions for an RKHS to be characteristic exist for Rn . In the present work, conditions are established for an RKHS to be characteristic on groups and semigroups. Illustrative examples are provided, including characteristic kernels on periodic domains, rotation matrices, and Rn . + 1

2 0.6752218 106 nips-2008-Inferring rankings under constrained sensing

Author: Srikanth Jagabathula, Devavrat Shah

Abstract: Motivated by applications like elections, web-page ranking, revenue maximization etc., we consider the question of inferring popular rankings using constrained data. More specifically, we consider the problem of inferring a probability distribution over the group of permutations using its first order marginals. We first prove that it is not possible to recover more than O(n) permutations over n elements with the given information. We then provide a simple and novel algorithm that can recover up to O(n) permutations under a natural stochastic model; in this sense, the algorithm is optimal. In certain applications, the interest is in recovering only the most popular (or mode) ranking. As a second result, we provide an algorithm based on the Fourier Transform over the symmetric group to recover the mode under a natural majority condition; the algorithm turns out to be a maximum weight matching on an appropriately defined weighted bipartite graph. The questions considered are also thematically related to Fourier Transforms over the symmetric group and the currently popular topic of compressed sensing. 1

3 0.63340962 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Author: Francis R. Bach

Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

4 0.62912458 80 nips-2008-Extended Grassmann Kernels for Subspace-Based Learning

Author: Jihun Hamm, Daniel D. Lee

Abstract: Subspace-based learning problems involve data whose elements are linear subspaces of a vector space. To handle such data structures, Grassmann kernels have been proposed and used previously. In this paper, we analyze the relationship between Grassmann kernels and probabilistic similarity measures. Firstly, we show that the KL distance in the limit yields the Projection kernel on the Grassmann manifold, whereas the Bhattacharyya kernel becomes trivial in the limit and is suboptimal for subspace-based problems. Secondly, based on our analysis of the KL distance, we propose extensions of the Projection kernel which can be extended to the set of affine as well as scaled subspaces. We demonstrate the advantages of these extended kernels for classification and recognition tasks with Support Vector Machines and Kernel Discriminant Analysis using synthetic and real image databases. 1

5 0.52920306 122 nips-2008-Learning with Consistency between Inductive Functions and Kernels

Author: Haixuan Yang, Irwin King, Michael Lyu

Abstract: Regularized Least Squares (RLS) algorithms have the ability to avoid over-fitting problems and to express solutions as kernel expansions. However, we observe that the current RLS algorithms cannot provide a satisfactory interpretation even on the penalty of a constant function. Based on the intuition that a good kernelbased inductive function should be consistent with both the data and the kernel, a novel learning scheme is proposed. The advantages of this scheme lie in its corresponding Representer Theorem, its strong interpretation ability about what kind of functions should not be penalized, and its promising accuracy improvements shown in a number of experiments. Furthermore, we provide a detailed technical description about heat kernels, which serves as an example for the readers to apply similar techniques for other kernels. Our work provides a preliminary step in a new direction to explore the varying consistency between inductive functions and kernels under various distributions. 1

6 0.48783249 203 nips-2008-Scalable Algorithms for String Kernels with Inexact Matching

7 0.47226992 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

8 0.46155953 143 nips-2008-Multi-label Multiple Kernel Learning

9 0.42813697 113 nips-2008-Kernelized Sorting

10 0.41177753 20 nips-2008-An Extended Level Method for Efficient Multiple Kernel Learning

11 0.3953467 97 nips-2008-Hierarchical Fisher Kernels for Longitudinal Data

12 0.37161511 188 nips-2008-QUIC-SVD: Fast SVD Using Cosine Trees

13 0.36330327 217 nips-2008-Sparsity of SVMs that use the epsilon-insensitive loss

14 0.35680529 182 nips-2008-Posterior Consistency of the Silverman g-prior in Bayesian Model Choice

15 0.35535812 68 nips-2008-Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection

16 0.35224634 56 nips-2008-Deep Learning with Kernel Regularization for Visual Recognition

17 0.33296111 51 nips-2008-Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm

18 0.32776958 238 nips-2008-Theory of matching pursuit

19 0.32453695 170 nips-2008-Online Optimization in X-Armed Bandits

20 0.32378969 55 nips-2008-Cyclizing Clusters via Zeta Function of a Graph


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(6, 0.05), (7, 0.065), (12, 0.069), (28, 0.171), (57, 0.063), (59, 0.021), (61, 0.295), (71, 0.015), (77, 0.085), (83, 0.059)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.78481686 44 nips-2008-Characteristic Kernels on Groups and Semigroups

Author: Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, Bharath K. Sriperumbudur

Abstract: Embeddings of random variables in reproducing kernel Hilbert spaces (RKHSs) may be used to conduct statistical inference based on higher order moments. For sufficiently rich (characteristic) RKHSs, each probability distribution has a unique embedding, allowing all statistical properties of the distribution to be taken into consideration. Necessary and sufficient conditions for an RKHS to be characteristic exist for Rn . In the present work, conditions are established for an RKHS to be characteristic on groups and semigroups. Illustrative examples are provided, including characteristic kernels on periodic domains, rotation matrices, and Rn . + 1

2 0.73801309 221 nips-2008-Stochastic Relational Models for Large-scale Dyadic Data using MCMC

Author: Shenghuo Zhu, Kai Yu, Yihong Gong

Abstract: Stochastic relational models (SRMs) [15] provide a rich family of choices for learning and predicting dyadic data between two sets of entities. The models generalize matrix factorization to a supervised learning problem that utilizes attributes of entities in a hierarchical Bayesian framework. Previously variational Bayes inference was applied for SRMs, which is, however, not scalable when the size of either entity set grows to tens of thousands. In this paper, we introduce a Markov chain Monte Carlo (MCMC) algorithm for equivalent models of SRMs in order to scale the computation to very large dyadic data sets. Both superior scalability and predictive accuracy are demonstrated on a collaborative filtering problem, which involves tens of thousands users and half million items. 1 Stochastic Relational Models Stochastic relational models (SRMs) [15] are generalizations of Gaussian process (GP) models [11] to the relational domain, where each observation is a dyadic datum, indexed by a pair of entities. They model dyadic data by a multiplicative interaction of two Gaussian process priors. Let U be the feature representation (or index) space of a set of entities. A pair-wise similarity in U is given by a kernel (covariance) function Σ : U × U → R. A Gaussian process (GP) defines a random function f : U → R, whose distribution is characterized by a mean function and the covariance function Σ, denoted by f ∼ N∞ (0, Σ)1 , where, for simplicity, we assume the mean to be the constant zero. GP complies with the intuition regarding the smoothness — if two entities ui and uj are similar according to Σ, then f (ui ) and f (uj ) are similar with a high probability. A domain of dyadic data must involve another set of entities, let it be represented (or indexed) by V. In a similar way, this entity set is associated with another kernel function Ω. For example, in a typical collaborative filtering domain, U represents users while V represents items, then, Σ measures the similarity between users and Ω measures the similarity between items. Being the relation between a pair of entities from different sets, a dyadic variable y is indexed by the product space U × V. Then an SRM aims to model y(u, v) by the following generative process, Model 1. The generative model of an SRM: 1. Draw kernel functions Σ ∼ IW ∞ (δ, Σ◦ ), and Ω ∼ IW ∞ (δ, Ω◦ ); 2. For k = 1, . . . , d: draw random functions fk ∼ N∞ (0, Σ), and gk ∼ N∞ (0, Ω); 1 We denote an n dimensional Gaussian distribution with a covariance matrix Σ by Nn (0, Σ). Then N∞ (0, Σ) explicitly indicates that a GP follows an “infinite dimensional” Gaussian distribution. 1 3. For each pair (u, v): draw y(u, v) ∼ p(y(u, v)|z(u, v), γ), where d 1 z(u, v) = √ fk (u)gk (v) + b(u, v). d k=1 In this model, IW ∞ (δ, Σ◦ ) and IW ∞ (δ, Ω◦ ) are hyper priors, whose details will be introduced later. p(y|z, γ) is the problem-specific noise model. For example, it can follow a Gaussian noise distribution y ∼ N1 (z, γ) if y is numerical, or, a Bernoulli distribution if y is binary. Function b(u, v) is the bias function over the U × V. For simplicity, we assume b(u, v) = 0. In the limit d → ∞, the model converges to a special case where fk and gk can be analytically marginalized out and z becomes a Gaussian process z ∼ N∞ (0, Σ ⊗ Ω) [15], with the covariance between pairs being a tensor kernel K ((ui , vs ), (uj , vt )) = Σ(ui , uj )Ω(vs , vt ). In anther special case, if Σ and Ω are both fixed to be Dirac delta functions, and U, V are finite sets, it is easy to see that the model reduces to probabilistic matrix factorization. The hyper prior IW ∞ (δ, Σ◦ ) is called inverted Wishart Process that generalizes the finite ndimensional inverted Wishart distribution [2] IW n (Σ|δ, Σ◦ ) ∝ |Σ| − 1 (δ+2n) 2 1 etr − Σ−1 Σ◦ , 2 where δ is the degree-of-freedom parameter, and Σ◦ is a positive definite kernel matrix. We note that the above definition is different from the popular formulation [3] or [4] in the machine learning community. The advantage of this new notation is demonstrated by the following theorem [2]. Theorem 1. Let A ∼ IW m (δ, K), A ∈ R+ , K ∈ R+ , and A and K be partitioned as A= A11 , A12 , A21 , A22 K= K11 , K12 K21 , K22 where A11 and K11 are two n × n sub matrices, n < m, then A11 ∼ IW n (δ, K11 ). The new formulation of inverted Wishart is consistent under marginalization. Therefore, similar to the way of deriving GPs from Gaussian distributions, we define a distribution of infinite-dimensional kernel functions, denoted by Σ ∼ IW ∞ (δ, Σ◦ ), such that any sub kernel matrix of size m × m follows Σ ∼ IW m (δ, Σ◦ ), where both Σ and Σ◦ are positive definite kernel functions. In case when U and V are sets of entity indices, SRMs let Σ◦ and Ω◦ both be Dirac delta functions, i.e., any of their sub kernel matrices is an identity matrix. Similar to GP regression/classification, the major application of SRMs is supervised prediction based on observed relational values and input features of entities. Formally, let YI = {y(u, v)|(u, v) ∈ I} be the set of noisy observations, where I ⊂ U × V, the model aims to predict the noise-free values ZO = {z(u, v)|(u, v) ∈ O} on O ⊂ U × V. As our computation is always on a finite set containing both I and O, from now on, we only consider the finite subset U0 × V0 , a finite support subset of U × V that contains I ∪ O. Accordingly we let Σ be the covariance matrix of Σ on U0 , and Ω be the covariance matrix of Ω on V0 . Previously a variational Bayesian method was applied to SRMs [15], which computes the maximum a posterior estimates of Σ and Ω, given YI , and then predicts ZO based on the estimated Σ and Ω. There are two limitations of this empirical Bayesian approach: (1) The variational method is not a fully Bayesian treatment. Ideally we wish to integrate Σ and Ω; (2) The more critical issue is, the algorithm has the complexity O(m3 + n3 ), with m = |U0 | and n = |V0 |, is not scalable to a large relational domain where m or n exceeds several thousands. In this paper we will introduce a fully Bayesian inference algorithm using Markov chain Monte Carlo sampling. By deriving equivalent sampling processes, we show the algorithms can be applied to a dataset, which is 103 times larger than the previous work [15], and produce an excellent accuracy. In the rest of this paper, we present our algorithms for Bayesian inference of SRMs in Section 2. Some related work is discussed in Section 3, followed by experiment results of SRMs in Section 4. Section 5 concludes. 2 2 Bayesian Models and MCMC Inference In this paper, we tackle the scalability issue with a fully Bayesian paradigm. We estimate the expectation of ZO directly from YI using Markov-chain Monte Carlo (MCMC) algorithm (specifically, Gibbs sampling), instead of evaluating that from estimated Σ or Ω. Our contribution is in how to make the MCMC inference more efficient for large scale data. We first introduce some necessary notation here. Bold capital letters, e.g. X, indicate matrices. I(m) is an identity matrix of size m × m. Nd , Nm,d , IW m , χ−2 are the multivariate normal distribution, the matrix-variate normal distribution, the inverse-Wishart distribution, and the inverse chi-square distribution, respectively. 2.1 Models with Non-informative Priors Let r = |I|, m = |U0 | and n = |V0 |. It is assumed that d min(m, n), and the observed set, I, is sparse, i.e. r mn. First, we consider the case of Σ◦ = αI(m) and Ω◦ = βI(n) . Let {fk } on U0 denoted by matrix variate F of size m × d, {gk } on V0 denoted by matrix variate G of size n × d. Then the generative model is written as Model 2 and depicted in Figure 1. Model 2. The generative model of a matrix-variate SRM: Σ 1. Draw Σ ∼ IW m (δ, αI(m) ) and Ω ∼ IW n (δ, βI(n) ); Ω I(d) G F 2. Draw F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ) and G|Ω ∼ Nn,d (0, Ω ⊗ I(d) ); I(d) Z 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y where Nm,d is the matrix-variate normal distribution of size m × d; α, Figure 1: Model 2 β, δ, ν and σ 2 are scalar parameters of the model. A slight difference √ between this finite model and Model 1 is that the coefficient 1/ d is ignored for simplicity because this coefficient can be absorbed by α or β. As we can explicitly compute Pr(Σ|F), Pr(Ω|G), Pr(F|YI , G, Σ, s2 ), Pr(G|YI , F, Ω, s2 ), Pr(s2 |YI , F, G), we can apply Gibbs sampling algorithm to compute ZO . However, the computational time complexity is at least O(m3 + n3 ), which is not practical for large scale data. 2.2 Gibbs Sampling Method To overcome the inefficiency in sampling large covariance matrices, we rewrite the sampling process using the property of Theorem 2 to take the advantage of d min(m, n). αI(d) αI(m) Theorem 2. If 1. Σ ∼ IW m (δ, αI(m) ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), 2. K ∼ IW d (δ, αI(d) ) and H|K ∼ Nm,d (0, I(m) ⊗ K), then, matrix variates, F and H, have the same distribution. Proof sketch. Matrix variate F follows a matrix variate t distribution, t(δ, 0, αI(m) , I(d) ), which is written as 1 Σ I(d) F → I(m) K F Figure 2: Theorem 2 1 p(F) ∝ |I(m) + (αI(m) )−1 F(I(d) )−1 F |− 2 (δ+m+d−1) = |I(m) + α−1 FF |− 2 (δ+m+d−1) Matrix variate H follows a matrix variate t distribution, t(δ, 0, I(m) , αI(d) ), which can be written as 1 1 p(H) ∝ |I(m) + (I(m) )−1 H(αI(d) )−1 H |− 2 (δ+m+d−1) = |I(m) + α−1 HH |− 2 (δ+m+d−1) Thus, matrix variates, F and H, have the same distribution. 3 This theorem allows us to sample a smaller covariance matrix K of size d × d on the column side instead of sampling a large covariance matrix Σ of size m × m on the row side. The translation is depicted in Figure 2. This theorem applies to G as well, thus we rewrite the model as Model 3 (or Figure 3). A similar idea was used in our previous work [16]. Model 3. The alternative generative model of a matrix-variate SRM: I(m) I(n) K R 1. Draw K ∼ IW d (δ, αI(d) ) and R ∼ IW d (δ, βI(d) ); G F 2. Draw F|K ∼ Nm,d (0, I(m) ⊗ K), and G|R ∼ Nn,d (0, I(n) ⊗ R), 3. Draw s2 ∼ χ−2 (ν, σ 2 ) ; Z 4. Draw Y|F, G, s2 ∼ Nm,n (Z, s2 I(m) ⊗ I(n) ), where Z = FG . s2 Y Let column vector f i be the i-th row of matrix F, and column vector gj Figure 3: Model 3 be the j-th row of matrix G. In Model 3, {f i } are independent given K, 2 G and s . Similar independence applies to {gj } as well. The conditional posterior distribution of K, R, {f i }, {gj } and s2 can be easily computed, thus the Gibbs sampling for SRM is named BSRM (for Bayesian SRM). We use Gibbs sampling to compute the mean of ZO , which is derived from the samples of FG . Because of the sparsity of I, each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n)) time complexity2 , which is a dramatic reduction from the previous time complexity O(m3 + n3 ) . 2.3 Models with Informative Priors An important characteristic of SRMs is that it allows the inclusion of certain prior knowledge of entities into the model. Specifically, the prior information is encoded as the prior covariance parameters, i.e. Σ◦ and Ω◦ . In the general case, it is difficult to run sampling process due to the size of Σ◦ and Ω◦ . We assume that Σ◦ and Ω◦ have a special form, i.e. Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix, and Ω◦ = G◦ (G◦ ) + βI(n) , where G◦ is an n × q matrix, and the magnitude of p and q is about the same as or less than that of d. This prior knowledge can be obtained from some additional features of entities. Although such an informative Σ◦ prevents us from directly sampling each row of F independently, as we do in Model 3, we can expand matrix F of size m × d to (F, F◦ ) of size m × (d + p), and derive an equivalent model, where rows of F are conditionally independent given F◦ . Figure 4 illustrates this transformation. Theorem 3. Let δ > p, Σ◦ = F◦ (F◦ ) + αI(m) , where F◦ is an m × p matrix. If 1. Σ ∼ IW m (δ, Σ◦ ) and F|Σ ∼ Nm,d (0, Σ ⊗ I(d) ), K11 K12 ∼ IW d+p (δ − p, αI(d+p) ) and K21 K22 H|K ∼ Nm,d (F◦ K−1 K21 , I(m) ⊗ K11·2 ), 22 2. K = αI(d+p) Σ0 Σ I(d) F → I(m) K (F, F0 ) Figure 4: Theorem 3 where K11·2 = K11 − K12 K−1 K21 , then F and H have the same distribution. 22 Proof sketch. Consider the distribution (H1 , H2 )|K ∼ Nm,d+p (0, I(m) ⊗ K). (1) Because H1 |H2 ∼ Nm,d (H2 K−1 K21 , I(m) ⊗ K11·2 ), p(H) = p(H1 |H2 = F◦ ). On the other 22 hand, we have a matrix-variate t distribution, (H1 , H2 ) ∼ tm,d+p (δ − p, 0, αI(m) , I(d+p) ). By Theorem 4.3.9 in [4], we have H1 |H2 ∼ tm,d (δ, 0, αI(m) + H2 H2 , I(d) ) = tm,d (δ, 0, Σ◦ , I(d) ), which implies p(F) = p(H1 |H2 = F◦ ) = p(H). 2 |Y − FG |2 can be efficiently computed in O(dr) time. I 4 The following corollary allows us to compute the posterior distribution of K efficiently. Corollary 4. K|H ∼ IW d+p (δ + m, αI(d+p) + (H, F◦ ) (H, F◦ )). Proof sketch. Because normal distribution and inverse Wishart distribution are conjugate, we can derive the posterior distribution K from Eq. (1). Thus, we can explicitly sample from the conditional posterior distributions, as listed in Algorithm 1 (BSRM/F for BSRM with features) in Appendix. We note that when p = q = 0, Algorithm 1 (BSRM/F) reduces to the exact algorithm for BSRM. Each iteration in this sampling algorithm can be computed in O(d2 r + d3 (m + n) + dpm + dqn) time complexity. 2.4 Unblocking for Sampling Implementation Blocking Gibbs sampling technique is commonly used to improve the sampling efficiency by reducing the sample variance according to the Rao-Blackwell theorem (c.f. [9]). However, blocking Gibbs sampling is not necessary to be computationally efficient. To improve the computational efficiency of Algorithm 1, we use unblocking sampling to reduce the major computational cost is Step 2 and Step 4. We consider sampling each element of F conditionally. The sampling process is written as Step 4 and Step 9 of Algorithm 2, which is called BSRM/F with conditional Gibss sampling. We can reduce the computational cost of each iteration to O(dr + d2 (m + n) + dpm + dqn), which is comparable to other low-rank matrix factorization approaches. Though such a conditional sampling process increases the sample variance comparing to Algorithm 1, we can afford more samples within a given amount of time due to its faster speed. Our experiments show that the overall computational cost of Algorithm 2 is usually less than that of Algorithm 1 when achieving the same accuracy. Additionally, since {f i } are independent, we can parallelize the for loops in Step 4 and Step 9 of Algorithm 2. 3 Related Work SRMs fall into a class of statistical latent-variable relational models that explain relations by latent factors of entities. Recently a number of such models were proposed that can be roughly put into two groups, depending on whether the latent factors are continuous or discrete: (1) Discrete latent-state relational models: a large body of research infers latent classes of entities and explains the entity relationship by the probability conditioned on the joint state of participating entities, e.g., [6, 14, 7, 1]. In another work [10], binary latent factors are modeled; (2) Continuous latent-variable relational models: many such models assume relational data underlain by multiplicative effects between latent variables of entities, e.g. [5]. A simple example is matrix factorization, which recently has become very popular in collaborative filtering applications, e.g., [12, 8, 13]. The latest Bayesian probabilistic matrix factorization [13] reported the state-of-the-art accuracy of matrix factorization on Netflix data. Interestingly, the model turns out to be similar to our Model 3 under the non-informative prior. This paper reveals the equivalence between different models and offers a more general Bayesian framework that allows informative priors from entity features to play a role. The framework also generalizes Gaussian processes [11] to a relational domain, where a nonparametric prior for stochastic relational processes is described. 4 Experiments Synthetic data: We compare BSRM under noninformative priors against two other algorithms: the fast max-margin matrix factorization (fMMMF) in [12] with a square loss, and SRM using variational Bayesian approach (SRM-VB) in [15]. We generate a 30 × 20 random matrix (Figure 5(a)), then add Gaussian noise with σ 2 = 0.1 (Figure 5(b)). The root mean squared noise is 0.32. We select 70% elements as the observed data and use the rest of the elements for testing. The reconstruction matrix and root mean squared errors (RMSEs) of predictions on the test elements are shown in Figure 5(c)-5(e). BSRM outperforms the variational approach of SRMs and fMMMF. Note that because of the log-determinant penalty of the inverse Wishart prior, SRM-VB enforces the rank to be smaller, thus the result of SRM-VB looks smoother than that of BSRM. 5 5 5 5 5 5 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20 25 25 25 25 30 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 25 30 2 4 6 8 10 12 14 16 18 20 30 2 4 6 8 10 12 14 16 18 20 (a) Original Matrix (b) With Noise(0.32) (c) fMMMF (0.27) (d) SRM-VB(0.22) 2 4 6 8 10 12 14 16 18 20 (e) BSRM(0.19) Figure 5: Experiments on synthetic data. RMSEs are shown in parentheses. RMSE MAE User Mean 1.425 1.141 Movie Mean 1.387 1.103 fMMMF [12] 1.186 0.943 VB [8] 1.165 0.915 Table 1: RMSE (root mean squared error) and MAE (mean absolute error) of the experiments on EachMovie data. All standard errors are 0.001 or less. EachMovie data: We test the accuracy and the efficiency of our algorithms on EachMovie. The dataset contains 74, 424 users’ 2, 811, 718 ratings on 1, 648 movies, i.e. about 2.29% are rated by zero-to-five stars. We put all the ratings into a matrix, and randomly select 80% as observed data to predict the remaining ratings. The random selection was carried out 10 times independently. We compare our approach against several competing methods: 1) User Mean, predicting ratings by the sample mean of the same user’s ratings; 2) Move Mean, predicting rating by the sample mean of ratings on the same movie; 3) fMMMF [12]; 4) VB introduced in [8], which is a probabilistic lowrank matrix factorization using variational approximation. Because of the data size, we cannot run the SRM-VB of [15]. We test the algorithms BSRM and BSRM/F, both following Algorithm 2, which run without and with features, respectively. The features used in BSRM/F are generated from the PCA result of the binary indicator matrix that indicates whether the user rates the movie. The 10 top factors of both the user side and the movie side are used as features, i.e. p = 10, q = 10. We run the experiments with different d = 16, 32, 100, 200, 300. The hyper parameters are set to some trivial values, δ = p + 1 = 11, α = β = 1, σ 2 = 1, and ν = 1. The results are shown in Table 1 and 2. We find that the accuracy improves as the number of d is increased. Once d is greater than 100, the further improvement is not very significant within a reasonable amount of running time. rank (d) BSRM RMSE MAE BSRM/F RMSE MAE 16 1.0983 0.8411 1.0952 0.8311 32 1.0924 0.8321 1.0872 0.8280 100 1.0905 0.8335 1.0848 0.8289 200 1.0903 0.8340 1.0846 0.8293 300 1.0902 0.8393 1.0852 0.8292 Table 2: RMSE (root mean squared error) and MAE (mean absolute error) of experiments on EachMovie data. All standard errors are 0.001 or less. RMSE To compare the overall computational efficiency of the two Gibbs sampling procedures, Algorithm 1 and Algorithm 2, we run both algorithms 1.2 and record the running time and accuracy Algorithm 1 Algorithm 2 in RMSE. The dimensionality d is set to 1.18 be 100. We compute the average ZO and burn-in ends evaluate it after a certain number of itera1.16 tions. The evaluation results are shown in Figure 6. We run both algorithms for 100 1.14 burn-in ends iterations as the burn-in period, so that we 1.12 can have an independent start sample. After the burn-in period, we restart to compute 1.1 the averaged ZO and evaluate them, therefore there are abrupt points at 100 iterations 1.08 in both cases. The results show that the 0 1000 2000 3000 4000 5000 6000 7000 8000 overall accuracy of Algorithm 2 is better at Running time (sec) any given time. Figure 6: Time-Accuracy of Algorithm 1 and 2 6 Netflix data: We also test the algorithms on the large collection of user ratings from netflix.com. The dataset consists of 100, 480, 507 ratings from 480, 189 users on 17, 770 movies. In addition, Netflix also provides a set of validation data with 1, 408, 395 ratings. In order to evaluate the prediction accuracy, there is a test set containing 2, 817, 131 ratings whose values are withheld and unknown for all the participants. The features used in BSRM/F are generated from the PCA result of a binary matrix that indicates whether or not the user rated the movie. The top 30 user-side factors are used as features, none of movie-side factors are used, i.e. p = 30, q = 0. The hyper parameters are set to some trivial values, δ = p + 1 = 31, α = β = 1, σ 2 = 1, and ν = 1. The results on the validation data are shown in Table 3. The submitted result of BSRM/F(400) achieves RMSE 0.8881 on the test set. The running time is around 21 minutes per iteration for 400 latent dimensions on an Intel Xeon 2GHz PC. RMSE VB[8] 0.9141 BPMF [13] 0.8920 100 0.8930 BSRM 200 0.8910 400 0.8895 100 0.8926 BSRM/F 200 400 0.8880 0.8874 Table 3: RMSE (root mean squared error) of experiments on Netflix data. 5 Conclusions In this paper, we study the fully Bayesian inference for stochastic relational models (SRMs), for learning the real-valued relation between entities of two sets. We overcome the scalability issue by transforming SRMs into equivalent models, which can be efficiently sampled. The experiments show that the fully Bayesian inference outperforms the previously used variational Bayesian inference on SRMs. In addition, some techniques for efficient computation in this paper can be applied to other large-scale Bayesian inferences, especially for models involving inverse-Wishart distributions. Acknowledgment: We thank the reviewers and Sarah Tyler for constructive comments. References [1] E. Airodi, D. Blei, S. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. In Journal of Machine Learning Research, 2008. [2] A. P. Dawid. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68:265–274, 1981. [3] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin. Bayesian Data Analysis. Chapman & Hall, New York, 1995. [4] A. K. Gupta and D. K. Nagar. Matrix Variate Distributions. Chapman & Hall/CRC, 2000. [5] P. Hoff. Multiplicative latent factor models for description and prediction of social networks. Computational and Mathematical Organization Theory, 2007. [6] T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Inf. Syst., 22(1):89–115, 2004. [7] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), 2006. [8] Y. J. Lim and Y. W. Teh. Variational Bayesian approach to movie rating prediction. In Proceedings of KDD Cup and Workshop, 2007. [9] J. S. Liu. Monte Carlo Strategies in Scientific Computing. Springer, 2001. [10] E. Meeds, Z. Ghahramani, R. Neal, and S. T. Roweis. Modeling dyadic data with binary latent factors. In Advances in Neural Information Processing Systems 19, 2007. [11] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [12] J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, 2005. 7 [13] R. Salakhutdinov and A. Mnih. Bayeisna probabilistic matrix factorization using Markov chain Monte Carlo. In The 25th International Conference on Machine Learning, 2008. [14] Z. Xu, V. Tresp, K. Yu, and H.-P. Kriegel. Infinite hidden relational models. In Proceedings of the 22nd International Conference on Uncertainty in Artificial Intelligence (UAI), 2006. [15] K. Yu, W. Chu, S. Yu, V. Tresp, and Z. Xu. Stochastic relational models for discriminative link prediction. In Advances in Neural Information Processing Systems 19 (NIPS), 2006. [16] S. Zhu, K. Yu, and Y. Gong. Predictive matrix-variate t models. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, NIPS ’07: Advances in Neural Information Processing Systems 20, pages 1721–1728. MIT Press, Cambridge, MA, 2008. Appendix Before presenting the algorithms, we introduce the necessary notation. Let Ii = {j|(i, j) ∈ I} and Ij = {i|(i, j) ∈ I}. A matrix with subscripts indicates its submatrix, which consists its entries at the given indices in the subscripts, for example, XIj ,j is a subvector of the j-th column of X whose row indices are in set Ij , X·,j is the j-th column of X (· indicates the full set). Xi,j denotes the (i, j)-th 2 entry of X. |X|2 is the squared sum of elements in set I, i.e. (i,j)∈I Xi,j . We fill the unobserved I elements in Y with 0 for simplicity in notation Algorithm 1 BSRM/F: Gibbs sampling for SRM with features 1: Draw K ∼ IW d+p (δ + m, αI(d+p) + (F, F◦ ) (F, F◦ )); 2: For each i ∈ U0 , draw f i ∼ Nd (K(i) (s−2 G Y i,· + K−1 K12 K−1 f ◦ ), K(i) ), 11·2 22 i −1 where K(i) = s−2 (GIi ,· ) GIi ,· + K−1 ; 11·2 3: Draw R ∼ IW d+q (δ + n, βI(d+q) + (G, G◦ ) (G, G◦ )); 4: For each j ∈ V0 , draw gj ∼ Nd (R(j) (s−2 F Y ·,j + R−1 R12 R−1 g◦ ), R(j) ), 11·2 22 j −1 where R(j) = s−2 (FIj ,· ) FIj ,· + R−1 ; 11·2 5: Draw s2 ∼ χ−2 (ν + r, σ 2 + |Y − FG |2 ). I Algorithm 2 BSRM/F: Conditional Gibbs sampling for SRM with features 1: ∆i,j ← Yi,j − k Fi,k Gj,k , for (i, j) ∈ I; 2: Draw Φ ∼ Wd+p (δ + m + d + p − 1, (αI(d+p) + (F, F◦ ) (F, F◦ ))−1 ); 3: for each (i, k) ∈ U0 × {1, · · · , d} do 4: Draw f ∼ N1 (φ−1 (s−2 ∆i,Ii GIi ,k − Fi,· Φ·,k ), φ−1 ), where φ = s−2 (GIi ,k ) GIi ,k + Φk,k ; 5: Update Fi,k ← Fi,k + f , and ∆i,j ← ∆i,j − f Gj,k , for j ∈ Ii ; 6: end for 7: Draw Ψ ∼ Wd+q (δ + n + d + q − 1, (βI(d+q) + (G, G◦ ) (G, G◦ ))−1 ); 8: for each (j, k) ∈ V0 × {1, · · · , d} do 9: Draw g ∼ N1 (ψ −1 (s−2 ∆Ij ,j FIj ,k −Gj,· Ψ·,k ), ψ −1 ), where ψ = s−2 (FIj ,k ) FIj ,k +Ψk,k ; 10: Update Gj,k ← Gj,k + g and ∆i,j ← ∆i,j − gFi,k , for i ∈ Ij ; 11: end for 12: Draw s2 ∼ χ−2 (ν + r, σ 2 + |∆|2 ). I 8

3 0.59559971 216 nips-2008-Sparse probabilistic projections

Author: Cédric Archambeau, Francis R. Bach

Abstract: We present a generative model for performing sparse probabilistic projections, which includes sparse principal component analysis and sparse canonical correlation analysis as special cases. Sparsity is enforced by means of automatic relevance determination or by imposing appropriate prior distributions, such as generalised hyperbolic distributions. We derive a variational Expectation-Maximisation algorithm for the estimation of the hyperparameters and show that our novel probabilistic approach compares favourably to existing techniques. We illustrate how the proposed method can be applied in the context of cryptoanalysis as a preprocessing tool for the construction of template attacks. 1

4 0.59467375 205 nips-2008-Semi-supervised Learning with Weakly-Related Unlabeled Data : Towards Better Text Categorization

Author: Liu Yang, Rong Jin, Rahul Sukthankar

Abstract: The cluster assumption is exploited by most semi-supervised learning (SSL) methods. However, if the unlabeled data is merely weakly related to the target classes, it becomes questionable whether driving the decision boundary to the low density regions of the unlabeled data will help the classification. In such case, the cluster assumption may not be valid; and consequently how to leverage this type of unlabeled data to enhance the classification accuracy becomes a challenge. We introduce “Semi-supervised Learning with Weakly-Related Unlabeled Data” (SSLW), an inductive method that builds upon the maximum-margin approach, towards a better usage of weakly-related unlabeled information. Although the SSLW could improve a wide range of classification tasks, in this paper, we focus on text categorization with a small training pool. The key assumption behind this work is that, even with different topics, the word usage patterns across different corpora tends to be consistent. To this end, SSLW estimates the optimal wordcorrelation matrix that is consistent with both the co-occurrence information derived from the weakly-related unlabeled documents and the labeled documents. For empirical evaluation, we present a direct comparison with a number of stateof-the-art methods for inductive semi-supervised learning and text categorization. We show that SSLW results in a significant improvement in categorization accuracy, equipped with a small training set and an unlabeled resource that is weakly related to the test domain.

5 0.59316462 79 nips-2008-Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning

Author: Francis R. Bach

Abstract: For supervised and unsupervised learning, positive definite kernels allow to use large and potentially infinite dimensional feature spaces with a computational cost that only depends on the number of observations. This is usually done through the penalization of predictor functions by Euclidean or Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing norms such as the ℓ1 -norm or the block ℓ1 -norm. We assume that the kernel decomposes into a large sum of individual basis kernels which can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a hierarchical multiple kernel learning framework, in polynomial time in the number of selected kernels. This framework is naturally applied to non linear variable selection; our extensive simulations on synthetic datasets and datasets from the UCI repository show that efficiently exploring the large feature space through sparsity-inducing norms leads to state-of-the-art predictive performance.

6 0.59065574 118 nips-2008-Learning Transformational Invariants from Natural Movies

7 0.590523 245 nips-2008-Unlabeled data: Now it helps, now it doesn't

8 0.59045655 135 nips-2008-Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell 1$-regularized MLE

9 0.59029198 195 nips-2008-Regularized Policy Iteration

10 0.59001732 246 nips-2008-Unsupervised Learning of Visual Sense Models for Polysemous Words

11 0.58964407 49 nips-2008-Clusters and Coarse Partitions in LP Relaxations

12 0.58879781 201 nips-2008-Robust Near-Isometric Matching via Structured Learning of Graphical Models

13 0.58834684 87 nips-2008-Fitted Q-iteration by Advantage Weighted Regression

14 0.58824313 138 nips-2008-Modeling human function learning with Gaussian processes

15 0.58737129 200 nips-2008-Robust Kernel Principal Component Analysis

16 0.58664984 50 nips-2008-Continuously-adaptive discretization for message-passing algorithms

17 0.58632219 231 nips-2008-Temporal Dynamics of Cognitive Control

18 0.58598232 63 nips-2008-Dimensionality Reduction for Data in Multiple Feature Representations

19 0.58595157 227 nips-2008-Supervised Exponential Family Principal Component Analysis via Convex Optimization

20 0.58579749 116 nips-2008-Learning Hybrid Models for Image Annotation with Partially Labeled Data