jmlr jmlr2011 jmlr2011-16 knowledge-graph by maker-knowledge-mining

16 jmlr-2011-Clustering Algorithms for Chains


Source: pdf

Author: Antti Ukkonen

Abstract: We consider the problem of clustering a set of chains to k clusters. A chain is a totally ordered subset of a finite set of items. Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-specific scores are either difficult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items. First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure. We also present two different approaches for mapping chains to a vector space. The first one is related to the planted partition model, while the second one has an intuitive geometrical interpretation. Finally we discuss a randomization test for assessing the significance of a clustering. To this end we present an MCMC algorithm for sampling random sets of chains that share certain properties with the original data. The methods are studied in a series of experiments using real and artificial data. Results indicate that the methods produce interesting clusterings, and for certain types of inputs improve upon previous work on clustering algorithms for orders. Keywords: Lloyd’s algorithm, orders, preference statements, planted partition model, randomization testing

Reference: text


Summary: the most important sentenses genereted by tfidf model

sentIndex sentText sentNum sentScore

1 Diagonal 177 08018 Barcelona, Spain Editor: Marina Meila Abstract We consider the problem of clustering a set of chains to k clusters. [sent-4, score-0.816]

2 First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure. [sent-7, score-0.892]

3 We also present two different approaches for mapping chains to a vector space. [sent-8, score-0.676]

4 To this end we present an MCMC algorithm for sampling random sets of chains that share certain properties with the original data. [sent-11, score-0.649]

5 Informally, chains are totally ordered subsets of a set of items, meaning that for all items that belong to a chain we know the order, and for items not belonging to the chain the order is unknown. [sent-32, score-1.472]

6 In this scenario chains are a natural representation for the preference statements, as it is very unlikely that everyone would list the same movies. [sent-34, score-0.704]

7 This example also illustrates a very useful property of chains as preference statements: independence of the “scale” used by the respondents when assigning scores to the alternatives. [sent-36, score-0.734]

8 Statements in the form of chains let us directly focus on the relationships between the alternatives. [sent-43, score-0.649]

9 Moreover, the use of chains can also facilitate preference elicitation, as people may find it easier to rank a small set of items instead of assigning scores to individual items. [sent-44, score-0.996]

10 First, defining a good distance function for chains is not straightforward. [sent-47, score-0.713]

11 However, if the chains have no overlap, which can in practice happen quite often, their distance has to be defined in some arbitrary way. [sent-50, score-0.713]

12 We tackle the aforementioned issues on one hand by formulating the clustering problem in a way that no computationally hard subproblems are involved (Section 2), and on the other hand by by mapping the chains to a vector space (Section 3). [sent-55, score-0.843]

13 By taking the latter approach the problem of clustering chains is reduced to that of clustering vectors in Rn . [sent-56, score-0.983]

14 To make use of this approach we propose a method for generating random sets of chains that share some properties with our original input (Section 4). [sent-63, score-0.675]

15 While this framework seems quite interesting, extending it for chains seems nontrivial. [sent-76, score-0.649]

16 • In Section 3 we present two methods for mapping chains to high-dimensional vector spaces. [sent-84, score-0.676]

17 The first method aims to preserve the distance between two chains that are assumed to originate from the same component in a simple generative model. [sent-85, score-0.713]

18 • In Section 4 we present an MCMC algorithm for uniformly sampling sets of chains that share a number of characteristics with a given set of chains. [sent-89, score-0.649]

19 The random sets of chains are used for significance testing. [sent-90, score-0.649]

20 This is very useful, because defining a good distance function for chains is not straightforward. [sent-147, score-0.713]

21 For example, given the chains (1, 4, 5) and (2, 3, 6), it is not easy to say anything about their similarity, as they share no common items. [sent-148, score-0.649]

22 1, but before this we will describe an approach where the distance between two chains is not required. [sent-150, score-0.713]

23 If we use a total order for representing the centroid we have to solve the rank aggregation problem: given all chains belonging to the cluster Ci , we have to compute a total order that is in some sense the “average” of the chains in Ci . [sent-152, score-1.565]

24 , 2001), but for Spearman’s rho it can be solved in polynomial time if all chains in the input happen to be total orders. [sent-159, score-0.698]

25 By writing the sum in terms of pairs of items instead of chains, we obtain c(X, D) = ∑ ∑ CD (u, v)X(v, u)2, u∈M v∈M where CD (u, v) denotes the number of chains in D where u appears before v. [sent-184, score-0.913]

26 This is a natural way of expressing the the ordering information present in a set of chains without having to construct an explicit total order. [sent-191, score-0.649]

27 It is also worth noting that long chains will be consistently further away from the centroid than short chains, because we do not normalize Equation 2 with the length of the chain. [sent-192, score-0.828]

28 1394 C LUSTERING A LGORITHMS FOR C HAINS Distances of two chains of possibly different lengths are not compared. [sent-194, score-0.649]

29 We also emphasize that even if longer chains in some sense contribute more to the centroid, as they contain a larger number of pairs, the contribution to an individual element of the matrix X is independent of chain length. [sent-195, score-0.774]

30 When assigning chains to updated centroids the error can only decrease (or stay the same) because the chains are assigned to clusters that minimize the error (line 8 of Alg. [sent-198, score-1.358]

31 When we recompute the centroids given the new assignment of chains to clusters, the error is non-increasing as well, because the centroid X ∗ (Equation 4) by definition minimizes the error for every cluster. [sent-200, score-0.86]

32 Note that these mappings can also be used to visualize sets of chains (Ukkonen, 2007; Kidwell et al. [sent-207, score-0.649]

33 1 Graph Representation The mapping that we describe in this section is based on the adjacency matrices of two graphs where the chains of the input D appear as vertices. [sent-210, score-0.725]

34 Both Spearman’s rho and Kendall’s tau can be modified for chains so that they only consider the common items. [sent-215, score-0.71]

35 If the chains π1 and π2 have no items in common, we have to use a fixed distance between π1 and π2 . [sent-216, score-0.977]

36 This is done for example by Kamishima and Fujiki (2003), where the distance between two chains is given by 1 − ρ, where ρ ∈ [−1, 1] is Spearman’s rho. [sent-217, score-0.713]

37 For two fully correlated chains the distance becomes 0, and for chains with strong negative correlation the distance is 2. [sent-218, score-1.426]

38 If the chains have no common items we have ρ = 0 and the distance is 1. [sent-219, score-0.977]

39 We could use the same approach also with the Kendall distance by defining the distance between the chains π1 and π2 as the (normalized) Kendall distance between the permutations that are induced by the common items in π1 and π2 . [sent-220, score-1.168]

40 Under this assumption it no longer appears meaningful to have dK (π1 , π2 ) = dK (π1 , π3 ), as the clustering algorithm should separate chains generated by different components from each other. [sent-236, score-0.816]

41 2 AGREEMENT AND D ISAGREEMENT G RAPHS Next we propose a method for mapping the chains to Rn so that the distances between the vectors that correspond to π1 , π2 and π3 satisfy the inequality of the example above. [sent-241, score-0.676]

42 In general we want chains that are generated by the same component to have a shorter distance to each other than to chains that originate from other components. [sent-242, score-1.362]

43 To this end, we define the distance between two chains in D as the distance between their neighborhoods in appropriately constructed graphs. [sent-243, score-0.777]

44 If the neighborhoods are similar, that is, there are many chains in D that are (in a sense to be formalized shortly) “close to” both π1 and π2 , we consider also π1 and π2 similar to each other. [sent-244, score-0.649]

45 Note that this definition of distance between two chains is dependent on the input D. [sent-245, score-0.739]

46 In other words, the distance between π1 and π2 can change if other chains in D are modified. [sent-246, score-0.713]

47 We say that chains π1 and π2 agree if for some items u and v we have (u, v) ∈ π1 and (u, v) ∈ π2 . [sent-247, score-0.913]

48 Likewise, the chains π1 and π2 disagree if for some u and v we have (u, v) ∈ π1 and (v, u) ∈ π2 . [sent-248, score-0.672]

49 We define the agreement and disagreement graphs: Definition 1 Let Ga (D) and Gd (D) be undirected graphs with chains in D as vertices. [sent-250, score-0.699]

50 The graph Ga (D) is the agreement graph, where two vertices are connected by an edge if their respective chains agree and do not disagree. [sent-251, score-0.728]

51 The graph Gd (D) is the disagreement graph, where two vertices are connected by an edge if their respective chains disagree and do not agree. [sent-252, score-0.757]

52 The distance between chains π1 and π2 will be a function of the sets of neighboring vertices of π1 and π2 in Ga (D) and Gd (D). [sent-253, score-0.748]

53 First, the chains corresponding to the vertices must have at least 2 common items, the probability of which we denote by Pr(|π1 ∩ π2 | ≥ 2). [sent-274, score-0.684]

54 (5) i=2 Next we use this to derive the probabilities p and q of observing an edge between two chains that belong either to the same, or two different components, respectively. [sent-280, score-0.694]

55 The bound expresses how the density of the graph Ga (D) depends on the number of items m and the length of the chains l. [sent-310, score-0.935]

56 This, combined with the existing results concerning ∆, means that for short chains over a large M the input D has to be very large for the algorithms of Condon and Karp (2001) and Shamir and Tsur (2002) to produce good results. [sent-312, score-0.675]

57 Therefore, we conclude that for these algorithms to be of practical use a relatively large number of chains is needed if the data consists of short chains over a large number of different items. [sent-316, score-1.298]

58 Also, even though Theorem 2 is related to the graph Ga (D), it gives some theoretical justification to the intuition that increasing the length of the chains should make the clusters easier to separate. [sent-317, score-0.699]

59 4 U SING Ga (D) AND Gd (D) In the agreement graph, under ideal circumstances the chain π is mostly connected to chains generated by the same component as π. [sent-320, score-0.796]

60 Also, it is easy to see that in the disagreement graph the chain π is (again under ideal circumstances) not connected to any of the chains generated by the same component, and only to chains generated by the other components. [sent-321, score-1.473]

61 Above we argued that representations of two chains emitted by the same component should be more alike than representations of two chains emitted by different components. [sent-324, score-1.298]

62 ) Therefore, at least under these simple assumptions the expected distance between two chains from the same component is always less than the expected distance between two chains from different components. [sent-333, score-1.426]

63 2 Hypersphere Representation Next we devise a method for mapping chains to an m-dimensional (as opposed to n-dimensional) vector space. [sent-348, score-0.676]

64 Let f be the mapping from the set of all chains to Rm and let d be a distance function in Rm . [sent-351, score-0.74]

65 1 2 (8) (9) Less formally, we want the reversal of a chain to be furthest away from it in the vector space (8), and the distance between π and πR should be the same for all chains (9). [sent-354, score-0.838]

66 2 A M APPING FOR C HAINS To extend the above for chains we apply the technique used also by Critchlow (1985) and later by Busse et al. [sent-393, score-0.649]

67 Computing the vectors fπ for all chains in the input is of order O(nm), which is considerably less than the requirement of O(n2 m2 ) for the graph based approach. [sent-416, score-0.72]

68 As a downside we lose the property of having a shorter distance between chains generated by the same component than between chains generated by different components. [sent-417, score-1.362]

69 The random sets of chains must share certain aspects with our original input D. [sent-428, score-0.675]

70 In this section we define these aspects precisely, and devise a method for sampling randomized sets of chains that share these aspects with a given input D. [sent-429, score-0.71]

71 , Dh a sequence of random sets of chains that share certain properties with D. [sent-439, score-0.649]

72 Definition 6 Let D1 and D2 be two sets of chains on items of the set M. [sent-463, score-0.913]

73 The number of chains of length l is the same in D1 as in D2 for all l. [sent-466, score-0.649]

74 For all M ′ ⊆ M, the number of chains that contain M ′ as a subset is the same in D1 and D2 . [sent-468, score-0.649]

75 We have CD1 (u, v) = CD2 (u, v) for all u, v ∈ M, where CD (u, v) is the number of chains in D that rank u before v. [sent-470, score-0.677]

76 When analyzing chains over the items in M, the most interesting property is how the chains actually order the items. [sent-474, score-1.562]

77 Condition 1 is used to rule out the possibility that the value of A (D) is somehow caused only by the length distribution of the chains in D. [sent-478, score-0.649]

78 The intuition with chains is the same: we view D as a set of points in the space of chains. [sent-483, score-0.649]

79 The following approach, originally proposed by Besag and Clifford (1989), draws h sets of chains from C (D) so that the samples satisfy the exchangeability condition. [sent-525, score-0.649]

80 Since the Markov ˆ chain must remain in C (D), the swap may never result in a set of chains D ∈ C (D). [sent-537, score-0.923]

81 For example, if π1 = (1, 2, 3, 4, 5) and π2 = (3, 2, 6, 4, 1), the swap (π1 , π2 , 2, 1) will result in the chains π′ = (1, 3, 2, 4, 5) and π′ = (2, 3, 6, 4, 1). [sent-542, score-0.798]

82 But the second transposition we carry out in π2 cancels out the effect the first transposition had on CD (u, v) and CD (v, u), and the resulting set of chains remains in C (D). [sent-548, score-0.695]

83 Definition 7 Let D be a set of chains and let π1 and π2 belong to D. [sent-549, score-0.694]

84 Let Di contain the three chains below: π1 : (1, 2, 3, 4, 5) π2 : (7, 8, 4, 3, 6) π3 : (3, 2, 6, 4, 1) The valid swaps in this case are (π1 , π3 , 2, 1) and (π1 , π2 , 3, 3). [sent-554, score-0.802]

85 If we apply the swap (π1 , π2 , 3, 3) we obtain the chains π′ : (1, 2, 4, 3, 5) 1 π′ : (7, 8, 3, 4, 6) 2 1404 π3 : (3, 2, 6, 4, 1) C LUSTERING A LGORITHMS FOR C HAINS Obviously (π1 , π2 , 3, 3) is still a valid swap, as we can always revert the previous swap. [sent-555, score-0.823]

86 Note that here we are assuming that the chains in D are labeled. [sent-572, score-0.649]

87 To do this we should compute the Kendall distance between D( j) and Di (h( j)), where h is a bijective mapping between chains in D and Di that minimizes the sum of the pairwise distances. [sent-577, score-0.74]

88 This is simply a list that contains the set of chains where we can transpose u and v. [sent-591, score-0.649]

89 In SD (u, v) we have chains where u appears before v, while in SD (v, u) are chains where v appears before u. [sent-593, score-1.298]

90 Of this we took a random sample of 5000 chains for the analysis. [sent-666, score-0.649]

91 These conditions can be characterized by parameters of the input data, such as length of the chains or total number of items. [sent-695, score-0.675]

92 As on one hand suggested by intuition, and on the other hand by Theorem 2, finding a planted clustering becomes easier as the length of the chains increase. [sent-724, score-0.937]

93 5 3 4 5 6 7 8 9 3 4 5 6 7 8 9 0 Figure 1: The Adjusted Rand Index (median over 25 trials) between a recovered clustering and the true clustering as a function of the length of a chain in random data sets consisting of 2000 chains each. [sent-744, score-1.108]

94 We tested this hypothesis by running an experiment with random data sets ten times larger, that is, with an input of 20000 chains on m = 100 items. [sent-769, score-0.675]

95 (The curves for HS and GR therefore show performance that is obtained simply by mapping chains to the respective vector spaces and running standard k-means. [sent-834, score-0.676]

96 t (seconds) SUSHI 297 MLENS 670 DUBLIN 73 MSNBC 81 We observe that n, the number of chains in the input, does not affect the running time t, but the number of items m plays a significant part. [sent-873, score-0.913]

97 In Section 3 we gave two methods for mapping chains to a high-dimensional vector space. [sent-916, score-0.676]

98 1416 C LUSTERING A LGORITHMS FOR C HAINS We also proposed a method for testing if a clustering found in a set of chains is any different from a clustering of random data. [sent-923, score-1.004]

99 To this end we devised an MCMC algorithm for sampling sets of chains that all belong to the same equivalence class as a given set of chains. [sent-925, score-0.694]

100 1420 C LUSTERING A LGORITHMS FOR C HAINS Figure 8: Permutations in S(I) have the positions I occupied by items that belong to the chain π, while permutations in S(I R ) have the positions I R occupied by items of π. [sent-1022, score-0.805]


similar papers computed by tfidf model

tfidf for this paper:

wordName wordTfidf (topN-words)

[('chains', 0.649), ('items', 0.264), ('di', 0.234), ('centroid', 0.179), ('tmse', 0.173), ('clustering', 0.167), ('swap', 0.149), ('hains', 0.136), ('kkonen', 0.128), ('swaps', 0.128), ('chain', 0.125), ('sd', 0.123), ('planted', 0.121), ('lloyd', 0.116), ('kamishima', 0.113), ('sushi', 0.098), ('lustering', 0.095), ('msnbc', 0.091), ('dublin', 0.083), ('mlens', 0.083), ('ga', 0.077), ('akaho', 0.075), ('ebc', 0.075), ('hs', 0.075), ('cd', 0.07), ('clusterings', 0.069), ('ukkonen', 0.068), ('distance', 0.064), ('permutations', 0.063), ('rand', 0.059), ('gd', 0.059), ('orders', 0.058), ('ad', 0.058), ('preference', 0.055), ('lgorithms', 0.054), ('condon', 0.053), ('karp', 0.053), ('tsur', 0.053), ('ratings', 0.052), ('rnd', 0.051), ('shamir', 0.051), ('spearman', 0.045), ('belong', 0.045), ('kendall', 0.044), ('randomization', 0.044), ('rankings', 0.042), ('sdi', 0.039), ('tau', 0.038), ('dk', 0.037), ('item', 0.036), ('randomized', 0.035), ('vertices', 0.035), ('besag', 0.035), ('movies', 0.035), ('aggregation', 0.035), ('position', 0.033), ('reconstruction', 0.032), ('hypersphere', 0.032), ('preferences', 0.032), ('centroids', 0.032), ('ml', 0.031), ('adi', 0.03), ('busse', 0.03), ('gad', 0.03), ('precedes', 0.03), ('respondents', 0.03), ('pr', 0.028), ('disagreement', 0.028), ('movie', 0.028), ('rank', 0.028), ('clusters', 0.028), ('mapping', 0.027), ('uv', 0.026), ('partition', 0.026), ('clifford', 0.026), ('input', 0.026), ('valid', 0.025), ('cluster', 0.025), ('permutation', 0.025), ('gr', 0.025), ('considerably', 0.023), ('uncorrelated', 0.023), ('disagree', 0.023), ('dwork', 0.023), ('graham', 0.023), ('adjacency', 0.023), ('kidwell', 0.023), ('mannila', 0.023), ('precede', 0.023), ('rho', 0.023), ('swappable', 0.023), ('transposition', 0.023), ('graph', 0.022), ('agreement', 0.022), ('positions', 0.022), ('statistic', 0.022), ('testing', 0.021), ('adjusted', 0.021), ('ds', 0.021), ('bucket', 0.021)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.99999809 16 jmlr-2011-Clustering Algorithms for Chains

Author: Antti Ukkonen

Abstract: We consider the problem of clustering a set of chains to k clusters. A chain is a totally ordered subset of a finite set of items. Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-specific scores are either difficult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items. First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure. We also present two different approaches for mapping chains to a vector space. The first one is related to the planted partition model, while the second one has an intuitive geometrical interpretation. Finally we discuss a randomization test for assessing the significance of a clustering. To this end we present an MCMC algorithm for sampling random sets of chains that share certain properties with the original data. The methods are studied in a series of experiments using real and artificial data. Results indicate that the methods produce interesting clusterings, and for certain types of inputs improve upon previous work on clustering algorithms for orders. Keywords: Lloyd’s algorithm, orders, preference statements, planted partition model, randomization testing

2 0.059380401 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components

3 0.05454858 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms

Author: Volodymyr Melnykov, Ranjan Maitra

Abstract: This paper presents the C LUSTERING A LGORITHMS ’ R EFEREE PACKAGE or CARP, an open source GNU GPL-licensed C package for evaluating clustering algorithms. Calibrating performance of such algorithms is important and CARP addresses this need by generating datasets of different clustering complexity and by assessing the performance of the concerned algorithm in terms of its ability to classify each dataset relative to the true grouping. This paper briefly describes the software and its capabilities. Keywords: CARP, M IX S IM, clustering algorithm, Gaussian mixture, overlap

4 0.051755484 54 jmlr-2011-Learning Latent Tree Graphical Models

Author: Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our algorithms can be applied to both discrete and Gaussian random variables and our learned models are such that all the observed and latent variables have the same domain (state space). Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures such as neighbor-joining) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world data sets by modeling the dependency structure of monthly stock returns in the S&P; index and of the words in the 20 newsgroups data set. Keywords: graphical models, Markov random fields, hidden variables, latent tree models, structure learning c 2011 Myung Jin Choi, Vincent Y. F. Tan, Animashree Anandkumar and Alan S. Willsky. C HOI , TAN , A NANDKUMAR AND W ILLSKY

5 0.045296628 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel

6 0.044985108 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes

7 0.043208793 58 jmlr-2011-Learning from Partial Labels

8 0.038773555 55 jmlr-2011-Learning Multi-modal Similarity

9 0.035428796 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models

10 0.033763174 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

11 0.033627868 91 jmlr-2011-The Sample Complexity of Dictionary Learning

12 0.032425467 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

13 0.030788705 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

14 0.030668603 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

15 0.027799593 90 jmlr-2011-The Indian Buffet Process: An Introduction and Review

16 0.026496131 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

17 0.024910571 100 jmlr-2011-Unsupervised Supervised Learning II: Margin-Based Classification Without Labels

18 0.024455303 82 jmlr-2011-Robust Gaussian Process Regression with a Student-tLikelihood

19 0.023805676 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

20 0.023374282 72 jmlr-2011-On the Relation between Realizable and Nonrealizable Cases of the Sequence Prediction Problem


similar papers computed by lsi model

lsi for this paper:

topicId topicWeight

[(0, 0.13), (1, -0.061), (2, -0.029), (3, -0.002), (4, -0.031), (5, 0.003), (6, -0.006), (7, 0.075), (8, -0.028), (9, 0.019), (10, -0.053), (11, 0.13), (12, 0.069), (13, 0.03), (14, -0.051), (15, -0.09), (16, 0.016), (17, 0.068), (18, -0.119), (19, -0.153), (20, -0.02), (21, -0.051), (22, -0.199), (23, -0.16), (24, -0.162), (25, -0.174), (26, -0.138), (27, 0.106), (28, 0.081), (29, -0.244), (30, 0.193), (31, 0.262), (32, 0.021), (33, -0.041), (34, 0.131), (35, 0.023), (36, -0.01), (37, 0.09), (38, 0.046), (39, 0.128), (40, -0.008), (41, -0.033), (42, -0.049), (43, -0.112), (44, 0.065), (45, -0.067), (46, -0.109), (47, -0.095), (48, -0.025), (49, 0.113)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.96755785 16 jmlr-2011-Clustering Algorithms for Chains

Author: Antti Ukkonen

Abstract: We consider the problem of clustering a set of chains to k clusters. A chain is a totally ordered subset of a finite set of items. Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-specific scores are either difficult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items. First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure. We also present two different approaches for mapping chains to a vector space. The first one is related to the planted partition model, while the second one has an intuitive geometrical interpretation. Finally we discuss a randomization test for assessing the significance of a clustering. To this end we present an MCMC algorithm for sampling random sets of chains that share certain properties with the original data. The methods are studied in a series of experiments using real and artificial data. Results indicate that the methods produce interesting clusterings, and for certain types of inputs improve upon previous work on clustering algorithms for orders. Keywords: Lloyd’s algorithm, orders, preference statements, planted partition model, randomization testing

2 0.61532819 15 jmlr-2011-CARP: Software for Fishing Out Good Clustering Algorithms

Author: Volodymyr Melnykov, Ranjan Maitra

Abstract: This paper presents the C LUSTERING A LGORITHMS ’ R EFEREE PACKAGE or CARP, an open source GNU GPL-licensed C package for evaluating clustering algorithms. Calibrating performance of such algorithms is important and CARP addresses this need by generating datasets of different clustering complexity and by assessing the performance of the concerned algorithm in terms of its ability to classify each dataset relative to the true grouping. This paper briefly describes the software and its capabilities. Keywords: CARP, M IX S IM, clustering algorithm, Gaussian mixture, overlap

3 0.47000119 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

Author: Bruno Pelletier, Pierre Pudlo

Abstract: Following Hartigan (1975), a cluster is defined as a connected component of the t-level set of the underlying density, that is, the set of points for which the density is greater than t. A clustering algorithm which combines a density estimate with spectral clustering techniques is proposed. Our algorithm is composed of two steps. First, a nonparametric density estimate is used to extract the data points for which the estimated density takes a value greater than t. Next, the extracted points are clustered based on the eigenvectors of a graph Laplacian matrix. Under mild assumptions, we prove the almost sure convergence in operator norm of the empirical graph Laplacian operator associated with the algorithm. Furthermore, we give the typical behavior of the representation of the data set into the feature space, which establishes the strong consistency of our proposed algorithm. Keywords: spectral clustering, graph, unsupervised classification, level sets, connected components

4 0.28207088 57 jmlr-2011-Learning a Robust Relevance Model for Search Using Kernel Methods

Author: Wei Wu, Jun Xu, Hang Li, Satoshi Oyama

Abstract: This paper points out that many search relevance models in information retrieval, such as the Vector Space Model, BM25 and Language Models for Information Retrieval, can be viewed as a similarity function between pairs of objects of different types, referred to as an S-function. An S-function is specifically defined as the dot product between the images of two objects in a Hilbert space mapped from two different input spaces. One advantage of taking this view is that one can take a unified and principled approach to address the issues with regard to search relevance. The paper then proposes employing a kernel method to learn a robust relevance model as an S-function, which can effectively deal with the term mismatch problem, one of the biggest challenges in search. The kernel method exploits a positive semi-definite kernel referred to as an S-kernel. The paper shows that when using an S-kernel the model learned by the kernel method is guaranteed to be an S-function. The paper then gives more general principles for constructing S-kernels. A specific implementation of the kernel method is proposed using the Ranking SVM techniques and click-through data. The proposed approach is employed to learn a relevance model as an extension of BM25, referred to as Robust BM25. Experimental results on web search and enterprise search data show that Robust BM25 significantly outperforms baseline methods and can successfully tackle the term mismatch problem. Keywords: search, term mismatch, kernel machines, similarity learning, s-function, s-kernel

5 0.24423951 58 jmlr-2011-Learning from Partial Labels

Author: Timothee Cour, Ben Sapp, Ben Taskar

Abstract: We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partiallylabeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. Keywords: weakly supervised learning, multiclass classification, convex learning, generalization bounds, names and faces

6 0.24144706 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

7 0.19838929 26 jmlr-2011-Distance Dependent Chinese Restaurant Processes

8 0.19713765 93 jmlr-2011-The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Data Sets

9 0.18439727 54 jmlr-2011-Learning Latent Tree Graphical Models

10 0.16874203 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

11 0.16864713 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

12 0.15849459 34 jmlr-2011-Faster Algorithms for Max-Product Message-Passing

13 0.15672037 80 jmlr-2011-Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach

14 0.15090771 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

15 0.15013269 91 jmlr-2011-The Sample Complexity of Dictionary Learning

16 0.14636882 19 jmlr-2011-Convergence of Distributed Asynchronous Learning Vector Quantization Algorithms

17 0.14420721 55 jmlr-2011-Learning Multi-modal Similarity

18 0.14417394 85 jmlr-2011-Smoothness, Disagreement Coefficient, and the Label Complexity of Agnostic Active Learning

19 0.14228795 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

20 0.14100255 40 jmlr-2011-Hyper-Sparse Optimal Aggregation


similar papers computed by lda model

lda for this paper:

topicId topicWeight

[(4, 0.057), (9, 0.023), (10, 0.052), (23, 0.371), (24, 0.061), (31, 0.079), (32, 0.023), (41, 0.039), (60, 0.018), (65, 0.015), (70, 0.011), (71, 0.018), (73, 0.027), (78, 0.065), (90, 0.034)]

similar papers list:

simIndex simValue paperId paperTitle

same-paper 1 0.67443103 16 jmlr-2011-Clustering Algorithms for Chains

Author: Antti Ukkonen

Abstract: We consider the problem of clustering a set of chains to k clusters. A chain is a totally ordered subset of a finite set of items. Chains are an intuitive way to express preferences over a set of alternatives, as well as a useful representation of ratings in situations where the item-specific scores are either difficult to obtain, too noisy due to measurement error, or simply not as relevant as the order that they induce over the items. First we adapt the classical k-means for chains by proposing a suitable distance function and a centroid structure. We also present two different approaches for mapping chains to a vector space. The first one is related to the planted partition model, while the second one has an intuitive geometrical interpretation. Finally we discuss a randomization test for assessing the significance of a clustering. To this end we present an MCMC algorithm for sampling random sets of chains that share certain properties with the original data. The methods are studied in a series of experiments using real and artificial data. Results indicate that the methods produce interesting clusterings, and for certain types of inputs improve upon previous work on clustering algorithms for orders. Keywords: Lloyd’s algorithm, orders, preference statements, planted partition model, randomization testing

2 0.35700923 86 jmlr-2011-Sparse Linear Identifiable Multivariate Modeling

Author: Ricardo Henao, Ole Winther

Abstract: In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component δ-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/. Keywords: parsimony, sparsity, identifiability, factor models, linear Bayesian networks

3 0.35153466 77 jmlr-2011-Posterior Sparsity in Unsupervised Dependency Parsing

Author: Jennifer Gillenwater, Kuzman Ganchev, João Graça, Fernando Pereira, Ben Taskar

Abstract: A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques.

4 0.34972084 64 jmlr-2011-Minimum Description Length Penalization for Group and Multi-Task Sparse Learning

Author: Paramveer S. Dhillon, Dean Foster, Lyle H. Ungar

Abstract: We propose a framework MIC (Multiple Inclusion Criterion) for learning sparse models based on the information theoretic Minimum Description Length (MDL) principle. MIC provides an elegant way of incorporating arbitrary sparsity patterns in the feature space by using two-part MDL coding schemes. We present MIC based models for the problems of grouped feature selection (MICG ROUP) and multi-task feature selection (MIC-M ULTI). MIC-G ROUP assumes that the features are divided into groups and induces two level sparsity, selecting a subset of the feature groups, and also selecting features within each selected group. MIC-M ULTI applies when there are multiple related tasks that share the same set of potentially predictive features. It also induces two level sparsity, selecting a subset of the features, and then selecting which of the tasks each feature should be added to. Lastly, we propose a model, T RANS F EAT, that can be used to transfer knowledge from a set of previously learned tasks to a new task that is expected to share similar features. All three methods are designed for selecting a small set of predictive features from a large pool of candidate features. We demonstrate the effectiveness of our approach with experimental results on data from genomics and from word sense disambiguation problems.1 Keywords: feature selection, minimum description length principle, multi-task learning

5 0.34671095 53 jmlr-2011-Learning High-Dimensional Markov Forest Distributions: Analysis of Error Rates

Author: Vincent Y.F. Tan, Animashree Anandkumar, Alan S. Willsky

Abstract: The problem of learning forest-structured discrete graphical models from i.i.d. samples is considered. An algorithm based on pruning of the Chow-Liu tree through adaptive thresholding is proposed. It is shown that this algorithm is both structurally consistent and risk consistent and the error probability of structure learning decays faster than any polynomial in the number of samples under fixed model size. For the high-dimensional scenario where the size of the model d and the number of edges k scale with the number of samples n, sufficient conditions on (n, d, k) are given for the algorithm to satisfy structural and risk consistencies. In addition, the extremal structures for learning are identified; we prove that the independent (resp., tree) model is the hardest (resp., easiest) to learn using the proposed algorithm in terms of error rates for structure learning. Keywords: graphical models, forest distributions, structural consistency, risk consistency, method of types

6 0.34456131 91 jmlr-2011-The Sample Complexity of Dictionary Learning

7 0.34455132 74 jmlr-2011-Operator Norm Convergence of Spectral Clustering on Level Sets

8 0.34427255 29 jmlr-2011-Efficient Learning with Partially Observed Attributes

9 0.34408951 12 jmlr-2011-Bayesian Co-Training

10 0.34363091 7 jmlr-2011-Adaptive Exact Inference in Graphical Models

11 0.34244695 67 jmlr-2011-Multitask Sparsity via Maximum Entropy Discrimination

12 0.3422592 21 jmlr-2011-Cumulative Distribution Networks and the Derivative-sum-product Algorithm: Models and Inference for Cumulative Distribution Functions on Graphs

13 0.34163168 104 jmlr-2011-X-Armed Bandits

14 0.34131572 59 jmlr-2011-Learning with Structured Sparsity

15 0.34006175 4 jmlr-2011-A Family of Simple Non-Parametric Kernel Learning Algorithms

16 0.33939403 76 jmlr-2011-Parameter Screening and Optimisation for ILP using Designed Experiments

17 0.33896649 9 jmlr-2011-An Asymptotic Behaviour of the Marginal Likelihood for General Markov Models

18 0.33651683 48 jmlr-2011-Kernel Analysis of Deep Networks

19 0.33632967 17 jmlr-2011-Computationally Efficient Convolved Multiple Output Gaussian Processes

20 0.33570641 78 jmlr-2011-Producing Power-Law Distributions and Damping Word Frequencies with Two-Stage Language Models